Professional Documents
Culture Documents
fekade.getahun@aau.edu.et
1
Agenda
Object Relational Model
Distributed Database
2
1. Object-Relational Model
3
Recap
Database
Types Database
Relational
Non-relational
Object Oriented model
Object relational model
4
OO database concept
Representing complex object
Encapsulation
Class
Inheritance
5
OO database concept
Association: is the link between entities in an application.
It is represented by means of references between objects.
It can be binary, ternary and reverse
6
ADVANTAGES OF OODB
An integrated repository of information that is shared by
multiple users, multiple products, multiple applications on
multiple platforms.
It also solves the following problems:
The semantic gap: The real world and the Conceptual model is
very similar.
Impedance mismatch: Programming languages and database
systems must be interfaced to solve application problems. But the
language style, data structures, of a programming language (such
as C) and the DBMS (such as Oracle) are different. The OODB
supports general purpose programming in the OODB framework.
New application requirements: Especially in OA, CAD, CAM,
CASE, object-orientation is the most natural and most convenient.
7
Complex object model
Allows
Sets of atomic values
Tuple-valued attributes
Sets of tuples (nested relations)
General set and tuple constructors
Object identity
Thus, formally
Every atomic value in A is an object.
If a1, ..., an are attribute names in N, and O1, ..., On are objects,
then T = [a1:O1, ..., an:On] is also an object, and T.ai retrieves the
value Oi.
If O1, ..., On are objects, then S = {O1, ..., On} is an abject.
8
Object Model
An object is defined by a triple (OID, type constructor, state)
where OID is the unique object identifier,
type constructor is its type (such as atom, tuple, set, list, array, bag, etc.) and
state is its actual value.
Example:
(i1, atom, 'John')
(i2, atom, 30)
(i3, atom, 'Mary')
(i4, atom, 'Mark')
(i5, atom 'Vicki')
(i6, tuple, [Name:i1, Age:i2])
(i7, set, {i4, i5})
(i8, tuple, [Name:i3, Friends:i7])
(i9, set, {i6, i8})
9
OBJECT-ORIENTED DATABASES
OODB = Object Orientation + Database Capabilities
11
OODB
COMMERCIAL OODB
Relational DB Extensions: Many relational systems support
OODB extensions.
User-defined functions (dBase).
User-defined ADTs (POSTGRES)
Very-long multimedia fields (BLOB or Binary Large Object). (DB2
from IBM, SQL from SYBASE, Informix, Interbase)
12
OODB Implemenation Strategies
Develop novel database data model or data language
(SIM)
Extend an existing database language with object-oriented
capabilities. (IRIS, O2 and VBASE/ONTOS extended
SQL)
Extend existing object-oriented programming language
with database capabilities (GemStone OPAL extended
SmallTalk)
Extendable object-oriented DBMS library (ONTOS)
13
ODL A Class With Key and Extent
A class definition with “extent”, “key”, and more elaborate
attributes; still relatively straightforward
SELECT d.name
FROM departments d
WHERE d.college = ‘Engineering’;
Object-Relational Data Models
Extend the relational data model by including object
orientation and constructs to deal with added data types.
Allow attributes of tuples to have complex types,
including non-atomic values such as nested relations.
Preserve relational foundations, in particular the
declarative access to data, while extending modeling
power.
Upward compatibility with existing relational languages.
16
Nested Relations
Motivation:
Permit non-atomic domains (atomic indivisible)
Example of non-atomic domain: set of integers,or set of tuples
Allows more intuitive modeling for applications with complex
data
Intuitive definition:
allow relations whenever we allow atomic (scalar) values -
relations within relations
Retains mathematical foundation of relational model
Violates first normal form.
17
Example of a Nested Relation
Example: library information system
Each book has
title,
a set of authors,
Publisher, and
a set of keywords
Non-1NF relation books
18
1NF Version of Nested Relation
1NF version of books
flat-books
19
4NF Decomposition of Nested Relation
Remove awkwardness of flat-books by assuming that the
following multi-valued dependencies hold:
title author
title keyword
title pub-name, pub-branch
Decompose flat-doc into 4NF using the schemas:
(title, author)
(title, keyword)
(title, pub-name, pub-branch)
20
4NF Decomposition of flat–books
21
Problems with 4NF Schema
4NF design requires users to include joins in their queries.
1NF relational view flat-books defined by join of 4NF
relations:
eliminates the need for users to perform joins,
but loses the one-to-one correspondence between tuples and
documents.
And has a large amount of redundancy
Nested relations representation is much more natural here.
22
Complex Types and SQL:1999
Extensions to SQL to support complex types include:
Collection and large object types
Nested relations are an example of collection types
Structured types
Nested record structures like composite attributes
Inheritance
Object orientation
Including object identifiers and references
23
Collection Types
Set type (not in SQL:1999)
create table books (
…..
keyword-set setof(varchar(20))
……
)
Sets are an instance of collection types. Other instances
include
Arrays (are supported in SQL:1999)
E.g. author-array varchar(20) array[10]
Can access elements of array in usual fashion:
E.g. author-array[1]
25
Structured and Collection Types
(PostgreSQL)
Structured types can be declared and used in SQL
CREATE TYPE Publisher as (name varchar(20),
branch varchar(20));
27
Structured Types (Cont.)
Add two records into the books table
Insert into books (title,authors,pub_date, pub, keywords) values
('Compilers','{"Smith","Jones"}', now()::date,row('McGraw-
Hill','New York')::publisher,'{"Parsing","Analysis"}'),
('Networks','{"Jones","Frick"}',now()::date,row('Oxford','London')::
publisher,'{"Internet","Web"}')
Retrieve the content of the books table – two rows will be returned
Select * from Books;
30
Inheritance in PostgreSQL
PostgreSQL supports only table inheritance no type
inheritance which is supported in SQL-99
create type Person_Ty as (PID varchar (20), fullname name_type,
address full_address);
create table People of Person_ty;
Create table Emps (id serial, salary numeric) INHERITS (people);
-- inherits columns of the base table people
Inserting data to the Emps table adds part of the data into the base table
– people but the reverse is not true
Insert into emps (pid, fullname,address,salary) values (1245,
row('Dawit', 'bekele')::name_type, row('DZ','AM')::full_address,
9878)
31
Structured and Collection Types (Oracle)
Structured types can be declared and used in SQL
CREATE OR REPLACE TYPE Publisher as Object (name varchar(20), branch
varchar(20));
/
CREATE OR REPLACE TYPE VA as VARRAY (5) of VARCHAR(30);
/
CREATE OR REPLACE TYPE Book AS OBJECT (title varchar(20), authors VA,
pub_date date, pub Publisher, keywords VA);
/
Structured types can be used to create tables
32
Structured Types (Cont.)
Creating tables without creating an intermediate type
For example, the table books could also be defined as follows:
34
Creation of Values of Complex Types
To insert the preceding tuple into the relation books
Insert into books (title, authors, pub, keywords) values
('Compilers', VA('Smith', 'Jones'),
Publisher('McGraw-Hill', 'New York'), VA('parsing','analysis'));
35
Inheritance Person_Typ
Suppose that we have the following type definition for people:
create or replace type Person_typ as Object
(name varchar(20),
address varchar(20)) not final; Teacher_Typ Student_Typ
/
Using inheritance to define the student and teacher types
create type Student under Person
As Object (degree varchar(20),
department varchar(20))
create or replace type Student_typ UNDER Person_ty
(degree varchar(20),
department varchar(20)) not final;
/
Subtypes can redefine methods by using overriding method in place of member in the
member declaration
36
Reference Types
Object-oriented languages provide the ability to create and
refer to objects.
In SQL:1999
References are to tuples, and
References must be scoped,
I.e., can only point to tuples in one specified table
37
Reference Declaration in SQL:1999
E.g. define a type Department with a field name and a
field head which is a reference to the Person in table
people as scope
create type Department as Object
(name varchar(20), head ref Person_typ )
38
Initializing Reference Typed Values
In Oracle, to create a tuple with a reference value, first
create the tuple with a null reference and then set the
reference separately using the function ref(p) applied to a
tuple variable
40
Nested Table
CREATE TYPE animal_ty AS OBJECT (breed
VARCHAR(25), name VARCHAR(25), birthdate DATE);
/
CREATE TYPE animals_nt AS TABLE OF animal_ty;
/
CREATE TABLE breeder (breederName VARCHAR(25),
animals animals_nt)
breederName
nested Animals
table animals store as animals_nt_tab;
Breed Name Birthdate
Breed Name Birthdate
41
Nested Table
CREATE TABLE breeder (breederName VARCHAR(25),
animals animals_nt) nested table animals store as
animals_nt_tab;
INSERT INTO breeder VALUES (
'John Smith ',
animals_nt(
animal_ty('DOG', 'BUTCH', '31-MAR-01'),
animal_ty('DOG', 'ROVER', '05-JUN-01'),
animal_ty('DOG', 'JULIO', '10-JUN-01') )
);
breederName Animals
43
Comparison of O-O and O-R Databases
Relational systems
simple data types, powerful query languages, high protection.
Persistent-programming-language-based OODBs
complex data types, integration with programming language,
high performance.
Object-relational systems
complex data types, powerful query languages, high protection.
Note: Many real systems blur these boundaries
E.g. persistent programming language built as a wrapper on a
relational database offers first two benefits, but may have poor
performance.
44
Distributed Database
45
Outline
Distributed Database
Introduction
DDBMS Architecture
DDB Design
Distributed Query Processing
46
1. Introduction to Distributed
Database
47
File Systems
Program 1
Data
description File 1
Redundant Data
Program 2
Data File 2
description
Program 3
Data File 3
description
48
Database Management
Application program 1
(with data semantics)
Data description
Application program 2 Data Manipulation DATABAS
(with data semantics) E
…
Application program 3
(with data semantics)
49
Objective of database technology
The key objective of DBS is Integration not centralization
50
Motivation
integration distribution
Distributed Database
Systems
integration
Integration ≠ centralization
51
What is distributed …
Processing logic or processing elements
Functions
Data
Control
52
Classification of Distributed computing
Criteria's [Bochmann, 1983]
Degree of coupling – how closely the processing elements are
connected together
Amount of Data exchanged/ amount of local processing
Weak vs strong coupling
Interconnection structure
Point-to-point interconnection b/n processing units
Common interconnection channel
Interdependence of components
Synchronization between components
Synchronous or asynchronous
53
What is a Distributed Database
System?
A distributed database (DDB) is a collection of multiple,
logically interrelated databases distributed over a
computer network.
A distributed database management system (D–DBMS) is
the software that manages the DDB and provides an
access mechanism that makes this distribution transparent
to the users.
Distributed database system (DDBS) = DDB + D–DBMS
54
What is not a DDBS?
A timesharing computer system
A loosely or tightly coupled multiprocessor system
A database system which resides at one of the nodes of a
network of computers - this is a centralized database on a
network node
55
Centralized DBMS on a Network
Site 1
Site 2
Communication
Network
Site 3
Site 4
59
Distributed DBMS Environment
Site 2
Site 2
Site 1
Communication
Network
Site 3
Site 4
60
Implicit Assumptions
Data stored at a number of sites ➯ each site logically
consists of a single processor.
Processors at different sites are interconnected by a
computer network ➯ no multiprocessors
parallel database systems
Distributed database is a database, not a collection of files
➯ data logically related as exhibited in the users’ access
patterns
relational data model
D-DBMS is a full-fledged DBMS
not remote file system, not a TP system
61
Promises of Distributed DBMS
Transparent management of distributed, fragmented, and
replicated data
Improved reliability/availability through distributed
transactions
Improved performance
Easier and more economical system expansion
62
Transparency
Transparency is the separation of the higher level
semantics of a system from the lower level
implementation issues.
Fundamental issue is to provide
Data independence in the distributed environment
Network (distribution) transparency
Replication transparency
Fragmentation transparency
horizontal fragmentation: selection
vertical fragmentation: projection
hybrid
63
2. Distributed DBMS Architecture
64
Introduction: Architecture
Defines the structure of the system. i.e,
The components of a system are identified
The functions of each component is specified and
The interrelationships and interactions among these
components are defined
65
DBMS Standardization
Reference Model
A conceptual framework whose purpose is to divide standardization work into
manageable pieces and to show at a general level how these pieces are related
1. Component-based
The objectives of the system are clearly identified. But it gives very little insight
3. Data-based
Identify the different types of data and specify the functional units that will
As data is the central resource that DBMS manages datalogical approach is the
67
ANSI/SPARC Architecture
represent the data and the relationship in between without considering the users or physical de
physical definition and organization of data
68
Conceptual Schema Definition
RELATION PROJ [
KEY = {PNO}
ATTRIBUTES = {
PNO : CHARACTER(7)
PNAME : CHARACTER(20)
BUDGET : NUMERIC(7)
LOC : CHARACTER(15)
}
]
RELATION ASG [
KEY = {ENO,PNO}
ATTRIBUTES = {
ENO : CHARACTER(9)
PNO : CHARACTER(7)
RESP : CHARACTER(10)
DUR : NUMERIC(3)
}
]
69
Internal Schema Definition
RELATION EMP [
KEY = {ENO}
ATTRIBUTES = {
ENO : CHARACTER(9)
ENAME : CHARACTER(15)
TITLE : CHARACTER(10)
}
]
INTERNAL_REL E [
INDEX ON E# CALL EMINX
FIELD = {
E# : BYTE(9)
ENAME : BYTE(15)
TIT : BYTE(10)
}
]
70
External View Definition –
Example 1
Create a BUDGET view from the PROJ relation
71
External View Definition –
Example 2
Create a Payroll view from relations EMP and
Pay
72
Architectural models for Distributed
DBMS
Ways to put multiple databases for sharing multiple
DBMS
73
Dimensions of the Problem
Distribution
Whether the components (deals with data) of the system are
Heterogeneity
Various levels (hardware, communications, operating system)
Various dimensions:
Design autonomy: Ability of a component DBMS to decide on issues related to its own data
communicate with other DBMSs. i.e., what type of information it wants to provide to other
75
Architectural alternatives
A0, D0, H0: logically integrated system
Set of homogenous multiple DBMS
79
Client/server
Task distribution
80
Advantages of Client-Server Architectures
More efficient division of labor
Horizontal and vertical scaling of resources
Better price/performance on client machines
Ability to use familiar tools on client machines
Client access to remote data (via standards)
Full DBMS functionality provided to client workstations
Overall better system price/performance
81
Problems With Multiple-Client/Single
Server
Server forms bottleneck
Server forms single point of failure
Database scaling difficult
82
Multiple client- multiple server
85
Components of DDBMS
86
MDBMS architecture with GCS
87
Components of a Multi-DBMS
88
3. Distributed Database Design
89
Introduction
The design of DDB involves
Making decision on the placement of data and program across
the sites of a computer network as well as possible designing
the network itself
90
Design strategies
Top-down
Based on designing systems from scratch
Begins with the requirement analysis that defines the
environment of the system and elicits both the data and
processing needs of all potential database users
It is applicable for the design of homogeneous databases
Bottom-up
When the databases already exist at a number of sites
Design involves integrating databases into one database
Integrate Local schema into Global schema
It is ideal in the context of heterogeneous databases
91
Top-Down Design Process
92
Distribution Design Issues
Why fragment at all?
How should we fragment?
How much should we fragment?
Is there any way to test the correctness of decomposition?
How should we allocate?
What is the necessary information for fragmentation and
allocation?
93
Reasons for Fragmentation
Can't we just distribute relations?
What is a reasonable unit of distribution?
relation
views are subsets of relations locality
extra communication
fragments of relations (sub-relations)
concurrent execution of a number of transactions that access different
portions of a relation
views that cannot be defined on a single fragment will require extra
processing
semantic data control (especially integrity enforcement) more difficult
94
Fragmentation Alternatives-Horizontal
95
Fragmentation Alternatives-Vertical
96
Degree of Fragmentation
97
ER model – for the running examples
PROJ
Skill
PNO PName Budget Location
Title Sal
EMP
ENO ENAME TITLE
ASS
PNO ENO Dur RESP
98
Fragmentation
Horizontal Fragmentation (HF)
Primary Horizontal Fragmentation (PHF)
Derived Horizontal Fragmentation (DHF)
99
PHF – Information Requirements
Application Information
minterm selectivity: sel(mi)
The number of tuples of the relation that would be accessed by a user
query which is specified according to a given minterm predicate mi
access frequencies: acc(qi)
The frequency with which a user application accesses data. If Q =
{q1, q2, …, qq} is a set of user queries, acc(qi) indicates the access
frequency of the query qi in a given period
Acc(mi) is computed from the acc(qi) that constitute the minterm
100
Primary Horizontal Fragmentation
Definition:
Rj = σFj (R ), 1 ≤ j ≤ w
101
PHF – Algorithm
Given:
A relation R, the set of simple predicates Pr
Output:
The set of fragments of R = {R1, R2,…,Rw} which obey the
fragmentation rules.
Preliminaries :
1. Pr should be complete
2. Pr should be minimal
102
Completeness of Simple Predicates
A set of simple predicates Pr is said to be complete IFF
the accesses to the tuples of the minterm fragments
defined on Pr requires that two tuples of the same
minterm fragment have the same probability of being
accessed by any application
Example:
Assume PROJ[PNO, PNAME, BUDGET, LOC] has two
applications defined on it
Find the budgets of projects at each location (1)
Find projects with budgets less than $200000 (2)
103
Completeness of Simple Predicates
According to (1),
Pr = {LOC=“Montreal”, LOC=“New York”, LOC=“Paris”}
which is not complete with respect to (2).
Modify
Pr = {LOC=“Montreal”, LOC=“New York”, LOC=“Paris”,
BUDGET≤200000, BUDGET>200000}
which is complete.
104
Minimality of Simple Predicates
If a predicate influences how fragmentation is performed,
(i.e., causes a fragment f to be further fragmented into,
say, fi and fj) then there should be at least one application
that accesses fi and fj differently
In other words, the simple predicate should be relevant in
determining a fragmentation.
If all the predicates of a set Pr are relevant, then Pr is
minimal
105
Minimality of Simple Predicates
Example :
Pr ={LOC=“Montreal”, LOC=“New York”, LOC=“Paris”,
BUDGET≤200000, BUDGET>200000}
However, if we add
PNAME = “Instrumentation”
106
COM_MIN Algorithm
Given:
a relation R and a set of simple predicates Pr
Output:
a complete and minimal set of simple predicates Pr' for Pr
Rule 1:
a relation or fragment is partitioned into at least two parts
which are accessed differently by at least one application.
107
COM_MIN Algorithm
❶ Initialization
find a pi ∈ Pr such that pi partitions R according to Rule 1
set Pr' = pi ;
Pr ←Pr – pi;
F ←fi
❷ Iteratively add predicates to Pr' until it is complete
find a pj ∈ Pr such that pj partitions some fk defined according to minterm
predicate over Pr' according to Rule 1
set Pr' = Pr' ∪ pi ;
Pr ←Pr – pi;
F ← F ∪ fi
if ∃pk ∈ Pr' which is non-relevant then
Pr' ← Pr' – pk
F ← F – fk
108
PHORIZONTAL Algorithm
Makes use of COM_MIN to perform fragmentation
Input:
a relation R and a set of simple predicates Pr
Output:
a set of minterm predicates M according to which relation
R is to be fragmented
110
PHF - Example
Skill1 Skill2
111
PHF - Example
Fragmentation of relation PROJ
Applications:
Find the name and budget of projects given their location
Issued at three sites
Access project information according to budget
one site accesses ≤200000 other accesses >200000
Simple predicates
For application (1)
p1 : LOC = “Montreal”
p2 : LOC = “New York”
p3 : LOC = “Paris”
For application (2)
p4 : BUDGET ≤ 200000
p5 : BUDGET > 200000
Pr = Pr' = {p1,p2,p3,p4,p5}
112
PHF – Example
Fragmentation of relation PROJ continued
Minterm fragments left after elimination
m1 : (LOC = “Montreal”) ∧ (BUDGET ≤ 200000)
m2 : (LOC = “Montreal”) ∧ (BUDGET > 200000)
m3 : (LOC = “New York”) ∧ (BUDGET ≤ 200000)
m4 : (LOC = “New York”) ∧ (BUDGET > 200000)
m5 : (LOC = “Paris”) ∧ (BUDGET ≤ 200000)
m6 : (LOC = “Paris”) ∧ (BUDGET > 200000)
113
PHF Correctness
Completeness
Since Pr' is complete and minimal, the selection predicates are
complete
Reconstruction
If relation R is fragmented into FR = {R1,R2,…,Rr}
R = ∪∀Ri ∈FR Ri
Disjointness
Minterm predicates that form the basis of fragmentation should
be mutually exclusive.
114
Derived Horizontal Fragmentation
Defined on a member relation of a link according to a
selection operation specified on its owner.
Each link is an equijoin.
Equijoin can be implemented by means of semi-joins.
115
DHF – Definition
Given a link L where owner(L)=S and member(L)=R, the
derived horizontal fragments of R are defined as
Ri = R ⋉Si, 1≤i≤w
117
DHF – Correctness
Completeness
Let R be the member relation of a link whose owner is relation S which is
fragmented as FS = {S1, S2, ..., Sn}. Furthermore, let A be the join attribute
between R and S. Then, for each tuple t of R, there should be a tuple t' of S
such that: t[A] = t'[A]
i.e., Referential integrity :(tuples of any fragment of the member
relation are also in the owner relation)
Reconstruction
Reconstruction of a global relation R from its fragments {R1, R2, …,
Rn} is performed by the union operator (R is union of its fragments)
Disjointness
In DHF disjointness is guaranteed only if the join graph between
the owner and the member fragments is simple.
118
Paper review template
Introduction
Statement of the problem
Objective
Methodology
Approach/ proposed solution
Critics
Conclusion
119
Vertical Fragmentation
Has been studied within the centralized context
design methodology
physical clustering
More difficult than horizontal, because more alternatives
exist
Two approaches :
Grouping: attributes to fragments
Splitting: relation to fragments
120
VF
Overlapping fragments
grouping
Non-overlapping fragments
splitting
We do not consider the replicated key attributes to be
overlapping
Advantage:
Easier to enforce functional dependencies (for integrity
checking etc.)
121
VF – Information requirements
Application Information
Attribute affinities
a measure that indicates how closely related the attributes are
This is obtained from more primitive usage data
Attribute usage values
Given a set of queries Q = {q1, q2,…, qq} that will run on the relation
R[A1, A2,…, An]
122
VF – Definition of use(qi,Aj)
Consider the following 4 queries for relation PROJ
q1: SELECT BUDGET q2: SELECT PNAME,BUDGET
FROM PROJ FROM PROJ
WHERE PNO=Value
q3: SELECT PNAME q4: SELECT SUM(BUDGET)
FROM PROJ FROM PROJ
WHERE LOC=Value WHERE LOC=Value
Let A1= PNO, A2= PNAME, A3= BUDGET, A4= LOC
123
VF – Affinity Measure aff(Ai,Aj)
The attribute affinity measure between two attributes Ai and Aj
of a relation R[A1, A2, …, An] with respect to the set of
applications Q = (q1, q2, …, qq) is defined as follows :
Then
aff(A1, A3) = 15*1 + 20*1+10*1
= 45
and the attribute affinity matrix AA is
125
VF – Clustering Algorithm
Take the attribute affinity matrix AA and reorganize the
attribute orders to form clusters where the attributes in
each cluster demonstrate high affinity to one another
(∑ ∑ ( ) [ )
𝑛 𝑛
𝐴𝑀=𝑚𝑎𝑥 𝑎𝑓𝑓 𝑎𝑖 ,𝑎 𝑗 𝑎𝑓𝑓 ( 𝑎𝑖 ,𝑎 𝑗−1 ) +𝑎𝑓𝑓 ( 𝑎𝑖 ,𝑎 𝑗+1 ) +¿ 𝑎𝑓𝑓 ( 𝑎𝑖− 1 ,𝑎 𝑗 ) +𝑎𝑓𝑓 ( 𝑎𝑖+1 ,𝑎 𝑗 ) ]
𝑖=1 𝑗=1
(∑ ∑ )])
𝑛 𝑛
𝐴𝑀 =𝑚 𝑎𝑥 𝑎𝑓𝑓 ( 𝑎𝑖 ,𝑎 𝑗 ) [ 𝑎𝑓𝑓 ( 𝑎 𝑖 , 𝑎 𝑗 −1 ) + 𝑎𝑓𝑓 ( 𝑎𝑖 , 𝑎 𝑗 +1
𝑖 =1 𝑗 =1
126
Bond Energy Algorithm
Input: The AA matrix
Output: The clustered affinity matrix CA which is a
perturbation of AA
❶ Initialization: Place and fix one of the columns of AA in
CA
❷ Iteration: Place the remaining n-i columns in the
remaining i+1 positions in the CA matrix. For each
column, choose the placement that makes the most
contribution to the global affinity measure
❸ Row order: Order the rows according to the column
ordering
127
Cont(Ai,Ak, Aj) = 2bond(Ai Ak)+2bond(Ak,Aj)-2bond(Ai, Aj)
128
BEA – Example
Consider the following AA matrix and the corresponding CA matrix where A1
and A2 have been placed.
Place A3:
Ordering (0-3-1):
cont(A0,A3,A1) = 2bond(A0 , A3)+2bond(A3 , A1)–2bond(A0 , A1)
= 2* 0 + 2* 4410 – 2*0 = 8820
Ordering (1-3-2):
cont(A1,A3,A2) = 2bond(A1 , A3)+2bond(A3 , A2)–2bond(A1,A2)
= 2* 4410 + 2* 890 – 2*225 = 10150
Ordering (2-3-4): cont (A2,A3,A4) = 1780
129
BEA: Example
130
Partitioning Algorithm
The objective is find set of attributes that can be accessed
solely in most cases. i.e., to divide a set of clustered
attributes {A1, A2, …, An} into two (or more) sets {A1, A2,
…, Ai} and {Ai+1, …, An} such that there are no (or
minimal) applications that access both (or more than one)
of the sets.
131
Partitioning algorithm
Define
AQ(qi) = {Aj|use(qi, Aj) =1}
TQ = {qi| AQ(qi) subset of TA}
BQ = {qi| AQ(qi) subset of BA}
OQ = Q –{TQ U BQ} //set of applications that access both TA and BA
and
CTQ = total number of accesses to attributes by applications that access only
TA
CBQ = total number of accesses to attributes by applications that access only
BA
COQ = total number of accesses to attributes by applications that access both
TA and BA
Then find the point along the diagonal that maximizes
z = CTQ∗CBQ−COQ2
132
Partitioning algorithm
Two problems :
❶ Cluster forming in the middle of the CA matrix
Shift a row up and a column left and apply the algorithm to
find the “best” partitioning point
Do this for all possible shifts
Cost O(m2)
❷ More than two clusters
m-way partitioning
try 1, 2, …, m–1 split points along diagonal and try to find the
best point for each of these
Cost O(2m)
133
VF correctness
A relation R, defined over attribute set A and key K, generates the
vertical partitioning FR = {R1, R2, …, Rr}.
Completeness
The following should be true for A:
A =∪ ARi
Reconstruction
Reconstruction can be achieved by
R = ∆ Ri ∀Ri ∈FR
Disjointness
TID's are not considered to be overlapping since they are maintained by
the system
Duplicated keys are not considered to be overlapping
134
Hybrid fragmentation
135
Allocation
Problem Statement
Given
F = {F1, F2, …, Fn} fragments
S = {S1, S2, …, Sm} network sites
Q = {q1, q2,…, qq} applications
Find the "optimal" distribution of F to S.
Optimality
Minimal cost
Communication + storage + processing (read & update)
Cost in terms of time (usually)
Performance
Response time and/or throughput
Constraints
Per site constraints (storage & processing)
136
Information Requirements
Database information
selectivity of fragments
size of a fragment
Application information
access types and numbers
access localities
Communication network information
unit cost of storing data at a site
unit cost of processing at a site
Computer system information
bandwidth
latency
communication overhead
137
Allocation – Information Requirements
Database Information
selectivity of fragments
size of a fragment
Application Information
number of read accesses of a query to a fragment
number of update accesses of query to a fragment
A matrix indicating which queries updates which fragments
A similar matrix for retrievals
originating site of each query
Site Information
unit cost of storing data at a site
unit cost of processing at a site
Network Information
communication cost of frame between two sites
frame size
138
Allocation Model
General Form
min(Total Cost)
subject to
response time constraint
storage constraint
processing constraint
Decision Variable
140
Allocation Model
Query Processing Cost
Processing component
Access cost
141
Allocation Model
Query Processing Cost
Transmission component
Cost of updates
update message cost + acknowledgment cost
Retrieval Cost
(cost of retrieval command + cost of sending back the result)
142
Allocation Model
Constraints
Response time
Execution time of query <= max allowable response time for that
query
Storage constraints
Storage requirement of a fragment at that site <=storage capacity at
that site
143
Allocation Model
Attempts to reduce the solution space
assume all candidate partitioning are known and select the
“best” partitioning
ignore replication at first
sliding window on fragments
144
4. Distributed Query Processing
145
Introduction
Query Processing
query
Processor
Query optimization
How do we determine the “best” execution plan?
147
Query processing problem
Example
SELECT ENAME
FROM EMP,ASG
WHERE EMP.ENO = ASG.ENO
AND DUR > 37
148
Example …
149
Cost of Alternatives
Assume:
size(EMP) = 400, size(ASG) = 1000
tuple access cost = 1 unit; tuple transfer cost = 10 units
Strategy 1
produce ASGi: (10+10)∗tuple access cost 20
transfer ASGi to the sites of EMP: (10+10)∗tuple transfer cost 200
produce EMPi : (10+10) ∗tuple access cost∗2 40
transfer EMPi to result site: (10+10) ∗tuple transfer cost 200
Total cost 460
Strategy 2
transfer EMP to site 5:400∗tuple transfer cost 4,000
transfer ASGi to site 5 :1000∗tuple transfer cost 10,000
produce ASGi:1000∗tuple access cost 1,000
join EMPi and ASGi:400∗20∗tuple access cost 8,000
Total cost 23,000
150
Objective of Query processing
To transform a high-level query on a distributed database into low
level language on local databases
Minimize a cost function
I/O cost + CPU cost + communication cost
These might have different weights in different distributed
environments
Wide area networks
communication cost will dominate
low bandwidth
low speed
high protocol overhead
Local area networks
communication cost not that dominant
total cost function should be considered
151
Complexity of Relational Operations
Assume
• relations of cardinality n
• sequential scan
Operation Complexity
Select O(n)
Project
Project (with duplicate elimination) O(nlog n)
Group
Join O(nlog n)
Semi-join
Division
Set Operations
Cartesian Product O(n2)
152
Characterization of Query processors
Four characteristics that hold for Centralized query processors
Language
Input language – relational calculus or relational algebra
Types of optimization
Exhaustive search
cost-based
Optimal
combinatorial complexity in the number of relations
Heuristics
not optimal
regroup common sub-expressions
perform selection, projection first
replace a join by a series of semi-joins
reorder operations to reduce intermediate relation size
optimize individual operations
153
Optimization Timing
Static
compilation optimize prior to the execution
difficult to estimate the size of the intermediate results error propagation
can amortize over many executions
E.g. R*
Dynamic
run time optimization
exact information on the intermediate relation sizes
have to reoptimize for multiple executions
E.g. Distributed INGRES
Hybrid
compile using a static algorithm
if the error in estimate sizes > threshold, reoptimize at run time
E.g. MERMAID
154
Statistics
Relation
cardinality
size of a tuple
fraction of tuples participating in a join with another relation
Attribute
cardinality of domain
actual number of distinct values
Common assumptions
independence between different attribute values
uniform distribution of attribute values within their domain
155
Decision Sites
Centralized
single site determines the “best” schedule
simple
need knowledge about the entire distributed database
Distributed
cooperation among sites to determine the schedule
need only local information
cost of cooperation
Hybrid
one site determines the global schedule
each site optimizes the local subqueries
156
Network Topology
Wide area networks (WAN)
characteristics
low bandwidth
low speed
high protocol overhead
communication cost will dominate; ignore all other cost factors
global schedule to minimize communication cost
local schedules according to centralized query optimization
Local area networks (LAN)
communication cost not that dominant
total cost function should be considered
broadcasting can be exploited (e.g. joins) to optimize query processing
special algorithms exist for star networks
157
Exploitation of Replicated Fragments
In Distributed query processing global relations are
mapped into queries on physical fragments of relation by
translating relations into fragments – localization
Replication is need for increasing reliability and
availability
158
Use of semijoins
Semijoin reduces the size of the operand relation
But it increase the number of messages and in the local
processing time
E.g. SDD 1, designed for slow wide area networks, use
semijoin extensively
159
Layers of Query Processing
160
Query Decomposition
Input : Calculus query on global relations
1. Normalization
manipulate query quantifier and qualification
2. Analysis
detect and reject “incorrect” queries
possible for only a subset of relational calculus
3. Simplification
eliminate redundant predicates
4. Restructuring
calculus query is restructured into algebraic query
more than one translation is possible
use transformation rules
161
Normalization
Lexical and syntactic analysis
check validity (similar to compilers)
check for attributes and relations
type checking on the qualification
Put into normal form
Conjunctive normal form
(p11∨p12∨…∨p1n) ∧…∧ (pm1∨pm2∨…∨pmn)
Disjunctive normal form
(p11∧p12 ∧…∧p1n) ∨…∨ (pm1 ∧pm2∧…∧pmn)
OR's mapped into union
AND's mapped into join or selection
162
Analysis
Remove incorrect queries
Type incorrect
If any of its attribute or relation names are not defined in the global
schema
If operations are applied to attributes of the wrong type
Semantically incorrect
Components do not contribute in any way to the generation of the
result
Only a subset of relational calculus queries can be tested for
correctness
Those that do not contain disjunction and negation
Technique to detect incorrect queries
connection graph (query graph) that represent the semantic of the query
join graph
163
Analysis – Example
SELECT ENAME,RESP
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND ASG.PNO = PROJ.PNO
AND PNAME = "CAD/CAM"
AND DUR ≥ 36
AND TITLE = "Programmer"
164
Analysis
If the query graph is not connected, the query is wrong.
SELECT ENAME,RESP, PNAME
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND PNAME = "CAD/CAM"
AND DUR ≥ 36
AND TITLE = "Programmer"
165
Simplification
Use transformation rules
elimination of redundancy
idempotency rules
p1 ∧ ¬( p1) ⇔ false
p1 ∧ (p1 ∨ p2) ⇔ p1
p1 ∨ false ⇔ p1
application of transitivity
use of integrity rules
166
Simplification – Example
SELECT TITLE
FROM EMP
WHERE EMP.ENAME = “J. Doe”
OR (NOT(EMP.TITLE = “Programmer”)
AND (EMP.TITLE = “Programmer”)
OR EMP.TITLE = “Elect. Eng.”)
AND NOT(EMP.TITLE = “Elect. Eng.”) )
SELECT TITLE
FROM EMP
WHERE EMP.ENAME = “J. Doe”
167
Restructuring
Convert relational calculus to
relational algebra
Make use of query trees
Example
Find the names of employees other than
J. Doe who worked on the CAD/CAM
project for either 1 or 2 years.
SELECT ENAME
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND ASG.PNO = PROJ.PNO
AND ENAME ≠ “J. Doe”
AND PNAME = “CAD/CAM”
AND (DUR = 12 OR DUR = 24)
168
Restructuring –Transformation Rules
Commutativity of binary operations
R×S⇔S×R
R join S ⇔S join R
R∪S⇔S∪R
Associativity of binary operations
( R × S ) × T ⇔ R × (S × T)
( R join S) join T ⇔ R join (S join T)
Idempotence of unary operations
ΠA’(ΠA’(R)) ⇔ΠA’(R)
σp1(A1)(σp2(A2)(R)) = σp1(A1) ∧ p2(A2)(R)
where R[A] and A' ⊆ A, A" ⊆ A and A' ⊆ A"
Commuting selection with projection
169
Restructuring –Transformation Rules
Commuting selection with binary operations
σp(A)(R × S) ⇔ (σp(A) (R)) × S
σp(Ai)(R join(Aj,Bk) S) ⇔ (σp(Ai)(R)) join(Aj,Bk) S
σp(Ai)(R ∪ T) ⇔ σp(Ai)(R) ∪ σp(Ai)(T)
where Ai belongs to R and T
Commuting projection with binary operations
ΠC(R × S) ⇔ΠA’(R) × ΠB’(S)
ΠC(R join(Aj,Bk) S)⇔ΠA’(R) join(Aj,Bk) ΠB’(S)
ΠC(R ∪ S) ⇔ΠC (R) ∪ ΠC (S)
where R[A] and S[B]; C = A' ∪ B' where A' ⊆ A, B' ⊆ B
170
Example
Example
Find the names of employees other than
J. Doe who worked on the CAD/CAM
project for either 1 or 2 years
SELECT ENAME
FROM EMP, ASG, PROJ
WHERE EMP.ENO = ASG.ENO
AND ASG.PNO = PROJ.PNO
AND ENAME ≠ “J. Doe”
AND PNAME = “CAD/CAM”
AND (DUR = 12 OR DUR =
24)
171
Equivalent Query
172
Restructuring
σDur=12 v Dur=24
173
Step 2 – Data Localization
Input: Algebraic query on distributed relations
Determine which fragments are involved
Localization program
substitute for each global query its materialization program
➠ optimize
174
Example
Assume
EMP is fragmented into EMP1, EMP2,
EMP3 as follows:
EMP1=σENO≤“E3”(EMP)
EMP2= σ“E3”<ENO≤“E6”(EMP)
EMP3=σENO≥“E6”(EMP)
ASG fragmented into ASG1 and ASG2 as
follows:
ASG1=σENO≤“E3”(ASG)
ASG2=σENO>“E3”(ASG)
175
Provides Parallellism
176
Eliminates …
177
Reduction for PHF
Reduction with selection
Relation R and FR={R1, R2, …, Rw} where Rj=σ pj(R)
σ pi(Rj)= φ if ∀x in R: ¬(pi(x) ∧ pj(x))
EMP1=σENO≤“E3”(EMP)
Example EMP2= σ“E3”<ENO≤“E6”(EMP)
SELECT * EMP3=σENO>“E6”(EMP)
FROM EMP
WHERE ENO=“E5”
178
Reduction for PHF
Reduction with join
Possible if fragmentation is done on join attribute
Distribute join over union
(R1 ∪ R2) join S ⇔ (R1 join S) ∪ (R2 join S)
Given Ri = σpi(R) and Rj = σpj(R)
Ri join Rj = φ if ∀x in Ri, ∀y in Rj: ¬(pi(x) ∧ pj(y))
179
Reduction for PHF
Reduction with join - Example
Assume EMP is fragmented into three
ASG1: σENO ≤ "E3"(ASG)
ASG2: σENO > "E3"(ASG) EMP1=σENO≤“E3”(EMP)
EMP2= σ“E3”<ENO≤“E6”(EMP)
Consider the query
EMP3=σENO>“E6”(EMP)
SELECT * FROM EMP, ASG
WHERE EMP.ENO=ASG.ENO
180
Reduction for PHF
Reduction with join
Distribute join over unions
Apply the reduction rule
181
Reduction for VF
Find useless (not empty) intermediate relations
Relation R defined over attributes A = {A1, ..., An} vertically
fragmented as Ri = ΠA'(R) where A' ⊆ A:
ΠD,K(Ri) is useless if the set of projection attributes D is not in A’
Example: EMP1= ΠENO,ENAME(EMP); EMP2= ΠENO,TITLE (EMP)
SELECT ENAME
FROM EMP
182
Reduction for DHF
Rule :
Distribute joins over unions
Apply the join reduction for horizontal fragmentation
Example
ASG1: ASG JoinENO EMP1
ASG2: ASG JoinENO EMP2
EMP1: σTITLE=“Programmer” (EMP)
EMP2: σTITLE<>“Programmer” (EMP)
Query
SELECT *
FROM EMP, ASG
WHERE ASG.ENO = EMP.ENO
AND EMP.TITLE = “Mech. Eng.”
183
Reduction for DHF
184
Reduction for DHF
Joins over unions
185
Reduction for Hybrid Fragmentation
Combine the rules already specified:
Remove empty relations generated by contradicting selections
on horizontal fragments
Remove useless relations generated by projections on vertical
fragments
Distribute joins over unions in order to isolate and remove
useless joins
186
Reduction for Hybrid Fragmentation
Example
Consider the following hybrid
fragmentation:
EMP1=σENO≤"E4" (ΠENO,ENAME(EMP))
EMP2=σENO>"E4"
(ΠENO,ENAME(EMP))
EMP3= ΠENO,TITLE(EMP)
and the query
SELECT ENAME
FROM EMP
WHERE ENO=“E5”
187
Global Query Optimization
Input: Fragment query
Find the best (not necessarily optimal) global schedule
Minimize a cost function
Distributed join processing
Bushy vs. linear trees
Which relation to ship where?
Ship-whole vs ship-as-needed
Decide on the use of semi-joins
Semi-join saves on communication at the expense of more local
processing.
Join methods
nested loop vs ordered joins (merge join or hash join)
188
Cost-Based Optimization
Solution space
The set of equivalent algebra expressions (query trees).
Cost function (in terms of time)
I/O cost + CPU cost + communication cost
These might have different weights in different distributed
environments (LAN vs WAN).
Can also maximize throughput
Search algorithm
How do we move inside the solution space?
Exhaustive search, heuristic algorithms (iterative improvement,
simulated annealing, genetic,…)
189
5. Concurrency Control
190
Concurrency Control in Distributed
Database
Concurrency control schemes dealt with handling of data
as part of concurrent transactions.
Various locking protocols are used for handling
concurrent transactions in centralized database systems.
There are no major differences between the schemes in
centralized and distributed databases. The only major
difference is that the way the lock manager should deal
with the replicated data.
191
Locking protocols
1. Single lock manager approach
2. Distributed lock manager approach
a) Primary Copy protocol
b) Majority protocol
c) Biased protocol
d) Quorum Consensus protocol
192
Single Lock Manager - Concurrency
Control in Distributed Database
193
Single Lock Manager …
1. Transaction T1 @S5 request for data item
D
2. The initiator site S5’s Transaction manager
sends the lock request to lock data item D
to the lock-manager site S3.
The Lock-manager at site S3 will look for the
availability of the data item D.
3. If the requested item is not locked by any
other transactions, the lock-manager site
responds with lock grant message to the
initiator site S5.
4. The initiator site S5 can use the data item
D from any of the sites S1, S2, and S6 for
completing the Transaction T1.
5. After successful completion of the
Transaction T1, the Transaction manager
of S5 releases the lock by sending the
unlock request to the lock-manager site S3.
194
Primary Copy Protocol
195
Majority Based Protocol
A transaction which needs to lock data item Q has to
request and lock data item Q in half+one sites in which Q
is replicated (i.e, majority of the sites in which Q is
replicated).
The lock-managers of all the sites in which Q is replicated
are responsible for handling lock and unlock requests
locally individually.
Irrespective of the lock types (read or write, i.e, Shared or
Exclusive), we need to lock half+one sites.
196
Majority Based Protocol
197
Parallel Databases
198
Parallel Databases
Introduction
I/O Parallelism
Interquery Parallelism
Intraquery Parallelism
Intraoperation Parallelism
Interoperation Parallelism
Design of Parallel Systems
199
Introduction
Parallel machines are becoming quite common and affordable
Prices of microprocessors, memory and disks have dropped sharply
Recent desktop computers feature multiple processors and this trend
is projected to accelerate
Databases are growing increasingly large
large volumes of transaction data are collected and stored for later
analysis.
multimedia objects like images are increasingly stored in databases
Large-scale parallel database systems increasingly used for:
storing large volumes of data
processing time-consuming decision-support queries
providing high throughput for transaction processing
200
Parallelism in Databases
Data can be partitioned across multiple disks for parallel
I/O.
Individual relational operations (e.g., sort, join,
aggregation) can be executed in parallel
data can be partitioned and each processor can work
independently on its own partition.
Queries are expressed in high level language (SQL,
translated to relational algebra)
makes parallelization easier.
Different queries can be run in parallel with each other.
Concurrency control takes care of conflicts.
Thus, databases naturally lend themselves to parallelism.
201
Modes of Parallelism
At the heart of all parallel machines is a collection of
processors.
Each processor has its own local cache
Classify parallel architectures into three broad groups
The most tightly coupled architectures shared memory
A less tightly coupled architecture shares disk but not memory.
Shared nothing
202
Shared-Memory
203
Shared-Disk
all processors have their own memory and their own disk or disks
the shared-nothing architecture is the most commonlyused architecture for database systems
Used by Teradata, IBM, Sybase, Microsoft for OLAP
Prototypes: Gamma, Bubba, Grace, Prisma, EDS
+ Extensibility, availability
- Complexity, difficult load balancing
205
I/O Parallelism
Reduce the time required to retrieve relations from disk by
partitioning the relations on multiple disks.
Horizontal partitioning – tuples of a relation are divided
among many disks such that each tuple resides on one
disk.
Partitioning techniques (number of disks = n):
Round-robin: Send the ith tuple inserted in the relation to disk i
mod n.
Hash partitioning: send tuple n to disk f(n) where f is a
uniformly distributed random function
207
I/O Parallelism (Cont.)
Range partitioning: break tuples up into contiguous
ranges of keys, requires a key that can be ordered linearly
Choose an attribute as the partitioning attribute.
A partitioning vector [vo, v1, ..., vn-1] is chosen.
Let v be the partitioning attribute value of a tuple. Tuples such
that vi vi+1 go to disk I + 1. Tuples with v < v0 go to disk 0 and
tuples with v vn-2 go to disk n-1.
E.g., with a partitioning vector [5,11], a tuple with partitioning
attribute value of 2 will go to disk 0, a tuple with value 8 will
go to disk 1, while a tuple with value 20 will go to disk2.
208
Comparison of Partitioning Techniques
Evaluate how well partitioning techniques support the
following types of data access:
1. Scanning the entire relation.
2. Locating a tuple (identify query) associatively – point
queries.
Example: r.A = 25.
3. Locating a set of tuples based on the value of a given
attribute lies within a specified range – range queries.
Example: 10 r.A < 25.
209
Comparison of Partitioning Techniques(Cont.)
Round robin:
Advantages
Best suited for sequential scan of entire relation on each query.
All disks have almost an equal number of tuples; retrieval work
is thus well balanced between disks.
Range queries are difficult to process
No clustering - tuples are scattered across all disks
210
Comparison of Partitioning Techniques(Cont.)
Hash partitioning:
Good for sequential access
Assuming hash function is good, and partitioning attributes
form a key, tuples will be equally distributed between disks
Retrieval work is then well balanced between disks.
Good for point queries on partitioning attribute
Can lookup single disk, leaving others available for answering
other queries.
Index on partitioning attribute can be local to disk, making
lookup and update more efficient
No clustering, so difficult to answer range queries
211
Range partitioning
Partition requires a partitioning attribute A usually the
primary key
A vector of dimension n partitions A
Vector {v0,v1,…,vn-1}
Each tuple t goes into:
Partition 0 if t[A] < v0
Partition n-1 if t[A] > vn-2
Partition k if t[A] > vk-1 and t[A] < vk, k >=1
Simple range partitioning #disks = #partitions
212
Comparison of Partitioning Techniques (Cont.)
Range partitioning:
Provides data clustering by partitioning attribute value.
Good for sequential access
Good for point queries on partitioning attribute: only one
disk needs to be accessed.
For range queries on partitioning attribute, one to a few disks
may need to be accessed
Remaining disks are available for other queries.
Good if result tuples are from one to a few blocks.
If many blocks are to be fetched, they are still fetched from one to
a few disks, and potential parallelism in disk access is wasted
Example of execution skew.
213
Partitioning a Relation across Disks
If a relation contains only a few tuples which will fit into a
single disk block, then assign the relation to a single disk.
Large relations are preferably partitioned across all the
available disks.
If a relation consists of m disk blocks and there are n disks
available in the system, then the relation should be
allocated min(m,n) disks.
214
Handling of Skew
The distribution of tuples to disks may be skewed — that is,
some disks have many tuples, while others may have fewer
tuples.
Types of skew:
Attribute-value skew.
when lots of tuples are clustered around the same (or nearly same value)
i.e. some values appear in the partitioning attributes of many tuples; all
the tuples with the same value for the partitioning attribute end up in the
same partition.
Can occur with range-partitioning and hash-partitioning.
Partition skew.
With range-partitioning, badly chosen partition vector may assign too
many tuples to some partitions and too few to others.
Less likely with hash-partitioning if a good hash-function is chosen.
215
Handling Skew in Range-Partitioning
To create a balanced partitioning vector (assuming
partitioning attribute forms a key of the relation):
Sort the relation on the partitioning attribute.
Construct the partition vector by scanning the relation in sorted
order as follows.
After every 1/nth of the relation has been read, the value of the
partitioning attribute of the next tuple is added to the partition vector.
n denotes the number of partitions to be constructed.
Duplicate entries or imbalances can result if duplicates are
present in partitioning attributes.
Alternative technique based on histograms used in
practice
216
Handling Skew using Histograms
Balanced partitioning vector can be constructed from
histogram in a relatively straightforward fashion
Assume uniform distribution within each range of the
histogram
Histogram can be constructed by scanning relation, or
sampling (blocks containing) tuples of the relation.
217
Handling Skew Using Virtual Processor
Partitioning
Skew in range partitioning can be handled elegantly using
virtual processor partitioning:
create a large number of partitions (say 10 to 20 times the
number of processors)
Assign virtual processors to partitions either in round-robin
fashion or based on estimated cost of processing each virtual
partition
Basic idea:
If any normal partition would have been skewed, it is very
likely the skew is spread over a number of virtual partitions
Skewed virtual partitions get spread across a number of
processors, so work gets distributed evenly!
218
Interquery Parallelism
It is a form of parallelism where many different Queries or Transactions
are executed in parallel with one another on many processors
Increases transaction throughput; used primarily to scale up a
transaction processing system to support a larger number of
transactions per second.
Easiest form of parallelism to support, particularly in a shared-memory
parallel database, because even sequential database systems support
concurrent processing.
More complicated to implement on shared-disk or shared-nothing
architectures
Locking and logging must be coordinated by passing messages between
processors.
Data in a local buffer may have been updated at another processor.
Cache-coherency has to be maintained - reads and writes of data in buffer
must find latest version of data.
219
Cache Coherency Protocol
Example of a cache coherency protocol for shared disk
systems:
Before reading/writing to a page, the page must be locked in
shared/exclusive mode.
On locking a page, the page must be read from disk
Before unlocking a page, the page must be written to disk if it
was modified.
More complex protocols with fewer disk reads/writes exist.
Cache coherency protocols for shared-nothing systems are
similar. Each database page is assigned a home processor.
Requests to fetch the page or write it to disk are sent to the
home processor.
220
Intraquery Parallelism
Execution of a single query in parallel on multiple
processors/disks; important for speeding up long-running
queries.
SELECT * FROM Email ORDER BY Start_Date;
Two complementary forms of intraquery parallelism :
Intraoperation Parallelism – parallelize the execution of each
individual operation in the query.
SELECT * FROM Email ORDER BY Start_Date; //(Sort
Operation)
SELECT * FROM Student, CourseRegd WHERE Student.Regno
= CourseRegd.Regno; //(Join)
221
Intraquery Parallelism
Interoperation Parallelism – execute the different operations
in a query expression in parallel.
A single query may involve multiple operations at once.
SELECT AVG(Salary) , dept_id FROM Employee GROUP BY
Dept_Id;
225
Sort 2: Sort each temporary table in ascending order and later merge
226
Parallel Sort (Cont.)
Parallel External Sort-Merge
Assume the relation has already been partitioned among disks
D0, ..., Dn-1.
Each processor Pi locally sorts the data on disk Di.
The sorted runs on each processor are then merged to get the final
sorted output.
Parallelize the merging of sorted runs as follows:
The sorted partitions at each processor Pi are range-partitioned across the
processors P0, ..., Pm-1.
Each processor Pi performs a merge on the streams as they are received,
to get a single sorted run.
The sorted runs on processors P0,..., Pm-1 are concatenated to get the final
result.
227
SELECT * FROM Employee ORDER BY Salary;
v[14000, 24000]
228
Parallel Join
The join operation requires pairs of tuples to be tested to
see if they satisfy the join condition, and if they do, the
pair is added to the join output.
Parallel join algorithms attempt to split the pairs to be
tested over several processors. Each processor then
computes part of the join locally.
In a final step, the results from each processor can be
collected together to produce the final result.
229
Partitioned Join
For equi-joins and natural joins, it is possible to partition the two
input relations across the processors, and compute the join locally
at each processor.
Let r and s be the input relations, and we want to compute r r.A=s.B s.
r and s each are partitioned into n partitions, denoted r0, r1, ..., rn-1
and s0, s1, ..., sn-1.
Can use either range partitioning or hash partitioning.
r and s must be partitioned on their join attributes r.A and s.B),
using the same range-partitioning vector or hash function.
Partitions ri and si are sent to processor Pi,
Each processor Pi locally computes ri ri.A=si.B si. Any of the
standard join methods can be used.
230
Partitioned Join (Cont.)
231
232
Partitioned Parallel Hash-Join
Parallelizing partitioned hash join:
Assume s is smaller than r and therefore s is chosen as the build
relation.
A hash function h takes the join attribute value of each tuple in
1
s and maps this tuple to one of the n processors.
Each processor P reads the tuples of s that are on its disk D ,
i i
and sends each tuple to the appropriate processor based on hash
function h1. Let si denote the tuples of relation s that are sent to
processor Pi.
As tuples of relation s are received at the destination processors,
they are partitioned further using another hash function, h2,
which is used to compute the hash-join locally. (Cont.)
237
Partitioned Parallel Hash-Join (Cont.)
Once the tuples of s have been distributed, the larger relation r is
redistributed across the m processors using the hash function h1
Let ri denote the tuples of relation r that are sent to processor Pi.
As the r tuples are received at the destination processors, they are
repartitioned using the function h2
(just as the probe relation is partitioned in the sequential hash-join
algorithm).
Each processor Pi executes the build and probe phases of the hash-
join algorithm on the local partitions ri and s of r and s to produce a
partition of the final result of the hash-join.
Note: Hash-join optimizations can be applied to the parallel case
e.g., the hybrid hash-join algorithm can be used to cache some of the
incoming tuples in memory and avoid the cost of writing them and reading
them back in.
238
Parallel Nested-Loop Join
Assume that
relation s is much smaller than relation r and that r is stored by partitioning.
there is an index on a join attribute of relation r at each of the partitions of
relation r.
Use asymmetric fragment-and-replicate, with relation s being
replicated, and using the existing partitioning of relation r.
Each processor Pj where a partition of relation s is stored reads the
tuples of relation s stored in Dj, and replicates the tuples to every
other processor Pi.
At the end of this phase, relation s is replicated at all sites that store tuples
of relation r.
Each processor Pi performs an indexed nested-loop join of relation s
with the ith partition of relation r.
239
Other Relational Operations
Selection (r)
If is of the form ai = v, where ai is an attribute and v a
value.
If r is partitioned on ai the selection is performed at a single
processor.
If is of the form l <= ai <= u (i.e., is a range selection)
and the relation has been range-partitioned on a i
Selection is performed at each processor whose partition overlaps
with the specified range of values.
In all other cases: the selection is performed in parallel at
all the processors.
240
Other Relational Operations (Cont.)
Duplicate elimination
Perform by using either of the parallel sort techniques
eliminate duplicates as soon as they are found during sorting.
Can also partition the tuples (using either range- or hash-
partitioning) and perform duplicate elimination locally at each
processor.
Projection
Projection without duplicate elimination can be performed as
tuples are read in from disk in parallel.
If duplicate elimination is required, any of the above duplicate
elimination techniques can be used.
241
Grouping/Aggregation
Partition the relation on the grouping attributes and then
compute the aggregate values locally at each processor.
Can reduce cost of transferring tuples during partitioning by
partly computing aggregate values before partitioning.
Consider the sum aggregation operation:
Perform aggregation operation at each processor P i on those tuples
stored on disk Di
results in tuples with partial sums at each processor.
Result of the local aggregation is partitioned on the grouping
attributes, and the aggregation performed again at each processor P i
to get the final result.
Fewer tuples need to be sent to other processors during
partitioning.
242
Cost of Parallel Evaluation of Operations
If there is no skew in the partitioning, and there is no
overhead due to the parallel evaluation, expected speed-up
will be 1/n
If skew and overheads are also to be taken into account,
the time taken by a parallel operation can be estimated as
Tpart + Tasm + max (T0, T1, …, Tn-1)
Tpart is the time for partitioning the relations
Tasm is the time for assembling the results
Ti is the time taken for the operation at processor P i
this needs to be estimated taking into account the skew, and the time
wasted in contentions.
243
Interoperator Parallelism
Pipelined parallelism
Consider a join of four relations
r1 r2 r3 r4
Set up a pipeline that computes the three joins in parallel
Let P1 be assigned the computation of
temp1 = r1 r2
And P2 be assigned the computation of temp2 = temp1 r3
And P3 be assigned the computation of temp2 r4
Each of these operations can execute in parallel, sending result
tuples it computes to the next operation even as it is computing
further results
Provided a pipelineable join evaluation algorithm (e.g. indexed nested
loops join) is used
244
Factors Limiting Utility of Pipeline
Parallelism
Pipeline parallelism is useful since it avoids writing
intermediate results to disk
Useful with small number of processors, but does not
scale up well with more processors. One reason is that
pipeline chains do not attain sufficient length.
Cannot pipeline operators which do not produce output
until all inputs have been accessed (e.g. aggregate and
sort)
Little speedup is obtained for the frequent cases of skew
in which one operator's execution cost is much higher than
the others.
245
Independent Parallelism
Independent parallelism
Consider a join of four relations
r1 r2 r3 r4
Let P1 be assigned the computation of temp1 = r1 r2
And P2 be assigned the computation of temp2 = r3 r4
And P3 be assigned the computation of temp1 temp2
P1 and P2 can work independently in parallel
P3 has to wait for input from P1 and P2
Can pipeline output of P1 and P2 to P3, combining independent parallelism
and pipelined parallelism
Does not provide a high degree of parallelism
useful with a lower degree of parallelism.
less useful in a highly parallel system,
246
Design of Parallel Systems
Some issues in the design of parallel systems:
Parallel loading of data from external sources is needed in
order to handle large volumes of incoming data.
Resilience to failure of some processors or disks.
Probability of some disk or processor failing is higher in a
parallel system.
Operation (perhaps with degraded performance) should be
possible in spite of failure.
Redundancy achieved by storing extra copy of every data item
at another processor.
247
Design of Parallel Systems (Cont.)
Online reorganization of data and schema changes must
be supported.
For example, index construction on terabyte databases can take
hours or days even on a parallel system.
Need to allow other processing (insertions/deletions/updates) to be
performed on relation even as index is being constructed.
Basic idea: index construction tracks changes and “catches up”
on changes at the end.
Also need support for online repartitioning and schema
changes (executed concurrently with other processing).
248