Professional Documents
Culture Documents
1 Paralleldb13 130212222440 Phpapp01 PDF
1 Paralleldb13 130212222440 Phpapp01 PDF
10,000,000,000,000 bytes!
1 Terabyte 1 Terabyte
10 MB/s
Parallelism:
divide a big problem
1. Parallel DB /D.S.Jagli
into many smaller ones
5 2/12/2013
to be solved in parallel.
Parallel DB
Parallel database system seeks to improve performance through
parallelization of various operations such as loading data ,building
indexes, and evaluating queries by using multiple CPUs and Disks in
Parallel.
INTERQUERY PARALLELISM
It is possible to process a number of transactions in
parallel with each other.
Improves Throughput.
INTRAQUERY PARALLELISM
It is possible to process ‘sub-tasks’ of a transaction in
parallel with each other.
Speed-Up
– Adding more resources results in proportionally less running time for a
fixed amount of data.
10 seconds to scan a DB of 10,000 records using 1 CPU
1 second to scan a DB of 10,000 records using 10 CPUs
Scale-Up
If resources are increased in proportion to an increase in data/problem
size, the overall time should remain constant
Number of transactions/second
2000/Sec
1600/Sec
Sub-linear speed-up
1000/Sec
1. Parallel DB /D.S.Jagli
Number of CPUs 2/12/2013
13
1. Parallel DB /D.S.Jagli 13
PARALLEL DBMSs
SCALE-UP
Number of transactions/second
5 CPUs 10 CPUs
1 GB Database 2 GB Database
1. Parallel DB /D.S.Jagli
Number of CPUs, Database size 2/12/2013
14
1. Parallel DB /D.S.Jagli 14
PARALLEL QUERY EVALUATION
A relational query execution plan is graph/tree of
relational algebra operators (based on this operators can
execute in parallel)
2. Vertical Partitioning
2/12/2013
17 1. Parallel DB /D.S.Jagli
17
1.Range Partitioning
Tuples are sorted (conceptually), and n ranges are chosen for
the sort key values so that each range contains roughly the
same number of tuples;
tuples in range i are assigned to processor i.
Eg:
sailor _id 1-10 assigned to disk 1
sailor _id 10-20 assigned to disk 2
sailor _id 20-30 assigned to disk 3
range partitioning can lead to data skew; that is, partitions with widely
varying number of tuples across
Hash partitioning has the additional virtue that it keeps data evenly
distributed even if the data grows and shrinks over time.
If only a subset of the tuples (e.g., those that satisfy the selection
condition age = 20) is required, hash partitioning and range partitioning
are better than round-robin partitioning
A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z
Techniques
1. Bulk loading& scanning
2. Sorting
3. Joins
2. By using the same partitioning function for both A and B, we ensure that
the union of the k smaller joins computes the join of A and B.
Hash-Join
Sort-merge-join
The result of the join of A and B, the output of the join process may be
split into several data streams.
I. Introduction to DDBMS
II. Architecture of DDBs
III. Storing data in DDBs
IV. Distributed catalog management
V. Distributed query processing
VI. Transaction Processing
VII. Distributed concurrency control and recovery
29 1. Parallel DB /D.S.Jagli 2/12/2013
1. Parallel DB /D.S.Jagli 29
I.Introduction to DDBMS
Data in a distributed database system is stored across several sites.
Transparent implies that each user within the system may access all of
the data within all of the databases as if they were a single database
CLIENT CLIENT
CLIENT CLIENT
LAN
LAN
CLIENT CLIENT
CLIENT CLIENT
Delhi Mumbai
DBMS
LAN LAN
CLIENT CLIENT
CLIENT CLIENT
32 1. Parallel DB /D.S.Jagli
Hyderabad Pune2/12/2013
1. Parallel DB /D.S.Jagli 32
DISTRIBUTED DATABASE ARCHITECTURE
DBMS
DBMS
LAN
Delhi Mumbai
CLIENT
CLIENT CLIENT CLIENT CLIENT
DBMS
DBMS
LAN
1. Parallel DB /D.S.Jagli
Hyderabad
33 Pune2/12/2013
1. Parallel DB /D.S.Jagli 33
Distributed database
Communication Network- DBMS and Data at each node
I. Introduction to DDBMS
II. Architecture of DDBs
III. Storing data in DDBs
IV. Distributed catalog management
V. Distributed query processing
VI. Transaction Processing
VII. Distributed concurrency control and recovery
36 1. Parallel DB /D.S.Jagli 2/12/2013
1. Parallel DB /D.S.Jagli 36
2.DISTRIBUTED DBMS
ARCHITECTURES
1. Client-Server Systems:
2. Collaborating Server Systems
3. Middleware Systems
DUMB
SERVER #1
CLIENT
#1
D/BASE
CLIENT
#2
SERVER #2
D/BASE
CLIENT
#3
NOT TRANSPARENT!
41 1. Parallel DB /D.S.Jagli 2/12/2013
1. Parallel DB /D.S.Jagli 41
3.Middleware Systems:
The Middleware architecture is designed to allow a single
query to span multiple servers, without requiring all
database servers to be capable of managing such multisite
execution strategies.
I. Introduction to DDBMS
II. Architecture of DDBs
III. Storing data in DDBs
IV. Distributed catalog management
V. Distributed query processing
VI. Transaction Processing
VII. Distributed concurrency control and recovery
43 1. Parallel DB /D.S.Jagli 2/12/2013
1. Parallel DB /D.S.Jagli 43
3.Storing Data in DDBs
In a distributed DBMS, relations are stored across
several sites.
Accessing a relation that is stored at a remote site
includes message-passing costs.
A single relation may be partitioned or fragmented
across several sites.
I. Introduction to DDBMS
II. Architecture of DDBs
III. Storing data in DDBs
IV. Distributed Catalog Management
V. Distributed Query Processing
VI. Transaction Processing
VII. Distributed concurrency control and recovery
48 1. Parallel DB /D.S.Jagli 2/12/2013
1. Parallel DB /D.S.Jagli 48
4.Distributed Catalog Management
1. Naming Objects
• If a relation is fragmented and replicated, we must be able to
uniquely identify each replica of each fragment.
1. A local name field
2. A birth site field
2. Catalog Structure
A centralized system catalog can be used It is vulnerable to failure of
the site containing the catalog).
An alternative is to maintain a copy of a global system catalog.
compromises site autonomy,)
I. Introduction to DDBMS
II. Architecture of DDBs
III. Storing data in DDBs
IV. Distributed Catalog management
V. Distributed Query Processing
VI. Transaction Processing
VII. Distributed concurrency control and recovery
51 1. Parallel DB /D.S.Jagli 2/12/2013
1. Parallel DB /D.S.Jagli 51
5.Distributed Query Processing
Distributed query processing: Transform a high-level
query (of relational calculus/SQL) on a distributed database
(i.e., a set of global relations) into an equivalent and
efficient lower-level query (of relational algebra) on
relation fragments.
SELECT S.age
FROM Sailors S
WHERE S.rating > 3 AND S.rating < 7
Suppose that the Sailors relation is horizontally fragmented, with all
tuples having a rating less than 5 at Mumbai and all tuples having a
rating greater than 5 at Delhi.
The DBMS must answer this query by evaluating it at both sites and
taking the union of the answers.
=1000(2td+ts)+(result)td.
If individual sites are run under the control of different DBMSs, the
autonomy of each site must be respected while doing global query
planning.
Query site constructs global plan, with suggested local plans describing
processing at each site.
If a site can improve suggested local plan, free to do so.
Three solutions:
Centralized: send all local graphs to one site ;
Hierarchical: organize sites into a hierarchy and send local graphs to parent
in the hierarchy;
Timeout : abort Xact if it waits too long.
3. If coordinator gets all yes votes, force-writes a commit log record and
sends commit msg to all subs. Else, force-writes abort log rec, and
sends abort msg.
5. Coordinator writes end log rec after getting ack msg from all subs
78 1. Parallel DB /D.S.Jagli 2/12/2013
TWO-PHASE COMMIT (2PC) - commit
2. The basic idea is that when the coordinator sends out prepare
messages and receives yes votes from all subordinates.
Queries?
83 1. Parallel DB /D.S.Jagli 2/12/2013