Para Distr Query Processing Slides

Parallel implementation of the relational algebra Distributed Query Processing
Advances in Data Management:

Parallel and Distributed Query Processing
Advances in Data Management: Parallel and Distributed Query Processing

The basic foundations of parallel query evaluation are the splitting

and merging of parallel streams of data:
Data streams arising from different disks or processors provide the
input for each operator in the query.
The results being produced as an operator is evaluated in parallel
over a set of nodes need to be merged at one node in order to
produce the overall result of the operator.
This result set may then be split again in order to parallelise
subsequent processing in the query.

Sort
Given the frequency with which sorting is required in query
processing, parallel sorting algorithms have been much studied.
A commonly used parallel sorting algorithm is the parallel merge
sort:
This first sorts each fragment of the relation individually on its
local disk, e.g. using the external merge sort algorithm we have
already looked at.
Groups of fragments are then shipped to one node per group,
which merges the group of fragments into a larger sorted fragment.
This process repeats until a single sorted fragment is produced at
one of the nodes.

Projection and selection

Selection and projection operators on a relation are usually
performed together in one scan. This can be considered as a
reduction operator.
Reducing a fragment at a single node can be done as in a
centralised DBMS, i.e. by a sequential scan or by utilising
appropriate indexes if available.
If duplicates are permitted in the overall result, then the
fragments can just be shipped to one node to be merged.
If duplicates need to be eliminated, then a variant of the parallel
merge sort algorithm described in the previous slide can be used
to sort the result while at the same time eliminating duplicates.

Join
Ro
n S using Parallel Nested Loops or Index Nested Loops
First the ‘outer’ relation has to be chosen. In particular, if S has
an appropriate index on the join attribute(s) then R should be the
outer relation.
All the fragments of R are then shipped to all nodes. So each node
i now has a whole copy of R as well as its own fragment of S, Si .
The local joins R o
n Si are performed in parallel on all the nodes,
and the results are finally shipped to a chosen node for merging.

Ro
n S using Parallel Sort-Merge Join (for natural/equijoins)
The first phase of this involves sorting R and S on the join
attribute(s). These sorts can be performed using the parallel merge
sort operation
The sorted relations are then partitioned across the nodes using
range partitioning with the same subranges on the join attribute(s)
for both relations.
The local joins of each pair of sorted fragments Ri o n Si are
performed in parallel, and the results are finally shipped to a
chosen node for merging.

Ro
n S using Parallel Hash Join (for natural/equijoins only)
Each bucket of R and S is logically assigned to one node.
The first hashing phase, using the first hash function h1 , is
undertaken in parallel on the all nodes. Each tuple t from R or S
is shipped to node i if the bucket assigned to it by h1 is the i th
bucket.
The next phase is also undertaken in parallel on all nodes. On each
node i, a hash table is created from the local fragment of R, Ri ,
using another hash function h2 . The local fragment of S, Si , is
then scanned and h2 is used to probe the hash table for matching
records of Ri for each record of Si .
The results produced at each node are shipped to a chosen node
for merging.

Parallel Query Optimisation
Costs of parallel versions of operators differ:

The costs of partitioning and merging the data now need to
be taken into account.
If the data distribution is skewed, this will have an impact on
the overall time taken to complete the evaluation

The results being produced by one operator can be pipelined

into the evaluation of another operator that is executing at
the same time on a different node.
For example, consider this left-deep join tree where all the
joins are nested loops joins:
((R1 o
n R2 ) o
n R3 ) o
n R4
The tuples being produced by R1 o n R2 can be used to ‘probe’

R3 and the resulting tuples can be used to probe R4 ,

There is now the possibility of executing different operators of

the query concurrently on different nodes.
For example, with this bushy join tree:
((R1 o
n R2 ) o
n (R3 o
n R4 ))
the join of R1 with R2 and the join of R3 with R4 can be

executed concurrently. The partial results of these joins can
also be pipelined into their parent join operator.

In uniprocessor systems only left-deep join orders are usually

considered (and this still gives n! possible join orders for a join of n
relations).
In multi-processor systems, other join orders can result in more
parallelisation, e.g. bushy trees, as illustrated above. Even if such a
plan is more costly in terms of the number of I/O operations
performed, it may execute more quickly than a left-deep plan due
to the increased parallelism it affords.
So, in general, the number of candidate query plans is much
greater in parallel database systems and more heuristics need to be
employed to limit the search space of query plans.

The purpose of distributed query processing is to process global

queries, i.e. queries expressed with respect to the global or external
schemas of a DDB system.
The local query processor at each site is responsible for
processing sub-queries of global queries that can be evaluated at
that site.
A global query processor is needed at sites of the DDB system to
which global queries can be submitted. This will optimise each
global query, distribute sub-queries of the query to the appropriate
local query processors, and collect the results of these sub-queries
to return to the user.

In more detail, processing global queries consists of

1 translating the query into a query tree;
2 replacing fragmented relations in this tree by their definition
as unions/joins of their horizontal/vertical fragments;
3 simplifying the resulting tree using several heuristics (see
below);
4 global query optimisation, resulting in the selection of a query
plan; this consists of sub-queries each of which will be
executed at one local site; the query plan is annotated with
the data transmission that will occur between sites;
5 local processing of the local sub-queries; this may include
further optimisation of the local sub-queries, based on local
information about access paths and database statistics.

In Step 3, the simplifications that can be carried out in the case of

horizontal partitioning include the following:
eliminating fragments from the argument to a selection
operation if they cannot contribute any tuples to the result of
that selection;
distributing join operations over unions of fragments, and
eliminating joins that can yield no tuples;

For example, suppose a table Employee(empID, site, salary, . . . ) is

horizontally fragmented into four fragments:
E1 = σsite=0 A0 AND salary <30000 Employee
E2 = σsite=0 A0 AND salary >=30000 Employee
E3 = σsite=0 B 0 AND salary <30000 Employee
E4 = σsite=0 B 0 AND salary >=30000 Employee
then the query σsalary <25000 Employee is replaced in Step 2 by
σsalary <25000 (E1 ∪ E2 ∪ E3 ∪ E4 )
which simplifies to
σsalary <25000 (E1 ∪ E3 )

For example, suppose a table WorksIn(empID, site, project, . . . ) is

horizontally fragmented into two fragments:
W1 = σsite=0 A0 WorksIn
W2 = σsite=0 B 0 WorksIn
then the query Employee o
n WorksIn is replaced in Step 2 by:
(E1 ∪ E2 ∪ E3 ∪ E4 ) o
n (W1 ∪ W2 )
distributing the join over the unions of fragments gives:
n W1 ) ∪ (E2 o
(E1 o n W1 ) ∪ (E3 o
n W1 ) ∪ (E4 o
n W1 ) ∪
n W2 ) ∪ (E2 o
(E1 o n W2 ) ∪ (E3 o
n W2 ) ∪ (E4 o
n W2 )
and this simplifies to:
n W1 ) ∪ (E2 o
(E1 o n W1 ) ∪ (E3 o
n W2 ) ∪ (E4 o
n W2 )

One simplification that can be carried out in Step 3 in the case of

vertical partitioning is that fragments in the argument of a
projection operation which have no non-key attributes in common
with the projection attributes can be eliminated.
For example, if a table Projects(projNum, budget, location,
projName) is vertically partitioned into two fragments:
P1 = πprojNum,budget,location Projects
P2 = πprojNum,projName Projects
then the query πprojNum,location Projects is replaced in Step 2 by:
πprojNum,location (P1 o
n P2 )
which simplifies to:
πprojNum,location P1
assuming that each ProjNum value in P1 also appears in P2

Step 4 (Query Optimisation) consists of generating a set of

alternative query plans, estimating the cost of each plan, and
selecting the cheapest plan.
It is carried out in much the same way as for centralised query
optimisation, but now communication costs must also be taken
into account when estimating the overall cost of a query plan.
Also, the replication of relations or fragments of relations is now a
factor — there may be a choice of which replica to use within a
given query plan, with different costs being associated with using
different replicas.
Another difference from centralised query optimisation is that there
may be a speed-up in query execution times due to the parallel
processing of parts of the query at different sites.

Distributed Processing of Joins
Given the potential size of the results of join operations, the

efficient processing of joins is a significant aspect of global query
processing in distributed databases and a number of distributed
join algorithms have been developed. These include:
the full-join method, and
the semi-join method.

Full-join method
The simplest method for computing R o n S at the site of S
consists of shipping R to the site of S and doing the join there.
This has a cost of
cost of reading R + c×pages(R) + cost of computing R o

n S at site(S)
where c is the cost of transmitting one page of data from the site
of R to the site of S, and pages(R) is the number of pages that R
consists of.
If the result of this join were needed at a different site, then there
would also be the additional cost of sending the result of the join
from site(S) to where it is needed.

Semi-join method
This is an alternative method for computing R o
n S at the site of S
and consists of the following steps:
(i) Compute πR∩S (S) at the site of S, where πR∩S denotes
projection on the common attributes of R and S.
(ii) Ship πR∩S (S) to the site of R.
(iii) Compute R n S at the site of R, where n is the semi-join
operator, defined as follows:
R nS =R o
n πR∩S (S)
(iv) Ship R n S to the site of S.

(v) Compute R on S at the site of S, using the fact that
Ro
n S = (R n S) o
nS

Example 1. Consider the following relations, stored at different

sites:
R = accounts(accno, cname, balance)
S = customer (cname, address, city , telno, creditRating ).
Suppose we need to compute R o
n S at the site of S.
Suppose also that
accounts contains 100,000 tuples on 1,000 pages
customer contains 50,000 tuples on 500 pages
the cname field of S consumes 0.2 of each record of S

With the full join method we have a cost of

n S at site(S)
which is 1000 I/Os to read R, plus (c × 1000) to transmit R to the

site of S, plus 1000 I/Os to save it there, plus (3 × (1000 + 500))
I/Os (assuming a hash join) to perform the join. This gives a total
cost of:
(c × 1000) + 6500 I/Os

With the semi-join method we have the cost of:

(i) Computing πR∩S (S) at site(S), i.e. 500 I/Os to scan S,
generating 100 pages of just the cname values
(ii) Shipping πR∩S (S) to site(R), i.e. c × 100 and saving it there,
i.e. 100 I/Os.
(iii) Computing R n S at site(R), i.e. 3 × (100 + 1000) I/Os,
assuming a hash join
(iv) Shipping the result of R n S to the site of S, i.e. c × 1000
and saving it there, i.e. 1000 I/Os (assuming cname in R is a
foreign key).
n S at site(S), i.e. 3 × (1000 + 500))
(v) Computing R o

This gives a total cost of
(c × 1100) + 9400 I/Os
So in this case the full join method ((c × 1000) + 6500 I/Os) is
cheaper: we have gained nothing by using the semi-join method
since all the tuples of R join with tuples of S.

Example 2. Let R be as above (i.e., accounts) and let

S = σcity =0 London0 (customer )
Suppose again that we need to compute R o
n S at the site of S.
Suppose also that there are 100 different cities in customer , that
there is a uniform distribution of customers across cities, and a
uniform distribution of accounts over customers. So S contains
500 tuples on 5 pages.
With the full join method we have a cost of

n S at site(S)
which is 1000 + (c × 1000) + 1000 + (3 × (1000 + 5)) I/Os
= (c × 1000) + 5015 I/Os.

With the semi-join method we have the cost of:

(i) Computing πR∩S (S) at site(S), i.e. 5 I/Os to scan S,
generating 1 page of cname values
(ii) Shipping πR∩S (S) to site(R), i.e. c × 1 plus 1 I/O to save it
there.
(iii) Computing R n S at site(R), i.e. 3 × (1 + 1000) assuming a
hash join
(iv) Shipping R n S to the site of S, i.e. c × 10 since, due to a
uniform distribution of accounts over customers, 1/100th of R
will match the cname values sent to it from S.
Plus the cost of saving the result of R n S at the site of S, 10
I/Os.
n S at site(S), i.e. 3 × (10 + 5))
(v) Computing R o

The overall cost is thus (c × 11) + 3064 I/Os versus

(c × 1000) + 5015 I/Os for the full-join method.
So in this case the semi-join method is cheaper. This is because a
significant number of tuples of R do not join with S and so are not
sent to the site of S.

Bloom join method

In general, a Bloom filter is an efficient memory structure (usually
a bit vector) used to approximate the contents of a set S in the
following sense:
given an item i,
if the filter returns false or 0 for i, then i is definitely not in S;
if it returns true or 1, then i may (or may not) be in S.
Apart from their use in distributed join processing, Bloom filters

are also used in NoSQL systems such as BigTable to avoid having
to read an SSTable in full to find out that the key being searched
for does not appear in it.

The Bloom join method is similar to the semi-join method, as it

too aims to reduce the amount of data being sent from the site of
R to the site of S.
However, rather than doing a projection on S and sending the
resulting data to the site of R, a bit-vector of a fixed size k is
computed by hashing each tuple of S to the range [0..k − 1] (using
the join attribute values). The i th bit of the vector is set to 1 if
some tuple of S hashes to i and is set to 0 otherwise.
Then, at the site of R, the tuples of R are also hashed to [0..k − 1]
(using the same hash function and the join attribute values), and
only those tuples of R whose hash value corresponds to a 1 in the
bit-vector sent from S are retained for shipping to the site of S.

The cost of shipping the bit-vector from the site of S to the site of
R is less than the cost of shipping the projection of S in the
semi-join method.
However, the size of the subset of R that is sent back to the site of
S is likely to be larger (since only approximate matching of tuples
is taking place now), and so the shipping costs and join costs are
likely to be higher than with the semi-join method.

Para Distr Query Processing Slides

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Para Distr Query Processing Slides

Uploaded by

Copyright:

Available Formats

Parallel implementation of the relational algebra Distributed Query Processing

Advances in Data Management:

Advances in Data Management: Parallel and Distributed Query Processing

The basic foundations of parallel query evaluation are the splitting

Advances in Data Management: Parallel and Distributed Query Processing

Advances in Data Management: Parallel and Distributed Query Processing

Projection and selection

Advances in Data Management: Parallel and Distributed Query Processing

Advances in Data Management: Parallel and Distributed Query Processing

Advances in Data Management: Parallel and Distributed Query Processing

Advances in Data Management: Parallel and Distributed Query Processing

Parallel Query Optimisation

Costs of parallel versions of operators differ:

Advances in Data Management: Parallel and Distributed Query Processing

Parallel Query Optimisation

The results being produced by one operator can be pipelined

The tuples being produced by R1 o n R2 can be used to ‘probe’

Advances in Data Management: Parallel and Distributed Query Processing

Parallel Query Optimisation

There is now the possibility of executing different operators of

the join of R1 with R2 and the join of R3 with R4 can be

Advances in Data Management: Parallel and Distributed Query Processing

Parallel Query Optimisation

In uniprocessor systems only left-deep join orders are usually

Advances in Data Management: Parallel and Distributed Query Processing

The purpose of distributed query processing is to process global

Advances in Data Management: Parallel and Distributed Query Processing

In more detail, processing global queries consists of

Advances in Data Management: Parallel and Distributed Query Processing

In Step 3, the simplifications that can be carried out in the case of

Advances in Data Management: Parallel and Distributed Query Processing

For example, suppose a table Employee(empID, site, salary, . . . ) is

σsalary <25000 (E1 ∪ E2 ∪ E3 ∪ E4 )

σsalary <25000 (E1 ∪ E3 )

Advances in Data Management: Parallel and Distributed Query Processing

For example, suppose a table WorksIn(empID, site, project, . . . ) is

Advances in Data Management: Parallel and Distributed Query Processing

One simplification that can be carried out in Step 3 in the case of

Advances in Data Management: Parallel and Distributed Query Processing

Step 4 (Query Optimisation) consists of generating a set of

Advances in Data Management: Parallel and Distributed Query Processing

Distributed Processing of Joins

Given the potential size of the results of join operations, the

Advances in Data Management: Parallel and Distributed Query Processing

Distributed Processing of Joins

cost of reading R + c×pages(R) + cost of computing R o

Advances in Data Management: Parallel and Distributed Query Processing

Distributed Processing of Joins

(iv) Ship R n S to the site of S.

Advances in Data Management: Parallel and Distributed Query Processing

Distributed Processing of Joins

Example 1. Consider the following relations, stored at different

Advances in Data Management: Parallel and Distributed Query Processing

Distributed Processing of Joins

With the full join method we have a cost of

cost of reading R + c×pages(R) + cost of computing R o

which is 1000 I/Os to read R, plus (c × 1000) to transmit R to the

Advances in Data Management: Parallel and Distributed Query Processing

Distributed Processing of Joins

With the semi-join method we have the cost of:

Advances in Data Management: Parallel and Distributed Query Processing

Distributed Processing of Joins

This gives a total cost of