Query Optimization in Distributed Database Systems

Query optimization in
distributed database systems

Framework for query optimization
• The selection of a query processing strategy
involves:
– determining the physical copies of the fragments upon
which to execute the query
– selecting the order of the execution of operations,
particularly, this involves the determination of a „good”
sequence of joins
– selecting the method for executing each operation
2
Transmission cost
• Transmission requirements are neutral with respect to
systems; they are typically a function of the amount of data
transmitted among sites
• The optimization of a distributed query can be partitioned
into two independent problems: the distribution of the
access strategy among sites, which is done considering
transmission only, and the determination of local access
strategies at each site, which use traditional methods of
centralized databases
• Transmission cost:
TC(X) = C0 + C1 * x
3
Database Profile
Database profile:
• The number of tuples in each relation Ri (card(Ri))
• The size of each attribute A (size(A) )
• The size of Ri (size(Ri)) is sum of the sizes of its attributes
• For each attribute A in each relation Ri: the number of
distinct values appearing in Ri (val(A[Ri])), max and min
LDBS1 LDBS2
Supply1 Supply2
Dept1 Dept2
4
Database Profile
Supply card(Supply)=50 000

SNUM PNUM DEPTNUM QUAN
size 6 7 2 10
val 3000 1000 30 500
Dept card(dept)= 30
DEPTNUM NAME AREA MGRNUM
size 2 15 1 7
val 30 30 6 30
5
Database Profile
Supply1 card(Supply1)=30 000 site(Supply1) = 1

SNUM PNUM DEPTNUM QUAN
size 6 7 2 10
val 1800 1000 20 500
Dept1 card(dept)= 10 site(Dept1) = 2

DEPTNUM NAME AREA MGRNUM
size 2 15 1 7
val 10 10 2 10
6
Profile of partial results of algebraic
operations - SELECTION
Let S denote the result of performing a unary relation over
a relation R
• Cardinality - to each selection we associate a selectivity
factor  which is the fraction of tuples satisfying it
In simple selection attribute = value (A=v),  can be
defined as follows:
 = 1/val(A[Ri])
under the assumptions that values are homogeneously
distributed. Thus
card(S) =  * card(R)
7
• Size: selection does not affect the size of relations
size(S) = size(R)
• Distinct values : depends on the selection criterion
Consider an attribute B which is not used in selection
formula. The determination of val(B[S]) may be as follows
Given n=card(R) - objects uniformly distributed over m =
val(B[R]) colors. How many different colors c= val(B[S])
are selected if we take just r objects?
8
• Yao approximation:
r, for r < m/2

c(n, m, r) = (r+m)/3 for m/2 < r < 2m
m, for r > 2m
9
operations - PROJECTION
Let S denote the result of performing a unary relation over
a relation R
• Cardinality – projection affects the cardinality of
operands since duplicates are eliminated from the result.
This effect is difficult to evaluate, the following three rules
can be applied
– If the projection involves a single attribute A, set
card(S) = val(A[R])
– If the product  AiAttr(S) val(Ai[R]) is less than card(R), where
Attr(S) are the attributes in the result of the projection, set
card(S) =  AiAttr(S) val(Ai[R])
10
operations - PROJECTION
– If the projection includes a key of R, set
card(S) = card(R)
• Note that if the system does not eliminate duplicates, the
cardinality of the result is the same as the cardinality of the
operand relation
• Size: the size of the result of a projection is reduced to the
sum of the sizes of attributes in its specification
• Distinct values : the distinct values of projected attributes
are the same as in the operand relation
11
operations – GROUP BY
Let G denote the attributes on which the grouping is
performed, AF indicates the aggregate functions to be
evaluated
• Cardinality – we give an upper bound on the cardinality
of S:
card(S) <  AiG val(Ai[R])
• Size: for all attributes A appearing in G
size(R.A) = size (S.A)
• Distinct values : for all attributes A appearing in G
val(A[S]) = val(A[R])
12
operations – UNION
• Cardinality – we have:
card(T) < card(R) + card(S)
Equality holds when duplicates are not eliminated
• Size: we have
size(T) = size(R) = size(S)
• Distinct values : an upper bound is
val(A[T]) < val(A[R]) + val(A[S])
13
operations – DIFFERENCE
max(0, card((R)-card(S)) < card(T) < card(R)
• Size: we have
size(T) = size(R) = size(S)
• Distinct values : an upper bound is
val(A[T]) < val(A[R])
14
operations – CARTESIAN PRODUCT
card(T) < card(R) x card(S)
• Size: we have
size(T) = size(R) + size(S)
• Distinct values : the distinct values of attributes are the
same as in the operand relation
15
operations – JOIN
• Cardinality – estimating precisely the cardinality of T is
very complex; we can give an upper bound to card(T)
because card(T) < card(R) x card(S), but this value is
usually much higher than the actual cardinality. Assuming
that all the values of A in R appear also as values of B in S
and vice versa and that the two attributes are both
uniformly distributed over tuples of R and S, we have
card(T) = (card(R) x card(S))/val(A[R])
if one of the two attributes, say A, is a key of R, then
card(T) = card(S)
16
operations – JOIN
• Size: we have
size(T) = size(R) + size(S)
In the case of natural join the size of the join attribute must
be subtracted from the size of the result
• Distinct values : if A is a join attribute, an upper bound is
val(A[T]) < min(val(A[R]), val(B[S]) )
if A is not a join attribute, an upper bound is
val(A[T]) < val(A[R]) + val(B[S])
17
operations – SEMIJOIN
Consider the semijoin T=R SJ A=B S
• Cardinality – the estimation of the cardinality of T is
similar to that of a selection operation; we denote with 
the selectivity of the semijoin operation, which measures
the fraction of the tuples of R which belong to the result.
The estimation is the following:
 = 1/val(A[S]) / val(dom[A])
Given 
card(T) =  * card(R)
18
operations – SEMIJOIN
• Size: The size of the result of a semijoin is the same as the
size of its first operand
size(T) = size(R)
• Distinct values : the number of distinct values of attributes
which do not belong to the semijoin specification can be
estimated using Yao’s formula with n= card(R),
m=val(A[R]), and r =card(T). If A is the only attribute
appearing in the semijoin specification, then
val(A[T]) =  * val(A[R])
19
Architecture of a Query Processing
Query result
Parser Catalog
Internal rep. plan query execution

plan
Query Query Plan Query
Rewrite Optimizer Refinement Execution
Engine
Internal rep.
Base data
20
• Parser: the query is parsed and translated into an internal
representation (flex and bison can be used for the
construction of SQL parser)
• Query Rewrite: query rewrite transforms a query in order
to carry out optimizations that are good regardless of the
physical state of the system (elimination of redundant
predicates, unnesting of subqueries, simplification of
expressions). Query rewrite is carried out by a rule engine
• Query Optimizer: this component carries out
optimizations that depend on the physical state of the
system. QO decides which index, which method, and in
which order to execute operations of a query.
21
• Query optimizer: in distributed system QO must decide at
which site each operation is to be executed. QO
enumerates alternative plans and chooses the best plan
using a cost estimation model
• Plan: specifies precisely how the query is to be executed.
The nodes are operators, and every operator carries out one
particular operation. The edges represent consumer-
producer relationships of operators.
• Plan Refinement: this component transforms the plan into
an executable plan. In DB2 this transformation involves
the generation of an assembler-like code to evaluate
expressions and predicates efficiently
22
Query evaluation plan
Site 0 PJ A1
NLJ A2=B2
scan
temp
receive receive
send send
PJ B3
PJ A3
SL C=cos
Inxscan(A) Scan(B)
23
Query evaluation plan
• Fragment reducers: a set of unary operations which apply
to the same fragment are collected into programs
• Binary operations: joins and unions
• Optimization graph: nodes represent reduced fragments,
and joins (unions) are represented by edges (hypernodes)
A2=B2
A B
24
Query Optimization (1)
• Plan enumeration with Dynamic Programming
Input: SPJ query q on relations R1, ..., Rn
Output: A query plan for q
1. for i=1 to n do {
2. optPlan({Ri}) = accessPlans(Ri)
3. prunePlans(optPlan({Ri}))
4. }
5. for i=2 to n do {
6. for all S  {R1, ..., Rn} such that |S| = i do {
7. optPlan(S) = 
25
8. for all O  S do {
9. optPlan(S) = optPlan(S) 
joinPlans(optPlan(O), optPlan(S-O))
10. prunePlans(optPlan(S))
11. }
12. }
13. }
14. return optPlan({R1, ..., Rn})
Problem: alternative plans cannot be immediately pruned
26
• Optimization criteria:
– Classic cost model (total time, total resource
consumption) – estimate the cost of every individual
operator of the plan and then sum up these costs – this
model is useful to estimate the overall throughput of a
system
– Mean response time model – estimate the lowest
response time of a query
27
Query Execution Techniques
• Row blocking – implementation of send and receive
operators is based on TCP/IP, UDP protocols;
idea: ship tuples in a blockwise fashion
• Optimization of Multicasts: send data sequentially
instead of sending data twice (NY  Berlin  Poznan)
• Joins with Horizontally Partitioned Data –
(A1  A2) JN B or (A1 JN B)  (A2 JN B)
If A and B are both partitioned than we have more plans
• Semijoin and Bloojoin programs
28
Semijoin Programs
• Semijoin between R and S over two attributes A and B is
defined as follows:
( R SJ A=B S) JN A=B S is equal R JN A=B S
1. Send PJ B (S) to site R at a cost

C0 + C1 * size(B) * val(B(S))
2. Compute semijoin on R at a null cost; Let R’= R SJ A=B S
3. Send R’ to site S at a cost
C0 + C1 * size(R) * card(R’)
4. Compute the join on site S at a null value
29
Reducers
• Semijoin programs can be regarded as reducers, i.e.
Operations that can be applied to reduce the cardinality of
their operands
• Let RED(Q, R) denote the set of reducer programs that can
be built for a given relation R in a given query Q
• There is one reducer program, element of RED(Q, R),
which reduces R more than all other programs – full
reducer
• The problem : find all full reducers for the relations of a
query (difficult task)
• Acyclic (tree queries) versus cyclic queries
30
Reducers
• Is it possible to give a limitation to the length of the full
reducer?
• Tree queries – YES
The limitation on the length of the full reducer amounts to
n-1, where n is the number of nodes of the tree
• Cyclic queries – NO
The limitation on the length of the ‘best’ reducer is linearly
bound by the number of tuples of some relations of the
query
• Best reducer does not mean full reducer
31
Example (1)
R S T
A B B C C A
1 a a x x 2
2 b b y y 3
3 c c z z 4
S
Cyclic query
B=B C=C
R T
A=A
The final result is empty relation; the length of the reducers

is 3*(m-1), where m is the number of tuples
32
Example (2)
R S T
A B B C C D
1 a a x x 10
2 b b y p 20
3 e c z q 30
S
Acyclic query
B=B C=C
R T
The final result - one tuple (a, x)
33
Testing the graph for cycles
• There are two cases in which cycles can be broken without
changing the meaning of the query
1. In the cycle (R.A=S.B), (S.B=T.C), (T.C=R.A), in which
R, S, T are relation names, and A, B, C are attributes, any
one of the edges can be dropped, as any edge can be
obtained from the remaining ones by transitivity.
2. In the cycle (R.A=S.B), (S.B=T.C), (T.C=R.D), we can
substitute (R.A=R.D) for (T.C=R.D) because, by
transitivity, T.C must equal R.A; the remaining graph
contains two edges (R.S) and (S.T) and is acyclic, because
an interrelation clause can be sabstituted by an intrarelation
clause
34

Query Optimization in Distributed Database Systems

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Query Optimization in Distributed Database Systems

Uploaded by

Copyright:

Available Formats

Query optimization in

distributed database systems

Supply card(Supply)=50 000

Supply1 card(Supply1)=30 000 site(Supply1) = 1

Dept1 card(dept)= 10 site(Dept1) = 2

r, for r < m/2

Internal rep. plan query execution

Problem: alternative plans cannot be immediately pruned

1. Send PJ B (S) to site R at a cost

The final result is empty relation; the length of the reducers

The final result - one tuple (a, x)

You might also like