You are on page 1of 34

Query optimization in

distributed database systems


Framework for query optimization
• The selection of a query processing strategy
involves:
– determining the physical copies of the fragments upon
which to execute the query
– selecting the order of the execution of operations,
particularly, this involves the determination of a „good”
sequence of joins
– selecting the method for executing each operation

2
Transmission cost
• Transmission requirements are neutral with respect to
systems; they are typically a function of the amount of data
transmitted among sites
• The optimization of a distributed query can be partitioned
into two independent problems: the distribution of the
access strategy among sites, which is done considering
transmission only, and the determination of local access
strategies at each site, which use traditional methods of
centralized databases
• Transmission cost:
TC(X) = C0 + C1 * x

3
Database Profile
Database profile:
• The number of tuples in each relation Ri (card(Ri))
• The size of each attribute A (size(A) )
• The size of Ri (size(Ri)) is sum of the sizes of its attributes
• For each attribute A in each relation Ri: the number of
distinct values appearing in Ri (val(A[Ri])), max and min

LDBS1 LDBS2

Supply1 Supply2
Dept1 Dept2

4
Database Profile

Supply card(Supply)=50 000


SNUM PNUM DEPTNUM QUAN
size 6 7 2 10
val 3000 1000 30 500

Dept card(dept)= 30
DEPTNUM NAME AREA MGRNUM
size 2 15 1 7
val 30 30 6 30

5
Database Profile

Supply1 card(Supply1)=30 000 site(Supply1) = 1


SNUM PNUM DEPTNUM QUAN
size 6 7 2 10
val 1800 1000 20 500

Dept1 card(dept)= 10 site(Dept1) = 2


DEPTNUM NAME AREA MGRNUM
size 2 15 1 7
val 10 10 2 10

6
Profile of partial results of algebraic
operations - SELECTION
Let S denote the result of performing a unary relation over
a relation R
• Cardinality - to each selection we associate a selectivity
factor  which is the fraction of tuples satisfying it
In simple selection attribute = value (A=v),  can be
defined as follows:
 = 1/val(A[Ri])
under the assumptions that values are homogeneously
distributed. Thus
card(S) =  * card(R)

7
Profile of partial results of algebraic
operations - SELECTION
• Size: selection does not affect the size of relations
size(S) = size(R)
• Distinct values : depends on the selection criterion
Consider an attribute B which is not used in selection
formula. The determination of val(B[S]) may be as follows
Given n=card(R) - objects uniformly distributed over m =
val(B[R]) colors. How many different colors c= val(B[S])
are selected if we take just r objects?

8
Profile of partial results of algebraic
operations - SELECTION
• Yao approximation:

r, for r < m/2


c(n, m, r) = (r+m)/3 for m/2 < r < 2m
m, for r > 2m

9
Profile of partial results of algebraic
operations - PROJECTION
Let S denote the result of performing a unary relation over
a relation R
• Cardinality – projection affects the cardinality of
operands since duplicates are eliminated from the result.
This effect is difficult to evaluate, the following three rules
can be applied
– If the projection involves a single attribute A, set
card(S) = val(A[R])
– If the product  AiAttr(S) val(Ai[R]) is less than card(R), where
Attr(S) are the attributes in the result of the projection, set
card(S) =  AiAttr(S) val(Ai[R])

10
Profile of partial results of algebraic
operations - PROJECTION
– If the projection includes a key of R, set
card(S) = card(R)
• Note that if the system does not eliminate duplicates, the
cardinality of the result is the same as the cardinality of the
operand relation
• Size: the size of the result of a projection is reduced to the
sum of the sizes of attributes in its specification
• Distinct values : the distinct values of projected attributes
are the same as in the operand relation

11
Profile of partial results of algebraic
operations – GROUP BY
Let G denote the attributes on which the grouping is
performed, AF indicates the aggregate functions to be
evaluated
• Cardinality – we give an upper bound on the cardinality
of S:
card(S) <  AiG val(Ai[R])
• Size: for all attributes A appearing in G
size(R.A) = size (S.A)
• Distinct values : for all attributes A appearing in G
val(A[S]) = val(A[R])

12
Profile of partial results of algebraic
operations – UNION

• Cardinality – we have:
card(T) < card(R) + card(S)
Equality holds when duplicates are not eliminated
• Size: we have
size(T) = size(R) = size(S)
• Distinct values : an upper bound is
val(A[T]) < val(A[R]) + val(A[S])

13
Profile of partial results of algebraic
operations – DIFFERENCE
• Cardinality – we have:
max(0, card((R)-card(S)) < card(T) < card(R)
• Size: we have
size(T) = size(R) = size(S)
• Distinct values : an upper bound is
val(A[T]) < val(A[R])

14
Profile of partial results of algebraic
operations – CARTESIAN PRODUCT

• Cardinality – we have:
card(T) < card(R) x card(S)
• Size: we have
size(T) = size(R) + size(S)
• Distinct values : the distinct values of attributes are the
same as in the operand relation

15
Profile of partial results of algebraic
operations – JOIN
• Cardinality – estimating precisely the cardinality of T is
very complex; we can give an upper bound to card(T)
because card(T) < card(R) x card(S), but this value is
usually much higher than the actual cardinality. Assuming
that all the values of A in R appear also as values of B in S
and vice versa and that the two attributes are both
uniformly distributed over tuples of R and S, we have
card(T) = (card(R) x card(S))/val(A[R])
if one of the two attributes, say A, is a key of R, then
card(T) = card(S)

16
Profile of partial results of algebraic
operations – JOIN
• Size: we have
size(T) = size(R) + size(S)
In the case of natural join the size of the join attribute must
be subtracted from the size of the result
• Distinct values : if A is a join attribute, an upper bound is
val(A[T]) < min(val(A[R]), val(B[S]) )
if A is not a join attribute, an upper bound is
val(A[T]) < val(A[R]) + val(B[S])

17
Profile of partial results of algebraic
operations – SEMIJOIN
Consider the semijoin T=R SJ A=B S
• Cardinality – the estimation of the cardinality of T is
similar to that of a selection operation; we denote with 
the selectivity of the semijoin operation, which measures
the fraction of the tuples of R which belong to the result.
The estimation is the following:
 = 1/val(A[S]) / val(dom[A])
Given 
card(T) =  * card(R)

18
Profile of partial results of algebraic
operations – SEMIJOIN
• Size: The size of the result of a semijoin is the same as the
size of its first operand
size(T) = size(R)
• Distinct values : the number of distinct values of attributes
which do not belong to the semijoin specification can be
estimated using Yao’s formula with n= card(R),
m=val(A[R]), and r =card(T). If A is the only attribute
appearing in the semijoin specification, then
val(A[T]) =  * val(A[R])

19
Architecture of a Query Processing

Query result

Parser Catalog

Internal rep. plan query execution


plan
Query Query Plan Query
Rewrite Optimizer Refinement Execution
Engine
Internal rep.

Base data

20
Architecture of a Query Processing
• Parser: the query is parsed and translated into an internal
representation (flex and bison can be used for the
construction of SQL parser)
• Query Rewrite: query rewrite transforms a query in order
to carry out optimizations that are good regardless of the
physical state of the system (elimination of redundant
predicates, unnesting of subqueries, simplification of
expressions). Query rewrite is carried out by a rule engine
• Query Optimizer: this component carries out
optimizations that depend on the physical state of the
system. QO decides which index, which method, and in
which order to execute operations of a query.

21
Architecture of a Query Processing
• Query optimizer: in distributed system QO must decide at
which site each operation is to be executed. QO
enumerates alternative plans and chooses the best plan
using a cost estimation model
• Plan: specifies precisely how the query is to be executed.
The nodes are operators, and every operator carries out one
particular operation. The edges represent consumer-
producer relationships of operators.
• Plan Refinement: this component transforms the plan into
an executable plan. In DB2 this transformation involves
the generation of an assembler-like code to evaluate
expressions and predicates efficiently

22
Query evaluation plan
Site 0 PJ A1

NLJ A2=B2

scan

temp

receive receive

send send
PJ B3
PJ A3
SL C=cos
Inxscan(A) Scan(B)
23
Query evaluation plan
• Fragment reducers: a set of unary operations which apply
to the same fragment are collected into programs
• Binary operations: joins and unions
• Optimization graph: nodes represent reduced fragments,
and joins (unions) are represented by edges (hypernodes)

A2=B2

A B

24
Query Optimization (1)
• Plan enumeration with Dynamic Programming
Input: SPJ query q on relations R1, ..., Rn
Output: A query plan for q
1. for i=1 to n do {
2. optPlan({Ri}) = accessPlans(Ri)
3. prunePlans(optPlan({Ri}))
4. }
5. for i=2 to n do {
6. for all S  {R1, ..., Rn} such that |S| = i do {
7. optPlan(S) = 

25
Query Optimization (2)
8. for all O  S do {
9. optPlan(S) = optPlan(S) 
joinPlans(optPlan(O), optPlan(S-O))
10. prunePlans(optPlan(S))
11. }
12. }
13. }
14. return optPlan({R1, ..., Rn})

Problem: alternative plans cannot be immediately pruned

26
Query Optimization (3)

• Optimization criteria:
– Classic cost model (total time, total resource
consumption) – estimate the cost of every individual
operator of the plan and then sum up these costs – this
model is useful to estimate the overall throughput of a
system
– Mean response time model – estimate the lowest
response time of a query

27
Query Execution Techniques
• Row blocking – implementation of send and receive
operators is based on TCP/IP, UDP protocols;
idea: ship tuples in a blockwise fashion
• Optimization of Multicasts: send data sequentially
instead of sending data twice (NY  Berlin  Poznan)
• Joins with Horizontally Partitioned Data –
(A1  A2) JN B or (A1 JN B)  (A2 JN B)
If A and B are both partitioned than we have more plans
• Semijoin and Bloojoin programs

28
Semijoin Programs
• Semijoin between R and S over two attributes A and B is
defined as follows:
( R SJ A=B S) JN A=B S is equal R JN A=B S

1. Send PJ B (S) to site R at a cost


C0 + C1 * size(B) * val(B(S))
2. Compute semijoin on R at a null cost; Let R’= R SJ A=B S
3. Send R’ to site S at a cost
C0 + C1 * size(R) * card(R’)
4. Compute the join on site S at a null value

29
Reducers
• Semijoin programs can be regarded as reducers, i.e.
Operations that can be applied to reduce the cardinality of
their operands
• Let RED(Q, R) denote the set of reducer programs that can
be built for a given relation R in a given query Q
• There is one reducer program, element of RED(Q, R),
which reduces R more than all other programs – full
reducer
• The problem : find all full reducers for the relations of a
query (difficult task)
• Acyclic (tree queries) versus cyclic queries

30
Reducers
• Is it possible to give a limitation to the length of the full
reducer?
• Tree queries – YES
The limitation on the length of the full reducer amounts to
n-1, where n is the number of nodes of the tree
• Cyclic queries – NO
The limitation on the length of the ‘best’ reducer is linearly
bound by the number of tuples of some relations of the
query
• Best reducer does not mean full reducer

31
Example (1)
R S T
A B B C C A
1 a a x x 2
2 b b y y 3
3 c c z z 4

S
Cyclic query
B=B C=C
R T

A=A

The final result is empty relation; the length of the reducers


is 3*(m-1), where m is the number of tuples
32
Example (2)
R S T
A B B C C D
1 a a x x 10
2 b b y p 20
3 e c z q 30

S
Acyclic query
B=B C=C
R T

The final result - one tuple (a, x)

33
Testing the graph for cycles
• There are two cases in which cycles can be broken without
changing the meaning of the query
1. In the cycle (R.A=S.B), (S.B=T.C), (T.C=R.A), in which
R, S, T are relation names, and A, B, C are attributes, any
one of the edges can be dropped, as any edge can be
obtained from the remaining ones by transitivity.
2. In the cycle (R.A=S.B), (S.B=T.C), (T.C=R.D), we can
substitute (R.A=R.D) for (T.C=R.D) because, by
transitivity, T.C must equal R.A; the remaining graph
contains two edges (R.S) and (S.T) and is acyclic, because
an interrelation clause can be sabstituted by an intrarelation
clause
34

You might also like