You are on page 1of 69

5.

Distributed Query Processing

Chapter 6
   Overview of Query Processing 

Chapter 7
  Query Decomposition and Data   
  Localization

1
Outline

v Overview of Query Processing 
v Query Decomposition and Localization

2
Query Processing

High level user query

Query
Processor

Low level data manipulation commands

3
Query Processing Components
v Query language that is used
w SQL (Structured Query Language)

v Query execution methodology
w The steps that the system goes through in executing 
high-level (declarative) user queries

v Query optimization 
w How to determine the “best” execution plan?

4
Query Language

v Tuple calculus: { t | F(t) }
where t is a tuple variable, and F(t) is a well formed formula

v Example: 

w Get the numbers and names of all programmers.

5
Query Language (cont.)
v Domain calculus:
   where xi is a domain variable, and is a well
formed formula
v Example:

{ x, y | EMP(x, y, “Programmer") }

Variables are position sensitive!

6
Query Language (cont.)
v SQL is a tuple calculus language.

SELECT ENO,ENAME
FROM EMP
WHERE TITLE=“Programmer”

End user uses non-procedural (declarative) 
languages to express queries.
7
Query Processing Objectives & Problems

v Query processor transforms queries into procedural 
operations to access data in an optimal way.

v Distributed query processor has to deal with query 
decomposition and data localization.

8
DB example

Figure 3.3

9
Centralized Query Processing
Alternatives
SELECT ENAME
FROM EMP E, ASG G
WHERE E.ENO = G.ENO AND RESP=“manager”

v Strategy 1:

v Strategy 2:

Strategy 2 avoids Cartesian product, so is it “better”.
10
Distributed Query Processing
v Query processor must consider the communication 
cost and select the best site.
v The same query example, but relation G and E are 
fragmented and distributed.

11
Distributed Query Processing Plans

v  By centralized optimization,

v  Two distributed query processing plans

12
12
Distributed Query Plan I
Plan I: To transport all segments to query site and 
            execute there.
Site 5

Result = (EMP1  EMP2) ⋈ ENO TITLE=“manager” (ASG1  ASG2)

ASG1 ASG2 EMP1 EMP2

Site 1 Site 2 Site 3 Site 4

This causes too much network traffic, very costly.

13
Distributed Query Plan II
Plan II (Optimized):

Site 5
Result = (EMP1 ’  EMP2 ’)

EMP1’ EMP2’
Site 3 Site 4
EMP1’ = EMP1 ⋈ ENO ASG1’ EMP2’ = EMP2 ⋈ ENO ASG2’
ASG1’ ASG2’
Site 1 Site 2
ASG1’ =  RESP=“manager” (ASG1) ASG2’ =  RESP =“manager” (ASG2)
14
Costs of the Two Plans
v Assume:
w size(EMP)=400, size(ASG)=1000, 20 tuples with RESP =“manager” 
w tuple access cost = 1 unit; tuple transfer cost = 10 units
w ASG and EMP are locally clustered on attribute RESP and ENO, respectively.
v Plan 1
w Transfer EMP to site 5:  400*tuple transfer cost                                4000
w Transfer ASG to site 5:  1000*tuple transfer cost                            10000
w Produce ASG’: 1000*tuple access cost                                             1000
w Join EMP and ASG’: 400*20*tuple access cost                                 8000
    Total cost                                                                                         23,000
v Plan 2
w Produce ASG’:  (10+10)*tuple access cost                                          20
w Transfer ASG’ to the sites of EMP:  (10+10)*tuple transfer cost        200
w Produce EMP’:  (10+10)*tuple access cost * 2                                    40
w Transfer EMP’ to result site: (10+10)*tuple transfer cost                   200
    Total cost                                                                                            460 15
Query Optimization Objectives
v Minimize a cost function

          I/O cost + CPU cost + communication cost
w These might have different weights in different distributed 
environments

w Can also maximize throughout

16
Communication Cost
v  Wide area network 
w Communication cost will dominate
- Low bandwidth
- Low speed
- High protocol overhead
w Most algorithms ignore all other cost components
 
v Local area network
w Communication cost not that dominate
w Total cost function should be considered
17
Complexity of Relational Algebra
Operations
v Measured by cardinality n and tuples are sorted 
on comparison attributes

s , p (without duplicate elimination)


p (with duplicate elimination), GROUP

Cartesian-Product X

Fig (6.2)
18
Complexity of Relational Algebra
Operations
• The complexity of relational algebra operators, which directly affects their
execution time, dictates some principles useful to a query processor.
• These principles can help in choosing the final execution strategy.
• The simplest way of defining complexity is in terms of relation cardinalities
independent of physical implementation details such as fragmentation and storage
structures.
• Figure 6.2 shows the complexity of unary and binary operators in the order of
increasing complexity, and thus of increasing execution time.
• Complexity is O(n log n) for binary operators if each tuple of one relation must be
compared with each tuple of the other on the basis of the equality of selected
attributes.
• This complexity assumes that tuples of each relation must be sorted on the
comparison attributes.
• This simple look at operator complexity suggests two principles. First, because
complexity is relative to relation cardinalities, the most selective operators that
reduce cardinalities (e.g., selection) should be performed first.
• Second, operators should be ordered by increasing complexity so that Cartesian
products can be avoided or delayed. 19
Characterization of Query Processors

• It is quite difficult to evaluate and compare query processors in


the context of both centralized systems and distributed systems
because they may differ in many aspects.
• In what follows, we list important characteristics of query
processors that can be used as a basis for comparison.
• The first four characteristics hold for both centralized and
distributed query processors while the next four characteristics
are particular to distributed query processors.

20
Types of Query Optimization
• Query optimization aims at choosing the “best” point in the
solution space of all possible execution strategies.
v Exhaustive search
§ method for  query optimization  is to  search  the  solution 
space,  exhaustively  predict  the  cost  of  each  strategy, and 
select the strategy with minimum cost.
§ Cost-based
§ Optimal
§ Combinatorial  complexity  in  the  number  of relations  (The 
problem  becomes  worse  as  the  number  of  relations  or 
fragments increases (e.g., becomes greater than 5 or 6).
§ Workable for small solution spaces
21
Types of Query Optimization

v Heuristics
• Not optimal
• restrict  the  solution  space  so  that  only  a  few 
strategies are considered
• Perform  unary  operations  (selection  and 
projection) first
• Reorder  operations  to  reduce  intermediate 
relation size
• Replace  a  join  by  a  series  of  semijoins to 
minimize data communication.
22
Query Optimization Granularity
v Single query at a time
w Cannot use common intermediate results

v Multiple queries at a time
w Efficient if many similar queries
w Decision space is much larger

23
Query Optimization Timing

v Static
w Do it at compilation time by using statistics, appropriate 
for exhaustive search, optimized once, but executed 
many times.
w Difficult to estimate the size of the intermediate results
w Can amortize over many executions

v Dynamic
w Do it at execution time, accurate about the size of the 
intermediate results, repeated for every execution, 
expensive. 24
Query Optimization Timing (cont.)

v Hybrid
w Compile using a static algorithm
w If the error in estimate size > threshold, re-optimizing at 
run time

25
Statistics
v Relation
w Cardinality
w Size of a tuple
w Fraction of tuples participating in a join with another relation
v Attributes
w Cardinality of the domain
w Actual number of distinct values
v Common assumptions
w Independence between different attribute values
w Uniform distribution of attribute values within their domain 26
Decision Sites
v For query optimization, it may be done by
w Single site – centralized approach 
– Single site determines the best schedule
– Simple
– Need knowledge about the entire distributed database
w All the sites involved – distributed approach
– Cooperation among sites to determine the schedule
– Need only local information
– Cost of operation
w Hybrid – one site makes major decision in cooperation 
with other sites making local decisions
– One site determines the global schedule
– Each site optimizes the local subqueries 27
Network Topology

v Wide Area Network (WAN) – point-to-point
w Characteristics
– Low bandwidth
– Low speed
– High protocol overhead
w Communication cost will dominate; ignore all other cost 
factors
w Global schedule to minimize communication cost
w Local schedules according to centralized query 
optimization

28
Network Topology (cont.)

v Local Area Network (LAN)
w communication cost not that dominate
w Total cost function should be considered
w Broadcasting can be exploited
w Special algorithms exist for star networks

29
Other Information to Exploit

v Using replications to minimize communication 
costs 

v Using semijoins to reduce the size of operand 
relations to cut down communication costs when 
overhead is not significant.

30
Layers of Query Processing
Calculus Query on Distributed Relations
QUERY GLOBAL
DECOMPOSITION SCHEMA
Algebra Query on Distributed Relations
CONTROL
DATA FRAGMENT
SITE LOCALIZATION SCHEMA
Fragment Query
GLOBAL
STATISTICS ON
OPTIMIZATION
FRAGMENTS
Optimized Fragment Query
With Communication Operations
LOCAL
SITE
LOCAL LOCAL
OPTIMIZATION SCHEMA

Optimized Local Queries 31


Query Decomposition
v Query  decomposition is the  first  phase  of  query  processing that  transforms a 
relational calculus query into a relational algebra query. 
v Both input and output queries refer  to  global relations,  without  knowledge of  the 
distribution  of data.  Therefore,  query  decomposition  is  the  same  for  centralized 
and distributed systems.
v The successive steps of query decomposition are 
(1) normalization, 
(2) analysis, 
(3) elimination of redundancy, and
(4) rewriting.
v Steps 1, 3, and 4 rely on the fact that various transformations are equivalent for a
given query, and some can have better performance than others. 
v We present  the first  three steps  in  the  context  of  tuple  relational  calculus  (e.g., 
SQL). Only the last step rewrites the query into relational algebra.

32
Step 1 - Query Decomposition
v Decompose calculus query into algebra query using 
global conceptual schema information.

33
Step 1 - Query Decomposition (cont.)

1) Normalization 
w The calculus query is written in a normalized form (CNF or 
DNF) for subsequent manipulation
2) Analysis 
w To reject normalized queries for which further processing 
is either impossible or unnecessary (type incorrect or 
semantically incorrect)
3) Simplification (elimination of redundancy)
w Redundant predicates are eliminated to obtain simplified 
queries 
4) Restructuring (rewriting)
w The calculus query is translated to optimal algebraic query 
representation 
w More than one translation is possible 34
1) Normalization
v Lexical and syntactic analysis 
w check validity (similar to compilers)
w check for attributes and relations 
w type checking on the qualification 
v There are two possible forms of representing the predicates in query 
qualification (the WHERE clause): 
w Conjunctive Normal Form (CNF) or Disjunctive Normal Form (DNF) 
v The conjunctive normal form is a conjunction ( predicate) of disjunctions (  
predicates) as follows (where pij is a simple predicate): 
– CNF: (p11  p12 ...  p1n)  ...  (pm1  pm2 ...  pmn)
– DNF: (p11  p12 ...  p1n)  ...  (pm1  pm2 ...  pmn)
– OR's mapped into union 
– AND's mapped into join or selection 

35
1) Normalization (cont.)
v The transformation of the quantifier-free predicate is 
straightforward using the well-known equivalence rules for 
1.
logical operations (  ):
2.
3.
4.
5.
6.
7.
8.
9.
10.
36
1) Normalization (cont.)
Example  7.1.  Let  us  consider  the  following  query  on  the  engineering 
database that we have been referring to:
“Find the names of employees who have been working on project P1 for 12 
or 24 months”. 
The query expressed in SQL is
SELECT ENAME
FROM EMP, ASG
WHERE EMP.ENO=ASG.ENO AND ASG.PNO=”P1”
AND (DUR=12 OR DUR=24)

The qualification in conjunctive normal form:

37
2) Analysis
v Objective
w reject type incorrect or semantically incorrect queries.

v Type incorrect 
w if any of its attribute or relation names is not defined in 
the global schema; 
w if operations are applied to attributes of the wrong type 

38
2) Analysis (cont.)
Example  7.2.  The  following  SQL  query  on  the  engineering  database  is 
type incorrect  for two  reasons.  First,  attribute  E#  is  not  declared  in  the 
schema. Second, the operation “>200” is incompatible with the type string 
of ENAME.

SELECT E# ! Undefined 
attribute
FROM EMP
WHERE ENAME>200
! Type 
mismatch
39
2) Analysis (cont.)
v Semantically incorrect
w Components do not contribute in any way to the 
generation of the result
w For only those queries that do not use disjunction () or 
negation (), semantic correctness can be determined by 
using query graph 

40
Query Graph
v Two kinds of nodes
w One node represents the result relation
w Other nodes represent operand relations  
v Two types of edges
w an edge to represent a join if neither of its two nodes is 
the result
w an edge to represent a projection if one of its node is 
the result node

Nodes and edges may be labeled by 
predicates for selection, projection, or join.
41
Query Graph Example
Example 7.3. Let us consider the following query:
“Find the names and responsibilities of programmers who have been working on the CAD/CAM project for more than 3 years.” The query 
expressed in SQL is

SELECT ENAME, RESP


FROM EMP,ASG,PROJ
WHERE EMP.ENO=ASG.ENO
AND ASG.PNO=PROJ.PNO AND PNAME=“CAD/CAM”
AND DUR>=36 AND TITLE=“Programmer”

42
Join Graph Example 7.3

A subgraph of query graph for join operation.

43
Tool of Analysis

v  A conjunctive query without negation is 
semantically incorrect if its query graph is NOT 
connected!

44
Analysis Example
Example 7.4. Let us consider the following SQL query:

SELECT ENAME, RESP


FROM EMP,ASG,PROJ
WHERE EMP.ENO=ASG.ENO
AND PNAME=“CAD/CAM”
AND DUR>36
AND TITLE=“Programmer”
TITLE=“Programmer”
45
Query Graph Example 2

46
3) Simplification
(elimination of redundancy)

v Using well-known 
idempotency rules to 
eliminate redundant 
predicates from 
WHERE clause.

47
Simplification Example
The SQL query: 
SELECT TITLE
FROM EMP
WHERE (NOT(TITLE=”Programmer)
AND (TITLE=“Programmer”
OR TITLE=“Electrical Eng.”)
AND NOT(TITLE=“Electrical Eng.”))
OR ENAME=“J.Doe”
can be simplified using the previous rules to become: 
SELECT TITLE
FROM EMP
WHERE ENAME="J.Doe"
48
Simplification Example (cont.)

p1 = <TITLE = ``Programmer''>
p2 = <TITLE = ``Elec. Engr''>
p3 = <ENAME = ``J.Doe''>

Let the query qualification is


(¬ p1  (p1  p2)  ¬ p2)  p3

The disjunctive normal form of the query is


= (¬p1  ((p1  ¬p2)  (p2  ¬ p2)))  p3
(¬ p1  p1  ¬p2)  (¬ p1  p2  ¬ p2)  p3
= (false  ¬ p2)  (¬ p1  false)  p3
= false  false  p3
= p3 49
4) Rewriting

v Converting a calculus query into relational algebra
w straightforward transformation from relational calculus to 
relational algebra  
w restructuring relational algebra expression to improve 
performance
w making use of query trees

50
Relational Algebra Tree
v A tree defined by:
w a root node representing the query result
w leaves representing database relations
w non-leaf nodes representing relations produced by operations 
w edges from leaves to root representing the sequences of operations
Example 7.6. The query  “Find the names of employees other than J. Doe who worked 
on the CAD/CAM project for either one or two years” whose SQL expression is: 
SELECT ENAME
FROM PROJ, ASG, EMP
WHERE ASG.ENO = EMP.ENO
AND ASG.PNO = PROJ.PNO
AND ENAME != "J. Doe"
AND PROJ.PNAME = "CAD/CAM"
AND (DUR = 12 OR DUR = 24)
51
An SQL Query and Its Query Tree

52
How to translate an SQL query into an
algebra tree?
1. Create a leaf for every relation in the FROM 
clause
2. Create the root as a project operation involving 
attributes in the SELECT clause
3. Create the operation sequence by the predicates 
and operators in the WHERE clause

53
Rewriting -- Transformation Rules (I)
v Commutativity of binary operations: 
RS SR
R S S R
RS SR
v Associativity of binary operations: 
(R  S)  T  R  ( S  T )
(R S) TR (S T)
v Idempotence of  unary operations: grouping of  projections and  selections,  which is  the  property of 
certain operations in mathematics and computer science, that can be applied multiple times without 
changing the result beyond the initial application.

w A’ (  A’’ (R ))   A’ (R ) for A’A’’ A


w p1(A1) (  p2(A2) (R ))   p1(A1) p2(A2) (R )
54
Rewriting -- Transformation Rules (II)
v Commuting selection with projection
A1, …, An (  p (Ap) (R ))  A1, …, An (  p (Ap) ( A1, …, An, Ap(R )))
v Commuting selection with binary operations 
 p (Ai)(R  S)   ( p (Ai)(R))   S
 p (Ai)(R      S)   ( p (Ai)(R))        S
 p (Ai)(R  S)     p (Ai)(R)   p (Ai)(S) 

v Commuting projection with binary operations 
 C(R  S)  A(R)   B (S)   C = A  B
 C(R      S)  C(R)        C (S)
 C (R  S)  C (R)  C (S)  55
How to use transformation rules to
optimize?
v These  rules  allow the  separation  of  the  unary 
operations to simplify the query expression
v Unary  operations  on  the  same  relation  may  be 
grouped to access the same relation once
v Unary  operations  (select,  project)  may  be  commuted 

with binary operations, so that they may be performed 
first to reduce the size of intermediate relations
v Binary operations may be reordered. This rule is used 

extensively in query optimization. 
56
Optimization of Previous Query Tree

57
Step 2 : Data
Localization
v Task : To translate 
an algebraic query 
on global relation 
into an algebraic 
query expressed on 
physical fragments.
v Localization uses 
information stored 
in the fragment 
schema.
58
Data Localization-- An Example
EMP is fragmented into
ENAME EMP1 = ENO “E3” (EMP)
Dur=12  Dur=24 EMP2 =  “E3” < ENO “E6” (EMP)
EMP3 = ENO >“E6” (EMP)
ENAME< >“J.DOE”
ASG is fragmented into
PNAME=“CAD/CAM” ASG1 = ENO “E3” (ASG)
ASG2 = ENO >“E3” (ASG)
PNO

ENO

PROJ 

ASG1 ASG2 EMP1 EMP2 EMP3 59


Reduction with Selection for PHF
SELECT * EMP is fragmented into
FROM EMP EMP1 = ENO “E3” (EMP)
WHERE ENO=“E5”
EMP2 =  “E3” < ENO “E6” (EMP)
EMP3 = ENO >“E6” (EMP)
Given Relation R, FR={R1, R2, …, Rn} where Rj =pj(R)
pj(Rj) =  if x  R: (pi(x)pj(x))
where pi and pj are selection predicates, x denotes a tuple, and p(x) denotes “predicate p holds for x.”

ENO=“E5”
ENO=“E5” ENO=“E5”

EMP EMP1 EMP2 EMP3 EMP2


(a) Localized query (b) reduced query 60
Reduction with Join for PHF
SELECT *
FROM EMP, ASG ENO

WHERE EMP.ENO=ASG.ENO


ENO
ASG1 ASG2 EMP1 EMP2 EMP3
ASG EMP

EMP is fragmented into


ASG is fragmented into
EMP1 = ENO “E3” (EMP)
ASG1 = ENO “E3” (ASG)
EMP2 =  “E3” < ENO “E6” (EMP)
ASG2 = ENO >“E3” (ASG)
EMP3 = ENO >“E6” (EMP)
61
Reduction with Join for PHF (I)
ENO (R1  R2) S 
(R1 S)  (R2 S)
 

ASG1 ASG2 EMP1 EMP2 EMP3


(a) Localized query

ENO ENO ENO ENO ENO ENO

EMP1 ASG1 EMP1 ASG2 EMP2 ASG1 EMP2 ASG2 EMP3 ASG1 EMP3 ASG2

62
Reduction with Join for PHF (II)

ENO ENO ENO

EMP1 ASG1 EMP2 ASG2 EMP3 ASG2


(b) Reduced query
Given Ri =pi(R) and Rj =pj(R)
Ri Rj =  if x  Ri , y Rj: (pi(x)pj(y))

Reduction with join


1. Distribute join over union
2. Eliminate unnecessary work
63
Reduction for VF (Vertical Fragmentation)
v Find useless intermediate relations
Relation R defined over attributes A = {A1, A2, …, An} 
vertically fragmented as Ri =A’ (R) where A’ A 
D (Ri) is useless if the set of projection attributes D is 
not in A’
EMP1= ENO,ENAME (EMP)
EMP2= ENO,TITLE (EMP) ENAME ENAME

ENO
SELECT ENAME
FROM EMP
EMP1 EMP2 EMP1
(a) Localized query (b) Reduced query 64
Reduction for DHF
Distribute joins over union
Apply the join reduction for horizontal fragmentation

EMP1: TITLE=“Programmer” (EMP)


ENO
EMP2: TITLE“Programmer” (EMP)
ASG1: ASG ENO EMP1 TITLE=“MECH. Eng.”
ASG2: ASG ENO EMP2
 
SELECT *
FROM EMP, ASG ASG1 ASG2 EMP1 EMP2
WHERE ASG.ENO = EMP.ENO (a) Localized query
AND EMP.TITLE = “Mech. Eng.”
65
Reduction for DHF (II)
ENO
Selection first

TITLE=“Mech. Eng.” Joins over union


ASG1 ASG2 EMP2

(b) Query after pushing selection down

ENO ENO ENO

TITLE=“Mech. Eng.” TITLE=“Mech. Eng.” TITLE=“Mech. Eng.”

ASG2
ASG1 EMP2 ASG1 EMP2 ASG1
ASG2 EMP2
(d) Reduced query after eliminating the left subtree (c) Query after moving unions up 66
Reduction for Hybrid Fragmentation
v Combines the rules already specified:
w Remove empty relations generated by contradicting 
selection on horizontal fragments;
w Remove useless relations generated by projections on 
vertical fragments;
w Distribute joins over unions in order to isolate and 
remove useless joins.

67
Reduction for Hybrid Fragmentation -
Example
EMP1 = ENO“E4” (ENO,ENAME (EMP))
EMP2 = ENO>“E4” (ENO,ENAME (EMP))
ENAME
EMP3 = ENO,TITLE (EMP)
ENO=“E5”
QUERY: ENAME
SELECT ENAME
EMP2
FROM EMP ENO=“E5”
WHERE ENO = “E5”
(b) Reduced query
ENO


EMP1
ASG1 EMP2 EMP3
(a) Localized query 68
Question & Answer

69

You might also like