DDB Lec 4 PDF

5.
Distributed Query Processing
Chapter 6
Overview of Query Processing
Chapter 7
Query Decomposition and Data
Localization
1
Outline
v Overview of Query Processing
v Query Decomposition and Localization
2
Query Processing
High level user query
Query
Processor
Low level data manipulation commands
3
Query Processing Components
v Query language that is used
w SQL (Structured Query Language)
v Query execution methodology
w The steps that the system goes through in executing
high-level (declarative) user queries
v Query optimization
w How to determine the “best” execution plan?
4
Query Language
v Tuple calculus: { t | F(t) }
where t is a tuple variable, and F(t) is a well formed formula
v Example:
w Get the numbers and names of all programmers.
5
Query Language (cont.)
v Domain calculus:
where xi is a domain variable, and is a well
formed formula
v Example:
{ x, y | EMP(x, y, “Programmer") }
Variables are position sensitive!
6
Query Language (cont.)
v SQL is a tuple calculus language.
SELECT ENO,ENAME
FROM EMP
WHERE TITLE=“Programmer”
End user uses non-procedural (declarative)
languages to express queries.
7
Query Processing Objectives & Problems
v Query processor transforms queries into procedural
operations to access data in an optimal way.
v Distributed query processor has to deal with query
decomposition and data localization.
8
DB example
Figure 3.3
9
Centralized Query Processing
Alternatives
SELECT ENAME
FROM EMP E, ASG G
WHERE E.ENO = G.ENO AND RESP=“manager”
v Strategy 1:
v Strategy 2:
Strategy 2 avoids Cartesian product, so is it “better”.
10
Distributed Query Processing
v Query processor must consider the communication
cost and select the best site.
v The same query example, but relation G and E are
fragmented and distributed.
11
Distributed Query Processing Plans
v By centralized optimization,
v Two distributed query processing plans
12
12
Distributed Query Plan I
Plan I: To transport all segments to query site and
execute there.
Site 5
Result = (EMP1  EMP2) ⋈ ENO TITLE=“manager” (ASG1  ASG2)
ASG1 ASG2 EMP1 EMP2
Site 1 Site 2 Site 3 Site 4
This causes too much network traffic, very costly.
13
Distributed Query Plan II
Plan II (Optimized):
Site 5
Result = (EMP1 ’  EMP2 ’)
EMP1’ EMP2’
Site 3 Site 4
EMP1’ = EMP1 ⋈ ENO ASG1’ EMP2’ = EMP2 ⋈ ENO ASG2’
ASG1’ ASG2’
Site 1 Site 2
ASG1’ =  RESP=“manager” (ASG1) ASG2’ =  RESP =“manager” (ASG2)
14
Costs of the Two Plans
v Assume:
w size(EMP)=400, size(ASG)=1000, 20 tuples with RESP =“manager”
w tuple access cost = 1 unit; tuple transfer cost = 10 units
w ASG and EMP are locally clustered on attribute RESP and ENO, respectively.
v Plan 1
w Transfer EMP to site 5: 400*tuple transfer cost 4000
w Transfer ASG to site 5: 1000*tuple transfer cost 10000
w Produce ASG’: 1000*tuple access cost 1000
w Join EMP and ASG’: 400*20*tuple access cost 8000
Total cost 23,000
v Plan 2
w Produce ASG’: (10+10)*tuple access cost 20
w Transfer ASG’ to the sites of EMP: (10+10)*tuple transfer cost 200
w Produce EMP’: (10+10)*tuple access cost * 2 40
w Transfer EMP’ to result site: (10+10)*tuple transfer cost 200
Total cost 460 15
Query Optimization Objectives
v Minimize a cost function
I/O cost + CPU cost + communication cost
w These might have different weights in different distributed
environments
w Can also maximize throughout
16
Communication Cost
v Wide area network
w Communication cost will dominate
- Low bandwidth
- Low speed
- High protocol overhead
w Most algorithms ignore all other cost components

v Local area network
w Communication cost not that dominate
w Total cost function should be considered
17
Complexity of Relational Algebra
Operations
v Measured by cardinality n and tuples are sorted
on comparison attributes
s , p (without duplicate elimination)

p (with duplicate elimination), GROUP
Cartesian-Product X
Fig (6.2)
18
Complexity of Relational Algebra
Operations
• The complexity of relational algebra operators, which directly affects their
execution time, dictates some principles useful to a query processor.
• These principles can help in choosing the final execution strategy.
• The simplest way of defining complexity is in terms of relation cardinalities
independent of physical implementation details such as fragmentation and storage
structures.
• Figure 6.2 shows the complexity of unary and binary operators in the order of
increasing complexity, and thus of increasing execution time.
• Complexity is O(n log n) for binary operators if each tuple of one relation must be
compared with each tuple of the other on the basis of the equality of selected
attributes.
• This complexity assumes that tuples of each relation must be sorted on the
comparison attributes.
• This simple look at operator complexity suggests two principles. First, because
complexity is relative to relation cardinalities, the most selective operators that
reduce cardinalities (e.g., selection) should be performed first.
• Second, operators should be ordered by increasing complexity so that Cartesian
products can be avoided or delayed. 19
Characterization of Query Processors
• It is quite difficult to evaluate and compare query processors in

the context of both centralized systems and distributed systems
because they may differ in many aspects.
• In what follows, we list important characteristics of query
processors that can be used as a basis for comparison.
• The first four characteristics hold for both centralized and
distributed query processors while the next four characteristics
are particular to distributed query processors.
20
Types of Query Optimization
• Query optimization aims at choosing the “best” point in the
solution space of all possible execution strategies.
v Exhaustive search
§ method for query optimization is to search the solution
space, exhaustively predict the cost of each strategy, and
select the strategy with minimum cost.
§ Cost-based
§ Optimal
§ Combinatorial complexity in the number of relations (The
problem becomes worse as the number of relations or
fragments increases (e.g., becomes greater than 5 or 6).
§ Workable for small solution spaces
21
Types of Query Optimization
v Heuristics
• Not optimal
• restrict the solution space so that only a few
strategies are considered
• Perform unary operations (selection and
projection) first
• Reorder operations to reduce intermediate
relation size
• Replace a join by a series of semijoins to
minimize data communication.
22
Query Optimization Granularity
v Single query at a time
w Cannot use common intermediate results
v Multiple queries at a time
w Efficient if many similar queries
w Decision space is much larger
23
Query Optimization Timing
v Static
w Do it at compilation time by using statistics, appropriate
for exhaustive search, optimized once, but executed
many times.
w Difficult to estimate the size of the intermediate results
w Can amortize over many executions
v Dynamic
w Do it at execution time, accurate about the size of the
intermediate results, repeated for every execution,
expensive. 24
Query Optimization Timing (cont.)
v Hybrid
w Compile using a static algorithm
w If the error in estimate size > threshold, re-optimizing at
run time
25
Statistics
v Relation
w Cardinality
w Size of a tuple
w Fraction of tuples participating in a join with another relation
v Attributes
w Cardinality of the domain
w Actual number of distinct values
v Common assumptions
w Independence between different attribute values
w Uniform distribution of attribute values within their domain 26
Decision Sites
v For query optimization, it may be done by
w Single site – centralized approach
– Single site determines the best schedule
– Simple
– Need knowledge about the entire distributed database
w All the sites involved – distributed approach
– Cooperation among sites to determine the schedule
– Need only local information
– Cost of operation
w Hybrid – one site makes major decision in cooperation
with other sites making local decisions
– One site determines the global schedule
– Each site optimizes the local subqueries 27
Network Topology
v Wide Area Network (WAN) – point-to-point
w Characteristics
– Low bandwidth
– Low speed
– High protocol overhead
w Communication cost will dominate; ignore all other cost
factors
w Global schedule to minimize communication cost
w Local schedules according to centralized query
optimization
28
Network Topology (cont.)
v Local Area Network (LAN)
w communication cost not that dominate
w Total cost function should be considered
w Broadcasting can be exploited
w Special algorithms exist for star networks
29
Other Information to Exploit
v Using replications to minimize communication
costs
v Using semijoins to reduce the size of operand
relations to cut down communication costs when
overhead is not significant.
30
Layers of Query Processing
Calculus Query on Distributed Relations
QUERY GLOBAL
DECOMPOSITION SCHEMA
Algebra Query on Distributed Relations
CONTROL
DATA FRAGMENT
SITE LOCALIZATION SCHEMA
Fragment Query
GLOBAL
STATISTICS ON
OPTIMIZATION
FRAGMENTS
Optimized Fragment Query
With Communication Operations
LOCAL
SITE
LOCAL LOCAL
OPTIMIZATION SCHEMA
Optimized Local Queries 31

Query Decomposition
v Query decomposition is the first phase of query processing that transforms a
relational calculus query into a relational algebra query.
v Both input and output queries refer to global relations, without knowledge of the
distribution of data. Therefore, query decomposition is the same for centralized
and distributed systems.
v The successive steps of query decomposition are
(1) normalization,
(2) analysis,
(3) elimination of redundancy, and
(4) rewriting.
v Steps 1, 3, and 4 rely on the fact that various transformations are equivalent for a
given query, and some can have better performance than others.
v We present the first three steps in the context of tuple relational calculus (e.g.,
SQL). Only the last step rewrites the query into relational algebra.
32
Step 1 - Query Decomposition
v Decompose calculus query into algebra query using
global conceptual schema information.
33
Step 1 - Query Decomposition (cont.)
1) Normalization
w The calculus query is written in a normalized form (CNF or
DNF) for subsequent manipulation
2) Analysis
w To reject normalized queries for which further processing
is either impossible or unnecessary (type incorrect or
semantically incorrect)
3) Simplification (elimination of redundancy)
w Redundant predicates are eliminated to obtain simplified
queries
4) Restructuring (rewriting)
w The calculus query is translated to optimal algebraic query
representation
w More than one translation is possible 34
1) Normalization
v Lexical and syntactic analysis
w check validity (similar to compilers)
w check for attributes and relations
w type checking on the qualification
v There are two possible forms of representing the predicates in query
qualification (the WHERE clause):
w Conjunctive Normal Form (CNF) or Disjunctive Normal Form (DNF)
v The conjunctive normal form is a conjunction ( predicate) of disjunctions (
predicates) as follows (where pij is a simple predicate):
– CNF: (p11  p12 ...  p1n)  ...  (pm1  pm2 ...  pmn)
– DNF: (p11  p12 ...  p1n)  ...  (pm1  pm2 ...  pmn)
– OR's mapped into union
– AND's mapped into join or selection
35
1) Normalization (cont.)
v The transformation of the quantifier-free predicate is
straightforward using the well-known equivalence rules for
1.
logical operations (  ):
2.
3.
4.
5.
6.
7.
8.
9.
10.
36
1) Normalization (cont.)
Example 7.1. Let us consider the following query on the engineering
database that we have been referring to:
“Find the names of employees who have been working on project P1 for 12
or 24 months”.
The query expressed in SQL is
SELECT ENAME
FROM EMP, ASG
WHERE EMP.ENO=ASG.ENO AND ASG.PNO=”P1”
AND (DUR=12 OR DUR=24)
The qualification in conjunctive normal form:
37
2) Analysis
v Objective
w reject type incorrect or semantically incorrect queries.
v Type incorrect
w if any of its attribute or relation names is not defined in
the global schema;
w if operations are applied to attributes of the wrong type
38
2) Analysis (cont.)
Example 7.2. The following SQL query on the engineering database is
type incorrect for two reasons. First, attribute E# is not declared in the
schema. Second, the operation “>200” is incompatible with the type string
of ENAME.
SELECT E# ! Undefined
attribute
FROM EMP
WHERE ENAME>200
! Type
mismatch
39
2) Analysis (cont.)
v Semantically incorrect
w Components do not contribute in any way to the
generation of the result
w For only those queries that do not use disjunction () or
negation (), semantic correctness can be determined by
using query graph
40
Query Graph
v Two kinds of nodes
w One node represents the result relation
w Other nodes represent operand relations
v Two types of edges
w an edge to represent a join if neither of its two nodes is
the result
w an edge to represent a projection if one of its node is
the result node
Nodes and edges may be labeled by
predicates for selection, projection, or join.
41
Query Graph Example
Example 7.3. Let us consider the following query:
“Find the names and responsibilities of programmers who have been working on the CAD/CAM project for more than 3 years.” The query
expressed in SQL is
SELECT ENAME, RESP

FROM EMP,ASG,PROJ
WHERE EMP.ENO=ASG.ENO
AND ASG.PNO=PROJ.PNO AND PNAME=“CAD/CAM”
AND DUR>=36 AND TITLE=“Programmer”
42
Join Graph Example 7.3
A subgraph of query graph for join operation.
43
Tool of Analysis
v A conjunctive query without negation is
semantically incorrect if its query graph is NOT
connected!
44
Analysis Example
Example 7.4. Let us consider the following SQL query:
SELECT ENAME, RESP

FROM EMP,ASG,PROJ
AND PNAME=“CAD/CAM”
AND DUR>36
AND TITLE=“Programmer”
TITLE=“Programmer”
45
Query Graph Example 2
46
3) Simplification
(elimination of redundancy)
v Using well-known
idempotency rules to
eliminate redundant
predicates from
WHERE clause.
47
Simplification Example
The SQL query:
SELECT TITLE
FROM EMP
WHERE (NOT(TITLE=”Programmer)
AND (TITLE=“Programmer”
OR TITLE=“Electrical Eng.”)
AND NOT(TITLE=“Electrical Eng.”))
OR ENAME=“J.Doe”
can be simplified using the previous rules to become:
SELECT TITLE
FROM EMP
WHERE ENAME="J.Doe"
48
Simplification Example (cont.)
p1 = <TITLE = ``Programmer''>
p2 = <TITLE = ``Elec. Engr''>
p3 = <ENAME = ``J.Doe''>
Let the query qualification is

(¬ p1  (p1  p2)  ¬ p2)  p3
The disjunctive normal form of the query is

= (¬p1  ((p1  ¬p2)  (p2  ¬ p2)))  p3
(¬ p1  p1  ¬p2)  (¬ p1  p2  ¬ p2)  p3
= (false  ¬ p2)  (¬ p1  false)  p3
= false  false  p3
= p3 49
4) Rewriting
v Converting a calculus query into relational algebra
w straightforward transformation from relational calculus to
relational algebra
w restructuring relational algebra expression to improve
performance
w making use of query trees
50
Relational Algebra Tree
v A tree defined by:
w a root node representing the query result
w leaves representing database relations
w non-leaf nodes representing relations produced by operations
w edges from leaves to root representing the sequences of operations
Example 7.6. The query “Find the names of employees other than J. Doe who worked
on the CAD/CAM project for either one or two years” whose SQL expression is:
SELECT ENAME
FROM PROJ, ASG, EMP
WHERE ASG.ENO = EMP.ENO
AND ASG.PNO = PROJ.PNO
AND ENAME != "J. Doe"
AND PROJ.PNAME = "CAD/CAM"
AND (DUR = 12 OR DUR = 24)
51
An SQL Query and Its Query Tree
52
How to translate an SQL query into an
algebra tree?
1. Create a leaf for every relation in the FROM
clause
2. Create the root as a project operation involving
attributes in the SELECT clause
3. Create the operation sequence by the predicates
and operators in the WHERE clause
53
Rewriting -- Transformation Rules (I)
v Commutativity of binary operations:
RS SR
R S S R
RS SR
v Associativity of binary operations:
(R  S)  T  R  ( S  T )
(R S) TR (S T)
v Idempotence of unary operations: grouping of projections and selections, which is the property of
certain operations in mathematics and computer science, that can be applied multiple times without
changing the result beyond the initial application.
w A’ (  A’’ (R ))   A’ (R ) for A’A’’ A

w p1(A1) (  p2(A2) (R ))   p1(A1) p2(A2) (R )
54
Rewriting -- Transformation Rules (II)
v Commuting selection with projection
A1, …, An (  p (Ap) (R ))  A1, …, An (  p (Ap) ( A1, …, An, Ap(R )))
v Commuting selection with binary operations
 p (Ai)(R  S)  ( p (Ai)(R))  S
 p (Ai)(R S)  ( p (Ai)(R)) S
 p (Ai)(R  S)   p (Ai)(R)   p (Ai)(S)
v Commuting projection with binary operations
C(R  S)  A(R)  B (S) C = A  B
C(R S)  C(R) C (S)
C (R  S)  C (R)  C (S) 55
How to use transformation rules to
optimize?
v These rules allow the separation of the unary
operations to simplify the query expression
v Unary operations on the same relation may be
grouped to access the same relation once
v Unary operations (select, project) may be commuted
with binary operations, so that they may be performed
first to reduce the size of intermediate relations
v Binary operations may be reordered. This rule is used
extensively in query optimization.
56
Optimization of Previous Query Tree
57
Step 2 : Data
Localization
v Task : To translate
an algebraic query
on global relation
into an algebraic
query expressed on
physical fragments.
v Localization uses
information stored
in the fragment
schema.
58
Data Localization-- An Example
EMP is fragmented into
ENAME EMP1 = ENO “E3” (EMP)
Dur=12  Dur=24 EMP2 =  “E3” < ENO “E6” (EMP)
EMP3 = ENO >“E6” (EMP)
ENAME< >“J.DOE”
ASG is fragmented into
PNAME=“CAD/CAM” ASG1 = ENO “E3” (ASG)
ASG2 = ENO >“E3” (ASG)
PNO
ENO
PROJ 

ASG1 ASG2 EMP1 EMP2 EMP3 59

Reduction with Selection for PHF
SELECT * EMP is fragmented into
FROM EMP EMP1 = ENO “E3” (EMP)
WHERE ENO=“E5”
EMP2 =  “E3” < ENO “E6” (EMP)
Given Relation R, FR={R1, R2, …, Rn} where Rj =pj(R)
pj(Rj) =  if x  R: (pi(x)pj(x))
where pi and pj are selection predicates, x denotes a tuple, and p(x) denotes “predicate p holds for x.”
ENO=“E5”
ENO=“E5” ENO=“E5”

EMP EMP1 EMP2 EMP3 EMP2

(a) Localized query (b) reduced query 60
Reduction with Join for PHF
SELECT *
FROM EMP, ASG ENO


ENO
ASG1 ASG2 EMP1 EMP2 EMP3
ASG EMP
EMP is fragmented into

ASG is fragmented into
EMP1 = ENO “E3” (EMP)
ASG1 = ENO “E3” (ASG)
EMP2 =  “E3” < ENO “E6” (EMP)
ASG2 = ENO >“E3” (ASG)
61
Reduction with Join for PHF (I)
ENO (R1  R2) S 
(R1 S)  (R2 S)
 
ASG1 ASG2 EMP1 EMP2 EMP3

(a) Localized query
ENO ENO ENO ENO ENO ENO
EMP1 ASG1 EMP1 ASG2 EMP2 ASG1 EMP2 ASG2 EMP3 ASG1 EMP3 ASG2
62
Reduction with Join for PHF (II)

ENO ENO ENO
EMP1 ASG1 EMP2 ASG2 EMP3 ASG2

(b) Reduced query
Given Ri =pi(R) and Rj =pj(R)
Ri Rj =  if x  Ri , y Rj: (pi(x)pj(y))
Reduction with join

1. Distribute join over union
2. Eliminate unnecessary work
63
Reduction for VF (Vertical Fragmentation)
v Find useless intermediate relations
Relation R defined over attributes A = {A1, A2, …, An}
vertically fragmented as Ri =A’ (R) where A’ A
D (Ri) is useless if the set of projection attributes D is
not in A’
EMP1= ENO,ENAME (EMP)
EMP2= ENO,TITLE (EMP) ENAME ENAME
ENO
SELECT ENAME
FROM EMP
EMP1 EMP2 EMP1
(a) Localized query (b) Reduced query 64
Reduction for DHF
Distribute joins over union
Apply the join reduction for horizontal fragmentation
EMP1: TITLE=“Programmer” (EMP)

ENO
EMP2: TITLE“Programmer” (EMP)
ASG1: ASG ENO EMP1 TITLE=“MECH. Eng.”
ASG2: ASG ENO EMP2
 
SELECT *
FROM EMP, ASG ASG1 ASG2 EMP1 EMP2
WHERE ASG.ENO = EMP.ENO (a) Localized query
AND EMP.TITLE = “Mech. Eng.”
65
Reduction for DHF (II)
ENO
Selection first
TITLE=“Mech. Eng.” Joins over union

ASG1 ASG2 EMP2

(b) Query after pushing selection down
ENO ENO ENO
TITLE=“Mech. Eng.” TITLE=“Mech. Eng.” TITLE=“Mech. Eng.”
ASG2
ASG1 EMP2 ASG1 EMP2 ASG1
ASG2 EMP2
(d) Reduced query after eliminating the left subtree (c) Query after moving unions up 66
Reduction for Hybrid Fragmentation
v Combines the rules already specified:
w Remove empty relations generated by contradicting
selection on horizontal fragments;
w Remove useless relations generated by projections on
vertical fragments;
w Distribute joins over unions in order to isolate and
remove useless joins.
67
Reduction for Hybrid Fragmentation -
Example
EMP1 = ENO“E4” (ENO,ENAME (EMP))
EMP2 = ENO>“E4” (ENO,ENAME (EMP))
ENAME
EMP3 = ENO,TITLE (EMP)
ENO=“E5”
QUERY: ENAME
SELECT ENAME
EMP2
FROM EMP ENO=“E5”
WHERE ENO = “E5”
(b) Reduced query
ENO

EMP1
ASG1 EMP2 EMP3
(a) Localized query 68
Question & Answer
69

DDB Lec 4 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DDB Lec 4 PDF

Uploaded by

Copyright:

Available Formats

5.

Distributed Query Processing

High level user query

Low level data manipulation commands

w Get the numbers and names of all programmers.

Result = (EMP1  EMP2) ⋈ ENO TITLE=“manager” (ASG1  ASG2)

ASG1 ASG2 EMP1 EMP2

Site 1 Site 2 Site 3 Site 4

This causes too much network traffic, very costly.

s , p (without duplicate elimination)

• It is quite difficult to evaluate and compare query processors in

Optimized Local Queries 31

SELECT ENAME, RESP

SELECT ENAME, RESP

Let the query qualification is

The disjunctive normal form of the query is

w A’ (  A’’ (R ))   A’ (R ) for A’A’’ A

ASG1 ASG2 EMP1 EMP2 EMP3 59

EMP EMP1 EMP2 EMP3 EMP2

EMP is fragmented into

ASG1 ASG2 EMP1 EMP2 EMP3

ENO ENO ENO ENO ENO ENO

ENO ENO ENO

EMP1 ASG1 EMP2 ASG2 EMP3 ASG2

Reduction with join

EMP1: TITLE=“Programmer” (EMP)

TITLE=“Mech. Eng.” Joins over union

ENO ENO ENO

TITLE=“Mech. Eng.” TITLE=“Mech. Eng.” TITLE=“Mech. Eng.”

You might also like