Professional Documents
Culture Documents
Chapter 6
Overview of Query Processing
Chapter 7
Query Decomposition and Data
Localization
1
Outline
v Overview of Query Processing
v Query Decomposition and Localization
2
Query Processing
Query
Processor
3
Query Processing Components
v Query language that is used
w SQL (Structured Query Language)
v Query execution methodology
w The steps that the system goes through in executing
high-level (declarative) user queries
v Query optimization
w How to determine the “best” execution plan?
4
Query Language
v Tuple calculus: { t | F(t) }
where t is a tuple variable, and F(t) is a well formed formula
v Example:
5
Query Language (cont.)
v Domain calculus:
where xi is a domain variable, and is a well
formed formula
v Example:
{ x, y | EMP(x, y, “Programmer") }
Variables are position sensitive!
6
Query Language (cont.)
v SQL is a tuple calculus language.
SELECT ENO,ENAME
FROM EMP
WHERE TITLE=“Programmer”
End user uses non-procedural (declarative)
languages to express queries.
7
Query Processing Objectives & Problems
v Query processor transforms queries into procedural
operations to access data in an optimal way.
v Distributed query processor has to deal with query
decomposition and data localization.
8
DB example
Figure 3.3
9
Centralized Query Processing
Alternatives
SELECT ENAME
FROM EMP E, ASG G
WHERE E.ENO = G.ENO AND RESP=“manager”
v Strategy 1:
v Strategy 2:
Strategy 2 avoids Cartesian product, so is it “better”.
10
Distributed Query Processing
v Query processor must consider the communication
cost and select the best site.
v The same query example, but relation G and E are
fragmented and distributed.
11
Distributed Query Processing Plans
v By centralized optimization,
v Two distributed query processing plans
12
12
Distributed Query Plan I
Plan I: To transport all segments to query site and
execute there.
Site 5
13
Distributed Query Plan II
Plan II (Optimized):
Site 5
Result = (EMP1 ’ EMP2 ’)
EMP1’ EMP2’
Site 3 Site 4
EMP1’ = EMP1 ⋈ ENO ASG1’ EMP2’ = EMP2 ⋈ ENO ASG2’
ASG1’ ASG2’
Site 1 Site 2
ASG1’ = RESP=“manager” (ASG1) ASG2’ = RESP =“manager” (ASG2)
14
Costs of the Two Plans
v Assume:
w size(EMP)=400, size(ASG)=1000, 20 tuples with RESP =“manager”
w tuple access cost = 1 unit; tuple transfer cost = 10 units
w ASG and EMP are locally clustered on attribute RESP and ENO, respectively.
v Plan 1
w Transfer EMP to site 5: 400*tuple transfer cost 4000
w Transfer ASG to site 5: 1000*tuple transfer cost 10000
w Produce ASG’: 1000*tuple access cost 1000
w Join EMP and ASG’: 400*20*tuple access cost 8000
Total cost 23,000
v Plan 2
w Produce ASG’: (10+10)*tuple access cost 20
w Transfer ASG’ to the sites of EMP: (10+10)*tuple transfer cost 200
w Produce EMP’: (10+10)*tuple access cost * 2 40
w Transfer EMP’ to result site: (10+10)*tuple transfer cost 200
Total cost 460 15
Query Optimization Objectives
v Minimize a cost function
I/O cost + CPU cost + communication cost
w These might have different weights in different distributed
environments
w Can also maximize throughout
16
Communication Cost
v Wide area network
w Communication cost will dominate
- Low bandwidth
- Low speed
- High protocol overhead
w Most algorithms ignore all other cost components
v Local area network
w Communication cost not that dominate
w Total cost function should be considered
17
Complexity of Relational Algebra
Operations
v Measured by cardinality n and tuples are sorted
on comparison attributes
Cartesian-Product X
Fig (6.2)
18
Complexity of Relational Algebra
Operations
• The complexity of relational algebra operators, which directly affects their
execution time, dictates some principles useful to a query processor.
• These principles can help in choosing the final execution strategy.
• The simplest way of defining complexity is in terms of relation cardinalities
independent of physical implementation details such as fragmentation and storage
structures.
• Figure 6.2 shows the complexity of unary and binary operators in the order of
increasing complexity, and thus of increasing execution time.
• Complexity is O(n log n) for binary operators if each tuple of one relation must be
compared with each tuple of the other on the basis of the equality of selected
attributes.
• This complexity assumes that tuples of each relation must be sorted on the
comparison attributes.
• This simple look at operator complexity suggests two principles. First, because
complexity is relative to relation cardinalities, the most selective operators that
reduce cardinalities (e.g., selection) should be performed first.
• Second, operators should be ordered by increasing complexity so that Cartesian
products can be avoided or delayed. 19
Characterization of Query Processors
20
Types of Query Optimization
• Query optimization aims at choosing the “best” point in the
solution space of all possible execution strategies.
v Exhaustive search
§ method for query optimization is to search the solution
space, exhaustively predict the cost of each strategy, and
select the strategy with minimum cost.
§ Cost-based
§ Optimal
§ Combinatorial complexity in the number of relations (The
problem becomes worse as the number of relations or
fragments increases (e.g., becomes greater than 5 or 6).
§ Workable for small solution spaces
21
Types of Query Optimization
v Heuristics
• Not optimal
• restrict the solution space so that only a few
strategies are considered
• Perform unary operations (selection and
projection) first
• Reorder operations to reduce intermediate
relation size
• Replace a join by a series of semijoins to
minimize data communication.
22
Query Optimization Granularity
v Single query at a time
w Cannot use common intermediate results
v Multiple queries at a time
w Efficient if many similar queries
w Decision space is much larger
23
Query Optimization Timing
v Static
w Do it at compilation time by using statistics, appropriate
for exhaustive search, optimized once, but executed
many times.
w Difficult to estimate the size of the intermediate results
w Can amortize over many executions
v Dynamic
w Do it at execution time, accurate about the size of the
intermediate results, repeated for every execution,
expensive. 24
Query Optimization Timing (cont.)
v Hybrid
w Compile using a static algorithm
w If the error in estimate size > threshold, re-optimizing at
run time
25
Statistics
v Relation
w Cardinality
w Size of a tuple
w Fraction of tuples participating in a join with another relation
v Attributes
w Cardinality of the domain
w Actual number of distinct values
v Common assumptions
w Independence between different attribute values
w Uniform distribution of attribute values within their domain 26
Decision Sites
v For query optimization, it may be done by
w Single site – centralized approach
– Single site determines the best schedule
– Simple
– Need knowledge about the entire distributed database
w All the sites involved – distributed approach
– Cooperation among sites to determine the schedule
– Need only local information
– Cost of operation
w Hybrid – one site makes major decision in cooperation
with other sites making local decisions
– One site determines the global schedule
– Each site optimizes the local subqueries 27
Network Topology
v Wide Area Network (WAN) – point-to-point
w Characteristics
– Low bandwidth
– Low speed
– High protocol overhead
w Communication cost will dominate; ignore all other cost
factors
w Global schedule to minimize communication cost
w Local schedules according to centralized query
optimization
28
Network Topology (cont.)
v Local Area Network (LAN)
w communication cost not that dominate
w Total cost function should be considered
w Broadcasting can be exploited
w Special algorithms exist for star networks
29
Other Information to Exploit
v Using replications to minimize communication
costs
v Using semijoins to reduce the size of operand
relations to cut down communication costs when
overhead is not significant.
30
Layers of Query Processing
Calculus Query on Distributed Relations
QUERY GLOBAL
DECOMPOSITION SCHEMA
Algebra Query on Distributed Relations
CONTROL
DATA FRAGMENT
SITE LOCALIZATION SCHEMA
Fragment Query
GLOBAL
STATISTICS ON
OPTIMIZATION
FRAGMENTS
Optimized Fragment Query
With Communication Operations
LOCAL
SITE
LOCAL LOCAL
OPTIMIZATION SCHEMA
32
Step 1 - Query Decomposition
v Decompose calculus query into algebra query using
global conceptual schema information.
33
Step 1 - Query Decomposition (cont.)
1) Normalization
w The calculus query is written in a normalized form (CNF or
DNF) for subsequent manipulation
2) Analysis
w To reject normalized queries for which further processing
is either impossible or unnecessary (type incorrect or
semantically incorrect)
3) Simplification (elimination of redundancy)
w Redundant predicates are eliminated to obtain simplified
queries
4) Restructuring (rewriting)
w The calculus query is translated to optimal algebraic query
representation
w More than one translation is possible 34
1) Normalization
v Lexical and syntactic analysis
w check validity (similar to compilers)
w check for attributes and relations
w type checking on the qualification
v There are two possible forms of representing the predicates in query
qualification (the WHERE clause):
w Conjunctive Normal Form (CNF) or Disjunctive Normal Form (DNF)
v The conjunctive normal form is a conjunction ( predicate) of disjunctions (
predicates) as follows (where pij is a simple predicate):
– CNF: (p11 p12 ... p1n) ... (pm1 pm2 ... pmn)
– DNF: (p11 p12 ... p1n) ... (pm1 pm2 ... pmn)
– OR's mapped into union
– AND's mapped into join or selection
35
1) Normalization (cont.)
v The transformation of the quantifier-free predicate is
straightforward using the well-known equivalence rules for
1.
logical operations ( ):
2.
3.
4.
5.
6.
7.
8.
9.
10.
36
1) Normalization (cont.)
Example 7.1. Let us consider the following query on the engineering
database that we have been referring to:
“Find the names of employees who have been working on project P1 for 12
or 24 months”.
The query expressed in SQL is
SELECT ENAME
FROM EMP, ASG
WHERE EMP.ENO=ASG.ENO AND ASG.PNO=”P1”
AND (DUR=12 OR DUR=24)
The qualification in conjunctive normal form:
37
2) Analysis
v Objective
w reject type incorrect or semantically incorrect queries.
v Type incorrect
w if any of its attribute or relation names is not defined in
the global schema;
w if operations are applied to attributes of the wrong type
38
2) Analysis (cont.)
Example 7.2. The following SQL query on the engineering database is
type incorrect for two reasons. First, attribute E# is not declared in the
schema. Second, the operation “>200” is incompatible with the type string
of ENAME.
SELECT E# ! Undefined
attribute
FROM EMP
WHERE ENAME>200
! Type
mismatch
39
2) Analysis (cont.)
v Semantically incorrect
w Components do not contribute in any way to the
generation of the result
w For only those queries that do not use disjunction () or
negation (), semantic correctness can be determined by
using query graph
40
Query Graph
v Two kinds of nodes
w One node represents the result relation
w Other nodes represent operand relations
v Two types of edges
w an edge to represent a join if neither of its two nodes is
the result
w an edge to represent a projection if one of its node is
the result node
Nodes and edges may be labeled by
predicates for selection, projection, or join.
41
Query Graph Example
Example 7.3. Let us consider the following query:
“Find the names and responsibilities of programmers who have been working on the CAD/CAM project for more than 3 years.” The query
expressed in SQL is
42
Join Graph Example 7.3
A subgraph of query graph for join operation.
43
Tool of Analysis
v A conjunctive query without negation is
semantically incorrect if its query graph is NOT
connected!
44
Analysis Example
Example 7.4. Let us consider the following SQL query:
46
3) Simplification
(elimination of redundancy)
v Using well-known
idempotency rules to
eliminate redundant
predicates from
WHERE clause.
47
Simplification Example
The SQL query:
SELECT TITLE
FROM EMP
WHERE (NOT(TITLE=”Programmer)
AND (TITLE=“Programmer”
OR TITLE=“Electrical Eng.”)
AND NOT(TITLE=“Electrical Eng.”))
OR ENAME=“J.Doe”
can be simplified using the previous rules to become:
SELECT TITLE
FROM EMP
WHERE ENAME="J.Doe"
48
Simplification Example (cont.)
p1 = <TITLE = ``Programmer''>
p2 = <TITLE = ``Elec. Engr''>
p3 = <ENAME = ``J.Doe''>
v Converting a calculus query into relational algebra
w straightforward transformation from relational calculus to
relational algebra
w restructuring relational algebra expression to improve
performance
w making use of query trees
50
Relational Algebra Tree
v A tree defined by:
w a root node representing the query result
w leaves representing database relations
w non-leaf nodes representing relations produced by operations
w edges from leaves to root representing the sequences of operations
Example 7.6. The query “Find the names of employees other than J. Doe who worked
on the CAD/CAM project for either one or two years” whose SQL expression is:
SELECT ENAME
FROM PROJ, ASG, EMP
WHERE ASG.ENO = EMP.ENO
AND ASG.PNO = PROJ.PNO
AND ENAME != "J. Doe"
AND PROJ.PNAME = "CAD/CAM"
AND (DUR = 12 OR DUR = 24)
51
An SQL Query and Its Query Tree
52
How to translate an SQL query into an
algebra tree?
1. Create a leaf for every relation in the FROM
clause
2. Create the root as a project operation involving
attributes in the SELECT clause
3. Create the operation sequence by the predicates
and operators in the WHERE clause
53
Rewriting -- Transformation Rules (I)
v Commutativity of binary operations:
RS SR
R S S R
RS SR
v Associativity of binary operations:
(R S) T R ( S T )
(R S) TR (S T)
v Idempotence of unary operations: grouping of projections and selections, which is the property of
certain operations in mathematics and computer science, that can be applied multiple times without
changing the result beyond the initial application.
v Commuting projection with binary operations
C(R S) A(R) B (S) C = A B
C(R S) C(R) C (S)
C (R S) C (R) C (S) 55
How to use transformation rules to
optimize?
v These rules allow the separation of the unary
operations to simplify the query expression
v Unary operations on the same relation may be
grouped to access the same relation once
v Unary operations (select, project) may be commuted
with binary operations, so that they may be performed
first to reduce the size of intermediate relations
v Binary operations may be reordered. This rule is used
extensively in query optimization.
56
Optimization of Previous Query Tree
57
Step 2 : Data
Localization
v Task : To translate
an algebraic query
on global relation
into an algebraic
query expressed on
physical fragments.
v Localization uses
information stored
in the fragment
schema.
58
Data Localization-- An Example
EMP is fragmented into
ENAME EMP1 = ENO “E3” (EMP)
Dur=12 Dur=24 EMP2 = “E3” < ENO “E6” (EMP)
EMP3 = ENO >“E6” (EMP)
ENAME< >“J.DOE”
ASG is fragmented into
PNAME=“CAD/CAM” ASG1 = ENO “E3” (ASG)
ASG2 = ENO >“E3” (ASG)
PNO
ENO
PROJ
ENO=“E5”
ENO=“E5” ENO=“E5”
WHERE EMP.ENO=ASG.ENO
ENO
ASG1 ASG2 EMP1 EMP2 EMP3
ASG EMP
EMP1 ASG1 EMP1 ASG2 EMP2 ASG1 EMP2 ASG2 EMP3 ASG1 EMP3 ASG2
62
Reduction with Join for PHF (II)
ENO
SELECT ENAME
FROM EMP
EMP1 EMP2 EMP1
(a) Localized query (b) Reduced query 64
Reduction for DHF
Distribute joins over union
Apply the join reduction for horizontal fragmentation
ASG1 ASG2 EMP2
(b) Query after pushing selection down
ASG2
ASG1 EMP2 ASG1 EMP2 ASG1
ASG2 EMP2
(d) Reduced query after eliminating the left subtree (c) Query after moving unions up 66
Reduction for Hybrid Fragmentation
v Combines the rules already specified:
w Remove empty relations generated by contradicting
selection on horizontal fragments;
w Remove useless relations generated by projections on
vertical fragments;
w Distribute joins over unions in order to isolate and
remove useless joins.
67
Reduction for Hybrid Fragmentation -
Example
EMP1 = ENO“E4” (ENO,ENAME (EMP))
EMP2 = ENO>“E4” (ENO,ENAME (EMP))
ENAME
EMP3 = ENO,TITLE (EMP)
ENO=“E5”
QUERY: ENAME
SELECT ENAME
EMP2
FROM EMP ENO=“E5”
WHERE ENO = “E5”
(b) Reduced query
ENO
EMP1
ASG1 EMP2 EMP3
(a) Localized query 68
Question & Answer
69