You are on page 1of 61

Advanced Database Systems(CoSc2072)

Chapter Two

QUERY PROCESSING & OPTIMIZATION


Query Processing and Optimization: Outline
 Query processing
 Operator Evaluation Strategies
 Selection
 Join
 Query Optimization
 Heuristic query optimization
 Cost-based query optimization
 Query Tuning

2
Overview of Query Processing
 Query processing: The activities involved in parsing,

validating, optimizing, and executing a query.


 Aims

 To transform a query written in a high-level language,


typically SQL, into a correct and efficient execution strategy
expressed in a low-level language (implementing the relational
algebra), and
 To execute the strategy to retrieve the required data.

3
Steps of Query Processing
1. Parsing and translation
2. Optimization
3. Evaluation

4
 DBMS has algorithms to implement relational algebra expressions
 SQL is a kind of high level language; specify what is wanted, not how it is
obtained

5
6
Query optimization:
The activity of choosing an efficient execution strategy for
processing a query.
 Task: Find an efficient physical query plan (aka execution plan) for
an SQL query
Goal: Minimize the evaluation time for the query, i.e., compute
query result as fast as possible
Cost Factors: Disk accesses, read/write operations, [I/O, page
transfer] (CPU time is typically ignored)
Optimization: find the most efficient evaluation plan for a query because
there can be more than one way.
7
Examples:

8
Find all Managers who work at a London branch.
SELECT * FROM Staff s, Branch b WHERE s.branchNo =
b.branchNo AND (s.position = ‘Manager’ AND b.city = ‘London’);

The equivalent relational algebra queries corresponding to this


SQL statement are:

9
Different Strategies

10
Cost Comparison
 Cost (in disk accesses) are:

(1) (1000 + 50) + 2*(1000 * 50) = 101 050

(2) 2*1000 + (1000 + 50) = 3 050

(3) 1000 + 2*50 + 5 + (50 + 5) = 1 160

The third option significantly reduces size of relations being


joined together.

Cartesian product and join operations are much more


expensive than selection.

11
Phases of query processing

12
 Query Processing has four main phases.
1. Decomposition.
• Analysis.
• Normalization.
• Semantic Analysis.
• Simplification.
• Restructuring.
2. Optimization.
• Heuristics.
• Comparing costs.
3. Code Generation.
4. Execution.

13
 Query Decomposition
 Transform high-level query into RA query.

 Check that query is syntactically and semantically correct.

 Typical stages are:


 Analysis,

 Normalization,

 Semantic analysis,

 Simplification,

 Query restructuring.

14
 Analysis
 Analyze query lexically and syntactically using compiler
techniques.
 Verify relations and attributes exist.
 Verify operations are appropriate for object type.
Example
SELECT staff_no FROM Staff WHERE position > 10;

 This query would be rejected on two grounds:

staff_no is not defined for Staff relation (should be staffNo).

Comparison ‘>10’ is incompatible with type position, which


is variable character string.

15
Analysis
 Finally, query transformed into a query tree constructed as follows:
Leaf node for each base relation.
Non-leaf node for each intermediate relation produced by RA
operation.
Root of tree represents query result.
 Sequence is directed from leaves to root.

16
Normalization
 Converts query into a normalized form for easier manipulation.

 Predicate can be converted into one of two forms:

 Conjunctive normal form:

(position = 'Manager'  salary > 20000)  (branchNo = 'B003')

 Disjunctive normal form:

(position='Manager'branchNo='B003')(salary>20000branchNo
='B003')

17
Semantic Analysis
 Rejects normalized queries that are incorrectly formulated or
contradictory.
 Query is incorrectly formulated if components do not contribute
to generation of result.
 Query is contradictory if its predicate cannot be satisfied by any
tuple.
 Algorithms to determine correctness exist only for queries that
do not contain disjunction and negation.

18
Semantically incorrect
 Components do not contribute in any way to the
generation of the result
 Only a subset of relational calculus queries can be tested
for correctness
● Those that do not contain disjunction and negation
● To detect
➠ connection graph (query graph)
➠ join graph

19
Relation connection graph
a. Create node for each relation and node for result.
b. Create edges between two nodes that represent a join.
c. Create edges between nodes that represent projection.
 If not connected, query is incorrectly formulated.

Example: SELECT p.propertyNo, p.street FROM Client c, Viewing v,


PropertyForRent p WHERE c.clientNo = v.clientNo AND c.maxRent >= 500
AND c.prefType = ‘Flat’ AND p.ownerNo = ‘CO93’;

 Relation connection graph not fully


connected, so query is not correctly
formulated.
 Have omitted the join condition

20 (v.propertyNo = p.propertyNo) .
Example 2
SELECT Ename,Resp FROM Emp, Works, Project WHERE
Emp.Eno = Works.Eno AND Works.Pno = Project.Pno AND
Pname = ‘CAD/CAM’ AND Dur > 36 AND Title = ‘Programmer’

If the query graph is connected, the query is semantically correct.

21
Simplification
1. Detects redundant qualifications,

2. Eliminates common sub-expressions,

3. Transforms query to semantically equivalent but more


easily and efficiently computed form.

 Apply well-known transformation rules of Boolean algebra.

22
Example
 SELECT TITLE FROM E WHERE(NOT (TITLE= “Programmer”) AND
(TITLE=“Programmer” OR TITLE=”Electrical Eng.”) AND NOT
(TITLE=“Electrical Eng.”))OR ENAME=“J.Doe”; is

equivalent to
 SELECT TITLE FROM E WHERE ENAME= “J.Doe”;

23
Restructuring
 Convert
. SQL to relational algebra
 Make use of query trees
 Example: SELECT Ename FROM Emp,
Works, Project WHERE Emp.Eno =
Works.Eno AND Works.Pno =
Project.Pno AND Ename <> ‘J. Doe’
AND Pname = ‘CAD/CAM’ AND (Dur =
12 OR Dur = 24)

24
 Query tree:
 A tree data structure that corresponds to a relational algebra
expression.
 It represents the input relations of the query as leaf nodes of the tree,
and represents the relational algebra operations as internal nodes.
 Query graph:
 A graph data structure that corresponds to a relational calculus
expression.
 It does not indicate an order on which operations to perform first.
 There is only a single graph corresponding to each query.

25
Transformation Rules for RA Operations
1. Conjunctive Selection operations can cascade into individual
Selection operations (and vice versa).

 Sometimes referred to as cascade of Selection.

2. Commutativity of Selection.

26
Con…
3. In a sequence of Projection operations, only the last in the
sequence is required.

4. Commutativity of Selection and Projection.


If predicate p involves only attributes in projection list, Selection
and Projection operations commute:

27
Con…
5. Commutativity of Theta join (and Cartesian product).

Rule also applies to Equijoin and Natural join.


Example:

28
6. Commutativity of Selection and Theta join (or Cartesian product)
 If selection predicate involves only attributes of one of join
relations, Selection and Join (or Cartesian product) operations
commute:

 If selection predicate is conjunctive predicate having form (p  q),


where p only involves attributes of R, and q only attributes of S,
Selection and Theta join operations commute as:

29
7. Commutativity of Projection &Theta join (or Cartesian product)

30
8. Commutativity of Union & Intersection (but not set difference)
RS=SR
RS=SR
9.Commutativity of Selection and set operations (Union,
Intersection, and Set difference).
p(R  S) = p(S)  p(R)
p(R  S) = p(S)  p(R)
p(R - S) = p(S) - p(R)

10.Commutativity of Projection and Union.


L(R  S) = L(S)  L(R)

11. Associativity of Union & Intersection (but not Set difference).


(R  S)  T = S  (R  T), (R  S)  T = S  (R  T)
31
12 . Associativity of Theta join (and Cartesian product).

 Cartesian product and Natural join are always associative.

32
2. Query Optimization
Optimization – not necessarily “optimal”, but reasonably
efficient

Techniques:

Heuristic rules

 Query tree (relational algebra) optimization

 Query graph optimization

Cost-based (physical) optimization

 Cost estimation(Comparing costs of different plans)

33
a. Heuristic based Processing Strategies
► Perform Selection operations as early as possible.
►Keep predicates on same relation together.
►Combine Cartesian product with subsequent Selection whose predicate
represents join condition into a Join operation.
►Use associativity of binary operations to rearrange leaf nodes so leaf
nodes with most restrictive Selection operations executed first.
►Perform Projection as early as possible.
►Keep projection attributes on same relation together.
►Compute common expressions once.
►If common expression appears more than once, and result not too
large, store result and reuse it when required.

34
Examples
 What are the names of customers living on Elm Street who have
checked out “Terminator”?
 SQL query:
SELECT Name FROM Customer CU, CheckedOut CH, Film F WHERE
Title = ’Terminator’ AND F.FilmId = CH.FilmID AND CU.CustomerID =
CH.CustomerID AND CU.Street = ‘Elm’

35
Apply Selections Early

36
Apply More Restrictive Selections Early

37
Form Joins

38
Apply Projections Early

39
Cost- Based Optimization
 Statistics on the inputs to each operator are needed.
 Statistics on leaf relations are stored in the system catalog.
 Statistics on intermediate relations must be estimated; most
important is the relations' cardinalities.
 Cost formulas estimate the cost of executing each operation in each
candidate query tree.
 Cost can be CPU time, I/O time, communication time, main
memory usage, or a combination.
 The candidate query tree with the least total cost is selected for execution.

40
Example: Cost Estimation

41
42
43
Operation 3: σ followed by a π

44
Measures of Query Cost
 There are many possible ways to estimate cost, e.g., based on
disk accesses, CPU time, or communication overhead.

 Disk access is the predominant cost (in terms of time); relatively


easy to estimate; therefore, number of block transfers from/to disk
is typically used as measure.

 Simplifying assumption: each block transfer has the same cost

 Cost of algorithm (e.g., for join or selection) depends on database


buffer size; more memory for DB buffer reduces disk accesses.

 Thus DB buffer size is a parameter for estimating cost.


 We refer to the cost estimate of algorithm S as cost(S).
 We do not consider cost of writing output to disk.
Selectivity and Cost Estimates in Query Optimization
 Catalog Information Used in Cost Functions
 Information about the size of a file
 number of records (tuples) (r),
 record size (R),
 number of blocks (b)
 blocking factor (bfr)
 Information about indexes and indexing attributes of a file
 Number of levels (x) of each multilevel index
 Number of first-level index blocks (bI1)
 Number of distinct values (d) of an attribute
 Selectivity (sl) of an attribute
 Selection cardinality (s) of an attribute. (s = sl * r)
Selection Operation

σA=a(R) where a is a constant value, A an attribute of R

File Scan - search algorithms that locate and retrieve records


that satisfy a selection condition

S1 - Linear search
cost(S1)= BR

S2 - Binary search, i.e., the file ordered based on attribute A


(primary index)

47
Con…

48
49
Cost of Operations

 Cost = I/O cost + CPU cost


 I/O cost: # pages (reads & writes) or # operations (multiple pages)

 CPU cost: # comparisons or # tuples processed

 I/O cost dominates (for large databases)

 Cost depends on
 Types of query conditions

 Availability of fast access paths

 DBMSs keep statistics for cost estimation


50
Notations

 Used to describe the cost of operations.


 Relations: R, S

 nR: # tuples in R, nS: # tuples in S

 bR: # pages in R

 dist(R.A) : # distinct values in R.A

 min(R.A) : smallest value in R.A

 max(R.A) : largest value in R.A

 HI: # index pages accessed (B+ tree height?)

51
Simple Selection
 Simple selection: A op a(R)
 A is a single attribute, a is a constant, op is one of =, , <, , >, .
 Do not further discuss  because it requires a sequential scan of
table.
How many tuples will be selected?
 Selectivity Factor (SFA op a(R)) : Fraction of tuples of R satisfying
“A op a”
 0  SFA op a(R)  1
# tuples selected: NS = nR  SFA op a(R)

52
Options of Simple Selection
Sequential (linear) Scan
 General condition: cost = bR
 Equality on key: average cost = bR / 2
Binary Search
 Records are stored in sorted order
 Equality on key: cost = log2(bR)
 Equality on non-key (duplicates allowed)
cost = log2(bR) + NS/bfR - 1
= sorted search time + selected – first one

53
Example: Cost of Selection
Relation: R(A, B, C)
nR = 10000 tuples
bfR = 20 tuples/page
dist(A) = 50, dist(B) = 500
B+ tree clustering index on A with order 25 (p=25)
B+ tree secondary index on B w/ order 25
Query:
 select * from R where A = a1 and B = b1
Relational Algebra: A=a1  B=b1 (R)

54
Example: Cost of Selection (cont.)
Option 1: Sequential Scan
 Have to go thru the entire relation
 Cost = bR = 10000/20 = 500
Option 2: Binary Search using A = a
 It is sorted on A (why?)
 NS = 10000/50 = 200
 assuming equal distribution
 Cost = log2(bR) + NS/bfR - 1
= log2(500) + 200/20 - 1 = 18

55
Cost of Join

Cost = # I/O reading R & S +


# I/O writing result
Additional notation:
 M: # buffer pages available to join operation
 LB: # leaf blocks in B+ tree index
Limitation of cost estimation
 Ignoring CPU costs
 Ignoring timing
 Ignoring double buffering requirements

56
Estimate Size of Join Result

How many tuples in join result?


 Cross product (special case of join)
NJ = nR  nS
 R.A is a foreign key referencing S.B
NJ = nR (assume no null value)
 S.B is a foreign key referencing R.A
NJ = nS (assume no null value)
 Both R.A & S.B are non-key

n R  nS n R  nS
NJ = min( , )
dist(R. A) dist(S .B)
57
Estimate Size of Join Result (cont.)
How wide is a tuple in join result?
 Natural join: W = W(R) + W(S) – W(SR)
 Theta join: W = W(R) + W(S)
What is blocking factor of join result?
 bfJoin = block size / W
How many blocks does join result have?
 bJoin = NJ / bfJoin

58
Query Execution Plans
 An execution plan for a relational algebra query consists of a
combination of the relational algebra query tree and information
about the access methods to be used for each relation as well as
the methods to be used in computing the relational operators
stored in the tree.
 Materialized evaluation: the result of an operation is stored as a
temporary relation.
 Pipelined evaluation: as the result of an operator is produced, it
is forwarded to the next operator in sequence

59
Query Tuning
 Monitoring or revising the query to increase throughput, to lower
response time for time-critical applications.
 Having to tune queries is a fact of life.
 Query tuning has a localized effect and is thus relatively
attractive.

 It is a time-consuming and specialized task.

 It makes the queries harder to understand.

 However, it is often a necessity.

 This is not likely to change any time soon.

60
Assignment one
 Using heuristic algorithm optimize the following sql query.
SELECT LNAME FROM EMPLOYEE, WORKS_ON, PROJECT

WHERE PNAME = ‘AQUARIUS’ AND


PNMUBER=PNO AND ESSN=SSN AND
BDATE > ‘1957-12-31’;

61

You might also like