SOEN 363 - Data Systems For Software Engineers: Query Optimization

SOEN 363 - Data Systems
for Software Engineers

Lecture 16: Query Optimization
Essam Mansour
Today…
 Last Session:
 Query Optimization
 Today’s Session:
 Query Optimization
DBMS Layers
Queries
Query Optimization
and Execution
Relational Operators
Transaction Files and Access Methods
Manager
Recovery
Buffer Management Manager
Lock
Manager Disk Space Management
DB
Outline
A Brief Primer on Query Optimization
Evaluating Query Plans 

Relational Algebra Equivalences
Estimating Plan Costs
Enumerating Plans
Query Evaluation Plans
 A query evaluation plan (or simply a plan) consists of an
extended relational algebra tree (or simply a tree)
 A plan tree consists of annotations at each node indicating:

 The access methods to use for each relation
 The implementation method to use for each operator
 Consider the following SQL query Q:

SELECT S.sname What is the
FROM Reserves R, Sailors S corresponding
WHERE R.sid=S.sid AND RA of Q?
R.bid=100 AND S.rating>5
Query Evaluation Plans (Cont’d)
 Q can be expressed in relational algebra as follows:
 sname( (Re serves  Sailors)
bid 100 rating  5 sid  sid
A RA Tree: An Extended RA Tree:

(On-the-fly)
sname
sname
bid=100 rating > 5 (On-the-fly)

bid=100 rating > 5
(Simple Nested Loops)

sid=sid sid=sid
Reserves Sailors Sailors (File Scan)

(File Scan) Reserves
Pipelining vs. Materializing
 When a query is composed of several operators, the
result of one operator can sometimes be pipelined to
another operator Applied on-the-fly
(On-the-fly)
sname
Pipeline the output of the join into the

selection and projection that follow

sid=sid
(File Scan) Reserves Sailors (File Scan)

Pipelining vs. Materializing
 When a query is composed of several operators, the
result of one operator can sometimes be pipelined to
another operator Applied on-the-fly
(On-the-fly)
sname
Pipeline the output of the join into the

selection and projection that follow
In contrast, a temporary table can be materialized (Simple Nested Loops)

to hold the intermediate result of the join and read sid=sid
back by the selection operation!
Pipelining can significantly save I/O cost!

The I/O Cost of the Q Plan
 What is the I/O cost of the following evaluation plan?
(On-the-fly)
sname

sid=sid
 The cost of the join is 1000 + 1000 * 500 = 501,000 I/Os (assuming page-oriented
Simple NL join)
 The selection and projection are done on-the-fly; hence, do not incur additional I/Os
Pushing Selections
 How can we reduce the cost of a join?
 By reducing the sizes of the input relations!
sname
bid=100 rating > 5
Involves bid in Reserves; Involves rating in Sailors;

hence, “push” ahead of the join! hence, “push” ahead of the join!
sid=sid
Reserves Sailors
Pushing Selections
(On-the-fly) (On-the-fly)
sname sname
(Sort-Merge Join)
bid=100 rating > 5 (On-the-fly) sid=sid
(Scan; (Scan;
write to bid=100 rating > 5 write to
(Simple Nested Loops) temp T1) temp T2)
sid=sid
Reserves Sailors
Reserves Sailors
(File Scan) (File Scan)
The I/O Cost of the New Q Plan
(On-the-fly)
sname
(Sort-Merge Join)
sid=sid
(Scan; (Scan;
temp T1) temp T2)
Reserves Sailors
Cost of Scanning Reserves = 1000 I/Os Cost of Scanning Sailors = 500 I/Os
Cost of Writing T1 = 10* I/Os (later) Cost of Writing T2 = 250* I/Os (later)
*Assuming 100 boats and uniform distribution of reservations across boats.

*Assuming 10 ratings and uniform distribution over ratings.
The I/O Costs of the Two Q Plans
(On-the-fly)
(On-the-fly) sname
sname
(Sort-Merge Join)
bid=100 rating > 5 (On-the-fly) sid=sid
(Scan; (Scan;
temp T1) temp T2)
sid=sid Reserves Sailors
Reserves Sailors
Total Cost = 501, 000 I/Os Total Cost = 4060 I/Os

Pushing Projections
 Consider (again) the following plan:

 What are the attributes required
sname
from T1 and T2?
 Sid from T1
 Sid and sname from T2
sid=sid
Hence, as we scan Reserves and

(Scan; (Scan;
write to bid=100 rating > 5 write to Sailors we can also remove
temp T1) temp T2) unwanted columns (i.e., “Push” the
Reserves Sailors
projections ahead of the join)!
Pushing Projections
 Consider (again) the following plan:

sname
“Push” ahead
the join
The cost after applying
sid=sid this heuristic can become
2000 I/Os (as opposed to
(Scan; (Scan;
write to bid=100 rating > 5 write to 4060 I/Os with only
temp T1) temp T2)
pushing the selection)!
Reserves Sailors
Using Indexes
 What if indexes are available on Reserves and Sailors?
(On-the-fly)
sname
rating > 5 (On-the-fly)
(Index Nested Loops,

sid=sid with pipelining )
(Use hash
index; do
not write bid=100 Sailors (Hash index on sid)
result to
temp)
(Clustered hash index on bid) Reserves
 With clustered index on bid of Reserves, we get 100,000/100 = 1000 tuples (assuming 100
boats and uniform distribution of reservations across boats)
 Since the index is clustered, the 1000 tuples appear consecutively within the same
bucket; thus # of pages = 1000/100 = 10 pages
Using Indexes
(On-the-fly)
sname

(Use hash
index; do
result to
temp)
 For each selected Reserves tuple, we can retrieve matching Sailors tuples using the hash
index on the sid field
 Selected Reserves tuples need not be materialized and the join result can be pipelined!
 For each tuple in the join result, we apply rating > 5 and the projection of sname on-the-fly
Using Indexes
(On-the-fly)
sname
Is it necessary to project out

unwanted columns?
NO, since selection results

are NOT materialized (Index Nested Loops,
(Use hash
index; do
result to
temp)

Using Indexes
(On-the-fly)
sname
Does the hash index on sid

rating > 5 (On-the-fly) need to be clustered?
NO, since there is at most

sid=sid with pipelining ) 1 matching Sailors tuple
per a Reserves tuple! Why?
(Use hash
index; do
result to
temp)

Using Indexes
(On-the-fly)
sname

(Use hash Cost = 1.2 I/Os (if

index; do
not write bid=100 Sailors (Hash index on sid) A(1)) or 2.2 (if A(2))
result to
temp)

Using Indexes
(On-the-fly)
sname
Why not pushing this selection
ahead of the join?
It would require a scan on Sailors!

(Use hash
index; do
result to
temp)

The I/O Cost of the New Q Plan
(On-the-fly)
sname
10 I/Os sid=sid
with pipelining )
(Use hash
index; do
Cost = 1.2 I/Os for
result to
temp) 1000 Reserves
tuples; hence,
1200 I/Os
Total Cost = 10 + 1200 = 1210 I/Os

Comparing I/O Costs: Recap
(On-the-fly) (On-the-fly)
sname (On-the-fly) sname sname

bid=100 rating > 5 (On-the-fly) (Sort-Merge Join)
sid=sid
(Index Nested
Loops,
(Scan; (Scan; sid=sid with pipelining )
(Simple Nested temp T1) temp T2) (Hash
Loops) index)
sid=sid
Reserves Sailors bid=100 Sailors (Hash
index
on sid)
Reserves Sailors Reserves
Total Cost = 501, 000 I/Os Total Cost = 4060 I/Os Total Cost = 1210 I/Os
But, How Can we Ensure Correctness?
sname sname
Canonical form
rating > 5
bid=100 rating > 5
sid=sid
sid=sid
bid=100 Sailors
Reserves Sailors Reserves
Still the same result!
How can this be guaranteed?

Outline
A Brief Primer on Query Optimization
Evaluating Query Plans
Relational Algebra Equivalences


Estimating Plan Costs
Enumerating Plans

SOEN 363 - Data Systems For Software Engineers: Query Optimization

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SOEN 363 - Data Systems For Software Engineers: Query Optimization

Uploaded by

Copyright:

Available Formats

SOEN 363 - Data Systems

for Software Engineers

Evaluating Query Plans 

Estimating Plan Costs

 A plan tree consists of annotations at each node indicating:

 Consider the following SQL query Q:

A RA Tree: An Extended RA Tree:

bid=100 rating > 5 (On-the-fly)

(Simple Nested Loops)

Reserves Sailors Sailors (File Scan)

Pipeline the output of the join into the

(Simple Nested Loops)

(File Scan) Reserves Sailors (File Scan)

Pipeline the output of the join into the

In contrast, a temporary table can be materialized (Simple Nested Loops)

Pipelining can significantly save I/O cost!

bid=100 rating > 5 (On-the-fly)

(Simple Nested Loops)

(File Scan) Reserves Sailors (File Scan)

bid=100 rating > 5

Involves bid in Reserves; Involves rating in Sailors;

*Assuming 100 boats and uniform distribution of reservations across boats.

Total Cost = 501, 000 I/Os Total Cost = 4060 I/Os

 Consider (again) the following plan:

Hence, as we scan Reserves and

 Consider (again) the following plan:

rating > 5 (On-the-fly)

(Index Nested Loops,

(Clustered hash index on bid) Reserves

rating > 5 (On-the-fly)

(Index Nested Loops,

(Clustered hash index on bid) Reserves

Is it necessary to project out

NO, since selection results

(Clustered hash index on bid) Reserves

Does the hash index on sid

NO, since there is at most

(Clustered hash index on bid) Reserves

rating > 5 (On-the-fly)

(Index Nested Loops,

(Use hash Cost = 1.2 I/Os (if

(Clustered hash index on bid) Reserves

(Index Nested Loops,

(Clustered hash index on bid) Reserves

rating > 5 (On-the-fly)

Total Cost = 10 + 1200 = 1210 I/Os

rating > 5 (On-the-fly)

Reserves Sailors Reserves

Still the same result!

How can this be guaranteed?

Evaluating Query Plans

Relational Algebra Equivalences

You might also like