You are on page 1of 32

More Operators

and Query Plans


Immanuel Trummer

itrummer@cornell.edu

www.itrummer.org
Database Management
Systems (DBMS)
Application 1 Connections, Security, Utilities, ...

Application 2
DBMS Interface Query Processor
Query Parser Query Rewriter

... Query Optimizer Query Executor

Storage Manager
Data Access Buffer Manager

Transaction Manager Recovery Manager

[RG, Sec. 12]


Slides by Immanuel Trummer, Cornell University
Data
Projection (𝝿)

Slides by Immanuel Trummer, Cornell University


Projection Operator
• Straight forward for SELECT without DISTINCT

• Calculate SELECT items, drop other columns

• More difficult if DISTINCT keyword is present

• Need to filter out duplicates - multiple options:

• Exploit hash function

• Exploit sorting

• Exploit index

Slides by Immanuel Trummer, Cornell University


Projection & Duplicate Elimination 

via Hashing
• Phase 1: partition data into hash buckets

• Scan input, calculate projection, partition by hash function

• Data partitions are written back to hard disk

• Phase 2: eliminate duplicates for each partition

• Read one partition into memory and eliminate duplicates

• Can use second hash function to detect duplicates

• Constraints on memory similar as for hash join

• Count hash buckets for Phase 1, bucket size for Phase 2

Slides by Immanuel Trummer, Cornell University


Projection & Duplicate Elimination 

via Hashing
• Phase 1: partition data into hash buckets

• Scan input, calculate projection, partition by hash function

• Data partitions are written back to hard disk

• Phase 2: eliminate duplicates for each partition

• Read one partition into memory and eliminate duplicates

• Can use second hash function to detect duplicates

• Constraints on memory similar as for hash join

• Count hash buckets for Phase 1, bucket size for Phase 2

(Cost is 3 * Number of Pages - at most)


Slides by Immanuel Trummer, Cornell University
What If Hash Buckets
Do Not Fit in Memory?

Slides by Immanuel Trummer, Cornell University


Projection & Duplicate Elimination 

via Sorting
• Idea: sorting rows helps finding duplicates

• (Duplicates appear consecutively)

• Use variant of external sort algorithm seen before

• Apply projection during first pass over data

• Eliminate in-memory duplicates during all steps

• The result is duplicate-free and sorted

• Can reduce number of passes with more main memory

Slides by Immanuel Trummer, Cornell University


Projection & Duplicate Elimination 

via Sorting
• Idea: sorting rows helps finding duplicates

• (Duplicates appear consecutively)

• Use variant of external sort algorithm seen before

• Apply projection during first pass over data

• Eliminate in-memory duplicates during all steps

• The result is duplicate-free and sorted

• Can reduce number of passes with more main memory


(Cost is external sorting cost - at most)
Slides by Immanuel Trummer, Cornell University
Projection & Duplicate Elimination 

via Index

• Assume index key includes projection columns

• Can retrieve relevant data from index alone

• Saves cost considering index smaller than data

• Even better: tree index with projections as key prefix

• Duplicates retrieved consecutively, easy to eliminate

Slides by Immanuel Trummer, Cornell University


Projection & Duplicate Elimination 

via Index

• Assume index key includes projection columns

• Can retrieve relevant data from index alone

• Saves cost considering index smaller than data

• Even better: tree index with projections as key prefix

• Duplicates retrieved consecutively, easy to eliminate

(Cost of reading index data)


Slides by Immanuel Trummer, Cornell University
Grouping (𝚪) &
Aggregation (∑ ...)

Slides by Immanuel Trummer, Cornell University


Aggregation without Groups

• SQL offers Min, Max, Sum, Count, Avg

• Scan input data and update in-memory aggregate

• Can use constant amount of memory

• Cost of reading input data once

• Count distinct requires duplicate elimination (see prior)

Slides by Immanuel Trummer, Cornell University


Aggregation with Groups
• Can use hashing

• Maintain hash table of group keys with aggregates

• Can use sorting

• Sort on group keys, aggregate groups consecutively

• Can use indexes

• Index key must contain group-by keys

Slides by Immanuel Trummer, Cornell University


Set Operations (∩,∪,-)

Slides by Immanuel Trummer, Cornell University


Set Operations
• INTERSECT can be handled like a join

• Join condition is equality on all columns

• UNION eliminates duplicates (unlike UNION ALL)

• Either hash and eliminate duplicates in each bucket

• Or sort and eliminate duplicates during merging

• R EXCEPT S

• Either partition via hash, then treat each bucket separately

• Or sort and check whether R tuple in S during merge steps

Slides by Immanuel Trummer, Cornell University


Query Plans

Slides by Immanuel Trummer, Cornell University


Reminder: Query Plans

• Query plans describes query processing as tree

• Inner nodes are operators, leaf nodes are tables

• Logical query plan just specifies types of operations

• Physical query plan specifies implementation as well

Slides by Immanuel Trummer, Cornell University


Example Plan (Logical)
𝛑S.Sname

⨝E.SID=S.SID
Data Flow

⨝C.CID=E.CID Students (S)

𝛔C.CID='CS4320' Enrollment (E)


SELECT Sname

FROM Enrollment E

NATURAL JOIN Students S

Courses (C) NATURAL JOIN Courses C

Slides by Immanuel Trummer, Cornell University
WHERE C.CID = 'CS4320'
Example Plan (Physical) i n g
n v ia S o r t
Projectio
𝛑S.Sname
Hash Join

⨝E.SID=S.SID
te d L o o p Jo in

Block Nes
Data Flow

S i z e : 10 P a ges
Block
⨝C.CID=E.CID Students (S)

Scan Filter

𝛔C.CID='CS4320' Enrollment (E)


SELECT Sname

FROM Enrollment E

NATURAL JOIN Students S

Courses (C) NATURAL JOIN Courses C

Slides by Immanuel Trummer, Cornell University
WHERE C.CID = 'CS4320'
Passing Results 

Between Operators
• Simplest version: each operator writes output to disk

• May lead to unnecessary read/write overheads!

• Better: keep intermediate results in main memory

• This may not always be possible, depending on size

• Physical plan specifies how results are passed on

• Pipelined operator passes result in-memory to next operation

• Label "On the fly" for unary operators means pipelined input

Slides by Immanuel Trummer, Cornell University


Pipelined 

Nested Loop Joins

• Full join output may be too large to fit in memory

• Hence, produce small join result parts consecutively

• Directly invoke next operator for result part in memory

• Can easily chain nested loop joins in this way

• Can start producing output with part of left input

Slides by Immanuel Trummer, Cornell University


Example Plan (Physical) F ly)
n (O n t h e
Projectio
𝛑S.Sname

i p e l i n e d B N L

P
k S i z e : 1 0 P ages
Bloc
⨝E.SID=S.SID
Data Flow

e l i ne d B N L 

Pi p
S i z e : 1 0 P a ges
Block ⨝C.CID=E.CID Students (S)
Scan Filter 

u t P i pe l i n e d )
(Outp
𝛔C.CID='CS4320' Enrollment (E)
SELECT Sname

FROM Enrollment E

NATURAL JOIN Students S

Courses (C) NATURAL JOIN Courses C

Slides by Immanuel Trummer, Cornell University
WHERE C.CID = 'CS4320'
Plan Cost Estimation

• (Calculate intermediate result sizes if not given)

• Calculate cost of each operator

• Take into account how data is passed on

• Do not count output cost of final operator

• Sum up cost of all operators in plan

Slides by Immanuel Trummer, Cornell University


Example Plan (Physical) F ly)
n (O n t h e
Projectio
𝛑S.Sname

P i p e l i n e d B N L
 40 Pages
k S i z e : 1 0 P ages
Bloc
⨝E.SID=S.SID
Data Flow

l i ne d B N L 
 20 Pages
Pi p e
S i z e : 1 0 P a ges
Block ⨝C.CID=E.CID Students (S)
20 Pages
Scan Filter 
 10 Pages
u t P i pe l i n e d )
(Outp
𝛔C.CID='CS4320' Enrollment (E)
100 Pages SELECT Sname

FROM Enrollment E

NATURAL JOIN Students S

Courses (C) NATURAL JOIN Courses C

100 Pages Slides by Immanuel Trummer, Cornell University
WHERE C.CID = 'CS4320'
What is the 

Plan Execution Cost?

Slides by Immanuel Trummer, Cornell University


Example Plan (Physical) F ly)
n (O n t h e
Projectio
𝛑S.Sname

P i p e l i n e d B N L
 40 Pages
k S i z e : 1 0 P ages
Bloc
⨝E.SID=S.SID
Data Flow

l i ne d B N L 
 20 Pages
Pi p e
S i z e : 1 0 P a ges
Block ⨝C.CID=E.CID Students (S)
20 Pages
Scan Filter 
 10 Pages
u t P i pe l i n e d )
(Outp
𝛔C.CID='CS4320' Enrollment (E)
100 Pages SELECT Sname

FROM Enrollment E

NATURAL JOIN Students S

Courses (C) NATURAL JOIN Courses C

100 Pages Slides by Immanuel Trummer, Cornell University
WHERE C.CID = 'CS4320'
Example Plan (Physical) F ly)
n (O n t h e
Projectio
𝛑S.Sname

P i p e l i n e d B N L
 40 Pages
k S i z e : 1 0 P ages
Bloc
⨝E.SID=S.SID
Data Flow

l i ne d B N L 
 20 Pages
Pi p e
S i z e : 1 0 P a ges
Block ⨝C.CID=E.CID Students (S)
20 Pages
Scan Filter 
 10 Pages
u t P i pe l i n e d )
(Outp
𝛔C.CID='CS4320' 100 Enrollment (E)
100 Pages SELECT Sname

FROM Enrollment E

NATURAL JOIN Students S

Courses (C) NATURAL JOIN Courses C

100 Pages Slides by Immanuel Trummer, Cornell University
WHERE C.CID = 'CS4320'
Example Plan (Physical) F ly)
n (O n t h e
Projectio
𝛑S.Sname

P i p e l i n e d B N L
 40 Pages
k S i z e : 1 0 P ages
Bloc
⨝E.SID=S.SID
Data Flow

l i ne d B N L 
 20 Pages
Pi p e
S i z e : 1 0 P a ges
Block ⨝C.CID=E.CID Students (S)
100
20 Pages
Scan Filter 
 10 Pages
u t P i pe l i n e d )
(Outp
𝛔C.CID='CS4320' 100 Enrollment (E)
100 Pages SELECT Sname

FROM Enrollment E

NATURAL JOIN Students S

Courses (C) NATURAL JOIN Courses C

100 Pages Slides by Immanuel Trummer, Cornell University
WHERE C.CID = 'CS4320'
Example Plan (Physical) F ly)
n (O n t h e
Projectio
𝛑S.Sname

P i p e l i n e d B N L
 40 Pages
k S i z e : 1 0 P ages
Bloc
⨝E.SID=S.SID 40
Data Flow

l i ne d B N L 
 20 Pages
Pi p e
S i z e : 1 0 P a ges
Block ⨝C.CID=E.CID Students (S)
100
20 Pages
Scan Filter 
 10 Pages
u t P i pe l i n e d )
(Outp
𝛔C.CID='CS4320' 100 Enrollment (E)
100 Pages SELECT Sname

FROM Enrollment E

NATURAL JOIN Students S

Courses (C) NATURAL JOIN Courses C

100 Pages Slides by Immanuel Trummer, Cornell University
WHERE C.CID = 'CS4320'
Example Plan (Physical) F ly)
n (O n t h e
Projectio
𝛑S.Sname 0

P i p e l i n e d B N L
 40 Pages
k S i z e : 1 0 P ages
Bloc
⨝E.SID=S.SID 40
Data Flow

l i ne d B N L 
 20 Pages
Pi p e
S i z e : 1 0 P a ges
Block ⨝C.CID=E.CID Students (S)
100
20 Pages
Scan Filter 
 10 Pages
u t P i pe l i n e d )
(Outp
𝛔C.CID='CS4320' 100 Enrollment (E)
100 Pages SELECT Sname

FROM Enrollment E

NATURAL JOIN Students S

Courses (C) NATURAL JOIN Courses C

100 Pages Slides by Immanuel Trummer, Cornell University
WHERE C.CID = 'CS4320'
Example Plan (Physical) F ly)
n (O n t h e
Projectio
𝛑S.Sname 0

P i p e l i n e d B N L
 40 Pages
k S i z e : 1 0 P ages
Bloc
⨝E.SID=S.SID 40
Data Flow

l i ne d B N L 
 20 Pages
Pi p e
S i z e : 1 0 P a ges
Block ⨝C.CID=E.CID Students (S)
100
20 Pages
Scan Filter 
 10 Pages
u t P i pe l i n e d )
(Outp
𝛔C.CID='CS4320' 100 Enrollment (E)
240
100 Pages SELECT Sname

FROM Enrollment E

NATURAL JOIN Students S

Courses (C) NATURAL JOIN Courses C

100 Pages Slides by Immanuel Trummer, Cornell University
WHERE C.CID = 'CS4320'

You might also like