You are on page 1of 55

Query Processing

Query Processing

 Overview
 Measures of Query Cost
 Selection Operation
 Sorting
 Join Operation
 Other Operations
 Evaluation of Expressions

2
Basic Steps in Query Processing

1. Parsing and translation


2. Optimization
3. Evaluation

3
Basic Steps in Query Processing (Cont.)

 Parsing and translation


• translate the query into its internal form. This is then
translated into relational algebra.
• Parser checks syntax, verifies relations

 Evaluation
• The query-execution engine takes a query-evaluation
plan, executes that plan, and returns the answers to the
query.

4
Basic Steps in Query Processing:
Optimization
EQUIVALENT OF EXPRESSION?
 Given schema
instructor (id, name, dept-name, salary)
Query: Write relational algebra expression to find id and salary of
all instructors whose id is less than 50501.
SQL: SELECT id, salary FROM INSTRUCTOR WHERE id<50501
Answer 1: id50501(id, salary(instructor))
equivalent to
Answer 2: id, salary(id50501(instructor))

Discussion 1: Instructor is stored physically sorted order of id. Will


the query processing cost of answer 1 and answer 2 be the
same? How?
5
Basic Steps in Query Processing:
Optimization
 Each relational algebra operation can be evaluated using one
of several different algorithms
• Correspondingly, a relational-algebra expression can be
evaluated in many ways.

 Annotated expression specifying detailed evaluation strategy is


called an evaluation-plan. E.g.,:
• Use an index on id to find instructors with id < 50501
(Answer 2),
• Or perform complete relation scan and discard instructors
with id < 50501 (Answer 1)

6
Basic Steps: Optimization (Cont.)

 Query Optimization: Amongst all equivalent evaluation plans


choose the one with lowest cost.
• Cost is estimated using statistical information from the
database catalog
 e.g.. number of tuples in each relation, size of tuples,
etc.

 In this chapter we shall study


• How to measure query costs
• Algorithms for evaluating relational algebra operations
• How to combine algorithms for individual operations in
order to evaluate a complete expression

7
Measures of Query Cost
 Many factors contribute to time cost
• disk access, CPU, and network communication
 Cost can be measured based on
• response time, i.e. total elapsed time for answering query, or
• total resource consumption
 We use total resource consumption as cost metric
• Response time harder to estimate, and minimizing resource
consumption is a good idea in a shared database
 We ignore CPU costs for simplicity
• Real systems do take CPU cost into account
• Network costs must be considered for parallel systems
 We describe how to estimate the cost of each operation
• We do not include cost to writing output to disk

8
Measures of Query Cost

 Disk cost can be estimated as:


• Number of seeks * average-seek-cost
• Number of blocks read * average-block-read-cost
• Number of blocks written * average-block-write-cost
 What is number of seeks?
 Answer: The number of times, the disk head directly moves
from one track to another track.
 What is number of block reads?
 Answer: The number of blocks need to be transferred from
disk to memory to process the query.
 What is number of blocks written?
 Answer: The number of blocks need to be written to disk from
memory to process the query.

9
Measures of Query Cost

 For simplicity we just use the number of block transfers from


disk and the number of seeks as the cost measures
• tT – time to transfer one block
 Assuming for simplicity that write cost is same as read
cost
• tS – time for one seek
• Cost for b block transfers plus S seeks
b * tT + S * tS

 tS and tT depend on where data is stored; with 4 KB blocks:


• High end magnetic disk: tS = 4 msec and tT =0.1 msec
• SSD: tS = 20-90 microsec and tT = 2-10 microsec for 4KB

10
Selection Operation ()

 File scan
 Algorithm A1 (linear search). Scan each file block and test all
records to see whether they satisfy the selection condition.
Example: The size of the given student relation is 2000 blocks. The seek
time is 5ms and block transfer time is 0.1 ms. Find the estimated query
cost for the following query
SELECT * FROM STUDENT WHERE cgpa > 3.5 or name =
‘Rafique’
Given student(id, name, CGPA, street, city)
Answer: The query requires 1 seek and full file scan to find all students
with cgpa > 3.5. So br = 2000

Total cost = 1 × 5 + 2000 × 0.1 = 5 + 200 = 205ms


For full scan, cost estimate = 1 seek + br block transfers
 br denotes number of blocks containing records from relation r

11
Selection Operation ()
 File scan
 Algorithm A1 (linear search). Scan each file block and test all records
to see whether they satisfy the selection condition.
Question: The size of the given student relation is 2000 blocks. The seek
time is 5ms and block transfer time is 0.1 ms. Find the estimated query cost
for the following query

SELECT * FROM STUDENT WHERE id <1801111042


The student relation is physically sorted on id. The number of blocks with
id <1801111042 is 800.

For partial scan, cost estimate (average) = 1 seek + (br /2) block transfers

Linear search can be applied regardless of selection condition or ordering of


records in the file, or availability of indices

12
Selections Using
Indices

 Index scan – search


algorithms that use an index
• selection condition must be
on search-key of index.
Question: Given the relation
Instructor (id, name, dept, salary)
Seek cost = 4ms, block cost =
0.1ms
Find the cost of query:
SELECT *
Query cost = cost of index and cost of data
FROM INSTRUCTOR Index cost = root cost (1 block + 1 seek) + leaf cost (1 block + 1 seek)
= 2 blocks + 2 seeks
WHERE id = 83821 Data cost = 1 block + 1 seek
Total cost = 3 blocks + 3 seeks = 3*0.1 + 3*4 = 12.3 ms

13
Selections Using Indices

 Index scan – search algorithms that use an index


• selection condition must be on search-key of index.
Given the SQL
SELECT * FROM STUDENT WHERE id = 1520303042
Clustering B+ tree index on id
Example: How many records will be returned? Given the
height of the index B+ tree is hi , find the cost of the query.

 A2 (clustering index, equality on key). Retrieve a single


record that satisfies the corresponding equality condition
• Cost = cost of index traversal + cost of data block
= hi * (tS + tT) + (tS + tT)

14
Index Search

Each node in different location in


disk

Total Number of seek = hi ,


Total number of block transfer = hi
Estimated time = hi * TS + hi*TT

Data Search

Total number of seek = 1,


Total number of block transfer = 1
Estimated time = TS +TT

Index and Data Search

Total Estimated time


= (hi * TS + hi*TT )+ (TS +TT )

15
Selections Using Indices

 A3 (clustering index, equality on nonkey) Retrieve multiple


records.
• Records will be on consecutive blocks
 Let b = number of blocks containing matching records
• Cost = hi * (tT + tS) + tS + tT * b

Example: The relation student(id, name, CGPA, street, city, NID)


is given. There is a clustering B+ tree index on city attribute of
student relation and there are 200 cities. The height of the index
is 2 and there are 400 students living in Uttara city. The record
size is 400 byte and block size is 4KB.
Find the query cost for the following query
SELECT * FROM STUDENT WHERE CITY = ‘UTTARA’

16
Selections Using
Indices
Example: The relation student(id,
name, CGPA, street, city, NID) is
given. There is a clustering B+
tree index on city attribute of
student relation and there are 200
cities. The height of the index is 2
and there are 400 students living
in Uttara city. The record size is
400 byte and block size is 4KB.
Seek time = 10ms, block transfer
time = 0.2ms
Find the query cost for the
following query
SELECT * FROM STUDENT
WHERE CITY = ‘UTTARA’
17
Selections Using
Indices
 Total cost =
Index cost + data cost
= 2 * (10+0.2) + 1 * 10 + 40 *
0.2
= 20.4 + 10+ 8
=

18
Selections Using Indices
 A3 (clustering index, equality on nonkey) Retrieve multiple
records.
• Records will be on consecutive blocks
 Let b = number of blocks containing matching records
• Cost = hi * (tT + tS) + tS + tT * b

Question: The relation student(id, name, CGPA, street, city, NID)


is given. There is a clustering B+ tree index on name attribute of
student relation and there are 2000 unique names. The height of
the index is 4 and there are 20 students whose name is Abdullah.
The record size is 400 bytes and block size is 4KB. Seek time =
10ms, block transfer time = 0.2ms
Find the query cost for the following query
SELECT * FROM STUDENT WHERE name = ‘Abdullah’

19
Selections Using Indices

 A4 (secondary index, equality What is secondary index?


on key/non-key).
• Retrieve a single record if the Student relation is physically sorted
search-key is a candidate key order in disk on id. There is an index
on id.
 Cost = (hi + 1) * (tT + tS) What type of index on id? Why?
• Retrieve multiple records if
search-key is not a candidate Answer: This is primary index because
key the file is sorted with the same
 each of n matching records attribute (id) as the index (id).
may be on a different block
Other than the primary index of a
relation, all indices of the relation are
 Cost = (hi + n) * (tT + tS), the secondary index.
Can be very expensive!
There is an index on district attribute.
What type of index is it?
20
Selections Using Indices

 A4 (secondary index, equality Example: There is a B+ tree index on


on key/non-key). NID attribute of student relation. The
• Retrieve a single record if the height of the index is 4. Seek cost is
search-key is a candidate key 10ms and block transfer cost is 0.2
ms.
 Cost = (hi + 1) * (tT + tS)
Q1. Find the query cost for the
following query using the above index.

SELECT * FROM student WHERE


NID = 999888331

Q2: Find the query cost for the above


query without using the index. Total
number of blocks of student relation is
2000. The NID 999888331 is in 500th
block.
21
Selections Using
Indices
Example: Given B+ tree index on
NID attribute of student relation.
The height of the index is 4. Seek
cost is 10ms and block transfer
cost is 0.2 ms.
Q1. Find the query cost for the
following query using the above
index.
SELECT * FROM student WHERE
NID = 999888331
Q2. Find the query cost for the
above query without using the
index. Total number of blocks of
student relation is 2000. The NID
999888331 is in 500th block.

22
Selections Using Indices

 A4 (secondary index, equality Example Given B+ tree index on


on key/non-key). district attribute of student relation and
• Retrieve multiple records if there are 64 districts. The height of the
search-key is not a candidate index is 1 and there are 800 students
key from Comilla. Each of the records is in
different block. Seek cost is 10ms and
 each of n matching records
block transfer cost is 0.2 ms.
may be on a different block
Q1: Find the query cost for the
 Cost = (hi + n) * (tT + tS), following query
Can be very expensive!
SELECT * FROM student WHERE
district = ‘Comilla’

Q2: Find the query cost without using


the index. (This requires full scan of
the relation. Why?)
23
Selections Using
Indices
Example Given B+ tree index on
district attribute of student relation
and there are 64 districts. The
height of the index is 1 and there
are 800 students from Comilla.
Each of the records is in different
block. Seek cost is 10ms and block
transfer cost is 0.2 ms.

Q1: Find the query cost for the


following query using the index.

SELECT * FROM student WHERE


district = ‘Comilla’

Q2: Find the query cost without


using the index. (This requires full
scan of the relation. Why?)
24
Selections Using
Indices
Question Given B+ tree index on city
attribute of student relation and there
are 500 cities. The height of the index
is 2 and there are 30 students from city
Bhola. Each of the records is in
different block and no block is
consecutive. Seek cost is 10ms and
block transfer cost is 0.1 ms. The
number of blocks of student relation is
2000.
a. Find the query cost for the following
query using the index.
SELECT * FROM student WHERE city
= ‘Bhola’
b. Find the query cost without using the
index.
c. Find cost of a if Bhola would have 5
students.
d. Compare and explain the cost of a, b
and c
25
Selections Involving Comparisons
Can implement selections of
the form AV (r) or A  V(r) by
Example: Given clustering index on id of
using
student relation.
a linear file scan,
or by using indices in the Q1. Find all students with id <= 1500001042.
following ways: The id 1500001042 is the first tuple of student
A5 (clustering index, relation.
comparison). (Relation is SELECT * FROM student WHERE id <= 1500001042
sorted on A)
For A  V(r) use index to Q2. Find all students with id <= 1800001042.
find first tuple  v and The id 1800001042 is in 1200th block of student
scan relation sequentially relation.
from there SELECT * FROM student WHERE id <= 1800001042

For AV (r) just scan


relation sequentially till Find the query cost of Q1 and Q2. Here no
first tuple > v; do not use need to use index (why?). Seek cost is 10ms
index and block transfer cost is 0.2 ms.

26
Selections Involving Comparisons
Example: Given clustering index on id of Answer of Example:
student relation.
With Index ( No need to use
Q1. Find all students with id <= index)
1500001042. The id 1500001042 is the first Cost without index
tuple of student relation. Q1 cost
SELECT * FROM student WHERE id <= 1500001042 = index cost + data cost
= 0 + 1 * (10+0.2)
Q2. Find all students with id <= = 10.2ms
1800001042. The id 1800001042 is in 1200th
block of student relation. Q2 cost
SELECT * FROM student WHERE id <= 1800001042
= index cost + data cost
= 0 + 1 seek * 10 + 1200block
Find the query cost of Q1 and Q2. Here no * 0.2
need to use index (why?). Seek cost is =10 + 240
10ms and block transfer cost is 0.2 ms. =250ms

27
Selections Involving Comparisons
Example: Given clustering index on id of Answer of Example:
student relation.
With Index
Query cost
Q3. Find all students with id > 1800001042. = index cost + data cost
The id 1800001042 is in 1200th block of = 4 * (10+0.2) + 10+
student relation. (2000 – 1200) * (0.2)
SELECT * FROM student WHERE id > 1800001042 = 50.8 + 160

Find the query cost of Q3 with index. The


height of the index is 4. Seek cost is 10ms
and block transfer cost is 0.2 ms.

28
Selections Involving Comparisons
 A6 (Secondary index, comparison).
 For A  V(r) use index to find first index entry  v and scan index sequentially from
there, to find pointers to records.
For instructor relation, 12 blocks for 12 records.
Select * from instructor where salary >= 60000
Asssume index in memory,
Query Cost = 11 seek + 11 blocks

29
Selections Involving Comparisons
 A6 (Secondary index, comparison).
 For AV (r) just scan leaf pages of index finding pointers to records, till first entry > v
For instructor relation, 12 blocks for 12 records.
Select * from instructor where salary <= 87000

Data Cost = 9 seek + 9 blocks

30
Selections Involving Comparisons
 A6 (Secondary index, comparison).
 For A  V(r) use index to find first index entry  v and scan index sequentially from
there, to find pointers to records. Index in memory.
For instructor relation, 12 blocks for 12 records. Q1:Select * from instructor where salary
>= 60000, Data Cost = 11 seek + 11 blocks, seek cost 10ms, block cost 0.1ms
 For AV (r) just scan leaf pages of index finding pointers to records, till first entry > v
Q2:Select * from instructor where salary <= 87000, Data Cost = 9 seek + 9 blocks
Question: In either case, retrieve records that are pointed to, requires an I/O per record;
Linear file scan may be cheaper! Explain?

31
Implementation of Complex What is conjunctive selection?
Selections (12-11-20) Selection that uses a combinations of
Conjunction: 1 2. . . n(r) predicates
A7 (conjunctive selection using one index).
Given schema
Student(id, name, fname, mname, cgpa,
 Select a combination of i and algorithms year-admit, division, district, thana,
A1 through A7 that results in the least cost street, house-number, city, NID, DOB,
for i (r). total-credit)
 Test other conditions on tuple after
fetching it into memory buffer. Example: Find students with CGPA =
4 and city = Feni and year-admit =
A8 (conjunctive selection using composite
2019
index).
 Use appropriate composite (multiple-key)
index if available. Algebra:
A9 (conjunctive selection by intersection of CGPA=4 city = Feni year-admit=2019(student)
identifiers).
1= (CGPA=4), 2= (City=‘Feni’),
 Requires indices with record pointers.
3= (year-admit=2019),
 Use corresponding index for each
condition, and take intersection of all the
obtained sets of record pointers.
 Then fetch records from file 32
Implementation of Example Given B+ tree indices on CGPA, city
Complex Selections and year-admit. The height of the indices are
2, 2 and 1 respectively.
Conjunction: 1 2. . . n(r) SQL: SELECT * FROM student WHERE
A7 (conjunctive selection using one CGPA=4 AND City=‘Feni’ AND year-
index). admit=2019. tT = 0.2ms, tS = 10ms
Select a combination of i and
algorithms A1 through A7 that results in The number of tuples (pointers) for CGPA=4,
the least cost for i (r). City=‘Feni’ AND year-admit=2019 are 5, 20
Test other conditions on tuple after and 4000 respectively
fetching it into memory buffer. Find the least cost of the SQL using A7
algorithm.
Example: Find students with CGPA The indices are secondary index, We have to
= 4 and city = Feni and and year- use A4 algorithm.
admit = 2019 Query cost = index cost + data cost
= hi * (tT + tS) + n * (tT + tS)
Algebra:
CGPA=4 city = Feni year-admit=2019(r)
1= (CGPA=4), 2= (City=‘Feni’),
3= (year-admit=2019),

33
Implementation of Example Given B+ tree indices on CGPA, city
Complex Selections and year-admit. The height of the indices are
2, 2 and 1 respectively.
C1: Cost using CGPA index SQL: SELECT * FROM student WHERE
CGPA=4 AND City=‘Feni’ AND year-
n = 5,
admit=2019. tT = 0.2ms, tS = 10ms
cost = 2 * (10 + 0.2) + 5 * (10 + 0.2)
C2: Cost using city index The number of tuples (pointers) for CGPA=4,
City=‘Feni’ AND year-admit=2019 are 5, 20
n = 20 and 4000 respectively
cost = 2 * (10 + 0.2) + 20 * (10 + 0.2) Find the least cost of the SQL using A7
algorithm and analyze the result.
Cost using year-admit index
The indices are secondary index, We have to
n = 4000 use A4 algorithm.
Query cost = index cost + data cost
C3: cost = 2 * (10 + 0.2) + 4000 * (10 +
= hi * (tT + tS) + n * (tT + tS)
0.2)
Analysis:

34
Implementation of
Complex Selections

Question Given B+ tree indices on name,


and total-credit. The height of the indices are
4 and 2 respectively.
SQL: SELECT * FROM student WHERE
name=‘Abdullah’ AND total-credit = 100
tT = 0.1ms, tS = 10ms

The number of tuples (pointers)


name=‘Abdullah’ and total-credit = 100 are 50
and 200 respectively
a. Find the least cost of the SQL using A7
algorithm and analyze the result.
The indices are secondary index, We have to
use A4 algorithm.
Query cost = index cost + data cost
= hi * (tT + tS) + n * (tT + tS)
b. Analyze the result

35
Implementation of Given B+ tree composite index on
(CGPA, city, year-admit).
Complex Selections Discussion 2: Explain how will this
query be executed using composite
Conjunction: 1 2. . . n(r) index?
A8 (conjunctive selection using composite
index). What is composite index?
Use appropriate composite (multiple-key) index Index on multiple attributes.
if available. composite index on
(CGPA, city, year-admit)

Example: Find students with CGPA = 4 and Search keys in index:


city = Uttara and and year-admit = 2019 (3, ‘Dhaka’, 2015) record
(3, ‘Dhaka’, 2017) record
Algebra: (3, ‘Dhaka’, 2020) record
CGPA=4 city = Feni year-admit=2019(r) (3, ‘Feni’, 2015) record
1= (CGPA=4), 2= (City=‘Feni’), (3, ‘Feni’, 2020) record
3= (year-admit=2019),
SELECT * FROM student WHERE
SQL: SELECT * FROM student WHERE CGPA=3 AND City=‘Feni’ AND year-
CGPA=4 AND City=‘Feni’ AND year- admit=2015
admit=2019

36
Implementation of Given B+ tree composite index on
(CGPA, city, year-admit).
Complex Selections Discussion 2: Explain how will this
query be executed using composite
Conjunction: 1 2. . . n(r) index?
A8 (conjunctive selection using composite
index). Example Find the cost of the SQL
Use appropriate composite (multiple-key) index using A8 algorithm.
if available. The height of the composite index is
4 and the number of tuples (pointers)
for CGPA=4, City=‘Feni’ AND year-
Example: Find students with CGPA = 4 and admit=2019 is 2.
city = Uttara and and year-admit = 2019
The composite index is a secondary
Algebra: index.
CGPA=4 city = Feni year-admit=2019(r)
1= (CGPA=4), 2= (City=‘Feni’), Query cost = index cost + data cost
3= (year-admit=2019),
= hi * (tT + tS) + n * (tT + tS)
SQL: SELECT * FROM student WHERE
CGPA=4 AND City=‘Feni’ AND year- Analyze A7 and A8
admit=2019

37
Implementation of Given B+ tree indices on CGPA, city
Complex Selections and year-admit.
Conjunction: 1 2. . . n(r) Discussion 3: Explain how will this
query be executed using algorithm
A9 (conjunctive selection by intersection of
A9?
identifiers).
Requires indices with record pointers. Example: Find the cost of the SQL
Use corresponding index for each condition, using A9 algorithm.
and take intersection of all the obtained sets of The height of the indices are 2, 2 and
record pointers. 1 respectively.
Then fetch records from file The number of tuples (pointers) for
CGPA=4, City=‘Feni’ AND year-
Example: Find students with CGPA = 4 and admit=2019 are 5, 20 and 4000
city = Feni and and year-admit = 2019 respectively

Algebra: The indices are secondary index.


CGPA=4 city = Feni year-admit=2019(r)
1= (CGPA=4), 2= (City=‘Feni’),
3= (year-admit=2019),

SQL: SELECT * FROM student WHERE


CGPA=4 AND City=‘Feni’ AND year-
admit=2019
38
Given B+ tree indices on CGPA, city
and year-admit.
Discussion 3: Explain how will this
query be executed using algorithm
A9?

Example Find the cost of the SQL


using A9 algorithm.
The height of the indices are 2, 2 and
1 respectively.
The number of tuples (pointers) for
CGPA=4, City=‘Feni’ AND year-
admit=2019 are 5, 20 and 4000
respectively

The indices are secondary index.


Cost = Cost of (CGPA index + City
index+year-admit index) + 1 data

= 2 * (tT + tS) + 2 * (tT + tS) +1 * (tT + tS)


+1 * (tT + tS)

= 61.2ms

39
Question Given B+ tree indices on name, and total-credit.
The height of the indices are 4 and 2 respectively.
SQL: SELECT * FROM student WHERE name=‘Abdullah’
AND total-credit = 100
tT = 0.1ms, tS = 10ms

The number of tuples (pointers) name=‘Abdullah’ AND total-


credit = 100 are 50 and 200 respectively. The number of
common pointers are 5 and each tuple is in different block.

a. Find the cost of the SQL using A9 algorithm.


The indices are secondary index
Query cost = index cost + data cost
Cost = Cost of (name index + total-credit index) + 5 data

b. Analyze the result

40
Algorithms for Disjunctive Selection

Complex Selections SQL: SELECT * FROM student WHERE CGPA=4


OR City=‘Feni’ OR year-admit=2019
Disjunction:1 2 . . . n (r).
A10 (disjunctive selection by Given B+ tree indices on CGPA, city and year-
union of identifiers). admit.
Applicable if all conditions
Discussion 4: Explain how will this query be
have available indices.
executed using algorithm A10?
Otherwise use linear
scan. Example: Find the cost of the SQL using A10
Use corresponding index for algorithm.
each condition, and take The height of the indices are 2, 2 and 1
union of all the obtained sets respectively.
of record pointers. The number of tuples (pointers) for CGPA=4,
City=‘Feni’ and year-admit=2019 are 5, 20 and
Then fetch records from file
4000 respectively

The indices are secondary index.

41
Algorithms for Example Find the cost of the SQL using A10
algorithm.
Complex Selections The height of the indices are 2, 2 and 1
respectively.
Disjunction:1 2 . . . n (r). The number of tuples (pointers) for CGPA=4,
A10 (disjunctive selection by City=‘Feni’ AND year-admit=2019 are 5, 20 and
union of identifiers). 4000 respectively
Applicable if all conditions The indices are secondary index.
have available indices.
Otherwise use linear
scan. Cost = Cost of (CGPA index + City index+year-
Use corresponding index for admit index) + ??? data
each condition, and take
union of all the obtained sets = 2 * (tT + tS) + 2 * (tT + tS) +1 * (tT + tS) +n(?) * (tT +
of record pointers. tS )
Then fetch records from file
= 5*10.2 + n(?) *10.2

42
Example Find the cost of the SQL using
A10 algorithm.
The height of the indices are 2, 2 and 1
respectively. 2 1
The number of tuples (pointers) for 6001 6001
CGPA=4, City=‘Feni’ AND year- 9002 6002
admit=2019 are 5, 20 and 4000 12001 12001
respectively 12002 16001

The indices are secondary index.


12001
12002
Cost = Cost of (CGPA index + City 12003
index+year-admit index) + ??? data ……
15999
= 2 * (tT + tS) + 2 * (tT + tS) +1 * (tT + tS)
+??? * (tT + tS)

= 5*10.2 + n(?) *10.2


n = CGPA record set  city record set  year-admit record set
= 5 + 18 + 3998 = 4021

43
Question: Given B+ tree indices on name, and total-credit.
Total number of blocks is 2000. The height of the indices are
4 and 3 respectively.
SQL: SELECT * FROM student WHERE name=‘Abdullah’
OR total-credit = 100
tT = 0.1ms, tS = 10ms

The number of tuples (pointers) name=‘Abdullah’ AND total-


credit = 100 are 50 and 200 respectively. The number of
common pointers are 5 and each tuple is in different block.

a. Find the cost of the SQL using A10 algorithm.


The indices are secondary index
Query cost = index cost + data cost
Cost = Cost of (name index + total-credit index) + n(?) data

b. Find the cost of the query using A1 algorithm


c. Analyze the result

44
Algorithms for Complex Selections

 Negation: (r)
• Use linear scan on file
• If very few records satisfy , and an index is applicable to

 Find satisfying records using index and fetch from file

45
Sorting

 We may build an index on the relation, and then use the index
to read the relation in sorted order. May lead to one disk
block access for each tuple.

Discussion: The case when it May lead to one disk block access
for each tuple.

 For relations that fit in memory, techniques like quicksort can


be used.

 For relations that don’t fit in memory, quicksort is not


applicable. Why? external sort-merge is a good choice in
this case.

46
Example: External Sorting Using Sort-Merge

47
External Sort-Merge
Memory
(Run Creation)
Run Creation Let M denote memory size (in pages).
Here M = 3
1. Create sorted runs. Let i be 1 initially.
R1 Repeatedly do the following till the end
of the relation:
(a) Read M blocks of relation into
memory
(b) Sort the in-memory blocks
R2
(c) Write sorted data to run Ri;
increment i.

R3 Let the final value of i be N (Here N = ?)

Here N = 4
2. Merge the runs (next slide)…..
R4

48
External Sort-Merge
a 19
b 14
Input Buffer (Sort-Merge)
Output Merge the runs (2-way merge). Here
a 19
Buffer
Merge pass1 N > M. M=3, N=4
Use 2 blocks of memory to buffer input
runs, and 1 block to buffer output. Read the
R1 first block of each run into its buffer page
repeat
Select the first record (in sort order)
among all buffer pages
R2
Write the record to the output buffer. If
the output buffer is full write it to disk.
Delete the record from its input buffer
R3 page.
If the buffer page becomes empty
then
read the next block (if any) of the run
R4 into the buffer.
until all input buffer pages are empty:

49
External Sort-Merge
d 31
b 14
Memory
(Sort Merge)
Merge the runs (2-way merge). Here
b 14
Merge pass1 N > M. M=3, N=2
Use 2 blocks of memory to buffer input
runs, and 1 block to buffer output. Read the
R1 first block of each run into its buffer page
repeat
Select the first record (in sort order)
among all buffer pages
R2
Write the record to the output buffer. If
the output buffer is full write it to disk.
Delete the record from its input buffer
R3 page.
If the buffer page becomes empty
then
read the next block (if any) of the run
R4 into the buffer.
until all input buffer pages are empty:

50
External Sort-Merge
d 31
b 14
Memory
(Algorithm Analysis)
Merge pass2
 If N  M, several merge passes are
b 14 required.
Merge pass1
 Here, it is 2
 In each pass, contiguous groups of M - 1
R1 runs are merged.
 A pass reduces the number of runs by a
factor of M -1, and creates runs longer by
the same factor.
R2 E.g. If M=11, and there are 90 runs,
one pass reduces the number of runs
to 9, each 10 times the size of the
initial runs
R3
 Repeated passes are performed till all
runs have been merged into one.

R4

51
External Sort-Merge
d 31
b 14
Memory
(Algorithm Analysis)
Merge pass2
b 14
Merge pass1
Question
Find the number of runs and number of
passes for a relation with 2200 blocks.
R1 Memory size is 11.

 Cost analysis (Total number of passes)


R2  Total number of blocks of the relation is br
 Size of the memory is M
 Total number of runs = (br/M)
R3

R4

52
External Sort-Merge
d 31
b 14
Memory
(Algorithm Analysis)
Merge pass2
b 14
Merge pass1
 Cost analysis (Total number of passes)
 Total number of blocks of the relation is br
R1  Size of the memory is M
 Total number of runs = (br/M)
 After merge pass 1, number of runs = 
(br/M) /(M-1) 
R2
 After merge pass 2, number of runs
=  (br/M) /(M-1) (M-1) 
=  (br/M) /(M-1) 2
R3

Total number of pass = log (M–1)(br/M).

R4

53
External Sort-Merge
d 31
b 14
Memory
(Algorithm Analysis)
Merge pass2
b 14 Cost analysis (Total number of block
Merge pass1
transfer)
Block transfers for initial run creation as well
R1
as in each pass is 2br
for final pass, we don’t count write cost
since the output of an operation may be sent
to the parent operation
R2 Block transfer for run creation
= br for read+ br for write = 2 br
Block transfers for merging
R3 = 2 br * log (M–1)(br/M)
BT for run and merge (B_RM) = 2 br + 2 br *
log (M–1)(br/M)
Final merge, no write.
R4
So Net block transfer = B_RM - br
= 2 br + 2 br * log (M–1)(br/M) - br
= br * (2 * log (M–1)(br/M) +1)
54
External Sort-Merge
d 31
b 14
Memory
(Algorithm Analysis)
Merge pass2
b 14 Cost analysis (Total number of seek)
Merge pass1

During run generation: one seek to read


R1 each run and one seek to write each run
2 br / M
During the merge phase
R2 Need 2 br seeks for each merge pass
except the final one which does not require
a write
R3 Total number of seeks:
= seek for run creation + seek for merge
= 2 br / M + 2br logM–1(br / M) - br

R4 = 2 br / M + br ( 2 logM–1(br / M) - 1)

55

You might also like