You are on page 1of 34

Chapter 15 (TCDS)

Query Execution
Sukarna Barua
Associate Professor, CSE, BUET
03/20/2024
Nested-loop Joins
 Nested-loop joins:
 Requires one and a half-pass.
 Tuples of one arguments are read only once, while the other arguments
will be read repeatedly.
 Can be used for relations of any size. [ no requirement on memory ]
 Not necessary to fit in main memory for any relation.
 is sufficient for this join.

03/20/2024
Tuple-based Nested Loop Join
 Algorithm for computing :

 Cost of tuple-based nested loop join:


 As much as [if not managed efficiently]
 As much as [if block based retrieval is done, discussed next]

03/20/2024
Block-based Nested-loop joins
 An improvement over tuple-based nested loop join.
 Assume and [don't fit in main memory]
 Algorithm:
 Repeatedly read blocks of into main memory.
 Create a search data structure (search key=join attributes) in main memory.
 Read blocks of one by one and for each tuple of :
 Find matching tuples of S [from main memory].
 Calculate joined tuple and send to output.

 What if we read one block of at a time in main memory instead of ?


 Cost will be increased significantly! [discussed next
03/20/2024
Block-based Nested-loop joins
 Algorithm for computing :

03/20/2024
Block-based Nested-loop joins
 I/O cost of block-based nested-loop join:
 Let, , , and .
 disk I/Os for every blocks of into main memory.
 disk I/Os of for inner loop [Need to retrieve repeated for every 100 blocks of
S]
 Total I/O .
 If we read one block of at a time, then cost =
 Significantly higher than 5500!

03/20/2024
Analysis of nested-loop join
 Assume
 The number of iterations of the outer loop is
 At each iteration, we read blocks of and blocks of .
 The number of disk I/Os:

[What if R is used in the outer loop instead of S?]


 Can be approximated as [Considering is much smaller than and ]
 If , then cost [same as one-pass algorithm].

03/20/2024
Cost Summary
 Summary costs for different operations:

03/20/2024
Two-Phase Multi-way Merge Sort (TPMMS)
 Algorithm:
 Phase 1: Repeated fill buffers with blocks of (from disk), sort them using any
main memory algorithm. Write the sorted subsists to disk.
 Phase 2: Merge the sorted sublists as follows.
 Assume there are at most sorted sublists. [Constraint why?]
 Allocate one memory block for each sorted sublist and one block for output
 Keep a pointer to each block of input subsists:
 Points to tuple in the block of input sublist not yet moved to output.
 Merge the input sublists [discussed next]

03/20/2024
Two-Phase Multi-way Merge Sort (TPMMS)
 Algorithm for merging sorted sublists:
 Find the smallest tuple among all input sublists [main memory operation]
 Move the smallest to the output block.
 If the output block is full, write it to disk.
 If any input block becomes empty, get next block from the corresponding
sorted subsist
 If no next block remains, then the input block remains empty as sublist is
empty.

03/20/2024
Two-Phase Multi-way Merge Sort (TPMMS)
 Memory requirement for TPMMS:
 There cannot be more than sublists.
 Each sublist consist of M blocks.
 The number of sublists: .
 We require that:

[ for simiplicity! ]

 I/O cost of TPMMS:


 Phase 1: blocks read, blocks write.
 Phase 2: blocks read.
 Total cost: or [if sorted result need to be stored in disk]

03/20/2024
Two-Phase Multi-way Merge Sort (TPMMS)
 Calculation of I/O cost of TPMMS:
 Suppose blocks size is. Memory size is . What is the maximum size of for
TPMMS?

 Max size of

03/20/2024
Duplicate Elimination Using Sorting: Two Pass
Algorithm
 Algorithm: Similar to TPMMS.
 Phase 1: First sort tuples of in subsists as in TPMMS. [Do not merge]
 Phase 2:
 Like TPMMS, use one memory block for each sublist and one memory block
for output. Bring one block from each sublist to memory.
 Repeatedly perform the following:
 Find the smallest tuple among all sublists and all other tuples same as .
 Send one copy of to output and discard other copies.
 When an output block is full, copy to disk.

03/20/2024
Duplicate Elimination Using Sorting: Two Pass
Algorithm
 I/O cost and memory requirement:
 Disk I/O: [ignoring output wrte to disk]
 Memory requirement:
 Comapre: memory requirement for one pass duplicate elimination is

03/20/2024
Grouping and Aggregation Using Sorting:
Two Pass Algorithm
 Phase 1: First sort tuples of in sublists as in TPMMS based on grouping attributes
[sort key=L].
 Phase 2:
 Like TPMMS, use one memory block for each sublist and one memory block for
output. Bring one block from each sublist to memory.
 Repeatedly perform the following:
 Find the smallest tuple based on sort key. Assume this value is which becomes
the next group. Prepare aggregate accumulation variables for this group (e.g,
MIN, MAX, SUM, COUNT, etc.).
 Examine all tuples with sort key and update accumulated result.
 If an input buffer becomes empty, replace it with the next block from the same
sublist.
 When there are no more tuples with sort key , output a tuple consisting of group
attribute and aggregate values.
03/20/2024
Grouping and Aggregation Using Sorting: Two
Pass Algorithm
 I/O Cost and memory requirement of two-pass grouping and aggregation:
 Disk I/O: [ignoring output wrte to disk]
 Memory requirement for two pass: .
 Comapre with one pass algorithm: memory requirement is ).

03/20/2024
Sort-based Set Union: Two Pass Algorithm
 Algorithm for :
 Phase 1: Create sorted sublists for both and . [ similar to TPMMS ]
 Phase 2:
 Use one memory buffer for each sublist of and and one buffer for output.
Bring one block from each sublist of and to memory.
 Repeatedly perform the following:
 Find the smallest tuple among all buffers. Send the copy to the output.
 Repeatedly find all other copies same as and discard.
 Handle input and output blocks as like in TPMMS.

03/20/2024
Sort-based Set Union: Two Pass Algorithm
 Cost and memory requirement of two-pass set union:
 Disk I/O:
 and are read and written once while sorted sublists are created.
 and are read a second time in phase 2 while outputs are generated.
 Cost =
 Memory requirement:
 Total number of sublists of and cannot exceed [ M-1 to be more correct! ]

 Compare with one pass algorithm:

03/20/2024
Sort-based Set Intersection: Two Pass
Algorithm
 Algorithm for :
 Phase 1: Same as set union.
 Phase 2:
 Use one memory buffer for each sublist of and and one buffer for output. Bring one
block from each sublist of and to memory.
 Repeatedly perform the following:
 Find the smallest tuple and check if it is present in at least one sub list of and
[main memory operation]. If yes, then copy to output.
 Repeated find all tuples of and that are same as . Remove and discard the tuples.
 Handle input and output blocks as like in TPMMS.

 I/O cost and memory requirement: Same as set union.


03/20/2024
Sort-based Set Difference: Two Pass
Algorithm
 Algorithm for :
 Phase 1: Same as set union.
 Phase 2:
 Repeatedly perform the following:
 Find the smallest tuple and check if it is present in at least one sub list of but not
in [main memory operation]. If yes, then copy to output.
 Repeated find all tuples of that are same as . Remove and discard the tuples.
 Handle input and output blocks as like in TPMMS.

 I/O cost and memory requirement: Same as set union.

03/20/2024
Sort based Join operation
 Algorithm for
 Sort using 2PMMS with as the sort key. Store sorted in disk.
 Sort using 2PMMS with as the sort key. Store sorted in disk.
 Merge sorted and as follows:
 Use two memory buffers: one for and one for .
 Repeated do the following:
 Find the least value of the join attribute that is currently at the front of the blocks
for and .
 If does not appear at the front of both relations, then remove the tuples with sort
key
 Otherwise get all tuples from both relations having sort key . [ More blocks may
need to be read from the disk during this checking. why?]

03/20/2024
 Output all tuples by joining tuples from and having common .
Cost of Sort Based Join
 Total I/O cost:
 Cost for sorting R and S and storing final result in disk:
 Cost for the final merge: [ read once sorted R and S ]
 Total cost .
 Memory requirement:
 Sorting R using 2PMMS requires .
 Sorting S using requires
 Combing we get .
 Total number of tuples having same attribute fit in memory blocks.
[If all tuples are same, then memory requirement increases to

03/20/2024
Cost of Sort Based Join
 Calculation of I/O cost for sorted based join:
 Assume .
 Sorting using 2PPMS takes I/Os.
 Sorting S using 2PPMS takes I/Os.
 Merging require accessing and one last time causing I/Os.
 Total I/O cost .

 Compare with block-based nested-loop join:


 Nested-loop I/O cost:
 Seems like nested-loop join is better than sort-based join!
 But this is not true! [ next slide ]

03/20/2024
Cost of Sort Based Join
 Calculate the same cost for .
 Sorting using 2PPMS takes I/Os.
 Sorting S using 2PPMS takes I/Os.
 Merging require accessing and one last time causing I/Os.
 Total I/O cost .

 Compare with block-based nested-loop join:


 Nested-loop I/O cost:
 Now nested-loop join cost is significantly higher than sort-based cost!

 Reason: Nested-loop join require I/O proportional to while sort based require proportional
to.

03/20/2024
An Efficient Sort Based Join: Two Pass
Algorithm
 An efficient version with a constraint: number of tuples having common sort key
value should be small.
 Algorithm:
 Phase 1: Create sorted sublists of and based on sort key .
 Phase 2:
 Bring first block of each sublist of and into memory buffer.
 Repeated perform the following:
 Find the smallest among all sublists. Find all tuples that have same sort key .
 Combine tuples of R and S with common value to create joined tuples.
 Copy joined tuples to output.

03/20/2024
An Efficient Sort Based Join: Two Pass
Algorithm
 Cost of efficient sort-based join:
 I/O cost: 3).
 Memory requirement:
 Example: Assume
 Using 100 blocks per sublists: 10 sublists for , 5 sublists for
 This will require blocks of memory in phase 2 for reading of and
 remaining blocks will be free to handle large number of tuples with same
sort key value [if more blocks are required]
 I/Os for phase 1 sorted sublists creation.
 I/O for phase 2 reading of sorted sublists from both and .
 Total I/O

03/20/2024
Summary of Sort Based Algorithms
 Cost summary for sort-based algorithms:

03/20/2024
Two Pass Algorithm Based on Hashing
 Involves partitioning into buckets of roughly equal size.
 Algorithm for hash partitioning:
 Assume is a hash function. Algorithm to partition into buckets using as
follows.
 Allocate one memory buffer for each bucket and one memory buffer to read
one for R.
 Read one block of at a time and for each tuple :
 Find and copy to the buffer holding bucket for
 If buffer is full write to disk and initialize for next tuples of the same
bucket.
 At then end, write to disk memory buffers of all buckets.

03/20/2024
Two Pass Algorithm Based on Hashing
 Algorithm for hash partitioning:

03/20/2024
Duplicate Elimination Using Hashing
 Compute :
 Phase 1: Hash to buckets. Note same tuples will hash to same bucket.
 Phase 2: Read one bucket at a time to main memory.
 Use one pass duplicate elimination algorithm on the bucket.

 Memory requirement:
 Approx. size of each bucket:
 For phase 2 to work, we require to fit in main memory
 Thus, [simplify.]
 I/O cost:
 Phase 1 partitioning:
 Phase 2:
 Total cost:

03/20/2024
Hash based Union, Intersection, and Difference
 Algorithm:

 Phase 1: Hash R and S based on the same hash function. Let, , , …, and are
buckets of and .
 Phase 2: For each, use one pass algorithm on and buckets with same hash values.
For example, for union, this can be done as follows:
 Retrieve bucket into main memory blocks. Use a main memory data structure
to efficiently search a tuple in
 Copy all tuples of to the output.
 Retrieve one block at a time from the bucket and do the following:
 For each tuple to of , if it does not appear in , then copy to output.
Otherwise, discard.

03/20/2024
Hash Based Union, Intersection, and Difference
 Cost of set union, intersection, and difference:
 I/O cost:
 Phase 1 for hashing:
 Phase 2 for one pass algorithm:
 Total cost =
 Memory requirement:
 In phase 2, the smaller of and buckets must fit in blocks.
 Each bucket size is roughly and whose smaller should be .
 Thus the requirement is roughly:
.

03/20/2024
Hash Based Join
 Algorithm for :
 Algorithm is similar to previous ones for set operations.
 Phase 1: Hash and based on the same hash function into buckets.
 Phase 2: Take each pair of buckets from and , and use one pass join algorithm.
 Memory requirement:
 Same as for set union, intersection, etc.
 Thus, memory requirement is: .
 I/O cost:
 Cost = [same as before for set union, intersection, etc.]

03/20/2024
Summary of Hash Based Algorithms
• Summary of costs for hash based algorithms:

03/20/2024

You might also like