You are on page 1of 27

Chapter 13: External Sorting

CPSC 404, Ed Knorr, Slides from September-December 2022

Based on:
 Chapter 13 (pages 421-436), but skip Section 13.3.1 on
“Minimizing the Number of Runs” on pages 428-430 of
Ramakrishnan & Gehrke (textbook);
 Garcia-Molina, Ullman, & Widom (reference text);
 Some CPSC 221 review material

1
Learning Goals
 Justify the importance of external sorting for database applications.
 Argue that, for many database applications, I/O activity dominates CPU
time when measuring complexity using elapsed time.
 Compute the number of I/Os (e.g., disk requests, pages, seek operations, or
passes) that are required to sort a file of size N using general multiway (k-
way) external mergesort (GMEMS), where N is the number of pages in the
file, and k ≥ 2.
 Compute the number of sorted runs (or passes) that are required to sort a
large file using GMEMS, including the sizes of the sorted runs as we
progress.
 Analyze the complexity and scalability of sorting a large file using
GMEMS.
 Using the parameters for a given disk geometry, compute the number of
I/Os, or estimate the elapsed time, for sorting a large file with GMEMS,
including two-phase multiway mergesort (2PMMS).

2
Learning Goals (cont.)
 Suggest several optimizations that can be exploited when sorting large files via
GMEMS (including 2PMMS)—e.g., cylindrification, larger block sizes, double
buffering, disk scheduling, and multiple smaller disks.
 Show how our earlier concepts of disk geometry, buffer pool management,
disk scheduling, etc., directly relate to the elapsed time for sorting large
datasets.
 Argue that sorting will continue to be a bottleneck for many applications,
regardless of how much memory computers may have in the future.
 Explain why the sorting performance of large files is data dependent.
 Estimate the number of short seeks and the number of long seeks that are
required in a data dependent sort of large files, where a “short seek” is to a
neighbouring cylinder, and a “long seek” is a seek that’s further away than one
cylinder.
 Argue for or against: Sorting performance has linear complexity for large
datasets.

3
Motivation

 Sorting is a classic problem in computer science


– Many CPU cycles and I/O operations are devoted to
sorting.
 Database reasons for sorting:
– Data is often requested in sorted order.
– Elimination of duplicates in a collection of records
– First step in bulk loading (i.e., populating, reorganizing) a
table and its index(es)
– Aggregation types of queries (GROUP BY)
– Efficient joins of relations (e.g., Sort-Merge Join algorithm)
4
Mergesort (from CPSC 221 or Equivalent)
 Mergesort is fast and efficient
– It works well on memory-resident data that can fit completely into
memory (e.g., a large array), although Quicksort often outperforms it.
– O(n log n) average case time for Mergesort
– O(n log n) worst case time for Mergesort
 Quicksort is O(n log n) on average, but degrades to O(n2) in the worst case.
 Merge = take two sorted lists and repeatedly choose the
smaller of the “heads” of the lists
 In CPSC 404, a sorted list is often called a sorted run (SR).
 Example: Merge sorted run SR1: 1, 3, 4, 8
with sorted run SR2: 2, 5, 7, 12
Result: merged, sorted list: 1, 2, 3, 4, 5, 7, 8, 12

5
Mergesort (cont.)
 Here’s an important database-related problem: How
can we sort 100 GB (or 100 TB!) of data with only a small
fraction of that amount of RAM?
– CPU issues? Not a problem.
– I/O issues? A potentially big problem.
 Common in-memory sorting algorithms (like Quicksort)
don’t focus on disk I/Os because they don’t have to.
– They simply read the whole fit into memory, sort the files, and
write it out.
– However, as an important part of its algorithm, External
Mergesort (CPSC 404) tries to optimize disk I/Os.
 If we don’t pay attention to how I/Os are handled, there could be
an enormous difference in sort performance, as measured by
elapsed time (e.g., hours vs. minutes).

6
CPSC 404’s General Multiway External
Mergesort Algorithm (GMEMS)
 Suppose we need to sort a disk-resident file of size N pages
with B pages of RAM (buffer pool space) available, where
N > B.
 Algorithm:
1. [First part of the sort phase] Read B blocks/pages of data into
memory, sort that data, and write it to a temporary file on disk.
This file is called a sorted run (SR) or sublist of length B.

7
CPSC 404’s GMEMS Algorithm (cont.)

2. [Second part of the sort phase] Repeat Step 1 until all N pages
have been read in, and therefore the file has possibly many SRs of
size B (except for the last SR which may be shorter than B).
a) Each SR is independent. Although each SR is itself sorted, the elements
of SRi are probably not all ≤ (or ≥) the elements of SRj for i ≠ j.
b) The sorted runs are stored using temporary disk space.

8
CPSC 404’s GMEMS Algorithm (cont.)

3. [Merge phase] Until the whole file is sorted (i.e., we get a single
sorted run of size N), do:
• Merge up to B – 1 of the sorted runs (why not B of them?) to
form an even longer SR. This is called a (B – 1)-way mergesort.
• This longer SR is also written out as a temporary file on disk
(unless it’s the final product, in which case you probably want to
store it permanently).
4. Delete the consumed (smaller, input) sorted runs (except for the
original file, which we probably don’t want to delete).

9
GMEMS Diagram of Merging
 We need at least B = 3 buffer pages.
 To sort a file with N pages using B buffer pages:
– Pass 0: Use all B buffer pages.
 Produce  N / B sorted runs of B pages each.
– Passes 1, 2, etc.: Now, merge B - 1 sorted runs:
INPUT 1

... INPUT 2
... OUTPUT ...
INPUT B-1
Disk Disk
B Main Memory Buffers
10
GMEMS: The Merge Phase
 Divide the total buffer space B appropriately, by using a
block of one or more pages for each input buffer, and a block
of one or more pages for the output buffer. Bigger
input/output buffer block sizes are usually much better.
– When the pages (input block) for an input buffer have been merged,
get the next block from the same sorted run. Don’t worry; the pages
within a sorted run are always in order.
– When the output block is full, write it to disk.
– At the end of merging, write the remaining (partially filled) output
buffer to disk.
 The cost of the sort is a function of the number of passes
required.
 To be consistent with the textbook, let us define one pass as
one complete read PLUS one complete write of all the data in
the file. 11
GMEMS Example

 To sort an N = 108 page file using B = 5 buffer


pages, we need 4 passes overall:
– Pass 0: 108 / 5 = 22 sorted runs (SRs) of 5 pages
each (but the last SR is only 3 pages)
– Pass 1:  22 / 4  = 6 sorted runs of 20 pages each (but
the last SR is only 8 pages)
– Pass 2: 2 SRs: 80 pages and 28 pages
– Pass 3: Final product: A sorted file of 108 pages.
 We’ll fill in the next slide with more details
including a diagram. →

12
GMEMS Example (cont.)

13
Two-Phase Multiway Mergesort
(2PMMS)—i.e., GMEMS in 2 Passes
 If B is sufficiently large, then most cases of General
Multiway External Mergesort can be completed in
only 2 passes!
– We want to use as much memory as we can get, up to
the size of the file (which would be ideal because then
we can finish in just one pass).
– Nevertheless, even a 2-pass scenario has major
performance benefits.
 Compare this to the 3-buffer case, which requires too many
passes!
 We call the 2-pass version of GMEMS: Two-Phase
Multiway Mergesort (2PMMS).
14
Number of Passes Required for GMEMS
with B Buffer Pages and N Data Pages
N B=3 B=5 B=9 B=17 B=129 B=257
100 7 4 3 2 1 1
1,000 10 5 4 3 2 2
10,000 13 7 5 4 2 2
100,000 17 9 6 5 3 3
1,000,000 20 10 7 5 3 3
10,000,000 23 12 8 6 4 3
100,000,000 26 14 9 7 4 4
1,000,000,000 30 15 10 8 5 4
15
Sorting Large Files in Only 2 Passes?
Given 40 records/page and 12,800 4K pages for the buffer pool:
How many records (max.) can you sort with 2PMMS?

External mergesort runs in “O(N)” time for “most” big files that IT
shops work with.
What would you do if there were even more records?

16
Improving the Running Time of
External Mergesort
Here are some techniques that can make secondary
memory algorithms more efficient:

1. Use the cylindrification principle. Usually:


a) Best choice: k cylinders at a time (k > 1)
b) Next best choice: 1 cylinder at a time
c) Next best choice: k tracks at a time
d) Next best choice: 1 track at a time
e) Next best choice: k pages at a time
f) Worst choice: 1 page at a time

17
Improving the Running Time of
External Mergesort (cont.)
2. Instead of one big disk that stores the file → use several
smaller disks (e.g., data striping) in parallel

3. Mirrored disks = multiple copies of the same data, or a


RAID redundancy scheme

4. Disk scheduling: e.g., the Elevator Algorithm 18


Improving the Running Time of
External Mergesort (cont.)
5. Use Double Buffering (see diagram on
next page)
– General idea: While the CPU is processing
(i.e., sorting and/or merging) using half of
memory, the other half of memory is being
refilled in the background (asynchronously),
so that when one input block runs out, its
successor is (hopefully!) already in memory.
– There are trade-offs to consider. I/O time
usually greatly exceeds CPU time, and it
depends on memory size, disk location of
the blocks, which blocks run out first, disk
contention, etc.

19
Double Buffering (Prefetching)
 To reduce wait time for an I/O request to complete,
we can prefetch data into “shadow blocks”.
– Potentially, more passes; but, in practice, many
files can still be sorted in 2 passes

INPUT 1

INPUT 1'

INPUT 2
OUTPUT
INPUT 2'
OUTPUT'

Disk INPUT k
Disk
INPUT k'

B main memory buffers; k-way merge


20
Analysis of Naïve Implementation
We’ll use the same “Megatron 747” disk drive (instructive
toy example) from our unit on Disks earlier in the term.
 We want to sort the records by some field(s).
– We might do this as a step before bulk loading records in
clustering order.
– Note that external sorting is extremely common for many kinds of
large files, and they don’t have to involve databases.
 Disk geometry (from earlier in the term):
– Suppose the average access time is 15.49 ms.
 In our examples, we’ll use 10 ms for average seek time, 5 ms for avg. rotational
latency, and (a very generous) 0.49 ms/page for the transfer time.
– Suppose 1 cyl. = 1 MB, and suppose there are 256 pages/cyl.
 Yes, this is very small; but, it’s good enough for our example.

21
Analysis (cont.)
 Assume pages are stored at random.
– This is unlikely to be the case in general, but it will give us an
appreciation for how time-consuming a pass can be.
 Suppose we have 10M (10,000,000) records of 100 bytes,
that is, approximately a 1 GB file size.
– We’ll stored the file on our Megatron 747 disk, with a 4 KB block or
page size, with each page holding 40 records of 100 bytes.
– The entire file consumes 250,000 4K pages. How did we get this?

 This equates to 977 cylinders. How did we get this?

22
Analysis (cont.)
 Naïve case for 2PMMS:
– 1,000,000 page I/Os * 15.49 ms/page = 15,490 seconds = 258.17
minutes = 4.3 hours!
 Let us consider a more intelligent approach to reduce
the number of seeks, namely, cylindrification.
 Suppose 50 MB of main memory is available.
– This is 12,800 blocks of size 4K = about 1/20th of the file. How
did we get 12,800?

23
Cylindrification Example
When reading or writing blocks in a known order, place them together on
the same cylinder. Once we reach that cylinder (with one seek), we can read
or write page after page, with negligible rotational latency, and no extra
seeks (unless we need > 1 cylinder for the given sorted run).

Application to Phase 1 of 2PMMS:


1. Initially, (unsorted) records are on 977 consecutive cylinders.

2. Load main memory with (unsorted) data from 50 consecutive


cylinders. We will assume that a disk request completes
uninterrupted after the start of reading or writing the request.

24
Cylindrification Example (cont.)
Application to Phase 1 of 2PMMS (cont.):
 The order of blocks read is unimportant, so the only time besides
transfer time is: one random seek and 49 1-cylinder seeks (each of
minimal seek time).

 The time to transfer 12,800 pages (50 MB) at 0.49 ms/page = 6.33
sec. Note that the last sorted run (SR) may be shorter.

3. Write each SR onto 50 consecutive cylinders.


 Is this realistic?
 The write time is also about 6.33 sec/SR.

4. Repeat for the remaining input sorted runs.

25
Cylindrification Example (cont.)
Phase 1 and Phase 2 Details
Phase 1 Calculations: See page 1 of the separate PDF download to
save yourself a lot of writing.

In Phase 2: The blocks are read from the 20 SRs in a data dependent
order, so we can’t gain much of an advantage in terms of
cylinder seeks.

Why 20 SRs? There are 977 cylinders of data.

Phase 2 Calculations: See page 2 of the separate PDF download to


save yourself a lot of writing.
26
Summary
 External sorting is very important.
 The DBMS may dedicate part of its buffer pool to sorting.
 External Mergesort minimizes disk I/O cost:
– Pass 0: Produces sorted runs of size B (# of buffer pages). In later
passes, we merge the runs.
– # of runs merged at a time depends on B and on the block size
 A larger block size means less disk access cost per page, when
amortizing seek and rotational latency.
 A larger block size means a smaller number of runs is merged.
– In practice, the number of passes is rarely more than two.
 In practice, External Mergesort runs in O(N) time, where
N is the number of pages.

27

You might also like