You are on page 1of 10

Sorting

CS 102
File Structures & File Organizations

Sorting arranging the items in a list in ascending or descending order by a key value. Applicable for all file organizations, not just sequential Why sort ? to make a report, to merge files in queries, to merge files in master file maintenance, to make searches easier, to prioritize, etc.
CJD

Chapter 05

External Sorting Algorithms

Internal vs External Sorts


Internal Sort sorting items entirely in main memory ICS 2, ICS 3, CS 101 External Sort sorting files in secondary storage using main memory CS 102 Why external sort ?
Some files may be too large to fit in main memory

Some Terminologies
A Pass an iteration that goes through the items (or records) of a list (or file) once to include reading it from file, processing it in main memory and writing it to file. A Run a grouping of some items of a list. Usually a run starts as a block of records but eventually increases in size. Size of a Run the number of items in a run. Usually no less than the blocking factor. A Merge combining lists into one
CJD CJD

The Algorithms
External Sort Algorithms 2-way Sort Merge Balanced 2-way Sort Merge Balanced k-way Sort Merge Polyphase Sort Merge Overview :

2-Way Sort Merge


A simple 2-way Sort Merge repeatedly merges 2 smaller sorted components of a file into a sorted bigger component of the file. The algorithm Phase 1 : The Sort Phase Phase 2 : The Merge Phase

CJD

CJD

The Sort Phase


Phase 1 : The Sort Phase Divide the records of a file into several runs, internal sort the records in a run, and distribute the runs evenly to two external files file_1 and file_2

The Merge Phase


Phase 2 : The Merge Phase For each pair of runs, one from file_1 and another from file_2,
merge the pair resulting in a longer run. Store the new resulting run in a third external file file_3

Redistribute the runs evenly in file_3 to file_1 and file_2 Repeat Phase 2 until all records are in one long run.
CJD CJD

Tips for Efficiency


As much sorting in main memory must be performed using internal sort because file accesses are slower than main memory accesses. The size of a run must be as large as available space in main memory, limited by other data that must also be in main memory. Each file must be on a separate device (such as tapes or disks) to allow easy access during the merge phase. The original file and file_3 may be assigned to the same device. The output will be in file_3.
CJD

Algorithm Simulation (1)


File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80 File Size = 15 records Size of Run = 3 initially (Usually a large number blocking factor) Number of Runs = 5 initially Sort Phase : Pass 1: Group records into size of run 50 110 95 - 10 100 36 - 153 40 120 - 60 70 130 - 22 140 80 3 records are in main memory at a time Do an internal sort and distribute File 1: 50 95 110 40 120 153 22 80 140 File 2: 10 36 100 60 70 130 File 3: empty
CJD

Algorithm Simulation (2)


File 1: 50 95 110 40 120 153 22 80 140 File 2: 10 36 100 60 70 130 File 3: empty Merge Phase : Pass 2: Merge : 2 sets of 3 records are in main memory at a time File 1: empty File 2: empty File 3: 10 36 50 95 100 110 40 60 70 120 130 153 22 80 140 Pass 3: Distribute: 3 records are in main memory at a time File 1: 10 36 50 95 100 110 22 80 140 File 2: 40 60 70 120 130 153 File 3: empty
CJD

Algorithm Simulation (3)


File 1: 10 36 50 95 100 110 22 80 140 File 2: 40 60 70 120 130 153 File 3: empty Pass 4: Merge : 2 sets of 3 records are in main memory at a time File 1: empty File 2: empty File 3: 10 36 40 50 60 70 95 100 110 120 130 153 22 80 140 Pass 5: Distribute: 3 records are in main memory at a time File 1: 10 36 40 50 60 70 95 100 110 120 130 153 File 2: 22 80 140 File 3: empty
CJD

Algorithm Simulation (4)


File 1: 10 36 40 50 60 70 95 100 110 120 130 153 File 2: 22 80 140 File 3: empty Pass 6: 2 sets of 3 records are in main memory at a time File 1: empty File 2: empty File 3: 10 22 36 40 50 60 70 80 95 100 110 120 130 140 153 Number of passes = 1 Sort Pass + 5 Merge Passes Total Number of passes = 6

Algorithm Analysis (1)


Amount of space occupied Requires 3 files Ease in the implementation of the algorithm Straightforward Awkward in having separate merge and redistribution steps in merge phase The redistribution increases the number of passes

CJD

CJD

Algorithm Analysis (2)


Speed of the algorithm
Let NR be the number of runs initially. If NR = 1, Total Passes = 2 (one sort phase and one trivial merge phase) Suppose NR > 1. Total Passes = log2 NR * 2

Algorithm Analysis (3)


Assume NR is a power of 2. That is NR = 2 j Each iteration is composed of a distribution step and a merge step. It divides the number of runs by 2 Until there is only 1 run. The number of divisions to go from 2 j to 1 is j. So the number of iterations is j = log2 NR If NR is not a power of 2, the number of iterations is log2 NR

CJD

CJD

Algorithm Analysis (4)


1 2 3 4 5 6 7 8 Merge 1

Balanced 2-Way Sort Merge


Overview : The Balanced 2-Way Sort Merge improves the 2-Way Sort merge. It combines the merge and distribution steps. It requires 4 files. The algorithm Phase 1 : The Sort Phase Phase 2 : The Merge Phase

Merge 2

Merge 3

Each iteration requires 2 passes : to distribute and to merge So Total Passes = log2 NR * 2 When NR=5, algorithm requires 6 passes
CJD

CJD

The Sort Phase


Phase 1 : The Sort Phase Divide the records of a file into several runs, internal sort the records in a run, and distribute the runs evenly to two external files file_1 and file_2

The Merge Phase


Phase 2 : The Merge Phase For each pair of runs, one from file_1 and another from file_2,
merge the pairs resulting in longer runs. alternately store the resulting runs in external files file_3 and file_4

Repeat Phase 2 until all records are in one long run. Alternate the roles of file_1 and file_2 with file_3 and file_4 depending on which files need to be merged and which would hold the redistributed resultant longer runs.
CJD CJD

End of Algorithm
Copy or assign the file that contains the one long run to the desired output file. It is called balanced because in each iteration, the number of input files used is equal to the number of output files used.

Algorithm Simulation (1)


File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80 File Size = 15 records Size of Run = 3 initially Number of Runs = 5 initially Sort Phase : Pass 1 50 110 95 - 10 100 36 - 153 40 120 - 60 70 130 - 22 140 80 3 records are in main memory at a time do an internal sort and distribute File 1: 50 95 110 40 120 153 22 80 140 File 2: 10 36 100 60 70 130 File 3: empty File 4 : empty
CJD CJD

Algorithm Simulation (2)


Merge Phase : 2 sets of 3 records are in main memory at a time merge and distribute Pass 2 File 1, File 2: empty File 3: 10 36 50 95 100 110 22 80 140 File 4: 40 60 70 120 130 153 Pass 3 File 1: 10 36 40 50 60 70 95 100 110 120 130 153 File 2: 22 80 140 File 3, File 4: empty Pass 4 File 1, File 2, File 4: empty File 3: 10 22 36 40 50 60 70 80 95 100 110 120 130 140 153
CJD

Algorithm Analysis (1)


Amount of space occupied Requires 4 files (instead of 3) Ease in the implementation of the algorithm Straightforward implementation Requires an additional file Easy to combine merge and redistribution steps in merge phase Reduces the number of passes

CJD

Algorithm Analysis (2)


Speed of the algorithm
Let NR be the number of runs initially If NR = 1, Total Passes = 1 (Sort Phase only) Suppose NR > 1. Total Passes = log2 NR + 1

Algorithm Analysis (3)


Assume NR is a power of 2. That is NR = 2 j The sort phase takes 1 pass: sorts each run, but does not reduce the number of runs. Each execution of merge phase is composed of a merge step and a distribution step. It divides the number of runs by 2 Until there is only 1 run. The number of divisions to go from 2 j to 1 is j. So the number of merges is j = log2 NR And the total number of passes is j + 1 = (log2 NR) + 1, including the one for sort phase If NR is not a power of 2, the number of passes is log2 NR + 1 When NR=5, requires 4 passes instead of 6
CJD CJD

Balanced k-Way Sort Merge


Overview : The Balanced k-Way Sort Merge is an improvement of Balanced 2-way sort merge. Instead of merging 2 files at a time, merge k files at a time, k 2 Requires 2k files Each pass results in fewer number of runs compared to each pass of balanced 2-way sort merge.

The Sort Phase


Phase 1 : The Sort Phase Divide the records of a file into several runs, internal sort the records in a run, and distribute the runs alternately to k external files

CJD

CJD

The Merge Phase


Phase 2 : The Merge Phase
For each k-tuple of runs, one from each file with the distribution of runs, merge the k-tuples resulting in longer runs. alternately distribute the resulting runs in the other k external files Repeat Phase 2 until all records are in one long run. Alternate the roles of the first k files and the second k files depending on which files need to be merged and which would hold the redistributed resultant longer runs.

Algorithm Simulation (1)


File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80 File Size = 15 records; Size of Run = 3 initially Number of Runs = 5 initially k=3 Sort Phase : Pass 1 50 110 95 - 10 100 36 - 153 40 120 - 60 70 130 - 22 140 80 3 records are in main memory at a time do an internal sort and distribute File 1: 50 95 110 60 70 130 File 2: 10 36 100 22 80 140 File 3: 40 120 153 File 4, File 5, File 6 : empty
CJD CJD

Copy or assign the file that contains the one long run to the desired output file.

Algorithm Simulation (2)


File 1: 50 95 110 60 70 130 File 2: 10 36 100 22 80 140 File 3: 40 120 153 File 4, File 5, File 6 : empty Merge Phase : 3 sets of 3 records are in main memory at a time Pass 2 File 1, File 2, File 3: empty File 4 : 10 36 40 50 95 100 110 120 153 File 5 : 22 60 70 80 130 140 File 6 : empty Pass 3 File 1: 10 22 36 40 50 60 70 80 95 100 110 120 130 140 153 File 2, File 3, File 4, File 5, File 6: empty
CJD

Algorithm Analysis (1)


Amount of space occupied k input files (File_1, File_2, , File_k) k output files (File_k+1, File_k+2, , File 2k) 2k = total no. of files needed Ease in the implementation of the algorithm More complicated than a (balanced) 2-way sort merge Requires more files Merging becomes more complicated
CJD

Algorithm Analysis (2)


Speed of the algorithm
Let NR be the number of runs initially If NR = 1, Total Passes = 1 (Sort Phase only) Suppose NR > 1. Total Passes = logk NR + 1

Algorithm Analysis (3)


Assume NR is a power of k. That is NR = kj The sort phase takes 1 pass: sorts each run, but does not reduce the number of runs. Each execution of merge phase is composed of a merge step and a distribution step. It divides the number of runs by 2 Until there is only 1 run. The number of divisions to go from k j to 1 is j. So the number of merges is j = logk NR

CJD

CJD

Algorithm Analysis (4)


1 2 3 4 5 6 7 8 9 Merge 1

Exercise 1
Fill in the following table with NR = 100 NR = 100 2-Way Sort Merge 3 14 Balanced 2-Way Balanced 3-Way Balanced 4-Way

Merge 2

No. of Files Used And the total number of passes is j + 1 = (logk NR) + 1, including the one for sort phase If NR is not a power of 2, the number of passes is logk NR + 1 When NR=5 and k=3, requires 3 passes instead of 4 (Balanced 2-way)
CJD

4 8

6 6

8 5

Total No. of Passes

Question: What conclusion/s can you draw based on the above table.

CJD

More Exercises
Exercise 2: Using Balanced 3-way Sort Merge algorithm, sort the given master file with the following records. Assume that the size of the run is 3. Determine the total number of passes. File : 28 17 79 38 5 70 24 91 37 3 19 63 15 44 8 Exercise 3: Using Balanced 3-way Sort Merge algorithm, sort the given master file with the following records. Assume that the size of the run is 4. Determine the total number of passes. File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80 Exercise 4: Using Balanced 4-way Sort Merge algorithm, sort the given master file with the following records. Assume that the size of the run is 3. Determine the total number of passes. File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80
CJD

Challenges
What if each sorted run from the sort phase is distributed to a separate file and all such files are merged into one output file. What are the implications ? What factors make this approach possible? impossible? There are main memory and number of file devices limitations How do you implement a k-way merge efficiently if k > 2 ? If k is large, use priority queues CS101 (or an advanced CS101 course) The realistic sort/merge situation is somewhere between the basic balanced two-way sort merge, and the idealistic balanced k-way sort/merge, which uses k input files for k runs and merges to one output file.
CJD

Polyphase Sort Merge


Overview : Balanced 2-way sort/merge closer analysis :
Suppose the number of runs in a pass does not divide evenly among the 2 files. The last pair of runs to be merged is trivial (i.e. a copy) as one of the files is already empty. Possible improvement : reduce this trivial merge Another possible improvement : reduce the number of files required For a balanced k-way sort/merge, uneven distribution cause merging of less than k runs.

Polyphase Improvements
In the merge phase, do not distribute the merges into several files just send them to a (k+1)st file. When an input file becomes empty, discontinue the previous merge phase. Instead, merge the (k-1) possibly non-empty file(s) with the (k+1)st file into the empty file. Perform this repeatedly each time a file becomes empty, until there is only one nonempty file containing one long sorted run. The name Polyphase is attributed to the many phases of the merging process to sort the records.
CJD

CJD

The Sort Phase


Phase 1 : The Sort Phase Divide the records of a file into several runs, internal sort the records in a run, and distribute the runs to k external files using a distribution special to Polyphase Sort Merge.

The Merge Phase


Phase 2 : The Merge Phase while not all records are in one long run
repeat merge the next k-tuple of runs, one from each of the k input files, into the output (k+1)st file until an input file becomes empty reassign the empty file as the next output file and the other k files as input files.

Copy or assign the file that contains the one long run to the desired output file.
CJD CJD

Algorithm Simulation (1)


File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80 File Size = 15 records Size of Run = 3 initially Number of Runs = 5 initially

Algorithm Simulation (2)


Merge Phase : Pass 2 File 1, File 2: empty File 3: 10 36 50 95 100 110 22 80 140 (trivial merge : a copy) File 4: 40 60 70 120 130 153 Pass 3 File 1: 10 36 40 50 60 70 95 100 110 120 130 153 File 2: 22 80 140 (another trivial merge : a copy) File 3, File 4: empty Pass 4 File 1, File 2: empty File 3: 10 22 36 40 50 60 70 80 95 100 110 120 130 140 153 File 4: empty
CJD CJD

Recall : With a balanced 2-way sort merge :


Sort Phase : Pass 1 50 110 95 - 10 100 36 - 153 40 120 - 60 70 130 - 22 140 80 File 1: 50 95 110 40 120 153 22 80 140 File 2: 10 36 100 60 70 130 File 3, File 4: empty

Algorithm Simulation (3)


With Polyphase sort merge :
Sort Phase : 50 110 95 - 10 100 36 - 153 40 120 - 60 70 130 - 22 140 80 (Assume 1 block = 3 records) (after 5 blocks passed) File 1: 50 95 110 40 120 153 22 80 140 File 2: 10 36 100 60 70 130 File 3: empty

Algorithm Simulation (4)


Merge Phase :
(after 2 run merges, 4 more blocks passed = 9 total blocks passed)

File 1: 22 80 140 File 2: empty File 3: 10 36 50 95 100 110 40 60 70 120 130 153
(after 1 run merge, 3 more blocks passed = 12 total blocks passed)

File 1: empty File 2: 10 22 36 50 80 95 100 110 140 File 3: 40 60 70 120 130 153
(after 1 run merge, 5 more blocks passed = 17 total blocks passed)

File 1: 10 22 36 40 50 60 70 80 95 100 110 120 130 140 153 File 2: empty File 3: empty
CJD CJD

Algorithm Simulation Summary


17 total blocks / 5 blocks per file pass = 3.4 passes with Polyphase 2-way Sort Merge compared to 4 passes with balanced 2-way sort merge Summary Number of Runs in File 1 Sort Phase Merge Phase 1 Merge Phase 2 Merge Phase 3 3 1 0 1 File 2 2 0 1 0 File 3 0 2 1 0
CJD

Another Run Distribution


Consider Polyphase 3-way sort merge with NR = 17 and distribute the initial runs 7-6-4 into files 1 to 3.
Summary Sort Phase Merge Phase 1 Merge Phase 2 Merge Phase 3 Merge Phase 4 Level 4 3 2 1 0 Number of Number of Runs in merges File 1 File 2 File 3 File 4 0 7 6 4 0 4 3 2 0 4 2 1 0 2 2 1 1 0 1 1 0 1 0 1 0

Observe that by distributing the initial runs 17=7-6-4, at most only one file becomes empty after each merge, except the last. This is because 17 has a perfect 3rd order Fibonacci distribution of 7-6-4.
CJD

Fibonacci Sequence
2nd order Fibonacci sequence
F
( 2) 0

Polyphase Fibonacci Distrib (ex1)


In first example, NR = 5 For k=2, using 3 files = 2 input files + 1 output file The recommended distribution of NR=5 over two input files is to use 2 nd order Fibonacci distribution.
Summary Sort Phase Merge Phase 1 Merge Phase 2 Merge Phase 3 Level (n)
3 2 1 0

= 0,

F1( 2 ) = 1,

) Fn( 2 ) = Fn(21) + Fn(22 , {Fi ( 2 ) } = {0,1,1,2,3,5,8,13,21,...}

3rd order Fibonacci sequence


) 3) F0( 3) = 0, F1(3 ) = 0, F2( 3) = 1, Fn(3 ) = Fn(31) + Fn(32 + Fn(3 , {Fi (3 ) } = {0,0,1,1,2,4,7,13,24, 44...}

Number of merges
0 2 1 1

Number of Runs in File 1


3 1 0 1

File 2
2 0 1 0

File 3
0 2 1 0

kth order Fibonacci sequence


Fi ( k ) = 0, i < k 1,
k k Fk(1) = 1, Fn( k ) = Fn(1) + Fn(k2) + L + Fn(kk)

The ith largest file on the nth level (n>0) initially contains the following number of runs :
k k k Fn(+ k) 2 + Fn(+ k) 3 + L + Fn(+ i) 2
CJD

the number of runs on ith largest file on the nth level is : Fn(+kk) 2 + Fn(+kk) 3 + L + Fn(+ki)2 = Fn( 2) + L + Fn(+2i)2 which means largest is Fn+Fn-1 and 2 nd largest is Fn.

CJD

Polyphase Fibonacci Distrib (ex2)


In second example, NR=17 and k=3, the number of runs on the ith largest file on the nth level (n>0) initially contains

Imperfect NR
If NR is not perfect, add dummy runs to make it perfect. This is done during the Sort Phase. Where? Some say distribute them either at the end or beginning of each file

F
level (n) 5 4 3 2 1 0

(k ) n+k 2

+F

(k ) n+ k 3

+L+ F

(k ) n+i 2

Largest File (i=1) 13=7+4+2 7=4+2+1 4=2+1+1 2=1+1+0 1=1+0+0 1

2nd Largest File (i=2) 11=7+4 6=4+2 3=2+1 2=1+1 1=1+0 0

3rd largest File (i=3) 7=7 4=4 2=2 1=1 1=1 0

Runs (perfect run sizes for k=3) 31=13+11+7 17=7+6+4 9=4+3+2 5=2+2+1 3=1+1+1 1

Exercise : build the table for k=4 from level 1 to level 5


CJD CJD

Perfect Fibonacci Numbers


Let n = level number an, bn, cn = decreasing order of sizes of non-empty files at level n for k=3,
3) c n = Fn(+1 3 bn = Fn(+1) + Fn(3) 3 3 3) an = Fn(+1) + Fn(3) + Fn(1) = Fn(+2

Perfect Polyphase Distributions


level (n) n n+1 0 1 2 3 4 5 n+1
CJD

Largest 2nd 3rd 4th File File Largest Largest (Empty (i=1) File (i=2) File (i=3) File) an bn cn dn an+bn 1 1 2 4 7 13 an+bn an+cn 0 1 2 3 6 11 an+cn an 0 1 1 2 4 7 an+dn 0 0 0 0 0 0 0 0

Runs (perfect run sizes for k=3) tn tn+2an 1 3 5 9 17 31 tn+2an


CJD

t n = an + bn + c n
3) c n +1 = Fn(+ 2 = a n + 0

3) 3 bn +1 = Fn(+2 + Fn(+1) = an + cn
3) 3 an +1 = Fn(+ 2 + Fn(+1) + Fn( 3) = an + bn

t n +1 = an +1 + bn +1 + c n +1 = 3a n + bn + cn = (an + bn + c n ) + 2an = t n + 2a n

Algorithm Analysis (1)


Amount of space occupied
k + 1 files for polyphase k-way sort merge vs 2k files for balanced k-way sort merge

Algorithm Analysis (2)


Speed of the algorithm
According to theoretical computations by Knuth, Note : logk x = ln x / ln k
Number of Passes for NR = 100 Polyphase Balanced k-way k-way 7.90 8 5.66 6 4.88 5 4.54 4 4.01 4

Ease in the implementation of the algorithm


Polyphase adds more complication to the balanced kway algorithm. The sort phase must distribute the runs according to the Fibonacci perfect distribution, adding dummy runs when necessary. Each merge phase iteration may not run its full course due to some files becoming empty, thereby making it a little more difficult to trace the algorithm.
CJD

k 2 3 4 5 8

Polyphase k-way
1.50 1.02 0.86 0.80 0.73 ln NR ln NR ln NR ln NR ln NR + + + + + 0.99 0.96 0.92 0.86 0.65

Balanced k-way log2 NR + 1 log3 NR + 1 log4 NR + 1 log5 NR + 1 log8 NR + 1

CJD

Algorithm Analysis (3)


Speed of the algorithm
Runs best when NR has a perfect kth order Fibonacci distribution For small k, Polyphase Sort Merge performs better than Balanced k-Way Sort Merge Why ? Because trivial copies are minimized. For large k, Balanced k-Way Sort Merge beats Polyphase Sort Merge. Why ? Because with more files, the merged runs are distributed to more files.

Summary of Analyses
Comparison :
Algorithm Space Time (# of (# of Passes) Files) 3 2 * log2 NR 4 1 + log2 NR 2k 1 + logk NR 3 4 5 1.50 ln NR + 0.99 1.02 ln NR + 0.96 0.86 ln NR + 0.92 # of Passes (NR=100) 14 8 (6 if k=3; 5 if k=4) 7.90 5.66 4.88

2-way Balanced 2-way Balanced k-way Polyphase 2-way Polyphase 3-way Polyphase 4-way

CJD

CJD

Impact of Devices
Device Impact on External Sorts
The sort time is of course highly influenced by the secondary storage device being used. Tapes require to be rewound between passes. On disk, all files may reside on the same disk but has more overhead because of seek time and latency time as the head(s) switch from file to file. If possible, store the files on separate disks. This allows I/O to overlap and run in parallel. If a disk is dedicated to a file, it will reduce seek time and latency time. Further complications arise in a multi-user environment.
CJD

End

CJD

You might also like