Professional Documents
Culture Documents
CS102-05 External Sort
CS102-05 External Sort
CS 102
File Structures & File Organizations
Sorting arranging the items in a list in ascending or descending order by a key value. Applicable for all file organizations, not just sequential Why sort ? to make a report, to merge files in queries, to merge files in master file maintenance, to make searches easier, to prioritize, etc.
CJD
Chapter 05
Some Terminologies
A Pass an iteration that goes through the items (or records) of a list (or file) once to include reading it from file, processing it in main memory and writing it to file. A Run a grouping of some items of a list. Usually a run starts as a block of records but eventually increases in size. Size of a Run the number of items in a run. Usually no less than the blocking factor. A Merge combining lists into one
CJD CJD
The Algorithms
External Sort Algorithms 2-way Sort Merge Balanced 2-way Sort Merge Balanced k-way Sort Merge Polyphase Sort Merge Overview :
CJD
CJD
Redistribute the runs evenly in file_3 to file_1 and file_2 Repeat Phase 2 until all records are in one long run.
CJD CJD
CJD
CJD
CJD
CJD
Merge 2
Merge 3
Each iteration requires 2 passes : to distribute and to merge So Total Passes = log2 NR * 2 When NR=5, algorithm requires 6 passes
CJD
CJD
Repeat Phase 2 until all records are in one long run. Alternate the roles of file_1 and file_2 with file_3 and file_4 depending on which files need to be merged and which would hold the redistributed resultant longer runs.
CJD CJD
End of Algorithm
Copy or assign the file that contains the one long run to the desired output file. It is called balanced because in each iteration, the number of input files used is equal to the number of output files used.
CJD
CJD
CJD
Copy or assign the file that contains the one long run to the desired output file.
CJD
CJD
Exercise 1
Fill in the following table with NR = 100 NR = 100 2-Way Sort Merge 3 14 Balanced 2-Way Balanced 3-Way Balanced 4-Way
Merge 2
No. of Files Used And the total number of passes is j + 1 = (logk NR) + 1, including the one for sort phase If NR is not a power of 2, the number of passes is logk NR + 1 When NR=5 and k=3, requires 3 passes instead of 4 (Balanced 2-way)
CJD
4 8
6 6
8 5
Question: What conclusion/s can you draw based on the above table.
CJD
More Exercises
Exercise 2: Using Balanced 3-way Sort Merge algorithm, sort the given master file with the following records. Assume that the size of the run is 3. Determine the total number of passes. File : 28 17 79 38 5 70 24 91 37 3 19 63 15 44 8 Exercise 3: Using Balanced 3-way Sort Merge algorithm, sort the given master file with the following records. Assume that the size of the run is 4. Determine the total number of passes. File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80 Exercise 4: Using Balanced 4-way Sort Merge algorithm, sort the given master file with the following records. Assume that the size of the run is 3. Determine the total number of passes. File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80
CJD
Challenges
What if each sorted run from the sort phase is distributed to a separate file and all such files are merged into one output file. What are the implications ? What factors make this approach possible? impossible? There are main memory and number of file devices limitations How do you implement a k-way merge efficiently if k > 2 ? If k is large, use priority queues CS101 (or an advanced CS101 course) The realistic sort/merge situation is somewhere between the basic balanced two-way sort merge, and the idealistic balanced k-way sort/merge, which uses k input files for k runs and merges to one output file.
CJD
Polyphase Improvements
In the merge phase, do not distribute the merges into several files just send them to a (k+1)st file. When an input file becomes empty, discontinue the previous merge phase. Instead, merge the (k-1) possibly non-empty file(s) with the (k+1)st file into the empty file. Perform this repeatedly each time a file becomes empty, until there is only one nonempty file containing one long sorted run. The name Polyphase is attributed to the many phases of the merging process to sort the records.
CJD
CJD
Copy or assign the file that contains the one long run to the desired output file.
CJD CJD
File 1: 22 80 140 File 2: empty File 3: 10 36 50 95 100 110 40 60 70 120 130 153
(after 1 run merge, 3 more blocks passed = 12 total blocks passed)
File 1: empty File 2: 10 22 36 50 80 95 100 110 140 File 3: 40 60 70 120 130 153
(after 1 run merge, 5 more blocks passed = 17 total blocks passed)
File 1: 10 22 36 40 50 60 70 80 95 100 110 120 130 140 153 File 2: empty File 3: empty
CJD CJD
Observe that by distributing the initial runs 17=7-6-4, at most only one file becomes empty after each merge, except the last. This is because 17 has a perfect 3rd order Fibonacci distribution of 7-6-4.
CJD
Fibonacci Sequence
2nd order Fibonacci sequence
F
( 2) 0
= 0,
F1( 2 ) = 1,
Number of merges
0 2 1 1
File 2
2 0 1 0
File 3
0 2 1 0
The ith largest file on the nth level (n>0) initially contains the following number of runs :
k k k Fn(+ k) 2 + Fn(+ k) 3 + L + Fn(+ i) 2
CJD
the number of runs on ith largest file on the nth level is : Fn(+kk) 2 + Fn(+kk) 3 + L + Fn(+ki)2 = Fn( 2) + L + Fn(+2i)2 which means largest is Fn+Fn-1 and 2 nd largest is Fn.
CJD
Imperfect NR
If NR is not perfect, add dummy runs to make it perfect. This is done during the Sort Phase. Where? Some say distribute them either at the end or beginning of each file
F
level (n) 5 4 3 2 1 0
(k ) n+k 2
+F
(k ) n+ k 3
+L+ F
(k ) n+i 2
Runs (perfect run sizes for k=3) 31=13+11+7 17=7+6+4 9=4+3+2 5=2+2+1 3=1+1+1 1
Largest 2nd 3rd 4th File File Largest Largest (Empty (i=1) File (i=2) File (i=3) File) an bn cn dn an+bn 1 1 2 4 7 13 an+bn an+cn 0 1 2 3 6 11 an+cn an 0 1 1 2 4 7 an+dn 0 0 0 0 0 0 0 0
t n = an + bn + c n
3) c n +1 = Fn(+ 2 = a n + 0
3) 3 bn +1 = Fn(+2 + Fn(+1) = an + cn
3) 3 an +1 = Fn(+ 2 + Fn(+1) + Fn( 3) = an + bn
t n +1 = an +1 + bn +1 + c n +1 = 3a n + bn + cn = (an + bn + c n ) + 2an = t n + 2a n
k 2 3 4 5 8
Polyphase k-way
1.50 1.02 0.86 0.80 0.73 ln NR ln NR ln NR ln NR ln NR + + + + + 0.99 0.96 0.92 0.86 0.65
CJD
Summary of Analyses
Comparison :
Algorithm Space Time (# of (# of Passes) Files) 3 2 * log2 NR 4 1 + log2 NR 2k 1 + logk NR 3 4 5 1.50 ln NR + 0.99 1.02 ln NR + 0.96 0.86 ln NR + 0.92 # of Passes (NR=100) 14 8 (6 if k=3; 5 if k=4) 7.90 5.66 4.88
2-way Balanced 2-way Balanced k-way Polyphase 2-way Polyphase 3-way Polyphase 4-way
CJD
CJD
Impact of Devices
Device Impact on External Sorts
The sort time is of course highly influenced by the secondary storage device being used. Tapes require to be rewound between passes. On disk, all files may reside on the same disk but has more overhead because of seek time and latency time as the head(s) switch from file to file. If possible, store the files on separate disks. This allows I/O to overlap and run in parallel. If a disk is dedicated to a file, it will reduce seek time and latency time. Further complications arise in a multi-user environment.
CJD
End
CJD