File Structures by Folk, Zoellick, and Riccardi

Chap 8. Cosequential Processing
and the Sorting of Large Files
서울대학교 컴퓨터공학부
객체지향시스템연구실
SNU-OOPSLA-LAB
교수 김 형 주
File Structures

SNU-OOPSLA Lab.

1

Chapter Objectives(1)

Describe a class of frequently used processing activities
known as cosequential process
Provide a general object-oriented model for implementing
varieties of cosequential processes
Illustrate the use of the model to solve a number of
different kinds of cosequential processing problems,
including problems other than simple merges and
matches
Introduce heapsort as an approach to overlapping I/O with
sorting in RAM

File Structure

SNU-OOPSLA Lab.

2

Chapter Objectives(2)


Show how merging provides the basis for sorting very
large files
Examine the costs of K-way merges on disk and find ways
to reduce those costs
Introduce the notion of replacement selection
Examine some of the fundamental concerns associated
with sorting large files using tapes rather than disks
Introduce UNIX utilities for sorting, merging, and
cosequential processing

File Structure

SNU-OOPSLA Lab.

3

Contents
8.1 Cosequential operations
8.2 Application of the OO Model to a General Ledger Program
8.3 Extension of the OO Model to Include Multiway Merging
8.4 A Second Look at Sorting in Memory
8.5 Merging as a Way of Sorting Large Files on Disk
8.6 Sorting Files on Tape
8.7 Sort-Merge Packages
8.8 Sorting and Cosequential Processing in Unix

File Structure

SNU-OOPSLA Lab.

4

8.1 An Object-Oriented Model for
Implementation Cosequential Processes

Cosequential operations

Coordinated processing of two or more sequential
lists to produce a single list
Kinds of operations


merging, or union
matching, or intersection
combination of above

File Structure

SNU-OOPSLA Lab.

5

8.1 An Object-Oriented Model for
Implementation Cosequential Processes

Matching Names in Two Lists(1)


So called “intersection operation”
Output the names common to two lists
Things that must be dealt with to make match procedure
work reasonably




initializing that is to arrange things
methods that are getting and accessing the next list item
synchronizing between two lists
handling EOF conditions
recognizing errors
e.g. duplicate names or names out of sequence

File Structure

SNU-OOPSLA Lab.

6

8.1 An Object-Oriented Model for
Implementation Cosequential Processes

Matching Names in Two Lists(2)

In comparing two names

if Item(1) is less than Item(2), read the next from List 1

if Item(1) is greater than Item(2), read the next name from
List 2

if the names are the same, output the name and read the
next names from the two lists

File Structure

SNU-OOPSLA Lab.

7

8.1 An Object-Oriented Model for
Implementation Cosequential Processes

Cosequential match procedure(1)
PROGRAM: match
Item(1)

Item(1) < Item(2)

List 1
use input() & initialize() procedure

same
name

List 2
Item(1) > Item(2)

File Structure

Item(2)

SNU-OOPSLA Lab.

8

8.1 An Object-Oriented Model for
Implementation Cosequential Processes

Cosequential match procedure(2)
int Match(char * List1, char List2, char *OutputList)
{
int MoreItems; // true if items remain in both of the lists
// initialize input and output lists
InitializeList(1, List1); InitializeList(2, List2);
InitializeOutput(OutputList);
// get first item from both lists
MoreItems = NextItemInLIst(1) && NextItemInList(2);
while (MoreItems) { // loop until no items in one of the lists
if(Item(1) < Item(2) ) MoreItems = NextItemInList(1);
else if (Item(1) == Item (2) ) {
ProcessItem(1);
// match found
MoreItems = NextItemInList(1) && NextItemInList(2);
}
else
MoreItems = NextItemInList(2); // Item(1) > Item(2)
}
FinishUp();

return 1;

}

File Structure

SNU-OOPSLA Lab.

9

8.1 An Object-Oriented Model for
Implementation Cosequential Processes

General Class for Cosequential Processing(1)
template <class ItemType> class CosequentialProcess
// base class for cosequential processing
{ public:
// the following methods provide basic list processing
// these must be defined in subclasses
virtual int InitializeList (int ListNumber, char *LintName) = 0;
virtual int InitializeOutput (char * OutputListName) = 0;
virtual int NextItemInList (int ListNumber) = 0;
// advance to next item in this list
virtual ItemType Item(int ListNumber) = 0;
// return current item from this list
virtual int ProcessItem(int ListNumber) = 0;
// process the item in this list
virtual int FinishUp() = 0; // complete the processing
// 2-way cosequential match method
virtual int Match2Lists (char *List1, char * List2, char *OutputList);
};

File Structure

SNU-OOPSLA Lab.

10

8.1 An Object-Oriented Model for
Implementation Cosequential Processes

General Class for Cosequential Processing(2)

A Subclass to support lists that are files of strings, one per line

class StringListProcess : public CosequentialProcess<String &>
{ public:
StringListProcess (int NumberOfLists); // constructor
// Basic list processing methods
int InitializeList (int ListNumber, char * List1);
int InitializeOutput(char * OutputList);
int NextItemInList (int ListNumber); // get next
String & Item (int ListNumber); // return current
int ProcessItem (int ListNumber); // process the item
int FinishUp(); // complete the processing
protected:
ifstream * List; // array of list files
String * Items; // array of current Item from each list
ofstream OutputLsit;
static const char * LowValue; //used so that NextItemInList() doesn’t
// have to get the first item in an special way
static const char * HighValue;
};

File Structure

SNU-OOPSLA Lab.

11

8.1 An Object-Oriented Model for
Implementation Cosequential Processes

General Class for Cosequential Processing(3)

Appendix H: full implementation
An example of main
#include “coseq.h”
int main()
{
StringListProcess ListProcess(2); // process with 2 lists
ListProces.Match2Lists (“list1.txt”, “list2.txt”, “match.txt”);
}

File Structure

SNU-OOPSLA Lab.

12

8.1 An Object-Oriented Model for
Implementation Cosequential Processes

Merging Two Lists(1)

Based on matching operation
Difference

must read each of the lists completely
must change MoreNames behavior
 keep this flag set to true as long as there are records in
either list

HighValue

the special value (we use “\xFF”)
come after all legal input values in the files to ensure both
input files are read to completion

File Structure

SNU-OOPSLA Lab.

13

8.1 An Object-Oriented Model for
Implementation Cosequential Processes

Merging Two Lists(2)

Cosequential merge procedure based on a single
loop

This method has been added to class CosequentialProcess
No modifications are required to class StringListProcess

template <class ItemType>
int CosequentialProcess<ItemType> :: Merge2Lists
(char * List1Name, char * List2Name, char * OutputList)
{
int MoreItems1, MoreItems2; // true if more items in list
(continued … )

File Structure

SNU-OOPSLA Lab.

14

8.1 An Object-Oriented Model for
Implementation Cosequential Processes

Merging Two Lists(3)

}

InitializeList (1 List1Name);
InitializeList (2, List2Name);
InitializeOutput (OutputListName);
MoreItems1 = NextItemInList(1);
MoreItems2 = NextItemInLIst(2);
while (MoreItems1 || MoreItems(2) ) { // if either file has more
if (Item(1) < Item(2)) { // list 1 has next item to be processed
ProcessItem(1);
MoreItem1 = NextItemInList(1);
}
else if (Item(1) == Item(2) ) {
ProcessItem(1);
MoreItems1 = NextItemInList(1);
MoreItems2 = NextItemInList(2);
}
else // Item(1) > Item(2) {
ProcessItem(2);
MoreItem2 = NextItemInList(2);
}
}
FinishUp(); return 1;

File Structure

SNU-OOPSLA Lab.

15

8.1 An Object-Oriented Model for
Implementation Cosequential Processes

Cosequential merge procedure(1)
PROGRAM: merge

List 1

(Item(1) < Item(2) )or match

NAME_1

OutputList
List 2

NAME_2

Item(1) > Item(2)

File Structure

SNU-OOPSLA Lab.

16

8.1 An Object-Oriented Model for
Implementation Cosequential Processes

Summary of the Cosequential Processing Model(1)

Assumptions




two or more input files are processed in a parallel fashion
each file is sorted
in some cases, there must exist a high key value or a low
key
records are processed in a logical sorted order
for each file, there is only one current record
records should be manipulated only in internal memory

File Structure

SNU-OOPSLA Lab.

17

8.1 An Object-Oriented Model for
Implementation Cosequential Processes

Summary of the Cosequential Processing Model(2)

Essential Components
initialization - reads from first logical records
 one main synchronization loop
 - continues as long as relevant records remain

selection in main synchronization loop

if
(Item(1) > Item(2) then ..........
else if ( Item(1) < Item(2)) then .........
else ........... /* current keys equal */
endif

Input files & Output files are sequence checked by
comparing the previous item value with new one

File Structure

SNU-OOPSLA Lab.

18

8.1 An Object-Oriented Model for
Implementation Cosequential Processes

Summary of the Cosequential Processing Model(3)

Essential

components (cont’d)

substitute

high values for actual key when EOF
main loop terminates when high values have occurred for
 all relevant input files
no special code to deal with EOF
I/O or error detection are to be relegated to supporting method
so the details of these activities do not obscure the principal
processing logic

File Structure

SNU-OOPSLA Lab.

19

8.2 The General Ledger Program (1)
Account table (Fig 8.6)
Acct-No Acct-Title
Jan
101
check #1
100
102
check #2
500
505
advertize
300

Feb
200
270
129

Mar Apr
170
320
230

Journal entry table (Fig 8.7)
Acct-No Check-No Date
Description
101
112
04/02/86
auto-repair
505
213
05/13/86
newspaper
540
670
04/13/86
printer

Debit/Credit
-30
-39
+60

Ledger Printout (Fig 8.8)
101 check #1
1271 04/02/86 auto-expense
1272 04/03/86 advertise

-78
-30

File Structure

SNU-OOPSLA Lab.

20

8.2 The General Ledger Program(2)
Ledger List and Journal List (Fig 8.10)
101 check#1
101 1271 Auto-expense
101 1272 Rent
101 1273 Advertising
102 check#2 102 670 Office-expense



The ledger (master) account number
The journal (transaction) account number
Class MasterTransactionProcess (Fig 8.12)
Subclass LedgeProcess (Fig 8.14)

File Structure

SNU-OOPSLA Lab.

21

8.2 The General Ledger Program (3)
Template <class ItemType>
class MasterTransactionProcess: Public CosequentialProcess<ItemType>
// a cosequential process that supports master/transaction processing
{public:
MasterTransactionProcess(); // constructor
Virtual int ProcessNewMaster() = 0; //processing when new master read
Virtual int ProcessCurrentMaster() = 0;
Virtual int ProcessEndMaster() = 0;
Virtual int ProcessTransactionError()= 0;
//cosequential processing of master and transaction records
int PostTransactions (char * MasterFileName, char * TransactionFileName,
char * OutputListName);
};

File Structure

SNU-OOPSLA Lab.

22

8.3 Extension of the Model to Include
Multiway Merging

A K-way Merge Algorithm

A very general form of cosequential file processing
Merge K input lists to create a single, sequentially
ordered output list
Algorithm



begin loop
determine which list has the key with the lowest value
output that key
move ahead one key in that list
 in duplicate input entries, move ahead in each list
loop again

File Structure

SNU-OOPSLA Lab.

23

8.3 Extension of the Model to Include
Multiway Merging

Selection Tree for Merging Large Number of Lists

K-way merge


nice if K is no larger than 8 or so
if K > 8, the set of comparisons for minimum key is expensive
loop of comparison (computing)

Selection Tree (if K > 8)



time vs. space trade off
a kind of “tournament” tree
the minimum value is at root node
the depth of tree is log2 K

File Structure

SNU-OOPSLA Lab.

24

8.3 Extension of the Model to Include
Multiway Merging

Selection Tree

7, 10, 17....List 0

7

9, 19, 23....List 1
7
11
input

11, 13, 32....List 2
18, 22, 24....List 3

5
5
5

12, 14, 21....List 4
5, 6, 25....List 5
15, 20, 30....List 6

8
8, 16, 29....List 7

File Structure

SNU-OOPSLA Lab.

25

8.4 A Second Look at Sorting in Memory

8.4 A Second Look at Sorting in Memory

Read the whole file from into memory, perform
sorting, write the whole file into disk

Can we improve on the time that it takes for this RAM
sort?


perform some of parts in parallel
selection sort is good but cannot be used to sort entire file

Using Heap technique!
 processing and I/O can occur in parallel
 keep all the keys in heap
Heap building while reading a block
Heap rebuilding while writing a block
File Structure

SNU-OOPSLA Lab.

26

8.4 A Second Look at Sorting in Memory

Overlapping processing and I/O : Heapsort

Heap


a kind of binary tree, complete binary tree
each node has a single key, that key is less than or equal to
the key at its parent node
storage for tree can be allocated sequentially
so there is no need for pointers or other dynamic overhead
for maintaining the heap

File Structure

SNU-OOPSLA Lab.

27

8.4 A Second Look at Sorting in Memory

A heap in both its tree form and
as it would be stored in an array
A (1)
* n, 2n, 2n+1 positions

B (2)

E (4)

G (8)

File Structure

c (3)

H (5) I (6)
F (9)

D (7)

1

2

3

4

5

6

7

8

9

A

B

C

E

H

I

D

G

F

SNU-OOPSLA Lab.

28

8.4 A Second Look at Sorting in Memory

Class Heap and Method Insert(1)
class Heap
{ public:
Heap(int maxElements);
int Insert (char * newKey);
char * Remove();
protected:
int MaxElements; int NumElements;
char ** HeapArray;
void Exchange (int i, int j); // exchange element i and j
int Compare (int i, int j) // compare element i and j
{ return strcmp(Heaparray[i], HeapArray[j]); }
};

File Structure

SNU-OOPSLA Lab.

29

8.4 A Second Look at Sorting in Memory

Class Heap and Method Insert(2)
int Heap::Insert(char * newKey)
{
if (NumElements == MaxElements) return FALSE;
NumElements++; // add the new key at the last position
HeapAray[NumElements] = newKey;
// re-order the heap
int k = NumElements; int parent;
while(k > 1) { // k has a parent
parent = k/2;
if (Compare(k, parent) >= 0) break;
// HeapArray[k] is in the right place
// else exchange k and parent
Exchange(k, parent);
k = parent;
}
return;
}

File Structure

SNU-OOPSLA Lab.

30

8.4 A Second Look at Sorting in Memory

Heap Building Algorithm(1)
input key order : F D C G H I B E A
New key to
be inserted

Heap, after insertion
of the new key

F

1 2 3 4 5 6 7 8 9
F

D

1 2 3 4 5 6 7 8 9
DF

C

1 2 3 4 5 6 7 8 9
CFD

G
H

Selected heaps
in tree form

C
F

D

1 2 3 4 5 6 7 8 9
CF D G
1 2 3 4 5 6 7 8 9
CFD GH

File Structure

(continued....)
SNU-OOPSLA Lab.

31

8.4 A Second Look at Sorting in Memory

Heap Building Algorithm(2)
input key order : F D C G H B E A
New key to
be inserted

Heap, after insertion
of the new key

I

1 2 3 4 5 6 7 8 9
CF D GH I

B

1 2 3 4 5 6 7 8 9
BFC GH I D
1 2 3 4 5 6 7 8 9
B EC F H I D G

E
A

1 2 3 4 5 6 7 8 9
A BC E HI D G F

File Structure

Selected heaps
in tree form
C
F
G

D
H

I

B
C

F
G

H

(continued....)
SNU-OOPSLA Lab.

I

D

32

8.4 A Second Look at Sorting in Memory

Heap Building Algorithm(3)
input key order : F D C G H B E A
New key to Heap, after insertion
of the new key
be inserted
A

Selected heaps
in tree form

1 2 3 4 5 6 7 8 9
A BC E HI D G F

A
C

B
H

E
G

File Structure

I

D

F

SNU-OOPSLA Lab.

33

8.4 A Second Look at Sorting in Memory

Illustration for overlapping input with heap building(1)
(Free ride of main memory processing: heap building is faster than IO!)

Total RAM area allocated for heap

First input buffer. First part of heap is built here. The
first record is added to the heap, then the second
record
is added, and so forth
Second input buffer. This buffer is being filled
while heap is being built in first buffer.

File Structure

SNU-OOPSLA Lab.

34

8.4 A Second Look at Sorting in Memory

Illustration for overlapping input with heap building(2)
(One Heap is growing during IO time!)

Second part of heap is built here. The first record is
added to the heap, then the second record, etc

Third input buffer. This buffer is filled while heap is being
built in second buffer
Third part of heap is built here

File Structure

Fourth input buffer is filled while heap is being
built
in third bufferLab.
SNU-OOPSLA
35

8.4 A Second Look at Sorting in Memory

Sorting while Writing to the File

Heap rebuilding while writing a block
(Free ride of main memory processing)
Retrieving the keys in order (Fig 8.20)

while( there is no elements)



get the smallest value
put largest value into root
decrease the # of elements
reorder the heap

Overlapping retrieve-in-order with I/O

retrieve-in-order a block of records
while writing this block,
retrieve-in-order the next block

File Structure

SNU-OOPSLA Lab.

36

8.5 Merging as a Way of Sorting Large
Files on Disk

8.5 Merging as a Way of Sorting Large Files on Disk

Keysort: holding keys in memory
Two Shortcomings of Keysort

substantial cost of seeking may happen after keysort
cannot sort really large files
 e.g. a file with 800,000 records, size of each record: 100 bytes,
size of key part: 10 bytes, then 800,000 X 10 => 8G bytes!
 cannot even sort all the keys in RAM

Multiway merge algorithm

small overhead for maintaining pointers, temporary variables

run: sorted subfile

using heap sort for each run
split, read-in, heap sort, write-back

File Structure

SNU-OOPSLA Lab.

37

8.5 Merging as a Way of Sorting Large
Files on Disk

Sorting through the creation of runs
and subsequential merging of runs
800,000 unsorted records
80 internal sorts

.............
80runs, each containing 10,000 sorted records

.............
Merge

File Structure

800,000 records in sorted order

SNU-OOPSLA Lab.

38

8.5 Merging as a Way of Sorting Large
Files on Disk

Multiway merging (K-way merge-sort)

Can be extended to files of any size
Reading during run creation is sequential




no seeking due to sequential reading

Reading & writing is sequential
Sort each run: Overlapping I/O using heapsort
K-way merges with k runs
Since I/O is largely sequential, tapes can be used

File Structure

SNU-OOPSLA Lab.

39

8.5 Merging as a Way of Sorting Large
Files on Disk

How Much Time Does a Merge Sort Take?

Assumptions

only one seek is required for any sequential access

only one rotational delay is required per access

Four I/Os (

refer to page of 39 )

during the sort phase

reading all records into RAM for sorting, forming runs

writing sorted runs out to disk

during the merge phase

reading sorted runs into RAM for merging

writing sorted file out to disk

File Structure

SNU-OOPSLA Lab.

40

8.5 Merging as a Way of Sorting Large
Files on Disk

Four Steps(1)

Step1: Reading records into RAM for sorting and forming runs




assume: 10MB input buffer, 800MB file size
seek time --> 8msec, rotational delay --> 3msec
transmission rate --> 0.0145MB/msec
Time for step1:

access 80 blocks (80 X 11)msec + transfer 80 blocks (800/0.0145)msec

Step2: Writing sorted runs out to disk

writing is reverse of reading
time that it takes for step2 equals to time of step1

File Structure

SNU-OOPSLA Lab.

41

Four Steps(2)

Step3: Reading sorted runs into RAM for merging
 10 MB of RAM is for storing runs. 80 runs


reallocate each of 80 buffers 10MB RAM as 80 input buffers
access each run 80 buffers to read all of it
Each buffer holds 1/80 of a run (0.125MB)

total seek & rotational time --> 80 runs X 80 seeks
--> 6400 seeks. 6400 X 11 msec = 70 seconds
transfer time --> 60 seconds

total time = total seek & rotation time + transfer time

File Structure

SNU-OOPSLA Lab.

42

8.5 Merging as a Way of Sorting Large
Files on Disk

Four Steps(3)
Step4:

Writing sorted file out to disk

need

to know how big output buffers are
with 20,000-byte output buffers,
80,000,000 bytes
20,000 bytes per seek

4,000 seeks

total

seek & rotation time = 4,000 x 11 msec
transfer time is still 60 seconds
Consider

Table 8.1 (323pp)
What if we use keysort for 800M file? --> 24hrs 26mins 40secs

File Structure

SNU-OOPSLA Lab.

43

8.5 Merging as a Way of Sorting Large
Files on Disk

Effect of buffering on the number of seeks required
10MB file

1st run = 80 buffers’ worth(80 accesses)

800MB file

2nd run = 80 buffers’ worth(80 accesses)

800,000
sorted records

:
:
:

80 buffers(10MB)

80th run = 80 buffers’ worth(80 accesses)

File Structure

SNU-OOPSLA Lab.

44

8.5 Merging as a Way of Sorting Large
Files on Disk

Sorting a Very Large File

Two kinds of I/O

Sort phase

I/O is sequential if using heapsort
Since sequential access is minimal seeking, we cannot
algorithmically speed up I/O

Merge phase

RAM buffers for each run get loaded, reloaded at predictable
times -> random access
For performance, look for ways to cut down on the number of
random accesses that occur while reading runs
you can have some chance here!

File Structure

SNU-OOPSLA Lab.

45

8.5 Merging as a Way of Sorting Large
Files on Disk

The Cost of Increasing the File Size

K-way merge of K runs

Merge sort = O(K2) ( merge op. -> K2 seeks )

If K is a big number, you are in trouble!

Some ways to reduce time!! (8.5.4, 8.5.5, 8.5.6)


more hardware (disk drives, RAM, I/O channel)
reducing the order of merge (k), increasing buffer size
of each run
increase the lengths of the initial sorted runs
find the ways to overlap I/O operations

File Structure

SNU-OOPSLA Lab.

46

8.5 Merging as a Way of Sorting Large
Files on Disk

Hardware-base Improvements

Increasing the amount of RAM

Increasing the number of disk drives

longer & fewer initial runs
fewer seeks
no delay due to seek time after generation of runs
assign input and output to separate drives

Increasing the number of I/O channels

separate I/O channels, I/O can overlap
Improve transmission time

File Structure

SNU-OOPSLA Lab.

47

8.5 Merging as a Way of Sorting Large
Files on Disk

Decreasing the Num of Seeks Using Multiple-step Merges

K-way merge characteristics

a selection tree is used

K is proportional to N

the number of comparisons is N*log K
(K-way merge with N records)
O(N*log N) : reasonably efficient

Reducing seeks is to reduce the number of runs

give each run a bigger buffer space
multiple-step merge provides the way without more RAM

File Structure

SNU-OOPSLA Lab.

48

8.5 Merging as a Way of Sorting Large
Files on Disk

Multiple-step merge(1)

Do not merge all runs at one time
Break the original set of runs into small groups and
Merge runs in these group separately
Leads fewer seeks, but extra transmission time in
second pass
Reads every record twice

to form the intermediate runs & the final sorted file

Similar to have selection tree in merging n lists!!

File Structure

SNU-OOPSLA Lab.

49

8.5 Merging as a Way of Sorting Large
Files on Disk

Two-step merge of 800 runs

(25 sets X 32 runs) = 800 runs

25 sets of 32 runs each
32 runs

......

32 runs

......

......

32 runs

......

......

File Structure

SNU-OOPSLA Lab.

50

8.5 Merging as a Way of Sorting Large
Files on Disk

Multiple-step merge(2)

Essence of multiple-step merging

Can we do even better with more than two steps?

increase the available buffer space for each run
extra pass vs. random access decrease
trade-offs between the seek&rotation time and the
transmission time

major cost in merge sort

seek, rotation time, transmission time, buffer size, number of
runs

File Structure

SNU-OOPSLA Lab.

51

8.5 Merging as a Way of Sorting Large
Files on Disk

Increasing Run Lengths Using Replacement Selection(1)

Facts of Life


Want to use the heap sort in memory
Want to allocate longer output runs
Can we pack the longer output runs using the heap sort in memory?

Replacement Selection

Idea
 always select the key from memory that has the lowest value
 output the key
 replace it with a new key from the input list
 use 2 heaps in the memory buffer

File Structure

SNU-OOPSLA Lab. (continued...) 52

8.5 Merging as a Way of Sorting Large
Files on Disk

Increasing Run Lengths Using Replacement Selection(2)

Implementation
 step1: read records and sort using heap sort
this heap is the primary heap
 step2: write out only the record with the lowest value
 step3: bring in new record and compare its key with that of
record just output

step3-a: if the new key is higher, insert new record into its proper in the
primary heap along with the other records selected for output

step3-b: if the new key is lower, place the record in a secondary heap
with key values lower than already written out

step4: repeat step 3 while there are records in the primary heap and
there are records to be read in. When the primary heap is empty, make
the secondary heap into the primary heap and repeat step2 & step3

File Structure

SNU-OOPSLA Lab.

53

8.5 Merging as a Way of Sorting Large
Files on Disk

Example of the principle underlying
replacement selection
Input:
21, 67, 12, 5, 47, 16

Front of input string
(Heap sort!)

Remaining input
21, 67, 12
21, 67
21
-

File Structure

Memory(p=3)
5
12
67
67
67
67
-

47
47
47
47
47
-

16
16
16
21
-

SNU-OOPSLA Lab.

Output run
5
12, 5
16, 12, 5
21, 16, 12, 5
47, 21, 16, 12, 5
67, 47, 21, 16, 12, 5

54

8.5 Merging as a Way of Sorting Large
Files on Disk

Replacement Selection(1)

What happens if a key arrives in memory too late to be output into ins
proper position relative to the other keys? (if 4th key is 2 rather than 12)

use of second heap, to be included in next run

refer to page 335 Figure 8.25

Two questions

Given P locations in memory, how long a run can we expect replacement
selection to produce, on the average?

On the average, we can expect a run length of 2P

Knuth provides an excellent description (page 335-336)

File Structure

SNU-OOPSLA Lab.

(continued...)

55

8.5 Merging as a Way of Sorting Large
Files on Disk

Comparisons of access times required to sort 8 million records
both RAM sort and replacement selection

Approach

# of Records
per Seek to
Form Runs

Size of
Runs
Formed

# of Seeks
Required to
Form Runs

Merge
Order
Used

Total
Number
of Seeks

Total Seek &
Rotation Delay
Time
(hr)

800 RAM
sorts followed 10,000
by an 800-way
merge
Replacement
selection followed
by 534-way merge 2,500
(records in random
order)
Replacement
selection followed
by 200-way merge 2,500
(records partially
ordered)

File Structure

10,000

800

1,600

(min)

681,600

4

58

15,000

534

6,400

521,134

3

48

40,000

200

200

206,400

1

30

SNU-OOPSLA Lab.

56

8.5 Merging as a Way of Sorting Large
Files on Disk

Step-by-step op. of replacement selection with 2 heaps
working to form two sorted runs(1)
Input
33, 18, 24, 58, 14, 17, 7, 21, 67, 12, 5, 47, 16

Front of input string

(Heap sort!)

Remaining input

Memory(P=3)

33, 18, 24, 58, 14, 17, 7, 21, 67, 12
33, 18, 24, 58, 14, 17, 7, 21, 67
33, 18, 24, 58, 14, 17, 7, 21
33, 18, 24, 58, 14, 17, 7
33, 18, 24, 58, 14, 17
33, 18, 24, 58, 14
33, 18, 24, 58

File Structure

5 47 16
12 47 16
67 47 16
67 47 21
67 47 ( 7)
67 (17) ( 7)
(14) (17) ( 7)

SNU-OOPSLA Lab.

Output run(A)
5
12, 5
16, 12, 5
21, 16, 12, 5
47, 21, 16, 12, 5
67, 47, 21, 16, 12, 5

57

8.5 Merging as a Way of Sorting Large
Files on Disk

Step-by-step op. of replacement selection
working to form two sorted runs(2)
Remaining input

Memory(P=3)

Output run(B)

First run complete; start building the second
33, 18, 24, 58
33, 18, 24
33, 18
-

File Structure

14
14
24
24
24
-

17
17
17
18
33
33
-

7
58
58
58
58
58
58

SNU-OOPSLA Lab.

7
14, 7
17, 14, 7
18, 17, 14, 7
24, 18, 17, 14, 7
33, 24, 18, 17, 14, 7
58, 33, 24, 18, 17, 14, 7

58

8.5 Merging as a Way of Sorting Large
Files on Disk

Replacement Selection Plus Multiple Merging

Total number of seeks is less than for the one-step merges
The two-step merge requires transferring the data two more
times than do the one-step merge

the two-step merges & replacement selection are still better, but the
results are less dramatic

refer to table of the next slide

File Structure

SNU-OOPSLA Lab.

59

8.5 Merging as a Way of Sorting Large
Files on Disk

Comparison of merges, considering transmission times(1)
:1-step merge

Approach Number of Merge
Records per Pattern
Seek to
Used
Form Runs
RAM sorts

Number
of Seeks
for Sorts
and Merges

Seek +
Rotational
Delay
Time(min)

Total
Passes
over the
File

Total
Transmission
Time(min)

298

4

43

341

341

10,000

800way

681,700

replacement
selection
2,500
(records in
random order)

534way

521,134

228

4

43

replacement
2,500
selection
(records part
-ially ordered)

200way

206,400

90

4

43

File Structure

SNU-OOPSLA Lab.

Total of Seek,
Rotation, and
Transmission
Times(min)

341

(continued...) 60

8.5 Merging as a Way of Sorting Large
Files on Disk

Comparison of merges, considering transmission times(2)
:2-step merge

Approach Number of Merge
Records per Pattern
Seek to
Used
Form Runs

Number
of Seeks
for Sorts
and Merges

10,000

25 x 32
127,200
-way
(one 25-way)

replacement
selection
2,500
(records in
random order)

19 x 28
-way
124,438
(one 19-way)

replacement
2,500
selection
(records part
-ially ordered)

20 x 10
110,400
-way
(one 20-way)

RAM sorts

File Structure

Seek +
Rotational
Delay
Time(min)

Total
Passes
over the
File

56

55

48

SNU-OOPSLA Lab.

Total
Transmission
Time(min)

Total of Seek,
Rotation, and
Transmission
Times(min)

6

65

121

6

65

120

6

65

113

61

8.5 Merging as a Way of Sorting Large
Files on Disk

Using Two Disks with Replacement Selection

Two disk drives

Sort phase

the run selection & output can overlap

Merge phase

input & output can overlap
 reduce transmission by 50%
seeking is virtually eliminated

output disk becomes input disk, and vice versa
seeking will occur on input disk, output is sequential

substantially reducing merge & transmission time
File Structure

SNU-OOPSLA Lab.

62

8.5 Merging as a Way of Sorting Large
Files on Disk

Memory organization for replacement selection

disk1

input
buffers

heap
disk2

output
buffers

File Structure

SNU-OOPSLA Lab.

63

8.5 Merging as a Way of Sorting Large
Files on Disk

More Drives? More Processors?

More drives?

Until I/O becomes so fast that processing cannot keep up
with it

More processors?



mainframes
vector and array processors
massively parallel machines
very fast local area networks

File Structure

SNU-OOPSLA Lab.

64

8.5 Merging as a Way of Sorting Large
Files on Disk

Effects of Multiprogramming

Increase the efficiency of overall system by
overlapping processing and I/O
Effects are very hard to predict

File Structure

SNU-OOPSLA Lab.

65

8.5 Merging as a Way of Sorting Large
Files on Disk

A Concept Toolkit for External Sorting


For in-RAM sorting, use heapsort
Use as much RAM as possible
Use a multiple-step merge when

Use replacement selection when

possibility of partially ordered

Use more than one disk drive and I/O channel

the number of initial runs is so long that seek and rotation time is much
greater than transmission time

read/write can overlap

Look for ways to take advantage of new architecture and systems

parallel processing or high-speed networks

File Structure

SNU-OOPSLA Lab.

66

Sorting Files on Tape

Balanced Merge with several tape drivers
Tape

Step1

T1
T2
T3
T4

contains runs
R1 R3 R5
R2 R4 R6
---

R7
R8

R9
R10

Figure 8.28 (2 way-balanced 4 tape merge)

P is the number of passes, N is the number of runs, k is the number of
input drivers ==> then,
P = ceiling of (logkN)

4 tape drivers (2 for input, 2 for output), 10 runs ==> 4 passes
20 tape drivers (10 for input, 10 for output), 200 runs ==> 3 passes

File Structure

SNU-OOPSLA Lab.

67

Sorting Files on Tape

Other ways of Balanced Merge

(Fig 8.30)

T1

T2

T3

T4

Step1
Step2
Step3
Step4

11111
-4
--

11111
-4
--

-2 2 2
.. 2
--

-2 2
-10

(Fig 8.31)

T1
11111
…1 1 1
…11
…. 1
--

T2
111
.. 1
-4
--

T3
11
-5
5
--

T4
-3 3
.3
-10

Step1
Step2
Step3
Step4
Step5

File Structure

SNU-OOPSLA Lab.

68

K-way Balanced Merge on Tapes

Some difficult questions

How does one choose an initial distribution that leads readily to an
efficient merge pattern?

Are there algorithmic descriptions of the merge patterns, given an
initial distribution?

Given N runs and J tape drives, is there some way to compute the
optimal merging performance so we have a yardstick against which
to compare the performance of any specific algorithm?

File Structure

SNU-OOPSLA Lab.

69

Unix: Sorting and Cosequential Processing

Sorting in Unix

The Unix sort command
The qsort library routine

Cosequential processing utilities in Unix


Compares: cmp
Difference: diff
Common: comm

File Structure

SNU-OOPSLA Lab.

70

Let’s Review !!
8.1 Cosequential operations
8.2 Application of the Model to a General Ledger Program
8.3 Extension of the Model to Include Multiway Merging
8.4 A Second Look at Sorting in Memory
8.5 Merging as a Way of Sorting Large Files on Disk
8.6 Sorting Files on Tape
8.7 Sort-Merge Packages
8.8 Sorting and Cosequential Processing in Unix

File Structure

SNU-OOPSLA Lab.

71