CS2604: Data Structures and File Processing

C++ Edition
Cli ord A. Sha er Department of Computer Science Virginia Tech Copyright c 1995, 1996, 1998

COURSENOTES

You might believe that faster computers make it unnecessary to be concerned with e ciency. However...
⇒ • • •

Data structures organize data

A primary concern of this course is e ciency.

The Need for Data Structures
more e cient programs.

More powerful computers ⇒ more complex applications. More complex applications demand more calculations. Complex computing tasks are unlike our everyday experience. So we need special
training

Any organization for a collection of records can be searched, processed in any order, or modi ed. If you are willing to pay enough in time delay.
Ex: Simple unordered array of records.

The choice of data structure and algorithm can make the di erence between a program running in a few seconds or many days.

1

E ciency
Better than known alternatives  relatively" e cient • space These are typical contraints for programs

A solution is said to be e cient if it solves the problem within its resource constraints. Alt:
This does not mean always strive for the most e cient program. If the program operates well within resource constraints, there is no bene t to making it faster or smaller.

time

The cost of a solution is the amount of resources that the solution consumes.

2

Selecting a Data Structure
Select a data structure as follows: 1. Analyze the problem to determine the resource constraints a solution must meet. 2. Determine the basic operations that must be supported. Quantify the resource constraints for each operation. 3. Select the data structure that best meets these requirements.

Typically want the simplest" data struture that will meet requirements. Some questions to ask: These questions often help to narrow the possibilities

• •

Are all data inserted into the data structure at the beginning, or are insertions interspersed with other operations? Can data be deleted? If so, a more complex
representation is typically required

Are all data processed in some well-de ned order, or is random access allowed?
3

Data Structure Philosophy
Each data structure has costs and bene ts. Rarely is one data structure better than another in all situations. A data structure requires: • space for each data item it stores, Data +
Overhead
• •

time to perform each basic operation, programming e ort. Some data

structures algorithms more complicated than others

Each problem has constraints on available space and time. Only after a careful analysis of problem characteristics can we know the best data structure for the task. Bank example: • Start account: a few minutes • Transactions: a few seconds • Close account: overnight
4

Goals of this Course
1. Reinforce the concept that there are costs and bene ts for every data structure. A
worldview to adopt

2. Learn the commonly used data structures. These form a programmer's basic data structure toolkit." The nuts and bolts" of the
course

3. Understand how to measure the e ectiveness of a data structure or program. • These techniques also allow you to judge the merits of new data structures that you or others might invent. To prepare
you for the future

5

De nitions
A type is a set of values. A data type is a type and a collection of operations that manipulate the type. A data item or element is a piece of information or a record. A data item is said to be a member of a data type. A simple data item contains no subparts. An aggregate data item may contain several pieces of information.
Ex: Payroll record, city database record Ex: Integer Physical instantiation Ex: Addition Ex: Integer, Boolean, Float

6

data type solely in terms of a set of values and a set of operations on that data type. Each ADT operation is de ned by its inputs and outputs. Encapsulation: hide implementation details A data structure is the physical implementation of an ADT. • Each operation associated with the ADT is implemented by one or more subroutines in the implementation. Data structure usually refers to an organization for data in main memory. File structure: an organization for data on peripheral storage, such as a disk drive or tape. An ADT manages complexity through abstraction: metaphor. Hierarchies of labels
Ex: transistors → gates → CPU. In a program, implement an ADT, then think only about the ADT, not its implementation

Abstract Data Type ADT: a de nition for a

Abstract Data Types

7

Logical vs. Physical Form
Data items have both a logical and a physical form. Logical form: de nition of the data item within an ADT. Ex: Integers in mathematical sense: +, − Physical form: implementation of the data item within a data structure. 16 32 bit integers: over ow

Data Type ADT: Type Operations Data Structure: Storage Space Subroutines Data Items: Logical Form Data Items: Physical Form

In this class, we frequently move above and below the line" separating logical and physical forms.

8

Problems
Problem: a task to be performed.
• •

Best thought of as inputs and matching outputs. Problem de nition should include constraints on the resources that may be consumed by any acceptable solution. But
NO constraints on HOW the problem is solved

Problems ⇔ mathematical functions • A function is a matching between inputs the domain and outputs the range. • An input to a function may be single number, or a collection of information. • The values making up an input are called the parameters of the function. • A particular input must always result in the same output every time the function is computed.

9

Algorithm: a method or a process followed to
solve a problem. A recipe An algorithm takes the input to a problem function and transforms it to the output.
mapping of input to output A

Algorithms and Programs

A problem can have many algorithms. An algorithm possesses the following properties: 1. It must be correct. Computes proper function 2. It must be composed of a series of concrete steps. Executable by that machine 3. There can be no ambiguity as to which step will be performed next. 4. It must be composed of a nite number of steps. 5. It must terminate. A computer program is an instance, or concrete representation, for an algorithm in some programming language.
We frequently interchange use of algorithm" and program" though they are actually di erent concepts

10

Look over Chapter 2, read as needed depending on your familiarity with this material. Set concepts and notation Set has no duplicates, sequence may

Mathematical Background

Recursion Induction proofs Logarithms Almost always use log to base 2. That is our
default base.

Summations

11

Estimation Techniques
Known as back of the envelope" or back of the napkin" calculation. 1. Determine the major parameters that a ect the problem. 2. Derive an equation that relates the parameters to the problem. 3. Select values for the parameters, and apply the equation to yield an estimated solution. Example: How many library bookcases does it take to store books totaling one million pages? Estimate: • pages inch guess 500 • feet shelf guess 4 actually, 3 • shelves bookcase guess 5 actually, 7
Units check: pages in × ft shelf × shelf bkcase ⇒ pages bkcase

12

Algorithm E ciency
There are often many approaches algorithms to solve a problem. How do we choose between them? At the heart of computer program design are two sometimes con icting goals: 1. To design an algorithm that is easy to understand, code and debug. 2. To design an algorithm that makes e cient use of the computer's resources. Goal 1 is the concern of Software Engineering. Goal 2 is the concern of data structures and algorithm analysis. When goal 2 is important, how do we measure an algorithm's cost?

13

How to Measure E ciency?
1. Empirical comparison run programs. 2. Asymptotic Algorithm Analysis. Critical resources:
Di cult to do fairly." Time consuming.

Time. Space disk, RAM. Programmer's e ort. Ease of use user's e ort.

Factors a ecting running time:
Machine load. OS. Compiler. Problem size. Speci c input values for given problem size.

For most algorithms, running time depends on size" of the input. Running time is expressed as Tn for some function T on input size n.
14

Examples of Growth Rate
Example 1:
As n grows, how does Tn grow?
int largest(int* array, int n) { // Find largest value int currlarge = array[0]; // Store largest seen for (int i=1; i<n; i++) // For each element if (array[i] > currlarge) // If largest currlarge = array[i]; // Remember it return currlarge; // Return largest }

Example 2: Assignment statement Example 3:
sum = 0; for (i=1; i<=n; i++) for (j=1; j<=n; j++) sum++;

Cost: Tn = c n + c steps
1 2

Constant cost

Cost: Tn = c n + c Roughly n steps, with sum being n at the end. Ignore various overhead such as loop counter increments.
1 2 2 2 2

15

constant.
1400 1200 1000 800 600 400 200 0 0

2 is an exponential algorithm. 10n and 20n di er only by a
n

Growth Rate Graph
2n2

2n

5n log n

20n

10n

10

20 2n

30

40 2n2

50

400 20n 5n log n

300

200

100

10n

0

0

5

Input size n

10

15 16

Best, Worst and Average Cases
Not all inputs of a given size take the same time. Sequential search for K in an array of n integers: • Begin at rst element in array and look at each element in turn until K is found. Best Case: Find at rst position: 1 compare Worst Case: Find at last position: n compares Average Case: n + 1/2 compares While average time seems to be the fairest measure, it may be di cult to determine.
Depends on distribution. Assumption for above analysis: Equally likely at any position.

When is worst case time important?
Real time algorithms

17

Faster Computer or Algorithm?
What happens when we buy a computer 10 times faster? How much speedup? 10 times. More
Tn 10n 20n 5n log n 2n2

important: How much increase in problem size for same time? Depends on growth rate.

2n
n

: Size of input that can be processed in one hour 10,000 steps. n : Size of input that can be processed in one hour on the new machine 100,000 steps.
Compare Tn = n to Tn = n log n. For n > 58, it is faster to have the n log n algorithm than to have a computer that is 10 times faster.
2

For n , if n = 1000, then n would be 1003
2

1, 000 10 000 500 5 000 250 1 842 70 223 13 16

n

n , , ,

Change n /n n = 10n 10 n = 10n 10 √ 10n√ n < 10n 7.37 < n = 10n 3.16 n =n+3 −−

18

Asymptotic Analysis: Big-oh
De nition: For Tn a non-negatively valued function, Tn is in the set Of n if there exist two positive constants c and n0 such that Tn ≤ cf n for all n > n0. Usage: The algorithm is in On2 in best, average, worst case. Meaning: For all data sets big enough i.e., n > n0, the algorithm always executes in less than cf n steps in best, average or worst case .
Must pick one of these to complete the statement. Big-oh notation applies to some set of inputs.

Upper Bound. Example: if Tn = 3n2 then Tn is in On2. Wish tightest upper bound: While Tn = 3n2 is in On3, we prefer On2.
It provides more information to say On  than On 
2 3

19

Big-oh Example
case

Example 1. Finding value X in an array.
Tn = csn/2.
cs

Average

For all values of n > 1, csn/2 ≤ csn. Therefore, by the de nition, Tn is in On for n0 = 1 and c = cs. Example 2. Tn = c1n2 + c2n in average case c1n2 + c2 n ≤ c1n2 + c2n2 ≤ c1 + c2n2 for all n > 1. Tn ≤ cn2 for c = c1 + c2 and n0 = 1. Therefore, Tn is in On2 by the de nition. Example 3: Tn = c. We say this is in O1.
Rather than Oc

is a constant. Actual value is irrelevant

20

Big-Omega
De nition: For Tn a non-negatively valued function, Tn is in the set gn if there exist two positive constants c and n0 such that Tn ≥ cg n for all n > n0. Meaning: For all data sets big enough i.e., n > n0, the algorithm always executes in more than cgn steps. Lower Bound. Example: Tn = c1n2 + c2n. c1n2 + c2 n ≥ c1n2 for all n > 1. Tn ≥ cn2 for c = c1 and n0 = 1. Therefore, Tn is in n2 by the de nition. Want greatest lower bound.

21

Theta Notation
When big-Oh and meet, we indicate this by using  big-Theta notation. De nition: An algorithm is said to be hn if it is in Ohn and it is in hn.
For polynomial equations on Tn, we always have . There is no uncertainty, a complete" analysis.

Simplifying Rules: 1. If f n is in Ogn and gn is in Ohn, then f n is in Ohn. 2. If f n is in Okgn for any constant k > 0, then f n is in Og n. No constant 3. If f1n is in Og1n and f2n is in Og2n, then f1 + f2n is in Omaxg1n, g2n. Drop low order terms 4. If f1n is in Og1n and f2n is in Og2n then f1nf2n is in Og1ng2n. Loops
22

Asymptotic analysis is de ned for equations. Need to convert program to an equation.

Running Time of a Program

Example 1: a = b; This assignment takes constant time, so it is 1. Not c notation by tradition Example 2:
sum = 0; for (i=1; i<=n; i++) sum += n; 

n even though sum is n 
2

Example 3:
sum = 0; for (j=1; j<=n; j++) // First for loop for (i=1; i<=j; i++) // is a double loop sum++; for (k=0; k<n; k++) // Second for loop A[k] = k;
i
2

First statement is 1. Double for loop is for loop is n. Result: n .

= n . Final
2

23

More Examples
Example 4.
sum1 = 0; for (i=1; i<=n; i++) for (j=1; j<=n; j++) sum1++; sum2 = 0; for (i=1; i<=n; i++) for (j=1; j<=i; j++) sum2++;
2 2

// First double loop // do n times

// Second double loop // do i times

First loop, sum is n . Second loop, sum is n + 1n/2. Both are n .

Example 5.

sum1 = 0; for (k=1; k<=n; k*=2) for (j=1; j<=n; j++) sum1++; sum2 = 0; for (k=1; k<=n; k*=2) for (j=1; j<=k; j++) sum2++;

First is

log n k=1

n

= n log n. Second is

log n−1 k=0

2 = n.
k

24

Binary Search
Position Key 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 11 13 21 26 29 36 40 41 45 51 54 56 65 72 77 83

int binary(int K, int* array, int left, int right) { // Return position of element (if any) with value K int l = left-1; int r = right+1; // l, r are beyond array bounds while (l+1 != r) { // Stop when l and r meet int i = (l+r)/2; // Look at middle of subarray if (K < array[i]) r = i; // In left half if (K == array[i]) return i; // Found it if (K > array[i]) l = i; // In right half } return UNSUCCESSFUL; // Search value not in array }

Analysis: How many elements can be examined in the worst case? log n

25

Other Control Statements
loop: analyze like a for loop. if statement: Take greater complexity of then else clauses.
while switch

statement: Take complexity of most expensive case. Subroutine call: Complexity of the subroutine.
If probabilities are independent of n.

If probabilities are independent of n.

26

Upper bound: Upper bound of best known algorithm. Lower bound: Lower bound for every possible algorithm.
The examples so far have been easy in that exact equations always yield . Thus, it was hard to distinguish and O. Following example should help to explain the di erence bounds are used to describe our level of uncertainty about an algorithm.

Typically do a lot of this in a senior algorithms course.

Analyzing Problems

Example: Sorting 1. Cost of I O: n 2. Bubble or insertion sort: On2 3. A better sort Quicksort, Mergesort, Heapsort, etc.: On log n 4. We prove later that sorting is n log n

27

Compute the rank ordering for all C pixel values in a picture of P pixels.
for (i=0; i<C; i++) count[i] = 0; for (i=0; i<P; i++) count[value(i)]++; sort(count); // Initialize count // Look at all of the pixels // Increment proper value count // Sort pixel value counts

Ex: 256 colors 8 bits, 1000 × 1000 pixels

Multiple Parameters

If we use P as the measure, then time is P log P . More accurate is P + C log C .

28

Space Bounds
Space bounds can also be analyzed with asymptotic complexity analysis. Time: Algorithm Space: Data Structure Space Time Tradeo Principle: One can often achieve a reduction in time is one is willing to sacri ce space, or vice versa. • Encoding or packing information Boolean ags • Table lookup Factorials Disk Based Space Time Tradeo Principle: The smaller you can make your disk storage requirements, the faster your program will run.

29

A list is a nite, ordered sequence of data items called elements.
The positions are ordered, NOT the values.

Students should already be familiar with lists. Objectives: use alg analysis in familiar context, compare implementations.

Lists

Each list element has a data type. The empty list contains no elements. The length of the list is the number of elements currently stored. The beginning of the list is called the head, the end of the list is called the tail. Sorted lists have their elements positioned in ascending order of value, while unsorted lists have no necessary relationship between element values and positions. Notation:  a0, a1, ..., an−1  What operations should we implement?
Add delete elem anywhere, nd, next, prev, test for empty.

30

List ADT
class List { public: List(int =LIST_SIZE); ~List(); void clear(); void insert(const Elem); void append(const Elem); Elem remove(); void setFirst(); void prev(); void next(); int length() const; void setPos(int); void setValue(const Elem); Elem currValue() const; bool isEmpty() const; bool isInList() const; bool find(int); }; // List class ADT // // // // // // // // // // // // // // // // Constructor Destructor Remove all Elems Insert Elem at curr Insert Elem at tail Remove and return Elem Set curr to first pos Move curr to prev pos Move curr to next pos Return current length Set curr to position Set current value Return current value TRUE if list is empty TRUE if curr in list Find value

This is an example of an ADT. Our list implementations will match. Note that the generic type Elem" is being used for the element type.

31

List ADT Examples
List:  12, 32, 15 
MyList.insert(99);

The above is an example use of the insert function. MyList is an assumed parameter, 99 is another parameter.

Assume MyPos has 32 as current element: Process an entire list:

Put 99 before current element, yielding 12, 99, 32, 15.

for (MyList.first(); MyList.isInList(); MyList.next()) DoSomething(MyList.currValue());

32

Array-Based List Insert
Push items up down. Cost: n.
Insert 23: 13 12 20 0 1 2 8 3 
a

3 4 5 0

13 12 20 1 2 3 
b

8 4

3 5

23 13 12 20 0 1 2 3 
c

8 4

3 5

33

Array-Based List Class
class List { private: int msize; int numinlist; int curr; Elem* listarray; public: List(int =LIST_SIZE); ~List(); void clear(); void insert(const Elem); void append(const Elem); Elem remove(); void setFirst(); void prev(); void next(); int length() const; void setPos(int); void setValue(const Elem); Elem currValue() const; bool isEmpty() const; bool isInList() const; bool find(int); }; // Array-based list class // // // // // // // // // // // // // // // // // // // // Maximum size of list Actual number of Elems Position of "current" Array of list Elems Constructor Destructor Remove all Elems Insert Elem at curr Insert Elem at tail Remove and return Elem Set curr to first pos Move curr to prev pos Move curr to next pos Return current length Set curr to position Set current value Return current value TRUE if list is empty TRUE if curr in list Find value

34

Array-Based List Implementation
List::List(int sz) // Constructor { msize = sz; numinlist = 0; curr = 0; listarray = new Elem[sz]; } List::~List() { delete [] listarray; } // Destructor void List::clear() { numinlist = 0; curr = 0; } // Insert Elem at current position void List::insert(const Elem item) { assert((numinlist < msize) && (curr >=0) && (curr <= numinlist)); for(int i=numinlist; i>curr; i--) // Shift Elems up listarray[i] = listarray[i-1]; listarray[curr] = item; numinlist++; // Increment size } void List::append(const Elem item) { // Insert at tail assert(numinlist < msize); // Can’t be full listarray[numinlist++] = item; // Increment size } Elem List::remove() { // Remove/return current Elem assert(!isEmpty() && isInList()); // Must have Elem Elem temp = listarray[curr]; // Store Elem for(int i=curr; i<numinlist-1; i++) // Shift down listarray[i] = listarray[i+1]; numinlist--; return temp; }

35

List Implementation Cont
void List::setFirst() { curr = 0; } // Set to first pos void List::prev() { curr--; } // Move curr to prev pos void List::next() { curr++; } // Move curr to next pos int List::length() const { return numinlist; } void List::setPos(int pos) {curr = pos; } void List::setValue(const Elem val) // Set curr value { assert(isInList()); listarray[curr] = val; } Elem List::currValue() const // Return current value { assert(isInList()); return listarray[curr]; } bool List::isEmpty() const { return numinlist == 0; } bool List::isInList() const // TRUE if in list { return (curr >= 0) && (curr < numinlist); } bool List::find(int val) { // Find value while (isInList()) // Stop if reach end if (key(currValue()) == val) return TRUE; // Found else next(); return FALSE; // Not found }

36

Link Class
Dynamic allocation of new list elements.
class Link { // Singly-linked node public: Elem element; // Elem value for node Link *next; // Pointer to next node Link(const Elem elemval, Link* nextval =NULL) { element = elemval; next = nextval; } Link(Link* nextval =NULL) { next = nextval; } };

37

Linked List Position
head
20 23 
a

curr
12

tail
15

head
20 23

curr
10 12

tail
15

Naive approach: Point to current node. Current is 12. Want b to insert node with 10. No access available to node with 23. How can we do the insert?
head
20

curr
23 
a

tail
12 15

head
20

curr
23 
b

tail
10 12 15

Alt implementation: Point to node preceding actual current node. Now we can do the insert. Also note use of header node.

38

Linked List Class
class List { private: Link* head; Link* tail; Link* curr; public: List(int =LIST_SIZE); ~List(); void clear(); void insert(const Elem); void append(const Elem); Elem remove(); void setFirst(); void prev(); void next(); int length() const; void setPos(int); void setValue(const Elem); Elem currValue() const; bool isEmpty() const; bool isInList() const; bool find(int); }; // Linked list class // Pointer to list header // Pointer to last Elem // Pos of "current" Elem // // // // // // // // // // // // // // // // Constructor Destructor Remove all Elems Insert at current pos Insert at tail of list Remove/return Elem Set curr to first pos Move curr to prev pos Move curr to next pos Return length Set current pos Set current value Return current value TRUE if list is empty TRUE if now in list Find value

39

Linked List Insertion
// Insert Elem at current position void List::insert(const Elem item) { assert(curr != NULL); // Must be pointing to Elem curr->next = new Link(item, curr->next); if (tail == curr) // Appended new Elem tail = curr->next; }
curr
...

23 Insert 10: 10 
a

12

...

curr
...

23 3 10 1 2 
b

12

...

40

Linked List Remove
Elem List::remove() { // Remove/return Elem assert(isInList()); // Must be valid pos Elem temp = curr->next->element; // Remember value Link* ltemp = curr->next; // Remember link curr->next = ltemp->next; // Remove from list if (tail == ltemp) tail = curr; // Set tail delete ltemp; // Free link return temp; // Return value }

curr
...

23

10 
a

15

...

curr
...

2 10 15
...

23

it

1 
b

41

Freelists
System new and delete are slow.
class Link { // Singly-linked node public: // with freelist Elem element; // Elem value for node Link* next; // Pointer to next node static Link* freelist; // Link class freelist Link(const Elem elemval, Link* nextval =NULL) { element = elemval; next = nextval; } Link(Link* nextval =NULL) { next = nextval; } void* operator new(size_t); // Overloaded new void operator delete(void*); // Overloaded delete }; Link* Link::freelist = NULL; // Create static variable void* Link::operator new(size_t) { // Overload new if (freelist == NULL) return ::new Link; // New space Link* temp = freelist; // Or get from freelist freelist = freelist->next; return temp; // Return the link } void Link::operator delete(void* ptr) { // Overload ((Link*)ptr)->next = freelist; // Put on freelist freelist = (Link*)ptr; }

42

Comparison of List Implementations
Array-Based Lists: Average and worst cases • Insertion and deletion are n. • Array must be allocated in advance. • No overhead if all array positions are full. Linked Lists: • Insertion and deletion 1; prev and direct access are n. • Space grows with number of elements. • Every element requires overhead. Space break-even" point:
DE

= nP + E ;

n

Estimation question: What are the dimensions of these variables?

= +

DE P E

E: Space for data value P: Space for pointer D: Number of elements in array

43

Doubly Linked Lists
Simplify insertion and deletion: Add a prev pointer.
// Doubly-linked node // with freelist // Node Elem value // Pointer to next node // Pointer to prev node // Link class freelist Link* nextp =NULL, Link* prevp =NULL) { element = Elemval; next = nextp; prev = prevp;} Link(Link* nextp =NULL, Link* prevp = NULL) { next = nextp; prev = prevp; } void* operator new(size_t); // Overloaded new void operator delete(void*); // Overloaded delete };
head
20 23

class Link { public: Elem element; Link* next; Link* prev; static Link* freelist; Link(const Elem Elemval,

curr
12

tail
15

44

Doubly Linked List Operations
curr
...

20 Insert 10: 10

23

12

... 

a

curr
...

4 20 10 3 1 2 
b

5 23 12
...

// Insert Elem at current position void List::insert(const Elem item) { assert(curr != NULL); curr->next = new Link(item, curr->next, curr); if (curr->next->next != NULL) curr->next->next->prev = curr->next; if (tail == curr) tail = curr->next; } Elem List::remove() { // Remove current Elem assert(isInList()); // Must be valid position Elem temp = curr->next->element; Link* ltemp = curr->next; if (ltemp->next != NULL) ltemp->next->prev = curr; else tail = curr; // Removed tail Elem - change tail curr->next = ltemp->next; delete ltemp; return temp; }

45

Stacks
LIFO: Last In, First Out Restricted form of list: Insert and remove only at front of list. Notation: • Insert: PUSH • Remove: POP • The accessible element is called TOP.

46

Array-Based Stack
De ne top as rst free position.
class Stack { private: int size; int top; Elem *listarray; // Array-based stack class // Maximum size of stack // Index for top Elem // Array holding stack Elems

public: Stack(int sz =LIST_SIZE) // Constructor: initialize { size = sz; top = 0; listarray = new Elem[sz]; } ~Stack() // Destructor: free array { delete [] listarray; } void clear() // Remove all Elems { top = 0; } void push(const Elem item) // Push Elem onto stack { assert(top < size); listarray[top++] = item; } Elem pop() // Pop Elem from stack top { assert(!isEmpty()); return listarray[--top]; } Elem topValue() const // Return value of top Elem { assert(!isEmpty()); return listarray[top-1]; } bool isEmpty() const // Return TRUE if empty { return top == 0; } };
top1 top2

47

Linked Stack
class Stack { // Linked stack class private: Link *top; // Pointer to top Elem public: Stack(int sz =LIST_SIZE) // Constructor: { top = NULL; } // initialize ~Stack() { clear(); } // Destructor void clear(); // Remove stack Elems void push(const Elem item) // Push Elem onto stack { top = new Link(item, top); } Elem pop(); // Pop Elem from stack Elem topValue() const // Get value of top Elem { assert(!isEmpty()); return top->element; } bool isEmpty() const // Return TRUE if empty { return top == NULL; } }; void Stack::clear() { // Remove Elems while(top != NULL) // Free link nodes { Link* temp = top; top = top->next; delete temp; } } Elem Stack::pop() { assert(!isEmpty()); Elem temp = top->element; Link* ltemp = top->next; delete top; top = ltemp; return temp; } // Pop Elem from stack

48

Queues
FIFO: First In, First Out Restricted form of list: Insert at one end, remove from other. Notation: • Insert: Enqueue • Delete: Dequeue • First element: FRONT • Last element: REAR

49

Queue Implementations
Array-Based Queue
front
20

rear
5 12 17 
a

front
12 17 3

rear
30 4 
b

Application of Pigeonhole Principle: Given a xed arbitrary position for front, there are n − 1 states 0 through n elements in queue and only n positions for rear. So, something must be done to distinguish between two of the states.
front
20 5 12 17 12

front
17 3 4 30

rear 

a 

b

rear

Choices: Leave one slot empty, or use a separate ag. Drifting" queue use modular arithmetic. Linked Queue: modi ed linked list. Operations are 1

50

Binary Trees
A binary tree is made up of a nite set of nodes that is either empty or consists of a node called the root together with two binary trees, called the left and right subtrees, which are disjoint from each other and from the root. Notation: Node, Children, Edge, Parent Ancestor, Descendant, Path, Depth, Height, Level, Leaf Node, Internal Node, Subtree.
A B D G E H C F I

A has depth 0. B and C form level 1. The tree has height 4. Height = max depth + 1.

51

Full and Complete Binary Trees
an internal node with exactly two non-empty children. Complete binary tree: If the height of the tree is d, then all levels except possibly level d are completely full. The bottom level has all nodes to the left side.

Full binary tree: each node either is a leaf or is 

a 

b

These terms can be hard to distinguish. Students will need to remember which is which.

52

Full Binary Tree Theorem
Theorem: The number of leaves in a non-empty full binary tree is one more than the number of internal nodes. Proof by Mathematical Induction: • Base Case: A full binary tree with 1 internal node must have two leaf nodes. • Induction Hypothesis: Assume any full binary tree T containing n − 1 internal nodes has n leaves. • Induction Step: Given tree T with n internal nodes, pick internal node I with two leaf children. Remove I 's children, call resulting tree T'. By induction hypothesis, T' is a full binary tree with n leaves. Restore i's two children. The number of internal nodes has now gone up by 1 to reach n. The number of leaves has also gone up by 1.
53
Relevant since it helps us calculate space requirements.

Full Binary Tree Theorem Corollary
non-empty binary tree is one more than the number of nodes in the tree. Proof: Replace all null pointers with a pointer to an empty leaf node. This is a full binary tree.

Theorem: The number of NULL pointers in a

54

Binary Tree Node Class
class BinNode { // Binary tree node class public: Belem element; // The node’s value BinNode* left; // Pointer to left child BinNode* right; // Pointer to right child static BinNode* freelist; // Two constructors: with and without initial values BinNode() { left = right = NULL; } BinNode(Belem e, BinNode* l =NULL, BinNode* r =NULL) { element = e; left = l; right = r; } ~BinNode() { } // Destructor BinNode* leftchild() const { return left; } BinNode* rightchild() const { return right; } Belem value() const { return element; }; void setValue(Belem val) { element = val; } bool isLeaf() const // TRUE if is a leaf { return (left == NULL) && (right == NULL); } void* operator new(size_t); // Overload new void operator delete(void*);// Overload delete };

55

Traversals
Any process for visiting the nodes in some order is called a traversal. Any traversal that lists every node in the tree exactly once is called an enumeration of the tree's nodes. Preorder traversal: Visit each node before visiting its children. Postorder traversal: Visit each node after visiting its children. Inorder traversal: Visit the left subtree, then the node, then the right subtree.
void preorder(BinNode* rt) // rt is root of a subtree { if (rt == NULL) return; // Empty subtree visit(rt); // visit performs desired action preorder(rt->leftchild()); preorder(rt->rightchild()); }

56

Binary Tree Implementation
A B D G E H C F I

Leaves are the same as internal nodes. Lots of wasted space.
,  
4 x 2 c

+ 
x a

Example of expression tree: 4x ∗ 2x + a − c. Leaves are di erent from internal nodes.

57

Union Implementation
enum Nodetype {leaf, internal}; // Enumerate node types class VarBinNode { // Generic node class public: Nodetype mytype; // Stores type for this node union { struct { // Structure for internal node VarBinNode* left; VarBinNode* right; // Children Operator opx; // Internal node value } intl; Operand var; // Leaves just store a value }; VarBinNode(const Operand& val) // Constructor: leaf { mytype = leaf; var = val; } // Constructor: Internal VarBinNode(const Operator& op, VarBinNode* l, VarBinNode* r) { mytype = internal; intl.opx = op; intl.left = l; intl.right = r; } bool isLeaf() { return mytype == leaf; } VarBinNode* leftchild() { return intl.left; } VarBinNode* rightchild() { return intl.right; } }; void traverse(VarBinNode* rt) { // Preorder traversal if (rt == NULL) return; if (rt->isLeaf()) cout << "Leaf: " << rt->var <<"\n"; else { cout << "Internal: " << rt->intl.opx << "\n"; traverse(rt->leftchild()); traverse(rt->rightchild()); }}

58

Inheritance
class VarBinNode { // Node base class public: // isLeaf is a "pure" virtual function. // Each derived class must define isLeaf. virtual bool isLeaf() = 0; }; class LeafNode : public VarBinNode { // Leaf subclass private: Operand var; // Operand value public: LeafNode(const Operand& val) { var = val; } bool isLeaf() { return TRUE; } // Subclass version Operand value() { return var; } // Return value }; class IntlNode : public VarBinNode { // Internal node private: VarBinNode* left; // Left child VarBinNode* right; // Right child Operator opx; // Operator value public: IntlNode(const Operator& op, VarBinNode* l, VarBinNode* r) { opx = op; left = l; right = r; } // Constructor bool isLeaf() { return FALSE; } // Subclass version VarBinNode* leftch() { return left; } // Left VarBinNode* rightch() { return right; } // Right Operator value() { return opx; } // Value };

59

Inheritance cont
void traverse(VarBinNode *rt) { // Preorder traversal if (rt == NULL) return; // Nothing to visit if (rt->isLeaf()) // Do leaf node cout << "Leaf: " << ((LeafNode *)rt)->value() << "\n"; else { // Do internal node cout << "Internal: " << ((IntlNode *)rt)->value() << "\n"; traverse(((IntlNode *)rt)->leftch()); traverse(((IntlNode *)rt)->rightch()); } }

60

Space Overhead
From Full Binary Tree Theorem: Half of pointers are NULL. If leaves only store information, then overhead depends on whether tree is full. All nodes the same, with two pointers to children: Total space required is 2p + dn. Overhead: 2pn. If p = d, this means 2p/2p + d = 2/3 overhead. Eliminate pointers from leaf nodes: n 2p p 2 = p+d n 2p + dn 2
The following is for full binary trees:

This is 1 2 if p = d. 2p/2p + d if data only at leaves ⇒ 2 3 overhead. Some method is needed to distinguish leaves from internal nodes. This adds overhead.
61

Half the nodes have 2 pointers, which is overhead.

This is a good example of logical representation vs. physical implementation.

Array Implementation
0 1 2 4 5 10 
a

For complete binary trees.

3 7 8 9

6

11

Node 0 1 2 3 4 5 6 7 8 9 10 11 • Parentr = r − 1/2 if r = 0 and r < n. • Leftchildr = 2r + 1 if 2r + 1 < n. • Rightchildr = 2r + 2 if 2r + 2 < n. • Leftsiblingr = r − 1 if r is even, r > 0 and r < n. • Rightsiblingr = r + 1 if r is odd, r + 1 < n.
Since the complete binary tree is so limited in its shape, only one shape for tree of n nodes, it is reasonable to expect that space e ciency can be achieved.

62

Hu man Coding Trees
ASCII Codes: 8 bits per character. Fixed length coding. Can take advantage of relative frequency of letters to save space. Variable length coding. Z K F C U D L E 2 7 24 32 37 42 42 120 Build the tree with minimal external path weight.

63

Hu man Tree Construction
Step 1: 2 Z 9 Step 2: 2 Z 32 C 9 2 Z 37 U Step 4: 42 D 7 K 42 L 32 C 9 2 Z 42 L Step 5: 32 C 9 2 Z 7 K 65 33 24 F 37 U 7 K 79 42 D 120 E 65 33 24 F 120 E 7 K 33 24 F 37 U 42 D 42 L 120 E 7 K 24 F 24 F 32 C 32 C 37 U 37 U 42 D 42 D 42 L 42 L 120 E 120 E

Step 3:

64

Assigning Codes
0 120 E 0 37 U 79 306 1 0 1 42 D 42 L 32 C 0 2 Z 9 186 1 0 107 0 1 65 1 0 1 7 K 33 1 24 F

C code: 1110; D code: 101; E code: 0; F code 1111; K code 111101; L code: 110; U code 100; Z code 111100.

Letter C D E F K L U Z

Freq Code Bits 32 42 120 24 7 42 37 2

65

Coding and Decoding
A set of codes are said to meet the pre x property if no code in the set is the pre x of another. Code for DEED: 101 0 0 101 Decode 1011001110111101: DUCK Expected cost per letter:
1∗120+3∗121+4∗32+5∗24+6∗9 306

=

785 306

= 2.57.

Letter C D E F K L U Z

Freq 32 42 120 24 7 42 37 2

Code Bits 1110 4 101 3 0 1 11111 5 111101 6 110 3 100 3 111100 6
66

Binary Search Trees
Binary Search Tree BST Property
All elements stored in the left subtree of a node whose value is K have values less than K . All elements stored in the right subtree of a node whose value is K have values greater than or equal to K .
Problem with lists: either insert delete or search must be n time. How can we make both update and search e cient? Answer: Use a new data structure. 120
37 24 7 2 32 40 42 42 120 2 24 7 32 37 40 
a b

42 42

67

BST Search
class BST { private: BinNode* root; void clearhelp(BinNode*); // Private void inserthelp(BinNode*&, const Belem); // functions BinNode* deletemin(BinNode*&); void removehelp(BinNode*&, int); Belem findhelp(BinNode*, int) const; void printhelp(const BinNode*, int) const; public: BST() { root = NULL; } ~BST() { clearhelp(root); } void clear() { clearhelp(root); root = NULL; } void insert(const Belem val) {inserthelp(root, val);} void remove(const val) { removehelp(root, val); } Belem find(const val) const { return findhelp(root, val); } bool isEmpty() const { return root == NULL; } void print() const { if (root == NULL) cout << "The BST is empty.\n"; else printhelp(root, 0); } }; Belem BST::findhelp(BinNode* rt, int val) const { if (rt == NULL) return UNSUCCESSFUL; // Empty tree else if (val < key(rt->value())) // Left return findhelp(rt->leftchild(), val); else if (val == key(rt->value())) return rt->value(); else return findhelp(rt->rightchild(), val); // Right }

68

BST Insert
void BST::inserthelp(BinNode*& rt, const Belem val) { if (rt == NULL) // Empty tree: create node rt = new BinNode(val, NULL, NULL); else if (key(val) < key(rt->value())) inserthelp(rt->left, val); // Check left else inserthelp(rt->right, val); // Check right }

Note that rt is declared by reference."
37 24 7 2 32 35 40 42 42 120

69

Alternate Approach
void BST::inserthelp(BinNode* rt, const Belem val) { if (rt == NULL) return new BinNode(val, NULL, NULL); if (key(val) < key(rt->value())) rt->left = inserthelp(rt->left, val); else rt->right = inserthelp(rt->right, val); return rt; }

70

Remove Minimum Value
BinNode* BST::deletemin(BinNode*& rt) { assert(rt != NULL); // Must be a node to delete if (rt->left != NULL) // Continue left return deletemin(rt->left); else // Found it { BinNode* temp = rt; rt = rt->right; return temp; } }

10
rt

5 9

20

71

BST Remove
void BST::removehelp(BinNode*& rt, if (rt == NULL) cout << val << " else if (val < key(rt->value())) removehelp(rt->left, val); else if (val > key(rt->value())) removehelp(rt->right, val); else { BinNode* temp = rt; if (rt->left == NULL) rt = rt->right; else if (rt->right == NULL) rt = rt->left; else { temp = deletemin(rt->right); rt->setValue(temp->value()); } delete temp; } } int val) { is not in tree.\n"; // Check left // Check right // Found it: remove // // // // // // // Only a right point to right Only a left point to left Both non-empty Replace with min in right subtree

// Free up space

37 40 24 7 2 32 40 42 42 120

72

Cost of BST Operations
Find: Insert: Remove:
All cost depth of the node in question. Worst case:n. Average case:n log n.

73

Heaps
Heap: Complete binary tree with the Heap Property: • Min-heap: all values less than child values. • Max-heap: all values greater than child values. The values in a heap are partially ordered. Heap representation: normally the array based complete binary tree representation.

74

Max Heap
1 2 4 5 6

Building the Heap
7 3 7 
a

4 1 2 3

6 5

1 2 4 5 6 3 7 
b

7 5 4 2 1 6 3 

a requires exchanges 4-2, 4-1, 2-1, 5-2, 5-4, 6-3, 6-5, 7-5, 7-6. b requires exchanges 5-2, 7-3, 7-1, 6-1.

How to get a good number of exchanges? By induction. Heapify the root's subtrees, then push the root to the correct level.

75

Heap ADT
class heap { private: Elem* Heap; int size; int n; void siftdown(int); public: heap(Elem*, int, int); int heapsize() const; bool isLeaf(int) const; int leftchild(int) const; int rightchild(int) const; int parent(int) const; void insert(const Elem); Elem removemax(); Elem remove(int); void buildheap(); }; // Max-heap class // // // // // // // // // // // // // // Pointer to heap array Maximum size of heap Number of ELEMs in heap Put ELEM in place Constructor Return current size TRUE if pos is a leaf Return L child position Return R child position Return parent position Insert value into heap Remove maximum value Remove specified value Heapify contents

76

Siftdown
For fast heap construction: • Work from high end of array to low end. • Call siftdown for each item. • Don't need to call siftdown on leaf nodes.
void heap::buildheap() // Heapify contents { for (int i=n/2-1; i>=0; i--) siftdown(i); } void heap::siftdown(int pos) { // Put ELEM in place assert((pos >= 0) && (pos < n)); while (!isLeaf(pos)) { int j = leftchild(pos); if ((j<(n-1)) && (key(Heap[j]) < key(Heap[j+1]))) j++; // j now index of child with greater value if (key(Heap[pos]) >= key(Heap[j])) return; // Done swap(Heap, pos, j); pos = j; // Move down } }

Cost for heap construction:
log n
i

that level. 

i − 1 is number of steps down, n/2 is number of nodes at
i

=1 

i − 1 2i ≈ n.
n

77

Priority Queues
A priority queue stores objects, and on request releases the object with greatest value. Example: Scheduling jobs in a multi-tasking operating system. The priority of a job may change, requiring some reordering of the jobs. Implementation: use a heap to store the priority queue. To support priority reordering, delete and re-insert. Need to know index for the object.
// Remove value at specified position Elem heap::remove(int pos) { assert((pos > 0) && (pos < n)); swap(Heap, pos, --n); // Swap with last value while (key(Heap[pos]) > key(Heap[parent(pos)])) swap(Heap, pos, parent(pos)); // Push up if large if (n != 0) siftdown(pos); // Push down if small return Heap[n]; }

78

General Trees
A tree T is a nite set of one or more nodes such that there is one designated node r called the root of T , and the remaining nodes in T − {r} are partitioned into n ≥ 0 disjoint subsets T1, T2, ..., Tk, each of which is a tree, and whose roots r1, r2, ..., rk, respectively, are children of r.
Note: disjoint because a node cannot have two parents.
Root R Ancestors of V Parent of V V C1 C2 P

S1 S2 Siblings of V Subtree rooted at V

Children of V

79

General Tree ADT
class GTNode { public: GTNode(const Elem); ~GTNode(); Elem value(); bool isLeaf(); GTNode* parent(); GTNode* leftmost_child(); GTNode* right_sibling(); void setValue(Elem); void insert_first(GTNode* n); void insert_next(GTNode* n); void remove_first(); void remove_next(); }; // // // // // // // // // // // // Constructor Destructor Return node’s value TRUE if is a leaf Return parent Return first child Return right sibling Set node’s value Insert first child Insert right sibling Remove first child Remove right sibling

class GenTree { public: GenTree(); // Constructor ~GenTree(); // Destructor void clear(); // Free nodes GTNode* root(); // Return root void newroot(Elem, GTNode*, GTNode*); // Combine };

80

General Tree Traversal
void print(GTNode* rt) { // Preorder traverse from root if (rt->isLeaf()) cout << "Leaf: "; else cout << "Internal: "; cout << rt->value() << "\n"; // Print or take action GTNode* temp = rt->leftmost_child(); while (temp != NULL) { print(temp); temp = temp->right_sibling(); } }

R A C D E B F

RACDEBF

81

Parent Pointer Implementation
R A C D E B F X W Y Z

Parent's Index 0 0 1 1 1 2 7 7 7 Label R A B C D E F W X Y Z Node Index 0 1 2 3 4 5 6 7 8 9 10

Parent Pointer representation is good for answering: Are two elements in the same tree?
bool Gentree::differ(int a, int b) { // Are nodes a and b in different trees? GTNode* root1 = FIND(&array[a]); // Find a’s root GTNode* root2 = FIND(&array[b]); // Find b’s root return root1 != root2; // Compare roots }

Examples of equivalence classes: Connected components in graphs; point clustering.

82

When joining equivalence classes, want to keep depth small. Weighted Union Rule: join the tree with fewer nodes to the tree with more nodes. Limits depth to log n for n nodes. Path Compression: Make all nodes visited point to root. Nearly constant cost.
class GTNode { public: GTNode* par; GTNode() { par = NULL; } GTNode* parent() { return par; } }; class Gentree { // private: GTNode* array; int size; GTNode* FIND(GTNode*) const; public: Gentree(int); ~Gentree(); void UNION(int, int); bool differ(int, int); // }; // General tree node // Parent pointer // Constuctor // Return parent

Equivalence Classes

Less than half of the nodes increase depth by 1.

General tree: UNION/FIND // Node array // Size of node array // Find root // Constructor // Destructor // Merge equivalences TRUE if not in same tree

83

UNION-FIND
Gentree::Gentree(int sz) { size = sz; array = new GTNode[sz]; } Gentree::~Gentree() { delete [] array; } // Constructor // Create node array

// Destructor // Free node array

bool Gentree::differ(int a, int b) { // Are nodes a and b in different trees? GTNode* root1 = FIND(&array[a]); // Find a’s root GTNode* root2 = FIND(&array[b]); // Find b’s root return root1 != root2; // Compare roots } void Gentree::UNION(int a, int b) { // Merge subtrees GTNode* root1 = FIND(&array[a]); // Find a’s root GTNode* root2 = FIND(&array[b]); // Find b’s root if (root1 != root2) root2->par = root1; // Merge }

84

Equivalence Processing Example
A A B C D E F G H I J 0 1 2 3 4 5 6 7 8 9 F B G C H D I E J

Process A,B, C, H, G,F, D, E, I, F 
a

0 3 5 2 5 A B C D E F G H I J 0 1 2 3 4 5 6 7 8 9

A B 
b

C H

F G

J I

D E

Process H, A, E, G

0 0 5 3 5 2 5 A B C D E F G H I J 0 1 2 3 4 5 6 7 8 9

A B C H 
c

F G I

J D E

Process H, E

5 0 0 5 3 5 2 5 A B C D E F G H I J 0 1 2 3 4 5 6 7 8 9

F A B C H 
d

J I D E

G

85

Path Compression Example
5 0 0 5 5 5 0 5 A B C D E F G H I J 0 1 2 3 4 5 6 7 8 9 B F A C G H I D J E

GTNode* Gentree::FIND(GTNode* curr) const { while (curr->parent() != NULL) curr = curr->par; return curr; // At root }

86

Lists of Children
Index 0 1 2 3 4 5 6 7 Val R A C B D F E Par 0 1 0 1 3 1 1 2 5 3 4 6

Hard to nd right sibling.

87

Leftmost Child Right Sibling
R R A C D E B F
0

X

Left Val Par Right 1 R 3 A 0 2 6 B 0 C 1 4 D 1 5 E 1 F 2 8 R X 7
0

Note: Two trees share same array.
R R A C D E B F Left Val 1 R 3 A 6 B C D E F 0 R X
0

0

X

Par Right 7 8 0 2 0 1 4 1 5 1 2 -1 7

88

Linked Implementations
R A C D E 
a

Val Size R 2 A 3 C 0 D 0 
b

B F

B 1 E 0 F 0

Allocate child pointer space when node is created.
R

R A C D E 
a

A B F C D 
b

B

E

F

89

Sequential Implementations
List node values in the order they would be visited by a preorder traversal. Saves space, but allows only sequential access. Need to retain tree structure for reconstruction. For binary trees: Use symbol to mark NULL links.
AB/D//CEG///F H//I//

Full binary trees: Mark leaf or internal.
Need NULL mark since this tree is not full.
A B /DC E G/F HI

General trees: Mark end of each subtree. RAC DE BF 
90

Convert to Binary Tree
Left Child Right Sibling representation essentially stores a binary tree. Use this process to convert any general tree to a binary tree. A forest is a collection of one or more general trees.
root 

a 

b

Dynamic implementation of Left child right sibling."

91

Graphs
A graph G = V, E consists of a set of vertices V, and a set of edges E, such that each edge in E is a connection between a pair of vertices in V. The number of vertices is written |V|, and the number of edges is written |E|. A sequence of vertices v1, v2, ..., vn forms a path of length n − 1 if there exist edges from vi to vi+1 for 1 ≤ i < n. A path is simple if all vertices on the path are distinct. A cycle is a path of length 3 or more that connects vi to itself. A cycle is simple if the path is simple, except for the rst and last vertices being the same.

92

Graph De nitions Cont
An undirected graph is connected if there is at least one path from any vertex to any other. The maximal connected subgraphs of an undirected graph are called connected components. A graph without cycles is acyclic. A directed graph without cycles is a directed acyclic graph or DAG. A free tree is a connected, undirected graph with no simple cycles. Equivalently, a free tree is connected and has |V − 1| edges.
0 4 3 1 
a b

2 1 4 2 
c

1

7 3

93

Connected Components
0 4 1 3 5 2 6 7

94

Adjacency Matrix: |V|2. Adjacency List: |V| + |E|.
0 0 4 1 
a

Graph Representations
1 1 1 1 1 1 
b

2

3

4 1

2

0 1 2 3

3

4

0 1 2 3 4

1 3 4 2 1 
c

4

0 0 1 1 2 3

1 1

2

3 1 1

4 1 1 1

0 4 1 
a

2

1 1 1

1 1 
b

3

4

0 1 2 3 4

1 0 3 1 0 
c

4 3 4 2 1 2 4

Instead of bits, the graph could store edge, weights.

95

Implementation: Adjacency Matrix
class EdgeClass { public: int v1, v2; EdgeClass(int in1, int in2) { v1 = in1; v2 = in2; } }; typedef EdgeClass* Edge; class Graph { private: int** matrix; int numVertex; int numEdge; bool* Mark; public: Graph(); ~Graph(); int n(); int e(); Edge first(int); bool isEdge(Edge); Edge next(Edge); int v1(Edge); int v2(Edge); int weight(int, int); int weight(Edge); bool getMark(int); void setMark(int, bool); }; // Adjacency matrix // // // // // // // // // // // // // // // // // The edge matrix Number of vertices Number of edges The mark array Constructor Destructor Number of graph vertices Number of edges for graph Get vertex first edge TRUE if this is an edge Get vertex next edge Vertex edge comes from Vertex edge goes to Weight of edge Weight of edge Return a Mark value Set a Mark value

For completeness, add setedge and deledge.

96

Adjacency Matrix Functions
Edge Graph::first(int v) { // Get vertex first edge for (int i=0; i<numVertex; i++) if (matrix[v][i] != NOEDGE) return new EdgeClass(v, i); return NULL; } bool Graph::isEdge(Edge w) // TRUE if this is an edge { return (w != NULL) && (matrix[w->v1][w->v2] != NOEDGE); } Edge Graph::next(Edge w) { // Get vertex next edge for (int i=w->v2+1; i<numVertex; i++) if (matrix[w->v1][i] != NOEDGE) return new EdgeClass(w->v1, i); return NULL; } int Graph::weight(int i, int j) { // Return edge weight if (matrix[i][j] == NOEDGE) return INFINITY; else return matrix[i][j]; } int Graph::weight(Edge w) { // Return weight of edge if ((w == NULL) || (matrix[w->v1][w->v2] == NOEDGE)) return INFINITY; else return matrix[w->v1][w->v2]; } int Graph::v1(Edge w) { return w->v1; } // Comes from int Graph::v2(Edge w) { return w->v2; } // Goes to bool Graph::getMark(int v) { return Mark[v]; } void Graph::setMark(int v, bool val) { Mark[v] = val; }

97

Implementation: Adjacency List
class EdgeLink { // Singly-linked list node public: int weight; // Edge weight int v1, v2; // Edge vertices EdgeLink* next; // Pointer to next list edge EdgeLink(int vt1, int vt2, int w, EdgeLink* nxt=NULL) { v1 = vt1; v2 = vt2; weight = w; next = nxt; } EdgeLink(EdgeLink* nxt =NULL) { next = nxt; } }; typedef EdgeLink* Edge; class Graph { // Graph class: Adjacency list private: Edge* list; // The vertex list int numVertex, numEdge // Number of vertices, edges bool* Mark; // The mark array public: Graph(); // Constructor ~Graph(); // Destructor int n(); // Number of vertices int e(); // Number of edges Edge first(int); // Get vertex first edge bool isEdge(Edge); // TRUE if this is an edge Edge next(Edge); // Get vertex next edge int v1(Edge); // Vertex edge comes from int v2(Edge); // Vertex edge goes to int weight(int, int); // Weight of edge int weight(Edge); // Weight of edge bool getMark(int); // Return a Mark value void setMark(int, bool); // Set a Mark value };

98

Adjacency List Functions
Edge Graph::first(int v) { return list[v]; } // Get first edge for vertex bool Graph::isEdge(Edge w) // TRUE if this is an edge { return w != NULL; } Edge Graph::next(Edge w) { // Get next edge for vertex if (w == NULL) return NULL; else return w->next; } int Graph::v1(Edge w) { return w->v1; } int Graph::v2(Edge w) { return w->v2; } // Return vertex it comes from

// Return vertex it goes to

int Graph::weight(int i, int j) { // Return edge weight for (Edge curr = list[i]; curr != NULL; curr = curr->next) if (curr->v2 == j) return curr->weight; return INFINITY; } int Graph::weight(Edge w) { // Return weight of edge if (w == NULL) return INFINITY; else return w->weight; }

99

Graph Traversals
Some applications require visiting every vertex in the graph exactly once. Application may require that vertices be visited in some special order based on graph topology. Example: Arti cial Intelligence • Problem domain consists of many states." • Need to get from Start State to Goal State. • Start and Goal are typically not directly connected. To insure visiting all vertices:
void graph_traverse(Graph& G) { for (v=0; v<G.n(); v++) G.Mark[v] = UNVISITED; // Initialize mark bits for (v=0; v<G.n(); v++) if (G.Mark[v] == UNVISITED) do_traverse(G, v); }

Two traversals we will talk about: DFS, BFS.

100

Depth First Search
void DFS(Graph& G, int v) { // Depth first search PreVisit(G, v); // Take appropriate action G.setMark(v, VISITED); for (Edge w = G.first(v); G.isEdge(w); w = G.next(w)) if (G.getMark(G.v2(w)) == UNVISITED) DFS(G, G.v2(w)); PostVisit(G, v); // Take appropriate action }

Cost: |V| + |E|.
A C B A C B

D F E 
a

D F E 
b

The directions are imposed by the traversal. This is the Depth First Search Tree.

101

Breadth First Search
Like DFS, but replace stack with a queue. Visit the vertex's neighbors before continuing deeper in the tree.
void BFS(Graph& G, int start) { Queue Q(G.n()); Q.enqueue(start); G.setMark(start, VISITED); while (!Q.isEmpty()) { int v = Q.dequeue(); PreVisit(G, v); // Take appropriate action for (Edge w = G.first(v); G.isEdge(w); w=G.next(w)) if (G.getMark(G.v2(w)) == UNVISITED) { G.setMark(G.v2(w), VISITED); Q.enqueue(G.v2(w)); } PostVisit(G, v); // Take appropriate action }}

A C

B

A C

B

D F E 
a

D F E 
b

102

Problem: Given a set of jobs, courses, etc. with prerequisite constraints, output the jobs in an order that does not violate any of the prerequisites.
J6 J1 J2 J3 J4 J5 J7

Topological Sort

void topsort(Graph& G) { // Topological sort: recursive for (int i=0; i<G.n(); i++) // Initialize Mark array G.setMark(i, UNVISITED); for (i=0; i<G.n(); i++) // Process all vertices if (G.getMark(i) == UNVISITED) tophelp(G, i); // Call helper function } void tophelp(Graph& G, int v) { // Helper function G.setMark(v, VISITED); // No PreVisit operation for (Edge w = G.first(v); G.isEdge(w); w = G.next(w)) if (G.getMark(G.v2(w)) == UNVISITED) tophelp(G, G.v2(w)); printout(v); // PostVisit for Vertex v }

Prints in reverse order.

103

Queue-based Topological Sort
void topsort(Graph& G) { Queue Q(G.n()); int Count[G.n()]; // Topological sort: Queue

for (int v=0; v<G.n(); v++) Count[v] = 0; // Init for (v=0; v<G.n(); v++) // Process every edge for (Edge w=G.first(v); G.isEdge(w); w=G.next(w)) Count[G.v2(w)]++; // Add to v2’s prereq count for (v=0; v<G.n(); v++) // Initialize Queue if (Count[v] == 0) // Vertex has no prereqs Q.enqueue(v); while (!Q.isEmpty()) { // Process the vertices int v = Q.dequeue(); printout(v); // PreVisit for Vertex V for (Edge w=G.first(v); G.isEdge(w); w=G.next(w)) { Count[G.v2(w)]--; // One less prerequisite if (Count[G.v2(w)] == 0) // Vertex is now free Q.enqueue(G.v2(w)); } } }

104

Shortest Paths Problems
Input: A graph with weights or costs associated with each edge. Output: The list of edges forming the shortest path. Sample problems: • Find the shortest path between two speci ed vertices. • Find the shortest path from vertex S to all other vertices. • Find the shortest path between all pairs of vertices. Our algorithms will actually calculate only distances.

105

Shortest Paths De nitions
dA, B is the shortest distance from vertex A to B. wA, B is the weight of the edge connecting A to B. • If there is no such edge, then wA, B = ∞.
10 A 3 C B 20 2 15 E 5 D 11

wA, D = 20; dA, D = 10 through ACBD.

106

Single Source Shortest Paths
Given start vertex s, nd the shortest path from s to all other vertices. Try 1: Visit all vertices in some order, compute shortest paths for all vertices seen so far, then add the shortest path to next vertex x. Problem: Shortest path to a vertex already processed might go through x. Solution: Process vertices in order of distance from s.

107

Dijkstra's Algorithm Example
Initial Process A Process C Process B Process D Process E
10 A 3 C B 20 2 15 E

A 0 0 0 0 0 0

10 5 5 5 5

∞ ∞ ∞ ∞ ∞

B C D E 3 3 3 3 3 20 20 10 10 10
D 11

18 18 18 18

5

108

Dijkstra's Algorithm: Array
void Dijkstra(Graph& G, int s) { // Use array int D[G.n()]; for (int i=0; i<G.n(); i++) // Initialize D[i] = INFINITY; D[s] = 0; for (i=0; i<G.n(); i++) { // Process vertices int v = minVertex(G, D); if (D[v] == INFINITY) return; // Unreachable G.setMark(v, VISITED); for (Edge w = G.first(v); G.isEdge(w); w=G.next(w)) if (D[G.v2(w)] > (D[v] + G.weight(w))) D[G.v2(w)] = D[v] + G.weight(w); } } int minVertex(Graph& G, int* D) { // Get mincost vertex int v; // Initialize v to any unvisited vertex; for (int i=0; i<G.n(); i++) if (G.getMark(i) == UNVISITED) { v = i; break; } for (i++; i<G.n(); i++) // Now find smallest D value if ((G.getMark(i) == UNVISITED) && (D[i] < D[v])) v = i; return v; }

Approach 1: Scan the table on each pass for closest vertex. Total cost: |V|2 + |E| = |V|2 .

109

Dijkstra's Algorithm: Priority Queue
class Elem { public: int vertex, dist; }; int key(Elem x) { return x.dist; } void Dijkstra(Graph& G, int s) { // W/ priority queue int v; // The current vertex int D[G.n()]; // Distance array Elem temp; Elem E[G.e()]; // Heap array temp.dist = 0; temp.vertex = s; E[0] = temp; // Initialize heap heap H(E, 1, G.e()); // Create the heap for (int i=0; i<G.n(); i++) D[i] = INFINITY; D[s] = 0; for (i=0; i<G.n(); i++) { // Now, get distances do { temp = H.removemin(); v = temp.vertex; } while (G.getMark(v) == VISITED); G.setMark(v, VISITED); if (D[v] == INFINITY) return; // Rest unreachable for (Edge w = G.first(v); G.isEdge(w); w=G.next(w)) if (D[G.v2(w)] > (D[v] + G.weight(w))) { D[G.v2(w)] = D[v] + G.weight(w); // Update D temp.dist = D[G.v2(w)]; temp.vertex = G.v2(w); H.insert(temp); // Insert new distance in heap }}}

Approach 2: Store unprocessed vertices using a min-heap to implement a priority queue ordered by D value. Must update priority queue for each edge. Total cost: |V| + |E| log |V|.
110

All Pairs Shortest Paths
For every vertex u, v ∈ V, calculate du, v. Could run Dijkstra's Algorithm V times. Better is Floyd's Algorithm.
Cost:
|V ||E|

log |V | = |V | log |V | for dense graph.
3

De ne a k-path from u to v to be any path whose intermediate vertices all have indices less than k.
1 4 0 1 7

The issue is how to e ciently check all the paths without computing any path more than once.

1
2

5 2 12

1

3

11

1

3

1

0,3 is a 0-path. 2,0,3 is a 1-path. 0,2,3 is a 3-path, but not a 2 or 1 path. Everything is a 4 path.

111

Floyd's Algorithm
void Floyd(Graph& G) { // All-pairs shortest paths int D[G.n()][G.n()]; // Store distances for (int i=0; i<G.n(); i++) // Initialize D for (int j=0; j<G.n(); j++) D[i][j] = G.weight(i, j); for (int k=0; k<G.n(); k++) // Compute all k paths for (int i=0; i<G.n(); i++) for (int j=0; j<G.n(); j++) if (D[i][j] > (D[i][k] + D[k][j])) D[i][j] = D[i][k] + D[k][j]; }

112

Minimum Cost Spanning Trees
Minimum Cost Spanning Tree MST Problem: • Input: An undirected, connected graph G. • Output: The subgraph of G that 1 has minimum total cost as measured by summing the values for all of the edges in the subset, and 2 keeps the vertices connected.
A 7 C 9 1 D E 1 2 2 F 6 5 B

113

Prim's MST Algorithm
void Prim(Graph& G, int s) { // Prim’s MST alg int D[G.n()]; // Distance vertex int V[G.n()]; // Who’s closest for (int i=0; i<G.n(); i++) // Initialize D[i] = INFINITY; D[s] = 0; for (i=0; i<G.n(); i++) { // Process vertices int v = minVertex(G, D); G.setMark(v, VISITED); if (v != s) AddEdgetoMST(V[v], v); // Add to MST if (D[v] == INFINITY) return; // Rest unreachable for (Edge w = G.first(v); G.isEdge(w); w=G.next(w)) if (D[G.v2(w)] > G.weight(w)) { D[G.v2(w)] = G.weight(w); // Update distance, V[G.v2(w)] = v; // who came from }}} int minVertex(Graph& G, int* D) { // Min cost vertex int v; // Initialize v to any unvisited vertex; for (int i=0; i<G.n(); i++) if (G.getMark(i) == UNVISITED) { v = i; break; } for (i=0; i<G.n(); i++) // Now find smallest value if ((G.getMark(i) == UNVISITED) && (D[i] < D[v])) v = i; return v; }

This is an example of a greedy algorithm.
114

Alternative Prim's Implementation
Like Dijkstra's algorithm, we can implement Prim's algorithm with a priority queue.
void Prim(Graph& G, int s) { // W/ priority queue int v; // The current vertex int D[G.n()]; // Distance array int V[G.n()]; // Who’s closest Elem temp; Elem E[G.e()]; // Heap array temp.distance = 0; temp.vertex = s; E[0] = temp; // Initialize heap array heap H(E, 1, G.e()); // Create the heap for (int i=0; i<G.n(); i++) D[i] = INFINITY; // Init D[s] = 0; for (i=0; i<G.n(); i++) { // Now build MST do { temp = H.removemin(); v = temp.vertex; } while (G.getMark(v) == VISITED); G.setMark(v, VISITED); if (v != s) AddEdgetoMST(V[v], v); // Add to MST if (D[v] == INFINITY) return; // Rest unreachable for (Edge w = G.first(v); G.isEdge(w); w=G.next(w)) if (D[G.v2(w)] > G.weight(w)) { // Update D D[G.v2(w)] = G.weight(w); V[G.v2(w)] = v; // Who came from temp.distance = D[G.v2(w)]; temp.vertex = G.v2(w); H.insert(temp); // Insert distance in heap } }}

115

Proof of Prim's MST Algorithm
minimum cost spanning tree. Proof by contradiction: Order vertices by how they are added to the MST by Prim's algorithm. v1, v2, ..., vn. Let edge ei connect vx, vi+1, x < i. Let ej be the lowest numbered  rst edge added by the algorithm such that the set of edges selected so far cannot be extended to form an MST for G. Let V1 = v1, ..., vj . Let V2 = vj +1, ..., vn.
Marked Vertices v , i j
i

Theorem 14.1 Prim's algorithm produces a

v

correct" edge e
0

Unmarked Vertices v , i  j
i

u

v

w

v

p

e Prim's edge
j

v

j

116

Kruskal's MST Algorithm
Kruskel(Graph& G) { // Kruskal’s MST algorithm Gentree A(G.n()); // Equivalence class array Elem E[G.e()]; // Array of edges for min-heap int edgecnt = 0; for (int i=0; i<G.n(); i++) // Put edges on array for (Edge w = G.first(i); G.isEdge(w); w = G.next(w)) { E[edgecnt].weight = G.weight(w); E[edgecnt++].edge = w; } heap H(E, edgecnt, edgecnt); // Heapify the edges int numMST = G.n(); // Init w/ n equiv classes for (i=0; numMST>1; i++) { // Combine equiv classes Elem temp = H.removemin(); // Get next cheap edge Edge w = temp.edge; int v = G.v1(w); int u = G.v2(w); if (A.differ(v, u)) { // If different equiv classes A.UNION(v, u); // Combine equiv classes AddEdgetoMST(G.v1(w), G.v2(w)); // Add to MST numMST--; // One less MST } } }

How do we compute function MSTof(v)? Solution: Use Parent Pointer representation to merge equivalance classes.
117

Time dominated by cost for initial edge sort. Total cost: |V| + |E| log |E|.
Initial A B C C 1 D B C 1 D B C 1 D F E 1 2 E 1 F D

Kruskal's Algorithm Example

Alternative: Use a min heap, quit when only one set left. Kth-smallest" implementation.
E F

Step 1

A

B

E

F

Process edge C. D

Step 2

A

Process edge E, F

Step 3

A

Process edge C, F

118

Sorting
Each record contains a eld called the key. Linear order: comparison.
a<b

and b < c ⇒ a < c.

The Sorting Problem

Given a sequence of records R1, R2, ..., Rn with key values k1, k2, ..., kn, respectively, arrange the records into any order s such that records Rs1 , Rs2 , ..., Rsn have keys obeying the property ks1 ≤ ks2 ≤ ... ≤ ksn . Measures of cost: • Comparisons • Swaps
Put keys in ascending order.

119

Insertion Sort
void inssort(Elem* array, int n) { // Insertion Sort for (int i=1; i<n; i++) // Insert i’th record for (int j=i; (j>0) && (key(array[j])<key(array[j-1])); j--) swap(array, j, j-1); }

i=1 42 20 17 13 28 14 23 15 20 42 17 13 28 14 23 15

2 17 20 42 13 28 14 23 15

3 13 17 20 42 28 14 23 15

4 13 17 20 28 42 14 23 15

5 13 14 17 20 28 42 23 15

6 13 14 17 20 23 28 42 15

7 13 14 15 17 20 23 28 42

Best Case: 0 swaps, n − 1 comparisons Worst Case: n /2 swaps and compares Average Case: n /4 swaps and compares
2 2

Good best case performance.

120

Bubble Sort
void bubsort(Elem* array, int n) { // Bubble Sort for (int i=0; i<n-1; i++) // Bubble up i’th record for (int j=n-1; j>i; j--) if (key(array[j]) < key(array[j-1])) swap(array, j, j-1); }

Using test
i=0 42 20 17 13 28 14 23 15 13 42 20 17 14 28 15 23

j>i

" saves a factor of 2 over
2 13 14 20 42 15 17 23 28 3 13 14 15 20 42 17 23 28

j>

0".

1 13 14 42 20 17 15 28 23
2

4 13 14 15 17 20 42 23 28

5 13 14 15 17 20 23 42 28

6 13 14 15 17 20 23 28 42

Best Case: n /2 compares, 0 swaps Worst Case: n /2 compares, n /2 swaps Average Case: n /2 compares, n /4 swaps
2 2 2 2

No redeeming features to this sort.

121

Selection Sort
void selsort(Elem* array, int n) { // Selection Sort for (int i=0; i<n-1; i++) { // Select i’th record int lowindex = i; // Remember its index for (int j=n-1; j>i; j--) // Find the least value if (key(array[j]) < key(array[lowindex])) lowindex = j; // Put it in place swap(array, i, lowindex); } }

Select the value to go in the ith position.
i=0 13 20 17 42 28 14 23 15 1 2 3 13 14 17 42 28 20 23 15 13 14 15 42 28 20 23 17 13 14 15 17 28 20 23 42

4 13 14 15 17 20 28 23 42

5 13 14 15 17 20 23 28 42

6 13 14 15 17 20 23 28 42

42 20 17 13 28 14 23 15

Best Case: 0 swaps n − 1 as written, n /2 compares. Worst Case: n − 1 swaps, n /2 compares Average Case: On swaps, n /2 compares
2 2 2

122

Pointer Swapping
Key = 42 Key = 5 Key = 42 Key = 5 

a 

b

For large records.

123

Exchange Sorting
Comparisons: Best Case n n2 n2 Average Case n2 n2 n2 Worst Case n2 n2 n2 Swaps: Best Case 0 0 n Average Case n2 n2 n Worst Case n2 n2 n All of these sorts rely on exchanges of adjacent records. What is the average number of exchanges required?
n2/

Summary Insertion Bubble Selection

4 average distance from a record to its sorted position.

124

Shellsort
void shellsort(Elem* array, int n) { // Shellsort for (int i=n/2; i>2; i/=2) // For each increment for (int j=0; j<i; j++) // Sort each sublist inssort2(&array[j], n-j, i); inssort2(array, n, 1); } // Version of Insertion Sort for varying increments void inssort2(Elem* A, int n, int incr) { for (int i=incr; i<n; i+=incr) for (int j=i; (j>=incr) && (key(A[j])<key(A[j-incr])); j-=incr) swap(A, j, j-incr); }
59 20 17 13 28 14 23 83 36 98 11 70 65 41 42 15

8 lists of length 2 4 lists of length 4 2 lists of length 8

36 20 11 13 28 14 23 15 59 98 17 70 65 41 42 83

28 14 11 13 36 20 17 15 59 41 23 70 65 98 42 83

1 list of length 16

11 13 17 14 23 15 28 20 36 41 42 70 59 83 65 98

On1.5

Any increments will work, provided the last is 1. Shellsort takes advantage of inssort's best case performance.

11 13 14 15 17 20 23 28 36 41 42 59 65 70 83 98

125

Quicksort
Divide and Conquer: divide list into values less than pivot and values greater than pivot.
Initial call:
qsort(array, 0, n-1);

void qsort(Elem* array, int i, int j) { // Quicksort int pivotindex = findpivot(array, i, j); swap(array, pivotindex, j); // Swap to end // k will be the first position in the right subarray int k = partition(array, i-1, j, key(array[j])); swap(array, k, j); // Put pivot in place if ((k-i) > 1) qsort(array, i, k-1); // Sort left if ((j-k) > 1) qsort(array, k+1, j); // Sort right } int findpivot(Elem* array, int i, int j) { return (i+j)/2; }

126

Quicksort Partition
int partition(Elem* array, int l, int r, int pivot) { do { // Move the bounds inward until they meet while (key(array[++l]) < pivot); // Move right while (r && (key(array[--r]) > pivot));// Move left swap(array, l, r); // Swap out-of-place vals } while (l < r); // Stop when they cross swap(array, l, r); // Reverse wasted swap return l; // Return first pos in right partition }
Initial l 72 6 72 6 l 48 6 l 48 6 48 6 48 6 48 6 48 6 57 88 85 42 83 73 48 60 r 57 88 85 42 83 73 48 60 r 57 88 85 42 83 73 72 60 r 57 88 85 42 83 73 72 60 l r 57 42 85 88 83 73 72 60 l r 57 42 85 88 83 73 72 60 r l 57 85 42 88 83 73 72 60 r l 57 42 85 88 83 73 72 60 r l

Pass 1 Swap 1 Pass 2 Swap 2 Pass 3 Swap 3 Reverse Swap

The cost for Partition is n.

127

Quicksort Example
72 6 48 6 6 57 88 60 42 83 73 48 85

Pivot = 60

Pivot = 6

57 42 60 88 83 73 72 85

Pivot = 73

Pivot = 57 Pivot = 42 42 48 57
42 48 6

42 57 48

72 73 85 88 83

Pivot = 88
85 83 88 83 85

Pivot = 85

42 48 57 60 72 73 83 85 88

Final Sorted Array

128

Cost for Quicksort
Best Case: Always partition in half. Worst Case: Bad partition. Average Case: 1 n−1 T k + T n − k T n = n + 1 + n − 1 k=1 = n log n Optimizations for Quicksort: • Better pivot. • Use better algorithm for small sublists. • Eliminate recursion.

129

Mergesort
List mergesort(List inlist) { if (inlist.length() <= 1) return inlist;; List l1 = half of the items from inlist; List l2 = other half of the items from inlist; return merge(mergesort(l1), mergesort(l2)); }

36 20 17 13 28 14 23 15 20 36 13 17 14 28 15 23 13 17 20 36 14 15 23 28 13 14 15 17 20 23 28 36

130

Mergesort Implementation
Mergesort is tricky to implement.
void mergesort(Elem* array, Elem* temp, int left, int right) { int mid = (left+right)/2; if (left == right) return; // List of one ELEM mergesort(array, temp, left, mid); // Sort 1st half mergesort(array, temp, mid+1, right);// Sort 2nd half for (int i=left; i<=right; i++) // Copy to temp temp[i] = array[i]; // Do the merge operation back to array int i1 = left; int i2 = mid + 1; for (int curr=left; curr<=right; curr++) { if (i1 == mid+1) // Left sublist exhausted array[curr] = temp[i2++]; else if (i2 > right) // Right sublist exhausted array[curr] = temp[i1++]; else if (key(temp[i1]) < key(temp[i2])) array[curr] = temp[i1++]; else array[curr] = temp[i2++]; }}

Mergesort cost: n log n

Note: This requires a second array.

Mergesort is good for sorting linked lists.

Send records to alternating linked lists, mergesort each, then merge.

131

Optimized Mergesort
void mergesort(Elem* array, Elem* temp, int left, int right) { int i, j, k, mid = (left+right)/2; if (left == right) return; mergesort(array, temp, left, mid); // Sort 1st half mergesort(array, temp, mid+1, right);// Sort 2nd half // Do merge operation. First, copy halves to temp. for (i=left; i<=mid; i++) temp[i] = array[i]; for (j=1; j<=right-mid; j++) temp[right-j+1] = array[j+mid]; // Merge sublists back to array for (i=left,j=right,k=left; k<=right; k++) if (key(temp[i]) < key(temp[j])) array[k] = temp[i++]; else array[k] = temp[j--];

The right sublist is reversed. So, each list increases towards the middle. This means that each list serves as a sentinal for the other list, so no test for end-of-list is required.

}

132

Heapsort
Heapsort uses a max-heap.
void heapsort(Elem* array, int n) { // Heapsort heap H(array, n, n); // Build the heap for (int i=0; i<n; i++) // Now sort H.removemax(); // Value placed at end of heap }

Cost of Heapsort: n log n Cost of nding k largest elements: k log n + n.
Time to build heap: n. Time to remove least element: log n. Compare to sorting with BST: this is expensive in space overhead, potential bad balance, BST does not take advantage of having all records available in advance. Heap is space e cient, balanced, and building initial heap is e cient.

133

Heapsort Example
Original Numbers
73 6 57 88 60 42 83 72 48 85 88 6 60 42 72 48 85 73 57 83

Build Heap
88 85 83 72 73 42 57 6 48 60 72 6 85 73

88 83 42 57

48 60 85

Remove 88
85 73 83 72 60 42 57 6 48 88 72 6 48 73 60

83 42 57

Remove 85
83 73 57 72 60 42 48 6 85 88 72 6 73 60

83 57 42 48

Remove 83
73 72 57 6 60 42 48 83 85 88 6

73 72 60 42 57 48

134

Binsort
A simple, e cient sort:
for (i=0; i<n; i++) B[key(A[i])] = A[i];

Ways to generalize: • Make each bin the head of a list.
duplicates space

Only works for a permutation.

Support

Allow more keys than records.

I.e., larger key

void binsort(ELEM *A, int n) { list B[MaxKeyValue]; for (i=0; i<n; i++) B[key(A[i])].append(A[i]); for (i=0; i<MaxKeyValue; i++) for (B[i].first(); B[i].isInList(); B[i].next()) output(B[i].currValue()); }

Cost: n.

Oops! Actually, n ∗ Maxkeyvalue Maxkeyvalue could be On  or worse.
2

135

Radix Sort
Initial List: First pass on right digit 0 1 2 3 4 5 6 7 8 9 Result of rst pass: Result of second pass: 91 1 27 28 97 17 67 91 72 23 84 5 25 1 0 1 2 3 4 5 6 7 8 9 67 72 84 91 97 27 91 1 97 17 23 84 28 72 Second pass on left digit 1 17 23 25 27 28 5 5 67 25

1 72 23 84 5 25 27 97 17 67 28 5 17 23 25 27 28 67 72 84 91 97

136

Cost of Radix Sort
void radix(Elem* A, Elem* B, int n, int k, int r, int* count) { // Count[i] stores number of records in bin[i] for (int i=0, rtok=1; i<k; i++, rtok*=r) {// k digits for (int j=0; j<r; j++) count[j] = 0; // Init // Count number of records for each bin this pass for (j=0; j<n; j++) count[(key(A[j])/rtok)%r]++; // Index B: count[j] is index for last slot of j. for (j=1; j<r; j++) count[j] = count[j-1]+count[j]; // Put recs into bins working from bottom of bin. // Since bins fill from bottom, j counts downwards for (j=n-1; j>=0; j--) B[--count[(key(A[j])/rtok)%r]] = A[j]; for (j=0; j<n; j++) A[j] = B[j]; // Copy B to A }

Cost: nk + rk. How do n, k and r relate?
can be viewed as a constant. keys.
r k≥

}

log n if there are n distinct 137

Radix Sort Example
Initial Input: Array A
27 91 1 0 0 0 0 1 2 1 2 2 1 2 3 97 17 23 84 28 72 5 3 1 3 4 4 1 4 5 5 2 5 7 6 0 6 7 7 4 7 8 1 8 9 0 9 67 25

First pass values for Count. rtok = 1. Count array: Index positions for Array B.

11 12 12

End of Pass 1: Array A.

91 1 0 2 0 2 1 1 1 3

72 23 84 5 2 4 2 7 3 0 3 7 4 0 4 7 5 0 5 7

25 27 97 17 67 28 6 1 6 8 7 1 7 9 8 1 9 2

Second pass values for Count. rtok = 10. Count array: Index positions for Array B.

8 9 10 12

End of Pass 2: Array A.

1

5

17 23 25 27 28 67 72 84 91 97

138

Empirical Comparison
Insert. Sort Bubble Sort Selec. Sort Shellsort Shellsort O QSort QSort O Merge Merge O Heapsort Rad Sort 1 Rad Sort 4 Rad Sort 8 Insert. Sort Bubble Sort Selec. Sort Shellsort Shellsort O QSort QSort O Merge Merge O Heapsort Rad Sort 1 Rad Sort 4 Rad Sort 8

Algorithm

.00200 .00233 .00167 .00233 .00233 .00367 .00200 .00700 .00133 .00900 .02433 .00700 .00967 .0002 .0003 .0003 .0003 .0003 .0004 .0002 .0009 .0003 .0010 .0123 .0035 .0047

10

.1833 18.13 1847.0 16544 .2267 22.47 2274.0 20452 .0967 2.17 900.3 8142 .0600 1.00 17.0 59 .0500 .93 16.3 65 .0500 .63 7.3 24 .0300 .43 5.7 18 .0700 .87 10.7 35 .0267 .37 5.0 16 .1767 2.67 36.3 122 .2333 2.30 23.3 69 .0600 .60 6.0 17 .0333 .30 3.0 8 .0170 .0257 .0273 .0027 .0060 .0057 .0040 .0130 .0067 .0173 .1197 .0305 .0183

100 1,000

10K

30K

Algorithm

10

100 1,000 10K 100K

1.68 168.8 23382 2.55 257.2 41874 2.65 267.5 40393 0.12 1.9 40 0.11 1.8 33 0.08 0.9 12 0.06 0.8 10 0.17 2.3 30 0.11 1.5 21 0.26 3.5 49 1.21 12.5 135 0.30 3.2 34 0.16 1.6 18 139

Sorting Lower Bound
Want to prove a lower bound for all possible sorting algorithms. Sorting is On log n. Sorting I O takes n time. Will now prove n log n lower bound. Form of proof: • Comparison based sorting can be modeled by a binary tree. • The tree must have n! leaves. • The tree must be n log n levels deep.

140

Decision Trees
Yes XYZ XYZ YZX XZY ZXY YXZ ZYX A1 A0? Y X? No

XYZ YXZ XYZ YXZ XZY YZX ZXY ZYX Yes No Yes No A2 A1? A2 A1? YZX Z X? YXZ XZY Z Y? XYZ YZX XZY ZYX ZXY Yes No Yes No A1 A0? A1 A0? ZYX Z Y? YZX ZXY Z X? XZY

There are n! permutations, and at least 1 node for each permutation. A tree with n nodes has at least log n levels. Where is the worst case in the decision tree? log n! = n log n.
141

Primary vs. Secondary Storage
Primary Storage: Main memory RAM Secondary Storage: Peripheral devices
• •

Disk Drives Tape Drives

Medium 32MB RAM 1.4MB oppy disk 2.1GB disk drive 1GB JAZ cassette 2GB cartridge tape

Price Price per Mbyte $225 $7.00 MB $.50 $0.36 MB $210 $0.10 MB $100 $0.10 MB $20 $0.01 MB

RAM is usually volatile. RAM is about 1 4 million times faster than disk.

142

Golden Rule of File Processing
Minimize the number of disk accesses!
1. Arrange information so that you get what you want with few disk accesses. 2. Arrange information so minimize future disk accesses. An organization for data on disk is often called a le structure. Disk based space time tradeo : Compress information to save processing time by reducing disk accesses.

143

Disk Drives
Boom arm Platters Spindle 
a

Read Write Heads 
b

Track

Spacing loses 1 2 of potential data density.
Sectors

CD-ROM: Spiral with equally spaced dots, variable speed rotation.

Intersector Gaps

Bits of data

144

Sectors
8 7 6 5 
a

1 2 3 4 3 8

6

1 4 7

5 
b

2

A sector is the basic unit of I O. Interleaving factor: Physical distance between logically adjacent sectors on a track, to allow for processing of sector data. Locality of Reference: If a record is read from disk, the next request is likely to come from near the same place in the le. Cluster: Smallest unit of le allocation several sectors. Extent: A group of physically contiguous clusters. Internal Fragmentation: Wasted space within a sector if record size does not match sector size, or wasted space within a cluster if le size is not a multiple of cluster size.
145

Factors a ecting disk access time
1. Seek time: time for I O head to reach desired track. Largely determined by distance between I O head and desired track. If seeking at random, n = disk width 3. Seek time: f n = t ∗ n + s where t is time to traverse one track and s is startup time for the I O head. 2. Rotational delay or latency: time for data to rotate to I O head position. At 3600 RPM, one half rotation of the disk: 16.7 2 msec = 8.3 msec. 3. Transfer time: time for data to move under the I O head. Number of sectors read Number of sectors per track * 16.7 msec
One disk rotation. Newer disks are running at 7200 RPM.

146

675 Mbyte disk drive • 15 platters ⇒ 45 Mbyte platter • 612 tracks platter • 150 sectors track ⇒ 512 bytes sector • 8 sectors cluster 4K bytes cluster ⇒ 18 clusters track • Interleaving factor of 3 ⇒ 3 revolutions to read one track 50.1 msec How long to read a le of 128 Kbytes divided into 256 records of 512 bytes? Number of Clusters: If le lls minimum number of tracks: 150 sectors of one track, 106 of the next Total time: 612/3 ∗ 0.08 + 3 + 3.5 ∗ 16.7 + 0.08 + 3+ 3.5 ∗ 16.7 = 139.3 msec. If clusters are spread randomly across disk: 612 ∗ 0.08 + 3 + 16.7/2 + 24 ∗ 16.7 32 ∗  3 150 = 32 ∗ 30.3 = 969.6 msec.
147

Disk Access Cost Example

Magnetic Tape
Example: 9 track tape at 6250 bytes per inch bpi. At 2400 feet, this yields 170 Mbytes for $20, or $0.12 Mbyte. Workstation PC cartridge tape is similar. Magnetic tape requires sequential access. Magnetic tape has two speeds: • High speed for skipping." • Low speed for reading." Interblock Gap is space required for I O head to recognize beginning of a record at high speed. This is a signi cant amount of space compared to typical record size. Blocking Factor: Number of records in a block between gaps.
148

Bu ers
Read time for one track: 612/3 ∗ 0.08 + 3 + 3.5 ∗ 16.7 = 77.8 msec. Read time for one sector: 612/3∗0.08+3+16.7/2+16.7/150 = 27.8 msec. Read time for one byte: 612/3 ∗ 0.08 + 3 + 16.7/2 = 27.7 msec. Nearly all disk drives read write one sector at every I O access. • Also called a page. The information in a sector is stored in a bu er or cache. If next I O access is to same bu er, then no need to go to disk. There are usually one or more input bu ers and one or more output bu ers.
149

Bu er Pools
A series of bu ers used by an application to cache disk data is called a bu er pool. Virtual memory uses a bu er pool to imitate greater RAM memory by actually storing information on disk and swapping" between disk and RAM. Organization for bu er pools: which one to use next? • First-in, First-out: Use the rst one on the queue. • Least Frequently Used LFU: Count bu er accesses, pick the least used. • Least Recently Used LRU: Keep bu ers on linked list. When a bu er is accessed, bring to front. Reuse the one at the end. Double Bu ering: Read data from disk for next bu er while CPU is processing previous bu er.
150

Programmer's View of Files
Logical view of les: • An array of bytes. • A le pointer marks the current position. Three fundamental operations: • Read bytes from current position move le pointer. • Write bytes to current position move le pointer. • Set le pointer to speci ed byte position.

151

C C++File Functions
FILE *fopen(char *filename, char *mode); void fclose(FILE *stream);

Mode examples: • "rb": open a binary le, read-only. • "w+t": create a text le for reading and writing.
size t fread(void *ptr, size t size, size t n, FILE *stream); if(numrec != fread(recarr, sizeof rec, numrec, myfile)) its_an_error(); size t fwrite(void *ptr, size t size, size t n, FILE *stream); int fseek(FILE *stream, long offset, int whence);

where whence is one of: • SEEK SET: O set from le beginning. • SEEK CUR: O set from current position. • SEEK END: O set from end of le.
152

External Sorting
Problem: Sorting data sets too large to t in main memory. • Assume data stored on disk drive. To sort, portions of the data must be brought into main memory, processed, and returned to disk. An external sort should minimize disk accesses.

153

Model of External Computation
Secondary memory is divided into equal-sized blocks 512, 2048, 4096 or 8192 bytes are typical sizes. The basic I O operation transfers the contents of one disk block to from main memory. Under certain circumstances, reading blocks of a le in sequential order is more e cient. When? 1 Adjacent logical blocks of le are physically
adjacent on disk. 2 No competition for I O head.

Typically, the time to perform a single block I O operation is su cient to Quicksort the contents of the block. Thus, our primary goal is to minimize the number fo block I O operations. Most workstations today must do all sorting on a single disk drive.
So, the algorithm presented here is general for these conditions.

154

Key Sorting
Often records are large while keys are small. • Ex: Payroll entries keyed on ID number. Approach 1: Read in entire records, sort them, then write them out again. Approach 2: Read only the key values, store with each key the location on disk of its associated record. If necessary, after the keys are sorted the records can be read and re-written in sorted order.
But, this is not usually done. 1 It is expensive random access to all records. 2 If there are multiple keys, there is no correct" order.

155

External Sort: Simple Mergesort
Quicksort requires random access to the entire set of records. Better: Modi ed Mergesort algorithm • Process n elements in log n passes. 1. Split the le into two les. 2. Read in a block from each le. 3. Take rst record from each block, output them in sorted order. 4. Take next record from each block, output them to a second le in sorted order. 5. Repeat until nished, alternating between output les. Read new input blocks as needed. 6. Repeat steps 2-5, except this time the input les have groups of two sorted records that are merged together. 7. Each pass through the les provides larger and larger groups of sorted records. A group of sorted records is called a run.
156

Problems with Simple Mergesort
36 17 28 23 20 13 14 15 Runs of length 1 20 36 14 28 13 17 15 23 Runs of length 2 13 17 20 36 14 15 23 28 Runs of length 4

Is each pass through input and output les sequential? yes What happens if all work is done on a single disk drive? Competition for I O head eliminates
advantage of sequential processing.

How can we reduce the number of Mergesort passes? Read in a block or several blocks and do an
in-memory sort to generate large initial runs.

In general, external sorting consists of two phases: 1. Break the le into initial runs. 2. Merge the runs together into a single sorted run.
157

Breaking a le into runs
General approach: • Read as much of the le into memory as possible. • Perform and in-memory sort. • Output this group of records as a single run.

158

Replacement Selection
1. Break available memory into an array for the heap, an input bu er and an output bu er. 2. Fill the array from disk. 3. Make a min-heap. 4. Send the smallest value root to the output bu er. 5. If the next key in the le is greater than the last value output, then Replace the root with this key. else Replace the root with the last key in the array. Add the next record in the le to a new heap actually, stick it at the end of the array.
Input File Input Bu er RAM Output Bu er Output Run File

159

Example of Replacement Selection
Input 16 25 29 25 19 25 14 25 21 25 35 40 25 29 56 29 21 31 14 21 56 21 29 40 31 14 56 21 19 31 40 19 56 19 21 29 31 40 56 19 21 16 31 40 16 56 Memory 12 31 40 Output 12

160

Bene t from Replacement Selection
Use double bu ering to overlap input, processing and output. How many disk drives for greatest advantage? Snowplow argument: • A snowplow moves around a circular track onto which snow falls at a steady rate. • At any instant, there is a certain amount of snow S on the track. Some falling snow comes in front of the plow, some behind. • During the next revolution of the snowplow, all of this is removed, plus 1 2 of what falls during that revolution. • Thus, the plow removes 2S amount of snow. Falling Snow Is this always true? Future snow
Existing snow Snowplow Movement Start time T

161

Simple Mergesort may not be Best
Simple Mergesort: Place the runs into two les. • Merge the rst two runs to output le, then next two runs, etc. This process is repeated until only one run remains. • How many passes for r initial runs? log r Is there bene t from sequential reading? Is working memory well used?
No only 2 blocks are used. Not if all on one disk drive.

Need a way to reduce the number of passes.
And use memory better.

162

Multiway Merge
With replacement selection, each initial run is several blocks long. Assume that each run is placed in a separate disk le. We could then read the rst block from each le into memory and perform an r-way merge. When a bu er becomes empty, read a block from the appropriate run le. Each record is read only once from disk during the merge process. In practice, use only one le and seek to appropriate block.
Input Runs 5 10 15 ... Output Bu er 6 7 23 ... 5 6 7 10 12 ...

12 18 20 ...

163

Limits to Single Pass Multiway Merge
Assume working memory is b blocks in size. How many runs can be processed at one time? The runs are 2b blocks long on average. How big a le can be merged in one pass?
2b
2

Because of replacement selection.

Larger les will need more passes but the run size grows quickly! This approach trades log b possibly sequential passes for a single or a very few random block access passes.
Example: 128K → 32 4K blocks. With replacement selection, get 256K-length runs. One merge pass: 8MB. Two merge passes: 256MB. Three merge passes: 8GB. In K merge passes, process 2b k 
+1

blocks.

164

General Principals of External Sorting
In summary, a good external sorting algorithm will seek to do the following: • Make the initial runs as long as possible. • At all stages, overlap input, processing and output as much as possible. • Use as much working memory as possible. Applying more memory usually speeds processing. • If possible, use additional disk drives for more overlapping of processing with I O, and allow for more sequential le processing.

165

Search
Given: Distinct keys k1, k2, ... kn and collection T of n records of the form k1, I1, k2, I2, ..., kn, In where Ij is information associated with key kj for 1 ≤ j ≤ n. Search Problem: For key value K , locate the record kj , Ij  in T such that kj = K . Searching is a systematic method for locating the record or records with key value kj = K . A successful search is one in which a record with key kj = K is found. An unsuccessful search is one in which no record with kj = K is found and presumably no such record exists.

166

Approaches to Search
1. Sequential and list methods lists, tables, arrays. 2. Direct access by key value hashing. 3. Tree indexing methods.

167

Searching Ordered Arrays
Sequential Search Binary Search
int binary(int K, int* array, int left, int right) { // Return pos of ELEM in array (if any) with value K int l = left-1; int r = right+1; // l, r beyond bounds of array while (l+1 != r) { // Stop when l and r meet int i = (l+r)/2; // Look at middle of subarray if (K < array[i]) r = i; // In left half if (K == array[i]) return i; // Found it if (K > array[i]) l = i; // In right half } return UNSUCCESSFUL; // Search value not in array }

Position Key

0

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15

11 13 21 26 29 36 40 41 45 51 54 56 65 72 77 83

Dictionary Search
168

Order lists by expected frequency of occurrance. • Perform sequential search. Cost to access rst record: 1 Cost to access second record: 2 Expected search cost: C n = 1p1 + 2p2 + ... + npn
pi

Lists Ordered by Frequency

Example: all records have equal frequency
Cn

is probability of ith record being accessed.

=

n i

=1

i/n

= n + 1/2.

Example: Exponential frequency 1/2i if 1 ≤ i ≤ n − 1 pi = 1/2n−1 if i = n
Second line is to make proabilities sum to 1.
n

Cn ≈
i

=1 

i/2i ≈ 2.
169

Zipf Distributions
Applications: • Distribution for frequency of word usage in natural languages. • Distribution for populations of cities, etc. De nition: Zipf frequency for item i in the distribution for n records as 1/iHn.
Hn

=

n 1 i=1 i

log n.
e

Cn

=

n i

=1

i/iHn

= n/Hn ≈ n/ loge n

80 20 rule: 80 of the accesses are to 20 of the records. For distributions following the 80 20 rule, C n ≈ 0.122n.

170

Self-Organizing Lists
Self-organizing lists modify the order of records within the list based on the actual pattern of record access. Self-organizing lists use a rule called a heuristic for deciding how to to reorder the list. These heuristics are similar to the rules for managing bu er pools.
Bu er pools can be viewed as a form of self-organizing list.

• •

Order by actual historical frequency of access. Similar to LFU bu er pool replacement stratagy. When a record is found, swap it with the rst record on list. Move-to-Front: When a record is found, move it to the front of the list. Not worse
than twice best arrangement."

Transpose: When a record is found, swap
it with the record ahead of it.
keep swapping last two elements.

A bad example:

171

Example of Self-Organizing Tables
Application: Text compression. Keep a table of words already seen, organized via Move-to-Front Heuristic. If a word not yet seen, send the word. Otherwise, send the current index in the table. The car on the left hit the car I left. The car on 3 left hit 3 5 I 5. This is similar in spirit to Ziv-Lempel coding.

172

Searching in Sets
For dense sets small range, many elements in set: Can use logical bit operators. Example: To nd all primes that are odd numbers, compute 0011010100010100 & 0101010101010101
0 0 1 0 2 1 3 1 4 0 5 1 6 0 7 1 8 0 9 10 11 12 13 14 15 0 0 1 0 1 0 0

1 if element in set, 0 otherwise.

173

Hashing
to a position in a table. A hash function maps key values to positions It is denoted by h. A hash table is an array that holds the records. It is denoted by T . The hash table has M slots, indexed from 0 to M − 1. For any value K in the key range and some hash function h, hK  = i, 0 ≤ i < M , such that keyT i  = K .

Hashing: The process of mapping a key value

174

Hashing continued
Hashing is appropriate only for sets no duplicates. Good for both in-memory and disk based applications. Answers the question What record, if any, has key value K ?" Example: Store the n records with keys in range 0 to n − 1. • Store the record with key i in slot i. • Use hash function hK  = K.
Not good for range queries.

175

Collisions
More reasonable example: • Store about 1000 records with keys in range 0 to 16,383. • Impractical to keep a hash table with 16,384 slots. • We must devise a hash function to map the key range to a smaller table. Given: hash function h and keys k1 and k2. β is a slot in the hash table. If hk1 = β = hk2, then k1 and k2 have a collision at β under h. Search for the record with key K : 1. Compute the table location hK . 2. Starting with slot hK , locate the record containing key K using if necessary a collision resolution policy. Collisions are inevitable in most applications. • Example: 23 people are likely to share a birthday.
176

Hash Functions
A hash function MUST return a value within the hash table range. To be practical, a hash function SHOULD evenly distribute the records stored among the hash table slots. Ideally, the hash function should distribute records with equal probability to all hash table slots. In practice, success depends on the distribution of the actual records stored. If we know nothing about the incoming key distribution, evenly distribute the key range over the hash table slots while avoiding obvious opportunities for clustering. If we have knowlege of the incoming distribution, use a distribution-dependant hash function.

177

Example Hash Functions
int h(int x) { return(x % 16); }

This function is entirely dependant on the lower 4 bits of the key. Mid-square method: square the key value, take the middle r bits from the result for a hash table of 2r slots. Sum the ASCII values of the letters and take results modulo M .
int h(char x[10]) { int i, sum; for (sum=0, i=0; i<10; i++) sum += (int) x[i]; return(sum % M); }

All bits contribute to the result.

Only good if the sum is large compared to M .

178

ELF Hash
From Executable and Linking Format ELF, UNIX System V Release 4.
int ELFhash(char* key) { unsigned long h = 0; while(*key) { h = (h << 4) + *key++; unsigned long g = h & 0xF0000000L; if (g) h ^= g >> 24; h &= ~g; } return h % M; }

179

Open Hashing
What to do when collisions occur? Open hashing treats each hash table slot as a bin.
0 1000 1 2 3 3013 4 5 6 7 9877 8 9 9879 2007 1057 9530

180

Bucket Hashing
Divide the hash table slots into buckets. • Example: 8 slots bucket. Include an over ow bucket. Records hash to the rst slot of the bucket, and ll bucket. Go to over ow if necessary. When searching, rst check the proper bucket. Then check the over ow.

181

Closed Hashing
Closed hashing stores all records directly in the hash table. Each record i has a home position hki. If i is to be inserted and another record already occupies i's home position, then another slot must be found to store i. The new slot is found by a collision resolution policy. Search must follow the same policy to nd records not in their home slots.

182

During insertion, the goal of collision resolution is to nd a free slot in the table. Probe Sequence: the series of slots visited during insert search by following a collision resolution policy. Let β0 = hK . Let β0, β1, ... be the series of slots making up the probe sequence.
void hashInsert(Elem R) { // Insert R into hash table T int home; // Home position for R int pos = home = h(key(R));// Initial pos on sequence for (int i=1; key(T[pos]) != EMPTY; i++) { pos = (home + p(key(R), i)) % M; // Next slot if (key(T[pos]) == key(R)) ERROR; //No duplicates } T[pos] = R; // Insert R } Elem hashSearch(int K) { // Search for record w/ key K int home; // Home position for K int pos = home = h(K); // Initial pos on sequence for (int i = 1; (key(T[pos]) != K) && (key(T[pos]) != EMPTY); i++) pos = (home + p(K, i)) % M; // Next pos on sequence if (key(T[pos]) == K) return T[pos]; // Found it else return UNSUCCESSFUL; // K not in hash table }

Collision Resolution

183

Linear Probing
Use the probe function
int p(int K, int i) { return i; }

This is called linear probing. Linear probing simply goes to the next slot in the table. If the bottom is reached, wrap around to the top. To avoid an in nite loop, one slot in the table must always be empty.

184

Linear Probing Example
0 1001 1 9537 2 3016 3 4 5 6 7 9874 8 2009 9 9875 10 
a

0 1001 1 9537 2 3016 3 4 5 6 7 9874 8 2009 9 9875 10 1052 
b

in the table under linear probing since the probabilities for which slot to use next are not the same.
For a: prob3 = 4 11, prob4 = 1 11, prob5 = 1 11, prob6 = 1 11, prob10 = 4 11. For b: prob3 = 8 11, prob4,5,6 = 1 11 each.

Primary Clustering: Records tend to cluster

185

Improved Linear Probing
Instead of going to the next slot, skip by some constant c. Warning: Pick M and c carefully. The probe sequence SHOULD cycle through all slots of the table. Pick c to be relatively prime to M . There is still some clustering. • Example: c = 2. hk1 = 3. hk2 = 5. • The probe sequences for k1 and k2 are linked together.
If M = 10 with c = 2, then we e ectively have created 2 hash tables evens vs. odds.

186

Pseudo Random Probing
The ideal probe function would select the next slot on the probe sequence at random. An actual probe function cannot operate randomly. Why? Pseudo random probing: • Select a random permutation of the numbers from 1 to M − 1: All insertions and searches use the same permutation. Example: Hash table of size M = 101 • r1 = 2, r2 = 5, r3 = 32. • hk1 = 30, hk2 = 28. • Probe sequence for k1 is: 30, 32, 35, 62 • Probe sequence for k2 is: 28, 30, 33, 60
• r1, r2, ..., rM−1

187

Quadratic Probing
Set the i'th value in the probe sequence as hK  + i2 mod M. Example: M = 101. • hk1 = 30, hk2 = 29. • Probe sequence for k1 is: • Probe sequence for k2 is:

30, 31, 34, 39 21, 30, 33, 38

188

Pseudo random probing eliminates primary clustering. If two keys hash to same slot, they follow the same probe sequence. This is called secondary clustering. To avoid secondary clustering, need a probe sequence to be a function of the original key value, not just the home position. Double hashing: pK, i = i ∗ h2K  for 0 ≤ i ≤ M − 1. Be sure that all probe sequence constants are relatively prime to M . Example: Hash table of size M = 101 • hk1 = 30, hk2 = 28, hk3 = 30. • h2k1 = 2, h2k2 = 5, h2k3 = 5. • Probe sequence for k1 is: 30, 32, 34, 36 • Probe sequence for k2 is: 28, 33, 38, 43 • Probe sequence for k3 is: 30, 35, 40, 45
189

Double Hashing

Analysis of Closed Hashing
The load factor is α = N/M where N is the number of records currently in the table.
5

Insert 4

Delete

3

2

1 0 .2 .4 .6 .8 1.0

190

Deletion
1. Deleting a record must not hinder later searches. 2. We do not want to make positions in the hash table unusable because of deletion. Both of these problems can be resolved by placing a special mark in place of the deleted record, called a tombstone. A tombstone will not stop a search, but that slot can be used for future insertions. Unfortunately, tombstones do add to the average path length. Solutions: 1. Local reorganizations to try to shorten the average path length. 2. Periodically rehash the table by order of most frequently accessed record.
191

Indexing
Goals: • Store large les. • Support multiple search keys. • Support e cient insert, delete and range queries. Entry sequenced le: Order records by time of insertion. Not practical as a database organization. Use sequential search. Index le: Organized, stores pointers to actual records. Could be a tree or other data structure. Primary key: A unique identi er for records. May be inconvenient for search. Secondary key: an alternate search key, often not unique for each record. Often used for search key.

192

Linear Indexing
Linear Index: an index le organized as a
simple sequence of key record pointer pairs where the key values are in sorted order. If the index is too large to t in main memory, a second level index may be used. Linear indexing is good for searching variable length records. Also good for indexing an entry
sequenced le.
Linear Index 37 42 52

Linear indexing is poor for insert delete.
73 98

73 Database Records 1

52

98

This is an entry sequenced le. First key on each disk page.
5688 5894 9942 10528 10984

37 42

2003 5894 10528

Second Level Index 1 2001 2003

Linear Index: Disk Pages

193

Inverted List
Inverted list is another term for a secondary
Jones Smith Zukowski AA10 AB12 AB39 FF37 AX33 AX35 ZX45 ZQ99

index. A secondary key is associated with a primary key, which in turn locates the record.

An improvement over 1D list: 2D table can save space of duplicating secondary keys. But, may waste space in rows. Easier to update than linear list, but still requires some shifting of values.
Secondary Key Jones Smith Zukowski Primary Key AA10 AB12 AB39 FF37 AX33 AX35 ZX45 ZQ99

194

Inverted List Continued
Secondary Key Index Jones 0 Smith 1 Zukowski 3 Primary Key Next 0 AA10 4 1 AX33 6 2 ZX45 3 ZQ99 4 AB12 5 5 AB39 7 6 AX35 2 7 FF37

A linked list.

195

Tree Indexing
Linear index is poor for insertion deletion. Tree index can e ciently support all desired operations: • Insert delete • Multiple search keys Multiple tree indices. • Key range search Storing a tree index on disk causes additional problems: 1. Tree must be balanced. Minimize disk accesses. 2. Each path from root to a leaf should cover few disk pages.

5 3 2 4 
a

4 7 6 1 2 3 
b

6 5 7

196

2-3 Tree
A 2-3 Tree has the following properties: 1. A node contains one or two keys. 2. Every internal node has either two children if it contains one key or three children if it contains two keys. 3. All leaves are at the same level in the tree, so the tree is always height balanced. The 2-3 Tree also has a search tree property analogous to the BST. The advantage of the 2-3 Treeover the BST is that it can be updated at low cost.
18 33 12 10 15 20 21 23 30 24 31 48 45 47 50 52

197

2-3 Tree Insertion
18 33 12 10 15 15 14 20 21 23 30 24 31 48 45 47 50 52

Insert 14
18 33

12 10 15 20 21

23 30 24 31 45 47

48 52 50 55

Insert 55. Always insert at leaf node.

198

2-3 Tree Splitting
23 20 19 21 
a

23 30 24 31 19

20 21 
b

30 24 31

Insert 19
23

18 12 10 15 19 20 21 
c

33 30 24 31 48 45 47 50 52

All operations are local to original search path.

199

B-Trees
The B-Tree is an extension of the 2-3 Tree. The B-Tree is now the standard le organization for applications requiring insertion, deletion and key range searches. 1. B-Trees are always balanced. 2. B-Trees keep related records on a disk page, which takes advantage of locality of reference. 3. B-Trees guarantee that every node in the tree will be full at least to a certain minimum percentage. This improves space e ciency while reducing the typical number of disk fetches necessary during a search or update operation.

200

B-Trees Continued
A B-Tree of order m has the following properties. • The root is either a leaf or has at least two children. • Each node, except for the root and the leaves, has between m/2 and m children. • All leaves are at the same level in the tree, so the tree is always height balanced. A B-Tree node is usually selected to match the size of a disk block. A B-Tree node could have hundreds of children.

201

B-Tree Example
Search in a B-Tree is a generalization of search in a 2-3 Tree. 1. Perform a binary search on the keys in the current node. If the search key is found, then return the record. If the current node is a leaf node and the key is not found, then report an unsuccessful search. 2. Otherwise, follow the proper branch and repeat the process.
24 15 20 10 12 18 21 23 30 31 38 33 45 48 47 50 52 60

202

B+-Trees
The most commonly implemented form of the B-Tree is the B+-Tree. Internal nodes of the B+-Tree do not store records only key values to guide the search. Leaf nodes store records or pointers to records. A leaf node may store more or less records than an internal node stores keys.
Assume leaves can store 5 values, internal notes 3 4 children.
33 18 23 10 12 15 18 19 20 21 22 23 30 31 33 45 47 48 48 50 52

Link leaf nodes in a linked list.

203

Note special rule for root: May have only two children.
33 10 12 23 33 48 10 12 23 33 48 50

B+-Tree Insertion 

b Add a 50. 20.
10 12 15

Add 45, 52, 47 split,b 15, 31 split, 21, 18,
18 33 48 18 20 21 23 31 
c

33 45 47

48 50 52

Add 30 split.

33 18 23 10 12 15 18 20 21 23 30 31 
d

48 33 45 47 48 50 52

204

Simple delete delete 18 from original example.
33 18 23 10 12 15 19 20 21 22 23 30 31 33 45 47 48 48 50 52

B+-Tree Deletion

Delete of 12 form original example: Borrow from sibling.
33 19 23 10 15 18 19 20 21 22 23 30 31 33 45 47 48 48 50 52

205

Max: 100 leaves with 100 for 10,000 records. 3 level B+-Tree: Min: 2 × 50 nodes of leaves for 5000 records. Max: 100 = 1, 000, 000 records. 4 level B+-Tree: Min: 250,000 records 2 * 50 * 50 * 50. Max: 100 million records 100 * 100 * 100 * 100.
3

B+-Tree nodes are always at least half full. The B∗-Tree splits two pages for three, and combines three pages into two. In this way, nodes are always 2 3 full. Asymptotic cost of search, insertion and deletion of records from B-Trees, B+-Trees and B∗-Trees is log n. The base of the log is the average branching factor of the tree. Example: Consider a B+-Tree of order 100 with leaf nodes containing 100 records. 1 level B+-Tree: Max: 100 2 level B+-Tree: Min: 2 leaves of 50 for 100 records.

B-Tree Space Analysis

Ways to reduce the number of disk fetches: • Keep the upper levels in memory. • Manage B+-Tree pages with a bu er pool.
206

Sign up to vote on this title
UsefulNot useful