You are on page 1of 121

Unit I

• Introduction to Data Structure

Introduction
It is important for every Computer Science student to understand the concept of Information and
how it is organized or how it can be utilized.
What is Information?
Data is raw, unorganized facts that need to be processed. Data can be something simple and
seemingly random and useless until it is organized. If we arrange some data in an appropriate
sequence, then it forms a Structure and gives us a meaning. This structured data is called
Information. The basic unit of Information in Computer Science is a bit, Binary Digit. So, we
found two things in Information: One is Data and the other is Structure.

Comparison chart
Data Information
Meaning: Data is raw, unorganized facts that When data is processed, organized,
need to be processed. Data can be structured or presented in a given
something simple and seemingly context so as to make it useful, it is
random and useless until it is called Information.
organized.

Example: Each student's test score is one piece The class' average score or the
of data school's average score is the
information that can be concluded
from the given data.

Definition: Latin 'datum' meaning "that which is Information is interpreted data.


given". Data was the plural form of
datum singular (M150 adopts the
general use of data as singular. Not
everyone agrees.)

What is Data Structure?


1. A data structure is a systematic way of organizing and accessing data.
2. A data structure tries to structure data!
• Usually more than one piece of data
• Should define legal operations on the data
• The data might be grouped together (e.g. in an linked list)

Why Data Structures?


1. Data structures study how data are stored in a computer so that operations can be implemented
efficiently
2. Data structures are especially important when you have a large amount of information.
3. Conceptual and concrete ways to organize data for efficient storage and manipulation.

Definition:- A data structure is a way of organizing data that considers not only the items stored,
but also their relationship to each other. Advance knowledge about the relationship between data
items allows designing of efficient algorithms for the manipulation of data.
Introduction to algorithm

The word algorism, came from the 9th century Persian mathematician "Abu Abdullah Muhammad
bin Musa al-Khwarizmi”, which means the method of doing arithmetic using Indo-Arabic
decimal system. It is also the root of the word "algorithm".

Definition: An algorithm is a well defined computational method that takes some value(s) as input
and produces some value(s) as output. In other words, an algorithm is a sequence of
computational steps that transforms input(s) into output(s).

An algorithm is correct if for every input, it halts with correct output. A correct algorithm solves
the given problem, where as an incorrect algorithm might not halt at all on some input instance, or
it might halt with other than designed answer.

Each algorithm must have

▪ Specification: Description of the computational procedure.


▪ Pre-conditions: The condition(s) on input.
▪ Body of the Algorithm: A sequence of clear and unambiguous instructions.
▪ Post-conditions: The condition(s) on output.

Q. Describe the characteristics of algorithms? Explain the procedure and recursion in


algorithm?

Characteristics of Algorithms:

While designing an algorithm as a solution to a given problem, we must take care of the following
five important characteristics of an algorithm.

Finiteness:

An algorithm must terminate after a finite number of steps and further each step must be executable
in finite amount of time. In order to establish a sequence of steps as an algorithm, it should be
established that it terminates (in finite number of steps) on all allowed inputs.

Definiteness (no ambiguity):

Each steps of an algorithm must be precisely defined; the action to be carried out must be
rigorously and unambiguously specified for each case.

Inputs:

An algorithm has zero or more but only finite, number of inputs.

Output:
An algorithm has one or more outputs. The requirement of at least one output is obviously
essential, because, otherwise we cannot know the answer/ solution provided by the algorithm. The
outputs have specific relation to the inputs, where the relation is defined by the algorithm.

Effectiveness:

An algorithm should be effective. This means that each of the operation to be performed in an
algorithm must be sufficiently basic that it can, in principle, be done exactly and in a finite length
of time, by person using pencil and paper. It may be noted that the ‘FINITENESS’ condition is a
special case of ‘EFFECTIVENESS’. If a sequence of steps is not finite, then it cannot be effective
also.

Algorithms are written in pseudo code that resembles programming languages like C and Pascal.

Example
Consider a simple algorithm for finding the factorial of n.
Algorithm Factorial (n)
Step 1: FACT = 1
Step 2: for i = 1 to n do
Step 3: FACT = FACT * i
Step 4: print FACT

Specification: Computes n!.


Pre-condition: n >= 0
Post-condition: FACT = n!
For better understanding conditions can also be defined after any statement, to specify values in
particular variables.
Pre-condition and post-condition can also be defined for loop, to define conditions satisfied before
starting and after completion of loop respectively.
What is remain true before execution of the ith iteration of a loop is called "loop invariant". These
conditions are useful during debugging process of algorithms implementation. Moreover, these
conditions can also be used for giving correctness proof.

Time analysis of Algorithms


Execution time of an algorithm depends on numbers of instruction executed.
Consider the following algorithm fragment:
for i = 1 to n do
sum = sum + i ;
The for loop executed n+1 times for i values 1,2,....... n, n+1. Each instruction in the body of the
loop is
executed once for each value of i = 1,2,......, n. So number of steps executed is 2n+1.
Consider another algorithm fragment:
for i = 1 to n do
for j = 1 to n do
k = k +1
From previous example, number of instruction executed in the inner loop is which is the body of
outer
loop.
Total number of instruction executed is
=
=
• To measure the time complexity in absolute time unit has the following problems
1. The time required for an algorithm depends on number of instructions executed, which
is a complex polynomial.
2. The execution time of an instruction depends on computer's power. Since, different
computers take different amount of time for the same instruction.
3. Different types of instructions take different amount of time on same computer.
4. Complexity analysis technique abstracts away these machine dependent factors . In this
approach, we assume all instruction takes constant amount of time for execution.

Asymptotic bounds as polynomials are used as a measure of the estimation of the number of
instructions to be executed by the algorithm.

Time complexity
How long does this sorting program run? It possibly takes a very long time on large inputs (that is
many strings) until the program has completed its work and gives a sign of life again. Sometimes it
makes sense to be able to estimate the running time before starting a program. Nobody wants to
wait for a sorted phone book for years! Obviously, the running time depends on the number n of
the strings to be sorted. Can we find a formula for the running time that depends on n?

Having a close look at the program we notice that it consists of two nested for-loops. In both loops
the variables run from 0 to n, but the inner variable starts right from where the outer one just
stands. An if with a comparison and some assignments not necessarily executed reside inside the
two loops. A good measure for the running time is the number of executed comparisons. In the first
iteration n comparisons take place, in the second n-1, then n-2, then n-3 etc. So 1+2+...+n
comparisons are performed altogether. According to the well known Gaussian sum formula these
are exactly 1/2·(n-1)·n comparisons. Figure 1 illustrates this. The screened area corresponds to the
number of comparisons executed. It apparently corresponds approx. to half of the area of a square
with a side length of n. So it amounts to approx. 1/2·n2.
Figure1. Running time analysis of sorting by minimum search

How does this expression have to be judged? Is this good or bad? If we double the number of
strings to be sorted, the computing time quadruples! If we increase it ten-fold, it takes even 100 =
102 times longer until the program will have terminated! All this is caused by the expression n 2.
One says: Sorting by minimum search has quadratic complexity. This gives us a forefeeling that
this method is unsuitable for large amounts of data because it simply takes far too much time.

So it would be a fallacy here to say: “For a lot of money, we'll simply buy a machine that is twice
as fast, then we can sort twice as many strings (in the same time).” Theoretical running time
considerations offer protection against such fallacies.

The number of (machine) instructions that a program executes during its running time is called
its time complexity in computer science. This number depends primarily on the size of the
program's input, that is approximately on the number of the strings to be sorted (and their length)
and the algorithm used. So approximately, the time complexity of the program “sort an array of n
strings by minimum search” is described by the expression c·n2.

c is a constant that depends on the programming language used, on the quality of the compiler or
interpreter, on the CPU, on the size of the main memory and the access time to it, on the
knowledge of the programmer, and last but not least on the algorithm itself, which may require
simple but also time consuming machine instructions. (For the sake of simplicity we have drawn
the factor 1/2 into c here.) So while one can make c smaller by improvement of external
circumstances (and thereby often investing a lot of money), the term n2, however, always remains
unchanged

Space complexity
The better the time complexity of an algorithm is, the faster the algorithm will carry out his work in
practice. Apart from time complexity, its space complexity is also important: This is essentially the
number of memory cells that an algorithm needs. A good algorithm keeps this number as small as
possible, too.
There is often a time-space-tradeoff involved in a problem, that is, it cannot be solved with few
computing time and low memory consumption. One then has to make a compromise and to
exchange computing time for memory consumption or vice versa, depending on which algorithm
one chooses and how one parameterizes it.

Notations
The O-notation
In other words: c is not really important for the description of the running time! To take this
circumstance into account, running time complexities are always specified in the so-called O-
notation in computer science. One says: The sorting method has running time O(n2). The
expression O is also called Landau's symbol.

Mathematically speaking, O(n2) stands for a set of functions, exactly for all those functions that,
“in the long run”, do not grow faster than the function n2, that is for those functions for which the
function n2 is an upper bound (apart from a constant factor.) To be precise, the following holds
true: A function f is an element of the set O(n2) if there are a factor c and an integer number n0 such
that for all n equal to or greater than this n0 the following holds:

f(n) ≤ c·n2.

The function n2 is then called an asymptotically upper bound for f. Generally, the
notation f(n)=O(g(n)) says that the function f is asymptotically bounded from above by the
function g.

A function f from O(n2) may grow considerably more slowly than n2 so that, mathematically
speaking, the quotient f / n2 converges to 0 with growing n. An example of this is the function
f(n)=n. However, this does not hold for the function f that describes the running time of our sorting
method. This method always requires n2 comparisons (apart from a constant factor of 1/2). n2 is
therefore also an asymptotically lower bound for f. This f behaves in the long run exactly like n2.
Expressed mathematically: There are factors c1 and c2 and an integer number n0 such that for all n
equal to or larger than n0 the following holds:

c1·n2 ≤ f(n) ≤ c2·n2.

So f is bounded by n2 from above and from below. There also is a notation of its own for the set of
these functions: Θ(n2).

Figure 2 contrasts a function f that is bounded from above by O(g(n)) to a function whose
asymptotic behavior is described by Θ(g(n)): The latter one lies in a tube around g(n), which results
from the two factors c1 and c2.
Figure 2. The asymptotical bounds O and Θ

These notations appear again and again in the LEDA manual at the description of non-trivial
operations. Thereby we can estimate the order of magnitude of the method used; in general,
however, we cannot make an exact running time prediction. (Because in general we do not know c,
which depends on too many factors, even if it can often be determined experimentally; see also on
this.)

Frequently the statement is found in the manual that an operation takes “constant time”. By this it
is meant that this operation is executed with a constant number of machine instructions,
independently from the size of the input. The function describing the running time behavior is
therefore in O(1). The expressions “linear time” and “logarithmic time” describe corresponding
running time behaviors: By means of the O-notation this is often expressed as “takes
time O(n) and O(log(n))”, respectively.

Furthermore, the phrase “expected time” O(g(n)) often appears in the manual. By this it is meant
that the running time of an operation can vary from execution to execution, that the expectation
value of the running time is, however, asymptotically bounded from above by the function g(n).

Back to our sorting algorithm: A runtime of Θ(n2) indicates that an adequately big input will
always bring the system to its knees concerning its running time. So instead of investing a lot of
money and effort in a reduction of the factor c, we should rather start to search for a better
algorithm. Thanks to LEDA, we do not have to spend a long time searching for it: All known
efficient sorting methods are built into LEDA.

To give an example, Quicksort algorithm whose (expected) complexity is O(n·log(n)), which (seen
asymptotically) is fundamentally better than Θ(n2). This means that Quicksort defeats sorting by
minimum search in the long run: If n is large enough, the expression c1·n·log(n) certainly becomes
smaller than the expression c2·n2, independently from how large the two system-dependent
constants c1 and c2 of the two methods actually are; the quotient of the two expressions converges
to 0. (For small n, however, c1·n·log(n) may definitely be larger than c2·n2; indeed, Quicksort does
not pay on very small arrays compared to sorting by minimum search.)
Now back to the initial question: Can we sort phone books with our sorting algorithm in acceptable
time? This depends, in accordance to what we said above, solely on the number of entries (that is
the number of inhabitants of the town) and on the system-dependent constant c. Applied to today's
machines: the phone book of Saarbrücken in any case, the one of Munich maybe in some hours, but
surely not the one of Germany. With the methodsort() of the class array, however, the last
problem is not a problem either.

Best, Worst, and Average-Case Complexity

Using the RAM model of computation, we can count how many steps our algorithm will take on
any given input instance by simply executing it on the given input. However, to really understand
how good or bad an algorithm is, we must know how it works over all instances.

To understand the notions of the best, worst, and average-case complexity, one must think about
running an algorithm on all possible instances of data that can be fed to it. For the problem of
sorting, the set of possible input instances consists of all the possible arrangements of all the
possible numbers of keys. We can represent every input instance as a point on a graph, where the x-
axis is the size of the problem (for sorting, the number of items to sort) and the y-axis is the number
of steps taken by the algorithm on this instance. Here we assume, quite reasonably, that it doesn't
matter what the values of the keys are, just how many of them there are and how they are ordered.
It should not take longer to sort 1,000 English names than it does to sort 1,000 French names, for
example.

Figure: Best, worst, and average-case complexity

As shown in Figure , these points naturally align themselves into columns, because only integers
represent possible input sizes. After all, it makes no sense to ask how long it takes to sort 10.57
items. Once we have these points, we can define three different functions over them:

• The worst-case complexity of the algorithm is the function defined by the maximum
number of steps taken on any instance of size n. It represents the curve passing through the
highest point of each column.
• The best-case complexity of the algorithm is the function defined by the minimum number
of steps taken on any instance of size n. It represents the curve passing through the lowest
point of each column.
• Finally, the average-case complexity of the algorithm is the function defined by the average
number of steps taken on any instance of size n.

In practice, the most useful of these three measures proves to be the worst-case complexity, which
many people find counterintuitive. To illustrate why worst-case analysis is important, consider
trying to project what will happen to you if you bring n dollars to gamble in a casino. The best
case, that you walk out owning the place, is possible but so unlikely that you should place no
credence in it. The worst case, that you lose all n dollars, is easy to calculate and distressingly
likely to happen. The average case, that the typical person loses 87.32% of the money that they
bring to the casino, is difficult to compute and its meaning is subject to debate. What exactly
does average mean? Stupid people lose more than smart people, so are you smarter or dumber than
the average person, and by how much? People who play craps lose more money than those playing
the nickel slots. Card counters at blackjack do better on average than customers who accept three or
more free drinks. We avoid all these complexities and obtain a very useful result by just
considering the worst case.

The important thing to realize is that each of these time complexities defines a numerical function,
representing time versus problem size. These functions are as well-defined as any other numerical
function, be it or the price of General Motors stock as a function of time. Time
complexities are complicated functions, however. In order to simplify our work with such messy
functions, we will need the big Oh notation.

SORTING
Sorting is very important in every computer application. Sorting refers to arranging of data
elements in some given order. Many Sorting algorithms are available to sort the given set of
elements.
We will now discuss two sorting techniques and analyze their performance. The two
techniques are:
• Internal Sorting
• External Sorting

Internal Sorting
Internal Sorting takes place in the main memory of a computer. The internal sorting methods are
applied to small collection of data. It means that, the entire collection of data to be sorted in
small enough that the sorting can take place within main memory. We will study the following
methods of internal sorting
1. Insertion sort
2. Selection sort
3. Merge Sort
4. Radix Sort
5. Quick Sort
6. Heap Sort
7. Bubble Sort

Insertion Sort
In this sorting we can read the given elements from 1 to n, inserting each element into its proper
position. For example, the card player arranging the cards dealt to him. The player picks up the
card and inserts them into the proper position. At every step, we insert the item into its proper
place.
This sorting algorithm is frequently used when n is small. The insertion sort algorithm scans A
from A[l] to A[N], inserting each element A[K] into its proper position in the previously sorted
subarray A[l], A[2], . . . , A[K-1]. That is:

Pass 1.
A[l] by itself is trivially sorted.
Pass 2.
A[2] is inserted either before or after A[l] so that: A[l], A[2] is sorted.
Pass 3.
A[3] is inserted into its proper place in A[l], A[2], that is, before A[l],
between A[l] and A[2], or after A[2], so that: A[l], A[2], A[3] is sorted.
Pass 4. A[4] is inserted into its proper place in A[l], A[2], A[3] so that:
A[l], A[2], A[3], A[4] is sorted.
………………………………………………………
Pass N. A[N] is inserted into its proper place in A[l], A[2], . . . , A[N - 1] so
that: A[l], A[ 2 ] , . . . , A[ N ] is sorted.
Example 2.1

Insertion sort for n = 8 items

Algorithm 2.1
INSERTION ( A , N )
This algorithm sorts the array A with N elements
1. Set A[0] := --∞ . [initializes the element]
2. Repeat Steps 3 to 5 for K= 2,3, ……,N
3. Set TEMP := A[K] and PTR:= K-1
4. Repeat while TEMP < A[PTR]
(a) Set A[PTR +1]:=A[PTR] [Moves element forward]
(b) Set PTR := PTR-1
[End of loop].
5. Set A[PTR+1] := TEMP [inserts element in proper place]
[End of Step 2 loop]
6. Return

Complexity of Insertion Sort:


The insertion sort algorithm is a very slow algorithm when n is very large.

Algorithm Worst Case Average Case


2 2
Insertion Sort n (n-1) = O( n ) n (n-1) = O( n )
2 4

Worst Case
The Worst Case occurs when the array A is in reverse order and the inner loop must use
the maximum number of K-1 of comparisons.
f(n) = 1 +2+ ……+(n-1) = n (n-1) = O( n2)
2
Average Case
The average case occurs when there is (K-1) /2 comparisons in the inner loop.

F(n) = ½ + 2/2 + ……..+ (n-1) = n (n-1) = O (n2)


2 4
Selection Sort
In this sorting we find the smallest element in this list and put it in the first position. Then find
the second smallest element in the list and put it in the second position. And so on.
Pass 1. Find the location LOC of the smallest in the list of N elements
A[l], A[2], . . . , A[N], and then interchange A[LOC] and [1] .
Then A[1] is sorted.
Pass 2. Find the location LOC of the smallest in the sublist of N – 1
Elements A[2], A[3],. . . , A[N], and then interchangeA[LOC]
and A[2]. Then:A[l], A[2] is sorted, since A[1]<A[2].
Pass 3. Find the location LOC of the smallest in the sublist of N – 2
elements A[3], A[4], . . . , A[N], and then interchange A[LOC]
and A[3]. Then: A[l], A[2], . . . , A[3] is sorted, since A[2] <
A[3].
………………………………
Pass N - 1. Find the location LOC of the smaller of the elements A[N - 1),
A[N], and then interchange A[LOC] and A[N- 1]. Then: A[l],
A[2], . . . , A[N] is sorted, since A[N - 1] < A[N].
Thus A is sorted after N - 1 passes.
Example 2.2
Suppose an array A contains 8 elements as follows:
77, 33, 44, 11, 88, 22, 66, 55
Pass A[l] A[2] A[3] A[4] A[5] A[6] A[7] A[8]
K = l, LOC = 4 77 33 44 11 88 22 66 55
K = 2, LOC = 6 11 33 44 77 88 22 66 55
K = 3, LOC = 6 11 22 44 77 88 33 66 55
K = 4, LOC = 6 11 22 33 88 66 55
77 44
K = 5,LOC = 8 11 22 33 44 88 77 66 55
K = 6, LCjC = 7 11 22 33 44 55 66 88
77
K = 7, LOC = 7 11 22 33 44 55 66 77 88
Sorted: 11 22 33 44 55 66 77 88

Algorithm 2.2:
1. To find the minimum element
MIN ( A, K , N, LOC)
An array A is in memory. This procedure finds the location
LOC of the smallest element among A[K] , A[K+1],…….A[N].
1. Set MIN:= A[K] and LOC := K [Initializes pointers]
2. Repeat for J = K +1, K+2…
If MIN > A [J] , then : Set MIN := A[J] and LOC := A[j]
and LOC: = J
3. Return
2. To Sort the elements
SELECTION (A, N)
1. Repeat Steps 2 and 3 form K= 1,2, ….., N – 1
2. Call MIN(A,K,N,LOC)
3. [Interchange A[K] and A[LOC] ]
Set TEMP: = A [K], A [K]:= A [LOC] and A [LOC]:=TEMP
4. Exit.
Complexity of the Selection Sort Algorithm
First note that the number f(n) of comparisons in the selection sort algorithm is
independent of the original order of the elements. Observe that MIN(A, K, N, LOC) requires n -
K comparisons. That is, there are n - 1 comparisons during Pass 1 to find the smallest element,
there are n - 2 comparisons during Pass 2 to find the second smallest element, and so on.
Accordingly,
f(n) = (n - 1) + (n - 2) + ….. + 2 + 1 = n(n(-1)/2 = O(n2)

The above result is summarized in the following table:

2.3. 1.3 Merge Sort

Combing the two lists is called as merging. For example A is a sorted list with r elements
and B is a sorted list with s elements. The operation that combines the elements of A and B into
a single sorted list C with n = r + s elements is called merging. After combing the two lists the
elements are sorted by using the following merging algorithm
Suppose one is given two sorted decks of cards. The decks are merged as in Fig. 2.1.
That is, at each step, the two front cards are compared and the smaller one is placed in the
combined deck. When one of the decks is empty, all of the remaining cards in the other deck are
put at the end of the combined deck. Similarly, suppose we have two lines of students sorted by
increasing heights, and suppose we want to merge them into a single sorted line. The new line is
formed by choosing, at each step, the shorter of the two students who are at the head of their
respective lines. When one of the lines has no more students, the remaining students line up at
the end of the combined line.
(c)

Fig 2.1
The above discussion will now be translated into a formal algorithm which merges a
sorted r-element array A and a sorted s-element array B into a sorted array C, with n = r + s
elements. First of all, we must always keep track of the locations of the smallest element of A
and the smallest element of B which have not yet been placed in C. Let NA and NB denote these
locations, respectively. Also, let PTR denote the location in C to be filled. Thus, initially, we set
NA : = 1, NB : = 1 and PTR : = 1. At each step of the algorithm, we compare A[NA] and
B[NB] and assign the smaller element to C[PTR]. Then we increment PTR by setting PTR:=
PTR + 1, and we either increment NA by setting NA: = NA + 1 or increment NB by setting NB:
= NB + 1, according to whether the new element in C has come from A or from B. Furthermore,
if NA> r, then the remaining elements of B are assigned to C; or if NB > s, then the remaining
elements of A are assigned to C.

Algorithm 2.3
MERGING ( A, R, B, S, C)
Let A and B be sorted arrays with R and S elements. This algorithm
merges A and B into an array C with N = R + S elements.
1. [Initialize ] Set NA : = 1 , NB := 1 AND PTR : = 1
2. [Compare] Repeat while NA <= R and NB <= S
If A[NA] < B[NB] , then
(a)[Assign element from A to C ] set C[PTR] := A[NA]
(b)[Update pointers ] Set PTR := PTR +1 and NA := NA +1
Else
(a) [Assign element from B to C] Set C[PTR] := B[NB]
(b) [Update Pointers] Set PTR := PTR +1 and NB := NB +1
[End of loop]
3. [Assign remaining elements to C]
If NA > R , then
Repeat for K = 0 ,1,2,……..,S- NB
Set C[PTR+K] := B[NB+K]
[End of loop]
Else
Repeat for K = 0,1,2,……,R-NA
Set C[PTR+K] := A[NA+K]
[End of loop]
4. Exit
The total computing time = O(n log2 n).
The disadvantages of using mergesort is that it requires two arrays of the same size and type for
the merge phase
Radix Sort
Radix sort is the method that many people intuitively use or begin to use when
alphabetizing a large list of names. Specifically, the list of names is first sorted according to the
first letter of each name. That is, the names are arranged in 26 classes, where the first class
consists of those names that begin with "A," the second class consists of those names that begin
with "B," and so on. During the second pass, each class is alphabetized according to the second
letter of the name. And so on. If no name contains, for example, more than 12 letters, the names
are alphabetized with at most 12 passes.
The radix sort is the method used by a card sorter. A card sorter contains 13 receiving
pockets labeled as follows:
9, 8, 7, 6, 5, 4, 3, 2, 1, 0, 11, 12, R (reject)
Each pocket other than R corresponds to a row on a card in which a hole can be punched.
Decimal numbers, where the radix is 10, are punched in the obvious way and hence use only the
first 10 pockets of the sorter. The sorter uses a radix reverse-digit sort on numbers. That is,
suppose a card sorter is given a collection of cards where each card contains a 3-digit number
punched in columns 1 to 3. The cards are first sorted according to the units digit. On the second
pass, the cards are sorted according to the tens digit. On the third and last pass, the cards are
sorted according to the hundreds digit. We illustrate with an example.

Example 2.3
Suppose 9 cards are punched as follows:
348, 143, 361, 423, 538, 128, 321, 543, 366
Given to a card sorter, the numbers would be sorted in three phases, as pictured in Fig. 2.2:

(a) In the first pass, the units digits are sorted into pockets. (The pockets are pictured
upside down, so 348 is at the bottom of pocket 8.) The cards are collected pocket by
pocket, from pocket 9 to pocket 0. (Note that 361 will now be at the bottom of the pile
and 128 at the top of the pile.) The cards are now reinput to the sorter.
(b) In the second pass, the tens digits are sorted into pockets. Again the cards are collected
pocket by pocket and reinput to the sorter.
(c) In the third and final pass, the hundreds digits are sorted into pockets.

(a) First pass.


(b) Second pass.

(c) Third pass.


Figure: 2.2
When the cards are collected after the third pass, the numbers are in the following order:
128, 143, 321, 348, 361, 366, 423, 538, 543
Thus the cards are now sorted.
The number C of comparisons needed to sort nine such 3-digit numbers is bounded as
follows:
C ≤ 9 * 3 * 10
The 9 comes from the nine cards, the 3 comes from the three digits in each number, and the 10
comes from radix d = 10 digits.

Complexity of Radix Sort


Suppose a list A of n items A1, A2, . . . , An is given. Let d denote the radix (e.g., d = 10 for
decimal digits, d = 26 for letters and d = 2 for bits), and suppose each item Ai is represented by
means of s of the digits:
Ai = di1 di2 …. dis
The radix sort algorithm will require 5 passes, the number of digits in each item. Pass K will
compare each dik with each of the d digits. Hence the number C(n) of comparisons for the
algorithm is bounded as follows:
C{n) ≤ d * s * n
Although d is independent of n, the number s does depend on n. In the worst case, s = n, so C(n)
= O(n2). In the best case, s = logd n, so C(n) = O(n log n). In other words, radix sort performs
well only when the number s of digits in the representation of the Ai’s is small.
Another drawback of radix sort is that one may need d*n memory locations. This comes from the
fact that all the items may be "sent to the same pocket" during a given pass. This drawback may
be minimized by using linked lists rather than arrays to store the items during a given pass.
However, one will still require 2*n memory locations.

Quick Sort
This is the most widely used internal sorting algorithm. It is based on divide-and-conquer
type i.e. Divide the problem into sub-problems, until solved sub problems are found.

Example 2.4
Suppose A is the following list of 12 numbers:
(44) 33, 11, 55, 77, 90, 40, 60, 99, 22, 88, 66
The quick sort algorithm finds the final position of one of the numbers; in this
illustration, we use the first number, 44. This is accomplished as follows. Beginning with the last
number, 66, scan the list from right to left, comparing each number with 44 and stopping at the
first number less than 44. The number is 22. Interchange 44 and 22 to obtain the list
(22) 33, 11, 55, 77, 90, 40, 60, 99, (44) 88, 66
(Observe that the numbers 88 and 66 to the right of 44 are each greater than 44.)
Beginning with 22, next scan the list in the opposite direction, from left to right, comparing each
number with 44 and stopping at the first number greater than 44. The number is 55. Interchange
44 and 55 to obtain the list
22, 33, 11, (44) 77, 90, 40, 60, 99, (55), 88, 66 .
(Observe that the numbers 22, 33 and 11 to the left of 44 are each less than 44.)
Beginning this time with 55, now scan the list in the original direction, from right to left, until
meeting the first number less than 44. It is 40. Interchange 44 and 40 to obtain the list
22, 33, 11, (40) 77, 90, (44) 60, 99, 55, 88, 66*
(Again, the numbers to the right of 44 are each greater than 44.) Beginning with 40, scan
the list from left to right. The first number greater than 44 is 77. Interchange 44 and 77 to obtain
the list
22, 33, 11, 40, (44) 90, (77) 60, 99, 55, 88, 66
(Again, the numbers to the left of 44 are each less than 44.) Beginning with 77, scan the
list from right to left seeking a number less than 44. We do not meet such a number before
meeting 44. This means all numbers have been scanned and compared with 44. Furthermore, all
numbers less than 44 now form the sublist of numbers to the left of 44, and all numbers greater
than 44 now form the sublist of numbers to the right of 44, as shown below:
22, 33, 11, 40, (44) 90, 77, 60, 99, 55, 88, 66

First sublist Second sublist

Thus 44 is correctly placed in its final position, and the task of sorting the original list A
has now been reduced to the task of sorting each of the above sublists.
The above reduction step is repeated with each sublist containing 2 or more elements.
Algorithm 2.4
This algorithm sorts an array A with N elements
1. [Initialize] TOP := NULL
2. If N >1 , then TOP := TOP +1 , LOWER[1] := 1 , UPPER[1] :=N
3. Repeat Steps 4 to 7 while TOP ≠ NULL
4. Set BEG := LOWER[TOP] , END := UPPER[TOP],TOP:=TOP-1
5. Call QUICK(A,N,BEG,END,LOC)
6. If BEG < LOC -1 then
TOP := TOP +1 , LOWER [TOP] := BEG
UPPER[TOP] = LOC-1
End If
7. If LOC +1 < END then
TOP := TOP +1 , LOWER[TOP] := LOC +1
UPPER[TOP] := END
End If
8. Exit

The Quick sort algorithm uses the O(N log2N) comparisons on average.

Heap Sort
A heap is a complete binary tree, in which each node satisfies the heap condition.

Heap condition
The key of each node is greater than or equal to the key in its children. Thus the
root node will have the largest key value.

MaxHeap
Suppose H is a complete binary tree with n elements. Then H is called a heap or
maxheap, if the value at N is greater than or equal to the value at any of the children of N.

MinHeap
The value at N is less than or equal to the value at any of the children of N.

The operations on a heap


(i) New node is inserted into a Heap
(ii) Deleting the Root of a Heap

Example 2.5
Consider the complete tree H in Fig.2.3 (a) . Observe that H is a heap. Figure 2.3 (b)
shows the sequential representation of H by the array TREE. That is, TREE[1] is the root of the
tree H, and the left and right children of node TREE[K] are, respectively, TREE[2K] and
TREE[2K + 1]. This means, in particular, that the parent of any nonroot node TREE[J] is the node
TREE[J / 2] (where J / 2 means integer division). Observe that the nodes of H on the same level
appear one after the other in the array TREE.
The sequential representation of H by the array TREE

97 88 95 66 55 95 48 66 35 48 55 62 77 25 38 18 40 30 26 24

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
(a) Sequential representation
Fig. 2.3

TREE[1] is the root of the tree H, and the left and right children of node TREE[K] are,
respectively, TREE[2K] and TREE[2K + 1]. This means, in particular, that the parent of any
nonroot node TREE[J] is the node TREE[J / 2] (where J / 2 means integer division). Observe that the
nodes of H on the same level appear one after the other in the array TREE.

Inserting into a Heap


Inserting a new element into a heap tree. Suppose H is a heap with N elements. We insert ITEM
into the heap H as follows:
(i) First join ITEM at the end of H.
(ii) Then the ITEM rise to its “appropriate place” in H.
Example 2.6
To add ITEM = 70 to H. First we adjoin 70 as the next element in the complete tree; that is,
we set TREE[21] = 70. Then 70 is the right child of TREE[10] = 48.The path from 70 to the root of
H is pictured in Fig 2.4(a). We now find the appropriate place of 70 in the heap as follows:
(a) Compare 70 with its parent, 48. Since 70 is greater than 48, interchange 70 and 48; Fig.
2.4(b).
(b) Compare 70 with its new parent, 55. Since 70 is greater than 55, interchange 70 and 55; the
path will now look like Fig 2.4(c)
(c) Compare 70 with its new parent, 88. Since 70 does not exceed 88, ITEM = 70 has risen to its
appropriate place in H.
The following figure shows the final tree. A dotted line indicates that an exchange has
taken place.
Algorithm 2.5
INSHEAP (TREE, N, ITEM)
A heap H with N elements is stored in the array TREE, and an ITEM of
information is given. This procedure inserts ITEM as a new element of H .
PTR gives the location of ITEM as it rises in the tree, and PAR denotes the
location of the parent of ITEM.
1. [Add new node to H and initialize PTR].
Set N := N +1 and PTR := N
2. [Find location to insert ITEM]
Repeat steps 3 to 6 while PTR <1
3. Set PAR := [PTR/2] [Location of parent node]
4. If ITEM <= TREE[PAR], then
Set TREE[PTR] := ITEM , and return
End If
5. Set TREE[PTR] := TREE[PAR] [Moves node down]
6. Set PTR := PAR [Updates PTR] End
Loop
7. [Assign ITEM as the root of H]
8. Return
Deleting the Root of a Heap
Deleting the root element from a heap tree. Suppose H is a heap with N elements. We delete the
root R from the heap H as follows:
(i) Assign the root R to some variable ITEM
(ii) Replace the deleted node R by the last node L of H
(iii) L sinks to its appropriate place in H.

Example 2.7
Consider the heap H in Fig 2.5(a), where R = 95 is the root and L = 22 is the last node of
the tree. Step 1 of the above procedure deletes R = 95, and Step 2 replaces R = 95 by L = 22.
(a) Compare 22 with its two children, 85 and 70. Since 22 is less than the larger child, 85,
interchange 22 and 85..(Fig. 2.5 (c))
(b) Compare 22 with its two new children, 55 and 33. Since 22 is less than the larger child,
55, interchange 22 and 55 . (Fig.2.5 (d))
(c) Compare 22 with its new children, 15 and 20.Since 22 is greater than both children,
node 22 has dropped to its appropriate place in H.
.

Fig. 2.5 Reheaping


Algorithm 2.6
DELHEAP (TREE, N, ITEM)
A heap H with N elements is stored in the array TREE. This procedure assigns the
root TREE [1] of H to the variable ITEM and then reheaps the remaining elements.
The variable LAST saves the value of the original last node of H. The pointers PTR,
LEFT and RIGHT give the locations of LAST and its left and right children as LAST
sinks in the tree.
1. Set ITEM := TREE[1] [removes root of H]
2. Set LAST := TREE [N] and N:= N-1 {Removes last node of H]
3. Set PTR := 1 , LEFT := 2 and RIGHT := 3 [Initializes pointers]
4. Repeat steps 5 to 7 while RIGHT <= N
5. If LAST >= TREE [LEFT] and LAST >=TREE [RIGHT], then
Set TREEE [PTR] := LAST and return
End if
6. If TREE [RIGHT] <= TREE [LEFT], then
Set TREE [PTR] := TREE[LEFT] and PTR := LEFT
Else
Set TREE [PTR]:= TREE [RIGHT] and PTR:= RIGHT
End If
7. Set LEFT := 2 * PTR AND RIGHT := LEFT +1
[End of Step 4 loop]
8. If LEFT=N and if LAST < TREE[LEFT] , then Set PTR := LEFT
9. Set TREE[PTR]:=LAST
10. Return.

Bubble Sort
In this sorting algorithm, multiple swapping take place in one iteration. Smaller elements move
or ‘bubble’ up to the top of the list. In this method ,we compare the adjacent members of the list
to be sorted , if the item on top is greater than the item immediately below it, they are swapped.
Example 2.8
Suppose the following numbers are stored in an array A:
32, 51, 27, 85, 66, 23, 13, 57
We apply the bubble sort to the array A. We discuss each pass separately.

Pass 1. We have the following comparison:


(a) Compare Al and A2 Since 32 < 51, the list is not altered.
(b) Compare A2 and A3 Since 51 > 27, interchange 51 and 27 as follows:
32, (27), (51), 85, 66, 23, 13, 57
(c) Compare A3 and A4. Since 51 < 85, the list is not altered.
(d) Compare A4 and A5. Since 85 > 66, interchange 85 and 66 as follows:
32, 27, 51, (66), (85), 23, 13, 57
(e) Compare A5 and A6. Since 85 > 23, interchange 85 and 23 as follows:
32, 27, 51, 66, (23), (85), 13, 57
(f) Compare A6 and A7. Since 85 > 13, interchange 85 and 13 as follows:
32, 27, 51, 66, 23, (13), (85), 57
(g) Compare A7 and A8. Since 85 > 57, interchange 85 and 57 as follows:
32, 27, 51, 66, 23, 13, (57), (85)
At the end of this first pass, the largest number, 85, has moved to the last position.
However, the rest of the numbers are not sorted, even though some of them have changed
their positions. For the remainder of the passes, we show only the positions of the
numbers in the array.

At the end of Pass 2, the second largest number, 66, has moved its way down to the next-to-last
position.
27, 33, 51, 23, 13, 57, 66, 85

At the end of Pass 3, the third largest number, 57, has moved its way down to its position in the
list.
27, 33, 23, 13, 51, 57, 66, 85
At the end of Pass 4, the fourth largest number, 51, has moved its way down to its position in
the list.
27, 23, 13, 33, 51, 57, 66, 85
At the end of Pass 5, the fifth largest number, 33, has moved its way down to its position in the
list.
23, 13, 27, 33, 51, 57, 66, 85
At the end of Pass 6, the sixth largest number, 27, has moved its way down to its position in the
list.
13, 23, 27, 33, 51, 57, 66, 85
At the end of Pass 7 (last pass), the seventh largest number, 23, has moved its way down to its
position in the list.
13, 23, 27, 33, 51, 57, 66, 85

Algorithm 2.7
BUBBLE (DATA, N)
Here DATA is an array with N elements. This algorithm sorts the elements in
DATA.
1. Repeat Steps 2 and 3 for K = 1 to N-1
2. Set PTR := 1 [Initialize pass pointer PTR]
3. Repeat while PTR <= N-K : [Execute Pass]
(a) If DATA[PTR] > DATA[PTR+1], then
Interchange DATA[PTR] and DATA[PTR+1]
End if
(b) Set PTR := PTR+1
[End of inner loop]
[End of step 1 outer loop]
4. Exit.

The total number of comparisons in Bubble sort are:


= (N-1)+(N-2)….+2+1
=(N-1)*N/2=O(N2)
The time required to execute the bubble sort algorithm is proportional to n2 , where n is the
number of input items. The Bubble sort algorithm uses the O(n2) comparisons on average

EXTERNAL SORT
The External sorting methods are applied only when the number of data elements to be
sorted is too large. These methods involve as much external processing as processing in the
CPU. To study the external sorting, we need to study the various external devices used for
storage in addition to sorting algorithms. This sorting requires auxiliary storage.The following
are the examples of external sorting
• Sorting with Disk
• Sorting with Tapes
Sorting with Disks
We will first illustrate merge sort using disks. The following example illustrate the concept of
sorting with disks
The file F containing 600 records is to be sorted. The main memory is capable of sorting
of 1000 records at a time. The input file F is stored on one disk and we have in addition another
scratch disk. The block length of the input file is 500 records.
We see that the file could be treated as 6 sets of 1000 records each. Each set is sorted and
stored on the scratch disk as a run. These 5 runs will then be merged as follows.
Allocate 3 blocks of memory each capable of holding 500 records. Two of these buffers
B1 and B2 will be treated as input buffers and the third B3 as the output buffer. We have now
the following
1) 6 run R1, R2, R3, R4, R5, R6 on the scratch disk.
2) 3 buffers B1,B2 and B3
• Read 500 records from R1 into B1.
• Read 500 records from R2 into B2.
• Merge B1 and B2 and write into B3.
• When B3 is full- write it out to the disk as run R11.
• Similarly merge R3 and R4 to get run R12.
• Merge R5 and R6 to get run R13.
Thus, from 6 runs of size 1000 each, we have now 3 runs of size 2000 each.
The steps are repeated for steps R11 and R12 to get a run of size 4000.
This run is merged with R13 to get a single sorted run of size 6000.
Sorting with Tapes
Sorting with tapes is essentially similar to the merge sort used for sorting with disks. The
differences arise due to the sequential access restriction of tapes. This makes the selection time
prior to data transmission an important factor, unlike seek time and latency time. Thus is sorting
with tapes we will be more concerned with arrangement of blocks and runs on the tape so s to
reduce the selection or across time.
Example
A file of 6000 records is to be sorted. It is stored on a tape and the block length is 500. The
main memory can sort unto a 1000 records at a time. We have in addition 4 search tapes T1-T4.

Check Your Progress 1


1. Define Sorting.
2. What is internal sorting?
3. Given 2 sorted list of size ‘m’ and ‘n’ respectively. The number of comparisons
needed in the worst case by the merge sort algorithm will be
(a) mn (b) max(m,n) (c) min(m,1) (d) m+n-1
4. Sorting is useful for
(a) report generation (b) minimizing the storage needed
(c) making searching easier and efficient (d) responding to queries easily
5. Choose the correct statement
(a) Internal sorting is used if the number of items to be sorted is very
large (b)External sorting is used if the number of items to be sorted in very
large (c)External sorting needs auxiliary storage
(d)Internal sorting needs auxiliary storage.
6. The way a card game player arranges his cards as he picks them up one by one is an
example of
(a) bubble sort (b) selection sort (c) insertion sort (d) merge sort
7. You are asked to sort 15 randomly .you should prefer
(a) bubble sort (b) quick sort (c) merger sort (d) heap sort
8. Describe Heap.
9. The maximum number of comparisons needed to sort 7 items ( each item is 4 digit)
using radix sort is
(a) 280 (b) 40 (c) 47 (d) 38
10. What is the difference between MinHeap and MaxHeap?
11. Which of the following algorithm design technique is used in the quick sort?
(a) Dynamic programming
(b) BackTracking
(c) Divide and conquer
(d) Greedy method
12. The number of swapping needed to sort the numbers 8, 22,7,9,31,19,5,13 in
ascending order, using bubble sort is
(a) 11 (b) 12 (c) 13 (d) 14
13. What is merging?

SEARCHING

Searching refers to the operation of finding the location of a given item in a collection of items.
The search is said to be successful if ITEM does appear in DATA and unsuccessful otherwise.

The following searching algorithms are discussed in this chapter.


1. Sequential Searching
2. Binary Search
3. Binary Tree Search
Sequential Search
This is the most natural searching method. The most intuitive way to search for a given ITEM in
DATA is to compare ITEM with each element of DATA one by one. .The algorithm for a
sequential search procedure is now presented
Algorithm 2.8
SEQUENTIAL SEARCH
INPUT : List of Size N. Target Value T
OUTPUT : Position of T in the list-1
BEGIN
Set FOUND = false
Set I := 0
While (I <= N) and (FOUND is false)
IF List[i] ==t THEN
FOUND = true
ELSE
I = I+1
IF FOUND==false THEN
T is not present in the List
END

Binary Search
Suppose DATA is an array which is sorted in increasing numerical order. Then there is an
extremely efficient searching algorithm, called binary search, which can be used to find the
location LOC of a given ITEM of information in DATA.

The binary search algorithm applied to our array DATA works as follows. During each stage of
our algorithm, our search for ITEM is reduced to a segment of elements of DATA:
DATA[BEG], DATA[BEG + 1], DATA[BEG + 2], ...... DATA[END].

Note that the variable BEG and END denote the beginning and end locations of the segment
respectively. The algorithm compares ITEM with the middle element DATA[MID] of the
segment, where MID is obtained by
MID = INT((BEG + END) / 2)
(We use INT(A) for the integer value of A.) If DATA[MID] = ITEM, then the search is
successful and we set LOC: = MID. Otherwise a new segment of DATA is obtained as follows:

(a) If ITEM < DATA[MID], then ITEM can appear only in the left half of the
segment: DATA[BEG],DATA[BEG + 1],….. ,DATA[MID - 1]
So we reset END := MID - 1 and begin searching again.

(b) If ITEM > DATA[MID], then ITEM can appear only in the right half of the
segment: DATA[MID + 1], DATA[MID + 2],....,DATA[END]
So we reset BEG := MID + 1 and begin searching again.
Initially, we begin with the entire array DATA; i.e. we begin with BEG = 1
and END = n, If ITEM is not in DATA, then eventually we obtain END<
BEG

This condition signals that the search is unsuccessful, and in this case we assign LOC: =
NULL. Here NULL is a value that lies outside the set of indices of DATA. We now formally
state the binary search algorithm.

Algorithm 2.9: (Binary Search) BINARY(DATA, LB, UB, TEM, LOC)


Here DATA is sorted array with lower bound LB and upper bound
UB, and ITEM is a given item of information. The variables BEG,
END and MID denote, respectively, the beginning, end and middle
locations of a segment of elements of DATA. This algorithm finds
the location LOC of ITEM in DATA or sets LOC=NULL.
1. [Initialize segment variables.]
Set BEG := LB, END := UB and MID = INT((BEG + END)/ 2).
2. Repeat Steps 3 and 4 while BEG ≤ END and
DATA[MID] ≠ ITEM.
3. If ITEM<DATA[MID], then:
Set END := MID - 1.
Else:
Set BEG := MID + 1.
[End of If structure]
4. Set MID := INT((BEG + END)/2).
[End of Step 2 loop.]
5. If DATA[MID] :=ITEM, then:
Set LOC:=MID.
Else:
Set LOC := NULL.
[End of If structure.]
6. Exit.
Example 2.9
Let DATA be the following sorted 13-element array:
DATA: 11, 22, 30, 33, 40, 44, 55, 60, 66, 77, 80, 88, 99
We apply the binary search to DATA for different values of ITEM.
(a) Suppose ITEM = 40. The search for ITEM in the array DATA is pictured in
where the values of DATA[BEG] and DATA[END] in each stage of the algorithm
are indicated by parenthesis and- the value of DATA[MID] by a bold. Specifically,
BEG, END and MID will have the following successive values:
(1) Initially, BEG = 1 and END 13. Hence,
MID = INT[(1 + 13) / 2 ] = 7 and so DATA[MID] = 55
(2) Since 40 < 55, END = MID – 1 = 6. Hence,
MID = INT[(1 + 6) / 2 ] = 3 and so
DATA[MID] = 30
(3) Since 40 > 30, BEG = MID + 1 = 4. Hence,
MID = INT[(4 + 6) / 2 ] = 5 and so
DATA[MID] = 40
The search is successful and LOC = MID = 5.

(1) (11), 22, 30, 33, 40, 44, 55, 60, 66, 77, 80, 88, (99)
(2) (11), 22, 30, 33, 40, (44), 55, 60, 66, 77, 80, 88, 99
(3) 11, 22. 30, (33), 40, (44), 55, 60, 66, 77, 80, 88, 99 [Successful]

Complexity of the Binary Search Algorithm


The complexity is measured by the number of comparisons f(n) to locate ITEM in
DATA where DATA contains n elements. Observe that each comparison reduces the
sample size in half.
Hence we require at most f(n) comparisons to locate ITEM
where f(n) = [log2n] + 1
That is the running time for the worst case is approximately equal to log2n. The
running time for the average case is approximately equal to the running time for the worst
case.
Limitations of the Binary Search Algorithm
The algorithm requires two conditions:
(1) the list must be sorted and
(2) one must have direct access to the middle element in any sublist.
Abstract Data Type

1. Abstract Data Types (ADT's) are a model used to understand the design of a data
structure
2. 'Abstract ' implies that we give an implementation-independent view of the data
structure
3. ADTs specify the type of data stored and the operations that support the data
4. Viewing a data structure as an ADT allows a programmer to focus on an idealized
model of the data and its operations

INTRODUCTION

This unit is an introductory unit and gives you an understanding of what a data structure
is. It is about structuring and organizing data as a fundamental aspect of developing a
computer application. This unit guided you from the definition of data structure to the
linear data structures: arrays and records, stacks, queues and lists. Finally, the last
chapter leads you to the advanced data structures like binary tree. In this chapter, you
could understand the issues related to trees that is how to make a traversal in binary tree,
how to delete a node, how to search for a node, how to construct a tree, etc.

Knowledge of data structures is required for people who design and develop computer
programs of any kind: Systems software or application software. As you know already,
the data means a collection of facts, concepts or instructions in a formalized manner
suitable for communication or processing. Processed data is called as Information. A
data structure is an arrangement of data in a computer's memory or even disk storage.
An example of several common data structures are arrays, linked lists, queues, stacks,
binary trees, and hash tables. Algorithms, on the other hand, are used to manipulate the
data contained in these data structures as in searching and sorting.

Software engineering is the study of ways in which to create large and complex
computer applications by programmers and designers. Software engineering involves the
full life cycle of a software project which includes Analysis, Design, Coding, Testing and
Maintenance. However, the subject of data structures and algorithms is concerned with
the Coding phase. The use of data structures and algorithms is to store and manipulate
data by the programmers.

One of the basic data structures of a program is Array. Array is a data structure which can
represent a collection of elements of same data type. An array can be of any dimensions.
It can be two dimensional, three dimensional etc. Collections of data are frequently
organized into a hierarchy of field, records and files. Specifically, a record is a collection
of related data items, each of which is called a field or attribute, a file is a collection of
similar records.

Stacks and Queues are very useful in Computer Science. Stack is a data structure which
allows elements to be inserted as well as deleted from only one end. Stack is also known
as LIFO data structure. Queue is another most common data structure which allows
elements to be inserted at one end called Front and deleted at another end called Rear.
Queue is also known as FIFO data structure.
The other data structure is Lists. A List is an ordered set consisting of a variable number
of elements to which addition and deletions can be made. The first element of a List is
called head of List and the last element is called the tail of List. List can be represented
using arrays or pointers. If pointers are used, then, the operations on the list such as
Insertion, Deletion etc. will become easy.

The final data structure is Tree. A Tree is a connected, acyclic graph. A Tree contains no
loops or cycles. The concept of trees is one of the most fundamental and useful concepts
in Computer Science. A Tree structure is one in which items of data are related by edges.

As a first section, we will discuss about the basic terminology and concepts of Data
Organization. Then, the discussion on different operations which are applied to these data
structures.

OBJECTIVES

After going through this unit, you should be able to:

• understand the data organization;


• define the term ‘data structure’;
• know the classifications of data structures, i.e., linear and non-linear ;
• understand the basic operations of linear and non-linear data structures;
• explain the memory representation of all types of data structures and
• tell that how to implement the all kinds of data structures.

BASIC TERMINOLOGY; ELEMENTARY DATA ORGANIZATION

Data Structures are building blocks of a program. If a program is built using improper
data structures, then the program may not work as expected always. It is very much
important to use right data structures for a program.

As pointed out earlier, Data are simply values or set of values. A data item is either the
value of a variable or a constant. For example, a data item is a row in a database table,
which is described by a data type. A data item that does not have subordinate data items
is called an elementary item. A data item that is composed of one or more subordinate
data items is called a group item. A record can be either an elementary item or a group
item. For example, an employee’s name may be divided into three sub items – first name,
middle name and last name – but the social_security_number would normally be treated
as a single item.

The data are frequently organized into a hierarchy of fields, records and files. In order to
understand these terms, let us see the following example (1.1),
Attributes: Name Age Sex Social_Security_Number
Values: Vignesh 30 m 123-34-23

(Example 1.1)

An entity is something that has certain attributes or properties which may be assigned
values. The values themselves may be either numeric or nonnumeric. In the above
example, an employee of a given organization is entity. Entities with similar attributes
(E.g. all the employees in an organization) collectively form an entity set. Each attributes
of an entity set has a range of values. The set of all possible values could be assigned to
the particular attributes.

The way that data are organized into the hierarchy of fields, records and files reflects the
relationship between attributes, entities and entity sets. That is, a field is a single
elementary unit of information representing an attributes of an entity, a record is the
collection of field values of a given entity and a file is the collection of records of the
entities in given entity set.

Each record in a file may contain many field items, but the value in a certain field may
uniquely determine the record in the file. Such a field K is called a Primary Key, and the
values K1, K2….in such field are called keys or key values. Let we see the example of
Key and Primary Key.

For example 1.2(a) and 1.2(b)

(a) An automobile dealership maintains an inventory file where each record contains the
following data:

Serial_ Number Type Year Price Accessories

The Serial_Number field can serve as a primary key for the file, since each automobile
has a unique serial number.
Example 1.2(a)

(b) An organization maintains a membership file where each record contains the following
file data:

Name AddressTelephone_Number Dues_Owed

Example 1.2(b)

Although there are four data items, Name and Address may be group items. Here the
Name field is a primary key. Note that the address and Telephone_Number fields may not
serve as primary keys, since some members may belong to the same family and have the
same address and telephone number.
There are two categories of records according to length of the record. A file can have
fixed length records or variable length records. In fixed-length records, all the records
contain the same data items with the same amount of space assigned to each data items
with the same amount of space assigned to each data item. In variable-length records,
file records may contain different lengths. For Example, student records usually have
variable lengths, since different students take different number of courses. Usually,
variable length records have a maximum and a minimum length.

As we know already, the organization of data into fields, records and files may not be
complex enough to maintain and efficiently process certain collections of data. For this
reason, data are also organized into more complex types of structures. To learn about
these data structures, we have to follow these three steps:
1. Logical or mathematical description of the structure.
2. Implementation of the structure on a computer.
3. Quantitative analysis of the structure, which includes determining the amount of
memory needed to store the structure and the time required to process the
structure.

We may discuss some of these data structures (study of first step) as an introduction part
in next section. The study of second and third steps depend on either the data are stored in
the main (or) primary memory of the computer or, in a secondary (or) external storage
unit.

Check Your Progress 1

1. Define Data
2. What do you mean by Data Item, Elementary Data Item and Group Item?
3. The data is organized into Fields, Records and Files. True/False
4. If the records contain the same data items with same amount of space assigned to
each data item is called -----------
5. In variable-length records, file records may contain different lengths. True/False

DATA STRUCTURES - AN OVERVIEW

Data may be organized in many different ways: the logical or mathematical model of a
particular organization of data is called a data structure.

Arrays

The simplest type of data structure is a linear (or one dimensional) array. By a linear
array, we mean a list of a finite number n of similar data elements referenced respectively
by a set of n consecutive numbers, usually 1, 2, 3, …n. If we choose the name A for the
array, then the elements of A are denoted by subscript notation
a1, a2, a3, ……., an
or, by the parenthesis notation
A(1), A(2), A(3),……., A(N)
or, by the bracket notation
A[1], a[2], A[3],…….., A[N]

Regardless of the notation, the number K in A[K] is called a subscript and A[K] is called
a subscripted variable.

Linear arrays are called one-dimensional arrays because each element in such an array is
referenced by one subscript. A two-dimensional array is a collection of similar data
elements where each element is referenced by two subscripts. Such arrays are called
matrices in mathematics, and tables in business applications.

Stack

A stack, also called a last-in-first-out (LIFO) system, is a linear list in which items may
be inserted or removed only at one end called the top of the stack. A stack may be in our
daily life, for example a stack of dishes on a spring system or a stake of dishes. We can
observe that any dish may be added or removed only from the top of the stack. We also
call these lists as “piles” or “push-down list”.

Queue:

A queue, also called a first-in-first-out (FIFO) system, is a linear list in which deletions
can take place only at one end of the list, the “front” of the list, and insertions can take
place only at the other end of the list, the “rear” of the list. The features of a Queue are
similar to the features of any queue of customers at a counter, at a bus stop, at railway
reservation counter etc. A queue can be implemented using arrays or linked lists. A queue
can be represented as a circular queue. This representation saves space when compared
to the linear queue. Finally, there are special cases of queues called Dequeues which
allow insertion and deletion of elements at both the end.

Tree

A tree is an acyclic, connected graph. A tree contains no loops or cycles. The concept of
tree is one of the most fundamental and useful concepts in computer science. Trees have
many variations, implementations and applications. Trees find their use in applications
such as compiler construction, database design, windows, operating system programs,
etc. A tree structures is one in which items of data are related by edges. A very common
example is the ancestor tree as given in Figure 1.1. This tree shows the ancestors of
KUMAR. His parents are RAMESH and RADHA. RAMESH’s parents are PADMA and
SURESH who are also grand parents of KUMAR (on father’s side); RADHA’s parents
are JAYASHRI and RAMAN who are also grand parents of KUMAR (on mother’s
side).
KUMAR

RADHA RAMESH

JAYASHRI RAMAN PADMA SURESH

(Figure 1.1: A Family Tree)

Graph

All the data structures (Arrays, Lists, Stacks, and Queues) except Graphs are linear data
structures. Graphs are classified in the non-linear category of data structures.

A graph G may be defined as a finite set V of vertices and a set E of edges (pair of
connected vertices). The notation used is as follows:

Graph G = (V,E)
Let we consider graph of figure1.2.

1 3

5 4

(Figure 1.2: A graph)

From the above graph, we may observe that the set of vertices for the graph is
V = {1,2,3,4,5}, and the set of edges for the graph is E = {(1,2), (1,5), (1,3), (5,4), (4,3),
(2,3)}. The elements of E are always a pair of elements. The relationship between pairs
of these elements is not necessarily hierarchical in nature.

Check Your Progress 2


1. The classification of the data structures are -----------------
2. The number K in A[K] is called a ………….. and A[K] is called a ………………
3. Dequeues allow insertion and deletion of elements at both the end. True/False
4. Graphs are classified into ----------- category of data structures.
5. What do you mean by FIFO?
6. What do you mean by LIFO?
7. List out the applications which are using the data structure “Tree”.
LINEAR ARRAYS

A linear array is a list of a finite number n of inhomogeneous data elements (i.e., data
elements of the same type) such that:
(a) The elements of the array are referenced respectively by an index set consisting of
n consecutive numbers.
(b) The elements of the array are stored respectively in successive memory locations.

We will analyze this for Linear Arrays.

The number n of elements is called the length or size of the array. If not explicitly stated,
we may assume the index set consists of the integers 1, 2, 3, ….., n.
The general equation to find the length or the number of data elements of the array is,
Length = UB – LB + 1 (1.4a)

where, UB is the largest index, called the upper bound, and LB is the smallest index,
called the lower bound of the array. Remember that length=UB when LB=1.

Also, remember that the elements of an array A may be denoted by the subscript notation,
A1, A2, A3……..An

Let us consider the following example 1.3(a),

a) Array DATA be a 6-element linear array of integers such that DATA[1]=247,


DATA[2]=56, DATA[3]=429, DATA[4]=135, DATA[5]=87, DATA[6]=156.

Example 1.3(a)

The following figures 1.3(a) and (b) depict the array DATA.

247
56
429
135
87
156 247 56 429 135 87 156

Figure1. 3 (a) Figure 1.3(b)

Let we consider the other example 1.3(b),

b) An automobile company uses an array AUTO to record the number of automobiles


sold each year from 1932 through 1984.

Example 1.3(b)

Rather than beginning the index set with 1, it is more useful to begin the index set with
1932, so we may learn that,

AUTO[K] = number of automobiles sold in the year K


Then, LB = 1932 is the lower bound and UB=1984 is the upper bound of AUTO. When
we use the equation (1.4a), we can find out the length of this array,

Length = UB – LB + 1 = 1984 + 1 = 55.


From this equation, we may conclude that AUTO contains 55 elements and its index set
consists of all integers from 1932 through 1984.

REPRESENTATION OF LINEAR ARRAYS IN MEMORY

Consider LA be a linear array in the memory of the computer. As we know that the
memory of the computer is simply a sequence of addressed location as pictured in
figure 1.4 as given below.

1000
1001
1002
1003
1004
1005
.
.
.
.

Figure1. 4: Computer Memory

Let we use the following notation when calculate the address of any element in linear
arrays in memory,
LOC(LA[K]) = address of the element LA[K] of the array LA.

As previously noted, the elements LA are stored in successive memory cells.


Accordingly, the computer does not need to keep track of the address of ever element of
LA, but needs to keep track only of the address of the first element of LA, which is
denoted by
Base(LA)

and called the base address of LA. Using this address Base(LA), the computer calculates
the address of any element of LA by the following formula:

LOC(LA[K]) = Base(LA) + w (K- lower bound) (1.4b)


where w is the number of words per memory cell for the array LA.
Let we observe that the time to calculate LOC(LA[K]) is essentially the same for any
value of K. Furthermore, given any subscript K, one can locate and access the content of
LA[K] without scanning any other element of LA.
Example 1.4,

Consider the previous example 1.3(b), array AUTO, which records the number of
automobiles sold each year from 1932 through 1984. Array AUTO appears in memory is
pictured in Figure 1.5. Assume, Base(AUTO) = 200 and w = 4 words per memory cell for
AUTO.

Then the base addresses of following arrays are,

LOC(AUTO[1932]) = 200, LOC(AUTO[1933]) = 204, LOC(AUTO[1934]) = 208, …….

Let we find the address of the array element for the year K = 1965. It can be obtained by
using Equation (1.4b):

LOC(AUTO[1965]) = Base(AUTO) + w(1965 – lower bound) = 200 + 4(1965-1932) = 332

Again, we emphasize that the contents of this element can be obtained without scanning
any other element in array AUTO.

200
201
202 AUTO[1932]
203
204
205
206 AUTO[1933]
207
208
209
AUTO[1934]
210
211
. .
. .
. .
(Figure 1.5: Array AUTO
appears in Memory)
MULTIDIMENSIONAL ARRAYS

The linear arrays discussed so far are also called one dimensional arrays, since each
element in the array is referenced by a single subscript. Most programming languages
allow two- dimensional and three- dimensional arrays, ie., arrays where elements are
referenced, respectively, by two and three subscripts. In fact, some programming
languages allow the number of dimensions for an array to be high as 7. this section
discusses these multidimensional arrays.

Two-Dimensional Arrays

A two-dimensional m x n array A is a collection of m . n data elements such that each


element is specified by a pair of integers (such as J, K), called subscripts, with the
following property that,
I<=J<=m and I <= K <= n
element of A with first subscripts j and second subscript k will be denoted by A
J,K or A[J, K]
As we learned already, two-dimensional arrays are called matrices in mathematics and
tables in business applications; hence two-dimensional arrays are called matrix arrays.

Let we understand that there is a standard way of drawing a two–dimensional m x n array


A where the elements of A form a rectangular array with m rows and n columns where
the element A[J, K] appears in row J and column K. Remind that a row is a horizontal
list of elements and a column is a vertical list of elements. In the following figure 1.8, we
may observe that two-dimensional array A has 3 rows and 4 columns. Let we emphasize
that each row contains those elements with the same first subscript, and each column
contains those elements with the same second subscript.

1 2 3 4
1 A[1, 1] A[1, 2] A[1, 3] A[1, 4]
Rows 2 A[2, 1] A[2, 2] A[2, 3] A[2, 4]
3 A[3, 1] A[3, 2] A[3, 3] A[3, 4]

Figure 1.8: Two-dimensional 3 x 4 array A

Let we go through this example 1.8,

Let each student in a class of 25 students is given 4 tests. Assume the students are
numbered from 1 to 25, the test scores can be assigned to a 25 x 4 matrix array SCORE
as pictured in figure 1.9.

Thus, SCORE[K, L] contains the Kth student’s score on the Lth test. In particular, the
second row of an array,
SCORE [2, 1], SCORE[2, 2] SCORE[2, 3] SCORE[2, 4]
contains the four test scores of the second student.

Student Test 1 Test 2 Test 3 Test 4


1 84 73 88 81
2 95 100 88 96
3 72 66 77 72
. . . . .
. . . . .
. . . . .
25 78 82 70 85

Figure 1.9: Array SCORE


Let A is a two–dimensional m x n array. The first dimension of A contains the index set
1,……., m. with lower bound 1 and upper bound m; and the second dimension of A
contains the index set 1,2,….. n, with lower bound 1 and upper bound n. The length of a
dimension is the number of integers in its index set. The pair of lengths m x n (read “m
by n”) is called the size of the array. Let we find the length of a given dimension (i.e., the
number of integers in its index set) by obtain from the formula,
Length = upper bound – lower bound + 1
Representation of Two-Dimensional Arrays in Memory
Let A be a two-dimensional m x n array. Although A is pictured as a rectangular array of
elements with m rows and n columns, the array will be represented in memory by a block
of m.n sequential memory locations. If they are being stored in sequence, then how are
they sequenced? Is it that the elements are stored row wise or column wise? Again, it
depends on the operating system. Specifically, the programming languages will store the
array A in either,
• Column by column, called column-major order, or
• Row by row, called row-major order.
The following figures 1.10 (a) & (b) shows these two ways when A is a two-dimensional
3 x 4 array.
A Subscript A Subscript
(1, 1) (1,1)
(2, 1) Column 1 (1, 2) ROW 1
(3, 1) (1, 3)
(1, 2) (1,4)
(2, 2) Column 2 (2, 1)
(3, 2) (2, 2) ROW 2
(1, 3) (2,3)
(2, 3) Column 3 (2, 4)
(3, 3) (3, 1)
(1, 4) (3,2) ROW 3
(2, 4) Column 4 (3, 3)
(3, 4) (3, 4)

Fig.1.10 (a) Column-major order. Fig.1.10 (b) Row-major order.


Recall that, for a linear array LA, the computer does not keep track of the address
LOC(LA[K]) of every element LA[K] of LA, but does keep track of Base(LA), the
address of the first element of LA. The computer uses the formula
LOC(LA[K]) = Base(LA) + w (K-1)
to find the address of LA[K]. Here w is the number of words per memory cell for the
array LA, and 1 is the lower bound of the index set of LA.

A similar situation also holds for any two-dimensional m x n array A. That is, the
computer keeps track of Base(A) – the address of the first element A[1,1] of A – and
computes the address LOC(A[J, K]) of A[J, K].

The formula for column and row major order is,

LOC(A[J, K]) = Base(A) + w[M(K-1) + (J-1)] (1.4.6.2a)

The formula for row major order is,

LOC(A[J, K]) = Base(A) + w[N(J-1) + (K-1)] (1.4.6.2b)

Again, w denotes the number of words per memory location for array A. Note that the
formulas are linear in J and K, and that we may find the address LOC(A[J, K]) in time
independent of J and K.

By using these formulas, let we see the following example 1.9.

Example 1.9

Consider the previous example 1.8, of 25 x 4 matrix array SCORE. Suppose


Base(SCORE) = 200 and there are w = 4 words per memory cell. Further more let the
programming language stores two-dimensional arrays using row-major order. Then the
address of SCORE[12,3], the third test of the twelfth student, follows:

LOC(SCORE[12,3]) = 200 + 4[4(12 -1) + (3 -1)] = 200 + 4[46] = 284

Could you understand that how we get the address of the twelfth student? By simple
using the Equation (1.4.5.2b), we derived it.

Multidimensional arrays clearly illustrate the difference between the logical and the
physical views of data. Figure 1.8 shows that how we could logically views a 3 x 4
matrix array A, that is, as a rectangular array of data where A[J, K] appears in row J and
column K. On the other hand, the data will be physically stored in memory by a linear
collection of memory cells.
General Multidimensional Arrays

General multidimensional arrays are defined analogously. More specifically, an n-


dimensional m1 x m2 x ……x mn., array B is a collection of m1, m2…….. mn data
elements in which each element is specified by a list of n integers such as K1, K2,…. Kn
called subscripts, with the property that

1 <= K1 <= m1, 1 <= K2 <= m2, ……….. 1 <= Kn <= mn,

The element of B with subscripts K1, K2,…. Kn will be denoted by

B K1, K2,…. Kn or B[K1, K2,…. Kn ]

The array will be stored in memory in a sequence of memory locations. Specifically, the
programming language will store the array B either in row-major order or column-major
order.
Unit 2
LINKED LIST

BASIC TERMINLOGY

A linked list, or one-way list, is a linear collection of data elements, called nodes, where
the linear order is given by means of pointers. That is, each node is divided into two
parts: the first part contains the information of the element, and the second part, called the
link field or next pointer field, contains the address of the next node in the list.

Figure 1.20 is a schematic diagram of a linked list with 6 nodes, Each node is pictured
with two parts. The left part represents the information part of the node, which may
contain an entire record of data items (e.g., NAME, ADDRESS,...). The right part
represents the Next pointer field of the node, and there is an arrow drawn from it to the
next node in the list. This follows the usual practice of drawing an arrow from a field to a
node when the address of the node appears in the given field. The pointer of the last node
contains special value, called the null pointer, which is any invalid address.

(Figure 1.20)

REPRESENTATION OF LINKED LISTS IN MEMORY

Let LIST be a linked list. Then LIST will be maintained in memory as follows. First of
all, LIST requires two linear arrays-we will call them here INFO and LINK - such that
INFO[K] and LINK[K] contain the information part and the next pointer field of a node
of LIST respectively. START contains the location of the beginning of the list, and a next
pointer sentinel-denoted by NULL-which indicates the end of the list.

The following examples of linked lists indicate that more than one list may be maintained
in the same linear arrays INFO and LINK. However, each list must have its own pointer
variable giving the location of its first node.
Example 1.22

Figure 1.21 pictures a linked list in memory where each node of the list contains a
single character. We can obtain the actual list of characters, or, in other words, the
string, as follows:
INFO LINK
START

(Figure 1.21)

START = 9, so INFO[9] = N is the first character.


LINK[9] = 3, so INFO[3] = O is the second character.
LINK[3] = 6, so 1NFO[6] = (blank) is the third character.
LINK[6] = 11, so INFO[11] = E is the fourth character.
LINK[11] = 7, so INFO[7] = X is the fifth character.
LINK[7] = 10, so INFO[10] = I is the sixth character.
LINK[10] = 4, so INFO[4] = T is the seventh character.
LINK[4] = 0, so the NULL value, so the list has ended.

TRAVERSING A LINKED LIST


Let LIST be a linked list in memory stored in linear arrays INFO and LINK with START
pointing to the first element and NULL indicating the end of LIST. Suppose we want to
traverse LIST in order to process each node exactly once. Our traversing algorithm uses a
pointer variable PTR which points to the node that is currently being processed.
Accordingly, LINK[PTR] points to the next node to be processed. The assignment
PTR := LINK[PTR]
moves the pointer to the next node in the list.

(Figure 1.22: PTR := LINK[PTR])

The details of the algorithm1.6 are as follows. Initialize PTR. Then process INFO[PTR],
the information at the first node. Update PTR by the assignment PTR:=LINK[PTR], and
then process INFO[PTR], the information at the second node and so on until
PTR=NULL, which signals the end of the list.
Algorithm 1.6: (Traversing a Linked List) Let LIST be a linked list in memory. This
algorithm traverses LIST, applying an operation PROCESS to each
element of LIST. The variable PTR points to the node currently being
processed.
1. Set PTR := START. [Initializes pointer PTR].
2. Repeat Steps 3 and 4 while PTR # NULL.
3. Apply PROCESS to INFO[PTR].
4. Set PTR:=LINK[PTR]. [PTR now points to the next node.]
[End of Step 2 loop.]
5. Exit.

The following procedure 1.6 presents about how traversing a linked list. They are similar
to algorithm 1.6.

Procedure 1.6: PRINT(INFO, LINK, START)


This procedure prints the information at each node of the list.
1. Set PTR := START.
2. Repeat Steps 3 and 4 while PTR # NULL:
3. Write: INFO[PTR].
4. Set PTR := LINK[PTR]. [Updates pointer.]
[End of Step 2 loop.]
5. Return.

This procedure 1.7 finds elements in a linked list.

Procedure 1.7: COUNT (INFO, LINK, START, NUM)


1. Set NUM := 0. [Initializes counter.]
2. Set PTR := START. [Initializes pointer.]
3. Repeat Steps 4 and 5 while PTR # NULL.
4. Set NUM := NUM+1. [Increases NUM by 1]
5. Set PTR:LINK[PTR]. [Updates pointer.]
[End of Step 3 loop.]
6. Return.
SEARCHING A LINKED LIST
Let LIST be a linked list in memory. We are given an ITEM of information. In this
section we are going to discuss the two searching algorithms for finding the location LOC
of the node where ITEM first appears in LIST.
The first algorithm 1.7 does not assume that the data in LIST are sorted. The second
algorithm 1.8 does assume that LIST is sorted. If ITEM is actually a key value and we
are searching through a file for the record containing ITEM, then ITEM can appear only
once in LIST.

LIST is Unsorted
The data in LIST are not necessarily sorted. Then one searches for ITEM in LIST by
traversing through the list using a pointer variable PTR and comparing ITEM with the
contents INFOR[PTR] of each node, one by one, of LIST. Before we update the pointer
PTR by
PTR := LINK[PTR]
we require two tests. First we have to check whether we reached the end of the list; i.e.,
PTR = NULL
If not, then we check to see whether
INFO[PTR] = ITEM

Algorithm 1.7: SEARCH(INFO, LINK, START, ITEM, LOC)


LIST is a linked list in memory. This algorithm finds the location LOC
of the node where ITEM first appears in LIST, or sets LOC=NULL.
1. Set PTR := START.
2. Repeat Step 3 while PTR # NULL:
3. If ITEM = INFO[PTR] then:
Set LOC := PTR and Exit.
Else:
Set PTR := LINK[PTR].[PTR now points to the next node.]
[End of If structure.]
[End of Step 2 loop.]
4. Set LOC:=NULL. [Search is unsuccessful.]
5. Exit.

LIST is Sorted
The data in the LIST are sorted. Again we search for ITEM in LIST by traversing the list
using a point variable PTR and comparing ITEM with the contents INFO[PTR] of each
node, one by one, of LIST. Here we can stop once ITEM exceeds INFO[PTR].

Algorithm 1.8: SRCHSL(INFO, LINK, START, ITEM, LOC)


LIST is a sorted list in memory. This algorithm finds the location LOC of
the node where ITEM first appears in LIST, or sets LOC=NULL.
1. Set PTR := START.
2. Repeat Step 3 while PTR#NULL:
3. If ITEM < INFO[PTR], then:
Set PTR := LINK[PTR]. [PTR now points to next node]
Else if ITEM = INFO[PTR], then:
Set LOC := PTR, and Exit. [Search is successful.]
Else:
Set LOC := NULL, and Exit.[ITEM now exceeds INFO[PTR]].
[End of If structure.]
[End of Step 2 loop.]
4. Set LOC := NULL.
5. Exit.

INSERTION INTO A LINKED LIST

Let LIST be a linked list with successive nodes A and B, as pictured in fig. 1.23(a).
Suppose a node N is to be inserted into the list between nodes A and B. The schematic
diagram of such an insertion appears in fig. 1.23(b). That is, node A now points to the
new node N, and node N points to node B, to which A previously pointed.

Figure 1.23(a): Before Insertion

Figure 1.23(b): After Insertion

Insertion Algorithms
Let we discuss three insertion algorithms.
(a) The first one inserts a node at the beginning of the list.
(b) The second one inserts a node into after the node with a given location.
(c) The third one inserts a node into a sorted list.

In all the following algorithms assume that the linked list is in memory in the form
LIST(INFO, LINK, START; AVAIL) and that the variable ITEM contains the new
information to be added to the list. Since our insertion algorithms will use a node in the
AVAIL list, all of the algorithms will include the following steps:
(a) Checking to see if space is available in the AVAIL list, If not, that is, if
AVAIL=NULL, then the algorithm will print the message OVERFLOW.
(b) Removing the first node from the AVAIL list. Using the variable NEW to keep
track of the location of the new node, this step can be implemented by the pair
of assignments
NEW := AVAIL, AVAIL := LINK[AVAIL]
(c) Copying new information into the new node. in other words,
INFO[NEW] := ITEM

Insertion at the Beginning of a List


The linked list is not sorted. The algorithm 1.9 inserts the node at the beginning of the
list.
Algorithm 1.9: INSFIRST(INFO, LINK, START, AVAIL,. ITEM)
This algorithm inserts ITEM as the first node in the list.
1. [OVERFLOW?]
If AVAIL = NULL, then: Write: OVERFLOW and Exit.
2. [Remove first node from AVAIL list].
Set NEW := AVAIL and AVAIL := LINK [AVAIL].
3. [Copy new data into new node.]
Set INFO[NEW] := ITEM.
4. [Make new node now to point the original first node.]
Set LINK[NEW] := START.
5. [Change START so it points to the new node.]
Set START := NEW
6. Exit.

(Figure 1.24 (a) : Before Insertion)

(Figure 1.24 (b) : After Insertion)


Inserting after a Given Node
Suppose we are given the value of LOC where either LOC is the location of a node A in a
linked LIST or LOC=NULL. The following is an algorithm which inserts ITEM into
LIST so that ITEM follows node A or, when LOC = NULL, so that ITEM is the first
node:
Let N denote the new node (whose location is NEW). If LOC = NULL, then N is inserted
as the first node in LIST as in algorithm 1.9. We let node N point to node B (which
originally followed node A) by the assignment
LINK[NEW] := LINK[LOC]
and we let node A point to the new node N by the assignment
LINK[LOC] := NEW
The algorithm is as follows.

Algorithm 1.10: INSLOC(INFO, LINK, START, AVAIL, LOC, ITEM)


This algorithm inserts ITEM so that ITEM follows the node with
location LOC or inserts ITEM as the first node when LOC or inserts
ITEM as the first node when LOC=NULL.

1. [OVERFLOW?] If AVAIL=NULL, then: Write: OVERFLOW, and Exit.


2. [Remove first node from AVAIL list}
Set NEW := AVAIL and AVAIL := LINK[AVAIL].
3. [Copy new data into new node.] Set INFO[NEW] := ITEM.
4. If LOC = NULL, then: [Insert as first node.]
Set LINK[NEW] := START and START = NEW.
Else: [Insert after node with location LOC.]
Set LINK[NEW] := LINK [LOC] and LINK[LOC]:=NEW.
[End of If structure.]
5. Exit.

Inserting into a Sorted Linked List


Suppose ITEM is to be inserted into a sorted linked LIST. The linked list is not sorted.
The algorithm inserts the node into a Sorted Linked List. The ITEM must be inserted
between nodes A and B so that
INFO(A) < ITEM < INFO(B)
The following procedure finds the location LOC of node A, that is, which finds the
location LOC of the last node in LIST whose value is less than ITEM.
Traverse the list, using a pointer variable PTR and comparing ITEM with INFO[PTR] at
each node. While traversing, keep track of the location of the preceding node by using a
pointer variable PREV, as pictured in figure 1.25. Thus SAVE and PTR are updated by
the assignments
SAVE := PTR and PTR := LINK[PTR]

(Figure 1.25)
The traversing stops as soon as ITEM < INFO[PTR]. Then PTR points to node B. so
SAVE will contain the location of the node A.

Procedure 1.8: FINDA(INFO, LINK, START, ITEM, LOC)


This procedure finds the location LOC of the last node in a sorted list
such that INFO[LOC] < ITEM, or sets LOC=NULL.
1. [List empty?]
If START = NULL, then: Set LOC := NULL, and Return.
2. [Special case?]
If ITEM <INFO[START], then: Set LOC := NULL, and Return.
3. [Initializes pointers.]
Set PREV := START and PTR := LINK[START].
4. Repeat Steps 5 and 6 while PTR # NULL.
5. If ITEM < INFO[PTR],then:
Set LOC := PREV and Return.
[End of If structure.]
6. [Updates pointers] Set PREV := PTR and PTR := LINIC[PTR].
[End of Step 4 loop.]
7. Set LOC := SAVE
8. Return.

Algorithm 1.11: INSSRT(INFOR, LINK, START, AVAIL, ITEM)


This algorithm inserts ITEM into a sorted linked list.
1. [Use Procedure 1.8 to find the location of the node preceding ITEM]
Call FINDA(INFO, LINK, START, ITEM, LOC).
2. [Use Algorithm 1.10 to insert ITEM after the node with location LOC]
Call INSLOC(INFO, LINK, START, AVAIL, LOC, ITEM).
3. Exit.

DELETION FROM A LINKED LIST


Let LIST be a linked list with a node N between nodes A and B, as pictured in
Fig.1.26(a). Suppose node N is to be deleted from the linked list. The schematic diagram
of such a deletion appears in Fig.1.26(b). The deletion occurs as soon as the next pointer
field of node A points to node B. (Accordingly, when performing deletions, one must
keep track of the address of the node which immediately precedes the node that is to be
deleted.)

Figure 1.26(a) : Before deletion


Figure 1.26(b) : After deletion

Deletion Algorithms
We discuss two deletion algorithms.
(a) The first one deletes a node following a given node.
(b) The second one deletes the node with a given ITEM of information.
All our algorithms assume that the linked list is in memory in the form LIST(INFO,
LINK, START; AVAIL) and that the variable ITEM contains the new information to be
added to the list.

All of our deletion algorithms will return the memory space of the deleted node N to the
beginning of the AVAIL list. Accordingly, all of our algorithms will include the
following pair of assignments, where LOC is the deleted node N:
LINK[LOC] := AVAIL and then AVAIL := LOC
If START=NULL, then the algorithm will print the message UNDERFLOW.

Deleting the Node Following a Given Node


Let LIST be a linked list in memory. Suppose we are given the location LOC of a node N
in LIST. Furthermore, suppose we are given the location LOCP of the node proceeding N
or, when N is the first node, we are given LOCP=NULL. The following algorithm 1.12
deletes N from the list.

Algorithm 1.12: DEL(INFO, LINK, START, AVAIL, LOC, LOCP)


This algorithm deletes the node N with location LOC. LOCP is the location of
the node which precedes N or, when N is the first node, LOCP=NULL.
1. [Deletes first node.]
If LOCP = NULL, then:
Set START:=LINK[START].
Else: [Deletes node N.]
Set LINK[LOCP] := LINK[LOC].
[End of If structure.]
2. [Return deleted node to the AVAIL list.]
Set LINK[LOC] := AVAIL and AVAIL := LOC.
3. Exit.
Deleting the Node with a Given ITEM of Information
Let LIST be a linked list in memory. Suppose we are given an ITEM of information and
we want to delete from the LIST the first node N which contains ITEM. (If ITEM is a
key value, then only one node can contain ITEM.). First we give a procedure 1.9 which
finds the location LOC of the node N containing ITEM and the location LOCP of the
node preceding node N. If N is the first node, we set LOCP = NULL, and if ITEM does
not appear in LIST, we set LOC = NULL. (This procedure is similar to Procedure 1.8)

Traverse the list, using a pointer variable PTR and comparing ITEM with INFO[PTR] at
each node. While traversing, keep track of the location of the preceding node by using a
pointer variable SAVE as in fig.1.25. Thus SAVE and PTR are updated by the
assignments
SAVE := PTR and PTR := LINK[PTR]
The traversing stops as soon as ITEM = INFO[PTR]. Then PTR contains the location
LOC of node N and SAVE contains the location LOCP of the node preceding N.

The formal statement of our procedure follows. The cases where the list is empty or
where INFO[START]=ITEM(i.e., where node N is the first node) are treated separately,
since they do not involve the variable SAVE.

Procedure 1.9: FINDB(INFO, LINK, START, ITEM, LOC, LOCP)


This procedure finds the location LOC of the first node N which contains ITEM
and the location LOCP of the node preceding N. If ITEM does not appear in the
list, then the procedure sets LOC=NULL; and if ITEM appears in the first node,
then it sets LOCP=NULL.
1. [List empty?]
If START = NULL, then:
Set LOC:=NULL and LOCP:=NULL, and Return.
[End of If structure.]
2. [ITEM in first node?]
If INFO[START] = ITEM, then:
Set LOC:=START and LOCP=NULL, and Return.
[End of If structure.]
3. [Initializes pointers.]
Set SAVE := START and PTR := LINK[START].
4. Repeat Steps 5 and 6 while PTR # NULL.
5. If INFO[PTR] = ITEM, then:
Set LOC := PTR and LOCP := SAVE, and Return.
[End of If structure.]
6. [Updates pointers.]
Set SAVE := PTR and PTR := LINK[PTR].
[End of Step 4 loop.]
7. [Search unsuccessful.]
Set LOC := NULL.
8. Return.
Algorithm 1.13: DFLETE (INFO, LINK, START, AVAIL, ITEM)
This algorithm deletes from a linked list the first node N which contains the
given ITEM of information.
1. [Use Procedure 1.9 to find the location of N and its preceding node.]
Call FINDB(INFO, LINK, START, ITEM, LOC, LOCP)
2. [Item not found?]
If LOC = NULL, then:
Write: ITEM not in list, and Exit.
3. [Delete node.]
If LOCP = NULL, then:
Set START := LINK[START]. [Deletes first node.]
Else:
Set LINK[LOCP] := LINK[LOC].
[End of If structure.]
4. [Return deleted node to the AVAIL list.]
Set LINK[LOC] := AVAIL and AVAIL = LOC.
5. Exit.

HEADER LINKED LISTS


A header linked list is a linked list which always contains a special node, called the
header node, at the beginning of the list. The following are two kinds of widely used
header lists:
1. A grounded header list is a header list where the last node contains the null
pointer.
2. A circular header list is a header list where the last node points back to the header
node.

Figure 1.27 contains schematic diagrams of these header lists. Unless otherwise stated or
implied, our header lists will always be circular. Accordingly, in such a case, the header
node also acts as a sentinel indicating the end of the list.

(Figure 1.27)
Observe that the list pointer START always points to the header node. Hence,
LINK[START] = NULL indicates that a grounded header list is empty, and
LINK[START] = START indicates that a circular header list is empty.
The following are the various operations performed on a circular header list.
1. Traversing a circular header list
2. Searching in a circular header list
3. Deleting from a circular header list

Let we discuss the respective algorithms below.

Algorithm 1.14: (Traversing a Circular Header List) Let LIST is a circular header
list in memory. This algorithm traverses LIST, applying an
operation PROCESS to each node of LIST
1. [Initializes the pointer PTR.]
Set PTR := LINK[ START].
2. Repeat Steps 3 and 4 while PTR # START:
3. Apply PROCESS to INFO[PTR].
4. [PTR now point to the next node.]
Set PTR := LINK[PTR].[PTR now points to the next node]
[End of Step 2 loop.]
5. Exit.

Algorithm 1.15: SRCHHL(INFO, LINK, START, ITEM, LOC)


LIST is circular header list in memory. This algorithm finds the
location LOC of the node where ITEM first appears in LIST or
sets LOC = NULL.
1. Set PTR := LINK[START].
2. Repeat while INFO[PTR] # ITEM and PTR # START:
Set PTR := LINK[PTR].[PTR now points to the next node]
[End of loop]
3. If INFO[PTR] = ITEM, then:
Set LOC := PTR.
Else:
Set LOC := NULL.
[End of If structure]
4. Exit.

The following procedure 1.10 finds the location LOC of the first node N which contains
ITEM and also the location LOCP of the node preceding N.
Procedure 1.10: FINDBHL(INFO, LINK, START, ITEM, LOC, LOCP)
1. [Initializes pointers.]
Set PREV := START and PTR := LINK[START].
2. [Updates pointers.]
Repeat while INFO[PTR] # ITEM and PTR # START.
Set PREV := PTR and PTR := LINK[PTR].
[End of loop.]
3. If INFO[PTR] = ITEM, then:
Set LOC := PTR and LOCP := PREV.
Else:
Set LOC :=NULL and LOCP := PREV.
[End of If structure.]
4. Exit.

The following algorithm deletes the first node N which contains ITEM in a circular header list.

Algorithm 1.16: DELLOCHL(INFO, LINK, START, AVAIL, ITEM)


1. [Use Procedure 1.10 to find the location of N and its preceding node]
Call FINDBHL(INFO, LINK, START, ITEM; LOC, LOCP).
2. [Item not found?]
If LOC = NULL, then:
Write: ITEM not in list, and Exit.
3. [Delete the node.]
Set LINK[LOCP] := LINK[LOC].
4. [Return deleted node to the AVAIL list.]
Set LINK[LOC] := AVAIL and AVAIL := LOC.
5. Exit.

Remark:There are two other variations of linked lists which sometimes appears in the
literature:
1. A linked list whose last node points back to the first node instead of
containing the null pointer, called a circular list
2. A linked list which contains both a special header node at the beginning of
the list and a special trailer node at the end of the list.
Figure 1.28 contains schematic diagrams of these lists.

(Figure 1.28)
TWO-WAY LISTS(Doubly Linked List)
Let we discuss a two-way list, which can be traversed in two directions, either
1. in the usual forward direction from the beginning of the list to the end, or
2. in the backward direction from the end of the list to the beginning.
Furthermore, given the location LOC of a node N in the list, one now has immediate
access to both the next node and the preceding node in the list. This means, in particular,
we may able to delete N from the list without traversing any part of the list.

A two-way list is a linear collection of data elements, called nodes, where each node N is
divided into three parts:
1. An information field INFO which contains the data of N
2. A pointer field FORW which contains the location of the next node in the list
3. A pointer field BACK which contains the location of the preceding node in
the list
The list also requires two list pointer variables: FIRST, which points to the first node in
the list, and LAST, which points to the last node in the list. Figure 1.29 contains a
schematic diagram of such a list. Observe that the null pointer appears in the FORW field
of the last node in the list and also in the BACK field of the first node in the list.

(Figure 1.29 : Two way list)

Observe that, using the variable FIRST and the pointer field FORW, we can traverse a
two-way list in the forward direction. On the other hand, using the variable LAST and the
pointer field BACK, we can also traverse the list in the backward direction.

Suppose LOCA and LOCB are the locations, of nodes A and B in a two-way list
respectively. Then the way that the pointers FORW and BACK are defined gives us the
following:
Pointer property: FORW[LOCA] = LOCB if and only if BACK[LOCB] = LOCA
In other words, the statement that node B follows node A is equivalent to the statement
that node A precedes rode B.

Two-Way Header Lists


The advantages of a two-way list and a circular header list may be combined into a two-
way circular header list as pictured in Figure.1.30. The list is circular because the two
end nodes point back to the header node. Observe that such a two-way list requires only
one list pointer variable START, which points to the header node. This is because the two
pointer in the header node point to the two ends of the list.
Unit 3
Stack and Queue
STACK OPERATIONS

Stacks and Queues are two data structures that allow insertions and deletions operations
only at the beginning or the end of the list, not in the middle. A stack is a linear structure
in which items may be added or removed only at one end. Figure 1.12 pictures three
everyday examples of such a structure: a stack of dishes, a stack of pennies and a stack of
folded towels.

Stacks are also called last-in first-out (LIFO) lists. Other names used for stacks are
"piles" and "push-down lists. Stack has many important applications in computer science.
The notion of recursion is fundamental in computer science. One way of simulating
recursion is by means of stack structure. Let we learn the operations which are performed
by the Stacks.

Figure 1.12
A stack is a list of elements in which an element may be inserted or deleted only at one
end, called the top of the stack. This means, that elements which are inserted last will be
removed first. Special terminology is used for two basic operation associated with
stacks:
(a) "Push" is the term used to insert an element into a stack.
(b) "Pop" is the term used to delete an element from a stack.
Apart from these operations, we could perform these operations on stack: (i) Create a
stack (ii) Check whether a stack is empty, (iii) Check whether a stack is full (iv) Initialize
a stack (v) Read a stack top (vi) Print the entire stack.

Example 1.16

Suppose that following 6 elements are pushed, in order, onto an empty stack:
AAA, BBB, CCC, DDD, EEE, FFF

Figure 1.13(a), (b) & (c) shows three ways of picturing such a stack
(a) (b)
TOP
1 AAA
2 BBB
3 CCC
4 DDD T OP
FFF 5 EEE
EEE 6 FFF
DDD .
CCC .
BBB N-1
N
AAA
(c)

1 2 3 4 5 6 7 . . . N-1 N
AAA BBB CCC DDD EEE FFF

TOP
Figure 1.13 (a), (b) & (c)

ARRAY REPRESENTATION OF STACKS


Stacks will be maintained by a linear array STACK; a pointer variable TOP, which
contains the location of the top element of the stack; and a variable MAXSTK which
gives the maximum number of elements that can be held by the stack. The condition
TOP=0 or TOP=NULL will indicate that the stack is empty.

Figure 1.14 pictures such an array representation of a stack. Since TOP=3, the stacks has
three elements, XXX, YYY and ZZZ; and since MAXSTK=8, there is room for 5 more
items in the stack.
MAXSTK 8
7
6
5
4
TOP → 3 ZZZ
2 YYY
1 XXX
Figure 1.14

The operation of adding (pushing) an item onto a stack and the operation of removing
(popping) and item from a stack are implemented by the following procedures 1.1 and
1.2, called PUSH and POP respectively. When we adding a new element, first, we must
test whether there is a free space in the stack for the new item; if not, then we have the
condition known as overflow. Analogously, in executing the procedure POP, we must
first test whether there is an element in the stack to be deleted; if not; then we have the
condition known as underflow.

Procedure 1.1: PUSH(STACK, TOP, MAXSTK, ITEM)


This procedure pushes an ITEM onto a stack.
1. [Stack already filled?]
If TOP=MAXSTK, then: Print: OVERFLOW, and Return.
2. Set TOP:=TOP + 1 [Increases TOP by 1.]
3. Set STACK[TOP] := ITEM. [Inserts ITEM in new TOP position.]
4. Return.

Procedure 1.2: POP(STACK, TOP, ITEM)


This procedure deletes the top element of STACK and assigns it
to the variable ITEM.
1. [Stack has an item to be removed?]
If TOP = 0), then: Print: UNDERFLOW, and Return.
2. Set ITEM := STACK[TOP]. [Assigns TOP element to ITEM.]
3. Set TOP := TOP-1. [Decreases TOP by 1]
4. Return.
Frequently, TOP and MAXSTK are global variables; hence the procedures may be called
using only
PUSH(STACK, ITEM) and POP(STACK, ITEM)

respectively. We note that the value of TOP is changed before the insertion in PUSH but
the value of TOP is changed after the deletion in POP.

Example 1.17 (a) & (b)

a. Consider the stack in Fig.1.14. We perform the operation PUSH(STACK, WWW):

1. Since TOP=3, control is transferred to Step 2.


2. TOP=3+1 = 4.
3. STACK[TOP] = STACK[4] = WWW.
Note that WWW is now the top element in the stack.

b. Consider again the stack in Fig 1.14. This time we simulate the operation
POP(STACK; ITEM):

1. Since TOP=3, control is transferred to Step 2.


2. ITEM = CCC
3. TOP = 3 - 1 = 2.
4. Return.
Observe that STACK[TOP] = STACK[2] = BBB is now the top element in the
stack
APPLICATIONS OF STACKS

Arithmetic Expression; Polish Notation


Let Q be an arithmetic expression involving constants and operations. This section gives
an algorithm which finds the value of Q by using reverse Polish (postfix) notation. We
will see that the stack is an essential tool in this algorithm.
Recall that the binary operations in Q may have different levels of precedence.
Highest : Exponentiation ( ↑ )
Next highest : Multiplication ( * ) and division ( / )
Lowest : Addition ( + ) and subtraction ( - )
Example 1.18

Let we evaluate the following parenthesis-free arithmetic expression:

2 ↑ 3 + 5 * 2 ↑ 2 - 12 / 6
First we evaluate the exponentiations to obtain
8 + 5 * 4 - 12 / 6
Then we evaluate the multiplication and division to obtain 8 + 20 - 2. Last, we evaluate
the addition and subtraction to obtain the final result, 26. Observe that the expression is
traversed three times, each time corresponding to a level of precedence of the operations.

Polish Notation
In polish notation, the operator symbol is placed between its two operands. For example,
A+B C-D E*F G/H
This is called infix notation. With this notation, we must distinguish between
(A + B) * C and A + (B * C)
by using either parentheses or operator-precedence convention such as the usual
precedence levels discussed above. Accordingly, the order of the operators and operands
in an arithmetic expression does not uniquely determine the order in which the operations
are to be performed.
Let we translate, step by step, the following infix expressions into Polish notation using
brackets [ ] to indicate a partial translation:
(A + B) * C = [+AB] * C = * +ABC
A + (B * C) = A + [*BC] = +A*BC
(A + B) / (C - D) = [+AB] / [-CD] = / +AB – CD
The fundamental property of Polish notation is that the order in which the operations are
to be performed is completely determined by the positions of the operators and operands
in the expression. There is no need of parentheses when writing expressions in Polish
notation. The computer usually evaluates an arithmetic expression written in infix
notation in two steps. First, it converts the expression to postfix notation, and then it
evaluates the postfix expression.
Evaluation of a Postfix Expression
Suppose P is an arithmetic expression written in postfix notation. The following
algorithm 1.3 uses a STACK to hold operands, evaluates P

Algorithm 1.3: This algorithm finds the VALUE of an arithmetic expression P written in
postfix notation.
1. Add a right parenthesis ")"at the end of P. [This acts as a sentinel].
2. Scan P from left to right and repeat Steps 3 and 4 for each element of
until the sentinel ")" is encountered.
3. If an operand is encountered, put it on STACK.
4. If an operator (x) is encountered, then:
a) Remove the two top elements of STACK, where A is the top
element and B is the next-to-top element.
b) Evaluate B (x) A.
c) Place the result of (b) back on STACK
[End of If structure.]
[End of Step 2 loop.]
5. Set VALUE equal to the top element on STACK.
6. Exit.
Example 1.19
Consider the following arithmetic expression P written in postfix notation:
P: 5, 6, 2, +, *, 12, 4, /, -

(Commas are used to separate the elements of P so that 5, 6, 2 is not interpreted as the
number 562.) The equivalent infix expression Q follows:
Q: 5 * (6 + 2) - 12 / 4
We evaluate P using algorithm 1.3. First we add a sentinel right parenthesis at the end of
P to obtain
P: 5, 6, 2, +, *, 12, 4, /, -, )
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
Figure 1.15 shows the contents of STACK as each element of P is scanned. The final
number in STACK, 37, which is assigned to VALUE when the sentinel “)” is scanned, is
the value of P.
Symbol Scanned STACK
(1) 5 5
(2) 6 5, 6
(3) 2 5, 6, 2
(4) + 5, 8
(5) * 40
(6) 12 40, 12
(7) 4 40, 12, 4
(8) / 40, 3
(9) - 37
(10) )
Figure 1.15
Transforming Infix Expressions into Postfix Expressions
Let Q be an arithmetic expression written in infix notation. The following algorithm 1.4
transforms the infix expression Q into its equivalent postfix expression P. The algorithm
uses a stack to temporarily hold operators and left parentheses. The postfix expression P
will be constructed from left to right using the operands from Q and the operators which
are removed from STACK. We begin by pushing a left parenthesis onto STACK and
adding a right parenthesis at the end of Q. The algorithm is completed when STACK is
empty.

Algorithm 1.4: POLISH(Q,P)


Suppose Q is an arithmetic expression written in infix notation. This algorithm finds the
equivalent postfix expression P
1. Push "(" onto STACK, and add ")" to the end of Q.
2. Scan Q from left to right and repeat Steps 3 to 6 for each element of Q until the
STACK is empty
3. If an operand is encountered, add it to P.
4. If a left parenthesis is encountered, push it onto STACK.
5. If an operator (x) is encountered, then:
a) Repeatedly pop from STACK and add to P each operator (on the
top of STACK) which has the same precedence as or higher
precedence than (x)
b) Add (x) to STACK.
[End of If structure.]
6. If a right parenthesis is encountered, then:
a) Repeatedly pop from STACK and add to P each operator (on top
of STACK) until a left parenthesis is encountered.
b) Remove the left parenthesis. [Do not add the left parenthesis to
P.]
[End of If structure.]
[End of Step 2 loop.]
7. Exit.

Example 1.20
Let we see the example. Consider the following arithmetic infix expression
Q: A + (B * C - (D / E ↑ F) * G) * H
We transform Q using algorithm 1.4 into its equivalent postfix expression P. First we
push "(" onto STACK, and then we add ")" to the end of Q to obtain:
Q: A + ( B * C - ( D / E ↑ F ) * G ) * H )

We may observe that the Figure 1.16 shows the status of STACK and of the string P as
each element of Q is scanned.
Symbol Scanned STACK Expression P
(1) A ( A
(2) + ( + A
(3) ( (+( A
(4) B (+( AB
(5) * (+(* AB
(6) C (+(* A B C
(7) - (+(- ABC*
(8) ( (+(-( ABC*
(9) D (+(-( A B C * D
(10) / (+(-(/ A B C * D
(11) E (+(-(/ ABC*DE
(12) ↑ (+(-(/↑ ABC*DE
(13) F (+(-(/↑ ABC*DEF
(14) ) (+(- ABC*DEF↑/
(15) * (+(-* ABC*DEF↑/
(16) G (+(-* ABC*DEF↑/G
(17) ) (+ ABC*DEF↑/G*-
(18) * (+* ABC*DEF↑/G*-
(19) H (+* ABC*DEF↑/G*-H
(20) ) ABC*DEF↑/G*-H*+
Figure 1.16

Quick Sort, An application of Stacks


Let A be a list of n data items. “Sorting A” refers to the operation of rearranging the
elements of A so that they are in some logical order, such as numerically ordered when A
contains numerical data, or alphabetically ordered when A contains character data. Let
we discuss the subject of sorting, including various sorting algorithms, is in Unit II.

Thus, Stacks are frequently used in evaluation of arithmetic expressions written in infix
notation. Polish notations are evaluated by stacks. Conversions of different notations
(Prefix, Postfix, Infix) into one another are performed using stacks. Stacks are widely
used inside computer when recursive functions are called. Let we discuss it in next
section.

RECURSION

Recursion is an important concept in computer science. Many algorithms can be best


described in terms of recursion. Let we discuss, how recursion may be useful tool in
developing algorithms for specific problems. Consider P is a procedure containing either
a Call statement to itself or a Call statement to a second procedure that may eventually
result in a Call statement back to the original procedure P. Then P is called a recursive
procedure. A recursive procedure must have the following two properties:
1) There must be certain criteria, called base criteria, for which the procedure
does not call itself.
2) Each time the procedure does call itself (directly or indirectly); it must be
closer to the base criteria.
A recursive procedure with these two properties is said to be well-defined. Similarly, a
function is said to be recursively defined if the function definition refers to itself.
The following examples should help us to clarify these ideas.
Factorial Function
The product of the positive integers from 1 to n, inclusive, is called "n factorial" and is
usually denoted by n!:
n! = 1 . 2 . 3 . . . (n - 2) (n - 1) n
It is also defined that 0! = l. Thus we have,
5! = 1 . 2 . 3 . 4 . 5 = 120 and 6! = 1 . 2 . 3 . 4 . 5 . 6 = 720
This is true for every positive integer n; that is, n! = n . (n - 1)! Accordingly, the factorial
function may also be defined as follows:
Definition: (Factorial Function)
(a) If n = 0, then n!= 1.
(b) If n > 0, then n! = n . (n-1)!
Example 1.21

Let us calculate 4! using the recursive definition. This calculation requires the following
nine steps:
(1) 4! = 4 . 3!
(2) 3! = 3 . 2!
(3) 2! = 2 . 1!
(4) 1! = 1 . 0!
(5) 0! = 1
(6) 1! = 1 . 1 = 1
(7) 2! = 2 . 1 = 2
(8) 3! = 3 . 2 = 6
(9) 4! = 4 . 6 = 24
Let we learn the procedure 1.3 that calculate n! This procedure calculates n! and returns
the value in the variable FACT.
Procedure 1.3: FACTORIAL (FACT, N)
1. If N = 0 then: Set FACT: =1, and return.
2. Call FACTORIAL (FACT, N-1).
3. Set FACT := N * FACT.
4. Return.
The above procedure is a recursive procedure, since it contains a call statement to itself.
IMPLEMENTATION OF RECURSIVE, PROCEDURES BY. STACKS

Let we discuss, how do stacks may be used to implement recursive procedures? Recall
that a subprogram can contain both parameters and local variables.

The parameters are the variables which receive values from objects in the calling
program, called arguments, and which transmit values back to the calling program.
Besides the parameters and local variables, the subprogram must also keep track of the
return address in the calling program. This return address is essential, since control. must
be transferred back to its proper place in the calling program At the time that the
subprogram is finished executing and control is transferred back to the calling program,
the values of the local variables and the return address are no longer needed.

If our subprogram is a recursive program then each level of execution of the subprogram
may contain different values for the parameters and local variables and for the return
address. Furthermore, if the recursive program does call itself then these current values
must be saved, since they will be used again when the program is reactivated.

Translation of a Recursive Procedure into a Non-recursive Procedure


Consider P is a recursive procedure. We assume that a recursive call to P comes only
from the procedure P. The translation of the recursive procedure P into a non-recursive
procedure works as follows. First of all, let we define:
1. A stack STPAR for each parameter PAR
2. A stack STVAR for each local variable VAR
3. A local variable ADD and a stack STADD to hold return addresses
Each time there is a recursive call to P, the current values of parameters and local
variables are pushed onto the corresponding stacks for future processing, and each time
there is a recursive return to stacks. The handling of the return address is done as follows.

Suppose the procedure P contains a recursive Call P in Step K. Then there are two return
address associated with the execution of this Step K:
1. There is the current return address of the procedure P, which will be used
when the current level of execution of P is finished executing.
2. There is the new return address K+1, which is the address of the step
following the Call P and which will be used to return to the current level of
execution of procedure P
Some texts push the first of these two addresses, the current return address, onto the
return address stack STADD, whereas some texts push the second address, the new return
address K+1, onto STADD. We will choose the latter method, since the translation of P
into a non-recursive procedure will then be simpler. This also means, in particular, that an
empty stack STADD will indicate a return to the main program that initially called the
recursive procedure P.
The algorithm 1.5 which translates the recursive procedure P into a non-recursive
procedure follows. It consists of three parts: (1) preparation, (2) translating each recursive
Call P in procedure P and (3) translating each Return in procedure P.
Algorithm 1.5:
(1) Preparation.
a) Define a stack. STPAR for each parameter PAR, a stack STVAR for each
local variable VAR, and a local variable ADD and a stack STADD to
hold return addresses.
b) Set TOP := NULL.
(2) Translation of "Step K. Call P."
a) Push the current values of the parameters and local variables onto the
appropriate stacks, and push the new return address [Step] K + 1 onto
STADD.
b) Rest the parameters using the new argument values.
c) Go to Step 1. [The beginning of the procedure P.]
(3) Translation of "Step J. Return."
a) If STADD is empty, then: Return. [Control is returned to the main
program.]
b) Restore the top values of the stacks. That is, set the parameters and local
variables equal to the top values on the stacks, and set ADD equal to the
top value on the stack STADD.
c) Go to Step ADD.

Observe that the translation of "Step K. Call P" does depend on the value of K, but that
the translation of "Step J. Return" does not depend on the value of J. According to that,
we need to translate only one Return statement, for example, by using
Step L. Return.
as above and then replace every other Return statement by
Go to Step L.
This will simplify the translation of the procedure.

Check your Progress 5


1. Stacks are sometimes called FIFO lists True/False
2. Stack allows Push and Pop from both ends True/False
3. TOS (Top of the Stack) gives the bottom most element in the stack.
4. Multiple stacks can be implemented using ……………
5. …………… are evaluated by stacks.
6. Stack is used whenever a ………………. function is called.

QUEUES
A queue is a linear structure in which element may be inserted at one end called the rear,
and the deleted at the other end called the front. Figure 1.17 pictures a queue of people
waiting at a bus stop. Queues are also called first-in first-out (FIFO) lists. An important
example of a queue in computer science occurs in a timesharing system, in which
programs with the same priority form a queue while waiting to be executed. Similar to
stack operations, operations that are define a queue.

(Figure 1.17)

REPRESENTATION OF QUEUES
Queues may be represented in the computer in various ways, usually by means of one-
way lists or linear arrays. Queues will be maintained by a linear array QUEUE and two
pointer variables: FRONT, containing the location of the front element of the queue; and
REAR, containing the location of the rear element of the queue. The condition FRONT =
NULL will indicate that the queue is empty.

Figure 1.18, indicates the way elements will be deleted from the queue and the way new
elements will be added to the queue.

Figure 1.18: Array Representation of a Queue

Observe that whenever an element is deleted from the queue, the value of FRONT is
increased by 1; this can be implemented by the assignment
FRONT := FRONT + 1
Similarly, whenever an element is added to the queue, the value of REAR is increased by
1; this can be implemented by the assignment
REAR := REAR + 1
Assume QUEUE is circular, that is, that QUEUE[1] comes after QUEUE[N] in the array.
With this assumption, we insert ITEM into the queue by assigning ITEM to QUEUE[1].
Specifically, instead of increasing REAR to N+1, we reset REAR=1 and then assign,
QUEUE[REAR]:= ITEM
Similarly, if FRONT=N and an element of QUEUE is deleted, we reset FRONT=1
instead of increasing FRONT to N+1. Suppose that our queue contains only one element,
i.e., suppose that
FRONT = REAR = NULL
and suppose that the element is deleted. Then we assign
FRONT: = NULL and REAR:= NULL
to indicate that the queue is empty.
The operation of adding an item onto a stack and the operation of removing (popping)
and item from a stack are implemented by the following procedures, called QINSERT
and QDELET respectively.

Procedure 1.4: INSERT (QUEUE, N, FRONT, REAR, ITEM)


This procedure inserts an element ITEM into a queue.
1. [Queue already filled?]
If FRONT = 1 and REAR = N, or if FRONT = REAR+1, then:
Write: OVERFLOW, and Return.
2. [Find new value of REAR.]
If FRONT = NULL, then: [Queue initially empty.]
Set FRONT:=1 and REAR:=1.
Else if REAR = N, then:
Set REAR:=1.
Else:
Set REAR := REAR + 1.
[End of If structure.]
3. Set QUEUE[REAR] := ITEM. [This inserts new element.]
4. Return.

Procedure 1.5: QDELETE (QUEUE, N, FRONT, REAR, ITEM)


This procedure deletes an element from a queue and assigns it to the
variable item.
1. [Queue already empty?]
If FRONT := NULL, then: Write: UNDERFLOW, and Return.
2. Set ITEM := QUEUE[FRONT].
3. [Find new value of FRONT.]
If FRONT = REAR, then: [Queue has only one element to start.]
Set FRONT := NULL and REAR := NULL.
Else if FRONT = N, then:
Set FRONT := 1.
Else:
Set FRONT := FRONT+1.
[End of If structure.]
4. Return

DEQUES

A deque (pronounced either "deck" or "dequeue") is a linear list in which elements can be
added or removed at either end but not in the middle. The term deque refers to the name
double-ended queue.

There are two variations of a deque - namely, an input-restricted deque and an output-
restricted deque - which are intermediate between a deque and a queue. An input-
restricted deque is a deque which allows insertions at only one end of the list but allows
deletions at both ends of the list; and an output-restricted deque is a deque, which allows
deletions at only one end of the list buy allows insertions at both ends of the list.

Figure 1.19 pictures two deques, each with 4 elements maintained in an array with N = 8
memory locations. The condition LEFT = NULL will be used to indicate that a deque is
empty.
DEQUE (a)
LEFT: 4
RIGHT : 7
AAA BBB CCC DDD

1 2 3 4 5 6 7 8
LEFT: 4
RIGHT : 7
DEQUE (b)
YYY ZZZ WWW XXX

1 2 3 4 5 6 7 8
(Figure 1.19)

PRIORITY QUEUES
A priority queue is a collection of elements such that each element has been assigned a
priority and such that the order in which elements are deleted and processed comes from
the following rules:
1) An element of higher priority is processed before any element of lower
priority.
2) Two elements with the same priority are processed according to the order in
which they were added to the queue.
Many application involving queues require priority queues rather than the simple FIFO
strategy. For elements of same priority, the FIFO order is used. For example, in a multi-
user system, there will be several programs competing for use of the central processor at
one time. The programs have a priority value associated to them and are held in a
priority queue. The program with the highest priority is given first use of the central
processor.
Scheduling of jobs within a time-sharing system is another application of queues. In such
system many users may equest processing at a time and computer time divided among
these requests. The simlest approach sets up one queue that store all requests for
processing. Computer processes the request at the front of the queue and finished it
before starting on the next. Same approach is also used when several users want to use
the same output device, say a printer.
In a time sharing system, another common approach used is to process a job only for a
specified maximum length of time. If the program is fully processed within that time,
then the computer goes on to the next process. If the program is not completely
processed within the specified time, the intermediate values are stored and the remaining
part of the program is put back on the queue. This approach is useful in handling a mix of
long and short jobs.
Unit IV

Tree
TREES

It is a non-linear data structure. The tree is used to represent the hierarchical relationship
between the data items.
Binary Trees
A binary tree T is defined as a finite set of elements, called nodes, such that:
(a) T is empty (called the null tree or empty tree), or
(b) T contains a distinguished node R, called the root of T, and the remaining
nodes of T form an ordered pair of disjoint binary trees T1 and T2.
If T does contain a root R, then the two trees T 1 and T2 are called, respectively, the left
and right subtrees of R. If T1 is nonempty, then its root is called the left successor of R;
similarly, if T2 is nonempty, then its root is called the right successor of R.
A binary tree T is frequently presented by means of a diagram. The diagram in Fig
represents a binary tree T as follows.
(i) T consists of 11 nodes, represented by the letter A through L, excluding I.
(ii) The root of T is the node A at the top of the diagram.
(iii) A left-downward slanted line from a node N indicates a left successor of N,
and a right-downward slanted line from N indicates a right successor of N.
Observe that:
(a) B is a left successor and C is a right successor of the node A.
(b) The left subtree of the root A consists of the nodes B,D,E and F,
and the right subtree of A consists of the nodes C,G,H,J,K and L.

Any node N in a binary tree T has either 0,1 or 2 successors. the nodes A,B,C and H have
two successors, the nodes E and J have only one successor, and the nodes D,F,G,L and K
have no successors. The nodes with no successors are called terminal nodes.
Fig. 4.17

Expression Tree
Consider any algebraic expression E involving only binary operations, such as
E = (a-b)/((c*d)-1-e)
E can be represented by means of the binary tree T pictured in Fig 4.18. That is, each
variable or constant in E appears as an "internal" node in T whose left and right subtrees
correspond to the operands of the operation. For example:
1. In the expression E, the operands of + are c * d and e.
2. In the tree T, the subtrees of the node + correspond to the subexpression c * d
and e.
Every algebraic expression will correspond to a unique tree, and vice versa.

Fig. 4.18

Terminology
Terminology describing family relationships is frequently used to describe
relationships between the nodes of a tree T. Suppose N is a node in T with left successor
S1 and the right successor S2. Then N is called the parent (or father) of S1 and S2.
Analogously, S1 is called the left child (or son) of N, and S2 is called the right child (or
son) of N. Furthermore, S1 and S2 are said to be siblings (or brothers). Every node N in a
binary tree T, except the root, has a unique parent, called the predecessor of N.
The terms descendant and ancestor have their usual meaning. That is, a node L is
called a descendant of a node N (and N is called an ancestor of L) if there is a succession
of children from N to L. In particular, L is called a left or right descendant of N according
to whether L belongs to the left or right subtree of N.
The line drawn from a node N of T to a successor is called an edge, and a
sequence of consecutive edges is called a path. A terminal node is called a leaf, and a
path ending in a leaf is called a branch.
Each node in a binary tree T is assigned a level number, as follows. The root R of the tree
T is assigned the level number 0, and every other node is assigned a level number which
is one more than the level number of its parent. Those nodes with the same level number
are said to belong to the same generation.
The depth(or height) of a tree T is the maximum number of nodes in a branch of
T. This turns out to be 1 more than the largest level number of T. The tree T in Fig. 4.17
has depth 4.

Complete Binary Trees


Consider any binary tree T. Each node of T can have at most two children.
Accordingly, one can show that level r of T can have at most 2r nodes.
The tree T is said to be complete if all its levels, except possibly the last, have the
maximum number of possible nodes, and if all the nodes at the last level appear as far left
as possible. The complete binary tree appears as in Fig.4.19.

Fig. 4.19 Complete Binary Tree

Extended Binary Trees: 2-Trees


A binary tree T is said to be a 2-tree or an extended binary tree if each node N has
either 0 or 2 children. In such a case, the nodes with 2 children are called internal nodes,
and the nodes with 0 children are called external nodes. Sometimes the nodes are
distinguished in diagrams by using circles for internal nodes and squares for external
nodes.
The term "extended binary tree" comes from the following operation.. Consider
any binary tree T, such as the tree in Fig.4.20(a). Then T may be "converted" into a 2-tree
by replacing each empty subtree by a new node, as pictured in Fig.4.20(b). Observe that
the new tree is, indeed, a 2-tree. Furthermore, the nodes in the original tree T are now the
internal nodes in the extended tree, and the new nodes are the external nodes in the
extended tree.
(a) Binary tree T. (b) Extended 2-tree
Fig. 4.20 Converting a binary tree T into a 2-tree.

An important example of a 2-tree is the tree T corresponding to any algebraic


expression E which uses only binary operations. As illustrated in Fig,5-2, the variables in
E will appear as the external nodes, and the operations in E will appear as internal nodes.

REPRESENTING BINARY TREES IN MEMORY

Let T be a binary tree. This section discusses the two ways of representing T in memory.
(1) Linked representation of T.
(2) Sequential representation of T.
The main requirement of any representation of T is that one should have direct
access to the root R of T and, given any node N of T, one should have direct access to the
children of N.

Linked Representation of Binary Trees

Consider a binary tree T. Each node N of T will correspond to a location K such that:

1. INFO[K] contains the data at the node N.


2. LEFT[K] contains the location of the left child of node N.
3. RIGHT[K] contains the location of the right child of node N.

Furthermore, ROOT will contain the location of the root R of T. If any subtree is empty,
then the corresponding pointer will contain the null value; if the tree T itself is empty,
then ROOT will contain the null value.
Fig. 4.21 Linked Representation of tree of fig. 4.17
Sequential Representation of Binary Trees
Suppose T is a binary tree that is complete or nearly complete. Then there is an
efficient way of maintaining T in memory called the sequential representation of T. This
representation uses only a single linear array TREE as follows:
(1) The root R of T is stored in TREE[1].
(2) If a node N occupies TREE[K] then its left child is stored in TREE[2*K]
and its right child is stored in TREE[2*K+1].
NULL is used to indicate an empty subtree. TREE[1] = NULL indicates that the tree is
empty.
The sequential representation of a tree with depth d will require an array with
approximately 2d+1 elements. The sequential representation is usually inefficient unless,
the binary tree T is complete or nearly complete.
null
null
F
null
null
null
null
null
G
E
null
C
D
B
A
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1

Fig. 4.22 Sequential representation of Binary Tree of fig. 4.17

TRAVERSING BINARY TREES


There are three standard ways of traversing a binary tree T with root R. They are
1. Preorder.
2. Inorder and
2. Postorder.

Preorder: (NLR)
1. Proces the root R
2. Traverse the left subtree of R in preorder.
3. Traverse the right subtree of R in preorder.

Inorder: (LNR)
(1) Traverse the left subtree of R in inorder.
(2) Process the root R.
(3) Traverse the right subtree of R in inorder.

PostOrder: (LRN)

(1) Traverse the left subtree of R in postorder.


(2) Traverse the right subtree of R in postorder.
(3) Process the root R.

Consider the binary tree T in Fig.4.17. The three traversal for T is given below.
Preorder : A, B, C, D, E, F, G.
Inorder : C, B, A, E, F, D, G.
Postorder : C, B, F, E, G, D, A.
TRAVERSAL ALGORITHMS USING STACKS

Suppose a binary tree T is maintained in memory by some linked representation.


TREE (INFO,LEFT,RIGHT,ROOT)
This section discusses the implementation of the three standard traversals of T, which
were recursively in the last section, by means of nonrecursive procedures using stacks.

Preorder Traversal
The preorder traversal algorithm uses a variable PTR(pointer) which will contain
the location of the node N currently being, scanned. L(N) denotes the left child of node N
and R(N) denotes the right child. The algorithm also uses an array STACK, which will
hold the addresses of nodes for future processing.

Algorithm 4.10: Initially push NULL onto STACK and then set PTR := ROOT. Then
repeat
the following steps until NULL is popped from STACK.

(a) Proceed down the left-most path rooted at PTR, processing each node N on the
path and pushing each right child R(N), if any, onto STACK. The traversing
ends after a node N with no left child L(N) is processed. (Thus PTR is updated
using the assignment PTR := LEFT[PTR] and the traversing stops when
LEFT[PTR] = NULL.)

(b) Backtracking] Pop and assign to PTR the top element on STACK. If PTR #
NULL, then return to step (a) ; else Exit.
(We note that the initial element NULL on STACK is used as a sentinel.)
We simulate the algorithm in example 4.11.

Example 4.11
Consider the binary tree T in. Fig 4.17. We simulate the above algorithm with T,
showing the contents of STACK at each step.

1. Initially push NULL onto STACK:


STACK : 0.
Then set PTR:= A, the root of T.
2. Proceed down the left-most path rooted at PTR = A as follows:
(i) Process A and push its right child D onto STACK:
STACK: 0, D.
(ii) Process B (There is no right child)
(iii) Process C (There is no right child)
No other node is processed, since C has no left child.
3. [Backtracking] Pop the top element D from STACK, and set PTR := D. This
leaves:
STACK: 0.
Since PTR # NULL, return to Step(a) of the algorithm.
4. Proceed down the left-most path rooted at PTR := D as follows:
(iv) Process D and push its right child G onto STACK:
STACK: 0, G.
(v) Process E and push its right child F onto STACK:
STACK: 0, G, F.
No other node is processed, since E has no left child.
5. [Backtracking] Pop F from STACK, and set PTR := F. This leaves: -
STACK: 0, G.
Since PTR # NULL, return to Step(a) of the algorithm.

6. Proceed down the left-most path rooted at PTR := F as follows:


(vii) Process F. (There is no right child)
No other node is processed, since F has no left child.

7. [Backtracking] Pop the top element G from STACK, and set PTR := G. This
leaves:
STACK: 0.
Since PTR # NULL, return to Step(a) of the algorithm.

8. Proceed down the left-most path rooted at PTR := G as follows:


(viii) Process G (There is no right child)
9. [Backtracking] Pop the top element NULL from STACK,
Since NULL is popped, the algorithm is completed.

As seen from Steps 2, 4, 6 and 8 the nodes are processed in the order A, B, C, D, E, F, G.
A formal presentation of our preorder traversal algorithm follows:

Algorithm 4.11 PREORD(INFO,LEFT,RIGHT,ROOT)


A binary tree T is in memory. The algorithm does a preorder traversal
of T, applying an operation PROCESS to each of its nodes. An array
STACK is used to temporarily hold the addresses of nodes.
1. [Initially push NULL onto. STACK, and initialize PTR
Set. TOP:=1 STACK[1]:=NULL and PTR:=ROOT.
2. Repeat Steps 3 to 5 while PTR # NULL:
3. Apply PROCESS to INFO[PTR]
4. [Right child?]
If RIGHT[PTR] # NULL, then: [Push on STACK]
Set TOP := TOP+1, and
STACK[TOP] := RIGHT[PTR]
[End of If structure]
5. [Left child?]
If LEFT[PTR] # NULL, then:
Set PTR := LEFT[PTR]
Else: [Pop from STACK]
Set PTR :=STACK[TOP] and
TOP:=TOP -1 [End of If structure]
[End of Step 2 loop]
6. Exit.

Inorder Traversal
The inorder traversal algorithm uses a variable PTR(pointer) which will contain
the location of the node N currently being, scanned. L(N) denotes the left child of node N
and R(N) denotes the right child. The algorithm also uses an array STACK, which will
hold the addresses of nodes for future processing.

Algorithm 4.12:
Initially push NULL onto STACK and then set PTR := ROOT. Then repeat
the following steps until NULL is popped from STACK.

(a) Proceed down the left-most path rooted at PTR, pushing each node N
onto STACK and stopping when a node N with no left child is pushed
onto STACK.

(b) Backtracking] Pop and process the nodes on STACK. If NULL is


popped, then Exit. If a node N with a right child R(N) is processed, set
PTR = R(N). and return to step (a).

(We note that the initial element NULL on STACK is used as a sentinel.)
We simulate the algorithm in example 4.12.
Example 4.12
Consider the binary tree T in. Fig 4.17. We simulate the above algorithm with T,
showing the contents of STACK at each step.
1. Initially push NULL onto STACK:
STACK: 0.
Then set PTR := A, the root of T.
2. Proceed down the left-most path rooted at PTR = A, pushing the nodes A, B and C
onto STACK:
STACK: 0, A, B, C.
(No other node is pushed onto STACK, since C has no left child)
3. [Backtracking] The nodes C, B and A are popped and processed, leaving:
STACK: 0.
(We stop the. processing at A, since A has a right child.)
Then set PTR := D, the right child of A.
4. Proceed down the left-most path rooted at PTR := D, pushing the nodes D and E onto
STACK:
STACK: 0. D, E.
(No other node is pushed onto STACK, since E has no left child.)
5. [Backtracking] The node E is popped and processed, leaving:
STACK: 0, D.
(We stop the. processing at E, since E has a right child.)
Then set PTR := F, the right child of E.
6. Proceed down the left-most path rooted at PTR := F, pushing the node F onto STACK:
(No other node is pushed onto STACK, since F has no left child.)
STACK: 0, D, F.
7. [Backtracking] The nodes F and D are popped and processed, leaving:
STACK: 0.
(We stop the. processing at D, since D has a right child.)
8. Proceed down the left-most path rooted at PTR := G, pushing the node G onto
STACK:
STACK: 0, G.
9. [Backtracking] Node G is popped and processed. Since G has no right child, the next
element, NULL, is popped from STACK.
Since NULL is popped, the algorithm is completed.

As seen from Steps 3, 5, 7 and 9 the nodes are processed in the order C, B, A, E, F, D, G.

A formal presentation of our preorder traversal algorithm follows:

Algorithm 4.13 INORD(INFO,LEFT,RIGHT,ROOT)


A binary tree T is in memory. The algorithm does an inorder traversal of T,
applying an operation PROCESS to each of its nodes. An array STACK is
used to temporarily hold the addresses of nodes.
1. [Initially push NULL onto. STACK, and initialize PTR
Set. TOP:=1 STACK[1]:=NULL and PTR:=ROOT.
2. Repeat while PTR # NULL: [Pushes left-most path onto STACK]
(a) Set TOP := TOP + 1 and STACK[TOP] := PTR. [Saves node.]
(b) Set PTR := LEFT[PTR]. [Updates PTR.]
[End of loop.]
3. Set PTR := STACK[TOP] and TOP := TOP - 1.
[Pops node from STACK.]
4. Repeat Steps 5 to 7 while PTR # NULLI: [Backtracking.]
5. Apply PROCESS to INFO[PTR].
6. [Right child?] If RIGHT[PTR] # NULL, then:
(a) Set PTR := RIGHT[PTR].
(b) Go to Step 3
[End of If structure]
7. Set PTR := STACK[TOP] and TOP := TOP - 1. [Pops node.]
[End of Step 4 loop.]
8. Exit.

Postorder Traversal
The postorder traversal algorithm uses a variable PTR(pointer) which will contain
the location of the node N currently being, scanned. L(N) denotes the left child of node N
and R(N) denotes the right child. The algorithm also uses an array STACK, which will
hold the addresses of nodes for future processing.
Algorithm 4.14: Initially push NULL onto STACK and then set PTR := ROOT.
Then repeat the following steps until NULL is popped from
STACK.

(a) Proceed down the left-most path rooted at PTR, at each node N of the
path, push N onto STACK and, if N has a right child R(N), push –R(N)
onto STACK.

(b) Backtracking] Pop and process positive nodes on STACK. If NULL is


popped, then Exit. If a negative node is popped, that is, if PTR = -N for
some node N, set PTR = N (by assigning PTR = -PTR) and return to step
(a).

popped, that is, if PTR = -N for some


node N, set PTR = N (by assigning PTR = -PTR) and return to step (a).

(We note that the initial element NULL on STACK is used as a sentinel.)
We simulate the algorithm in example 4.13

Example 4.13
Consider the binary tree T in. Fig4.17 We simulate the above algorithm with T,
showing the contents of STACK at each step.
1. Initially push NULL onto STACK:
STACK: 0.
Then set PTR := A, the root of T.
2. Proceed down the left-most path rooted at PTR = A, pushing the nodes A, B and C
onto STACK:
Furthermore, since A has a right child D, push -D onto STACK after A but before.
This leaves:
STACK: 0, A, -D, B, C.
3. [Backtracking] Pop and process C and B. Since –D is negative, only pop –D.
This leaves
STACK: 0, A.
Now PTR = -D. Reset PTR = D and return to step (a).
4. Proceed down the left-most path rooted at PTR = D. First push D onto STACK. Since
D has a right child G, push -G onto STACK after D.
Then push E and –F (right child of E) onto STACK. This leaves:
STACK: 0, A, D, -G, E, -F.
5. [Backtracking.] Pop –F. Since –F is negative, only pop –F. This leaves:
STACK: 0, A, D, -G, E.
Now PTR = -F. Reset PTR = F and return to Step(a).
6. Proceed down the left-most path rooted at PTR = F. Now, only F is pushed onto
STACK. This yields:
STACK: 0, A,D, -G, E, F.
7. [Backtracking.] Pop and process F and E, but only pop -G. This leaves:
STACK: 0, A, D.
Now PTR = -G. Reset PTR = G and return to Step(a).
5. Proceed down the left-most path rooted at PTR = G. Now, only G is pushed onto
STACK.
6. This yields:
STACK: 0, A, D, G.
9. [Backtracking.] Pop and process G, D and A.
When NULL is popped, STACK is empty and the algorithm is completed.

As seen from Steps 3, 7 and 9 the nodes are processed in the order C, B, F, E, G, D, A .

A formal presentation of our preorder traversal algorithm follows:

Algorithm 4.15: POSTORD(INFO,LEFT,RIGHT,ROOT)


A binary tree T is in memory. This algorithm does a postorder traversal of
T, applying an operation PROCESS to each of its nodes. An array STACK
is used to temporarily hold the addresses of nodes.

1. [Push NULL onto STACK and initialize PTR.]


Set TOP := 1, STACK[1] := NULL and PTR := ROOT.
2. [Push left-most path onto STACK]
Repeat Steps, 3 to 5 while PTR # INULL:
3. [Pushes PTR on STACK]
Set TOP := TOP + 1 and STACK[TOP] := PTR.
4. If RIGHT[PTR] # NULL, then :[Push on STACK.]
Set TOP := TOP + 1 and STACK[TOP] := -RIGHT[PTR].
[End of If structure]
5. Set PTR := LEFT[PIR]. [Updates pointer PTR.]
[End of Step 2 loop.]
6. [Pops node from STACK.]
Set PTR := STACK[TOP] and TOP := TOP - 1.
7. Repeat while PTR > 0:
(a) Apply PROCESS to INFO[PTR].
(b) [Pops node from STACK.]
Set PTR := STACK[TOP] and TOP:= TOP-1.
[End of loop]

8. If PTR < 0, then:


(a) Set PTR := -PTR.
(b) Go to Step 2.
[End of If structure]
9. Exit.

HEADER NODES; THREADS


Header Nodes
Suppose a binary tree T is maintained in memory by means of a linked
representation. Sometimes an extra, special node, called a header node, is added to the
beginning of T. When this extra node is used, the tree pointer variable, which we will call
HEAD (instead of ROOT), will point to the header node, and the left pointer of the
header node will point to the root of T. Figure 4.23 shows a schematic picture of the
binary tree in Fig.4.17 that uses a linked representation with a header node.
Suppose a binary tree T is. empty. Then T will still contain a header node, but the
left pointer of the header node will contain the null value. Thus the condition
LEFT[HEAD] = NULL will indicate an empty tree.
Another variation of the above representation of a binary tree T is to use the header node
as a sentinel. That is, if a node has an empty subtree, then the pointer field for the subtree
will contain the address of the header node instead of the null value. The condition
LEFT [HEAD] = HEAD will indicate an empty subtree.

Fig. 4.23 Header Binary Tree of tree fig. 4.17


Threads; Inorder Threading
Consider the linked representation of a binary tree T. Approximately half of the
entries in the pointer fields LEFT and RIGHT will contain null elements. This space may
be more efficiently used by replacing the null entries by some other type of information.
Specifically, we will replace certain null entries by special pointers which point to nodes
higher in the tree. These special pointers are called threads, and binary trees with such
pointers are called threaded trees.
The threads in a threaded tree must be distinguished in some way from ordinary
pointers. The threads in a diagram of a threaded tree are usually indicated by dotted lines.
In computer memory, an extra 1-bit TAG field may be used to distinguish threads from
ordinary pointers, or, alternatively threads may be denoted by negative integers when
ordinary pointers are denoted by positive integers.
There are many ways to thread a binary tree T, but each threading will correspond
to a particular traversal of T. Also, one may choose a one-way threading or a two-way
threading. Unless otherwise stated, our threading will correspond to the inorder traversal
of T.
In the one-way threading of T, a thread will appear in the right field of a node and
will point to the next node in the inorder traversal of T.
In the two-way threading of T, a thread will also appear in the LEFT field of a
node and will point to the preceding node in the inorder traversal of T. furthermore, the
left pointer of the first node and the right pointer of the last node (in the inorder traversal
of T) will contain the null value when T does not have a header node, but wilt point to the
header node when T does have a header node.
The following figure shows threaded binary trees of tree T of figure 4.17

Fig. 4.24 Threaded binary trees of fig. 4.17

Check your progress 5


1. If a tree has 45 edges, how many vertices does it have?
2. What is binary tree?
3. Define full binary tree
4. Define complete binary tree.
5. List out the ways of traversing a binary tree.

SUMMARY

Graphs are data structures that consist of a set of vertices and a set of edges that connect
the vertices. A graph where the edges are directed is called directed graph. Otherwise, it
is called an undirected graph. Graphs are represented by adjacency lists and adjacency
matrices. Graphs can be used to represent a road network where the edges are weighted
as the distance between the cities.
SOLUTIONS/ANSWERS

Check Your Progress 1


1. A graph G consists of two things:
(i) A set V of elements called nodes (or points or vertices)
(ii) A set E of edges such that each edge e in E is identified with a unique
(unordered) pair [u, v] of nodes in V, denoted by e = [u, v]
Sometimes we indicate the parts of a graph by writing G = (V, E).
2. Edges
3. Simple
4. Cycle
5. n(n-1)/2
6. Simple Directed
7. A graph where the edges are directed is called directed graph.

Check Your Progress 2


1. Sequential and linked list representation
2. True
3. aij= 1 if vi is adjacent to vj, that is , if there is an edge (vi,vj)

0 otherwise

4. Pij = 1 if there is a path from vi to vj

0 otherwise

5. The transitive closure of graph G is defined to be the graph G’ such that G’ has the
same nodes as G and there is an edge(vi,vj) in G’ whenever there is a path from vi
to vj in G’
6. The graph G is said to be strongly connected if, for any pair of nodes u and v in G,
there are both a path from u to v and a path from v to u.

Check Your Progress 3


1. Search, Insert and Delete

Check Your Progress 4


1. Shortest Path
2. True
3. Wij = w(e) if there is an edge e from vi to vj

0 if there is no edge from vi , vj


Check Your Progress 5
1. If a tree has e edges and n vertices, then e=n-1. Hence if a tree has 45 edges, then
it has 46 vertices.
2. A binary tree is a special tree where each non-leaf node can have almost two child
nodes.
3. A binary tree of height h which had 2h – 1 elements is called a Full Binary Tree.
4. A binary tree whereby if the height is d, and all levels, except possible level d, are
completely full. If the bottom level is incomplete, then it has all nodes to the left
side. That is the tree has been filled in the level order from left to right.Preorder,
inorder and postorder
Unit V
Graph
INTRODUCTION

In this unit, we will discuss a data structure called Graph. Graph is a general tree with no
parent-child relationships. Graphs have many applications in computer science and other
fields of science. As we have done with other data structures, we discuss the
representation of graphs in memory and present various operations and algorithms on
them.
The tree is a nonlinear data structure. This structure is mainly used to represent data
containing a hierarchical relationship between elements, e.g. records, family tree and
tables of contest.

OBJECTIVES

After going through this unit, you should be able to


• Know about graphs and related terminologies
• Know about the sequential and linked representations of graphs
• Perform different operations on graphs
• Understand the concept of shortest path algorithm
• Understand the concept of Topological sorting
• Know about trees and related terminologies
• Know about the sequential and linked representations of binary trees
• Perform different techniques for traversing binary trees
• Understand the concept of traversal algorithms

GRAPH THEORY TERMINOLOGY

Graphs and Multigraphs


A graph G consists of two things:
(1) A set V of elements called nodes (or points or vertices)
(2) A set E of edges such that each edge e in E is identified with a unique
(unordered) pair [u, v] of nodes in V, denoted by e = [u, v]
Sometimes we indicate the parts of a graph by writing G = (V, E).
Suppose e = [u, v]. Then the nodes u and v are called the endpoints of e, and u and
v are said to be adjacent nodes or neighbors. The degree of a node u, written deg(u), is the
number of edges containing u. If deg(u) = 0 — that is, if u does not belong to any edge—
then u is called an isolated node.

Path and Cycle


A path P of length n from a node u to a node v is defined as a sequence of n + 1
nodes.
P = (v0, v1, v2, . . . , vn)
such that u = v0; vi-1 is adjacent to vi for i = 1,2, . . ., n and vn = v.
Types of Path
(i) Simple Path
(ii) Cycle Path
(i) Simple Path
Simple path is a path in which first and last vertex are different (V0 ≠ Vn)

(ii) Cycle Path


Cycle path is a path in which first and last vertex are same (V0 = Vn).It is also
called as Closed path.

Connected Graph
A graph G is said to be connected if there is a path between any two of its nodes.

Proposition 4.1: A graph G is connected if and only if there is a simple path


between any two nodes in G.

Complete Graph
A graph G is said to be complete if every node u in G is adjacent to every other
node v in G.

Tree
A connected graph T without any cycles is called a tree graph or free tree or,
simply, a tree.

Labeled or Weighted Graph


If the weight is assigned to each edge of the graph then it is called as Weighted or
Labeled graph.

(Fig 4.1)
The definition of a graph may be generalized by permitting the following:
(1) Multiple edges: Distinct edges e and e' are called multiple edges if they connect
the
same endpoints, that is, if e = [u, v] and e' = [u, v ] .
(2) Loops: An edge e is called a loop if it has identical endpoints, that is, if e = [u,
u].
(3) Finite Graph:A multigraph M is said to be finite if it has a finite number of
nodes
and a finite number of edges
Example 4.1
odes—A, B, C, D and E and 7 edges:
(a) F [A,B], [ B,C], [ C, D ], [D,E], [A,E], [C, E] [A,C]
i There are two simple paths of length 2 from B
g to E: (B, A, E) and (B, C, E).
u There is only
r one simple path of length 2 from B to D: (B, C, D).
e We note that (B, A, D) is not a path, since [A, D] is not an
edge. There are two 4-cycles in the graph:
4
-
2 = 2.
( [A, B, C, E, A] and [A, C, D, E, A}.
aNote that deg(A) = 3, since A belongs to 3 edges. Similarly, deg(C) = 4 and deg(D)
)

i
s

p
i
c
t
u
r
e

o
f

c
o
n
n
e
c
t
e
d

g
r
a
p
h

w
i
t
h

n
(b) Figure 4-2(b) is not a graph but a multigraph. The reason is that it has multiple
edges—
e4 = [B, C] and e5 = [B, C]—and it has a loop, e6 = [D, D]. The definition of a graph usually
does not allow either multiple edges or loops.

(c) Figure 4.2(c) is a tree graph with m = 6 nodes and, consequently, m - 1 = 5 edges. The
reader can verify that there is a unique simple path between any two nodes of the tree
graph. Figure 4.2(d) is the same graph as in Fig. 4.2(a), except that now the graph is
weighted.

Fig 4-2

Directed Graphs
A directed graph G, also called a digraph or graph is the same as a multigraph
except that each edge e in G is assigned a direction, or in other words, each edge e is
identified with an ordered pair (u, v) of nodes in G.
Suppose G is a directed graph with a directed edge e = (u, v). Then e is also called
an arc. Moreover, the following terminology is used:
(1) e begins at u and ends at v.
(2) u is the origin or initial point of e, and v is the destination or terminal point
of e.
(3) u is a predecessor of v, and v is a successor or neighbor of w.
(4) H is adjacent to v, and v is adjacent to u.
Outdegree and Indegree
Indegree : The indegree of a vertex is the number of edges for which v is head
Example 4.2

1 2 3

(Figure 4.3)

Indegree of 1 = 1
Indegree pf 2 = 2
Outdegree :The outdegree of a node or vertex is the number of edges for which v is tail.
Example 4.3
1 2 3

(Figure 4.4)
Outdegree of 1 =1
Outdegree of 2 =2
S
o
u
r
c
e

a
n
d

S
i
n
k
A node u is called a source if it has a positive outdegree but zero indegree.
Similarly, u is called a sink if it has a zero outdegree but a positive indegree.

A directed graph G is said to be connected, or strongly connected, if for each pair


u, v of nodes in G there is a path from u to v and there is also a path from v to u. On the
other hand, G is said to be unilaterally connected if for any pair u, v of nodes in G there is
a path from u to v or a path from v to u.

Example 4.5

Figure 4.5 shows a directed graph G with 4 nodes and 7 (directed) edges. The edges
e2 and e3 are said to be parallel, since each begins at B and ends at A. The edge e7 is a loop,
since it begins and ends at the same point, B. The sequence Pl = (D, C, B, A) is not a path,
since (C, B) is not an edge—that is, the direction of the edge e5 = (B, C) does not agree
with the direction of the path P1. On the other hand, P2 = (D, B, A) is a path from D to A,
since (D, B) and (B, A) are edges. Thus A is reachable from D. There is no path from C to
any other node, so G is not strongly connected. However, G is unilaterally connected. Note
that indeg(D) = 1 and outdeg(D) = 2. Node C is a sink, since indeg(C) = 2 but outdeg(C) =
0. No node in G is a source.

Fig 4.5

Let T be any nonempty tree graph. Suppose we choose any node R in T. Then T
with this designated node R is called a rooted tree and R is called its root. Recall that
there is a unique simple path from the root R to any other node in T. This defines a
direction to the edges in T, so the rooted tree T may be viewed as a directed graph.
Furthermore, suppose we also order the successors of each node v in T. Then T is called
an ordered rooted tree.
Simple Directed Graph

A directed graph G is said to be simple if G has no parallel edges. A simple graph


G may have loops, but it cannot have more than one loop at a given node.
Check Your Progress 1

1. Define Graph
2. The degree of node u is the number of containing node u .
3. In path the first and last vertex are differents
4. In path the first and last vertex are same.
5. A complete graph with n nodes will have _ _ edges.
6. Which graph does not contain parallel edges?
7. What is Directed graph?

REPRESENTATION OF GRAPHS

There are two standard ways of maintaining a graph G in the memory of a computer.
1. The sequential representation
2. The linked representation

SEQUENTIAL REPRESENTATION OF GRAPHS


There are two different sequential representations of a graph. They are
• Adjacency Matrix representation
• Path Matrix representation
ADJACENCY MATRIX REPRESENTATION
Suppose G is a simple directed graph with m nodes, and suppose the nodes of G
have been ordered and are called v1, v2, . . . , vm. Then the adjacency matrix A = (aij) of
the graph G is the m x m matrix defined as follows:

1 if vi is adjacent to Vj, that is, if there is an edge (Vi, Vj)


aij =
0 otherwise

Suppose G is an undirected graph. Then the adjacency matrix A of G will be a symmetric


matrix, i.e., one in which aij = aji; for every i and j.

Drawbacks
1. It may be difficult to insert and delete nodes in G. This is because the size of A
may need to be changed and the nodes may need to be reordered, so there may be many,
many changes in the matrix A.
2. If the number of edges is 0(m) or 0(m log2 m), then the matrix A will be sparse
(will contain many zeros); hence a great deal of space will be wasted.

Example 4.6
Consider the graph G in Fig. 4.6. Suppose the nodes are stored in memory in a linear
array DATA as follows:
DATA: X, Y, Z, W
Then we assume that the ordering of the nodes in G is as follows: v1 = X, v2 = Y, v3 = Z
and u4 = W. The adjacency matrix A of G is as follows:
The number of l's in A is equal to the number of edges in G.

Fig. 4.6
Consider the powers A, A2, A3, . . . of the adjacency matrix A of a graph G.
Let ak(i,j) = the ij entry in the matrix AK
Observe that a1{i, j) = aij gives the number of paths of length 1 from node vi to node vj.
One can show that a2(i, j) gives the number of paths of length 2 from vi to vj.

Proposition 4.2: Let A be the adjacency matrix of a graph G. Then aK(i, j), the ij entry
in the matrix A , gives the number of paths of length K from vi to vj.
PATH MATRIX REPRESENATION
Let G be a simple directed graph with m nodes, v 1 , v 2 , . . . , vm . The path matrix
of G is the m-square matrix P = (pij) defined as follows:

1 if there is a path from Vi to Vj


Pij =
0 otherwise

Proposition 4.3: Let A be the adjacency matrix and let P=(Pij) be the path matrix of a
digraph G. Then pij=1 if and only if there is a nonzero number in the ij entry of the
matrix
2 3 m
Bm = A +A + A +…….+A

Consider the graph G with m=4 nodes in Fig 4.6 .Adding the matrices A, A2 , A3
4
and A , we obtain the following matrix B4 , and , replacing the nonzero entries in B4 by 1,
we obtain the path matrix P of the graph G:
B4 = 1 0 2 3
5 0 6 8
3 0 3 5
2 0 3 3

and

P = 1 0 1 1
1 0 1 1
1 0 1 1
1 0 1 1
4.4.2 LINKED LIST REPRESENTATION OF GRAPHS
G is usually represented in memory by a linked representation, also called an adjacency
structure, Consider the graph G in Fig. 4-7(a). The table in Fig. 4.7(b) shows each node in
G followed by its adjacency list, which is its list of adjacent nodes, also called its
successors or neighbors. Figure 4-8 shows a schematic diagram of a linked representation
of G in memory. Specifically, the linked representation will contain two lists (or files), a
node list NODE and an edge list EDGE, as follows.

(a) Node list. Each element in the list NODE will correspond to a node in G, and it will
be a record of the form:
NODE NEXT ADJ

Here NODE will be the name or key value of the node, NEXT will be a pointer to the
next node in the list NODE and ADJ will be a pointer to the first element in the adjacency
list of the node, which is maintained in the list EDGE. The shaded area indicates that
there may be other information in the record, such as the indegree INDEG of the node,
the outdegree OUTDEG of the node, the STATUS of the node during the execution of an
algorithm, and so on. (Alternatively, one may assume that NODE is an array of records
containing fields such as NAME, INDEG, OUTDEG, STATUS,…………. ). The
nodes themselves, as pictured in Fig. 7-5, will be organized as a linked list and hence will
have a pointer variable START for the beginning of the list and a pointer variable
AVAILN for the list of available space.
(b) Edge list. Each element in the list EDGE will correspond to a edge in G, and it will be
a record of the form:
DEST LINK

The field DEST will point to the location in the list NODE of the destination or terminal
node of the edge. The field LINK will link together the edges with the same initial node,
that is, the nodes in the same adjacency list. The shaded area indicates that there may be
other information in the record corresponding to the edge, such as a field EDGE
containing the labeled data of the edge when G is a labeled graph, a field WEIGHT
containing the weight of the edge when G is a weighted graph, and so on. We also need a
pointer variable AVAILE for the list of available space in the list EDGE.
Figure 4.9 shows how the graph G in Fig. 4.7(a) may appear in memory. The choice of 10
locations for the list NODE and 12 locations for the list EDGE is arbitrary.

The linked representation of a graph G that we have been discussing may be denoted by
GRAPH(NODE, NEXT, ADJ, START, AVAILN, DEST, LINK, AVAILE)
The representation may also include an array WEIGHT when G is weighted or may
include an array EDGE when G is a labeled graph.
Example 4.7
Suppose Friendly Airways has nine daily flights, as follows:
103 Atlanta to Houston 203 Boston to Denver 305 Chicago to Miami
106 Houston to Atlanta 204 Denver to Boston 308 Miami to Boston
201 Boston to Chicago 301 Denver to Reno 402 Reno to Chicago
Fig 4.11
Clearly, the data may be stored efficiently in a file where each record contains three
fields:
Flight Number, City of Origin, City of Destination

Check Your Progress 2


1. What are two standard way of maintaining graph in memory?
2. Adjacency matrix is a sequential representation of graphs. (True/False)
3. The Adjacency matrix A=(aij) of an directed graph G is_ .
4. The Path matrix of an simple directed graph is _ _.
5. Define transitive closure.
6. What is strongly connected Graph?
OPERATIONS ON GRAPHS

This section discusses the operations of searching, inserting and deleting nodes and
edges in the graph G.

Searching in a Graph

Suppose we want to find the location LOC of a node N in a graph G. This


can be accomplished by using Procedure 4.1, as follows:
Call FIND(NODE, NEXT, START, N, LOC)
That is, this Call statement searches the list NODE for the node N.

Procedure 4.1: FIND(INFO, LINK START, ITEM, LOC) [Algorithm 5.2]


Finds the location LOC of the first node containing ITEM,
or sets LOC : = NULL.
1. Set PTR := START.
2. Repeat while PTR ≠ NULL:
If ITEM = INFO[PTR], then:
Set LOC:=PTR, and Return. Else: Set PTR: =
LINK[PTR]. [End of loop.]
3. Set LOC := NULL, and Return.

On the other hand, suppose we want to find the location LOC of an edge (A,
B) in the graph G. First we must find the location LOCA of A and the location
LOCB of B in the list NODE. Then we must find in the list of successors of A,
which has the list pointer ADJ[LOCA], the location LOC of LOCB. This is
implemented by Procedure 4.2, which also checks to see whether A and B are nodes
in G. Observe that LOC gives the location of LOCB in the list EDGE.

Procedure 4.2: FINDEDGE(NODE, NEXT, ADJ, START, DEST, LINK, A, B,


LOC)
This procedure finds the location LOC of an edge (A, B) in the
graph G, or
sets LOC := NULL.
1. Call FIND(NODE, NEXT, START, A, LOCA).
2. CALL FIND(NODE, NEXT, START, B, LOCB).
3. If LOCA = NULL or LOCB = NULL, then: Set LOC:
= NULL. Else: Call FIND(DEST, LINK,
ADJ[LOCA], LOCB, LOC).
4. Return.

Inserting in a Graph
Suppose a node N is to be inserted in the graph G. Note that N will be
assigned to NODE[AVAILN], the first available node. Moreover, since N will be an
isolated node, one must also set ADJ[AVAILN] := NULL. Procedure 4.3
accomplishes this task using a logical variable FLAG to indicate overflow.
The Procedure 4.3 must be modified if the list NODE is maintained as a
sorted list or a binary search tree.

Procedure 4.3: INSNODE(NODE, NEXT, ADJ, START, AVAILN, N, FLAG)


This procedure inserts the node N in the graph G.
1. [OVERFLOW?] If AVAILN = NULL, then:
Set FLAG : = FALSE, and Return.
2. Set ADJ[AVAILN]: = NULL.
3. [Removes node from AVAILN list.]
Set NEW: = AVAILN and AVAILN : =
NEXT[AVAILN].
4. [Inserts node N in the NODE list.]
Set NODE[NEW]: = N, NEXT[NEW] := START and
START := NEW.
5. Set FLAG := TRUE, and Return.

Suppose an edge (A, B) is to be inserted in the graph G. (The procedure will


assume that both A and B are already nodes in the graph G.) The procedure first
finds the location LOCA of A and the location LOCB of B in the node list. Then (A,
B) is inserted as an edge in G by inserting LOCB in the list of successors of A,
which has the list pointer ADJ[LOCA], Again, a logical variable FLAG is used to
indicate overflow. The procedure follows.

Procedure 4.4: INSEDGE(NODE, NEXT, ADJ, START, DEST, LINK, AVAILE, A, B,


FLAG) This procedure inserts the edge (A, B) in the graph G.
1. Call FIND(NODE, NEXT, START, A, LOCA).
2. Call FIND(NODE, NEXT, START, B, LOCB).
3. [OVERFLOW?] If AVAILE = NULL, then:
Set FLAG := FALSE, and Return.
4. [Remove node from AVAILE list.]
Set NEW := AVAILE and AVAILE: = LINK[ AVAILE].
5. [Insert LOCB in list of successors of A.]
Set DEST[NEW] := LOCB, LINKfNEW]: = ADJ[LOCA] and
ADJ[LOCA]:=NEW. lCEl\ Set FLAG := TRUE, and Return.
6. Set FLAG := TRUE, and Return.

Deleting from a Graph


Suppose an edge (A, B) is to be deleted from the graph G. (Our procedure
will assume that A and B are both nodes in the graph G.) Again, we must first find
the location LOCA of A and the location LOCB of B in the node list. Then we
simply delete LOCB from the list of successors of A, which has the list pointer
ADJ[LOCA]. A logical variable FLAG is used to indicate that there is no such edge
in the graph G. The procedure follows.
Procedure 4.5: DELEDGE(NODE, NEXT, ADJ, START, DEST, LINK, AVAILE, A,
B, FLAG) This procedure deletes the edge (A, B) from the graph G.
1. Call FIND(NODE, NEXT, START, A, LOCA). [Locates node A.]
2. Call FIND(NODE, NEXT, START, B, LOCB). [Locates node B.]
3. Call DELETE(DEST, LINK, ADJ[LOCA], AVAILE, LOCB, FLAG).
[Uses Procedure 7.4.]
4. Return.
Suppose a node N is to be deleted from the graph G. This operation is more
complicated than the search and insertion operations and the deletion of an edge,
because we must also delete all the edges that contain N. Note these edges come in two
kinds; those that begin at N and those that end at N. Accordingly, our procedure will
consist mainly of the following four steps:
(1) Find the location LOC of the node N in G.
(2) Delete all edges ending at N; that is, delete LOC from the list of successors of
each node M in G. (This step requires traversing the node list of G.)
(3) Delete all the edges beginning at N. This is accomplished by finding the location
BEG of the first successor and the location END of the last successor of N, and
then adding the successor list of N to the free AVAILE list.
(4) Delete N itself from the list NODE.
The procedure follows.

Procedure 4.6: DELNODE(NODE, NEXT, ADJ, START, AVAILN, DEST, LINK,


AVAILE, N, FLAG) This procedure deletes the node N from the graph
G.
1. Call FIND(NODE, NEXT, START, N, LOC). [Locates node N.]
2. If LOC = NULL, then: Set FLAG: = FALSE, and Return.
3. [Delete edges ending at N.]
(a) Set PTR := START.
(b) Repeat while PTR ≠ NULL:
(i) Call DELETE(DEST, LINK, ADJ[PTR], AVAILE,
LOC,
FLAG).
(ii) Set PTR: = NEXT[PTR].
[End of loop.]
4. [Successor list empty?] If ADJ[LOC] = NULL, then: Go to Step 7.
5. [Find the first and last successor of N.]
(a) Set BEG : = ADJ[LOC], END : = ADJ[LOC] and
PTR:=LINK[END].
(b) Repeat while PTR ≠ NULL:
Set END := PTR and PTR := LINK[PTR]. [End of loop.]
6. [Add successor list of N to AVAILE list.]
Set LINK[END] := AVAILE and AVAILE := BEG.
7. [Delete N using Procedure 8.4.]
Call DELETE(NODE, NEXT, START, AVAILN, N, FLAG).
8. Return
Example 4.8
Consider the (undirected) graph G in Fig. 4.12(a), whose adjacency lists appear in
ig. 4.12(b). Observe that G has 14 directed edges, since there are 7 undirected edges.

Fig. 4.12

Suppose G is maintained in memory as in Fig. 7-11(a). Furthermore, suppose


node B is deleted from G by using Procedure 7.9. We obtain the following steps:
Step 1. Finds LOC = 2, the location of B in the node list.
Step 3. Deletes LOC = 2 from the edge list, that is, from each list of successors.
Step 5. Finds BEG = 4 and END = 6, the first and last successors of B.
Step 6. Deletes the list of successors from the edge list.
Step 7. Deletes node B from the node list.
Step 8. Returns.
The deleted elements are circled in Fig. 4.13(a). Figure 4.13(b) shows G in memory
after node B (and its edges) are deleted.
Fig. 4-13
Check Your Progress 3
1. List out the operations performed on graph.

4.6 SHORTEST PATH ALGORITHM

WARSHALL’S ALGORITHM; SHORTEST PATH


Let G be a directed graph with m nodes, v 1 , v 2 , . . . , v m . Suppose we want to
find the path matrix P of the graph G. Warshall gave an algorithm for this purpose. This
algorithm is described in this section, and a similar algorithm is used to find shortest
paths in G when G is weighted.
First we define m-square Boolean matrices P0, P1, . . . , Pm as follows. Let Pk[i, j] denote
the ij entry of the matrix Pk. Then we define:
1 if there is a simple path from vi to vj which does not use any other
nodes except possibly v 1 , v 2 , . . . , v k
Pk [i,j] = 0 otherwise

In other words,
P0[I, J] = 1 if there is an edge from vi to vj
P1[I, J] = 1 if there is a simple path from vi to vj which does not use any other
nodes except possibly v1
P2[I, J] = 1 if there is a simple path from vi to vj which does not use any other
nodes except possibly v1 and v2
…………………………………………………………………..
Warshall observed that Pk[i, j] = 1 can occur only if one of the following two
cases occurs:
(1) There is a simple path from vi to vj which does not use any other nodes except
possibly v 1 , v 2 , . . . , vk - 1 ; hence
Pk - 1[i, k] = l
(2) There is a simple path from vi to vk and a simple path from vk to vj where each
path does not use any other nodes except possibly v1, v2, . . . , vk-1; hence
Pk - 1[i, k] = l and Pk - 1[k, j] = l
Accordingly, the elements of the matrix Pk can be obtained by
Pk [i, j] = Pk-1[i, j] v (Pk-1[i, k] A Pk - 1[k, j])
where we use the logical operations of A (AND) and v(OR). Warshall's algorithm
follows.
Algorithm 4.7: (Warshall's Algorithm) A directed graph G with M nodes is maintained
in memory by its adjacency matrix A. This algorithm finds the Boolean)
path matrix P of the graph G.
1. Repeat for I, J = 1, 2, . . . , M: [Initializes P.]
If A[I, J] = 0, then: Set P[I, J] := 0;
Else: Set P[I, J ]:= l.
[End of loop.]
2. Repeat Steps 3 and 4 for K = 1, 2, . . . , M: [Updates P.]
3. Repeat Step 4 for I = 1, 2, . . . , M:
4. Repeat for J = 1, 2, . . . ,M :
Set P[I, J] := P[I, J] v (P[I, K] A P[K, J]).
[End of loop.]
[End of Step 3 loop.]
[End of Step 2 loop.]
5. Exit.
Shortest-Path Algorithm

Let G be a directed graph with m nodes, v1, v2, . . . , vm, Suppose G is weighted;
that is, suppose each edge e in G is assigned a nonnegative number w(e) called the weight
or length of the edge e. Then G may be maintained in memory by its weight matrix W =
{wij), defined as follows:
w(e) if there is an edge e from vi to vj
wij = 0 if there is no edge e from vi to vj

The path matrix P tells us whether or not there are paths between the nodes. Now we
want to find a matrix Q which will tell us the lengths of the shortest paths between the
nodes or, more exactly, a matrix Q = (qij) where
qij = length of a shortest path from vi to vj
Next we describe a modification of Warshall's algorithm which finds us the matrix Q.
Here we define a sequence of matrices Q0, Q1 . . . , Qm whose entries are defined as
follows:
Qk[i, j] = the smaller of the length of the preceding path from vi to vj or the sum of
the lengths of the preceding paths from vi to vk and from vk to vj

More exactly,
Qk[i, j] = Mm(Qk_1[i, j], Qk-1[i, k] + Qk-1[k, j]
The initial matrix Q0 is the same as the weight matrix W except that each 0 in W is
replaced by ∞ (or a very, very large number). The final matrix Qm will be the desired
matrix Q.

Example 4.9
Consider the weighted graph G in Fig. 4.14. Assume vt = R, v2 = S, v3, = T and v4 =
U. Then the weight matrix W of G is as follows:

7 5 0 0
7 0 0 2
W =
0 3 0 0
4 0 1 0

Applying the modified Warshall's algorithm, we obtain the following matrices Q0 ,Ql,
Q2, Q3 and Q4 = Q. To the right of each matrix Qk, we show the matrix of paths which
correspond to the lengths in the matrix Qk.
Fig 4.14

We indicate how the circled entries are obtained:


The formal statement of the algorithm follows.
Algorithm 4.8: (Shortest-Path Algorithm) A weighted graph G with M nodes is
maintained in memory by its weight matrix W. This algorithm finds a
matrix Q such that Q[I, J] is the length of a shortest path from node V,
to
node V,. INFINITY is a very large number, and MIN is the minimum
value function.
1. Repeat for I, J = 1, 2, . . . , M: [Initializes Q.]
If W[I, J] = 0, then: Set Q[I, J] := INFINITY;
Else: Set Q[I, J] : = W[I, J].
[End of loop.]
2. Repeat Steps 3 and 4 for K = 1, 2, . . . , M: [Updates Q.]
3. Repeat Step 4 for I = 1, 2, . . . , M:
4. Repeat for J = 1, 2, . . . , M:
Set Q[I, J] := MIN(Q[I, J], Q[I, K] + Q[K, J]).
4.8 TOPOLOGICAL
[End of loop.] SORTING
[End of Step 3 loop.]
[End of Step 2 loop..]
5. Exit.

Suppose S is a graph such that each node vi of S represents a task and each edge (u, v)
means that the completion of the task u is a prerequisite for starting the task v. Suppose
such a graph S contains a cycle, such as
P = (u, v, w, u)
This means that we cannot begin v until completing u, we cannot begin w until
completing v and we cannot begin u until completing w. Thus we cannot complete any of
the tasks in the cycle. Accordingly, such a graph 5, representing tasks and a prerequisite
relation, cannot have cycles.
Suppose S is a graph without cycles. Consider the relation < on S defined as
follows:
u<v if there is a path from u to v
This relation has the following three properties:
(1) For each element u in S, we have u < u. (Irreflexivity.)
(2) If u < v, then v < u. (Asymmetry.)
(3) If u < v and v < w, then u < w. (Transitivity.)
Such a relation < on S is called a partial ordering of S, and S with such an ordering is
called & partially ordered set, or poset. Thus a graph S without cycles may be regarded as
a partially ordered set.
On the other hand, suppose S is a partially ordered set with the partial ordering
denoted by <. Then S may be viewed as a graph whose nodes are the elements of S and
whose edges are defined as follows:
(u, v) is an edge in S if u<v
Furthermore, one can show that a partially ordered set S, regarded as a graph, has no
cycles:
Example 4.10
Let S be the graph in Fig. 4-15. Observe that S has no cycles. Thus S may be
regarded as a partially ordered set. Note that G < C, since there is a path from G to C.
Similarly, B < F and B < C. On the other hand, B < A, since there is no path from B to A.
Also, A<B.

(a) (b)

Fig. 4.15
Check Your Progress 4
1. Warshall’s algorithm is used to find_ .
2. The path matrix p tells us whether or not there are paths between the nodes
[True/False]
3. How to represent the Weighted Graph G in memory?

TOPOLOGICAL SORTING

Let S be a directed graph without cycles (or a partially ordered set). A topological
sort T of 5 is a linear ordering of the nodes of 5 which preserves the original partial
ordering of S. That is: If u<v in S (i.e., if there is a path from u to v in S), then u comes
before v in the linear ordering T. Figure 4-16 shows two different topological sorts of the
graph S in Fig. 4.15. We have included the edges in Fig. 4.16 to indicate that they agree
with the direction of the linear ordering.
The following is the main theoretical result in this section.
Proposition: Let S be a finite directed graph without cycles or a finite partially ordered
set. Then there exists a topological sort T of the set S.
(b)
Fig. 4.16 Two topological
Sorts.

Note that the proposition states only that a topological sort exists. We now give an
algorithm which will find such a topological sort.

The main idea behind our algorithm to find a topological sort T of a graph S without
cycles is that any node N with zero indegree, i.e., without any predecessors, may be
chosen as the first element in the sort T. Accordingly, our algorithm will repeat the
following two steps until the graph 5 is empty:
(1) Finding a node N with zero indegree
(2) Deleting N and its edges from the graph S

The order in which the nodes are deleted from the graph S will use an auxiliary array
QUEUE which will temporarily hold all the nodes with zero indegree. The algorithm also
uses a field INDEG such that INDEG(N) will contain the current indegree of the node N.
The algorithm follows.
Algorithm 4.9: TOPOSORT This algorithm finds a topological sort T of a graph S
without cycles.
1. Find the indegree INDEG(N) of each node N of S. (This can be done by
traversing each adjacency list as in Prob. 8.15.)
2. Put in a queue all the nodes with zero indegree.
3. Repeat Steps 4 and 5 until the queue is empty.
4. Remove the front node N of the queue
(by setting FRONT: = FRONT +1).
5. Repeat the following for each neighbor M of the node N:
(a) Set INDEG(M): = INDEG(M) - 1.
[This deletes the edge from N to M.]
(b) If INDEG(M) = 0, then: Add M to the rear of the
queue.
[End of loop.]
[End of Step 3 loop.]
6. Exit
Unit 6
Symbol Table

Symbol table:
Symbol table contains all information that must be passed between different phases of a
compiler/interpreter
• A Symbol (or token) has at least the following attributes
1. Symbol Name
2. Symbol Type (int, real, char, ... ) and Symbol
• Class (static, automatic, const, ...)
Hashing
We have all used a dictionary, and many of us have a word processor equipped with
a limited dictionary, that is a spelling checker. We consider the dictionary, as an ADT.
Examples of dictionaries are found in many applications, including the spelling checker, the
thesaurus, the data dictionary found in database management applications, and the symbol
tables generated by loaders, assemblers, and compilers.
In computer science, we generally use the term symbol table rather than dictionary,
when referring to the ADT. Viewed from this perspective, we define the symbol table as a
set of name-attribute pairs. The characteristics of the name and attribute vary according to
the application. For example, in a thesaurus, the name is a word, and the attribute is a list of
synonyms for the word; in a symbol table for a compiler, the name is an identifier, and the
attributes might include an initial value and a list of lines that use the identifier.
Generally we would want to perform the following operations on any symbol table:
(1) Determine if a particular name is in the table
(2) Retrieve the attributes of that name
(3) Modify the attributes of that name
(4) Insert a new name and its attributes
(5) Delete a name and its attributes

There are only three basic operations on symbol tables: searching, inserting, and deleting.
The technique for those basic operations is hashing. Unlike search tree methods that rely on
identifier comparisons to perform a search, hashing relies on a formula called the hash
function. We divide our discussion of hashing into two parts: static hashing and dynamic
hashing.

Static Hashing
Hash Tables
In static hashing, we store the identifiers in a fixed size table called a hash table. We
use an arithmetic function, f, to determine the address, or location, of an identifier, x, in the
table. Thus, f (x) gives the hash, or home address, of x in the table. The hash table ht is
stored in sequential memory locations that are partitioned into b buckets, ht [0], …, ht [b-1].
Each bucket has s slots. Usually s = 1, which means that each bucket holds exactly one
record. We use the hash function f (x) to transform the identifiers x into an address in the
hash table. Thus, f (x) maps the set of possible identifiers onto the integers 0 through b-1.

Definition
The identifier density of a hash table is the ratio n/T, where n is the number of
identifiers in the table. The loading density or loading factor of a hash table is =n /(sb).
Example 7.1: Consider the hash table ht with b = 26 buckets and s = 2. We have n = 10 distinct
identifiers, each representing a C library function. This table has a loading factor, , of 10/52 =
0.19. The hash function must map each of the possible identifiers onto one of the number, 0-25.
We can construct a fairly simple hash function by associating the letter, a-z, with the number, 0-25,
respectively, and then defining the hash function, f (x), as the first character of x. Using this
scheme, the library functions acos, define, float, exp, char, atan, ceil, floor, clock, and ctime hash
into buckets 0, 3, 5, 4, 2, 0, 2, 5, 2, and 2, respectively.

Hashing Function
A hash function, f, transforms an identifier, x, into a bucket address in the hash table. We want
to hash function that is easy to compute and that minimizes the number of collisions.
Although the hash function we used in Example 7.1 was easy to compute, using only the first
character in an identifier is bound to have disastrous consequences. We know that identifiers,
whether they represent variable names in a program, word in a dictionary, or names in a
telephone book, cluster around certain letters of the alphabet. To avoid collisions, the hash
function should depend on all the characters in an identifier. It also should be unbiased. That
is, if we randomly choose an identifier, x, from the identifier space (the universe of all possible
identifiers), the probability that f(x) = i is 1/b for all buckets i. This means that a random x has
an equal property a uniform hash function.
There are several types of uniform hash functions, and we shall describe four of them. We
assume that the identifiers have been suitably transformed into a numerical equivalent.

Mid-square:
The middle of square hash function is frequently used in symbol table application. We
compute the function fm by squaring the identifier and then using an appropriate number
of bits from the middle of the square to obtain the bucket address. Since the middle bits
of the square usually depend upon all the characters in an identifier, there is a high
probability that different identifiers will produce different hash addresses, even when
some of the characters are the same. The number of bits used to obtain the bucket
address depends on the table size. If we use r bits, the range of the values is 2r. Therefore,
the size of the hash table should be a power of 2 when we use this scheme.

Division:
This hash function is using the modulus (%) operator. We divide the identifier x by some
number M and use the remainder as the hash address of x. The hash function is: fD (x) = x
% M This gives bucket address that range from 0 to M-1, where M = the table size. The
choice of M is critical. In the division function, if M is a power of 2, then fD (x) depends only
on the least significant bits of x. Such a choice for M results in a biased use of the hash
table when several of the identifiers in use have the same suffix. If M is divisible by 2, then
odd keys are mapped to odd buckets, and even keys are mapped to even buckets. Hence,
an even M results in a biased use of the table when a majority of identifiers are even or
when majorities are odd.

Folding:
In this method, we partition the identifier x into several parts. All parts, except for the last
one have the same length. We then add the parts together to obtain the hash address for
x. There are two ways of carrying out this addition. In the first method, we shift all parts
except for the last one, so that the least significant bit of each part lines up with the
corresponding bit of the last part. We then add the parts together to obtain f(x). This
method is known as shift folding. The second method, know as folding at the boundaries,
reverses every other partition before adding.

Digit Analysis:
The last method we will examine, digit analysis, is used with static files. A static file is one
in which all the identifiers are known in advance. Using this method, we first transform
the identifiers into numbers using some radix, r. We then examine the digits of each
identifier, deleting those digits that have the most skewed distributions. We continue
deleting digits until the number of remaining digits is small enough to give an address in
the range of the hash table. The digits used to calculate the hash address must be the
same for all identifiers and must not have abnormally high peaks or valleys (the standard
deviation must be small).

Overflow Handling
There are two methods for detecting collisions and overflows in a static hash table; each
method using different data structure to represent the hash table.
Tow Methods:
Linear Open Addressing (Linear probing)
Chaining

Linear Open Addressing


When use linear open addressing, the hash table is represented as a one-dimensional array
with indices that range from 0 to the desired table size-1. The component type of the array is a
struct that contains at least a key field. Since the keys are usually words, we use a string to denote
them. Creating the hash table ht with one slot per bucket is:

#define MAX_CHAR 10 /* max number of characters in an identifier */


#define TABLE_SIZE 13 /*max table size = prime number */

struct element
{
char key[MAX_CHAR];
/* other fields */
};

element hash_table[TABLE_SIZE];

Before inserting any elements into this table, we must initialize the table to represent the
situation where all slots are empty. This allows us to detect overflows and collisions when we
insert elements into the table. The obvious choice for an empty slot is the empty string since it will
never be a valid key in any application.

Initialization of a hash table:


void init_table ( element ht[ ] )
{
short i;
for ( i = 0; i < TABLE_SIZE; i ++ )
ht [ i ].key[0] = NULL;
}

To insert a new element into the hash table we convert the key field into a natural
number, and then apply one of the hash functions discussed in Hashing Function. We can
transform a key into a number if we convert each character into a number and then add these
numbers together. The function transform (below) uses this simplistic approach. To find the hash
address of the transformed key, hash (below) uses the division method.

short transform (char *key )


{/* simple additive approach to create a natural number that is within the integer range */
short number = 0;
while (*key)
number += *key++;

return number;
}

short hash (char *key)


{/* transform key to a natural number, and return this result modulus the table size */

return (transform (key) % TABLE_SIZE);


}

To implement the linear probing strategy, we first compute f (x) for identifier x and them
examine the hash table buckets ht[(f(x) + j) % TABLE_SIZE], 0  j  TABLE_SIZE in this order. Four
outcomes result from the examination of a hash table bucket:
(1) The bucket contains x. In this case, x is already in the table. Depending on the application,
we may either simply report a duplicate identifier, or we may update information in the
other fields of the element.
(2) The bucket contains the empty string. In this case, the bucket is empty, and we may insert
the new element into it.
(3) The bucket contains a nonempty string other than x. In this case we proceed to examine
the next bucket.
(4) We return to the home bucket ht [f (x)] (j = TABLE_SIZE). In this case, the home bucket is
being examined for the second time and all remaining buckets have been examined. The
table is full and we report an error condition and exit.

Implementation of the insertion strategy:


void linear_insert (element item, element ht [ ] )
{ /* insert the key into the table using the linear probing technique, exit the function if the table is
full */

short i, hash_value;
hash_value = hash (item.key);
i = hash_value;
while (strlen (ht [i].key)
{
if (! strcmp (ht [i].key, item.key)
{
cout << “Duplicate entry !\n”;
exit (1);
}
i = (i+1) % TABLE_SIZE;
if (i = = hash_value)
{
cout << “The table is full !\n”;
exit (1);
}
}
ht [i] = item;
}

Chaining
Linear probing and its variations perform poorly because inserting an identifier requires
the comparison of identifiers with different hash values. To insert a new element we would only
have to compute the hash address f (x) and examine the identifiers in the list for f (x). Since we
would not know the sizes of the lists in advance, we should maintain them as linked chains. We
now require additional space for a link field. Since we will have M lists, where M is the desired
table size, we employ a head node for each chain. These head nodes only need a link field, so they
are smaller than the other nodes. We maintain the head nodes in ascending order, 0,…….., M-1 so
that we may access the lists at random. The C++ declarations required to create the chained hash
table are:
#define MAX_CHAR 10 /* maximum identifier size*/
#define TABLE_SIZE 13 /* prime number */
#define IS_FULL (ptr) (!(ptr))

Struct element
{
char key[MAX_CHAR];
/* other fields */
};
typedef struct list *lis_pointer;
struct list
{
element item;
list_pointer link;
};
list_pointer hash_table[TABLE_SIZE];

The function chain_insert (below) implements the chaining strategy. The function first computes
the hash address for the identifier. It then examines the identifiers in the list for the selected
bucket. If the identifier is found, we print an error message and exit. If the identifier is not in the
list, we insert it at the end of the list. If list was empty, we change the head node to point to the
new entry.

Implementation of the function chain_insert:


void chain_insert (element item, list_pointer ht[])
{ /* insert the key into the table using chaining */
short hash_value = hash (item.key);
list_pointer ptr, trail = NULL, lead = ht [hash_value];
for( ; lead; trail = lead, lead = lead->link)
{
if (!strcmp(lead->item.key, item.key))
{
cout << “The key is in the table \n”;
exit (1);
}
}
ptr = new struct list;
if (IS_FULL (ptr))
{
cout << “The memory is full \n”;
exit (1);
}
ptr->item = item;
ptr->link = NULL;
if (trail)
trail->link = ptr;
else
ht [hash_value] = ptr;
}

Collision Resolution in Internal Hashing

A collision occurs in a hash table any time that the hash function maps two or more
key values into the same address within the address space. There are two basic
techniques that can be used to handle collisions, the initial technique is a lazy
approach (also referred to as an optimistic approach) and the second technique is a
greedy approach (also referred to as a pessimistic approach).

1. Ignore the collision. If the probability of collision is very low or the hash function
is already too slow to add the overhead of collision resolution.
2. Create and utilize a collision resolution protocol. This adds complexity to
hashed operations and causes extra implementation work.

Collision Resolution Protocols

Collision resolution protocols can range from fairly simple to very complex
techniques. Among the simplest protocols are:

1. linear probing
2. quadratic probing
3. chaining

More advanced techniques such as multiple hash functions and bucketing can be
applied when the table size is relatively large.

Linear Probing
Technique: When a collision occurs sequentially search through the table from
the point of the collision (using wrap-around searching – modulo
arithmetic) until an empty location is found. Specifically, if the hash
function returns a value H and location (cell) H is not empty then
cell H+1 is attempted, followed by H+2, H+3, …, H+i (using
wraparound).

Example: Suppose our hash function maps the letter A to location 0, B to 1, …, Z to


26. And we are hashing based upon the first letter of a person’s name.
With the input sequence: Insert (Al), Insert (Bob), Insert (Betty), Insert
(Carl), we can see how linear probing handles collisions.

location value
0 Al should be in location 1 but a
collision occurred moving it to
1 Bob
location 2
2 Betty

3 Carl

4 should be in location 2 but a


collision occurred moving it to
… location 3
25

Details: Retrievals are handled by hashing the key and comparing the data at the
location provided by the hash function. If the two values are not equal
the location is incremented and the comparison is made again against
the value in this new location. This is repeated until either the key value
is found or an empty location is encountered. Deletion must be lazy.
This entails marking the item as deleted but leaving it in place in the
table (using a delete bit) without actually physically removing it from the
table. This ensures that the look-up operation always works. Items
which have been lazily deleted are only removed when they won’t break
a chain valid items or when a new item can be inserted at this location
which overwrites the deleted item.

Analysis:

Definition
Load factor: The load factor of a probing hash table is the fraction of the table that is
full. The load factor is represented by the symbol , and generally, ranges from 0
(empty table) to 1 (full table).

Assuming that the probes are independent, the average number of locations (cells in the table)
that will be examined in a single probe is: 1/(1-). This comes simply from the fact that the
probability that a location is empty is 1-.

The above assumption is bad! In fact, linear probing causes a phenomenon


called primary clustering. These clusters are blocks of occupied cells (locations).
These blocks cause excessive attempts to resolve collisions. Taking this into
account, the average number of cells that will need to be examined for an
insertion into the hash table is:

 
1 + 1 
 2
(1 −  ) 

2
For half-full tables, i.e., when   0.5, this is an acceptable value of 2.5, but when
 = 0.9, the search will require that 50 cells (on the average) be examined!

We need a solution that eliminates primary clustering. The following picture


illustrates (sort of!) the long-term effect primary clustering has on the file density.

The shaded areas indicate areas of the file


that are occupied with records. The
unshaded areas are unoccupied areas
containing no information. Primary
clustering tends to divide the file space into
discrete clusters which further increases
the probability of collision and tends only
to expand each cluster rather than spread
the information across the file space.

Quadratic Probing

Quadratic probing eliminates the problem of primary clustering caused by linear


probing. The technique is similar to linear probing but the location increment is
not 1. Specifically, if the hash function produces a hash value (a location or cell
index) of H and the search at location H is unsuccessful, then the next location
that is searched is H+12, followed by H+22, H+32, H+42, …, H+i2 (using
wraparound as before).

Example: Suppose our hashing function is a simple mod operation on the size of
the hash table. If the hash table is size 10 and the input sequence is:
Insert(89), Insert (18), Insert (49), Insert (58), Insert (9). Then the
hash table is filled as shown below:

location value description


0 49 H=0, collision, (H+1)mod 10 = 0
1
2 58 H=8, collision, (H+1)mod 10 collision, (H+4)mod 10 = 2
3 9 H=9, collision, (H+1)mod 10 collision, (H+4)mod 10 = 3
4
5
6
7
8 18 ok
9 89 ok

The question now becomes, “Is quadratic probing any better than linear probing?”.
If the size of the hash table is a prime number and   0.5 then all probes will be
to different locations and an item can always be inserted and further, no location
will be probed twice during an access.

However, at  = 0.5, linear probing is fairly good and the removal of primary
clustering by use of quadratic probing will only save 0.5 probes for an average
insertion and 0.1 probes for an average successful search. Quadratic probing
provides an additional benefit in that it will be unlikely to encounter an excessively
long probe as might be the case with linear probing. However, quadratic probing
requires a multiplication (the i2 term) so an efficient algorithm for this multiplication
will be necessary.

Given the previous value of Hi-1 it is possible to determine the next value, Hi
without requiring the computation of i2. Assuming, that we still require a
wraparound technique this new value of Hi is computed as follows:

Hi = Hi-1 + 2i − 1 (mod tablesize)

This can be implemented as follows:

1. use an addition to increment i


2. use a left bit shift (1) to compute 2i
3. a subtraction to compute 2i−1
4. a second addition to increment the old value of 2i−1
5. finally a modulo operation if wraparound is needed

Example: Using the example from earlier, consider the steps to insert(58).
Initially H0 = 58 mod 10 = 8 and collision results. Then i = 1 and H0 = 8. H1 = [H0 +
2(1) – 1]mod 10 = [8+1]mod 10 = 9. This too results in a collision so another value
of H must be calculated as follows: H2 = [H1 + 2(2) – 1]mod 10 = [9+3]mod 10 = 2
which is empty, so insertion occurs at position 2 in the hash table.

Using the shift operation this example proceeds as (with numbers shown in binary
form):
Initially H0 = 58 mod 10 = 8 and collision results. Then i = 0001 and H0 = 1000. H1
= [1000 + 0010 – 0001]mod 10 = [8+1]mod 10 = 9. This too results in a collision
so another value of H must be calculated as follows: H2 = [1001 + 0100 –
0001]mod 10 = [9+3]mod 10 = 2 which is empty, so insertion occurs at position 2
in the hash table.

Quadratic probing eliminates primary clustering but introduces the problem of


secondary clustering. Elements which hash to the same location will probe the
same set of alternative locations. This however, is not a real concern.
Simulations have shown that, in general, less than 0.5 additional probes are
required per search, and this only occurs for high load factors. If secondary
clustering does present a problem for a given application, there are techniques
which will eliminate it altogether. One of the more popular techniques is called
double hashing in which a second hash function is used to drive the collision
resolution.

Chaining
• Maintain an array of linked lists at each hash addressable location.
• The hash function returns an index of a specific list.
• Insertions, deletions, and searches occur in that list.
• If the lists are kept short, then the potential performance bottleneck is
eliminated.
• λ is calculated by dividing the total number of nodes N, by the number of lists
which are maintained M.
• λ= N/M
• λ is no longer bounded by 1.0 but has an average value of 1.0.
• The expected number of probes for insertion and an unsuccessful search is: λ.
• The expected number of probes for a successful search is: 1 + λ/2.

Example M = 6, N = 15,  = N/M = 15/6 = 2.5

Hash
List
Address

Al Ann Art Ali


0 •

1 • Kris Kristi

2 • Bo

3 • Cris Cindi Calli Carl


Cyn

4 •

5 • Jimi Jane Jack


• Each list referenced by the “hash table” is a singly-linked list (see previous notes
for implementation details).

• The singly-linked lists shown above do not have a tail node. Would the use of a
tail node be beneficial in this data structure? The answer is yes, it could help in
two different ways! Notice that there is no implied order to the elements of a
specific list. This is done since insertion into a hash table should be an O(1)
operation. If the list is maintained in alphabetical order – then insertion will not
be an O(1) operation and we would violate one of the specifications of the hash
table data structure. This also happens in the implementation shown above
since we have no way, other than traversing the list, of finding the end of the list.
Therefore a “better” implementation is the one shown on the next page.

Hash Address List

Ann Art Ali


0 • Al TAIL

1 • Kris Kristi TAIL

2 • Bo TAIL

3 • Cris Cindi Cyn Calli Carl TAIL

4 • TAIL

5 • Jimi Jane Jack TAIL

• Notice in this implementation of the hash table that even the hash addresses
with no entries maintain an empty list (chain).

• The first way that the tail node improves the implementation is as follows: in
typical implementations, the tail node will actually contain a data field which is
usually set to the largest possible key value that will could be hashed. This
eliminates null value comparisons in the code (replacing them with perhaps
comparisons to MaxInt or something similar). Since each list has a logical end,
there should be no problems associated with running off the end of a list.
• Also notice how wasteful of space it is to have a separate tail node for every list.
In reality, all of these nodes will be condensed to a single node to which all lists
will link. This is shown in the next diagram.

Hash Address List

Al Ann Art Ali


0 •

1 • Kris Kristi

TAIL

2 • Bo

3 • Cris Cindi Cyn Calli Carl

4 •

5 • Jimi Jane Jack

• Notice that this “better” implementation still does not provide O(1) insert time,
unless we can identify (have a reference to) the node immediately preceding the
tail node in any given list. For example, if we want to insert Alice into the first
list, having a tail node only tells us where the end of the list is, not where the
node next to the end of the list is! What do we do to get our required O(1) insert?

The answer has been available all along, and none of the “improvements” that we
have made to our structure have done anything toward this end. Recall some of the
issues we discussed when dealing with the implementation of linked lists in CS2.
We stated that in a list without header and tail nodes that insertion at either end of
the list was a “special case” that was different from inserting in the middle of the
list. So we put header and tail nodes in to prevent the special cases from occurring.
However, in our hash table structure, there has been a header node all along. It is
embedded in the hash table itself as the reference to the chain for each hashable
location. Therefore, to achieve O(1) insertion time, we simply perform ALL
insertions at the head of the list rather than at the tail of the list. (A potential benefit
of this is that the chain will contain the elements in the order of their arrival – i.e.
they appear in entry order within each chain.) This again illustrates that you need to
be aware of the various implementation issues for all of the data structures that are
involved in any application. The final diagram illustrates the insertion of a newly
hashed value into our hash table.
Hash Address List

Al Ann Art Ali


0 •

1 • Kris Kristi

TAIL

2 • Bo

3 • Cris Cindi Cyn Calli Carl

4 •

5 • Jimi Jane Jack

James

Hash tables can be used to implement insert and find operations in O(1) time, on the
average. There are many implementation factors that can influence the performance
of the hash table such as the load factor, the hash function itself, file size, input
rates and distributions, as well as many other factors. It is important to pay
attention to these details if you are to perform these operations in O(1) time.

You might also like