You are on page 1of 36

EC 2202 DATA STRUCTURES AND OBJECT ORIENTED PROGRAMMING IN C++

UNIT III DATA STRUCTURES & ALGORITHMS

3.1. Algorithm
An algorithm is a sequence of unambiguous instructions for solving a problem, i.e., for obtaining a required
output for any legitimate input in a finite amount of time.

3.1.1. Fundamentals of algorithmic problem solving

Understanding the Problem:


• Clearly read the problem
• Ask any doubts you having
• Analysis what is the input required

Decision making on
• Capabilities of computational devices: The algorithm should adapt to your system. For example if your
system executes the instruction in parallel then you have to write parallel Algorithm or else if you having
RAM machine (The system executes the instruction sequentially) then you have to write serial algorithm.
Decide exact or approximate output need : Write an algorithm depends on output need. For example to find
the square root the approximate output is enough but finding shortest way between the cities, we need an
exact result.
• Decide Data Structure: In which format going to give the input. Example data structures are Stack, Queue,
List and etc.
• Algorithm Techniques: Decide which technique used to write an algorithm. For example you can divide the
big problem into number of smaller units and solve each unit. This technique called as Divide and Conquer
method. So depending on the problem you can select any one techniques. Some techniques are Decrees and
Conquer, Brute Force, Divide and Conquer and etc.
Specification of an Algorithm:
The way for specifying an algorithm. There are three ways to represent an algorithm
• Natural Language : You can represent the instruction in the form of sentence.
• Pseudo Code : Use identifier, keywords and symbols to represent the instructions. For example c=a+b.
• Flowchart : The another way for representing the instructions. In this use diagrams and link the sequence
using lines. The separate diagrams are available based on the instructions.
Algorithm Verification:
To verify the algorithm apply all the possible input and check the algorithm produce the required
output or not..
Analysis of Algorithm:
Compute the efficiency of the algorithm.. The efficiency iis depend on the following factors
• Time: The amount of time to take execute the algorithm. If the algorithm take less amount of time to
execute then this one will be the best one.
• Space: The amount of memory required to store the algorithm and the amount of memory required to store
the input for that algorithm.
• Simplicity: The algorithm should not having any complex instructions. That type of instructions are
simplified to number of smaller instructions.
• Generality: The algorithm should be in general. It should be implemented in any language. The algorithm is
not fit in to a specific output.
Implementation :
After satisfying above all factors code the algorithm in any of language you known and execute the program.
3.1.3. Properties of the algorithm
) Finiteness: - an algorithm terminates after a finite numbers of steps.
2) Definiteness: - each step in algorithm is unambiguous. This means that the action specified by
the step cannot be interpreted (explain the meaning of) in multiple ways & can be performed
without any confusion.
3) Input:- an algorithm accepts zero or more inputs
4) Output:- it produces at least one output.
5) Effectiveness:- it consists of basic instructions that are realizable. This means that the
instructions
6) Non Ambiguity: The algorithm should not having any conflict meaning..
7) Range of Input: Before design a algorithm,, decide what type of iinput going to
given,, whatsis the required output..
8) Multiplicity: You can write different algorithms to solve the same problem..
9) Speed : Apply some idea to speed up the execution time..
3.1.4. Analysis of algorithm
Analysis Framework :
Analysis : To compute the efficiency of an algorithm. When compute the efficiency of an
algorithm, consider the following two factors.
_ Space Complexity
_ Time Complexity
(i) Space Complexity : The amount of memory required to store the algorithm and the amount
of memory required to store the inputs for this algorithm.
S(p) = C + Sp where,
C – Constant (The amount of memory required to store the algorithm)
Sp – The amount of memory required to store the inputs. Each input stored in one unit.
Example : Write an algorithm to find the summation of n numbers and analysis the space
complexity for that algorithm..
Algorithm Summation (X ,,n)
//// Input : n Number of elements
//// Output : The result for summation of n numbers..
sum = 0
for I = 1 to n
sum = sum+ X[ii]
return sum
The Space Complexity of above algorithm:
S(p) = C + Sp
1.. One unit for each element in the array.. The array having n number of elements so the array
required in unit..
2.. One unit for the variable n,, one unit for the variable I and one unit for the variable sum..
3.. Add the above all units and find the space complexity..
S(p) = C + (n + 1 + 1 + 1)
S(p) = C + (n + 3)
(ii) Time Complexity
The amount of time required to run the algorithm.. The execution time depends on the following
factors..
• System load
• Number of other programs running
• Speed of hardware
How to measure the running time?
• Find the basic operation of the algorithm. (The inner most loop or operation is called basic
operation)
• Compute the time required to execute the basic operation.
• Compute how many times the basic operation is executed. The Time complexity calculated by
the following formula
T(n) = Cop * C(n) Where,
Cop – Constant (The amount of time required to execute the basic operation)
C(n) – How many times the basic operation is executed.
Example : The Time complexity of the above algorithm
(Summation of n Numbers)
1. The Basic operation is: addition
2. To compute how many times the Basic operation is executed: n
3. So, T(n) Cop * n
4. Remove the constant or assume Cop = 1
5. The Time Complexity is T(n) = n.
3.2. List ADT
Abstract Data Type (ADT)
A set of data values and associated operations that are precisely specified independent of any particular
implementation. It is also known as ADT.

List ADT
       A list is a sequential data structure, ie. a collection of items accessible one after another beginning at the
head and ending at the tail.

 It is a widely used data structure for applications which do not need random access
 Addition and removals can be made at any position in the list 
 lists are normally in the form of a1,a2,a3.....an. The size of this list is n.The first element of the list is
a1,and the last element is an.The position of element ai in a list is i.
 List of size 0 is called as null list.

 Basic Operations on a List

 Creating a list
 Traversing the list
 Inserting an item in the list
 Deleting an item from the list
 Concatenating two lists into one 

 Implementation of List:
  A list can be implemented in two ways

1. Array list
2. Linked list

3.2.1. Array Implementation of List

  This implementation stores the list in an array.


 The position of each element is given by an index from 0 to n-1, where n is the number of elements.
 The element with the index can be accessed in constant time (ie) the time to access does not depend
on the size of the list.
 The time taken to add an element at the end of the list does not depend on the size of the list. But the
time taken to add an element at any other point in the list depends on the size of the list because the
subsequent elements must be shifted to next index value.So the additions near the start of the list take
longer time than the additions near the middle or end.
 Similarly when an element is removed,subsequent elements must be shifted to the previous index
value. So removals near the start of the list take longer time than removals near the middle or end of
the list.

Problems with Array implementation of lists

 Insertion and deletion are expensive. For example, inserting at position 0 (a new first element)
requires first pushing the entire array down one spot to make room, whereas deleting the first element
requires shifting all the elements in the list up one, so the worst case of these operations is O(n).
 Even if the array is dynamically allocated, an estimate of the maximum size of the list is required.
Usually this requires a high over-estimate, which wastes considerable space. This could be a serious
limitation,  if there are many lists of unknown size.
  Simple arrays are generally not used to implement lists. Because the running time for insertion and
deletion is so slow and the list size must be known in advance

3.2.2. Linked list implementation

A Linked list is a chain of structs or records called Nodes. Each node has at least two members, one
of which points to the next Node in the list and the other holds the data. These are defined as Single Linked
Lists because they can only point to the next Node in the list but not to the previous.

The list can be made just as long as required. It does not waste memory space because successive
elements are connected by pointers.  The position of each element is given by an index from 0 to n-1, where
n is the number of elements. The time taken to access an element with an index depends on the index
because each element of the list must be traversed until the required index is found. The time taken to add an
element at any point in the list does not depend on the size of the list, as no shifts are required Additions and
deletion near the end of the list take longer than additions near the middle or start of the list. because the list
must be traversed until the required index is found.
Types of Linked List

 Singly Linked List


 Doubly Linked List
 Circular Linked List

Operation on Linked List(List ADT)

1.is_last()
int is_last( position p, LIST L )
{
return( p->next == NULL );

This procedure is used to check whether the position p is an end of the linked list L

2. Isempty()

int is_empty( LIST L )


{
return( L->next == NULL );

This is used to check whether the linked list is empty or not. If the header of linked list is NULL then return
TRUE otherwise return FALSE.

3.Insert()
void insert( element_type x, LIST L, position p )
{
position tmp_cell;
tmp_cell = (position) malloc( sizeof (struct node) );
tmp_cell->element = x;
tmp_cell->next = p->next;
p->next = tmp_cell;
}

This procedure is used to insert an element x at the position p in the linked list L. This is done by

 Create a new node as tmpcell


 Assign the value x in the data filed of tmpcell
 Identify the position p and assign the address of next cell of p to the pointer filed of tmpcell
 Assign the address of tmpcell to the address filed of p

4. Delete()

void delete( element_type x, LIST L )


{
position p, tmp_cell;
p = find_previous( x, L );
if( p->next != NULL )
{
tmp_cell = p->next;
p->next = tmp_cell->next;
free( tmp_cell );
}}

This is used to delete an element x from the linked list L. This is done by

 Identify the element x in the linked list L using findpreviuos() function


 Name the node x as tmpcell
 Assign the address of next cell of x to the previous cell.
 Remove the node x

5.Find()

position find ( element_type x, LIST L )


{
position p;
p = L->next;
while( (p != NULL) && (p->element != x) )
p = p->next;
return p;
}

This is used to check whether the element x is present in the linked list L or not. This is done by searching an
element x from the beginning of the linked list. If it is matched then return TRUE, else move the next node
and continue this process until the end of the linked list. If an element is matched with any of the node then
return FALSE.

3.2.3. Cursor based Linked List Implementation

Many languages, such as BASIC and FORTRAN, do not support pointers. If linked lists are required and
pointers are not available, then an alternate implementation must be used. The alternate method we will
describe is known as a cursor implementation.
The two important items present in a pointer implementation of linked lists are

1. The data is stored in a collection of structures. Each structure contains the data and a pointer to the next
structure.
2. A new structure can be obtained from the system's global memory by a call to malloc and released by a
call to free.

Figure : Example of a cursor implementation of linked lists

Operation on Linked List

1.is_last()
int is_last( position p, LIST L) /* using a header node */
{
return( CURSOR_SPACE[p].next == 0

This procedure is used to check whether the position p is an end of the linked list L

2. Isempty()

int is_empty( LIST L ) /* using a header node */


{
return( CURSOR_SPACE[L].next == 0

}
This is used to check whether the linked list is empty or not. If the header of linked list is NULL then return
TRUE otherwise return FALSE.

3.Insert()

void insert( element_type x, LIST L, position p )


{
position tmp_cell;
tmp_cell = cursor_alloc( );
if( tmp_cell ==0 )
fatal_error("Out of space!!!");
else
{
CURSOR_SPACE[tmp_cell].element = x;
CURSOR_SPACE[tmp_cell].next = CURSOR_SPACE[p].next;
CURSOR_SPACE[p].next = tmp_cell;
}

This procedure is used to insert an element x at the position p in the linked list L. This is done by

 Create a new node as tmpcell


 Assign the value x in the data filed of tmpcell
 Identify the position p and assign the address of next cell of p to the pointer filed of tmpcell
 Assign the address of tmpcell to the address filed of p

4. Delete()

void delete( element_type x, LIST L )


{
position p, tmp_cell;
p = find_previous( x, L );
if( !is_last( p, L) )
{
tmp_cell = CURSOR_SPACE[p].next;
CURSOR_SPACE[p].next = CURSOR_SPACE[tmp_cell].next;
cursor_free( tmp_cell );
}
}

This is used to delete an element x from the linked list L. This is done by

 Identify the element x in the linked list L using findpreviuos() function


 Name the node x as tmpcell
 Assign the address of next cell of x to the previous cell.
 Remove the node x

5.Find()

position find( element_type x, LIST L) /* using a header node */


{
position p;
CURSOR_SPACE[L].next;
while( p && CURSOR_SPACE[p].element != x )
p = CURSOR_SPACE[p].next;
return p;
}

This is used to check whether the element x is present in the linked list L or not. This is done by searching an
element x from the beginning of the linked list. If it is matched then return TRUE, else move the next node
and continue this process until the end of the linked list. If an element is matched with any of the node then
return FALSE.

3.2.4. Double linked list implementation

A doubly linked list is a linked data structure that consists of a set of sequentially linked records called
nodes. Each node contains two fields, called links, that are references to the previous and to the next node in
the sequence of nodes.
Operations on double linked list

1. Insert()

void insert( element_type B, LIST L, position p )


{
position newnode,temp;
newnode = (position) malloc( sizeof (struct node) );
newnode->element = B;
p->next=newnode;
temp=p->next;
temp->prev=newnode;
newnode->prev=p;
newnode->next=temp->prev;
}
This procedure is used to insert an element x after the node p in the doubly linked list. This is done by,
 Create a new node and assign the value x into the data field of new node.
 Rearrange the pointers by assigning the address of x to the pointer field of p and vice versa
 Assigning the address of next cell of p to the address filed of x and vice versa
2.Delete()

void delete( element_type B, LIST L)


{
position newnode,temp;
newnode=findprevious(B,L);
temp=newnode->next;
newnode->prev=temp->next;
temp->next->prev=newnode;
}
This is used to remove a node B from the doubly linked list. Removing a node from the middle requires that
the preceding node skips over the node being removed, and the following node to skip over in the reverse
direction.

3.2.5. Circular Linked List

A linked list have a beginning node and an ending node: the beginning node was a node pointed to by a
special pointer called head and the end of the list was denoted by a special node which had a NULL pointer
in its next pointer field.  If a problem requires that operations need to be performed on nodes in a linked list
and it is not important that the list have special nodes indicating the front and rear of the list, then this
problem should be solved using a circular linked list.  A singly linked circular list is a linked list where the
last node in the list points to the first node in the list.  A circular list does not contain NULL pointers. 

3.3. Stack ADT

A stack is a list with the restriction that inserts and deletes can be performed in only one position, namely the
end of the list called the top. The fundamental operations on a stack are push, which is equivalent to an
insert, and pop, which deletes the most recently inserted element. It is also called Last In First Out (LIFO)
Implementation of Stacks

 Linked List Implementation of Stacks


 Array Implementation of Stacks

3.3.1. Linked List Implementation of Stacks

Stack data structure can be implemented using linked list. All thye insertion(push) and deletion
operation(pop) can be performed in LIFO(Last In First Out) manner. Here, the elements can be inserted up to
the size of storage space. The operations can be performed under stack is push(), pop() and top().

(i) push():

This is the function which is for insertion(pushing)of an element into stack. It is similar to the insertion of an
element at the beginning of a single linked list. The steps to be followed,
o Get the element to be inserted
o Create a new cell to store the new element using malloc()
o If the linked list is empty, then insert the new element as the first cell of the linked list and
assign the address the new cell to NULL
o Otherwise, insert the new cell in the first of linked list by rearranging the pointers

void push( element_type x, STACK S )


{
node_ptr tmp_cell;
tmp_cell = (node_ptr) malloc( sizeof ( struct node ) );
tmp_cell->element = x;
tmp_cell->next = S->next;
S->next = tmp_cell;
}

 (ii) pop()

This is the function which is for deletion (popping up) of an element from the stack. It is similar to
the deletion of an element at the beginning of a single linked list. Before deleting the element, check whether
the stack(linked list) is empty or not. If it is empty then return error otherwise delete the first element from
the linked list by rearranging the pointer.

void pop( STACK S )


{
node_ptr first_cell;
if( is_empty( S ) )
error("Empty stack");
else
{
first_cell = S->next;
S->next = S->next->next;
free( first_cell );
}
}

(iii) top()
This function is used to return he top element in the stack which means that the last inserted element.
element_type top( STACK S )
{
if( is_empty( S ) )
error("Empty stack");
else
return S->next->element;
}
3.3.2. Array Implementation of Stacks
Stack data structure can be implemented using arrays. All thye insertion(push) and deletion operation(pop)
can be performed in LIFO(Last In First Out) manner. Here, the elements can be inserted up to the size of the
array. The operations can be performed under stack is push(), pop() and top().

push()
This is used to push an element x into the stack using an array. This is done by increment the top
pointer by 1 and assign the value x into the top position.

void push( element_type x, STACK S )


{
if( is_full( S ) )
error("Full stack");
else
S->stack_array[ ++S->top_of_stack ] = x;
}
pop()

This is used to deleet or pop an element from th stack. Initially check wtheter the stack is empty or
not. If it is empty then return FALSE. Otherwise decrement the top pointer by 1.

void pop( STACK S )


{
if( is_empty( S ) )
error("Empty stack");
else
S->top_of_stack--;
}
top()
This function is used to return he top element in the stack which means that the last inserted
element.
element_type top( STACK S )
{
if( is_empty( S ) )
error("Empty stack");
else
return S->stack_array[ S->top_of_stack ];
}

3.4.Queue ADT

Queue is an linear data structure which insert an element into the rear(enqueue) and dete an element from
the front(dequeue). It is also called First In First Out(FIFO). That is the first inserted element will be deleted
first.

Array Implementation of Queue

Queue is implemented using an array. It can insert elements upto the size of an array. All the
insertion(enqueue) and deletion(dequeue) operation can be performed in FIFO (First In First Out) manner.
Here there are two poitera are used. Rear is ued for inserion and front pointer is use for deletion.

(i) enqueue()
void enqueue( element_type x, QUEUE Q )
{
if( is_full( Q ) )
error("Full queue");
else
{
Q->q_size++;
Q->q_rear = succ( Q->q_rear, Q );
Q->q_array[ Q->q_rear ] = x;
}
}

Thi s procedure is used to insert an element into the queue using an array. Before inserting any element first
check whether the queue is full or not. If it is full, then display an error. Otherwise identify the successor rear
position using succ() function and insert an element into the rear position of queue.

Dequeue()

This is used to delete an element from the queue. This is done by identify the successor of front. Then the
first inserted element is deleted first.
3.5.Applications of stack

 Infix to postfix conversion


 Evaluating postfix expression
 Function calls

(i) Infix to postfix conversion

There are three different ways in which an expression like a+b can be represented.
Prefix (Polish)
+ab
Postfix (Suffix or reverse polish)
ab+
Infix
a+b
Note than an infix expression can have parathesis, but postfix and prefix expressions are paranthesis free
expressions.
Conversion from infix to postfix
Suppose this is the infix expresssion
((A + (B - C) * D) ^ E + F)
To convert it to postfix, we add an extra special value ] at the end of the infix string and push [ onto the
stack.
((A + (B - C) * D) ^ E + F)]
--------->
We move from left to right in the infix expression. We keep on pushing elements onto the stack till we reach
a operand. Once we reach an operand, we add it to the output. If we reach a ) symbol, we pop elements off
the stack till we reach a corresponding { symbol. If we reach an operator, we pop off any operators on the
stack which are of higher precedence than this one and push this operator onto the stack.
As an example

Expresssion Stack Output


----------------------------------------------------------------------------------------------
((A + (B - C) * D) ^ E + F)] [
^

((A + (B - C) * D) ^ E + F)] [(( A


^

((A + (B - C) * D) ^ E + F)] [((+ A


^

((A + (B - C) * D) ^ E + F)] [((+(- ABC


^

((A + (B - C) * D) ^ E + F)] [( ABC-D*+


^

((A + (B - C) * D) ^ E + F)] [( ABC-D*+E^F+


^
((A + (B - C) * D) ^ E + F)] [ ABC-D*+E^F+
^
Is there a way to find out if the converted postfix expression is valid or not Yes. We need to associate a rank
for each symbol of the expression. The rank of an operator is -1 and the rank of an operand is +1. The total
rank of an expression can be determined as follows:
- If an operand is placed in the post fix expression, increment the rank by 1.
- If an operator is placed in the post fix expression, decrement the rank by 1.
At any point of time, while converting an infix expression to a postfix expression, the rank of the expression
can be greater than or equal to one. If the rank is anytime less than one, the expression is invalid. Once the
entire expression is converted, the rank must be equal to 1. Else the expression is invalid.
(ii) Evaluating postfix expression

Here is the pseudocode. As we scan from left to right


* If we encounter an operand, push it onto the stack.
* If we encounter an operator, pop 2 operand from the stack. The first one popped is called operand2 and the
second one is called operand1.
* Perform Result=operand1 operator operand2.
* Push the result onto the stack.
* Repeat the above steps till the end of the input.

Example:

For instance, the postfix expression


6523+8*+3+*

is evaluated as follows: The first four symbols are placed on the stack. The resulting stack is

Next a '+' is read, so 3 and 2 are popped from the stack and their sum, 5, is pushed.
Next 8 is pushed.

Now a '*' is seen, so 8 and 5 are popped as 8 * 5 = 40 is pushed.

Next a '+' is seen, so 40 and 5 are popped and 40 + 5 = 45 is pushed.

Now, 3 is pushed.

Next '+' pops 3 and 45 and pushes 45 + 3 = 48.

Finally,
Finally, a '*' is seen and 48 and 6 are popped, the result 6 * 48 = 288 is pushed.

(iii) Function Calls

The algorithm to check balanced symbols suggests a way to implement function calls. The problem here is
that when a call is made to a new function, all the variables local to the calling routine need to be saved by
the system, since otherwise the new function will overwrite the calling routine's variables. Furthermore, the
current location in the routine must be saved so that the new function knows where to go after it is done. The
variables have generally been assigned by the compiler to machine registers, and there are certain to be
conflicts (usually all procedures get some variables assigned to register #1), especially if recursion is
involved.
Applications of Queue
 Printers
 Job Scheduling
 Ticket Reservation ….

3.6. Binary Heaps


Heaps (occasionally called as partially ordered trees) are a very popular data structure for implementing
priority queues.

 A heap is either a min-heap or a max-heap. A min-heap supports the insert and deletemin operations
while a max-heap supports the insert and deletemax operations.
 Heaps could be binary or d-ary. Binary heaps are special forms of binary trees while d-ary heaps are a
special class of general trees.
DEF. A binary heap is a complete binary tree with elements from a partially ordered set, such that the
element at every node is less than (or equal to) the element at its left child and the

 Since a heap is a complete binary tree, the elements can be conveniently stored in an array. If an
element is at position i in the array, then the left child will be in position 2i and the right child will be
in position 2i + 1. By the same token, a non-root element at position i will have its parent at
position (i/2)
 Because of its structure, a heap with height k will have between 2 kand 2k + 1 - 1 elements. Therefore a
heap with n elements will have height =log2 n
 Because of the heap property, the minimum element will always be present at the root of the heap.
Thus the findmin operation will have worst-case O (1) running time.

Property:
1. Structure Property -> A heap is completely filled with the possible exception of the bottom level
2. Heap order property -> Every parent node in the heap which are smaller than of its children (min heap)

Implementation of Insert and Deletemin


Insert ()

To insert an element say x, into the heap with n elements, we first create a hole in position (n+1) and see if
the heap property is violated by putting x into the hole. If the heap property is not violated, then we have
found the correct position for x. Otherwise, we ``push-up'' or ``percolate-up'' x until the heap property is
restored. To do this, we slide the element that is in the hole's parent node into the hole, thus bubbling the hole
up toward the root. We continue this process until x can be placed in the whole. See Figure for an example

Worstcase complexity of insert is O (h) where h is the height of the heap. Thus insertions are O (log n) where
n is the number of elements in the heap.

Void insert( element_type x, PRIORITY_QUEUE H )


{
unsigned int i;
if( is_full( H ) )
error("Priority queue is full");
else
{
i = ++H->size;
while( H->elements[i/2] > x )
{
H->elements[i] = H->elements[i/2];
i /= 2;
}
H->elements[i] = x;
}
}

Deletemin()

When the minimum is deleted, a hole is created at the root level. Since the heap now has one less element
and the heap is a complete binary tree, the element in the least position is to be relocated. This we first do by
placing the last element in the hole created at the root. This will leave the heap property possibly violated at
the root level. We now ``push-down'' or ``percolate-down'' the hole at the root until the violation of heap
property is stopped. While pushing down the hole, it is important to slide it down to the less of its two
children (pushing up the latter). This is done so as not to create another violation of heap property. See
Figure. It is easy to see that the worst-case running time of deletemin is O (log n) where n is the number of
elements in the heap
element_type delete_min( PRIORITY_QUEUE H )
{
unsigned int i, child;
element_type min_element, last_element;
if( is_empty( H ) )
{
error("Priority queue is empty");
return H->elements[0];
}
min_element = H->elements[1];
last_element = H->elements[H->size--];
for( i=1; i*2 <= H->size; i=child )
{
child = i*2;
if( ( child != H->size ) && ( H->elements[child+1] < H->elements [child] ) )
child++;
if( last_element > H->elements[child] )
H->elements[i] = H->elements[child];
else
break;
}
H->elements[i] = last_element;
return min_element;

3.7. Hashing
Hashing is a method to store data in an array so that sorting, searching, inserting and deleting data is fast.
For this every record needs unique key. The basic idea is not to search for the correct position of a record
with comparisons but to compute the position within the array. The function that returns the position is
called the 'hash function' and the array is called a 'hash table'.
Types of hashing
 Separate chaining

 Open addressing

o Linear probing

o Quadratic probing

o Double hashing

Separate chaining

Separate chaining hashing uses an array as the primary hash table, except that the array is an array of lists
of entries, each list initially being empty. When an entry is inserted, it is inserted at the end of the list at the
index corresponding to the hash code of the key in question. Searching the hash table now involves walking
down the list at the given index though, with good design, it should be a relatively short list.
In the strategy known as separate chaining, direct chaining, or simply chaining, each slot of the bucket array
is a pointer to a linked list that contains the key-value pairs that hashed to the same location. Lookup requires
scanning the list for an entry with the given key. Insertion requires adding a new entry record to either end of
the list belonging to the hashed slot. Deletion requires searching the list and removing the element.

Chained hash tables with linked lists are popular because they require only basic data structures with simple
algorithms, and can use simple hash functions that are unsuitable for other methods.

The cost of a table operation is that of scanning the entries of the selected bucket for the desired key. If the distribution
of keys is sufficiently uniform, the average cost of a lookup depends only on the average number of keys per bucket—
that is, on the load factor.

For separate-chaining, the worst-case scenario is when all entries were inserted into the same bucket, in
which case the hash table is ineffective and the cost is that of searching the bucket data structure. If the latter
is a linear list, the lookup procedure may have to scan all its entries; so the worst-case cost is proportional to
the number n of entries in the table.

Position find( element_type key, HASH_TABLE H )


{
position p;
LIST L;
L = H->the_lists[ hash( key, H->table_size) ];
p = L->next;
while( (p != NULL) && (p->element != key) )
p = p->next;
return p;
}

Void insert( element_type key, HASH_TABLE H )


{
position pos, new_cell;
LIST L;
pos = find( key, H );
if( pos == NULL )
{
new_cell = (position) malloc(sizeof(struct list_node));
L = H->the_lists[ hash( key, H->table size ) ];
new_cell->next = L->next;
new_cell->element = key; /* Probably need strcpy!! */
L->next = new_cell;
}

Open addressing

In another strategy, called open addressing, all entry records are stored in the bucket array itself. When a new
entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in some
probe sequence, until an unoccupied slot is found. When searching for an entry, the buckets are scanned in
the same sequence, until either the target record is found, or an unused array slot is found, which indicates
that there is no such key in the table. [9] The name "open addressing" refers to the fact that the location
("address") of the item is not determined by its hash value.

Well-known probe sequences include:

 Linear probing, in which the interval between probes is fixed (usually 1)


 Quadratic probing, in which the interval between probes is increased by adding the successive
outputs of a quadratic polynomial to the starting value given by the original hash computation
 Double hashing, in which the interval between probes is computed by another hash function

A drawback of all these open addressing schemes is that the number of stored entries cannot exceed the
number of slots in the bucket array. In fact, even with good hash functions, their performance dramatically
degrades when the load factor grows beyond 0.7 or so. Thus a more aggressive resize scheme is needed.
Separate linking works correctly with any load factor, although performance is likely to be reasonable if it is
kept below 2 or so. For many applications, these restrictions mandate the use of dynamic resizing, with its
attendant costs.

Open addressing schemes also put more stringent requirements on the hash function: besides distributing the
keys more uniformly over the buckets, the function must also minimize the clustering of hash values that are
consecutive in the probe order. Using separate chaining, the only concern is that too many objects map to the
same hash value; whether they are adjacent or nearby is completely irrelevant.
(i) Linear probing

The need to have a rehash function arises when a collision occurs. This happens when two or more
information would collide on the same cell allocated for the hash table. Thus, a rehashing is needed. Since
we know that the hash function for storing the information or data into the allocated memory is done by
getting the remainder of the data’s numerical equivalent divide by the number of space allocated. Then, the
rehashing function is the method for finding the second or third or so on location for the information. One
rehashing technique is the Linear Probing, where the rehashing is done by looking for the next empty space
that it can occupy. The function for the rehashing is the following:

rehash(key)=(n+1) % k;

This method works in such a way that if the first location is not free, then it will go to the next location and
check if that location is free or not, and so on until it finds a free location or can’t find anyone at all. For
formality and familiarity’s sake, an empty space would be given a -1 value while a deleted data’s space
would be -2. In this way, finding an empty space is easy and also the search for a stored item would be
easier.

To test this algorithm, the use of the following example is needed.

For example, we have a hash table that could accommodate 9 information, and the data to be stored were
integers. To input 27, we use hash(key)= 27 % 9=0. Therefore, 27 is stored at 0. If another input 18 occurs,
and we know that 18 % 9= 0, then a collision would occur. In this event, the need to rehash is needed. Using
linear probing, we have the rehash(key)= (18+1) % 9= 1. Since 1 is empty, 18 can be stored in it.

To retrieve data, the hash function and the rehash function were also useful. Using the example from above,
retrieving 18 is done by using the hash function to find the key and check if the data would coincide to the
data needed. If not, then the rehash would be needed. Until such time the correct location is found or an
empty space is encountered(that is the value of that space is -1), which means that the data does not exist.
This is authentic because, the path of the search function would be the same path that was used in storing the
data. In this sense, if an empty space is encountered, it all means that the data does not exist, logical isn’t it?

The use of the -2 value for deleted items is useful in such a way that in traversing the hash table,
encountering a deleted cell would not end the traversal.
(ii) Quadratic probing

This is a method of accessing data with nearly O(1) if the data needs to be dynamically administered (mostly
by inserting new or deleting existing data). The only data structure which guarantees O(1) is the array. But
access is only O(1) if the index of the element is known – otherwise searching algorithms must be used (like
Binary Search) which then have a complexity of O(n * log n).

Each data item to be stored is associated with a key on which a hash function is applied, the resulting hash
value is used as an index to store the object in one of a number of "hash buckets" in a hash table. So for each
object, the index at which it is stored in the hash table can be calculated. Sometimes more than one object
can have the same hash value, in that case a collision procedure must determine the position for the new
object. In the animation, the collision resolution strategy "Quadratic Probing" is presented. Starting from the
collision point, free spaces are searched for using a quadratic algorithm. The hash values needed to search for
an Object are calculated as follows: h, h+1, h+4, h+9, ... .

(iii) Double hashing

Like all other forms of open addressing, double hashing becomes linear as the hash table approaches
maximum capacity. The only solution to this is to rehash to a larger size. Double hashing uses the idea of
applying a second hash function to the key when a collision occurs. The result of the second hash function
will be the number of positions form the point of collision to insert.

There are a couple of requirements for the second function:

 it must never evaluate to 0


 must make sure that all cells can be probed

On top of that, it is possible for the secondary hash function to evaluate to zero. For example, if we choose
k=5 with the following function:

The resulting sequence will always remain at the initial hash value. One possible solution is to change the
secondary hash function to:

This ensures that the secondary hash function will always be non zero.
Rehashing

Once the hash table gets too full, the running time for operations will start to take too long and may fail. To
solve this problem, a table at least twice the size of the original will be built and the elements will be
transferred to the new table. The new size of the hash table:

 should also be prime


 will be used to calculate the new insertion spot (hence the name rehashing)

This is a very expensive operation! O(N) since there are N elements to rehash and the table size is roughly
2N. This is ok though since it doesn't happen that often.

Rehashing is applied when:

 once the table becomes half full


 once an insertion fails
 once a specific load factor has been reached, where load factor is the ratio of the number of elements
in the hash table to the table size
Deletion from a Hash Table
It is very difficult to delete an item from the hash table that uses rehashes for search and insertion.
Suppose that a record r is placed at some specific location. We want to insert some other record r1 on
the same location. We will have to insert the record in the next empty location to the specified original
location. Suppose that the record r which was there at the specified location is deleted.

Now, we want to search the record r1, as the location with record r is now empty, it will erroneously
conclude that the record r1 is absent from the table. One possible solution to this problem is that the
deleted record must be marked "deleted" rather than "empty" and the search must continue whenever a
"deleted" position is encountered. But this is possible only when there are small numbers of deletions
otherwise an unsuccessful search will have to search the entire table since most of the positions will be
marked "deleted" rather than "empty".

---------------------------------

You might also like