You are on page 1of 25

DSA with Java/Unit- 7

Unit -9: Hashing


Introduction to searching techniques
 Searching is the process of finding a given value in a list of values. It decides whether a
search key is present in the data or not. It is the algorithmic process of finding a particular
item in a collection of items.
 Any search is said to be successful or unsuccessful depending upon whether the element
that is being searched is found or not.
 Some of the standard searching techniques that is being followed in the data structure are:
 Linear Search or Sequential Search
 Binary Search
Sequential Search
 Sequential search starts at the beginning of the list and checks every element of the list. It
is a basic and simple searching algorithm.
 Sequential search compares the element with all the other elements given in the list. If the
element is matched, it returns the value index, else it returns -1.

The above figure shows how sequential search works. It searches an element or value from an
array till the desired element or value is not found. If we search the element 25, it will go step by
step in a sequence order.
Pseudocode
searchValue(data, key)
{
for (i = 0; < data.length; i++)
{
if (data[i] == key)
{
return i;
}
}
return -1;
}

1
Collected by Bipin Timalsina
DSA with Java/Unit- 7

 It is used for unsorted and unordered small list of elements.


 It has a time complexity of O(n), which means the time is linearly dependent on the number
of elements.
 It has a very simple implementation.
Binary Search
 Binary search is a very fast and efficient searching technique. It requires the list to be in
sorted order.
 In this method, to search an element you can compare it with the present element at the
center of the list. If it matches, then the search is successful otherwise the list is divided
into two halves: one from the 0th element to the middle element which is the center element
(first half) another from the center element to the last element (which is the 2nd half) where
all values are greater than the center element.
 The searching mechanism proceeds from either of the two halves depending upon whether
the target element is greater or smaller than the central element. If the element is smaller
than the central element, then searching is done in the first half, otherwise searching is done
in the second half.
Following are the steps of implementation that we will be following:
 Start with the middle element:
 If the target value is equal to the middle element of the array, then return the index
of the middle element.
 If not, then compare the middle element with the target value,
 If the target value is greater than the number in the middle index, then
pick the elements to the right of the middle index, and start with Step 1.
 If the target value is less than the number in the middle index, then pick
the elements to the left of the middle index, and start with Step 1.
 When a match is found, return the index of the element matched.
 If no match is found, then return -1

2
Collected by Bipin Timalsina
DSA with Java/Unit- 7

Pseudocode
binarySearch(data, n, key){
low = 0
high = n − 1
while(low ≤ high){
middle = floor((low + high) / 2)
if (data[middle] < key)
low = middle + 1
else if (data[middle] > key)
high = middle − 1
else
return middle
}
return -1
}

Example:

3
Collected by Bipin Timalsina
DSA with Java/Unit- 7

Introduction to Hashing
There are several searching techniques like linear search, binary search, search trees etc. In these
techniques, time taken to search any particular element depends on the total number of elements.
 Linear Search takes O(n) time to perform the search in unsorted arrays consisting of n
elements.
 Binary Search takes O(log n) time to perform the search in sorted arrays consisting of n
elements.
 It takes O(log n) time to perform the search in Binary Search Tree consisting of n elements.
The main drawback of these techniques is-
 As the number of elements increases, time taken to perform the search also increases.
 This becomes problematic when total number of elements become too large.

Hashing is a well-known technique to search any particular element among several elements. It
minimizes the number of comparisons while performing the search.
Hashing is designed to solve the problem of needing to efficiently find or store an item in a
collection.
Unlike other searching techniques,
 Hashing is extremely efficient.
 The time taken by it to perform the search does not depend upon the total number of
elements.
 It completes the search with constant time complexity O(1).
 Constant time O(1) means the operation does not depend on the size of the data.
 Hashing is used with a database to enable items to be retrieved more quickly.
 Hashing is the process of mapping data to their representative integer value using hash
function. It is a technique to convert a range of key values into a range of indexes of an
array. It is used to facilitate the next level searching method when compared with the linear
or binary search.

4
Collected by Bipin Timalsina
DSA with Java/Unit- 7

Hashing Mechanism
 In hashing, an array data structure called as Hash table is used to store the data items.
 Based on the hash key value, data items are inserted into the hash table.
 Hash key value is a special value that serves as an index for a data item.
 It indicates where the data item should be stored in the hash table.
 Hash key value is generated using a hash function.

Hash Functions
 A function h that can transform a particular key K, be it a string, number, record, or the
like, into an index in the table used for storing items of the same type as K. The function h
is called a hash function.
 If a hash function h transforms different keys into different numbers , it is called a perfect
hash function.
 A function which converts a key to a hash key value is known as a Hash Function.
 This function takes a key and maps it to a value of a certain length which is called a Hash
value or Hash.
 Input to hash function is of variable length and output is of fixed length.
 Hash function takes the data item as an input and returns a small integer value as an output.
The small integer value is called as a hash value. Hash value of the data item is then used
as an index for storing it into the hash table.
The properties of a good hash function are-
 It should be efficiently computable.
 It should minimize the number of collisions.
 It should distribute the keys uniformly over the table.
There are various types of hash functions available such as-
 Division Hash Function
 Folding Hash Function
 Mid Square Hash Function etc

5
Collected by Bipin Timalsina
DSA with Java/Unit- 7

 Hash Table uses an array as a storage medium and uses hash technique to generate an
index where an element is to be inserted or is to be located from.

Hash Table

6
Collected by Bipin Timalsina
DSA with Java/Unit- 7

Division
 In division hashing method, a key k is divided by the size of table and the remainder is
used as index of the hash table.
h(k) = k% TSize
 A hash function must guarantee that the number it returns is a valid index to one of the
table cells. The simplest way to accomplish this is to use division modulo TSize =
sizeof(table).
 It is best if TSize is a prime number; otherwise, h(K) = (K mod p) mod TSize for some
prime p > TSize can be used.
 The division method is usually the preferred choice for the hash function if very little is
known about the keys
Example
Suppose we have integer data items {26, 70, 18, 31, 54, 93} and size of hash table is 10.
Data Item Hash value = Key % SizeOfTable
26 26 % 10 = 6
70 70 % 10 = 0
18 18 % 10 = 8
31 31 % 10 = 1
54 54 % 10 = 4
93 93 % 10 = 3

The data items are stored as follow in hash table:

 After computing the hash values, we can insert each item into the hash table at the
designated position as shown in the above figure.
 In the hash table, 6 of the 10 slots are occupied, it is referred to as the load factor and
denoted by, λ = No. of items / table size. For example , λ = 6/10.

7
Collected by Bipin Timalsina
DSA with Java/Unit- 7

Folding
 In this method, the key is divided into several parts. These parts are combined or folded
together and are often transformed in a certain way to create the target address.
In folding technique, the key is divided into separate parts and by using some simple operations
these parts are combined to produce a hash key.
 There are two types of folding:
 shift folding and
 boundary folding
 In both versions, the key is usually divided into even parts of some fixed size plus some
remainder and then added.
Shift Folding
The key is divided into several parts and these parts are then processed using a simple operation
such as addition to combine them in a certain way. In shift folding, they are put underneath one
another and then processed.
Example: SSN: 123-45-6789 can be divided into three parts, 123,456,789 and then processed using
simple operation like addition. The resulting number can be divided modulo TSize.
123
456
+ 789
1368 mod TSize
NOTE: The division can be done in many different ways. Another possibility is to divide the same
number 123-45-6789 into five parts (say, 12, 34, 56, 78, and 9), add them, and divide the result
modulo TSize.
Boundary Folding
The key is divided into several parts and these parts are then processed using a simple operation
such as addition to combine them in a certain way. In boundary folding, alternate pieces are flipped
on the boundary.
Example: Consider the same three parts of the SSN: 123, 456, and 789. The first part, 123, is taken
in the same order, then the piece of paper with the second part is folded underneath it so that 123
is aligned with 654, which is the second part, 456, in reverse order. When the folding continues,
789 is aligned with the two previous parts.

8
Collected by Bipin Timalsina
DSA with Java/Unit- 7

123
654
+ 789
1566 mod TSize
Or
321
456
+ 987
1764 mod TSize

Mid-Square Function
 In the mid-square method, the key is squared and the middle or mid part of the result is used
as the address.
In Mid-Square method, the hash value is computed by first squaring the key and taking the
middle or mid part of the result.
 In a mid-square hash function, the entire key participates in generating the address so that there
is a better chance that different addresses are generated for different keys.
Example:
k=3121 in a hash table of size 1000.
h(3121) = 31212 = 9740641 mid part is 406
Extraction
 In the extraction method, only a part of the key is used to compute the address.
In Extraction technique, hash value of a key is calculated by taking only a part of the key.
 For the social security number 123-45-6789, this method might use
 the first four digits, 1234;
 the last four, 6789;
 the first two combined with the last two, 1289;
 or some other combination.

9
Collected by Bipin Timalsina
DSA with Java/Unit- 7

 The ISBN starting digits are the same for a publisher, so they should be exclude if the hash
table is for only one publisher.

Radix transformation
 The key is translated into another base to compute the hash value.
Example: Key = 345, change to base 9 = 423 mod TSize.

COLLISION RESOLUTION
 When a hash function returns the same hash value for more than one key, it is called as
collision
 Collisions are problematic in hashing, show they should be avoided.
 There are many strategies that attempt to avoid hashing multiple keys to the same location.
 Some of the collision resolution techniques are :
o Open Addressing
o Chaining
o Bucket Addressing
When one or more hash values compete with a single hash table slot, collisions occur. To resolve
this, the next available empty slot is assigned to the current hash value.
Open Addressing
 In the open addressing method, when a key collides with another key, the collision is
resolved by finding an available table entry other than the position (address) to which the
colliding key is originally hashed.
 If position h(K) is occupied, then the positions in the probing sequence
norm(h(K) + p(1)), norm(h(K) + p(2)), . . . , norm(h(K) + p(i)), . . . are tried until either an
available cell is found or the same positions are tried repeatedly or the table is full.
 Function p is a probing function, i is a probe, and norm is a normalization function, most
likely, division modulo the size of the table.

10
Collected by Bipin Timalsina
DSA with Java/Unit- 7

The Open Addressing methods are:


 Linear Probing
 Quadrating Probing
 Double Hashing
 The simplest method is linear probing, for which p(i) = i, and for the ith probe, the position
to be tried is (h(K) + i) mod TSize.
o In linear probing, the position in which a key can be stored is found by sequentially
searching all positions starting from the position calculated by the hash function
until an empty cell is found. If the end of the table is reached and no empty cell has
been found, the search is continued from the beginning of the table and stops—in
the extreme case—in the cell preceding the one from which the search started.
o Linear probing, however, has a tendency to create clusters in the table.
In linear probing, when the collision occurs, we perform a linear probe for the
next slot, and this probing is performed until an empty slot is found. The main
advantage of this technique is that it can be easily calculated. Problem is
clustering. Many consecutive elements form groups. Then, it takes time to search
an element or to find an empty bucket.
Disadvantages of linear probing
→ The main problem is clustering.
→ It takes too much time to find an empty slot.
Example:
→ Following Figure contains an example where a key Ki is hashed to the position i.
→ In Figure a, three keys—A5, A2, and A3—have been hashed to their home positions.
→ Then B5 arrives (Figure b), whose home position is occupied by A5. Because the next
position is available, B5 is stored there.
→ Next, A9 is stored with no problem, but B2 is stored in position 4, two positions from
its home address.
→ A large cluster has already been formed.
→ Next, B9 arrives. Position 9 is not available, and because it is the last cell of the table,
the search starts from the beginning of the table, whose first slot can now host B9.
→ The next key,C2, ends up in position 7, five positions from its home address.

11
Collected by Bipin Timalsina
DSA with Java/Unit- 7

Figure 1: Resolving collisions with the linear probing method. Subscripts indicate the
home positions of the keys being hashed.
 In Quadratic Probing, the probe distance is calculated based on the quadratic equation.
This is considerably a better option as it balances clustering.
p(i) = h(K) + (–1)i–1((i + 1)/2)2 for i = 1, 2, . . . , TSize – 1
 In this, when the collision occurs, we probe for i2th slot in ith iteration, and this probing is
performed until an empty slot is found.
 Quadratic probing also reduces the problem of clustering.
 Although using quadratic probing gives much better results than linear probing, the
problem of cluster buildup is not avoided altogether, because for keys hashed to the same
location, the same probe sequence is used. Such clusters are called secondary clusters.
These secondary clusters, however, are less harmful than primary clusters.

12
Collected by Bipin Timalsina
DSA with Java/Unit- 7

Figure 2: Using quadratic probing for collision resolution.

 The problem of secondary clustering is best addressed with double hashing. This method
utilizes two hash functions, one for accessing the primary position of a key, h, and a second
function, hp, for resolving conflicts. The probing sequence becomes
h(K), h(K) + hp(K), . . . , h(K) + i · hp(K), . . (all divided modulo TSize).
o The table size should be a prime number so that each position in the table can be
included in the sequence.
o In double hashing,
 We use another hash function hp(K) and look for i * hp(K) slot in
ith iteration.
 It requires more computation time as two hash functions need to be
computed.
The hash functions for this technique are:
h(K)=K mod TSize
hp(K)=P- (k mod P)
Where, P is a prime number which should be taken smaller than the size of a hash
table.

13
Collected by Bipin Timalsina
DSA with Java/Unit- 7

Example: Let us consider we have to insert 67, 90,55,17,49 into a hash table of size 10
 67, 90 and 55 can be inserted in a hash table by using first hash function
 but in case of 17 again the slot is full and in this case we have to use the second hash
function hp(K)=P- (k mod P)
 When P =7 (which is less than Table size,10)
hp(17)= 7- (17 mod 7) = 4
Therefore p(1) = [h(17) + 1* hp(17) ] mod 10
(7+ (1*4) )%10 = 1 so, insert the key 17 at position 1

0 90
1 17
2
3
4
5 55
6
7 67
8
9 49

14
Collected by Bipin Timalsina
DSA with Java/Unit- 7

Question: Using the hash function ‘key mod 7’, insert the following sequence of keys in the hash
table-
50, 700, 76, 85, 92, 73 and 101
Use linear probing technique for collision resolution.

(A) (B)

(C) (D)

15
Collected by Bipin Timalsina
DSA with Java/Unit- 7

(E) (F)

(G) (H)

16
Collected by Bipin Timalsina
DSA with Java/Unit- 7

Chaining
 Keys do not have to be stored in the table itself.
 In chaining, each position of the table is associated with a linked list or chain of structures
whose info fields store keys or references to keys.
 This method is called separate chaining, and a table of references (pointers) is called a
scatter table.
 In this method, the table can never overflow, because the linked lists are extended only
upon the arrival of new keys,
 To handle the collision,
 This technique creates a linked list to the slot for which collision occurs.
 The new key is then inserted in the linked list.
 These linked lists to the slots appear like chains.
 That is why, this technique is called as separate chaining.
Example:

Figure 3: In chaining, colliding keys are put on the same linked list.

 For short linked lists, this is a very fast method, but increasing the length of these lists can
significantly degrade retrieval performance. Performance can be improved by maintaining an
order on all these lists so that, for unsuccessful searches an exhaustive search is not required
in most cases or by using self-organizing linked lists.

17
Collected by Bipin Timalsina
DSA with Java/Unit- 7

 This method requires additional space for maintaining references. The table stores only
references, and each node requires one reference field.

Question: Using the hash function ‘key mod 7’, insert the following sequence of keys in the hash
table-
50, 700, 76, 85, 92, 73 and 101
Use separate chaining technique for collision resolution

(A) (B) (C)

18
Collected by Bipin Timalsina
DSA with Java/Unit- 7

(D) (E)

(F)

19
Collected by Bipin Timalsina
DSA with Java/Unit- 7

(G)

(H)

Coalesced hashing (or coalesced chaining)


 A version of chaining called coalesced hashing (or coalesced chaining) combines linear
probing with chaining. In this method, the first available position is found for a key
colliding with another key, and the index of this position is stored with the key already in
the table. In this way, a sequential search down the table can be avoided by directly

20
Collected by Bipin Timalsina
DSA with Java/Unit- 7

accessing the next element on the linked list. Each position pos of the table includes two
fields: an info field for a key and a next field with the index of the next key that is hashed
to pos.
 An overflow area known as a cellar can be allocated to store keys for which there is no
room in the table.
 Figure 4 illustrates an example where coalesced hashing puts a colliding key in the last
position of the table. In Figure 4a, no collision occurs. In Figure 4b, B5 is put in the last
cell of the table, which is found occupied by A9 when it arrives. Hence, A9 is attached to
the list accessible from position 9. In Figure 4c, two new colliding keys are added to the
corresponding lists.

Figure 4: Coalesced hashing puts a colliding key in the last available position of the table.

 Figure 5 illustrates coalesced hashing that uses a cellar. Non colliding keys are stored in
their home positions, as in Figure 5a. Colliding keys are put in the last available slot of the
cellar and added to the list starting from their home position, as in Figure 5b. In Figure 5c,
the cellar is full, so an available cell is taken from the table when C2 arrives.

21
Collected by Bipin Timalsina
DSA with Java/Unit- 7

Figure 5: Coalesced hashing that uses a cellar

Bucket Addressing
 Another solution to the collision problem is to store colliding elements in the same position
in the table. This can be achieved by associating a bucket with each address.
 A bucket is a block of space large enough to store multiple items.
 By using buckets, the problem of collisions is not totally avoided. If a bucket is already
full, then an item hashed to it has to be stored somewhere else.
 By incorporating the open addressing approach, the colliding item can be stored in the
next bucket if it has an available slot when using linear probing, as illustrated in Figure 6,
or it can be stored in some other bucket when, say, quadratic probing is used.
 The colliding items can also be stored in an overflow area. In this case, each bucket includes
a field that indicates whether the search should be continued in this area or not. It can be
simply a yes/no marker. In conjunction with chaining, this marker can be the number

22
Collected by Bipin Timalsina
DSA with Java/Unit- 7

indicating the position in which the beginning of the linked list associated with this bucket
can be found in the overflow area (see Figure 7).

Figure 6: Collision resolution with buckets and linear probing method.

Figure 7: Collision resolution with buckets and overflow area.

23
Collected by Bipin Timalsina
DSA with Java/Unit- 7

Deletion
 With a chaining method, deleting an element leads to the deletion of a node from a linked
list holding the element.
 For other methods, a deletion operation may require a more careful treatment of collision
resolution, except for the rare occurrence when a perfect hash function is used.
Example:
 Consider the table in Figure 8a in which the keys are stored using linear probing.
 The keys have been entered in the following order: A1, A4, A2, B4, B1.
 After A4 is deleted and position 4 is freed (Figure 8b), we try to find B4 by first checking
position 4.
 But this position is now empty, so we may conclude that B4 is not in the table.
 The same result occurs after deleting A2 and marking cell 2 as empty (Figure 8c). Then,
the search for B1 is unsuccessful, because if we are using linear probing, the search
terminates at position 2.
 The situation is the same for the other open addressing methods

Figure 8: Linear search in the situation where both insertion and deletion of keys are permitted

 If we leave deleted keys in the table with markers indicating that they are not valid elements
of the table, any subsequent search for an element does not terminate prematurely.

24
Collected by Bipin Timalsina
DSA with Java/Unit- 7

 When a new key is inserted, it overwrites a key that is only a space filler. However, for a
large number of deletions and a small number of additional insertions, the table becomes
overloaded with deleted records, which increases the search time because the open
addressing methods require testing the deleted elements.
 Therefore, the table should be purged after a certain number of deletions by moving
undeleted elements to the cells occupied by deleted elements. Cells with deleted elements
that are not overwritten by this procedure are marked as free. Figure 8d illustrates this
situation.

Case Study: Hashing with Buckets


(follow the text book)

25
Collected by Bipin Timalsina

You might also like