DS Unit-5

Prepared by D.
HIMAGIRI
UNIT - V
----------------------------------------------------------------------------------------------------------
Hashing-Hash table, Hash table representations, hash functions, collision resolution techniques-
separate chaining, open addressing-linear probing, quadratic probing, double hashing, Re
hashing, Extendible hashing,
Pattern matching : Introduction, Brute force, the Boyer –Moore algorithm, Knuth-Morris-Pratt
algorithm.
----------------------------------------------------------------------------------------------------------
Hashing: Hashing is a technique to convert a range of key values into a range of indexes of an
array. Hashing techniques is implemented using hash function and hash table.
Hash Table: It is data structure which
contains index and associated data.
Access of data becomes very fast
if we know the index of the desired data.
Hash Function: It is a function which is
used to map the key to a hash value.
It is represented as h(x).
Collision: If the same index or hash value
is produced by the hash function for
multiple keys then, conflict arises.
1
This situation is called collision.
Prepared by D.HIMAGIRI
Hashing…
Types of hash functions:
Division method
It is the most simple method of hashing an integer x. This method divides x by m and then
uses the remainder obtained as hash value. In this case, the hash function can be given as
h(x) = x mod m.
It requires only a single division operation, therefore this method works very fast.
Example:
calculate the hash values of keys 1234 and 5462,where m=97.
h(1234) = 1234 % 97 = 70 , h(5642) = 5642 % 97 = 16
Multiplication method
The steps involved in the multiplication method are as follows:
Step 1: choose a constant a such that 0 < a < 1.
Step 2: multiply the key k by a.
Step 3: extract the fractional part of ka.
Step 4: multiply the result of step 3 by the size of hash table (m).
Hence, the hash function can be given as:
h(k) = m (ka mod 1) where, (ka mod 1) gives the fractional part of ka and m is the total
number of indices in the hash table.
2
Hash Function Types…
Example:
Given a hash table of size 1000, map the key 12345 to an appropriate location in the hash
table.
we will use a = 0.618033, m = 1000, and k = 12345
h(12345) = 1000 (12345 * 0.618033 mod 1)
= 1000 (7629.617385 mod 1)
= 1000 (0.617385)
= 617.385
= 617
Mid Square Method:

The mid-square method is a good hash function which works in two steps:
Step 1: square the value of the key. That is, find k2.
Step 2: extract the middle r digits of the result obtained in step 1.
In the mid-square method, the same r digits must be chosen from all the keys. Therefore, the
hash function can be given as:
H(k) = s ,Where s is obtained by selecting r digits from k2.
.
3
Hash Function Types…
Example:
calculate the hash value for keys 1234 and 5642 using the mid-square method.
The hash table has 100 memory locations.
Note that the hash table has 100 memory locations whose indices vary from 0 to 99.
This means that only two digits are needed to map the key to a location in the hash table, so
r = 2.
When k = 1234, k2 = 1522756, h (1234) = 27
When k = 5642, k2 = 31832164, h (5642) = 21
Observe that the 3rd and 4th digits starting from the right are chosen.
Folding method :
The folding method works in the following two steps:
Step 1: divide the key value into a number of parts. That is, divide k into parts k1 , k2 , ...,
kn , where ,each part has the same number of digits except the last part which may have
lesser digits than the other parts.
Step 2: add the individual parts. That is, obtain the sum of k1+k2+k3+......kn .This hash
value produced by ignoring the last carry, if any.
4
Hash FunctionTypes…
Example:
Given a hash table of 100 locations, calculate the hash value using folding method for
keys 5678, 321, and 34567.
Since there are 100 memory locations to address, we will break the key into parts where
each part (except the last) will contain two digits. The hash values can be obtained as shown
below:
5
Collision Resolution Techniques
The collision resolution techniques are :
1. Separate chaining
2. Open addressing
Separate chaining:
In this technique, if a hash function produces the same index for multiple elements, these
elements are stored in the same index by using a linked list.
 if no element is hashed to a particular index then it will contain NULL.
6
Collision Resolution Techniques…
Insert the keys 7, 24, 18, 52, 36, 54, 11, and 23 in a chained hash table of 9 memory locations.
Use hash function h(k) = k mod m.
In this case, m=9.
Insert 7
Insert 24 Insert 18
7
Insert 52 Insert 36
Insert 54 Insert 11
8
Insert 23
9
Open Addressing:
In this technique, the hash table contains two types of values: sentinel values (e.g., –1) and
data values. The presence of a sentinel value indicates that the location contains no data
value at present but can be used to hold a value.
When a key is mapped to a particular memory location, then the value it holds is checked.
If it contains a sentinel value, then the location is free and the data value can be stored in
it.
if the location already has some data value stored in it, then other slots are examined
systematically in the forward direction to find a free slot. If even a single free location is
not found, then we have an overflow condition.
The process of examining memory locations in the hash table is called probing.
Open addressing technique can be implemented using:
1. Linear probing
2. Quadratic probing
3. Double hashing
4. Rehashing.
10
Linear probing:
The simplest approach to resolve a collision is linear probing. In this technique, if a value is
already stored at a location generated by h(k), then the following hash function is used to resolve
the collision:
H(k, i) = [h¢(k) + i] mod m
Where m is the size of the hash table, h¢(k) = (k mod m), and i is the probe number that varies
from 0 to m–1.
Example: Consider a hash table of size 10. Using linear probing, insert the keys 72, 27, 36, 24,
63, 81, 92 into the table.
Let h¢(k) = k mod m, m = 10 ,Initially, the hash table can be given as:
11
Linear Probing…
Let h¢(k) = k mod m, m = 10
12
Linear Probing…
13
Linear Probing…
14
Linear Probing…
15
Linear Probing…
Advantages:
Easy to compute.
Disadvantages:
Table must be big enough to get a free cell.
Time to get a free cell may be quite large.
Primary Clustering: Any key that hashes into the cluster will require several attempts to
resolve the collision.
16
Quadratic probing
Quadratic probing:
In this technique, if a value is already stored at a location generated by h(k), then the following
hash function is used to resolve the collision:
H(k, i) = [h¢(k) + c1i + c2i2] mod m
Where,
m is the size of the hash table,
h¢(k) = (k mod m),
i is the probe number that varies from 0 to M–1, and
c1 and c2 are constants.
Quadratic probing eliminates the primary clustering phenomenon of linear probing because
instead of doing a linear search, it does a quadratic search.
Example: Consider a hash table of size 10. Using quadratic probing, insert the keys 72, 27, 36,
24, 63, 81, and 101 into the table. Take c = 1 and c = 3.
17
Quadratic probing…
18
19
20
21
Advantage:
Eliminates Primary Clustering problem.
Disadvantage:
Secondary Clustering: Elements that hash to the same position will probe the same alternative
cells.
Double Hashing:
In double hashing, we use two hash functions rather than a single function. The hash function
in the case of double hashing can be given as:
H(k, i) = [h1(k) + ih2(k)] mod m
Where m is the size of the hash table, h (k) and h (k) are two hash functions given as h1 (k) = k
mod m and h2 (k) = k mod m', i is the probe number that varies from 0 to m–1, and m' is
chosen to be less than m. We can choose m' = m–1 or m–2.
Advantage:
Double hashing minimizes repeated collisions and the effects of clustering. That is, double
hashing is free from problems associated with primary clustering as well as secondary
clustering.
Disadvantage:
Implementation is difficult as it uses two hash functions. 22
Double Hashing…
Example : consider a hash table of size = 10. Using double hashing, insert the keys 72, 27, 36,
24, 63, 81, 92 into the table. Take h1 = (k mod 10) and h2 = (k mod 8).
23
Double Hashing…
24
Double Hashing…
25
Double Hashing…
26
Rehashing
Rehashing :
When the hash table becomes nearly full, the number of collisions increases, thereby degrading
the performance of insertion and search operations. In such cases, a better option is to create a
new hash table with size double of the original hash table.
All the entries in the original hash table will then have to be moved to the new hash table. This
is done by taking each entry, computing its new hash value, and then inserting it in the new
hash table. Though rehashing seems to be a simple process, it is quite expensive and must
therefore not be done frequently.
Example : Consider the hash table of size 5 given below. The hash function used is h(x)= x % 5.
Rehash the entries into to a new hash table.
27
Extendible Hashing
Extendible Hashing :
It is a dynamic hashing method in which directories and buckets are used to hash data.
 It is an aggressively flexible method in which the hash function also experiences dynamic
changes.
Terminology used in Extendible hashing:
Directories: The directories store addresses of
the buckets in pointers. An id is assigned to each
directory which may change each time when
Directory Expansion takes place.
Buckets: The buckets are used to hash the actual
data.
Global Depth: It is associated with the Directories.
They denote the number of bits which are used by
the hash function to categorize the keys.
Global Depth = Number of bits in directory id.
Local Depth: It is the same as that of Global Depth except for the fact that Local Depth is
associated with the buckets and not the directories. Local depth in accordance with the global
depth is used to decide the action that to be performed in case an overflow occurs. Local Depth is
28
always less than or equal to the Global Depth.
Extendible Hashing...
Bucket Splitting: When the number of elements in a bucket exceeds a particular size, then the
bucket is split into two parts.
Directory Expansion: Directory Expansion Takes place when a bucket overflows. Directory
Expansion is performed when the local depth of the overflowing bucket is equal to the global
depth.
Basic Working of Extendible Hashing:
Data Hash Function f(h)

(converted into Returns the
Directory
binary)
location
Directory
Points to buckets
Data
traverse to
bucket
Bucket
Data is stored /Hashed here
29
Tackling Over Flow Condition during Data Insertion:
Many times, while inserting data in the buckets, it might happen that the Bucket overflows. In
such cases, we need to follow an appropriate procedure to avoid mishandling of data.
First, Check if the local depth is less than or equal to the global depth. Then choose one of the
cases below.
Case1: If the local depth of the overflowing Bucket is equal to the global depth, then
Directory Expansion, as well as Bucket Split, needs to be performed. Then increment the
global depth and the local depth value by 1. And, assign appropriate pointers.
Directory expansion will double the number of directories present in the hash structure.
Case2: In case the local depth is less than the global depth, then only Bucket Split takes
place. Then increment only the local depth value by 1. And, assign appropriate pointers.
Over Flow Occurs
Local Depth< Global Local Depth=Global

Depth Depth
Performs Bucket Split &

Performs Bucket Split
Directory Expansion 30
Example :Hash the following elements: 16,4,6,22,24,10,31,7,9,20,26 using Extendible hashing
where bucket size is 3 and Hash Function: Suppose the global depth is X ,then the Hash Function
returns X LSBs.
Initially, the global-depth and local-depth is always 1.
Key Binary
Thus, the hashing frame looks like this:
Representation
16 10000
4 00100
6 00110
22 10110
Insert 16: 24 11000
The binary format of 16 is 10000 and global-depth is 1.
10 01010
The hash function returns 1 LSB of 10000 which is 0.
Hence, 16 is mapped to the directory with id=0. 31 11111
7 00111
9 01001
20 10100
26 01101
31
Insert 4:
The binary format of 4 is 00100 and global-depth is 1. Key Binary
The hash function returns 1 LSB of 00100 which is 0. Representation
Hence, 4 is mapped to the directory with id=0.
16 10000
4 00100
6 00110
22 10110
24 11000
10 01010
Insert 6: 31 11111
Hash(6)=00110
7 00111
9 01001
20 10100
26 01101
32
Insert 22
Hash(22)=10110, The bucket pointed by directory 0 is already full. Hence, Over Flow occurs.
Since Local Depth = Global Depth, the bucket splits and directory expansion takes place. Also,
rehashing of numbers present in the overflowing bucket takes place after the split. And, since the
global depth is incremented by 1, now,the global depth is 2. Hence, 16,4,6,22 are now rehashed
w.r.t 2 LSBs.[ 16(10000),4(100),6(110),22(10110) ]
33
Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on directories with id 00
and 10
Inserting 31,7,9: All of these elements[ 31(11111), 7(111), 9(1001) ] have either 01 or 11 in
their LSBs. Hence, they are mapped on the bucket pointed out by 01 and 11.
34
Inserting 20: Insertion of data element
20 (10100) will again cause the overflow
problem.
20 is inserted in bucket pointed out by 00.

since the local depth of the bucket = global-
depth, directory expansion (doubling) takes
place along with bucket splitting. Elements
present in overflowing bucket are rehashed
with the new global depth..
35
Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are considered. Therefore 26 best
fits in the bucket pointed out by directory 010.
Key Binary
Representation
16 10000
4 00100
6 00110
22 10110
24 11000
10 01010
31 11111
7 00111
9 01001
20 10100
26 01101
36
The bucket overflows, and the local depth of bucket < Global depth (2<3), directories are not
doubled but, only the bucket is split and elements are rehashed.
Finally, the output of hashing the given list of numbers is obtained. Key Binary
Representation
16 10000
4 00100
6 00110
22 10110
24 11000
10 01010
31 11111
7 00111
9 01001
20 10100
26 01101
37
Pattern Matching
Pattern Matching:
The Pattern Matching algorithms are also referred as String Searching Algorithms
These algorithms are useful in the case of searching a pattern within the text or searching a sub
string within a String.
Pattern Matching Algorithms:

1. Brute force algorithm
2. Boyer –Moore algorithm
3. Knuth-Morris-Pratt algorithm
38
Brute Force Algorithm
Brute force algorithm:
 The Brute Force algorithm compares the pattern to the text, one character at a time, until
unmatching characters are found.
 If unmatched characters are found move the pattern down the text by one character.
 The algorithm can be designed to stop on either the first occurrence of the pattern, or upon
reaching the end of the text.
39
Brute Force Algorithm...
Example:
40
KMP Algorithm
 KMP Algorithm is one of the most popular patterns matching algorithms. KMP stands for
Knuth Morris Pratt.
 KMP algorithm was invented by Donald Knuth and Vaughan Pratt together and
independently by James H Morris in the year 1970. In the year 1977, all the three jointly
published KMP Algorithm.
 KMP algorithm is used to find a "Pattern" in a "Text". This algorithm compares character
by character from left to right. But whenever a mismatch occurs, it uses a pre-processed table
called "Prefix Table" to skip characters comparison while matching.
 Prefix table is also known as LPS Table. Here LPS stands for "Longest proper Prefix
which is also Suffix".
Steps for Creating LPS Table (Prefix Table)
Step 1 : Define a one dimensional array with the size equal to the length of the Pattern.
(LPS[size])
Step 2 : Define variables i & j. Set i = 0, j = 1 and LPS[0] = 0.
Step 3 : Compare the characters at Pattern[i] and Pattern[j].
Step 4 : If both are matched then set LPS[j] = i+1 and increment both i & j values by one.
Goto to Step 3.
Step 5 : If both are not matched then check the value of variable 'i'. If it is '0' then set LPS[j]
= 0 and increment 'j' value by one, if it is not '0' then set i = LPS[i-1]. Goto Step 3.
41
Step 6: Repeat above steps until all the values of LPS[] are filled.
KMP Algorithm...
42
KMP Algorithm...
43
KMP Algorithm...
44
KMP Algorithm...
How to use LPS Table:
We use the LPS table to decide how many characters are to be skipped for comparison when a
mismatch has occurred.
When a mismatch occurs, check the LPS value of the previous character of the mismatched
character in the pattern. If it is '0' then start comparing the first character of the pattern with the
next character to the mismatched character in the text.
If it is not '0' then start comparing the character which is at an index value equal to the LPS
value of the previous character to the mismatched character in pattern with the mismatched
character in the Text.
Example : Apply the KMP for the following
45
KMP Algorithm...
46
KMP Algorithm...
47
KMP Algorithm...
48
Boyer Moore Algorithm
It was developed by Robert S. Boyern and J Strother Moore in 1977.
The Boyer-Moore algorithm is consider the most efficient string-matching algorithm.
It works the fastest when the Text is moderately sized and the pattern is relatively long.
Boyer Moore is a combination of following two approaches:
1) Bad Character Heuristic
2) Good Suffix Heuristic
Both of the above heuristics can also be used independently to search a pattern in a text.
It processes the pattern and creates different arrays for both heuristics. At every step, it slides
the pattern by the max of the slides suggested by the two heuristics. So it uses best of the two
heuristics at every step.
Boyer Moore algorithm starts matching from the last character of the pattern.
Bad Character Heuristic:
The character of the text which doesn’t match with the current character of the pattern is called
the Bad Character.
Upon mismatch, we shift the pattern until –
1) The mismatch becomes a match.
2) Pattern P move past the mismatched character.
49
Boyer Moore Algorithm
Case 1 : Mismatch become match We will lookup the position of last occurrence of mismatching
character in pattern and if mismatching character exist in pattern then we’ll shift the pattern such
that it get aligned to the mismatching character in text T.
Case 2: Pattern move past the mismatch character We’ll lookup the position of last occurence of
mismatching character in pattern and if character does not exist we will shift pattern past the
mismatching character.
50

DS Unit-5

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DS Unit-5

Uploaded by

Copyright:

Available Formats

Prepared by D.

Mid Square Method:

Table must be big enough to get a free cell.

Time to get a free cell may be quite large.

resolve the collision.

Data Hash Function f(h)

Over Flow Occurs

Local Depth< Global Local Depth=Global

Performs Bucket Split &

20 is inserted in bucket pointed out by 00.

Pattern Matching Algorithms:

You might also like