AADS - 14 - Hash Tables & Hash Functions

AADS-14
Significance of complexity of Search

In an unordered list the time required to find a
value is O(n) In an ordered list this time can be improved, and there could definitely be improvement in the modification operations
In a Binary Search Tree the search time
could well improve to O(log n) Same is the limit for AVL trees
Dictionary Data Structure

Dictionary is a general form of Data Structure to
store key and values It can be implemented using Array or Linked List structures For a Dictionary, the direct addressing of each element could be done using the value of the element as index, if the Dictionary is of that size
Key Search/Dictionary Storage

But in any of the complex applications the memory is
simultaneously used by many processes Also there could be frequent accesses to the Keys in the runtime So, there is a need for reducing both size of space and the search time
Example-1
A 4 digit number as Key may need 9999 locations
If the Key stands for the Employee ID of a company
with 500 employees, Then only 500 locations shall be used when all the Keys are arranged in the memory
Example 2
A Hospital might be having large number of patients,
both inpatients and outpatients The database system can be modeled to group the patients and then index them so that the retrieval of the records shall be fast. Another way is not to group, but assign only one number to each case
Search time of O(1)

In both cases of large data or small amount of data,
the amortized time of O(1) or a near about time could be achieved if we know the location of the data or key we are looking for This location could be obtained from a mapping of the key to a new hashed key using proper functions
Hashing
Hashing could provide unique locations or a
reference to a shorter list for the keys from where we can easily get the data pertaining to one key Also, this would perhaps use less space in memory Instead of a large array, we can use a short length array/linked list
Hash Table
Hash Table is a Data Structure
Hash tables provide the time O(1) for any
and all values in a set contained on the Hash Table for search/insert/delete
Hash Table?
Hash table is an array say T[1,m] where m is a positive integer called the table size
When we try to put an item into a spot in the
hash table that is occupied, the situation is called collision It is resolved using a collision resolution policy
Hashing-Mathematical Definition
Hashing is a mapping operation
Consider the a set K of keys Let H be a function that map the keys to a new set L
Such that
H:K L
Hash Function/ & Hash Address

The function H is called the HASH FUNCTION This mapping done by the function H is called the HASHING The object L is the Hash table Each cell/location in L is identified using the Hash address
Hash Address
Let k is Key in K or k K Then k will have a mapped address in L given by
H(k) known as the Hash Address Hash Address d is the mapped address/location given by the hashing operation d=H(k) of a key k
Indexing on the Hash table

The hash address d shall directly point to a location
in L This address d is also called the Hash Address or Hash Code for the key k The process of Hashing is also called Compression
Notes
There is no meaning between the actual data value k

and the hash key d So there is no practical way to traverse a hash table, except a direct search using d Hash table items are not in any order There is no mapping function from d to k, except the hash table The purpose of hash tables is to provide fast look ups
Illustration- Bucket Array Structure for Hash Table

1 k1 2 k2 3 k3 L-1 kN-1 L kN
Uses of Hash Tables

Compilers use hash tables for symbol storage.
The Linux Kernel uses hash tables to manage
memory pages and buffers. High speed routing tables use hash tables. Database systems use hash tables.
Operations on Hash Tables

Initialize
Insert(k) Search(k)
Remove(k)
Sizeof Isempty
Types of hashing
There are two types 1. Open hashing- Open Chaining-Closed AddressingSeparate Chaining 2. Closed hashing- Open Addressing
Open hashing-Open Chaining

Amount of data to be stored is high
Uses a hash function to obtain the hash address All data with same hash address shall be stored as a
shorter list with a reference indicated by the above hash address
Bucket in Open hashing
Each hash location on the Hash table is said to a
bucket for the data with an index Data within the bucket could better be organized as Linked List
1 k1
2 k2
3 k3
L-1 kN-1
L kN
Closed hashing-Open Addressing

Closed hashing uses a fixed space
Hashing shall map a key into one of the locations in
the earmarked space If there are multiple keys getting hashed to same address(collision) then the tie shall be resolved Bucket may be small enough to hold only one value at a time
Topics in Hashing
Basically there are two subareas under Hashing 1. Hash Functions 2. Collision Resolutions
Hash Functions
1. The Hash Function H should be easy to compute
2. The function H should, as far as possible, uniformly
distribute the hash addresses throughout the set L so that there are a minimum number of collisions
Hash Functions
Requirement of Hash Functions

The main idea of using Hash Function H is that for a
key k, the hash function H obtains a value H(k) as an index into the hash table cell/bucket so that we can locate the key k in the Hash Table easily for search/insert
Hash Functions
Division Method
Mid Square method Multiplication Method
Division Method
Choose a prime number that is not close to the
power of 2 Let m be the selected number Then m also indicate the size of the Hash Table in the ideal case with one cell in each bucket The hash address/bucket address is given by
H(k)=k mod m
Example
Given keys are
4845, 5679, 6381, 3636, 7180, 8126, 1127 Use Table size m=7 Hash to a Table with 7 cells Also use m=11 and m=8 to repeat the exercise
Answer
0 1127 1 4845 2 5679 3 3636
HASH ADDRESS
4 6381
5 7180
6 8126
KEY
Choosing Table size in Division Method

When using the division method, ample
consideration must be given to the size of the table. The best choice for table size is usually a prime number not too close to a power of 2.
Division Method for Chaining Here, the Hash Table will have many cells Hash addresses map multiple keys to a single location,
So, there could be multiple entries in one location,

These multiple entries under a single hash Code are held
as a linked list
Illustration
Take Table size m as 11 to map a set keys
Keys
122
221
661
90
69
167
57
Modulo Divide each by 11 and get the hash
addresses
Answer- We get the following Table

0 1 2
111 221 551 90 167
57
3
4
69
Load Factor
Let there are m slots in a Hash Table
At the instant of observation the number elements is n
Therefore the Load factor =n/m

This is the average number of element stored in the Hash Table
can be less than, equal to or greater than 1
Find the Load Factor

0 1 2 3 4 5
9 10
110
89 45 68 167 225 554
57
82
108 109
Solution
There are 11 slots
11 elements = 11/11=1
So, indicates the average number of elements per
position Also, we get =1 even if there are vacant slots, because it is only showing the average
Notes on
The Load factor could be assuming various values
as the number of keys on the Hash Table changes Accordingly, could be less than, equal, or greater than one in a Hash Table formed using Separate Chaining(Open Hashing) In a Hash Table formed using Open Addressing(Closed Hashing) shall be always less than one decides the complexity of the operations on the Hash Tables like insert, search, delete etc
Hashing the Strings
Exercise
Map the following keys in such a way that we have
the hash function as follows

Find the ASCII values of first and last characters If there is only one character, it shall be the start and
end Add the ASCII value of last character to the ASCII value of first multiplied by 256 Apply mod m division to this resulting number
Keys
A, BABU, CHOWHAN, SUMAN, DILIP
The 5 symbols are:

AA
BU CN
SN
DP
These 5 symbols are then converted to a numerical code using the rule given previously by employing the ASCII values of the characters in the symbols
ASCII Values
A-65
B-66 C-67 D-68 E-69 F-70 G-71 H-72 I-73
J-74
K-75 L-76 M-77 N-78 O-79 P-80 Q-81 R-82
S-83
T-84 U-85 V-86 W-87 X-88 Y-89 Z-90
A-65
J-74 S-83 K-75 T-84
Example- Answer
AA 256*65+65=16705
BU 256*66+85=16981 CN 256*67+78=17320
B-66
C-67 L-76 U-85 D-68 M-77 V-86 E-69 F-70 N-78 W-87 O-79 X-88
G-71 P-80 Y-89 H-72 Q-81 Z-90 I-73 R-82
SN 256* 83+78=21326
DP 256*68+ 80=17488
Solution
Take m=7
Obtain the Hash Addresses
AA 256*65+65=16705mod 7=3 BU 256*66+85=16981mod7=6 CN 256*67+78=17320mod7=2 SN 256* 83+78=21326mod7=4 DP 256*68+ 80=17488mod7=2
Solution
0 1 2

CHOWHAN
DILIP
3
4 5 6
AA
SUMAN
BABU
Symbol Table
Compilers use a method similar to the previous one
to form a symbol table for the parsing purposes in the compilation
Hash Functions for string hashing

Hash Functions perform two separate functions:
1 Convert the string to a key. 2 Constrain the key to a positive value less than the size of the table. The best strategy is to keep the two functions separate so that there is only one part to change if the size of the table changes.
Notes-Chaining method
The chaining method gives infinite space in the hash
table in principle But, in practical applications, only limited space shall be allotted for one hash table in the memory There is no collision in chaining
Collisions
Collision
In the case of closed hashing(open addressing)-
even though H is ideally giving distinct addresses in L for each member in K in the real situation two or more Keys may LEAD TO A SINGLE Hash Address when a given Hash Function is used This situation is called collision We need some method to resolve collision The method is called Collision Resolution Policy
Collision Resolution Policy

Linear Probing
Quadratic Probing Double Hashing
Linear Probing
If a collision occurs, look for next immediate free
location and use it for storage for the insert operation If a key is not found, look for it in the next cells in a linear manner for search operations
Example
Let H is mod 11 Let the keys are 56, 78, 100 appear in this order for
hashing All these have home as position 1 The table is considered a circular array
1 56
2 78
3 100 8
4 9 10
Exercise
Hash 45, 39, 66, 74 in that order with Table size m=7
3 45 5 66
4 39 6 74
45 mod 7=3 39 mod 7 = 4 66 mod 7 =3 74 mod 7=4
Exercise
Let H is mod 11
Let the keys are 46, 122, 222, 441 appear in this order
for hashing
46 mod 11 = 2 122 mod 11 = 1 222 mod 11 = 2
441 mod 11 = 1
Solution
1 122
2 46
3 222 8
4 441 9 10
More on Hash Functions
Mid Square Method of hashing
Mid square method

1. The key k is squared to get k2
2. This value is now treated as a string of digits 3. Then hash function H(k) is defined as H(k)=f
4. This f is given by deleting the digits from both ends
of k2 5. Once chosen, same positions of k2 must be used for all keys consistently
Example
k:
3205 k2 : 10 272 025 H(k) 72
7148 51 093 904 93
2345 5 499 025 99
Multiplication Method for hashing
Multiplication method for Hashing

This method uses a hashing which is different from
the Division method The function take the form H(k)=m(kA mod 1) =floor(m* (kA mod 1) Where, 0<A<1 and kA mod 1 refers to the fractional part of kA Since 0< kA mod 1<1, the range of H(k) is from 0 to m
Advantage of Multiplication Method

The advantage of the multiplication method is that it
works equally well with any size m A should be chosen carefully Rational numbers should not chosen for A An example of good choice for A is
5 1 2
Obtain the Hash Codes for the keys

2343, 4345, 6567, 3476, 1215
m=11, A=0.618
5 1 A 2
2343 floor(11* (2343* 0.618 mod 1) 10 4345 floor(11* (4345* 0.618 mod 1) 2 6567 floor(11* (6567* 0.618 mod 1) 4 3476 floor(11* (3476* 0.618 mod 1) 1 1215 floor(11* (1215* 0.618 mod 1) 9 MATLAB command floor(11*mod((k*0.618),1))
Solution
1 3476
2 4345
3 8
4 6567 9 1215 10 2343
More on Collision Resolution
Quadratic Probing for Collision Resolution
Notes on Linear Probing

Linear probing is simple to program
Linear probing has better locality of reference and
hence better cache performance in the memory usage
Primary Clustering in Linear Probing

Linear probing use a probe sequence H+1, H+2,
H+3 and so on to find the space of the key, which has got the primary hash value as H This would lead to clustering of hash codes near some cells, called primary clustering Larger the cluster, lesser will be the search efficiency
Uniform Hashing & Random Probing

If use a method to generate Hash codes in a
uniformly distributed manner with a larger table size the process may avoid collisions Even if collisions occur we may use a pseudo random sequence to probe the locations But this approach reduces the locality reference, which then becomes a random variable So, better to use a via media solution between the linear probing and the random hashing
Quadratic Probing
Instead of linearly traversing through the hash table
slots in the case of collisions, the quadratic probing introduces more spacing between the slots we try in the case of collision This reduces the clustering effect seen in linear probing Clustering can still occur because Quadratic Probing is not immune to clustering Quadratic Probing preserves some locality reference and hence give good cache performance but lower than that of Linear Probing
Hash Function for quadratic probing

H(k,i)=(H(k)+c1*i + c2 i2 ) mod m
Where c1 and c2 are constants, (auxiliary constants) H is an auxiliary hash function. It could be k mod m
i=0,1,2,,m-1 is called the probe number

For a given Hash table the c1 and c2 remain
constant Choices for c1 and c2 are c1 = c2 =, c1 = c2 =1, c1 = 0, c2 =1,
Example
c1 = c2 =,
Take m= 11
Let the keys are 46, 122, 222, 441 appear in this
order for hashing

46 mod 11 = 2 122 mod 11 = 1
222 mod 11 = 2 (2+0.5 *1 + 0.5*1) mod 11

441 mod 11 = 1 (1+0.5 *1 + 0.5*1) mod 11
Exercise
Apply Quadratic Probing for the following Hash
Addresses
78 mod 11 =1 89 mod 11 =1 111 mod 11=1 166 mod 11=1
Answer
78 mod 11 =1 1
89 mod 11 =1 (1+0.5 *1 + 0.5*12 ) mod 11 2 111 mod 11=1 (1+0.5 *2 + 0.5*22 ) mod 11 4 166 mod 11=1 (1+0.5 *3 + 0.5*32 ) mod 11 7
Notes
If two keys have the same initial probe position, then
their probe sequences are the same, since H(k1, 0)=H(k2, 0) implies H(k1, i)=H(k2, i) This property leads to milder form of clustering called secondary clustering
Clustering
Problems with Linear Probing

Linear probing leads to Primary Clustering- the
hashed keys share substantial segments of probe sequence, because more than one key hashed into same home position shall have the same probe sequence And the hash addresses that collide at the home address, say b, will extend the cluster
Primary Clustering
As we have seen, once a block of few contiguous
occupied positions emerges in the Hash Table, it becomes a target for subsequent collisions As clusters grow, they also merge to form larger clusters Primary clustering means elements that hash to different cells probe same alternative cells Clustering will be reduced only if the hash addresses home at different positions
Example
Suppose we have 10 Hash Codes with value 1 and 5
Hash Codes with Value 2 All these codes shall be clustering around 1 and 2
Problems with Quadratic Probing

There could be adjacent clusters that join to form
composite clusters This is called secondary clustering

This happens because the keys which have the
same home hash address, will lead to same probe sequence In Quadratic probing also, the probe sequence is a function of the home position and not the original key value
Double hashing for Collision Resolution
Double Hashing
To avoid secondary clustering, we need to have the
probe sequence that make use of the original key value in its decision process This is achieved using Double Hashing, because the Hashing is done in two stages We shall use a second hash function also, so as to reduce the collisions
Double Hashing
Let H1(k) and H2(k) be two hash functions for the
same key k The H(k) is obtained as

H(k,i)= {H1(k) + i* H2(k)} mod m for the ith probe sequence
If the Table size m is a prime number the above
sequence is likely to access all locations in the Hash Table
Notes
The functions H1(k) and H2(k) are auxiliary hash
functions, which are selected like any hash function: so that the Keys are distributed in a uniform and random manner.
Example 1
We let H1(k) = k mod m and H2(k) = 1 + (k mod m' ),
where m' is slightly less than m, say, m 1 or m 2. For example m=11 and m=9
Example 2
First Use Mid Square Method and then use the
Modulo Division
Double hashing
Double hashing can be used to avoid the primary and
secondary clustering H2(k) must be chosen with care m and H2(k) must be relatively prime and this can be effected by making m a prime number If m is a power of two then choose H2(k) which is always odd
Example
Generate Hash Codes using Double Hashing for the

following: 2227, 3545, 4537, 8981, 7857, 3433, 6965 Use Division Method using H1(k) = k mod m and H2(k) = 1 + (k mod m' ) We have H(k,i)= {H1(k) + i* H2(k)} mod m Use m=11 and m=9
Steps
First generate Hash codes with H1(k) = k mod m
using m=11 Then apply the Second hashing depends on the Collisions. Take m=9
Step 1-Answer
2227 mod 11 = 5
3545 mod 11 = 3 4537 mod 11 = 5
8981 mod 11 = 5
7857 mod 11 = 3 3433 mod 11 = 1
6965 mod 11 = 2
Step 2
For resolving collisions, use the second Hash
Function-two times for Hash Code 5 and once for Hash Code 3 and see how the mapping evolves
Answer-Step 2
2227 mod 11 = 5
3545 mod 11 = 3 4537 mod 11 = 5
2227 mod 9 +1= 5 3545 mod 9 +1 = 9
4537 mod 9 +1 = 2
8981 mod 9 +1 = 9 7857 mod 9 +1 = 1
8981 mod 11 = 5
7857 mod 11 = 3 3433 mod 11 = 1 6965 mod 11 = 2
3433 mod 9 +1 = 5
6965 mod 9 +1 = 9
Step 3
2227 5
3545 3 4537 5+1*2=7
2227 mod 9 +1= 5 3545 mod 9 +1 = 9
4537 mod 9 +1 = 2
8981 mod 9 +1 = 9 7857 mod 9 +1 = 1
8981 5+2*9 1
7857 3+1*1 4 3433 1+1*5 6 6965 2
3433 mod 9 +1 = 5
6965 mod 9 +1 = 9
Sparse Matrices

AADS - 14 - Hash Tables & Hash Functions

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AADS - 14 - Hash Tables & Hash Functions

Uploaded by

Copyright:

Available Formats

AADS-14

Significance of complexity of Search

Dictionary Data Structure

Key Search/Dictionary Storage

Search time of O(1)

Hash Function/ & Hash Address

Indexing on the Hash table

Illustration- Bucket Array Structure for Hash Table

Uses of Hash Tables

Operations on Hash Tables

Open hashing-Open Chaining

shorter list with a reference indicated by the above hash address

Bucket in Open hashing

Each hash location on the Hash table is said to a

Closed hashing-Open Addressing

2. The function H should, as far as possible, uniformly

Requirement of Hash Functions

Choosing Table size in Division Method

So, there could be multiple entries in one location,

Modulo Divide each by 11 and get the hash

Answer- We get the following Table

Therefore the Load factor =n/m

can be less than, equal to or greater than 1

Find the Load Factor

So, indicates the average number of elements per

Hashing the Strings

the hash function as follows

The 5 symbols are:

J-74 S-83 K-75 T-84

G-71 P-80 Y-89 H-72 Q-81 Z-90 I-73 R-82

AA 256*65+65=16705mod 7=3 BU 256*66+85=16981mod7=6 CN 256*67+78=17320mod7=2 SN 256* 83+78=21326mod7=4 DP 256*68+ 80=17488mod7=2

to form a symbol table for the parsing purposes in the compilation

Hash Functions for string hashing

Collision Resolution Policy

45 mod 7=3 39 mod 7 = 4 66 mod 7 =3 74 mod 7=4

More on Hash Functions

Mid Square Method of hashing

Mid square method

4. This f is given by deleting the digits from both ends

3205 k2 : 10 272 025 H(k) 72

7148 51 093 904 93

2345 5 499 025 99

Multiplication Method for hashing

Multiplication method for Hashing

Advantage of Multiplication Method

Obtain the Hash Codes for the keys

4 6567 9 1215 10 2343

More on Collision Resolution

Quadratic Probing for Collision Resolution

Notes on Linear Probing

hence better cache performance in the memory usage

Primary Clustering in Linear Probing

Uniform Hashing & Random Probing

Hash Function for quadratic probing

i=0,1,2,,m-1 is called the probe number

constant Choices for c1 and c2 are c1 = c2 =, c1 = c2 =1, c1 = 0, c2 =1,

order for hashing

222 mod 11 = 2 (2+0.5 *1 + 0.5*1) mod 11

Problems with Linear Probing

Problems with Quadratic Probing

composite clusters This is called secondary clustering

Double hashing for Collision Resolution

AA 25665+65=16705mod 7=3 BU 25666+85=16981mod7=6 CN 25667+78=17320mod7=2 SN 256 83+78=21326mod7=4 DP 256*68+ 80=17488mod7=2

222 mod 11 = 2 (2+0.5 1 + 0.51) mod 11