You are on page 1of 96

AADS-14

Significance of complexity of Search


In an unordered list the time required to find a

value is O(n) In an ordered list this time can be improved, and there could definitely be improvement in the modification operations
In a Binary Search Tree the search time

could well improve to O(log n) Same is the limit for AVL trees

Dictionary Data Structure


Dictionary is a general form of Data Structure to

store key and values It can be implemented using Array or Linked List structures For a Dictionary, the direct addressing of each element could be done using the value of the element as index, if the Dictionary is of that size

Key Search/Dictionary Storage


But in any of the complex applications the memory is

simultaneously used by many processes Also there could be frequent accesses to the Keys in the runtime So, there is a need for reducing both size of space and the search time

Example-1
A 4 digit number as Key may need 9999 locations
If the Key stands for the Employee ID of a company

with 500 employees, Then only 500 locations shall be used when all the Keys are arranged in the memory

Example 2
A Hospital might be having large number of patients,

both inpatients and outpatients The database system can be modeled to group the patients and then index them so that the retrieval of the records shall be fast. Another way is not to group, but assign only one number to each case

Search time of O(1)


In both cases of large data or small amount of data,

the amortized time of O(1) or a near about time could be achieved if we know the location of the data or key we are looking for This location could be obtained from a mapping of the key to a new hashed key using proper functions

Hashing
Hashing could provide unique locations or a

reference to a shorter list for the keys from where we can easily get the data pertaining to one key Also, this would perhaps use less space in memory Instead of a large array, we can use a short length array/linked list

Hash Table
Hash Table is a Data Structure
Hash tables provide the time O(1) for any

and all values in a set contained on the Hash Table for search/insert/delete

Hash Table?
Hash table is an array say T[1,m] where m is a positive integer called the table size
When we try to put an item into a spot in the

hash table that is occupied, the situation is called collision It is resolved using a collision resolution policy

Hashing-Mathematical Definition
Hashing is a mapping operation
Consider the a set K of keys Let H be a function that map the keys to a new set L

Such that

H:K L

Hash Function/ & Hash Address


The function H is called the HASH FUNCTION This mapping done by the function H is called the HASHING The object L is the Hash table Each cell/location in L is identified using the Hash address

Hash Address
Let k is Key in K or k K Then k will have a mapped address in L given by

H(k) known as the Hash Address Hash Address d is the mapped address/location given by the hashing operation d=H(k) of a key k

Indexing on the Hash table


The hash address d shall directly point to a location

in L This address d is also called the Hash Address or Hash Code for the key k The process of Hashing is also called Compression

Notes
There is no meaning between the actual data value k

and the hash key d So there is no practical way to traverse a hash table, except a direct search using d Hash table items are not in any order There is no mapping function from d to k, except the hash table The purpose of hash tables is to provide fast look ups

Illustration- Bucket Array Structure for Hash Table


1 k1 2 k2 3 k3 L-1 kN-1 L kN

Uses of Hash Tables


Compilers use hash tables for symbol storage.
The Linux Kernel uses hash tables to manage

memory pages and buffers. High speed routing tables use hash tables. Database systems use hash tables.

Operations on Hash Tables


Initialize
Insert(k) Search(k)

Remove(k)
Sizeof Isempty

Types of hashing
There are two types 1. Open hashing- Open Chaining-Closed AddressingSeparate Chaining 2. Closed hashing- Open Addressing

Open hashing-Open Chaining


Amount of data to be stored is high
Uses a hash function to obtain the hash address All data with same hash address shall be stored as a

shorter list with a reference indicated by the above hash address

Bucket in Open hashing

Each hash location on the Hash table is said to a

bucket for the data with an index Data within the bucket could better be organized as Linked List

1 k1

2 k2

3 k3

L-1 kN-1

L kN

Closed hashing-Open Addressing


Closed hashing uses a fixed space
Hashing shall map a key into one of the locations in

the earmarked space If there are multiple keys getting hashed to same address(collision) then the tie shall be resolved Bucket may be small enough to hold only one value at a time

Topics in Hashing
Basically there are two subareas under Hashing 1. Hash Functions 2. Collision Resolutions

Hash Functions
1. The Hash Function H should be easy to compute

2. The function H should, as far as possible, uniformly

distribute the hash addresses throughout the set L so that there are a minimum number of collisions

Hash Functions

Requirement of Hash Functions


The main idea of using Hash Function H is that for a

key k, the hash function H obtains a value H(k) as an index into the hash table cell/bucket so that we can locate the key k in the Hash Table easily for search/insert

Hash Functions
Division Method
Mid Square method Multiplication Method

Division Method
Choose a prime number that is not close to the

power of 2 Let m be the selected number Then m also indicate the size of the Hash Table in the ideal case with one cell in each bucket The hash address/bucket address is given by

H(k)=k mod m

Example
Given keys are

4845, 5679, 6381, 3636, 7180, 8126, 1127 Use Table size m=7 Hash to a Table with 7 cells Also use m=11 and m=8 to repeat the exercise

Answer
0 1127 1 4845 2 5679 3 3636

HASH ADDRESS

4 6381

5 7180

6 8126

KEY

Choosing Table size in Division Method


When using the division method, ample

consideration must be given to the size of the table. The best choice for table size is usually a prime number not too close to a power of 2.

Division Method for Chaining Here, the Hash Table will have many cells Hash addresses map multiple keys to a single location,

So, there could be multiple entries in one location,


These multiple entries under a single hash Code are held

as a linked list

Illustration
Take Table size m as 11 to map a set keys
Keys

122

221

661

90
69

167

57

Modulo Divide each by 11 and get the hash

addresses

Answer- We get the following Table


0 1 2
111 221 551 90 167

57

3
4

69

Load Factor
Let there are m slots in a Hash Table
At the instant of observation the number elements is n

Therefore the Load factor =n/m


This is the average number of element stored in the Hash Table

can be less than, equal to or greater than 1

Find the Load Factor


0 1 2 3 4 5
9 10
110
89 45 68 167 225 554

57

82
108 109

Solution
There are 11 slots
11 elements = 11/11=1

So, indicates the average number of elements per

position Also, we get =1 even if there are vacant slots, because it is only showing the average

Notes on
The Load factor could be assuming various values

as the number of keys on the Hash Table changes Accordingly, could be less than, equal, or greater than one in a Hash Table formed using Separate Chaining(Open Hashing) In a Hash Table formed using Open Addressing(Closed Hashing) shall be always less than one decides the complexity of the operations on the Hash Tables like insert, search, delete etc

Hashing the Strings

Exercise
Map the following keys in such a way that we have

the hash function as follows


Find the ASCII values of first and last characters If there is only one character, it shall be the start and

end Add the ASCII value of last character to the ASCII value of first multiplied by 256 Apply mod m division to this resulting number

Keys
A, BABU, CHOWHAN, SUMAN, DILIP

The 5 symbols are:


AA
BU CN

SN
DP

These 5 symbols are then converted to a numerical code using the rule given previously by employing the ASCII values of the characters in the symbols

ASCII Values
A-65
B-66 C-67 D-68 E-69 F-70 G-71 H-72 I-73

J-74
K-75 L-76 M-77 N-78 O-79 P-80 Q-81 R-82

S-83
T-84 U-85 V-86 W-87 X-88 Y-89 Z-90

A-65

J-74 S-83 K-75 T-84

Example- Answer
AA 256*65+65=16705
BU 256*66+85=16981 CN 256*67+78=17320

B-66

C-67 L-76 U-85 D-68 M-77 V-86 E-69 F-70 N-78 W-87 O-79 X-88

G-71 P-80 Y-89 H-72 Q-81 Z-90 I-73 R-82

SN 256* 83+78=21326
DP 256*68+ 80=17488

Solution
Take m=7
Obtain the Hash Addresses

AA 256*65+65=16705mod 7=3 BU 256*66+85=16981mod7=6 CN 256*67+78=17320mod7=2 SN 256* 83+78=21326mod7=4 DP 256*68+ 80=17488mod7=2

Solution
0 1 2

CHOWHAN

DILIP

3
4 5 6

AA
SUMAN

BABU

Symbol Table
Compilers use a method similar to the previous one

to form a symbol table for the parsing purposes in the compilation

Hash Functions for string hashing


Hash Functions perform two separate functions:

1 Convert the string to a key. 2 Constrain the key to a positive value less than the size of the table. The best strategy is to keep the two functions separate so that there is only one part to change if the size of the table changes.

Notes-Chaining method
The chaining method gives infinite space in the hash

table in principle But, in practical applications, only limited space shall be allotted for one hash table in the memory There is no collision in chaining

Collisions

Collision
In the case of closed hashing(open addressing)-

even though H is ideally giving distinct addresses in L for each member in K in the real situation two or more Keys may LEAD TO A SINGLE Hash Address when a given Hash Function is used This situation is called collision We need some method to resolve collision The method is called Collision Resolution Policy

Collision Resolution Policy


Linear Probing
Quadratic Probing Double Hashing

Linear Probing
If a collision occurs, look for next immediate free

location and use it for storage for the insert operation If a key is not found, look for it in the next cells in a linear manner for search operations

Example
Let H is mod 11 Let the keys are 56, 78, 100 appear in this order for

hashing All these have home as position 1 The table is considered a circular array

1 56

2 78

3 100 8

4 9 10

Exercise
Hash 45, 39, 66, 74 in that order with Table size m=7

3 45 5 66

4 39 6 74

45 mod 7=3 39 mod 7 = 4 66 mod 7 =3 74 mod 7=4

Exercise
Let H is mod 11
Let the keys are 46, 122, 222, 441 appear in this order

for hashing
46 mod 11 = 2 122 mod 11 = 1 222 mod 11 = 2

441 mod 11 = 1

Solution

1 122

2 46

3 222 8

4 441 9 10

More on Hash Functions

Mid Square Method of hashing

Mid square method


1. The key k is squared to get k2
2. This value is now treated as a string of digits 3. Then hash function H(k) is defined as H(k)=f

4. This f is given by deleting the digits from both ends

of k2 5. Once chosen, same positions of k2 must be used for all keys consistently

Example
k:

3205 k2 : 10 272 025 H(k) 72

7148 51 093 904 93

2345 5 499 025 99

Multiplication Method for hashing

Multiplication method for Hashing


This method uses a hashing which is different from

the Division method The function take the form H(k)=m(kA mod 1) =floor(m* (kA mod 1) Where, 0<A<1 and kA mod 1 refers to the fractional part of kA Since 0< kA mod 1<1, the range of H(k) is from 0 to m

Advantage of Multiplication Method


The advantage of the multiplication method is that it

works equally well with any size m A should be chosen carefully Rational numbers should not chosen for A An example of good choice for A is

5 1 2

Obtain the Hash Codes for the keys


2343, 4345, 6567, 3476, 1215
m=11, A=0.618

5 1 A 2

2343 floor(11* (2343* 0.618 mod 1) 10 4345 floor(11* (4345* 0.618 mod 1) 2 6567 floor(11* (6567* 0.618 mod 1) 4 3476 floor(11* (3476* 0.618 mod 1) 1 1215 floor(11* (1215* 0.618 mod 1) 9 MATLAB command floor(11*mod((k*0.618),1))

Solution

1 3476

2 4345

3 8

4 6567 9 1215 10 2343

More on Collision Resolution

Quadratic Probing for Collision Resolution

Notes on Linear Probing


Linear probing is simple to program
Linear probing has better locality of reference and

hence better cache performance in the memory usage

Primary Clustering in Linear Probing


Linear probing use a probe sequence H+1, H+2,

H+3 and so on to find the space of the key, which has got the primary hash value as H This would lead to clustering of hash codes near some cells, called primary clustering Larger the cluster, lesser will be the search efficiency

Uniform Hashing & Random Probing


If use a method to generate Hash codes in a

uniformly distributed manner with a larger table size the process may avoid collisions Even if collisions occur we may use a pseudo random sequence to probe the locations But this approach reduces the locality reference, which then becomes a random variable So, better to use a via media solution between the linear probing and the random hashing

Quadratic Probing
Instead of linearly traversing through the hash table

slots in the case of collisions, the quadratic probing introduces more spacing between the slots we try in the case of collision This reduces the clustering effect seen in linear probing Clustering can still occur because Quadratic Probing is not immune to clustering Quadratic Probing preserves some locality reference and hence give good cache performance but lower than that of Linear Probing

Hash Function for quadratic probing


H(k,i)=(H(k)+c1*i + c2 i2 ) mod m
Where c1 and c2 are constants, (auxiliary constants) H is an auxiliary hash function. It could be k mod m

i=0,1,2,,m-1 is called the probe number


For a given Hash table the c1 and c2 remain

constant Choices for c1 and c2 are c1 = c2 =, c1 = c2 =1, c1 = 0, c2 =1,

Example
c1 = c2 =,
Take m= 11

Let the keys are 46, 122, 222, 441 appear in this

order for hashing


46 mod 11 = 2 122 mod 11 = 1

222 mod 11 = 2 (2+0.5 *1 + 0.5*1) mod 11


441 mod 11 = 1 (1+0.5 *1 + 0.5*1) mod 11

Exercise
Apply Quadratic Probing for the following Hash

Addresses
78 mod 11 =1 89 mod 11 =1 111 mod 11=1 166 mod 11=1

Answer
78 mod 11 =1 1
89 mod 11 =1 (1+0.5 *1 + 0.5*12 ) mod 11 2 111 mod 11=1 (1+0.5 *2 + 0.5*22 ) mod 11 4 166 mod 11=1 (1+0.5 *3 + 0.5*32 ) mod 11 7

Notes
If two keys have the same initial probe position, then

their probe sequences are the same, since H(k1, 0)=H(k2, 0) implies H(k1, i)=H(k2, i) This property leads to milder form of clustering called secondary clustering

Clustering

Problems with Linear Probing


Linear probing leads to Primary Clustering- the

hashed keys share substantial segments of probe sequence, because more than one key hashed into same home position shall have the same probe sequence And the hash addresses that collide at the home address, say b, will extend the cluster

Primary Clustering
As we have seen, once a block of few contiguous

occupied positions emerges in the Hash Table, it becomes a target for subsequent collisions As clusters grow, they also merge to form larger clusters Primary clustering means elements that hash to different cells probe same alternative cells Clustering will be reduced only if the hash addresses home at different positions

Example
Suppose we have 10 Hash Codes with value 1 and 5

Hash Codes with Value 2 All these codes shall be clustering around 1 and 2

Problems with Quadratic Probing


There could be adjacent clusters that join to form

composite clusters This is called secondary clustering


This happens because the keys which have the

same home hash address, will lead to same probe sequence In Quadratic probing also, the probe sequence is a function of the home position and not the original key value

Double hashing for Collision Resolution

Double Hashing
To avoid secondary clustering, we need to have the

probe sequence that make use of the original key value in its decision process This is achieved using Double Hashing, because the Hashing is done in two stages We shall use a second hash function also, so as to reduce the collisions

Double Hashing
Let H1(k) and H2(k) be two hash functions for the

same key k The H(k) is obtained as


H(k,i)= {H1(k) + i* H2(k)} mod m for the ith probe sequence

If the Table size m is a prime number the above

sequence is likely to access all locations in the Hash Table

Notes
The functions H1(k) and H2(k) are auxiliary hash

functions, which are selected like any hash function: so that the Keys are distributed in a uniform and random manner.

Example 1
We let H1(k) = k mod m and H2(k) = 1 + (k mod m' ),

where m' is slightly less than m, say, m 1 or m 2. For example m=11 and m=9

Example 2
First Use Mid Square Method and then use the

Modulo Division

Double hashing
Double hashing can be used to avoid the primary and

secondary clustering H2(k) must be chosen with care m and H2(k) must be relatively prime and this can be effected by making m a prime number If m is a power of two then choose H2(k) which is always odd

Example
Generate Hash Codes using Double Hashing for the

following: 2227, 3545, 4537, 8981, 7857, 3433, 6965 Use Division Method using H1(k) = k mod m and H2(k) = 1 + (k mod m' ) We have H(k,i)= {H1(k) + i* H2(k)} mod m Use m=11 and m=9

Steps
First generate Hash codes with H1(k) = k mod m

using m=11 Then apply the Second hashing depends on the Collisions. Take m=9

Step 1-Answer
2227 mod 11 = 5
3545 mod 11 = 3 4537 mod 11 = 5

8981 mod 11 = 5
7857 mod 11 = 3 3433 mod 11 = 1

6965 mod 11 = 2

Step 2
For resolving collisions, use the second Hash

Function-two times for Hash Code 5 and once for Hash Code 3 and see how the mapping evolves

Answer-Step 2
2227 mod 11 = 5
3545 mod 11 = 3 4537 mod 11 = 5

2227 mod 9 +1= 5 3545 mod 9 +1 = 9

4537 mod 9 +1 = 2
8981 mod 9 +1 = 9 7857 mod 9 +1 = 1

8981 mod 11 = 5
7857 mod 11 = 3 3433 mod 11 = 1 6965 mod 11 = 2

3433 mod 9 +1 = 5
6965 mod 9 +1 = 9

Step 3
2227 5
3545 3 4537 5+1*2=7

2227 mod 9 +1= 5 3545 mod 9 +1 = 9

4537 mod 9 +1 = 2
8981 mod 9 +1 = 9 7857 mod 9 +1 = 1

8981 5+2*9 1
7857 3+1*1 4 3433 1+1*5 6 6965 2

3433 mod 9 +1 = 5
6965 mod 9 +1 = 9

Sparse Matrices

You might also like