You are on page 1of 33

Hashing

By
Pravakar Ghimire
• It is an efficient searching technique in which key is
placed in direct accessible address for rapid search.
• Hashing provides the direct access of records from
the file no matter where the record is in the file.
Due to which it reduces the unnecessary
comparisons.
• This technique uses a hashing function say h which
maps the key with the corresponding key address or
location.
Hashing Performance
There are three factors that influence the
performance of hashing:
Hash function
 should distribute the keys and entries evenly
throughout the entire table
 should minimize collisions
Hashing Performance
Collision resolution strategy
 Open Addressing: store the key/entry in a different position
 Separate Chaining: chain several keys/entries in the same
position
Table size
 Too large a table, will cause a wastage of memory
 Too small a table will cause increased collisions and eventually
force rehashing (creating a new hash table of larger size and
copying the contents of the current hash table into it)
 The size should be appropriate to the hash function used and
should typically be a prime number. Why?
Hash Function
• A function that transforms a key into a table
index is called a hash function.
• A common hash function is
h(x)=x mod SIZE
Some Popular Hash Functions
• The hash function converts the key into the table position. It can be carried out
using:
• Modular Arithmetic( The one we saw in previous slide): Compute the index by
dividing the key with some value and use the remainder as the index. This forms
the basis of the next two techniques. For Example: index := key MOD table_size
• Truncation: Ignoring part of the key and using the rest as the array index. The
problem with this approach is that there may not always be an even distribution
throughout the table. For Example: If student id’s are the key 928324312 then
select just the last three digits as the index i.e. 312 as the index. => the table size
has to be at least 999. Why?
• Folding: Partition the key into several pieces and then combine it in some
convenient way.

.......and a lot more for which I’m too lazy to describe them here, see it yourself 
Desired Properties in Hash Function

A good hash function should:


 Minimize collisions.
 Be easy and quick to compute.
 Distribute key values evenly in the hash table.
 Use all the information provided in the key.
Hash Collision
• If two or more than two records trying to
insert in a single index of a hash table then
such a situation is called hash collision.
Collision Resolution
Some popular methods for dealing with collision are:
 Close Hashing/Open Addressing (Probing)
✔ Linear probing
✔ Quadratic probing
✔ Double Hashing
 Open Hashing/Close Addressing
✔ Chaining
✔ Hashing using buckets
 Rehashing
Close Hashing/Open Addressing (Probing)

• Open addressing / probing is carried out for


insertion into fixed size hash tables (hash
tables with 1 or more buckets). If the index
given by the hash function is occupied, then
increment the table position by some number.
Open Addressing contd.....
There are three schemes commonly used for probing:
 Linear Probing:
A hash-table in which a collision is resolved by putting
the item in the next empty place within the occupied
array space. It starts with a location where the collision
occurred and does a sequential search through a hash
table for the desired empty location. Hence this
method searches in straight line, and it is therefore
called linear probing.
Linear Probing Contd.....
Index := hash(key)
While Table(Index) Is Full do
index := (index + 1) MOD Table_Size
if (index = hash(key))
return table_full
else
Table(Index) := Entry
Linear Probing
• In linear probing, collisions are resolved by
sequentially scanning an array (with
wraparound) until an empty cell is found.
– i.e. f is a linear function of i, typically f(i)= i.
• Example:
– Insert items with keys: 89, 18, 49, 58, 9 into an
empty hash table.
– Table size is 10.
– Hash function is hash(x) = x mod 10.
Linear probing
hash table after
each insertion
Clustering Problem
• As long as table is big enough, a free cell can
always be found, but the time to do so can get
quite large.
• Worse, even if the table is relatively empty,
blocks of occupied cells start forming.
• This effect is known as primary clustering.
• Any key that hashes into the cluster will
require several attempts to resolve the
collision, and then it will add to the cluster.
Quadratic Probing
increment the position computed by the hash
function in quadratic fashion i.e. increment by
1, 4, 9, 16, …. It is a collision resolution
method that eliminates the primary clustering
problem that take place in a linear probing.
Quadratic Probing Contd....
When collision occur then the quadratic probing works
as follows:
(Hash value + 1^2)% table size
if there is again collision occur
(hash value + 2^2)%table size
if there is again collision occur then
(hash value = 3^2)% table size
in general in ith collision
hi(x)=(hash value +i^2)%table size
A quadratic probing
hash table after each
insertion (note that
the table size was
poorly chosen
because it is not a
prime number).

CENG 213 Data Structures 19


Quadratic Probing
• Problem:
– We may not be sure that we will probe all locations in the
table (i.e. there is no guarantee to find an empty cell if
table is more than half full.)
– If the hash table size is not prime this problem will be
much severe.
• However, there is a theorem stating that:
– If the table size is prime and load factor is not larger than
0.5, all probes will be to different locations and an item can
always be inserted.
Analysis of Quadratic Probing
• Quadratic probing has not yet been mathematically
analyzed.
• Although quadratic probing eliminates primary
clustering, elements that hash to the same location
will probe the same alternative cells. This is know as
secondary clustering.
• Techniques that eliminate secondary clustering are
available.
– the most popular is double hashing.
Double Hash
compute the index as a function of two different
hash functions.
• One important problem with linear probing is
clustering
– as collisions start to occur, then blocks of contiguous
occupied bins (clusters) appear.
• And a quite unfortunate aspect is that the longer
these clusters, the longer our searches or insertions
(or deletions) will take (and remember that we
wanted them to be constant time and fast!)
Double Hashing
• Define h1 (k) to map keys to a table location
But also define h2 (k) to produce a linear
probing step size
– First look at h1 (k)
– Then if it is occupied, look at h1 (k) + h2 (k)
– Then if it is occupied, look at h1 (k) + 2*h2 (k)
– Then if it is occupied, look at h1 (k) + 3*h2 (k)
Open Hashing/Close Addressing
(Chaining)
• The idea is to keep a list of all elements that hash to
the same value.
– The array elements are pointers to the first nodes of the
lists.
– A new item is inserted to the front of the list.
• Advantages:
– Better space utilization for large items.
– Simple collision handling: searching linked list.
– Overflow: we can store more items than the hash table
size.
– Deletion is quick and easy: deletion from the linked list.
Example
Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81
hash(key) = key % 10.
0 0

1 81 1
2

4 64 4
5 25
6 36 16
7

9 49 9
Analysis of Separate Chaining
• Collisions are very likely.
– How likely and what is the average length of lists?
• Load factor  definition:
– Ratio of number of elements (N) in a hash table to
the hash TableSize.
• i.e.  = N/TableSize
– The average length of a list is also  .
– For chaining  is not bound by 1; it can be > 1.

CENG 213 Data Structures 26


Rehashing
•For open addressing/probing time also depends on load factor(α )
•As α approaches 1, expected comparisons will get very large
– Capped at the tableSize (i.e. O(n))
• Using a dynamic array (vector) we can simply allocate a bigger
table
– Don't just double it because we want the tableSize to be prime
• Can we just copy items over to the corresponding location in the
new table?
– No because the hash function usually depends on tableSize so
we need to re-hash each entry to its correct location
– h(k) = k % 13 != h'(k) = k %17 (e.g. k = 15)
• Books suggest to increase table size to keep α < 2/3)
Extendible Hashing
 Keys stored in buckets.
 Each bucket can only hold a fixed size of
items.
 Index is an extendible table;
h(x) hashes a key value x to a bit map;
only a portion of a bit map is used to build
a directory.
How to minimize the rehashing cost?
• Use another level of Bucket A
indirection 4* 12* 32* 8*
Directory
– Directory of pointers
to buckets 00 1* 5* 21* 13* Bucket B
• Insert 0 (00) 01
– Double buckets only 10
Bucket C
double the directory 11
(no rehashing)
– Split just the bucket 15* 7* 19* Bucket D
that overflowed!
• What is cost here?

29
Example
• Directory is array of size 4.
• What info this hash data
structure maintains?
– Need to know current #bits
used in the hash function
• Global Depth
– May need to know #bits used
to hash a specific bucket
• Local Depth
– Can global depth < local depth?
Points to remember
• What is the global depth of directory?
– Max # of bits needed to tell which bucket an entry belongs to.
• What is the local depth of a bucket?
– # of bits used to determine if an entry belongs to this bucket.
• When does bucket split cause directory doubling?
– Before insert, bucket is full & local depth = global depth.
• How to efficiently double the directory?
– Directory is doubled by copying it over and `fixing’ pointer to split
image page.
– Note that you can do this only by using the least significant bits in the
directory.

You might also like