ds23 7

Chapter 4
Hash Tables
COMP2015 HKBU/CS/JF/2023 1
Motivation
n Binary search tree and AVL tree can be used to implement
Insert and Find operations
q The premise is that the data should be sorted in advance
n Objective:
Based on a table named hash table, hashing is a technique

used for performing
q Insertions
q Deletions
q Finds
in constant time “on an average” without sorting the data
Hash Table
n A hash table is a kind of symbol table.
n A symbol table is like a dictionary, with which you can look

up a word for its meaning.
Symbol Table
n Formally, a symbol table is a collection of 〈key, value〉 pairs.

q “Key” and “value” can be anything.
q “Value” can be composed of multiple parts.
Key Value
23123456 Chan Yat, Male, 96868686
23234567 Cheung Wai Man, Male, 62323232
23345678 Lee Wing Man, Female, 94343434
23456789 Wong Siu Ming, Male, 61700800
… ...
Hash Function
n Hash Function: a function h that maps a search key to an
integer index.
q e.g., h(“06053344”) = 36
“06053344”
(a search key)
q A hash function:
n Should be simple to compute;
n Should ensure that any two distinct keys get different

cells;
n Should distribute the keys evenly among the cells.
Hash Table
n A symbol table that uses a hash function is called a hash
table.
“06053344”
h(“06053344”)
n Hash table is a fixed size (TableSize) array containing keys.

n Each key is mapped into some number in the range 0 to
TableSize - 1, and placed in the appropriate cells.
Collision in Hashing
n A hash function can sometimes return the same index
(called hash value) for two different keys.
h(“06053344”) = 36
h(“06059999”) = 36
…
n Having two or more keys hash to the same index is called
collision.
n Hashing basic plan:

q create an array for the items to be stored
q use a hash function to figure out storage location from

key
q a collision resolution scheme is necessary
Hash Function
Usually, m << N.
h(Ki) = an integer in [0, …, m-1], the hash value of Ki
Hash Function
n The division method for numeric keys

q h(k) = k mod m
where m is the table size and a prime number.
q e.g. m = 11, k = 100, h(k) = 1

Requires only a single division operation (quite fast)
n Can the keys be strings?

n Non-numeric keys are converted to numbers
Hash Function
n Method 1
q Add up the ASCII values of the characters in the string
q e.g. the string “HongKong” becomes 795
(72+111+110+103+75+111+110+103)
q Problems:
n Different permutations of the same set of characters
would have the same hash value
n If the table size is large, the keys are not distributed well.
n e.g. If m=10007 and all the keys are eight or fewer

characters long. Since ASCII value <= 127, the hash
function can only assume values between 0 and
127*8=1016
Hash Function
n Method 2
Represent the first three characters s1s2s3 in the string by the
ASCII code, choose a number r, the integer value for the
string is
ASCII (s1) + ASCII(s2)r + ASCII(s3)r2
Collision Handling: Separate Chaining
n Instead of a hash table, we use a table of linked list

n keep a linked list of keys that hash to the same value
h(K) = K mod 10
Separate Chaining
n To insert a key K
q Compute h(K) to determine which list to traverse, then
insert in the linked list
n To delete a key K
q Compute h(K), then search for K within the list. Delete K
if it is found.
n To find a key K, compute h(K), then search in the linked list
n Disadvantage: Memory allocation in linked list manipulation

will slow down the program.
Collision Handling: Open Addressing
n Open addressing:
q Relocate the key K to be inserted if it collides with an
existing key. That is, we store K at a different entry.
n Two issues arise

q What is the relocation scheme?
q How to search for K later?
n Three common methods for resolving a collision in open

addressing
q Linear probing
q Quadratic probing
q Double hashing
Open Addressing
n To insert a key K, compute h0(K)
- If the cell is empty, insert it there
- If collision occurs, probe alternative cell h1(K), h2(K), ....

until an empty cell is found,
where
hi(K) = (hash(K) + f(i)) mod m, with f(0) = 0
f: collision resolution strategy
Linear Probing
n f(i) = i
q cells are probed sequentially (with wraparound)
q hi(K) = (hash(K) + i) mod m
n Insertion:
q Let K be the new key to be inserted. We compute hash(K)
q For i = 0 to m-1
n compute L = ( hash(K) + i ) mod m
n If the cell is empty, then we put K there and stop.
q If we cannot find an empty entry to put K, it means that the
table is full and we should report an error.
Linear Probing
Question: Insert the following values into the hash table using
hash(K)=K mod 10 (table size) and linear probing to resolve
collisions
1, 5, 11, 7, 12, 17, 6, 25
0 1 2 3 4 5 6 7 8 9
111 12 25
5 6 17
7
Linear Probing
n hi(K) = (hash(K) + i) mod m

n Example: inserting keys 89, 18, 49, 58, 69 with hash(K)=K mod 10
To insert 58, probe

T[8], T[9], T[0],
T[1]
To insert 69, probe

T[9], T[0], T[1],
T[2]
Primary Clustering
n We call a block of contiguously occupied table entries a

cluster. The larger the cluster, the slower the performance.
n Linear probing has the following disadvantages:

q Once h(K) falls into a cluster, this cluster will definitely
grow in size by one. Thus, this may worsen the
performance of insertion in the future.
q If two clusters are only separated by one entry, then
inserting one key into a cluster can merge the two clusters
together. Thus, the cluster size can be increased drastically
by a single insertion. This means that the performance of
insertion can deteriorate drastically after a single insertion.
q Large clusters are easy targets for collisions.
Primary Clustering in Linear
Probing
n Linear probing tends to create long sequences of filled cells

in a hash table.
n This effect is called primary clustering.
n Primary clustering degrades the performance of a hash table.
n Quadratic probing can be used to address the primary

clustering problem.
Quadratic Probing
n f(i) = i2
n hi(K) = ( hash(K) + i2 ) mod m
n Example: inserting keys 89, 18, 49, 58, 69 with hash(K) = K
mod 10
To insert 58, probe

T[8], T[9], T[(8+4)
mod 10]
To insert 69, probe

T[9], T[(9+1) mod
10], T[(9+4) mod
10]
Quadratic Probing
n Two keys with different home positions will have different

probe sequences
q e.g., m=101, h(k1)=30, h(k2)=29
q probe sequence for k1: 30, 30+1, 30+4, 30+9
q probe sequence for k2: 29, 29+1, 29+4, 29+9
n If the table size is prime, a new key can always be inserted if

the table is at least half empty.
Quadratic Probing
n If the table size is not prime?

q The number of alternative locations can be severely
reduced.
q e.g., m=16, h(k1)=h0
q probe sequence for k1: h0, h0 + 1, h0 + 4, h0 + 9, ……
q The only alternative locations would be at distances 1, 4,

or 9 away.
n If the table size is a prime of the form 4k+3, the entire table
can be probed.
Quadratic Probing
n Secondary clustering
q Keys that hash to the same home position will probe the
same alternative cells
q To avoid secondary clustering, the probe sequence needs

to be a function of the original key value, not the home
position
Tutorial
Insert the following keys into the hash table using hash(K)=K
mod m (table size) and linear/quadratic probing to resolve
collisions
19, 39, 29, 9, 79, 49, 69
0 1 2 3 4 5 6 7 8 9
Double Hashing
n To alleviate the problem of clustering, the sequence of

probes for a key should be independent of its primary
position
=> use two hash functions: hash() and hash2()
n The idea is that even if two keys hash to the same value
using hash(), they will have different values using hash2().
n f(i) = i * hash2(K)
q e.g., hash2(K) = R - (K mod R), where R is a prime
number smaller than m
q hash2 should return a nonzero integer.
Double Hashing
n hi(K) = ( hash(K) + f(i) ) mod m; hash(K) = K mod m
n f(i) = i * hash2(K); hash2(K) = R - (K mod R),
n Example: m=10, R = 7 and insert keys 89, 18, 49, 58, 69
To insert 49,
hash2(49)=7, 2nd
probe is T[(9+7)
mod 10]
To insert 58,
hash2(58)=5, 2nd
probe is T[(8+5)
mod 10]
To insert 69,
hash2(69)=1, 2nd
probe is T[(9+1)
mod 10]
Rehashing
n No matter what collision resolution scheme is used, the

performance of an open addressing hash table degrades when
the table is becoming full.
n Build another table, twice as big (and prime).

n The hash function has to be changed.
n All entries in the hash table have to be re-entered using the
new hash function.
n This process is called rehashing.
n Note: Rehashing is rarely necessary for most applications.
Example Applications
n Compilers use hash tables (symbol tables) to keep track of

declared variables.
n On-line spelling checkers: after prehashing the entire

dictionary, one can check each word in constant time and
print out the misspelled word in order of their appearance
in the document.
Issues
n What do we lose?
q Operations that require ordering are inefficient
q FindMax
q FindMin
q PrintSorted
n What do we gain?
q Insert: O(1)
q Delete: O(1)
q Find: O(1)
n How to handle collision?

q Separate chaining
q Open addressing
Tutorial
v [Hash Table – More Collision Resolution] Suppose you
have a hash table of size 10, with initial hash function h(x) =
2x. Insert the following keys into the hash table:
2, 12, 3, 10, 5, 7
v (a) Have your hash table using linear probing to resolve
collisions.
v (b) Have your hash table using quadratic probing to resolve
collisions.
v (c) Have your hash table using double hashing, with second
hash function f(x) = x + 1.

ds23 7

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ds23 7

Uploaded by

Copyright:

Available Formats

Chapter 4

Based on a table named hash table, hashing is a technique

in constant time “on an average” without sorting the data

n A symbol table is like a dictionary, with which you can look

n Formally, a symbol table is a collection of 〈key, value〉 pairs.

q “Value” can be composed of multiple parts.

n Should ensure that any two distinct keys get different

n Hash table is a fixed size (TableSize) array containing keys.

n Hashing basic plan:

q use a hash function to figure out storage location from

h(Ki) = an integer in [0, …, m-1], the hash value of Ki

n The division method for numeric keys

where m is the table size and a prime number.

q e.g. m = 11, k = 100, h(k) = 1

n Can the keys be strings?

q e.g. the string “HongKong” becomes 795

n e.g. If m=10007 and all the keys are eight or fewer

n Instead of a hash table, we use a table of linked list

n To find a key K, compute h(K), then search in the linked list

n Disadvantage: Memory allocation in linked list manipulation

n Two issues arise

q How to search for K later?

n Three common methods for resolving a collision in open

n To insert a key K, compute h0(K)

- If the cell is empty, insert it there

- If collision occurs, probe alternative cell h1(K), h2(K), ....

q hi(K) = (hash(K) + i) mod m

n compute L = ( hash(K) + i ) mod m

n If the cell is empty, then we put K there and stop.

q If we cannot find an empty entry to put K, it means that the

table is full and we should report an error.

n hi(K) = (hash(K) + i) mod m

To insert 58, probe

To insert 69, probe

n We call a block of contiguously occupied table entries a

n Linear probing has the following disadvantages:

n Linear probing tends to create long sequences of filled cells

n This effect is called primary clustering.

n Primary clustering degrades the performance of a hash table.

n Quadratic probing can be used to address the primary

To insert 58, probe

To insert 69, probe

n Two keys with different home positions will have different

q probe sequence for k1: 30, 30+1, 30+4, 30+9

q probe sequence for k2: 29, 29+1, 29+4, 29+9

n If the table size is prime, a new key can always be inserted if

n If the table size is not prime?

q e.g., m=16, h(k1)=h0

q probe sequence for k1: h0, h0 + 1, h0 + 4, h0 + 9, ……

q The only alternative locations would be at distances 1, 4,

q To avoid secondary clustering, the probe sequence needs

n To alleviate the problem of clustering, the sequence of

n No matter what collision resolution scheme is used, the

n Build another table, twice as big (and prime).

n Note: Rehashing is rarely necessary for most applications.

n Compilers use hash tables (symbol tables) to keep track of

n On-line spelling checkers: after prehashing the entire

n How to handle collision?

You might also like