Professional Documents
Culture Documents
U
(universe of keys) h(k1)
h(k4)
K k1
h(k2) = h(k5)
(actual k4 k2
keys) Collisions!
k5 k3 h(k3)
m-1
7
8 Collisions
Two or more keys hash to the same location
For a given set K of keys
If |K| ≤ m, collisions may or may not happen, depending on the
hash function
If |K| > m, collisions will definitely happen (i.e., there must be at
least two keys that have the same hash value)
Avoiding collisions completely is hard, even with a good hash
function
9 Collision Resolution
Methods:
Separate Chaining (open hashing)
Open addressing (closed hashing)
Linear probing
Quadratic probing
Double hashing
10
Slot j contains a pointer to the head of the list of all elements that hash to j
Collision with Chaining - Discussion
15
Analysis of Hashing with Chaining:
16 Average Case
Average case
depends on how well the hash function distributes the n keys among the
m slots
T
Simple uniform hashing assumption:
n0 = 0
Any given element is equally likely to hash into any of the m slots
(i.e., probability of collision Pr(h(x)=h(y)), is 1/m)
n2
Length of a list: n3
T[j] = nj, j = 0, 1, . . . , m – 1 nj
Number of keys in the table:
nk
n = n0 + n1 +· · · + nm-1
Average value of nj: nm – 1 = 0
E[nj] = α = n/m
Load Factor of a Hash Table
17
Case 1: Unsuccessful Search
18 (i.e., item not stored in the table)
Theorem
An unsuccessful search in a hash table takes expected time under the(1 )
assumption of simple uniform hashing
(i.e., probability of collision Pr(h(x)=h(y)), is 1/m)
Proof
Searching unsuccessfully for any key k
need to search to the end of the list T[h(k)]
n = O(m)
Minimize the chance that closely related keys hash to the same slot
Derive a hash value that is independent from any patterns that may exist in the
distribution of the keys
Hash keys such as 199 and 499 should hash to different slots
The Division Method
23
Idea:
p=1m=2
h(k) = {0, 1} , least significant 1 bit of k
p=2m=4
h(k) = {0, 1, 2, 3} , least significant 2 bits of k
Key k is squared.
h(k) = l
l is obtained by deleting digits from both
ends of k2
Same positions of
k2 are used for all the keys
H(k) 72 93 99
28
The Folding Method
Idea:
H(k)
37 19 68
H(k)
82 55 77 second part is reversed
Open Addressing
If we have enough contiguous memory to store all the keys (m > N) store the keys in
the table itself
It is called “open” because the address where key k is stored also
e.g., insert 14
depends on keys already stored in hash table along with h(k)
It is also called closed hashing
probe sequence!
30 Common Open Addressing Methods
Linear probing
Quadratic probing
Double hashing
Linear probing: Inserting a key
31
Idea: when there is a collision, check the next available
position in the table (i.e., probing)
5 5 5 5
6 6 6 6
1 1 3 2
probes:
Linear probing: Searching for a key
probe the next higher index until the element is
found (successful search) or an empty position 0
is found (unsuccessful search)
h(k1)
The process wraps around to the beginning of
h(k4)
the table
h(k2) = h(k5)
h(k3)
m-1
33
34 Deletion in Closed Hashing
delete(2) find(7)
00 00
11 11 Where is it?!
22 2
37 37
4 4
5 5
6 6
Solution 5 98
6
Mark the slot with a sentinel value DELETED
7 72
The deleted slot can later be used for insertion 8
Searching will be able to find all the keys 9 14
10
11 50
12
45, 69 and 98 are hashed to same hash
address 3
Delete 69
36 Lazy Deletion
delete(2) find(7)
00 00 Indicates deleted value:
if you find it, probe again
11 11
22 2#
37 37
4 4
5 5
6 6
Linear probing: Primary Clustering Problem
Some slots become more likely than others
if a bunch of elements hash to the same area of the table, they mess each other up! (Even
though the hash function isn’t producing lots of collisions!)
Slot b:
2/m
Slot d:
4/m
Slot e:
5/m
4 4 4 4
21 21
5 5 5 5
6 6 6 6
1 1 3 1
probes:
Problem With Quadratic Probing
40
4 4 4 4 4
21 21 21
5 5 5 5 5
6 6 6 6 6
1 1 3 1 ??
probes:
41 Double Hashing
(1) Use one hash function to determine the first slot
(2) Use a second hash function to determine the increment for the
probe sequence
(h1(k) + i h2(k) ) mod m, i=0,1,...
Initial probe: h1(k)
Second probe is offset by h2(k) mod m, so on ...
Advantage: avoids clustering
Disadvantage: harder to delete an element
42 Double Hashing: Example
h1(k) = k mod 13
0
h2(k) = 1+ (k mod 11) 1 79
0 0 0 0 0
14 14 14 14 14
1 1 1 1 1
8 8 8 8
2 2 2 2 2
2 2
3 3 3 3 3
4 4 4 4 4
21 21 21
5 5 5 5 5
6 6 6 6 6
1 1 2 1 ??
probes:
Double Hashing Example
44
insert(14) insert(8) insert(21) insert(2) insert(7)
14%7 = 0 8%7 = 1 21%7 =0 2%7 = 2 7%7 = 0
5-(21%5)=4 5-(21%5)=4
0 0 0 0 0
14 14 14 14 14
1 1 1 1 1
8 8 8 8
2 2 2 2 2
2 2
3 3 3 3 3
4 4 4 4 4
21 21 21
5 5 5 5 5
7
6 6 6 6 6
1 1 2 1 4
probes:
Theoretical Results
•Let = n/m
the load factor: average number of keys per array index
•Analysis is probabilistic, rather than worst-case
45
46 Analysis of Double Hashing
Example
Unsuccessful retrieval:
=0.5 E(#steps) = 2
=0.9 E(#steps) = 10
Successful retrieval:
=0.5 E(#steps) = 3.387
=0.9 E(#steps) = 3.670
Rehashing
47
0 1 2 3 4
=1 25 37 83
52 98
0 1 2 3 4 5 6 7 8 9 10
=5/11
25 37 83 52 98
49 Case Study
Array +
Closed hashing
binary search Separate chaining
…
n words