You are on page 1of 33

Data Analysis and Applications

Lecture 12 : Hash Tables


Dongjin Ji
Maps and Dictionaries

2
MAPS

A map is a searchable collection of items that are key-value pair


s
The main operations of a map are for searching, inserting, and d
eleting items
Multiple items with the same key are not allowed
Applications:
address book
student-record database

3
DICTIONARIES

Python’s dict class is arguably the most


significant data structure in the language.
It represents an abstraction known as a dictionary in
which unique keys are mapped to associated values.
Here, we use the term “dictionary” when
specifically discussing Python’s dict class, and the
term “map” when discussing the more general
notion of the abstract data type.

4
THE MAP ADT
(USING DICT SYNTAX)

5
MORE MAP OPERATIONS

6
A FEW MORE MAP OPERATIONS

7
EXAMPLE

8
A Simple List-Based Map

We can efficiently implement a map using an


unsorted list
We store the items of the map in a list S (based
on a doubly-linked list), in arbitrary order

header nodes/positions trailer

9 c 6 c 5 c 8 c
Items (key-value pairs)

9
OUR MAPBASE CLASS

10
THE MAPBASE ABSTRACT CLASS

11
AN UNSORTED LIST IMPLEMENTATION

12
Performance of a List-Based Map

Performance:
Inserting an item takes O(1) time since we can insert the new
item at the beginning or at the end of the unsorted list
Searching for or removing an item takes O(n) time, since in
the worst case (the item is not found) we traverse the entire
listto look for an item with the given key
The unsorted list implementation is effective only for
maps of small size or for maps in which insertions are
the most common operations, while searches and
removals are rarely performed (e.g., historical record
of logins to a workstation)
13
Hash Tables

0 ∅
1 025-612-0001
2 981-101-0002
3 ∅
4 451-229-0004

14
RECALL THE NOTION OF A MAP

Intuitively, a map M supports the abstraction of


using keys as indices with a syntax such as M[k].
As a mental warm-up, consider a restricted setting
in which a map with n items uses keys that are
known to be integers in a range from 0 to N − 1,
for some N ≥ n.

15
MORE GENERAL KINDS OF KEYS

But what should we do if our keys are not


integers in the range from 0 to N – 1?
Use a hash function to map general keys to
corresponding indices in a table.
For instance, the last four digits of a Social Security
number.
0 ∅
1 025-612-0001
2 981-101-0002
3 ∅
4 451-229-0004

16
Hash Functions and Hash Table
s
A hash function h maps keys of a given type to integers
in a fixed interval [0, N − 1]
Example:
h(x) = x mod N
is a hash function for integer keys
The integer h(x) is called the hash value of key x
A hash table for a given key type consists of
Hash function h
Array (called table) of size N
When implementing a map with a hash table, the goal
is to store item (k, o) at index i = h(k)
17
SSN EXAMPLE

We design a hash table for 0 ∅

a map storing entries as 1 025-612-0001


2 981-101-0002
(SSN, Name), where SSN 3 ∅
(social security number) is a 4 451-229-0004

nine-digit positive integer


Our hash table uses an 9997 ∅

array of size N = 10,000 and 9998 200-751-9998


9999 ∅
the hash function
h(x) = last four digits of x

18
Hash Functions

A hash function is The hash code is


usually specified as the applied first, and the
compression function
composition of two
is applied next on the
functions: result, i.e.,
Hash code: h(x) = h2(h1(x))
h1: keys → integers The goal of the hash
Compression function: function is to “disper
se” the keys in an ap
h2: integers → [0, N − 1]
parently random way
19
Hash Codes

Memory address: Component sum:


We reinterpret the memory a We partition the bits of
ddress of the key object as a the key into components
n integer Good in general, ex of fixed length (e.g., 16 or
cept for numeric and string ke 32 bits) and we sum the
ys components (ignoring
Integer cast: overflows)
We reinterpret the bits of the Suitable for numeric keys
key as an integer of fixed length greater
Suitable for keys of length les than or equal to the
s than or equal to the number number of bits of the
of bits of the integer integer type

20
Hash Codes (cont.)

Polynomial accumulation: Polynomial p(z) can be


We partition the bits of the key evaluated in O(n) time
into a sequence of
components of fixed length using Horner’s rule:
(e.g., 8, 16 or 32 bits)
The following
a0 a1 … an−1
We evaluate the polynomial polynomials are
p(z) = a0 + a1 z + a2 z2 + … successively computed,
… + an−1zn−1 each from the previous
at a fixed value z, ignoring one in O(1) time
overflows p0(z) = an−1
Especially suitable for strings
(e.g., the choice z = 33 gives pi (z) = an−i−1 + zpi−1(z)
at most 6 collisions on a set of (i = 1, 2, …, n −1)
50,000 English words)
We have p(z) = pn−1(z)
21
Compression Functions

Division: Multiply, Add and


h2 (y) = y mod N Divide (MAD):
The size N of the h2 (y) = (ay + b) mod N
hash table is usually a and b are
chosen to be a prime nonnegative integers
The reason has to do such that
with number theory a mod N ≠ 0
and is beyond the Otherwise, every
scope of this course integer would map to
the same value b
22
ABSTRACT HASH TABLE CLASS

23
Collision Handling

Collisions occur when 0 ∅


1 025-612-0001
different elements are 2 ∅
mapped to the same 3 ∅
4 451-229-0004 981-101-0004
cell
Separate Chaining: let
Separate chaining is
each cell in the table
simple, but requires
point to a linked list of
additional memory
entries that map there
outside the table

24
MAP WITH SEPARATE CHAINING

Delegate operations to a list-based map at each cell:


Algorithm get(k):
return A[h(k)].get(k)

Algorithm put(k,v):
t = A[h(k)].put(k,v)
if t = null then {k is a new key}
n=n+1
return t

Algorithm remove(k):
t = A[h(k)].remove(k)
if t ≠ null then {k was found}
n=n-1
return t
25
HASH TABLE WITH SEPARATE CHAINING

26
Linear Probing

Open addressing: the Example:


colliding item is placed in a
different cell of the table h(x) = x mod 13
Linear probing: handles Insert keys 18, 41,
collisions by placing the 22, 44, 59, 32, 31, 73,
colliding item in the next in this order
(circularly) available table cell
Each table cell inspected is
referred to as a “probe” 0 1 2 3 4 5 6 7 8 9 10 11 12

Colliding items lump together,


causing future collisions to 41 18 44 59 32 22 31 73
cause a longer sequence of 0 1 2 3 4 5 6 7 8 9 10 11 12
probes
27
Search with Linear Probing

Consider a hash table Algorithm get(k)


i ← h(k)
A that uses linear
p←0
probing repeat
get(k) c ← A[i]
if c = ∅
We start at cell h(k)
return null
We probe consecutive else if c.getKey () = k
locations until one of the return c.getValue()
following occurs else
An item with key k is i ← (i + 1) mod N
found, or p←p+1
An empty cell is found, until p = N
or return null
N cells have been
28
Updates with Linear Probing

To handle insertions and d put(k, o)


eletions, we introduce a sp We throw an exception if t
ecial object, called AVAIL he table is full
ABLE, which replaces dele We start at cell h(k)
ted elements We probe consecutive cell
remove(k) s until one of the following
We search for an entry with occurs
key k A cell i is found that is eith
er empty or stores AVAILA
If such an entry (k, o) is fo
BLE, or
und, we replace it with the
N cells have been unsucce
special item AVAILABLE a
ssfully probed
nd we return element o
We store (k, o) in cell i
Else, we return null

29
Hash Table with Linear Probing

30
Double Hashing

Double hashing uses a


secondary hash function Common choice of
d(k) and handles collisions compression function for
by placing an item in the the secondary hash
first available cell of the function:
series d2(k) = q − k mod q
(i + jd(k)) mod N where
for j = 0, 1, … , N − 1
q<N
The secondary hash
q is a prime
function d(k) cannot have
zero values The possible values for
The table size N must be a d2(k) are
prime to allow probing of all 1, 2, … , q
the cells

31
EXAMPLE OF DOUBLE HASHING

k h (k ) d (k ) Probes
Consider a hash 18 5 3 5
table storing integer 41 2 1 2
22 9 6 9
keys that handles 44 5 5 5 10
59 7 4 7
collision with double 32 6 3 6
hashing 31
73
5
8
4
4
5
8
9 0

N = 13
h(k) = k mod 13
0 1 2 3 4 5 6 7 8 9 10 11 12
d(k) = 7 − k mod 7
Insert keys 18, 41,
31 41 18 32 59 73 22 44
22, 44, 59, 32, 31, 73, 0 1 2 3 4 5 6 7 8 9 10 11 12
in this order
32
Performance of Hashing

In the worst case, searches,


insertions and removals on a The expected running
hash table take O(n) time time of all the dictionary
The worst case occurs when ADT operations in a
all the keys inserted into the hash table is O(1)
map collide
In practice, hashing is
The load factor α = n/N
affects the performance of a
very fast provided the
hash table load factor is not close
Assuming that the hash to 100%
values are like random Applications of hash
numbers, it can be shown tables:
that the expected number of
small databases
probes for an insertion with
open addressing is compilers
1 / (1 − α) browser caches

33

You might also like