Professional Documents
Culture Documents
4.2 Hashing
"More computing sins are committed in the name of efficiency (without
necessarily achieving it) than for any other single reason - including
blind stupidity." - William A. Wulf
"We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil." - Donald E. Knuth
Save items in a key-indexed table. Index is a function of the key. Goal: scramble the keys.
! Efficiently computable.
Hash function. Method for computing table index from key. ! Each table position equally likely for each key.
thoroughly researched problem
Collision resolution strategy. Algorithm and data structure to handle
two keys that hash to the same index. Ex: Social Security numbers.
! Bad: first three digits. 573 = California, 574 = Alaska
Classic space-time tradeoff. ! Better: last three digits. assigned in chronological order within a
given geographic region
! No space limitation: trivial hash function with key as address.
! No time limitation: trivial collision resolution = sequential search. Ex: date of birth.
! Limitations on both time and space: hashing (the real world). ! Bad: birth year.
! Better: birthday.
3 4
Hash Function: String Keys hashCode
Java string library hash functions. Hash code. For any references x and y:
!Repeated calls to x.hashCode() must return the same value
public int hashCode() { provided no information used in equals comparison has changed.
int hash = 0; s = "call"; !If x.equals(y) then x and y must have the same hash code.
for (int i = 0; i < length(); i++) h = s.hashCode();
hash = (31 * hash) + s[i]; hash = h % M; "consistent with equals"
return hash; 3045982
} ith character of s
7121 8191
Default implementation: memory address of x.
Customized implementations: String, URL, Integer, Date.
5 6
Phone numbers: (609) 867-5309. Collision = two keys hashing to same value.
area code exchange extension
! Essentially unavoidable.
! Birthday problem: how many people will have to enter a room until
two have the same birthday? 23
! With M hash values, expect a collision after sqrt(! ! M) insertions.
public final class PhoneNumber {
private final int area, exch, ext;
public PhoneNumber(int area, int exch, int ext) { Conclusion: can't avoid collisions unless
this.area = area;
this.exch = exch; you have a ridiculous amount of memory.
this.ext = ext;
} Challenge: efficiently cope with collisions.
public boolean equals(Object y) { // as before }
7 8
Collision Resolution: Two Approaches. Separate Chaining
Separate chaining. st[0] jocularly seriously Separate chaining: array of M linked lists.
! M much smaller than N. st[1] listen ! Hash: map key to integer i between 0 and M-1.
! " N / M keys per table position. st[2] null ! Insert: put at front of ith chain (if not already there).
! Put keys that collide in a list. st[3] suburban untravelled considerating ! Search: only need to search ith chain.
! Need to search lists. ! Running time: proportional to length of chain.
st[8190] browsing
M = 8191, N = 15000
key hash
!When a new key collides, find next st[3] suburban st[2] null seriously 0
M = 30001, N = 15000
9 10
Symbol Table: Separate Chaining Symbol Table: Separate Chaining Implementation (cont)
11 12
Separate Chaining Performance Symbol Table: Implementations Cost Summary
13 14
Linear probing: array of size M. typically twice as many slots as elements public class ArrayHashST<Key, Val> {
! Hash: map key to integer i between 0 and M-1. private int M = 30001;
private Key[] keys = (Key[]) new Object[M]; no generic array
! Insert: put in slot i if free, if not try i+1, i+2, etc. private Val[] vals = (Val[]) new Object[M]; creation in Java
! Search: search slot i, if occupied but no match, try i+1, i+2, etc.
private int hash(Key key) { // as before }
15 16
Linear Probing Performance Symbol Table: Implementations Cost Summary
Parameters.
! M too large $ too many empty array entries. Advantages: fast insertion, fast search.
! M too small $ clusters coalesce. Disadvantage: hash table has fixed size.
! Typical choice: M " 2N $ constant-time search/insert. fix: use repeated doubling, and rehash all keys
17 18
Double hashing. Avoid clustering by using second hash to compute skip Linear probing performance.
for search. ! Insert and search cost depend on length of cluster.
! Trivial: average length of cluster = # = N / M.
Hash. Map key to integer i between 0 and M-1. ! Worst case: all keys hash to same cluster.
Second hash. Map key to nonzero skip value k.
Theorem. [Guibas-Szemeredi] Let # = N / M < 1 be average length of list.
Ex: k = 1 + (v mod 97).
depends on hash map being
hashCode random map
Parameters.
! M too large $ too many empty array entries.
! M too small $ clusters coalesce.
Result. Skip values give different search paths for keys that collide. ! Typical choice: M " 2N $ constant-time search/insert.
Best practices. Make k and M relatively prime. Disadvantage: delete cumbersome to implement.
19 20
Hashing Tradeoffs Hash Table: Java Library
Separate chaining vs. linear probing/double hashing. Java has built-in libraries for symbol tables.
! Space for links vs. empty table slots. ! HashMap = linear probing hash table implementation.
! Small table + linked allocation vs. big coherent array.
import java.util.HashMap;
public class HashMapDemo {
public static void main(String[] args) {
Linear probing vs. double hashing. HashMap<String, String> st = new HashMap <String, String>();
st.put("www.cs.princeton.edu", "128.112.136.11");
st.put("www.princeton.edu", "128.112.128.15");
load factor # System.out.println(st.get("www.cs.princeton.edu"));
}
50% 66% 75% 90% }
21 22
Symbol table. Implement our interface using HashMap. Java 1.1 string library hash function.
! For long strings: only examines 8 evenly spaced characters.
! Saves time in performing arithmetic.
! Great potential for bad collision patterns.
import java.util.HashMap;
import java.util.Iterator;
23 24
Algorithmic Complexity Attacks Algorithmic Complexity Attacks
Is the random hash map assumption important in practice? Q. How easy is it to break Java's hashCode with String keys?
! Obvious situations: aircraft control, nuclear reactors. A. Almost trivial: string hashCode is part of Java 1.5 API.
! Surprising situations: denial-of-service attacks. ! Ex: hashCode of "BB" equals hashCode of "Aa".
! Can now create 2N strings of length 2N that all hash to same value!
malicious adversary learns your ad hoc hash function
(e.g., by reading Java API) and causes a big pile-up in
single address that grinds performance to a halt AaAaAaAa BBAaAaAa
AaAaAaBB BBAaAaBB
AaAaBBAa BBAaBBAa
AaAaBBBB BBAaBBBB
Real-world exploits. [Crosby-Wallach 2003] AaBBAaAa BBBBAaAa
AaBBAaBB BBBBAaBB
! Bro server: send carefully chosen packets to DOS the server, using AaBBBBAa BBBBBBAa
less bandwidth than a dial-up modem AaBBBBBB BBBBBBBB
! Perl 5.8.0: insert carefully chosen strings into associative array. Possible to fix?
! Linux 2.4.20 kernel: save files with carefully chosen names. ! Security by obscurity. [not recommended]
Reference: http://www.cs.rice.edu/~scrosby/hash
! Cryptographically secure hash functions.
! Universal hashing.
25 26