You are on page 1of 7

Universal Classes of Hash Functions

(extended abstract)
J. Lawrence Carter and Mark N. Wegman
IBM Thomas J. Watson Research Center
Y o r k t o w n Heights, New York 10598

Abstract:
This paper gives an i , put independe, t average linear time algorithm for storage
and retrieval on keys. The algorithm makes a random choice of hash function from a
suitable class of hash functions.

Given any sequence of inputs the expected time

(averaging over all functions in the class) to store and retrieve elements is linear in
the length of~the sequence. The number of references to the data base required by
the algorithm for any input is extremely close to the theoretical minimum for any
possible hash function with randomly distributed inputs.

We present three suitable

classes of hash functions which also may be evaluated rapidly. The ability to analyze
the cost of storage and retrieval w i t h o u t worrying about the distribution of the input
allows as corollaries improvements on the bounds of several algorithms.
Introduction:

2) A consequence of (1) is that one cannot classi-

One may v i e w different inputs to a program as


elements from a class of problems.

The answer

cally analyze the average performance of a subroutine independently of the main routine, since the

given by the program is, hopefully, a correct solution

main routine may skew the distribution of data.

to the problem.

Ordinarily, when one talks about

3) If the program is presented with a w o r s t - c a s e

the average performance of a program, one aver-

input, there is no way to avoid the resulting poor

ages over the class of problems the program can

performance.

solve.

G i l l [ 2 ] , R a b i n [ 7 ] , Strassen and S o l o v a y [ 9 ]

ithms to choose from and was able to realize that a

have used a different approach on some classes of

particular algorithm was running slowly on a g i v e n

problems. They suggest that the program randomly

input, then it might be possible to choose a different

choose an algorithm from the class of algorithms to

algorithm.

solve the problem.

They are able to bound the a v -

erage performance of the class of algorithms for the


worst case input.

However, if one had a class of algor-

This average on the worst case

In this paper, we apply these notions to hashing for storage and retrieval, and suggest that a

can be better than the performance of any known

class of hash functions be used.

single algorithm on its worst case.

the class of functions is chosen properly, then the

Some of the

p r o b l e m s which this approach overcomes are the

We show that if

average performance of the program on any input

following:

will be nearly as good as if a single function, chosen

1)

with knowledge of the input, were used.

Classical analysis (averaging over the class of

W~ pres-

inputs) must make assumptions about the distrib-

ent several classes of hash functions which insure

ution of the inputs. These assumptions may not hold

that every sample chosen from the input space will

in certain applications.

be distributed evenly by enough of the functions to

106

compensate for the poor performance of the algorProposition 1: Given any collection H of hash func-

ithm w h e n an unlucky choice of function is made.

tions (not necessarily universal2), there exists x,yeA


A brief outline of our paper follows.

such that

After in-

troducing some notation, w e define a property of a


class of functions: universal 2.

class of functions that is universal 2 has the property

then

IAI
A coun-

ting argument shows that for each fell,


a2
(~f (A,A) > - - - a.

of that class will be expected to distribute the s a m We

IHI

Proof (sketch): Let a = I A I and b = I B I .

that given any sample, a randomly chosen member

ple evenly.

IHI
8.(x,y) > IBI

W e show that any

exhibit several universal 2

classes of functions which can be evaluated easily.

Thus, 8H(A,A) _> a 2 1 H l ( 1 / b

- l/a).

Therefore, by

the pigeon hole principle, there exists x,yeA such

Finally w e give several examples of the use of these

that 8H(x,y) > [ H I(1/b

functions.

- 1/a).

[]

Proposition 2: Let x be any element of A and S any

Notation:

If S is a set,
elements in S.

I SI

Ix]

subset of A.

will denote the number of

means the least integer > x.

Z will represent the integers mod n.

All hash functions map a set A into a set B. W e will


always assume I A I

> IBI.

A is sometimes called

t h e set of possible keys, and B the set of indices.

If

f is a hash function and x,y~A, w e let ~r(x,y) be 1 if


x p y and f(x) = f(y), and 0 otherwise.

(with equal

T h e n . t h e expected

number of elements of S that x collides with, i.e.


ISI
8,(x,S), is _< - IBI
Proof:
Mean value of 8f(x,S)
1

~'~

I HI

feB

Thus, 8f(x,y)

is 1 if x and y are distinct elements of A which map


to the same value under f.

Let f be a function chosen randomly

a universal 2 class of functions

probabilities on the functions.)

If x and y are bit strings, then x ( ~ y is the exclusiveor of x and y.

from

If f, x or y is replaced in

1
IHI

< ~

8,(x,y) by a set, w e sum over all the elements in the


set. Thus, if H is a collection of hash functions, x~A

8f(x,S)

~_, 8H(x,y) (by notation)


y~s

~'~
IHI
~**~.q~-

1
_

and S e A then 8H(X,S) means

ISl

(by def. of universal 2)


[]

IBI"
In addition to being useful later, Proposition 2

Z
f~H

~ 8f(x,y).
y~S

has some direct applications.

cal character reader postprocessing system is d e -

Notice that the order of summation does not matter.

scribed in [ 8 ] .
Properties

say

of Universal

For instance, an o p t i -

This system is designed t o check if

a word x is a member of a set of valid words S.

Classes:

Let H be a class of functions from A to B. W e

The set { f(y) l yeS } is stored in memory.

that

A,

whether x is in S, a check is made to see if f(x) is in

That is, H is universal 2 if no pair of

the stored set. Since f(y) is generally shorter than y,

8.(x,y) <
-

H
IHI.

is

universal 2

if

for

all

x,y

in

IBI

To test

distinct keys are m a p p e d into the same index by

a considerable amount of space was saved. H o w e v -

more than one I B I'h of the functions.

er, there is a chance of error; if f(x)=f(y) and yeS,

Proposition 1

shows that this bound on 8.(x,y) is tight when


is much larger than

I BI.

I AI

then x may erroneously be accepted as valid

The second proposition

Prop-

osition 2 gives a bound on the probability of error

f o l l o w s almost i m m e d i a t e l y from the definition of

when f is chosen from a class of universal 2 f0nc-

universal 2 .

tions.

107

W e are interested in the cost of using these


functions in storage and retrieval operations.

be c l ~ s e n appropriately.

If there is no known upper

Given

bound it is possible to dynamically choose a size,

a sequence R of requests (insertions or retrievals) to

and rehash when that choice proves to be t o o small.

some data base, and a hash function f, w e define

Rehashing can be done in linear time, and even in

the cost of R with respect to f, C(f,R), to be the sum

real-time.

of the costs of the individual requests.

The cost of

an individual request referring to an element x is one

W e can s h o w that the expected cost (averaging

plus the number of distinct previously inserted ys

over the hash functions) of any request is virtually

for which f(x) = f(y).

the same as the expected cost (averaging over the


possible requests) of any single hash function when

This cost function reflects the w o r s t case cost


of inserting or finding elements in a storage and
retrieval scheme in which each element of B is a s sociated with a linked list, and an element x is
stored in the list associated with f(x) (see [ 1 ], page
111

would
them.

113.)

Other

collision

resolution

schemes

have other cost functions associated with


For example, if the keys with the same index

were stored in a balanced tree, the corresponding

applied to a random request after random insertions


have been made.

The reason is as f o l l o w s : Let

a = I AI

I BI.

and b =

The counting

argument

cited in proposition 1 implies that if f is any hash


function and x and y are chosen at' random from A,
then the expectation of 8f(x,y) is > ( 1 / b - l / a ) .
follows
is >

that

the

ISa(1/b-

expectation

of

l / a ) , where S is the random sub-

set of A which has been previously stored.

cost function w o u l d be smaller.

the
The following theorem gives a nice bound on

r.ost

1 + I S I(1/b

It

8f(x,S)

of

the

- l/a).

request

is

at

Thus,
least

When A is much larger than

the e x p e c t e d cost of using a universal 2 class of

B (which will be the case in most applications of

hash functions with the linked list method for re-

hashing), this is virtually the same as the cost of a

solving collisions.

request cited in the proof of Proposition 3.

Proposition 3:
which

includes

Let R be a sequence of r requests


k insertions.

Suppose

universal 2 class of hash functions.


choose f at random from
k
C(f,R) is _< r(1 + ~ - ~ - ) .

H is a

Then if w e

H, the expected cost

given a sequence of requests R, the performance of


a randomly chosen function will be worse than t o l erable on R. Since we k n o w that C(f,R) must be at
least r, we can conclude that when k is roughly the

Proof: The e x p e c t e d cost of R is the sum of the


expected costs of the individual requests.

It is also possible to bound the probability that

Proposi-

tion 2 and the definition of cost tell us that an i n d i vidual request has expected cost no greater than
k
1 + - []
IBI "
A special case of this proposition is that if k is
roughly the size of B then the expected cost is 2r.

size of I BI, the probability that C(f,R) > t.r is less


than 1 / ( t - 1 ) .

For some classes of hash functions

(such as the last class mentioned in this paper), it is


possible to derive a bound on the standard deviation
or higher moments of the cost of a randomly chosen
function on a particular R.

This allows us to get a

much better estimate of the probability that C(f,R)


will be large.

Notice that this linear bound holds for any sequence


of requests, not just for the " a v e r a g e " request.

For

Some

universal

2 classes:

many applications, there is an upper bound on the

The first class of universal 2 hash functions we

number of elements to be stored and hence, B can

present is suitable for applications where the bit

108

strings w h i c h represent the keys can c o n v e n i e n t l y

n u m b e r of choices for s such that r # s but g(r)=g(s)

be multiplied by the computer.

is < P - - I . Since there are p choices for r,


b
P p-1 > the n u m b e r of ordered pairs (r,s) satisfyb
ing the condition in the l e m m a = 8H(x,y). Recalling

S u p p o s e A={O,1 ..... a - l }
Let p be a prime with p>_a.

and B={O,1 ..... b - l } .


Let g be any function

from Zp to B which, as closely as possible, maps the

that

for

x=y,

same number of elements of Z, into each element of

universal 2 .

8H(x,y)=O,

this

shows

that

is

I-I

B. Formally, w e require

I {y~Zp I g(y)=z} I <

[ -~ ] for all z ~ B. A

natural choice for g is the residue m o d u l o b.

Frequently, algorithms are analyzed making the

When

assumption that multiplication takes unit time.

The

b = 2 k for some k, this a m o u n t s to taking the last k

n u m b e r of multiplications is s a i d , t o be the cost of

bits in the binary representation of y.

the algorithm. This m o d e l is appropriate when there


are no operations which can be done an unbounded

Let m and n be elements of Zp with mpO.


define hm,o:A--,Z p by h

(x) = (mx+n) m o d p.

define fm.n(X) = g(hm,,(X)).


{f~,n I m,neZp

m#0}.

We

n u m b e r of times for each multiplication.

Now

When a

universal 2 class of hash functions is used, then the

The class H is the set

n u m b e r of m e m o r y references per request can be

If desired, p can be chosen

b o u n d e d w h e n averaged over all functions in the

so the mod p operation can be calculated w i t h o u t a

class as in Proposition 3 (assuming k is not u n -

division.

b o u n d e d with respect to J BI.) Therefore the m o d e l


is appropriate, and under it the hash functions in the

The f o l l o w i n g lemma is useful in proving that

class given above may be applied in constant time.

this class is universal 2.

Thus in this model, for any sequence of requests,


Lemma:

When H is defined as above, then for any

the algorithm takes average time linear in the n u m -

x,y~A with x # y , 8H(x,y) equals the number of o r -

ber of requests.

dered pairs (r,s) with r,s~Zp, r ~ s and g(r)=g(s).


Proof:

It may seem that the addition of n in the class

There is a natural correspondence b e t w e e n

the functions hm, n and the ordered pairs (r,s) w h e r e

of functions given a b o v e plays an unimportant role.

r, seZp and r~s.

This is only partly troe.

Specifically, w e identify the f u n c -

tion hm, n by the ordered pair (hm,n(X),hm, n(y)).

Suppose for m~Zp w e d e -

fine hm(X) = (mx) rood p, and as before define fm(X)

Since

as g(hm(X)).

m~O, hm,n(X)~hrn,n(y). This correspondence is o n e -

Let H = {frn I meZp

m~O}.

It can

to-one

and

equations

be s h o w n that this class of functions comes within a

xm+n~r

(mod p) and y m + n - - s (mod p) have a uni-

factor of t w o of being universal 2, that is, for any x


^ IHI
and y, 8H(x,y) <_ Z ' ~ T - .
On the other hand, this

onto

since

the

linear

que solution for m and n in the field Zp.


If

(r,s)

is

the

pair

(hm,n(X),hm,n(y)),

fm,n(X)=fm,n(y) if and only if g(r)=g(s).


is the n u m b e r of such pairs.

b o u n d c a n n o t be i m p r o v e d significantly.

then

Thus, 8H(x,y)

[]

stance,

let

p = kb+k+l

b = JBI,

and

choose

For inso

is prime (there will be infinitely m a n y

such k's.) Let g(x) be the function x (mod b).


Proposition

4:

The

class

H defined

above

is

universal 2 .
Proof:

Let

that

x = 1 and y = b + l .

Let

Then the 2k functions fl ,

f2 ..... fk fp-k fp-k.l ..... fp-1 each m a p x and y to


n i be the

number

of

elements

in

[ teZp I g(t)=i }. The restriction on g is that for


each i, ni < I - ~ ] .

Since p and b are integers, this

implies that n i < p - 1 +1.


-

Thus, for a given r, the

the same bin. Thus, 8H(x,y) = 2k w h i l e


IH_~_l_ p-_~_l _ k b + k _ ( 1 + 1 ) k .
IBI
b
b
b
The universal 2 class of functions given above

109

may not be convenient when the keys are t o o long

length ia, each of w h o s e elements are bit strings o f

to be multiplied u=~ing a single machine instruction.

length j.

However, the next proposition gives a method of

is the k th element of m, and for x t A , let x k be the k 'h

extending a class of functions for long keys.

digit of x written in base a. W e define

For m~M, let m(k) be the bit string which

fro(x) -- m ( x , + l ) O m(xl+x2+2) (~ ...


Proposition 4: Suppose

I BI

is a p o w e r of t w o and

(~) m(~ xk+k).


k=l

The class H is the set { f . I m e M } .

H is a class of functions from A to B with the p r o p Another w a y of defining this class is to give a

erty that for each i~B,


I {f~H
Q

I f(x)(~f(y)

is

= i } I

exclusive-or.

Then

universal 2 class of hash function from A x A


follows.
Then

program for applying a function to an input x.

]HI
Recall that
IBI "
w e can define a

dcl m(ia) bit(j)init(random);


dcl x(i) digits base a;
dcl value bit(j);
disp := 0;
value := 0;
for k := 1 to i do begin
disp := disp + x(k) + 1;
value := value (~) m(disp);
end;
return (value);

to B as

For f, geH, define hf.g((x,y)) = f(x) (~ g(y).


this

new

J = {hf.Q ] f,g~H}

collection

of

hash

functions

is universal 2 and also satisfies

the condition of this proposition.


The proof is quite similar to the proof of Proposition 6, and therefore omitted.

Proposition 5 can

be applied repeatedly to extend the functions to


arbitrarily long keys.

Proposition 6:

If the functions in H can be

applied in constant time, then the time required to

Proof:

compute an extended function is proportional to the

For x and y in A, s u p p o s e fm(X) is the

e x c l u s i v e - o r of the rows r 1, r 2 ..... r s of m, and fro(Y)

length of the key.

is rs.l(~...(~)rt.

only one of fm(X) or fm(y). Then fro(x) will equal fro(y)

BI cannot be

if and only if r k is the e x c l u s i v e - o r of the other ri's.

an integer) and because the number of functions for

Since there are 2 j = B possibilities for that row, x

which f(x) (~ f(y) = 0, i.e. 6H(x,y), is actually less

and y will collide for one B th of the possible f u n c -

Both of these differences add small

tions fm"
[]

factors to 8j(x,y) which barely prevent J from being


universal 2. The percentage contributed

A s s u m i n g x ~ y , there will be

some k such that r k is involved in the calculation of

universal 2 class of functions given earlier both b e cause H is not a power of 2 (so I H I / I

Notice that fro(x) = fro(Y) if and only

if rl(~)...(~)r t = 0.

The proposition does not quite apply to the

than I H I / I B I .

The class H defined above is

universal 2 .

Thus, the class of all frn'S is universal 2.

by these

small factors decrease asymptotically to 0 as p is

For a given B, each function in H takes time

increased. More details will be given in a future p a -

linear in the length of the key.

per.

more accurately describe the distribution of costs of

In addition, w e can

a particular sequence of requests under the different


The following is a class of functions which do

functions.

For instance, M a r k o w s k y [ 6 ] has shown

not require multiplication, which may be better for

that for any sequence R of r requests with k inser-

many applications,

tions,

Suppose A can be viewed as

and

any

positive

the set of i - d i g ; t numbers written in base a, and B

C(f,R)--r(l+

as the set of binary numbers of length j. Then I A I

7k
also less than - t3iB]

= a'and

I BI = 2 j.

Let M be the class of arrays of

ii0

t,

the

) > r-t is less than

probability
k
tZlBI

and

that

importance:

amount of time refining hash functions for applications where it is critical that a unifor m distribution
be achieved ( [ 5 ] , p . 5 0 8 - 5 1 3 ) .

The following algorithm will

retrieved.

Further-

gram is run, then we can be sure of good p e r f o r m ance averaged over all runs.
The theoretical importance is that it allows one
to get a good bound on the average performance of

If a value has not been stored previously

presented w o u l d seem appropriate for this analysis.

unevenly by the particular hash function being used.

Future research:

Rabin has developed an algorithm which finds

There are a number of areas which can be in-

the nearest neighbors of a collection of points in a


plane, given the coordinates of the points [ 7 ] .

points, and it uses hashing.

taking constant time, the first class of functions we

being stored and re-

using

Their first argu-

Since addition and multiplication are viewed as

trieved t o w a r d s those cases that are distributed

making

assuming

The problem with

an ordinary hashing scheme is that the algorithm

involves

suggested,
implemented

Begin
Choose a hash function;
For i := 1 to n do
F o r j := 1 t o m d o
Begin
Coefficient := CP i * CQj;
Retrieve (EP i + EQj,k);
Store (EP i + EQi,Coefficient + k);
End;
Print all keys and values which have been stored;
End;

more, if the function is changed each time the p r o -

algorithm

we
are

for a given key, a retrieve will have a zero value.

Simply choosing a single hash function ran-

might bias the information

Retrieve

in the

domly from such a class gives a high expectation

an algorithm which uses hashing.

performance

ment is a key, and the second is the value stored or

of a class of universal 2 functions is that we know

that a uniform distribution will be achieved.

and

universal 2 class of hash functions.

One of the practical values

that there are many acceptable functions

Let CQ i and EQi stand for the

Store

not be biased in such a w a y as to make tl~e hash


function perform poorly.

nents of those terms.


same quantities of Q.
have the

This may be difficult

because it is necessary that the expected input set

class.

Let EP1,EP 2 ..... EPn be the expo-

the n terms of P.

Programmers sometimes spend a considerable

random

vestigated, such as:

This

choice

1) Improve the bounds cited here on the probability

of

that a particular function

If one also randomly

from the table l o o k - u p

class will perform poorly on a particular input.

chooses the hash function from a universal 2 class,

2) Suppose the

then the expected running time of the algorithm will

changed so that the last element retrieved is moved

always be linear the number of points.

to the first position on the list.

store

and

retrieve algorithm

is

Can a smaller bound

on the probability that the cost function exceeds a


In [ 3 ]

and [ 4 ]

an algorithm is suggested for

multiplying sparse polynomials, using hashing.


can strengthen the results of these papers.

specified value be derived?

We

3) Extend the analysis to other storage and retrieval

Let t w o

algorithms which involve hashing, such as double

polynomials, P and Q, have n and m n o n - z e r o terms

hashing and open addressing.

respectively.

4)

If multiplies and aCds are viewed as

Generalize

the

definition

of

universal 2

to

taking constant time, then given any t w o p o l y n o m i -

universal n to consider the action of the class of

als P and O., we can multiply them in average time

functions on any collection of n elements of A.

O(n'm).

termine if one can obtain improved results with such

Let CP1,CP 2 ..... CP n be the coefficients of

iii

De-

Computing, May, 1976, Seattle, Washington,


p. 91-95.

a stronger assumption. (One definition of universalk


implies that the expected number of keys from a
k-element set mapping into a given element of B

[3]

Goto, E., and Kanada, Y., Hashing Lemmas


on Time Complexities with Applications to
Formula Manipulation, Proceedings of the
1976 ACM Symposium on Symbolic and Algebraic Computation, August, 1976, Yorktown
Heights, New York, p.149-153.

[4]

Gustavson, F., and Yun, D.Y.Y., Arithmetic


Complexity of Unordered or Sparse
Polynomials Proceedings of the 1976 ACM
Symposium on Symbolic and Algebraic Computation, August, 1976, Yorktown Heights, New
York, p.154-159.

[5]

Knuth, D.E.,
Sorting and Searching,
Addison-Wesley, Reading, Mass. (1973).

[6]

Markowsky, G.,

[7]

Rabin, M.O.,
Probabilistic algorithms,
Proceedings of Symposium on New Directions and
Recent Results in Algorithms and Complexity
(1976 - to appear).

[8]

Rosenbaum, W.S., and Hilliard, J.J,, Multifont OCR postprocessing system. I B M Journal of Research and Development (July 1975).

[9]

Strassen, V., and Solovay, R.,


A fast
Mont~-Carlo test for primality, SlAM Journal
on Computing (to appear).

would be binomially distributed.)


5) When should~one decide that a particular function
is a poor choice and it would be worth the effort to
choose a new function and rehash?

Acknowledgements:
We would like to thank Ashok Chandra for helping
suggest and formulate the problem; Dave Glickman
for suggesting we examine the table look-up technique; Hania Gajewska and David Smith for suggesting revisions in an earlier manuscript; Walter
Rosenbaum for discussions about a practical use of
this work; and George Markowsky for help in understanding the distribution of performance of the
class obtained by;table look-up.

References:
[1]

[2]

Aho, A.V., Hopcroft, J.E., and Ullman, J.D.,


The Design and Analysis of Computer Algorithms, Addison-Wesley, Reading, Mass.
(1974).
Gill, J.T., III, Computational Complexity of
Probabilistic Turing Machines, Proceedings of
the Sixth AC,~f Symposium on the Theory of

112

private communication.

You might also like