Professional Documents
Culture Documents
Universal Classes of Hash Functions
Universal Classes of Hash Functions
(extended abstract)
J. Lawrence Carter and Mark N. Wegman
IBM Thomas J. Watson Research Center
Y o r k t o w n Heights, New York 10598
Abstract:
This paper gives an i , put independe, t average linear time algorithm for storage
and retrieval on keys. The algorithm makes a random choice of hash function from a
suitable class of hash functions.
(averaging over all functions in the class) to store and retrieve elements is linear in
the length of~the sequence. The number of references to the data base required by
the algorithm for any input is extremely close to the theoretical minimum for any
possible hash function with randomly distributed inputs.
classes of hash functions which also may be evaluated rapidly. The ability to analyze
the cost of storage and retrieval w i t h o u t worrying about the distribution of the input
allows as corollaries improvements on the bounds of several algorithms.
Introduction:
The answer
cally analyze the average performance of a subroutine independently of the main routine, since the
to the problem.
performance.
solve.
G i l l [ 2 ] , R a b i n [ 7 ] , Strassen and S o l o v a y [ 9 ]
algorithm.
In this paper, we apply these notions to hashing for storage and retrieval, and suggest that a
Some of the
We show that if
following:
1)
W~ pres-
in certain applications.
106
compensate for the poor performance of the algorProposition 1: Given any collection H of hash func-
such that
After in-
then
IAI
A coun-
IHI
ple evenly.
IHI
8.(x,y) > IBI
- l/a).
Therefore, by
functions.
- 1/a).
[]
Notation:
If S is a set,
elements in S.
I SI
Ix]
subset of A.
> IBI.
A is sometimes called
If
(with equal
T h e n . t h e expected
~'~
I HI
feB
Thus, 8f(x,y)
from
If f, x or y is replaced in
1
IHI
< ~
8f(x,S)
~'~
IHI
~**~.q~-
1
_
ISl
IBI"
In addition to being useful later, Proposition 2
Z
f~H
~ 8f(x,y).
y~S
scribed in [ 8 ] .
Properties
say
of Universal
For instance, an o p t i -
Classes:
that
A,
8.(x,y) <
-
H
IHI.
is
universal 2
if
for
all
x,y
in
IBI
To test
Proposition 1
I BI.
I AI
Prop-
universal 2 .
tions.
107
be c l ~ s e n appropriately.
Given
real-time.
The cost of
would
them.
113.)
Other
collision
resolution
schemes
a = I AI
I BI.
and b =
The counting
argument
that
the
ISa(1/b-
expectation
of
the
The following theorem gives a nice bound on
r.ost
1 + I S I(1/b
It
8f(x,S)
of
the
- l/a).
request
is
at
Thus,
least
solving collisions.
Proposition 3:
which
includes
Suppose
H is a
Then if w e
Proposi-
tion 2 and the definition of cost tell us that an i n d i vidual request has expected cost no greater than
k
1 + - []
IBI "
A special case of this proposition is that if k is
roughly the size of B then the expected cost is 2r.
For
Some
universal
2 classes:
108
S u p p o s e A={O,1 ..... a - l }
Let p be a prime with p>_a.
that
for
x=y,
universal 2 .
8H(x,y)=O,
this
shows
that
is
I-I
B. Formally, w e require
[ -~ ] for all z ~ B. A
When
The
(x) = (mx+n) m o d p.
m#0}.
We
Now
When a
division.
ber of requests.
Since
as g(hm(X)).
m~O}.
It can
to-one
and
equations
xm+n~r
onto
since
the
linear
(r,s)
is
the
pair
(hm,n(X),hm,n(y)),
b o u n d c a n n o t be i m p r o v e d significantly.
then
Thus, 8H(x,y)
[]
stance,
let
p = kb+k+l
b = JBI,
and
choose
For inso
4:
The
class
H defined
above
is
universal 2 .
Proof:
Let
that
x = 1 and y = b + l .
Let
number
of
elements
in
109
length j.
I BI
is a p o w e r of t w o and
H is a class of functions from A to B with the p r o p Another w a y of defining this class is to give a
I f(x)(~f(y)
is
= i } I
exclusive-or.
Then
]HI
Recall that
IBI "
w e can define a
to B as
new
J = {hf.Q ] f,g~H}
collection
of
hash
functions
Proposition 5 can
Proposition 6:
Proof:
is rs.l(~...(~)rt.
BI cannot be
tions fm"
[]
A s s u m i n g x ~ y , there will be
universal 2 class of functions given earlier both b e cause H is not a power of 2 (so I H I / I
if rl(~)...(~)r t = 0.
than I H I / I B I .
universal 2 .
by these
per.
In addition, w e can
functions.
many applications,
tions,
and
any
positive
C(f,R)--r(l+
7k
also less than - t3iB]
= a'and
I BI = 2 j.
ii0
t,
the
probability
k
tZlBI
and
that
importance:
amount of time refining hash functions for applications where it is critical that a unifor m distribution
be achieved ( [ 5 ] , p . 5 0 8 - 5 1 3 ) .
retrieved.
Further-
gram is run, then we can be sure of good p e r f o r m ance averaged over all runs.
The theoretical importance is that it allows one
to get a good bound on the average performance of
Future research:
using
making
assuming
involves
suggested,
implemented
Begin
Choose a hash function;
For i := 1 to n do
F o r j := 1 t o m d o
Begin
Coefficient := CP i * CQj;
Retrieve (EP i + EQj,k);
Store (EP i + EQi,Coefficient + k);
End;
Print all keys and values which have been stored;
End;
algorithm
we
are
Retrieve
in the
performance
and
Store
class.
the n terms of P.
random
This
choice
of
2) Suppose the
store
and
retrieve algorithm
is
and [ 4 ]
We
Let t w o
respectively.
4)
Generalize
the
definition
of
universal 2
to
O(n'm).
iii
De-
[3]
[4]
[5]
Knuth, D.E.,
Sorting and Searching,
Addison-Wesley, Reading, Mass. (1973).
[6]
Markowsky, G.,
[7]
Rabin, M.O.,
Probabilistic algorithms,
Proceedings of Symposium on New Directions and
Recent Results in Algorithms and Complexity
(1976 - to appear).
[8]
Rosenbaum, W.S., and Hilliard, J.J,, Multifont OCR postprocessing system. I B M Journal of Research and Development (July 1975).
[9]
Acknowledgements:
We would like to thank Ashok Chandra for helping
suggest and formulate the problem; Dave Glickman
for suggesting we examine the table look-up technique; Hania Gajewska and David Smith for suggesting revisions in an earlier manuscript; Walter
Rosenbaum for discussions about a practical use of
this work; and George Markowsky for help in understanding the distribution of performance of the
class obtained by;table look-up.
References:
[1]
[2]
112
private communication.