Professional Documents
Culture Documents
are dened by
given the belief that ex
lusive or is appropriate
init (v ) = v
for hashing. This shift-xor-xor
lass is uniform
step (i; h;
) = h LL ( ) + RR ( ) +
h h
but, apparently, not universal. The reasons are
nal (h; v ) = h T k not entirely
lear to us, but it seems that a mix of
addition and ex
lusive or is required; in parti
ular, in whi
h n keys
an be distributed among the T
addition appears to be valuable be
ause it propa- addresses, that is, there are T n fun
tions that map
gates
hange between bits, and leads to a more even the given set of keys into the table. It is assumed
distribution of 0 and 1 at ea
h position. that ea
h of these distributions is equally likely
There are essentially two methods for hashing when the n keys are hashed into T slots.
strings. One is to dire
tly redu
e the string to a The analyti
ally-predi
ted performan
e of a
string of bits, as in the shift-add-xor
lass. Another
lass of hashing fun
tions
orresponds to the
is to
onvert the string to a number, then apply an expe
ted performan
e of a randomly-
hosen
integer hashing fun
tion su
h as fun
tion from the set of T n fun
tions. It is
interesting to
onsider both average-
ase and
nal (h; v ) = (v h)kp k
T
worst-
ase behaviour. In the average
ase,
where p is a large prime. We would expe
t su
h behaviour is measured by the average length of
fun
tions to be well-behaved, but the operations the probe sequen
e (that is, the average number
required in the
onversion and hashing make it of a
esses) for su
essful and unsu
essful sear
h.
unlikely that they would be faster than shift-add- Analyti
al results for the average
ase in a
hained
xor. For example,
onsider Cormen, Leiserson, hash table are given by Knuth [8, page 535℄.
and Rivest [2℄ and Sedgewi
k [17℄, whi
h are The worst
ase for hashing o
urs when all the
two of the better-known re
ent algorithms texts. keys hash to the same address and the sear
h length
Cormen, Leiserson, and Rivest [2℄ suggest that is O(n). Knuth [8, page 540℄ expressed fear of
strings be
onverted to numbers through radix this possibility by
on
luding that \hashing would
onversion. For alphanumeri
strings, impli
itly be inappropriate for
ertain real-time appli
ations
of base 62, radix
onversion requires up to two su
h as air traÆ
ontrol, where people's lives are
omparisons, a subtra
tion, and a multipli
ation at stake". However, Gonnet [5℄ proved that su
h
for ea
h
hara
ter, with possibly further operations fears of hashing are baseless, sin
e the probability
be
ause of over
ow. For ASCII
hara
ters radix of the worst
ase is, in his words, ridi
ulously small.
onversion is rather simpler, involving only a left- Gonnet proposed a measure for the worst
ase
shift of 7 pla
es|multipli
ation by 128|but su
h of hashing based on the length of the longest probe
onversion
an lead to the
ontribution of the rst sequen
e, or llps . Out of all the keys stored in the
hara
ters in a string being lost as they are shifted hash table, one has the maximum su
essful sear
h
out to the left. Thus the te
hnique of regarding length. Gonnet proposed that the expe
ted value
strings as numbers and using numeri
al hashing is of llps is a better measure of the worst
ase of hash-
inappropriate unless arbitrarily large numbers
an ing than is the (extraordinarily improbable) worst
be manipulated eÆ
iently. Sedgewi
k [17℄ suggests
ase of llps, and demonstrated theoreti
ally that
a method that avoids over
ow by use of a modulo llps is very narrowly distributed with the expe
ted
operation at ea
h step, whi
h is
onsiderably more value being quite small, that is, not dramati
ally
expensive to evaluate. greater than would be given by dividing the keys
Our hypothesis, then, is that by
hoosing evenly amongst the bu
kets. Larson extended these
fun
tions at random from the shift-add-xor
lass results for the general
ase of bu
ket size greater
of string hashing fun
tions|that is, by making than 1 [9℄.
a random sele
tion of seed|we
an in pra
ti
e We now use these analyti
al results, for both
obtain the analyti
ally predi
ted performan
e of average-
ase and worst-
ase behaviour of a
lass of
hashing s
hemes. Predi
tion of performan
e is ideal hashing fun
tions, as a yardsti
k for evalu-
reviewed in the next se
tion. ating the behaviour in pra
ti
e of
lasses of string
hashing fun
tions.
3 Predi
ted behaviour of hashing
4 Experimental results
Hashing te
hniques are usually analysed under the
assumption that the hash values are uniformly dis- Our hypothesis is that, by
hoosing hashing
tributed. Consider a set of n keys mapped into fun
tions at random from the shift-add-xor
lass
an address range of T values. Given a key s and of hashing fun
tions, the analyti
ally-predi
ted
a hashing fun
tion h that maps the key into this performan
e of hashing s
hemes
an be a
hieved in
range, the probability that the key hashes to a pra
ti
e. To support the hypothesis, in this se
tion
parti
ular address is 1=T and is independent of the we experimentally evaluate the shift-add-xor
lass
out
ome of hashing other keys. There are T n ways of string hashing fun
tions on real data sets.
Exhaustively
he
king whether a
lass of hash- Average-
ase sear
h length
ing fun
tions is indeed uniform would require eval-
We rst investigated average sear
h lengths for su
-
uating the fun
tion over all potential key sets for
essful and unsu
essful sear
h. Results are shown
all seeds. This or any
lose approximation to it
in Table 1. The \a
tual" results are an average over
is
learly impra
ti
al, but by applying the
lass to
10,000 randomly-sele
ted hashing fun
tions (equiv-
a sele
tion of data sets with a reasonable number
alently, seeds), based on one set of tre
keys; the
of seeds we
an be highly
ondent that the ob-
\" gure is one standard deviation. In these re-
served behaviour is a good approximation to the
sults the number of keys was held at 1000 and the
behaviour over the whole
lass.
table size varied to give a load fa
tor. For exam-
For these experiments we had several sets of
ple, with a load fa
tor of 70% the table size was
keys available to us. We used these to explore 1000
the performan
e of the hashing fun
tions dis
ussed 70% = 1429. The \predi
ted" results are quoted
from Knuth [8, page 535℄. As
an be seen, the
in this paper. The reported results are based on
orresponden
e is extremely good, thus
onrming
the following key sets. (However, results for all of
our hypothesis that fun
tions in the
lass shift-
the key sets were similar.) One was names, a le
add-xor generate uniform hash addresses. Almost
of 31,918 distin
t surnames extra
ted from Inter-
identi
al results|usually to within 0.01|were ob-
net news arti
les and hand-edited to remove errors
served with the other data les, in
luding the four
and nonsense [18℄.1 Another was tre
, a le of
\pathologi
al" data les. We also tried other table
1,073,726 distin
t words (that is,
ontiguous alpha-
sizes and key set sizes, in
luding table sizes su
h
beti
strings) extra
ted from the rst 3 gigabytes of
as powers of 2 that might lead to poor behaviour,
the TREC data [6℄; this data
ontains the full-text
but again similar behaviour was observed. Note
of newspaper arti
les, abstra
ts, and s
ienti
jour-
that we have not reported gures for larger bu
ket
nals. In our experiments we have fo
used on
ertain
sizes;
hanging bu
ket size does not
hange the dis-
table sizes and load fa
tors, to allow
omparison
tribution or the properties of the hashing fun
tion.
with previously published analyti
al results, and
By way of
omparison,
onsider the
lass of
thus did not usually require the full data sets. In-
hashing fun
tions given by
stead we used subsets of the data of the required
size: ten random subsets of 1000 strings ea
h, from init (v ) = 0
ea
h of tre
and names; the lexi
ographi
ally rst step (i; h;
) = L1 (h) +
1000 strings from ea
h of tre
and names; a le,
fives, of the rst 1000 distin
t strings of exa
tly nal (h; v ) = hkT
ve
hara
ters (that is, \aaaaa", \aaaab", and so This fun
tion is like that used in several
ompilers,
on); and a le, sevif, of the strings from fives as reported by M
Kenzie, Harries, and Bell [12℄.
reversed. These last four les are pathologi
al
ases For a load fa
tor of 90%, on the twenty randomised
that should help to expose
aws in weak hashing data les average su
essful sear
h length
fun
tions. was 1.701, already signi
antly greater than
In these experiments we have fo
used on hash the predi
tion of 1.450, while on sevif it was 5.110
tables with separate
haining, whi
h|with their and on fives it was 9.358.
toleran
e to over
ows and similarity to dynami
- It is also interesting to
onsider performan
e
table s
hemes su
h as linear hashing and extensible on a large data set, the more realisti
ase for a
hashing|we
onsider to be most typi
al of hash hashing fun
tion to be used in a database system.
tables in pra
ti
al use. However, the results are For the full set of tre
keys, 1000 randomly-
hosen
independent of the hash table organisation: they hashing fun
tions, and a load fa
tor of 90%, shift-
demonstrate properties of the
lass of hashing fun
- add-xor gave an average su
essful sear
h length of
tion that apply regardless of how it is used, whether 1.459 and the average sear
h length was 1.310|
for internal or external hashing, to slots of size 1 or essentially identi
al to performan
e with a small
bu
kets of many keys ea
h, or to appli
ations su
h set of keys. With the simple fun
tion above, how-
as hash joins. ever, average su
essful sear
h length was 19.103;
1 This le is available by ftp from in
reasing the value of the left shift to 4 de
reases
goanna.
s.rmit.edu.au this value, to 4.669, a gure that is however still
in the le una
eptable. From this experiment and similar
pub/rmit/fnetik/data/Surnames.Z experiments with other large sets of keys (su
h as
the rst 1,000,000 ve-
hara
ter strings) we have
observed that with a poorly-
hosen hashing fun
-
40% 60% 70% 80% 90%
Predi
ted su
essful 1.200 1.300 1.350 1.400 1.450
A
tual su
essful 1.2000.014 1.2990.017 1.3500.019 1.4000.020 1.4500.021
Predi
ted unsu
essful 1.070 1.149 1.197 1.249 1.307
A
tual unsu
essful 1.0700.004 1.1480.006 1.1960.007 1.2490.008 1.3070.009
Table 1: Average sear
h length, su
essful and unsu
essful for 1000 keys at load fa
tors from 20% to
90%, averaged over 10,000 seeds ( one standard deviation). The keys are extra
ted from the tre
data.
tion performan
e
an markedly deteriorate as the
hosen hashing fun
tions and with a load average
number of keys in
reases. However, a good hashing of 90%, the average llps was 8.900, with a minimum
fun
tion su
h as a member of shift-add-xor will in- of 8 and a maximum of 11. (Note that llps is
deed give the theoreti
ally-predi
ted performan
e. expe
ted to rise slowly as table size is in
reased;
this is not an indi
ator of poor performan
e.) In-
Worst-
ase sear
h length terestingly, our experiments indi
ate that llps is a
better tool than average sear
h length for dis
rim-
Experimental results for the expe
ted length of the inating between hashing fun
tions, parti
ularly on
longest probe sequen
e, or llps, are shown in Ta- large key sets. For example, on the same data
ble 2, from the same experiments reported in Ta- the hashing fun
tion given by simplifying the step
ble 1. The \predi
ted" results are quoted from operation in shift-add-xor to
Gonnet [5℄. As
an be seen llps values vary sig-
ni
antly between runs, as indi
ated by the high step (i; h;
) = h LL ( ) +
h
imental llps values is extremely narrow|even with Although it is not possible to
on
lusively demon-
a load fa
tor of 90%, over 95% of the llps values are strate that a
lass of hashing fun
tions is univer-
4, 5, or 6; the largest observed value was 12, whi
h sal, there is eviden
e that
an indi
ate whether
o
urred only on
e in the 4,000,000 experiments. universality holds. The method we have used is
Pushing this experiment further, we
hose a ran- deliberate atta
k: for some hashing fun
tion and
dom set of 20 keys from names, a table size of 20, table size, nd a set of strings that hash to the
and measured llps for the 230 hashing fun
tions same value; then for that set of strings explore llps
given by the seeds between 1 and 230 . The worst and average sear
h lengths. A signi
ant in
rease
llps was 11, with only 9 in over one billion o
- in llps indi
ates that some strings are being hashed
urren
es. That is, for even su
h a small table to the same value for more seeds than would be
exhaustive sear
h of the
lass failed to nd a hash- expe
ted for a universal
lass of hashing fun
tions.
ing fun
tion that maps all keys to the same value. To use this approa
h to provide eviden
e for
Average llps was 3.231. universality we used the full tre
key set, assumed
For the full set of tre
keys, the distribution of a loading fa
tor of 90% and a table size of 1111, ran-
llps values is even narrower. With 1000 randomly- domly
hose a hashing fun
tion, then sear
hed for
60% 70% 80% 90%
Predi
ted 4.333 4.636 4.947 5.242
A
tual 4.5560.644 4.7970.677 5.0690.679 5.3060.688
Table 2: Length of the longest probe sequen
e (llps) for 1000 keys at load fa
tors from 60% to 90%,
averaged over 10,000 seeds ( one standard deviation). The keys are extra
ted from the tre
data.
600000
Frequency
400000
load factor of 60%
load factor of 70%
load factor of 80%
200000 load factor of 90%
0
5 10 15
Length of the longest probe sequence
Figure 1: Distribution of the length of the longest probe sequen
e (llps) for 1000 keys and 1,000,000
randomly-sele
ted seeds. The keys are extra
ted from the tre
data.
any set of 1000 keys with the same hash value. We only has, in ee
t, T distin
t members for a table
then atta
ked the shift-add-xor
lass by
hoosing size T that is a power of 2.
1,000,000 random seeds and examined the distri-
bution of llps values. After this atta
k the average
llps was 5.307|almost identi
al to the value in 5 Other string hashing fun
tions
Table 2|and, aside from one o
urren
e of an llps
of 15 and one of an llps of 12, the values were In addition to string hashing fun
tions proposed
between 4 and 10. That is, behaviour was virtually in the literature, not surprisingly many dierent
indistinguishable from that of a random set of keys. fun
tions are to be found embodied in software.
This pro
ess of atta
k was a key tool in our We now
onsider some of these fun
tions.
evaluation of other hashing fun
tions. For example, As dis
ussed above, some algorithms texts de-
for the shift-xor-xor
lass dened with s
ribe a form of multipli
ative method, whi
h are
two-stage methods in whi
h the string is redu
ed to
step (i; h;
) = h LL ( )RR ( )
h h
a number before pro
essing with a hashing fun
tion
atta
k in
reases average llps over 1000 seeds, for integers. One form of multipli
ative method is
to 6.229. For the
lass dened with as follows.
init (v ) = 0
step (i; h;
) = LL ( ) + RR (h) +
step (i; h;
) = h r +
h
average llps is in
reased to 41.198, with an average nal (h; v ) = (v h)kp kT
su
essful sear
h length of 5.491.
To survive su
h an atta
k, a
lass of fun
tions where p is a large prime and r is a radix. In a
must be large. For example, the
lass dened by variation of this form, an array P of distin
t large
primes
an be used as follows.
init (v ) = v
init (v ) = 0
step (i; h;
) = L7 ( h )+
step (i; h;
) = h + Pi
nal (h; v ) = k
h T
nal (h; v ) = (v h)kp kT
We tested several fun
tions of this kind using form
lasses) and, with respe
t to our
riteria, are
the same methodology as in Se
tion 4. These not parti
ularly interesting. Nor are they distin
t
approa
hes are uniform (provided that radix r from the
lasses we have already dis
ussed; most
is not a power of 2); probably universal, as they are variants of simple radix or shift methods.
are resistant to adversarial atta
k; and widely A more interesting
lass of hashing fun
tions is
appli
able. However, Sedgewi
k's fun
tion [17℄ dened by
does not resist atta
k quite as well as other
fun
tions on the full set of tre
strings; average init (v ) = v
[4℄ E.A. Fox, Q.F. Chen, A.M. Daoud and L.S. [15℄ M.V. Ramakrishna. Pra
ti
al performan
e of
Heath. Order-preserving minimal hash fun
- Bloom lters and parallel free-text sear
hing.
tions and information retrieval. ACM Trans- Communi
ations of the ACM, Volume 32,
a
tions on Information Systems, Volume 9, Number 10, pages 1237{1239, 1989.
Number 3, pages 281{308, 1991. [16℄ M.V. Ramakrishna and P.A. Larson. File
organization using
omposite perfe
t hashing.
[5℄ G. Gonnet. Expe
ted length of the longest
ACM Transa
tions on Database Systems, Vol-
probe sequen
e in hash
ode sear
hing. Jour-
ume 14, Number 2, pages 231{263, 1989.
nal of the ACM, Volume 28, Number 2, pages
289{304, 1981. [17℄ R. Sedgewi
k. Algorithms in C. Addison-
Wesley, Reading, Massa
husetts, se
ond edi-
[6℄ D.K. Harman. Overview of the rst Text Re- tion, 1990.
trieval Conferen
e. In D.K. Harman (editor),
Pro
. TREC Text Retrieval Conferen
e, pages [18℄ J. Zobel and P. Dart. Phoneti
string mat
h-
1{20, Washington, November 1992. National ing: Lessons from information retrieval. In
Institute of Standards Spe
ial Publi
ation Pro
. ACM-SIGIR International Conferen
e
500-207. on Resear
h and Development in Information
Retrieval, pages 166{173, Zuri
h, Switzerland,
August 1996.