You are on page 1of 9

Performan

e in Pra ti e of String Hashing Fun tions


M.V. Ramakrishna Justin Zobel
Department of Computer S ien e, RMIT
frama,jzg s.rmit.edu.au

Abstra t that has attra ted surprisingly little resear h.


Some re ent papers have examined spe i string
String hashing is a fundamental operation, used in hashing fun tions [12, 13℄ but how these fun tions
ountless appli ations where fast a ess to distin t ompare to the analyti ally-predi ted performan e
strings is required. In this paper we des ribe a of hashing is unknown.
lass of string hashing fun tions and explore its Moreover, good hoi e of hashing fun tion is
performan e. In parti ular, using experiments with ru ial to eÆ ien y. It is often assumed that for
both small sets of keys and a large key set from a a given load fa tor a ess osts are independent of
text database, we show that it is possible to a hieve table size, but for a poor fun tion this assumption
performan e lose to that theoreti ally predi ted for breaks down. In omparison to a good hashing
hashing fun tions. We also onsider riteria for fun tion, a badly designed fun tion may give a -
hoosing a hashing fun tion and use them to om- eptable performan e for a small appli ation su h
pare our lass of fun tions to other methods for as a symbol table but be mu h slower when used
string hashing. These results show that our lass for a large database appli ation su h as a join.
of hashing fun tions is reliable and eÆ ient, and is In this paper we present a lass of string hashing
therefore an appropriate hoi e for general-purpose fun tions and demonstrate experimentally that the
hashing. analyti ally-predi ted performan e an be a hieved
in pra ti e by hoosing hashing fun tions at ran-
1 Introdu tion dom from this lass; to our knowledge there has
been no previous investigation of lasses of string
String hashing is the pro ess of redu ing a string to hashing fun tions. In these results performan e is
a pseudo-random number in a spe i ed range. It evaluated by two measures: the average number of
is a fundamental operation, used widely in appli- probes during su essful and unsu essful sear h,
ations where speed is riti al. On a small s ale, a and the largest number of probes during su ess-
hash table is often the basi data stru ture in appli- ful sear h, that is, the worst ase. Our experi-
ations su h as symbol tables in ompilers and a - mental results are based on sets of strings drawn
ount names in password les. Hashing is also used from real data, in luding a set of over one million
in appli ations su h as spell he king and Bloom distin t words drawn from a text database. The
lters [15℄. In databases, hashing is important, not results show that the lass gives good average per-
just for indexing, but also for operations su h as forman e.
joins and inverted- le onstru tion. We also identify four properties that a lass of
The performan e of a hashing s heme depends string hashing fun tions should satisfy: uniformity,
primarily on two fa tors: the eÆ ien y of the universality, appli ability, and eÆ ien y. We use
over ow-handling s heme and the behaviour of the these properties to motivate our lass of string
hashing fun tion. There has been mu h resear h hashing fun tions and to ompare it to other
addressing the problems of over ow and ollisions. string hashing fun tions that have been proposed.
Hashing fun tions have re eived less attention, but These results show that fun tions in the lass are,
analyti ally the behaviour of hashing is now well- as well as reliable, faster than other good hashing
understood [3, 7, 10, 11, 14℄. However, in mu h of fun tions. This lass of fun tions is therefore a
the work on hashing it is assumed that the keys good hoi e for any appli ation involving hashing
are integers, while in pra ti e keys are often strings of strings, in luding s hemes su h as hash joins and
of alphanumeri hara ters|an aspe t of hashing external hashing as well as the hained hashing
Pro eedings of the Fifth International Confer- used in this paper to explore fun tion performan e.
en e on Database Systems for Advan ed Appli-
In Se tion 2 we des ribe our lass of hashing
ations, Melbourne, Australia, April 1{4, 1997. fun tions. Analysis of hashing s hemes is reviewed
in Se tion 3. In Se tion 4 we des ribe our test data We ontend that, to be useful for general-
and experimental results, onsidering both average- purpose hashing, a lass of hashing fun tions
ase and worst- ase sear h lengths. Other string should satisfy the following properties.
hashing fun tions are dis ussed in Se tion 5.
Uniformity. If a hashing fun tion is uniform then
the probability of an arbitrary key hashing to a
2 Classes of hashing fun tions given slot is 1=T for table size T , independent
In this se tion we des ribe a lass of string hashing of the hash values of other keys. In pra ti al
fun tions. First we outline our notation. terms, uniformity means that for a given load
String hashing fun tions an be represented in fa tor (ratio of keys to slots) average a ess
the following generi form, in whi h s = 1 : : : m time is roughly onstant, regardless of table
is a string of m hara ters, v is a seed, and hi size.
is an intermediate hash value after examination of Universality. A lass of hashing fun tions H is
i hara ters. universal if, for a given table size T and any
pair of valid keys s1 and s2 , the number of
(
h s; v )= hashing fun tions h 2 H su h that h(s1 ) =
set h0 init (v ) h(s2 ) is less than or equal to jH j=T [2℄. That
for ea h hara ter i in s, is, for a randomly- hosen hashing fun tion the
set hi step (i; hi 1 ; i ) probability that s1 and s2 hash to the same
return h = nal (hm ; v ) value is less than or equal to 1=T .
That is, the hash value of s is omputed as follows. In pra ti e universality means that, with high
Some fun tion init is applied to v to yield the initial probability, a randomly- hosen hashing fun -
h0 . At ea h step hi is a fun tion step of i, the hash tion will perform well. For any hashing fun -
value omputed so far, and the urrent hara ter. tion it is true that there exist sets of keys
The hash value returned is a fun tion nal of v that all hash to the same value|and no hash-
and the internal hash value hm . De ning init , step , ing fun tion is invulnerable to a deliberate at-
and nal des ribes a string hashing fun tion. For tempt to identify su h a set of keys. However,
example, we might de ne if a lass of hashing fun tions is universal and
the fun tions in the lass are uniform then it is
init (v ) = 0
guaranteed that the lass annot be subje ted
step (i; h; ) = h + to su h atta k. If for some hashing fun tion h
nal (h; v ) = h and set S of keys every key s 2 S is su h that
h(s) = k for some k , it is still true that for
yielding a simple (rather uninteresting) hashing another randomly- hosen fun tion h0 the set of
fun tion in whi h the hash value is the sum of the hash values h0 (s) will be uniformly distributed.
ASCII values in the given string.
Hash values must be trun ated in some way to It is somewhat diÆ ult to test universality in
give values in the range 0 : : : T 1, where T is the pra ti e; su h a test would require hashing
table size. In general the only pra ti al me hanism every pair of keys for every possible seed value
is to take h modulo T (remainder of h after division and table size. However, by subje ting the
by T ) but, for values of T of the form 2b with lass to atta k of the kind outlined above|
integer b, bitwise and an be used. a tively sear hing for keys that hash to the
Operations that might be used in a hashing same value|we an obtain a strong indi ation
fun tion in lude addition (+), multipli ation (), that the lass is indeed universal.
bitwise and (_), bitwise or (^), bitwise ex lu- Appli ability. At a more pragmati level, hashing
sive or (), modulo (k), left-shift of value h by fun tions should be appli able in all ir um-
b bits (Lb (h)), and right-shift of value h by b bits stan es where hashing might be used. A fun -
(Rb (h)). On most ar hite tures today modulo is tion that is limited to a few table sizes, an
implemented in software; multipli ation, although only hash strings of a ertain length, or an-
usually in hardware, is relatively slow; while the not a ept seeds (thus allowing, for example,
other operations are typi ally single- y le instru - double hashing) is not as valuable as fun tions
tions. We assume that hara ters are represented without su h restri tions.
in some integer ode su h as ASCII. EÆ ien y. The primary advantage of hashing as
an a ess method is its speed: given a data set
of n keys and a table of O(n) size sear h time in whi h the modulo operation in the nal step an
is O(1), assuming a hashing fun tion with time be repla ed by bitwise-and for suitable T values.
omplexity O(1). Hashing fun tions should As we dis uss below, this fun tion was the simplest
also be small; in many appli ations there is we ould identify that had the required properties.
little advantage to a fun tion that is as large Fun tions of this general form are not new, but
as the key set. to our knowledge they have not previously been
In pra ti e, onstant fa tors an be important. analysed with respe t to the theoreti al behaviour
For example, in some appli ations it is possible of hashing fun tions.
for sear h in a array, although O(log n), to be Uniformity and universality are investigated in
as fast as sear h in a hash table. Consider an Se tion 4; for now we simply note that ea h seed
appli ation in whi h the keys are long strings. gives a new fun tion and hen e we have de ned a
During sear h of an array, all that is required lass of hashing fun tions.
at ea h element is inspe tion of the rst few Given appropriate hoi e of shift magnitudes
L and R good use is made of the 32-bit spa e,
hara ters of the key (up to a mismat h), then
a full omparison when the orre t element providing greater likelihood of a uniform distribu-
is found. During hashing, the key must be tion of hash values. For example, with 5- hara ter
ompletely inspe ted at least twi e, on e to keys and L  7 it is possible to obtain any 32-
form the hash value and at least on e to he k bit string using this lass of fun tions. We used
L = 5 and R = 2 in our experiments, but found
the key in the table. A slow hashing fun tion,
with omplex operations for ea h hara ter, that variation in these values made little di eren e
ould well be una eptable. for 4  L  7 (so that only a few hara ters are
required to yield a large hash value) and 1  R  3
Other valuable properties are perfe tion, where (so that the ontribution of the rst hara ters is
the hashing fun tion is ollision-free, and order- not diminished); note that the hara ter set was
preservation, where the sort order of the hash ASCII, that is, 7-bit values. We on lude that the
values is the same as the sort-order of the original lass is widely appli able.
keys [1, 4, 16℄. Both are valuable in spe i The lass is fairly eÆ ient. There is no use of
appli ations; perfe t hashing fun tions an be used slow operations su h as modulo or multipli ation
for lookup in stati tables, for example, be ause (other than the ne essary nal use of modulo to
it may then not be ne essary to store the keys. redu e the hash value to the table size) and only
However, su h fun tions require prior knowledge ve operations per hara ter|one ex lusive or,
of the omplete set of keys to be hashed. We do two shifts, and two additions|ea h of whi h on
not onsider perfe t or order-preserving hashing our ma hines require only a single instru tion y-
fun tions in this paper. le. It is possible that there exists a simpler ef-
We now de ne our lass of string hashing fun - fe tive hashing fun tion, but it annot be obtained
tions. To obtain a lass of hashing fun tions that by simplifying shift-add-xor. Considering the pos-
meets the riteria above, we wish to use eÆ ient sible simpli ations: the left-shift is required to
primitive operators su h as addition and ex lusive obtain 32-bit values; the right-shift is required for
or; to use as few as possible of these operators; uniformity, be ause, we suspe t, the majority of
to allow the fun tion to generate large hash values; o urren es of letters in English, in luding all six
and to design the fun tion to s ramble the input vowels, have a rightmost 1-bit; and, as we dis uss
bits as thoroughly as possible, without losing the below, the ex lusive or is required for universality.
ontribution of any hara ters. Thus it is essential EÆ ien y is onsidered further in Se tion 5.
to use some me hanism su h as left shift to make Note that we do not require that table size T be
use of higher-order bits, while operators su h as prime, or be arefully hosen in any way; hashing
bitwise and should be avoided be ause they tend fun tions should be e e tive for all table sizes.
to erase information. Based on these prin iples we Some readers may be urious as to why we hose
experimented with many ombinations of primitive to de ne step as above rather than as
operators, and as a result propose the shift-add-xor 
lass of hashing fun tions, in whi h the omponents step (i; h; ) = h LL ( )RR ( )
h h

are de ned by
given the belief that ex lusive or is appropriate
init (v ) = v
for hashing. This shift-xor-xor lass is uniform

step (i; h; ) = h  LL ( ) + RR ( ) +
h h but, apparently, not universal. The reasons are
nal (h; v ) = h T k not entirely lear to us, but it seems that a mix of
addition and ex lusive or is required; in parti ular, in whi h n keys an be distributed among the T
addition appears to be valuable be ause it propa- addresses, that is, there are T n fun tions that map
gates hange between bits, and leads to a more even the given set of keys into the table. It is assumed
distribution of 0 and 1 at ea h position. that ea h of these distributions is equally likely
There are essentially two methods for hashing when the n keys are hashed into T slots.
strings. One is to dire tly redu e the string to a The analyti ally-predi ted performan e of a
string of bits, as in the shift-add-xor lass. Another lass of hashing fun tions orresponds to the
is to onvert the string to a number, then apply an expe ted performan e of a randomly- hosen
integer hashing fun tion su h as fun tion from the set of T n fun tions. It is
 interesting to onsider both average- ase and
nal (h; v ) = (v  h)kp k
T
worst- ase behaviour. In the average ase,
where p is a large prime. We would expe t su h behaviour is measured by the average length of
fun tions to be well-behaved, but the operations the probe sequen e (that is, the average number
required in the onversion and hashing make it of a esses) for su essful and unsu essful sear h.
unlikely that they would be faster than shift-add- Analyti al results for the average ase in a hained
xor. For example, onsider Cormen, Leiserson, hash table are given by Knuth [8, page 535℄.
and Rivest [2℄ and Sedgewi k [17℄, whi h are The worst ase for hashing o urs when all the
two of the better-known re ent algorithms texts. keys hash to the same address and the sear h length
Cormen, Leiserson, and Rivest [2℄ suggest that is O(n). Knuth [8, page 540℄ expressed fear of
strings be onverted to numbers through radix this possibility by on luding that \hashing would
onversion. For alphanumeri strings, impli itly be inappropriate for ertain real-time appli ations
of base 62, radix onversion requires up to two su h as air traÆ ontrol, where people's lives are
omparisons, a subtra tion, and a multipli ation at stake". However, Gonnet [5℄ proved that su h
for ea h hara ter, with possibly further operations fears of hashing are baseless, sin e the probability
be ause of over ow. For ASCII hara ters radix of the worst ase is, in his words, ridi ulously small.
onversion is rather simpler, involving only a left- Gonnet proposed a measure for the worst ase
shift of 7 pla es|multipli ation by 128|but su h of hashing based on the length of the longest probe
onversion an lead to the ontribution of the rst sequen e, or llps . Out of all the keys stored in the
hara ters in a string being lost as they are shifted hash table, one has the maximum su essful sear h
out to the left. Thus the te hnique of regarding length. Gonnet proposed that the expe ted value
strings as numbers and using numeri al hashing is of llps is a better measure of the worst ase of hash-
inappropriate unless arbitrarily large numbers an ing than is the (extraordinarily improbable) worst
be manipulated eÆ iently. Sedgewi k [17℄ suggests ase of llps, and demonstrated theoreti ally that
a method that avoids over ow by use of a modulo llps is very narrowly distributed with the expe ted
operation at ea h step, whi h is onsiderably more value being quite small, that is, not dramati ally
expensive to evaluate. greater than would be given by dividing the keys
Our hypothesis, then, is that by hoosing evenly amongst the bu kets. Larson extended these
fun tions at random from the shift-add-xor lass results for the general ase of bu ket size greater
of string hashing fun tions|that is, by making than 1 [9℄.
a random sele tion of seed|we an in pra ti e We now use these analyti al results, for both
obtain the analyti ally predi ted performan e of average- ase and worst- ase behaviour of a lass of
hashing s hemes. Predi tion of performan e is ideal hashing fun tions, as a yardsti k for evalu-
reviewed in the next se tion. ating the behaviour in pra ti e of lasses of string
hashing fun tions.
3 Predi ted behaviour of hashing
4 Experimental results
Hashing te hniques are usually analysed under the
assumption that the hash values are uniformly dis- Our hypothesis is that, by hoosing hashing
tributed. Consider a set of n keys mapped into fun tions at random from the shift-add-xor lass
an address range of T values. Given a key s and of hashing fun tions, the analyti ally-predi ted
a hashing fun tion h that maps the key into this performan e of hashing s hemes an be a hieved in
range, the probability that the key hashes to a pra ti e. To support the hypothesis, in this se tion
parti ular address is 1=T and is independent of the we experimentally evaluate the shift-add-xor lass
out ome of hashing other keys. There are T n ways of string hashing fun tions on real data sets.
Exhaustively he king whether a lass of hash- Average- ase sear h length
ing fun tions is indeed uniform would require eval-
We rst investigated average sear h lengths for su -
uating the fun tion over all potential key sets for
essful and unsu essful sear h. Results are shown
all seeds. This or any lose approximation to it
in Table 1. The \a tual" results are an average over
is learly impra ti al, but by applying the lass to
10,000 randomly-sele ted hashing fun tions (equiv-
a sele tion of data sets with a reasonable number
alently, seeds), based on one set of tre keys; the
of seeds we an be highly on dent that the ob-
\" gure is one standard deviation. In these re-
served behaviour is a good approximation to the
sults the number of keys was held at 1000 and the
behaviour over the whole lass.
table size varied to give a load fa tor. For exam-
For these experiments we had several sets of
ple, with a load fa tor of 70% the table size was
keys available to us. We used these to explore  1000 
the performan e of the hashing fun tions dis ussed 70% = 1429. The \predi ted" results are quoted
from Knuth [8, page 535℄. As an be seen, the
in this paper. The reported results are based on
orresponden e is extremely good, thus on rming
the following key sets. (However, results for all of
our hypothesis that fun tions in the lass shift-
the key sets were similar.) One was names, a le
add-xor generate uniform hash addresses. Almost
of 31,918 distin t surnames extra ted from Inter-
identi al results|usually to within 0.01|were ob-
net news arti les and hand-edited to remove errors
served with the other data les, in luding the four
and nonsense [18℄.1 Another was tre , a le of
\pathologi al" data les. We also tried other table
1,073,726 distin t words (that is, ontiguous alpha-
sizes and key set sizes, in luding table sizes su h
beti strings) extra ted from the rst 3 gigabytes of
as powers of 2 that might lead to poor behaviour,
the TREC data [6℄; this data ontains the full-text
but again similar behaviour was observed. Note
of newspaper arti les, abstra ts, and s ienti jour-
that we have not reported gures for larger bu ket
nals. In our experiments we have fo used on ertain
sizes; hanging bu ket size does not hange the dis-
table sizes and load fa tors, to allow omparison
tribution or the properties of the hashing fun tion.
with previously published analyti al results, and
By way of omparison, onsider the lass of
thus did not usually require the full data sets. In-
hashing fun tions given by
stead we used subsets of the data of the required
size: ten random subsets of 1000 strings ea h, from init (v ) = 0
ea h of tre and names; the lexi ographi ally rst step (i; h; ) = L1 (h) +
1000 strings from ea h of tre and names; a le,
fives, of the rst 1000 distin t strings of exa tly nal (h; v ) = hkT
ve hara ters (that is, \aaaaa", \aaaab", and so This fun tion is like that used in several ompilers,
on); and a le, sevif, of the strings from fives as reported by M Kenzie, Harries, and Bell [12℄.
reversed. These last four les are pathologi al ases For a load fa tor of 90%, on the twenty randomised
that should help to expose aws in weak hashing data les average su essful sear h length
fun tions. was 1.701, already signi antly greater than
In these experiments we have fo used on hash the predi tion of 1.450, while on sevif it was 5.110
tables with separate haining, whi h|with their and on fives it was 9.358.
toleran e to over ows and similarity to dynami - It is also interesting to onsider performan e
table s hemes su h as linear hashing and extensible on a large data set, the more realisti ase for a
hashing|we onsider to be most typi al of hash hashing fun tion to be used in a database system.
tables in pra ti al use. However, the results are For the full set of tre keys, 1000 randomly- hosen
independent of the hash table organisation: they hashing fun tions, and a load fa tor of 90%, shift-
demonstrate properties of the lass of hashing fun - add-xor gave an average su essful sear h length of
tion that apply regardless of how it is used, whether 1.459 and the average sear h length was 1.310|
for internal or external hashing, to slots of size 1 or essentially identi al to performan e with a small
bu kets of many keys ea h, or to appli ations su h set of keys. With the simple fun tion above, how-
as hash joins. ever, average su essful sear h length was 19.103;
1 This le is available by ftp from in reasing the value of the left shift to 4 de reases
goanna. s.rmit.edu.au this value, to 4.669, a gure that is however still
in the le una eptable. From this experiment and similar
pub/rmit/fnetik/data/Surnames.Z experiments with other large sets of keys (su h as
the rst 1,000,000 ve- hara ter strings) we have
observed that with a poorly- hosen hashing fun -
40% 60% 70% 80% 90%
Predi ted su essful 1.200 1.300 1.350 1.400 1.450
A tual su essful 1.2000.014 1.2990.017 1.3500.019 1.4000.020 1.4500.021
Predi ted unsu essful 1.070 1.149 1.197 1.249 1.307
A tual unsu essful 1.0700.004 1.1480.006 1.1960.007 1.2490.008 1.3070.009

Table 1: Average sear h length, su essful and unsu essful for 1000 keys at load fa tors from 20% to
90%, averaged over 10,000 seeds ( one standard deviation). The keys are extra ted from the tre data.

tion performan e an markedly deteriorate as the hosen hashing fun tions and with a load average
number of keys in reases. However, a good hashing of 90%, the average llps was 8.900, with a minimum
fun tion su h as a member of shift-add-xor will in- of 8 and a maximum of 11. (Note that llps is
deed give the theoreti ally-predi ted performan e. expe ted to rise slowly as table size is in reased;
this is not an indi ator of poor performan e.) In-
Worst- ase sear h length terestingly, our experiments indi ate that llps is a
better tool than average sear h length for dis rim-
Experimental results for the expe ted length of the inating between hashing fun tions, parti ularly on
longest probe sequen e, or llps, are shown in Ta- large key sets. For example, on the same data
ble 2, from the same experiments reported in Ta- the hashing fun tion given by simplifying the step
ble 1. The \predi ted" results are quoted from operation in shift-add-xor to
Gonnet [5℄. As an be seen llps values vary sig- 
ni antly between runs, as indi ated by the high step (i; h; ) = h LL ( ) +
h

standard deviation. For a load fa tor of 60%, the


greatest llps observed in the 10,000 runs was 8; for has reasonable average su essful sear h length
load fa tors of 70%, 80%, and 90% the greatest llps but average llps|the worst- ase su essful sear h
was 9. The llps values varied somewhat between length|markedly deteriorates, to 25.886.
data les|for example, for a load fa tor of 90% and Note that the llps values quoted in Table 2 are
the les drawn from names and tre the minimum not a lower bound|it is quite possible for a hashing
average value of llps was 5.257 and the maximum fun tion to have better worst- ase performan e for
was 5.332. However, all of these values are, within a given data set. In parti ular, perfe t hashing
the error indi ated by the standard deviation, lose fun tions, whi h are onstru ted with respe t to
to the analyti ally-predi ted value. For fives, av- the set of keys to be hashed, have by de nition
erage llps was 3.034. an llps of 1. The weakness of su h fun tions is
We de ided to examine in detail the distribu- their ineÆ ien y for dynami sets of keys and the
tion of llps values, by hashing the strings in one unpredi table behaviour for an arbitrary key set.
data set with 1,000,000 randomly-sele ted hashing
fun tions. The results are shown in Figure 1. As
predi ted by the analysis, the distribution of exper- Universality

imental llps values is extremely narrow|even with Although it is not possible to on lusively demon-
a load fa tor of 90%, over 95% of the llps values are strate that a lass of hashing fun tions is univer-
4, 5, or 6; the largest observed value was 12, whi h sal, there is eviden e that an indi ate whether
o urred only on e in the 4,000,000 experiments. universality holds. The method we have used is
Pushing this experiment further, we hose a ran- deliberate atta k: for some hashing fun tion and
dom set of 20 keys from names, a table size of 20, table size, nd a set of strings that hash to the
and measured llps for the 230 hashing fun tions same value; then for that set of strings explore llps
given by the seeds between 1 and 230 . The worst and average sear h lengths. A signi ant in rease
llps was 11, with only 9 in over one billion o - in llps indi ates that some strings are being hashed
urren es. That is, for even su h a small table to the same value for more seeds than would be
exhaustive sear h of the lass failed to nd a hash- expe ted for a universal lass of hashing fun tions.
ing fun tion that maps all keys to the same value. To use this approa h to provide eviden e for
Average llps was 3.231. universality we used the full tre key set, assumed
For the full set of tre keys, the distribution of a loading fa tor of 90% and a table size of 1111, ran-
llps values is even narrower. With 1000 randomly- domly hose a hashing fun tion, then sear hed for
60% 70% 80% 90%
Predi ted 4.333 4.636 4.947 5.242
A tual 4.5560.644 4.7970.677 5.0690.679 5.3060.688

Table 2: Length of the longest probe sequen e (llps) for 1000 keys at load fa tors from 60% to 90%,
averaged over 10,000 seeds ( one standard deviation). The keys are extra ted from the tre data.

600000
Frequency

400000
load factor of 60%
load factor of 70%
load factor of 80%
200000 load factor of 90%

0
5 10 15
Length of the longest probe sequence

Figure 1: Distribution of the length of the longest probe sequen e (llps) for 1000 keys and 1,000,000
randomly-sele ted seeds. The keys are extra ted from the tre data.

any set of 1000 keys with the same hash value. We only has, in e e t, T distin t members for a table
then atta ked the shift-add-xor lass by hoosing size T that is a power of 2.
1,000,000 random seeds and examined the distri-
bution of llps values. After this atta k the average
llps was 5.307|almost identi al to the value in 5 Other string hashing fun tions
Table 2|and, aside from one o urren e of an llps
of 15 and one of an llps of 12, the values were In addition to string hashing fun tions proposed
between 4 and 10. That is, behaviour was virtually in the literature, not surprisingly many di erent
indistinguishable from that of a random set of keys. fun tions are to be found embodied in software.
This pro ess of atta k was a key tool in our We now onsider some of these fun tions.
evaluation of other hashing fun tions. For example, As dis ussed above, some algorithms texts de-
for the shift-xor-xor lass de ned with s ribe a form of multipli ative method, whi h are
 two-stage methods in whi h the string is redu ed to
step (i; h; ) = h  LL ( )RR ( )
h h
a number before pro essing with a hashing fun tion
atta k in reases average llps over 1000 seeds, for integers. One form of multipli ative method is
to 6.229. For the lass de ned with as follows.
init (v ) = 0
step (i; h; ) = LL ( ) + RR (h) +
step (i; h; ) = h  r +
h


average llps is in reased to 41.198, with an average nal (h; v ) = (v  h)kp kT
su essful sear h length of 5.491.
To survive su h an atta k, a lass of fun tions where p is a large prime and r is a radix. In a
must be large. For example, the lass de ned by variation of this form, an array P of distin t large
primes an be used as follows.
init (v ) = v
init (v ) = 0
step (i; h; ) = L7 ( h )+ step (i; h; ) = h + Pi 
nal (h; v ) = k
h T
nal (h; v ) = (v  h)kp kT

We tested several fun tions of this kind using form lasses) and, with respe t to our riteria, are
the same methodology as in Se tion 4. These not parti ularly interesting. Nor are they distin t
approa hes are uniform (provided that radix r from the lasses we have already dis ussed; most
is not a power of 2); probably universal, as they are variants of simple radix or shift methods.
are resistant to adversarial atta k; and widely A more interesting lass of hashing fun tions is
appli able. However, Sedgewi k's fun tion [17℄ de ned by
does not resist atta k quite as well as other
fun tions on the full set of tre strings; average init (v ) = v

llps is 10.334, a signi ant in rease. step (i; h; ) =  LL ( )_


h

Moreover, these fun tions are relatively slow. (R32 L (h) ^ MASK )
On a Sun SPARC 20, fun tions from the shift- nal (h; v ) = h T k
add-xor lass an pro ess just over 1000 strings an
be hashed per millise ond. In ontrast, the multi- in whi h step is a left-rotation of h by L bits xor'ed
pli ative methods pro essed under 300 strings per with and MASK is 2L 1, that is, L one-bits. This
millise ond and Sedgewi k's method (whi h uses fun tion, whi h is similar to a method attributed
a modulo for ea h hara ter) pro essed under 150 by Knuth to Knott [8, page 412℄ and is embodied
strings per millise ond. We would expe t simi- in the ispell spelling he king utility, has almost
lar relative performan e on other urrent ar hite - exa tly the same ost as shift-add-xor and is thus
tures. about the same speed, but is slightly vulnerable to
Two re ent papers on ern string hashing fun - atta k, with an average llps of 6.064.
tions. Pearson [13℄ proposed an algorithm that an
be de ned as
6 Con lusions
init (v ) = 0
step (i; h; ) = Ah Analyti ally, the behaviour of hashing s hemes is
nal (h; v ) = h well understood. In this paper we have presented
riteria by whi h we believe pra ti al hashing
in whi h A is an array of the 256 distin t 8-bit fun tions should be evaluated|uniformity,
values, randomly permuted, and Ah denotes the universality, appli ability, and eÆ ien y. We
h 'th value in A. This algorithm omputes 8- developed a lass of shift-add-xor string hashing
bit hash values only, albeit qui kly. Pearson also fun tions and experimentally showed that, by
gives an extension to 16-bit hash values, whi h is hoosing hashing fun tions at random from this
somewhat slower. The array A is in e e t the seed, lass, the analyti ally-predi ted performan e an
sin e di erent permutations yield di erent hashing be a hieved in pra ti e. We have also shown that
fun tions. However, this fun tion is of limited value the lass is likely to be universal, as it is resistant
in pra ti e, sin e it is expensive to store ea h seed, to one method of adversarial atta k. Moreover,
and the fun tion is only appli able to limited table the fun tions from the lass are omputationally
sizes. As a generalisation of Pearson's fun tion we eÆ ient, pro essing more keys per unit time than
tested the lass other good string hashing fun tions, and, as shown
in our experiments with the distin t words in
init (v ) = v
 tre , are e e tive even for a large key sets su h
step (i; h; ) = h  LL ( ) + (h )k256
h A
as strings in a database.
nal (h; v ) = h T k The shift-add-xor lass of fun tions is thus an
appropriate hoi e for pra ti al appli ations. Our
This lass is uniform|experimental results are al-
results answer an important question posed by all
most identi al to those for shift-add-xor|resistant
users of hashing, namely, what fun tion should be
to atta k, and at around 800 keys per millise ond
used for hashing strings. The answer is, make a
is only slightly slower than shift-add-xor. We have
random hoi e from this lass; with high probabil-
found no simpli ation of this algorithm that pre-
ity it will work well, and will be at least as eÆ ient
serves uniformity and universality.
as other e e tive hash fun tions.
The other re ent paper on string hashing
The worst- ase performan e results for the
fun tions is a survey of their use in software,
shift-add-xor lass of string hashing fun tions
by M Kenzie, Harries, and Bell [12℄. Like the
are of parti ular interest. We have provided
fun tion des ribed by Pearson, these fun tions are
experimental eviden e|in luding, for one set of
not designed to a ept seeds (and thus do not
strings, exhaustive sear h amongst one billion [7℄ G.D. Knott. Hashing fun tions. Computer
hashing fun tions| on rming the theoreti al Journal, Volume 18, Number 3, pages 265{
predi tion that the length of the longest 278, 1975.
probe sequen e is narrowly distributed. To
[8℄ D.E. Knuth. The Art of Computer Program-
our knowledge these are the rst experiments
ming, Volume 3: Sorting and Sear hing, Se -
testing this predi tion. These results are also a
ond Edition. Addison-Wesley, Massa husetts,
further on rmation that, with an appropriately
1973.
hosen lass of hashing fun tions, hashing is indeed
safe in pra ti e|the likelihood of the theoreti al [9℄ P. Larson. Expe ted worst- ase performan e
worse- ase of many keys hashing to the same value of hash les. Computer Journal, Volume 25,
is extraordinarily low. Number 3, pages 347{352, 1982.
[10℄ V.Y. Lum. General performan e analysis of
key-to-address transformations methods using
A knowledgements
an abstra t le on ept. Communi ations of
We thank Evan Harris for suggesting several hash- the ACM, Volume 16, Number 10, pages 603{
ing fun tions. We also thank Kotagiri Ramamoha- 612, 1973.
narao. This work was supported by the Australian [11℄ V.Y. Lum, P.S.T. Yuen and M. Dodd. Key-to-
Resear h Coun il. address transform te hniques: A fundamen-
tal performan e study on large existing les.
Communi ations of the ACM, Volume 14,
Referen es Number 4, pages 228{239, 1971.
[1℄ G.V. Corma k, R.N.S. Horspool and [12℄ B.J. M Kenzie, R. Harris and T. Bell. Sele t-
M. Kaiserwerth. Pra ti al perfe t hashing. ing a hashing algorithm. Software|Pra ti e
Computer Journal, Volume 28, Number 1, and Experien e, Volume 20, Number 2, pages
pages 54{55, February 1985. 209{224, 1990.
[2℄ T.H. Cormen, C.E. Leiserson and R.L. Rivest. [13℄ P.K. Pearson. Fast hashing of variable-length
Introdu tion to Algorithms. The MIT Press, text strings. Communi ations of the ACM,
Massa husetts, 1990. Volume 33, Number 6, pages 677{680, 1990.
[3℄ R.F. Deuts her, P.G. Sorenson and J.P. Trem- [14℄ M.V. Ramakrishna. Hashing in pra ti e,
blay. Distribution dependent hashing fun - analysis of hashing and universal hashing. In
tions and their hara teristi s. In Pro . ACM- Pro . ACM-SIGMOD International Confer-
SIGMOD International Conferen e on the en e on the Management of Data, pages 191{
Management of Data, pages 224{236, 1975. 199, 1988.

[4℄ E.A. Fox, Q.F. Chen, A.M. Daoud and L.S. [15℄ M.V. Ramakrishna. Pra ti al performan e of
Heath. Order-preserving minimal hash fun - Bloom lters and parallel free-text sear hing.
tions and information retrieval. ACM Trans- Communi ations of the ACM, Volume 32,
a tions on Information Systems, Volume 9, Number 10, pages 1237{1239, 1989.
Number 3, pages 281{308, 1991. [16℄ M.V. Ramakrishna and P.A. Larson. File
organization using omposite perfe t hashing.
[5℄ G. Gonnet. Expe ted length of the longest
ACM Transa tions on Database Systems, Vol-
probe sequen e in hash ode sear hing. Jour-
ume 14, Number 2, pages 231{263, 1989.
nal of the ACM, Volume 28, Number 2, pages
289{304, 1981. [17℄ R. Sedgewi k. Algorithms in C. Addison-
Wesley, Reading, Massa husetts, se ond edi-
[6℄ D.K. Harman. Overview of the rst Text Re- tion, 1990.
trieval Conferen e. In D.K. Harman (editor),
Pro . TREC Text Retrieval Conferen e, pages [18℄ J. Zobel and P. Dart. Phoneti string mat h-
1{20, Washington, November 1992. National ing: Lessons from information retrieval. In
Institute of Standards Spe ial Publi ation Pro . ACM-SIGIR International Conferen e
500-207. on Resear h and Development in Information
Retrieval, pages 166{173, Zuri h, Switzerland,
August 1996.

You might also like