|Views: 1,511|Likes: 0

Published by matthewriley123

See More

See less

Locality-Sensitive Hashing Scheme Based on p-StableDistributions

Mayur DatarDepartment of Computer Science,Stanford University

datar@cs.stanford.edu

Nicole ImmorlicaLaboratory for Computer Science, MIT

nickle@theory.lcs.mit.edu

Piotr IndykLaboratory for Computer Science, MIT

indyk@theory.lcs.mit.edu

Vahab S. MirrokniLaboratory for Computer Science, MIT

mirrokni@theory.lcs.mit.edu

ABSTRACT

We present a novel Locality-Sensitive Hashing scheme for the Ap-proximate Nearest Neighbor Problem under norm, based on -stable distributions.Our scheme improves the running time of the earlier algorithmfor the case of the norm. It also yields the ﬁrst known provablyefﬁcient approximate NN algorithm for the case . We alsoshow that the algorithm ﬁnds the

exact

near neigbhor intime for data satisfying certain “bounded growth” condition.Unlike earlierschemes, our LSHscheme works directly on pointsin the Euclidean space without embeddings. Consequently, the re-sulting query time bound is free of large factors and is simple andeasy to implement. Our experiments (on synthetic data sets) showthat the our data structure is up to 40 times faster than -tree.

Categories and Subject Descriptors

E.1 [

Data

]: Data Structures; F.0 [

Theory of Computation

]: Gen-eral

General Terms

Algorithms, Experimentation, Design, Performance, Theory

Keywords

Sublinear Algorithm,Approximate Nearest Neighbor, Locally Sen-sitive Hashing, -Stable Distributions

1. INTRODUCTION

Asimilaritysearch problem involves a collection ofobjects (doc-uments, images, etc.) that are characterized by a collection of rel-evant features and represented as points in a high-dimensional at-tribute space; given queries in the form of points in this space, weThis material is based upon work supported by the NSF CAREERgrant CCR-0133849.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for proﬁt or commercial advantage and that copiesbear this notice and the full citation on the ﬁrst page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior speciﬁcpermission and/or a fee.

SCG’04,

June 9–11, 2004, Brooklyn, New York, USA.

Copyright 2004 ACM 1-58113-885-7/04/0006 ...

$

5.00.

are required to ﬁnd the nearest (most similar) object to the query. Aparticularly interesting and well-studied instance is -dimensionalEuclidean space. This problem is of major importance to a varietyof applications; some examples are: data compression, databasesand data mining, information retrieval, image and video databases,machine learning, pattern recognition, statistics and data analysis.Typically, the features of the objects of interest (documents, im-ages, etc) are represented as points in and a distance metric isused to measure similarity of objects. The basic problem then isto perform indexing or similarity searching for query objects. Thenumber of features (i.e., the dimensionality) ranges anywhere fromtens to thousands.The low-dimensional case (say, for the dimensionality equal to2 or 3) is well-solved, so the main issue is that of dealing with alarge number of dimensions, the so-called “curse of dimensional-ity”. Despite decades of intensive effort, the current solutions arenot entirely satisfactory; in fact, for large enough , in theory or inpractice, they often provide little improvement over a linear algo-rithm which compares a query to each point from the database. Inparticular, it was shown in [28] (both empirically and theoretically)that

all

current indexing techniques (based on space partitioning)degrade to linear search for sufﬁciently high dimensions.In recent years, several researchers proposed to avoid the run-ning time bottleneck by using

approximation

(e.g., [3, 22, 19, 24,15], see also [12]). This is due to the fact that, in many cases, ap-proximate nearest neighbor is almost as good as the exact one; inparticular, if the distance measure accurately captures the notionof user quality, then small differences in the distance should notmatter. In fact, in situations when the quality of the approximatenearest neighbor is much worse than the quality of the actual near-est neighbor, then the nearest neighbor problem is

unstable

, and itis not clear if solving it is at all meaningful [4, 17].In[19, 14],the authors introduced an approximate high-dimensionalsimilarity search scheme with provably sublinear dependence onthe data size. Instead of using tree-like space partitioning, it re-lied on a new method called

locality-sensitive hashing (LSH)

. Thekey idea is to hash the points using several hash functions so as toensure that, for each function, the probability of collision is muchhigher for objectswhich are close toeach other than forthose whichare far apart. Then, one can determine near neighbors by hashingthe query point and retrieving elements stored in buckets containingthat point. In [19, 14] the authors provided such locality-sensitivehash functions for the case when the points live in binary Hammingspace . They showed experimentally that the data structureachieves large speedup over several tree-based data structures when

253

the data is stored on disk. In addition, since the LSH is a hashing-based scheme, it can be naturally extended to the

dynamic

setting,i.e., when insertion and deletion operations also need to be sup-ported. This avoids the complexity of dealing with tree structureswhen the data is dynamic.The LSH algorithm has been since used in numerous appliedsettings, e.g., see [14, 10, 16, 27, 5, 7, 29, 6, 26, 13]. However,it suffers from a fundamental drawback: it is fast and simple onlywhen the input points live in the Hamming space (indeed, almostall of the above applications involved binary data). As mentionedin [19, 14], it is possible to extend the algorithm to the norm, byembedding space into space, and then space into Hammingspace. However, it increases the query time and/or error by a largefactor and complicates the algorithm.In this paper we present a novel version of the LSH algorithm.As with the previous schemes, it works for the -Near Neigh-bor (NN) problem, where the goal is to report a point within dis-tance from a query , if there is a point in the data set withindistance from . Unlike the earlier algorithm, our algorithmworks directly on points in Euclidean space without embeddings.As a consequence, ithas the following advantages over the previousalgorithm:For the norm, its query time is , wherefor (the inequality is strict, see Fig-ure 1(b)). Thus, for large range of values of , the query timeexponent is better than the one in [19, 14].It is simple and quite easy to implement.It works for any norm, as long as . Speciﬁcally,we show that for any and there exists analgorithm for -NN under which usesspace, with query time , where where. To our knowledge, this is the onlyknown

provable

algorithm for the high-dimensional nearestneighbor problem for the case . Similaritysearch undersuch

fractional

norms have recently attracted interest [1, 11].Our algorithm also inherits two very convenient properties of LSH schemes. The ﬁrst one is that it works well on data that isextremely high-dimensional but sparse. Speciﬁcally, the runningtime bound remains unchanged if denotes the maximum numberof non-zero elements in vectors. To our knowledge, this propertyis not shared by other known spatial data structures. Thanks tothis property, we were able to use our new LSH scheme (specif-ically, the norm version) for fast color-based image similaritysearch [20]. In that context, each image was represented by a pointin roughly -dimensional space, but only about 100 dimensionswere non-zero per point. The use of our LSH scheme enabledachieving order(s) of magnitude speed-up over the linear scan.The second property is that our algorithm provably reports the

exact

nearneighbor very quickly, ifthe datasatisﬁescertain

bounded growth

property. Speciﬁcally, for a query point , and , letbe the number of -approximate nearest neighbors of in. If grows “sub-exponentially” as a function of , thenthe LSH algorithm reports , the nearest neighbor, with constantprobability within time , assuming it is given a constantfactor approximation to the distance from to its nearest neighbor.In particular, we show that if , then the runningtime is . Efﬁcient nearest neighbor algorithms fordata sets with polynomial growth properties in general metrics havebeen recently a focus of several papers [9, 21, 23]. LSH solves aneasier problem (near neighbor under norm), while working underweaker assumptions about the growth function. It is also somewhatfaster, due to the fact that the factor in the query time of theearlier schemes is

multiplied

by a function of , while in our casethis factor is additive.We complement our theoretical analysis with experimental eval-uation of the algorithm on data with wide range of parameters. Inparticular, we compare our algorithm to an approximate version of the -tree algorithm [2]. We performed the experiments on syn-thetic data sets containing “planted” near neighbor (see section 5for more details); similar model was earlier used in [30]. Our ex-periments indicate that the new LSHscheme achieves query time of up to 40 times better than the query time of the -tree algorithm.

1.1 Notations and problem deﬁnitions

We use to denote the space under the norm. For anypoint , we denote by the norm of the vector . Letbe any metric space, and . The

ball

of radiuscentered at is deﬁned as .Let . In this paper we focus on the -NN prob-lem. Observe that -NN is simply a decision version of theApproximate Nearest Neighbor problem. Although in many ap-plications solving the decision version is good enough, one canalso reduce the approximate NN problem to approximate NN viabinary-search-like approach. In particular, it is known [19, 15] thatthe -approximate NN problem reduces to instancesof -NN. Then, the complexity of -approximate NN is thesame (within log factor) as that of the -NN problem.

2. LOCALITY-SENSITIVE HASHING

An important technique from [19], to solve the -NN prob-lem is locality sensitive hashing or LSH. For a domain of thepoints set with distance measure , an LSH family is deﬁned as:D

EFINITION

1.

Afamily iscalled

-sensitive

for if for anyif then ,if then .

In order for a locality-sensitive hash (LSH) family to be useful, ithas to satisfy inequalities and .We will brieﬂy describe, from [19], how a LSH family can beused to solve the -NN problem: We choose and. Given a family of hash functions with parame-ters as in Deﬁnition 1, we amplify the gap betweenthe “high” probability and “low” probability by concate-nating several functions. In particular, for speciﬁed later, de-ﬁne a function family such that, where . For an integer we choosefunctions from , independently and uniformly atrandom. During preprocessing, we store each (input pointset) in the bucket , for . Since the total num-ber of buckets may be large, we retain only the non-empty buck-ets by resorting to hashing. To process a query , we search allbuckets ; as it is possible (though unlikely) thatthe total number of points stored in those buckets is large, we in-terrupt search after ﬁnding ﬁrst points (including duplicates).Let be the points encountered therein. For each , if then we return

YES

and , else we return

NO

.The parameters and are chosen so as to ensure that with aconstant probability the following two properties hold:1. If there exists then for some, and

254

2. The total number of collisions of with points fromis less than , i.e.Observe that if (1) and (2) hold, then the algorithm is correct.It follows (see [19] Theorem 5 for details) that if we set, and where then (1) and (2) holdwitha constant probability. Thus, wegetfollowing theorem (slightlydifferent version of Theorem 5 in [19]), which relates the efﬁciencyof solving -NN problem to the sensitivity parameters of theLSH.T

HEOREM

1.

Suppose there is a -sensitive fam-ily for a distance measure . Then there exists an algorithm for -NN under measure which uses space,with query time dominated by distance computations, and evaluations of hash functions from , where.

3. OUR LSH SCHEME

In this section, we present a LSH family based on -stable dis-tributions, that works for all .Since we consider points in , without loss of generality we canconsider , which we assume from now on.

3.1 -stable distributions

Stable distributions [31] are deﬁned as limits of normalized sumsof independent identically distributed variables (an alternate deﬁni-tion follows). The most well-known example of a stable distribu-tion is Gaussian (or normal) distribution. However, the class ismuch wider; for example, it includes heavy-tailed distributions.

Stable Distribution:

A distributionover is called

-stable

, if there exists such that for any real numbers andi.i.d. variables with distribution , the random variablehas the same distribution as the variable ,where is a random variable with distribution .It is known [31] that stable distributions exist for any .In particular:a

Cauchy distribution

, deﬁned by the density function, is -stablea

Gaussian (normal) distribution

, deﬁned by the densityfunction , is -stableWe note from a practical point of view, despite the lack of closedform density and distribution functions, it is known [8] that onecan generate -stable random variables essentially from two inde-pendent variables distributed uniformly over .Stable distribution have found numerous applications in variousﬁelds (see the survey [25] for more details). In computer science,stable distributions were used for “sketching” of high dimensionalvectors by Indyk ([18]) and since have found use in various ap-plications. The main property of -stable distributions mentionedin the deﬁnition above directly translates into a sketching tech-nique for high dimensional vectors. The idea is to generate a ran-dom vector of dimension whose each entry is chosen indepen-dently from a -stable distribution. Given a vector of dimension, the dot product is a random variable which is distributedas (i.e., ), where is a random variablewith -stable distribution. A small collection of such dot products(), corresponding to different ’s, is termed as the sketch of the vector and can be used to estimate (see [18] for de-tails). It is easy to see that such a sketch is linearly composable, i.e..

3.2 Hash family

In this paper we use -stable distributions in a slightly differentmanner. Instead of using the dot products ( ) to estimate thenorm we use them to assign a hash value to each vector . Intu-itively, the hash function family should be locality sensitive, i.e. if two vectors ( ) are close (small ) then they shouldcollide (hash to the same value) with high probability and if theyare far they should collide with small probability. The dot productprojects each vector to the real line; It follows from -stabilitythat for two vectors ( ) the distance between their projections( )is distributed as where is a -stabledistribution. If we “chop” the real line into equi-width segments of appropriate size and assign hash values to vectors based on whichsegment they project onto, then it is intuitively clear that this hashfunction will be locality preserving in the sense described above.Formally, each hash function maps adimensional vector onto the set of integers. Each hash functionin the family is indexed by a choice of random and where is,as before, a dimensional vector with entries chosen independentlyfrom a -stable distribution and is a realnumber chosen uniformlyfrom the range . For a ﬁxed the hash function isgiven byNext, we compute the probability that two vectors collideunder a hash function drawn uniformly at random from this family.Let denote the probability density function of the

absolutevalue

of the -stable distribution. We may drop the subscriptwhenever it is clear from the context. For the two vectors ,let . Fora random vector whose entriesare drawnfrom a -stabledistribution, isdistributed as whereis a random variable drawn from a -stable distribution. Sinceis drawn uniformly from it is easy to see thatFor a ﬁxed parameter the probability of collision decreasesmonotonically with . Thus, as per Deﬁnition 1the family of hash functions above is -sensitive forand for .In what follows we will bound the ratio , which asdiscussed earlier is critical to the performance when this hash fam-ily is used to solve the -NN problem.Note that we have not speciﬁed the parameter , for it dependson the value of and . For every we would like to choose a ﬁnitethat makes as small as possible.

4. COMPUTATIONAL ANALYSIS OF THERATIO

In this section we focus on the cases of . In these casesthe ratio can be explicitly evaluated. We compute and plot thisratio and compare it with . Note, is the best (smallest)known exponent for in the space requirement and query time thatis achieved in [19] for these cases.

4.1 Computing the ratio for special cases

For the special cases we can compute the probabili-ties , using the density functions mentioned before. A simple

255