Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more ➡
Standard view
Full view
of .
Add note
Save to My Library
Sync to mobile
Look up keyword
Like this
0 of .
Results for:
No results containing your search query
P. 1
Locality-Sensitive Hashing Scheme Based on p-Stable Distributions

Locality-Sensitive Hashing Scheme Based on p-Stable Distributions



|Views: 1,511|Likes:
Published by matthewriley123

More info:

Published by: matthewriley123 on Jun 17, 2008
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See More
See less





Locality-Sensitive Hashing Scheme Based on p-StableDistributions
Mayur DatarDepartment of Computer Science,Stanford University
Nicole ImmorlicaLaboratory for Computer Science, MIT
Piotr IndykLaboratory for Computer Science, MIT
Vahab S. MirrokniLaboratory for Computer Science, MIT
We present a novel Locality-Sensitive Hashing scheme for the Ap-proximate Nearest Neighbor Problem under norm, based on -stable distributions.Our scheme improves the running time of the earlier algorithmfor the case of the norm. It also yields the first known provablyefficient approximate NN algorithm for the case . We alsoshow that the algorithm finds the
near neigbhor intime for data satisfying certain “bounded growth” condition.Unlike earlierschemes, our LSHscheme works directly on pointsin the Euclidean space without embeddings. Consequently, the re-sulting query time bound is free of large factors and is simple andeasy to implement. Our experiments (on synthetic data sets) showthat the our data structure is up to 40 times faster than -tree.
Categories and Subject Descriptors
E.1 [
]: Data Structures; F.0 [
Theory of Computation
]: Gen-eral
General Terms
Algorithms, Experimentation, Design, Performance, Theory
Sublinear Algorithm,Approximate Nearest Neighbor, Locally Sen-sitive Hashing, -Stable Distributions
Asimilaritysearch problem involves a collection ofobjects (doc-uments, images, etc.) that are characterized by a collection of rel-evant features and represented as points in a high-dimensional at-tribute space; given queries in the form of points in this space, weThis material is based upon work supported by the NSF CAREERgrant CCR-0133849.
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.
June 9–11, 2004, Brooklyn, New York, USA.
Copyright 2004 ACM 1-58113-885-7/04/0006 ...
are required to find the nearest (most similar) object to the query. Aparticularly interesting and well-studied instance is -dimensionalEuclidean space. This problem is of major importance to a varietyof applications; some examples are: data compression, databasesand data mining, information retrieval, image and video databases,machine learning, pattern recognition, statistics and data analysis.Typically, the features of the objects of interest (documents, im-ages, etc) are represented as points in and a distance metric isused to measure similarity of objects. The basic problem then isto perform indexing or similarity searching for query objects. Thenumber of features (i.e., the dimensionality) ranges anywhere fromtens to thousands.The low-dimensional case (say, for the dimensionality equal to2 or 3) is well-solved, so the main issue is that of dealing with alarge number of dimensions, the so-called “curse of dimensional-ity”. Despite decades of intensive effort, the current solutions arenot entirely satisfactory; in fact, for large enough , in theory or inpractice, they often provide little improvement over a linear algo-rithm which compares a query to each point from the database. Inparticular, it was shown in [28] (both empirically and theoretically)that
current indexing techniques (based on space partitioning)degrade to linear search for sufficiently high dimensions.In recent years, several researchers proposed to avoid the run-ning time bottleneck by using
(e.g., [3, 22, 19, 24,15], see also [12]). This is due to the fact that, in many cases, ap-proximate nearest neighbor is almost as good as the exact one; inparticular, if the distance measure accurately captures the notionof user quality, then small differences in the distance should notmatter. In fact, in situations when the quality of the approximatenearest neighbor is much worse than the quality of the actual near-est neighbor, then the nearest neighbor problem is
, and itis not clear if solving it is at all meaningful [4, 17].In[19, 14],the authors introduced an approximate high-dimensionalsimilarity search scheme with provably sublinear dependence onthe data size. Instead of using tree-like space partitioning, it re-lied on a new method called
locality-sensitive hashing (LSH)
. Thekey idea is to hash the points using several hash functions so as toensure that, for each function, the probability of collision is muchhigher for objectswhich are close toeach other than forthose whichare far apart. Then, one can determine near neighbors by hashingthe query point and retrieving elements stored in buckets containingthat point. In [19, 14] the authors provided such locality-sensitivehash functions for the case when the points live in binary Hammingspace . They showed experimentally that the data structureachieves large speedup over several tree-based data structures when
the data is stored on disk. In addition, since the LSH is a hashing-based scheme, it can be naturally extended to the
setting,i.e., when insertion and deletion operations also need to be sup-ported. This avoids the complexity of dealing with tree structureswhen the data is dynamic.The LSH algorithm has been since used in numerous appliedsettings, e.g., see [14, 10, 16, 27, 5, 7, 29, 6, 26, 13]. However,it suffers from a fundamental drawback: it is fast and simple onlywhen the input points live in the Hamming space (indeed, almostall of the above applications involved binary data). As mentionedin [19, 14], it is possible to extend the algorithm to the norm, byembedding space into space, and then space into Hammingspace. However, it increases the query time and/or error by a largefactor and complicates the algorithm.In this paper we present a novel version of the LSH algorithm.As with the previous schemes, it works for the -Near Neigh-bor (NN) problem, where the goal is to report a point within dis-tance from a query , if there is a point in the data set withindistance from . Unlike the earlier algorithm, our algorithmworks directly on points in Euclidean space without embeddings.As a consequence, ithas the following advantages over the previousalgorithm:For the norm, its query time is , wherefor (the inequality is strict, see Fig-ure 1(b)). Thus, for large range of values of , the query timeexponent is better than the one in [19, 14].It is simple and quite easy to implement.It works for any norm, as long as . Specifically,we show that for any and there exists analgorithm for -NN under which usesspace, with query time , where where. To our knowledge, this is the onlyknown
algorithm for the high-dimensional nearestneighbor problem for the case . Similaritysearch undersuch
norms have recently attracted interest [1, 11].Our algorithm also inherits two very convenient properties of LSH schemes. The first one is that it works well on data that isextremely high-dimensional but sparse. Specifically, the runningtime bound remains unchanged if denotes the maximum numberof non-zero elements in vectors. To our knowledge, this propertyis not shared by other known spatial data structures. Thanks tothis property, we were able to use our new LSH scheme (specif-ically, the norm version) for fast color-based image similaritysearch [20]. In that context, each image was represented by a pointin roughly -dimensional space, but only about 100 dimensionswere non-zero per point. The use of our LSH scheme enabledachieving order(s) of magnitude speed-up over the linear scan.The second property is that our algorithm provably reports the
nearneighbor very quickly, ifthe datasatisfiescertain
bounded growth
property. Specifically, for a query point , and , letbe the number of -approximate nearest neighbors of in. If grows “sub-exponentiallyas a function of , thenthe LSH algorithm reports , the nearest neighbor, with constantprobability within time , assuming it is given a constantfactor approximation to the distance from to its nearest neighbor.In particular, we show that if , then the runningtime is . Efficient nearest neighbor algorithms fordata sets with polynomial growth properties in general metrics havebeen recently a focus of several papers [9, 21, 23]. LSH solves aneasier problem (near neighbor under norm), while working underweaker assumptions about the growth function. It is also somewhatfaster, due to the fact that the factor in the query time of theearlier schemes is
by a function of , while in our casethis factor is additive.We complement our theoretical analysis with experimental eval-uation of the algorithm on data with wide range of parameters. Inparticular, we compare our algorithm to an approximate version of the -tree algorithm [2]. We performed the experiments on syn-thetic data sets containing “planted” near neighbor (see section 5for more details); similar model was earlier used in [30]. Our ex-periments indicate that the new LSHscheme achieves query time of up to 40 times better than the query time of the -tree algorithm.
1.1 Notations and problem definitions
We use to denote the space under the norm. For anypoint , we denote by the norm of the vector . Letbe any metric space, and . The
of radiuscentered at is dened as .Let . In this paper we focus on the -NN prob-lem. Observe that -NN is simply a decision version of theApproximate Nearest Neighbor problem. Although in many ap-plications solving the decision version is good enough, one canalso reduce the approximate NN problem to approximate NN viabinary-search-like approach. In particular, it is known [19, 15] thatthe -approximate NN problem reduces to instancesof -NN. Then, the complexity of -approximate NN is thesame (within log factor) as that of the -NN problem.
An important technique from [19], to solve the -NN prob-lem is locality sensitive hashing or LSH. For a domain of thepoints set with distance measure , an LSH family is defined as:D
Afamily iscalle
for if for anyif then ,if then .
In order for a locality-sensitive hash (LSH) family to be useful, ithas to satisfy inequalities and .We will briefly describe, from [19], how a LSH family can beused to solve the -NN problem: We choose and. Given a family of hash functions with parame-ters as in Definition 1, we amplify the gap betweenthe “high” probability and “lowprobability by concate-nating several functions. In particular, for specified later, de-ne a function family such that, where . For an integer we choosefunctions from , independently and uniformly atrandom. During preprocessing, we store each (input pointset) in the bucket , for . Since the total num-ber of buckets may be large, we retain only the non-empty buck-ets by resorting to hashing. To process a query , we search allbuckets ; as it is possible (though unlikely) thatthe total number of points stored in those buckets is large, we in-terrupt search after finding first points (including duplicates).Let be the points encountered therein. For each , ithen we return
and , else we return
.The parameters and are chosen so as to ensure that with aconstant probability the following two properties hold:1. If there exists then for some, and
2. The total number of collisions of with points fromis less than , i.e.Observe that if (1) and (2) hold, then the algorithm is correct.It follows (see [19] Theorem 5 for details) that if we set, and where then (1) and (2) holdwitha constant probability. Thus, wegetfollowing theorem (slightlydifferent version of Theorem 5 in [19]), which relates the efficiencyof solving -NN problem to the sensitivity parameters of theLSH.T
Suppose there is a -sensitive fam-ily for a distance measure . Then there exists an algorithm for -NN under measure which uses space,with query time dominated by distance computations, anevaluations of hash functions from , where.
In this section, we present a LSH family based on -stable dis-tributions, that works for all .Since we consider points in , without loss of generality we canconsider , which we assume from now on.
3.1 -stable distributions
Stable distributions [31] are defined as limits of normalized sumsof independent identically distributed variables (an alternate defini-tion follows). The most well-known example of a stable distribu-tion is Gaussian (or normal) distribution. However, the class ismuch wider; for example, it includes heavy-tailed distributions.
Stable Distribution:
A distributionover is called
, if there exists such that for any real numbers andi.i.d. variables with distribution , the random variablehas the same distribution as the variable ,where is a random variable with distribution .It is known [31] that stable distributions exist for any .In particular:a
Cauchy distribution
, defined by the density function, is -stablea
Gaussian (normal) distribution
, defined by the densityfunction , is -stableWe note from a practical point of view, despite the lack of closedform density and distribution functions, it is known [8] that onecan generate -stable random variables essentially from two inde-pendent variables distributed uniformly over .Stable distribution have found numerous applications in variousfields (see the survey [25] for more details). In computer science,stable distributions were used for “sketching” of high dimensionalvectors by Indyk ([18]) and since have found use in various ap-plications. The main property of -stable distributions mentionedin the definition above directly translates into a sketching tech-nique for high dimensional vectors. The idea is to generate a ran-dom vector of dimension whose each entry is chosen indepen-dently from a -stable distribution. Given a vector of dimension, the dot product is a random variable which is distributedas (i.e., ), where is a random variablewith -stable distribution. A small collection of such dot products(), corresponding to different ’s, is termed as the sketch of the vector and can be used to estimate (see [18] for de-tails). It is easy to see that such a sketch is linearly composable, i.e..
3.2 Hash family
In this paper we use -stable distributions in a slightly differentmanner. Instead of using the dot products ( ) to estimate thenorm we use them to assign a hash value to each vector . Intu-itively, the hash function family should be locality sensitive, i.e. if two vectors ( ) are close (small ) then they shouldcollide (hash to the same value) with high probability and if theyare far they should collide with small probability. The dot productprojects each vector to the real line; It follows from -stabilitythat for two vectors ( ) the distance between their projections( )is distributed as where is a -stabledistribution. If we “chop” the real line into equi-width segments of appropriate size and assign hash values to vectors based on whichsegment they project onto, then it is intuitively clear that this hashfunction will be locality preserving in the sense described above.Formally, each hash function maps adimensional vector onto the set of integers. Each hash functionin the family is indexed by a choice of random and where is,as before, a dimensional vector with entries chosen independentlyfrom a -stable distribution and is a realnumber chosen uniformlyfrom the range . For a xed the hash function isgiven byNext, we compute the probability that two vectors collideunder a hash function drawn uniformly at random from this family.Let denote the probability density function of the
of the -stable distribution. We may drop the subscriptwhenever it is clear from the context. For the two vectors ,let . Fora random vector whose entriesare drawnfrom a -stabledistribution, isdistributed as whereis a random variable drawn from a -stable distribution. Sinceis drawn uniformly from it is easy to see thatFor a fixed parameter the probability of collision decreasesmonotonically with . Thus, as per Denition 1the family of hash functions above is -sensitive forand for .In what follows we will bound the ratio , which asdiscussed earlier is critical to the performance when this hash fam-ily is used to solve the -NN problem.Note that we have not specified the parameter , for it dependson the value of and . For every we would like to choose a finitethat makes as small as possible.
In this section we focus on the cases of . In these casesthe ratio can be explicitly evaluated. We compute and plot thisratio and compare it with . Note, is the best (smallest)known exponent for in the space requirement and query time thatis achieved in [19] for these cases.
4.1 Computing the ratio for special cases
For the special cases we can compute the probabili-ties , using the density functions mentioned before. A simple

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->