Locality-Sensitive Hashing Scheme Based on p-StableDistributions
Mayur DatarDepartment of Computer Science,Stanford University
Nicole ImmorlicaLaboratory for Computer Science, MIT
Piotr IndykLaboratory for Computer Science, MIT
Vahab S. MirrokniLaboratory for Computer Science, MIT
We present a novel Locality-Sensitive Hashing scheme for the Ap-proximate Nearest Neighbor Problem under norm, based on -stable distributions.Our scheme improves the running time of the earlier algorithmfor the case of the norm. It also yields the ﬁrst known provablyefﬁcient approximate NN algorithm for the case . We alsoshow that the algorithm ﬁnds the
near neigbhor intime for data satisfying certain “bounded growth” condition.Unlike earlierschemes, our LSHscheme works directly on pointsin the Euclidean space without embeddings. Consequently, the re-sulting query time bound is free of large factors and is simple andeasy to implement. Our experiments (on synthetic data sets) showthat the our data structure is up to 40 times faster than -tree.
Categories and Subject Descriptors
]: Data Structures; F.0 [
Theory of Computation
Algorithms, Experimentation, Design, Performance, Theory
Sublinear Algorithm,Approximate Nearest Neighbor, Locally Sen-sitive Hashing, -Stable Distributions
Asimilaritysearch problem involves a collection ofobjects (doc-uments, images, etc.) that are characterized by a collection of rel-evant features and represented as points in a high-dimensional at-tribute space; given queries in the form of points in this space, weThis material is based upon work supported by the NSF CAREERgrant CCR-0133849.
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for proﬁt or commercial advantage and that copiesbear this notice and the full citation on the ﬁrst page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior speciﬁcpermission and/or a fee.
June 9–11, 2004, Brooklyn, New York, USA.
Copyright 2004 ACM 1-58113-885-7/04/0006 ...
are required to ﬁnd the nearest (most similar) object to the query. Aparticularly interesting and well-studied instance is -dimensionalEuclidean space. This problem is of major importance to a varietyof applications; some examples are: data compression, databasesand data mining, information retrieval, image and video databases,machine learning, pattern recognition, statistics and data analysis.Typically, the features of the objects of interest (documents, im-ages, etc) are represented as points in and a distance metric isused to measure similarity of objects. The basic problem then isto perform indexing or similarity searching for query objects. Thenumber of features (i.e., the dimensionality) ranges anywhere fromtens to thousands.The low-dimensional case (say, for the dimensionality equal to2 or 3) is well-solved, so the main issue is that of dealing with alarge number of dimensions, the so-called “curse of dimensional-ity”. Despite decades of intensive effort, the current solutions arenot entirely satisfactory; in fact, for large enough , in theory or inpractice, they often provide little improvement over a linear algo-rithm which compares a query to each point from the database. Inparticular, it was shown in  (both empirically and theoretically)that
current indexing techniques (based on space partitioning)degrade to linear search for sufﬁciently high dimensions.In recent years, several researchers proposed to avoid the run-ning time bottleneck by using
(e.g., [3, 22, 19, 24,15], see also ). This is due to the fact that, in many cases, ap-proximate nearest neighbor is almost as good as the exact one; inparticular, if the distance measure accurately captures the notionof user quality, then small differences in the distance should notmatter. In fact, in situations when the quality of the approximatenearest neighbor is much worse than the quality of the actual near-est neighbor, then the nearest neighbor problem is
, and itis not clear if solving it is at all meaningful [4, 17].In[19, 14],the authors introduced an approximate high-dimensionalsimilarity search scheme with provably sublinear dependence onthe data size. Instead of using tree-like space partitioning, it re-lied on a new method called
locality-sensitive hashing (LSH)
. Thekey idea is to hash the points using several hash functions so as toensure that, for each function, the probability of collision is muchhigher for objectswhich are close toeach other than forthose whichare far apart. Then, one can determine near neighbors by hashingthe query point and retrieving elements stored in buckets containingthat point. In [19, 14] the authors provided such locality-sensitivehash functions for the case when the points live in binary Hammingspace . They showed experimentally that the data structureachieves large speedup over several tree-based data structures when