Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more ➡
Download
Standard view
Full view
of .
Add note
Save to My Library
Sync to mobile
Look up keyword
Like this
1Activity
×
0 of .
Results for:
No results containing your search query
P. 1
Similarity Estimation Techniques from Rounding Algorithms

Similarity Estimation Techniques from Rounding Algorithms

Ratings: (0)|Views: 966|Likes:
Published by matthewriley123

More info:

Published by: matthewriley123 on Jun 17, 2008
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See More
See less

06/16/2009

pdf

text

original

 
Similarity Estimation Techniques from RoundingAlgorithms
Moses S. Charikar
Dept. of Computer SciencePrinceton University35 Olden StreetPrinceton, NJ 08544
moses@cs.princeton.edu
ABSTRACT
A locality sensitive hashing scheme is a distribution on afamily
of hash functions operating on a collection of ob- jects, such that for two objects
x,y
,
Pr
h
[
h
(
x
) =
h
(
y
)] =
sim
(
x,y
)
,
where
sim
(
x,y
)
[0
,
1] is some similarity function definedon the collection of objects. Such a scheme leads to a com-pact representation of objects so that similarity of objectscan be estimated from their compact sketches, and alsoleads to e
cient algorithms for approximate nearest neigh-bor search and clustering. Min-wise independent permu-tations provide an elegant construction of such a localitysensitive hashing scheme for a collection of subsets with theset similarity measure
sim
(
A,B
) =
|
A
B
||
A
B
|
.We show that rounding algorithms for LPs and SDPs usedin the context of approximation algorithms can be viewedas locality sensitive hashing schemes for several interestingcollections of objects. Based on this insight, we constructnew locality sensitive hashing schemes for:1. A collection of vectors with the distance between
 
u
and
 
v
measured by
θ
(
 
u,
 
v
)
/
π
, where
θ
(
 
u,
 
v
) is the an-gle between
 
u
and
 
v
. This yields a sketching schemefor estimating the cosine similarity measure betweentwo vectors, as well as a simple alternative to minwiseindependent permutations for estimating set similar-ity.2. A collection of distributions on
n
points in a metricspace, with distance between distributions measuredby the Earth Mover Distance (
EMD
), (a popular dis-tance measure in graphics and vision). Our hash func-tions map distributions to points in the metric spacesuch that, for distributions
and
Q
,
EMD
(
P,Q
)
E
h
[
d
(
h
(
)
,h
(
Q
))]
O
(log
n
loglog
n
)
·
EMD
(
P,Q
)
.
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.
STOC’02,
May 19-21, 2002, Montreal, Quebec, Canada.Copyright 2002 ACM 1-58113-495-9/02/0005 ...
$
5.00.
1. INTRODUCTION
The current information explosion has resulted in an in-creasing number of applications that need to deal with largevolumes of data. While traditional algorithm analysis as-sumes that the data fits in main memory, it is unreasonableto make such assumptions when dealing with massive datasets such as data from phone calls collected by phone com-panies, multimedia data, web page repositories and so on.This new setting has resulted in an increased interest inalgorithms that process the input data in restricted ways,including sampling a few data points, making only a fewpasses over the data, and constructing a succinct sketch of the input which can then be e
ciently processed.There has been a lot of recent work on streaming algo-rithms, i.e. algorithms that produce an output by mak-ing one pass (or a few passes) over the data while using alimited amount of storage space and time. To cite a fewexamples, Alon
et al 
[2] considered the problem of estimat-ing frequency moments and Guha
et al 
[25] considered theproblem of clustering points in a streaming fashion. Manyof these streaming algorithms need to represent importantaspects of the data they have seen so far in a small amount of space; in other words they maintain a compact sketch of thedata that encapsulates the relevant properties of the dataset. Indeed, some of these techniques lead to sketching algo-rithms – algorithms that produce a compact sketch of a dataset so that various measurements on the original data setcan be estimated by e
cient computations on the compactsketches. Building on the ideas of [2], Alon
et al 
[1] give al-gorithms for estimating join sizes. Gibbons and Matias [18]give sketching algorithms producing so called
synopsis data structures
for various problems including maintaining ap-proximate histograms, hot lists and so on. Gilbert
et al 
[19]give algorithms to compute sketches for data streams so asto estimate any linear projection of the data and use this toget individual point and range estimates. Recently, Gilbert
et al 
[21] gave e
cient algorithms for the dynamic mainte-nance of histograms. Their algorithm processes a stream of updates and maintains a small sketch of the data from whichthe optimal histogram representation can be approximatedvery quickly.In this work, we focus on sketching algorithms for estimat-ing similarity, i.e. the construction of functions that producesuccinct sketches of objects in a collection, such that thesimilarity of objects can be estimated e
ciently from theirsketches. Here, similarity
sim
(
x,y
) is a function that maps
380
 
pairs of objects
x,y
to a number in [0
,
1], measuring thedegree of similarity between
x
and
y
.
sim
(
x,y
) = 1 corre-sponds to objects
x,y
that are identical while
sim
(
x,y
) = 0corresponds to objects that are very di
ff 
erent.Broder
et al 
[8, 5, 7, 6] introduced the notion of 
min-wiseindependent permutations
, a technique for constructing suchsketching functions for a collection of sets. The similaritymeasure considered there was
sim
(
A,B
) =
|
A
B
||
A
B
|
.
We note that this is exactly the Jaccard coe
cient of simi-larity used in information retrieval.The min-wise independent permutation scheme allows theconstruction of a distribution on hash functions
h
: 2
such that
Pr
h
[
h
(
A
) =
h
(
B
)] =
sim
(
A,B
)
.
Here
denotes the family of hash functions (with an asso-ciated probability distribution) operating on subsets of theuniverse
. By choosing say
t
hash functions
h
1
,...h
t
fromthis family, a set
could be represented by the hash vector(
h
1
(
)
,...h
t
(
)). Now, the similarity between two sets canbe estimated by counting the number of matching coordi-nates in their corresponding hash vectors.
1
The work of Broder
et al 
was originally motivated by theapplication of eliminating near-duplicate documents in theAltavista index. Representing documents as sets of featureswith similarity between sets determined as above, the hash-ing technique provided a simple method for estimating sim-ilarity of documents, thus allowing the original documentsto be discarded and reducing the input size significantly.In fact, the minwise independent permutations hashingscheme is a particular instance of a
locality sensitive hashing scheme
introduced by Indyk and Motwani [31] in their workon nearest neighbor search in high dimensions.
Definition
1.
A locality sensitive hashing scheme is distribution on a family 
of hash functions operating on a collection of objects, such that for two objects
x,y
,
Pr
h
[
h
(
x
) =
h
(
y
)] =
sim
(
x,y
) (1)
Here
sim
(
x,y
)
is some similarity function defined on thecollection of objects.
Given a hash function family
that satisfies (1), we willsay that
is a locality sensitive hash function family corre-sponding to similarity function
sim
(
x,y
). Indyk and Mot-wani showed that such a hashing scheme facilitates the con-struction of e
cient data structures for answering approxi-mate nearest-neighbor queries on the collection of objects.In particular, using the hashing scheme given by minwiseindependent permutations results in e
cient data structuresfor set similarity queries and leads to e
cient clustering al-gorithms. This was exploited later in several experimentalpapers: Cohen
et al 
[14] for association-rule mining, Haveli-wala
et al 
[27] for clustering web documents, Chen
et al 
[13]for selectivity estimation of boolean queries, Chen
et al 
[12]for twig queries, and Gionis
et al 
[22] for indexing set value
1
One question left open in [7] was the issue of compact rep-resentation of hash functions in this family; this was settledby Indyk [28], who gave a construction of a small family of minwise independent permutations.attributes. All of this work used the hashing technique forset similarity together with ideas from [31].We note that the definition of 
locality sensitive hashing 
used by [31] is slightly di
ff 
erent, although in the same spiritas our definition. Their definition involves parameters
r
1
>r
2
and
p
1
> p
2
. A family
is said to be (
r
1
,r
2
,p
1
,p
2
)-sensitive for a similarity measure
sim
(
x,y
) if 
Pr
h
[
h
(
x
) =
h
(
y
)]
p
1
when
sim
(
x,y
)
r
1
and
Pr
h
[
h
(
x
) =
h
(
y
)]
 p
2
when
sim
(
x,y
)
r
2
. Despite the di
ff 
erence in the pre-cise definition, we chose to retain the name
locality sensitivehashing 
in this work since the two notions are essentially thesame. Hash functions with closely related properties wereinvestigated earlier by Linial and Sasson [34] and Indyk
et al 
[32].
1.1 Our Results
In this paper, we explore constructions of locality sensi-tive hash functions for various other interesting similarityfunctions. The utility of such hash function schemes (fornearest neighbor queries and clustering) crucially dependson the fact that the similarity estimation is based on a testof equality of the hash function values. We make an interest-ing connection between constructions of similarity preserv-ing hash-functions and rounding procedures used in the de-sign of approximation algorithms. We show that proceduresused for rounding fractional solutions from linear programsand vector solutions to semidefinite programs can be usedto derive similarity preserving hash functions for interestingclasses of similarity functions.In Section 2, we prove some necessary conditions on sim-ilarity measures
sim
(
x,y
) for the existence of locality sensi-tive hash functions satisfying (1). Using this, we show thatsuch locality sensitive hash functions do not exist for certaincommonly used similarity measures in information retrieval,the Dice coe
cient and the Overlap coe
cient.In seminal work, Goemans and Williamson [24] intro-duced semidefinite programming relaxations as a tool forapproximation algorithms. They used the random hyper-plane rounding technique to round vector solutions for theMAX-CUT problem. We will see in Section 3 that the ran-dom hyperplane technique naturally gives a family of hashfunctions
for vectors such that
Pr
h
[
h
(
 
u
) =
h
(
 
v
)] = 1
θ
(
 
u,
 
v
)
π
.
Here
θ
(
 
u,
 
v
) refers to the angle between vectors
 
u
and
 
v
.Note that the function 1
θπ
is closely related to the func-tion
cos
(
θ
). (In fact it is always within a factor 0.878 from it.Moreover,
cos
(
θ
) can be estimated from an estimate of 
θ
.)Thus this similarity function is very closely related to thecosine similarity measure, commonly used in informationretrieval. (In fact, Indyk and Motwani [31] describe howthe set similarity measure can be adapted to measure dotproduct between binary vectors in
d
-dimensional Hammingspace. Their approach breaks up the data set into
O
(log
d
)groups, each consisting of approximately the same weight.Our approach, based on estimating the angle between vec-tors is more direct and is also more general since it appliesto general vectors.) We also note that the cosine betweenvectors can be estimated from known techniques based onrandom projections [2, 1, 20]. However, the advantage of alocality sensitive hashing based scheme is that this directlyyields techniques for nearest neighbor search for the cosinesimilarity measure.
381
 
An attractive feature of the hash functions obtained fromthe random hyperplane method is that the output is a singlebit; thus the output of 
t
hash functions can be concatenatedvery easily to produce a
t
-bit vector.
2
Estimating similaritybetween vectors amounts to measuring the Hamming dis-tance between the corresponding
t
-bit hash vectors. We canrepresent sets by their characteristic vectors and use thislocality sensitive hashing scheme for measuring similaritybetween sets. This yields a slightly di
ff 
erent similarity mea-sure for sets, one that is linearly proportional to the anglebetween their characteristic vectors.In Section 4, we present a locality sensitive hashing schemefor a certain metric on distributions on points, called the
Earth Mover Distance
. We are given a set of points
L
=
{
l
1
,...l
n
}
, with a distance function
d
(
i,j
) defined on them.A probability distribution
(
X
) (or distribution for short) isa set of weights
p
1
,...p
n
on the points such that
p
i
0 and
 p
i
= 1. (We will often refer to distribution
(
X
) as sim-ply
, implicitly referring to an underlying set
X
of points.)The
Earth Mover Distance
EMD
(
P,Q
) between two dis-tributions
and
Q
is defined to be the cost of the mincost matching that transforms one distribution to another.(Imagine each distribution as placing a certain amount of earth on each point.
EMD
(
P,Q
) measures the minimumamount of work that must be done in transforming one dis-tribution to the other.) This is a popular metric for imagesand is used for image similarity, navigating image databasesand so on [37, 38, 39, 40, 36, 15, 16, 41, 42]. The ideais to represent an image as a distribution on features withan underlying distance metric on features (e.g. colors in acolor spectrum). Since the earth mover distance is expensiveto compute (requiring a solution to a minimum transporta-tion problem), applications typically use an approximationof the earth mover distance. (e.g. representing distributionsby their centroids).We construct a hash function family for estimating theearth mover distance. Our family is based on rounding al-gorithms for LP relaxations for the problem of classificationwith pairwise relationships studied by Kleinberg and Tar-dos [33], and further studied by Calinescu
et al 
[10] andChekuri
et al 
[11]. Combining a new LP formulation de-scribed by Chekuri
et al 
together with a rounding techniqueof Kleinberg and Tardos, we show a construction of a hashfunction family which approximates the earth mover dis-tance to a factor of 
O
(log
n
loglog
n
). Each hash function inthis family maps a distribution on points
L
=
{
l
1
,...,l
n
}
to some point
l
i
in the set. For two distributions
(
X
) and
Q
(
X
) on the set of points, our family of hash functions
satisfies the property that:
EMD
(
P,Q
)
E
h
[
d
(
h
(
)
,h
(
Q
))]
O
(log
n
loglog
n
)
·
EMD
(
P,Q
)
.
We also show an interesting fact about a rounding al-gorithm in Kleinberg and Tardos [33] applying to the casewhere the underlying metric on points is a uniform met-ric. In this case, we show that their rounding algorithm can
2
In Section 2, we will show that we can convert any localitysensitive hashing scheme to one that maps objects to
{
0
,
1
}
with a slight change in similarity measure. However, themodified hash functions convey less information, e.g. thecollision probability for the modified hash function family isat least 1
/
2 even for a pair of objects with original similarity0.be viewed as a generalization of min-wise independent per-mutations extended to a continuous setting. Their roundingprocedure yields a locality sensitive hash function for vectorswhose coordinates are all non-negative. Given two vectors
 
a
= (
a
1
,...a
n
) and
 
b
= (
b
1
,...b
n
), the similarity functionis
sim
(
 
a,
 
b
) =
i
min(
a
i
,b
i
)
i
max(
a
i
,b
i
)
.
(Note that when
 
a
and
 
b
are the characteristic vectors forsets
A
and
B
, this expression reduces to the set similaritymeasure for min-wise independent permutations.)Applications of locality sensitive hash functions to solvingnearest neighbor queries typically reduce the problem to theHamming space. Indyk and Motwani [31] give a data struc-ture that solves the approximate nearest neighbor problemon the Hamming space. Their construction is a reduction tothe so called PLEB (Point Location in Equal Balls) problem,followed by a hashing technique concatenating the values of several locality sensitive hash functions. We give a simpletechnique that achieves the same performance as the IndykMotwani result in Section 5. The basic idea is as follows:Given bit vectors consisting of 
d
bits each, we choose a num-ber of random permutations of the bits. For each randompermutation
σ
, we maintain a sorted order of the bit vectors,in lexicographic order of the bits permuted by
σ
. To find anearest neighbor for a query bit vector
q
we do the follow-ing: For each permutation
σ
, we perform a binary search onthe sorted order corresponding to
σ
to locate the bit vectorsclosest to
q
(in the lexicographic order obtained by bits per-muted by
σ
). Further, we search in each of the sorted ordersproceeding upwards and downwards from the location of 
q
,according to a certain rule. Of all the bit vectors examined,we return the one that has the smallest Hamming distanceto the query vector. The performance bounds we can provefor this simple scheme are identical to that proved by Indykand Motwani for their scheme.
2. EXISTENCEOFLOCALITYSENSITIVEHASH FUNCTIONS
In this section, we discuss certain necessary properties forthe existence of locality sensitive hash function families forgiven similarity measures.
Lemma
1.
For any similarity function 
sim
(
x,y
)
that ad-mits a locality sensitive hash function family as defined in (1), the distance function 
1
sim
(
x,y
)
satisfies triangle in-equality.
Proof.
Suppose there exists a locality sensitive hash func-tion family such that
Pr
h
[
h
(
x
) =
h
(
y
)] =
sim
(
x,y
)
.
Then,1
sim
(
x,y
) =
Pr
h
[
h
(
x
)
=
h
(
y
)]
.
Let
h
(
x,y
) be an indicator variable for the event
h
(
x
)
=
h
(
y
). We claim that
h
(
x,y
) satisfies the triangle inequality,i.e.
h
(
x,y
) +
h
(
y,z
)
h
(
x,z
)
.
382

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->