Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Look up keyword
Like this
4Activity
0 of .
Results for:
No results containing your search query
P. 1
Randomized Algorithms and NLP - Using Locality Sensitive Hash Function for High Speed Noun Clustering

Randomized Algorithms and NLP - Using Locality Sensitive Hash Function for High Speed Noun Clustering

Ratings: (0)|Views: 118|Likes:
Published by newtonapple

More info:

Published by: newtonapple on Nov 19, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

11/16/2012

pdf

text

original

 
Proceedings of the 43rd Annual Meeting of the ACL
, pages 622–629,Ann Arbor, June 2005.c
2005 Association for Computational Linguistics
Randomized Algorithms and NLP: Using Locality Sensitive Hash Functionfor High Speed Noun Clustering
Deepak Ravichandran, Patrick Pantel, and Eduard Hovy
Information Sciences InstituteUniversity of Southern California4676 Admiralty WayMarina del Rey, CA 90292.
{
ravichan, pantel, hovy
}
@ISI.EDU
Abstract
In this paper, we explore the power of randomized algorithm to address the chal-lenge of working with very large amountsof data. We apply these algorithms to gen-erate noun similarity lists from 70 millionpages. We reduce the running time fromquadratic to practically linear in the num-ber of elements to be computed.
1 Introduction
In the last decade, the field of Natural Language Pro-cessing (NLP), has seen a surge in the use of cor-pus motivated techniques. Several NLP systems aremodeled based on empirical data and have had vary-ing degrees of success. Of late, however, corpus-based techniques seem to have reached a plateauin performance. Three possible areas for future re-search investigation to overcoming this plateau in-clude:
1.
Working with large amounts of data (Banko andBrill, 2001)
2.
Improving semi-supervised and unsupervised al-gorithms.
3.
Using more sophisticated feature functions.The above listing may not be exhaustive, but it isprobably not a bad bet to work in one of the abovedirections. In this paper, we investigate the first twoavenues. Handling terabytes of data requires moreefficient algorithms than are currently used in NLP.We propose a web scalable solution to clusteringnouns, which employs randomized algorithms. Indoing so, we are going to explore the literature andtechniques of randomized algorithms. All cluster-ing algorithms make use of some distance similar-ity (e.g., cosine similarity) to measure pair wise dis-tance between sets of vectors. Assume that we aregiven
n
points to cluster with a maximum of 
k
fea-tures. Calculating the full similarity matrix wouldtake time complexity
n
2
k
. With large amounts of data, say
n
in the order of millions or even billions,having an
n
2
k
algorithm would be very infeasible.To be scalable, we ideally want our algorithm to beproportional to
nk
.Fortunately, we can borrow some ideas from theMath and Theoretical Computer Science communityto tackle this problem. The crux of our solution liesindefiningLocalitySensitiveHash(LSH)functions.LSH functions involve the creation of short signa-tures (fingerprints) for each vector in space such thatthose vectors that are closer to each other are morelikely to have similar fingerprints. LSH functionsare generally based on randomized algorithms andare probabilistic. We present LSH algorithms thatcan help reduce the time complexity of calculatingour distance similarity atrix to
nk
.Rabin (1981) proposed the use of hash func-tions from random irreducible polynomials to cre-ate short fingerprint representations for very largestrings. These hash function had the nice propertythat the fingerprint of two identical strings had thesame fingerprints, while dissimilar strings had dif-ferent fingerprints with a very small probability of collision. Broder (1997) first introduced LSH. Heproposed the use of Min-wise independent functionsto create fingerprints that preserved the Jaccard sim-
622
 
ilarity between every pair of vectors. These tech-niques are used today, for example, to eliminate du-plicate web pages. Charikar (2002) proposed theuse of random hyperplanes to generate an LSH func-tion that preserves the cosine similarity between ev-ery pair of vectors. Interestingly, cosine similarity iswidely used in NLP for various applications such asclustering.In this paper, we perform high speed similaritylist creation for nouns collected from a huge webcorpus. We linearize this step by using the LSHproposed by Charikar (2002). This reduction incomplexity of similarity computation makes it pos-sible to address vastly larger datasets, at the cost,as shown in Section 5, of only little reduction inaccuracy. In our experiments, we generate a simi-larity list for each noun extracted from 70 millionpage web corpus. Although the NLP communityhas begun experimenting with the web, we knowof no work in published literature that has appliedcomplex language analysis beyond IR and simplesurface-level pattern matching.
2 Theory
The core theory behind the implementation of fastcosine similarity calculation can be divided into twoparts:
1.
Developing LSH functions to create sig-natures;
2.
Using fast search algorithm to find near-est neighbors. We describe these two components ingreater detail in the next subsections.
2.1 LSH Function Preserving Cosine Similarity
We first begin with the formal definition of cosinesimilarity.
Definition:
Let
u
and
v
be two vectors in a
k
dimensional hyperplane. Cosine similarity is de-fined as the cosine of the angle between them:
cos
(
θ
(
u,v
))
. We can calculate
cos
(
θ
(
u,v
))
by thefollowing formula:
cos
(
θ
(
u,v
)) =
|
u.v
||
u
||
v
|
(1)Here
θ
(
u,v
)
is the angle between the vectors
u
and
v
measured in radians.
|
u.v
|
is the scalar (dot)product of 
u
and
v
, and
|
u
|
and
|
v
|
represent thelength of vectors
u
and
v
respectively.The LSH function for cosine similarity as pro-posed by Charikar (2002) is given by the followingtheorem:
Theorem:
Suppose we are given a collection of vectorsina
k
dimensionalvectorspace(aswrittenas
R
k
). Choose a family of hash functions as follows:Generate a spherically symmetric random vector
r
of unit length from this
k
dimensional space. Wedefine a hash function,
h
r
, as:
h
r
(
u
) =
1 :
r.u
00 :
r.u <
0
(2)Then for vectors
u
and
v
,
Pr
[
h
r
(
u
) =
h
r
(
v
)] = 1
θ
(
u,v
)
π
(3)Proof of the above theorem is given by Goemansand Williamson (1995). We rewrite the proof herefor clarity. The above theorem states that the prob-ability that a random hyperplane separates two vec-tors is directly proportional to the angle between thetwo vectors (i,e.,
θ
(
u,v
)
). By symmetry, we have
Pr
[
h
r
(
u
)
=
h
r
(
v
)] = 2
Pr
[
u.r
0
,v.r <
0]
. Thiscorresponds to the intersection of two half spaces,the dihedral angle between which is
θ
. Thus, wehave
Pr
[
u.r
0
,v.r <
0] =
θ
(
u,v
)
/
2
π
. Proceed-ing we have
Pr
[
h
r
(
u
)
=
h
r
(
v
)] =
θ
(
u,v
)
and
Pr
[
h
r
(
u
) =
h
r
(
v
)] = 1
θ
(
u,v
)
. This com-pletes the proof.Hence from equation 3 we have,
cos
(
θ
(
u,v
)) =
cos
((1
Pr
[
h
r
(
u
) =
h
r
(
v
)])
π
)
(4)This equation gives us an alternate method forfinding cosine similarity. Note that the above equa-tion is probabilistic in nature. Hence, we generate alarge (
d
) number of random vectors to achieve theprocess. Having calculated
h
r
(
u
)
with
d
randomvectors for each of the vectors
u
, we apply equation4 to find the cosine distance between two vectors.As we generate more number of random vectors, wecan estimate the cosine similarity between two vec-tors more accurately. However, in practice, the num-ber (
d
) of random vectors required is highly domaindependent, i.e., it depends on the value of the totalnumber of vectors (
n
), features (
k
) and the way thevectors are distributed. Using
d
random vectors, we
623
 
can represent each vector by a bit stream of length
d
.Carefully looking at equation 4, we can ob-serve that
Pr
[
h
r
(
u
) =
h
r
(
v
)] = 1
(
hamming distance
)
/d
1
. Thus, the above theo-rem, converts the problem of finding cosine distancebetween two vectors to the problem of finding ham-ming distance between their bit streams (as given byequation4). Findinghammingdistancebetweentwobit streams is faster and highly memory efficient.Also worth noting is that this step could be consid-ered as dimensionality reduction wherein we reducea vector in
k
dimensions to that of 
d
bits while stillpreserving the cosine distance between them.
2.2 Fast Search Algorithm
To calculate the fast hamming distance, we use thesearch algorithm PLEB (Point Location in EqualBalls) first proposed by Indyk and Motwani (1998).This algorithm was further improved by Charikar(2002). This algorithm involves random permuta-tions of the bit streams and their sorting to find thevector with the closest hamming distance. The algo-rithm given in Charikar (2002) is described to findthe nearest neighbor for a given vector. We mod-ify it so that we are able to find the top
B
closestneighbor for each vector. We omit the math of thisalgorithm but we sketch its procedural details in thenext section. Interested readers are further encour-aged to read Theorem 2 from Charikar (2002) andSection 3 from Indyk and Motwani (1998).
3 Algorithmic Implementation
In the previous section, we introduced the theory forcalculation of fast cosine similarity. We implementit as follows:1. Initially we are given
n
vectors in a huge
k
di-mensional space. Our goal is to find all pairs of vectors whose cosine similarity is greater thana particular threshold.2. Choose
d
number of (
d << k
) unit randomvectors
{
r
0
,r
1
,......,r
d
}
each of 
k
dimensions.A
k
dimensional unit random vector, in gen-eral, is generated by independently sampling a
1
Hamming distance is the number of bits which differ be-tween two binary strings.
Gaussian function with mean
0
and variance
1
,
k
number of times. Each of the
k
samples isused to assign one dimension to the randomvector. We generate a random number froma Gaussian distribution by using Box-Mullertransformation (Box and Muller, 1958).3. For every vector
u
, we determine its signatureby using the function
h
r
(
u
)
(as given by equa-tion 4). We can represent the signature of vec-tor
u
as:
¯
u
=
{
h
r
1
(
u
)
,h
r
2
(
u
)
,.......,h
rd
(
u
)
}
.Each vector is thus represented by a set of a bitstreams of length
d
. Steps 2 and 3 takes
O
(
nk
)
time (We can assume
d
to be a constant since
d << k
).4. The previous step gives
n
vectors, each of themrepresented by
d
bits. For calculation of fasthamming distance, we take the original bit in-dex of all vectors and randomly permute them(see Appendix A for more details on randompermutation functions). A random permutationcan be considered as random jumbling of thebits of each vector
2
. A random permutationfunction can be approximated by the followingfunction:
π
(
x
) = (
ax
+
b
)
mod p (5)where,
p
is prime and
0
< a < p
,
0
b < p
,and
a
and
b
are chosen at random.We apply
q
different random permutation forevery vector (by choosing random values for
a
and
b
,
q
number of times). Thus for every vec-tor we have
q
different bit permutations for theoriginal bit stream.5. For each permutation function
π
, we lexico-graphically sort the list of 
n
vectors (whose bitstreams are permuted by the function
π
) to ob-tain a sorted list. This step takes
O
(
nlogn
)
time. (We can assume
q
to be a constant).6. For each sorted list (performed after applyingthe random permutation function
π
), we calcu-late the hamming distance of every vector with
2
The jumbling is performed by a mapping of the bit indexas directed by the random permutation function. For a givenpermutation, we reorder the bit indexes of all vectors in similarfashion. This process could be considered as column reordingof bit vectors.
624

You're Reading a Free Preview

Download
scribd
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->