You are on page 1of 5

Efficient Distributed Locality Sensitive Hashing

Bahman Bahmani

Ashish Goel

Rajendra Shinde

Stanford Univesrity
Stanford, CA
bahman@stanford.edu ∗

Stanford Univesrity
Stanford, CA
ashishg@stanford.edu †

Stanford Univesrity
Stanford, CA
rbs@stanford.edu ‡

ABSTRACT

In these applications, objects are usually represented by
a high dimensional feature vector. A scheme to solve the
similarity search problem constructs an index which, given a
query point, allows for quickly finding the data points similar
to it. The index construction also needs to be time and
space efficient. Furthermore, since today’s massive datasets
are typically stored and processed in a distributed fashion,
where network communication is one of the most important
bottlenecks, these methods need to be network efficient.
An important family of similarity search methods is based
on the notion of Locality Sensitive Hashing (LSH) [12]. LSH
scales well with the data dimension [12, 15]. However, to
guarantee a good search quality, it needs a large number of
hash tables. This entails a rather large index space requirement, and also in the distributed setting, a large network
load, as each hash bucket look up requires a network call.
By looking up a number of query offsets in addition to the
query itself, Panigrahy’s Entropy LSH [18] significantly reduces the number of required hash tables, and hence the
LSH space complexity. But, it does not help with its network efficiency, as now each query offset lookup requires a
network call. In fact, since the number of required offsets
in Entropy LSH is larger than the number of required hash
tables in conventional LSH, Entropy LSH amplifies the network inefficiency issue.
In this paper, focusing on the Euclidian space, we design
the Layered LSH method which, compared to a straightforward distributed implementation of LSH or Entropy LSH,
results in an exponential improvement in the network load,
while maintaining a good load balance across machines. Our
experiments also verify our theoretical results.

Distributed frameworks are gaining increasingly widespread
use in applications that process large amounts of data. One
important example application is large scale similarity search,
for which Locality Sensitive Hashing (LSH) has emerged
as the method of choice, specially when the data is highdimensional. To guarantee high search quality, the LSH
scheme needs a rather large number of hash tables. This
entails a large space requirement, and in the distributed
setting, with each query requiring a network call per hash
bucket look up, also a big network load. Panigrahy’s Entropy LSH scheme significantly reduces the space requirement but does not help with (and in fact worsens) the search
network efficiency. In this paper, focusing on the Euclidian
space under l2 norm and building up on Entropy LSH, we
propose the distributed Layered LSH scheme, and prove that
it exponentially decreases the network cost, while maintaining a good load balance between different machines. Our
experiments also verify that our theoretical results.

Categories and Subject Descriptors
H.3.1 [Content Analysis and Indexing]: Indexing Methods

Keywords
Similarity Search, Locality Sensitive Hashing, Distributed
Systems, MapReduce

1.

INTRODUCTION

Similarity search is the problem of retrieving data objects
similar to a query object. It has become an important component of modern data-mining systems, with wide ranging
applications [14, 17, 7, 5, 3]

1.1

Background

In this paper, we focus on the network efficiency of LSH
in distributed frameworks. Two main instantiations of such
frameworks are the batched processing system MapReduce
[9] (with its open source implementation Hadoop [1]), and
real-time processing systems denoted as Active Distributed
Hash Tables (Active DHTs), such as Twitter Storm [2]. The
common feature in all these systems is that they process data
in the form of (Key, Value) pairs, distributed over a set of
machines. This distributed (Key, Value) abstraction is all
we need for both our scheme and analyses to apply. However, to make the later discussions more concrete, here we
briefly overview the mentioned distributed systems.

Research supported by NSF grant 0904314
Research supported by NSF grants 0915040 and NSF
0904314

Research supported by NSF grant 0915040

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CIKM’12, October 29–November 2, 2012, Maui, HI, USA.
Copyright 2012 ACM 978-1-4503-1156-4/12/10 ...$15.00.

MapReduce: MapReduce [9] is a simple model for batched
distributed processing, where computations are done in three
phases. The Map phase reads a collection of (Key, Value)

2174

using p-stable distributions. in the distributed setting. to implement Entropy LSH in the distributed (Key. H = (h1 . r) denotes the normal distribution around the origin. Indyk and Motwani [12] proved that with n data points. approximation ratio c > 1. We also verify this empirically on MapReduce. W ] and a ∈ N d (0. for an integer k. Also. y) ≥ cr then PrH [h(x) = h(y)] ≤ p2 . in addition to q. Such families are known for a variety of metric spaces [6]. Formally. Value) pairs in a node of the active DHT are usually stored in main memory to allow for fast real-time processing of data and queries. Inserts. 2. Value) pairs sharing the same Key. Similarity Search: The similarity search problem is that of finding data objects similar to a query object. each hash table lookup at query time corresponds to a network call which entails a large network load. Although Basic LSH yields a significant improvement in the running time over both the brute force linear scan and the space partitioning approaches [20. .b : Rd → Z} such that a·v+b c ha. We prove that despite network efficiency (requiring collocating near points on the same machines). y ∈ T . where hi ∈ H (1 ≤ i ≤ k). and then finding the closest point to q among those data points. r)NN problem as follows. given a query q. the objects are represented by multidimensional feature vectors. h2 .b (v) = b W Our Results We design a new scheme. [8] proposed LSH families for lp norms. Surprisingly. First. Definition 1. i. called Layered LSH. This is an exponential improvement over the O(nΘ(1) ) query network cost of the simple distributed implementation of both Entropy LSH and basic LSH. In addition to the typical performance measures of total running time and total space. Twitter’s Storm [2] is an example of Active DHTs which is gaining widespread use.. N d (0. . 0 ∈ Rd . . Here. let H0 = {H : T → U k } be a family of hash functions in which any H ∈ H0 is the concatenation of k functions in H. . d. where given an approximation ratio c > 1. 1. A summary of our results is: √ 1.2 if ζ(x. To utilize this indexing scheme. The term Active refers to the fact that an arbitrary User Defined Function (UDF) can be executed on a (Key.1) PRELIMINARIES In this section we briefly review the preliminaries required for the paper. Value) store which allows Lookups. where b ∈ R is chosen uniformly from [0. the data points close to the query point q are highly likely to hash either to the same value as H(q) or to a value very close to that. In many practical applications. the sphere of radius r centered at q. and Lookup. r). Basic LSH: To solve the (c. and emits zero or more values to associate with the group’s Key. y) ≤ r then PrH [h(x) = h(y)] ≥ p1 . one needs an LSH family H to start with. Entropy LSH: To mitigate the space inefficiency. several “offsets” q+δi (1 ≤ i ≤ L). we prove that. two other measures are very important for both MapReduce and Active DHTs: 1) total network traffic generated 2) the maximum number of values with the same key. in contrast with both Entropy LSH and basic LSH. the goal is to construct an index 2175 . if ζ(x. Layered LSH sends points which are only Ω(1) apart to different machines with high likelihood and hence hits the right tradeoff between load balance and network efficiency. Active DHT: A DHT (Distributed Hash Table) is a distributed (Key. p1 . 3. . (2. independently and in parallel. The Active DHT model is broad enough to act as a distributed stream processing system and as a continuous version of MapReduce [16]. with any W > 0. and Deletes on the basis of the Key. Then. Delete. which entails a very large space requirement for the index.that given any query point q ∈ T within distance r of a data point. unfortunately the required number of hash functions is usually large [5. with 0 < p ≤ 2. r). Value) model. 13]. p2 )-LSH family if for all x.e. independently and uniformly at random. Panigrahy [18] introduced the Entropy LSH scheme using the same indexing as the Basic LSH scheme. Panigrahy log n [18] shows that for n data points. For this case. r)-NN problem with constant probability. Value) pairs. chosen randomly from the surface of B(q. a high value here can lead to the “curse of the last reducer” in MapReduce or to one compute node becoming a bottleneck in Active DHT. LSH families can be used to design an index for the (c. 10]. they consider a family of hash functions HW : {ha. for an integer M . in this scheme. similarity search reduces to the problem more commonly known as the (c. a family of hash functions H = {h : T → U } is said to be a (r. where the i-th coordinate has the distribution N (0. For the space T with metric ζ. but a different query search procedure. In particular. All the (Key. 1). The Reduce phase invokes a user-defined Reducer function on each distinct group. choosing k = O(log n) and M = O(n1/c ) solves the (c. 4. and probabilities p1 > p2 . for the case p = 2. r)-NN problem. in a metric space with domain T . Datar et al. choosing k ≥ log(1/p 2) 2. the similarity search is done by first generating the set of all data points mapping to the same bucket as q in at least one hash table. . We prove that Layered LSH incurs only O( log n) network cost per query. hk ). cr. allows for quickly finding a data point p ∈ T whose distance to q is at most cr. The Shuffle phase then groups together all the Mapper-emitted (Key. are also hashed and the data points in their hash buckets are also considered as search result candidates. With this index. Hence. ∀i ∈ 1 . and use them to construct the index consisting of M hash tables on the data points. Indyk and Motwani [12] introduced the following notion of LSH functions: pairs from an input source. Specially. and hence the problem reduces to finding objects close to the query object under the feature space distance metric. emits zero or more (Key. given distance threshold r. the network efficiency of Layered LSH is independent of the search quality. and by invoking a user defined Mapper function on each input element independently and in parallel. and outputs each distinct group to the next phase. Value) pair in addition to Insert. draw M hash functions from H0 . r)-NN problem. The idea here is that for each hash function H ∈ H0 .

one needs to look up L = O(n2/c ) offsets. . one per offset. Value) pair (x. we will present the Layered LSH scheme and theoretically analyze it. the network load due to query q is O(dfq ). DISTRIBUTED LSH In this section. for any received query point q. The main idea is to use another layer of locality sensitive hashing to distribute the data and query points over the machines. with offsets q + δi (1 ≤ i ≤ L). and the second is the 2-stability property of Gaussian distribution: Before proceeding to the analysis. this implementation causes a large network load. Hk ). where for 1 ≤ i ≤ k: Layered LSH G(v) = b Analysis In this section. Value) pair (H(q + δi ). we are interested in the (c. Layered LSH partitions faraway data points on different machines. with H(p) = H(q + δi ) which are within distance cr of q. Parameters k. if any. we give a definition: (3. for each query point q ∈ Q. . We start with a simple distributed implementation of Entropy LSH.2) to be the number of (Key. there exists at least one such bucket. We will use the following small lemma. The LSH function H ∈ H0 W that we use is H = (H1 . we analyze the Layered LSH scheme presented in the previous section. since for the basic LSH. it does not help with reducing the network load of LSH queries. we generate a (Key. either given as a batch (in case of MapReduce) or arriving in realtime (in case of Active DHT).1) where α ∈ N k (0.2 Hi (v) = b Γi (v) = Lemma 2. Hence. For any two vectors u. for the sake of clarity we will focus on a single hash table and use a randomly chosen hash function H ∈ H0 as our LSH function throughout. if two data points p. the machine with id H(q + δi ) retrieves all data points p. 3. and bucket search can all be done by either a UDF in Active DHT or the Reducer in MapReduce. p > ). GH(p) and GH(p0 ) are highly likely to be different. . in our analysis: In this subsection. Note that this is critical. As mentioned earlier in the paper. Value) pair gets processed by the machine with id Key. we sample an LSH function G : Rk → Z such that: α·v+β c D ai · v + b i c W where ai is a d-dimensional vector each of whose entries is chosen from the standard Gaussian N (0. v ∈ Rd . p) is generated. This means that. q). Hence Entropy LSH improves the space efficiency significantly.1 3. and also for any query point q. L. H. after generating the offsets q + δi (1 ≤ i ≤ L). . which in turn ensures a good load balance across the machines. finds their hash buckets H(q +δi ). given a parameter value D > 0. Hence. 3. However. Value) pairs. W ]. while locating the nearby points on the same machines. to analyze the network efficiency of Layered LSH. This is done via a UDF in Active DHT or the Reducer in MapReduce. Value) pair (GH(p). as without a good load balance. On the other hand. For each query point q. . . The first is the sharp concentration of χ2 -distributed random variables. r)-NN problem. We first fix some notation. the offset regeneration. each with the d-dimensional point q as Value. At an intuitive level. Without loss of generality and to simplify the notation. Hi (·) = bΓi (·)c. This is done in the following theorem: 2176 . . Both L and d are large in many practical applications with highdimensional data. Actually. 1) and β is chosen uniformly from [0. for each data point p ∈ S. and bi ∈ R is chosen uniformly from [0. the point in distributing the implementation would be lost. p0 are far apart. We will also assume that a (Key. 17] . the main idea in Layered LSH is that since G is an LSH. We will also let Γ : Rd → Rk be Γ = (Γ1 . and Q be the set of query points. we generate a (Key.2 has a very small cardinality. This machine will then add p to the bucket H(p) by a Reducer in MapReduce or a UDF in Active DHT. Value) pair (H(p). . Then. 18. we have H(q + δi ) ' H(q) for all offsets q + δi (1 ≤ i ≤ L). which follows from triangle inequality. as few as O(1) hash tables suffice to solve the (c. denoting G(H(·)) by GH(·). W and LSH families H = HW and H0 = H0 W will be as defined in section 2. in this section we assume r = 1/c. one needs to look up M = O(n1/c ) buckets but with this scheme. Let S be a set of n data points available a-priori.˜ (with p2 as in Definition 1) and L = O(n2/c ). define fq = |{GH(q + δi )|1 ≤ i ≤ L}| (3. which is also used in the proof of the Johnson-Lindenstrauss (JL) lemma [12]. D]. it suffices to analyze fq . < H(p). and for any of these buckets such that GH(q + δi ) = x. The amount of data transmitted per query in this implementation is O(Ld): L (Key. which in turn implies a small amount of network communication per query. Having chosen LSH functions G. for a query point q ∈ Q. For any data point p ∈ S a (Key. 1) distribution. More specifically. Γk ). Then. Similarly. Also note that. it makes the network inefficiency issue even more severe [12. since G and H are both LSH functions. Then. for each offset q + δi a (Key. we present the Layered LSH scheme. machine x will have all the data points p such that GH(p) = x as well as the queries q ∈ Q one of whose offsets gets mapped to x by GH(·). This can be achieved by a simple scaling. q) is generated. where k is chosen as in [18] and for each 1 ≤ i ≤ k: Definition 3. for each unique value x in the set {GH(q + δi )| 1 ≤ i ≤ L} ai · v + b i W hence. Value) pairs sent over the network for query q. it performs a Since q is d-dimensional. r)NN problem. Since multiple hash tables can be obviously implemented in parallel. in the distributed setting. after generating the offsets q +δi (1 ≤ i ≤ L). we have: √ √ ||Γ(u) − Γ(v)|| − k ≤ ||H(u) − H(v)|| ≤ ||Γ(u) − Γ(v)|| + k Our analysis also uses two well-known facts. hash. Note that since q is sent to this machine. this machine regenerates the offsets q +δi (1 ≤ i ≤ L). the set in equation 3. Then. similarity search among the data points in that bucket.

using equation 3. a pair of points p1 . we have: P r[G(u) = G(v)] = P ( √D2λ ) Theorem 4. we prove the following theorem: fq ≤ max {GH(q + δi ) − GH(q + δj )} 1≤i. since each entry of α ∈ Rk is distributed as N (0.5) max {||H(q + δi ) − H(q + δj )||} ≤ (1 + 1≤i. Then. both of which increase the network load. Layered LSH guarantees that the number of √ (Key. log n recalling that k ≥ log(1/p2 ) [18]. For any two points u. 1). Furthermore. j ≤ L. To increase the search quality. we proceed to analyzing the load balance of Layered LSH. Layered LSH exponentially improves the network load.3) where the last line follows from the Cauchy-Schwartz inequality. Let u. However. (3. together give fq ≤ 2(1 + cW Remark 5. we have: One can easily see that P (·) is a monotonically increasing function. and get fq = O( k) = O( log n). that ||δ −δ || Γt (q + δi ) − Γt (q + δj ) is distributed as N (0. since at · (δi − δj ) W we know. and 3. iW j ). with Layered LSH the network efficiency is achieved independently of the level of search quality. Corollary 8 shows that. D should be only large enough to make points which are O(W ) away likely to be sent to the same machine.6) with high probability. we have: ||δi − δj || ≤ 2/c. we have: √ ||H(u) − H(v)|| ≥ ||Γ(u) − Γ(v)|| − k Furthermore for any 1 ≤ t ≤ k. using 2-stability of Gaussian distribution.4) Proof. that there exists an  = (p2 ) = 1 Θ(1) such that with probability at least 1 − nΘ(1) we have: Γt (q + δi ) − Γt (q + δj ) = ||Γ(u) − Γ(v)|| ≥ (1 − ) Hence. we have: fq = O(k/D) Proof. equations 3. √ Corollary 8. v ∈ Rd be two points and denote ||u−v|| = λ. letting λξ = 0 λ ≥ √D . v ∈ Rk with ||u−v|| = λ. Then by lemma 2. most likely have distance O(W ). 2177 . by theorem √ network √ 4. another application of JL lemma [12] gives (again with  = (p2 )): √ √ ||α|| ≤ (1 + ) k ≤ 2 k (3. while maintaining the load balance across the different machines. Next. and L is only polynomially large in n: √ √ k k max ||Γ(q + δi ) − Γ(q + δj )|| ≤ 2(1 + ) ≤4 1≤i. we have P r[GH(u) = GH(v)] ≤ ξ + o(1). Then. we know from lemma 2: √ ||H(q +δi )−H(q +δj )|| ≤ ||Γ(q +δi )−Γ(q +δj )||+ k (3.4.Lemma 6.j≤L cW which finishes the proof by recalling P r[||H(u) − H(v)|| < λ0 ] = o(1). Intuitively. Theorems 4.j≤L 1 max {α · (H(q + δi ) − H(q + δj ))} + 1 D 1≤i. there is a λξ such that λξ /W = O(1 + √Dk ) and for any two points u. and yet points which are Ω(W ) away get sent to different machines with constant probability. using k ≥ loglog (1/p2 ) [18] and the JL lemma [12]. A surprising property of Layered LSH demonstrated by theorem 4 is that the network load is independent of the number of query offsets. and state the following lemma which is proved in the full version of the paper. 7 show the tradeoff governing the choice √ of parameter D. 1 that is probability at least 1 − nΘ(1) . Then.5.6 k 4 )D + 1. and the number of hash tables for Basic LSH.Value) pairs sent over the network per query is O( log n) with high probability. For any constant 0 < ξ < 1. nΘ(1) ||H(u) − H(v)|| ≥ λ0 = ((1 − ) Now. one can see.7) P (z) = √ π 0 πz Remark 9. and for any 0 < ξ < 1 there exists a number z = zξ such that P (zξ ) = ξ. j ≤ L. compared to the simple distributed implementation of Entropy LSH and basic LSH. 2zξ W (1 1− + zξ D √ ). p2 mapping to the same buckets as two offsets of a query point. Hence since there are only L2 different choices of 1 ≤ i. First. we choose D = Θ( k). since H has a bin size of W .j≤L ≤ Theorem 7. one needs to increase the number of offsets for Entropy LSH. 2k we have: √ λ − 1) k W we have if λ ≥ λξ then and hence by lemma 6: D P r[GH(u) = GH(v)| ||H(u) − H(v)|| ≥ λ0 ] ≤ P ( √ )≤ξ 2λ0 with high probability. Then. L. Since for any offset q + δi . For any query point q.3. we get with high probability: 4 √ ) k (3. to minimize the √ traffic. 3. we define the function P (·): Z z 2 2 2 1 e−τ dτ − √ (1 − e−z ) (3. Now.j≤L cW cW √ λ k W 1 . Theorem 7 shows that choosing D = o( k) does not asymptotically help with the distance threshold at which points become likely to be sent to different machines. we need √ to choose D such that O(1 + √Dk ) = O(1) that is D = O( k). Hence. with probability at least 1 − with probability at least 1 − 1/nΘ(1) . using the JL lemma [12]: √ ||δi − δj || ||Γ(q + δi ) − Γ(q + δj )|| ≤ (1 + ) k W n As in the proof of theorem 4. √ from O(nΘ(1) ) to O( log n). 1/c) of radius 1/c centered at q. For any 1 ≤ i. there is a constant  = (p2 ) < 1 for which. Choosing D = Θ( k).j≤L ||α|| · max {||H(q + δi ) − H(q + δj )||} + 1 ≤ D 1≤i. the value GH(q + δi ) is an integer. Since offsets are chosen from the surface of the sphere B(q. Using this notation and the previous lemma. where o(1) is polynomially small in n. with high probability. v with ||u − v|| ≥ λξ .

Li. We perform experiments on a small cluster of 13 compute nodes using Hadoop [1] with 800MB JVMs. STOC ’98. Rabani. Shenoy. Datar. 17. r)-NN problem with r = 0.1. pages 985–996. Schek. McGregor. 4. We optimized D. Michel.81 Recall 0. Li. Charikar. Aberer. Consistency in the choice of hash functions H. EDBT ’09. Immorlica. VLDB ’99. Kernelized locality-sensitive hashing for scalable image search. Grauman. Then. SIGMOD ’11. Weber. Krauthgamer and J. The MapReduce job for the Sum scheme failed due to reduce task running out of memory. returned in the output. Navigating nets:simple algorithms for proximity search. i. ICML ’06. [4] A. P.80 0. SODA ’04. • Comparison with Sum and Cauchy schemes: We compare Layered LSH with Sum and Cauchy distributed LSH schemes described in Haghani et al. Beygelzimer. on average Layered LSH provides a factor 3 improvement over simple LSH in the wall-clock run time on account of a factor 10 or more decrease in the shuffle size. Similarity search in high dimensions via hashing. 17:419–428.84 Layered LSH Simple LSH 100 150 200 250 300 L (b) Wiki: shuffle size (c) Wiki: runtime Figure 3. Dean and S. Cover trees for nearest neighbours. EXPERIMENTS [17] as our first layer of hashing. [12] P. Kulis and K. the fraction of query points with at least one data point within a distance r. Josephson. Mazur. Further. [11]. [17] Q. E. Further details are in the full version of the paper. Berkhin. Similarity estimation techniques from rounding algorithms. [15] E. VLDB ’98. [6] M.100 150 200 250 300 20000 5000 0 50 100 150 200 250 300 50 L L (a) Wiki: recall 15000 Run time (sec) 300 200 Shuffle size (GB) 100 0 50 Layered LSH Simple LSH 10000 0. We compute TF-IDF vectors for each document after some standard preprocessing. Since the underlying dimensionality (vocabulary size 549532) is large. M. Langford. 19]. M. Kakade. OSDI ’04. We choose the LSH parameters (W = 0. improving recall by increasing L results in a linear increase in the shuffle size for Simple LSH. WWW ’07. k = 12) according to [18. • Load Balance: Layered LSH offers a tunable way of trading off load balance with network cost. [8] M.85 0. Lee. Mirrokni.83 0. N. H. Distributed similarity search in high dimensions using locality sensitive hashing. A quantititative analysis and performance study for similarity search methods in high dimensional spaces. STOC ’98. Parthasarathy. Motwani. while the shuffle size for Layered LSH remains almost constant.5. Rajaram. the parameter of Layered LSH.1: Variation in recall. Mapreduce: simplified data processing on large clusters. using a simple binary search to minimize the wall-clock run time. Summary of Results: Here we present a summary of the results of our experiments using MapReduce with respect to the network cost (shuffle size) and wall-clock run time: 5. and K. Springer.. Garg. and S. We solve the (c. SODA ’06. Layered LSH compares favorably with the Cauchy scheme in shuffle size and run time. Kushilevitz. Entropy based nearest neighbor search in high dimensions.75M articles in the corpus randomly into a data set of size 3M and a query set of size 750K. Multi-probe lsh: Efficient indexing for high-dimensional similarity search. Bayesian locality sensitive hashing for fast similarity search. REFERENCES [1] Hadoop. and S. Indyk. and J. Efficient large scale sequence comparison by locality-sensitive hashing. 2001. Charikar. Bioinformatics.com/nathanmarz/storm/. Ostrovsky. [18] R. R. shuffle size and wall-clock run time with L. we partition the 3. [7] A. S. [13] R. [11] P. We measure the accuracy of search results by computing the recall. Efficient search of approximate nearest neighbor in high dimensional spaces. Y. [3] P.apache. Approximate nearest neighbors: Towards removing the curse of dimensionality. ICCV ’09. [16] B. indicating load imbalance. Google news personalization: Scalable online collaborative filetering. W. [5] J. and R.org. VLDB ’07. and K. we also perform experiments on other datasets. Buhler. Haghani. Panigrahi.1 and c = 2. Implementation Details: We use the English Wikipedia corpus from February 2012 as the data set. and Y. [14] B. In this section we describe our experimental setup and results. Motwani. A platform for scalable one-pass analytics using mapreduce. 2002. [20] R. and P. Ghemawat. A.e. [10] A. A Survey of Clustering Data Mining Techniques. S. [9] J. Indyk and R. Gionis. VLDB ’12. http://hadoop. Z.82 0. Datar. P. • Comparison with Simple LSH: As shown in Figure 3. STOC ’02. we use the Multi-Probe LSH (MPLSH) 2178 . Das. [2] https://github. Blott. G (Section 3) as well as offsets across mappers and reducers is ensured by setting the seeds of the random number generators appropriately. In addition. Lv. This also verifies Theorem 4 and Remark 9. It compares favorably in load balance with Simple LSH and Cauchy schemes and is much better than Sum. Wang. and V. SoCG ’04. Diao. Satuluri and S. which yield similar results as presented in the full version of the paper. Indyk. [19] V. Locality sensitive hashing scheme based on p-stable distributions. A.