Professional Documents
Culture Documents
Densify One Permutation Hashing NIPS2012 ICML2014 UAI2014
Densify One Permutation Hashing NIPS2012 ICML2014 UAI2014
1
One can then estimate R from k independent permutations, π1 , ..., πk :
1X
k ³ ´ 1
R̂M = 1{min(πj (S1 )) = min(πj (S2 ))}, Var R̂M = R(1 − R) (3)
k j=1 k
Because the indicator function 1{min(πj (S1 )) = min(πj (S2 ))} can be written as an inner product
between two binary vectors (each having only one 1) in D dimensions [16]:
D−1
X
1{min(πj (S1 )) = min(πj (S2 ))} = 1{min(πj (S1 )) = i} × 1{min(πj (S2 )) = i} (4)
i=0
we know that minwise hashing can be potentially used for training linear SVM and logistic regres-
sion on high-dimensional binary data by converting the permuted data into a new data matrix in
D × k dimensions. This of course would not be realistic if D = 264 .
The method of b-bit minwise hashing [18, 19] provides a simple solution by storing only the lowest
b bits of each hashed data, reducing the dimensionality of the (expanded) hashed data matrix to just
2b × k. [16] applied this idea to large-scale learning on the webspam dataset and demonstrated that
using b = 8 and k = 200 to 500 could achieve very similar accuracies as using the original data.
π(S ): 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0
1
π(S2): 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0
π(S3): 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0
Figure 1: Consider S1 , S2 , S3 ⊆ Ω = {0, 1, ..., 15} (i.e., D = 16). We apply one permutation π on the
sets and present π(S1 ), π(S2 ), and π(S3 ) as binary (0/1) vectors, where π(S1 ) = {2, 4, 7, 13}, π(S2 ) =
{0, 6, 13}, and π(S3 ) = {0, 1, 10, 12}. We divide the space Ω evenly into k = 4 bins, select the smallest
nonzero in each bin, and re-index the selected elements as: [2, 0, ∗, 1], [0, 2, ∗, 1], and [0, ∗, 2, 0]. For
now, we use ‘*’ for empty bins, which occur rarely unless the number of nonzeros is small compared to k.
As illustrated in Figure 1, the idea of one permutation hashing is simple. We view sets as 0/1 vectors
in D dimensions so that we can treat a collection of sets as a binary data matrix in D dimensions.
After we permute the columns (features) of the data matrix, we divide the columns evenly into k
parts (bins) and we simply take, for each data vector, the smallest nonzero element in each bin.
In the example in Figure 1 (which concerns 3 sets), the sample selected from π(S1 ) is [2, 4, ∗, 13],
where we use ’*’ to denote an empty bin, for the time being. Since only want to compare elements
with the same bin number (so that we can obtain an inner product), we can actually re-index the
elements of each bin to use the smallest possible representations. For example, for π(S1 ), after
re-indexing, the sample [2, 4, ∗, 13] becomes [2 − 4 × 0, 4 − 4 × 1, ∗, 13 − 4 × 3] = [2, 0, ∗, 1].
We will show that empty bins occur rarely unless the total number of nonzeros for some set is small
compared to k, and we will present strategies on how to deal with empty bins should they occur.
2
1.5 Advantages of One Permutation Hashing
Reducing k (e.g., 500) permutations to just one permutation (or a few) is much more computationally
efficient. From the perspective of energy consumption, this scheme is desirable, especially consid-
ering that minwise hashing is deployed in the search industry. Parallel solutions (e.g., GPU [17]),
which require additional hardware and software implementation, will not be energy-efficient.
In the testing phase, if a new data point (e.g., a new document or a new image) has to be first pro-
cessed with k permutations, then the testing performance may not meet the demand in, for example,
user-facing applications such as search or interactive visual analytics.
One permutation hashing should be easier to implement, from the perspective of random number
generation. For example, if a dataset has one billion features (D = 109 ), we can simply generate a
“permutation vector” of length D = 109 , the memory cost of which (i.e., 4GB) is not significant.
On the other hand, it would not be realistic to store a “permutation matrix” of size D × k if D = 109
and k = 500; instead, one usually has to resort to approximations such as universal hashing [5].
Universal hashing often works well in practice although theoretically there are always worst cases.
One permutation hashing is a better matrix sparsification scheme. In terms of the original binary data
matrix, the one permutation scheme simply makes many nonzero entries be zero, without further
“damaging” the matrix. Using the k-permutation scheme, we store, for each permutation and each
row, only the first nonzero and make all the other nonzero entries be zero; and then we have to
concatenate k such data matrices. This significantly changes the structure of the original data matrix.
3
Specifically, we hash the data points using k random permutations and store each hash value using
b bits. For each data point, we concatenate the resultant B = bk bits as a signature (e.g., bk = 16).
This way, we create a table of 2B buckets and each bucket stores the pointers of the data points
whose signatures match the bucket number. In the testing phrase, we apply the same k permutations
to a query data point to generate a bk-bit signature and only search data points in the corresponding
bucket. Since using only one table will likely miss many true near neighbors, as a remedy, we
independently generate L tables. The query result is the union of data points retrieved in L tables.
Index Data Points Index Data Points
00 00 6, 110, 143 00 00 8, 159, 331
00 01 3, 38, 217 00 01 11, 25, 99
00 10 (empty) 00 10 3, 14, 32, 97
In [16], they apply k random permutations on each (binary) feature vector xi and store the lowest
b bits of each hashed value, to obtain a new dataset which can be stored using merely nbk bits. At
run-time, each new data point has to be expanded into a 2b × k-length vector with exactly k 1’s.
To illustrate this simple procedure, [16] provided a toy example with k = 3 permutations. Sup-
pose for one data vector, the hashed values are {12013, 25964, 20191}, whose binary digits
are respectively {010111011101101, 110010101101100, 100111011011111}. Using b = 2 bits,
the binary digits are stored as {01, 00, 11} (which corresponds to {1, 0, 3} in decimals). At
run-time, the (b-bit) hashed data are expanded into a new feature vector of length 2b k = 12:
{0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0}. The same procedure is then applied to all n feature vectors.
Clearly, in both applications (near neighbor search and linear learning), the hashed data have to be
“aligned” in that only the hashed data generated from the same permutation are interacted. Note
that, with our one permutation scheme as in Figure 1, the hashed data are indeed aligned.
This section presents the probability analysis to provide a rigorous foundation for one permutation
hashing as illustrated in Figure 1. Consider two sets S1 and S2 . We first introduce two definitions,
4
for the number of “jointly empty bins” and the number of “matched bins,” respectively:
k
X k
X
Nemp = Iemp,j , Nmat = Imat,j (7)
j=1 j=1
where Iemp,j and Imat,j are defined for the j-th bin, as
½
1 if both π(S1 ) and π(S2 ) are empty in the j-th bin
Iemp,j = (8)
0 otherwise
(
1 if both π(S1 ) and π(S1 ) are not empty and the smallest element of π(S1 )
Imat,j = matches the smallest element of π(S2 ), in the j-th bin (9)
0 otherwise
Lemma 1
k−j fY
−1 ¡ ¢
X k! D 1 − j+s −t
s k
Pr (Nemp = j) = (−1) , 0≤j ≤k−1 (10)
s=0
j!s!(k − j − s)! t=0 D−t
¡ ¢
Assume D 1 − k1 ≥ f = f1 + f2 − a.
fY−1 ¡ ¢ µ ¶f
E (Nemp ) D 1 − k1 − j 1
= ≤ 1− (11)
k j=0
D−j k
¡ ¢
µ ¶ fY−1 1
E (Nmat ) E (Nemp ) D 1 − − j
=R 1− = R 1 − k (12)
k k j=0
D − j
In practical scenarios, the data are often sparse, i.e., f = f1 + f2 − a ¿ D. In this case, the upper
¡ ¢f E(Nemp ) ¡ ¢f
bound (11) 1 − k1 is a good approximation to the true value of k . Since 1 − k1 ≈
e−f /k , we know that the chance of empty bins is small when f À k. For example, if f /k = 5 then
¡ ¢f
1 − k1 ≈ 0.0067. For practical applications, we would expect that f À k (for most data pairs),
otherwise hashing probably would not be too useful anyway. This is why we do not expect empty
bins will significantly impact (if at all) the performance in practical settings.
Lemma 2 shows the following estimator R̂mat of the resemblance is unbiased:
Lemma 2
Nmat ³ ´
R̂mat = , E R̂mat = R (14)
k − Nemp
³ ´ µ µ ¶µ ¶ ¶
1 1 1
V ar R̂mat = R(1 − R) E 1+ − (15)
k − Nemp f −1 f −1
µ ¶ k−1X Pr (Nemp = j)
1 1
E = ≥ ¤ (16)
k − Nemp j=0
k−j k − E(Nemp )
³ ´
The fact that E R̂mat = R may seem surprising as in general ratio estimators are not unbiased.
Note that k − Nemp > 0, because we assume the original data vectors are not completely empty
³ (all-
´
zero). As expected, when k ¿ f = f1 + f2 − a, Nemp is essentially zero and hence V ar R̂mat ≈
³ ´
R(1−R) R(1−R)
k
. In fact, V ar R̂mat is a bit smaller than k
, especially for large k.
It is probably not surprising that our one permutation scheme (slightly) outperforms the original
k-permutation scheme (at merely 1/k of the preprocessing cost), because one permutation hashing,
which is “sampling-without-replacement”, provides a better strategy for matrix sparsification.
5
4 Strategies for Dealing with Empty Bins
In general, we expect that empty bins should not occur often because E(Nemp )/k ≈ e−f /k , which
is very close to zero if f /k > 5. (Recall f = |S1 ∪ S2 |.) If the goal of using minwise hashing is for
data reduction, i.e., reducing the number of nonzeros, then we would expect that f À k anyway.
Nevertheless, in applications where we need the estimators to be inner products, we need strategies
to deal with empty bins in case they occur. Fortunately, we realize a (in retrospect) simple strategy
which can be nicely integrated with linear learning algorithms and performs well.
4
Frequency
4000 which should be much larger than k (e.g., 500) 2
(0) Nmat
R̂mat = q q (17)
(1) (2)
k− Nemp k − Nemp
(1) Pk (1) (2) Pk (2)
where Nemp = j=1 Iemp,j and Nemp = j=1 Iemp,j are the numbers of empty bins in π(S1 )
and π(S2 ), respectively. This modified estimator makes sense for a number of reasons.
Basically, since each data vector is processed and coded separately, we actually do not know Nemp
(the number of jointly empty bins) until we see both π(S1 ) and π(S2 ). In other words, we can not
(1) (2)
really compute Nemp if we want toq use linear estimators.
q On the other hand, Nemp and Nemp are
(1) (2)
always available. In fact, the use of k − Nemp k − Nemp in the denominator corresponds to the
normalizing step which is needed before feeding the data to a solver for SVM or logistic regression.
(1) (2)
When Nemp = Nemp = Nemp , (17) is equivalent to the original R̂mat . When two original vectors
(1) (2)
are very similar (e.g., large R), Nemp and Nemp will be close to Nemp . When two sets are highly
unbalanced, using (17) will overestimate R; however, in this case, Nmat will be so small that the
absolute error will not be large.
6
is summarized as follows:
Original hashed values (k = 4) : 12013 25964 20191 ∗
Original binary representations :
010111011101101 110010101101100 100111011011111 ∗
Lowest b = 2 binary digits : 01 00 11 ∗
Expanded 2b = 4 binary digits : 0010 0001 1000 0000
1
New feature vector fed to a solver : √ × [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0]
4−1
We apply the same procedure to all feature vectors in the data matrix to generate a new data matrix.
The normalization factor q 1 (i) varies, depending on the number of empty bins in the i-th vector.
k−Nemp
Accuracy (%)
Accuracy (%)
94
b=2
94 94 Accuracy (%) 94
92 92 b=1 92 Original 92 Original
90 90 90 1 Perm 90 1 Perm
Original
88 88 88 k Perm 88 k Perm
b=1 1 Perm
86 86 k Perm 86 86
84 84 84 84
SVM: k = 64 SVM: k = 128 SVM: k = 256 SVM: k = 512
82 82 82 82
Webspam: Accuracy Webspam: Accuracy Webspam: Accuracy Webspam: Accuracy
80 −3 −2 −1 0 1 2
80 −3 −2 −1 0 1 2
80 −3 −2 −1 0 1 2
80 −3 −2 −1 0 1 2
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
C C C C
100 100 100 100 b = 4,6,8
b=8 b = 6,8 b = 6,8 b=4 b=2
98 b=6 98 b=4 98 98
b=2 b=1
96 b=4 96 96 96
b=1
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
94 94 b=2 94 94
92 92 92 92 Original
b=2 Original
90 90 90 90
b=1 1 Perm 1 Perm
88 88 88 88
b=1 k Perm k Perm
86 86 86 86
84 84 84 84
logit: k = 64 logit: k = 128 logit: k = 256 logit: k = 512
82 82 82 82
Webspam: Accuracy Webspam: Accuracy Webspam: Accuracy Webspam: Accuracy
80 −3 −2 −1 0 1 2
80 −3 −2 −1 0 1 2
80 −3 −2 −1 0 1 2
80 −3 −2 −1 0 1 2
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
C C C C
Figure 4: Test accuracies of SVM (upper panels) and logistic regression (bottom panels), averaged
over 50 repetitions. The accuracies of using the original data are plotted as dashed (red, if color is
available) curves with “diamond” markers. C is the regularization parameter. Compared with the
original k-permutation minwise hashing (dashed and blue if color is available), the one permutation
hashing scheme achieves similar accuracies, or even slightly better accuracies when k is large.
The empirical results on the webspam datasets are encouraging because they verify that our proposed
one permutation hashing scheme performs as well as (or even slightly better than) the original k-
permutation scheme, at merely 1/k of the original preprocessing cost. On the other hand, it would
be more interesting, from the perspective of testing the robustness of our algorithm, to conduct
experiments on a dataset (e.g., news20) where the empty bins will occur much more frequently.
7
as large as 4096 (i.e., most of the bins are empty). In fact, the one permutation schemes achieves
noticeably better accuracies than the original k-permutation scheme. We believe this is because the
one permutation scheme is “sample-without-replacement” and provides a better matrix sparsification
strategy without “contaminating” the original data matrix too much.
We experiment with k ∈ {25 , 26 , 27 , 28 , 29 , 210 , 211 , 212 } and b ∈ {1, 2, 4, 6, 8}, for both one per-
mutation scheme and k-permutation scheme. We use 10,000 samples for training and the other
10,000 samples for testing. For convenience, we let D = 221 (which is larger than 1,355,191).
Figure 5 and Figure 6 present the test accuracies for linear SVM and logistic regression, respectively.
When k is small (e.g., k ≤ 64) both the one permutation scheme and the original k-permutation
scheme perform similarly. For larger k values (especially as k ≥ 256), however, our one permu-
tation scheme noticeably outperforms the k-permutation scheme. Using the original data, the test
accuracies are about 98%. Our one permutation scheme with k ≥ 512 and b = 8 essentially achieves
the original test accuracies, while the k-permutation scheme could only reach about 97.5%.
100 100 100 100
95 95 b=8 95 b=8 b=8
b=8 b=6
95 b=6
90 90 b=6
90
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
85 85 85 b=4 90 b=4
b=6
80 80 b=4 80 85
75 b=4
75 75 b=2
70 70 70 80 b=2
b=2
b=1
65 b=2 65 b=1 65 75 b=1
60 b=1 60 60
SVM: k = 32 SVM: k = 64 SVM: k = 128 70 SVM: k = 256
55 55 News20: Accuracy 55
News20: Accuracy News20: Accuracy News20: Accuracy
50 −1 0 1 2 3
50 −1 0 1 2 3
50 −1 0 1 2 3
65 −1 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
C C C C
100 b=8 100 b = 6,8 100 b = 6,8 100 b = 4,6,8
b=6 b=4 b=2
95 95 b=4 95 b=2 95 b=1
b=4 b=1
b=2
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
90 90 b=1 90 90
b=2
85 85 85 85
b=1 Original
Original
80 80 80 80
1 Perm 1 Perm
75 75 75 k Perm 75 k Perm
70 SVM: k = 512 70 SVM: k = 1024 70 SVM: k = 2048 70 SVM: k = 4096
News20: Accuracy News20: Accuracy News20: Accuracy News20: Accuracy
65 −1 0 1 2 3
65 −1 0 1 2 3
65 −1 0 1 2 3
65 −1 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
C C C C
Figure 5: Test accuracies of linear SVM averaged over 100 repetitions. The one permutation scheme
noticeably outperforms the original k-permutation scheme especially when k is not small.
100 100 100 100
b=8 b=8
95 95 b=8 95
b=6 95 b=6
90 b=8 90 b=6 90
b=4
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
85 85 85 b=4 90
b=6
80 80 b=4 80 85
75 b=4
75 75 b=2
70 70 b=2 70 80 b=2
b=1
65 b=2 65 65 75
b=1 b=1
60 60 logit: k = 64 60
b=1 logit: k = 32 logit: k = 128 70 logit: k = 256
55 55 55
News20: Accuracy News20: Accuracy News20: Accuracy News20: Accuracy
50 −1 0 1 2 3
50 −1 0 1 2 3
50 −1 0 1 2 3
65 −1 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
C C C C
100 b=8 100 100 100
b = 6,8 b = 6,8 b=4 b = 4,6,8 b=2
b=6 b=4 b=2
95 b=4
95 95 95 b=1
b=2 b=1
Accuracy (%)
Accuracy (%)
Accuracy (%)
Accuracy (%)
90 90 b=1
90 90
85 b=2 85 85 85 Original
b=1 Original 1 Perm
80 80 80 80
1 Perm k Perm
75 75 75 k Perm 75
70 logit: k = 512 70 logit: k = 1024 70 logit: k = 2048 70 logit: k = 4096
News20: Accuracy News20: Accuracy News20: Accuracy News20: Accuracy
65 −1 0 1 2 3
65 −1 0 1 2 3
65 −1 0 1 2 3
65 −1 0 1 2 3
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
C C C C
Figure 6: Test accuracies of logistic regression averaged over 100 repetitions. The one permutation
scheme noticeably outperforms the original k-permutation scheme especially when k is not small.
7 Conclusion
A new hashing algorithm is developed for large-scale search and learning in massive binary data.
Compared with the original k-permutation (e.g., k = 500) minwise hashing (which is a standard
procedure in the context of search), our method requires only one permutation and can achieve
similar or even better accuracies at merely 1/k of the original preprocessing cost. We expect that one
permutation hashing (or its variant) will be adopted in practice. See more details in arXiv:1208.1259.
Acknowledgement: The research of Ping Li is partially supported by NSF-IIS-1249316, NSF-
DMS-0808864, NSF-SES-1131848, and ONR-YIP-N000140910911. The research of Art B Owen
is partially supported by NSF-0906056. The research of Cun-Hui Zhang is partially supported by
NSF-DMS-0906420, NSF-DMS-1106753, NSF-DMS-1209014, and NSA-H98230-11-1-0205.
8
References
[1] Alexandr Andoni and Piotr Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in
high dimensions. In Commun. ACM, volume 51, pages 117–122, 2008.
[2] Leon Bottou. http://leon.bottou.org/projects/sgd.
[3] Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent
permutations (extended abstract). In STOC, pages 327–336, Dallas, TX, 1998.
[4] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. Syntactic clustering of
the web. In WWW, pages 1157 – 1166, Santa Clara, CA, 1997.
[5] J. Lawrence Carter and Mark N. Wegman. Universal classes of hash functions (extended abstract). In
STOC, pages 106–112, 1977.
[6] Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and
its applications. Journal of Algorithm, 55(1):58–75, 2005.
[7] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library
for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
[8] Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of the evolution
of web pages. In WWW, pages 669–678, Budapest, Hungary, 2003.
[9] Jerome H. Friedman, F. Baskett, and L. Shustek. An algorithm for finding nearest neighbors. IEEE
Transactions on Computers, 24:1000–1006, 1975.
[10] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimen-
sionality. In STOC, pages 604–613, Dallas, TX, 1998.
[11] Thorsten Joachims. Training linear svms in linear time. In KDD, pages 217–226, Pittsburgh, PA, 2006.
[12] Ping Li. Very sparse stable random projections for dimension reduction in lα (0 < α ≤ 2) norm. In
KDD, San Jose, CA, 2007.
[13] Ping Li and Kenneth W. Church. Using sketches to estimate associations. In HLT/EMNLP, pages 708–
715, Vancouver, BC, Canada, 2005 (The full paper appeared in Commputational Linguistics in 2007).
[14] Ping Li, Kenneth W. Church, and Trevor J. Hastie. One sketch for all: Theory and applications of
conditional random sampling. In NIPS, Vancouver, BC, Canada, 2008 (Preliminary results appeared
in NIPS 2006).
[15] Ping Li, Trevor J. Hastie, and Kenneth W. Church. Very sparse random projections. In KDD, pages
287–296, Philadelphia, PA, 2006.
[16] Ping Li, Anshumali Shrivastava, Joshua Moore, and Arnd Christian König. Hashing algorithms for large-
scale learning. In NIPS, Granada, Spain, 2011.
[17] Ping Li, Anshumali Shrivastava, and Arnd Christian König. b-bit minwise hashing in practice: Large-
scale batch and online learning and using GPUs for fast preprocessing with simple hash functions. Tech-
nical report.
[18] Ping Li and Arnd Christian König. b-bit minwise hashing. In WWW, pages 671–680, Raleigh, NC, 2010.
[19] Ping Li, Arnd Christian König, and Wenhao Gui. b-bit minwise hashing for estimating three-way simi-
larities. In NIPS, Vancouver, BC, 2010.
[20] Shai Shalev-Shwartz, Yoram Singer, and Nathan Srebro. Pegasos: Primal estimated sub-gradient solver
for svm. In ICML, pages 807–814, Corvalis, Oregon, 2007.
[21] Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, and S.V.N. Vishwanathan. Hash
kernels for structured data. Journal of Machine Learning Research, 10:2615–2637, 2009.
[22] Anshumali Shrivastava and Ping Li. Fast near neighbor search in high-dimensional binary data. In ECML,
2012.
[23] Josef Sivic and Andrew Zisserman. Video google: a text retrieval approach to object matching in videos.
In ICCV, 2003.
[24] Simon Tong. Lessons learned developing a practical large scale machine learning system.
http://googleresearch.blogspot.com/2010/04/lessons-learned-developing-practical.html, 2008.
[25] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing
for large scale multitask learning. In ICML, pages 1113–1120, 2009.
9
Densifying One Permutation Hashing via Rotation for Fast Near Neighbor
Search
min π(S1 ) and min π(S2 ), as the hashed values. By an vector. Our work will reduce the cost to O(KL + dL).
elementary probability argument, Previously, there were two major attempts to reduce query
time (i.e., hashing cost), which generated multiple hash val-
|S1 ∩ S2 |
Pr (min π(S1 ) = min π(S2 )) = =R (1) ues from one single permutation: (i) Conditional Random
|S1 ∪ S2 | Sampling (CRS) and (ii) One permutation hashing (OPH).
This method is known as minwise hashing (Broder, 1997; 1.5. Conditional Random Sampling (CRS)
Broder et al., 1998; 1997). See Figure 1 for an illustration.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
π(S ) 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 1 1 0
π(S ) 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 1 1 0 1
1
π(S ) 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0
π(S ) 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 2
2
Figure 2. Conditional Random Sampling (CRS). Suppose the
Figure 1. (b-bit) minwise hashing. Consider two binary vec-
data are already permuted under π. We store the smallest k = 3
tors (sets), S1 , S2 ⊆ Ω = {0, 1, ..., 23}. The two per-
nonzeros as “sketches”: {5, 7, 14} and {5, 6, 12}, respectively.
muted sets are: π(S1 ) = {5, 7, 14, 15, 16, 18, 21, 22}, π(S2 ) =
Because min{14, 12} = 12, we only look at up to the 12-th col-
{5, 6, 12, 14, 16, 17}. Thus, minwise stores min π(S1 ) = 5 and
umn (i.e., 13 columns in total). Because we have already applied
min π(S2 ) = 5. With b-bit minwise hashing, we store their low-
a random permutation, any chunk of the data can be viewed as
est b bits. For example, if b = 1, we store 1 and 1, respectively.
a random sample, in particular, the chunk from the 0-th column
to the 12-th column. This looks we equivalently obtain a random
Li & König (2010); Li et al. (2013) proposed b-bit min- sample of size 13, from which we can estimate any similarities,
wise hashing by only storing the lowest b-bits (as opposed not just resemblance. Also, it clear that the method is naturally
to, e.g., 64 bits) of the hashed values; also see Figure 1. applicable to non-binary data. A careful analysis showed that it is
This substantially reduces not only the storage but also the (slightly) better NOT to use the last column, i.e., we only use up
dimensionality. Li et al. (2011) applied this technique in to the 11-th column (a sample of size 12 instead of 13).
large-scale learning. Shrivastava & Li (2012) directly used
these bits to build hash tables for near neighbor search. Minwise hashing requires applying many permutations on
the data. Li & Church (2007); Li et al. (2006) developed
1.4. Classical LSH with Minwise Hashing the Conditional Random Sampling (CRS) by directly tak-
The crucial formula of the collision probability (1) suggests ing the k smallest nonzeros (called sketches) from only one
that minwise hashing belongs to the LSH family. permutation. The method is unique in that it constructs an
equivalent random sample pairwise (or group-wise) from
Let us consider the task of similarity search over a giant these nonzeros. Note that the original minwise hashing pa-
collection of sets (binary vectors) C = {S1 , S2 , ..., SN }, per (Broder, 1997) actually also took the smallest k nonze-
where N could run into billions. Given a query Sq , we are ros from one permutation. Li & Church (2007) showed that
|S ∩S|
interested in finding S ∈ C with high value of |Sqq ∪S| . The CRS is significantly more accurate. CRS naturally extends
idea of linear scan is infeasible due to the size of C. The to multi-way similarity (Li et al., 2010; Shrivastava & Li,
idea behind LSH with minwise hashing is to concatenate 2013) and the method is not restricted to binary data.
K different minwise hashes for each Si ∈ C, and then store
Although CRS requires only one permutation, the draw-
Si in the bucket indexed by
back of that method is that the samples (sketches) are not
B(Si ) = [hπ1 (Si ); hπ2 (Si ); ...; hπK (Si )], (2) aligned. Thus, strictly speaking, CRS can not be used for
linear learning or near neighbor search by building hash ta-
where πi : i ∈ {1, 2, ..., K} are K different random permu- bles; otherwise the performance would not be ideal.
tations and hπi (S) = min(πi (S)). Given a query Sq , we
retrieve all the elements from bucket B(Sq ) for potential 1.6. One Permutation Hashing
similar candidates. The bucket B(Sq ) contains elements 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Si ∈ C whose K different hash values collide with that of
π(S ) 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 1 1 0
the query. By the LSH property of the hash function, i.e., 1
π(S ) 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0
(1), these elements have higher probability of being similar 2
to the query Sq compared to a random point. This proba- Figure 3. One Permutation Hashing. Instead of only storing the
bility value can be tuned by choosing appropriate value for smallest nonzero in each permutation and repeating the permu-
parameter K. The entire process is repeated independently tation k times, we can use just one permutation, break the space
L times to ensure good recall. Here, the value of L is again evenly into k bins, and store the smallest nonzero in each bin. It is
optimized depending on the desired similarity levels. not directly applicable to near neighbor search due to empty bins.
The query time cost is dominated by O(dKL) hash eval- Li et al. (2012) developed one permutation hashing. As il-
uations, where d is the number of nonzeros of the query lustrated in Figure 3, the method breaks the space equally
Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search
into k bins after one permutation and stores the smallest of empty bins, unlike the original paper (Li et al., 2012).
nonzero of each bin. In this way, the samples are naturally
aligned. Note that if k = D, where D = |Ω|, then we 2. Our Proposed Hashing Method
get back the (permuted) original data matrix. When there 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
are no empty bins, one permutation hashing estimates the
resemblance without bias. However, if empty bins occur, π(S ) 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 1 1 0
1
we need a strategy to deal with them. The strategy adopted π(S ) 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 1 1 0 0
2
0 0 0 0
by Li et al. (2012) is to simply treat empty bins as “ze-
ros” (i.e., zero-coding), which works well for linear learn-
OPH(π(S1)) = [ E, 1, E, 2, 0, 1 ] H(π(S1)) = [ 1+C, 1, 2+C, 2, 0, 1 ]
ing. Note that, in Li et al. (2012), “zero” means zero in the
“expanded metric space” (so that indicator function can be OPH(π(S2)) = [ E, 1, E, 0, 0, E ] H(π(S2)) = [ 1+C, 1, C, 0, 0, 1+2C]
explicitly written as an inner product).
Figure 4. One Permutation Hashing+Rotation, with D = 24
1.7. Limitation of One Permutation Hashing (space size) and k = 6 bins (i.e., D/k = 4). For one permutation
One permutation hashing can not be directly used for near hashing (OPH), the hashed value is defined in (4). In particular,
neighbor search by building hash tables because empty bins for set S1 , OP H(π(S1 )) = [E, 1, E, 2, 0, 1] where E stands for
do not offer indexing capability. In other words, because of “empty bins”. The 0-th and 2-th bins are empty. In the 1-th bin,
the minimum is 5, which becomes 1 after the modular operation:
these empty bins, it is not possible to determine which bin
5 mod 4 = 1. Our proposed hash method, defined in (5), needs
value to use for bucketing.Using the LSH framework, each
to fill the two empty bins for S1 . For the 0-th bin, the first non-
hash table may need (e.g.,) 10 ∼ 20 hash values and we empty bin on the right is the 1-th bin (i.e., t = 1) with hashed
may need (e.g.,) 100 or more hash tables. If we generate all value 1. Thus, we fill the 0-th empty bin with “1 + C”. For the
the necessary (e.g., 2000) hash values from merely one per- 2-th bin, the first non-empty bin on the right is the 3-th bin (i.e.,
mutation, the chance of having empty bins might be high t = 1 again) with hashed value 0 and hence we fill the 2-th bin
in sparse data. For example, suppose on average each data with “C”. The interesting case for S2 is the 5-th bin, which is
vector has 500 nonzeros. If we use 2000 bins, then roughly empty. The first non-empty bin on the right (circular) is the 1-th
500
(1 − 1/2000) ≈ 78% of the bins would be empty. bin (as 5 + 2 mod 6 = 1). Thus, the 5-th bin becomes “1+2C”.
Our proposed hashing method is easy to understand with an
Suppose we adopt the zero-coding strategy, by coding ev-
example as in Figure 4. Consider S ⊆ Ω = {0, 1, ..., D −
ery empty bin with any fixed special number. If empty bins
1} and a random permutation π : Ω −→ Ω. As in Figure 3
dominate, then two sparse vectors will become artificially
(also in Figure 4), we divide the space evenly into k bins.
“similar”. On the other hand, suppose we use the “random-
For the j-th bin, where 0 ≤ j ≤ k − 1, we define the set
coding” strategy, by coding an empty bin randomly (and { [ )}
independently) as a number in {0, 1, 2, D − 1}. Again, if Mj (π(S)) = π(S) ∩
Dj D(j + 1)
, (3)
empty bins dominate, then two sparse vectors which are k k
similar in terms of the original resemblance may artificially If Mj (π(S)) is empty, we denote the hashed value under
become not so similar. We will see later that these schemes this one permutation hashing by OP H (π(S)) = E, where
lead to significant deviation from the expected behavior. j
We need a better strategy to handle empty bins. E stands for “empty”. If the set is not empty, we just take
the minimum of the set (mod by D/k). In this paper, with-
1.8. Our Proposal: One Permutation with Rotation out loss of generality, we always assume D is divisible by
Our proposed hashing method is simple and effective. Af- k (otherwise we just increase D by padding zeros). That is,
ter one permutation hashing, we collect the hashed values if Mj (π(S)) ̸= ϕ, we have
for each set. If a bin is empty, we “borrow” the hashed D
value from the first non-empty bin on the right (circular). OP H (π(S)) = min Mj (π(S)) mod (4)
j k
Due to the circular operation, the scheme for filling empty
bins appears like a “rotation”. We can show mathematically We are now ready to define our hashing scheme, which is
that we obtain a valid LSH scheme. Because we gener- basically one permutation hashing plus a “rotation” scheme
ate all necessary hash values from merely one permutation, for filling empty bins. Formally, we define
the query time of a (K, L)-parameterized LSH scheme is
OP H (π(S)) if OP H (π(S)) ̸= E
reduced from O(dKL) to O(KL + dL), where d is the j j
number of nonzeros of the query vector. We will show em- Hj (π(S)) =
pirically that its performance in near neighbor search is vir- OP H (π(S)) + tC otherwise
(j+t)mod k
tually identical to that of the original minwise hashing.
(5)
An interesting consequence is that our work supplies an t = min z, s.t. OP H (π(S)) ̸= E (6)
unbiased estimate of resemblance regardless of the number (j+z) mod k
Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search
Here C ≥ D k +1 is a constant, to avoid undesired collisions. This estimator, although is not unbiased (for estimating R),
Basically, we fill the empty bins with the hashed value of works well in the√context of linear
√ learning. In fact, the
the first non-empty bin on the “right” (circular). (1) (2)
normalization by k − Nemp k − Nemp is smoothly in-
tegrated in the SVM training procedure which usually re-
Theorem 1
quires the input vectors to have unit l2 norms.
Pr (Hj (π(S1 )) = Hj (π(S2 ))) = R (7) It is easy to see that as k increases (to D), R̂OP H estimates
the original (normalized) inner product, not resemblance.
Theorem 1 proves that our proposed method provides a From the perspective of applications (in linear learning),
valid hash function which satisfies the LSH property from this is not necessarily a bad choice, of course.
one permutation. While the main application of our method
Here, we provide an experimental study to compare the two
is for approximate neighbor search, which we will elabo-
estimators, R̂ in (8) and R̂OP H in (10). The dataset, ex-
rate in Section 4, another promising use of our work is for
tracted from a chunk of Web crawl (with 216 documents),
estimating resemblance using only linear estimator (i.e., an
is described in Table 1, which consists of 12 pairs of sets
estimator which is an inner product); see Section 3.
(i.e., total 24 words). Each set consists of the document IDs
3. Resemblance Estimation which contain the word at least once.
Theorem 1 naturally suggests a linear, unbiased estimator Table 1. 12 pairs of words used in Experiment 1. For example,
of the resemblance R: “A” and “THE” correspond to the two sets of document IDs which
contained word “A” and word “THE” respectively.
1∑
k−1
R̂ = 1 {Hj (π(S1 )) = Hj (π(S2 ))} (8) Word 1 Word 2 |S1 | |S2 | R
k j=0 HONG KONG 940 948 0.925
RIGHTS RESERVED 12234 11272 0.877
A THE 39063 42754 0.644
Linear estimator is important because it implies that the UNITED STATES 4079 3981 0.591
hashed data form an inner product space which allows SAN FRANCISCO 3194 1651 0.456
us to take advantage of the modern linear learning algo- CREDIT CARD 2999 2697 0.285
rithms (Joachims, 2006; Shalev-Shwartz et al., 2007; Bot- TOP BUSINESS 9151 8284 0.163
SEARCH ENGINE 14029 2708 0.152
tou; Fan et al., 2008) such as linear SVM. TIME JOB 12386 3263 0.128
The original one permutation hashing paper (Li et al., 2012) LOW PAY 2936 2828 0.112
SCHOOL DISTRICT 4555 1471 0.087
proved the following unbiased estimator of R REVIEW PAPER 3197 1944 0.078
Nmat
R̂OP H,ub = (9)
k − Nemp Figure 5 and Figure 6 summarizes the estimation accura-
∑
k−1 { } cies in terms of the bias and mean square error (MSE). For
( ) ( )2
Nemp = 1 OP H (π(S1 )) = E and OP H (π(S2 )) = E R̂, bias = E R̂ − R , and MSE = E R̂ − R . The def-
j j
j=0
{ } initions for R̂OP H is analogous. The results confirm that
∑
k−1
our proposed hash method leads to an unbiased estimator
Nmat = 1 OP H (π(S1 )) = OP H (π(S2 )) ̸= E
j j regardless of the data sparsity or number of bins k. The es-
j=0
timator in the original one permutation hashing paper can
which unfortunately is not a linear estimator because the be severely biased when k is too large (i.e., when there are
number of “jointly empty” bins Nemp would not be known many empty bins). The MSEs suggest that the variance
until we see both hashed sets. To address this issue, Li et al. of R̂ essentially follows R(1−R)
k , which is the variance of
(2012) provided a modified estimator the original minwise hashing estimator (Li & König, 2010),
unless k is too large. But even when the MSEs deviate from
Nmat R(1−R)
R̂OP H = √ √ (10) k , they are not large, unlike R̂OP H .
(1) (2)
k − Nemp k − Nemp This experimental study confirms that our proposed hash
∑ {
k−1 } method works well for resemblance estimation (and hence
(1)
Nemp = 1 OP H (π(S1 )) = E , it is useful for training resemblance kernel SVM using a
j
j=0 linear algorithm). The more exciting application of our
∑
k−1 { } work would be approximate near neighbor search.
(2)
Nemp = 1 OP H (π(S2 )) = E
j
j=0
Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search
Bias
Bias
Bias
0.02 0.05
0.01 0.02
0 0
0 0
−0.01 1 2 3 4 5
−0.02 1 2 3 4 5
−0.02 1 2 3 4 5
−0.05 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
k k k k
0.2 0.2 0.15 0.2
SAN−− FRANCISCO CREDIT−− CARD TOP−− BUSINESS SEARCH−− ENGINE
0.15 Proposed 0.15 Proposed Proposed 0.15 Proposed
0.1
OPH OPH OPH OPH
0.1 0.1 0.1
Bias
Bias
Bias
Bias
0.05
0.05 0.05 0.05
0
0 0 0
−0.05 1 2 3 4 5
−0.05 1 2 3 4 5
−0.05 1 2 3 4 5
−0.05 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
k k k k
0.15 0.15 0.15 0.08
TIME−− JOB LOW−− PAY SCHOOL−− DISTRICT REVIEW−− PAPER
Proposed Proposed Proposed 0.06 Proposed
0.1 0.1 0.1
OPH OPH OPH OPH
0.04
Bias
Bias
Bias
Bias
0.05 0.05 0.05
0.02
0 0 0
0
−0.05 1 2 3 4 5
−0.05 1 2 3 4 5
−0.05 1 2 3 4 5
−0.02 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
k k k k
Figure 5. Bias in resemblance estimation. The plots are the biases of the proposed estimator R̂ defined in (8) and the previous estimator
R̂OP H defined in (10) from the original one permutation hashing paper (Li et al., 2012). See Table 1 for a detailed description of the
data. It is clear that the proposed estimator R̂ is strictly unbiased regardless of k, the number of bins which ranges from 22 to 215 .
−1 −1 −1 −1
10 10 10 10
HONG−−KONG A −− THE RIGHTS −− RESERVED UNITED −− STATES
−2 −2 −2 −2
10 10 10 10
−3 −3 −3 −3
10 10 10 10
MSE
MSE
MSE
MSE
−4 −4 −4 −4
10 10 10 10
Proposed Proposed Proposed Proposed
10
−5
OPH 10
−5
OPH 10
−5 OPH 10
−5 OPH
R(1−R)/k R(1−R)/k R(1−R)/k R(1−R)/k
−6 −6 −6 −6
10 1 2 3 4 5
10 1 2 3 4 5
10 1 2 3 4 5
10 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
k k k k
−1 −1 −1 −1
10 10 10 10
SAN −− FRANCISCO CREDIT −− CARD TOP −− BUSINESS SEARCH −− ENGINE
−2 −2 −2 −2
10 10 10 10
−3 −3 −3 −3
10 10 10 10
MSE
MSE
MSE
MSE
−4 −4 −4 −4
10 10 10 10
Proposed Proposed Proposed Proposed
−5 −5 −5 −5
10 OPH 10 OPH 10 OPH 10 OPH
R(1−R)/k R(1−R)/k R(1−R)/k R(1−R)/k
−6 −6 −6 −6
10 1 2 3 4 5
10 1 2 3 4 5
10 1 2 3 4 5
10 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
k k k k
−1 −1 −1 −1
10 10 10 10
TIME −− JOB LOW −− PAY SCHOOL −− DISTRICT REVIEW −− PAPER
−2 −2 −2 −2
10 10 10 10
−3 −3 −3 −3
10 10 10 10
MSE
MSE
MSE
MSE
−4 −4 −4 −4
10 10 10 10
Proposed Proposed Proposed Proposed
−5 −5 −5 −5
10 OPH 10 OPH 10 OPH 10 OPH
R(1−R)/k R(1−R)/k R(1−R)/k R(1−R)/k
−6 −6 −6 −6
10 1 2 3 4 5
10 1 2 3 4 5
10 1 2 3 4 5
10 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
k k k k
Figure 6. MSE in resemblance estimation. See the caption of Figure 5 for more descriptions. The MSEs of R̂ essentially follow
R(1 − R)/k which is the variance of the original minwise hashing, unless k is too large (i.e., when there are too many empty bins). The
previous estimator R̂OP H estimates R poorly when k is large, as it is not designed for estimating R.
Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search
4. Near Neighbor Search number of around 150 non-zeros. We use the stan-
We implement the classical LSH algorithm described in dard partition of MNIST, which consists of 10000 data
Section 1.4 using the standard procedures (Andoni & In- points in one set and 60000 in the other.
dyk, 2004) with the following choices of hash functions: • NEWS20 Collection of newsgroup documents. The
feature dimension is 1,355,191 with 500 non-zeros
• Traditional Minwise Hashing We use the standard on an average. We randomly split the dataset in two
minwise hashing scheme hi (S) = min(πi (S)) which halves having around 10000 points in each partition.
uses K × L independent permutations.
• WEBSPAM Collection of emails documents. The
• The Proposed Scheme We use our proposed hashing feature dimension is 16,609,143 with 4000 non-zeros
scheme H given by (5) which uses only one permuta- on an average. We randomly selected 70000 data
tion to generate all the K × L hash evaluations. points and generated two partitions of 35000 each.
• Empty Equal Scheme (EE) We use a heuristic way These datasets cover a wide variety in terms of size, spar-
of dealing with empty bins by treating the event of si- sity and task. In the above datasets, one partition, the bigger
multaneous empty bins as a match (or hash collision). one in case of MNIST, was used for creating hash buck-
This can be achieved by assigning a fixed special sym- ets and the other partition was used as the query set. All
bol to all the empty bins. We call this Empty Equal datasets were binarized by setting non-zero values to 1.
(EE) scheme. This can be formally defined as:
We perform a rigorous evaluation of these hash functions,
{ by comparing their performances, over a wide range of
OP H (π(S)), if OP H (π(S)) ̸= E
hEE
j (S) =
j j choices for parameters K and L. In particular, we want
A fixed special symbol, otherwise to understand if there is a different effect of varying the pa-
(11) rameters K and L on the performance of H as compared
to minwise hashing. Given parameters K and L, we need
• Empty Not Equal Scheme (ENE) Alternatively, one K × L number of hash evaluations per data point. For
can also consider the strategy of treating simultaneous minwise hashing, we need K × L independent permuta-
empty bins as a mismatch of hash values referred to tions while for the other three schemes we bin the data into
as Empty Not Equal (ENE). ENE can be reasonably k = K × L bins using only one permutation.
achieved by assigning a new random number to each
empty bin independently. The random number will For both WEBSPAM and NEWS20, we implemented
ensure, with high probability, that two empty bins do all the combinations for K = {6, 8, 10, 12} and L =
not match. This leads to the following hash function {4, 8, 16, 32, 64, 128}. For MNIST, with only 784 features,
we used K = {6, 8, 10, 12} and L = {4, 8, 16, 32}.
{
OP H (π(S)), if OP H (π(S)) ̸= E We use two metrics for evaluating retrieval: (i) the frac-
hEN
j
E
(S) = j j
rand(new seed), otherwise tion of the total number of points retrieved by the bucket-
(12) ing scheme per query, (ii) the recall at a given threshold T0 ,
defined as the ratio of retrieved elements having similarity,
with the query, greater than T0 to the total number of ele-
Our aim is to compare the performance of minwise hashing ments having similarity, with the query, greater than T0 . It
with the proposed hash function H. In particular, we would is important to balance both of them, for instance in linear
like to evaluate the deviation in performance of H with re- scan we retrieve everything, and so we always achieve a
spect to the performance of minwise hashing. Since H has perfect recall. For a given choice of K and L, we report
the same collision probability as that of minwise hashing, both of these metrics independently. For reporting the re-
we expect them to have similar performance. In addition, call, we choose two values of threshold T0 = {0.5, 0.8}.
we would like to study the performance of simple strategies Since the implementation involves randomizations, we re-
(EE and ENE) on real data. peat the experiments for each combination of K and L 10
times, and report the average over these 10 runs.
4.1. Datasets
Figure 7 presents the plots of the fraction of points re-
To evaluate the proposed bucketing scheme, we chose the trieved per query corresponding to K = 10 for all the three
following three publicly available datasets. datasets with different choices of L. Due to space con-
straint, here we only show the recall plots for various val-
• MNIST Standard dataset of handwritten digit sam- ues of L and T0 corresponding to K = 10 in Figure 8. The
ples. The feature dimension is 784 with an average plots corresponding to K = {6, 8, 12} are very similar.
Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search
0 −3 −1
10 10 10
EE
MNIST
K=10
−1 EE Minwise Proposed
Fraction Retrieved
Fraction Retrieved
Fraction Retrieved
10
EE
ENE
−2 Proposed
10
Minwise −4 Proposed −2
10 10
−3
10 ENE
Minwise
−4
10 NEWS20 WEBSPAM
ENE K=10 K=10
−5 −5 −3
10 10 10
10 20 30 20 40 60 80 100 120 20 40 60 80 100 120
L (Number of Tables) L (Number of Tables) L (Number of Tables)
Figure 7. Fraction of points retrieved by the bucketing scheme per query corresponding to K = 10 shown over different choices of L.
Results are averaged over 10 independent runs. Plots with K = {6, 8, 12} are very similar in nature.
One can see that the performance of the proposed hash need to generate K × L hash evaluations of a given query
function H is indistinguishable from minwise hashing, ir- vector (Indyk & Motwani, 1998). With minwise hashing,
respective of the sparsity of data and the choices of K, L this requires storing and processing K ×L different permu-
and T0 . Thus, we conclude that minwise hashing can be tations. The total computation cost for simply processing a
replaced by H without loss in the performance and with a query is thus O(dKL).
huge gain in query processing time (see Section 5).
On the other hand, generating K ×L hash evaluations using
Except for the WEBSPAM dataset, the EE scheme re- the proposed hash function H requires only processing a
trieves almost all the points, and so its not surprising that it single permutation. It involves evaluating K×L minimums
achieves a perfect recall even at T0 = 0.5. EE scheme with one permutation hashing and one pass over the K × L
treats the event of simultaneous empty bins as a match. bins to reassign empty values via “rotation” (see Figure 4).
The probability of this event is high in general for sparse Overall, the total query processing cost is O(d+KL). This
data, especially for large k, and therefore even non-similar is a massive saving over minwise hashing.
points have significant chance of being retrieved. On the
To verify the above claim, we compute the ratio of the time
other hand, ENE shows the opposite behavior. Simulta-
taken by minwise hashing to the time taken by our pro-
neous empty bins are highly likely even for very similar
posed scheme H for NEWS20 and WEBSPAM dataset. We
sparse vectors, but ENE treats this event as a rejection, and
randomly choose 1000 points from each dataset. For every
therefore we can see that as k = K × L increases, the re-
vector, we compute 1024 hash evaluations using minwise
call values starts decreasing even for the case of T0 = 0.8.
hashing and the proposed H. The plot of the ratio with the
WEBSPAM has significantly more nonzeros, so the occur-
number of hash evaluations is shown in Figure 9. As ex-
rence of empty bins is rare for small k. Even in WEB-
pected, with the increase in the number of hash evaluations
SPAM, we observe an undesirable deviation in the perfor-
minwise hashing is linearly more costly than our proposal.
mance of EE and ENE with K ≥ 10.
Figure 9 does not include the time required to generate per-
5. Reduction in Computation Cost mutations. Minwise hashing requires storing K × L ran-
dom permutation in memory while our proposed hash func-
1500 1000 tion H only needs to store 1 permutation. Each fully ran-
NEWS20
800 WEBSPAM dom permutation needs a space of size D. Thus with min-
1000
600
wise hashing storing K × L permutations take O(KLD)
Ratio
Ratio
5.1. Query Processing Cost Reduction 5.2. Reduction in Total Query Time
Let d denote the average number of nonzeros in the dataset. The total query complexity of the LSH algorithm for re-
For running a (K, L)-parameterized LSH algorithm, we trieving approximate near neighbor using minwise hashing
Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search
1 1 NEWS20
1
EE
MNIST K=10
0.8 0.8 T0=0.5 0.8
K=10 Minwise EE
T0=0.5
0.6 0.6 0.6
Recall
Recall
Recall
Proposed
EE
Proposed Minwise
0.4 0.4 0.4 ENE
WEBSPAM
0.2 Proposed 0.2 0.2
K=10
Minwise ENE T0=0.5
ENE
10 20 30 20 40 60 80 100 120 20 40 60 80 100 120
L (Number of Tables) L (Number of Tables) L (Number of Tables)
1 1 EE Minwise Proposed 1 EE Minwise Proposed
EE Minwise
0.8 0.8 0.8 ENE
Proposed
NEWS20
0.6 0.6 0.6
Recall
Recall
Recall
K=10
MNIST T0=0.8
0.4 0.4 0.4
K=10
T0=0.8 WEBSPAM
0.2 0.2 0.2 K=10
ENE ENE
T =0.8
0
is the sum of the query processing cost and the cost incurred ing for near neighbors is prohibitively expensive, especially
in evaluating retrieved candidates. The total running cost in user-facing applications like search. LSH is popular for
of LSH algorithm is dominated by the query processing achieving sub-linear near neighbor search. Minwise hash-
cost (Indyk & Motwani, 1998; Dasgupta et al., 2011). In ing, which belongs to the LSH family, is one of the ba-
theory, the number of data points retrieved using the appro- sic primitives in many “big data” processing algorithms.
priate choice of LSH parameters (K, L), in the worst case, In addition to near-neighbor search, these techniques are
is O(L) (Indyk & Motwani, 1998). The total cost of evalu- frequently used in popular operations like near-duplicate
ating retrieved candidates in brute force fashion is O(dL). detection (Broder, 1997; Manku et al., 2007; Henzinger,
Thus, the total query time of LSH based retrieval with min- 2006), all-pair similarity (Bayardo et al., 2007), similarity
wise hashing is O(dKL + dL) = O(dKL). Since, both join/record-linkage (Koudas & Srivastava, 2005), temporal
scheme behaves very similarly for any given (K, L), the to- correlation (Chien & Immorlica, 2005), etc.
tal query time with the proposed scheme is O(KL+d+dL)
Current large-scale data processing systems deploying in-
which comes down to O(KL + dL). This analysis ignores
dexed tables based on minwise hashing suffer from the
the effect of correlation in the LSH algorithm but we can
problem of costly processing. In this paper, we propose a
see from the plots that the effect is negligible.
new hashing scheme based on a novel “rotation” technique
The need for evaluating all the retrieved data points based to densify one permutation hashes (Li et al., 2012). The
on exact similarity leads to O(dL) cost. In practice, an obtained hash function possess the properties of minwise
efficient estimate of similarity can be obtained by using the hashing, and at the same time it is very fast to compute.
same hash evaluations used for indexing (Shrivastava & Li,
Our experimental evaluations on real data suggest that the
2012). This decreases the re-ranking cost further.
proposed hash function offers massive savings in compu-
5.3. Pre-processing Cost Reduction tation cost compared to minwise hashing. These savings
The LSH algorithm needs a costly pre-processing step come without any trade-offs in the performance measures.
which is the bucket assignment for all N points in the col-
lection. The bucket assignment takes in total O(dN KL)
Acknowledgement
with minwise hashing. The proposed hashing scheme H The work is supported by NSF-III-1360971, NSF-Bigdata-
reduces this cost to O(N KL + N d). 1419210, ONR-N00014-13-1-0764, and AFOSR-FA9550-
13-1-0137. We thank the constructive review comments
6. Conclusion from SIGKDD 2013, NIPS 2013, and ICML 2014, for im-
With the explosion of data in modern applications, the proving the quality of the paper. The sample matlab code
brute force strategy of scanning all data points for search- for our proposed hash function is available upon request.
Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search
Fan, Rong-En, Chang, Kai-Wei, Hsieh, Cho-Jui, Wang, Xiang- Shalev-Shwartz, Shai, Singer, Yoram, and Srebro, Nathan. Pega-
Rui, and Lin, Chih-Jen. Liblinear: A library for large linear sos: Primal estimated sub-gradient solver for svm. In ICML,
classification. Journal of Machine Learning Research, 9:1871– pp. 807–814, Corvalis, Oregon, 2007.
1874, 2008.
Shrivastava, Anshumali and Li, Ping. Fast near neighbor search
Fetterly, Dennis, Manasse, Mark, Najork, Marc, and Wiener, in high-dimensional binary data. In ECML, 2012.
Janet L. A large-scale study of the evolution of web pages.
Shrivastava, Anshumali and Li, Ping. Beyond pairwise: Provably
In WWW, pp. 669–678, Budapest, Hungary, 2003.
fast algorithms for approximate k-way similarity search. In
Friedman, Jerome H., Baskett, F., and Shustek, L. An algorithm NIPS, Lake Tahoe, NV, 2013.
for finding nearest neighbors. IEEE Transactions on Comput-
Tong, Simon. Lessons learned developing a
ers, 24:1000–1006, 1975.
practical large scale machine learning system.
Henzinger, Monika Rauch. Finding near-duplicate web pages: a http://googleresearch.blogspot.com/2010/04/lessons-learned-
large-scale evaluation of algorithms. In SIGIR, pp. 284–291, developing-practical.html, 2008.
2006.
Weinberger, Kilian, Dasgupta, Anirban, Langford, John, Smola,
Indyk, Piotr and Motwani, Rajeev. Approximate nearest neigh- Alex, and Attenberg, Josh. Feature hashing for large scale mul-
bors: Towards removing the curse of dimensionality. In STOC, titask learning. In ICML, pp. 1113–1120, 2009.
pp. 604–613, Dallas, TX, 1998.
Joachims, Thorsten. Training linear svms in linear time. In KDD,
pp. 217–226, Pittsburgh, PA, 2006.
Improved Densification of One Permutation Hashing
• Our detailed variance analysis of the hashes obtained We assume for the rest of the paper that D is divisible by
from the existing densification scheme [24] reveals k, otherwise we can always pad extra dummy features. We
that there is no enough randomness in that procedure define OP H (“OPH” for one permutation hashing) as
j
which leads to high variance in very sparse datasets. [ )
{
Dj D(j+1)
• We provide a new densification scheme for one per- E, if π(S) ∩ k , k = ϕ
OP H (π(S)) =
mutation hashing with provably smaller variance than j D
Mj (π(S)) mod k , otherwise
the scheme in [24]. The improvement becomes more (5)
significant for very sparse datasets which are common
in practice. The improved scheme retains the com- i.e., OP H (π(S)) denotes the minimum value in Bin j,
j
putational complexity of O(d + k) for computing k
under permutation mapping π, as shown in the exam-
different hash evaluations of a given vector.
[ple in Figure) 1. If this intersection is null, i.e., π(S) ∩
• We provide experimental evidences on publicly avail- Dj D(j+1)
k , k = ϕ, then OP H (π(S)) = E.
j
able datasets, which demonstrate the superiority of
the improved densification procedure over the exist- j
Consider the events of “simultaneously empty bin” Iemp =
ing scheme, in the task of resemblance estimation and j
1 and “simultaneously non-empty bin” Iemp = 0, between
as well as the task of near neighbor retrieval with LSH. given vectors S1 and S2 , defined as:
{
2 Background j
1, if OP H (π(S1 )) = OP H (π(S2 )) = E
j j
Iemp =
2.1 One Permutation Hashing 0 otherwise
(6)
As illustrated in Figure 1, instead of conducting k inde-
Simultaneously empty bins are only defined with respect to
pendent permutations, one permutation hashing [18] uses 0 2
two sets S1 and S2 . In Figure 1, Iemp = 1 and Iemp = 1,
only one permutation and partitions the (permuted) feature 1 3 4 5
while Iemp = Iemp = Iemp = Iemp = 0. Bin 5 is only
space into k bins. In other words, a single permutation π 5
empty for S2 and not for S1 , so Iemp = 0.
is used to first shuffle the given binary vector, and then the
shuffled vector is binned into k evenly spaced bins. The Given a bin number j, if it is not simultaneously empty
j j
(Iemp = 0) for both the vectors S1 and S2 , [18] showed that the new values of simultaneous empty bins (Iemp = 1),
( ) at any location j for S1 and S2 , never match if their new
j
P r OP H (π(S1 )) = OP H (π(S2 ))Iemp = 0 = R values come from different bin numbers.
j j
(7) Formally the hashing scheme with “rotation”, denoted by
j
H, is defined as:
On the other hand, when Iemp = 1, no such guarantee ex-
j
ists. When Iemp = 1 collision does not have enough infor-
OP H (π(S)) if OP H (π(S)) ̸= E
j j
j
mation about the similarity R. Since the event Iemp = 1 Hj (S) = (8)
can only be determined given the two vectors S1 and S2
OP H (π(S)) + tC otherwise
(j+t) mod k
and the materialization of π, one permutation hashing can-
not be directly used for indexing, especially when the data t = min z, s.t. OP H (π(S)) ̸= E (9)
(j+z) mod k
are very sparse. In particular, OP H (π(S)) does not lead
j
D
to a valid LSH hash function because of the coupled event Here C = k + 1 is a constant.
j
Iemp = 1 in (7). The simple strategy of ignoring empty This densification scheme ensures that whenever Iemp j
= 0,
bins leads to biased estimators of resemblance and shows i.e., Bin j is simultaneously empty for any two S1 and
poor performance [24]. Because of this same reason, one S2 under considerations, the newly assigned value mimics
permutation hashing cannot be directly used to extract ran- the collision probability of the nearest simultaneously non-
dom features for linear learning with resemblance kernel. empty bin towards right (circular) hand side making the fi-
nal collision probability equal to R, irrespective of whether
2.2 Densifying One Permutation Hashing for j j
Iemp = 0 or Iemp = 1. [24] proved this fact as a theorem.
Indexing and Linear Learning
[24] proposed a “rotation” scheme that assigns new values Theorem 1 [24]
to all the empty bins, generated from one permutation hash- Pr (Hj (S1 ) = Hj (S2 )) = R (10)
ing, in an unbiased fashion. The rotation scheme for filling
the empty bins from Figure 1 is shown in Figure 2. The Theorem 1 implies that H satisfies the LSH property and
idea is that for every empty bin, the scheme borrows the hence it is suitable for indexing based sublinear similarity
value of the closest non-empty bin in the clockwise direc- search. Generating KL different hash values of H only re-
tion (circular right hand side) added with offset C. quires O(d + KL), which saves a factor of d in the query
processing cost compared to the cost of O(dKL) with tra-
Bin 0 Bin 1 Bin 2 Bin 3 Bin 4 Bin 5 ditional minwise hashing. For fast linear learning [19] with
H(S1) 1+C 1 2+C 2 0 1 k different hash values the new scheme only needs O(d+k)
H(S2)
testing (or prediction) time compared to standard b-bit min-
1+C 1 0+C 0 0 1+2C
wise hashing which requires O(dk) time for testing.
where R̃ = a−1 Combining the expressions from the above 3 Lemmas and
f 1+f 2−a−1 . Using these new events, we have
Eq.(19), we can compute f (m). Taking a further expec-
1 ∑[ E ]
k−1
tation over values of m to remove the conditional depen-
R̂ = Mj + MjN (16) dency, the variance of R̂ can be shown in the next Theorem.
k j=0
1
∑[
1
]
1
k−1
E (MjN )2 + (MjE )2 m = kR (19) Figure 3: Illustration of the existing densification
j=0 scheme [24]. The 3 boxes indicate 3 simultaneously non-
empty bins. Any simultaneously empty bin has 4 possi-
The values of the remaining three terms are given by the
ble positions shown by blank spaces. Arrow indicates the
following 3 Lemmas; See the proofs in the Appendix.
choice of simultaneous non-empty bins picked by simul-
Lemma 1 taneously empty bins occurring in the corresponding posi-
tions. A simultaneously empty bin occurring in position 3
∑
uses the information from Bin c. The randomness is in the
E MiN MjN m = m(m − 1)RR̃ (20)
position number of these bins which depends on π.
i̸=j
On the other hand, under the given scheme, the probability Bin 0 Bin 1 Bin 2 Bin 3 Bin 4 Bin 5
i
that two simultaneously empty bins, i and j, (i.e., Iemp =
j Direction 0 1 0 0 1 1
1, Iemp = 1), both pick the same simultaneous non-empty Bits (q)
t
Bin t (Iemp = 0) is given by (see proof of Lemma 3) H+(S1) 1+C 1 1+C 2 0 1
H+(S2) 0+2C 1 1+C 0 0 1+2C
2
p= (24)
m+1
The value of p is high because there is no enough random-
ness in the selection procedure. Since R ≤ 1 and R ≤ RR̃,
Figure 5: Assigned values (in red) of empty bins from Fig-
if we can reduce this probability p then we reduce the value
ure 1 using the improved densification procedure. Every
of [pR + (1 − p)RR̃]. [This directly reduces ] the value empty Bin i uses the value of the closest non-empty bin,
of (k − m)(k − m − 1) m+1 2R
+ (m−1)R
m+1
R̃
as given by towards circular left or circular right depending on the ran-
Lemma 3. The reduction scales with Nemp . dom direction bit qi , with offset C.
For every simultaneously empty bin, the current scheme For S1 , we have q0 = 0 for empty Bin 0, we therefore
uses the information of the closest non-empty bin in the move circular left and borrow value from Bin 5 with offset
right. Because of the symmetry in the arguments, changing C making the final value 1 + C. Similarly for empty Bin 2
the direction to left instead of right also leads to a valid we have q2 = 0 and we use the value of Bin 1 (circular left)
densification scheme with exactly same variance. This added with C. For S2 and Bin 0, we have q0 = 0 and the
is where we can infuse randomness without violating the next circular left bin is Bin 5 which is empty so we continue
alignment necessary for unbiased densification. We show and borrow value from Bin 4, which is 0, with offset 2C. It
that randomly switching between left and right provably is a factor of 2 because we traveled 2 bins to locate the first
improves (reduces) the variance by making the sampling non-empty bin. For Bin 2, again q2 = 0 and the closest
procedure of simultaneously non-empty bins more random. circular left non-empty bin is Bin 1, at distance 1, so the
new value of Bin 2 for S2 is 1 + C. For Bin 5, q5 = 1, so
5 The Improved Densification Scheme
we go circular right and find non-empty Bin 1 at distance
2. The new hash value of Bin 5 is therefore 1 + 2C. Note
a b c that the non-empty bins remain unchanged.
÷ LÙ ÷ LÙ ÷ LÙ Formally, let qj j = {0, 1, 2, ..., k − 1} be k i.i.d. Bernoulli
y LÚz z
y LÚ y LÙz random variables such that qj = 1 with probability 12 . The
improved hash function H+ is given by
1 2 3 4
1 1 1
OP H (π(S)) + t1 C
(j−t1 )mod k
1 1
1
value of the closest non-empty bin from the right (circular), (25)
where The theoretical variance of the new estimator R̂+ is given
by the following Theorem 4.
t1 = min z, s.t. OP H (π(S)) ̸= E (26)
(j−z) mod k
Theorem 4
t2 = min z, s.t. OP H (π(S)) ̸= E (27)
(j+z) mod k
R R RR̃
D V ar(R̂+ ) = + A+ 2 + B + 2 − R2 (33)
with same C = + 1. Computing k hash evaluations with
k [ k k k]
H+ requires evaluating π(S) followed by two passes over Nemp (4k − Nemp + 1)
A+ = E
the k bins from different directions. The total complexity 2(k − Nemp + 1)
of computing k hash evaluations is again O(d + k) which [ ]
3 2
2k + Nemp − Nemp (2k 2 + 2k + 1) − 2k
is the same as that of the existing densification scheme. We B =E
+
1∑
k−1
More precisely,
R̂+ = 1{Hj+ (S1 ) = Hj+ (S2 )}. (29)
k j=0
V ar(R̂) − V ar(R̂+ )
[ ]
(Nemp )(Nemp − 1)
6 Variance Analysis of Improved Scheme =E [R − RR̃] (35)
2k 2 (k − Nemp + 1)
( 1 )f1 +f2 −a
When m = 1 (an event with prob k ≃ 0), i.e.,
only one simultaneously non-empty bin, both the schemes The probability of simultaneously empty bins increases
are exactly same. For simplicity of expressions, we will with increasing sparsity in dataset and the total number of
assume that the number of simultaneous non-empty bins is bins k. We can see from Theorem 5 that with more simul-
strictly greater than 1, i.e., m > 1. The general case has an taneously empty bins, i.e., higher Nemp , the gain with the
extra term for m = 1, which makes the expression unnec- improved scheme H+ is higher compared to H. Hence,
essarily complicated without changing the final conclusion. H+ should be significantly better than the existing scheme
for very sparse datasets or in scenarios when we need a
Following the notation as in Sec. 3, we denote large number of hash values.
MjN + = 1{Iemp
j
= 0 and Hj+ (S1 ) = Hj+ (S2 )} (30)
7 Evaluations
MjE+ = j
1{Iemp = 1 and Hj+ (S1 ) = Hj+ (S2 )} (31)
Our first experiment concerns the validation of the theoreti-
[ ]
∑
N +
cal variances of the two densification schemes. The second
The two expectations E N+
i̸=j Mi Mj m and experiment focuses on comparing the two schemes in the
[ ]
∑
E+
context of near neighbor search with LSH.
E i̸=j Mi
N+
M j m are the same as given by 7.1 Comparisons of Mean Square Errors
Lemma 1 and Lemma 2 respectively, as all the arguments
used to prove them still [hold for the newscheme. The only We empirically verify the theoretical variances of R and
] R+ and their effects in many practical scenarios. To
∑
E
change is in the term E E
i̸=j Mi Mj m . achieve this, we extracted 12 pairs of words (which cover
a wide spectrum of sparsity and similarity) from the web-
Lemma 4 crawl dataset which consists of word representation from
216 documents. Every word is represented as a binary vec-
∑
tor (or set) of D = 216 dimension, with a feature value of
E MiE+ MjE+ m = (k − m)(k − m − 1)
i̸=j
1 indicating the presence of that word in the corresponding
[ ] document. See Table 1 for detailed information of the data.
3R (2m − 1)RR̃
× + (32) For all 12 pairs of words, we estimate the resemblance us-
2(m + 1) 2(m + 1)
ing the two estimators R and R+ . We plot the empirical
−1 −1 −1 −1
10 10 10 10
Old Old Old Old
Old−Theo −2 −2 Old−Theo
−2 10 Old−Theo 10 Old−Theo −2
10 Imp Imp 10 Imp
Imp
Imp−Theo −3 −3 Imp−Theo
10 Imp−Theo 10 Imp−Theo
MSE
MSE
MSE
MSE
−3 −3
10 10
−4 −4
10 10
−4 −4
10 −5 −5 10
10 10
−5 HONG−KONG −6 RIGHTS−RESERVED −6 A−THE −5 UNITED−STATES
10 1 2 3 4 5
10 1 2 3 4 5
10 1 2 3 4 5
10 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
k (Num of Hashes) k (Num of Hashes) k (Num of Hashes) k (Num of Hashes)
−1 −1 −1 −1
10 10 10 10
Old Old Old Old
Old−Theo Old−Theo −2 Old−Theo −2 Old−Theo
−2 Imp −2 Imp 10 Imp 10 Imp
10 10 Imp−Theo Imp−Theo
Imp−Theo Imp−Theo
MSE
MSE
MSE
MSE
−3 −3
10 10
−3 −3
10 10 −4 −4
10 10
−4 TOGO−GREENLAND −4 ANTILLES−ALBANIA −5
CREDIT−CARD −5
COSTA−RICO
10 1 2 3 4 5
10 1 2 3 4 5
10 1 2 3 4 5
10 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
k (Num of Hashes) k (Num of Hashes) k (Num of Hashes) k (Num of Hashes)
−1 −1 −1 −2
10 10 10 10
Old VIRUSES−ANTIVIRUS Old Old
Old−Theo −2 Old−Theo Old−Theo
−2
Old 10
10 Imp −2 Imp −3 Imp
Imp−Theo 10 Old−Theo −3 Imp−Theo 10 Imp−Theo
10
MSE
MSE
MSE
MSE
−3 Imp
10
Imp−Theo −4
10
−3 −4
−4
10 10
10 −5
10
−5 LOW−PAY −4 −6 REVIEW−PAPER −5
FUNNIEST−ADDICT
10 1 2 3 4 5
10 1 2 3 4 5
10 1 2 3 4 5
10 1 2 3 4 5
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
k (Num of Hashes) k (Num of Hashes) k (Num of Hashes) k (Num of Hashes)
Figure 6: Mean Square Error (MSE) of the old scheme R̂ and the improved scheme R̂+ along with their theoretical values
on 12 word pairs (Table 1) from a web crawl dataset.
RCV1 40
As we know from our analysis that the first three terms in-
Top 10 side expectation, in the RHS of the above equation,
1000 K=3 [ behaves ]
20 similarly for both H+ and H. The fourth term E Mj1 E E
Mj2
0 0
is likely to be smaller in case of H+ because of smaller
50 100 150 200
L (Number of Tables)
50 100 150 200
L (Number of Tables) values of p. We therefore see that H retrieves more points
4000 100 than necessary as compared to H+ . The difference is vis-
Points Retrieved per Query
URL 60
2000
Top 10
K=15
are common in practice.
40
1000
20 8 Conclusion
0 0
25 50 75 100
L (Number of Tables)
25 50 75 100
L (Number of Tables)
Analysis of the densification scheme for one permutation
hashing, which reduces the processing time of minwise
Figure 7: Average number of points scanned per query and hashes, reveals a sub-optimality in the existing procedure.
the mean recall values of top 10 near neighbors, obtained We provide a simple improved procedure which adds more
from (K, L)-parameterized LSH algorithm, using H (old) randomness in the current densification technique leading
and H+ (Imp). Both schemes achieve the same recall but to a provably better scheme, especially for very sparse
H+ reports fewer points compared to H. Results are aver- datasets. The improvement comes without any compro-
aged over 10 independent runs. mise with the computation and only requires O(d + k) (lin-
ear) cost for generating k hash evaluations. We hope that
It is clear from Figure 7 that the improved hashing scheme our improved scheme will be adopted in practice.
H+ achieves the same recall but at the same time retrieves
less number of points compared to the old scheme H. To Acknowledgement
achieve 90% recall on URL dataset, the old scheme re-
Anshumali Shrivastava is a Ph.D. student partially sup-
trieves around 3300 points per query on an average while
ported by NSF (DMS0808864, III1249316) and ONR
the improved scheme only needs to check around 2700
(N00014-13-1-0764). The work of Ping Li is partially sup-
points per query. For RCV1 dataset, with L = 200 the
ported by AFOSR (FA9550-13-1-0137), ONR (N00014-
old scheme retrieves around 3000 points and achieves a re-
13-1-0764), and NSF (III1360971, BIGDATA1419210).
A Proofs [ i and j have the
bins ]same closest bin on the right. Then
∑ E
For the analysis, it is sufficient to consider the configura- E E
i̸=j Mi Mj m is given by
tions, of empty and non-empty bins, arising after throwing [ ]
|S1 ∪ S2 | balls uniformly into k bins with exactly m non- (k − m)(k − m − 1) pR + (1 − p)RR̃ (37)
empty bins and k −m empty bins. Under uniform throwing
of balls, any ordering of m non-empty and k − m empty because with probability (1 − p), it uses estimators from
bins is equally likely. The proofs involve elementary com- different simultaneous non-empty bins and in that case the
binatorial arguments of counting configurations. MiE MjE = 1 with probability RR̃.
A.1 Proof of Lemma 1 Consider Figure 3, where we have 3 simultaneous non-
Given exactly m simultaneously non-empty bins, any two empty bins, i.e., m = 3 (shown by colored boxes). Given
of them can be chosen in m(m − 1) ways (with order- any two simultaneous empty bins Bin i and Bin j (out of
ing of i and j). Each term MiN MjN , for both simultane- total k − m) they will occupy any of the m + 1 = 4 blank
positions. The arrow shows the chosen non-empty bins for
ously
( non-empty
i and j, is 1 with probability
) RR̃ (Note,
filling the empty bins. There are (m + 1)2 + (m + 1) =
E MiN MjN i ̸= j, Iemp
i j
= 0, Iemp = 0 = RR̃).
(m + 1)(m + 2) different ways of fitting two simultaneous
A.2 Proof of Lemma 2 non-empty bins i and j between m non-empty bins. Note,
if both i and j go to the same blank position they can be
The permutation is random and any sequence of simulta- permuted. This adds extra term (m + 1).
neously m non-empty and remaining k − m empty bins
are equal likely. This is because, while randomly throw- If both i and j choose the same blank space or the first
ing |S1 ∪ S2 | balls into k bins with exactly m non-empty and the last blank space, then both the simultaneous empty
bins every sequence of simultaneously empty and non- bins, Bin i and Bin j, corresponds to the same non-empty
empty bins has equal probability. Given m, there are total bin. The number of ways in which this happens is 2(m +
2m(k − m) different pairs of empty and non-empty bins 1) + 2 = 2(m + 2). So, we have
(including the ordering). Now, for every simultaneously
j
empty bin j, i.e., Iemp = 1, MjE replicates MtN corre- 2(m + 2) 2
p= = .
sponding to nearest non-empty Bin t which is towards the (m + 1)(m + 2) m+1
circular right. There are two cases we need to consider:
Substituting p in Eq.(37) leads to the desired expression.
1
Case 1: t = i, which has probability m and
A.4 Proof of Lemma 4
E(MiN MjE |Iemp
i
= j
0, Iemp = 1) = E(MiN |Iemp
i
= 0) = R Similar to the proof of Lemma 3, we need to compute
p which is the probability that two simultaneously empty
Case 2: t ̸= i, which has probability m−1
m and bins, Bin i and Bin j, use information from the same bin.
As argued before, the total number of positions for any two
E(MiN MjE |Iemp
i j
= 0, Iemp = 1) simultaneously empty bins i and j, given m simultaneously
=E(MiN MtN |t ̸= i, Iemp
i t
= 0, Iemp = 0) = RR̃ non-empty bins is (m + 1)(m + 2). Consider Figure 4, un-
der the improved scheme, if both Bin i and Bin j choose the
[ ]
∑ E
same blank position then they choose the same simultane-
Thus, the value of E N
i̸=j Mi Mj m comes out to be ously non-empty bin with probability 21 . If Bin i and Bin j
choose consecutive positions (e.g., position 2 and position
[ ] 3) then they choose the same simultaneously non-empty
R (m − 1)RR̃ bin (Bin b) with probability 14 . There are several boundary
2m(k − m) +
m m cases to consider too. Accumulating the terms leads to