You are on page 1of 7

MinHash

In computer science and data mining, MinHash (or the min-wise independent permutations locality
sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was
invented by Andrei Broder (1997),[1] and initially used in the AltaVista search engine to detect duplicate
web pages and eliminate them from search results.[2] It has also been applied in large-scale clustering
problems, such as clustering documents by the similarity of their sets of words.[1]

Jaccard similarity and minimum hash values


The Jaccard similarity coefficient is a commonly used indicator of the similarity between two sets. Let U be
a set and A and B be subsets of U, then the Jaccard index is defined to be the ratio of the number of
elements of their intersection and the number of elements of their union:

This value is 0 when the two sets are disjoint, 1 when they are equal, and strictly between 0 and 1
otherwise. Two sets are more similar (i.e. have relatively more members in common) when their Jaccard
index is closer to 1. The goal of MinHash is to estimate J(A,B) quickly, without explicitly computing the
intersection and union.

Let h be a hash function that maps the members of U to distinct integers, let perm be a random
permutation of the elements of the set U, and for any subset S of U define h min(S) to be the minimal
member of S with respect to h ∘ perm—that is, the member x of S with the minimum value of
h(perm(x)). (In cases where the hash function used is assumed to have pseudo-random properties, the
random permutation would not be used.)

Now, applying h min to both A and B, and assuming no hash collisions, we see that the values are equal
(h min(A) = h min(B)) if and only if among all elements of , the element with the minimum hash
value lies in the intersection . The probability of this being true is exactly the Jaccard index,
therefore:

Pr[ hmin(A) = hmin(B) ] = J(A,B),

That is, the probability that h min(A) = h min(B) is true is equal to the similarity J(A,B), assuming
drawing perm from a uniform distribution. In other words, if r is the random variable that is one when
hmin(A) = hmin(B) and zero otherwise, then r is an unbiased estimator of J(A,B). r has too high a
variance to be a useful estimator for the Jaccard similarity on its own, because is always zero or one. The
idea of the MinHash scheme is to reduce this variance by averaging together several variables constructed
in the same way.

Algorithm

Variant with many hash functions

The simplest version of the minhash scheme uses k different hash functions, where k is a fixed integer
parameter, and represents each set S by the k values of h min(S) for these k functions.

To estimate J(A,B) using this version of the scheme, let y be the number of hash functions for which
hmin(A) = hmin(B), and use y/k as the estimate. This estimate is the average of k different 0-1 random
variables, each of which is one when h min(A) = h min(B) and zero otherwise, and each of which is an
unbiased estimator of J(A,B). Therefore, their average is also an unbiased estimator, and by standard
deviation for sums of 0-1 random variables, its expected error is O(1/√ k ).[3]

Therefore, for any constant ε > 0 there is a constant k = O(1/ε2) such that the expected error of the
estimate is at most  ε. For example, 400 hashes would be required to estimate J(A,B) with an expected
error less than or equal to .05.

Variant with a single hash function

It may be computationally expensive to compute multiple hash functions, but a related version of MinHash
scheme avoids this penalty by using only a single hash function and uses it to select multiple values from
each set rather than selecting only a single minimum value per hash function. Let h be a hash function, and
let k be a fixed integer. If S is any set of k or more values in the domain of h , define h (k)(S) to be the
subset of the k members of S that have the smallest values of h . This subset h (k)(S) is used as a signature
for the set S , and the similarity of any two sets is estimated by comparing their signatures.

Specifically, let A and B be any two sets. Then X = h (k)(h (k)(A) ∪ h (k)(B)) = h (k)(A ∪ B) is a set of k
elements of A ∪ B, and if h is a random function then any subset of k elements is equally likely to be
chosen; that is, X is a simple random sample of A ∪ B. The subset Y = X ∩ h (k)(A) ∩ h (k)(B) is the
set of members of X that belong to the intersection A ∩ B. Therefore, |Y |/k is an unbiased estimator of
J(A,B). The difference between this estimator and the estimator produced by multiple hash functions is
that X always has exactly k members, whereas the multiple hash functions may lead to a smaller number of
sampled elements due to the possibility that two different hash functions may have the same minima.
However, when k is small relative to the sizes of the sets, this difference is negligible.

By standard Chernoff bounds for sampling without replacement, this estimator has expected error
O(1/√k), matching the performance of the multiple-hash-function scheme.

Time analysis

The estimator |Y|/k can be computed in time O(k) from the two signatures of the given sets, in either
variant of the scheme. Therefore, when ε and k are constants, the time to compute the estimated similarity
from the signatures is also constant. The signature of each set can be computed in linear time on the size of
the set, so when many pairwise similarities need to be estimated this method can lead to a substantial
savings in running time compared to doing a full comparison of the members of each set. Specifically, for
set size n the many hash variant takes O(n k) time. The single hash variant is generally faster, requiring
O(n) time to maintain the queue of minimum hash values assuming n >> k.[1]

Incorporating weights
A variety of techniques to introduce weights into the computation of MinHashes have been developed. The
simplest extends it to integer weights.[4] Extend our hash function h to accept both a set member and an
integer, then generate multiple hashes for each item, according to its weight. If item i occurs n times,
generate hashes . Run the original algorithm on this expanded set of hashes.
Doing so yields the weighted Jaccard Index as the collision probability.

Further extensions that achieve this collision probability on real weights with better runtime have been
developed, one for dense data,[5] and another for sparse data.[6]

Another family of extensions use exponentially distributed hashes. A uniformly random hash between 0
and 1 can be converted to follow an exponential distribution by CDF inversion. This method exploits the
many beautiful properties of the minimum of a set of exponential variables.

This yields as its collision probability the probability Jaccard index[7]

Min-wise independent permutations


In order to implement the MinHash scheme as described above, one needs the hash function h to define a
random permutation on n elements, where n is the total number of distinct elements in the union of all of
the sets to be compared. But because there are n! different permutations, it would require Ω(n log n) bits
just to specify a truly random permutation, an infeasibly large number for even moderate values of n .
Because of this fact, by analogy to the theory of universal hashing, there has been significant work on
finding a family of permutations that is "min-wise independent", meaning that for any subset of the domain,
any element is equally likely to be the minimum. It has been established that a min-wise independent family
of permutations must include at least

different permutations, and therefore that it needs Ω(n) bits to specify a single permutation, still infeasibly
large.[2]

Practical min-wise independent hash functions


Because of the above impracticality, two variant notions of min-wise independence have been introduced:
restricted min-wise independent permutations families, and approximate min-wise independent families.
Restricted min-wise independence is the min-wise independence property restricted to certain sets of
cardinality at most k.[8] Approximate min-wise independence has at most a fixed probability ε of varying
from full independence.[9]

In 1999 Piotr Indyk proved[10] that any k-wise independent family of hash functions is also approximately
min-wise independent for large enough. In particular, there are constants such that if
, then

for all sets and . (Note, here means the probability is at most a factor too
big, and at most too small.)

This guarantee is, among other things, sufficient to give the Jaccard bound required by the MinHash
algorithm. That is, if and are sets, then

Since k-wise independent hash functions can be specified using just bits, this approach is much
more practical than using completely min-wise independent permutations.

Another practical family of hash functions that give approximate min-wise independence is Tabulation
hashing.

Applications
The original applications for MinHash involved clustering and eliminating near-duplicates among web
documents, represented as sets of the words occurring in those documents.[1][2][11] Similar techniques have
also been used for clustering and near-duplicate elimination for other types of data, such as images: in the
case of image data, an image can be represented as a set of smaller subimages cropped from it, or as sets of
more complex image feature descriptions.[12]

In data mining, Cohen et al. (2001) use MinHash as a tool for association rule learning. Given a database in
which each entry has multiple attributes (viewed as a 0–1 matrix with a row per database entry and a
column per attribute) they use MinHash-based approximations to the Jaccard index to identify candidate
pairs of attributes that frequently co-occur, and then compute the exact value of the index for only those
pairs to determine the ones whose frequencies of co-occurrence are below a given strict threshold.[13]

The MinHash algorithm has been adapted for bioinformatics, where the problem of comparing genome
sequences has a similar theoretical underpinning to that of comparing documents on the web. MinHash-
based tools[14][15] allow rapid comparison of whole genome sequencing data with reference genomes
(around 3 minutes to compare one genome with the 90000 reference genomes in RefSeq), and are suitable
for speciation and maybe a limited degree of microbial sub-typing. There are also applications for
metagenomics [14] and the use of MinHash derived algorithms for genome alignment and genome
assembly.[16] Accurate average nucleotide identity (ANI) values can be generated very efficiently with
MinHash-based algorithms.[17]

Other uses
The MinHash scheme may be seen as an instance of locality-sensitive hashing, a collection of techniques
for using hash functions to map large sets of objects down to smaller hash values in such a way that, when
two objects have a small distance from each other, their hash values are likely to be the same. In this
instance, the signature of a set may be seen as its hash value. Other locality sensitive hashing techniques
exist for Hamming distance between sets and cosine distance between vectors; locality sensitive hashing
has important applications in nearest neighbor search algorithms.[18] For large distributed systems, and in
particular MapReduce, there exist modified versions of MinHash to help compute similarities with no
dependence on the point dimension.[19]

Evaluation and benchmarks


A large scale evaluation was conducted by Google in 2006 [20] to compare the performance of Minhash
and SimHash[21] algorithms. In 2007 Google reported using Simhash for duplicate detection for web
crawling[22] and using Minhash and LSH for Google News personalization.[23]

See also
Bloom filter
Count–min sketch
w-shingling

References
1. Broder, Andrei Z. (1997), "On the resemblance and containment of documents",
Compression and Complexity of Sequences: Proceedings, Positano, Amalfitan Coast,
Salerno, Italy, June 11-13, 1997 (https://web.archive.org/web/20150131043133/http://gateke
eper.dec.com/ftp/pub/dec/SRC/publications/broder/positano-final-wpnums.pdf) (PDF), IEEE,
pp. 21–29, CiteSeerX 10.1.1.24.779 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.
1.1.24.779), doi:10.1109/SEQUEN.1997.666900 (https://doi.org/10.1109%2FSEQUEN.199
7.666900), ISBN 978-0-8186-8132-5, S2CID 11748509 (https://api.semanticscholar.org/Cor
pusID:11748509), archived from the original (http://gatekeeper.dec.com/ftp/pub/dec/SRC/pu
blications/broder/positano-final-wpnums.pdf) (PDF) on 2015-01-31, retrieved 2014-01-18.
2. Broder, Andrei Z.; Charikar, Moses; Frieze, Alan M.; Mitzenmacher, Michael (1998), "Min-
wise independent permutations", Proc. 30th ACM Symposium on Theory of Computing
(STOC '98), New York, NY, USA: Association for Computing Machinery, pp. 327–336,
CiteSeerX 10.1.1.409.9220 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.409.
9220), doi:10.1145/276698.276781 (https://doi.org/10.1145%2F276698.276781), ISBN 978-
0897919623, S2CID 465847 (https://api.semanticscholar.org/CorpusID:465847).
3. Vassilvitskii, Sergey (2011), COMS 6998-12: Dealing with Massive Data (lecture notes,
Columbia university) (https://web.archive.org/web/20181024161246/https://www.cs.columbi
a.edu/~coms699812/lecture1.pdf) (PDF), archived from the original (https://www.cs.columbi
a.edu/~coms699812/lecture1.pdf) (PDF) on 2018-10-24.
4. Chum, Ondrej; Philbin, James; Zisserman, Andrew (2008), "Near Duplicate Image
Detection: min-Hash and tf-idf Weighting." (http://www.bmva.org/bmvc/2008/papers/119.pdf)
(PDF), BMVC, 810: 812–815
5. Shrivastava, Anshumali (2016), "Exact weighted minwise hashing in constant time",
arXiv:1602.08393 (https://arxiv.org/abs/1602.08393) [cs.DS (https://arxiv.org/archive/cs.DS)]
6. Ioffe, Sergey (2010), "Improved consistent sampling, weighted minhash and L1 sketching" (h
ttps://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36928.pdf)
(PDF), Data Mining (ICDM), 2010 IEEE 10th International Conference on: 246–255,
CiteSeerX 10.1.1.227.9749 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.227.
9749), doi:10.1109/ICDM.2010.80 (https://doi.org/10.1109%2FICDM.2010.80), ISBN 978-1-
4244-9131-5, S2CID 9970906 (https://api.semanticscholar.org/CorpusID:9970906)
7. Moulton, Ryan; Jiang, Yunjiang (2018), "Maximally Consistent Sampling and the Jaccard
Index of Probability Distributions", 2018 IEEE International Conference on Data Mining
(ICDM), pp. 347–356, arXiv:1809.04052 (https://arxiv.org/abs/1809.04052),
doi:10.1109/ICDM.2018.00050 (https://doi.org/10.1109%2FICDM.2018.00050), ISBN 978-1-
5386-9159-5, S2CID 49746072 (https://api.semanticscholar.org/CorpusID:49746072)
8. Matoušek, Jiří; Stojaković, Miloš (2003), "On restricted min-wise independence of
permutations", Random Structures and Algorithms, 23 (4): 397–408,
CiteSeerX 10.1.1.400.6757 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.400.
6757), doi:10.1002/rsa.10101 (https://doi.org/10.1002%2Frsa.10101), S2CID 1483449 (http
s://api.semanticscholar.org/CorpusID:1483449).
9. Saks, M.; Srinivasan, A.; Zhou, S.; Zuckerman, D. (2000), "Low discrepancy sets yield
approximate min-wise independent permutation families", Information Processing Letters, 73
(1–2): 29–32, CiteSeerX 10.1.1.20.8264 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi
=10.1.1.20.8264), doi:10.1016/S0020-0190(99)00163-5 (https://doi.org/10.1016%2FS0020-0
190%2899%2900163-5).
10. Indyk, Piotr. "A small approximately min-wise independent family of hash functions." Journal
of Algorithms 38.1 (2001): 84-90.
11. Manasse, Mark (2012). On the Efficient Determination of Most Near Neighbors: Horseshoes,
Hand Grenades, Web Search, and Other Situations when Close is Close Enough. Morgan &
Claypool. p. 72. ISBN 9781608450886.
12. Chum, Ondřej; Philbin, James; Isard, Michael; Zisserman, Andrew (2007), "Scalable near
identical image and shot detection", Proceedings of the 6th ACM International Conference
on Image and Cideo Retrieval (CIVR'07), pp. 549–556, doi:10.1145/1282280.1282359 (http
s://doi.org/10.1145%2F1282280.1282359), ISBN 9781595937339, S2CID 3330908 (https://
api.semanticscholar.org/CorpusID:3330908); Chum, Ondřej; Philbin, James; Zisserman,
Andrew (2008), "Near duplicate image detection: min-hash and tf-idf weighting",
Proceedings of the British Machine Vision Conference (http://www.bmva.org/bmvc/2008/pap
ers/119.pdf) (PDF), vol. 3, p. 4.
13. Cohen, E.; Datar, M.; Fujiwara, S.; Gionis, A.; Indyk, P.; Motwani, R.; Ullman, J. D.; Yang, C.
(2001), "Finding interesting associations without support pruning", IEEE Transactions on
Knowledge and Data Engineering, 13 (1): 64–78, CiteSeerX 10.1.1.192.7385 (https://citesee
rx.ist.psu.edu/viewdoc/summary?doi=10.1.1.192.7385), doi:10.1109/69.908981 (https://doi.o
rg/10.1109%2F69.908981).
14. Ondov, Brian D.; Treangen, Todd J.; Melsted, Páll; Mallonee, Adam B.; Bergman, Nicholas
H.; Koren, Sergey; Phillippy, Adam M. (2016-06-20). "Mash: fast genome and metagenome
distance estimation using MinHash" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC491504
5). Genome Biology. 17 (1): 132. doi:10.1186/s13059-016-0997-x (https://doi.org/10.1186%2
Fs13059-016-0997-x). ISSN 1474-760X (https://www.worldcat.org/issn/1474-760X).
PMC 4915045 (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4915045). PMID 27323842
(https://pubmed.ncbi.nlm.nih.gov/27323842).
15. "Welcome to sourmash! — sourmash 1.0 documentation" (https://sourmash.readthedocs.io/e
n/latest/). sourmash.readthedocs.io. Retrieved 2017-11-13.
16. Berlin, Konstantin; Koren, Sergey; Chin, Chen-Shan; Drake, James P; Landolin, Jane M;
Phillippy, Adam M (2015-05-25). "Assembling large genomes with single-molecule
sequencing and locality-sensitive hashing". Nature Biotechnology. 33 (6): 623–630.
doi:10.1038/nbt.3238 (https://doi.org/10.1038%2Fnbt.3238). ISSN 1546-1696 (https://www.w
orldcat.org/issn/1546-1696). PMID 26006009 (https://pubmed.ncbi.nlm.nih.gov/26006009).
S2CID 17246729 (https://api.semanticscholar.org/CorpusID:17246729).
17. Jain, Chirag; Rodriguez-R, Luis M.; Phillippy, Adam M.; Konstantinidis, Konstantinos T.;
Aluru, Srinivas (December 2018). "High throughput ANI analysis of 90K prokaryotic
genomes reveals clear species boundaries" (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC
6269478). Nature Communications. 9 (1): 5114. doi:10.1038/s41467-018-07641-9 (https://do
i.org/10.1038%2Fs41467-018-07641-9). PMC 6269478 (https://www.ncbi.nlm.nih.gov/pmc/a
rticles/PMC6269478). PMID 30504855 (https://pubmed.ncbi.nlm.nih.gov/30504855).
18. Andoni, Alexandr; Indyk, Piotr (2008), "Near-optimal hashing algorithms for approximate
nearest neighbor in high dimensions", Communications of the ACM, 51 (1): 117–122,
CiteSeerX 10.1.1.226.6905 (https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.226.
6905), doi:10.1145/1327452.1327494 (https://doi.org/10.1145%2F1327452.1327494),
S2CID 6468963 (https://api.semanticscholar.org/CorpusID:6468963).
19. Zadeh, Reza; Goel, Ashish (2012), "Dimension Independent Similarity Computation",
arXiv:1206.2082 (https://arxiv.org/abs/1206.2082) [cs.DS (https://arxiv.org/archive/cs.DS)].
20. Henzinger, Monika (2006), "Finding near-duplicate web pages: a large-scale evaluation of
algorithms", Proceedings of the 29th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval (https://archive.org/details/sigirseattle20
060000inte/page/284), pp. 284 (https://archive.org/details/sigirseattle20060000inte/page/28
4), doi:10.1145/1148170.1148222 (https://doi.org/10.1145%2F1148170.1148222),
ISBN 978-1595933690, S2CID 207160068 (https://api.semanticscholar.org/CorpusID:20716
0068).
21. Charikar, Moses S. (2002), "Similarity estimation techniques from rounding algorithms",
Proceedings of the 34th Annual ACM Symposium on Theory of Computing, p. 380,
doi:10.1145/509907.509965 (https://doi.org/10.1145%2F509907.509965), ISBN 978-
1581134957, S2CID 4229473 (https://api.semanticscholar.org/CorpusID:4229473).
22. Gurmeet Singh, Manku; Jain, Arvind; Das Sarma, Anish (2007), "Detecting near-duplicates
for web crawling", Proceedings of the 16th International Conference on World Wide Web (htt
p://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33026.pdf)
(PDF), p. 141, doi:10.1145/1242572.1242592 (https://doi.org/10.1145%2F1242572.124259
2), ISBN 9781595936547, S2CID 1414324 (https://api.semanticscholar.org/CorpusID:14143
24).
23. Das, Abhinandan S.; Datar, Mayur; Garg, Ashutosh; Rajaram, Shyam; et al. (2007), "Google
news personalization: scalable online collaborative filtering", Proceedings of the 16th
International Conference on World Wide Web, p. 271, doi:10.1145/1242572.1242610 (http
s://doi.org/10.1145%2F1242572.1242610), ISBN 9781595936547, S2CID 207163129 (http
s://api.semanticscholar.org/CorpusID:207163129).

Retrieved from "https://en.wikipedia.org/w/index.php?title=MinHash&oldid=1155091988"

You might also like