Professional Documents
Culture Documents
Min-Hashing
and
Locality Sensitive Hashing
1
Finding Similar Items
Outline
Introduction
Shingling
Minhashing
Locality-Sensitive Hashing
2
Finding Similar Items Introduction
Goals
Many Web-mining problems can be expressed
as finding “similar” sets:
1. Pages with similar words, e.g., for
classification by topic.
2. NetFlix users with similar tastes in movies,
for recommendation systems.
3. Dual: movies with similar sets of fans.
4. Images of related things.
3
Finding Similar Items
Motivating problem
Find duplicate and near-duplicate documents from a
web crawl.
Example Problem:
Comparing Documents
Goal: common text.
5
Finding Similar Items Introduction
6
Finding Similar Items
Main issues
What is the right representation of the document
when we check for similarity?
E.g., representing a document as a set of characters will
not do (why?)
When we have billions of documents, keeping the
full text in memory is not an option.
We need to find a shorter representation
How do we do pairwise comparisons of billions of
documents?
If exact match was the issue it would be ok, can we
replicate this idea?
Finding Similar Items Introduction
Candidate
pairs :
Locality-
Docu- those pairs
sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures : similarity.
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
8
Finding Similar Items Introduction
9
Finding Similar Items
Outline
Introduction
Shingling
Minhashing
Locality-Sensitive Hashing
10
Finding Similar Items Shingling
Shingles
A k -shingle (or k -gram) for a document is a
sequence of k characters that appears in the
document.
11
Finding Similar Items Shingling
Working Assumption
Documents that have lots of shingles in common
have similar text, even if the text appears in different
order.
12
Finding Similar Items Shingling
13
Finding Similar Items
Outline
Introduction
Shingling
Minhashing
Locality-Sensitive Hashing
14
Finding Similar Items Minhashing
15
Finding Similar Items Minhashing
16
Finding Similar Items Minhashing
3 in intersection.
8 in union.
Jaccard similarity
= 3/8
17
Finding Similar Items Minhashing
18
Finding Similar Items Minhashing
19
Finding Similar Items Minhashing
Aside
We might not really represent the data by a
boolean matrix.
Sparse matrices are usually better represented by
the list of places where there is a non-zero value.
But the matrix picture is conceptually useful.
20
Finding Similar Items Minhashing
21
Finding Similar Items Minhashing
22
Finding Similar Items Minhashing
Warnings
1. Comparing all pairs of signatures may take too
much time, even if not too much space.
A job for Locality-Sensitive Hashing.
23
Finding Similar Items Minhashing
Signatures
Key idea: “hash” each column C to a small
signature Sig (C), such that:
1. Sig (C) is small enough that we can fit a signature in main
memory for each column.
2. Sim (C1, C2) is the same as the “similarity” of Sig (C1) and
Sig (C2).
24
Finding Similar Items Minhashing
25
Finding Similar Items Minhashing
Minhashing
Imagine the rows permuted randomly.
26
Finding Similar Items
3 1 3 1
Finding Similar Items
≈
C 0 1 0 1 h1 1 2 1 2
D 0 1 0 1 h2 2 1 3 1
E 0 1 0 1 h3 3 1 3 1
F 1 0 1 0
G 1 0 1 0
• Sig(S) = vector of hash values
• e.g., Sig(S2) = [2,1,1]
• Sig(S,i) = value of the i-th hash
function for set S
• E.g., Sig(S2,3) = 1
Finding Similar Items Minhashing
Surprising Property
The probability (over all permutations of the
rows) that h (C1) = h (C2) is the same as Sim (C1,
C2).
Both are a /(a +b +c )!
Why?
Look down the permuted columns C1 and C2 until we
see a 1.
If it’s a type-a row, then h (C1) = h (C2). If a type-b or
type-c row, then not.
31
Finding Similar Items Minhashing
32
Finding Similar Items
≈
1 2 1 2 (S1, S4) 1/7 0
C 0 1 0 1
2 1 3 1 (S2, S3) 0 0
D 0 1 0 1
3 1 3 1 (S2, S4) 3/4 1
E 0 1 0 1
(S3, S4) 0 0
F 1 0 1 0
G 1 0 1 0 Zero similarity is preserved
High similarity is well approximated
With multiple signatures we get a good
approximation
33
Finding Similar Items Minhashing
Minhash Signatures
Pick (say) 100 random permutations of the rows.
Think of Sig (C) as a column vector.
Let Sig (C)[i] =
according to the i th permutation, the number of the
first row that has a 1 in column C.
34
Finding Similar Items Minhashing
Implementation – (1)
Suppose 1 billion rows.
Implementation – (2)
A good approximation to permuting rows: pick
100 (?) hash functions.
36
Finding Similar Items Minhashing
Implementation – (3)
Initialize M(i,c) to ∞ for all i and c
for each row r
for each column c
if c has 1 in row r
for each hash function hi do
if hi (r ) is a smaller value than M (i, c ) then
M (i, c ) := hi (r );
37
Finding Similar Items Minhashing
Implementation – (4)
Often, data is given by column, not row.
E.g., columns = documents, rows = shingles.
38
Finding Similar Items
h(3) = 4 1 2
h(x) = x+1 mod 5 g(3) = 4 2 0
g(x) = 2x+3 mod 5
h(Row) Row S1 S2 g(Row) Row S1 S2 h(4) = 0 1 0
0 E 0 1 0 B 0 1 g(4) = 1 2 0
1 A 1 0 1 E 0 1
2 B 0 1 2 C 1 0
3 C 1 1 3 A 1 1
4 D 1 0 4 D 1 0 39
Finding Similar Items Locality-Sensitive Hashing
40
Finding Similar Items Locality-Sensitive Hashing
At 1 microsecond/comparison: 6 days.
41
Finding Similar Items Locality-Sensitive Hashing
Locality-Sensitive Hashing
General idea: Use a function f(x,y) that tells whether
or not x and y is a candidate pair : a pair of
elements whose similarity must be evaluated.
42
Finding Similar Items Locality-Sensitive Hashing
LSH Summary
Tune to get almost all pairs with similar
signatures, but eliminate most pairs that do not
have similar signatures.