Lect 26 and 27 - Locality Sensitive Hashing

Finding Similar Items
Min-Hashing
and
Locality Sensitive Hashing
Thanks to the foll. for their lecture slides:

 Rajaraman and Ullman, “Mining Massive Datasets”
 Evimaria Terzi and Wu-Jun Li
1
Outline
 Introduction
 Shingling
 Minhashing
 Locality-Sensitive Hashing
2
Finding Similar Items Introduction
Goals
 Many Web-mining problems can be expressed
as finding “similar” sets:
1. Pages with similar words, e.g., for
classification by topic.
2. NetFlix users with similar tastes in movies,
for recommendation systems.
3. Dual: movies with similar sets of fans.
4. Images of related things.
3
Motivating problem
 Find duplicate and near-duplicate documents from a
web crawl.
 If we wanted exact duplicates we could do this by

hashing
 We will see how to adapt this technique for near
duplicate documents
Example Problem:
Comparing Documents
 Goal: common text.
 Special cases are easy, e.g., identical

documents, or one document contained
character-by-character in another.
 General case, where many small pieces of one

doc appear out of order in another, is very
hard.
5
Similar Documents – (2)

 Given a body of documents, e.g., the Web, find pairs
of documents with a lot of text in common, e.g.:
 Mirror sites, or approximate mirrors.
 Application: Don’t want to show both in a
search.
 Plagiarism, including large quotations.
 Similar news articles at many news sites.
 Application: Cluster articles by “same story.”
6
Main issues
 What is the right representation of the document
when we check for similarity?
 E.g., representing a document as a set of characters will
not do (why?)
 When we have billions of documents, keeping the
full text in memory is not an option.
 We need to find a shorter representation
 How do we do pairwise comparisons of billions of
documents?
 If exact match was the issue it would be ok, can we
replicate this idea?
The Big Picture
Candidate
pairs :
Locality-
Docu- those pairs
sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures : similarity.
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
8
Three Essential Techniques for Similar

Documents
1. Shingling : convert documents, emails, etc., to
sets.
2. Minhashing : convert large sets to short

signatures, while preserving similarity.
3. Locality-sensitive hashing : focus on pairs of

signatures likely to be similar.
9
Outline
 Introduction
 Shingling
 Minhashing
10
Finding Similar Items Shingling
Shingles
 A k -shingle (or k -gram) for a document is a
sequence of k characters that appears in the
document.
 Example: k=2; doc = abcab. Set of 2-shingles = {ab,

bc, ca}.
 Represent a doc by its set of k-shingles.
11
Working Assumption
 Documents that have lots of shingles in common
have similar text, even if the text appears in different
order.
 Careful: you must pick k large enough, or most

documents will have most shingles.
 k = 5 is OK for short documents; k = 10 is better for long
documents.
12
Shingles: Compression Option

 To compress long shingles, we can hash them to
(say) 4 bytes (integer).
 Represent a doc by the set of hash values of its k-

shingles.
 Two documents could rarely appear to have

shingles in common, when in fact only the hash-
values were shared.
13
Outline
 Introduction
 Shingling
 Minhashing
14
Finding Similar Items Minhashing
Basic Data Model: Sets

 Many similarity problems can be couched as
finding subsets of some universal set that have
significant intersection.
 Examples include:
1. Documents represented by their sets of shingles (or
hashes of those shingles).
 Applicable to any kind of sets.
Similar customers or products.
15
Jaccard Similarity of Sets

 The Jaccard similarity of two sets is the size of their
intersection divided by the size of their union.
 Sim (C1, C2) = |C1C2|/|C1C2|.
16
Example: Jaccard Similarity
3 in intersection.
8 in union.
Jaccard similarity
= 3/8
17
From Sets to Boolean Matrices

 Rows = elements of the universal set.
 Columns = sets.
 1 in row e and column S if and only if e is a
member of S.
 Column similarity is the Jaccard similarity of the
sets of their rows with 1.
 Typical matrix is sparse.
18
Example: Jaccard Similarity of Columns

C1 C2
0 1 *
1 0 *
1 1 ** Sim (C1, C2) = 2/5 = 0.4
0 0
1 1 **
0 1 *
19
Aside
 We might not really represent the data by a
boolean matrix.
 Sparse matrices are usually better represented by
the list of places where there is a non-zero value.
 But the matrix picture is conceptually useful.
20
When Is Similarity Interesting?

1. When the sets are so large or so many that they
cannot fit in main memory.
2. Or, when there are so many sets that comparing all
pairs of sets takes too much time.
3. Or both.
21
Outline: Finding Similar Columns

1. Compute signatures of columns = small summaries
of columns.
2. Examine pairs of signatures to find similar signatures.

 Essential: similarities of signatures and columns are
related.
3. Optional: check that columns with similar signatures

are really similar.
22
Warnings
1. Comparing all pairs of signatures may take too
much time, even if not too much space.
 A job for Locality-Sensitive Hashing.
2. These methods can produce false negatives, and

even false positives (if the optional check is not
made).
23
Signatures
 Key idea: “hash” each column C to a small
signature Sig (C), such that:
1. Sig (C) is small enough that we can fit a signature in main
memory for each column.
2. Sim (C1, C2) is the same as the “similarity” of Sig (C1) and
Sig (C2).
24
Four Types of Rows

 Given columns C1 and C2, rows may be classified as:
C1 C2
a 1 1
b 1 0
c 0 1
d 0 0
 Also, a = # rows of type a , etc.
 Note Sim (C1, C2) = a /(a +b +c ).
25
Minhashing
 Imagine the rows permuted randomly.
 Define “hash” function h (C ) = the number of the

first (in the permuted order) row in which column C
has 1.
 Use several (e.g., 100) independent hash functions to

create a signature.
26
Example of minhash signatures

 Input matrix
S1 S2 S3 S4 S1 S2 S3 S4
A
A 1 0 1 0 1 A 1 0 1 0
C
B 1 0 0 1 2 C 0 1 0 1
C 0 1 0 1 G 3 G 1 0 1 0
D 0 1 0 1 F 4 F 1 0 1 0
E 0 1 0 1 B 5 B 1 0 0 1
F 1 0 1 0 E 6 E 0 1 0 1
G 1 0 1 0 D 7 D 0 1 0 1
1 2 1 2

 Input matrix
S1 S2 S3 S4 S1 S2 S3 S4
D
A 1 0 1 0 1 D 0 1 0 1
B
B 1 0 0 1 2 B 1 0 0 1
C 0 1 0 1 A 3 A 1 0 1 0
D 0 1 0 1 C 4 C 0 1 0 1
E 0 1 0 1 F 5 F 1 0 1 0
F 1 0 1 0 G 6 G 1 0 1 0
G 1 0 1 0 E 7 E 0 1 0 1
2 1 3 1

 Input matrix
S1 S2 S3 S4 S1 S2 S3 S4
C
A 1 0 1 0 1 C 0 1 0 1
D
B 1 0 0 1 2 D 0 1 0 1
C 0 1 0 1 G 3 G 1 0 1 0
D 0 1 0 1 F 4 F 1 0 1 0
E 0 1 0 1 A 5 A 1 0 1 0
F 1 0 1 0 B 6 B 1 0 0 1
G 1 0 1 0 E 7 E 0 1 0 1
3 1 3 1

 Input matrix
S1 S2 S3 S4 Signature matrix
A 1 0 1 0
B 1 0 0 1 S1 S2 S3 S4
≈
C 0 1 0 1 h1 1 2 1 2
D 0 1 0 1 h2 2 1 3 1
E 0 1 0 1 h3 3 1 3 1
F 1 0 1 0
G 1 0 1 0
• Sig(S) = vector of hash values
• e.g., Sig(S2) = [2,1,1]
• Sig(S,i) = value of the i-th hash
function for set S
• E.g., Sig(S2,3) = 1
Surprising Property
 The probability (over all permutations of the
rows) that h (C1) = h (C2) is the same as Sim (C1,
C2).
 Both are a /(a +b +c )!
 Why?
 Look down the permuted columns C1 and C2 until we
see a 1.
 If it’s a type-a row, then h (C1) = h (C2). If a type-b or
type-c row, then not.
31
Similarity for Signatures

 The similarity of signatures is the fraction of the
hash functions in which they agree.
32
Similarity for Signatures

 The similarity of signatures is the fraction of the
hash functions in which they agree.
Actual Sig
S1 S2 S3 S4
Signature matrix (S1, S2) 0 0
A 1 0 1 0
S1 S2 S3 S4 (S1, S3) 3/5 2/3
B 1 0 0 1
≈
1 2 1 2 (S1, S4) 1/7 0
C 0 1 0 1
2 1 3 1 (S2, S3) 0 0
D 0 1 0 1
3 1 3 1 (S2, S4) 3/4 1
E 0 1 0 1
(S3, S4) 0 0
F 1 0 1 0
G 1 0 1 0 Zero similarity is preserved
High similarity is well approximated
 With multiple signatures we get a good
approximation
33
Minhash Signatures
 Pick (say) 100 random permutations of the rows.
 Think of Sig (C) as a column vector.
 Let Sig (C)[i] =
according to the i th permutation, the number of the
first row that has a 1 in column C.
34
Implementation – (1)
 Suppose 1 billion rows.
 Hard to pick a random permutation from

1…billion.
 Representing a random permutation requires 1

billion entries.
 Accessing rows in permuted order leads to

thrashing.
35
 A good approximation to permuting rows: pick
100 (?) hash functions.
 For each column c and each hash function hi ,

keep a “slot” M (i, c ).
 Intent: M (i, c ) will become the smallest value

of hi (r ) for which column c has 1 in row r.
 I.e., hi (r ) gives order of rows for i th permuation.
36
Initialize M(i,c) to ∞ for all i and c
for each row r
for each column c
if c has 1 in row r
for each hash function hi do
if hi (r ) is a smaller value than M (i, c ) then
M (i, c ) := hi (r );
37
 Often, data is given by column, not row.
 E.g., columns = documents, rows = shingles.
 If so, sort matrix once so it is by row.
 And always compute hi (r ) only once for each row.
38
Example Sig1 Sig2

h(0) = 1 1 -
x Row S1 S2 h(x) g(x)
g(0) = 3 3 -
0 A 1 0 1 3
1 B 0 1 2 0 h(1) = 2 1 2
2 C 1 1 3 2 g(1) = 0 3 0
3 D 1 0 4 4
4 E 0 1 0 1
h(2) = 3 1 2
g(2) = 2 2 0
h(3) = 4 1 2
h(x) = x+1 mod 5 g(3) = 4 2 0
g(x) = 2x+3 mod 5
h(Row) Row S1 S2 g(Row) Row S1 S2 h(4) = 0 1 0
0 E 0 1 0 B 0 1 g(4) = 1 2 0
1 A 1 0 1 E 0 1
2 B 0 1 2 C 1 0
3 C 1 1 3 A 1 1
4 D 1 0 4 D 1 0 39
Finding Similar Items Locality-Sensitive Hashing
Finding Similar Pairs

 Suppose we have, in main memory, data
representing a large number of objects.
 May be the objects themselves .
 May be signatures as in minhashing.
 We want to compare each to each, finding those

pairs that are sufficiently similar.
40
Checking All Pairs is Hard

 While the signatures of all columns may fit in main
memory, comparing the signatures of all pairs of
columns is quadratic in the number of columns.
 Example: 106 columns implies 5*1011 column-

comparisons.
 At 1 microsecond/comparison: 6 days.
41
Locality-Sensitive Hashing
 General idea: Use a function f(x,y) that tells whether
or not x and y is a candidate pair : a pair of
elements whose similarity must be evaluated.
 For minhash matrices: Hash columns to many

buckets, and make elements of the same bucket
candidate pairs.
42
LSH Summary
 Tune to get almost all pairs with similar
signatures, but eliminate most pairs that do not
have similar signatures.
 Check in main memory that candidate pairs really

do have similar signatures.
 Optional: In another pass through data, check

that the remaining candidate pairs really
represent similar sets .
43

Lect 26 and 27 - Locality Sensitive Hashing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect 26 and 27 - Locality Sensitive Hashing

Uploaded by

Copyright:

Available Formats

Finding Similar Items

Thanks to the foll. for their lecture slides:

 If we wanted exact duplicates we could do this by

 Special cases are easy, e.g., identical

 General case, where many small pieces of one

Similar Documents – (2)

The Big Picture

Three Essential Techniques for Similar

2. Minhashing : convert large sets to short

3. Locality-sensitive hashing : focus on pairs of

 Example: k=2; doc = abcab. Set of 2-shingles = {ab,

 Represent a doc by its set of k-shingles.

 Careful: you must pick k large enough, or most

Shingles: Compression Option

 Represent a doc by the set of hash values of its k-

 Two documents could rarely appear to have

Basic Data Model: Sets

Jaccard Similarity of Sets

Example: Jaccard Similarity

From Sets to Boolean Matrices

Example: Jaccard Similarity of Columns

When Is Similarity Interesting?

Outline: Finding Similar Columns

2. Examine pairs of signatures to find similar signatures.

3. Optional: check that columns with similar signatures

2. These methods can produce false negatives, and

Four Types of Rows

 Also, a = # rows of type a , etc.

 Note Sim (C1, C2) = a /(a +b +c ).

 Define “hash” function h (C ) = the number of the

 Use several (e.g., 100) independent hash functions to

Example of minhash signatures

Example of minhash signatures

Example of minhash signatures

Example of minhash signatures

Similarity for Signatures

Similarity for Signatures

 Hard to pick a random permutation from

 Representing a random permutation requires 1

 Accessing rows in permuted order leads to

 For each column c and each hash function hi ,

 Intent: M (i, c ) will become the smallest value

 If so, sort matrix once so it is by row.

 And always compute hi (r ) only once for each row.

Example Sig1 Sig2

Finding Similar Pairs

 We want to compare each to each, finding those

Checking All Pairs is Hard

 Example: 106 columns implies 5*1011 column-

 For minhash matrices: Hash columns to many

 Check in main memory that candidate pairs really

 Optional: In another pass through data, check

You might also like