You are on page 1of 70

Finding Similar Items:

Locality Sensitive Hashing


New thread: High dim. data

High dim. Graph Infinite Machine


Apps
data data data learning

Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams

Network Web Decision Association


Clustering
Analysis advertising Trees Rules

Dimensiona Duplicate
Spam Queries on Perceptron,
lity document
Detection streams kNN
reduction detection

2
[Hays and Efros, SIGGRAPH 2007]

Scene Completion Problem

3
[Hays and Efros, SIGGRAPH 2007]

Scene Completion Problem

4
[Hays and Efros, SIGGRAPH 2007]

Scene Completion Problem

10 nearest neighbors from a collection of 20,000 images


5
[Hays and Efros, SIGGRAPH 2007]

Scene Completion Problem

10 nearest neighbors from a collection of 2 million images


6
A Common Metaphor
 Many problems can be expressed as
finding “similar” sets:
 Find near-neighbors in high-dimensional space
 Examples:
 Pages with similar words
 For duplicate detection, classification by topic
 Customers who purchased similar products
 Products with similar customer sets
 Images with similar features
 Users who visited similar websites

7
Finding Similar Items
Distance Measures
 Goal: Find near-neighbors in high-dim. space
 We formally define “near neighbors” as
points that are a “small distance” apart
 For each application, we first need to define what
“distance” means
 Today: Jaccard distance/similarity
 The Jaccard similarity of two sets is the size of their
intersection divided by the size of their union:
sim(C1, C2) = |C1C2|/|C1C2|
 Jaccard distance: d(C1, C2) = 1 - |C1C2|/|C1C2|
3 in intersection
8 in union
Jaccard similarity= 3/8
Jaccard distance = 5/8
9
Application of Set Similarity

10
Similar Documents

11
Task: Finding Similar Documents

Goal:
  Given a large number ( in the millions or
billions) of documents, find “near duplicate” pairs
 Applications:
 Mirror websites, or approximate mirrors
 Don’t want to show both in search results
 Similar news articles at many news sites
 Cluster articles by “same story”
 Problems:
 Many small pieces of one document can appear
out of order in another
 Too many documents to compare all pairs
 Documents are so large or so many that they cannot
fit in main memory 12
3 Essential Steps for Similar Docs
1. Shingling: Convert documents to sets

2. Min-Hashing: Convert large sets to short


signatures, while preserving similarity

3. Locality-Sensitive Hashing: Focus on


pairs of signatures likely to be from
similar documents
 Candidate pairs!

13
The Big Picture

Candidate
Shingling

pairs:

Hashing
Locality-
those pairs

Min
Docu-
Sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures: similarity
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity

14
Shingling
Docu-
ment

The set
of strings
of length k
that appear
in the doc-
ument

Shingling
Step 1: Shingling: Convert documents to sets
Documents as High-Dim Data
 Step 1: Shingling: Convert documents to sets

 Simple approaches:
 Document = set of words appearing in document
 Document = set of “important” words
 Don’t work well for this application. Why?

 Need to account for ordering of words!


 A different way: Shingles!

16
Define: Shingles
 A k-shingle (or k-gram) for a document is a
sequence of k tokens that appears in the doc
 Tokens can be characters, words or something
else, depending on the application
 Assume tokens = characters for examples

 Example: k=2; document D1 = abcab


Set of 2-shingles: S(D1) = {ab, bc, ca}
 Option: Shingles as a bag (multiset), count ab
twice: S’(D1) = {ab, bc, ca, ab}

17
Shingles and Similarity

18
Compressing Shingles
 To compress long shingles, we can hash them
to (say) 4 bytes
 Represent a document by the set of hash
values of its k-shingles
 Idea: Two documents could (rarely) appear to have
shingles in common, when in fact only the hash-
values were shared
 Example: k=2; document D1= abcab
Set of 2-shingles: S(D1) = {ab, bc, ca}
Hash the singles: h(D1) = {1, 5, 7}
19
Similarity Metric for Shingles
 Document D1 is a set of its k-shingles C1=S(D1)
 Equivalently, each document is a
0/1 vector in the space of k-shingles
 Each unique shingle is a dimension
 Vectors are very sparse
 A natural similarity measure is the
Jaccard similarity:
sim(D1, D2) = |C1C2|/|C1C2|

20
Example: Jaccard Similarity

21
Example: Jaccard Similarity
 Let’ssee the example about how to Jaccard
Similarity work?
 doc_1 = "Data is the new oil of the digital
economy"
 doc_2 = "Data is a new oil"

22
Example: Jaccard Similarity
 Let’s see the example about how to Jaccard
Similarity work?
 doc_1 = "Data is the new oil of the digital
economy"
 doc_2 = "Data is a new oil“
 Let’s get the set of unique words for each
document
 words_doc1 = {'data', 'is', 'the', 'new', 'oil',
'of', 'digital', 'economy’}
 words_doc2 = {'data', 'is', 'a', 'new', 'oil'}

23
Example: Jaccard Similarity
 wewill calculate the intersection and union of
these two sets of words and measure the Jaccard
Similarity between doc_1 and doc_2.

24
Example: Jaccard Similarity
 wewill calculate the intersection and union of
these two sets of words and measure the Jaccard
Similarity between doc_1 and doc_2.

25
Working Assumption
 Documents that have lots of shingles in
common have similar text, even if the text
appears in different order

 Caveat: You must pick k large enough, or most


documents will have most shingles
 k = 5 is OK for short documents
 k = 10 is better for long documents

26
Motivation for Minhash/LSH
  Suppose we need to find near-duplicate
documents among million documents

 Naïvely, we would have to compute pairwise


Jaccard similarities for every pair of docs
 ≈ 5*1011 comparisons
 At 105 secs/day and 106 comparisons/sec,
it would take 5 days

 For million, it takes more than a year…


27
Min-Hash-
Shingling

ing
Docu-
ment

The set Signatures:


of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity

MinHashing
Step 2: Minhashing: Convert large sets to
short signatures, while preserving similarity
From Sets to Boolean Matrices
 Rows = elements (shingles)
 Columns = sets (documents)
Documents
 1 in row e and column s if and only if
1 1 1 0
e is a member of s
 Column similarity is the Jaccard 1 1 0 1
similarity of the corresponding sets 0 1 0 1

Shingles
(rows with value 1) 0 0 0 1
 Typical matrix is sparse!
 Each document is a column:
1 0 0 1
 Example: sim(C1 ,C2) = ? 1 1 1 0
 Size of intersection = 3; size of union = 6, 1 0 1 0
Jaccard similarity (not distance) = 3/6
 d(C1,C2) = 1 – (Jaccard similarity) = 3/6 30
Example: Column Similarity

32
Example: Column Similarity

33
Example: Column Similarity

34
Four Types of rows

35
Minhashing

36
Minhashing Example

37
Minhashing Example

38
Minhashing Example

39
Minhashing Example

40
Minhashing Example

41
Surprising Property

42
Similarity of Signature

43
Min Hashing Example

44
Implementation of Minhasing

45
Implementation of Minhasing 2

46
Implementation of Minhasing 3

47
Example

48
Implementation

49
Candidate

Min-Hash-
Shingling
pairs:
Locality-

ing
Docu- those pairs
Sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures: similarity
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity

Locality Sensitive Hashing


Step 3: Locality-Sensitive Hashing:
Focus on pairs of signatures likely to be from
similar documents
Locality Sensitive Hashing

51
Candidate Generation from Minhash signatures

52
LSH For Minhash Signature

53
Partition Into Band

54
Partition Into Band (2)

55
Hash Function for One Bucket

56
Example Bands

57
Suppose C1, C2 are 80% Similar

58
Suppose C1, C2 Only 40% Similar

59
Analysis of LSH- What We Want

60
What One Band of One Row Gives You

61
What One Band of One Row Gives You

62
What b Bands of Rows Gives You

63
Example: b=20; r=5

64
Analysis of LSH – What We Want

Probability = 1
if t > s

Similarity threshold s
Probability
No chance
of sharing
if t < s
a bucket

Similarity t =sim(C1, C2) of two sets

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 71


Note: Another (equivalent) way is
to 1 5 1 5
Min-Hashing Example store row indexes:
2
6
3
4
1
6
3
4

2nd element of the permutation


is the first to map to a 1

Permutation  Input matrix (Shingles x Documents)


Signature matrix M

2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 4th element of the permutation
is the first to map to a 1
5 7 1 1 0 1 0
4 5 5 1 0 1 0
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 72
Four Types of Rows
 Given cols C1 and C2, rows may be classified as:
C1 C2
A 1 1
B 1 0
C 0 1
D 0 0
 a = # rows of type A, etc.
 Note: sim(C1, C2) = a/(a +b +c)
 Then: Pr[h(C1) = h(C2)] = Sim(C1, C2)
 Look down the cols C1 and C2 until we see a 1
 If it’s a type-A row, then h(C1) = h(C2)
If a type-B or type-C row, then not
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 74
Similarity for Signatures
 We know: Pr[h(C1) = h(C2)] = sim(C1, C2)
 Now generalize to multiple hash functions

 The similarity of two signatures is the


fraction of the hash functions in which they
agree

 Note: Because of the Min-Hash property, the


similarity of columns is the same as the
expected similarity of their signatures
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 75
Min-Hashing Example
Permutation  Input matrix (Shingles x Documents)
Signature matrix M

2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 Similarities:
1-3 2-4 1-2 3-4
5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0
4 5 5 1 0 1 0 Sig/Sig 0.67 1.00 0 0

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 76


Min-Hash Signatures
  Pick K=100 random permutations of the rows
 Think of sig(C) as a column vector
 sig(C)[i] = according to the i-th permutation, the
index of the first row that has a 1 in column C
sig(C)[i] = min (i(C))
 Note: The sketch (signature) of document C is
small bytes!

 We achieved our goal! We “compressed”


long bit vectors into short signatures
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 77
LSH Summary
 Tune M, b, r to get almost all pairs with
similar signatures, but eliminate most pairs
that do not have similar signatures

 Check in main memory that candidate pairs


really do have similar signatures

 Optional: In another pass through data,


check that the remaining candidate pairs
really represent similar documents

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 95


Summary: 3 Steps
 Shingling: Convert documents to sets
 We used hashing to assign each shingle an ID
 Min-Hashing: Convert large sets to short
signatures, while preserving similarity
 We used similarity preserving hashing to generate
signatures with property Pr[h(C1) = h(C2)] = sim(C1, C2)
 We used hashing to get around generating random
permutations
 Locality-Sensitive Hashing: Focus on pairs of
signatures likely to be from similar documents
 We used hashing to find candidate pairs of similarity  s
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 96

You might also like