5 Ch03-Lsh Class Lecture

Finding Similar Items:
Locality Sensitive Hashing

New thread: High dim. data
High dim. Graph Infinite Machine

Apps
data data data learning
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Network Web Decision Association

Clustering
Analysis advertising Trees Rules
Dimensiona Duplicate
Spam Queries on Perceptron,
lity document
Detection streams kNN
reduction detection
2
[Hays and Efros, SIGGRAPH 2007]
Scene Completion Problem
3
4
10 nearest neighbors from a collection of 20,000 images

5
10 nearest neighbors from a collection of 2 million images

6
A Common Metaphor
 Many problems can be expressed as
finding “similar” sets:
 Find near-neighbors in high-dimensional space
 Examples:
 Pages with similar words
 For duplicate detection, classification by topic
 Customers who purchased similar products
 Products with similar customer sets
 Images with similar features
 Users who visited similar websites
7
Finding Similar Items
Distance Measures
 Goal: Find near-neighbors in high-dim. space
 We formally define “near neighbors” as
points that are a “small distance” apart
 For each application, we first need to define what
“distance” means
 Today: Jaccard distance/similarity
 The Jaccard similarity of two sets is the size of their
intersection divided by the size of their union:
sim(C1, C2) = |C1C2|/|C1C2|
 Jaccard distance: d(C1, C2) = 1 - |C1C2|/|C1C2|
3 in intersection
8 in union
Jaccard similarity= 3/8
Jaccard distance = 5/8
9
Application of Set Similarity
10
Similar Documents
11
Task: Finding Similar Documents

Goal:
Given a large number ( in the millions or
billions) of documents, find “near duplicate” pairs
 Applications:
 Mirror websites, or approximate mirrors
 Don’t want to show both in search results
 Similar news articles at many news sites
 Cluster articles by “same story”
 Problems:
 Many small pieces of one document can appear
out of order in another
 Too many documents to compare all pairs
 Documents are so large or so many that they cannot
fit in main memory 12
3 Essential Steps for Similar Docs
1. Shingling: Convert documents to sets
2. Min-Hashing: Convert large sets to short

signatures, while preserving similarity
3. Locality-Sensitive Hashing: Focus on

pairs of signatures likely to be from
similar documents
 Candidate pairs!
13
The Big Picture
Candidate
Shingling
pairs:
Hashing
Locality-
those pairs
Min
Docu-
Sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures: similarity
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
14
Shingling
Docu-
ment
The set
of strings
of length k
that appear
in the doc-
ument
Shingling
Step 1: Shingling: Convert documents to sets
Documents as High-Dim Data
 Step 1: Shingling: Convert documents to sets
 Simple approaches:
 Document = set of words appearing in document
 Document = set of “important” words
 Don’t work well for this application. Why?
 Need to account for ordering of words!

 A different way: Shingles!
16
Define: Shingles
 A k-shingle (or k-gram) for a document is a
sequence of k tokens that appears in the doc
 Tokens can be characters, words or something
else, depending on the application
 Assume tokens = characters for examples
 Example: k=2; document D1 = abcab

Set of 2-shingles: S(D1) = {ab, bc, ca}
 Option: Shingles as a bag (multiset), count ab
twice: S’(D1) = {ab, bc, ca, ab}
17
Shingles and Similarity
18
Compressing Shingles
 To compress long shingles, we can hash them
to (say) 4 bytes
 Represent a document by the set of hash
values of its k-shingles
 Idea: Two documents could (rarely) appear to have
shingles in common, when in fact only the hash-
values were shared
 Example: k=2; document D1= abcab
Set of 2-shingles: S(D1) = {ab, bc, ca}
Hash the singles: h(D1) = {1, 5, 7}
19
Similarity Metric for Shingles
 Document D1 is a set of its k-shingles C1=S(D1)
 Equivalently, each document is a
0/1 vector in the space of k-shingles
 Each unique shingle is a dimension
 Vectors are very sparse
 A natural similarity measure is the
Jaccard similarity:
sim(D1, D2) = |C1C2|/|C1C2|
20
Example: Jaccard Similarity
21
 Let’ssee the example about how to Jaccard
Similarity work?
 doc_1 = "Data is the new oil of the digital
economy"
 doc_2 = "Data is a new oil"
22
 Let’s see the example about how to Jaccard
Similarity work?
 doc_1 = "Data is the new oil of the digital
economy"
 doc_2 = "Data is a new oil“
 Let’s get the set of unique words for each
document
 words_doc1 = {'data', 'is', 'the', 'new', 'oil',
'of', 'digital', 'economy’}
 words_doc2 = {'data', 'is', 'a', 'new', 'oil'}
23
 wewill calculate the intersection and union of
these two sets of words and measure the Jaccard
Similarity between doc_1 and doc_2.
24
 wewill calculate the intersection and union of
these two sets of words and measure the Jaccard
Similarity between doc_1 and doc_2.
25
Working Assumption
 Documents that have lots of shingles in
common have similar text, even if the text
appears in different order
 Caveat: You must pick k large enough, or most

documents will have most shingles
 k = 5 is OK for short documents
 k = 10 is better for long documents
26
Motivation for Minhash/LSH
 Suppose we need to find near-duplicate
documents among million documents
 Naïvely, we would have to compute pairwise

Jaccard similarities for every pair of docs
 ≈ 5*1011 comparisons
 At 105 secs/day and 106 comparisons/sec,
it would take 5 days
 For million, it takes more than a year…

27
Min-Hash-
Shingling
ing
Docu-
ment
The set Signatures:

ument reflect their
similarity
MinHashing
Step 2: Minhashing: Convert large sets to
short signatures, while preserving similarity
From Sets to Boolean Matrices
 Rows = elements (shingles)
 Columns = sets (documents)
Documents
 1 in row e and column s if and only if
1 1 1 0
e is a member of s
 Column similarity is the Jaccard 1 1 0 1
similarity of the corresponding sets 0 1 0 1
Shingles
(rows with value 1) 0 0 0 1
 Typical matrix is sparse!
 Each document is a column:
1 0 0 1
 Example: sim(C1 ,C2) = ? 1 1 1 0
 Size of intersection = 3; size of union = 6, 1 0 1 0
Jaccard similarity (not distance) = 3/6
 d(C1,C2) = 1 – (Jaccard similarity) = 3/6 30
Example: Column Similarity
32
33
34
Four Types of rows
35
Minhashing
36
Minhashing Example
37
Minhashing Example
38
Minhashing Example
39
Minhashing Example
40
Minhashing Example
41
Surprising Property
42
Similarity of Signature
43
Min Hashing Example
44
Implementation of Minhasing
45
Implementation of Minhasing 2
46
Implementation of Minhasing 3
47
Example
48
Implementation
49
Candidate
Min-Hash-
Shingling
pairs:
Locality-
ing
Docu- those pairs
Sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures: similarity
ument reflect their
similarity

Step 3: Locality-Sensitive Hashing:
Focus on pairs of signatures likely to be from
similar documents
51
Candidate Generation from Minhash signatures
52
LSH For Minhash Signature
53
Partition Into Band
54
Partition Into Band (2)
55
Hash Function for One Bucket
56
Example Bands
57
Suppose C1, C2 are 80% Similar
58
Suppose C1, C2 Only 40% Similar
59
Analysis of LSH- What We Want
60
What One Band of One Row Gives You
61
What One Band of One Row Gives You
62
What b Bands of Rows Gives You
63
Example: b=20; r=5
64
Analysis of LSH – What We Want
Probability = 1
if t > s
Similarity threshold s
Probability
No chance
of sharing
if t < s
a bucket
Similarity t =sim(C1, C2) of two sets
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 71

Note: Another (equivalent) way is
to 1 5 1 5
Min-Hashing Example store row indexes:
2
6
3
4
1
6
3
4
2nd element of the permutation

is the first to map to a 1
Permutation  Input matrix (Shingles x Documents)

Signature matrix M
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 4th element of the permutation
is the first to map to a 1
5 7 1 1 0 1 0
4 5 5 1 0 1 0
Four Types of Rows
 Given cols C1 and C2, rows may be classified as:
C1 C2
A 1 1
B 1 0
C 0 1
D 0 0
 a = # rows of type A, etc.
 Note: sim(C1, C2) = a/(a +b +c)
 Then: Pr[h(C1) = h(C2)] = Sim(C1, C2)
 Look down the cols C1 and C2 until we see a 1
 If it’s a type-A row, then h(C1) = h(C2)
If a type-B or type-C row, then not
Similarity for Signatures
 We know: Pr[h(C1) = h(C2)] = sim(C1, C2)
 Now generalize to multiple hash functions
 The similarity of two signatures is the

fraction of the hash functions in which they
agree
 Note: Because of the Min-Hash property, the

similarity of columns is the same as the
expected similarity of their signatures
Min-Hashing Example
Permutation  Input matrix (Shingles x Documents)
Signature matrix M
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 Similarities:
1-3 2-4 1-2 3-4
5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0
4 5 5 1 0 1 0 Sig/Sig 0.67 1.00 0 0

Min-Hash Signatures
 Pick K=100 random permutations of the rows
 Think of sig(C) as a column vector
 sig(C)[i] = according to the i-th permutation, the
index of the first row that has a 1 in column C
sig(C)[i] = min (i(C))
 Note: The sketch (signature) of document C is
small bytes!
 We achieved our goal! We “compressed”

long bit vectors into short signatures
LSH Summary
 Tune M, b, r to get almost all pairs with
similar signatures, but eliminate most pairs
that do not have similar signatures
 Check in main memory that candidate pairs

really do have similar signatures
 Optional: In another pass through data,

check that the remaining candidate pairs
really represent similar documents

Summary: 3 Steps
 Shingling: Convert documents to sets
 We used hashing to assign each shingle an ID
 Min-Hashing: Convert large sets to short
signatures, while preserving similarity
 We used similarity preserving hashing to generate
signatures with property Pr[h(C1) = h(C2)] = sim(C1, C2)
 We used hashing to get around generating random
permutations
 Locality-Sensitive Hashing: Focus on pairs of
signatures likely to be from similar documents
 We used hashing to find candidate pairs of similarity  s

5 Ch03-Lsh Class Lecture

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5 Ch03-Lsh Class Lecture

Uploaded by

Copyright:

Available Formats

Finding Similar Items:

Locality Sensitive Hashing

High dim. Graph Infinite Machine

Network Web Decision Association

Scene Completion Problem

Scene Completion Problem

Scene Completion Problem

10 nearest neighbors from a collection of 20,000 images

Scene Completion Problem

10 nearest neighbors from a collection of 2 million images

2. Min-Hashing: Convert large sets to short

3. Locality-Sensitive Hashing: Focus on

 Need to account for ordering of words!

 Example: k=2; document D1 = abcab

 Caveat: You must pick k large enough, or most

 Naïvely, we would have to compute pairwise

 For million, it takes more than a year…

The set Signatures:

Locality Sensitive Hashing

Similarity t =sim(C1, C2) of two sets

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 71

2nd element of the permutation

Permutation  Input matrix (Shingles x Documents)

 The similarity of two signatures is the

 Note: Because of the Min-Hash property, the

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 76

 We achieved our goal! We “compressed”

 Check in main memory that candidate pairs

 Optional: In another pass through data,

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 95

You might also like