Professional Documents
Culture Documents
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Dimensiona Duplicate
Spam Queries on Perceptron,
lity document
Detection streams kNN
reduction detection
2
[Hays and Efros, SIGGRAPH 2007]
3
[Hays and Efros, SIGGRAPH 2007]
4
[Hays and Efros, SIGGRAPH 2007]
7
Finding Similar Items
Distance Measures
Goal: Find near-neighbors in high-dim. space
We formally define “near neighbors” as
points that are a “small distance” apart
For each application, we first need to define what
“distance” means
Today: Jaccard distance/similarity
The Jaccard similarity of two sets is the size of their
intersection divided by the size of their union:
sim(C1, C2) = |C1C2|/|C1C2|
Jaccard distance: d(C1, C2) = 1 - |C1C2|/|C1C2|
3 in intersection
8 in union
Jaccard similarity= 3/8
Jaccard distance = 5/8
9
Application of Set Similarity
10
Similar Documents
11
Task: Finding Similar Documents
Goal:
Given a large number ( in the millions or
billions) of documents, find “near duplicate” pairs
Applications:
Mirror websites, or approximate mirrors
Don’t want to show both in search results
Similar news articles at many news sites
Cluster articles by “same story”
Problems:
Many small pieces of one document can appear
out of order in another
Too many documents to compare all pairs
Documents are so large or so many that they cannot
fit in main memory 12
3 Essential Steps for Similar Docs
1. Shingling: Convert documents to sets
13
The Big Picture
Candidate
Shingling
pairs:
Hashing
Locality-
those pairs
Min
Docu-
Sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures: similarity
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
14
Shingling
Docu-
ment
The set
of strings
of length k
that appear
in the doc-
ument
Shingling
Step 1: Shingling: Convert documents to sets
Documents as High-Dim Data
Step 1: Shingling: Convert documents to sets
Simple approaches:
Document = set of words appearing in document
Document = set of “important” words
Don’t work well for this application. Why?
16
Define: Shingles
A k-shingle (or k-gram) for a document is a
sequence of k tokens that appears in the doc
Tokens can be characters, words or something
else, depending on the application
Assume tokens = characters for examples
17
Shingles and Similarity
18
Compressing Shingles
To compress long shingles, we can hash them
to (say) 4 bytes
Represent a document by the set of hash
values of its k-shingles
Idea: Two documents could (rarely) appear to have
shingles in common, when in fact only the hash-
values were shared
Example: k=2; document D1= abcab
Set of 2-shingles: S(D1) = {ab, bc, ca}
Hash the singles: h(D1) = {1, 5, 7}
19
Similarity Metric for Shingles
Document D1 is a set of its k-shingles C1=S(D1)
Equivalently, each document is a
0/1 vector in the space of k-shingles
Each unique shingle is a dimension
Vectors are very sparse
A natural similarity measure is the
Jaccard similarity:
sim(D1, D2) = |C1C2|/|C1C2|
20
Example: Jaccard Similarity
21
Example: Jaccard Similarity
Let’ssee the example about how to Jaccard
Similarity work?
doc_1 = "Data is the new oil of the digital
economy"
doc_2 = "Data is a new oil"
22
Example: Jaccard Similarity
Let’s see the example about how to Jaccard
Similarity work?
doc_1 = "Data is the new oil of the digital
economy"
doc_2 = "Data is a new oil“
Let’s get the set of unique words for each
document
words_doc1 = {'data', 'is', 'the', 'new', 'oil',
'of', 'digital', 'economy’}
words_doc2 = {'data', 'is', 'a', 'new', 'oil'}
23
Example: Jaccard Similarity
wewill calculate the intersection and union of
these two sets of words and measure the Jaccard
Similarity between doc_1 and doc_2.
24
Example: Jaccard Similarity
wewill calculate the intersection and union of
these two sets of words and measure the Jaccard
Similarity between doc_1 and doc_2.
25
Working Assumption
Documents that have lots of shingles in
common have similar text, even if the text
appears in different order
26
Motivation for Minhash/LSH
Suppose we need to find near-duplicate
documents among million documents
ing
Docu-
ment
MinHashing
Step 2: Minhashing: Convert large sets to
short signatures, while preserving similarity
From Sets to Boolean Matrices
Rows = elements (shingles)
Columns = sets (documents)
Documents
1 in row e and column s if and only if
1 1 1 0
e is a member of s
Column similarity is the Jaccard 1 1 0 1
similarity of the corresponding sets 0 1 0 1
Shingles
(rows with value 1) 0 0 0 1
Typical matrix is sparse!
Each document is a column:
1 0 0 1
Example: sim(C1 ,C2) = ? 1 1 1 0
Size of intersection = 3; size of union = 6, 1 0 1 0
Jaccard similarity (not distance) = 3/6
d(C1,C2) = 1 – (Jaccard similarity) = 3/6 30
Example: Column Similarity
32
Example: Column Similarity
33
Example: Column Similarity
34
Four Types of rows
35
Minhashing
36
Minhashing Example
37
Minhashing Example
38
Minhashing Example
39
Minhashing Example
40
Minhashing Example
41
Surprising Property
42
Similarity of Signature
43
Min Hashing Example
44
Implementation of Minhasing
45
Implementation of Minhasing 2
46
Implementation of Minhasing 3
47
Example
48
Implementation
49
Candidate
Min-Hash-
Shingling
pairs:
Locality-
ing
Docu- those pairs
Sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures: similarity
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
51
Candidate Generation from Minhash signatures
52
LSH For Minhash Signature
53
Partition Into Band
54
Partition Into Band (2)
55
Hash Function for One Bucket
56
Example Bands
57
Suppose C1, C2 are 80% Similar
58
Suppose C1, C2 Only 40% Similar
59
Analysis of LSH- What We Want
60
What One Band of One Row Gives You
61
What One Band of One Row Gives You
62
What b Bands of Rows Gives You
63
Example: b=20; r=5
64
Analysis of LSH – What We Want
Probability = 1
if t > s
Similarity threshold s
Probability
No chance
of sharing
if t < s
a bucket
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 4th element of the permutation
is the first to map to a 1
5 7 1 1 0 1 0
4 5 5 1 0 1 0
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 72
Four Types of Rows
Given cols C1 and C2, rows may be classified as:
C1 C2
A 1 1
B 1 0
C 0 1
D 0 0
a = # rows of type A, etc.
Note: sim(C1, C2) = a/(a +b +c)
Then: Pr[h(C1) = h(C2)] = Sim(C1, C2)
Look down the cols C1 and C2 until we see a 1
If it’s a type-A row, then h(C1) = h(C2)
If a type-B or type-C row, then not
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 74
Similarity for Signatures
We know: Pr[h(C1) = h(C2)] = sim(C1, C2)
Now generalize to multiple hash functions
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 Similarities:
1-3 2-4 1-2 3-4
5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0
4 5 5 1 0 1 0 Sig/Sig 0.67 1.00 0 0