You are on page 1of 2

Set Similarity

Finding Near Duplicates „ Set Similarity (Jaccard measure)


Ci I C j
simJ(Ci , C j ) =
Ci U C j
„ View sets as columns of a matrix; one row for
each element in the universe. aij = 1 indicates
presence of item i in set j
(Adapted from slides and material
„ Example
from Rajeev Motwani and Jeff Ullman) C1 C2

0 1
1 0
1 1 simJ(C1,C2) = 2/5 = 0.4
0 0
1 1
0 1

Identifying Similar Sets? Key Observation


„ Signature Idea „ For columns Ci, Cj, four types of rows
„ Hash columns Ci to signature sig(Ci) Ci Cj
„ simJ(Ci,Cj) approximated by A 1 1
simH(sig(Ci),sig(Cj)) B 1 0
„ Naï
ve Approach C 0 1
„ Sample P rows uniformly at random D 0 0
„ Define sig(Ci) as P bits of Ci in sample „ Overload notation: A = # of rows of type A
„ Problem „ Claim
sparsity Æ would miss interesting part of columns A
simJ(Ci , C j ) =
„

„ sample would get only 0’s in columns A + B+C

Min Hashing Min-Hash Signatures


„ Randomly permute rows „ Pick – P random row permutations
„ Hash h(Ci) = index of first row with 1 in „ MinHash Signature
column Ci
„ Suprising Property sig(C) = list of P indexes of first rows with 1 in
column C
[ ]
P h(Ci ) = h(C j ) = simJ (Ci , C j )
„ Why?
„ Similarity of signatures
„ Both are A/(A+B+C)
„ Let simH(sig(Ci),sig(Cj)) = fraction of
„ Look down columns Ci, Cj until first non-Type-
permutations where MinHash values agree
D row
„ Observe E[simH(sig(Ci),sig(Cj))] = simJ(Ci,Cj)
„ h(Ci) = h(Cj) ÅÆ type A row

1
Example Implementation Trick
Signatures
S1 S2 S3 „ Permuting rows even once is prohibitive
Perm 1 = (12345) 1 2 1 „ Row Hashing
C1 C2 C3 Perm 2 = (54321) 4 5 4
„ Pick P hash functions hk: {1,…,n}Æ{1,…,O(n)}
R1 1 0 1 Perm 3 = (34512) 3 5 4
„ Ordering under hk gives random row
R2 0 1 1
permutation
R3 1 0 0
R4 1 0 1 „ One-pass Implementation
Similarities
R5 0 1 0 1-2 1-3 2-3 „ For each Ci and hk, keep “slot” for min-hash
Col-Col 0.00 0.50 0.25 value
Sig-Sig 0.00 0.67 0.00 „ Initialize all slot(Ci,hk) to infinity
„ Scan rows in arbitrary order looking for 1’s
„ Suppose row Rj has 1 in column Ci
„ For each hk,

Example Comparing Signatures


C1 C2 C1 slots C2 slots
R1 1 0 h(1) = 1 1 - „ Signature Matrix S
R2 0 1 g(1) = 3 3 - „ Rows = Hash Functions
R3 1 1 „ Columns = Columns
R4 1 0 h(2) = 2 1 2 „ Entries = Signatures
g(2) = 0 3 0
R5 0 1 „ Compute – Pair-wise similarity of signature columns
h(3) = 3 1 2 „ Problem
g(3) = 2 2 0 „ MinHash fits column signatures in memory
„ But comparing signature-pairs takes too much time
h(4) = 4 1 2
h(x) = x mod 5 „ Technique to limit candidate pairs?
g(4) = 4 2 0
g(x) = 2x+1 mod 5 „ A-Priori does not work
h(5) = 0 1 0 „ Locality Sensitive Hashing (LSH)
g(5) = 1 2 0

You might also like