You are on page 1of 48

# FindingSimilarItems

ACommonMetaphor
Manyproblemscanbeexpressedas
findingsimilarsets:
Findnearneighborsinhighdimensional space

Examples:
Pageswithsimilarwords
Forduplicatedetection,classificationbytopic

Customerswhopurchasedsimilarproducts
Productswithsimilarcustomersets

Imageswithsimilarfeatures
Userswhovisitedthesimilarwebsites
SlidesbyJureLeskovec:MiningMassiveDatasets

DistanceMeasures
Weformallydefinenearneighborsas
pointsthatareasmalldistanceapart
Foreachusecase,weneedtodefinewhat
distancemeans
Today:Jaccard similarity/distance
TheJaccard Similarity/Distance oftwosets isthe
sizeoftheirintersection/thesizeoftheirunion:
sim(C1,C2)=|C1C2|/|C1C2|
d(C1,C2)=1 |C1C2|/|C1C2|
3 in intersection
8 in union
Jaccard similarity= 3/8
SlidesbyJureLeskovec:MiningMassiveDatasets
Jaccard distance = 5/8

DistanceMeasures
Weformallydefinenearneighborsas
pointsthatareasmalldistanceapart
Foreachusecase,weneedtodefinewhat
distancemeans
Twomajorclassesofdistancemeasures:
AEuclideandistance isbasedonthelocationsof
pointsinsuchaspace
ANonEuclideandistance isbasedonproperties
ofpoints,butnottheirlocationinaspace
SlidesbyJureLeskovec:MiningMassiveDatasets

SomeEuclideanDistances
L2 norm: d(p,q) =squarerootofthesumof
thesquaresofthedifferencesbetweenp and
q ineachdimension:
Themostcommonnotionofdistance

L1 norm: sumoftheabsolutedifferencesin
eachdimension
Manhattandistance =distanceifyou
SlidesbyJureLeskovec:MiningMassiveDatasets

NonEuclideanDistances:Cosine
Thinkofapointasavectorfrom
theorigin(0,0,,0)toitslocation
Twovectorsmakeanangle,whose
cosineisnormalizeddotproduct
ofthevectors:

AB
A

Example: A=00111;B=10011
AB=2;A =B =3

SlidesbyJureLeskovec:MiningMassiveDatasets

NonEuclideanDistances:Jaccard
TheJaccard Similarity oftwosets isthesizeof
theirintersection/thesizeoftheirunion:
Sim(C1,C2)=|C1C2|/|C1C2|
TheJaccard Distance betweensetsis1minus
theirJaccard similarity:
d(C1,C2)=1 |C1C2|/|C1C2|
3 in intersection
8 in union
Jaccard similarity= 3/8
Jaccard distance = 5/8
SlidesbyJureLeskovec:MiningMassiveDatasets

FindingSimilarItems

FindingSimilarDocuments
Goal: Givenalargenumber(Ninthemillionsor
billions)oftextdocuments,findpairsthatare
nearduplicates
Applications:
Mirrorwebsites,orapproximatemirrors
Dontwanttoshowbothinasearch

Similarnewsarticlesatmanynewssites
Clusterarticlesbysamestory

Problems:
Manysmallpiecesofonedoccanappear
outoforderinanother
Toomanydocstocompareallpairs
Docsaresolargeorsomanythattheycannot
fitinmainmemory
SlidesbyJureLeskovec:MiningMassiveDatasets

3EssentialStepsforSimilarDocs
1. Shingling: Convertdocuments,emails,
etc.,tosets
2. Minhashing: Convertlargesetstoshort
signatures,whilepreservingsimilarity
3. Localitysensitivehashing: Focuson
pairsofsignatureslikelytobefrom
similardocuments
SlidesbyJureLeskovec:MiningMassiveDatasets

10

TheBigPicture

LocalitySensitive
Hashing

Document

The set
of strings
of length k
that appear
in the document

Signatures:
short integer
vectors that
represent the
sets, and
reflect their
similarity

SlidesbyJureLeskovec:MiningMassiveDatasets

Candidate
pairs:
those pairs
of signatures
that we need
to test for
similarity.

11

DocumentsasHighDimData
Step1: Shingling: Convertdocuments,
emails,etc.,tosets
Simpleapproaches:
Document=setofwordsappearingindoc
Document=setofimportantwords
Dontworkwellforthisapplication.Why?

Needtoaccountfororderingofwords
SlidesbyJureLeskovec:MiningMassiveDatasets

12

Define:Shingles
sequenceofktokensthatappearsinthedoc
Tokenscanbecharacters,words orsomething
else,dependingonapplication
Assumetokens=charactersforexamples

Example:k=2;D1=abcab
Setof2shingles:S(D1)={ab,bc,ca}
Option: Shinglesasabag,countab twice
SlidesbyJureLeskovec:MiningMassiveDatasets

13

CompressingShingles
Tocompresslongshingles,
wecanhash themto(say)4bytes
ofitskshingles
Idea: Twodocumentscould(rarely)appearto
haveshinglesincommon,wheninfactonlythe
hashvalueswereshared
Example: k=2;D1=abcab
Setof2shingles:S(D1)={ab,bc,ca}
Hashthesingles:h(D1)={1,5,7}
SlidesbyJureLeskovec:MiningMassiveDatasets

14

WorkingAssumption
Documentsthathavelotsofshinglesin
commonhavesimilartext,evenifthetext
appearsindifferentorder
Careful: Youmustpickk largeenough,ormost
documentswillhavemostshingles
k=5isOKforshortdocuments
k =10isbetterforlongdocuments

SlidesbyJureLeskovec:MiningMassiveDatasets

15

MotivationforMinhash/LSH
Supposeweneedtofindnearduplicate
documentsamongN=1milliondocuments
Navely,wedhavetocomputepairwise
Jaccard similaritiesforeverypairofdocs
i.e,N(N1)/25*1011 comparisons
At105 secs/dayand106 comparisons/sec,
itwouldtake5days

ForN=10million,ittakesmorethanayear
SlidesbyJureLeskovec:MiningMassiveDatasets

16

Docu
ment

Theset
ofstrings
oflengthk
thatappear
inthedoc
ument

Signatures:
shortinteger
vectorsthat
representthe
sets,and
reflecttheir
similarity

MinHashing
Step2: Minhashing: Convertlargesets to
shortsignatures,whilepreservingsimilarity

EncodingSetsasBitVectors
Manysimilarityproblemscanbe
formalizedasfindingsubsetsthat
havesignificantintersection
Encodesetsusing0/1(bit,boolean)vectors
Onedimensionperelementintheuniversalset

InterpretsetintersectionasbitwiseAND,and
setunionasbitwiseOR
Example: C1 =10111;C2 =10011

Sizeofintersection=3;sizeofunion=4,
Jaccard similarity(notdistance)=3/4
d(C1,C2)=1 (Jaccard similarity)=1/4
SlidesbyJureLeskovec:MiningMassiveDatasets

18

FromSetstoBooleanMatrices
Rows =elementsofthe
universalset
Columns =sets
1inrowe andcolumns ifand
onlyife isamemberofs
ColumnsimilarityistheJaccard
similarityofthesetsoftheir
rowswith1
Typicalmatrixissparse
SlidesbyJureLeskovec:MiningMassiveDatasets

1 1

1
0

1
1

0
0

1
1

0
19

Example:Jaccard ofColumns
Eachdocumentisacolumn:
Sizeofintersection=2;sizeofunion=5,
Jaccard similarity(notdistance)=2/5
d(C1,C2)=1 (Jaccard similarity)=3/5

shingles

## Example: C1 =1100011;C2 =0110010

1 0

1
0

1
1

0
0

1
1

Note:
0 0 0 1
Wemightnotreallyrepresent
1 1 1 0
thedatabyaboolean matrix
1 0 1 0
Sparsematricesareusually
documents
betterrepresentedbythelist
ofplaceswherethereisanonzerovalue
SlidesbyJureLeskovec:MiningMassiveDatasets

20

Outline:FindingSimilarColumns
Sofar:
Documents Setsofshingles
Representsetsasboolean vectorsinamatrix

NextGoal:Findsimilarcolumns,Smallsignatures
Approach:
1)Signaturesofcolumns: smallsummariesofcolumns
2)Examinepairsofsignatures tofindsimilarcolumns
Essential: Similaritiesofsignatures&columnsarerelated

3)Optional: checkthatcolumnswithsimilarsigs.arereally
similar

Warnings:
Comparingallpairsmaytaketoomuchtime:jobforLSH
Thesemethodscanproducefalsenegatives,andevenfalsepositives
SlidesbyJureLeskovec:MiningMassiveDatasets

21

HashingColumns(Singatures)
Keyidea: hasheachcolumnC toasmall
signature h(C),suchthat:
(1) h(C) issmallenoughthatthesignaturefitsinRAM
(2) sim(C1,C2) isthesameasthesimilarityof
signaturesh(C1) andh(C2)

## Goal: Findahashfunctionh() suchthat:

ifsim(C1,C2) ishigh,thenwithhighprob.h(C1)=h(C2)
ifsim(C1,C2) islow,thenwithhighprob.h(C1)h(C2)

Hashdocsintobuckets,andexpectthatmost
pairsofnearduplicatedocshashintothesame
bucket
SlidesbyJureLeskovec:MiningMassiveDatasets

22

MinHashing
Goal: Findahashfunctionh() suchthat:
ifsim(C1,C2) ishigh,thenwithhighprob.h(C1)=h(C2)
ifsim(C1,C2) islow,thenwithhighprob.h(C1)h(C2)

Clearly,thehashfunctiondependson
thesimilaritymetric:
Notallsimilaritymetricshaveasuitable
hashfunction

Thereisasuitablehashfunctionfor
Jaccard similarity: Minhashing
SlidesbyJureLeskovec:MiningMassiveDatasets

23

MinHashing
Imaginetherowsoftheboolean matrix
permutedunderrandompermutation
Defineahashfunctionh(C) =thenumberof
thefirst(inthepermutedorder)rowinwhich
columnC hasvalue1:
h (C) = min (C)
Useseveral(e.g.,100)independenthash
functionstocreateasignatureofacolumn
SlidesbyJureLeskovec:MiningMassiveDatasets

24

MinHashingExample
Permutation

Inputmatrix(ShinglesxDocuments)

SignaturematrixM

1 4 3

3 2 4

7 1 7

6 3 6

2 6 1

5 7 2

4 5 5

SlidesbyJureLeskovec:MiningMassive
Datasets

25

Choosearandompermutation
thenPr[h(C1)=h(C2)]=sim(C1,C2)
Why?

LetXbeasetofshingles,X [264],xX
Then: Pr[(y)=min((X))]=1/|X|

SurprisingProperty

ItisequallylikelythatanyyX ismappedtotheminelement

Letxbes.t.(x)=min((C1C2))
Theneither: (x)=min((C1))ifx C1,or
(x)=min((C2))ifx C2
Sotheprob.thatbotharetrueistheprob.x C1 C2
Pr[min((C1))=min((C
SlidesbyJureLeskovec:MiningMassiveDatasets
26 2)
2))]=|C1C2|/|C1C2|=sim(C1,C

SimilarityforSignatures
Weknow:Pr[h(C1)=h(C2)]=sim(C1,C2)
Nowgeneralizetomultiplehashfunctions
Thesimilarityoftwosignaturesisthefraction
ofthehashfunctionsinwhichtheyagree
Note:Becauseoftheminhash property,the
similarityofcolumnsisthesameasthe
expectedsimilarityoftheirsignatures
SlidesbyJureLeskovec:MiningMassiveDatasets

27

MinHashing Example
Inputmatrix

SignaturematrixM

1 4 3

3 2 4

7 1 7

6 3 6

2 6 1

5 7 2

4 5 5

Similarities:
13241234
Col/Col 0.750.7500
Sig/Sig 0.671.0000

SlidesbyJureLeskovec:MiningMassive
Datasets

28

MinHash Signatures
Pick100randompermutationsoftherows
Thinkofsig(C)asacolumnvector
Letsig(C)[i]=accordingtotheith
permutation,theindexofthefirstrow
thathasa1incolumnC
sig(C)[i] = min (i(C))
Note: Thesketch(signature)of
documentCissmall ~100bytes!
Weachievedourgoal! Wecompressed
longbitvectorsintoshortsignatures
SlidesbyJureLeskovec:MiningMassiveDatasets

29

Locality
sensitive
Hashing

Docu
ment

Theset
ofstrings
oflengthk
thatappear
inthedoc
ument

Signatures:
shortinteger
vectorsthat
representthe
sets,and
reflecttheir
similarity

Candidate
pairs:
thosepairs
ofsignatures
thatweneed
totestfor
similarity.

LocalitySensitiveHashing
Step3:Localitysensitivehashing: Focuson
pairsofsignatureslikelytobefromsimilar
documents

LSH:FirstCut

## Goal: FinddocumentswithJaccard similarityat

leasts(forsomesimilaritythreshold,e.g., s=0.8)
LSH Generalidea: Useafunctionf(x,y)thattells
whetherx andy isacandidatepair:
apairofelementswhosesimilaritymustbe
evaluated
Forminhash matrices:
HashcolumnsofsignaturematrixM tomanybuckets
Eachpairofdocumentsthathashesintothe
samebucketisacandidatepair
SlidesbyJureLeskovec:MiningMassiveDatasets

31

CandidatesfromMinhash
1
2

Pickasimilaritythresholds,afraction<1
Columnsxandy ofMareacandidatepair if
theirsignaturesagreeonatleastfractions of
theirrows:
M (i,x)=M (i,y)foratleastfrac.s valuesofi
Weexpectdocumentsx andytohavethesame
similarityastheirsignatures

SlidesbyJureLeskovec:MiningMassiveDatasets

32

LSHforMinhash

Bigidea: Hashcolumnsof
signaturematrixM severaltimes
Arrangethat(only)similarcolumnsare
likelytohashtothesamebucket,with
highprobability
Candidatepairsarethosethathashto
thesamebucket
SlidesbyJureLeskovec:MiningMassiveDatasets

33

PartitionMintoBands2

r rows
perband
b bands

One
signature

SignaturematrixM

SlidesbyJureLeskovec:MiningMassiveDatasets

34

PartitionMintoBands
DividematrixM intobbandsofr rows
Foreachband,hashitsportionofeach
columntoahashtablewithk buckets
Makek aslargeaspossible

Candidate columnpairsarethosethathash
tothesamebucketfor 1band
Tune b andr tocatchmostsimilarpairs,
butfewnonsimilarpairs
SlidesbyJureLeskovec:MiningMassiveDatasets

35

HashingBands
Buckets

Matrix M

Columns 2 and 6
are probably identical
(candidate pair)
Columns 6 and 7 are
surely different.

r rows

SlidesbyJureLeskovec:MiningMassive
Datasets

b bands

36

SimplifyingAssumption
Thereareenoughbucketsthatcolumnsare
unlikelytohashtothesamebucketunless
theyareidentical inaparticularband
Hereafter,weassumethatsamebucket
meansidenticalinthatband
Assumptionneededonlytosimplifyanalysis,
notforcorrectnessofalgorithm
SlidesbyJureLeskovec:MiningMassiveDatasets

37

ExampleofBands

Assumethefollowingcase:
Suppose100,000columnsofM(100kdocs)
Signaturesof100integers(rows)
Therefore,signaturestake40Mb
Choose20bandsof5integers/band
Goal: Findpairsofdocumentsthat
areatleasts=80%similar
SlidesbyJureLeskovec:MiningMassiveDatasets

38

C1,C2 are80%Similar2
Assume: C1,C2 are80%similar

Sinces=80%wewantC1,C2 tohashtoatleastone
commonbucket (atleastonebandisidentical)

ProbabilityC1,C2 identicalinoneparticularband:
(0.8)5 =0.328
ProbabilityC1,C2 arenot similarinallofthe20
bands:(10.328)20 =0.00035
arefalsenegatives
Wewouldfind99.965%pairsoftrulysimilar
documents
SlidesbyJureLeskovec:MiningMassiveDatasets

39

C1,C2 are30%Similar2
Assume: C1,C2 are30%similar

Sinces=80%wewantC1,C2 tohashtoatNO
commonbuckets (allbandsshouldbedifferent)

ProbabilityC1,C2 identicalinoneparticular
band:(0.3)5 =0.00243
ProbabilityC1,C2 identicalinatleast1of20
bands:1 (1 0.00243)20 =0.0474
Inotherwords,approximately4.74%pairs
ofdocswithsimilarity30%endupbecoming
candidatepairs falsepositives
SlidesbyJureLeskovec:MiningMassiveDatasets

40

Pick:
thenumberofminhashes (rowsofM)
thenumberofbandsb,and
thenumberofrowsr perband

tobalancefalsepositives/negatives
rows,thenumberoffalsepositiveswould
godown,butthenumberoffalsenegatives
wouldgoup
SlidesbyJureLeskovec:MiningMassiveDatasets

41

AnalysisofLSH WhatWeWant

Probability = 1
if s > t
Probability
of sharing
a bucket

No chance
if s < t

t
Similarity s of two sets
SlidesbyJureLeskovec:MiningMassiveDatasets

42

What1Bandof1RowGivesYou

Remember:
Probability of
equal hash-values
= similarity

Probability
of sharing
a bucket

t
Similarity s of two sets

SlidesbyJureLeskovec:MiningMassive
Datasets

43

At least
one band
identical

t ~ (1/b)1/r

Probability
of sharing
a bucket

No bands
identical

1 - (1 - s r )b

Some row
of a band
unequal

All rows
of a band
are equal

## Similarity s of two sets

SlidesbyJureLeskovec:MiningMassiveDatasets

44

Example:b =20;r =5
Similaritythresholds
Prob.thatatleast1bandidentical:
s
1-(1-sr)b
.2
.006
.3
.047
.4
.186
.5
.470
.6
.802
.7
.975
.8
.9996
SlidesbyJureLeskovec:MiningMassiveDatasets

45

Pickingr andb:TheScurve
Pickingr andb togetthebestScurve
50hashfunctions(r=5,b=10)
1

Prob.sharingabucket

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2

Bluearea:FalseNegativerate
Greenarea:FalsePositiverate

0.1
0
0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Similarity
SlidesbyJureLeskovec:MiningMassiveDatasets

46

LSHSummary
Tunetogetalmostallpairswithsimilar
signatures,buteliminatemostpairsthatdo
nothavesimilarsignatures
Checkinmainmemorythatcandidatepairs
reallydohavesimilarsignatures
Optional: Inanotherpassthroughdata,check
thattheremainingcandidatepairsreally
representsimilardocuments
SlidesbyJureLeskovec:MiningMassiveDatasets

47

Summary:3Steps
1. Shingling: Convertdocuments,emails,
etc.,tosets
2. Minhashing: Convertlargesetstoshort
signatures,whilepreservingsimilarity
3. Localitysensitivehashing: Focuson
pairsofsignatureslikelytobefrom
similardocuments
SlidesbyJureLeskovec:MiningMassiveDatasets

48