Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
P. 1
Algorithms for Duplicate Documents Prince Ton)

Algorithms for Duplicate Documents Prince Ton)

Ratings: (0)|Views: 19 |Likes:
Published by newtonapple

More info:

Published by: newtonapple on Nov 19, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

12/24/2013

pdf

text

original

 
Algorithms for duplicatedocuments
Andrei BroderIBM Researchabroder@us.ibm.com
 
2A. Broder Algorithms fornear-duplicate documentsFebruary 18, 2005
FingerprintingFingerprinting
(discussed last week)(discussed last week)
Fingerprints are short tags for larger objects.
Notations
Properties
objectsallofsetThe
=
tfingerprintheoflenghtThe
=
{ }
functiontingfingerprinA
 f 
1,0:
( ) ( )( ) ( )
( )
 B A B f  A f   B A B f  A f 
21Pr
=
 
3A. Broder Algorithms fornear-duplicate documentsFebruary 18, 2005
Fingerprinting schemesFingerprinting schemes
Fingerprints vs hashing
u
For hashing I want good distribution so bins will beequally filled
u
For fingerprints I don’t want any collisions = much longerhashes but the distribution does not matter!
Cryptographically secure:
u
MD2, MD4, MD5, SHS, etc
u
relatively slow
Rabin’s scheme
u
Based on polynomial arithmetic
u
Very fast (1 table lookup + 1 xor + 1 shift) /byte
u
Nice extra-properties

Activity (2)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->