- Dedup Whitepaper 4AA1-9796ENW
- starters unit 4 week 2
- References - Fermi-Dirac Functions
- Finite 42 - Solver
- Best Sorting Algorithm for Nearly Sorted Lists
- Numerical Methods
- Secret Key Spec
- DataStage Hands on Guide Removing Duplicates
- an_adaptive_update_lifting_scheme_with_perfect_reconstruction.pdf
- Answers 02
- 21-09-17
- ExiMiR
- Classifying Digits
- blackbox testing
- Decision Tree & Techniques
- hw3sol
- Gauss Elimination
- Last Kopek 2013
- Tree Traversals
- Homework 2
- G Level+Complementary+Test+MaD6NVC08+Trigonometry+and+Derivatives
- Mat 1212
- Megamix solution.pdf
- integration.doc
- chapter 5 warm ups
- C1 May 2006 Mark Scheme
- Guide to Git Wildly Inaccurate
- Computer Graphics - Clipping
- Sparsity Based Video Classification Using l1 Minimization
- Simple FFT and Filtering Tutorial With Matlab - CodeProject
- Wiki Linguistics
- Locality-sensitive hashing using Stable Distributions
- Locality-Sensitive Hashing Scheme Based on p-Stable Distributions
- A Text-Retrieval Approach to Content-Based Audio Retrieval
- Real-time Detection of Illegally Parked Vehicles using 1-D Transformation
- Wiki Linguistics
- Wiki Linguistics
- Unreliable Cross Validation
- Inverted Files for Text Search Engines
- Wiki Linguistics
- CaseySlaneyISMIR2006
- p117-andoni
- Finding Interesting Associations without Support Pruning
- Principles of Hash-based Text Retrieval.
- Similarity Estimation Techniques from Rounding Algorithms
- A Gentle Tutorial on the EM Algorithm
- Syntactic Clustering of the Web
- Applying Hash-based Indexing in Text-based Information Retrieval
- Automatic Extraction of Titles from General Documents using Machine Learning
- Fuzzy-Fingerprints for Text-Based Information Retrieval
- Principle Component Analysis
- LSH Presentation
- Scalable Techniques for Clustering the Web
- New Issues in Near-duplicate Detection
- Song Intersection by Approximate Nearest Neighbor Search

A Real-World Problem of Matching Records Techniques: Minhashing, LocalitySensitive Hashing Measuring the Quality of the Results

1

**What Is Entity Resolution?
**

x Data from several sources may refer to the same “ entities,” e.g., people. x There is no universal key to help us match records. x Big question: how do we tell if records refer to the same underlying entity?

2

A Matching Problem

x Company A sold the services of Company B. x They then got mad at each other and sued over how many customers of B were originally from A. x B never bothered to store a “ from A” bit. x I was asked to find how many of B’ s customers came from A.

3

Matching Details

x There were about 1,000,000 records from each company. x Each had name, address, and phone# fields. x Because the records were created independently, there were many differences between records that represented the same entity (person).

4

Examples of Differences

1. 2. 3. 4. 5. 6. 7. Typos of many sorts. Abbreviations (St./Street). Nicknames (Bob/Robert). Missing middle name or initial. First/last names reversed. Area-code changes. Etc., etc.

5

Simple Approach

1. Develop a score of how close two name-addr-phone records are. 2. Consider all pairs of records, one from A, one from B. If their score is above a threshold, consider them to represent the same customer. x 1012 scorings --- way too much.

6

Scoring Matches

x The exact formula used to measure similarity turned out not to matter much, because we were able to measure the “ quality” of any given score. x In general: an ad-hoc, experimental process.

7

**Finding Pairs to Score
**

x The key problem is deciding which of the 1012 pairs are worth scoring. x An example of the near-neighbors problem:

Given N points, find all pairs of points that are at distance less than some threshold. Usually expressed as similarity = 1 – normalized distance.

8

Standard N-N Framework

1. Points are sets. 2. Similarity of sets = Ratio of size of intersection to size of union. 3. Minhashing to convert sets into manageable summaries (signatures). 4. Locality-sensitive hashing to focus on pairs likely to be similar.

9

**Locality-Sensitive Hashing
**

1. Choose many hash functions from points to “ buckets.” 2. Arrange that “ nearby” points have a good chance of going to the same bucket. 3. Candidates = pairs of points sent to the same bucket by at least one hash function. 4. Evaluate only candidate pairs.

10

**Example: Similar Documents
**

x Replace a document by its k-shingles = all substrings of length k. x Example: Doc1: abcdb; shingle set = {ab, bc, cd, db}. x Doc2: cdab; shingle set = {cd, da, ab}. x |intersection| = 2; |union| = 5; similarity = 40%.

11

Minhashing

x Pick a number of hash functions (say 100) from set elements to integers. x For each hash function, the minhash value for a set is the smallest integer to which any of its members hash. x The signature of a set is the list of minhash values for the selected hash functions.

12

Theorem

x The probability that the minhash of two sets is the same = the similarity of the sets. x Consequence: if we minhash two sets many times, the number of hash functions for which their minhashes are the same will approximate the similarity of the sets.

13

Back to LSH

x Represent a set (e.g., sets of shingles of a doc) by the column of (say) 100 minhash values (its signature). x Matrix M consists of a column per set. x LSH starts by partitioning the rows into b blocks of r rows each.

14

**Partition Into Bands
**

r rows per band b bands Column = signature for one set. Matrix M

15

**Partition into Bands --- (2)
**

x For each band, hash its portion of each column to a hash table with many buckets. x Candidate column pairs are those that hash to the same bucket for ≥ 1 band. x Tune b and r to catch most similar pairs, few nonsimilar pairs.

16

Buckets

Matrix M

r rows

b bands

17

**Example --- Efficiency of LSH
**

x Suppose 100,000 columns. x Signatures of 100 integers. x Therefore, signatures take 40Mb.

So they fit in main memory.

x But 5,000,000,000 pairs of signatures can take a while to compare. x Choose 20 bands of 5 integers/band.

18

**Suppose C1, C2 are 80% Similar
**

xProbability C1, C2 identical in one particular band: (0.8)5 = 0.328. xProbability C1, C2 are not similar in any of the 20 bands: (1-0.328)20 = .00035 .

i.e., we miss about 1/3000th of the 80%similar column pairs.

19

**Suppose C1, C2 Only 40% Similar
**

xProbability C1, C2 identical in any one particular band: (0.4)5 = 0.01 . xProbability C1, C2 identical in ≥ 1 of 20 bands: ≤ 20 * 0.01 = 0.2. x But false positives much lower for similarities << 40%.

20

**LSH Involves a Tradeoff
**

x Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. x Example: if we had fewer than 20 bands, the number of false positives would go down, but the number of false negatives would go up.

21

LSH --- Graphically

x Example Target: All pairs with Sim > t.

One hash fn.

1.0 1.0

Ideal

Prob.

0.0 s t

Prob. Sim

1.0 0.0 t

Sim

1.0

**x Partition into bands gives us:
**

1.0

1 – (1 – sr)b

Sim

s t 1.0

Prob.

0.0

t ~ (1/b)1/r

22

**Back to Entity Resolution
**

x Name-addr-phone records are not naturally representable by sets (e.g., shingle sets). x So we adapted the idea by using 3 “ hash functions” :

1. Hash by name. 2. Hash by address. 3. Hash by phone.

23

**Entity-Resolution LSH
**

x False negative for every pair of records that represented the same customer but had none of the three components identical. x With more cycles, we could have used bigger buckets and gotten fewer false negatives.

24

Example

x Hash on positions 1, 3, and 5 of the (5-digit) zip code. x Approximately 1000 from each dataset goes into each of 1000 buckets. x 1 billion candidate pairs to score. x Need many more hash functions like this one.

25

**How Many False Positives?
**

x Scoring system: 100 pts. for each of name, addr, phone. x Pairs with a score of 300 certainly refer to the same entity. x What about pairs with a score of 220? 150? etc.

26

**Using the Time-Lag
**

x We took advantage of the fact that a Brecord was probably created shortly after the A-record. x For the 300-score pairs, the average delay was 10 days. x We did not even consider matching records with more than a 90-day lag.

27

**Time-Lag-Trick --- (2)
**

x Bogus-pair time-lag avg. = 45 days. x Good-pair time-lag avg. = 10 days. x Suppose the pairs with score s have average time-lag d. x Fraction pairs with score s that are good: (45-d )/35.

28

**Profile of Time-Lag
**

45

**Lag = 10 Score = 300 185 120 100
**

29

**Generalizing the Time-Lag Trick
**

x All we need is some property of records with a predictable correlation for bogus matches and a measurable correlation for good matches. x Example: reserve phones for checking. x Not even essential that all records have the property.

30

Summary

x Entity-resolution: important step in database integration. x Minhashing: useful tool for converting sets into easily comparable vectors. x Locality-sensitive hashing: powerful technique for finding similar objects of many kinds.

31

- Dedup Whitepaper 4AA1-9796ENWUploaded byRyan Belicov
- starters unit 4 week 2Uploaded byapi-331370086
- References - Fermi-Dirac FunctionsUploaded byalokesh1982
- Finite 42 - SolverUploaded byCătă Doxy
- Best Sorting Algorithm for Nearly Sorted ListsUploaded byRohitGirdhar
- Numerical MethodsUploaded byJeoff Libo-on
- Secret Key SpecUploaded bypcdproyecto
- DataStage Hands on Guide Removing DuplicatesUploaded byAmit Sharma
- an_adaptive_update_lifting_scheme_with_perfect_reconstruction.pdfUploaded byBernardMight
- Answers 02Uploaded by2eaa889c
- 21-09-17Uploaded bymaruthi631
- ExiMiRUploaded byAlejandro Cuevas Villegas
- Classifying DigitsUploaded bySankeerti Haniyur
- blackbox testingUploaded bySaeed Ullah Niazi
- Decision Tree & TechniquesUploaded byjain
- hw3solUploaded byJed Chang
- Gauss EliminationUploaded byferdie526
- Last Kopek 2013Uploaded byIwin Soossay Sathianathan
- Tree TraversalsUploaded byCharlene Copeland
- Homework 2Uploaded byrob globerjin
- G Level+Complementary+Test+MaD6NVC08+Trigonometry+and+DerivativesUploaded byEpic Win
- Mat 1212Uploaded byyourhunkie
- Megamix solution.pdfUploaded bySugan Nalla
- integration.docUploaded byNorzulsuriana Binti Yahaya
- chapter 5 warm upsUploaded byapi-259543534
- C1 May 2006 Mark SchemeUploaded byBoaBii ShafutTe
- Guide to Git Wildly InaccurateUploaded byjkarpar
- Computer Graphics - ClippingUploaded byshijinbgopal
- Sparsity Based Video Classification Using l1 MinimizationUploaded byShyam Mohan Muraleedharan
- Simple FFT and Filtering Tutorial With Matlab - CodeProjectUploaded bylatec

- Wiki LinguisticsUploaded bymatthewriley123
- Locality-sensitive hashing using Stable DistributionsUploaded bymatthewriley123
- Locality-Sensitive Hashing Scheme Based on p-Stable DistributionsUploaded bymatthewriley123
- A Text-Retrieval Approach to Content-Based Audio RetrievalUploaded bymatthewriley123
- Real-time Detection of Illegally Parked Vehicles using 1-D TransformationUploaded bymatthewriley123
- Wiki LinguisticsUploaded bymatthewriley123
- Wiki LinguisticsUploaded bymatthewriley123
- Unreliable Cross ValidationUploaded bymatthewriley123
- Inverted Files for Text Search EnginesUploaded bymatthewriley123
- Wiki LinguisticsUploaded bymatthewriley123
- CaseySlaneyISMIR2006Uploaded bymatthewriley123
- p117-andoniUploaded bymatthewriley123
- Finding Interesting Associations without Support PruningUploaded bymatthewriley123
- Principles of Hash-based Text Retrieval.Uploaded bymatthewriley123
- Similarity Estimation Techniques from Rounding AlgorithmsUploaded bymatthewriley123
- A Gentle Tutorial on the EM AlgorithmUploaded bymatthewriley123
- Syntactic Clustering of the WebUploaded bymatthewriley123
- Applying Hash-based Indexing in Text-based Information RetrievalUploaded bymatthewriley123
- Automatic Extraction of Titles from General Documents using Machine LearningUploaded bymatthewriley123
- Fuzzy-Fingerprints for Text-Based Information RetrievalUploaded bymatthewriley123
- Principle Component AnalysisUploaded bymatthewriley123
- LSH PresentationUploaded bymatthewriley123
- Scalable Techniques for Clustering the WebUploaded bymatthewriley123
- New Issues in Near-duplicate DetectionUploaded bymatthewriley123
- Song Intersection by Approximate Nearest Neighbor SearchUploaded bymatthewriley123