Entity Resolution

A Real-World Problem of Matching Records Techniques: Minhashing, LocalitySensitive Hashing Measuring the Quality of the Results

1

What Is Entity Resolution?
x Data from several sources may refer to the same “ entities,” e.g., people. x There is no universal key to help us match records. x Big question: how do we tell if records refer to the same underlying entity?

2

A Matching Problem
x Company A sold the services of Company B. x They then got mad at each other and sued over how many customers of B were originally from A. x B never bothered to store a “ from A” bit. x I was asked to find how many of B’ s customers came from A.
3

Matching Details
x There were about 1,000,000 records from each company. x Each had name, address, and phone# fields. x Because the records were created independently, there were many differences between records that represented the same entity (person).
4

Examples of Differences
1. 2. 3. 4. 5. 6. 7. Typos of many sorts. Abbreviations (St./Street). Nicknames (Bob/Robert). Missing middle name or initial. First/last names reversed. Area-code changes. Etc., etc.
5

Simple Approach
1. Develop a score of how close two name-addr-phone records are. 2. Consider all pairs of records, one from A, one from B. If their score is above a threshold, consider them to represent the same customer. x 1012 scorings --- way too much.
6

Scoring Matches
x The exact formula used to measure similarity turned out not to matter much, because we were able to measure the “ quality” of any given score. x In general: an ad-hoc, experimental process.

7

Finding Pairs to Score
x The key problem is deciding which of the 1012 pairs are worth scoring. x An example of the near-neighbors problem:
 Given N points, find all pairs of points that are at distance less than some threshold.  Usually expressed as similarity = 1 – normalized distance.
8

Standard N-N Framework
1. Points are sets. 2. Similarity of sets = Ratio of size of intersection to size of union. 3. Minhashing to convert sets into manageable summaries (signatures). 4. Locality-sensitive hashing to focus on pairs likely to be similar.
9

Locality-Sensitive Hashing
1. Choose many hash functions from points to “ buckets.” 2. Arrange that “ nearby” points have a good chance of going to the same bucket. 3. Candidates = pairs of points sent to the same bucket by at least one hash function. 4. Evaluate only candidate pairs.
10

Example: Similar Documents
x Replace a document by its k-shingles = all substrings of length k. x Example: Doc1: abcdb; shingle set = {ab, bc, cd, db}. x Doc2: cdab; shingle set = {cd, da, ab}. x |intersection| = 2; |union| = 5; similarity = 40%.
11

Minhashing
x Pick a number of hash functions (say 100) from set elements to integers. x For each hash function, the minhash value for a set is the smallest integer to which any of its members hash. x The signature of a set is the list of minhash values for the selected hash functions.
12

Theorem
x The probability that the minhash of two sets is the same = the similarity of the sets. x Consequence: if we minhash two sets many times, the number of hash functions for which their minhashes are the same will approximate the similarity of the sets.
13

Back to LSH
x Represent a set (e.g., sets of shingles of a doc) by the column of (say) 100 minhash values (its signature). x Matrix M consists of a column per set. x LSH starts by partitioning the rows into b blocks of r rows each.

14

Partition Into Bands
r rows per band b bands Column = signature for one set. Matrix M
15

Partition into Bands --- (2)
x For each band, hash its portion of each column to a hash table with many buckets. x Candidate column pairs are those that hash to the same bucket for ≥ 1 band. x Tune b and r to catch most similar pairs, few nonsimilar pairs.

16

Buckets

Matrix M

r rows

b bands

17

Example --- Efficiency of LSH
x Suppose 100,000 columns. x Signatures of 100 integers. x Therefore, signatures take 40Mb.
 So they fit in main memory.

x But 5,000,000,000 pairs of signatures can take a while to compare. x Choose 20 bands of 5 integers/band.
18

Suppose C1, C2 are 80% Similar
xProbability C1, C2 identical in one particular band: (0.8)5 = 0.328. xProbability C1, C2 are not similar in any of the 20 bands: (1-0.328)20 = .00035 .
 i.e., we miss about 1/3000th of the 80%similar column pairs.

19

Suppose C1, C2 Only 40% Similar
xProbability C1, C2 identical in any one particular band: (0.4)5 = 0.01 . xProbability C1, C2 identical in ≥ 1 of 20 bands: ≤ 20 * 0.01 = 0.2. x But false positives much lower for similarities << 40%.
20

LSH Involves a Tradeoff
x Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. x Example: if we had fewer than 20 bands, the number of false positives would go down, but the number of false negatives would go up.
21

LSH --- Graphically
x Example Target: All pairs with Sim > t.
One hash fn.
1.0 1.0

Ideal

Prob.
0.0 s t

Prob. Sim
1.0 0.0 t

Sim
1.0

x Partition into bands gives us:
1.0

1 – (1 – sr)b
Sim
s t 1.0

Prob.
0.0

t ~ (1/b)1/r

22

Back to Entity Resolution
x Name-addr-phone records are not naturally representable by sets (e.g., shingle sets). x So we adapted the idea by using 3 “ hash functions” :
1. Hash by name. 2. Hash by address. 3. Hash by phone.
23

Entity-Resolution LSH
x False negative for every pair of records that represented the same customer but had none of the three components identical. x With more cycles, we could have used bigger buckets and gotten fewer false negatives.
24

Example
x Hash on positions 1, 3, and 5 of the (5-digit) zip code. x Approximately 1000 from each dataset goes into each of 1000 buckets. x 1 billion candidate pairs to score. x Need many more hash functions like this one.
25

How Many False Positives?
x Scoring system: 100 pts. for each of name, addr, phone. x Pairs with a score of 300 certainly refer to the same entity. x What about pairs with a score of 220? 150? etc.

26

Using the Time-Lag
x We took advantage of the fact that a Brecord was probably created shortly after the A-record. x For the 300-score pairs, the average delay was 10 days. x We did not even consider matching records with more than a 90-day lag.
27

Time-Lag-Trick --- (2)
x Bogus-pair time-lag avg. = 45 days. x Good-pair time-lag avg. = 10 days. x Suppose the pairs with score s have average time-lag d. x Fraction pairs with score s that are good: (45-d )/35.
28

Profile of Time-Lag
45

Lag = 10 Score = 300 185 120 100
29

Generalizing the Time-Lag Trick
x All we need is some property of records with a predictable correlation for bogus matches and a measurable correlation for good matches. x Example: reserve phones for checking. x Not even essential that all records have the property.
30

Summary
x Entity-resolution: important step in database integration. x Minhashing: useful tool for converting sets into easily comparable vectors. x Locality-sensitive hashing: powerful technique for finding similar objects of many kinds.
31