Entity Resolution

A Real-World Problem of Matching Records Techniques: Minhashing, LocalitySensitive Hashing Measuring the Quality of the Results


What Is Entity Resolution?
x Data from several sources may refer to the same “ entities,” e.g., people. x There is no universal key to help us match records. x Big question: how do we tell if records refer to the same underlying entity?


A Matching Problem
x Company A sold the services of Company B. x They then got mad at each other and sued over how many customers of B were originally from A. x B never bothered to store a “ from A” bit. x I was asked to find how many of B’ s customers came from A.

Matching Details
x There were about 1,000,000 records from each company. x Each had name, address, and phone# fields. x Because the records were created independently, there were many differences between records that represented the same entity (person).

Examples of Differences
1. 2. 3. 4. 5. 6. 7. Typos of many sorts. Abbreviations (St./Street). Nicknames (Bob/Robert). Missing middle name or initial. First/last names reversed. Area-code changes. Etc., etc.

Simple Approach
1. Develop a score of how close two name-addr-phone records are. 2. Consider all pairs of records, one from A, one from B. If their score is above a threshold, consider them to represent the same customer. x 1012 scorings --- way too much.

Scoring Matches
x The exact formula used to measure similarity turned out not to matter much, because we were able to measure the “ quality” of any given score. x In general: an ad-hoc, experimental process.


Finding Pairs to Score
x The key problem is deciding which of the 1012 pairs are worth scoring. x An example of the near-neighbors problem:
 Given N points, find all pairs of points that are at distance less than some threshold.  Usually expressed as similarity = 1 – normalized distance.

Standard N-N Framework
1. Points are sets. 2. Similarity of sets = Ratio of size of intersection to size of union. 3. Minhashing to convert sets into manageable summaries (signatures). 4. Locality-sensitive hashing to focus on pairs likely to be similar.

Locality-Sensitive Hashing
1. Choose many hash functions from points to “ buckets.” 2. Arrange that “ nearby” points have a good chance of going to the same bucket. 3. Candidates = pairs of points sent to the same bucket by at least one hash function. 4. Evaluate only candidate pairs.

Example: Similar Documents
x Replace a document by its k-shingles = all substrings of length k. x Example: Doc1: abcdb; shingle set = {ab, bc, cd, db}. x Doc2: cdab; shingle set = {cd, da, ab}. x |intersection| = 2; |union| = 5; similarity = 40%.

x Pick a number of hash functions (say 100) from set elements to integers. x For each hash function, the minhash value for a set is the smallest integer to which any of its members hash. x The signature of a set is the list of minhash values for the selected hash functions.

x The probability that the minhash of two sets is the same = the similarity of the sets. x Consequence: if we minhash two sets many times, the number of hash functions for which their minhashes are the same will approximate the similarity of the sets.

Back to LSH
x Represent a set (e.g., sets of shingles of a doc) by the column of (say) 100 minhash values (its signature). x Matrix M consists of a column per set. x LSH starts by partitioning the rows into b blocks of r rows each.


Partition Into Bands
r rows per band b bands Column = signature for one set. Matrix M

Partition into Bands --- (2)
x For each band, hash its portion of each column to a hash table with many buckets. x Candidate column pairs are those that hash to the same bucket for ≥ 1 band. x Tune b and r to catch most similar pairs, few nonsimilar pairs.



Matrix M

r rows

b bands


Example --- Efficiency of LSH
x Suppose 100,000 columns. x Signatures of 100 integers. x Therefore, signatures take 40Mb.
 So they fit in main memory.

x But 5,000,000,000 pairs of signatures can take a while to compare. x Choose 20 bands of 5 integers/band.

Suppose C1, C2 are 80% Similar
xProbability C1, C2 identical in one particular band: (0.8)5 = 0.328. xProbability C1, C2 are not similar in any of the 20 bands: (1-0.328)20 = .00035 .
 i.e., we miss about 1/3000th of the 80%similar column pairs.


Suppose C1, C2 Only 40% Similar
xProbability C1, C2 identical in any one particular band: (0.4)5 = 0.01 . xProbability C1, C2 identical in ≥ 1 of 20 bands: ≤ 20 * 0.01 = 0.2. x But false positives much lower for similarities << 40%.

LSH Involves a Tradeoff
x Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. x Example: if we had fewer than 20 bands, the number of false positives would go down, but the number of false negatives would go up.

LSH --- Graphically
x Example Target: All pairs with Sim > t.
One hash fn.
1.0 1.0


0.0 s t

Prob. Sim
1.0 0.0 t


x Partition into bands gives us:

1 – (1 – sr)b
s t 1.0


t ~ (1/b)1/r


Back to Entity Resolution
x Name-addr-phone records are not naturally representable by sets (e.g., shingle sets). x So we adapted the idea by using 3 “ hash functions” :
1. Hash by name. 2. Hash by address. 3. Hash by phone.

Entity-Resolution LSH
x False negative for every pair of records that represented the same customer but had none of the three components identical. x With more cycles, we could have used bigger buckets and gotten fewer false negatives.

x Hash on positions 1, 3, and 5 of the (5-digit) zip code. x Approximately 1000 from each dataset goes into each of 1000 buckets. x 1 billion candidate pairs to score. x Need many more hash functions like this one.

How Many False Positives?
x Scoring system: 100 pts. for each of name, addr, phone. x Pairs with a score of 300 certainly refer to the same entity. x What about pairs with a score of 220? 150? etc.


Using the Time-Lag
x We took advantage of the fact that a Brecord was probably created shortly after the A-record. x For the 300-score pairs, the average delay was 10 days. x We did not even consider matching records with more than a 90-day lag.

Time-Lag-Trick --- (2)
x Bogus-pair time-lag avg. = 45 days. x Good-pair time-lag avg. = 10 days. x Suppose the pairs with score s have average time-lag d. x Fraction pairs with score s that are good: (45-d )/35.

Profile of Time-Lag

Lag = 10 Score = 300 185 120 100

Generalizing the Time-Lag Trick
x All we need is some property of records with a predictable correlation for bogus matches and a measurable correlation for good matches. x Example: reserve phones for checking. x Not even essential that all records have the property.

x Entity-resolution: important step in database integration. x Minhashing: useful tool for converting sets into easily comparable vectors. x Locality-sensitive hashing: powerful technique for finding similar objects of many kinds.