You are on page 1of 65

Chapter 7: Data Matching

PRINCIPLES OF
DATA INTEGRATION
ANHAI DOAN ALON HALEVY ZACHARY IVES
Introduction
 Data matching: find structured data items that refer to
the same real-world entity
 e.g., (David Smith, 608-245-4367, Madison WI)
vs (D. M. Smith, 245-4367, Madison WI)
 Data matching arises in many integration scenarios
 merging multiple databases with the same schema
 joining rows from sources with different schemas
 matching a user query to a data item
 One of the most fundamental problems in data
integration

2
Entity Matching
 Find data records referring to the same real-world entity
Table A Table B
Name City Age Name City Age
Dave Smith Altanta 18 David Smith Atlanta 18
Daniel Smith LA 18 Joe Wilson NY 25
Joe Welson New York 25 Daniel W. Smith LA 30
Charles Williams Chicago 45 Charles Williams Chicago 45
Charlie William Atlanta 28

 Critical problem in all domains (FSS, Telco, Public safety,….).


EM in Practice
 Two fundamental steps: blocking and matching
Table A Research
Name City Age Focus on these two steps
a1 Dave Smith Altanta 18 •Develop algorithms
a2 Daniel Smith LA 18
a3
•Fast, minimize selectivity, and maximize
Joe Welson New York 25 accuracy (in terms of true matches i.e. recall)
a4
Charles Williams Chicago 45
a5 (a2, b3) (a2, b3) -
Charlie William Atlanta 28 block on
(a4, b4) match (a4, b4) +
Table B a.City = b.City
(a5, b1) (a5, b1) -
Name City Age
b1 David Smith Atlanta 18
b2 Joe Wilson NY 25
b3 Daniel W. Smith LA 30
Build stand-alone
b4 Charles Williams Chicago 45
monolithic systems
 Many blocking methods have been developed
 Hash, sorting, phonetic, rule-based, etc. Blocker 1 Matcher 1

Blocker 2 Matcher 2
… … 4

Hash blocker can be very fast, but the recall can be low… why?
Development Stage
no
(-,-) (-,-) +
A’ (-,-) (-,-) + yes
blocker matcher quality
sample (-,-) (-,-) -
X (-,-) V (-,-) - check
(-,-) (-,-) +
B
S
(-,-)
blocker Cx sample (-,-)
(-,-)
X

label
cross-validate
G
matcher U
0.89 F1 (-,-) +
blocker Cy (-,-) -
0.93 F1 (-,-) +
Y
cross-validate
matcher V

Select the best blocker: X, Y Select the best matcher: U, V


This is Far from Enough for Handling EM in Practice

 EM in practice is significantly more complex


 A messy, iterative, multiple-step process
 Many steps perceived trivial are actually quite difficult to do

 Even if we let a human user be in charge of the whole


EM process, he/she often doesn’t know what to do.

6
Outline

 Selecting of Blocking attributes


 Matching Problem definition
 Rule-based matching
 Learning- based matching
 Matching by clustering
 Collective matching
 Scaling up data matching

7
Debugging Blocking
Table A
Name City Age
a1 Dave Smith
Dave Smith Altanta
Altanta 18
18
a2 Daniel Smith LA 18
a3 Joe Welson New York 25
a4 Charles Williams Chicago 45 C
a5 Charlie William Atlanta 28
blocker Q
(a2, b3)
Table B a.City = b.City (a4, b4)
Name City Age (a5, b1)
b1 David
David Smith
Smith Atlanta
Atlanta 18
b2 Joe Wilson NY 25
b3 Daniel W. Smith LA 30
b4 Charles Williams Chicago 45

 Does Q kill off too many matches?


 What are the killed-off matches?
 Why are they killed off by Q?
 Need to debug blocking!

8
The MatchCatcher Solution
 Find matches killed-off by the blocker
 User can examine these matches and improve the blocker
A×B

D = (A×B) - C

C = Q(A, B)

 Quickly find as many killed-off true matches as possible

9
Example

Table A Q1: a.City = b.City


Name City Age
a1 Dave Smith Altanta
Altanta 18 Debugger
C1 Output
a2 Daniel Smith LA 18
(a2, b3) (a1, b1)
a3 Joe Welson New York 25
(a4, b4) (a3, b2)
a4 Charles Williams Chicago 45 (a2, b1)
(a5, b1)
a5 Charlie William Atlanta 28

Table B
Name City Age
b1 David Smith Atlanta 18
b2 Joe Wilson NY 25
b3 Daniel W. Smith LA 30
b4 Charles Williams Chicago 45

10
Example

Q2: a.City = b.City OR


Table A lastword(a.Name) = lastword(b.Name)
Name City Age
a1 Dave Smith Altanta 18 C2 Debugger
a2 Daniel Smith LA 18 (a1, b1) (a4, b4) Output
a3 Joe Welson
Welson New York 25 (a1, b3) (a5, b1) (a3, b2)
a4 Charles Williams Chicago 45 (a2, b1) (a1, b2)
Charlie William Atlanta 28 (a2, b3) (a1, b4)
a5
Table B
Name City Age
b1 David Smith Atlanta 18
b2 Joe Wilson
Wilson NY 25
b3 Daniel W. Smith LA 30
b4 Charles Williams Chicago 45

11
Example

Q3: a.City = b.City OR


Table A ed(lastword(a.Name), lastword(b.Name)) ≤ 2
Name City Age
a1 Dave Smith Altanta 18 C3 Debugger
a2 Daniel Smith LA 18 (a1, b1) (a3, b2) Output
a3 Joe Welson New York 25 (a1, b3) (a5, b4) (a1, b2)
a4 Charles Williams Chicago 45 (a2, b1) (a5, b1) (a1, b4)
Charlie William Atlanta 28 (a2, b3) (a5, b4) (a2, b2)
a5
Table B
Name City Age
b1 David Smith Atlanta 18
b2 Joe Wilson NY 25
b3 Daniel W. Smith LA 30
b4 Charles Williams Chicago 45

12
Key Idea
 Must find killed-off matches quickly
 Use string similarity joins (SSJs)

Config = {Name, City}


A
Name City Age A’
a1 Dave Smith Altanta 18 “Dave
“DaveSmith
Smith Altanta”
Altanta”
a2 Daniel Smith LA 18 “Daniel
“Daniel Smith
Smith LA”
LA”
a3 Joe Welson New York 25 “Joe
“JoeWelson
WelsonNewNewYork”
York”
a4 Charles Williams Chicago 45 “Charles
“CharlesWilliams
WilliamsChicago”
Chicago”
a5 Charlie William Atlanta 28 “Charlie
“CharlieWilliam
WilliamAtlanta”
Atlanta”
(a1, b1)
B Top-3 SSJ (a3, b2)
Name City Age B’ (a2, b1)
b1 David Smith Atlanta 18 “David
“David Smith
Smith Atlanta”
Atlanta”
b2 Joe Wilson NY 25 “Joe Wilson
“Joe Wilson NY”
NY”
b3 Daniel W. Smith LA 30 “Daniel
“Daniel W.
W. Smith
Smith LA”
LA”
b4 Charles Williams Chicago 45 “Charles
“Charles Williams
Williams Chicago”
Chicago”

13
Challenges

A
Name City Age Top-k SSJ
Dave Smith Altanta 18 {Name}
Daniel Smith LA 18 Top-k SSJ
Joe Welson New York 25 {City}
Charles Williams Chicago 45
Charlie William Atlanta 28 {Age}
B
Name City Age
Top-k SSJ
{Name, City}
David Smith Atlanta 18
Joe Wilson NY 25 …
Daniel W. Smith LA 30 {Name, City, Age}
Charles Williams Chicago 45

14
Selection of Configurations
 Matching tuples have similar values on certain attributes

Tuple a Tuple b
Name City Age Name City Age
David A. Smith Las 18 David Smith Vegas 19
Vegas
Name + City Name + City

Jaccard( “David A. Smith Las Vegas” , ) = 0.6


“David Smith Vegas”

– Run a top-k string similar join, use top-k pairs as match candidates

 Use multiple configs to find as many killed-off matches as possible


– But cannot use all of them as there are too many
– E.g., 10 attributes -> 1023 configs

 Need to select a set of configs

15
Generating a Config Tree
 Procedure for generating a config tree
1. Given schema S, select a set of most promising attributes T T = {n,c,s,d}
 Discard numeric, boolean and categorical attributes
2. Generate a tree of configs in a top-down fashion {c,s,d} {n,s,d} {n,c,d} {n,c,s}

 Pick one node to expand {c,d} {n,d} {n,c}


– Equivalent to “pick one attribute to exclude”
{d} {n}
– Attributes with more unique values and fewer missing values are more useful
 E.g. “name” vs. “city”
– Assign a score (#of non-missing values / #of tuples and #of unique values / #of non-missing
for each attribute, remove the one with the lowest score
 E.g. e(n) > e(d) > e(c) > e(s), remove attribute s
{n,c,s,d}

 Manage long string attributes {c,s,d} {n,s,d} {n,c,d} {n,c,s}


– E.g., description, comments
{c,s} {n,s} {n,c}
– Cause multiple configs to generate similar top-k lists
– Need to remove the long attribute first {c} {n}

16
Top-k String Similarity Joins

17
QJoin Algorithm
 Key idea
– keep track of the number of token overlaps each pair has, only compute
the similarity score if there are q overlaps in the prefix

 Save unnecessary similarity score computation

 Select the best q


– Select the best q empirically as it’s difficult to analyze
– A machine with n CPU cores

Core 1: q = 1 best q = 2
Core 2: q = 2 top-50 top-k

Core n: q=n

18
Joint Top-k SSJs Across All Configs
 Need to do joint SSJs on a set of configs
– Generate the configs in a tree structure
– Speed up the SSJs by reusing computations and parallelization

 Reuse similarity score


f1 f2 f3 a f1 f2 f3 b f1 f2 f3
computations
– Process the configs in H
f2 f3 f1 f3 f1 f2 (a,b): o(f1, f1)=2 (a,c): …
a breath-first order o(f1, f2)=3
– Reuse the computation …
from the parents f1 f2 o(f3, f3)=1

 Reuse top-k lists


– Initialize the top-k list by re-adjusting the top-k pair scores from the parent
– Can make QJoin stop earlier as the top-k list is not generated from scratch

 Parallel processing of the configs


– Process one config join per core in the same breath-first order
– Concurrency control issues

19
Outline

 Selecting Blocking attributes


 Matching Problem definition
 Rule-based matching
 Learning- based matching
 Matching by clustering
 Collective matching
 Scaling up data matching

20
Problem Definition

 Given two relational tables X and Y with identical


schemas
 assume each tuple in X and Y describes an entity (e.g., person)
 We say tuple x 2 X matches tuple y 2 Y if they refer to the
same real-world entity
 (x,y) is called a match
 Goal: find all matches between X and Y

21
Example

 Other variations
 Tables X and Y have different schemas
 Match tuples within a single table X
 The data is not relational, but XML or RDF

22
Why is This Different than
String Matching?
 In theory, can treat each tuple as a string by
concatenating the fields, then apply string matching
techniques
 But doing so makes it hard to apply sophisticated
techniques and domain-specific knowledge
 E.g., consider matching tuples that describe persons
 suppose we know that in this domain two tuples match if the
names and phone match exactly
 this knowledge is hard to encode if we use string matching
 so it is better to keep the fields apart

23
Challenges

 Same as in string matching


 How to match accurately?
 difficult due to variations in formatting conventions, use of
abbreviations, shortening, different naming conventions,
omissions, nicknames, and errors in data
 several common approaches: rule-based, learning-based,
clustering, probabilistic, collective
 How to scale up to large data sets

24
Outline

 Selecting Blocking attributes


 Matching Problem definition
 Rule-based matching
 Learning- based matching
 Matching by clustering
 Collective matching
 Scaling up data matching

25
Rule-based Matching

 The developer writes rules that specify when two tuples


match
 typically after examining many matching and non-matching
tuple pairs, using a development set of tuple pairs
 rules are then tested and refined, using the same development
set or a test set
 Many types of rules exist, we will consider
 linearly weighted combination of individual similarity scores
 logistic regression combination
 more complex rules

26
Linearly Weighted Combination Rules

27
Example

 sim(x,y) =
0.3sname(x,y) + 0.3sphone(x,y) + 0.1scity(x,y) + 0.3sstate(x,y)
 sname(x,y): based on Jaro-Winkler
 sphone(x,y): based on edit distance between x’s phone (after
removing area code) and y’s phone
 scity(x,y): based on edit distance
 sstate(x,y): based on exact match; yes  1, no  0
28
Pros and Cons
 Pros
 conceptually simple, easy to implement
 can learn weights ®i from training data
 Cons
 an increase ± in the value of any si will cause a linear increase
®i * ± in the value of s
 in certain scenarios this is not desirable, there after a certain threshold
an increase in si should count less (i.e., “diminishing returns” should
kick in)
 e.g., if sname(x,y) is already 0.95 then the two names already very closely
match
 so any increase in sname(x,y) should contribute only minimally

29
Logistic Regression Rules

30
Logistic Regression Rules

 Are also very useful in situations where


 there are many “signals” (e.g., 10-20) that can contribute to
whether two tuples match
 we don’t need all of these signals to “fire” in order to conclude
that the tuples match
 as long as a reasonable number of them fire, we have sufficient
confidence
 Logistic regression is a natural fit for such cases
 Hence is quite popular as a first matching method to try

31
More Complex Rules

 Appropriate when we want to encode more complex


matching knowledge
 e.g., two persons match if names match approximately and
either phones match exactly or addresses match exactly
1. If sname(x,y) < 0.8 then return “not matched”
2. Otherwise if ephone(x,y) = true then return “matched”
3. Otherwise if ecity(x,y) = true and estate(x,y) = true then return
“matched”
4. Otherwise return “not matched”

32
Pros and Cons of
Rule-Based Approaches
 Pros
 easy to start, conceptually relatively easy to understand,
implement, debug
 typically run fast
 can encode complex matching knowledge
 Cons
 can be labor intensive, it takes a lot of time to write good rules
 can be difficult to set appropriate weights
 learning-based approaches address these issues

33
Outline

 Selecting Blocking attributes


 Matching Problem definition
 Rule-based matching
 Learning- based matching
 Matching by clustering
 Collective matching
 Scaling up data matching

34
Learning-based Matching

 Let’s first consider supervised learning


 learn a matching model M from training data, then apply M to
match new tuple pairs
 Learning a matching model M (the training phase)
 start with training data: T = {(x1,y1,l1), … (xn,yn,ln)}, where each
(xi,yi) is a tuple pair and li is a label: “yes” if xi matches yi and
“no” otherwise
 define a set of features f1, …, fm, each quantifying one aspect of
the domain judged possibly relevant to matching the tuples

35
Learning-based Matching

 Learning a matching model M (continued)


 convert each training example (xi,yi,li) in T to a pair
(<f1(xi,yi), …, fm(xi,yi)>, ci)
 vi = <f1(xi,yi), …, fm(xi,yi)> is a feature vector that encodes (xi,yi) in terms of the features
 ci is an appropriately transformed version of label l_i (e.g., yes/no or 1/0, depending
on what matching model we want to learn)
 thus T is transformed into T’ = {(v1,c1), …, (vn,cn)}
 apply a learning algorithm (e.g. decision trees, SVMs) to T’ to learn a matching
model M

 Applying model M to match new tuple pairs


 given pair (x,y), transform it into a feature vector
 v = <f1(x,y), …, fm(x,y)>
 apply M to v to predict whether x matches y
36
Example:
Learning a Linearly Weighted Rule

 s1 and s2 use Jaro-Winkler and edit distance


 s3 uses edit distance (ignoring area code of a)
 s4 and s5 return 1 if exact match, 0 otherwise
 s6 encodes a heuristic constraint
37
Example:
Learing a Linearly Weighted Rule

38
Example: Learning a Decision Tree

Now the labels are


yes/no, not 1/0

39
The Pros and Cons of
Learning-based Approach
 Pros compared to rule-based approaches
 in rule-based approaches must manually decide if a particular
feature is useful  labor intensive and limit the number of
features we can consider
 learning-based ones can automatically examine a large number
of features
 learning-based approaches can construct very complex “rules”
 Cons
 still require training examples, in many cases a large number of
them, which can be hard to obtain
 clustering addresses this problem

40
Outline

 Selecting Blocking attributes


 Matching Problem definition
 Rule-based matching
 Learning- based matching
 Matching by clustering
 Collective matching
 Scaling up data matching

41
Matching by Clustering
 Many common clustering techniques have been used
 agglomerative hierarchical clustering (AHC), k-means, graph-theoretic, …
 here we focus on AHC, a simple yet very commonly used one
 AHC
 partitions a given set of tuples D into a set of clusters
 all tuples in a cluster refer to the same real-world entity, tuples in different clusters
refer to different entities
 begins by putting each tuple in D into a single cluster
 iteratively merges the two most similar clusters
 stops when a desired number of clusters has been reached, or until the
similarity between two closest clusters falls below a pre-specified
threshold

42
Example

 sim(x,y) =
0.3sname(x,y) + 0.3sphone(x,y) + 0.1scity(x,y) + 0.3sstate(x,y)
43
Computing a Similarity Score
between Two Clusters
 Let c and d be two clusters
 Single link: s(c,d) = minxi2c, yj2d sim(xi, yj)
 Complete link: s(c,d) = maxxi2c, yj2d sim(xi, yj)
 Average link: s(c,d) = [xi2c, yj2d sim(xi, yj)] /
[# of (xi,yj) pairs]
 Canonical tuple
 create a canonical tuple that represents each cluster
 sim between c and d is the sim between their canonical tuples
 canonical tuple is created from attribute values of the tuples
 e.g., “Mike Williams” and “M. J. Williams”  “Mike J. Williams”
 (425) 247 4893 and 247 4893  (425) 247 4893

44
Key Ideas underlying
the Clustering Approach
 View matching tuples as the problem of constructing
entities (i.e., clusters)
 The process is iterative
 leverage what we have known so far to build “better” entities
 In each iteration merge all matching tuples within a
cluster to build an “entity profile”, then use it to match
other tuples  merging then exploiting the merged
information to help matching

45
Outline

 Selecting Blocking attributes


 Matching Problem definition
 Rule-based matching
 Learning- based matching
 Matching by clustering
 Collective matching
 Scaling up data matching

46
Collective Matching

 Matching approaches discussed so far make independent


matching decisions
 decide whether a and b match independently of whether any
two other tuples c and d match
 Matching decisions hower are often correlated
 exploiting such correlations can improve matching accuracy

47
An Example

 Goal: match authors of the four papers listed above


 Solution
 extract their names to create the table above
 apply current approaches to match tuples in table
 This fails to exploit co-author relationships in the data
48
An Example (cont.)

 nodes = authors, hyperedges connect co-authors


 Suppose we have matched a3 and a5
 then intuitively a1 and a4 should be more likely to match
 they share the same name and same co-author relationship to the same
author
49
An Example (cont.)

 First solution:
 add coAuthors attribute to the tuples
 e.g., tuple a_1 has coAuthors = {C. Chen, A. Ansari}
 tuple a_4 has coAuthors = {A. Ansari}

 apply current methods, use say Jaccard measure for coAuthors

50
An Example (cont.)

 Problem:
 suppose a3: A. Ansari and a5: A. Ansari share same name but do
not match
 we would match them, and incorrectly boost score of a1 and a4
 How to fix this?
 want to match a3 and a5, then use that info to help match a1
and a4; also want to do the opposite
 so should match tuples collectively, all at once and iteratively
51
Collective Matching using Clustering

 Many collective matching approaches exist


 clustering-based, probabilistic, etc.
 Here we consider clustering-based (see notes for more)
 Assume input is graph
 nodes = tuples to be matched
 edges = relationships among tuples

52
Collective Matching using Clustering

 To match, perform agglomerative hierarchical clustering


 but modify sim measure to consider correlations among tuples
 Let A and B be two clusters of nodes, define
 sim(A,B) = ® * simattributes(A,B) + (1- ®) * simneighbors(A,B)
 ® is pre-defined weight
 simattributes(A,B) uses only attributes of A and B, examples of
such scores are single link, complete link, average link, etc.
 simneighbors(A,B) considers correlations
 we discuss it next

53
An Example of simneighbors(A,B)

 Assume a single relationship R on the graph edges


 can generalize to the case of multiple relationships
 Let N(A) be the bags of the cluster IDs of all nodes that are
in relationship R with some node in A
 e.g., cluster A has two nodes a and a’,
a is in relationship R with node b with cluster ID 3, and
a’ is in relationship R with node b’ with cluster ID 3
and another node b’’ with cluster ID 5
 N(A) = {3, 3, 5}
 Define simneighbors(A,B) =
Jaccard(N(A),N(B)) = |N(A) Å N(B)| / |N(A) [ N(B)|
54
An Example of simneighbors(A,B)

 Recall that earlier we also define a Jaccard measure as


 JaccardSimcoAuthors(a,b) =
|coAuthors(a) Å coAuthors(b)| / |coAuthors(a) [ coAuthors(b)|
 Contrast that to
 simneighbors(A,B) =
Jaccard(N(A),N(B)) = |N(A) Å N(B)| / |N(A) [ N(B)|
 In the former, we assume two co-authors match if their
“strings” match
 In the latter, two co-authors match only if they have the
same cluster ID

55
An Example to Illustrate the Working of
Agglomerative Hierarchical Clustering

56
Outline

 Selecting Blocking attributes


 Matching Problem definition
 Rule-based matching
 Learning- based matching
 Matching by clustering
 Collective matching
 Scaling up data matching

57
Scaling up Rule-based Matching
 Two goals: minimize # of tuple pairs to be matched and minimize
time it takes to match each pair
 For the first goal:
 hashing
 sorting
 indexing
 canopies
 using representatives
 combining the techniques
 Hashing
 hash tuples into buckets, match only tuples within each bucket
 e.g., hash house listings by zipcode, then match within each zip

58
Scaling up Rule-based Matching

 Sorting
 use a key to sort tuples, then scan the sorted list and match
each tuple with only the previous (w-1) tuples, where w is a
pre-specified window size
 key should be strongly “discriminative”: brings together tuples
that are likely to match, and pushes apart tuples that are not
 example keys: soc sec, student ID, last name, soundex value of last
name
 employs a stronger heuristic than hashing: also requires that
tuples likely to match be within a window of size w
 but is often faster than hashing because it would match fewer pairs

59
Scaling up Rule-based Matching

 Indexing
 index tuples such that given any tuple a, can use the index to
quickly locate a relatively small set of tuples that are likely to
match a
 e.g., inverted index on names
 Canopies
 use a computationally cheap sim measure to quickly group tuples
into overlapping clusters called canopines (or umbrella sets)
 use a different (far more expensive) sim measure to match tuples
within each canopy
 e.g., use TF/IDF to create canopies

60
Scaling up Rule-based Matching

 Using representatives
 applied during the matching process
 assigns tuples that have been matched into groups such that those
within a group match and those across groups do not
 create a representative for each group by selecting a tuple in the
group or by merging tuples in the group
 when considering a new tuple, only match it with the representatives
 Combining the techniques
 e.g., hash houses into buckets using zip codes, then sort houses
within each bucket using street names, then match them using a
sliding window

61
Scaling up Rule-based Matching

 For the second goal of minimizing time it takes to match


each pair
 no well-established technique as yet
 tailor depending on the application and the matching approach
 e.g., if using a simple rule-based approach that matches
individual attributes then combines their scores using weights
 can use short circuiting: stop the computation of the sim score if it is
already so high that the tuple pair will match even if the remaining
attributes do not match

62
Scaling up Other Matching Methods
 Learning, clustering, probabilistic, and collective approaches often
face similar scalability challenges,
and can benefit from the same solutions
 Probabilistic approaches raise additional challenges
 if model has too many parameters  difficult to learn efficiently, need a
large # of training data to learn accurately
 make independence assumptions to reduce # of parameters
 Once learned, inference with model is also time costly
 use approximate inference algorithms
 simplify model so that closed form equations exist
 EM algorithm can be expensive
 truncate EM, or initializing it as accurately as possible

63
Scaling up Using Parallel Processing

 Commonly done in practice


 Examples
 hash tuples into buckets, then match each bucket in parallel
 match tuples against a taxonomy of entities (e.g., a product or
Wikipedia-like concept taxonomy) in parallel
 two tuples are declared matched if they match into
the same taxonomic node
 a variant of using representatives to scale up, discussed earlier

64
Summary

 Selection of Blocking attributes is an important problem


since it can drastically reduce the EM time.
 Huge amount of work in academia and industry
 Rule-based matching
 Learning- based matching
 Matching by clustering
 Collective matching
 We have discussed only the most common and basic
approaches.

65

You might also like