DQ Matching

What is Matching ?
Matching is the real heart of Quality Stage. Different probabilistic algorithms are
available for different types of data. Using the frequencies developed during
investigation (or subsequently), the information content (or “rarity value”) of each value
in each field can be estimated. The less common a value, the more information it
contributes to the decision. A separate agreement weight or disagreement weight is
calculated for each field in each data record, incorporating both its information content
(likelihood that a match actually has been found) and its probability that a match has
been found purely at random. These weights are summed for each field in the record to
come up with an aggregate weight that can be used as the basis for reporting that a
particular pair or records probably are, or probably are not, duplicates of each other.
There is a third possibility, a “grey area” in the middle, which Quality Stage refers to as
the “clerical review” area – record pairs in this category need to be referred to a human
to make the decision because there is not enough certainty either way. Over time the
algorithms can be tuned with things like improved rule sets, weight overrides, different
settings of probability levels and so on so that fewer and fewer “clericals” are found.
Matching makes use of a concept called “blocking”, which is an unfortunately-chosen
term that means that potential sets of duplicates form blocks (or groups, or sets) which
can be treated as separate sets of potentially duplicated values. Each block of potential
duplicates is given a unique ID, which can be used by the next phase (survivorship) and
can also be used to set up a table of linkages between the blocks of potential duplicates
and the keys to the original data records that are in those blocks. This is often a
requirement when de-duplication is being performed, for example when combining
records from multiple sources, or generating a list of unique addresses from a customer
file, et cetera.
More than one pass through the data may be required to identify all the potential
duplicates. For example, one customer record may refer to a customer with a street
address but another record for the same customer may include the customer’s post
office box address. Searching for duplicate addresses would not find this customer; an
additional pass based on some other criteria would also be required. Quality Stage does
provide for multiple passes, either fully passing through the data for each pass, or only
examining the unmatched records on subsequent passes (which is usually faster).
Matching vs. Lookups, Joins, and Merges
Within Information Server, multiple stages offer capability that can be considered matching,
for example:
 Lookup
 Join
 Merge
 Unduplicate Match
 Reference Match
 Lookups, Joins, and Merges typically use key attributes, exact match criteria, or
matches to a range of values or simple formats
 The Unduplicate Match Stage and Reference Match Stage offer probabilistic matching
capability
There are two types of match stage
 Unduplicate match :locates and groups all similar records within a single input data
source. This process identifies potential duplicate records,which might then be
removed
 Reference Match identifies relationships among records in two data sources. An

example of many-to-one matching is matching the ZIP codes in customer file with the
list of valid ZIP codes. More than one record in the customer file can have the same
ZIP code in it.
Blocking step
 Blocking provides a method of limiting the number of pairs to examine. When you
partition data sources into mutually-exclusive and exhaustive subsets and only
search for matches within a subset, the process of matching becomes manageable.
 Basic blocking concepts include:
 Blocking partitions the sources into subsets that make computation feasible. Block
size is the single most important factor in match performance. Blocks should be as
small as possible without causing block overflows. Smaller blocks are more efficient
than larger blocks during matching.
Reference Match Stage

 The Reference Match stage identifies relationships among records. This match can
group records that are being compared in different ways as follows:
 One-to-many matching
 Many-to-one matching
One-to-many matching
 Identifies all records in one data source that correspond to a record for the same
individual, event, household, or street address in a second data source.
 Only one record in the reference source can match one record in the data source
because the matching applies to individual events.Eg: finding the same individual
based on comparing SSN in voter registration list and department of motor vehicles
list.
Many-to-one matching
 Multiple records in the data file can match a single record in the reference file.
Eg: matching a transaction data source to a master data source allows many
transactions for one person in the master data source.
The Reference match stage delivers up to six outputs as follows:
– Match contains matched records for both inputs
– Clerical has records that fall in the clerical range for both inputs
– Data Duplicate contains duplicates in the data source
– Reference Duplicate contains duplicates in the reference source
– Data Residual contains records that are non-matches from the data input
– Reference Residual contains records that are non-matches from the reference
input

DQ Matching

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DQ Matching

Uploaded by

Copyright:

Available Formats

What is Matching ?

Matching vs. Lookups, Joins, and Merges

There are two types of match stage

 Reference Match identifies relationships among records in two data sources. An

 Basic blocking concepts include:

Reference Match Stage

The Reference match stage delivers up to six outputs as follows:

– Match contains matched records for both inputs

– Data Duplicate contains duplicates in the data source

– Reference Duplicate contains duplicates in the reference source

You might also like