You are on page 1of 24

Record linkage system

History

• The term record linkage was first used by the chief of


the U.S. National Office of Vital Statistics, Dr. Halbert
L. Dunn in a talk given in Canada in 1946.

• Dr. Dunn advocated the use of a unique number (e.g.


birth registration number).
• Historically record linkage was assigned to clerks who
would search and review lists to bring together the
appropriate pairs of records for comparison, seek
additional information when there were questionable
matches, and finally make decisions regarding the
linkages based on established rules.
History
• Formal development of a theory of record linkage
started with the pioneering work of Fellegi and
Sunter (1969).

• Several people have worked on extending or


modifying their procedure (Jaro 1989; Winkler 1994).
Need for record linkage
Researchers and the community‘s demand for detailed statistical information

In response to increasing business and health needs.

Improving data quality and timeliness

In reducing the complexity of data

Reducing respondent burden and costs


What is Record Linkage?
• Record linkage is the process of bringing together two or
more records relating to the same individual (person),
family or entity (e.g. event, object, geography, business
etc).

• To find syntactically distinct data entries that refer to the


same entity in two or more input files.

• Part of the data cleaning process, which is a crucial first


step in the knowledge discovery process
Link/
Event Input 1

Input 3 Input 2
Types
• Two basic types of strategies: deterministic
and probabilistic, both of which are
considered to be a type of exact matching
Probabilistic
TYPES of
Record
Linkage
STRATEGIES

Deterministic
Deterministic Record linkage
• A pair of records is said to be a link if the two records
agree exactly on each element within a collection of
identifiers called the match key.

• ALL or NONE

• For example, when comparing two records on last name,


street name, year of birth, and street number, the pair of
records is deemed to be a link only if the names agree on
all characters, the years of birth are the same, and the
street numbers are identical.
Probabilistic Record Linkage
• Formalized by Fellegi and Sunter [1969].
• Pairs of records are classified as links, possible links, or
non-links.

• Here, we consider the probability of a match in the


given observed data.
• In probability matching, a threshold of likelihood is set
(which can be varied in different circumstances) above
which a pair of records is accepted as a match, relating
to the same person, and below which the match is
rejected.
INFORMATION FLOW IN RLS
Standardization
• In every data there exist many manual errors and non-matching
abbreviations etc which may present themselves as separate data
without actually being so

• First step

• To clean and standardise the data

• E.g. : For input data belonging to Mr. William Marcus Smith, entries
could have been made by different individuals as :
– Smith W. M.
– William M. Smith
– W.M. Smith
– W.M. Smithe etc
Blocking:
• In order to reduce the search space (i.e. the
number of record pairs to be compared)

• To group similar records together, called blocks or


clusters

• The data sets are split into smaller blocks and


only records within the same blocks are
compared
• E.g. instead of making detailed comparisons of
all 90 billion pairs from two lists of 300,000
records representing all businesses in a State
of the U.S., it may be sufficient to consider the
set of 30 million pairs that agree on U.S. Postal
ZIP code.
Matching
Exact Matching Statistical Matching
• Linkage of data for the same • Attempts to link files that
unit (e.g., establishment) may have few units in
from different files. common

• Uses identifiers such as


name, address, or tax unit • Linkages are based on
number similar characteristics rather
than unique identifying
information
Requirements for defining a RLS
• Types of linkages required,
• Whether the linkages is performed in batch and/or interactive
mode,
• The security provisions for confidential data files,

• The speed of operation needed,

• The volume of records that can be linked with the system,

• The initial cost of software including licensing and maintenance


costs,

• Simplicity and flexibility in defining the rules used for linkages,


• FEBRL

• GRLS (used in canada)

• OX-RL (Oxford Record linkage system used in


the UK)

• FRIL
General record linkage system
Uses
• The system is used to improve data quality and coverage, for
long term medical follow up of cohorts, for creating patient-
oriented rather than event-oriented data, for building new
data sources, and for a range of other statistical purposes

• It helps create statistically relevant source of ‘new’


information

• Answers research questions relating to genetics,


occupational and environmental health and medical
research.
Applications
• Duplication in data in minimized
• Powerful tool for generating more value out of
existing databases
• Large projects regarding the census of an
entire country can be planned
• More detailed information can be obtained
• Becomes easier to follow cohorts
Drawback
• Issues of privacy and confidentiality

• Policies for conducting studies using such


systems must be transparent
Conclusion
• “Each person in the word creates a book of
life. The book starts with birth and ends with
death. Its pages are made up of the principle
events of life. Record linkage is the name given
to the process of assembling the pages of the
book into a volume. “
– Dr. Dunn (1946)
Thank you!

You might also like