DataIntegrationManual PDF

Data Integration Manual
Acknowledgement
This report was prepared by Statistics New Zealands Statistical Methods team and produced by the Product Development and Publishing unit.
Further information
For further information on the statistics in this report, or on other reports or products, contact Statistics New Zealands Information Centre. Visit our website: www.stats.govt.nz or email us at: info@stats.govt.nz or phone toll free:0508 525 525 Auckland Private Bag Phone Fax 92003 09 920 9100 09 920 9198 Wellington PO Box Phone Fax 2922 04 931 4600 04 931 4610 Christchurch Private Bag Phone Fax 4741 03 964 8700 03 964 8964
Information Centre
Your gateway to Statistics New Zealand Statistics New Zealand collects more than 60 million pieces of information each year. New Zealanders tell us how and where they live and about their work, spending and recreation. We also collect a complete picture of business in New Zealand. This valuable resource is yours to use. But with all the sophisticated options available, finding exactly what you need can sometimes be a problem. Giving you the answers Our customer services staff can provide the answers. They are the people who know what information is available and how it can be used to your best advantage. Think of them as your guides to Statistics New Zealand. They operate a free enquiry service where answers can be quickly provided from published material. More extensive answers and customised solutions will incur costs, but we always give you a free, no-obligation quote before going ahead.
Liability statement
Statistics New Zealand gives no warranty that the information or data supplied in this report is error free. All care and diligence has been used, however, in processing, analysing and extracting information. Statistics New Zealand will not be liable for any loss or damage suffered by customers consequent upon the use, directly or indirectly, of information in this report.
Reproduction of material
Any table or other material published in this report may be reproduced and published without further licence, provided that it does not purport to be published under government authority and that acknowledgement is made of this source.
Published in August 2006 by Statistics New Zealand PO Box 2922 Wellington www.stats.govt.nz Crown Copyright ISBN 0-478-26971-4
Contents
Preface .................................................................................................................................... vi Abbreviations.......................................................................................................................... vii 1 Introduction to Data Integration .........................................................................................1 1.1 Introduction.................................................................................................................1 What is data integration? .....................................................................................1 Levels of data integration.....................................................................................1 The role of data integration..................................................................................1 Why integrate?.....................................................................................................1 Legal and policy considerations...........................................................................2 Integration for statistical and administrative purposes.........................................2 Exact linkage and probabilistic linkage ................................................................2 Quality assessment .............................................................................................2 Data integration scenarios ...................................................................................3 The emergence of data integration......................................................................4 Data integration at Statistics NZ ..........................................................................4 1.1.1 1.1.2 1.1.3 1.1.4 1.1.5 1.2 1.2.1 1.2.2 1.2.3 1.2.4 1.3 1.3.1 1.3.2 1.4 2 2.1 2.2 2.3
Some key data integration concepts ..........................................................................2
Data integration at Statistics NZ and elsewhere.........................................................4
Key steps in a data integration project .......................................................................5 Introduction.................................................................................................................7 The Statistics Act 1975...............................................................................................7 The Privacy Act 1993 .................................................................................................8 General information .............................................................................................8 Use of unique identifiers ......................................................................................9
Legal and Policy Considerations .......................................................................................7
2.3.1 2.3.2 2.4 2.5 2.6
The Statistics NZ Data Integration Policy .................................................................10 Codes of practice......................................................................................................12 Data integration business case ................................................................................12 Privacy impact assessment ...............................................................................13 Consultation with other agencies.......................................................................14
2.6.1 2.6.2 2.7 2.8 3 3.1 3.2
The Statistics NZ Confidentiality Protocol ................................................................15 The Statistics NZ Microdata Access Protocols .........................................................15 Early stages of a data integration project .................................................................17 Other relationships ...................................................................................................17
Operational Aspects of a Statistics NZ Data Integration Project .....................................17
iii
3.3
Obtainment and safe keeping of external data.........................................................18 Data extract .......................................................................................................18 Data transfer, storage, security and internal access controls ............................18 Documenting record linkage methodology ........................................................19 Reviewing record linkage methodology .............................................................19 Supporting documentation.................................................................................20
3.3.1 3.3.2 3.4 3.4.1 3.4.2 3.4.3 3.5 4 4.1 4.2
Documentation and quality assurance .....................................................................19
IT considerations ......................................................................................................20 Introduction...............................................................................................................21 Gathering information about source data .................................................................21 Preliminary investigation of source data ............................................................21 Target population and units ...............................................................................22 Identification of population ..........................................................................22 Identification of units ...................................................................................23
Preparing for Record Linkage .........................................................................................21
4.2.1 4.2.2
4.2.2.1 4.2.2.2 4.2.3 4.2.4 4.3 4.3.1 4.3.2 4.3.3 4.3.4 4.4 4.4.1 4.4.2
Understanding the source data metadata.......................................................24 Implications from the metadata..........................................................................25 Request for supply of data.................................................................................26 Data transfer ......................................................................................................27 Data verification .................................................................................................30 Feedback to provider .........................................................................................30 Typical errors in linking variables.......................................................................30 Standardisation: editing, parsing, formatting, concordance...............................32 Editing .........................................................................................................32 Parsing and standardisation of linking variables.........................................32 Concordances.............................................................................................34
Procedure for obtaining data ....................................................................................26
Preparing data for record linkage .............................................................................30
4.4.2.1 4.4.2.2 4.4.2.3 4.4.3 4.4.4 5 5.1 5.2 5.3 5.4 5.5 5.6
Deduplication .....................................................................................................34 Anonymisation of unique identifiers ...................................................................34
Statistical Theory of Record Linkage...............................................................................36 Introduction...............................................................................................................36 Exact matching .........................................................................................................36 Terminology..............................................................................................................36 Matching files............................................................................................................37 The human approach ...............................................................................................37 The mathematical approach .....................................................................................38
iv
5.6.1 5.6.2 5.6.3 5.6.4 5.6.5 5.6.6 5.7 5.7.1 5.7.2 5.7.3 5.8 5.9 6 6.1 6.2
The m probability ...............................................................................................38 The u probability ................................................................................................39 The field weight..................................................................................................39 The composite weight........................................................................................39 Example.............................................................................................................40 Changing the m and u probabilities ...................................................................40 Distribution.........................................................................................................42 Cut-off thresholds ..............................................................................................43 Clerical review ...................................................................................................44
Weights.....................................................................................................................42
Blocking ....................................................................................................................44 Passes......................................................................................................................45 Types of matching ....................................................................................................46 Pre-matching process...............................................................................................48 Deduplication .....................................................................................................48 A data integration process flow..........................................................................49 Standardised datasets .......................................................................................50 Choice of blocking variables ..............................................................................51 Choice of linking variables .................................................................................52 Commonly used comparison functions for linking variables ..............................53 The m and u probabilities ..................................................................................55 Setting the cut-off threshold...............................................................................56 False positives, false negatives and match rates ..............................................58 Measurement error in integration.......................................................................59
Record Linkage in Practice..............................................................................................46
6.2.1 6.2.2 6.2.3 6.3 6.3.1 6.3.2 6.3.3 6.3.4 6.4 6.4.1 6.4.2 6.4.3 6.5
Matching method ......................................................................................................51
Quality assessment of linked data............................................................................56
Adding data over time...............................................................................................60
Appendix: Statistics New Zealands Uses of Data Integration ...............................................61 Glossary .................................................................................................................................62 Bibliography............................................................................................................................66
Preface
The Data Integration Manual provides a guide to data integration as carried out at Statistics New Zealand. The manual was written by Statistics NZ staff, following involvement in several large interagency data integration projects. The aim of the manual is to provide a guide to best practice and to share the insights gained from Statistics NZs experience. We hope the manual will assist agencies collaborating with Statistics NZ, and others interested in data integration, to understand the basic concepts, theory and processes involved in data integration, as well as providing practical advice. The manual begins with an introduction to data integration that describes what data integration is and why data integration is carried out, and outlines the key steps involved. Chapter 2 introduces the legal environment and Statistics NZ policy on data integration. Chapter 3 describes operational aspects of Statistics NZ data integration projects. The remaining chapters focus on technical aspects of the linkage itself: the preparation of data needed before record linkage can be undertaken, the statistical theory of record linkage and the practical implementation of record linkage techniques.
vi
Abbreviations
ACC CD CURF DSW DVD FAQs IRD IT IUID LEED MOU NHI NMDS NYSIIS NZHIS OPC PGP PIA SLA UID Accident Compensation Corporation compact disc confidentialised unit record file Department of Social Welfare digital video disc frequently asked questions Inland Revenue Department information technology internally assigned unique identifier Linked Employer-Employee Data Memorandum of Understanding National Health Index National Minimum Dataset New York State Identification and Intelligence Algorithm New Zealand Health Information Services Office of the Privacy Commissioner Pretty Good Privacy privacy impact assessment Service Level Agreement unique identifier
vii
Introduction to Data Integration
Summary This chapter provides an introduction to data integration, describes why data integration is carried out, and presents a brief history of data integration at Statistics New Zealand.
1.1 Introduction
1.1.1 What is data integration?
Data integration is defined broadly as the combination of data from different sources about the same or a similar individual or unit. This definition includes linkages between survey and administrative data, as well as between data from two or more administrative sources. An alternative application of data integration theory is in identifying records on a single file that belong to the same individual or unit. Other terms used to describe the process of data integration include record linkage and data matching.
1.1.2
Levels of data integration
When integration occurs at the micro level, information on one individual (unit) can be linked to: (i) a different set of information on the same person (unit) (ii) information on an individual (unit) with the same characteristics. At the macro level, collective statistics on a group of people or a region can be compared and used together. The main focus of this manual is micro-level data integration of type (i) that is, linkage of records that are likely to belong to the same individual or unit.
1.1.3
The role of data integration
The role of data integration in helping to produce an effective official statistical system is becoming increasingly apparent. The process of bringing together information from different sources paves the way for a broader range of questions to be answered. Through integration it becomes possible to examine underlying relationships between various aspects of society, thus improving our knowledge and understanding about a particular subject.
1.1.4
Why integrate?
Linking administrative data from different sectors creates a valuable source of information for statistical and research purposes because relationships that previously could not have been considered can be examined. There may sometimes be other possible methods of investigating relationships of particular interest, for example conducting a survey. However, data integration can offer a less time consuming and less costly alternative, although it does still require a significant level of time and resource. Data integration also has the advantage of reducing respondent burden by making more effective use of existing data sources.
1
1.1.5
Legal and policy considerations
Data integration raises a range of legal and policy issues, some of which can be complex to resolve. These are discussed in more detail in Chapter 2.
1.2 Some key data integration concepts

1.2.1 Integration for statistical and administrative purposes
When data is linked for statistical purposes, individuals (or units) are identified only to enable the link to be made. When the linkage is complete, the identity of the individual (or unit) is no longer of any statistical interest. The linked dataset is used to report statistical findings about the population or sub-populations. In contrast, when data is linked for administrative purposes, individuals are identified not only to enable the link to be made, but also for administrative use subsequent to the linkage. This may sometimes result in adverse action, such as prosecution, being taken against individuals. Statistics NZ undertakes data integration only for statistical purposes.
1.2.2
Exact linkage and probabilistic linkage
There are two key methods for matching records. Exact linkage involves using a unique identifier (for example a tax number, passport number or drivers license number) that is present on both files to link records. It is the easiest and most efficient way to link datasets, and standard statistical software such as SAS can be used. Where a unique identifier is not available, or is not of sufficient quality or coverage to be relied on alone, probabilistic linkage1 is employed. This involves the use of other variables common to both files (for example names, addresses, date of birth and sex). Probabilistic linking is more complex and sophisticated data integration software is required in order to achieve high-quality results.
1.2.3
Quality assessment
Either linking method can result in two types of errors: false positive matches and false negative matches. A false positive match is where two records are linked together, when in reality they are not the same person or unit. A false negative match is where two records are not linked together, when they do in fact belong to the same person or unit. Generally there is a trade-off between the two types of errors since, for example, reducing the rate of false positives may increase the rate of false negatives. Thus it is important to consider the consequences of each type of error and to determine whether one is more critical than the other.
There are a range of terms used to describe types of linkage, including probabilistic, statistical, stochastic, and demographic. In the literature, different authors use these to communicate different concepts. Sometimes they are used interchangeably within a paper. The term probabilistic as defined above is used consistently throughout this manual.
Data Integration Manual An assessment of the size of each of these sources of linkage error should be undertaken as part of the integration and results made available. Analysis of an integrated dataset should take into account possible impacts of the linkage error. Further details on quality assessment can be found in section 6.4, below.
1.2.4
Data integration scenarios
There are different ways in which two datasets being integrated relate to each other.
Situation 1
Situation 2
Situation 3
A B
Situation 1 This is where every individual on dataset A is also on dataset B and vice versa. For example, dataset A might consist of addresses while dataset B contains rates information for each address. Situation 2 This is where every individual on dataset B is on dataset A but there are individuals who appear on dataset A who are not on dataset B. For example, dataset A could be student enrolments and dataset B could be information for those students who have student loans. Situation 3 This is where some individuals appear on both dataset A and dataset B. However, other individuals will appear on only one dataset or the other. For example, dataset A might be Accident Compensation Corporation (ACC) clients, while dataset B could be people who are admitted to hospital.
It should be noted that these are theoretical relationships between pairs of files. Real life is rarely that perfect. For example, in situation 1 there could be duplication and omissions within the files, and timing differences between the two files, which mean that they do not have 100 percent overlap. There are also different desired results from a pair of integrated datasets: the union or the intersection.
A B
Intersection
A B
Union
Data Integration Manual For example, if dataset A was ACC claims and dataset B was hospitalisations for injury, the intersection would be of interest if statistics were wanted on the number of ACC claimants admitted to hospital as a result of their injury. The union would be of interest if statistics were wanted on the total number of injuries, without double counting injuries represented in both datasets. Sometimes a different combination of records may be required for example, all records on B, with information added from A if it is available.
A B
All records on B
Continuing the previous example, this combination may be of interest if hospitalisation costs were to be combined with costs to ACC, for the population of ACC claims.
1.3 Data integration at Statistics NZ and elsewhere

1.3.1 The emergence of data integration
The most significant early contributions to record linkage came in the 1950s, in the field of medical research.2 Two of the most influential early papers were by Newcombe et al3 and Fellegi and Sunter.4
1.3.2
Data integration at Statistics NZ
Data integration has been used in a variety of ways at Statistics NZ, beginning in the 1990s, and gaining momentum in recent years. The earliest uses were mostly Statistics NZ driven, but more recent momentum has come from external interest in integrating datasets collected from different agencies for unrelated purposes. In 1997, the Government directed that where datasets are integrated across agencies from information collected for unrelated purposes, Statistics NZ should be custodian of these datasets in order to ensure public confidence in the protection of individual records.5 In the same Cabinet meeting, it was agreed that Statistics NZ should carry out a feasibility study
Gill L (2001). Methods for Automatic Record Matching and Linkage and their use in National Statistics, National Statistics Methodological Series No 25, National Statistics, United Kingdom. Newcombe H, Kennedy J, Axford S, and James A (1959). Automatic Linkage of Vital Records, Science 130, 954959. Fellegi I and Sunter A (1969). A theory of record linkage, Journal of the American Statistical Association 64, 11831210. Cabinet meeting minutes CAB (1997) M 31/14 [electronic copy unavailable].
Data Integration Manual into the costs and benefits of integrating cross-sectoral administrative data to produce new social statistics. This feasibility study was subsequently carried out, including a trial integration of Inland Revenue Department (IRD) income information with beneficiary data from the Department of Social Welfare (DSW). The feasibility, costs, barriers and benefits of that integration were 6 assessed. The resulting final report, completed in 1998, was instrumental in laying the foundation for following data integration projects. The Statistics New Zealand Statement of Intent for the year ended 30 June 2004 stated:7 Data from administrative and transaction databases will increasingly be used for producing statistics. Users will expect this data to be made available, and where applicable, be integrated with data from other sources, resulting in greater information richness. Data integration has been used at Statistics NZ to create survey frames, supplement survey data, and produce new datasets. The main uses of probabilistic linkage have been as follows: New Zealand Census and Mortality Study Student Loan Data Integration Project Linked Employer-Employee Data (LEED) Project Injury Statistics Project.
More detail on Statistics NZs uses of data integration is given in the Appendix.
1.4 Key steps in a data integration project

Through experience gained over the various data integration projects, Statistics NZ have identified a number of key steps that must be undertaken for a successful outcome. The list below is a high-level summary of what must be addressed. Each of these is fundamental, and none are trivial. The importance of clear and well-defined objectives cannot be overemphasised. The objectives will inform decisions at every other step of the project, from gaining approval to undertake the project (under Statistics NZ policy requirements), to assessing whether the integrated data is able to support outputs that are fit for purpose. Most data integration projects go through a feasibility or development stage, where the steps will be investigated and carried out to a greater or lesser extent. The results of the feasibility study determine whether full production systems will be developed. Many of the steps have parallels in the production of statistics from survey data. Differences occur where the nature of the data collection is different, and because there is an additional step of matching two or more data sources.
Statistics New Zealand (1998). Final report on the feasibility study into the costs and benefits of integrating cross-sectoral administrative data to produce new social statistics. (Internal report available on request.) Statistics New Zealand (2003c). Statistics New Zealand Statement of Intent: Year ending 30 June 2004, Statistics New Zealand, Wellington. http://www.stats.govt.nz/about-us/corporate-reports/statement-of-intent-03/default.htm
Data Integration Manual Key steps in a data integration project: develop clearly defined objectives address legal, policy, privacy, and security issues define governance structures and establish relationships with data providers and data users gain a thorough understanding of data sources decide how you will do the matching define and build information technology (IT) data storage and processing requirements obtain the source data carry out the matching validate the matching and provide quality measures consider provision of access to microdata and confidentiality of published outputs carry out the analysis and disseminate results.
These aspects are discussed in more detail in the remainder of the manual.
Legal and Policy Considerations
Summary All data integration projects are subject to a whole raft of legislation, codes of practice, protocols and policies. This section gives an overview of the relevant guidelines and a practical guide to their application within Statistics NZ.
2.1 Introduction
Staff working on a data integration project should be aware of the various policies and legislative provisions that affect their project. Some of these, such as the Statistics Act 1975,8 are applicable to much of the business carried out by Statistics NZ. Others, such as the Statistics NZ Data Integration Policy, are much more specific to data integration projects. Sometimes it can be difficult to interpret legislation and apply it in a practical way to a particular situation. Different parties have differing views on what is or is not acceptable, and the first project teams to work on data integration projects have had to debate and work through issues, which often establish precedents. Questions can often be resolved by discussion with experienced colleagues, the project manager and stakeholders. Sometimes advice must be sought from external parties such as a reference group or the Privacy Commissioner (see section 2.6.2, below). The following sections give an overview of some relevant documents and processes.
2.2 The Statistics Act 1975

Statistics NZ operates under the authority of the Statistics Act 1975. The Act provides the framework for the production of official statistics in New Zealand. It covers statistics collected in surveys of households and businesses, as well as statistics derived from administrative records of central and local government agencies. It covers official statistics produced by Statistics NZ as well as by other government agencies. As stated in the Statistics New Zealand Statement of Intent 2006,9 Statistics NZs main roles are to: lead New Zealands Official Statistics System be the key contributor to the collection, analysis and dissemination of official statistics relating to New Zealands economy, environment and society build and maintain trust in official statistics ensure official statistics are of high integrity and quality, and are equally available to all guarantee that statistical information provided to Statistics NZ remains confidential, and that it will be used for statistical purposes only.
http://www.stats.govt.nz/about-us/who-we-are/statistics-act-1975.htm Internal document available on request. The latest published version can be found at: http://www.stats.govt.nz/about-us/corporate-reports/statement-of-intent-05/default.htm
Data Integration Manual The Statistics Act 1975 does not specifically refer to data integration, so it is necessary to interpret its provisions in the data integration context. It includes the following points relevant to data integration: Official statistics shall be collected to provide information required by the Executive Government of New Zealand, Government Departments, local authorities, and businesses for the purpose of making policy decisions, and to facilitate the appreciation of economic, social, demographic, and other matters of interest to the said Government, Government Departments, local authorities, businesses, and to the general public [section 3(1)]. Official statistics means statistics derived by Government Departments from: (a) statistical surveys as defined in this section; and (b) administrative and registration records and other forms and papers the statistical analysis of which are published regularly, or are planned to be published regularly, or could reasonably be published regularly [section 2]. Information may be required of any person in a position to provide it to enable the production of official statistics of any or all of the kinds specified [section 4]. Independence of the Government Statistician in respect of deciding: (a) the procedures and methods employed in the provision of statistics produced by the Statistician; and (b) the extent, form, and timing of publication of those statistics [section 15(1)]. Furnishing of information required [section 32]. Security of information provided [section 37]. Information furnished under the Act to be used only for statistical purposes [section 37(1)]. Only employees of the department may view individual schedules [section 37(2)]. No information from an individual schedule is to be separately published or disclosed [section 37(3)], except as authorised by the Statistics Act 1975 (the Act permits others to see information from an individual schedule, but only when it is in a form that prevents identification of the respondent concerned, and then only under strict security conditions). All statistical information published is to be arranged in such a manner as to prevent any particulars published from being identifiable by any person as particulars relating to any particular person or undertaking [section 37(4)].
Data obtained by Statistics NZ for integration, and all integrated datasets, are considered to be furnished under the Statistics Act 1975, and therefore subject to the provisions of the Act.
2.3 The Privacy Act 1993

2.3.1 General information
The Privacy Act 1993 aims to promote and protect individual privacy. It relates to personal information (not information about businesses). In section 6, twelve principles relating to the collection, storage, security, access, retention, use and disclosure of personal information are outlined. A Statistics NZ corporate document exists that outlines how the Privacy Act 1993 relates to statistics.10 Several of the principles provide for exemption on the grounds that the information is used for statistical or research purposes and will not be published in a form that could reasonably be expected to identify the individual concerned. Regardless of these exemptions, it is important to consider the ideals expressed by the principles. Each situation must be
10
Statistics New Zealand (1999b). Statistics and the Privacy Act 1993. (Internal report available on request.)
Data Integration Manual evaluated in the light of the other privacy principles, and likely public perception of the proposed use. There are no guidelines in the Act regarding linkage for statistical purposes. In lieu of this, Statistics NZ has produced a set of data integration principles and guidelines (see section 2.4). The Privacy Act 1993 contains a chapter governing information matching. This does not relate to data integration as carried out by Statistics NZ; it relates to the comparison of two files for the purpose of producing or verifying information that may be used for the purpose of taking adverse action against an identifiable individual. The Statistics NZ Data Integration Policy states that Statistics New Zealand must not provide information to data providers about individual records in integrated data that could assist the data provider in carrying out any administrative purpose [principle 7(d)]. There is a designated channel for communication with the Office of the Privacy Commissioner on privacy issues in data integration projects (see section 2.6.2, below).
2.3.2
Use of unique identifiers
Principle 12 of section 6 the Privacy Act 199311 regarding unique identifiers has particular relevance for data integration projects and has been discussed at length with the Office of the Privacy Commissioner. Principle 12 states (under the heading Unique identifiers): (1) An agency shall not assign a Unique identifier to an individual unless the assignment of that identifier is necessary to enable the agency to carry out any one or more of its functions efficiently. (2) An agency shall not assign to an individual a unique identifier that, to that agencys knowledge, has been assigned to that individual by another agency, unless those 2 agencies are associated persons within the meaning of section OD 7 of the Income Tax Act 2004. (3) An agency that assigns unique identifiers to individuals shall take all reasonable steps to ensure that unique identifiers are assigned only to individuals whose identity is clearly established. (4) An agency shall not require an individual to disclose any unique identifier assigned to that individual unless the disclosure is for one of the purposes in connection with which that unique identifier was assigned or for a purpose that is directly related to one of those purposes.
11
http://www.privacy.org.nz/privacy-act/
Data Integration Manual A unique identifier is defined in section 2 of the Privacy Act 1993 as follows: Unique identifier means an identifier (a) That is assigned to an individual by an agency for the purposes of the operations of the agency; and (b) That uniquely identifies that individual in relation to that agency but, for the avoidance of doubt, does not include an individuals name used to identify that individual. Not all of the identifiers used by other departments are unique identifiers in terms of the above. However, in practical terms, this privacy principle impacts on Statistics NZs ability to use unique identifiers in data integration projects, particularly in retention of the unique identifiers over time. The Statistics NZ Data Integration Policy states that unique identifiers assigned by an external agency must not be retained in an integrated dataset. It goes on to say: (a) Data can be received that includes unique identifiers assigned by an external agency. These identifiers can be used to verify the integrity of the data, or to clean the data. They can also be used for integration of data, but they must be removed immediately after integration. When linking needs to occur on an ongoing basis, then the externally assigned identifier must be replaced by a new identifier. This new identifier can be used for integration. It must not be possible to derive the externally assigned identifier from the new identifier [principle 11].
(b)
2.4 The Statistics NZ Data Integration Policy

The Statistics NZ Data Integration Policy12 states: This Data Integration Policy states Statistics NZ policy on integrating personal data. Data integration involves linking together information from different sources. This can be used to produce new statistics, enhance the value of existing statistics, and also enable a greater level of research. This can benefit New Zealand by increasing knowledge on the countrys people, economy and environment. Individuals have legitimate privacy expectations. Integration of personal data can be privacy intrusive when it uses information for purposes other than those for which the information was originally provided. New Zealand legislation recognises individual expectations of privacy and the wider benefits of statistical information. The Statistics Act 1975 allows Statistics NZ to require responses to its surveys while also requiring that these responses are kept confidential. The Privacy Act 1993 requires adherence to a set of information privacy principles while also recognising exceptions when information is used for statistical or research purposes. This policy describes how Statistics NZ ensures that any integration of personal data is justified. It details the care taken by Statistics NZ when integrating personal data to
12
http://www.stats.govt.nz/about-us/policies-and-guidelines/data-integration-policy/default.htm
10
Data Integration Manual ensure any impact on privacy is minimised. This policy provides strict conditions, often beyond statutory obligations, on how Statistics NZ undertakes data integration. The Government Statisticians approval is required for all data integration projects. The Government Statistician will only give the go-ahead to a data integration project if satisfied that the principles set out in this policy will be fully observed. The policy covers: applicability data integration principles applying the data integration principles.
There are 12 data integration principles. The following principles govern when integration of personal data for statistical or related research purposes can occur: 1. 2. 3. 4. 5. 6. Statistics NZ must only undertake data integration if integration will produce or improve official statistics. Data integration should be considered when it can reduce costs, increase quality or minimise compliance load. Data integration benefits must clearly outweigh any privacy concerns about the use of data and risks to the integrity of the official statistics system. Data integration must not occur when it will materially threaten the integrity of the source data collections. Data must not be integrated where any undertaking has been given to respondents that would preclude this. Data integration must be approved at an appropriate level by all the agencies involved.
The following principles govern how integration of personal data for statistical or related research purposes will be done: 7. 8. 9. 10. 11. 12. Integrated data must only be used for approved statistical or related research purposes. The size and data variables of the linked dataset must be no larger than necessary to support the approved purposes. Integrated data will be stored apart from other data. Names and addresses can only be kept in an integrated dataset while necessary for linking. Unique identifiers assigned by an external agency must not be retained in an integrated dataset. Data integration must be conducted openly.
Statistics NZs Data Integration Policy Guidelines13 also describes practical steps that can be taken to comply with the policy. These guidelines explain what must be done at various
13
Statistics New Zealand (2005a). Data Integration Policy Guidelines. (Internal report available on request.)
11
Data Integration Manual stages of a data integration project: before data integration starts, during data integration and when bringing a data integration project to a close. The Statistics NZ Data Integration Policy is essential reading for anyone embarking on a data integration project.
2.5 Codes of practice

The Privacy Commissioner14 may issue codes of practice which modify the Information Privacy Principles set out in the Privacy Act to take into account the special characteristics of specific industries, agencies or types of personal information. The provisions in a code may be more stringent or less stringent than the principles. Although there is no specific code of practice for data integration, a number of other codes of practice have been issued by the Privacy Commissioner. Where the code applies it substitutes for the principles [in the Privacy Act 1993]. For example, an action that would otherwise be a breach of one of the principles is deemed not to breach that principle if done in accordance with the code. Also a failure to comply with the code, where it applies, is for the purposes of the complaints procedures under the Privacy Act, deemed to be a breach of the principles.15 Examples of such codes are: The Health Information Privacy Code 1994 The Post-Compulsory Education Unique Identifier Code 2001.
Staff working on data integration projects should be aware of the existence of codes of practice that relate to their project and, in particular, how these affect the agency supplying the data.
2.6 Data integration business case

It is important that before data is actually acquired and linked, appropriate discussions are held and approvals given. A specific requirement of the Statistics NZ Data Integration Policy is that a business case documenting how the proposed project will comply with the policy be submitted to the Government Statistician for approval. An approved data integration business case sets the boundaries for the data integration project. To produce the data integration business case (Statistics NZ, 2005a), the following tasks need to be undertaken: (1) Define purpose(s) It is essential to identify the statistical and related research purposes that the integrated data can be used for. Once this business case is approved, these purposes cannot be changed. The data integration business case must explain how these purposes will produce or improve official statistics. Improving official statistics
14
Office of the Privacy Commissioner: http://www.privacy.org.nz/library/five-strategies-for-addressing-publicregister-privacy-problems Health Information Privacy Code 1994: http://www.privacy.org.nz/filestore/docfiles/38197000.pdf
15
12
Data Integration Manual could mean improving: accuracy, reliability, timeliness, consistency, coverage, concepts, definitions or methodologies. (2) Stakeholder consultation As part of producing a data integration business case, it is necessary to consult with stakeholders in the project. This is likely to include consultation with at least the following groups: data suppliers, respondent representatives, the Privacy Commissioner, the Chief Archivist. (3) Prepare a privacy impact assessment A privacy impact assessment (PIA) needs to be undertaken and included in the data integration business case. (4) Prepare the data integration business case The data integration business case needs to document how the proposed data integration project will comply with the Statistics NZ Data Integration Policy. A list of requirements for a data integration business case is available in Attachment A to that policy. (5) Obtain approval First, approval for a data integration business case must be obtained from the chief executives of any agencies that would supply data for integration (other than Statistics NZ). Following this, the business case can be submitted to the Government Statistician for approval. Approval must be received from the Government Statistician before any data integration occurs. Statistics NZ can undertake a pilot study to determine whether using data integration to produce or improve official statistics is feasible in a particular case prior to full project approval [principle 1(b)]. A pilot study is likely to be needed for any significantly new or largescale data integration work. Any pilot study still needs to conducted in accordance with the Data Integration Policy and consequently needs its own business case. Once approved, the Minister of Statistics needs to be formally notified that the data integration project has been approved and the project must also be included in the list of data integration projects maintained on Statistics NZs website.
2.6.1
Privacy impact assessment
A key component of a data integration business case is the PIA. PIAs are not unique to Statistics NZ they are used in many situations where risks to privacy arise for example, from a new technology, the convergence of existing technologies, the use of a known privacy intrusive technology in new circumstances, a major endeavor, or a change in practice. Privacy impact assessments for Statistics NZs data integration projects include a description of procedures for collection, use, disclosure and retention of personal information. They analyse the risks to privacy, and state how these risks will be managed, avoided or reduced. Statistics NZ has compiled a summary of privacy issues that must be considered in an integration project.16 The following is an extract:
16
Statistics New Zealand (2002c). Pro forma Privacy Impact Assessment Report Data Integration Projects (draft). (Internal report available on request.)
13
Data Integration Manual The risks associated with failing to address the privacy implications of a given proposal can take many forms, and may include: failing to comply with either the letter or the spirit of the Privacy Act, or fair information practices generally, resulting in criticism from the public or Privacy Commissioner or complaints under the Act stimulating public outcry as a result of a perceived loss of privacy or a failure to meet expectations regarding the protection of personal information loss of credibility or public confidence when the public feels that a proposed project has not adequately considered or addressed privacy concerns underestimating privacy requirements with the result that systems need to be redesigned or retro-fitted at considerable expense.
An important consideration is the expectations of the general public, customers, clients or employees. Proposals may be subject to public criticism even where the requirements of the various Acts have been met. If people perceive their privacy is seriously at risk, they are unlikely to be satisfied by justification that the project has not technically breached the law. Risks to actual and perceived privacy can arise in many circumstances. Collecting excessive information, using intrusive means of collection, or obtaining sensitive details in unexpected circumstances all represent risks to the individual, and might already be there to some extent with the various source datasets. Unexpected or unwelcome use or disclosure of that information, as in a [data integration] project could put perceived privacy at risk. One task of the PIA is to sort out which risks are serious and which are trivial. The privacy impact report should identify the avoidable risks and suggest cost-effective measures to reduce them to an appropriate level. PIAs are usually compiled by the project manager, with contribution from team members and policy staff. However, it is important for project staff to be aware of privacy issues relating to their project and how they are to be managed. PIAs for established projects are a useful way of getting to grips with privacy issues, for example the Linked Employer-Employee Data PIA.17 While not a formal part of a PIA, any issues regarding appropriate protection for business information should also be considered.
2.6.2
Consultation with other agencies
(i) Office of the Privacy Commissioner It is sometimes necessary to consult with the Office of the Privacy Commissioner (OPC) regarding proposed data integration projects. A role of the Privacy Commissioner is to: ... investigate on receipt of a complaint or on his own initiative and form an opinion about whether there had been an interference with the privacy of an individual. He
17
http://www.stats.govt.nz/NR/rdonlyres/F5025B36-85D8-4464-B683CE070FFC4807/0/LEEDPrivacyImpactAssessment.pdf
14
Data Integration Manual does not give a decision. Nor does his opinion bind anyone. Only the Complaints Review Tribunal can give a decision which could be accurately described as a ruling. Therefore, Statistics NZ cannot and does not seek approval from the Privacy Commissioner to proceed with a data integration project, but does seek advice from the Privacy Commissioner as to whether the proposed approach raises any concerns, and works closely with him or her to determine the most appropriate solution to any such concerns. For a Statistics NZ project manager, faced with making a choice between continuing with an approach that precedent and best current understanding can accept, or arguing for a change in what the OPC is comfortable with to get a better outcome, it is much easier and more certain to take the first option. The OPC is not resourced for quick responses to such approaches, because it is not their core business. It is up to Statistics NZ to be compliant with the Privacy Act 1993 and, if necessary, to seek professional advice on issues. (ii) Other agencies Integration projects must recognise the direct interests of stakeholders and take any concerns into account in a decision to integrate. Providers of source datasets, as well as groups that represent the interests of those whose information is being integrated, must be consulted.
2.7 The Statistics NZ Confidentiality Protocol

The Statistics NZ Confidentiality Protocol18 is another document that has an impact on data integration projects. It includes sections on restricting use of information to statistical purposes, protecting confidential information, and rules for avoiding disclosure of confidential information in outputs and microdata. Although this protocol is not specific to integrated data, its provisions do apply to integrated data, and it is therefore important to be aware of its content. There is greater risk of disclosure from integrated datasets and therefore extra care is required to protect the data.
2.8 The Statistics NZ Microdata Access Protocols

Integrated datasets potentially pose more risk in terms of disclosure so any access to microdata needs careful consideration. The Statistics NZ Data Integration Policy notes that the Government Statisticians decision on provision of microdata access will be made after consultation with data providers and will take into account: the legislation under which the data was collected Statistics NZs Microdata Access Protocols19 the data integration business case (Statistics NZ, 2005a) (eg allowed uses of integrated data) any agreements entered into with data suppliers.
18
Statistics New Zealand (1999a). Confidentiality Protocol. (Internal report available on request.) http://www.stats.govt.nz/about-us/policies-and-guidelines/general/microdata-access-protocols.htm
19
15
Data Integration Manual In some cases, the data providers might advise Statistics NZ of legal requirements or other conditions that preclude some of the forms of access that would otherwise be possible. Tax data is a particular instance where this has occurred. Statistics NZs Microdata Access Protocols provide guidance on: the type of research that will be considered eligible for use of Statistics NZ microdata the methods of access that are available, including on-site Data Laboratory, off-site Data Laboratory, confidentialised unit record files (CURFs) or remote access the conditions placed on the researchers and their obligations in using microdata and producing output the data that is potentially eligible for use.
16
Operational Aspects of a Statistics NZ Data Integration Project
Summary Individual data integration projects can vary greatly in terms of the data sources and methods used. This chapter outlines some of the operational aspects that are common to most data integration projects carried out by Statistics NZ.
3.1 Early stages of a data integration project

A data integration project begins with the approval processes outlined in Chapter 2. It is recommended that a pilot study be undertaken to assess the feasibility of a data integration project. Once approved, a successful data integration proposal moves into the usual phases of a Statistics NZ project, with the development of project initiation documents, and the bringing together of a project team. It is also important to establish inter-agency relationships and support from interest groups as early as possible in the initiation phase.
3.2 Other relationships

Statistics NZs data integration projects usually involve data from external agencies, and produce outputs that are of wide interest outside the organisation. In addition to a project team and internal Statistics NZ governance, each data integration project is likely to have a number of critical relationships with external groups. The nature of these relationships differs, depending on how the project is structured. These are broadly described in the following table, with examples from each of the current data integration projects. Role Provide independent purchase advice to the relevant Minister(s) Oversee development Provide expert review and advice Represent user views and needs, provide advice Provide data, and advice on data Student loans Linked EmployerEmployee Data (LEED) Injury Ministerial advisory panel IRD-led steering committee Inter-departmental working group Sponsors group Expert advisory group Users group Data providers
External reference group
A critical factor in the success and efficiency of any data integration project is the quality of the relationship with the data providers and users.
17
3.3 Obtainment and safe keeping of external data

3.3.1 Data extract
The data integration business case submitted for Statistics NZs approval needs to detail the data it proposes integrating. This should include a list of the variables from each data source. If it is not possible to determine these details, then a business case for a pilot study should be produced instead. A main objective of the pilot study will be to determine the variables that are needed for integration and whether the integrated dataset is suitable for achieving the statistical objectives of the project. The specifications for the data extract should use the table and variable names of the source. Specific inclusions and exclusions must be clear and unambiguous. The data received should be examined to ensure it complies with expectations (see section 4.3, below).
3.3.2 Data transfer, storage, security and internal access controls

In keeping with Statistics NZs core value of Security, it is important that data integration projects provide adequate protection to the data that is involved. The following checklist, developed by Paul Maxwell based on international practice and now used in Statistics NZ privacy impact assessments, contains a range of issues that need to be addressed within data integration projects. The project team should actively contribute to the maintenance of high standards of security. Security checklist: Have security procedures for the collection, transmission, storage and disposal of personal information, and access to them, been developed and documented? Are privacy controls in place for the project? Have technological tools and system design techniques been considered that may enhance both privacy and security? Has there been an expert review of all the security risks and the reasonableness of countermeasures to secure the system against unauthorised or improper collection, access, modification, use, disclosure and disposal? Have staff been trained in requirements for protecting personal information and are they aware of policies regarding breaches of security or confidentiality? Are there authorisations controls defining which staff may add, change or delete information from records? Is the system designed so that access and changes to data can be audited by date and user identification? Does the system footprint inspection of records and provide an audit trail? Are user accounts, access rights and security authorisations controlled and recorded by an accountable systems or records management process? Are access rights only provided to users who actually require access for the stated purposes of collection or consistent purposes? Is user access to personal information limited to that required to discharge the assigned functions? Are the security measures commensurate with the sensitivity of the information recorded? Are there contingency plans and mechanisms in place to identify security breaches or disclosures of personal information in error? Are there mechanisms
18
Data Integration Manual in place to notify security breaches to relevant parties to enable them to mitigate collateral risks? Are there adequate ongoing resources budgeted for security upgrades with performance indicators in systems maintenance plans? What steps are to be taken to make the public aware of the project? Are individuals covered in the source datasets aware of that use?
Statistics NZs Data Integration Policy provides guidelines to ensure protection of the linked data. Information about individual records cannot be sent to data providers. Names and addresses can only be retained in an integrated dataset for a limited period if this was approved in the data integration business case. Unique identifiers assigned by an external agency must be removed immediately after integration. An externally assigned unique identifier cannot be used for longitudinal linking. Moreover, all data integration projects must have exclusive use of their own physical server(s) for processing and exclusive use of their own physical disk(s) for storage, and be accessible only to the smallest practical number of Statistics NZ employees.
3.4 Documentation and quality assurance

3.4.1 Documenting record linkage methodology
Any matching exercise should be accompanied by full documentation of the method used. This can be thought of as a technical description of the matching methodology. It has two main uses: to allow a peer review of the methodology to provide a record of what has been done for the future.
It is vital that full details of the matching method and results are written down and available for the future. They provide the formal documentation of what has been done, both for future matching with the same data sources, and as examples for other matching projects. Statistics NZ (2002a)20 has outlined the key components of a technical description for a record linkage project.
3.4.2
Reviewing record linkage methodology
A peer review is needed in order to provide confirmation that a sound job has been done. The peer review should not be a repeat of the matching, but rather a review of the process. The review should ideally be done before the linked data is handed over to clients, so that any improvements in methodology suggested by the reviewer can be carried out. This might mean a two stage process, where the first results are essentially a trial. The match method and results are reported and reviewed, any modifications carried out, and the final linked file handed over to clients. This may not be possible in practice, and in that case improvements can be noted for future. The documentation and peer review of the matching methodology should be included as tasks in planning the project, and enough time and resources allowed for them.
20
Statistics New Zealand (2002a). Guidelines for Writing a Technical Description of a Record Linkage Project. (Internal Statistics NZ document.)
19
Data Integration Manual Statistics NZ (2003a)21 has outlined the information that a reviewer should use when reviewing a technical description of a record linkage method.
3.4.3
Supporting documentation
All output data should be supported by adequate metadata information that enables a user to come to an adequate understanding of the data. This may include, for example, information about: the context of the source data processing of the data at Statistics NZ quality indicators including match rates the format of the output data, including detailed information about individual variables (a data dictionary) advice on how the technical description of the matching can be obtained.
3.5 IT considerations
The size of an integrated dataset, its complexity and the differing needs of official statistics and researchers place considerable demands on the IT solution of a data integration project. Use of administrative data in data integration can result in a much larger data file than for data collections based on sample surveys. Some Statistics NZ data integration projects, for example LEED, have datasets that are orders of magnitude larger than what is usually processed for sample surveys. This presents challenges for data storage capacity and efficiency of updates, general processing and retrieval of information. Integrated data can also often be conceptually complex in structure. There will be a link to maintain that may be longitudinal and/or cross-sectional. There may be complex relationships between the original unit record structure and the statistical units used for analysis. For example, the student loans data includes units for loans, enrolments, individual students and tertiary institutions. There are often two types of uses for integrated data, with possibly conflicting requirements. Official statistics outputs are aggregated and summarised data, usually tabular, and standardised. In contrast, researchers often require access to microdata. The range of variables and scope of the investigation are likely to be different for each research project and unpredictable. It is important for the IT development to have at an early stage a general indication of ways in which the integrated data will be reported, of the type of statistics outputs and the breakdowns of these. Each of these features must be carefully considered when designing IT storage and processing systems.
21
Statistics New Zealand (2003a). Guidelines for Peer Review of the Technical Description of a Record Linkage Project. (Internal Statistics NZ document.)
20
Preparing for Record Linkage
Summary It is often remarked that the actual process of doing the record linkage is only a small fraction of the overall data integration project. This chapter describes the tasks that should be carried out before embarking on the actual linkage itself.
4.1 Introduction
It has been estimated by Gill (2001) that in the implementation of record linkage: 75 percent of the effort is in preparing the input files 5 percent of the effort is in carrying out the linkage itself 20 percent of the effort is in checking the results of the linkage.
This is consistent with the balance of work experienced in current Statistics NZ data integration projects. The importance of adequate preparation for linkage needs to be emphasised. This includes investigating, obtaining, assessing and transforming the input data. It is also important to emphasise that in the implementation stage, the implementation per se is not the only work that should be done. Everything collected and implemented for record linkage should be well documented for ongoing maintenance and for future data users. Although the emphasis of this manual is on how to create the first record linkage, it should be noted that in the future the majority of the work will be maintenance and monitoring of already established record linkage projects.
4.2 Gathering information about source data

Development of a thorough understanding of the source data is fundamental to obtaining meaningful results from analysis of the integrated data. A common experience is that understanding new data sources is likely to be time consuming and resource intensive. The investigation can be carried out in several phases. An initial investigation may focus on what is needed to determine the relevance to a particular research programme mainly the population and variable concepts and values and be restricted to just the variables likely to be of use. This can often be done from available documentation and contact with the agency, and without access to actual data files. Information from this early stage may be used to determine whether a data source is likely to be of sufficient quality to meet the objectives of the data integration, and can act as a stop/go point in a project. More detail will be required to specify data to be transferred, to prepare files for the linkage, and for analysis of the integrated data.
4.2.1
Preliminary investigation of source data
Once a data integration project has been initiated, it is usually clear where the data to be linked will come from. This data may be internal to Statistics NZ such as the Business Frame, the population census or survey datasets or it may be from an external agency.
21
Data Integration Manual Existing departmental knowledge of the input data sources can range from no knowledge at all to extensive knowledge. The following discussion assumes that data from an external agency is involved, but many of the principles apply equally to internal datasets. It also assumes that this investigation takes place in the context of a well-managed relationship, with some kind of service agreement, such as Memorandum of Understanding (see section 4.3.1), either being developed or already in place. Before requesting the datasets from the source agency or agencies, preliminary investigation should be carried out to clarify dataset specifications for the integration. This investigation is done by Statistics NZ and involves some (or all) of the following tasks: Collation of existing departmental knowledge If the data source has been used previously within Statistics NZ, other staff may be able to provide a valuable starting point to the investigation in the form of a briefing on the data; provision of written documentation; or an overview of how the data was used in the past. Over time, it is likely that more administrative data will be used by Statistics NZ and it is important to share knowledge that is gained, both for internal efficiency and to convey professionalism to data providers. Review of hard- and soft-copy information Organisations websites generally provide an excellent overview of the context of the data source, the motivation for collecting the data, the data collection environment, and how the data is used. In some cases, a large amount of more detailed information is also available, from fact sheets and frequently asked questions (FAQs), to downloadable data collection forms and data dictionaries. Meeting with data providers Meeting with the data providers facilitates the effective transfer of knowledge from the people who work with the data on a daily basis. These meetings may involve general briefings, and will provide the opportunity to ask questions, allow viewing of the electronic data storage/query system, and permit the passing on of further documentation such as data dictionaries and data models, where these have not been previously available.
4.2.2
Target population and units
4.2.2.1 Identification of population Understanding how the data sources relate to one another and defining the target population is an important part of the preparation for linking. The nature of an individual dataset itself is also important when thinking about population coverage. Administrative data records have often been compiled into a dataset from multiple locations. For example, the National Minimum Dataset (NMDS) is a collection of discharge information from all public and private hospitals. Each of the source data files has its own target population and actual population. One data source may have quite a different target population to the other(s) and deciding where the populations overlap is an important step in the preparation for record linkage.
22
Data Integration Manual Target administrative population The target population is the theoretical population that the administrative data covers. For example, there may be a legal requirement which defines the target population. In other cases, there may be a transaction undertaken, or a voluntary application, in which case the target population may simply be the collection of reporting units on the dataset. Actual administrative population For administrative data, the actual administrative population is the population that the administrative data includes in practice that is, the population actually on the administrative dataset. It is only about this population that results can be reported, although in some cases it is possible to have some estimate of the undercount and bias.22 The actual population may be more than, less than, or the same as the administrative target population. The difference between the target and actual populations is known as coverage. The purpose of an integration project is to produce a new set of data, for which a target population must also be defined. This is an ideal population to make inference about by using the integrated dataset. Depending on the actual populations that are available in the source datasets, it may not be possible to produce a combined dataset that exactly covers the target population. Therefore some compromises would have to be made and a dataset utilised that does not exactly correspond to the actual integrated population. Some data providers have unpublished research to verify the coverage of their target population. This information could be useful to estimate the link rate.
4.2.2.2 Identification of units Another critical step in preparing for integration is identifying the reporting units on the source data files, assessing whether they are consistent across the files, and carrying out any necessary transformations. Reporting units in the source data may differ from the units of interest in the target population of the integrated dataset. Also, there may be multiple units of interest in the integrated dataset, as well as differences between reporting units in the different data sources. For example, the Linked Employer-Employee Data (LEED)23 integrated dataset will be used to produce statistics for businesses, jobs and workers. Developing the methodology to transform reporting units into the units of interest can be a time-consuming and complex process. For example, in the Injury Statistics Project,24 the unit of integration and analysis is the injury. However, the reporting unit on the New Zealand Health Information Service input file is at a lower level: health event. There can be several health events (discharges) for a particular injury.
22
Statistics New Zealand (2005b). Proposed Methodology for Estimating Undercounting of Vehicle-related Injuries in NZ. (Internal report available on request.) http://www.stats.govt.nz/leed/default.htm http://www.stats.govt.nz/additional-information/injury-statistics/default.htm
23
24
23
4.2.3
Understanding the source data metadata
The Statistics NZ metadata template for administrative data is a helpful tool for coming to a full understanding of a data source. The metadata template is based around a quality framework for statistical outputs, where quality is considered as being made up of six dimensions: relevance, accuracy, timeliness, accessibility, interpretability and coherence. Brackstone25 discusses what these terms mean and how they are interrelated. Statistical outputs produced from administrative data, including integrated data, also need to meet standards of data quality. The six dimensions of quality can be used to assess and report on the quality of the source administrative data, as well as for the final integrated data. The metadata template provides headings as prompts to ensure that adequate information is collected in each of these areas. Further explanation of these headings can be obtained from Statistics NZ.26 Use of the template provides: a focus and direction for those needing to understand an administrative data source a structured format for the documentation of the administrative data the information needed to assess the statistical integrity of an administrative data source the information needed to determine the usefulness of the administrative data source in any given context.
The Statistics NZ metadata template takes the following structure and contains the following information: Data source Name of agency Application of data High-level summary of time period, variables, file structure Target and actual population Reporting units Coverage Name Definition Values Definitions of terms used Conceptual what is the simplest, most logical way to think about the files involved (which may be quite different from the way the data is actually stored) IT detailed IT structure should be covered separately an indication here is sufficient (eg flat file, relational database, SAS/SQL server formats) Application Collection method Frequency and timing of collection Consistent approach (over different centres)
Population Variables Glossary File structure
Data collection and data entry
25
Brackstone (1999). Managing Data Quality in a Statistical Agency Survey Methodology, Dec 1999 Vol 25, No 2, 139149. Statistics New Zealand (2002b). Meta Information Template for Description and Assessment of Administrative Data Sources. (Internal report available on request.)
26
24
Data Integration Manual Question adequacy and respondent understanding Contextual or methodological biases Data capture, coding and editing Updating procedures Missing data Imputation Duplication Rounding Internal consistency Concepts Coverage Data collection and data entry Data accuracy Privacy/security/confidentiality Documentation available Ease of access Forms of dissemination Timeliness Storage of historical information Consistency at aggregate level Comparison of variable concepts and values with Statistics NZ standards Is the population well defined? Do reporting units map to statistical units? Are variables well defined? Do data collection and entry systems have good quality controls? Is data accuracy reasonable (bias, reliability, internal consistency)? Is there a consistent time series?
Data accuracy
Changes over time in
Accessibility
Comparison with other data sources Summary of statistical integrity
In addition to using the available documentation and contact with data providers, field visits to data collection agencies or data entry points can also be very helpful in understanding the collection system.
4.2.4
Implications from the metadata
Once the source data is understood, the next task is to make an assessment of the implications for the integration project, and determine the usefulness of its fields. This assessment should include: Comparisons with similar fields in other datasets (format, ranges, classifications etc) are the formats, ranges, classifications compatible? If not, how much work is required to make them compatible? For example, sex is coded as 1 for male and 2 for female in one dataset and the other way around in another. Comparing the metadata may highlight the need to standardise these codes before attempting linking. Understanding of how the field came about for example, finding many records with sex recorded as male but with undoubtedly female first names can be due to a system that has default sex as male.
25
Data Integration Manual What fields could be useful for linking the files? Fields common to both files are candidates for linking. One needs to assess their quality. See section 4.4.1 and Chapter 6 (sections 6.3.1 and 6.3.2) below for more information. Are there any implications for the main outputs that are expected to be produced from the integrated dataset? That is, are there any changes in collection or maintenance practices in the source data that have implications for the quality of the integrated dataset? For example, if one of the outputs was a breakdown by region, it would be important to know if the address is collected once and then never checked again, because if the integrated dataset is a time series the regional breakdown may not be particularly accurate.
4.3 Procedure for obtaining data

4.3.1 Request for supply of data
Following approval of a pilot and/or an integration project, Statistics NZ will formally request a dataset from the source agency or agencies. All data in datasets that are obtained by Statistics NZ for integration will be considered to have been collected under the Statistics Act 1975 and all relevant provisions of that Act will apply to the data. Prior to requesting the data, an agreement document such as a Service Level Agreement (SLA) or a Memorandum of Understanding (MOU) should ideally be in place. An SLA is a formal agreement between two or more parties that seeks to achieve a mutually agreed level of services through the efforts of all the parties involved. An MOU is a formal voluntary agreement between two or more parties that seeks to achieve mutually agreed outcomes through the efforts of the parties. The only difference between an SLA and an MOU is that an SLA is a contract, while an MOU is a voluntary agreement. In data integration projects in Statistics NZ an MOU is commonly used and would stipulate: a specification on which variables to request or the formats of the data security and confidentiality measures the frequency of data supply.
A request to obtain a dataset can be prepared once the source agency and Statistics NZ have agreed on the appropriate dataset specifications to allow Statistics NZ to proceed with integration in the most cost-effective way. Such specifications are to include the variables requested, the corresponding format, periodicity and timing of delivery, the transport mechanism, missing data handling, information about data quality and responsibility for cleaning the data. The source agency: is responsible for ensuring that all required variables are identified, specified and supplied as agreed must provide relevant documentation about the dataset, as well as access to data experts who can assist with queries as to structure, format, etc must also provide information on who currently, or potentially, has access to the dataset to assist Statistics NZ in establishing confidentiality requirements must advise any changes in its collection mode or classifications to Statistics NZ.
26
Data Integration Manual All those requests to the source agencies should be clearly documented in the MOU, and information collected about administrative data should be well documented in the metadata. It has proved helpful to obtain the data model of the providers database so that data can be requested in terms used by the source agency. Once sufficient understanding of the data has been gained, a request for a data extract can be prepared. This will include the following items: content of the file: population, time period, fields required how the file will be formatted: file type, field separators, provision of separate look up tables checklist for the extract before it is sent (eg no special characters, valid data values, range checks etc) how the data will be delivered: media, transport, encryption (see section 4.3.2).
Good communication with the data providers, and unambiguous specification of data requirements, reduces the likelihood of the data extract failing to meet the needs of the project. Sufficient time should be allowed for the providing agency to extract and supply the data this can range from days to months, and should be discussed with the provider before submitting the request. Along with the choice of the fields to request and the corresponding formats of the data, another very important decision in requesting data from providers is the time period over which the request is made. Data integration projects may integrate data over a monthly, quarterly or annual cycle. This requires precise definition of which records should be supplied each period, without overlap or gap. Again, using the field names on the data model helps avoid misunderstandings. For instance, all records collected in December quarter 2002 and all records with creation_date field within 1st October to 31st December 2002 could result in different datasets. In the first instance records that are updated in that period could be included, while in the second instance that would not happen. The agency receiving the data can detect records that should not have been received. However, there is usually no way for the receiving agency to check which records have not been received. Deciding on the time period can be more complex, because it can involve decisions around frequency of supply, what to do about late returns/claims filed much later etc. The data request and supply process may need to be iterative, with modifications or corrections made to the data supplied as needed. It is recommended that the specification be tested first by transfer of a small version of the full dataset.
4.3.2
Data transfer
Statistics NZ corporate standards and policies on data transfer are being developed, but as yet there is no standard approach in place. Various transmission modes have been employed in data integration projects: by email, courier or carried by hand. The medium of storage has also not been consistent across projects and has variously involved the use of tape, compact disc (CD) and digital video disc (DVD). Furthermore, the method to ensure data security for these media has also been variable, ranging from none, to a password compressed (ZIP) file or Pretty Good Privacy (PGP) encryption. Email is a fast method of transferring data between the data source agency and Statistics NZ. This method of transfer has two main weaknesses. First, email transfer can be less secure than other methods and, secondly, email data transfer is suitable only for small-sized
27
Data Integration Manual datasets. However, more secure email systems are being introduced that could provide better options for consideration in the future. Considering future demands for more efficient and secure ways of transferring data, the method of carriage by hand should be reviewed. In general, the recommended data transfer option is by courier. Although storing data on CDs is still a common practice, transferring large datasets by DVD would be the preferable mode in the future, because the different media for storing data do not have a built in security mechanism. At media level, it is expected that all medium content will be encrypted. Past records show that few measures have been taken to secure the media content. PGP encryption is the recommended method for encryption of data identified by Statistics NZ. However, where providers do not have the resource for PGP encryption, password protection should be used. Example of data collection process Injury Statistics This example outlines the administrative data collection process of the Injury Statistics programme. This process was developed as a temporary measure by the injury team to serve their current requirements. Process diagram:
Data Provider
Developer
1 Extracted Data 8 9
Data Custodian 3 4 5 Help Desk 10
Commercial Courier
Disposal 7 Safe Storage
Server/Network
Media/Instruction Flow Information/Communication Flow Working Together
28
Data Integration Manual Step 1: Media creation The data provider will extract information from their administrative systems, which is then copied to CD/DVD media in PGP-encrypted format. The encrypted key/password will be communicated to the data custodian via telephone. Statistics NZ highly recommends encryption. However, if the providers do not wish to encrypt the data, it will be received in flat file text format. The file naming convention should be specified in the SLA or MOU. Step 2: Handover to courier The media is handed over to the courier. The package is to be addressed securely, as specified in the MOU. Step 3: Delivery The Injury Statistics data custodian will receive the data media and the Data Collection Log will be updated. The Data Collection Log is a log of all events related to data collection and will be maintained by the data custodian in hard copy or electronic format to keep a track record of the collection. Step 4: Transfer at Helpdesk The data custodian or a representative will place a Helpdesk request to copy the file and personally carry media to the Helpdesk for the transfer of data to the file server, the Statistics NZ network. On completion of the transfer, the media is to be brought back by the same person. Though the transfer is expected to take only a few minutes, the data custodian is expected to set an appointment with the Helpdesk beforehand. In the event that data transfer was unsuccessful (eg media corruption, incorrect format), the data custodian will be informed and the medium stored in an appropriate place. Step 5: Collect from Helpdesk This is a process of collecting the media on completion of the transfer from the medium to the server. On completion, the Data Collection Log is to be updated. Step 6: Place in safe storage If data transfer and the database loading are completed successfully, the same medium is to be stored safely for the period specified in the MOU. The Data Collection Log will be updated and the storage location for the medium is specified. Step 7: Disposal On completion of the storage period specified in the MOU, the media is to be disposed of by shredding. Step 8: Feedback to provider The data custodian is expected to inform the provider of all successful data transfers. In the case of an unsuccessful data transfer, the provider has to be informed about why the transfer failed, and a request for a new set of data must be made. Step 9: Developer expertise In the event that data from the server is not loaded to the database, or tests on loaded data show possible error, the data custodian may consult Application Services for expert advice. Step 10: Transfer from media This is the actual process of transferring the data from CD/DVD medium to the server that is carried out by Helpdesk personnel.
29
4.3.3

Data verification
On receipt of a data extract, a number of checks can be performed to verify that: the number of records extracted is equal to the number received there are no duplicate unique identifiers numeric fields contain numbers, and text fields are predominantly text all the variables requested are present, and determining if any extra variables have been provided by mistake the range of values in each field is appropriate, and there are no unusual or surprising values the distribution of values in each field is as expected there is consistency with other fields in the data the relationship between files is as expected (only relevant if more than one file has been supplied).
4.3.4
Feedback to provider
In the case of success of the data validation, it is expected that the data custodian will inform the provider of the success of the data transfer and validation. In the case of failure in the data transfer or data validation, the provider has to be informed about why the transfer or the validation has failed, and a new set of data will be requested. Due to privacy reasons (Privacy Act 1993, see section 2.3, above), on contacting the provider about a failure in transfer or validation, the data custodian should never disclose personal information, such as the record with IRD number XXXX has the payment field missing.
4.4 Preparing data for record linkage

A number of issues need to be addressed when linking data. Often, data is recorded or captured in various formats and classifications, and data items may be missing or contain errors. A pre-processing phase that aims to edit and standardise the data is therefore an essential first step in every linkage process. Datasets may also contain duplicate entries, in which case linkage may need to be applied within a dataset to deduplicate it before linkage with other files is attempted.
4.4.1
Typical errors in linking variables
Errors in linking variables may occur during the capture and processing of these variables. Sources of errors in the linking variables include: variation in spellings, data coding and preparation, use of nicknames, anglicisation of foreign names, use of initials, truncation or abbreviation of names and addresses, use of compound names, missing words and extra words (Gill, 2001). The errors occurring in the commonly used linking variables are illustrated below: Unique numeric identifiers Unique numeric identifiers, when available, can be excellent linking variables. However, very strict control over the issue of new identifiers and recording in the data file is necessary to produce high-quality linkage with the numeric identifier alone. Typical errors include: missing identifiers (particularly important where links are longitudinal); recording of data can result in transcription errors such as transposing of digits; the same identifier may be used for more than one unit; the same unit may have more than one identifier assigned to it (duplicates);
30
Data Integration Manual the units may refer to different identities in different files. Numeric identifiers that include a check digit are much less likely to be incorrectly recorded. Surname Name changes due to marriage or divorce are perhaps the main difficulty. For some ethnic groups, there can be many surnames and the order of their use varies. Concatenation of the birth surname and the marriage or partnership name into a compound (or hyphenated) name is common, so both parts are required for linking purposes. Spelling variation is quite common in surnames due to the effects of transcription of the names through various systems. In some cultures there is no exact equivalent of a surname (Gill, 2001). First names There are wide variations in the spelling of first names due to recording and transcription errors. Widespread problems include the use of nicknames and contractions. Some are readily identifiable (Jim for James, Will for William, Liz for Elizabeth), but others are not (Ginger for Paul, Blondie for Jane). Some records may just record the fact that the person is a baby, or a twin, and until such time as the birth is registered, the record may contain BABY OF ... or TWIN OF ... Address This is an excellent variable for confirming otherwise questionable links. Disagreements are hard to interpret, however, because of address changes, address variation and difference in mailing addresses and physical addresses (Gill, 2001). Sex Sex is generally well reported and, except for transcription and recording errors, it is a very reliable variable. The main difficulty is that sex may not always be available in some administrative records. For example, some databases do not collect this variable and it can only be generated through the recording of the first name, which cannot be done with complete accuracy (Gill, 2001). Some datasets collect titles such as MISS or MR, which also could be used for sex imputation. Date of birth Date of birth is in general well reported. Problems may occur when the date of birth is filled in for others (ie by proxy), for example for children and the elderly, when an approximation may be provided. Typical transcription errors arise when day and month have been transposed, and when two digits for year are transposed. For example, a correct birth date of 11/03/75 may be recorded as 03/11/75 or 11/03/57. The current date can be filled in mistakenly on the date of birth field or the current year in the birth year field. Other problems encountered in the use of linking variables are: Swapping of first name with surnames Occasionally the surnames and first names are swapped around. Embedded titles in the name Surnames and first names fields may contain titles such as MR, MRS, DR, JR etc. Before the names can be used for linking they should be parsed and the various components identified and separated (Gill, 2001). Sections 6.3.1 and 6.3.2, below, give more detail on choosing linking variables.
31
4.4.2
Standardisation: editing, parsing, formatting, concordance
The success of a data integration exercise is dependent on having standardised data fields. Because of potential quality problems, some variables may be not suitable for use for linking. Rigorous editing, parsing and formatting of the linking variables and creation of concordances is undertaken to minimise errors. Briefly these terms can be described thus: Editing is the process of detecting and dealing with erroneous or suspicious data. Parsing a field separates the entities within that field to make the comparison easier. For example, a field name containing first name and surname would be separated. Formatting is necessary when the fields are recorded in different formats, such as date of birth 01Jan2002 in one file and 010102 in another. Creation of consistent coding across files (a concordance) is very important for variables that require classification, such as sex coded as 1 and 2 in one file and coded as M and F in another.
4.4.2.1 Editing While probabilistic matching takes data errors into account, some basic data cleaning may be needed before the matching to remove definite errors. Edit checks should be used to identify invalid responses such as character strings in a numeric variable, or non-alpha characters such as # or ^ in a character text response. Other edits may check for out of range or impossible responses such as birth dates in the future. Often the best approach is to treat these invalid responses as missing values.
4.4.2.2 Parsing and standardisation of linking variables The process of parsing and standardisation of linking variables involves identifying the constituent parts of the linking variables and representing them in a common standard way through the use of look-up tables, lexicons and phonetic coding systems (Gill, 2001). The standardised individual elements are then rearranged in a common order. An example of parsing a name that has title, first name and surname: Mr John Peter Smith title first name 1 first name 2 surname Mr John Peter Smith
The parsing and standardisation of commonly used linking variables are detailed below. Standardisation of surnames and first names The basic uses of standardisation are: first, to replace many spelling and abbreviation variations of the commonly occurring names and addresses with standard spelling and fixed abbreviations and, secondly, to use key words generated during the standardisation process as hints for the development of parsing subroutines. The purpose of name standardisation in data integration is to allow the data integration software to work more efficiently by presenting names in a consistent fashion and by separating out parts of the name that would be of little or no value when making comparisons (Gill, 2001). In the standardisation process, first name spelling variations such as LIZ and BETTY might for consistency be replaced with the original or formal spelling such as ELIZABETH. It is also possible to convert identifying stem words such as FRED, although these could equally be
32
Data Integration Manual associated with ALFRED or FREDERIC. It is important to notice that those nicknames could actually be the real name, so caution should be taken when applying this kind of standardisation. Other standardisation procedures sometimes used in formatting names include removal of punctuations or blanks. For example, OBRIEN becomes OBRIEN, TE AROHA becomes TEAROHA and VAN DAMM becomes VANDAMM. Dictionaries and lexicons have been developed that can relate commonly used nicknames and name contractions to formal names (BOB to ROBERT, LIZA to ELIZABETH) and link the common variations in spelling (SMITH, SMYTH, SMYTHE) (Gill, 2001). Phonetic coding Phonetic coding is a way of writing a string of characters based on the way the string is pronounced, and is a useful tool to summarise names and allow for some spelling variations. Used in the context of data linkage, its aim is to dampen the effects of coding errors which could possibly result in two variables disagreeing when, in fact, they dont. Two traditional phonetic coding methods used are the Russell SOUNDEX, initially developed for the 1890 United States Census, and the New York State Identification and Intelligence Algorithm (the NYSIIS),27 published in 1970. For instance, suppose a surname is listed as Camden in one dataset and is misspelled in another as Comden. If a character comparison is done on these character strings, the surname variables would disagree. Using SOUNDEX, both Camden and Comden would be coded as C535. In the case of NYSIIS, both would be coded as CANDAN. Thus, even if a typographical error was committed in encoding the surname, the two field entries would still agree. (Unless the error was in the first letter! In such a case, reverse SOUNDEX /NYSIIS may be employed.) The NYSIIS, has a higher accuracy (discriminating power) compared to SOUNDEX, but a lower selectivity factor (bringing together alternative forms of the same name). As an example, assuming that the surnames Days and Dais are the same, NYSIIS would perform relatively poorly, as it codes these surnames differently (non-match). Conversely, assuming the surnames William and Williams to be the same, SOUNDEX performs poorly this time, as it would code these two surnames as though they are distinct from one another. The choice between NYSIIS and SOUNDEX then comes down to the level of trade-off the analyst is willing to accept between these two measures of accuracy and selectivity. Standardisation of business names The main difficulty with business names is that even when they are properly parsed the identifying information may be indeterminate. Sometimes the pair of names can refer to the same business, although the names can be quite different. For example, the burger shops Habib Burger and Smith Burger decided to merge and changed the name to Karori Burger. On other occasions, the names can be quite similar, but the businesses very different. For example, Habib Burger and Habibs are a burger shop and Arabian food restaurant, respectively. Because the name information may be insufficient to accurately determine the status of the link, address information and other identifying characteristics are obtained for integration (Gill, 2001). Standardisation of addresses Standardisation of addresses operates in a similar fashion to standardisation of names. Abbreviations like Rd or Cres should be replaced by appropriate expansions to Road or Crescent or to a set of standard abbreviations commonly used by the organisations. For example, when a variation of rural address (eg R.D. 1 Tauranga or Tokoroa Farm
27
Taft R (1970). Name Search Techniques, New York State Identification and Intelligence System, as cited in http://www.name-searching.com/Working/Name_SearchKeyWordPhoneticcoding.htm
33
Data Integration Manual Kapiro Road) is encountered, the software should use a set of parsing routines different from those associated with home-number/street-name address. Parsing divides the free-form address variable into a common set of components that can be compared, for example, by street number, suburb and town. Parsing algorithms often use words that have been standardised. For example, STREET or ROAD would cause parsing algorithms to apply different procedures than words such as R.D. or Auckland. While exact character-by-character comparison of the standardised but unparsed names could result in no links, use of the components in the address might help designate some pairs as links. Commercial software is available for the parsing and standardisation of addresses.
4.4.2.3 Concordances Often there is interest in a variable that is collected using different classifications. Sometimes one classification is a simplified version of the other, but at other times one part of the classification agrees across files, while the rest does not. Correct comparison of the variable requires a consistent classification for use as a linking variable. An example in ethnicity would be to have European New Zealander, Mori and Other in one data source, while the ethnicity classification in the other data source has European New Zealander, Mori New Zealander, Cook Island Maori and Other. Several concordances are possible, and choosing the best solution requires an understanding of the concepts behind the data collection. One would need to form an idea of whether Cook Island Maori was typically coded to Mori, or to Other in the first file. For example, one solution could be to concord Mori with Mori New Zealander and Cook Island Maori.
4.4.3
Deduplication
Duplicate records are common in administrative datasets. They are usually created by mistake, either from the form-filling process or from the input stage in the data source agency. Examples are typos in filling out a form, or forms being filled out many times within the processing of the same case. Most agencies have systems in place to deal with duplication. However, there are often some duplicates left. The data integration analyst can eliminate duplicates using a process called deduplication. Deduplication can be thought of as a data integration exercise, where a file is made to link to itself, and can be performed using the same techniques as for integration between two files. Another option is to ignore the duplicates but with at least an estimated size of them, which is helpful in understanding their impact on the final integrated data. The impact of duplicates on integration depends on the frequency of duplicates, how duplicates are generated and what type of integrated dataset is being created. False positives will occur if the duplicates are linked to the wrong unit. If the resulting integrated dataset is the intersection of the source files, then unlinked duplicates will appear as false negative links. Most care should be taken where the integrated data is the union of the source files, as the unlinked duplicates will inflate the number of cases in the final integrated file.
4.4.4
Anonymisation of unique identifiers
The use of unique identifiers (UIDs) assigned by other agencies must meet the requirements of the Privacy Act 1993. Their use in data integration is governed by Statistics NZs Data Integration Policy and is discussed in Chapter 2, above.
34
Data Integration Manual Any UID assigned by other agencies and passed to Statistics NZ as part of an integration project will be converted to an internally (Statistics NZ) assigned unique identifier (IUID) as soon as is practicably possible. The IUID is created using a common encryption process and a key unique to each project. External UIDs will only be retained within the Statistics NZ systems (servers, databases and applications) as long as is necessary to perform validation, editing and integration. They will then be replaced and removed completely.
An externally assigned UID will not be used for longitudinal linking. The IUID provides the capacity to create a consistent longitudinal link to the same unit without the need for Statistics NZ to store the original UID in any of its production databases.
35
Statistical Theory of Record Linkage
Summary This chapter introduces the mathematical foundations of record linkage theory. The basic ideas are presented using easy-to-understand illustrations to explain the concepts.
5.1 Introduction
In order to understand the process of data integration, it is important to understand the statistical theory behind it. This chapter looks at the data integration process, building up from a simple integration situation (that of exact matching), to how human beings process the integration, to the mathematical/computer process. This chapter is not intended as a complete view of the mathematics of data integration.
5.2 Exact matching

When two files contain the same unique identifier, they can be linked via that unique identifier. Linking via a unique identifier is called exact matching. A unique identifier might either be a single variable or a combination of variables, such as name, date of birth and sex, as long as they are of sufficient quality to be used in combination to uniquely define a record. There is no uncertainty in exact matching. Either a pair of records agrees on the unique identifier or they do not. The problem is when the quality of the variables is not good enough to guarantee that the value of the unique identifier is available, correct and unique. Where exact matching alone will not result in a sufficiently robust integrated dataset, probabilistic linking may be used.
5.3 Terminology
Two records are a match when they relate to the same person/business/entity/event. The role of data integration is to determine which records are a match. This term can be differentiated from other uses of the word match by using true match. Two records are a link if, by some process, it is determined that the two records refer to the same unit (eg person/business/entity/event). Creating links is what the data integration process does. Note that not every match is a link, and not every link is a match, as the table outlines. Link Non-link True Match Correct Outcome False Negative Link True Non-match False Positive Link Correct Outcome
(See Chapter 6, below, for more details.)
36
Data Integration Manual Matching is the process of comparing records and deciding which are links. The variables used in the matching process are called matching variables, matching fields, linking variables or comparison variables. This process is also known as record linkage.
5.4 Matching files

The following discussion assumes the following scenario. There are two files, File A and File B. The task is comparing one record from each file, and making a decision as to whether the records should be linked. The two records are: File Name Date of Birth Sex Address File Name Date of Birth Sex Address A John Black 23-11-63 M 112 Hiropi Street B Jon Block 23-11-65 M 89 Molesworth St
5.5 The human approach

Comparing the two records by eye, human beings make a judgement on how likely they think it is that the two records refer to the same unit. Consider the example above. The aim is to know if the two records refer to the same person, so a comparison of each field must be made to judge how likely this is. An initial impression might be that these records are for the same person, so further evidence is sought to see whether this is true. For the name field, the differences can be explained as: spelling mistakes; errors created when one of the records was entered; or someone trying to read scrawled handwriting. For the date of birth field, the day and month are the same, although the year is two years out, a difference that could have a similar explanation as for name, above. The sex field agrees. If one record did have F for sex, it is more likely for people to view that as an obvious error, as neither John nor Jon are feminine names (although it is possible that Jon is a mistake for Joan). The address fields do not agree at all. The reason might be because John moved between having information recorded for File A and having information recorded for File B. It might be that one of the files is the home address and one is the work address. Knowledge about how the information for these files was collected, and what the information means, would influence the amount of reliance to be placed on the address. Assuming that address is unreliable, and that the differences can be explained as a simple data quality issue, then it would be declared that these records refer to the same person. However, there is the possibility that another record might be an even better match.
37
Data Integration Manual There is other information that might be a factor in decision making. For example, there might be 100,000 Blacks and 50,000 Blocks. In such a situation, the chance of getting two records such as those in the example above isnt too unlikely. With this knowledge, it might be decided that the differences are very important, and therefore these two records do not refer to the same person. A central problem in record linkage is: if file A contains 1,000 records, and file B contains 10,000 records, the number of possible record pairs is 1,000 x 10,000, which is 10 million record pairs. While humans can make value judgements to decide whether to link records or not, they cant do thousands of records a minute, which is the degree of speed needed to make a data integration project feasible. Computers can handle such speeds, but they need to be told how to make judgements under uncertainty, which they can then do in a consistent manner.
5.6 The mathematical approach

We introduce in simple terms the theory of probabilistic record linkage as formalised by Fellegi and Sunter (1969). Other useful and accessible references include Jaro (1995)28 and Winkler (1995).29 Much of the record linkage software available, including the software used at Statistics New Zealand, is based on this approach.
When comparing two records, the computer compares each field and assigns a measure that reflects how similar they are. This measure is called the field weight. It is calculated from two pieces of information: how reliable the data is, and how common the value is.
5.6.1
The m probability
The reliability of the data is described by the m probability or m prob. It is a measure of how trustworthy the data is, and can be expressed as the probability of the two values agreeing given that they refer to the same unit (eg person/business/entity/event). That is: m = Pr (the two values agree | the records are a match) Another way of thinking about it is this: given that the two records are a match, how likely is it that there is an issue with the data that makes the values different? (This could be an error, inconsistent definition, timing difference, etc.) This is why the m prob is related to the data quality. That is: m = 1 Pr (the values disagree | the records are a match) For example, sex might be very well collected and monitored, so it would be given an m prob of 0.98. Address might not be used and is just collected as a matter of course with no checks made on it at all, so it might be given an m prob of 0.7. Chapter 6 discusses how to determine in practice the m prob for a variable.
28
Jaro, Matthew (1995). Probabilistic Linkage of Large Public Health Data Files, Statistics In Medicine, Vol 14, 491498. Winkler WE (1995). Matching and Record Linkage, Business Survey Statistics, 355384.
29
38
5.6.2
The u probability
The commonness of the value is described by the u probability or u prob. It is a measure of how likely it is that two values will agree by chance. It is expressed as the probability of the values agreeing given that the records do not relate to the same unit. That is: u = Pr (the two values agree | the records are not a match) Simply, this is a measure of relative frequency. The more common a value is, the more likely two unrelated records are going to contain that value. Hence, the u prob is often defined as: u = 1 / (Number of different values) For example, sex is either Male or Female, with about equal probability, so the u prob is 0.5. There are 12 months in a year, so a month of birth variable would have a u prob of 0.08. Address is usually unique, so the u prob will be 0.01 or lower. This assumes that each value has the same probability of occurring and is known as the global u probability. Specific u probabilities can be created for each value that a field can take, allowing for non-uniform distributions. For example, surnames such as Smith and Quimby would have different relative frequencies, and could usefully be assigned a specific u probability.
5.6.3
The field weight
From these component probabilities, a weight for the field can be calculated. The calculation used depends on whether or not the two values in the field agree. If they do agree, a positive weight is generated, and if they disagree a negative weight is generated. The size of the weight measures the evidence the values provide about the record pair being a match. The two calculations are:
m agreement field weight = log 2 u

and
1 m disagreement field weight = log 2 1 u

This is the log of a likelihood value. Likelihood values behave like probabilities. Logs are used to make later calculations easier because field weights, assuming independence, then become additive. As for the logarithmic base, it has been customary from information theory to use base 2 for the logarithm.
5.6.4
The composite weight
Once field weights have been calculated, a composite weight for the entire record is calculated, based on the variables examined. The composite weight for the record pair is simply the sum of the field weights. This composite weight is usually what is referred to as weight when talking about record pair weights. Note that adding the weights is equivalent to multiplying the likelihood values, just as one would multiply independent probabilities. The assumption is made that the fields are
39
Data Integration Manual independent of each other, and that errors in the fields occur independently, although this is not necessarily true in the real world.
5.6.5
Example
Using the example from above, first the m and u probs are set up, and field weights calculated. File Name Date of birth Sex Address m prob 0.95 0.9 0.95 0.7 u prob 0.01 0.01 0.5 0.01 Agreement field weight 6.57 6.49 0.93 6.13 Disagreement field weight -4.31 -3.31 -3.32 -1.72
Next, the fields are compared. File Name Date of birth Sex Address Agreement? No No Yes No Field weight -4.31 -3.31 0.93 -1.72
In practice, allowance can be made for partial agreements, for example a minor difference in spelling could generate a lower, but still positive, agreement weight. This gives a final composite weight for this pair of records as:
( 4.31) + ( 3.31) + (0.93) + ( 1.72) = 8.41

As this final weight is negative, the matching process would reject these record pairs as a link.
5.6.6
Changing the m and u probabilities
The best way to understand m and u probs and how they affect weight calculations is to take an example and make changes, holding one of the values constant at a time. The examples below help to demonstrate what happens.
Note that logs are being calculated to base 10, and then divided by Log(2) to produce the required log to base 2.
Original probabilities => m prob = 0.9 u prob = 0.1 agreement weight = disagreement weight = log(0.9/0.1)/log(2) = log((1-0.9)/(1-0.1))/log(2) = 3.17 -3.17
40
Data Integration Manual Keeping u prob constant, and increasing/decreasing m prob => decrease m prob: agreement weight = log(0.85/0.1)/log(2) = 3.09 increase m prob: agreement weight = log(0.95/0.1)/log(2) = 3.25 m prob = 0.85 u prob = 0.1 disagreement weight = log((1-0.85)/(1-0.1))/log(2) = -2.58 m prob = 0.95 u prob = 0.1 disagreement weight = log((1-0.95)/(1-0.1))/log(2) = -4.17
Changing m prob holding u prob = 0.1

weight 4 3 2 1 0 -1 -2 -3 -4 -5 m prob 0.7 0.75 0.8 0.85 0.9 0.95 Agreement Disagreement
Keeping m prob constant, and increasing/decreasing u prob => decrease u prob: m prob = 0.9 u prob = 0.05 agreement weight = disagreement weight = log(0.9/0.05)/log(2) = log((1-0.9)/(1-0.05))/log(2) = 4.17 -3.24 increase u prob: m prob = 0.9 u prob = 0.15 agreement weight = disagreement weight = log(0.9/0.15)/log(2) = log((1-0.9)/(1-0.15))/log(2) = 2.58 -3.09
41
Changing u prob holding u prob = 0.1

weight 5 4 3 2 1 0 -1 -2 -3 -4 u prob 0.05 0.1 0.15 0.2 0.25 0.3 Agreement Disagreement
So, if the m prob is changed, the disagreement weight moves disproportionately to the agreement weight. Whereas, if the u prob is changed, the agreement weight moves disproportionately to the disagreement weight.
5.7 Weights
Once the weights have been calculated, the next step is to decide which records are links and which are non-links, based on the evidence of the weights.
5.7.1
Distribution
In a typical integration project there are hundreds of thousands of records, and millions of possible pairings. Most of those record pairs do not refer to the same entity, and thus there will be more non-links created than links. The distribution of these weights therefore is bimodal, like the following figure:
42
Number of comparison pairs
Distribution of Composite Weights Across All Possible Comparison Pairs
-150
-100 Non-Matches
-50 Matches Weights
50 Observed
100
(Note: the observed line is the distribution actually observed, which has been offset slightly to make it more differentiable from the other distributions.)
5.7.2
Cut-off thresholds
Once the weights have been calculated, upper and lower thresholds are established. The upper threshold is the weight above which every record pair is determined to be a link. There is usually only one link per record, so other possible pairings can either be ignored or considered duplicate records. The lower threshold is the weight below which every record pair is determined to be a non-link.
Distribution of Composite Weights and Threshold Cut-offs
Lower Cut-off Non-links
Upper Cut-off Links
-150
-100
-50
Weights
50
100
43
Data Integration Manual The problem is: although record pairs are in reality either true matches or true non-matches, in the world of data integration, with imperfect/insufficient data, the picture isnt so clear. Some true matches have low weights because of data errors or similar problems, just as some true non-matches are given high weights for the same reasons. There is theory stating the best way to determine the threshold levels (as outlined in Fellegi and Sunter, 1969), but in practice it is up to the person working on the data integration project to decide where the cut-off thresholds will go. This is usually done by reviewing record pairs near a likely cut-off point and making judgements about how the computer differentiated the pairings. More detail is given about the impact of setting particular threshold levels in Chapter 6, below.
5.7.3
Clerical review
If the link and non-link thresholds are the same, this divides the set of record pairs cleanly into two sets. However, if they are not, then the record pairs with weights in between the two limits are in the clerical review area. In this area, the human operator decides which record pairs are links and which are non-links. With some statistical integration software, it might not be possible to assess record pairs in a clerical review period. In this case, it is necessary to make the link and non-link thresholds the same.
5.8 Blocking
As mentioned above, there is likely to be a very large number of records to compare. Comparing 1,000 records with 1,000 records means that 1,000,000 (1 million!) comparisons are made. With only 1,000 record pairs being a match, this gives 999,000 records pairs that are a non-match, which are determined to be non-links.
1,000 x 1,000 = 1,000,000
1,000 records
1,000 records
Total Com parisons = 1,000,000
To reduce the number of comparisons made and focus on the records that are more likely to be matches, the records can be filtered first so that only certain records are considered in comparison to each other. This filtering is called blocking, and is done by selecting variables to block on. Only records that agree on the values in those variables are compared to each other. For example, if sex is chosen as a blocking variable, only records with the same value of sex are compared to each other. This cuts out about half of the comparisons required. If month of birth were chosen, this decreases the number of comparisons by a factor of 12. Choosing both sex and month of birth means that 1/24th of the comparisons are made. The following diagram illustrates the reduction in comparisons for the case where there are five equal-sized blocks on each file.
44

5 Blocks 200 200 200 x 200 = 40,000 200 200 x 200 = 40,000 200 200 x 200 = 40,000 200 x 200 = 40,000 200 x 200 = 40,000 Total Comparisons = 200,000 200 200 200 5 Blocks
1,000 records
200 200 200
1,000 records
In the example introduced in section 5.4, above, with the John Black/Jon Block record pair, if sex was used as a blocking variable, the two records will still be compared. If year of birth was used as a blocking variable, then they wouldnt be compared.
5.9 Passes
In the example above, if year was chosen to block on, then the two records would not be compared. If this was the only comparison done, then these records would never be compared. However, more than one comparison can be run. A pass is an iteration of record linkage using a combination of blocking variables and matching variables. In a data integration project, more than one pass is used to block the file in different ways and to allow for different variable comparisons, and for errors in the blocking variables. For example, one pass might block on year of birth, and match on name and, sex. Another pass might block on sex, and match on name and address. The number of passes used should reflect how well the record linkage process is working. If only a small number of links are created in each pass, then multiple passes might be needed. On the other hand, this might indicate that there isnt enough information to give high-quality links and more passes wont help. It is up to the user to decide.
45
Record Linkage in Practice
Summary This chapter focuses on the practical application. It also covers things that can go wrong, and discusses what makes the output of a record linkage exercise fit for use.
6.1 Types of matching

A data linkage exercise may take several forms for instance, a many-to-one match (eg geocoding), a deduplication exercise, or a one-to-one match. Many-to-one matching In a many-to-one match, a record from one dataset (file) is allowed to link to more than one record in another. In the Student Loan Data Integration Project, a person would have one loan record, but might be enrolled in more than one institution. A common example of the use of many-to-one matching is geocoding. Geocoding is the process of matching addresses to a geographic location. For example: File A (three records) : Location Rossmore House Aorangi House Back Bencher File B (one record) : Location Molesworth St Outcome: Each of the records in File A links with the record in File B, producing a file with three records: Location Rossmore House Molesworth St Aorangi House Molesworth St Back Bencher Molesworth St Deduplicating: Cleaning up lists Deduplication is the process of removing the duplicates on a file. Multiple occurrences of a single unit within one dataset are consolidated into a single record. This might be done on an address list or client database to ensure that the file is as clean as possible for its purpose. The theory used is the same as when there are two files. Here the second file is simply the same as the first. The practice can be slightly different, depending on the software. For example:
46
Data Integration Manual File A (five records): Name Amy Bill Chris Cris Dave Birth date 15 June 1985 22 March 1987 1 September 1980 1 September 1980 12 August 1990
Outcome: Chris links with Cris, producing a file with four records: Name Amy Bill Chris or Cris Dave Birth date 15 June 1985 22 March 1987 1 September 1980 12 August 1990
In the example above, the choice to retain the spelling Chris or Cris is left to the analyst. Many-to-many matching Many-to-many matching is similar to many-to-one, in that it is also possible for records on both files to link to more than one record on the other file. To date, no integration project at Statistics NZ has involved creating many-to-many links. One-to-one matching In a data integration project with one-to-one matching, one record on File A links to one record on File B. This is the situation in the New Zealand Census Mortality Study, where one death record should match to one census record. File A (seven records): Name Nicolla Mike John Sharon Jamas Rissa Allyson File B (six records): Name Nicola Mick Jon Sharon James Andy
47
Data Integration Manual The following records link: Nicolla Mike John Jamas Sharon Nicola Mick Jon James Sharon
The outcome could be either the union or the intersection of File A and File B. Thus: Outcome 1: The union of File A and File B. A file with eight records: Name Nicolla or Nicola Mike or Mick John or Jon Sharon Jamas or James Rissa Allyson Andy Outcome 2: The intersection of File A and File B. A file with five records: Name Nicolla or Nicola Mike or Mick John or Jon Sharon Jamas or James Our focus in this document is one-to-one matching, but much of what is contained herein can be made to apply to the other data linkage forms.
6.2 Pre-matching process

6.2.1 Deduplication
Deduplication refers to the process of removing duplicate records belonging to the same entity, within the same file. Even if it is acknowledged that a certain level of duplication is acceptable for planning or research purposes,30 it is at times best to remove the duplicates from the files before integration commences. Retention of duplicates complicates the integration, especially when integrating multiple datasets.
30
Community Services Ministers Advisory Council (2004). Statistical Data Linkage in Community Services Data Collections. Australian Institute of Health and Welfare, Canberra. http://www.aihw.gov.au/publications/index.cfm/author/3541
48
6.2.2
A data integration process flow
The diagram below shows a typical process flow for data integration:
Dataset A Set
Dataset B
Standardise
Edit, parse
Standardise
Edit, parse
Block
Decide on blocking variables
Subset of A
Matched on blocking variables
Subset of B
Link
Compare linking variables using comparison functions and m and u probabilities
Add a pass/Revise pass
Set cut-off threshold
Linked data
Quality measurement
This chapter describes each of the steps in the diagram above in more detail. Dataset standardisation involves data editing and parsing to a prescribed format to allow comparison of field entries. It has been noted that 75 percent of the effort in a data integration exercise is devoted to preparing the input files as stated by Gill (2001). Chapter 4 includes a detailed discussion of the data standardisation process. Blocking, a filtering process that reduces the number of record comparison pairs, was initially discussed in Chapter 5. Section 6.3.1 below will look at how blocking is done in practice.
49
Data Integration Manual The linking step is closely examined in section 6.3.2 by giving some pointers on choosing linking variables. Section 6.3.3 covers commonly used comparison functions for linking variables. These functions provide means by which a decision can be made on whether two variables being compared fully agree, partially agree or fully disagree. In section 6.3.4, the iterative approach to determining the m and u probability values will be covered. The tricky aspect of determining a cut-off value is discussed in section 6.4.1. A composite weight cut-off value is important, as it dissociates record pairs that are considered to be linked from those that arent. Estimation of false positives, false negatives and match rates, and measurement error in data integration are discussed in sections 6.4.2 and 6.4.3. Chapter 6 concludes with a discussion of issues that need to be considered when the integration is not a one-off situation, but data is added over time.
6.2.3
Standardised datasets
If no standardisation is carried out, it is possible for records that are true matches to not be linked because the common variables might appear so different that the composite weight could turn out to be low or negative. Standardisation is discussed in detail in section 4.4. Field standardisation may be carried out using rule sets. Rule sets are groups of files that contain the instructions to parse free-form fields. Linking software may have a number of built-in rule sets which can be modified. Alternatively, new rule sets or reformatting may be applied outside the linking software environment. For example, a surname field called SURNAME_TEXT might be standardised as follows: Surnames File A Take SURNAME_TEXT Capitalise Remove spaces Remove any characters other than alphabetic characters Name the resulting field SURNAME1 Define new variable INITIAL_SURNAME = first character of SURNAME1 Define new variable SOUNDEX_SURNAME = SOUNDEX code of SURNAME1 Surnames File B Take SURNAME_TEXT Capitalise Remove spaces Set to missing if surname contains UNKNOWN Remove any characters other than alphabetic characters Name the resulting field SURNAME1 Define new variable INITIAL_SURNAME= first character of SURNAME1 Define new variable SOUNDEX_SURNAME = SOUNDEX code of SURNAME1 In this example, the standardisation of File B requires the extra step of removing the words UNKNOWN if it exists in the surname text, and the end result is two datasets that produce equivalent standardised variables for surname (SURNAME1, INITIAL_SURNAME and SOUNDEX_SURNAME).
50
6.3 Matching method

6.3.1 Choice of blocking variables
Blocking is employed to efficiently compare two datasets by reducing the number of records to compare between the two. For example, if one wants to link a dataset having 100,000 records with another containing 1,000,000 records, the total number of comparisons would be 100,000 x 1,000,000. Blocking cuts down the total number of records to compare by only comparing records that exactly match on the specified blocking variable. In effect, the comparison space is cut down to only those records which have a potential to match, as specified by the blocking variables. In choosing the blocking variable, the analyst aims to keep the size of the block small to efficiently reduce the number of comparison pairs, yet big enough to avoid missing true matching record pairs.31 For instance, if the analyst blocks on gender, two huge blocks are created, resulting in an inefficiently large number of comparisons to perform. On the other hand, blocking performed on a numeric identifier produces numerous mini-blocks perhaps as many records as there are in the datasets. A problem arises when there is an error or missing value for the blocking variable. Two matching records will not be compared and the match will be missed. The following methods have been employed successfully at Statistics NZ to design blocks of good quality and size. One technique is to choose a variable that has a good number of values (eg overcomes the problem of gender variable in the example above), with a fairly uniform distribution, so as to have blocks of uniform size. Blocks of uniform size are desired because the number of comparison pairs generated by any blocking method depends on the number of blocks (the method) generates and (the resulting blocks) sizes. Very large blocks have therefore dominant effects on the efficiency of the blocking methods.32 It is also desirable to have a blocking variable that has a high reliability value, in order to avoid the scenario of two matching records failing to be in the same block, with no chance of being linked. One approach is to keep the block sizes as small as possible and compensate for errors in blocking by running multiple passes.33 This technique is achieved by using multiple blocking variables in the different passes to overcome block size problems (very large blocks may heavily slow down the linkage software or could even cause it to crash) and data errors. Essentially, each time a pass is run the links are kept, and another pass with new blocks and new comparison pairs is performed on the remaining unlinked records. New blocks and new comparison pairs mean more chance of not missing out on true matches. Truncated fields can also be used to mitigate the effects of erroneous encoding when blocking, in addition to using phonetic coding and variables which are thought to be reliable. For instance, because the SOUNDEX for surname William and Williams return different codes, a new variable containing a truncated form of the surname could be considered. This
31
Baxter R, Christen P, Churches T (2003). A Comparison of Fast Blocking Methods for Record Linkage, CMIS Technical Report 03/139, First Workshop on Data Cleaning, Record Linkage and Object Consolidation, KDD 2003, Washington DC. Gu L, Baxter R (2004). Adaptive Filtering for Efficient Record Linkage, SIAM International Conference on Data Mining Conference Proceedings, Florida. Ascential Software (2002). Integrity SuperMATCH Concepts and Reference Guide Version 4.0, 517.
32
33
51
Data Integration Manual new field, together with other fields, could produce new matches which otherwise might have been missed. Event dates, birth dates separated into month, day and year, forenames, and surnames (or their corresponding phonetic codes) are good blocking variables. Unique identification numbers, although potentially erroneous or missing, partition the files into a large number of sets. Unless there is rigorous control of the issue and recording of identifiers, the recommendation is to use unique identifiers as blocking variables in the first pass, with other matching variables to verify the link. Other variables would be used to block in subsequent passes. Sparsely populated fields are not good for blocking purposes, since records with missing values will remain unblocked and ineligible for potential linking.
6.3.2
Choice of linking variables
Practically all the variables common to the two datasets undergoing integration can be used for linking. In doing so, redundancies in the information imparted by the related variables may be helpful in reducing matching errors, provided the errors are not highly correlated or functionally dependent.34 It should be noted, however, that linkage software does not necessarily compute correlations. Moreover, it is not advisable to have highly correlated linking variables in the same pass, as they increase the composite weight without providing additional discrimination between record pairs which should link and those which should not. Usually, however, only a subset of the variables common to the datasets is used for linking. Gill (2001) suggests using six groups of variables, with a combination of variables coming from the different groups to be used for linking. The six groups are: Group 1: Proper names, which rarely change over a persons lifetime (except possibly for a womans surname) (eg forenames, initials, surnames) Group 2: Non-name personal characteristics, which rarely change over a lifetime (eg date of birth, sex) Group 3: Socio-demographic variables that may have several changes over a lifetime (eg address, marital status) Group 4: Variables collected for special registers (eg occupation, date of injury, diagnosis) Group 5: Variables used for family record linkage (eg surnames in Group 1 plus other surnames, birth weight) Group 6: Arbitrarily allocated numbers that identify the record (eg IRD number). Gill notes that it is common practice to choose and combine linking variables from Groups 1, 2, 3 and 6. When choosing linking variables, spelling errors, phonetic coding choice (SOUNDEX codes for William and Williams are different) and the like may affect the classification of the variables as either agreeing or disagreeing. A quick run-through of the problems with some common variables and what has been done in practice to increase their reliability will also help the analyst in selecting the linking (and blocking) variables.
34
Gu L, Baxter R, Vickers D, Rainsford C(2003). Record Linkage: Current Practice and Future Directions, CMIS Technical Report No 03/83, CSIRO Mathematical and Information Sciences, Canberra.
52
Data Integration Manual Surnames: May be prone to changes, as in marriage and divorce. The order of use of the surnames in some ethnic groups may be different. Surnames may be prone to spelling variations resulting from erroneous transcription. A phonetically coded surname may be used to reduce transcription/spelling errors. Surname array (different surname fields merged into one) may also be used to handle multiple surnames. The arrays are then compared using some comparison function (see section 3.3) to make allowance for misspellings. Forenames: Possess many of the same problems as surnames. Modernised name versions and nicknames are possibly used in some documents, while the formal forename is used in others. Forenames may be prone to transcription/spelling errors. Sometimes only forename initials, instead of full forenames, are available from the dataset. An array of the initials may be created as a new variable. Sex: Generally reliable, if collected at all. Sex, however, has a low discriminatory power in distinguishing between a match and a non-match. Birth date: There may be differences in the format (eg European v American format, although this should have been handled during the standardisation phase). Birth month and birth day are usually more reliable than the birth date. Gill suggests some tolerance when using the birth year, as this is more prone to error than the month or day of birth. Age: May be used with some tolerance, like year of birth. When available together with the birth date, a data check can be performed to see if these two agree. Address: As with dates there can be format problems, but the field can be standardised. The standardisation process can be laborious. (When done in SAS, for instance, rule sets, can be used to standardise addresses in a relatively straightforward manner.) This is a good field for confirming matches, but could be poor for cases of disagreements, as the person might have changed address. When not used as a linking variable, the unlinked records from each of the two datasets being integrated can be sorted according to address, and then a comparison of the sorted files can be made to check if some matching records have failed to link. Experience at Statistics NZ across various data integration projects shows that, generally, the standardised forms of most of the above variables are reliable.
6.3.3
Commonly used comparison functions for linking variables
Each field that is used for linking (and thus compared) will have an agreement or a disagreement weight (some positive/negative value, respectively). The field weight takes the full agreement weight if the fields completely agree (see Chapter 5). Field agreement or disagreement, however, need not be exact. With the use of comparison functions, partial agreements are possible. Below are some commonly used comparison functions available in QualityStage,35 the software package currently used by Statistics NZ for its data integration projects. ABS_DIFF: The absolute difference comparison. It compares the difference between two numeric values. As an example, assume the field being compared is age and the tolerance specified is 5 (ie plus or minus five years in the ages would still be considered to match). If the age in the first dataset is 24 and the age in the other is 28, since the difference is within the allowed tolerance, the full agreement weight is assigned to the field age for this particular
35
QualityStage (2003). Match Concepts and Reference Guide Version 7.0, Chapter 5, 134.
53
Data Integration Manual comparison pair. If the age in the second dataset is 30, however, since the difference is beyond the tolerance, the field weight for age would be the full disagreement weight. Note that unlike several of the comparison functions below, the ABS_DIFF does not assign partial agreement weights. CHAR: The character-by-character comparison. Any mismatch in the character of the fields undergoing comparison merits the assignment of the full disagreement weight. DATE8: The comparison that allows tolerance in dates. At least one of two tolerance parameters has to be specified. If only the first tolerance parameter is set, say 2, the analyst has allowed up to two days difference between the two dates compared. Unlike ABS_DIFF and CHAR, however, a partial agreement weight is assigned to the field if the difference is within the prescribed tolerance. If only the first parameter is specified and the value entered is 2, a one-day difference between the dates compared reduces the agreement weight by one-third of the weight range (agreement weight disagreement weight). A two-day difference cuts the agreement weight by two-thirds of the range. A three-day difference merits the assignment of the full disagreement weight. If two parameters are specified, the first parameter is the number of days tolerated when variable B > variable A, while the second parameter is the number of days tolerated when variable B < variable A. MULT_EXACT: The comparison function used to allow the agreement of free-form text when the order of the words does not matter and where there may be missing or erroneous words. It is similar to comparing arrays where the individual words are the array elements. The string of characters to be compared from each of File A and File B must be specified. MULT_UNCERT: The comparison function identical to the MULT_EXACT comparison function, except for a parameter of uncertainty, which must be specified and is based on how similar the two comparison strings are. A higher value is given to identical strings; a lower value to strings that are almost certainly different. Weights are linearly proportioned between the full agreement weight and disagreement weights, depending on how close the score is to the specified threshold. A score outside the threshold is given the full disagreement weight. NAME_UNCERT: The comparison allowing truncated fields. For example, the field in one dataset has the name Albert, while the other dataset has Al. If CHAR is used, the fields will not match (total disagreement). With NAME_UNCERT, the comparison will use the shorter length (truncation) of the two names and will not compare characters after that length. In the example above, the two names are considered to fully agree. A parameter, the minimum threshold, must be specified where the value given is based on how similar the two strings are. Weights are linearly proportioned between the full agreement weight and disagreement weights, depending on how close the score is to the specified threshold. A score outside the threshold is given the full disagreement weight. PRORATED: Like the ABS_DIFF, the PRORATED comparison function is for comparing numeric fields. The prorated comparison allows numeric fields to disagree by a specified absolute amount as specified by an additional parameter. For example, if the parameter was 15 and the absolute value of the difference in the field values is greater than 15, the disagreement weight would be assigned to the comparison. If the difference were zero, the full agreement weight would be assigned. Any difference between 0 and 15 would receive a weight proportionally equal to the difference. A difference of eight would receive a weight exactly between the agreement and disagreement weight.
54
Data Integration Manual Two additional arguments can be specified if it matters whether the difference is positive or negative. The first argument is the tolerance if the value on file B is greater than the value on file A. The second argument is the tolerance if the value on file A is greater than the value on file B. UNCERT: A character comparison which allows partial weight assignments like the NAME_UNCERT. The weight assigned is based on the difference between the two strings compared as a function of the string length, the number of character transpositions, unassigned insertions, deletions or replacement of characters (recall that NAME_UNCERT is only for truncated names). A parameter, the minimum threshold, must be specified where the value given is based on how similar or not the two strings are. Weights are linearly proportioned between the full agreement weight and disagreement weights depending on how close the score is to the specified threshold. A score outside the threshold is given the full disagreement weight.
6.3.4
The m and u probabilities
The m and u probabilities can be defined in two different ways. Global m and u probabilities assume the probability is constant through all variable values. Value-specific m and u probabilities are probabilities that may contain variable value differences. Global u probabilities are used if it is assumed that the distribution of possible values within the field is (nearly) uniform. In practice, the linkage software may automatically estimate value-specific u probabilities to reflect the actual distribution of variable values in the dataset. Value-specific m probabilities may be used for fields where some values are more reliable than others. However, global m probabilities are generally used, as it is to be expected that the values in a field are affected in the same way by the things that make a field reliable or unreliable (mode of collection, maintenance practices etc). The m probability is the probability that the fields agree given that the record pair is a match. It is a reflection of how reliable the field is, as it is computed as 1 minus the error rate of the field. Because all fields are not equally reliable, it is expected that m probabilities for different fields will vary. In practice, the error rates are generally not accurately known. Initially, when no estimates of the m probabilities are available, the following may be used: For most fields, 0.9 For very important fields, 0.999 For moderately important fields, 0.95 For fields with poor reliability, 0.8 or less.
Setting a high m probability value for a field forces a high penalty for disagreement in that field. Statistics NZ experience across various data integration projects shows that the standardised variables sex, name, surname and birth date have good m probability values. Variables such as address, ethnicity and phone number have been observed to generally have a lower reliability. This is not to say that these fields are unreliable in the absolute sense. Experience has shown that variables which have been collected and maintained carefully by the source agencies have good m values (are reliable), whereas variables that are of less importance to source agencies that is, those not necessary to support their core operational requirements tend to be less reliable. Where the law requires an event to be reported within a prescribed short period of time, event dates have proven to be reliable fields.
55
Data Integration Manual While there have been some theoretical approaches to modelling the m values, (eg Winkler, 198836), at Statistics NZ an iterative approach has been used. The first linking is done using an estimate for m based on what is known broadly of the importance of the variable, or from previous experience, as above. A new m value is then estimated from the values for data that has been linked. The m probability may be estimated by dividing the number of times the field values agree in a comparison by the number of times the value participated in a comparison (excluding in the computation of the m probability the records with missing entries for the field of interest). This should be done when the analyst has a certain degree of confidence that most of the good links have been captured. The u probability is the probability that the fields agree given that the record pair is not a match. This is a reflection of how likely things are to agree by chance. Assuming a uniform distribution for the values a field may take, u is estimated by 1/n, where n is the number of field values. For example, it may be estimated that the u probability for gender is 1/2, as the variable gender takes two possible values. Similarly, for the variable month of birth, the u probability may be estimated as 1/12.
6.4 Quality assessment of linked data

6.4.1 Setting the cut-off threshold
A trade-off exists between the level of false positive matches and the level of false negative matches. It is important to consider the objectives of the matching exercise when determining cut-off thresholds. For example, if it is critical to avoid false matches, then set the cut-off threshold higher, mindful that some true matches will be missed. The (non-negative) cut-off threshold is the composite weight value that demarcates between links which the analyst considers to be matches and those which the analyst doesnt. All record pairs whose composite weight is greater than or equal to the cut-off are regarded as a link. Deciding on the cut-off value is one of the more difficult tasks the analyst faces in a data integration project, as the boundary is not clear-cut. It is acknowledged that even experienced analysts could produce significantly different linked outputs.37 In practice, the cut-off is initially set at zero for a given pass and is iteratively changed before proceeding to the next pass. After running the pass, the weights histogram can be examined to aid in deciding the cut-off score for the pass. Ideally, the frequencies of matched records trail off as the weights become lower, while the frequencies of unmatched records trail off as the weights become higher. This ideal situation produces a bimodal distribution. The farther apart from each other the modes are, the better the discrimination between the matched and unmatched records. This scenario is represented by the figure below.
36
Winkler, WE (1988), Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage. In Proceedings of Survey Research Methods Section, American Statisticial Association, 667671. Gomatam S, Carter R, Ariet M, Mitchell G (2002). An empirical comparison of record linkage procedures, Statistics in Medicine 21, 14851496.
37
56
Distribution of Composite Weights Across All Possible Comparison Pairs
-150
-100 Non-Matches
-50 Matches Weights
50 Observed
100
In reality, the distribution is far more complex. Multimodal distributions are not uncommon and the trailing of frequencies described above may not be as observable. Also, in some software, as comparisons are not made for records that have no chance of matching, comparisons with negative weights do not appear in the histogram. As the ideal situation above is not often encountered in practice (although such an ideal distribution has been noted for the Student Loan Data Integration Project in Statistics NZ), it is good to produce a file of linked records for examination. The file can be sorted by weight in descending order. The record pairs with high composite weights represent (relatively) good links. As the weight value lowers, the links become dubious. The sorted record pairs are examined for increasing patterns of field disagreements as the weights decrease, to determine an appropriate cut-off level for the pass. Of course this is easier said than done, but as an analyst gains experience and familiarity with the data undergoing integration, a certain level of confidence is gained in setting the cut-off scores. A sample actual histogram of weights under non-ideal conditions is shown below. After a visual assessment of the file of linked records, the cut-off score of 21.07 was set for this pass. Note the multiple peaks and the not-so-distinct trailing frequencies near the chosen cut-off.
57
Weight Distribution Histogram

300 250 200 Freq
cut-off
150 100 50 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Weight
A side-effect of adjusting the cut-off threshold is the possibility of creating duplicate pairs. Record pairs whose composite weights fall below the cut-off become residuals and are eligible for linking in the next pass. However, one record in file A may form a pair with a weight above the chosen threshold for more than one record in file B. Depending on the nature of the integration exercise, these may be treated as genuine duplicates, possibly for further review, and not available in other passes. If the matching is treated strictly as one to one, then the record pair with the highest weight is taken as the link, and the rest become eligible for linking in the next pass. In the case where record pairs have the same weight, one might be chosen at random as the link. Data integration software may be equipped with options for handling cases where duplicates with the same or different weights exist. Carrying out deduplication simplifies the subsequent linking by generating confidence that no genuine duplicates exist.
6.4.2
False positives, false negatives and match rates
False positives are record pairs that are deemed to be links but which are actually true nonmatches. False negatives are true matches which remain unlinked. Generally, there is no good method for automatic estimation of error rates, so false positive rates have been estimated by manually checking samples of linked records. In large datasets, analysis of false positives can be time-consuming work and it is often useful to group the linked data prior to selecting a sample. For example, in the Student Loan Data Integration Project, the passes constitute groups from which samples for false positive analysis were drawn. Alternatively, new groups different from the groups induced by the passes can be constructed for sampling purposes. In the Injury Statistics Project. For example, the linked Accident Compensation Corporation (ACC) and New Zealand Health Information Services (NZHIS) records may fall in any one of the sample groups below:
58
Data Integration Manual Group 1: linked on injury date, National Health Index (NHI) number, first name, surname all same Group 2: with the same injury date, and date of birth, plus the same NHI if present on both records Group 3: injury date, first name, surname, date of birth. Samples from each of the groups can be selected and analysed for false positives. The clerical review of these samples is done by visually comparing the records, and while this method is able to draw upon subject-matter knowledge and other information, it still involves the subjective view of the reviewer. If it is understood where errors are most likely to occur in the datasets, it may be necessary to target the sample to these areas with a view to improving the quality of the match. Several iterations of clerical review and adjustment of match criteria may be necessary before a linked dataset is confirmed and final false positive error rates calculated. If at least one of the files is expected to match completely and the false positive rate is low, then the false negative rate may be calculated simply as one minus the match rate (where the match rate for a given file is the number of matched records over total records). However in other situations, such as when the integrated dataset is the union of two files, expected matches are unknown and the false negative rate is difficult to estimate.
6.4.3
Measurement error in integration
Measurement error affects inference it can lead to bias in estimation, which can be severe. Best-practice procedures in data analysis examine the data being used for measurement errors and known measurement error properties are incorporated into the analysis (Chesher and Nesheim, 2004).38 The measurement error processes that arise when there is probabilistic record linkage are complex and non-standard. Chesher and Nesheim list causes of measurement error in data linking, including: units incorrectly linked so that data from one unit is incorrectly associated with another unit (aka false positive links) in many-to-one linking, statistics computed using only a few sub-units are used to measure characteristics of all sub-units in many-to-one linking, characteristics of sub-units inferred from features of major units (and vice versa). Chesher and Nesheim go on to say that, from a practical perspective, measurement error is inevitable and since the potential effects are so damaging, one should avoid using data-linking procedures which are likely to generate large amounts of measurement error. The first step in estimating the quality of linked datasets is often the estimation of rates of false positives and false negatives. In record linkage projects carried out by Statistics NZ to date, quality measurement has focused on these two dimensions of quality, with the aim of minimising false positive links.
38
Chesher A and Nesheim L (2004). Review of the Literature on the Statistical Properties of Linked Datasets, Report to the Department of Trade and Industry, United Kingdom.
59
6.5 Adding data over time

Adding data over time may impose additional difficulty on the data integration exercise and policies need to be established to account for changes or updates in the data for a given time period. It is common for agencies providing administrative data to add, delete, modify or update their records over time. For example, an ACC claim can be made at any time by a claimant after an accident has occurred, resulting in late claims, or an agency may update its records to account for new information such as a change in address or family name. As a consequence, data received in one time period may not be the complete dataset for that period. A policy must be established that imposes time cut-offs on data that arrive late, to ensure it doesnt impact on the integration exercise, and also doesnt result in major adjustments to the outputs over time. An issue that can impact integration arises when integrating data from multiple data sources. Difficulties exist when the definitions of reference periods vary between data sources. It is important that data received from the different data providers refer to the same time period. Ensuring the same reference periods for data are obtained from multiple data providers is easier said than done. Agencies use different dates to refer to different parts of the process they use to gather records. For example, there could be a date lodged for when a record is received by the agency, perhaps initially in paper form; a processing date for when a record is entered into a computer system and another date for when a record is registered or accepted by the agency; and yet another for what time period the record actually refers to. Understanding the nature of the data and discussions with the data providers is necessary to ensure that the correct data is received for a specified time period. Another potential issue is the carry-over effect of false positives in an ongoing production environment. For example, during production of the Injury Statistics Project,39 a six-monthly linking is decided on. An injury event may occur in June 2005, with a corresponding NZHIS record in the time period January to June 2005, but for which an ACC claim is to be lodged in August 2005 (a different six-month time period). If the NZHIS record for this injury links to an ACC record found in the January to June 2005 data (a false positive), this NZHIS injury record would never have any chance of linking to its correct ACC record to be lodged in August 2005. This false positive link is carried over as new datasets are being integrated and is carried over the different future six-month time periods. The situation would be worse if the correct ACC record, to which the NZHIS record should have linked, links to a different NZHIS record in the future (another false positive). This, in turn, would get carried over again as new datasets were integrated and new time periods were considered. The scenario with ACC and NZHIS described here can happen with any record in any of the other datasets and in any time period. Extra caution must therefore be exercised to keep the number of errors (false positives) down in each step of the integration period, for any given time period, to minimise such errors being propagated.
39
Statistics New Zealand (2003b). Injury Statistics Project Pilot: Quality Report Part Two Assessment of Bias. (Internal document available on request.).
60
Appendix: Statistics New Zealands Uses of Data Integration

1. New Zealand Census Mortality Study (NZCMS)
Wellington School of Medicine and Health Sciences website: http://www.wnmeds.ac.nz/academic/dph/research/HIRP/nzcms/
2. Injury Statistics Project

Statistics New Zealand website: http://www.stats.govt.nz/injury
3. Linked Employer-Employee Database (LEED)

Statistics New Zealand website: http://www.stats.govt.nz/leed
4. Student Loan Data Integration Project

Statistics New Zealand website: http://www.stats.govt.nz/datasets/education-training/student-loan-borrowers.htm
61
Glossary
Summary The Glossary lists some of the commonly used terms in the area of data integration, and defines how they are used by Statistics NZ.
Glossary of common data integration terms

Term Agreement weight Definition A numeric value assigned when there is agreement on a particular field for a pair of records being compared. See Disagreement weight. Some record linkage software allows the user to combine a number of single fields into an array. If there are several fields that contain similar information (eg several alternative name fields), the use of arrays can reduce the number of cross-comparisons that must be made. A data linkage method may be biased if there are systematic errors in the links created. If the linkage method is biased, then results from analysis using the linked data may differ systematically from the true results. The files to be linked are divided into blocks (pockets) which have some information in common. Records are only compared with others in the same block. The purpose of using blocks is to reduce the number of comparisons that must be made. Variables used to divide a file into blocks. See Block. The composite weight at or above which all record pairs are linked and below which all record pairs are not linked. The sum of the agreement weights for all matching variables that agree (positive values) and the disagreement weights for all matching variables that disagree (negative values). The composite weight measures the relative likelihood that the two records are in fact a true match. See Total weight, weight. The combination of data from different sources about the same or a similar individual or unit. Data integration at the micro level is synonymous with Record linkage. The identification of records belonging to the same unit (eg person). Once identified, duplicate records may be removed or combined with a record nominated as the master record. Linking records belonging to the same unit by way of a unique identifier.
Array
Bias
Block
Blocking variables Cut-off weight Composite weight
Data integration
Deduplication
Deterministic (exact) record linkage
62
Disagreement weight False negative link
A numeric value assigned when there is disagreement on a particular field for a pair of records being compared. See Agreement weight. Two records that should have been linked because they correspond to the same unit (they are a true match) but that were not linked.
False negative rate The proportion of true matches on a file that have not been linked. False positive link Two records that should not have been linked because they do not correspond to the same unit (they are a not a true match), but that have been linked. The proportion of links that are false positives. Global m and u probabilities assume the probability is constant for all values of the variable. For example, if u probability is set to 0.03 for the year of birth variable, this applies for all years. See m probability, u probability. The dataset resulting after record linkage has taken place. A dataset containing data that has been edited parsed and standardised in readiness for integration. Level at which the data is integrated. May not be the same as Reporting Unit. A decision that two records belong to the same unit. See Non-link, Match, Non-match. The file output from record linkage that lists all the linked pairs. See Link, Linked. See Record linkage. The status of a record that has passed through the integration process and was linked to a record from the other file. Variables used to compare two records, including both blocking variables and matching variables. In record linkage, the m probability is the probability that a field has the same value on both files, given that the records being compared truly belong to the same individual/unit. It is a measure of how reliable the field is. See u probability. Variables used to compare two records that fall within the same block, to see how likely it is that the two records belong to the same unit. See Block, Blocking variables, linking variables. A formal voluntary agreement between two or more parties that seeks to achieve mutually agreed outcomes through the efforts of the parties. A file consisting of a record for each unit (Unit record data). The lowest level of data available. With reference to record linkage, a decision that two records do not correspond to the same unit. See Link, True Match, True non-match.
False positive rate Global m and u probability
Integrated dataset Integration input dataset Integration unit Link Link file Linkage Linked
Linking variables m probability
Matching variables
Memorandum of Understanding (MOU) Microdata Non-link
63
Data Integration Manual Parsing Pass Process of splitting a text string into a series of variables (eg full name is split into first names and surnames). One iteration of a record linkage, using a particular set of blocking and matching variables. See Block, Blocking variables, Matching variables.
Probabilistic record Record linkage methodology based on the relative likelihood linkage that two records belong to the same unit given a set of similarities/differences between the values of the linking variables (eg name, date of birth, sex) on the two records. See Deterministic record linkage. Record linkage The combination of data from different sources about the same individual or unit, or a similar individual or unit, at the level of individual unit records. Synonymous with Data integration at the micro level. Level at which the data source is provided. May not be the same as Integration unit. The proportion of all records on one file that have a match in the other file that were correctly accepted as a link. The proportion of all records on one file that have no match in the other file that were correctly not accepted as a link. A formal voluntary agreement between two or more parties that seeks to achieve mutually agreed level of services through the efforts of the parties. The original dataset as it was received from the data provider. Process of changing the formats of variables to make them comparable across different datasets. Statistical matching occurs at the unit-record level but it does not necessarily link records of the same person. In statistical matching, a unit record for one individual is linked to a record or records for similar individuals in other datasets on a probabilistic basis. See Stochastic matching. Matching groups from two different datasets based on similar characteristics, with the assumption that such people will act in the same way. Useful for creating synthetic datasets. See Statistical matching. The sum of the agreement weights for all matching variables that agree (positive values) and the disagreement weights for all matching variables that disagree (negative values). Synonymous with Composite weight. Two records that truly do correspond to the same unit. See Link, Non-link, True non-match. Two records that truly do not correspond to the same unit (eg two different people). See Link, Non-link, True match. In record linkage, the u probability is the probability that a field has the same value on both files, given that the records being compared do not belong to the same individual/unit. It is a measure of how likely the field is to agree by chance on non-matching record pairs. See m probability.
Reporting unit Sensitivity Specificity Service Level Agreement (SLA) Source data Standardisation Statistical matching
Stochastic matching
Total weight
True match True non-match u probability
64
Data Integration Manual Unique identifier (UI or UID) Unlinked A variable that uniquely identifies a person, place, event or other unit. The status of a record that has passed through the integration process and was not linked to a record from the other file. A numeric value assigned to a pair of records compared during integration on the basis of the similarity of the linking variables. See Composite weight.
Weight
65
Bibliography
Ascential Software (2002). Integrity SuperMATCH Concepts and Reference Guide Version 4.0, 517. Baxter R, Christen P, Churches T (2003). A Comparison of Fast Blocking Methods for Record Linkage. CMIS Technical Report 03/139, First Workshop on Data Cleaning, Record Linkage and Object Consolidation, KDD 2003, Washington DC. Brackstone (1999). Managing Data Quality in a Statistical Agency, Survey Methodology, Dec 1999 Vol 25, No 2, 139149. Cabinet meeting minutes CAB (1997) M 31/14 [electronic copy unavailable]. Chesher A and Nesheim L (2004). Review of the Literature on the Statistical Properties of Linked Datasets, Report to the Department of Trade and Industry, United Kingdom. Community Services Ministers Advisory Council (2004). Statistical Data Linkage in Community Services Data Collections, Australian Institute of Health and Welfare, Canberra. Fellegi I and Sunter A (1969). A theory of record linkage, Journal of the American Statistical Association 64, 11831210. Gill L (2001). Methods for Automatic Record Matching and Linkage and their use in National Statistics, National Statistics Methodological Series No 25, National Statistics, United Kingdom. Gomatam S, Carter R, Ariet M, Mitchell G (2002). An empirical comparison of record linkage procedures, Statistics in Medicine 21, 14851496. Gu L, Baxter R (2004). Adaptive Filtering for Efficient Record Linkage. 2004 SIAM International Conference on Data Mining Conference Proceedings, Florida. Gu L, Baxter R, Vickers D, Rainsford C (2003). Record Linkage: Current Practice and Future Directions, CMIS Technical Report No 03/83, CSIRO Mathematical and Information Sciences, Canberra. Jaro, Matthew (1995). Probabilistic Linkage of Large Public Health Data Files, Statistics In Medicine, Vol 14, 491498 Newcombe H, Kennedy J, Axford S, and James A (1959). Automatic Linkage of Vital Records, Science, 130, 954959. QualityStage (2003). Match Concepts and Reference Guide Version 7.0, Chapter 5, 134. Statistics New Zealand (1998). Final report on the feasibility study into the costs and benefits of integrating cross-sectoral administrative data to produce new social statistics. (Internal report available on request.) Statistics New Zealand (1999a). Confidentiality Protocol. (Internal report available on request.)
66
Data Integration Manual Statistics New Zealand (1999b). Statistics and the Privacy Act 1993. (Internal report available on request.) Statistics New Zealand (2002a). Guidelines for Writing a Technical Description of a Record Linkage Project. (Internal Statistics NZ document.) Statistics New Zealand (2002b). Meta Information Template for Description and Assessment of Administrative Data Sources (Internal report available on request.) Statistics New Zealand (2002c). Pro forma Privacy Impact Assessment Report Data Integration Projects (draft). (Internal report available on request.) Statistics New Zealand (2003a). Guidelines for Peer Review of the Technical Description of a Record Linkage Project. (Internal Statistics NZ document.) Statistics New Zealand (2003b). Injury Statistics Project Pilot: Quality Report Part Two Assessment of Bias. (Internal document available on request.) Statistics New Zealand (2003c). Statistics New Zealands Statement of Intent: Year ending 30 June 2004, Statistics New Zealand, Wellington. http://www.stats.govt.nz/about-us/corporate-reports/statement-of-intent-03/default.htm Statistics New Zealand (2005a). Data Integration Policy Guidelines. (Internal report available on request.) Statistics New Zealand (2005b). Proposed Methodology for Estimating Undercounting of Vehicle-related Injuries in NZ. (Internal report available on request.) Taft R (1970). Name Search Techniques, New York State Identification and Intelligence System. http://www.name-searching.com/Working/Name_SearchKeyWordPhoneticcoding.htm Winkler, WE (1988). Using the EM Algorithm for Weight Computation in the Fellegi-Sunter Model of Record Linkage. In Proceedings of Survey Research Methods Section, American Statistical Association, 667671. Winkler, WE (1995). Matching and Record Linkage, Business Survey Statistics, 355384.
67

DataIntegrationManual PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DataIntegrationManual PDF

Uploaded by

Copyright:

Available Formats

Data Integration Manual

Some key data integration concepts ..........................................................................2

Data integration at Statistics NZ and elsewhere.........................................................4

Legal and Policy Considerations .......................................................................................7

2.3.1 2.3.2 2.4 2.5 2.6

2.6.1 2.6.2 2.7 2.8 3 3.1 3.2

Operational Aspects of a Statistics NZ Data Integration Project .....................................17

3.3.1 3.3.2 3.4 3.4.1 3.4.2 3.4.3 3.5 4 4.1 4.2

Documentation and quality assurance .....................................................................19

Preparing for Record Linkage .........................................................................................21

Procedure for obtaining data ....................................................................................26

Preparing data for record linkage .............................................................................30

Deduplication .....................................................................................................34 Anonymisation of unique identifiers ...................................................................34

Record Linkage in Practice..............................................................................................46

Matching method ......................................................................................................51

Quality assessment of linked data............................................................................56

Adding data over time...............................................................................................60