Data Integration Using Statistical Matching Techniques: A Review

Statistical Journal of the IAOS 37 (2021) 1391–1410 1391
DOI 10.3233/SJI-210835
IOS Press
Review Article
Data integration using statistical matching

techniques: A review
Israa Lewaaa,∗ , Mai Sherif Hafezb and Mohamed Ali Ismailb
a
Faculty of Business Administration, Economics and Political Science, Department of Business Administration, The
Y
British University in Egypt, Cairo, Egypt
b
Faculty of Economics and Political Science, Department of Statistics, Cairo University, Egypt
OP
Abstract. In the era of data revolution, availability and presence of data is a huge wealth that has to be utilized. Instead of making
new surveys, benefit can be made from data that already exists. As enormous amounts of data become available, it is becoming
essential to undertake research that involves integrating data from multiple sources in order to make the best use out of it. Statistical
C
Data Integration (SDI) is the statistical tool for considering this issue. SDI can be used to integrate data files that have common
units, and it also allows to merge unrelated files that do not share any common units, depending on the input data. The convenient
method of data integration is determined according to the nature of the input data. SDI has two main methods, Record Linkage
(RL) and Statistical Matching (SM). SM techniques typically aim to achieve a complete data file from different sources which do
OR
not contain the same units. This paper aims at giving a complete overview of existing SM methods, both classical and recent, in
order to provide a unified summary of various SM techniques along with their drawbacks. Points for future research are suggested
at the end of this paper.
Keywords: Statistical matching, record linkage, parametric statistical matching, nonparametric statistical matching, mixed methods,
Bayesian statistical matching
TH
1. Introduction data. SDI has two main methods, Record Linkage (RL)
and Statistical Matching (SM). [1] provide a review of
AU
Attention has been growing towards data integra- data integration techniques for combining probability
tion in response to the increasing flood of available samples, probability and non-probability samples, and
data. Data integration aims at integrating two or more probability and big data samples. [2] presents a review
data sources (usually data from sample surveys) shar- for combining data from different sources focusing on
ing the same target population. Statistical Data Integra- record linkage. RL is the method of gathering informa-
tion (SDI) can be used to integrate data files that have tion from multiple sources that relates to the same en-
common units, and it also allows to merge unrelated tity. If there is a unique identifier in the data sources that
files that do not share any common units, depending on will be integrated, then there is no difficulty in matching
the input data. The convenient method of data integra- data sources. In this case, deterministic linkage will be
tion is determined according to the nature of the input used. Deterministic linkage is a direct way of linking
which usually requires exact agreement on a unique
identifier (such as a national identity number). If there
∗ Corresponding author: Israa Lewaa, Faculty of Business Admin-
is no unique identifier, the probabilistic record linkage
istration, Economics and Political Science, Department of Business
is employed (see [3–6] for details). The advantage of
Administration, The British University in Egypt, Cairo, Egypt. Tel.:
+20 102 949 9446; E-mail: israalewaa@feps.edu.eg; israa.lewaa@ SM that it can be used as a supplement to RL to re-
bue.edu.eg. duce the potential bias when obtaining the estimates
1874-7655/$35.00
c 2021 – IOS Press. All rights reserved.
1392 I. Lewaa et al. / Statistical matching techniques: A review
Table 1
Pattern of missing data originated by the statistical matching
File A
Unit Gender Age Education ... Purchasing information about View information about
Cereals Wines ... Daily soaps News ...
1 Female 40–45 Low 1 Kg 1
2 Male 30–35 High None 2 Missing data
... ... ... ...
File B
Unit Gender Age Education ... Purchasing information about View information about
Cereals Wines ... Daily soaps News ...
1 Female 40–45 Low Regular No
2 Male 30–35 High Missing Data No Regular
... ... ... ...
Source: [13].
Y
using RL [7]. On the other hand, SM techniques typi- variables. Variables which were not found jointly on
OP
cally aim to achieve a complete data file from different both datasets are imputed individually to create a com-
sources which do not contain the same units. In SM, prehensive and large database through SM. On match-
data sources just share a set of common variables and ing the two datasets and imputing the missing parts of
inference is required on the other variables. Besides the the data, it becomes plausible to study the relationship
main goal of SM which is data integration, SM also between TV viewing behavior and purchasing behavior.
C
has the objective of jointly analysing pairs of variables This paper aims to provide a unified piece of work
observed in two distinct sample surveys [8–11]. Incon- that gives a complete view of SM techniques that are
sistent SM can only produce meaningless data that are available in the literature. The organization of this paper
impossible to compare and use. Accordingly, suitable is as follows: Section 2 introduces SM as a missing
OR
approaches should be used for SM to keep pace with data problem. Section 3 presents classical approaches in
rapid development in the data science. SM showing drawbacks of these procedures. Methods
SM can be viewed as a missing data problem where of SM are presented within three frameworks; under
methods such as imputation can be used to solve it [12]. conditional independence assumption, in presence of
In the basic SM Structure, there exist two files A and B auxiliary information, and subject to uncertainty analy-
as sources of data. File A contains variables X and Y, sis. A number of alternative SM techniques are given
TH
while file B contains variables X and Z. In this case, in Section 4. Moreover, Section 5 discusses issues re-
variables Y and Z are not found jointly in one dataset. lated to SM including a review of existing methods for
SM provides a complete file containing variables X, Y assessing the quality of the data resulting from various
and Z. SM can be very useful in so many applications. SM techniques. Finally, Section 6 gives conclusion of
AU
A motivating example is proposed in this section to this paper.

demonstrate the importance of SM. The objective is
to integrate two or more surveys that have no units in
common. 2. Statistical matching as a missing data problem
The following example from [13] represents two dif-
ferent surveys, each of them having a different goal There are several methods for dealing with data with
with some variables in common. The first dataset in- missing values including procedures based on com-
cludes data about gender, age, education and purchasing pletely recorded units, where incomplete units are dis-
information about some products like cereals, wines, carded and only units with complete data are anal-
. . . , etc. The second dataset includes data about gender, ysed. Another approach for dealing with missing data
age, education and information about TV viewing be- depends on weighting procedures, where weights are
havior for daily soaps, news, . . . , etc. (see Table 1). An given for observed units. Imputation techniques con-
appealing option for market researchers is to combine stitute yet another common approach for treatment of
viewing information available from the second dataset missing data, where imputation can be conducted via
with purchasing data available from the first. This can single or Multiple Imputation (MI). Single imputation
save a lot of time and money while providing a richer is defined as replacing each missing value by a sin-
data source for analysing relationships among those gle value. Single imputation techniques have different
I. Lewaa et al. / Statistical matching techniques: A review 1393
types including mean imputation, regression imputa- used SM, depending on the Conditional Independence
tion, and hot-deck imputation, and advanced imputation Assumption (CIA) proposed by [25]. For instance, [34]
techniques such as using an Expectation Maximization assumed CIA when developing a theoretical framework
(EM) algorithm to obtain Maximum Likelihood (ML) for SM. [35] applied sample likelihood approach in
estimates [14–16]. Rubin was the first to propose MI as case of CIA to statistically match two independent sam-
a technique to address the problem of missing data (see, ples. Whereas, [36] proposed a mixed procedure based
e.g., [17–19]). MI has a main notion, which is the re- on multinomial logistic regression models depending
placement of each missing value by a set of M plausible on CIA. [37] used CIA to combine the information of
values where each value is drawn from the conditional two different datasets: Statistics on Income and Living
distribution for the missing values given the observed Condition, and Household Budget Survey.
values. The M completed datasets are produced by an Recently, [38] used a mixed procedure of SM based
appropriate imputation model, and the method which on CIA to integrate two different surveys: Statistics
is considered suitable for analysis had the data been on Income and Living Condition, and Household Bud-
Y
complete is then used to analyse each set. The result- get Survey. In Section 3, we will present classical ap-
ing M analyses are then combined to give one final proaches in SM showing drawbacks of these proce-
result. MI is most straightforward to use under ignor- dures.
OP
able missingness mechanisms (MAR and MCAR). If
the missingness is ignorable, the estimates of the sta-
tistical model of interest resulting from MI often have 3. Traditional statistical matching methods
the desirable properties of being unbiased, consistent,
and asymptotically normal [20,21]. Nevertheless, MI In their comprehensive book about SM, [11] classify
C
is quite possible to be applied when data is Missing SM techniques into procedures developed under CIA,
Not At Random (MNAR), however the resulting esti- and SM procedures in light of auxiliary information.
mates are not proved to have the same properties. For For general review about CIA, see [11,39]. In this sec-
more details about types of missingness: MCAR, MAR, tion, we present classical SM techniques under CIA
OR
and MNAR; see [13,22,23]. Since SM can be viewed and with auxiliary information using the same structure
as a problem of missing data [12], many authors have proposed by [11]. In addition, we consider uncertainly
been using MI in the context of SM. For example, [24] analysis for SM methods. With this framework of the
performed cross survey-MI to combine the regression three settings, different estimation approaches are con-
estimation where more covariates were found in one sidered such as parametric, nonparametric, mixed and
TH
survey but less in the other. Another MI method used Bayesian.

in SM and proposed by [13] was Multiple Imputation
Chained Equations (MICE). 3.1. Statistical matching methods under conditional
The most important question raised when consider- independence assumption
ing SM is how to match the sets of data in a right way.
AU
One of the first SM attempts is by [25]. He presented Traditionally, the framework of SM occurs when only
the use of Equivalence Class of the common variables file A and file B are available. File A contains variables
in two data sources by making a random matching of X and Y , while file B contains variables X and Z. The
(X, Y ) with (X, Z) among “Equivalent” (X, Z)’s that match is done based on the set of common variables X
achieve the closest score. [26,27] opposed Okner’s ap- in file A and file B. All SM techniques throughout this
proach when assuming that Y and Z, given X, are in- section assume conditional independence of Y and Z
dependent and stressed that there is a need for a the- given X. In other words, variable X is the only source
ory of matching. [28] supported and defended the inde- of association between data in files A and B. In partic-
pendence assumption, whereas [29] discussed that this ular, the CIA implies that the joint probability density
conditional independence assumption is valid in many function for X, Y and Z can be factorized as follows:
cases. Through a second round of discussion, [30–32]
f (x, y, z) = f (y, z|x)f (x) = f (y|x)f (z|x)f (x), (1)
showed some improvements in the method of equiv-
alence classes. Again, [33] stressed his belief that the where; fY |X is the conditional distribution of Y given
methods proposed will not perform well under condi- X, fZ|X is the conditional distribution of Z given X,
tional independence assumption. and fX is the marginal distribution of X.
Various developments to SM techniques have been To estimate Eq. (1), we need information about the
discussed in the literature and many applications have marginal distribution of X and the partial relationship
between X and Y on one hand and that of X and Z on Notice that when dealing with categorical variables,
the other hand. In fact, such information can be found the variables are replaced by the indicator variable of
in different samples A and B. each category. [43] showed that the conditional mean
matching has the problem of leading to lack of variabil-
3.1.1. Parametric approaches ity in the imputations with respect to the same condi-
The SM parametric methods are mainly based on tioning variables and hence leads to underestimation of
two steps. First, it necessarily requires the specifica- the covariance between Y and Z. To preserve variability
tion of a model; then parameters have to be estimated. and to avoid this underestimation, [22] suggest to add a
The method of estimation depends on the specific SM random residual to each predicted value. They referred
framework. There are two main approaches for para- to this approach as stochastic regression imputation.
metric SM; namely micro and macro approaches. Like conditional mean matching method, random draws
In the macro case, the goal is to estimate the param- can be obtained by replacing the parameter values with
eter of interest, e.g. the correlation coefficient between their ML estimates. Therefore, file A is filled in with
Y
Y and Z or the contingency table Y × Z in case of the values
categorical data. The only parameter that cannot be es- (A)
ẑprimary = α̂Z + β̂ZX xa + εa ,
OP
timated is the covariance between Y and Z, as they are
not jointly found in one data source. One possibility a = 1, 2, . . . , nA , (4)
to derive an estimate of such a covariance is provided where εa is a normally distributed residual N (0, σ̂Z|X )
by assuming the conditional independence of Y and Z 2
with σ̂Z|X = SZZ,A − β̂ZX2
SXX,A . Similarly, file B is
given the X variable. However, the estimates from dif- filled in with the values
C
ferent independent samples may lead to a variance co- (B)
variance matrix (σ̂) that is not positive semi-definite. To ŷPrimary = α̂Y + β̂Y X xb + εb ,
avoid such a problem, Maximum Likelihood Estimation b = 1, 2, . . . , nB , (5)
(MLE) can be used for partially observed data [11]. In
OR
case of categorical variables, the parameters of interest where εb is a normally distributed residual N (0, σ̂Y |X )
are the joint probabilities of X, Y , and Z, which can be with σ̂Y2 |X = SY Y,A − β̂Y2 X SXX,A .
obtained using maximum likelihood too (see [40–42]).
The goal of micro approaches for SM is to create a 3.1.2. Nonparametric approaches
complete set of data for (X, Y, Z), by completing the Nonparametric approaches to SM do not explicitly
refer to a model. As for the parametric case, there are
TH
missing values in files A and B. More precisely, miss-

ing Z in file A and missing Y in file B are filled in. two cases in nonparametric approaches; namely macro
Accordingly, there are essentially two main parametric and micro.
micro methods; conditional mean matching and draws In nonparametric macro approaches, as mentioned
based on a predictive distribution. The conditional mean before, the goal is to estimate the joint distribution
AU
matching is the most popular method in the case of con- of X, Y and Z. There are a number of nonparametric
tinuous variables. This method is based on regression approaches that can be used in this sense. Three ap-
imputation. The values are imputed using the estimated proaches that are mentioned in the literature are em-
regression functions of Z on X, and Y on X, respec- pirical cumulative distribution function, kernel density
tively. Since X is a random variable, we consider the estimator and k Nearest Neighbour (kNN). For more
conditional mean of Z given X = x, details, see [44,45]. The previous three methods focus
(A)
on the distribution of (X, Y, Z) either jointly using the
E(Z|X = xa ) = ẑprimary = α̂Z + β̂ZX xa , cumulative distribution function, or as a product of con-
a = 1, 2, . . . , nA , (2) ditional and marginal distributions. However, one may
be interested in certain characteristics of this distribu-
where the estimated parameters values β̂ZX and α̂Z are tion. The conditional expectations of Z given X, and
the parametric macro ML estimates. Thus, a complete Y given X are considered among the most important
(A)
file A is obtained after adding ẑprimary . Similarly, file B characteristics of interest. These can be derived using
is filled with the following predicted values: a nonparametric regression function. Nonparametric
(B) regression is a nonparametric method for describing
E(Y |X = xb ) = ŷprimary = α̂Y + β̂Y X xb ,
the relation between a response variable and one or
b = 1, 2, . . . , nB . (3) more predictors. Nonparametric regression techniques
Table 2
include kernel smoothing, local polynomial regression, Data of file A and file B
spline based regression models, and regression trees.
File A
All these nonparametric techniques result in finding the
Unit Weight wA Gender X1A Age X2A Profit Y
conditional expectation of Z given X or Y given X, A1 3 1 42 9.156
which is the most important characteristic of interest A2 3 1 35 9.149
(for details see [46]). A3 3 0 63 9.287
In nonparametric micro approaches, a complete A4 3 1 55 9.512
A5 3 0 28 8.484
dataset for files A and B can be obtained without consid- A6 3 0 53 8.891
ering any specific parametric distribution for the vari- A7 3 0 22 8.425
ables X, Y and Z. In a nonparametric micro setting, A8 3 1 25 8.867
this can be done by applying nonparametric imputation File B
procedures, known as hot deck imputation methods. Unit Weight wB Gender X1B Age X2B Income Z
Such procedures are defined by replacing missing val- B1 4 0 33 6.932
Y
B2 4 1 52 5.524
ues in file A by values that are found in file B based on B3 4 1 28 4.224
some matching criteria, and vice versa. The advantage B4 4 0 59 6.147
OP
of using hot deck imputation procedures is that they B5 4 1 41 7.243
do not need any specification of a family of distribu- B6 4 0 45 3.230
tions. Also, there is no need to estimate the distribu- Source: [13,49], Gender: 1 = male; 0 = Female; Population size is
24; wA = 24 = 3;wB = 24 = 4.
tion function or any of its characteristics. Most of these 8 6
procedures are based on filling in the dataset chosen as

the “recipient”, with the values of the variable which ple case of a a single continuous variable X, distance
C
is available only in the other dataset, the “donor” file. hot deck method is based on measuring the absolute
In the following, we suppose that file A is the recipient difference between a record in file A and another in file
whereas file B is the donor such that the sample size B as
OR
of the recipient file nA , is less than the sample size of dab∗ = |xA B A B
a − xb∗ | = min |xa − xb |, (6)
16b6nB
the donor file nB ; i.e nA < nB . [42] stated three hot
deck methods that are used in SM: Distance Hot Deck, where; b∗ is the b∗th unit in file B, achieving the mini-
Random Hot Deck and Rank Hot Deck. Such methods mum distance between xA B
a and xb . In case of more than
may be regarded as the nonparametric counterparts of one variable X, this distance is for example measured
micro parametric approaches that are presented above. by the Mahalanobis distance.
TH
Random hot deck is performed by selecting a donor Distance hot deck can be performed using a con-
record in file B randomly and the value of Zobserved strained or an unconstrained approach both introduced
in file B is assigned to the Zmissing in recipient file A by [49]. The unconstrained approach has an objective
and this is done for each record in file A [42,47,48]. of finding the case on the target file that is most similar
AU
A necessary step when making the random selection to each case on the other file. Suppose we have file A
is that the units in files A and B are grouped into do- that contains a set of variables: gender X1 , age X2 and
nation classes. Donation classes are defined according profit Y . Whereas, file B contains gender X1 , age X2
to the values of one or more categorical variables cho- and Income Z (see Table 2). The files A and B have
sen within the common set of variables in the datasets no units in common, but they have the variables gender
A and B (for instance gender, region etc.). Once files and age in common. The process of creating a matched
are grouped into homogenous subsets, those subsets file A with variables X1 , X2 , Y and Z using an uncon-
are referred to as donation classes. Then, for a given strained method is done by finding for each A unit the
record in file A belonging to a given group (say males), same gender B unit that is closest in age (e.g., A1 is
a donor record in file B is randomly chosen from within male and 42, and the best matching B unit is B5, who
the same group (males in B). is male and 41; B5 is also the closest match to A2). The
Distance hot deck method, which is widely used in result of matched file A is given in Table 3. To create
imputation of missing data, is performed by choosing a matched B file, one can use the same procedure to
the value of Zobserved in file B to be assigned to a recip- obtain matches for B units from file A. On the other
ient record in file A based on the closest distance, ac- hand, constrained distance hot deck approach starts by
cording to a set of common variables X, after splitting making an exploded file for both files A and B. The
units in files A and B into donation classes. For the sim- exploded file depends on sampling weights wA and wB ,
Table 3 Table 4
Statistical matching using unconstrained approach Reordered exploded file A and file B for constrained approach
Unit Gender X1A Age X2A Age X2B Profit Y Income Z Unit i Gender X1A Age X2A Unit i Gender X1B Age X2B
A1 1 42 41 9.156 7.243 A7 0 22 B1 0 33
A2 1 35 41 9.149 7.243 A7 0 22 B1 0 33
A3 0 63 59 9.287 6.147 A7 0 22 B1 0 33
A4 1 55 52 9.512 5.524 A5 0 28 B1 0 33
A5 0 28 33 8.484 6.932 A5 0 28 B6 0 45
A6 0 53 59 8.891 6.147 A5 0 28 B6 0 45
A7 0 22 33 8.425 6.932 A6 0 53 B6 0 45
A8 1 25 28 8.867 4.224 A6 0 53 B6 0 45
Source: [13,49]. A6 0 53 B4 0 59
A3 0 63 B4 0 59
A3 0 63 B4 0 59
that – assuming that datasets A and B have both been A3 0 63 B4 0 59
obtained by simple random sampling from the popula- A8 1 25 B3 1 28
Y
tion – are calculated by dividing the population size by A8 1 25 B3 1 28
A8 1 25 B3 1 28
the number of units in each file. Then, every unit in files A2 1 35 B3 1 28
OP
A and B is duplicated wA and wB times respectively to A2 1 35 B5 1 41
get the same number of population size N in each file A2 1 35 B5 1 41
A1 1 42 B5 1 41
A and file B. The units in exploded file A and exploded
A1 1 42 B5 1 41
file B are then reordered according to the common vari- A1 1 42 B2 1 52
ables X1 and X2 . This procedure is shown in detail in A4 1 55 B2 1 52
A4 1 55 B2 1 52
C
Table 4. Each unit in the reoredered exploded file A is
A4 1 55 B2 1 52
matched to the corresponding reordered exploded file B
Source: [13,49].
to minimize the distance measure in X space between
the exploded file A and exploded file B. For instance, Table 5
OR
unit A7 is matched three times with unit B1, unit A5 is Statistical Matching using Constrained Approach
matched once with unit B1 and twice with unit B6, and Units Gender X1A Age X2A Age X2B Profit Y Income Z
so on. The results of SM using a constrained method A1, B2 1 42 52 9.156 5.524
are summarised as shown in Table 5. A1, B5 1 42 41 9.156 7.243
Rank hot deck, which was first introduced by [50], is A2, B3 1 35 28 9.149 4.223
A2, B5 1 35 41 9.149 7.243
performed by separately ranking the units with respect
TH
A3, B4 0 63 59 9.287 6.147

to X values in both files. Files A and B are matched in A4, B2 1 55 52 9.512 5.524
this method by associating the records with the same A5, B1 0 28 33 8.494 6.932
rank. If files A and B contain a different number of A6, B4 0 53 59 8.891 6.147
A6, B6 0 53 45 8.891 3.230
records, matching is done using the empirical cumula- A7, B1 0 22 33 8.425 6.932
AU
tive distribution function of X. This cumulative distri- A7, B6 0 22 45 8.425 3.230

bution is calculated for the two files Pas follows: A8, B3 1 25 28 8.867 4.223
nA
A
In the recipient file; F̂X (x) = n1A a=1 I(xa 6 x), Source: [13,49].
x ∈ X, where I is the indicator function taking the
value 1 when xa 6 x, and 0 otherwise. 3.1.3. Mixed approaches
In the donor file; F̂X B
PnB
(x) = n1B b=1 I(xb 6 x), Mixed approaches in SM depend on predictive mean
x ∈ X, where I is the indicator function taking the matching imputation [11]. These are two-step proce-
value 1 when xb 6 x, and 0 otherwise. dures which depend on a mixture of parametric and
After that, every a = 1, . . . , nA is linked with that nonparametric methods. A SM mixed method consists
record b∗ in file B according to the following condition: of the following two steps:
A B
1. A model is fitted and all its parameters are esti-
|F̂X (xa ) − F̂X (xb∗ )| mated.
A
= min |F̂X B
(xa ) − F̂X (xb )|. (7) 2. A nonparametric method is used to create the
16b6nB complete dataset.
Then again as in Distance Hot Deck, the value of Z There exist various methods for mixed SM depending
observed for the donor unit in file B, is assigned to the on the type of variables, whether they are continuous or
missing Z on the recipient unit in file A. categorical.
In case of mixed approaches under CIA for continu- Rubin’s procedure. Then, derived different posterior
ous variables X, Y and Z, steps for SM procedure are distributions for the parameters. Accordingly, she pro-
as follows: posed the following MI algorithm based on a Bayesian
1. Estimate the regression parameters of Z on X approach by the following steps:
from file B. 1. Perform a regression for each dataset and compute
2. For each unit a = 1, . . . , nA , a primary value the OLS estimates β̂Y X and β̂ZX .
(A) 2. Calculate each sum of squared residual er-
ẑprimary is imputed in file A using the estimated
regression function from the first step. rors; (z − XA β̂ZX )0 (z − XA β̂ZX ) and (y −
(B)
3. For each unit a = 1, . . . , nA , a live value zlive XB β̂Y X )0 (y−XB β̂Y X ), where y = (y1 , y2 , . . . ,
th
from file B is assigned to the a record in file A ynA )0 , z = (z1 , z2 , . . . , znB )0 .
using a convenient distance hot deck, considering 3. Choose a value for ρZY |X from its prior or set of
(A)
the primary value ẑprimary . arbitrary levels.
4. Conduct random draws for the parameters from
Y
In the same manner, Y is estimated (using data from
their observed data posterior distribution.
file A) and imputed in file B.
Using the above steps, similar procedures were pro- In a nutshell, this procedure depends on random
OP
posed by [11,34,43,51,52]. The main difference be- draws for the parameters, instead of estimating them
tween their procedures lies within the parameter esti- from the data. This algorithm is repeated m times which
mates they use. Also, they differ in the matching step; yields m imputed datasets. Then, the results are com-
bined. For more details, see [13].
unconstrained or constrained approach. Another differ-
The second method presented by [13] was the Nor-
ence is that Rubin’s procedure suggests concatenating
C
mal Approach (NORM). This method is mainly based
the resulting statistically matched files and assigning
−1 −1 −1 on data augmentation algorithm, in case of a normal
the weight (wA + wB ) to each unit, where wA
model. The normal data model is used because it is
and wB are the weights corresponding to files A and B,
the most flexible in terms of the number of variables
OR
respectively.
included in any realistic matching task. NORM uses a
Analogous to SM for continuous variables, two steps
data model much like NIBAS. The difference between
are applied within the mixed approach when categor-
NIBAS and NORM is that NORM uses the paramet-
ical variables are considered. These are the loglinear
ric data model in case of complete data, whereas NI-
regression model step, and the matching step. For more
BAS uses the observed data posterior distribution. Since
details about this method, see [42].
TH
NORM method needs complete data, an algorithm that

handles the missing data is needed, namely data aug-
3.1.4. Bayesian approaches mentation. The data augmentation algorithm can be
[13] presented three procedures for SM using viewed as a Bayesian version of the EM algorithm for
Bayesian approaches. First, the author followed [52] by dealing with missing data [53].
presenting a frequentist regression imputation method
AU
The third method proposed by [13] was MICE, an ab-

with random residual, known as RIEPS. This procedure breviation for Multiple Imputation Chained Equations.
is based on drawing the missing values from their con- MICE, a common modification in MI, is an iterative
ditional predictive distribution which is equivalent to method that considers the imputation problem as a set
the imputed values in Eqs (2) and (3), but after consid- of estimations where each variable takes its turn in be-
ering the residual in the imputation process. Then, [13] ing regressed on the other variables. MICE, first intro-
presented the Bayesian version of RIEPS. She presented duced by [54], is a flexible method that handles various
a Non Iterative Bayesian based Imputation Procedure types of variables, since each variable is imputed using
(NIBAS) for univariate and multivariate cases. The for- its own imputation model [55]. This procedure is called
mat of the procedure is the same for both the univari- regression switching, chained equations, or variable-
ate and multivariate cases. The only difference is in by-variable Gibbs sampling [54]. [13] employed MICE
the mathematical formulation of matrices that are used, within a Bayesian estimation framework as a tool for
and in the posterior distributions of variables. In this SM. For more details about MICE, see [13,54]. [56]
approach, instead of estimating the parameters like the compared, through a simulation study, the different
above approaches, she performed random draws for Bayesian approaches described above. Using another
them using a Bayesian version. [13] first assumed a simulation study, [57] found out that the MI approaches
general linear model for both datasets as suggested by are superior to the traditional SM procedures.
(B)
3.2. Statistical matching methods with auxiliary ŷPrimary = α̂Y + β̂Y X xb + β̂Y Z zb ,
information
b = 1, . . . , nB . (9)
The problem with the conditional independence as- Analogous to Section 3.1.1 under CIA, we get the
sumption of the variables Y and Z given X, is that equations for the method of draws from the predicted
it may or may not hold. Moreover, it is not a testable distribution by adding an error term as follows:
assumption. Thus, the use of existing auxiliary infor- (A)
ẑPrimary = α̂Z + β̂ZX xa + β̂ZY ya + εa ,
mation can be very helpful. Auxiliary information can
take one of the following two forms: a = 1, . . . , nA , (10)
– The existence of a third data source C that contains (B)
ŷPrimary = α̂Y + β̂Y X xb + β̂Y Z zb + +εb ,
variables (X, Y, Z) or just (Y, Z).
– Parameter estimates of the joint distribution of b = 1, . . . , nB , (11)
(Y, Z) or the conditional distribution (Y, Z)|X; where; εa is a random residual drawn from N (0,
Y
like for example covariance σY Z ; correlation co- 2
σ̂Z|XY ), and εb is a random residual drawn from
efficient ρY Z or partial correlation coefficient N (0, σ̂Y2 |XZ ). The regression coefficients β̂Y X and
OP
ρY Z|X ; a contingency table of Y × Z; etc.
β̂ZX are obtained from files A and B respectively.
We will follow the same structure as that given for Whereas, β̂ZY and β̂Y Z are obtained from file C.
CIA, to present different approaches for SM with aux-
iliary information, which was proposed by [11]. 3.2.2. Nonparametric approaches
In case of a nonparametric macro setting, the only
C
3.2.1. Parametric approaches type of auxiliary information available is that from an
In case of a parametric macro approach, the auxiliary additional data source C. There are two cases for aux-
information may be an additional data source or using iliary information available from an additional data
an external estimate. In case of having an additional source C. First, when all variables are available in C.
OR
data source C, containing nC i.i.d. observations, pa- The marginal X distribution can be estimated either
rameters can be estimated by concatenating all the data using kernels or k Nearest Neighbour methods depend-
sources (A∪B ∪C) and then by exploiting all the avail- ing on the whole sample A ∪ B ∪ C. For more on ker-
able information. In case of using external estimates, nel estimators, see [58–60]. The second case is when
it seems to be quite difficult as it requires validation only Y and Z are available in C with missing X. Al-
TH
of some compatibility assumptions that is not always though nonparametric methods should be used when
straightforward to check, or that is not easy to verify. we have a complete data source i.e file C containing
In case of parametric micro approach and having (X, Y, Z), [61] confirm the possibility of using ker-
external information, a model is assumed for the joint nel estimators of distribution functions when the data
distribution of X, Y and Z. After that, the parameters source is not complete i.e file C containing only Y and
AU
are estimated by exploiting all the available informa- Z. In this case, [40] proposes using k Nearest Neigh-
tion (data sources and auxiliary information). When pa- bour in the EM algorithm for continuous variables, or
rameters can be estimated uniquely as in Section 3.1.1, the counterpart of EM; which is Iterative Proportional
then the estimates are used to generate a complete data Fitting algorithm (IPF), in case of categorical variables.
source using conditional mean matching or draws from In case of a nonparametric micro setting, auxiliary
the predicted distribution [42,52]. The additional infor- information is solely represented by an additional data
mation only affects the estimation stage, but no change source C. This procedure is presented by [40] in two
happens in the next step when generating the synthetic steps as follows:
dataset. Another difference occurs in the equation for (C)
Step 1: Impute a live z value zlive using file C as a
conditional mean matching and draws from the pre-
donor. This live value, that is observed in file
dicted distribution. Under CIA, the Eqs (2) and (3) in
C, is then assigned to the ath record in file A
Section 3.1.1 were reduced and β̂ZY ya , β̂Y Z zb were
according to the nearest distance:
omitted, respectively from the following more general (A) (A) (C)
case with auxiliary information: – dac ((xobserved , yobserved ), (xobserved ,
(C)
(A) yobserved )), if C contains X, Y and Z;
ẑPrimary = α̂Z + β̂ZX xa + β̂ZY ya , (A) (C)
– dac (yobserved , yobserved ), if C contains Y
a = 1, . . . , nA , (8) and Z only.
(B)
Step 2: The final live z value zlive is imputed us- used for estimating the conditional association of Y and
ing file B as a donor, then it is assigned Z. That is; to use the value ρ∗Y Z|X estimated from file
to the ath record in file A with distance C, as the prior conditional correlation instead of setting
(A) (C) (B) (B)
dab ((xobserved , zlive ), (xobserved , zobserved )). a prior value for ρY Z|X as in Section 3.1.4. After that,
In particular, if C is a sample that is large enough and the algorithm will perform as outlined in Section 3.1.4,
the information it provides can be considered ”reliable”, exploiting all the available information. Therefore, the
then the SM procedure can be limited to Step 1 outlined posterior distribution of all the parameters is changed
above; that is imputing Z in A using C as donor. according to ρ∗Y Z|X .
An alternative procedure based on loglinear mod-
els called categorical restriction approaches that was 3.3. Uncertainty analysis
introduced by [42].
In practice, CIA rarely holds true, and results relying
3.2.3. Mixed approaches on CIA may be misleading. If CIA is not valid, bias will
Y
Analogous to CIA in Section 3.1.3, mixed ap- exist among variables of interest after applying the SM
proaches contain the same two steps in case of auxiliary process. This issue is solvable with the help of auxiliary
OP
information. First step is to estimate model parameters, information when available [42,62]. When CIA is not
followed by the second step where a nonparametric valid and auxiliary information is not available, which is
method is used. often the case, uncertainty analysis may be performed.
The same procedure outlined in Section 3.1.3 for SM Statistical matching is essentially related to an iden-
under CIA can be used here with continuous variables. tification problem concerning the association of vari-
ables not found together in the same dataset. As known,
C
The only difference here is in the way of estimation
as information about parameter estimates is available, CIA cannot be validated from the observed data. By
either from additional data source C or from an external depending on the explanatory power of the common
estimate (i.e. information about ρY Z|X ). In this context, variables X, a smaller or wider range of admissible
OR
SM takes place using two steps: values of the unconditional association of Y and Z can
then be estimated. In official statistics this kind of ap-
Step 1: The mixed procedure remains unchanged,
proach is named uncertainty analysis [63]. In other ap-
focusing on imputing primary values us-
plication domains, it takes other names, for instance it
ing the same regression models as in equa-
is known as partial identification in econometrics and
tions 10 and 11.
social sciences [64–67].
TH
Step 2: Final nonparametric imputation: a live z

(B) Uncertainty in SM can be viewed as a special case
value; zlive , that is observed in file B is of estimation problem as there is a kind of uncertainty
then assigned for each ath record in file A, about the joint model of (X, Y, Z). In case of continu-
corresponding to the closest Mahalanobis ous data, uncertainty is quantified by taking the range
distance computed using observed and pri-
AU
of an association parameter (e.g. the correlation coef-

mary imputed values for Y and Z; namely d ficient) between the variables that are not jointly ob-
(A) (A) (B) (B)
[(yObserved , ẑPrimary ), (ŷPrimary , zObserved )]. served. Whereas, in case of categorical data (e.g. k-way
[34,51] proposed a similar procedure that permits contingency tables), upper and lower bounds are set on
the incorporation of additional information represented cell counts. Basically, uncertainty analysis is concerned
by the correlation coefficient ρ∗Y Z , where parameters of with identifying which values of the target parameters
the normal distribution are estimated by their sample are compatible with the available data. To know more
counterparts. about uncertainty and different methods released for
In case of categorical variables, a number of mixed uncertainty, see [57,68–76].
methods are introduced in [42]. For more details, Uncertainty analysis can be performed in two set-
see [11]. tings either parametric or nonparametric. According
to parametric setting, the result of the identification
3.2.4. Bayesian approaches problem is handled by considering ranges of plausible
The Bayesian approach is especially useful when values of the missing records, extracted from models
dealing with auxiliary information. [13] considered a fitted to the available sample information. Then, in-
Bayesian approach for SM when an additional data tervals are defined by these ranges and are known as
source C is available. This additional dataset C can be uncertainty intervals [13,34,43,52]. According to non-
parametric setting, uncertainty for SM is considered mon variables X shared by both datasets A and
in [69,76,77]. Uncertainty in a nonparametric setting is B, before data integration takes place. In the har-
still described by a class of models, or specifically, by a monization step, common variables in the two
class of distributions, for (X, Y, Z). If compared to the datasets should have the same categories. If the
parametric case (either multinormal or multinomial), two similar variables cannot be harmonized, they
there are two main sources of trouble. First, since the have to be discarded. Also, common variables X
class of distributions for (X, Y, Z) are not identified should not include missing values. If A and B are
by a finite number of parameters, we need a technical representative samples of the same population,
tool to describe them. Second, in order to measure, to the common variables in the two datasets should
quantify uncertainty, we need to “summarise”the class have the same marginal/joint distribution.
of all possible distributions for (X, Y, Z) with a single 2. Weight adjustments step: In this step, the sum of
number. An even more important problem is the quan- the attached weights for records (weighted popu-
tification of the uncertainty in the presence of auxiliary lation totals) in the donor file is adjusted to make
Y
information on the model [78,79]. them comparable with those in the recipient file.
Recently, [80] performed a unique analysis to choose 3. Estimation of the propensity scores step: In this
the matching variables by searching for the common step, estimation of the propensity scores for
OP
variables X that are the most effective in reducing the matching process is done by creating an outcome
uncertainty between Y and Z. Also, it is taken into variable R, where R = 1 for all records in the re-
consideration not to select too many common vari- cipient file and R = 0 for all records in the donor
ables to avoid a large number of parameters to esti- file. The two files are joined by stacking their
mate. Other measures of uncertainty are considered records. A logistic regression model is then per-
C
in [81,82]. [35] discussed the uncertainty in SM un- formed, in which R is the dependent variable and
der informative sampling designs considering the three- the independent variables are the selected com-
variate normal case. [83] proposed the use of graphical
mon variables X. The common variables X are
models to deal with the SM uncertainty for multivariate
selected carefully in the logistic regression model
OR
categorical variables.
for measuring propensity scores to maximize ex-
planatory power. This is because the validity of
4. Developments in statistical matching methods propensity score statistical matching relies on the
power of the common variables to act as good
In addition to the traditional SM techniques outlined predictors that can be transformed into effective
TH
in Section 3, there are a number of alternative methods propensity scores. The propensity score is defined
that have been introduced in the literature. This section as:
will present an overview of some of these methods. e(zi ) = P (R = 1|X = xi ) = f (x0i β), (12)
the conditional probability of unit i to belong to a
4.1. Propensity scores method
AU
certain group (R = 1) given the covariates (X =

[84,85] were among the first who proposed the use x). The estimated propensity score is defined as,
of propensity score matching. Propensity scores are 1
ê(zi ) = f (x0i β̂) = 0
. (13)
predicted values from logistic regression, where the ef- 1 + e−xi β̂
fect of a certain group or treatment is estimated, ac- The individual propensity scores ê(zi ) are the pre-
counting for covariates. The matching process is done dicted values from the logistic regression output for β.
based on these scores. The main advantage of using Having calculated propensity scores, all units in each
this method in SM, is to reduce the matching to one file are sorted in ascending order according to their
constructed propensity score, which is a very great ad- respective propensity scores ê(zi ). For the two sorted
vantage, especially when there are a large number of data sets with respect to propensity scores, a cumula-
common variables X. The methodology of constrained tive weight variable is created. This cumulative weight
statistical matching, using estimated propensity scores is then sorted in descending order. Under this sorting
was developed by [86] to produce the required synthetic scheme, units with larger weights in the donor file are
data sets. [87,88] outlined three steps for propensity assigned to multiple units in the recipient file until all
score matching: of their weight is used up. If all of the units in both
1. Data preparation and harmonization step: Harmo- files did not match, additional estimation of propensity
nization step involves identification of all com- scores is performed.
m m
On the contrary, [89] opposed using propensity X X
∗ ∗
wia wij(t) S(Ψ ; xi , zij , yi ) = 0
scores for SM justifying their opinion by stating that
i∈A j=1
the procedure ignores the measurements of matching
variables X, and this may lead to increased bias. They where S(Ψ ; x, z, y) = log f (y|x, z; Ψ )/∂Ψ and
claimed that measurements of matching variables is less wia is the sampling weight of unit i in file A.
transparent with propensity score matching than with 4. Repeat steps 2 and 3 till convergence.
other methods of matching. Therefore, they suggested For nonparametric setting, fractional hot deck impu-
a set of precautions for using propensity score match- tation is considered; that is a mixed idea of fractional
ing to avoid as much as possible of the imbalance of imputation and hot deck procedure. Instead of generat-
matching resulting from this method. ∗
ing zij from fˆa (z|x) which is a parametric setting, hot
deck fractional imputation which is a nonparametric
4.2. Statistical matching using fractional imputation approach can be used. In this approach, all observed
method values of zi in file B are considered as imputed values
Y
and the fractional weights in Step 2 are obtained by
[90,91] proposed fractional imputation, a compara- ∗ ∗ ∗
tively recent form of imputation for handling missing wij (Ψ̂t ) ∝ wij0 f (yi |xi , zij ; Ψ̂t ), (14)
OP
data. Fractional imputation can be handled in two set- ∗ fˆb (zj |xi )
where wij0 = Pm ˆ .
tings either parametric or nonparametric. According to k∈A wka fb (zj |xk )
the parametric approach for generating imputed values,

it is performed using EM algorithm which is a popular 4.3. Statistical matching based on statistical learning
tool for finding the MLE for parameters of the model. In
C
fractional imputation, the E-step can be approximated Statistical Learning (SL) refers to a wide set of clas-
by the weighted mean of the imputed data likelihood sification/regression techniques that “ learn from data”.
where the fractional weights are computed from the These are typically algorithm-based techniques that do
current value of the parameter estimates. Using frac- not assume a statistical model. Statistical learning tech-
OR
tional imputation, numerous imputed values with frac- niques include methods such as classification and re-
tional weights are generated for each unit containing gression trees (CART), random forest, . . . , etc. rather
missing data. Each fractional weight indicates the con- than methods based on fitting stochastic models. Sta-
ditional probability of imputed value given observed tistical Learning techniques can be used in statistical
data. For instance, the following two steps should be matching to replace methods that use model predic-
tions. Some techniques like Nearest Neighbour distance
TH
considered if we would like to generate the missing z

in file A from the conditional distribution of z given hot deck are special cases of a well-known statistical
observed data, i.e. f (z|x, y) ∝ f (y|x, z)f (z|x). First, learning technique called kNN. For more details about
generate z ∗ using the estimated fˆa (z|x) from file A un- SL techniques, see [92–95]. [11] were first to use SL
der the conditional independence assumption. Next, ac- techniques in SM. They showed how SL can be used to
AU
cept z ∗ if f (y|x, z ∗ ) is sufficiently large. Imputed val- pick a subset of matching variables prior to the match-
ues in f (z|x, y) ∝ f (y|x, z)f (z|x) can be generated ing step. Even, when measuring the distance between
by parametric fractional imputation via EM algorithm the observations in A and those in B, SL techniques
as follows: can be useful when employing the nearest neighbour
distance hot deck [96]. The use of SL in SM takes it a
1. For each i ∈ A, generate m imputed values
bit further by incorporating other current classification
z1∗ , . . . , zm
∗
from fˆb (z|xi ) where fˆb (z|x) is the es-
or regression approaches into SM. [97] showed how
timated conditional distribution of z given x from
popular SL techniques can be beneficial for matching
file B.
purposes. Two different ideas are looked at: (i) integrat-
2. Fractional weight is assigned for the j th imputed
value zij ∗
using ing datasets; (ii) assessing the uncertainty. The charac-
teristics of these approaches are investigated by simu-
∗ ∗
wij(t) ∝ f (yi |xi , zij ; Ψ̂t ), lation study and application on a real survey data. The
Pm ∗
where j=1 wij = 1, Ψt is the parameter(s) obtained results in his paper are promising, showing
defining the conditional distribution of yt given that certain SL approaches can be very successful in
y1 , . . . , yt−1 ; and Ψ̂t is its current estimated value. using the given information by already available survey
3. Solve the fractionally imputed score equation for data, allowing for a reduction in the uncertainty when
Ψ to obtain Ψ̂t+1 using traditional SM techniques. Moreover, [80] com-
pared the SL techniques to traditional hot deck methods 1. For file C, fit a multinomial logistic regression
to avoid consuming time in selecting the matching vari- model in which Y is the independent variable and
ables that are typically needed in hot deck and to get the the observed variable Z in file B is the dependent
distances between units in file A and potential donors variable.
in file B. [80] showed the superiority of SL techniques 2. For file A, the estimates in the first step are used
(A)
to traditional hot deck methods in two aspects; first the to get Π̂j .
reduction of time consumed in selecting matching vari- 3. For file A, Ẑ is obtained by generating a multino-
ables typically needed in hot deck, and second in cal- (A)
mial random variable with probability Π̂j .
culating distances between units in file A and potential
A variation of the above method is to replace the first
donors in file B.
step by fitting a multinomial logistic regression model,
in which X and Y are the independent variables and Z
4.4. Statistical matching using multinomial logistic is the dependent variable using file C.
regression models The third variation of this method involves the fol-
Y
lowing six steps:
[36] proposed several mixed procedures based on
1. For file C, fit a multinomial logistic regression
OP
multinomial logistic regression models depending on
model in which Z is the independent variable and
CIA and auxiliary information. First, they proposed two
the observed variable Y in file A is the dependent
mixed methods for SM, without auxiliary information
variable.
assuming conditional independence. The first approach
2. For file B, the estimates in the first step is used to
utilizing multinomial logistic regression, is a mixed (B)
method that uses distance hot deck imputation. It is get the probability Π̂j .
C
based on the following three steps: 3. For file B, Ŷ is obtained by generating a multino-
(B)
mial random variable with probability Π̂j .
1. The probability of the unit being in j th category,
(B) 4. For file B, fit a multinomial logistic regression
Π̂j , is estimated from all units in file B, is esti-
OR
model in which X and Ŷ are independent vari-

mated by fitting a multinomial logistic regression
ables and the observed variable Z in file B is the
model in which the common variables X are the
dependent variable.
independent variables and the observed variable
5. For file A, the estimates in the fourth step are used
Z in file B is the dependent variable. (A)
to get probability Π̂j .
2. For each unit in file A, the probability of being in
(A) 6. For file A, Ẑ is obtained by generating a multino-
TH
j th category, Πj , is estimated using the follow- (A)

ing equation; mial random variable with probability Π̂j .
exp(β̂j0 x) The first step can be replaced by fitting a multino-
(A)
Π̂j = PJ−1 ; mial logistic regression model in which X and Z are
1 + i=1 exp(β̂i0 x) independent variables and Y is the dependent variable,
AU
j = 1, . . . , J − 1, using file C, thus giving a fourth version of the proposed

where β̂j0 is obtained from the first step. method.
3. Distance hot deck approach is applied in which a The proposed methods were compared to a random
value of Z in file B is imputed for file A based on hot deck procedure via a simulation study. The results
(A) (B) were very similar in the case of CIA. However, the
the nearest Euclidean distance dab (Π̂j , Π̂j )
proposed methods exhibit better performance in case
and is denoted by Ẑ.
of auxiliary information, especially with high levels of
The second mixed method, based on multinomial lo- association between Y and Z. Also, simulation results
gistic regression models under CIA, uses a randomiza- showed that there is no need to have a large size of file
tion mechanism, and involves the same first and second C, therefore the expense of overcoming the CIA would
steps as the previous method. The only difference is in not be a major worry in practice.
the third step as Ẑ for file A is obtained by generating
(A)
a multinomial random variable with probability Πj . 4.5. Mixed integer linear programming procedure
They also extended their method to be used in pres-
ence of auxiliary information. Four versions of this A set of approaches have been raised recently to
method were introduced, in presence of auxiliary infor- tackle the problem of incoherence arising when SM is
mation. The first version has three steps as follows: considered. A number of authors including [98–101]
and [102] applied a mixed integer linear programming points of research for Bayesian network is considered
procedure, based on efficient L1 distance minimization, in [104]. [104] performed log-linear Markov networks
in the context of SM. [102] dealt with the managing of under the assumption of conditional independence.
inconsistencies inside the SM framework when logical Their approach visualizes dependencies among vari-
relations among the variables are present. Incoherence ables using undirected graphical models, and obtained
can arise in the probability evaluations. They used sev- a powerful factorization of their joint distribution. It
eral advanced adjustment procedures to remove such is utilized to estimate the probability components of
incoherence. They applied these procedures to real data,
the joint distribution. They embedded the identification
and their study found some differences between these
problem of SM into the theory of log-linear Markov
adjustment procedures.
Recently, [98,99] recommended applying a merging networks and showed an exemplary implementation of
approach for jointly inconsistent probabilistic assess- their approach depending on the German General Social
ments to the SM problem. The merging technique is Survey. The findings showed that their suggested SM
Y
based on an efficient L1 distance minimization through approach can reconstruct the joint distribution reason-
mixed-integer linear programming, which results in ably well. These preliminary findings showed that their
OP
elicitation of imprecise (lower-upper) probability as- method yields good results. Small differences between
sessments that are not only feasible but also meaning- the sample distribution and the distribution estimated
ful. They emphasized how their method is meaning- using their SM procedure are particularly encouraging
ful whenever there are structural zeros among the vari- because they avoided overconfidence by deliberately
ables. Structural zeros is an important feature of survey not selecting the specific and common variables based
data that refers to the existence of impossible combina-
C
on previous association analysis.
tions of variables. For example, in the combinations of Recently, [83] proposed the use of Bayesian networks
variables of pregnancy status and gender, there should to deal with the SM uncertainty for multivariate categor-
not exist a pregnant male. For household survey, in the
ical variables. The motivation for using Bayesian net-
OR
combinations of variables of relationship and age, there

works in their work is because extra sample information
should not exist a household where a son is older than
of qualitative dependencies between the components
his biological father. The presence of these structural
zeros prevents the sure coherence of the merging of of Y and Z can be utilized in SM. Such information
estimates coming from different sources of informa- can be used to factorize the joint probability mass func-
tion. Importance of their approach seems to be apparent tion based on the conditional independencies implied
TH
whenever there are logical (structural) constraints with by the graph. Extra sample information significantly
varied sources of information. increases the quality of SM. Because a smaller number
of lower dimension parameters must be estimated, the
4.6. Statistical matching using Bayesian networks representation of the joint probability distribution using
AU
local relationships simplifies both parameter estimation

[103] described and discussed first attempts for SM and SM evaluation in a multivariate context. Further-
of discrete data using Bayesian networks. In Bayesian more, the graphical model’s modularity enables to cope
networks, so-called (directed acyclic) graphs are used with the subgraphs produced by variables related to the
to model the data. Their micro matching approach has same sample and the subgraphs produced by variables
three steps: estimating and combining the (directed existing in different samples. Accordingly, parameters
acyclic) graphs for datasets A and B, estimating the cor- that are influenced by uncertainty are separated from
responding local parameters and combining them to the
those that may be directly estimated by sample data.
joint probability distribution, and imputing the missing
Computational complexity is thus limited to a subset of
values in A and B to obtain the integrated dataset.
Their first attempt of using Bayesian networks en- variables. A simulation study is conducted to evaluate
couraged a further study into how probabilistic graph- the performance of their suggested approach with and
ical models may be used for SM [104]. In their study, without auxiliary information and to compare it with
they considered not only discrete but also continuous the saturated multinomial model, in terms of uncer-
and mixed variables for SM. Further research is also tainty reduction. Based on real application, the find-
needed to see if using undirected probabilistic graph- ings demonstrated the practicality of their suggested
ical models is more promising. Some of these new approach in real case situations.
5. Issues related to statistical matching consistency between estimators derived from files A
and B separately. Calibration is a common approach
5.1. Methods for complex surveys in sample surveys to derive new weights (for more
details about calibration; see [111]). Rubin’s proce-
Complex surveys refer to surveys that involve sam- dure is the only method that takes the sampling de-
pling designs that are not based on simple random se- sign into consideration through the concatenation step.
lection. These may include many sampling designs such It gives weights for all units in the matched file. The
as probability proportional to size sampling, multistage general steps of Rubin’s procedure are mentioned in
sampling, cluster sampling, stratified sampling, etc. . . It Section 3.1.3. Whereas, [107] proposed his approach
is also very common to combine some of these designs based on empirical likelihood. Unlike other methods,
according to the population under study or the survey Wu’s approach, which is used in the context of SM,
objectives. guarantees to obtain positive adjusted weights as it is
In general samples selected using a complex survey based on maximum likelihood. Also, his approach is
Y
design have many differences if compared with simple consistent and efficient compared with other methods
random sampling. The secondary sampling units, in a introduced for adjusted weights like methods proposed
OP
multistage sampling design, may show unequal inclu- in [112,113]. Table 6 summarises advantages and dis-
sion probabilities; units belonging to the same primary advantages of SM methods for complex surveys pro-
sampling unit are not independent and usually show an posed by [52,106,107], in addition to traditional micro
intraclass correlation. For more details about complex approaches.
surveys, see [105]. Recently, [108] dealt with the problem of SM in case
In the literature, there are some methods raised in of complex sample surveys using a nonparametric set-
C
SM to deal with complex surveys. These include tradi- ting. They proposed to use an iterative proportional fit-
tional micro approaches and SM methods that adjust for ting algorithm for estimating the distribution function of
sampling design, proposed by [52,106,107] and [108]. variables that are not found together, and demonstrated
OR
Traditional micro approaches for SM of data from com- how to assess its reliability.
plex sample surveys are usually performed using non-
parametric micro methods such as; nearest neighbour 5.2. Practical situations within SM context
donor, rank or random hot deck. These methods ne-
glect the sampling design and the weights of the units One of the situations when SM may need to be ap-
in the matching step. After the matching step and filling plied is Split Questionnaires Design (SQD). The aim of
TH
the missing values in the recipient file by using one of SQD is to reduce burden on the respondent by short-
these traditional micro approaches, the sampling design ening questionnaires. Since long questionnaires lead
for the recipient data is ready to use it and perform to deter prospective respondents and cause high non-
statistical analysis. [109] compared several traditional response rates, split questionnaires are used as an op-
AU
approaches for SM in the context of complex surveys. tion to reduce the pressure on respondents [114–116].
Rank and random hot deck approaches appeared to be Also, long questionnaires decrease the quality of survey
efficient in terms of preserving the joint distribution responses and, hence, lead to low accuracy [117]. To
X × Z in the integrated file A dataset and the marginal increase the quality of survey responses, SQD can be
distribution of variable Z after integrating. Whereas, used [118]. [12] discussed a SQD, in which, a sam-
nearest neighbour donor works admirably only when ple m is selected randomly from the target population.
constrained matching is used and a design variable is The sample is then split into two independent random
considered in forming donation classes. samples ma and mb such that ma ∪ mb = m and
For SM methods that explicitly consider the sam- ma ∩ mb = ∅. Accordingly, there are common vari-
pling design and the weights associated with it, there ables that appear in the two samples and there exist
are Renssen’s method [106]; Rubin’s file concatena- distinct variables that appear exclusively in either one
tion [52] and a method depending on empirical like- of the two independent samples. This is a clear situation
lihood suggested by [107]. A comparison between that requires SM. The SQD yields effects that are close
Renssen’s and Wu’s approaches with Rubin’s file con- to those of a complete questionnaire, with one excep-
catenation procedure is found in [110]. In general, tion that is the power is reduced since the number of
Renssen’s approach relies on a series of calibration’s observations for each variable is reduced [114]. [119]
weights, that are applied in the two datasets, to achieve proposed a matching method to be applied for a SQD,
Table 6
Comparison between different methods of SM in complex surveys
Methods Advantages Disadvantages
– Selection of the matching variables is essential.
Traditional – Very simple and flexible to be performed – Theoretical and practical implications have not been investi-
Approaches gated.
– For estimating characteristics relating to Y , Z, and their
relationship, methods to cope with missing values must be
– It is extremely simple in practice; only the concatenated file considered.
Rubin’s file is used. – The theory is developed to concatenate the theoretic samples
concatenation – More accurate for estimates of X variables. but in practice the available data refer to the respondents
to a survey (final weights incorporate corrections for unit
nonresponse, coverage, . . . ).
– After harmonization, the parameters linked to (X, Y ) on A,
as well as (X, Z) on B, can be directly estimated. – Calibration may fail.
Y
– Facilitates the incorporation of future auxiliary data sources – Managing continuous and categorical variables is complex.
Renssen’s
(file C) into the estimation process. – Imputed micro values for categorical variables are estimated
approach
– Offers micro imputations with a distribution that is usually probabilities, which may be negative or greater than one.
OP
coherent with respect to the one in the donor file.
– When compared to Renssen’s approach, it is more adaptable – Theory is more complex. A major complexity is required to
(no negative weights; handles mixed type variables). deal with nonproportional stratified sampling.
Wu’s
– It provides a complete framework for inferring from complex – Methods allow combining theoretic samples, but it is difficult
approach
sample survey data. to handle unit nonresponse.
C
based on ordinary least squares. They consider using duce a district-specific synthetic population of individ-
the macro approach of SM under CIA to estimate the uals and households. The wealth index for each house-
joint distribution of all variables of interest, thus obtain- hold within the synthetic population is then estimated
OR
ing an identifiable model given the available data. In to impute the closest available records from the 2011
their method, they fit three models to get the production Bangladesh Demographic and Health Survey. The find-
model with all covariates in common. To validate their ings show that the method proposed by [120] achieved
proposed method, they compared its performance with more representative estimates (when compared to di-
other methods of imputation such as MICE, CART and rect survey estimates), particularly in areas with small
random forest. They found that their proposed method sample sizes and a small number of population units
TH
for SM has the best performance among other methods with different socio-demographic characteristics.
demonstrated for MI in terms of prediction error and
it does not necessitate the use of a specific imputation 5.3. Essential steps for applying SM in practice
model for missing data.
Another very useful application of SM is combining Some considerations and preprocessing steps are re-
AU
data from sample surveys with census data. Sample sur- quired before utilizing SM approaches to integrate two
veys can include detailed data about specific issues, that or more data sources. Assuming that we have two data
may not be available via census data. [120] used SM files A and B, the following steps are essential for SM:
to combine census data with Demographic and Health 1. Selection of the target variables Y and Z, i.e.
Survey due to the increasing demand for integrating sur- variables that are not found jointly in two datasets.
veys with census data in order to provide richer, more 2. Data preparation and harmonization step.
detailed sources of data. SM methods can be used to 3. Selection of matching variables that will be used
connect records from survey and census data where RL in the matching process for the two separate sam-
is impractical due to confidentiality constraints. [120] ple surveys. For the SM process, many common
created a technique that uses an iterative proportional variables may exist in both files A and B. In prac-
updating approach to construct a synthetic population tice, only the most relevant ones from these set of
of individuals and households that is similar to real- common variables are employed in the matching
ity. Then, SM is used for imputing survey data to in- process, which are known as matching variables.
dividuals and households in the synthetic population These variables should be chosen using appro-
using the nearest neighbour approach. To assess this priate statistical procedures and with the help of
process, 2011 Bangladesh census data is used to pro- subject matter experts. The appropriate statistical
procedures may be descriptive or inferential. For 3. A covariance structure cov(X, Y, Z) is reflected

instance, the easiest approach for identifying the in the integrated dataset, also the marginal distri-
optimal set of matching variables is to calculate butions of fXY and fXZ are preserved.
the measures of pairwise correlation/association 4. The marginal as well as joint distributions of vari-
between Y and each of the available predictors ables in the donor file are preserved in the inte-
X and between Z and each of the available pre- grated file.
dictors X, for files A and B respectively. If the The first and third levels of quality are not achievable.
response variable is continuous, correlation with The only way to validate these two levels is using sim-
predictors can be examined. Also, Spearman’s ulation experiments. The second level may be validated
rank correlation coefficient can be considered to by auxiliary information. In reality, marginal and joint
find any potential nonlinear relationships. distributions in the integrated datasets are derived when
4. The matching framework should be determined using traditional approaches. This is a minimum crite-
with respect to the objective of SM; either micro rion to assess the validity of SM technique. This, how-
Y
or macro and also with respect to the appropriate ever, does not imply the validation of the estimates for
setting; either parametric, nonparametric, mixed the joint distributions of the variables found separately
OP
approach or Bayesian. in the two datasets.
5. After deciding the SM framework using the previ- In practice, the most commonly used quality assess-
ous steps, the suitable SM approach is performed ment technique for data resulting from SM is the one
to integrate the two separate datasets. suggested by the German Association of Media Analy-
6. An appropriate quality assessment technique sis [13], which involves:
should be performed to evaluate the dataset after
C
matching. 1. Comparing the empirical distribution of target
variables included in the integrated file with the
5.4. Statistical matching quality assessment one in the recipient and the donor files,
2. Comparing the joint distributions fX,Z and fX,Y
OR
Having reviewed all the above traditional and recent observed in the recipient and donor files with their
developments for SM, maybe the most important ques- corresponding joint distribution in the integrated
tion to be asked is: How good is the resulting integrated file.
data? [48] stated two questions about the quality of in- More work still needs to focus on the assessment of
tegrated data. First, how can we measure the quality of the quality of integrated data, and developing reliable
TH
estimates based on statistical matched data in practice, indicators that can be used in this context.
i.e. when one does not know the complete data? And
second, can we develop a theoretical model for data
including relations between target variables that enables 6. Conclusion and future work
us to measure the quality of integrated data?
AU
Quality assessment of joint distribution of vari- Through this paper, we introduced available methods
ables that have never been jointly observed is a non- in existing literature about SM either under CIA, aux-
trivial task [121], and [49] suggested relatively simple iliary information or uncertainty analysis, along with
measures of quality assessment of integrated datasets their drawbacks whenever applicable. Recent contri-
through a comparison of basic statistics (mean, standard butions and alternative techniques for SM have also
deviation, . . . etc.) in donor and integrated datasets. [13] been reviewed. It is noted that available SM procedures
proposed a more complex way of integrated data quality focus more on the case of continuous variables, rather
evaluation, called ‘integration validity’. It is a multilevel than categorical data. For future work, it is advised that
framework for the evaluation of quality in a SM pro- studies focus more on categorical data as it can be con-
cedure, based on four levels of validity for a matching sidered the most common type of data in most social
procedure: surveys. An important result by [57] was, that MI ap-
1. A reproduction of true but unknown values of Z proaches are superior to the traditional SM procedures.
of the recipient units. Consequently, recent MI approaches that are used for
2. A joint distribution preservation where a true un- categorical data can be employed in the context of SM.
known joint distribution of (X, Y, Z) is preserved Latent class models have recently been used for MI,
in an integrated dataset. where they are employed as a tool for estimating the
density of categorical variables [122,123]. The advan- [8] Rässler S, Fleischer K. Aspects concerning data fusion tech-
tage of using latent class models is that they can be used niques. in: International Workshop on Household Survey
Nonresponse. DEU; 1998; 4: p. 317-333.
for datasets drawn from large-scale studies, where there [9] Rässler S, Fleischer K. An evaluation of data fusion tech-
is a large number of variables and complex relationship niques. in: Proceedings of Statistics Canada Symposium 99
structures. On the other hand, it is known that when as- on Combining Data from Different Sources; 1999. p. 129-
suming CIA in cases where it does not really hold, this 136.
[10] D’Orazio M, Di Zio M, Scanu M. Statistical matching and
can lead to misleading results. To solve such a problem, official statistics. Rivista di Statistica Ufficiale. 2002.
it is suggested by [9,57] to choose the common vari- [11] D’Orazio M, Di Zio M, Scanu M. Statistical matching: The-
ables carefully in a way that already establishes condi- ory and practice. John Wiley & Sons; 2006.
tional independence, thus inference about the actually [12] Kim JK, Berg E, Park T. Statistical matching using fractional
imputation. arXiv preprint arXiv: 151003782. 2016.
unobserved association becomes valid. In this context, [13] Rässler S. Statistical matching: A frequentist theory, practical
another advantage of using latent class models is the applications, and alternative Bayesian approaches. vol. 168.
conditional independence, that is, the scores of different New York: Springer-Verlag; 2002.
Y
[14] Rubin DB. An overview of multiple imputation. in: Proceed-
items are independent of each other given latent classes.
ings of the Survey Research Methods Section of the American
A new approach for SM of categorical data is currently Statistical Association. Citeseer; 1988. p. 79-84.
OP
being developed by the authors, based on latent class [15] Little RJ. Missing-data adjustments in large surveys. Journal
models within a Bayesian framework. Simulation stud- of Business and Economic Statistics. 1988; 6(3): 287-296.
[16] Song Q, Shepperd M. Missing data imputation techniques.
ies will be performed to evaluate the performance of International journal of business intelligence and data mining.
the proposed latent class model in SM, through making 2007; 2(3): 261-291.
an empirical comparison of several different matching [17] Rubin DB. Basic ideas of multiple imputation for nonre-
procedures depending on simulated data. sponse. Survey Methodology. 1986; 12(1): 37-47.
C
[18] Rubin DB. Multiple imputations in sample surveys – a phe-
nomenological Bayesian approach to nonresponse; 1978.
[19] Rubin DB. Multiple imputation for nonresponse in surveys.
Acknowledgments John Wiley & Sons; 81: 2004.
OR
[20] Soley-Bori M. Dealing with missing data: Key assumptions

and methods for applied analysis. Boston University. 2013;
The authors are grateful to reviewers for their 4: 1-19.
valuable comments, which permitted to improve the [21] van Ginkel JR, Linting M, Rippe RCA, van der Voort A.
manuscript. Rebutting existing misconceptions about multiple imputation
as a method for handling missing data. Journal of Personality
Assessment. 2019; 0(0): 1-12. PMID: 30657714. Available
TH
from: https://doi.org/10.1080/00223891.2018.1530680.
References [22] Little RJ, Rubin DB. Statistical analysis with missing data.
John Wiley & Sons; vol. 793. 1987.
[1] Yang S, Kim JK. Statistical data integration in survey sam- [23] Rubin DB. Multiple imputation for nonresponse in surveys.
pling: A review. arXiv preprint arXiv: 200103259. 2020. John Wiley & Sons; vol. 81: 1987.
[2] Thompson ME. Combining data from new and traditional [24] Rendall MS, Ghosh-Dastidar B, Weden MM, Baker EH,
AU
sources in population surveys. International Statistical Re- Nazarov Z. Multiple imputation for combined-survey estima-
view. 2019; 87: S79-S89. tion with incomplete regressors in one but not both surveys.
[3] Fellegi IP, Sunter AB. A theory for record linkage. Journal of Sociological Methods and Research. 2013; 42(4): 483-530.
the American Statistical Association. 1969; 64(328): 1183- [25] Okner B. Constructing a new data base from existing micro-
1210. data sets: The 1966 merge file. 1972; 325-342.
[4] Sayers A, Ben-Shlomo Y, Blom A, Steele F. Probabilistic [26] Sims C. Comments. Annals of Economic and Social Mea-
record linkage. International Journal of Epidemiology. 2016; surement. 1972; 1(3): 343-345.
45(3): 954-964. The Author 2015; Published by Oxford Uni- [27] Sims C. Rejoinder. Annals of Economic and Social Measure-
versity Press on behalf of the International Epidemiological ment. 1972; 1(3): 355-357.
Association. [28] Peck J. Comments. Annals of Economic and Social Measure-
[5] Murray J. Probabilistic record linkage and deduplication af- ment. 1972; 1(3): 347-348.
ter indexing, blocking, and filtering. Journal of Privacy and [29] Okner B. Reply and comments. 1972; 359-362.
Confidentiality. 2015; 7. [30] Okner B. in: Data matching and merging: An overview.
[6] Hof MH, Ravelli AC, Zwinderman AH. A probabilistic record NBER; 1974. p. 347-352. Available from: http://www.nber.
linkage model for survival data. Journal of the American org/chapters/c10114.
Statistical Association. 2017; 112(520): 1504-1515. Available [31] Ruggles N, Ruggles R. A strategy for merging and matching
from: https://doi.org/10.1080/01621459.2017.1311262. microdata sets. in: Annals of Economic and Social Measure-
[7] Gessendorfer J, Beste J, Drechsler J, Sakshaug JW, et al. Sta- ment, NBER. 1974; 3(2): p. 353-371.
tistical matching as a supplement to record linkage: A valu- [32] Alter HE. Creation of a synthetic data set by linking records
able method to tackle non-consent bias. Journal of Official of the Canadian survey of consumer finances with the fam-
Statistics. 2018; 34(4): 909-933.
ily expenditure survey. in: Annals of Economic and Social responses: The case of the Slovenian plebiscite. Journal of the
Measurement, NBER. 1974; 3(2): p. 373-397. American Statistical Association. 1995; 90(431): 822-828.
[33] Sims C. Comments. Annals of Economic and Social Mea- [54] Van Buuren S, Oudshoorn K. Flexible multivariate imputation
surement. 1974; 1(3): 395-397. by MICE. Leiden: TNO; 1999.
[34] Moriarity C, Scheuren F. Statistical matching: A paradigm for [55] Shah AD, Bartlett JW, Carpenter J, Nicholas O, Hemingway
assessing the uncertainty in the procedure. Journal of Official H. Comparison of random forest and parametric imputation
Statistics. 2001; 17(3): 407. models for imputing missing data using MICE: A CALIBER
[35] Marella D, Pfeffermann D. Matching information from two study. American Journal of Epidemiology. 2014; 179(6):
independent informative samples. Journal of Statistical Plan- 764-774.
ning and Inference. 2019; 203: 70-81. [56] Rässler S. A non-iterative bayesian approach to statistical
[36] Kim K, Park M. Statistical micro matching using a multi- matching. Statistica Neerlandica. 2003; 57: 58-74.
nomial logistic regression model for categorical data. Com- [57] Rässler S. Data fusion: Identification problems, validity, and
munications for Statistical Applications and Methods. 2019; multiple imputation. Austrian Journal of Statistics. 2004;
26(5): 507-517. 33(1-2): 153-171.
[37] Donatiello G, D’Orazio M, Frattarola D, Rizzi A, Scanu M, [58] Gasser T, Müller HG. Kernel estimation of regression
Spaziani M. The role of the conditional independence as- functions. in: Smoothing techniques for curve estimation.
Y
sumption in statistically matching income and consumption. Springer; 1979. p. 23-68.
Statistical Journal of the IAOS. 2016; 32(4): 667-675. [59] Andrews DW. Nonparametric kernel estimation for semipara-
[38] Cutillo A, Scanu M. A mixed approach for data fusion of metric models. Econometric Theory. 1995; 560-596.
OP
HBS and SILC. Social Indicators Research. 2020; 1-27. [60] Geenens G, Charpentier A, Paindaveine D, et al. Probit trans-
[39] Doretti M, Geneletti S, Stanghellini E. Missing data: A uni- formation for nonparametric kernel estimation of the copula
fied taxonomy guided by conditional independence. Interna- density. Bernoulli. 2017; 23(3): 1848-1873.
tional Statistical Review. 2018; 86(2): 189-204. [61] Cheng P, Chu C. Kernel estimation of distribution functions
[40] Paass G. Statistical match: Evaluation of existing procedures and quantiles with missing data. Statistica Sinica. 1996; 63-
and improvements by using additional information. Micro- 78.
analytic Simulation Models to Support Social and Financial [62] Vantaggi B. Statistical matching of multiple sources: A look
C
Policy. 1986; 401-420. through coherence. International Journal of Approximate
[41] Singh AC, Lemaître GE, Armstrong J. Statistical matching Reasoning. 2008; 49(3): 701-711.
using log linear imputation. Social Survey Methods Division, [63] D’Orazio M, Di Zio M, Scanu M. The use of uncertainty to
Statistics Canada; 1988. choose matching variables in statistical matching. Interna-
OR
[42] Singh A, Mantel H, Kinack M, Rowe G. Statistical matching: tional Journal of Approximate Reasoning. 2017; 90: 433-440.
Use of auxiliary information as an alternative to the condi- [64] Molinari F. Partial identification of probability distribu-
tional independence assumption. Survey Methodology. 1993; tions with misclassified data. Journal of Econometrics. 2008;
19(1): 59-79. 144(1): 81-117.
[43] Kadane JB. Some statistical problems in merging data files. [65] Ahfock D, Pyne S, Lee SX, McLachlan GJ. Partial identi-
Journal of Official Statistics. 2001; 17(3): 423-433. fication in the statistical matching problem. Computational
[44] Silverman B. Density estimation for statistics and data analy- Statistics and Data Analysis. 2016; 104: 79-90.
TH
sis. Chapman and Hall, London, 1986; Crossref, á. 1986. [66] Fan Y, Park SS. Sharp bounds on the distribution of treatment
[45] Wand M, Jones M. Monographs on statistics and applied effects and their statistical inference. Econometric Theory.
probability. Kernel Smoothing. 1995. 2010; 931-951.
[46] Härdle W, Linton O. Applied nonparametric methods. vol. [67] Komarova T, Nekipelov D, Yakovlev E. Identification, data
9203. Center for Economic Research, Tilburg University; combination, and the risk of disclosure. Quantitative Eco-
1992. nomics. 2018; 9(1): 395-440.
AU
[47] Joenssen DW, Bankhofer U. Hot deck methods for imputing [68] Manski C. Identification problems in the social sciences.
missing data. in: International Workshop on Machine Learn- 1995.
ing and Data Mining in Pattern Recognition. Springer; 2012. [69] Conti PL, Marella D, Scanu M. Uncertainty analysis in sta-
p. 63-75. tistical matching. Journal of Official Statistics. 2012; 28(1):
[48] De Waal T. Statistical matching: Experimental results and 69-88.
future research questions. Statistics Netherlands; 2015. [70] Endres E, Fink P, Augustin T. Imprecise imputation: A non-
[49] Rodgers WL. An evaluation of statistical matching. Journal parametric micro approach reflecting the natural uncertainty
of Business and Economic Statistics. 1984; 2: 91-102. of statistical matching with categorical data. Journal of Offi-
[50] Singh A, Mantel H, Kinack M, Rowe G. On methods of cial Statistics. 2019; 35(3): 599-624.
statistical matching with and without auxiliary information. [71] D’Orazio M, Di Zio M, Scanu M. Statistical matching and
Technical Report; 1990. the likelihood principle: Uncertainty and logical constraints.
[51] Moriarity C, Scheuren F. A note on Rubin’s statistical match- ISTAT Technical Report; 2004.
ing using file concatenation with adjusted weights and multi- [72] D’Orazio M, Di Zio M, Scanu M. Uncertainty intervals for
ple imputations. Journal of Business and Economic Statistics. nonidentifiable parameters in statistical matching. Proceed-
2003; 21(1): 65-73. ings of the 57th Session of the International Statistical Insti-
[52] Rubin DB. Statistical matching using file concatenation with tute. 2009.
adjusted weights and multiple imputations. Journal of Busi- [73] D’Orazio M. Statistical matching and imputation of survey
ness and Economic Statistics. 1986; 4(1): 87-94. Available data with StatMatch. Italian National Institute of Statistics:
from: http://www.jstor.org/stable/1391390. Rome, Italy. 2017.
[53] Rubin DB, Stern HS, Vehovar V. Handling don’t know survey [74] D’Orazio M, Di Zio M, Scanu M. Statistical matching for
categorical data: Displaying uncertainty and using logical [95] James G, Witten D, Hastie T, Tibshirani R. An introduction
constraints. Journal of Official Statistics. 2005; 22(1): 137. to statistical learning. vol. 112. Springer; 2013.
[75] Zhang LC, Chambers R. An overview on uncertainty and es- [96] D’Orazio M. A two step non parametric procedure for statis-
timation in statistical matching. Analysis of Integrated Data. tical matching. in: 8 th Scientific meeting of the CLAssifica-
2019; 73. tion and Data Analysis Group of the Italian Statistical Society
[76] Conti PL, Marella D, Scanu M. How far from identifiability? (CLADAG 2011); 2011. p. 7-9.
A systematic overview of the statistical matching problem in [97] D’Orazio M. Statistical learning in official statistics: The case
a non parametric framework. Communications in Statistics- of statistical matching. Statistical Journal of the IAOS. 2019;
Theory and Methods. 2017; 46(2): 967-994. 35(3): 435-441.
[77] Conti PL, Marella D, Scanu M. Uncertainty analysis for sta- [98] Baioletti M, Capotorti A. A L1 based probabilistic merging
tistical matching of ordered categorical variables. Computa- algorithm and its application to statistical matching. Applied
tional Statistics and Data Analysis. 2013; 68: 311-325. Intelligence. 2019; 49(1): 112-124.
[78] Ridder G, Moffitt R. The econometrics of data combina- [99] Baioletti M, Capotorti A. An efficient probabilistic merging
tion. Handbook of Econometrics. North Holland, Amsterdam; procedure applied to statistical matching. in: International
2007. Conference on Industrial, Engineering and Other Applica-
[79] Rässler S, Kiesl H. How useful are uncertainty bounds? Some tions of Applied Intelligent Systems. Springer; 2017. p. 65-
Y
recent theory with an application to Rubin’s causal model. 74.
Proceedings of the 57th Session of the International Statistical [100] Capotorti A. A further empirical study on the Over-
Institute. 2009. Performance of estimate correction in statistical matching.
OP
[80] D’Orazio M, Di Zio M, Scanu M. Auxiliary variable selection in: International Conference on Information Processing and
in a statistical matching problem. Zhang L-C, Chambers, R Management of Uncertainty in Knowledge-Based Systems.
Analysis of integrated data, CRC/Chapman and Hall. 2019; Springer; 2012. p. 124-133.
101-120. [101] Baioletti M, Capotorti A. Efficient L1-based probability as-
[81] Zhang LC. On proxy variables and categorical data fusion. sessments correction: Algorithms and applications to belief
Journal of Official Statistics. 2015; 31(4): 783-807. merging and revision. in: Proceedings of the 9th Interna-
[82] Conti P, Marella D, Scanu M. An overview on uncertainty tional Symposium on Imprecise Probability: Theories and
C
and estimation in statistical matching. Zhang L-C, Chambers, Applications (ISIPTA 2015), Pescara (IT); 2015. p. 37-46.
R Analysis of integrated data, CRC/Chapman and Hall. 2019. [102] Capotorti A, Vantaggi B. Incoherence correction strategies
[83] Conti PL, Marella D, Vicard P, Vitale V. Multivariate statisti- in statistical matching. in: Proceedings of ISIPTA. Citeseer;
cal matching using graphical modeling. International Journal 2011. p. 109-118.
OR
of Approximate Reasoning. 2021; 130: 150-169. [103] Endres E, Augustin T. Statistical matching of discrete data by
[84] Rubin DB, Thomas N. Matching using estimated propensity Bayesian networks. in: Conference on Probabilistic Graphical
scores: Relating theory to practice. Biometrics. 1996; 249- Models. PMLR; 2016. p. 159-170.
264. [104] Endres E, Augustin T. Utilizing log-linear Markov networks
[85] Rubin DB, Thomas N. Combining propensity score matching to integrate categorical data files. 2019.
with additional adjustments for prognostic covariates. Journal [105] Lumley T. Complex surveys: A guide to analysis using R.
of the American Statistical Association. 2000; 95(450): 573- John Wiley & Sons. 2011; 565.
TH
585. [106] Renssen RH. Use of statistical matching techniques in cali-

[86] Kum H, Masterson T. Statistical matching using propensity bration estimation. Survey Methodology. 1998; 24: 171-184.
scores: Theory and application to the levy institute measure [107] Wu C. Combining information from multiple surveys through
of economic well-being. 2008. the empirical likelihood method. Canadian Journal of Statis-
[87] Kum H, Masterson TN. Statistical matching using propensity tics. 2004; 32(1): 15-26.
scores: Theory and application to the analysis of the distribu- [108] Conti PL, Marella D, Scanu M. Statistical matching anal-
AU
tion of income and wealth. Journal of Economic and Social ysis for complex survey data with applications. Journal of
Measurement. 2010; 35(3-4): 177-196. the American Statistical Association. 2016; 111(516): 1715-
[88] Bestehorn F, Bestehorn M, Bestehorn M, Kirches C. A de- 1725.
terministic balancing score algorithm to avoid common pit- [109] D’Orazio M, Di Zio M, Scanu M. Statistical matching of
falls of propensity score matching. arXiv preprint arXiv: data from complex sample surveys. in: Proceedings of the
180302704. 2018. European Conference on Quality in Official Statistics-Q2012.
[89] King G, Nielsen R. Why propensity scores should not be used 2012; 29.
for matching. Political Analysis. 2019; 27(4): 435-454. [110] D’Orazio M, Di Zio M, Scanu M. Old and new approaches in
[90] Kim JK. Parametric fractional imputation for missing data statistical matching when samples are drawn with complex
analysis. Biometrika. 2011; 98(1): 119-132. survey designs. Proceedings of the 45th “Riunione Scientifica
[91] Yang S, Kim JK, et al. Fractional imputation in survey sam- della Societa Italiana di Statistica”, Padova. 2010; 16-18.
pling: A comparative review. Statistical Science. 2016; 31(3): [111] Särndal CE, Lundström S. Estimation in surveys with nonre-
415-432. sponse. John Wiley & Sons; 2005.
[92] Vapnik VN. An overview of statistical learning theory. IEEE [112] Zieschang KD. Sample weighting methods and estimation
transactions on neural networks. 1999; 10(5): 988-999. of totals in the consumer expenditure survey. Journal of the
[93] Breiman L, et al. Statistical modeling: The two cultures (with American Statistical Association. 1990; 85(412): 986-1001.
comments and a rejoinder by the author). Statistical Science. [113] Renssen RH, Nieuwenbroek NJ. Aligning estimates for com-
2001; 16(3): 199-231. mon variables in two or more sample surveys. Journal of the
[94] Hastie T, Tibshirani R, Friedman J. The elements of statistical American Statistical Association. 1997; 92(437): 368-374.
learning: Data mining, inference, and prediction. Springer [114] Raghunathan TE, Grizzle JE. A split questionnaire survey
Science & Business Media; 2009.
design. Journal of the American Statistical Association. 1995; [120] Namazi-Rad MR, Tanton R, Steel D, Mokhtarian P, Das S. An
90(429): 54-63. unconstrained statistical matching algorithm for combining
[115] Chipperfield JO, Steel DG. Design and estimation for split individual and household level geo-specific census and survey
questionnaire surveys. 2009. data. Computers, Environment and Urban Systems. 2017; 63:
[116] Kamgar S, Navvabpour H. An efficient method for estimat- 3-14.
ing population parameters using split questionnaire design. [121] Barr RS, Turner JS. Quality issues and evidence in statistical
Journal of Statistical Research of Iran JSRI. 2017; 14(1): file merging. Data Quality Control: Theory and Pragmatics.
77-99. 1990; 112: 245.
[117] Peytchev A, Peytcheva E. Reduction of measurement error [122] van der Palm DW, van der Ark LA, Vermunt JK. A compari-
due to survey length: Evaluation of the split questionnaire son of incomplete data methods for categorical data. Statis-
design approach. in: Survey Research Methods. 2017; 11: p. tical Methods in Medical Research. 2016; 25(2): 754-774.
361-368. PMID: 23166159. Available from: https://doi.org/10.1177/
[118] Stuart M, Yu C. A computationally efficient method for 0962280212465502.
selecting a split questionnaire design. Communications in [123] Vidotto D, Vermunt JK, Kaptein MC. Multiple imputation of
Statistics-Simulation and Computation. 2019; 1-23. missing categorical data using latent class models: State of
[119] Ali M, Kauermann G. A split questionnaire survey design in the art. Psychological Test and Assessment Modeling. 2015;
Y
the context of statistical matching. Statistical Methods and 57(4): 542.
Applications. 2021; 1-18.
C OP
OR
TH
AU

Data Integration Using Statistical Matching Techniques: A Review

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Integration Using Statistical Matching Techniques: A Review

Uploaded by

Copyright:

Available Formats

Statistical Journal of the IAOS 37 (2021) 1391–1410 1391

Data integration using statistical matching

A motivating example is proposed in this section to this paper.

survey but less in the other. Another MI method used Bayesian.

missing values in files A and B. More precisely, miss-

procedures are based on filling in the dataset chosen as

A3, B4 0 63 59 9.287 6.147

tive distribution function of X. This cumulative distri- A7, B6 0 22 45 8.425 3.230

NORM method needs complete data, an algorithm that

The third method proposed by [13] was MICE, an ab-

Step 2: Final nonparametric imputation: a live z

of an association parameter (e.g. the correlation coef-

certain group (R = 1) given the covariates (X =

the parametric approach for generating imputed values,

considered if we would like to generate the missing z

model in which X and Ŷ are independent vari-

j th category, Πj , is estimated using the follow- (A)

j = 1, . . . , J − 1, using file C, thus giving a fourth version of the proposed

combinations of variables of relationship and age, there

local relationships simplifies both parameter estimation

procedures may be descriptive or inferential. For 3. A covariance structure cov(X, Y, Z) is reflected

[20] Soley-Bori M. Dealing with missing data: Key assumptions

585. [106] Renssen RH. Use of statistical matching techniques in cali-

You might also like