You are on page 1of 15

Computational Statistics and Data Analysis 68 (2013) 311–325

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis


journal homepage: www.elsevier.com/locate/csda

Uncertainty analysis for statistical matching of ordered


categorical variables
Pier Luigi Conti a , Daniela Marella b,∗ , Mauro Scanu c
a
Dipartimento di Scienze Statistiche, Sapienza Università di Roma, Italy
b
Dipartimento di Scienze della Formazione, Università ‘‘Roma Tre’’, Italy
c
ISTAT, Italian National Statistical Institute, Roma, Italy

article info abstract


Article history: The aim is to analyze the uncertainty in statistical matching for ordered categorical
Received 26 January 2012 variables. Uncertainty in statistical matching consists in estimating a joint distribution
Received in revised form 3 July 2013 by observing only samples from its marginals. Unless very restrictive conditions are met,
Accepted 3 July 2013
observed data do not identify the joint distribution to be estimated, and this is the reason
Available online 15 July 2013
of uncertainty. The notion of uncertainty is first formally introduced, and a measure of
uncertainty is then proposed. Moreover, the reduction of uncertainty in the statistical
Keywords:
Statistical matching
model due to the introduction of logical constraints is investigated and evaluated via
Contingency tables simulation.
Structural zeros © 2013 Elsevier B.V. All rights reserved.
Nonidentifiability
Uncertainty

1. Introduction

In current practice, information needed for statistical analysis is frequently available in different data sources, each
containing a subset of the variables of interest. This is the case of the ecological inference problem (cfr. King, 1997), where the
main interest is estimating the joint probabilities of a contingency table, when the marginals are known from population
counts. A typical example consists of contingency tables where rows and columns correspond to votes to political parties
(known from election results) and race (known from population registers), respectively. One is then interested in the voters’
behavior for different races. When rows’ and columns’ proportions are (separately) estimated from two different samples,
then the problem becomes a genuine statistical matching problem. Another example of statistical matching problem is
studied in Tonkin and Webber (2012), where the authors ‘‘statistically match expenditures for the Household Budget Survey
(HBS) with income and material deprivation contained within EU Statistics on Income and Living Conditions (EU-SILC)’’.
Formally speaking, the statistical matching problem can be described as follows. Let (X , Y , Z ) be a three-dimensional
random variable (r.v.), and let A and B be two independent samples of nA and nB i.i.d. records from (X , Y , Z ), respectively.
Assume that the marginal (bivariate) (X , Y ) is observed in A, and that the marginal (bivariate) (X , Z ) is independently
observed in B. The main goal of statistical matching, at a macro level, consists in estimating the joint distribution of
(X , Y , Z ). Such a distribution is not identifiable due to the absence of joint information on Z and Y given X , see D’Orazio
et al. (2006b) and references therein, and Aluja-Banet et al. (2007) and Saporta (2002) for alternative approaches based on
general multivariate analyses, as well as Conti et al. (2008) for an evaluation of statistical matching based on the matching
noise.

∗ Corresponding author.
E-mail address: daniela.marella@uniroma3.it (D. Marella).

0167-9473/$ – see front matter © 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.csda.2013.07.004
312 P.L. Conti et al. / Computational Statistics and Data Analysis 68 (2013) 311–325

Generally speaking, two approaches have been considered to ensure the identifiability of the joint distribution of
(X , Y , Z ):
• techniques based on the conditional independence assumption between Y and Z given X (CIA assumption), see, e.g.,
Okner (1972) or on other kinds of identifiable models as independence of Y and Z given a latent variable (Kamakura and
Wedel, 1997);
• techniques based on the external auxiliary information regarding the statistical relationship between Y and Z , e.g. an
additional file C where (X , Y , Z ) are jointly observed is available, as in Singh et al. (1993).
Unfortunately, since CIA assumption is rarely met in practice (Rodgers, 1984; Sims, 1972), and external auxiliary
information is hardly ever available, the lack of joint information on the variables of interest is the cause of uncertainty
on the model of (X , Y , Z ). In other terms, the sample information provided by A and B does not allow one to identify the
joint distribution of (X , Y , Z ), but only a class of possible distributions of (X , Y , Z ) (identification problem). Such distributions
are compatible with the available information, namely they may have generated the observed data. Note that, even if the
marginal distributions of (X , Y ) and (X , Z ) were perfectly known, it will not be possible to draw certain conclusion on the
model of (X , Y , Z ).
The lack of identifiability of the distribution of (X , Y , Z ) is due to the sampling mechanism that is actually unable to
identify the conditional distribution of (Y , Z ) given X . Hence, considering uncertainty about the conditional distribution of
(Y , Z ) given X is equivalent to consider uncertainty on the distribution of (X , Y , Z ).
An important task, in this setting, consists in constructing a coherent measure that can reasonably quantify the
uncertainty about the (estimated) model. From an operational point of view, a measure of uncertainty essentially quantifies
how ‘‘large’’ is the class of models estimable on the basis of the available sample information. The smaller the measure of
uncertainty, the smaller the class of estimated models. For a review on uncertainty in statistical matching providing a unified
framework for the parametric and nonparametric approach, see Conti et al. (2012). A specific approach dealing with the case
of dichotomous Y and Z is in Gilula et al. (2006), where also a Bayesian analysis of the uncertainty space is proposed.
In our setting, the main task consists in providing a precise definition of uncertainty on the (estimated) model, and in
constructing a coherent measure that can reasonably quantify such an uncertainty.
The paper is organized as follows. In Section 2 the model uncertainty for ordered categorical variables is investigated.
More specifically, model uncertainty is defined and uncertainty measures are introduced. In Section 3 the effect on model
uncertainty due to the introduction of logical constraints is evaluated. In Section 4 estimators of uncertainty measures are
proposed and their asymptotic behavior is studied. Finally, in Section 5 a simulation experiment is performed.

2. Uncertainty in statistical matching for ordered categorical variables

As stressed above, the statistical matching problem is characterized by uncertainty on the statistical model for the joint
distribution of all variables of interest. Uncertainty in statistical matching can be viewed as a special case of estimation
problems for general partially identifiable models, as in Manski (1995, 2003), and references therein. In those cases,
estimation is not pointwise, but consists of ranges. Another example comes from the so called ‘‘disclosure problem’’ for
confidentiality protection. In the case of categorical data (e.g. k-way contingency tables) upper and lower bounds on cell
counts induced by a set of released margins play an important role in the disclosure limitation techniques; see Dobra and
Fienberg (2001). In that context, for each suppressed cell we get an uncertainty interval called ‘‘feasibility interval’’. Such an
interval should be sufficiently wide in order to ensure adequate confidentiality protection.
Uncertainty in statistical matching for parametric models, mainly in the multinormal case, is studied in Kadane (1978),
Rubin (1986), Moriarity and Scheuren (2001) and Raessler (2002). The basic feature common to all those papers is that a
multivariate distribution is not completely observed; only (some of) its marginals are observed. Sample observations cannot
identify the statistical model generating data. As already said, this produces a kind of uncertainty about the model. Such an
uncertainty is quantified by taking the range of an association parameter (e.g. the correlation coefficient in the normal
bivariate case) between the non-jointly observed variables. In the case of categorical (non-ordinal) variables, uncertainty is
dealt with in D’Orazio et al. (2006a).
Evaluation of the uncertainty in a statistical matching problem is also used for validation purposes. In particular,
Raessler (2002) evaluates for multinormal models the length of the uncertainty intervals for unidentifiable parameters
in order to define a measure of the reliability of estimates under CIA. ‘‘Small’’ uncertainty intervals imply that parameter
estimates obtained under the different models compatible with the available sample information slightly differ from the
ones estimated under the CIA.
The attention to the estimates under the CIA is justified by the fact that when (X , Y , Z ) are multinormal, estimates
under the CIA are the midpoint of the uncertainty interval of the inestimable parameters, usually the correlation coefficients
between Y and Z . For other parametric models this property of the estimates under the CIA does not hold. Generalizations
have been considered in D’Orazio et al. (2006a) in the case of categorical data, and in D’Orazio et al. (2006b) for general
parametric models. They consider a maximum likelihood approach, and a related general measure of uncertainty given
by the (hyper)volume of the likelihood ridge (in this case called ‘‘uncertainty space’’). Formally, the parameter estimate
which maximizes the likelihood function is not unique, the set of maximum likelihood estimates is called likelihood ridge.
Statistical analysis of the likelihood ridge determines the middle point in the uncertainty interval for each parameter.
P.L. Conti et al. / Computational Statistics and Data Analysis 68 (2013) 311–325 313

Assume that, given a discrete r.v. X with I categories, Y and Z are discrete r.v.s too, with J and K categories, not necessarily
ordered. With no loss of generality, the symbols i = 1, . . . , I , j = 1, . . . , J, and k = 1, . . . , K , denote the categories taken
by X , Y and Z , respectively.
Let γjk|i be the conditional probability Pr(Y = j, Z = k|X = i), and denote by φj|i = Pr(Y = j|X = i) and ψk|i =
Pr(Z = k|X = i) the corresponding marginals, respectively. For real numbers a, b, define further the two quantities
U (a, b) = min(a, b), L(a, b) = max(0, a + b − 1), (1)
then
L(φj|i , ψk|i ) ≤ γjk|i ≤ U (φj|i , ψk|i ). (2)
The interval (2) summarizes the pointwise uncertainty about the statistical model for every triple (i, j, k). It is intuitive to
take the length of such an interval as a pointwise measure of uncertainty. Formally

∆jk|i = U (φj|i , ψk|i ) − L(φj|i , ψk|i ). (3)


The larger ∆ the more uncertain the statistical model generating the data w.r.t. (i, j, k).
jk|i

The pointwise differences (3) can be summarized into a unique overall measure of uncertainty, taking the general form

 J 
I  K
∆= U (φj|i , ψk|i ) − L(φj|i , ψk|i ) tijk ,
 
(4)
i=1 j=1 k=1

where tijk s are positive weights summing up to 1. Among the possible choices of tijk s, from now on we consider the following

tijk = ξi φj|i ψk|i (5)


where ξi is the probability of the event X = i. The choice of (5) is motivated by the following considerations: (i) it is the
simplest choice given ξi , φj|i , ψk|i ; (ii) it makes the computation of (4) easy; (iii) it is ‘‘neutral’’, because it does not give
preference to any specific form of association between Y and Z (given X ). The choice of (5) leads to the following measure
 
I
 J 
K
∆= U (φj|i , ψk|i ) − L(φj|i , ψk|i ) φj|i ψk|i ξi
 
i=1 j=1 k=1

I

= ∆x=i ξi (6)
i=1

where
J 
 K
∆x=i = U (φj|i , ψk|i ) − L(φj|i , ψk|i ) φj|i ψk|i
 
(7)
j=1 k=1

is the conditional measure of uncertainty given X = i.


Sharper results are obtained when the categories taken by (X , Y , Z ) are ordered. For the sake of simplicity, we use the
customary order for natural numbers. In this case, the cumulative d.f.’s are
j 
 k
Hjk|i = γyz |i , j = 1, . . . , J , k = 1, . . . , K , i = 1, . . . , I ,
y=1 z =1

j

Fj|i = φy|i , j = 1, . . . , J , i = 1, . . . , I ,
y =1

k

Gk|i = ψz |i , k = 1, . . . , K , i = 1, . . . , I ,
z =1

respectively. Then, the inequalities


L(Fj|i , Gk|i ) ≤ Hjk|i ≤ U (Fj|i , Gk|i ) (8)
hold true. Inequalities (8) imply that

γjk−|i ≤ γjk|i ≤ γjk+|i , (9)


where
γjk−|i = L(Fj|i , Gk|i ) − L(Fj−1|i , Gk|i ) − L(Fj|i , Gk−1|i ) + L(Fj−1|i , Gk−1|i ),
γjk+|i = U (Fj|i , Gk|i ) − U (Fj−1|i , Gk|i ) − U (Fj|i , Gk−1|i ) + U (Fj−1|i , Gk−1|i ).
314 P.L. Conti et al. / Computational Statistics and Data Analysis 68 (2013) 311–325

Taking into account that (as a consequence of Fréchet inequalities)

γjk−|i ≥ L(φj|i , ψk|i ),


γjk+|i ≤ U (φj|i , ψk|i ), (10)

it is seen that inequalities (9) are sharper than (2). By repeating verbatim the reasoning leading to (6), and using the d.f.’s
Fj|i , Gk|i instead of the probabilities φj|i , ψk|i , the pointwise uncertainty measure (3) is

∆jk|i = U (Fj|i , Gk|i ) − L(Fj|i , Gk|i ) (11)

and the conditional uncertainty measure (7) is replaced by

J 
 K
∆x=i = U (Fj|i , Gk|i ) − L(Fj|i , Gk|i ) φj|i ψk|i .
 
(12)
j=1 k=1

Although the notation is identical, the uncertainty measure (12) is smaller than (7), as a consequence of (10). The overall
uncertainty measure ∆ is the average of (12) with respect to the probabilities ξi s. Due to the discrete nature of Y , Z , the
uncertainty measure (12) does depend on the d.f.’s Fj|i s, Gk|i s. As stressed in Conti et al. (2009), the same does not hold when
Y and Z are continuous.

3. Reducing uncertainty under constraints

3.1. Structural zeros and regular domains

In several real cases, uncertainty about the joint distribution of Y and Z can be considerably reduced by introducing
appropriate logical constraints among the values taken by Y and Z . Precisely, we consider constraints acting as structural
zeros, i.e. constraints that make equal to 0 some of the joint probabilities γjk|i = Pr(Y = j, Z = k|X = i). Of course, this is
equivalent to assume that logical constraints ‘‘reduce’’ the support of the joint distribution of Y and Z (given X ), which is
strictly smaller than the Cartesian product of the supports of Y and Z . In the sequel, we will concentrate on structural zeros
that reduce the support of Y and Z in a ‘‘regular way’’, useful to manage uncertainty.
To introduce the kind of constraints we will deal with, consider the support of (Y , Z ) given X , which is a subset (either
proper or improper) of {(j, k); j = 1, . . . , J ; k = 1, . . . , K }. For each j ∈ {1, . . . , J }, define the two integers:

j = largest integer k such that γjk|i > 0,


k+

j = smallest integer k such that γjk|i > 0.


k−

Of course, there exist integers j1 , j2 such that k+j1 = K and kj2 = 1. Note that kj , kj actually depend on i, symbols kj|i , kj|i
− + − + −

would be more appropriate. However, in order to keep the notation not too heavy, we avoid to include in the symbols the
index i.
Analogously, for each k ∈ {1, . . . , K }, define the two integers:

k = largest integer j such that γjk|i > 0,


j+

k = smallest integer j such that γjk|i > 0.


j−

Again, there exist integers k1 , k2 such that j+ − + −


k1 = J and jk2 = 1. The same considerations on the dependency of jk and jk on i

j , kj .
hold as for k+ −

The support of (Y , Z ) (given X ) is Y -regular if, for all j = 1, . . . , J,

γjk|i = 0 if and only if k > k+


j or k < kj , ∀i = 1, . . . , I .

(13)

Similarly, the support of (Y , Z ) (given X ) is Z -regular if, for all k = 1, . . . , K ,

γjk|i = 0 if and only if j > j+


k or j < jk , ∀i = 1, . . . , I .

(14)

To visualize the meaning of Y -regularity, consider the piecewise straight line dy (j) joining the downstairs points
(j, k−
j ), j = 1, . . . , J, and the piecewise straight line uy (j) joining the upstairs points (j, kj ), j = 1, . . . , J. Y -regularity
+

means that structural zeros are all points above uy (·) and below dy (·). Similar concepts hold in the case of Z -regularity. Let
dz (k) be the piecewise straight line joining the downstairs points (k, j− k ), k = 1, . . . , K , and the piecewise straight line
uz (k) joining the upstairs points (k, j+k ), k = 1, . . . , K . Z -regularity means that structural zeros are all points above uz (·)
and below dz (·).
In Fig. 1 the case of a Y -regular support (which is not Z -regular) is illustrated.
P.L. Conti et al. / Computational Statistics and Data Analysis 68 (2013) 311–325 315

Fig. 1. Support of (Y , Z ) Y -regular but not Z -regular.

It is of interest to establish when a Y -regular support is also Z -regular (and vice-versa). A graphical inspection of Fig. 2
shows that a Y -regular support is also Z -regular (and vice-versa) if and only if the following two conditions are met:
R1 no triple of indices j1 < j2 < j3 exists such that k−
j1 < kj2 and kj3 < kj2 ;
− − −

R2 no triple of indices l1 < l2 < l3 exists such that kl1 > kl2 and kl2 < kl3 .
+ + + +

A simple sufficient condition ensuring both R1, R2 is that uy (·) and dy (·) are monotonic. In the case of continuous Y , Z , this
case has been dealt with in Conti et al. (2009).

3.2. Constrained lower and upper bounds

The constraints introduced on the support of the joint distribution of (Y , Z ) given X can be used to improve the lower
and upper bounds for Hjk|i given in (8). To see this, suppose that the domain of (Y , Z ) (given X ) is Y -regular, and take a fixed
j. From
γjk|i = 0 ∀ k < k−
j ,
it follows that
Hjk|i = Hj−1k|i ∀ k < k−
j . (15)
On the opposite side, from
Fj|i − Hjk|i = Fj−1|i − Hj−1k|i ∀ k > k+
j ,
the relationship
Hjk|i = Fj|i − Fj−1|i + Hj−1k|i ∀ k > k+
j , (16)
follows.
The two relationships (15), (16) can be used to construct bounds for Hjk|i better than the unconstrained bounds (8). The
basic idea consists in computing initially improved bounds, based on the given constraints, for the cells in the first column
of the contingency table. Then, bounds for cells of the jth column are obtained combining, via (15) and (16), the constraints
for cells of column jth with the bounds already obtained for columns j − 1, j − 2, . . . , 1. The improved bounds for a Y -regular
domain can be formally obtained via the algorithm reported below.
j , kj . Define next F0|i = 0, G0|i = 0, and
Step 0 Take j = 1, and compute k− +

if k < k−

0 j
min(Fj|i , Gk|i )
y+
Hjk|i = if k− +
j ≤ k ≤ kj
Fj|i if k > kj
+

and
if k < k−

0 j
max(0, Fj|i + Gk|i − 1)
y−
Hjk|i = if k− +
j ≤ k ≤ kj
Fj|i if k > kj .
+

Replace j by j + 1. Go to Step 1.
316 P.L. Conti et al. / Computational Statistics and Data Analysis 68 (2013) 311–325

Fig. 2. Visual representation of conditions R1, R2.

Step 1 If j > J, then go to Step 3. Otherwise, go to Step 2.

j , kj . Define
Step 2 Compute k− +

 y+
Hj−1k|i if k < k−
j
y+
Hjk|i = min(Fj|i , Gk|i ) if k− +
j ≤ k ≤ kj
y+
min(Fj|i , Gk|i , Fj|i − Fj−1|i + Hj−1k|i ) if k > k+

j

and

max(0, Fj|i + Gk|i − 1, Hj−1k|i ) if k < k−


 −
j
y−
Hjk|i = max(0, Fj|i + Gk|i − 1) if k− +
j ≤ k ≤ kj
y−
max(0, Fj|i + Gk|i − 1, Fj|i − Fj−1|i + Hj−1k|i ) if k > k+
j .

Replace j by j + 1. Go to Step 1.
y− y+
Step 3 Stop. Hjk|i and Hjk|i are the lower and upper bounds for Hjk|i , respectively.

The same reasoning holds also when the support of (Y , Z ) given X is Z -regular: it is enough to exchange the role of
Y and Z . Let us denote by Hjkz − z+
|i and Hjk|i the lower and upper bounds for Hjk|i , respectively, in the case of Z -regularity.
P.L. Conti et al. / Computational Statistics and Data Analysis 68 (2013) 311–325 317

Finally, define:
y+
min(Hjk|i , Hjk|i ) if the support of (Y , Z | X ) is both Y and Z -regular,
z+

y+
Hjk+|i = H if the support of (Y , Z | X ) is only Y -regular,
 jkz +|i
Hjk|i if the support of (Y , Z | X ) is only Z -regular,
and
y−
max(Hjk|i , Hjk|i ) if the support of (Y , Z | X ) is both Y and Z -regular,
z−

y−
Hjk−|i = H if the support of(Y , Z | X ) is only Y -regular,
 jkz −|i
Hjk|i if the support of (Y , Z | X ) is only Z -regular.
Taking into account that
Hjk+|i ≤ min(Fj−1|i , Gk|i ) ≤ min(Fj|i , Gk|i ) ∀ k < k−
j ,

Hjk−|i ≥ Fj|i − Fj−1|i + max(0, Fj−1|i + Gk|i − 1) ≥ max(0, Fj|i + Gk|i − 1) ∀ k > k+
j ,
it is immediate to see that the pair of inequalities
Hjk+|i ≤ U (Fj|i , Gk|i ), Hjk−|i ≥ L(Fj|i , Gk|i ) ∀ j = 1, . . . , J , k = 1, . . . , K ,
holds, so that Hjk+|i , Hjk−|i improve the unconstrained bounds in (8) whenever the support of (Y , Z ) given X is Y -regular.
From the algorithm used to compute the constrained upper and lower bounds Hjk+|i , Hjk−|i , respectively, it follows that Hjk+|i
and Hjk−|i are continuous functions of F1|i , . . . , Fj|i , G1|i , . . . , Gk|i . Formally

Hjk+|i = fjk+|i (F1|i , . . . , FJ |i , G1|i , . . . , GK |i ), (17)

Hjk−|i = fjk−|i (F1|i , . . . , FJ |i , G1|i , . . . , GK |i ), (18)


where the two functions fjk+|i , fjk−|i are:
J +K
(a) continuous in R ;
(b) differentiable (with continuous derivatives) for all but a finite set of points of RJ +K . More precisely, the ‘‘non-
y+
differentiability’’ points are those where two or more elements in the max(·) and/or min(·) terms defining Hjk|i , Hjkz +
|i ,
y−
|i are equal. Again, the non-differentiability points only depend on the marginal d.f.’s Fj|i , Gk|i , and on the
Hjk|i , Hjkz −
constraints, as well.
As a consequence of the theory developed so far, the conditional and unconditional measures of uncertainty under the
constraints represented by structural zeros take now the form
J 
 K
∆xc=i = Hjk+|i − Hjk−|i φj|i ψk|i ,
 
(19)
j=1 k=1

I

∆c = ∆cx=i ξi , (20)
i=1

respectively, where c denotes the constraint.

Example 1. Let u(j) be a monotone increasing function and suppose that d(j) = 1 for j = 1, . . . , J, so that bounds are
one-sided (upper bounds). As a consequence structural zeros are all points above u(·). From the relationship
Hjk|i = Hjkj |i ∀k > kj , (21)
we obtain
Hjk|i ≤ min(Fj|i , Gk|i ),
Hjk|i ≤ min(Fj|i , Gkj |i ), (22)
and then
Hjk|i ≤ min(Fj|i , Gk|i , Gkj |i ) = min(Fj|i , Gkj |i ) k > kj . (23)
It is straightforward to prove that the lower bound does not improve. In fact, if k > kj
Hjk|i ≥ max(0, Fj|i + Gk|i − 1),
Hjk|i ≥ max(0, Fj|i + Gkj |i − 1), (24)
then
Hjk|i ≥ max(0, Fj|i + Gk|i − 1, Fj|i + Gkj |i − 1) = max(0, Fj|i + Gk|i − 1). (25)
318 P.L. Conti et al. / Computational Statistics and Data Analysis 68 (2013) 311–325

Example 2. Let d(j) be a monotone decreasing function and suppose that u(j) = K for j = 1, . . . , J, so that bounds are
one-sided (lower bounds). As a consequence structural zeros are all points below d(·). From the relationship
Hjk|i = Hjkj |i ∀k < kj , (26)
we obtain
Hjk|i ≥ max(0, Fj|i + Gk|i − 1),
Hjk|i ≥ max(0, Fj|i + Gkj |i − 1) (27)
and then
Hjk|i ≥ max(0, Fj|i + Gk|i − 1, Fj|i + Gkj |i − 1)
= max(0, Fj|i + Gkj |i − 1) ∀k < kj . (28)
It is straightforward to prove that the upper bound does not improve. In fact, if k < kj
Hjk|i ≤ min(Fj|i , Gk|i ),
Hjk|i ≤ min(Fj|i , Gkj |i ), (29)
then
Hjk|i ≤ min(Fj|i , Gk|i , Gkj |i ) = min(Fj|i , Gk|i ) k < kj . (30)

4. Estimation of the measure(s) of uncertainty

An important feature of the measure of estimation introduced so far is that they can be estimated on the basis of sample
xy
data. Let nxA,i (nxB,i ) be the number of sample observations in sample A (B) such that X = i, and let nA,ij (nxz B,ik ) be the number
of observations in sample A (B) such that X = i and Y = j (X = i and Z = k), i = 1, . . . , I , j = 1, . . . , J , k = 1, . . . , K . The
probabilities ξi , φj|i , ψk|i can be then estimated by the corresponding sample proportions
nxA,i + nxB,i
ξi =
 , i = 1, . . . , I ;
nA + nB
xy
nA,ij
φj|i =
 , i = 1, . . . , I , j = 1, . . . , J ;
nxA,i

nxz
B,ik
ψ
k|i = , i = 1, . . . , I , k = 1, . . . , K .
nxB,i
Furthermore, the c.d.f.’s Fj|i , Gk|i can be estimated by the corresponding empirical cumulative distribution functions
(e.c.d.f.’s):
xy xy
nA,i1 + · · · + nA,ij
Fj|i =
 , i = 1, . . . , I , j = 1, . . . , J ;
nxA,i

nxz xz
B,i1 + · · · + nB,ik
Gk|i =
 , i = 1, . . . , I , k = 1, . . . , K .
nxB,i
As a consequence, using (17), (18), the upper and lower bound for Hjk|i , Hjk+|i , Hjk−|i , can be estimated by

Hjk+|i = fjk+|i (
 F1|i , . . . , 
FJ |i , 
G1|i , . . . , 
GK |i ), (31)

Hjk−|i = fjk−|i (
 F1|i , . . . , 
FJ |i , 
G1|i , . . . , 
GK |i ), (32)
respectively.
Hence, the conditional and unconditional measures of uncertainty can be estimated by
J 
 K
∆ φj|i ψ
k|i ,
 +
cx=i = Hjk−|i 

Hjk|i − 
 (33)
j=1 k=1

I


c = ∆ ξi
xc=i  (34)
i =1

respectively. The consistency of estimators (33), (34) is established in Proposition 1.


P.L. Conti et al. / Computational Statistics and Data Analysis 68 (2013) 311–325 319

Proposition 1. Assume that nA /(nA + nB ) → α as nA , nB go to infinity, with 0 < α < 1. Then ∆


cx=i , ∆
c converge almost surely
(a.s.) to ∆xc=i , ∆c , respectively. In symbols:
a.s.

xc=i → ∆xc=i as nA → ∞, nB → ∞, i = 1, . . . , I ; (35)
a.s.

c → ∆c as nA → ∞, nB → ∞. (36)

Proof. First of all, from the strong law of large numbers it follows that
a.s. a.s. a.s.
ξi → ξi ,
 φj|i → φj|i ,
 ψ
k|i → ψk|i (37)
a.s. a.s.
Fj|i → Fj|i ,
 Gk|i → Gk|i
 (38)

as nA → ∞, nB → ∞. From (38), taking into account that the functions fjk+|i , fjk−|i are continuous, it is immediate to deduce
that
a.s. a.s.
Hjk+|i → Hjk+|i ,
 Hjk−|i → Hjk−|i as nA → ∞, nB → ∞.
 (39)
Relationship (35) is an easy consequence of (39) and (37). Relationship (36), in its turn, is a consequence of (35) and (37).
In the second place, using the piecewise differentiability of fjk+|i , fjk−|i , the estimators 
Hjk+|i s, 
Hjk−|i s are jointly asymptotically
normally distributed, provided that the ‘‘true’’ Fj|i s, Gk|i s are not ‘‘non-differentiability’’ point as mentioned in (b) of
Section 3.2. As a consequence, the following proposition holds.

Proposition 2. Assume that nA /(nA + nB ) → α as nA , nB go to infinity, with 0 < α < 1, and that Fj|i s, Gk|i s satisfy the
differentiability condition for fjk+|i s, fjk−|i s. Then

nxA,i nxB,i
(∆
xc=i − ∆xc=i ), (40)
nxA,i + nxB,i

does have normal asymptotic distribution with mean zero and positive variance σi2 as nA , nB tend to infinity. Similarly, the variate

nA nB
(∆
c − ∆c ), (41)
nA + nB

possesses normal asymptotic distribution with mean zero and positive variance σ 2 as nA , nB tend to infinity.
Proof. Consider the (J + K )th r.v.
 
nxA,i nxB,i
(Hjk|i − Hjk|i , Hjk|i − Hjk|i ), j = 1, . . . , J , k = 1, . . . , K .
+ + − −
(42)
nxA,i + nxB,i

Since the functions fjk+|i s, fjk−|i s, are differentiable, the r.v. (42) does possess a (possibly degenerate) (J + K )-variate asymptotic
normal distribution with null mean vector, as nxA,i , nxB,i go to infinity. Next, using the delta method, and taking into account
φj|i s, ψ
that  k|i s are asymptotically normally distributed (if properly rescaled), it is immediate to conclude that the r.v. (40)
possesses normal asymptotic distribution. The same holds for the r.v. (41).
The asymptotic variances σi2 s, σ 2 do have a complicate form, depending on the ‘‘true’’ Fj|i s, Gk|i s. However, they can be
consistently estimated by bootstrap method, that works as follows:
1. generate from the e.d.f. of sample A a bootstrap sample of size nA .
2. generate from the e.d.f. of sample B a bootstrap sample of size nB .
3. use samples generated in steps 1, 2 to compute the ‘‘bootstrap version’’ ∆
cx=i of ∆
cx=i .
x =i
cx=i,m , m = 1, . . . , M are obtained. Let ∆c
Steps 1–3 are repeated M times, so that the M bootstrap values ∆ be their
2x
average, and let SM be their variance:
M M
1  1
x =i
xc=i,m , xc=i,m − ∆c )2 .
x =i

∆c = ∆ 2x
SM = (∆
M m=1
M − 1 m=1

As an estimate of σi2 , we may take


nA,x nB,x
σi2,M =
 2x
SM . (43)
nA,x + nB,x
320 P.L. Conti et al. / Computational Statistics and Data Analysis 68 (2013) 311–325

Table 1
Response categories for the variables (X , Y , Z ).
Variables Response categories

X 1 = 15–17 years old; 2 = 18–22; 3 = 23–64; 4 = 65 and above;


Y 1 = None o compulsory school; 2 = Vocational school; 3 = Secondary school; 4 = Degree;
Z 1 = Worker, 2 = Clerk, 3 = Manager.

Table 2
Distribution of (X , Y ) in file A.
X Y
1 2 3 4 Total

1 6 0 – – 6
2 14 16 13 – 33
3 387 102 464 158 1111
4 10 0 3 2 15
Total 417 108 480 160 1165

Table 3
Distribution of (X , Z ) in file B.
X Z
1 2 3 Total

1 9 – – 9
2 17 5 – 22
3 486 443 179 1108
4 2 1 6 9
Total 514 449 185 1148

From (43) it is also easy to construct an estimate of the unconditional variance σ 2 . The functional ∆  is continuous,
asymptotically normal, and possesses finite moments of all orders. Using results in Bickel and Freedman (1981) the
consistency of variance estimator follows.
The above results are useful to construct point and interval estimates of the uncertainty measures ∆cx=i , ∆c . They are
also useful to test the hypothesis that the class of bivariate d.f.’s with upper bounds Hjk+|i s and lower bounds Hjk−|i is ‘‘narrow’’,
when structural zeros are considered.

5. Simulation study

In order to evaluate the uncertainty for ordered categorical variables a simulation experiment has been performed. The
data considered are in D’Orazio et al. (2006a). A subset of 2.313 employees (people at least 15 years old) has been extracted
from the 2000 pilot survey of the Italian Population and Household Census. Three variables have been analyzed Age (X ),
Educational Level (Y ) and Professional Status (Z ). For the sake of simplicity and without loss of information in respect of
our aim, the original variables have been transformed by grouping homogeneous response categories. The results of this
grouping are shown in Table 1.
To reproduce the statistical matching situation, the original file has been randomly split into two almost equal subsets.
The variable Z has been removed from the first subset (file A), containing 1.165 units, and the variable Y has been removed
from the second subset (file B), consisting of the remaining 1.148 observations.
Tables 2 and 3 show, respectively, the joint distribution of (X , Y ) and (X , Z ) in files A and B, respectively, after the original
dataset has been split. Structural zeros are denoted by ‘‘–’’ throughout the paper.
There are 48(=4 × 3 × 4) data patterns of which many are structural zeros. Note that each structural zero in a marginal
table implies a set of structural zeros on the joint distribution of (Y , Z ) given X . Furthermore, the joint distribution has some
additional structural zeros that cannot be inferred from the marginals in Tables 2 and 3, respectively. Conditionally on x = i,
constraints acting as structural zeros are introduced. This kind of constraints reduce the support of the joint distribution
(Y , Z |X ). More specifically, we consider the following constraints:
1. conditionally on x = 1, the joint distribution of (Y , Z ) is described in Table 4.
2. conditionally on x = 2, the joint distribution of (Y , Z ) is described in Table 5.
3. conditionally on x = 3, 4, the joint distribution of (Y , Z ) is described in Table 6.
In the simulation, the category x = 4 has been collapsed into the category x = 3 for two reasons. First, the sample sizes
for x = 4 in Tables 2 and 3, respectively, are too small. Second, the same kind of constraints (Table 6) are imposed on the
joint distribution of (Y , Z ) given X . The simulation involves the following steps:
1. a sample A composed by nA i.i.d. records has been generated from Table 2.
P.L. Conti et al. / Computational Statistics and Data Analysis 68 (2013) 311–325 321

Table 4
Structural zeros for the distribution of (Y , Z )|x = 1.
Y Z
1 2 3

1 – –
2 – –
3 – – –
4 – – –

Table 5
Structural zeros for the distribution of (Y , Z )|x = 2.
Y Z
1 2 3

1 – –
2 –
3 –
4 – – –

Table 6
Structural zeros for the distribution of (Y , Z )|x for x = 3, 4.
Y Z
1 2 3

1 – –
2 –
3
4

2. a sample B composed by nB i.i.d. records has been generated from Table 3.


3. conditionally on x = i (for i = 1, 2, 3) the pointwise uncertainty measure (11) in (i, j, k) is estimated by
 
U (F̂j|i , Ĝk|i ) − L(F̂j|i , Ĝk|i ) (44)

where F̂j|i and Ĝk|i are the empirical distribution functions of Fj|i and Gk|i , respectively.
4. conditionally on x = i (for i = 1, 2, 3) the uncertainty measure when no external auxiliary information in the form
of structural zeros is available is obtained by a weighted average of the pointwise uncertainty measures (44). Formally,
from (12) we obtain
J 
 K  

ˆ x =i = U (F̂j|i , Ĝk|i ) − L(F̂j|i , Ĝk|i ) φ̂j|i ψ̂k|i , (45)
j=1 k=1

where φ̂j|i = F̂j|i − F̂j−1|i and ψ̂k|i = Ĝk|i − Ĝk−1|i .


5. let nxA,i and nxB,i be the number of units such that x = i in samples A and B, respectively. Then the overall unconditional
uncertainty measure is given by
I


ˆ = ˆ x=i ξ̂i ,
∆ (46)
i=1

where
nA,i + nB,i
ξ̂i = . (47)
nA + nB

6. conditionally on x = i (for i = 1, 2, 3) the constraints in Tables 4–6 acting as structural zeros are introduced. According to
the algorithm described in Section 3 the conditional and unconditional uncertainty measures ∆ ˆ cx=i and ∆
ˆ c are estimated
by (33) and (34), respectively.
7. steps 1–6 have been repeated 500 times and for different sample sizes nA = nB = n = (1000, 4000, 8000).

ˆ xc=i,m , ∆
Given n, for each sample m (for m = 1, . . . , 500) and for each category i (for i = 1, 2, 3) denote by ∆ ˆmc the
uncertainty measures (33) and (34), respectively.
322 P.L. Conti et al. / Computational Statistics and Data Analysis 68 (2013) 311–325

Table 7
The actual uncertainty measures ∆x=i and ∆xc=i .

i ∆x=i ∆xc=i
1 0 0
2 0.108 0.108
3 0.150 0.127

x=i
Fig. 3. Uncertainty measures ∆c , ∆xc=i and the estimated uncertainty measure ∆c , for each x = i.

ˆ cx=i,m , ∆
If we refer to the conditional uncertainty measure estimates ∆ ˆmc , we have that the average over simulation runs
is given by
500
1 
∆c
x=i
= ˆ xc=i,m ,
∆ (48)
500 m=1

and
500
m

∆c = ∆c (49)
m=1

where ∆c represents the corresponding overall uncertainty under the constraint. The standard deviation over simulation
runs is

500

 1 
SD(∆c ) = 
ˆ x =i ˆ xc=i,m − ∆xc=i )2 ,
(∆ (50)
499 m=1

and finally the corresponding mean square error is


x =i
MSE(∆
ˆ xc=i ) = [SD(∆
ˆ xc=i )]2 + (∆c − ∆xc=i )2 . (51)

The value ∆xc=i in (51) is the actual conditional uncertainty measure for the ith category under the constraint c. ∆xc=i has
been numerically computed by generating N = 100 000 i.i.d. records from (X , Y ) and (X , Z ) according to the distributions
specified in Tables 2 and 3, respectively. The same procedure has been used to evaluate the uncertainty measure ∆x=i (for
i = 1, 2, 3) according to the simulation steps 3, 4, 5.

6. Simulation results

The actual uncertainty measures ∆x=i and ∆xc=i , numerically computed, are reported in Table 7.
P.L. Conti et al. / Computational Statistics and Data Analysis 68 (2013) 311–325 323

Fig. 4. Standard deviation of estimated uncertainty measures under the constraint, for each x = i.

Fig. 5. Mean square error of estimated uncertainty measures under the constraint, for each x = i.

Conditionally on x = 1, the structural zeros on the joint distribution of (Y , Z ) come from the structural zeros in the
marginal Tables 2 and 3, respectively. As a consequence ∆x=1 = ∆cx=1 and such a value is zero because of the sampling zero
in the cell (X , Y ) = (1, 2) in Table 2.
Analogously, conditionally on x = 2 the possible data patterns are obtained setting Y = 1, 2, 3 and Z = 1, 2. Then, the
only structural zero referring to the joint distribution of (Y , Z ) (given X ) is γ12|2 = 0. Clearly, the reduction in the model
uncertainty depends on how informative is the constraint, that is how large is the reduction in the support of Y and Z (given
X ). In this case, the structural zero γ12|2 = 0 is not informative because of the small sample size, then the class of possible
distributions does not reduce at all.
As expected conditionally on x = 3 the structural zero described in Table 6 reduces uncertainty from ∆x=3 = 0.150 to
∆xc=3 = 0.127.
324 P.L. Conti et al. / Computational Statistics and Data Analysis 68 (2013) 311–325

Fig. 6. Density estimate of overall uncertainty measure under the constraint.

a b

Fig. 7. Density estimation for categories x = 2, 3.

Furthermore, given the actual uncertainty measure ∆x=i and ∆cx=i in Table 7 the actual overall uncertainty measures are
∆ = 0.148, ∆c = 0.125, respectively.
In Fig. 3 the expectations of the estimated uncertainty measure ∆ cx=i given by (48) are reported. More specifically,
the horizontal line is the overall unconditional uncertainty measure ∆c = 0.125, the circles represent the conditional
x =i
uncertainty measures ∆xc=i as reported in Table 7 while the dashed, dotted and solid lines depict ∆c for sample sizes
n = 1000, 4000, 8000, respectively. Note that, conditionally on x = 2, 3 as the sample size increases the estimates become
more precise, this is evident for x = 2 since such a category is characterized by a smaller sample size.
In Figs. 4 and 5 the standard deviation (50) and the mean square error (51) of our uncertainty estimators are reported.
The highest efficiency of estimation is obtained for the class x = 3. The smallest efficiency is obtained for the class x = 2
because of the smaller sample size. At any rate, we stress that the mean square errors of conditional uncertainty estimators
are all considerably small.
Finally, in Fig. 6 the Kernel density estimate of the overall uncertainty measure under the constraint is shown. Such an
estimate has been computed using for each sample size n the 500 value ∆ c given by (34). Note that, as the sample size n
increases the uncertainty measure distribution tends to a normal distribution. Similar considerations hold for the estimated
conditional uncertainty measures ∆ xc=i given by (33), as Fig. 7 shows. The bandwidth selection rule is given by Sheather and
Jones (1991).
P.L. Conti et al. / Computational Statistics and Data Analysis 68 (2013) 311–325 325

7. Conclusions

The extension of the uncertainty measure to multivariate X and/or Y and/or Z can be based on multivariate d.f.’s. Let
Fj|i Gk|i be the multivariate d.f.’s of Y and Z given X = i, where j, k, i are multidimensional indices. The measure of uncertainty
becomes now
 
 
∆= U (Fj|i , Gk|i ) − L(Fj|i , Gk|i ) φj|i ψk|i ξi .

(52)
i j k

From (52) it follows that the theory developed in the univariate case extends to the multivariate case.

References

Aluja-Banet, T., Daunis-i-Estadella, J., Pellicer, D., 2007. GRAFT, a complete system for data fusion. Computational Statistics and Data Analysis 52, 635–649.
Bickel, P.J., Freedman, D.A., 1981. Some asymptotic theory for the bootstrap. Annals of Statistics 9, 1196–1271.
Conti, P.L., Marella, D., Scanu, M., 2008. Evaluation of matching noise for imputation techniques based on nonparametric local linear regression estimators.
Computational Statistics and Data Analysis 53, 354–365.
Conti, P.L., Marella, D., Scanu, M., 2009. How far from identifiability? A nonparametric approach to uncertainty in statistical matching under logical
constraints. Technical Report n. 22, DSPSA, Università di Roma La Sapienza.
Conti, P.L., Marella, D., Scanu, M., 2012. Uncertainty analysis in statistical matching. Journal of Official Statistics 28, 1–21.
Dobra, A., Fienberg, S.E., 2001. Bounds for cell entries in contingency tables induced by fixed marginal totals with applications to disclousure limitation.
Statistical Journal of the United Nations ECE 18, 363–371.
D’Orazio, M., Di Zio, M., Scanu, M., 2006a. Statistical matching for categorical data: displaying uncertainty and using logical constraints. Journal of Official
Statistics 22 (1), 137–157.
D’Orazio, M., Di Zio, M., Scanu, M., 2006b. Statistical Matching: Theory and Practice. Wiley.
Gilula, Z., McCulloch, R.E., Rossi, P.E., 2006. Direct approach to data fusion. Journal of Marketing Research 43, 73–83.
Kadane, J.B., 1978. Some statistical problems in merging data files. Journal of Official Statistics 17, 423–433. In Compendium of tax research, Department
of Treasury, US Gov- ernement Printing Office, Washington D.C., 159–179. Reprinted in 2001.
Kamakura, W., Wedel, M., 1997. Statistical data fusion for cross-tabulation. Journal of Marketing Research 34, 485–498.
King, G., 1997. A Solution to the Ecological Inference Problem. Princeton University Press, Princeton.
Manski, C.F., 1995. Identification Problems in the Social Sciences. Harvard University Press, Harvard.
Manski, C.F., 2003. Partial Identification of Probability Distributions. Springer-Verlag, New York.
Moriarity, C., Scheuren, F., 2001. Statistical matching: a paradigm of assessing the uncertainty in the procedure. Journal of Official Statistics 17, 407–422.
Okner, B.A., 1972. Constructing a new microdata base from existing microdata sets: the 1966 merge file. Annals of Economic and Social Measurement 1,
325–362.
Raessler, S., 2002. Statistical Matching: A Frequentist Theory, Practical Applications and Alternative Bayesian Approaches. In: Lecture Notes in Statistics,
Springer Verlag, New York.
Rodgers, W.L., 1984. An evaluation of statistical matching. Journal of Business and Economic Statistics 2, 91–102.
Rubin, D.B., 1986. Statistical matching with adjusted weights and multiple imputations. Journal of Business and Economic Statistics 2, 91–102.
Saporta, G., 2002. Data fusion and data grafting. Computational Statistics and Data Analysis 38, 465–473.
Sheather, S.J., Jones, M.C., 1991. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society,
Series B 53, 683–690.
Sims, C.A., 1972. Comments and Rejoinder (On Okner (1972)). Annals of Economic and Social Measurement 1, 343–345. pp. 355–357.
Singh, A.C., Mantel, H., Kinack, M., Rowe, G., 1993. Statistical matching: use of auxiliary information as an alternative to the conditional independence
assumption. Survey Methodology 19, 59–79.
Tonkin, R., Webber, D., 2012. Statistical matching of EU-SILC and household budget survey to compare poverty estimates using income, expenditures and
material deprivation. In: EU-SILC International Conference, Vienna. December 6–7, 2012.

You might also like