You are on page 1of 7

Statistics & Probability Letters 59 (2002) 219–225

Empirical simulation extrapolation for measurement error


models with replicate measurements
Viswanath Devanarayana;∗ , Leonard A. Stefanskib
a
Lilly Research Laboratories, Eli Lilly & Company, Indianapolis, IN 46285, USA
b
Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, USA
Received March 2001; received in revised form February 2002

Abstract

We present a variation of the simex algorithm (J. Amer. statist. Assoc. 89 (1994) 1314) appropriate for the
case in which the measurement error variance(s) are unknown but replicate measurements are available. The
method used pseudo errors generated from random linear contrasts of the observed replicate measurements.
An attractive feature of the new method is its ability to accommodate heteroscedastic measurement error.
c 2002 Elsevier Science B.V. All rights reserved.

Keywords: Errors-in-variables; Heteroscedasticity; Logistic regression; Method of moments; Simulation; Variance


components

1. Introduction

We consider heteroscedastic measurement error models with replicate measurements. The observed
data are {Yi ; Zi ; Wi1 ; : : : ; Wimi }; where Yi is a response variable, Zi is an error-free predictor, Xi is the
true value of the error-prone predictor, and the mi replicate measurements {Wij } of Xi follow the
additive error model, Wij =Xi +
i Uij , where {Uij } are independent and identically distributed N (0; 1),
independent of {Yi ; Zi ; Xi }; and {
i2 } are unknown measurement error variances (curly brackets denote
sequences, e.g. {ai } = a1 ; : : : ; an ):
We describe an adaptation of the simex method introduced by Cook and Stefanski (1994); see also
Stefanski and Cook (1995), and Carroll et al. (1995,1996). For applications of the simex method,
see Carroll et al. (1999), Fung and Krewski (1999), Holcomb, (1999), and Wang et al.(1999).
The simulation method of Cook and Stefanski (1994), called parametric simex in this paper,
assumes that measurement error variances are homoscedastic and known, and pseudo errors are

Corresponding author. Tel.: +1-317-433-1309; fax: +1-317-277-3220.
E-mail address: devan@lilly.com (V. Devanarayan).

0167-7152/02/$ - see front matter  c 2002 Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 7 - 7 1 5 2 ( 0 2 ) 0 0 0 9 8 - 6
220 V. Devanarayan, L.A. Stefanski / Statistics & Probability Letters 59 (2002) 219–225

generated from a normal distribution. We relax both of these assumptions, albeit at the expense
of requiring replicate measurements. In our new method, called empirical simex, pseudo data are
generated via simulation of random linear contrasts of the replicate measurements.
We do not assume a speciCc model for the data, but rather that in the absence of measurement
error the parameter of interest, , is estimated by ˆTrue = T ({Yi ; Zi ; Xi }), where T is the function that
maps the data into the parameter space. With this notation, the so-called naive estimator is denoted
ˆ
Naive = T ({Yi ; Zi ; Wi }):

2. Simulation extrapolation methods

2.1. Parametric SIMEX

This section contains a brief description of parametric simex generalized to the case of known
heteroscedastic error variances; see Carroll et al. (1995) for a more detailed account. Pseudo data
are generated via simulation by computing, for several values of  ¿ 0;
Wbi () = WG i: + 1=2
i UG bi: ; i = 1; : : : ; n; b = 1; : : : ; B;
where the added pseudo measurement errors, {Ubij }; are mutually independent standard normal
random variables, independent of the observed data. The bth pseudo data set for a given  is
{Yi ; Zi ; Wbi ()}. Note that because Wbi () given Xi is a linear combination of independent normal
random variables,
Wbi ()|Xi ∼ N(Xi ; (1 + )
i2 =mi ): (1)
This is the key distributional property upon which the simex method is based. The parametric pseudo
estimators are computed,
ˆb () = T ({Yi ; Zi ; Wb; i ()}); b = 1; : : : ; B (2)
and averaged to obtain
1ˆ
B
ˆ
() = b () (3)
B
b=1

for each Cxed contamination level .


The extrapolation step entails modeling () ˆ as a function of  and extrapolating to the case
 = −1, resulting in the parametric simex estimators, ˆpsimex :

Note that as B → ∞; B−1 Bb=1 ˆb () → E[ˆb () | {Yi ; Zi ; Wi1 ; : : : ; Wimi }]. The latter expectation is
ˆ
denoted () ˆ
and is referred to as the exact extrapolant; and (−1) is called the exact simex estimator.
In application, simex is an approximate method because B is Cnite and an estimated extrapolant is
used. Its theoretical justiCcation derives from the facts that for many types of estimators, (−1) ˆ is
a consistent (n → ∞) estimator of ; and that the underlying theory is closely related to that of the
jackknife for bias reduction, Stefanski and Cook (1995). √
In the case of homoscedastic error with common variance
2 , replacing
2 with a n-consistent
estimator,
ˆ2 , in the parametric method results in the same bias reductions obtained with known
2 ,
but with increased variance asymptotically. With heteroscedastic variation, the plug-in approach to
V. Devanarayan, L.A. Stefanski / Statistics & Probability Letters 59 (2002) 219–225 221

parametric simex, (replacing each


i2 with the sample variance of the mi replicates) generally fails
to eliminate asymptotic bias (Devanarayan, 1996).

2.2. Empirical SIMEX

We now show that with replicate measurements, it is possible to generate pseudo data having the
key property (1) without knowing the measurement error variance
i2 and without generating pseudo
random measurement errors directly.
Consider  pseudo data obtained by taking linear combinations of the replicate measurements,
Wbi () = mj=1i
Abij Wij . In order that the pseudo data have the desired conditional moments as in (1)
the coeKcients {Abij } must satisfy
 mi  mi
1+
Abij = 1; A2bij = : (4)
j=1 j=1
mi

The Crst condition in (4) also ensures that WG i: and Wbi () − WG i: are uncorrelated. The conditions
in (4) do not uniquely determine the coeKcients Abij . Multiple pseudo data sets are obtained by
sampling uniformly from the set of all coeKcients satisfying (4). The sampling is more easily
explained with reparameterized
 coeKcients.  i
In terms of Cbij = mi =(Abij − 1=mi ), the conditions in (4) are equivalent to mj=1 Cbij = 0 and
mi 2
j=1 Cbij = 1. Therefore, the Cbij ; j = 1; : : : ; mi lie on the intersection of a mi -dimensional hyperplane
with a mi -dimensional unit sphere, or equivalently the {Cbij } lie on a (mi − 1)-dimensional unit
sphere. When mi = 2 the {Cbij } lie on the intersection of line through the origin and the unit circle
and thus there are two unique points, whereas for mi ¿ 2, there are an inCnite number of solutions.
The simulation component of empirical simex entails generating pseudo data by sampling uniformly
from the set of solutions.
A convenient way to sample mi uniform random points from a (mi − 1)-dimensional unit sphere
is to generate mi independent and identically distributed standard normal random numbers, {bij },
and set
bij − Gbi:
Cbij = mi :
[ j=1 (bij − Gbi: )2 ]1=2
It is readily veriCed that the {Cbij } satisfy the required constraints, and that they are uniformly
distributed on the mi − 1 unit ball is well known (Watson, 1983).
Solving for Abij , and after some algebraic rearrangement, the empirical pseudo data have the
representation
mi

Wbi () = WG i: + ci () Tbij Wij ; (5)
j=1

where ci () = (={mi (mi − 1)})1=2 ; Tbij = (bij − Gbi· )=Sbi , and Gbi: and Sbi are the sample means and
standard deviations of {bij }. Certain distributional properties of the {Tbij } used throughout the paper
are established in the appendix.
 i  i 2
Using the facts that mj=1 Tbij =0 and mj=1 Tbij =mi −1; it follows that the conditional distribution
mi
of j=1 Tbij Uij given {Tbij }mj=1
i
is normal with mean 0 and variance mi − 1. Because the conditional
222 V. Devanarayan, L.A. Stefanski / Statistics & Probability Letters 59 (2002) 219–225

 i
distribution does not depend on {Tbij }mj=1 i
; mj=1 Tbij Uij and {Tbij }mj=1
i
are independent. Therefore,
mi mi 2
j=1 Tbij Wij =
i j=1 Tbij Uij has a normal distribution with mean 0 and variance (mi − 1)
i .
The
mconditional distribution
mi of WG i: given Xi is normal with mean Xi and variancemi
i2 =mi . Note
G
that j=1 Uij and j=1 Tbij Uij are uncorrelated. Therefore, given Xi ; W i: and j=1 Tbij Wij are un-
i

correlated. Because they are normally distributed, they are also independent, and thus the conditional
distribution of Wbi (), conditioned on Xi , is normal with mean and variance as in (1).
Therefore, the distributions of the pseudo data are the same for the parametric and empirical meth-
ods. The empirical simex pseudo data does not depend on the measurement error variance structure,
that is, the linear coeKcients {Abij }mj=1 i
were derived independent of the measurement error vari-
ances. Consequently, the empirical simex method automatically handles the case of heteroscedastic
measurement error variances.
Just as with its parametric counterpart, empirical simex method entails generating pseudo data,
as deCned in (5), B times over a grid of  values ranging from 0 to 2. The naive estimators
ˆb (); b = 1; : : : ; B are computed as in (2) and then averaged to obtain () ˆ deCned in (3). In the
ˆ
extrapolation stage, () is modeled as a function of  using a suitable extrapolant function. This
model is then extrapolated back to  = −1 resulting in the empirical simex estimator. Because the
pseudo data generated by the parametric and empirical methods have the same key distributional
properties, the latter method inherits many properties of the former. Therefore in this paper, we
conCne ourselves to further description and illustration of the new method of simulation, but will
not discuss the extrapolation step or the sampling properties of the estimator.

3. Some theoretical comparisons

The two versions of the simex method, parametric with estimated variances, and empirical, are
illustrated on two simple but informative models. We also describe the parametric method with
known variances for comparison. The observed data for these non-regression models consist only of
the replicate measurements {Wij }. The Xi are assumed to be independent and identically distributed.
Variance estimation: In this example, the parameter of interest is  =
x2 . The true-data estimator

is the sample variance ˆTrue = (n − 1)−1 ni=1 (Xi − XG )2 . The naive estimator is the sample variance

of {WG i: } denoted SW2G . It has expectation E{ˆNaive } =
X2 + n−1 ni=1 (
i2 =mi ), and is biased whenever
any
i2 ¿ 0.
Straightforward calculations show that for the case  of known measurement error variances the
exact parametric simex estimator is ˆpsimex = SW2G − n−1 ni=1
12 =mi , and is unbiased for  =
X2 .
The parametric simex estimator with estimated measurement error variance is obtained by replacing
2

i in the known-variance parametric estimator with


m
1  i


ˆ2i = (Wij − WG i: )2 ;
mi − 1 j=1

resulting in the estimator ˆpsimex = SW2G − n−1 ni=1
ˆ2i =mi , which is unbiased for  =
X2 .
Routine calculations show that for this simple model the exact empirical simex estimator and the
exact parametric simex estimator with estimated variances are equal almost surely.
V. Devanarayan, L.A. Stefanski / Statistics & Probability Letters 59 (2002) 219–225 223

Estimating a moment generating function: The data are as in the variance-components example,
but now the parameter of interest is the moment generating function of {Xi } at the point t0 , i.e.,
 = mX (t0 ) = E{exp(t0 Xi )}. Without loss of generality take t0 = 1. The true-data estimator, ˆTrue =
 
n−1 ni=1 exp(Xi ), is unbiased for . The naive estimator, ˆNaive = n−1 ni=1 exp(WG i ), has expectation

E(ˆNaive ) = n , where n = n−1 ni=1 exp{
i2 =(2mi )} and is positively biased.
The exact parametric simex estimator with known measurement error variances is ˆpsimex =− 1ˆ
n naive ,
which is unbiased for .
The exact parametric simex estimator with estimated n measurement 2error variances, obtained by
2 2 ˆ
replacing
i with the estimator
ˆi is psimex = n − 1 G
i=1 exp{W i: −
ˆi =(2mi )}. This estimator has
expectation
 mi =2
1
n
ˆ m i (m i − 1)
E(psimex ) = E{exp(WG i: )}E{exp{−
ˆi =(2mi )}}
2
;
n i=1 mi (mi − 1) +
i2
which is strictly greater than  unless all
i2 = 0.
The exact empirical simex estimator is
n
ˆ
(−1) = n− 1 exp{WG i: }hmi: (ci (−1)Wi1 ; : : : ; ci (−1)Wimi )};
i=1

where hm () is the moment generating function deCned in (A.1) in the appendix and ci () is deCned
in (5). It has expectation
1
n
ˆ
E{(−1)} = E[exp(WG i: )hmi (ci (−1)Wi1 ; : : : ; ci (−1)Wimi )]
n i=1
 
1
n
ci2 (−1)
i2
= E{exp(Xi )} exp{
i2 =(2mi )} exp (mi − 1) = :
n i=1 2
 i
The key steps is proving unbiasedness use the facts that conditioned on {Tbij }, WG i: and mj=1 Tbij Wij
mi
are independent, and j=1 Tbij Xi = 0 almost surely.
Summary: These two simple examples illustrate the key features of the empirical method and
its advantages over the parametric method with plugged-in estimated variances. In general, the two
methods for handling heteroscedastic measurement errors yield the same exact extrapolants only
when the bias is a linear function of the measurement error variances, as in the variance component
example.

4. An application

The methods are illustrated using data from the Framingham Heart Study (Gordon and Kannel,
1968). The logistic regression model has response Y = the occurrence of coronary heart disease within
a speciCed follow-up period, and predictors: Z1 = age at the time of enrollment into the study; Z2 =
smoking indicator; Z3 = baseline cholesterol; and X = transformed systolic blood pressure (SBP). We
use the transformation, log(SBP–50), as described in Carroll et al. (1984). Systolic blood pressures
taken at successive exams two years apart provide the replicate measurements after transformation
224 V. Devanarayan, L.A. Stefanski / Statistics & Probability Letters 59 (2002) 219–225

Table 1
CoeKcient estimates from the Framingham data logistic regression analysis. CS, Condi-
tional Score; P-simex, parametric simex; E-simex, empirical simex

Age × 10 Smoke Chol × 102 LSBP

Naive 0.555 0.593 0.787 1.701


CS homoscedastic,
ˆ2 0.536 0.601 0.782 1.940
CS heteroscedastic,
ˆ2i 0.537 0.600 0.779 1.930
P-simex homoscedastic,
ˆ2 0.535 0.600 0.781 1.943
P-simex heteroscedastic,
ˆ2i 0.537 0.600 0.782 1.940
E-simex 0.538 0.601 0.783 1.942

so that mi = 2 for all n = 1615 subjects. The data and model have been used previously to illustrate
measurement error methods by Carroll et al. (1995), where a more detailed description of the data
can be found.
Table 1 displays estimates of the logistic regression model parameters for several estimators. In
addition to the naive and empirical simex estimators, we computed parametric simex estimators
and conditional score estimators (Carroll et al., 1995) using plug-in variance estimates under both
the assumptions
n of heteroscedastic variation (
ˆ2i = (Wi1 − Wi2 )2 =2), and homoscedastic variation
2 − 1 2
(
ˆ = n i=1

ˆi ). For the parametric and empirical simex analysis, the estimates obtained using
quadratic and rational-linear extrapolants (Carroll et al., 1995) were nearly identical so only the
latter are presented.
After transformation, the blood pressure measurements show only slight heteroscedasticity and one
expects little diNerences among methods that assume constant or non-constant variation. For these
data the empirical simex estimates are nearly identical to those of the conditional score and parametric
simex estimators computed assuming homogeneous measurement error. Also the conditional score
and parametric simex estimators computed using
ˆ2i assuming heteroscedastic measurement error
variances are not very diNerent from those computed assuming homogeneous measurement error
variances and the empirical simex estimators.
This example suggests that little is lost by using empirical simex relative to other methods that
assume homogeneous error variance when in fact the assumption of homogeneity is reasonable. In
simulation studies reported elsewhere (Devanarayan, 1996), the estimators calculated in this appli-
cation were compared under cases of homoscedastic and heteroscedastic error variation for diNerent
levels of measurement error. The simulation results provided further evidence that the empirical
method performs comparably to other methods when the errors are homoscedastic. For cases in
which the errors were heteroscedastic, the empirical method performed as well or better than the
other methods. Although, the conditional score estimator with plugged-in variance estimates (
ˆ2i )
was a close competitor. The empirical simex method has the advantage that it does not require
speciCcation of a model for measurement variances.

Appendix A. Technical details

Certain properties of the random variables Tbij are established in the following lemma. For brevity,
we write Tbij as Tj in this section.
V. Devanarayan, L.A. Stefanski / Statistics & Probability Letters 59 (2002) 219–225 225

Lemma. Suppose {j }mj=1 are independently and identically distributed random variables. Let G and
 −ZG
S2 denote the sample mean and variances of {j }mj=1 and let Tj = jSz ; j = 1; : : : ; m. Then (i) {Tj }mj=1
are identically distributed; (ii) (Tj ; Tk ) and (Tj ; Tk  ) are identically distributed for any pairs (j; k)
and (j  ; k  ); j
= k; j 
= k  ; (iii)E{Tj } = 0; Var{Tj } = (m − 1)=m; and Cov {Tj ; Tk } = −1=m; j
= k; and
(iv) the moment generating function
  
 m 
hm (s1 ; : : : ; sm ) = E exp sj Tj  ; (A.1)
 
j=1

exists for all real s1 ; : : : ; sm .

Proof. Assertions (i) and (ii) follow from the fact that T
1 ; : : : ; Tm are exchangeable; and (iii) follows
from exchangeability and the constraints Tj = 0 and Tj2 = m − 1 upon exploiting the identities
  2  2
0 = E( Tj ); m − 1 = E( Tj ); and 0 = {E( Tj ) }. The proof of (iv) follows from the fact that
 
| mj=1 sj Tj | ¡ ( mj=1 sj2 )1=2 . The moment generating function hm () is that of a vector distributed
uniformly on an m − 1 sphere and its exact form is given in Watson (1983).

References

Carroll, R.J., Spiegelman, C.H., Lan, K.K.G., Bailey, K.T., Abbott, R.D., 1984. On errors-in-variables for binary regression
models. Biometrika 71, 19–25.
Carroll, R.J., Ruppert, D., Stefanski, L.A., 1995. Measurement Error in Nonlinear Models. Chapman & Hall, London,
England.
Carroll, R.J., KPuchenhoN, H., Lombard, F., Stefanski, L.A., 1996. Asymptotics for the SIMEX estimator in structural
measurement error models. J. Amer. Statist. Assoc. 91, 242–250.
Carroll, R.J., Maca, J.D., Ruppert, D., 1999. Nonparametric regression in the presence of measurement error. Biometrika
86, 541–554.
Cook, J.R., Stefanski, L.A., 1994. A simulation extrapolation method for parametric measurement error models. J. Amer.
Statist. Assoc. 89, 1314–1328.
Devanarayan, V., 1996. Simulation Extrapolation Method for Heteroscedastic Measurement Error Models with Replicate
Measurements. Ph.D. Thesis, North Carolina State University, Raleigh, North Carolina, USA, unpublished.
Fung, K.Y., Krewski, D.J., 1999. Evaluation of regression calibration and SIMEX methods in logistic regression when
one of the predictors is subject to additive measurement error. J. Epidemiol. Biostatist. 4 (2), 65–74.
Gordon, T., Kannel, W.E., 1968. The Framingham Study, introduction and general background in the Framingham study,
Sections 1 and 2, National Heart, Lung and Blood Institute, Bethesda, MD, USA.
Holcomb, J., 1999. Regression with covariates and outcome calculated from a comment set of variables in the presence
of measurement error: estimation using the SIMEX method. Statist. Med. 18, 2847–2862.
Stefanski, L.A., Cook, J.R., 1995. Simulation-extrapolation: the measurement error jackknife. J. Amer. Statist. Assoc. 90,
1247–1256.
Wang, N., Lin, X., Gutierrez, R., Carroll, R.J., 1999. Bias analysis and SIMEX approach in generalized linear mixed
measurement error models. J. Amer. Statist. Assoc. 93, 249–261.
Watson, G.S., 1983. Statistics on Spheres. Wiley, New York.

You might also like