Professional Documents
Culture Documents
Abstract
We present a variation of the simex algorithm (J. Amer. statist. Assoc. 89 (1994) 1314) appropriate for the
case in which the measurement error variance(s) are unknown but replicate measurements are available. The
method used pseudo errors generated from random linear contrasts of the observed replicate measurements.
An attractive feature of the new method is its ability to accommodate heteroscedastic measurement error.
c 2002 Elsevier Science B.V. All rights reserved.
1. Introduction
We consider heteroscedastic measurement error models with replicate measurements. The observed
data are {Yi ; Zi ; Wi1 ; : : : ; Wimi }; where Yi is a response variable, Zi is an error-free predictor, Xi is the
true value of the error-prone predictor, and the mi replicate measurements {Wij } of Xi follow the
additive error model, Wij =Xi +
i Uij , where {Uij } are independent and identically distributed N (0; 1),
independent of {Yi ; Zi ; Xi }; and {
i2 } are unknown measurement error variances (curly brackets denote
sequences, e.g. {ai } = a1 ; : : : ; an ):
We describe an adaptation of the simex method introduced by Cook and Stefanski (1994); see also
Stefanski and Cook (1995), and Carroll et al. (1995,1996). For applications of the simex method,
see Carroll et al. (1999), Fung and Krewski (1999), Holcomb, (1999), and Wang et al.(1999).
The simulation method of Cook and Stefanski (1994), called parametric simex in this paper,
assumes that measurement error variances are homoscedastic and known, and pseudo errors are
∗
Corresponding author. Tel.: +1-317-433-1309; fax: +1-317-277-3220.
E-mail address: devan@lilly.com (V. Devanarayan).
0167-7152/02/$ - see front matter c 2002 Elsevier Science B.V. All rights reserved.
PII: S 0 1 6 7 - 7 1 5 2 ( 0 2 ) 0 0 0 9 8 - 6
220 V. Devanarayan, L.A. Stefanski / Statistics & Probability Letters 59 (2002) 219–225
generated from a normal distribution. We relax both of these assumptions, albeit at the expense
of requiring replicate measurements. In our new method, called empirical simex, pseudo data are
generated via simulation of random linear contrasts of the replicate measurements.
We do not assume a speciCc model for the data, but rather that in the absence of measurement
error the parameter of interest, , is estimated by ˆTrue = T ({Yi ; Zi ; Xi }), where T is the function that
maps the data into the parameter space. With this notation, the so-called naive estimator is denoted
ˆ
Naive = T ({Yi ; Zi ; Wi }):
This section contains a brief description of parametric simex generalized to the case of known
heteroscedastic error variances; see Carroll et al. (1995) for a more detailed account. Pseudo data
are generated via simulation by computing, for several values of ¿ 0;
Wbi () = WG i: + 1=2
i UG bi: ; i = 1; : : : ; n; b = 1; : : : ; B;
where the added pseudo measurement errors, {Ubij }; are mutually independent standard normal
random variables, independent of the observed data. The bth pseudo data set for a given is
{Yi ; Zi ; Wbi ()}. Note that because Wbi () given Xi is a linear combination of independent normal
random variables,
Wbi ()|Xi ∼ N(Xi ; (1 + )
i2 =mi ): (1)
This is the key distributional property upon which the simex method is based. The parametric pseudo
estimators are computed,
ˆb () = T ({Yi ; Zi ; Wb; i ()}); b = 1; : : : ; B (2)
and averaged to obtain
1ˆ
B
ˆ
() = b () (3)
B
b=1
We now show that with replicate measurements, it is possible to generate pseudo data having the
key property (1) without knowing the measurement error variance
i2 and without generating pseudo
random measurement errors directly.
Consider pseudo data obtained by taking linear combinations of the replicate measurements,
Wbi () = mj=1i
Abij Wij . In order that the pseudo data have the desired conditional moments as in (1)
the coeKcients {Abij } must satisfy
mi mi
1+
Abij = 1; A2bij = : (4)
j=1 j=1
mi
The Crst condition in (4) also ensures that WG i: and Wbi () − WG i: are uncorrelated. The conditions
in (4) do not uniquely determine the coeKcients Abij . Multiple pseudo data sets are obtained by
sampling uniformly from the set of all coeKcients satisfying (4). The sampling is more easily
explained with reparameterized
coeKcients. i
In terms of Cbij = mi =(Abij − 1=mi ), the conditions in (4) are equivalent to mj=1 Cbij = 0 and
mi 2
j=1 Cbij = 1. Therefore, the Cbij ; j = 1; : : : ; mi lie on the intersection of a mi -dimensional hyperplane
with a mi -dimensional unit sphere, or equivalently the {Cbij } lie on a (mi − 1)-dimensional unit
sphere. When mi = 2 the {Cbij } lie on the intersection of line through the origin and the unit circle
and thus there are two unique points, whereas for mi ¿ 2, there are an inCnite number of solutions.
The simulation component of empirical simex entails generating pseudo data by sampling uniformly
from the set of solutions.
A convenient way to sample mi uniform random points from a (mi − 1)-dimensional unit sphere
is to generate mi independent and identically distributed standard normal random numbers, {bij },
and set
bij − Gbi:
Cbij = mi :
[ j=1 (bij − Gbi: )2 ]1=2
It is readily veriCed that the {Cbij } satisfy the required constraints, and that they are uniformly
distributed on the mi − 1 unit ball is well known (Watson, 1983).
Solving for Abij , and after some algebraic rearrangement, the empirical pseudo data have the
representation
mi
Wbi () = WG i: + ci () Tbij Wij ; (5)
j=1
where ci () = (={mi (mi − 1)})1=2 ; Tbij = (bij − Gbi· )=Sbi , and Gbi: and Sbi are the sample means and
standard deviations of {bij }. Certain distributional properties of the {Tbij } used throughout the paper
are established in the appendix.
i i 2
Using the facts that mj=1 Tbij =0 and mj=1 Tbij =mi −1; it follows that the conditional distribution
mi
of j=1 Tbij Uij given {Tbij }mj=1
i
is normal with mean 0 and variance mi − 1. Because the conditional
222 V. Devanarayan, L.A. Stefanski / Statistics & Probability Letters 59 (2002) 219–225
i
distribution does not depend on {Tbij }mj=1 i
; mj=1 Tbij Uij and {Tbij }mj=1
i
are independent. Therefore,
mi mi 2
j=1 Tbij Wij =
i j=1 Tbij Uij has a normal distribution with mean 0 and variance (mi − 1)
i .
The
mconditional distribution
mi of WG i: given Xi is normal with mean Xi and variancemi
i2 =mi . Note
G
that j=1 Uij and j=1 Tbij Uij are uncorrelated. Therefore, given Xi ; W i: and j=1 Tbij Wij are un-
i
correlated. Because they are normally distributed, they are also independent, and thus the conditional
distribution of Wbi (), conditioned on Xi , is normal with mean and variance as in (1).
Therefore, the distributions of the pseudo data are the same for the parametric and empirical meth-
ods. The empirical simex pseudo data does not depend on the measurement error variance structure,
that is, the linear coeKcients {Abij }mj=1 i
were derived independent of the measurement error vari-
ances. Consequently, the empirical simex method automatically handles the case of heteroscedastic
measurement error variances.
Just as with its parametric counterpart, empirical simex method entails generating pseudo data,
as deCned in (5), B times over a grid of values ranging from 0 to 2. The naive estimators
ˆb (); b = 1; : : : ; B are computed as in (2) and then averaged to obtain () ˆ deCned in (3). In the
ˆ
extrapolation stage, () is modeled as a function of using a suitable extrapolant function. This
model is then extrapolated back to = −1 resulting in the empirical simex estimator. Because the
pseudo data generated by the parametric and empirical methods have the same key distributional
properties, the latter method inherits many properties of the former. Therefore in this paper, we
conCne ourselves to further description and illustration of the new method of simulation, but will
not discuss the extrapolation step or the sampling properties of the estimator.
The two versions of the simex method, parametric with estimated variances, and empirical, are
illustrated on two simple but informative models. We also describe the parametric method with
known variances for comparison. The observed data for these non-regression models consist only of
the replicate measurements {Wij }. The Xi are assumed to be independent and identically distributed.
Variance estimation: In this example, the parameter of interest is =
x2 . The true-data estimator
is the sample variance ˆTrue = (n − 1)−1 ni=1 (Xi − XG )2 . The naive estimator is the sample variance
of {WG i: } denoted SW2G . It has expectation E{ˆNaive } =
X2 + n−1 ni=1 (
i2 =mi ), and is biased whenever
any
i2 ¿ 0.
Straightforward calculations show that for the case of known measurement error variances the
exact parametric simex estimator is ˆpsimex = SW2G − n−1 ni=1
12 =mi , and is unbiased for =
X2 .
The parametric simex estimator with estimated measurement error variance is obtained by replacing
2
ˆ2i = (Wij − WG i: )2 ;
mi − 1 j=1
resulting in the estimator ˆpsimex = SW2G − n−1 ni=1
ˆ2i =mi , which is unbiased for =
X2 .
Routine calculations show that for this simple model the exact empirical simex estimator and the
exact parametric simex estimator with estimated variances are equal almost surely.
V. Devanarayan, L.A. Stefanski / Statistics & Probability Letters 59 (2002) 219–225 223
Estimating a moment generating function: The data are as in the variance-components example,
but now the parameter of interest is the moment generating function of {Xi } at the point t0 , i.e.,
= mX (t0 ) = E{exp(t0 Xi )}. Without loss of generality take t0 = 1. The true-data estimator, ˆTrue =
n−1 ni=1 exp(Xi ), is unbiased for . The naive estimator, ˆNaive = n−1 ni=1 exp(WG i ), has expectation
E(ˆNaive ) = n , where n = n−1 ni=1 exp{
i2 =(2mi )} and is positively biased.
The exact parametric simex estimator with known measurement error variances is ˆpsimex =− 1ˆ
n naive ,
which is unbiased for .
The exact parametric simex estimator with estimated n measurement 2error variances, obtained by
2 2 ˆ
replacing
i with the estimator
ˆi is psimex = n − 1 G
i=1 exp{W i: −
ˆi =(2mi )}. This estimator has
expectation
mi =2
1
n
ˆ m i (m i − 1)
E(psimex ) = E{exp(WG i: )}E{exp{−
ˆi =(2mi )}}
2
;
n i=1 mi (mi − 1) +
i2
which is strictly greater than unless all
i2 = 0.
The exact empirical simex estimator is
n
ˆ
(−1) = n− 1 exp{WG i: }hmi: (ci (−1)Wi1 ; : : : ; ci (−1)Wimi )};
i=1
where hm () is the moment generating function deCned in (A.1) in the appendix and ci () is deCned
in (5). It has expectation
1
n
ˆ
E{(−1)} = E[exp(WG i: )hmi (ci (−1)Wi1 ; : : : ; ci (−1)Wimi )]
n i=1
1
n
ci2 (−1)
i2
= E{exp(Xi )} exp{
i2 =(2mi )} exp (mi − 1) = :
n i=1 2
i
The key steps is proving unbiasedness use the facts that conditioned on {Tbij }, WG i: and mj=1 Tbij Wij
mi
are independent, and j=1 Tbij Xi = 0 almost surely.
Summary: These two simple examples illustrate the key features of the empirical method and
its advantages over the parametric method with plugged-in estimated variances. In general, the two
methods for handling heteroscedastic measurement errors yield the same exact extrapolants only
when the bias is a linear function of the measurement error variances, as in the variance component
example.
4. An application
The methods are illustrated using data from the Framingham Heart Study (Gordon and Kannel,
1968). The logistic regression model has response Y = the occurrence of coronary heart disease within
a speciCed follow-up period, and predictors: Z1 = age at the time of enrollment into the study; Z2 =
smoking indicator; Z3 = baseline cholesterol; and X = transformed systolic blood pressure (SBP). We
use the transformation, log(SBP–50), as described in Carroll et al. (1984). Systolic blood pressures
taken at successive exams two years apart provide the replicate measurements after transformation
224 V. Devanarayan, L.A. Stefanski / Statistics & Probability Letters 59 (2002) 219–225
Table 1
CoeKcient estimates from the Framingham data logistic regression analysis. CS, Condi-
tional Score; P-simex, parametric simex; E-simex, empirical simex
so that mi = 2 for all n = 1615 subjects. The data and model have been used previously to illustrate
measurement error methods by Carroll et al. (1995), where a more detailed description of the data
can be found.
Table 1 displays estimates of the logistic regression model parameters for several estimators. In
addition to the naive and empirical simex estimators, we computed parametric simex estimators
and conditional score estimators (Carroll et al., 1995) using plug-in variance estimates under both
the assumptions
n of heteroscedastic variation (
ˆ2i = (Wi1 − Wi2 )2 =2), and homoscedastic variation
2 − 1 2
(
ˆ = n i=1
ˆi ). For the parametric and empirical simex analysis, the estimates obtained using
quadratic and rational-linear extrapolants (Carroll et al., 1995) were nearly identical so only the
latter are presented.
After transformation, the blood pressure measurements show only slight heteroscedasticity and one
expects little diNerences among methods that assume constant or non-constant variation. For these
data the empirical simex estimates are nearly identical to those of the conditional score and parametric
simex estimators computed assuming homogeneous measurement error. Also the conditional score
and parametric simex estimators computed using
ˆ2i assuming heteroscedastic measurement error
variances are not very diNerent from those computed assuming homogeneous measurement error
variances and the empirical simex estimators.
This example suggests that little is lost by using empirical simex relative to other methods that
assume homogeneous error variance when in fact the assumption of homogeneity is reasonable. In
simulation studies reported elsewhere (Devanarayan, 1996), the estimators calculated in this appli-
cation were compared under cases of homoscedastic and heteroscedastic error variation for diNerent
levels of measurement error. The simulation results provided further evidence that the empirical
method performs comparably to other methods when the errors are homoscedastic. For cases in
which the errors were heteroscedastic, the empirical method performed as well or better than the
other methods. Although, the conditional score estimator with plugged-in variance estimates (
ˆ2i )
was a close competitor. The empirical simex method has the advantage that it does not require
speciCcation of a model for measurement variances.
Certain properties of the random variables Tbij are established in the following lemma. For brevity,
we write Tbij as Tj in this section.
V. Devanarayan, L.A. Stefanski / Statistics & Probability Letters 59 (2002) 219–225 225
Lemma. Suppose {j }mj=1 are independently and identically distributed random variables. Let G and
−ZG
S2 denote the sample mean and variances of {j }mj=1 and let Tj = jSz ; j = 1; : : : ; m. Then (i) {Tj }mj=1
are identically distributed; (ii) (Tj ; Tk ) and (Tj ; Tk ) are identically distributed for any pairs (j; k)
and (j ; k ); j
= k; j
= k ; (iii)E{Tj } = 0; Var{Tj } = (m − 1)=m; and Cov {Tj ; Tk } = −1=m; j
= k; and
(iv) the moment generating function
m
hm (s1 ; : : : ; sm ) = E exp sj Tj ; (A.1)
j=1
Proof. Assertions (i) and (ii) follow from the fact that T
1 ; : : : ; Tm are exchangeable; and (iii) follows
from exchangeability and the constraints Tj = 0 and Tj2 = m − 1 upon exploiting the identities
2 2
0 = E( Tj ); m − 1 = E( Tj ); and 0 = {E( Tj ) }. The proof of (iv) follows from the fact that
| mj=1 sj Tj | ¡ ( mj=1 sj2 )1=2 . The moment generating function hm () is that of a vector distributed
uniformly on an m − 1 sphere and its exact form is given in Watson (1983).
References
Carroll, R.J., Spiegelman, C.H., Lan, K.K.G., Bailey, K.T., Abbott, R.D., 1984. On errors-in-variables for binary regression
models. Biometrika 71, 19–25.
Carroll, R.J., Ruppert, D., Stefanski, L.A., 1995. Measurement Error in Nonlinear Models. Chapman & Hall, London,
England.
Carroll, R.J., KPuchenhoN, H., Lombard, F., Stefanski, L.A., 1996. Asymptotics for the SIMEX estimator in structural
measurement error models. J. Amer. Statist. Assoc. 91, 242–250.
Carroll, R.J., Maca, J.D., Ruppert, D., 1999. Nonparametric regression in the presence of measurement error. Biometrika
86, 541–554.
Cook, J.R., Stefanski, L.A., 1994. A simulation extrapolation method for parametric measurement error models. J. Amer.
Statist. Assoc. 89, 1314–1328.
Devanarayan, V., 1996. Simulation Extrapolation Method for Heteroscedastic Measurement Error Models with Replicate
Measurements. Ph.D. Thesis, North Carolina State University, Raleigh, North Carolina, USA, unpublished.
Fung, K.Y., Krewski, D.J., 1999. Evaluation of regression calibration and SIMEX methods in logistic regression when
one of the predictors is subject to additive measurement error. J. Epidemiol. Biostatist. 4 (2), 65–74.
Gordon, T., Kannel, W.E., 1968. The Framingham Study, introduction and general background in the Framingham study,
Sections 1 and 2, National Heart, Lung and Blood Institute, Bethesda, MD, USA.
Holcomb, J., 1999. Regression with covariates and outcome calculated from a comment set of variables in the presence
of measurement error: estimation using the SIMEX method. Statist. Med. 18, 2847–2862.
Stefanski, L.A., Cook, J.R., 1995. Simulation-extrapolation: the measurement error jackknife. J. Amer. Statist. Assoc. 90,
1247–1256.
Wang, N., Lin, X., Gutierrez, R., Carroll, R.J., 1999. Bias analysis and SIMEX approach in generalized linear mixed
measurement error models. J. Amer. Statist. Assoc. 93, 249–261.
Watson, G.S., 1983. Statistics on Spheres. Wiley, New York.