You are on page 1of 14

J. R. Statist. Soc.

B (1997)
59, No. 3, pp. 589±602

Multistage Sampling Designs and Estimating Equations


By ALICE S. WHITTEMORE{
Stanford University, USA

[Received June 1995. Final revision September 1996]

SUMMARY
In some applications it is cost ef®cient to sample data in two or more stages. In the ®rst
stage a simple random sample is drawn and then strati®ed according to some easily
measured attribute. In each subsequent stage a random subset of previously selected units
is sampled for more detailed and costly observation, with a unit's sampling probability
determined by its attributes as observed in the previous stages. This paper describes
multistage sampling designs and estimating equations based on the resulting data. Maxi-
mum likelihood estimates (MLEs) and their asymptotic variances are given for designs
using parametric models. Horvitz±Thompson estimates are introduced as alternatives to
MLEs, their asymptotic distributions are derived and their strengths and weaknesses are
evaluated. The designs and the estimates are illustrated with data on corn production.
Keywords: DOUBLE SAMPLING; HORVITZ±THOMPSON ESTIMATOR; MEAN SCORE METHOD;
MISSING DATA; STRATIFIED SAMPLING; TWO-STAGE CASE±CONTROL STUDIES

1. INTRODUCTION
In designing a study it can be advantageous to sample units in more than one stage.
The criteria for selecting a unit at a given stage typically depend on attributes
observed in the previous stages. Some types of units may be more informative than
others, and it is better to sample them at a higher rate. If it costs little to determine
the attributes that are necessary to classify the units, it can be cost ecient to stratify
a large sample in stage 1, and then in stage 2 to subsample the strata at di€erent
rates. (These parts of the sampling plan are called `stages' in the biometrics literature
but are called `phases' in survey research. A stage in survey research is part of a
sampling plan in which di€erent types of units (e.g. enumeration districts, households
or individuals) are selected. I shall use the term stage here in the hope that no con-
fusion will occur.)
Data from Cochran (1977), pages 168 and 332, on corn production in Je€erson
County, Iowa, illustrate two-stage sampling. The goal is to estimate the total number
of acres devoted to corn growth in the county by sampling its 2010 farms. Much of
the variability in acres of corn planting is due to variability in farm size. Thus, if the
distribution of farm sizes were known, a sample strati®ed by size would give a more
precise estimate than a simple random sample would. If the distribution is not known
but the farm sizes are available from county records, it can be advantageous to
stratify a large simple random sample of farms by size in stage 1, and then to
subsample each stratum and to determine total acreage and corn acreage in stage 2.

{Address for correspondence: Northern California Cancer Center, Room T204, Redwood Building, Stanford
University School of Medicine, Stanford, CA 94305-5092, USA.
E-mail: ASW@OSIRIS.STANFORD.EDU

& 1997 Royal Statistical Society 0035^9246/97/59589


590 WHITTEMORE [No. 3,
A second example is provided by a registry of families at high risk of breast cancer.
The registry was formed by augmenting a population-based cancer registry with
family disease information for women newly diagnosed with breast cancer (hereafter
called probands). Families with multiple cases of breast cancer are of primary
interest. Since fewer than 20% of probands have a ®rst-degree relative with the
disease, it is cost ecient ®rst to stratify them according to the presence or absence of
breast cancer in their ®rst-degree relatives, and then to sample at di€erent rates in the
two strata, to obtain detailed pedigree data. Moreover since fewer than 3% of
probands report two or more relatives with breast cancer, a third stage might seek
blood samples from all such probands but from only a subset of the others. A
proband's sampling probability in this third stage depends on her attributes as
determined in the previous two stages. The sampling must allow future studies to
make population-based inferences about molecular and clinical characteristics and
environmental risk factors in the probands and their families. Some studies will use
data from the set of probands who completed the ®rst two sampling stages, but
others will need data from the smaller set of probands who complete all three stages.
Other applications of multistage designs include validation substudies, wherein
more accurate response or covariate information is obtained from a subset of
subjects strati®ed by response or covariates or both (e.g. Pepe et al. (1994), Reilly and
Pepe (1995) and Robins et al. (1994)). Multistage designs also are used in two-stage
case±control studies and biased case±control sampling designs (Breslow and Cain,
1988; Weinberg and Wacholder, 1990).
Such multistage studies raise issues for both design and analysis. The design issues
include how to choose sample sizes at the various stages to optimize some end point,
such as the variance of a parameter estimate. The analysis issues relate to the
drawbacks of likelihood-based inference: diculty in specifying the probabilities
underlying the responses and the sampling design, missing information and the
complexity of the likelihood. Horvitz±Thompson estimating equations have been
proposed as a solution to these problems (Kalb¯eisch and Lawless, 1988; Flanders
and Greenland, 1991; Pepe et al., 1994; Reilly and Pepe, 1995; Robins et al., 1994).
To focus the discussion, consider a model in which a parameter  is estimated by
the solution b to the equation U1 …† ˆ 0, where
XN
U1 …† ˆ ut …†:
tˆ1

The ut …† ˆ u…yt ; † are column vectors which depend on the data yt and , and which
we call scores. Both  and ut …† are q dimensional. We assume that
(a) y1 , . . ., yN are independent, identically distributed random vectors such that
E ‰u… yt ; 0 †Š ˆ 0 at the true parameter 0 and
(b) N1=2 …b ÿ 0 † converges in distribution to N f0, Aÿ1 B1 …Aÿ1 †T g, where
" #
@ut …†
AˆE ÿ ,
@ ˆ0 …1†
B1 ˆ E ‰u…yt ; 0 † u… yt ; 0 †T Š
exist and are positive de®nite.
1997] MULTISTAGE SAMPLING DESIGNS 591
If Aˆ B1 , as typically holds when the ut …† are likelihood scores (Cox and Hinkley
(1974), p. 281) the asymptotic variance of N1=2 b reduces to Aÿ1 .
These assumptions describe the estimating equation U1 …† based on a simple
random sample in which all values yt are observed. The subscript 1 designates the
estimating equation obtained from such a sample, hereafter called a stage 1 sample.
Here we are concerned with estimating 0 by using data from multistage sampling
designs, in which only partial information on yt is obtained on the stage 1 sample. In
each subsequent stage of sampling more information is obtained on a subset of units,
but the complete value of yt is obtained only on those units sampled in the ®nal stage.
Section 2 introduces multistage designs and relates them to the two-stage designs of
classical survey research (e.g. Cochran (1977)) and discussed recently by Pepe et al.
(1994) and Reilly and Pepe (1995). Section 3 presents the likelihood function and
Rao±CrameÂr bound for parametric multistage designs. The latter makes explicit the
minimum loss of precision associated with multistage sampling compared with
complete sampling. Section 3 also discusses the drawbacks of likelihood-based
inference for multistage designs and introduces Horvitz±Thompson estimates as
alternatives. Section 4 gives the asymptotic distribution of the Horvitz±Thompson
estimate and a related estimate. Comparison of their asymptotic variances with
the Rao±CrameÂr bound quanti®es their ineciency in parametric settings. The
asymptotic distribution of the Horvitz±Thompson estimate applies regardless of
whether the units are sampled as independent Bernoulli variates or sampled without
replacement. Section 5 illustrates the designs and estimates by application to the data
on corn production, and Section 6 concludes the paper.

2. MULTISTAGE DESIGN
Suppose that the sample space S of y is partitioned as S ˆ [ S j and it is costly to
observe the exact values y, but inexpensive to determine their stratum membership.
Suppose also that units in di€erent strata contribute di€erent amounts of infor-
mation about . Several researchers (e.g. Breslow and Cain (1988)) have shown that
substantial eciency gains can accompany the following two-stage design.

Stage 1: draw a random sample y1 , . . ., yN and observe only the stratum to which
each yt belongs. Let Qj = ftjyt 2 S j g, with Nj =jQj j, and j Nj = N.
Stage 2: for each stratum S j , draw a random subsample Q j  Qj of size N j .
Draw samples independently in di€erent strata with pr(t 2 Q j jt 2 Qj †  pj > 0,
and observe yt for t 2 Q j .

In some applications the stage 1 sample may itself be strati®ed, with the inves-
tigator choosing the stratum sizes, and with the stage 2 sampling done within each of
the initial strata. For example, the stage 1 strata in the familial breast cancer registry
are de®ned by the races and/or ages of the probands. It is straightforward to extend
the results described here to a stage 1 strati®ed design, since it involves mutually
independent data from ®nitely many strata. For notational simplicity, we assume
only one stage 1 stratum.
As illustrated by the familial breast cancer registry, it may be cost ecient to
sample units in more than two stages, with each stage providing a more re®ned
strati®cation of the units. Sampling in r > 2 stages is a straightforward extension of
592 WHITTEMORE [No. 3,
p2:j p3:jk pr:j : : : m
fy1 , . . ., yN g ÿÿ
ÿ
!! Q2:j ÿ! Q 2:j ÿÿ
! ÿ
!! Q3:jk ÿ! Q 3:jk ÿÿ
! ÿ
!! . . . ÿ! Q r:j : : : m
!

Stage 1 Stage 2 Stage 3 Stage r

Fig. 1. Schematic diagram of r-stage sampling

two-stage sampling. In stage 1 the units in S are sampled as shown in Fig. 1. The
sample is partitioned into strata S 2:j , with Q2:j ˆ ftjyt 2 S 2:j g. In stage 2 a subset Q 2:j
of units is drawn from each Q2:j with probability p2:j > 0 and classi®ed into ®ner
strata S 3:jk Ð the stage 3 strati®cation. Thus Q3:jk ˆ Q 2:j \ ftjyt 2 S 3:jk g, and jQ3:jk j ˆ
N3:jk , k N3:jk ˆ N 2:j . This process is repeated r times, each stage producing a
more re®ned strati®cation of the units. In the ®nal rth stage random subsamples
Q r:j : : : m  Qr:j : : : m of size N r:j : : : m are drawn independently across strata, with
probabilities pr:j : : : m > 0, and the y-values of these units are observed.
We have not yet indicated how the strata are sampled in stages 2±r. We shall
consider two types of sampling. The ®rst type ( ®xed fraction sampling) draws simple
random samples without replacement, with each unit in a stratum having equal
probability of being sampled. The second type (Bernoulli sampling) selects units
according to the outcomes of independent Bernoulli variables, with all units in a
stratum having the same probability. Mixtures of both sampling types are possible,
as discussed brie¯y in closing.
Let
s:j : : : l ˆ p2:j p3:jk : : : ps:j : : : l ˆ pr…t 2 Q s:j : : : l jyt 2 S s:j : : : l †, s ˆ 2, . . ., r, …2†

denote the probability that a unit yt selected in stage 1 is also selected in stages 2±s,
given that yt 2 S s:j : : : l . The sampling design requires that this selection be con-
ditionally independent of yt , given yt 2 S s:j : : : l . If we regard as missing the values yt
for units that are unsampled in a second or subsequent stage, this assumption implies
that the missing data within each stratum are missing at random (Little and Rubin,
1987). By design, unbiased estimates of ps:j : : : l and s:j : : : l are respectively pbs:j : : : l
ˆ N s:j : : : l =Ns:j : : : l and 
bs:j : : : l ˆ pb2:j pb3:jk . . . pbs:j : : : l .

3. MULTISTAGE ANALYSIS
When a parametric form f… yt ; † is correctly assumed for the probability density
function of yt , the likelihood score is asymptotically ecient. Since the s:j : : : l are
independent of , the likelihood function is proportional to
Y  N2:j  Y  !3:jk N3:jk Y  !r:j : : : m Nr:j : : : m Y f …yt ; † 
L…† ˆ !2:j ... ,
j k
!2:j m
!rÿ1:j : : : l t2Q
!r:j : : : m
r:j : : : m

where !s:j : : : l ˆ pr…yt 2 S s:j : : : l ). Let


@
ut …† ˆ log f …yt ; †
@
1997] MULTISTAGE SAMPLING DESIGNS 593
denote the likelihood score for the tth unit, t ˆ 1, . . ., N. Using the identity
@
log !s:j : : : l ˆ E ‰ut …†j yt 2 S s:j : : : l Š  s:j : : : l ,
@
(see Appendix A, lemma 1), we can write the likelihood score as the sum of condi-
tional mean scores for the N units, given the data observed for them:
@ XN
log L…†  U r …† ˆ E ‰ut …†jdata for unit tŠ: …3†
@ tˆ1

Here
8
< ut …†
> if t 2 [j : : : m Q r:j : : : m ,
E ‰ut …†jdata for unit tŠ ˆ X X
r
>
: 1…t 2 Qs:j : : : l nQ s:j : : : l †s:j : : : l otherwise,
sˆ2 j : : : l

…4†
and 1…E† is the indicator function assuming the value 1 when E is true and 0
otherwise. The asymptotic variance of the maximum likelihood estimate b is shown
in Appendix A, lemma 2, to be
 Xr X ÿ1
1=2 b
varfN … ÿ †g ˆ A ÿ !s:j : : : l sÿ1:j : : : k …1 ÿ ps:j : : : l †ÿs:j : : : l , …5†
sˆ2 ij : : : l

where
ÿs:j : : : l ˆ s:j : : : l ÿ s:j : : : l Ts:j : : : l , with s:j : : : l ˆ E‰ut …† uTt …†jyt 2 S s:j : : : l Š.
Here ÿs:j : : : l is the conditional variance of the score ut …† given yt 2 S s:j : : : l . These
results are special cases of the general theory of Breckling et al. (1994) applied to
multistage designs with uninformative sampling at each phase.
Despite its asymptotic optimality, likelihood-based inference has drawbacks. A
major drawback is its restriction to applications with a plausible model f …yt ; †.
Another is that it can be cumbersome, intractable or vulnerable to misspeci®cation of
the !s:j : : : l , which usually are not of interest (for examples, see Section 5, Pepe (1992)
and Whittemore and Halpern (1997)). Consider instead the estimating function
Ur …b
; † ˆ 0, where
X
N
Ur …b
; † ˆ ar:t …b
† ut …†,
tˆ1
…6†
X
ar:t …b
† ˆ bÿ1
r:j : : : m 1…t 2 Q r:j : : : m †:
j: : :m

We note several features of equations (6). First, Ur …b; † is a weighted sum of scores
for those units that are selected in all stages. The score for a unit is weighted by the
inverse of its estimated overall sampling probability. Thus Ur …b ; † is useful when the
594 WHITTEMORE [No. 3,
ut …† are not likelihood scores but rather summands of a general estimating func-
tion. Second, as seen from equation (2), when all the stratum-speci®c sampling
probabilities are 1 for a stage, equations (6) reduce to the estimating function for an
…r ÿ 1†-stage design.
Third, Ur …b
; † has the same form as the likelihood score (3) and (4), except that
the means s:j : : : l are replaced by nonparametric estimates. Speci®cally,
X
N
Ur …b
; † ˆ b ‰ut …†jdata for unit tŠ,
E
tˆ1

where Eb ‰ut …†jdata for unit tŠ is given by equation (4) with the s:j : : : l replaced by the
strati®ed estimate
X
 bÿ1
bs:j : : : l ˆ ! s:j : : : l br:j : : : l : : : m 
! br:j : : : l : : : m , s < r;
1 X …7†

br:j : : : m ˆ ut …†.
N r:j : : : m 
t2Qr:j : : :m

bs:j : : : l are
Here the estimates !
N2:j Ns:j : : : l
b2:j ˆ
! and !
bs:j : : : l ˆ !
bsÿ1:j : : : l , s > 2.
N 
Nsÿ1:j : : : l
The ®rst summation in equations (7) is taken over all substrata j . . . l . . . m of the s-
stage stratum j . . . l. Thus Ur …b ; † is semiparametric in the sense that the means
s:j : : : l in the likelihood score are replaced by nonparametric estimates.
Fourth,
X
Nÿ1 Ur …b
; † ˆ !
br:j : : : m 
br:j : : : m …8†

is itself a strati®ed estimate of the mean score E‰ut …†Š. Thus, regarding the scores
ut …†, t ˆ 1, . . ., N, as a random sample from an in®nite population, we see that
Nÿ1 U1 …† is the sample mean, whereas, for r ˆ 2, Nÿ1 Ur …b ; † is a Horvitz±Thompson
two-stage strati®ed estimator of the population mean (Horvitz and Thompson,
1952). We call Ur …b ; † a Horvitz±Thompson estimating function; it also has been
called a pseudolikelihood estimating function (Pfe€ermann, 1993).
Finally, when ®xed fraction sampling is used in stages 2±r, ps:j : : : l = N s:j : : : l /Ns:j : : : l ,
so the  b s and the corresponding s are equal and Ur …b ; † ˆ Ur …; †. Under
Bernoulli sampling, however, Ur …b ; † 6ˆ Ur …; ) in general, since N s:j : : : l /Ns:j : : : l =
ps:j : : : l only in expectation. The estimating function U2 …; † was proposed by
Kalb¯eisch and Lawless (1988). In the following section we show that, for Bernoulli
sampling, Ur …; † gives estimates that are asymptotically less ecient than those
given by Ur …b ; †.

4. ASYMPTOTIC PROPERTIES
The following theorem, proved in Appendix A, gives the asymptotic distributions
of solutions to Ur …b
; † ˆ 0 and Ur …; † ˆ 0 under ®xed fraction and Bernoulli
sampling.
1997] MULTISTAGE SAMPLING DESIGNS 595
Theorem 1. Consider an r-stage design as described above. Under assumptions (a)
and (b) in Section 1 and under the regularity conditions described in Appendix A, the
following hold.
(a) For ®xed fraction sampling, as N ! 1,
(i) with probability approaching 1, the equation Ur …; † (ˆ Ur …b ; †† ˆ 0 has
a unique root b and
(ii) N1=2 …b ÿ 0 † converges in distribution to a Gaussian with mean 0 and vari-
ance Aÿ1 V…Aÿ1 †T , where A is given by equations (1) and
V ˆ B1 ‡ B2 ‡ . . . ‡ Br : …9†
Here B1 is given by equations (1) and
X 1 ÿ ps:j : : : l
Bs ˆ !s:j : : : l ÿs:j : : : l , s ˆ 2, . . ., r.
s:j : : : l
(b) For Bernoulli sampling, as N ! 1,
(i) with probability approaching 1, Ur …b ; † and Ur …; † have unique roots
b…b
† and b…† respectively and
(ii) N1=2 fb…b
† ÿ 0 g and N1=2 fb…† ÿ 0 g converge in distribution to Gaussians
with mean 0 and with respective variances Aÿ1 V…Aÿ1 †T and Aÿ1 Vp …Aÿ1 †T .
Here V is given by equation (9) and
X
Vp ˆ !r:j : : : m ÿ1
r:j : : : m r:j : : : m ˆ B1 ‡ C2 ‡ . . . ‡ Cr , …10†

with
X 1 ÿ ps:j : : : l
Cs ˆ !s:j : : : l s:j : : : l , s ˆ 2, . . ., r.
s:j : : : l
Theorem 1 shows that each stage of sampling adds a positive de®nite matrix to the
asymptotic variance of the estimates. Moreover, when Ur …b ; † is used to estimate
, ®xed fraction sampling and Bernoulli sampling give asymptotically equivalent
estimates. However, this equivalence does not hold when Ur …; † is used. Indeed, a
comparison of equations (9) and (10) shows that b…† is asymptotically less ecient
than b…b
† under Bernoulli sampling, since s:j : : : l ÿ ÿs:j : : : l ˆ s:j : : : l Ts:j : : : l is positive
de®nite. Thus, as noted by Robins et al. (1994) for two-stage regression with
incomplete covariates, it is advantageous to use the observed sampling fractions even
when the theoretical fractions are known. Note also from equation (10) that, what-
ever the number of stages of Bernoulli sampling is, b…† is asymptotically equivalent
to the corresponding estimate for a two-stage design in which all N units are
classi®ed in their ®nal re®ned strata in the ®rst stage, and then sampled with
probabilities r:j : : : m .
Consistent variance estimates can be obtained from the relevant formulae by
replacing A by
@
b ˆ ÿNÿ1
A Ur ….; †jˆ^,
@
by replacing !s:j : : : l and s:j : : : l by !
bs:j : : : l and 
bs:j : : : l respectively, and by replacing
596 WHITTEMORE [No. 3,
b s:j : : : l , de®ned by
s:j : : : l by 
X
b s:j : : : l ˆ !
 bÿ1
s:j : : : l b r:j : : : l : : : m ,
br:j : : : l : : : m 
! s < r;
1 X …11†
b r:j : : : m
 ˆ ut …† ^ uTt …†,
^
N r:j : : : m t2Qr:j : : :m

and ®nally by replacing B1 by


X
B^ 1 ˆ ^ r:j : : : m :
!^ r: j : : : m 

A `model-based' variance estimate, i.e. an estimate obtained by assuming that A ˆ


b ÿ1 ‡ A
B1 , is A b ÿ1 V b ÿ1 †T , where V
b …A b estimates V or Vp , as appropriate.

5. EXAMPLE
We compare the maximum likelihood and Horvitz±Thompson estimates of the
number of acres devoted to corn growth in Je€erson County, Iowa. Suppose that in
stage 1 a sample of N farms is strati®ed by size as small (160 acres or fewer) or large
(more than 160 acres). In stage 2, total acreage and corn acreage are obtained for a
proportion p1 of the small farms and p2 of the large farms. Let yt ˆ …xt , zt †T , where xt
and zt represent respectively the total size in acres and acres devoted to corn
production for farm t, t ˆ 1, . . ., N. A parametric approach to the problem might
assume that yt has a bivariate Gaussian distribution with mean  = … , † and with
covariance parameters …2x , 2z , † which we assume are known. We wish to estimate .
The two-stage likelihood score (3) gives the maximum likelihood estimate
z
b ˆ z ‡  … b ÿ x†,

x
where x and z are the means among all farms sampled in stage 2, and b satis®es
N b ˆ …N1 ÿ N 1 † 1 … b† ‡ …N2 ÿ N 2 † 2 … b† ‡ …N 1 ‡ N 2 †x.

Here j … † is the expected farm size given that the farm belongs to stratum j, which is
determined (up to ) by the Gaussian distribution of farm sizes. Thus departures
from the assumed Gaussian farm size distribution (which is of no direct interest) can
bias b. In contrast, the Horvitz±Thompson function uses bj = N ÿ1
j t2Q j xt in place of
j , giving b = Nÿ1 …N1 b1 ‡ N2 b2 †.
This example is somewhat unrealistic as it assumes that the investigator knows
only the strata (large compared with small) of the farms sampled in stage 1 but not in
stage 2. More plausibly, the actual sizes x1 , . . ., xN of all sampled farms would be
obtained in stage 1. In the latter circumstance, the maximum likelihood estimate b
is just the mean size of the N farms (see also Breckling et al. (1994), section 3.1).
Nevertheless the example illustrates the robustness of Horvitz±Thompson estima-
tion when f … yt ; † ˆ g…x; † h…zjx; † and is a nuisance parameter. The Horvitz±
Thompson function avoids modelling the !j which involve . A more realistic
example of this robustness involving familial cancer survival times is given by
Whittemore and Halpern (1997).
1997] MULTISTAGE SAMPLING DESIGNS 597
6. DISCUSSION
Multistage sampling has proved cost ecient in many applications and raises
interesting design issues. The objectives usually are to minimize the variance of a
parameter estimate or to maximize the total number of units with a certain desirable
attribute, subject to the constraint of a ®xed total budget. For variance minimization,
the simple forms of the Rao±CrameÂr bound and the asymptotic variance of the
Horvitz±Thompson estimator facilitate optimization. These variances apply to any
combination of Bernoulli and ®xed fraction sampling.
The Horvitz±Thompson or pseudolikelihood estimating function has been the
subject of considerable research, beginning with the seminal work of Godambe and
Thompson (1986). Its properties have been described in the survey research literature
(e.g. Skinner et al. (1989) and Pfe€ermann (1993)) and in the biometrics literature (e.g.
Robins et al. (1994) and Pepe et al. (1994)). The function is useful when ®tting com-
plex models for which the likelihood score is cumbersome, intractable or sensitive to
model misspeci®cation. Although it can give estimates that are asymptotically less
ecient than the maximum likelihood estimates (see Robins et al. (1994) and
Whittemore and Halpern (1997)), substantial eciency loss seems to occur chie¯y
when multistage sampling is unnecessary (i.e. when strati®ed sampling is not useful).
Further work is needed to clarify the trade-o€ between the Horvitz±Thompson
estimator's eciency loss and its virtues: ease of implementation, robustness and
versatility. There also is a need to examine the performance of Horvitz±Thompson
estimates in small samples, and under `sparse data' asymptotics, i.e. as the number of
®nal strata grows large but the number of units per stratum remains bounded.

ACKNOWLEDGEMENTS
This research was supported by National Institutes of Health grant CA 47448.

APPENDIX A
Although the asymptotic distribution of estimates obtained from Ur …b
; † is independent of
the sampling plan (Bernoulli compared with ®xed fraction sampling), its derivation is not.
The work of Pepe et al. (1994) can be extended to cover Bernoulli sampling but not ®xed
fraction sampling, because the sampling indicators for distinct units are not independent.
However, arguments of Cochran (1977) for double sampling can be generalized to handle
®xed fraction sampling in arbitrarily many stages.
We assume conditions (a) and (b) of Section 1 and the following regularity condition: there
is a neighbourhood of the true parameter 0 within which ut …† has bounded second deriv-
atives and is bounded away from 0 almost surely.
Lemma 1. Suppose that
@
u…† ˆ log f…y; †,
@

where f…y; † is the probability density function of y, and let


…
!s:j : : : l ˆ f…y; † dy.
S s:j : : :l
598 WHITTEMORE [No. 3,
Then
@
(a) log !s:j : : : l ˆ E ‰u…†jy 2 S s:j : : : l Š  s:j : : : l ,
@ 2
@
(b) ÿ 2 log !s:j : : : l ˆ As:j : : : l ÿ ÿs:j : : : l ,
@
where
 
@
As:j : : : l ˆ E ÿ u…†jy 2 S s:j : : : l .
@

This lemma is easily veri®ed by interchanging di€erentiation with respect to the com-
ponents of  and integration over the subspace S s:j : : : l of the sample space S of y.
Lemma 2. The Rao±CrameÂr lower bound for an r-stage design is given by equation (5).
Proof. From expression (3) the Fisher information for the r-stage design is
  X r X  
@ @2
E ÿ U r …† ˆ !s:j : : : l sÿ1:j : : : k …1 ÿ ps:j : : : l † ÿ 2 log !s:j : : : l
@ sˆ2 j : : : l
@
X
‡ !r:j : : : m r:j : : : m Ar:j : : : m :
j: : :m

Now we use part (b) of lemma 1 to replace ÿ@ 2 …log !s:j : : : l †=@2 by As:j : : : l ÿ ÿs:j : : : l , equation
(2) to replace sÿ1:j : : : k ps:j : : : l by s:j : : : l and rearrange terms to obtain
  X Xr X
@
E ÿ U r …† ˆ !2:j A2:j ÿ !s:j : : : l sÿ1:j : : : k …1 ÿ ps:j : : : l †ÿs:j : : : l
@ j sˆ2 j : : : l

X
r X
ˆAÿ !s:j : : : l sÿ1:j : : : k …1 ÿ ps:j : : : l †ÿs:j : : : l .
sˆ2 j : : : l

Lemma 3. Under the assumptions of theorem 1 and the above regularity condition, the
following hold for both ®xed fraction and Bernoulli sampling:
(a) with probability approaching 1 as N ! 1, the equations Ur …b
; † ˆ 0 and Ur …; † ˆ
0 have unique roots b…b
† and b…† respectively;
(b) the roots are consistent;
(c)

@
ÿN ÿ1 Ur …b
; †
@
and
@
ÿN ÿ1 Ur …; †
@
both converge in probability to the positive de®nite matrix A of equations (1).
The proof of lemma 3 uses the same arguments as given in the proof of theorem 3.1 in Pepe
et al. (1994) and is omitted.
1997] MULTISTAGE SAMPLING DESIGNS 599
Lemma 4. Under the assumptions of theorem 1 each of the following is oP …1†:

(a) N ÿ1=2 …b
r:j : : : m ÿ r:j : : : m † t2Q rj : : : m fut …† ÿ r:j : : : m g;
(b) N ÿ1=2  !br:j : : : m r:j : : : m ;
(c) N ÿ1=2 t f1…t 2 Q rÿ1:j : : : l †rÿ1:j : : : l ÿ m 1…t 2 Qr:j : : : m †r:j : : : m g.

Proof of lemma 4. Part (a) follows from the proof of lemma A.1 of Pepe et al. (1994). Part
(b) holds because
X X
N ÿ1=2 !br:j : : : m r:j : : : m ˆ N ÿ1=2 !r:j : : : m ÿ !r:j : : : m †r:j : : : m ˆ oP …1†.
…b

To verify part (c), we note that up to terms oP …1†


X XX !r:j : : : m
N ÿ1=2 1…t 2 Q rÿ1:j : : : l †rÿ1:j : : : l ˆ N ÿ1=2 1…t 2 Q rÿ1:j : : : l † 
t t m
!rÿ1:j : : : l r:j : : : m
XX
ˆ N ÿ1=2 1…t 2 Qr:j : : : m †r:j : : : m .
t m

Proof of theorem 1. Consistency follows from lemma 3. To derive asymptotic normality


and variances, we expand N ÿ1=2 Ur …b
; ) in a second-order Taylor series to obtain, using
lemma 3,
 ÿ1
@
N 1=2
fb…b
† ÿ 0 g ˆ ÿN ÿ1
Ur …b
; † N ÿ1=2 Ur …b
; † ‡ oP …1†
@

ˆ Aÿ1 N ÿ1=2 Ur …b
; † ‡ oP …1†.

A similar expansion is valid for N ÿ1=2 Ur …; ). Therefore it suces to determine the asymp-
totic properties of N ÿ1=2 Ur …b ; † and N ÿ1=2 Ur …; ).
ÿ1=2
We begin with N Ur …; † under Bernoulli sampling. Normality follows from the
central limit theorem, since Ur …; † is the sum of the N independent and identically
distributed random variables art …† ut …†. For its asymptotic variance, we note that
X ÿ2
a2r:t …† ut …† uTt …† ˆ r:j : : : m 1…t 2 Q r:j : : : m † ut …† uTt …†

converges in probability to  !r:j : : : m ÿ1


r:j : : : m r:j : : : m , in agreement with equation (10).
Next we turn to the asymptotic distribution of N ÿ1=2 Ur …b ; ) under Bernoulli sampling,
using arguments similar to those of Pepe et al. (1994), theorem 3.2, for the two-stage design.
We rewrite equation (8) as
X X
N ÿ1=2 Ur …b
; † ˆ N 1=2 !br:j : : : m …br:j : : : m ÿ r:j : : : m † ‡ N 1=2 !
br:j : : : m r:j : : : m .

br:j : : : m ˆ N ÿ1 N r:j : : : m 
Using ! bÿ1
r:j : : : m , we can write this as

X
N X
N ÿ1=2 Ur …b
; † ˆ N ÿ1=2 bÿ1
  r:j : : : m †fut …† ÿ r:j : : : m g
r:j : : : m 1…t 2 Q
tˆ1
X
1=2
‡N !
br:j : : : m r:j : : : m .

By part (a) of lemma 4, we can, up to a term oP …1†, replace  br:j : : : m in this expression by
r:j : : : m . Also, by part (b) of lemma 4, the second summand is oP …1†. Thus N ÿ1=2 Ur …b ; ) is
600 WHITTEMORE [No. 3,
asymptotically equivalent to N ÿ1=2 N tˆ1 hr …t†, where
X
hr …t† ˆ ÿ1
r:j : : : m 1…t 2 Q r:j : : : m †fut …† ÿ r:j : : : m g:

Since the hr …t† are independent and identically distributed variables, asymptotic normality
again follows by the central limit theorem. The asymptotic variance of N ÿ1=2 Ur …b
; ) is
   X    X 
lim V1 E2 N ÿ1=2 hr …t† ‡ E1 V2 N ÿ1=2 hr …t† .
N!1
t t

Here the subscripts 1 and 2 represent respectively expectation over repeated sampling in the
®rst r ÿ 1 stages and over repeated sampling in the rth stage, given the values y1 , . . ., yN of
the ®rst-stage sample and the indicators for selection in the r ÿ 1 preceding stages. Now
 X  X X
ÿ1=2
E2 N hr …t† ˆ N ÿ1=2 ÿ1
rÿ1:j : : : l 1…t 2 Qr:jm †fut …† ÿ r:j : : : m g
t j: : :l m
ÿ1=2
X
ˆN hrÿ1 …t† ‡ oP …1†,
t

where the second equality follows from part (c) of lemma 4. By induction,
  X 
lim V1 N ÿ1=2 hrÿ1 …t† ˆ B1 ‡ . . . ‡ Brÿ1 .
N!1
t

Moreover
 X  X
1 ÿ pr:j : : : m
V2 N ÿ1=2
hr …t† ˆ ÿ2  rÿ1:j : : : l †
rÿ1:j : : : l 1…t 2 Q fut …† ÿ r:j : : : m gfut …† ÿ r:j : : : m ŠT .
t
pr:j : : : m

Thus
   X  X
ÿ1=2 1 ÿ pr:j : : : m
lim E1 V2 N hr …t† ˆ !r:j : : : m ÿr:j : : : m ˆ Br
N!1
t
r:j : : : m

as required.
Finally we consider ®xed fraction sampling. The results of Hajek (1960) show that the
mutually independent and unbiased estimates ubr:j : : : m are asymptotically normal. The
asymptotic normality of N ÿ1=2 Ur …; ) as represented by equation (8) then follows from
!r:j : : : m ] = !r:j : : : m . The asymptotic variance of N ÿ1=2 Ur …; ) is
Slutsky's theorem, since E[b

lim …V1 fE2 ‰N ÿ1=2 Ur …; †Šg ‡ E1 ‰V2 fN ÿ1=2 Ur …; †gŠ†. …12†
N!1

Since E2 ‰1…t 2 Q r:j : : : m †Š ˆ pr:j : : : m 1…t 2 Qr:j : : : m †, it follows from equations (6) and (2) that
E2 ‰ar:t …†Š ˆ arÿ1:t …†, i.e. E2 ‰N ÿ1=2 Ur …; †Š ˆ N ÿ1=2 Urÿ1 …; ). So by induction the ®rst
summand of expression (12) converges to B1 ‡ . . . ‡ Brÿ1 . It remains to show that the
second summand converges to Br . To do so, we write
8 ÿ1=2 X
<N N2:j  b2:j r ˆ 2,
N ÿ1=2 Ur …; † ˆ X
: N ÿ1=2 ÿ1  rÿ1:j : : : l 
rÿ1:j : : : l N brÿ1:j : : : l r > 2.
1997] MULTISTAGE SAMPLING DESIGNS 601
Here 
brÿ1:j : : : l is de®ned by equations (7). Using the variance of a strati®ed estimate (Cochran
(1977), theorem 5.3, p. 92),

1 ÿ p2:j
V2 fU2 …; †g ˆ N2:j ÿe 2:j
p2:j
ÿ2
X 1 ÿ pr:j : : : m
rÿ1:j : : : l † ˆ N rÿ1:j : : : l
V2 …b Nr:j : : : m e r:j : : : m ,
ÿ r > 2.
m
pr:j : : : m

Here we have introduced ÿ e r:j : : : m ˆ  e r:j : : : m ÿ  r:j : : : m †T , where 


er:j : : : m …e e r:j : : : m are
er:j : : : m and 
de®ned by equations (7) and (11) respectively, with Q r:j : : : m replaced by Qr:j : : : m . Thus
8X
>
> N2:j 1 ÿ p2:j
>
> ÿe 2:j r ˆ 2,
< N p2:j
ÿ1=2
V2 fN Ur …; †g ˆ X X Nr:j : : : m
>
>
>
> ÿ2 e r:j : : : m
ÿ r > 2.
: rÿ1:j : : : l
N
j: : :l m

Since Nr:j : : : m =N converges in probability to !r:j : : : m rÿ1:j : : : l , E1 ‰V2 fN ÿ1=2 Ur …; †gŠ converges
to
X 1 ÿ pr:j : : : m
!r:j : : : m ÿr:j : : : m ˆ Br
r:j : : : m

as required.
This completes the proof of theorem 1.

REFERENCES
Breckling, J. U., Chambers, R. L., Dorfman, A. H., Tam, S. M. and Welsh, A. H. (1994) Maximum
likelihood inference from sample survey data. Int. Statist. Rev., 62, 349±363.
Breslow, N. E. and Cain, K. C. (1988) Logistic regression for two-stage case±control data. Biometrika,
75, 11±20.
Cochran, W. G. (1977) Sampling Techniques, 3rd edn. New York: Wiley.
Cox, D. R. and Hinkley, D. V. (1974) Theoretical Statistics. London: Chapman and Hall.
Flanders, W. D. and Greenland, S. (1991) Analytic methods for two-stage case control studies and
other strati®ed designs. Statist. Med., 10, 739±747.
Godambe, V. P. and Thompson, M. E. (1986) Parameters of superpopulation and survey population:
their relationship and estimation. Int. Statist. Rev., 54, 127±138.
Hajek, J. (1960) Limiting distributions in simple random sampling from a ®nite population. Publ. Math.
Inst. Hung. Acad. Sci., 5, 361±374.
Horvitz, D. G. and Thompson, D. J. (1952) A generalization of sampling without replacement from a
®nite population. J. Am. Statist. Ass., 47, 663±685.
Kalb¯eisch, J. D. and Lawless, J. F. (1988) Likelihood analysis of multi-state models for disease
incidence and mortality. Statist. Med., 7, 149±160.
Little, J. A. and Rubin, D. B. (1987) Statistical Analysis with Missing Data. New York: Wiley.
Pepe, M. S. (1992) Inference using surrogate outcome data and a validation sample. Biometrika, 79,
355±362.
Pepe, M. S., Reilly, M. and Fleming, T. R. (1994) Auxiliary outcome data and the mean score method.
J. Statist. Planng Inf., 42, 137±160.
Pfe€ermann, D. (1993) The role of sampling weights when modeling survey data. Int. Statist. Rev., 61,
317±337.
Reilly, M. and Pepe, M. S. (1995) A mean-score method for missing and auxiliary covariate data in
regression models. Biometrika, 82, 299±314.
602 WHITTEMORE [No. 3,
Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994) Estimation of regression coecients when some
regressors are not always observed. J. Am. Statist. Ass., 89, 846±866.
Skinner, C. J., Holt, D. and Smith, T. M. F. (1989) Analysis of Complex Surveys. Chichester: Wiley.
Weinberg, C. R. and Wacholder, S. (1990) The design and analysis of case±control studies with biased
sampling. Biometrics, 46, 963±976.
Whittemore, A. S. and Halpern, J. (1997) Multistage sampling designs in genetic epidemiology. Statist.
Med., 16, 153±167.

You might also like