Professional Documents
Culture Documents
Stratified Sampling
has been allocated proportional to stratum size, so that n1 = 20, n2 = 10, and
n3 = n4 = 5.
The results in the next section are written to allow for the possibility of any
design within a given stratum, provided that the selections are independent between
strata; then specific results for stratified random sampling are given.
τ̂h = Nh y h
1
nh
yh = yhi
nh
i=1
L
τ̂st = Nh y h
h=1
having variance
L
σh2
var(τ̂st ) = Nh (Nh − nh )
nh
h=1
where
1 h N
σh2 = (yhi − μh )2
Nh − 1
i=1
where
1 h n
sh2 = (yhi − y h )2
nh − 1
i=1
1
L
y st = Nh y h (11.1)
N
h=1
Its variance is
L
Nh 2 Nh − nh σ 2 h
var(y st ) = (11.2)
N Nh nh
h=1
Example 1: The results of a stratified random sample are summarized in Table 11.1.
Substituting in Equation (11.1), the estimate of the population mean is
1
y st = [20(1.6) + 9(2.8) + 12(0.6)]
41
1
= (32 + 25.2 + 7.2)
41
64.4
= = 1.57
41
confidence intervals 145
When all the stratum sample sizes are sufficiently large, an approximate
100(1 − α)% confidence interval for the population total is provided by
τ̂st )
τ̂st ± t var(
where t is the upper α/2 point of the normal distribution. For the mean, the confi-
μ̂st ). As a rule of thumb, the normal approximation
dence interval is μ̂st ± t var(
may be used if all the sample sizes are at least 30. With small sample sizes, the t dis-
tribution with an approximate degrees of freedom may be used. The Satterthwaite
(1946) approximation for the degrees of freedom d to be used is
L 2
L
d= ah sh2 / (ah sh2 )2 /(nh − 1) (11.4)
h=1 h=1
Since the formula for the variance of the estimator of the population mean or total
with stratified sampling contains only within-stratum population variance terms, the
estimators will be more precise the smaller the σh2 . Equivalently, estimation of the
population mean or total will be most precise if the population is partitioned into
strata in such a way that within each stratum, the units are as similar as possible.
Thus, in a survey of a plant or animal population, the study area might be stratified
into regions of similar habitat or elevation, with the idea that within strata, abun-
dances will be more similar than between strata. In a survey of a human population,
stratification may be based on socioeconomic factors or geographic region.
Given a totalsample size n, one may choose how to allocate it among the L strata. If
each stratum is the same size and one has no prior information about the population,
a reasonable choice would be to assume equal sample sizes for the strata, so that
for stratum h the sample size would be
n
nh =
L
allocation in stratified random sampling 147
c = c0 + c1 n1 + c2 n2 + · · · + cL nL
where c is the total cost of the survey, c0 is an “overhead” cost, and ch is the cost
per unit observed in stratum h. Then for a fixed total cost c, the lowest variance is
√
achieved with sample size in stratum h proportional to Nh σh / ch , that is,
√
(c − c0 )Nh σh / ch
nh = L √
k=1 Nk σk ck
Thus, the optimum scheme allocates larger sample size to the larger or more
variable strata and smaller sample size to the more expensive or difficult-to-sample
strata.
12(120)(300)
n3 = = 6.3
150(100) + 90(200) + 120(300)
Rounding to whole numbers gives n1 = 3, n2 = 3, and n3 = 6.
11.6. POSTSTRATIFICATION
In some situations it may be desired to classify the units of a sample into strata
and to use a stratified estimate, even though the sample was selected by simple
random, rather than stratified, sampling. For example, a simple random sample of
a human population may be stratified by sex after selection of the sample, or a
simple random sample of sites in a fishery survey may be poststratified on depth.
In contrast to conventional stratified sampling, with poststratification, the stratum
sample sizes n1 , n2 . . . , nL are random variables.
With proportional allocation in conventional stratified random sampling, the
sample size in stratum h is fixed at n h = nNh /N and the variance (Eq. (11.2))
simplifies to var(y st ) = [(N − n)/nN ] L 2
h=1 (Nh /N )σh . With poststratification of
a simple random sample of n units from the whole population, the sample size
nh in stratum h has expected value nNh /N, so that the resulting sample tends
to approximate proportional allocation. With poststratification the variance of the
stratified estimator y st = L h=1 (Nh /N)y h is approximately
L L
N − n Nh 1 N − n N − Nh 2
var(y st ) ≈ σh2 + 2 σh (11.5)
nN N n N −1 N
h=1 h=1
and the variance of τ̂st = Ny st is var(τ̂st ) = N 2 var(y st ). The first term is the vari-
ance that would be obtained using a stratified random sampling design with propor-
tional allocation. An additional term is added to the variance with poststratification,
due to the random sample sizes.
For a variance estimate with which to construct a confidence interval for the
population mean with poststratified data from a simple random sample, it is rec-
ommended to use the standard stratified sampling method (Eq. (11.3)) rather than
substituting the sample variances directly into Equation (11.5). With poststratifica-
tion, the standard formula (Eq. (11.3)) estimates the conditional variance (given by
Eq. (11.2)) of y st given the sample sizes n1 , . . . , nL , while Equation (11.5) is the
unconditional variance [and see the comments of J. N. K. Rao (1988, p. 440)].
To use poststratification, the relative size Nh /N of each stratum must be known.
If the relative stratum sizes are not known, they may be estimated using double
sampling (see Chapter 14). Further discussion of poststratification may be found
in Cochran (1977), Hansen et al. (1953), Hedayat and Sinha (1991), Kish (1965),
Levy and Lemeshow (1991), Singh and Chaudhary (1986), and Sukhatme and
Sukhatme (1970). Variance approximations for poststratification vary among the
sampling texts. The derivation for the expression given here is given in Section
11.8 under the heading “Poststratification Variance.”
derivations for stratified sampling 149
A simple model for a stratified population assumes that the population Y -values are
independent random variables, each having a normal distribution, and with means
and variances depending on stratum membership. Under this model, the value Yhi
for the ith unit in stratum h has a normal distribution with mean μh and variance σh2 ,
for h = 1, . . . , L, i = 1, . . . , Nh , and the Yhi are independent. A stratified sample
s is selected using any conventional design within each stratum.
NSince for each
unit Yhi is a random variable, the population total T = L h=1
h
i=1 Y hi is also a
random variable. Since the Y -values are observed only for units in the sample, we
wish to predict T using a predictor T̂ computed from the sample data. Desirable
properties to have in a predictor T̂ include model unbiasedness,
where expectation is taken with respect to the model. In addition, we would like
the mean square prediction error E(T̂ − T )2 to be as low as possible.
For a given sample the best unbiased predictor of the population total T is
L
T̂ = Nh y h
h=1
Optimum Allocation
Consider the variance of the estimator τ̂st as a function f of the sample sizes, with
the total sample sizes given. The object is to choose n1 , n2 , . . . , nL to minimize
L
Nh
f (n1 , . . . , nL ) = var(τ̂st ) = Nh σh2 −1
nh
h=1
150 stratified sampling
L
nh = n
h=1
∂H N 2σ 2
= − h2 h − λ = 0
∂nh nh
nNh σh
nh = L
k=1 Nk σk
To verify that the solution gives a minimum of the variance function, as opposed
to a maximum or saddle point, the second derivatives are examined. Writing Hhk
for the second partial derivative ∂ 2 H /∂nh ∂nk gives
N 2 σh2
Hhh = h = 1, . . . , L
n3h
Hhk = 0 h = k
Hλh = −1
Hλλ = 0
Poststratification Variance
With simple random sampling, the number nh of sample units in stratum h has a
hypergeometric distribution with E(nh ) = nNh /N and
Using a Taylor series approximation for 1/nh , whose first derivative is −n−2
h and
second derivation is 2n−3
h , and taking expectation gives the approximation
1 1 1
E ≈ + var(nh )
nh E(nh ) n3h
N N 2 N − Nh N −n
= +
nNh nNh N N −1
Substituting this approximation into the variance expression gives
L
Nh 2 2 N N 2 N − Nh N −n 1
var(y st ) ≈ σh + −
N nNh nNh N N −1 Nh
h=1
L L
N − n Nh 1 N − n N − Nh
= σh + 2
2
σh2
nN N n N −1 N
h=1 h=1
Calculations and a simulation for stratification will be illustrated using data from the
1997 aerial moose survey along the Yukon River corridor, Yukon-Charley Rivers
National Preserve, Alaska: Project report, November, 1997 (Burch and Demma,
1997). The survey was stratified into three strata and a stratified random sampling
design was used to estimate the number of moose in the study refuge. For units in
the sample moose were counted from the air. For the purposes of our example, we
will assume that every moose in a sample plot is detected. In the actual survey,
a factor for sightability (detectability) was estimated and an additional adjustment
was made.
The population has L = 3 strata based on habitat type. The numbers of units in
each stratum are N 1 = 122, N 2 = 57 and N3 = 22. The sample sizes used were
n1 = 39, n2 = 38 and n3 = 21.
First, estimates are made using the stratified sample data.
152 stratified sampling
# Read in the data or enter them from the print out below.
moosedat <- read.table(file="http://www.stat.sfu.ca/
∼thompson/data/moosedata")
moosedat
# You can see it by printing out the whole data structure:
moosedat
# Note there are two columns called "str" for stratum and
"moose" for the total count of
# moose in each sample plot. The strata are labeled 1 =
"low", 2 =
"medium", and 3 = "high", representing habitat favorability.
# You can rename these for accessibility as follows:
stratum <- moosedat$str
y <- moosedat$moose
# Their total sample size:
length(y)
# The stratum sample sizes:
table(stratum)
# Two simple ways to get stratum sample means and sample
variances to use in calculations:
?tapply
tapply(y,stratum,mean)
tapply(y,stratum,var)
y1 <- y[stratum==1]
y1
length(y1)
mean(y1)
var(y)
# With the second method, repeat for strata 2 and 3.
> moosedat
str moose
1 3 0
2 3 0
3 3 1
4 3 7
5 3 5
6 3 7
7 3 7
8 2 13
9 3 17
10 3 1
11 2 7
12 3 10
13 2 1
14 2 0
computing notes 153
15 2 1
16 3 4
17 2 8
18 1 2
19 2 3
20 2 0
21 2 2
22 1 2
23 1 2
24 2 4
25 1 1
26 1 1
27 1 0
28 3 10
29 2 4
30 3 8
31 2 23
32 1 1
33 1 0
34 1 0
35 1 0
36 1 0
37 1 4
38 3 2
39 1 0
40 1 12
41 2 3
42 2 3
43 2 0
44 1 1
45 2 18
46 2 3
47 2 2
48 2 2
49 2 13
50 2 1
51 2 0
52 2 8
53 2 10
54 3 17
55 1 3
56 1 3
57 1 0
58 1 0
59 1 0
60 1 0
61 1 2
62 1 0
63 3 3
154 stratified sampling
64 1 0
65 1 1
66 2 11
67 1 5
68 1 17
69 3 33
70 2 2
71 3 8
72 3 10
73 1 1
74 2 0
75 3 2
76 3 9
77 2 2
78 1 0
79 2 0
80 1 0
81 1 0
82 1 2
83 2 1
84 2 0
85 2 0
86 1 3
87 2 0
88 2 4
89 1 0
90 1 0
91 1 0
92 2 10
93 2 1
94 2 0
95 1 0
96 1 0
97 2 0
98 1 0
>
A simulation like this can be used to compare a stratified design with a different
design such as simple random sampling from the whole population, or to study
the effect of changing sample size or allocation scheme. The simulation procedure
follows.
for (k in 1:b){
s1 <- sample(N1, n1)
tauhat1 <- N1 * mean(y1aug[s1])
s2 <- sample(N2, n2)
tauhat2 <- N2 * mean(y2aug[s2])
s3 <- sample(N3, n3)
tauhat3 <- N3 * mean(y3aug[s3])
tauhat[k] <- tauhat1 + tauhat2 + tauhat3
}
hist(tauhat)
mean(tauhat)
var(tauhat)
tau <- sum(y1aug) + sum(y2aug) + sum(y3aug)
tau
mean((tauhat-tau)^2)
sqrt(mean((tauhat-tau)^2))
EXERCISES
2. Allocate a total sample size of n = 100 between two strata having sizes N1 =
200 and N2 = 300 and variances σ12 = 81 and σ22 = 16 (a) using proportional
allocation and (b) using optimal allocation (assume equal costs).