You are on page 1of 25

# 6 ONE-STAGE CLUSTER SAMPLING and SYSTEMATIC SAMPLING

In general, we want the target and study populations to be the same. When they are not
the same, the researcher must be careful to ensure that conclusions based on the sample
results can be applied to the target population.
Because of restrictions such as cost or scheduling conicts, it is often impossible to collect
a simple random sample or a stratied simple random sample. In many cases, however,
it may be possible to dene a sampling frame that does not correspond with the target
population or the study population and still obtain statistically valid estimates.
Cluster sampling, and, specically, systematic sampling are examples when a dierence
between the target population and the sampling frame occurs. Despite the dierence, if
executed properly, conclusions based on the sample results from these sampling designs
can be applied to the target population.
Situation: A population contains M population units. The set of M units is partitioned
into N disjoint groups of population units called primary units. The population units
containing the primary units are called secondary units.
The primary units may be of dierent sizes. That is, the numbers of secondary units in
the primary units are not all the same.
Think of these disjoint groups of population units as strata. Suppose we dene the sam-
pling frame as a set of strata. Then, the sampling units in this sampling frame are not
individual units in the population . The sampling units are clusters of population units. In
this case, the sampling frame does not correspond with the units of the target population
or the study population.
Thus, whenever any secondary unit of a primary unit is included in the sample, all the
other secondary units in that primary unit will also be included in the sample.
That is, the primary units in the sampling frame are strata. The number of strata is
usually large while each stratum contains only a small number of population (secondary)
units. Note: the population has M individual units but the sampling frame has only N
primary sampling units corresponding the number of clusters (or strata) formed.
The responses from the secondary population units are not analyzed individually, but are
combined with all other secondary units that are in the same cluster. Therefore, there are
only N possible y values (not M). The researcher hopes that reducing the population of
size M to a sampling frame containing only N sampling units is oset by the practical
conveniences (such as reduced cost) that this type of sampling frame can oer.
6.1 One-Stage Cluster Sampling
What is the dierence between this type of cluster sampling and stratied sampling?
In stratied sampling, we take a subset of population sampling units within each
stratum to form the sample.
In cluster sampling, we take a subset of strata as the primary sampling units.
88
When the strata themselves are the primary sampling units, the strata are called clusters.
The selection of a sample of clusters to provide a sample of population units is called
cluster sampling.
If all of the population units in every selected cluster are in the sample, then this is known
as one-stage cluster sampling.
When a cluster is dened as a group of population units, the clusters are called the
primary units. Subgroups within primary units are called secondary units. For one-
stage cluster sampling, the secondary units are the individual population units.
A one-stage cluster sample
with N = 50 primary units each having 8 secondary units
in a population containing M = 400 secondary units.
If the selection of the population units within every selected cluster is restricted a second
time, then this technique is known as subsampling or two-stage cluster sampling.
For example, we may take a SRS of secondary units within each primary unit. This will
be discussed later in Section 7 of the course notes.
If a sample of primary units (Stage 1) is selected, followed by a selection of secondary
units (Stage 2) within the sample of primary units, followed by a selection of tertiary units
(Stage 3) within the sample of secondary units, and so on, then the sampling procedure
is known as multistage cluster sampling.
In cluster sampling, the size of the cluster can also be used as an auxiliary variable to
select clusters with unequal sampling probabilities or used in a ratio estimator.
89
Stratied sampling vs cluster sampling:
A researcher will use a stratied sampling design because of its potential to produce
an ecient (less variable) estimator of a population characteristic. It will, in general,
be more expensive to collect data for a stratied sample than for a cluster sample.
A researcher will use cluster sampling because of its administrative convenience. That
is, cluster sampling can signicantly reduce sampling costs at the expense of a less
ecient estimator of a population characteristic.
Notation used in one-stage cluster sampling:
N = the number of clusters (primary units in the population )
M
i
= number of secondary units in cluster i
M =

N
i=1
M
i
= the number of secondary units in the population
y
ij
= the y-value associated with secondary unit j in cluster i
y
i
=
M
i

j=1
y
ij
= cluster i total y
i
=
y
i
M
i
= cluster i mean
s
2
u
=

n
i=1
(y
i
y)
2
n 1
= the sample variance of cluster totals
=
N

i=1
M
i

j=1
y
ij
=
N

i=1
y
i
= population total
=
1
M
N

i=1
M
i

j=1
y
ij
=

M
= population mean of the secondary units

1
=
1
N
N

i=1
y
i
= population mean of the cluster totals (mean of the primary units)

2
u
=

N
i=1
(y
i

1
)
2
N 1
= the population variance of cluster totals
6.2 Equal Sized Clusters
Suppose that each of the N clusters have the same number L of secondary units (M
1
=
M
2
= = M
N
= L). Then, M = NL.
Suppose a SRS of n clusters (primary units) is taken. Then the total number of secondary
units selected is m = nL.
There is a total of
_
N
n
_
possible one-stage cluster samples and each one has the same
probability of being selected.
Thus, the probability of selecting any particular one-stage cluster sample =
1
_
N
n
_
.
90
Figure 7a: Cluster Sampling Example for the Longleaf Pin Data
The total abundance = 584. There are M = 400 secondary units and N = 100 primary units
(clusters) of size M
i
= 4.
1 1 1 1 1 2 1 0 0 0 4 5 0 1 0 1 2 1 0 1
3 2 1 0 1 0 0 0 1 2 2 2 0 2 2 2 0 2 0 1
7 4 1 1 1 1 0 0 0 2 2 0 4 3 2 4 2 1 2 2
0 1 2 0 0 0 0 0 4 6 5 1 5 0 0 0 2 1 2 0
1 1 0 2 3 2 0 0 2 1 3 1 4 1 1 1 2 2 1 1
2 0 0 0 4 3 3 0 1 16 5 0 1 3 8 0 0 1 3 3
0 0 1 14 3 3 1 2 0 8 0 2 0 3 9 0 4 2 1 0
0 0 5 1 8 7 6 6 6 1 0 4 0 0 1 2 2 0 1 2
0 0 2 2 3 2 2 3 1 1 1 3 0 0 2 2 0 3 4 0
0 0 0 0 1 0 3 1 1 1 2 0 2 0 2 0 2 1 1 0
1 8 7 7 8 0 5 0 1 0 1 2 0 0 2 4 2 2 2 4
0 9 1 0 0 1 1 1 0 0 0 1 2 4 0 2 1 3 3 1
0 0 0 1 0 2 4 3 1 2 2 0 0 1 1 2 2 0 2 4
0 1 0 0 1 2 0 2 3 5 2 0 0 2 1 1 2 0 1 3
1 0 0 1 1 0 0 0 2 2 2 1 1 1 0 0 2 0 0 0
0 2 0 2 2 0 1 1 0 2 0 0 1 0 0 1 1 1 5 3
0 0 0 3 2 1 0 0 0 0 0 2 1 0 1 1 1 3 1 2
1 0 0 1 0 3 0 1 0 0 2 1 2 0 0 0 1 1 1 0
0 0 0 0 0 0 0 1 1 1 0 1 0 3 0 2 0 1 1 0
2 0 0 0 0 0 0 0 1 2 0 1 3 0 0 1 0 1 2 4
Figure 7b: Cluster Sampling Example for a Spatially Correlated Population
The abundance counts show a strong diagonal spatial correlation. The total abundance = 13354. There are
M = 400 secondary units and N = 50 primary units clusters of size M
i
= 8.
18 20 15 20 20 15 19 18 24 23 20 26 29 28 28 31 31 34 28 32
13 20 16 20 15 23 19 26 21 21 24 30 23 26 25 33 31 28 32 38
16 18 20 24 25 26 22 23 26 26 22 27 25 25 34 28 37 36 38 31
17 17 16 22 21 23 22 27 27 24 28 32 29 33 27 37 37 38 35 33
15 19 23 17 21 23 21 23 24 25 31 26 32 34 32 33 31 31 36 37
21 24 20 21 28 26 30 22 31 25 29 29 27 30 29 37 35 32 38 43
23 17 24 25 24 27 31 29 31 34 27 36 29 29 34 39 37 37 40 36
18 24 21 25 27 22 32 32 31 26 28 34 34 37 35 34 38 38 37 40
22 26 28 26 24 29 33 26 27 27 34 31 39 32 36 38 37 40 44 43
23 27 28 29 26 32 25 31 35 34 32 33 37 32 42 40 40 37 42 44
23 21 31 23 30 27 31 30 32 35 30 40 32 37 37 36 40 44 44 40
26 29 31 26 30 31 34 36 30 38 36 32 38 38 37 42 42 41 40 49
28 24 28 27 26 31 32 29 32 33 38 34 39 38 40 37 41 43 42 43
32 25 31 32 29 29 35 38 38 32 36 35 39 42 39 40 44 42 41 45
27 29 35 28 35 35 31 40 35 37 38 44 40 40 47 39 49 48 51 49
30 29 32 32 33 30 36 38 42 36 35 38 44 47 45 49 41 43 44 51
28 35 35 34 34 33 41 33 34 35 39 44 44 48 44 50 49 48 53 54
29 33 32 36 39 33 33 34 35 42 46 47 48 47 46 45 44 52 54 55
28 37 38 37 33 33 34 37 45 40 39 42 42 46 47 48 52 47 46 53
38 39 39 37 34 38 39 45 39 42 45 41 44 51 46 50 52 51 51 53
91
6.2.1 Estimation of , , and
1
The unbiased estimators of and are

cl
=
M
nL
n

i=1
L

j=1
y
ij
=
N
n
n

i=1
L

j=1
y
ij
=
N
n
n

i=1
y
i
= Ny (77)

cl
=
1
nL
n

i=1
L

j=1
y
ij
=
1
nL
n

i=1
y
i
=
y
L
=

cl
M
(78)
where y =
1
n
n

i=1
y
i
=

cl
N
= is the sample mean of the cluster totals.
Next, we want to study the variances of these estimators:
var(
cl
) = N(N n)

2
u
n
var(
cl
) =
N(N n)
M
2

2
u
n
(79)
where
2
u
=

N
i=1
(y
i

1
)
2
N 1
is the variance of the N cluster y
i
totals. Taking a square
root of the true variances in (79) yields the standard deviations of the estimators.
Because
2
u
is unknown, we use the sample variance of the clusters: s
2
u
=

n
i=1
(y
i
y)
2
n 1
to get unbiased estimators of the variances:
var(
cl
) = N(N n)
s
2
u
n
var(
cl
) =
N(N n)
M
2
s
2
u
n
(80)
Taking the square root of the estimated variances in (80) yields the standard errors of
the estimators.
An unbiased estimator of the mean per primary unit
1
is
1
= y =

cl
N
.
The variance of
1
is var(
1
) =
1
N
2
Var(
cl
) with the estimated variance being obtained
by dividing the estimated variance of
cl
in (80) by N
2
. That is, var(
1
) =
N n
N
s
2
u
n
.
6.2.2 Condence Intervals for and
The condence intervals for and are:

cl
t

_
var(
cl
)
cl
t

_
var(
cl
) (81)
where t

is the upper /2 critical value from the t(n 1) distribution. Note that the
degrees of freedom are based on n, the number of primary units or sampled clusters (and
not on the total number of secondary units m = nL).
92
Figure 8a: Cluster Sampling Example for the Longleaf Pin Data
The total abundance = 584. There are M = 400 secondary units and N = 100 primary units
(clusters) of size L = 4. The sample contains n = 8 clusters. The secondary units sampled are in ( )
1 (1) 1 1 1 2 1 0 0 0 4 5 0 1 0 1 2 1 0 1
3 (2) 1 0 1 0 0 0 1 2 2 2 0 2 2 2 0 2 0 1
7 (4) 1 1 1 1 0 0 0 2 2 0 4 3 2 4 2 1 2 2
0 (1) 2 0 0 0 0 0 4 6 5 1 5 0 0 0 2 1 2 0
1 1 0 2 3 2 0 (0) 2 (1) 3 1 4 1 1 1 2 2 1 1
2 0 0 0 4 3 3 (0) 1 (16) 5 0 1 3 8 0 0 1 3 3
0 0 1 14 3 3 1 (2) 0 (8) 0 2 0 3 9 0 4 2 1 0
0 0 5 1 8 7 6 (6) 6 (1) 0 4 0 0 1 2 2 0 1 2
0 0 2 (2) 3 2 2 3 1 1 1 3 0 0 (2) 2 0 3 4 0
0 0 0 (0) 1 0 3 1 1 1 2 0 2 0 (2) 0 2 1 1 0
1 8 7 (7) 8 0 5 0 1 0 1 2 0 0 (2) 4 2 2 2 4
0 9 1 (0) 0 1 1 1 0 0 0 1 2 4 (0) 2 1 3 3 1
0 0 0 1 0 (2) 4 3 1 2 2 0 0 1 1 (2) 2 (0) 2 4
0 1 0 0 1 (2) 0 2 3 5 2 0 0 2 1 (1) 2 (0) 1 3
1 0 0 1 1 (0) 0 0 2 2 2 1 1 1 0 (0) 2 (0) 0 0
0 2 0 2 2 (0) 1 1 0 2 0 0 1 0 0 (1) 1 (1) 5 3
0 0 0 3 2 1 0 0 0 0 0 2 1 0 1 1 1 3 1 2
1 0 0 1 0 3 0 1 0 0 2 1 2 0 0 0 1 1 1 0
0 0 0 0 0 0 0 1 1 1 0 1 0 3 0 2 0 1 1 0
2 0 0 0 0 0 0 0 1 2 0 1 3 0 0 1 0 1 2 4
Figure 8b: Cluster Sampling Example for a Spatially Correlated Population
The abundance counts show a strong diagonal spatial correlation. The total abundance = 13354. There are
M = 400 secondary units and N = 100 primary units (clusters) of size L = 4. The sample contains n = 10
clusters. The secondary units sampled are in ( )
18 20 15 20 (20) 15 19 18 24 23 20 26 29 28 28 31 (31) 34 28 32
13 20 16 20 (15) 23 19 26 21 21 24 30 23 26 25 33 (31) 28 32 38
16 18 20 24 (25) 26 22 23 26 26 22 27 25 25 34 28 (37) 36 38 31
17 17 16 22 (21) 23 22 27 27 24 28 32 29 33 27 37 (37) 38 35 33
15 (19) 23 17 21 23 21 23 24 25 31 26 32 34 32 33 31 31 36 37
21 (24) 20 21 28 26 30 22 31 25 29 29 27 30 29 37 35 32 38 43
23 (17) 24 25 24 27 31 29 31 34 27 36 29 29 34 39 37 37 40 36
18 (24) 21 25 27 22 32 32 31 26 28 34 34 37 35 34 38 38 37 40
22 26 28 26 24 29 (33) 26 27 27 34 31 39 (32) 36 38 37 40 44 43
23 27 28 29) 26 32 (25) 31 35 34 32 33 37 (32) 42 40 40 37 42 44
23 21 31 23 30 27 (31) 30 32 35 30 40 32 (37) 37 36 40 44 44 40
26 29 31 26 30 31 (34) 36 30 38 36 32 38 (38) 37 42 42 41 40 49
28 24 28 27 26 31 32 (29) 32 33 38 34 39 38 40 37 (41) 43 42 43
32 25 31 32 29 29 35 (38) 38 32 36 35 39 42 39 40 (44) 42 41 45
27 29 35 28 35 35 31 (40) 35 37 38 44 40 40 47 39 (49) 48 51 49
30 29 32 32 33 30 36 (38) 42 36 35 38 44 47 45 49 (41) 43 44 51
28 35 35 34 (34) 33 41 33 34 35 39 44 44 48 44 (50) (49) 48 53 54
29 33 32 36 (39) 33 33 34 35 42 46 47 48 47 46 (45) (44) 52 54 55
28 37 38 37 (33) 33 34 37 45 40 39 42 42 46 47 (48) (52) 47 46 53
38 39 39 37 (34) 38 39 45 39 42 45 41 44 51 46 (50) (52) 51 51 53
93
6.2.3 Comparison to Simple Random Sampling
Because the variance formulas for
cl
and
cl
in (79) are determined only from the cluster-
to-cluster variability, the precision of the estimators can be improved by forming clusters
with small cluster-to-cluster variability.
Equivalently, we want to form clusters such that the y-values within each cluster are as
variable as possible but the y
i
values across clusters are as similar as possible.
We will compare var( ) from a SRS to var(
cl
) from a one-stage cluster sample.
Because
2
=
1
NL 1
N

i=1
L

j=1
(y
ij
)
2
, we have
(NL 1)
2
=
N

i=1
L

j=1
(y
ij
)
2
=
N

i=1
L

j=1
(y
ij
y
i
+ y
i
)
2
=
N

i=1
L

j=1
(y
ij
y
i
)
2
+ L
N

i=1
(y
i
)
2
= N(L 1)
2
+ L
N

i=1
(y
i
)
2
(82)
where
2
=
1
N
N

i=1

2
i
is the average within-cluster variance.
The sum in (82) is a weighted sum of within-cluster and cluster-to-cluster variabilities.
Let be the estimator of from a SRS (see Section 2 of the course notes). We use (82)
to compare the variance var( ) of the SRS estimator and the variance of the one-stage
cluster sample var(
cl
). After simplication, we get:
var( ) var(
cl
) =
N
2
(N n)(L 1)
nL(N 1)
_

2
_
(83)
If var( ) var(
cl
) > 0 (or, if
2
>
2
), then we say that
cl
is more ecient than for
estimating .
This result is also true for estimation of . That is, if var( ) var(
cl
) > 0, then the
one-stage cluster sample estimator
cl
would be more ecient than SRS estimator for
estimating .
Practically speaking, the one-stage cluster sample estimator will be more ecient than
the SRS estimator of or if the average within-cluster variability (
2
) is larger than the
population variance (
2
).
6.3 Relationship between Cluster Sampling Systematic Sampling
Systematic sampling is a sampling plan in which the sample population units are col-
lected systematically throughout the population. More specically, a single primary unit
consists of secondary units that are spaced in some systematic pattern throughout the
population.
94
Suppose the study area is partitioned into a 20 20 grid of 400 population units. A
systematic sample primary unit could consist of all population units that form a lattice
which are 5 units apart horizontally and vertically. In Figure 9a, N = 25 and L = 16. In
Figure 9b, each of the N = 50 primary units contains L = 8 secondary units.
Initially, systematic sampling and cluster sampling appear to be opposites because sys-
tematic samples contain secondary units that are spread throughout the population (good
global coverage of the study area) while cluster samples are collected in groups of close
proximity (good coverage locally within the study area).
Systematic and cluster sampling are similar, however, because whenever a primary unit
is selected from the sampling frame, all secondary units of that primary unit will be
included in the sample. Thus, random selection occurs at the primary unit level and not
the secondary unit level.
For estimation purposes, you could ignore the secondary unit y
ij
-values and only retain
the primary units y
i
-values. This is what we did with one-stage cluster sampling.
The systematic and cluster sampling principle: To obtain estimators of low variance,
the population must be partitioned into primary unit clusters in such a way that the
clusters are similar to each other with respect to the y
i
-values (small cluster-to-cluster
variability).
95
This is equivalent to saying that the within-cluster variability should be as large as possible
to obtain the most precise estimators. Thus, the ideal primary unit is representative of
the full diversity of y
ij
-values within the population.
With natural populations of spatially distributed plants, animals, minerals, etc., these
conditions are typically satised by systematic primary units (and are not satised by
primary units with spatially clustered secondary units).
6.4 Using Proc Surveymeans for One-Stage Cluster Samples
To use Proc Surveymeans to analyze data from a one-stage cluster sample with the goal
of estimating or , we need to include a Cluster statement followed by a cluster label.
In the rst example, the clusters are labeled cluster.
The value following total = is the number of primary units in the population.
The appropriate weight to use in the weight statement to get the correct estimates for
is M/(nL).
Analysis of the One-Stage Cluster Sample in Figure 8a
data Clus_8a;
wgt= 400/(8*4); * wgt = M/(n*L) ;
input _cluster trees @@;
datalines;
1 1 1 2 1 4 1 1 2 2 2 0 2 7 2 0
3 2 3 2 3 0 3 0 4 0 4 0 4 2 4 6
5 1 5 16 5 8 5 1 6 2 6 2 6 2 6 0
7 2 7 1 7 0 7 1 8 0 8 0 8 0 8 1
;
proc surveymeans data=Clus_8a total=100 mean clm sum clsum;
var trees;
cluster _cluster;
weight wgt;
title1 One-Stage Cluster Sample in Figure 8a --- Estimating mu and tau;
run;
=========================================================================
One-Stage Cluster Sample in Figure 8a -- Estimating mu and tau
The SURVEYMEANS Procedure
Data Summary
Number of Clusters 8
Number of Observations 32
Sum of Weights 400
96
Statistics
Std Error
Variable Mean of Mean 95% CL for Mean
-----------------------------------------------------------------
trees 2.062500 0.648436 0.52919341 3.59580659
-----------------------------------------------------------------
Variable Sum Std Dev 95% CL for Sum
-----------------------------------------------------------------
trees 825.000000 259.374247 211.677365 1438.32263
-----------------------------------------------------------------
Analysis of the One-Stage Cluster Sample in Figure 8b
data Clus_8b;
wgt= 400/(10*4); * wgt = M/(nL) ;
do _cluster = 1 to 10;
do sec_unit = 1 to 4;
input count @@; output;
end; end;
datalines;
19 24 17 24 20 15 25 21 34 39 33 34 33 25 31 34 29 38 40 38
32 32 37 38 50 45 48 50 31 31 37 37 41 44 49 41 49 44 52 52
;
proc surveymeans data=Clus_8b total=100 mean clm sum clsum;
var count;
cluster _cluster;
weight wgt;
title1 One-Stage Cluster Sample in Figure 8b -- Estimating mu and tau;
run;
========================================================================
One-Stage Cluster Sample in Figure 8b -- Estimating mu and tau
The SURVEYMEANS Procedure
Data Summary
Number of Clusters 10
Number of Observations 40
Sum of Weights 400
Statistics
Std Error
Variable Mean of Mean 95% CL for Mean
-----------------------------------------------------------------
count 35.325000 2.980573 28.5824765 42.0675235
-----------------------------------------------------------------
Variable Sum Std Dev 95% CL for Sum
-----------------------------------------------------------------
count 14130 1192.229005 11432.9906 16827.0094
-----------------------------------------------------------------
97
6.5 Systematic Sampling
If a systematic sample is selected using simple random sampling to select the system-
atic primary units, we can apply the estimation results for cluster sampling to dene (i)
estimators, (ii) the variance of each estimator, and (iii) the estimated variance of each
estimator.
The formulas we are about to introduce will be the same as those used for one-stage cluster
sampling. The subscript sys denotes the fact that data were collected under systematic
sampling.
6.5.1 Estimation of and
The unbiased estimators of and are:

sys
=
N
n
n

i=1
y
i
= Ny
sys
=
1
nL
n

i=1
y
i
=
y
L
=

sys
M
(84)
with variance
var(
sys
) = N(N n)

2
u
n
var(
sys
) =
N(N n)
M
2

2
u
n
y (85)
where
2
u
=

N
i=1
(y
i

1
)
2
N 1
.
Recall that y =
1
n
n

i=1
y
i
is the sample mean
and that s
2
u
=

n
i=1
(y
i
y)
2
n 1
is the sample variance of the primary units.
Because
2
u
is unknown, we use s
2
u
to get unbiased estimators of the variances:
var(
sys
) = N(N n)
s
2
u
n
var(
sys
) =
N(N n)
M
2
s
2
u
n
(86)
6.5.2 Condence Intervals for and
For a relatively small number n of sampled primary units, the following condence intervals
are recommended:

sys
t

_
var(
sys
)
sys
t

_
var(
sys
) (87)
where t

is the upper /2 critical value from the t(n 1) distribution. Note that the
degrees of freedom are based on n, the number of sampled primary units, and not on the
total number of secondary units nL.
98
Systematic Sampling Examples
In Figure 9a, each of the 25 primary units contains the 16 secondary units corresponding to the
same location within the 16 5x5 subregions. n = 3 primary units were sampled. In Figure 9b,
each of the 50 primary units contains the 8 secondary units corresponding to the same location
within the 8 10x5 subregions. n = 6 primary units were sampled.
Figure 9a
1 1 (1) 1 1 2 1 (0) 0 0 4 5 (0) 1 0 1 2 (1) 0 1
3 2 1 0 1 0 0 0 1 2 2 2 0 2 2 2 0 2 0 1
7 (4) 1 1 1 1 (0) 0 0 2 2 (0) 4 3 2 4 (2) 1 2 2
0 1 2 0 0 0 0 0 4 6 5 1 5 0 0 0 2 1 2 0
1 1 0 (2) 3 2 0 0 (2) 1 3 1 4 (1) 1 1 2 2 (1) 1
2 0 (0) 0 4 3 3 (0) 1 16 5 0 (1) 3 8 0 0 (1) 3 3
0 0 1 14 3 3 1 2 0 8 0 2 0 3 9 0 4 2 1 0
0 (0) 5 1 8 7 (6) 6 6 1 0 (4) 0 0 1 2 (2) 0 1 2
0 0 2 2 3 2 2 3 1 1 1 3 0 0 2 2 0 3 4 0
0 0 0 (0) 1 0 3 1 (1) 1 2 0 2 (0) 2 0 2 1 (1) 0
1 8 (7) 7 8 0 5 (0) 1 0 1 2 (0) 0 2 4 2 (2) 2 4
0 9 1 0 0 1 1 1 0 0 0 1 2 4 0 2 1 3 3 1
0 (0) 0 1 0 2 (4) 3 1 2 2 (0) 0 1 1 2 (2) 0 2 4
0 1 0 0 1 2 0 2 3 5 2 0 0 2 1 1 2 0 1 3
1 0 0 (1) 1 0 0 0 (2) 2 2 1 1 (1) 0 0 2 0 (0) 0
0 2 (0) 2 2 0 1 (1) 0 2 0 0 (1) 0 0 1 1 (1) 5 3
0 0 0 3 2 1 0 0 0 0 0 2 1 0 1 1 1 3 1 2
1 (0) 0 1 0 3 (0) 1 0 0 2 (1) 2 0 0 0 (1) 1 1 0
0 0 0 0 0 0 0 1 1 1 0 1 0 3 0 2 0 1 1 0
2 0 0 (0) 0 0 0 0 (1) 2 0 1 3 (0) 0 1 0 1 (2) 4
Figure 9b
18 (20) 15 20 20 15 (19) 18 24 23 20 (26) 29 28 28 31 (31) 34 28 32
13 20 16 20 15 23 19 26 21 21 24 30 23 26 25 33 31 28 32 38
(16) 18 20 24 (25) (26) 22 23 26 (26) (22) 27 25 25 (34) (28) 37 36 38 (31)
17 17 16 22 21 23 22 27 27 24 28 32 29 33 27 37 37 38 35 33
15 19 23 17 21 23 21 23 24 25 31 26 32 34 32 33 31 31 36 37
21 (24) 20 21 28 26 (30) 22 31 25 29 (29) 27 30 29 37 (35) 32 38 43
23 17 24 25 24 27 31 29 31 34 27 36 29 29 34 39 37 37 40 36
(18) 24 21 25 27 (22) 32 32 31 26 (28) 34 34 37 35 (34) 38 38 37 40
22 26 28 (26) 24 29 33 26 (27) 27 34 31 39 (32) 36 38 37 40 (44) 43
23 27 28 29 26 32 25 31 35 34 32 33 37 32 42 40 40 37 42 44
23 (21) 31 23 30 27 (31) 30 32 35 30 (40) 32 37 37 36 (40) 44 44 40
26 29 31 26 30 31 34 36 30 38 36 32 38 38 37 42 42 41 40 49
(28) 24 28 27 (26) (31) 32 29 32 (33) (38) 34 39 38 (40) (37) 41 43 42 (43)
32 25 31 32 29 29 35 38 38 32 36 35 39 42 39 40 44 42 41 45
27 29 35 28 35 35 31 40 35 37 38 44 40 40 47 39 49 48 51 49
30 (29) 32 32 33 30 (36) 38 42 36 35 (38) 44 47 45 49 (41) 43 44 51
28 35 35 34 34 33 41 33 34 35 39 44 44 48 44 50 49 48 53 54
(29) 33 32 36 39 (33) 33 34 35 42 (46) 47 48 47 46 (45) 44 52 54 55
28 37 38 (37) 33 33 34 37 (45) 40 39 42 42 (46) 47 48 52 47 (46) 53
38 39 39 37 34 38 39 45 39 42 45 41 44 51 46 50 52 51 51 53
99
tages of systematic sampling:
Intuitively, systematic sampling seems likely to be more precise than simple
random sampling. In eect, it straties the population into [N] strata, which
consist of the rst [L] units, the second [L] units, and so on. We might there-
fore expect the systematic sample to be about as precise as the corresponding
stratied random sample with one unit per stratum. The dierence is that
with the systematic sample the units all occur at the same relative position
in the stratum, whereas with the stratied random sample the position in the
stratum is determined separately by randomization within each stratum. The
systematic sample is spread more evenly over the population, and this fact has
sometimes made systematic sampling considerably more precise than stratied
random sampling.
Cochran also warns us that:
The performance of systematic sampling relative to that of stratied or simple
random sampling is greatly dependent on the properties of the population. There
are populations for which systematic sampling is extremely precise and other for
which it is less precise that simple random sampling. For some populations and
values of [L], [var(
sys
)] may even increase when a larger sample is taken
a startling departure from good behavior. Thus it is dicult to give general
advice about the situation in which systematic sampling is to recommended. A
knowledge of the structure of the population is necessary for its most eective
use.
If a population contains a linear trend:
1. The variances of the estimators from systematic and stratied sampling will be smaller
than the variance of the estimator from simple random sampling.
2. The variance of the estimator from systematic sampling will be larger than the vari-
ance of the estimator from stratied sampling. Why? If the starting point of the
systematic sample is selected too low or too high, it will be too low or too high
across the population of units. Whereas, stratied sampling gives an opportunity for
within-stratum errors to cancel.
For example: Suppose a population has 12 secondary units ( = 130) and is ordered as
follows:
Sampling unit 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
y-value 1 2 2 3 3 4 5 6 8 9 12 13 14 15 16 17
Note there is a linearly increasing trend in the y-values with the order of the sampling
units. Suppose we take a 1-in-4 systematic sample. The following table summarizes the
four possible 1-in-4 systematic samples.
100
Sampling unit 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
y-values 1 2 2 3 3 4 5 6 8 9 12 13 14 15 16 17
sys
Sample 1 1 3 8 14 104
Sample 2 2 4 9 15 120
Sample 3 2 5 12 16 140
Sample 4 3 6 13 17 156
If a population has periodic trends, the eectiveness of the systematic sample depends
on the relationship between the periodic interval and the systematic sampling interval or
pattern. The following idealized curve was given by Cochran to show this. The height of
the curve represents the population y-value.
The sample points A represent the least favorable systematic sample because when-
ever L is equal to the period, every observation in the systematic sample will be the
same so the sample is no more precise than a single observation taken at random
from the population.
The sample points B represent the most favorable systematic sample because L is
equal to a half-period. Every systematic sample has mean exactly equal to the true
population mean because successive y-value deviations above and below the mean
cancel. Thus, the variance of the estimator is zero.
For other values of L, the sample has varying degrees of eectiveness that depends
on the relation between L and the period.
6.6 Using a Single Systematic Sample
Many studies generate data from a systematic sample based on a single randomly selected
starting unit (i.e., there is only one randomly selected primary unit).
When there is only one primary unit, it is possible to get unbiased estimators
sys
and

sys
of and . It is not possible, however, to get an unbiased estimator of the variances
var(
sys
) and var(
sys
).
If we can ignore the fact that the y
ij
-values were collected systematically and treat the L
secondary units in the single primary unit as a SRS, then the SRS variance estimator would
be a reasonable substitute only if the units of the population can reasonably be conceived
as being randomly ordered (i.e., there is no systematic pattern in the population such as
a linear trend or a periodic pattern).
If this assumption is reasonable, then

Var(
sys
)

Var( ) =
_
Nn
N
_
s
2
n
With natural populations in which nearby units are similar to each other (spatial correla-
tion), this procedure tends to provide overestimates of the variances of
sys
and
sys
.
Procedures for estimating variances from a single systematic sample are discussed in Bell-
house (1988), Murthy and Rao (1988), and Wolter (1984).
101
6.7 Using Proc Surveymeans for Systematic Samples
To use Proc Surveymeans to analyze data from a systematic sample with the goal of
estimating or , we need to include a Cluster statement followed by a label representing
the units in the systematic sample. In the rst example, the label is start pt.
The value following total = is the number of starting points for a primary unit in a
systematic sample. That is, it is the primary units in the population.
The appropriate weight to use in the weight statement to get the correct estimates for
is M/(nL).
Analysis of the Systematic Sample in Figure 9a
data Sys_9a;
wgt= 400 /(3*16); * wgt = M/(nL) ;
do start_pt = 1 to 3;
do sec_unit = 1 to 16;
input count @@; output;
end; end;
datalines;
1 0 0 1 0 0 1 1 7 0 0 2 0 1 1 1
4 0 0 2 0 6 4 2 0 4 0 2 0 0 1 1
2 2 1 1 0 1 0 1 1 2 1 0 0 1 0 2
;
proc surveymeans data=Sys_9a total=25 mean clm sum clsum;
var count;
cluster start_pt;
weight wgt;
title1 Systematic Sample in Figure 9a --- Estimating mu and tau;
run;
==================================================================
Systematic Sample in Figure 9a --- Estimating mu and tau
The SURVEYMEANS Procedure
Data Summary
Number of Clusters 3
Number of Observations 48
Sum of Weights 400
Statistics
Std Error
Variable Mean of Mean 95% CL for Mean
-----------------------------------------------------------------
count 1.187500 0.205902 0.30157311 2.07342689
-----------------------------------------------------------------
Variable Sum Std Dev 95% CL for Sum
-----------------------------------------------------------------
count 475.000000 82.360994 120.629244 829.370756
-----------------------------------------------------------------
102
Analysis of the Systematic Sample in Figure 9b
data Sys_9b;
wgt= 400 /(6*8); * wgt = M/(nL) ;
do start_pt = 1 to 6;
do sec_unit = 1 to 8;
input count @@; output;
end; end;
datalines;
20 19 26 31 21 31 40 40
16 26 22 28 28 31 38 37
25 26 34 31 26 33 40 43
24 30 29 35 29 36 38 41
18 22 28 34 29 33 46 45
26 27 32 44 37 45 46 46
;
proc surveymeans data=Sys_9b total=50 mean clm sum clsum;
var count;
cluster start_pt;
weight wgt;
title1 Systematic Sample in Figure 9b --- Estimating mu and tau;
run;
==================================================================
Systematic Sample in Figure 9b --- Estimating mu and tau
The SURVEYMEANS Procedure
Data Summary
Number of Clusters 6
Number of Observations 48
Sum of Weights 400
Statistics
Std Error
Variable Mean of Mean 95% CL for Mean
-----------------------------------------------------------------
count 31.916667 1.342334 28.4660867 35.3672466
-----------------------------------------------------------------
Variable Sum Std Dev 95% CL for Sum
-----------------------------------------------------------------
count 12767 536.933681 11386.4347 14146.8986
-----------------------------------------------------------------
103
6.8 Cluster Sampling with Unequal Cluster Sizes
Suppose the N cluster sizes M
1
, M
2
, . . . , M
N
are not all equal and that a one-stage cluster
sample of n primary units is taken with the goal of estimating or .
Let M
i
and y
i
(i = 1, 2, . . . , m) be the sizes and totals of the n sampled primary units.
Let m =

n
i=1
M
i
be the total number of secondary units in the sample.
We will discuss three methods of calculating estimates of and given the unequal cluster
sizes. These methods are based on two dierent representations of .
(i) as a population ratio:
=

N
i=1
y
i

N
i=1
M
i
=

N
i=1
y
i
M
(88)
expresses as the ratio of the total of the primary unit values to the total number
of secondary units.
(ii) as a mean cluster total:
=
_
N
M
_
N

i=1
y
i
N
(89)
expresses as a multiple of the mean of the cluster y
i
values.
Method 1: The sample cluster ratio: Substitution of sample values into (88) provides
the following ratio estimator for :

c(a)
=

n
i=1
y
i

n
i=1
M
i
=

n
i=1
y
i
m
which is the ratio of the sum of the sampled cluster totals to the sum of the sampled
cluster sizes.

c(a)
is a special case of the SRS ratio estimator (see Section 4 of the course notes).
Thus,
c(a)
is biased with the bias 0 as n increases.
Because
c(a)
is a ratio estimator, there is no closed-form for the true variance of
c(a)
.
However, an approximation is given in Thompson (2002). A sample-based estimate
of this approximate variance is given by
var(
c(a)
) =
(N n)N
n(n 1)M
2
n

i=1
M
2
i
(y
i

c(a)
)
2
. (90)
If M is not known, it can be estimated from the sample as M Nm/n. Substitution
into (90) yields:
var(
c(a)
) =
(N n)n
N(n 1)
n

i=1
_
M
i
m
_
2
(y
i

c(a)
)
2
. (91)
To estimate , multiply
c(a)
by M. To get the estimated variances, multiply var(
c(a)
)
by M
2
.
104
Method 2: The cluster sample total: Substitution of sample values into (89) provides
the following unbiased estimator for :

c(b)
=
N
M

n
i=1
y
i
n
=
N
nM
n

i=1
y
i
The variance var(
c(b)
) =
(N n)N
n(N 1)M
2
N

i=1
(y
i

1
)
2
=
(N n)N
nM
2

2
u
.
An estimate of this variance is given by
var(
c(b)
) =
(N n)N
n(n 1)M
2
n

i=1
(y
i
y)
2
=
(N n)N
nM
2
s
2
u
. (92)
If M is not known, we can substitute of M Nm/n into (92) and get:
var(
c(b)
) =
(N n)n
(n 1)Nm
2
n

i=1
(y
i
y)
2
=
(N n)n
Nm
2
s
2
u
. (93)
To estimate using Method 1 or Method 2, multiply
c(a)
or
c(b)
by M. To get the
estimated variances, multiply var(
c(a)
) and var(
c(b)
) by M
2
.
A condence interval for using either Method 1 (subscript a) or Method 2 (subscript b)
is:

c(k)
t

_
var(
c(k)
) for k = a, b (94)
where t

## is the upper /2 critical value from the t(n 1) distribution.

Method 3: Primary units selected with pps: Suppose that the primary units are
selected with replacement with draw-by-draw selection probabilities (p
i
) proportional to
the sizes of the primary units, p
i
= M
i
/M.
One way to construct the sampling design is to
1. Select m secondary units (say, u
1
, u
2
, . . . , u
m
) from the M in the population using
simple random sampling with replacement.
2. Then for each u
i
(i = 1, 2, . . . , m), sample all secondary units in the cluster containing
u
i
.
Thus, a primary unit is selected every time any of its secondary units is selected.
Now that we have dened p
i
, we simply use either the Hansen-Hurwitz estimator or the
Horvitz-Thompson estimator (and the associated variance estimators) discussed in Section
3 of the course notes.
105
Figure 10: Cluster Sampling with Unequal-Sized Cluster
The mean = 33.385. There are M = 400 secondary units and N = 49 primary units (clusters).
There are 9 clusters of size M
i
= 16, 24 clusters of size M
i
= 8, and 16 clusters of size M
i
= 4.
The boldfaced values represent the units in the sample.
18 20 15 20 20 15 19 18 24 23 20 26 29 28 28 31 31 34 28 32
13 20 16 20 15 23 19 26 21 21 24 30 23 26 25 33 31 28 32 38
16 18 20 24 25 26 22 23 26 26 22 27 25 25 34 28 37 36 38 31
17 17 16 22 21 23 22 27 27 24 28 32 29 33 27 37 37 38 35 33
15 19 23 17 21 23 21 23 24 25 31 26 32 34 32 33 31 31 36 37
21 24 20 21 28 26 30 22 31 25 29 29 27 30 29 37 35 32 38 43
23 17 24 25 24 27 31 29 31 34 27 36 29 29 34 39 37 37 40 36
18 24 21 25 27 22 32 32 31 26 28 34 34 37 35 34 38 38 37 40
22 26 28 26 24 29 33 26 27 27 34 31 39 32 36 38 37 40 44 43
23 27 28 29 26 32 25 31 35 34 32 33 37 32 42 40 40 37 42 44
23 21 31 23 30 27 31 30 32 35 30 40 32 37 37 36 40 44 44 40
26 29 31 26 30 31 34 36 30 38 36 32 38 38 37 42 42 41 40 49
28 24 28 27 26 31 32 29 32 33 38 34 39 38 40 37 41 43 42 43
32 25 31 32 29 29 35 38 38 32 36 35 39 42 39 40 44 42 41 45
27 29 35 28 35 35 31 40 35 37 38 44 40 40 47 39 49 48 51 49
30 29 32 32 33 30 36 38 42 36 35 38 44 47 45 49 41 43 44 51
28 35 35 34 34 33 41 33 34 35 39 44 44 48 44 50 49 48 53 54
29 33 32 36 39 33 33 34 35 42 46 47 48 47 46 45 44 52 54 55
28 37 38 37 33 33 34 37 45 40 39 42 42 46 47 48 52 47 46 53
38 39 39 37 34 38 39 45 39 42 45 41 44 51 46 50 52 51 51 53
n
i
y
i
y
i
16 401 25.0625
16 337 21.0625
8 273 34.8750
8 321 40.1250
8 280 35.0000
4 171 42.7500
4 187 46.7500
4 216 54.0000
m = 68

y
i
= 2192 y = 274
106
Figure 11: Horvitz-Thompson Estimation with Selection Probabilities
Proportional to Cluster Size
The mean = 33.385. There are M = 400 secondary units and M = 49 primary units (clusters).
There are 9 clusters with M
i
= 16, 24 clusters with M
i
= 8, and 16 clusters with M
i
= 4. Five
clusters were sampled with replacement. One cluster was sampled twice. The boldfaced values
are in the sample.
Sampled
twice
18 20 15 20 20 15 19 18 24 23 20 26 29 28 28 31 31 34 28 32
13 20 16 20 15 23 19 26 21 21 24 30 23 26 25 33 31 28 32 38
16 18 20 24 25 26 22 23 26 26 22 27 25 25 34 28 37 36 38 31
17 17 16 22 21 23 22 27 27 24 28 32 29 33 27 37 37 38 35 33
15 19 23 17 21 23 21 23 24 25 31 26 32 34 32 33 31 31 36 37
21 24 20 21 28 26 30 22 31 25 29 29 27 30 29 37 35 32 38 43
23 17 24 25 24 27 31 29 31 34 27 36 29 29 34 39 37 37 40 36
18 24 21 25 27 22 32 32 31 26 28 34 34 37 35 34 38 38 37 40
22 26 28 26 24 29 33 26 27 27 34 31 39 32 36 38 37 40 44 43
23 27 28 29 26 32 25 31 35 34 32 33 37 32 42 40 40 37 42 44
23 21 31 23 30 27 31 30 32 35 30 40 32 37 37 36 40 44 44 40
26 29 31 26 30 31 34 36 30 38 36 32 38 38 37 42 42 41 40 49
28 24 28 27 26 31 32 29 32 33 38 34 39 38 40 37 41 43 42 43
32 25 31 32 29 29 35 38 38 32 36 35 39 42 39 40 44 42 41 45
27 29 35 28 35 35 31 40 35 37 38 44 40 40 47 39 49 48 51 49
30 29 32 32 33 30 36 38 42 36 35 38 44 47 45 49 41 43 44 51
28 35 35 34 34 33 41 33 34 35 39 44 44 48 44 50 49 48 53 54
29 33 32 36 39 33 33 34 35 42 46 47 48 47 46 45 44 52 54 55
28 37 38 37 33 33 34 37 45 40 39 42 42 46 47 48 52 47 46 53
38 39 39 37 34 38 39 45 39 42 45 41 44 51 46 50 52 51 51 53
i y
i
M
i
p
i
= M
i
/M
i
= 1 (1 p
i
)
5
1 344 16 16/400=.04 1 .96
5
= .184627302
2 252 8 8/400=.02 1 .98
5
= .096079203
3 278 8 8/400=.02 1 .98
5
= .096079203
4 181 4 4/400=.01 1 .99
5
= .049009950

12
=
13
= [1 (.96
5
)] + [1 (.98
5
)] [1 (.94
5
)] = .0146105270

14
= [1 (.96
5
)] + [1 (.99
5
)] [1 (.95
5
)] = .00741819

24
=
34
= [1 (.98
5
)] + [1 (.99
5
)] [1 (.97
5
)] = .003823179

23
= [1 (.98
5
)] + [1 (.98
5
)] [1 (.96
5
)] = .007531104
107
108
Figure 12: Hansen-Hurwitz Estimation with Selection Probabilities Proportional
to Cluster Size
In Figure 11, the total abundance is = 13354. There are M = 400 secondary units and
M = 49 primary units (clusters). There are 9 clusters with M
i
= 16, 24 clusters with M
i
= 8,
and 16 clusters with M
i
= 4. The cluster totals y
i
for the clusters in Figure 11 are summarized
in the gure below. Also included is a cluster label (1 to 49). Eight clusters were sampled with
replacement. The sampled units are 2, 6, 6, 16, 25, 30, 32, and 44. Note that cluster 6 was
sampled twice. The boldfaced values are in the sample.
1 2 3 10 11 12 13
292 (344) 401 218 243 272 267
4 5 6 14 15 16 17
337 418 ((467)) 252 273 (279) 307
7 8 9 18 19 20 21
419 475 526 285 308 321 346
22 26 30 34 35 36 37
227 249 (278) 158 156 170 171
23 27 31 38 39 40 41
242 278 305 171 180 181 195
24 28 32 42 43 44 45
262 280 (322) 187 185 (193) 216
25 29 33 46 47 48 49
(293) 293 333 333 183 191 203
Unit i y
i
p
i
y
i
/p
i
2 344 .04 8600
6 467 .04 11675
6 467 .04 11675
16 279 .02 13950
25 293 .02 14650
30 273 .02 13900
32 322 .02 16100
44 193 .01 19300
109850
109
6.9 Attribute Proportion Estimation using Cluster Sampling
Instead of studying a quantitative measure associated with sampling units, we often are
interested in an attribute (a qualitative characteristic). Statistically, the goal is to estimate
a proportion. The population proportion p is the proportion of population units having
that attribute.
Examples: the proportion of females (or males) in an animal population, the proportion of
consumers who own motorcycles, the proportion of married couples with at least 1 child. . .
If a one-stage cluster sample is taken, then how do we estimate p?
6.9.1 Estimating p with Equal Cluster Sizes
Statistically, we use an indicator function that assigns a y
ij
value to secondary unit j in
primary unit (cluster) i as follows:
y
ij
= 1 if unit j in cluster i possesses the attribute
= 0 otherwise
Then =
N

i=1
M

j=1
y
ij
and p =

LN
=

M
where M
i
= L for each cluster.
The proportion for cluster i is dened as p
i
=
1
L
L

j=1
y
ij
.
By taking a one-stage cluster sample of n equal-sized clusters, we can estimate p as the
weighted average of the sampled cluster proportions:
p
c
=

n
i=1
p
i
n
.
p
c
is an unbiased estimator of p.
The variance of p
c
is
var( p
c
) =
_
N n
nN
_
N

i=1
(p
i
p)
2
N 1
=
_
1 f
n
_
N

i=1
(p
i
p)
2
N 1
(95)
where f = n/N = the proportion of clusters sampled.
Because p is unknown, we use p
c
as an estimate of p to get the unbiased estimator of
var( p
c
):
var( p
c
) =
_
N n
nN
_
n

i=1
(p
i
p
c
)
2
n 1
=
_
1 f
n
_
n

i=1
(p
i
p
c
)
2
n 1
(96)
110
6.9.2 Estimating p with Unequal Cluster Sizes
Suppose the cluster sizes are not all equal. Let m
i
be the number of secondary units in
cluster i and y
i
=

M
i
j=1
y
ij
= the cluster i total.
By taking a one-stage cluster sample of n cluster from a population with unequal-sized
clusters, we can estimate p as:
p
c
=

n
i=1
y
i

n
i=1
M
i
.
Note that p
c
is a ratio estimator. Therefore, it is a biased estimator. The bias, however,
tends to be small for large
n

i=1
M
i
.
The var( p
c
) is approximated by:
var( p
c
) =
_
1 f
nM
2
_

N
i=1
(y
i
pM
i
)
2
N 1
(97)
where M =
N

i=1
M
i
/N = the average number of elements per cluster in the population.
Because p is unknown, we use p
c
as an estimate to get the unbiased estimator of var( p
c
):
var( p
c
)
_
1 f
nm
2
_

n
i=1
(y
i
p
c
M
i
)
2
n 1
=
_
1 f
nm
2
_

n
i=1
y
2
i
2p
c

n
i=1
y
i
M
i
+ p
2
c

n
i=1
M
2
i
n 1
(98)
where m =
n

i=1
M
i
/n = the average number of elements per cluster in the sample.
Bellhouse, D.R. (1988) Systematic sampling. Handbook of Statistics, Vol. 6 (Sampling). 125-145.
Eds: Krishnaiah and Rao. Elsevier Science Publishers. Amsterdam.
Murthy, M.N. and Rao, T.J. (1988) Systematic sampling with illustrative examples. Handbook
of Statistics, Vol. 6 (Sampling). 147-185. Eds: Krishnaiah and Rao. Elsevier Science Publishers.
Amsterdam.
Wolter, K.M. (1984) An investigation of some estimators of variance for systematic sampling. J.
of the American Statistical Association. 79 781-790.
111
Example: A simple random sample of n = 30 households (clusters) was drawn from
a health district in Baltimore (USA) that contains N = 15, 000 households. Using the
following data, estimate the proportion p of people in this health district that visited a
doctor last year.
Household Household Number who visited
Number Size (M
i
) doctor last year (y
i
)
1 5 5
2 6 0
3 3 2
4 3 3
5 2 0
6 3 0
7 3 0
8 3 0
9 4 0
10 4 0
11 3 0
12 2 0
13 7 0
14 4 4
15 3 1
16 5 2
17 4 0
18 4 0
19 3 1
20 3 3
21 4 2
22 3 0
23 3 0
24 1 0
25 2 2
26 4 2
27 3 0
28 4 2
29 2 0
30 4 1
Totals 104 30
112