Professional Documents
Culture Documents
Sampling Techniques
Sampling Techniques
Techniques
WILLIAM G. COCHRAN
Prof'S$or of S;-,ati;tics
Harvard U"itl~sity
i Data..... ............ ..
I
. ... I
, H EeS AL, EA :lGPLG i1::'.,.,!,I
- - ..-- J
A c c e ssIOn N..,.SJ..).. .~ . ~
Da t~ .. ..... .......... .. .... ........ .... ... "
PRINTED IN INDIA
IIV c. L. BHAROAVA AT C. W. LAWRIB AND co., LUO!tNOW AND
PUBLISHBD BY P. S. JAYASlNOH!!, ASIA PUBLI'HtNO HOUS!!, BOMBAY.
To Betty
Preface
1. INTRODUCTION . . . . . . . . • • • • • • 1
1.1 Advantages of the sampling method. . 1
1.2 The principal steps in a sample survey 2
1.3 The role of sampling theory 5
1.4 Probability sampling 6
1.5 Bias and its effects . 7
1.6 Referen<'es... . . 10
2. BIllPLE RANDoM SAMPLING . 11
2.1 Simple random sampling. 11
2.2 Definitions and notation . 12
2.3 Properties of the estimates . 13
2.. Variances of the estimates . 15
2.5 The finite population correction 17
2.6 Estimation of the standard error from 9. sample . 18
2.7 Confidence limits . . . . . . . . . . . . . . 20
2.8 Validity of the normal apprpximation . . . . . 22
2.9 Effect of non-normality on the estimated variance 27
2.10 Exercises . . . . . . . . . . . . . . 29
2.11 References. . . . . . . . . . . . . . 30
8. S..uaUNG lOa PRoPORTIONS AND PERcmNTAGES 31
3.1 Qualitative characteristics . . . . . . 31
3.2 Variances of the sample estimates . . . 31
3.3 The effect of P on the standard errors . 35
3.. The binomial distribution . . . . . . 36
3.5 The general distribution of p . . . . . 37
3.6 Confidence limits . . . . . . . . . . 39
3.7 Classification into more than two c l _ . .. 43
3.8 Confidence limits when there are more than two classes 44
3.9 The conditional distnbution of p 45
3.10 Exercises . . . . . . . . 47
3.11 References . . . . . . . . 48
"- Tn EilTulATION OJ' S..uaLll SUII • 50
U A hypothetical example . . 50
U Analyllia of the problem . . 51
U The specification of preciBion . 52
•••
U
The formula for", in sampling for proportions
The formula for", with continuous data .
53
55
U Sample me with mOre than one item . . • . 57
"-1 Stein', method of two-atage sampling . . . . 59
U An attempt at a &enerallOlution of the _pie me problem 61
"-9 EurciaeI . 63
UO RefereDOel • • • • • • • M
xii CONTENTS
INTRODUCTION
1.3 The role of sampling theory. This list of the steps in a sample
survey has been given in order to emphasize that sampling is a prac-
tical business, which calls for several different types of skill. In some
of the steps-the definition of the population, the determination of
the data to be collected and of the methods of measurement, and the
organization of the field work-sampling theory plays at most a minor
role. Although these topics will not be discussed further in this book,
their importance should be realized. Sampling demands attention to
all phases of the activity: poor work in one phase may ruin a survey
in which everything else is done well.
The purpose of sampling theory is to make sampling more efficient.
It attempts to develop methods of sample selection and of estimation
that provide, at the lowest possible cost, estimates that are precise
enough for our purpose. This principle of specified precision at mini-
mum cost recurs repeatedly in the presentation of theory.
In order to apply this principle, we must be able to predict, for any
sampling procedure that is under consideration, the precision and the
cost to be expected. So far as precision is concerned, we cannot fore-
tell exactly how large an error will be present in an esti.:nate in any
specific situation, for this would require a knowledge of the true value
for the popUlation. Instead, the precision of a sampling procedure is
judged by examining the frequency distribution which is generated
for the estimate, if the procedure is applied again and again to the
same population. This is, of course, the standard technique by which
precision is judged in statistical theory.
A further simplification is introduced. With samples of the sizes
that are common in practice, there is often good reason to suppose
that the sample estimates are approximately normally distributed.
Consequently the sampling variance of the estimate is used to provide,
in inverse terms, a measure of its precision. A considerable part of
the theory deals with the calculation of formulas for the sampling
variances of estimates obtained by various procedures.
The study of sampling from an infinite population is a relatively old
and well-established discipline. The development of theory specifically
for application to sample surveys is quite recent. Nearly all the ref-
erences in this book are less than 20 years old and the majority are
less than 10 years old. The primary stimulus to sample survey theory
was the increasing use of sample surveys as a means of obtaining infor-
mation. Most of the work in sample survey theory has been done by
persons who are also actively engaged in the conduct of surveys. In
their turn, the advances in theory increased the scope and utility of
6 INTRODUCTION 1.3
1.6 Bias and its effects. For simplicity, it is assumed in the presen-
tation of theory that any measurement Yi on the ith unit is the cor-
rect value for that unit. Errors of measurement are ignored. This
assumption is of course unrealistic, and in chapter 13 the effects of
errors of measurement on the standard results are examined. For
some types of error, the standard results remain valid with only minor
changes. For other types of error, more drastic changes are needed.
The effects of bia8 will, however, be discussed in this section, be-
cause the deliberate use of biased estimates is often found to be profit-
able in sample surveys.
A sampling procedure is said to be unbiased if the mean of the fre-
quency distribution of the estimates which it produces is exactly equal
to the population characteristic which is being estimated. In the na-
tation of the previous section, let z, be the estimate provided by the
sample S, (i ,.. 1, 2, ... , v), with probability of selection 7ri, and let
8, be the population value which is being estimated. The procedure
is unbiased if
8 INTRODUCTION
If the two quantities are not equal, their difference is called the bia8
in the sampling procedure :
~
Bias = L: .z. -
'II"• 8.
i-I
Similarly, the lower tail, i.e. the shaded area below P, has an area
--
1 f -1.00- (B/ .. )
e-t'/.dt
-yI2; - 00
From the form of the integrals it is clear that the amount of dis-
turbance depends solely on the ratio of the bias to the standard devia-
tion. The results are shown in table 1.1.
For the total probability of an error of more than 1.9611, the bias
has little effect provided that it is less than one-tenth of the standard
deviation. At this point the total probability is 0.0511 instead of the
0.05 which we think it is. As the bias increases further, the disturb-
ance becomes more serious. At B = 11, the total probability of error
is 0.17, more than three times the presumed value.
The two tails are a.ft'ected differently. With a positive bias, as in
this example, the probability of an underestimate by more than 1.9611
shrinks rapidly from the presumed 0.025 to become negligible when
B - 11. The probability of the corresponding overestimate mounts
10 INTRODUCTION 1.5
1.6 References.
PAYNE, S. L. (1951). TM art of asking questions. Princeton University Press.
TEPHAoN, F. F. (1948). History of the II8e8 of modern sampling proceduree.
Jour. AIMT. Sial. Assoc., '3, 12-39.
CHAPTER 2
( N\
n)
= NC" = N!
n!(N - n)!
For example, jf the population contains five units denoted by A, B,
C, D, and E , there are ten different sanlples of size 3, as follows :
ABC ABD ABE ACD ACE
ADE BCD BCE BDE CDE
Note that the same letter is not allowed to occur twice in the sample.
No attention is paid to the order in which the letters occur in the
sample, the six samples ABC, ACB, BAC, BCA, CAB, and eBA be-
ing considered identical.
Simple random sampling is a method of selecting n units out of the
N such that every one of the NC" samples has an equal chance of
being chosen. This type of sampling is sometimes called random
sampling. Since the word random is used in the literature in many
different senses, an extra qualifying adjective is advisable. Some
writers prefer the phrase unrestricted random sampling.
In practice a simple random sample is drawn unit by unit. The
units in the population are numhered from 1 to N. A series of random
numbers between 1 and N is then drawn, either by means of a table
of random numbers or by placing the numbers 1 to N in a bowl and
mixing thoroughly. If the bowl is used, n numbers are drawn out in
succession. The units which bear these numbers constitute the sample.
At any stage in the draw, this process gives an equal chance of eeleo-
tion to all numbers not previously drawn. It is easy to verify that
all NC.. possible samples have an equal chance.
11
12 SIMPLE RANDOM SAMPLING 2.1
When a number has been drawn from the bowl, it is not replaced,
since this might allow the same unit to enter the sample more than
once. For this reason the sampling is described as without replacement.
Similarly, if a table of random numbers is employed, a number that
has been drawn previously is ignored. Sampling with replacement is
entirely feasible, but except in special circumstances is seldom used,
since there seems little point in having the same unit twice in the
eample.
Other methods of sampling are often preferable to simple random
sampling on the grounds of convenience or of increased precision.
Simple random sampling serves best to introduce sampling theory.
Mean: r _ til + III + ... + liN _.!: f} - 1/1 + 1/1 + ... + II.. _ ! (2.2)
N N n n
The second is the average value l' per unit (e.g. the average number
of acres of wheat per farm) . The t.hird is the proportion or percentage
of units which fall into some defined class (e.g. the percentage of
farms growing no wheat). Estimation of the population total and
mean will be considered in this chapter.
The symbol .. is used to denote an' estimate of a popUlation charac-
teristic made from a sample. In this chapter only the simplest types
of estimate are considered, as follows:
Estimate
Population mean: r= y = sample mean
PopUlation total: r = Ny = NYln
There is more than one way in which this definition can be adapted 80
as to apply to a finite population, and the definition which we have
given may not be the moet useful one for studying the properties of
estimates in large sampleS. However, the idea of consistency does not
play an Important part in the subsequent exposition.
As we have seen, a method of estimation is unbiased if the average
value of the estimate, taken over all possible samples of given size n,
is exactly equal to the true population value. If the method is to be
unbiased without qualification, this result must hold for any popula-
tion of finite values 11; and for any n. To investigate whet.her y is un-
biased with simple random sampling, we calculate the value of y for
all NC" samples and find the average of the estimates. The symbol
E denotes this average over all possible samples.
- L
Ii
E11----
L (111 + 1/2 + .. .+ 1/,,) (2.3)
NC,. N!
n ----
n!(N - n)!
where the sum extends over all NC,. samples. To evaluate this sum,
we find out in how many samples any specific value 1/; appears. Since
there are (N - 1) other units available for the rest of the sample and
(n - 1) other places to fill in the sample, the number of samples con-
taining 11. is
C _ (N - 1)1
(N-l) (,,-1) (n _ I) !(N - n) 1
I'
Hence
(N - I)!
L (VI + v, + ... + 1/,,) - (n - I)I(N - n)1
(til + tit + ... + tiN)
From (2.3) this gives
E (N - 1)1 nl(N - n)1
g-(n-I)I(N-n)1 nNl (1Il+1I'+"'+1IN)
V~
IVA ' \ . (2.10)
16 SIMPLE RANDOM SAMPLING
+ 2(n - I
1) (YI - Y)(Y2 - Y)
(N - 1)
+ .. .+ (YN-I - Y)(YN - 1") I]
Completing the square on the cross-product term, we have
n 2E(f} - "r) 2 = -n
N
[(1 - -
N-l
1)
n -- {(YI - r) 2 " + ... + (YN - "r )2)
(n - 1) ]
+ (N - 1)
{(YI - Y) + .. .+ (y N - YW
The second term inside the square bracket vanishes, since the sum of
the Yi equals NY. Division by n2 gives
(N - n) S2 (N - n)
V(g) = E(g - y)2 = EN (Yi - 'Y)2 >= - - -
nN(N - 1) i_I n N
Corollary 1 The standard error of y is
llfi
S
= Vn
IN-n
-z;- (2.12)
This theorem reduces to theorem 2.2 if the variates Y.:, Xi are equal on
every unit.
18 SIMPLE RANDOM SAMPLING 2.5
(N - n) 1
E(u - U)2 = LN (Ui - U)
2
nN (N - 1) i- I
i.e.
EI(g - Y) + (x - 1')1 2
(N - n) 1 N
-nN- - -L
(N - 1) i _ I
I(Yi - Y) + (Xi - 1')1 2 (2.16)
(N - n) 1
E(fi - y)2 = LN (Yi _ y)2
nN (N - 1) i- I
with a similar relation for E(Xi - X)2, Hence these two terms can-
cel on the left and right sides of equation (2.16). The result of the
theorem, equation (2.15), follows from the cross-product terms.
S2 =_1_ _ __
N - ]
ESTIMATED STANDARD ERRORS 19
== 1
(n - 1)
[t._1 (Y. - Y? - n(fi - y?l
Now average over all simple random samples of size n. By the argu~
ment of symmetry used in theorem 2.2,
;. Y2} n ~ y2 n(N-l) 2
E {LJ (Yi - ) = - LJ (Yi - ) = 8
i-I N ._1 N
2
by the definition of 8 • Further, by theorem 2.2,
.
Hence
82
E(s2) = ---In(N - 1) - (N - n)} = 82 (2.17) ~
(n - I)N
These estimates are slightly biased : for most applications the bias is
unimportant.
The reader should note the symbols employed for true and esti~
mated variances of the estimates. Thus for y we write
True variance: V(y) == (Til
Estimated variance: v(iJ) = si
The notation is a little redundant, but it is convenient to have
separate symbols V and v for variance, and q and s for standard error.
20 SIMPLE RANDOM SAMPLING 2.7
r L ..
ts IN-n
fi - Vn ~:
ts ~
r u = ii + Vn"\}~ (2.21)
Total :
tL =-Nfj-
tNs
Vn
IN~:
-n tu = Ny
INs IN~
+ Vn -n (2.22)
If the sample size is less than 30, the percentage points may be taken
from Student's t-table with (n - I) degrees of freedom, these being
the degrees of freedom in the estimated variance 8 2 • ( The t-distribu-
tion holds exactly only if the observations Yi are themselves normally
distributed and N is infinite) Moderate departures from normality do
not affect it greatly. For small samples with very skew distributions,
special methods are needed.
Example. Signatures to a petition were collected on 676 sheets.
Each sheet had enough space for 42 signatures, but on many sheets a
smaller number of signatures had been collected. The numbers of
signatures per sheet were counted on a random sample of 50 sheets
(about a 7 per cent sample), with the results shown in tabJe 2.1.
Estimate the total number of ~ignat.ures to the petition and the 80
per cent confidence limits.
The sampling unit is a sheet, and the observations y, are the num-
bers of signatures per sheet. Since about half the sheets had the maxi-
mum number of signatures, 42, the data are presented as a frequency
distribution. Notice that the original distribution appears to be far
from normal, the greatest frequency being at the upper end. Never-
theless there is reason to believe from experience that the means of
samples of 50 are approximately normally distributed.
We find
n - "'Eh .,. 50; y = "'Ef;y,' = 1471; "'Ef,y,2 == 54,497
CONFIDENCE LIMITS 21
J'" 2 (L fiYi)2}
r = (n _1 1)
'"
I """,f'(Yi - fi)
2
1=
1
(n _ 1) I """,fiY. - Lf;
1 { (1471)2}
"" - 54497 - - - = 229.0
49' 50
From equation (2.22) the 80 per cent confidence limits are
tNs
19,888 :!:: _ r
vn
IN - --
N
n
>= 19,888:!::
(1.28)(676)(15.13)"V'1 - 0.0740
_~
v50
= 19,888 :!:: 1781
This gives 18,107 and 21 ,669 for the 80 per cent limits. A complete
count which was made showed 21,045 signatures.
IiO
SIMPLE RANDOM SAMPLING 2.8
100 200 300 400 500 600 700 800 900 1000 1100
City size (thousands)
FIGURE 2.1 Frequency distribution of sizes of 196 United States cities in 1920.
40
.,g30
'"W20
....
10
0
3 4 9
Millions
FIGURE 2.2 Frequency distribution of totals of 200 simple random samples with
n ... 49,
.~
11111111111111111111111111 1111
5235
26 SIMPLE RANDOM SAMPLING 2.8
,,836.7
E (y,) - r - - - - 1.50486
556
G ". 15.411
I - ;; - 7.925 - 1.9
2.9 EFFECT OF NON-NORMALITY ON VARIANCE 27
2.10 Exercises.
2.1 In a population with N = 6, the values of y. are 8, 3, 1, 11, 4, and 7.
Calculate the sa.mple mean ii for al1 possible simple random sa.mples of size 2.
Verify that f} is an unbis.sed estima.te of Y and that its variance is a.s given in
theorem 2.2.
2.2 For the sa.me population, calculate 8 2 for al1 simple random sa.mples of
size 3, and verify that E(8 2) = 81.
2.3 If random sa.mples of size 2 are drawn with replacement (rom this
population, show by finding all possible sa.mples that V(ii) satisfies the equa-
tion (7'2 8 2 (N - 1)
V(ii) = - = - --
nn N
Give a general proof of this result.
2.4 A simple random sample of 30 households wa.s drawn {rom a city area
containing 14,848 households. The numbers of persons per household in the
sample were as follows:
5,6,3,3,2,3,3,3,4,4,3,2,7,4,3,5, 4,4,3,3,4,3,3, 1,2,4,3,4,2,4
Estimate the total number of people in the area and compute the probability
that this estimate is within ±10 per cent of the true value.
2.5 The table below shows the numbers of inhabitants in each of the 197
United States cities which had populations over SO,OOO in 1940. Calculate
the standard error of the estima.ted total number of inhabitants in all 197
cities for the following methods of sampling : (i) a simple random sample of
size SO, (ii) a sample which includes the 5 largest cities and is a simple random
sample of size 45 from the remaining 192 cities, (iii) a sample which includes
the 9 largest cities and is a simple random sample of size 41 from the rema.ining
cities.
FREQUENCY DISTRIBUTION OF CITY SIZES
2.7 With certain populations it is known that the observations I/i are all
zero on a portion qN of the N units (0 < q < 1) . Sometimes, with varying
expenditures of effort, these units can be found and listed, so that they need
not be sampled. If q2 is the variance of I/i in the original popUlation, and qo2
is the variance when all zeros are excluded, show that
qo2 __ _'q_
= q2 y2
P pi
where p - 1 - q.
If the popula.tion total is estimated from a simple random sample of size
ft, show that with the exclusion of the "zero" units the fractional reduction
in the variance of the estimate is
q(V2 + 1)
V2
where VI - ql/ V2 is the square of the coefficient of variation in the original
population. (For further discu88ion of this technique, see Jessen and House-
man, 1944.) The fpc may be omitted.
2.11 References.
CORNFIELD, J . (1944). On samples from finite populations. Jour. Amer. Sial. A"oc.,
a9, 236-239.
FmLLlCR, W. (1950) . An introduction to probability tMory and its applicationl. John
Wiley &: Sona, New York.
FISHIIR, R. A. (1932). Statistical met/wd8 for re8earch workers . Oliver and Boyd,
Edinburgh, 4th ed.
JJi88IlN, R. J., and HOUSEMAN, E. E. (1944). Statistical investigations of farm
sample surveys taken in Iowa, Florida and California. [(YIJ)(J Agr. Exp. Sta. Ru.
BuU. 329.
MADow, W. G. (1948) . On the limiting distributions of estimates based on samples
from finite universes. Ann. M mh. Sial., 19, 535-545.
MOLINA, E. C. (1949). Poisson's exponential binomial limit. D. Van Nostrand
Co., New York.
TUUT, J. W. (1950). Some sampling simplified. Jour. A mer. Sial. Assoc., U,
501- 519.
WEST, Q. M. (1951) . The ruults of applying a.simple random sampling procus to
farm management data . Agricultural Experiment Station, Cornell University.
WISHART, J . (1952). Moment-coefficients of the k-statistics in samples from a
finite population. Biumetrika, 39, 1- 13.
CHAPTER 3
~ 1i. A
y= - = - = p (3.2)
N N
31
32 SAMPLING FOR PROPORTIONS AND PERCENTAGES 3.2
1 N
- - (NP - Np2) = PQ (3.4)
(N - 1) (N - 1)
where Q - 1 - P. Similarly
.
:E (Y' - ji)2
g2 ~ _1_ _ _ _ = __
n_ pq
(3.5)
n - 1 (n - 1)
Theorem S.l The sample proportion 'P ... a/n is an unbiased esti-
mate of.the population proportion P = A/N.
Theorem S.S The variance of 'P is
{Pq (0.19)(0.81)
8N'P = N -V-.;; = (3042) - - --
200
= 84.4
0 10 20 80 40 60 60 70 80 00 100
(fNp NVPQ ~ 1 (Q IN - n
(3.12)
NP - VnNP ~N=1- v'n ~p N - 1
This quantity is usually called the coe.fficient of lIariation of the esti.
mate. If the fpc is ignored, the coefficient is v'Q/nP. The ratio
SAMPLING FOR PROPORTIONS AND PERCENTAGES 3.S
p 0 0.1 0.5 1 5 10 20
v'Q7P GO 31.6 14.1 9.9 4 .4 3.0 2.0
P 30 40 50 60 70 80 90
VQ1P 1.5 1.2 1.0 0 .8 0.7 0.5 0.3
A - 3; A' - 5; N - 8; n - 4
CONFIDENCE LIMITS
From formula (3.17) the distribution of the number of malee, G, is
as follows:
(I Probability
.1 U.8.2 1
o 0i4i . 8.7.6.6 - i4
.1 8.tiA.8 6
1rif . 8.7.6.11 - Ii
.1 8.2.11.4 6
2i2i . 8.7.6.11 - Ii
.1 8.2.1.11 1
8
iiii . 8.7.6.6 - Ii
Impoealble - 0
The reader may verify that the mean number of males is and the t
variance is H. These results agree with the formulas previouslyes-
tablished, section 3.2, which give
nA (4)(3) 3
E(np) - nP - - - - - - -
N 8 2
N - n 3 5 4 15
V(np) - nPQ N _ 1 - 4 . 8 . 8 . ;; - 28
tT,,= ~ fPQ
~~~-;;
If it is assumed that p is normally distributed, with estimated stand-
ard error
, JS-n~q
Bp'" ---
N - 1
-
n
we obtain, as a normal approximation to the confidence limits,
PL =: p - t J ~-;;
r;q: Pu = p +t IN -
N - 1
n r;q
~-;;
where t is the normal deviate corresponding to the confidence proba-
bility. •
It is worth while to amend these formulas by inserting a correction
for continuity, whenever this correction has an appreciable effect.
The rationale of the correction may be explained as follows. Suppose
that 30 units out of 70 are observed to fall in class C, and we wish to
approximate Pu . Using the exact distribution, we would find Pu
such that the sum of the probabilities that 0, 1, "', 30 units fall in
C is au. If the exact distribution is to be approximated by a. con-
tinuous normal distribution, it is natural to regard the ordinate at 30
as corresponding to the area of the normal curve between 29! a.nd 30;.
Thus the sum of terms from 0 to 30 corresponds to the area of the
normal curve below the point 30;. The effect of the correction is to
• In theorem 3.3 it was shown that an unbiased sample estimate of fT.,' ill
2 (N - n)pq
B., - N(n _ 1)
In estlma.ting PL and Pu, 3" might have been used for the standard error of 11
inatead of 3.,'. However, B.,' was preferred because it is more familiar, and both
eetimatee appear to give about equally good a.pproximations.
3.6 CONFIDENCE LIMITS . 41
- +-I}
p± {t
~-n~q
--
N-l n 2n
(3.21)
np - number observed
p in the smaller class n - sample size
0.5 15 30
0 .4 20 50
0 .3 24 80
0 .2 40 200
0.1 60 600
0 .05 70 1400
--0. 80 ..
• Thi8 means that p is extremely small, so that np follows the Poisson distribution.
The rules in table 3.3 are constructed so that with 95 per cent confi-
dence limits the true frequency with which the limits fail to enclose
P is not greater than 5.5 per cent. Further, the probability that the
upper limit is below P is between 2.5 and 3.5 per cent, and the proba-
bility that the lower limit exceeds P is between 2.5 and 1.5 per cent.
These restrictions on the one-tailed frequencies of error seemed ad-
visable because the binomial distribution is in general skew (see sec-
tion 2.8). The rules are not guaranteed to satisfy these probability
statements in all cases, 'since exhaustive examination is lengthy, but
I believe that the btatements are generally true. The choice of 5.5,
3.5, and 1.5 per cent for the probabilities is of course arbitrary. The
reader who is content with greater error in the normal approximation
can allow lower values of n.
SAMPLING FOR PROPORTIONS AND PERCENTAGES
When the situation lies outside the range of validity of the normal
approximation, or when greater accuracy is desired, reference may be
made to charts of the confidence limite of the hypergeometric function
by Chung and DeLu 1950). These give t e OO;-95-;-&nd 99 per
cen Imlts for P for popu ation sizes of 500, 2500, and 10,000. Values
for intermediate population sizes may be obtained by interpolation.
An alternative when the normal approximation does not apply is
to use the limite for the binomial distribution, adjusted if necessary
so as to take account of the finite population correction. For n less
than 50, the bimonial limits are quickly found from Table8 of the bi-
nomial frequency di8tribution (U. S. National Bureau of Standards).
A convenient table of the limits themselves, constructed by W. L.
Stevens, is given in Fisher and Yates's Stati8tical table8, 3rd .ed., table
VIII, 1. The limits presented by Stevens are those for nP, since this
quantity is more amenable to compact tabulation than P itself. The
method of amending these limits so as to allow for the finite popula-
tion correction is illustrated in example 2 below.
Example 1. In a simple random sample of size 100, from a popula-
tion of size 500, there are 37 units in class C. Find the 95 per cent
confidence limits for the proportion and for the total number in class
C in the population. In this example,
n - 100; N == 500; l' - 0.37
The example lies in the range in which the normal approximation is
recommended. The estimated standard error of p is
J(N - n) 1'q
(N - 1) n
= -400 (0.37)(0.63) = 0.0432
499 100
The correction for continuity, 1/ 2n, equals 0.005. Hence the 95 per
cent limits for P are estimated as
~
(N - n) pq
p± ( t -
(N - 1) n
+ -2n1 )
= 0.37 ± (1.96 X 0.0432 + 0.(05) = 0.37 ± 0.090
Pc. = 0.280: Pu = 0.460
The limits as read from the charts by Chung and DeLury are 0.285
and 0.462, respectively.
To find limits for the total number in class C in the population, we
multiply by N , obtaining 140 and 230, respectively.
CLASSIFICATION INTO MORE THAN TWO CLASSES
Ezample B. This example shows how binomial limits may be ueed
8.8 an approximation. Suppose that for another item in the previous
sample 9 units out of the 100 fall in class C. This is outside the range
for the norroa.J approximation.
The 95 per cent binomial limits for the expected number nP in
cl8.88 C are read from Fisher and Yates's Sla.ti8tical table8, table VIII,
1, 8.8 4.20 and 16.40. Dividing by n, we obtain approximate limits of
0.042 and 0.164 for p. If the sampling ratio is less than 5 per cent,
limits found in this way are close enough for most purposes. In this
sample, the sampling ratio is 20 per cent, 80 that the fpc should be
applied. The fpc factor is
~
oo
- - - 0.895
499
To apply the correction, we shorten the interval between p and each
limit by this factor. The adjusted limits are as follows:
PL ... 0.09 - (0.895)(0.09 - 0.042) == 0.047
Pu .. 0.09 + (0.895)(0.164 - 0.09) ... 0.156
The limits obtained from the tables by Chung and DeLury are 0.045
and 0.157, respectively.
When the sample size n is small relative to all the Ai, the probabili-
ties Pi may be considered effectively constant throughout the draw-
ing of the sample. The probability of drawing the observed sample is
given by the multinomial expression
n!
Pr(ai) = PloIP2o'P30' (3.22)
a, !a2!aS!
•• SAMPLING FOR PROPORTIONS AND PERCENTAGES 8.7
S.8 Confiderlce limits when there are more than two classes. Two
different cases must be distinguished.
Case I . We calculate
Number in anyone class in sample 111
p'"
n n
or
P=
Total number in a group of classes a1 + a2 + 113 , say
n n
For example, if the answers are classified into "yes," "no," "don't
know," and "no answer," we might take p as the proportion in the
sample answering "yes," or alternatively as the proportion in the
sample giving a definite answer, either "yes" or "no." In either of
these situations, although the original classification contains more
than two classes, p itself is obtained from a subdivision of the n units
into only two classes. The theory already presented applies to this
case. Confidence limits are calculated as described in section 3.6.
Case II. Sometimes certain classes are omitted, p being computed
from a breakdown of the remaining classes into two parts. For ex-
ample, we might omit persons who did not know or gave no answer,
and consider the ratio of number of "yes" answers to "yes" plus "no"
answers. Ratios which are structurally of this type are often of in-
terest in sample surveys. The denomina.tor of such a ratio is not n,
but some smaller number n'.
The frequency distribution of p is more complicated than in Case I,
because both the numerator and denominator of p vary from one
sample to another, even although all samples have the same total
size n. This presents an obstacle to the calculation of confidence
limits. Most of the complications can be avoided by the device, com-
mon in statistira.l theory, of ca.lculating confidence limits from the
3.9 THE CONDITIONAL DISTRIBUTION OF 11
conditional distribution of p, given nand n'. In this method we con-
struct confidence statements which will be true, with the assigned
confidence probability, over all samples which have the same n and
n' as the observed sample.
The reason why this device helps is that the conditional distribu-
tion of p is obtained from an ordinary hypergeometric distribution,
as will l:>e shown in the next section. To give the result more pre-
cisely, suppose that
al
11 = ---: n' ... al + 1l:I: n ... al + a2 + as
al + a2
so that aa is the number in the sample falling in classes in which we
are not at the moment interested. Then the conditional distribution
of aJ and a2 is the hypergeometric distribution when the sample is
of size n' and the population of size N' = Al + A 2 • In particular,
the conditional standard error of p is
(f
P
= IN' - n' (PQ
N' - 1 ~-:;;;
The nonna! approximations to conditional confidence limits for P ...
AJ!(A J+ A 2 ) are
~ r;q 1)
l' ± (t ~~ ~;;; + 2n' (3.24)
PL - P- t J N-n
N - (n/ n')
$ 1
- --
n' 2n'
(3.25)
P I N-n r;q 1
u "" 11 + t ~N _ (n/n') ~;;; + 2n'
The fpc is the only tenn affected by the approximation used for N'.
(3.26)
(3.28)
Sample
ABD. ABE, ACD, or ACE
BCD or BCE
01
1
0
lit
1
2
,
l'
0
Conditional
probability (1'-P)
i
i -t
The estimate i8 again unbiased, and ita varianoe i8
which may be verified from the general fonnula. Note that the vari-
ance is only one-fourth of that obtained when n' - 1. In a condi-
tional approach, the variance changes with the oonfiguration of the
sample that was drawn.
For n' - 3, there is only one possible sample, ABC. This gives the
correct population fraction, ! . The conditional variance of p i8 zero,
as i8 indicated by the general fonnula, which reduces to zero when
N' -n'.
S.10 Exercilel.
8.1 For a population with N - 6, A - 4, A' - 2, work out the value of
a for all po.ible simple random sampl811 of lize 8. Verify the theorems aiven
for the mean and variance of p - a/ no Verify that
(N - n)
(n - l)N pq
opinion. Estimate 95 per cent confidence limits for the number of colleges
in the population which favored the proposal.
3.3 Do the results of the previous sample furnish conclusive evidence that
the majority of the colleges in the population favored this proposal?
3.4 A-population with N - 7 consists of the elements B11 C1 1 Cl , C" Dl ,
DI, and D,. A simple random sample of size 4 is taken in order to estimate
the proportion of C's to C's + D's. Work out the conditional distributions
of thiB proportion, p, and verify the formula for its conditional variance.
3.5 In the previous exercise, what is the probability that a sample of size
4 contains Bl? Hence find the average variance of p over all simple random
samples of size 4, and verify your answer by the general formula.
3.6 A simple random sample of 290 households was chosen from a city
area containing 14,828 households. Each family was asked whether it owned
or rented the house and also whether it had the exclusive use of an indoor
toilet. Results were as follows :
Owned Rented
Exohlllve U8Il of toilet Yes No Yes No Total
141 6 109 84 290
(i) For families which rent, estimate the percentage in the area who have ex-
clusive use of an indoor toilet and give the standard error of your estimate ;
(ii) estimate the total number of renting families in the area who do not have
exolusive indoor toilet facilities and give the standard error of this estimate.
3.7 In a simple random sample of size 5 from a popUlation of size 30, no
units in the sample were in class C. By the hypergeometrio distribution, find
the upper limit to the number A of units in class C in the population, corre-
sponding to a one-tailed oonfidence probability of 95 per pent.
3.8 In sampling for an attribute that is rare, one method is to continue
drawing a simple random sample until m units which possess the rare attribute
have been found (Haldane, 19(5), where m is decided in advance. If the fpo
can be ignored, show that the probability that the total sample required is
of aile n is
(n - 1) I P"Q~ __
(n ~ m)
(m - 1)1(n ~ m)l
where P is the frequency of the rare attribute. Find the average lile of the
total sample and show that p - (m - 1)/ (n - 1) is an unbiased estimats of
P. (For further discussion see Finney, 11149.)
3.11 References.
CHUNG, J. H., and DJlJLuKT, D. B. (1950). C07Ijidence limit. lor the hllpergeometric
dutrilndion. University of Toronto PreMo
FINNJlJT, D . J. (1949) . On a method of estimating; frequencies. BiorMtrika, ae,
233-284.
FIuu, R. A., and YATJIJI, F. (1948). St4li,ticolloblH lor ~, agricultural
and m«Iicol rtlfOrm. Oliver and Boyd, Edinbursh, 8rd ed.
3.11 REFERENCES 49
Up '=.
\}{PQ
--;;
Hence, we may put
2 fPQ = 5, or n = 4PQ
~~ 25
At this point a difficulty appears which is common to all problems
in the estimation of sample size. A formula for n has been obtained,
but n depends on some property of the population which is to be sam-
pled. In this instance the property is the quantity P which we would
like to measure. We must therefore ask the anthropologist if he can
give us some idea of the likely value of P. He replies that from pre-
vious data on other ethnic groups, and from his speculations about
the racial history of this island, he will be surprised if P lies outside
the range 30 to 60 per cent.
This information is sufficient. to provide a usable answer. For any
value of P between 30 and 50, the product PQ lies between 2100 and
a maximum of 2500 at P = 50. The corresponding n lies between 336
and 400. To be on the safe side, 400 is taken as the initial estimate
of n.
The assumptions made in this analysis can now be re-examined.
With n = 400 and a P between 30 and 50, the distribution of p should
be close to normal. Whether the fpc is required depends on the num-
ber of people on the island. If the population exceeds 8000, the sam-
pling fraction is less than 5 per cent and no adjustment for fpc is called
for. The method of applying the readjustment, if it is needed, is dis-
cussed in section 4.4.
'-4 The formula for n in sampling for proportions. The units are
classified into two classes, C and C'. Some margin of error d in the
estimated proportion p of units in class C has been agreed upon, as
well as a small risk a which we are willing to incur that the actual
error is larger than d. That is, we want
Pr{1 p - P I ~ dl = a
THE EBTIM.ATION OF SAMPLE SIZE
f1p= ~ [PQ
...j;:;-=} \) -;;
Hence the formula which connects n with the desired degree of preci-
sion is
where t is the abscissa of the normal curve which cuts off an area a at
the tails. Solving for n, we find
(4.1)
(4.2)
rfJ
V =f = Desired variance of the sample proportion
we have
pq
no=-
v
In practice we first calculate no. If no/N is negligible, no is a satis
factory approximation to the n of equation (4.1) . If not, it is apparent
on comparison of (4.1) and (4.2) that n is obtained as
no
n=---- no (4.3)
no - 1
1+ - - 1 + no
N N
THE FORMULA FOR n WITH CONTINUOUS DATA 55
Let us assume that there are only 3200 people on the island. The
fpc is needed, and we find
no 400
n = --';;"""- - - - =356
no - 1
1+ - - 1 +..f.Mr
N '
The formula for no holds also if d, p, and q are all expressed as per-
centages instead of proportions. Since the product pq increases as p
moves toward -! or 50 per cent, a conservative estimate of n is obtained
by choosing for p the value nearest to ! in the range in which p is
thought likely to lie. If 'T' seems likely to lie between 5 and 9 per cent,
for instance, we assume 9 per cent for the estimation of n.
A good discussion of sample size for proportions, with a specific
application, is given by Cornfield (1951 ).
Pr{\ g - Y I ~ dl = cr
This gives
d=t
N s
-
Vn
(4.4)
(~y
n= -----
1 +~(~y
66 THE ESTIMATION OF SAMPLE SIZE
(4.5)
fN--=-;8
d = t '\}}{ . Vn (4.7)
'-6 Sample size with more than one item.. In most surveys, informa-
tion is collected on more than one item. One method of determining
aample size is to specify margins of error for those items that are re-
garded as most vital to the rvey. estimation of the sample size
needed is first made separately for each of these important items.
THE ESTIMATION OF SAMPLE SIZE
When the single-item estimations of n have been completed, it is
time to take stock of the situation. It may happen that the n's re-
quired are all reasonably close. If the largest of the n's falls within
the limits of the budget, this n is !!elected. More commonly, there is
a sufficient variation among the n's required so that we are reluctant
to choose the largest, either from budgetary considerations or because
of the fact that this will give an overall standard of precision sub-
stantially higher than was originally contemplated. In this event the
desired standard of precision may perhaps be relaxed for certain of the
items, in order to permit the use of a smaller value of n.
In some cases the n's required for different items are so discordant
that certain items must be dropped from the inquiry, because with the
resources available the precision expected for these items is totally
inadequate. The difficulty may be not merely one of sample size.
Some items call for a different type of sampling from others. With
populations that are sampled repeatedly, it is useful to amass info~
mation about those items which can be combined economically in a
general survey, and those which necessitate special methods. As an
example, a classification of items into four types, suggested by experi-
ence in regional agricultural surveys, is shown in table 4.1. In this
classification, a general survey means one in which the units are fairly
evenly distributed over some region, as for example by a simple ran-
dom sample.
STEIN'S METHOD OF TWO-STAGE SAMPLING 59
Prl\ y - Y \ ~ dl Sa (4.8)
Sketch of 1"'oof. The proof assumes that the observations, Yl, Y2 , .. "
y", are normally distributed about Y. Throughout the proof, d, a, and
nl are fixed quantities. The total sample size n is not fi.xed, but is a
THE FBrIMATION OF SAMPLE SIZE
random variate, since ita value depends on the value of 8 that turns up
in the first sample. Nevertheless, for fixed 8, n is fixed, and the quantity
Vn (Y - Y)
ill normally distributed with mean zero and variance r? Hence, this
quantity follows. the normal distribution whether 8 is fixed or not.
Moreover, by a well-known property of the normal distribution, the
distribution is independent of that of 8. Consequently,
Vn (ii - Y)/8
follows the klistribution with (nl - 1) degrees of freedom. By defini-
tion of tl it follows that
Pr {I Vn (ti - Y) I?; }
8
tl - a (4.9)
This is the key result in the proof. Further, by the way in which the
value of n was calculated, we always have
(4.10)
Hence, from (4.9)
Pr {I
tl (y -
d
Y) I }Sa
?;tl
i.e.
PrO fi - Y I ?; dl S IX
n - (tS)2 (2 X60)2 .
d - 1 ( ) .. 144
where t is taken from the normal distribution. Suppose that nl is
taken as 50. This value gives a reasonably large number of degrees of
freedom for estimating S and does not commit us to too large an initial
sample in case S should turn out to be less than we feared.
4.8 THE GENERAL SAMPLE SIZE PR(,~l.EM 61
2
When the first sample is taken, 8 is found to be 1938. Since til for
49 df, is 2.01, we have
t18 (2.01)(44.02)
-- = = 12.51
vn; 7.0711
so that the sample of 50 gives a confidence interval of half-width 12.51
instead of 10. Finally, n is chosen so that
2 2
t1 8 (4.040)(1938)
n = - = = 78.3
tP 100
That is, 29 additional observations are taken to make the total n ... 79.
A more general form of this result is given by Yates (1949). The same
analysis applies to any method of sampling and estimation in which
the variance of the estimate is inversely proportional to n and the coat
is a linear function of n.
Blythe (1945) describes the application of this principle to the
estimation of the volume of timber in a lot for selling purposes (see
exercise 4.6) . Nordin (1944) discusses the optimum size of sample for
estimating potential sales in a market which a mll.nufacturer intends
to enter. If the sales can be forecast accurately, the amount of fixed
equipment and the production per unit period can be allocated so as
to maximize the manufacturer's expected profit.
Although the application of this technique is likely to be restricted
to situations in which the sample is taken for a specifi~ it
seems probable that this approach to the problem orBiDiple size has
a number of fruitful applications which have not yet been realized.
A related problem is the sampling of lots of articles in a mass-pro-
duction process, in order to determine whether the lot is to be accepted
or rejected on the basis of its estimated quality. Since the purpose of
the sample is usually to lead to a single "yes" or "no" decision, the -
best sample size can be decided by examining the consequences of
errors in the decision. Good introductory accounts of the techniques
are given by Tippett (1950) and Deming (1950) .
4.9 EXERCISES 63
.. 9 Exercises.
4.1 A survey is to be made of the prevalence of the common diseases in a
large population. For any disease that affects at least 1 per cent of the in-
dividuals in the population, it is desired to estimate the total number of cases
with a coefficient of variation of not more than 20 per cent. (i) What size of
simple random sample is neederl , assuming that the presence of the disease
can be recognized without mistakes? (ii) What size is needed if total cases
are wanted separat.ely for. maies and females, with the same precision?
4.2 In a wireworm survey, the number of wireworms per acre is to be es-
timated with a limit of error of 30 per cent, at the 95 per cent probability level,
in any field in which wireworm density exceeds 200,000 per acre in the top 5
in. of soil. The sampling tool measures 9 X 9 X 5 in. deep. Assuming that
the number of wireworms in a single sample follows a distribution slightly
more variable than the Poisson, we take S'l = 1.2Y. What size of simple
random sample is needed? (1 acre = 43,560 sq ft.)
4.3 The following coefficients of variation per unit were obtained in a
farm survey in Iowa, the unit being an area 1 mile square (data of R. J . Jessen) :
Estimated cv
Item (%)
Acres in tanus 38
Acres in corn 39
Acres in oats 44
Number of family workers 100
Number of hired workers 110
Number of unemployed 317
A survey is planned to estimate acreage items with lI. cv of 21 per cent and
numbers of workers (excluding unemployed) with a cv of 5 per cent. With
simple random sampling, how many units are needed? How well would this
sample be expected to estimate the number of unemployed?
4.4 By experimental sampling, the mean value of a random variate is to
be obtained correct to 0.001, with confidenee probability 95 per cent. The
values of the random variate for the first 20 samples drawn are shown below.
How many more samples are needed?
Sample Value of Sample Value of
no. random variate no. random variate
1 .0725 11 .0712
2 .0755 12 .0748
3 .0759 13 .0878
4 .0739 14 .0710
5 .0732 15 .0754
6 .0843 16 .0712
7 .07Z1 17 .0757
8 .0769 18 .0737
9 .0730 19 .0704
10 .0727 20 .0723
THE ESTIMATION OF SAMPLE SIZE
•.5 If the 10118 function due to an error in ii is >'1 ii - Y I and if the cost
C- Co +Cln, show that with simple random II&IIlpling, ignoring the fpc, the
moet economical value of n is
4.6 (Adapted from Blythe, 1945.) The selling price of a lot of standing
timber is UW, where U is the price per unit volume and W is the volume of
timber on the lot. The number N of logs on the lot is counted, and the aver-
age volume per log is fJIItimated from a simple random sample of n logs. The
estimate is made and paid for by the seller and is provisionally accepted by
the bu~r. Later, the buyer finds out the exact volume purchased, and the
seller reimburses him if he has paid for more than was delivered. If he has
paid for less than was delivered, the buyer does not mention the fact.
ConBtruct the seller's lOBS function. Assuming that the cost of measuring
n logs is en, find the optimum value of n. The standard deviation of the vol-
ume per log may be denoted by S, and the fpc ignored.
'-10 ReferenceL
BLY'l'BJ!, R. H . (1945) . The economiea of sample size Il.pplied to the ecaling of saw-
logs. Biom. Bull., 1, 67- 70.
CoUrtllLD, J. (1951). The determination of sample size. ArMr. Jour. Pub.
Hoalth, '1, ~l.
DIlIWIG, W. E. (1950) . SOTM eMory of 8ampling. John Wiley & Sons, New York.
JOHNSON, F . A. (1943). A etatistica.l study of sampling methods for tree nursery
inventories. Jour. Forut'1l. 'I, 674-679.
NOJU>JN, J. A. (1944) . Determining sample eise. Jour. ArMr. Stat. Auoc., Be,
497- 506.
STIlIN, C. (1945) . A two-sample test for a linear hypothesis whose power is inde-
pendent of the variance. Ann. Math. Stat., 16, 243-258.
TlPPIITT, L. H. C. (1950). Technological applicati0n3 of statistics. John Wiley &
Sons, N ew York.
YATU, F . (1949). Sampling methodb for ceMmU and 8UMJey8. Charles Griffin
and Co., London.
CHAPTER 5
6.2 Notation. The suffix h denotes the stratum and i the unit within
the stratum. The notation is a natural extension of that previously
used. The following symbols all refer to stratum h:
Total number of units
Number of units in sample
f
i_ I
Yhi
Y,,= - - True mean
N"
Sample mean
N.
. L (YM - 1",,)2
2 _i-_l_ _ _ __
S" = True variance
N" -1
Note that the divisor for the variance is (N" - 1).
6.S Properties of the estimates. For the population mean per unit,
the simplest type of estimate appropriate to stratified sampling is fl.t
(st for stratified), where
(5.1)
5.8 PROPERTIES OF THE ESTIMATES
The estimate '0" is not in general the same as the sample mean, for
the sample mean, '0, can be written as
L
L nAYA
A-I
ii "" - -
n
- (5.2)
The difference is that in '0" the estimates from the individual strata
receive their correct weights N"IN. It is evident that Y coincides
with YII provided that in every stratum
nh N. n" n
- = - : or - = - = Constant
n N Nil N
This means that the sampling fraction is the same in all strata. Thw
stratification is described as stratification wi proportional allocation
of the nIl. It gives a self-weighting sample. If numerous estimates
have to be made, a self-weighting sample is time-saving.
The principal properties of the estimate Y" are outlined in the fol-
lowing theorems. The first two theorems apply to stratified sampling
in general and are not restricted to stratified random sampling: i.e. the
sample from any stratum need not be a simple random sample.
since the estimates are unbiased in the individual strata. But the
population mean Y may be written
N
This completes the proof.
CoroUa17/. ~ince '0" is an unbiased estimate of y" for simple random
sampling within strata, '0" is an unbiased estimate of Y for stratified
random sampling.
68 STRATIFIED RANDOM SAMPLING
where
V(y,,) = E(y" _ Y,,)2
There are two restrictions on the theorem: (i) f/Ir. must be an unbiased
estimate of Y", and (ii) the samples must be drawn independently in
different strata.
Proof:
where the sum extends over all strata. Note that the error (ti" - Y)
in the estimate is now expressed as a weighted mean of the errors of
estimation which have been made within the individual strata. Hence
_ 1" 2 _ ~ N"2(y,, - 1",,)2 2 ~ N"Nj(17" - Y,,)(17j - Yi )
(g., ) - N2 + ~
The important point about this result is that the variance of g" de-
pends only on the variances of the estimates of the indi"idual stratum
meml!:> Y h • If it were possible to divide a highly variable population
into strata such that all items had the same value within a stratum, we
could estimate Y without any error. Examination of the proof shows
PROPERTIES OF THE ESTIMATES 69
that it is the use of the correct stratum weights N" which leads to this
result.
Theorem 5.S For stratified random sampling, the variance of the
estimate g., is
(5.4)
S2(N-n)
V(g.,) = ~ l { (5.8)
70 STRATIFIED RANDOM SAMPLING 6.3
Stratum Stratum
" -1 2 1 2
1920 1930
,E(YA;) ,E(W".2)
1 8,349 4,756,619 10,070 7,145,450
2 7,941 1,474,871 9,498 2,141,720
5.3 PROPERTIES OF THE ESTIMATES 71
were obtained by taking the cities which ranked from 5th to 68th in
the United States in total number of inhabitants in 1920. The cities
are arranged in 2 strata, the first containing the 16 largest cities and
the second the rema.ining 48 cities.
The total 1930 number of inha.bitants in all 64 cities is to be esti-
mated from a sample of size 24. FiIid the standard error of the esti-
mated total for (i) a simple random sample, (ii) a stratified random
sample with proportional allocation, (iii) a stratified random sample
with 12 units drawn from each stratum.
This population resembles the populations of many types of business
enterprise in that some units-the large cities-eontribute very sub-
stantially to the total and display much greater variability than the
remainder.
The stratum totals and BUmS of squares are given under table 5.l.
For the complete population in 1930, we find
~ N 28 2 (N - n) (64)2(52,448) (40)
V(rr",,) = -n- N ...
24
-
64
"'" 5,594,453
u(1'",,,,,) == 2365
ii. For the individual strata the variances are
8 12 - 53,843: 8 2 2 - 5581
Note that the stratum with the largest cities has a variance nearly
10 timee that of the other stratum.
In proportional allocation, we have nl - 6, ns - 18. From formula
(5.7), multiplying by ~, we have
~ (N - n) 2
V(r JlTOJI) -
n
!:.N"s"
- ttl (16)(53,843) + (48)(5581) I - 1,882,293
cr(1'"JI'o,,) - 1372
STRATIFIED RANDOM SAMPLING
(5.10)
Hence we obtain
Theorem 5.5 With stratified random sampling, an unbiased esti-
ate of the variance of fi" is
I L 8,.2
v(iill) = 82(y,,) '"' 2 I: N"(N,, - n,,) - (5.11)
~~(
~"'') N "_1 n"
An alternative form for computing purposes is
L W 2 2 L W :I
8 2 (fi,,) = " -- -
~ " S" ~
" -" -
SIo (5.12)
"_I nIl "_I N
The second term on the right represents the reduction due to the fpc.
In order that this estimate can be computed, there must be at }Pllst
two units drawn from every stratum. Estimation of the variam·c
when stratification is carried to the point where only one unit is ChOlSe1l
per stratum is discussed in section 5.21.
Corollary. In certain applications it is reasonable to suppose t.hat
S}o2 has the same value in all strata. From the analysis of variance of
the sample, a pooled estimate of this common variance is
L "4
I: I: (Yll. - g,.)'
2 "-1,_1
S", - - - -- - - -
(n - L)
OPTIMUM ALLOCATION 73
To find the actual value of n,., add (5.17) over the strata. Thus
1: N"sll
1: nil - n = V>..
N >.
(5.18)
~~
nYX = L Nvc" (5.24)
N"S"
~ (5.25)
n
LN"S"
vc"
Thia theorem leads to the following rules of conduct. In a given
rtratum, take a larger sample if
(i) the stratum is larger,
(ii) the stratum is more variable,
(iii) sampling is cheaper in the stratum.
76 BTRATIFIED RANDOM SAMPLING
(C - a) L: N"S"
-
~
n = --::=---""";"",--
L: NhS"V0.
If V is fixed, we substitute the optimum nil in the formula for V(y.,).
We find
n=
V"op =
r:. N"S,,2 [from equation (5.7), section 5.31
nN
V opl ""
(r:.nN2
N"SII)2
[from equation (5.20), section 5.5J
n nN nN
r:. N"(Y,, - y)2
"" Vprop + nN (5.28)
By the definition of V opl • we must have V prop ;:::: V opl • Their dif-
ference is
= _1
nN
r:. N"(S,, - S)2 (5.29)
V rail = VOp1 +
r:. N"(S,, - S)2 + r:. NIt(Y" - y)2
(5.30)
nN nN
This result shows tha.t there are two components to the decrease in
variance as we change from simple random sampling to optimum allo-
cation. The first component (term on the extreme right) eomes from
78 STRATIFIED RANDOM SAMPLING 6.7
(5.31)
It follows that simple random sampling gives a higher variance than
proportional stratification unless
.E N"(Y,, - y)2 :s ~.E (N - N,,)S,,2 (5.32)
This case could happen, since the Y" could all be identically equal. If
any differences among the Y" exist, the inequality is unlikely to be
satisfied except with small strata, since the left side is of order N"
while the right is of order unity.
The results reported here for optimum allocation are applicable to
sampling practice only if optimum allocation can be achieved. The
attempt to do 80 raises a number of problems that are discussed in
succeeding eections.
(5.33)
putting U = 81182 •
We want to examine this quantity for a series of values of nl and
n2 in the neighborhood of the optimum. Departures of nl and n2
from the optimum values can be expressed in terms of the ratio
Hence
(5.34)
RP = tp(NtU + N2)2
(."N1U + N 2 )(N1U + tpN2 )
(in per cent) plotted against tp for three different values of 81182 =
U = 2, 4, 8. The scale for 'P is logarithmic. On each curve the value
of 'I' which represents proportional allocation is shown.
110
100
90
~ ~
80
Vj t?'P I~ ~
~ l-jl ~
.§ 70
'"
.~ 60
pI/ // r~ ~'"
~/J
/
0t:~
a.
'"
~50
.!!
~ 40
I'V '\!C).. ~
U=SI!S2 ~"'2
30
Points denoted by P show the
20 relative precisions with
oroportional allocation
10
o
2 4 8
tfJ = ratio of 1l 1 /1l 2 to optimum value of Ildll2
FIOUD 6.1 Loss of precision through departures from optimum allocation.
1 16 163 .30 2612 .80 11 .56 232 .04 3712 .64 12 .21
2 48 58 .55 2810 .40 12.44 74. 71 3586 .08 11 .79
for the variance of the estimated total. This gives a theoretical mini-
mum variance of 1,090,157. The variance actually attained by a 12,
12 allocation was worked out in the example in section 5.3 and was
found to be 1,090,827. The difference is trivial.
When an allocation is being planned, it is advisable to estimate the
apparent gain in precision relative to proportional allocation. In
section 5.8 we suggested that proportional allocation is to be preferred
in view of its self-weighting properties, unless the apparent gain due to
optimum allocation is at least 20 per cent. For the present data, the
comparison with proportional allocation was made in a previous ex-
ample (section 5.3) . The relative precision turned out to be 1883/1091,
or 173 per cent. Note that a calculation of this kind inevitably over-
estimates the gain in precision relative to proportional allocation, be-
cause it assumes that the S" in the new survey will be in the same
proportions, from stratum to stratum, as in the previous data which
we used to compute the allocation. In the present example this over-
estimation is negligible, because the 1920 allocation happened to be
the same as the 1930 allocation.
Vprop - VOP1 = nN
1 {
:E N J.8h 2 - CE NhSh)2}
N
6.11 Allocation with several items. Since the allocation that is best
fOJ: one item will not in general be best for another, some compromise
must be reached in a survey with numerous items. The first step is to
reduce the items that will be considered in the allocs.tion to a relatively
small number that are thought to be most important. If good previ-
ous data are available, we can then compute the optimum allocation
for each item separately, and see to what extent there is disa.greement.
In a survey of a specialized type, the correlations among the items
may be high, and the allocations may differ relatively little.
Example . Data. given by Jessen (1942) illustrate a farm survey of
this kind. The state of Iowa was divided into five geographic regions,
each denoted by its major agricultural enterprise. Suppose that these
regions are to be used as strata in a survey on dairy farming. The
three items of most interest are the number of cows milked per day,
the number of gallons of milk per day, and the total annual cash re-
ceipts from dairy products. From So survey made in 1938, the esti-
mated standard deviations 8" within strata are shown in table 5.4. We
TABLE 5.4 SrA.'1DARD DEVIATION8 WITHIN 8TRATA
8A
SA SA Receipts
Stratum Cows Gallons for dairy
milked of miik products
(I)
8"
shall assume, for the present, that the are the true standard devia-
tions. The 8" apply to a single farm. In table 5.5 the optimum alloca-
tions are given for the individual items in a sa.mple of 1000 farms.
6.11 ALLOCATION WITH SEVERAL ITEMS
Allocation
Since the state contains over 200,000 farms, the fpc can be ignored.
Thus for any item
nW"8,,
nIl -
LW,,8h
The individual optimum allocations differ only moderately from
eaoh other. With one exception, all three deviate in the same direc-
tion from a proportional allocation. Thus, in the first stratum, propor-
tional allocation suggests 197 farms, while the individual allocations
lead to numbers between 236 and 258. The average of the optimum
sample sizes for the three items, shown in the right-hand column, pro-
vides a satisfactory compromise allocation.
Table 5.6 shows the expected sampling variances of ii,,, as given by
TABLE 5.6 EXPJ)CTJ)D VAIIlANCJ)8 01' THIl 1!I8TlMATilD MilAN
Type of allocation COW8 Gallons Receipts
Optimum 0.0127 0.0800 76 .9
Comprom.iee 0.0128 0.0802 77.6
Proportional 0.0131 0.0837 80.9
theorem 5.3, corollary 1; the denominator mil is the sample size given in
the hth stratum by the compromise allocation.
The compromise allocation gives almost as precise reSults as if it
were possible to use separate optimum allocations for each item. What
is more noteworthy is that proportional allocation, though consistently
poorer than the compromise, is only slightly less precise than the com-
promise or the individual optima. Further, table 5.6 overestimates
the precision of the optima and of the compromise, since these alloca-
tions were made from estimated rather than from true variances. This
result illustrates the important principle (previously discussed) that up
to a point the variance of the estimate does not increase much even if
allocation departs quite substantially from the optimum. From the
comparison in table 5.6, proportional allocation would be the recom-
mended choice. Had the S" differed more markedly among strata, the
compromise allocation might have been very satisfactory.
In surveys which cover a wider range of items, the allocations may
differ more violently. The best compromise is then not so obvious.
Although proportional allocation is often used in this situation, it may
be possible to find a compromise which is superior. As a preliminary
we need some criterion for a "best" compromise. One possibility is to
minimize the sum of the variances for the items, each divided by its
optimum variance. If the fpc is ignored, this amounts to finding nil
which minimize
L (W"s,,)2
nil
L' (L W"sll) 2
- n
where L' denotes summation over the items. It remains to be seen,
however, whether situations are, common, in surveys of wide scope, in
which the gain over proportional allocation is sufficient to offset the
advantage of the "self-weighting" feature in the latter.
6.12 Allocation requiring more than 100 per cent sampling. In the
planning of an allocation it may happen that the formula for the opti-
mum produces an n" in some stratum that is larger than the correspond-
ing Nil. Consider the example on city sizes in section 5.9. A sample
of 24 cities, distributed between 2 strata, called for 12 cities out of 16
in the first stratum and 12 out of 48 in the second. Had the sample
size been 48, the allocation would demand 24 cities out of 16 in the
first stratum. The best that can be done is to take all cities in the
stratum, leaving 32 cities for the second stratum instead of the 24
5.13 ESTIMATION OF SAMPLE SIZE 81
postulated by the formula. This problem arises only when the overall
sampling fraction is substantial, and one stratum is much more variable
than the others. It has occurred in practice on several occasions.
Care must be taken to use the correct formula in predicting the
expected variance from this allocation, or in comparing the allocation
with others. The general formulas" in theorem 5.3, section 5.3, are
appropriate if the nil given by the revised optimum allocation are sub-
stituted. Formula (5.20) for the minimum variance for fixed n,
_ 1 (J:,N"S/t,)2 1 II
Vm.,,(y.,) = N 2 n - NlILN"S"
n=--
no (5.44)
1 + no
N
If V is the desired variance for the estimated population total, the
principal formulas become as follows:
General: .
(5.45)
1 13 325 4 , 225 9
2 18 190 3 ,420 7
3 26 189 4,914 10
4 42 82 3 ,444 7
6 73 86 6,278 13
6 24 190 4 ,560 10
~N"p"
P,I =£ .,-- (5.48)
N
(5.49)
Proof: This is a particular case of the general theorem for the vari-
ance of the estimated mean. From theor~m 5.3
1 S,,2
V(y,,) =2 LN"(N,, - nl)-
N n"
• The arithmetical re8\llta di.f[er aliahtly {rom thoae siven by Cornell (1947).
5.14 STRATIFIED SAMPLING FOR PROPORTIONS 91
Let 1/", be a variate which has the value 1 when the unit is in C, and
zero otherwise. In section 3.2 it W'88 shown that for this variate
2 N"
S" = (N" - 1)
P"Q"
This gives the result.
Note. In practically all applications, even if the fpc is not negligible,
tenns in liN" will be negligible, and the slightly simpler fonnula
1 P"Q"
V(p,,) = 2 L. N"(N,, - n,,) - (5.50)
N nIl
can be used.
Corollary 1 When the fpc CAn be ignored,
(N - n) 1
' =. - L. W"P"Q" (6.53)
N n
For a sample estimate of the variance, we may substitute Pt., q" for the
unknown P", Q" in any of the fonnulas above. This does not give an un-
biased estimate, but is adequate for the calcula.tion of confidence limits.
The best choice of the nil in order to minimize V(p,t) follows from
the general theory in sections 5.5 a.nd 5.6.
Minimum variance for fixed total sample aize.
Thus
nIl a: N" J N"
N" -1
v'P&",. ..... N" VP;ii",.
N/oVP;ii",.
nil . =. n =----,,== (5.54)
L.N"YP"Q/o
Minimum variance for fixed C08t, where C08t = a +L c"n".
Nfl {P&;.
\} -;:-
(5.55)
nil ·"".n LN" lp"Qt.
...J C/o
The value of n is found as in section 5.6.
STR.ATIFIED RANDOM SAMPLING
Ew"p"Q"
V"rOJl - - - - -
n
---
V op ,
V"rop
(EW,,~)2
E W"P"Q"
If all P" lie between the two values Po and (1 - Po) ,we are inter-
ested in the smallest value which the relative precision takes. For
simplicity, we consider two strata of equal size (WI = W 2 ). The
minimum relative precision is atta.ined when PI - ! and P2 - Po.
The relative precision then becomes
(0.5 + ~_)2
--z::-____
VOl"
Po 0.4 or 0.6 0.3 or 0.7 0.2 or 0.8 0.1 or 0.9 0.05 or 0.95
where no is the first approximation, which ignores the fpc, and 11. is the
corrected value taking account of the fpc. In the development of
thElse formulas, the factor Nh/(Nh - 1) has been taken as unity.
All results in this section apply to the estimate of a proportion. If it
is preferable to think in terms of percentages, the same formulas apply
if Ph, Qh, V, etc., are expressed as percentages. For the estimation of
the total number in the population which lie in class C, i.e. of NP, all
variances are mUltiplied by N 2 •
existence is divided into two new strata. For this stratum the effect
is to repla.ce simple random sampling by stratified sampling, which
should result in a lower variance for the estimated stratum mean, and
hence for the mean of the whole population. Thus the process of
constructing new strata by the subdivision of old strata should result
in a series of decreasing variances in the estimates.
Multiplication of strata is, however, attended by some disadvan-
tages. Unless allocation is proportional, the number of weighting fac-
tors W h which: appear in the tabulations increases, as do the number
of within-etratum variances to be estimated, both for allocation and
for finding the standard error of Y,I' The number of degrees of free-
dom available for the estimated standard error decreases until with
one unit per stratum the fonnula for the standard error cannot be used
at all. If an increase in the number of strata is under consideration,
we want to be assured that the ensuing gain in precision is sufficient
to repay us for these complications.
Consider a process in which each stratum is divided into halves to
fonn new strata. The number of strata becomes successively 2, 4, 8,
16, etc. If we can use the frequency distribution of Yi itself for the
construction of strata, and if the distribution of 'Vi is continuous, every
stage in the process produces a substantial reduction in the variance
of the estimate. When the number of strata becomes la.rge, each
doubling reduces the variance to approximately one-quarter of the
previous value.
To show this, suppose that before subdivision a typical stratum
consists of all values of 'Vi between a and b. If there are already many
strata, the distribution of Yi between a and b will be approximately
rectangular. The variance of this rectangular distribution is known
by theory to be (b - a?/ 12. When we ha.lve the stratum, we produce
two approximately rectangular distributions, each with range (b - a) /2
and variance (b - a)2/48. With proportional allocation, ignoring the
fpc,
to
WA (b - a? Wh (b - a)2 Wh(b - a)'
-2 48n
+-
2 48n
--- --
48n
615 THE CONSTRUCTION OF STRATA 95
Optimum Proportional
Number
oC strata
Ratio to Ratio to
nV nV
preceding preceding
1 1 1
2 0.27722 0.277 0.40747 0.407
4 0.06690 0.241 0. 11822 0.290
8 0.01618 0.242 0.03079 0.260
16 0.00397 0.245 0 .00778 0.253
to
W,,(b - a)2 W"Ue,,2
- -48n- - + -n -
96 STRATIFIED RANDOM SAMPLING 6.15
As soon a.s (b - a) is small enough so that the term in fT.1t.2 dominates,
further increase in the number of strate. produces only a triviliol reduc-
tion in variance.
To summarize, subdivision of strate. in practice sooner or later re-
sults in practically no further incree.se in precision. If u, ha.s only a
moderate correlation with y" the point may be reached with a small
number of strate..
If strata are formed by cutting up a frequency distribution, there
remains the question: What are the best points of division?
Rules have been developed by Dalenius (1950) and Dalenius and
Gurney (1951) for division under proportional and optimum allocation.
One result will be quoted. If the variate z'" on which the division is
based is linearly related to the variate y", which is to be mea.suroo in
the survey, the division point z,,' between stratum h and stratum
(h + 1) should satisfy the equation
+ Z"+l
2z,,' - Z"
where Z", Z"+1 are the mean values of z'" in the two strata. This is
the optimum rule for proportional allocation. The division points ZI',
Z2', ••• , ZL_l' are found by trial and error, the number of strata L
being a.ssumed chosen in advance.
A stratification that is effective for one variable may not be so for
another. In surveys which cover a range of items, some compromise
criterion for the construction of strata must be found. Bases of
stratification for economic items have been discussed by Stephan
(1941) and Hagood and Bernert (1945), and for farm items by King
and McCarty (1941). What is wanted is some variable that ha.s high
correlations with all the principaf items in the survey.
Stratum
No.
State of Type o{
itelD8 Town-
County {armln" State
ship
area
The data shown are the average relative precisions of the estimates
of the means, averaged over the numbers of items shown in the table.
The county is taken as a standard in each case.
As appears to be typical of geographic stratification, the increases
in precision from increased numbers of strata are moderate rather than
large. This indicates that the similarities referred to above are weak.
(5.58)
by an algebraic identity for 8 2 •
There is no difficulty in obtaining an estimate of the first term inside
the bracket. The second term requires some detailed investigation.
For estimating L
N/>,(Y" - y )2 it is natural to try L
N"(y,, - Yat)2 .
This quantity turns out to be an overestimate which needs adjustment.
We may write
L N"(y,, - gol)2 =L N"{(Y,, - Y) + (g" - Y,,) - (g" - y)}2
We now expand and take the average over all possible samples. It
may be verified that the average of each of the two crOBS-product
terms involving (Yh - Y) vanishes. This gives
E L Nh(Yh - Y,t)2 = L N,,( y~ - y)2 + E L Nh(y" - Y h)2
+E L N,,(Y.t - Y)2 - 2E L N,,(YIa - Yh)(got - Y) (5.59)
But
L N"(y,, - Y")(Y,t - Y) = N(g" _ y)2
by the definitions of g., and Y. Thus the last two ·terms in (5.59) coa-
lesce to give 2
-EN(Y,t _ y)2 = _ L N"(N,, - nh) 8h
N nIl
since this expression is N times the variance of g". For the second
term on the right in (5.59),
~ 2 N"(N,, - nIl) 8,,2 8.,.2
E £..JNIa('U" - Y,,) = L -
= (N" - n,,)- L
N" nIl n.,.
because within each stratum g" is the mean of a simple random sample.
5.17 ESTIMATION OF GAIN FROM STRATIFICATION 99
Hence,
E :E N"(fi,, - ii.,)2 = :E N"(Y,, _ y)2
~ S,,2 L N"(N,, - n,,) S,,2
+ L. (N" - 11,,) - - -
n" N nIt
The sum of the last two terms on the right is always positive, so that
L N"(fi" - y,,)2 gives an overestimate. It follows that an unbiased
estimate of L N,.(Y" - y)2 is
S,,2 L N"(N,, -
Q = L N"(fJ,, - ii,,? - L (N" - n,,) - + = - - - - -
n,,) s,:
nil N nIl
1 13 9 2.200 1 .615
2 18 7 1.638 0 .063
3 26 10 0 .992 0 .077
Tota.ls 57 26
The sample is so small that expression (5.61) for Vran will be used.
The supplementary calculations are given in table 5.12.
31
Vra " =
(57)(26)
[0.4232 - 0.0473 + 0.0118 + 2.4000 - 2.1655] = 0.0130
s2 = L (Y"i - '0)2
n- 1
== 1 ['"
L; (n" - 1)8"
2
+ L;
'" (L-n"y,,?]
nh'O" 2 - - --
(n - 1) n
by the usual identity in the analysis of va.riance. If terms in lin" are
5.17 ESTIMATION OF GAIN FROM STRATIFICATION 101
:for II ran , we assume that terms in 1/ N are negligible, but not terms
in l / n", since sometimes there are only a few units per stratum in the
sample. Thus we start from equation (5.61), inserting, however, the
.pooled estimate 8 w2 in place of the individual 8,,2 . This gives
+ L: Why,,2 - (L WhY") 2 J
2
Since n" = nWh , the coefficient of 8w may be written
L: (W" _ ~ + W
__II) = _n_-_L_+_ 1
n 11. n
where L is the number of strata. Hence
(N - n) 1 2
II ran =
N
2' I (n - L
n
+ 1)sw + L n"(1111 - 11,,)21
Component df ms
Between strata (L - 1) B ~ L n~(llh - 1l.,)2/(L - 1)
Within strata. (n - L ) 2
W - 8..
lOZ STRATIFIED RANDOM SAMPUNG 5.l7
It follow8 that
V, .." - (N; n) ~2 [(L - 1)B + (n - L + 1)W]
E:unnple. In sampling an agricultural field experiment for estimat
ing the number of wireworms per plot, the 25 plots were divided into
halves. From each half 3 random samples of soil were taken with a
mall boring tool. The sample was a volume of soil 9 in. square to a
depth of 5 in. The combined analysis of variance of numbers of wire-
worms for the 25 plots was as follow,,;
df ms
Between 8trata (h&lf-plots) 25 B - 00.76
Within strata 100 W - 38 .44
The conditions are slightly different from those in the theory pre-
~nted above. Each plot represents a separate population, divided
into 2 strata. Thus n .. 6, n,\ .,. 3, L = 2 for each plot. The analysis
of variance gives the combined results for 25 stratified samples of this
type. The fpc need not be included.
38.44
V., - -6- - 6.41
(5.70)
8"
nil ex: - -
~
c"
Alternatively, if the vary little from stratum to stratum but there
are marked variations in the S", it is sometimes preferable to allocate
so that every mean has the llarne varianoe. For this, we choose the
n" 80 that
S1 2 S,2 SL 2
- " " - = : .. . : : -
11.28 Exercises.
5.1 The houaeholds in a town are to be sampled in order to estimate the
&verage amount of &8I.!ete per household that are readily convertible into cash.
The households are stratified into a high-rent and a low-rent stratum. A
house in the high-rent stratum is thought to have about 9 times 8.8 much
Uleta as one in the low-rent stratum, and S~ is expect.ed to be proportiona.!
to the equare root of the stratum mean.
There are 4000 households in the bigh-rent stratum and 20,000 in the low-
rent stratum. (i) How would you distribute a sample of 1000 households
between the two strata? (ii) If the "Object is to estimate the difference be-
tween &8I.!ete per hQul.!ehold in the two strata, how should the sample be dill-
tribut.ed ?
5.2 The following data 8how the 8tratification of all the farms in a county
by fann size, and the average acres of com (maize) per farm in each stratum.
Totalorme&ll 2010 26 .3
5.23 EXERCISES 109
For a sample of 100 farms, compute the sample sizes in each stratum under
(i) proportional allocation, (ii) optimum allocation. Compare the precisions
of these methods with that of simple random sampling.
5.3 If the 7 strata are to be combined into 2 strata, what is the best point
of division for proportional allocation? What is the relative precision of 2
strata to 7 strata under proportional allocation?
5.4 Pro .....e the result stated in formula (5.31), section 5.7:
5.5 If there are 2 strata and if '(J is the ratio of the actual nvnt to the opti-
mum nJn2 for fixed sample size (as in section 5. ), show that, whatever the
va.lues of N 1, N 2, 8 1, and 8" the relative precision of the actual a.l1oca tion to
the optimum allocation is never less than
5.6 The variate IIi follows the frequency distribution e- rl dlli (0 5: 11.).
The population is divided into 2 strata at the point 110, and a stratified random
sample of size n is taken with proportional allocation. Find V(II.,) as a func-
tion of 110. Verify that the vaiue of 110 which minimizes V satisfies the rule
given by Dalenius (section 5.15).
5.7 The following data are derived from a stratified sample of tire dealers
taken in March 1945 (Deming and Simmons, 1946). The dealers were &8-
signed to strata according to the number of new tires held at a previous cen-
SUlI. The sample means 1iA are the mean numbers of new tires per dealer.
(i) Estimate the gain in precision due to the stratification. (ii) Compare this
result with the gain that would have been attained from proportional alloca-
tion .
Stra.tum
boundaries
1-9
NA
19,850
WA
0 .8032
'UA
4. 1
..'
34 .8 ""
3,000
1()""19 3,250 0 . 1315 13.0 92 .2 600
20-29 1,007 0.0407 25 .0 174 .2 34.0
80-39 606 O.~ 38.2 320 .4 230
1.26 References.
AlIJoIlTAOlC, P. (1947) . A compsrillOD of stratified with unrestricted random laID-
pIing from .. finite popula.tion. BiMndrika, M, 27S-280.
CocIJlU", W. G. (11139). The use of analysiB of variance in enumeration by laID-
plinK. Jour . Amer. 8taJ. A,uoc., M, 49Z-510.
eoaNllLL, F . G. (1947) . A stratified random sample of a small finite popul.tion.
Jour. Amm-. 8taJ. A • .oc., ' 2, 523-532.
DALlJNIUII, T. (1950). The problem of optimum stratification. Bleand. AIet. , as,
203-213.
DALlCN1UII, T ., and GlIltNET, M . (1951) . The problem of optimum stratification.
II. Sleand. AIet., M, 133-148.
D.III(lNO, W . E., and 8n'MoN8, W . R . (1946) . On the deaign of a lample for dealer
inventoriM. Jour. Amer. 8141. A.aoc., n, 16-33.
EVANII, W. D . (1951). On stratification and optimum a.llocations. Jour. Amm-.
Slat. A"oc., ' 8, 9~104 .
HAGOOD, M . J., and BW:RNERT, E. H . (1945). Component indexes as .. basis for
stratification. Jour. A mer. StaJ. Aaaoc., ' 0, 330-341
JlC88mN, R. J. (1942) . Statistical invostigation of a sample survey for obtaining
fum f.cte . Iowa Agr. Exp. 814. Rea. Bull. 304.
J_IIN, R. J ., and HOUBlCIolAN, E . E . (1944). Statistical investigations of farm
lample surveys taken in Iowa, Florida and California. Iowa Agr. Exp. SI4. flu .
Bull. 829.
KINO, A. J., and McCARTT, D. E . (1941) . Application of sampling to agricultural
.tatlatica with empbuis on stratified sampling. Jour . Marketing, 6, 462-474.
KINo, A. J., McCARTT, D . E, and McPI!:EI, M. (1942). An objective method of
eampling wheat fields to eetima.te production and Quality of wheat. U . S. Depl.
Agr. Ttdi. BuU 814.
MA.HALANOBI8, P . C. (1940) . A sample survey of the acreage under jute in Bengal.
&nMI/B, " 511- 530.
NlCYlIA1'l, J . (1934). On the two different aspec~ of the representative method :
the method of stratified sampling and the method of purposive selection. Jour.
&1/. Blai. Soc., 87, 568-606.
8A.TrIlRTHWAITlC, F. E. (1946). An approximate distribution of estimatee of vari-
&Doe oomponenta. BWmdMca, 2, no-: 1(.
BTilPIlAN, F. F . (1941) . Stratification in repre!lentative sampling. Jour . M(J1'Ic~/'
iftg, 8, 88-46.
&r1CPIlAN, F. F . (1946) . The expected value and variance of the reciprocal and
other neptive powers of a positive Bernoullian variate. Ann. Math. BIai., 18,
6(Hi1.
BUDlATJoIlC, P. V. (1935). Contribution to the t.beory of the representative method.
Sup". Jour . &". Slat. Soc .• I , 263-268.
T.cuupJl.Ow, A. A. (1923) . On the mathematical expectation of the momenta of
frequency d.ietributione in the CII86 of correlated obaervations. M eUon, ll, 461-
493 &Dd 846-683.
N 04 cittd in u=
D AL.mIlUl, T . (1962) . The problem of optimum stratification in & special type
of deeign. Sk4nd. A let., III, 81-70. (Givee the best boundary for dividing a
skew popul ..tlon into two strata, given th&t th upper stratum is to be sampled
100 '* cent.)
CHAPTER 6
RATIO ESTIMATES
where II, % are the sample totals of the 11. and Xi, respectively.
If %i is the value of 11. at some previous time, the ratio method uses
the sample to estimate the relative change Y / X that has occurred
since the previous time. The estimated relative change, II/ X, is mul-
tiplied by the known population total X on the previous occa.sion, to
provide an estimate of the current popula.tion total. If the ratio
II;/X; is nearly the same on all sampling units, the values of II/X vary
little from one sample to another, and the ratio estimate is of high
precision. In another application, Xi may be the total acreage of a
farm and 11. the number of acres sown to some crop. The ratio esti-
mate will be successful in this case if all farmers devote about the same
percentage of their total acreage to this crop.
If the quantity to be estimated is 1', the population mean value of
Ifi, the ratio estimate is
TABLE 6.1 SIZE8 OF 49 LAROI: UNITED STATU emu (in 1000's) IN 1920
(.r,) AND 1930 (~,)
76 80 2 50 243 291
138 143 507 634 87 106
67 67 179 260 30 III
29 50 121 113 71 79
381 464 50 64 266 288
23 48 44 58 43 61
37 63 77 89 25 57
120 115 64 63 94 86
61 69 64 77 43 50
387 459 56 142 298 317
93 104 40 60 36 46
172 183 40 64 161 232
78 106 38 52 7. 93
66 86 136 139 45 53
60 57 116 130 36 54
46 65 46 53 50 08
48 71)
70
60
200 ratio estimates
50
g 40
!:
...g30
20 '.\ ;; .
)( dtl'lOtH POPUIltlon tollil
10 h_
)(
. 1-;-,
j~I~1 a ro ~ ~ ~ ~ 30 ~
Total populltion (millions)
~ ~ ~ ~ u ~
FIOUU 6.1 Experimental comparison or the ratio flItim.t.te with the flItimate
baEd on the sample mean.
RATIO ESTIMAT~
Consequently the ratio estimate of the 1930 totaJ. for all 196 cities is
11 6262
'fR - - X - - (22,919) ... 28,397
:c 5054
The corresponding estimate baaed on the sample mean per city is
i) (196)(6262)
1 - Nr. - - - - - 25 048
II 49 '
6.8 Approximate variance and blat of the ratio estimate. The dis-
tribution of the ratio estimate has proved annoyingly intractable, be-
caU8e both 11 and :c vary from sample to sample. The known theoreti-
cal results fall short of what we would like to know for practical appli-
cations. The principal results will be stated first without proof.
The ratio estimate is consistent (this is obvious). It is biased, ex-
cept for some special types of population, although the bias is negligible
in large samples. The limiting distribution of the ratio estimate, as n
becomes very large, is normal, subject to some mild restrictions on
the type of population from which we are sampling. In samples of
moderate size, the dis ribution shows a tendency to positive skewness,
at least for the kinds of population for which the method is most often
used. We do not possess exact formulas for the bias and the sampling
variance of the estimate, but only approximations that are valid in
large samples.
These results amount to saying that there is no difficulty if the
sample is large enough so that (i) the ratio is nearly normally distrib--
uted, and (ii) the large-sample formula for its variance is valid . Defi-
ciencies in the theory are (i) the lack of a well-substantiated rule to
answer the question: When is the sample large enough?, and (ii) a
serviceable method for estimating confidence limits for small samples.
As a working rule, the large-sample results may be used if the sample
siae exceeds 30 anti is also large enough 80 that the coefficients of vari-
ation of :f and 9 are both less than 10 per cent. This rule is rather
poorly documented as yet.
APPROXIMATE VARlANCE AND BIAS 116
V(rR) _ N(N - n)
n(N - 1)
!:
i_I
(fli _ Rzi)2 (6.2)
rR - ii
Y = -X - Y
i
NX
- - ( y - R:)
i
since R - YIX.
If the sample is large, f should be close to X. The approximation
consists in replacing the factor Xli by 1. This gives
rR - Y '-. N(y - R:) (6.4)
N(N - n)
L
. I
(Ui - U)2
V(rR) = N2V(1Z) - -.----
n (N - 1)
An equivalent form is
r (N - n) y2 {S1I
2
Sz' 2Sl/z} (6.6)
V ( B) = N n 1'2 + X' - l' X
where SII:r. ". pS.,,8z is the covariance between 1/. and x.. This relation
mayal80 be written &8
~
V(l'B)-
N(N - n);'"
L..X,
2('I-I,- R)2 ... N(N - '1)" 2
L..%,(T,-R)
:I
NX
rB - Y "" -
f
(fl - ~) ..... N(ii - ~) (6.9)
X i-X
-----=- "" 1 - - -
X(1 + f ~ X) X
retaining the first term in the Taylor series expansion. Substitution
in (6.9) gives
r R - Y = N(y -~) ( 1 - x-X)
f -
Now
(N-n)SJ
Ex(x - X) ... E(x _ X)2 "" :r
N n
118 RATIO ESTIMATES
Similarly,
Eg(f - X) - E(9 - y)(f - X) _ (N N- n) pS.s.
n
(6.10)
by theorem 2.3 (p. 17) and the definition of p. Hence the leading
term in the bias is
(N - n)
E(rB - Y) - nX (BS,,? - pSvS.) (6.11)
N -1
n-l
ESTIMATED VARrANCE UG
Thia estimate can be shown to have a bias of order l/n: no method ia
available for obtaining an unbiased sample estimate.
For the estimated variance, v(f,R}, this gives
v(fR} _ N(N - n)
n(n - 1)
i: (1/. _ flXi)2
._1
N(N - n} ~ 2 ~
-
n(n - 1)
(..l... 1/. + 4A2 ..l...
'" 2
Xi -
noA
ur, ..l...1I.%.) (6.13)
To compute the term inside the bracket, the sums of squares and prod-
ucts are placed on the same row as their multipliers, as follow!!:
Multiplier
:E ,,; - 1, 527,882 1
:Exil - 1,044,504 1.635168 - It'
Ev,;z( -1 ,251 ,63Q 2 .478038 - 2R
Hence
v(f ) = (196)(147) (29 784) ... 364 854
R (49)(48)' ,
.(fR ) -= 604
120 RATIO ESTIMATE3
6.6 Confidence limits. If the sample size is large enough so that the
normal approximation applies, confidence limits for Y and R may be
obtained as follows:
Y: r}l±t~ (6.15)
R: fl ± tVv(R) (6.16)
where t is the normal deviate corresponding to the chosen confidence
probability.
In section 6.3 it was suggested that the normal approximation ap-
pli reasonably well if the sample size is at least 30 and is large enough
80 that the cv's of fl and .f are both less than 0.1. When theee condi-
tions do not apply, the formula for v(fl) tends to give values that are
too low and the positive skewness in the distribution of fl may be-
come noticeable.
CONFIDENCE LIMITS 121
8. 7 ComparilOn of the ratio estimate with the mean per unit. The
type of estimate of Y which was studied in previous chapters is Ny,
where 9 is the mean per unit for the sample (in simple random sam-
pling) or a weighted mean per unit (in stratified random sampling).
Estimates of this kind will be called estimates based on the mean per
unit or estimates obtained by limple expansion.
TlwJrem 6.t In large samples, with simple random sampHng, the
ratio estimate f R has a smaller variance than the estimate f ... Ny
obtained by simple expansion, if
1(S%
)/(SII) Coefficient of variation of x,
p> - - - - -----------
2 X Y 2(Coefficient of variation of 7/i)
Proof; For f we have
N(N - n) 2
VCr) _ 8 11
n
For the ratio estimate we have from (6 .5)
N(N - n)
V(rR) -
n
lsi + R2S.? - 2RpSuS.,1
Hence the ratio estimate has the smaller variance if
S~
2
+ R2S-s 2 - 2R"suS-s < SII 2
i.e. if
This theorem show that the ratio estimate may be either more or
less precise than a simple expansion. The issue depends on the size
of the correlation coefficient between 11; and Xi, and OD the cv's of
th two variates. The variability of the auxiliary variate Xi is an
important factor : if its cv is more than twice that of 11;, the ratio esti-
mate i alwa7lB I precise, since p cannot exceed 1. When X; is the
value of 7/. at some previous time, the two cv's may be about equal.
In this event the ratio estimate is superior if p exceeds 0.5.
6.8 CONDITIONS FOR RATIO ESTIMATE TO BE OPTIMUM 1:18
Theorem 6.2 applies only for samples large enough 80 that the ap-
proximate formula for VerB) is valid. In 8Jllaller samples the ratio
method probably does not compare as favorably as the theorem sug-
gests, since the approximate formula is usually an underestimate.
1 1
where w, - - 2 - -
u.. Ax.
This gives
b_ :E 1Ii _ ~
:E Xi X
6.9 The ratio estimate in sampling for proportions. The ratio method
plays an important role in the estimation of proportions. With sim-
ple random sampling, the usual formula for the variance of an esti-
mated proportion p is
PQ
V(p) - -
n
where P is the popula.tion proportion. [The factor (N - n)/(N - 1)
is inserted if the fpc is needed .]
As was pointed out in section 3.2, this formula is valid only if the
sample is a simple random sample of units, each of which is classified
into the two classes from which the proportion is derived. For in-
stance, if the proportion of diseased plants in a wheat field is estimated
by sampling, this formula applies if a simple random sample of indi-
vidual plan18 has been taken, each plant being classified as diseased or
healthy. It is unlikely that this method of sampling would be used.
6.9 THE RATIO ESTIMATE IN SAMPLING FOR PROPORTIONS 125
.~ 1/.
Y
P.., - -=-
N X
I: %i
i- I
(N - n) ;.. 2 2
nN(N _ 1)X2 .~ %i (Pi - P)
Tot&ls 104 53 51 30 U
+ 0'""
2 (iJH)2
-
11m2
Consequently M, - Y, M2 - X.
6.11 RATIO ESTIMATES IN STRATIFIED RANDOM SAMPLING 129
where R" = Y"I X" is the true ratio in stratum h, and PII is defined 8.8
before in each stratum.
130 RATIO ESTIMATES 6.11
Proof: Write
~ 11"
IS" - -X"
x"
Then
(fR• - Y) - E (fRA - YA)
Hence "
V(fR.) - E(fR• - Y)2
- I: E(fRA - Y,,)2 + 2 EI: E(fRA - YA)(fRJ - Yi)
A A j
Since f R" is the ratio estimate made from a simple random sample
within stratum h, we may use formula (6.5) for the approximate vari-
ance of f Rlh i.e.
~ N"(N,, - n,,) 2
V(r RA) -
nil
[S"'" + R"2S",A2 - 2RAP~II~"'Al
For exa.mple, with 50 strata and the cv of 111 about 0.1 in each
r
stratum, the bias in R • might be as large as 0.7 times its standard
error.
r
In the present state of our knowledge, 8 • is to be avoided unless
vL (cv of XII) appears to be less than 0.2. This rule is probably too
conservative, because in practice the bias may be much smaller than
its upper bound, particularly if within each stratum the relation be-
tween 1Ihi and XIIi is approximately a straight line through the origin.
where Y., '"" r .rlN, i.,= g.clN are the estima.ted population means
from a stratified sample.
r
The estimate Re does not require a knowledge of the Xh, but only
of X.
Now consider the variate U/Ii = 11M - RXh i. The right side of equ~
tion (6.23) is NU", where u. , is the weighted mean of the variate U/Ii
in a stratified sample. Further, the population mean n
of Ulli is zero,
since R = f i X.
182 RATIO Eln'IMATm 6.12
Hence we may apply to a. , theorem 5.3 for the variance of the e8ti-
mated mean from a stratified random sample. This gives
N"(N,, - n",) I
n~J-~v~~- E ~
It nit
where
Size
Stra.ta N~ Sp,' Sp. S ..' R,.,
(fa.nn &Cree)
The last three quantities, Q", V,,', and V,,", are auxiliary quantities
to be used in the computations, the last two being defined later.
We consider five methods of estimating the population mean com
acres per farm. The fpc will be ignored.
i. Simple ra.ndom sample: mean per farm estimate.
S,l 620
VI - - ""' - "'"' 6.20
n 100
ii. Simple random sample: ratio estimate.
1 2
V2 - -
n
[SII + R2Sz2 - 2RSII"J
(6.24)
where dM "" '1/100 - RAXA, is the deviation of 'l/A; from R"x",. By the
methods given in chapter 5 for finding optimum allocation, it follows
that (6.24) is minimized subject to a total cost of the form L CAn",
when
With a mean per unit it will be recalled that for minimum variance
nil is chosen proportional to N,$",,/Vc,..
In the planning of a sample, the allocation with a ratio estimate
may appear a little perplexing, because it seems difficult to speculate
about the likely values of Sd". Two rules are helpful. With a popula-
tion in which the ratio estimate is a best linear unbiased estimate, SdA
will be roughly proportional to ~ (by theorem 6.3). In this case
the nil should be proportional to N" V){,,/v-;;. Sometimes the vari-
ance of d", may be more nearly proportional to ){A2. This leads to
the allocation of nil proportional to N,,){,,/ Vc,., i.e. to the stratum
total of :tAi, divided by the square root of the cost per unit. An exam-
ple of this type is discussed by Hansen, Hurwitz, and Gurney (1946),
for a sample designed to estimate sales of retail stores.
If the estimate f Rc is to be used, the same general argument applies.
E:romple. The different methods of allocation can be compared
from data collected in a complete enumeration of 257 commercial
peach orchards in North Carolina in June 1946 (Finker, 1950). The
purpose of this survey was to determine the moat efficient sampling
procedure for estimating commercial peach production in this area.
Information was obtained on the number oC peach trees and the esti·
mated total peach production in each orchard. The high correlation
RATIO ESTIMATES U3
The upper part of the table shows the basic data. The method
employed to calculate the four varianoes was first to find the fl." for
eaoh type of aI1ocation. These values are shown in the columns headed
(i)-(iv) in th lower part of the table. Thus, with allocation i, fl." -
nN,./N, 80 that in the first stratum
(100)(47)
fl.) - - 18
256
EXERCISES 187
When the n" have been obtained, the corresponding V(1~.a.) is
found by subetituting in the formula
V(r .) _
s
:E N"(N,, - n,,) SA2
where
" n"
SA 2 _ 8",,2 + R"28,,,,2 - 2R,.sp"
The quantities StI" 2 are the same for all four allocations and are given
on the extreme right of the top half of table 6.4.
The variances and relative precisions are shown in table 6.5.
Vari&n~
Method of
allocation: "" Strata Relative
proportional preci.ei.on
to Total
1 2 3
6.1 In a field of barley the grain, 1/" and the grain plus straw, %" were
weighed for each of a large number of sampling units located at random over
the field. The total produce (grain plu8 straw) of the whole field wu &iIo
weighed. The following data were obtained :
It requires 20 min to cut, thresh, and weigh the grain on each unit, 2 min
to weigh the straw on each unit, and 2 hr to collect and weigh the total produce
or the field. How many units must be taken per field in order that the ratio
estimate be more economical than the mean per unit?
6.2 For the data in table 6.1, f 8 - 28,367 and
.. 3 2 0
..2
1
3 0 3 0
"
Estimate the variance of the prop&rtion or persons who saw a dentist, and
compare this with the binomial estimate or the variance.
6.4 The rollowing data are Cor a small artificial population with N - 8
and two strata of equal size :
Stratum 1 Stratum 2
%JI 1111 %1; II1i
2 0 10 7
5 a 18 15
9 7 21 10
15 10 25 16
REGRESSION ESTIMATES
7.1 The linear regreuion estimate. Like the ratio estimate, the linear
regression estimate is designed to inorea.ee precision by the use of an
auxiliazy variate Xt which is correlated with 1/.. When the relation
x,
between 1/, and is examined, it may be found that, although the re-
lation is approximately linear, the line does not go through the ori-
gin. This suggests an estimate based on the linear regreeaion of 11,
on :ti rather than on the ratio of the two variables.
We suppoee that 1/, and x, are each obtained for every unit in the
sample, and that the population mean X of the x, is known. The
x,
sample regression of 'V. on is computed. For the preeent. we 88I!WIle
that the least squares regression coefficient b is used, where
.
1:_____
.-_l
b __
('Vi - fi)(X. - :e)
_
9 + (X - of) ... X + lY - f)
- (Pop. mean of rapid estimate) + (Adjustment for bias)
Our knowledge of the properties of the regression estimate is of
about the same scope as our knowledge for the ratio estimate. The
regression estimate is consistent, although this is in the trivial Benae
that, when the sample comprises the whole population, f ... X, and
the regression estimate makes no adjustment. As will be shown, the
estimate is in general biased, but the ratio of the bias to the standard
error becomes small when the sample is large. We po88es8 a large-
sample formula for the variance of the estimate, but more information
is needed about the distribution of the estimate in small samples and
about the value of n required for the practical use of large-sample
results.
y; - Y + B(x; - X) + e; (7.3)
(7.4)
L: (Xi - X)2
(7.7)
L: (Yi - y)2
where the sums extend over all units in the population, then
(7.8)
/v(iil') - (N - n)
Nn(n - 1)
±
i_I
{(fli - fi) _ b(X; - f»)2 (7.9)
Formally, this is the same as equation (7.3) , except that an extra sub-
script j has been added as a reminder that in standard regression theory
there is a whole frequency distribution or array of values of YIi and
eli for each value of Xi . The theory assumes simple random sampling,
and further that, in every array in which XI is fixed,
E(ei;) = 0: E(ei/) "" 8,2 = Constant
From this model, by the same analysis 11.8 in theorem 7.1, we obtain
iii, - Y = l + (b - B)(X - f) (7.10)
Now, if b is the least squares regression coefficient,
b= L (Yi - fi)(Xi - x)
L (Xi - £)2
where the extra subscript j has been dropped. SUbstitution for YI
and y from equation (7.3') gives
L ei(X; - f)
b= B +L (X; _ £)2 (7.11)
Hence
b _ B = L ei(xI - f)
(7.12)
L (XI - i?
In repeated samples in which the Xi remain fixed from sample to sa.m-
pIe, it is easy to verify that the covariance of land (b - B ) is zero,
and that
S2
V(b) = E(b - B)2 =
L (x; • - i)2
Hence, from (7.10),
So"
V(til,) = E(fil, - y)2 = - + (X - i)2V(b)
n
_ 8 2 {: + (x - X)2 }
(7.13)
, n L (Xi - f)2
The average of the second term in (7.16) turns out to be of order lin
and will not be considered.
Hence, dividing by (n - I)S.,2, the leading term in the bias of g~
is, to terms of order lin,
- (N - n) {:E e,(x, - X)2} (7.17)
(" - I)NS.? (N - 1)
The expression inside the brackets is the population covariance bo-
tween e, and (x, - X)2; it represents a contribution from the ~
ratic regression of y, on x" and vanishes if the relation between y,
and X i is linear. Since the bias of fjlr is of order lin, while its standard
error is of order l l yn, the bias becomes negligible in large samples.
7.6 Comparison with the ratio estimate and the mean per unit For
these compa.risons the sample size n must be sufficiently large so that
the approximate formulas for the variances of the ratio and regres-
sion estimates are valid. The three comparable variances for the es-
timated population mean Yare as follows :
,A.Y (N - n)
"'"J (1)lr) - S1/2 (1 - p2) (Regression)
I Nn
(Ratio)
(N - n)
V(D) - SI/2 (Mean per unit)
Nn
It is apparent that the variance of the regression estimate is smaller
than that of the mean per unit unless p = 0, in which case the two
variances are equal.
The variance of the regression estimate is less than that of the ratio
estimate if
_p2S,,' < 1(, 2S,.2 - 2RpSwS,. (7.18)
This is equivalent to the inequality
(pSr - RS.. )2 >0
Therefore the regression estimate is more precise than the ratio esti~
mate unless
RS,. Coefficient of variation of x,
p - - - (7.19)
Sr Coefficient of variation of Jl4
since R - YIX.
7.6 COMPARISON WITH RATIO ESTIMATE If9
g
... - X - 11B
f
S•.2(1 - p2) (1 + D
to
EfJ" - Y + b·E(X - f) - Y
160 REGRESSION ESTIMATES
E:eampk. The precision of the regression, ratio, and mean per unit
estimates from a simple random sample can be compared using data
collected in the complete enumeration of peach orchards described on
p. 135. In this eXllJIlple, 1/; is the estimated peach production in an
orchard and Xi the number of peach trees in the orchard. We will
compare the estimates of the total production of the 256 orchards,
as made from a sample of 100 orchards. It is doubtful whether the
sample is large enough to make the variance formulas fully valid,
8ince the cv's of y and f are both somewhat higher than 10 per cent,
but the example will serve to illustrate the computations. The basic
data are as follows :
S,/ "" 6409: S~Z'" 4434: Sz2 =- 3898
R - 1.270: p - 0.887 : n - 100: N = 256
~ N(N - n) 2 2
V(rl,)- S,,(I-p)
n
~ N(N - n) 2
V(rR) -
n
(S" + R2Sz2 - 2RS~,,)
(256)(156)
- 100 [6409 + (1.613)(3898) - 2(1.270)(4434)J
- 573,000
N(N - n)
V(f) - S,,2 - 2,559,000
n
There is little to choose between the regression and the ratio esti-
mates, as might be expected from the nature of the variables. Both
techniques are greatly superior to the mean per unit.
There are two types of approach to the sampling theory for ill."
cl)rresponding to the two approaches made with simple random sam-
pling. On the one hand we may assume that the population size in
each stratum is infinite and that the regression really is linear within
each stratum, 80 that the results of standard least squares theory may
be applied. These assumptions are not too unrealistic for some ap-
plications (e.g. in agricultural sampling). On the other hand there is
the largtHl&IIlple theory (as in section 7.2) which does not assume an
infinite population or a linear regression. Since both reeultB may be
uaeful on occasion, two versions of the theorem for V(il'ra) will be
given. The elementary theory will be presented first.
V(D, .. ) -
L
A~
(NA)2
N 8",,2(1 - p/')
{nA1 + :E(ill(:tA'- _XJa)'~f,,)2 } (7.22)
Since L N"
fhrl - E -N £1""
A_I
where Sw.d? - S,,1I 2(1 - p,,2). The most precise estimate of B is,
theoretically, obtained by weigh;,,:; each b" inversely &8 its variance.
This will be found to give
bop'
.E .E g"(YM - 9")(z,,, - f,,)
- ~ ~ ,
LJ LJ g,,(z,", - flo)
where gil - I/S'/I.",,2. This estimate reduces to (7.25) only if the ~
sidual variances are the same in all strata. In practice, bop, cannot be
1M REGRESSION ESTIMATES 7.7
used, because we have to insert sample ~:.mates of S" .sh'l, \vith a re-
sulta.nt 1088 of precision from errors in these estimates. These errors
are Btnall when the samples within strata are large, but in that event
the sampling error of b makes only a negligible contribution to V(Yz •• ).
Consequently any improvement on the customary combined estimate
of B will probably be small unless the total sample size is, say, less
than 50, and there are large differences between the residual variances
in different strata.
In presenting the elementary form of the result for V(YI,,), we shall
suppose that the tmmpJe regression coefficient, b' say, is some weighted
mean of the bA, where the weights depend only on the XM. Such a
function includes, 8P particular cases, both the customary combined
band b.pt. and enables V(1/lr.) to be stated slightly more generally.
Theorem 7.4a Suppose that each stratum may be regarded as in-
finite and that
'VAij - YA + B(xAj - X,,) + eAij (7.26)
where, for any fixed :CA"
-
~ ('-NA)2 -1 S"A"(1
LJ - PA ) + (f" - X)2 V(b) I
A_ I N n,.
7.7 THE COMBINED REGRESSION ESTIMATE 156
Coroll4ry 1 If
L: L: (YM - fl")(x,,. - flo)
b' _ b "" -"--:'~=------=---
.
L: L: (XA, - X,,)2
"
(7.28)
Also
,
The Te!!ult follows by applying the usual formula for the varianoe of a
linear function.
CoroUary B There are various particular forms of this result, ac-
cording to the type of allocation adopted. For instance, if 8",,2(1 _ P"2)
is constant in a.ll strata. and proportional allocation is used, formula
(7.28) become!!
2 2 {1
V(fl'r.) - 81/" (1 - PA) - + ~ ~ (
(:f,t - X)2
)2
} (7.29)
n L. L. x'" - :flo
" i
With simple random sampling, the contribution of the sampling
error of b to V(Ylr) was found to be approximately lin time!! the total
variance. Unfortunately this result does not always hold for V(f}lr.) .
For equation (7.29) the result is valid, but in the more general exprea-
sion (7.28) it sometime!! happens that the major contribution to the
variance come!! from a single stratum, say stratum h. An examination
of formula (7.28), which will not be presented, shows that in this situ-
ation the contribution from V(b) may be as la.rge f.o8 lin" times the
total variance. In samples of moderate size it is therefore advisable
to check that the contribution of V(b) is negligible before discarding it.
The more general theory for g'ro, in which the assumption of a linear
regression is relaxed, becomes quite complicated. We shall carry it.
168 REGRESSION ESTIMATES 7.7
only far enough to exhibit, in a general way, what happens. The
within-etratum regression coefficients are defined as follows:
f (l/1li - X,,)
B
~
- ,_I
" ---------
L
r,,)(X/ai -
(XM - X,,)2
Substitute for y" from equation (7.30). Hence the error of estimate is
2
With the combined estimate, 8., .,,11 mlly be taken e.s
7.10 Exercises.
7.1 A popula.tion contains 6 units, with the following values of III and xc
Unit
1 2 3 4 6 6
2 4 I) 8 10 12
o 3 4 I) 6
By working out all possible cases, compare the precisions of the ratio and
linear regression estimates for simple random ss.mples of size 2. Compute
the contributions of the bias to the variances.
7.2 From the ss.mple data in table 6.1 (p. 113) compute the regression
estimate of the 1930 total number of inhabitants in the 196 large cities. Find
the standard error of this estimate, and compare its precision with that of
the ratio estimate.
7.3 In the previous exercise, find the estimated total number of inhabi-
tants, and its standard error, if b is arbitrarily taken as 1.
7.4 By working out all poSBible cases, compare the precisions of the sep-
arate and combined regression estimates of the total Y of the following popu-
lation, when simple random samples of size 2 are drawn from each stratum :
Stratum 1 Stratum 2
x,~ ~Ii %t. 1/1.
4 0 I) 7
6 3 6 12
7 I) 8 13
Uae the ordinary least squares estimates of the B's, formula (7.25) for b•.
7.11 REFERENCES 15G
7.11 References.
CocIl1tAN, W. O. (1942). Se.mpling theory when the 8aIIlpiing unite IU1I of unequal
!lisee. JOIM. Amer. SII:U. Auoc., 37, 199-212.
FIsHEll, R. A. (1928) . Moments ILnd product moments of IIl\mpling distrihution ..
Proc. London Math. Soc., 2, 30, 1119-238.
WATSON, D. J . (1937). The estimation of lelLf 1LI'6ILII. Jour . Agr. Sci., 11,474.
YATES, F . (1949) . Sampling -uuxu for c:m8IUU and ftU'IIeVI . Charles Griffin and
Co., London.
CHAPTER 8
SYSTEMATIC SAMPLING
L"_...j_".__L.,, • I )(~.~___l
i ~ ~ U ~ ~
Unit number
1 11 1Il IV V
1 2 3 4 6
6 7 8 9 10
11 12 13 14 16
16 17 18 19 20
21 22 23
samples have n - 5, while the last two have n - 4. This fact in-
troduces a disturbance into the theory of systematic sampling. The
disturbance is probably negligible if n exceeds 50, and will be ignored,
for simplioity, in the presentation of theory. The disturbance is un-
likely to be large even when n is smell.
8.2 An alternative view. There is. a.nother wa.y of looking e.t syste-
matic sampling. With N - nk, the k possible systematio samples are
shown in the columns of table 8.2. It is evident from this table that
the population has been divided into k large sampling unitAl, each of
which r.ontains n of the original units. The operation of ohoosing a
UI2 SYSTEMATIC SAMPLING 8.2
Sample number
1 2 k
Meana DI 1/1
" f1l
is the variance among units which lie within the same systematio
sample. The denominator of this variance, ken - 1), is corurt.ructed
by the u8Ual rules in the analY8is of variance : each of the k samples
contributes (n :... 1) degrees of freedom to the sum of squares in th
numerator.
Proof: By the usual identity of the analysis of variance
(N - 1)8' - LL (1/,} - y)2
• j
V(g,!t) - -;
S' [(N N- I) + (n - I)p"
] (8.3)
where
2 k
p" - /m(
n -
I)S' :E :E (110 -
I-I J< u
Y)(lIlu - Y)
..
- i:E
_I
[(IIi1 - Y) + (IIi' - Y) + ...+ (IIi" - Y»)I
This is the variance among units that lie in the same stratum. The
divisor n(k - 1) is used because each of the n strata contributes
(k - 1) degrees of freedom. Further
P... , - len( 2
n -
1) I:i iI:
....
(l1;i -9 ·i)(Yi.. - 9 ... )/8... ,2 (8.5)
I 0 1 1 2 6 7 7 6 .. ".1
11 6 8 9 10 13 12 16 16 16 17 12 .2
III 1 19 20 20 24 23 26 28 29 27 23 .3
IV 26 ao 31 31 33 32 36 ;r, 38 38 33 . 1
Tota18 IlO 58 6J 63 76 71 82 88 91 88 72 .7
__
I [(50)2 + (58)2 + ... + ( )2 _ (7Z7)2] _ 11.63
160 10
For random and stratified random sampling, we need an analysis
of variance of the population into "between rows" nd "within rows."
This ia preeented in tabl 8.4. Hence th variances oC the estimated
dt •
Betw ron (.!.rat..) a .. .3
Wit.h.lo Itrata 36 485 . 6 13 .411 - 8 ..1'
Tot&I.I 39 6313 .8 136.26 - 8'
N - n) t
V,,- ( - - - - -9 · -
13.49
- - 3.04
N ft 10 4
8.4 OOMPARISON WITH STRATIFIED RANDOM SAMPLING 187
Both stratified random sampling and systematic sampling aNI much
more effective than aimple random sampling, but, &8 anticipated, sy&-
tematic sampling is less preci8e than stratified random sampling.
Table 8.5 shows the same data, with the order of the obaervatione
reveraed in the ond and fourth strata. This has the effect of mak-
ing P••• negative, because it makes the majority of the cl"088-oproducte
between deviations from the strata means negative for pairs of obeer--
vations that lie in the same systematic sample. In the first systematic
TABLE 8.5 DATA. IN TA.~ 8.3, WITH 2'lf1l 0111)0 alllV&JII1I1) IN ITaATA. n ANI) 11'
I 0 1 1 2 5 4 7 7 8 6 4. 1
II 17 16 16 15 12 13 10 9 8 6 12.2
III 18 19 20 20 24 23 Z5 Z8 Z9 Z7 23 .8
IV as as 37 35 8Z 33 31 31 30 26 33 .1
Tot&1a 78 74 74 72 78 78 78 75 75 65 72 .7
sample, for instance, the deviations [rom the strata means are now
-4.1, +4.8, -5.3, +4.9. Of th six products of pairs of deviations,
four are negative. Roughly the same situation applies in every sy&-
tematic sample.
Thie change does not affect V r ." and V.,. With systematic sam-
pling, it brings about a dramatic incN'.aee in precision, &8 is eeen when
the systematic sample totals in table 8.5 are compared with those in
table 8.3. We now have
The crucial conditions are that ally. have the same mean p., i,e. there
is no trend, and that no linear correlation exists between the values
tI. and lIJ at two different points. The variance 11.' may change from
point to point in the series.
Proof: For any specific finite population,
N
L Y)'
v _ (N - n ).-1 (1/. -
'010 Nn - (-N- -- l )-
Now
N N
L (y. - y )2 - L {Wi _ p.) _ (Y _ p.)I'
i-I '-I
N
-L (1/. _ }')2 ..,. N(Y _ p.)2
Hence
,V,." -
(N - n) {NEcr(2 - N~
E crr}
Nn(N - 1) ._1 lV-
'Thia givee
(N - n) N 2
tV.... - N'n E cr.
._1
Turning to V ,~, let '0. denote the mean of the uth systematic sample.
For any specifio finite population,
Vq - -1
k ... 1
E• (g. - f)2
- ~k {i:.
..
(9. -
-I
~)' - k(Y - j.I)2}
By the theorem for the variance of the mean of an uncorrelated
aa.mple from an infinite population,
N N
Vii-I
• •• - k ---;;- - N'I
E cri2 k L crt'
i_I
(N - n) N 2
-
Nn
2 E cr( -
i_I
.V.....
(8.7)
To find the variance within strata, S", 2, we need only replAce N by
kin (8.6). This gives
(N-n) 8 ..2 n(k-l) k(k+l) (k;2 - 1)
V,I - .----- (8.8)
N n nJc 12n 12n
For systematic sampling, the mean of the second sample exceed!
that of the first by 1, while the mean of the third exceeds that of the
second by 1, and 80 on. Thu8 the means g.. may be replaced by the
numbers 1, 2, ... , k. Hence, by a further application of (8.6),
..:.. ~2 k(k;2 - 1)
~ (9.. - ~ J - ---
.. _I 12
Tbia gives
V., _ ~ L (9.. _ Y)2 __
(k;2_-_1)
(8.9)
k 12
From the fonnulas (8.7), (8.8), and (8.9) we deduce, &II anticipated,
k;2 - 1 k;2 - 1 (k - I)(N + 1)
V., - - - < V,., - -12- <
12n - -
V,." - - - -
12
--
172 SYBTEMATIC SAMPLING
n(2i - k - I)
FiTllt member : 1+ -2(n-- -I)k-
n(2i - k - 1)
LalIt member : 1- - - - --
2(n - l)k
where ~ is the weight attached to any 1IJ' With fl.", the unweighted
mean, L wI' - l i n. With 1}VI'~' L w/ depends on the starting mem-
ber i of the sample. The average value of L wi', taken over the range
i - l , 2, "', Ie, is found to be
~
n
[1 + _n_(~_-.. .l.,. ). . ,.
6(n - 1)2_k2
]
to the first and last members. Th differ from the weights given
previously only by a factor (n - 1)/n. In tests of the efficacy of his
end corrections in five natural populations (described in tabl.e 8.6)
Yates found a worth-while increaae in precision in four of the five cues.
174 SYSTEMATIC SAMPLING
obeervation 11(. The sample points A represent the case least favor-
able to the systematic sample. This ca.ee holds whenever k is equal
to the period df the sine curve or is an integral multiple of the period.
Every obeervation within the systematic sample is exactly the same,
80 that the sampl is no more precise than II. single observation taken
at random from the population.
The moet favorable case (sample B) occurs when k is an odd mul-
tiple of the half-period. Every systematic sample has a mean exactly
equal to the true population mean, since successive deviations above
and below the middle li.ne cancel. The sampling variance of the mean
is therefore &ero. Betw n these two C&8e8 the sample has various
degrees of eft' tiven ,depending on th relation between k and the
wavel ngtb.
for any 8i~ of sample. Further, unless 6,.' - 0, i - 2, 3, ... , (Am - 2),
• TheoretJcally, N " inAnlte, If line. that are iolioit.ely thin can be enviIIcecL
t Approximately. The Dumber varied from bed to bed.
tim three studiee were mad from maps. In the first study, the finite
population consiets of 288 altitudes at euccessive distances of 0.1 mile
in undulating country. In the next two, the data are the fraction.e of
the lengt.ha of lin drawn on a COV8J'otype map that lie in • certain
8.10 NATURAL POPULATIONS 177
type of cover (e.g. grass). These examples might be considered the
cloaest to continuoU! variation in the mathematical eenae.
The next three studies are based on temperatures for 192 consecutive
days: (i) 12 in. under the aoil, (ii) 4 in. under the oil, (iii) in air.
This trio represents a gradation in the direction of greater influence of
erratic clay~to-day changes in the weather as compared with slow
aeaaonal influences.
The remaining studies deal with plant or tree yields in some se-
quence that lies along a line. In the study on potatoes, which is typi~
cal of the group, the finite population consists of the total yields of 96
rows in a field. Since no exhaustive search of the literature has been
made, further data may be available.
In some of the studies, V' II is compared with the variance V. for a.
"
stratified random sample with strata of size 2k and 2 units per stratum.
This comparison is of interest because an unbiased estimate of Y."
can be obtained from the sample data. This cannot be done for Y. n
(with strata of size k and 1 unit per stratum) or for V. II • Other writers
report comparisons of V.w with both V,II and V,I" The majority of
the IOUTces do not pl'e8ent comparisons with V ru in readily usable
form, but it appears that in general V. gave gains in precision over
V ...... "
In the papers by Yates and Finney, comparisons are given for a
range of values of nand k within each finite population. In these
caees the data in table 8.7 are the geometric means of the variance
ratioe for the individual values of k. The other writers make compute,..
tiOIl8 for only one value of k per population, but llUI.y give data for
cJjfferent itema or for several populations of the sa.me natural type.
Here, again, geom tric means of the variance ratios were taken.
Although the data are limited in extent, the reBUitll are impressive.
In the studi which pennit comparison with V. , .. systematic 88lD a
piing shOW8 a coll8i8tent gain in preciaion which, although mod t, is
worth baving. The gains in comparison with V. ,2 are substantial.
Th internal trend of the resu1ts agrees with expectations, although
not too much 8hould be made of this in view of the srnaI1 number of
8tudies. Th gaill8 are largest for the types of data in which we would
gu that variation would be nearest to continuou8. The decline in
V"t!V.~ from aoil to air temperatures would also be anticipated from
thi vi wpoint. In tb last three items (forest nunICry data), the only
00 showing no gain i oniferous transplant stock, which is older and
more uniform than eeedling stock.
•
{2 "
10
V..,
4.21
V'.11
7.21
v_
10 .29
V.n/V..
1.71
~ 21 3.06 3.00 4.77 O.
16 2.42 2.09 3.62 0.86
14 80 O. 1.90 3.20 2.76
10 42 1.74 1.29 2.26 0. 74
.7 80 O. 0.82 1.61 3. 15
6 S. 1.22 0.60 1.00 0 .41
8.12 Eltimation of the variance from a linIle aample. From the ro-
sults of a simple random sample with n > 1, we can calculate an un-
biased estimate of the variance of the sample mean, the estimate be-
ing unbiased whatever 1M form Of 1M population. ince a systematic
sample can be regarded as a simple random sample with n - 1, this
useful property does not hold for the systematic sample. As an ilIu&-
tration, consider the "sine curve" example. Let
II. - m + a sin (ri/2)
where k - 4 and i - I , 2, "', 4n. The succeasive observatioI18 in
the population are
(m + a), m, (m - a), m, (m + a), m, (m - a), m, ...
If i - I is choeen as the first member, all members of the system&tio
sample have the value (m + a). For the other three poesible ohoicee
of i, all members have the values m, (m - a), or m, respectively.
Thua from a lingle sample we have no me&I18 of finding out or estimat-
ing the value of a. But the true sampling variance of the mean of the
systematic sample is 0 2 /2. The illustration shows that it is impossible
to construct an estimated variance that is unbiased if periodic varia-
tion is present.
These results do not mean that nothing can be done. Excluding
the case of periodio variation, we might know enough about the struc-
ture of the population to be able to develop a mathematical model
which adequately represents the type of variation that is preeent.
We might then be able to manufacture a formula for the estimated
variance that is approximately unbiaaed for this model, although it
may be badly biaaed for other models. The decision to use one of
these models must rest on the judgment of the sampler. Unfortu-
nately, we frequently lack data, &8 distinct from opiniOI18, about the
structure of the population and are not confident that a given model
will be satisfactory.
Some simple models with their corresponding estimated variancee
are illustrated below. No proofs will be given.
180 SYSTEMATIC SAMPLING 8.12
(i "')
A propoeed formula I,w 2 for the estimated variance is called unbiaaed if
fE(",,2) _ ,V."
i.e. if it is unbiaeed over all finite populations that can be drawn from
the super-population.
T. PopuLatUm in "random" order.
JLj - Constant (i - 1, 2, " ', N)
(N - n) .L (lh - fi,,,?
Nn (n - 1)
This case applies when we are confident that the order is eseentially
random with reapect to the items being measured . The variance for-
mula is the same as for a simple random wnple and is unbiased if the
model is correct.
II. traJificoJiqn ejJecl4 onlJ/.
~ - C<>nstant (rA: +1~i ~ rA: + k)
,,~
2
- (N - n)
Nn
.L (J/. -
2(n - 1)
J/f+A:)'
In this Cl&8e the mean is conat.ant within each stratum of k unite. The
tim te •••,l, which is baaed on th mean aquare 8Ucceeaive difi'er-
en ,is not unbiased. It containa an unwanted contributio from the
dift' ce between ",'s in neighboring strata, and the fint and last
rata carry too little weight in estimating the random component of
the variance. With a reasonably large wnple, this estimate would in
ceneral be too high, &EIBW'Ding that the model is correct.
Ill. 1A~ trend.
""-,,,+fJ.
I.
, (N - n) n'
----
:E (rI, - +
~.+' lIH,Jl)'
N nl 6(n - 2)
Th estimate i baaed on llUe 've qur.dntic terma in the eequence
8.12 ESTIMATION OF VARIANCE FROM A SINGLE SAMPLE 1 I
ActUAl
Bed V,. .",,2
' 1/1'
Silver maple I 0 .91 2 .8 2 .5
2 0 .74 3 .6 2.9
American elm 1 • .8 28 .• 12 .6
2 15 .5 22 .6 18.6
White epruoo 1 5 .5 17.2 11 .2
2 2.0 11.6 6 .•
White piDe 1 8 .2 21 .0 21 .9
and third formulas are applied to six forest nursery bed8 (Johnson,
1943). The quadratic formula is slightly better than that based on
successive differences, but both give serious overestimates.
Various other formulas can be devised. Residuals from a fitted
polynomial of higher degree may be effective if !l; varies continuously
and not too rapidly : tables have been provided by DeLury (1950) for
this method.
Formulas developed from simple assumptions about the nature of
the correlogram have been diecUMe<i by Osborne (1942), Cochran
(1946), and Ma~m (1947) . Yates (1949) has investigated an esti-
mate based on a quantity of the form
(1/" + 11..+2.1: + ""-H,t + .. .) -
+ 11"+111 + ...)(1I,,+,t
The successive items in the sample are given alternatively + and
aigns. II this expression is taken over the whole sample, only 1 df
is available. In order to provide more degrees of freedom, th sample
data can be broken into parts, which Yates suggests might contain 9
obeervations each. If we d note the successive observations in the
sy8tem&tic sample by 111" 11'/, etc., and give weight i to the first and
laat tenna, we may write
The next difference, d" may start with IIg', and 80 on. Then, for the
estimated variance of g,~, we take
"11' _ (N - n) t d,,'
Nn ,,_17.6g
The factor 7.6 is the sum 1)£ squares of the coefficients in any d", and
g is the number of differencee which the sample provides (g is approxi-
mately n/9). In the natural populations which Yates examined, a
formula of this type W&8 superior to the fonnula "/2 2 baaed on sue-
ceeaive differences, but it still overestimated the actual variance of '0'1/'
In conclusion, there is no dearth of formulas for the estimated vari-
ance, but all appear to have a limited range of applicability.
:E N.g,~
g,,,,, - N
where the sum extends over the pairs of strata, is on the averag an
overestimate, even if periodic effects are present within strata. The
amount of overestimation depends on the terms in (Y.. - y ;)2. So
far as can be predicted, strata in the eame pair should therefore have
about the aame population means. This d vic is an application of
the method of "collapsed strata," previously described in IlCCtion 5.21.
It " :t • • :
I I"
:
, I
I•
___ , ____ 1-
I __ _
----'----..j.----
I t • I I
~ I • I • I I
I I I I "
•
---,----,---
I
I "
I
I •
I
I
I
:...---~----..L----
• I
• I
I
I , I I·
I I I I
(Q) AlipIed or ~ .lQWlN pid. (b) UnalilDed NIIIplAl
.. mple
ment, with one of the letters chosen at random in each field, gave an
increase in precision of around 25 per cent over stratified random sam-
pling with rows as strata. The arrangement does not quite satisfy
the latin square property, because each letter appears 3 times in one
column and twice in the other columns, but it approaches this prop-
erty &8 nearly &8 poesible.
8.16 Summary. Systematic samples are convenient to draw and to
execute. In m08t of the studies reported in this chapter, both on arti-
ficial and on natural populations, they compared favorably in pro-
cision with stratified random samples. Their disadvantages are that
they may give poor precision when unsuspected periodicity is present
and that no trustworthy method for estimating V(g.w) from the sample
data is known.
In the light of these re8ults, writers on sampling are not in ~
ment in their viewB on the advisability of SYBtematw 8&Dlpling. It
appears, however, that I!Iy8tematic sampling can safely be recommended
in the fonowing situations:
i. Where the ordering of the population is eseenti&lly random, or
contains at most a mild stratification. Here systematic aampling is
used for convenience, with little expectation of a gain in precision.
Sample estimates of error which are reaaonably unbiased are availabl
(IIlCltion 8.12).
ii. Where a stratification with numeroua strata is mployed, and
an independent systematic sample i drawn from each tratum. The
effects of any bidden periodiciti tend to cancel out in this situation,
and an estimate of error which is known to be an overestimate can be
obtained (IIlCltion 8.13). Alternatively we can use half the number of
1. SYSTEMATIC SAMPLING
8 20 26 84. 81 24 18 16 36 10 223
6 19 26 21 23 19 13 12 8 35 182
6 2/i 10 27 41 28 7 8 29 7 188
28 11 41 2/i 1 1 9 10 33 9 197
2/i 81 30 32 Iii 29 11 III a 12 211
16 26 IiIi 43 21 24 20 20 13 7 24Ii
28 29 34 33 8 33 16 17 18 6 222
21 19 56 4/i 22 37 9 12 20 14 2/iIi
22 17 89 23 11 32 14 7 13 12 100
18 28 41 27 3 26 15 17 24 15 214
26 16 27 87 . 36 20 21 29 18 234
28 II 20 14 5 .20 21 26 18 . 165
11 22 2/i 14 11 .a 16 16 16 4 177
16 26 89 24 9 27 14 18 20 9 202
7 17 24 18 2/i 20 13 11 6 8 149
2i 2/i 17 16 21 9 19 15 8 191
21 18 14 18
'"
26
31
14
40
44
IiIi 36
13
22
18
19
24
2/i
17
7
27
29
31
"8
8
II
10
Ii
1
227
2/iIi
26 80 89 29 II 30 80 29 10 3 235
ta
t.o1.ala 410 U9 674 Ml 325 528 803 as8 342 206 (1M
1.17 IlEFERENCEB 187
Find the variance of the mean of a ryetematic IIIUllple oonaieting of every
~ foot. Compare this with the varianoee for (i) a simple random sample,
(ll) a stratified random eample with 2 unite per stratum, (iii) a stratified ran-
dom eample with 1 unit per stratum. All eamplea have n - 10. II: Ull - Y)I
- 23,601.)
8.2 For the population in exerciae 8.1 , ia the precision of IIystematic 1&Ill-
piing improved by end corrections?
8.8 A two-dimensional population with a linear trend may be repreeented
by the relation
tlCJ - i + i (i, j - I , 2, "', nk)
where tlCJ ia the item value in the itb row and jth oolumn. The population
oontains N' - n'k ' unite.
A systematic tlQuare grid II8Jllple ia lI8lected by drawing at random two in-
dependent starting coordinatea 10, jo, each between 1 and k. The eample, of
lise n ' , oontains all unite whoee coordinatea are of the form
io + 'Yk, i. + 64:
where"(, 6 are any two integers between 0 and (n - I), inclusive.
Show that the mean of this sample has the same preciaion as ~h mean of a
simple random sample of size n'.
8.4 If the oompILriaon in exercise 8.3 were made for a three-dimensional
population with linear trend, what result would you expect?
8.5 A population of 360 households (numbered from 1 to 360) in Baltimore
ia arranged alphabetically in a file by the surname of the head of the hou
bold. Households in which the head ia non-white occur at the following
nutnbers : 28, 31-33, 36-41 , 44, 45, 47, 55, 56, 68, 68, 69, 82, 83, 86, 86, 89-94,
98, 99, 101, 107- 110, 114, 154, 156, 178, 223,224,200, 298-300, 302-3{)(,
306-323, 32&-331, 333, 335-339, 341, 342. ~The non-white households show
lOme "clumping" becaUlle of an association between lurn&tne and color.}
Cotnpare the precision of a 1 in 8 lyatematic sample with a simple random
sample of the same sUe, for eatimating the proportion of houaeholds in which
the head ia non-white.
8.17 R.eferences.
Co<nI:Luf, w. G. (1946) . IUJative accuraoy of ~matic ADd ftratmed random
-.rnple8 for a certain clue of populatJOOl. A"". Malh. SI4l., 1' , 164-177.
D.a.a, A.' C. (1950) . Two-dimeoaiooal aystematJe II&IJlpliJlI and the UIOciat.ed
ftratilied and random I&mpUng. ScmA:IIIlO, 10,95-108.
DsLuu, D . B. (l950). Valuu ond i~ of 1M orlJtogofICIl poi1"lOl1liolt up 10
It -16. Univ6rlity of Toronto P.,..
FufNa, D. J . (1948). Random and ~tic -.rnplinl in timber lUtVeyI.
Porutrv, JI, 1-38.
FINN.,., D . J. (l9&O). An example of periodie variation in foreat. I&mPlln&. Por-
,*" II, 96-111 .
FlIlHD, R. A., and MAc a II&, W. A. (1m ). The con-eJation of IN kly ralnfall.
Quorl. Jow- . RAIf/. M tI. . Soc., ta, 234-245.
SYSTEMATIC SAMPLING 8.17
Nol cil4d in ~
9.2 A limple example. Johnson's data (1941) for a bed of white pine
seedlings provide a simple example of the procedure for comparing
different units. The bed contained 6 roWs, each 434 ft long. There
are many ways ill which the bed can be divided into sampling unite.
Data for four typee of unit are shown in table 9.1. Since the bed waa
completely counted, the data are correct population valuee.
1~
1110 TYPE OF SAMPLING UNIT IU
Type of unit
Preliminary data
l·ft Z-ft l.ft 2.ft
row row bed bed
.. 6.746
62
23 .094
78
68.668
108
(~ 2.537
6 . 746
~ - At - -- - O.666n]
Type of unit
Suooeeaive ,t.epa in
the caloulAt.ion
1-ft 2-ft 1-ft 2-ft
row row bed bed
l.33On t
- - - - 1.330 -
(44) Ct - 0.944c1
62 62
u shown in table 9.2. All the larger unite cost I than the smallMt
unit, although the differences are not great. The 1-ft bed unit appears
the beat. The last line of table 9.2 shOWI the reciproca!e of the costa,
with the amall I. unit taken as 100. In the tabl these figures hAve
un TYPE OF SAMPLING UNIT
been called rel4tive JUt precilion, becaWle, if the same comparisons are
made with COllt kept constant instead of variance, it will be found that
these figurea are invereely proportional to the relative variances of the
estimated population total, and hence me&8Ure relative preci8:ion.
9.3 General prooedure for comparing units. The analysis in this ex-
ample can be expre88ed in more general terms as follow8.
Theorem 9.1 For the uth type of unit, let
Relative size of unit - M ..
Variance among the item totals on the unit - 8 .. 2
Relative COllt per unit - c..
Simple random 8&mpling i8 assumed, with the fpc ignored, and the
population total is estimated by 8imple expansion. Then the relative
C08t for equal precision is proportional to
C..s..2 (9.1)
M ..2
Proof: This follows the argument used in the numerical example.
For th uth unit
Number of units in the population a: 11M"
Variance of eatimated population total a: S"2 In.,M ,, 2
SampJe size (n,,) for equal preci8ion a: S,,2/ M,,2
Relative coat for equal precision a: C"S.. 2I M,,2
The d finitions of 8,,2 and C. should be noted, because in the com-
pilation of data it is often convenient to expr these quantities
originally in lOme oth r form. "l'hu8 in the numerical example the
COl:lt data w re giv n in terms of the bulk of sample that could be
counted in a given time.
CoroUo.rll 1 Under th conditions of the theorem, the variance of
the estimated population total, for a fixed cOIIt, is also proportional to
C..s. 2 (9.2)
TABLE 9.8 EaTrtoiATliD IIT.uroAIID ICRIIOII8 (IN PICIl CIlNT) POll POUIl allD OF
UNIT, WITH IXIIIPLII ILA.NDOW IIAWPLINO
Bart
l~ 8/4 8/2 8 2S unit
Number of .wine 6.0 4.9 6.3 6.2 S/2
Number of ho~ 8.4 3.3 3 .6 4.2 8/2
oNumber of .beep 17.' 16 .7 14.9 14 .3 2S
Number of chieken. 3.0 3 .0 3 .3 3.8 8/4,8/2
Number of eua yMterd~ 6.7 6.2 4.9 4.7 2S
The data in the table are the relative standard errors (in per cent)
of th Mtimated means per farm for 18.items. No unit is best for all
items. The half-.ection and the quartef.oolleotion are, aowever, superior
to the tar units for all except 2 items, with little to chooee between
the half· and quarteNl8Ctio . The half-eection would probably be
G.' OOMPARISONS MADE FROM SURVEY DATA 1115
preferred, becauae the problem of identifying the boundaries accu-
rately is easier.
In order to make any compariaon of this kind, we must know the
variability among unite for each type of unit that is included. Bow
are 8Uch data obtained? One 1IOUJ'Ce, 88 in the nu.rsery bed example,
is a complete count of the population for es.ch type of unit. Another
is the drawing of aeparate samples for each type of unit. uch methods
may be employed when the population is compact and it is not too
costly to obtain the data. With extensive populatioD8, however, it is
aeldom feasible to make a 8Urvey solely for the purpose of comparing
different types of unit. Infonnation about optimum type of unit is
more usually procured 88 an ingenious by-product of a 8Urvey Wh088
main purpose is to make estimates. Some techniques for doing this
are outlined in succeeding aectioD8.
df 11\1
Between 1&I'Ie unite (n - J) IIJ,I
dI
Betw n lara- unit. (N - 1)
Between IIUIA1I UDit. within lara-
unit. N(M - 1) 8.1
Population IUllple
No. or strata 587 572
No. or sampling unite 72 . 849 1397
No. or (arffi$ 217 ,976 4166
It will be noted that a few strata were not sampled and that the
number of farms per unit was slightly under 3. These discrepanci
will be ignored. The sampling ratio was l.9 per cent.
From the sample data we can compare the group of 3 farms with the
individual farm as sampling unit. We shall use a slightly simpler
analy is than is strictly required. Th fpc can be omitted. ince th
sampling was stratified, the variance of the estimated population total is
V ( f,, ) - L: N 's 2
_ A_ A
" 11."
The standard procedure is to compute, within each stratum, an esti-
mate of SA' for the two types of unit, and substitute in this formula.
The t.rata contained in general between 300 and {50 farms, and eith r
two or three 3-farm units were taken in each stratum 80 as to make the
sampling approximately proportional. Assuming proportionality, i.e.
nlt/N" - nl N, we may write
N IV'
V(f,,) - -
11.
L: NJ08,,' ' - . -S,,'
11.
if we assume funher that the 8,,2 do not vary greatly among strata, 10
tha. they may be repla.ced by th ir average, S.2.
1. TYPE OF SAMPLING UNIT
elf IDI
Bet_n unite within IItrata ~ 8 .218
Bet_ '1omII
within unite 2788 11.1118
For the group of 3 farma, the mean aquare '1t.3' - 6.218 llervelll as
the eetimate of SIa'. To obtain an estimate of the variance within
strata for the individual fann, we construct an analysis of variance
for the whole population (table 9.8) . The degrees of freedom come
from table 9.6, and the first two mean aqua.rea from table 9.7.
elf IDI
Bet", II unite within .lAta n ,:M2 8 .218
Betw_ '1omII
within unit. 1~,127 2.1118
217,3811
U Nt the number of la.rge unita,· is large, this takes the simpler form
Sb 2 - MSl - (M - l)AM' (9.8)
Hendricks (1944) haa pointed out that the complete population
mi ht be regarded 88 a single large rwnpling unit containing NM ele-
menta. If formula (9.6) holds, then S2 so A(NM)'. The advantage
of this device is that the values of A and g can now be estimated from
th data for a survey in which only one value of M W88 used. The
two equations which lead to the timates are
logS ..2 - logA + glogM
Jog S2 - log A + g log (N M)
2
Th formula for Sb becomes, from (9.7),
Sb
2
-
AM'I(NM - I)N' - N(M -
- - - - - - -- - - - -
1»
N-l
This m thod furnishes no check on the correctness of formula (9.6) .
It might happen that the formula held well enough for small values of
M , but failed for a value 88 large 88 NM . In this event the more
g n ral formul88 (9 .7) and (9 . ) should be employed.
Formula (9.6) is presented 88 an example of the methodology rather
than as a gen ral law. The reader who faces a imilar probl m should
construct and teet whatev r type of formula seems most appropriat.c
to his material. In 80me cases log Sb2 might be a 8imple function of M .
Assuming simple random sampling and ignoring the fpc, the variance
of the mean per element" is Sb 2 /nM. From (9.8), this equals
S2 - (M - l)AM,-1
V(t') ... - - - - - - (9.10)
n
n: (9 .12)
xav
M: Cln - - - - (9.13)
aM
Divide (9.13) by (9.12) 80 as to eliminate},. This leads to
or
-- -n av
VaM clM
cln
+ l~n-~
--- ------
Mav
VaM 1+ CJ
(9.14)
2c I M vn
If we substitute for vn from (9.11), we obtain, &fter some simplifica.-
tion,
M av ( 4CCI~-~
- - - 1+--
VaM es' -1 (9.l6)
TYPE OF 8AMPUNG UNIT lI.e
By writing out the I ft 'de of this equation in full and changing
signa on both sidee, we find
This equation gives the optimum M . The left side does not involve
any of the coet facton, being dependent only on the shape of the vari-
ance function. It follows that for a given population, which will have
lOme fiz«J. variance function, the optimum M reacts to changes in the
eon facton in such a way that the quantity
remains fixed.
Now CI inCrea8e8 if the length of interview increases, while c, de-
creasee if travel becom cheaper, or if the farms in a given area be-
come more dense. Theee facts lead to the conclusion that the optimum
sile of unit becomes rmaller when:
Length of interview increases.
Travel becomes cheaper.
The elements (farms) become more dense.
Total amount of money used (C) increases.
The conclusions are a c nsequence of the type of coo function and
would require l'tHOOUllination with a different function . They il-
lustrate the fact that the optimum unit is not a fixed characteristic of
the population, but d penda also on the type of survey and on the
I vels of prices and wages.
---
V(p) .
Vb.,,(P)
M L: (1'. -
NPQ
P)'
(N large) (9.18)
showlI the relative change in the variance due to the use of clusters.
Numerice.l va.lues of this factor are helpful in making preliminary
estimates of sample size with cluster sampling. The required sample
size is first estimated by the binomie.l fonnula, and then multiplied
by the factor to indicate the size that will be necessary with cluster
aampling. For an iIIulltration, see Cornfield (1951) .
If the cluster lIizes M. are variable, the estimate l' - L: ail L: M.
is a ratio timate. Its variance is given approximately by the formula
(eection 6.9)
N
- L: M.'(p, -
p)2
V( ) . _. (N - n) _i-_I _ _ __
P NM12 N - 1
&
the value 1 to any unit which falls in claas C, and 0 to any other unit,
the fundamental analysis of variance equation for fixed It{ is
9.9 Measures of the size of a unit. In many survey the units vary
in size. A house or a dwelling place, suitably defined, is often a con-
venient unit in surveys of human populations, but it may contain
anywhere between 0 and 25 or more pel'8ODS. In eX&lllples of thi
kind, we can define the size of the cluster as the number M,· of ele-
ments which it contains.
There are other populations in which obvious differenc in eiae
among units exist, although it is less clear how size i8 to be measured.
Farms, banks, and restaurants are examples. There are large and
small farms. As a meaflure of th size of a fann, however, WI' might
propose the total acreage, or th total acreag ava.ilable for crops, or
the total valu of the farm's production, or still other quantiti .
What kind of measure of size is useful to the sampler? uppoae that
om item 1/i is to be measured on each of a simple random sampl of
farms. The sampler fears that 1/, will have a high variance, because
there are BOme farms which year after year giv large valu of 1/..
whereas others consistently yield small valu . What is need d is an
auxiliary variabl Xi that is obtainable before th survey is tak nand
that predicts whether the value of y,' will be larg or small. Thus the
problem of finding a measure of size reduces to that of finding an
auxiliary variate which is highly correlated with 1/,' in BOm nse of
this term. The choice among total acreage, total tillable acreag , and
total value is made by examining which of the three has th high t
correlation, on the averag , with the items that are included in the
surv y . We shall not discu at present how this average correlation
would be calculated, in our interest is in the general conoept of a
good measure of size rather than in a specific definition.
For the sam survey, the best m &sure of size may d pend on the
item . If the item has been enumerated in a recent census, it is often
found that the best auxiliary variate %{ i8 the value of this item at the
previou cen8U8. In 8uch C&8e8 any gen ral measure of eize is inferior
to a eeparate measure for each item . However, available previous
TYPE OF SAMPLING UNIT
cen.sua data may not include the items that are to be in the new sur-
vey, but may provide eeveral general measures of Biu.
Given a general measure of aiu, we may utilize it by one of the two
methods di!cuaeed in previoua chapters. We may stratify by aile.
Since the variance of 1/, usually inCre&lle8 with Xi, the sampling fraction
should ordinarily be changed from stratum to stratum. Complete
enumeration of the strata with the largest units may be advisable.
The second method is to u a ratio or a regreesion estimate with X, as
the auxiliary variate. This allows th stratification to be employed to
control some other factor. A combination of the two techniques is
sometimes fruitful.
1 N
V(g.) - E(g. - Y.)2 - - L %t(Yi -
11.,,_1
Y,)2 (9.21)
Note that all the 1/1 in the population appear in this expression. In
repeated sampling, the l's are the random variables, whereas th 1/. are
a t of fixed numbers. Hence
1 N N
E(g.) - -
nL i_ I
1/;E(t;) - L
__ I
t(//; - Y.
Further,
Hence
ft ft
since, by the definition of V(Y.), the mean value of the aeeond term on
the right is -nV(y.). Introducing the variables ti, we have
.. N
E 1: (1/. - g.)2 - E 1: 1..(11, - Y.)' - nV(g.)
We may now regard the 11, as fixed quantiti and the ti as the random
variables. S\nce E(ti) ... 7I.Zi,
ft N
E 1: (11, - g.)2 - n :E z,(111 - Y.)2 - n V(ti.)
2
- n V(ti.) - n V(ti.) - n(n - 1) V(g.)
by theorem 9.2. This completes the proof.
Note that, in estimating the variance from the sampltl, we do not
weight the 11i by the Zi, because this weighting has already been intro-
duced in the selection of the sample.
These results may now be applied to the estimation of a population
total when the units are of unequal size.
Theorem 9.4 A sample of size n is drawn with probability propor-
tional to measures of size z, -
M ;/ L Mi. The item totals for the
units in the sample are 111. 112, . • " 11". where the same unit may appear
more than once, since sampling is with replacement. As an estimate
of the population total Y we take ?ppJ (probability proportional to
size), where
'I>
I PI" - -1
n
GI + - + ...+ -
-
I
112
z2
"'')
z"
(9.22)
E(?"".) - L
N
Zi
(y.) - :E II. -
_!
N
Y
'-1 t, ._1
v(1),,,,.) - -1
n
D,
N
I-t
-i - G' )2 y
210 TYPE OF 8AMPUNG tJNIT lUI
The result is exact for any size of sample. The variance contains no
fpc, becauee the sampling is with replacement. The estimate is r"".
not suitable for quick computation, since it involves finding the ratio
'I(l z( on every unit in the sample, and is unlikely to be widely used.
With this estimate, the optimum measures of size are the set of
numbers Mi for which the Zi minimize
V( r"".) - n-1 L t.
(II'~ Y)2 -
-
1{ 1I ,2
- L ~ - Y' }
z. n t.
If the 111 are all positive, it is easy to see that V(p) is zero when M .
ex: 7/., Coneequently the beat measures of size are the item totals 1/.
on the units. This result is not of practical interest, because, if the 1/.
were known in advance, the sample would be unnecessary. The result
8ugg ta, however, that if the items are relatively stable through time,
th mOllt recently available previous values of the 1/. may be the best
measure of size to adopt,
V(rB) _ N (N - n)
n(N - 1) ._1
f. ('Vi _ LM.
M.Y)2
N (N - n) Ii
- 'l(N - 1)
L: ('I. -
._1 Ylo)' (9.24)
For the unbiaaed estimate with pps sampling, we have from (9.23)
vn'..,..) - -I L
N
n ,-
t,
I,
a' ~ - Y )2 - -1 L -1 ('II -
Ii
n i-I Ii
Yzi )' (9.25)
9.12 COMPARISON WITH THE RATIO ESTIMATE 211
Assuming n/N small, the two comparable quantities IU'e
nV(rR) - N :E (u, - YZ,)I (9.26)
nV(r",,) - :E ~ Cu, -
Zj
YZ,)2 (9.27)
For some populations the ratio estimate ill superior, and for others
the pps estimate. I do not know any simple rule for predicting the
better estimate: formulas (9.26) and (9.27) can be used to make the
comparison when population data are av&ilable.
One result will be presented. Suppose that
1/, - Yz, + e,
where t; is independent of It in the probability eenae. In array. in
which to is fixed, we assume that
E(e,) - OJ E(el) - azl (g > 0)
If g - 1, this model satisfies t·h e conditioDll in which the ratio estimate
ill 8. beat linear unbiued etttimate (eection 6.8) .
From (9.26),
N
nV(fR ) - N :E el - N2E(,,2)
_ aN2E(z;')
- aNE(,I-I)
Hence
1
V(r,,) > V(f",) if B(z/) - 'NE(zl-l) >0
ince E(I,) - 1/N, becatae the Ii add to unity, the inequality may be
written
(9.28)
9.1 For the data in table 9.1, compare the relative net precwona of the
four type!! of unit when the object is to estimate the total number of eeedlings
in th bed with a standard error of 200 eeedli.np. (Note that the fpc is in-
volved.)
9.2 FOI the data in table 6.2 (p. l~) estimate the relative precilion of the
bo hold to the individual for estimating the eex ratio and the proportion 01
people who had III8D a doctor in the put. 12 montha, ... ,mine eimp1e random
amplinc·
9.15 REFEBENCES 21S
<if DIll
Between atrr.t& 9 30 .8
Between larp unit. within .tIata 400 3.0
Between element. within Jarp unit. 2000 1.8
Ignoring the fpc, i.e the relative precision of the large to the I5lDAll unit greater
with simple random sampling than with stratified random sampling (propor-
tional allocation)?
9.4 A population containing L1l M element. i.e divided into L strata, each
having 1llarge units, e8()h of which contains M small unit.. The following
quantities come from the an&Iysis of variance of the popU.latiOD, on an element
baai.e:
8 1' .. Mean equare between strata
8,' .. Mean equare between large unite within strata
8,' .. Mean eqUAre between elemente within strata
U TV i.e large and the fpc i.e ignored, how that the relative precision of the
large to the small unit (element) ill improved by stratification if
(M - 1) M 1
sr < 8,'- S,'
9.5 The large unite in a population arrange theJMelvetl into a finite num-
ber of size clas8ell: all unite in cla.es 11 contain MA I5lDAU unite. (i) Under what
conditions does 8&IDpling with PPII give, on the average, the same distribution
oC the size claa8e8 in the sample as stratification by size of unit, with optimum
allocation for fixed 8&IDpie Bise? (ii) If the variance among large unit. in
claae 11 is "MAl where " is a constant (or aU claaees, what 8ystem o( probabili-
t.iee of eelection of the unite gives a sampie in which the aizes have approxi-
mately the same distribution &8 a stratified random sample with optimum
Iillocation for fixed 8&IDple BiJeT
9.16 Referencea.
CoIDIYmLD, J. (1961). The det.enninatiOJl of IaJDpJe lise. A_. J(NT. p,u,.
HeallA, U, ~1.
FlHnr-, A. L.. MOIIGAN. J. J ., and MONllOIIl, R. J. (1943). Methode of eetimaLinc
farm employment (rom IaJDple data io North Carolina. N . C. Agr. Eq. &4.
Tee:.\. Bull.. 76.
1UNIIClf, M. R., and BUlltr",. W. N. (1P43). 00 the t.beory of IaJDplJn& from tinlta
populatioaa. Anll. MalA. &at., 16. 333-362.
lhHDuCD, W. A. (1944). The relative etlicieuc:iel oIlJOUPI of lanDIJ .. -pllnc
unita. J_. A_. &at. A-., It, 367-878.
214 TYPE OF SAMPLINO UNIT U6
HOMJ:'fU, P. O., and BLACK, C. A. (1946). Bampllnc replicated field ~t.
on oat. for yield detenninationa. Proc. &it Sci. Soc. Amerioo, 11. 1U1-3U.
Honrn:, D . O., and ThOIlI'lOM, D . J. (1952). A gen rali.cation of aamplinc with-
out replacement from a 6nite univene. Jour. Amer. Bt4l. A,IOC., ''', 663-686.
JUUH, R. J . (1942). Statistical inv.tipt.ion of a aample wrvey for obt&.ining
fann facta. 1_ Agr. Ezp. 814. Ru. Bull. 3(M.
JOJrNIOH, F. A. (1941). A lltatiltlcal study of aampling methoda for tree nUl'lel')'
lnventori_ M .S. tb_, Iowa State College.
M.uu.LA.HO.ll, P . C. (1944) . On large«a)eaampleaurveya. Pha. Tr<JM. Roy. Soc.
~,BII1, 829-461.
MoVAY, F . E . (1947). Samplinc methoda applied to eetimating numbers of oom·
meroiaI orohardJ in a oommerclal peach...... Juur . A",,,.. Stat. A_., U,
688-640.
MtDIt1NO, H. (1960). An outline of the theory of .mpling eyntuDI. AM. IMI.
Bt4l. MIJlJi. (Japan), 1, 149-166.
BIIM, A. R. (1962). Further development. of the theory and application of the IMlleo-
Uon of primary aampling unit. with 'P'lCial reference to the North Carolina
acricultural population. Ph. D . th_, Univeraity of North Cal'Olioa.
BOXJlATMII, P . v. (1947). The problem of plot aile in larg&«l&le yield .urveya.
Jour. AIIW. SI4l. A.'IOC., 62, 297- 810.
THOMI'ION, D . J. (1962). A theory of aampling finite univenel wil.b unequal
probabiliti . Pb.D . tb " Iowa State College.
CHAPTER 10
•
Sample mean per primary unit - 1J - :E 1I./n - 1I/ft - mp
Analogou definitions hold for the population means Vi, r, and :r
The of th nQt&ti n is that e. ain_gl bar <lenQtee an e.verage
over any single tage, a double bar an average over two stages. The
subscripts (if any) indioate what is being held constant. ThUll the
10.2 ELEMENTARY THEORY 217
average of the 7/11 for fixed i is th: the average of the fh over the units
is ,.
Note also that N, n are uaed for the number of primary units, and
M, m (or the number of subunits per primary unit. Since BOme authors
reverse these roles, a careful check of the notation is advised when
reading references cited bere.
10.2 Elementary theory. The theory of subsampling was dey loped
in connection with the sampling of the plots in agricultural field experi-
ments. In these applications both the sampling fraction nl N and the
subaampling fraction m/ M are UBUally small and fpc's can be ignored.
Since the resulting theory is elegant and is adequate for many applica-
tions, we shall describe it fint. Actually, th elementary theory re-
quires only that nlN be negligible, say less than 0.05, as will be n
when the exact theory is presented.
The observation 1M in the jth element of the ith unit is assumed to
be of the form
11ii - r+ + u.- W,} (10.1)
where the term u.- represents a component associated with the unit
and constant for all elements in the unit. The term WiJ represents a
component of variation from element to element within the unit. The
variates u, and Wi; are all independent in th probability sense and
h ve zero means. The variates u, have variance 8 M 2 (u for unit), and
the WIJ have variance 8",2 (w for within).
The values of N and M are 888UITled infinite. The units are choeen
at random from the population, and the elements at random from the
units.
It is easy to show, as a consequence of the mod I, that thc sample
mean per element 'fi is an unbiased estimate of r.
Theortm. 10.1 With this model, the variance of the sample m an
per element 'fi is
(10.2)
v(Ji) - 2 (
mnn - 1) n(n - 1)
The first form on the right is the most convenient for computing.
Proof: By an algebraic identity we have
ft ft
E
._1L (9, -
ft
1])2 .. (n - 1) 8 .. 2
(8
+ ~2) m
= n(n - 1)V(P}
df ms Estimate of
m L(il, - V)'
.,,2 L$11 __' _ __
Between unite (n - 1)
n - 1
LL (1/,; - il,)2
Within unitll between elementll n(m - 1) ... 2 _ -'~;,---:-:-
n(m - 1)
the quantities of which the mean squares 8b 2 and 8", 2 are unbia.eed
estimates. The result for 86 2 follows at once from theorem 10.2, since
86 = nmv(p). The result for 8,} if easily verified. Consequently, an
2
unbiased estimate of 8 u 2 is
2 2
2 Bb - 8",
Bu = ----
m
Hence an unbiased estimate of V C
D') is
Since the fields were not choeen at random, but by following routes
designed so 88 to give good coverage of the area, the mean square b&-
tween fields may be an overestimate of the variance that would be
obtained from a random sample of fields. This disturbance and the
effects of variation in field sile will be ignored.
We will examine how the variance of the sample mean is affected by:
i. Doubling the number of fields, with two subl58.lIlples per field.
ii. Keeping the number of fields unchanged, but taking four sub-
samples per field.
iii. Keeping the number of fields unohanged, but completely harvest-
ing the fields. .
Let n denote the number of fields in the original sample. From
formula (10.4), the following estimated variances for the sample mean
are obtained (note that m - 2):
Original sample: (n' - n, m' - 2) V - G)C~) --
00
11
45
Cue i: (n' - 2n, m' - 2) VI - ~~)C~)
Case ii: (n' - n, m' - 4) VII - - -S:5
(De~ ~)
Case iii: (n' - n, m' - GO) VIII - G) C~ -328) - ~
In case iii complete harvesting is 888umed to be equivalent to taking
all p088ible subunits from every field in the sample. Since the size of
the subunit was very small compared to the sile of a field, this implies
that m' - GO.
Cases jj and iii show that increases in the subsampling ratio, keeping
the number of fields constant, produce only modest reductions in the
variance. If a marked increase in precision is wanted, the number of
fields must be increased.
10.' General theory. We now drop the assumptions that nlN and
mlM are small and that the mathematical model holds. It is still con-
venient to express the variances in terms of the quantities So. 2 and
S",2,but these must first be defined in tel'llJl! of the observations Yij.
The definitions are constructed from the analysis of variance for the
complete population, shown in table 10.3. Thus, So. 2 and SID2 are de-
tined 80 that the two equations stated in the two lines of the analysis
lU GENERAL THEORY 121
of variance are valid. These definitiona were ehoeen becau8e they
enable the general theory to be expreseed as a natural extenaion of
the elementary theory.
With some populations the quantity S..2 may turn out to be nep-
tive; this happens when elements in the same unit are negatively cor-
related. Any feeling of discomfort created by the appearance of a
negative variance can be avoided by expressing the results in terms of
S",2 and p (the intra-unit correlation coefficients) instead of S",2 and
S.. 2. However, all formulas remain correct when S.. 2 is negative.
In two-stage sampling, expected values must be found not only over
all possible samples of n units that can be drawn, but also over aU p0s-
sible subsamples that can be drawn from the !!elected eet of units. It
is often helpful to perform the averaging in successive stages. For this
purpoee we introduce two symbols:
,
E - Average over all subsamples from the ith unit.
E
ft
- Average over all subsamples from a fixed set of n units.
If the m elements in any choeen unit are eelected by simple random
sampling,
Hence
EU
• - r•
where r..
denotes the mean that would be obu.ined if the fa units in
the sample were enumerated completely. If, further, the fa units are
also choeen by simple random sampling,
Er.. - r
This shoWB that p is an unbiaaed eetimate of r.
SUB8AMPLING WITH UNITS OF EQUAL SIZE IG.4
Theorem 10.3 If the n units and m elements from each choeen unit
are selected by simple random IlaInpling,
(N - n) S 2 (MN - mn) S '
V(m - E(l' - r)2 -
N
..
n
+ MN
~ (10.5)
mn
Proof: Write
fJ- r- ('1 - r,,) + (r" - v) (10.5,)
When we square both sides and take the average over any fixed set of
n units, there is no contribution from the croes-product term on the
right, since
H(m - r..
"
Consider the first tenn on the right. Each of the n units may be
regarded as a stratum composed of M elements. The sample from
theee units is a proportionally stratified sample, since m elements are
taken from every stratum. Consequently, the fonnula in theorem 5.3,
p. 69, for the variance of the mean of a stratified random sample may
be applied. This gives
1 "M(M - m)
E(J} - r,,)2 - - - 2 :E S...'
" (nM) ;_1 m
where S",l is the variance within the ith unit. This may be rewritten
(M - m) 1
E(fJ - r,,)1 - • -S_2 (10.6)
" M mn
where S",,, 2 is the average variance within these n units. If we further
average over all possible sets of n, it.. is clear that the average of SID" II
is S",2, as defined from table 10.3. Hence
10.6 Eatimation of the variance in the renera! CAse. The first step
is to find sample estimates of S" 2 and S .. 2 • If the analysis of variance
of the sample is performed alS in table 10.1, it turns out that the mean
squares 'h'l and ,,,,2 still have the expectations given in the table: i.e.
E(8b 2 ) _ Sw 2 + mS..2 (10.9)
(10.10)
2
The result for ,,} is easily verified, but that for 'b is lees obvious. It
may be proved by straightforward methods as follows.
._1
"a ------
Now ~
1: {D. -
n- 1
p)2 -
.
1: '0.2 - n, (10.11)
, .. 1 i _l
Write
9, - f, + (y, - f,)
Since 9, it the mean of .. random subsample of size m;
Further, write
r .. + ~ - r ..)
p-
80 that
ET! - r ..2 + E(J) - r ..)2
.. "
where r"
ill the true mean of all n units in the sample. In equation
(10.6) it was shown that
E(fJ _ r..)2 _ (M - m) _1 B_2
" M mn
Hence,
Enp2 _ nr,,2 + (M - m)
B"",2 (10.13)
.. Mm
From (10.12) and (10.13), •
.. .. (n - I)(M - m)
E
" i_I
L (fh - fJ)' - L (Y
i-I
i - r,,)2 + Mm
B....'
i.e.
.
m" ~ (Y. - r )2 (M - m)
--E L
(n - 1) .. i_ I
(g. - fJ)' - m .-
n - 1
+ M
B....'
N (y . _ "'" - (M - m)
E(I.') _ m L · Z) + S.'
;_1 N - 1 M
1 {(N - n) .. (M - m) n ,}
tI(fJ) - - I." + I. (10.14)
"'" N M N
10.8 OPTIMUM SAMPLING AND SUB8AMPLING FRACTIONS 226
Proof: Substitute the expected values of .&2, .",2 in 11(71).
"
2 :E (fi. - J1f'
11'1\ 8& '_1
"\I/J - - - ----- (10.15)
mn n(n - I)
This agrees with the result from the elementary theory, theorem 10.2.
When m .. 1, the sample provides no estimate 8,/. This does not
matter provided that nl N is negligible, since in that event .",2 does
not appear in lJon.One application of this result occurs when the
subsampling is systematic. Since a systematic subaample is equivalent
to a simple random subsample with a more complex type of element
and m - 1, formula (10.15) remains valid witb systematic subsam-
pIing unless nlN is substantial. If the fir8fArt,age sampling is system-
atic, however, the formula holds only if the systematic sample of the
units is equivalent to a simple random sample.
Since m enters into V and C only in the combination nm, put k - nm,
and write (10.17) as
(10.18)
(10.19)
V(t7) - -
S,,2 C. (
1+-
S",2
2
)(c.. + m)
-
S,,2 Co (
= -
1.69)
1 +- (10 + m)
C mS" c. C m
Omitting the constant factor, the relative variance can be calculated
for different values of m. Table 10.4 shows these variances and the
relative precisions (with the maximum precisiou for m ... 4 taken as
the standard.).
TABLE 10.4 RmuTIVIl VAlUANCU AND PRECISIONS FOR DIJ'nRENT VALUU OJ' III
tn- 1 2 3 4 5 6 7 8 9 10
Rel. variance 29.59 22.14 20.32 19.92 20.07 20.51 21.10 21.80 22.56 23.38
ReI. precision 0.61 0.90 0.98 1.00 0.99 0.91 0.94 0.111 0.88 0.86
ttl
01"-
Vm
V(",2j,..2) _ 1 "c.f -V(8N,.. 6.324
2)-1
SUB8AMPLING WITH UNITS OF EQUAL SIZE 10.6
This result provides confidence limits for mopt. Take n == 10, and
corurider 80 per cent limits. The degrees of freedom are 9 and 30.
From the 10 per cent significance levels of F (Merrington and Thomp-
son, 1943), we find
F. 10 (9, 30) - 1.8490
F .oo(9, 30) ... 1/F. 10 (30, 9) - 1/2.2547 - 0.4435
Substitution of these values of F gives
Lower limit: mop, - 2.8
Upper limit: mo., - 9.0
As we have seen from table lOA, any m in this range gives a degree of
precision that is fairly close to the optimum. Thus, with n "" 10, the
chances are 8 in 10 that the loss in precision is small.
The 80 per cent and 95 per cent confidence limits for n = 5, 10, 20
appear in table 10.5. The upper limits m - 00 which occur in three
easee imply single-stage sampling.
m 1: (Pi - fJ)2
I i-I
I" =------
n-l
m "
8t1,2~ - n(m - 1) ,~ P.lI,
where fJ - .E-ps/n. Consequently, the formula for the estimated
variance in tW<H!tage sampling is (by theorem 10.4)
(N - n) 1" 2
v(P) - N ~n _ 1) i~ (Pi - fJ)
(M - m) 1 ~
+ M Nn(m - 1)
£....
i-1
P;fJi
where the components U,' Wi;, e,i. are all independently distributed
with means zero and va.riances s.,2, S,,?, and SUI.., 2, respectively. It
follows that the variance of the sample mean per subsubunit is
S2 S2 S 2
V(J) =~+~+~ (10.20)
n nm nmk
SUBSAMPLING WITH UNITS OF EQUAL SIZE 10.8
dI ID8 Estimates of
mk E, (17. - p)t
Between units (n - 1) ••2
n-l
8 111 ..' + kS ..' + mkS.1
Between BUb-
units within k EE
, (1]1; - 17,)2
unite n(m - 1) ••2 _ j
n(m - 1)
8 ....2 +- kS ..
2
Between BUb-
lIubunits
within EEl: (1/,;, - y,;)2
BUbunits nm(k - 1) .....2 _ -,'~;-'~--:- 8.,..2
nm(k - 1)
whare '1""", is the population mean for the nm subunits that were
selected, and '1"" the population mean for the n units that were selected.
When we square and take the average, the cross-product terms vanish.
The contributions of the squared terms turn out to be as follows:
" - "2
E(r"", r,,) (M-m)l(l
= \KS.... + S", 2)
~ ;;; :I
10.10 Exercises.
10.1 From a simple random sample of fields of com, 2 subunits (each
consisting of 10 hills) were chosen in each field. The following mean squares
come from an analysis of variance of the number of ears per hill (on a single-
hill basis) :
ma
Between fields '; .89
Between subunits within fields 1 .41
If it takes 1 hr to locate each field, and 10 min to locate and count 1 sub-
unit (after the Ii ld i reached), what is the optimum number of subunits per
field? (The fpr's may be ignored .)
10.2 In the same survey, the mean square between hills within the same
subunit was 0.92. Assuming that this mean square would not change appreci-
ably if the subunit contained 20 hills, estimate the change in precision if one
2O-hill unit were taken per field instead of two 10-hill units.
10.3 Verify the rul (section 10.6), that, if ,ii opI lies between the two inte-
gers m, (m + 1), we should choose (m +
1) if
m > m(m + 1)
oJl ,'
10.4 how that, if S.} > 0, in the notation of section 10.4, a simple ran-
dom sample of 11 primary units, with 1 element chosen per unit, is more pre-
cise than a simple random sample of 11 elements (n > 1, M > 1). Show that
the precision of the two methods is equal if n/N is negligible. Would you ex-
pect this intuitively?
10.11 REFERENCES 283
10.11 References.
C AMERON, J . M. (1951). Use of variance components in preparing schedules for
the sampling of baled wool. Biometria, 7, 83-96.
EISf:NHART, C. (1947). The &ll8umptionl! underlying the analysis of variance.
BiomltriC8, 3, J 8.
GRAY, P.O., and CoRLETT, T . (1950). Sa.mpling for the social survey. Jou.r. Roy.
Stat. Soc., All3, 150-206.
KING, A. J., and JEBE, E . H. (1940). An experiment in the pre-harvest sampling
of wheat fields. Iowa Agr. Exp. SIG. Ru. Bu.U. 273.
MERRINGTO~, M., and THOMPSON, C. M. (1943). Tables of percentage points of
the inverted beta (F} distribution. Biomdrika., 33, 7~88.
SUDlATJ41l, P. V. (J947). The problem of plot me in large-scale yield aurveya.
Jour. Amer. Stat. AIBoe., d, 297-310.
Population Sample
Number of elemente Mi IftI
Mean per element y, fli
Total Y,-MiYi tli - "" ..-0,
Total
M-LMi
Y-LY,
N
m-Lnlj
11 -
.
LY,
Mean per element y- YIM Ii -111m
Mean per primary unit Y - YIN ii - 1IIn
The notation departs from that of chapter 10 in one respect. We
define M and m as the total number of elements in the population and
sample, respectively, whereas in chapter 10 these symbols were thE
corresponding totals in any primary unit (all units being of equa.1 size).
To keep the notation consistent, symbols M and m should have been
used in chapter 10.
11.2 Sampling methods when n ... 1. Suppose that the ith unit i.E
selected and that it contains M; units, of which m; are sampled af
random. We consider three methods of estimating r.
I. Uniu cJw8en with equalprobaQilitll.
Estimate = Yt ... '0;.
The estimate is the sample mean per element. It is bia.sed.. For, it
repeated sampling from the same unit, the average of 'Oi is i , &DC r
SUBSAMPLING WITH UNITS OF UNEQUAL SIZE 11.2
since every unit has an equal chance of being selected, the average of
Yds
Un - r = NM&. - r
M
Now M.f, ... Y" the total for the unit, and r ... NfIM, when f is
the population mean per unit. This gives
NM · N
911 - r "" M' (fl, - f.) + M (Y. - f)
Hence,
N N S~ N N
V(Yu) = '""2
M
L M.(Mi
,_I - m.-) ~ + - L (Yo -
m. M2 ._1
f)2 (11.2)
Now average over all poesible selections of the unit. Since the ith unit
is aelooted with relative frequency M .1M,
J
V(9111) - ~ {t (M; -
M i-I
mil 8.
m;
+ f.
i-I
M,(Yi - V)J} (11.3)
TotaJe 12 33
are any set of positive numbers that add to unity. We 8till a8IJUDle
n - 1.
Method IV. An unbiased estimate of is r
(11.4)
This foUows because, in repeated sampling, the ith unit appears with
relative frequency Zi, 80 that
E(U,.v) - :E
N
z. (Mdl')
_' -:EN Mdl'
_' - V
'-1 z;M '_1 M
With this method it is customary to select m; 80 that
leM.
m;-- (11.5)
fI,
m; - -kM'~)
= - (31) -121 = 8, to the nearest integer
z, 23
The desired subsampling ratio
m.; k 121 1
-=-=
M, z, (20) (23) 4
is known before the block is listed. Thus the interviewer can be told
in advance to take I household in 4 from this block. This rule is con-
venient when the subsample is to be systematic, as is often the case in
practice.
The variance of 'Iilv is obtained in the usual way. Write
..g, - I
firv-I" =M-- " {by (11.4)J
z,M
V(fhv) ... -
1 {~M,(M;
L..
- m,) S;2
-
~ (M/f';
+ L.. Z; - - - M
,,)2}
I (11.7)
~ ;_ 1 Z; m j ;_1 Zj
V(gIV) ...
I
-2
M
{N
L
'_I
(M; _ m;)S,2
k
LN Zi (Y;
+._1 -
Z, - y.
)2} (11.7')
Unit M, M ,/M
" m, - M61,, 6(M, - 1fI1) 81 y,
.. ..
Y, Y, - Y
1 2 0 .1 7 .2 Jj 2 0 .500 1 5 -28
2 4 0 .33 .4 ¥ 14 0.667 8 20 -13
a 6 0 .50 .4 J.j 21 O.I:!OO 24 60 +27
In practice, the m.. are rounded to integers. This has not been done
in the present illustration. From formula (11 .7'), the variance comes
out as follows :
Contribution from "within-units" - 6 L (M, - m;)SNM:J - 0.188
Method I, in which the ea.mple mean provides the estimate, has two
analogues. First, we may take the ordinary sample mean,
:E Yi
flI == - - == -
1904
= 47.6 years
:E 1ni 40
This estimate is biased. This is easily seen when m is constant, since
in that event
1 N
E(fll) >= -
N i_I
:E fi
whereas the population mean per element is r .. L
Mifi/M. The
biased estimate gives too much weight to the emaJIer units. If there
is no correla.tion between Mi and fi, the distorted weighting does not
ma.tter greatly, and the bias is unimportant in large samples. But, if
- Mi and fi are correlated, as often happens, the bias may not be neg-
11.5 SAMPLING METHODS WHEN n > 1
ligible in large samples. In the present example we might anticipate
a small negative correlation between M. and Fi , because the longer
biographies, which cut down the number per page, tend to be those of
the older scientists.
This diacUBBion suggests, 88 an alternative estimate in method I, the
weighted mean,
17,121.5
- - - == 47.7 years
359
This is a typical ratio estimate because both the numerator and the
denominator vary from sample to sample. As is characteristic of ratio
estimates, the bias is negligible in large samples. If the subsampling
is proportional, this estimate reduces to the ordinary sample mean and
coincides with that in method 1.
When the m's are all equal, the unweighted sample mean sometimes
has a smaller variance than the weighted mean . In view of its greater
liability to bias, however, the unweighted mean is more hazardous.
The unbiased estimate (method II) is given by
N "
PII LM,1h
= -
nM .-1
In this estimate the quantity L M,Yi, which is an unbiased estimate
of the total of the n units in the sample, is raised by the inverse N In of
the sampling fraction, and then divided by the total number M of ele-
ments. In the present exa.mple the number of pages in the book is
2823, and M (total number of biographies in the book) is given in the
preface 88 "about 50,000." Accepting this figure for illustration, we
have
2823
flu = (17,121.5) = 48.3 years
(20)(50,000)
As in the case n = 1, this estimate often has poor precision if there
is much variation among the M ,.
Sampling with probability proportional to size, or to estimated size,
is unlikely to be adopted in this illustration, because of the work in-
yolved in counting or estimating the numbers of entries on all 2823
pages. The estimates for these methods will be given in algebraic form .
As pointed out in section 9.10, we must sample with replacement in
order to keep the probabilities proportional to size or to estimated size.
If the same unit is drawn twice, the subsample is also taken with re-
placement. In examining the bias of an estimate we shall use the same
SUBSAMPLING WITH UNITS OF UNEQUAL SIZE 1l.5
m;z ·
L:m.z.-
. I
k =- ' = '--- - (11.8)
M; M
by summing the numerators and denominators of the series of equal
fractions. But the average number of elements in the sample is
(11.9)
when sampling is with equal probabilities, the symbol k has the same
meaning in table 11.8 for methods I', II, IV, and V.
Where two algebraic expressions for an estimate are shown (&8 in
methode I', II, IV, and V), the first is the general form of the estimate,
which applies for any choice of the m, : the second is the self-weighting
form which holds only when the mj are chosen as in the preceding
column.
(11.10)
(H.IO)
SUB8AMPIJNG WITH UNITS OF UNEQUAL SIZE 11.9
If xi; iB a "dummy" variate that has the value 1 for every element in
the population, 80 that ~i &I 1, R reduces to t'v as given in table 11.8,
and therefore includes methods I', Ill, and V 88 particular cases.
The ratio estimate is al80 useful in its own right. If Xij is the value
r
of 'Vii at a previou8 census, we can form ratio estimates of and Y 88
fll ... RX: ?R = RX
This type of e8timate was found to be more precise than any of the
preceding methods for farm items in North Carolina (L. H. Madow,
1950; Jebe, 1952).
In finding VCR) we 8hall use one of the variance formulas already
establiBhed for the case n = 1. As a preliminary, we require a well-
known result fo'!' the variance of a mean in sampling with replacement.
The result i8 deliberately stated in rather general terms.
Lemma. For a specified method of sampling and estimation, the
sample e8timate U i8 an unbiased estimate of some population charac-
teri8tic U, with variance S,/. Suppose that 8uch a sample i8 drawn n
times, with replacement after each draw, yielding the estimates UI,
"2, ... , u.". Then, if 12 is the arithmetic mean of the Ui,
S",2
V(u) ... - (l1.1J)
n
Proof:
1
(12 - U) ... - [(UI - U) + (U2 - U) + ... + (u" - U)]
n
1 2 S,,2
V(12) - -nS" =-
n2 n
Note that the lemma does not specify how the sample is drawn: the
drawing may be with either equal or arbitrary probabilities.
Theorem 11.1 A sample of n units is drawn with replacement, the
probability of selection assigned to the ith unit at any draw being z(.
11.9 THE PRINCIPAL VARIANCE FORMULAS 251
From each selected unit a subaa.mple of si.ze "'-i is drawn by simple ran-
dom sampling. The estimate is
" M·
:E _' fi,
II = '-I Z,
~Mi
~ -i,
,'_I %i
Then, in large samples, •
V(ll) '=.
1 L [1-
-2
N
(Y, - RX,)2 + M·(M
" Z '_'
2
· - m ·) 8 d . ]
(11.12)
nX ,_I z, ,"'-i
where
1 J[.
8d ·2
,
= (Mi ~ I(y " - Rx ··) - (y . - RX ·) 12
_ 1) j~" ., , •
• The theorem dOlI! not reveal how large the 8&IIlple muat be. At with the ordi-
nary ratio estimate (chapter 6) the approximation is probably adequate if the
coefficiente of variation of the numerator and denominator of fl are both 1_ than
0.1, though further reeearch on this point is needed.
262 BUB8AMPLING WITH UNITB OF UNEQUAL SIZE 11.9
~ ~
V(.r.) - Li z, -
(y,- )2 + ~ M,(M, -
Y Li
m,) S;J
'-1 Z, '_1 z. m,
Apply this result to the variate t. - Rg,.
This va.riate equals
= Y'I - Rx,J' Substitution in (11.13) gives
M.J.;/r.., where d,l
VCR) '- . -2
1 [N
:E -1 (Y, - RX.)2 +:E
N M •·(M•· - m ·) S
•
.2]
d.
nX '-1 z, '_1 Z, m;
where
1 }//I
Sd,' -
(M. -
:E 1(1I'J -
1) i_I
Rx,,) - (f, - RX,»)2
(11.12')
nX ,-I z,
}uj noted in section 11.9, this result provides variance formulas for
methoda V, III, and I' as particUlar cases.
Theorem 11.2 a.pplies to the unbiased methods IV and II.
TMor-em 11.1 With the same method of selection as in theorem 11.1,
the estimate of is r
1 • M.g,
Prv--:E-
Then
nM '-1 "
1 N [1
V(Prv) - - : E - (Y, - "y)' + M,(M, - S,2]
- m.)
(11.14)
nM' '_1 " ""',
11.10 OPTIMUM PROBABILITIES OF SELECTION
=~
nM
t
i_I
[~ (Y,
Zi
_ ~,Y? + M,(M, - m.) Si
Z, m,
2
]
by (11.7) .
Corollary. If mi = kMi/Z i , 'so that 'fhv = y / nkM, then
V(flIV) ==
1 L [1-
-2
N
(Yi - ZiY)2 +(M·' -k m.)
Si
2]
(11.14')
nM i_I Zi
Variance formula.a (11 .12') and (11.14') are structurally rather simi-
lar. Apart from mUltiplying factors, the principal difference is that in
the ratio estimate the variate (1/,; - RXi;) replaces the variate 1/'1
which appears in the unbiased estimate. The formula for the ratio
estimate is approximate; that for the unbiased estimate is exact, pro-
vided that sampling is with replacement.
This fonnula is not adapted for our purpose, because the quantities
L m. and :E M i are random variables which depend on the units
that happen to be selected. Instead, we consider the average cost.
SUBSAMPLING WITH UNITS OF UNEQUAL SIZE 11.10
E (
"
i~ m, - t; m.'1ni -
) N
nk EMi - nkM
N
Similarly,
.. ) N N
E( :E Mi - :E miMi - n:E "M.
'_1 '_1 i_I
,.
Z !· (11.18)
i.e.
M ,2(15,,2_ Sd,2)
, M,
(11 .19)
X(c .. + cIM,)
Since X is the same for all i, these equations provide explicit solutions
for the zl and hence the z, == zll n.
The numerators of the :;,2 in (11 .19) will be assumed positive. (If
they are negative it is found that subsampling is inefficient, single-
stage sampling of the primary units being superior,) For the optimum
z, we ha.ve
(11.20)
where
D, 2 = IP _ Sd,2
,.. , M,
lation in which all units are of the same size 51. Perform an analysis of
variance similar to table 9.5 (p. 196), on the variates d;j = Yii - RXi;.
From the definition of D,,/, its average value over all units (assum-
ing M, - 51) is
(g > 0)
(11.22)
iii. Ii costs of listing and fixed costs are of the same order of m8lJli-
tude, z, a: VM; is a good compromise.
The optimum k, found by differentiation, is left as an exerciBe to
the reader. Its value is
The optimum n is found by solving the cost equation (11 .15) for n.
As has been mentioned, the discussion in this section assumes that
the sizes, or good estimates of them, are known. No part of the budget
is allotted to obtaining information about the sizes of the N units,
except for any listing needed in the units that comprise the sample.
M,
:E -ii.
:1:;
Pv= =J7 (if m. = kM,/z,)
:EM;
z,
1 M. Y
PIV = -nM :E -z. ii, = -
nkM
A
yen;) .-.
1 N 1
"V2:E r
- (Y, - RX,)
2
+ (M, -k m,)
Sdi
2]
nA- ~ l ~
SUB8AMPLINO WITH UNITS OF UNEQUAL SIZE 11.11
V(PIV) ...
1 ~ [ 1
-2 ~ - (Y i - Zi Y )
~
+ (Mi - mi)
Si
2]
nM i_I Zi k
,,1
"-- - (Y i - MiT)
"2
< ,,1
"-- - (Y i - ZiY)
2
Zi zi
L Zi (y, r ~
-.!. -
Zi
M.)2 < L (y. Y )2
Zi
Zi -.!. -
Zi
(11.23)
M·
EX;' ... LZi~ = M
Zi
Let
Then
M, M, M,
- (,/1 - Rx ..) - - d j ... -il;
ttJn, m,
t .. Zi
M.. M.. D. M, .
"" -15, + - (a i - 15,) - - + -Z.. (ai - 15,)
Zi Zi Z..
where
15, .,. f .. - RX.. and D..... M ..15, ... Y . - RX..
Now equare and average over all subsamples from the ith unit.
Mtdi)2 (D ..
~--=-+
)2 M,(M .. - m;) Sd,'J
(
• z.m, z.. z, 2 -
m;
Hence, for a fixed selection of n units,
where t, is the number of times that the ith unit has been drawn in
any specific sample. When we average over all possible sets of n units,
E(t,) - nz... Hence
E t
i_I
(M.-d..)2>= n
Z,mi
1: [Dl + M..(M
i_I Zi
i -
Zi
m,) Sdi2]
m;
But by theorem 11.1, formula (11.12), since D, = Y, - RX i ,
V(lt)
A..
= ' -2
1 ~
k
[Di2
-+ Mi(Mi - mi) Sdl]
-
nX i_I Zi Z.. m,
The r suit follows [note the divisor n 2 X 2 in v(R)].
CoroUary. Theorem 11.3 is not usable as it stands, since R, and in
some applications X also, are unknown. In pla('e of R we put the
sample estimate Ii and replace an n in the denominator by (n - 1), as
was done with the ordinary ratio timate in chapter 6. An unbiased
estimate of X is
g = ~ t Mifi
n i_ 1 Zi
11.12 ESTIMATED VARIANCES 261
Hence we take
vCR) = 1
n(n - l)g
2 i: {M'
,_I t,m,
(Y' _ RX,)}2 (11.26)
lip =
:E""M{fh "" 17,121.5 = 47 .6922 years
£...., Mi 359
Some extra decimal places are retained to ensure accuracy in later
computations.
To apply formula (11.26), we put
1 • N
z, = - : m, = 2: R = 1Jr,: Xi = m, = 2: g = - :EM,
N n
Substitution into formula (11.26) gives
n "
vCR) = (n _ l)(:EM.)<~~ IMi(Yi -17r,)}2 (11.27)
The sum of squares is easily computed from table 11.7, p. 244, in the
form
:E (M,y,)2 - 2Jir' L (M.y,)M, + 171'2 L M,2
-= 15,375,020 - (95.3844)(309,747.5) + (2274.55)(6481)
= 571,300 (11.28)
• (20) (571,300)
vCR) = (19)(359)2 = 4.67
The following result gives the sample estimate of variance for the
unbiased estimate in method IV.
Tlw>rem 11.4 With the same method of sampling as in theorem
11.3, the estimate of is r
1 "M.1}.
JiIV = - L-
nM '_1 Z.
Then an unbiased sample estimate of V(liIV) is
1
L" (r, -
where
v(JlIv) =
n(n - l)M
2
.-1 r,,)2 (11.29)
TABLE 11.10 SAJolPL]) VARIANCES FOR llBTDfATIIlB 0 ... THE POPULATION MEAN
I' Eq.
""
1cNM, P (n _n )m2 1: {",,(g, _ p)l1
l
II I
II Eq. 1cNM, n(n _ 1)(kM)2 1: (!Ii _ g)1
nkM
M, 1
III
M
1ft P n(n _ 1)11121: (IIi _ g)2
1cM, _II- I
IV
" n1cM n(n _ 1)(kM)1 1: (!Ii _ g)'
V Sj
"
1cM,
'1
, n
(n _ l)ml:E !m,(g, _ p)}'
The subscript h denotes the stratU!Jl: M" is the total number of ele-
ments in stratum h, and M is the total number of elements in the
population. The estimated population mean is
L
where JI" denotes the estimate of the stratum mean per element r".
Further,
L
V(JI,,) = L W,,2V(JI,,)
II_I
(11.31)
and the estimated variance is
L
v(JI.,) = L W,,2V(JlII) (11.32)
"_1
Table 11.10, which gives the value of v(J7,,) for a single stratum, is
useful in constructing these estimates of variance.
Table 11.11, which is an extension of part of table 11.8 to stratified
sampling, presents the algebraic forms of the three unbiased estimates.
The column headed mil, shows the subsample sizes which make the
estimates self-weighting within 8trata. The self-weighting forms of the
estimates are given in the right-hand column : the general forms apply
if the m",have not been chosen so as to make the estimate self-weight-
ing within strata. In the summations within the table, h goes from 1
to L and i from 1 to nIl.
II Eq. kANAMAI
M.I
III IlIA
M.
k.MAI
IV -.1
'AI
with a corresponding definition for gil. The quantities fill gil are
unbiased sample estimates of the stratum totals Y", X/o, respectively.
The combined ratio estimate is defined as
where
11.16 Enrclsea.
11.1 By working out the estimates for all poesible samples which can be
drawn from the artificial population in table 11.1, by methode 10, Ib, TI, and
TIl, verify the total variances given in table 11.2.
11.16 REFERENCES 267
M (Pu - P) -
N
t M#, - r ,) + (Y ~ -
.!11'_1 P)
where r" is the true mean per primary unit for the 11 units in the sa.mple.
Hence find the exact variance of PII and compare it with the variance deduced
from theorem 11.2, which aasumes sa.mpling with replacement.
11.6 For the data in table 11.7, estimate from the sample the standard
error of the unbiased estimate which was made of the mean age of entries in
A1Mrica1l nun of BCience. (M may be taken as 50,000.)
11.16 References.
A ch4pter in population ,amp1inll (1950). U. S. Government Printing Office.
GRAT, P. G., and CoRLitTI', T. (1950). Sampling for the socialeurvey. Jour. Rol/.
Slat. Soc., AIlS, 100-206.
HANSEN, M. R., and HUlIWITZ, W. N. (1943). On the theory of sa.mpliDI from
finite populations. Ann. Math. SIIU., 1., 333-362.
HANSEN, M. H., and HURWITZ, W. N. (1949). On the determination of the opti-
mum probabilities in sampling. Ann. Math. Stat., 20, 426-432.
JmBE, E. H. (1952). Estimation for 8ub-eamplinl designs employing the county
&8 a primary sa.mpiing unit. Jour. Amer. Sial. A"oc., '7, 49-70.
JasEN, R. J., et al. (J947). On a population sa.mple {or Greece. Jour. Amer. SIIU.
A.IOC., 61, 357-384.
MADOW, L. H. (19ro). On the use of the county &8 a primary aamplin, unit for
state estimates. Jour . Amer. SIIU. AMOC., '6, 30-47.
YATIC8, F. (1949). Sampling metIaocU Jur ~e~ and .uroey.. Charles Griffin and
Co., London.
CHAPTER 12
DOUBLE SAMPLING
12.2 Double I&JIlplinc for .tratification. The theory was first given
by Neyman (1938).
268
12.2 DOUBLE SAMPLING FOR STRATIFICATION 269
"
270 DOUBLE SAMPLING 12.2
If the middle term. in (12.5) is combined with (12.6), the reader may
verify that these together amount to
The term. free from n' is the familiar expression for the variance
when the strata sizes are known exactly. The effects of errors in the
first sample are therefore to increase slightly the within-straum con-
tribution to the variance, and to introduce a between-stratum com-
ponent.
Corollary. If we are estimating a proportion in the second sample,
then
(12.7)
where Ph is the proportion in stratum h.
12.3 Optimum allocation. The values of the nil and n' that lead to
the minimum variance are rather complicated. It is clear from for-
mula (12.3) that nil should be proportional to
SA
J
WA
2 W (1 - W
h
+-----
n'
II )
Since the second term. inside the root is usually small compared with
the first, Neyman (1938) suggests taking nil proportional to W,.s",
Thus
272 DOUBLE SAMPLING 12.3
When these values are substituted into the variance (12.3), with the
term in W h (1 - W,,) ignored, we obtain
. (2: W hS,,)2 2: W,,(l\ - 'Y)2
VOl'l =. n
+=-----
n'
(12.8)
=
v" V",
-+- (say) (12.8')
n n'
This approximate expression for the variance is now minimized by
choice of nand n' for a given cost of the form stated previously
• (12.10)
Strata WA SA VA
1 0 .786 17.7 19 . 404
2 0 .214 30.4 51.626
~= J417 . 1 = 0.488
n' 175 10
From the cost equation (12.1.1) we obtain
100
n' = - - = 170' n = 170 X 0.488 ... 83
0.588 '
At this point the reader may verify from the data in this example
that the neglected term in W,,(l - W,,) in the variance formula (12.3)
is in fact negligible. From formula (12.8) we then have
VOI'I = W + Hi = 5.02 + 1.03 = 6.05
For a random sample of size 100, with no double sampling, we would
have
V = ill = 6.20
It appears that there would be only a trifling gain from double sam-
pling.
- ~ [{ 2
V (11,,) =.t..- W" + W,,(l -, W,,)} 8,,2
- + W"(1",, ,- 1")2]
"-1 n nil n
If estimates from the sample are substituted in this quantity, the
resulting expression turns out to be an overestimate of V(y,,). An
unbiased estimate can 'be constructed without difficulty.
274 DOUBLE SAMPLING 12.(
V(f),,) -
11.' ~
~
(11.' - 1) ,.
[{ 2
W" - -
w,.} S.~
11.'
-
11.,.
+ w,.(f}l. 11.'- '0,,)2] (12.12)
(12.13)
(12.14)
(12.15)
Adding (12.13), (12.14), and (12.15), we obtain for (11.' - 1)Ev(g,,)/ n'
' " [{
~
II
W"
2
+ W,,(1 11.'- Wh )} S/a2
- + W/a(Y"11.'-
n/a
Y)2]
-V(f),,)
-n'-
(n' - 1)V('O,,)
11.'
The theorem follows.
If 11.' is large relative to the 11.", v(y,,) reduces to ,
(12.16)
- 0.00248
8(p,,) - 0.049
The estimated proportion of rented households is 0.64 ± 0.049. The
reader may verify that there is only a trifling gain in precision over a
single-etage simple random sample of siae 92. In view of the rela.-
tively small siae of the non-white stratum, a greater difference between
the proportions of rented households for whites and non-whites would
be neceesary to make double sampling profitable.
finite and that the relation between 1/1 and x, is linear. Write 88 a
model
(12.17)
where the second subscript a is introduced 88 a reminder that for
fixed x, the random variate eio follows a frequency distribution with
mean 0 and variance S.2 ... S/(1 _ p2) .
In the first (large) sample, of size n', we measure only Xi: in the
second, of size n, we measure both Xi and 1!ia. The estimate of Y is
til, .. ii + b(x' - i)
where i', f are the means of Xi in the first and second samples, respec-
tively, and b is the least squares regression coefficient of 1/1t. on Xi,
computed from the second sample.
We now examine the error of estimate (ti" - 1') From (12.17)
we find
j} ... Y + B(x - X) + ~ (12.18)
"
L (Yio - ii) (Xi - i)
b ... _.-_1_ _ _ _ __
L" e.a(Xi - x)
= B + _.-_1____ (12.19)
" (Xi -
L: i)2
i_I
From (12.18) and (12.19), substitute for ii and b in the error of esti·
mate. This gives
til, - Y = (y - Y) + b(f' - x)
= B(x - X) + e + B(x' - i) + (x' _ i) L eia(xi - i)
L (Xi - i)2
= e + (x' - x) ~eia(Xi ~ ~) + B(X' - X) (12.20)
(Xi - X)
In ordinary regression theory, in which x' = X, the standard prac-
tice is to discuss the conditional frequency distribution of the error of
estimate (9" - Y) in repeated samples in which the Xi values are
fixed. If this approach is adopted in the present problem, keeping
the Xi values fixed in both the first and the second II&lllples, we see
12.5 REGRESSION ESTIMATES
that the estima.te is biased in the conditional distribution, since
E (il" - Y) = B(i' - X)
•
If the bias is not too large, we have seen (section 1.5) that it may be
taken into account by adding its square to the variance of iiI,. Hence
we may regard the conditional variance V. of ii" as
V.(YI,) = 8/(1 -
1
p2) [ -
n
+ :E(i'(Xi - X)2]
_2 + B2(X' -
X)
X)2 (12.21)
Vetil,) = 8 112 (1 2
- p) - [1n + (1-n - -n'1) (n -1 ]+ -B n'8-
3)
2
z
2
(12.22)
=S 2(1 - p2 ) [
1+
(n' - n) 1 ] l8 2
+_11_
1/ (12.23)
n n' (n - 3) n'
since B 28 z 2 = p281J2.
If the Xi are not normally distributed, the only term whose value is
changed is that in I / (n - 3), as discussed in section 7.3. As regards
assumption ii, the small sample might not be drawn at random from
the large sample : it is preferable to select the small sample so as to
obtain a wide spread in the values of Xi and hence reduce the sampling
error of b. The effect is to reduce, perhaps considerably, the term in
I / (n - 3).
In some applications the second sample is drawn independently of
the first. In this event the argument given in this section remains
unchanged down to equation (12.21). In equation (12.22) the term
(: -~)
n'
n
is replaced by
This cue of two independent samples was first considered by Cham eli
Bose (1943).
278 OOUBLE SAMPLING 12.5
(~+~?
V"p, = C
e,. (1 + ~)2 p2
(12.27)
->
e,., p2 - (1 -.Vl _ p2)2
or
2 4c,.e,.,
P >(e,.-+
-- - (12.28)
e,.,)2
Eql1ation (12.27) shows that, for a given value of p, the ratio of the
cost per unit in the second sample to the cost per unit in the first
sample must exceed a critical value before double sampling brings an
100
t
~ 50
U1\
,
.S 40 • \
'2
'" 30
1\ \
l n,\ \
""'\
~ 20
S
t ~ \
1l
i Ii
" ........
..........
.........
"'-
'\.
"\..
.S; 5
'"" '\.
§
l
4
:3 " ~ , \
......,\\
~
15 2
8 ~
= 10.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1
p• eorreIItion betwMn Yj Ind S j
Fi011U 12.1 Be1atioD between c../c..' and " for three fixed valuee of the relatin
precision of double and lilllie II&mplipi.
Curve I: double and lilllie ampJlnl equal1y preci8e.
Cum II: double amplinl cine 26 per DIlPt increaae in precision.
Cum III: double I&DIplini ciftll 50 per DIlnt increue in precision.
280 DOUBLE SAMPLING 12.6
(12.29)
12.8 RATIO ESTIMATES 281
If the second sample is very small and terms in lin are not negligible,
a suggested estimate of variance is
V(iil,) = 8"." ~ + L
2 {I (x' -
(Xi':""
f)2}
X)2 +
(8,,2 - 8".,,2)
n'
12.8 Ratio estimates. If the first sample is used to obtain x' for a
ratio estimate of Y, the estimate is
ii ,
'OR =-x (12.30)
X
= a x- Y) + ~ (x' - x)
X -
.., - (fi - Rf) + ~ (x' - X)
x f
The first component is the error of the ordinary ratio estimate (sec-
tion 6.3). In obtaining the approximate error variance in section 6.3,
we replaced the factor XIX by unity in this term. To the same order
of approximation, we replace the factor ylx in the second component
by the population ratio R = Y IX. Thus
YR - y. =. (y - Rf) + R(i' - X) (12.31)
V(Yn) =
S,/ - 2RSII", + R 2S,.2 + 2RS",. - R 2S,.2
(12.33)
n n'
Note that formulas (12.32) and (12.33) are both of the form
V" V",
V(On) = -+-
n n'
Hence the optimum choices of nand n', and the minimum variance for
comparison with single sampling, are found by the same procedure 88
for stratification and regression estimates. Details will not be given.
For sample estimates of variance, the quantities 8 112 , 8 11"" 8",2, and
R may be substituted in (12.32) and (12.33). The resulting estimates
V(YR) are not unbiased, but appear to be adequate to the order of ap-
proximation presented in the analysis.
Estimate Variance
Unmatched:
u 1 m Vi""7 (12.36)
;; = 1 + V 1 - p2: ;" = 1 + V1 _ p2
Table 12.2 shows for a series of values of p the optimum per cent
which should be matched and the relative gain in precision as compared
with no matching. The best percentage to match never exceeds 50
per cent and decreases steadily as p increases. When p = 1, the for-
mula suggests m = 0, which lies outside the range of our assumptions,
since m has been assumed reasonably large. The correct procedure in
this case is to take m = 2. The two matched units are sufficient to
determine the regression line exactly.
The greatest attainable gain in precision is 100 per cent, when p = 1.
Unless p is high, the gains are modest.
286 DOUBLE SAMPLING 12.10
% gain with
p Optimum % gain in m 1 m
% matched precision ;-3 ;-4
0.5 46 7 7 6
0.6 44 11 11 9
0.7 42 17 17 15
0 .8 38 25 25 23
0.9 30 39 39 39
0.95 24 52 50 52
1.0 o 100 67 75
-)
V(1/lr -
VIr; ')'
\Ham =.
Sl(1 - p2) + B2S",2
n n'
The translations needed to the present notation are:
When these substitutions are made, the formula shown in table 12.3
is obtained. - - -----
On the hth occasion, the two estimates remain as in table 12.3 if
the subscript 3 is replaced by h and the subscript 2 by (h - 1).
We now find the optimum values of m and u on any occasion. It
will be ehown that the optimum m increases steadily on the successive
occasions, and rather rapidly approaches a limiting value of !.
Weighting as before inversely as the variance, the best estimate of
y~ . is
(12.38)
where
288 DOUBLE SAMPLING 12.11
Since
1
V(-') - - - - - (12.39)
Ylo - W II.. + W II",
we can find the optimum m on occasion h by maximizing
(W llu + W II",)
It is helpful to write
S2
V(ti,,') "" gil. - I (12.40)
n
Since V(YI') 'i'S2/ n , the quantity gil is the ratio of the variance on
occasion h to that on the first occasion. If the successive estimates
become steadily more precise, gil. will be a d~creasing function of h.
Now, from table 12.3, with h in place of 3,
1 S2(1 - p2) S2p2gh_l
--=-: - =
W hu U W"", m
+---
n
This gives
W lou + W 11m = \
S
[u + 1 - p
2 1
P
2
g" _ 1
1 (12.41 )
-m- + -n-
Hence
1
S2(Wh + W"",) = (n - m) + 2 2
1- p P 910-1
-m- + -n-
By differentiation with respect to m, the optimum mIl is found to satisfy
the equation
~ 1 - p2 p2g"_1
--
m...
- = -mil- + -n-
This gives
(12.42)
~ = nV(y,,')
g"
~ = S2n (W"" + W A...) - ~n [Ufo + 1 - p2
1 1
p2g" _ 1
.
-mIl- + --
n
from (12.41).
By substituting the optimum m and u from equation (12.42) into
this equation, we obtain a recurrence relation which connects gIl with
g"- l . After some algebraic manipulation, the relation is expressible as
(12.43)
2'\1'1=7
goo = 1 + V'l=7
Hence the variance of YII' tends to
V(g..')
S2 (
= -; 1
2'\1'1=7
+ vT=7) (12.44)
m.. '\1'1=7 1
- ; ... U.. (1 + VI='7) - 2
irrespective of the value of p.
Table 12.4 shows the optimum percentage matched-l00m,,/n, as
found from equation (12.42)-and the percentage gains in precision,
as computed from equation (12.43), for p ... 0.7, 0.8, 0.9, and 0.95,
and for a series of values of h. .
290 DOUBLE SAMPLING 12.11
2 42 38 30 24 17 25 39 52
3 49 47 42 36 19 31 55 80
4 50 49 47 43 20 33 61 94
5 50 50 49 46 20 33 63 102
6 50 50 50 48 20 33 64 106
GO 50 50 50 50 20 33 65 110
/
12.12 Exercises.
12.1 A population contains L strata of equal size. If V,an denotes the
variance of the mean of a simple random sample, and Vol, Vd. are the corre-
sponding variances for stratified random sampling with proportional alloca-
tion and for double sampling with stratification, show that, approximately,
:E (Y~ - Y)'
n V,an .. S~2 +" L
nV" ... S~1
':E (YA - Y)I
nV". - S~2 + n'~-"---=--
L
where SAl is the average variance within strata. (N and n' may both be as-
sumed large relative to L, and the 11,\ in double sampling may be aasumed
equal to niL.)
12.18. REFERENv'ES 291
12.13 References.
BOSII, CILUIIILI (1043) . Note on the 88JIlpling error in the method of double
eampling. Sankh'Vo, 8, 330.
Jli88llN, R. J. (1942). Statistical investigation of a sample lIurvey for obtaining
farm facts. Iowa Agr. Ezp. Sta, Rea. Bull. 304.
NIITKAN, J. (1938). Contribution to the theory of sampling human populations.
Jour. Anur. Stat. Assoc., SS, 101-116.
PA'1'l'IilI8ON, H. D . (1950). Sampling on successive occasions with partial replace-
ment of units. JOUT. Rov. Stat. Soc., B12, 241- 255.
YATES, F. (1949). Sampling 1PUItJa0d4 1M" cemuaes and BUIWJI', Charlea Griffin and
Co., London.
CHAPTER 13
The steady decline in the number of fruit trees per grower in the
successive responses is evident, these numbers being 456 for respond-
ents to the first mailing, 382 in the second mailing, 340 in the third,
and 290 for the refusals to all 3 letters. The total response was poor,
over half the popUlation failing to give data even after 3 attempts.
SOURCES OF ERROR IN SURVEYS 18.2
PI ±: 2 r;q;
\)-;; ,
where PI is the aample proportion and the fpc is ignored.
• OccaaionalJy it pays to make no attempt to II&IJlple in one stratum. An exam-
ple occure when 1"1 is known to be very 1!lllaIJ. Without any II&IJlpling of stratum 2,
we adopt 91 - 0 as the II&IJlple estimate in this stratum. Hence the aa.mple estimate •
of 1" is
Pu = WI (PI )
+ 2 Jp~:1 + W2 (1) (13.3)
It is easy to verify that these limits are conservative, for the state-
ment
i.e.
(13.4)
Whatever the value of P 2 , the interval (13.4) always includes the
interval
Hence
Pr{h :5 P :5 Pul ~ 0.95
Although limits can be found in this way if the percentage W 2 of
non-response in the population is known, the limits are distressingly
wide unless W 2 is very small. Table 13.2 shows the Average limits for
. a sample size n = WOO and a series of values of W 2 and Pl. Since the
limits in equations (13.2) and (13.3) depend on the value of nl (num-
ber of respondents in the sample), we have taken nl = nW1 , its aver-
age value, in computing table 13.2.
The rapid increase in the width of the confidence interval with in-
creasing W 2 is evident. It is of interest to examine what values of n
would be needed to give the same widths of confidence interval if W 2
296 SOURCES OF ERROR IN SURVEYS 13.2
TABLE 13.2 95 PIiR CliNT CONFIDIiNCIi LIIOTS FOR P (IN I'll. CliNT)
WHIiN n - 1000
For W 2 = 10 per cent, 15 per cent, and 20 per cent, the values of n.
are 155, 90, and 60, respectively. It is evidently worth ",hile to devote
a substantial proportion of the resources to the reduction of non-
response.
It may be objected that the limits in table 13.2 are much too con~
sarvative, since we have supposed that the worst poasible cases have
actually happened. Since, moreover, the limits are frequently too
wide to be useful, it is always tempting to make some guesses about
the bounds within which P 2 lies and construct much narrower "confi-
dence" limits for P based on these bounds. There is nothing wrong
with this procedure if the bounds are correct, but we should recognile
that the procedure represents the substitution of guesswork for objec-
tive evidence.
An interesting method of finding 8&Dlple size when non-response
is present has been given by Birnbaum and Sirken (195Oa, 195Gb).
The proportion W 2 of non-response is assumed known from previoU8
experience in the particular type of survey. No advance knowledp
of Ph Ps, or P is assumed. ThU8, if there were no non-response and
13.2 EFFECTS OF NON-RESPONSE
where t.. is the normal deviate corresponding to the risk a that the
error exceeds d. With no advance information about P, we would
take P = 0.5 as the least favorable case, giving
t,.2
n=- (13.5)
4~
By taking the least favorable combination of the bias W 2 (P1 - P 2)
and the value of PI, Birnbaum and Sirken show that a value of n
which still guarantees an error less than d, with risk a, is
ta 2
n ' =. - -1 (13.6)
4cl(d - W2 )W1
Note that no value of n suffices if W 2 > d. If W 2 = 0, this equation
reduces to (13.5) apart from the term -1, which comes from an ap-
proximation in the analysis. Soml!' values of n given by Birnbaum
and Sirken'~ method are shown in table 13.3.
TABLE 18.3 SIIALLEIIT VALUII OJ' n !'OR GIVEN LIMIT OJ!' ERROR d, WITH
RlSI[ .. - 0.05
0 24 43 96 384
2 27 50 122 653
4 31 60 166 2000
6 36 75 255
8 43 99 521
10 63 142
15 112
The table tells the same sad story as table 13.2. If we are content
with a crude estimate (d = 20), amounts of non-response up to 10
per cent can be handled by doubling the sample size. However, any
sizable percentage of non-response makes it impossible or very costly
, 298 SOURem3 OF ERROR IN SURVEYS 18.2
to attain a highly guaranteed precision by increasing the sample size
among the respondents.
(13.8)
lU OPTIMUM SAMPLING OF NON-RESPONDENTS
Let fit, fit. be the 88D1ple means in the two strata. The subscript,
is introduced as a reminder that the sample in the second stratum is of
sUre ". As an estimate of the population mean, we take
1
fi' = -n (nlfil + ntihr) (13.9)
Note that the second stratum receives a weight n" although the sam-
ple is only of size '2' This is done in order to obtain an unbiased
estimate.
This procedure is an application of double sampling with stratifica-
tion. The first or "large" sample, of size n, gives an estimate ndnt
of the relative size of the strata. The second or "small" sample is of
sile nl in the first stratum and '2 in the second stratum. Unfortu-
nately, the variance of 9 cannot be derived from the variance formulas
which were given in section 12.2 for double sampling with stratifica-
tion. In section 12.2, the .sizes nil in the second sample were assumed
fixed, whereas in the present problem nl and '2 are random variables.
To find V(fi'), write
1 nt
fi' ,.. - (nll'h
n
+ nt!h,,) + -n (thr - 172") (13.10)
Iiae " from the second stratum, and g, .. is the mean of '" simple random
-.mple of aiM '" from the same stratum. Hence, /07' jiud '" 4114 '"
(N, - ,,) 8,' , (N, - n,) 81'
N
, . ... E(g" - fI, ..)
r, + N
2 n,
where 8l is the variance within the "non-~poD8e" stratum. This
gives
, , ( 1 1) , (ns - '2) , (k - 1)
E('U" - 1)",) ... 8, - - - ... 8, .., 8,
" ~ n,r, ~
aince n, -kr,.
Hence, adding the variances of the two terms in (13.10), we find, for
fixed n"
+ (n,)'
2
(N - n) 8
V(g') ...
N
-
n
-n (k n,- 1) 8,,
(N - n) S' (k - 1) ,
- ' N
n
+ n
,nsS, (13.11)
Since E(n,) ... nW2 , this gives for the expected variance
. (N - n) S' (k - 1) W,
V(g') .. -
N n
+ n
8,' (13.12)
The first tel'Dl is the variance that would be obtained if all n, in the
non-respoD8e group were sampled. The second term is the increase in
variance from sampling only of the n,. '2
The quantities n and k are then chosen to minimize average cost
(13.8) for a preassigned vahle of the expected variance (13.12).
The aolutions are:
c,(S' - W~,')
ko,,'= --::,:------
8, (eo + Ct Wt)
• U3.13)
NIS' + (k - I)WaS,'1
(13.14)
no", ... NV + S'
where V is the value specified fol' the variance of the estimated popula-
tion mean.
The aolutions require a knowledae of W.: often this can be estimated
from previous experience. Ip addition to S', whose value must be
estimated in advance in any "sample Bille" problem, the aolutions alao
involve 8.', the variance in the non-responee stratum. The value of
8,' is naturally harder to predict; it win probably not be the same as
OPTIMUM SAMPLING OF NON-RESPONDENTS 801
SJ. For instance, in surveys made by mail of moo kinds of economic
enterpriae, the J'fJ8POndents tend to be larger operators, with larger
between-unit variances than the non-respondents.
This technique was first presented for a survey made by mail, fol-
lowed by visits to & subsample of thoee who do not answer ~e letters.
With a mail survey, W 2 may be large and its value may be difficult to
predict fQr determining n..",. In this event a satisfactory approxima-
tion is to work out the value of no,,' for a range of assumed values of
W 2 between 0 and a safe upper limit. The max~um no", in this aeries
is adopted as the initial sample size n. When the replies to the mail
survey have been received, the value of ~ is known. The variance
formula (13.11) is then solved to find the value of k that gives the de-
sired variance V. The cost for this method is usually only slightly
higher than the optimum coat which would have applied if W 2 were
known.
E:t4mple. This example is ~ondensed from the paper by Hansen
and Hurwitz (1946). The first sample is taken by mail and the re-
sponse rate W 1 is expected to be 50 per cent. The precision desired is
that which would be given by a simple random sample of size 1000 if
there were no non-response. The cost of mailing a questionnaire is
10 cents, and the cost of processing the completed questionnaire is
40 cents. To carry out a personal interview costs e4.1O.
How many questionnaires should be sent out, and what percentage
of the non-respondents should be interviewed?
In terms of the cost function (13.8) the unit costs in dollars are as
follows;
There are 3 strata of equal size, with ?rh = 1, 0.5, and 0.25. An
initial sample size of n = 300 is planned (100 of which would fall in
each stratum, on the average). Owing to the absences from home, the
actual sample sizes n,,' average 100, 50, and 25, respectively.
With the assumed values of Yh, the true population mean Y is 2.
We have ignored the within-stratum sampling variances, taking 'Oh
= l"". The observed sample total y comes out as 275, and the sample
mean is 275/175 = 1.57. This is negatively biased because the "not--
at-homes" have higher values of Yh than the "at-homes."
SOURCES OF ERROR IN SURVEYS lU
The situation is also eBBeIltially the same with regression and ratio
estimates. Consider the regression estimate
'!il. = Y+ b(X - l)
where both the Yi and the Xi may be subject to constant biases gil and
g", respectively. Since the least squares estimate b remains unchanged,
and since the bias g" cancels out of the term (X - x), it follows that
YI. is subject to a bias gil' It is easy to verify that the sample estimate
of V(YI.) contains no contribution due to the biases.
With the ratio estimate
- - :f/ X
YR
X
E ('fiR - 1'')2
,
EE'ia - 0
,
where E denotes an average over all measurements of the ith unit
and E a subsequent average over aU units. By hypothesis, f,a may be
correlated with 'Ii, and it may also be correlated from unit to unit.
Now it happens that, if the errors fia, fi~ on any two different units
in the sample are independent, in the probability sense, the sampling
error formulas given in previous chapters remain valid, provided that
the population can be regarded as infinite.
This is easily seen for simple random sampling. The independent
drawing of successive units guarantees that the successive values of 'Ii
in a sample are mutually independent. It does not guarantee that the
values of the 'ia are mutually independent, but we have &88umed that
this is so. Henoe, the 7/ia are independent on sucoessive units, and the
ordinary theory for random sampling from an infinite population is
applicable to the 7/ia.
13.8 EFFECTS OF INDEPENDENT COMPONENTS
In particular, if E(ru) = R is the true population mean,
cr 2 1
V(y) = E(fi - R)2 = _!_
n
= - (cr~ 2
n
+ cr 2 + 2p~.cr~cr.)
f (13.17)
where
cr~2 = E(ru - R)2
cr.2 = EE(Eia 2) .
Further, since the Yia are independent members of a simple random
sample from an infinite population,
_) 1 E (Yia - fi)2
V (Y =- (13.18)
n (n - 1)
S 2 ... 1 ~ ( . _ R):I
~ (N - 1) i~ 11.
1 N
(1:1 =- E E(E ' 2)
• N i_I i 101
1 N
P".s~cr. =- (N _ 1) ~ {II Eia(l1i - R)}
Note the use of the divisor N in defining (1,2; this helps to keep the
results simple.
/310 SOURCES OF ERROR IN SURVEYS 13.8
(13.21)
For the cross-product, Et(;; _ R), let us suppose first that there is
a fixed error of measurement f i a' associated with the ith unit. Then,
by the ordinary theory for simple random samples from a finite popu-
lation (theorem 2.3),
N
E'l'(- _ R _ (N n
_ )H) 1: Eia'('Ii _
__ 1
'I ) - Nn --=--(N---I)-
where this average is over a.ll simple random samples with this fixed
set of errors of measurement. Taking the average over a.Il possible
sets of these errors, we have, by the definition of p~"
(N _ n)
E'i~ _ /l) = p".S~(I. (13.22)
Nn
Fina.Ily, from (13.20), (13.21), and (13.22),
(N _ n) (1.2
V(y) =
Nn
IS~2 + 2~~(I.\ + -n (13.23)
With this model, the mean square deviation from the sample mean,
"
L: (y, - y)2
82 = _'-_1_ _ __
n-1
can be shown to be an unbiased estimate of
S,/ = S~2 + 2p~.s~O', + 0',2
The usual formula (section 2.6) for the estimated variance of y is
v(y) = (N ~ n)~
Hence
_ (N - n) 2 2
Ev(y) = (S~
Nn
+ 2p~,S~O', + 0', I
By comparison with (13.23) it follows that v(y) has a negative bias
which amounts to 0', 2/ N and will usually be small. If the fpc is omitted
from v(iJ), we obtain an overestimate. An unbiased estimate cannot
be constructed without knowledge of O't
(13.24)
_ 1 2
Ev(y) = - [CTq
n
+ 2Pq.CT~CT. + CT.2 {I - .II•• }J (13.26)
putting n = mk.
If, as before, the variance of a single measurement of a unit by an
interviewer chosen at random is denoted by
2
"..,/ = CT~ + 2p~,CT~CT, + CT,2
then (13.27) may be written
(13.28)
(13.30)
(13.31)
'l~ SOURCES OF ERROR IN SURVEYS 13.10
If n and k are chosen to minimize V(y) for fixed C, we find that the
optimum size of group (number of interviews per interviewer) is
13.11 Summary. From the point of view of their effects on the for-
mulas given in previous chapters, the additional sources of error may
be classified as follows :
i. Errors of measurement that are independent from unit to unit
and average to zero over the whole population are properly taken into
account in the usual formulas for computing the standard errors of the
estimates, provided that the fpc is negligible. Such errors do, of
course, decrease the precision, and it is worth while to learn something
about their magnitude in order to find out whether the decrease is
serious.
ii. With non-response, the usual formulas for the standard errors, as
computed from the units that were measured, are likely to be under-
estimates since they ignore the bias due to differences between respond-
ents and non-respondents. The sampler has no excuse for not being
aware of this problem: a complete record of all non-response, with
reasons, is an essential part of good practice. If non-response can be
reduced by expenditure of greater effort on a certain segment of the
population, the method of Hansen and Hurwitz (1946) shows how to
allocate resources to this segment.
iii. If errors of measurement are correlated from unit to unit, the
usual formulas for the standard errors are biased. The standard errors
are likely to be too small, since the correlations appear to be mostly
positive in practice. This type of disturbance is harder to detect and
has probably often passed unnoticed. The device of interpenetrating
sub8&JIlples gives a measure of the magnitude of the effect as well as
unbiased estimates of the real error variances (a constant bias ex-
cepted).
iv. A constant bias that affects all units alike is hardest of all to
detect. No manipulations of the sample data will reveal this bias.
There is much work to be done on these problems. Perhaps the most
urgent need is for the accumulation of data on the na~ure and size of
errors of measurements in sample surveys. In many cases it will be
found that the usual formulas and techniques are disturbed to only a
minor degree. In others it will become evident, as has happened in
some types of study, that random sampling errors are the leaat of our
troubles, and that precise estimates are unattainable until a drastic
reduction in one of the other sources of error is made. More needs to
be learned, also, about what can be accomplished by good training and
supervision and by rechecks of the units from more experienced per-
sonnel.
13.12 REFERENCES 817
13.12 References.
BIRNBAUM, Z. W., and SIRKEN, M . O. (195Oa). BilLS due to non-availability in
sampling surveys. Jour. Amer. Stat. ABBOC., 46, 98-111.
BIRNBAUM, Z. W., and SIRKEN, M . O. (1950b) . On the total error due to non-
interview and to random sampling. Int. Jrrur. Opinion and AUitude ReB., 4,
179-191.
FINKNER, A. L. (1950). M ethods of sampling for estimating co=ercial peach
production in North Carolina. North Carolina Agr. Exp. Sta. Tech.. Bull. 91.
HANSEN, M . H., and HURWlTZ, W. N . (1946). The problem of non-response in
sample surveys. Jour . Amer. Stat. ABBOC., 41, 517- 529.
HANSEN, M . H ., et al. (1951) . Response errors in surveys. Jrrur . Amer. Stat.
ASBOC., 46, 147- 190.
LIENAU, C. C. (1941). Selection, training and performance of the National Health
Survey field staff. Amer. Jour . Hygiene, 34, 110-132.
MAIlALANOBIS, P. C. (1946). Recent experiments in statistical sampling in the
Indian Statistical Institute. Jour. Roy. Stat. Soc., 109,325-370.
POLITZ, A. N., and SIMMONS, W. R . (1949, 1950) . An attempt to get the "not at
homes" into the sample without callbacks. Jour. Afner. Stat. ASBoc., 44., 9- 31,
and 46, 136-137.
ANSWERS TO EXERCISES
u 1064, 1336.
u Nearly conclusive.
u (i) 76.2 ± 3.6 per cent; (ii) 1738 ± 280 famililll.
a.7 Au - 13.
a.8 Average size of sample - m/P.
1i.7 (i) Gain in precision is about 110 per cent. (ii) Gain from proportional
etratification over simple random sampling is about 90 per cent.
1i.8 (i) 3.733; (ii) 1.111; (iii) 8.222.
'1.1 V(iilr) - 1.03, V(iiR) - 10.3 (one of the samples givlll a very poor IlIIti-
mate with the ratio method). Values of BItT are 0.32 and 0.27, respectively, where
tT is taken about Y.
'1.2 r lr - 28,177 ± 570. RP - 113 per cent.
'1.a 27,751 ± 694.
'1.' V(rlr.) - 34.5; V(r lrc) - 10.3.
319
ANSWERS TO EXERCISES
8.1 Varianeee are 8.19 (systematic), 11.27 (simple random), 8.25 (stratified, 2),
U6 (stratified, 1).
' .1 No: variance is 8.78 with end correctiona.
8.1 V", - 0.00141; V,. .. - 0.00340.
11.1 Relative net precieions are Ill, 125, and 128, respectively, for the lut three
types of unit relative to the firat.
II.. Relative precision of the household is 211 per cent for the !leX ratio and 38
per cent for the proportion who had _n a doctor.
lI.a Relative preeieion of the large unit is 0.566 with simple random sampling
and 0.625 with stratified random sampling.
11.1 (i) If the standard deviation among large units in cl&llll h ex: M". (ii) If
probability ex: ~.
10.1 2.
10.1 Lose in precision is about 8 per cent.
11.1 Contributions to variance from
Within Between
Methode units units Bias Total
Ia 0.356 0.250 0.010 0.616
II 0 .468 0.640 1.1OS
III 0.404 0 .240 0.644
IV 0 .386 0 .414 0.800
V 0 .350 0 .248 0 .002 0.600
11.a Total variance: 0.00482 (la) , 0.02337 (II), 0.00554 (II!).
11.1 Exact variance is
V(JIrI) -
nll'.l-
t
~ 1-1 [«NN -- I'll» (Y; - y)2 + M;(M; - m.) s.t]
m;
The variance formula deduced from theorem 11.2 is the same except that the factor
(N - n)/(N - 1) inside the bracket is-replaced by 1.
11.11 To compute II(Vu), use formula (11.29) in theorem 11.4, with .I; - liN.
This give.
N'I,
II(JIU) - n(n _ 1)M'l
Annitage, P., 76, no Hansen, M. H., 131, 135, 139, 206, 213,
239, 249, 253, 267, 298, 301, 306, 314,
Bernert, E. H., 96, 110 316, 317
Birnbaum, Z. W., 296, 317 Haynes, J. D ., 184, 188
Black, C. A., 185, 188, 189, 214 H endricks, W. A., 199,200,213
Blythe, R. H ., 62, 64 Homeyer, P . G., 1&5, 188, 189, 214
Bose, Chameli, 277, 291 Horvitz, D. G., 207, 214
Buckland, W. R., 188 Houseman, E . E ., 30, 96, 110
Hurwitz, W. N ., 131, 135, 139,206,213,
Cameron, J. M ., 226, 233 239, 249, 253, 267, 298, 301, 316, 317
Chung, J. H., 42, 48
Cochran, W. G., 103, 110, 146, 159, 175, Jebe, E. H., 219, 233, 239, 250, 267
181, 187, 227, 233 Jessen, R. J ., 30, 63, 84, 96, no, 124,
Corlett, T., 227, 233, 234, 266, 267 139, 194, 199, 214, 266, 267, 284, 291
Cornell, F. G., 89, 90, 110 Johnson, F . A., 56, 64, 176, 181, 188,
Cornfield, J ., 17, 30, 55, 64, 204, 213 189, 214
Cram6r, H ., 127, 128, 139
King, A. J., 96, 103, 110,219,233
Dalenius, T., 96, 109, 110 Lienau, C. C., 306, 317
DM, A. C., 184, 187
David, F . N., 123, 139 Mackenzie, W. A., 176, 187
DeLury, D. B., 42, 48, 181, 187 Madow, L. H., 165, 168, 178, 188, 250,
Deming, W. E., 57, 62, 64, 109, 110 267
Madow, W. G., 22, 30, 129, 139, 165,
Eisenhart, C., 228, 233 168,188
Evall8, W. D . 76, 82, 83, no Mahalanobill, P. C., no, 199, 214, 215,
306, 312, 317
Feller, W., 22, 30 Marcuee, S., 233
Fieller, E. C., 121, 139 Mat6rn, B., 181 188
Finkner, A. L., 135, 139, 197, 213, McCarthy, P. J ., 49
293,317 McCarty, D. E., 96, 103, no
Finney, D. J., 48, 176, 177, 178, 187 McPeek, M., 103, 110
Fisher, R. A., 27, 28, 30, 42, 48,146,159, McVay, F. E., 199, 205, 214
176, 187 Merrington, M ., 228, 233
Midzuno, H., 207, 214
GaU88, C. F., 123 Molina, E. C., 24, 30
Gray, P. G., 227, 233, 234, 266, 267 Monroe, R. J., 197,213
Gurney, M., 96, no, 131, 135, 139 Morgan, J. J., 197, 213
Hagood, M. J., 96, 110 Neyman, J., 73, no, 123, 139,268,271,
Haldane, J. B. S., 48, 49 291
821
322 INDEX
Nordin, J. A., 62, 64 Stephan, F. F., 6, 10, 96, 103, 104, 110
Stevens, W. L., 42
Osborne, J. G., 176, lSI, 188 Sukbatme, P. V., 83, 110, IS9, 214, 229,
233
Patterson, H . D ., 286, 291
Paulson, E., 121, 139 Thompson, C. M., 228, 233
Thompson, D. J., 207, 214
Payne, S. L., 3, 10
Politz, A. N., 303, 304, 317 Tippett, L. H. C., 62, 64
Tschuprow, A. A., 73, 110
Tukey, J. W., 17, 30
Quenouille, M. H., 175, 184, 188
Watson, D . J., 140, 159
Romig, H. G., 37, 49 West, Q. M., 27, 2S, 30
Wishart, J., 17, 30
Satterthwa.ite, F. E., 73, llO Wold, H ., 176, 188
Sen, A. R., 207, 214
Simmons, W. R ., 109, 110,303,304,317 Yates, 1"., 42, 4S, 62, 64, 141, 159, 173,
Sirken, M. G., 296, 317 176, 177, lSI , IS2, 188,266,267,286,
Stein, C., 59, 50, 64 291
Subject Index
Ratio estimate, conditio!lll under which Repeated sampling of the same popula-
billll is negligible, 118 tion, 282
co!lllistency, 114 types of estimate wanted, 283
confidence limits, 120 replacement policy, 283, 286, 290
optimum conditio!lll for, 123 sampling on two ocCllllio!lll, 284
in estimating proportio!lll, 124 sampling on more than two occuiODII,
in stratified random sampling, 129 286
optimum allocation for, 135 Replacement of sample, .ee Repeated
in cluster sampling, 124, 203 sampling of the same population
in two-stage sampling, 248
sample size with, 120 Sampling fraction, 13
comparison with mean per unit, 122 overall sampling fraction, 240, 246
comparison with regression estimate, Sampling on more than two OCcuiODII,
148 286
l1li special case of regression estimate, estimate of current population mean,
149 287
comparison with stratifie&tion, 134 optimum per cent matched, 290
comparison with pps sampling, 210 Sampling on two OCClllliODII, 284
limiting distribution, 127 estimate of current population mean,
effect of mell8urement billll on, 307 284
Set az'o Combined ratio estimate and optimum per cent matched, 285
Separate ratio estimate Sampling ratio, 13
Regression coefficient Sampling unit (unit)
lell8t squares, 140 definition, 3
in finite populations, 142 optimum mell8W'e of size of, 205
variance, 145 Sampling unit, optimum, 189
combined from different strata, 153 method for determining, 189-195
Regression estimate, 140 worked eX&IDples of method, 189, 194,
uses, 140 197
Iarge-ea.mple variance, 142 use of survey data in determining, 196
estimated variance, 144 use of variance functioDII, 198
lell8t squares theory, 144 effect of field coets on, 202
billll, 147 for proportioDII, 203
with arbitrary value of b, 149 Sampling with replacement, 12, 206,
with inefficient estimate of b, 149 245, 250, 269, 262
comparison with ratio estimate, 148 Sampling without replacement, 12
comparison with mean per unit, 148 Selection with arbitrary probabilities
in stratified random sampling, 150- in singllH!tage samPlin&, '1I.YT
, 158 in two-etage sampling, 239
in double sampling, 275 optimum probabilities of eelection of
in repeated sampling of the same primary units, 263
population, 284-290 Self-weighting sample, 67
effect of m8lllluremen t billll on, 307 Separate ratio estimate, 129
See alao Combined regression estimate variance, 129
and Separate regression estimate liability to bias, 130
Relative net precision, 192 comparison with combined ratio esti-
Relative precision (RP) mate, 132
of stratified random and aimple ran- estimated variance, 132
dom sampling, 76 illustration of precision, 133, 136
of optimum and general allocation, optimum allocation for, 136
79 Separate regrellllion estimate, 1150
328 INDEX
Separate regreB8ion estimate, variance, Skewed population, experimental sam-
151, 152 ples from, 22, 26
liability to bias, 153 Skewness
compariBon with combined reg.reesion coefficient of, 25
estimate, 157 effect on confidence limits, 22
estimated variance, 157 Square grid sample, 183
Simple expansion, 122 Standard error,
comparison with ratio estimate, 122 of mean of simple random sample, 16,
Simple random sampling 19, 55
definition, 11 of estimated population total from
method of drawing, 11 simple random sample, 17, 19
unbiased sample mean, 14 of sample standard deviation, 27
variance of sample mean, 15 of sample proportion, 32, 33, 34, 40,
estimated variance of sample mean, 45, 64
19 of total in population poseessing some
confidence limite for sample mean, attribute, 33
20 of weighted mean in stratified sam-
variance of sample proportion, 32, 35 pling,69, 71, 72,105
unbi88ed sample proportion, 32 of proportion estimated from strati-
estimated variance of sample propor- fied sample, 91
tion, 33 of ratio estimate, 115-119, 125-1'1:1
distribution of sample proportion, 36, of ratio estimate in stratified sam-
37 pling, 129, 131-134
corJidence limite for sample propor- of regression estimate, 14~146, 150
tion,39 of mean of systematic sample, 1~
for classification into more than two 167,171H82
olasses,43 of mean per element in two-etage
sample size needed, ro, 53, 55, 57 sampling, 217, 222, 224, 236-243,
precision compared with stratified 249-253, 259-265
random sampling, 76 in sampling with probability propor-
Sise of sample for specified limite of tional to sise, 20&-210
error in double sampling, '1:10, '1:18, '1:18, 280,
t.nalyais of problem, 51 281,282
with propor.tions, 5(}-64 with interpenetrating Bubsamples, 314
with continuous data, 56 (N0TIl: in some caseII the formula
with more than one item, 57 given is Mlr the variance.)
in stratified random sampling, 87, 93 Stein's method of two-etllge sampling,
with ratio esti]nates, 120 59
worked examples, 50, 55, 56, eo, 89 Steps in" sample survey, 2
Stein's method of two-atage sampling, Strata
59 definition, 65
by minimizing cost plus 10III'I due to construction, 98
errors, 61 optimum number, ~
effect of non-response on, 296 effect of subdivision on precision, 93
Siae of sample needed optimum boundaries between, 96
for norma.! approximation to confi- Stratification, 65
dence limits for continuous data, '1:1 reasons for, 65
for norma.! approximation to confi- best variable for, 93
dence limits of proportions, 41 geographic, 96, 134
for estimating optimum subeampling after aelection of II&II1ple, lOt
fractions, 226 with double aampq, 268
INDEX 329
Stratification, effect on normality of Subsampling (units of unequal size),
variate, 27 units chosen with probability pro-
Su olio Strata portional to estimated size, 239,
Stratified random sampling, 65 243, 246, 247
estimate, ii,t, 66 estimation of proportions, 248
variance of ii", 69 general formulas for variances, 249
estimated variance of ii,t, 72 general formulas for estimated vari-
confidence limits for continuollij data, ances, 259, 262
73 optimum probabilities of selection,
optimum ' allocation, 73 253
optimum allocation with varying costs, advantages of ratio estimates, 248,
75 250
size of sample, 87 comparison of biased and unbiased
for proportions, 90 estimates, 257
construction of strata, 93 in stratified sampling, 262
estimation of gain in precision from planning of sample, 266
stratification, 97-102 Bubsampling of non-respondents, 298
with ratio estimates, 129 "Substitution" method for non-response,
with regression estimates, 150 302
in analytical studies, 106 Super-population, 169
with one unit per stratum, 105 Systematic sampling, 160
comparison with simple random sam- advantages, 160
pling, 76 variance, 162
comparison with systematic sampling, estimation of the variance, 179
167 worked example for, 165
effects of errore in stratum weights, recommendations about use, 185
102, 268, 271 relation to cluster sampling, 162
deliberate omiBBion of a stratum, 294 comparison with simple random sam-
Stratified samplin~, 65 pling, 163, 167, 168, 170
estimate, 66 comparison with stratified sampling,
variance of estimate, 68 165, 167, 170, 176
Stratum weight, Wh, 69 end corrections, 172
Bube&mpling (units of equal size), 215 in popUlations in "random" order,
advantage, 215 168, 180
notation, 215 in populations with linear trend, 170,
approximate variance of mean, 217 180
exact variance of mean, 22Q effect of periodic variation, 174, 179
estimated variances, 218, 223 in autocorrelated populations, 174
prediction of variance for other sub- in natural populations, 176
sampling fractions, 218 in two dimensions, 183
optimum subeampling fractions, 225 in subsampling, 225
for proportions, 228 stratified systematic samplini, 182
in stratified sampling, 231
three-8tage sampling, 229 t-<iietrlbution, 20, 27, 57, 73
when subeample fe syl!tematic, 225 Theory of sampling, function in II&mple
Bubeampling (units of unequal size), 234 surveys, 5
notation, 235 Three-stage II&mpling, 229
units chosen with equal probabilities, Total in population, estimation
235, 236, 243, 244, 245, 247 by simple expa.nsion, 13
unite choeen with probability propor- by ratio estimate, 112
tional to size, 237, 243, 245, 247 il). stratified random II&mplini, 70
330 INDEX
Total in population, eetim&tion, for. at- Unbiased procedure or estimate, defini-
tributes, 31 tiOD,7,14
Tw~ell8ioul poplI)atloll, 181 Unit (eampling unit), definition, 3
lIqu&l'e crid eample, 188 Su. GUo Sampling unit
nuHpwf I!y1Itemr.tic IUDple, 188 Unrstricted random MDlPlin&, 11
simple random -.mple, 1M 8« GUo Simple random I!&IDplin,
UII! of latin Iqu&l'e principle, 184
Two-J'h- IIt.mplinc, _ Double IIIUD- Variance, definition of $I and~, 15
pIinc Variancee of II&Dlple eetimatee,.su Stand-
Two-etap IIIIIIIPIinc, definition, 216
ard error
&e alao Subamplinc
Variance within units, .a function of
UDAiiped i)'8tematic ample, 183 elK of unit, 198
UNIV. OF AGR[L. SCIENCES
UNIVERSITY LlBRARY. B-\NGALORE-560024
27 SEP 1985
50 S;;
A
~,1. 9 5
~~ .
4..j
:v~
• -
17 DEC 1985
. C;-01--/,
9 ~y (986
!S-c 91\.{_)
UAS LIBRARY GKVK