Professional Documents
Culture Documents
Chapter9 Sampling Cluster Sampling
Chapter9 Sampling Cluster Sampling
Cluster Sampling
It is one of the basic assumptions in any sampling procedure that the population can be divided into a finite
number of distinct and identifiable units, called sampling units. The smallest units into which the
population can be divided are called elements of the population. The groups of such elements are called
clusters.
In many practical situations and many types of populations, a list of elements is not available and so the use
of an element as a sampling unit is not feasible. The method of cluster sampling or area sampling can be
used in such situations.
In cluster sampling
- divide the whole population into clusters according to some well-defined rule.
- Treat the clusters as sampling units.
- Choose a sample of clusters according to some procedure.
- Carry out a complete enumeration of the selected clusters, i.e., collect information on all the
sampling units available in selected clusters.
Area sampling
In case, the entire area containing the populations is subdivided into smaller area segments and each
element in the population is associated with one and only one such area segment, the procedure is called as
area sampling.
Examples:
In a city, the list of all the individual persons staying in the houses may be difficult to obtain or even
maybe not available but a list of all the houses in the city may be available. So every individual
person will be treated as sampling unit and every house will be a cluster.
The list of all the agricultural farms in a village or a district may not be easily available but the list
of village or districts are generally available. In this case, every farm in sampling unit and every
village or district is the cluster.
In both the examples, draw a sample of clusters from houses/villages and then collect the observations on
all the sampling units available in the selected clusters.
In a closed segment, the sum of the characteristic under study, i.e., area, livestock etc. for all the elements
associated with the segment will account for all the area, livestock etc. within the segment.
Construction of clusters:
The clusters are constructed such that the sampling units are heterogeneous within the clusters and
homogeneous among the clusters. The reason for this will become clear later. This is opposite to the
construction of the strata in the stratified sampling.
There are two options to construct the clusters – equal size and unequal size. We discuss the estimation of
population means and its variance in both the cases.
Let
yij : Value of the characteristic under study for the value of j th element ( j 1, 2,..., M ) in the i th cluster
(i 1, 2,..., N ).
1 M
yi
M
y
j 1
ij mean per element of i th cluster .
N Clusters
n Clusters
population mean as
1 n
ycl yi .
n i 1
Bias:
1 n
E ( ycl ) E ( yi )
n i 1
1 n
Y
n i 1
(since SRS is used)
Y.
Variance:
The variance of ycl can be derived on the same lines as deriving the variance of sample mean in SRSWOR.
The only difference is that in SRSWOR, the sampling units are y1 , y2 ,..., yn whereas in case of ycl , the
N n 2 N n 2
Note that is case of SRSWOR, Var ( y ) Nn S and Var ( y ) Nn s ,
( y ) N n s2
Var cl b
Nn
1 n
where sb2 ( yi ycl )2 is the mean sum of squares between cluster means in the sample .
n 1 i 1
N M 2
( yij yi ) ( yi Y )
i 1 j 1
N M N M
( yij yi ) 2 ( yi Y ) 2
i 1 j 1 i 1 j 1
N ( M 1) S M ( N 1) Sb2
2
w
where
1 N
S w2
N
S
i 1
i
2
is the mean sum of squares within clusters in the population
1 M
Si2
M 1 j 1
( yij yi ) 2 is the mean sum of squares for the i th cluster.
efficient if clusters are so formed that the variation the between cluster means is as small as possible while
variation within the clusters is as large as possible.
( yij Y )( yik Y )
MN ( M 1) i 1 j 1 k ( j ) 1
1 N M
( yij Y )2
MN i 1 j 1
1 N M M
( yij Y )( yik Y )
MN ( M 1) i 1 j 1 k ( j )1
MN 1 2
S
MN
N M M
i 1 j 1 k ( j ) 1
( yij Y )( yik Y )
.
( MN 1)( M 1) S 2
Consider
2
N 1 N M
( yi Y )
2
i 1 M
( yij Y )
i 1 j 1
N
1 M
1 M M
2 (y Y )2 ( yij Y )( yik Y )
M2
ij
i 1 M j 1 j 1 k ( j ) 1
N M M N N M
( yij Y )( yik Y ) M 2 ( yi Y ) 2 ( yij Y ) 2
i 1 j 1 k ( j ) 1 i 1 i 1 j 1
or
( MN 1)( M 1) S 2 M 2 ( N 1) Sb2 ( NM 1) S 2
( MN 1)
or Sb2 1 ( M 1) S 2 .
M 2 ( N 1)
The variance of ycl now becomes
N n 2
Var ( ycl ) Sb
Nn
N n MN 1 S 2
1 ( M 1) .
Nn N 1 M 2
MN 1 N n
For large N , 1, N 1 N , 1 and so
MN N
1 S2
Var ( ycl ) 1 ( M 1) .
nM
If M 1 then E 1, i.e., SRS and cluster sampling are equally efficient. Each cluster will consist
of one unit, i.e., SRS.
If M 1, then cluster sampling is more efficient when
E 1
or ( M 1) 0
or 0.
If 0, then E 1 , i.e., there is no error which means that the units in each cluster are arranged
randomly. So sample is heterogeneous.
In practice, is usually positive and decreases as M increases but the rate of decrease in is
much lower in comparison to the rate of increase in M . The situation that 0 is possible when
the nearby units are grouped together to form cluster and which are completely enumerated.
There are situations when 0.
1 n
Since ycl yi is the mean of n means yi from a population of N means yi , i 1, 2,..., N which are
n i 1
drawn by SRSWOR, so from the theory of SRSWOR,
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur
Page 7
1 n
E ( sb2 ) E ( yi yc ) 2
n i 1
1 N
N 1 i 1
( yi Y ) 2
Sb2 .
1 n 2
Since sw2
n i 1
Si is the mean of n mean sum of squares Si2 drawn from the population of N mean sums
1 n 1 n 1 n 1 N
E ( sw2 ) E Si2 E ( Si2 ) S i
2
n i 1 n i 1 n i 1 N i 1
1 N
Si2
N i 1
S w2 .
Consider
1 N M
S2
MN 1 i 1 j 1
( yij Y ) 2
N M 2
or ( MN 1) S ( yij yi ) ( yi Y )
2
i 1 j 1
N M
( yij yi ) 2 ( yi Y ) 2
i 1 j 1
N
( M 1) Si2 M ( N 1) Sb2
i 1
N ( M 1) S w2 M ( N 1) Sb2 .
( y ) N n s2
Var cl b
Nn
ˆ2
( y ) N n S
Var nM
Nn M
1 n
where sb2
n 1 i 1
( yi ycl ) 2 .
N ( M 1) sw2 M ( N 1) sb2
Eˆ .
M ( NM 1) sb2
1 M 1 S w2
E
M M MSb2
and its estimate is
ˆ 1 M 1 sw2
E .
M M Msb2
Suppose that a sample of n clusters is drawn from N clusters by SRSWOR. Defining yij 1 if the j th unit
in the i th cluster belongs to the specified category (i.e. possessing the given attribute) and yij 0 otherwise,
we find that
yi Pi ,
1 N
Y Pi P,
N i 1
MPQ
Si2 i i
,
( M 1)
N
M PQ
i i
S w2 i 1
,
N ( M 1)
NMPQ
S2 ,
NM 1)
1 N 2
N 1 i 1
Pi NP 2
1 N N
( N 1) i 1
Pi (1 Pi )
i 1
Pi NP 2
1 N
( N 1)
NPQ
i 1
i i,
PQ
where Pi is the proportion of elements in the i th cluster, belonging to the specified category and
Qi 1 Pi , i 1, 2,..., N and Q 1 P. Then, using the result that ycl is an unbiased estimator of Y , we
find that
1 n
Pˆcl Pi
n i 1
is an unbiased estimator of P and
N
( N n)
NPQ PQ
i i
.
Var ( Pˆcl ) i 1
Nn ( N 1)
N n PQ
Var ( Pˆcl ) [1 ( M 1) ],
N 1 nM
M ( N 1) Sb2 NS w2
and ( MN 1) S 2 N ( M 1) S w2 M ( N 1) Sb2
( M 1)( MN 1) S 2
1
PQ i i
M
1 i 1
.
( M 1) N PQ
( Pˆ ) N n s 2
Var cl b
nN
N n 1 n
( Pi Pˆcl )2
nN (n 1) i 1
N n ˆ ˆ n
Nn(n 1)
nP Q
cl cl
i 1
PQ
i i
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur
Page 10
where Qˆ cl I Pˆcl . The efficiency of cluster sampling relative to SRSWOR is given by
M ( N 1) 1
E
( MN 1) 1 ( M 1)
( N 1) NPQ
.
NM 1 N
NPQ PQ i i
i 1
1
If N is large, then E .
M
An estimator of the total number of elements belonging to a specified category is obtained by multiplying
Pˆcl by NM , i.e. by NMPˆcl . The expressions of variance and its estimator are obtained by multiplying the
1 N
M
N
M
i 1
i
1 Mi
yi
Mi
y
j 1
ij : mean of i th cluster
1 N Mi
Y
M0
y
i 1 j 1
ij
N
Mi
yi
i 1 M0
1 N
Mi
N
M i 1
yi
Suppose that n clusters are selected with SRSWOR and all the elements in these selected clusters are
surveyed. Assume that M i ’s (i 1, 2,..., N ) are known.
Sampling Theory| Chapter 9 | Cluster Sampling | Shalabh, IIT Kanpur
Page 11
Population
N Clusters
n Clusters
Based on this scheme, several estimators can be obtained to estimate the population mean. We consider
four type of such estimators.
Bias yc E yc Y
1 N N
Mi
N
i 1
yi yi
i 1 M 0
1 N M N
M 0 i 1
M i yi 0
N
y
i 1
i
N
N
1 N
i yi
M
M i yi i 1 i 1
M 0 i 1 N
1 N
(M i M )( yi Y )
M 0 i 1
N 1
Smy
M0
Bias yc 0 if M i and yi are uncorrelated .
2
N n 2 N 1 2
Sb S my
Nn M0
where
1 N
Sb2
N 1 i 1
( yi Y ) 2
1 N
S my (M i M )( yi Y ).
N 1 i 1
An estimate of Var yc is
y N n s2
Var c b
Nn
1 n
yi yc .
2
where sb2
n 1 i 1
1 n 1
E ( yc* ) E ( yi M i )
n i 1 M
n 1 N
M i yi
n M 0 i 1
1 N Mi
M0
y
i 1 j 1
ij
Y.
Thus yc* is an unbiased estimator of Y . The variance of yc* and its estimate are given by
1 n M
Var ( yc* ) Var i yi
n i 1 M
N n *2
Sb
Nn
( y * ) N n s*2
Var c b
Nn
where
2
1 N Mi
Sb*2
N 1 i 1 M
yi Y
2
1 n Mi
sb*2
n 1 i 1 M
yi yc*
E ( sb*2 ) Sb*2 .
Note that the expressions of variance of yc* and its estimate can be derived using directly the theory of
SRSWOR as follows:
Mi 1 n
Let zi yi , then yc* zi z .
M n i 1
M y i i
y
**
c
i 1
n
M
i 1
i
It is easy to see that this estimator is a biased estimator of the population mean. Before deriving its bias and
mean squared error, we note that this estimator can be derived using the philosophy of ratio method of
estimation. To see this, consider the study variable U i and auxiliary variable Vi as
M i yi
Ui
M
M
Vi i i 1, 2,..., N
M
1 N
1 M i
V
N
i 1
Vi
N
i 1
M
1
1 n
u ui
n i 1
1 n
v vi .
n i 1
YˆR V
u
v
n
u i
i 1
n
v
i 1
i
n
M i yi
M
i 1n
Mi
i 1 M
n
M y i i
i 1
n
.
M
i 1
i
Since the ratio estimator is biased, so yc** is also a biased estimator. The approximate bias and mean
squared errors of yc** can be derived directly by using the bias and MSE of ratio estimator. So using the
results from the ratio method of estimation, the bias up to second order of approximation is given as
follows
N n Sv2 Suv
Bias ( y ) **
U
Nn V 2 UV
c
N n 2 Suv
Sv U
Nn U
1 N
1 N
where U
N
U i
i 1
M i yi
NM i 1
N n 2
MSE ( yc** )
Nn
Su R 2 Sv2 2 RSuv
2
1 N M i yi 1 N
where S
2
u
N 1 i 1 M
NM
i 1
M i yi
Alternatively,
2
N n 1 N
MSE ( y )
**
c U i RuvVi
Nn N 1 i 1
2
N n 1 N M i yi 1 N
M
Nn N 1 i 1 M
NM
i 1
M i yi i
M
2
N
N n 1 N
Mi
2 M i yi
yi
Nn N 1 i 1 M
i 1
NM
.
N 1
Bias ( yc ) Smy
M0
N 1
Smy
NM
Since SRSWOR is used, so
1 n 1 n
smy (M i m)( yi yc ),
n 1 i 1
m Mi
n i 1
is an unbiased estimator of
1 N
S my ( M i M )( yi Y ),
N 1 i 1
i.e., E ( smy ) S my .
So it follow that
N 1
E ( yc ) Y E ( smy )
NM
N 1
or E yc smy Y .
NM
So
N 1
yc** yc smy
NM
is an unbiased estimator of the population mean Y .
This estimator is based on unbiased ratio type estimator. This can be obtained by replacing the study
Mi Mi
variable (earlier yi ) by yi and auxiliary variable (earlier xi ) by . The exact variance of this
M M
estimate is complicated and does not reduces to a simple form. The approximate variance upto first order of
approximation is
2
1 M i 1
Var y
N N
c
**
n( N 1) i 1 M
yi Y )
NM
yi ( M i M ) .
i 1
n
E M i nM .
i 1
Now if a sample of size nM is drawn from a population of size NM , then the variance of corresponding
sample mean based on SRSWOR is
NM nM S 2
Var ( ySRS )
NM nM
N n S 2
.
Nn M
This variance can be compared with any of the four proposed estimators.
For example, in case of
1 n
yc*
nM
M y
i 1
i i
N n *2
Var ( yc* ) Sb
Nn
2
N n 1 N Mi
Nn N 1 i 1 M
yi Y .
The relative efficiency of yc** relative to SRS based sample mean
Var ( ySRS )
E
Var ( yc* )
S2
.
MSb*2
For Var ( yc* ) Var ( ySRS ), the variance between the clusters ( Sb*2 ) should be less. So the clusters should be
formed in such a way that the variation between them is as small as possible.
Suppose that n clusters are selected with ppswr, the size being the number of units in the cluster. Here
Pi is the probability of selection assigned to the i th cluster which is given by
Mi Mi
Pi , i 1, 2,..., N .
M 0 NM
Consider the following estimator of the population mean:
1 n
Yˆc yi .
n i 1
Then this estimator can be expressed as
1 N
Yˆc i yi
n i 1
where i denotes the number of times the i th cluster occurs in the sample. The random variables
Hence,
1 N
E (Yˆc ) E ( i ) yi
n i 1
1 N
nPi yi
n i 1
N
M
i yi
i 1 NM
N Mi
y
i 1 j 1
ij
Y.
NM
n i 1
1 N
nNM
M (y Y ) .
i 1
i i
2
(Yˆ ) 1 n
Var c ( yi Yˆc )2
n(n 1) i 1
which can be seen to satisfy the unbiasedness property as follows:
Consider
1 n
E ( yi Yˆc ) 2
n(n 1) i 1
1 n
E ( yi2 nYˆc ) 2
n(n 1) i 1
1 n
2 ˆ 2
E i yi nVar (Yc ) nY
n(n 1) i 1
where E ( i ) nPi , Var ( i ) nPi (1 Pi ), Cov ( i , j ) nPP
i j ,i j
1 n
ˆ ) 2 1 n P y 2 n 1 P ( y Y ) 2 nY 2
N N
E i c n(n 1)
( y Y i i
n(n 1) i 1
i i i
i 1 n i 1
1 N 1 N
(n 1) i 1
Pi ( yi2 Y 2) Pi ( yi Y ) 2
n i 1
1 N 1 N
(n 1) i 1
Pi ( yi Y ) 2
n i 1
Pi ( yi Y ) 2
1 N
Pi ( yi Y )2
(n 1) i 1
Var (Yˆ ).
c