You are on page 1of 14

1

Sampling of Rare Population using Disproportionate


Stratified Random Sampling and Adaptive Cluster Sampling
Sushil kumar
Roll no.:- 20388
M.Sc. (Agricultural Statistics)
Indian Agricultural Statistics Research Institute
Abstract
In sample survey we estimate some characteristic of a population (parameter) by observing
only part of the population i.e. sample. Suitable sampling design is used for selecting a
sample. In general conventional sampling design is efficient when population distribution
is uniform with respect to space and time. However, many often e.g. in social,
environmental and biological population on which survey is performed may be rare in
nature. As a result, conventional sample design may not be appropriate and hence
corresponding estimator is inefficient. A rare population can have several definitions to
cover different situations. In all cases it denote a subset of rare elements for which
population is small or have smaller proportion. This occurs most frequently when
population or trait within population is highly clustered in space or time and sampling is
done on spatial regions or time segments. Several sampling designs based on modification
of classical sampling designs and adaptive sampling designs have been developed to
address this issue. Moreover, this write up focus on disproportionate stratified random
sampling and adaptive cluster sampling.
Key words: Adaptive cluster sampling; Conventional sampling designs; Disproportionate
random sampling; Rare population; Sampling.

1. Introduction
In sample survey we estimate some characteristic of population (parameter) by observing
only part of the population called sample. The procedure of selecting a sample is called
sampling. Conventional sampling design include simple random sampling, stratified
random sampling, sampling with probability proportionate to size (PPS), systematic

2

sampling, cluster sampling multistage sampling etc. These methods of sampling require
uniform distribution of sampling unit over space and time. But, there are several area of
research e.g. in social, environmental, biological fields where survey is performed to
estimate the parameters of populations that may be rare. Population denotes a set of
elements that is of interest. It is distinct from the sampling frame, which is the set of units
that can be sampled in a probabilistic sampling design.
A rare population can have several definitions. First, a rare population is one in which
the size of the population (N) is very small, like an endangered species. Here, even if
sampling frame and the population do not coincide, the number of sampling units
containing rare elements is small. Second, population in which presence of a particular trait
is very low e.g. Genetic disorder is very infrequent in live births. Rare denotes rarity of sub-
population (M out of N) displaying trait of interest. Often, traits are not identifiable before
sampling commences. Therefore, some screening method is used to draw sample from
whole population and identify members of the rare population (subpopulation). Third,
population elements are not necessarily rare but are cryptic or hidden and due to low
detectability population appears rare. Here, applied sampling designs should allow
estimation of both detectability and rarity to obtain sufficient information about the cryptic
population. Final, population is not necessarily small or rare but the proportion of sampling
units containing trait of interest in population is very small. This occurs most frequently
when population or trait within population is highly clustered in space or time and the
sampling units are spatial regions or time segments. Here, a large proportion of sampling
units do not contain any elements of interest and those do contain can contain high numbers
of them.
Although, rare populations require a sampling designs that provide high
observation rates along with controlling sample sizes; still, the choice of sampling
design is influenced by some common objectives-
- estimating population size (N) or density (N/ A, where A is area),
- developing probability maps for the presence of a rare trait,
- estimating the proportion of the population carrying a rare trait
(M/ N),
- comparing parameters among two or more populations,
- to monitor temporal changes within the population, and

3

- to detect the impacts of interventions.

Techniques for sampling of rare populations are based on
I. Modifications of classical sampling design
- Random sampling
- Disproportionate stratified random sampling

II. Adaptive sampling designs
- Multiplicity or network sampling design
- Link-tracing sampling designs
- Adaptive cluster sampling designs.
This write up focuses upon disproportionate stratified random sampling and
adaptive cluster sampling in detail.

2. Disproportionate Stratified Random Sampling
Ericksen (1976) suggested construction of strata such that rare elements are
concentrated in one or few strata and the strata with rare elements are
oversampled to obtain sufficient number of elements for precise estimation. For
this, at least partial information on spatial distribution should be available a priori.
Kalton and Anderson (1986) described stratification and disproportionate
samplings to estimate prevalence of rare trait in a population and to estimate mean
of rare population. The efficiency of disproportionate sampling depends on
effectiveness of stratification in grouping the rare subpopulation into one strata.
Suppose the population is composed of M rare sampling units and (N-M) non-rare
units. Then, to estimate the mean

=
M
i
i
y M
1
1
of a study variable Y of rare
subset of population using stratified random sampling. Let the population be
divided into two strata such that stratum 1 contains proportion A (=M1/M) of the
rare population and the remaining (1- A) in stratum 2. By simple random
sampling select samples of size n1 and n2 from both strata, where m1 ( n1) and
m2 ( n2) are members of rare elements in sample respectively. Hence, in term of A

4

unbiased estimator of can be written as:
( )
1 2
1
A
st
Ay A y = + , (1)
where

=
h
m
j
hj h h
y m y
1
1
.
If m1 and m2 are sufficiently large, the variance of (1) is approximately
( )
| |
( )
| |
2
2
2
2
1
2
1 2
1
~

m E
A
m E
A
A
st
o o
v + = (2)
where ( )

=

=
h
M
j
h hj h h
y M
1
2 1 2
) ( 1 o with

=
h
M
j
hj h h
y M
1
1
and E[mh] is the expected
value of mh.
To show efficiency of disproportionate sampling relative to proportional
sampling, let
N
N
W
h
h
= be the proportion of the total population in stratum h,
N
M
P = be the proportion of total population that is rare,
h
h
h
N
M
P = be the
proportion of units in the ht h stratum that is rare and kf 2 and f 2 be the sampling
fractions of strata 1 and 2 respectively. Let c be the ratio of the cost of sampling
of a member of rare subset of population to that of a member of non-rare subset.
Assume
2
2
2
1
o o = , as estimation of values of y is confined for rare subset of
population only and stratification not based on the values of y, instead whether
the unit is a member of the rare subset or not.
The ratio of variance of
A
st
under disproportional allocation relative to
proportional allocation is approximately

] 1 ) 1 [(
] 1 ) 1 ( ] ) 1 ( ){ 1 ][( ) 1 ( [
~
1 1 1 1 1
+
+ + +
=
P c kP
W k P W k P c P W k kP
R (3)
The optimal value of k, the ratio of the sampling fractions in the two strata is

] 1 ) 1 )[( (
)] 1 ( ) )( 1 [( ~
1 1 1
1 1 1 1
+
+
=
P c P W P
W P W P c P
k ,

5

which reduces to
2
1
~
P
P
k = , when c = 1. The degree to which disproportional sampling
outperforms the proportional sampling depends on A and P1 i.e. for larger A or P1
disproportional sampling is better relative to proportional sampling. The ideal condition for
disproportional stratified random sampling is when most of the rare subset of the population
is confined to a single stratum (large A) and that stratum has few non-rare elements
(largeP1).
In many situations distribution of rare elements are not known before sampling. For this
stratification is done by: (1) two-phase sampling for stratification (Kalton and Anderson,
1986) and (2) model-based approach.

2.1 Two-phase sampling:
At first phase, units are screened using easily measurable and inexpensive variables that
are highly correlated with the rare trait. The selected sample is then divided into two or
more strata based on probability of a sampled unit to be a member of rare population
estimated from the screening variables. Now second sample is drawn from newly stratified
sampling units of first phase using disproportional allocation (Fink et al., 2004).
Suppose in first-phase sample of size n' is selected according to some probability-based
design D and this yields an unbiased estimator of study variable Y. Population mean, for a
variable Y which take on nonzero values for elements possessing the rare trait be

=
N
i
i all
y N
1
1
. Since
all

estimator of mean is design-unbiased so


all all D
E = ] [ with
variance | |
all D
v . Let based on easily measurable auxiliary information, sampled units of
first-phase is classified into H strata of size
'
h
n , h=1,2,3,,H;

=
h
h
n n
'
. The strata are
constructed such that rare units are congregated into one or few strata. Now stratified
random sampling with disproportional allocation is performed with nh units sampled from
n'h, h = 1, 2,3, ,H. Here, the estimator

=
= =
= =
H
h
H
h
n
j
hj
h
h
h
h
str
h
y
n
y
n
n
y
1
1 1 , '
1
is conditionally
unbiased for all and has conditional variance

=
|
.
|

\
|
=
H
h h
h
h
h
h
n str
n
y
n
n
y y
1
2
2
,
|
) | var(
o
, where yn is
the vector of observations from the first phase, yh is the subvector of yn assigned to stratum

6

h and
2
/
h
y h
o is the population variance of yh. Then, the unconditional mean and variance
of
str
y given by
all all D str D str D
E y E E = = ] [ ]] [ [
/
and

( )
(
(

|
|
.
|

\
|
+ = + =

=
h
y h
H
h
h
h
D all D str D str D str D str D str
n n
n
E y E y E y
h
2
|
2
1 ' / /
var )] ( [var ]) [ ( var ) var(
o



2.2 Model based approach:
Used for stratification of population where at least one stratum is highly likely
to contain rare elements. Suppose we want to estimate population total

= =
=
H
h
N
h
hj
h
y
1 1
t , for some variable Y of a spatially highly clustered population where,
yhj be the number of elements of population in the j
th
sampling unit (PSU) within h
th
stratum.
Here, optimal allocation of samples to strata cant be applied a priori because of incomplete
knowledge of the variances of different strata. So, initial stratified random sample with nh1
samples in stratum h is selected and sample stratum variance s
2
h1, h = 1,2,3,,H is
calculated. For highly clustered populations, s
2
h1 will be largest for those strata in which
PSUs have large counts. Hence, additional samples should be assigned to those strata if
possible.
To overcome this problem, Francis (1984) recommended sequential allocation of additional
effort. In sequential allocation, for each stratum first calculate the stratum variance of
estimator from the first-phase sample. To estimate total estimator at the stratum level is
1 1

h h h
y N = t , where
1 h
y is the sample mean per PSU, and Nh is the total number of PSUs in
stratum h with estimated variance
1
2
1
2
1
) (
h
h h
h
n
s N
= t u . Then, the difference in variance if one
additional sample is taken is approximated as :

) 1 (
)
1
1 1
(
1 1
2
1
2
1 1
2
1
2
+
=
+
=
h h
h h
h h
h h h
n n
s N
n n
s N G
Now additional sample is taken in the stratum with the largest value of Gh, and a new
Gh is calculated. This sequential sampling is repeated until the desired sample size is
reached. Francis (1984) & Thompson and Seber (1996) used this approach to study
fish biomass based on stratified random tows of different length.

7


3. Adaptive Cluster Sampling Designs
Francis (1984) and Thompson (1990) described Adaptive Cluster Design as a
single stage cluster sampling strategy in which size and distribution of cluster
is unknown before sampling. They denotes cluster as a group of secondary units
that display traits of interest. It first select secondary units and if it demonstrate
the trait of interest then primary unit to which it belong is sampled
exhaustively. This method earlier used in Fishers Sib Method (1934) to
estimate proportion of children with albinism from parents who could produce
children with albinism. Adaptive cluster sampling is used in situations where
list frame is unavailable.
To conduct Adaptive Cluster Design, clusters are constructed based on two things:
- the linkages among sampling units and
- the defining characteristic of a cluster.
For example, In case of ecological study population distribution is in spatial
setting so, sampling frame is made of a set of contiguous, non-overlapping
quadrats or grid cells in which yi is number of individuals in i
t h
cell. Linkages
among the cells are based on attributes, e.g. common boundaries, orientation
relative to each other, distance between centroids etc. In regular grids, the
linkages for the i
t h
cell will be cells with common border with i
t h
cell. Set of
linkages for a cell is referred as the cells neighborhood and for unbiased
estimation the linkages among units within a cluster should be symmetric i.e.
if i
th
cell is linked to the j
t h
cell then converse is also true.
To constitute a cluster, when the linkages are to be used in sampling researchers
specify a criterion for initiating adaptive cluster sampling. The choice for
criterion is based on the Y-variable such as {Y > c} where, c be a constant. If
the population is very rare, then {Y>0}. The adaptive component of sampling
is performed when a sampled unit, either in initial sample or via a link to
sampled unit meets the criterion{Y>c}.
Network denotes a set of units that would be sampled as a result of any single
one of them is being selected in the initial sample. Every unit in the network is

8

linked to every other unit in the network. Thus, a cluster is composed of the network
plus any additional units that would be sampled but which are not part of the
network. If a unit in the initial sample does not satisfy the criterion to perform
adaptive sampling then its network size is 1(one).
Consider a spatial region divided into square quadrats and linkages are defined only
in two directions (north or south) of a cell. If a cell is selected in initial sample and
meets the criterion for adaptive sampling links, then the adjacent cells in north or
south are sampled. If either of these cells meet the condition, the next cell(s) in
the north or south are sampled. Sampling continues until either a boundary is encountered
or a cell which does not meet the condition is measured. These last cells are not within the
network of the cluster but are measured to determine the spatial extent of the cluster. These
are often referred to as edge units. The final sample contain units of the initial sample
plus all units belonging to the networks intersected by the initial sample plus those edge
units which are not part of any network. This can be understood by a simple example:-
Consider a small one dimensional population of size 10 units whose y-values are below.
Here actual population mean is 26.2 units with a variance of 626.622.
2 20 8 60 48 1 9 15 71 28
The neighborhood of each unit includes all adjacent units. Suppose the criteria, "C", for
taking additional samples is defined as C = {y: y >= 10 units}.
1. If an initial random sample yields 3 y-values such that 1
st
, 4
th
and 9
th
are selected. This
is shown in red.
2 20 8 60 48 1 9 15 71 28
2. The units that conform the adaptive cluster sampling criteria are y-values 60 and 71
units. These are shown in blue.



9

2 20 8 60 48 1 9 15 71 28

3. Now, sample is done to the right and to the left of each unit meeting the criteria. The new
units are shown in red.
2 20 8 60 48 1 9 15 71 28
4 . Those units meeting the criteria, C = {y: y >= 10}, are shown in green.
2 20 8 60 48 1 9 15 71 28
5. Sampling continues to the left and right of each new unit. These values are shown in
red.
2 20 8 60 48 1 9 15 71 28
6. Again, the units meeting the criteria are shown in green.
2 20 8 60 48 1 9 15 71 28
7. This is the full table with all of the collected samples.





10


3.1 Estimators for Adaptive Cluster Sampling:
Unbiased estimation of population totals or means in unrestricted adaptive cluster sampling
is developed by modification of Horvitz and Thompson (1952) and Hansen and Hurwitz
(1943) estimators. Both estimators consider inverse of inclusion probability as weight for
each sampling unit. The Horvitz-Thompson estimator (HT) is used for sampling without
replacement and includes each observation exactly once in the estimator while the Hansen-
Hurwitz estimator (HH) used for with replacement sampling and includes observations as
many times as they have been observed in the sample. Therefore, the HH estimator is easier
to calculate than the HT estimator but generally has higher variance.
The usual HH estimator of the population mean is

=
=
n
i i
i
Y
nN
HH
1
1

o
, where i is the
probability that the i
th
sampling unit is selected in a with-replacement sampling design. For
adaptive cluster sampling, the estimator must be modified since i is calculable for units in
the initial sample and units in the networks adaptively sampled from the initial sample but
not for the edge units sampled as part of an adaptively sampled cluster. The modified HH
estimator is given by

=
=
'
1
'
1

n
i i
i
HH
Y
N n o
(4)
where n equals the number of units selected in the initial sample plus the number of
adaptively added units which met the criterion and belonged to the networks, and i is
interpreted as the probability that the i
th
network is included in the sample. The variance of
(4) is given by
2
1
2
'
1
) var(
|
|
.
|

\
|
=

=

o
o N
y
n N
i
i
N
i
i HH

with unbiased estimator

=
|
|
.
|

\
|

=
'
1
2
2

) 1 ' ( '
1
) (
n
i
HH
i
i
HH
N
y
n n N

o
u .
The HT estimator is similarly modified and is given by

=
=
v
t

1
1

i i
i
HT
Y
N
where v is the
number of distinct units in the sample which belong to a network, and i is the probability
that the ith network is intersected by the initial sample.


11

An alternative formulation uses the sums of the y-values of the sampled networks so,
mean estimator is given by

=
=
k
i i
i
HT
Y
N
1
*
1

t
,
where k is the number of sampled networks, and Y* is the sum of the y-values
in the i
t h
network. The variance of (10) is

= =
|
|
.
|

\
|
=
k
j k j
k j jk
k
k
k j
ht
Y Y
N
1 1
* *
2
1
) (
t t
t t t
u , (5)
where k is the number of networks in the population, and j j =j ; thus unbiased
estimator of (5) is

= =
|
|
.
|

\
|
=
k
j
k
k k j
k j jk
jk
k j
HT
Y Y
N
1 1
* *
2
1
) var(
t t
t t t
t
.

3.2 Comparison of adaptive cluster sampling with SRS
Consider a square region is divided into 100 plots with 100 objects clustered
in the region as follows:

Figure 1. Left one show the clustered distribution of population and right one
shows the number of population units in each plots.
If initial sample of size n=3 is selected randomly with linkage criterion
{y: y>0} for additional units. The initial selected units are shown as X and
the units within cluster shown by *.

12


Now, To compare adaptive cluster sampling with SRS we need an account for
the total sample size in the adaptive scheme, , which is a random variable.
includes all plots that end up in the sample, including the n initial plots and all
neighbors of plots satisfying the condition.
To compute E(), let the random variable Xi be 1 if the i
t h
plot is included in
the sample and 0 otherwise. Then = X1 + X2 ++ XN and E() = E(X1)+ +
E(XN). To compute E (Xi), the probability that plot i is included in the sample
is one minus the probability that none of the plots in its network nor the plots
in networks of which plot i is an edge unit are in the initial sample i.e.

where mi is the number of units in the network to which plot i belongs and ai
is the number of plots in networks for which plot i is an edge unit (note that if
plot i satisfies the condition, then ai = 0, while if it doesn' t satisfy the condition,
then mi = 1). Since Ji is a Bernoulli random variable, E(Ji ) = i , so
E() = 1 ++ N .
Now estimators are calculated. Below table shows variance for Hansen &
Hurwitz estimator, Horvitz & Thompson estimator and variance estimator
under SRS for mean.



13

n E() Var ) (
HH
Var ) (
HT
Var ) (
SRS

1 5.50 3.5697 3.5697 1.0864
2
10.47 1.7668 1.7006 0.5409
3
14.96 1.1659 1.0792 0.3594
5 22.73 0.6851 0.5853 0.2150
10
36.88 0.3245 0.2243
0.1082
20 52.98 0.1442 0.0639 0.0561
30 62.45 0.0841 0.0242 0.0380
50 75.30 0.0361 0.0041 0.0207

HH estimator is always greater than HT estimator. As the sample size increases
n30 HT estimator become more efficient than SRS estimator for mean.


4. Summary
Although several methods are available to study rare populations the choice
depends on the population under study, the objectives of the study, and the
distribution of the rare elements to be sampled. None of the methods will
perform well at small samples sizes when the elements are very rare and found
in small groups of one or two. The method of sampling of rare population have
advantages of large gain in information from a relatively small initial sample
but the dependence on accurate responses and identification of linkages is
critical to the success of the methods. When the population is spatially
distributed and the rare elements occur in spatially distinct groupings of
reasonable sizes, then a method such as adaptive cluster sampling is useful to
estimate population parameters. The availability of several different sampling
strategies for the initial sampling allows for a variety of approaches to accurately
and precisely estimate the parameters of interest. Overall, the final choice of
design will depend on the cost and objectives of the study.

14

5. Reference
Anderson, D.W. and Kalton, G. (1990). Case-finding strategies for studying rare chronic
diseases. Statistica Applicate -Italian Journal of Applied Statistics, 2, 309-321.
Ericksen, E. (1976). Sampling a rare population: a case study. Journal of the American
Statistical Association, 71, 836-822.
Fisher, R. A. (1934). The amount of information supplied by records of families as a
function of the linkage in the population sampled. Annals of Eugenics, 6, 13-25.
Hansen, M.M. and Hurwitz, W.N. (1943). On the theory of sampling from finite
populations. Annals of Mathematical Statistics, 14, 333-362.
Horvitz, D. G. and Thompson, L. J. (1952). A generalization of sampling without
replacement from a finite universe. Journal of the American Statistical
Association,47, 663-685.
Kalton, G. & Anderson, D. W. (1986). Sampling rare populations. Journal of the Royal
Statistical Society, 149, 6582.
Kalton, G. (1993). Sampling Rare and Elusive Populations. United Nations Statistics
Division, New York.
www.unstats.un.org/unsd/publication//UNFPA_UN_INT_92_P80_16E.pdf
Kalton, G. (2003). Practical methods for sampling rare and mobile populations. Statistics
in Transition,6, 491-501.
Pfefferman, D. and Rao,C.R. 2009. Handbook of Statistics, Sample survey: design method
and applications. 29(A), North-Holland, UK.
Thompson, S.K. (1990). Adaptive cluster sampling. Journal of the American Statistical
Association, 85, 1050-1059.
Thompson, S.K. (1992). Sampling. John Wiley & Sons, New York, 23,315-338.
Thompson, S.K. and Seber, G.A.F. (1996). Adaptive sampling, John Wiley& Sons, New
York,50,712-724.

You might also like