Professional Documents
Culture Documents
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
Vita e Pensiero – Pubblicazioni dell’Università Cattolica del Sacro Cuore is collaborating with
JSTOR to digitize, preserve and extend access to Rivista Internazionale di Scienze Sociali
J.N.K. Rao*
ABSTRACT
Methods for small area estimation have received much attention in recent years due to g
demand for reliable small area statistics that are needed in formulating policies and program
tion of government funds, making business decisions and so on. Traditional area-specific di
mation methods are not suitable in the small area context because of small (or even zero) ar
fic sample sizes. As a result, indirect estimation methods that borrow information acros
areas through implicit or explicit linking models and auxiliary information, such as census
administrative records, are needed. This paper provides an introduction to small area estimat
emphasis on explicit model-based estimation. Methods covered include «off-the-shelf» re-w
methods, simulated census methods used by the World Bank and formal empirical Bayes an
archical Bayes methods, based on explicit models. Formal model-based methods permit th
tion of mean squared prediction error and the construction of confidence intervals.
g
u
I - INTRODUCTION
.$
•8 Sample surveys have long been used to provide timely estimates of param
8
interest, such as totals, means and ratios, for subpopulations (or domains) of a f
1
u
population. Such estimates are «direct» in the sense of using only domain-s
-& sample data, provided the domain sample sizes are large enough to support
<L>
>
direct estimates. That is, the domains are «large» such as large geographi
'£
p (provinces in Canada) or age-sex groups at the national level. Typically, dir
mates are design based requiring minimal assumptions, and the associated in
'2
0
(standard errors of estimates, confidence intervals on the parameters, etc.) are
1
0
on the known sampling distribution induced by the sampling design, with the
tion item values held fixed. Standard text books on survey sampling theo
I methods (e.g. Kish, 1965; Cochran, 1977; Sarndal et al., 1992; Thompso
ì
CA
Lohr 1999) provide extensive accounts of design-based direct estimation a
1
<D
a
>
oc
* Carleton University, Ottawa, Canada. This paper is based on keynote talks presente
o
o
(N
workshop on Small Area Estimation and Local Territory, Catholic University of the Sac
© Piacenza, Italy, May 2005 and at the International Conference on Small Area Estima
2007), Università di Pisa, Pisa, Italy, September 2007. Thanks are due to Isabel Molina fo
ful reading of the paper and for constructive comments and suggestions.
= wJ'Jes (2J)
and
Dy = ( gij - L) logife- - L)/{ 1 - L)} + (U- g9) log {(£/ - gij)/(U - 1)}, (
gij = Wij/dj , L(< 1) and U(> 1) are specified lower and upper bounds on gy.
solution {gijj G 5} does not exist, then the bounds L and U are adjusted to f
solution, but there is no guarantee that the resulting weights wy = gydj are all n
negative, unlike in the first method. Also, the condition (2.1) that ensures J]
may not be satisfied, but the second method avoids the model assumption
This method of indirect estimation has been used by the National Center for
and Economic Modeling (NATSEM) in Australia.
The estimator of a small area mean Yt from the simulated census b is taken as
IV - COMPOSITE ESTIMATORS
med model. On the other hand, the form (4.1) may be intuitively appealing
user because it is the sum of the observed values for the sampled units plus
of the predicted values for the non-sampled units. It is interesting to note
large area sample mean is used as % in (4.1) then Yip reduces to a composi
mator: Ni( weighted average of the small area sample mean and the large a
ple mean with weights 0,- = rii/Ni and 1 - </>/), where n¡ is the area sampl
follows that more weight is given to the small area sample mean and less w
the large area sample mean as the sample size increases.
Harter et al. (2003) successfully applied the estimator Ýip to estimate employm
at the county/industry division level in the state of Illinois, USA, using month
vey data (available beginning of the following month) and quarterly admini
data obtained for all the employers (available five months following the re
quarter) to construct the predicted values. They showed that the use of the com
administrative data in constructing the proposed estimator leads to significant
efficiency compared to using only a direct estimator. In their application, the u
vel model (3.5) with random small area effects did not reveal significant va
in the random small area effects.
states using Current Population Survey (CPS) estimates as 0it and Unemploym
surance (UI) claims rate as a component of zit along with random month and ye
fects to account for seasonal variation in monthly unemployment rates.
The second type of basic small area models, called unit level models, is b
a sample of unit level response values y¡j (j = 1,. = 1 , ...,m) and asso
auxiliary variables Xy = (xuj, ..., with known area population means
Hi is the sample size in area i and m denotes the number of sampled areas.
error linear regression model is assumed for the population units j = 1, ..., N¡ an
same model is assumed to hold for the sample, i.e. sample design is not info
given the auxiliary variables. The sample model is given by ytj - x/ijß +
where the v¿ are IID N(0, o¿) variables and independent of e¡j which are ass
be IID N( 0, oj) variables. Here the parameter of interest is the realized valu
small area population mean fÍ9 but in some applications non-linear function
population values y# are also of interest. For example, a small-area poverty mea
used by the World Bank and others, is the mean, Fai, of the N¡ population
Faij - {(z ~ Eij) / z}a I (Eg < z), where E¿j is a suitable measure of welfare su
come or expenditure, z is the poverty line, I(Ey < z) = 1 if Eij
m < z) = 0 otherwise, and a is taken as 0, 1 or 2. Molina - Rao (2008)
the above nested error model with y¡j = log(^) and Fy can therefore be ex
as a non-linear function of y^. Objective here is to estimate the realized valu
small-area poverty measure Fai.
Various extensions of the basic unit level model have been proposed to
small area estimation involving binary responses or counts, two-stage sampling
in areas, multivariate responses, panel data and others (see Rao, 2003, Chapt
and 10). Recent applications of unit level models include the estimation of t
occupied by olive trees in small areas of a region located in Navarra, Spai
sample data and satellite image data as auxiliary information (Militino et al
and small area estimation of average household income using panel data (Fa
al., 2007).
We now turn to "optimal" estimation of small area parameters under the assumed
small area models. In particular, Empirical Bayes or Empirical Best (EB) and Hier-
archical Bayes (HB) methods have been extensively used for this purpose (see Rao,
2003). In this section we focus on EB methods using the basic area level model of
Section V. HB methods are studied in section VII.
Published literature has largely focused on the optimal estimation of realized 0,-
under the basic area level model. Here the objective is to minimize the mean squared
prediction error (MSPE) defined as E(§i - 0¿)2 for any estimator 0;, where the expec-
tation is with respect to the assumed model. It is well-known that the best estimator,
ÔfB = ÎÂ + ( 1 - 7 ¡y iß (6. 1 )
where 7 ,• = + ipi). For example, we can use residual maximum likelihood
estimators (REML) of the model parameters obtained from the marginal distribution
of the direct estimators 0,-, namely independent N(z/iß, The form (6.1) of
the EB estimator shows that it is a weighted average of the direct estimator 0,- and
the «regression synthetic» estimator z!ß with weights 7,- and 1 - 7/ respectively. It
is clear that more weight is given to the direct estimator when the sampling variance
is small relative to the model variance and more weight to the synthetic estimator
when the sampling variance is large relative to the model variance. Note that the
EB estimator takes account of survey weights through the direct estimator and it i
design consistent as the sampling variance goes to zero, provided the direct estima-
tor is design consistent. For non-sampled areas, we use the synthetic estimator zlß
based on the values of z,- for those areas. The EB estimator (6.1) is also an empirica
best linear unbiased prediction (EBLUP) estimator of 0/ without normality assump-
tion when the model parameters are estimated by moment methods not requiring
normality. It may be noted that REML estimators of model parameters under nor-
mality remain asymptotically valid without normality assumption.
The EB estimator (6.1) is model unbiased for 0i9 but the resulting naïve estimator
fNEB - g-i(QEBj 0f the total Yi is model biased and no longer optimal. The proper
EB estimator of Y¡ is obtained by evaluating the conditional expectation
; see Rao (2003, p. 182) for a Monte Carlo approxima-
tion to the EB estimator.
A second-order Taylor linearization approximation to MSPE of the EB estimator
9fB under REML estimation of model parameters is given by
(4) The jackknife MSPE estimator is the sum of (6.4) and (6.5):
Note that (6.7) only requires the calculation of EB estimates from each bootstrap
sample.
If B is sufficiently large, then (6.7) is essentially the MSPE of the bootstrap EB
estimator referring to the bootstrap model above. Therefore, appealing to (6.2) a sec-
ond order approximation to the bootstrap MSPE is given by (Gonzalez-Manteiga et
al., 2008)
model assuming normality of the random effects v,- and the sampling errors e¡,
not second-order accurate as the number of small areas, m , increases. That
error in coverage probability is not of order lower than 0(l/m). As a result,
research has focused on constructing second-order accurate confidence int
using the bootstrap method. Chatteijee - Lahiri - Li (2008) used calibration
on the parametric bootstrap to obtain values (¿1,^2) such that the in
(§fB - t' yfgxi, 6fB + Î2y/gïi) on the realized 0¿ has coverage error of orde
than 0(1 /w), where gu = gii(o^) is the naïve MSPE estimator. Simulation r
suggested good performance in terms of coverage probability. Their bootstr
bration method also works for other small area models based on the linear mixed
model. It would be practically appealing to find a similar second-order accurate
confidence interval using a nearly unbiased MSPE estimator in place of gu because
one would be using the former MSPE estimator to measure variability in the EB es-
timator. It would be useful to obtain similar second-order accurate bootstrap confi-
dence intervals in the case of mismatched small area models and other complex
models including models for handling binary responses and count data.
Hall - Maiti (2006a, b) obtained different bootstrap intervals but they do not make
use of the point estimator QfB , unlike Chatteijee et al. (2008).
Smith (2001) in his unpublished Ph.D. dissertation developed alternative EB in-
tervals under the basic area level model, based on asymptotic expansions of the cov-
erage probability, that are also second-order accurate. The topic of second-order ac-
curate EB intervals is technically complex, and extensions to more complex models
are technically challenging.
need for indirect estimates and improving the efficiency of indirect estimates. Re
Longford (2006) addressed sample allocation issues at the design stage for planned
mains. He proposed to minimize a weighted sum of sampling variances of direct e
mators for small areas and a direct estimator of the aggregate over the area
weights are named «inferential priorities». This method may be potentially usefu
in practice the specification of weights could be problematic. Chowdhry - Ra
proposed an alternative solution to sample allocation that avoids the specifica
weights by minimizing the total sample size subject to desired tolerances on the
sampling variances and on the aggregate sampling variance.
Various theoretical issues related to model specification have been studied
cent years. In particular, the following problems among others have been add
(1) Robust estimation in the presence of outliers (Ghosh - Maiti, 2008; Sinha
2009; Tzavidis - Chambers, 2005). (2) Measurement errors in the covariates
ciated with unit level models (Ghosh - Sinha - Kim, 2006; Torabi - Datta -
2009). (3) Covariate information in the basic area level model subject to sampl
rors (Ybarra - Lohr, 2008). (3) Sensitivity of inferences to errors in specify
sampling variances in the area level model (Bell, 2008; Rivest - Vandal
Wang - Fuller, 2003). (4) Replacing parametric regression assumption in the l
models by weaker non-parametric specifications, in particular using penalized
regression models (Opsomer et al., 2008; Ugarte et al., 2008). (5) Pfefferma
Sverchkov (2007) studied informative probability sampling of areas and wit
sampled areas, thus relaxing the assumption that the model specification ho
the sample data. Singh - Folsom - Vaish (2008) developed a HB approach to
informative sampling within areas, assuming all the areas are sampled.
We have focused on model-based estimation of small area totals or means,
practice we may be interested in ranking the areas or identifying areas that fall
or above some pre-specified level. In the latter case, estimators designed for
or totals are not suitable. Shen - Louis (1998) proposed «triple» goal estimati
can produce good ranks, a good histogram and good area-specific estimators,
ing a simple linking model. Extensions of triple goal estimation to cover mor
plex small area models would be practically useful. Ganesh - Lahiri (2007) s
multiple comparisons (in particular, pair-wise comparisons) of small area m
using the hierarchical Bayes (HB) framework.
It is desirable and often necessary to ensure that the small area estimates
to a reliable direct estimate at a large area level. This property is called bench
ing (see Rao, 2003, section 7.2 for some benchmarking methods proposed in
erature). Wang - Fuller - Qu (2008) proposed to enlarge the linking model to
automatic benchmarking, thus facilitating the estimation of mean squared predic
error (MSPE) of the benchmarked small area estimators.
Subject-matter specialists or end users should have influence on the cho
models, particularly on the choice of auxiliary variables. However, statistical
ods for model selection and validation play a vital role in small area estimat
remarked earlier, linking models often used in small area estimation are based on
ear mixed or generalized linear mixed models containing random effects ass
REFERENCES
N. Ganesh - P. Lahiri, Simultaneous credible intervals for small area estimation problems , in
«Technical Report», courtesy of authors, 2007.
W. Gonzalez-Manteiga - M.J. Lo
strap mean squared error of a sm
Simulation», 75, 2008, pp. 443-46
R. Harter - M. Macaluso - K. Wo
mator , in «Survey Methodology»
J. Jiang - P. Lahiri - S.-M. Wan, A unified jackknife theory for empirical best prediction with
M-estimation , in «Annals of Statistics», 33, 2002, pp. 1782-1810.
J. Jiang - T. Nguyen - J.S. Rao, Fence methods for small area estimation , «Technical Report»,
courtesy of authors, 2008.
J. Jiang - J.S. Rao - Z. Gu - T. Nguyen, Fence methods for mixed model selection , in «Annals
of Statistics», 36, 2008, pp. 1669-1692.
S.L. Lohr, Sampling: Design and Analysis , Duxbury, Pacific Grove, CA 1999.
S.L. Lohr - J.N.K. Rao, Jackknife estimation of mean squared error of small area predictors
in nonlinear mixed models , in «Biometrika», 96, 2009, pp. 457-468.
N.T. Longford, Sample size calculations for small area estimation , in «Survey Methodology»,
32, 2006, pp. 87-96.
J. Meza - P. Lahiri, A note on the Cp statistic under the nested error regression model , in
«Survey Methodology», 31, 2005, pp. 195-199.
A.F. Militino - M.D. Ugarte - T. Goicoa - M. Gonzalez-Audicana, Using small area models
to estimate the total area occupied by olive trees , in «Journal of Agricultural, Biological, and
Environmental Statistics», 11, 2006, pp. 450-461.
I. Molina - J.N.K. Rao, Small area estimation of poverty indicators , «Technical Repor
National Center for Health Statistics, Synthetic State Estimates for Small Areas ,
Monograph 24), U.S. Government Printing Office, Washington, DC 1968.
M.H. Quenouille, Notes on bias in estimation , in «Biometrika», 43, 1956, pp. 353-3
J.N.K. Rao, Jackknife and bootstrap methods for small area estimation , in «Proceedin
Section on Survey Research Methods», American Statistical Association, Washington,
J.N.K. Rao - M. Yu, Small area estimation by combining time series and cross-sect
in «Canadian Journal of Statistics», 22, 2004, pp. 511-528.
L.-P. Rivest - N. Vandal, Mean squared error estimation for small areas when the
variances are estimated, in «Proceedings of Conference on Recent Advances in Sur
pling. Laboratory for Research in Statistics and Probability», Carleton University, Ot
nada 2003.
C.E. Sarndal - B. Swensson - J.H. Wretman, Model Assisted Survey Sampling, Springer-Ver-
lag, New York 1992.