Professional Documents
Culture Documents
highlights
article info a b s t r a c t
Article history: Poverty mapping uses small area estimation techniques to estimate levels of depriva-
Received 13 October 2016 tion (poverty, undernutrition) across small geographic domains within a country. These
Received in revised form 22 May 2017 estimates are then displayed on a poverty map, and used by aid organizations such as
Accepted 23 May 2017
the United Nations World Food Programme for the efficient allocation of aid. Current
Available online 1 June 2017
methodology employs unit-level regression modelling of a target variable (household
income, child weight-for-age). An alternative modelling technique is proposed, using tree-
Keywords:
Small area estimation based methods, that has some practical advantages. Alternative ways of amalgamating the
Sustainable Development Goals unit-level predictions from classification trees to small area level are explored, adapting
Complex survey data the trees to account for the survey design, and resampling strategies are proposed for pro-
Clustered data ducing standard errors. The methodology is evaluated using both real data and simulations
based on a poverty mapping study in Nepal. The simulations suggest that amalgamation of
posterior probabilities from the tree gives approximately unbiased estimates, and standard
errors can be calculated using a cluster bootstrap approach with cluster effects included in
the predictions. Small area estimates of poverty incidence for a region in Nepal, generated
using the proposed tree based method, are comparable to the published results obtained
by the standard method.
© 2017 Elsevier B.V. All rights reserved.
1. Introduction
Elimination of poverty and undernutrition, the first two of the United Nations Sustainable Development Goals (United
Nations, 2016b), is addressed through the distribution of billions of dollars in assistance each year to third world countries.
Poverty mapping is promoted by the World Bank (World Bank, 2015) for predicting regional variations in the levels of
deprivation in a particular country, to facilitate efficient allocation of food aid by agencies such as the United Nations World
✩ Specimen R code for classification tree estimates is provided as supplementary material in the electronic version of the paper (see Appendix A).
*
Correspondence to: Institute of Fundamental Sciences (Statistics), Massey University, Private Bag 11222, Palmerston North 4100, New Zealand. Fax:
+64 6 3557953.
E-mail address: g.jones@massey.ac.nz (G. Jones).
http://dx.doi.org/10.1016/j.csda.2017.05.009
0167-9473/© 2017 Elsevier B.V. All rights reserved.
54 P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66
Food Programme (WFP). Statistical techniques are used to generate within-country estimates of deprivation, which can then
be combined with Geographic Information System (GIS) data to produce poverty maps, displaying disaggregated measures
of poverty and other indicators of well being at low geographical levels. Fig. 1, an example of a poverty map for Nepal,
displays small area estimates of the proportion of individuals below a specific expenditure level (the ‘‘poverty line’’) across
geographic domains called ‘‘ilakas’’. A poverty line is usually based on the income or expenditure required to enjoy a minimal
level of goods and services (Ravallion, 1992). The head-count index is more correctly called poverty prevalence; however, the
proportion of poor people is referred to in this paper as poverty incidence, the term generally used in the poverty mapping
literature.
Estimation of poverty status at low geographical level requires the use of small area estimation methodology. Current
methods for poverty mapping are usually based on multiple regression models, relating the variable of interest to a set of
predictors. An alternative approach for modelling poverty incidence is to use classification trees (Breiman et al., 1984),
which offer several advantages. Firstly, a classification tree does not require parametric assumptions (Hastie et al., 2001)
and provides a simple, direct and easily understood method of estimating poverty incidence. Multiple regression typically
uses a stepwise method, at a preliminary stage, for selection of model predictors, but there can be major problems with
this approach (Harrell, 2015); some skill and experience is required on the part of the modeller to avoid over-fitting while
including the important predictors in an appropriate way. The classification tree method in contrast has built-in tools for
automatically selecting variables and avoiding over-fitting, and is better able to cater for possible non-linear relationships
in the data structure (Chambers and Hastie, 1992). Including interactions in multiple regression can be problematic, since
decisions must first be made about which interactions to explore, and then the model attempts to estimate effects for all
possible combinations, some of which may be unimportant. In contrast, the classification tree readily incorporates variable
interactions, selecting only the important combinations.
The next section reviews some basic concepts in poverty mapping and classification trees, after which Section 3 describes
the proposed methodology for adapting the classification tree model for small area estimation of poverty incidence. Results
from applying the methodology to simulated data having a simple random sampling structure, simulated data containing
clusters, and actual Nepal data (Haslett and Jones, 2006) are provided in Section 4. These results are discussed, and
recommendations made, in the final section.
Using classification trees to model poverty incidence requires melding small area estimation methodology and complex
survey design with the technique of classification trees. The basic features of poverty mapping, resampling methods in
complex surveys and classification tree models are reviewed below.
Poverty mapping combines survey data with other information to estimate poverty measures across small domains.
The term ‘‘small area’’ describes a subpopulation for which direct estimates from national survey data cannot be provided
P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66 55
with sufficient precision (Rao and Molina, 2015). Direct estimation for a domain of interest is based only upon the survey
data available for that domain at a specific time (Pfeffermann, 2013). National sample surveys of household income and
expenditure conducted on a small scale generally have inadequate sample sizes, so that direct estimates do not provide
stable predictions across small domains. An alternative approach is to use an indirect method, by incorporating auxiliary
information available at the small area level to improve the precision of the estimates, a technique known as ‘‘borrowing
strength’’ (Ghosh and Rao, 1994). Indirect estimators include model-based methods which relate the variable of interest
to a set of predictor variables. The statistical approach to poverty mapping generally employed by the World Bank, the ELL
method (Elbers et al., 2003), starts by building a unit level multiple regression model from the survey data having the form
Estimation of poverty measures is usually based upon information collected through complex sample surveys, and
so elements of complex design in the survey data: weighting, clustering and stratification: must be accounted for when
producing estimates and their standard errors (Wolter, 2007). With a complex data structure comprising more than one
element of survey design and several sampling stages, the variance of the quantity of interest does not have a tractable
mathematical form (Lohr, 1999). A common approach to variance estimation for complex survey data is to use some type
of replication or resampling method (Rust and Rao, 1996). A key challenge of using classification trees for the modelling is
to develop a methodology for generating valid standard errors for the small area estimates of poverty incidence.
Jackknife resampling (Quenouille, 1949) involves creating replicates by systematically omitting units in the original
sample. If θ̂(1) , . . . , θ̂(N) denote predictions from N jackknife replicates, then the jackknife point estimate, θ̂(·) , is computed as
the average of the θ̂(1) , . . . , θ̂(N) , and jackknife standard error of prediction, σ̂ is defined as
⎡ ⎤
N
⎣ N − 1 ∑ ( )2
σ̂ = √ θ̂(j) − θ̂(·) ⎦.
N
j=1
The delete-cluster jackknife is used for survey data which is clustered, and when stratification is also present the jackknife
is applied to each stratum.
The bootstrap method (Efron, 1979) creates replicates by sampling with replacement from the original sample. If the B
bootstrap predictions are represented by θ̂1∗ , . . . , θ̂B∗ , then the bootstrap point estimate, θ̂ ∗ , is calculated as the average of
the θ̂1∗ , . . . , θ̂n∗ , and their standard deviation, σ̂ ∗ , provides a bootstrap standard error, such that
B
1 ∑ ( )2
σ̂ ∗ = √ θ̂b∗ − θ̂ ∗ .
B
b=1
When the survey data contains clustering, the bootstrap should be applied to clusters rather than individual units, to
account for the dependence structure in the data (Field and Welsh, 2007). For a complex survey design comprising clustering
and stratification, cluster bootstrap samples should be drawn independently from each survey stratum, so as to replicate the
56 P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66
original probability sampling design in each stratum (Rao and Wu, 1988). A straightforward application of the bootstrap for
stratified data (Shao, 2003) involves drawing independent bootstrap samples from each stratum, and combining these into
a single bootstrap replicate for model building. The size of each bootstrap sample is recommended to be one less than the
number of clusters in the particular stratum from which it is drawn (McCarthy and Snowden, 1985). Bootstrap sampling
weights should be modified to avoid introducing bias into the variance estimator (Rust and Rao, 1996).
ELL methodology uses resampling to estimate prediction error. All three sources of variability: model coefficients, cluster
and household level effects: are resampled for each new replicate as follows:
Parametric bootstrapping is used for the regression parameters, specifically β ∼ normal[β̂, V̂ (β̂)], whilst a non-parametric
b
bootstrap is usually applied to the cluster and household level effects (Elbers et al., 2003).
The concept of tree-based statistical models was first proposed (Morgan and Sonquist, 1963) to handle variable
interactions in the analysis of weighted survey data, but their paper did not specifically address issues of clustering and
stratification. Decision tree methodology (Quinlan, 1990) was later brought into the statistical arena (Breiman et al., 1984).
Construction of a tree model is a top-down iterative process which recursively partitions the data space into smaller subsets
called nodes, then fits a simple model in each subset (Hastie et al., 2001) by assigning the same value to each unit in the
node. A subset which does not split further is known as a terminal node, or leaf. Trees are usually ‘‘pruned’’ to a suitable size
so as to prevent overfitting, the most common approach using the complexity parameter, cp, which balances tree size with
model error as defined by the misclassification rate. The choice of cp is done using cross-validation (Breiman et al., 1984).
Another method of pruning is to restrict the depth of the tree (Therneau et al., 2014). A classification tree is built using data
which has a class variable, usually binary, as the response. The kth leaf of the tree is assigned a value which depends upon the
classifications of observations from the training data which end up in that node. These trees can then be used to predict for
new data where the class variable is missing. Each new observation follows a path down the tree dictated by the predictor
values of that observation, and is allocated the value assigned to the leaf of the tree at which its journey terminates. Each
new observation is assigned a class by majority vote, or given a posterior probability of class membership for each class.
Binary classification trees, which partition the data into only two alternatives at each split, are the most commonly used
tree-based models (Venables and Ripley, 2002).
In order to use classification trees for poverty mapping, three issues need to be considered. Firstly, the aim of the modelling
is not simply to classify individuals, but to classify and then aggregate individual estimates. Secondly, standard application
of classification tree models assumes that the data are identically and independently distributed; adjustments to the tree
method are required if it is to be applied to the complex survey data used for poverty mapping. Thirdly, the objective in
poverty estimation is not just to classify and aggregate, but also to generate standard errors for the aggregated predictions,
which requires a resampling method. However, building trees from resampled data is problematic, since trees are inherently
unstable (Li and Belford, 2002). Tree instability is mainly due to the hierarchical nature of the algorithm, since variation in
an upper split continues on down to successive splits and increases (Hastie et al., 2001). The next section outlines how these
three issues have been addressed in developing a methodology for estimating poverty incidence using classification tree
models.
Auxiliary information is incorporated into the proposed poverty mapping methodology by building a classification tree
from the survey data with household poverty status as the response and variables common to the survey and census
as predictors. The tree is then used to predict poverty status using the census data, a procedure similar to the ELL
method. When generating small area estimates of poverty incidence, P0, the natural approach is to obtain a prediction
of poverty incidence for each census household and then aggregate these predictions across the small domains of interest.
However, the poverty status of individuals rather than households is the preferred indicator of poverty incidence, so the
classification tree predictions at household level are weighted by the household sizes before amalgamating to the small area
estimates.
New cases from the census data are classified by the tree according to the leaf at which they terminate. If a definitive
classification is to be made, a majority rule is used so that each leaf is designated ‘‘poor’’ or ‘‘not poor’’ based on the status
of the majority of survey households terminating at that leaf. Alternatively, the proportion of poor households at a leaf can
be taken as the posterior probability of being poor for any household terminating at the leaf. Thus classification trees can
provide two types of estimates, here labelled hard and soft. For the hard type of tree estimates, the ith census household is
assigned the class designation of the leaf at which it terminates; either Yi = 1 if the leaf is classified as ‘‘poor’’, or Yi = 0 if the
leaf designation is ‘‘not poor’’. Let Sk represent the subset of census households which emerge at the kth leaf, and ni denote
P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66 57
Fig. 2. Weighted classification tree model for poverty incidence with cp = 0, and tree depth of 4, showing splitting rules and with hard and soft estimates
as terminal node labels; diagram created using rpart.plot package (Milborrow, 2015).
the household size, i.e. total number of people in the household, for the ith census household. Then a hard classification tree
estimate of poverty incidence, P0(h) , for a given small area is obtained by summing the number of poor people, ni Yi , over all
census households allocated to a particular leaf, summing across the leaves and then dividing by the total number of people
in the small area, as follows:
∑ ∑
k i∈Sk ni Yi
P0 (h)
= ∑ ∑ . (3)
k i∈Sk ni
A soft tree prediction for the ith census household, where i ∈ Sk , is pk , the posterior probability of being poor for households
in the kth leaf. Then, P0(s) , the soft classification tree estimate of poverty incidence for the small area being considered is
defined as:
∑ ∑
k i∈Sk ni pk
P0 (s)
= ∑ ∑ . (4)
k i∈Sk ni
At the prediction stage, the poverty status of each household in the census is unknown but is estimated from the tree, so
conditional on the fitted tree the Yi ’s can be considered as Bernoulli random variables. For the ith census household, such
that i ∈ Sk , Yi ∼ Bern(pk ). Consequently, P0(s) is equivalent to the posterior expected value of poverty incidence across the
leaves of the tree, that is the posterior mean of the proportion of poor individuals in the small area, since:
[∑ ∑ ] ∑ ∑
k i∈Sk ni Yi k i∈Sk ni p k
E ∑ ∑ = ∑ ∑ . (5)
k i∈Sk ni k i∈Sk ni
Note that the designation of the particular small area has been suppressed in (3)–(5). To illustrate, Fig. 2 displays the tree
diagram of a weighted classification tree model from Nepal with poverty status as response, and pruned to a depth of four.
The meanings of the variable labels are given in Table 2. A square box indicates an internal node, with terminal nodes
represented by oval boxes. Each leaf is labelled with a hard estimate, ‘‘poor’’ or ‘‘not poor’’, and a soft estimate, the posterior
probability of being poor for that leaf, represented by the decimal value in the lower half of the box. Fig. 2 suggests that
the three most useful variables provided by the model to predict poverty in Nepal are the proportion of households in a
particular ward owning a television set (tvw), the proportion of households in the ward with no proper toilet (toilet3w) and
the total number of people in the household (hhsize). The ward-level variables were obtained by averaging census variables
at ward level, where ward is the smallest administrative unit in Nepal; each ilaka is made up of several wards (Haslett and
Jones, 2006). The tree model represented by Fig. 2 was pruned to a depth of four for this illustrative example because it is
easier to read and interpret, but a larger tree would be preferred in practice as it should provide more accurate estimates.
58 P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66
Using a weighted classification tree that incorporates the survey weights ensures that the model built from survey data is
representative of the population (Toth and Eltinge, 2011). The rpart function (Therneau et al., 2014), the statistical software
used to build the trees, has a weight argument that incorporates survey weights into the splitting criterion, and into the node
summaries (Therneau, 2015). When the survey data contains clustering and stratification, these design features must also
be incorporated into any resampling procedure for variance estimation, as described in the next section.
A single classification tree can provide small area estimates of poverty incidence, but makes no provision for estimating
standard errors of those predictions. Variance estimation for complex survey data requires resampling of the survey data,
but this can be problematic due to the inherent instability of trees (Turney, 1995). The proposed methodology of variance
estimation using classification trees is discussed firstly for survey data with a simple random sampling structure, and then
for complex survey data containing clustering and stratification.
with Yi ∼ Bernoulli πjk as the true poverty status for the ith survey household, where j and k are such that i ∈ Cj and
( )
i ∈ Sk . That is, πjk denotes the posterior probability of being poor for households in cluster j and leaf k. The linear predictor
ηjk comprises the fixed effect φk , representing the posterior probability of being poor for the kth leaf, corresponding to p̂k on
the logit scale, and the random effect cj representing the variability in the probability of being poor due to cluster effects.
A similar use of GLMMs was made by Sela and Simonoff (Sela and Simonoff, 2012) to adapt regression trees for clustered
data; they found that the re-estimated tree did not really lead to better predictions. Our use is different since we are using
the GLMM results to add cluster effects to our predictions for variance estimation.
At each bootstrap iteration, cluster effects are randomly selected as described above and incorporated into the prediction
for each household in a cluster. This is similar to the way in which the ELL method uses resampled cluster-level residuals in
producing its replicate estimates. Omitting this step produces standard errors that are too small, unless the areas contain a
large enough number of clusters for the cluster effects to average out, which is rarely the case.
P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66 59
Fig. 3. ELL estimates of poverty incidence compared with the equivalent hard and soft classification tree estimates for eighteen ilakas in a district of Nepal.
The dashed line, y = x, represents equal values for ELL and tree estimates.
whij
∑
whij = whij
∗
∑ stratum . (8)
BSsample whij
Census predictions are then adjusted to incorporate cluster effects, as outlined in Eq. (6). Results from applying the
methodologies described above are provided in the next section.
4. Results
The survey data used in the classification tree modelling was derived from the 2003/4 Nepal Living Standards Survey,
which comprises a two-stage stratified cluster sampling design, as described in the Nepal poverty mapping report (Haslett
and Jones, 2006). At the first stage, the primary sampling units (PSUs) were selected with probability proportional to size
independently within six strata. Then, within each chosen cluster, or PSU, twelve households were selected using systematic
sampling. This gave a dataset of 3912 households, which was processed to create the log per capita expenditure variable from
which poverty status is derived, and a set of possible predictors that match variables available in the 2003 Nepal Census.
To apply the proposed poverty tree methodology, a weighted classification tree model was built from the survey dataset
using the same predictor variables as were used in the model for the published estimates (Haslett and Jones, 2006), and
pruned with a complexity parameter cp = 0.005 chosen by cross-validation. The tree was then used to generate hard and
soft small area estimates of poverty incidence for 18 ilakas in one of the districts of Nepal. The classification tree estimates
are compared in Fig. 3 with the published estimates from Nepal generated by the ELL method. The dashed line in Fig. 3, the
diagonal y = x, represents equality for ELL and tree estimates. Hard tree estimates are consistently lower than soft tree
estimates for all eighteen small areas, indicating better agreement with the ELL-based estimates for the soft type of tree
prediction than the hard type. However, both ELL and tree estimates are subject to error, so neither can be regraded as the
true value. If the comparison between methods suggests bias, it is difficult to determine which method is producing the bias.
To develop a method for generating valid standard errors of prediction of poverty incidence using a classification tree,
variance estimation methods of jackknife and bootstrap resampling, with hard and soft tree predictions, were tested in a
Monte Carlo study. Survey and census datasets with a simple random sampling structure were generated from the same
linear model based on the Nepal survey data. The data for each household are the log-transformed per capita expenditure
Y and a vector of 25 predictors x comprising the variables used in the published Nepal poverty mapping project (Haslett
and Jones, 2006). Since the Nepal census dataset was not available for this project (except for one unspecified District),
the simulation study was based on the survey data only. To allow for variation in the x variables, we generated (Y , x) as a
complete vector, with its mean and covariance structure estimated from the survey data. Denoting this 26 × 1 vector by
60 P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66
Fig. 4. Actual coverage of 100 intervals for a nominal level of 95%, from 100 Monte Carlo simulations involving jackknife and bootstrap resampling with hard
and soft tree estimates. True P0 = 0.196 is the actual poverty status of the single census dataset used for all simulations. Percentage empirical coverage is
displayed for the four variance estimation methods. Note that jackknife hard estimates have a much larger X axis range than the other three methods.
D, and its mean vector and covariance matrix for the Nepal survey data by µ and Σ respectively, new observations were
simulated using
D∗ = µ + Σ 1/2 Z , (9)
where Z ∼ N26 (0, I). The first component, representing the simulated response variable log per capita expenditure, was
converted to a binary variable indicating poverty status, for building a classification tree using the remaining components as
predictors. For each of 100 Monte Carlo simulation, a survey dataset of 3000 households and a census ilaka of 6000, the typical
size of a Nepal ilaka, were randomly generated as above. Bootstrap and jackknife replicates of the simulated survey data
were used to build multiple classification trees, each pruned to a depth of five, with 100 bootstraps per dataset. These trees
were then applied to the simulated census data to generate multiple bootstrap and jackknife, hard and soft predictions, from
which point estimates of poverty incidence and associated standard errors were computed. Empirical parametric confidence
intervals, of the form θ̂(·) ± zα σ̂ and θ̂ ∗ ± zα σ̂ ∗ , were constructed using both hard and soft tree estimates and a range of
values of α , and the actual coverage compared with nominal coverage. Fig. 4 displays actual coverage for a 95% nominal level
of a hundred intervals from Monte Carlo simulations. The graphs show that hard tree estimates are biased, while soft tree
estimates are approximately unbiased. Bootstrap resampling gives much smaller standard errors than the jackknife. Extreme
overcoverage of jackknife hard and soft intervals was due to very large standard errors. It is known that the jackknife method
can be unstable for estimators which are not smooth (Shao and Tu, 1995), but the delete-k jackknife produced similar results;
instability in the trees occasionally produced very different estimates which inflated the standard error estimates.
The tree algorithm is not a smooth process, being a ‘‘greedy’’ top-down binary partitioning method (Hastie et al., 2001).
The granular nature of the hard estimates, classifying each household as poor or not poor, also seems to contribute to
P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66 61
Fig. 5. Empirical coverage for a nominal level of 95% when cluster and naive bootstrap methods are applied to clustered data.
larger bias and less precision with this type of tree prediction, as compared with employing soft estimation, which uses the
posterior probability of being poor. Bootstrap soft estimation is clearly the preferred method for classification trees when the
data has a simple random sampling structure, giving approximately unbiased estimates and standard errors which perform
appropriately for calculating confidence intervals.
Fig. 6. Empirical coverage for cluster bootstrap soft estimation with cluster effects incorporated into predictions, for three types of intervals: ‘‘Soft’’ refers to
intervals centred about the average of bootstrap soft estimates; ‘‘Full’’ refers to intervals centred about the estimate based on all the survey data; ‘‘Percent’’
refers to bootstrap percentile intervals; ‘‘None’’’ indicates that cluster effects were not incorporated into predictions, but only introduced into the data
structure.
to be incorporated into the predictions for valid standard errors of prediction. Incorporating cluster effects into soft tree
predictions has provided standard errors of prediction that give approximately the correct coverage over a plausible range
of cluster variances.
The actual Nepal survey data used stratification as well as clustering, as is usual for these poverty mapping exercises. Thus
to obtain small area estimates for Nepal, the methodology was extended to incorporate stratification into the modelling
process by selecting cluster bootstrap samples independently from each survey stratum, and constructing appropriate
bootstrap weights, as outlined in Section 3.3.3. Classification tree small area estimates of poverty incidence and associated
standard error of prediction were generated for 18 ilakas in a district of Nepal, with cp chosen by cross-validation for the full
survey and then kept fixed in the bootstrapping. The tree estimates are compared in Table 1 with published results obtained
using the ELL method (Haslett and Jones, 2006). The tree estimates represent point estimates of poverty incidence (P0) for
each ilaka computed as the mean of 100 bootstrap soft estimates. The small area estimates of poverty incidence obtained by
the classification tree modelling are graphed against the published ELL estimates in Fig. 7.
Fully incorporating complex survey design components of clustering and stratification into the methodology has resulted
in point estimates which are closer to their ELL equivalents than those produced by the tree model which only incorporates
survey weightings (Fig. 3), and more consistent in that the distance below the diagonal y = x is similar for poverty levels
P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66 63
Table 1
Comparison of ELL and bootstrap soft classification tree estimates and stan-
dard errors for ilakas in one district of Nepal, with the ratio of coefficients of
variation (CV Ratio) of the tree to the ELL estimates, and the population size.
Ilaka ELL Tree CV ratio Size
P0 se P0 se
1 0.742 0.030 0.653 0.058 2.162 3971
2 0.525 0.033 0.463 0.045 1.555 4622
3 0.619 0.035 0.545 0.061 1.939 4170
4 0.468 0.024 0.397 0.042 2.034 4168
5 0.297 0.028 0.304 0.046 1.637 4320
6 0.472 0.027 0.460 0.035 1.316 4818
7 0.704 0.036 0.611 0.065 2.099 2587
8 0.204 0.031 0.188 0.047 1.657 3666
9 0.201 0.028 0.190 0.057 2.108 3112
10 0.122 0.024 0.135 0.035 1.315 2645
11 0.242 0.020 0.249 0.034 1.672 4569
12 0.202 0.027 0.229 0.035 1.148 5624
13 0.213 0.020 0.238 0.033 1.422 5231
14 0.415 0.031 0.414 0.038 1.249 3184
15 0.311 0.026 0.300 0.033 1.311 3368
16 0.063 0.021 0.073 0.028 1.116 3015
17 0.130 0.040 0.099 0.046 1.501 2255
18 0.137 0.034 0.094 0.043 1.877 5134
Fig. 7. Comparing classification tree and ELL estimates for 18 ilakas in a district of Nepal.
greater than 0.5. However, the tree estimates are still typically lower than the ELL estimates for the higher poverty rates.
Standard errors of prediction from the classification tree model are of the same order of magnitude as the ELL standard errors,
but are always larger. This is expected as the tree estimates are allowing for model uncertainty whereas the ELL standard
errors assume that the fitted model is correct. The coefficients of variation for the tree estimates are up to two times those
of the ELL estimates.
If the methods are applied to give estimates for the whole district, ELL gives P0 = 0.351, se = 0.014; the tree estimate
is 0.327 (0.023) and the direct estimate based on the five sampled PSUs is 0.275 (0.149). Again the tree estimate is slightly
lower with a higher standard error; the direct estimate, even at this level, is too imprecise to be useful.
5. Discussion
A methodology has been presented for small area estimation of poverty incidence using a classification tree, as an
alternative approach to the standard ELL approach using a multiple regression model for poverty mapping. Tree-based
models have some advantages, particularly if an automatic method, with minimal user input, is desired, since they handle
non-linearities and interactions easily, and have built-in variable selection procedures that can avoid over-fitting.
Estimating poverty incidence using a classification tree involves predictions for groups rather than for individuals, the
usual application of tree models. Simulation suggests that soft tree estimates are approximately unbiased, whereas hard
estimates are not. The hard estimate ignores minority cases at each terminal node, which is a probable cause of the bias.
Hard estimates are also much more affected by tree instability whereby a small change in the data can lead to a very different
tree.
64 P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66
Table 2
Definitions of the main effects variables used for the published Nepal poverty
mapping estimates (Haslett and Jones, 2006). Some variables, as indicated,
are census means or GIS-derived variables at ward or vdc (Village Develop-
ment Community) level.
Label Meaning
group1 Urban Kathmandu
group2 Urban Other
group3 Rural Western Mt + Hills
group4 Rural Eastern Mt + Hills
group5 Rural Western Terai
group6 Rural Eastern Terai
hhsize Household size
skids6 Propn kids 0–6 in hh
skids714 Propn kids 7–14 in hh
samen Propn adult men
hage2 hh head aged 30–44
hethn1 hh head Brahmin/Chhetri (B/C)
hethn2 hh head Terai Middle Caste (TMC)
hethn3 hh head Dalit (Dit)
hethn4 hh head Newar (Nwr)
hethn5 hh head Hill Janajatis (HlJ)
hethn6 hh head Terai Janajatis (TrJ)
hethn7 hh head Other castes (Otm)
hrelig3 hh head Muslim
remtab hh member migrated abroad
hfem Female headed hh
hutype1 House permanent (Perm)
hutype2 House semi permanent (Sm)
hutype3 House temporary (Oth)
huown2 House rented or free
npltry4 Rural with no poultry
nagar8 Ruralwith land 0.5–1.0 ha
nagar9 Ruralwith land 1.0–2.0 ha
ckfuel3w Cooking fuel LP/gas, ward
toilet3w Propn hh with no toilet, ward
ftoiletw Propn hh with flush toilet, ward
ltfuel3w Propn hh with lighting fuel other than electricity/kerosene
edulv3w Propn 15+ pop 5–7 yr completed, ward
elecw Propn hh with electricity, ward
tvw Propn hh with tv, ward
pflandv Propn hh with land-owning females, vdc
dmortv mortality rate due to infectious disease, vdc
meanht Mean elevation (’000 m) above sea level, vdc
meanslp Mean slope (as %), vdc
In adapting the standard tree model for a complex survey design structure, a weighted tree model takes account of survey
sampling weights. Variance estimation can be carried out using the cluster bootstrap independently within each stratum,
thus accounting for stratification and clustering in the data structure, however it is necessary to augment the soft tree
predictions with cluster residuals, derived from a parametric estimate of cluster variance. Jackknife resampling and hard
tree prediction were found to be unsuitable for variance estimation using trees. The jackknife method is inconsistent with
non-smooth estimators (Miller, 1974), such as a tree model, and produced large standard errors. The automatic variable
selection used to fit the trees to resampled data facilitates the incorporation of model uncertainty in the variance estimates.
The form of the tree may be unstable, but the estimates produced from the trees are not necessarily unstable, depending
on the method chosen. Soft predictions combined with the bootstrap give stable estimates, so the advantage of automatic
variable selection can be retained.
The proposed methodology for adapting classification trees for poverty mapping from complex survey data produces
standard errors for the small area poverty estimates that appear to be valid, in that normal parametric intervals based on
these standard errors appear to have approximately nominal coverage, even though the sampling distribution of poverty
incidence is not expected to be normal. Standard errors from the tree method methods were slightly larger than ELL
estimates, which is expected since the ELL method does not incorporate uncertainty in the choice of predictors, whereas each
new bootstrap sample produces a different tree model based on a different selection of covariates. The bootstrap procedure
here is similar to the Random Forest methodology (Breiman, 2001), except that the latter uses a simple bootstrap sample and
forces change in the selected predictors by choosing a random subset at each iteration. Other ensemble methods (Dietterich,
2000) could be explored as ways of overcoming tree instability in poverty mapping.
The noted differences between the point estimates of poverty incidence for the real Nepal data produced by the
classification tree and the ELL method, at the higher poverty rates, could be an important feature in comparing tree based
P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66 65
models with regression based methods. However, since allocation of aid is based upon relative comparisons of levels of
poverty, this may not be of practical importance. It is difficult in the absence of a validation study to determine which type of
model provides the most accurate results. The ELL method applied in Nepal built a single model for the entire dataset, with
regional variations included via particular specified regional interaction terms. The model assumes linearity of numerical
covariates, and additivity of effects. In contrast, the tree constructs a model which is constant in each partitioning of the
dataset, essentially producing a step function across the data space; it does not assume linearity and naturally investigates
many interactions, including regional ones.
Extension of the methodology to estimate child-level indicators such as rates of stunting and underweight is straightfor-
ward. Random effects at small area level could easily be included if desired by including them in the GLMM and adding the
EBLUPs to the classification tree predictions in addition to, or instead of, the random cluster effects. In future work, we hope
to explore the use of regression tree models, rather than classification trees, for poverty mapping, since this should allow the
estimation of poverty gap and poverty severity in addition to poverty incidence. There are many alternative classification
technologies, such as neural networks and vector support machines, that could also be explored.
Acknowledgements
The authors are grateful to the Associate Editor and four referees for their helpful comments and suggestions. The first
author was supported in this work by a Massey University Doctoral Scholarship. The authors acknowledge the assistance
of the UN World Food Programme and the Nepal Central Bureau of Statistics who provided extensive background for this
study. Any errors or omissions remain the sole responsibility of the authors.
References
Benavent, R., Morales, D., 2016. Multivariate FayHerriot models for small area estimation. Comput. Statist. Data Anal. 94 (4), 372–390.
Breiman, L., 2001. Random forests. Mach. Learn. 45 (1), 5–32.
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A., 1984. Classification and Regression Trees. Chapman and Hall.
Chambers, J.M., Hastie, T.J., 1992. Statistical Models in S. Wadsworth & Brookes/Cole.
Dietterich, T.G., 2000. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization.
Mach. Learn. 40 (2), 139–157.
Efron, B., 1979. 1977 Rietz lecture - Bootstrap methods - another look at the jackknife. Ann. Statist. 7 (1), 1–26.
Elbers, C., Lanjouw, J.O., Lanjouw, P., 2003. Micro-level estimation of poverty and inequality. Econometrica 71 (1), 355–364.
Esteban, M., Morales, M., Pérez, A., Santamara, L., 2012. Small area estimation of poverty proportions under area-level time models. Comput. Statist. Data
Anal. 56 (10), 2840–2855.
Fabrizi, E., Ferrante, M.R., Pacei, S., Trivisano, C., 2011. Hierarchical Bayes multivariate estimation of poverty rates based on increasing thresholds for small
domains. Comput. Stat. Data Anal. 55 (4), 1736–1747.
Field, C.A., Welsh, A.H., 2007. Bootstrapping clustered data. J. R. Stat. Soc. Ser. B - Stat. Methodol. 69 (Part 3), 369–390.
Ghosh, M., Rao, J., 1994. Small-area estimation - an appraisal. Stat. Sci. 9 (1), 55–76.
Harrell, F., 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer.
Haslett, S.J., Jones, G., 2006. ‘Small area estimation of poverty, caloric intake and malnutrition in Nepal’. Tech. rep. Nepal Central Bureau of Statistics/World
Food Programme, United Nations/World Bank, Kathmandu, Nepal. URL https://www.wfp.org/content/nepal-small-area-estimation-poverty-caloric-
intake-and-malnutrition-september-2006.
Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning. Springer.
Li, R.-H., Belford, G.G., 2002. Instability of decision tree classification algorithms. In: Proceedings of the Eighth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. ACM, pp. 570–575.
Lohr, S.L., 1999. Sampling: Design and Analysis. Duxberry Press.
Marchetti, S., Tzavidis, N., Pratesi, M., 2012. Non-parametric bootstrap mean squared error estimation for M-quantile estimators of small area averages,
quantiles and poverty indicators. Comput. Statist. Data Anal. 56 (10), 2889–2902.
McCarthy, P.J., Snowden, C.B., 1985. The bootstrap and finite population sampling. In: Hyattsville Md US National Center for Health Statistics [NCHS] 1985.
Milborrow, S., 2015. rpart.plot: Plot rpart Models. An Enhanced Version of plot.rpart. R package version 1.5.2. URL http://CRAN.R-project.org/package=rpart.
plot.
Miller, R.G., 1974. The jackknife - a review. Biometrika 61 (1), 1–15.
Molina, I., Rao, J., 2010. Small area estimation of poverty indicators. Canad. J. Statist. 38 (3), 369–385.
Morgan, J.N., Sonquist, J.A., 1963. Problems in the analysis of survey data, and a proposal. J. Amer. Statist. Assoc. 58 (302), 415–434.
Pfeffermann, D., 2013. New important developments in small area estimation. Stat. Sci. 28 (1), 40–68.
Pratesi, M., 2016. Analysis of Poverty Data By Small Area Estimation. Wiley.
Quenouille, M.H., 1949. Problems in plane sampling. Ann. Math. Statist. 20 (3), 355–375.
Quinlan, J.R., 1990. Decision trees and decision-making. IEEE Trans. Syst. 20 (2), 339–346.
Rao, J.N., Molina, I., 2015. Small Area Estimation. John Wiley & Sons.
Rao, J., Wu, C., 1988. Resampling inference with complex survey data. J. Amer. Statist. Assoc. 83 (401), 231–241.
Ravallion, M., 1992. ‘Poverty comparisons: a guide to concepts and methods’. Tech. Rep. World Bank, living standards measurement study (LSMS) working
paper; no. LSM 88.
Rust, K.F., Rao, J., 1996. Variance estimation for complex surveys using replication techniques. Stat. Methods Med. Res. 5, 283–310.
Sela, R.J., Simonoff, J.S., 2012. RE-EM trees: a data mining approach for longitudinal and clustered data. Mach. Learn. 86, 169–207.
Shao, J., 2003. Impact of the bootstrap on sample surveys. Stat. Sci. 18 (2), 191–198.
66 P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66
Shao, J., Tu, D., 1995. The Jackknife and the Bootstrap. Springer-Verlag.
Therneau, T., 2015. User written splitting functions for RPART. Mayo Clinic. Disponível.
Therneau, T., Atkinson, B., Ripley, B., 2014. rpart: Recursive Partitioning and Regression Trees. R package version 4.1–5. URL http://CRAN.R-project.org/
package=rpart.
Toth, D., Eltinge, J., 2011. Building consistent regression trees from complex sample data. J. Amer. Statist. Assoc. 106 (496), 1626–1636.
Turney, P., 1995. Bias and the quantification of stability. Mach. Learn. 20 (1–2), 23–33.
United Nations, 2016a. Map of poverty incidence in Nepal. URL http://www.un.org.np/sites/default/files/maps/tid_113/Poverty-Map.pdf.
United Nations, 2016b. Sustainable Development Goals. URL http://www.un.org/sustainabledevelopment/sustainable-development-goals/.
Venables, W.N., Ripley, B.D., 2002. Modern Applied Statistics with S. Springer.
Wolter, K.M., 2007. Introduction To Variance Estimation, second ed. Springer.
World Bank, 2015. Poverty mapping. URL http://go.worldbank.org/9CYUFEUQ30.