You are on page 1of 14

Computational Statistics and Data Analysis 115 (2017) 53–66

Contents lists available at ScienceDirect

Computational Statistics and Data Analysis


journal homepage: www.elsevier.com/locate/csda

Classification trees for poverty mapping✩


Penny Bilton a , Geoff Jones a, *, Siva Ganesh b , Steve Haslett a,c
a
Institute of Fundamental Sciences (Statistics), Massey University, Palmerston North, New Zealand
b
AgResearch, Bioinformatics Maths & Stats, Palmerston North, New Zealand
c
Statistical Consulting Unit, Australian National University, Canberra, Australia

highlights

• Adapts the use of classifications trees for small area estimation.


• Produces standard errors for classification tree models fitted to survey data.
• Important applications involving the allocation of millions of dollars of aid to Third World countries.

article info a b s t r a c t
Article history: Poverty mapping uses small area estimation techniques to estimate levels of depriva-
Received 13 October 2016 tion (poverty, undernutrition) across small geographic domains within a country. These
Received in revised form 22 May 2017 estimates are then displayed on a poverty map, and used by aid organizations such as
Accepted 23 May 2017
the United Nations World Food Programme for the efficient allocation of aid. Current
Available online 1 June 2017
methodology employs unit-level regression modelling of a target variable (household
income, child weight-for-age). An alternative modelling technique is proposed, using tree-
Keywords:
Small area estimation based methods, that has some practical advantages. Alternative ways of amalgamating the
Sustainable Development Goals unit-level predictions from classification trees to small area level are explored, adapting
Complex survey data the trees to account for the survey design, and resampling strategies are proposed for pro-
Clustered data ducing standard errors. The methodology is evaluated using both real data and simulations
based on a poverty mapping study in Nepal. The simulations suggest that amalgamation of
posterior probabilities from the tree gives approximately unbiased estimates, and standard
errors can be calculated using a cluster bootstrap approach with cluster effects included in
the predictions. Small area estimates of poverty incidence for a region in Nepal, generated
using the proposed tree based method, are comparable to the published results obtained
by the standard method.
© 2017 Elsevier B.V. All rights reserved.

1. Introduction

Elimination of poverty and undernutrition, the first two of the United Nations Sustainable Development Goals (United
Nations, 2016b), is addressed through the distribution of billions of dollars in assistance each year to third world countries.
Poverty mapping is promoted by the World Bank (World Bank, 2015) for predicting regional variations in the levels of
deprivation in a particular country, to facilitate efficient allocation of food aid by agencies such as the United Nations World

✩ Specimen R code for classification tree estimates is provided as supplementary material in the electronic version of the paper (see Appendix A).

*
Correspondence to: Institute of Fundamental Sciences (Statistics), Massey University, Private Bag 11222, Palmerston North 4100, New Zealand. Fax:
+64 6 3557953.
E-mail address: g.jones@massey.ac.nz (G. Jones).

http://dx.doi.org/10.1016/j.csda.2017.05.009
0167-9473/© 2017 Elsevier B.V. All rights reserved.
54 P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66

Fig. 1. Poverty map of poverty incidence in Nepal (United Nations, 2016a).

Food Programme (WFP). Statistical techniques are used to generate within-country estimates of deprivation, which can then
be combined with Geographic Information System (GIS) data to produce poverty maps, displaying disaggregated measures
of poverty and other indicators of well being at low geographical levels. Fig. 1, an example of a poverty map for Nepal,
displays small area estimates of the proportion of individuals below a specific expenditure level (the ‘‘poverty line’’) across
geographic domains called ‘‘ilakas’’. A poverty line is usually based on the income or expenditure required to enjoy a minimal
level of goods and services (Ravallion, 1992). The head-count index is more correctly called poverty prevalence; however, the
proportion of poor people is referred to in this paper as poverty incidence, the term generally used in the poverty mapping
literature.
Estimation of poverty status at low geographical level requires the use of small area estimation methodology. Current
methods for poverty mapping are usually based on multiple regression models, relating the variable of interest to a set of
predictors. An alternative approach for modelling poverty incidence is to use classification trees (Breiman et al., 1984),
which offer several advantages. Firstly, a classification tree does not require parametric assumptions (Hastie et al., 2001)
and provides a simple, direct and easily understood method of estimating poverty incidence. Multiple regression typically
uses a stepwise method, at a preliminary stage, for selection of model predictors, but there can be major problems with
this approach (Harrell, 2015); some skill and experience is required on the part of the modeller to avoid over-fitting while
including the important predictors in an appropriate way. The classification tree method in contrast has built-in tools for
automatically selecting variables and avoiding over-fitting, and is better able to cater for possible non-linear relationships
in the data structure (Chambers and Hastie, 1992). Including interactions in multiple regression can be problematic, since
decisions must first be made about which interactions to explore, and then the model attempts to estimate effects for all
possible combinations, some of which may be unimportant. In contrast, the classification tree readily incorporates variable
interactions, selecting only the important combinations.
The next section reviews some basic concepts in poverty mapping and classification trees, after which Section 3 describes
the proposed methodology for adapting the classification tree model for small area estimation of poverty incidence. Results
from applying the methodology to simulated data having a simple random sampling structure, simulated data containing
clusters, and actual Nepal data (Haslett and Jones, 2006) are provided in Section 4. These results are discussed, and
recommendations made, in the final section.

2. Review of current methods

Using classification trees to model poverty incidence requires melding small area estimation methodology and complex
survey design with the technique of classification trees. The basic features of poverty mapping, resampling methods in
complex surveys and classification tree models are reviewed below.

2.1. Poverty mapping

Poverty mapping combines survey data with other information to estimate poverty measures across small domains.
The term ‘‘small area’’ describes a subpopulation for which direct estimates from national survey data cannot be provided
P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66 55

with sufficient precision (Rao and Molina, 2015). Direct estimation for a domain of interest is based only upon the survey
data available for that domain at a specific time (Pfeffermann, 2013). National sample surveys of household income and
expenditure conducted on a small scale generally have inadequate sample sizes, so that direct estimates do not provide
stable predictions across small domains. An alternative approach is to use an indirect method, by incorporating auxiliary
information available at the small area level to improve the precision of the estimates, a technique known as ‘‘borrowing
strength’’ (Ghosh and Rao, 1994). Indirect estimators include model-based methods which relate the variable of interest
to a set of predictor variables. The statistical approach to poverty mapping generally employed by the World Bank, the ELL
method (Elbers et al., 2003), starts by building a unit level multiple regression model from the survey data having the form

Yij = E Yij | xij = xTij β + γj + ϵij ,


[ ]
(1)
where Yij denotes the value of log transformed per capita expenditure for the ith household in the jth primary sampling unit
(PSU) of the survey, and γj represents a random cluster effect to account for the clustering in the survey design. The model
in Eq. (1) is then applied to census data to predict values of Yij for each household in the census, which are then compared
with the poverty line to predict the poverty status at household level and amalgamated to generate small area estimates of
poverty incidence. A similar approach is taken for other indicators, for example modelling weight-for-age of children under
five and predicting at small area level the proportion of children more than two standard deviations below the median of a
reference population. ELL methodology uses a common set of predictors for both survey and census, and assumes that the
model built from survey data, Eq. (1), is also applicable to census data.
The ELL method is just one of a wide range of small area estimation techniques that have been applied to poverty
estimation. A review of the state-of-the-art is given in Pratesi (2016). Some of these use data structures different from
those under consideration here, such as area-level models (Esteban et al., 2012) and multiple poverty indicators (Fabrizi et
al., 2011; Benavent and Morales, 2016). Of particular relevance are the methods of Molina and Rao (2010) and Marchetti
et al. (2012). The former uses (1) with i denoting the small domains rather than survey PSUs, and identifies the survey
households in the census data so that these do not need to be predicted. While this can lead to improved estimates, most
ELL applications have very little, or no, data in most small areas, and area-level effects are minimized by including PSU-level
contextual predictors (usually derived from census data) in the model, so the improvements, if possible, will generally be
slight. Marchetti et al. (2012) use M-quantile estimators from a unit-level model fitted to the survey data to give robustness
against outliers and non-normality, but their method requires sample data in each small area. In our Nepal example, which
is typical of applications in the Third World, only about 30% of the small areas of interest have been sampled, and for these
the typical sample size is 12.

2.2. Resampling methods for complex survey design

Estimation of poverty measures is usually based upon information collected through complex sample surveys, and
so elements of complex design in the survey data: weighting, clustering and stratification: must be accounted for when
producing estimates and their standard errors (Wolter, 2007). With a complex data structure comprising more than one
element of survey design and several sampling stages, the variance of the quantity of interest does not have a tractable
mathematical form (Lohr, 1999). A common approach to variance estimation for complex survey data is to use some type
of replication or resampling method (Rust and Rao, 1996). A key challenge of using classification trees for the modelling is
to develop a methodology for generating valid standard errors for the small area estimates of poverty incidence.
Jackknife resampling (Quenouille, 1949) involves creating replicates by systematically omitting units in the original
sample. If θ̂(1) , . . . , θ̂(N) denote predictions from N jackknife replicates, then the jackknife point estimate, θ̂(·) , is computed as
the average of the θ̂(1) , . . . , θ̂(N) , and jackknife standard error of prediction, σ̂ is defined as
⎡ ⎤

N
⎣ N − 1 ∑ ( )2

σ̂ = √ θ̂(j) − θ̂(·) ⎦.
N
j=1

The delete-cluster jackknife is used for survey data which is clustered, and when stratification is also present the jackknife
is applied to each stratum.
The bootstrap method (Efron, 1979) creates replicates by sampling with replacement from the original sample. If the B
bootstrap predictions are represented by θ̂1∗ , . . . , θ̂B∗ , then the bootstrap point estimate, θ̂ ∗ , is calculated as the average of
the θ̂1∗ , . . . , θ̂n∗ , and their standard deviation, σ̂ ∗ , provides a bootstrap standard error, such that

B

1 ∑ ( )2
σ̂ ∗ = √ θ̂b∗ − θ̂ ∗ .
B
b=1

When the survey data contains clustering, the bootstrap should be applied to clusters rather than individual units, to
account for the dependence structure in the data (Field and Welsh, 2007). For a complex survey design comprising clustering
and stratification, cluster bootstrap samples should be drawn independently from each survey stratum, so as to replicate the
56 P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66

original probability sampling design in each stratum (Rao and Wu, 1988). A straightforward application of the bootstrap for
stratified data (Shao, 2003) involves drawing independent bootstrap samples from each stratum, and combining these into
a single bootstrap replicate for model building. The size of each bootstrap sample is recommended to be one less than the
number of clusters in the particular stratum from which it is drawn (McCarthy and Snowden, 1985). Bootstrap sampling
weights should be modified to avoid introducing bias into the variance estimator (Rust and Rao, 1996).
ELL methodology uses resampling to estimate prediction error. All three sources of variability: model coefficients, cluster
and household level effects: are resampled for each new replicate as follows:

Yijb = xTij βb + γjb + ϵijb , b = 1, . . . , B. (2)

Parametric bootstrapping is used for the regression parameters, specifically β ∼ normal[β̂, V̂ (β̂)], whilst a non-parametric
b

bootstrap is usually applied to the cluster and household level effects (Elbers et al., 2003).

2.3. Classification trees

The concept of tree-based statistical models was first proposed (Morgan and Sonquist, 1963) to handle variable
interactions in the analysis of weighted survey data, but their paper did not specifically address issues of clustering and
stratification. Decision tree methodology (Quinlan, 1990) was later brought into the statistical arena (Breiman et al., 1984).
Construction of a tree model is a top-down iterative process which recursively partitions the data space into smaller subsets
called nodes, then fits a simple model in each subset (Hastie et al., 2001) by assigning the same value to each unit in the
node. A subset which does not split further is known as a terminal node, or leaf. Trees are usually ‘‘pruned’’ to a suitable size
so as to prevent overfitting, the most common approach using the complexity parameter, cp, which balances tree size with
model error as defined by the misclassification rate. The choice of cp is done using cross-validation (Breiman et al., 1984).
Another method of pruning is to restrict the depth of the tree (Therneau et al., 2014). A classification tree is built using data
which has a class variable, usually binary, as the response. The kth leaf of the tree is assigned a value which depends upon the
classifications of observations from the training data which end up in that node. These trees can then be used to predict for
new data where the class variable is missing. Each new observation follows a path down the tree dictated by the predictor
values of that observation, and is allocated the value assigned to the leaf of the tree at which its journey terminates. Each
new observation is assigned a class by majority vote, or given a posterior probability of class membership for each class.
Binary classification trees, which partition the data into only two alternatives at each split, are the most commonly used
tree-based models (Venables and Ripley, 2002).
In order to use classification trees for poverty mapping, three issues need to be considered. Firstly, the aim of the modelling
is not simply to classify individuals, but to classify and then aggregate individual estimates. Secondly, standard application
of classification tree models assumes that the data are identically and independently distributed; adjustments to the tree
method are required if it is to be applied to the complex survey data used for poverty mapping. Thirdly, the objective in
poverty estimation is not just to classify and aggregate, but also to generate standard errors for the aggregated predictions,
which requires a resampling method. However, building trees from resampled data is problematic, since trees are inherently
unstable (Li and Belford, 2002). Tree instability is mainly due to the hierarchical nature of the algorithm, since variation in
an upper split continues on down to successive splits and increases (Hastie et al., 2001). The next section outlines how these
three issues have been addressed in developing a methodology for estimating poverty incidence using classification tree
models.

3. Classification tree modelling of poverty incidence

3.1. Classifying and aggregating

Auxiliary information is incorporated into the proposed poverty mapping methodology by building a classification tree
from the survey data with household poverty status as the response and variables common to the survey and census
as predictors. The tree is then used to predict poverty status using the census data, a procedure similar to the ELL
method. When generating small area estimates of poverty incidence, P0, the natural approach is to obtain a prediction
of poverty incidence for each census household and then aggregate these predictions across the small domains of interest.
However, the poverty status of individuals rather than households is the preferred indicator of poverty incidence, so the
classification tree predictions at household level are weighted by the household sizes before amalgamating to the small area
estimates.
New cases from the census data are classified by the tree according to the leaf at which they terminate. If a definitive
classification is to be made, a majority rule is used so that each leaf is designated ‘‘poor’’ or ‘‘not poor’’ based on the status
of the majority of survey households terminating at that leaf. Alternatively, the proportion of poor households at a leaf can
be taken as the posterior probability of being poor for any household terminating at the leaf. Thus classification trees can
provide two types of estimates, here labelled hard and soft. For the hard type of tree estimates, the ith census household is
assigned the class designation of the leaf at which it terminates; either Yi = 1 if the leaf is classified as ‘‘poor’’, or Yi = 0 if the
leaf designation is ‘‘not poor’’. Let Sk represent the subset of census households which emerge at the kth leaf, and ni denote
P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66 57

Fig. 2. Weighted classification tree model for poverty incidence with cp = 0, and tree depth of 4, showing splitting rules and with hard and soft estimates
as terminal node labels; diagram created using rpart.plot package (Milborrow, 2015).

the household size, i.e. total number of people in the household, for the ith census household. Then a hard classification tree
estimate of poverty incidence, P0(h) , for a given small area is obtained by summing the number of poor people, ni Yi , over all
census households allocated to a particular leaf, summing across the leaves and then dividing by the total number of people
in the small area, as follows:
∑ ∑
k i∈Sk ni Yi
P0 (h)
= ∑ ∑ . (3)
k i∈Sk ni

A soft tree prediction for the ith census household, where i ∈ Sk , is pk , the posterior probability of being poor for households
in the kth leaf. Then, P0(s) , the soft classification tree estimate of poverty incidence for the small area being considered is
defined as:
∑ ∑
k i∈Sk ni pk
P0 (s)
= ∑ ∑ . (4)
k i∈Sk ni

At the prediction stage, the poverty status of each household in the census is unknown but is estimated from the tree, so
conditional on the fitted tree the Yi ’s can be considered as Bernoulli random variables. For the ith census household, such
that i ∈ Sk , Yi ∼ Bern(pk ). Consequently, P0(s) is equivalent to the posterior expected value of poverty incidence across the
leaves of the tree, that is the posterior mean of the proportion of poor individuals in the small area, since:
[∑ ∑ ] ∑ ∑
k i∈Sk ni Yi k i∈Sk ni p k
E ∑ ∑ = ∑ ∑ . (5)
k i∈Sk ni k i∈Sk ni

Note that the designation of the particular small area has been suppressed in (3)–(5). To illustrate, Fig. 2 displays the tree
diagram of a weighted classification tree model from Nepal with poverty status as response, and pruned to a depth of four.
The meanings of the variable labels are given in Table 2. A square box indicates an internal node, with terminal nodes
represented by oval boxes. Each leaf is labelled with a hard estimate, ‘‘poor’’ or ‘‘not poor’’, and a soft estimate, the posterior
probability of being poor for that leaf, represented by the decimal value in the lower half of the box. Fig. 2 suggests that
the three most useful variables provided by the model to predict poverty in Nepal are the proportion of households in a
particular ward owning a television set (tvw), the proportion of households in the ward with no proper toilet (toilet3w) and
the total number of people in the household (hhsize). The ward-level variables were obtained by averaging census variables
at ward level, where ward is the smallest administrative unit in Nepal; each ilaka is made up of several wards (Haslett and
Jones, 2006). The tree model represented by Fig. 2 was pruned to a depth of four for this illustrative example because it is
easier to read and interpret, but a larger tree would be preferred in practice as it should provide more accurate estimates.
58 P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66

3.2. Adapting the tree for complex survey data

Using a weighted classification tree that incorporates the survey weights ensures that the model built from survey data is
representative of the population (Toth and Eltinge, 2011). The rpart function (Therneau et al., 2014), the statistical software
used to build the trees, has a weight argument that incorporates survey weights into the splitting criterion, and into the node
summaries (Therneau, 2015). When the survey data contains clustering and stratification, these design features must also
be incorporated into any resampling procedure for variance estimation, as described in the next section.

3.3. Variance estimation using classification trees

A single classification tree can provide small area estimates of poverty incidence, but makes no provision for estimating
standard errors of those predictions. Variance estimation for complex survey data requires resampling of the survey data,
but this can be problematic due to the inherent instability of trees (Turney, 1995). The proposed methodology of variance
estimation using classification trees is discussed firstly for survey data with a simple random sampling structure, and then
for complex survey data containing clustering and stratification.

3.3.1. Simple random sampling


For a survey based on simple random sampling, replicate survey datasets can be created by simple jackknifing or
bootstrapping. Two types of small area estimate can be considered, hard and soft. The variance estimation process using
trees built from simple random sampling data involves creating a replicate survey, building a tree and then producing
estimates for each small area. Repeating this process produces many replicate estimates for each small area, which are used
to compute a mean estimate and associated standard error. A Monte Carlo study, detailed in Section 4.2, shows that under
both the replication based variance estimation methods of jackknifing and bootstrapping, the hard type of tree estimate fails
badly due to tree instability leading to estimates from the replicates varying wildly. When soft estimates are used, bootstrap
resampling produces reasonable standard errors, whereas standard errors from the jackknife method are highly inflated.

3.3.2. Clustered data


If the survey data contains clustering, the replication method needs to reflect this. Thus variance estimation for trees
built from clustered survey data is carried out using the cluster bootstrap, which involves selecting survey clusters by simple
random sampling with replacement (Field and Welsh, 2007). A bootstrap sample of clusters is chosen and, following the
(s)
ultimate cluster principle (Wolter, 2007), all units in each selected cluster are included. If the jth survey cluster, Cj is
(s)
selected mj times, then the ith survey household appears mj times in the cluster bootstrap sample, for i ∈ Cj . In simulation
studies, the cluster bootstrap was found to be sufficient for standard errors of population level parameters, but at small
area level, predictions of cluster effects need to be incorporated into the small area estimates in a manner analogous to the
resampling method of the ELL method described in Section 2.2. Incorporating cluster effects into tree predictions corresponds
to perturbing the leaves of the tree which generates the predictions of poverty, by adjusting the prediction at each leaf by a
small amount for all households in a particular census cluster. The tree provides a soft prediction, p̂k , the posterior probability
of being poor for each census household allocated to the kth leaf. However, if the census data contains clustering, households
which come from the same census cluster and end up in the kth leaf should have a different posterior probability from
households in that leaf which come from a different census cluster. The survey data can be used to estimate the amount of
between cluster variability which should be included in the predictions. To ensure that the adjusted prediction remains in
the interval [0, 1], cluster effects are derived on the logistic scale. Then, the amended posterior probability for households
from the jth census cluster which arrive at the kth leaf is p̂∗jk , such that

p̂∗jk = logit−1 logit p̂k + cj∗


( ( ) )
(6)
∗ ∗
where cj is cluster effect assigned to each household from the jth census cluster. The cluster effects, cj , are randomly
generated by assuming a normal distribution, N(0, σc2 ), where σc2 is the cluster variance. This is estimated from the survey
data using a generalized linear mixed model (GLMM),

logit πjk = ηjk = φk + cj ,


( )
(7)

with Yi ∼ Bernoulli πjk as the true poverty status for the ith survey household, where j and k are such that i ∈ Cj and
( )
i ∈ Sk . That is, πjk denotes the posterior probability of being poor for households in cluster j and leaf k. The linear predictor
ηjk comprises the fixed effect φk , representing the posterior probability of being poor for the kth leaf, corresponding to p̂k on
the logit scale, and the random effect cj representing the variability in the probability of being poor due to cluster effects.
A similar use of GLMMs was made by Sela and Simonoff (Sela and Simonoff, 2012) to adapt regression trees for clustered
data; they found that the re-estimated tree did not really lead to better predictions. Our use is different since we are using
the GLMM results to add cluster effects to our predictions for variance estimation.
At each bootstrap iteration, cluster effects are randomly selected as described above and incorporated into the prediction
for each household in a cluster. This is similar to the way in which the ELL method uses resampled cluster-level residuals in
producing its replicate estimates. Omitting this step produces standard errors that are too small, unless the areas contain a
large enough number of clusters for the cluster effects to average out, which is rarely the case.
P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66 59

Fig. 3. ELL estimates of poverty incidence compared with the equivalent hard and soft classification tree estimates for eighteen ilakas in a district of Nepal.
The dashed line, y = x, represents equal values for ELL and tree estimates.

3.3.3. Accounting for stratification


To account for stratification as well as clustering in the data structure, cluster bootstrap samples can be selected
independently within each survey stratum and combined into a single bootstrap replicate for building the weighted
classification tree used to generate predictions. The bootstrap sample for the hth survey stratum is of size nh − 1, where
nh is the number of clusters in that stratum. Bootstrap sampling weights for a particular stratum are constructed to ensure
that the sum of the sampling weights for all households in the bootstrap sample equals the total weight of all units in that
stratum (Rust and Rao, 1996), thus preserving the representativeness of the strata. Let whij denote the sampling weight of yhij ,
the ith sampled household in the jth cluster of the hth stratum. Then the bootstrap weights whij ∗
are constructed as follows,

whij

whij = whij

∑ stratum . (8)
BSsample whij

Census predictions are then adjusted to incorporate cluster effects, as outlined in Eq. (6). Results from applying the
methodologies described above are provided in the next section.

4. Results

4.1. Small area estimates of poverty incidence for a district in Nepal

The survey data used in the classification tree modelling was derived from the 2003/4 Nepal Living Standards Survey,
which comprises a two-stage stratified cluster sampling design, as described in the Nepal poverty mapping report (Haslett
and Jones, 2006). At the first stage, the primary sampling units (PSUs) were selected with probability proportional to size
independently within six strata. Then, within each chosen cluster, or PSU, twelve households were selected using systematic
sampling. This gave a dataset of 3912 households, which was processed to create the log per capita expenditure variable from
which poverty status is derived, and a set of possible predictors that match variables available in the 2003 Nepal Census.
To apply the proposed poverty tree methodology, a weighted classification tree model was built from the survey dataset
using the same predictor variables as were used in the model for the published estimates (Haslett and Jones, 2006), and
pruned with a complexity parameter cp = 0.005 chosen by cross-validation. The tree was then used to generate hard and
soft small area estimates of poverty incidence for 18 ilakas in one of the districts of Nepal. The classification tree estimates
are compared in Fig. 3 with the published estimates from Nepal generated by the ELL method. The dashed line in Fig. 3, the
diagonal y = x, represents equality for ELL and tree estimates. Hard tree estimates are consistently lower than soft tree
estimates for all eighteen small areas, indicating better agreement with the ELL-based estimates for the soft type of tree
prediction than the hard type. However, both ELL and tree estimates are subject to error, so neither can be regraded as the
true value. If the comparison between methods suggests bias, it is difficult to determine which method is producing the bias.

4.2. Bias and variance estimation under simple random sampling

To develop a method for generating valid standard errors of prediction of poverty incidence using a classification tree,
variance estimation methods of jackknife and bootstrap resampling, with hard and soft tree predictions, were tested in a
Monte Carlo study. Survey and census datasets with a simple random sampling structure were generated from the same
linear model based on the Nepal survey data. The data for each household are the log-transformed per capita expenditure
Y and a vector of 25 predictors x comprising the variables used in the published Nepal poverty mapping project (Haslett
and Jones, 2006). Since the Nepal census dataset was not available for this project (except for one unspecified District),
the simulation study was based on the survey data only. To allow for variation in the x variables, we generated (Y , x) as a
complete vector, with its mean and covariance structure estimated from the survey data. Denoting this 26 × 1 vector by
60 P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66

Fig. 4. Actual coverage of 100 intervals for a nominal level of 95%, from 100 Monte Carlo simulations involving jackknife and bootstrap resampling with hard
and soft tree estimates. True P0 = 0.196 is the actual poverty status of the single census dataset used for all simulations. Percentage empirical coverage is
displayed for the four variance estimation methods. Note that jackknife hard estimates have a much larger X axis range than the other three methods.

D, and its mean vector and covariance matrix for the Nepal survey data by µ and Σ respectively, new observations were
simulated using

D∗ = µ + Σ 1/2 Z , (9)
where Z ∼ N26 (0, I). The first component, representing the simulated response variable log per capita expenditure, was
converted to a binary variable indicating poverty status, for building a classification tree using the remaining components as
predictors. For each of 100 Monte Carlo simulation, a survey dataset of 3000 households and a census ilaka of 6000, the typical
size of a Nepal ilaka, were randomly generated as above. Bootstrap and jackknife replicates of the simulated survey data
were used to build multiple classification trees, each pruned to a depth of five, with 100 bootstraps per dataset. These trees
were then applied to the simulated census data to generate multiple bootstrap and jackknife, hard and soft predictions, from
which point estimates of poverty incidence and associated standard errors were computed. Empirical parametric confidence
intervals, of the form θ̂(·) ± zα σ̂ and θ̂ ∗ ± zα σ̂ ∗ , were constructed using both hard and soft tree estimates and a range of
values of α , and the actual coverage compared with nominal coverage. Fig. 4 displays actual coverage for a 95% nominal level
of a hundred intervals from Monte Carlo simulations. The graphs show that hard tree estimates are biased, while soft tree
estimates are approximately unbiased. Bootstrap resampling gives much smaller standard errors than the jackknife. Extreme
overcoverage of jackknife hard and soft intervals was due to very large standard errors. It is known that the jackknife method
can be unstable for estimators which are not smooth (Shao and Tu, 1995), but the delete-k jackknife produced similar results;
instability in the trees occasionally produced very different estimates which inflated the standard error estimates.
The tree algorithm is not a smooth process, being a ‘‘greedy’’ top-down binary partitioning method (Hastie et al., 2001).
The granular nature of the hard estimates, classifying each household as poor or not poor, also seems to contribute to
P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66 61

Fig. 5. Empirical coverage for a nominal level of 95% when cluster and naive bootstrap methods are applied to clustered data.

larger bias and less precision with this type of tree prediction, as compared with employing soft estimation, which uses the
posterior probability of being poor. Bootstrap soft estimation is clearly the preferred method for classification trees when the
data has a simple random sampling structure, giving approximately unbiased estimates and standard errors which perform
appropriately for calculating confidence intervals.

4.3. Variance estimation with complex survey data

4.3.1. Employing a cluster bootstrap


A second simulation study compared the behaviour of the standard, naive, bootstrap resampling method with that of
the cluster bootstrap, using soft tree prediction, for the situation where the data are clustered. To simulate clustered data,
random cluster effects were added to survey and census data simulated as in the previous section. Clusters of size 12 were
used in the survey data since the Nepal survey sampled 12 households per PSU, but in the census the typical cluster size of 150
households was used. Random cluster effects were simulated from a normal distribution with mean zero and variance σc2 . The
same cluster effect, cj was added to the value of the log expenditure, Yij , for each household in the jth cluster. Several values
of σc , based on the degree of clustering found in real poverty datasets, were used in the simulation process. This simulation
used 1000 datasets, with the number of bootstraps per simulation kept at 100. Fig. 5 displays the coverage properties of the
two bootstrap methods for a nominal coverage of 95%, and indicates that the cluster bootstrap performed better than the
naive bootstrap when the data had cluster effects, but with increasing undercoverage as the level of clustering increased.

4.3.2. Incorporating cluster effects into predictions


The undercoverage seen in Fig. 5 reflects the lack of a strategy to include cluster effects in the predictions, a necessary
step in order that the small area estimates correctly reflect the variability due to clustering in the data. In a further Monte
Carlo study, the method of cluster bootstrap resampling with soft tree prediction was extended by incorporating cluster
effects into the soft tree predictions, as outlined in Section 3.3.2. Coverage properties of the methodology were examined
using three different types of intervals, four nominal coverage levels and various degrees of clustering in the data structure,
σcl , and are displayed in Fig. 6. Bootstrap parametric intervals of the form θ̂ ∗ ± zα σ̂ ∗ were built for zα = 0.95, 0.9, 0.8
and 0.68, where the point estimate θ̂ ∗ is the average of bootstrap soft predictions, θ̂b , and its associated standard error of
prediction, θ̂ ∗ , is the standard deviation of the θ̂b . These intervals, designated as ‘‘Soft’’ in the graphs in Fig. 6, are represented
by solid lines with solid square symbols. A second type of parametric interval, θ̂F ± zα σ̂ ∗ , was centred about θ̂F the prediction
of poverty incidence obtained from the ‘‘Full’’ tree built from all the survey data, rather than being centred about θ̂ ∗ . This
alternate parametric interval is represented in Fig. 6 by solid lines with solid circle symbols, and referenced by the term ‘‘Full’’.
Bootstrap percentile intervals, of the form θ̂α/∗ 2 , θ̂1∗−α/2 for α = 0.05, 0.1, 0.2 0.32, were also constructed, where, for example,
θ̂0∗.05 denotes the 5th percentile of the distribution of bootstrap soft estimates. These non-parametric bootstrap intervals are
labelled in Fig. 6 as ‘‘Percent’’ and represented by solid lines with solid triangle symbols. Also displayed on the graphs are
dashed lines representing empirical coverage for the same three interval types, based on simulations which did not include
cluster effects incorporated into predictions but only clustering in the data structure, designated as ‘‘None Soft’’, ‘‘None Full’’
and ‘‘None Percent’’. The focus of the simulation study was not the confidence intervals per se, but whether the bootstrap
soft estimation method provided valid standard errors of prediction. For low to moderate amounts of clustering in the data,
actual coverage is very similar for all three interval types. Empirical coverage for the parametric interval centred about the
‘‘full’’ tree estimate is fairly consistent for all levels of clustering. Clearly, when the data is clustered, cluster effects need
62 P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66

Fig. 6. Empirical coverage for cluster bootstrap soft estimation with cluster effects incorporated into predictions, for three types of intervals: ‘‘Soft’’ refers to
intervals centred about the average of bootstrap soft estimates; ‘‘Full’’ refers to intervals centred about the estimate based on all the survey data; ‘‘Percent’’
refers to bootstrap percentile intervals; ‘‘None’’’ indicates that cluster effects were not incorporated into predictions, but only introduced into the data
structure.

to be incorporated into the predictions for valid standard errors of prediction. Incorporating cluster effects into soft tree
predictions has provided standard errors of prediction that give approximately the correct coverage over a plausible range
of cluster variances.

4.4. Standard errors of prediction for poverty incidence in Nepal

The actual Nepal survey data used stratification as well as clustering, as is usual for these poverty mapping exercises. Thus
to obtain small area estimates for Nepal, the methodology was extended to incorporate stratification into the modelling
process by selecting cluster bootstrap samples independently from each survey stratum, and constructing appropriate
bootstrap weights, as outlined in Section 3.3.3. Classification tree small area estimates of poverty incidence and associated
standard error of prediction were generated for 18 ilakas in a district of Nepal, with cp chosen by cross-validation for the full
survey and then kept fixed in the bootstrapping. The tree estimates are compared in Table 1 with published results obtained
using the ELL method (Haslett and Jones, 2006). The tree estimates represent point estimates of poverty incidence (P0) for
each ilaka computed as the mean of 100 bootstrap soft estimates. The small area estimates of poverty incidence obtained by
the classification tree modelling are graphed against the published ELL estimates in Fig. 7.
Fully incorporating complex survey design components of clustering and stratification into the methodology has resulted
in point estimates which are closer to their ELL equivalents than those produced by the tree model which only incorporates
survey weightings (Fig. 3), and more consistent in that the distance below the diagonal y = x is similar for poverty levels
P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66 63

Table 1
Comparison of ELL and bootstrap soft classification tree estimates and stan-
dard errors for ilakas in one district of Nepal, with the ratio of coefficients of
variation (CV Ratio) of the tree to the ELL estimates, and the population size.
Ilaka ELL Tree CV ratio Size
P0 se P0 se
1 0.742 0.030 0.653 0.058 2.162 3971
2 0.525 0.033 0.463 0.045 1.555 4622
3 0.619 0.035 0.545 0.061 1.939 4170
4 0.468 0.024 0.397 0.042 2.034 4168
5 0.297 0.028 0.304 0.046 1.637 4320
6 0.472 0.027 0.460 0.035 1.316 4818
7 0.704 0.036 0.611 0.065 2.099 2587
8 0.204 0.031 0.188 0.047 1.657 3666
9 0.201 0.028 0.190 0.057 2.108 3112
10 0.122 0.024 0.135 0.035 1.315 2645
11 0.242 0.020 0.249 0.034 1.672 4569
12 0.202 0.027 0.229 0.035 1.148 5624
13 0.213 0.020 0.238 0.033 1.422 5231
14 0.415 0.031 0.414 0.038 1.249 3184
15 0.311 0.026 0.300 0.033 1.311 3368
16 0.063 0.021 0.073 0.028 1.116 3015
17 0.130 0.040 0.099 0.046 1.501 2255
18 0.137 0.034 0.094 0.043 1.877 5134

Fig. 7. Comparing classification tree and ELL estimates for 18 ilakas in a district of Nepal.

greater than 0.5. However, the tree estimates are still typically lower than the ELL estimates for the higher poverty rates.
Standard errors of prediction from the classification tree model are of the same order of magnitude as the ELL standard errors,
but are always larger. This is expected as the tree estimates are allowing for model uncertainty whereas the ELL standard
errors assume that the fitted model is correct. The coefficients of variation for the tree estimates are up to two times those
of the ELL estimates.
If the methods are applied to give estimates for the whole district, ELL gives P0 = 0.351, se = 0.014; the tree estimate
is 0.327 (0.023) and the direct estimate based on the five sampled PSUs is 0.275 (0.149). Again the tree estimate is slightly
lower with a higher standard error; the direct estimate, even at this level, is too imprecise to be useful.

5. Discussion

A methodology has been presented for small area estimation of poverty incidence using a classification tree, as an
alternative approach to the standard ELL approach using a multiple regression model for poverty mapping. Tree-based
models have some advantages, particularly if an automatic method, with minimal user input, is desired, since they handle
non-linearities and interactions easily, and have built-in variable selection procedures that can avoid over-fitting.
Estimating poverty incidence using a classification tree involves predictions for groups rather than for individuals, the
usual application of tree models. Simulation suggests that soft tree estimates are approximately unbiased, whereas hard
estimates are not. The hard estimate ignores minority cases at each terminal node, which is a probable cause of the bias.
Hard estimates are also much more affected by tree instability whereby a small change in the data can lead to a very different
tree.
64 P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66

Table 2
Definitions of the main effects variables used for the published Nepal poverty
mapping estimates (Haslett and Jones, 2006). Some variables, as indicated,
are census means or GIS-derived variables at ward or vdc (Village Develop-
ment Community) level.
Label Meaning
group1 Urban Kathmandu
group2 Urban Other
group3 Rural Western Mt + Hills
group4 Rural Eastern Mt + Hills
group5 Rural Western Terai
group6 Rural Eastern Terai
hhsize Household size
skids6 Propn kids 0–6 in hh
skids714 Propn kids 7–14 in hh
samen Propn adult men
hage2 hh head aged 30–44
hethn1 hh head Brahmin/Chhetri (B/C)
hethn2 hh head Terai Middle Caste (TMC)
hethn3 hh head Dalit (Dit)
hethn4 hh head Newar (Nwr)
hethn5 hh head Hill Janajatis (HlJ)
hethn6 hh head Terai Janajatis (TrJ)
hethn7 hh head Other castes (Otm)
hrelig3 hh head Muslim
remtab hh member migrated abroad
hfem Female headed hh
hutype1 House permanent (Perm)
hutype2 House semi permanent (Sm)
hutype3 House temporary (Oth)
huown2 House rented or free
npltry4 Rural with no poultry
nagar8 Ruralwith land 0.5–1.0 ha
nagar9 Ruralwith land 1.0–2.0 ha
ckfuel3w Cooking fuel LP/gas, ward
toilet3w Propn hh with no toilet, ward
ftoiletw Propn hh with flush toilet, ward
ltfuel3w Propn hh with lighting fuel other than electricity/kerosene
edulv3w Propn 15+ pop 5–7 yr completed, ward
elecw Propn hh with electricity, ward
tvw Propn hh with tv, ward
pflandv Propn hh with land-owning females, vdc
dmortv mortality rate due to infectious disease, vdc
meanht Mean elevation (’000 m) above sea level, vdc
meanslp Mean slope (as %), vdc

In adapting the standard tree model for a complex survey design structure, a weighted tree model takes account of survey
sampling weights. Variance estimation can be carried out using the cluster bootstrap independently within each stratum,
thus accounting for stratification and clustering in the data structure, however it is necessary to augment the soft tree
predictions with cluster residuals, derived from a parametric estimate of cluster variance. Jackknife resampling and hard
tree prediction were found to be unsuitable for variance estimation using trees. The jackknife method is inconsistent with
non-smooth estimators (Miller, 1974), such as a tree model, and produced large standard errors. The automatic variable
selection used to fit the trees to resampled data facilitates the incorporation of model uncertainty in the variance estimates.
The form of the tree may be unstable, but the estimates produced from the trees are not necessarily unstable, depending
on the method chosen. Soft predictions combined with the bootstrap give stable estimates, so the advantage of automatic
variable selection can be retained.
The proposed methodology for adapting classification trees for poverty mapping from complex survey data produces
standard errors for the small area poverty estimates that appear to be valid, in that normal parametric intervals based on
these standard errors appear to have approximately nominal coverage, even though the sampling distribution of poverty
incidence is not expected to be normal. Standard errors from the tree method methods were slightly larger than ELL
estimates, which is expected since the ELL method does not incorporate uncertainty in the choice of predictors, whereas each
new bootstrap sample produces a different tree model based on a different selection of covariates. The bootstrap procedure
here is similar to the Random Forest methodology (Breiman, 2001), except that the latter uses a simple bootstrap sample and
forces change in the selected predictors by choosing a random subset at each iteration. Other ensemble methods (Dietterich,
2000) could be explored as ways of overcoming tree instability in poverty mapping.
The noted differences between the point estimates of poverty incidence for the real Nepal data produced by the
classification tree and the ELL method, at the higher poverty rates, could be an important feature in comparing tree based
P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66 65

models with regression based methods. However, since allocation of aid is based upon relative comparisons of levels of
poverty, this may not be of practical importance. It is difficult in the absence of a validation study to determine which type of
model provides the most accurate results. The ELL method applied in Nepal built a single model for the entire dataset, with
regional variations included via particular specified regional interaction terms. The model assumes linearity of numerical
covariates, and additivity of effects. In contrast, the tree constructs a model which is constant in each partitioning of the
dataset, essentially producing a step function across the data space; it does not assume linearity and naturally investigates
many interactions, including regional ones.
Extension of the methodology to estimate child-level indicators such as rates of stunting and underweight is straightfor-
ward. Random effects at small area level could easily be included if desired by including them in the GLMM and adding the
EBLUPs to the classification tree predictions in addition to, or instead of, the random cluster effects. In future work, we hope
to explore the use of regression tree models, rather than classification trees, for poverty mapping, since this should allow the
estimation of poverty gap and poverty severity in addition to poverty incidence. There are many alternative classification
technologies, such as neural networks and vector support machines, that could also be explored.

Acknowledgements

The authors are grateful to the Associate Editor and four referees for their helpful comments and suggestions. The first
author was supported in this work by a Massey University Doctoral Scholarship. The authors acknowledge the assistance
of the UN World Food Programme and the Nepal Central Bureau of Statistics who provided extensive background for this
study. Any errors or omissions remain the sole responsibility of the authors.

Appendix A. Supplementary material

Supplementary material related to this article can be found online at http://10.1016/j.csda.2017.05.009.

References

Benavent, R., Morales, D., 2016. Multivariate FayHerriot models for small area estimation. Comput. Statist. Data Anal. 94 (4), 372–390.
Breiman, L., 2001. Random forests. Mach. Learn. 45 (1), 5–32.
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A., 1984. Classification and Regression Trees. Chapman and Hall.
Chambers, J.M., Hastie, T.J., 1992. Statistical Models in S. Wadsworth & Brookes/Cole.
Dietterich, T.G., 2000. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization.
Mach. Learn. 40 (2), 139–157.
Efron, B., 1979. 1977 Rietz lecture - Bootstrap methods - another look at the jackknife. Ann. Statist. 7 (1), 1–26.
Elbers, C., Lanjouw, J.O., Lanjouw, P., 2003. Micro-level estimation of poverty and inequality. Econometrica 71 (1), 355–364.
Esteban, M., Morales, M., Pérez, A., Santamara, L., 2012. Small area estimation of poverty proportions under area-level time models. Comput. Statist. Data
Anal. 56 (10), 2840–2855.
Fabrizi, E., Ferrante, M.R., Pacei, S., Trivisano, C., 2011. Hierarchical Bayes multivariate estimation of poverty rates based on increasing thresholds for small
domains. Comput. Stat. Data Anal. 55 (4), 1736–1747.
Field, C.A., Welsh, A.H., 2007. Bootstrapping clustered data. J. R. Stat. Soc. Ser. B - Stat. Methodol. 69 (Part 3), 369–390.
Ghosh, M., Rao, J., 1994. Small-area estimation - an appraisal. Stat. Sci. 9 (1), 55–76.
Harrell, F., 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer.
Haslett, S.J., Jones, G., 2006. ‘Small area estimation of poverty, caloric intake and malnutrition in Nepal’. Tech. rep. Nepal Central Bureau of Statistics/World
Food Programme, United Nations/World Bank, Kathmandu, Nepal. URL https://www.wfp.org/content/nepal-small-area-estimation-poverty-caloric-
intake-and-malnutrition-september-2006.
Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning. Springer.
Li, R.-H., Belford, G.G., 2002. Instability of decision tree classification algorithms. In: Proceedings of the Eighth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. ACM, pp. 570–575.
Lohr, S.L., 1999. Sampling: Design and Analysis. Duxberry Press.
Marchetti, S., Tzavidis, N., Pratesi, M., 2012. Non-parametric bootstrap mean squared error estimation for M-quantile estimators of small area averages,
quantiles and poverty indicators. Comput. Statist. Data Anal. 56 (10), 2889–2902.
McCarthy, P.J., Snowden, C.B., 1985. The bootstrap and finite population sampling. In: Hyattsville Md US National Center for Health Statistics [NCHS] 1985.
Milborrow, S., 2015. rpart.plot: Plot rpart Models. An Enhanced Version of plot.rpart. R package version 1.5.2. URL http://CRAN.R-project.org/package=rpart.
plot.
Miller, R.G., 1974. The jackknife - a review. Biometrika 61 (1), 1–15.
Molina, I., Rao, J., 2010. Small area estimation of poverty indicators. Canad. J. Statist. 38 (3), 369–385.
Morgan, J.N., Sonquist, J.A., 1963. Problems in the analysis of survey data, and a proposal. J. Amer. Statist. Assoc. 58 (302), 415–434.
Pfeffermann, D., 2013. New important developments in small area estimation. Stat. Sci. 28 (1), 40–68.
Pratesi, M., 2016. Analysis of Poverty Data By Small Area Estimation. Wiley.
Quenouille, M.H., 1949. Problems in plane sampling. Ann. Math. Statist. 20 (3), 355–375.
Quinlan, J.R., 1990. Decision trees and decision-making. IEEE Trans. Syst. 20 (2), 339–346.
Rao, J.N., Molina, I., 2015. Small Area Estimation. John Wiley & Sons.
Rao, J., Wu, C., 1988. Resampling inference with complex survey data. J. Amer. Statist. Assoc. 83 (401), 231–241.
Ravallion, M., 1992. ‘Poverty comparisons: a guide to concepts and methods’. Tech. Rep. World Bank, living standards measurement study (LSMS) working
paper; no. LSM 88.
Rust, K.F., Rao, J., 1996. Variance estimation for complex surveys using replication techniques. Stat. Methods Med. Res. 5, 283–310.
Sela, R.J., Simonoff, J.S., 2012. RE-EM trees: a data mining approach for longitudinal and clustered data. Mach. Learn. 86, 169–207.
Shao, J., 2003. Impact of the bootstrap on sample surveys. Stat. Sci. 18 (2), 191–198.
66 P. Bilton et al. / Computational Statistics and Data Analysis 115 (2017) 53–66

Shao, J., Tu, D., 1995. The Jackknife and the Bootstrap. Springer-Verlag.
Therneau, T., 2015. User written splitting functions for RPART. Mayo Clinic. Disponível.
Therneau, T., Atkinson, B., Ripley, B., 2014. rpart: Recursive Partitioning and Regression Trees. R package version 4.1–5. URL http://CRAN.R-project.org/
package=rpart.
Toth, D., Eltinge, J., 2011. Building consistent regression trees from complex sample data. J. Amer. Statist. Assoc. 106 (496), 1626–1636.
Turney, P., 1995. Bias and the quantification of stability. Mach. Learn. 20 (1–2), 23–33.
United Nations, 2016a. Map of poverty incidence in Nepal. URL http://www.un.org.np/sites/default/files/maps/tid_113/Poverty-Map.pdf.
United Nations, 2016b. Sustainable Development Goals. URL http://www.un.org/sustainabledevelopment/sustainable-development-goals/.
Venables, W.N., Ripley, B.D., 2002. Modern Applied Statistics with S. Springer.
Wolter, K.M., 2007. Introduction To Variance Estimation, second ed. Springer.
World Bank, 2015. Poverty mapping. URL http://go.worldbank.org/9CYUFEUQ30.

You might also like