Moizen Classification and Regression Trees

Author's personal copy
582 Ecological Informatics | Classification and Regression Trees
See also: Invasive Species. Follett PA and Duan JJ (eds.) (2000) Nontarget Effects of Biological
Control. Boston, MA: Kluwer Academic.
Jervis M and Kidd N (eds.) (1996) Insect Natural Enemies: Practical
Further Reading Approaches to their Study and Evaluation. London: Chapman and
Hall.
Bellows TS and Fisher TW (eds.) (1999) Handbook of Biological Control: Julien MH and Griffiths MW (eds.) (1998) Biological Control of Weeds, a
Principles and Applications of Biological Control. San Diego: World Catalogue of Agents and their Target Weeds, 4th edn.
Academic Press. Wallingford: CABI Publishing.
Clausen CP (ed.) (1978) Agricultural Research Service: Handbook No. Van Driesche J and Van Driesche RG (2000) Nature Out of Place:
480: Introduced Parasites and Predators of Arthropod Pests and Biological Invasions in a Global Age. Washington, DC: Island
Weeds: A World Review. Washington, DC: USDA: Agricultural Press.
Research Service. Van Driesche RG, Hoddle M, and Center T (2008) Control of Pests and
DeBach P and Rosen D (1991) Biological Control by Natural Enemies. Weeds by Natural Enemies, an Introduction to Biological Control.
Cambridge, UK: Cambridge University Press. London: Blackwell.
Classification and Regression Trees

G G Moisen, US Forest Service, Ogden, UT, USA
ª 2008 Elsevier B.V. All rights reserved.
Introduction Additional Tree-Fitting Issues

Overview of the Fitting Process Enhancements through Ensemble Methods
Splitting Rules Software
Pruning Further Reading
Costs
Introduction • using environmental variables to model the distribu-

tion of vegetation alliances,
Frequently, ecologists are interested in exploring ecolo-
gical relationships, describing patterns and processes, or
• assessing biological indicators of environmental factors
affecting fish habitat, and
making spatial or temporal predictions. These purposes
often can be addressed by modeling the relationship
• identifying fuels characteristics for fire spread models.
Modeling ecological data poses many challenges. The
between some outcome or response and a set of features
response as well as the explanatory variables may be
or explanatory variables. Some examples from ecology
continuous or discrete. The relationships that need to
include:
be deciphered are often nonlinear and involve complex
interactions between explanatory variables. Missing
• analyzing bioclimatic factors affecting the presence of a
species in the landscape, values for both explanatory and response variables are
not uncommon, and outliers almost always exist. In
• mapping forest types from remotely sensed data,
addition, ecological problems usually demand methods
• identifyingforest
predicting attributes over large geographic areas,
that are both easy to implement and easy to interpret.
• making sense of complex
suitable wildlife habitat,
• hundreds of variables, ecological data sets with Frequently, many different statistical tools are employed
to handle unique problems posed by the various scenar-
• predicting
distributions,
microhabitat affecting fish species ios. This diverse set of tools might include multiple or
logistic regression, log linear models, analysis of var-
• developing screening tests for unwanted plant species, iance, survival models, and the list continues.
• time, and mapping landcover change through
monitoring Classification and regression trees, however, offer a sin-
gle tool to work with all these challenges. This article
Ecological Informatics | Classification and Regression Trees 583
describes classification and regression trees in general,

the major concepts guiding their construction, some of
Douglas fir Elevation (m) Aspect
the many issues a modeler may face in their use, and,
finally, recent extensions to their methodology. The Absent 2045 E
intent of the article is to simply familiarize the reader Present 2885 SE
Present 2374 NE
with the terminology and general concepts behind this
Absent 2975 S
set of tools. ... ... ...
Figure 1 illustrates a simple classification tree for this

Overview of the Fitting Process
problem. Beginning with all 1544 observations at the root,
the 393 cases that fall below an elevation of 2202 m are
Classification and regression trees are intuitive methods,
classified as having no Douglas fir. If elevation is greater
often described in graphical or biological terms. A tree is
than 2202 m, as is the case for 1151 observations, then
typically shown growing upside down, beginning at its
more information is needed. The next split is made at an
root. An observation passes down the tree through a series
elevation of 2954 m. These very-high-elevation observa-
of splits, or nodes, at which a decision is made as to which
tions above the cutoff are also classified as having no
direction to proceed based on the value of one of the
Douglas fir. Turning now to the remaining 928 moder-
explanatory variables. Ultimately, a terminal node or
ate-elevation observations, yet more fine-tuning is
leaf is reached and predicted response is given.
needed. The third split occurs at an elevation of 2444 m.
Trees partition the explanatory variables into a series
The 622 moderately high elevation cases above 2444 m
of boxes (the leaves) that contain the most homogeneous
are classified as having Douglas fir present. The final split
collection of outcomes possible. Creating splits is analo-
uses aspect to determine if Douglas fir is likely to grow on
gous to variable selection in regression. Trees are
the remaining 306 moderately low sites, predicting
typically fit via binary recursive partitioning. The term
Douglas fir to be present on the cooler, wetter northerly
binary refers to the fact that the parent node will always
and easterly slopes, and absent on the hotter, dryer
be split into exactly two child nodes. The term recursive
exposures.
is used to indicate that each child node will, in turn,
At a minimum, construction of a tree involves making
become a parent node, unless it is a terminal node. To
choices about three major issues. The first choice is how
start with, a single split is made using one explanatory
splits are to be made: which explanatory variables will be
variable. The variable and the location of the split are
used and where the split will be imposed. These are defined
chosen to minimize the impurity of the node at that point.
by splitting rules. The second choice involves determining
There are many ways to minimize the impurity of each
appropriate tree size, generally using a pruning process.
node. These are known as splitting rules. Each of the two
The third choice is to determine how application-specific
regions that result from the initial split are then split
costs should be incorporated. This might involve decisions
themselves according to the same criteria, and the tree
about assigning varying misclassification costs and or
continues to grow until it is no longer possible to create
accounting for the cost of model complexity.
additional splits or the process is stopped by some user--
defined criteria. The tree may then be reduced in size
using a process known as pruning.
Assigning a predicted value to the terminal nodes can Splitting Rules
be done in a number of ways. Typically, in classification
trees, values at the terminal nodes are assigned the class Binary recursive partitioning, as described above, applies to
which represents the plurality of cases in that node. The the fitting of both classification and regression trees.
rules of class assignment can be altered based on a cost However, the criteria for minimizing node impurity (i.e.,
function, to adjust for the consequences of making a mis- maximizing homogeneity) are different for the two methods.
take for certain classes, or to compensate for unequal
sampling of classes. In the case of regression trees, values
at the terminal node are assigned using the mean of cases For Regression Trees
in that node.
For regression trees, two common impurity measures are:
As an example, consider the problem of modeling the
presence or absence of the tree species Pseudotsuga menzie- Least squares. This method is similar to minimizing
sii (Douglas fir) in the mountains of northern Utah using least squares in a linear model. Splits are chosen to mini-
only information about elevation (ELEV) and aspect mize the sum of squared error between the observation
(ASP), where data take the form: and the mean in each node.
N = 1544
Present: 21%
Absent: 79%
Is ELEV < 2202 m ?
Yes No
N = 393 N = 1151
Present: 22% Present: 56%
Absent: 78% Absent: 43%
Classify as Absent Is ELEV < 2954 m ?
Yes No
N = 928 N = 223
Is ELEV < 2444 m ? Classify as Absent
Yes No
N = 306 N = 622
Is ASP = N, NE, or E ? Classify as Present
Yes No
N = 204 N = 102
Classify as Present Classify as Absent
Figure 1 A simple example of a classification tree describing the relationship between presence/absence of P. menziesii and
explanatory factors of elevation (ELEV) and aspect (ASP) in the mountains of northern Utah. Thin-lined boxes indicate a node from which
a split emerges. Thick-lined boxes indicate a terminal node.
Least absolute deviations. This method minimizes the practice, and is more sensitive than the misclassification
mean absolute deviation from the median within a node. error to changes in node probability.
The advantage of this over least squares is that it is not as Entropy index. Also called the cross-entropy or deviance
sensitive to outliers and provides a more robust model. measure of impurity, the entropy index can be written
PK
The disadvantage is in insensitivity when dealing with k¼1 p̂mk log p̂mk . This too is more sensitive than misclas-
data sets containing a large proportion of zeros. sification error to changes in node probability.
Twoing. Designed for multiclass problems, this
approach favors separation between classes rather than
For Classification Trees node heterogeneity. Every multiclass split is treated as a
binary problem. Splits that keep related classes together
There are many criteria by which node impurity is mini- are favored. The approach offers the advantage of reveal-
mized in a classification problem, but four commonly ing similarities between classes and can be applied to
used metrics include: ordered classes as well.
Misclassification error. The misclassification error is
simply the proportion of observations in the node that
are not members of the majority class in that node. Pruning
Gini index. Suppose there are a total of K classes, each
indexed by k. Let p̂mk be the proportion of class k observa- A tree can be grown to be quite large, almost to the point
tions in node m. The Gini index can then be written as where it fits the training data perfectly, that is, sometimes
PK
k¼1 p̂mk 1 – p̂mk . This measure is frequently used in having just one observation in each leaf. However, this
results in overfitting and poor predictions on independent Questions often arise as to whether one should use an
test sets. A tree may also be constructed that is too small independent test set or cross-validated estimates of error
and does not extract all the useful relationships that exist. rates. One thing to consider is that cross-validated error
Appropriate tree size can be determined in a number of rates are based on models built with only 90% of the
ways. One way is to set a threshold for the reduction in data. Consequently, they will not be as good as a model
impurity measure, below which no split will be made. A built with all of the data and will consistently result in
preferred approach is to grow an overly large tree until slightly higher error rates, providing the modeler a con-
some minimum node size is reached. Then prune the tree servative independent estimate of error. However, in
back to an optimal size. Optimal size can be determined regression tree applications in particular, this overesti-
using an independent test set or cross-validation mate of error can be substantially higher than the truth,
(described below). In either case, what results is a tree of giving more incentive to the modeler to find an inde-
optimal size accompanied by an independent measure of pendent test set.
its error rate.
1-SE Rule
Independent Test Set Under both the testing and cross-validation sections
above, tree size was based on the minimum error rate.
If the sample size is sufficiently large, the data can be A slight modification on this strategy is often used where
divided into two subsets randomly, namely, one for train- the smallest tree size is selected such that the error rate
ing and other for testing. Defining sufficiently large is is within one standard error of the minimum. This
problem specific, but one rule of thumb in classification results in more parsimonious trees, with little sacrifice
problems is to allow a minimum of 200 observations for a in error.
binary classification model, with an additional 100 obser-
vations for each additional class. An overly large tree is
grown on the training data. Then, using the test set, error
rates are calculated for the full tree as well as all smaller Costs
subtrees (i.e., trees having fewer terminal nodes than the
full tree). Error rates for classification trees are typically The notion of costs is interlaced with the issues of split-
the overall misclassification rate, while for regression ting criteria and pruning, and is used in a number of ways
problems, mean squared error or mean absolute deviation in fitting and assessing classification trees.
from the median are the criteria used to rank trees of
different size. The subtree with the smallest error rate
Costs of Explanatory Variables and
based on the independent test set is then chosen as the
Misclassification
optimal tree.
In many applications, some explanatory variables are
much more expensive to collect or process than others.
Cross-Validation
Preference may be given to choosing less expensive
If the sample size is not large, it is necessary to retain all explanatory variables in the splitting process by assigning
the data for training purposes. However, pruning and costs or scalings to be applied when considering splits.
testing must be done using independent data. A way This way, the improvement made by splitting on a parti-
around the dilemma is through v-fold cross-validation. cular variable is downweighted by its cost in determining
Here, all the data are used to fit an initial overly large the final split.
tree. The data is then divided into (usually) v ¼ 10 sub- Other times in practice, the consequences are greater
groups, and 10 separate models fit. The first model uses for misclassifying one class over another. Therefore, it is
subgroups 1–9 for training, and subgroup 10 for testing. possible to give preference for correctly classifying cer-
The second model uses groups 1–8 and 10 for training, tain classes, or even assigning specific costs to how an
and group 9 for testing, and so on. In all cases, an inde- observation is misclassified, that is, which wrong class it
pendent test subgroup is available. These 10 test falls in.
subgroups are then combined to give independent error
rates for the initial overly large tree which was fit using all
Cost of Tree Complexity
the data. Pruning of this initial tree proceeds as it did in
the case of the independent test set, where error rates are As discussed in the pruning section, an overly large tree can
calculated for the full tree as well as all smaller subtrees. easily be grown to some user-defined minimum node size.
The subtree with the smallest error rate based on the Often, though, the final tree selected through tree pruning
independent test set is then chosen as the optimal tree. is substantially smaller than the original overly large tree. In
the case of regression trees, the final tree may be 10 times the same variable. If the modeler suspects strong linear
smaller. This result can be a substantial amount of wasted relationships, small trees can first be fit to the data to
computing time. Consequently, one can specify a penalty partition it into a few more similar groups, and then
for cost complexity which is equal to the resubstitution standard parametric models can be run on these groups.
error rate (error obtained using just the training data) plus Another alternative available in some software packages
some penalty parameter multiplied by the number of is creating linear combinations of the explanatory vari-
nodes. A very large tree will have a low misclassification ables, then entering these as new explanatory variables
rate but high penalty, while a small tree will have a high for the tree.
misclassification but low penalty. Cost complexity can be
used to reduce the size of the initial overly large tree
grown prior to pruning, which can greatly improve com- Competitors and Surrogates
putational efficiency, particularly when cross-validation It should be noted that when selecting splits, classification
is being used. and regression trees may track the competitive splits at
One process that combines the cross-validation and each decision point along the way. A competitive split is
cost complexity ideas is to generate a sequence of trees one that results in nearly as pure a node as the chosen
of increasing size by gradually decreasing the penalty split. Classification and regression trees may also keep
parameter in the cost-complexity approach. Then, tenfold track of surrogate variables. Use of a surrogate variable
cross-validation is applied to this relatively small set of at a given split results in a similar node impurity measure
trees to choose the smallest tree whose error falls within (as would a competitor) but also mimics the chosen split
one standard error of the minimum. Because each time a itself in terms of which and how many observations go
tenfold cross-validation procedure is run a modeler might which way in the split.
see a different tree size chosen, multiple (like 50) tenfold
processes may be run, with the most frequently appearing
tree size chosen. Missing Values
As mentioned before, one of the advantages of classifica-
tion and regression trees is their ability to accommodate
Additional Tree-Fitting Issues missing values. If a response variable is missing, that
observation can be excluded from the analysis, or, in the
Although the main issues of fitting classification and case of classification problem, treated as a new class (e.g.,
regression trees revolve around splitting, pruning, and missing) to identify any potential patterns in the loss of
costs, numerous other details remain. Several of these information. If explanatory variables are missing, trees
are discussed below. can use surrogate variables in their place to determine
the split. Alternatively, an observation can be passed to
the next node using a variable that is not missing for that
Heteroscedasticity observation.
In the case of regression trees, heteroscedasticity, or the
tendency for higher-value responses to have more varia-
tion, can be problematic. Because regression trees seek to Observation Weights
minimize within-node impurity, there will be a tendency There are a number of instances where it might be desir-
to split nodes with high variance. Yet, the observations able to give more weight to certain observations in the
within that node may, in fact, belong together. The training set. Some examples include if the training sample
remedy is to apply variance-stabilizing transformations has a disproportionate number of cases in certain classes
to the response as one would do in a linear regression or if the data were collected under a stratified design with
problem. Although regression trees are invariant to one strata having greater or lesser sampling intensity. In
monotonic transformations on explanatory variables, these cases, observations can be weighted to reflect the
transformations like a natural log or square root may be importance each should bear.
appropriate for the response variable.
Variable Importance
Linear Structure
The importance of individual explanatory variables can
Classification and regression trees are not particularly be determined by measuring the proportion of variability
useful when it comes to deciphering linear relationships, accounted for by splits associated with each explanatory
having no choice but to produce a long line of splits on variable. Alternatively, one may address variable
importance by determining the effect of excluding vari- Random Forests

ables in turn and by assessing the resulting predictive
Another recent ensemble method is called ‘random for-
accuracy of the resulting models.
ests’. In this technique, a bootstrap sample of the training
data is chosen. At the root node, a small random sample of
Model Assessment explanatory variables is selected and the best split made
using that limited set of variables. At each subsequent
While a single measure of error may be used to pick the
node, another small random sample of the explanatory
optimum tree size, no single measure of error can capture
variables is chosen, and the best split made. The tree
the adequacy of the model for often diverse applications.
continues to be grown in this fashion until it reaches
Consequently, several measures of error may need to be
the largest possible size, and is left unpruned. The
reported on the final model. In classification problems,
whole process, starting with a new bootstrap sample, is
these may include the misclassification rate and kappa.
repeated a large number of times. As in committee mod-
Kappa measures the proportion of correctly classified
els, the final prediction is a (weighted) plurality vote or
units after accounting for the probability of chance
average from prediction of all the trees in the collection.
agreement. In classification problems involving only a
zero–one response, additional measures include
sensitivity, specificity, receiver operating characteristic Stochastic Gradient Boosting
(ROC) curves with associated area under the curve Yet another ensemble method is known as stochastic
(AUC). In regression problems measures of interest gradient boosting. In this technique, many small classifi-
might include correlation coefficients, root mean squared cation or regression trees are built sequentially from
error, average absolute error, bias, and the list continues. residual-like measures from the previous tree. At each
The literature on error assessment is vast. The point here iteration, a tree is built from a random subsample of the
is that an optimal tree size may be determined using one data set (selected without replacement), producing an
criterion, but often it is necessary to report several mea- incremental improvement in the model. Ultimately, all
sures to assess the applicability of the model for different the small trees are stacked together as a weighted sum of
applications. terms. The overall model accuracy gets progressively
better with each additional term.
Enhancements through Ensemble

Methods
Software
While classification and regression trees are powerful
methods in and of themselves, much work has been A wide variety of software packages are available for
done in the data mining and machine learning fields to implementing classification and regression trees. The R
improve the predictive ability of these tools by combining part library and affiliated packages, part of the R public
separate tree models into what is often called a committee domain statistical software, is widely used. Popular
of experts, or ensemble. Following is a very brief descrip- commercial packages include Salford Systems CART,
tion of some of these newer techniques using classification Rulequest’s See5 and Cubist, tree-based models in S-Plus,
and regression trees as building blocks. to name a few.
Bagging and Boosting See also: Data Mining.
Two simple enhancements to tree-based methods are

called bagging and boosting. These iterative schemes
Further Reading
each produce a committee of expert tree models by
resampling with replacement from the initial data set. Breiman L (2001) Random forests. Machine Learning 45: 5–32.
Afterward, the expert tree models are averaged using a Breiman L, Friedman RA, Olshen RA, and Stone CG (1984)
Classification and Regression Trees. Pacific Grove, CA: Wadsworth.
plurality voting scheme if the response is discrete, or Clark LA and Pregibon D (1992) Tree-based models. In: Chambers JM
simple averaging if the response is continuous. The dif- and Hastie TJ (eds.) Statistical Models in S, pp. 377–419. Pacific
ference between bagging and boosting is the way in which Grove, CA: Wadsworth and Brooks.
De’ath G and Fabricius KE (2000) Classification and regression trees:
data are resampled. In the former, all observations have A powerful yet simple technique for ecological data analysis. Ecology
equal probability of entering the next bootstrap sample; in 81: 3178–3192.
the latter, problematic observations (i.e., observations that Everitt BS and Hothorn T (2006) A Handbook of Statistical Analyses
using R. Boca Raton, FL: Chapman and Hall/CRC.
have been frequently misclassified) have a higher prob- Friedman JH (2002) Stochastic gradient boosting. Computational
ability of selection. Statistics and Data Analysis 38(4): 367–378.
588 Global Ecology | Climate Change 1: Short-Term Dynamics
Hastie T, Tibshirani R, and Friedman J (2001) The Elements of Statistical Steinberg D and Colla P (1995) CART: Tree-Structured Nonparametric
Learning. New York: Springer. Data Analysis. San Diego, CA: Salford Systems.
Murthy SK (1998) Automatic construction of decision trees from data: Vayssieres MP, Plant RP, and Allen-Diaz BH (2000) Classification trees:
A multi-disciplinary survey. Data Mining and Knowledge Discovery An alternative non-parametric approach for predicting species
2: 345–389. distributions. Journal of Vegetation Science 11: 679–694.
Quinlan JR (1993) C4.5: Programs for Machine Learning. San Mateo, Venables WN and Ripley BD (1999) Modern Applied Statistics with
CA: Morgan Kaufmann. S-Plus. New York: Springer.
Ripley BD (1996) Pattern Recognition and Neural Networks. Cambridge:
Cambridge University Press.
Climate Change 1: Short-Term Dynamics

G Alexandrov, National Institute for Environmental Studies, Tsukuba, Japan
ª 2008 Elsevier B.V. All rights reserved.
Introduction Biosphere Response

Carbon Budget Components Concluding Remarks
Human Intervention Further Reading
Introduction pool can be a sink, during a given time interval, if carbon

inflow exceeds carbon outflow.
Short-term dynamics of the global carbon cycle is closely
related to the concept of climate system: the totality of
Carbon Source
the atmosphere, hydrosphere, biosphere, and their
interactions. Human activities have been substantially Carbon source is a process or mechanism that releases
increasing the concentrations of carbon dioxide and carbon dioxide to the atmosphere. A given carbon pool
other greenhouse gases in the atmosphere and thus can be a source, during a given time interval, if carbon
inducing potentially adverse changes in the climate outflow exceeds carbon inflow.
system. This tendency has become of public concern that
led to the United Nations Framework Convention on
Carbon Budget
Climate Change (UNFCCC). This convention suggests
protection of carbon pools, enhancement of carbon sinks, The estimates of carbon stocks and carbon fluxes form the
and reduction of emissions from carbon sources. carbon budget, which is normally used as a kind of diag-
nostic tool in the studies of the short-term dynamics of the
global carbon cycle.
Carbon Pools
Carbon pool (or reservoir, or storage) is a system that has Carbon Budget Components
the capacity to accumulate or release carbon. The abso-
lute quantity of carbon held within at a specified time is The components of the global carbon budget may be
called carbon stock. Transfer of carbon from one carbon subdivided into fossil and dynamic categories (Figure 1).
pool to another is called carbon flux. Transfer from the
atmosphere to any other carbon pool is said to be carbon
sequestration. The addition of carbon to a pool is referred
Fossil Components
to as uptake.
The fossil components are naturally inert. The stock of
fossil organic carbon and mineral carbonates (estimated at
65.5 106 PgC) is relatively constant and would not
Carbon Sink
dramatically change within a century. The lithospheric
Carbon sink is a process or mechanism that removes part of the carbon cycle is very slow; all the fluxes are less
carbon dioxide from the atmosphere. A given carbon than 1 PgC yr1. For example, volcanic emissions are

Moizen Classification and Regression Trees

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Moizen Classification and Regression Trees

Uploaded by

Copyright:

Available Formats

Author's personal copy

582 Ecological Informatics | Classification and Regression Trees

Classification and Regression Trees

Introduction Additional Tree-Fitting Issues

Introduction • using environmental variables to model the distribu-

describes classification and regression trees in general,

Figure 1 illustrates a simple classification tree for this

Is ELEV < 2202 m ?

Classify as Absent Is ELEV < 2954 m ?

Is ELEV < 2444 m ? Classify as Absent

Is ASP = N, NE, or E ? Classify as Present

Classify as Present Classify as Absent

importance by determining the effect of excluding vari- Random Forests

Enhancements through Ensemble

Bagging and Boosting See also: Data Mining.

Two simple enhancements to tree-based methods are

Climate Change 1: Short-Term Dynamics

Introduction Biosphere Response

Introduction pool can be a sink, during a given time interval, if carbon

You might also like