You are on page 1of 20

Classification and Regression

Trees

Data Mining and Predictive Analytics


Overview
• Classification trees, a method for classification
and profiling
– Popular because of simple interpretation
– Creates understandable rules (useful for profiling)
• We will:
– Discuss the main ideas behind classification trees
– Address some of the key methodological questions
– Look at examples
– A separate slide deck looks at implementation in R

2
Example: Beer Preference
• Hacker Pschorr
– One of the oldest beer
brewing companies in Munich
– Collects data on beer
preference (light/ regular) and
demographic info
• Goal: determine
demographic factors for
preferring light beer
• We will first focus on two
predictors Income and Age

3
Underlying Idea
• Recursively separating the records into
subgroups by creating splits on the predictors
– This splitting of the data setBeer
canPreferences
be visualized as
All
trees.
data 100
Income Income 90
≤38562 >38562 80
70
60
Regular
50
Age

Light
40
30
20
10
0
$10,000 $20,000 $30,000 $40,000 $50,000 $60,000 $70,000 $80,000
Income

4
Underlying Idea - 2
• Recursively separating the records into
subgroups by creating splits on the predictors
– This splitting of the data setBeer
canPreferences
be visualized as
All
trees.
data 100
Income Income 90
≤38562 >38562 80
70
60
Regular
50
Age

Light
Age Age 40
≤37.5 >37.5 30
20
10
0
$10,000 $20,000 $30,000 $40,000 $50,000 $60,000 $70,000 $80,000
Income

5
Underlying Idea - 3
• Recursively separating the records into
subgroups by creating splits on the predictors
– This splitting of the data setBeer
canPreferences
be visualized as
All
trees.
data 100
Income Income 90
Class =0
≤38562 >38562 80 Class =0
(regular)
70 (regular)
60
Regular
50
Age

Light
Age Age Age Age 40
≤37.5 >37.5 ≤48.5 >48.5 30
20 Class =1
10
Class =1 (light)
0 (light)
$10,000 $20,000 $30,000 $40,000 $50,000 $60,000 $70,000 $80,000
Income

6
The Key Questions
How to choose
the split variable
All and its value?
data
Income Income
≤38562 >38562 When
should we
stop?
Age Age Age Age
≤37.5 >37.5 ≤48.5 >48.5 What rule do we use
for
classification/predictio
n in the end nodes?

How do we classify a
new record ?
7
Determining the Best Split
• What do we mean by “best”?
– We want to find the split that best discriminates between records
with different outcomes
– After the split we want the new sub-nodes to be more
homogenous in the outcome variable
– We need a measure of “homogeneity”
• There are two commonly used impurity
measures:
– Entropy
– Gini index
8
Determining the Best Split - 2
• The CART algorithm (Classification And
Regression Trees) evaluates all possible binary
splits
• For each variable
– For each possible split value of that variable
• Calculate the impurity of the resulting sub-nodes
• Summarize the impurity of the split as the weighted
average of the impurities of the sub-nodes
• Select the best “variable-value” split
9
Determining the Best Split - 3
• Searching for the best split value for the
income variable Beer Preferences
100

90

80

70

60
Regular
50
Age

Light

40

30

20

10

0
$10,000 $20,000 $30,000 $40,000 $50,000 $60,000 $70,000 $80,000
Income

Data Mining and Predictive Analytics 10


Impurity Measures
• Notation
– K - the number of classes
– pi - the percentage of records belonging to class i
K
• The Gini GI
Index is pdefined
 1  2
i
as:
i 1

K
E   pi log 2  pi 
• Entropy is defined
i 1
as:

11
Impurity Measures – 2
• Assume we have 2 classes, class 1 and class 0, and
we are at a node where all the sample is class 1
• We want to calculate the impurity measures at the
node K

GI = 1-[1 +0 ] = 0
2 2 GI  1   pi
2

• What is the Gini index? i 1

E = –[1log2(1)] = 0 K

• What is the entropy? E   pi log 2  pi 


i 1

12
Impurity Measures - 3
• Assume we have 2 classes, class 1 and class 0, and
we are at a node where half the sample is class 1
and half is class 0
• We want to calculate the impurity measures at theK

node
GI = 1– [ (0.5) +(0.5) ] = 0.5
2 2 GI  1  
i 1
pi
2

• What is the Gini index?


E = – [0.5log2(0.5) + 0.5log2(0.5)] K
=1 E   pi log 2  pi 
i 1
• What is the entropy?
13
Impurity Measures – 4
1.2
• Both error measures are
1 zero for pure nodes
• Maximized when
0.8
heterogeneity
Impurity measure

0.6 is greatest
• The error measures differ in
0.4
how they measure
0.2 “relative” impurity of nodes
and therefore the choice of
error measure may/will
0
0 0.2 0.4 0.6 0.8 1
proportion of observations in class 1
affect the resulting tree
Gini Index Entropy

14
Optimal Split
• Overall impurity is the weighted average of
the impurity of sub-nodes created by the split
– Each sub-node is weighted by the number of
observations
X2

• The algorithm picks a split to minimize overall


The proposed split
impurity makes the Gini
coefficient or entropy
• Example: zero

X
The Key Questions
How to choose
the split variable
All and its value?
data
Income Income
≤38562 >38562 When
should we
stop?
Age Age Age Age
≤37.5 >37.5 ≤48.5 >48.5 What rule do we use
for
classification/predictio
n in the end nodes?

How do we classify a
new record ?
Data Mining and Predictive Analytics 16
When to stop growing the tree?
• One option is to stop when we can no longer
find a split that improves the impurity
measures
• There is a chance of overfitting if we keep
Stop here
splitting the data until we only have very few
The goal is to
points at each node arrive at a tree
that captures the
patterns but not
the noise in the
training data,
therefore
maximizing the
The tree models relationship The tree models noise in the training
prediction
between the outcome and the set accuracy on new
predictors 17
data
Avoiding Overfitting: Stopping Rules

• There are a number of Stopping Rules that one


can use to avoid overfitting:
– Set a minimum number of records at a node
– Set a maximum number of splits
– Statistical significance of the split
• There is no simple good way to determine the
right stopping point (depends on the dataset)

18
Avoiding Overfitting: Pruning
• Pruning refers to using the validation sample to prune back (cut
branches off) the full grown tree

Prune here
Validation Pruning has
been proven
more
successful in
practice than
stopping
rules

• Note, pruning uses the validation sample to select the best tree: the
performance of the pruned tree on the validation data is not fully
reflective of the performance on completely new data
19
Returning to the example
• Selecting the “right” 60

50

tree 40

% Error
30

• Details of pruning 20

will be discussed
10

0
0 2 4 6 8 10 12 14

later # Decision Nodes

Training sample Validation sample

The tree with the lowest error


(pick the smallest if there is a
choice of more than one)

20

You might also like