02 Lectureslides DecTrees

Decision Trees
Wiebke Köpp
Technische Universität München
Reading Material:
”Machine Learning: A Probabilistic Perspective” by Murphy [ch. 16.2]
Further extra reading:

”Pattern Recognition and Machine Learning” by Bishop [ch. 14.4]
Note: These slides are adapted from slides originally by Daniala Korhammer.
Additionally, some of them are inspired by Understanding Random Forests (Phd thesis
by Gilles Louppes)
1 / 34
The 20-Questions Game
Does it have fur?
Does it bark? Does it fly?
Dog Cat Bird Fish
2 / 34
6
4
x2
0
0 2 4 6 8 10
x1
Example: data X with two features x1 and x2 and class labels z

Goal: classification of unseen instances (generalization)
3 / 34
6
5
4
x2
3
2
1
0
0 2 4 6 8 10
x1
x2 ≤ 3
True False
x2 ≤ 1.7 x1 ≤ 7.8
x1 ≤ 3.7 x1 ≤ 1.7 (2, 3, 112) x2 ≤ 3.9
... ... (10, 0, 2) ... (4, 0, 1) (0, 24, 1)
Distribution of classes in leaf: (red, green, blue)
4 / 34
Interpretation of a Decision Tree
...
x2 ≤ 3.9
True False
(4, 0, 1) (0, 24, 1)
• Node =
b feature test → leads to decision boundaries.
• Branch =
b different outcome of the preceding feature test.
• Leaf =
b region in the input space and the distribution of samples in
that region.
Decision trees partition the input space into cuboid regions.
5 / 34
To classify a new sample x:
• Test the attributes of x to find the region R that contains it and get
the class distribution nR = (nc1 ,R , nc2 ,R , . . . , nck ,R ) for
C = {c1 , . . . , ck }.
• The probability that a data point x ∈ R should be classified
belonging to class c is then:
nc,R
p(z = c | R) = P
nci ,R
ci ∈C
• A new unseen sample x is simply given the label which is most

common in its corresponding region:
y = arg max p(z = c | x) = arg max p(z = c | R) = arg max nc,R

c c c
6 / 34
6
Classifcation of x = (8.5, 3.5)T 5
4
x2
3
x2 ≤ 3 2
1
0
0 2 4 6 8 10
x1
x2 ≤ 1.7 x1 ≤ 7.8
True False
x1 ≤ 3.7 x1 ≤ 1.7
x2 ≤ 3.9
(2, 3, 112)
... ... (10, 0, 2) ... (4, 0, 1) (0, 24, 1)
7 / 34
What does the perfect decision tree look like?
It performs well on new data. → generalization
How can we test this?
Training set DT Test set Dt
• Split training data D into a training set DT = (X T , z T ) and a test

set Dt = (X t , z t ),
• build tree from training set DT ,
• predict test set labels y t using the tree,
• and compare predictions y t to true labels z t for evaluation.
8 / 34
Idea: Build all possible trees and evaluate how they perform on new data.
All combinations of features and values can serve as tests in the tree:
feature tests
x1 ≤ 0.36457631
≤ 0.50120369
≤ 0.54139549 xk+1
≤ ... xk
x2 ≤ 0.09652214
≤ 0.20923062
≤ ...
In our simple example:

2 features × 300 unique values per feature
2 features × 299 possible thresholds per feature:
598 possible tests at the root node, slightly fewer at each descendant
9 / 34
Iterating over all possible trees is possible only for very small examples
(→ low number of possible tests) because the number of trees quickly
explodes.
Finding the optimal tree is NP-complete.
Instead: Grow the tree top-down and choose the best split node-by-node
using a greedy heuristic on the training data.
10 / 34
For example: Split the node when it improves the missclassification rate
iE at node t
iE (t) = 1 − max p(z = c | t)
c
The improvement when performing a split s of t into tR and tL for

i(t) = iE (t) is given by
∆i(s, t) = i(t) − pL i(tL ) − pR i(tR )
i(t) = 1 − 0.5 = 0.5
t
6
5
4 100
x2 3
2
150
tL
1
0
50
tR 0 2 4 6 8 10
x1
i(tL ) = 0 i(tR ) = 0.33
11 / 34
6 40
no split:
5
4 40 60 200
x2
3 40
x1 ≤ 5
2
1 60 40 200
0 40
0 2 4 6 8 10 x2 ≤ 3 200
x1
Node is not split even though combining the two tests would result in
perfect classification
Use a criterion i(t) that measures how pure the class distribution at a
node t is. It should be
• maximum if classes are equally distributed in the node
• minimum, usually 0, if the node is pure
• symmetric
12 / 34
Impurity measures
With πc = p(z = c | t):

For C = {c1 , c2 }:
Misclassification rate:
iE (t) = 1 − max πc 1.0 iE (t)
c iH (t)
Entropy: 0.8 iG (t)
X
iH (t) = − πci log πci 0.6
ci ∈C
0.4
(Note that lim x log x=0.)
x→0+
0.2
Gini index:
X 0.0
iG (t) = πci (1 − πci ) 0.0 0.2 0.4 0.6 0.8 1.0
ci ∈C p(c1 | t) = 1 − p(c2 | t)
X
=1− πc2i
ci ∈C
13 / 34
Detour: Information Theory
Information theory is about encoding and transmitting information
We would like to encode four messages:

• m1 = “You have a lecture.” p(m1 ) = 0.01 → Code 111
• m2 = “You have an exam.” p(m2 ) = 0.02 → Code 110
• m3 = “There is free beer.” p(m3 ) = 0.30 → Code 10
• m4 = “Nothing happening.” p(m4 ) = 0.67 → Code 0
The code above is called a Huffman Code.
On average:
0.01 × 3 bits + 0.02 × 3 bits + 0.3 × 2 bits + 0.67 × 1bit = 1.36 bits
14 / 34
Shannon Entropy
Shannon entropy gives a lower bound on the average number of bits

needed to encode a set of messages.
It is defined over a discrete random variable X with possible values

{x1 , . . . , xn }
n
X
H(X) = − p(X = xi ) log2 p(X = xi )
i
H(X) measures the uncertainty about the outcome of X.
15 / 34
Building a decision tree
Compare all possible tests and choose the one where the improvement
∆i(s, t) for some splitting criterion i(t) is largest
x2 ≤ 3
2 2 2 65 67
67 65 168
iG (t) = 1 − − −
300 300 300 t
≈ 0.5896 168
After testing x2 ≤ 3:
tL
iG (tL ) ≈ 0.6548 and iG (tR ) ≈ 0.3632 tR
153 147
⇒ ∆iG (x2 ≤ 3, t) = iG (t) − iG (tL ) − iG (tR )
300 300
≈ 0.07768
16 / 34
∆iG (x1 , D)
0.06
0.03
0.00
4
x2
0
0 2 4 6 8 0.00 0.06
x1 ∆iG (x2 , D)
17 / 34
Decision boundaries at depth 1
4
x2
0
0 2 4 6 8 10
x1
Accuracy on the whole data set: 58.3%

18 / 34
4
x2
0
0 2 4 6 8 10
x1
Accuracy on the whole data set: 77%

19 / 34
4
x2
0
0 2 4 6 8 10
x1

20 / 34
4
x2
0
0 2 4 6 8 10
x1

21 / 34
Decision boundaries of a maximally pure tree
4
x2
0
0 2 4 6 8 10
x1
Accuracy on the whole data set: 100% → generalisation?

22 / 34
Overfitting
Overfitting typically occurs when we try to model the training data

perfectly.
Overfitting means poor generalisation! How can we spot overfitting?
• low training error, possibly 0

• testing error is comparably high.
23 / 34
1.0
0.9 Overfitting
Accuracy 0.8
0.7
0.6
Underfitting Train
Test
0.5
5 10 15 20 25 30
Number of leaf nodes
The training performance monotonically increases with every split.

The test performance tells us how well our model generalises, not the
training performance! → validation set
24 / 34
When to stop growing?
Possible stopping (or pre-pruning) criteria:

• distribution in branch is pure, i.e i(t) = 0
• maximum depth reached
• number of samples in each branch below certain threshold tn
• benefit of splitting is below certain threshold ∆i(s, t) < t∆
Or we can grow a tree maximally and then (post-)prune it.
25 / 34
Creating the validation set
Training set DT Validation set DV
Training set Test set Dt
Be sure to have separate training, validation and test sets:

• training set DT to build the tree,
• validation set DV to prune the tree,
• test set Dt to evaluate the final model.
Splits of ( 32 , 31 ) are common.
Test data for later evaluation should not be used for pruning or you will
not get an honest estimate of the model’s performance!
26 / 34
Reduced Error Pruning
Let T be our decision tree and t one of its inner nodes.

Pruning T w.r.t. t means deleting all descendant nodes of t (but not t
itself). We denote the pruned tree T \Tt .
T Tt T \Tt
t t t
t t t t t t
t t t t t t t t
t t
• Use validation set to get an error estimate: errDV (T ).

• For each node t calculate errDV (T \Tt )
• Prune tree at the node that yields the highest error reduction.
• Repeat until for all nodes t: errDV (T ) < errDV (T \Tt ).
After pruning you may use both training and validation data to update the labels at each leaf.
27 / 34
Random Forests
Train not only one decision tree but an army of them,

also called an ensemble.
To label a new data point, evaluate all trees and let them vote. Label the
data point with the most “popular” class.
But. . . if we build all trees in the same way they will always vote in
unison?
→ We need to introduce variance between the trees!
28 / 34
Random Forests: Introducing Variance
Let N denote the number of samples in the training set DT and

D denote the number of features (attributes, dimensions).
• Instead of using the whole training set to build the tree, we build
each tree Ti from a bootstrap sample:
From DT create DTi by randomly drawing N samples with
replacement.
By sampling with replacement we only use ≈ 63.2 % of DT for each tree.
• Build decision trees Ti top-down with a greedy heuristic and a small

variation:
At each node, instead of picking the best of all possible tests,
randomly select a subset of d < D features and consider only tests
on these features.
• There is no need for pruning.
29 / 34
Random Forests: Setting the hyper parameter d
Choosing a small d creates variance between the trees, but if we choose d

too small, we will build random trees with poor split choices and little
individual predictive power.
Choosing d too large will create an army of very similar trees, such that
there is hardly any advantage over a single tree.
The answer is somewhere in between!
30 / 34
Decision surface for a random forest of 1000 trees
4
x2
0
0 2 4 6 8 10
x1
31 / 34
Decision Trees with Categorical Teatures
Day Outlook Temperature Humidity Wind Play Tennis?

D1 sunny hot high weak No
D2 sunny hot high strong No
...
Outlook = overcast?
Yes Temperature = hot?
No Wind = weak?
... ...
Different algorithm variants (ID3, C4.5, CART) handle these things

differently.
32 / 34
Decision Trees for Regression
For regression (if zi is a real value rather than a class):

• At the leaves compute the mean (instead of the mode) over the
outputs.
• Use the mean-squared-error as splitting heuristic.
2.5
2.0 depth: 2
1.5 depth: 5
1.0 data
0.5
z
0.0
−0.5
−1.0
−1.5
−2.0
0 1 2 3 4 5
x
33 / 34
What we learned
• Interpretation and Building of Decision Trees

• Impurity functions / Splitting heuristics
• Training / Test / Validation Set
• Overfitting
• Random forests
34 / 34

02 Lectureslides DecTrees

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

02 Lectureslides DecTrees

Uploaded by

Copyright:

Available Formats

Decision Trees

Technische Universität München

Further extra reading:

Does it have fur?

Does it bark? Does it fly?

Dog Cat Bird Fish

Example: data X with two features x1 and x2 and class labels z

x1 ≤ 3.7 x1 ≤ 1.7 (2, 3, 112) x2 ≤ 3.9

... ... (10, 0, 2) ... (4, 0, 1) (0, 24, 1)

Distribution of classes in leaf: (red, green, blue)

(4, 0, 1) (0, 24, 1)

• A new unseen sample x is simply given the label which is most

y = arg max p(z = c | x) = arg max p(z = c | R) = arg max nc,R

... ... (10, 0, 2) ... (4, 0, 1) (0, 24, 1)

Training set DT Test set Dt

• Split training data D into a training set DT = (X T , z T ) and a test

In our simple example:

Finding the optimal tree is NP-complete.

The improvement when performing a split s of t into tR and tL for

∆i(s, t) = i(t) − pL i(tL ) − pR i(tR )

i(t) = 1 − 0.5 = 0.5

With πc = p(z = c | t):

Information theory is about encoding and transmitting information

We would like to encode four messages:

Shannon entropy gives a lower bound on the average number of bits

It is defined over a discrete random variable X with possible values

H(X) measures the uncertainty about the outcome of X.

Accuracy on the whole data set: 58.3%

Accuracy on the whole data set: 77%

Accuracy on the whole data set: 84.3%

Accuracy on the whole data set: 90.3%

Accuracy on the whole data set: 100% → generalisation?

Overfitting typically occurs when we try to model the training data

Overfitting means poor generalisation! How can we spot overfitting?

• low training error, possibly 0

The training performance monotonically increases with every split.

Possible stopping (or pre-pruning) criteria:

Or we can grow a tree maximally and then (post-)prune it.

Training set DT Validation set DV

Training set Test set Dt

Be sure to have separate training, validation and test sets:

Let T be our decision tree and t one of its inner nodes.

• Use validation set to get an error estimate: errDV (T ).

Train not only one decision tree but an army of them,

Let N denote the number of samples in the training set DT and

• Build decision trees Ti top-down with a greedy heuristic and a small

Choosing a small d creates variance between the trees, but if we choose d

The answer is somewhere in between!

Day Outlook Temperature Humidity Wind Play Tennis?

Yes Temperature = hot?

Different algorithm variants (ID3, C4.5, CART) handle these things

For regression (if zi is a real value rather than a class):

• Interpretation and Building of Decision Trees

You might also like