Professional Documents
Culture Documents
02 Lectureslides DecTrees
02 Lectureslides DecTrees
Wiebke Köpp
Reading Material:
”Machine Learning: A Probabilistic Perspective” by Murphy [ch. 16.2]
Note: These slides are adapted from slides originally by Daniala Korhammer.
Additionally, some of them are inspired by Understanding Random Forests (Phd thesis
by Gilles Louppes)
1 / 34
The 20-Questions Game
2 / 34
6
4
x2
0
0 2 4 6 8 10
x1
3 / 34
6
5
4
x2
3
2
1
0
0 2 4 6 8 10
x1
x2 ≤ 3
True False
x2 ≤ 1.7 x1 ≤ 7.8
4 / 34
Interpretation of a Decision Tree
...
x2 ≤ 3.9
True False
• Node =
b feature test → leads to decision boundaries.
• Branch =
b different outcome of the preceding feature test.
• Leaf =
b region in the input space and the distribution of samples in
that region.
Decision trees partition the input space into cuboid regions.
5 / 34
To classify a new sample x:
• Test the attributes of x to find the region R that contains it and get
the class distribution nR = (nc1 ,R , nc2 ,R , . . . , nck ,R ) for
C = {c1 , . . . , ck }.
• The probability that a data point x ∈ R should be classified
belonging to class c is then:
nc,R
p(z = c | R) = P
nci ,R
ci ∈C
6 / 34
6
Classifcation of x = (8.5, 3.5)T 5
4
x2
3
x2 ≤ 3 2
1
0
0 2 4 6 8 10
x1
x2 ≤ 1.7 x1 ≤ 7.8
True False
x1 ≤ 3.7 x1 ≤ 1.7
x2 ≤ 3.9
(2, 3, 112)
7 / 34
What does the perfect decision tree look like?
It performs well on new data. → generalization
How can we test this?
8 / 34
Idea: Build all possible trees and evaluate how they perform on new data.
All combinations of features and values can serve as tests in the tree:
feature tests
x1 ≤ 0.36457631
≤ 0.50120369
≤ 0.54139549 xk+1
≤ ... xk
x2 ≤ 0.09652214
≤ 0.20923062
≤ ...
9 / 34
Iterating over all possible trees is possible only for very small examples
(→ low number of possible tests) because the number of trees quickly
explodes.
Instead: Grow the tree top-down and choose the best split node-by-node
using a greedy heuristic on the training data.
10 / 34
For example: Split the node when it improves the missclassification rate
iE at node t
iE (t) = 1 − max p(z = c | t)
c
t
6
5
4 100
x2 3
2
150
tL
1
0
50
tR 0 2 4 6 8 10
x1
i(tL ) = 0 i(tR ) = 0.33
11 / 34
6 40
no split:
5
4 40 60 200
x2
3 40
x1 ≤ 5
2
1 60 40 200
0 40
0 2 4 6 8 10 x2 ≤ 3 200
x1
Node is not split even though combining the two tests would result in
perfect classification
Use a criterion i(t) that measures how pure the class distribution at a
node t is. It should be
• maximum if classes are equally distributed in the node
• minimum, usually 0, if the node is pure
• symmetric
12 / 34
Impurity measures
13 / 34
Detour: Information Theory
On average:
0.01 × 3 bits + 0.02 × 3 bits + 0.3 × 2 bits + 0.67 × 1bit = 1.36 bits
14 / 34
Shannon Entropy
15 / 34
Building a decision tree
Compare all possible tests and choose the one where the improvement
∆i(s, t) for some splitting criterion i(t) is largest
x2 ≤ 3
2 2 2 65 67
67 65 168
iG (t) = 1 − − −
300 300 300 t
≈ 0.5896 168
After testing x2 ≤ 3:
tL
iG (tL ) ≈ 0.6548 and iG (tR ) ≈ 0.3632 tR
153 147
⇒ ∆iG (x2 ≤ 3, t) = iG (t) − iG (tL ) − iG (tR )
300 300
≈ 0.07768
16 / 34
∆iG (x1 , D)
0.06
0.03
0.00
4
x2
0
0 2 4 6 8 0.00 0.06
x1 ∆iG (x2 , D)
17 / 34
Decision boundaries at depth 1
4
x2
0
0 2 4 6 8 10
x1
4
x2
0
0 2 4 6 8 10
x1
4
x2
0
0 2 4 6 8 10
x1
4
x2
0
0 2 4 6 8 10
x1
4
x2
0
0 2 4 6 8 10
x1
23 / 34
1.0
0.9 Overfitting
Accuracy 0.8
0.7
0.6
Underfitting Train
Test
0.5
5 10 15 20 25 30
Number of leaf nodes
24 / 34
When to stop growing?
25 / 34
Creating the validation set
26 / 34
Reduced Error Pruning
After pruning you may use both training and validation data to update the labels at each leaf.
27 / 34
Random Forests
To label a new data point, evaluate all trees and let them vote. Label the
data point with the most “popular” class.
But. . . if we build all trees in the same way they will always vote in
unison?
→ We need to introduce variance between the trees!
28 / 34
Random Forests: Introducing Variance
29 / 34
Random Forests: Setting the hyper parameter d
Choosing d too large will create an army of very similar trees, such that
there is hardly any advantage over a single tree.
30 / 34
Decision surface for a random forest of 1000 trees
4
x2
0
0 2 4 6 8 10
x1
31 / 34
Decision Trees with Categorical Teatures
Outlook = overcast?
No Wind = weak?
... ...
32 / 34
Decision Trees for Regression
2.5
2.0 depth: 2
1.5 depth: 5
1.0 data
0.5
z
0.0
−0.5
−1.0
−1.5
−2.0
0 1 2 3 4 5
x
33 / 34
What we learned
34 / 34