Lecture 2 Decision Trees

Decision Trees
Unpredictability
Overfitting
Lecture 2: Decision Trees

DD2421
Atsuto Maki
Autumn, 2020
Atsuto Maki Lecture 2: Decision Trees

Decision Trees
Unpredictability
Overfitting
Lecture 1: Nearest Neighbour Classifier (Memory-based)

Lecture 2: Decision Trees (Logical inference, Rule-based)
Lecture 3: Challenges in Machine Learning

Decision Trees
Unpredictability
Overfitting
1 Decision Trees
The representation
Training
2 Unpredictability
Entropy
Information gain
Gini impurity
3 Overfitting
Overfitting
Occam’s principle
Training and validation set approach
Extensions

Decision Trees
The representation
Unpredictability
Training
Overfitting
1 Decision Trees
The representation
Training
2 Unpredictability
Entropy
Information gain
Gini impurity
3 Overfitting
Overfitting
Occam’s principle
Extensions

Decision Trees
The representation
Unpredictability
Training
Overfitting
顺序测试属性（功能）
Basic Idea: Test the attributes (features) sequentially
= Ask questions about the target/status sequentially
=依次询问有关目标/状态的问题

Decision Trees
The representation
Unpredictability
Training
Overfitting

Example: building a concept of whether someone would like to

play tennis.

Decision Trees
The representation
Unpredictability
Training
Overfitting

Example: building a concept of whether someone would like to

play tennis. 例如：建立一个人是否想打网球的概念。
当涉及名义数据时，例如（但不限于）有用。　在医学诊断，信用风险分析等方面
Useful also (but not limited to) when nominal data are involved,
e.g. in medical diagnosis, credit risk analysis etc.

Decision Trees
The representation
Unpredictability
Training
Overfitting
The whole analysis strategy can be seen as a tree.

整个分析策略可以看作是一棵树。
(T. Mitchell, Machine Learning)

Decision Trees
The representation
Unpredictability
Training
Overfitting
The whole analysis strategy can be seen as a tree.
Each leaf node bears a category label, and the test pattern is
assigned the category of the leaf node reached.
每个叶子节点带有一个类别标签，并且为测试模式分配了到达的叶子节点的类别。
Decision Trees
The representation
Unpredictability
Training
Overfitting
What does the tree encode?

Decision Trees
The representation
Unpredictability
Training
Overfitting
(Sunny ∧ Normal Humidity) ∨ (Cloudy) ∨ (Rainy ∧ Weak Wind)

Decision Trees
The representation
Unpredictability
Training
Overfitting
(Sunny ∧ Normal Humidity) ∨ (Cloudy) ∨ (Rainy ∧ Weak Wind)
Logical expressions of the conjunction of decisions along the

path. 沿路径的决策合取的逻辑表达式。
可以表示任意布尔函数！
Arbitrary boolean functions can be represented!
Decision Trees
The representation
Unpredictability
Training
Overfitting
Training: we need to grow a tree from scratch given a set of

labeled training data.
How to grow/construct the tree automatically?

Decision Trees
The representation
Unpredictability
Training
Overfitting


1 Choose the best question (according to the information
gain), and split the input data into subsets
选择最佳问题（根据信息增益），然后将输入数据拆分为子集

Decision Trees
The representation
Unpredictability
Training
Overfitting


2 Terminate: call branches with a unique class labels leaves
(no need for further quesitons)
终止：具有唯一类标签的调用分支将离开（无需其他问题）

Decision Trees
The representation
Unpredictability
Training
Overfitting


2 Terminate: call branches with a unique class labels leaves
(no need for further quesitons)
3 Grow: recursively extend other branches 增长：递归扩展其他分支
(with subsets bearing mixtures of labels) （子集带有标签的混合）

Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity
1 Decision Trees
The representation
Training
2 Unpredictability
Entropy
Information gain
Gini impurity
3 Overfitting
Overfitting
Occam’s principle
Extensions

Quize time – Game of “sixty-three”
x drawn from {0,1,2,3,4, ... , 63}

I pick a number x from the set.
You ask me yes/now questions.
How many (and what) questions will you ask me to get the
number x as rapidly as possible?

Entropy
How to measure information gain?

Entropy
The Shannon information content of an outcome is:
1
log2
pi
(pi : probability for event i)

Entropy
The Shannon information content of an outcome is:
1
log2
pi
The Entropy — measure of uncertainty (unpredictability)

X
Entropy = −pi log2 pi
i
is a sensible measure of expected information content.

Entropy
Example: tossing a coin

phead = 0.5; ptail = 0.5

Entropy

X
i

Entropy

X
Entropy = −pi log2 pi =
i
= −0.5 log2 0.5 − 0.5 log2 0.5

Entropy

X
i
= −0.5 log2 0.5 −0.5 log2 0.5
| {z } | {z }
−1 −1

Entropy

X
i
= −0.5 log2 0.5 −0.5 log2 0.5 =
| {z } | {z }
−1 −1
=1

Entropy

X
i
= −0.5 log2 0.5 −0.5 log2 0.5 =
| {z } | {z }
−1 −1
=1
The result of a coin-toss has 1 bit of information

Entropy
Example: rolling a die

p1 = 16 ; p2 = 16 ; . . . p6 = 1
6
X
i

Entropy

p1 = 16 ; p2 = 16 ; . . . p6 = 1
6
X
i
1 1
= 6 × (− log2 )
6 6

Entropy

p1 = 16 ; p2 = 16 ; . . . p6 = 1
6
X
i
1 1
= 6 × (− log2 ) =
6 6
1
= − log2 = log2 6 ≈ 2.58
6

Entropy

p1 = 16 ; p2 = 16 ; . . . p6 = 1
6
X
i
1 1
= 6 × (− log2 ) =
6 6
1
= − log2 = log2 6 ≈ 2.58
6
The result of a die-roll has 2.58 bit of information

Entropy
Example: rolling a fake die

p1 = 0.1; . . . p5 = 0.1; p6 = 0.5
X
i

Entropy

p1 = 0.1; . . . p5 = 0.1; p6 = 0.5
X
i
= −5 · 0.1 log2 0.1 − 0.5 log2 0.5

Entropy

p1 = 0.1; . . . p5 = 0.1; p6 = 0.5
X
i
= −5 · 0.1 log2 0.1 − 0.5 log2 0.5 =
≈ 2.16

Entropy

p1 = 0.1; . . . p5 = 0.1; p6 = 0.5
X
i
= −5 · 0.1 log2 0.1 − 0.5 log2 0.5 =
≈ 2.16
A real die is more unpredictable (2.58 bit) than a fake (2.16 bit)

Entropy
Unpredictability of a dataset
100 examples, 42 positive

Entropy
58 58 42 42
− log2 − log2 = 0.981
100 100 100 100

Entropy
58 58 42 42
− log2 − log2 = 0.981
100 100 100 100
97 97 3 3
− log2 − log2 = 0.194
100 100 100 100

Entropy
Unpredictability of a dataset (think of a subset at a node)
58 58 42 42
− log2 − log2 = 0.981
100 100 100 100
97 97 3 3
− log2 − log2 = 0.194
100 100 100 100

Back to the decision trees

Smart idea:
Ask about the attribute which maximizes the expected
reduction of the entropy.


Smart idea:
reduction of the entropy.
Information gain
Ask about attribute A for a data set S that has Entropy Ent(S),
Gain = Ent(S) −
| {z }
before


Smart idea:
reduction of the entropy. 询问使熵的预期减少最大化的属性。
Information gain
Ask about attribute A for a data set S that has Entropy Ent(S),
and get subsets SV according to the value of A
X |Sv |
Gain = Ent(S) − Ent(Sv )
| {z } |S| | {z }
before v ∈Values(A)
| {z } after
weighted
sum

What is the entropy of this binary dataset (attributes={A, B, C, D}, n = 25)?

A B C D
◦ • • ◦ +
• • ◦ ◦ +
◦ ◦ ◦ ◦
◦ ◦ • • +
◦ • ◦ ◦ +
• ◦ • ◦
◦ • • ◦ +
◦ ◦ ◦ ◦
• ◦ ◦ ◦
◦ • • ◦ +
◦ ◦ ◦ • +
◦ ◦ • ◦
• • • ◦ +
◦ • ◦ •
◦ ◦ ◦ ◦
◦ ◦ • ◦
◦ • • •
◦ • ◦ ◦ +
◦ ◦ • ◦
• ◦ ◦ ◦
◦ • ◦ ◦ +
◦ ◦ • • +
• • ◦ ◦ +
◦ ◦ ◦ ◦
◦ ◦ • ◦


Ent = − 12
25 log2
12
25 − 13
25 log2 13
25 ≈ 0.9988 A B C D
◦ • • ◦ +
• • ◦ ◦ +
◦ ◦ ◦ ◦
◦ ◦ • • +
◦ • ◦ ◦ +
• ◦ • ◦
◦ • • ◦ +
◦ ◦ ◦ ◦
• ◦ ◦ ◦
◦ • • ◦ +
◦ ◦ ◦ • +
◦ ◦ • ◦
• • • ◦ +
◦ • ◦ •
◦ ◦ ◦ ◦
◦ ◦ • ◦
◦ • • •
◦ • ◦ ◦ +
◦ ◦ • ◦
• ◦ ◦ ◦
◦ • ◦ ◦ +
◦ ◦ • • +
• • ◦ ◦ +
◦ ◦ ◦ ◦
◦ ◦ • ◦


Ent = − 12
25 log2
12
25 − 13
25 log2 13
25 ≈ 0.9988 A B C D
◦ • • ◦ +
A = •: 63 positive → 1.0 •
◦
•
◦
◦
◦
◦
◦
+
9 ◦ ◦ • • +
A = ◦: 19 positive → 0.9980 ◦ • ◦ ◦ +
6 • ◦ • ◦
Expected: 25 · 1.0 + 19
25 · 0.9980 ≈ 0.9985 ◦ • • ◦ +
◦ ◦ ◦ ◦
• ◦ ◦ ◦
◦ • • ◦ +
◦ ◦ ◦ • +
◦ ◦ • ◦
• • • ◦ +
◦ • ◦ •
◦ ◦ ◦ ◦
◦ ◦ • ◦
◦ • • •
◦ • ◦ ◦ +
◦ ◦ • ◦
• ◦ ◦ ◦
◦ • ◦ ◦ +
◦ ◦ • • +
• • ◦ ◦ +
◦ ◦ ◦ ◦
◦ ◦ • ◦


Ent = − 12
25 log2
12
25 − 13
25 log2 13
25 ≈ 0.9988 A B C D
◦ • • ◦ +
A = •: 63 positive → 1.0 •
◦
•
◦
◦
◦
◦
◦
+
9 ◦ ◦ • • +
A = ◦: 19 positive → 0.9980 ◦ • ◦ ◦ +
6 • ◦ • ◦
Expected: 25 · 1.0 + 19
25 · 0.9980 ≈ 0.9985 ◦ • • ◦ +
◦ ◦ ◦ ◦
9 • ◦ ◦ ◦
B = •: positive → 0.684
11 ◦ • • ◦ +
3 ◦ ◦ ◦ • +
B = ◦: positive → 0.750
14 ◦ ◦ • ◦
Expected: 0.721 •
◦
•
•
•
◦
◦
•
+
◦ ◦ ◦ ◦
◦ ◦ • ◦
◦ • • •
◦ • ◦ ◦ +
◦ ◦ • ◦
• ◦ ◦ ◦
◦ • ◦ ◦ +
◦ ◦ • • +
• • ◦ ◦ +
◦ ◦ ◦ ◦
◦ ◦ • ◦


Ent = − 12
25 log2
12
25 − 13
25 log2 13
25 ≈ 0.9988 A B C D
◦ • • ◦ +
A = •: 63 positive → 1.0 •
◦
•
◦
◦
◦
◦
◦
+
9 ◦ ◦ • • +
A = ◦: 19 positive → 0.9980 ◦ • ◦ ◦ +
6 • ◦ • ◦
Expected: 25 · 1.0 + 19
25 · 0.9980 ≈ 0.9985 ◦ • • ◦ +
◦ ◦ ◦ ◦
9 • ◦ ◦ ◦
B = •: positive → 0.684
11 ◦ • • ◦ +
3 ◦ ◦ ◦ • +
B = ◦: positive → 0.750
14 ◦ ◦ • ◦
Expected: 0.721 •
◦
•
•
•
◦
◦
•
+
6 ◦ ◦ ◦ ◦
C = •: 12 positive → 1.0 ◦ ◦ • ◦
6 ◦ • • •
C = ◦: 13 positive → 0.9957 ◦ • ◦ ◦ +
◦ ◦ • ◦
Expected: 0.9977 • ◦ ◦ ◦
◦ • ◦ ◦ +
◦ ◦ • • +
• • ◦ ◦ +
◦ ◦ ◦ ◦
◦ ◦ • ◦


Ent = − 12
25 log2
12
25 − 13
25 log2 13
25 ≈ 0.9988 A B C D
◦ • • ◦ +
A = •: 63 positive → 1.0 •
◦
•
◦
◦
◦
◦
◦
+
9 ◦ ◦ • • +
A = ◦: 19 positive → 0.9980 ◦ • ◦ ◦ +
6 • ◦ • ◦
Expected: 25 · 1.0 + 19
25 · 0.9980 ≈ 0.9985 ◦ • • ◦ +
◦ ◦ ◦ ◦
9 • ◦ ◦ ◦
B = •: positive → 0.684
11 ◦ • • ◦ +
3 ◦ ◦ ◦ • +
B = ◦: positive → 0.750
14 ◦ ◦ • ◦
Expected: 0.721 •
◦
•
•
•
◦
◦
•
+
6 ◦ ◦ ◦ ◦
C = •: 12 positive → 1.0 ◦ ◦ • ◦
6 ◦ • • •
C = ◦: 13 positive → 0.9957 ◦ • ◦ ◦ +
◦ ◦ • ◦
Expected: 0.9977 • ◦ ◦ ◦
◦ • ◦ ◦ +
D = •: 35 positive → 0.9710 ◦
•
◦
•
•
◦
•
◦
+
+
9
D = ◦: 20 positive → 0.9928 ◦
◦
◦
◦
◦
•
◦
◦
Expected: 0.9884
Gain(A) = 0.9988 − 0.9985 = 0.0003

Gain(B) = 0.9988 − 0.7210 = 0.2778
Gain(C) = 0.9988 − 0.9977 = 0.0011
Gain(D) = 0.9988 − 0.9884 = 0.0104

Gain(A) = 0.9988 − 0.9985 = 0.0003

Gain(B) = 0.9988 − 0.7210 = 0.2778
Gain(C) = 0.9988 − 0.9977 = 0.0011
Gain(D) = 0.9988 − 0.9884 = 0.0104
Attribute B gives most information gain
Yes No

Yes No
Examples where Examples where

B=• B=◦
A B C D A B C D
◦ • • ◦ + ◦ ◦ ◦ ◦
• • ◦ ◦ + ◦ ◦ • • +
◦ • ◦ ◦ + • ◦ • ◦
◦ • • ◦ + ◦ ◦ ◦ ◦
◦ • • ◦ + • ◦ ◦ ◦
• • • ◦ + ◦ ◦ ◦ • +
◦ • ◦ • ◦ ◦ • ◦
◦ • • • ◦ ◦ ◦ ◦
◦ • ◦ ◦ + ◦ ◦ • ◦
◦ • ◦ ◦ + • ◦ ◦ ◦
• • ◦ ◦ + ◦ ◦ • ◦
◦ ◦ • • +
◦ ◦ ◦ ◦
◦ ◦ • ◦

Yes No
D D
Yes No Yes No
_ + + _

贪婪的方法来选择一个问题：
选择最能告诉我们答案的属性
Greedy approach to choose a question:
Choose the attribute which tells us most about the answer
In sum, we need to find good questions to ask.

(more than one attribute could be involved in one question)
总之，我们需要找到很好的问题要问。
（一个问题可能涉及多个属性）

Gini impurity: Another definition of predictability (impurity).

基尼杂质：可预测性（杂质）的另一种定义。
X X
pi (1 − pi ) = 1 − pi2
i i

X X
pi (1 − pi ) = 1 − pi2
i i
The expected error rate at a node, N, if the category label is

randomly selected from the class distribution present at N.

X X
pi (1 − pi ) = 1 − pi2
i i
The expected error rate at a node, N, if the category label is

randomly selected from the class distribution present at N.
Similar to the entropy but more strongly peaked at equal

probabilities.

Overfitting
Decision Trees
Occam’s principle
Unpredictability
Overfitting
Extensions
1 Decision Trees
The representation
Training
2 Unpredictability
Entropy
Information gain
Gini impurity
3 Overfitting
Overfitting
Occam’s principle
Extensions

Overfitting
Decision Trees
Occam’s principle
Unpredictability
Overfitting
Extensions
Overfitting
When the learned models are overly specialized for the training
samples.
当学习的模型过于专门用于训练样本时。

Overfitting
Decision Trees
Occam’s principle
Unpredictability
Overfitting
Extensions
Overfitting
samples.
Good results on training data, but generalizes poorly.

在训练数据上效果不错，但推广效果很差。

Overfitting
Decision Trees
Occam’s principle
Unpredictability
Overfitting
Extensions
Overfitting
samples.
Good results on training data, but generalizes poorly.
When does this occur?

Non-representative sample 非代表性样本
Noisy examples 噪音样本
Too complex model 太复杂的模型

Overfitting
Decision Trees
Occam’s principle
Unpredictability
Overfitting
Extensions
What can be done about it?

Overfitting
Decision Trees
Occam’s principle
Unpredictability
Overfitting
Extensions
What can be done about it?

Choose a simpler model and accept some errors for the
training examples 选择一个更简单的模型并接受一些错误的训练示例

Overfitting
Decision Trees
Occam’s principle
Unpredictability
Overfitting
Extensions
Which hypothesis should be preferred when several are

compatible with the data? 当几个与数据兼容时，应首选哪种假设？

Overfitting
Decision Trees
Occam’s principle
Unpredictability
Overfitting
Extensions

compatible with the data?
Occam’s principle (Occam’s razor)

William from Ockham, Theologian and Philosopher
(1288–1348)
”Entities should not be multiplied beyond necessity”

“实体的数量不应超过必要性”

Overfitting
Decision Trees
Occam’s principle
Unpredictability
Overfitting
Extensions

compatible with the data?
Occam’s principle (Occam’s razor)

William from Ockham, Theologian and Philosopher
(1288–1348)
”Entities should not be multiplied beyond necessity”

与数据兼容的最简单的解释往往是正确的解释
The simplest explanation compatible with data
tends to be the right one

Overfitting
Decision Trees
Occam’s principle
Unpredictability
Overfitting
Extensions
将可用数据分成两
Separate the available data into two sets of examples 组示例
Training set T : to form the learned model 构造学习模型
Validation set V : to evaluate the accuracy of this model
以评估该模型的准确性

Overfitting
Decision Trees
Occam’s principle
Unpredictability
Overfitting
Extensions
Separate the available data into two sets of examples

Training set T : to form the learned model
The motivations:
The training may be misled by random errors, but the
validation set is unlikely to exhibit the same random
fluctuations 训练可能会因随机错误而被误导，但验证集不太可能表现出相同
的随机波动
The validation set to provide a safety check against
overfitting the spurious characteristics of the training set
验证集可提供安全检查，以防过度拟合训练集的虚假特性

Overfitting
Decision Trees
Occam’s principle
Unpredictability
Overfitting
Extensions
Separate the available data into two sets of examples

Training set T : to form the learned model
The motivations:
The training may be misled by random errors, but the
validation set is unlikely to exhibit the same random
fluctuations
The validation set to provide a safety check against
overfitting the spurious characteristics of the training set
(V need be large enough to provide statistically meaningful

instances)

Overfitting
Decision Trees
Occam’s principle
Unpredictability
Overfitting
Extensions
Reduced-Error Pruning
Split data into training and validation set
Do until further pruning is harmful:

Evaluate impact on validation set of pruning each possible
node (plus those below it)
Greedily remove the one that most improves validation set
accuracy
Produces smallest version of most accurate subtree

Overfitting
Decision Trees
Occam’s principle
Unpredictability
Overfitting
Extensions
Possible ways of improving/extending the decision trees
Avoid overfitting
Stop growing when data split not statistically significant
Grow full tree, then post-prune (e.g. Reduced error pruning)

Overfitting
Decision Trees
Occam’s principle
Unpredictability
Overfitting
Extensions
Possible ways of improving/extending the decision trees
Avoid overfitting
Stop growing when data split not statistically significant
Grow full tree, then post-prune (e.g. Reduced error pruning)
A collection of trees (Ensemble learning: in Lecture 10)
Bootstrap aggregating (bagging)
Decision Forests

Lecture 2 Decision Trees

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2 Decision Trees

Uploaded by

Copyright:

Available Formats

Decision Trees

Lecture 2: Decision Trees

Atsuto Maki Lecture 2: Decision Trees

Lecture 1: Nearest Neighbour Classifier (Memory-based)

Atsuto Maki Lecture 2: Decision Trees

Atsuto Maki Lecture 2: Decision Trees

Atsuto Maki Lecture 2: Decision Trees

Atsuto Maki Lecture 2: Decision Trees

Basic Idea: Test the attributes (features) sequentially

Example: building a concept of whether someone would like to

Atsuto Maki Lecture 2: Decision Trees

Basic Idea: Test the attributes (features) sequentially

Example: building a concept of whether someone would like to

Atsuto Maki Lecture 2: Decision Trees

The whole analysis strategy can be seen as a tree.

(T. Mitchell, Machine Learning)

Atsuto Maki Lecture 2: Decision Trees

The whole analysis strategy can be seen as a tree.

(T. Mitchell, Machine Learning)

What does the tree encode?

Atsuto Maki Lecture 2: Decision Trees

What does the tree encode?

(Sunny ∧ Normal Humidity) ∨ (Cloudy) ∨ (Rainy ∧ Weak Wind)

Atsuto Maki Lecture 2: Decision Trees

What does the tree encode?

(Sunny ∧ Normal Humidity) ∨ (Cloudy) ∨ (Rainy ∧ Weak Wind)

Logical expressions of the conjunction of decisions along the

Training: we need to grow a tree from scratch given a set of

How to grow/construct the tree automatically?

Atsuto Maki Lecture 2: Decision Trees

Training: we need to grow a tree from scratch given a set of

How to grow/construct the tree automatically?

Atsuto Maki Lecture 2: Decision Trees

Training: we need to grow a tree from scratch given a set of

How to grow/construct the tree automatically?

Atsuto Maki Lecture 2: Decision Trees

Training: we need to grow a tree from scratch given a set of

How to grow/construct the tree automatically?

Atsuto Maki Lecture 2: Decision Trees

Atsuto Maki Lecture 2: Decision Trees

Quize time – Game of “sixty-three”

x drawn from {0,1,2,3,4, ... , 63}

Atsuto Maki Lecture 2: Decision Trees

How to measure information gain?

Atsuto Maki Lecture 2: Decision Trees

How to measure information gain?

The Shannon information content of an outcome is:

Atsuto Maki Lecture 2: Decision Trees

How to measure information gain?

The Shannon information content of an outcome is:

The Entropy — measure of uncertainty (unpredictability)

is a sensible measure of expected information content.

Atsuto Maki Lecture 2: Decision Trees

Example: tossing a coin

Atsuto Maki Lecture 2: Decision Trees

Example: tossing a coin

Atsuto Maki Lecture 2: Decision Trees

Example: tossing a coin

Atsuto Maki Lecture 2: Decision Trees

Example: tossing a coin

Atsuto Maki Lecture 2: Decision Trees

Example: tossing a coin

Atsuto Maki Lecture 2: Decision Trees