You are on page 1of 71

Decision Trees

Unpredictability
Overfitting

Lecture 2: Decision Trees


DD2421

Atsuto Maki

Autumn, 2020

Atsuto Maki Lecture 2: Decision Trees


Decision Trees
Unpredictability
Overfitting

Lecture 1: Nearest Neighbour Classifier (Memory-based)


Lecture 2: Decision Trees (Logical inference, Rule-based)
Lecture 3: Challenges in Machine Learning

Atsuto Maki Lecture 2: Decision Trees


Decision Trees
Unpredictability
Overfitting

1 Decision Trees
The representation
Training
2 Unpredictability
Entropy
Information gain
Gini impurity
3 Overfitting
Overfitting
Occam’s principle
Training and validation set approach
Extensions

Atsuto Maki Lecture 2: Decision Trees


Decision Trees
The representation
Unpredictability
Training
Overfitting

1 Decision Trees
The representation
Training
2 Unpredictability
Entropy
Information gain
Gini impurity
3 Overfitting
Overfitting
Occam’s principle
Training and validation set approach
Extensions

Atsuto Maki Lecture 2: Decision Trees


Decision Trees
The representation
Unpredictability
Training
Overfitting

顺序测试属性(功能)
Basic Idea: Test the attributes (features) sequentially
= Ask questions about the target/status sequentially
=依次询问有关目标/状态的问题

Atsuto Maki Lecture 2: Decision Trees


Decision Trees
The representation
Unpredictability
Training
Overfitting

Basic Idea: Test the attributes (features) sequentially


= Ask questions about the target/status sequentially

Example: building a concept of whether someone would like to


play tennis.

Atsuto Maki Lecture 2: Decision Trees


Decision Trees
The representation
Unpredictability
Training
Overfitting

Basic Idea: Test the attributes (features) sequentially


= Ask questions about the target/status sequentially

Example: building a concept of whether someone would like to


play tennis. 例如:建立一个人是否想打网球的概念。

当涉及名义数据时,例如(但不限于)有用。 在医学诊断,信用风险分析等方面
Useful also (but not limited to) when nominal data are involved,
e.g. in medical diagnosis, credit risk analysis etc.

Atsuto Maki Lecture 2: Decision Trees


Decision Trees
The representation
Unpredictability
Training
Overfitting

The whole analysis strategy can be seen as a tree.


整个分析策略可以看作是一棵树。

(T. Mitchell, Machine Learning)

Atsuto Maki Lecture 2: Decision Trees


Decision Trees
The representation
Unpredictability
Training
Overfitting

The whole analysis strategy can be seen as a tree.

(T. Mitchell, Machine Learning)

Each leaf node bears a category label, and the test pattern is
assigned the category of the leaf node reached.
每个叶子节点带有一个类别标签,并且为测试模式分配了到达的叶子节点的类别。
Atsuto Maki Lecture 2: Decision Trees
Decision Trees
The representation
Unpredictability
Training
Overfitting

What does the tree encode?

Atsuto Maki Lecture 2: Decision Trees


Decision Trees
The representation
Unpredictability
Training
Overfitting

What does the tree encode?

(Sunny ∧ Normal Humidity) ∨ (Cloudy) ∨ (Rainy ∧ Weak Wind)

Atsuto Maki Lecture 2: Decision Trees


Decision Trees
The representation
Unpredictability
Training
Overfitting

What does the tree encode?

(Sunny ∧ Normal Humidity) ∨ (Cloudy) ∨ (Rainy ∧ Weak Wind)

Logical expressions of the conjunction of decisions along the


path. 沿路径的决策合取的逻辑表达式。
可以表示任意布尔函数!
Arbitrary boolean functions can be represented!
Atsuto Maki Lecture 2: Decision Trees
Decision Trees
The representation
Unpredictability
Training
Overfitting

Training: we need to grow a tree from scratch given a set of


labeled training data.

How to grow/construct the tree automatically?

Atsuto Maki Lecture 2: Decision Trees


Decision Trees
The representation
Unpredictability
Training
Overfitting

Training: we need to grow a tree from scratch given a set of


labeled training data.

How to grow/construct the tree automatically?


1 Choose the best question (according to the information
gain), and split the input data into subsets
选择最佳问题(根据信息增益),然后将输入数据拆分为子集

Atsuto Maki Lecture 2: Decision Trees


Decision Trees
The representation
Unpredictability
Training
Overfitting

Training: we need to grow a tree from scratch given a set of


labeled training data.

How to grow/construct the tree automatically?


1 Choose the best question (according to the information
gain), and split the input data into subsets
2 Terminate: call branches with a unique class labels leaves
(no need for further quesitons)
终止:具有唯一类标签的调用分支将离开(无需其他问题)

Atsuto Maki Lecture 2: Decision Trees


Decision Trees
The representation
Unpredictability
Training
Overfitting

Training: we need to grow a tree from scratch given a set of


labeled training data.

How to grow/construct the tree automatically?


1 Choose the best question (according to the information
gain), and split the input data into subsets
2 Terminate: call branches with a unique class labels leaves
(no need for further quesitons)
3 Grow: recursively extend other branches 增长:递归扩展其他分支
(with subsets bearing mixtures of labels) (子集带有标签的混合)

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

1 Decision Trees
The representation
Training
2 Unpredictability
Entropy
Information gain
Gini impurity
3 Overfitting
Overfitting
Occam’s principle
Training and validation set approach
Extensions

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Quize time – Game of “sixty-three”

x drawn from {0,1,2,3,4, ... , 63}


I pick a number x from the set.
You ask me yes/now questions.

How many (and what) questions will you ask me to get the
number x as rapidly as possible?

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

How to measure information gain?

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

How to measure information gain?

The Shannon information content of an outcome is:

1
log2
pi
(pi : probability for event i)

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

How to measure information gain?

The Shannon information content of an outcome is:

1
log2
pi
(pi : probability for event i)

The Entropy — measure of uncertainty (unpredictability)


X
Entropy = −pi log2 pi
i

is a sensible measure of expected information content.

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Example: tossing a coin


phead = 0.5; ptail = 0.5

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Example: tossing a coin


phead = 0.5; ptail = 0.5

X
Entropy = −pi log2 pi
i

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Example: tossing a coin


phead = 0.5; ptail = 0.5

X
Entropy = −pi log2 pi =
i
= −0.5 log2 0.5 − 0.5 log2 0.5

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Example: tossing a coin


phead = 0.5; ptail = 0.5

X
Entropy = −pi log2 pi =
i
= −0.5 log2 0.5 −0.5 log2 0.5
| {z } | {z }
−1 −1

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Example: tossing a coin


phead = 0.5; ptail = 0.5

X
Entropy = −pi log2 pi =
i
= −0.5 log2 0.5 −0.5 log2 0.5 =
| {z } | {z }
−1 −1

=1

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Example: tossing a coin


phead = 0.5; ptail = 0.5

X
Entropy = −pi log2 pi =
i
= −0.5 log2 0.5 −0.5 log2 0.5 =
| {z } | {z }
−1 −1

=1

The result of a coin-toss has 1 bit of information


Atsuto Maki Lecture 2: Decision Trees
Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Example: rolling a die


p1 = 16 ; p2 = 16 ; . . . p6 = 1
6

X
Entropy = −pi log2 pi
i

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Example: rolling a die


p1 = 16 ; p2 = 16 ; . . . p6 = 1
6

X
Entropy = −pi log2 pi =
i
1 1
= 6 × (− log2 )
6 6

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Example: rolling a die


p1 = 16 ; p2 = 16 ; . . . p6 = 1
6

X
Entropy = −pi log2 pi =
i
1 1
= 6 × (− log2 ) =
6 6
1
= − log2 = log2 6 ≈ 2.58
6

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Example: rolling a die


p1 = 16 ; p2 = 16 ; . . . p6 = 1
6

X
Entropy = −pi log2 pi =
i
1 1
= 6 × (− log2 ) =
6 6
1
= − log2 = log2 6 ≈ 2.58
6

The result of a die-roll has 2.58 bit of information

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Example: rolling a fake die


p1 = 0.1; . . . p5 = 0.1; p6 = 0.5

X
Entropy = −pi log2 pi
i

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Example: rolling a fake die


p1 = 0.1; . . . p5 = 0.1; p6 = 0.5

X
Entropy = −pi log2 pi =
i
= −5 · 0.1 log2 0.1 − 0.5 log2 0.5

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Example: rolling a fake die


p1 = 0.1; . . . p5 = 0.1; p6 = 0.5

X
Entropy = −pi log2 pi =
i
= −5 · 0.1 log2 0.1 − 0.5 log2 0.5 =
≈ 2.16

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Example: rolling a fake die


p1 = 0.1; . . . p5 = 0.1; p6 = 0.5

X
Entropy = −pi log2 pi =
i
= −5 · 0.1 log2 0.1 − 0.5 log2 0.5 =
≈ 2.16

A real die is more unpredictable (2.58 bit) than a fake (2.16 bit)

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Unpredictability of a dataset

100 examples, 42 positive

100 examples, 3 positive

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Unpredictability of a dataset

100 examples, 42 positive

58 58 42 42
− log2 − log2 = 0.981
100 100 100 100

100 examples, 3 positive

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Unpredictability of a dataset

100 examples, 42 positive

58 58 42 42
− log2 − log2 = 0.981
100 100 100 100

100 examples, 3 positive

97 97 3 3
− log2 − log2 = 0.194
100 100 100 100

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Entropy

Unpredictability of a dataset (think of a subset at a node)

100 examples, 42 positive

58 58 42 42
− log2 − log2 = 0.981
100 100 100 100

100 examples, 3 positive

97 97 3 3
− log2 − log2 = 0.194
100 100 100 100

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Back to the decision trees


Smart idea:
Ask about the attribute which maximizes the expected
reduction of the entropy.

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Back to the decision trees


Smart idea:
Ask about the attribute which maximizes the expected
reduction of the entropy.

Information gain
Ask about attribute A for a data set S that has Entropy Ent(S),

Gain = Ent(S) −
| {z }
before

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Back to the decision trees


Smart idea:
Ask about the attribute which maximizes the expected
reduction of the entropy. 询问使熵的预期减少最大化的属性。

Information gain
Ask about attribute A for a data set S that has Entropy Ent(S),
and get subsets SV according to the value of A
X |Sv |
Gain = Ent(S) − Ent(Sv )
| {z } |S| | {z }
before v ∈Values(A)
| {z } after
weighted
sum

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

What is the entropy of this binary dataset (attributes={A, B, C, D}, n = 25)?


A B C D
◦ • • ◦ +
• • ◦ ◦ +
◦ ◦ ◦ ◦
◦ ◦ • • +
◦ • ◦ ◦ +
• ◦ • ◦
◦ • • ◦ +
◦ ◦ ◦ ◦
• ◦ ◦ ◦
◦ • • ◦ +
◦ ◦ ◦ • +
◦ ◦ • ◦
• • • ◦ +
◦ • ◦ •
◦ ◦ ◦ ◦
◦ ◦ • ◦
◦ • • •
◦ • ◦ ◦ +
◦ ◦ • ◦
• ◦ ◦ ◦
◦ • ◦ ◦ +
◦ ◦ • • +
• • ◦ ◦ +
◦ ◦ ◦ ◦
◦ ◦ • ◦

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

What is the entropy of this binary dataset (attributes={A, B, C, D}, n = 25)?


Ent = − 12
25 log2
12
25 − 13
25 log2 13
25 ≈ 0.9988 A B C D
◦ • • ◦ +
• • ◦ ◦ +
◦ ◦ ◦ ◦
◦ ◦ • • +
◦ • ◦ ◦ +
• ◦ • ◦
◦ • • ◦ +
◦ ◦ ◦ ◦
• ◦ ◦ ◦
◦ • • ◦ +
◦ ◦ ◦ • +
◦ ◦ • ◦
• • • ◦ +
◦ • ◦ •
◦ ◦ ◦ ◦
◦ ◦ • ◦
◦ • • •
◦ • ◦ ◦ +
◦ ◦ • ◦
• ◦ ◦ ◦
◦ • ◦ ◦ +
◦ ◦ • • +
• • ◦ ◦ +
◦ ◦ ◦ ◦
◦ ◦ • ◦

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

What is the entropy of this binary dataset (attributes={A, B, C, D}, n = 25)?


Ent = − 12
25 log2
12
25 − 13
25 log2 13
25 ≈ 0.9988 A B C D
◦ • • ◦ +

A = •: 63 positive → 1.0 •







+

9 ◦ ◦ • • +
A = ◦: 19 positive → 0.9980 ◦ • ◦ ◦ +
6 • ◦ • ◦
Expected: 25 · 1.0 + 19
25 · 0.9980 ≈ 0.9985 ◦ • • ◦ +
◦ ◦ ◦ ◦
• ◦ ◦ ◦
◦ • • ◦ +
◦ ◦ ◦ • +
◦ ◦ • ◦
• • • ◦ +
◦ • ◦ •
◦ ◦ ◦ ◦
◦ ◦ • ◦
◦ • • •
◦ • ◦ ◦ +
◦ ◦ • ◦
• ◦ ◦ ◦
◦ • ◦ ◦ +
◦ ◦ • • +
• • ◦ ◦ +
◦ ◦ ◦ ◦
◦ ◦ • ◦

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

What is the entropy of this binary dataset (attributes={A, B, C, D}, n = 25)?


Ent = − 12
25 log2
12
25 − 13
25 log2 13
25 ≈ 0.9988 A B C D
◦ • • ◦ +

A = •: 63 positive → 1.0 •







+

9 ◦ ◦ • • +
A = ◦: 19 positive → 0.9980 ◦ • ◦ ◦ +
6 • ◦ • ◦
Expected: 25 · 1.0 + 19
25 · 0.9980 ≈ 0.9985 ◦ • • ◦ +
◦ ◦ ◦ ◦
9 • ◦ ◦ ◦
B = •: positive → 0.684
11 ◦ • • ◦ +
3 ◦ ◦ ◦ • +
B = ◦: positive → 0.750
14 ◦ ◦ • ◦
Expected: 0.721 •







+

◦ ◦ ◦ ◦
◦ ◦ • ◦
◦ • • •
◦ • ◦ ◦ +
◦ ◦ • ◦
• ◦ ◦ ◦
◦ • ◦ ◦ +
◦ ◦ • • +
• • ◦ ◦ +
◦ ◦ ◦ ◦
◦ ◦ • ◦

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

What is the entropy of this binary dataset (attributes={A, B, C, D}, n = 25)?


Ent = − 12
25 log2
12
25 − 13
25 log2 13
25 ≈ 0.9988 A B C D
◦ • • ◦ +

A = •: 63 positive → 1.0 •







+

9 ◦ ◦ • • +
A = ◦: 19 positive → 0.9980 ◦ • ◦ ◦ +
6 • ◦ • ◦
Expected: 25 · 1.0 + 19
25 · 0.9980 ≈ 0.9985 ◦ • • ◦ +
◦ ◦ ◦ ◦
9 • ◦ ◦ ◦
B = •: positive → 0.684
11 ◦ • • ◦ +
3 ◦ ◦ ◦ • +
B = ◦: positive → 0.750
14 ◦ ◦ • ◦
Expected: 0.721 •







+

6 ◦ ◦ ◦ ◦
C = •: 12 positive → 1.0 ◦ ◦ • ◦
6 ◦ • • •
C = ◦: 13 positive → 0.9957 ◦ • ◦ ◦ +
◦ ◦ • ◦
Expected: 0.9977 • ◦ ◦ ◦
◦ • ◦ ◦ +
◦ ◦ • • +
• • ◦ ◦ +
◦ ◦ ◦ ◦
◦ ◦ • ◦

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

What is the entropy of this binary dataset (attributes={A, B, C, D}, n = 25)?


Ent = − 12
25 log2
12
25 − 13
25 log2 13
25 ≈ 0.9988 A B C D
◦ • • ◦ +

A = •: 63 positive → 1.0 •







+

9 ◦ ◦ • • +
A = ◦: 19 positive → 0.9980 ◦ • ◦ ◦ +
6 • ◦ • ◦
Expected: 25 · 1.0 + 19
25 · 0.9980 ≈ 0.9985 ◦ • • ◦ +
◦ ◦ ◦ ◦
9 • ◦ ◦ ◦
B = •: positive → 0.684
11 ◦ • • ◦ +
3 ◦ ◦ ◦ • +
B = ◦: positive → 0.750
14 ◦ ◦ • ◦
Expected: 0.721 •







+

6 ◦ ◦ ◦ ◦
C = •: 12 positive → 1.0 ◦ ◦ • ◦
6 ◦ • • •
C = ◦: 13 positive → 0.9957 ◦ • ◦ ◦ +
◦ ◦ • ◦
Expected: 0.9977 • ◦ ◦ ◦
◦ • ◦ ◦ +
D = •: 35 positive → 0.9710 ◦







+
+
9
D = ◦: 20 positive → 0.9928 ◦







Expected: 0.9884
Atsuto Maki Lecture 2: Decision Trees
Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Gain(A) = 0.9988 − 0.9985 = 0.0003


Gain(B) = 0.9988 − 0.7210 = 0.2778
Gain(C) = 0.9988 − 0.9977 = 0.0011
Gain(D) = 0.9988 − 0.9884 = 0.0104

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Gain(A) = 0.9988 − 0.9985 = 0.0003


Gain(B) = 0.9988 − 0.7210 = 0.2778
Gain(C) = 0.9988 − 0.9977 = 0.0011
Gain(D) = 0.9988 − 0.9884 = 0.0104
Attribute B gives most information gain

Yes No

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Yes No

Examples where Examples where


B=• B=◦
A B C D A B C D
◦ • • ◦ + ◦ ◦ ◦ ◦
• • ◦ ◦ + ◦ ◦ • • +
◦ • ◦ ◦ + • ◦ • ◦
◦ • • ◦ + ◦ ◦ ◦ ◦
◦ • • ◦ + • ◦ ◦ ◦
• • • ◦ + ◦ ◦ ◦ • +
◦ • ◦ • ◦ ◦ • ◦
◦ • • • ◦ ◦ ◦ ◦
◦ • ◦ ◦ + ◦ ◦ • ◦
◦ • ◦ ◦ + • ◦ ◦ ◦
• • ◦ ◦ + ◦ ◦ • ◦
◦ ◦ • • +
◦ ◦ ◦ ◦
◦ ◦ • ◦

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Yes No

D D

Yes No Yes No

_ + + _

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

贪婪的方法来选择一个问题:
选择最能告诉我们答案的属性
Greedy approach to choose a question:
Choose the attribute which tells us most about the answer

In sum, we need to find good questions to ask.


(more than one attribute could be involved in one question)
总之,我们需要找到很好的问题要问。
(一个问题可能涉及多个属性)

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Gini impurity: Another definition of predictability (impurity).


基尼杂质:可预测性(杂质)的另一种定义。
X X
pi (1 − pi ) = 1 − pi2
i i

(pi : probability for event i)

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Gini impurity: Another definition of predictability (impurity).

X X
pi (1 − pi ) = 1 − pi2
i i

(pi : probability for event i)

The expected error rate at a node, N, if the category label is


randomly selected from the class distribution present at N.

Atsuto Maki Lecture 2: Decision Trees


Decision Trees Entropy
Unpredictability Information gain
Overfitting Gini impurity

Gini impurity: Another definition of predictability (impurity).

X X
pi (1 − pi ) = 1 − pi2
i i

(pi : probability for event i)

The expected error rate at a node, N, if the category label is


randomly selected from the class distribution present at N.

Similar to the entropy but more strongly peaked at equal


probabilities.

Atsuto Maki Lecture 2: Decision Trees


Overfitting
Decision Trees
Occam’s principle
Unpredictability
Training and validation set approach
Overfitting
Extensions

1 Decision Trees
The representation
Training
2 Unpredictability
Entropy
Information gain
Gini impurity
3 Overfitting
Overfitting
Occam’s principle
Training and validation set approach
Extensions

Atsuto Maki Lecture 2: Decision Trees


Overfitting
Decision Trees
Occam’s principle
Unpredictability
Training and validation set approach
Overfitting
Extensions

Overfitting
When the learned models are overly specialized for the training
samples.
当学习的模型过于专门用于训练样本时。

Atsuto Maki Lecture 2: Decision Trees


Overfitting
Decision Trees
Occam’s principle
Unpredictability
Training and validation set approach
Overfitting
Extensions

Overfitting
When the learned models are overly specialized for the training
samples.

Good results on training data, but generalizes poorly.


在训练数据上效果不错,但推广效果很差。

Atsuto Maki Lecture 2: Decision Trees


Overfitting
Decision Trees
Occam’s principle
Unpredictability
Training and validation set approach
Overfitting
Extensions

Overfitting
When the learned models are overly specialized for the training
samples.

Good results on training data, but generalizes poorly.

When does this occur?


Non-representative sample 非代表性样本
Noisy examples 噪音样本
Too complex model 太复杂的模型

Atsuto Maki Lecture 2: Decision Trees


Overfitting
Decision Trees
Occam’s principle
Unpredictability
Training and validation set approach
Overfitting
Extensions

(T. Mitchell, Machine Learning)

What can be done about it?

Atsuto Maki Lecture 2: Decision Trees


Overfitting
Decision Trees
Occam’s principle
Unpredictability
Training and validation set approach
Overfitting
Extensions

(T. Mitchell, Machine Learning)

What can be done about it?


Choose a simpler model and accept some errors for the
training examples 选择一个更简单的模型并接受一些错误的训练示例

Atsuto Maki Lecture 2: Decision Trees


Overfitting
Decision Trees
Occam’s principle
Unpredictability
Training and validation set approach
Overfitting
Extensions

Which hypothesis should be preferred when several are


compatible with the data? 当几个与数据兼容时,应首选哪种假设?

Atsuto Maki Lecture 2: Decision Trees


Overfitting
Decision Trees
Occam’s principle
Unpredictability
Training and validation set approach
Overfitting
Extensions

Which hypothesis should be preferred when several are


compatible with the data?

Occam’s principle (Occam’s razor)


William from Ockham, Theologian and Philosopher
(1288–1348)

”Entities should not be multiplied beyond necessity”


“实体的数量不应超过必要性”

Atsuto Maki Lecture 2: Decision Trees


Overfitting
Decision Trees
Occam’s principle
Unpredictability
Training and validation set approach
Overfitting
Extensions

Which hypothesis should be preferred when several are


compatible with the data?

Occam’s principle (Occam’s razor)


William from Ockham, Theologian and Philosopher
(1288–1348)

”Entities should not be multiplied beyond necessity”


与数据兼容的最简单的解释往往是正确的解释
The simplest explanation compatible with data
tends to be the right one

Atsuto Maki Lecture 2: Decision Trees


Overfitting
Decision Trees
Occam’s principle
Unpredictability
Training and validation set approach
Overfitting
Extensions

将可用数据分成两
Separate the available data into two sets of examples 组示例
Training set T : to form the learned model 构造学习模型
Validation set V : to evaluate the accuracy of this model
以评估该模型的准确性

Atsuto Maki Lecture 2: Decision Trees


Overfitting
Decision Trees
Occam’s principle
Unpredictability
Training and validation set approach
Overfitting
Extensions

Separate the available data into two sets of examples


Training set T : to form the learned model
Validation set V : to evaluate the accuracy of this model

The motivations:
The training may be misled by random errors, but the
validation set is unlikely to exhibit the same random
fluctuations 训练可能会因随机错误而被误导,但验证集不太可能表现出相同
的随机波动
The validation set to provide a safety check against
overfitting the spurious characteristics of the training set
验证集可提供安全检查,以防过度拟合训练集的虚假特性

Atsuto Maki Lecture 2: Decision Trees


Overfitting
Decision Trees
Occam’s principle
Unpredictability
Training and validation set approach
Overfitting
Extensions

Separate the available data into two sets of examples


Training set T : to form the learned model
Validation set V : to evaluate the accuracy of this model

The motivations:
The training may be misled by random errors, but the
validation set is unlikely to exhibit the same random
fluctuations
The validation set to provide a safety check against
overfitting the spurious characteristics of the training set

(V need be large enough to provide statistically meaningful


instances)

Atsuto Maki Lecture 2: Decision Trees


Overfitting
Decision Trees
Occam’s principle
Unpredictability
Training and validation set approach
Overfitting
Extensions

Reduced-Error Pruning

Split data into training and validation set

Do until further pruning is harmful:


Evaluate impact on validation set of pruning each possible
node (plus those below it)
Greedily remove the one that most improves validation set
accuracy

Produces smallest version of most accurate subtree

Atsuto Maki Lecture 2: Decision Trees


Overfitting
Decision Trees
Occam’s principle
Unpredictability
Training and validation set approach
Overfitting
Extensions

Possible ways of improving/extending the decision trees

Avoid overfitting
Stop growing when data split not statistically significant
Grow full tree, then post-prune (e.g. Reduced error pruning)

Atsuto Maki Lecture 2: Decision Trees


Overfitting
Decision Trees
Occam’s principle
Unpredictability
Training and validation set approach
Overfitting
Extensions

Possible ways of improving/extending the decision trees

Avoid overfitting
Stop growing when data split not statistically significant
Grow full tree, then post-prune (e.g. Reduced error pruning)
A collection of trees (Ensemble learning: in Lecture 10)
Bootstrap aggregating (bagging)
Decision Forests

Atsuto Maki Lecture 2: Decision Trees

You might also like