Lecture 03 (3hrs) Decision Tree and Decision Forest

Lecture 03
Decision Tree
Xizhao WANG
Big Data Institute
College of Computer Science
Shenzhen University
March 2021
Decision Tree Learning
Decision Tree Generation – An Illustration
Uncertainty
Inductive Bias and Partition Decision Tree
Summary
Advanced Topics on Decision Trees
Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Uncertainty 1. A General Framework of Supervised Learning
Inductive Bias and Partition
Summary 2. Decision Tree Learning
Outline
1. Decision Tree Learning

2. Decision Tree Generation – An Illustration
3. Uncertainty
4. Inductive Bias and Partition
5. Summary
6. Advanced Topics on Decision Trees

A General Framework of Supervised Learning


Decision Tree Learning 1. Difference Between Random-Partition Tree and Attribute-Induced Tree
Uncertainty 2. Difference Between Decision Tree and Other Type of Partition of Sample Space
Inductive Bias and Partition 3. Training Set to Generate A Decision Tree
Summary 4. Animation of The Generation
Advanced Topics on Decision Trees 5. For Real-Valued Attributes
Outline

3. Uncertainty
5. Summary

How to use this training set

to generate a decision tree?

The difference between Random-partition tree

and Attribute-induced tree
{D1 D3 D5 {D2 D4
D7 D9} D6 D8 {D11 D12
D10 D13} D14}
Decision Tree is a type of Attribute-induced

tree (with samples in a leaf – one class)




Animation of The Generation
 The training data set contains 14 samples with 5 attributes.

 There is a special attribute: the attribute PlayTennis is the class label.
 Based on the training data set, we want to find a set of rules to know
how to determine a new sample would like to play tennis or not.
Animation of The Generation - Continued

{D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,D11,D12,D13,D14}
Outlook
Sunny Overcast Rain
{D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14}
Humidity Ye Wind
High Normal
s Strong Weak
{D1,D2,D8} {D9,D11} {D6,D14} {D4,D5,D10}
No Ye No Ye
s s
 D1, D2,…,D14 represent samples.
 Red and blue indicate the class label of a sample is “No” or “Yes”.
 Samples are split by the most suitable attribute.
 Assign the leaf node with a class label that most samples belong to.
Information gain / Gain ratio / Gini index…
For Real-Valued Attributes
If an attribute A is real-valued,
then the partition cannot be induced by “Attribute A =
value #1”
We often use the a constrict like

“Attribute A > value #1” (>=, <, <=, etc)
to induce a partition.
For instance:

For Real-Valued Attributes – Continued (1)
 The training data set contains 14 samples with 5 attributes.

 There is a special attribute: the attribute PlayTennis is the class label.
 The attributes, temperature, and humidity are numerical.
 Other attributes are categorical, that is, they cannot be ordered.
 Based on the training data set, we want to find a set of rules to know
how to determine a new sample would like to play tennis or not.

For Real-Valued Attributes – Continued (2)

{D1,D2,D3,D4,D5,D6,D7,D8,D9,D10}
Outlook
Sunny Overcast Rain
{D1,D2,D3,D4} {D5,D6,D7} {D8,D9,D10}
Humidity Ye Temperature
s
{D2,D3} {D1,D4} {D9,D10} {D8}
Ye No No Ye
s s
 D1, D2,…,D14 represent samples.
 Red and blue indicate the class label of a sample is “No” or “Yes”.
 Samples are split by the most suitable attribute and corresponding value.
 Assign the leaf node with a class label that most samples belong to.
Information gain / Gain ratio / Gini index…
Decision Tree Learning 1. Summary of Uncertainty Definition
Decision Tree Generation – An Illustration 2. Shannon Entropy
Uncertainty 3. Classification Entropy
Inductive Bias and Partition 4. Fuzziness
5. Non-Specificity
Summary 6. Rough-Degree
Advanced Topics on Decision Trees 7. Relation Between 2 Uncertainties
Outline

3. Uncertainty
5. Summary

5. Non-Specificity
Summary of Uncertainty Definition

5. Non-Specificity

5. Non-Specificity

5. Non-Specificity

5. Non-Specificity

5. Non-Specificity

5. Non-Specificity

5. Non-Specificity

5. Non-Specificity

5. Non-Specificity

5. Non-Specificity
Uncertainty of an object with description
Shannon Probability Uncertainty caused by

entropy distribution randomness
Classificatio
Impurity of the class
n Crisp set
distribution in a set
entropy
Fuzziness Fuzzy set Uncertainty of a linguistic term
Non-specificity when choosing

Non-
Fuzzy set one from many available
specificity
choices.
Rough-
Rough set Upper / lower approximation
degree
Summary 5. Non-Specificity
Advanced Topics on Decision Trees 6. Rough-Degree
7. Relation Between 2 Uncertainties
Relation between 2 uncertainties: fuzziness & ambiguity

Fuzziness Fuzzy set A  {1 ,  2 } Ambiguity
Fig. a. Fuzziness with the Fig. b. Ambiguity with the

formula formula
.
1 2
Ev ( A)     i log 2 i  (1  i ) log 2 (1  i )  Ea  A  =1 /  2
2 i 1

Decision Tree Generation – An Illustration 1. Decision Tree is A Partition
Uncertainty 2. The types of Attributes
Summary 3. Example
Advanced Topics on Decision Trees 4. Inductive Bias
Outline

3. Uncertainty
5. Summary

Summary 3. Example

Summary 3. Example
The types of attributes

Summary 3. Example
An example (1)

Summary 3. Example
An example (2)

Summary 3. Example
An example (3)

Summary 3. Example

Decision Tree Generation – An Illustration 1. A Summary
Uncertainty
Inductive Bias and Partition 2. Uncertainty Based on Frequency in Symbolic Learning
Summary 3. Decision Tree – in Comparison with NNs
Outline

3. Uncertainty
5. Summary

Uncertainty

Uncertainty

Uncertainty
Decision Tree – in comparison with NNs, regarding image classification, the

accuracy of DTs is considerably lower.

Decision Tree Generation – An Illustration 1. Splitting Criteria – Expanded Attribute Selection Criteria
Uncertainty 2. Pruning Trees
Summary 3. Evaluation of Classification Trees
Advanced Topics on Decision Trees 4. Fuzzy Decision Trees
Outline

3. Uncertainty
5. Summary

Further Discussions: Some Advanced Topics on Decision Trees
1. Splitting Criteria – Expanded Attribute Selection Criteria

2. Pruning Trees
3. Evaluation of Classification Trees
4. Fuzzy Decision Trees

Advanced topics on decision trees
Splitting Criteria –
Expanded Attribute Selection
Criteria

Splitting Criteria – Expanded Attribute Selection Criteria (1)






Pruning Trees

Pruning Trees (1)
Recursive algorithm
When do we stop?
Pruning Trees (2)

Stopping
Criteria
 All instances under each branch belong to the same category
 Each leave node contains one example
Overfitting Problem
 Can always classify

training examples perfectly
 Doesn’t work on new data
Number of nodes

Pruning Trees (3)
Stopping
Criteria
Pre-pruning

Pruning Trees (4)

Overview - Post-pruning
Employing tight stopping criteria tends to create small and underfitted deci
sion trees. On the other hand, using loose stopping criteria tends to generat
e large decision trees that are overfitted to the training set.
Post-pruning divides the generation of the decision tree into two phases. The first phase
is the tree-building process with the termination condition that the proportion of a
certain class in the node reaches 100%, and the second phase is pruning the tree
structure that is acquired from the first phase.
In this way, post-pruning approaches avoid the problem of a limited visual field.
Accordingly, the accuracy of post-pruning methods is typically superior to that of pre-
pruning methods, and post-pruning methods are more commonly used than pre-pruning
methods.
There are various post-pruning techniques for decision trees. Most perform t
op-down or bottom-up traversal of the nodes. A node is pruned if this opera
tion improves a certain criterion. The following subsections describe the mos
t popular technique.
Pruning Trees (5)
Cost Complexity Pruning

Breiman et al. (1984) developed a pruning methodology based on a loose st
opping criterion and allowing the decision tree to overfit the training set. Th
en the overfitted tree is cut back into a smaller tree by removing sub-branch
es that are not contributing to the generalization accuracy.
Cost complexity pruning proceeds in two stages.

Pruning Trees (6)
Cost Complexity PruningKey points

-

Pruning Trees (7)
Cost Complexity PruningKey points

-
Study [1] has shown that the relationship between the tim
e required to generate the sub-tree sequence and the num
ber of non-leaf nodes in the original decision tree is quadr
atic, which means that if the number of non-leaf nodes inc
reases linearly with the number of training examples, then
the relationship between the time complexity of the CCP
method and the number of training data is quadratic.
[1] Nobel A (2002) Analysis of a complexity-based pruning scheme for classification

trees. IEEE Trans Inf Theory 48(8):2362–2368

Evaluation of
Classification
Trees

Evaluation of classifications tree (1)

Generalization Error
 Classification accuracy is the primary evaluation criterion

 Its actual value is known only in rare cases (mainly synthetic cases)
 One can take the training error as an estimation of the generalization error.

1. Theoretical Estimation of Generalization Error

2. Empirical Estimation of Generalization Error
3. Alternatives to the Accuracy Measure


4. Confusion Matrix


5. ROC curves
Another measure is the ROC curves which illustrate the tradeoff

between true positive to false positive rates [Provost and Fawcet
t (1998)]. Figure 4.3 illustrates a ROC curve in which the X-axis re
presents a false positive rate and the Y -axis represents a true po
sitive rate. The ideal point on the ROC curve would be (0,100), th
at is, all positive examples are classified correctly and no negativ
e examples are misclassified as positive.


6. Computational Complexity
• Computational complexity for generating a new classifier
• Computational complexity for updating a classifier
• Computational complexity for classifying a new instance
7. Comprehensibility
Comprehensibility criterion (also known as interpretability) refers to how well
humans grasp the induced classifier. While the generalization error measures how
the classifier fits the data, comprehensibility measures the “mental fit” of that
classifier.
8. Scalability to Large Datasets
Scalability refers to the ability of the method to construct the classificatio
n model efficiently given large amounts of data. Classical induction algorit
hms have been applied with practical success in many relatively simple an
d small-scale problems. However, trying to discover knowledge in real life
and large databases introduces time and memory problems.

9. Robustness
The ability of the model to handle noise or data with missing values and m
ake correct predictions is called robustness. Different decision trees algori
thms have different robustness levels. In order to estimate the robustness
of a classification tree, it is common to train the tree on a clean training se
t and then train a different tree on a noisy training set. The noisy training s
et is usually the clean training set to which some artificial noisy instances
have been added. The robustness level is measured as the difference in th
e accuracy of these two situations.
10. Stability
Formally, stability of a classification algorithm is defined as the degree to
which an algorithm generates repeatable results, given different batches
of data from the same process. In mathematical terms, stability is the exp
ected agreement between two models on a random sample of the original
data, where agreement on a specific example means that both models ass
ign it to the same class.

10. Stability
Formally, stability of a classification algorithm is defined as the degree to
which an algorithm generates repeatable results, given different batches
of data from the same process. In mathematical terms, stability is the exp
ected agreement between two models on a random sample of the original
data, where agreement on a specific example means that both models ass
ign it to the same class.
11. Over-fitting and Under-fitting

In decision trees, overfitting usually occurs when the tree has too many nodes relative
to the amount of training data available. By increasing the number of nodes, the
training error usually decreases while at some point the generalization error
becomes worse.
Overfitting is generally recognized to be a violation of the principle of Occams razor


11. Over-fitting and Under-fitting (continued)
In decision trees there are two mechanisms that help to avoid overfitting. The first is
to avoid splitting the tree if the split is not useful (for instance by approving only
statistically significant splits). The second approach is to use pruning; after
growing the tree, we prune unnecessary nodes.


12. “No Free Lunch” Theorem
The “conservation law” [Schaffer (1994)] or “no free lunch theorem” [Wolpert
(1996)]: if one inducer is better than another in some domains, then there are
necessarily other domains in which this relationship is reversed.
The “no free lunch theorem” implies that for a given problem, a certain approach
can yield more information from the same data than other approaches.
The “no free lunch” concept presents a dilemma to the analyst appr
oaching a new task: Which inducer should be used?
If the analyst is looking for accuracy only, one solution is to try each o
ne in turn, and by estimating the generalization error, to choose the o
ne that appears to perform best [Schaffer (1994)]. Another approach,
known as multistrategy learning [Michalski and Tecuci (1994)], attemp
ts to combine two or more different paradigms in a single algorithm.

Fuzzy Decision
Trees

Fuzzy Representation of Data (Training Set –An Illustration)

Fuzzy Representation of Data (Training Set –An Illustration)

Fuzzy Decision Tree Induction

思考题
Problems (Exercises)
Problems (Assignments) - Decision Tree
1.Given a training set with 5 binary attributes (4 conditional attributes and 1

decision attribute), and 20 samples. Please give a rough estimation of the
numbers of random partition trees and attribute-induced trees.
2.Whether or not the minimum-entropy-based approach can generate a
decision tree with smallest scale?
3.How do you think the tree size if your use a random strategy to select the
induced-attribute in the process of sub-node generation?
4.Give a brief summary about the heuristics of choosing expanded attributes
while splitting a node during the process of decision tree generation.

Lecture 03 (3hrs) Decision Tree and Decision Forest

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 03 (3hrs) Decision Tree and Decision Forest

Uploaded by

Copyright:

Available Formats

Lecture 03

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

1. Decision Tree Learning

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

A General Framework of Supervised Learning

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

1. Decision Tree Learning

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

How to use this training set

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

The difference between Random-partition tree

Decision Tree is a type of Attribute-induced

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Animation of The Generation

 The training data set contains 14 samples with 5 attributes.

Animation of The Generation - Continued

{D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14}

{D1,D2,D8} {D9,D11} {D6,D14} {D4,D5,D10}

For Real-Valued Attributes

We often use the a constrict like

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

For Real-Valued Attributes – Continued (1)

 The training data set contains 14 samples with 5 attributes.

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

For Real-Valued Attributes – Continued (2)

{D1,D2,D3,D4} {D5,D6,D7} {D8,D9,D10}

{D2,D3} {D1,D4} {D9,D10} {D8}

1. Decision Tree Learning

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Summary of Uncertainty Definition

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Summary of Uncertainty Definition

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Summary of Uncertainty Definition

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Summary of Uncertainty Definition

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Summary of Uncertainty Definition

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Summary of Uncertainty Definition

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Summary of Uncertainty Definition

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Summary of Uncertainty Definition

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Summary of Uncertainty Definition

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Summary of Uncertainty Definition

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

Summary of Uncertainty Definition

Uncertainty of an object with description

Shannon Probability Uncertainty caused by

Fuzziness Fuzzy set Uncertainty of a linguistic term

Non-specificity when choosing

Relation between 2 uncertainties: fuzziness & ambiguity

Fig. a. Fuzziness with the Fig. b. Ambiguity with the

Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree

1. Decision Tree Learning