Professional Documents
Culture Documents
Lecture 03 (3hrs) Decision Tree and Decision Forest
Lecture 03 (3hrs) Decision Tree and Decision Forest
Decision Tree
Xizhao WANG
Big Data Institute
College of Computer Science
Shenzhen University
March 2021
Decision Tree Learning
Decision Tree Generation – An Illustration
Uncertainty
Inductive Bias and Partition Decision Tree
Summary
Advanced Topics on Decision Trees
Outline
Outline
{D1 D3 D5 {D2 D4
D7 D9} D6 D8 {D11 D12
D10 D13} D14}
Outlook
Sunny Overcast Rain
Humidity Ye Wind
High Normal
s Strong Weak
No Ye No Ye
s s
D1, D2,…,D14 represent samples.
Red and blue indicate the class label of a sample is “No” or “Yes”.
Samples are split by the most suitable attribute.
Assign the leaf node with a class label that most samples belong to.
Information gain / Gain ratio / Gini index…
Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree
Decision Tree Learning 1. Difference Between Random-Partition Tree and Attribute-Induced Tree
Decision Tree Generation – An Illustration
Uncertainty 2. Difference Between Decision Tree and Other Type of Partition of Sample Space
Inductive Bias and Partition 3. Training Set to Generate A Decision Tree
Summary 4. Animation of The Generation
Advanced Topics on Decision Trees 5. For Real-Valued Attributes
If an attribute A is real-valued,
then the partition cannot be induced by “Attribute A =
value #1”
For instance:
Outlook
Sunny Overcast Rain
Humidity Ye Temperature
s
Ye No No Ye
s s
D1, D2,…,D14 represent samples.
Red and blue indicate the class label of a sample is “No” or “Yes”.
Samples are split by the most suitable attribute and corresponding value.
Assign the leaf node with a class label that most samples belong to.
Information gain / Gain ratio / Gini index…
Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree
Decision Tree Learning 1. Summary of Uncertainty Definition
Decision Tree Generation – An Illustration 2. Shannon Entropy
Uncertainty 3. Classification Entropy
Inductive Bias and Partition 4. Fuzziness
5. Non-Specificity
Summary 6. Rough-Degree
Advanced Topics on Decision Trees 7. Relation Between 2 Uncertainties
Outline
Classificatio
Impurity of the class
n Crisp set
distribution in a set
entropy
Outline
An example (1)
An example (2)
An example (3)
Outline
Outline
Splitting Criteria –
Expanded Attribute Selection
Criteria
Pruning Trees
Recursive algorithm
When do we stop?
Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree
Decision Tree Learning
Decision Tree Generation – An Illustration 1. Splitting Criteria – Expanded Attribute Selection Criteria
Uncertainty 2. Pruning Trees
Inductive Bias and Partition
Summary 3. Evaluation of Classification Trees
Advanced Topics on Decision Trees 4. Fuzzy Decision Trees
Number of nodes
Stopping
Criteria
Pre-pruning
Post-pruning divides the generation of the decision tree into two phases. The first phase
is the tree-building process with the termination condition that the proportion of a
certain class in the node reaches 100%, and the second phase is pruning the tree
structure that is acquired from the first phase.
In this way, post-pruning approaches avoid the problem of a limited visual field.
Accordingly, the accuracy of post-pruning methods is typically superior to that of pre-
pruning methods, and post-pruning methods are more commonly used than pre-pruning
methods.
There are various post-pruning techniques for decision trees. Most perform t
op-down or bottom-up traversal of the nodes. A node is pruned if this opera
tion improves a certain criterion. The following subsections describe the mos
t popular technique.
Machine Learning Lecture – Xizhao Wang Lecture 02: Decision Tree
Decision Tree Learning
Decision Tree Generation – An Illustration 1. Splitting Criteria – Expanded Attribute Selection Criteria
Uncertainty 2. Pruning Trees
Inductive Bias and Partition
Summary 3. Evaluation of Classification Trees
Advanced Topics on Decision Trees 4. Fuzzy Decision Trees
Evaluation of
Classification
Trees
7. Comprehensibility
Comprehensibility criterion (also known as interpretability) refers to how well
humans grasp the induced classifier. While the generalization error measures how
the classifier fits the data, comprehensibility measures the “mental fit” of that
classifier.
8. Scalability to Large Datasets
Scalability refers to the ability of the method to construct the classificatio
n model efficiently given large amounts of data. Classical induction algorit
hms have been applied with practical success in many relatively simple an
d small-scale problems. However, trying to discover knowledge in real life
and large databases introduces time and memory problems.
9. Robustness
The ability of the model to handle noise or data with missing values and m
ake correct predictions is called robustness. Different decision trees algori
thms have different robustness levels. In order to estimate the robustness
of a classification tree, it is common to train the tree on a clean training se
t and then train a different tree on a noisy training set. The noisy training s
et is usually the clean training set to which some artificial noisy instances
have been added. The robustness level is measured as the difference in th
e accuracy of these two situations.
10. Stability
Formally, stability of a classification algorithm is defined as the degree to
which an algorithm generates repeatable results, given different batches
of data from the same process. In mathematical terms, stability is the exp
ected agreement between two models on a random sample of the original
data, where agreement on a specific example means that both models ass
ign it to the same class.
10. Stability
Formally, stability of a classification algorithm is defined as the degree to
which an algorithm generates repeatable results, given different batches
of data from the same process. In mathematical terms, stability is the exp
ected agreement between two models on a random sample of the original
data, where agreement on a specific example means that both models ass
ign it to the same class.
The “conservation law” [Schaffer (1994)] or “no free lunch theorem” [Wolpert
(1996)]: if one inducer is better than another in some domains, then there are
necessarily other domains in which this relationship is reversed.
The “no free lunch theorem” implies that for a given problem, a certain approach
can yield more information from the same data than other approaches.
The “no free lunch” concept presents a dilemma to the analyst appr
oaching a new task: Which inducer should be used?
If the analyst is looking for accuracy only, one solution is to try each o
ne in turn, and by estimating the generalization error, to choose the o
ne that appears to perform best [Schaffer (1994)]. Another approach,
known as multistrategy learning [Michalski and Tecuci (1994)], attemp
ts to combine two or more different paradigms in a single algorithm.
Fuzzy Decision
Trees
思考题
Problems (Exercises)