Professional Documents
Culture Documents
CS 586
Prepared by Jugal Kalita
1
A Possible Decision Tree for P layT ennis
Outlook
No Yes No Yes
2
Decision or Regression Trees
3
What does a Decision or Regression Tree Represent?
4
Representing a Decision or RegressionTree in Rules
• We can express the example decision tree in terms of rules.
• The last or default rule may or may not be necessary, de-
pending on how rules are processed.
• Representing knowledge in readable rules is convenient.
If (Outlook = Sunny ^ Humidity = N ormal), then
P layT ennis
If (Outlook = Overcast), then P layT ennis
If (Outlook = Rain^W ind = W eak), then P layT ennis
Otherwise, ¬P layT ennis
• The conclusion of each rule obtained from a decision tree
is a value of the target attribute from a small set of pos-
sible values. The conclusion of each rule obtained from a
regression tree is a value of the target attribute or a function
to compute the value of the target attribute.
5
Basic Issues in Decision or Regression Tree Learning
6
When do we use Decision Trees
8
Entropy in Data: A Measure of Impurity of a Set of Data
9
Computing Entropy in Data
1.0
Entropy(S)
0.5
p ( log2 p ) + p ( log2 p )
11
Information Gain
12
Selecting the First Attribute
S: [9+,5-] S: [9+,5-]
E =0.940 E =0.940
Humidity Wind
13
Selecting the First Attribute
14
Selecting the Next Attribute
Choose Humidity as next attribute under Sunny because it gives most information gain.
From Page 61 of Mitchell.
15
Hypothesis Space Search by a Decision Tree Learner
16
Hypothesis Space Search by a Decision Tree Learner
17
Inductive Bias in Decision Tree Learning
18
Some Issues in Decision and Regression Tree Learning
19
Overfitting the training examples
• In other words, h fits the training data better than h0, but h
does worse on the entire distribution of the data.
20
Overfitting in Decision Tree Learning
0.9
0.85
0.8
0.75
Accuracy
0.7
0.65
0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)
21
How to Avoid Overfitting
• Stop growing the tree before the tree fits the data perfectly.
It is difficult to figure out when to stop.
• Grow full tree, then post-prune. This is the method normally
used.
22
How to select “best” tree
23
Reduced-Error Pruning
Note: Some people use three independent sets of data: training set for train-
ing the decision tree, validation set for pruning, and test set for testing on a
pruned tree. Assuming the data elements are randomly distributed, the same
regularities do not occur in the three sets.
Effect of Reduced-Error Pruning
0.9
0.85
0.8
0.75
Accuracy
0.7
0.65
0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)
25
Rule Post-Pruning
26
Pruning Rules
27
Why Convert Decision Tree to Rules before Pruning?
• The split selected maximizes the value of the difference between Gini
impurity before splitting and Gini impurity after splitting.
GiniGain(T, A) = Gini(T ) p(Tl ) Gini(Tl )
p(Tr ) Gini(Tr )
where p(Tl ) and p(Tr ) are the proportion of examples in the left and
right subtrees after split.
29
Comparing Entropy and Gini Index of Diversity
As we can see, the two functions are not very different. Here J is the total number of
classes. From Page 409 of Jang et al.
30
Other Splitting Rules for Discrete Variables
31
Continuous Valued Attributes
32
Discretizing a Continuous Valued Attribute
Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes No
33
Continuous Valued Attribute and Space Partitioning
34
Alternative Splitting Rule for Continuous Attribute
35
Attributes with Many Values
36
Attributes with Many Values
c
X |Ti| |Ti|
SplitInf ormation(T, A) ⌘ log2
i=1 |T | |T |
where Ti is subset of T for which A has value vi.
• GainRatio(T, A) is the gain in entropy due to splitting us-
ing attribute A: We used this before.
• SplitInf ormation(T, A) is the sum of the entropy of the
nodes after splitting on attribute A.
37
• The value of the denominator is high for attributes such as
Date and SSN .
• If an attribute is like Date, where each date is unique for
each row of data, then SplitInf ormation(T, A) = log2 n
if there are n attributes.
• If an attribute is boolean and divides the set exactly in half,
the value of SplitInf ormation = 1.
• Use attributes which have high gain ratio. It makes gain
ratio low for attributes like Date.
Unknown Attribute Values
38
Attributes with Differing Costs
39
Regression Trees
40
Growing Regression Trees
As the regression tree is grown, the learned function which is discontinuous, becomes
better defined. From Alpaydin, Chapter 13’s instructor slides. The output y is a function of
one variable only: x. From Alpaydin, instructor slides.
41
Growing Regression Trees
As the regression tree is grown, the learned function which is discontinuous, becomes
better defined. The output z is a function of two variables x and y. The output has 10
nodes now. From Jang et al. page 413.
42
Growing Regression Trees
As the regression tree is grown, the learned function which is discontinuous, becomes
better defined. The output z is a function of two variables x and y. The tree has 20 nodes
now. From Jang et al. page 414.
43
Pruning Regression Trees
44
Constant Leaves Vs. Linear Equations in Leaves
The Input-Output surfaces of Regression Trees with terminal nodes that are
constant-valued vs. terminal nodes that store functions of the input. From Jang et al., Page
406.
45
Mutlivariate Decision/Regression Trees
46
Omnivariate Decision/Regression Trees
• An omnivariate tree allows both univariate or multivariate
nodes, both linear and non-linear.
• Different types of nodes are compared with statistical tests
and the best one chosen.
• A simpler node is chosen unless a complex one gives much
better accuracy.
• Usually more complex nodes occur near root when we have
more data and simpler nodes occur near leaves when we
have less data to work with.
47
Balancing subtrees
• Many machine learning algorithms do not perform well if the
training data is unbalanced in class sizes.
• However, in real life, unbalanced classes are common. One
needs a way not to be biased by the bigger classes.
• The CART program uses some techniques to keep the classes
balanced as much as possible.
48