You are on page 1of 12

MODULE 11

Decision Trees
LESSON 24
Learning Decision Trees
Keywords: Learning, Training Data, Axis-Parallel Decision Tree

Money
25
200
100
125
30
300
55
140
20
175
110
90

Has-exams
no
no
no
yes
yes
yes
yes
no
yes
yes
no
yes

weather
fine
hot
rainy
rainy
rainy
fine
hot
hot
fine
fine
fine
fine

Goes-to-movie
no
yes
no
no
no
yes
no
no
no
yes
yes
no

Table 1: Example training data set for Induction of Decision Tree

Decision Tree Construction from Training Data - an Example


Let us take a training set and induce a decision tree using the training set.
Table 1 gives a training data set with four patterns having the class
Goes-to-movie=yes and eight patterns having the class Goes-to-movie=no.
The impurity of this set is
4
4
log2 12

Im(n) = 12
= 0.9183

8
8
log2 12
12

We need to now consider all the three attributes for the first split and
chose the one with the most information gain.
Money
Let us divide the feature values of money into three parts < 50, between
50-150 and > 150 .

1. Money < 50, has 3 patterns belonging to goes-to-movie=no and 0 patterns belonging to goes-to-movie=yes. The entropy for money < 50 is
Im(M oney < 50) = 0
2. Money 50-150 has 5 patterns belonging to goes-to-movie=no and 1 pattern belonging to goes-to-movie=yes . Entropy for money 50-150 is
Im(M oney50 150) = 61 log2 61 56 log2 65
= 0.65
3. Money > 150 has 3 patterns belonging to goes-to-movie=yes and 0 patterns belonging to goes-to-movie=no. The entropy for money > 150 is
Im(Money>150) = 0
4. Gain(Money)
Gain(money) = 0.9183

3
12

6
12

0.65

3
12

0 = 0.5933

Has-exams
1. (Has-exams=yes)
Has a total of 7 patterns with 2 patterns belonging to goes-to-movie=yes
and five patterns belonging to goes-to-movies=no. The entropy for hasexams=yes is
Im(has exams = yes) = 27 log2 27 75 log2 57
= 0.6717

2. (Has exams=no)
Has a total of 5 patterns with 2 patterns belonging to goes-to-movie=yes
and 3 patterns belonging to goes-to-movies=no. The entropy for hasexams=no is
Im(has exams = no) = 52 log2 52 35 log2 53
= 0.9710
3. Gain for has-exams
Gain(has exams) = 0.9183
= 0.1219

7
12

0.6717

5
12

9710

Weather
1. (Weather=hot)
Has a total of 3 patterns with 1 pattern belonging to goes-to-movie=yes
and 2 patterns belonging to goes-to-movie=no. The entropy for weather=hot
is
Im(weather = hot) = 13 log2 13 32 log2 23
= 0.9183
2. (Weather=fine)
Has a total of 6 patterns with 3 patterns belonging to goes-to-movie=yes
and 3 patterns belonging to goes-to-movie=no. The entropy for weather=fine
is

Im(weather = f ine) = 63 log2 36 36 log2 63


= 1.0
3. (Weather=rainy)
Has a total of 3 patterns with 0 patterns belonging to goes-to-movie=yes
and 3 patterns belonging to goes-to-movie=no. The entropy for weather=rainy
is
Im(weather = rainy) = 30 log2 03 33 log2 33
=0
4. Gain for weather
Gain(weather) = 0.9183
= 0.1887

3
12

0.9183

6
12

3
12

All the three attributes have been investigated and here are the gain values :
Gain(money) = 0.5933
Gain(has-exams) = 0.1219
Gain(weather) = 0.1887
Since Gain(money) has the maximimum value, money is taken as the first
attribute.
When we take money as the first decision node, the training data gets
split into three portions for money < 50, money = 50-150, and money > 150.
There are three patterns along the outcome money < 50, 6 patterns along the
outcome money = 50-150 and 3 patterns along the outcome money > 150.
We will consider each of these three branches and think of the next decision
node as a new decision tree.

M oney < 50
Out of the 3 patterns along this branch, all the patterns belong to goesto-movie=no, so this is a leaf node and need not be investigated further.
M oney > 150
Out of the 3 patterns along this branch, all the patterns belong to goesto-movie=yes, so this is a leaf node and need not be investigated further.
M oney = 50 150
This has a total of 6 patterns with 1 pattern belonging to goes-to-movie=yes
and 5 patterns belonging to goes-to-movie=no. So the information in this
branch is
Im(n) = 16 log2 16 56 log2 56
= 0.65
Now we need to check the attributes has-exams and weather to see which
is the next attribute to be chosen.
Has-exams
1. (has-exams=yes)
There are a total of 3 patterns with has-exams=yes out of the 6 patterns along this branch. Out of these 3 patterns, 3 patterns belong to
goes-to-movie=no and 0 patterns belong to goes-to-movie=yes. So the
entropy of has-exams=yes is
Im(has exams = yes) = 03 log2 03 33 log2 33
=0
2. (has-exams=no)

There are a total of 3 patterns with has-exams=no out of the 6 patterns


along this branch. Out of these 3 patterns, 2 patterns have goes-tomovie=no and 1 pattern has goes-to-movie=yes. The entropy of hasexams=no is
Im(has exams = no) = 13 log2 31 23 log2 32
= 0.9183
Gain(has exams) = 0.65 36 0 63 0.9183
= 0.1909
Weather
1. (weather=hot)
There are two patterns out of six which belong to weather=hot and
both of them belong to goes-to-movie=no. The entropy for weather=hot
is
Im(weather=hot) = 0
2. (weather=fine)
There are two patterns out of six which belong to weather=fine and 1
pattern belongs to goes-to-movie=yes and 1 pattern belongs to goesto-movie=no. So, the entropy of weather=fine is
Im(weather = f ine) = 12 log2 12 12 log2 21
= 0.5 + 0.5 = 1.0
3. (weather=rainy)

There are two patterns out of six which belong to weather=rainy and
both of them belong to goes-to-movie=no. The entropy for weather=rainy
is
Im(weather = rainy) = 0
4. Gain for weather is
Gain(weather) = 0.65 62 1.0 = 0.3167
The values for gain for has-exams and weather are
Gain(has-exams) = 0.1909
Gain(weather) = 0.3167
Since weather has the higher gain value, it is the attribute to be chosen.
The remaining attribute is then chosen. For weather=hot and weather=rainy,
all the patterns belong to goes-to-movie=no and therefore it is the leaf node.
Only the node weather=fine needs to be further expanded. The entire decision tree is given in the Figure 1.
The following points may be noted after going through the example :
The decision tree can be used effectively to chose among several courses
of action.
In what way the decision tree comes up with a decision can be easily
explained. Each path in the decision tree corresponds to simple rules.
At each node, the attribute chosen to make the split is the one with
the highest drop in impurity or highest increase in gain.

has money
<50

>150

50150
goes to
a movie
= false

goes to
a movie
= true

weather?
hot
goes to
a movie
= false

fine

rainy
goes to
a movie
= false

has exams

goes to
a movie
= false

goes to
a movie
= true

Figure 1: The decision tree induced from Table 1

Assignment
1. There are four coins 1, 2, 3, 4 out of which three coins are of equal
weight and one coin is heavier. Use a decision tree to identify the
heavier coin.
2. Consider the three-class problem characterized by the training data
given in the following table. Obtain the axis-parallel decision tree for
the data.
Professor
Sam
Sam
Sam
Pam
Pam
Pam
Ram
Ram
Ram

Number of Students Funding Research Output


3
5
1
1
5
5
1
3
5

Low
Low
High
High
Low
High
Low
High
Low

Medium
Medium
High
Low
Low
Low
Low
Medium
High

3. Consider a data set of 10 patterns which is split into 3 classes at a node


in a decision tree where 4 patterns are from class 1, 5 from class 2, and
1 from class 3. Computer Entropy Impurity. What is the Variance
Impurity?
4. For the data in problem 3, compute the Gini Impurity. How about the
Misclassification Impurity?
5. Consider the data given in the following table for a three-class problem.
If the set of 100 training patterns is split using a variable X into two
parts represented by the left and right subtrees below the decision node
as shown in the table, compute the entropy impurity.
6. Compute the variance impurity for the data given in problem 5. How
about the misclassification impurity?
10

Class Label

Total No in Class

1
2
3

40
30
30

No. in Left Subtree No. in Right Subtree


40
10
10

0
20
20

7. Consider the two-class problem in a two-dimensional space characterized by the following training data. Obtain an axis-parallel decision
tree for the data.
Class1 : (1, 1)t , (2, 2)t , (6, 7)t , (7, 7)t
Class2 : (6, 1)t , (6, 2)t , (7, 1)t , (7, 2)t
8. Consider the data given in problem 7. Suggest an oblique decision tree.
References
V. Susheela Devi and M. N. Murty (2011) Pattern Recognition: An
Introduction Universities Press, Hyderabad.
Buntine and T. Niblett (1992) A further comparison of splitting rules for
decision-tree induction,Machine Learning, Vol. 8, pp. 75-85.
B. Chandra and P. Paul Varghese (2009) Moving towards efficient decision tree construction, Information Science, Vol. 179, Issue 8, pp. 1059-1069.
George H. Hohn (1994) Finding multivariate splits in decision trees using
function optimization, Proceedings, AAAI.
Esmeir, Markovitch (2008) Anytime induction of low-cost, low-error classifiers : a sampling-based approach, JAIR, Vol. 33, pp1-31.
Nunez, (1991) The use of background knowledge in decision tree induction,
Machine Learning, Vol. 6, pp. 231-250. Sreerama K. Murthy, Simon
Kasif and Steven Salzberg (1994) A system for induction of oblique decision trees, Journal of Artificial Intelligence Research, Vol. 2, pp. 1-32.
Olcay Taner Yildiz and Onur Dikmen (2007) Parallel univariate decision trees, Pattern Recognition Letters, Vol. 28, Issue 7, pp. 825-832.
Peter D. Turney (1995) Cost-sensitive classification : Empirical evaluation
of a hybrid genetic decision tree induction algorithm, JAIR, Vol. 2, pp. 36911

409.
Yen-Liang Chen, Chia-Chi Wu and Kwei Tang (2009) Building a costconstrained decision tree with multiple condition attributes, Information Science, Vol. 179, Issue 7, pp. 967-979.
J.R. Quinlan, (1986) Induction of decision trees, Machine Learning, Vol.
1, pp.81-106.
J.R. Quinlan (1992) C4.5-Programs for Machine Learning, San Mateo CA:
Morgan Kaufmann.
J.R. Quinlan (1996) Improved use of continuous attributes in C4.5, Journal
of Artificial Intelligence Research, Vol. 4, pp.77-90.

12