You are on page 1of 31

Constructing Decision Trees

A Decision Tree Example


The weather data example.
ID code Outlook Temperature Humidity Windy Play
a Sunny Hot High False No
b Sunny Hot High True No
c Overcast Hot High False Yes
d Rainy Mild High False Yes
e Rainy Cool Normal False Yes
f Rainy Cool Normal True No
g Overcast Cool Normal True Yes
h Sunny Mild High False No
i Sunny Cool Normal False Yes
j Rainy Mild Normal False Yes
k Sunny Mild Normal True Yes
l Overcast Mild High True Yes
m Overcast Hot Normal False Yes
n Rainy Mild High True No
~continues
Outlook
sunny overcast rainy

humidity yes windy

high normal false true

no yes yes no

Decision tree for the weather data.


The Process of Constructing a
Decision Tree
• Select an attribute to place at the root of the
decision tree and make one branch for every
possible value.
• Repeat the process recursively for each
branch.
Which Attribute Should Be Placed
at a Certain Node
• One common approach is based on the
information gained by placing a certain
attribute at this node.
Information Gained by Knowing
the Result of a Decision
• In the weather data example, there are 9
instances of which the decision to play is
“yes” and there are 5 instances of which the
decision to play is “no’. Then, the
information gained by knowing the result of
the decision is
9  9 5  5
   log        log   0.940 bits.
14  14   14   14 
The General Form for Calculating
the Information Gain
• Entropy of a decision =
 P1  log P1  P2  log P2    Pn  log Pn
P1, P2, …, Pn are the probabilities of the n po
ssible outcomes.
Information Further Required If
“Outlook” Is Placed at the Root
Outlook
sunny overcast rainy
yes yes yes
yes yes yes
no yes yes
no yes no
no no

Information further required 


 5 4  5
   0.971     0     0.971  0.693bits.
 14   14   14 
Information Gained by Placing
Each of the 4 Attributes
• Gain(outlook) = 0.940 bits – 0.693 bits
= 0.247 bits.
• Gain(temperature) = 0.029 bits.
• Gain(humidity) = 0.152 bits.
• Gain(windy) = 0.048 bits.
The Strategy for Selecting an
Attribute to Place at a Node
• Select the attribute that gives us the largest
information gain.
• In this example, it is the attribute
“Outlook”.
Outlook
sunny overcast rainy

2 “yes” 4 “yes” 3 “yes”


3 “no” 2 “no”
The Recursive Procedure for
Constructing a Decision Tree
• The operation discussed above is applied to each b
ranch recursively to construct the decision tree.
• For example, for the branch “Outlook = Sunny”, w
e evaluate the information gained by applying eac
h of the remaining 3 attributes.
• Gain(Outlook=sunny;Temperature) = 0.971 – 0.4 = 0.5
71
• Gain(Outlook=sunny;Humidity) = 0.971 – 0 = 0.971
• Gain(Outlook=sunny;Windy) = 0.971 – 0.951 = 0.02
• Similarly, we also evaluate the information
gained by applying each of the remaining 3
attributes for the branch “Outlook = rainy”.
• Gain(Outlook=rainy;Temperature) = 0.971 – 0.
951 = 0.02
• Gain(Outlook=rainy;Humidity) = 0.971 – 0.951
= 0.02
• Gain(Outlook=rainy;Windy) =0.971 – 0 = 0.97
1
The Over-fitting Issue

• Over-fitting is caused by creating decision


rules that work accurately on the training
set based on insufficient quantity of
samples.
• As a result, these decision rules may not
work well in more general cases.
Example of the Over-fitting Problem
in Decision Tree Construction

11 “Yes” and 9 “No” samples;


prediction = “Yes”

3 “Yes” and 0 “No” samples; 8 “Yes” and 9 “No” samples;


prediction = “Yes” Ai=0 Ai=1 prediction = “No”
 11 11 9 9 
Entropy at the subroot    log2  log2 
 20 20 20 20 
 0.993 bits
17  8 8 9 9
Average entropy at the children    log2  log2 
20  17 17 17 17 
 0.848 bits
• Hence, with the binary split, we gain more information.
• However, if we look at the pessimistic error rate, i.e. the
upper bound of the confidence interval of the error rate,
we may get different conclusion.
• The formula for the pessimistic error rate is

z2 r r2 z2
r z   2
2n n n 4n
e
z2
1
n
where r is the observed error rate, n is the number of samples,
• Note that the pessimistic error rate is a function of the
z   1  c  , and c is the confidence level specified by the user.
confidence level used.
• The pessimistic error rates under 95%
confidence are
1.6452 0.45 0.452 2.706
0.45   1.645  
e9  40 20 20 1600  0.6278
20 1.6452
1
20
1.6452 1.6452
 1.645
e0  6 36  0.4742
3 1.6452
1
3

8 1.645

2 8  
8
 1.645 17  17 
2
2.706
e8  17 34 17 17 1156  0.6598
17 1.6452
1
17
• Therefore, the average pessimistic error rate
at the children is
3 17
 0.4742   0.6598  0.632  0.6278
20 20

• Since the pessimistic error rate increases


with the split, we do not want to keep the
children. This practice is called “tree
pruning”.
Tree Pruning based on 2 Test of
Independence
• We construct the Ai=0 Ai=
1
corresponding contingency
Yes 3 8 11
table 11 “Yes” and
No 0 9 9
9 “No” samples;
3 17 20
3 “Yes” and 8 “Yes” and
0 “No samples; Ai=0 Ai=1 9 “No” samples;
The  2 statistic
2 2 2 2
 11  3   17  11   3 9   9  17 
3-  8 -  0 -  9 - 
20 20 20 20
         1.15
11  3 17  11 3 9 9  17
20 20 20 20
• Therefore, we should not split the subroot n
ode, if we require that the 2 statistic must b
e larger than 2k,0.05 , where k is the degree of
freedom of the corresponding contingency t
able.
Constructing Decision Trees based
on 2 test of Independence
• Using the following example, we can
construct a contingency table accordingly.
75 “Yes”s out of
100 samples; Ai
Prediction = “Yes” 0 1 2
75
Yes 20 10 45
100
Ai=0 Ai=1 Ai=2 No 5 15 5
25
20 “Yes”s out of 10 “Yes”s out of 45 “Yes”s out of 100
25 25 50
25 samples; 25 samples; 50 samples; 100
100 100 100
2 2 2
 1 3   1 3   1 3 
 20    100   10    100   45    100 
4 4 4 4 2 4
2       
1 3 1 3 1 3
  100   100   100
4 4 4 4 2 4
2 2 2
 1 1   1 1   1 1 
 5    100   15    100   5    100 
4 4 4 4 2 4
     
1 1 1 1 1 1
  100   100   100
4 4 4 4 2 4
 22.67   22,0.05  5.991

• Therefore, we may say that the split is


statistically robust.
Assume that we have another attribute
Aj to consider
Aj=0 Aj=1
75 “Yes” out of Yes 25 50 75
100 samples;
No 0 25 25
25 “Yes” out 50 “Yes” out of
of 25 samples; Aj=0 Aj=1 75 samples; 25 75 100

2 2 2 2
 25  7 5   25  25   75  75   25  75 
 25 -  0 -   50 -   25 - 
2  100   100   100   100 
    
2 5  75 25  25 75  75 25  75
100 100 100 100
 11 .11  12,0.05  3.841
• Now, both Ai and Aj pass our criterion. How shoul
d we make our selection?
• We can make our selection based on the significan
ce levels of the two contingency tables.


12, '  11 .11   '  1  F 2 (11 .11)  Prob N 2 (0,1)  11 .11
1

  '  Prob N (0,1)  3.33  Prob N (0,1)  3.33
 2  Prob N (0,1)  3.33  0.0008  8  10 4.
 22, "  22.67   "  1  F 2 (22.67)
2

  ( 22.67 ) 
1
 1  1  e 2   1.19  10 5.
 
 

• Therefore, Ai is preferred over Aj.


Termination of Split due to Low
Significance level
• If a subtree is as follows
15 “Yes”s out of
20 samples;

4 “Yes”s out of 2 “Yes”s out of 9 “Yes”s out of


5 samples; 5 samples; 10 samples;

 2 = 4.543 < 5.991


• In this case, we do not want to carry out the split.
A More Realistic Example and
Some Remarks
• In the following example, a bank wants to
derive a credit evaluation tree for future use
based on the records of existing customers.
• As the data set shows, it is highly likely that the
training data set contains inconsistencies.
• Furthermore, some values may be missing.
• Therefore, for most cases, it is impossible to
derive perfect decision trees, i.e. decision trees
with 100% accuracy.
~continues
Attributes Class
Education Annual Income Age Own House Sex Credit ranking
College High Old Yes Male Good
High school ----- Middle Yes Male Good
High school Middle Young No Female Good
College High Old Yes Male Poor
College High Old Yes Male Good
College Middle Young No Female Good
High school High Old Yes Male Poor
College Middle Middle ----- Female Good
High school Middle Young No Male Poor
~continues
• A quality measure of decision trees can be b
ased on the accuracy. There are alternative
measures depending on the nature of applica
tions.
• Overfitting is a problem caused by making t
he derived decision tree work accurately for
the training set. As a result, the decision tree
may work less accurately in the real world.
~continues
• There are two situations in which overfitting
may occur:
• insufficient number of samples at the subroot.
• some attributes are highly branched.
• A conventional practice for handling missing
values is to treat them as possible attribute va
lues. That is, each attribute has one additiona
l attribute value corresponding to the missing
value.
Alternative Measures of Quality of
Decision Trees
• The recall rate and precision are two widely used
measures.
C ' C
Recall Rate 
C
C ' C
Precision 
C'

• where C is the set of samples in the class and C’ is


the set of samples which the decision tree puts into
the class.
~continues

• A situation in which the recall rate is the


main concern:
• “A bank wants to find all the potential credit
card customers”.
• A situation in which precision is the main
concern:
• “A bank wants to find a decision tree for credit
approval.”

You might also like