You are on page 1of 19


Introduction to Decision Tree Algorithms for

Classification involves identifying unseen data
records into pre-known classes or groups
Important activity from an analytical point of
Various algorithms have been developed to
achieve the same over the years beginning
Fundamental idea of information gain from the
theory of Information Systems is utilized in all
standing algorithms in one way or the other
Our focus would be to conceptually understand
the concept of Information gain and utilize it 2
Decision Tree Induction
Involves two steps
Stage 1: Constructing a classification model
Stage 2: Adapting and applying the model to
classify data records whose classes are
Various terms utilized by practitioners have led
to confusion, however, training set and testing
set are fairly standard over the literature
Some authors use the term validation set
No consensus on how to divide the dataset but
2/3 and 1/3 is the classical approach, a more
rigorous approach is and 1/2 3
Components of a Decision Tree
Leaf nodes : represent the class label
Internal nodes : Name of an attribute
Links : the link from parent node to child node
represents a value of the attribute of the parent node


Internal node

Leaf Leaf
Characteristics of Decision Trees
More than one decision tree can be constructed
from the same data
Structure of the decision tree impacts its
performance, classification involves moving
from root to a possible leaf and this usually
involves a test at every internal node (breadth
and length of the tree)
Accuracy of classification is an important
parameter, and is usually dependent on the
final application
Constructing a Decision Tree
A lot of work has been done in last 2-3 decades
in this area
Most methods use the following process
If the training set is empty, create a leaf node
and label it as NULL, it means that there is
nothing to determine the class outcome and
hence class is unknown
If all examples in training set are of same class,
create a leaf node and label it with class label
If the examples in the training set are from
different classes the following operations
needed to be performed 6
Select an attribute to be the root of the current
Partition the current training set into subsets
according to the values of the chosen attributes
Construct a subtree for each subset
Create a link from the root of the current tree to
the root of each subtree, and label the link with
appropriate value of the root attribute that
separates one subset from the others

Information Gain
Idea originates from information theory and probability
theory, to quantify the amount of information when
random events occur
Information System S is a system around sample space
comprising set of events E1, E2, E3, En with associated
probabilities of event occurrence P(E1), P(E2),.. P(En)
If M is the size of sample space and Nk is the number of
outcomes that convey the event Ek, then the probability
that Ek occurs is calculated as P(Ek) = (Nk/M)
Each attribute can be considered as an Information
For a S self Information of an event Ek of S is defined
I(Ek) = logq(1/P(Ek)) = -logqP(Ek)
Some points to ponder..
When P(Ek) = 0; I(Ek) is set to 0
The base q of the algorithm defines unit of
measurement for the amount of information, if
base is 2 its bits; if base is 10 it is digits
If Ek always happens P(Ek) = 1, I(Ek)=0,
usually a fact that does not convey any
If Ek frequently occurs and is close to P(Ek) ~1,
I(Ek) is close to 0, so not much information
If Ek is rare, P(Ek) ~0, I(Ek) is very large,
conveying a large amount of information
If Ek never occurs I(Ek) is infinite, so it is forced
to be set to zero
Shannon's Entropy
Based on self-information of individual events, the
average information of the whole information system S,
is defined as the weighted sum of self information of all
events in S
Given by H(S) = P(Ek)*I(Ek) = -P(Ek)*logqP(Ek)
Given two information systems S1 and S2 the
conditional self information of event Ek of S1 given that
event Fj of S2 has occurred is defined as,
I(Ek|Fj) = -logqP(Ek|Fj) = -logqP((Ek and Fj)/P(Fj))
The average conditional information (expected
information) of system S1 of n event in the presence of
system S2 of m events is the weighted sum of the
conditional self-information over all pairs of events in S1
and S2
H(S1|S2) = (i=1,n)(j=1,m)P(Ei and Fj)*I(Ei|Fj) 10
When a decision tree is being constructed two
info systems are present, the attribute A and
H(Class) represents average information of
class system before attribute is chosen for the
H(Class|A) represents expected information of
class system after attribute A is chosen as root
The Information Gain over attribute A is given
by G(A)=H(Class)-H(Class|A)
G(A) represents reduction in uncertainty

Representative Data Set
Sr Attributes Class
Outlook Temperature Humidity Windy
1 Sunny Hot High False N
2 Sunny Hot High True N
3 Overcast Hot High False P
4 Rain Mild High False P
5 Rain Cool Normal False P
6 Rain Cool Normal True N
7 Overcast Cool Normal True P
8 Sunny Mild High False N
9 Sunny Cool Normal False P
10 Rain Mild Normal False P
11 Sunny Mild Normal True P
12 Overcast Mild High True P
13 Overcast Hot Normal False P
14 Rain Mild High True N
I(Class=P) = -log2P(Class=P)
= -log29/14 = 0.673 bits
H(Class)= -P(Class=P)*log2P(Class=P) P(Class=N)*log2P(Class=N)
= -(9/14)log2(9/14) (5/14)*log2(5/14) = 0.94 bits

H(Class|A) = (i=1,v)((pi+ni)/(p+n))(H(Class|A=ai))
Where ((pi+ni)/(p+n)) is probability of attribute A=ai
H(Class|Outlook) = (5/14*H(Class|Outlook=sunny)) +
(4/14*H(Class|Outlook=overcast)) +
(5/14*H(Class|Outlook = rain))
0.694 bits
H(Class) H(Class|Outlook) = 0.246 bits
Corresponding values (G(temp)=0.029,
G(Hum)=0.151, G(wind)=0.048

Algorithm constructTreeID3 (C: training set): decision tree;
Tree = ; empty tree initially
If C is empty then
Tree:= a leaf node labeled NULL;
If C contains examples of one class then
Tree := a leaf node labeled by the class tag
For every attribute Ai(1<=i<=p) do
Calculate information gain Gain(Ai)
select attribute A where Gain(A) =
Partition C into subsets C1, C2, C3, C4, Cw by values of A
For each Ci (1<=i<=w) do
t1 := contructTreeID3(Ci);
Label the links from A to the roots of the subtrees with values of A
return (Tree); 15
Subsets partitioned from the original training sets


Temperature Humidity Windy Class

Hot High False N
Hot High True N
Mild High False N
Cool Normal False P
Mild Normal True P


Temperature Humidity Windy Class

Hot High False P
Cool Normal True P
Mild High True P
Hot Normal False P


Temperature Humidity Windy Class

Mild High False P
Cool Normal False P
Cool Normal True N
Mild Normal False P
Mild High True N

Outlook had the highest info gain and so is the
root of the tree
The process is repeated for the remaining
subsets, for first subset Humidity is identified as
having maximum information gain, forms root of
Links are labeled as high and normal, subsets
partitioned further and two leaf nodes are
The second subset has only one class, so one
leaf is formed
The third subset forms with windy as the root of
subtree, and again two leafs are formed
Du, Hongbo, Data Mining Techniques and
Applications, Cengage Learning, India, 2013
Berry, M, and Linoff,G.,1997,Data Mining
Techniques for Marketing, sales and customer
support, John Wiley and Sons
Han, J. and Kamber, M., 2001, Data Mining:
Concepts and Techniques, Morgan Kaufman
Tan, P-N, Steinbach, M. and Kumar, V., 2006,
Introduction to Data Mining, Addision-Wesley