You are on page 1of 74

Decision Tree Learning

Vineet Padmanabhan Nair

School of Computer and Information Sciences


University of Hyderabad
Hyderabad.

April 12, 2022

Vineet Padmanabhan Machine Learning


Decision Tree Learning

Decision tree representation

ID3 learning algorithm

Entropy, Information gain

Overfitting

Vineet Padmanabhan Machine Learning


Introduction

Decision trees are among the most widely used methods for
inductive inference.

It is a method for approximating discrete-valued functions.

Robust to noisy data and can learn disjunctive expressions.

The hypothesis is represented using a decision tree.

Vineet Padmanabhan Machine Learning


Decision Trees

Each node in the tree specifies a test for some attribute of the
instance.
Each branch corresponds to an attribute value.
Each leaf node assigns a classification.
Decision trees represent a disjunction (or) of conjunctions (and)
of constraints on the values. Each root-leaf path is a conjunction.

(Outlook = Sunny ∧ Humidity = N ormal) ∨ (Outlook = Overcast)


∨(Outlook = Rain ∧ W ind = W eak)

Vineet Padmanabhan Machine Learning


Decision Tree for P layT ennis

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

Vineet Padmanabhan Machine Learning


When to Consider Decision Trees

Instances describable by attribute–value pairs


Target function is discrete valued
Disjunctive hypothesis may be required
Possibly noisy training data
The training data may contain missing attribute values
Problems in which the task is to classify examples into one of a
discrete set of possible categories are called classification problems

Vineet Padmanabhan Machine Learning


Building a decision tree

Main loop:
1 A ← the “best” decision attribute for next node
2 Assign A as decision attribute for node
3 For each value of A, create new descendant of node
4 Sort training examples to leaf nodes
5 If training examples perfectly classified, Then STOP, Else iterate
over new leaf nodes
This is basically the ID3 algorithm
What do we mean by best?

Vineet Padmanabhan Machine Learning


Choosing the best attribute

Day Outlook Temperature Humidity Wind PlayTennis


D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Vineet Padmanabhan Machine Learning


Choosing the best attribute . . .

There are 9 positive and 5 negative examples


Humidity = High has 3 positive and 4 negative examples
Humidity = Normal has 6 positive and 1 negative
Wind = Strong has 3 positive and 3 negative
Which one is better as a root node, Humidity or Wind?

Vineet Padmanabhan Machine Learning


Entropy

1.0

Entropy(S)
0.5

0.0 0.5 1.0


p
+

S is a sample of training examples


p⊕ is the proportion of positive examples in S
p is the proportion of negative examples in S
Entropy measures the impurity of S

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p

Vineet Padmanabhan Machine Learning


Entropy . . .

If S has 9 positive and 5 negative examples, its entropy is


       
9 9 5 5
Entropy([9+, 5−]) = − log2 − log2 = 0.94
14 14 14 14

This function is 0 for p⊕ = 0 and p⊕ = 1. It reaches its maximum


of 1 when p⊕ = .5
That is, it is maximised when there degree of confusion is
maximised.

Vineet Padmanabhan Machine Learning


Entropy as Encoding Length

We can also say that Entropy equals the expected number of bits
needed to encode class (⊕ or ) of randomly drawn member of S
using the optimal, shortest length code.
Information theory: optimal length code assigns − log2 p bits to
message having probability p.
Imagine I am choosing elements from S at random and telling
you whether they are ⊕ or . How many bits per element will I
need? (We work-out encoding beforehand).
If message has probability 1 then its encoding length is 0.
If probability .5 then we need 1 bit (the maximum).
So, expected number of bits to encode ⊕ or of random member
of S:
p⊕ (− log2 p⊕ ) + p (− log2 p )

Entropy(S) ≡ −p⊕ log2 p⊕ − p log2 p

Vineet Padmanabhan Machine Learning


Non Boolean Entropy

If the target attribute can take on c different values we can still


define entropy
c
X
Entropy(S) ≡ −pi log2 pi
i=1

pi is the proportion belonging to class i


Now the entropy can be as large as log2 c

Vineet Padmanabhan Machine Learning


Information Gain

The information gain is the expected reduction in entropy


caused by partitioning the examples with respect to an attribute.
given S is the set of examples, A the attribute, and Sv the subset
of S for which attribute A has value v:
X |Sv |
Gain(S, A) ≡ Entropy(S) − Entropy(Sv ) (1)
|S|
v∈V alues(A)

First term of the equation is the entropy of the original collection


S
Second term is the expected value of the entropy after S is
partitioned using attribute A.

Vineet Padmanabhan Machine Learning


Information Gain. . .

X |Sv |
Gain(S, A) ≡ Entropy(S) − Entropy(Sv ) (2)
|S|
v∈V alues(A)

The expected entropy described by the second term is simply the


sum of the entropies of each subset Sv , weighted by the fraction
of examples |S v|
|S| that belong to S.
Gain(S, A) is therefore the expected reduction in entropy caused
by knowing the value of attribute A.
Put another way, Gain(S, A) is the information provided about
the target function value, given the value of some other
attribute A.
The value of Gain(S, A) is the number of bits saved when
encoding the target value of an arbitrary member of S, by
knowing the value of attribute A.

Vineet Padmanabhan Machine Learning


Information Gain . . .

Using our set of examples we can calculate that


Original Entropy = 0.94
Humidity = High entropy = 0.985
Humidity = Normal entropy = 0.592
 
7 7
Gain(S, Humidity) = .94 − 14 .984 − 14 .592 = .151
Wind = Weak entropy = 0.811
Wind = Strong entropy = 1.0
 
8 6
Gain(S, W ind) = .94 − 14 .811 − 14 1.0 = .048
So Humidity provides a greater information gain.

Vineet Padmanabhan Machine Learning


Selecting the next attribute

Which attribute is the best classifier?

S: [9+,5-] S: [9+,5-]
E =0.940 E =0.940

Humidity Wind

High Normal Weak Strong

[3+,4-] [6+,1-] [6+,2-] [3+,3-]


E =0.985 E =0.592 E =0.811 E =1.00

Gain (S, Humidity ) Gain (S, Wind )


= .940 - (7/14).985 - (7/14).592 = .940 - (8/14).811 - (6/14)1.0
= .151 = .048

Vineet Padmanabhan Machine Learning


ID3
ID3 (Examples, Target, Attributes)
Create a root node
If all examples have the same Target value, give the root this
label
Else if attributes is empty, label the root according to the most
common value
Else begin
Calculate the information gain of each attribute, according to the
average entropy formula.
Select the attribute, A, with the lowest average entropy (highest
information gain) and make this attribute tested at the root.
For each possible value, v, of this attribute
Add a new branch below the root, corresponding to A = v.
Let Examples(v) be those examples with A = v
If examples(v) is empty, make the new branch a leaf node labelled
with the most common value among examples.
Else let the new branch be the tree created by ID3(Examples(v),
Target, Attributes - A)
end
Vineet Padmanabhan Machine Learning
ID3 Example

Again using our examples, ID3 would first calculate


Gain(S, Outlook) = 0.246
Gain(S, Humidity) = 0.151
Gain(S, Wind) = 0.048
Gain(S, Temperature) = 0.029
So Outlook would be the root. The Overcast branch would lead
to a Yes classification.
At the Sunny branch we would recursively apply it for examples
S 0 = {1, 2, 8, 9, 11} leading to
Gain(S 0 , Humidity) = .97
Gain(S 0 , T emperature) = .57
Gain(S 0 , W ind) = .019

Vineet Padmanabhan Machine Learning


Another Decision Tree Example

Vineet Padmanabhan Machine Learning


Another Decision Tree Example continued . . .

S has 3 positive and 3 negative


 examples, its
 entropy is
Entropy(S) = − 36 log2 36 − 36 log2 36 = 1.


Vineet Padmanabhan Machine Learning


Another Decision Tree Example continued . . .

S has 3 positive and 3 negative


 examples, its
 entropy is
Entropy(S) = − 36 log2 36 − 36 log2 36 = 1.


|Sred |
Gain(S, color) = Entropy(S) − |S| Entropy(Sred ) −
|Sgreen | |Sblue |
|S| Entropy(Sgreen ) − |S| Entropy(Sblue )
= 1 − 36 (.9182) − 0 − 0 = 0.5409.

Vineet Padmanabhan Machine Learning


Another Decision Tree Example continued . . .

S has 3 positive and 3 negative


 examples, its
 entropy is
Entropy(S) = − 36 log2 36 − 36 log2 36 = 1.


|Sred |
Gain(S, color) = Entropy(S) − |S| Entropy(Sred ) −
|Sgreen | |Sblue |
|S| Entropy(Sgreen ) − |S| Entropy(Sblue )
= 1 − 36 (.9182) − 0 − 0 = 0.5409.

Vineet Padmanabhan Machine Learning


Decision Tree example continued . . .

Gain(S, color) = 0.5409.


Gain(S, shape) = 0.
Gain(S, size) = 0.459.
Select color as the best attribute at the root.

Vineet Padmanabhan Machine Learning


Decision Tree example continued . . .

What is the best attribute for red child node ?


2
 2
 1
 1

Entropy(Sred ) = − 3
log2 3
− 3
log2 3
= .9182.

Vineet Padmanabhan Machine Learning


Decision Tree example continued . . .

What is the best attribute for red child node ?


2
 2
 1
 1

Entropy(Sred ) = − 3 log2 3 − 3 log2 3 = .9182.
Gain(red, shape) = Entropy(Sred ) −
|Sred,square | |S |
|Sred |
Entropy(Sred,square ) − red,round
|Sred |
Entropy(Sred,round )

Vineet Padmanabhan Machine Learning


Decision Tree example continued . . .

What is the best attribute for red child node ?


2
 2
 1
 1

Entropy(Sred ) = − 3 log2 3 − 3 log2 3 = .9182.
Gain(red, shape) = Entropy(Sred ) −
|Sred,square | |S |
|Sred |
Entropy(Sred,square ) − red,round
|Sred |
Entropy(Sred,round )
2
= .9182 − 0 − 3 (1) = 0.2515.

Vineet Padmanabhan Machine Learning


Decision Tree example continued . . .

What is the best attribute for red child node ?


2
 2
 1
 1

Entropy(Sred ) = − 3 log2 3 − 3 log2 3 = .9182.
Gain(red, shape) = Entropy(Sred ) −
|Sred,square | |S |
|Sred |
Entropy(Sred,square ) − red,round
|Sred |
Entropy(Sred,round )
2
= .9182 − 0 − 3 (1) = 0.2515.
Gain(red, size) = .9182.

Vineet Padmanabhan Machine Learning


Decision Tree example continued . . .

What is the best attribute for red child node ?


2
 2
 1
 1

Entropy(Sred ) = − 3 log2 3 − 3 log2 3 = .9182.
Gain(red, shape) = Entropy(Sred ) −
|Sred,square | |S |
|Sred |
Entropy(Sred,square ) − red,round
|Sred |
Entropy(Sred,round )
2
= .9182 − 0 − 3 (1) = 0.2515.
Gain(red, size) = .9182.
Best attribute for red child node is size.

Vineet Padmanabhan Machine Learning


Decision Tree example continued . . .

What is the best attribute for red child node ?


2
 2
 1
 1

Entropy(Sred ) = − 3 log2 3 − 3 log2 3 = .9182.
Gain(red, shape) = Entropy(Sred ) −
|Sred,square | |S |
|Sred |
Entropy(Sred,square ) − red,round
|Sred |
Entropy(Sred,round )
2
= .9182 − 0 − 3 (1) = 0.2515.
Gain(red, size) = .9182.
Best attribute for red child node is size.

Vineet Padmanabhan Machine Learning


More Questions

Consider classification of data in which the 4-dimensional feature


vector x contains four Boolean features: A, B, C, D. Furthermore, the
class label y is also Boolean. Give decision trees to represent the
following Boolean functions.
1 y = A ∧ ¬B
2 y = A ∨ (B ∧ C)
3 y =A⊗B
4 y = (A ∧ B) ∨ (C ∧ D)

Vineet Padmanabhan Machine Learning


Questions Continued . . .

Consider a classification problem where the class label y can take the values −1 and +1 and there are
two features, x1 and x2 , which both have possible values 0, 1, and 2. Let H = {h1 , h2 , h3 } be a
hypothesis space for this problem that contains the following three hypotheses 1 :

x1 x2 x3
0 2 +1
2 2 -1
1 2 -1
0 0 + 1

Table: Classification Data

 +1 if x1 · x2 = 0,
h1 (x) =
−1 otherwise.
 +1 if x1 6= x2 ,
h2 (x) =
−1 otherwise.

 +1 if x1 = 0,
h3 (x) =
−1 otherwise.

1 Give a decision tree that makes the same classification as h1 .


2 Give an example of a hypothesis for this classification problem that is consistent with the data
in Table 1, but is not a member of H.

1x · x2 denotes the product of x1 and x2 .


1
Vineet Padmanabhan Machine Learning
The data given in Table2 is related to "The Simpsons". Using ID3
find the best attribute for the root node of the decision tree for the 9
training examples. Based on this rootnode construct a tree that
correctly classifies Males/Females so that we know which class Comic
belongs to.

Person Hair length Weight Age Class


Homer 000 250 36 M
Marge 1000 150 34 F
Bart 200 90 10 M
Lisa 600 78 8 F
Maggie 400 20 1 F
Abe 100 170 70 M
Selma 800 160 41 F
Otto 1000 180 38 M
Krusty 600 200 45 M

Comic 800 290 38 ?


Table: The Simpsons

Vineet Padmanabhan Machine Learning


A Different Question
Consider the training examples given in Table3.

Weekend Weather Parents Money Decision


W1. Sunny Yes Rich Cinema
W2 Sunny No Rich Tennis
W3 Windy Yes Rich Cinema
W4 Rainy Yes Poor Cinema
W5 Rainy No Rich Stay-in
W6 Rainy Yes Poor Cinema
W7 Windy No Poor Cinema
W8 Windy No Rich Shopping
W9 Windy Yes Rich Cinema
W10 Sunny No Rich Tennis
Table: Training Examples for ID3

1 Using ID3 find the best attribute for the root node of the
decision tree for the 10 training examples.
2 What is the best attribute for the Parents child node?
Vineet Padmanabhan Machine Learning
Hypothesis Space Search by ID3

ID3 searches the space of possible decision trees: doing


hill-climbing on information gain.
It searches the complete space of all finite discrete-valued
functions. All functions have atleast one tree that represents
them.
It maintains only one hypothesis (unlike Candidat-Elimination).
It cannot tell us how many other viable ones are there.
It does not do back tracking. Can get stuck in local optima.
Uses all training examples at each step. Results are less sensitive
to errors.

Vineet Padmanabhan Machine Learning


Hypothesis Space Search by ID3 . . .

+ – +

...
A2
A1
+ – + + + – + –

...

A2 A2

+ – + – + – + –
A3 A4

+

... ...

Vineet Padmanabhan Machine Learning


Inductive Bias

Given a set of examples there are many trees that would fit it.
Which one does ID3 pick?

This is the inductive bias.

Approximate ID3 inductive bias: Prefer shorter trees. To


actually do that ID3 would need to do a BFS on tree sizes.

Better ID3 inductive bias: Prefer shorter trees over longer


trees. Prefer trees that place higher information gain attributes
near the root.

Vineet Padmanabhan Machine Learning


Restriction and Preference Biases

ID3 searches a complete hypothesis space but does so


incompletely since once it finds a good hypothesis it stops
(cannot find others).
Candidate-Elimination searches an incomplete hypothesis space
(it can only represent some hypothesis) but does so completely.
A preference bias is an inductive bias where some hypothesis
are preferred over others (for instance, shorter hypothesis).
A restriction bias is an inductive bias where the set of
hypothesis considered is restricted to a smaller set.

Vineet Padmanabhan Machine Learning


Occam’s Razor

Occam’s Razor: Prefer the simplest hypothesis that fits the


data.
Why should we prefer a shorter hypothesis?
There are fewer short hypothesis than long hypothesis so
A short hypothesis that fits data unlikely to be coincidence
A long hypothesis that fits data might be coincidence
But, there are many ways to define small sets of hypothesis
e.g., all trees with a prime number of nodes that use attributes
beginning with Z.
What’s so special about small sets based on size of hypothesis?

Vineet Padmanabhan Machine Learning


Issues in Decision Tree Learning

How deep to grow?


How to handle continous attributes?
How to choose an appropriate attribute selection measure?
How to handle data with missing attribute values?
How to handle attributes with different costs?
How to improve computational efficiency?
ID3 has been extended to handle most of these. The resulting
system is C4.5.

Vineet Padmanabhan Machine Learning


Overfitting

A hypothesis h ∈ H is said to overfit the training data if there


exists some alternative hypothesis h0 ∈ H, such that h has
smaller error than h0 over the training examples, but h0 has
smaller error than h over the entire distribution of instances.
That is, if
errortrain (h) < errortrain (h0 )
and
errorD (h) > errorD (h0 )
This can happen if there are errors in the training data.
It becomes wprse if we let the tree grow to be too big, as shown
in the next experiment.

Vineet Padmanabhan Machine Learning


Overfitting. . .

0.9

0.85

0.8

0.75
Accuracy

0.7

0.65

0.6 On training data


On test data
0.55

0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

Vineet Padmanabhan Machine Learning


Dealing with Overfitting

Overfitting is a significant practical difficulty for decision tree


learning and many other learning methods.
In one experimental study of ID3 involving five different learning
tasks with noisy, nondeterministic data, overfitting was found to
decrease the accuracy of learned decision trees by 10-25% on
most problems.
There are several approaches to avoiding overfitting in decision
tree learning. These can be grouped into two classes:
Approaches that stop growing the tree earlier, before it reaches
the point where it perfectly classifies the training data,
Approaches that allow the tree to overfit the data, and then
post-prune the tree.

Vineet Padmanabhan Machine Learning


Dealing with Overfitting

Regardless of whether the correct tree size is found by stopping early


or by post-pruning, a key question is what criterion is to be used to
determine the correct final tree size. Approaches include:
Either stop growing the tree earlier or prune it afterwards.
Pruning has been more effective.
Use a separate set of examples (not training) to evaluate the
utility of post-pruning nodes.
Use a statistical test to estimate whether expanding a node is
likely to improve performance beyond the training set.
Use explicit measure of the complexity for encoding the training
examples and the decision tree. Stop when this encoding size is
minimize. Minimum Description Length principle.
MDL: minimize size(tree) + size(misclassif ications(tree))

Vineet Padmanabhan Machine Learning


Reduced Error Pruning

Split data into training and validation set

Do until further pruning is harmful:


1 Evaluate impact on validation set of pruning each possible node
(plus those below it)
2 Greedily remove the one that most improves validation set
accuracy

produces smallest version of most accurate subtree


Requires that a lot of data is available

Vineet Padmanabhan Machine Learning


Effect of Reduced Error Pruning

0.9

0.85

0.8

0.75
Accuracy

0.7

0.65

0.6 On training data


On test data
0.55 On test data (during pruning)

0.5
0 10 20 30 40 50 60 70 80 90 100
Size of tree (number of nodes)

Vineet Padmanabhan Machine Learning


Reduced error pruning: Example

Vineet Padmanabhan Machine Learning


Reduced-error pruning continued . . .

Vineet Padmanabhan Machine Learning


Rule Post-Pruning

Infer tree as well as possible


Convert tree to equivalent set of rules
Prune each rule by removing any preconditions that result in
inproving its estimated accuracy
Sort final rules by their estimated accuracy and consider them in
this sequence when calssifying.
Perhaps most frequently used method (example C4.5)

Vineet Padmanabhan Machine Learning


Converting a Tree to Rules

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

IF (Outlook = Sunny) ∧ (Humidity = High)


THEN P layT ennis = N o
Vineet Padmanabhan Machine Learning
Why Convert Decision trees to rules befre pruning?

Converting to rules allows distinguishing among the different


contexts in which a decision node is used. Because each distinct
path through the decision tree node produces a distinct rule, the
pruning decision regarding that attribute test can be made
differently for each path. In contrast, if the tree itself were
pruned, the only two choices would be to remove the decision
node completely, or to retain it in its original form.
Converting to rules removes the distinction between attribute
tests that occur near the root of the tree and those that occur
near the leaves. Thus, we avoid messy bookkeeping issues such as
how to reorganize the tree if the root node is pruned while
retaining part of the subtree below this test.
Converting to rules improves readability. Rules are often easier
for to understand.

Vineet Padmanabhan Machine Learning


Continous-Valued Attributes
We might have a Temperature attribute with a continous value
T emperature = 82.5
Create a new boolean attribute that is true when the value is less
than c (the threshold).
(T emperature > 72.3) = t, f
To pick c sort the examples according to the attribute. Identify
adjacent examples that differ in their target calssification.
Generate candidate thresholds at the midpoints.

Temperature: 40 48 60 72 80 90
PlayTennis: No No Yes Yes Yes No

The candidate thresholds can be evaluated by computing the


information gain associate with each one.
The new discrete-valued attribute can then compete with the
other attributes.
Vineet Padmanabhan Machine Learning
Attributes with Many Values

Problem:
If attribute has many values, Gain will select it
Imagine using Date = Jun_3_1996 as attribute

One approach: use GainRatio instead

Gain(S, A)
GainRatio(S, A) ≡
SplitInf ormation(S, A)

c
X |Si | |Si |
SplitInf ormation(S, A) ≡ − log2
i=1
|S| |S|
where Si is subset of S for which A has value vi
The SplitInf ormation term discourages the selection of
attributes with many uniformaly distributed values.

Vineet Padmanabhan Machine Learning


Example for Gain Ratio

Day Outlook Temperature Humidity Wind PlayTennis


D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Vineet Padmanabhan Machine Learning


Gain Ratio Example continued . . .

c
X |Si | |Si |
SplitInf ormation(S, A) ≡ − log2
i=1
|S| |S|
5 5 4 4
Split(S, Outlook) = (− 14 log 14 ) × 2 + (− 14 log 14 ) = 1.577

Gain(S, A)
GainRatio(S, A) ≡
SplitInf ormation(S, A)
0.247
GainRatio(S, Outlook) = 1.577 = 0.157
GainRatio(S, T emperature) = 0.029
1.362 = 0.021
GainRatio(S, Humidity) = 0.152
1 = 0.152
0.048
GainRatio(S, W ind) = 0,985 = 0.049
1 May choose an attribute just because its intrinsic information is
very low
2 First, only consider attributes with greater than average
information gain
3 Then, compare them on gain ratio
Vineet Padmanabhan Machine Learning
Attributes With Costs

Consider
medical diagnosis, BloodT est has cost $150
robotics, W idth_f rom_1f t has cost 23 sec.

How to learn a consistent tree with low expected cost?


One approach: replace gain by
Tan and Schlimmer
Gain2 (S, A)
.
Cost(A)
Nunez
2Gain(S,A) − 1
(Cost(A) + 1)w
where w ∈ [0, 1] determines importance of cost

Vineet Padmanabhan Machine Learning


Unknown Attribute Values

What if some examples missing values of A?


Use training example anyway, sort through tree
If node n tests A, assign most common value of A among other
examples sorted to node n
assign most common value of A among other examples with same
target value
assign probability pi to each possible value vi of A
assign fraction pi of example to each descendant in tree

Vineet Padmanabhan Machine Learning


Gini Index
If a data set T contains examples from n classes, the gini index
gini(T) is defined as
m
X
gini(T ) = 1 − p2i
i=1

m: the number of classes


pi : is the relative frequency of class j in T.
The Gini index considers a binary split for each attribute A, say
D1 and D2 . The Gini index of D given that partitioning is
D1 D2
GiniA (D) = Gini(D1 ) + Gini(D2 )
D D

A weighted sum of the impurity of each partition


The reduction in impurity is given by
∆Gini(A) = gini(D) − GiniA (D)
The attribute that maximises the reduction in impurity is chosen
as the splitting attribute.
Vineet Padmanabhan Machine Learning
Consider the training examples shown in Table 4 for a binary
classification problem
Customerid Gender CarType ShirtSize Class
1 M Family Samll C0
2 M Sports Medium C0
3 M Sports Medium C0
4 M Sports Large C0
5 M Sports Extralarge C0
6 M Sports ExtraLarge C0
7 F Sports Small C0
8 F Sports Small C0
9 F Sports Medium C0
10 F Luxury large C0
11 M Family Large C1
12 M Family ExtraLarge C1
13 M Family Medium C1
14 M Luxury ExtraLarge C1
15 F Luxury Small C1
16 F Luxury Small C1
17 F Luxury Medium C1
18 F Luxury Medium C1
19 F Luxury Medium C1
20 F Luxury large C1

Table: A Sample Data set

1 Compute the Gini Index for the overall collection of training examples.
2 Compute the Gini Index for the Customer ID attribute.
3 Compute the Gini index for Gender, Car Type and ShirtSize and show which attribute is better.

Vineet Padmanabhan Machine Learning


Another Question

A B Class Label
T F +
T T +
T T +
T F -
T T +
F F -
F F -
F F -
T T -
T F

Calculate the gain in the Gini Index when splitting on A and B.


Which attribute would the decision tree induction algorithm choose?

Vineet Padmanabhan Machine Learning


Class Histogram

Let the training data set be T with class labels {C1 , C2 , . . . , Cn }.


If T is partitioned based on the value of the non-class attribute X
into sets T1 , . . . , Tn then the class histogram for the partition is a
table of k columns and n rows. The (i, j)-the entry indicates the
number of records in the data set in the partition Ti and class Cj .
For a numerical attribute A, let us assume that we want to
compute the splitting index for the possible split A ≤ v. Then
the class histogram is a table of two rows and k columns, where k
is the number of classes.
The first row represents the frequency distribution of the set of
records satisfying A ≤ v.
The second row represents the frequency distribution for each
class, for those records which do not satisfy the condition.

Vineet Padmanabhan Machine Learning


Class Histogram

Table: Training Data Set

Age Salary Class Table: Class Histogram for salary


1 30 65 G with splitting criterion Salary ≤ 70
2 23 15 B
Salary ≤ 70 B G
3 40 75 G
left 2 2
4 55 40 B
right 0 2
5 55 100 G
6 45 60 G

Vineet Padmanabhan Machine Learning


Class Histogram continued . . .

For the following generic histogram the gini index can be give as
follows:
C1 C2
L a1 a2
R b1 b2
  2  2 
(a1+a2) a1 a2
gini = n 1 − a1+a2 − a1+a2 +
  2  2 
(b1+b2) b1 b2
n 1 − b1+b2 − b1+b2

This concept cn be generalised for cases where the split is non-binary.


In such cases the number of rows is equal to the number of partitions..
Hence one can also define a class histogram for a categorical attribute.

Vineet Padmanabhan Machine Learning


Binary Split: Continuous-Valued Attributes

D: a data partition
Consider attribute A with continuous values
To determine the best binary split on A
What to examine?
Examine each possible split point
The midpoint between each pair of (sorted) adjacent values is
taken as a possible split-point
How to examine?
For each split point compute the weighted sum of the impurity of
each of the two resulting partitions (D1: A ≤ split − point, D2:
A > split − point)
D1 D2
GiniA (D) = Gini(D1) + Gini(D1)
D D
The split-point that gives the minimum Gini index for attribute
A is selected as its splitting subset.

Vineet Padmanabhan Machine Learning


Vineet Padmanabhan Machine Learning
Vineet Padmanabhan Machine Learning
Binary Split for Categorical Attributes

Since we cannot have any ordering of the values of a categorical


attribute, there cannot be a value n such that it splits the
attribute into two
If S(A) is the set of possible values of the categorical attribute A
then the split test is of the form A ∈ S 0 where S 0 ⊆ S
For an attribute with n values there are 2n possible splits
If n is small the split index value is found for all possible
combinations
If n is large some heuristics is used to find the best split
The construction of an attribute list is similar to that of
numerical attributes
Instead of having a class histogram a Count Matrix is maintained
for the categorical attribures

Vineet Padmanabhan Machine Learning


Count Matrix
The count matrix has n rows (for n distinct values of the
attributes) and k columns(for k classes). Each entry, say (i,
j)-entry, represents the number of records in the data set having
ith value of the attribute and in the jth class.
The count matrix is independent of any partition whereas the
class histogram is. Different splitting criteria result in different
class histograms.

Table: Attribute List for a


Categorical Attribute
Table: Count Matrix
Class record-id
family high 1 H L
sports high 2 family 2 1
sports high 3 sports 2 0
family low 4 truck 0 1
truck low 5
family high 6

Vineet Padmanabhan Machine Learning


A Question

Table: Attribute List for a


Categorical Attribute
Table: Count Matrix
Class record-id
family high 1 H L
sports high 2 family 2 1
sports high 3 sports 2 0
family low 4 truck 0 1
truck low 5
family high 6

1 If we select S 0 = {f amily, truck} as the possible splitting subset,


calculate the gini index
2 Determine the best splitting criteria?

Vineet Padmanabhan Machine Learning


Worked out example

S 0 = {f amily, truck} 
giniS 0 (T) = 46 1 − ( 24 )2 − 2 2

2 2 2 0 2
 (4) +
6 1 − ( 2 ) − ( 2 )  =
4 4 4
6 1 −( 16 ) − ( 16 ) =
4 8
16 h 1 − ( 16 ) i =
4 8÷8
16 1 − 16÷8 =
4 1
 
16 1 − 2 =
4 1
6 × 2 =
4 1
12 = 3

Vineet Padmanabhan Machine Learning


Binary Split: Discrete-Valued Attributes

D: a data partition
Consider attribute A with v outcomes {a1 , . . . av }
To determine the best binary split on A
What to examine?
Examine the partitions resulting from all possible subsets of
{a1 , . . . av }
Each subset SA is a binary test of attribute A of the form
A ∈ SA?
2v possible subsets. We exclude the power set and the empty set,
then we have 2v − 2 subsets.
How to examine?
For each subset, compute the weighted sum of the impurity of
each of the two resulting partitions
D1 D2
GiniA (D) = Gini(D1) + Gini(D1)
D D
The subset that gives the minimum Gini index for attribute A is
selected as its splitting subset

Vineet Padmanabhan Machine Learning


Gini(Income)

Rid Age Income student credit-rating class:buy-computer


1 youth high no fair no
2 youth high no excellent no
3 middle-aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle-aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle-aged medium no excellent yes
13 middle-aged high yes fair yes
14 senior medium no excellent no

Compute the gini index of the training set D:


9 2 5 2

Gini(D) = 1 − ( 14 ) + ( 14 )

Using attribute income: there are three values: low, medium and high
Choosing the subset {low, medium} results in two partions:
D1 (income∈ {low, medium} ): 10 tuples
D2 (income ∈ {high} ): 4 tuples

Vineet Padmanabhan Machine Learning


Gini Income

10 4
Giniincome ∈ {low, medium}(D) = 14 Gini(D1 ) + 14 Gini(D2 )

10 6 2 4 2 4
= 14 (1 − ( 10 ) − ( 10 ) )+ 14 (1 − ( 14 )2 − ( 34 )2 )

= 0.450

= Giniincome ∈ {high}(D)

The Gini Index measures of the remaining partitions are:


Gini{low,high} and {medium} ( D ) = 0 .315
Gini{medium, high} and {low} ( D ) = 0 .300
Therefore, the best binary split for attribute income is on
{medium, high} and {low}

Vineet Padmanabhan Machine Learning


Comparing attribute selection measures

The three measures, in general, return good results but


Information Gain
Biased towards multivalued attributes
Gain Ratio
Tends to prefer unbalanced splits in which one partition is much
smaller than the other
Gini Index
Biased towards multivalued attributes
Has difficulties when the number of classes is large
Tends to favor tests that result in equal-sized partitions and
purity in both partitions

Vineet Padmanabhan Machine Learning

You might also like