You are on page 1of 13

Information Gain

Which test is more informative?


Split over whether Split over whether
Balance exceeds 50K applicant is employed

Less or equal 50K Over 50K Unemployed Employed


1
Information Gain
Impurity/Entropy (informal)
– Measures the level of impurity in a group
of examples

2
Impurity

Very impure group Less impure Minimum


impurity

3
Entropy: a common way to measure impurity

• Entropy = ∑− p
i
i log 2 pi

pi is the probability of class i


Compute it as the proportion of class i in the set.

16/30 are green circles; 14/30 are pink crosses


log2(16/30) = -.9; log2(14/30) = -1.1
Entropy = -(16/30)(-.9) –(14/30)(-1.1) = .99

• Entropy comes from information theory. The


higher the entropy the more the information
content.
What does that mean for learning from examples? 4
2-Class Cases:
Minimum
• What is the entropy of a group in which impurity
all examples belong to the same
class?
– entropy = - 1 log21 = 0
not a good training set for learning

• What is the entropy of a group with Maximum


50% in either class? impurity
– entropy = -0.5 log20.5 – 0.5 log20.5 =1

good training set for learning


5
Information Gain
• We want to determine which attribute in a given
set of training feature vectors is most useful for
discriminating between the classes to be
learned.

• Information gain tells us how important a given


attribute of the feature vectors is.

• We will use it to decide the ordering of attributes


in the nodes of a decision tree.

6
Calculating Information Gain
Information Gain = entropy(parent) – [average entropy(children)]
13 1 3  4 4
child − ⋅ lo g2  −  ⋅ lo g2  = 0.7 8 7
entropy  1 7 1 7  1 7 1 7

Entire population (30 instances)


17 instances

child 1 1  1 2 1 2
entropy − 1 3⋅ lo g2 1 3 −  1 3 ⋅ lo g2 1 3 = 0.3 9 1
   

parent − 1 4 ⋅ lo g2 1 4  −  1 6 ⋅ lo g2 1 6  = 0.9 9 6


entropy  3 0 3 0  3 0 3 0 13 instances

 17   13 
(Weighted) Average Entropy of Children =  ⋅ 0 . 787  +  ⋅ 0 . 391 = 0.615
 30   30 
Information Gain= 0.996 - 0.615 = 0.38 for this split 7
Entropy-Based Automatic
Decision Tree Construction

Training Set S Node 1


x1=(f11,f12,…f1m) What feature
x2=(f21,f22, f2m) should be used?
. What values?
.
xn=(fn1,f22, f2m)

Quinlan suggested information gain in his ID3 system


and later the gain ratio, both based on entropy.

8
Using Information Gain to Construct a
Decision Tree 1
Choose the attribute A
Full Training Set S Attribute A with highest information
gain for the full training
2 v1 v2 vk set at the root of the tree.
Construct child nodes
for each value of A. Set S ′ S′={s∈S | value(A)=v1}
Each has an associated
subset of vectors in 3
which A has a particular repeat
value. recursively
till when?

9
Simple Example
Training Set: 3 features and 2 classes

X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II

How would you distinguish class I from class II?

10
X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II

Split on attribute X
If X is the best attribute,
this node would be further split.
II
X=1
II Echild1= -(1/3)log2(1/3)-(2/3)log2(2/3)
I I = .5284 + .39
II II = .9184
X=0 II
Echild2= 0
Eparent= 1
GAIN = 1 – ( 3/4)(.9184) – (1/4)(0) = .3112 11
X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II

Split on attribute Y

I I
Y=1
Echild1= 0
I I
II II
X=0 II
II Echild2= 0
Eparent= 1
GAIN = 1 –(1/2) 0 – (1/2)0 = 1; BEST ONE 12
X Y Z C
1 1 1 I
1 1 0 I
0 0 1 II
1 0 0 II

Split on attribute Z

I
Z=1
II Echild1= 1
I I
II II
Z=0 I
II Echild2= 1
Eparent= 1
GAIN = 1 – ( 1/2)(1) – (1/2)(1) = 0 ie. NO GAIN; WORST 13

You might also like