You are on page 1of 5

DECISION TREES

A decision tree represents a function that takes a vector of attribute values as inputs and returns a
‘decision’ – a single output value.

The input and output values can be discrete or continuous.

A decision tree reaches its decision by performing a sequence of tests.

Ex: Let us take a decision to go out for playing depending on weather condition. The decision tree is
given below:

There are several algorithms used by decision trees.

Which attribute should be taken as root node? This question is answered based on values of 1)
entropy and 2) gini index.

Both the entropy and gini index represent impurity level in the dataset.

ENTROPY

Entropy is the measure of randomness of data. When the data is more random, it means, there are
impurities. When entropy is less, the data is good. When entropy is less ‘information gain’ in every
split will be more.

Entropy is useful to decide which attribute is to be selected as root node in the decision tree.

The formula for calculating entropy is:


Example. Calculate Entropy. To go out for playing which of the attributes
should be selected as root node ?

Suppose, we take outlook as root node, then the following decision tree arises.

First calculate total entropy E(S) for the data set


There are 14 instances (or rows) and we have 9 Yes and 5 No.

The formula for Entropy:

E(S) = -P(Yes) log2 P(Yes) - P(No) log2 P(No)

E(S) = - (9/14) * log2 9/14 – (5/14) * log2 5/14


E(S) = 0.41 + 0.53 = 0.94

Calculate entropy for outlook


Outlook has 3 different parameters: Sunny, Overcast, Rainy.

In case of Sunny, no. of ‘Yes’ = 2, no. of ‘No’ = 3

In case of Overcast, no. of ‘Yes’ = 4, no. of ‘No’ = 0

In case of Rainy, no. of ‘Yes’ = 3, no. of ‘No’ = 2

So, Entropy (outlook=Sunny) = -2/5 log2 2/5 - 3/5 log2 3/5 = 0.971

Entropy(outlook=overcast) = -1 log2 1 – 0 log2 0 = 0

Entropy(outlook=rainy) = -3/5 log2 3/5 – 2/5 log2 2/5 = 0.971

Information from outlook:

I(outlook) = 5/14 x 0.971 + 4/14 x 0 + 5/14 x 0.971 = 0.693

Information gained from outlook:

Gain(outlook) = E(S) – I(outlook) = 0.94 – 0.693 = 0.247

This gain must be high. That node is to be selected as root node. Similarly
repeat for other nodes (like Temperature, Humidity, and Windy).

The following are the results:


Since in outlook, the information gain is very high, it is selected as root node.

GINI INDEX

Formula for gini index


Let us calculate the gini index for Outlook
Note down how many ‘Yes’ and ‘No’ are there in each class.

Sunny = 2 Yes, 3 Nos

overcast = 4 Yes, 0 No

rainy = 3 Yes, 2 No

Gini (outlook = Sunny) = 1 – (2/5)2 – (3/5)2 = 0.48

Gini (outlook = overcast) =1 – (4/4)2 – (0/4)2 = 0

Gini (outlook = rainy) = 1 – (3/5)2 – ( 2/5)2 = 0.48

Therefore, the gini index for outlook is:

5/14 x 0.48 + 4/14 x 0 + 5/14 x 0.48 = 0.3429

In the same manner, calculate gini index for other attributes also.

Since the gini index of outlook is very less, there are very less impurities. Hence
we select outlook as our root node.

You might also like