You are on page 1of 3

Decision Tree:

Can work on both classification and regression problem.

Supervised Learning algorithm

CART 4.5 -> Information Gain

CART 5.0 -> Gini Impurity.

Information Gain:

Entropy: E =

E = - p(Yes)log(p(Yes)) – p(No) log (p(No)) => log = log of base 2


Information Gain = Entropy (total dataset) – (weighted average)*Entropy(individual features/independent variable)

Gini Impurity: Gini=


s

E(Outlook=Sunny) = - 2/5* log (2/5) - 3/5*log(3/5) = 0.971

E(Outlook=Overcast) = -4/4*log(4/4) – 0 log (0) =0

E(Outlook=Rainy) = -3/5*log(3/5) – 2/5*log(2/5) = 0.971

Information Gain (Outlook) = 5/14* E(Outlook=Sunny) + 4/14 * E(Outlook=Overcast) + 5/14* E(Outlook=Rainy)

=5/14*0.971 + 4/14*0 + 5/14*0.971=0.693

Information Gain if we do split on Outlook column: = E(S) - Information Gain (Outlook) = 0.94 – 0.693 = 0.247

Similarly we need to calculate Information Gain if we do split on temp, humidity and windy.

Whichever has the highest information gain we will split on that column. We will take the decision
based on lowest Entropy in this case Outlook == Overcast
Entropy means impurity.

For continuous variable we will create buckets. E.g. Age column

Small Age: <12

Teenage: 13- 19

Jr Adult: 20-30

Middle Aged:30-60:

Sr. Citizen :60-80

Super Sr. Citizen >80

Gini=1 – summation of p2=1-(50/50)^2 – (0/50)^2 – (0/50)^2 =1 – 1=0

Overfitting: Training accuracy = high but test accuracy=low

 Very complex model and unable to generalize

Underfitting: both train and test accuracy is low

 It is a very simple model.

You might also like