Decision Tree

Decision Tree
U.A. NULI
2
Introduction:
Decision tree is a supervised learning algorithm for classification that represents

learned knowledge in the tree data structure.
Tree based learning algorithms are considered to be one of the best and mostly used
supervised learning methods.
Tree based methods empower predictive models with high accuracy, stability and ease
of interpretation. Unlike linear models, they map non-linear relationships quite well.
They are adaptable at solving any kind of problem at hand
U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

3
Decision Trees in the Real World
Because decision trees are so easy to interpret, they are among the most widely used
data-mining methods in business analysis, medical decision-making, and policymaking.
Often, a decision tree is created automatically, and an expert uses it to understand the
key factors and then refines it to better match her beliefs.
This process allows machines to assist experts and to clearly show the reasoning process
so that individuals can judge the quality of the prediction.
Decision trees have been used in this manner for such wide-ranging applications as
customer profiling, financial risk analysis, assisted diagnosis, and traffic prediction.
4
What is decision tree?
➢ Decision tree is a type of supervised learning algorithm mostly used for classification
problem.
➢ Provide Rules for classifying data using attributes.
➢ The tree consists of decision nodes and leaf nodes.
➢ A decision node has two or more branches, each representing values for the
attribute tested.
➢ A leaf node attribute produces a homogeneous result (all in one class), which does
not require additional classification testing.

5
Decision Tree Characteristics:
► Every non leaf node ( decision node) represents an attribute in dataset
► Every Branch represents possible values of an attribute
► Every leaf (or terminal) node represents the value of target attribute
► Starting node is called as root node.
► To make a decision, the flow starts at root node, navigates through the
arc/edges until it reaches a leaf node, and then makes decision
Based on leaf node value.
6
Decision Tree Classification task
Tid Attrib1 Attrib2 Attrib3 Class

Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No

Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn

8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes

Model
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?

Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
7
Decision Tree Examples:
Day Outlook Temp. Humidit Wind Play Tennis

D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
8
Outlook
Sunny Overcast Rain
Humidity Yes Wind
High Normal Strong Weak

No Yes No Yes

9

10

11

12
how to build decision tree?
1. ID3 (Iterative Dichotomiser 3) → uses Entropy function and Information

gain as metrics.
2. CART (Classification and Regression Trees) → uses Gini

Index(Classification) as metric.

13
ID3 (Iterative Dichotomiser 3)
Algorithm
 A mathematical algorithm for building the decision tree.

 Invented by J. Ross Quinlan in 1979.
 Uses Information Theory invented by Shannon in 1948.
 Builds the tree from the top down, with no backtracking.
 Information Gain is used to select the most useful
attribute for classification.

14
ID3 (Iterative Dichotomiser 3)
Algorithm
Day Outlook Temp. Humidit Wind Play Tennis

D1 Sunny Hot High Weak No Attributes – outlook, temp,
D2 Sunny Hot High Strong No Humidity, Wind
D4 Rain Mild High Weak Yes Which attribute will be
D5 Rain Cool Normal Weak Yes used at root node
D7 Overcast Cool Normal Weak Yes
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
15
Which attribute to select first?
We have four X values (outlook, temp, humidity and wind) being

categorical and one y value (play Y or N) also being categorical.
so we need to learn the mapping (what machine learning always does)

between X and y.
This is a binary classification problem.
We Will build the tree using the ID3 algorithm.

16
To create a tree, we need to have a root node first and we know that nodes are
features/attributes(outlook, temp, humidity and wind),
so which one do we need to pick first?
Answer: determine the attribute that best classifies the training data; use this attribute at
the root of the tree. Repeat this process at for each branch.

17
so how do we choose the best
attribute?
Answer: use the attribute with the highest information gain in ID3
In order to define information gain precisely, we begin by defining a

measure commonly used in information theory, called entropy that
characterizes the (im)purity of an arbitrary collection of examples.”

18
Entropy
Entropy is the measures of impurity, disorder or uncertainty in a bunch of examples.

19
16/30 are green circles; 14/30 are pink crosses

log2(16/30) = -.9; log2(14/30) = -1.1
Entropy = -(16/30)(-.9) –(14/30)(-1 .1 ) = .99
The higher the entropy the more the information content

20

21

22

23
Outlook Play Tennis

Sunny No
Sunny No
Overcast Yes P(Play Tennis = Yes) = 9/14
Rain Yes
Rain Yes P(Play Tennis = No) = 5/14
Rain No
Overcast Yes
Sunny No Entropy(Play Tennis) = - (9/14)log2(9/14) – (5/14)log2(5/14) = .940
Sunny Yes
Rain Yes
Sunny Yes
Overcast Yes
Overcast Yes
Rain No

24
Information Gain
We want to determine which attribute in a given set of attributes in dataset that is most
useful for discriminating between the classes to be learned.
Information gain tells us how important a given attribute is from all other attributes
We will use it to decide the ordering of attributes in the nodes of a decision tree.
the information gain (or entropy reduction) is the reduction in ‘uncertainty’ when
choosing an attribute

25
Information Gain
◼ The information gain is based on the decrease in entropy after a

dataset is split on an attribute.
How to decide the most appropriate attribute for decision node?
◼ First the entropy of the total dataset is calculated.
◼ The dataset is then split on the different attributes.
◼ The entropy for each branch is calculated. Then it is added
proportionally, to get total entropy for the split.
◼ The resulting entropy is subtracted from the entropy before the
split.
◼ The result is the Information Gain, or decrease in entropy.
◼ The attribute that yields the largest IG is chosen for the decision
node.
26
Information Gain
◼ A branch set with entropy of 0 is a leaf node.

◼ Otherwise, the branch needs further splitting to classify
its dataset.
◼ The ID3 algorithm is run recursively on the non-leaf
branches, until all data is classified.

27
Information Gain(IG)
Information Gain(IG) = entropy(parent) – [average entropy(children)]
IG(S,A): expected reduction in entropy due to splitting S on attribute A
IG(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv)

28
Calculation of information gain
All logs
are with
respect
to base 2

29
Calculating Entropy and information gain
◼ Most programming languages and calculators do not

have a log2 function.
◼ Use a conversion factor
◼ Take log function of 2, and divide by it.
◼ Example: log10(2) = .301
◼ Then divide to get log2(n):
◼ log2(3/5) = log10(3/5) / .301

30
Finding most important attribute for
play tennis dataset
day outlook temp humidity wind play
D2 Sunny Hot High Strong No Total element = 14
D3 Overcast Hot High Weak Yes No class = 5
D4 Rain Mild High Weak Yes Yes class = 9
D6 Rain Cool Normal Strong No Entropy of entire dataset
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No E(S) = - (9/14) log2(9/14) –
D9 Sunny Cool Normal Weak Yes (5/14) log2(5/14)
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes = 0.940
D14 Rain Mild High Strong No U.A. Nuli, Textile and Engineering Institute, Ichalkaranji
31
Now we have to split the database on outlook, humidity, wind and

temperature and have to find information gain in each case.

32
Splitting dataset on outlook

Total element in outlook sunny
=5
D2 Sunny Hot High Strong No Yes class = 2
D8 Sunny Mild High Weak No No class = 3
D9 Sunny Cool Normal Weak Yes
D11 Sunny Mild Normal Strong Yes Entropy of outlook sunny
E(Sun) = - (2/5) log2(2/5) –

(3/5) log2(3/5)
= 0.971

33
Total element in outlook overcast

day outlook temp humidity wind play =4
D3 Overcast Hot High Weak Yes Yes class = 4
No class = 0
Entropy of outlook overcast
E(over) = - (4/4) log2(4/4) –
(0/4) log2(0/4)
=0

34
day outlook temp humidity wind play Total element in outlook rain
=5
D5 Rain Cool Normal Weak Yes No class = 2
D10 Rain Mild Normal Weak Yes Entropy of outlook overcast
E(rain) = - (3/5) log2(3/5) –
(2/5) log2(2/5)
= 0.971

35
Information gain on splitting dataset
on outlook
Sunny IG(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv)
E(Sun) =
0.971
IG(S,outlook) = E(S) – ( (5/14)*E(Sun) + (4/14)* E(Over)
+ (5/14)*E(Rain))
= 0.940 – ((5/14)*0.971 + (4/14)*0
Ovecast + (5/14)* .971)
Outlook
E(over) = 0 = 0.247
E(S) = 0.940
Rain
E(Rain) =
0.971

36
Similarly we can calculate information gain by

splitting dataset on Humidity, Temperature, and
wind

S=[9+,5-] S=[9+,5-] 37
E=0.940 E=0.940
Humidity Wind
High Normal Weak Strong
[3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-]

E=0.985 E=0.592 E=0.811 E=1.0
IG(S,Humidity) IG(S,Wind)
=0.940-(7/14)*0.985 =0.940-(8/14)*0.811
– (7/14)*0.592 – (6/14)*1.0
=0.151 =0.048
Humidity provides greater info. gain than Wind, w.r.t target classification.

38
The information gain values for the 4 attributes are:

• IG(S,Outlook) =0.247
• IG(S,Humidity) =0.151
• IG(S,Wind) =0.048
• IG(S,Temperature) =0.029
where S denotes the collection of training examples

Here highest gain is available by splitting dataset on outlook
Hence Outlook becomes the root node of tree and its three
branches are Sunny, Overcast and Rain
39
Decision Tree – root node
Outlook
Sunny Overcast Rain
Yes
Sunny and Rain branches need further splitting because their entropy is greater than 0
Overcast will not be splitted because its entropy is 0
Tree grows in this way till it reaches leaf node having entropy 0U.A. Nuli, Textile and Engineering Institute, Ichalkaranji
40

Total element in outlook sunny
=5
D2 Sunny Hot High Strong No Yes class = 2
D8 Sunny Mild High Weak No No class = 3
D11 Sunny Mild Normal Strong Yes Entropy of outlook sunny
E(Sun) = - (2/5) log2(2/5) –

(3/5) log2(3/5)
= 0.971

41
How to split sunny Branch?
Which Attribute to use for splitting sunny branch?

- temperature, humidity, or wind
Take sunny node and parent node and dataset corresponding to sunny
Value as parent data set.
Calculate Information Gain of Temperature, Humidity and Wind
Select attribute with highest gain as attribute for child node of outlook
On sunny branch.
42
Attribute Outlook – value Sunny – Total Element 5 Entropy E(Sunny) = 0.971

Attribute Temperature:
Value Hot, Total Elements =2 , Play yes = 0 No =2

E(hot) = - (0/2) log2(0/2) – (2/2)log2(2/2) = 0
Value mild, Total Elements =2 , Play yes = 1 No =1

E(mild) = - (1/2) log2(1/2) – (1/2)log2(1/2) = 1
Value cool, Total Elements =1 , Play yes = 1 No =0

E(cool) = - (1/1) log2(1/1) – (0/1)log2(0/1) = 0
IG(sunny, Temp ) = E(sunny) – ((2/5)*E(hot) + (2/5)*E(mild) + (1/5)*E(cool))

= 0.971 – ((2/5)*0 + (2/5)*1 + (1/5)*0) = 0.571 U.A. Nuli, Textile and Engineering Institute, Ichalkaranji
43

Attribute Humidity:
Value High, Total Elements =3 , Play yes = 0 No =3

E(high) = - (0/3) log2(0/3) – (3/3)log2(3/3) = 0
Value Normal, Total Elements =2 , Play yes = 2 No =0

E(Normal) = - (2/2) log2(2/2) – (0/2)log2(0/2) = 0
IG(sunny, Humidity ) = E(sunny) – ((3/5)*E(High) + (2/5)*E(Normal))

= 0.971 – ((3/5)*0 + (2/5)*0) = 0.971

44

Attribute Wind:
Value weak, Total Elements =3 , Play yes = 1 No =2

E(weak) = - (1/3) log2(1/3) – (2/3)log2(2/3) = 0.931
Value strong, Total Elements =2 , Play yes = 1 No =1

E(Strong) = - (1/2) log2(1/2) – (1/2)log2(1/2) = 1
IG(sunny, Wind ) = E(sunny) – ((3/5)*E(weak) + (2/5)*E(strong))

= 0.971 – ((3/5)*0.931 + (2/5)*1) = 0.9586

45
IG(sunny,Temperature) = 0.571
IG(sunny,Humidity) = 0.971
IG(sunny,Wind) = 0.9586
The highest gain is achieved through Humidity attribute hence sunny node
Should split on the basis of humidity.

46
Outlook
sunny
Rain
Overcast
Humidity Yes
High Normal
No Yes

47
How to split Rain branch?
Which Attribute to use for splitting Rain node?

- temperature, humidity, or wind
Take Rain node as parent node and dataset corresponding to Rain

Value as parent data set.
Calculate Information Gain of Temperature, Humidity and Wind
Select attribute with highest gain as child node of Outlook on

Rain Branch of tree.
48
day outlook temp humidity wind play Total element in outlook rain
=5
D5 Rain Cool Normal Weak Yes No class = 2
D10 Rain Mild Normal Weak Yes Entropy of outlook overcast
E(rain) = - (3/5) log2(3/5) –
(2/5) log2(2/5)
= 0.971

49
Attribute Outlook – value Rain – Total Element 5 Entropy E(Rain) = 0.971

Attribute Temperature:
Value mild, Total Elements =3 , Play yes = 2 No =1

E(mild) = - (2/3) log2(2/3) – (1/3)log2(1/3) = 0.931
Value cool, Total Elements =2 , Play yes = 1 No =1

E(cool) = - (1/2) log2(1/2) – (1/2)log2(1/2) = 1
IG(Rain, Temp ) = E(Rain) – ((3/5)*E(mild) + (2/5)*E(cool))

= 0.971 – ((3/5)*0.931 + (2/5)*1) = 0.9586

50
Attribute Outlook – value Rain – Total Element 5 Entropy E(rain) = 0.971

Attribute Humidity:
Value High, Total Elements =2 , Play yes = 1 No =1

E(high) = - (1/2) log2(1/2) – (1/2)log2(1/2) = 1
Value Normal, Total Elements =3 , Play yes = 2 No =1

E(Normal) = - (2/3) log2(2/3) – (1/3)log2(1/3) = 0.931
IG(rain, Humidity ) = E(rain) – ((2/5)*E(High) + (3/5)*E(Normal))

= 0.971 – ((3/5)*1 + (2/5)*0.931) = -0.0014

51
Attribute Outlook – value Rain – Total Element 5 Entropy E(Rain) = 0.971

Attribute Wind:
Value weak, Total Elements =3 , Play yes = 3 No =0

E(weak) = - (3/3) log2(3/3) – (0/3)log2(0/3) = 0
Value strong, Total Elements =2 , Play yes = 0 No =2

E(Strong) = - (0/2) log2(0/2) – (2/2)log2(2/2) = 0
IG(rain, Wind ) = E(rain) – ((3/5)*E(weak) + (2/5)*E(strong))

= 0.971 – ((3/5)*0 + (2/5)*0) = 0.971

52
IG(rain,Temperature) = 0.9586
IG(rain,Humidity) = -0.0014
IG(rain,Wind) = 0.971
The highest gain is achieved through Wind attribute hence Rain branch
Should split on the basis of Wind Attribute.

53
Complete Decision Tree
Outlook
sunny
Rain
Overcast
Humidity Yes
Wind
High Normal Strong
weak
No Yes Yes No

ID3 Algorithm: 54
ID3 (Examples, Target_Attribute, Attributes)
Create a root node for the tree
If all examples are positive, Return the single-node tree Root, with label = +.
If all examples are negative, Return the single-node tree Root, with label = -.
If number of predicting attributes is empty, then Return the single node tree Root,
with label = most common value of the target attribute in the examples.
Otherwise Begin
A ← The Attribute that best classifies examples.
Decision Tree attribute for Root = A.
For each possible value, vi, of A,
Add a new tree branch below Root, corresponding to the test A = vi.
Let Examples(vi) be the subset of examples that have the value vi for A
If Examples(vi) is empty
Then below this new branch add a leaf node with label = most common target value in the
examples
Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})
End
Return Root
55
Advantages of ID3
 Understandable prediction rules are created from the training data.

 Builds the fastest tree.
 Builds a short tree.
 Only need to test enough attributes until all data is classified.
 Finding leaf nodes enables test data to be pruned, reducing number of tests.
 Whole dataset is searched to create tree.

56
Disadvantages of ID3
 Data may be over-fitted or over-classified, if a small sample is tested.

 Only one attribute at a time is tested for making a decision.
 Classifying continuous data may be computationally expensive, as
many trees must be generated to see where to break the continuum.

57
Classification And Regression
Tree(CART) Algorithm:
CART algorithm makes use of Gini index to select appropriate attribute for a node.
Here the Gini Index (used in CART) measures the impurity of a data partition

58
Gini index
Pk - Probability of class K

59
Example:
D = { Y,Y,Y,Y,N,N,N} D = { Y,Y,Y,Y,Y,Y,Y}
Total elements = 7 Y =4, N =3 Total elements = 7 Y =7, N =0
Gini(D) = 1 – (4/7)2-(3/7)2 Gini(D) = 1 – (7/7)2-(0/7)2

Gini(D) = 1 – 0.3265 – 0.1837 Gini(D) = 1 – 1 – 0
Gini(D) = 0.4898 Gini(D) = 0
Gini index of pure/homogeneous data is 0 where as impure data is greater than 0

60
Gini index of dataset with multiple
attribute
Drivng Risk Table

Age Car Type Risk Total Element =6 Risk (high) =4,
Risk(low)=2
23 Family High
17 Sports High Gini(DrvTable) = 1- (4/6)2 – (2/6)2
43 Sports High Gini(DrvTable) = 1-0.4444 – 0.1111
68 Family low
32 Truck Low Gini(DrvTable) = 0.4445
20 Family high

61
How to find Gini index of Cartype
Attribute?
CartType has multiple values Family, Sports, and Truck.
Its Gini index can be calculated as multiway split or binary split.

62
Gini index of CarType using multiway split
Car Type Gini(sport) = 1- (2/2)2-(0/2)2

Risk Sports Family Truck
Gini(sport) = 1- 1-0 = 0
High 2 2 0
Gini(Family) = 1- (2/3)2-(1/3)2
Low 0 1 1
Gini(Family) = 1- 0.4444 – 0.1111 = 0.4445
Gini(Truck) = 1- (0/1)2-(1/1)2 = 1-0-1 =0
How to combine three gini index of three categories to cartype ?

63
Combine gini index of attribute having
multiple values:
If an attribute D having total elements Dn has 3 values D1,D2, D3
D1 has D1n elements and gini index of Gini(D1)
And Dn = D1n+D2n+D3n Then
Gini(D) = (D1n/Dn)*Gini(D1) + (D2n/Dn)*Gini(D2) + (D3n/Dn)*Gini(D3)

64
Gini index of CarType Attribute
Gini (CarType) = (2/6)*Gini(Sport) + (3/6) Gini(Family) + (1/6)*Gini(Truck)
Gini (CarType) = (2/6)*0+(3/6)*.4445+(1/6)*0
Gini (CarType) = 0.2222

65
Can we split CarType for binary split
We have three values Sports, Family and Truck for Cartype that splits in three branches
We can split into two branches by grouping certain values
For example:
{Sports,Family} and {Truck}

{Sports,Truck} and {Family}
{Truck,Family} and {Sport}
For each of this we have to calculate gini value and pickup the lowest gini group

Car Type1 Gini(Sp,Fa) = 1 – (4/5)2-(1/5)2 66
Risk Sports,Family Truck Gini(Sp,Fa) = 1 – 0.64 - 0.04 = 0.32
High 4 0
Gini(Truck) = 1- (0/1)2-(1/1)2 = 0
Low 1 1
Gini(CarType1) = (5/6)*0.2975 + (1/6)*0 = 0.2667
Gini(Fa,Tr) = 1 – (2/4)2-(2/4)2
Car Type2
Risk Sports Family,Truck Gini(Fa,Tr) = 1 – 0.25 - 0.25 = 0.5
High 2 2 Gini(sport) = 1- (2/2)2-(0/2)2 = 0
Low 0 2
Gini(CarType2) = (5/6)*0.5 + (1/6)*0 = 0.4167

67
Car Type3
Risk Sports,Truck Family Gini(Sp,Tr) = 1 – (2/3)2-(1/3)2
High 2 2 Gini(Sp,Tr) = 1 – 0.4444 - 0.1111 = 0.4445
Low 1 1 Gini(Family) = 1- (2/3)2-(1/3)2 = 0.4445
Gini(CarType3) = (5/6)*0.4445 + (1/6)*0.4445 = 0.4445

68
Which Split is better?
Split Gini
CarType1 0.2667 Hence, Best Split is CarType1 because it has lowest
Gini Index
CarType2 0.4167
CarType3 0.4445

69
D2 Sunny Hot High Strong No
D4 Rain Mild High Weak Yes
D10 Rain Mild Normal Weak Yes

70
Gini Index for outlook
Outlook Yes No Number of instances

Sunny 2 3 5
Overcast 4 0 4
Rain 3 2 5
Gini(Outlook=Sunny) = 1 – (2/5)2 – (3/5)2 = 1 – 0.16 – 0.36 = 0.48

Gini(Outlook=Overcast) = 1 – (4/4)2 – (0/4)2 = 0
Gini(Outlook=Rain) = 1 – (3/5)2 – (2/5)2 = 1 – 0.36 – 0.16 = 0.48
Then, we will calculate weighted sum of gini indexes for outlook feature.
Gini(Outlook) = (5/14) x 0.48 + (4/14) x 0 + (5/14) x 0.48 = 0.171 + 0 + 0.171 = 0.342
71
Gini Index for Temperature
Temperature
Similarly, temperature is a nominal feature and it could have 3 different values: Cool, Hot and Mild.
Let’s summarize decisions for temperature feature.
Temperature Yes No Number of instances
Hot 2 2 4
Cool 3 1 4
Mild 4 2 6
Gini(Temp=Hot) = 1 – (2/4)2 – (2/4)2 = 0.5

Gini(Temp=Cool) = 1 – (3/4)2 – (1/4)2 = 1 – 0.5625 – 0.0625 = 0.375
Gini(Temp=Mild) = 1 – (4/6)2 – (2/6)2 = 1 – 0.444 – 0.111 = 0.445
We’ll calculate weighted sum of gini index for temperature feature
Gini(Temp) = (4/14) x 0.5 + (4/14) x 0.375 + (6/14) x 0.445 = 0.142 + 0.107 + 0.190 = 0.439
72
Gini Index for Humidity
Humidity Yes No Number of instances
High 3 4 7
Normal 6 1 7
Gini(Humidity=High) = 1 – (3/7)2 – (4/7)2 = 1 – 0.183 – 0.326 = 0.489

Gini(Humidity=Normal) = 1 – (6/7)2 – (1/7)2 = 1 – 0.734 – 0.02 = 0.244
Weighted sum for humidity feature will be calculated next

Gini(Humidity) = (7/14) x 0.489 + (7/14) x 0.244 = 0.367

73
Gini Index for Wind
Wind Yes No Number of instances
Weak 6 2 8
Strong 3 3 6
Gini(Wind=Weak) = 1 – (6/8)2 – (2/8)2 = 1 – 0.5625 – 0.062 = 0.375

Gini(Wind=Strong) = 1 – (3/6)2 – (3/6)2 = 1 – 0.25 – 0.25 = 0.5
Gini(Wind) = (8/14) x 0.375 + (6/14) x 0.5 = 0.428

74
Which attribute is best?
We’ve calculated gini index values for each feature. The winner will be outlook
feature because its cost is the lowest.
Feature Gini index
Outlook 0.342
Outlook Attribute will be at the root of the tree
Temperature 0.439
Humidity 0.367
Wind 0.428

75
Root node of the tree

76
Attribute for Sunny branch
Day Outlook Temp. Humidity Wind Decision
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
11 Sunny Mild Normal Strong Yes

77
Gini of temperature for sunny outlook
Hot 0 2 2
Cool 1 0 1
Mild 1 1 2
Gini(Outlook=Sunny and Temp.=Hot) = 1 – (0/2)2 – (2/2)2 = 0

Gini(Outlook=Sunny and Temp.=Cool) = 1 – (1/1)2 – (0/1)2 = 0
Gini(Outlook=Sunny and Temp.=Mild) = 1 – (1/2)2 – (1/2)2 = 1 – 0.25 – 0.25 = 0.5
Gini(Outlook=Sunny and Temp.) = (2/5)x0 + (1/5)x0 + (2/5)x0.5 = 0.2

78
Gini of humidity for sunny outlook
High 0 3 3
Normal 2 0 2
Gini(Outlook=Sunny and Humidity=High) = 1 – (0/3)2 – (3/3)2 = 0

Gini(Outlook=Sunny and Humidity=Normal) = 1 – (2/2)2 – (0/2)2 = 0
Gini(Outlook=Sunny and Humidity) = (3/5)x0 + (2/5)x0 = 0

79
Gini of wind for sunny outlook
Weak 1 2 3
Strong 1 1 2
Gini(Outlook=Sunny and Wind=Weak) = 1 – (1/3)2 – (2/3)2 = 0.266

Gini(Outlook=Sunny and Wind=Strong) = 1- (1/2)2 – (1/2)2 = 0.2
Gini(Outlook=Sunny and Wind) = (3/5)x0.266 + (2/5)x0.2 = 0.466

80
Decision for sunny outlook
We’ve calculated gini index scores for feature when outlook is sunny. The winner is
humidity because it has the lowest value.
Feature Gini index
Temperature 0.2
Humidity 0
Wind 0.466

81

82
Rain outlook
Day Outlook Temp. Humidity Wind Decision

4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
10 Rain Mild Normal Weak Yes
14 Rain Mild High Strong No

83
Gini of temprature for rain outlook
Cool 1 1 2
Mild 2 1 3
Gini(Outlook=Rain and Temp.=Cool) = 1 – (1/2)2 – (1/2)2 = 0.5

Gini(Outlook=Rain and Temp.=Mild) = 1 – (2/3)2 – (1/3)2 = 0.444
Gini(Outlook=Rain and Temp.) = (2/5)x0.5 + (3/5)x0.444 = 0.466

84
Gini of humidity for rain outlook

High 1 1 2
Normal 2 1 3
Gini(Outlook=Rain and Humidity=High) = 1 – (1/2)2 – (1/2)2 = 0.5

Gini(Outlook=Rain and Humidity=Normal) = 1 – (2/3)2 – (1/3)2 = 0.444
Gini(Outlook=Rain and Humidity) = (2/5)x0.5 + (3/5)x0.444 = 0.466

85
Gini of wind for rain outlook

Weak 3 0 3
Strong 0 2 2
Gini(Outlook=Rain and Wind=Weak) = 1 – (3/3)2 – (0/3)2 = 0

Gini(Outlook=Rain and Wind=Strong) = 1 – (0/2)2 – (2/2)2 = 0
Gini(Outlook=Rain and Wind) = (3/5)x0 + (2/5)x0 = 0

86
Decision for rain outlook
The winner is wind feature for rain outlook because it has the minimum gini index score
in features.
Feature Gini index

Temperature 0.466
Humidity 0.466
Wind 0

87

88
Procedure used in CART and ID3 algorithm is same the only difference is the metric
used to
Calculate impurity at every stage.
Incase of ID3 entropy is used and in CART gini index is used.
Although it is possible to create a multiway tree using CART, it is most preferred for
Binary Tree
Tree created using ID3 and CART may not be same.

89
Advantages of Decision Tree
1 . Easy to Understand: Decision tree output is very easy to understand even for people from non-
analytical background. It does not require any statistical knowledge to read and interpret them.
Its graphical representation is very intuitive and users can easily relate their hypothesis.
2. Useful in Data exploration: Decision tree is one of the fastest way to identify most significant
variables and relation between two or more variables. With the help of decision trees, we can
create new variables / features that has better power to predict target variable. You can refer
article (Trick to enhance power of regression model) for one such trick. It can also be used in
data exploration stage. For example, we are
working on a problem where we have information available in hundreds of variables, there
decision tree will help to identify most significant variable.
3. Less data cleaning required: It requires less data cleaning compared to some other modeling
techniques. It is not influenced by outliers and missing values to a fair degree.
4. Data type is not a constraint: It can handle both numerical and categorical variables.
5. Non Parametric Method: Decision tree is considered to be a non-parametric method. This
means that decision trees have no assumptions about the space distribution and the classifier
structure. U.A. Nuli, Textile and Engineering Institute, Ichalkaranji
90
Disadvantages of Decision Tree
1 . Over fitting: Over fitting is one of the most practical difficulty for
decision tree models. This problem gets solved by setting constraints on
model parameters and pruning (discussed in detailed below).
2. Not fit for continuous variables: While working with continuous
numerical variables, decision tree looses information when it categorizes
variables in different categories.

Decision Tree

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decision Tree

Uploaded by

Copyright:

Available Formats

Decision Tree

Decision tree is a supervised learning algorithm for classification that represents

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

► Every non leaf node ( decision node) represents an attribute in dataset

► Every Branch represents possible values of an attribute

► Starting node is called as root node.

Tid Attrib1 Attrib2 Attrib3 Class

4 Yes Medium 120K No

7 Yes Large 220K No Learn

10 No Small 90K Yes

12 Yes Medium 80K ?

13 Yes Large 110K ?

Day Outlook Temp. Humidit Wind Play Tennis

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

1. ID3 (Iterative Dichotomiser 3) → uses Entropy function and Information

2. CART (Classification and Regression Trees) → uses Gini

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

 A mathematical algorithm for building the decision tree.

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

Day Outlook Temp. Humidit Wind Play Tennis

We have four X values (outlook, temp, humidity and wind) being

so we need to learn the mapping (what machine learning always does)

This is a binary classification problem.

We Will build the tree using the ID3 algorithm.

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

so which one do we need to pick first?

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

In order to define information gain precisely, we begin by defining a

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

Entropy is the measures of impurity, disorder or uncertainty in a bunch of examples.

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

16/30 are green circles; 14/30 are pink crosses

Entropy = -(16/30)(-.9) –(14/30)(-1 .1 ) = .99

The higher the entropy the more the information content

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

Outlook Play Tennis

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

◼ The information gain is based on the decrease in entropy after a

◼ A branch set with entropy of 0 is a leaf node.

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

Information Gain(IG) = entropy(parent) – [average entropy(children)]

IG(S,A): expected reduction in entropy due to splitting S on attribute A

IG(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv)

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

◼ Most programming languages and calculators do not

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

Now we have to split the database on outlook, humidity, wind and

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

day outlook temp humidity wind play

E(Sun) = - (2/5) log2(2/5) –

U.A. Nuli, Textile and Engineering Institute, Ichalkaranji

Total element in outlook overcast

IG(sunny, Temp ) = E(sunny) – ((2/5)E(hot) + (2/5)E(mild) + (1/5)*E(cool))

IG(sunny, Humidity ) = E(sunny) – ((3/5)E(High) + (2/5)E(Normal))

IG(sunny, Wind ) = E(sunny) – ((3/5)E(weak) + (2/5)E(strong))

IG(Rain, Temp ) = E(Rain) – ((3/5)E(mild) + (2/5)E(cool))

IG(rain, Humidity ) = E(rain) – ((2/5)E(High) + (3/5)E(Normal))

IG(rain, Wind ) = E(rain) – ((3/5)E(weak) + (2/5)E(strong))