An Overview of Decision Tree Learning

Decision Tree Learning
Sheik Benazir Ahmed
Hamburg University of Technology

Information and Communication Systems
18.03.2021
Slide 2
Outline
Title Slide No.

• An Example to Begin With 3-6
• Decision Tree: Introduction 7
• Decision Tree: Picking The Splits 8-10
• Picking The Splits: Information Gain 11-12
• Picking The Splits: Chi-Square 13-14
• Picking The Splits: Gini Index 15-17
• Picking The Splits: Variance 18-19
• ID3 Algorithm 20-21
• C4.5 Algorithm 22-23
• CART Algorithm 24-26
• Comparing Attribute Selection Measures 27
• Other Attribute Selection Measures 28
• Unknown Attribute Values 29-31
• Overfitting 32
• Summary: Decision Tree Learning 33
• References 34
Author | Title
07/13/2021
Slide 3
An Example to Begin With: To Play or Not To Play
What?: Outlook Humidity Windy Play

• Is weather condition of a random day Sunny Normal True Yes
favorable to play? Sunny High True No
• The goal is placed into the last column. Sunny High False No
Sunny High False No
• Other columns’ values will be used to
Sunny Normal False Yes
reach the goal.
Overcast High True Yes
How?: Overcast High False Yes
• Construct a tree. Overcast Normal True Yes
Overcast Normal False Yes
• Columns will act as nodes.
Rainy High True No
• Terminal nodes give final decision.
Rainy Normal True No
Rainy Normal False Yes
Rainy High False Yes
Sheik Benazir Ahmed | Decision Tree Learning

07/13/2021
Slide 4
• The column with the most sensible information

should act as a node of the tree.
Outlook?
• Perform Q&A session on the node based on the
picked column.
• The answers will generate new nodes. Play????
• Continue the Q&A session for all nodes.
• Stop when goal is reached.

Humidity? Windy?

07/13/2021
Slide 5
Sunny Rainy
Outlook
Humidity Windy Play Humidity Windy Play
Sunny Rainy Normal True Yes High True No
High True No Normal True No

Overcast
High False No Normal False Yes
? Yes ?
High False No Normal False Yes
Normal False Yes High False Yes
Two ‘Yes’ and three ‘No’. Three ‘Yes’ and two ‘No’. Overcast
Humidity Windy Play
High True Yes
Only ‘Yes’. High False Yes
Normal True Yes
Normal False Yes

07/13/2021
Slide 6
Outlook
Sunny Rainy
Outlook Humidity Windy Play Overcast
Rainy Normal True ? Humidity Yes Windy
High Normal True False
No Yes No Yes

07/13/2021
Slide 7
Decision Tree: Introduction

Dataset
Attribute 1 Attribute 2 Class
• Decision trees are a type of supervised machine
A1.1 A2.1 C1
learning.
A1.2 A2.2 C2
A1.3 A2.2 C1
• They use known training data to create a process that
predicts the results of that data. Root
• That process can then be used to predict the results Node
for unknown data.
Branches Decision Nodes
• Processes data into groups based on the value of the
data and the features it is provided.
• At each decision node, the data gets split into two

separate branches.
• Decision trees can be used for regression to get a real

numeric value.
• Or they can be used for classification to split data into
different distinct categories.
Leaf Nodes
07/13/2021
Slide 8
Decision Tree: Picking The Splits

Images taken from: Machine Learning with Random Forests and Decision Trees by Scott Hartshorn
Consider the following examples of splitting red circles from orange

circles. • A line in the middle splits
most of the red circles.
Is it the best split ?
• What if we slide farther

right and pick up all red
circles on the left?
• Or move farther left and

get more orange
squares on the right?
• A vertical line in the middle splits all the

red circles from the orange squares
which is the best split for this example

07/13/2021
Slide 9
• Decision trees look through every possible

split and pick the best split between two
adjacent points in a given feature.
• But what is best ? And how do we figure out

the best split ?
• From our previous example: How is ‘Outlook’

better than ‘Windy’ or ‘Rain’ ?

07/13/2021
Slide 10
• Decision tree splits the nodes on all available variables.

• Selects the split which results in most homogeneous/pure sub-nodes.
• If the nodes are entirely pure, each node will contain only single class and hence they will be
homogeneous.
• Multiple ways (algorithms) to decide the best split:

• Information Gain/Entropy
• Chi-Square
• Gini Index
• Variance

07/13/2021
Slide 11
Picking The Splits: Information Gain/Entropy

𝐸𝑛𝑡𝑟𝑜𝑝𝑦=− 𝒑 𝟏 𝒍𝒐𝒈 𝟐 𝒑 𝟏 − 𝒑 𝟐 𝒍𝒐𝒈 𝟐 𝒑𝟐 − 𝒑𝟑 𝒍𝒐𝒈 𝟐 𝒑 𝟑 − … − 𝒑𝒏 𝒍𝒐𝒈 𝟐 𝒑 𝒏
Y Y
% of Y = 50% Y
N
Y
N % of Y = 100% Y
Y
Y
Y
% of N = 50% N N Y % of N = 0% Y Y Y
N Y Y Y
• Lower Entropy means purer node.

• Works only with categorical target variables.
Information Gain: Steps:

• Calculate the entropy of the parent node.
• = Weighted Entropy = Information after using attribute A • Then calculate the entropy of each child node.
to split dataset D • Calculate the weighted entropy of the split.
• Higher information gain leads to more homogeneity or

pure nodes.

07/13/2021
Slide 12
Picking The Splits: Information Gain/Entropy

Total Classes = 14
Percentage of Yes = 64.29%

Percentage of No = 35.71%
Outlook Humidity Windy
Sunny Rainy Normal High True False
YY YY N
Y N
N Y
Y
N
Y N
Y N
Y N N N Y YY
Y N N
Y N
Overcast
Y N Y YY Y N Y N Y Y
Y Y
Y YY
e
Node Entropy s Weighted Information Node Entropy Weighted Information Node Entropy Weighted Information
Entropy Gain Entropy Gain Entropy Gain
Sunny 0.97 Normal 0.59 True 1
Overcast 0.97 0.69 0.25 0.79 0.15 0.90 0.04

High 0.98 False 0.82
Rainy 3
Information
Gain for the split on Outlook: Information
Gain for the split on Humidity: Information
Gain for the split on Windy:
Sunny Entropy: Normal Entropy: True Entropy:
Overcast Entropy: High Entropy: False Entropy:
Rainy Entropy:
Weighted Entropy: Weighted Entropy:
Weighted Entropy:
Information Gain: Information Gain:
Information Gain:

07/13/2021
Slide 13
Picking The Splits: Chi-Square

Total Classes = 14
Class Yes = 9
• Measures the statistical significance of differences between Class No = 5
child nodes and their parent node. Humidity
Percentage of Yes = 64.29%
Percentage of No = 35.71%
Normal High
• It is measured as sum of squared standardized differences
between actual and expected frequencies of target variable for YY Y Y N
N NN
each node. YYY Y Y N
Total Classes = 7 Total Classes = 7

Expected No. of Yes = 64.29% of 7 = 4.5 Expected No. of Yes = 64.29% of 7 = 4.5
• Expected value is calculated based on the distribution of the Actual No. of Yes = 6 Actual No. of Yes = 3
Expected No. of No = 35.71% of 7 = 2.5 Expected No. of No = 35.71% of 7 = 2.5
parent node of the same class. Actual No. of No = 1 Actual No. of No = 4
• Chi-square value 0: If expected and actual values are same Steps:

(that is the distribution in parent and child node are same). No • Calculate the expected values for each class for
improvement in the purity of nodes. every child node.
• High Chi-square value: The distribution of child node is • Calculate the Chi-Square for every child node using
changing with respect to the parent node and moving in a the formula.
direction to achieve more pure nodes.
• Calculate Chi-Square for Split using the sum of Chi-
• Works only with Categorical target variable. Square of each child node of that split.
07/13/2021
Slide 14
Picking The Splits: Chi-Square

Total Classes = 14 Total Classes = 14 Total Classes = 14
Percentage of Yes = 64.29% Percentage of Yes = 64.29% Percentage of Yes = 64.29%
Percentage of No = 35.71% Percentage of No = 35.71% Percentage of No = 35.71%

YY YY N
Y N
N Y
Y
N
Y N
Y N
Y N N N Y YY
Y N N
Y N
Overcast
Y N Y YY Y N Y N Y Y
Y Y
Y YY
e
Node Actual Expect s Actual Expect Chi- Chi- Node Actual Expect Actual Expect Chi- Chi- Node Actual Expect Actual Expect Chi- Chi-
Yes ed Yes No ed No Squar Squar Yes ed Yes No ed No Squar Squar Yes ed Yes No ed No Squar Squar
e Yes e No e Yes e No e Yes e No
Sunny (5) 2 3.21 3 1.79 0.68 0.90 Normal (7) 6 4.5 1 2.5 0.71 0.95 True (6) 3 3.86 3 2.14 0.44 0.59
Overcast (4) 4 2.57 0 1.43 0.89 1.20

High (7) 3 4.5 4 2.5 0.71 0.95 False (8) 6 5.14 2 2.86 0.38 0.51
Rainy (5) 3 3.21 2 1.79 0.12 0.16
Chi-Square
for the split on Outlook: Chi-Square
for the split on Humidity: Chi-Square
for the split on Windy:

07/13/2021
Slide 15
Picking The Splits: Gini Index
• If a data set D contains training data from n classes, gini index, Gini(D) is defined as:
• If a data set D is split on A into subsets and , the gini index ; isisdefined
the relative
as:frequency of class i in D
Properties: Steps:
• Used to create a binary split of the tree. • Group attribute values into subsets (if attribute is
non-binary)
• Lower the Gini Index, higher the homogeneity of
the nodes. • Calculate the Gini index of each subset and select
the one with minimum value as candidate
• Works with categorical targets
• If we want to predict house price, sales, taxi fare or • Repeat the steps above for all the attributes and
number of bikes rented, Gini is not the right algorithm.
select candidate with lowest Gini index value.
07/13/2021
Slide 16
• Each possible splitting subsets of attribute: Outlook (Sunny, Total Classes = 14

Class Yes = 9
Overcast, Rainy) Class No = 5
• {(sunny, overcast), (sunny, rainy), (overcast, rainy), (sunny),

(overcast), (rainy)} Outlook

•
Sunny, Rainy Overcast
N
Y Y Yes
Y N
Y N Y Y
• Y Y N
N Y
Total Classes = 10
Total Classes = 4
Class Yes = 5
Class Yes = 4
Class No = 5
• Prob. Yes = 0.50
Class No = 0
Prob. Yes = 1
Prob. No = 0.50
Prob. No = 0
Weight = 10/14
Weight = 4/14

07/13/2021
Slide 17

Total Classes = 14
Class Yes = 9
• Binary Attribute: Humidity (Normal, High) Class No = 5
Humidity
•
Normal High
N
Total Classes = 7 YY Y N Y NN
Total Classes = 7
Class Yes = 6 YYY Y Y N Class Yes = 3
Class No = 1 Class No = 4
Prob. Yes = 0.86 Prob. Yes = 0.43
Prob. No = 0.14 Prob. No = 0.57
Weight = 7/14 Weight = 7/14
• Binary Attribute: Windy (True, False)

•
Total Classes = 14
Class Yes = 9
Class No = 5
Windy
True False
Total Classes = 6 Y N Y Total Classes = 8

Attribute Split Gini Index Y N Y Y N
Class Yes = 3 Y Y N Class Yes = 6
Y N Y
Outlook {sunny, rainy}, 0.3571 Class No = 3 Class No = 2
{overcast} Prob. Yes = 0.50 Prob. Yes = 0.75
Prob. No = 0.50 Prob. No = 0.25
Humidity Binary 0.3674 Weight = 6/14 Weight = 8/14
Windy Binary 0.4286

07/13/2021
Slide 18
Picking The Splits: Variance

All values are same
• , X is the class sample.
2 6 7 1 1 1
µ is the mean.
n is the number of samples.
4 7 9 1 1 1
Variance ̴ 6 Variance = 0
• Lower value of variance leads to purer Steps:

• Calculate the variance of each child
node.
• Used when target is continuous. nodes.
• So let us consider that in our example: • Calculate the variance of each split as
‘Yes’ has a numeric value of 1 and ‘No’ weighted average variance of each child
has a value of 0. node.

07/13/2021
Slide 19
Picking The Splits: Variance

11 11 0
1 0
0 1
1
0
1 0
1 0
1 0 0 0 1 11
1 0 0
1 0
Overcast
1 0 1 11 1 0 1 0 1 1
1 1
1 Y1
Node Mean
e Variance Weighted Node Mean Variance Weighted
Node Mean Variance Weighted
s Variance of Variance of Variance of
Split Split Split
Sunny 0.4 0.24 Normal 0.86 0.12 True 0.5 0.25

0.18 0.19
Overcast 1 0 0.17 False 0.75 0.24
High 0.43 0.24
Rainy 0.6 0.24
Variance
for the split on Outlook: Variance
for the split on Humidity: Variance
for the split on Windy:
Sunny: Normal: Normal:
Mean: Variance: Mean: Variance: Mean: Variance:
Overcast: High: High:
Mean: Variance: Mean: Variance: Mean: Variance:
Rainy:
Mean: Variance: Weighted Variance: Weighted Variance:
Weighted Variance:

07/13/2021
Slide 20
ID3 Algorithm
0 1
• Creates a decision tree using Information Theory (Entropy) Y Y Y Y Y Q N B
concept.
• ID3 chooses to split on an attribute that gives the highest Y Y Y Y E T A V
information gain.
Y Y Y Y C N W I
• Entropy:
• A measure of uncertainty associated with a random variable Y Y Y Y U J R Z
• Calculation: for a discrete random variable Y taking m distinct
values {y1, y2, y3, ... , ym}, entropy is calculated by: the Y Y N Y N Y
formula Y Y
• Interpretation: Y Y N Y N Y
0 0.92
Y Y N1Y
• High Entropy: Higher uncertainty
• Lower Entropy: Lower Uncertainty
N N N N
N Y N N
0.92
N Y N0 N

07/13/2021
Slide 21
Building The Decision Tree: ID3
I. Entropy/Info(D) of root node or Class Attribute =

II. Information Gain of all other attributes:
Outlook
Sunny Rainy
Overcast
? ?
Yes
Normal True Yes Humidity Windy Play High True No
I.II. Entropy/Info(D) of parent node/class of left sub-tree = High True No High True Yes Normal True No
Information Gain of all other attributes: High False No High False Yes Normal False Yes
• High False No Normal True Yes Normal False Yes
Normal False Yes Normal False Yes High False Yes

•
Outlook
I.II. Entropy/Info(D) of parent node/class of right sub-tree =

Information Gain of all other attributes:
Sunny Rainy
Overcast
•
Humidity Yes Windy
•
No Yes No Yes

07/13/2021
Slide 22
C4.5 Algorithm
• ID3 favors attributes with large number of values or outcomes. Outlook Humidity Windy
Sunny Normal True
• Biased towards multivalued attributes. Overcast High False
Rainy
• C4.5 is a successor and improved version of ID3.
• C4.5 uses gain ratio to overcome the problem of ID3.

• Normalization of Information Gain.
• increases with more attribute values and decreases with less attribute values.
• The attribute with the maximum gain ration is selected as the splitting attribute.

07/13/2021
Slide 23
Building The Decision Tree: C4.5

Outlook
I. Gain Ratio for Root selection:
• Outlook: Sunny Rainy
• ;
Overcast
? ?
Yes
•
Humidity:
• ;
Normal True Yes Humidity Windy Play High True No
High True No High True Yes Normal True No

• Windy:
High False No High False Yes Normal False Yes
• ;
High False No Normal True Yes Normal False Yes
Normal False Yes Normal False Yes High False Yes
Outlook
I. •
Gain Ratio for left sub-tree Parent Node selection:
;
Sunny Rainy
• ; Overcast
Humidity Windy
Yes
II. Gain Ratio for right sub-tree Parent Node selection :
• ;
No Yes No Yes

07/13/2021
Slide 24
CART Algorithm
• C4.5 – Gain Ratio:

• Tends to prefer unbalanced splits in which one partition is much smaller than the others.
• CART: Classification and Regression Tree

• Creates a binary tree.
• Implements Gini Index as an attribute selection measure.
• Gini index of the dataset:
• Gini index of an attribute:
• Reduction in Impurity:
• The attribute that maximizes the reduction in impurity is selected as the splitting attribute.
• Equivalently, the attribute with minimum Gini index is selected.

07/13/2021
Slide 25
Building The Decision Tree: CART
• Gini Index of the dataset: Outlook

• Each possible splitting subsets of attribute: Outlook (Sunny, Overcast, Rainy)
• {(sunny, overcast), (sunny, rainy), (overcast, rainy), (sunny), (overcast),
(rainy)}
? Yes
Outlook Humidity Windy Play
Sunny Normal True Yes

• Binary Attributes: Humidity (Normal, High) and Windy (True, False) Sunny High True No
Sunny High False No
Sunny High False No
Attribute Split Gini Index Reduction in Impurity Rainy High True No

Outlook {sunny, rainy}, {overcast} 0.3571 0.4591 – 0.3571 = 0.102 Rainy Normal False Yes
Humidity
Humidity Binary
Binary 0.3674
0.3674 0.4591
0.4591 –
– 0.3674
0.3674 =
= 0.0917
0.0917 Rainy Normal False Yes
Windy
Windy Binary
Binary 0.4286
0.4286 0.4591
0.4591 –
– 0.4286
0.4286 =
= 0.0305
0.0305 Rainy High False Yes

07/13/2021
Slide 26
Building The Decision Tree: CART

Sunny Normal True Yes

Outlook
Sunny High True No
Sunny High False No

Sunny High False No
• Binary Attribute: Humidity (Normal, High) Sunny Normal False Yes
• Rainy High True No
Humidity Yes

Normal High

• Binary Attribute: Windy (True, False)
• Windy Windy
True False True False

Normal True Yes High True No
Y, N Yes No
Normal False Yes Y, N, N High False No
Normal True No High False No
Normal False Yes High True No
Normal False Yes High False Yes
No No

07/13/2021
Slide 27
Comparing Attribute Selection Measures
• Information Gain (ID3):

• Biased towards multivalued attributes
• Gain Ration (C4.5):

• Tends to prefer unbalanced splits in which one partition is much smaller than the others.
• Gini Index (CART):

• Biased to multivalued attributes
• Has difficulty when # of classes is large
• Tends to favor tests that result in equal-sized partitions and purity in both partitions

07/13/2021
Slide 28
Other Attribute Selection Measures
• CHAID: A popular decision tree algorithm, measure based on Chi-square test for independence.
• C-SEP: Performs better than information gain and gini index in certain cases.
• G-statistic: Has a close approximation to Chi-square distribution.
• MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred):
• The best tree as the one that requires the fewest # of buts to both
• Encode the tree
• Encode the exceptions to the tree
• Multivariate splits (partition based on multiple variable combinations)
• CART
• Which attribute selection measure is the best?

• Most give good results, none is significantly superior than others.

07/13/2021
Slide 29
Unknown attribute values

(Class)
• Let us consider that the value of the attribute Outlook for the following case were Sunny Normal True Yes
unknown: Sunny High True No
• Outlook = Overcast replaced by ‘?’, Humidity = High, Windy = True Sunny High False No
• Gain for the known valued training data: Sunny High False No
Outlook # of Yes # of No Total
? High True Yes
Sunny 2 3 5 Overcast High False Yes
Overcast 3 0 3 Overcast Normal True Yes

Overcast Normal False Yes
Rain 3 2 5
Rainy High True No
Total 8 5 13 Rainy Normal True No

• Split information is calculated from
the entire dataset (an extra
(for sunny), (for overcast)
category for unknown value): (for rainy), (for ‘?’)

07/13/2021
Slide 30
• The remaining case is assigned to all blocks of the partition from above. With weights: Outlook
• Sunny Rainy
Overcast
Humidity Windy
Yes
Outlook Humidity Windy Play Weight Outlook Humidity Windy Play Weight Outlook Humidity Windy Play Weight
Sunny Normal True Yes 1 Overcast High False Yes 1 Rainy High True No 1
Sunny High True No 1 Rainy Normal True No 1
Overcast Normal True Yes 1
Sunny High False No 1 Rainy Normal False Yes 1
Overcast Normal False Yes 1
Sunny High False No 1 Rainy Normal False Yes 1
Sunny Normal False Yes 1 ? High True Yes 3/13 ≈ 0.2
Rainy High False Yes 1
? High True Yes 5/13 ≈ 0.4
? High True Yes 5/13 ≈ 0.4
Partitioning this subset further by the same Partitioning this subset further by the same
Overcast contains only the class ‘Yes’
test on humidity, the class distribution are as test on windy:
follows: class Yes 0 class No
Windy = True class Yes 2 class No
Humidity
Windy = False class Yes 0 class No
Humidity == Normal
High
2 class Yes
class Yes
0 class No
3 class No
*Unable to partition into single-class subsets*
*Unable to partition into single-class subsets*
07/13/2021
Slide 31
• Classify another case with sunny outlook, unknown humidity, and

windy false. Outlook
• If the humidity is normal, the case will be classified as Yes (Play).
• if the humidity high, the case will be classified as No (Don't Play) with Sunny Rainy
probability 3/3.4 (88%) and Yes (Play) with probability 0.4/3.4 (12%).
Overcast
Humidity Windy
HMD Total Weight Of Case No. of % Of Yes in No. of % Of No in

Case Case in HMD Yes Case No Case Yes
Normal 2 2 100% 0 0%
High 3.4 0.4 12% 3 88%
High 3.4 0.4 12% 3 88% 0.4 Yes, 3 No Yes 0.4 Yes, 2 No Yes
• Final class distribution for the case: Yes / No? Yes / No?

07/13/2021
Slide 32
Overfitting
• Drawing too fine of conclusions from the dataset that we have. Pruning:
• Purpose is to predict classes of new/unseen information. • Two ways to produce simpler trees.
• Example:
• Differentiating between different fruits based on color, weight, shape & • Prepruning: Halt tree construction early- stop
size etc. splitting if the splitting assessment falls below a
• Tree can start to memorize instead of learning. threshold.
• Fruit with width of 2.87 inches is an apple. • Difficult to choose an appropriate
• Fruit with width of 2.86 inches or 2.88 inches is an orange. threshold.
• Assuming that there is more precision in the data than we have. • Too high a threshold can terminate
division too early.
• Too low value results in little simplification.
• Couple of ways to control overfitting: • Postpruning: Remove branches from a fully
• Limit the number of splits. grown tree
• Direct the tree to make no more than 7 number of splits. • Use a set of data different from the
• Split a branch based on the number of data sets in it. training data to decide which is the best
• If we don’t have at least 6 data points, then we won’t split. pruned tree.
• Split a branch based on the number of data points in the child nodes. • Growing and pruning trees is slower but
• Split only if all the child nodes have at least 3 data points in them. more reliable.
• Split a branch if the tree has not reached a certain depth, maybe 5.

07/13/2021
Slide 33
Summary: Decision Tree Learning
• Inducing decision trees is one of the most widely used learning methods in practice
• Can out-perform human experts in many problems
• Strengths include
• Fast
• Simple to implement
• Can convert result to a set of easily interpretable rules
• Empirically valid in many commercial products
• Handles noisy data
• Weaknesses include:
• Univariate splits/partitioning using only one attribute at a time so limits types of possible trees
• Large decision trees may be hard to understand
Author | Title
07/13/2021
Slide 34
References
• Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and
regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books &
Software
• Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers,

1993.
• Scott Hartshorn, Machine Learning with Random Forests and Decision Trees.
• J.R. Quinlan, Induction of Decision Trees. Machine Learning 1: 81-106. Kluwer

Academic Publishers, Boston, 1986.
Author | Title
07/13/2021

An Overview of Decision Tree Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Overview of Decision Tree Learning

Uploaded by

Copyright:

Available Formats

Decision Tree Learning

Sheik Benazir Ahmed

Hamburg University of Technology

Title Slide No.

An Example to Begin With: To Play or Not To Play

What?: Outlook Humidity Windy Play

Sheik Benazir Ahmed | Decision Tree Learning

An Example to Begin With: To Play or Not To Play

• The column with the most sensible information

• The answers will generate new nodes. Play????

• Continue the Q&A session for all nodes.

• Stop when goal is reached.

Sheik Benazir Ahmed | Decision Tree Learning

An Example to Begin With: To Play or Not To Play

Sunny Rainy Normal True Yes High True No

High True No Normal True No

Normal False Yes High False Yes

High True Yes

Only ‘Yes’. High False Yes

Normal True Yes

Normal False Yes

Sheik Benazir Ahmed | Decision Tree Learning

An Example to Begin With: To Play or Not To Play

Outlook Humidity Windy Play Overcast

Rainy Normal True ? Humidity Yes Windy

High Normal True False

Sheik Benazir Ahmed | Decision Tree Learning

Decision Tree: Introduction

• At each decision node, the data gets split into two

• Decision trees can be used for regression to get a real

Decision Tree: Picking The Splits

Consider the following examples of splitting red circles from orange

• What if we slide farther

• Or move farther left and

• A vertical line in the middle splits all the

Sheik Benazir Ahmed | Decision Tree Learning

Decision Tree: Picking The Splits

• Decision trees look through every possible

• But what is best ? And how do we figure out

• From our previous example: How is ‘Outlook’

Sheik Benazir Ahmed | Decision Tree Learning

Decision Tree: Picking The Splits

• Decision tree splits the nodes on all available variables.

• Multiple ways (algorithms) to decide the best split:

Sheik Benazir Ahmed | Decision Tree Learning

Picking The Splits: Information Gain/Entropy

• Lower Entropy means purer node.

Information Gain: Steps:

• Higher information gain leads to more homogeneity or

Sheik Benazir Ahmed | Decision Tree Learning

Picking The Splits: Information Gain/Entropy

Sunny 0.97 Normal 0.59 True 1

Overcast 0.97 0.69 0.25 0.79 0.15 0.90 0.04

Sheik Benazir Ahmed | Decision Tree Learning

Picking The Splits: Chi-Square

Total Classes = 7 Total Classes = 7

• Chi-square value 0: If expected and actual values are same Steps:

Picking The Splits: Chi-Square

Outlook Humidity Windy

Overcast (4) 4 2.57 0 1.43 0.89 1.20

Sheik Benazir Ahmed | Decision Tree Learning

Picking The Splits: Gini Index

Picking The Splits: Gini Index