You are on page 1of 34

Decision Tree Learning

Sheik Benazir Ahmed

Hamburg University of Technology


Information and Communication Systems

18.03.2021
Slide 2

Outline

Title Slide No.


• An Example to Begin With 3-6
• Decision Tree: Introduction 7
• Decision Tree: Picking The Splits 8-10
• Picking The Splits: Information Gain 11-12
• Picking The Splits: Chi-Square 13-14
• Picking The Splits: Gini Index 15-17
• Picking The Splits: Variance 18-19
• ID3 Algorithm 20-21
• C4.5 Algorithm 22-23
• CART Algorithm 24-26
• Comparing Attribute Selection Measures 27
• Other Attribute Selection Measures 28
• Unknown Attribute Values 29-31
• Overfitting 32
• Summary: Decision Tree Learning 33
• References 34

Author | Title
07/13/2021
Slide 3

An Example to Begin With: To Play or Not To Play

What?: Outlook Humidity Windy Play


• Is weather condition of a random day Sunny Normal True Yes
favorable to play? Sunny High True No
• The goal is placed into the last column. Sunny High False No
Sunny High False No
• Other columns’ values will be used to
Sunny Normal False Yes
reach the goal.
Overcast High True Yes
How?: Overcast High False Yes
• Construct a tree. Overcast Normal True Yes
Overcast Normal False Yes
• Columns will act as nodes.
Rainy High True No
• Terminal nodes give final decision.
Rainy Normal True No
Rainy Normal False Yes
Rainy Normal False Yes
Rainy High False Yes

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 4

An Example to Begin With: To Play or Not To Play

• The column with the most sensible information


should act as a node of the tree.
Outlook?
• Perform Q&A session on the node based on the

picked column.

• The answers will generate new nodes. Play????

• Continue the Q&A session for all nodes.

• Stop when goal is reached.


Humidity? Windy?

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 5

An Example to Begin With: To Play or Not To Play

Sunny Rainy
Outlook
Humidity Windy Play Humidity Windy Play

Sunny Rainy Normal True Yes High True No

High True No Normal True No


Overcast
High False No Normal False Yes
? Yes ?
High False No Normal False Yes

Normal False Yes High False Yes

Two ‘Yes’ and three ‘No’. Three ‘Yes’ and two ‘No’. Overcast
Humidity Windy Play

High True Yes

Only ‘Yes’. High False Yes

Normal True Yes

Normal False Yes

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 6

An Example to Begin With: To Play or Not To Play

Outlook

Sunny Rainy

Outlook Humidity Windy Play Overcast

Rainy Normal True ? Humidity Yes Windy

High Normal True False

No Yes No Yes

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 7

Decision Tree: Introduction


Dataset
Attribute 1 Attribute 2 Class
• Decision trees are a type of supervised machine
A1.1 A2.1 C1
learning.
A1.2 A2.2 C2
A1.3 A2.2 C1
• They use known training data to create a process that
predicts the results of that data. Root
• That process can then be used to predict the results Node
for unknown data.
Branches Decision Nodes
• Processes data into groups based on the value of the
data and the features it is provided.

• At each decision node, the data gets split into two


separate branches.

• Decision trees can be used for regression to get a real


numeric value.
• Or they can be used for classification to split data into
different distinct categories.
Leaf Nodes
Sheik Benazir Ahmed | Decision Tree Learning
07/13/2021
Slide 8

Decision Tree: Picking The Splits


Images taken from: Machine Learning with Random Forests and Decision Trees by Scott Hartshorn

Consider the following examples of splitting red circles from orange


circles. • A line in the middle splits
most of the red circles.
Is it the best split ?

• What if we slide farther


right and pick up all red
circles on the left?

• Or move farther left and


get more orange
squares on the right?

• A vertical line in the middle splits all the


red circles from the orange squares
which is the best split for this example

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 9

Decision Tree: Picking The Splits

• Decision trees look through every possible


split and pick the best split between two
adjacent points in a given feature.

• But what is best ? And how do we figure out


the best split ?

• From our previous example: How is ‘Outlook’


better than ‘Windy’ or ‘Rain’ ?

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 10

Decision Tree: Picking The Splits

• Decision tree splits the nodes on all available variables.


• Selects the split which results in most homogeneous/pure sub-nodes.
• If the nodes are entirely pure, each node will contain only single class and hence they will be
homogeneous.

• Multiple ways (algorithms) to decide the best split:


• Information Gain/Entropy
• Chi-Square
• Gini Index
• Variance

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 11

Picking The Splits: Information Gain/Entropy


 𝐸𝑛𝑡𝑟𝑜𝑝𝑦=− 𝒑 𝟏 𝒍𝒐𝒈 𝟐 𝒑 𝟏 − 𝒑 𝟐 𝒍𝒐𝒈 𝟐 𝒑𝟐 − 𝒑𝟑 𝒍𝒐𝒈 𝟐 𝒑 𝟑 − … − 𝒑𝒏 𝒍𝒐𝒈 𝟐 𝒑 𝒏
Y Y
 % of Y = 50% Y
N
Y
N  % of Y = 100% Y
Y
Y
Y
% of N = 50% N N Y % of N = 0% Y Y Y
N Y Y Y

 • Lower Entropy means purer node.


• Works only with categorical target variables.

Information Gain: Steps:


• Calculate the entropy of the parent node.
• = Weighted Entropy = Information after using attribute A • Then calculate the entropy of each child node.
to split dataset D • Calculate the weighted entropy of the split.

• Higher information gain leads to more homogeneity or


pure nodes.

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 12

Picking The Splits: Information Gain/Entropy


Total Classes = 14
Percentage of Yes = 64.29%
 
Percentage of No = 35.71%
Outlook Humidity Windy
Sunny Rainy Normal High True False

YY YY N
Y N
N Y
Y
N
Y N
Y N
Y N N N Y YY
Y N N
Y N
Overcast
Y N Y YY Y N Y N Y Y
Y Y
Y YY
e
Node Entropy s Weighted Information Node Entropy Weighted Information Node Entropy Weighted Information
Entropy Gain Entropy Gain Entropy Gain

Sunny 0.97 Normal 0.59 True 1

Overcast 0.97 0.69 0.25 0.79 0.15 0.90 0.04


High 0.98 False 0.82
Rainy 3

Information
  Gain for the split on Outlook: Information
  Gain for the split on Humidity: Information
  Gain for the split on Windy:
Sunny Entropy: Normal Entropy: True Entropy:
Overcast Entropy: High Entropy: False Entropy:
Rainy Entropy:
Weighted Entropy: Weighted Entropy:
Weighted Entropy:
Information Gain: Information Gain:
Information Gain:

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 13

Picking The Splits: Chi-Square


Total Classes = 14
Class Yes = 9
•  Measures the statistical significance of differences between Class No = 5
child nodes and their parent node. Humidity
Percentage of Yes = 64.29%
Percentage of No = 35.71%
Normal High
• It is measured as sum of squared standardized differences
between actual and expected frequencies of target variable for YY Y Y N
N NN
each node. YYY Y Y N

Total Classes = 7 Total Classes = 7


Expected No. of Yes = 64.29% of 7 = 4.5 Expected No. of Yes = 64.29% of 7 = 4.5
• Expected value is calculated based on the distribution of the Actual No. of Yes = 6 Actual No. of Yes = 3
Expected No. of No = 35.71% of 7 = 2.5 Expected No. of No = 35.71% of 7 = 2.5
parent node of the same class. Actual No. of No = 1 Actual No. of No = 4

• Chi-square value 0: If expected and actual values are same Steps:


(that is the distribution in parent and child node are same). No • Calculate the expected values for each class for
improvement in the purity of nodes. every child node.

• High Chi-square value: The distribution of child node is • Calculate the Chi-Square for every child node using
changing with respect to the parent node and moving in a the formula.
direction to achieve more pure nodes.
• Calculate Chi-Square for Split using the sum of Chi-
• Works only with Categorical target variable. Square of each child node of that split.
Sheik Benazir Ahmed | Decision Tree Learning
07/13/2021
Slide 14

Picking The Splits: Chi-Square


Total Classes = 14 Total Classes = 14 Total Classes = 14
Percentage of Yes = 64.29% Percentage of Yes = 64.29% Percentage of Yes = 64.29%
Percentage of No = 35.71% Percentage of No = 35.71% Percentage of No = 35.71%

Outlook Humidity Windy


Sunny Rainy Normal High True False

YY YY N
Y N
N Y
Y
N
Y N
Y N
Y N N N Y YY
Y N N
Y N
Overcast
Y N Y YY Y N Y N Y Y
Y Y
Y YY
e
Node Actual Expect s Actual Expect Chi- Chi- Node Actual Expect Actual Expect Chi- Chi- Node Actual Expect Actual Expect Chi- Chi-
Yes ed Yes No ed No Squar Squar Yes ed Yes No ed No Squar Squar Yes ed Yes No ed No Squar Squar
e Yes e No e Yes e No e Yes e No

Sunny (5) 2 3.21 3 1.79 0.68 0.90 Normal (7) 6 4.5 1 2.5 0.71 0.95 True (6) 3 3.86 3 2.14 0.44 0.59

Overcast (4) 4 2.57 0 1.43 0.89 1.20


High (7) 3 4.5 4 2.5 0.71 0.95 False (8) 6 5.14 2 2.86 0.38 0.51
Rainy (5) 3 3.21 2 1.79 0.12 0.16

Chi-Square
  for the split on Outlook: Chi-Square
  for the split on Humidity: Chi-Square
  for the split on Windy:

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 15

Picking The Splits: Gini Index

 • If a data set D contains training data from n classes, gini index, Gini(D) is defined as:

• If a data set D is split on A into subsets and , the gini index ; isisdefined
the relative
as:frequency of class i in D

Properties: Steps:
• Used to create a binary split of the tree. • Group attribute values into subsets (if attribute is
non-binary)
• Lower the Gini Index, higher the homogeneity of
the nodes. • Calculate the Gini index of each subset and select
the one with minimum value as candidate
• Works with categorical targets
• If we want to predict house price, sales, taxi fare or • Repeat the steps above for all the attributes and
number of bikes rented, Gini is not the right algorithm.
select candidate with lowest Gini index value.
Sheik Benazir Ahmed | Decision Tree Learning
07/13/2021
Slide 16

Picking The Splits: Gini Index

• Each possible splitting subsets of attribute: Outlook (Sunny, Total Classes = 14


Class Yes = 9
Overcast, Rainy) Class No = 5

• {(sunny, overcast), (sunny, rainy), (overcast, rainy), (sunny),


(overcast), (rainy)} Outlook

 

Sunny, Rainy Overcast

N
Y Y Yes
Y N
Y N Y Y
 • Y Y N
N Y

Total Classes = 10
Total Classes = 4
Class Yes = 5
Class Yes = 4
Class No = 5
• Prob. Yes = 0.50
Class No = 0
Prob. Yes = 1
Prob. No = 0.50
Prob. No = 0
Weight = 10/14
Weight = 4/14

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 17

Picking The Splits: Gini Index


Total Classes = 14
Class Yes = 9
•  Binary Attribute: Humidity (Normal, High) Class No = 5
Humidity

Normal High

N
Total Classes = 7 YY Y N Y NN
Total Classes = 7
Class Yes = 6 YYY Y Y N Class Yes = 3
Class No = 1 Class No = 4
Prob. Yes = 0.86 Prob. Yes = 0.43
Prob. No = 0.14 Prob. No = 0.57
Weight = 7/14 Weight = 7/14

 • Binary Attribute: Windy (True, False)



Total Classes = 14
Class Yes = 9
Class No = 5
Windy
True False

Total Classes = 6 Y N Y Total Classes = 8


Attribute Split Gini Index Y N Y Y N
Class Yes = 3 Y Y N Class Yes = 6
Y N Y
Outlook {sunny, rainy}, 0.3571 Class No = 3 Class No = 2
{overcast} Prob. Yes = 0.50 Prob. Yes = 0.75
Prob. No = 0.50 Prob. No = 0.25
Humidity Binary 0.3674 Weight = 6/14 Weight = 8/14
Windy Binary 0.4286

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 18

Picking The Splits: Variance


All values are same
•  , X is the class sample.
2 6 7 1 1 1
µ is the mean.
n is the number of samples.
4 7 9 1 1 1

Variance ̴ 6 Variance = 0

• Lower value of variance leads to purer Steps:


• Calculate the variance of each child
node.
• Used when target is continuous. nodes.
• So let us consider that in our example: • Calculate the variance of each split as
‘Yes’ has a numeric value of 1 and ‘No’ weighted average variance of each child
has a value of 0. node.

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 19

Picking The Splits: Variance

Outlook Humidity Windy


Sunny Rainy Normal High True False

11 11 0
1 0
0 1
1
0
1 0
1 0
1 0 0 0 1 11
1 0 0
1 0
Overcast
1 0 1 11 1 0 1 0 1 1
1 1
1 Y1

Node Mean
e Variance Weighted Node Mean Variance Weighted
Node Mean Variance Weighted
s Variance of Variance of Variance of
Split Split Split

Sunny 0.4 0.24 Normal 0.86 0.12 True 0.5 0.25


0.18 0.19
Overcast 1 0 0.17 False 0.75 0.24
High 0.43 0.24
Rainy 0.6 0.24

Variance
  for the split on Outlook: Variance
  for the split on Humidity: Variance
  for the split on Windy:
Sunny: Normal: Normal:
Mean: Variance: Mean: Variance: Mean: Variance:
Overcast: High: High:
Mean: Variance: Mean: Variance: Mean: Variance:
Rainy:
Mean: Variance: Weighted Variance: Weighted Variance:
Weighted Variance:

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 20

ID3 Algorithm
0 1
• Creates a decision tree using Information Theory (Entropy) Y Y Y Y Y Q N B
concept.
• ID3 chooses to split on an attribute that gives the highest Y Y Y Y E T A V
information gain.
Y Y Y Y C N W I
• Entropy:
• A measure of uncertainty associated with a random variable Y Y Y Y U J R Z
• Calculation: for a discrete random variable Y taking m distinct
values {y1, y2, y3, ... , ym}, entropy is calculated by: the Y Y N Y N Y
formula Y Y
• Interpretation: Y Y N Y N Y
0 0.92
Y Y N1Y
• High Entropy: Higher uncertainty
• Lower Entropy: Lower Uncertainty
N N N N

N Y N N
0.92
N Y N0 N

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 21

Building The Decision Tree: ID3

I. Entropy/Info(D) of root node or Class Attribute =


 
II. Information Gain of all other attributes:
  Outlook

Sunny Rainy

Overcast
? ?

Yes
Humidity Windy Play Humidity Windy Play

Normal True Yes Humidity Windy Play High True No

 I.II. Entropy/Info(D) of parent node/class of left sub-tree = High True No High True Yes Normal True No

Information Gain of all other attributes: High False No High False Yes Normal False Yes
• High False No Normal True Yes Normal False Yes

Normal False Yes Normal False Yes High False Yes


Outlook

 I.II. Entropy/Info(D) of parent node/class of right sub-tree =


Information Gain of all other attributes:
Sunny Rainy

Overcast

Humidity Yes Windy

High Normal True False

No Yes No Yes

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 22

C4.5 Algorithm

•  ID3 favors attributes with large number of values or outcomes. Outlook Humidity Windy
Sunny Normal True
• Biased towards multivalued attributes. Overcast High False
Rainy

• C4.5 is a successor and improved version of ID3.

• C4.5 uses gain ratio to overcome the problem of ID3.


• Normalization of Information Gain.
• increases with more attribute values and decreases with less attribute values.
• The attribute with the maximum gain ration is selected as the splitting attribute.

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 23

Building The Decision Tree: C4.5


Outlook
I. Gain Ratio for Root selection:
  • Outlook: Sunny Rainy
• ;
Overcast
? ?

Yes

  Humidity:
• ;
Humidity Windy Play Humidity Windy Play

Normal True Yes Humidity Windy Play High True No

High True No High True Yes Normal True No


• Windy:
High False No High False Yes Normal False Yes
• ;
High False No Normal True Yes Normal False Yes

Normal False Yes Normal False Yes High False Yes

Outlook
 I. •
Gain Ratio for left sub-tree Parent Node selection:
;
Sunny Rainy

• ; Overcast

Humidity Windy
Yes
II. Gain Ratio for right sub-tree Parent Node selection :
High Normal True False

• ;
No Yes No Yes

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 24

CART Algorithm

•  C4.5 – Gain Ratio:


• Tends to prefer unbalanced splits in which one partition is much smaller than the others.

• CART: Classification and Regression Tree


• Creates a binary tree.
• Implements Gini Index as an attribute selection measure.
• Gini index of the dataset:
• Gini index of an attribute:
• Reduction in Impurity:

• The attribute that maximizes the reduction in impurity is selected as the splitting attribute.
• Equivalently, the attribute with minimum Gini index is selected.

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 25

Building The Decision Tree: CART

•  Gini Index of the dataset: Outlook


• Each possible splitting subsets of attribute: Outlook (Sunny, Overcast, Rainy)
• {(sunny, overcast), (sunny, rainy), (overcast, rainy), (sunny), (overcast),
Sunny, Rainy Overcast
(rainy)}

? Yes

Outlook Humidity Windy Play

Sunny Normal True Yes


• Binary Attributes: Humidity (Normal, High) and Windy (True, False) Sunny High True No

Sunny High False No

Sunny High False No

Sunny Normal False Yes

Attribute Split Gini Index Reduction in Impurity Rainy High True No

Rainy Normal True No


Outlook {sunny, rainy}, {overcast} 0.3571 0.4591 – 0.3571 = 0.102 Rainy Normal False Yes
Humidity
Humidity Binary
Binary 0.3674
0.3674 0.4591
0.4591 –
– 0.3674
0.3674 =
= 0.0917
0.0917 Rainy Normal False Yes

Windy
Windy Binary
Binary 0.4286
0.4286 0.4591
0.4591 –
– 0.4286
0.4286 =
= 0.0305
0.0305 Rainy High False Yes

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 26

Building The Decision Tree: CART


Outlook Humidity Windy Play

Sunny Normal True Yes


Outlook
Sunny High True No

Sunny High False No


Sunny, Rainy Overcast
Sunny High False No
•  Binary Attribute: Humidity (Normal, High) Sunny Normal False Yes
• Rainy High True No
Humidity Yes
Rainy Normal True No

Rainy Normal False Yes


Normal High
Rainy Normal False Yes

Rainy High False Yes


• Binary Attribute: Windy (True, False)
• Windy Windy

True False True False


Humidity Windy Play Humidity Windy Play

Normal True Yes High True No

Y, N Yes No
Normal False Yes Y, N, N High False No

Normal True No High False No

Normal False Yes High True No

Normal False Yes High False Yes

No No

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 27

Comparing Attribute Selection Measures

• Information Gain (ID3):


• Biased towards multivalued attributes

• Gain Ration (C4.5):


• Tends to prefer unbalanced splits in which one partition is much smaller than the others.

• Gini Index (CART):


• Biased to multivalued attributes
• Has difficulty when # of classes is large
• Tends to favor tests that result in equal-sized partitions and purity in both partitions

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 28

Other Attribute Selection Measures

• CHAID: A popular decision tree algorithm, measure based on Chi-square test for independence.
• C-SEP: Performs better than information gain and gini index in certain cases.
• G-statistic: Has a close approximation to Chi-square distribution.
• MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred):
• The best tree as the one that requires the fewest # of buts to both
• Encode the tree
• Encode the exceptions to the tree
• Multivariate splits (partition based on multiple variable combinations)
• CART

• Which attribute selection measure is the best?


• Most give good results, none is significantly superior than others.

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 29

Unknown attribute values


Outlook Humidity Windy Play
(Class)
• Let us consider that the value of the attribute Outlook for the following case were Sunny Normal True Yes
unknown: Sunny High True No
• Outlook = Overcast replaced by ‘?’, Humidity = High, Windy = True Sunny High False No
• Gain for the known valued training data: Sunny High False No
Sunny Normal False Yes
Outlook # of Yes # of No Total
? High True Yes
  Sunny 2 3 5 Overcast High False Yes

Overcast 3 0 3 Overcast Normal True Yes


Overcast Normal False Yes
Rain 3 2 5
Rainy High True No
Total 8 5 13 Rainy Normal True No
Rainy Normal False Yes
Rainy Normal False Yes
Rainy High False Yes

 
• Split information is calculated from
the entire dataset (an extra
 (for sunny), (for overcast)
category for unknown value): (for rainy), (for ‘?’)

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 30

Unknown attribute values

•  The remaining case is assigned to all blocks of the partition from above. With weights: Outlook

• Sunny Rainy

Overcast
Humidity Windy

Yes

Outlook Humidity Windy Play Weight Outlook Humidity Windy Play Weight Outlook Humidity Windy Play Weight
Sunny Normal True Yes 1 Overcast High False Yes 1 Rainy High True No 1
Sunny High True No 1 Rainy Normal True No 1
Overcast Normal True Yes 1
Sunny High False No 1 Rainy Normal False Yes 1
Overcast Normal False Yes 1
Sunny High False No 1 Rainy Normal False Yes 1
Sunny Normal False Yes 1 ? High True Yes 3/13 ≈ 0.2
Rainy High False Yes 1
? High True Yes 5/13 ≈ 0.4
? High True Yes 5/13 ≈ 0.4

Partitioning this subset further by the same Partitioning this subset further by the same
Overcast contains only the class ‘Yes’
test on humidity, the class distribution are as test on windy:
follows:  class Yes 0 class No
Windy = True class Yes 2 class No
Humidity
 Windy = False class Yes 0 class No
 Humidity == Normal
High
2 class Yes
class Yes
0 class No
3 class No
*Unable to partition into single-class subsets*
*Unable to partition into single-class subsets*
Sheik Benazir Ahmed | Decision Tree Learning
07/13/2021
Slide 31

Unknown attribute values

• Classify another case with sunny outlook, unknown humidity, and


windy false. Outlook
• If the humidity is normal, the case will be classified as Yes (Play).
• if the humidity high, the case will be classified as No (Don't Play) with Sunny Rainy

probability 3/3.4 (88%) and Yes (Play) with probability 0.4/3.4 (12%).
Overcast
Humidity Windy

HMD Total Weight Of Case No. of % Of Yes in No. of % Of No in


Case Case in HMD Yes Case No Case Yes

High Normal True False

Normal 2 2 100% 0 0%
High 3.4 0.4 12% 3 88%
High 3.4 0.4 12% 3 88% 0.4 Yes, 3 No Yes 0.4 Yes, 2 No Yes

 • Final class distribution for the case: Yes / No? Yes / No?

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 32

Overfitting

• Drawing too fine of conclusions from the dataset that we have. Pruning:
• Purpose is to predict classes of new/unseen information. • Two ways to produce simpler trees.
• Example:
• Differentiating between different fruits based on color, weight, shape & • Prepruning: Halt tree construction early- stop
size etc. splitting if the splitting assessment falls below a
• Tree can start to memorize instead of learning. threshold.
• Fruit with width of 2.87 inches is an apple. • Difficult to choose an appropriate
• Fruit with width of 2.86 inches or 2.88 inches is an orange. threshold.
• Assuming that there is more precision in the data than we have. • Too high a threshold can terminate
division too early.
• Too low value results in little simplification.
• Couple of ways to control overfitting: • Postpruning: Remove branches from a fully
• Limit the number of splits. grown tree
• Direct the tree to make no more than 7 number of splits. • Use a set of data different from the
• Split a branch based on the number of data sets in it. training data to decide which is the best
• If we don’t have at least 6 data points, then we won’t split. pruned tree.
• Split a branch based on the number of data points in the child nodes. • Growing and pruning trees is slower but
• Split only if all the child nodes have at least 3 data points in them. more reliable.
• Split a branch if the tree has not reached a certain depth, maybe 5.

Sheik Benazir Ahmed | Decision Tree Learning


07/13/2021
Slide 33

Summary: Decision Tree Learning

• Inducing decision trees is one of the most widely used learning methods in practice
• Can out-perform human experts in many problems

• Strengths include
• Fast
• Simple to implement
• Can convert result to a set of easily interpretable rules
• Empirically valid in many commercial products
• Handles noisy data

• Weaknesses include:
• Univariate splits/partitioning using only one attribute at a time so limits types of possible trees
• Large decision trees may be hard to understand

Author | Title
07/13/2021
Slide 34

References

• Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and
regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books &
Software

• Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers,


1993.

• Scott Hartshorn, Machine Learning with Random Forests and Decision Trees.

• J.R. Quinlan, Induction of Decision Trees. Machine Learning 1: 81-106. Kluwer


Academic Publishers, Boston, 1986.

Author | Title
07/13/2021

You might also like