Professional Documents
Culture Documents
18.03.2021
Slide 2
Outline
Author | Title
07/13/2021
Slide 3
picked column.
Sunny Rainy
Outlook
Humidity Windy Play Humidity Windy Play
Two ‘Yes’ and three ‘No’. Three ‘Yes’ and two ‘No’. Overcast
Humidity Windy Play
Outlook
Sunny Rainy
No Yes No Yes
YY YY N
Y N
N Y
Y
N
Y N
Y N
Y N N N Y YY
Y N N
Y N
Overcast
Y N Y YY Y N Y N Y Y
Y Y
Y YY
e
Node Entropy s Weighted Information Node Entropy Weighted Information Node Entropy Weighted Information
Entropy Gain Entropy Gain Entropy Gain
Information
Gain for the split on Outlook: Information
Gain for the split on Humidity: Information
Gain for the split on Windy:
Sunny Entropy: Normal Entropy: True Entropy:
Overcast Entropy: High Entropy: False Entropy:
Rainy Entropy:
Weighted Entropy: Weighted Entropy:
Weighted Entropy:
Information Gain: Information Gain:
Information Gain:
• High Chi-square value: The distribution of child node is • Calculate the Chi-Square for every child node using
changing with respect to the parent node and moving in a the formula.
direction to achieve more pure nodes.
• Calculate Chi-Square for Split using the sum of Chi-
• Works only with Categorical target variable. Square of each child node of that split.
Sheik Benazir Ahmed | Decision Tree Learning
07/13/2021
Slide 14
YY YY N
Y N
N Y
Y
N
Y N
Y N
Y N N N Y YY
Y N N
Y N
Overcast
Y N Y YY Y N Y N Y Y
Y Y
Y YY
e
Node Actual Expect s Actual Expect Chi- Chi- Node Actual Expect Actual Expect Chi- Chi- Node Actual Expect Actual Expect Chi- Chi-
Yes ed Yes No ed No Squar Squar Yes ed Yes No ed No Squar Squar Yes ed Yes No ed No Squar Squar
e Yes e No e Yes e No e Yes e No
Sunny (5) 2 3.21 3 1.79 0.68 0.90 Normal (7) 6 4.5 1 2.5 0.71 0.95 True (6) 3 3.86 3 2.14 0.44 0.59
Chi-Square
for the split on Outlook: Chi-Square
for the split on Humidity: Chi-Square
for the split on Windy:
• If a data set D contains training data from n classes, gini index, Gini(D) is defined as:
• If a data set D is split on A into subsets and , the gini index ; isisdefined
the relative
as:frequency of class i in D
Properties: Steps:
• Used to create a binary split of the tree. • Group attribute values into subsets (if attribute is
non-binary)
• Lower the Gini Index, higher the homogeneity of
the nodes. • Calculate the Gini index of each subset and select
the one with minimum value as candidate
• Works with categorical targets
• If we want to predict house price, sales, taxi fare or • Repeat the steps above for all the attributes and
number of bikes rented, Gini is not the right algorithm.
select candidate with lowest Gini index value.
Sheik Benazir Ahmed | Decision Tree Learning
07/13/2021
Slide 16
•
Sunny, Rainy Overcast
N
Y Y Yes
Y N
Y N Y Y
• Y Y N
N Y
Total Classes = 10
Total Classes = 4
Class Yes = 5
Class Yes = 4
Class No = 5
• Prob. Yes = 0.50
Class No = 0
Prob. Yes = 1
Prob. No = 0.50
Prob. No = 0
Weight = 10/14
Weight = 4/14
N
Total Classes = 7 YY Y N Y NN
Total Classes = 7
Class Yes = 6 YYY Y Y N Class Yes = 3
Class No = 1 Class No = 4
Prob. Yes = 0.86 Prob. Yes = 0.43
Prob. No = 0.14 Prob. No = 0.57
Weight = 7/14 Weight = 7/14
Variance ̴ 6 Variance = 0
11 11 0
1 0
0 1
1
0
1 0
1 0
1 0 0 0 1 11
1 0 0
1 0
Overcast
1 0 1 11 1 0 1 0 1 1
1 1
1 Y1
Node Mean
e Variance Weighted Node Mean Variance Weighted
Node Mean Variance Weighted
s Variance of Variance of Variance of
Split Split Split
Variance
for the split on Outlook: Variance
for the split on Humidity: Variance
for the split on Windy:
Sunny: Normal: Normal:
Mean: Variance: Mean: Variance: Mean: Variance:
Overcast: High: High:
Mean: Variance: Mean: Variance: Mean: Variance:
Rainy:
Mean: Variance: Weighted Variance: Weighted Variance:
Weighted Variance:
ID3 Algorithm
0 1
• Creates a decision tree using Information Theory (Entropy) Y Y Y Y Y Q N B
concept.
• ID3 chooses to split on an attribute that gives the highest Y Y Y Y E T A V
information gain.
Y Y Y Y C N W I
• Entropy:
• A measure of uncertainty associated with a random variable Y Y Y Y U J R Z
• Calculation: for a discrete random variable Y taking m distinct
values {y1, y2, y3, ... , ym}, entropy is calculated by: the Y Y N Y N Y
formula Y Y
• Interpretation: Y Y N Y N Y
0 0.92
Y Y N1Y
• High Entropy: Higher uncertainty
• Lower Entropy: Lower Uncertainty
N N N N
N Y N N
0.92
N Y N0 N
Sunny Rainy
Overcast
? ?
Yes
Humidity Windy Play Humidity Windy Play
I.II. Entropy/Info(D) of parent node/class of left sub-tree = High True No High True Yes Normal True No
Information Gain of all other attributes: High False No High False Yes Normal False Yes
• High False No Normal True Yes Normal False Yes
Outlook
Overcast
•
Humidity Yes Windy
•
High Normal True False
No Yes No Yes
C4.5 Algorithm
• ID3 favors attributes with large number of values or outcomes. Outlook Humidity Windy
Sunny Normal True
• Biased towards multivalued attributes. Overcast High False
Rainy
Yes
•
Humidity:
• ;
Humidity Windy Play Humidity Windy Play
Outlook
I. •
Gain Ratio for left sub-tree Parent Node selection:
;
Sunny Rainy
• ; Overcast
Humidity Windy
Yes
II. Gain Ratio for right sub-tree Parent Node selection :
High Normal True False
• ;
No Yes No Yes
CART Algorithm
• The attribute that maximizes the reduction in impurity is selected as the splitting attribute.
• Equivalently, the attribute with minimum Gini index is selected.
? Yes
Windy
Windy Binary
Binary 0.4286
0.4286 0.4591
0.4591 –
– 0.4286
0.4286 =
= 0.0305
0.0305 Rainy High False Yes
Y, N Yes No
Normal False Yes Y, N, N High False No
No No
• CHAID: A popular decision tree algorithm, measure based on Chi-square test for independence.
• C-SEP: Performs better than information gain and gini index in certain cases.
• G-statistic: Has a close approximation to Chi-square distribution.
• MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred):
• The best tree as the one that requires the fewest # of buts to both
• Encode the tree
• Encode the exceptions to the tree
• Multivariate splits (partition based on multiple variable combinations)
• CART
• Split information is calculated from
the entire dataset (an extra
(for sunny), (for overcast)
category for unknown value): (for rainy), (for ‘?’)
• The remaining case is assigned to all blocks of the partition from above. With weights: Outlook
• Sunny Rainy
Overcast
Humidity Windy
Yes
Outlook Humidity Windy Play Weight Outlook Humidity Windy Play Weight Outlook Humidity Windy Play Weight
Sunny Normal True Yes 1 Overcast High False Yes 1 Rainy High True No 1
Sunny High True No 1 Rainy Normal True No 1
Overcast Normal True Yes 1
Sunny High False No 1 Rainy Normal False Yes 1
Overcast Normal False Yes 1
Sunny High False No 1 Rainy Normal False Yes 1
Sunny Normal False Yes 1 ? High True Yes 3/13 ≈ 0.2
Rainy High False Yes 1
? High True Yes 5/13 ≈ 0.4
? High True Yes 5/13 ≈ 0.4
Partitioning this subset further by the same Partitioning this subset further by the same
Overcast contains only the class ‘Yes’
test on humidity, the class distribution are as test on windy:
follows: class Yes 0 class No
Windy = True class Yes 2 class No
Humidity
Windy = False class Yes 0 class No
Humidity == Normal
High
2 class Yes
class Yes
0 class No
3 class No
*Unable to partition into single-class subsets*
*Unable to partition into single-class subsets*
Sheik Benazir Ahmed | Decision Tree Learning
07/13/2021
Slide 31
probability 3/3.4 (88%) and Yes (Play) with probability 0.4/3.4 (12%).
Overcast
Humidity Windy
Normal 2 2 100% 0 0%
High 3.4 0.4 12% 3 88%
High 3.4 0.4 12% 3 88% 0.4 Yes, 3 No Yes 0.4 Yes, 2 No Yes
• Final class distribution for the case: Yes / No? Yes / No?
Overfitting
• Drawing too fine of conclusions from the dataset that we have. Pruning:
• Purpose is to predict classes of new/unseen information. • Two ways to produce simpler trees.
• Example:
• Differentiating between different fruits based on color, weight, shape & • Prepruning: Halt tree construction early- stop
size etc. splitting if the splitting assessment falls below a
• Tree can start to memorize instead of learning. threshold.
• Fruit with width of 2.87 inches is an apple. • Difficult to choose an appropriate
• Fruit with width of 2.86 inches or 2.88 inches is an orange. threshold.
• Assuming that there is more precision in the data than we have. • Too high a threshold can terminate
division too early.
• Too low value results in little simplification.
• Couple of ways to control overfitting: • Postpruning: Remove branches from a fully
• Limit the number of splits. grown tree
• Direct the tree to make no more than 7 number of splits. • Use a set of data different from the
• Split a branch based on the number of data sets in it. training data to decide which is the best
• If we don’t have at least 6 data points, then we won’t split. pruned tree.
• Split a branch based on the number of data points in the child nodes. • Growing and pruning trees is slower but
• Split only if all the child nodes have at least 3 data points in them. more reliable.
• Split a branch if the tree has not reached a certain depth, maybe 5.
• Inducing decision trees is one of the most widely used learning methods in practice
• Can out-perform human experts in many problems
• Strengths include
• Fast
• Simple to implement
• Can convert result to a set of easily interpretable rules
• Empirically valid in many commercial products
• Handles noisy data
• Weaknesses include:
• Univariate splits/partitioning using only one attribute at a time so limits types of possible trees
• Large decision trees may be hard to understand
Author | Title
07/13/2021
Slide 34
References
• Breiman, Leo; Friedman, J. H.; Olshen, R. A.; Stone, C. J. (1984). Classification and
regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books &
Software
• Scott Hartshorn, Machine Learning with Random Forests and Decision Trees.
Author | Title
07/13/2021