You are on page 1of 31

Trees

BASAV ROYCHOUDHURY
What do we learn?
Classification Trees
Regression Trees
Classification & Regression Trees
CART (Brieman, Freidman, Olshen, and Stone)
Simple approach which performs across situations
Result easily interpretable
Can be used for both classification and estimation
Can result in simple (or complex) rules
◦ If tear production rate = reduced then recommendation = none
Otherwise, if age = young and astigmatic = no
then recommendation = soft
Tree
Tree (Inverted)
Tree
A tree structure is represented upside down – root at top, leaves at bottom
From root, the tree splits from a single trunk into branches
The process continues till the leaves are reached/generated
The rule can be deciphered as one travels from the root to the leaves
Tree
Classification Trees
predict membership in the classes of a categorical dependent variable based on predictor
variables
when more stringent theoretical and distributional assumptions of more traditional methods are
met, the traditional methods may be preferable
as an exploratory technique, or as a technique of last resort when traditional methods fail,
classification trees are good
Weather Dataset
Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
Example Classification Tree
Example Classification Tree
Divide and conquer
Select an attribute to place at the root node
Make one branch for each possible value of that attribute
◦ If it is continuous, decide where to split

The above process is repeated recursively for each branch


◦ This uses only the instances that reach the branch

If at any time all instances at a node have the same classification (?) stop developing that part of
the tree
Knowledge Discovery
Outlook Temperature Humidity Windy Play

Yes No Yes No Yes No Yes No Yes No

Sunny 2 3 Hot 2 2 High 3 4 False 6 2 9 5

Overcast 4 0 Mild 4 2 Normal 6 1 True 3 3

Rainy 3 2 Cool 3 1

Outlook Temperature Humidity Windy Play

2 3 2 2 3 4 6 2 9 5
Sunny Hot High False
9 5 9 5 9 5 9 5 14 14

4 0 4 2 6 1 3 3
Overcast Mild Normal True
9 5 9 5 9 5 9 5

3 2 3 1
Rainy Cool
9 5 9 5
Split? Where?
Choose attributes that produces purest daughter nodes
How to define (im)purity/
◦ 𝐺𝑖𝑛𝑖 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 𝑎𝑡 𝑛𝑜𝑑𝑒 𝐴 = 1 − σ𝑚 2
𝑘=1 𝑝𝑘
◦ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝐴 = σ𝑚 𝑘=1 −𝑝𝑘 log 2 ( 𝑝𝑘 )

Entropy comes from Physics, meaning the amount of disorder


Possible Splits
The dataset contains 9 as ‘yes’ play and 5 as ‘no’ play
2 + 𝑝2
◦ 𝑔𝑖𝑛𝑖 𝑟𝑜𝑜𝑡 = 1 − 𝑝𝑦𝑒𝑠 𝑛𝑜 =
2 2
9 5
1− + = 0.459
14 14
◦ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑟𝑜𝑜𝑡 = σ𝑚
𝑘=1 −𝑝𝑘 log 2 ( 𝑝𝑘 ) =

9 9 5 5
− log 2 − log 2 = 0.940
14 14 14 14
Possible Splits
𝑔𝑖𝑛𝑖 𝑠𝑢𝑛𝑛𝑦 = 0.480
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑠𝑢𝑛𝑛𝑦 = 0.971
𝑔𝑖𝑛𝑖 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 = 0
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 = 0
𝑔𝑖𝑛𝑖 𝑟𝑎𝑖𝑛𝑦 = 0.480
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑟𝑎𝑖𝑛𝑦 = 0.971

5 4 5
𝑔𝑖𝑛𝑖 𝑛𝑜𝑑𝑒 = × 0.480 + ×0+ × 0.480 = 0.342
14 14 14
5 4 5
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑛𝑜𝑑𝑒 = × 0.971 + ×0+ × 0.971 = 0.693
14 14 14
Possible Splits
The information gain:
◦ gini index reduced from 0.459 before split to 0.342 after the split
◦ entropy reduced from 0.940 before split to 0.693 after the split
Possible Splits
Why Outlook?
Based on entropy,
◦ 𝑔𝑎𝑖𝑛 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 0.247
◦ 𝑔𝑎𝑖𝑛 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.029
◦ 𝑔𝑎𝑖𝑛 ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.152
◦ 𝑔𝑎𝑖𝑛 𝑤𝑖𝑛𝑑𝑦 = 0.048

Comparing the reduction in impurity across all possible splits in all possible predictors, the next
split will be done
Overfitting
If a tree is allowed to grow based on the training data set, there is the danger of overfitting
Overall error at various level will decrease with growth for training data
Overall error at different levels will increase as we go down for validation data
Overfitting - Solutions
Stopping Tree Growth
Prunning
Stopping tree growth (pre-pruning)
Using parameters beforehand to limit growth
CHAID (chi-squared automatic interaction detection)
◦ Uses chi-squared test to check if splitting a node improves the gain significantly (statistically)
◦ Depending on resulting p value, the node may or may not be split
Pruning (post pruning)
Pruning is alternative to stopping growth
Tree is allowed to grow to the fullest, and then pruned
Advantage: tree allowed to grow without any inhibition
Detect the weak branches
◦ Did not result in significant gain
◦ Enhanced overfitting

Prune the weak branches


Pruning (post pruning)
CART algorithm uses complexity parameter (cost complexity) for this
Complexity parameter is based on
◦ Mis-classification error of the tree
◦ Penalty factor for the tree size
Tuning the model
Min split - minimum number of observations that must exist at a node before considered for
split
Minimum Bucket Size - minimum number of observations in any leaf node
Complexity parameter - control the size of the decision tree and to select an optimal tree size
Use Surrogates – whether to use surrogate splits
Maximum no of Surrogates – maximum number of surrogate splits allowable
Tuning the model, Cont’d
More complex a model – more likely to be overfitted – less likely to match new data
◦ Cp - minimum benefit to be gained at each split
◦ 0 will build a decision tree to maximum depth
◦ Useful to look at Cp value for various tree sizes
◦ Point where sum of cross validation error relative to root node error and cross validation standard error
is minimum
Tuning the model, Cont’d
Priors
◦ Proportions of classes in a training set may not represent true proportions in the population
◦ If provided, the resulting model will reflect true proportions
◦ Example - 0.6,0.4 (binary classification), must have equal number as the number of possible values,
must add to 1
Tuning the model, Cont’d
loss matrix
◦ used to weight different kinds of errors differently.
◦ Example -0; 10; 1; 0 - interpreted as an actual 1 predicted as 0 is ten times more costly than 0 as 1
Ensemble methods
Bagging – Bootstrap Aggregation –
◦ decrease the variance of prediction by generating additional data for training from original dataset using
combinations with repetitions to produce m multisets of the same size as training data.
◦ m models are fitted using m bootstrap samples, and combined by averaging the output (regression) or voting
(classification)

Boosting-
◦ calculate output using several different models and then average the result using weighted average approach

Stacking –
◦ similar to boosting: apply several models to original data
◦ don’t just use an empirical formula for weight function
◦ introduce a meta-level and use another model/approach to estimate

E.g. Random Forest


Regression Tree
Use CART for continuous dependent variable
◦ Splits are based on purity – impurity measure is sum of squared deviations from mean of the leaf
◦ Leaf values is determined by mean of the leaf

Errors can be measured using standard matrices like RMSE, etc.


Closing remarks
Tree are non-linear and non-parametric
◦ Does not need particular relationships among the IV and DV
◦ Might miss out any relationship amongst predictors

Decision tree algorithms handle mixed types of variables and missing values
Robust to outliers and to irrelevant inputs
Might not have a very good performance
Good for variable selection
Good for generating rules
Thank you

You might also like