Professional Documents
Culture Documents
Session XVIII PDF
Session XVIII PDF
BASAV ROYCHOUDHURY
What do we learn?
Classification Trees
Regression Trees
Classification & Regression Trees
CART (Brieman, Freidman, Olshen, and Stone)
Simple approach which performs across situations
Result easily interpretable
Can be used for both classification and estimation
Can result in simple (or complex) rules
◦ If tear production rate = reduced then recommendation = none
Otherwise, if age = young and astigmatic = no
then recommendation = soft
Tree
Tree (Inverted)
Tree
A tree structure is represented upside down – root at top, leaves at bottom
From root, the tree splits from a single trunk into branches
The process continues till the leaves are reached/generated
The rule can be deciphered as one travels from the root to the leaves
Tree
Classification Trees
predict membership in the classes of a categorical dependent variable based on predictor
variables
when more stringent theoretical and distributional assumptions of more traditional methods are
met, the traditional methods may be preferable
as an exploratory technique, or as a technique of last resort when traditional methods fail,
classification trees are good
Weather Dataset
Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
Example Classification Tree
Example Classification Tree
Divide and conquer
Select an attribute to place at the root node
Make one branch for each possible value of that attribute
◦ If it is continuous, decide where to split
If at any time all instances at a node have the same classification (?) stop developing that part of
the tree
Knowledge Discovery
Outlook Temperature Humidity Windy Play
Rainy 3 2 Cool 3 1
2 3 2 2 3 4 6 2 9 5
Sunny Hot High False
9 5 9 5 9 5 9 5 14 14
4 0 4 2 6 1 3 3
Overcast Mild Normal True
9 5 9 5 9 5 9 5
3 2 3 1
Rainy Cool
9 5 9 5
Split? Where?
Choose attributes that produces purest daughter nodes
How to define (im)purity/
◦ 𝐺𝑖𝑛𝑖 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 𝑎𝑡 𝑛𝑜𝑑𝑒 𝐴 = 1 − σ𝑚 2
𝑘=1 𝑝𝑘
◦ 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝐴 = σ𝑚 𝑘=1 −𝑝𝑘 log 2 ( 𝑝𝑘 )
9 9 5 5
− log 2 − log 2 = 0.940
14 14 14 14
Possible Splits
𝑔𝑖𝑛𝑖 𝑠𝑢𝑛𝑛𝑦 = 0.480
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑠𝑢𝑛𝑛𝑦 = 0.971
𝑔𝑖𝑛𝑖 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 = 0
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 = 0
𝑔𝑖𝑛𝑖 𝑟𝑎𝑖𝑛𝑦 = 0.480
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑟𝑎𝑖𝑛𝑦 = 0.971
5 4 5
𝑔𝑖𝑛𝑖 𝑛𝑜𝑑𝑒 = × 0.480 + ×0+ × 0.480 = 0.342
14 14 14
5 4 5
𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑛𝑜𝑑𝑒 = × 0.971 + ×0+ × 0.971 = 0.693
14 14 14
Possible Splits
The information gain:
◦ gini index reduced from 0.459 before split to 0.342 after the split
◦ entropy reduced from 0.940 before split to 0.693 after the split
Possible Splits
Why Outlook?
Based on entropy,
◦ 𝑔𝑎𝑖𝑛 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 0.247
◦ 𝑔𝑎𝑖𝑛 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.029
◦ 𝑔𝑎𝑖𝑛 ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.152
◦ 𝑔𝑎𝑖𝑛 𝑤𝑖𝑛𝑑𝑦 = 0.048
Comparing the reduction in impurity across all possible splits in all possible predictors, the next
split will be done
Overfitting
If a tree is allowed to grow based on the training data set, there is the danger of overfitting
Overall error at various level will decrease with growth for training data
Overall error at different levels will increase as we go down for validation data
Overfitting - Solutions
Stopping Tree Growth
Prunning
Stopping tree growth (pre-pruning)
Using parameters beforehand to limit growth
CHAID (chi-squared automatic interaction detection)
◦ Uses chi-squared test to check if splitting a node improves the gain significantly (statistically)
◦ Depending on resulting p value, the node may or may not be split
Pruning (post pruning)
Pruning is alternative to stopping growth
Tree is allowed to grow to the fullest, and then pruned
Advantage: tree allowed to grow without any inhibition
Detect the weak branches
◦ Did not result in significant gain
◦ Enhanced overfitting
Boosting-
◦ calculate output using several different models and then average the result using weighted average approach
Stacking –
◦ similar to boosting: apply several models to original data
◦ don’t just use an empirical formula for weight function
◦ introduce a meta-level and use another model/approach to estimate
Decision tree algorithms handle mixed types of variables and missing values
Robust to outliers and to irrelevant inputs
Might not have a very good performance
Good for variable selection
Good for generating rules
Thank you