You are on page 1of 61

10701 

Recitation: Decision Trees 
& Model Selection (AIC & BIC)
Min Chi
Oct 7th, 2010
NSH 1507
More on Decision Trees 
Building More General Decision Trees
• Build a decision tree (≥ 2 level) Step by Step.

• Building a decision tree with continuous input 
feature. 

• Building a quad decision tree. 
Building More General Decision Trees
• Build a decision tree (≥ 2 level) Step by Step.
Next

• Building a decision tree with continuous input 
feature. 

• Building a quad decision tree. 
Information Gain
• Advantage of attribute – decrease in uncertainty
– Entropy of Y before split

– Entropy of Y after splitting based on Xi
• Weight by probability of following each branch

• Information gain is difference

5
How to learn a decision tree
• Top‐down induction [ID3, C4.5, CART, …]

Refund
Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

6
Person Hair Weight Age
Length <161 <40
Class

Homer Short No Yes M


Marge Long Yes Yes F
Bart Short Yes Yes M
Lisa Long Yes Yes F
Maggie Long Yes Yes F
Abe Short No No M
Selma Long Yes No F
Otto Long No Yes M
Krusty Long No No M

Comic Long No Yes ?


p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p + n ⎠ p+n ⎝ p + n ⎠

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)


= 0.9911
Short Long

Let us try splitting on


Hair length

Gain ( A) = E (Current set ) − ∑ E (all child sets )


Gain(Hair Length) = 0.9911 – (3/9 * 0 + 6/9 * 0.9183 ) = 0.3789
p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p + n ⎠ p+n ⎝ p + n ⎠

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)


= 0.9911
yes no
Weight <161?

Let us try splitting on


Weight

Gain ( A) = E (Current set ) − ∑ E (all child sets )


Gain(Weight < 161) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p + n ⎠ p+n ⎝ p + n ⎠

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)


= 0.9911
yes no
age <= 40?

Let us try splitting on


Age

Gain ( A) = E (Current set ) − ∑ E (all child sets )


Gain(Age < 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
Of the 3 features we had, Weight was best. But
while people who weigh over 161 are perfectly
classified (as males), the under 161 people are
not perfectly classified… So we simply
recurse!
yes no
Person Hair Weight Age Clas Weight < 161?
Length <161 <40 s

Marge Long Yes Yes F


Bart Short Yes Yes M
Lisa Long Yes Yes F
Maggie Short Yes Yes F Male
Selma Long Yes No F
p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p + n ⎠ p+n ⎝ p + n ⎠

Entropy(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5)


= 0.7219
Short Long

Let us try splitting on


Hair length

Gain ( A) = E (Current set ) − ∑ E (all child sets )


Gain(Hair Length, Weight < 161) = 0.7219 – (1/5 * 0 + 3/5 * 0 ) = 0.7219
p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p + n ⎠ p+n ⎝ p + n ⎠

Entropy(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5)


= 0.7219
yes no
age <= 40?

Let us try splitting on


Age

Gain ( A) = E (Current set ) − ∑ E (all child sets )


Gain(Age, Weight < 161) = 0.7219 – (3/4 * 0.8113 + 1/4 * 0 ) = 0.1134
Of the 3 features we had, Weight was best. But
while people who weigh over 161 are perfectly
classified (as males), the under 161 people are
not perfectly classified… So we simply
recurse!
yes no
Weight <= 160?

This time we find that we can split on


Hair length and we done.!

Short Long Male


Hair Length

Male Female
Person Hair Weight Age
Length <161 <40
Class

Homer Short No Yes M


Marge Long Yes Yes F
Bart Short Yes Yes M
Lisa Long Yes Yes F
Maggie Long Yes Yes F
Abe Short No No M
Selma Long Yes No F
Otto Long No Yes M
Krusty Long No No M

Comic Long No Yes ?


Hair Weight Age Class Count
Length <161 <40
Short No Yes M 1
Long Yes Yes F 3
Short Yes Yes M 1
Short No No M 1
Long Yes No F 1
Long No Yes M 1
Long No No M 1
Building More General Decision Trees
• Build a decision tree (≥ 2 level) Step by Step.
Next

• Building a decision tree with continuous input 
feature. 

• Building a quad decision tree. 
Person Hair Weight Age
Length <161 <40
Class

Homer 0” No Yes M
Marge 10” Yes Yes F
Bart 2” Yes Yes M
Lisa 6” Yes Yes F
Maggie 4” Yes Yes F
Abe 1” No No M
Selma 8” Yes No F
Otto 10” No Yes M
Krusty 6” No No M
Real‐Values input
What should we do if some of the input features are real‐
valued?
p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p+n⎠ p+n ⎝ p+n⎠
Example
Hair 0” 1” 2” 4” 6” 6” 8” 10” 10”
Length
Class M M M F F M F F M
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
Example
Hair 0” 1” 2” 4” 6” 6” 8” 10” 10”
Length
Class M M M F F M F F M
However,…
• Gain(Hair Length < 4’’) = 0.9911 – (3/9 * 0 + 6/9 * 0.9183 ) =
0.3789
• Gain(Weight < 161) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
• Gain(Age < 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
Of the 3 features we had, Weight was best. But
while people who weigh over 160 are perfectly
classified (as males), the under 160 people are
not perfectly classified… So we simply
recurse!
yes no
Person Hair Weight Age Clas Weight < 161?
Length <161 <40 s

Marge 10” Yes Yes F


Bart 2” Yes Yes M
Lisa 6” Yes Yes F
Maggie 4” Yes Yes F Male
Selma 8” Yes No F
Example
Hair 2” 4” 6” 8” 10”
Length
Class M F F F F
Entropy(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5) = 0.7219

Gain(Hair Length < 4’’, Weight < 161) = 0.7219 – (1/5 * 0 + 4/5 * 0 ) =
0.7219
Gain(Age, Weight < 161) = 0.7219 – (3/4 * 0.8113 + 1/4 * 0 ) = 0.1134
Of the 3 features we had, Weight was best. But
while people who weigh over 160 are perfectly
classified (as males), the under 160 people are
not perfectly classified… So we simply
recurse!
yes no
Weight < 161?

This time we find that we can split on


Hair length, and we are done!

yes no
Hair Length < 4? Male

Male Female
Building More General Decision Trees
• Build a decision tree (≥ 2 level) Step by Step.

• Building a decision tree with continuous input 
feature. 
Next

• Building a quad decision tree. 
Decision Tree more generally…
• Features can be discrete,
continuous or categorical
• Each internal node: test 
1 1
some set of features {Xi}
0 • Each branch from a node: 
1 1
0 1 selects a set of value for 
{Xi}
1 1 • Each leaf node: predict Y
1 1
0
0 1

29
Person Hair
Length
Weight Class

Hair length:
Homer 0” 250 M
0‐2  Short
Marge 10” 150 F 3‐6  Medium
Bart 2” 90 M ≥7    Long

Lisa 6” 78 F Weight:
Maggie 4” 20 F 0‐100  Light
100‐175 
Abe 1” 170 M Normal
Selma 8” 160 F ≥175    Heavy
Otto 10” 180 M
Krusty 6” 200 M
Person Hair
Length
Weight Class

Hair length:
Homer S Heavy M
0‐2  Short
Marge L Normal F 3‐6  Medium
Bart S Light M ≥7    Long

Lisa M Light F Weight:


Maggie M Light F 0‐100  Light
100‐175 
Abe S Normal M Normal
Selma L Normal F ≥175    Heavy
Otto L Heavy M
Krusty M Heavy M
p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p + n ⎠ p+n ⎝ p + n ⎠

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)


= 0.9911
{M, L}
Short, Short, Light {M, L}
{Normal, Let us try splitting on Hair
{Normal, Light
Heavy} length {S} vs. {M, L}
Heavy}
and Weight: {Light} vs.
{Normal, Heavy}
{M, L} * {Normal, Heavy}
Let us try re-splitting on Hair
length: M vs. L
and Weight: Normal vs. Heavy

L, Normal L, Heavy M, Heavy


M, Normal
p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p + n ⎠ p+n ⎝ p + n ⎠

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)


= 0.9911
{M, L}
Short, Short, Light {M, L}
{Normal, Let us try splitting on Hair
{Normal, Light
Heavy} length {S} vs. {M, L}
Heavy}
and Weight: {Light} vs.
{Normal, Heavy}

Female
L, Normal M, Heavy
Male Male L, Heavy

Female Male Male


p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p + n ⎠ p+n ⎝ p + n ⎠

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)


= 0.9911
Short, Long, Long,
Short, Let us try splitting on Hair
Weight ≥161 Weight ≥161 Weight < 161 Weight ≥161
length {S} vs. {M, L}
and Weight: {Light} vs.
{Normal, Heavy}

Short,
{M, L} Female
Male {Normal,
Short, Light {M, L}
Light
{Normal,
Heavy}
Heavy}
Model and Selection: AIC & BIC
Bias, Variance, and Model Complexity
•Bias‐Variance trade‐off 
again
•Generalization: test 
sample vs. training sample 
performance
– Training data usually 
monotonically increasing 
performance with model 
complexity
Measuring Performance
•target variable  Y
•Vector of inputs  X
•Prediction model fˆ ( X )

•Typical Choices of Loss function

(
⎧ Y − fˆ X
)
2
( ) squared error
(
L Y , fˆ ( X ) ) ⎪
=⎨
⎪ Y − fˆ ( X ) absolute error

Generalization Error
• Test error aka. Generalization error
Err = E ⎡ L Y , fˆ ( X ) ⎤
⎣ ⎦( )
• Note: This expectation averages anything that is random, including the 
randomness in the training sample that it produced

• Training error
1
( )
n


err =
N
∑ i i
L y
i =1
, ˆ (x )
f
– average loss over training sample
– not a good estimate of test error (next slide)
Training Error

•Training error ‐ Overfitting
– not a good estimate of test 
error
– consistently decreases with 
model complexity
– drops to zero with high 
enough complexity
Categorical Data
•same for categorical 
Test Error again:
responses
Err = E[ L(G, pˆ ( x))]
pk ( X ) = pr ( G = k | X )

Gˆ ( X ) = arg max k pˆ k ( X ) Training Error again:
−2 N
•Typical Choices of Loss 
functions:
err = ∑
N i =1
log pˆ gi ( xi )

( ) (
L G , Gˆ ( X ) = I G ≠ Gˆ ( X ) ) 0 − 1 loss
K
L ( G , pˆ ( X ) ) = − 2 ∑ I ( G = k ) log pˆ k ( X ) = −2 log pˆ G ( X ) log − likelihood
k =1
Log‐likelihood = cross‐entropy loss = deviance
Loss Function for General Densities
• For densities parameterized by theta: 
• Log‐likelihood function can be used as a loss‐
function

Prθ( X) (Y) density of Y with predictor 


X
L (Y , θ ( X ) ) = −2 log Prθ ( X ) (Y )
Two separate goals
• Model selection:
– Estimating the performance of different models in order to choose the 
(approximate) best one
• Model assessment:
– Having chosen a final model, estimating its prediction error (generalization 
error) on new data

• Ideal situation: split data into the 3 parts for training, validation (est. 
prediction error+select model), and testing (assess model)
• Typical split: 50% / 25% / 25%

• Remainder of the chapter: Data‐poor situation
• => Approximation of validation step either analytically (AIC, BIC, MDL, 
SRM) or by efficient sample reuse (cross‐validation, bootstrap)
Bias‐Variance Decomposition
Y = f ( X ) + ε , E ( ε ) = 0, Var ( ε ) = σ ε
2

X = x0
• Then for an input point           , using unit‐square loss 
and regression fit:

(
Err ( x0 ) = E Y − fˆ ( x0 ) | X = x0 ⎤

)
2

⎢⎣ ⎥⎦
2 2
= σ ε + ⎡⎣ Efˆ ( x0 ) − f ( x0 ) ⎤⎦ + E ⎡⎣ fˆ ( x0 ) − Efˆ ( x0 ) ⎤⎦
2

2
= σ ε + Bias ⎡ fˆ ( x0 ) ⎤ + Var ⎡ fˆ ( x0 ) ⎤
2
⎣ ⎦ ⎣ ⎦
Irreducible  Bias^2 Variance
Error
variance of the  Amount by which average  Expected deviation of 
target around  estimate differs from the true  f^ around its mean
the true mean mean
Bias‐Variance Decomposition
2
Err ( x0 ) = σ ε + Bias ⎡⎣ fˆ ( x0 ) ⎤⎦ + Var ⎡⎣ fˆ ( x0 ) ⎤⎦
2

2
⎡ 1
( ) ⎤
k
kNN: Err ( x0 ) = σ ε + ⎢ f ( xo ) − ∑ f x( l ) ⎥ + σ ε 2 / k
2

⎣ k l =1 ⎦

Linear Model Fit: fˆp ( x ) = βˆ T x

2
Err ( x0 ) = σ ε + ⎣ f ( xo ) − Ef p ( x0 ) ⎦ + h ( x0 ) σ ε2
2
⎡ 2 ˆ ⎤

where h ( x 0 ) = ( X X ) X T y
T −1
Bias‐Variance Decomposition
Linear Model Fit: fˆp ( x ) = βˆ T x

2
Err ( x0 ) = σ ε + ⎣ f ( xo ) − Ef p ( x0 ) ⎦ + h ( x0 ) σ ε2
2
⎡ 2 ˆ ⎤

where h ( x 0 ) = ( X X ) X T y ... N-dim weight vector


T −1

average over sample values x i :


1 N
1 N 2 p 2
N

i =1
Err ( xi ) = σ ε +
2

N

i =1

⎣ f ( xi ) − Ef ( xi ) ⎦ + σ ε ... in-sample error
ˆ ⎤
N

Model complexity is directly related to the number of 
parameters p
Bias‐Variance Decomposition
2
Err ( x0 ) = σ ε + Bias ⎡⎣ fˆ ( x0 ) ⎤⎦ + Var ⎡⎣ fˆ ( x0 ) ⎤⎦
2

For ridge regression and other linear models, variance same as before, but
with diff’t weights.
Parameters of the best fitting linear approximation

β* = arg min E ( f ( X ) − β X ) T 2

β
Further decompose the bias:

E x0 [ f ( x0 ) − Efˆα ( x0 )]2 = E x0 [ f ( x0 ) − βT* x0 ]2 + E x0 [βT* x0 − EβTα x0 ]2


= Ave [ Model Bias ] 2 + Ave [ Estimation Bias ] 2

Least squares fits best linear model ‐> no estimation bias
Restricted fits ‐> positive estimation bias in favor of reduced variance
Optimism of the Training Error Rate
• Typically: training error rate < true error
• (same data is being used to fit the method 
and assess its error)

1
( ) ( )
n
err =
N
∑ i i
L y
i =1
, ˆ (x )
f < Err = E ⎡ L Y , fˆ ( X ) ⎤
⎣ ⎦
overly optimistic
Optimism of the Training Error Rate
Err … kind of extra‐sample error: test features don’t need to coincide with 
training feature vectors

1
( )
N
Focus on in‐sample error:
Errin =
N
∑ Y Y
E E
i =1
new L Yi , f ( xi )
new ˆ

Y new
… observe N new response values at each of training points xi , i=1, 2, ...,N
optimism: op ≡ Errin − E y ( err )
2 N

∑ Cov ( yˆ , y )
for squared error 0‐1 and other loss functions:
op = i i
N i =1

The amount by which err underestimates the true error depends on how strongly y i affects
its own prediction.
Optimism of the Training Error Rate
2 N
Summary:  Errin = E y ( err ) +
N
∑ Cov ( yˆ , y )
i =1
i i

The harder we fit the data, the greater             
Cov ( yˆi , yi )will be, thereby increasing the 
optimism. 
• For linear fit with d indep inputs/basis funcs: 
2
Errin = E y ( err ) + dσ ε2
N
– optimism    linearly with # d of basis functions
– Optimism   as training sample size 
Optimism of the Training Error Rate
• Ways to estimate prediction error:
– Estimate optimism and then add it to training error rate 
err
• AIC, BIC, and others work this way, for a special class of estimates 
that are linear in their parameters
– Direct estimates of the sample error 
• Cross‐validation, bootstrap Err
• Can be used with any loss function, and with nonlinear, adaptive 
fitting techniques

Estimates of In‐Sample Prediction Error
• General form of the in‐sample estimate:
Êrrin = err + oˆp with estimate of optimism

2
• For linear fit and with Errin = E y ( err ) + dσ ε2: 
N
2d 2
C p = err + σˆ ε , so called C p statistic
N

σˆ ε2 ... estimate of noise variance, from mean-squared error of low-bias model


d ... # of basis functions
N ... training sample size
Estimates of In‐Sample Prediction Error

• Similarly: Akaike Information Criterion (AIC)
– More applicable estimate of         , when log‐
Errin
likelihood function is used
2 d
For N → ∞ : − 2 E ⎡⎣log Prθˆ (Y ) ⎤⎦ ≈ − E [ log lik ] + 2
N N
Prθ (Y ) ... family density for Y (containing the true density)
θˆ... ML estimate of θ
N
loglik= ∑ log Prθˆ ( yi ) Maximized log‐likelihood due to ML 
estimate of theta
i =1
AIC
2 d
For N → ∞ : − 2 E ⎡⎣log Prθˆ (Y ) ⎤⎦ ≈ − E [ log lik ] + 2
N N
For example, for logistic regression model, using binomial log‐likelihood:

2 d
AIC = − ⋅ loglik + 2 ⋅
N N
To use AIC for model selection: choose the model giving smallest AIC over the set of 
models considered.
d (α ) 2
AIC (α ) = err (α ) + 2 σˆ ε
N
fαˆ ( x ) ... set of models, α ... tuning parameter
err (α ) ... training error, d (α ) ... # parameters
AIC
• Function AIC(α) estimates test error curve
• If basis functions are chosen adaptively with 
d<p inputs: N

∑ Cov
i =1
( ˆ
y i , y i ) = d σ 2
ε

• no longer holds => optimism exceeds 
( 2 d / N )σ ε 2

• effective number of parameters fit > d
Using AIC to select the # of basis functions
• Input vector: log‐periodogram of vowel; Quantized to 256 uniformly 
spaced f
• Linear logistic regression model 
M
β ( f ) = ∑ hm ( f ) θ m
• Coefficient function: 
m =1
– Expansion of M spline basis functions
– For any M, a basis of natural cubic splines is used for the     
knots chosen uniformly over the range of frequencies, i.e.
hm
d (α ) = d ( M ) = M
• AIC approximately minimizes Err(M) for both entropy 
and 0‐1 loss

2 N 2d 2

N i =1
Cov ( yˆi , yi ) =
N
σ ε ... simple formula for linear case
Using AIC to select the # of basis functions

2 N
2d 2
N
∑ Cov ( yˆi , yi ) =
i =1 N
σε

Approximation does not 
hold, in general, for 0‐1 
case, but it does o.k. 
(Exact only for linear 
models w/ additive 
errors and sq err loss)
Effective Number of Parameters
⎛ y1 ⎞
⎜ ⎟
y
y= ⎜ 2 ⎟ Vector of Outcomes, similarly for predicitons
⎜M ⎟
⎜⎜ ⎟⎟
⎝ yN ⎠

ŷ = Sy Linear fit (e.g. linear regression, quadratic shrinkage – ridge, splines)

S ... N × N matrix, depends on input vector x i but not on yi

effective number of parameters: d ( S) = trace ( S ) c.f. Cov( yˆ , y )

d(s) is the correct d for Cp 2d 2


C p = err + σˆ ε
N
Bayesian Approach and BIC
• Like AIC used in when fitting by max log‐likelihood

Bayesian Information Criterion (BIC):
BIC = −2 loglik + (log N )d
Assuming Gaussian model : σ ε2 known,
− 2 ⋅ loglik ≈ ∑ i ( y i − fˆ ( xi ) ) 2 / σ ε2 = N ⋅ err / σ ε2
N d 2
then BIC = 2 [ err + (log N ) ⋅ σ ε ]
σε N
BIC proportional to AIC except for log(N) rather than factor of 
2.  For N>e2 (approx 7.4), BIC penalizes complex models more 
heavily.
BIC Motivation
• Given a set of candidate models  Μ m , m = 1K M and model parameters θm
• Posterior probability of a given model:  Pr(M m | Z) ∝ Pr(M m ) ⋅ Pr(Z | M m )
Z represents the training data {xi , yi }1N
• Where 
• To compare two models, form the posterior odds:
Pr(M m | Z) Pr(M m ) Pr(Z | M m )
= ⋅
Pr(M l | Z) Pr(M l ) Pr(Z | M l )
• If odds > 1, then choose model m.  Prior over models (left half) considered 
constant.  Right half, contribution of data (Z) to posterior odds, is called 
the Bayes factor BF(Z).
• Need to approximate Pr(Z|Mm). Various chicanery and approximations 
(pp. 207) gets us BIC.  
• Can est. posterior from BIC and compare relative merits of models.
BIC: How much better is a model?
• But we may want to know how various models stack up (not just ranking) relative 
to one another:

• Once we have the BIC:

• Denominator normalizes the result and now we can assess the relative merits of 
each model
1
− ⋅ BICm
2
e
estimate of Pr(M m | Z) ≡ 1
− ⋅ BICl

M
2
l =1
e

You might also like