Professional Documents
Culture Documents
Recitation: Decision Trees
& Model Selection (AIC & BIC)
Min Chi
Oct 7th, 2010
NSH 1507
More on Decision Trees
Building More General Decision Trees
• Build a decision tree (≥ 2 level) Step by Step.
• Building a decision tree with continuous input
feature.
• Building a quad decision tree.
Building More General Decision Trees
• Build a decision tree (≥ 2 level) Step by Step.
Next
• Building a decision tree with continuous input
feature.
• Building a quad decision tree.
Information Gain
• Advantage of attribute – decrease in uncertainty
– Entropy of Y before split
– Entropy of Y after splitting based on Xi
• Weight by probability of following each branch
• Information gain is difference
5
How to learn a decision tree
• Top‐down induction [ID3, C4.5, CART, …]
Refund
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
6
Person Hair Weight Age
Length <161 <40
Class
Male Female
Person Hair Weight Age
Length <161 <40
Class
• Building a decision tree with continuous input
feature.
• Building a quad decision tree.
Person Hair Weight Age
Length <161 <40
Class
Homer 0” No Yes M
Marge 10” Yes Yes F
Bart 2” Yes Yes M
Lisa 6” Yes Yes F
Maggie 4” Yes Yes F
Abe 1” No No M
Selma 8” Yes No F
Otto 10” No Yes M
Krusty 6” No No M
Real‐Values input
What should we do if some of the input features are real‐
valued?
p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p+n⎠ p+n ⎝ p+n⎠
Example
Hair 0” 1” 2” 4” 6” 6” 8” 10” 10”
Length
Class M M M F F M F F M
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
Example
Hair 0” 1” 2” 4” 6” 6” 8” 10” 10”
Length
Class M M M F F M F F M
However,…
• Gain(Hair Length < 4’’) = 0.9911 – (3/9 * 0 + 6/9 * 0.9183 ) =
0.3789
• Gain(Weight < 161) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
• Gain(Age < 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
Of the 3 features we had, Weight was best. But
while people who weigh over 160 are perfectly
classified (as males), the under 160 people are
not perfectly classified… So we simply
recurse!
yes no
Person Hair Weight Age Clas Weight < 161?
Length <161 <40 s
Gain(Hair Length < 4’’, Weight < 161) = 0.7219 – (1/5 * 0 + 4/5 * 0 ) =
0.7219
Gain(Age, Weight < 161) = 0.7219 – (3/4 * 0.8113 + 1/4 * 0 ) = 0.1134
Of the 3 features we had, Weight was best. But
while people who weigh over 160 are perfectly
classified (as males), the under 160 people are
not perfectly classified… So we simply
recurse!
yes no
Weight < 161?
yes no
Hair Length < 4? Male
Male Female
Building More General Decision Trees
• Build a decision tree (≥ 2 level) Step by Step.
• Building a decision tree with continuous input
feature.
Next
• Building a quad decision tree.
Decision Tree more generally…
• Features can be discrete,
continuous or categorical
• Each internal node: test
1 1
some set of features {Xi}
0 • Each branch from a node:
1 1
0 1 selects a set of value for
{Xi}
1 1 • Each leaf node: predict Y
1 1
0
0 1
29
Person Hair
Length
Weight Class
Hair length:
Homer 0” 250 M
0‐2 Short
Marge 10” 150 F 3‐6 Medium
Bart 2” 90 M ≥7 Long
Lisa 6” 78 F Weight:
Maggie 4” 20 F 0‐100 Light
100‐175
Abe 1” 170 M Normal
Selma 8” 160 F ≥175 Heavy
Otto 10” 180 M
Krusty 6” 200 M
Person Hair
Length
Weight Class
Hair length:
Homer S Heavy M
0‐2 Short
Marge L Normal F 3‐6 Medium
Bart S Light M ≥7 Long
Female
L, Normal M, Heavy
Male Male L, Heavy
Short,
{M, L} Female
Male {Normal,
Short, Light {M, L}
Light
{Normal,
Heavy}
Heavy}
Model and Selection: AIC & BIC
Bias, Variance, and Model Complexity
•Bias‐Variance trade‐off
again
•Generalization: test
sample vs. training sample
performance
– Training data usually
monotonically increasing
performance with model
complexity
Measuring Performance
•target variable Y
•Vector of inputs X
•Prediction model fˆ ( X )
•Typical Choices of Loss function
(
⎧ Y − fˆ X
)
2
( ) squared error
(
L Y , fˆ ( X ) ) ⎪
=⎨
⎪ Y − fˆ ( X ) absolute error
⎩
Generalization Error
• Test error aka. Generalization error
Err = E ⎡ L Y , fˆ ( X ) ⎤
⎣ ⎦( )
• Note: This expectation averages anything that is random, including the
randomness in the training sample that it produced
• Training error
1
( )
n
•
err =
N
∑ i i
L y
i =1
, ˆ (x )
f
– average loss over training sample
– not a good estimate of test error (next slide)
Training Error
•Training error ‐ Overfitting
– not a good estimate of test
error
– consistently decreases with
model complexity
– drops to zero with high
enough complexity
Categorical Data
•same for categorical
Test Error again:
responses
Err = E[ L(G, pˆ ( x))]
pk ( X ) = pr ( G = k | X )
•
Gˆ ( X ) = arg max k pˆ k ( X ) Training Error again:
−2 N
•Typical Choices of Loss
functions:
err = ∑
N i =1
log pˆ gi ( xi )
( ) (
L G , Gˆ ( X ) = I G ≠ Gˆ ( X ) ) 0 − 1 loss
K
L ( G , pˆ ( X ) ) = − 2 ∑ I ( G = k ) log pˆ k ( X ) = −2 log pˆ G ( X ) log − likelihood
k =1
Log‐likelihood = cross‐entropy loss = deviance
Loss Function for General Densities
• For densities parameterized by theta:
• Log‐likelihood function can be used as a loss‐
function
• Ideal situation: split data into the 3 parts for training, validation (est.
prediction error+select model), and testing (assess model)
• Typical split: 50% / 25% / 25%
• Remainder of the chapter: Data‐poor situation
• => Approximation of validation step either analytically (AIC, BIC, MDL,
SRM) or by efficient sample reuse (cross‐validation, bootstrap)
Bias‐Variance Decomposition
Y = f ( X ) + ε , E ( ε ) = 0, Var ( ε ) = σ ε
2
X = x0
• Then for an input point , using unit‐square loss
and regression fit:
(
Err ( x0 ) = E Y − fˆ ( x0 ) | X = x0 ⎤
⎡
)
2
⎢⎣ ⎥⎦
2 2
= σ ε + ⎡⎣ Efˆ ( x0 ) − f ( x0 ) ⎤⎦ + E ⎡⎣ fˆ ( x0 ) − Efˆ ( x0 ) ⎤⎦
2
2
= σ ε + Bias ⎡ fˆ ( x0 ) ⎤ + Var ⎡ fˆ ( x0 ) ⎤
2
⎣ ⎦ ⎣ ⎦
Irreducible Bias^2 Variance
Error
variance of the Amount by which average Expected deviation of
target around estimate differs from the true f^ around its mean
the true mean mean
Bias‐Variance Decomposition
2
Err ( x0 ) = σ ε + Bias ⎡⎣ fˆ ( x0 ) ⎤⎦ + Var ⎡⎣ fˆ ( x0 ) ⎤⎦
2
2
⎡ 1
( ) ⎤
k
kNN: Err ( x0 ) = σ ε + ⎢ f ( xo ) − ∑ f x( l ) ⎥ + σ ε 2 / k
2
⎣ k l =1 ⎦
Linear Model Fit: fˆp ( x ) = βˆ T x
2
Err ( x0 ) = σ ε + ⎣ f ( xo ) − Ef p ( x0 ) ⎦ + h ( x0 ) σ ε2
2
⎡ 2 ˆ ⎤
where h ( x 0 ) = ( X X ) X T y
T −1
Bias‐Variance Decomposition
Linear Model Fit: fˆp ( x ) = βˆ T x
2
Err ( x0 ) = σ ε + ⎣ f ( xo ) − Ef p ( x0 ) ⎦ + h ( x0 ) σ ε2
2
⎡ 2 ˆ ⎤
N
∑
i =1
⎡
⎣ f ( xi ) − Ef ( xi ) ⎦ + σ ε ... in-sample error
ˆ ⎤
N
Model complexity is directly related to the number of
parameters p
Bias‐Variance Decomposition
2
Err ( x0 ) = σ ε + Bias ⎡⎣ fˆ ( x0 ) ⎤⎦ + Var ⎡⎣ fˆ ( x0 ) ⎤⎦
2
For ridge regression and other linear models, variance same as before, but
with diff’t weights.
Parameters of the best fitting linear approximation
β* = arg min E ( f ( X ) − β X ) T 2
β
Further decompose the bias:
Least squares fits best linear model ‐> no estimation bias
Restricted fits ‐> positive estimation bias in favor of reduced variance
Optimism of the Training Error Rate
• Typically: training error rate < true error
• (same data is being used to fit the method
and assess its error)
1
( ) ( )
n
err =
N
∑ i i
L y
i =1
, ˆ (x )
f < Err = E ⎡ L Y , fˆ ( X ) ⎤
⎣ ⎦
overly optimistic
Optimism of the Training Error Rate
Err … kind of extra‐sample error: test features don’t need to coincide with
training feature vectors
1
( )
N
Focus on in‐sample error:
Errin =
N
∑ Y Y
E E
i =1
new L Yi , f ( xi )
new ˆ
Y new
… observe N new response values at each of training points xi , i=1, 2, ...,N
optimism: op ≡ Errin − E y ( err )
2 N
∑ Cov ( yˆ , y )
for squared error 0‐1 and other loss functions:
op = i i
N i =1
The amount by which err underestimates the true error depends on how strongly y i affects
its own prediction.
Optimism of the Training Error Rate
2 N
Summary: Errin = E y ( err ) +
N
∑ Cov ( yˆ , y )
i =1
i i
The harder we fit the data, the greater
Cov ( yˆi , yi )will be, thereby increasing the
optimism.
• For linear fit with d indep inputs/basis funcs:
2
Errin = E y ( err ) + dσ ε2
N
– optimism linearly with # d of basis functions
– Optimism as training sample size
Optimism of the Training Error Rate
• Ways to estimate prediction error:
– Estimate optimism and then add it to training error rate
err
• AIC, BIC, and others work this way, for a special class of estimates
that are linear in their parameters
– Direct estimates of the sample error
• Cross‐validation, bootstrap Err
• Can be used with any loss function, and with nonlinear, adaptive
fitting techniques
•
Estimates of In‐Sample Prediction Error
• General form of the in‐sample estimate:
Êrrin = err + oˆp with estimate of optimism
2
• For linear fit and with Errin = E y ( err ) + dσ ε2:
N
2d 2
C p = err + σˆ ε , so called C p statistic
N
• Similarly: Akaike Information Criterion (AIC)
– More applicable estimate of , when log‐
Errin
likelihood function is used
2 d
For N → ∞ : − 2 E ⎡⎣log Prθˆ (Y ) ⎤⎦ ≈ − E [ log lik ] + 2
N N
Prθ (Y ) ... family density for Y (containing the true density)
θˆ... ML estimate of θ
N
loglik= ∑ log Prθˆ ( yi ) Maximized log‐likelihood due to ML
estimate of theta
i =1
AIC
2 d
For N → ∞ : − 2 E ⎡⎣log Prθˆ (Y ) ⎤⎦ ≈ − E [ log lik ] + 2
N N
For example, for logistic regression model, using binomial log‐likelihood:
2 d
AIC = − ⋅ loglik + 2 ⋅
N N
To use AIC for model selection: choose the model giving smallest AIC over the set of
models considered.
d (α ) 2
AIC (α ) = err (α ) + 2 σˆ ε
N
fαˆ ( x ) ... set of models, α ... tuning parameter
err (α ) ... training error, d (α ) ... # parameters
AIC
• Function AIC(α) estimates test error curve
• If basis functions are chosen adaptively with
d<p inputs: N
∑ Cov
i =1
( ˆ
y i , y i ) = d σ 2
ε
• no longer holds => optimism exceeds
( 2 d / N )σ ε 2
• effective number of parameters fit > d
Using AIC to select the # of basis functions
• Input vector: log‐periodogram of vowel; Quantized to 256 uniformly
spaced f
• Linear logistic regression model
M
β ( f ) = ∑ hm ( f ) θ m
• Coefficient function:
m =1
– Expansion of M spline basis functions
– For any M, a basis of natural cubic splines is used for the
knots chosen uniformly over the range of frequencies, i.e.
hm
d (α ) = d ( M ) = M
• AIC approximately minimizes Err(M) for both entropy
and 0‐1 loss
•
2 N 2d 2
∑
N i =1
Cov ( yˆi , yi ) =
N
σ ε ... simple formula for linear case
Using AIC to select the # of basis functions
2 N
2d 2
N
∑ Cov ( yˆi , yi ) =
i =1 N
σε
Approximation does not
hold, in general, for 0‐1
case, but it does o.k.
(Exact only for linear
models w/ additive
errors and sq err loss)
Effective Number of Parameters
⎛ y1 ⎞
⎜ ⎟
y
y= ⎜ 2 ⎟ Vector of Outcomes, similarly for predicitons
⎜M ⎟
⎜⎜ ⎟⎟
⎝ yN ⎠
ŷ = Sy Linear fit (e.g. linear regression, quadratic shrinkage – ridge, splines)
Bayesian Information Criterion (BIC):
BIC = −2 loglik + (log N )d
Assuming Gaussian model : σ ε2 known,
− 2 ⋅ loglik ≈ ∑ i ( y i − fˆ ( xi ) ) 2 / σ ε2 = N ⋅ err / σ ε2
N d 2
then BIC = 2 [ err + (log N ) ⋅ σ ε ]
σε N
BIC proportional to AIC except for log(N) rather than factor of
2. For N>e2 (approx 7.4), BIC penalizes complex models more
heavily.
BIC Motivation
• Given a set of candidate models Μ m , m = 1K M and model parameters θm
• Posterior probability of a given model: Pr(M m | Z) ∝ Pr(M m ) ⋅ Pr(Z | M m )
Z represents the training data {xi , yi }1N
• Where
• To compare two models, form the posterior odds:
Pr(M m | Z) Pr(M m ) Pr(Z | M m )
= ⋅
Pr(M l | Z) Pr(M l ) Pr(Z | M l )
• If odds > 1, then choose model m. Prior over models (left half) considered
constant. Right half, contribution of data (Z) to posterior odds, is called
the Bayes factor BF(Z).
• Need to approximate Pr(Z|Mm). Various chicanery and approximations
(pp. 207) gets us BIC.
• Can est. posterior from BIC and compare relative merits of models.
BIC: How much better is a model?
• But we may want to know how various models stack up (not just ranking) relative
to one another:
• Once we have the BIC:
• Denominator normalizes the result and now we can assess the relative merits of
each model
1
− ⋅ BICm
2
e
estimate of Pr(M m | Z) ≡ 1
− ⋅ BICl
∑
M
2
l =1
e