Decisiontree Modelselection

10701
Recitation: Decision Trees
& Model Selection (AIC & BIC)
Min Chi
Oct 7th, 2010
NSH 1507
More on Decision Trees
Building More General Decision Trees
• Build a decision tree (≥ 2 level) Step by Step.
• Building a decision tree with continuous input
feature.
• Building a quad decision tree.
Next
feature.
Information Gain
• Advantage of attribute – decrease in uncertainty
– Entropy of Y before split
– Entropy of Y after splitting based on Xi
• Weight by probability of following each branch
• Information gain is difference
5
How to learn a decision tree
• Top‐down induction [ID3, C4.5, CART, …]
Refund
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
6
Person Hair Weight Age
Length <161 <40
Class
Homer Short No Yes M

Marge Long Yes Yes F
Bart Short Yes Yes M
Lisa Long Yes Yes F
Maggie Long Yes Yes F
Abe Short No No M
Selma Long Yes No F
Otto Long No Yes M
Krusty Long No No M
Comic Long No Yes ?

p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p + n ⎠ p+n ⎝ p + n ⎠
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)

= 0.9911
Short Long
Let us try splitting on

Hair length
Gain ( A) = E (Current set ) − ∑ E (all child sets )

Gain(Hair Length) = 0.9911 – (3/9 * 0 + 6/9 * 0.9183 ) = 0.3789
p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p + n ⎠ p+n ⎝ p + n ⎠

= 0.9911
yes no
Weight <161?

Weight

Gain(Weight < 161) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p + n ⎠ p+n ⎝ p + n ⎠

= 0.9911
yes no
age <= 40?

Age

Gain(Age < 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
Of the 3 features we had, Weight was best. But
while people who weigh over 161 are perfectly
classified (as males), the under 161 people are
not perfectly classified… So we simply
recurse!
yes no
Person Hair Weight Age Clas Weight < 161?
Length <161 <40 s

Lisa Long Yes Yes F
Maggie Short Yes Yes F Male
Selma Long Yes No F
p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p + n ⎠ p+n ⎝ p + n ⎠

= 0.7219
Short Long

Hair length

Gain(Hair Length, Weight < 161) = 0.7219 – (1/5 * 0 + 3/5 * 0 ) = 0.7219
p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p + n ⎠ p+n ⎝ p + n ⎠

= 0.7219
yes no
age <= 40?

Age

Gain(Age, Weight < 161) = 0.7219 – (3/4 * 0.8113 + 1/4 * 0 ) = 0.1134
recurse!
yes no
Weight <= 160?
This time we find that we can split on

Hair length and we done.!
Short Long Male

Hair Length
Male Female
Length <161 <40
Class
Homer Short No Yes M

Lisa Long Yes Yes F
Maggie Long Yes Yes F
Abe Short No No M
Selma Long Yes No F
Otto Long No Yes M
Krusty Long No No M
Comic Long No Yes ?

Hair Weight Age Class Count
Length <161 <40
Short No Yes M 1
Long Yes Yes F 3
Short Yes Yes M 1
Short No No M 1
Long Yes No F 1
Long No Yes M 1
Long No No M 1
Next
feature.
Length <161 <40
Class
Homer 0” No Yes M
Marge 10” Yes Yes F
Bart 2” Yes Yes M
Lisa 6” Yes Yes F
Maggie 4” Yes Yes F
Abe 1” No No M
Selma 8” Yes No F
Otto 10” No Yes M
Krusty 6” No No M
Real‐Values input
What should we do if some of the input features are real‐
valued?
p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p+n⎠ p+n ⎝ p+n⎠
Example
Hair 0” 1” 2” 4” 6” 6” 8” 10” 10”
Length
Class M M M F F M F F M
= 0.9911
Example
Hair 0” 1” 2” 4” 6” 6” 8” 10” 10”
Length
Class M M M F F M F F M
However,…
• Gain(Hair Length < 4’’) = 0.9911 – (3/9 * 0 + 6/9 * 0.9183 ) =
0.3789
• Gain(Weight < 161) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
• Gain(Age < 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
recurse!
yes no
Person Hair Weight Age Clas Weight < 161?
Length <161 <40 s
Marge 10” Yes Yes F

Bart 2” Yes Yes M
Lisa 6” Yes Yes F
Maggie 4” Yes Yes F Male
Selma 8” Yes No F
Example
Hair 2” 4” 6” 8” 10”
Length
Class M F F F F
Entropy(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5) = 0.7219
Gain(Hair Length < 4’’, Weight < 161) = 0.7219 – (1/5 * 0 + 4/5 * 0 ) =
0.7219
Gain(Age, Weight < 161) = 0.7219 – (3/4 * 0.8113 + 1/4 * 0 ) = 0.1134
recurse!
yes no
Weight < 161?
This time we find that we can split on

Hair length, and we are done!
yes no
Hair Length < 4? Male
Male Female
feature.
Next
Decision Tree more generally…
• Features can be discrete,
continuous or categorical
• Each internal node: test
1 1
some set of features {Xi}
0 • Each branch from a node:
1 1
0 1 selects a set of value for
{Xi}
1 1 • Each leaf node: predict Y
1 1
0
0 1
29
Person Hair
Length
Weight Class
Hair length:
Homer 0” 250 M
0‐2 Short
Marge 10” 150 F 3‐6 Medium
Bart 2” 90 M ≥7 Long
Lisa 6” 78 F Weight:
Maggie 4” 20 F 0‐100 Light
100‐175
Abe 1” 170 M Normal
Selma 8” 160 F ≥175 Heavy
Otto 10” 180 M
Krusty 6” 200 M
Person Hair
Length
Weight Class
Hair length:
Homer S Heavy M
0‐2 Short
Marge L Normal F 3‐6 Medium
Bart S Light M ≥7 Long
Lisa M Light F Weight:

Maggie M Light F 0‐100 Light
100‐175
Abe S Normal M Normal
Selma L Normal F ≥175 Heavy
Otto L Heavy M
Krusty M Heavy M
p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p + n ⎠ p+n ⎝ p + n ⎠

= 0.9911
{M, L}
Short, Short, Light {M, L}
{Normal, Let us try splitting on Hair
{Normal, Light
Heavy} length {S} vs. {M, L}
Heavy}
and Weight: {Light} vs.
{Normal, Heavy}
{M, L} * {Normal, Heavy}
Let us try re-splitting on Hair
length: M vs. L
and Weight: Normal vs. Heavy
L, Normal L, Heavy M, Heavy

M, Normal
p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p + n ⎠ p+n ⎝ p + n ⎠

= 0.9911
{M, L}
Short, Short, Light {M, L}
{Normal, Let us try splitting on Hair
{Normal, Light
Heavy} length {S} vs. {M, L}
Heavy}
{Normal, Heavy}
Female
L, Normal M, Heavy
Male Male L, Heavy
Female Male Male

p ⎛ p ⎞ n ⎛ n ⎞
Entropy ( S ) = − log 2 ⎜⎜ ⎟⎟ − log 2 ⎜⎜ ⎟⎟
p+n ⎝ p + n ⎠ p+n ⎝ p + n ⎠

= 0.9911
Short, Long, Long,
Short, Let us try splitting on Hair
Weight ≥161 Weight ≥161 Weight < 161 Weight ≥161
length {S} vs. {M, L}
{Normal, Heavy}
Short,
{M, L} Female
Male {Normal,
Short, Light {M, L}
Light
{Normal,
Heavy}
Heavy}
Model and Selection: AIC & BIC
Bias, Variance, and Model Complexity
•Bias‐Variance trade‐off
again
•Generalization: test
sample vs. training sample
performance
– Training data usually
monotonically increasing
performance with model
complexity
Measuring Performance
•target variable Y
•Vector of inputs X
•Prediction model fˆ ( X )
•Typical Choices of Loss function
(
⎧ Y − fˆ X
)
2
( ) squared error
(
L Y , fˆ ( X ) ) ⎪
=⎨
⎪ Y − fˆ ( X ) absolute error
⎩
Generalization Error
• Test error aka. Generalization error
Err = E ⎡ L Y , fˆ ( X ) ⎤
⎣ ⎦( )
• Note: This expectation averages anything that is random, including the
randomness in the training sample that it produced
• Training error
1
( )
n
•
err =
N
∑ i i
L y
i =1
, ˆ (x )
f
– average loss over training sample
– not a good estimate of test error (next slide)
Training Error
•Training error ‐ Overfitting
– not a good estimate of test
error
– consistently decreases with
model complexity
– drops to zero with high
enough complexity
Categorical Data
•same for categorical
Test Error again:
responses
Err = E[ L(G, pˆ ( x))]
pk ( X ) = pr ( G = k | X )
•
Gˆ ( X ) = arg max k pˆ k ( X ) Training Error again:
−2 N
•Typical Choices of Loss
functions:
err = ∑
N i =1
log pˆ gi ( xi )
( ) (
L G , Gˆ ( X ) = I G ≠ Gˆ ( X ) ) 0 − 1 loss
K
L ( G , pˆ ( X ) ) = − 2 ∑ I ( G = k ) log pˆ k ( X ) = −2 log pˆ G ( X ) log − likelihood
k =1
Log‐likelihood = cross‐entropy loss = deviance
Loss Function for General Densities
• For densities parameterized by theta:
• Log‐likelihood function can be used as a loss‐
function
Prθ( X) (Y) density of Y with predictor

X
L (Y , θ ( X ) ) = −2 log Prθ ( X ) (Y )
Two separate goals
• Model selection:
– Estimating the performance of different models in order to choose the
(approximate) best one
• Model assessment:
– Having chosen a final model, estimating its prediction error (generalization
error) on new data
• Ideal situation: split data into the 3 parts for training, validation (est.
prediction error+select model), and testing (assess model)
• Typical split: 50% / 25% / 25%
• Remainder of the chapter: Data‐poor situation
• => Approximation of validation step either analytically (AIC, BIC, MDL,
SRM) or by efficient sample reuse (cross‐validation, bootstrap)
Bias‐Variance Decomposition
Y = f ( X ) + ε , E ( ε ) = 0, Var ( ε ) = σ ε
2
X = x0
• Then for an input point , using unit‐square loss
and regression fit:
(
Err ( x0 ) = E Y − fˆ ( x0 ) | X = x0 ⎤
⎡
)
2
⎢⎣ ⎥⎦
2 2
= σ ε + ⎡⎣ Efˆ ( x0 ) − f ( x0 ) ⎤⎦ + E ⎡⎣ fˆ ( x0 ) − Efˆ ( x0 ) ⎤⎦
2
2
= σ ε + Bias ⎡ fˆ ( x0 ) ⎤ + Var ⎡ fˆ ( x0 ) ⎤
2
⎣ ⎦ ⎣ ⎦
Irreducible Bias^2 Variance
Error
variance of the Amount by which average Expected deviation of
target around estimate differs from the true f^ around its mean
the true mean mean
2
Err ( x0 ) = σ ε + Bias ⎡⎣ fˆ ( x0 ) ⎤⎦ + Var ⎡⎣ fˆ ( x0 ) ⎤⎦
2
2
⎡ 1
( ) ⎤
k
kNN: Err ( x0 ) = σ ε + ⎢ f ( xo ) − ∑ f x( l ) ⎥ + σ ε 2 / k
2
⎣ k l =1 ⎦
Linear Model Fit: fˆp ( x ) = βˆ T x
2
Err ( x0 ) = σ ε + ⎣ f ( xo ) − Ef p ( x0 ) ⎦ + h ( x0 ) σ ε2
2
⎡ 2 ˆ ⎤
where h ( x 0 ) = ( X X ) X T y
T −1
Linear Model Fit: fˆp ( x ) = βˆ T x
2
Err ( x0 ) = σ ε + ⎣ f ( xo ) − Ef p ( x0 ) ⎦ + h ( x0 ) σ ε2
2
⎡ 2 ˆ ⎤
where h ( x 0 ) = ( X X ) X T y ... N-dim weight vector

T −1
average over sample values x i :

1 N
1 N 2 p 2
N
∑
i =1
Err ( xi ) = σ ε +
2
N
∑
i =1
⎡
⎣ f ( xi ) − Ef ( xi ) ⎦ + σ ε ... in-sample error
ˆ ⎤
N
Model complexity is directly related to the number of
parameters p
2
Err ( x0 ) = σ ε + Bias ⎡⎣ fˆ ( x0 ) ⎤⎦ + Var ⎡⎣ fˆ ( x0 ) ⎤⎦
2
For ridge regression and other linear models, variance same as before, but
with diff’t weights.
Parameters of the best fitting linear approximation
β* = arg min E ( f ( X ) − β X ) T 2
β
Further decompose the bias:
E x0 [ f ( x0 ) − Efˆα ( x0 )]2 = E x0 [ f ( x0 ) − βT* x0 ]2 + E x0 [βT* x0 − EβTα x0 ]2

= Ave [ Model Bias ] 2 + Ave [ Estimation Bias ] 2
Least squares fits best linear model ‐> no estimation bias
Restricted fits ‐> positive estimation bias in favor of reduced variance
Optimism of the Training Error Rate
• Typically: training error rate < true error
• (same data is being used to fit the method
and assess its error)
1
( ) ( )
n
err =
N
∑ i i
L y
i =1
, ˆ (x )
f < Err = E ⎡ L Y , fˆ ( X ) ⎤
⎣ ⎦
overly optimistic
Err … kind of extra‐sample error: test features don’t need to coincide with
training feature vectors
1
( )
N
Focus on in‐sample error:
Errin =
N
∑ Y Y
E E
i =1
new L Yi , f ( xi )
new ˆ
Y new
… observe N new response values at each of training points xi , i=1, 2, ...,N
optimism: op ≡ Errin − E y ( err )
2 N
∑ Cov ( yˆ , y )
for squared error 0‐1 and other loss functions:
op = i i
N i =1
The amount by which err underestimates the true error depends on how strongly y i affects
its own prediction.
2 N
Summary: Errin = E y ( err ) +
N
∑ Cov ( yˆ , y )
i =1
i i
The harder we fit the data, the greater
Cov ( yî , yi )will be, thereby increasing the
optimism.
• For linear fit with d indep inputs/basis funcs:
2
Errin = E y ( err ) + dσ ε2
N
– optimism linearly with # d of basis functions
– Optimism as training sample size
• Ways to estimate prediction error:
– Estimate optimism and then add it to training error rate
err
• AIC, BIC, and others work this way, for a special class of estimates
that are linear in their parameters
– Direct estimates of the sample error
• Cross‐validation, bootstrap Err
• Can be used with any loss function, and with nonlinear, adaptive
fitting techniques
•
Estimates of In‐Sample Prediction Error
• General form of the in‐sample estimate:
Êrrin = err + oˆp with estimate of optimism
2
• For linear fit and with Errin = E y ( err ) + dσ ε2:
N
2d 2
C p = err + σˆ ε , so called C p statistic
N
σˆ ε2 ... estimate of noise variance, from mean-squared error of low-bias model

d ... # of basis functions
N ... training sample size
Estimates of In‐Sample Prediction Error
• Similarly: Akaike Information Criterion (AIC)
– More applicable estimate of , when log‐
Errin
likelihood function is used
2 d
For N → ∞ : − 2 E ⎡⎣log Prθˆ (Y ) ⎤⎦ ≈ − E [ log lik ] + 2
N N
Prθ (Y ) ... family density for Y (containing the true density)
θˆ... ML estimate of θ
N
loglik= ∑ log Prθˆ ( yi ) Maximized log‐likelihood due to ML
estimate of theta
i =1
AIC
2 d
For N → ∞ : − 2 E ⎡⎣log Prθˆ (Y ) ⎤⎦ ≈ − E [ log lik ] + 2
N N
For example, for logistic regression model, using binomial log‐likelihood:
2 d
AIC = − ⋅ loglik + 2 ⋅
N N
To use AIC for model selection: choose the model giving smallest AIC over the set of
models considered.
d (α ) 2
AIC (α ) = err (α ) + 2 σˆ ε
N
fαˆ ( x ) ... set of models, α ... tuning parameter
err (α ) ... training error, d (α ) ... # parameters
AIC
• Function AIC(α) estimates test error curve
• If basis functions are chosen adaptively with
d<p inputs: N
∑ Cov
i =1
( ˆ
y i , y i ) = d σ 2
ε
• no longer holds => optimism exceeds
( 2 d / N )σ ε 2
• effective number of parameters fit > d
Using AIC to select the # of basis functions
• Input vector: log‐periodogram of vowel; Quantized to 256 uniformly
spaced f
• Linear logistic regression model
M
β ( f ) = ∑ hm ( f ) θ m
• Coefficient function:
m =1
– Expansion of M spline basis functions
– For any M, a basis of natural cubic splines is used for the
knots chosen uniformly over the range of frequencies, i.e.
hm
d (α ) = d ( M ) = M
• AIC approximately minimizes Err(M) for both entropy
and 0‐1 loss
•
2 N 2d 2
∑
N i =1
Cov ( yî , yi ) =
N
σ ε ... simple formula for linear case
Using AIC to select the # of basis functions
2 N
2d 2
N
∑ Cov ( yî , yi ) =
i =1 N
σε
Approximation does not
hold, in general, for 0‐1
case, but it does o.k.
(Exact only for linear
models w/ additive
errors and sq err loss)
Effective Number of Parameters
⎛ y1 ⎞
⎜ ⎟
y
y= ⎜ 2 ⎟ Vector of Outcomes, similarly for predicitons
⎜M ⎟
⎜⎜ ⎟⎟
⎝ yN ⎠
ŷ = Sy Linear fit (e.g. linear regression, quadratic shrinkage – ridge, splines)
S ... N × N matrix, depends on input vector x i but not on yi
effective number of parameters: d ( S) = trace ( S ) c.f. Cov( yˆ , y )
d(s) is the correct d for Cp 2d 2

C p = err + σˆ ε
N
Bayesian Approach and BIC
• Like AIC used in when fitting by max log‐likelihood
Bayesian Information Criterion (BIC):
BIC = −2 loglik + (log N )d
Assuming Gaussian model : σ ε2 known,
− 2 ⋅ loglik ≈ ∑ i ( y i − fˆ ( xi ) ) 2 / σ ε2 = N ⋅ err / σ ε2
N d 2
then BIC = 2 [ err + (log N ) ⋅ σ ε ]
σε N
BIC proportional to AIC except for log(N) rather than factor of
2. For N>e2 (approx 7.4), BIC penalizes complex models more
heavily.
BIC Motivation
• Given a set of candidate models Μ m , m = 1K M and model parameters θm
• Posterior probability of a given model: Pr(M m | Z) ∝ Pr(M m ) ⋅ Pr(Z | M m )
Z represents the training data {xi , yi }1N
• Where
• To compare two models, form the posterior odds:
Pr(M m | Z) Pr(M m ) Pr(Z | M m )
= ⋅
Pr(M l | Z) Pr(M l ) Pr(Z | M l )
• If odds > 1, then choose model m. Prior over models (left half) considered
constant. Right half, contribution of data (Z) to posterior odds, is called
the Bayes factor BF(Z).
• Need to approximate Pr(Z|Mm). Various chicanery and approximations
(pp. 207) gets us BIC.
• Can est. posterior from BIC and compare relative merits of models.
BIC: How much better is a model?
• But we may want to know how various models stack up (not just ranking) relative
to one another:
• Once we have the BIC:
• Denominator normalizes the result and now we can assess the relative merits of
each model
1
− ⋅ BICm
2
e
estimate of Pr(M m | Z) ≡ 1
− ⋅ BICl
∑
M
2
l =1
e

Decisiontree Modelselection

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decisiontree Modelselection

Uploaded by

Copyright:

Available Formats

10701

Homer Short No Yes M

Comic Long No Yes ?

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)

Let us try splitting on

Gain ( A) = E (Current set ) − ∑ E (all child sets )

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)

Let us try splitting on

Gain ( A) = E (Current set ) − ∑ E (all child sets )

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)

Let us try splitting on

Gain ( A) = E (Current set ) − ∑ E (all child sets )

Marge Long Yes Yes F

Entropy(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5)

Let us try splitting on

Gain ( A) = E (Current set ) − ∑ E (all child sets )

Entropy(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5)

Let us try splitting on

Gain ( A) = E (Current set ) − ∑ E (all child sets )

This time we find that we can split on

Short Long Male

Homer Short No Yes M

Comic Long No Yes ?

Marge 10” Yes Yes F

This time we find that we can split on

Lisa M Light F Weight:

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)

L, Normal L, Heavy M, Heavy

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)

Female Male Male

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)

Prθ( X) (Y) density of Y with predictor

where h ( x 0 ) = ( X X ) X T y ... N-dim weight vector

average over sample values x i :

E x0 [ f ( x0 ) − Efˆα ( x0 )]2 = E x0 [ f ( x0 ) − βT* x0 ]2 + E x0 [βT* x0 − EβTα x0 ]2

σˆ ε2 ... estimate of noise variance, from mean-squared error of low-bias model

S ... N × N matrix, depends on input vector x i but not on yi

effective number of parameters: d ( S) = trace ( S ) c.f. Cov( yˆ , y )

d(s) is the correct d for Cp 2d 2

You might also like