Professional Documents
Culture Documents
1/1
Course Roadmap
2/1
Course Roadmap
2/1
Recap of last class
30
(x)
. fL
pp rox
ar a
20 t line
bes
10
−10
0 2 4 6
x
200 200
Wage
Wage
100 100
20 40 60 80 20 40 60 80
Age Age
6/1
Prediction topics: Part I
6/1
Prediction topics: Part I
6/1
Prediction topics: Part I
6/1
Agenda for Part I
• Splines
• Additive models
7/1
Some motivation
30
(x)
. fL
p rox
ea r ap
20 t lin
bes
10
−10
0 2 4 6
x
x) bes
n f( t qu
0 ctio adr
n fun atic
s si o app
gre
y
re rox
e . f(
tru x)
−5
−15
0 2 4 6
x
Consistent estimator
An estimator fˆ(x) is consistent if, as our sample size grows, fˆ(x)
converges to the true regression function f (x) = E (Y | X = x)
Use splines!
10 / 1
Piecewise polynomials
To understand splines, we first need to understand piecewise
polynomials
• We can think of step functions as piecewise constant models
• It’s easy to generalize this idea to:
◦ Piecewise linear
◦ Piecewise quadratic …
◦ Piecewise polynomial
250
Piecewise Cubic Continuous Piecew
250
200
200
Wage
Wage
150
150
100
100
50
50
20 30 40 50 60 70 20 30 40 50
Age Age
11 / 1
Piecewise Polynomial vs. Regression Splines
Piecewise Cubic Continuous Piecewise Cubic
250
250
200
200
Wage
Wage
150
150
100
100
50
50
20 30 40 50 60 70 20 30 40 50 60 70
Age Age
250
200
200
Wage
Wage
150
150
100
100
50
50
20 30 40 50 60 70 20 30 40 50 60 70
Age Age 12 / 1
20 30 40 50 60 70 20
Piecewise polynomial vs. Regression splines Age
250
250
250
200
200
200
200
Wage
Wage
Wage
Wage
150
150
150
150
100
100
100
100
50
50
50
50
20 30 40 50 60 70 20 2030 3040 4050 5060 6070 70 20
at Age
1 break Cubic Spline= 50 1 knot Age = 50
atSpline
Linear
250
200
200
Wage
150
100
50
50
13 / 1
5
5
20 Cubic Splines
30 40 50 60 70 20 30 40 50 60
Age Age
250
250
200
200
Wage
Wage
150
150
100
100
50
50
20 30 40 50 60 70 20 30 40 50 60
Age Age
14 / 1
Polynomial regression vs. Cubic splines
• In polynomial regression, we had to choose the degree
2
Technically, p + 1 if you count the intercept. To be consistent with R’s df
arguments, here we do not count the intercept.
16 / 1
Polynomial regression vs. Cubic splines
300
200
Wage
100
20 40 60 80
Age
300
200
Wage
100
20 40 60 80
Age
300
200
Wage
100
20 40 60 80
Age
300
200
Wage
100
20 40 60 80
Age
250
Cubic Spline
200
Wage
150
100
50
20 30 40 50 60 70
Age
Figure: 7.4 from ISLR. Natural cubic splines are cubic splines that extrapolate
linearly beyond the boundary knots. A NCS with K knots uses just K + 1
degrees of freedom — the same as a cubic spline with 2 fewer knots
18 / 1
Natural Cubic Splines vs Polynomial regression
300
Polynomial
250
200
Wage
150
100
50
20 30 40 50 60 70 80
Age
Figure: 7.7 from ISLR. Natural cubic splines are very nicely behaved at the tails
of the data. Polynomial regression shows erratic behaviour. (14 degrees of
freedom used for both)
19 / 1
Summary: Cubic Splines vs Polynomial regression
300
Polynomial
250
200
Wage
150
100
50
20 30 40 50 60 70 80
Age
R’s defaults:
The most common way of specifying splines in R is in terms of the
spline’s degrees of freedom (df). R then places knots at suitably chosen
quantiles of the x variable.
• You may also specify the internal knots manually for ns and bs by
specifying the knots = argument directly. E.g.,
lm(wage ∼ bs(age, knots = c(25, 40, 60)), data = Wage)
21 / 1
Smoothing splines
• A: Smoothing splines
22 / 1
Smoothing splines
• The smoothing spline estimator is the solution ĝ to the problem
∑n ∫
minimize (yi − g(xi ))2 + λ g ′′ (t)2 dt
i=1
| {z } | {z }
Roughness penalty
RSS
∑
n ∫
minimize (yi − g(xi ))2 + λ g ′′ (t)2 dt (∗)
| {z } | {z }
i=1
Roughness penalty
RSS
It turns out…
• The solution to (∗) is a natural cubic spline
• The solution has knots at every unique value of x
• The effective degrees of freedom of the solution is calculable
• λ ←→ df
Coding tip: In R with the gam library you can use the syntax
s(x, df) in your regression formula to fit a smoothing spline with df
effective degrees of freedom.
24 / 1
Smoothing Spline
16 Degrees of Freedom
6.8 Degrees of Freedom (LOOCV)
300
200
Wage
50 100
0
20 30 40 50 60 70 80
Age
Local Regression
OO OO
1.5
1.5
O O OO O O OO
O O OO O O O OO O
O O OOO
O O O OOO
O
O OO O O OO O
1.0
1.0
O O O O O O
O O O O OO O O O O O OO O
OOO O OOO O
O OO O O
OO OO O OO O O
OO OO
O OO OO O O O
O O OO OO O O O
O
0.5
0.5
OOO O OO O OOO O OO O
OO OO O O O OO OO O O O
O O O O O O
O O OO O O OO
O O O O O O
0.0
0.0
O O O O O O
OO OO
O OO O O
O O OO O O
O
OO OO
−0.5
−0.5
OO O O OO O O
O O O O
−1.0
O −1.0 O
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Figure: 7.9 from ISLR. Local regression (enabled as lo(x) and loess(x) )
26 / 1
Local Linear Regression
50 100
0
20 30 40 50 60 70 80
Age 27 / 1
Putting everything together: Additive Models
• Recall the Linear Regression Model
∑
p
Y = β0 + βj Xj + ϵ
j=1
• We can now extend this to the far more flexible Additive Model
∑
p
Y = β0 + fj (Xj ) + ϵ
j=1
28 / 1
Additive Models: Boston housing data
with terms:
• f1 (lstat) smoothing spline with 5 df
• f2 (ptratio) local linear regression
• f3 (crim) step function with breaks at crim = 1, 10, 25
4
You can use lm here instead, but gam has built-in plotting routines to help better
visualize the model fits.
29 / 1
Additive Models: Boston housing data
20
0
10
15
10
−2
5
5
−4
0
−6
−10
−8
−5
10 20 30 14 16 18 20 22 0 20 40 60 80
∑
p
Y = β0 + fj (Xj ) + ϵ
j=1
32 / 1
How do we pick the best model?
33 / 1
How do we pick the best statistical model?
300 300 300
Wage
Wage
100 100 100
20 40 60 80 20 40 60 80 20 40 60 80
Age Age Age
Wage
Wage
100 100 100
20 40 60 80 20 40 60 80 20 40 60 80
Age Age Age
200
200 200
Wage
Wage
Wage
100
100 100
20 40 60 80 20 40 60 80 20 40 60 80
Age Age Age
34 / 1
35 / 1
Prediction topics: Part II
36 / 1
Prediction topics: Part II
36 / 1
How do we pick the best statistical model?
300 300 300
Wage
Wage
100 100 100
20 40 60 80 20 40 60 80 20 40 60 80
Age Age Age
Wage
Wage
100 100 100
20 40 60 80 20 40 60 80 20 40 60 80
Age Age Age
200
200 200
Wage
Wage
Wage
100
100 100
20 40 60 80 20 40 60 80 20 40 60 80
Age Age Age
37 / 1
Resampling methods
• Part II of today’s lecture covers several topics that fall into the
category of Resampling methods
• These are approaches for using a single observed data set to answer
questions such as:
◦ Which model generalizes best to unseen data?
◦ What is the prediction error of my model?
◦ What is the uncertainty of my estimate?
◦ How can I form a confidence interval for a complex parameter?
38 / 1
Agenda for Part II
• Resampling methods
◦ Validation set approach (Train/Test split)
◦ Cross-validation
39 / 1
How should we think about model selection?
40 / 1
How should we think about model selection?
Let’s remind ourselves of the first Central Theme of this class.
1. Generalizability: We want to construct models that generalize
well to unseen data
• i.e., We want to:
1 Add variables/flexibility as long as doing so helps capture meaningful
trends in the data (avoid underfitting)
2 Ignore meaningless random fluctuations in the day (avoid overfitting)
40 / 1
Assessing Model Accuracy
41 / 1
Assessing Model Accuracy: Training Error vs. Testing Error
Here are three different models fit to the same small Train data set.
Which of these three is the best model?
Observed data Model 1
20 20
● ●
● ● ● ●
● ●
● ●
● ● ● ●
● ●
15 ● 15 ●
● ●
● ● ● ●
● ●
● ●
● ● ● ● ● ●
● ● ● ●
10 ● 10 ●
y
y
● ●● ● ●●
● ●
● ●
● ● ●
●
● ● ● ● ● ● ● ●
● ●
● ●
● ●
● ● ●
●
● ●
5 5
●● ● ●● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ●
0 ● 0 ●
0 25 50 75 100 0 25 50 75 100
x x
Model 2 Model 3
20 20
● ●
● ● ● ●
● ●
● ●
● ● ● ●
● ●
15 ● 15 ●
● ●
● ● ● ●
● ●
● ●
● ● ● ● ● ●
● ● ● ●
10 ● 10 ●
y
● ●● y ● ●●
● ●
● ●
● ● ●
●
● ● ● ● ● ● ● ●
● ●
● ●
● ●
● ● ●
●
● ●
5 5
●● ● ●● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ●
0 ● 0 ●
0 25 50 75 100 0 25 50 75 100
x x
42 / 1
Assessing Model Accuracy: Training Error vs. Testing Error
Observed data Model 1
20 20
● ●
● ● ● ●
● ●
● ●
● ● ● ●
● ●
15 ● 15 ●
● ●
● ● ● ●
● ●
● ●
● ● ● ● ● ●
● ● ● ●
10 ● 10 ●
y
y
● ●● ● ●●
● ●
● ●
● ● ●
●
● ● ● ● ● ● ● ●
● ●
● ●
● ●
● ● ●
●
● ●
5 5
●● ● ●● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ●
0 ● 0 ●
0 25 50 75 100 0 25 50 75 100
x x
Model 2 Model 3
20 20
● ●
● ● ● ●
● ●
● ●
● ● ● ●
● ●
15 ● 15 ●
● ●
● ● ● ●
● ●
● ●
● ● ● ● ● ●
● ● ● ●
10 ● 10 ●
y
y
● ●● ● ●●
● ●
● ●
● ● ●
●
● ● ● ● ● ● ● ●
● ●
● ●
● ●
● ● ●
●
● ●
5 5
●● ● ●● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ●
● ●
0 ● 0 ●
0 25 50 75 100 0 25 50 75 100
x x
Model 1 2 3
MSETrain = RSS/n 23.2 5.2 7.5
43 / 1
Assessing Model Accuracy: Training Error vs. Testing Error
Here are some new observa-
tions, which form our Test data. How well do our models fit the test data?
Test data Model 1 (test data)
20 ● 20 ●
●
● ●
● ●
●
●
● ● ●
● ● ● ●
15 ● ●
● ● 15 ● ●
●● ●
●
● ● ● ● ●
● ● ● ● ● ● ● ● ●●
● ● ●
● ● ● ● ●● ●
● ● ● ● ● ●
● ● ● ●
y.test
10 10 ●
y
● ● ●
● ● ● ●●
● ● ● ●●
●● ● ●● ● ● ●
● ● ● ● ● ● ●●
● ● ● ● ●
● ● ● ● ● ● ● ●
●
● ● ● ●
● ● ●
●
5 ● ● 5 ● ●
●● ●● ●● ●
●
● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●
● ● ●
0 0 ●
● ●
0 25 50 75 100 0 25 50 75 100
x.test x
y
● ● ● ●
● ● ●● ● ● ●●
●● ● ●
● ●●
● ●● ● ●
● ●●
●
● ● ● ● ●● ● ● ● ● ●●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ●
● ● ● ● ●
● ● ●
● ●
5 ● ● 5 ● ●
●● ●● ● ●● ●● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ●
0 ● 0 ●
● ●
0 25 50 75 100 0 25 50 75 100
x x
Solid green points: Test data Open grey circles: Train data
44 / 1
Assessing Model Accuracy: Training Error vs. Testing Error
Test data Model 1 (test data)
20 ● 20 ●
●
● ●
● ●
●
●
● ● ●
● ● ● ●
15 ● ●
● ● 15 ● ●
●● ●
●
● ● ● ● ●
● ● ● ● ● ● ● ● ●●
● ● ●
● ● ● ● ●● ●
● ● ● ● ● ●
● ● ● ●
y.test
10 10 ●
y
● ● ●
● ● ● ●●
● ● ● ●●
●● ● ●● ● ● ●
● ● ● ● ● ● ●●
● ● ● ● ●
● ● ● ● ● ● ● ●
●
● ● ● ●
● ● ●
●
5 ● ● 5 ● ●
●● ●● ●● ●
●
● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●
● ● ●
0 0 ●
● ●
0 25 50 75 100 0 25 50 75 100
x.test x
y
● ● ● ●
● ● ●● ● ● ●●
●● ● ●
● ●●
● ●● ● ●
● ●●
●
● ● ● ● ●● ● ● ● ● ●●
● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
● ●
● ● ● ● ●
● ● ●
● ●
5 ● ● 5 ● ●
●● ●● ● ●● ●● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ●
0 ● 0 ●
● ●
0 25 50 75 100 0 25 50 75 100
x x
Model 1 2 3
MSETrain 23.2 5.2 7.5
MSETest 24.6 10.3 7.0
45 / 1
Assessing Model Accuracy
• The test set error will decrease as we add flexibility that helps to
capture useful trends
• As we add too much flexibility, the test set error will begin to increase
due to model overfitting
46 / 1
How well could we possibly do?
Y = f (x) + ϵ
47 / 1
Bias-Variance trade-off in prediction
• For any predictor fˆ, one can decompose the the expected test MSE5
at a new data point x0 as:
[ ] [ ]2
E (y0 − fˆ(x0 ))2 = Var(fˆ(x0 )) + Bias(fˆ(x0 )) + Var(ϵ)
5
For more details, read §2.2.2 of ISL
48 / 1
Bias-Variance trade-off in prediction
Let’s understand what this equation is saying
[ ] [ ]2
E (y0 − fˆ(x0 ))2 = Var(fˆ(x0 )) + Bias(fˆ(x0 )) + Var(ϵ)
Bias(fˆ(x0 )) = E fˆ(x0 ) − E (Y | X = x0 )
= E fˆ(x0 ) − f (x0 )
• Var(ϵ) is the irreducible error
49 / 1
30
(x)
. fL
pp rox
ar a
20 t line
bes
10
−10
0 2 4 6
x
50 / 1
Bias-Variance trade-off in prediction
[ ] [ ]2
E (y0 − fˆ(x0 ))2 = Var(fˆ(x0 )) + Bias(fˆ(x0 )) + Var(ϵ)
51 / 1
Bias-Variance trade-off in prediction
Let’s think about this equation in practical terms
[ ] [ ]2
E (y0 − fˆ(x0 ))2 = Var(fˆ(x0 )) + Bias(fˆ(x0 )) + Var(ϵ)
Test Sample
Training Sample
Low High
Model Complexity
3 53 /1
/ 44
Training Error vs Test Error: Example 1
Dashed line on MSE plot shows the irreducible error Var(ϵ)
2.5
12
2.0
10
1.5
8
Y
1.0
6
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
2.5
12
2.0
10
1.5
8
Y
1.0
6
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
20
20
15
Mean Squared Error
10
10
Y
5
−10
0 20 40 60 80 100 2 5 10 20
X Flexibility
Truth is wiggly, but noise level is very low. The more flexible fits
perform the best.
56 / 1
Bias-Variance Trade-off for the three examples
2.5
20
MSE
Bias
Var
2.0
2.0
15
1.5
1.5
10
1.0
1.0
5
0.5
0.5
0.0
0.0
0
2 5 10 20 2 5 10 20 2 5 10 20
57 / 1
How do we estimate the Test Error?
• In reality, we only get one set of data to use for model fitting, model
selection, and model assessment
58 / 1
Validation set approach
1. Randomly divide the available data into two parts: a Training set and
a Validation or Hold-out set
50
Linear
Degree 2
Degree 5
40
Miles per gallon
30
20
10
Horsepower
Figure: 3.8 from ISL. We want to figure out what degree polynomial we want to
use to model the relationship between Y = Miles per gallon and X1 =
Horsepower 60 / 1
Example: Automobile data
• Want to figure out what degree polynomial to use
• Randomly split the 392 observations into two sets of 196 data points.
Train on the first, calculate errors on the second.
28
28
Mean Squared Error
26
24
24
22
22
20
20
18
18
16
16
2 4 6 8 10 2 4 6 8 10
Figure 5.2 from ISL. Left panel shows MSE for a single split. Right panel shows MSE curves for
10 randomly chosen splits.
61 / 1
Example: Automobile data
28
28
Mean Squared Error
26
24
24
22
22
20
20
18
18
16
16
2 4 6 8 10 2 4 6 8 10
Figure 5.2 from ISL. Left panel shows MSE for a single split. Right panel
shows MSE curves for 10 randomly chosen splits.
• All splits agree that quadratic is much better than linear
• Degree > 2 fits don’t seem to perform considerably better than
quadratic
62 / 1
Problems with the Validation set approach
28
28
Mean Squared Error
26
24
24
22
22
20
20
18
18
16
16
2 4 6 8 10 2 4 6 8 10
• Error estimates can be used to select the best model, and also reflect
the test error of the final chosen model
• Main idea: K-fold Cross-validation
1. Randomly split the data into K equal-sized parts (“folds”)
2. Give each part the chance to be the Validation set, treating the other
K − 1 parts (combined) as the Training set
3. Average the test error over all of the folds
4. Pick the simplest model among those with the lowest CV-error
5. Final model: Refit model on the entire data set.
64 / 1
The Picture for 5-fold Cross-validation
65 / 1
The Picture for 5-fold Cross-validation
66 / 1
The Picture for 5-fold Cross-validation
66 / 1
The Picture for 5-fold Cross-validation
66 / 1
The Picture for 5-fold Cross-validation
66 / 1
The Picture for 5-fold Cross-validation
66 / 1
The Picture for 5-fold Cross-validation
66 / 1
Cross-validation standard error
1 ∑K
CV(K) = MSEk
K k=1
6
This calculation isn’t quite right, but it’s a widely accepted approach for calculating
standard errors for CV error estimates.
67 / 1
Back to the Automobile data
50
Linear
Degree 2
Degree 5
40
Miles per gallon
30
20
10
Horsepower
Figure: 3.8 from ISL. We want to figure out what degree polynomial we want to
use to model the relationship between Y = Miles per gallon and X1 =
Horsepower 68 / 1
Example: Automobile data
LOOCV 10−fold CV
28
28
Mean Squared Error
26
24
24
22
22
20
20
18
18
16
16
2 4 6 8 10 2 4 6 8 10
Figure 5.4 from ISL. Left panel shows estimated Test Error for LOOCV. Right
panel shows estimated Test Error for 10-fold CV run nine separate times.
28
28
28
Squared Error
26
26
18 20 22 24 26
24
24
22
22
20
20
Mean
18
18
16
16
16
10 2 2 4 4 6 6 8 8 10 10 2 4 6 8 10
Degree
Degree of Polynomial
of Polynomial Degree of Polynomial
70 / 1
CV error estimate vs. Actual test error
3.0
3.0
20
2.5
2.5
15
Mean Squared Error
2.0
1.5
1.5
10
1.0
1.0
5
0.5
0.5
0.0
0.0
0
2 5 10 20 2 5 10 20 2 5 10 20
4.0
●
3.5
^
λ = 3.458
●
●
●
3.0
●
CV error
●
●
● ●
●
●
●
●
●
●
●
●
2.5
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
2.0
●
● ●
●
● ●
● ●
● ●
● ●
● ●
● ●
●
● ●
● ● ● ●
● ● ● ● ●
1.5
0 2 4 6 8 10 12
4.0
●
3.5
^
λ = 3.458
●
●
●
3.0
●
CV error
●
●
● ●
●
●
●
●
●
●
●
●
2.5 ●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
2.0
●
● ●
●
● ●
● ●
● ●
● ●
● ●
● ●
●
● ●
● ● ● ●
● ● ● ● ●
1.5
0 2 4 6 8 10 12
4.0
●
3.5
^
λ = 3.458
●
●
●
3.0
●
CV error
●
●
● ●
●
●
●
●
●
●
●
●
2.5 ●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
2.0
●
● ●
●
● ●
● ●
● ●
● ●
● ●
● ●
●
● ●
● ● ● ●
● ● ● ● ●
1.5
0 2 4 6 8 10 12
• The 1-standard error rule tells us to pick the simplest model whose
CV error falls inside the 1-SE error bars of the lowest CV error
model. 74 / 1
The 1-standard error rule
4.0
One standard error rule:
^
λ = 6.305
●
3.5
Usual rule:
^
λ = 3.458
●
●
●
3.0
●
CV error
●
●
● ●
●
●
●
●
●
●
●
●
2.5 ●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
2.0
●
● ●
●
● ●
● ●
● ●
● ●
● ●
● ●
●
● ●
● ● ● ●
● ● ● ● ●
1.5
0 2 4 6 8 10 12
• The 1-standard error rule tells us to pick the simplest model whose
CV error falls inside the 1-SE error bars of the lowest CV error
model. 75 / 1
The 1-standard error rule
4.0
One standard error rule:
^
λ = 6.305
●
3.5
Usual rule:
^
λ = 3.458
●
●
●
3.0
●
CV error
●
●
● ●
●
●
●
●
●
●
●
●
2.5 ●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
2.0
●
● ●
●
● ●
● ●
● ●
● ●
● ●
● ●
●
● ●
● ● ● ●
● ● ● ● ●
1.5
0 2 4 6 8 10 12
• Basic idea: We can’t be certain that the λ = 6.305 model actually has
higher prediction error than the λ = 3.458 model, so let’s err on the
side of caution and go with the simpler λ = 6.305 model. 76 / 1
Smoothing spline example
These plots show the results of applying 5-fold cross-validation to select the
effective degrees of freedom for a smoothing spline fit to the points in the right
panel.
Minimum CV rule
12
●
●
● ●
●
● ● ●● ●●
10
8
●
●
●● ● ●
● ● ● ●
● ●
● ●
● ● ●
●
● ●
8
● ● ●
● ● ●
● ●
● ●
6
●●● ● ●
●● ● ●●●
CV error
●● ●
6
●
●● ●● ●
●●● ●
● ● ●● ●
●
4
● ●
4
● ●●
● ●
● ●
● ●
2
●
●
● ●● ●● ●
●
● ●
●●
●
2
●
●
0
● ●
● ●
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
Degrees of freedom
12
●
●
● ●
●
● ● ●● ●●
10
8
●
●
●● ● ●
● ● ● ●
● ●
● ●
● ● ●
●
● ●
8
● ● ●
● ● ●
● ●
● ●
6
●●● ● ●
●● ● ●●●
CV error
●● ●
6
●
●● ●● ●
●●● ●
● ● ●● ●
●
4
● ●
4
● ●●
● ●
● ●
● ●
2
●
●
● ●● ●● ●
●
● ●
●●
●
2
●
●
0
● ●
● ●
● ●
●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
Degrees of freedom
78 / 1
Summary: Cross-validation
• We started with the question: How do we pick the best model?
• One answer: Pick the model with the lowest prediction error.
• From this we get: Our chosen model fˆ, and an estimate of its
prediction error
79 / 1
Acknowledgements
All of the lectures notes for this class feature content borrowed with or
without modification from the following sources:
• 36-462/36-662 Lecture notes (Prof. Tibshirani, Prof. G’Sell, Prof. Shalizi)
• Applied Predictive Modeling, (Springer, 2013), Max Kuhn and Kjell Johnson
80 / 1