C2 Statistical Learning

What is Statistical Learning?
25
25
25
20
20
20
Sales
Sales
Sales
15
15
15
10
10
10
5
5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
Shown are Sales vs TV, Radio and Newspaper, with a blue

linear-regression line fit separately to each.
Can we predict Sales using these three?
Perhaps we can do better using a model
Sales ⇡ f (TV, Radio, Newspaper)
1 / 30
Notation
Here Sales is a response or target that we wish to predict. We
generically refer to the response as Y .
TV is a feature, or input, or predictor; we name it X1 .
Likewise name Radio as X2 , and so on.
We can refer to the input vector collectively as
0 1
X1
B C
X = @ X2 A
X3
Now we write our model as
Y = f (X) + ✏
where ✏ captures measurement errors and other discrepancies.
2 / 30
What is f (X) good for?
• With a good f we can make predictions of Y at new points

X = x.
• We can understand which components of
X = (X1 , X2 , . . . , Xp ) are important in explaining Y , and
which are irrelevant. e.g. Seniority and Years of
Education have a big impact on Income, but Marital
Status typically does not.
• Depending on the complexity of f , we may be able to
understand how each component Xj of X a↵ects Y .
3 / 30
●●
6
● ●
●
●
● ●
●●●● ● ●
●● ●
4
● ●●●● ●
●●
●● ●●
●
● ●●
● ●● ●●●
● ●●●● ● ●
●
● ●● ●● ●●●●●
●
●●● ●●● ● ●
● ●●
●
●
●
●●●
●●
●
●
●●●
● ●●
●● ●
●●
● ●
●●●●
●●●●●●
● ●●
●● ●
● ● ●● ● ● ●
●●
●●
●●●
●● ●●●●
●●
● ●
y
2 ●● ●
●● ●
●
●● ●● ●● ●●●●●
●
●●●
●●●● ●
● ● ●●●●●
●
●●
●
●●●●●
●●
●●
●●
●
●
●
●
●●
●
●●●●
● ●●●
● ●●
● ●
●●
●●
● ●
●●●●
●
●
●
●●
●●
●
●
●●●
●
●
●●●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●●●
●
● ● ●●
●
●● ●
●● ●●
●●●
● ●
●
●●●
●●
●
●●●
●●●●
●●
●●
●
●
●
●●
●●
●●
●●
● ●
●
●●
●●
●●●●
●
● ●● ●●●●●●
●● ●
●● ●●
●
●●
●●●
●●
● ●
●●
●● ●●●
● ●●●
●● ●
●●
●●●●●
●●●●
●
●
●●
● ●●
●
●●
●
●
●●
●
●●
●●
●
●●
●●
●
●●
●
●●
●●
●●
●
●●●
●
●●●
●●
●●●●●
●
●●●
● ●● ●
●
●●● ●●●●●
●
●
●● ●●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
● ●●●
● ●
● ● ●● ● ● ●●● ● ● ● ●
●
● ●● ●
●●
●●
● ●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●●
●
●●
●
●●
●
●●●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
● ●●
●●●●●
● ●● ●●
●● ● ●● ● ●
● ●●
●●●
●●
●●●●●●●
●●●
●
●● ●●●● ●●
●●●●●
●●●●
0
● ●●●● ●●● ●● ●●● ● ● ●●●●

●●
●●
●
●●
● ●●
●
●●
●
●●●●
●●
●●
●
●●
●
●
●●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●●
●●
●
●●●
●
●●
●●●
●
●●●
● ● ●●
●
● ● ●● ● ●
●● ●
● ●● ● ●●●●● ● ●●●
●●
●● ●●●● ●
●
●●●
●●
●●● ●
●●●●●●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●●●
●●
●●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●●●
●
●
●●
● ●
●●●●●
●
●●●●●●●●●●
●
●
●● ●
●
●
●●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●●●
●
●●
●●
●
●●
●
●
●
●
●
●●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●●●
● ●● ●
● ●●●
● ●● ● ●●●●
●●
●
●
●●●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●●●
●●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●●
●●
●
●
●
●
●●
●
●
●●●
●
●
●
●●
●
●●
●
●
●
●
●●●
●
●
●●
●
●●
●
●●●
●●●●
● ●
● ● ● ●●●
● ●
● ● ●●●●
●●● ● ●● ●●
●●●●● ● ● ● ●●
●●● ●●● ●● ●
−2
1 2 3 4 5 6 7
Is there an ideal f (X)? In particular, what is a good value for

f (X) at any selected value of X, say X = 4? There can be
many Y values at X = 4. A good value is
f (4) = E(Y |X = 4)
E(Y |X = 4) means expected value (average) of Y given X = 4.
la media de
coge
This ideal f (x) = E(Y |X = x) is called the regression function.
vaien 4.
todos os que 4 / 30
The regression function f (x)
• Is also defined for vector X; e.g.

f (x) = f (x1 , x2 , x3 ) = E(Y |X1 = x1 , X2 = x2 , X3 = x3 )
5 / 30

f (x) = f (x1 , x2 , x3 ) = E(Y |X1 = x1 , X2 = x2 , X3 = x3 )
• Is the ideal or optimal predictor of Y with regard to
mean-squared prediction error: f (x) = E(Y |X = x) is the
function that minimizes E[(Y g(X))2 |X = x] over all
functions g at all points X = x.
5 / 30

f (x) = f (x1 , x2 , x3 ) = E(Y |X1 = x1 , X2 = x2 , X3 = x3 )
• ✏=Y f (x) is the irreducible error — i.e. even if we knew
f (x), we would still make errors in prediction, since at each
X = x there is typically a distribution of possible Y values.
5 / 30

f (x) = f (x1 , x2 , x3 ) = E(Y |X1 = x1 , X2 = x2 , X3 = x3 )
• ✏=Y f (x) is the irreducible error — i.e. even if we knew
f (x), we would still make errors in prediction, since at each
X = x there is typically a distribution of possible Y values.
• For any estimate fˆ(x) of f (x), we have
E[(Y fˆ(X))2 |X = x] = [f (x) fˆ(x)]2 + Var(✏)

| {z } | {z }
Reducible Irreducible
5 / 30
How to estimate f
• Typically we have few if any data points with X = 4
exactly.
• So we cannot compute E(Y |X = x)!
• Relax the definition and let
fˆ(x) = Ave(Y |X 2 N (x))
where N (x) is some neighborhood of x.
↳sogenunentrelate
3
● ●
2
●
●
●
●● ● ●
● ● ● ●
● ● ●●
1
● ●
●
●●
● ● ●
y
● ●
● ●
●● ●
● ● ● ● ● ● ●
0
●
● ● ● ● ●
● ● ● ●
● ●
● ● ● ●● ● ●
● ● ●
−1
● ● ● ●● ● ● ●
●●
−2
1 2 3 4 5 6
x
6 / 30
Es buena mado
hay poces
• Nearest neighbor averaging can
viables.
be pretty good for small p
— i.e. p  4 and large-ish N .
• We will discuss smoother versions, such as kernel and
-
spline smoothing later in the course.

->
otros métodos
7 / 30
• Nearest neighbor averaging can be pretty good for small p
— i.e. p  4 and large-ish N .
• We will discuss smoother versions, such as kernel and
spline smoothing later in the course.
noworoso p grande
• Nearest neighbor methods can be lousy when p is large.
Reason: the curse of dimensionality. Nearest neighbors
tend to be far away in high dimensions.
• We need to get a reasonable fraction of the N values of yi

to average to bring the variance down—e.g. 10%.
• A 10% neighborhood in high dimensions need no longer be
local, so we lose the spirit of estimating E(Y |X = x) by
local averaging.
7 / 30
The curse of dimensionality
10% Neighborhood
1.0
●
● ● ●● ●
p= 10
● ●
●
●
● ● ● ● ●●
1.5
● ● ● ●
● ●
● ● ●
● ●
0.5
● ● ● p= 5
● ●
● ● ●
● ●
p= 3
1.0
●● ●
●
Radius
●
● ● ● ● ●
●
0.0
● p= 2
x2
● ●● ● ● ●●
● ●
● ● ●
● ● ●
● ●
●
● p= 1
●● ●
0.5
●
● ● ●
−0.5
● ●
● ●●
● ●
●●
● ● ●
● ●●
● ● ● ● ●
● ●
●
−1.0
0.0
●
●
−1.0 −0.5 0.0 0.5 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
x1 Fraction of Volume
dim, mences
...., mis
A
vecinos.
8 / 30
Parametric and structured models
maishbor
lineadores que no son
The linear model is an important example of a parametric

model:
fL (X) = 0 + 1 X1 + 2 X2 + . . . p Xp .
• A linear model is specified in terms of p + 1 parameters

0, 1, . . . , p.
• We estimate the parameters by fitting the model to
training data.
• Although it is almost never correct, a linear model often
serves as a good and interpretable approximation to the
unknown true function f (X).
9 / 30
A linear model fˆL (X) = ˆ0 + ˆ1 X gives a reasonable fit here
eine
3
● ●
2
●
●
●
●● ● ●
● ● ●
● ●
1
● ● ●●
● ● ●
●
● ● ●
y
● ●● ●
● ●● ● ●
● ● ● ● ●
0
●
● ● ● ● ●
● ● ● ●
● ●● ●
●
● ●●
● ● ● ●
−1
● ● ● ●
●
●● ●● ●
−2
1 2 3 4 5 6
A quadratic model fˆQ (X) = ˆ0 + ˆ1 X + ˆ2 X 2 fits slightly

quathátic.
better.
3
● ●
2
●
●
●
●● ● ●
● ● ●
● ●
1
● ● ●●
● ● ●
●
● ● ●
y
● ●
● ●
●● ●
● ● ● ● ●
● ●
0
●
● ● ● ● ●
● ● ● ●
● ●● ●
●
● ●●
● ● ● ●
−1
● ● ● ●
●
●● ●● ●
−2
1 2 3 4 5 6
10 / 30
Incom
e
ity
or
Ye
ni
ars
Se
of
Ed
uc
ati
on
Simulated example. Red points are simulated values for income

from the model
income = f (education, seniority) + ✏
f is the blue surface.
11 / 30
Incom
e
ity
or
Ye
ni
ars
Se
of
Ed
uc
ati
on
Linear regression model fit to the simulated data.
fˆL (education, seniority) = ˆ0 + ˆ1 ⇥education+ ˆ2 ⇥seniority
12 / 30
Incom
e
ity
or
Ye
ni
ars
Se
of
Ed
uc
ati
on
More flexible regression model fˆS (education, seniority) fit to

the simulated data. Here we use a technique called a thin-plate
spline to fit a flexible surface. We control the roughness of the
fit (chapter 7).
13 / 30
Incom
e
ity
or
Ye
ni
ars
Se
of
Ed
uc
ati
on
Even more flexible spline regression model

fˆS (education, seniority) fit to the simulated data. Here the
fitted model makes no errors on the training data! Also known
as overfitting.
14 / 30
Some trade-o↵s
• Prediction accuracy versus interpretability.

— Linear models are easy to interpret; thin-plate splines
are not.
15 / 30
Some trade-o↵s

are not.
• Good fit versus over-fit or under-fit.
— How do we know when the fit is just right?
15 / 30
Some trade-o↵s

are not.
• Good fit versus over-fit or under-fit.
— How do we know when the fit is just right?
• Parsimony versus black-box.
— We often prefer a simpler model involving fewer
variables over a black-box predictor involving them all.
15 / 30
lineal
High
Subset Selection
Lasso
min cron
Least Squares
alscitmas generalizae
Interpretability
Generalized Additive Models

Trees
arboles
tenilas
Bagging, Boosting
-
Support Vector Machines

Low
Low
maquina High
Flexibility
FIGURE 2.7. A representation of the tradeo↵ between flexibility and inte

pretability, using di↵erent statistical learning methods. In general, as the16flexib
/ 30
Assessing Model Accuracy
Suppose we fit a model fˆ(x) to some training data

Tr = {xi , yi }N
1 , and we wish to see how well it performs.
• We could compute the average squared prediction error
over Tr:
MSETr = Avei2Tr [yi fˆ(xi )]2
This may be biased toward more overfit models.
• Instead we should, if possible, compute it using fresh test
data Te = {xi , yi }M
1 :
MSETe = Avei2Te [yi fˆ(xi )]2
17 / 30
mejor:flexibilidad media.
es
* alto
2.5
mas
wor Es pase
medepase
12 en
mug
bajo
2.0
10
Mean Squared Error

⑧ beine auva,
1.5
d
8
Y
1.0
⑧
6
pol grado
I
0.5
4;
4
se ajusta
mejor a los
en
free
2
puntrey linea
0.0
nepa
0 20 40 60 80 100 2 5 10 20
6 quere
rega.
X
y
to test
Flexibility
Black curve is truth. Red curve on right is MSETe , grey curve is

sconde han generado la voces)
FIGURE
MSE 2.9. Left: Data simulated from f , shown in black. Three estimates of
Tr . Orange, blue and green curves/squares correspond to fits of
f are shown: the linear regression line (orange curve), and two smoothing spline
di↵erent flexibility.
fits (blue and green curves). Right: Training MSE (grey curve), test MSE (red
8
para metas andvitamos en una me en est
curve), and minimum possible test MSE over all methods (dashed line). Squares
represent el
the training test MSEs for the three fits shown in the left-hand en
18 / 30
2.5
12
2.0
10
Mean Squared Error
1.5
8
Y
1.0
6
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
Here the truth is smoother, so the smoother fit and linear model do
FIGURE 2.10. Details are as in Figure 2.9, using a di↵erent true f that is
-
really well. to linear. In this setting, linear regression provides a very good fit to
much closer
-
the data.
19 / 30
the data.
20
20
15
Mean Squared Error
10
10
Y
5
−10
0
0 20 40 60 80 100 2 5 10 20
X Flexibility
Here the truth is wiggly and the noise is low, so the more flexible fits
FIGURE 2.11. Details are as in Figure 2.9, using a di↵erent f that is far from
do the best.
linear. In this setting, linear regression provides a very poor fit to the data.
20 / 30
Bias-Variance Trade-o↵
Suppose we have fit a model fˆ(x) to some training data Tr, and
let (x0 , y0 ) be a test observation drawn from the population. If
the true model is Y = f (X) + ✏ (with f (x) = E(Y |X = x)),
then
⇣ ⌘2
E y0 fˆ(x0 ) = Var(fˆ(x0 )) + [Bias(fˆ(x0 ))]2 + Var(✏).
1
The expectation averages over the variability of y0 as well as
the variability in Tr. Note that Bias(fˆ(x0 ))] = E[fˆ(x0 )] f (x0 ).
promedio de la
Typically as the flexibility of fˆ increases, its variance increases,
I
and its bias decreases. So choosing the flexibility based on
average test error amounts to a bias-variance trade-o↵. n. I
fede:: sxonvaldar
en una mestra vale algo
10 en
diferente a
sa
utra mestre. 21 / 30
Bias-variance trade-o↵ for the three examples
36 2. Statistical Learning
2.5
2.5
20
MSE
Bias
Var
2.0
2.0
15
Que
1.5
1.5
ene
de
10
1.0
1.0
5
0.5
0.5
0.0
0.0
0
2 5 10 20 2 5 10 20 2 5 10 20
Flexibility Flexibility Flexibility
FIGURE 2.12. Squared bias (blue curve), variance (orange curve), Var(✏)
(dashed line), and test MSE (red curve) for the three data sets in Figures 2.9–2.11.
The vertical dashed line indicates the flexibility level corresponding to the smallest
test MSE.
22 / 30
Classification Problems
Here the response variable Y is qualitative — e.g. email is one

of C = (spam, ham) (ham=good email), digit class is one of
C = {0, 1, . . . , 9}. Our goals are to:
• Build a classifier C(X) that assigns a class label from C to
a future unlabeled observation X.
• Assess the uncertainty in each classification
• Understand the roles of the di↵erent predictors among
X = (X1 , X2 , . . . , Xp ).
23 / 30
1.0
|| | | | || || |||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| |||| ||| | | ||
0.8
0.6
y
0.4
0.2
0.0
|| | | || |||| |||||||| |||||||||||||||||||||||||||||||||||||||||||

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| | ||||||| | | |
caso de
1 2 3 4 5 6 en 7
2
spain
x
↑ en aso
de
Is there an ideal C(X)? Suppose the K elements in C are

numbered 1, 2, . . . , K. Let 10
pk (x) = Pr(Y = k|X = x), k = 1, 2, . . . , K.

These are the conditional class probabilities at x; e.g. see little
barplot at x = 5. Then the Bayes optimal classifier at x is
C(x) = j if pj (x) = max{p1 (x), p2 (x), . . . , pK (x)}
24 / 30
1.0
| | | || || || || || || || | | || ||| | | ||||||||| || || | ||||| || | | | | |
0.8
0.6
y
0.4
0.2
0.0
| | | | || | || || | | | |||| || ||| | | || | | ||| | || | | |
2 3 4 5 6
Nearest-neighbor averaging can be used as before.

Also breaks down as dimension grows. However, the impact on
Ĉ(x) is less than on p̂k (x), k = 1, . . . , K.
25 / 30
Classification: some details
• Typically we measure the performance of Ĉ(x) using the

misclassification error rate:
ErrTe = Avei2Te I[yi 6= Ĉ(xi )]
• The Bayes classifier (using the true pk (x)) has smallest

error (in the population).
26 / 30
Classification: some details
• Typically we measure the performance of Ĉ(x) using the

misclassification error rate:
ErrTe = Avei2Te I[yi 6= Ĉ(xi )]
• The Bayes classifier (using the true pk (x)) has smallest

error (in the population).
• Support-vector machines build structured models for C(x).

• We will also build structured models for representing the
pk (x). e.g. Logistic regression, generalized additive models.
26 / 30
2. Statistical Learning
Example: K-nearest neighbors in two dimensions
oo o
oo o
o
o
o oo oo o
o o
o o ooo
o oo o oo ooo o
oo o o o o ooo oo oo
o o oo oo
o o o o oo o
oo oo o o o o
o o o oo o o o
o oo o o o o o o o
o o oooo o ooo o o o o ooo
o
X2
o
o o o oo o o o ooooo o o o
oo o oo o o
o o o o oo o
o oo o
o o o oo o ooo o o
o oo o
oo o ooooo oooo
o o oo o o
o o oo oo o o o
o o o oo oo
oo
o o o o
o oo o
o o o
X1
27 / 30
KNN: K=10
junior pare
mirar la
oo o
oo o
o
dase.
o
o oo oo o
Era
o
o o o ooo
o oo o oo oo
oo o o o o o oo o oo
o o oo o o o oo
o o o o oo o
oo oo o o o o
o o
o*
o oo o o o
Spalacion
o oo o o o o o o
o o oooo o ooo o o o o o
o o o oo o o oo oo
X2
o
o o o oo
o oo oo
oo o oo o o o oo o
o
meen0
o oo o
o
o o o oo o oo o o
no ma oo
o
o ooooo oo
oo o
oo
p mas o
o
o o oo o o
oo oo o o o
mars mo
son o o o oo ooo
ooo
o o
o oo o
hay al o o o
En la X1
FIGURE 2.15. The black curve indicates the KNN decision boundary on the
28 / 30
FIGURE 2.15. The black curve indicates the KNN decision boundary on the
data from Figure 2.13, using K = 10. The Bayes decision boundary is shown as
a purple dashed line. The KNN and Bayes decision boundaries are very similar.
KNN: K=1 KNN: K=100
o o o o
oo o o oo o o
o o o o
o oo o o oo o
oo oo
o o o o
o oo oo o o o o oo oo o o o
o o oo o o o o oo o o
oo o o o o o oo oo oo o o o o o oo oo
o o o oo oo o oo o o o oo oo o oo
o o o o o o o o
o ooo o o o ooo o o
o
o o o o o o
o o o o o
o oo o o o o oo o o o
o oo o o o o o oo o o o o
o
o o o o o
o o o o
o o ooo o ooo o o o
o o o ooo o ooo o o o
o
oo o o oo o o oo o oo o o oo o o oo o
o oo o oo
o o o ooo o o o ooo
oo o o o o oo o o o o
oo o o o oo o oo o o o oo o
o o o o o o o o
o o o o
o o o o o oo o o o o o o oo o
o ooo o o ooo o
oo o oooo oo oo o oooo oo
o o oo o o o oo o
o oo o o oo o
o o oo oo o o o o o oo oo o o o
oo o oo o
o
o
o oooo oo o
o
o
o oooo oo o
oo o oo o
o o
o o o o
o o
FIGURE 2.16. A comparison of the KNN decision boundaries (solid black

curves) obtained using K = 1 and K = 100 on the data from Figure 2.13. With
K = 1, the decision boundary is overly flexible, while with K = 100 it is not
sufficiently flexible. The Bayes decision boundary is shown as a purple dashed
29 / 30
0.20
0.15
Error Rate
0.10
0.05
malo
ecinos
Training Errors power o
un
terrenerüre
0.00
Test Errors-
y es
P
0.01 0.02 0.05
parte
0.10 0.20 0.50
⑧
1.00 A
Lo
L
100
miro
1/K
8 -
número de
recinces
30 / 30

C2 Statistical Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

C2 Statistical Learning

Uploaded by

Copyright:

Available Formats

What is Statistical Learning?

Shown are Sales vs TV, Radio and Newspaper, with a blue

Sales ⇡ f (TV, Radio, Newspaper)

Now we write our model as

where ✏ captures measurement errors and other discrepancies.

• With a good f we can make predictions of Y at new points

● ●●●● ●●● ●● ●●● ● ● ●●●●

Is there an ideal f (X)? In particular, what is a good value for

• Is also defined for vector X; e.g.

• Is also defined for vector X; e.g.

• Is also defined for vector X; e.g.

• Is also defined for vector X; e.g.

E[(Y fˆ(X))2 |X = x] = [f (x) fˆ(x)]2 + Var(✏)

spline smoothing later in the course.

• We need to get a reasonable fraction of the N values of yi

The linear model is an important example of a parametric

• A linear model is specified in terms of p + 1 parameters

A quadratic model fˆQ (X) = ˆ0 + ˆ1 X + ˆ2 X 2 fits slightly

Simulated example. Red points are simulated values for income

income = f (education, seniority) + ✏

f is the blue surface.

Linear regression model fit to the simulated data.

fˆL (education, seniority) = ˆ0 + ˆ1 ⇥education+ ˆ2 ⇥seniority

More flexible regression model fˆS (education, seniority) fit to

Even more flexible spline regression model

• Prediction accuracy versus interpretability.

• Prediction accuracy versus interpretability.

• Prediction accuracy versus interpretability.

Generalized Additive Models

Support Vector Machines

FIGURE 2.7. A representation of the tradeo↵ between flexibility and inte

Suppose we fit a model fˆ(x) to some training data

MSETe = Avei2Te [yi fˆ(xi )]2

Mean Squared Error

Black curve is truth. Red curve on right is MSETe , grey curve is

Mean Squared Error

Flexibility Flexibility Flexibility

Here the response variable Y is qualitative — e.g. email is one

|| | | || |||| |||||||| |||||||||||||||||||||||||||||||||||||||||||

Is there an ideal C(X)? Suppose the K elements in C are

pk (x) = Pr(Y = k|X = x), k = 1, 2, . . . , K.

| | | | || | || || | | | |||| || ||| | | || | | ||| | || | | |

Nearest-neighbor averaging can be used as before.

• Typically we measure the performance of Ĉ(x) using the

ErrTe = Avei2Te I[yi 6= Ĉ(xi )]

• The Bayes classifier (using the true pk (x)) has smallest

• Typically we measure the performance of Ĉ(x) using the

ErrTe = Avei2Te I[yi 6= Ĉ(xi )]

• The Bayes classifier (using the true pk (x)) has smallest

• Support-vector machines build structured models for C(x).

FIGURE 2.16. A comparison of the KNN decision boundaries (solid black

You might also like