Professional Documents
Culture Documents
What Is Statistical Learning?: Sales TV Radio Newspaper Sales
What Is Statistical Learning?: Sales TV Radio Newspaper Sales
25
25
25
20
20
20
Sales
Sales
Sales
15
15
15
10
10
10
5
5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
1 / 30
Notation
Here Sales is a response or target that we wish to predict. We
generically refer to the response as Y .
TV is a feature, or input, or predictor; we name it X1 .
Likewise name Radio as X2 , and so on.
We can refer to the input vector collectively as
X1
X = X2
X3
Y = f (X) +
2 / 30
What is f (X) good for?
3 / 30
●●
6
● ●
●
●
● ●
●●●● ● ●
●● ●
4
● ●●●● ●
●●
●● ●●
●
● ●●
● ●● ●●●
● ●●●● ● ●
●
● ●● ●● ●●●●●
●
●●● ●●● ● ●
● ●●
●
●
●
●●●
●●
●
●
●●●
● ●●
●● ●
●●
● ●
●●●●
●●●●●●
● ●●
●● ●
● ● ●● ● ● ●
●●
●●
●●●
●● ●●●●
●●
● ●
y
2 ●● ●
●● ●
●
●● ●● ●● ●●●●●
●
●●●
●●●● ●
● ● ●●●●●
●
●●
●
●●●●●
●●
●●
●●
●
●
●
●
●●
●
●●●●
● ●●●
● ●●
● ●
●●
●●
● ●
●●●●
●
●
●
●●
●●
●
●
●●●
●
●
●●●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●●●
●
● ● ●●
●
●● ●
●● ●●
●●●
● ●
●
●●●
●●
●
●●●
●●●●
●●
●●
●
●
●
●●
●●
●●
●●
● ●
●
●●
●●
●●●●
●
● ●● ●●●●●●
●● ●
●● ●●
●
●●
●●●
●●
● ●
●●
●● ●●●
● ●●●
●● ●
●●
●●●●●
●●●●
●
●
●●
● ●●
●
●●
●
●
●●
●
●●
●●
●
●●
●●
●
●●
●
●●
●●
●●
●
●●●
●
●●●
●●
●●●●●
●
●●●
● ●● ●
●
●●● ●●●●●
●
●
●● ●●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
● ●●●
● ●
● ● ●● ● ● ●●● ● ● ● ●
●
● ●● ●
●●
●●
● ●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●●
●
●●
●
●●
●
●●●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
● ●●
●●●●●
● ●● ●●
●● ● ●● ● ●
● ●●
●●●
●●
●●●●●●●
●●●
●
●● ●●●● ●●
●●●●●
●●●●
0
1 2 3 4 5 6 7
f (4) = E(Y |X = 4)
E(Y |X = 4) means expected value (average) of Y given X = 4.
This ideal f (x) = E(Y |X = x) is called the regression function.
4 / 30
The regression function f (x)
5 / 30
How to estimate f
• Typically we have few if any data points with X = 4
exactly.
• So we cannot compute E(Y |X = x)!
• Relax the definition and let
fˆ(x) = Ave(Y |X ∈ N (x))
where N (x) is some neighborhood of x.
3
● ●
2
●
●
●
●● ● ●
● ● ● ●
● ● ●●
1
● ●
●
●●
● ● ●
y
● ●
● ●
●● ●
● ● ● ● ● ● ●
0
●
● ● ● ● ●
● ● ● ●
● ●
● ● ● ●● ● ●
● ● ●
−1
● ● ● ●● ● ● ●
●●
−2
1 2 3 4 5 6
x
6 / 30
• Nearest neighbor averaging can be pretty good for small p
— i.e. p ≤ 4 and large-ish N .
• We will discuss smoother versions, such as kernel and
spline smoothing later in the course.
• Nearest neighbor methods can be lousy when p is large.
Reason: the curse of dimensionality. Nearest neighbors
tend to be far away in high dimensions.
7 / 30
The curse of dimensionality
10% Neighborhood
1.0
●
● ● ●● ●
p= 10
● ●
●
●
● ● ● ● ●●
1.5
● ● ● ●
● ●
● ● ●
● ●
0.5
● ● ● p= 5
● ●
● ● ●
● ●
p= 3
1.0
●● ●
●
Radius
●
● ● ● ● ●
●
0.0
● p= 2
x2
● ●● ● ● ●●
● ●
● ● ●
● ● ●
● ●
●
● p= 1
●● ●
0.5
●
● ● ●
−0.5
● ●
● ●●
● ●
●●
● ● ●
● ●●
● ● ● ● ●
● ●
●
−1.0
0.0
●
●
−1.0 −0.5 0.0 0.5 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
x1 Fraction of Volume
8 / 30
Parametric and structured models
9 / 30
A linear model fˆL (X) = β̂0 + β̂1 X gives a reasonable fit here
3
● ●
2
●
●
●
●● ● ●
● ● ●
● ●
1
● ● ●●
● ● ●
●
● ● ●
y
● ●● ●
● ●● ● ●
● ● ● ● ●
0
●
● ● ● ● ●
● ● ● ●
● ●● ●
●
● ●●
● ● ● ●
−1
● ● ● ●
●
●● ●● ●
−2
1 2 3 4 5 6
● ●
2
●
●
●
●● ● ●
● ● ●
● ●
1
● ● ●●
● ● ●
●
● ● ●
y
● ●
● ●
●● ●
● ● ● ● ●
● ●
0
●
● ● ● ● ●
● ● ● ●
● ●● ●
●
● ●●
● ● ● ●
−1
● ● ● ●
●
●● ●● ●
−2
1 2 3 4 5 6
10 / 30
Incom
e
ity
or
Ye
ni
a rs
Se
of
Ed
uc
atio
n
11 / 30
Incom
e
ity
or
Ye
ni
a rs
Se
of
Ed
uc
atio
n
12 / 30
Incom
e
ity
or
Ye
ni
a rs
Se
of
Ed
uc
atio
n
13 / 30
Incom
e
ity
or
Ye
ni
a rs
Se
of
Ed
uc
atio
n
14 / 30
Some trade-offs
15 / 30
High
Subset Selection
Lasso
Least Squares
Interpretability
Bagging, Boosting
Low High
Flexibility
17 / 30
2.5
12
2.0
10
1.5
8
Y
1.0
6
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
2.0
10
1.5
8
Y
1.0
6
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
Here the truth is smoother, so the smoother fit and linear model do
FIGURE 2.10. Details are as in Figure 2.9, using a different true f that is
really well.
much closer to linear. In this setting, linear regression provides a very good fit to
the data.
19 / 30
the data.
20
20
15
Mean Squared Error
10
10
Y
5
−10
0
0 20 40 60 80 100 2 5 10 20
X Flexibility
Here the truth is wiggly and the noise is low, so the more flexible fits
FIGURE
do 2.11. Details are as in Figure 2.9, using a different f that is far from
the best.
linear. In this setting, linear regression provides a very poor fit to the data.
20 / 30
Bias-Variance Trade-off
Suppose we have fit a model fˆ(x) to some training data Tr, and
let (x0 , y0 ) be a test observation drawn from the population. If
the true model is Y = f (X) + (with f (x) = E(Y |X = x)),
then
2
E y0 − fˆ(x0 ) = Var(fˆ(x0 )) + [Bias(fˆ(x0 ))]2 + Var().
21 / 30
Bias-variance trade-off for the three examples
36 2. Statistical Learning
2.5
2.5
20
MSE
Bias
Var
2.0
2.0
15
1.5
1.5
10
1.0
1.0
5
0.5
0.5
0.0
0.0
0
2 5 10 20 2 5 10 20 2 5 10 20
FIGURE 2.12. Squared bias (blue curve), variance (orange curve), Var(ǫ)
(dashed line), and test MSE (red curve) for the three data sets in Figures 2.9–2.11.
The vertical dashed line indicates the flexibility level corresponding to the smallest
test MSE.
22 / 30
Classification Problems
23 / 30
1.0
|| | | | || || |||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| |||| ||| | | ||
0.8
0.6
y
0.4
0.2
0.0
1 2 3 4 5 6 7
0.4
0.2
0.0
2 3 4 5 6
25 / 30
Classification: some details
26 / 30
2. Statistical Learning
Example: K-nearest neighbors in two dimensions
oo o
oo o
o
o
o oo oo o
o o
o o ooo
o oo o oo ooo o
oo o o o o ooo oo oo
o o oo oo
o o o o oo o
oo oo o o o o
o o o oo o o o
o oo o o o o o o o
o o oooo o ooo o o o o ooo
o
X2
o
o o o oo o o o ooooo o o o
oo o oo o o
o o o o oo o
o oo o
o o o oo o ooo o o
o oo o
oo o ooooo oooo
o o oo o o
o o oo oo o o o
o o o oo oo
oo
o o o o
o oo o
o o o
X1
27 / 30
KNN: K=10
oo o
oo o
o
o
o oo oo o
o oo o
o o o
o oo o oo oo
oo o o o o o oo o oo
o o oo oo o oo
o o o o oo o
oo oo o o o o
o o o oo o o o
o oo o o o o o
o o
o o oooo o ooo o o o o ooo
o
X2
o o o oo o o o o
o o o oo
oo o oo o
o
oo o o o o o oo o
o oo o
o
o o o oo o oo o o
o oo o
oo o ooooo oo
o oo
o oo o o
o o oo oo o o o
o o o oo ooo
ooo
o o
o oo o
o o o
X1
FIGURE 2.15. The black curve indicates the KNN decision boundary on the
28 / 30
FIGURE 2.15. The black curve indicates the KNN decision boundary on the
data from Figure 2.13, using K = 10. The Bayes decision boundary is shown as
a purple dashed line. The KNN and Bayes decision boundaries are very similar.
KNN: K=1 KNN: K=100
o o o o
oo o o oo o o
o o o o
o oo o o oo o
oo oo
o o o o
o oo oo o o o o oo oo o o o
o o oo o o o o oo o o
oo o o o o o oo oo oo o o o o o oo oo
o o o oo oo o oo o o o oo oo o oo
o o o o o o o o
o ooo o o o ooo o o
o
o o o o o o
o o o o o
o oo o o o o oo o o o
o oo o o o o o oo o o o o
o
o o o o o
o o o o
o o ooo o ooo o o o
o o o ooo o ooo o o o
o
oo o o oo o o oo o oo o o oo o o oo o
o oo o oo
o o o ooo o o o ooo
oo o o o o oo o o o o
oo o o o oo o oo o o o oo o
o o o o o o o o
o o o o
o o o o o oo o o o o o o oo o
o ooo o o ooo o
oo o oooo oo oo o oooo oo
o o oo o o o oo o
o oo o o oo o
o o oo oo o o o o o oo oo o o o
oo o oo o
o
o
o oooo oo o
o
o
o oooo oo o
oo o oo o
o o
o o o o
o o
0.10
0.05
Training Errors
0.00
Test Errors
1/K
30 / 30