Professional Documents
Culture Documents
25
25
25
20
20
20
Sales
Sales
Sales
15
15
15
10
10
10
5
5
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
1 / 30
Notation
Here Sales is a response or target that we wish to predict. We
generically refer to the response as Y .
TV is a feature, or input, or predictor; we name it X1 .
Likewise name Radio as X2 , and so on.
We can refer to the input vector collectively as
0 1
X1
B C
X = @ X2 A
X3
Y = f (X) + ✏
2 / 30
What is f (X) good for?
3 / 30
●●
6
● ●
●
●
● ●
●●●● ● ●
●● ●
4
● ●●●● ●
●●
●● ●●
●
● ●●
● ●● ●●●
● ●●●● ● ●
●
● ●● ●● ●●●●●
●
●●● ●●● ● ●
● ●●
●
●
●
●●●
●●
●
●
●●●
● ●●
●● ●
●●
● ●
●●●●
●●●●●●
● ●●
●● ●
● ● ●● ● ● ●
●●
●●
●●●
●● ●●●●
●●
● ●
y
2 ●● ●
●● ●
●
●● ●● ●● ●●●●●
●
●●●
●●●● ●
● ● ●●●●●
●
●●
●
●●●●●
●●
●●
●●
●
●
●
●
●●
●
●●●●
● ●●●
● ●●
● ●
●●
●●
● ●
●●●●
●
●
●
●●
●●
●
●
●●●
●
●
●●●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●●●
●
● ● ●●
●
●● ●
●● ●●
●●●
● ●
●
●●●
●●
●
●●●
●●●●
●●
●●
●
●
●
●●
●●
●●
●●
● ●
●
●●
●●
●●●●
●
● ●● ●●●●●●
●● ●
●● ●●
●
●●
●●●
●●
● ●
●●
●● ●●●
● ●●●
●● ●
●●
●●●●●
●●●●
●
●
●●
● ●●
●
●●
●
●
●●
●
●●
●●
●
●●
●●
●
●●
●
●●
●●
●●
●
●●●
●
●●●
●●
●●●●●
●
●●●
● ●● ●
●
●●● ●●●●●
●
●
●● ●●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
● ●●●
● ●
● ● ●● ● ● ●●● ● ● ● ●
●
● ●● ●
●●
●●
● ●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●●
●
●●
●
●●
●
●●●
●●●
●
●
●●●
●
●●●
●
●
●
●●●
●
● ●●
●●●●●
● ●● ●●
●● ● ●● ● ●
● ●●
●●●
●●
●●●●●●●
●●●
●
●● ●●●● ●●
●●●●●
●●●●
0
1 2 3 4 5 6 7
f (4) = E(Y |X = 4)
E(Y |X = 4) means expected value (average) of Y given X = 4.
la media de
coge
This ideal f (x) = E(Y |X = x) is called the regression function.
vaien 4.
todos os que 4 / 30
The regression function f (x)
5 / 30
The regression function f (x)
5 / 30
The regression function f (x)
5 / 30
The regression function f (x)
5 / 30
How to estimate f
• Typically we have few if any data points with X = 4
exactly.
• So we cannot compute E(Y |X = x)!
• Relax the definition and let
fˆ(x) = Ave(Y |X 2 N (x))
where N (x) is some neighborhood of x.
↳sogenunentrelate
3
● ●
2
●
●
●
●● ● ●
● ● ● ●
● ● ●●
1
● ●
●
●●
● ● ●
y
● ●
● ●
●● ●
● ● ● ● ● ● ●
0
●
● ● ● ● ●
● ● ● ●
● ●
● ● ● ●● ● ●
● ● ●
−1
● ● ● ●● ● ● ●
●●
−2
1 2 3 4 5 6
x
6 / 30
Es buena mado
hay poces
• Nearest neighbor averaging can
viables.
be pretty good for small p
— i.e. p 4 and large-ish N .
• We will discuss smoother versions, such as kernel and
-
otros métodos
7 / 30
• Nearest neighbor averaging can be pretty good for small p
— i.e. p 4 and large-ish N .
• We will discuss smoother versions, such as kernel and
spline smoothing later in the course.
noworoso p grande
• Nearest neighbor methods can be lousy when p is large.
Reason: the curse of dimensionality. Nearest neighbors
tend to be far away in high dimensions.
7 / 30
The curse of dimensionality
10% Neighborhood
1.0
●
● ● ●● ●
p= 10
● ●
●
●
● ● ● ● ●●
1.5
● ● ● ●
● ●
● ● ●
● ●
0.5
● ● ● p= 5
● ●
● ● ●
● ●
p= 3
1.0
●● ●
●
Radius
●
● ● ● ● ●
●
0.0
● p= 2
x2
● ●● ● ● ●●
● ●
● ● ●
● ● ●
● ●
●
● p= 1
●● ●
0.5
●
● ● ●
−0.5
● ●
● ●●
● ●
●●
● ● ●
● ●●
● ● ● ● ●
● ●
●
−1.0
0.0
●
●
−1.0 −0.5 0.0 0.5 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
x1 Fraction of Volume
dim, mences
...., mis
A
vecinos.
8 / 30
Parametric and structured models
maishbor
lineadores que no son
9 / 30
A linear model fˆL (X) = ˆ0 + ˆ1 X gives a reasonable fit here
eine
3
● ●
2
●
●
●
●● ● ●
● ● ●
● ●
1
● ● ●●
● ● ●
●
● ● ●
y
● ●● ●
● ●● ● ●
● ● ● ● ●
0
●
● ● ● ● ●
● ● ● ●
● ●● ●
●
● ●●
● ● ● ●
−1
● ● ● ●
●
●● ●● ●
−2
1 2 3 4 5 6
● ●
2
●
●
●
●● ● ●
● ● ●
● ●
1
● ● ●●
● ● ●
●
● ● ●
y
● ●
● ●
●● ●
● ● ● ● ●
● ●
0
●
● ● ● ● ●
● ● ● ●
● ●● ●
●
● ●●
● ● ● ●
−1
● ● ● ●
●
●● ●● ●
−2
1 2 3 4 5 6
10 / 30
Incom
e
ity
or
Ye
ni
ars
Se
of
Ed
uc
ati
on
11 / 30
Incom
e
ity
or
Ye
ni
ars
Se
of
Ed
uc
ati
on
12 / 30
Incom
e
ity
or
Ye
ni
ars
Se
of
Ed
uc
ati
on
13 / 30
Incom
e
ity
or
Ye
ni
ars
Se
of
Ed
uc
ati
on
14 / 30
Some trade-o↵s
15 / 30
Some trade-o↵s
15 / 30
Some trade-o↵s
15 / 30
lineal
High
Subset Selection
Lasso
min cron
Least Squares
alscitmas generalizae
Interpretability
Low
maquina High
Flexibility
17 / 30
mejor:flexibilidad media.
es
* alto
2.5
mas
wor Es pase
medepase
12 en
mug
bajo
2.0
10
1.5
d
8
Y
1.0
⑧
6
pol grado
I
0.5
4;
4
se ajusta
mejor a los
en
free
2
puntrey linea
0.0
nepa
0 20 40 60 80 100 2 5 10 20
6 quere
rega.
X
y
to test
Flexibility
2.0
10
1.5
8
Y
1.0
6
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
Here the truth is smoother, so the smoother fit and linear model do
FIGURE 2.10. Details are as in Figure 2.9, using a di↵erent true f that is
-
really well. to linear. In this setting, linear regression provides a very good fit to
much closer
-
the data.
19 / 30
the data.
20
20
15
Mean Squared Error
10
10
Y
5
−10
0
0 20 40 60 80 100 2 5 10 20
X Flexibility
Here the truth is wiggly and the noise is low, so the more flexible fits
FIGURE 2.11. Details are as in Figure 2.9, using a di↵erent f that is far from
do the best.
linear. In this setting, linear regression provides a very poor fit to the data.
20 / 30
Bias-Variance Trade-o↵
Suppose we have fit a model fˆ(x) to some training data Tr, and
let (x0 , y0 ) be a test observation drawn from the population. If
the true model is Y = f (X) + ✏ (with f (x) = E(Y |X = x)),
then
⇣ ⌘2
E y0 fˆ(x0 ) = Var(fˆ(x0 )) + [Bias(fˆ(x0 ))]2 + Var(✏).
1
The expectation averages over the variability of y0 as well as
the variability in Tr. Note that Bias(fˆ(x0 ))] = E[fˆ(x0 )] f (x0 ).
promedio de la
Typically as the flexibility of fˆ increases, its variance increases,
I
and its bias decreases. So choosing the flexibility based on
average test error amounts to a bias-variance trade-o↵. n. I
fede:: sxonvaldar
en una mestra vale algo
10 en
diferente a
sa
utra mestre. 21 / 30
Bias-variance trade-o↵ for the three examples
36 2. Statistical Learning
2.5
2.5
20
MSE
Bias
Var
2.0
2.0
15
Que
1.5
1.5
ene
de
10
1.0
1.0
5
0.5
0.5
0.0
0.0
0
2 5 10 20 2 5 10 20 2 5 10 20
FIGURE 2.12. Squared bias (blue curve), variance (orange curve), Var(✏)
(dashed line), and test MSE (red curve) for the three data sets in Figures 2.9–2.11.
The vertical dashed line indicates the flexibility level corresponding to the smallest
test MSE.
22 / 30
Classification Problems
23 / 30
1.0
|| | | | || || |||||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||| |||| ||| | | ||
0.8
0.6
y
0.4
0.2
0.0
0.4
0.2
0.0
2 3 4 5 6
25 / 30
Classification: some details
26 / 30
Classification: some details
26 / 30
2. Statistical Learning
Example: K-nearest neighbors in two dimensions
oo o
oo o
o
o
o oo oo o
o o
o o ooo
o oo o oo ooo o
oo o o o o ooo oo oo
o o oo oo
o o o o oo o
oo oo o o o o
o o o oo o o o
o oo o o o o o o o
o o oooo o ooo o o o o ooo
o
X2
o
o o o oo o o o ooooo o o o
oo o oo o o
o o o o oo o
o oo o
o o o oo o ooo o o
o oo o
oo o ooooo oooo
o o oo o o
o o oo oo o o o
o o o oo oo
oo
o o o o
o oo o
o o o
X1
27 / 30
KNN: K=10
junior pare
mirar la
oo o
oo o
o
dase.
o
o oo oo o
Era
o
o o o ooo
o oo o oo oo
oo o o o o o oo o oo
o o oo o o o oo
o o o o oo o
oo oo o o o o
o o
o*
o oo o o o
Spalacion
o oo o o o o o o
o o oooo o ooo o o o o o
o o o oo o o oo oo
X2
o
o o o oo
o oo oo
oo o oo o o o oo o
o
meen0
o oo o
o
o o o oo o oo o o
no ma oo
o
o ooooo oo
oo o
oo
p mas o
o
o o oo o o
oo oo o o o
mars mo
son o o o oo ooo
ooo
o o
o oo o
hay al o o o
En la X1
FIGURE 2.15. The black curve indicates the KNN decision boundary on the
28 / 30
FIGURE 2.15. The black curve indicates the KNN decision boundary on the
data from Figure 2.13, using K = 10. The Bayes decision boundary is shown as
a purple dashed line. The KNN and Bayes decision boundaries are very similar.
KNN: K=1 KNN: K=100
o o o o
oo o o oo o o
o o o o
o oo o o oo o
oo oo
o o o o
o oo oo o o o o oo oo o o o
o o oo o o o o oo o o
oo o o o o o oo oo oo o o o o o oo oo
o o o oo oo o oo o o o oo oo o oo
o o o o o o o o
o ooo o o o ooo o o
o
o o o o o o
o o o o o
o oo o o o o oo o o o
o oo o o o o o oo o o o o
o
o o o o o
o o o o
o o ooo o ooo o o o
o o o ooo o ooo o o o
o
oo o o oo o o oo o oo o o oo o o oo o
o oo o oo
o o o ooo o o o ooo
oo o o o o oo o o o o
oo o o o oo o oo o o o oo o
o o o o o o o o
o o o o
o o o o o oo o o o o o o oo o
o ooo o o ooo o
oo o oooo oo oo o oooo oo
o o oo o o o oo o
o oo o o oo o
o o oo oo o o o o o oo oo o o o
oo o oo o
o
o
o oooo oo o
o
o
o oooo oo o
oo o oo o
o o
o o o o
o o
0.10
0.05
malo
ecinos
Training Errors power o
un
terrenerüre
0.00
Test Errors-
y es
P
0.01 0.02 0.05
parte
0.10 0.20 0.50
⑧
1.00 A
Lo
L
100
miro
1/K
8 -
número de
recinces
30 / 30