Professional Documents
Culture Documents
1 Chapter 1-2 2
1.1 Statistical Decision Theory . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Example: Multivariate normally distributed . . . . . . . . 2
1.1.2 Fitting f(x) . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Method of least squares when f(x) is linear . . . . . . . . 3
1.1.4 Exercise: MSE . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.5 K-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . 5
1.1.6 Curse of dimensionality . . . . . . . . . . . . . . . . . . . 5
1.1.7 Bias-Variance Trade-off . . . . . . . . . . . . . . . . . . . 6
2 Chapter 3 7
2.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Fitted and predicted values . . . . . . . . . . . . . . . . . 7
2.1.3 Exercise: Hat-Matrix . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.5 Exercise: Residuals . . . . . . . . . . . . . . . . . . . . . . 9
1
1 Chapter 1-2
This chapter captures introduces contents for linear models, least squares and
N-nearest neighbor methods and the transition between. We also discuss the
problems which arises when using these methods in higher dimensions.
finding f(x):
2 2
min EPE(f ) = min E (Y − f (X)) = min E E (Y − f (X)) |X = x
Solution:
f (X) = (Y |X = x)
The conditional expectation, also known as the regression function. The best
prediction of Y at any point X=x is the conditional mean, when best if measured
by average squared error.
2
1.1.2 Fitting f(x)
f (x) = E(Y |X = x) depends on x and the parameter of the joint distribution
P(X, Y ) which are usually not known. Estimation can be done using MLE,
method of moments, etc. In many cases f(x) cannot be computed analytically.
Thus, sampling approximation of EPE:
N
1 X
EPE(f ) ≈ EPE(f ) =
[ (yi − f (xi ))2
N i=1
Nearest-Neighbor Method
Which is called method of least squares, when f (·) is linear or f (·) is approxi-
mated by first-order Taylor series expansion.
- Sampling appr. of f (x) = E(Y |X = x). The problem is though that we
usually have no enough observation for each possible value of x. Solution for
this is the complete with neighbouring data:
1 X
f (x) ≈ fˆ(x) = yi
k
xi ∈Nk (x)
3
Solution:
To begin with, lets expand and explain the solution from the lecture notes. We
know that:
N
X
RSS(β) = N (yi − xTi β)2 = (y − Xβ)T (y − Xβ)
i=1
= y T y − β T X T y − y T Xβ + β T X t xβ = y T − 2β T X T y + β T X T Xβ
δRSS(β) δy T y − 2β T X T y + β T X T Xβ
=⇒ =
δβ δβ
δy T y δβ T X T y δβ T (X T X)β
−2 + = 0 − 2X T y + 2xT Xβ =: 0
δβ δβ δβ
We now need to assume that (X T X)−1 exists.
X 1
(ϕ(Yi ; (XB)i , σ)) = − log (2π)n/2 σ n − 2 (y − Xβ)T (y − Xβ)
ℓ(β) = log
2σ
Using
1 1
2
(y − xβ)T X = 2 (y T X − β T X T X) = 0 ⇔ β̂ = (X T X)−1 Xy ■
σ σ
4
1.1.5 K-Nearest Neighbor
Assumption: k-nearest neighbors assumes f(x) is well approximated by a locally
constant function. (Sample) approximation:
1 X
f (x) ≈ fˆ(x) = yi
k
xi ∈Nk (x)
Nearest neighbor averaging can be pretty good for small p (e.g., p ≤ 4) and
large sample size N
Smoother versions of Nearest neighbor averaging:
- Kernel smoothing, spline smoothing
- Nearest neighbor methods can be lousy when p is large.
Curse of dimensionality: Nearest neighbors tend to be far away in high dimen-
sions.
Example: a subcubical neighborhood for uniform data in a unit cube. The right-
hand plot shows the side-length of the subcube needed to capture a fraction r
of the volume of the data for different p. For p = 10: 80% of the range of each
coordinate is needed to capture 10% of the data.
5
1.1.7 Bias-Variance Trade-off
fˆ(x) is fitted by using some training data T r = (xT r,i , yT r,i ) for i = 1, ..., N .
(x0 , y0 ) is a test observation where x0 is a fixed point (deterministic) from the
support of X.
Model:
M SE(x0 ) = E (y0 − f (x0 )) = V ar(fˆ(x0 )) + [Bias(fˆ(x0 ))]2 + V ar(ϵ0 )
ˆ 2
with
Example: Curse of dimensionality and its effect on MSE, bias and variance.
The input features xi are uniformly distributed in [1, 1]p for p = 1, ..., 10. The
plot shows the MSE, squared bias and variance curves in estimating f (0) with
f (x) = exp(8||x||2 ) as a function of dimension p.
6
2 Chapter 3
This chapter covers linear methods for regression.
y = Xβ + ϵ
y = (y1 , . . . , yn )T is a vector of response variables
ϵ = (ϵ1 , . . . , ϵn )T is a vector of errors
n × (p + 1) matrix X is the design matrix
Parameter vector β = (β0 , β1 , . . . , βp ) is considered unknown
2.1.1 Assumptions
A-Assumption
a1 In the model equation, no relevant independent variables are missing and
the used independent variables are not irrelevant
a2 The true relationship between x and y is linear.
a3 The parameter vector β is constant for all N observations (xi , yi ).
B-Assumption
b1 E(ϵ) = 0.
b2 Cov(ϵ) = σ 2 IN .
b3 ϵ ∼ N (0, σ 2 IN )
C-Assumption
c1 Each element of the (N × (p + 1))-matrix X is deterministic.
c2 rank (X) = p + 1.
ŷ0 = xT0 β̂
7
2.1.3 Exercise: Hat-Matrix
a) Prove that H is symmetric.
= X(X T X)−1 X T = H ■
Solution: Since the eigen values of H are zero or ones, number of ones then
equal the rank of H. Thus, Rank(H) = Tr(H).
(IN − H)H = IN H − HH = H − H = 0
Where the second last equality follows from that H is idempotent. ■
Solution:
Symmetric: (IN − H)T = IN
T
− H T = IN − H
Idempotent: (IN −H)(IN −H) = I 2 −IH −IH +H 2 = I −H −H +H = I −H
8
2.1.4 Residuals
The residuals of a linear regression model are defined by
ϵ̂ = y − ŷ = y − Hy = (IN − H)y
2
Estimator for σ :
1 1
σ̂ 2 = ϵ̂T ϵ̂ = (y − X β̂)T (y − X β̂)
N −p−1 N −p−1