You are on page 1of 9

Contents

1 Chapter 1-2 2
1.1 Statistical Decision Theory . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Example: Multivariate normally distributed . . . . . . . . 2
1.1.2 Fitting f(x) . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Method of least squares when f(x) is linear . . . . . . . . 3
1.1.4 Exercise: MSE . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.5 K-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . 5
1.1.6 Curse of dimensionality . . . . . . . . . . . . . . . . . . . 5
1.1.7 Bias-Variance Trade-off . . . . . . . . . . . . . . . . . . . 6

2 Chapter 3 7
2.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Fitted and predicted values . . . . . . . . . . . . . . . . . 7
2.1.3 Exercise: Hat-Matrix . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.5 Exercise: Residuals . . . . . . . . . . . . . . . . . . . . . . 9

1
1 Chapter 1-2
This chapter captures introduces contents for linear models, least squares and
N-nearest neighbor methods and the transition between. We also discuss the
problems which arises when using these methods in higher dimensions.

1.1 Statistical Decision Theory


We want to find a function f(X) for predicting Y given values of the input X.

Quadratic loss function:

L(Y, f (X)) = (Y − f (X))2


is required for penalizing errors in prediction. Other loss functions may be used.

Expected (squared) prediction error:


 
EP E(f ) = E (Y − f (X))2

finding f(x):

    
2 2
min EPE(f ) = min E (Y − f (X)) = min E E (Y − f (X)) |X = x

Minimization of EPE point-wise:


 
argmin E (Y − c)2 |X = x

Solution:

f (X) = (Y |X = x)
The conditional expectation, also known as the regression function. The best
prediction of Y at any point X=x is the conditional mean, when best if measured
by average squared error.

1.1.1 Example: Multivariate normally distributed


     
X µX ΣXX ΣXY
∼ ,
Y µ ΣXY σY Y
Then,

Y |X ∼ N (µy + ΣXY Σ−1 −1


XX (X − µX ), σY Y − ΣXY ΣXX ΣXY

=⇒ f (x) = E(Y |X = x) = µy + ΣY X Σ−1


XX (x − µ − x)

=⇒ Y = f (x) + ϵ = β0 + xT β + ϵ = Linear regression model

2
1.1.2 Fitting f(x)
f (x) = E(Y |X = x) depends on x and the parameter of the joint distribution
P(X, Y ) which are usually not known. Estimation can be done using MLE,
method of moments, etc. In many cases f(x) cannot be computed analytically.
Thus, sampling approximation of EPE:
N
1 X
EPE(f ) ≈ EPE(f ) =
[ (yi − f (xi ))2
N i=1
Nearest-Neighbor Method
Which is called method of least squares, when f (·) is linear or f (·) is approxi-
mated by first-order Taylor series expansion.
- Sampling appr. of f (x) = E(Y |X = x). The problem is though that we
usually have no enough observation for each possible value of x. Solution for
this is the complete with neighbouring data:
1 X
f (x) ≈ fˆ(x) = yi
k
xi ∈Nk (x)

Where Nk is the neighborhood containing the k points closest to x (Nearest-


neighbor methods

1.1.3 Method of least squares when f(x) is linear


Assumption: Least squares assumes f (x) is well approximated globally by a
linear function f (x) ≈ xT β.
Idea: find that minimizes the residual sum of squares.
N
X
RSS(β) = N (yi − xTi β)2 = (y − Xβ)T (y − Xβ)
i=1

Solution: Least-Squares estimator (LSE) for β is:

β̂(X T X)−1 X T y (1)

1.1.4 Exercise: MSE


Prove that under the assumption
     
X µX ΣXX ΣXY
∼ ,
Y µ ΣXY σY Y

the LSE for β coincides with the MLE for β

3
Solution:
To begin with, lets expand and explain the solution from the lecture notes. We
know that:
N
X
RSS(β) = N (yi − xTi β)2 = (y − Xβ)T (y − Xβ)
i=1

= y T y − β T X T y − y T Xβ + β T X t xβ = y T − 2β T X T y + β T X T Xβ

δRSS(β) δy T y − 2β T X T y + β T X T Xβ
=⇒ =
δβ δβ

δy T y δβ T X T y δβ T (X T X)β
−2 + = 0 − 2X T y + 2xT Xβ =: 0
δβ δβ δβ
We now need to assume that (X T X)−1 exists.

=⇒ (X T X)−1 (X T X)β = (X T X)−1 X T y ⇔ β̂ = (X T X)−1 X T y


Side note: All β, X and y is suppose to be bolded, i.e β = β, X = X, y = y.
Furthermore, lets show that MLE for β coinsides with LSE. That is, we need
to show that

β̂M L = argmax ℓ(β) = (X T X)−1 Xy


To begin with, the loglikelihood function is given by:

X 1
(ϕ(Yi ; (XB)i , σ)) = − log (2π)n/2 σ n − 2 (y − Xβ)T (y − Xβ)

ℓ(β) = log

Using

δAx δf (x)′ g(x) δg(x) δf (x)


= A and = f (x)′ + g(x)′
δx δx δx δx
and letting the derivative equal to 0 (for maximizing), we get:

1 1
2
(y − xβ)T X = 2 (y T X − β T X T X) = 0 ⇔ β̂ = (X T X)−1 Xy ■
σ σ

4
1.1.5 K-Nearest Neighbor
Assumption: k-nearest neighbors assumes f(x) is well approximated by a locally
constant function. (Sample) approximation:
1 X
f (x) ≈ fˆ(x) = yi
k
xi ∈Nk (x)

Nearest neighbor averaging can be pretty good for small p (e.g., p ≤ 4) and
large sample size N
Smoother versions of Nearest neighbor averaging:
- Kernel smoothing, spline smoothing
- Nearest neighbor methods can be lousy when p is large.
Curse of dimensionality: Nearest neighbors tend to be far away in high dimen-
sions.

1.1.6 Curse of dimensionality

Example: a subcubical neighborhood for uniform data in a unit cube. The right-
hand plot shows the side-length of the subcube needed to capture a fraction r
of the volume of the data for different p. For p = 10: 80% of the range of each
coordinate is needed to capture 10% of the data.

5
1.1.7 Bias-Variance Trade-off
fˆ(x) is fitted by using some training data T r = (xT r,i , yT r,i ) for i = 1, ..., N .
(x0 , y0 ) is a test observation where x0 is a fixed point (deterministic) from the
support of X.

Model:

Y = f (X) + ϵ with f (x) = E(Y |X = x) and E(ϵ|X = x) = 0


Mean Square Error (MSE) at point x0 :

 
M SE(x0 ) = E (y0 − f (x0 )) = V ar(fˆ(x0 )) + [Bias(fˆ(x0 ))]2 + V ar(ϵ0 )
ˆ 2

with

Bias(fˆ(x0 )) = E(fˆ(x0 )) − f (x0 )

Example: Curse of dimensionality and its effect on MSE, bias and variance.
The input features xi are uniformly distributed in [1, 1]p for p = 1, ..., 10. The
plot shows the MSE, squared bias and variance curves in estimating f (0) with
f (x) = exp(8||x||2 ) as a function of dimension p.

6
2 Chapter 3
This chapter covers linear methods for regression.

2.1 Linear Regression


Linear regression in matrix form:

y = Xβ + ϵ
y = (y1 , . . . , yn )T is a vector of response variables
ϵ = (ϵ1 , . . . , ϵn )T is a vector of errors
n × (p + 1) matrix X is the design matrix
Parameter vector β = (β0 , β1 , . . . , βp ) is considered unknown

2.1.1 Assumptions
A-Assumption
a1 In the model equation, no relevant independent variables are missing and
the used independent variables are not irrelevant
a2 The true relationship between x and y is linear.
a3 The parameter vector β is constant for all N observations (xi , yi ).

B-Assumption
b1 E(ϵ) = 0.
b2 Cov(ϵ) = σ 2 IN .
b3 ϵ ∼ N (0, σ 2 IN )

C-Assumption
c1 Each element of the (N × (p + 1))-matrix X is deterministic.
c2 rank (X) = p + 1.

2.1.2 Fitted and predicted values


Remember: β̂ = (X T X)−1 X T y from equation (1).
The fitted values at the training data is given by

ŷ = X β̂ = X(X T X)−1 X T y = Hy letting H = X(X T X)−1 X T


H is usually called Hat-Matrix and is a projection matrix. Also, fitted values
at x0 = (1, x01 , . . . , x0p ) (test data) is given by:

ŷ0 = xT0 β̂

7
2.1.3 Exercise: Hat-Matrix
a) Prove that H is symmetric.

Solution: Symmetric same as showing H = H T .

H T = (X(X T X)−1 X T )T = (X T )T ((X T X)−1 )T X T = X((X T X)T )−1 X T

= X(X T (X T )T )−1 X T = X(X T X)−1 X T = H ■

b) Prove that H is idempotent

Solution: Idempotent is the same as showing H = H 2 :

H 2 = (X(X T X)−1 X T )(X(X T X)−1 X T ) = (X(X T X)−1 (X T X)(X T X)−1 X T )

= X(X T X)−1 X T = H ■

c) Show that Rank(H) = p + 1

Solution: Since the eigen values of H are zero or ones, number of ones then
equal the rank of H. Thus, Rank(H) = Tr(H).

T r(H) = T r(X(X T X)−1 X T ) = T r((X T X)(X T X)−1 ) = T r(Ip+1 ) = p+1 ■

d) Show that IN − H is orthogonal to H.

Solution: If they are orthogonal, same as showing that (IN − H)H = 0

(IN − H)H = IN H − HH = H − H = 0
Where the second last equality follows from that H is idempotent. ■

e) Show that IN − H is symmetric and idempotent.

Solution:
Symmetric: (IN − H)T = IN
T
− H T = IN − H
Idempotent: (IN −H)(IN −H) = I 2 −IH −IH +H 2 = I −H −H +H = I −H

f) Show that rank(I − H) = N − p − 1

8
2.1.4 Residuals
The residuals of a linear regression model are defined by

ϵ̂ = y − ŷ = y − Hy = (IN − H)y
2
Estimator for σ :
1 1
σ̂ 2 = ϵ̂T ϵ̂ = (y − X β̂)T (y − X β̂)
N −p−1 N −p−1

2.1.5 Exercise: Residuals


a) Prove that X T ϵ̂ = 0

Solution: X T (IN − H)y = XIN y − X T Hy = X T y − X T y = 0, where we


in the last step used that X T H = X T X(X T X)−1 X T = X T

b) Assuming that ϵ ∼ N (0, σ 2 IN ) and X is deterministic, derive the distribution


of ϵ̂? What can you say about this distribution of ϵ̂? Are the residuals ϵ̂
independent?

You might also like