Statistical Methods Notes

Contents
1 Chapter 1-2 2
1.1 Statistical Decision Theory . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Example: Multivariate normally distributed . . . . . . . . 2
1.1.2 Fitting f(x) . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 Method of least squares when f(x) is linear . . . . . . . . 3
1.1.4 Exercise: MSE . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.5 K-Nearest Neighbor . . . . . . . . . . . . . . . . . . . . . 5
1.1.6 Curse of dimensionality . . . . . . . . . . . . . . . . . . . 5
1.1.7 Bias-Variance Trade-off . . . . . . . . . . . . . . . . . . . 6
2 Chapter 3 7
2.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Fitted and predicted values . . . . . . . . . . . . . . . . . 7
2.1.3 Exercise: Hat-Matrix . . . . . . . . . . . . . . . . . . . . . 8
2.1.4 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.5 Exercise: Residuals . . . . . . . . . . . . . . . . . . . . . . 9
1
1 Chapter 1-2
This chapter captures introduces contents for linear models, least squares and
N-nearest neighbor methods and the transition between. We also discuss the
problems which arises when using these methods in higher dimensions.
1.1 Statistical Decision Theory

We want to find a function f(X) for predicting Y given values of the input X.
Quadratic loss function:
L(Y, f (X)) = (Y − f (X))2

is required for penalizing errors in prediction. Other loss functions may be used.
Expected (squared) prediction error:

EP E(f ) = E (Y − f (X))2
finding f(x):

2 2
min EPE(f ) = min E (Y − f (X)) = min E E (Y − f (X)) |X = x
Minimization of EPE point-wise:

argmin E (Y − c)2 |X = x
Solution:
f (X) = (Y |X = x)
The conditional expectation, also known as the regression function. The best
prediction of Y at any point X=x is the conditional mean, when best if measured
by average squared error.
1.1.1 Example: Multivariate normally distributed

X µX ΣXX ΣXY
∼ ,
Y µ ΣXY σY Y
Then,
Y |X ∼ N (µy + ΣXY Σ−1 −1

XX (X − µX ), σY Y − ΣXY ΣXX ΣXY
=⇒ f (x) = E(Y |X = x) = µy + ΣY X Σ−1

XX (x − µ − x)
=⇒ Y = f (x) + ϵ = β0 + xT β + ϵ = Linear regression model
2
1.1.2 Fitting f(x)
f (x) = E(Y |X = x) depends on x and the parameter of the joint distribution
P(X, Y ) which are usually not known. Estimation can be done using MLE,
method of moments, etc. In many cases f(x) cannot be computed analytically.
Thus, sampling approximation of EPE:
N
1 X
EPE(f ) ≈ EPE(f ) =
[ (yi − f (xi ))2
N i=1
Nearest-Neighbor Method
Which is called method of least squares, when f (·) is linear or f (·) is approxi-
mated by first-order Taylor series expansion.
- Sampling appr. of f (x) = E(Y |X = x). The problem is though that we
usually have no enough observation for each possible value of x. Solution for
this is the complete with neighbouring data:
1 X
f (x) ≈ fˆ(x) = yi
k
xi ∈Nk (x)
Where Nk is the neighborhood containing the k points closest to x (Nearest-

neighbor methods
1.1.3 Method of least squares when f(x) is linear

Assumption: Least squares assumes f (x) is well approximated globally by a
linear function f (x) ≈ xT β.
Idea: find that minimizes the residual sum of squares.
N
X
RSS(β) = N (yi − xTi β)2 = (y − Xβ)T (y − Xβ)
i=1
Solution: Least-Squares estimator (LSE) for β is:
β̂(X T X)−1 X T y (1)
1.1.4 Exercise: MSE

Prove that under the assumption

X µX ΣXX ΣXY
∼ ,
Y µ ΣXY σY Y
the LSE for β coincides with the MLE for β
3
Solution:
To begin with, lets expand and explain the solution from the lecture notes. We
know that:
N
X
RSS(β) = N (yi − xTi β)2 = (y − Xβ)T (y − Xβ)
i=1
= y T y − β T X T y − y T Xβ + β T X t xβ = y T − 2β T X T y + β T X T Xβ
δRSS(β) δy T y − 2β T X T y + β T X T Xβ
=⇒ =
δβ δβ
δy T y δβ T X T y δβ T (X T X)β
−2 + = 0 − 2X T y + 2xT Xβ =: 0
δβ δβ δβ
We now need to assume that (X T X)−1 exists.
=⇒ (X T X)−1 (X T X)β = (X T X)−1 X T y ⇔ β̂ = (X T X)−1 X T y

Side note: All β, X and y is suppose to be bolded, i.e β = β, X = X, y = y.
Furthermore, lets show that MLE for β coinsides with LSE. That is, we need
to show that
β̂M L = argmax ℓ(β) = (X T X)−1 Xy

To begin with, the loglikelihood function is given by:
X 1
(ϕ(Yi ; (XB)i , σ)) = − log (2π)n/2 σ n − 2 (y − Xβ)T (y − Xβ)

ℓ(β) = log
2σ
Using
δAx δf (x)′ g(x) δg(x) δf (x)

= A and = f (x)′ + g(x)′
δx δx δx δx
and letting the derivative equal to 0 (for maximizing), we get:
1 1
2
(y − xβ)T X = 2 (y T X − β T X T X) = 0 ⇔ β̂ = (X T X)−1 Xy ■
σ σ
4
1.1.5 K-Nearest Neighbor
Assumption: k-nearest neighbors assumes f(x) is well approximated by a locally
constant function. (Sample) approximation:
1 X
f (x) ≈ fˆ(x) = yi
k
xi ∈Nk (x)
Nearest neighbor averaging can be pretty good for small p (e.g., p ≤ 4) and
large sample size N
Smoother versions of Nearest neighbor averaging:
- Kernel smoothing, spline smoothing
- Nearest neighbor methods can be lousy when p is large.
Curse of dimensionality: Nearest neighbors tend to be far away in high dimen-
sions.
1.1.6 Curse of dimensionality
Example: a subcubical neighborhood for uniform data in a unit cube. The right-
hand plot shows the side-length of the subcube needed to capture a fraction r
of the volume of the data for different p. For p = 10: 80% of the range of each
coordinate is needed to capture 10% of the data.
5
1.1.7 Bias-Variance Trade-off
fˆ(x) is fitted by using some training data T r = (xT r,i , yT r,i ) for i = 1, ..., N .
(x0 , y0 ) is a test observation where x0 is a fixed point (deterministic) from the
support of X.
Model:
Y = f (X) + ϵ with f (x) = E(Y |X = x) and E(ϵ|X = x) = 0

Mean Square Error (MSE) at point x0 :

M SE(x0 ) = E (y0 − f (x0 )) = V ar(fˆ(x0 )) + [Bias(fˆ(x0 ))]2 + V ar(ϵ0 )
ˆ 2
with
Bias(fˆ(x0 )) = E(fˆ(x0 )) − f (x0 )
Example: Curse of dimensionality and its effect on MSE, bias and variance.
The input features xi are uniformly distributed in [1, 1]p for p = 1, ..., 10. The
plot shows the MSE, squared bias and variance curves in estimating f (0) with
f (x) = exp(8||x||2 ) as a function of dimension p.
6
2 Chapter 3
This chapter covers linear methods for regression.
2.1 Linear Regression

Linear regression in matrix form:
y = Xβ + ϵ
y = (y1 , . . . , yn )T is a vector of response variables
ϵ = (ϵ1 , . . . , ϵn )T is a vector of errors
n × (p + 1) matrix X is the design matrix
Parameter vector β = (β0 , β1 , . . . , βp ) is considered unknown
2.1.1 Assumptions
A-Assumption
a1 In the model equation, no relevant independent variables are missing and
the used independent variables are not irrelevant
a2 The true relationship between x and y is linear.
a3 The parameter vector β is constant for all N observations (xi , yi ).
B-Assumption
b1 E(ϵ) = 0.
b2 Cov(ϵ) = σ 2 IN .
b3 ϵ ∼ N (0, σ 2 IN )
C-Assumption
c1 Each element of the (N × (p + 1))-matrix X is deterministic.
c2 rank (X) = p + 1.
2.1.2 Fitted and predicted values

Remember: β̂ = (X T X)−1 X T y from equation (1).
The fitted values at the training data is given by
ŷ = X β̂ = X(X T X)−1 X T y = Hy letting H = X(X T X)−1 X T

H is usually called Hat-Matrix and is a projection matrix. Also, fitted values
at x0 = (1, x01 , . . . , x0p ) (test data) is given by:
ŷ0 = xT0 β̂
7
2.1.3 Exercise: Hat-Matrix
a) Prove that H is symmetric.
Solution: Symmetric same as showing H = H T .
H T = (X(X T X)−1 X T )T = (X T )T ((X T X)−1 )T X T = X((X T X)T )−1 X T
= X(X T (X T )T )−1 X T = X(X T X)−1 X T = H ■
b) Prove that H is idempotent
Solution: Idempotent is the same as showing H = H 2 :
H 2 = (X(X T X)−1 X T )(X(X T X)−1 X T ) = (X(X T X)−1 (X T X)(X T X)−1 X T )
= X(X T X)−1 X T = H ■
c) Show that Rank(H) = p + 1
Solution: Since the eigen values of H are zero or ones, number of ones then
equal the rank of H. Thus, Rank(H) = Tr(H).
T r(H) = T r(X(X T X)−1 X T ) = T r((X T X)(X T X)−1 ) = T r(Ip+1 ) = p+1 ■
d) Show that IN − H is orthogonal to H.
Solution: If they are orthogonal, same as showing that (IN − H)H = 0
(IN − H)H = IN H − HH = H − H = 0
Where the second last equality follows from that H is idempotent. ■
e) Show that IN − H is symmetric and idempotent.
Solution:
Symmetric: (IN − H)T = IN
T
− H T = IN − H
Idempotent: (IN −H)(IN −H) = I 2 −IH −IH +H 2 = I −H −H +H = I −H
f) Show that rank(I − H) = N − p − 1
8
2.1.4 Residuals
The residuals of a linear regression model are defined by
ϵ̂ = y − ŷ = y − Hy = (IN − H)y
2
Estimator for σ :
1 1
σ̂ 2 = ϵ̂T ϵ̂ = (y − X β̂)T (y − X β̂)
N −p−1 N −p−1
2.1.5 Exercise: Residuals

a) Prove that X T ϵ̂ = 0
Solution: X T (IN − H)y = XIN y − X T Hy = X T y − X T y = 0, where we

in the last step used that X T H = X T X(X T X)−1 X T = X T
b) Assuming that ϵ ∼ N (0, σ 2 IN ) and X is deterministic, derive the distribution

of ϵ̂? What can you say about this distribution of ϵ̂? Are the residuals ϵ̂
independent?

Statistical Methods Notes

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistical Methods Notes

Uploaded by

Copyright:

Available Formats

Contents

1.1 Statistical Decision Theory

Quadratic loss function:

L(Y, f (X)) = (Y − f (X))2

Expected (squared) prediction error:

Minimization of EPE point-wise:

1.1.1 Example: Multivariate normally distributed

Y |X ∼ N (µy + ΣXY Σ−1 −1

=⇒ f (x) = E(Y |X = x) = µy + ΣY X Σ−1

=⇒ Y = f (x) + ϵ = β0 + xT β + ϵ = Linear regression model

Where Nk is the neighborhood containing the k points closest to x (Nearest-

1.1.3 Method of least squares when f(x) is linear

Solution: Least-Squares estimator (LSE) for β is:

β̂(X T X)−1 X T y (1)

1.1.4 Exercise: MSE

the LSE for β coincides with the MLE for β

=⇒ (X T X)−1 (X T X)β = (X T X)−1 X T y ⇔ β̂ = (X T X)−1 X T y

β̂M L = argmax ℓ(β) = (X T X)−1 Xy

δAx δf (x)′ g(x) δg(x) δf (x)

1.1.6 Curse of dimensionality

Y = f (X) + ϵ with f (x) = E(Y |X = x) and E(ϵ|X = x) = 0

Bias(fˆ(x0 )) = E(fˆ(x0 )) − f (x0 )

2.1 Linear Regression

2.1.2 Fitted and predicted values

ŷ = X β̂ = X(X T X)−1 X T y = Hy letting H = X(X T X)−1 X T

Solution: Symmetric same as showing H = H T .

H T = (X(X T X)−1 X T )T = (X T )T ((X T X)−1 )T X T = X((X T X)T )−1 X T

= X(X T (X T )T )−1 X T = X(X T X)−1 X T = H ■

b) Prove that H is idempotent

Solution: Idempotent is the same as showing H = H 2 :

H 2 = (X(X T X)−1 X T )(X(X T X)−1 X T ) = (X(X T X)−1 (X T X)(X T X)−1 X T )

c) Show that Rank(H) = p + 1

T r(H) = T r(X(X T X)−1 X T ) = T r((X T X)(X T X)−1 ) = T r(Ip+1 ) = p+1 ■

d) Show that IN − H is orthogonal to H.

Solution: If they are orthogonal, same as showing that (IN − H)H = 0

e) Show that IN − H is symmetric and idempotent.

f) Show that rank(I − H) = N − p − 1

2.1.5 Exercise: Residuals

Solution: X T (IN − H)y = XIN y − X T Hy = X T y − X T y = 0, where we

b) Assuming that ϵ ∼ N (0, σ 2 IN ) and X is deterministic, derive the distribution

You might also like