You are on page 1of 5

SB2b HT 2017 - Problem Sheet 1 - solutions

1. For a given loss function L, the risk R is given by the expected loss

R(f ) = E [L(Y, f (X))] ,

where f = f (X) is a function of the random predictor variable X.

(a) Consider a regression problem and the squared error loss

L(Y, f (X)) = (Y f (X))2 .

Derive the expression of f = f (X) minimizing the associated risk.

Answer: We have
h i
R(f ) = E (Y f (X))2
Z h i
= E (Y f (X))2 X = x gX (x) dx,

where gX is density of X. Thus, it suffices to for every x, minimize:

h i
E (Y f (X))2 X = x

= E Y 2 X = x 2f (x) E [ Y | X = x] + f (x)2

= Var [Y |X = x] + (E [ Y | X = x] f (x))2 .

This is clearly minimized by the conditional mean:

f (x) = E [ Y | X = x] .

(b) What if we use the absolute (L1 ) loss instead?

L(Y, f (X)) = |Y f (X)|.

Answer: As before, we want to find f (x) to minimize

E |Y f (x)| X = x

Differentiating the expression with respect to f (x),

E sign(Y f (x)) X = x = P(Y < f (x)|X = x) + P(Y > f (x)|X = x) = 0

which occurs when P(Y < f (x)|X = x) = P(Y > f (x)|X = x) = 1/2, i.e. at the conditional
median f (x) = med [ Y | X = x].
2. Let (xi , yi )ni=1 be our dataset, with xi Rp and yi R. Linear regression can be formulated as
empirical risk minimization, where the model is to predict y as x> , and we use the squared loss:
X 1
R emp
() = (yi x>
i )

(a) Show that the optimal parameter is

= (X> X)1 X> Y

where X is a n p matrix with ith row given x>

i , and Y is a n 1 matrix with ith entry yi .

Answer: We can write the empirical risk as

kY Xk22
Differentiating wrt and setting to 0,

(X Y)> X = 0
> (X> X) Y> X = 0
= (X> X)1 X> Y

(b) Consider regularizing our empirical risk by incorporating a L2 regularizer. That is, find mini-
X 1
kk22 + (yi x>
i )
2 2
Show that the optimal parameter is given by the ridge regression estimator

= (I + X> X)1 X> Y

Answer: The objective becomes:

kY Xk22 + kk22
2 2
Again differentiating and setting derivative to 0,

(X Y)> X + > = 0
> (I + X> X) Y> X = 0
= (I + X> X)1 X> Y

(c) Compare the regularized and unregularized estimators of for n = 10 samples generated from
the following model:

y N (X, ) (1)

with the following parameters and a random design matrix:

beta = c(-.1,.3,-.5,.2,-.5,.1,.3,7)
sigma = 0.5
p = 8
n = 10

X = matrix(rnorm(n*p),n) # gives a matrix of size n x p

y = rnorm(n, mean=X%*%beta, sd=sigma)

use = .1.
Which estimator has smaller L2 norm? Investigate the bias / variance tradeoff across values of
[0, 10] as follows. Generate new datasets for training and testing and make a learning curve
plot comparing training and testing error like the one shown in lecture on slide 13 of Lecture 2.
Put on the x-axis and mean square error (prediction error) on the y-axis. Label the points on
the plot corresponding to unregularized linear regression.
Answer: See the R markdown document:
3. Consider two univariate normal distributions N (, 2 ) with known parameters A = 10 and A = 5
for class A and B = 20 and B = 5 for class B. Suppose class A represents the random score X of
a medical test of normal patients and class B represents the score of patients with a certain disease. A
priori there are 100 times more healthy patients than patients carrying the disease.
(a) Find the optimal decision rule in terms of misclassification error (0-1 loss) for allocating a new
observation x to either class A or B.
Answer: The optimal decision for X = x is to allocate to class

argmaxk{A,B} k gk (x).

The patients should be classified as healthy iff

1 (x A )2 1 (x B )2
A exp( 2 ) B exp( 2 ),
2A 2A 2B 2B
that is, using A = B , iff
2A log(A /B ) + (x A )2 (x B )2 .

The decision boundary is attained for equality, that is if x fulfills

2x(B A ) + 2A 2B 2A
log(A /B ) = 0.

For the given values, this implies that the decision boundary is at

x = (50 log 100 100 + 400)/(2 10) 26.51,

that is all patients with a test score above 26.51 are classified as having the disease.
(b) Repeat (a) if the cost of a false negative (allocating a sick patient to group A) is > 1 times that
of a false positive (allocating a healthy person to group B). Describe how the rule changes as
increases. For which value of are 84.1% of all patients with disease correctly classified?
Answer: The optimal decision minimizes E [L(Y, f (x))|X = x]. It is hence optimal to choose
class A (healthy) over class B if and only if

P (Y = A|X = x) P (Y = B|X = x).

Using the same argument as above,

P the patients should be classified as healthy now iff (ignoring
again the common denominator k{A,B} k gk (x)),

1 (x A )2 1 (x B )2
A exp( 2 ) B exp( 2 ).
2A 2A 2B 2B

The decision boundary is now attained if x fulfills

2x(B A ) + 2A 2B 2A
log(A /(B )) = 0.

For increasing values of , patients with decreasingly smaller test scores are classified as having
the disease.
84.1% of all patients carrying the disease are correctly classified if the decision boundary is at
the 15.9%-quantile of the N (B , B2 )-distribution, which is at q = 20 + 51 (0.159) 15.

This decision boundary is attained if

15 = q = (50 log(100/) 100 + 400)/20,

which implies that for  

20q 300
= 100 exp = 100,
approximately 84.1% of all patients are correctly classified as carrying the disease.
4. (Exercise 2.8 in Elements of Statistical Learning) Compare the classification performance of logis-
tic regression, regularized logistic regression (and optionally k-nearest neighbor classification) on
the ZIP code digit image dataset, restricting to only the 2s and 3s. Investigate L1 and L2 regular-
ization alone and in combination (the elastic net). Show both training and testing error for each
choice. The ZIP code data are available from
ElemStatLearn/data.html. You can read more about the data in Section 11.7 of Elements of
Statistical Learning.
Answer: See the R markdown document:
5. OPTIONAL: (Exercise 2.9 in Elements of Statistical Learning) Consider a linear regression model
with p parameters, fit with unregularized linear regression (sometimes called least squares or ordi-
nary least squares or even just OLS) to a set of training data (xi , yi )1iN drawn at random from a
population. Let be the estimator. Suppose we have some test data (xi , yi )1iM drawn at random
from the same population as the training data.
2 2
If Rtr () = N1 N > 1 PM >
i=1 yi xi and Rte () = M i=1 yi xi , prove that

E[Rtr ()] E[Rte ()]

where the expectation is over all that is random in each expression.

For a given set of training data, the corresponding estimator is found by minimizing the risk:
1 X 2
= argmin Rtr () = argmin yi x>

We could also consider corresponding to test data:
1 X >
= argmin Rte () = argmin yi xi


From this statement we see that Rte () Rte (), for any particular test dataset (xi , yi )1iM . If
we now consider the expectation over randomly drawn test datasets we have:

E[Rte ()] E[Rte ()], .

In particular, this holds for .

E[Rte ()] E[Rte ()]. (2)
Since train and test datasets come from the same distribution, we know that:

E[Rte ()] = E[Rtr ()]

Thus we see:

E[Rtr ()] = E[Rte ()] E[Rte ()]
by Eq. 2.