You are on page 1of 5

# SB2b HT 2017 - Problem Sheet 1 - solutions

1. For a given loss function L, the risk R is given by the expected loss

## where f = f (X) is a function of the random predictor variable X.

(a) Consider a regression problem and the squared error loss

## Derive the expression of f = f (X) minimizing the associated risk.

h i
R(f ) = E (Y f (X))2
Z h i
= E (Y f (X))2 X = x gX (x) dx,

## where gX is density of X. Thus, it suffices to for every x, minimize:

h i
E (Y f (X))2 X = x

= E Y 2 X = x 2f (x) E [ Y | X = x] + f (x)2
 

= Var [Y |X = x] + (E [ Y | X = x] f (x))2 .

## This is clearly minimized by the conditional mean:

f (x) = E [ Y | X = x] .

## Answer: As before, we want to find f (x) to minimize

 
E |Y f (x)| X = x

## Differentiating the expression with respect to f (x),

 
E sign(Y f (x)) X = x = P(Y < f (x)|X = x) + P(Y > f (x)|X = x) = 0

which occurs when P(Y < f (x)|X = x) = P(Y > f (x)|X = x) = 1/2, i.e. at the conditional
median f (x) = med [ Y | X = x].
2. Let (xi , yi )ni=1 be our dataset, with xi Rp and yi R. Linear regression can be formulated as
empirical risk minimization, where the model is to predict y as x> , and we use the squared loss:
n
X 1
R emp
() = (yi x>
i )
2
2
i=1

1
(a) Show that the optimal parameter is

## where X is a n p matrix with ith row given x>

i , and Y is a n 1 matrix with ith entry yi .

## Answer: We can write the empirical risk as

1
kY Xk22
2
Differentiating wrt and setting to 0,

(X Y)> X = 0
> (X> X) Y> X = 0
= (X> X)1 X> Y

(b) Consider regularizing our empirical risk by incorporating a L2 regularizer. That is, find mini-
mizing
n
X 1
kk22 + (yi x>
i )
2
2 2
i=1
Show that the optimal parameter is given by the ridge regression estimator

## = (I + X> X)1 X> Y

1
kY Xk22 + kk22
2 2
Again differentiating and setting derivative to 0,

(X Y)> X + > = 0
> (I + X> X) Y> X = 0
= (I + X> X)1 X> Y

(c) Compare the regularized and unregularized estimators of for n = 10 samples generated from
the following model:

y N (X, ) (1)

## with the following parameters and a random design matrix:

beta = c(-.1,.3,-.5,.2,-.5,.1,.3,7)
sigma = 0.5
p = 8
n = 10

## X = matrix(rnorm(n*p),n) # gives a matrix of size n x p

y = rnorm(n, mean=X%*%beta, sd=sigma)

2
use = .1.
Which estimator has smaller L2 norm? Investigate the bias / variance tradeoff across values of
[0, 10] as follows. Generate new datasets for training and testing and make a learning curve
plot comparing training and testing error like the one shown in lecture on slide 13 of Lecture 2.
Put on the x-axis and mean square error (prediction error) on the y-axis. Label the points on
the plot corresponding to unregularized linear regression.
Answer: See the R markdown document: http://www.stats.ox.ac.uk/flaxman/
sml17/problem_2c_on_regularization.html.
3. Consider two univariate normal distributions N (, 2 ) with known parameters A = 10 and A = 5
for class A and B = 20 and B = 5 for class B. Suppose class A represents the random score X of
a medical test of normal patients and class B represents the score of patients with a certain disease. A
priori there are 100 times more healthy patients than patients carrying the disease.
(a) Find the optimal decision rule in terms of misclassification error (0-1 loss) for allocating a new
observation x to either class A or B.
Answer: The optimal decision for X = x is to allocate to class

argmaxk{A,B} k gk (x).

## The patients should be classified as healthy iff

1 (x A )2 1 (x B )2
A exp( 2 ) B exp( 2 ),
2A 2A 2B 2B
that is, using A = B , iff
2
2A log(A /B ) + (x A )2 (x B )2 .

## The decision boundary is attained for equality, that is if x fulfills

2x(B A ) + 2A 2B 2A
2
log(A /B ) = 0.

For the given values, this implies that the decision boundary is at

## x = (50 log 100 100 + 400)/(2 10) 26.51,

that is all patients with a test score above 26.51 are classified as having the disease.
(b) Repeat (a) if the cost of a false negative (allocating a sick patient to group A) is > 1 times that
of a false positive (allocating a healthy person to group B). Describe how the rule changes as
increases. For which value of are 84.1% of all patients with disease correctly classified?
Answer: The optimal decision minimizes E [L(Y, f (x))|X = x]. It is hence optimal to choose
class A (healthy) over class B if and only if

## Using the same argument as above,

P the patients should be classified as healthy now iff (ignoring
again the common denominator k{A,B} k gk (x)),

1 (x A )2 1 (x B )2
A exp( 2 ) B exp( 2 ).
2A 2A 2B 2B

3
The decision boundary is now attained if x fulfills

2x(B A ) + 2A 2B 2A
2
log(A /(B )) = 0.

For increasing values of , patients with decreasingly smaller test scores are classified as having
the disease.
84.1% of all patients carrying the disease are correctly classified if the decision boundary is at
the 15.9%-quantile of the N (B , B2 )-distribution, which is at q = 20 + 51 (0.159) 15.

## which implies that for  

20q 300
= 100 exp = 100,
50
approximately 84.1% of all patients are correctly classified as carrying the disease.
4. (Exercise 2.8 in Elements of Statistical Learning) Compare the classification performance of logis-
tic regression, regularized logistic regression (and optionally k-nearest neighbor classification) on
the ZIP code digit image dataset, restricting to only the 2s and 3s. Investigate L1 and L2 regular-
ization alone and in combination (the elastic net). Show both training and testing error for each
choice. The ZIP code data are available from https://statweb.stanford.edu/tibs/
ElemStatLearn/data.html. You can read more about the data in Section 11.7 of Elements of
Statistical Learning.
Answer: See the R markdown document: http://www.stats.ox.ac.uk/flaxman/sml17/
zip.html
5. OPTIONAL: (Exercise 2.9 in Elements of Statistical Learning) Consider a linear regression model
with p parameters, fit with unregularized linear regression (sometimes called least squares or ordi-
nary least squares or even just OLS) to a set of training data (xi , yi )1iN drawn at random from a
population. Let be the estimator. Suppose we have some test data (xi , yi )1iM drawn at random
from the same population as the training data.
2 2
If Rtr () = N1 N > 1 PM >
P
i=1 yi xi and Rte () = M i=1 yi xi , prove that

## where the expectation is over all that is random in each expression.

For a given set of training data, the corresponding estimator is found by minimizing the risk:
N
1 X 2
= argmin Rtr () = argmin yi x>
i
N
i=1

We could also consider corresponding to test data:
M
1 X >
2
= argmin Rte () = argmin yi xi
M
i=1

4

From this statement we see that Rte () Rte (), for any particular test dataset (xi , yi )1iM . If
we now consider the expectation over randomly drawn test datasets we have:

E[Rte ()] E[Rte ()], .

## In particular, this holds for .

E[Rte ()] E[Rte ()]. (2)
Since train and test datasets come from the same distribution, we know that:

E[Rte ()] = E[Rtr ()]

Thus we see:

E[Rtr ()] = E[Rte ()] E[Rte ()]
by Eq. 2. 