You are on page 1of 6

Introducing to Machine Learning (CS 419M) Midterm Exam

Computer Science and Engineering 2022-02-24 Thursday


Indian Institute of Technology Bombay

Instructions
1. It is a CLOSED BOOK examination.

2. This paper has five questions. Each question carries 10 marks. Therefore, the maximum marks
is 50.

3. Write your answers on a paper, scan and submit them at the end of the exam.

4. Write your name, roll number and the subject number (CS 419M) on the top of each of your
answer script.

5. There are multiple parts (sub-questions) in each question. Some sub-questions are objective
and some are subjective.

6. There will be partial credits for subjective questions, if you have made substantial progress
towards the answer. However there will be NO credit for rough work.

7. Please keep your answer sheets different from the rough work you have made. Do not attach
the rough work with the answer sheet. You should ONLY upload the answer sheets.

Page 1 of 6
1. Consider the one dimensional linear regression problem where we neglect the bias term. Suppose
that the dataset {(xi , yi )}m
i=1 is generated by sampling m samples from the following distribution
D (
(1, 0) with probability µ
(x, y) = (1)
(µ, −1) with probability (1 − µ)
1.a Find the probability that the dataset contains atleast one sample (µ, −1).
1.a /2
1.b Find E[x], E[y], Var(x), Var(y) where E[·] denotes expectation and Var(·) denotes the
variance.
1.b /4
1.c Consider the loss function
m
X
l(w) = (yi − wxi )2 (2)
i=1
Find the expected value of this loss function with respect to the distribution D and denote
it LD (w). Then find w∗ that minimizes
L(w) = LD (w) + λ|w|2 (3)

1.c /4
1.a This will be 1 − µm

1.b
E[x] = µ 1 + (1 − µ) µ = 2µ − µ2 (4)
E[y] = µ 0 + (1 − µ) (−1) = µ − 1 (5)
2
Var(x) = E[x2 ] − E[x] (6)
2 2
= µ 1 + (1 − µ) µ2 − 2µ − µ

(7)
= µ − 3µ2 + 3µ3 − µ4 (8)
2
Var(y) = E[y 2 ] − E[y] (9)
= µ 0 + (1 − µ) 1 − (µ − 1)2 (10)
= µ − µ2 (11)

1.c
m
X 
2
LD (w) = E (yi − wxi ) (12)
i=1
m
X
= E[yi2 ] + w2 E[x2i ] − 2 w E[yi xi ] (13)
i=1
 
2 2 3 2
= m (1 − µ) + w (µ + µ − µ ) − 2w(µ − µ) (14)
Hence
L(w) = LD (w) + λ|w|2 (15)
 
= m (1 − µ) + w (µ + µ − µ ) − 2w(µ − µ) + λw2
2 2 3 2
(16)
 
dL 2 3 2
=⇒ = m 2w(µ + µ − µ ) − 2(µ − µ) + 2λ w = 0 (17)
dw
Hence
m(µ2 − µ)
w∗ = (18)
m(µ + µ2 − µ3 ) + λ

Page 2 of 6
This is valid only if it is a minima, i.e., the second derivative should be non-negative,
i.e.,
d2 L
2
= m(µ + µ2 − µ3 ) + λ > 0 (19)
dw

2. Consider a 1-dimensional linear regression problem. The dataset corresponding to this problem
has n examples D = {(xi , yi )}ni=1 , where xi , yi ∈ R ∀i. We don’t have access to the inputs
or outputs directly and we don’t know the exact value of n. However, we have a few statistics
computed from the data.
Let w∗ = [w0∗ , w1∗ ]⊤ be the unique solution that minimizes J(w) given by:
n
1X
J(w) = (yi − w0 − w1 xi )2 (20)
n i=1

2.a Which of the following are true:


n
i. n1 (yi − w0∗ − w1∗ xi )yi = 0
P
i=1
n
1
(yi − w0∗ − w1∗ xi )(yi − ȳ) = 0
P
ii. n
i=1
n
1
(yi − w0∗ − w1∗ xi )(xi − x̄) = 0
P
iii. n
i=1
n
1
(yi − w0∗ − w1∗ xi )(w0∗ + w1∗ xi ) = 0
P
iv. n
i=1
where x̄ and ȳ are the sample means based on the same dataset D. Please provide
justification for your answer.
2.a /2
2.b Suppose we have the following statistics computed from the dataset D
n n n
1X 1X 1X
x̄ = xi ȳ = yi Cxx = (xi − x̄)2
n i=1 n i=1 n i=1
n n
1X 1X
Cyy = (yi − ȳ)2 Cxy = (xi − x̄)(yi − ȳ)
n i=1 n i=1
Express w1∗ and w0∗ using (some or all of) these statistics. Show the derivation for full
credit.
2.b /4
2.c Now suppose that the dataset D has been corrupted with some Gaussian noise, i.e. the
new dataset will be D̃ = {(x̃i , yi )}ni=1 where x̃i = xi + ϵi and ϵi ∼ N (0, σ 2 ). Now derive the
expressions for w̃0 ∗ and w̃1 ∗ obtained by minimizing J(w) on this noisy dataset in terms of
σ and the true statistics (i.e., when n is large) given in part (b) above.
2.c /4

2.a Equating the derivative of J(w) with respect to w0 and w1 to zero, we get
n
∂J 1X
= 2 (yi − w0∗ − w1∗ xi ) = 0 (21)
∂w0 n i=1
n
∂J 1X
= 2 (yi − w0∗ − w1∗ xi ) xi = 0 (22)
∂w1 n i=1
Based on these two equations, it is easy to see that only iii and iv are true.

2.b On expanding the equations 21 and 22 we get


w0∗ + w1∗ x̄ = ȳ (23)

Page 3 of 6
x̄ w0∗ + (Cxx + x̄2 )w1∗ = Cxy + x̄ȳ (24)
On solving, we get
Cxy
w0∗ = ȳ − x̄ (25)
Cxx
Cxy
w1∗ = (26)
Cxx
2.c The new statistics of the data in terms of old statistics are as follows:
n n n
1X 1X 1X
˜
x̄ = x̃i = x̄ ȳ = yi C̃xx = (x̃i − x̄)2 = Cxx + σ 2
n i=1 n i=1 n i=1
n n n
1X 1X 1X
Cyy = (yi − ȳ)2 C̃xy = (x̃i − x̄)(yi − ȳ) = Cxy + ϵi yi = Cxy
n i=1 n i=1 n i=1
Hence, the modified equations will be
Cxy
w0∗ = ȳ − x̄ (27)
Cxx + σ 2
Cxy
w1∗ = (28)
Cxx + σ 2

3. Consider the problem of learning a binary classifier on a dataset S using the 0-1 loss i.e.,
l0−1 (w, (x, y)) = I[y̸=sign(w⊤ x)] = I[yw⊤ x≤0]
where w, x ∈ Rd , y ∈ {−1, +1} and I[·] denotes the indicator function.
3.a Design a convex surrogate loss function for this 0-1 loss. Note than a convex surrogate
loss should (1) be convex and (2) should always upper bound the original loss function.
You will get full marks only if you show that the convex surrogate you propose satisfies
these two properties.
3.a /2
3.b Define X
FS (w) = λ|S|∥w∥2 + [1 − yi w⊤ xi ]+ (29)
i∈s
Show that, if w∗ minimizes FS (w) then
 
∗ 1
∥w ∥ = O √ (30)
λ
where we denote a = O(b) iff
k
a≤ (31)
b
for some constant k.
3.b /3
3.c Now define X
FS (w) = λ|S|∥w∥2 + (yi − w⊤ xi )2 (32)
i∈s
Show that, if w∗ minimizes FS (w) and ||x||2 ≤ xmax then
 
∗ 1
∥w ∥ = O (33)
λ
3.c /5

3.a When y and w⊤ x are of opposite signs, only then value of the loss will be 1, otherwise
it will be 0. So we want
yw⊤ x ≤ 0 (34)

Page 4 of 6
⇐⇒ 1 − yw⊤ x ≥ 0 (35)
Based on the inequality 35, we can propose the following hinge loss:
l(w, (x, y)) = max{0, 1 − yw⊤ x} (36)
It remains to show that the hinge loss is a valid convex surrogate for the 0-1 loss. We
show this as follows:

• Convexity: We know that 1 − yw⊤ x is convex as it is an affine transformation.


Hinge loss is nothing but composition of a convex function g(w) = max 0, w and
an affine transformation. Since composition of two convex functions is convex, the
hinge loss is also convex.
• Hinge loss upper bounds 0-1 loss: Consider the case when yw⊤ x ≤ 0.
Then l0−1 (w, (x, y)) = 1 and l(w, (x, y)) = 1 − yw⊤ x. As yw⊤ x ≤ 0, we
have l(w, (x, y)) ≥ l0−1 (w, (x, y)). Similarly, consider the case when yw⊤ x ≥ 1.
Then l0−1 (w, (x, y)) = 0 and l(w, (x, y)) = 0. If 0 ≤ yw⊤ x ≤ 1, then
l0−1 (w, (x, y)) = 0 and l(w, (x, y)) = 1 − yw⊤ x. Again in this case we have
l(w, (x, y)) ≥ l0−1 (w, (x, y)).

3.b As w∗ minimizes FS (w), we have


FS (w∗ ) ≤ FS (0) (37)
X X
λ|S|∥w∗ ∥2 + (yi − w∗⊤ xi )2 ≤ 1 = |S| (38)
i∈s i∈s
Now, the left hand side of inequality 38 has a lower bound given by
X
λ|S|∥w∗ ∥2 ≤ λ|S|∥w∗ ∥2 + (yi − w∗⊤ xi )2 (39)
i∈s
By combining the inequalities 39 and 38, we have
λ|S|∥w∗ ∥2 ≤ |S| (40)
1
=⇒ ∥w∗ ∥2 ≤ (41)
λ
1
=⇒ ∥w∗ ∥ ≤ √ (42)
λ
 
∗ 1
=⇒ ∥w ∥ = O √ (43)
λ
3.c We have X
FS (w) = λ|S|∥w∥2 + (yi − w⊤ xi )2 (44)
i∈S
X
= λ|S|∥w∥2 + yi2 − 2yi w⊤ xi + (w⊤ xi )2 (45)
i∈S
Ignoring the terms independent of w above, define
X
GS (w) = λ|S|∥w∥2 + (w⊤ xi )2 − 2yi w⊤ xi (46)
i∈S
Now, it is easy to see that w∗ also minimizes GS (w). Hence, we will have
GS (w∗ ) ≤ GS (0) (47)
X
=⇒ λ|S|∥w∗ ∥2 + (w∗⊤ xi )2 − 2yi w∗⊤ xi ≤ 0 (48)
i∈S
The left side of the inequality 48 can be lower bounded using the Schwarz inequality,
giving
X
λ|S|∥w∗ ∥2 − 2 |S|∥w∗ ∥ xmax ≤ λ|S|∥w∗ ∥2 + (w∗⊤ xi )2 − 2yi w∗⊤ xi (49)
i∈S

Page 5 of 6
By combining the inequalities 48 and 49, we have
λ|S|∥w∗ ∥2 − 2 |S|∥w∗ ∥ xmax ≤ 0 (50)
2 xmax
=⇒ ∥w∗ ∥ ≤ (51)
λ 
1
=⇒ ∥w∗ ∥ = O (52)
λ

Total: 30

Page 6 of 6

You might also like