You are on page 1of 11

BITS F464

Machine Learning 2021-22


Aditya Challa Compre Exam

BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI,


K K BIRLA GOA CAMPUS
Open Book, Open Laptop, Internet NOT allowed

Subject Name: BITS F464 - Machine Learning, Date: 19 May 2022


Examiner Name: Aditya Challa Max. Marks: 40
Duration: 2.5 hours (9:00 AM - 11:30 AM)

Instructions
• Attempt all questions.
• Marks corresponding to each question is highlighted in bold within square braces at the end of
the question.
• You are allowed to carry and refer any printed material (.pdf format) on your laptops/ipad.
However, you would not be allowed to use any other applications apart from .pdf reader. You
are not also allowed to use Internet.
• In case of ambiguities in any of the questions, clearly state your assumptions and attempt the
question(s).
• You should write the final answers (one word) in the space provided on the question paper.
Answers not written in the space provided will not be evaluated.
• You may show the reasoning behind each answer in the answer sheet clearly indicating the
problem number.
• Return both the question paper and the answer-sheet at the end of the examination.

Please fill the following details:

Name:

BITS ID:

BITS Email-ID: (Used for google classroom)

Problem 1
1. Here we investigate how test error estimates vary with K when using K-Fold Cross Validation Scheme. We
consider the following simple problem
• The ground-truth function is a constant function given by y = α.
• To estimate the ground-truth function we have the dataset {yi }ni=1 given by {α, α, · · · (repeated n−
1 times) · · · α, α0 }. α 6= α0 . That is, we have (n − 1) α values within the dataset and only one α0 .
• The model class M (from which we choose our estimate) is given by a set of constant functions
{y = c | c ∈ R}.
• So, we need to choose a function y = c from the model class. We use the least squared criterion
to to obtain this, that choose the c which minimizes
n
X
(yi − c)2 (1)
i=1

• Say we choose the function y = c0 . Recall that the test error is the measure of how far from the
ground-truth function is the estimated function. In this case this it is given by (c0 − α)2 .
However, in general we do not have access to the ground-truth function and would like to estimate the test
error. We use the K-Fold cross-validation approach to obtain these estimates. Recall

1 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam

• K-Fold cross validation splits the data into K parts, one part is left-out and uses the remaining
K − 1 parts to estimate the function. The test error is computed as follows. Assume, p denotes
the size of each fold that is n = K × p, then test error estimate for fold j is given by mean squared
error (MSE) of the points within fold j.
p
1X
TˆE j = (yi − yˆi )2 (2)
p i=1

Here yi denotes the actual value and yˆi denotes the predicted value. Thus, we would have K
estimates of test error {TˆE i }, one from each left-out fold. The final estimate of test error is given
by
K
1 X ˆ
T Ei (3)
K i=1

• When K = n (n is the size of the data) then this is called leave-one-out-cross-validation (LOOCV).

(a) What is the best possible test error one can (theoretically) obtain using the model class M above?
Answer Here:
[1 Marks]
(b) Let us now compute LOOCV estimate for test error. Since out of n data points there are n − 1
data points which take value α and only 1 data point which take α0 , two cases are possible -
i. If the data point with value α0 is left out, then the estimate from the remaining data is given
by

In this case the estimate of test-error is

[1 + 1 Marks]
ii. If the data point with value α is left out, then the estimate from the remaining data is given
by

In this case the estimate of test-error is

[1 + 1 Marks]
iii. So, the final estimate of test-error is

2 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam

[1 Marks]
(c) Let us now compute the K-Fold estimate for test error. Since there is only one data point which
takes the value α0 , this can belong to only one fold. So, we have two possible cases
i. If the data point with value α0 is within the left out portion, then then the estimate from the
remaining data is given by

In this case the estimate of test-error is

[1 + 1 Marks]
ii. If the data point with value α0 is not within the left out portion, then then the estimate from
the remaining data is given by

In this case the estimate of test-error is

[1 + 1 Marks]
iii. So, the final estimate of test-error is

[1 Marks]
(d) State TRUE/FALSE. From above, we have that that K-Fold estimate of test error is always less
than LOOCV estimate of test error.
Answer Here :
[2 Marks]

Important Note: All the answers should be in terms of α, α0 , n, K, p.


Answer of exercise 1

(a) Observe that the best possible test error one can obtain here is 0 . Since
y = α belongs to the model class.
(b) Let us now compute the LOOCV estimate for test error. Since there are n data points out of
which n − 1 take the value α and one takes the value α0 , we have to possible cases
i. If the data point with value α0 is left out, then the estimate from the remaining data is given
by α . Hence the MSE is (α − α0 )2 .

3 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam

ii. If the data point with value α is left out, then the estimate from the remaining data is given
(n−2)α+α0
by n−1 . The MSE for each such datapoint is given by

(n − 2)α + α0 2 (α − α0 )2
(α − ) = (4)
n−1 (n − 1)2

(α−α0 )2
That is, (n−1)2

iii. Hence the final estimate for the test error is


(α − α0 )2
 
1 1
(α − α0 )2 + (n − 1) = (α − α0 )2 (5)
n (n − 1)2 n−1

That is (α − α0 )2 n−1
1

(c) Let us now compute the K-Fold estimate for test error. Since there is only one data point which
takes the value α0 , this can belong to only one fold. So, we have two possible cases
i. If the data point with value α0 is within the left out portion, then then the estimate from the
remaining data is given by α . The MSE in this case is given by (Assume
for simplicity n = K × p)

1  (α − α0 )2
(p − 1)(α − α)2 + (α0 − α)2 = (6)
p p

ii. if the data point with value α0 is not within the left out portion, then the estimate is given
(n−p−1)α+α0
by n−p . Then, the MSE if give by

2
(n − p − 1)α + α0 (α − α0 )2

1
(p) α − = (7)
p n−p (n − p)2

iii. So, the final estimate of test error is

1 (α − α0 )2 (α − α0 )2
 
+ (K − 1)
K p (n − p)2
0 2
(α − α ) (K − 1) (α − α0 )2
= +
n K (n − n/K)2
0 2
(8)
(α − α ) (K − 1) K 2 (α − α0 )2
= +
n K n2 (K − 1)2
0 2
 
(α − α ) K
= 1+
n n(K − 1)

(d) Now we know that K ≤ n, and hence


   
K n K 1
≥ ⇒ 1+ ≥ 1+
K −1 n−1 n(K − 1) n−1
0 2 0 2
(α − α0 )2
   
(α − α ) K (α − α ) 1 (9)
⇒ 1+ ≥ 1+ =
n n(K − 1) n n−1 n−1
⇒ K-Fold Test Error ≥ LOOCV Test Error

Problem 2

4 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam

2. Here we explore the connection between neural networks and regression splines. In particular we shall only
be looking at piecewise linear splines and single hidden layer neural networks. Here we are interested in
functions whose input is 1-dimensional and output is 1-dimensional.
Recall that a piecewise linear spline tries to fit a function of the form


 a0 x + b0 if x < η0

a1 x + b1 if η0 ≤ x < η1

f (x) = . . (10)

 .. ..


an x + bn if ηn−1 ≤ x < ηn

where η0 , η1 , · · · , ηn are called cut-points and we fit a simple linear regression line within each interval.
On the other hand a neural network tries to fit a composition of functions. It first maps the input x into
the hidden layer using the ReLU activation function,
(1) (1)
hi = max(0, wi x + ci ) (11)

where i denotes a unit in the hidden layer. Then the output is obtained by
H
(2)
X
o= wi hi + c(2) (12)
i=1

where H denotes the number of neurons/units in the hidden layer.

(a) Let us start simple. Consider the function obtained by neural network with just one neuron in
(1) (1) (2)
the hidden layer. Assume the values w1 = 1, c1 = 1, w1 = 2, c(2) = 3. Let the equivalent
spline function be obtained by
(
a0 x + b0 if x < η0
f (x) = (13)
a1 x + b1 if x ≥ η0

Compute the values of a0 , b0 , η0 , a1 , b1 .


Answer Here: a0 = b0 = η0 =
Answer Here: a1 = b1 =
[3 Marks]
(b) Now consider the function obtained by the neural network with two neurons in the hidden layer.
(1) (1) (1) (1) (2) (2)
Assume the values of w1 = 1, c1 = 3, w2 = 2, c2 = 1, w1 = 2, w2 = 2, c(2) = 3. Let the
equivalent spline function be obtained by

a0 x + b0 if x < η0

f (x) = a1 x + b1 if η0 ≤ x < η1 (14)

a2 x + b2 if x ≥ η1

Compute the values of a0 , b0 , η0 , a1 , b1 , η1 , a2 , b2 .


Answer Here: a0 = b0 = η0 =
Answer Here: a1 = b1 = η1 =
Answer Here: a2 = b2 =
[4 Marks]

5 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam

(c) Let the function f˜(x) = |x|, where |x| denotes the absolute value of x. What is the minimum
number of neurons in the hidden layer required to represent f˜(x).
Answer Here:
[2 Marks]
(d) State TRUE/FALSE. The function represented by a single hidden layer network with H (some
arbitrary finite integer) neurons is piecewise linear.
Answer Here:
[2 Marks]
(e) State TRUE/FALSE: Let f denote the piecewise linear function which is equivalent to a neural
network with H (some arbitrary finite integer). The function f is continuous at η, where η denotes
a cut-point.
Answer Here:
[2 Marks]

Answer of exercise 2
(1)
Observe that each hi can be written as a linear spline as follows: If wi > 0,
(
(1) (1)
0 if x < −ci /wi
hi = (1) (1) (1) (1) (15)
w i x + ci if x ≥ −ci /wi

(1)
If wi < 0, (
(1) (1)
0 if x > −ci /wi
hi = (1) (1) (1) (1) (16)
w i x + ci if x ≤ −ci /wi

(a) In case of a single neuron in the hidden layer we have,


(
(1) (1)
0 if x < −c1 /w1
h1 = f1 (x) = (1) (1) (1) (1) (17)
w 1 x + c1 if x ≥ −c1 /w1

and hence,
(
(1) (1)
(2) (2) c(2) if x < −c1 /w1
0= w1 h1 +c = (2) (1) (2) (1) (1) (1) (18)
w1 w1 x + w1 c1 + c(2) if x ≥ −c1 /w1

So, we have
(1) (1)
Answer Here: a0 = 0 b0 = c(2) η0 = −c1 /w1
(2) (1) (2) (1)
Answer Here: a1 = w1 w1 b1 = w1 c1 + c(2)
By substituting the given values w11 = 1, c11 = 1, w12 = 2, c2 = 3
Answer Here: a0 = 0 b0 = 3 η0 = -1
Answer Here: a1 = 2 b1 = 5
(1) (1)
(b) In case of a two neurons in the hidden layer we have (Both w1 , w2 > 0)
( (
(1) (1) (1) (1)
0 if x < −c1 /w1 0 if x < −c2 /w2
h1 = (1) (1) (1) (1) h 2 = (1) (1) (1) (1) (19)
w1 x + c1 if x ≥ −c1 /w1 w 2 x + c2 if x ≥ −c2 /w2

6 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam

(1) (1) (1) (1)


Note that −c1 /w1 < −c2 /w2 . So, the values at the middle layer are going to be
(1) (1)

(0, 0)
 if x < −c1 /w1
(h1 , h2 ) = (w1(1) x + c(1)
1 , 0)
(1) (1) (1)
if − c1 /w1 ≤ x < −c2 /w2
(1)
(20)
 (1)
 (1) (1) (1) (1) (1)
(w1 x + c1 , w2 x + c2 ) if x ≥ −c2 /w2

Now, we know that


(2) (2)
o = w1 h1 + w2 h2 + c(2) (21)
So, we have
(1) (1)

(2)
c
 if x < −c1 /w1
o = w1(2) w1(1) x + w1(2) c(1)
1 +c
(2) (1) (1) (1) (1)
if − c1 /w1 ≤ x < −c2 /w2 (22)
 (2) (1)
 (2) (1) (2) (1) (2) (1) (1) (1)
w1 w1 x + w1 c1 + w2 w2 x + w2 c2 + c(2) if x ≥ −c2 /w2

So, we have that


(1) (1)
Answer Here: a0 = 0 b0 = c(2) η0 = −c1 /w1
(2) (1) (2) (1) (1) (1)
Answer Here: a1 = w1 w1 b1 = w1 c1 + c(2) η1 = −c2 /w2
(2) (1) (2) (1) (2) (1) (2) (1)
Answer Here: a2 = w1 w1 + w2 w2 b2 = w1 c1 + w2 c2 + c(2)
(1) (1) (1) (1) (2) (2)
Substituting the values of w1 = 1, c1 = 3, w2 = 2, c2 = 1, w1 = 2, w2 = 2, c(2) = 3
Answer Here: a0 = 0 b0 = 3 η0 = -3
Answer Here: a1 = 2 b1 = 9 η1 = -1/2

Answer Here: a2 = 6 b2 = 11

(c) Answer: 2 .
Observe that (
−x if x < 0
|x| = (23)
x if x ≥ 0

This can actually be written as max(0, x) + max(0, −x). Hence two neurons in the hidden layer
should suffice.
This cannot be achieved with single hidden neuron since either when x → ∞ or when x → −∞,
h1 → 0.
(d) Answer: TRUE since sum of piecewise linear functions is piecewise linear.

(e) Answer: TRUE since sum of continuous functions is continuous. In this case
functions at hidden neurons are continuous hence the sum of these functions is continuous.

Problem 3
3. Here we investigate the relationship between multi-class logistic regression and decision trees. More precisely,
we would like to identify the final regions of the decision tree using logistic regression. For simplicity we
assume that the input is 1-dimensional real value and the output is the class label.
Recall the following variant of multi-class logistic regression -
• If Y denotes the class label, we have

P (Y = i) ∝ exp(wi x + bi ) (24)

Here wi , bi are constants since input x is 1-dimensional real value.

7 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam

• Hence, we have
exp(wi x + bi )
P (Y = i) = P (25)
j exp(wj x + bj )

We slightly extend the above formulation with additional parameter τ as follows -

exp(τ (wi x + bi ))
P (Y = i) = P (26)
j exp(τ (wj x + bj ))

(a) Let, for a given value of x = x0 , we have wk x0 + bk > wi x0 + bi for all i 6= k. Then, P (Y = k) →
as τ → ∞
[2 Marks]
(b) Let, for a given value of x = x0 , we have wk x0 + bk < maxi6=k {wi x0 + bi } Then, P (Y = k) →
as τ → ∞.
[2 Marks]

So, if we let hi (x) = wi x + bi , then arg max{hi } gives the label.


Now, observe that since the input is 1-dimensional, the regions obtained as the leaves of a decision tree will
be of the form

{β−1 < x ≤ β0 }, {β0 < x ≤ β1 }, · · · , {βi−1 < x ≤ βi }, · · · , {βn−1 < x ≤ βn }, (27)

Assume that β−1 = −∞ and βn = ∞.


The aim is to find the parameters {(wi , bi )} such that if arg maxi {hi (x)} = arg maxi {wi x + bi } = k then x
belongs to the region {βk−1 < x ≤ βk }.

(c) Let us start simple with two regions (i.e n=1) - {β−1 < x ≤ β0 }, {β0 < x ≤ β1 }. Note that
β−1 = −∞ and β1 = ∞. Define

h0 (x) = x h1 (x) = 2x − c (28)

What is the value of c (in terms of β0 ) such that

h0 > h1 ⇒ x < β0
(29)
h1 > h0 ⇒ x > β0

Answer Here:
[2 Marks]
(d) Let us consider three regions (i.e n=2) - {β−1 < x ≤ β0 }, {β0 < x ≤ β1 }, {β1 < x ≤ β2 }. Note
that β−1 = −∞ and β2 = ∞. Define

h0 (x) = x h1 (x) = 2x − c1 h2 (x) = 3x − c2 (30)

What are the values of c1 , c2 (in terms of β0 , β1 )such that


( ) ( ) ( )
h0 > h1 h1 > h0 h2 > h0
⇒ x < β0 and ⇒ β0 < x < β1 and ⇒ x > β1 (31)
h0 > h2 h1 > h2 h2 > h1

c1 :
c2 :

8 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam

[4 Marks]
(e) Extending above to n regions -

{β−1 < x ≤ β0 }, {β0 < x ≤ β1 }, · · · , {βi−1 < x ≤ βi }, · · · , {βn−1 < x ≤ βn }, (32)

Let
hi = (i + 1)x − ci (33)
What is the value of ci (in terms of {βi })?
ci :
[2 Marks]

Answer of exercise 3

(a) Observe that X


exp(τ xj ) ≈ exp(τ max{xj }) as τ → ∞ (34)
j

So, if xi = max{xj }
exp(τ xi )
P →1 (35)
j exp(τ xj )

So, answer to (a) is 1


(b) If xi < max{xj }
exp(τ xi )
P →0 (36)
j exp(τ xj )

and answer to (b) is 0


(c) We have from the given equations,

h0 > h1 ⇔ x > 2x − c ⇔ x < c


(37)
h1 > h0 ⇔ 2x − c > c ⇔x>c

From above it is clear that c = β0 .


(d) We have from the given equations,

h0 > h1 ⇔ x > 2x − c1 ⇔ x < c1


(38)
h0 > h2 ⇔ x > 3x − c2 ⇔ 2x < c2

The above should imply x < β0 .

h1 > h0 ⇔ x < 2x − c1 ⇔ x > c1


(39)
h1 > h2 ⇔ 2x − c1 > 3x − c2 ⇔ x < c2 − c1

The above should imply β0 < x < β1 .

h2 > h0 ⇔ 3x − c2 > x ⇔ 2x > c2


(40)
h2 > h1 ⇔ 3x − c2 > 2x − c1 ⇔ x > c2 − c1

The above should imply x > β1 .


If c1 = β0 and c2 = β0 + β1 , then all the above conditions are satisfied.
c1 : β0

c2 : β0 + β1

9 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam

(e) Extending the above logic, we have that


ci : β0 + β1 + · · · + βi−1

Exercise 4
4. Consider the following dataset with two features - X1 , X2 .

S.No X1 X2 Class
1 3 4 -1
2 2 2 -1
3 4 4 -1
4 2 4 -1
5 2 1 1
6 4 3 1
7 4 1 1

(a) Obtain the maximum margin classifier which separates the given dataset. Let the hyperplane be
given by −1 + β1 X1 + β2 X2 . Find the values of β1 , β2 .
β1 : β2 :
[4 Marks]

(b) Which of the data points given constitute the support vectors?
Answer Here :
[4 Marks]

Answer of exercise 4

(a) To get the line, we make a visual inspection of the data points and guess the line. Then we prove
that this is indeed the maximum margin hyperplane.
From above, consider the line which passes through (2, 1.5), (4, 3.5), which is the line L(X1 , X2 ) =
X1 − X2 = 0.5 or L(X1 , X2 ) = −0.5 + X1 − X2 = 0. This is chosen since the labels should be
consistent with the
√ signs of the coefficients
√ of the hyperplane. The signed distance from a point
X is given by 1/ 2L(X). Ignoring 2, since it is same across all data points, we have

10 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam

S.No X1 X2 Class Signed distance


1 3 4 -1 −2.5
2 2 2 -1 −0.5
3 4 4 -1 −0.5
4 2 4 -1 −2.5
5 2 1 1 0.5
6 4 3 1 0.5
7 4 1 1 2.5
Observe that the margin for both the classes is 0.5. Hence this is the maximum margin hyperplane.
Any other line would increase/decrease the margin from these points.
L(X1 , X2 ) = −0.5 + X1 − X2 = 0 is the same as L(X1 , X2 ) = −1 + 2X1 − 2X2 = 0
β1 : +2 β2 : -2
(b) From above it is also clear that the support vectors are
Answer Here : (2, 2), (4, 4), (2, 1), (4, 3)

11 of 11

You might also like