EndSem 202122 Solution

BITS F464
Machine Learning 2021-22

Aditya Challa Compre Exam
BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI,

K K BIRLA GOA CAMPUS
Open Book, Open Laptop, Internet NOT allowed
Subject Name: BITS F464 - Machine Learning, Date: 19 May 2022

Examiner Name: Aditya Challa Max. Marks: 40
Duration: 2.5 hours (9:00 AM - 11:30 AM)
Instructions
• Attempt all questions.
• Marks corresponding to each question is highlighted in bold within square braces at the end of
the question.
• You are allowed to carry and refer any printed material (.pdf format) on your laptops/ipad.
However, you would not be allowed to use any other applications apart from .pdf reader. You
are not also allowed to use Internet.
• In case of ambiguities in any of the questions, clearly state your assumptions and attempt the
question(s).
• You should write the final answers (one word) in the space provided on the question paper.
Answers not written in the space provided will not be evaluated.
• You may show the reasoning behind each answer in the answer sheet clearly indicating the
problem number.
• Return both the question paper and the answer-sheet at the end of the examination.
Please fill the following details:
Name:
BITS ID:
BITS Email-ID: (Used for google classroom)
Problem 1
1. Here we investigate how test error estimates vary with K when using K-Fold Cross Validation Scheme. We
consider the following simple problem
• The ground-truth function is a constant function given by y = α.
• To estimate the ground-truth function we have the dataset {yi }ni=1 given by {α, α, · · · (repeated n−
1 times) · · · α, α0 }. α 6= α0 . That is, we have (n − 1) α values within the dataset and only one α0 .
• The model class M (from which we choose our estimate) is given by a set of constant functions
{y = c | c ∈ R}.
• So, we need to choose a function y = c from the model class. We use the least squared criterion
to to obtain this, that choose the c which minimizes
n
X
(yi − c)2 (1)
i=1
• Say we choose the function y = c0 . Recall that the test error is the measure of how far from the
ground-truth function is the estimated function. In this case this it is given by (c0 − α)2 .
However, in general we do not have access to the ground-truth function and would like to estimate the test
error. We use the K-Fold cross-validation approach to obtain these estimates. Recall
1 of 11
BITS F464
• K-Fold cross validation splits the data into K parts, one part is left-out and uses the remaining
K − 1 parts to estimate the function. The test error is computed as follows. Assume, p denotes
the size of each fold that is n = K × p, then test error estimate for fold j is given by mean squared
error (MSE) of the points within fold j.
p
1X
TÊ j = (yi − yî )2 (2)
p i=1
Here yi denotes the actual value and yî denotes the predicted value. Thus, we would have K
estimates of test error {TÊ i }, one from each left-out fold. The final estimate of test error is given
by
K
1 X ˆ
T Ei (3)
K i=1
• When K = n (n is the size of the data) then this is called leave-one-out-cross-validation (LOOCV).
(a) What is the best possible test error one can (theoretically) obtain using the model class M above?
Answer Here:
[1 Marks]
(b) Let us now compute LOOCV estimate for test error. Since out of n data points there are n − 1
data points which take value α and only 1 data point which take α0 , two cases are possible -
i. If the data point with value α0 is left out, then the estimate from the remaining data is given
by
In this case the estimate of test-error is
[1 + 1 Marks]
ii. If the data point with value α is left out, then the estimate from the remaining data is given
by
[1 + 1 Marks]
iii. So, the final estimate of test-error is
2 of 11
BITS F464
[1 Marks]
(c) Let us now compute the K-Fold estimate for test error. Since there is only one data point which
takes the value α0 , this can belong to only one fold. So, we have two possible cases
i. If the data point with value α0 is within the left out portion, then then the estimate from the
remaining data is given by
[1 + 1 Marks]
ii. If the data point with value α0 is not within the left out portion, then then the estimate from
the remaining data is given by
[1 + 1 Marks]
iii. So, the final estimate of test-error is
[1 Marks]
(d) State TRUE/FALSE. From above, we have that that K-Fold estimate of test error is always less
than LOOCV estimate of test error.
Answer Here :
[2 Marks]
Important Note: All the answers should be in terms of α, α0 , n, K, p.

Answer of exercise 1
(a) Observe that the best possible test error one can obtain here is 0 . Since
y = α belongs to the model class.
(b) Let us now compute the LOOCV estimate for test error. Since there are n data points out of
which n − 1 take the value α and one takes the value α0 , we have to possible cases
i. If the data point with value α0 is left out, then the estimate from the remaining data is given
by α . Hence the MSE is (α − α0 )2 .
3 of 11
BITS F464
ii. If the data point with value α is left out, then the estimate from the remaining data is given
(n−2)α+α0
by n−1 . The MSE for each such datapoint is given by
(n − 2)α + α0 2 (α − α0 )2
(α − ) = (4)
n−1 (n − 1)2
(α−α0 )2
That is, (n−1)2
iii. Hence the final estimate for the test error is

(α − α0 )2

1 1
(α − α0 )2 + (n − 1) = (α − α0 )2 (5)
n (n − 1)2 n−1
That is (α − α0 )2 n−1
1
(c) Let us now compute the K-Fold estimate for test error. Since there is only one data point which
takes the value α0 , this can belong to only one fold. So, we have two possible cases
i. If the data point with value α0 is within the left out portion, then then the estimate from the
remaining data is given by α . The MSE in this case is given by (Assume
for simplicity n = K × p)
1 (α − α0 )2
(p − 1)(α − α)2 + (α0 − α)2 = (6)
p p
ii. if the data point with value α0 is not within the left out portion, then the estimate is given
(n−p−1)α+α0
by n−p . Then, the MSE if give by
2
(n − p − 1)α + α0 (α − α0 )2

1
(p) α − = (7)
p n−p (n − p)2
iii. So, the final estimate of test error is
1 (α − α0 )2 (α − α0 )2

+ (K − 1)
K p (n − p)2
0 2
(α − α ) (K − 1) (α − α0 )2
= +
n K (n − n/K)2
0 2
(8)
(α − α ) (K − 1) K 2 (α − α0 )2
= +
n K n2 (K − 1)2
0 2

(α − α ) K
= 1+
n n(K − 1)
(d) Now we know that K ≤ n, and hence

K n K 1
≥ ⇒ 1+ ≥ 1+
K −1 n−1 n(K − 1) n−1
0 2 0 2
(α − α0 )2

(α − α ) K (α − α ) 1 (9)
⇒ 1+ ≥ 1+ =
n n(K − 1) n n−1 n−1
⇒ K-Fold Test Error ≥ LOOCV Test Error
Problem 2
4 of 11
BITS F464
2. Here we explore the connection between neural networks and regression splines. In particular we shall only
be looking at piecewise linear splines and single hidden layer neural networks. Here we are interested in
functions whose input is 1-dimensional and output is 1-dimensional.
Recall that a piecewise linear spline tries to fit a function of the form


 a0 x + b0 if x < η0

a1 x + b1 if η0 ≤ x < η1

f (x) = . . (10)

 .. ..


an x + bn if ηn−1 ≤ x < ηn

where η0 , η1 , · · · , ηn are called cut-points and we fit a simple linear regression line within each interval.
On the other hand a neural network tries to fit a composition of functions. It first maps the input x into
the hidden layer using the ReLU activation function,
(1) (1)
hi = max(0, wi x + ci ) (11)
where i denotes a unit in the hidden layer. Then the output is obtained by
H
(2)
X
o= wi hi + c(2) (12)
i=1
where H denotes the number of neurons/units in the hidden layer.
(a) Let us start simple. Consider the function obtained by neural network with just one neuron in
(1) (1) (2)
the hidden layer. Assume the values w1 = 1, c1 = 1, w1 = 2, c(2) = 3. Let the equivalent
spline function be obtained by
(
a0 x + b0 if x < η0
f (x) = (13)
a1 x + b1 if x ≥ η0
Compute the values of a0 , b0 , η0 , a1 , b1 .

Answer Here: a0 = b0 = η0 =
Answer Here: a1 = b1 =
[3 Marks]
(b) Now consider the function obtained by the neural network with two neurons in the hidden layer.
(1) (1) (1) (1) (2) (2)
Assume the values of w1 = 1, c1 = 3, w2 = 2, c2 = 1, w1 = 2, w2 = 2, c(2) = 3. Let the
equivalent spline function be obtained by

a0 x + b0 if x < η0

f (x) = a1 x + b1 if η0 ≤ x < η1 (14)

a2 x + b2 if x ≥ η1

Compute the values of a0 , b0 , η0 , a1 , b1 , η1 , a2 , b2 .

Answer Here: a2 = b2 =
[4 Marks]
5 of 11
BITS F464
(c) Let the function f˜(x) = |x|, where |x| denotes the absolute value of x. What is the minimum
number of neurons in the hidden layer required to represent f˜(x).
Answer Here:
[2 Marks]
(d) State TRUE/FALSE. The function represented by a single hidden layer network with H (some
arbitrary finite integer) neurons is piecewise linear.
Answer Here:
[2 Marks]
(e) State TRUE/FALSE: Let f denote the piecewise linear function which is equivalent to a neural
network with H (some arbitrary finite integer). The function f is continuous at η, where η denotes
a cut-point.
Answer Here:
[2 Marks]
(1)
Observe that each hi can be written as a linear spline as follows: If wi > 0,
(
(1) (1)
0 if x < −ci /wi
hi = (1) (1) (1) (1) (15)
w i x + ci if x ≥ −ci /wi
(1)
If wi < 0, (
(1) (1)
0 if x > −ci /wi
hi = (1) (1) (1) (1) (16)
w i x + ci if x ≤ −ci /wi
(a) In case of a single neuron in the hidden layer we have,

(
(1) (1)
0 if x < −c1 /w1
h1 = f1 (x) = (1) (1) (1) (1) (17)
w 1 x + c1 if x ≥ −c1 /w1
and hence,
(
(1) (1)
(2) (2) c(2) if x < −c1 /w1
0= w1 h1 +c = (2) (1) (2) (1) (1) (1) (18)
w1 w1 x + w1 c1 + c(2) if x ≥ −c1 /w1
So, we have
(1) (1)
Answer Here: a0 = 0 b0 = c(2) η0 = −c1 /w1
(2) (1) (2) (1)
Answer Here: a1 = w1 w1 b1 = w1 c1 + c(2)
By substituting the given values w11 = 1, c11 = 1, w12 = 2, c2 = 3
Answer Here: a0 = 0 b0 = 3 η0 = -1
Answer Here: a1 = 2 b1 = 5
(1) (1)
(b) In case of a two neurons in the hidden layer we have (Both w1 , w2 > 0)
( (
(1) (1) (1) (1)
0 if x < −c1 /w1 0 if x < −c2 /w2
h1 = (1) (1) (1) (1) h 2 = (1) (1) (1) (1) (19)
w1 x + c1 if x ≥ −c1 /w1 w 2 x + c2 if x ≥ −c2 /w2
6 of 11
BITS F464
(1) (1) (1) (1)

Note that −c1 /w1 < −c2 /w2 . So, the values at the middle layer are going to be
(1) (1)

(0, 0)
 if x < −c1 /w1
(h1 , h2 ) = (w1(1) x + c(1)
1 , 0)
(1) (1) (1)
if − c1 /w1 ≤ x < −c2 /w2
(1)
(20)
 (1)
 (1) (1) (1) (1) (1)
(w1 x + c1 , w2 x + c2 ) if x ≥ −c2 /w2
Now, we know that

(2) (2)
o = w1 h1 + w2 h2 + c(2) (21)
So, we have
(1) (1)

(2)
c
 if x < −c1 /w1
o = w1(2) w1(1) x + w1(2) c(1)
1 +c
(2) (1) (1) (1) (1)
if − c1 /w1 ≤ x < −c2 /w2 (22)
 (2) (1)
 (2) (1) (2) (1) (2) (1) (1) (1)
w1 w1 x + w1 c1 + w2 w2 x + w2 c2 + c(2) if x ≥ −c2 /w2
So, we have that

(1) (1)
Answer Here: a0 = 0 b0 = c(2) η0 = −c1 /w1
(2) (1) (2) (1) (1) (1)
Answer Here: a1 = w1 w1 b1 = w1 c1 + c(2) η1 = −c2 /w2
(2) (1) (2) (1) (2) (1) (2) (1)
Answer Here: a2 = w1 w1 + w2 w2 b2 = w1 c1 + w2 c2 + c(2)
(1) (1) (1) (1) (2) (2)
Substituting the values of w1 = 1, c1 = 3, w2 = 2, c2 = 1, w1 = 2, w2 = 2, c(2) = 3
Answer Here: a0 = 0 b0 = 3 η0 = -3
Answer Here: a1 = 2 b1 = 9 η1 = -1/2
Answer Here: a2 = 6 b2 = 11
(c) Answer: 2 .
Observe that (
−x if x < 0
|x| = (23)
x if x ≥ 0
This can actually be written as max(0, x) + max(0, −x). Hence two neurons in the hidden layer
should suffice.
This cannot be achieved with single hidden neuron since either when x → ∞ or when x → −∞,
h1 → 0.
(d) Answer: TRUE since sum of piecewise linear functions is piecewise linear.
(e) Answer: TRUE since sum of continuous functions is continuous. In this case
functions at hidden neurons are continuous hence the sum of these functions is continuous.
Problem 3
3. Here we investigate the relationship between multi-class logistic regression and decision trees. More precisely,
we would like to identify the final regions of the decision tree using logistic regression. For simplicity we
assume that the input is 1-dimensional real value and the output is the class label.
Recall the following variant of multi-class logistic regression -
• If Y denotes the class label, we have
P (Y = i) ∝ exp(wi x + bi ) (24)
Here wi , bi are constants since input x is 1-dimensional real value.
7 of 11
BITS F464
• Hence, we have
exp(wi x + bi )
P (Y = i) = P (25)
j exp(wj x + bj )
We slightly extend the above formulation with additional parameter τ as follows -
exp(τ (wi x + bi ))
P (Y = i) = P (26)
j exp(τ (wj x + bj ))
(a) Let, for a given value of x = x0 , we have wk x0 + bk > wi x0 + bi for all i 6= k. Then, P (Y = k) →
as τ → ∞
[2 Marks]
(b) Let, for a given value of x = x0 , we have wk x0 + bk < maxi6=k {wi x0 + bi } Then, P (Y = k) →
as τ → ∞.
[2 Marks]
So, if we let hi (x) = wi x + bi , then arg max{hi } gives the label.

Now, observe that since the input is 1-dimensional, the regions obtained as the leaves of a decision tree will
be of the form
{β−1 < x ≤ β0 }, {β0 < x ≤ β1 }, · · · , {βi−1 < x ≤ βi }, · · · , {βn−1 < x ≤ βn }, (27)
Assume that β−1 = −∞ and βn = ∞.

The aim is to find the parameters {(wi , bi )} such that if arg maxi {hi (x)} = arg maxi {wi x + bi } = k then x
belongs to the region {βk−1 < x ≤ βk }.
(c) Let us start simple with two regions (i.e n=1) - {β−1 < x ≤ β0 }, {β0 < x ≤ β1 }. Note that
β−1 = −∞ and β1 = ∞. Define
h0 (x) = x h1 (x) = 2x − c (28)
What is the value of c (in terms of β0 ) such that
h0 > h1 ⇒ x < β0
(29)
h1 > h0 ⇒ x > β0
Answer Here:
[2 Marks]
(d) Let us consider three regions (i.e n=2) - {β−1 < x ≤ β0 }, {β0 < x ≤ β1 }, {β1 < x ≤ β2 }. Note
that β−1 = −∞ and β2 = ∞. Define
h0 (x) = x h1 (x) = 2x − c1 h2 (x) = 3x − c2 (30)
What are the values of c1 , c2 (in terms of β0 , β1 )such that

( ) ( ) ( )
h0 > h1 h1 > h0 h2 > h0
⇒ x < β0 and ⇒ β0 < x < β1 and ⇒ x > β1 (31)
h0 > h2 h1 > h2 h2 > h1
c1 :
c2 :
8 of 11
BITS F464
[4 Marks]
(e) Extending above to n regions -
{β−1 < x ≤ β0 }, {β0 < x ≤ β1 }, · · · , {βi−1 < x ≤ βi }, · · · , {βn−1 < x ≤ βn }, (32)
Let
hi = (i + 1)x − ci (33)
What is the value of ci (in terms of {βi })?
ci :
[2 Marks]
(a) Observe that X

exp(τ xj ) ≈ exp(τ max{xj }) as τ → ∞ (34)
j
So, if xi = max{xj }
exp(τ xi )
P →1 (35)
j exp(τ xj )
So, answer to (a) is 1

(b) If xi < max{xj }
exp(τ xi )
P →0 (36)
j exp(τ xj )
and answer to (b) is 0

(c) We have from the given equations,
h0 > h1 ⇔ x > 2x − c ⇔ x < c

(37)
h1 > h0 ⇔ 2x − c > c ⇔x>c
From above it is clear that c = β0 .

(d) We have from the given equations,
h0 > h1 ⇔ x > 2x − c1 ⇔ x < c1

(38)
h0 > h2 ⇔ x > 3x − c2 ⇔ 2x < c2
The above should imply x < β0 .
h1 > h0 ⇔ x < 2x − c1 ⇔ x > c1

(39)
h1 > h2 ⇔ 2x − c1 > 3x − c2 ⇔ x < c2 − c1
The above should imply β0 < x < β1 .
h2 > h0 ⇔ 3x − c2 > x ⇔ 2x > c2

(40)
h2 > h1 ⇔ 3x − c2 > 2x − c1 ⇔ x > c2 − c1
The above should imply x > β1 .

If c1 = β0 and c2 = β0 + β1 , then all the above conditions are satisfied.
c1 : β0
c2 : β0 + β1
9 of 11
BITS F464
(e) Extending the above logic, we have that

ci : β0 + β1 + · · · + βi−1
Exercise 4
4. Consider the following dataset with two features - X1 , X2 .
S.No X1 X2 Class
1 3 4 -1
2 2 2 -1
3 4 4 -1
4 2 4 -1
5 2 1 1
6 4 3 1
7 4 1 1
(a) Obtain the maximum margin classifier which separates the given dataset. Let the hyperplane be
given by −1 + β1 X1 + β2 X2 . Find the values of β1 , β2 .
β1 : β2 :
[4 Marks]
(b) Which of the data points given constitute the support vectors?
Answer Here :
[4 Marks]
(a) To get the line, we make a visual inspection of the data points and guess the line. Then we prove
that this is indeed the maximum margin hyperplane.
From above, consider the line which passes through (2, 1.5), (4, 3.5), which is the line L(X1 , X2 ) =
X1 − X2 = 0.5 or L(X1 , X2 ) = −0.5 + X1 − X2 = 0. This is chosen since the labels should be
consistent with the
√ signs of the coefficients
√ of the hyperplane. The signed distance from a point
X is given by 1/ 2L(X). Ignoring 2, since it is same across all data points, we have
10 of 11
BITS F464
S.No X1 X2 Class Signed distance

1 3 4 -1 −2.5
2 2 2 -1 −0.5
3 4 4 -1 −0.5
4 2 4 -1 −2.5
5 2 1 1 0.5
6 4 3 1 0.5
7 4 1 1 2.5
Observe that the margin for both the classes is 0.5. Hence this is the maximum margin hyperplane.
Any other line would increase/decrease the margin from these points.
L(X1 , X2 ) = −0.5 + X1 − X2 = 0 is the same as L(X1 , X2 ) = −1 + 2X1 − 2X2 = 0
β1 : +2 β2 : -2
(b) From above it is also clear that the support vectors are
Answer Here : (2, 2), (4, 4), (2, 1), (4, 3)
11 of 11

EndSem 202122 Solution

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EndSem 202122 Solution

Uploaded by

Copyright:

Available Formats

BITS F464

Machine Learning 2021-22

BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE, PILANI,

Subject Name: BITS F464 - Machine Learning, Date: 19 May 2022

Please fill the following details:

BITS Email-ID: (Used for google classroom)

In this case the estimate of test-error is

In this case the estimate of test-error is

In this case the estimate of test-error is

In this case the estimate of test-error is

Important Note: All the answers should be in terms of α, α0 , n, K, p.

iii. Hence the final estimate for the test error is

iii. So, the final estimate of test error is

(d) Now we know that K ≤ n, and hence

where H denotes the number of neurons/units in the hidden layer.

Compute the values of a0 , b0 , η0 , a1 , b1 .

Compute the values of a0 , b0 , η0 , a1 , b1 , η1 , a2 , b2 .

(a) In case of a single neuron in the hidden layer we have,

(1) (1) (1) (1)

Now, we know that

So, we have that

Here wi , bi are constants since input x is 1-dimensional real value.

We slightly extend the above formulation with additional parameter τ as follows -

So, if we let hi (x) = wi x + bi , then arg max{hi } gives the label.

{β−1 < x ≤ β0 }, {β0 < x ≤ β1 }, · · · , {βi−1 < x ≤ βi }, · · · , {βn−1 < x ≤ βn }, (27)

Assume that β−1 = −∞ and βn = ∞.

h0 (x) = x h1 (x) = 2x − c (28)

What is the value of c (in terms of β0 ) such that

h0 (x) = x h1 (x) = 2x − c1 h2 (x) = 3x − c2 (30)

What are the values of c1 , c2 (in terms of β0 , β1 )such that

{β−1 < x ≤ β0 }, {β0 < x ≤ β1 }, · · · , {βi−1 < x ≤ βi }, · · · , {βn−1 < x ≤ βn }, (32)

(a) Observe that X

So, answer to (a) is 1

and answer to (b) is 0

h0 > h1 ⇔ x > 2x − c ⇔ x < c

From above it is clear that c = β0 .

h0 > h1 ⇔ x > 2x − c1 ⇔ x < c1

The above should imply x < β0 .

h1 > h0 ⇔ x < 2x − c1 ⇔ x > c1

The above should imply β0 < x < β1 .

h2 > h0 ⇔ 3x − c2 > x ⇔ 2x > c2

The above should imply x > β1 .

(e) Extending the above logic, we have that

S.No X1 X2 Class Signed distance

You might also like