Professional Documents
Culture Documents
Instructions
• Attempt all questions.
• Marks corresponding to each question is highlighted in bold within square braces at the end of
the question.
• You are allowed to carry and refer any printed material (.pdf format) on your laptops/ipad.
However, you would not be allowed to use any other applications apart from .pdf reader. You
are not also allowed to use Internet.
• In case of ambiguities in any of the questions, clearly state your assumptions and attempt the
question(s).
• You should write the final answers (one word) in the space provided on the question paper.
Answers not written in the space provided will not be evaluated.
• You may show the reasoning behind each answer in the answer sheet clearly indicating the
problem number.
• Return both the question paper and the answer-sheet at the end of the examination.
Name:
BITS ID:
Problem 1
1. Here we investigate how test error estimates vary with K when using K-Fold Cross Validation Scheme. We
consider the following simple problem
• The ground-truth function is a constant function given by y = α.
• To estimate the ground-truth function we have the dataset {yi }ni=1 given by {α, α, · · · (repeated n−
1 times) · · · α, α0 }. α 6= α0 . That is, we have (n − 1) α values within the dataset and only one α0 .
• The model class M (from which we choose our estimate) is given by a set of constant functions
{y = c | c ∈ R}.
• So, we need to choose a function y = c from the model class. We use the least squared criterion
to to obtain this, that choose the c which minimizes
n
X
(yi − c)2 (1)
i=1
• Say we choose the function y = c0 . Recall that the test error is the measure of how far from the
ground-truth function is the estimated function. In this case this it is given by (c0 − α)2 .
However, in general we do not have access to the ground-truth function and would like to estimate the test
error. We use the K-Fold cross-validation approach to obtain these estimates. Recall
1 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam
• K-Fold cross validation splits the data into K parts, one part is left-out and uses the remaining
K − 1 parts to estimate the function. The test error is computed as follows. Assume, p denotes
the size of each fold that is n = K × p, then test error estimate for fold j is given by mean squared
error (MSE) of the points within fold j.
p
1X
TˆE j = (yi − yˆi )2 (2)
p i=1
Here yi denotes the actual value and yˆi denotes the predicted value. Thus, we would have K
estimates of test error {TˆE i }, one from each left-out fold. The final estimate of test error is given
by
K
1 X ˆ
T Ei (3)
K i=1
• When K = n (n is the size of the data) then this is called leave-one-out-cross-validation (LOOCV).
(a) What is the best possible test error one can (theoretically) obtain using the model class M above?
Answer Here:
[1 Marks]
(b) Let us now compute LOOCV estimate for test error. Since out of n data points there are n − 1
data points which take value α and only 1 data point which take α0 , two cases are possible -
i. If the data point with value α0 is left out, then the estimate from the remaining data is given
by
[1 + 1 Marks]
ii. If the data point with value α is left out, then the estimate from the remaining data is given
by
[1 + 1 Marks]
iii. So, the final estimate of test-error is
2 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam
[1 Marks]
(c) Let us now compute the K-Fold estimate for test error. Since there is only one data point which
takes the value α0 , this can belong to only one fold. So, we have two possible cases
i. If the data point with value α0 is within the left out portion, then then the estimate from the
remaining data is given by
[1 + 1 Marks]
ii. If the data point with value α0 is not within the left out portion, then then the estimate from
the remaining data is given by
[1 + 1 Marks]
iii. So, the final estimate of test-error is
[1 Marks]
(d) State TRUE/FALSE. From above, we have that that K-Fold estimate of test error is always less
than LOOCV estimate of test error.
Answer Here :
[2 Marks]
(a) Observe that the best possible test error one can obtain here is 0 . Since
y = α belongs to the model class.
(b) Let us now compute the LOOCV estimate for test error. Since there are n data points out of
which n − 1 take the value α and one takes the value α0 , we have to possible cases
i. If the data point with value α0 is left out, then the estimate from the remaining data is given
by α . Hence the MSE is (α − α0 )2 .
3 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam
ii. If the data point with value α is left out, then the estimate from the remaining data is given
(n−2)α+α0
by n−1 . The MSE for each such datapoint is given by
(n − 2)α + α0 2 (α − α0 )2
(α − ) = (4)
n−1 (n − 1)2
(α−α0 )2
That is, (n−1)2
That is (α − α0 )2 n−1
1
(c) Let us now compute the K-Fold estimate for test error. Since there is only one data point which
takes the value α0 , this can belong to only one fold. So, we have two possible cases
i. If the data point with value α0 is within the left out portion, then then the estimate from the
remaining data is given by α . The MSE in this case is given by (Assume
for simplicity n = K × p)
1 (α − α0 )2
(p − 1)(α − α)2 + (α0 − α)2 = (6)
p p
ii. if the data point with value α0 is not within the left out portion, then the estimate is given
(n−p−1)α+α0
by n−p . Then, the MSE if give by
2
(n − p − 1)α + α0 (α − α0 )2
1
(p) α − = (7)
p n−p (n − p)2
1 (α − α0 )2 (α − α0 )2
+ (K − 1)
K p (n − p)2
0 2
(α − α ) (K − 1) (α − α0 )2
= +
n K (n − n/K)2
0 2
(8)
(α − α ) (K − 1) K 2 (α − α0 )2
= +
n K n2 (K − 1)2
0 2
(α − α ) K
= 1+
n n(K − 1)
Problem 2
4 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam
2. Here we explore the connection between neural networks and regression splines. In particular we shall only
be looking at piecewise linear splines and single hidden layer neural networks. Here we are interested in
functions whose input is 1-dimensional and output is 1-dimensional.
Recall that a piecewise linear spline tries to fit a function of the form
a0 x + b0 if x < η0
a1 x + b1 if η0 ≤ x < η1
f (x) = . . (10)
.. ..
an x + bn if ηn−1 ≤ x < ηn
where η0 , η1 , · · · , ηn are called cut-points and we fit a simple linear regression line within each interval.
On the other hand a neural network tries to fit a composition of functions. It first maps the input x into
the hidden layer using the ReLU activation function,
(1) (1)
hi = max(0, wi x + ci ) (11)
where i denotes a unit in the hidden layer. Then the output is obtained by
H
(2)
X
o= wi hi + c(2) (12)
i=1
(a) Let us start simple. Consider the function obtained by neural network with just one neuron in
(1) (1) (2)
the hidden layer. Assume the values w1 = 1, c1 = 1, w1 = 2, c(2) = 3. Let the equivalent
spline function be obtained by
(
a0 x + b0 if x < η0
f (x) = (13)
a1 x + b1 if x ≥ η0
5 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam
(c) Let the function f˜(x) = |x|, where |x| denotes the absolute value of x. What is the minimum
number of neurons in the hidden layer required to represent f˜(x).
Answer Here:
[2 Marks]
(d) State TRUE/FALSE. The function represented by a single hidden layer network with H (some
arbitrary finite integer) neurons is piecewise linear.
Answer Here:
[2 Marks]
(e) State TRUE/FALSE: Let f denote the piecewise linear function which is equivalent to a neural
network with H (some arbitrary finite integer). The function f is continuous at η, where η denotes
a cut-point.
Answer Here:
[2 Marks]
Answer of exercise 2
(1)
Observe that each hi can be written as a linear spline as follows: If wi > 0,
(
(1) (1)
0 if x < −ci /wi
hi = (1) (1) (1) (1) (15)
w i x + ci if x ≥ −ci /wi
(1)
If wi < 0, (
(1) (1)
0 if x > −ci /wi
hi = (1) (1) (1) (1) (16)
w i x + ci if x ≤ −ci /wi
and hence,
(
(1) (1)
(2) (2) c(2) if x < −c1 /w1
0= w1 h1 +c = (2) (1) (2) (1) (1) (1) (18)
w1 w1 x + w1 c1 + c(2) if x ≥ −c1 /w1
So, we have
(1) (1)
Answer Here: a0 = 0 b0 = c(2) η0 = −c1 /w1
(2) (1) (2) (1)
Answer Here: a1 = w1 w1 b1 = w1 c1 + c(2)
By substituting the given values w11 = 1, c11 = 1, w12 = 2, c2 = 3
Answer Here: a0 = 0 b0 = 3 η0 = -1
Answer Here: a1 = 2 b1 = 5
(1) (1)
(b) In case of a two neurons in the hidden layer we have (Both w1 , w2 > 0)
( (
(1) (1) (1) (1)
0 if x < −c1 /w1 0 if x < −c2 /w2
h1 = (1) (1) (1) (1) h 2 = (1) (1) (1) (1) (19)
w1 x + c1 if x ≥ −c1 /w1 w 2 x + c2 if x ≥ −c2 /w2
6 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam
Answer Here: a2 = 6 b2 = 11
(c) Answer: 2 .
Observe that (
−x if x < 0
|x| = (23)
x if x ≥ 0
This can actually be written as max(0, x) + max(0, −x). Hence two neurons in the hidden layer
should suffice.
This cannot be achieved with single hidden neuron since either when x → ∞ or when x → −∞,
h1 → 0.
(d) Answer: TRUE since sum of piecewise linear functions is piecewise linear.
(e) Answer: TRUE since sum of continuous functions is continuous. In this case
functions at hidden neurons are continuous hence the sum of these functions is continuous.
Problem 3
3. Here we investigate the relationship between multi-class logistic regression and decision trees. More precisely,
we would like to identify the final regions of the decision tree using logistic regression. For simplicity we
assume that the input is 1-dimensional real value and the output is the class label.
Recall the following variant of multi-class logistic regression -
• If Y denotes the class label, we have
P (Y = i) ∝ exp(wi x + bi ) (24)
7 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam
• Hence, we have
exp(wi x + bi )
P (Y = i) = P (25)
j exp(wj x + bj )
exp(τ (wi x + bi ))
P (Y = i) = P (26)
j exp(τ (wj x + bj ))
(a) Let, for a given value of x = x0 , we have wk x0 + bk > wi x0 + bi for all i 6= k. Then, P (Y = k) →
as τ → ∞
[2 Marks]
(b) Let, for a given value of x = x0 , we have wk x0 + bk < maxi6=k {wi x0 + bi } Then, P (Y = k) →
as τ → ∞.
[2 Marks]
(c) Let us start simple with two regions (i.e n=1) - {β−1 < x ≤ β0 }, {β0 < x ≤ β1 }. Note that
β−1 = −∞ and β1 = ∞. Define
h0 > h1 ⇒ x < β0
(29)
h1 > h0 ⇒ x > β0
Answer Here:
[2 Marks]
(d) Let us consider three regions (i.e n=2) - {β−1 < x ≤ β0 }, {β0 < x ≤ β1 }, {β1 < x ≤ β2 }. Note
that β−1 = −∞ and β2 = ∞. Define
c1 :
c2 :
8 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam
[4 Marks]
(e) Extending above to n regions -
Let
hi = (i + 1)x − ci (33)
What is the value of ci (in terms of {βi })?
ci :
[2 Marks]
Answer of exercise 3
So, if xi = max{xj }
exp(τ xi )
P →1 (35)
j exp(τ xj )
c2 : β0 + β1
9 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam
Exercise 4
4. Consider the following dataset with two features - X1 , X2 .
S.No X1 X2 Class
1 3 4 -1
2 2 2 -1
3 4 4 -1
4 2 4 -1
5 2 1 1
6 4 3 1
7 4 1 1
(a) Obtain the maximum margin classifier which separates the given dataset. Let the hyperplane be
given by −1 + β1 X1 + β2 X2 . Find the values of β1 , β2 .
β1 : β2 :
[4 Marks]
(b) Which of the data points given constitute the support vectors?
Answer Here :
[4 Marks]
Answer of exercise 4
(a) To get the line, we make a visual inspection of the data points and guess the line. Then we prove
that this is indeed the maximum margin hyperplane.
From above, consider the line which passes through (2, 1.5), (4, 3.5), which is the line L(X1 , X2 ) =
X1 − X2 = 0.5 or L(X1 , X2 ) = −0.5 + X1 − X2 = 0. This is chosen since the labels should be
consistent with the
√ signs of the coefficients
√ of the hyperplane. The signed distance from a point
X is given by 1/ 2L(X). Ignoring 2, since it is same across all data points, we have
10 of 11
BITS F464
Machine Learning 2021-22
Aditya Challa Compre Exam
11 of 11