You are on page 1of 8

INDIAN INSTITUTE OF TECHNOLOGY (ISM) DHANBAD

End-Term Examination
Machine Learning: MSD527
(Academic Year 2021-22)

Course: Machine Learning Max Marks: 100

Date: 29/4/22 Duration: 3 hours

Instructions: Scientific Calculators are allowed

Solution 1a.): −0.5 log2 0.5 − 0.5 log2 0.5 = 1 (2 Marks)


Solution 1b.): If we sort by f1, the features and the corresponding labels are
[1 4 5 6 7 9
0 1 0 0 1 1]. (2 marks)
If we sort by f2, we have
[1 2 3 4 5 6
1 0 1 0 1 0].
The best split is f1 ≥ 7. (2 Marks)
Solution 1c.): For the tree node with labels (1, 1), there’s no need to split again.
For the tree node with labels (0, 1, 0, 0), if we sort by f1, we have
[1 4 5 6
0 1 0 0] (2 Marks)
If we sort using f2, we get
[1 2 4 6
1 0 0 0]
We easily see we should choose f2 ≥ 2. (2 Marks)
Solution 1d.): Yes. The rules f1 ≥ 5, f2 ≥ 3, or f2 ≥ 5 would all fail to reduce the weighted
average entropy below 1. (2 marks)
Solution 2a.): It is a good idea to start with weights equal to 0.5 because in case of logistic
regression, the optimization cost function is of convex type. (1.5 marks)
In case of convex optimization cost function we’ll have just a single optimal point and it does
not matter where we start, we will be able to converge gradually. The starting point just changes
the number of epochs to reach to that optimal point. (1.5 marks)
Solution 2b.): No, initializing all weights to 0.5 is not a great idea for neural network system.
It creates the symmetry problem under which each hidden unit get exactly the same signal. (1.5
marks)
And due to this all hidden units will have identical influence on the cost that will lead to
identical gradients. Thus, it effectively prevents neurons to learn different things as both
neurons will evolve symmetrically throughout the training. (1.5 marks)
Solution 2c.): Computationally less expensive & gives faster convergence. (2 marks)
Solution 2d.): Faster convergence, & less divergence. (2 marks)
Solution 3a.): Increasing the number of hidden nodes increases its representational ability but
at the cost of increasing the risk of overtraining (overfitting issue might crop up). …….(2
marks)
Depending upon the complexity of the problem we can say that increasing the number of
hidden layers might improve the accuracy or might not. If the initial number of hidden nodes
was too small to solve the problem at all, then increasing the number should improve
generalisation. …….(2 marks)
Solution 3b.): One mark for each line shown below in the diagram (3 Marks) & 1 mark for
identifying support vectors.

Solution 3c.): For a new user who come in, we have little information about them, and thus
the matrix factorization method cannot learn much associations between the new user and the
existing users. ……(2 Marks).
We should use the demographics information of the user to bridge its associations with existing
users. Many ideas can be applied here. The most intuitive way is perhaps to do regression based
on the demographics features and compute a similarity between the new user and existing users,
then approximate vu with a linear combination. ……(2 Marks).
Solution 3d.): Using meta data of the movie as additional information to encode the similarity.
(1 mark)
Perhaps approximating the corresponding wm as a linear combination of existing movies based
on their similarities in terms of meta information. (1 mark)
This can be encoded in the objective function. (1 Mark)
Solution 3e.): If we want to map sample points to a very high-dimensional feature space, the
kernel trick can save us from having to compute those features explicitly, thereby saving a lot
of time. (2 Marks)
The kernel trick enables the use of infinite-dimensional feature spaces. (1 Mark)
Solution 3f.): The large number of centroids means that most centroids are likely identifying
individual data points. (1.5 Mark)
In real sense there is no learning, as the whole data is memorized with no generalization. (1.5
Mark)
Processing of new data will likely be unreliable. (1 mark)
Solution 6a.): The data points can be separated linearly, e.g. by the line 𝑥1 + 𝑥2 = 2.5 (2
marks)
Solution 6b.): For A, we get 𝑧 = 𝑤0𝑥0 + 𝑤1𝑥1 + 𝑤2𝑥2 = 0(−1) − 1(1) + 1(2) = 1. Since 𝑧 > 0,
we get the prediction 𝑦 = 1. Since 𝑦 = 𝑡, there is no change to 𝒘. (2 marks)
For B, we get 𝑧 = 𝑤0𝑥0 + 𝑤1𝑥1 + 𝑤2𝑥2 = 0(−1) − 1(2) + 1(1) = −1. Since 𝑧 < 0, we get the
prediction 𝑦 = 0. (2 marks)
The update 𝒘 = 𝒘 − 𝜂(𝑦 − 𝑡)𝒙 = (0, −1,1) − 0.1(0 − 1)(−1,2,1) = (−0.1, −0.8,1.1) (2 marks)
Solution 6c.): Point A. Since 𝑦 = 𝑧 = 1 = 𝑡, there is no update.
Data point B. Since 𝑦 = 𝑧 = −1, the update is 𝒘 = 𝒘 – 𝜂(𝑦 − 𝑡)𝒙 = (0, −1,1) − 0.1(−1 − 1)(−1,2,1)
= (−0.2, −0.6,1.2) (2 marks)
Solution 7a.): Assume the values on the graph & give answer, for example we assume that P(1
| x=0) = 0.2 and P(0 | x=0) = 0.8. (3 Marks)
Solution 7b.): You can never be 100% sure with a logistic regression model. (3 Marks)
Solution 7c.): This is normally done by choosing the class 1 if P (1 | x) > 0.5. Based on the
assumption considered in part a.) the logistic classifier classify 4 spams correctly and 3 spams
incorrectly and 5 no-spams correctly and 2 incorrectly. Altogether 9 out of 14 are classified
correctly, yielding and accuracy of 9/14. (3 Marks)
Solution 7d.): We could either say that the goal is to get a good precision for spam, or a good
recall for no-spam. This can be achieved by raising the threshold from 0.5. For the training
data, a threshold of 0.7 would suffice. If we want to be prepared for more variation in the test
data, we could set the threshold even higher. (3 Marks)

You might also like