You are on page 1of 5


- -- ,:~
"- -.--.-.'
.--, '. ":" .---,-. "":"'" . ''' .~~

Computer Science & Engineering Department, liT Kharagpur

CS60050 Machine Learning
Endterm Examination, Spring 2013
Time: 3 hours Full Marks: 95

1. (a) Suppose that you train a classifier with training sets of size m. As m --too, what [16]
do you expect will be the behavior of the training error? What would you expect
for the behavior of the test error? Draw a picture to illustrate.
(b) Suppose that you have a linear SVM binary classifier. Consider a point that is
currently classified correctly, and is far away from the decision boundary. If you
remove the point from the training set, and re-train the classifier, will the decision
boundary change or stay the same? Explain your answer in one sentence.
(c) Suppose that you have a decision tree binary classifier. Consider a point that is
currently classified correctly, and is far away from the decision boundary. If you
remove the point from the training set, and re-train the classifier, will the decision
boundary change or stay the same? Explain your answer in one sentence.
(d) True or false: Given enough training data, feed-forward neural networks can
learn to solve any binary classification problem. Explain.

2. A publisher has decided to run a marketing campaign and send free samples of books [6+9]
from their newly published books, to people who are likely to be very interested in
them. For each customer, they know the age, gender, occupation, education level,
salary, city and state. Each book that they publish has a title, keywords describing it
(e.g. fantasy, scienec fiction, historical, biography, etc), author and year of publication,
as well as a unique ISBN code. Some of the customers have provided in the past ratings
for books they have bought. The company has roughly 1000 past ratings available

(a) Suppose you have to set this up as a supervised learning problem. Explain how
you would construct the data set:
What attributes would you use?
What would you aim to predict?
What would be the training data?
(b) Suppose that you decided to phrase this as a classification problem. For each of
the methods below, explain in at most 2 sentences if it is appropriate or not. If
yes, describe any data preprocessing and other choices that you would need (in
at most 2 other sentences).
i. Support vector machines
ii. Neural networks
iii. !-nearest neighbour
3. (a) Let F be a set of classifiers whose VC-dimension is 5. Suppose we have four
training examples and labels, {(x1, YI), (x2, Y2), (x3, y3), (x4, y4)}, and select a
classifier f from F by minimizing classification error on the training set. In the

absence of any other information about the set of classifiers F , can we say that
the prediction /(x5) for a new example X5 has any relation to the training set?
Briefly justify your answer.
(b) Consider the space of points in the plane. Consider the class of hypotheses
defined by conjunctions of two perceptrons (each with two inputs). An example
of such a hypothesis is shown in the figure below.

',I ,'

1. Show a set of 3 points in the plane that can be shattered by this hypothesis
ii. Show a set of points in the plane that cannot be shattered by this hypothesis
iii. What is the exact VC-dimension of this hypothesis class? Show your rea-
(c) We learned that if a consistent learning algorithm for a finite hypothesis space
1-l is provided with
1 1
m 2: ~ (ln 1-l + ln
randomly drawn training instances, then we can state a certain guarantee. What
is that guarantee? Make sure to clearly indicate the roles of E and 8.
4. (a) Define what you mean by the the support vectors of a linear SVM classifier [3+4+6]
when using a hard margin SVM, assuming that the input instances are linearly
(b) Define a kernel function. Give an example of a kernel function.
(c) Given the following dataset in 1-d space, which consists of 4 positive data points
{0, 1, 2, 3} and 3 negative data points {4, 5, 6}. suppose that we want to learn
a soft-margin linear SVM for this data set. Remember that the soft-margin
linear SVM can be formalized as the following constrained quadratic optimization
problem. In this formulation, Cis the regularization parameter, which balances
the size of margin vs. the violation of the margin (i.e., smaller 2:::::, 1 Ei)
1 m
argmin{w,b} 2wtw + C '2.::: Ei

subject to
Yi(WtXi +b) :::: 1- Ei
Ei:::: 0 Vi

i. if C = 0, which means that we only care the size of the margin, how many
support vectors do we have? What is the margin in this case?
u. If C ---7 oo, which means that we only care the violation of the margin, how
many support vectors do we have?-

5. Consider building an ensemble of decision stumps (decision boundaries) Gm with the [6]
AdaBoost algorithm,

f(x) =sign (fl CYmGm(x)).

The figure below dispalys a few labeled point in two dimensions as well as the first
classifier boundary we have chosen. A boundary predicts binary 1 values, and de-
pends only on one coordinate value (the split point). The little arrow in the figure is
the normal to the decision boundary indicating the positive side where the boundary
line predicts +1. All the points start with uniform weights.

+1. -1

+1 0-1


(a) Circle all the point(s) in the figure whose weight will increase as a result of
incorporating the first stump (the weight update due to the first stump).
(b) :Praw in the same figure a possible stump (boundary) that we could select at the
next boosting iteration. You need to draw both the decision boundary and its
positive orientation.
(c) Will the second stump receive higher coefficient in the ensemble than the first?
In other words, will a2 > a 1 ? Briefly explain your answer. (no calculation should
be necessary).
6. (a) Let H be a hidden Markov model with state space Sand observation space 0. [6+3]
Suppose we are given a sequence of observations (yr, Y2, ... , Yn) and we would
like to find the MAP estimate of the hidden states (x 1 , x2, ... , xn). The Viterbi
algorithm can be used to compute the MAP estimate in O(nk 2 ) time where
k= lSI.
This algorithm uses the following: Ot,i is defned as the probability of the most
likely sequence that emits 01, o2, ... , Ot, and ends on state Si.
State the formula by which the o values are defined recursively using dynamic
programming. Satte briefly how the most likely path is found using this algo-

(b) Suppose the transition matrix M has the following special structure. M (i, i) = a
and M (i, j) = b when j # i. Suppose b < a. Show how the Viterbi algorithm
works in this case. Try to find an efficient algorithm that runs in O(nk) time in
this case.

7. Consider the following deterministic Markov Decision Process (MDP), describing a [14]
simple robot grid world. Notice the values of the immediate rewards are written next
to transitions. Transitions with no value have an immediate reward of 0. Assume the
discount factor 1 = 0.8.

s1 s2 9

s4 s5 s6

(a) For each states, write the value for V*(s) inside the corresponding square in the
(b) Mark the state-action transition arrows that correspond to one optimal policy.
If there is a tie, always choose the state with the smallest index.
(c) Give a different value for 1 which results in a different optimal policy and the
number of changed policy actions should be minimal. Give your new value for
/, and describe the resulting policy by indicating which 1r(s) values (i.e., which
policy actions) change.
For the remainder of this question, assume again that 1 = 0.8.
(d) How many complete loops (iterations) of value iteration are sufficient to guaran-
tee finding the optimal policy for this MDP? Assume that values are initialized
to zero, and that states are considered in an arbitrary order on each iteration.
(e) Is it possible to change the immediate reward function so that V* changes but the
optimal policy 1r* remains unchanged? If yes, give such a change, and describe
the resulting change to V*. Otherwise, explain in at most 2 sentences why this
is impossible.
(f) Unfortunately for our robot, in January, a patch of ice has appeared in its world,
making one of its actions non-deterministic. The resulting MDP is shown below.
Note that now the result of the action "go north" from state s6 results in one
of two outcomes. With probability p the robot succeeds in transitioning to state
s3 and receives immediate reward 100. However, with probability (1- p) it slips
on the ice, and remains in state s6 with zero immediate reward. Assume the
discount factor 1 = 0.8.





' .

...~- cJ ,-:;





., ' Assume p = 0.4 .. Writ~ the. value of V*
,_ ' f ,,., each state, and cirele the actio11r:; .il).
the optihuiC policy.
. , . il
8. 'In .this problem tvyo linear dimensionality r~!luctidn methods will be considered: prin- (6]
cipal component analysis (~CA) .and Fisher linear discriminant analysi~ (LDA), LDA
reduces/the dimensionality given labels by m~imizing the overall interClass V?J'!an:ce
reJative to intraclass variance. Plot the appro~imate directions ci( the first PCA and
LPA !3omponehts,.in the following figure. lj



i ' ,~ 0

' , 0 6 6
.. oo

.. ,. 0 0 . !Ii
0 0 0 l.
. oo. o" 1
' 0 , 0
0 0 .0

~ ~

1; ~
i\ ~
' ~jr
' ~ ~


__........._._,~. ,~,,...- ,..... ~
~-'~"'7" ""'\~~~"'-""""~.:~~--- .....,.~~-,.......,.... "{