Professional Documents
Culture Documents
1. a. What is the difference between maximal margin classifier and support vector classifier (5)
In maximal margin classifier, it is assumed that the two classes of response data should be
perfectly separable by a hyperplane. A natural choice is the maximal margin hyperplane (also
known as the maximal margin hyperplane optimal separating hyperplane), which is the
separating hyperplane that is farthest from the training observations.
In many cases no separating hyperplane exists, and so there is no maximal margin classifier.
The support vector classifier, sometimes called a soft margin classifier, allow some
observations to be on the incorrect side of the margin, or even the incorrect side of the
hyperplane, rather than seeking the largest possible margin so that every observation is not only
on the correct side of the hyperplane but also on the correct side of the margin.
b. Explain the model form used for support vector classifier. How is it different from
maximal margin classifier? (5)
If ϵi = 0 then the ith observation is on the correct side of the margin, as we saw in Section 9.1.4.
If ϵi > 0 then the ith observation is on the wrong side of the margin, and we say that the ith
observation has violated the margin. If ϵi > 1 then it is on the wrong side of the hyperplane.
1
In (9.15), C bounds the sum of the ϵi’s, and so it determines the number and severity of the
violations to the margin (and to the hyperplane) that we will tolerate. We can think of C as a
budget for the amount that the margin can be violated by the n observations. If C = 0 then there
is no budget for violations to the margin, and it must be the case that ϵ1 = · · · = ϵn = 0, in
which case (9.12)–(9.15) simply amounts to the maximal margin hyperplane optimization
2. What is bagging in context of decision trees? Explain how it can be applied in regression
and classification problems. (5)
3. From the dataset below, determine what proportion of variability is explained by the first
principal component? (10)
X1 4 8 13 7
X2 11 4 5 14
𝑋𝑖′ = (𝑋𝑖 − 𝜇𝑖 )
X1' -4 0 5 -1
X2' 2.5 -4.5 -3.5 5.5
′ 2 ′ ′
∑𝑖(𝑋1𝑖 ) ∑ 𝑋1𝑖 𝑋2𝑖
𝜎𝑋2′ = , 𝐶𝑜𝑣(𝑋1′ , 𝑋2′ ) = (n = 3)
1 𝑛−1 𝑛−1
4. What are the problems of Gradient Descent algorithm in the context of neural networks? (5)
5. How does Adagrad algorithm mitigate the issues in Stochastic Gradient Descent? (5)
############################