ML Practice 1

CS 189 Introduction to
Spring 2013 Machine Learning Midterm

• You have 1 hour 20 minutes for the exam.
• The exam is closed book, closed notes except your one-page crib sheet.
• Please use non-programmable calculators only.

• Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a
brief explanation. All short answer sections can be successfully answered in a few sentences AT MOST.
• For true/false questions, fill in the True/False bubble.
• For multiple-choice questions, fill in the bubbles for ALL CORRECT CHOICES (in some cases, there may be
more than one). For a question with p points and k choices, every false positive wil incur a penalty of p/(k − 1)
points.
First name
Last name
SID
For staff use only:

Q1. True/False /14
Q2. Multiple Choice Questions /21
Q3. Short Answers /15
Total /50
1
Q1. [14 pts] True/False
kwk2
(a) [1 pt] In Support Vector Machines, we maximize 2 subject to the margin constraints.
True False
(b) [1 pt] In kernelized SVMs, the kernel matrix K has to be positive definite.
True False
(c) [1 pt] If two random variables are independent, then they have to be uncorrelated.
True False
(d) [1 pt] Isocontours of Gaussian distributions have axes whose lengths are proportional to the eigenvalues of the
covariance matrix.
True False
(e) [1 pt] The RBF kernel (K(xi , xj ) = exp(−γkxi − xj k2 )) corresponds to an infinite dimensional mapping of the
feature vectors.
True False
(f ) [1 pt] If (X, Y ) are jointly Gaussian, then X and Y are also Gaussian distributed.
True False
(g) [1 pt] A function f (x, y, z) is convex if the Hessian of f is positive semi-definite.

True False
(h) [1 pt] In a least-squares linear regression problem, adding an L2 regularization penalty cannot decrease the L2
error of the solution w on the training data.
True False
(i) [1 pt] In linear SVMs, the optimal weight vector w is a linear combination of training data points.
True False
(j) [1 pt] In stochastic gradient descent, we take steps in the exact direction of the gradient vector.
True False
(k) [1 pt] In a two class problem when the class conditionals P (x|y = 0) and P (x|y = 1) are modelled as Gaussians
with different covariance matrices, the posterior probabilities turn out to be logistic functions.
True False
(l) [1 pt] The perceptron training procedure is guaranteed to converge if the two classes are linearly separable.
True False
(m) [1 pt] The maximum likelihood estimate for the variance of a univariate Gaussian is unbiased.
True False
(n) [1 pt] In linear regression, using an L1 regularization penalty term results in sparser solutions than using an
L2 regularization penalty term.
True False
2
Q2. [21 pts] Multiple Choice Questions
(a) [2 pts] If X ∼ N (µ, σ 2 ) and Y = aX + b, then the variance of Y is:
aσ 2 + b a2 σ 2 + b aσ 2 a2 σ 2
(b) [2 pts] In soft margin SVMs, the slack variables ξi defined in the constraints yi (wT xi + b) ≥ 1 − ξi have to be
<0 ≤0 >0 ≥0
(c) [4 pts] Which of the following transformations when applied on X ∼ N (µ, Σ) transforms it into an axis aligned
Gaussian? (Σ = U DU T is the spectral decomposition of Σ)
U −1 (X − µ) (U D)−1 (X − µ) U D(X − µ)
(U D1/2 )−1 (X − µ) U (X − µ) Σ−1 (X − µ)
(d) [2 pts] Consider the sigmoid function f (x) = 1/(1 + e−x ). The derivative f 0 (x) is
f (x) ln f (x) + (1 − f (x)) ln(1 − f (x)) f (x)(1 − f (x))
f (x) ln(1 − f (x)) f (x)(1 + f (x))
(e) [2 pts] In regression, using an L2 regularizer is equivalent to using a prior.
Laplace, 2βexp(−|x|/β) Exponential, β exp(−x/β), for x > 0
Gaussian with diagonal covariance

Gaussian with Σ = cI, c ∈ R (Σ 6= cI, c ∈ R)
λ11 λ12

(f ) [2 pts] Consider a two class classification problem with the loss matrix given as λ21 λ22 . Note that λij is the
P (ω2 |x)
loss for classifying an instance from class j as class i. At the decision boundary, the ratio P (ω1 |x) is equal to:
λ11 −λ22 λ11 −λ21 λ11 +λ22 λ11 −λ12
λ21 −λ12 λ22 −λ12 λ21 +λ12 λ22 −λ21
(g) [2 pts] Consider the L2 regularized loss function for linear regression L(w) = 12 kY − Xwk2 + λkwk2 , where λ
is the regularization parameter. The Hessian matrix ∇2w L(w) is
XT X 2λX T X X T X + 2λI (X T X)−1
(h) [2 pts] The geometric margin in a hard margin Support Vector Machine is
kwk2 1 2 2
2
kwk2 kwk kwk2
(i) [3 pts] Which of the following functions are convex?
sin(x) |x| min(f1 (x), f2 (x)), max(f1 (x), f2 (x)),

where f1 and f2 are where f1 and f2 are
convex convex
3
Q3. [15 pts] Short Answers
(a) [4 pts] For a hard margin SVM, give an expression to calculate b given the solutions for w and the Lagrange
multipliers {αi }N
i=1 .
(b) Consider a Bernoulli random variable X with parameter p (P (X = 1) = p). We observe the following samples
of X: (1, 1, 0, 1).
(i) [2 pts] Give an expression for the likelihood as a function of p.
(ii) [2 pts] Give an expression for the derivative of the negative log likelihood.
(iii) [1 pt] What is the maximum likelihood estimate of p?
(c) [6 pts] Consider the weighted least squares problem in which you are given a dataset {x̃i , yi , wi }N
i=1 , where wi
PN
is an importance weight attached to the ith data point. The loss is defined as L(β) = i=1 wi (yi − β T xi )2 .
Give an expression to calculate the coefficients β̃ in closed form.
Hint: You might need to use a matrix W such that diag(W ) = [w1 w2 . . . wN ]T
.
4
Scratch paper
5
Scratch paper
6
• You have 2 hours for the exam.
• Please use non-programmable calculators only.

brief explanation.
more than one). We have introduced a negative penalty for false positives for the multiple choice questions
such that the expected value of randomly guessing is 0. Don’t worry, for this section, your score will be the
maximum of your score and 0, thus you cannot incur a negative score for this section.
First name
Last name
SID
First and last name of student to your left
First and last name of student to your right
For staff use only:

Q1. True or False /10
Q2. Multiple Choice /24
Q3. Decision Theory /8
Q4. Kernels /14
Q5. L2-Regularized Linear Regression with Newton’s Method /8
Q6. Maximum Likelihood Estimation /8
Q7. Affine Transformations of Random Variables /13
Q8. Generative Models /15
Total /100
1
Q1. [10 pts] True or False
(a) [1 pt] The hyperparameters in the regularized logistic regression model are η (learning rate) and λ (regularization
term).
True False
(b) [1 pt] The objective function used in L2 regularized logistic regression is convex.
True False
(c) [1 pt] In SVMs, the values of αi for non-support vectors are 0.

True False
(d) [1 pt] As the number of data points approaches ∞, the error rate of a 1-NN classifier approaches 0.
True False
(e) [1 pt] Cross validation will guarantee that our model does not overfit.
True False
(f ) [1 pt] As the number of dimensions increases, the percentage of the volume in the unit ball shell with thickness
grows.
True False
(g) [1 pt] In logistic regression, the Hessian of the (non regularized) log likelihood is positive definite.
True False
(h) [1 pt] Given a binary classification scenario with Gaussian class conditionals and equal prior probabilities, the
optimal decision boundary will be linear.
True False
(i) [1 pt] In the primal version of SVM, we are minimizing the Lagrangian with respect to w and in the dual
version, we are minimizing the Lagrangian with respect to α.
True False
(j) [1 pt] For the dual version of soft margin SVM, the αi ’s for support vectors satisfy αi > C.
True False
2
Q2. [24 pts] Multiple Choice
(a) [3 pts] Consider the binary classification problem where y ∈ {0, 1} is the label and we have prior probability
P (y = 0) = π0 . If we model P (x|y = 1) to be the following distributions, which one(s) will cause the posterior
P (y = 1|x) to have a logistic function form?
Gaussian Uniform
Poisson None of the above
(b) [3 pts] Given the following data samples (square and triangle belong to two different classes), which one(s) of
the following algorithms can produce zero training error?
1-nearest neighbor Logistic regression
Support vector machine Linear discriminant analysis
(c) [3 pts] The following diagrams show the iso-probability contours for two different 2D Gaussian distributions. On
the left side, the data ∼ N (0, I) where I is the identity matrix. The right side has the same set of contour levels
as left side. What is the mean and covariance matrix for the right side’s multivariate Gaussian distribution?
5
5
4
4
3
3
2
2
1
1
0
y
0
y
−1
−1
−2
−2
−3
−3
−4
−4
−5
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5
−5 −4 −3 −2 −1 0 1 2 3 4 5
x x
" # " #
1 0 4 0
µ = [0, 0]T , Σ= µ = [0, 1]T , Σ=
0 1 0 0.25
" # " #
T
1 0 T
2 0
µ = [0, 1] , Σ= µ = [0, 1] , Σ=
0 1 0 0.5
3
(d) [3 pts] Given the following data samples (square and triangle mean two classes), which one(s) of the following
kernels can we use in SVM to separate the two classes?
Linear kernel Gaussian RBF (radial basis function) kernel
Polynomial kernel None of the above
(e) [3 pts] Consider the following plots of the contours of the unregularized error function along with the constraint
region. What regularization term is used in this case?
L2 L∞
L1 None of the above
(f ) [3 pts] Suppose we have a covariance matrix

5 a
Σ=
a 4
What is the set of values that a can take on such that Σ is a valid covariance matrix?
a∈< a≥0
√ √ √ √
− 20 ≤ a ≤ 20 − 20 < a < 20
4
(g) [3 pts] The soft margin SVM formulation is as follows:
N
1 X
min wT w + C ξi
2 i=1
subject to yi (wT xi + b) ≥ 1 − ξi ∀i
ξi ≥ 0 ∀i
2
What is the behavior of the width of the margin ( kwk ) as C → 0?
Behaves like hard margin Goes to zero
Goes to infinity None of the above
(h) [3 pts] In Homework 4, you fit a logistic regression model on spam and ham data for a Kaggle Competition.
Assume you had a very good score on the public test set, but when the GSIs ran your model on a private test
set, your score dropped a lot. This is likely because you overfitted by submitting multiple times and changing
the following between submissions:
λ, your penalty term , your convergence criterion
η, your step size Fixing a random bug
(i) [0 pts] BONUS QUESTION (Answer this only if you have time and are confident of your other answers
because this is not extra points.)
We have constructed the multiple choice problems such that every false positive will incur some negative
penalty. For one of these multiple choice problems, given that there are p points, r correct answers, and k
choices, what is the formula for the penalty such that the expected value of random guessing is equal to 0?
(You may assume k > r)
5
Q3. [8 pts] Decision Theory
Consider the following generative model for a 2-class classification problem, in which the class conditionals are
Bernoulli distributions:
p(ω1 ) = π
p(ω2 ) = 1 − π
(
1 with probability 0.5
x|ω1 =
(
x|ω2 =
Assume the loss matrix

true class = 1 true class = 2

predicted class = 1 0 λ12
predicted class = 2 λ21 0
(a) [8 pts] Give a condition in terms of λ12 , λ21 , and π that determines when class 1 should always be chosen as
the minimum-risk class.
6
Q4. [14 pts] Kernels
(a) [6 pts] Let k1 and k2 be (valid) kernels; that is, k1 (x, y) = Φ1 (x)T Φ1 (y) and k2 (x, y) = Φ2 (x)T Φ2 (y).
Show that k = k1 + k2 is a valid kernel by explicitly constructing a corresponding feature mapping Φ(z).
(b) [8 pts] The polynomial kernel is defined to be
k(x, y) = (xT y + c)d
where x, y ∈ Rn , and c ≥ 0. When we take d = 2, this kernel is called the quadratic kernel. Find the feature
mapping Φ(z) that corresponds to the quadratic kernel.
7
Q5. [8 pts] L2-Regularized Linear Regression with Newton’s
Method
Recall that the objective function for L2-regularized linear regression is
J(w) = kXw − yk22 + λkwk22
where X is the design matrix (the rows of X are the data points).
The global minimizer of J is given by:

w∗ = (X T X + λI)−1 X T y
(a) [8 pts] Consider running Newton’s method to minimize J.

Let w0 be an arbitrary initial guess for Newton’s method. Show that w1 , the value of the weights after one
Newton step, is equal to w∗ .
8
Q6. [8 pts] Maximum Likelihood Estimation
(a) [8 pts] Let x1 , x2 , . . . , xn be independent samples from the following distribution:
P (x | θ) = θx−θ−1 where θ > 1, x ≥ 1
Find the maximum likelihood estimator of θ.
9
Q7. [13 pts] Affine Transformations of Random Variables
Let X be a d-dimensional random vector with mean µ and covariance matrix Σ. Let Y = AX + b, where A is a
n × d matrix and b is a n-dimensional vector.
(a) [6 pts] Show that the mean of Y is Aµ + b.
(b) [7 pts] Show that the covariance matrix of Y is AΣAT .
10
Q8. [15 pts] Generative Models
Consider a generative classification model for K classes defined by the following:
• Prior class probabilities: P (Ck ) = πk k = 1, . . . , K

• General class-conditional densities: P (x|Ck ) k = 1, . . . , K
Suppose we are given training data {(xn , yn )}N

n=1 drawn independently from this model.
The labels yi are “one-of-K” vectors; that is, K-dimensional vectors of all 0’s except for a single 1 at the element
corresponding to the class. For example, if K = 4 and the true label of xi is class 2, then
T
yi = 0 1 0 0
(a) [5 pts] Write the log likelihood of the data set. You may use yij to denote the j th element of yi .
(b) [10 pts] What are the maximum likelihood estimates of the prior probabilities?
(Hint: Remember to use Lagrange multipliers!)
11
Midterm exam CS 189/289, Fall 2015
 You have 80 minutes for the exam.
 Total 100 points:
1. True/False: 36 points (18 questions, 2 points each).
2. Multiple-choice questions: 24 points (8 questions, 3 points each).
3. Three descriptive questions worth 10, 15, 15 points.
 The exam is closed book, closed notes except your one-page crib sheet.
 No calculators or electronic items.
 For true/false questions, fill in the True/False bubble.
 For multiple-choice questions, fill in the bubbles for ALL
CORRECT CHOICES (in some cases, there may be more than one). NO
PARTIAL CREDIT: all correct answers must be checked and no incorrect
answers should be checked.
First name
Last name
SID
For staff only

T/F /36
Multiple choice /24
Problem I /15
Problem II /15
Problem III /10
Total /100
Notation:
X: the training data matrix of dimension (N, d), of N rows representing samples
and d columns representing features.
x: an input data vector of dimension (1, d) of components xi, i=1:d.
xk: a training example of dimension (1, d) is a row of X, k=1:N.
w: weight vector of a linear model of dimension (1, d) such that
f(x) = w xT = x wT=  i=1:d wi xi

y: target vector of dimension (N, 1) of components yk.
: weight vector of dimension (N, 1) of kernel method f(x) =  k=1:N k k(x, xk)
k(u, v): a kernel function (a similarity measure between two samples u and v).
True/False (36 points):

1. Stochastic gradient descent performs less computation per update than batch
gradient descent.
TRUE FALSE
2. A function is convex if its Hessian is negative semidefinite.

TRUE FALSE t
p
:
/
3. If N < d, the solution to XwT = y is unique.
/
TRUE FALSE w
w
w t
.p
4. A support vector machine computes y: P(y|x).
TRUE FALSE o/
u/
ttw
upw
b:w
e/.
./y
cwo
5. Adding a ridge to XTX guarantees that it is invertible.
TRUE FALSE
6. Grid search is less prone to being trapped in a local minimum than other
heuristic search methods.
TRUE FALSE
7. The bootstrap method involves sampling without replacement.

TRUE FALSE
8. A non linearly-separable training set in a given feature space can always be

made linearly-separable in another space.
TRUE FALSE
9. Using the kernel trick, one can get non-linear decision boundaries using
algorithms designed originally for linear models.
TRUE FALSE
10. Logistic regression cannot be kernelized.

TRUE FALSE
11. Ridge regression, weight decay, and Gaussian processes use the same
regularizer: ǁwǁ2.
TRUE FALSE
12. Hebb’s rule computes the centroid method solution, if the target values are
+1/N1 and -1/N0 (N1 and N0 are the number of examples of each class)
TRUE FALSE
13. Any kernel method can be thought of as a parametric method in a possibly
infinite dimensional space.
TRUE FALSE
14. Nearest neighbors is a parametric method.

TRUE FALSE
15. A symmetric matrix is positive semidefinite if all its eigenvalues are positive or
zero.
TRUE FALSE
16. Zero correlation between any two random variables implies that the two
random variables are independent.
TRUE FALSE
17. The Linear Discriminant Analysis (LDA) classifier computes the direction
maximizing the ratio of between-class variance over within-class variance.
TRUE FALSE
18. If we repeat an experiment twice and get p-values p1 and p2, the minimum of
the two p-values is the p-value of the overall experiment.
TRUE FALSE
Multiple choice questions (24 points)
1. You trained a binary classifier model which gives very high accuracy on the
training data, but much lower accuracy on validation data. The following may
be true:
o This is an instance of overfitting.
o This is an instance of underfitting.
o The training was not well regularized.
o The training and testing examples are sampled from different
distributions.
2. Okham in the 14th century is credited to have stated that one should “shave
off unnecessary parameters of a model”. Which of the following implement
that principle:
o Regularization.
o Maximum likelihood estimation.
o Shrinkage.
o Empirical risk minimization.
o Feature selection.
3. Good practices to avoid overfitting include:
o Using a two part cost function which includes a regularizer to penalize
model complexity.
o Using a good optimizer to minimize error on training data.
o Building a structure of nested subsets of models and train learning
machines in each subset, starting from the inner subset, and stopping
when the cross-validation error starts increasing.
o Discarding 50% of randomly chosen samples.
4. Wrapper methods are hyper-parameter selection methods that:
o Should be used whenever possible because they are computationally
efficient.
o Should be avoided unless there are no other options because they are
always prone to overfitting.
o Are useful mainly when the learning machines are “black boxes”.
o Should be avoided altogether.
5. Three different classifiers are trained on the same data. Their decision
boundaries are shown below. Which of the following statements are true?
o The leftmost classifier has high robustness, poor fit.

o The leftmost classifier has poor robustness, high fit.
o The rightmost classifier has poor robustness, high fit.
o The rightmost classifier has high robustness, poor fit.
6. What are support vectors:

o The examples farthest from the decision boundary.
o The only examples necessary to compute f(x) in an SVM.
o The class centroids.
o All the examples that have a non-zero weight  in a SVM.
k
7. Which of the following does not converge to a solution if the training samples
are not linearly separable?
o Linear Logistic Regression.
o Linear Soft margin SVM.
o Linear hard-margin SVM.
o The centroid method.
o Parzen windows.
8. The number of test examples needed to get statistically significant results
should be:
o Larger if the error rate is larger.
o Larger if the error rate is smaller.
o It does not matter.
Three descriptive problems
Problem I: Gradient descent (15 points).
Given N training data points {(xk, yk)}, k=1:N, xk in Rd, and labels in yk in {-1,1}, we
seek a linear discriminant function f(x) = w.x optimizing the loss function L(z) = e-z,
for z=y f(x).
Question I.1 (3 points) Is L(z) a large margin loss function? Justify your answer (a
graphical justification may be useful).
Question I.2 (4 points) Derive the stochastic gradient descent update w  w + w
for L(z), where w is the difference between two consecutive values of w:
Question I.3 (3 points) We call Remp(w) =  k=1:N L(zk), where zk = yk f(xk), the
“empirical risk”. Derive the batch gradient update for the empirical risk:
Question I.4 (3 points) Suppose you also want to include a penalty term  ǁwǁ2 to
the risk functional that you wish to minimize. Derive the batch gradient update
for the regularized risk Rreg(w) = Remp(w) +  ǁwǁ2:
Question I.5 (2 points) How do you estimate  (answer in at most 3 words)?

Problem II. Classification concept review (15 points).
Question II.1. Centroid method. Now consider a 2-class classification problem in a 2-

dimensional feature space x=[x1, x2] with target variable y=±1. The training data comprises 7
samples as shown in Figure 1 (4 black diamonds for the positive class and 3 white diamonds for
the negative class). The 7 samples are also numbered for your reference.
x2
6
1
2 5 x1
3
7
4
Figure 1: Data for Problem II.1 Centroid method question.
Question II.1.A (2 points): Draw on Figure 1 the centroids of the two classes (mark them with a
circled “+” for the positive class and a circled “-“ for the negative class). Join the centroids with
a thick dashed line. Draw the decision boundary of the centroid method with a thick solid line.
Question II.1.B (1 point) What is the training error rate?
Question II.1.C (2 points) Is there any sample such that upon its removal, the decision boundary
changes in a manner that the removed sample goes to the other side (Answer “yes” or “no”)?
Question II.1.D (2 points) What is the leave-one-out error rate?

Question II. 2: Support Vector Machine (SVM). Consider again the same training data as in
Question II.1, replicated in Figure 2, for your convenience. The “maximum margin classifier”
(also called linear “hard margin” SVM) is a classifier that leaves the largest possible margin on
either side of the decision boundary. The samples lying on the margin are called support
vectors.
x2 6
1
2 5 x1
3
7
4
Figure 1: Data for Problem II.2 SVM method question.
Question II.2.A (2 points): Draw on Figure 2 the decision boundary obtained by the linear hard
margin SVM method with a thick solid line. Draw the margins on either side with thinner
dashed lines. Circle the support vectors.
Question II.2.B (1 point) What is the training error rate?
Question II.2.C (1 point) The removal of which sample will change the decision boundary?
Question II.2.D (2 points) What is the leave-one-out error rate?
Question II.2.E (1 point) A method is more robust if the difference between training error and
leave-one-out error is smaller. Which of the two methods (centroid or SVM) is more robust?
Question II.3.F (1 point) A method has a better fit is it has a smaller training error. Which of the
two methods has the best fit?
Problem III. Newton-Raphson for least-square regression (10 points)
[Hard problem, attempt only if you have time.]
In this problem, we will derive an optimization algorithm which we did not study
in class, called the Newton-Raphson algorithm. The algorithm makes updates in a
manner that often allows reaching the solution faster than regular gradient
descent.
Suppose we start with an initial value of a (1, d) vector w that we call w(0). We
know that the first order Taylor approximation of ∇wR(w(1)), at the point w(0) is:
∇wR(w(1)) = ∇wR(w(0)) + (w(1) - w(0)) ∇w2R(w(0))
Question III.1 (3 points). We want to minimize R(w(1)) using this approximation of

∇wR(w(1)). Find the update equation for the value of w(1). This is called the
Newton-Raphson update. Notes: This is not a trick question, you just have to
solve for w(1) after equaling ∇wR(w(1)) to 0. You can assume that the (d, d) Hessian
matrix ∇w2R(w(0)) is invertible.
Question III.2 (4 points). Consider now the linear regression problem: We are
given a training data matrix X of dim (N, d) and a target vector y of dim(N, 1) and
want to find a weight vector w of dim (1, d) such that f(x) = x wT approximates y
best, in the least square sense. The risk functional is: R(w) = (XwT - y)T (XwT- y). We
will assume that we are in the “regression case” N>d and that the Hessian is
invertible. Find the Newton-Raphson update for w(1).
Question III.3 (3 points). Recall the solution to the problem we found in class
using the normal equations or the solution found by solving for ∇wR(w) = 0
directly. Compare with the solution obtained in question (2). How many iterations
of the Newton-Raphson update do we need to perform for linear regression?
• You have 80 minutes for the exam.
• No calculators or electronic items.

brief explanation.
more than one). We have introduced a negative penalty for false positives for the multiple choice questions
such that the expected value of randomly guessing is 0. Don’t worry, for this section, your score will be the
maximum of your score and 0, thus you cannot incur a negative score for this section.
First name
Last name
SID
For staff use only:

Q1. True or False /26
Q2. Multiple Choice /36
Q3. Parameter Estimation /10
Q4. Dual Solution for Ridge Regression /8
Q5. Regularization and Priors for Linear Regression /8
Total /88
1
Q1. [26 pts] True or False
(a) [2 pts] If the data is not linearly separable, then there is no solution to the hard-margin SVM.
True False
(b) [2 pts] Logistic Regression can be used for classification.

True False
(c) [2 pts] In logistic regression, two ways to prevent β vectors from getting too large are using a small step size
and using a small regularization value.
True False
(d) [2 pts] The L2 norm is often used because it produces sparse results, as opposed to the L1 norm which does
not.
True False
(e) [2 pts] For a Multivariate Gaussian, the eigenvalues of the covariance matrix are inversely proportional to the
lengths of the ellipsoid axes that determine the isocontours of the density.
True False
(f ) [2 pts] In a generative binary classification model where we assume the class conditionals are distributed as
Poisson, and the class priors are Bernoulli, the posterior assumes a logistic form.
True False
(g) [2 pts] Maximum likelihood estimation gives us not only a point estimate, but a distribution over the parameters
that we are estimating.
True False
(h) [2 pts] Penalized maximum likelihood estimators and Bayesian estimators for parameters are better used in
the setting of low-dimensional data with many training examples as opposed to the setting of high-dimensional
data with few training examples.
True False
(i) [2 pts] It is not a good machine learning practice to use the test set to help adjust the hyperparameters of your
learning algorithm.
True False
(j) [2 pts] A symmetric positive semi-definite matrix always has nonnegative elements.
True False
(k) [2 pts] For a valid kernel function K, the corresponding feature mapping φ can map a finite dimensional vector
into an infinite dimensional vector.
True False
(l) [2 pts] The more features that we use to represent our data, the better the learning algorithm will generalize
to new data points.
True False
(m) [2 pts] A discriminative classifier explicitly models P (Y |X)

True False
2
(a) [3 pts] Which of the following algorithms can you use kernels with?
Support Vector Machines None of the above
Perceptrons
(b) [3 pts] Cross validation:
Is often used to select hyperparameters Does nothing to prevent overfitting
Is guaranteed to prevent overfitting None of the above
(c) [3 pts] In linear regression, L2 regularisation is equivalent to imposing a:
Logistic prior Laplace prior
Gaussian prior Gaussian class-conditional
(d) [3 pts] Say we have two 2-dimensional Gaussian distributions representing two different classes. Which of the
following conditions will result in a linear decision boundary:
Same mean for both classes Different covariance matrix for each class
Same covariance matrix for both classes Linearly separable data
(e) [3 pts] The normal equations can be derived from:
Minimizing empirical risk
Assuming that Y = β T x + , where ∼ N (0, σ 2 ).
Assuming that the P (Y |X = x) is distributed normally with mean β | x and variance σ 2
Finding a linear combination of the rows of the design matrix that minimizes the distance to our
vector of labels Y
(f ) [3 pts] Logistic regression can be motivated from:
Generative models with uniform class condi- Log odds being equated to an affine function
tionals of x
Generative models with Gaussian class con-

ditionals None of the above
(g) [3 pts] The perceptron algorithm will converge:
If the data is linearly separable As long as you initialize θ to all 0’s
Even if the data is linearly inseparable Always
3
(h) [3 pts] Which of the following is true:
Newton’s Method typically is more expensive to calculate than gradient descent, per iteration
For quadratic equations, Newton’s Method typically requires fewer iterations than gradient descent
Gradient descent can be viewed as iteratively reweighted least squares
None of the above
(i) [3 pts] Which of the following statements about duality and SVMs is (are) true?
Complementary slackness implies that every training point that is misclassified by a soft-margin SVM
is a support vector.
When we solve the SVM with the dual problem, we need only the dot product of xi , xj for all i, j,
and no other information about the xi .
We use Lagrange multipliers in an optimization problem with inequality (≤) constraints.
None of the above
(j) [3 pts] Which of the following distance metrics can be computed exclusively with inner products, assuming
Φ(x) and Φ(y) are feature mappings of x and y, respectively?
Φ(x) − Φ(y) kΦ(x) − Φ(y)k22 .
kΦ(x) − Φ(y)k1 None of the above
(k) [3 pts] Strong duality holds for:
Hard Margin SVM Constrained optimization problems in general
Soft Margin SVM None of the above
(l) [3 pts] Which the following facts about the ’C’ in SVMs is (are) true?
As C approaches 0, the soft margin SVM is A larger C tends to create a larger margin
equal to the hard margin SVM
None of the above
C can be negative, as long as each of the slack
variables are nonnegative
4
Q3. [10 pts] Parameter Estimation
In this problem, we have n trials with k possible types of outcomes {1, 2, ..., k}. Suppose we observe X1 , ..., Xk where
each Xi is the number of outcomes of type i. If pi refers to the probability that a trial has outcome i, then (X1 , ..., Xk )
is said to have a multinomial distribution with parameters p1 , ..., pk , denoted (X1 , ..., Xk ) ∼ Multinomial(p1 , ..., pk ).
It may be useful to know that the probability mass function of the multinomial distribution is given as follows.
n!
P (X1 = x1 , ..., Xk = xk ) = px1 . . . pxkk
x1 !x2 ! . . . xk ! 1
We want to find the maximum likelihood estimators for p1 , ..., pk . You may assume that pi > 0 for all i.
(a) [4 pts] What is the log-likelihood function, l(p1 , ..., pk |X1 , ..., Xk )?
(b) [6 pts] You might notice that unconstrained maximization of this function leads to an answer in which we set
each pi = ∞. But this is wrong. We must add a constraint such that the probabilities sum up to 1. Now, we
have the following optimization problem.
max l(p1 , ..., pk |X1 , ..., Xk )

p1 ,...,pk
k
X
s.t. pi = 1
i=1
Recall that we can use the method of Lagrange multipliers to solve an optimization problem with equality
constraints. Using this method, find the maximum likelihood estimators for p1 , ..., pk .
5
Q4. [8 pts] Dual Solution for Ridge Regression
Recall that ridge regression minimizes the objective function:
L(w) = kXw − yk22 + λkwk22
where X is an n-by-d design matrix, w is a d-dimensional vector and y is a n-dimensional vector. We already know
that the function L(w) is minimized by
w∗ = (X T X + λI)−1 X T y.
Alternatively, the minimizer can be represented by a linear combination of the design matrix rows. That is, there
exists a n-dimensional vector α∗ such that the objective function L(w) is minimized by w∗ = X T α∗ . The vector α∗
is called the dual solution to the linear regression problem.
(a) [2 pts] Using the relation w = X T α, define the objective function L in terms of α.
(b) [3 pts] Show that α∗ = (XX T + λI)−1 y is a dual solution.
(c) [3 pts] To make the solution in question (b) well-defined, the matrix XX T + λI has to be an invertible matrix.
Assuming λ > 0, show that XX T + λI is an invertible matrix. (Hint: positive definite matrices are invertible)
6
Q5. [8 pts] Regularization and Priors for Linear Regression
Linear regression is a model of the form P (y|x) ∼ N (wT x, σ 2 ), where w is a d-dimensional vector. Recall that in
ridge regression, we add an `2 regularization term to our least squares objective function to prevent overfitting, so
that our loss function becomes:
n
X
J(w) = (Yi − wT Xi )2 + λwT w (*)
i=1
We can arrive at the same objective function in a Bayesian setting, if we consider a MAP (maximum a posteriori
probability) estimate, where w has the prior distribution N (0, f (λ, σ)I).
(a) [3 pts] What is the conditional density of w given the data?
(b) [5 pts] What f (λ, σ) makes this MAP estimate the same as the solution to (*)?
7
• Please do not open the exam before you are instructed to do so.
• The exam is closed book, closed notes except your one-page cheat sheet.
• Electronic devices are forbidden on your person, including cell phones, iPods, headphones, and laptops.
Turn your cell phone off and leave all electronics at the front of the room, or risk getting a zero on
the exam.
• You have 1 hour and 20 minutes.
• Please write your initials at the top right of each odd-numbered page (e.g., write “JS” if you are Jonathan
Shewchuk). Finish this by the end of your 1 hour and 20 minutes.
• Mark your answers on the exam itself in the space provided. Do not attach any extra sheets.
• The total number of points is 100. There are 20 multiple choice questions worth 3 points each, and 3 written
questions worth a total of 40 points.
• For multiple-choice questions, fill in the bubbles for ALL correct choices: there may be more than one correct
choice, but there is always at least one correct choice. NO partial credit on multiple-choice questions: the
set of all correct answers must be checked.
First name
Last name
SID
1
Fill in the bubbles for ALL correct choices: there may be more than one correct choice, but there is always at
least one correct choice. NO partial credit: the set of all correct answers must be checked.
(a) [3 pts] Which of the following learning algorithms will return a classifier if the training data is not linearly
separable?
Hard-margin SVM Perceptron
Soft-margin SVM Linear Discriminant Analysis (LDA)
(b) [3 pts] With a soft-margin SVM, which samples will have non-zero slack variables ξi ?
All misclassified samples All samples lying on the margin boundary
All samples inside the margin All samples outside the margin
(c) [3 pts] Recall the soft-margin SVM objective function |w|2 + C

P
i ξi . Which value of C is most likely to overfit
the training data?
C = 0.01 C = 0.00001
C=1 C = 100
(d) [3 pts] There are several ways to formulate the hard-margin SVM. Consider a formulation in which we try to
directly maximize the margin β. The training samples are X1 , X2 , . . . , Xn and their labels are y1 , y2 , . . . , yn .
Which constraints should we impose to get a correct SVM? (Hint: Recall the formula for the distance from a
point to a hyperplane.) Maximize β subject to . . .
yi XiT w ≤ β ∀i ∈ [1, n]. |w| ≥ 1.
yi XiT w ≥ β ∀i ∈ [1, n]. |w| = 1.
(e) [3 pts] In the homework, you trained classifiers on the digits dataset. The features were the pixels in each
image. What features could you add that would improve the performance of your classifier?
Maximum pixel intensity Number of enclosed regions
Average pixel intensity Presence of a long horizontal line
(f ) [3 pts] The Bayes risk for a decision problem is zero when
the class distributions P (X|Y ) do not overlap. the loss function L(z, y) is symmetrical.
the Bayes decision rule perfectly classifies the

the training data is linearly separable. training data.
(g) [3 pts] Let L(z, y) be a loss function (where y is the true class and z is the predicted class). Which of the
following loss functions will always lead to the same Bayes decision rule as L?
L1 (z, y) = aL(z, y), a > 0 L3 (z, y) = L(z, y) + b, b > 0
L2 (z, y) = aL(z, y), a < 0 L4 (z, y) = L(z, y) + b, b < 0
2
(h) [3 pts] Gaussian discriminant analysis
models P (Y = y|X) as a Gaussian. is an example of a generative model.
can be used to classify points without ever

models P (Y = y|X) as a logistic function. computing an exponential.
(i) [3 pts] Which of the following are valid covariance matrices?

1 1 0 1
A= C=
−1 1 1 2

1 −1 1 1
B= D=
−1 2 1 1
(j) [3 pts] Consider a d-dimensional multivariate normal distribution that is isotropic (i.e., its isosurfaces are
spheres). Let Σ be its d × d covariance matrix. Let I be the d × d identity matrix. Let σ be the standard
deviation of any one component (feature). Then
1
Σ = σI. Σ= σ I.
Σ = σ 2 I. Σ= 1
σ 2 I.
None of the above.
(k) [3 pts] In least-squares linear regression, imposing a Gaussian prior on the weights is equivalent to
logistic regression L2 regularization
adding a Laplace-distributed penalty term L1 regularization
(l) [3 pts] Logistic regression
assumes that we impose a Gaussian prior on has a closed-form solution.

the weights.
minimizes a convex cost function. can be used with a polynomial kernel.
(m) [3 pts] Ridge regression
is more sensitive to outliers than ordinary adds an L1 -norm penalty to the cost function.
least-squares.
reduces variance at the expense of higher bias. often sets several of the weights to zero.
(n) [3 pts] Given a design matrix X ∈ Rn×d and labels y ∈ Rn , which of the following techniques could potentially
decrease the empirical risk on the training data (assuming the loss is the squared error)?
Adding the feature “1” to each data point. Centering the vector y by subtracting the
mean ȳ from each component yi .
Adding polynomial features to each data Penalizing the model weights with L2 regu-
point. larization.
3
(o) [3 pts] In terms of the bias-variance trade-off, which of the following is/are substantially more harmful to the
test error than the training error?
Bias Loss
Variance Risk
(p) [3 pts] Consider the bias-variance trade-off in fitting least-squares surfaces to two data sets. The first is US
census data, in which we want to estimate household income from the other variables. The second is synthetic
data we generated by writing a program that randomly creates samples from a known normal distribution, and
assigns them y-values on a known smooth surface y = f (x) plus noise drawn from a known normal distribution.
We can compute or estimate with high accuracy
the bias component of the empirical risk for the bias component of the empirical risk for
the US census data. the synthetic data.
the variance component of the empirical risk the variance component of the empirical risk
for the US census data. for the synthetic data.
(q) [3 pts] The kernel trick
is necessary if we want to add polynomial fea- can improve the speed of high-degree polyno-
tures to a learning algorithm. mial regression.
can improve the speed of learning algorithms

can be applied to any learning algorithm. when the number of samples is very large.
(r) [3 pts] In the usual formulation ofPsoft-margin SVMs, each training sample has a slack variable ξi ≥ 0 and
we impose a regularization cost C i ξi . Consider an alternative formulation where
Pwe impose the additional
constraints ξi = ξj for all i, j. How does the minimum objective value |w|2 + C i ξi obtained by the new
method compare to the one obtained by the original soft-margin SVM?
They are always equal. Original SVM minimum ≥ new minimum.
New minimum is sometimes larger and some-

New minimum ≥ original SVM minimum. times smaller.
(s) [3 pts] In Gaussian discriminant analysis, if two classes come from Gaussian distributions that have different
means, may or may not have different covariance matrices, and may or may not have different priors, which
decision boundary shapes are possible?
a hyperplane a surface that is not a quadric
a nonlinear quadric surface (quadric = the the empty set (the classifier always returns
isosurface of a quadratic function) the same class)

a 0
(t) [3 pts] Let the class conditionals be given by P (X|Y = i) ∼ N (0, Σi ), where i ∈ {0, 1} and Σ0 =
0 b
b 0
and Σ1 = with a, b > 0, a 6= b. Both conditionals have mean zero, and both classes have the prior
0 a
probability P (Y = 0) = P (Y = 1) = 0.5. What is the shape of the decision boundary?
a line multiple lines
a nonlinear quadratic curve not defined
4
Q2. [15 pts] Quadratics and Gaussian Isocontours
√ √
1/√5 −2/√ 5
(a) [4 pts] Write the 2 × 2 matrix Σ whose unit eigenvectors are with eigenvalue 1 and with
2/ 5 1/ 5
eigenvalue 4. Write out both the eigendecomposition of Σ and the final 2 × 2 matrix Σ.
(b) [3 pts] Write the symmetric square root Σ1/2 of Σ. (The eigendecomposition is optional, but it might earn you
partial credit if you get Σ1/2 wrong.)
(c) [3 pts] Consider the bivariate Gaussian distribution X ∼ N (µ, Σ). Let
√
P (X = x) be its probability distribution
function (PDF). Write the formula for the isocontour P (x) = e− 5/2 /(4π), substitute in the value of the
determinant |Σ| from part (a) (but leave µ and Σ−1 as variables), and simplify the formula as much as you can.
√

0
(d) [5 pts] Draw the isocontour P (x) = e− 5/2
/(4π) where µ = and Σ is given in part (a).
2
y
6
0
−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 x
−1
−2
−3
−4
−5
−6
5
Q3. [15 pts] Linear Regression
Recall that if we model our input data as linear plus Gaussian noise in the y-values, YP| x ∼ N (w> x, σ 2 ), then the
n
maximum likelihood estimator is the w that minimizes the residual sum of squares i=1 (Xi> w − yi )2 , where the
training samples are X1 , X2 , . . . , Xn and their labels are y1 , y2 , . . . , yn .
Let’s model noise with a Laplace distribution instead of a normal distribution. The probability density function
(PDF) of Laplace(µ, b) is

1 |y − µ|
P (y) = exp − .
2b b
(a) [6 pts] Show that if we model our input data as a line plus Laplacian noise in the y-values, i.e.
Y | x ∼ Laplace(w> x, b),
then the maximum likelihood estimator is the w that minimizes the sum of absolute residuals
n
X
|Xi> w − yi |.
i=1
(b) [6 pts] Derive the batch gradient descent rule for minimizing the sum of absolute residuals. (Hint: You will
probably need “if” statements or equations with conditionals because of the absolute value operators in the
cost function. Don’t worry about points where the gradient is undefined.)
(c) [3 pts] Why might we prefer to minimize the sum of absolute residuals instead of the residual sum of squares
for some data sets? (Hint: What is one of the flaws of least-squares regression?)
6
Q4. [10 pts] Discriminant Analysis
Let’s derive the decision boundary when one class is Gaussian and the other class is exponential. Our feature space
is one-dimensional (d = 1), so the decision boundary is a small set of points.
We have two classes, named

√
N for normal and E for exponential. For the former class (Y = N ), the prior probability
2π
is πN = P (Y = N ) = 1+√ 2π
and the class conditional P (X|Y = N ) has the normal distribution N (0, σ 2 ). For the
1
latter, the prior probability is πE = P (Y = E) = 1+√ 2π
and the class conditional has the exponential distribution
(
λe−λx if x ≥ 0,
P (X = x|Y = E) =
0 if x < 0.
Write an equation in x for the decision boundary. (Only the positive solutions of your equation will be relevant;
ignore all x < 0.) Use the 0-1 loss function. Simplify the equation until it is quadratic in x. (You don’t need to solve
the quadratic equation. It should contain the constants σ and λ. Ignore the fact that 0 might or might not also be
a point in the decision boundary.) Show your work, starting from the posterior probabilities.
7
• Electronic devices are forbidden on your person, including cell phones, iPods, headphones, and laptops. Turn your
cell phone off and leave all electronics at the front of the room, or risk getting a zero on the exam.
• Please write your initials at the top right of each page after this one (e.g., write “JS” if you are Jonathan Shewchuk).
Finish this by the end of your 1 hour and 20 minutes.
• The total number of points is 100. There are 20 multiple choice questions worth 3 points each, and 4 written questions
worth a total of 40 points.
• For multiple answer questions, fill in the bubbles for ALL correct choices: there may be more than one correct choice,
but there is always at least one correct choice. NO partial credit on multiple answer questions: the set of all correct
answers must be checked.
First name
Last name
SID
1
Q1. [60 pts] Multiple Answer
Fill in the bubbles for ALL correct choices: there may be more than one correct choice, but there is always at least one correct
choice. NO partial credit: the set of all correct answers must be checked.
(a) [3 pts] For a nonconvex cost function J, which of the following step sizes guarantee that batch gradient descent will
converge to the global optimum? Let i denote the ith iteration.
= 10−2 = 1
∇2 J
= 10−i None of the above
(b) [3 pts] Which of the following optimization algorithms attains the optimum of an unconstrained, quadratic, convex cost
function in the fewest iterations?
Batch gradient descent Newton’s method
Stochastic gradient descent The simplex method
(c) [3 pts] You train a linear classifier on 10,000 training points and discover that the training accuracy is only 67%. Which
of the following, done in isolation, has a good chance of improving your training accuracy?
Add novel features Use linear regression
Train on more data Train on less data
(d) [3 pts] You train a classifier on 10,000 training points and obtain a training accuracy of 99%. However, when you submit
to Kaggle, your accuracy is only 67%. Which of the following, done in isolation, has a good chance of improving your
performance on Kaggle?
Set your regularization value (λ) to 0 Use validation to tune your hyperparameters
Train on more data Train on less data
(e) [3 pts] You are trying to improve your Kaggle score for the spam dataset, but you must use logistic regression with no
regularization. So, you decide to extract some additional features from the emails, but you forget to normalize your new
features. You find that your Kaggle score goes down. Why might this happen?
The new features make the sample points linearly The new features have significantly more noise
separable and larger variances than the old features
The new features are uncorrelated with the emails The new features are linear combinations of the
being HAM or SPAM old features
(f) [3 pts] In a soft-margin support vector machine, if we increase C, which of the following are likely to happen?
The margin will grow wider Most nonzero slack variables will shrink
There will be more points inside the margin The norm |w| will grow larger
2
(g) [3 pts] If a hard-margin support vector machine tries to minimize |w|2 subject to yi (Xi · w + α) ≥ 2 instead, what will be
the width of the slab (the point-free region bracketing the decision boundary)?
1 4
kwk kwk
2 1
kwk 2kwk
(h) [3 pts] There is a 50% chance of rain on Saturday and a 30% chance of rain on Sunday. However, it is twice as likely to
rain on Sunday if it rains on Saturday than if it does not rain on Saturday. What is the probability it rains on neither of
the days?
15% 40%
25% 45%
(i) [3 pts] The Bayes risk for a decision problem is zero when
the training data is linearly separable after lifting the Bayes decision rule perfectly classifies the
it to a higher-dimensional space. training data.
the class distributions P(X|Y) do not overlap. the prior probability for one class is 1.
(j) [3 pts] Consider using a Bayes decision rule classifier in a preliminary screen for cancer patients, as in Lecture 6. We
want to reduce the probability that someone is classified as cancer-free when they do, in fact, have cancer. On the ROC
curve for the classifier, an asymmetric loss function that implements this strategy
Picks a point on the curve with higher sensitivity Picks a point on the curve that’s closer to the y-axis
than the 0-1 loss function. than the 0-1 loss function.
Picks a point on the curve with higher specificity Picks a point on the curve that’s further from the
than the 0-1 loss function. x-axis than the 0-1 loss function.
(k) [3 pts] For which of the following cases are the scalar random variables X1 and X2 guaranteed to be independent?
X1 ∼ N(0, 1) and X2 ∼ N(0, 1). E [(X1 − E [X1 ]) (X2 − E [X2 ])] = −1
Cov(X1 , X2 ) = 0 and [X1 X2 ]> has a multivariate

"# " # " #!
X1 1 3 0
∼N ,
normal distribution. X2 3 0 7
(l) [3 pts] Given X ∼ N(0, Σ) where the precision matrix Σ−1 has eigenvalues λi for i = 1, . . . , d, the isocontours of the
probability density function for X are ellipsoids whose relative axis lengths are
√
λi λi
√
1/λi 1/ λi
(m) [3 pts] In LDA/QDA, what are the effects of modifying the sample covariance matrix as Σ̃ = (1 − λ)Σ + λI, where
0 < λ < 1?
Σ̃ is positive definite Σ̃ is invertible
The isocontours of the quadratic form of Σ̃ are

Increases the eigenvalues of Σ by λ closer to spherical
3
(n) [3 pts] Let w∗ be the solution you obtain in standard least-squares linear regression. What solution do you obtain if you
scale all the input features (but not the labels y) by a factor of c before doing the regression?
1 ∗
cw cw∗
1 ∗
c2
w c2 w∗
(o) [3 pts] In least-squares linear regression, adding a regularization term can
increase training error. increase validation error.
decrease training error. decrease validation error.
(p) [3 pts] You have a design matrix X ∈ Rn×d with d = 100,000 features and and vector y ∈ Rn of binary 0-1 labels. When
you fit a logistic regression model to your design matrix, your test error is much worse than your training error. You
suspect that many of the features are useless and are therefore causing overfitting. What are some ways to eliminate the
useless features?
Use `1 regularization. Use `2 regularization.
Iterate over features; check if removing feature i If the ith eigenvalue λi of the sample covariance
increases validation error; remove it if not. matrix is 0, remove the ith feature/column.
(q) [3 pts] Recall the data model, yi = f (Xi ) + i , that justifies the least-squares cost function in regression. The statistical
assumptions of this model are, for all i,
i comes from a Gaussian distribution. all i have the same mean
all yi have the same mean all yi have the same variance
(r) [3 pts] How does ridge regression compare to linear regression with respect to the bias-variance tradeoff?
Ridge regression usually has higher bias. Ridge regression usually has higher variance.
Ridge regression usually has higher irreducible Ridge regression’s variance approaches zero as the
error. regularization parameter λ → ∞.
(s) [3 pts] Which of the following quantities affect the bias-variance tradeoff?
λ, the regularization coefficient in ridge regression , the learning rate in gradient descent
d, the polynomial degree in least-squares regres-

C, the slack parameter in soft-margin SVM sion
(t) [3 pts] Which of the following statements about maximum likelihood estimation are true?
MLE, applied to estimate the mean parameter µ For a sample drawn from a normal distribution, the
of a normal distribution N(µ, Σ) with a known covari- likelihood L(µ, σ; X1 , . . . , Xn ) is equal to the proba-
ance matrix Σ, returns the mean of the sample points bility of drawing exactly the points X1 , . . . , Xn (in that
order) when you draw n random points from N(µ, σ)
MLE, applied to estimate the covariance parameter
Σ of a normal distribution N(µ, Σ), returns Σ̂ = n1 X T X, Maximizing the log likelihood is equivalent to
where X is the design matrix maximizing the likelihood
4
Q2. [10 pts] Logistic Posterior for Poisson Distributions
Consider two classes C and D whose class conditionals are discrete Poisson distributions with means λC > 0 and λD > 0. Their
probability mass functions are
λkC e−λC λk e−λD

P(K = k|Y = C) = , P(K = k|Y = D) = D , k ∈ {0, 1, 2, . . .}.
k! k!
Their prior probabilities are P(Y = C) = πC and P(Y = D) = πD = 1 − πC . We use the standard 0-1 loss function.
(a) [7 pts] Derive the posterior probability and show that it can be written in the form P(Y = C|K = k) = s( f (k, λC , λD , πC )),
where s is the logistic function and f is another function.
(b) [3 pts] What is the maximum number of points in the Bayes optimal decision boundary? (Note: as the distribution is
discrete, we are really asking for the maximum number of integral values of k where the classifier makes a transition
from predicting one class to the other.)
5
Q3. [10 pts] Error-Prone Sensors
We want to perform linear regression on the outputs of d building sensors measured at n different times, to predict the building’s
energy use. Unfortunately, some of the sensors are inaccurate and prone to large errors and, occasionally, complete failure.
Fortunately, we have some knowledge of the relative accuracy and magnitudes of the sensors.
Let X be a n × (d + 1) design matrix whose first d columns represent the sensor measurements and whose last column is all 1’s.
(Each sensor column has been normalized to have variance 1.) Let y be a vector of n target values, and let w be a vector of d + 1
weights (the last being a bias term α). We decide to minimize the cost function
J(w) = kXw − yk1 + λw> Dw,
where D is a diagonal matrix with diagonal elements Dii (with Dd+1,d+1 = 0 so we don’t penalize the bias term).
(a) [2 pts] Why might we choose to minimize the `1 -norm kXw − yk1 as opposed to the `2 -norm |Xw − y|2 in this scenario?
(b) [2 pts] Why might we choose to minimize w> Dw as opposed to |w0 |2 ? What could the values Dii in D represent?
(c) [6 pts] Derive the batch gradient descent rule to minimize our cost function. Hint: let p be a vector with components
pi = sign(Xi> w − yi ), and observe that kXw − yk1 = (Xw − y)> p. For simplicity, assume that no Xi> w − yi is ever exactly
zero.
6
Q4. [10 pts] Gaussian Discriminant Analysis
Consider a two-class classification problem in d = 2 dimensions. Points from these classes come from multivariate Gaussian
distributions with a common mean but different covariance matrices.
" # " #! " # " #!
1 4 0 1 1 0
XC ∼ N µ = , ΣC = , XD ∼ N µ = , ΣD = .
1 0 4 1 0 1
(a) [5 pts] Plot some isocontours of the probability distribution function P(µ, ΣC ) of XC on the left graph. (The particular
isovalues don’t matter much, so long as we get a sense of the isocontour shapes.) Plot the isocontours of P(µ, ΣD ) for the
same isovalues (so we can compare the relative spacing) on the right graph.
y y
6 6
5 5
4 4
3 3
2 2
1 1
0 0
−6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 x −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 6 x
−1 −1
−2 −2
−3 −3
−4 −4
−5 −5
−6 −6
(b) [5 pts] Suppose that the priors for the two classes are πC = πD = 21 and we use the 0-1 loss function. Derive an equation
for the points x in the Bayes optimal decision boundary and simplify it as much as possible. What is the geometric shape
of this boundary? (Hint: try to get your equations to include the term |x − µ|2 early, then keep it that way.) (Hint 2: you
can get half of these points by guessing the geometric shape.)
7
Q5. [10 pts] Quadratic Functions
(a) [4 pts] Derive the 2 × 2 symmetric matrix whose eigenvalues are 7 and 1, such that (1, 1) is an eigenvector with eigenvalue
7.
(b) [4 pts] Is the function f (x1 , x2 ) = x14 + 2x12 + 3x1 x2 + 2x22 − 7x1 − 12x2 − 18 convex? Justify your answer.
(c) [2 pts] Consider the cost function J(w) for least-squares linear regression. Can J(w) ever be unbounded below? In other
words, is there a set of input sample points X and labels y such that we can walk along a path in weight space for which
the cost function J(w) approaches −∞? Explain your answer.
8
• Please write your initials at the top right of each page after this one (e.g., write “JS” if you are Jonathan Shewchuk).
Finish this by the end of your 1 hour and 20 minutes.
First name
Last name
SID
1
(a) [3 pts] Let A be a real, symmetric n × n matrix. Which of the following are true about A’s eigenvectors and eigenvalues?
A can have no more than n distinct eigenvalues The vector ~0 is an eigenvector, because A~0 = λ~0
A can have no more than 2n distinct unit-length We can find n mutually orthogonal eigenvectors of
eigenvectors A
(b) [3 pts] The matrix that has eigenvector [1, 2]> with eigenvalue 2 and eigenvector [−2, 1]> with eigenvalue 1 (note that
these are not unit eigenvectors!) is
" # " #
9 −2 6 2

−2 6 2 9
" # " #
9/5 −2/5 6/5 2/5

−2/5 6/5 2/5 9/5
(c) [3 pts] Consider a binary classification problem where we know both of the class conditional distributions exactly. To
compute the risk,
we need to know all the sample points we need to know the class prior probabilities
we need to know the loss function we need to use gradient descent
(d) [3 pts] Assuming we can find algorithms to minimize them, which of the following cost functions will encourage sparse
solutions (i.e., solutions where many components of w are zero)?
kXw − yk22 + λkwk1 kXw − yk22 + λ · (# of nonzero components of w)
kXw − yk22 + λkwk21 kXw − yk22 + λkwk22
(e) [3 pts] Which of the following statements about logistic regression are correct?
The cost function of logistic regression is convex The cost function of logistic regression is concave
Logistic regression uses the squared error as the Logistic regression assumes that each class’s
loss function points are generated from a Gaussian distribution
(f) [3 pts] Which of the following statements about stochastic gradient descent and Newton’s method are correct?
Newton’s method often converges faster than If the function is continuous with continuous
stochastic gradient descent, especially when the di- derivatives, Newton’s method always finds a local
mension is small minimum
Newton’s method converges in one iteration when

the cost function is exactly quadratic with one unique Stochastic gradient descent reduces the cost func-
minimum. tion at every iteration.
2
(g) [3 pts] Let X ∈ Rn×d be a design matrix containing n sample points with d features each. Let y ∈ Rn be the corresponding
real-valued labels. What is always true about every solution w∗ that locally minimizes the linear least squares objective
function kXw − yk22 , no matter what the value of X is?
w∗ = X + y (where X + is the pseudoinverse) w∗ satisfies the normal equations
w∗ is in the null space of X All of the local minima are global minima
(h) [3 pts] We are using linear discriminant analysis to classify points x ∈ Rd into three different classes. Let S be the set
of points in Rd that our trained model classifies as belonging to the first class. Which of the following are true?
The decision boundary of S is always a hyperplane S can be the whole space Rd
The decision boundary of S is always a subset of S is always connected (that is, every pair of points
a union of hyperplanes in S is connected by a path in S )
(i) [3 pts] Which of the following apply to linear discriminant analysis?
You calculate the sample mean for each class It approximates the Bayes decision rule
You calculate the sample covariance matrix using The model produced by LDA is never the same as
the mean of all the data points the model produced by QDA
(j) [3 pts] Which of the following are reasons why you might adjust your model in ways that increase the bias?
You observe high training error and high validation You observe low training error and high validation
error error
You have few data points Your data are not linearly separable
(k) [3 pts] Suppose you are given the one-dimensional data {x1 , x2 , . . . , x25 } illustrated below and you have only a hard-
margin support vector machine (with a fictitious dimension) at your disposal. Which of the following modifications
can give you 100% training accuracy?
Centering the data Add a feature xi2
Add a feature that is 1 if x ≤ 50, or −1 if x > 50 Add two features, xi2 and xi3
(l) [3 pts] You are performing least-squares polynomial regression. As the degree of your polynomials increases, which
of the following is commonly seen to go down at first but then go up?
Training error Variance
Validation error Bias
3
(m) [3 pts] Let f : R → R be a continuous, smooth function whose derivative f 0 (x) is also continuous. Suppose f has a
unique global minimum x∗ ∈ (−∞, ∞), and you are using gradient descent to find x∗ . You fix some x(0) ∈ R and > 0,
and run x(t) = x(t−1) − f 0 (x(t−1) ) repeatedly. Which of the following statements are true?
Gradient descent is sure to converge, to some Assuming gradient descent converges, it converges
value, for any step size > 0 to x∗ if and only if f is convex
If f has a local minimum x0 different from the If, additionally, f is the objective function of logis-
global one, i.e., x0 , x∗ , and x(t) = x0 for some t, tic regression, and gradient descent converges, then it
gradient descent will not converge to x∗ converges to x∗
(n) [3 pts] Suppose you are trying to choose a good subset of features for a least-squares linear regression model. Let
algorithm A be forward stepwise selection, where we start with zero features and at each step add the new feature that
most decreases validation error, stopping only when validation error starts increasing. Let algorithm B be similar, but
at each step we include the new feature that most decreases training error (measured by the usual cost function, mean
squared error), stopping only when training error starts increasing. Which of the following is true?
Algorithm B will select no more features than Al- The first feature chosen by the two algorithms will
gorithm A does be the same
Algorithm B will select at least as many features Algorithm A sometimes selects features that Al-
as Algorithm A does gorithm B does not select
(o) [3 pts] Suppose you have a multivariate normal distribution with a positive definite covariance matrix Σ. Consider a
second multivariate Gaussian distribution whose covariance matrix is κΣ, where κ = cos θ > 0. Which of the following
statements are true about the ellipsoidal isocontours of the second distribution, compared to the first distribution?
The principal axes of the ellipsoids would be ro- The principal axes (radii) of the ellipsoids will be
tated by θ scaled by 1/κ
The principal axes (radii) of the ellipsoids will be The principal

√ axes (radii) of the ellipsoids will be
scaled by κ scaled by κ
(p) [3 pts] Suppose M and N are positive semidefinite matrices. Under what conditions is M − N certain to be positive
semidefinite?
Never If M and N share all the same eigenvectors
The smallest eigenvalue of M is greater than the The largest eigenvalue of M is greater than the
largest eigenvalue of N largest eigenvalue of N
(q) [3 pts] You are given four sample points X1 = [−1, −1]> , X2 = [−1, 1]> , X3 = [1, −1]> , and X4 = [1, 1]> . Each of them
is in class C or class D. For what feature representations are the lifted points Φ(Xi ) guaranteed to be linearly separable
(with no point lying exactly on the decision boundary) for every possible class labeling?
Φ(x) = [x1 , x2 , 1] Φ(x) = [x12 , x22 , x1 , x2 , 1]
Φ(x) = [x1 , x2 , x12 + x22 , 1] Φ(x) = [x12 , x22 , x1 x2 , x1 , x2 , 1]
(r) [3 pts] Let Li (w) be the loss corresponding to a sample point Xi with label yi . The update rule for stochastic gradient
descent with step size is
wnew ← w − ∇Xi Li (w) wnew ← w − ∇w Li (w)
wnew ← w − wnew ← w −
Pn Pn
i=1 ∇Xi Li (w) i=1 ∇w Li (w)
4
(s) [3 pts] Suppose you have a sample in which each point has d features and comes from class C or class D. The class
conditional distributions are (Xi |yi = C) ∼ N(µC , σC2 ) and (Xi |yi = D) ∼ N(µD , σ2D ) for unknown values µC , µD ∈ Rd and
σC2 , σ2D ∈ R. The class priors are πC and πD . We use 0-1 loss.
If πC = πD and σC = σD , then the Bayes decision If σC = σD , then the Bayes decision boundary is
rule assigns a test point z to the class whose mean is always linear.
closest to z.
If πC = πD , then the Bayes decision rule is If σC = σD , then QDA will always produce a lin-
r∗ (z) = argminA∈{C,D} |z − µA |2 /(2σ2A ) + d ln σA ear decision boundary when you fit it to your sample.
(t) [3 pts] Let f ∈ [0, 1] be the unknown, fixed probability that a person in a certain population owns a dog (how cute!). We
model f with a hypothesis h ∈ [0, 1]. Before we observe any data at all, we can’t even guess what f might be, so we set
our prior probability for f to be the uniform distribution, i.e., P( f = h) = 1 over h ∈ [0, 1]. Now, we pick one person from
the population, and it turns out that they have a cute little labradoodle named Dr. Frankenstein. Which of the following
is true about the posterior probability that f = h given this one sample point?
The posterior is uniform over h ∈ [0, 1]. The posterior increases nonlinearly over h ∈ [0, 1].
The posterior increases linearly over h ∈ [0, 1]. The posterior is a delta function at 1.
5
Q2. [10 pts] The Perceptron Learning Algorithm
The table below is a list of sample points in R2 . Suppose that we run the perceptron algorithm, with a fictitious dimension, on
these sample points. We record the total number of times each point participates in a stochastic gradient descent step because it
is misclassified, throughout the run of the algorithm.
x1 x2 y times misclassified
−3 2 +1 0
−1 1 +1 0
−1 −1 −1 2
2 2 −1 1
1 −1 −1 0
(a) [5 pts] Suppose that the learning rate is = 1 and the initial weight vector is w(0) = (−3, 2, 1), where the last component
is the bias term. What is the equation of the separating line found by the algorithm, in terms of the features x1 and x2 ?
(b) [2 pts] In some cases, removing even a single point can change the decision boundary learned by the perceptron algorithm.
For which, if any, point(s) in our dataset would the learned decision boundary change if we removed it? Explain your
answer.
(c) [3 pts] How would our result differ if we were to add the additional training point (2, −2) with label +1?
6
Q3. [10 pts] Quadratic Discriminant Analysis
(a) [4 pts] Consider 12 labeled data points sampled from three distinct classes:
h i h i h i h i √ √ √ √ h i h i h i h i
Class 0: 02 , −2 5
0 , 3 , −5
−3 Class 1: √22 , −√22 , −4 √22 , −4√2
− 2
Class 2: 3 , 1 , 8 , 0
5 3 6 −2
For each class C ∈ {0, 1, 2}, compute the class sample mean µC , the class sample covariance matrix ΣC , and the estimate
of the prior probability πC that a point belongs to class C. (Hint: µ1 = µ0 and Σ2 = Σ0 .)
(b) [4 pts] Sketch one or more isocontours of the QDA-produced normal distribution or quadratic discriminant function (they
each have the same contours) for each class. The isovalues are not important; the important aspects are the centers, axis
directions, and relative axis lengths of the isocontours. Clearly label the centers of the isocontours and to which class
they correspond.
(c) [2 pts] Suppose that we apply LDA to classify the data given in part (a). Why will this give a poor decision boundary?
7
Q4. [10 pts] Ridge Regression with One Feature
We are given a sample in which each point has only one feature. Therefore, our design matrix is a column vector, which we
will write x ∈ Rn (instead of X). Consider the scalar data generation model
yi = ωxi + ei
where xi ∈ R is point i’s sole input feature, yi ∈ R is its scalar label (a noisy measurement), ei ∼ N(0, 1) is standard unit-
variance zero-mean Gaussian noise, and ω ∈ R is the true, fixed linear relationship that we would like to estimate. The ei ’s are
independent and identically distributed random variables, and the sole source of randomness. We will treat the design vector x
as fixed (not random).
Our goal is to fit a linear model and get an estimate wλ for the true parameter ω. The ridge regression estimate for ω is
n
 
 2 X
wλ = argminw∈R λw + where λ ≥ 0.
2

 (yi − xi w) 
i=1
(a) [4 pts] Express wλ in terms of λ, S xx and S xy , where S xx = xi2 and S xy =

Pn Pn
i=1 i=1 xi yi .
(b) [5 pts] Compute the squared bias of the ridge estimate wλ z at a test point z ∈ R, defined to be
bias2 (wλ , z) = (E[wλ z] − ωz)2 ,
where the expectation is taken with respect to the yi ’s. Express your result in terms of ω, λ, S xx , and z. (Hint: simplify
the expectation first.)
(c) [1 pt] What will the bias be if we are using ordinary least squares, i.e., λ = 0?
8
Q5. [10 pts] Logistic Regression with One Feature
We are given another sample in which each point has only one feature. Consider a binary classification problem in which
sample values x ∈ R are drawn randomly from two different class distributions. The first class, with label y = 0, has its mean to
the left of the mean of the second class, with label y = 1. We will use a modified version of logistic regression to classify these
data points. We model the posterior probability at a test point z ∈ R as
P(y = 1|z) = s(z − α),
where α ∈ R is the sole parameter we are trying to learn and s(γ) = 1/(1 + e−γ ) is the logistic function. The decision boundary
is z = α (because s(z) = 21 there).
We will learn the parameter α by performing gradient descent on the logistic loss function (a.k.a. cross-entropy). That is, for a
data point x with label y ∈ {0, 1}, we find the α that minimizes
J(α) = −y ln s(x − α) − (1 − y) ln(1 − s(x − α)).
(a) [5 pts] Derive the stochastic gradient descent update for J with step size > 0, given a sample value x and a label y. Hint:
feel free to use s as an abbreviation for s(x − α).
(b) [3 pts] Is J(α) convex over α ∈ R? Justify your answer.
(c) [2 pts] Now we consider multiple sample points. As d = 1, we are given an n × 1 design matrix X and a vector y ∈ Rn of
labels. Consider batch gradient descent on the cost function ni=1 J(α; Xi , yi ). There are circumstances in which this cost
P
function does not have a minimum over α ∈ R at all. What is an example of such a circumstance?
9
Spring 2020 Machine Learning Midterm A
• The exam is closed book, closed notes except your cheat sheets.
• Please write your name at the top of each page of the Answer Sheet. (You may do this before the exam.)
• You have 80 minutes to complete the midterm exam (6:40–8:00 PM). (If you are in the DSP program and have an
allowance of 150% or 200% time, that comes to 120 minutes or 160 minutes, respectively.)
• When the exam ends (8:00 PM), stop writing. You have 15 minutes to scan the exam and turn it into Gradescope. You
must remain visible on camera while you scan your exam and turn it in (unless the scanning device is your only self-
monitoring device). Most of you will use your cellphone and a third-party scanning app. If you have a physical scanner
in your workspace that you can make visible from your camera, you may use that. Late exams will be penalized at a rate
of 10 points per minute after 8:15 PM. (The midterm has 100 points total.) Continuing to work on the exam after 8:00
PM (or not being visible prior to submission) may incur a score of zero.
• Mark your answers on the Answer Sheet. If you absolutely must use overflow space for a written question, use the space
for “Written Question #5” (but please try hard not to overflow). Do not attach any extra sheets.
• The total number of points is 100. There are 10 multiple choice questions worth 4 points each, and three written questions
worth 20 points each.
• For written questions, please write your full answer in the space provided and clearly label all subparts of each
written question. Again, do not attach extra sheets.
First name
Last name
SID
1
(a) [4 pts] Let X be an m × n matrix. Which of the following are always equal to rank(X)?
A: rank(X T ) C: m − dimension(nullspace(X))
B: rank(X T X) D: dimension(rowspace(X))
(b) [4 pts] Which of the following types of square matrices can have negative eigenvalues?
A: a symmetric matrix C: an orthonormal matrix (M such that M > M = I)
B: I − uuT where u is a unit vector D: ∇2 f (x) where f (x) is a Gaussian PDF
(c) [4 pts] Choose the correct statement(s) about Support Vector Machines (SVMs).
A: if a finite set of training points from two classes is linearly separable, a hard-margin SVM will always find
a decision boundary correctly classifying every training point
B: if a finite set of training points from two classes is linearly separable, a soft-margin SVM will always find a
decision boundary correctly classifying every training point
C: every trained two-class hard-margin SVM model has at least one point of each class at a distance of exactly
1/kwk (the margin width) from the decision boundary
D: every trained two-class soft-margin SVM model has at least one point of each class at a distance of exactly
1/kwk (the margin width) from the decision boundary
(d) [4 pts] Suppose we perform least-squares linear regression, but we don’t assume that all weight vectors are equally
reasonable; instead, we use the maximum a posteriori method to impose a normally-distributed prior probability on the
weights. Then we are doing
A: L2 regularization C: logistic regression
B: Lasso regression D: ridge regression
(e) [4 pts] Which of the following statements regarding ROC curves are true?
A: the ROC curve is monotonically increasing C: the ROC curve is concave
B: for a logistic regression classifier, the ROC D: if the ROC curve passes through (0, 1), the clas-
curve’s horizontal axis is the posterior probability sifier is always correct (on the test data used to make
used as a threshold for the decision rule the ROC curve)
(f) [4 pts] One way to understand regularization is to ask which vectors minimize the regularization term. Consider the set
of unit vectors in the plane: {x ∈ R2 : kxk22 = 1}. Which of the following regularization terms are minimized solely by the
four unit vectors {(0, 1), (1, 0), (−1, 0), (0, −1)} and no other unit vector?
A: f (x) = kxk0 = the # of nonzero entries of x C: f (x) = kxk22
B: f (x) = kxk1 D: f (x) = kxk∞ = max{|x1 |, |x2 |}
2
(g) [4 pts] Suppose we train a soft-margin SVM classifier on data with d-dimensional features and binary labels. Below we
have written four pairs of the form “modification → effect.” For which ones would a model trained on the modified data
always have the corresponding effect relative to the original model?
A: augment the data with polynomial features → optimal value of the objective function (on the training
points) decreases or stays the same
B: multiply each data point by a fixed invertible d × d matrix A; i.e., Xi ← AXi → all training points are
classified the same as before
C: multiply each data point by a fixed orthonormal d×d matrix U and add a fixed vector z ∈ Rd ; i.e., Xi ← UXi +z
→ all training points are classified the same as before.
D: normalize each feature so that its mean is 0 and variance is 1 → all training points are classified the same
as before
(h) [4 pts] A real-valued n×n matrix P is called a projection matrix if P2 = P. Select all the true statements about eigenvalues
of P.
A: P can have an eigenvalue of 0 C: P can have an eigenvalue of −1
B: P can have an eigenvalue of 1 D: P can have an eigenvalue that isn’t 0, 1, or −1
(i) [4 pts] Let X be a real-valued n × d matrix. Let Ω be a diagonal, real-valued n × n matrix whose diagonal entries are all
positive. Which of the following are true of the matrix product M = X T ΩX?
A: M could have negative eigenvalues C: M could have positive eigenvalues
D: the eigenvalues of M are the values on the

B: M could have eigenvalues equal to zero diagonal of Ω
(j) [4 pts] Which of the following regression methods always have just one unique optimum, regardless of the data?
A: least Squares Regression C: Lasso Regression
B: ridge Regression D: logistic Regression
3
Q2. [20 pts] Gradient Descent
Let’s use gradient descent to solve the optimization problem of finding the value of x ∈ R2 that minimizes the objective function
" #
1 1 0
J(x) = xT Ax, A= .
2 0 2
(a) [7 pts] Let x(t) represent the value of x after t iterations of gradient descent from some arbitrary starting point x(0) . Write
the standard gradient descent update equation in the form x(t+1) ← f (x(t) ) (you tell us what the function f is) with a step
size of = 14 . Then manipulate it into the form x(t+1) = Bx(t) where B is a matrix (you tell us what B is). Show your work.
(b) [4 pts] The minimum of J(x) is at x∗ = 0, so we hope that our algorithm will converge: that is, limt→∞ x(t) = 0. Show that
for any starting point x(0) , your gradient descent algorithm converges to x∗ .
(c) [3 pts] Suppose we change the step size to = 1. What is B? How does gradient descent behave with this step size?
(d) [3 pts] Suppose we replace A with another diagonal matrix with positive diagonal entries. What is the optimal step size
for fastest convergence, expressed in terms of the diagonal entries A11 and A22 ?
(e) [3 pts] Your argument in part (b) can be adapted to prove convergence for any diagonal A with positive diagonal entries,
so long as we choose a suitably small step size as derived in part (d). Suppose we replace A with another matrix that
is symmetric and positive definite but not diagonal. Suppose we choose a suitably small step size . Without writing
any equations, give a mathematical explanation (in English) why your argument in part (b) applies here and gives us
confidence that gradient descent will converge to the minimum, even though A is not diagonal. Hint: One approach is to
change the coordinate system.
4
Q3. [20 pts] Gaussians and Linear Discriminant Analysis
Suppose that the training and test points for a class come from an anisotropic multivariate normal distribution N(µ, Σ), where
µ ∈ Rd and Σ ∈ Rd×d is symmetric and positive definite. Recall that (for x ∈ Rd ) its probability density function (PDF) is
!
1 1
f (x) = √ √ exp − (x − µ) Σ (x − µ) .
> −1
( 2π)d |Σ| 2
(a) [7 pts] In lecture, I claimed that if Σ is diagonal, you can write this PDF as a product of d univariate Gaussian PDFs, one
for each feature. What if Σ is not diagonal? Show that if you substitute Σ’s eigendecomposition for Σ, you can write the
PDF above as a product of d univariate Gaussian PDFs, one aligned with each eigenvector of Σ. For simplicity, please
set µ = 0 (prove it just for the mean-zero case).
√ √
Hints: Use the shorthand τ = 1/ ( 2π)d |Σ| . Write the eigendecomposition as a summation with one term per
eigenvalue/vector. The determinant |Σ| is the product of Σ’s eigenvalues (all d of them).
(b) [2 pts] When you express the multivariate PDF as a product of univariate PDFs, what is the variance of the univariate
distribution along the direction of the ith eigenvector vi ?
(c) [7 pts] Consider performing linear discriminant analysis (LDA) with two classes. Class C has the class-conditional
distribution N(µC , Σ), and class D has the class-conditional distribution N(µD , Σ). Note that they both have the same
covariance matrix but different means. Recall that we define a quadratic function
√
QC (x) = ln ( 2π)d fC (x) πC ,
where fC (x) is the PDF for class C and πC is the prior probability for class C. For class D, we define QD (x) likewise. For
simplicity, assume πC = πD = 12 .
Write down the LDA decision boundary as an equation in terms of QC (x) and QD (x). Then substitute the definition
above and show that the decision boundary has the form {x : w · x + α = 0} for some w ∈ Rd and α ∈ R. What is the
value of w?
(d) [2 pts] What is the relationship between w and the decision boundary?
(e) [2 pts] Is w always an eigenvector of Σ? (That is, is it always true that w = ωvi for some scalar ω and unit eigenvector vi
of Σ?) Why or why not?
5
Q4. [20 pts] Double Regression
Let’s work out a two-way least-squares linear regression method. The input is n observations, recorded in two vectors s, t ∈ Rn .
The ith observation is the ordered pair (si , ti ). We’re going to view these data in two ways: (1) si is a sample point in one
dimension with label ti ; or (2) ti is a sample point in one dimension with label si . We will use least-squares linear regression to
(1) take a test point sT ∈ R and predict its label tT , with a hypothesis tˆ(sT ) = βsT , and (2) take a test point tT ∈ R and predict its
label sT , with a hypothesis ŝ(tT ) = γtT .
We do not use bias terms, so both regression functions will pass through the origin. Our optimization problems are
Xn Xn
Find β that minimizes (βsi − ti )2 Find γ that minimizes (γti − si )2
i=1 i=1
A natural question, which we will explore now, is whether both regressions find the same relationship between sT and tT .
(a) [7 pts] Derive a closed-form expression for the optimal regression coefficient β. Write your final answer in terms of
vector operations, not summations. Show all your work.
(b) [2 pts] What is a closed-form expression for the optimal regression coefficient γ? (This follows from symmetry; you
don’t need to repeat the derivation, unless you want to.)
(c) [4 pts] The hypotheses tT = βsT and sT = γtT represent the same equation if and only if βγ = 1. Prove that βγ ≤ 1 and
determine under what condition equality holds. Hint: remember the Cauchy–Schwarz inequality.
(d) [5 pts] We might want to compute these coefficients with `1 -regularization. For some regularization parameter λ > 0,
consider the optimization problem
Xn
Find β that minimizes λ|β| + (βsi − ti )2
i=1
In Homework 4, we analyzed this optimization problem and concluded that that there is at most one point where the
derivative is zero. If such a point exists, it is the minimum; otherwise, the minimum is at the discontinuity β = 0. For
simplicity, let’s consider only the case where the solution happens to be positive (β > 0).
Derive a closed-form expression for the optimal regression coefficient β in the case β > 0. Write your final answer in
terms of vector operations, not summations. Show all your work.
(e) [2 pts] What necessary and sufficient condition (inequality) should s, t, and λ satisfy to assure us that the optimal β
is indeed positive?
6
Spring 2020 Machine Learning Midterm B
• The exam is closed book, closed notes except your self-made cheat sheets.
• You will submit your answers to the multiple-choice questions through Gradescope via the assignment “Midterm B –
Multiple Choice”; please do not submit your multiple-choice answers on paper. By contrast, you will submit your an-
swers to the written questions by writing them on paper by hand, scanning them, and submitting them through Gradescope
via the assignment “Midterm B – Writeup.”
• Please write your name at the top of each page of your written answers. (You may do this before the exam.)
• When the exam ends (8:00 PM), stop writing. You must submit your multiple-choice answers before 8:00 PM sharp.
Late multiple-choice submissions will be penalized at a rate of 5 points per minute after 8:00 PM. (The multiple-choice
questions are worth 40 points total.)
• From 8:00 PM, you have 15 minutes to scan the written portion of your exam and turn it into Gradescope via the
assignment “Midterm B Writeup.” Most of you will use your cellphone and a third-party scanning app. If you have a
physical scanner, you may use that. You do not need to scan the title page or the multiple-choice page. Late written
submissions will be penalized at a rate of 10 points per minute after 8:15 PM. (The written portion is worth 60 points
total.)
• Mark your answers to multiple-choice questions directly into Gradescope. Write your answers to written questions on
the corresponding pages of the Answer Sheet or on blank paper. If you need overflow space for a written question, use
additional sheets of blank paper. Clearly label all written questions and all subparts of each written question.
• Following the exam, you must use Gradescope’s page selection mechanism to mark which questions are on which pages
of your exam (as you do for the homeworks).
• The total number of points is 100. There are 10 multiple choice questions worth 4 points each, and three written questions
worth 20 points each.
First name
Last name
SID
1
(a) [4 pts] Recall that in subset selection, we attempt to identify poorly predictive features and ignore them. Which of the
following are reasons why we may seek to drop features available to our model?
A: To reduce model bias C: To increase speed of prediction on test points
B: To reduce model variance D: To improve model interpretability
(b) [4 pts] Consider a random variable X ∼ N(µ, Σ) ∈ Rd , where the multivariate Gaussian probability density function
(PDF) is axis-aligned, Σ is positive definite, and the standard deviation along coordinate axis i is σi . Select all that apply.
A: The d features of X are uncorrelated but not C: Σ has a symmetric square root Σ1/2 with eigen-
necessarily independent values σ1 , σ2 , . . . , σd
√ √ √
B: Σ = diag( σ1 , σ2 , . . . , σd ) D: (X − µ)> Σ−1 (X − µ) ≥ 0
(c) [4 pts] Given a design matrix X ∈ Rn×d representing n sample points with d features, you compute the sample covariance
matrix M of your dataset and find that its determinant is det M = 0. What do you know to be true?
A: There is some direction in Rd along which the C: The columns of the centered design matrix Ẋ
sample points have zero variance are linearly dependent
D: The rows of the centered design matrix Ẋ are

B: The covariance matrix M is positive definite linearly dependent
(d) [4 pts] For classification problems with two features (d = 2, test point z ∈ R2 ), which of the following methods have
posterior probability distributions of the form P(Y|X = z) = s(Az21 + Bz22 + Cz1 z2 + Dz1 + Ez2 + F) where s is the logistic
function s(γ) = 1+e1 −γ and A, B, C, D, E, F ∈ R can all be nonzero?
A: Logistic regression with linear features C: Logistic regression with quadratic features
B: Linear discriminant analysis (LDA) with D: Quadratic discriminant analysis (QDA) with
quadratic features linear features
(e) [4 pts] Which of the following statements regarding Bayes decision theory are true?
A: If the Bayes optimal classifier r∗ (x) correctly C: If you have a design matrix X and you are given
classifies all points in the design matrix X with 100% the Bayes optimal classifier r∗ (x), then you sample
accuracy, its Bayes risk must be zero a different design matrix X 0 from the same distribu-
tion(s), the original r∗ (x) is no longer optimal
B: With 0-1 loss, the two-class Bayes optimal clas-
sifier r∗ (x) classifies points in a way that minimizes
the probability of misclassification D: None of the above
2
(f) [4 pts] We want to minimize the function

β2
 if β ≤ 0,
f (β) = 

β
 otherwise.
A: Starting from β = 1, Newton’s method will take us to the minimum of f in one step
B: Starting from β = −1, Newton’s method will take us to the minimum of f in one step
C: There exists a learning rate such that starting from β = 1, gradient descent will take us to the minimum of
f in one step
D: There exists a learning rate such that starting from β = −1, gradient descent will take us to the minimum of
f in one step
(g) [4 pts] In each of the following two figures, there are exactly three training points drawn from an unknown distribution,
and the dashed line is a decision boundary.
Figure 1 Figure 2
A: The Bayes optimal decision boundary always appears as drawn in Figure 1
B: Both hard-margin and soft-margin (for some choice of C) SVMs could produce the decision boundary in
Figure 1 (using only the features x1 and x2 , plus the fictitious dimension where appropriate)
C: Both logistic regression and QDA could produce the decision boundary in Figure 2 (using only the features
x1 and x2 , plus the fictitious dimension where appropriate)
D: A hard-margin SVM augmented with the parabolic lifting map Φ(x) = [x1 x2 x12 + x22 ]> could produce the
decision boundary in Figure 2
(h) [4 pts] Consider least-squares linear regression with a design matrix X ∈ Rn×d and labels y ∈ Rn . If the solution to the
least-squares problem is unique, which of the following must be true?
A: rank(X) = d C: n ≤ d
B: rank(X > ) = d D: d ≤ n
3
(i) [4 pts] Suppose we are performing linear regression on a design matrix X and a label vector y. Recall that the conventional
least-squares formulation finds the linear function h(·) that minimizes the empirical risk R(h) = 1n ni=1 L(h(Xi ), yi ), where
P
L(ζ, γ) = (ζ − γ)2 is the squared-error loss for a prediction ζ ∈ R and a label γ ∈ R. However, you are afraid that your
training data may have outliers. Which of the following changes will help mitigate this issue if it exists?
A: Change the loss function from L(ζ, γ) = (ζ − γ)2 (squared error) to L(ζ, γ) = |ζ − γ| (absolute error)
B: Change the loss function from L(ζ, γ) = (ζ − γ)2 (squared error) to L(ζ, γ) = −γ ln ζ − (1 − γ) ln (1 − ζ)
(logistic loss)
C: Change the cost function from R(h) = L(h(Xi ), γi ) (mean loss) to R(h) = L(h(Xi ), γi ) (total loss)
1 Pn Pn
n i=1 i=1
D: Change the cost function from R(h) = L(h(Xi ), γi ) (mean loss) to R(h) = maxi∈[1,...,n] L(h(Xi ), γi )
1 Pn
n i=1
(maximum loss)
(j) [4 pts] The following chart depicts the class-conditional distributions P(X|Y) for a classification problem with three
classes, A, B, and C. Classes A and B are normally distributed over the domain (−∞, ∞); Class C is defined only over the
finite domain depicted below. All three classes have prior probabilities πA , πB , πC strictly greater than zero; the chart
does not show the influence of these priors. We use the 0-1 loss function.
A: The Bayes risk is the area of the shaded region in the chart (including the area not depicted off the sides of
the chart, going to x = ±∞)
B: Depending on the priors, it is possible that the Bayes rule r∗ (x) will classify all inputs as class B
C: Depending on the priors, it is possible that the Bayes rule r∗ (x) will classify all inputs as class C
D: Depending on the priors, it is possible that the Bayes risk is zero
4
Q2. [20 pts] Hard-Margin Support Vector Machines
Recall that a maximum margin classifier, also known as a hard-margin support vector machine (SVM), takes n training points
X1 , X2 , . . . , Xn ∈ Rd with labels y1 , y2 , . . . , yn ∈ {+1, −1}, and finds parameters w ∈ Rd and α ∈ R that satisfy a certain objective
function subject to the constraints
yi (Xi · w + α) ≥ 1, ∀i ∈ {1, . . . , n}.
For parts (a) and (b), consider the following training points. Circles are classified as positive examples with label +1 and
triangles are classified as negative examples with label −1.
" # " #
horizontal 3
(a) [3 pts] Which points are the support vectors? Write it as . E.g., the bottom right circle is .
vertical 1
" #
5
(b) [4 pts] If we add the sample point x = with label −1 (triangle) to the training set, which points are the support vectors?
1
For parts (c)–(f), forget about the figure above, but assume that there is at least one sample point in each class and that the
sample points are linearly separable.
(c) [2 pts] Describe the geometric relationship between w and the decision boundary.
(d) [2 pts] Describe the relationship between w and the margin. (For the purposes of this question, the margin is just a
number.)
(e) [4 pts] Knowing what you know about the hard-margin SVM objective function, explain why for the optimal (w, α), there
must be at least one sample point for which Xi · w + α = 1 and one sample point for which Xi · w + α = −1.
(f) [5 pts] If we add new features to the sample points (while retaining all the original features), can the optimal kwnew k in the
enlarged SVM be greater than the optimal kwold k in the original SVM? Can it be smaller? Can it be the same? Explain
why! (Most of the points will be for your explanation.)
5
Q3. [20 pts] Regression with Varying Noise
We derived a cost function for regression problems by assuming that sample points and their labels arise from the following
process, and applying maximum likelihood estimation (MLE).
• Sample points come from an unknown distribution, Xi ∼ D.

• Labels yi are the sum of a deterministic function g plus random noise: ∀i, yi = g(Xi ) + i , where i ∼ N(0, σ2 ).
For this problem, we will assume that i ∼ N(0, σ2i )—that is, the variance σ2i of the noise is different for each sample point—
and we will examine how our cost function changes as a result. We assume that (magically) we know the value of each σ2i . You
are given an n × d design matrix X, an n-vector y of labels, such that the label yi of sample point Xi is generated as described
above, and a list of the noise variances σ2i .
(a) [8 pts] Apply MLE to derive the optimization problem that will yield the maximum likelihood estimate of the distribution
parameter g. (Note: g is a function, but we can still treat it as the parameter of an optimization problem.) Express your
cost function as a summation of loss functions (where you decide what the loss function is), one per sample point.
(b) [4 pts] We decide to do linear regression, so we parameterize g(Xi ) as g(Xi ) = w · Xi , where w is a d-vector of weights.
Write an equivalent optimization problem where your optimization variable is w and the cost function is a function of X,
y, w, and the variances σ2i . Find a way to express your cost function in matrix notation, with no summations. (This may
entail defining a new matrix.)
(c) [4 pts] Write the solution to your optimization problem as the solution of a linear system of equations. (Again, in matrix
notation, with no summations.)
(d) [2 pts] Does your solution resemble that of a similar method you know? What is its name?
(e) [2 pts] Compare your solution to the case in which we assume that every sample point has the same noise distribution.
In simple terms, how does the amount of noise affect the optimization, and why does this seem like the intuitively right
thing to do? Answer in 3 sentences or fewer.
6
Q4. [20 pts] Finding Bias, Variance, and Risk
For z ∈ R, you are trying to estimate a true function g(z) = 2z2 with least-squares regression, where the regression function is
a line h(z) = wz that goes through the origin and w ∈ R. Each sample point x ∈ R is drawn from the uniform distribution on
[−1, 1] and has a corresponding label y = g(x) ∈ R. There is no noise in the labels. We train the model with just one sample
point! Call it x, and assume x , 0. We want to apply the bias-variance decomposition to this model.
(a) [3 pts] In one sentence, why do we expect the bias to be large?
(b) [6 pts] What is the bias of your model h(z) as a function of a test point z ∈ R? (Hint: start by working out the value of the
least-squares weight w.) Your final bias should not include an x; work out the expectation.
(c) [6 pts] What is the variance of your model h(z) as a function of a test point z ∈ R? Your final variance should not include
an x; work out the expectation.
(d) [5 pts] Let R(h, z) be the risk (expected loss) for a fixed, arbitrary test point z ∈ R with the noise-free label g(z) (where
the expectation is taken over the distribution of values of (x, y)). What is the mathematical relationship between the risk
R(h, z), the bias of h(z) at z, and the variance of h(z) at z? What are the values (as numbers) of these three quantities for
z = 1?
7
CS 189/289A Introduction to Machine Learning
Spring 2021 Jonathan Shewchuk Midterm
• The exam is open book, open notes for material on paper. On your computer screen, you may have only this exam,
Zoom (if you are running it on your computer instead of a mobile device), and four browser windows/tabs: Gradescope,
the exam instructions, clarifications on Piazza, and the form for submitting clarification requests.
• You will submit your answers to the multiple-choice questions directly into Gradescope via the assignment “Midterm –
Multiple Choice”; please do not submit your multiple-choice answers on paper. If you are in the DSP program and have
been granted extra time, select the “DSP, 150%” or “DSP, 200%” option. By contrast, you will submit your answers to
the written questions by writing them on paper by hand, scanning them, and submitting them through Gradescope via the
assignment “Midterm – Free Response.”
• Please write your name at the top of each page of your written answers. (You may do this before the exam.) Please start
each top-level question (Q2, Q3, etc.) on a new sheet of paper. Clearly label all written questions and all subparts
of each written question.
• When the exam ends (9:00 PM), stop writing. You must submit your multiple-choice answers before 9:00 PM sharp.
Late multiple-choice submissions will be penalized at a rate of 5 points per minute after 9:00 PM. (The multiple-
choice questions are worth 40 points total.)
• From 9:00 PM, you have 15 minutes to scan the written portion of your exam and turn it into Gradescope via the
assignment “Midterm – Free Response.” Most of you will use your cellphone/pad and a third-party scanning app. If
you have a physical scanner, you may use that. Late written submissions will be penalized at a rate of 10 points per
minute after 9:15 PM. (The written portion is worth 60 points total.)
• Following the exam, you must use Gradescope’s page selection mechanism to mark which questions are on which pages
of your exam (as you do for the homeworks). Please get this done before 2:00 AM. This can be done on a computer
different than the device you submitted with.
• The total number of points is 100. There are 10 multiple choice questions worth 4 points each, and four written questions
1
(a) [4 pts] Which of the following cost functions are smooth—i.e., having continuous gradients everywhere?
A: the perceptron risk function C: least squares with `2 regularization
B: the sum (over sample points) of logistic losses D: least squares with `1 regularization
(b) [4 pts] Which of the following changes would commonly cause an SVM’s margin 1/kwk to shrink?
A: Soft margin SVM: increasing the value of C C: Soft margin SVM: decreasing the value of C
B: Hard margin SVM: adding a sample point that D: Hard margin SVM: adding a new feature to
violates the margin each sample point
(c) [4 pts] Recall the logistic function s(γ) and its derivative s0 (γ) = d
dγ s(γ). Let γ∗ be the value of γ that maximizes s0 (γ).
A: γ∗ = 0.25 C: s0 (γ∗ ) = 0.5
B: s(γ∗ ) = 0.5 D: s0 (γ∗ ) = 0.25
(d) [4 pts] You are running logistic regression to classify two-dimensional sample points Xi ∈ R2 into two classes yi ∈ {0, 1}
with the regression function h(z) = s(w> z + α), where s is the logistic function. Unfortunately, regular logistic regression
isn’t fitting the data very well. To remedy this, you try appending an extra feature, kXi k2 , to the end of each sample
point Xi . After you run logistic regression again with the new feature, the decision boundary in R2 could be
A: a line. C: an ellipse.
B: a circle. D: an S-shaped logistic curve.
(e) [4 pts] We are performing least-squares linear regression, with the use of a fictitious dimension (so the regression function
isn’t restricted to satisfy h(0) = 0). Which of the following will never increase the training error, as measured by the
mean squared-error cost function?
A: Adding polynomial features C: Using Lasso to encourage sparse weights
B: Using backward stepwise selection to remove

some features, thereby reducing validation error D: Centering the sample points
(f) [4 pts] Given a design matrix X ∈ Rn×d , labels y ∈ Rn , and λ > 0, we find the weight vector w∗ that minimizes
kXw − yk2 + λkwk2 . Suppose that w∗ , 0.
A: The variance of the method decreases if λ in- C: The bias of the method increases if λ increases
creases enough. enough.
B: There may be multiple solutions for w∗ . D: w∗ = X + y, where X + is the pseudoinverse of X.
2
(g) [4 pts] The following two questions use the following assumptions. You want to train a dog identifier with Gaussian
discriminant analysis. Your classifier takes an image vector as its input and outputs 1 if it thinks it is a dog, and 0
otherwise. You use the CIFAR10 dataset, modified so all the classes that are not “dog” have the label 0. Your training
set has 5,000 dog images and 45,000 non-dog (“other”) images. Which of the following statements seem likely to be
correct?
A: LDA has an advantage over QDA because the C: LDA has an advantage over QDA because the
two classes have different numbers of training exam- two classes are expected to have very different covari-
ples. ance matrices.
B: QDA has an advantage over LDA because the D: QDA has an advantage over LDA because the
two classes have different numbers of training exam- two classes are expected to have very different covari-
ples. ance matrices.
(h) [4 pts] This question is a continuation of the previous question. You train your classifier with LDA and the 0-1 loss.
You observe that at test time, your classifier always predicts “other” and never predicts “dog.” What is a likely reason for
this and how can we solve it? (Check all that apply.)
A: Reason: The prior for the “other” class is very C: Solve it by using a loss function that penalizes
large, so predicting “other” on every test point mini- dogs misclassified as “other” more than “others” mis-
mizes the (estimated) risk. classified as dogs.
B: Reason: As LDA fits the same covariance ma- D: Solve it by learning an isotropic pooled covari-
trix to both classes, the class with more examples will ance instead of an anisotropic one; that is, the covari-
be predicted for all points in Rd . ance matrix computed by LDA has the form σ2 I.
(i) [4 pts] We do an ROC analysis of 5 binary classifiers C1 , C2 , C3 , C4 , C5 trained on the training points Xtrain and labels ytrain .
We compute their true positive and false positive rates on the validation points Xval and labels yval and plot them in the
ROC space, illustrated below. In Xval and yval , there are n p points in class “positive” and nn points in class “negative.” We
use a 0-1 loss.
ROC analysis of five classifiers. FPR = false positive rate; TPR = true positive rate.
A: If n p = nn , C2 is the classifier with the highest C: There exists some n p and nn such that C1 is the
validation accuracy. classifier with the highest validation accuracy.
B: If n p = nn , all five classifiers have higher vali- D: There exists some n p and nn such that C3 is the
dation accuracy than any random classifier. classifier with the highest validation accuracy.
3
(j) [4 pts] Tell us about feature subset selection.
A: Ridge regression is more effective for feature C: Stepwise subset selection uses the accuracy on
subset selection than Lasso. the training data to decide which features to include.
B: If the best model uses only features 2 and 4 (i.e., D: Backward stepwise selection could train a
the second and fourth columns of the design matrix), model with only features 1 and 3. It could train a
forward stepwise selection is guaranteed to find that model with only features 2 and 4. But it will never
model. train both models.
4
Q2. [14 pts] Eigendecompositions
(a) [5 pts] Consider a symmetric, square, real matrix A ∈ Rd×d . Let A = VΛV > be its eigendecomposition. Let vi denote the
ith column of V. Let λi denote Λii , the scalar component on the ith row and ith column of Λ.
Consider the matrix M = αA − A2 , where α ∈ R. What are the eigenvalues and eigenvectors of M? (Expressed in terms
of parts of A’s eigendecomposition and α. No proof required.)
(b) [4 pts] Suppose that A is a sample covariance matrix for a set of n sample points stored in a design matrix X ∈ Rn×d , and
that α ∈ R is a fixed constant. Is it always true (for any such A and α) that there exists another design matrix Z ∈ Rn×d
such that M = αA − A2 is the sample covariance matrix for Z? Explain your answer.
(c) [5 pts] In lecture, we talked about decorrelating a centered design matrix Ẋ. We used an eigendecomposition to do that.
Explain (in English, not math) what the eigendecomposition tells us about the sample points, and how that information
helps us decorrelate a design matrix.
The eigenvectors of tell us
.
With this information, we decorrelate the centered design matrix by

.
5
Q3. [10 pts] Maximum Likelihood Estimation
There are 5 balls in a bag. Each ball is either red or blue. Let θ (an integer) be the number of blue balls. We want to estimate θ,
so we draw 4 balls with replacement out of the bag, replacing each one before drawing the next. We get “blue,” “red,” “blue,”
and “blue” (in that order).
(a) [5 pts] Assuming θ is fixed, what is the likelihood of getting exactly that sequence of colors (expressed as a function
of θ)?
(b) [3 pts] Draw a table showing (as a fraction) the likelihood of getting exactly that sequence of colors, for every value of θ
from zero to 5 inclusive.
θ L(θ; h blue, red, blue, blue i)

0 ?
1 ?
2 ?
3 ?
4 ?
5 ?
(c) [2 pts] What is the maximum likelihood estimate for θ? (Chosen among all integers; not among all real numbers.)
6
Q4. [20 pts] Tikhonov Regularization
Let’s take a look at a more complicated version of ridge regression called Tikhonov regularization. We use a regularization
parameter similar to λ, but instead of a scalar, we use a real, square matrix Γ ∈ Rd×d (called the Tikhonov matrix). Given a
design matrix X ∈ Rn×d and a vector of labels y ∈ Rn , our regression algorithm finds the weight vector w∗ ∈ Rd that minimizes
the cost function
J(w) = kXw − yk22 + kΓwk22 .
(a) [7 pts] Derive the normal equations for this minimization problem—that is, a linear system of equations whose solution(s)
is the optimal weight vector w∗ . Show your work. (If you prefer, you can write an explicit closed formula for w∗ .)
(b) [3 pts] Give a simple, sufficient and necessary condition on Γ (involving only Γ; not X nor y) that guarantees that J(w)
has only one unique minimum w∗ . (To be precise, the uniqueness guarantee must hold for all values of X and y, although
the unique w∗ will be different for different values of X and y.) (A sufficient but not necessary condition will receive part
marks.)
(c) [5 pts] Recall the Bayesian justification of ridge regression. We impose an isotropic normal prior distribution on the
weight vector—that is, we assume that w ∼ N(0, σ2 I). (This encodes our suspicion that small weights are more likely to
be correct than large ones.) Bayes’ Theorem gives us a posterior distribution f (w|X, y). We apply maximum likelihood
estimation (MLE) to estimate w in that posterior distribution, and it tells us to find w by minimizing kXw − yk22 + λkwk22
for some constant λ.
Suppose we change the prior distribution to an anisotropic normal distribution: w ∼ N(0, Σ) for some symmetric,
positive definite covariance matrix Σ. Then MLE on the new posterior tells us to do Tikhonov regularization! What value
of Γ does MLE tells us to use when we minimize J(w)?
Give a one-sentence explanation of your answer.
(d) [5 pts] Suppose you solve a Tikhonov regularization problem in a two-dimensional feature space (d = 2) and obtain
a weight vector w∗ that minimizes J(w). The solution w∗ lies on an isocontour of kXw − yk22 and on an isocontour of
kΓwk22 . Draw a diagram that plausibly depicts both of these two isocontours, in a case where Γ is not diagonal and y , 0.
(You don’t need to choose specific values of X, y, or Γ; your diagram just needs to look plausible.)
Your diagram must contain the following elements:
• The two axes (coordinate system) of the space you are optimizing in, with both axes labeled.
• The specified isocontour of kXw − yk2 , labeled.
• The specified isocontour of kΓwk2 , labeled.
• The point w∗ .
These elements must be in a plausible geometric relationship to each other.
7
Q5. [16 pts] Multiclass Bayes Decision Theory
Let’s apply Bayes decision theory to three-class classification. Consider a weather station that constantly receives data from its
radar systems and must predict what the weather will be on the next day. Concretely:
• The input X is a scalar value representing the level of cloud cover, with only four discrete levels: 25, 50, 75, and 100 (the
percentage of cloud cover).
• The station must predict one of three classes Y corresponding to the weather tomorrow. Y = y0 means sunny, y1 means
cloudy, and y2 means rain.
• The priors for each class are as follows: P(Y = y0 ) = 0.5, P(Y = y1 ) = 0.3, and P(Y = y2 ) = 0.2.
• The station has measured the cloud cover on the days prior to 100 sunny days, 100 cloudy days, and 100 rainy days.
From these numbers they estimated the class-conditional probability mass functions P(X|Y):
Prior-Day Cloud Cover (X) Sunny, P(X|Y = y0 ) Cloudy, P(X|Y = y1 ) Rain, P(X|Y = y2 )
25 0.7 0.3 0.1
50 0.2 0.3 0.1
75 0.1 0.3 0.3
100 0 0.1 0.5
• We use an asymmetric loss. Let z be the predicted class and y the true class (label).




 0 z = y,
 1 y = y0 and z , y0 ,



L(z, y) = 




 2 y = y1 and z , y1 ,

 4 y = y2 and z , y2 .


(a) [8 pts] Consider the constant decision rule r0 (x) = y0 , which always predicts y0 (sunny). What is the risk R(r0 ) of the
decision rule r0 ? Your answer should be a number, but show all your work.
(b) [8 pts] Derive the Bayes optimal decision rule r∗ (x)—the rule that minimizes the risk R(r∗ ).
Hint: Write down a table calculating L(z, yi ) P(X|Y = yi ) P(Y = yi ), for each class yi and each possible value of X (12
entries total), in the cases where the prediction z is wrong. Then figure out how to use it to minimize R. This problem
can be solved without wasting time computing P(X).
8
CS 189/289A Introduction to Machine Learning
Spring 2022 Jonathan Shewchuk Midterm
• Please do not open the exam before you are instructed to do so. Fill out the blanks below now.
• Electronic devices are forbidden on your person, including phones, laptops, tablet computers, headphones, and calcu-
lators. Turn your cell phone off and leave all electronics at the front of the room, or risk getting a zero on the exam.
Exceptions are made for car keys and devices needed because of disabilities.
• When you start, the first thing you should do is check that you have all 7 pages and all 4 questions. The second
thing is to please write your initials at the top right of every page after this one (e.g., write “JS” if you are Jonathan
Shewchuk).
• The exam is closed book, closed notes except your one cheat sheet.
• You have 80 minutes. (If you are in the Disabled Students’ Program and have an allowance of 150% or 200% time, that
comes to 120 minutes or 160 minutes, respectively.)
• Mark your answers on the exam itself in the space provided. Do not attach any extra sheets. If you run out of space for
an answer, write a note that your answer is continued on the back of the page.
First name
Last name
SID
1
(a) [4 pts] Select the true statements about Bayes decision theory.
A: The risk for a decision rule is the average loss C: If the Bayes risk is nonzero in a two-class
over the training points that are in class C. classification problem, then the distributions for each
class (i.e., P(X|Y = C) and P(X|Y , C)) must overlap.
B: The Bayes decision boundary between two D: There exists a loss function for which the Bayes
classes, if you’re using the 0-1 loss, is the set of points decision rule might select the class with lower poste-
x where P(X = x|Y = 0) = P(X = x|Y = 1). rior probability.
(b) [4 pts] Select the true statements about least-squares linear regression.
A: The problem of minimizing kXw − yk1 often C: There are problems for which the normal equa-
yields a “sparse” solution, where some of the compo- tions have exactly two distinct solutions.
nents of w are exactly zero.
D: When the normal equations have multiple so-
B: There is always at least one solution to the nor- lutions, all the solutions have the same loss on test
mal equations. points.
(c) [4 pts] Select the true statements about ROC curves.
A: The horizontal axis represents posterior prob- C: A ROC curve closer to the diagonal line y = x
ability thresholds and the vertical axis represents test implies that your classifier’s risk is closer to Bayes
set accuracy. optimal.
B: The ROC curve is a better guide for choosing D: There are (at least) two points on a ROC curve
a threshold (separating negative from positive classi- that are not affected by changes in the model. (Note:
fications) on real-world data than the threshold sug- we are not counting the specific choice of threshold
gested by decision theory. between positive and negative as part of the model).
(d) [4 pts] Ridge regression is
A: a way to perform feature selection, as ridge re- C: motivated by imposing a Gaussian prior proba-
gression encourages weights to be exactly zero. bility on the weight vector.
B: a method in which bias tends to increase, and

variance tends to decrase, as we increase the regular- D: a method whose cost function has a unique min-
ization parameter λ. imum (assuming λ > 0).
(e) [4 pts] Select the statements that are true for every real symmetric matrix X ∈ Rn×n .
A: X can be factored as X = UDU > , where U is a C: λmax (X) ≥ 0, where λmax (X) denotes the great-
orthogonal matrix and D is a diagonal matrix. est eigenvalue of X.
B: X can be factored as X = UU > , where U is a

orthogonal matrix. D: a> Xa ≤ λmax (X) kak22 for all a ∈ Rn .
2
(f) [4 pts] Below are 1,000 sample points drawn from a two-dimensional multivariate normal distribution. Which of the
following matrices could (without extreme improbability) be the covariance matrix of the distribution? (Pay attention to
the numbers on the axes!)
0.3
0.2
0.1
0.0
0.1
0.2
0.3
30 20 10 0 10 20 30
" # " #
100 0 10 0
A: Σ = C: Σ =
0 0.01 0 0.1
" # " #
1 0 −10 0
B: Σ = D: Σ =
0 1 0 −0.1
(g) [4 pts] You are training a soft-margin SVM on a binary classification problem. You find that your model’s training
accuracy is very high, while your validation accuracy is very low. Which of the following are likely to improve your
model’s performance on the validation data?
A: Training your model on more data. C: Increasing the hyperparameter C.
B: Adding a quadratic feature to each sample

point. D: Decreasing the hyperparameter C.
(h) [4 pts] Select the true statements about Gaussian Discriminant Analysis.
A: If a class-conditional covariance matrix is C: QDA is more prone to overfitting than LDA.

anisotropic (the eigenvalues are not equal), the deci-
sion boundary is guaranteed to be nonlinear. D: The Bayes decision boundary arising from two
normally distributed classes can split the feature space
B: The QDA posterior probability is a logistic into at most two regions.
function composed with (applied to) a quadratic func-
tion of the feature space.
(i) [4 pts] Select the true statements about finding a minimum of a cost function f (x).
A: Newton’s method always converges to a glob- C: If f is convex, is differentiable, and has exactly
ally minimum solution for any twice-differentiable one local minimum, then (batch) gradient descent al-
function f . ways converges to that minimum for any choice of
learning rate.
B: For the cost function f (x) = δkx − bk2 + γ with
δ > 0, Newton’s method always converges to a glob- D: It is not possible to execute an iteration of New-
ally minimum solution. ton’s method on the perceptron risk function.
3
(j) [4 pts] In the following statements, the word “bias” is referring to the bias-variance decomposition. Select the true ones.
A: A model trained with n training points is likely C: Increasing the number of parameters (weights)
to have lower variance than a model trained with 2n in a model usually improves the test set accuracy.
training points.
B: If my model is underfitting, it is more likely to D: Adding `2 -regularization usually reduces vari-

have high bias than high variance. ance in linear regression.
(k) [4 pts] Which of the following statements are true regarding Lasso regression?
A: Lasso’s optimization problem can be stated as C: Lasso often produces sparser results (more zero
a quadratic program. weights) than ridge regression.
B: The cost function minimized by Lasso has D: A version of Lasso using a penalty term of
points where its gradient is not well-defined, and the λkwk`0.5 (that is, the `0.5 -norm) will be more inclined
solution (minimum) is often at such a point. to produce sparse solutions than Lasso.
(l) [4 pts] Let X be an n × d design matrix where n = 10 and d = 12, representing information about various loan borrowers.
Let y ∈ Rn be a vector of labels such that yi represents the time (in days) between when borrower i took a loan and when
it was fully repaid. We would like to train a regression model on this data. Which of the following methods would be
reasonable choices for this task?
A: Least squares linear regression with the solu- C: Least squares linear regression using the
tion w∗ = (X > X)−1 X > y Moore–Penrose pseudoinverse, w∗ = X † y
B: Logistic regression D: Ridge regression
4
Q2. [17 pts] Gaussian Discriminant Analysis
You want to create a model to predict student performance on the CS 189/289A Midterm. You survey several past students and
record how many hours they studied for the exam, and whether or not they passed, yielding the two classes.
Passed: [4, 5, 5.5, 6.5, 7, 8]

Failed: [0, 1, 2, 3, 4]
The hours spent studying is the only feature we have for each student (d = 1). Assume that the number of hours is normally
distributed for both the passing and failing students. Consider two ways of modeling this data: Linear Discriminant Analysis
(LDA) and Quadratic Discriminant Analysis (QDA). Use the 0-1 loss function to define risk.
(a) [8 pts] Calculate the sample means µ p , µ f and the variances σ2p , σ2f computed for QDA. (The subscripts mean “pass” and
“fail.”) Express your answers as the simplest fractions (not decimals) possible.
(b) [4 pts] Calculate the sample means and variances used by LDA. Express your answers as the simplest fractions (not
decimals) possible.
(c) [5 pts] Calculate the decision boundary for LDA. Use fractions, not decimals, and express the answer in as simple a form
as possible (but expect it to have a logarithm in it).
5
Q3. [15 pts] Symmetric Matrices
(a) [6 pts] Derive the 2 × 2 symmetric matrix whose eigenvalues are 5 and 2, such that (2, −1) is an eigenvector with
eigenvalue 5.
(b) [6 pts] Consider the two-dimensional bivariate normal distribution N(0, Σ) where the covariance matrix Σ is the matrix
you derived in part (a) and the mean is µ = 0. Let f (x) be the PDF of that normal distribution, where x ∈ R2 . What are
the lengths of the major and minor axes of the ellipse
1
f (x) = √ ?
4π 10
Justify your answer.
(c) [3 pts] Consider a cost function J(w) over a weight vector w, and suppose that at every point w ∈ Rd , the Hessian matrix
∇2 J is positive definite. Is it always true that J(w) has exactly one unique local minimum w∗ ∈ Rd ? Why or why not?
6
Q4. [20 pts] Linear Regression with Laplacian Noise
In lecture, we saw how least-squares regression is motivated by maximum likelihood estimation if we think our data obeys
a linear relationship but has added noise that is normally distributed. But what if the noise is better modeled by the Laplace
distribution (which you reviewed in Homework 4)?
Let ∼ Laplace(µ, β) indicate a random variable drawn from a univariate Laplace distribution with mean µ and scale param-
eter β. The PDF of this distribution is
−| − µ|
!
1
f (; µ, β) = exp .
2β β
Following our customary notation, the input is an n × d design matrix X and a vector y such that yi is the label for sample
point Xi , where Xi> is the ith row of X. To keep things simple, we will do linear regression through the origin (no bias term α),
so the regression function is h(x) = w· x. Our model is that each label yi comes from a linear relationship perturbed by Laplacian
noise,
yi ∼ Laplace(w · Xi , β),
where w ∈ Rd is the true linear relationship. We will use maximum likelihood estimation to try to estimate w.
(a) [5 pts] Write the likelihood function L(w; X, y) for the parameter w, given the fixed data X and y.
(b) [3 pts] Write the log likelihood function `(w; X, y) for the parameter w, given the fixed data X and y, in as simple a form
as you can. (Make sure your logarithms have the correct base.)
(c) [3 pts] What is the simplest cost function we can minimize that gives us the same value of w as maximizing the likelihood?
(d) [4 pts] How is the cost function you just derived different from standard least-squares regression? Is it more or less
sensitive to outliers? Why?
(e) [5 pts] Write the batch gradient descent rule for minimizing your cost function, using η for the step size (aka learning
d
rate). You may omit training points whose losses have undefined gradients. Hint: Recall that dα |α| is 1 for α > 0, −1 for
α < 0, and undefined for α = 0.
7
Summer 2019 Machine Learning Midterm
• The exam is closed book, closed notes except your two-page cheat sheet.
• You have 3 hours.
• Please write your initials at the top right of each page after this one (e.g., write “MK” if you are Marc Khoury). Finish
this by the end of your 3 hours.
First name
Last name
SID
1
θ ) for some θ ∈ R. What is the MLE estimator of θ?

1
(a) [3 pts] Let X ∼ Bernoulli( 1+exp
X 0
1 Does not exist.
(b) [3 pts] Let Y ∼ N(Xθ, In ) for some unknown θ ∈ Rd and some known X ∈ Rn×d that has full column rank and d < n.
What is the MLE estimator of θ?
(X > X)−1 X > Y Y + Z ∀Z ∈ Null(X)
X > (XX > )−1 Y Does not exist.
(c) [3 pts] Let f (x) = − xi = 1 and xi > 0, the Hessian of f is:

Pn Pn
i=1 xi log xi . For some x such that i=1
positive definite indefinite (neither positive semidefinite nor nega-

tive semidefinite)
negative definite
invertible
positive semidefinite nonexistent
negative semidefinite None of the above.
(d) [3 pts] Which of the following statements about optimization algorithms are correct?
Newton’s method always requires fewer iterations than gradient descent.
Stochastic gradient descent always requires fewer iterations than gradient descent.
Stochastic gradient descent, even with small step size, sometimes increases the loss in some iteration for convex
problems.
Gradient descent, regardless the step size, decreases the loss in every iteration for convex problems.
(e) [3 pts] Assume we run the hard-margin SVM algorithm on 100 d-dimensional points from 2 different classes. The
algorithm outputs a solution. After which transformation to the training data would the algorithm still output a solution?
Centering the data points Dividing all entries of each data point by some
negative constant c
Transforming each data point from x to Ax for
some matrix A ∈ Rdxd Adding an additional feature
(f) [3 pts] Which of the following holds true when running an SVM algorithm?
Increasing or decreasing α value only allows the (n + 1)-dimensional space that separates the points by
decision boundary to translate. their class.
Given n-dimensional points, the SVM algorithm Decision boundary rotates if we change the con-
finds a hyperplane passing through the origin in the straint to wT x + α ≥ 3.
2
The set of weights that fulfill the constraints of the SVM algorithm is convex.
(g) [3 pts] Consider the set {x ∈ Rd : (x − µ)> Σ(x − µ) = 1} given some vector µ ∈ Rd and matrix Σ ∈ Rdxd . Which of the
following are true?
If Σ is the identity matrix scaled by some constant Increasing the eigenvalues of Σ decreases the radii
c, then the set is isotropic. of the ellipsoid.
Increasing the eigenvalues of Σ increases the radii A singular Σ produces an ellipsoid with an infinite
of the ellipsoid. radius.
(h) [3 pts] Consider the linear regression problem with full rank design matrix, which of the following regularization in
general encourage more sparsity than non-regularized objective:
L0 regularization (number of the non-zero coordi- L3 regularization

nates)
L4 regularization
L1 regularization
L∞ regularization (the maximum absolute value
L2 (Tikhonov) regularization across all coordinates)
(i) [3 pts] Which of the following statements are correct?
In ridge regression, the regularization parameter λ is usually set as 0.1.
SVM in general does not enforce sparsity over the parameters w and α.
In binary linear classification, the support vectors of SVM might contain samples from only one class even if
training data has both classes.
In binary linear classification, suppose 1{w> x + α ≥ 0} is one maximum margin linear classifier, then the margin
only depends on w but not α.
(j) [3 pts] In binary classification (+1 and −1), suppose our data is linearly separable and the data matrix has full column
rank (n > d). Which of the following formulation can guarantee to find a linear classifier that achieves 0 training error?
Note that in the regression options, the prediction rule would still be 1{w> x + α ≥ 0}.
Logistic regression Linear regression with square loss
SVM Perceptron
Lasso None of the above
(k) [3 pts] Analogous to positive semi-definiteness, an n × n real symmetric matrix B is called negative semi-definite if
x> Bx ≤ 0 for all vectors x ∈ Rn . Which of the following conditions guarantee B is negative semi-definite?
B has all negative entries B = A−1 , where A is positive semi-definite
The largest eigenvalue of B is ≤ 0 B = −AT A for some matrix A
(l) [3 pts] Consider two classes whose class conditionals are the scalar normal distributions N(µ1 , σ2 ) and N(µ2 , σ2 ) re-
spectively, where µ1 < µ2 . Given some non-zero priors π1 and π2 , recall the Bayes’ optimal decision boundary will be a
single point, x∗ . Which of the following changes, holding everything else constant, would cause x∗ to increase?
3
Decreasing µ1 Increasing µ2
Increasing σ Increasing π1 while decreasing π2
(m) [3 pts] Let Σ be a positive definite matrix with eigenvalues λ1 , . . . , λd . Consider the quadratic function g(x) = x> (cΣ−2 )x,
for some constant c > 0. What are the lengths of the radii of the ellipsoid at which g(x) = 1?
c−1/2 · λi c · λ−1
i
c1/2 · λi c1/2 · λ−1/2

i
(n) [3 pts] Let X ∼ N(µ, Σ) be a multivariate normal random variable. Which the the following statements of about linear
functions of X are always true, where A is some square matrix and b a vector?
Var(AX) = A2 Σ AX is isotropic if A = Σ−1
AX + b is also multivariate normal E[AX + b] = Aµ + b
(o) [3 pts] You have trained four binary classifiers A, B, C, and D, observing the following ROC curves when evaluating
them:
We say that a classifier G strictly dominates a classifier H if G’s true positive rate is always greater than H’s true positive rate
for all possible false positive rates in (0, 1).
Mark all of the below relations between (A, B, C, and D) which are true under this definition.
C strictly dominates D A strictly dominates B
D strictly dominates C B strictly dominates C
B strictly dominates A D strictly dominates A
(p) [3 pts] We are doing binary classification on classes {1, 2}. We have a single dataset of size N of which a fraction α of
the elements are in class 1. To construct a test set, we randomly choose a fraction β of the dataset to put in the test set,
keeping the remaining elements in a training set.
We would like to avoid the situation where in the training or test sets, either class appears less than 0.1N times. In which
of the following situations does this occur, in expectation?
4
α = 50%, β = 50% α = 70%, β = 20%
α = 20%, β = 70% α = 30%, β = 60%
α = 40%, β = 60% α = 60%, β = 80%
(q) [3 pts] Assume that for a k-class problem, all classes have the same prior probability, i.e. π = [ 1k , . . . , 1k ]. You build two
different models:
• (Model A) You train QDA once for all k classes, and to classify a data point you return the class with the highest
posterior probability.

• (Model B) You train QDA pairwise 2k times, restricting the training data each time to only the data points from
two of the k classes. To classify a test point, you return the class that has the higher posterior probability most often
from the 2k independent models.
Mark all of the following which are true in general, in the comparison of bias and variance between models A and B:
B has higher bias than A B has higher variance than A
B has the same bias as A B has the same variance as A
B has lower bias as A B has lower variance as A
(r) [3 pts] You observe the following train and test error as a function of model complexity p for three different models:
Mark the values of p and models where the test and train error indicate overfitting.
model A at p = 10 model A at p = 20 model A at p = 30
model B at p = 10 model B at p = 20 model B at p = 30
model C at p = 10 model C at p = 20 model C at p = 30
(s) [3 pts] Which models, if any, appear to be underfit for all settings of p?
A B C
(t) [3 pts] Consider the minimum possible bias for each model over all settings of p for 0 ≤ p ≤ 30. Which of the following
are true in comparing the minimum bias between the three models?
5
Model A has higher minimum bias than B Model B has higher minimum bias than C
Model A has the same minimum bias as B Model B has the same minimum bias as C
Model A has lower minimum bias than C Model B has lower minimum bias than C
6
Q2. [10 pts] Comparing Classification Algorithms
Find the decision boundary given by the following algorithms. Provide a range of values if the algorithm allows for multiple
feasible decision boundaries. If there exists no feasible decision boundary, state ”None.”
(a) [1 pt] Perceptron: X1 =
(b) [2 pts] Hard-Margin SVM: X1 =
(c) [2 pts] Linear Discriminant Analysis: X1 =
(d) [1 pt] Perceptron: X1 =
(e) [2 pts] Hard-Margin SVM: X1 =
(f) [2 pts] Linear Discriminant Analysis: X1 =
7
Q3. [15 pts] Binary Image Classification
A binary image is a digital image where each pixel has only possibles values: zero (white) or one (black). A binary image,
which consists of a grid of pixels, can therefore naturally be represented as a vector with entries in {0, 1}.
d× d
0
1
0 ∈ {0,1}
d
image 1
vector
In this problem, we consider a classification scheme based on a simple generative model. Let X be a random binary image,
represented as a d-dimensional binary vector, drawn from one of two classes: P or Q. Assume every pixel Xi is an independent
Bernoulli random variable with parameter pi and qi when drawn from classes P and Q respectively.
Xi | Y = P ∼ Bernoulli(pi ) independently for all 1 ≤ i ≤ d

Xi | Y = Q ∼ Bernoulli(qi ) independently for all 1 ≤ i ≤ d
(a) [1 pt] Of course, when working with real data, the true parameters pi and qi will be unknown and therefore must be
estimated from the data. Given the following 5 2-dimensional training points from class P, find the maximum likelihood
estimates of p1 and p2 .
" # " # " # " # " #

0 0 0 1 0
, , , ,
0 1 1 1 1
p̂MLE
1 = p̂MLE
2 =
Important. The other parameters and priors could similarly be estimated. For the remainder of the problem, however, we focus
on the ideal case, where the true values of pi and qi , along with priors π p and πq , are known.
(b) [1 pt] Fill in the blanks in the statement below.
To minimize risk with the (symmetric) 0-1 loss function, we should pick the class with the
probability, which gives the Bayes’ optimal classifier.
8
(c) [4 pts] Given an image x ∈ {0, 1}d , compute the probabilities Pr(X = x|Y = P) and Pr(X = x|Y = Q) in terms of the priors,
image pixels and/or class parameters. Your answer must be a single expression for each probability.
(d) [2 pts] In terms of the probabilities above, write an equation which holds if and only if x is at the decision boundary of
the Bayes’ optimal classifier, assuming a (symmetric) 0-1 loss function. No simplification is necessary for full credit.
(e) [7 pts] It turns out that the decision boundary derived above is actually linear in the features of x, so for some vectors w
and scalar b, it can be succinctly expressed as:
{x ∈ {0, 1}d : w> x + b = 0}
Find the entries of the vector w and value of b in terms of class priors and parameters, using them to fill in the blanks on
the line below.
wi = b=
9
Q4. [15 pts] Gaussian Mean Estimation
Suppose Y ∈ Rd is a random variable distributed as N(θ, Id×d ) for some unknown θ ∈ Rd . We observe a sample y ∈ Rd of Y and
want to estimate θ.
(a) [2 pts] What is the maximum likelihood estimator(MLE) of θ? Write down the answer in the box.
n o
Now we are going to use ridge regression to solve this problem. Namely, solve minθ ky − θk2 + 12 λkθk2 to get an estimate of θ.
(b) [2 pts] What is the closed form of estimator θ̂(y) from ridge regression with regularization parameter λ? Write down the
answer in the box.
(c) [4 pts] Derive the population risk EkY − θ̂(y)k2 for ridge regression estimator (expectation is taken with respect to all the
randomness including testing time Y and training sample y).
10
(d) [3 pts] what is the population risk for the MLE estimator. Write down the answer in the box.
(e) [1 pt] Suppose we choose λ = d, find out the condition on θ such that the ridge regression estimator has a lower risk than
the MLE estimator.
(f) [2 pts] This implies that MLE estimator, although seems to be the most natural estimator, does not always achieve the
lowest risk. Briefly explain the reasons behind this fact.
(g) [1 pt] Based on the previous parts, write down one potential advantage of ridge regression over ordinary least square
regression (namely, why do we sometimes add the regularization term).
11
Q5. [12 pts] Estimation of Linear Models
In all of the following parts, write your answer as the solution to a norm minimization problem, potentially with a regularization
term. You do not need to solve the optimization problem. Simplify any sums using matrix notation for full credit.
Hint: Recall that the MAP estimator maximizes P(θ|Y): θ̂ = arg max P(Y|θ)P(θ)/P(Y) = arg maxθ∈Rd P(Y|θ)P(θ). The differ-
ence between MAP and MLE is the inclusion of a prior distribution on θ in the objective function.
For the following problems assume you are given X ∈ Rn×d and y ∈ Rn as training data.
(a) [3 pts] Let y = Xθ + where ∼ N(0, Σ) for some positive definite, diagonal Σ. Write the MLE estimator of θ as the
solution to a weighted least squares problem, potentially with a regularization term.
θ̂ = arg minθ∈Rd
(b) [3 pts] Let y|θ ∼ N(Xθ, Σ) for some positive definite, diagonal Σ. Let θ ∼ N(0, λId ) for some λ > 0 be the prior on
θ. Write the MAP estimator of θ as the solution to a weighted least squares minimization problem, potentially with a
regularization term.
i.i.d.
(c) [3 pts] Let y = Xθ + where i ∼ Laplace(0, 1). Recall that the pdf for Laplace(µ, b) is p(x) = 1
2b exp (− 1b |x − µ|). Write
down the MLE estimator of θ as the solution to a norm minimization optimization problem.
12
i.i.d.
(d) [3 pts] Let y|θ ∼ N(Xθ, Σ) for some positive definite, diagonal Σ. Let θi ∼ Laplace(0, λ) for some positive scalar
λ. Write the MAP estimator of θ as the solution to a weighted least squares minimization problem, potentially with a
regularization term.
13
Q6. [9 pts] Bias-Variance for Least Squares
For this problem, we would like to analyze the performance of linear regression on our given data (X, Ỹ), where X ∈ Rn×d and
Ỹ ∈ Rn .
The original Y perfectly fit a line from the original data, i.e. Y = Xβ. However, we do not know the original data, we only
know Ỹ = Y + ε, where ε ∼ N(0, In ), i.e. we only know the ỹs that are distorted from the actual ys with mean-zero, independent
variance-one Gaussian noise.
Recall that via least squares, the predicted regression coefficients are β̃ = (X T X)−1 X T Ỹ.
(a) [1 pt] For a test data point (z, y) ∈ (Rd , R), call the predicted value from our model f˜(z). What is f˜(z)?
(b) [1 pt] Note that for our test data point z, we do not observe the true value y ever, only ỹ = y + εy where εy ∼ N(0, 1). We
would like to calculate the expected squared-error E[(ỹ − f˜(z))2 ]. Apply the Bias-Variance decomposition to decompose
this expected squared error into three pieces, you do not have to simplify further. Clearly label what the three pieces
correspond to.
(c) [2 pts] Derive the bias. Show your work.
(d) [3 pts] Show that the variance is zt (X T X)−1 z.
(e) [2 pts] Argue why in this case the variance is always at least the bias, for any potential test data point z.
14
Q7. [5 pts] Estimates of Variance
Assume that data points X1 , . . . , Xn are sampled i.i.d from a normal distribution N(0, σ2 ). You know that the mean of this
distribution is 0, but you do not know the variance σ2 .
Recall that if a variable X ∼ N(µ, σ2 ), its probability density function is:
−(x − µ)2
!
1
fX (x) = √ exp
σ 2π 2σ2
(a) [3 pts] Write the expression for the log-likelihood ln P(X1 , . . . , Xn | σ2 ). Simplify as much as possible.
(b) [2 pts] Find the σ̂2 that maximizes this expression, i.e. the MLE.
15
Q8. [6 pts] Train and Test Error
Assume a general setting for regression with arbitrary loss function L(y, ŷ) ≥ 0. We have devised a family of models rθ : Rd → R
parameterized by θ ∈ Θ.
Let {(x1 , y1 ), . . . (xn , yn )} be a test set and {( x̃1 , ỹ1 ), . . . ( x̃m , ỹm )} be a training set, both sampled from the same joint distribution
(X, Y) ∈ (Rd , R). Then we have Rtr (θ) = 1n ni=1 L(rθ (xi ), yi ) and R(m) te (θ) = m
1 Pm
P
i=1 L(rθ ( x̃i ), ỹi ) as the train and test error
depending on the setting of θ, respectively.
We have found the optimal θ̂ = argminθ∈Θ Rtr (θ), and would like to show that
E[Rtr (θ̂)] ≤ E[R(m)

te (θ̂)]
te (θ̂)] is the same regardless of the size of the test set m.

(a) [2 pts] Show that E[R(m)
(b) [2 pts] Due to the previous part, we can work with a test set that is the same size as the training set. Argue that
E[Rtr (θ̂)] = E[minθ∈Θ R(n)

te (θ)]
te (θ)] ≤ E[Rte (θ̂)], completing the proof.

(c) [1 pt] Argue that E[minθ∈Θ R(n) (n)
(d) [1 pt] True or False: For all training and test datasets,
Rtr (θ̂) ≤ Rte (θ̂)
True False
16

ML Practice 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Practice 1

Uploaded by

Copyright:

Available Formats

CS 189 Introduction to

Spring 2013 Machine Learning Midterm

• Please use non-programmable calculators only.

For staff use only:

(g) [1 pt] A function f (x, y, z) is convex if the Hessian of f is positive semi-definite.

f (x) ln f (x) + (1 − f (x)) ln(1 − f (x)) f (x)(1 − f (x))

f (x) ln(1 − f (x)) f (x)(1 + f (x))

(e) [2 pts] In regression, using an L2 regularizer is equivalent to using a prior.

Laplace, 2βexp(−|x|/β) Exponential, β exp(−x/β), for x > 0

Gaussian with diagonal covariance

XT X 2λX T X X T X + 2λI (X T X)−1

(i) [3 pts] Which of the following functions are convex?

sin(x) |x| min(f1 (x), f2 (x)), max(f1 (x), f2 (x)),

(iii) [1 pt] What is the maximum likelihood estimate of p?

• Please use non-programmable calculators only.

First and last name of student to your left

First and last name of student to your right

For staff use only:

(c) [1 pt] In SVMs, the values of αi for non-support vectors are 0.

Poisson None of the above

1-nearest neighbor Logistic regression

Support vector machine Linear discriminant analysis

Linear kernel Gaussian RBF (radial basis function) kernel

Polynomial kernel None of the above

L1 None of the above

(f ) [3 pts] Suppose we have a covariance matrix  

Behaves like hard margin Goes to zero

Goes to infinity None of the above

λ, your penalty term , your convergence criterion

η, your step size Fixing a random bug

Assume the loss matrix

(b) [8 pts] The polynomial kernel is defined to be

k(x, y) = (xT y + c)d

J(w) = kXw − yk22 + λkwk22

The global minimizer of J is given by:

(a) [8 pts] Consider running Newton’s method to minimize J.

P (x | θ) = θx−θ−1 where θ > 1, x ≥ 1

Find the maximum likelihood estimator of θ.

(a) [6 pts] Show that the mean of Y is Aµ + b.

(b) [7 pts] Show that the covariance matrix of Y is AΣAT .

• Prior class probabilities: P (Ck ) = πk k = 1, . . . , K

Suppose we are given training data {(xn , yn )}N

First and last name of student to your left

First and last name of student to your right

For staff only

x: an input data vector of dimension (1, d) of components xi, i=1:d.

xk: a training example of dimension (1, d) is a row of X, k=1:N.

w: weight vector of a linear model of dimension (1, d) such that

f(x) = w xT = x wT=  i=1:d wi xi

True/False (36 points):

2. A function is convex if its Hessian is negative semidefinite.

7. The bootstrap method involves sampling without replacement.

8. A non linearly-separable training set in a given feature space can always be

10. Logistic regression cannot be kernelized.

14. Nearest neighbors is a parametric method.

o The leftmost classifier has high robustness, poor fit.

6. What are support vectors:

Question I.5 (2 points) How do you estimate  (answer in at most 3 words)?

Question II.1. Centroid method. Now consider a 2-class classification problem in a 2-

Figure 1: Data for Problem II.1 Centroid method question.

Question II.1.B (1 point) What is the training error rate?

Question II.1.D (2 points) What is the leave-one-out error rate?

Question II.2.B (1 point) What is the training error rate?

Question II.2.D (2 points) What is the leave-one-out error rate?

(f ) [3 pts] Suppose we have a covariance matrix

λ, your penalty term , your convergence criterion

Assuming that Y = β T x + , where ∼ N (0, σ 2 ).