You are on page 1of 2

CS 189 Final Note Sheet Rishi Sharma, Peter Gao, et. al.

Probability & Matrix Review

Support Vector Machines

LDA and QDA

Error Functions:

Bayesian Decision Theory

In the strictly separable case, the goal is to find a separating


hyperplane (like logistic regression) except now we dont just want
any hyperplane, but one with the largest margin.

Classify y {0, 1}, Model p(y) = y 1y and

nout
Cross Entropy Loss i=1
y log(h (x)) + (1 y) log(1 h (x))

1 m (x(i)
(i)
T
MLE = m
y(i) )(x y(i) ) .
i=1

Notation:
(l)
1. wi j is the weight from neuron i in layer l 1 to neuron j in
layer l. There are d (l) nodes in the l th layer.
2. L layers, where L is output layer and data is 0th layer.
(l)
(l)
3. x j = (s j ) is the output of a neuron. Its the activation
(l)
(l) (l1)
function applied to the input signal. s j = i wi j xi
4. e(w) is the error as a function of the weights

P(x|)P()
Bayes Rule: P(|x) =
, P(x) = i P(x|i )P(i )
P(x)
P(x, w) = P(x|w)P(w) = P(w|x)P(x)

P(error) =
P(error|x)P(x)dx

P(1 |x)
if we decide 2
P(error|x) =
P(2 |x)
if we decide 1
n
0
i = j (correct)
0-1 Loss: (i | j ) =
1
i 6= j (mismatch)

H = { T x + b = 0}, since scaling and b in opposite directions


doesnt change the hyperplane our optimization function should
have scaling invariance built into it. Thus, we do it now and define
the closest points to the hyperplane xsv (support vectors) to satisfy:
| T xsv + b| = 1. The distance from any support vector to the hyper
1 . Maximizing the distance to the hyperplane is
plane is now:
||||2
the same as minimizing ||||2 .

Expected Loss (Risk): R(i |x) = cj=1 (i | j )P( j |x)


0-1 Risk: R(i |x) = c P( j |x) = 1 P(i |x)
j6=i

The final optimization problem is:

Generative vs. Discriminative Model

(i) T (i)
Primal: L p (, b, ) = 12 ||||2 m
i=1 i (y (w x + b) 1)

Generative: Model class conditional density p(x|y) and find


p(y|x) p(x|y)p(y) or
model joint density p(x, y) and marginalize
to find p(y = k|x) = x p(x, y = k)dx (posterior)

Lp
= i y(i) x(i) = 0 = = i y(i) x(i)

Discriminative: Model conditional p(y|x).


class conditional P(X|Y )
prior P(Y )

posterior P(Y |X)


evidence P(X)

Probabilistic Motivation for Least Squares


y(i) = | x(i) + (i) with noise (i) N (0, 2 )
Note: The intercept term x0 = 1 is accounted for in
!
(y(i) | x(i) )2
exp
= p(y(i) |x(i) ; ) = p 1
2
2
2
2
!
(y(i) | x(i) )2
p 1
= L( ) = m
i=1 2 2 exp
2
2
(i)
| (i) 2
= l( ) = m log p 1
1 2 m
i=1 (y x )
2
2 2
m
(i)
2
= max l( ) min i=1 (y h (x))
Gaussian noise in our data set {x(i) , y(i) }m
i=1 gives us least squares
| |
| |
|
min ||X y||2
2 min X X 2 X y + y Y
l( ) = X | X X | y = 0 = = (X | X)1 X | y
Gradient Descent:
(i)
(i) (i)
t+1 = t + (yt h(xt ))xt , h (x) = | x

Multivariate Gaussian X N (, )
Gaussian class conditionals lead to a logistic posterior.


1
f (x; , ) =
exp 12 (x )T 1 (x )
(2)n/2 ||1/2
= E[(X )(X )T ] = E[XX T ] T
is PSD = xT x 0, if inverse exists must be PD
If X N(, ), then AX + b N(A + b, AAT )
1
1
1
= 2 (X ) N(0, I), where 2 = U 2

The distribution is the result of a linear transformation of a vector of


univariate Gaussians Z N (0, I) such that X = AZ + where we
have = AA| . From the pdf, we see that the level curves of the
distribution decrease proportionally with x| 1 x (assume = 0)
=
c-level set of f {x : x| 1 x = c}
x| 1 = c x| U1 U | x = c =
|
|
11 (u1 x)2 + + n1 (un x)2 = c
{z
}
|
|
{z
}

p
axis length: n
axis length: 1
Thus the level curves form an ellipsoid with axis lengths equal to the
square root of the eigenvalues of the covariance matrix.

Loss Functions
h
i
Binomial deviance = log 1 + ey f (x)

P[Y =+1|x]
minimizing function f (x) = log
P[Y =1|x]
SVM hinge loss = [1 y f (x)]+


minimizing function f (x) = sign P [Y = +1 | x] 12
Squared error = [y f (x)]2 = [1 y f (x)]2
minimizing function f (x) = 2P [Y = +1 | x] 1
Huberized
square hinge loss

4y f (x)
if y f (x) < 1
=
[1 y f (x)]2
otherwise
+
minimizing function f (x) = 2P [Y = +1 | x] 1

Optimization
1
Newtons Method: t+1 = t [2
f (t )] f (t )
Gradient Decent: t+1 = t f (t ), for minimizing

Gradients

y1
...
x1

.
y
.
,
.
.
x
.
.

y1
...
xn
(xT x)
(xT Ax)
= 2x,
x
x

ym
x1

T
(Ax)
.
T (x A) = A,
. , x = A , x
.

ym
xn
(trBA)
= (A + AT )x,
= BT
A

1
min ||||2 s.t. y(i) (wT x(i) + b) 1, i = 1, . . . , m
,b 2

Lp
= i y(i) = 0, Note: i 6= 0 only for support vectors.
b
Substitute the derivatives into the primal to get the dual.

Notice the covariance matrix is the same for all classes in LDA.
If p(x|y) multivariate gaussian (w/ shared ), then p(y|x) is logistic
function. The converse is NOT true. LDA makes stronger
assumptions about data than does logistic regression.
h(x) = arg maxk 21 (x k )T 1 (x k ) + log(k )
where k = p(y = k)
For QDA, the model is the same as LDA except that each class has a
unique covariance matrix.
h(x) = arg maxk 21 log|k | 12 (x k )T 1 (x k ) + log(k )
k

1 m
m
(i) ( j)
(i) T ( j)
Ld () = m
i=1 i 2 i=1 j=1 y y i j (x ) x
KKT says n (yn (wT xn + b) 1) = 0 where n > 0.
In the non-separable case we allow points to cross the marginal
boundary by some amount and penalize it.
m
1
min ||||2 +C i s.t. y(i) (wT x(i) + b) 1 i
,b 2
i=1
The dual for non-separable doesnt change much except that each i
now has an upper bound of C = 0 i C

Lagrangian
m

i fi (x)

i=1

Think of the i as the cost of violating the constraint fi (x) 0.


L defines a saddle point game: one player (M IN); the other
player (M AX) chooses to maximize L. If M IN violates a
constraint, fi (x) > 0, then M AX can drive L to infinity.
We call the original optimization problem the primal problem.
It has valuep = minx max 0 L (x, )
(Because of an infeasible x, L (x, ) can be made infinite, and
for a feasible x, the i fi (x) terms will become zero.)
Define g ( ) := minx L (x, ), and define the dual problem as
d = max 0 g ( ) = max 0 minx L (x, )
In a zero sum game, its always better to play second:
p = minx max 0 L (x, ) max 0 minx L (x, ) = dThis
is called weak duality.
If there is a saddle point (x, ), so that for all x and 0,
L (x, ) L (x, ) L (x, ) , then we have strong duality:
the primal and dual have the same value,
p = minx max 0 L (x, ) = max 0 minx L (x, ) = d
Using notation from Peters notes:

Given minx f (x) s.t. gi (x) = 0, hi (x) 0, the corresponding


Lagrangian is: L(x, , ) = f (x) + ki=1 i gi (x) + li=1 i hi (x)
We min over x and max over the Lagrange multipliers and

Regression

L2 regularization results in ridge regression.


Used when A contains a null space. L2 reg falls out of the MLE
when we add a Gaussian prior on x with = cI.
minx ||Ax y||22 + ||x||22 = x = (AT A + I)1 X T y
L1 regularization results in lasso regression.
Used when x has a Laplace prior. Gives sparse results.

1
1
2 T x =
T ) e
T
1+e x
1+e x
h (1 h )
dh
d = (

1
T = h (x)
1+e x
!
1
1
=
T
1+e x

p(y|x; ) = (h (x))y (1 h (x))1y =


(i) y(i) (1 h (x(i) ))1y(i) =
L( ) = m

i=1 (h (x ))
(i)
(i)
(i)
(i)
l( ) = m
i=1 y log(h (x )) + (1 y ) log(1 h (x )) =
l = i (y(i) h (x(i) ))x(i) = X | (y h (X)), (want max l( ))
( j)
( j) ( j)
Stochastic: t+1 = t + (yt h (xt ))xt
Batch: t+1 = t + X | (y h (X))

K 0,

KNN limN,K , N
knn =
Curse of dimensionality: As the number of dimensions increases,
everything becomes farther apart. Our low dimension intuition falls
apart. Consider the Hypersphere/Hypercube ratio, its close to zero
at d = 10. How do deal with this curse:
1. Get more data to fill all of that empty space
2. Get better features, reducing the dimensionality and packing the
data closer together. Ex: Bag-of-words, Histograms,...
3. Use a better distance metric.
1
p p
Minkowski: Dis p (x, y) = (d
i=1 |xi yu | ) = ||x y|| p
0-norm: Dis0 (x, y) = d
i=1 I|xi = yi |
q
(x y)T 1 (x y)

Mahalanobis: DisM (x, y|) =

In high-d we get Hubs s.t most points identify the hubs as their
NN. These hubs are usually near the means (Ex: dull gray images,
sky and clouds). To avoid having everything classified as these hubs,
we can use cosine similarity.
K-d trees increase the efficiency of nearest neighbor lookup.

Decision Trees
Given a set of points and classes {xi , yi }n
i=1 , test features x j and
branch on the feature which best separates the data. Recursively
split on the new subset of data. Growing the tree to max depth tends
to overfit (training data gets cut quickly = subtrees train on small
sets). Mistakes high up in the tree propagate to corresponding
subtrees. To reduce overfitting, we can prune using a validation set,
and we can limit the depth.
DTs are prone to label noise. Building the correct tree is hard.
Heurisitic: For classification, maximize information gain
H(D)

x j X j

P(X j = x j ) H(D|X j = x j )

where H(D) = cC P(y = c) log[p(y = c)] is the entropy of the


data set, C is the set of classes each data point can take, and P(y = c)
is the fraction of data points with class c.
For REGRESSION, minimize the variance. Same optimization
problem as above, except H is replaced with var. Pure leaves
correspond to low variance, and the result is the mean of the current
leaf.

Random Forests

Logistic Regression
Classify y {0, 1} = Model p(y = 1|x) =


Key Idea: Store all training examples xi , f (xi )
NN: Find closest training point using some distance metric and take
its label.
k-NN: Find closest k training points and take on the most likely
label based on some voting scheme (mean, median,...)
Behavior at the limit: 1NN limN NN 2
= error of optimal prediction, nn = error of 1NN classifier

max
j

In general the loss function consists of two parts, the loss term and
the regularization term. J() = i Lossi + R()

nout
(y h (x))2
Mean Squared Error i=1

(l)
The goal is to learn the weights wi j . We use gradient descent, but
error function is non-convex so we tend to local minima. The naive
version takes O(w2 ). Back propagation, an algorithm for efficient
computation of the gradient, takes O(w).
e(w)

Other Classifiers
Nearest Neighbor

Dual:

L (x, ) = f0 (x) +

(i) (i)
(i)
l( , 0 , 1 , ) = log m
i=1 p(x |y ; 0 , 1 , )p(y ; ) gives
us
1 m 1{y(i) = 1},
MLE = m
kMLE =
i=1
avg of x(i) classified as k,

Problem: DTs are unstable: small changes in the input data have
large effect on tree structure = DTs are high-variance estimators.
Solution: Random Forests train M different trees with randomly
sampled subsets of the data (called bagging), and sometimes with
randomly sampled subsets of the features to de-correlate the trees. A
new point is tested on all M trees and we take the majority as our
output class (for regression we take the average of the output).

Boosting
Weak Learner: Can classify with at least 50% accuracy.
Train weak learner to get a weak classifier. Test it on the training
data, up-weigh misclassified data, down-weigh correctly classified
data. Train a new weak learner on the weighted data. Repeat. A new
point is classified by every weak learner and the output class is the
sign of a weighted avg. of weak learner outputs. Boosting generally
overfits. If there is label noise, boosting keeps upweighing the
mislabeled data.
AdaBoost is a boosting algorithm. The weak learner weights are
1t
given by t = 1
2 ln( t ) where t = PrDt (ht (xi ) 6= yi )
(probability of misclassification). The weights are updated
D (i)exp(t yi ht (xi ))
where Zt is a normalization
Dt+1 (i) = t
Zt
factor.

Neural Networks
Neural Nets explore what you can do by combining perceptrons,
each of which is a simple linear classifier. We use a soft threshold
for each activation function because it is twice differentiable.

(l)
(l) (l1)
e(w)
e(w) s j
=
= j xi
(l)
(l)
(l)
wi j
s j wi j

(L)
(L)
(L) 0
e(w)
e(w) x j
Final Layer: j =
=
= e0 (x j )out
(sLj )
(l)
(L)
(L)
sj
xj
sj
General:

(l)
(l1)
sj
x
(l) e(w)
(l1)
e(w)
i
=
= dj=1

i
(l1)
(l)
(l1)
(l1)
si
sj
xi
si
(l) (l)
(l)
(l1)
= dj=1 j wi j 0 (si
)

Unsupervised Learning
Clustering
Unsupervised learning (no labels).
Distance functions. Suppose we have two sets of points.

Single linkage is minimum distance between members.


Complete linkage is maximum distance between members.
Centroid linkage is distance between centroids.
Average linkage is average distance between all pairs.

Hierarchical:
Agglomerative: Start with n points, merge 2 closest clusters
using some measure, such as: Single-link (closest pair),
Complete-link (furthest pair), Average-link (average of all
pairs), Centroid (centroid distance).
Note: SL and CL are sensitive to outliers.
Divisive: Start with single cluster, recursively divide clusters
into 2 subclusters.
Partitioning: Partition the data into a K mutually exclusive
exhaustive groups (i.e. encode k=C(i)). Iteratively reallocate to
minimize some loss function. Finding the correct partitions is hard.
Use a greedy algorithm called K-means (coordinate decent). Loss
function is non-convex thus we find local minima.
K-means: Choose clusters at random, calculate centroid of
each cluster, reallocate objects to nearest centroid, repeat.
Works with: spherical, well-separated clusters of similar
volumes and count.
K-means++: Initialize clusters one by one. D(x) = distance of
point x to nearest cluster. Pr(x is new cluster center) D(x)2
K-medians: Works with arbitrary distance/dissimilarity metric,
the centers k are represented by data points. Is more restrictive
thus has higher loss.
K
General Loss: N
n=1 k=1 d(xn , k )rnk where rnk = 1 if xn is in
cluster k, and 0 o.w.

Vector Quantization
Use clustering to find representative prototype vectors, which are
used to simplify representations of signals.

Parametric Density Estimation


Mixture Models. Assume PDF is made up of multiple gaussians
nc
with different centers. P(x) = i=1
P(ci )P(x|ci ) with objective
function as log likelihood of data. Use EM to estimate this model.
P(i )P(xk |i )
E Step: P(i |xk ) =
j P( j )P(x j | j )
n
M Step: P(ci ) = n1e e P(i |xk )
k=1
x P(i |xk )
i = k k
k P(i |xk )
(x )2 P(i |xk )
i2 = k k i
.
k P(i |xk )

Non-parametric Density Estimation


Can use Histogram or Kernel Density Estimation (KDE).
KDE: P(x) = n1 K(x xi ) is a function of the data.
The kernel K has the following
properties:

Symmetric, Normalized d K(x)dx = 1, and


R
d
lim||x|| ||x|| K(x) = 0.

The bandwidth is the width of the kernel function. Too small =


jagged results, too large = smoothed out results.

Principal Component Analysis


First run singular value decomposition on pattern matrix X:

Activation Functions:
s s
(s) = tanh(s) = es es = 0 (s) = 1 2 (s)
e +e
1
(s) = (s) =
= 0 (s) = (s)(1 (s))
1+es

1. Subtract mean from each point


2. (Sometimes) scale each dimension by its variance
3. Compute covariance = X T X (must be symmetric)
4. Compute eigenvectors/values = V SV | (spectral thm)
5. Get back X = X = (XV )SV | = USV |
S contains the eigenvalues of the transformed features. The larger
the Sii , the larger the variance of that feature. We want the k largest
features, so we find the indices of the k largest items in S and we
keep only these entries in U and V .

3
2. Maximum Entropy Distribution
Suppose we have a discrete random variable that has a Categorical distribution described by the parameters p1 , p2 , . . . , pd . Recall that the definition of entropy of a
discrete random variable is

CS 189 ALL OF IT Che Yeon, Chloe, Dhruv, Li, Sean

d
X

H(X) = E[ log p(X)] =

Past Exam Questions

Spring 2014 Final

Spring 2013 Midterm

(a)

kwk2
(a) False: In SVMs, we maximize 2 subject to the margin
constraints.
(b) False: In kernelized SVMS, the kernel matrix K has to be
positive definite.
(c) True: If two random variables are independent, then they have
to be uncorrelated.
(d) False: Isocontours of Gaussian distributions have axes whose
lengths are proportional to the eigenvalues of the covariance
matrix.





2
(e) True: The RBF kernel K xi , x j = exp xi x j

(b)

corresponds to an infinite dimensional mapping of the feature


vectors.
(f) True: If (X,Y ) are jointly Gaussian, then X and Y are also
Gaussian distributed.
(g) True: A function f(x,y,z) is convex if the Hessian of f is
positive semi-definite.
(h) True: In a least-squares linear regression problem, adding an
L2 regularization penalty cannot decrease the L2 error of the
solution w on the training data.
(i) True: In linear SVMs, the optimal weight vector w is a linear
combination of training data points.
(j) False: In stochastic gradient descent, we take steps in the
exact direction of the gradient vector.
(k) False: In a two class problem when the class conditionals
P [x | y = 0] andP [x | y = 1] are modeled as Gaussians with
different covariance matrices, the posterior probabilities turn
out to be logistic functions.
(l) True: The perceptron training procedure is guaranteed to
converge if the two classes are linearly separable.
(m) False: The maximum likelihood estimate for the variance of a
univariate Gaussian is unbiased.
(n) True: In linear regression, using an L1 regularization penalty
term results in sparser solutions than using an L2
regularization penalty term.

Spring 2013 Final


(a) True: Solving a non linear separation problem with a hard
margin Kernelized SVM (Gaussian RBF Kernel) might lead to
overfitting.
(b) True: In SVMs, the sum of the Lagrange multipliers
corresponding to the positive examples is equal to the sum of
the Lagrange multipliers corresponding to the negative
examples.
(c) False: SVMs directly give us the posterior probabilities
P (y = 1 | x) and P (y = 1 | x).
(d) False: V (X) = E[X]2 E[X 2 ]
(e) True: In the discriminative approach to solving classification
problems, we model the conditional probability of the labels
given the observations.
(f) False: In a two class classification problem, a point on the
Bayes optimal decision boundary x* always satisfies
P [y = 1 | x] = P [y = 0 | x].
(g) True: Any linear combination of the components of a
multivariate Gaussian is a univariate Gaussian.


(h) False: For any two random variables X N 1 , 12 and




Y N 2 , 22 , X +Y N 1 + 2 , 12 + 22 .
(i) False: For a logistic regression problem differing initialization
points can lead to a much better optimum.
p
(j) False: In logistic regression, we model the odds ratio 1p as
(k)
(l)
(m)
(n)
(o)
(p)
(q)
(r)
(s)
(t)
(u)
(v)

(a)

(b)

(c)

(d)
(e)
(f)
(g)

a linear function.
True: Random forests can be used to classify infinite
dimensional data.
False: In boosting we start with a Gaussian weight
distribution over the training samples.
False: In Adaboost, the error of each hypothesis is calculated
by the ratio of misclassified examples to the total number of
examples.
True: When k = 1 and N , the kNN classification rate is
bounded above by twice the Bayes error rate.
True: A single layer neural network with a sigmoid activation
for binary classification with the cross entropy loss is exactly
equivalent to logistic regression.
True: Convolution is a linear operation i.e.

f1 + f2 g = f1 g + f2 g.
True: The k-means algorithm does coordinate descent on a
non-convex objective function.
True: A 1-NN classifier has higher variance than a 3-NN
classifier.
False: The single link agglomerative clustering algorithm
groups two clusters on the basis of the maximum distance
between points in the two clusters.
False: The largest eigenvector of the covariance matrix is the
direction of minimum variance in the data.
False: The eigenvectors of AAT and AT A are the same.
True: The non-zero eigenvalues of AAT and AT A are the
same.
In linear regression, the irreducible error is 2 and


2
E (y E(y | x)) .

Let S1 and S2 be the support vectors for w1 (hard margin)


and w2 (soft margin). Then S1 may not be a subset of S2 and
w1 may not be equal to w2 .
Ordinary least square regression assumes each data point is
generated according to a linear function of the input plus
N (0, ) noise. In many systems, the noise variance is a
positive linear function of the input. In this case, the
probability model that describes this situation is
(y (w0 + w1 x))2
1
P(y|x) =
exp(
.
2x
2x 2
Averaging the outputs of multiple decision trees helps reduce
variance.
The following loss functions are convex: logistic, hinge,
exponential. Misclassification loss is not.
Bias will be smaller and variance will be larger for trees of
smaller depth.
If making a tree with k-ary splits, the algorithm will prefer
high values of k and there will be k 1 thresholds for a k-ary
split.

(c)
(d)
(e)
(f)
(g)
(h)
(i)

False: The singular value decomposition of a real matrix is


2.
unique.
True: A multiple-layer neural network with linear activation
functions is equivalent to one single-layer perceptron that uses
the same error function on the output layer and has the same
number of inputs.
False: The maximum likelihood estimator for the parameter
of a uniform distribution over [0, ] is unbiased.
True: The k-means algorithm for clustering is guaranteed to
converge to a local optimum.
True: Increasing the depth of a decision tree cannot increase
its training error.
False: There exists a one-to-one feature mapping for every
valid kernel k.
True: For high-dimensional data data, k-d trees can be slower
than brute force nearest neighbor search.
True: If we had infinite data and infinitely fast computers,
kNN would be the only algorithm we would study in CS 189.
True: For datasets with high label noise (many data points
with incorrect labels, random forests would generally perform
better than boosted decision trees.

(a)

In Homework 4, you fit a logistic regression model on spam


and ham data for a Kaggle Comp. Assume you had a very
good score on the public test set, but when the GSIs ran your
model on a private test set, your score dropped a lot. This is
likely because you overfitted by submitting multiple times and
changing the following between submissions: , your penalty
term; , your convergence criterion; your step size; fixing a
random bug.
(b) Given d-dimensional data {xi }N
i=1 , you run principal
component analysis and pick P principal components. Can
you always reconstruct any data point xi for i from 1 to N
from the P principal components with zero reconstruction
error? Yes, if P = d.
(c) Putting a standard Gaussian prior on the weights for linear
regression (w N(0, I)) will result in what type of posterior
distribution on the weights? Gaussian.
(d) Suppose we have N instances of d-dimensional data. Let h be
the amount of data storage necessary for a histogram with a
fixed number of ticks per axis, and let k be the amount of data
storage necessary for kernel density estimation. Which of the
following is true about h and k? h grows exponentially with d,
and k grows linearly with N.
(e) John just trained a decision tree for a digit recognition. He
notices an extremely low training error, but an abnormally
large test error. He also notices that an SVM with a linear
2.
kernel performs much better than his tree. What could be the
cause of his problem? Decision tree is too deep; decision tree
is overfitting.
(f) John has now switched to multilayer neural networks and
notices that the training error is going down and converges to
a local minimum. Then when he test on the new data, the test
error is abnormally high. What is probably going wrong and
what do you recommend him to do? The training data size is
not large enough so collect a larger training data and retain it;
play with learning rate and add regularization term to
objective function; use a different initialization and train the
network several times and use the average of predictions from
all nets to predict test data; use the same training data but use
less hidden layers.

Find the distribution (values of the pi ) that maximizes entropy. (Hint: remember that
P
d
3
i=1 pi = 1. Dont forget to include that in the optimization as a constraint!)

Discussion 9 Entropy

Modifying neural networks for fun and profit


Solution:

(a) How could we modify a neural network to perform regression instead of classifiFor simplicity, assume that log has base e, i.e. log = ln (the solution is the same
cation?
no matter what base we assume). The optimization problem we are trying to solve
is:

Solution: Change the output function of the final layer to be a linear function
rather than the normal non-linear function.
d
X

argmin

pi log pi

p
~
Consider a neural network with the addition
that the input layer is also fully connected
i=1
d
to the output layer. This type of neural network
is also called skip-layer.
X

s.t.

pi = 1

(b) How many weights would this model i=1


require? (Let d0 be the dimensionality of
the input vector, and d1 . . . dL be the number of nodes in the L following layers.
Formulating
the Lagrangian,
we get Also, you may want to try drawing out the
Dont
worry about
the bias term.
!
NN.)
d
d
X

L(~
p, ) =

Solution:

pi log pi +

pi

i=1

L
X1

Minicards
Gaussian distribution [7, 8]

!
2
1 exp (x)
2
2 2


Multivar: p(x) = p 1 d exp 21 (x )| 1 (x )
|| 2
1-var (normal): p(x) =

The covariance of variables X is a matrix such that each entry


i j = Cov(Xi , X j ). This means that the diagonal entries
ii = Var(Xi ). If the matrix is diagonal, then the non-diagonal
entries are zero, which means all the variables Xi are independent.
Its nice to have independent variables, so we try to diagonalize
non-diagonal covariances.

di di+1 + d0 dL

i=0
@
pi
L(~
p, ) = log pi +
@pi
pi
d
X

=)

1 = log pi

@
(c) What sort of problems
of neural network introduce? How do we
L(~
pcould
, ) = this psort
i =1
@ problems?
compensate for these
i=1
This says that log pi = log pj , 8i, j, which implies that pi = pj , 8i, j. Combining
Solution:
increased
number
weights
so now we
this
with theWeve
constraint,
we getthe
that
pi = d1 , of
which
is the(parameters)
uniform distribution.

a much higher chance of overfitting the data. We could fix this by reducing
the number of nodes at each layer and/or reducing the number of layers to

(d) Consider the simplest skip-layer neural network pictured below. The weights are
w = [wxh , why , wxy ]T .
Input

Hidden

Output

Spectral Theorem [7:23]


1. Take definition of eigenvalue/vector: Ax = x

2. Pack multiple eigenvalues into = diag 1 , 2 , . . . , n
n eigenvalues exist iff A is symmetric.

3. Pack multiple eigenvectors into U = x1 x2 . . . xn
4. Rewrite equation using these: AU = U A = UU 0 .
We can use this to diagonalize a symmetric A.

accomodate for the increased connectivity.


Discussion
11 Skip-Layer NN

4
Given some non-linear function g, calculate rw y. Dont forget to use s.

SVM-like classifiers work with a boundary, a hyperplane (a line for


2D data) that separates two classes. Support vectors are the point(s)
closest to the boundary. is the margin, the distance between the
boundary and the support vector(s). The parameter is a vector.
x gives predictions. About :
The direction of defines the boundary. We can choose this.
k k must be 1/, as restricted by i : yi xi 1
We cannot explicitly choose this; it depends on the boundary.
This restriction is turned into a cost in soft-margin SVM.

Solution: The output y is given by the function:

Perceptron [2:11, 3:6] picks misclassified point and updates just


enough to classify it correctly:
+ xi or J ( )

y = g(sy ) = g(why h + wxy x) = g(why g(sh ) + wxy x) = g(why g(wxh x) + wxy x)


@y
@y
@y
@wh , @why , @wxy .

To calculate ry we need all the partial derivatives


with the ones closest to the output.

@sy
@y
@y
=

= g 0 (sy )h =
@why
@sy @why

yh

@sy
@y
@y
=

= g 0 (sy )x =
@wxy
@sy @wxy

yx

Well start

Overfits when outliers skew the boundary. Converges iff separable.


Batch eqn x = i i yi xi x:
i = # times point i was misclassified
2

@sy
@y
@y
@
=

= g 0 (sy )
(why h + wxy x)
@wxh
@sy @wxh
@wxh
In this question we will derive PCA. PCA aims to find the direction of maximum
@ the line such
@shthat projecting
@
variance among a dataset.
= g 0You
(sy )(want(w
+
wxy x)your data onto this
hy g(sh ))
@sh of information.
@wxh Thus,
@wxhthe optimization problem
line will retain the maximum amount
is
@sh
0
0
n
= w1hyX
g (sy )g (sh )
T
T xh2
@w
max
u xi u x

u:kuk2 =1 n
i=1
@(wxh x)
= why g 0 (sy )g 0 (sh )
@wxh average of the data points.
where n is the number of data points and x
is the sample

Derivation of PCA

(a) Show that this optimization


be massaged
into this format
0
= why problem
g 0 (sy )g 0 (scan
h )x = why y g (sh )x
max uT u

Hard-margin SVM [3:36] maximizes the margin around the


boundary. Technically, it minimizes the distance between boundary
and the vectors closest to it (the support vectors):
min k k2 such that i : yi xi 1

Sometimes removing a few outliers lets us find a much higher margin or a margin at all. Hard-margin overfits by not seeing this.
Converges iff separable.
Batch eqn = i i yi xi , where i = 1i is support vector

u:kuk2 =1

P 12 PCA T
Discussion
where = n1 ni=1
(xi x
)(xi x
) .

Soft-margin SVM [3:37] is like hard-margin SVM but penalizes


misclassifications:

n 
min k k2 +C 1 yi xi
+

i=1
Hyperparameter C is the hardness of the margin. Lower C means
more misclassifications but larger soft margin.
Overfits on less data, more features, higher C

Solution:
We can massage the objective function (lefts call if f0 (u) in this way:
n

f0 (u) =

(a)

(a) You can use kernels with SVM and perceptron.


(b) Cross validation is used to select hyperparameters. It prevents
overfitting, but is not guaranteed to prevent it.
(c) L2 regularization is equivalent to imposing a Gaussian prior in
linear regression.
(d) If we have 2 two-dimensional Gaussians, the same covariance
matrix for both will result in a linear decision boundary.
(e) The normal equations can be derived from minimizing
empirical risk, assuming normally distributed noise, and
assuming P(Y | X) is distributed normally with mean $BTx$
and variance 2 .
(f) Logistic regression can be motivated from log odds equated to
an affine function of x and generative models with gaussian
class conditionals.
(g) The perceptron algorithm will converge only if the data is
linearly separable.
(h) True: Newtons method is typically more expensive to
calculate than gradient descent per iteration.
True: for quadratic equations, Newtons method typically
requires fewer iterations than gradient descent.
False: Gradient descent can be viewed as iteratively
reweighted least squares.
(i) True: Complementary slackness implies that every training
point that is misclassified by a soft margin SVM is a support
vector.
True: When we solve the SVM with the dual problem, we
need only the dot product of xi and x j for all i, j.
True: we use Lagrange multipliers in an optimization problem
with inequality constraints.
(j) k(x) (y)k22 can be computed exclusively with inner
products.
But not k(x) (y)k1 norm or (x) (y).
(k) Strong duality holds for hard and soft margin SVM, but not
constrained optimization problems in general.

i=1

Taking the derivative w.r.t. pi and :

Spring 2015 Midterm


True: If the data is not linearly separable, there is no solution
to hard margin SVM.
(b) True: logistic regression can be used for classification.
(c) False: Two ways to prevent beta vectors from getting too large
are to use a small step size and use a small regularization value
(d) False: The L2 norm is often used because it produces sparse
results, as opposed to the L1 norm which does not
(e) False: For multivariate gaussian, the eigenvalues of the
covariance matrix are inversely proportional to the lengths of
the ellipsoid axes that determine the isocontours of the density.
(f) True: In a generative binary classification model where we
assume the class conditionals are distributed as poisson and
the class priors are bernoulli, the posterior assumes a logistic
form.
(g) False: MLE gives us not only a point estimate, but a
distribution over the parameters we are estimating.
(h) False: Penalized MLE and bayesian estimators for parameters
are better used in the setting of low-dimensional data with
many training examples
(i) True: It is not good machine learning practice to use the test
set to help adjust the hyperparameters
(j) False: a symmetric positive semidefinite matrix always has
nonnegative elements.
(k) True: for a valid kernel function k, the corresponding feature
mapping can map a finite dimensional vector to an infinite
dimensional vector
(l) False: the more features we use, the better our learning
algorithm will generalize to new data points.
(m) True: a discriminative classifier explicitly models P (Y | X).

pi log pi

i=1

Discussion Problems

n
1X T
u xi
n i=1
i=1
n
X

1
=
n
=

uT x

2
x
)T u

(xi

x
) u

i=1
n

n
1X T
(u (xi
n i=1

)T u)
x

))((xi
x

i=1

n
1X
(xi
n

= uT

)T
x

)(xi
x

i=1

More classifiers
u

= uT u

(b) Show that the maximizer for this problem is equal to v1 , where v1 is the eigenvector
corresponding to the largest eigenvalue 1 . Also show that optimal value of this
problem is equal to 1 .
Solution:
We start by invoking the spectral decomposition of = V V T , which is a
symmetric positive semi-definite matrix.

max uT u =

max uT V V T u

u:kuk2 =1

u:kuk2 =1

max (V T u)T V T u

u:kuk2 =1

Here is an aside: note through this one line proof that left-multiplying a vector
by an orthogonal (or rotation) matrix preserves the length of the vector:
q
p
p
kV T uk2 = (V T u)T (V T u) = uT V V T u = uT u = kuk2

I define a new variable z = V T u, and maximize over this variable. Note that
because V is invertible, there is a one to one mapping between u and z. Also
note that the constraint is the same because the length of the vector u does
not change when multiplied by an orthogonal matrix.
max z T z = max
z

z:kzk2 =1

d
X

2
i zi

i=1

d
X

zi2 = 1

i=1

From this
formulation,
it is obvious to see that we can maximize this by
3. Deriving
the new
second
principal component
n
1 P
T
eggs
into one basket and
setting z = 1 if i is the index of
(a)throwing
Let J(v2all
, z2of
) =our
i=1 (xi zi1 v1 zi2 v2 ) (xi zi1 v1 izi2 v2 ) given the constraints
n
largest
and zthat
otherwise. Thus, T
i = 0@J
v1Tthe
v2 =
0 andeigenvalue,
v2T v2 = 1. Show
@z2 = 0 yields zi2 = v2 xi .

(b) We have shown that zi2 = v2TT xi so that the second


principal encoding is gotten

z = Vprincipal
u =)direction.
u = V zShow
= v1that the value of v that
by projecting onto the second
2
Pn
T with the second largest
minimizes
given
by the eigenvector
of Cand
= n1 corresponds
where v1Jisisthe
principle
eigenvector,
i=1 (xi xi ) to
1 . Plugging this
eigenvalue.
Assumed
we
have
already
proved
the
v
is
the
eigenvector
1
into the objective function, we see that the optimal value
is 1 . of C with the
largest eigenvalue.
Solution: (a) We have
n

J(v2 , z2 ) =

1X T
(xi xi
n

zi1 xT
i v1

zi2 xT
i v2

2 T
zi1 v1T xi + zi1
v1 v1 + (1)

i=1

zi1 zi2 v1T v2

2 T
zi2 v2T xi + zi1 zi2 v2T v1 + zi2
v2 v2 )

(2)

Take derivative respect to z2 , we have


@J
1
1
T
T
T
T
T
= ( xT
( 2xT
i v2 +zi1 v1 v2 v2 xi +zi1 v2 v1 +2zi2 v2 v2 ) =
i v2 +2zi2 v2 v2 )
@zi2
n
n
Set the derivative to 0 and we have
zi2 v2T v2 = xT
i v2
Since v2T v2 = 1, we have zi2 = xT
i v2
(b) Plug in zi2 into J(v2 , z2 ), we have
n

J(v2 ) =

1X T
T
T
2 T
T
2 T
(xi xi zi1 xT
i v1 zi2 xi v2 zi1 v1 xi +zi1 v1 v1 zi2 v2 xi +zi2 v2 v2 )
n
i=1

1X
=
(const
n

2
2zi2 xT
i v2 + zi2 ) =

i=1

1X
T
T
( 2v2T xi xT
i v2 + v2 xi xi v2 + const)
n
i=1

v2T Cv2 + const

In order to minimize J with constraints v2T v2 = 1, we have Langrage L =


v2T Cv2 + (v2T v2 1) and take derivative of v2 , we have
@L
=
@v2

2Cv2 + 2 v2 = 0

Then, we have
Cv2 = v2

KNN [14:4] Given an item x, find the k training items closest to


x and return the result of a vote.
Hyperparameter k, the number of neighbors.
Closest can be defined by some norm (l2 by default).
Overfits when k is really small
Decision trees: Recursively split on features that yield the best
split. Each tree has many nodes, which either split on a feature at a
threshold, or all data the same way.
Hyperparameters typically restrict complexity (max tree depth, min
points at node) or penalize it. One particular one of interest is d,
the max number of nodes.
Overfits when tree is deep or when we are allowed to split on a very
small number of items.
Bagging: Make multiple trees, each with a random subset of training items. To predict, take vote from trees.
Hyperparameters # trees, proportion of items to subset.
Random forests is bagging, except, for each node, consider only a
random subset of features to split on.
Hyperparameters proportion of features to consider.
AdaBoost [dtrees3:34] Use any algorithm (i.e., decision trees) to
train a weak learner, take all the errors, and train a new learner
on with the errors emphasized*. To predict, predict with the first
algorithm, then add on the prediction of the second algorithm, and
so on.
* For regression, train the new learner on the errors. For classification, give misclassified items more weight.
Hyperparameters B, the number of weak learners; , the learning
rate.

You might also like