Professional Documents
Culture Documents
Error Functions:
nout
Cross Entropy Loss i=1
y log(h (x)) + (1 y) log(1 h (x))
1 m (x(i)
(i)
T
MLE = m
y(i) )(x y(i) ) .
i=1
Notation:
(l)
1. wi j is the weight from neuron i in layer l 1 to neuron j in
layer l. There are d (l) nodes in the l th layer.
2. L layers, where L is output layer and data is 0th layer.
(l)
(l)
3. x j = (s j ) is the output of a neuron. Its the activation
(l)
(l) (l1)
function applied to the input signal. s j = i wi j xi
4. e(w) is the error as a function of the weights
P(x|)P()
Bayes Rule: P(|x) =
, P(x) = i P(x|i )P(i )
P(x)
P(x, w) = P(x|w)P(w) = P(w|x)P(x)
P(error) =
P(error|x)P(x)dx
P(1 |x)
if we decide 2
P(error|x) =
P(2 |x)
if we decide 1
n
0
i = j (correct)
0-1 Loss: (i | j ) =
1
i 6= j (mismatch)
(i) T (i)
Primal: L p (, b, ) = 12 ||||2 m
i=1 i (y (w x + b) 1)
Lp
= i y(i) x(i) = 0 = = i y(i) x(i)
Multivariate Gaussian X N (, )
Gaussian class conditionals lead to a logistic posterior.
1
f (x; , ) =
exp 12 (x )T 1 (x )
(2)n/2 ||1/2
= E[(X )(X )T ] = E[XX T ] T
is PSD = xT x 0, if inverse exists must be PD
If X N(, ), then AX + b N(A + b, AAT )
1
1
1
= 2 (X ) N(0, I), where 2 = U 2
p
axis length: n
axis length: 1
Thus the level curves form an ellipsoid with axis lengths equal to the
square root of the eigenvalues of the covariance matrix.
Loss Functions
h
i
Binomial deviance = log 1 + ey f (x)
P[Y =+1|x]
minimizing function f (x) = log
P[Y =1|x]
SVM hinge loss = [1 y f (x)]+
minimizing function f (x) = sign P [Y = +1 | x] 12
Squared error = [y f (x)]2 = [1 y f (x)]2
minimizing function f (x) = 2P [Y = +1 | x] 1
Huberized
square hinge loss
4y f (x)
if y f (x) < 1
=
[1 y f (x)]2
otherwise
+
minimizing function f (x) = 2P [Y = +1 | x] 1
Optimization
1
Newtons Method: t+1 = t [2
f (t )] f (t )
Gradient Decent: t+1 = t f (t ), for minimizing
Gradients
y1
...
x1
.
y
.
,
.
.
x
.
.
y1
...
xn
(xT x)
(xT Ax)
= 2x,
x
x
ym
x1
T
(Ax)
.
T (x A) = A,
. , x = A , x
.
ym
xn
(trBA)
= (A + AT )x,
= BT
A
1
min ||||2 s.t. y(i) (wT x(i) + b) 1, i = 1, . . . , m
,b 2
Lp
= i y(i) = 0, Note: i 6= 0 only for support vectors.
b
Substitute the derivatives into the primal to get the dual.
Notice the covariance matrix is the same for all classes in LDA.
If p(x|y) multivariate gaussian (w/ shared ), then p(y|x) is logistic
function. The converse is NOT true. LDA makes stronger
assumptions about data than does logistic regression.
h(x) = arg maxk 21 (x k )T 1 (x k ) + log(k )
where k = p(y = k)
For QDA, the model is the same as LDA except that each class has a
unique covariance matrix.
h(x) = arg maxk 21 log|k | 12 (x k )T 1 (x k ) + log(k )
k
1 m
m
(i) ( j)
(i) T ( j)
Ld () = m
i=1 i 2 i=1 j=1 y y i j (x ) x
KKT says n (yn (wT xn + b) 1) = 0 where n > 0.
In the non-separable case we allow points to cross the marginal
boundary by some amount and penalize it.
m
1
min ||||2 +C i s.t. y(i) (wT x(i) + b) 1 i
,b 2
i=1
The dual for non-separable doesnt change much except that each i
now has an upper bound of C = 0 i C
Lagrangian
m
i fi (x)
i=1
Regression
1
1
2 T x =
T ) e
T
1+e x
1+e x
h (1 h )
dh
d = (
1
T = h (x)
1+e x
!
1
1
=
T
1+e x
i=1 (h (x ))
(i)
(i)
(i)
(i)
l( ) = m
i=1 y log(h (x )) + (1 y ) log(1 h (x )) =
l = i (y(i) h (x(i) ))x(i) = X | (y h (X)), (want max l( ))
( j)
( j) ( j)
Stochastic: t+1 = t + (yt h (xt ))xt
Batch: t+1 = t + X | (y h (X))
K 0,
KNN limN,K , N
knn =
Curse of dimensionality: As the number of dimensions increases,
everything becomes farther apart. Our low dimension intuition falls
apart. Consider the Hypersphere/Hypercube ratio, its close to zero
at d = 10. How do deal with this curse:
1. Get more data to fill all of that empty space
2. Get better features, reducing the dimensionality and packing the
data closer together. Ex: Bag-of-words, Histograms,...
3. Use a better distance metric.
1
p p
Minkowski: Dis p (x, y) = (d
i=1 |xi yu | ) = ||x y|| p
0-norm: Dis0 (x, y) = d
i=1 I|xi = yi |
q
(x y)T 1 (x y)
In high-d we get Hubs s.t most points identify the hubs as their
NN. These hubs are usually near the means (Ex: dull gray images,
sky and clouds). To avoid having everything classified as these hubs,
we can use cosine similarity.
K-d trees increase the efficiency of nearest neighbor lookup.
Decision Trees
Given a set of points and classes {xi , yi }n
i=1 , test features x j and
branch on the feature which best separates the data. Recursively
split on the new subset of data. Growing the tree to max depth tends
to overfit (training data gets cut quickly = subtrees train on small
sets). Mistakes high up in the tree propagate to corresponding
subtrees. To reduce overfitting, we can prune using a validation set,
and we can limit the depth.
DTs are prone to label noise. Building the correct tree is hard.
Heurisitic: For classification, maximize information gain
H(D)
x j X j
P(X j = x j ) H(D|X j = x j )
Random Forests
Logistic Regression
Classify y {0, 1} = Model p(y = 1|x) =
Key Idea: Store all training examples xi , f (xi )
NN: Find closest training point using some distance metric and take
its label.
k-NN: Find closest k training points and take on the most likely
label based on some voting scheme (mean, median,...)
Behavior at the limit: 1NN limN NN 2
= error of optimal prediction, nn = error of 1NN classifier
max
j
In general the loss function consists of two parts, the loss term and
the regularization term. J() = i Lossi + R()
nout
(y h (x))2
Mean Squared Error i=1
(l)
The goal is to learn the weights wi j . We use gradient descent, but
error function is non-convex so we tend to local minima. The naive
version takes O(w2 ). Back propagation, an algorithm for efficient
computation of the gradient, takes O(w).
e(w)
Other Classifiers
Nearest Neighbor
Dual:
L (x, ) = f0 (x) +
(i) (i)
(i)
l( , 0 , 1 , ) = log m
i=1 p(x |y ; 0 , 1 , )p(y ; ) gives
us
1 m 1{y(i) = 1},
MLE = m
kMLE =
i=1
avg of x(i) classified as k,
Problem: DTs are unstable: small changes in the input data have
large effect on tree structure = DTs are high-variance estimators.
Solution: Random Forests train M different trees with randomly
sampled subsets of the data (called bagging), and sometimes with
randomly sampled subsets of the features to de-correlate the trees. A
new point is tested on all M trees and we take the majority as our
output class (for regression we take the average of the output).
Boosting
Weak Learner: Can classify with at least 50% accuracy.
Train weak learner to get a weak classifier. Test it on the training
data, up-weigh misclassified data, down-weigh correctly classified
data. Train a new weak learner on the weighted data. Repeat. A new
point is classified by every weak learner and the output class is the
sign of a weighted avg. of weak learner outputs. Boosting generally
overfits. If there is label noise, boosting keeps upweighing the
mislabeled data.
AdaBoost is a boosting algorithm. The weak learner weights are
1t
given by t = 1
2 ln( t ) where t = PrDt (ht (xi ) 6= yi )
(probability of misclassification). The weights are updated
D (i)exp(t yi ht (xi ))
where Zt is a normalization
Dt+1 (i) = t
Zt
factor.
Neural Networks
Neural Nets explore what you can do by combining perceptrons,
each of which is a simple linear classifier. We use a soft threshold
for each activation function because it is twice differentiable.
(l)
(l) (l1)
e(w)
e(w) s j
=
= j xi
(l)
(l)
(l)
wi j
s j wi j
(L)
(L)
(L) 0
e(w)
e(w) x j
Final Layer: j =
=
= e0 (x j )out
(sLj )
(l)
(L)
(L)
sj
xj
sj
General:
(l)
(l1)
sj
x
(l) e(w)
(l1)
e(w)
i
=
= dj=1
i
(l1)
(l)
(l1)
(l1)
si
sj
xi
si
(l) (l)
(l)
(l1)
= dj=1 j wi j 0 (si
)
Unsupervised Learning
Clustering
Unsupervised learning (no labels).
Distance functions. Suppose we have two sets of points.
Hierarchical:
Agglomerative: Start with n points, merge 2 closest clusters
using some measure, such as: Single-link (closest pair),
Complete-link (furthest pair), Average-link (average of all
pairs), Centroid (centroid distance).
Note: SL and CL are sensitive to outliers.
Divisive: Start with single cluster, recursively divide clusters
into 2 subclusters.
Partitioning: Partition the data into a K mutually exclusive
exhaustive groups (i.e. encode k=C(i)). Iteratively reallocate to
minimize some loss function. Finding the correct partitions is hard.
Use a greedy algorithm called K-means (coordinate decent). Loss
function is non-convex thus we find local minima.
K-means: Choose clusters at random, calculate centroid of
each cluster, reallocate objects to nearest centroid, repeat.
Works with: spherical, well-separated clusters of similar
volumes and count.
K-means++: Initialize clusters one by one. D(x) = distance of
point x to nearest cluster. Pr(x is new cluster center) D(x)2
K-medians: Works with arbitrary distance/dissimilarity metric,
the centers k are represented by data points. Is more restrictive
thus has higher loss.
K
General Loss: N
n=1 k=1 d(xn , k )rnk where rnk = 1 if xn is in
cluster k, and 0 o.w.
Vector Quantization
Use clustering to find representative prototype vectors, which are
used to simplify representations of signals.
Activation Functions:
s s
(s) = tanh(s) = es es = 0 (s) = 1 2 (s)
e +e
1
(s) = (s) =
= 0 (s) = (s)(1 (s))
1+es
3
2. Maximum Entropy Distribution
Suppose we have a discrete random variable that has a Categorical distribution described by the parameters p1 , p2 , . . . , pd . Recall that the definition of entropy of a
discrete random variable is
d
X
(a)
kwk2
(a) False: In SVMs, we maximize 2 subject to the margin
constraints.
(b) False: In kernelized SVMS, the kernel matrix K has to be
positive definite.
(c) True: If two random variables are independent, then they have
to be uncorrelated.
(d) False: Isocontours of Gaussian distributions have axes whose
lengths are proportional to the eigenvalues of the covariance
matrix.
2
(e) True: The RBF kernel K xi , x j = exp
xi x j
(b)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
a linear function.
True: Random forests can be used to classify infinite
dimensional data.
False: In boosting we start with a Gaussian weight
distribution over the training samples.
False: In Adaboost, the error of each hypothesis is calculated
by the ratio of misclassified examples to the total number of
examples.
True: When k = 1 and N , the kNN classification rate is
bounded above by twice the Bayes error rate.
True: A single layer neural network with a sigmoid activation
for binary classification with the cross entropy loss is exactly
equivalent to logistic regression.
True: Convolution is a linear operation i.e.
f1 + f2 g = f1 g + f2 g.
True: The k-means algorithm does coordinate descent on a
non-convex objective function.
True: A 1-NN classifier has higher variance than a 3-NN
classifier.
False: The single link agglomerative clustering algorithm
groups two clusters on the basis of the maximum distance
between points in the two clusters.
False: The largest eigenvector of the covariance matrix is the
direction of minimum variance in the data.
False: The eigenvectors of AAT and AT A are the same.
True: The non-zero eigenvalues of AAT and AT A are the
same.
In linear regression, the irreducible error is 2 and
2
E (y E(y | x)) .
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(a)
Find the distribution (values of the pi ) that maximizes entropy. (Hint: remember that
P
d
3
i=1 pi = 1. Dont forget to include that in the optimization as a constraint!)
Discussion 9 Entropy
(a) How could we modify a neural network to perform regression instead of classifiFor simplicity, assume that log has base e, i.e. log = ln (the solution is the same
cation?
no matter what base we assume). The optimization problem we are trying to solve
is:
Solution: Change the output function of the final layer to be a linear function
rather than the normal non-linear function.
d
X
argmin
pi log pi
p
~
Consider a neural network with the addition
that the input layer is also fully connected
i=1
d
to the output layer. This type of neural network
is also called skip-layer.
X
s.t.
pi = 1
L(~
p, ) =
Solution:
pi log pi +
pi
i=1
L
X1
Minicards
Gaussian distribution [7, 8]
!
2
1 exp (x)
2
2 2
Multivar: p(x) = p 1 d exp 21 (x )| 1 (x )
|| 2
1-var (normal): p(x) =
di di+1 + d0 dL
i=0
@
pi
L(~
p, ) = log pi +
@pi
pi
d
X
=)
1 = log pi
@
(c) What sort of problems
of neural network introduce? How do we
L(~
pcould
, ) = this psort
i =1
@ problems?
compensate for these
i=1
This says that log pi = log pj , 8i, j, which implies that pi = pj , 8i, j. Combining
Solution:
increased
number
weights
so now we
this
with theWeve
constraint,
we getthe
that
pi = d1 , of
which
is the(parameters)
uniform distribution.
a much higher chance of overfitting the data. We could fix this by reducing
the number of nodes at each layer and/or reducing the number of layers to
(d) Consider the simplest skip-layer neural network pictured below. The weights are
w = [wxh , why , wxy ]T .
Input
Hidden
Output
4
Given some non-linear function g, calculate rw y. Dont forget to use s.
@sy
@y
@y
=
= g 0 (sy )h =
@why
@sy @why
yh
@sy
@y
@y
=
= g 0 (sy )x =
@wxy
@sy @wxy
yx
Well start
@sy
@y
@y
@
=
= g 0 (sy )
(why h + wxy x)
@wxh
@sy @wxh
@wxh
In this question we will derive PCA. PCA aims to find the direction of maximum
@ the line such
@shthat projecting
@
variance among a dataset.
= g 0You
(sy )(want(w
+
wxy x)your data onto this
hy g(sh ))
@sh of information.
@wxh Thus,
@wxhthe optimization problem
line will retain the maximum amount
is
@sh
0
0
n
= w1hyX
g (sy )g (sh )
T
T xh2
@w
max
u xi u x
u:kuk2 =1 n
i=1
@(wxh x)
= why g 0 (sy )g 0 (sh )
@wxh average of the data points.
where n is the number of data points and x
is the sample
Derivation of PCA
Sometimes removing a few outliers lets us find a much higher margin or a margin at all. Hard-margin overfits by not seeing this.
Converges iff separable.
Batch eqn = i i yi xi , where i = 1i is support vector
u:kuk2 =1
P 12 PCA T
Discussion
where = n1 ni=1
(xi x
)(xi x
) .
i=1
Hyperparameter C is the hardness of the margin. Lower C means
more misclassifications but larger soft margin.
Overfits on less data, more features, higher C
Solution:
We can massage the objective function (lefts call if f0 (u) in this way:
n
f0 (u) =
(a)
i=1
pi log pi
i=1
Discussion Problems
n
1X T
u xi
n i=1
i=1
n
X
1
=
n
=
uT x
2
x
)T u
(xi
x
) u
i=1
n
n
1X T
(u (xi
n i=1
)T u)
x
))((xi
x
i=1
n
1X
(xi
n
= uT
)T
x
)(xi
x
i=1
More classifiers
u
= uT u
(b) Show that the maximizer for this problem is equal to v1 , where v1 is the eigenvector
corresponding to the largest eigenvalue 1 . Also show that optimal value of this
problem is equal to 1 .
Solution:
We start by invoking the spectral decomposition of = V V T , which is a
symmetric positive semi-definite matrix.
max uT u =
max uT V V T u
u:kuk2 =1
u:kuk2 =1
max (V T u)T V T u
u:kuk2 =1
Here is an aside: note through this one line proof that left-multiplying a vector
by an orthogonal (or rotation) matrix preserves the length of the vector:
q
p
p
kV T uk2 = (V T u)T (V T u) = uT V V T u = uT u = kuk2
I define a new variable z = V T u, and maximize over this variable. Note that
because V is invertible, there is a one to one mapping between u and z. Also
note that the constraint is the same because the length of the vector u does
not change when multiplied by an orthogonal matrix.
max z T z = max
z
z:kzk2 =1
d
X
2
i zi
i=1
d
X
zi2 = 1
i=1
From this
formulation,
it is obvious to see that we can maximize this by
3. Deriving
the new
second
principal component
n
1 P
T
eggs
into one basket and
setting z = 1 if i is the index of
(a)throwing
Let J(v2all
, z2of
) =our
i=1 (xi zi1 v1 zi2 v2 ) (xi zi1 v1 izi2 v2 ) given the constraints
n
largest
and zthat
otherwise. Thus, T
i = 0@J
v1Tthe
v2 =
0 andeigenvalue,
v2T v2 = 1. Show
@z2 = 0 yields zi2 = v2 xi .
z = Vprincipal
u =)direction.
u = V zShow
= v1that the value of v that
by projecting onto the second
2
Pn
T with the second largest
minimizes
given
by the eigenvector
of Cand
= n1 corresponds
where v1Jisisthe
principle
eigenvector,
i=1 (xi xi ) to
1 . Plugging this
eigenvalue.
Assumed
we
have
already
proved
the
v
is
the
eigenvector
1
into the objective function, we see that the optimal value
is 1 . of C with the
largest eigenvalue.
Solution: (a) We have
n
J(v2 , z2 ) =
1X T
(xi xi
n
zi1 xT
i v1
zi2 xT
i v2
2 T
zi1 v1T xi + zi1
v1 v1 + (1)
i=1
2 T
zi2 v2T xi + zi1 zi2 v2T v1 + zi2
v2 v2 )
(2)
J(v2 ) =
1X T
T
T
2 T
T
2 T
(xi xi zi1 xT
i v1 zi2 xi v2 zi1 v1 xi +zi1 v1 v1 zi2 v2 xi +zi2 v2 v2 )
n
i=1
1X
=
(const
n
2
2zi2 xT
i v2 + zi2 ) =
i=1
1X
T
T
( 2v2T xi xT
i v2 + v2 xi xi v2 + const)
n
i=1
2Cv2 + 2 v2 = 0
Then, we have
Cv2 = v2