You are on page 1of 1

1

Cheat Sheet: Algorithms for Supervised- and Unsupervised Learning


Algorithm Description Model Objective Training Regularisation Complexity Non-linear Online learning


t = arg max (ti , C)
C
i:xi Nk (x,x)

The label of a new point x is classified Nk (x, x) k points in x O(N M ) space complexity, since all
k-nearest Use cross-validation to learn the appropriate k; otherwise no k acts as to regularise the classifier: as k N the
with the most frequent label t of the k closest to x No optimisation needed. training instances and all their Natively finds non-linear boundaries. To be added.
neighbour training, classification based on existing points. boundary becomes smoother.
nearest training instances. features need to be kept in memory.
Euclidean distance formula:
D 2
i=1 (xi xi )

(a, b) 1 if a = b; 0 o/w

D Use a Dirichlet prior on the parameters to obtain a


Multivariate likelihood p(x|Ck ) = i=1 log p(xi |Ck ) MAP estimate.
N
j=1 (tj = Ck xji = v) Multivariate likelihood
y(x) = arg max p(Ck |x) pMLE (xi = v|Ck ) = N
k j=1 (tj = Ck )
pMAP (xi = v|Ck ) = Can only learn linear boundaries for
Learn p(Ck |x) by modelling p(x|Ck ) = arg max p(x|Ck ) p(Ck ) D xi N
k Multinomial likelihood p(x|Ck ) = i=1 p(wordi |Ck ) (i 1) + (tj = Ck xji = v) multivariate/multinomial attributes.
and p(Ck ), using Bayes rule to infer j=1 O(N M ), each training instance must
D N
Naive Bayes the class conditional probability. No optimisation needed. |xi |(i 1) + N j=1 (tj = Ck )
be visited and each of its features To be added.
Assumes each feature independent of = arg max p(xi |Ck ) p(Ck ) j=1 (tj = Ck ) xji counted.
With Gaussian attributes, quadratic
k pMLE (wordi = v|Ck ) = N D boundaries can be learned with uni-modal
all others, ergo Naive. i=1
j=1 d=1 (tj = Ck ) xdi
Multinomial likelihood
D distributions.

= arg max log p(xi |Ck ) + log p(Ck ) . . . where: pMAP (wordi = v|Ck ) =
k i=1 xji is the count of word i in test example j;

(i 1) + N j=1 (tj = Ck ) xji
xdi is the count of feature d in test example j. N D
((t j = Ck ) xdi ) D + D d=1 d
Gaussian likelihood p(x|Ck ) = D i=1 N (v; ik , ik )
j=1 d=1

Gradient descent (or gradient ascent if maximising


objective): Penalise large values for the parameters, by Reformulate the class conditional distribution
n+1 = n L introducing a prior distribution over them (typically in terms of a kernel K(x, x ), and use a
a Gaussian). non-linear kernel (for example
Minimise the negative log-likelihood: . . . where is the step parameter. K(x, x ) = (1 + wT x)2 ). By the Representer
y(x) = arg max p(Ck |x)
k Objective function Theorem:
LMLE (, D) = p(t|x) = log p(t|x)
Estimate p(Ck |x) directly, by = arg max m m (x, Ck ) (x,t)D (x,t)D LMLE (, D) = E[(x, )] (x, t) O(IN M K), since each training
k m LMAP (, D, ) = arg min log p() log p(t|x) 1 T
assuming a maximum entropy (x,t)D instance must be visited and each p(Ck |x) = e (x,Ck ) Online Gradient Descent: Update the
Log-linear distribution and optimising an = log Z (x) m m (x, t)
(x,t)D combination of class and features Z (x) parameters using GD after seeing each
. . . where:
objective function over the (x,t)D m LMAP (, D, ) = 2 + E[(x, )] (x, t) must be calculated for the 1 N K T training instance.
1

(x,t)D (x,t)D (0)2 = e n=1 i=1 nk (xn ,Ci ) (x,Ck )
conditional entropy distribution. p(Ck |x) = e m m m (x,Ck ) = arg min log e 2 2 log p(t|x) appropriate feature mapping. Z (x)
Z (x) = log e m m m (x,Ck )
m m (x, t) N K
. . . where (x,t)D (x, t) are the empirical counts. (x,t)D 1
Z (x) = e m m m (x,Ck ) (x,t)D k m = e n=1 i=1 nk K((xn ,Ci ),(x,Ck ))
Z (x)
k For each class Ck : 2m N
= arg min m
log p(t|x) 1
2 2 = e n=1 nk K(xn ,x)
E[(x, )] = (x, )p(Ck |x) (x,t)D Z (x)
(x,t)D (x,t)D

Binary, linear classifier: Iterate over each training example xn , and update the
weight vector if misclassification: Use a kernel K(x, x ), and 1 weight per
y(x) = sign(wT x) Tries to minimise the Error function; the number of The Voted Perceptron: run the perceptron i times training instance:
i+1 i
incorrectly classified input vectors: w = w + EP (w) N
. . . where: and store each iterations weight vector. Then:

Directly estimate the linear function = wi + xn tn y(x) = sign wn tn K(x, xn )
y(x) by iteratively updating the +1 if x 0 arg min EP (w) = arg min w T xn t n O(IN M L), since each combination of
The perceptron is an online algorithm
Perceptron sign(x) = w w
nM . . . where typically = 1. y(x) = sign ci sign(wT
i x) instance, class and features must be n=1
weight vector when incorrectly 1 if x < 0 per default.
i calculated (see log-linear). . . . and the update:
classifying a training instance. . . . where M is the set of misclassified training
Multiclass perceptron: For the multiclass perceptron: . . . where ci is the number of correctly classified
vectors. i+1
wn i
= wn +1
y(x) = arg max wT (x, Ck ) training instances for wi .
wi+1 = wi + (x, t) (x, y(x))
Ck

The soft margin SVM: penalise a hyperplane by the


number and distance of misclassified points.
Primal Use a non-linear kernel K(x, x ):
1 Primal
arg min ||w||2 N

w,w0 2 N
1 y(x) = n t n xT xn + w 0
arg min ||w||2 + C n Online SVM. See, for example:
w,w0 2 n=1
n=1
s.t. tn (wT xn + w0 ) 1 n N
The Huller: A Simple and
QP: O(n3 ); Ecient Online SVM, Bordes
A maximum margin classifier: finds = n tn K(x, xn ) + w0
Support N Dual Quadratic Programming (QP) s.t. tn (wT xn + w0 ) 1 n , n > 0 n
the separating hyperplane with the SMO: much more ecient than n=1 & Bottou (2005)
vector y(x) = n t n xT xn + w 0
maximum margin to its closest data N
N
N
SMO, Sequential Minimal Optimisation (chunking). QP, since computation based
machines Dual Pegasos: Primal Estimated
points. n=1 L() = n n m t n t m xT
n xm only on support vectors. N N
N sub-Gradient Solver for SVM,
N N N
n=1 n=1 m=1 L() = n n m tn tm xT
L() = n n m t n t m xT n xm Shalev-Shwartz et al. (2007)
n xm n=1 n=1 m=1
N n=1 n=1 m=1
N
N
N
s.t. n 0, n tn = 0, n = n n m tn tm K(xn , xm )
n=1 N
n=1 n=1 m=1
s.t. 0 n C, n tn = 0, n
n=1

assignments rnk {0, 1} s.t.


Hard
Expectation:
n k rnk = 1, i.e. each data point is

assigned to one cluster k. 1 if ||xn k ||2 minimal for k For non-linearly separable data, use kernel
rnk =
N
K 0 o/w k-means as suggested in:
A hard-margin, geometric clustering Geometric distance: The Euclidean arg min rnk ||xn k ||22 Sequential k-means: update the cen-
k-means algorithm, where each data point is distance, l2 norm: r, Maximisation: Only hard-margin assignment to clusters. To be added. troids after processing one point at a
n=1 k=1 Kernel k-means, Spectral Clustering and
assigned to its closest centroid. time.
D (k) n rnk xn Normalized Cuts, Dhillon et al. (2004).
. . . i.e. minimise the distance from each cluster centre MLE =
||xn k ||2 = (xni ki )2 to each of its points. n rnk
i=1
. . . where (k) is the centroid of cluster k.

Expectation: For each n, k set:


nk = p(z (i) = k|x(i) ; , , ) (= p(k|xn ))
p(x(i) |z (i) = k; , )p(z (i) = k; )
= K
j=1 p(x(i) |z (i) = l; , )p(z (i) = l; )
Assignments to clusters by specifying k N (xn |k , k )
probabilities = K Online Gaussian Mixture Models. A
A probabilistic clustering algorithm, j=1 j N (xn |j , j )
p(x(i) , z (i) ) = p(x(i) |z (i) )p(z (i) ) L(x, , , ) = log p(x|, , ) good start is:
where clusters are modelled as latent The mixture of Gaussians assigns probabilities for
Mixture of K Maximisation:
Guassians and each data point is N each cluster to each data point, and as such is To be added. Not applicable.
Gaussians . . . with z (i) Multinomial(), and A View of the EM Algorithm that Jus-
assigned the probability of being k = log k Nk (xn |k , k ) 1
N capable of capturing ambiguities in the data set.
nk p(k|xn ) s.t. tifies Incremental, Sparse, and Other
drawn from a particular Gaussian. j=1 nj = 1. I.e. n=1 k=1 k = nk
want to maximise the probability of N n=1 Variants, Neal & Hinton (1998).
the observed data x. N T
n=1 nk (xn k )(xn k )
k = N
n=1 nk
N

n=1 nk nx
k = N
n=1 nk

1 Created by Emanuel Ferm, HT2011, for semi-procrastinational reasons while studying for a Machine Learning exam. Last updated May 5, 2011.

You might also like