ML PDF

1
Cheat Sheet: Algorithms for Supervised- and Unsupervised Learning

Algorithm Description Model Objective Training Regularisation Complexity Non-linear Online learning

t = arg max (ti , C)
C
i:xi Nk (x,x)
The label of a new point x is classified Nk (x, x) k points in x O(N M ) space complexity, since all
k-nearest Use cross-validation to learn the appropriate k; otherwise no k acts as to regularise the classifier: as k N the
with the most frequent label t of the k closest to x No optimisation needed. training instances and all their Natively finds non-linear boundaries. To be added.
neighbour training, classification based on existing points. boundary becomes smoother.
nearest training instances. features need to be kept in memory.
Euclidean distance formula:
D 2
i=1 (xi xi )
(a, b) 1 if a = b; 0 o/w
D Use a Dirichlet prior on the parameters to obtain a

Multivariate likelihood p(x|Ck ) = i=1 log p(xi |Ck ) MAP estimate.
N
j=1 (tj = Ck xji = v) Multivariate likelihood
y(x) = arg max p(Ck |x) pMLE (xi = v|Ck ) = N
k j=1 (tj = Ck )
pMAP (xi = v|Ck ) = Can only learn linear boundaries for
Learn p(Ck |x) by modelling p(x|Ck ) = arg max p(x|Ck ) p(Ck ) D xi N
k Multinomial likelihood p(x|Ck ) = i=1 p(wordi |Ck ) (i 1) + (tj = Ck xji = v) multivariate/multinomial attributes.
and p(Ck ), using Bayes rule to infer j=1 O(N M ), each training instance must
D N
Naive Bayes the class conditional probability. No optimisation needed. |xi |(i 1) + N j=1 (tj = Ck )
be visited and each of its features To be added.
Assumes each feature independent of = arg max p(xi |Ck ) p(Ck ) j=1 (tj = Ck ) xji counted.
With Gaussian attributes, quadratic
k pMLE (wordi = v|Ck ) = N D boundaries can be learned with uni-modal
all others, ergo Naive. i=1
j=1 d=1 (tj = Ck ) xdi
Multinomial likelihood
D distributions.

= arg max log p(xi |Ck ) + log p(Ck ) . . . where: pMAP (wordi = v|Ck ) =
k i=1 xji is the count of word i in test example j;

(i 1) + N j=1 (tj = Ck ) xji
xdi is the count of feature d in test example j. N D
((t j = Ck ) xdi ) D + D d=1 d
Gaussian likelihood p(x|Ck ) = D i=1 N (v; ik , ik )
j=1 d=1
Gradient descent (or gradient ascent if maximising

objective): Penalise large values for the parameters, by Reformulate the class conditional distribution
n+1 = n L introducing a prior distribution over them (typically in terms of a kernel K(x, x ), and use a
a Gaussian). non-linear kernel (for example
Minimise the negative log-likelihood: . . . where is the step parameter. K(x, x ) = (1 + wT x)2 ). By the Representer
y(x) = arg max p(Ck |x)
k Objective function Theorem:
LMLE (, D) = p(t|x) = log p(t|x)
Estimate p(Ck |x) directly, by = arg max m m (x, Ck ) (x,t)D (x,t)D LMLE (, D) = E[(x, )] (x, t) O(IN M K), since each training
k m LMAP (, D, ) = arg min log p() log p(t|x) 1 T
assuming a maximum entropy (x,t)D instance must be visited and each p(Ck |x) = e (x,Ck ) Online Gradient Descent: Update the
Log-linear distribution and optimising an = log Z (x) m m (x, t)
(x,t)D combination of class and features Z (x) parameters using GD after seeing each
. . . where:
objective function over the (x,t)D m LMAP (, D, ) = 2 + E[(x, )] (x, t) must be calculated for the 1 N K T training instance.
1

(x,t)D (x,t)D (0)2 = e n=1 i=1 nk (xn ,Ci ) (x,Ck )
conditional entropy distribution. p(Ck |x) = e m m m (x,Ck ) = arg min log e 2 2 log p(t|x) appropriate feature mapping. Z (x)
Z (x) = log e m m m (x,Ck )
m m (x, t) N K
. . . where (x,t)D (x, t) are the empirical counts. (x,t)D 1
Z (x) = e m m m (x,Ck ) (x,t)D k m = e n=1 i=1 nk K((xn ,Ci ),(x,Ck ))
Z (x)
k For each class Ck : 2m N
= arg min m
log p(t|x) 1
2 2 = e n=1 nk K(xn ,x)
E[(x, )] = (x, )p(Ck |x) (x,t)D Z (x)
(x,t)D (x,t)D
Binary, linear classifier: Iterate over each training example xn , and update the
weight vector if misclassification: Use a kernel K(x, x ), and 1 weight per
y(x) = sign(wT x) Tries to minimise the Error function; the number of The Voted Perceptron: run the perceptron i times training instance:
i+1 i
incorrectly classified input vectors: w = w + EP (w) N
. . . where: and store each iterations weight vector. Then:

Directly estimate the linear function = wi + xn tn y(x) = sign wn tn K(x, xn )
y(x) by iteratively updating the +1 if x 0 arg min EP (w) = arg min w T xn t n O(IN M L), since each combination of
The perceptron is an online algorithm
Perceptron sign(x) = w w
nM . . . where typically = 1. y(x) = sign ci sign(wT
i x) instance, class and features must be n=1
weight vector when incorrectly 1 if x < 0 per default.
i calculated (see log-linear). . . . and the update:
classifying a training instance. . . . where M is the set of misclassified training
Multiclass perceptron: For the multiclass perceptron: . . . where ci is the number of correctly classified
vectors. i+1
wn i
= wn +1
y(x) = arg max wT (x, Ck ) training instances for wi .
wi+1 = wi + (x, t) (x, y(x))
Ck
The soft margin SVM: penalise a hyperplane by the

number and distance of misclassified points.
Primal Use a non-linear kernel K(x, x ):
1 Primal
arg min ||w||2 N

w,w0 2 N
1 y(x) = n t n xT xn + w 0
arg min ||w||2 + C n Online SVM. See, for example:
w,w0 2 n=1
n=1
s.t. tn (wT xn + w0 ) 1 n N
The Huller: A Simple and
QP: O(n3 ); Ecient Online SVM, Bordes
A maximum margin classifier: finds = n tn K(x, xn ) + w0
Support N Dual Quadratic Programming (QP) s.t. tn (wT xn + w0 ) 1 n , n > 0 n
the separating hyperplane with the SMO: much more ecient than n=1 & Bottou (2005)
vector y(x) = n t n xT xn + w 0
maximum margin to its closest data N
N
N
SMO, Sequential Minimal Optimisation (chunking). QP, since computation based
machines Dual Pegasos: Primal Estimated
points. n=1 L() = n n m t n t m xT
n xm only on support vectors. N N
N sub-Gradient Solver for SVM,
N N N
n=1 n=1 m=1 L() = n n m tn tm xT
L() = n n m t n t m xT n xm Shalev-Shwartz et al. (2007)
n xm n=1 n=1 m=1
N n=1 n=1 m=1
N
N
N
s.t. n 0, n tn = 0, n = n n m tn tm K(xn , xm )
n=1 N
n=1 n=1 m=1
s.t. 0 n C, n tn = 0, n
n=1
assignments rnk {0, 1} s.t.

Hard
Expectation:
n k rnk = 1, i.e. each data point is

assigned to one cluster k. 1 if ||xn k ||2 minimal for k For non-linearly separable data, use kernel
rnk =
N
K 0 o/w k-means as suggested in:
A hard-margin, geometric clustering Geometric distance: The Euclidean arg min rnk ||xn k ||22 Sequential k-means: update the cen-
k-means algorithm, where each data point is distance, l2 norm: r, Maximisation: Only hard-margin assignment to clusters. To be added. troids after processing one point at a
n=1 k=1 Kernel k-means, Spectral Clustering and
assigned to its closest centroid. time.
D (k) n rnk xn Normalized Cuts, Dhillon et al. (2004).
. . . i.e. minimise the distance from each cluster centre MLE =
||xn k ||2 = (xni ki )2 to each of its points. n rnk
i=1
. . . where (k) is the centroid of cluster k.
Expectation: For each n, k set:

nk = p(z (i) = k|x(i) ; , , ) (= p(k|xn ))
p(x(i) |z (i) = k; , )p(z (i) = k; )
= K
j=1 p(x(i) |z (i) = l; , )p(z (i) = l; )
Assignments to clusters by specifying k N (xn |k , k )
probabilities = K Online Gaussian Mixture Models. A
A probabilistic clustering algorithm, j=1 j N (xn |j , j )
p(x(i) , z (i) ) = p(x(i) |z (i) )p(z (i) ) L(x, , , ) = log p(x|, , ) good start is:
where clusters are modelled as latent The mixture of Gaussians assigns probabilities for
Mixture of K Maximisation:
Guassians and each data point is N each cluster to each data point, and as such is To be added. Not applicable.
Gaussians . . . with z (i) Multinomial(), and A View of the EM Algorithm that Jus-
assigned the probability of being k = log k Nk (xn |k , k ) 1
N capable of capturing ambiguities in the data set.
nk p(k|xn ) s.t. tifies Incremental, Sparse, and Other
drawn from a particular Gaussian. j=1 nj = 1. I.e. n=1 k=1 k = nk
want to maximise the probability of N n=1 Variants, Neal & Hinton (1998).
the observed data x. N T
n=1 nk (xn k )(xn k )
k = N
n=1 nk
N

n=1 nk nx
k = N
n=1 nk
1 Created by Emanuel Ferm, HT2011, for semi-procrastinational reasons while studying for a Machine Learning exam. Last updated May 5, 2011.

ML PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML PDF

Uploaded by

Copyright:

Available Formats

1

Cheat Sheet: Algorithms for Supervised- and Unsupervised Learning

D Use a Dirichlet prior on the parameters to obtain a

Gradient descent (or gradient ascent if maximising

The soft margin SVM: penalise a hyperplane by the

assignments rnk {0, 1} s.t.

Expectation: For each n, k set:

You might also like