You are on page 1of 1

1

Cheat Sheet: Algorithms for Supervised- and Unsupervised Learning


Algorithm Description Model Objective Training Regularisation Complexity Non-linear Online learning


t̂ = arg max δ(ti , C)
C
i:xi ∈Nk (x,x̂)

The label of a new point x̂ is classified • Nk (x, x̂) ← k points in x O(N M ) space complexity, since all
k-nearest Use cross-validation to learn the appropriate k; otherwise no k acts as to regularise the classifier: as k → N the
with the most frequent label t̂ of the k closest to x̂ No optimisation needed. training instances and all their Natively finds non-linear boundaries. To be added.
neighbour training, classification based on existing points. boundary becomes smoother.
nearest training instances. features need to be kept in memory.
• Euclidean distance formula:
� �D 2
i=1 (xi − x̂i )

• δ(a, b) ← 1 if a = b; 0 o/w

�D Use a Dirichlet prior on the parameters to obtain a


Multivariate likelihood p(x|Ck ) = i=1 log p(xi |Ck ) MAP estimate.
�N
j=1 δ(tj = Ck ∧ xji = v) Multivariate likelihood
y(x) = arg max p(Ck |x) pMLE (xi = v|Ck ) = �N
k j=1 δ(tj = Ck )
pMAP (xi = v|Ck ) = Can only learn linear boundaries for
Learn p(Ck |x) by modelling p(x|Ck ) = arg max p(x|Ck ) × p(Ck ) �D xi �N
k Multinomial likelihood p(x|Ck ) = i=1 p(wordi |Ck ) (βi − 1) + δ(tj = Ck ∧ xji = v) multivariate/multinomial attributes.
and p(Ck ), using Bayes’ rule to infer j=1 O(N M ), each training instance must
D �N �
Naive Bayes the class conditional probability. � No optimisation needed. |xi |(βi − 1) + N j=1 δ(tj = Ck )
be visited and each of its features To be added.
Assumes each feature independent of = arg max p(xi |Ck ) × p(Ck ) j=1 δ(tj = Ck ) × xji counted.
With Gaussian attributes, quadratic
k pMLE (wordi = v|Ck ) = �N �D boundaries can be learned with uni-modal
all others, ergo ‘Naive.’ i=1
j=1 d=1 δ(tj = Ck ) × xdi
Multinomial likelihood
D distributions.

= arg max log p(xi |Ck ) + log p(Ck ) . . . where: pMAP (wordi = v|Ck ) =
k i=1 xji is the count of word i in test example j; �

(αi − 1) + N j=1 δ(tj = Ck ) × xji
• xdi is the count of feature d in test example j. �N �D �
� (δ(t j = Ck ) × xdi ) − D + D d=1 αd
Gaussian likelihood p(x|Ck ) = D i=1 N (v; µik , σik )
j=1 d=1

Gradient descent (or gradient ascent if maximising


objective): Penalise large values for the λ parameters, by Reformulate the class conditional distribution
λn+1 = λn − η∆L introducing a prior distribution over them (typically in terms of a kernel K(x, x� ), and use a
a Gaussian). non-linear kernel (for example
Minimise the negative log-likelihood: . . . where η is the step parameter. K(x, x� ) = (1 + wT x)2 ). By the Representer
y(x) = arg max p(Ck |x) � �
k Objective function Theorem:
� LMLE (λ, D) = p(t|x) = − log p(t|x) �  
Estimate p(Ck |x) directly, by = arg max λm φm (x, Ck ) (x,t)∈D (x,t)∈D ∆LMLE (λ, D) = E[φ(x, ·)] − φ(x, t) � O(IN M K), since each training
k m � � LMAP (λ, D, σ) = arg min − log p(λ) − log p(t|x) 1 T
assuming a maximum entropy � � (x,t)∈D instance must be visited and each p(Ck |x) = eλ φ(x,Ck ) Online Gradient Descent: Update the
Log-linear distribution and optimising an = log Zλ (x) − λm φm (x, t) λ � � λ
(x,t)∈D combination of class and features Zλ (x) parameters using GD after seeing each
. . . where:  
objective function over the (x,t)∈D m ∆LMAP (λ, D, σ) = 2 + E[φ(x, ·)] − φ(x, t) must be calculated for the 1 �N �K T training instance.
1 �
� � σ
(x,t)∈D (x,t)∈D (0−λ)2 � = e n=1 i=1 αnk φ(xn ,Ci ) φ(x,Ck )
conditional entropy distribution. p(Ck |x) = e m λm φm (x,Ck ) � � � � = arg min − log e 2σ 2 − log p(t|x) appropriate feature mapping. Zλ (x)
Zλ (x) = log e m λm φm (x,Ck )
− λm φm (x, t) � λ �N �K
� � . . . where (x,t)∈D φ(x, t) are the empirical counts. (x,t)∈D 1
Zλ (x) = e m λm φm (x,Ck ) (x,t)∈D k m   = e n=1 i=1 αnk K((xn ,Ci ),(x,Ck ))
� Zλ (x)
k For each class Ck : λ2m � �N
= arg min  m
− log p(t|x) 1
� � λ 2σ 2 = e n=1 αnk K(xn ,x)
E[φ(x, ·)] = φ(x, ·)p(Ck |x) (x,t)∈D Zλ (x)
(x,t)∈D (x,t)∈D

Binary, linear classifier: Iterate over each training example xn , and update the
weight vector if misclassification: Use a kernel K(x, x� ), and 1 weight per
y(x) = sign(wT x) Tries to minimise the Error function; the number of The Voted Perceptron: run the perceptron i times training instance:
i+1 i
incorrectly classified input vectors: w = w + η∆EP (w) � N �
. . . where: and store each iteration’s weight vector. Then:
� �
Directly estimate the linear function � = wi + ηxn tn � � y(x) = sign wn tn K(x, xn )
y(x) by iteratively updating the +1 if x ≥ 0 arg min EP (w) = arg min − w T xn t n � O(IN M L), since each combination of
The perceptron is an online algorithm
Perceptron sign(x) = w w
n∈M . . . where typically η = 1. y(x) = sign ci × sign(wT
i x) instance, class and features must be n=1
weight vector when incorrectly −1 if x < 0 per default.
i calculated (see log-linear). . . . and the update:
classifying a training instance. . . . where M is the set of misclassified training
Multiclass perceptron: For the multiclass perceptron: . . . where ci is the number of correctly classified
vectors. i+1
wn i
= wn +1
y(x) = arg max wT φ(x, Ck ) training instances for wi .
wi+1 = wi + φ(x, t) − φ(x, y(x))
Ck

The soft margin SVM: penalise a hyperplane by the


number and distance of misclassified points.
Primal Use a non-linear kernel K(x, x� ):
1 Primal
arg min ||w||2 N

w,w0 2 �N
1 y(x) = λ n t n xT xn + w 0
arg min ||w||2 + C ξn Online SVM. See, for example:
w,w0 2 n=1
n=1
s.t. tn (wT xn + w0 ) ≥ 1 ∀n N
� • The Huller: A Simple and
• QP: O(n3 ); Efficient Online SVM, Bordes
A maximum margin classifier: finds = λn tn K(x, xn ) + w0
Support N Dual • Quadratic Programming (QP) s.t. tn (wT xn + w0 ) ≥ 1 − ξn , ξn > 0 ∀n
the separating hyperplane with the � • SMO: much more efficient than n=1 & Bottou (2005)
vector y(x) = λ n t n xT xn + w 0
maximum margin to its closest data N
� N
� N
� • SMO, Sequential Minimal Optimisation (chunking). QP, since computation based
machines Dual • Pegasos: Primal Estimated
points. n=1 L̃(∧) = λn − λ n λ m t n t m xT
n xm only on support vectors. N N �
N sub-Gradient Solver for SVM,
N N N � �
n=1 n=1 m=1 � � � L̃(∧) = λn − λn λm tn tm xT
L̃(∧) = λn − λ n λ m t n t m xT n xm Shalev-Shwartz et al. (2007)
n xm n=1 n=1 m=1
N n=1 n=1 m=1
� N
� N �
� N
s.t. λn ≥ 0, λn tn = 0, ∀n = λn − λn λm tn tm K(xn , xm )
n=1 N
� n=1 n=1 m=1
s.t. 0 ≤ λn ≤ C, λn tn = 0, ∀n
n=1

� assignments rnk ∈ {0, 1} s.t.


Hard
Expectation:
∀n k rnk = 1, i.e. each data point is

assigned to one cluster k. 1 if ||xn − µk ||2 minimal for k For non-linearly separable data, use kernel
rnk =
N �
� K 0 o/w k-means as suggested in:
A hard-margin, geometric clustering Geometric distance: The Euclidean arg min rnk ||xn − µk ||22 Sequential k-means: update the cen-
k-means algorithm, where each data point is distance, l2 norm: r,µ Maximisation: Only hard-margin assignment to clusters. To be added. troids after processing one point at a
n=1 k=1 � Kernel k-means, Spectral Clustering and
assigned to its closest centroid. � time.
�D (k) n rnk xn Normalized Cuts, Dhillon et al. (2004).
�� . . . i.e. minimise the distance from each cluster centre µMLE = �
||xn − µk ||2 = � (xni − µki )2 to each of its points. n rnk
i=1
. . . where µ(k) is the centroid of cluster k.

Expectation: For each n, k set:


γnk = p(z (i) = k|x(i) ; γ, µ, Σ) (= p(k|xn ))
p(x(i) |z (i) = k; µ, Σ)p(z (i) = k; π)
= �K
j=1 p(x(i) |z (i) = l; µ, Σ)p(z (i) = l; π)
Assignments to clusters by specifying πk N (xn |µk , Σk )
probabilities = �K Online Gaussian Mixture Models. A
A probabilistic clustering algorithm, j=1 πj N (xn |µj , Σj )
p(x(i) , z (i) ) = p(x(i) |z (i) )p(z (i) ) L(x, π, µ, Σ) = log p(x|π, µ, Σ) good start is:
where clusters are modelled as latent The mixture of Gaussians assigns probabilities for
Mixture of � K � Maximisation:
Guassians and each data point is �N � each cluster to each data point, and as such is To be added. Not applicable.
Gaussians . . . with z (i) ∼ Multinomial(γ), and A View of the EM Algorithm that Jus-
assigned the probability of being �k = log πk Nk (xn |µk , Σk ) 1 �
N capable of capturing ambiguities in the data set.
γnk ≡ p(k|xn ) s.t. tifies Incremental, Sparse, and Other
drawn from a particular Gaussian. j=1 γnj = 1. I.e. n=1 k=1 πk = γnk
want to maximise the probability of N n=1 Variants, Neal & Hinton (1998).
the observed data x. �N T
n=1 γnk (xn − µk )(xn − µk )
Σk = �N
n=1 γnk
�N
γ
n=1 nk nx
µk = � N
n=1 γnk

1 Created by Emanuel Ferm, HT2011, for semi-procrastinational reasons while studying for a Machine Learning exam. Last updated May 5, 2011.

You might also like