Cheat Sheet: Algorithms For Supervised-And Unsupervised Learning

1
Cheat Sheet: Algorithms for Supervised- and Unsupervised Learning

Algorithm Description Model Objective Training Regularisation Complexity Non-linear Online learning
�
t̂ = arg max δ(ti , C)
C
i:xi ∈Nk (x,x̂)
The label of a new point x̂ is classified • Nk (x, x̂) ← k points in x O(N M ) space complexity, since all
k-nearest Use cross-validation to learn the appropriate k; otherwise no k acts as to regularise the classifier: as k → N the
with the most frequent label t̂ of the k closest to x̂ No optimisation needed. training instances and all their Natively finds non-linear boundaries. To be added.
neighbour training, classification based on existing points. boundary becomes smoother.
nearest training instances. features need to be kept in memory.
• Euclidean distance formula:
� �D 2
i=1 (xi − x̂i )
• δ(a, b) ← 1 if a = b; 0 o/w
�D Use a Dirichlet prior on the parameters to obtain a

Multivariate likelihood p(x|Ck ) = i=1 log p(xi |Ck ) MAP estimate.
�N
j=1 δ(tj = Ck ∧ xji = v) Multivariate likelihood
y(x) = arg max p(Ck |x) pMLE (xi = v|Ck ) = �N
k j=1 δ(tj = Ck )
pMAP (xi = v|Ck ) = Can only learn linear boundaries for
Learn p(Ck |x) by modelling p(x|Ck ) = arg max p(x|Ck ) × p(Ck ) �D xi �N
k Multinomial likelihood p(x|Ck ) = i=1 p(wordi |Ck ) (βi − 1) + δ(tj = Ck ∧ xji = v) multivariate/multinomial attributes.
and p(Ck ), using Bayes’ rule to infer j=1 O(N M ), each training instance must
D �N �
Naive Bayes the class conditional probability. � No optimisation needed. |xi |(βi − 1) + N j=1 δ(tj = Ck )
be visited and each of its features To be added.
Assumes each feature independent of = arg max p(xi |Ck ) × p(Ck ) j=1 δ(tj = Ck ) × xji counted.
With Gaussian attributes, quadratic
k pMLE (wordi = v|Ck ) = �N �D boundaries can be learned with uni-modal
all others, ergo ‘Naive.’ i=1
j=1 d=1 δ(tj = Ck ) × xdi
Multinomial likelihood
D distributions.
�
= arg max log p(xi |Ck ) + log p(Ck ) . . . where: pMAP (wordi = v|Ck ) =
k i=1 xji is the count of word i in test example j; �
•
(αi − 1) + N j=1 δ(tj = Ck ) × xji
• xdi is the count of feature d in test example j. �N �D �
� (δ(t j = Ck ) × xdi ) − D + D d=1 αd
Gaussian likelihood p(x|Ck ) = D i=1 N (v; µik , σik )
j=1 d=1
Gradient descent (or gradient ascent if maximising

objective): Penalise large values for the λ parameters, by Reformulate the class conditional distribution
λn+1 = λn − η∆L introducing a prior distribution over them (typically in terms of a kernel K(x, x� ), and use a
a Gaussian). non-linear kernel (for example
Minimise the negative log-likelihood: . . . where η is the step parameter. K(x, x� ) = (1 + wT x)2 ). By the Representer
y(x) = arg max p(Ck |x) � �
k Objective function Theorem:
� LMLE (λ, D) = p(t|x) = − log p(t|x) �  
Estimate p(Ck |x) directly, by = arg max λm φm (x, Ck ) (x,t)∈D (x,t)∈D ∆LMLE (λ, D) = E[φ(x, ·)] − φ(x, t) � O(IN M K), since each training
k m � � LMAP (λ, D, σ) = arg min − log p(λ) − log p(t|x) 1 T
assuming a maximum entropy � � (x,t)∈D instance must be visited and each p(Ck |x) = eλ φ(x,Ck ) Online Gradient Descent: Update the
Log-linear distribution and optimising an = log Zλ (x) − λm φm (x, t) λ � � λ
(x,t)∈D combination of class and features Zλ (x) parameters using GD after seeing each
. . . where:  
objective function over the (x,t)∈D m ∆LMAP (λ, D, σ) = 2 + E[φ(x, ·)] − φ(x, t) must be calculated for the 1 �N �K T training instance.
1 �
� � σ
(x,t)∈D (x,t)∈D (0−λ)2 � = e n=1 i=1 αnk φ(xn ,Ci ) φ(x,Ck )
conditional entropy distribution. p(Ck |x) = e m λm φm (x,Ck ) � � � � = arg min − log e 2σ 2 − log p(t|x) appropriate feature mapping. Zλ (x)
Zλ (x) = log e m λm φm (x,Ck )
− λm φm (x, t) � λ �N �K
� � . . . where (x,t)∈D φ(x, t) are the empirical counts. (x,t)∈D 1
Zλ (x) = e m λm φm (x,Ck ) (x,t)∈D k m   = e n=1 i=1 αnk K((xn ,Ci ),(x,Ck ))
� Zλ (x)
k For each class Ck : λ2m � �N
= arg min  m
− log p(t|x) 1
� � λ 2σ 2 = e n=1 αnk K(xn ,x)
E[φ(x, ·)] = φ(x, ·)p(Ck |x) (x,t)∈D Zλ (x)
(x,t)∈D (x,t)∈D
Binary, linear classifier: Iterate over each training example xn , and update the
weight vector if misclassification: Use a kernel K(x, x� ), and 1 weight per
y(x) = sign(wT x) Tries to minimise the Error function; the number of The Voted Perceptron: run the perceptron i times training instance:
i+1 i
incorrectly classified input vectors: w = w + η∆EP (w) � N �
. . . where: and store each iteration’s weight vector. Then:
� �
Directly estimate the linear function � = wi + ηxn tn � � y(x) = sign wn tn K(x, xn )
y(x) by iteratively updating the +1 if x ≥ 0 arg min EP (w) = arg min − w T xn t n � O(IN M L), since each combination of
The perceptron is an online algorithm
Perceptron sign(x) = w w
n∈M . . . where typically η = 1. y(x) = sign ci × sign(wT
i x) instance, class and features must be n=1
weight vector when incorrectly −1 if x < 0 per default.
i calculated (see log-linear). . . . and the update:
classifying a training instance. . . . where M is the set of misclassified training
Multiclass perceptron: For the multiclass perceptron: . . . where ci is the number of correctly classified
vectors. i+1
wn i
= wn +1
y(x) = arg max wT φ(x, Ck ) training instances for wi .
wi+1 = wi + φ(x, t) − φ(x, y(x))
Ck
The soft margin SVM: penalise a hyperplane by the

number and distance of misclassified points.
Primal Use a non-linear kernel K(x, x� ):
1 Primal
arg min ||w||2 N
�
w,w0 2 �N
1 y(x) = λ n t n xT xn + w 0
arg min ||w||2 + C ξn Online SVM. See, for example:
w,w0 2 n=1
n=1
s.t. tn (wT xn + w0 ) ≥ 1 ∀n N
� • The Huller: A Simple and
• QP: O(n3 ); Eﬃcient Online SVM, Bordes
A maximum margin classifier: finds = λn tn K(x, xn ) + w0
Support N Dual • Quadratic Programming (QP) s.t. tn (wT xn + w0 ) ≥ 1 − ξn , ξn > 0 ∀n
the separating hyperplane with the � • SMO: much more eﬃcient than n=1 & Bottou (2005)
vector y(x) = λ n t n xT xn + w 0
maximum margin to its closest data N
� N
� N
� • SMO, Sequential Minimal Optimisation (chunking). QP, since computation based
machines Dual • Pegasos: Primal Estimated
points. n=1 L̃(∧) = λn − λ n λ m t n t m xT
n xm only on support vectors. N N �
N sub-Gradient Solver for SVM,
N N N � �
n=1 n=1 m=1 � � � L̃(∧) = λn − λn λm tn tm xT
L̃(∧) = λn − λ n λ m t n t m xT n xm Shalev-Shwartz et al. (2007)
n xm n=1 n=1 m=1
N n=1 n=1 m=1
� N
� N �
� N
s.t. λn ≥ 0, λn tn = 0, ∀n = λn − λn λm tn tm K(xn , xm )
n=1 N
� n=1 n=1 m=1
s.t. 0 ≤ λn ≤ C, λn tn = 0, ∀n
n=1
� assignments rnk ∈ {0, 1} s.t.

Hard
Expectation:
∀n k rnk = 1, i.e. each data point is
�
assigned to one cluster k. 1 if ||xn − µk ||2 minimal for k For non-linearly separable data, use kernel
rnk =
N �
� K 0 o/w k-means as suggested in:
A hard-margin, geometric clustering Geometric distance: The Euclidean arg min rnk ||xn − µk ||22 Sequential k-means: update the cen-
k-means algorithm, where each data point is distance, l2 norm: r,µ Maximisation: Only hard-margin assignment to clusters. To be added. troids after processing one point at a
n=1 k=1 � Kernel k-means, Spectral Clustering and
assigned to its closest centroid. � time.
�D (k) n rnk xn Normalized Cuts, Dhillon et al. (2004).
�� . . . i.e. minimise the distance from each cluster centre µMLE = �
||xn − µk ||2 = � (xni − µki )2 to each of its points. n rnk
i=1
. . . where µ(k) is the centroid of cluster k.
Expectation: For each n, k set:

γnk = p(z (i) = k|x(i) ; γ, µ, Σ) (= p(k|xn ))
p(x(i) |z (i) = k; µ, Σ)p(z (i) = k; π)
= �K
j=1 p(x(i) |z (i) = l; µ, Σ)p(z (i) = l; π)
Assignments to clusters by specifying πk N (xn |µk , Σk )
probabilities = �K Online Gaussian Mixture Models. A
A probabilistic clustering algorithm, j=1 πj N (xn |µj , Σj )
p(x(i) , z (i) ) = p(x(i) |z (i) )p(z (i) ) L(x, π, µ, Σ) = log p(x|π, µ, Σ) good start is:
where clusters are modelled as latent The mixture of Gaussians assigns probabilities for
Mixture of � K � Maximisation:
Guassians and each data point is �N � each cluster to each data point, and as such is To be added. Not applicable.
Gaussians . . . with z (i) ∼ Multinomial(γ), and A View of the EM Algorithm that Jus-
assigned the probability of being �k = log πk Nk (xn |µk , Σk ) 1 �
N capable of capturing ambiguities in the data set.
γnk ≡ p(k|xn ) s.t. tifies Incremental, Sparse, and Other
drawn from a particular Gaussian. j=1 γnj = 1. I.e. n=1 k=1 πk = γnk
want to maximise the probability of N n=1 Variants, Neal & Hinton (1998).
the observed data x. �N T
n=1 γnk (xn − µk )(xn − µk )
Σk = �N
n=1 γnk
�N
γ
n=1 nk nx
µk = � N
n=1 γnk
1 Created by Emanuel Ferm, HT2011, for semi-procrastinational reasons while studying for a Machine Learning exam. Last updated May 5, 2011.

Cheat Sheet: Algorithms For Supervised-And Unsupervised Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cheat Sheet: Algorithms For Supervised-And Unsupervised Learning

Uploaded by

Copyright:

Available Formats

1

Cheat Sheet: Algorithms for Supervised- and Unsupervised Learning

�D Use a Dirichlet prior on the parameters to obtain a

Gradient descent (or gradient ascent if maximising

The soft margin SVM: penalise a hyperplane by the

� assignments rnk ∈ {0, 1} s.t.

Expectation: For each n, k set:

You might also like