Cheat 3

Cheat Sheet: Algorithms for Supervised- and Unsupervised Learning
Algorithm Description Model Objective Training Regularisation = arg max t

C
1
Complexity Non-linear Online learning
i : x i Nk ( x , x )
(t i , C ) Use cross-validation to learn the appropriate k; otherwise no training, classication based on existing points. k acts as to regularise the classier: as k N the boundary becomes smoother. O (N M ) space complexity, since all training instances and all their features need to be kept in memory.
k-nearest neighbour
The label of a new point x is classied of the k with the most frequent label t nearest training instances.
N k (x , x ) k points in x closest to x Euclidean distance formula: D i )2 i=1 (xi x (a, b) 1 if a = b; 0 o/w
No optimisation needed.
Natively nds non-linear boundaries.
To be added.
Multivariate likelihood p(x|Ck ) = y (x) = arg max p(Ck |x)

k k
pMLE (xi
v |Ck )
j =1
Naive Bayes
Learn p(Ck |x) by modelling p(x|Ck ) and p(Ck ), using Bayes rule to infer the class conditional probability. Assumes each feature independent of all others, ergo Naive.
= arg max p(x|Ck ) p(Ck ) = arg max

k D
Multinomial likelihood p(x|Ck ) = No optimisation needed. pMLE (wordi . . . where:
(tj = Ck xji = v ) N j =1 (tj = Ck ) D

xi i=1 p(wordi |Ck )
i=1
log p(xi |Ck )
Use a Dirichlet prior on the parameters to obtain a MAP estimate. Multivariate likelihood pMAP (xi = v |Ck ) = (i 1) + N Can only learn linear boundaries for multivariate/multinomial attributes. With Gaussian attributes, quadratic boundaries can be learned with uni-modal distributions. To be added.
i=1 D i=1
p(xi |Ck ) p(Ck ) log p(xi |Ck ) + log p(Ck )
= arg max
k
j =1 (tj = Ck ) xji = v |Ck ) = N D j =1 d=1 (tj = Ck ) xdi
Multinomial likelihood
(tj = Ck xji = v ) |xi |(i 1) + N j =1 (tj = Ck )

j =1
O (N M ), each training instance must be visited and each of its features counted.
xji is the count of word i in test example j ; xdi is the count of feature d in test example j . Gaussian likelihood p(x|Ck ) = D i=1 N (v ; ik , ik ) Gradient descent (or gradient ascent if maximising objective): n+1 = n L . . . where is the step parameter. log p(t|x) m m (x, t) LMLE (, D ) =
pMAP (wordi = v |Ck ) = (i 1) + N j =1 (tj = Ck ) xji N D ( ( t = Ck ) xdi ) D + D j j =1 d=1 d=1 d Penalise large values for the parameters, by introducing a prior distribution over them (typically a Gaussian). Objective function Reformulate the class conditional distribution in terms of a kernel K (x, x ), and use a non-linear kernel (for example K (x, x ) = (1 + wT x)2 ). By the Representer Theorem: O (IN M K ), since each training instance must be visited and each combination of class and features must be calculated for the appropriate feature mapping. p ( C k | x) =
T 1 e (x,Ck ) Z ( x) N K T 1 = e n=1 i=1 nk (xn ,Ci ) (x,Ck ) Z ( x) N K 1 = e n=1 i=1 nk K ((xn ,Ci ),(x,Ck )) Z ( x) N 1 = e n=1 nk K (xn ,x) Z ( x)
y (x) = arg max p(Ck |x)

k k
Log-linear
Estimate p(Ck |x) directly, by assuming a maximum entropy distribution and optimising an objective function over the conditional entropy distribution.
= arg max . . . where:
m m (x, Ck )
Minimise the negative log-likelihood: LMLE (, D ) = p ( t | x) =

(x,t)D
1 e m m m (x,Ck ) Z ( x) Z ( x) = e m m m (x,Ck )
p ( C k |x) =
(x,t)D
(x,t)D
log Z (x) log

k
m m (x, t)
(x,t)D
E[(x, )] (x, t) E[(x, )] (x, t)
LMAP (, D , ) = 2 + . . . where
LMAP (, D , ) = arg min log p()
(x,t)D
m m (x,Ck )
(x,t)D
(x,t)D
(x,t)D (x, t) are the empirical counts.
= arg min log e
(x,t)D
(0)2 2 2
For each class Ck :

(x,t)D
E[(x, )] =
(x,t)D
(x, )p(Ck |x)
= arg min
(x,t)D
log p(t|x)
Online Gradient Descent: Update the parameters using GD after seeing each training instance.
2 m
2 2
(x,t)D
log p(t|x)
log p(t|x)
Binary, linear classier: y (x) = sign(wT x) Directly estimate the linear function y (x) by iteratively updating the weight vector when incorrectly classifying a training instance. . . . where: sign(x) = Tries to minimise the Error function; the number of incorrectly classied input vectors: arg min EP (w) = arg min w T xn t n
w w n M
Iterate over each training example xn , and update the weight vector if misclassication: w
i+1
= w + E P (w ) = w i + xn t n
Perceptron
+1 1
if x 0 if x < 0
. . . where typically = 1. For the multiclass perceptron: wi+1 = wi + (x, t) (x, y (x))
Multiclass perceptron: y (x) = arg max wT (x, Ck )

Ck
. . . where M is the set of misclassied training vectors.
The Voted Perceptron: run the perceptron i times and store each iterations weight vector. Then: y (x) = sign ci sign(wT x ) i
i
O (IN M L), since each combination of instance, class and features must be calculated (see log-linear).
Use a kernel K (x, x ), and 1 weight per training instance: N y (x) = sign wn tn K (x, xn )
n=1
. . . and the update:

i+1 i wn = wn +1
The perceptron is an online algorithm per default.
. . . where ci is the number of correctly classied training instances for wi .
Primal arg min

w,w0
The soft margin SVM: penalise a hyperplane by the number and distance of misclassied points. 1 ||w||2 2 n
Use a non-linear kernel K (x, x ):

N N
Primal
N 1 arg min ||w||2 + C n 2 w,w0 n=1
y ( x) =
n t n xT xn + w 0
n=1
Online SVM. See, for example:
s.t. Support vector machines A maximum margin classier: nds the separating hyperplane with the maximum margin to its closest data points. y ( x) =
N
t n ( w T xn + w 0 ) 1
N N
Dual n t n xT xn + w 0 () = L
N
Quadratic Programming (QP) SMO, Sequential Minimal Optimisation (chunking).
s.t. Dual
n=1
n=1
n m t n t m xT n xm
tn (wT xn + w0 ) 1 n ,
N N N
n > 0
QP: O (n3 );
SMO: much more ecient than QP, since computation based only on support vectors. () = L =
n tn K (x, xn ) + w0
n=1 N N N N N N
The Huller: A Simple and Ecient Online SVM, Bordes & Bottou (2005) Pegasos: Primal Estimated sub-Gradient Solver for SVM, Shalev-Shwartz et al. (2007)
n=1 m=1 N
() = L n s.t.
n=1
n m t n t m xT n xm
n=1 m=1 N
n=1
n n
n m tn tm xT n xm
n=1 m=1
s.t.
n 0,
n t n = 0,
n=1
n m t n t m K ( xn , x m )
0 n C,
n t n = 0,
n=1
n=1
n=1 m=1
Hard assignments rnk {0, 1} s.t. n k rnk = 1, i.e. each data point is assigned to one cluster k. k-means A hard-margin, geometric clustering algorithm, where each data point is assigned to its closest centroid. Geometric distance: The Euclidean distance, l2 norm: D ||xn k ||2 = (xni ki )2
i=1
Expectation:
N K
rnk = rnk ||xn k ||2 2
arg min
r ,
1 0
if ||xn k ||2 minimal for k o/w n rnk xn n rnk Only hard-margin assignment to clusters. To be added.
For non-linearly separable data, use kernel k-means as suggested in: Kernel k-means, Spectral Clustering and Normalized Cuts, Dhillon et al. (2004).
n=1 k=1
Maximisation: MLE =
(k )
Sequential k-means: update the centroids after processing one point at a time.
. . . i.e. minimise the distance from each cluster centre to each of its points.
. . . where (k) is the centroid of cluster k. Expectation: For each n, k set: nk = p(z (i) = k|x(i) ; , , ) = K L(x, , , ) = log p(x| , , ) K N = log k Nk ( xn | k , k )
n=1 k=1 j =1
p(x(i) |z (i) = k; , )p(z (i) = k; )
(= p(k|xn ))
Mixture of Gaussians
A probabilistic clustering algorithm, where clusters are modelled as latent Guassians and each data point is assigned the probability of being drawn from a particular Gaussian.
Assignments to clusters by specifying probabilities p ( x( i ) , z ( i ) ) = p ( x( i ) |z ( i ) ) p ( z ( i ) ) . . . with z (i) Multinomial( ), and k nk p(k|xn ) s.t. j =1 nj = 1. I.e. want to maximise the probability of the observed data x.
Maximisation:
k N ( xn | k , k ) = K j =1 j N (xn |j , j )
N
p(x(i) |z (i) = l; , )p(z (i) = l; ) Online Gaussian Mixture Models. A good start is: To be added. Not applicable. A View of the EM Algorithm that Justies Incremental, Sparse, and Other Variants, Neal & Hinton (1998).
1 k = nk N n=1 N T n=1 nk (xn k )(xn k ) k = N n=1 nk N x n=1 nk n k = N n=1 nk
The mixture of Gaussians assigns probabilities for each cluster to each data point, and as such is capable of capturing ambiguities in the data set.
1 Created
by Emanuel Ferm, HT2011, for semi-procrastinational reasons while studying for a Machine Learning exam. Last updated May 5, 2011.

Cheat 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cheat 3

Uploaded by

Copyright:

Available Formats

Cheat Sheet: Algorithms for Supervised- and Unsupervised Learning

Algorithm Description Model Objective Training Regularisation = arg max t

N k (x , x ) k points in x closest to x Euclidean distance formula: D i )2 i=1 (xi x (a, b) 1 if a = b; 0 o/w

Natively nds non-linear boundaries.

Multivariate likelihood p(x|Ck ) = y (x) = arg max p(Ck |x)

= arg max p(x|Ck ) p(Ck ) = arg max

Multinomial likelihood p(x|Ck ) = No optimisation needed. pMLE (wordi . . . where:

(tj = Ck xji = v ) N j =1 (tj = Ck ) D

log p(xi |Ck )

p(xi |Ck ) p(Ck ) log p(xi |Ck ) + log p(Ck )

j =1 (tj = Ck ) xji = v |Ck ) = N D j =1 d=1 (tj = Ck ) xdi

(tj = Ck xji = v ) |xi |(i 1) + N j =1 (tj = Ck )

y (x) = arg max p(Ck |x)

= arg max . . . where:

Minimise the negative log-likelihood: LMLE (, D ) = p ( t | x) =

log Z (x) log

E[(x, )] (x, t) E[(x, )] (x, t)

LMAP (, D , ) = arg min log p()

(x,t)D (x, t) are the empirical counts.

= arg min log e

For each class Ck :

(x, )p(Ck |x)

Multiclass perceptron: y (x) = arg max wT (x, Ck )

. . . where M is the set of misclassied training vectors.

. . . and the update:

The perceptron is an online algorithm per default.

. . . where ci is the number of correctly classied training instances for wi .

Primal arg min

Use a non-linear kernel K (x, x ):

Online SVM. See, for example:

Quadratic Programming (QP) SMO, Sequential Minimal Optimisation (chunking).

rnk = rnk ||xn k ||2 2

p(x(i) |z (i) = k; , )p(z (i) = k; )

1 k = nk N n=1 N T n=1 nk (xn k )(xn k ) k = N n=1 nk N x n=1 nk n k = N n=1 nk

You might also like