You are on page 1of 137

Unit 2 - SVM

SUPPORT VECTOR MACHINES


Main Idea
• Max-Margin Classifier
– Formalize notion of the best linear separator
• Lagrangian Multipliers
– Way to convert a constrained optimization problem to one
that is easier to solve
• Kernels
– Projecting data into higher-dimensional space makes it
linearly separable
• Complexity
– Depends only on the number of training examples, not on
dimensionality of the kernel space!
Text categorization
Tennis example

Temperature

Humidity
= play tennis
= do not play tennis
Linear Support Vector Machines
Data: <xi,yi>, i=1,..,l
xi  R d
yi  {-1,+1}

x2

=+1
=-1

x1
Linear SVM 2

Data: <xi,yi>, i=1,..,l


xi  R d
yi  {-1,+1}

f(x) =-1
=+1

All hyperplanes in Rd are parameterize by a vector (w) and a constant b.


Can be expressed as w•x+b=0 (remember the equation for a hyperplane
from algebra!)
Our aim is to find such a hyperplane f(x)=sign(w•x+b), that
correctly classify our data.
Definitions
Define the hyperplane H such that:
xi•w+b  +1 when yi =+1 H1

xi•w+b  -1 when yi =-1


H2
H1 and H2 are the planes: d+
H1: xi•w+b = +1
d-
H2: xi•w+b = -1 H
The points on the planes H1
and H2 are the Support
Vectors
d+ = the shortest distance to the closest positive point
d- = the shortest distance to the closest negative point

The margin of a separating hyperplane is d+ + d-.


Definition
SVM
Maximizing the margin
We want a classifier with as big margin as possible.

H1
H
H2
Recall the distance from a point(x0,y0) to a line: d+
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2) d-
The distance between H and H1 is:
|w•x+b|/||w||=1/||w||

The distance between H1 and H2 is: 2/||w||

In order to maximize the margin, we need to minimize ||w||. With the


condition that there are no datapoints between H1 and H2:
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi =-1 Can be combined into yi(xi•w)  1
Hard margin
Hard margin – Primal problem
Kernel function
ENSEMBLE OF CLASSIFIERS
Combining classifiers
• So far, we have only discussed
individual classifiers, i.e., how to
build them and use them.
• Can we combine multiple
classifiers to produce a better
classifier?
• Yes, sometimes
• Two approaches:
– Bagging
– Boosting
Bagging
 Breiman, 1996
 Bootstrap Aggregating = Bagging
 Application of bootstrap sampling
 Given: set D containing m training examples
 Create a sample S[i] of D by drawing m examples
at random with replacement from D
 S[i] of size m: expected to leave out 0.37 of
examples from D
Bagging (cont…)
 Training
 Create k bootstrap samples S[1], S[2], …, S[k]
 Build a distinct classifier on each S[i] to
produce k classifiers, using the same learning
algorithm.

 Testing
 Classify each new instance by voting of the k
classifiers (equal weights)
Bootstrap distribution
• The bootstrap does not replace or add to the
original data.

• We use bootstrap distribution as a way to


estimate the variation in a statistic based on
the original data.
Sampling distribution vs.
bootstrap distribution
• The population: certain unknown quantities of
interest (e.g., mean)

• Multiple samples  sampling distribution

• Bootstrapping:
– One original sample  B bootstrap samples
– B bootstrap samples  bootstrap distribution
• Bootstrap distributions usually approximate the
shape, spread, and bias of the actual sampling
distribution.

• Bootstrap distributions are centered at the value of


the statistic from the original sample plus any bias.

• The sampling distribution is centered at the value of


the parameter in the population, plus any bias.
Cases where bootstrap
does not apply
• Small data sets: the original sample is not a good
approximation of the population

• Dirty data: outliers add variability in our estimates.

• Dependence structures (e.g., time series, spatial


problems): Bootstrap is based on the assumption of
independence.

• …
Bagging Example
Original 1 2 3 4 5 6 7 8

Training set 1 2 7 8 3 7 6 3 1

Training set 2 7 8 5 6 4 2 7 1

Training set 3 3 6 2 7 5 6 2 2

Training set 4 4 5 1 4 6 4 3 8
Bagging (cont …)
• When does it help?
– When learner is unstable
• Small change to training set causes large
change in the output classifier
• True for decision trees, neural networks; not
true for k-nearest neighbor, naïve Bayesian,
class association rules
– Experimentally, bagging can help
substantially for unstable learners, may
somewhat degrade results for stable
learners
Bagging

 For i = 1 .. M
 Draw n*<n samples from D with replacement
 Learn classifier Ci
 Final classifier is a vote of C1 .. CM
 Increases classifier stability/reduces variance
Boosting
• A family of methods:
– We only study AdaBoost (Freund &
Schapire, 1996)
• Training
– Produce a sequence of classifiers (the
same base learner)
– Each classifier is dependent on the
previous one, and focuses on the
previous one’s errors
– Examples that are incorrectly predicted
in previous classifiers are given higher
weights
• Testing
– For a test case, the results of the series
of classifiers are combined to determine
the final class of the test case.
AdaBoost
Weighted called a weaker
classifier
training set
(x1, y1, w1)  Build a
(x2, y2, w2) classifier ht
… whose
(xn, yn, wn)
accuracy on
training set >
Non-negative
½ (better than
weights
random)
sum toChange
1
weights
AdaBoost algorithm
Bagging, Boosting and C4.5
C4.5’s mean
error rate
over the
10 cross-
validation.

Bagged C4.5
vs. C4.5.

Boosted C4.5
vs. C4.5.

Boosting vs.
Bagging
Does AdaBoost always work?
• The actual performance of boosting
depends on the data and the base
learner.
– It requires the base learner to be
unstable as bagging.
• Boosting seems to be susceptible to
noise.
– When the number of outliners is very
large, the emphasis placed on the hard
examples can hurt the performance.
What is Clustering?
• Attach label to each observation or data points in a set
• You can say this “unsupervised classification”
• Clustering is alternatively called as “grouping”
• Intuitively, if you would want to assign same label to a data
points that are “close” to each other
• Thus, clustering algorithms rely on a distance metric between
data points
• Sometimes, it is said that for clustering, the distance metric is
more important than the clustering algorithm
Distances: Quantitative Variables

Data point:
xi  [ xi1  xip ]T
Some examples
Types of clustering:
1. Hierarchical algorithms: these find successive clusters
using previously established clusters.
1. Agglomerative ("bottom-up"): Agglomerative algorithms
begin with each element as a separate cluster and merge them
into successively larger clusters.
2. Divisive ("top-down"): Divisive algorithms begin with the
whole set and proceed to divide it into successively smaller
clusters.
2. Partitional clustering: Partitional algorithms determine all clusters at
once. They include:
– K-means and derivatives
– Fuzzy c-means clustering
– QT clustering algorithm
Common Distance measures:

• Distance measure will determine how the similarity of two


elements is calculated and it will influence the shape of the
clusters.
They include:
1. The Euclidean distance (also called 2-norm distance) is given by:

2. The Manhattan distance (also called taxicab norm or 1-norm) is


given by:
3.The maximum norm is given by:

4. The Mahalanobis distance corrects data for


different scales and correlations in the variables.
5. Inner product space: The angle between two
vectors can be used as a distance measure when
clustering high dimensional data
6. Hamming distance (sometimes edit distance)
measures the minimum number of substitutions
required to change one member into another.
K-MEANS CLUSTERING
• The k-means algorithm is an algorithm to cluster n
objects based on attributes into k partitions, where
k < n.
• It is similar to the expectation-maximization
algorithm for mixtures of Gaussians in that they
both attempt to find the centers of natural clusters
in the data.
• It assumes that the object attributes form a vector
space.
• An algorithm for partitioning (or clustering) N
data points into K disjoint subsets Sj
containing data points so as to minimize the
sum-of-squares criterion

where xn is a vector representing the the nth


data point and uj is the geometric centroid of
the data points in Sj.
• Simply speaking k-means clustering is an
algorithm to classify or to group the objects
based on attributes/features into K number of
group.
• K is positive integer number.
• The grouping is done by minimizing the sum
of squares of distances between data and the
corresponding cluster centroid.
How the K-Mean Clustering algorithm
works?
• Step 1: Begin with a decision on the value of k =
number of clusters .
• Step 2: Put any initial partition that classifies the
data into k clusters. You may assign the training
samples randomly,or systematically as the
following:
1.Take the first k training sample as single- element
clusters
2. Assign each of the remaining (N-k) training
sample to the cluster with the nearest
centroid. After each assignment, recompute the
centroid of the gaining cluster.
• Step 3: Take each sample in sequence and
compute its distance from the centroid of each of the
clusters. If a sample is not currently in the cluster with the
closest centroid, switch this sample to that cluster and
update the centroid of the cluster gaining the new sample
and the cluster losing the sample.
• Step 4 . Repeat step 3 until convergence is
achieved, that is until a pass through the
training sample causes no new assignments.
A Simple example showing the implementation of
k-means algorithm
(using K=2)
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
Step 2:
• Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
• Their new centroids are:
Step 3:
• Now using these centroids
we compute the Euclidean
distance of each object, as
shown in table.

• Therefore, the new clusters


are:
{1,2} and {3,4,5,6,7}

• Next centroids are:


m1=(1.25,1.5) and m2 =
(3.9,5.1)
• Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}

• Therefore, there is no change


in the cluster.
• Thus, the algorithm comes to
a halt here and final result
consist of 2 clusters {1,2} and
{3,4,5,6,7}.
PLOT
(with K=3)

Step 1 Step 2
PLOT
How K-means partitions?

When K centroids are set/fixed,


they partition the whole data
space into K mutually
exclusive subspaces to form a
partition.

A partition amounts to a
Voronoi Diagram

Changing positions of centroids


leads to a new partitioning.
K-means: Pros & Cons
• Algorithmically, very simple to implement

• K-means converges, but it finds a local minimum of the cost


function

• Works only for numerical observations

• K is a user input; alternatively BIC (Bayesian information


criterion) or MDL (minimum description length) can be used
to estimate K

• Outliers can considerable trouble to K-means


Convergence
• Why should the K-means algorithm ever reach a
fixed point?
– A state in which clusters don’t change.
• K-means is a special case of a general procedure
known as the Expectation Maximization (EM)
algorithm.
– EM is known to converge.
– Theoretically, number of iterations could be large.
– Typically converges quickly
Time Complexity
• Computing distance between doc and cluster
is O(m) where m is the dimensionality of the
vectors.
• Reassigning clusters: O(Kn) distance
computations, or O(Knm).
• Computing centroids: Each doc gets added
once to some centroid: O(nm).
• Assume these two steps are each done once
for I iterations: O(IKnm).
Seed Choice
• Results can vary based on
random seed selection.
• Some seeds can result in poor
convergence rate, or
convergence to sub-optimal
clusterings.
– Select good seeds using a
heuristic (e.g., doc least similar
to any existing mean)
– Try out multiple starting points
– Initialize with the results of
another method.
How Many Clusters?
• Number of clusters K is given
– Partition n docs into predetermined number of
clusters
• Finding the “right” number of clusters is part of
the problem
– Given data, partition into an “appropriate” number of
subsets.
– E.g., for query results - ideal value of K not known up
front - though UI may impose limits.
• Can usually take an algorithm for one flavor and
convert to the other.
K not specified in advance
• Say, the results of a query.
• Solve an optimization problem: penalize having
lots of clusters
– application dependent, e.g., compressed summary
of search results list.
• Tradeoff between having more clusters (better
focus within each cluster) and having too many
clusters
K not specified in advance
• Given a clustering, define the Benefit for a
doc to be some inverse distance to its
centroid
• Define the Total Benefit to be the sum of
the individual doc Benefits.
Penalize lots of clusters
• For each cluster, we have a Cost C.
• Thus for a clustering with K clusters, the Total
Cost is KC.
• Define the Value of a clustering to be =
Total Benefit - Total Cost.
• Find the clustering of highest value, over all
choices of K.
– Total benefit increases with increasing K. But can stop
when it doesn’t increase by “much”. The Cost term
enforces this.
K-medoids Clustering
• K-means is appropriate when we can work with Euclidean
distances
• Thus, K-means can work only with numerical, quantitative
variable types
• Euclidean distances do not work well in at least two situations
– Some variables are categorical
– Outliers can be potential threats
• A general version of K-means algorithm called K-medoids can
work with any distance measure
• K-medoids clustering is computationally more intensive
K-medoids Algorithm
• Step 1: For a given cluster assignment C, find the observation
in the cluster minimizing the total distance to other points in
that cluster:
ik  arg min  d ( x , x ).
{i:C ( i )  k } C ( j )  k
i j

• Step 2: Assign mk  xi , k  1,2,, K


k

• Step 3: Given a set of cluster centers {m1, …, mK}, minimize the


total error by assigning each observation to the closest
(current) cluster center:
C (i)  arg min d ( xi , mk ), i  1,, N
1 k  K
• Iterate steps 1 to 3
K-medoids Summary

• Generalized K-means
• Computationally much costlier that K-means
• Apply when dealing with categorical data
• Apply when data points are not available, but only
pair-wise distances are available
• Converges to local minimum
Hierarchical Clustering
• Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.
animal

vertebrate invertebrate

fish reptile amphib. mammal worm insect crustacean

How could you do this with k-means?


Hierarchical Clustering algorithms

• Agglomerative (bottom-up):
– Start with each document being a single cluster.
– Eventually all documents belong to the same cluster.

• Divisive (top-down):
– Start with all documents belong to the same cluster.
– Eventually each node forms a cluster on its own.
– Could be a recursive application of k-means like algorithms

• Does not require the number of clusters k in advance


• Needs a termination/readout condition
Hierarchical Agglomerative Clustering (HAC)

• Assumes a similarity function for determining


the similarity of two instances.
• Starts with all instances in a separate cluster
and then repeatedly joins the two clusters
that are most similar until there is only one
cluster.
• The history of merging forms a binary tree or
hierarchy.
Dendogram: Hierarchical Clustering

• Clustering obtained
by cutting the
dendrogram at a
desired level: each
connected
component forms a
cluster.
Hierarchical Agglomerative Clustering
(HAC)
• Starts with each doc in a separate cluster
– then repeatedly joins the closest pair of
clusters, until there is only one cluster.
• The history of merging forms a binary tree
or hierarchy.
How to measure distance of clusters??
Closest pair of clusters
Many variants to defining closest pair of clusters
• Single-link
– Distance of the “closest” points (single-link)
• Complete-link
– Distance of the “furthest” points
• Centroid
– Distance of the centroids (centers of gravity)
• (Average-link)
– Average distance between pairs of elements
Single Link Agglomerative Clustering
• Use maximum similarity of pairs:

sim (ci ,c j )  max sim ( x, y )


xci , yc j
• Can result in “straggly” (long and thin) clusters
due to chaining effect.
• After merging ci and cj, the similarity of the
resulting cluster to another cluster, ck, is:

sim ((ci  c j ), ck )  max( sim (ci , ck ), sim (c j , ck ))


Single Link Example
Example Single link
X Y
P1 0.4 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
P1 P2 P3 P4 P5 P6

P1 0

P2 0.23 0

P3 0.22 0.15 0

P4 0.37 0.20 0.15 0

P5 0.34 0.14 0.28 0.29 0

P6 0.23 0.25 0.11 0.22 0.39 0


Merge
• P3, P6 is small – so merge them as one cluster
• Distance matrix is updated as
– Min (Dist(P3,P1), Dist(P6,P1)
– = 0.22
P1 P2 P3,P6 P4 P5 P6

P1 0

P2 0.23 0

P3 P6 0.22 0.15 0

P4 0.37 0.20 0.15 0

P5 0.34 0.14 0.28 0.29 0


Merge
• P2, P5 is small – so merge them as one cluster
• Distance matrix is updated as
– Min (Dist(P2,P1), Dist(P5,P1)
– = 0.22
P1 P2, P5 P3,P6 P4

P1 0

P2, P5 0.23 0

P3 P6 0.22 0.15 0

P4 0.37 0.20 0.15 0


Merge
• P2P5 , P3P6 is small – so merge them as one
cluster
• Distance matrix is updated as
– Min (Dist(P2,P1), Dist(P5,P1)
– = 0.22
P1 P2, P5 P3 P4
P6
P1 0

P2, P5 , 0.22 0
P3P6
P4 0.37 0.15 0
Merge
• P2P5P3P6, P4 is small – so merge them as one
cluster

P1 P2, P5 P3
P6P4
P1 0

P2, P5 , 0.22 0
P3P6 P4
Complete Link Agglomerative
Clustering
• Use minimum similarity of pairs:
sim (ci ,c j )  min sim ( x, y)
xci , yc j

• Makes “tighter,” spherical clusters that are


typically preferable.
• After merging ci and cj, the similarity of the
resulting cluster to another cluster, ck, is:
sim ((ci  c j ), ck )  min( sim (ci , ck ), sim (c j , ck ))
Ci Cj Ck
Complete Link Example
Example Complete Link
P1 P2 P3 P4 P5 P6

P1 0

P2 0.23 0

P3 0.22 0.15 0

P4 0.37 0.20 0.15 0

P5 0.34 0.14 0.28 0.29 0

P6 0.23 0.25 0.11 0.22 0.39 0


Merge
• Max (Dist(P3,P1), Dist(P6,P1) = 0.23
P1 P2 P3P6 P4 P5

P1 0

P2 0.23 0

P3P6 0.23 0.25 0

P4 0.37 0.20 0.22 0

P5 0.34 0.14 0.39 0.29 0


Merge

P1 P2P5 P3P6 P4

P1 0

P2P5 0.34 0

P3P6 0.23 0.39 0

P4 0.37 0.29 0.22 0


Merge

P1 P2P5 P3P6P4

P1 0

P2P5 0.34 0

P3P6P4 0.37 0.39 0


Merge

P1P2P5 P3P6P4

P1P2P5 0

P3P6P4 0.39 0
Key notion: cluster representative
• We want a notion of a representative point in
a cluster
• Representative should be some sort of
“typical” or central point in the cluster, e.g.,
– point inducing smallest radii to docs in cluster
– smallest squared distances, etc.
– point that is the “average” of all docs in the cluster
• Centroid or center of gravity
Centroid-based Similarity

• Always maintain average of vectors in each cluster:




x

xc j
s (c j ) 
cj
• Compute similarity of clusters by:

sim (ci , c j )  sim ( s(ci ), s(c j ))

• For non-vector data, can’t always make a centroid


Computational Complexity
• In the first iteration, all HAC methods need to
compute similarity of all pairs of n individual
instances which is O(mn2).
• In each of the subsequent n2 merging
iterations, compute the distance between the
most recently created cluster and all other
existing clusters.
• Maintaining of heap of distances allows this to
be O(mn2logn)
Major issue - labeling
• After clustering algorithm finds clusters - how
can they be useful to the end user?
• Need pithy label for each cluster
– In search results, say “Animal” or “Car” in the
jaguar example.
– In topic trees, need navigational cues.
• Often done by hand, a posteriori.

How would you do this?


How to Label Clusters
• Show titles of typical documents
– Titles are easy to scan
– Authors create them for quick scanning!
– But you can only show a few titles which may not fully
represent cluster
• Show words/phrases prominent in cluster
– More likely to fully represent cluster
– Use distinguishing words/phrases
• Differential labeling
– But harder to scan
Labeling
• Common heuristics - list 5-10 most frequent
terms in the centroid vector.
– Drop stop-words; stem.
• Differential labeling by frequent terms
– Within a collection “Computers”, clusters all have
the word computer as frequent term.
– Discriminant analysis of centroids.

• Perhaps better: distinctive noun phrase


Expensive Distance Metric for Text

• String edit distance S e c a t


• Compute with dynamic
programming 0.0 0.7 1.4 2.1 2.8 3.5
• Costs for character: S 0.7 0.0 0.7 1.1 1.4 1.8
– insertion
– deletion c 1.4 0.7 1.0 0.7 1.4 1.8
– substitution
o 2.1 1.1 1.7 1.4 1.7 2.4
– ...
t 2.8 1.4 2.1 1.8 2.4 1.7
t 3.5 1.8 2.4 2.1 2.8 2.4
String edit (Levenstein) distance
• Distance is shortest sequence of edit
commands that transform s to t.
• Simplest set of operations:
– Copy character from s over to t
– Delete a character in s (cost 1)
– Insert a character in t (cost 1)
– Substitute one character for another (cost 1)
Levenstein distance - example
• distance(“William Cohen”, “Willliam Cohon”)

s W I L L I A M _ C O H E N

t W I L L L I A M _ C O H O N

op C C C C I C C C C C C C S C

cost 0 0 0 0 1 1 1 1 1 1 1 1 2 2
Levenstein distance - example
• distance(“William Cohen”, “Willliam Cohon”)

s W I L L gap I A M _ C O H E N

t W I L L L I A M _ C O H O N

op C C C C I C C C C C C C S C

cost 0 0 0 0 1 1 1 1 1 1 1 1 2 2
Computing Levenshtein distance

D(i,j) = score of best alignment from s1..si to t1..tj

D(i-1,j-1), if si=tj //copy


D(i-1,j-1)+1, if si!=tj //substitute
= min D(i-1,j)+1 //insert
D(i,j-1)+1 //delete
Algorithm
for i = 0 to strlen(s)
{ for j = 0 to strlen(t)
{ edit = 1;
if (s[i] == t[j])
edit = 0;
distance[i+1][j+1] =
min{ distance[i][j+1] + 1,
distance[i+1][j] + 1,
distance[i][j] + edit }
}
}
Computing Levenstein distance

C O H E N
M 1 2 3 4 5
C 1 2 3 4 5
C 2 2 3 4 5
O 3 2 3 4 5
H 4 3 2 3 4
N 5 4 3 3 3
= D(s,t)
Computing Levenstein distance

C O H E N
A trace indicates where 2 3 4 5
M 1
the min value came
from, and can be used to C 2 3 4 5
find edit operations
1
and/or a best alignment C 2 3 3 4 5
(may be more than 1)
O 3 2 3 4 5

H 4 3 2 3 4

N 5 4 3 3 3
Large Clustering Problems
• Many examples
• Many clusters
• Many dimensions

Example Domains
 Text
 Images
 Protein structure
EXPECTATION MAXIMIZATION
ALGORITHM (EM)
Model Space
• The choice of the model space is plentiful but
not unlimited.
• There is a bit of “art” in selecting the
appropriate model space.
• Typically the model space is assumed to be a
linear combination of known probability
distribution functions.
Model based clustering
• Algorithm optimizes a probabilistic model criterion
• Clustering is usually done by the Expectation Maximization (EM)
algorithm
– Gives a soft variant of the K-means algorithm
– Assume k clusters: {c1, c2,… ck}
– Assume a probabilistic model of categories that allows
computing P(ci | E) for each category, ci, for a given
example, E.
– For text, typically assume a naïve Bayes category
model.
– Parameters  = {P(ci), P(wj | ci): i{1,…k}, j {1,…,|V|}}
Expectation Maximization (EM) Algorithm

• Iterative method for learning probabilistic categorization


model from unsupervised data.
• Initially assume random assignment of examples to
categories.
• Learn an initial probabilistic model by estimating model
parameters  from this randomly labeled data.
• Iterate following two steps until convergence:
– Expectation (E-step): Compute P(ci | E) for each example given the
current model, and probabilistically re-label the examples based on
these posterior probability estimates.
– Maximization (M-step): Re-estimate the model parameters, , from
the probabilistically re-labeled data.
Examples
• Suppose we have the following data
– 0,1,1,0,0,1,1,0
• In this case it is sensible to choose the
Bernoulli distribution (B(p)) as the model
space.

• Now we want to choose the best p, i.e.,


Examples
Suppose the following are marks in a course
55.5, 67, 87, 48, 63
Marks typically follow a Normal distribution
whose density function is

Now, we want to find the best , such that


Examples
• Suppose we have data about heights of
people (in cm)
– 185,140,134,150,170
• Heights follow a normal (log normal)
distribution but men on average are taller
than women. This suggests a mixture of two
distributions
Maximum Likelihood Estimation
• We have reduced the problem of selecting the best
model to that of selecting the best parameter.
• We want to select a parameter p which will
maximize the probability that the data was
generated from the model with the parameter p
plugged-in.
• The parameter p is called the maximum likelihood
estimator.
• The maximum of the function can be obtained by
setting the derivative of the function ==0 and solving
for p.
Two Important Facts
• If A1,,An are independent then

• The log function is monotonically increasing. x


· y ! Log(x) · Log(y)

• Therefore if a function f(x) >= 0, achieves a


maximum at x1, then log(f(x)) also achieves a
maximum at x1.
Example of MLE

• Now, choose p which maximizes L(p). Instead we will


maximize l(p)= LogL(p)
Properties of MLE
• There are several technical properties of the
estimator but lets look at the most intuitive
one:
– As the number of data points increase we become
more sure about the parameter p
Properties of MLE

r is the number of data points. As the number of data points increase the
confidence of the estimator increases.
MLE for Mixture Distributions
• When we proceed to calculate the MLE for a
mixture, the presence of the sum of the
distributions prevents a “neat” factorization
using the log function.
• A completely new rethink is required to
estimate the parameter.
• The new rethink also provides a solution to
the clustering problem.
A Mixture Distribution
Missing Data
• We think of clustering as a problem of
estimating missing data.
• The missing data are the cluster labels.
• Clustering is only one example of a missing
data problem. Several other problems can be
formulated as missing data problems.
Missing Data Problem
• Let D = {x(1),x(2),…x(n)} be a set of n
observations.
• Let H = {z(1),z(2),..z(n)} be a set of n values of
a hidden variable Z.
– z(i) corresponds to x(i)
• Assume Z is discrete.
EM Algorithm
• The EM Algorithm alternates between
maximizing F with respect to Q (theta fixed)
and then maximizing F with respect to theta
(Q fixed).
EM and K-means
• Notice the similarity between EM for Normal
mixtures and K-means.

• The expectation step is the assignment.


• The maximization step is the update of
centers.
EM intuition
• What if we were given only results of the coin
tosses?
• Can we guess the % of heads that each coin
yields?
• Can we guess which coin was picked for each
set of 10 coin tosses
EM algorithm
• Assign random averages to both coins
• For each of the 5 rounds of 10 coin tosses
– Check the percentage of heads
– Find the probability of it coming from each coin
– Compute the expected number of heads using the
probability and multiply by number of heads
– Record and re-compute new mean for coin A and
B
The GMM assumption
• There are k components. The i’th
component is called wi
• Component wi has an associated mean
vector i
• Each component generates data from 2
a Gaussian with mean i and
covariance matrix 2I 1
Assume that each datapoint is generated
according to the following recipe:
3
The GMM assumption
• There are k components. The i’th
component is called wi
• Component wi has an associated mean
vector i
• Each component generates data from 2
a Gaussian with mean i and
covariance matrix 2I
Assume that each datapoint is generated
according to the following recipe:
1. Pick a component at random. Choose
component i with probability P(wi).
The GMM assumption
• There are k components. The i’th
component is called wi
• Component wi has an associated mean
vector i
• Each component generates data from 2
a Gaussian with mean i and
covariance matrix 2I
x
Assume that each datapoint is generated
according to the following recipe:
1. Pick a component at random. Choose
component i with probability P(wi).
2. Datapoint ~ N(i, 2I )
The General GMM assumption
• There are k components. The i’th
component is called wi
• Component wi has an associated mean
vector i
• Each component generates data from 2
a Gaussian with mean i and
covariance matrix Si 1
Assume that each datapoint is generated
according to the following recipe:
1. Pick a component at random. Choose 3
component i with probability P(wi).
2. Datapoint ~ N(i, Si )
Mixtures of Gaussians
• K-means algorithm
– Assigned each example to exactly one cluster
– What if clusters are overlapping?
• Hard to tell which cluster is right
• Maybe we should try to remain uncertain
– Used Euclidean distance
– What if cluster has a non-circular shape?

• Gaussian mixture models


– Clusters modeled as multivariate Gaussians
• Not just by their mean
– EM algorithm: assign data to
cluster with some probability
The old trick…

Inference
Inputs

P(E1|E2) Joint DE, Bayes Net Structure Learning


Engine Learn

Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,


Predict
Inputs

Classifier Gauss/Joint BC, Gauss Naïve BC, N.Neigh, Bayes Net


category
Based BC, Cascade Correlation, GMM-BC

Density Prob- Joint DE, Naïve DE, Gauss/Joint DE, Gauss Naïve DE,
Inputs

Estimator ability Bayes Net Structure Learning, GMMs

Predict Linear Regression, Polynomial Regression, Perceptron,


Inputs

Regressor real no. Neural Net, N.Neigh, Kernel, LWR, RBFs, Robust
Regression, Cascade Correlation, Regression Trees,
GMDH, Multilinear Interp, MARS

You might also like