Clustering

Artificial Intelligence:
Representation and Problem Solving

15-381
April 26, 2007
Clustering
(including k-nearest neighbor classification, k-means
clustering, cross-validation, and EM, with a brief foray
into dimensionality reduction with PCA)
A different approach to classification
Nearest neighbor classification on the iris dataset
Nearby points are likely to be

members of the same class.
2.5
Class 1
What if we used the points

themselves to classify?
classify x in Ck if x is similar to
a point we already know is in Ck.
x2
1.5
Class 2
Eg: unclassified point x is more

similar Class 2 than Class 1.
Issue: How to define similar ?

Simplest is Euclidean distance:
!
d(x, y) =
(xi yi )2
i
Could define other metrics

depending on application, e.g. text
documents, images, etc.
Artificial Intelligence: Clustering
0.5
Class 3
0
4
x1
Potential advantages:
dont need an explicit model
the more examples the better
might handle more complex classes
easy to implement
no brain on part of the designer
2
Michael S. Lewicki ! Carnegie Mellon
A complex, non-parametric decision boundary
How do we control the complexity

of this model?
difficult
How many parameters?
every data point is a parameter!
This is an example of a non-parametric

model, ie where the model is defined
by the data. (Also, called, instance
based)
Can get very complex decision

boundaries
example from Martial Herbert
Error Bounds for NN
Nonparametric (Instance-Base
Amazing fact: asymptotically, err(1-NN) < 2 err(Bayes):

M 2
e
eB e1N N 2eB
M 1 B
this is a tight upper bound, achieved in the zero-information case
when the classes have identical densities.
Q: What are the parameters in K-NN? Wha

A: the scalar K and the entire training set.
Models which need the entire training set at
(hopefully) have very few other parameters a
nonparametric, instance-based or case based
For K-NN there are also bounds. e.g. for two classes and odd K:
(K1)/2 " # $
%
!
k
ki + eki(1 e )i+1
ei+1
eB eKN N
B
B (1 eB )
B
i
What if we want a classifier that uses only a

parameters at test time? (e.g. for speed or m
Idea 1: single linear boundary, of arbitrary or
Idea 2: many boundaries, but axis-parallel &
i=0
10
Example: Handwritten digits
For more on these bounds, see the book A Probabilistic Theory of

Pattern Recognition, by L. Devroye, L. Gyorfi & G. Lugosi (1996).
x2
9
8
x2
6
5
t3
4
3
1
1
Use Euclidean distance to see which

known digit isExample:
closest
each class.9
USPSto
Digits
Take 16x16 grayscale images (8bit) of handwritten digits.
Use all
Euclidean
distance in raw pixel are
space (dumb!)
7-nn.
But not
neighbors
theandsame:
Classification error (leave-one-out): 4.85%.
example
Example
nearest neighbors
7 Nearest Neighbours
10
x1
t5
Linear Classification for Bina
Goal: find the line (or hyperplane) which bes

w
c(x) = sign[x#&'()
weight
w
&'
thre
w is a vector perpendicular to decision boun
This is the opposite of non-parametric: only
Typically we augment x with a constant term

then absorb w0 into w, so we dont have to
10
9
8
7
x2
6
5
4
3
2
1
1
x1
example from Sam Roweis
look at k-nearest neighbors and

choose most frequent.
from LeCun etal, 1998

digit data available at:
http://yann.lecun.com/exdb/mnist/
Digits are just

represented as a
vector.
k-nearest neighbors:
Cautions: can get expensive to find

neighbors
The problem of using templates (ie Euclidean distance)
performance results of various classifiers

(from http://yann.lecun.com/exdb/mnist/)
Which of these is more like the

example? A or B?
example
Euclidean distance only cares about

how many pixels overlap.
Could try to define a distance metric

that is insensitive to small deviations
in position, scale, rotation, etc.
Digit example:
60,000 training images,

10,000 test images
no preprocessing
error rate on
test data (%)
linear
12.0
k=3 nearest neighbor

(Euclidean distance)
2-layer neural network
(300 hidden units)
nearest neighbor
(Euclidean distance)
k-nearest neighbor
(improved distance metric)
from Simard etal, 1998
Classifier
5.0
4.7
3.1
1.1
convolutional neural net
0.95
best (the conv. net with

elastic distortions)
0.4
humans
0.2 - 2.5
Clustering
Clustering: Classification without labels:
In many situations we dont have labeled training data, only unlabeled data.
Eg, in the iris data set, what if we were just starting and didnt know any classes?
petal width (cm)
2.5
2
1.5
1
0.5
0
3
4
5
petal length (cm)
7
7
Types of learning
supervised
unsupervised
reinforcement
reinforcement
desired output
{y1 , . . . , yn }
model
{1 , . . . , n }
model output
model
{1 , . . . , n }
model
{1 , . . . , n }
next
world
(or data)
world
(or data)
k
wee
world
(or data)
A real example: clustering electrical signals from neurons
An application of PCA: Spike sorting
amplifier
filters
A/D
software
analysis
electrode
An extracellular waveform with many different spikes
oscilloscope
10
15
20
25
msec
How do we sort the different spikes?
Basic problem: only information is signal.

The true classes are always unknown.
! !
An extracellular waveform with many different neural spikes

An extracellular waveform with many different spikes
10
15
20
25
msec
How do we sort the different spikes?
10
! !
Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU
! !
Sorting with level detection
Sorting
withwith
levellevel
detection
Sorting
detection
!0.5
!0.5
0.5
0.5 1
msec msec
1.5
!0.5
1.5
0.5
msec
1.5
Level detection doesnt always work.
! !? 7
!
! !
PrincipalPrincipal
Component
Analysis,Analysis,
Apr 23, 2001
/ Michael
Lewicki,
CMU CMU
Component
Apr 23,
2001 / S.
Michael
S. Lewicki,
11
withwith
levellevel
detection
Sorting
detection
WhySorting
level
detection
doesnt
work
Why level detection doesnt work

!0.5
!0.5
0.5
0.5 1
msec msec
1.5
!0.5
1.5
0.5
msec
1.5
peakLevel
amplitude:
neuron
2 always work.
detection
doesnt
peak amplitude:
neuron 1
!? 7
!
! ! !
amplitude
PrincipalPrincipal
Component
Analysis,Analysis,
Apr 23, 2001
/ Michael
Lewicki,
CMU CMU
Component
Apr 23,
2001 / S.
Michael
S. Lewicki,
background
amplitude
One dimension is not sufficient to separate the spikes.

12
Idea: try more features

Using multiple features
max amplitude
min amplitude
!0.5
0.5
msec
1.5
What other features could we use?
13
Maximum vs minimum
Maximum vs minimum
250
spike maximum (V)
200
150
100
50
0
!200
!150
!100
!50
spike minimum (V)
This allows better discrimination than max alone, but is it optimal?
14
! !
10
! !
Try different features

width features
Using multiple
height
!0.5
0.5
msec
1.5
What other features could we use?
! !
! !
15
11
Height vs width
Height vs width
400
350
spike height (V)
300
250
200
150
100
50
0
0
0.5
1
spike width (msec)
1.5
This allows allows better discrimination.

How can we choose more objectively?
16
Brief foray: dimensionality reduction

(in this modeling data with a normal distribution)
Can we model the signal with a better set of features?
Weve done this before:
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Bernoulli distribution of coin flips:

p(y|, n) =
! "
n y
(1 )ny
y
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
y
Gaussian distribution of iris data:

2.5
petal width (cm)
model
{1 , . . . , n }
Idea: model the distribution of the data.

This is density estimation
p(y|!=0.25, n=10)
unsupervised
learning
data
(neural signals)
2
1.5
1
0.5
0
3
4
5
petal length (cm)
1
1
p(x|, ) =
exp 2 (x )2
2
2
18
"
Modeling data with a Gaussian
a Gaussian (or normal) distribution is fit to data with things youre already
"
!
familiar with.
1
1
gaussian
identities
2
p(x|, ) =
exp 2 (x )
2
2
sam
roweis
1 !
=
xn
mean
N n
(revised July 1999)
1 !
2
(xn )2
variance =
N n
0.1 multidimensional gaussian
A multivariate normal is the same but in d-dimensions
a d-dimensional multidimensional gaussian (normal) density for x is:

"
!
1
d/2
1/2
T 1
N (, ) = (2)
||
exp (x ) (x )
2
=
it has entropy:
1 !
xn
N n
#
11 !
ij =S = log (2e)
j,n
bits
(xi,n d
)(x
)
const
||
21 2
N
n
(1)
(2)
where is a symmetric postive semi-definite covariance matrix and the

(unfortunate) constant is the log of the units in which x is measured over
19
the natural units
0.2
linear functions of a normal vector
no matter how x is distributed,

E[Ax + y] = A(E[x]) + y
Recall the example from

the +
probability
theory lecture
Covar[Ax
y] = A(Covar[x])A
T
in particular this means that for normal distributed quantities:

%
&
x N (, ) (Ax + y) N A + y, AAT
#$C;'(;2D4,E2+F(+($2
1/2
x N (, )
(x ) N (0, I)
1
x N (, ) (x )
T
(x )
(3a)
(3b)
(4a)
(4b)
(4c)
2n
GHI%)054()*+J,K
L.M7N0.O//P
" '()$*+ ! %&&
" '()$*+ ! %&&

" "#$ ! ! " "#$ ! !
#$%&'()*+,-,.//!0,123'45,67,8$$'4
9'$:;:(<(+&,=42>(+(4>?,@<(34,AB
20
The correlational structure is described by the covariance
#$C;'(;2D4,E2+F(+($2
9'(2D(%;<
G()42C4D+$'
$R,#
GHI%)054()*+J,K
L.M7N0.O//P
" '()$*+ ! %&&
" '()$*+ ! %&&

" "#$ ! ! " "#$ ! !
What about
distributions in higher
dimensions?
#$%&'()*+,-,.//!0,123'45,67,8$$'4
9'$:;:(<(+&,=42>(+(4>?,@<(34,AQ
21
!"
Multivariate covariance matrices and principal components

An application of PCA: head shape
Head measurements on two college-age groups (Bryce and Barker):

1) football players (30 subjects)
2) non-football players (30 subjects)
Use six different measurements:
variable measurement
WDMI head width at widest dimention
CIRCUM head circumference
FBEYE front to back at eye level
EYEHD eye to top of head
EARHD ear to top of head
JAW jaw width
Are these measures independent?
22
! !
The covariance matrix

The covariance matrix
S=
.370 .602
.149
.044
.107
.209
.602 2.629
.801
.666
.103
.377
.149 .801
.458
.012 .013
.120
.044 .666
.011 1.474
.252 .054
.107 .103 .013

.252
.488 .036
.209 .377
.120 .054 .036
.324
N
1 '
(xi,n xi)(xj,n xj )
Sij =
N 1 n=1
23
! !
The eigenvalues
The eigenvalues
Eigenvalue
3.323
1.374
.476
.325
.157
.088
Proportion
of Variance
.579
.239
.083
.057
.027
.015
Cumulative
Proportion
.579
.818
.901
.957
.985
1.000
How many PCs should we select?
! !
24
The eigenvalues
The eigenvalues
Eigenvalue
3.323
1.374
.476
.325
.157
.088
Proportion
of Variance
.579
.239
.083
.057
.027
.015
Cumulative
Proportion
.579
.818
.901
.957
.985
1.000
How many PCs should we select?

The first two principal components capture 81.8% of the variance.
! !
25
The eigenvalues
The eigenvalues
Eigenvalue
3.323
1.374
.476
.325
.157
.088
Proportion
Proportion
Variance
of Variance
.579
.579
.239
.239
.083
.083
.057
.057
.027
.027
.015
.015
Cumulative
Cumulative
Proportion
Proportion
.579
.579
.818
.818
.901
.901
.957
.957
.985
.985
1.000
1.000
How many PCs should

should we
we select?
select?
The first two principal
principal components
components capture
capture 81.8%
81.8% of
of the
thevariance.
variance.
eigenvectors:
a2
-.142
-.219
-.231
.891
.222
-.187
!!!!
??44
!
!
26
Principal Component
Principal
Component Analysis,
Analysis, Apr
Apr 23,
23, 2001
2001 // Michael
Michael S.
S.Lewicki,
Lewicki,CMU
CMU
!
!
The corresponding
variable
a1
WDMI .207
CIRCUM .873
FBEYE .261
EYEHD .326
EARHD .066
JAW .128
Using principal components to characterize the data

Use principal components to characterize a distribution
What are the data?
200
What are the data?
What is the dimensionality of

the data?
150
How many components are

there?
amplitude (V)
100
50
What will the components look

like?
0
!50
!100
!150
!0.5
msec
0.5
! !
27
12
The first three principal components of the waveform data

The principal components of spike data
0.4
0.3
PC2
PC1
PC3
magnitude
0.2
0.1
0
!0.1
!0.2
!0.3
!0.5
msec
0.5
What do you expect when we

plot PC1 vs PC2 ?
28
! !
13
Scatter plot of the first two principal component scores

400
2nd component score
300
200
100
0
!100
!200
!300
!400
!200
200
400
1st component score
600
are much
better
separated.
Now the Now
datathe
aredata
much
better
separated.
Could we use more PCs? How many?
Could we use more PCs? How many?
29
! !
14
The eigenspectrum of the waveform data

100
sqrt(!) (V)
80
60
40
20
0
0
10
20
30
component number
40
50
The first two PCs account for most of the variance.
30
! !
15
Recall example from last lecture: waveform data

The waveform x is modeled with a
multivariate Gaussian
x p(x|, )
and are mean and covariance of
the distribution.
Principal components can be used to
form a low-dimensional approximation
(n)
ci i
i=1
The vectors {1, . . . , T } are the

eigenvectors of .
keeping on the first two terms in the
sum is an adequate approximation of
the full T -dimensional density.
$#!
!"$
!"#
!"%
$!!
*&+,-./0(1234
x(n) =
T
!
raw spike waveform data

%!!
#!
!
!#!
!$!!
!$#!
!!"#
31
now we can go back to clustering
&'()
!"#
Now try clustering our 2d data (could also do it in n-d)
$""
#./'()'%*+,-
#""
!""
"
!!""
!#""
!!""
"
!""
!%&'()'%*+,-
33
#""
$""
k-means clustering
Idea: try to estimate k cluster centers by

minimizing distortion
Define distortion as:

D
rnk
N !
K
!
n=1 k=1
rnk ! xn k !2
1 if xn cluster k, 0 otherwise.
rnk is 1 for the closest cluster mean to xn.
How do we learn the cluster means?
Each point xn is the minimum distance

from its closet center.
Use EM = Estimate Maximize
34
Using EM to estimate the cluster means
D=
N !
K
!
n=1 k=1
Our objective function is:

rnk ! xn k !2
Differential wrt to the mean (the

parameter we want to estimate):
N
!
D
=2
rnk (xn k )
k
n=1
Solving for the mean we have:

!
n rnk xn
k = !
n rnk
This is simply a weighted mean for
each cluster.
Thus we have a simple estimation
algorithm (k-means clustering)
1. select k points at random
We know the optimum is when
2. estimate (update) means

N
!
3. repeat until converged
D
=2
rnk (xn k ) = 0
k
n=1
convergence (to a local minimum) is

guaranteed
35
k-means clustering example
$""
Select 3 points at random

for cluster means
#./'()'%*+,-
#""
!""
"
!!""
!#""
!!""
"
!""
!%&'()'%*+,36
#""
$""
$""
The update them using

the estimate.
#./'()'%*+,-
#""
!""
"
!!""
!#""
!!""
"
!""
!%&'()'%*+,-
#""
$""
37
$""
And iterate...
#./'()'%*+,-
#""
!""
"
!!""
!#""
!!""
"
!""
!%&'()'%*+,38
#""
$""
$""
#./'()'%*+,-
#""
!""
"
!!""
!#""
!!""
"
!""
!%&'()'%*+,-
#""
$""
39
$""
Stop when converged, ie

no change.
#./'()'%*+,-
#""
!""
"
!!""
!#""
!!""
"
!""
!%&'()'%*+,40
#""
$""
An example of a local minimum
$!!
There can be multiple

local minima.
#./'()'%*+,-
#!!
"!!
!
!"!!
!#!!
!
"!!
#!!
"%&'()'%*+,-
$!!
!"!!
41
How do we choose k?
10
9
Increasing k, will always decrease our

distortion, so we will overfit.
How can we avoid this?
Or how do we choose the best k?
n=1 k=1
$""
rnk ! xn k !
overfitting
Instead of classification error,

however, like before, we use our
distortion metric:
N !
K
!
We can use cross validation again.
D=
7
Distortion
4
5
6
7
k = number of clusters
10
k=10 clusters
2
#""
#./'()'%*+,-
Then just measure the distortion on

a test data set, and stop when we
reach a minimum.
!""
"
!!""
!#""
!!""
42
"
!""
#""
$""
!%&'()'%*+,Michael S. Lewicki ! Carnegie Mellon
A nice illustration of cross validation from Andrew Moore

y
Example
Example
x
Construct a predictor of y from x given this
training data
x
43
Construct a predictor of y from x given this

training data
Linear
Quadratic
Linear
y
Which
model is
y
best for
predicting y
Which
from x is
????
model
best for
predicting y
from x ????
Quadratic
x
Piecewise Linear
x
Piecewise Linear
x
44
y
We want the model that generate
Linear
the best predictions on future data.
Not necessarily the one with the
y lowest error on training data
y
We
want
the
model
that
generate
Linear
the best predictions on future data.
Not necessarily the one with the
lowest error on training data
x
y
Quadratic
Quadratic
Piecewise Linear
Which
model is
y
best for
predicting y
Which
from x ????
model is
best for
predicting y
from x ????
x
Piecewise Linear
x
45
Using a Test Set

y
Using a
y
1. Use a portion
30%) of
Test Set(e.g.,
the data as test
data
1.
portion
2. Use
Fit a amodel
to
(e.g.,
30%)
of
the remaining
the
datadata
as test
training
data
3. Evaluate the
2. Fit
a model
error
on the to
the
test remaining
data
training data
x
3. Evaluate the
error on the
test data
x
46
y Linear
Error = 2.4
Quadratic
Error = 0.9
y Linear
Error = 2.4
Quadratic
Error = 0.9
x
Piecewise Linear
x
Error = 2.2
Piecewise Linear
x
Error = 2.2
47
y Linear
Error = 2.4
y Linear
Error = 2.4
x
Quadratic
Error = 0.9
Quadratic
Error = 0.9
Using a Test Set:

+ Simple
x
y % of the data
- Wastes a large
- May get lucky with one
Using a Test Set:
particular subset of the data
+ Simple
- Wastes a large % of the data
- May get lucky with one
particular subset of the data
x
Piecewise Linear
x
Error = 2.2
Piecewise Linear
x
Error = 2.2
x
48
Leave One Out Cross-Validation

For k=1 to R
Train on all the

Leave One Out Cross-Validation
y
data leaving out

(xkk=1
,yk) to R
For
Evaluate
Train on error
all the
on
(x
,y
)
k
datakleaving
out
(xk,yk)
(xk,yk)
49
(xk,yk)the
Report
Evaluate
error
average
error
on trying
(xk,yk) all
after
x
datathe
points
the
Report
average error
after trying all
x
the data points
Error = 2.12
Error = 2.12
Note: Numerical examples in this and subsequent slides from A. Moore
Note: Numerical examples in this and subsequent slides from A. Moore

50
Error = 0.962
Error = 0.962
51
52
Error = 3.33
Error = 3.33
Leave One Out Cross-Validation:

Leave
One
+ Does
not waste
dataOut Cross-Validation
+ Average over large number of trials
For k=1 to R
y- Expensive
Leave One Out Cross-Validation:
Train on all the
Leave
One
+ Does
not waste
dataOut Cross-Validation
data leaving out
+ Average over large number of trials
(xkk=1
,yk) to R
For
y- Expensive
(xk,yk)
error
Evaluate
Train on all
the
on
(x
,y
)
k
data kleaving
out
(xk,yk)
53
(xk,yk)the
Report
Evaluate
error
average
error
on (x
after
trying
k,yk) all
x
the datathe
points
Report
average error
after trying all
x
the data points
K-Fold Cross-Validation
Randomly divide
the data set into K
subsets
K-Fold Cross-Validation
For each subset
S:
Randomly divide
y
the
dataonset
Train
theinto
dataK
not in S
subsets
Test
on the
data in
For
each
subset
S
S:
Return
Train the
on the data
average
not in Serror over
the
K subsets
Test
on the data in
x
S
Return the
Example: K = 3, each color corresponds
to a error
subset
average
over
the K subsets
x
y
Example: K = 3, each color corresponds to a subset

54
Error = 2.05
Error = 1.11
Error = 2.05
Error = 1.11
Error = 2.93
Error = 2.93
55
Cross-Validation Summary
+
Test Set
Wastes a lot of Simple/Efficient
Cross-Validation
Summary
data
Poor predictor
+
of future
Wastes a lot of
performance
data
Leave One Out Inefficient
Poor predictor
of future
K-Fold
Wastes
1/K of
performance
the data
Leave One Out Inefficient
K times slower
than Test Set
K-Fold
Wastes 1/K of
the data
K times slower
than Test Set
Test Set
56
Simple/Efficient
Does not waste
data
Wastes only 1/K of
the data!
Does not waste
Only
data K times slower
than Test Set!
Wastes only 1/K of
the data!
Only K times slower
than Test Set!
A probabilistic interpretation: Gaussian mixture models
R58
Weve already seen a one-dimensional version

This example has three classes: neuron 1, neuron 2, and background noise.
Each can be modeled as a Gaussian
Any given data point comes from just one Gaussian
The whole set of data is modeled by a mixture of three Gaussians
How do
model this?
Mwe
S Lewicki
peak amplitude: neuron 2
background
amplitude
peak amplitude:
neuron 1
amplitude
A
B
57
Figure 4. The figure illustrates the distribution of amplitudes for the background activity and the
peak amplitudes of the spikes from two units. Amplitude is along the horizontal axis. Setting the
threshold level to the position at A introduces a large number of spikes from unit 1. Increasing
the threshold to B reduces the number of spikes that are misclassified, but at the expense of
many missed spikes.
3.2.The
Types
of detection
errors
Gaussian
mixture
model density
Very often it is not possible to separate the desired spikes from the background noise with
perfect
accuracy.
Theofthreshold
level determines
between
missed spikes (false
The
likelihood
the data given
a particular the
classtrade-off
ck is given
by
negatives) and the number of background events that cross threshold (false positives), which
is illustrated in figure 4. If the thresholdp(x|c
is set,
to ,the
level at A, all of the spikes from unit
k
k k )
1 are detected, but there is a very large number of false positives due the contamination of
spikes from unit 2. If the threshold is increased to the level at B, only spikes from unit 1
x is thebut
spike
waveform,
kfall
andbelow
k arethreshold.
the meanIdeally,
and covariance
for class
ck . be set to
are detected,
a large
number
the threshold
should
optimize the desired ratio of false positives to false negatives. If the background noise level
The
marginal to
likelihood
is computed
summing
likelihood
of the K classes
is small
compared
the amplitude
of thebyspikes
andover
the the
amplitude
distributions
are well
separated, then both of these errors will be close to zero and the position of the threshold
K
!
hardly matters.
p(x|1:K) =
3.3. Misclassification error due to overlaps
p(x|ck , k )p(ck )
k=1
In addition
to the the
background
to first approximation,
is Gaussian in nature
1:K defines
parametersnoise,
for allwhich,
of the classes,
1:K = {1 , 1 , . . . , K , K }.
(we will have more to say about that below), the spike height can vary greatly if there are
"
other neurons
in the
local region
action
potentials of significant size. If the
p(ck ) is the
probability
of thethat
kth generate
class, with
k p(ck ) = 1.
peak of the desired unit and the dip of a background unit line up, a spike will be missed.
ThisisWhat
illustrated
in figure
does this
mean 5.
in this example?
How frequently this will occur depends on the firing rates of the units involved. A
rough estimate for the percentage of error due to overlaps can be calculated as follows.
Intelligence: Clustering
S. Lewicki ! Carnegie Mellon
58 shown in figure 5(b),Michael
TheArtificial
percentage
of missed spikes, like the one
is determined
by the
probability that the peak of the isolated spike will occur during the negative phase of the
background spike, which is expressed as
%missed spikes = 100rd/1000
(1)
Bayesian classification
How do we determine the class ck from the data x ?
Again use Bayes rule
p(x(n)|ck , k )p(ck )
p(ck |x(n), 1:K) = pk,n = !
(n) |c , )p(c )
k k
k
k p(x
This tells is the probability that waveform x(n) came from class ck .
Lets review the process:

1. define model of problem "
2. derive posterior distributions and estimators "
3. estimate parameters from data ??
How do we do this?
4. evaluate model accuracy
59
Estimating the parameters: fitting the model density to the data

The objective of density estimation is to maximize the likelihood of the data the data
If we assume the samples are independent, the data likelihood is just the product of
the marginal likelihoods
p(x1:N|1:K) =
N
!
n=1
p(xn|1:K)
The class parameters are determined by optimization.

Is far more practical to optimize the log-likelihood.
One elegant approach to this is the EM algorithm.
60
The EM algorithm
EM stands for Expectation-Maximization, and involves two steps that are iterated.
For the case of a Gaussian mixture model:
!
1. E-step: Compute pn,k = p(ck |x(n), 1:K). Let pk = n pi,n
2. M-step: Compute new mean, covariance, and class prior for each class:
Something can go bad
here...
k
k
"
pn,k x(n)/pk
What if these are zero?
"
n
pn,k (x(n) k )(x(n) k )T /pk
p(ck ) pk
This is just the sample mean and covariance, weighted by the class conditional
probabilities pn,k .
Derived by solving setting log-likelihood gradient to zero (i.e. the maximum).
61
Four cluster
solution with decision boundaries
#""
!.1)*+,-+./.()'*+0/
%""
!""
&""
"
!&""
!!""
!%""
!#""
!!""
"
!""
#""
&'()*+,-+./.()'*+0/
62
$""
00
But wait!
400 Heres a nine cluster solution
300
2nd component score
Uh oh...
200
100
How many clusters

are there really?
!100
Fortunately, probability
theory solves this problem
too.
!200
!300
600
!400
!200
200
400
1st component score
600
63
(b)
of Gaussian clustering to spike sorting. (a) The ellipses show the threeof the four clusters. The lines show the Bayesian decision boundaries
Bayesian model comparison
usters. (b) The same data modelled with nine clusters. The elliptical line
ottom is the three-sigma
error contour
ofKthe
largest
Let MK represents
a model with
classes.
(Here wecluster.
will assume that we can
choose the best among all such models, but this assumption is not necessary). How
do we evaluate the probability of model MK ? Bayes rule again.
We start with our existing model, but add a term to represent the model itself. Also,
by calculating the
probability
thatonathedata
point
to
we marginalize
out the dependency
parameters,
becausebelongs
we want the result
independent of any specific value. Letting X = x
, we have
ained with Bayes
rule
(1:N)
p(MK |X ) =
p(X |MK )p(MK )

p(X )
p(x|ck , k )p(ck )
!
.
(5)
Thekdenominator
is constant across models, and if we assume all models are equally
)
k p(x|ck , k )p(c
probable a priori, the only data-dependent term is X = x(1:N).
ian decision boundaries

forp(Xthe
Because
How do we compute
|M )?model.
Weve encountered
this before.the cluster
e cluster boundaries can be computed as a function of
ld better classification, because if the model is accurate
.e. the fewest number of misclassifications.
imized by maximizing the likelihood of the data
K
64
Evaluating the model evidence

p(X |MK ) is just the normalizing constant for the posterior for parameters
p(K |X , MK ) =
p(X |K , MK )p(MK )
p(X |MK )
(slight change of notation: K represents all parameters for model K.)

The normalizing constant here is evaluated just like before by marginalization
p(X |MK ) =
p(X |K , MK )p(MK )dK
Evaluating this term is practically a whole subfield of probability theory: Laplaces

method, monte carlo integration, variational approximation, etc. We will learn about
some of these techniques in future lectures.
R64
Back
65
M Sclusters
Lewicki
to the
300%""
300
200!""
200
2nd component score
400
2nd component score

!.1)*+,-+./.()'*+0/
400#""
100&""
0 "
!&""
!100
!100
!!""
!200
!200
!%""
!300
!300
!#""
!400
!!""
!200
100
!""
#""
0 "
200
400
1st&'()*+,-+./.()'*+0/
component score
$""
600
!400
!200
Which model is(a)

more probable?
200
400
1st component score
600
(b)
P(M9Figure
| X) is9.exp(160)
times
P(M4 |toX).
Application
of greater
Gaussianthan
clustering
spike sorting. (a) The ellipses show the threecontours
the our
fourintuitions?
clusters. The lines show the Bayesian decision boundaries
Whysigma
mighterror
this not
agreeofwith
separating the larger clusters. (b) The same data modelled with nine clusters. The elliptical line
The extending
conclusions
are the
always
onlyisasthe
valid
as the model.
across
bottom
three-sigma
error contour of the largest cluster.
But P(M9 | X) is exp(16) times greater than P(M11 | X).
Razor.
This embodies
Classification
is Occams
performed
by calculating the probability that a data point belongs to
each of the classes, which is obtained with Bayes rule
p(x|ck , k )p(c
66 k )
p(ck |x, 1:K ) = !
.
k p(x|ck , k )p(ck )
(5)
This implicitly defines the Bayesian decision boundaries for the model. Because the cluster
membership is probabilistic, the cluster boundaries can be computed as a function of
Comparison to cross-validation
does not waste any data

correct, if model is correct
can be expensive or difficult to evaluate for complex models
harder to implement (definitely requires a brain)
67
Summary
k-nearest neighbor: a simple, non-parametric method for classification
easy to implement, but hard to control the complexity

can require a lot of data, since theres no model to generalize from
clustering:
is unsupervised learning
have to infer classes without labels
dimensionality reduction and PCA
another example of unsupervised learning: fitting a multivariate normal
cross validation
a general way to control complexity: test data, leave one out, k-fold
gaussian mixture models
a probabilistic version of k-means

can assume arbitrary covariance matrices
can choose most probable # of clusters (with Bayesian model comparison)
68

Clustering

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering

Uploaded by

Copyright:

Available Formats

Artificial Intelligence:

Representation and Problem Solving

A different approach to classification

Nearest neighbor classification on the iris dataset

Nearby points are likely to be

What if we used the points

Eg: unclassified point x is more

Issue: How to define similar ?

Could define other metrics

Artificial Intelligence: Clustering

Michael S. Lewicki ! Carnegie Mellon

A complex, non-parametric decision boundary

How do we control the complexity

How many parameters?

every data point is a parameter!

This is an example of a non-parametric

Can get very complex decision

example from Martial Herbert

Artificial Intelligence: Clustering

Michael S. Lewicki ! Carnegie Mellon

Error Bounds for NN

Amazing fact: asymptotically, err(1-NN) < 2 err(Bayes):

Q: What are the parameters in K-NN? Wha

What if we want a classifier that uses only a

Example: Handwritten digits

For more on these bounds, see the book A Probabilistic Theory of

Use Euclidean distance to see which

Linear Classification for Bina

Goal: find the line (or hyperplane) which bes

w is a vector perpendicular to decision boun

This is the opposite of non-parametric: only

Typically we augment x with a constant term

example from Sam Roweis

look at k-nearest neighbors and

from LeCun etal, 1998

Digits are just

Cautions: can get expensive to find

Michael S. Lewicki ! Carnegie Mellon

The problem of using templates (ie Euclidean distance)

performance results of various classifiers

Which of these is more like the

Euclidean distance only cares about

Could try to define a distance metric

60,000 training images,

Artificial Intelligence: Clustering

k=3 nearest neighbor

from Simard etal, 1998

convolutional neural net

best (the conv. net with

Michael S. Lewicki ! Carnegie Mellon

Clustering: Classification without labels:

petal width (cm)

Artificial Intelligence: Clustering

Artificial Intelligence: Clustering

Michael S. Lewicki ! Carnegie Mellon

A real example: clustering electrical signals from neurons

An application of PCA: Spike sorting

How do we sort the different spikes?

Basic problem: only information is signal.

An extracellular waveform with many different neural spikes

How do we sort the different spikes?

Michael S. Lewicki ! Carnegie Mellon

Artificial Intelligence: Clustering