You are on page 1of 34

Artificial Intelligence:

Representation and Problem Solving


15-381
April 26, 2007

Clustering
(including k-nearest neighbor classification, k-means
clustering, cross-validation, and EM, with a brief foray
into dimensionality reduction with PCA)

A different approach to classification

Nearest neighbor classification on the iris dataset

Nearby points are likely to be


members of the same class.

2.5

Class 1

What if we used the points


themselves to classify?
classify x in Ck if x is similar to
a point we already know is in Ck.

x2

1.5

Class 2

Eg: unclassified point x is more


similar Class 2 than Class 1.

Issue: How to define similar ?


Simplest is Euclidean distance:
!
d(x, y) =
(xi yi )2
i

Could define other metrics


depending on application, e.g. text
documents, images, etc.

Artificial Intelligence: Clustering

0.5

Class 3
0

4
x1

Potential advantages:
dont need an explicit model
the more examples the better
might handle more complex classes
easy to implement
no brain on part of the designer
2

Michael S. Lewicki ! Carnegie Mellon

A complex, non-parametric decision boundary

How do we control the complexity


of this model?

difficult

How many parameters?

every data point is a parameter!

This is an example of a non-parametric


model, ie where the model is defined
by the data. (Also, called, instance
based)

Can get very complex decision


boundaries

example from Martial Herbert

Artificial Intelligence: Clustering

Michael S. Lewicki ! Carnegie Mellon

Error Bounds for NN

Nonparametric (Instance-Base

Amazing fact: asymptotically, err(1-NN) < 2 err(Bayes):


M 2
e
eB e1N N 2eB
M 1 B
this is a tight upper bound, achieved in the zero-information case
when the classes have identical densities.

Q: What are the parameters in K-NN? Wha


A: the scalar K and the entire training set.
Models which need the entire training set at
(hopefully) have very few other parameters a
nonparametric, instance-based or case based

For K-NN there are also bounds. e.g. for two classes and odd K:
(K1)/2 " # $
%
!
k
ki + eki(1 e )i+1
ei+1
eB eKN N
B
B (1 eB )
B
i

What if we want a classifier that uses only a


parameters at test time? (e.g. for speed or m
Idea 1: single linear boundary, of arbitrary or
Idea 2: many boundaries, but axis-parallel &

i=0

10

Example: Handwritten digits

For more on these bounds, see the book A Probabilistic Theory of


Pattern Recognition, by L. Devroye, L. Gyorfi & G. Lugosi (1996).

x2

9
8

x2

6
5

t3

4
3

1
1

Use Euclidean distance to see which


known digit isExample:
closest
each class.9
USPSto
Digits
Take 16x16 grayscale images (8bit) of handwritten digits.

Use all
Euclidean
distance in raw pixel are
space (dumb!)
7-nn.
But not
neighbors
theandsame:
Classification error (leave-one-out): 4.85%.

example
Example

nearest neighbors

7 Nearest Neighbours

10

x1

t5

Linear Classification for Bina

Goal: find the line (or hyperplane) which bes


w
c(x) = sign[x#&'()
weight

w
&'

thre

w is a vector perpendicular to decision boun

This is the opposite of non-parametric: only

Typically we augment x with a constant term


then absorb w0 into w, so we dont have to
10
9
8
7

x2

6
5
4
3
2
1
1

x1

example from Sam Roweis

look at k-nearest neighbors and


choose most frequent.

from LeCun etal, 1998


digit data available at:
http://yann.lecun.com/exdb/mnist/

Digits are just


represented as a
vector.
Artificial Intelligence: Clustering

k-nearest neighbors:

Cautions: can get expensive to find


neighbors

Michael S. Lewicki ! Carnegie Mellon

The problem of using templates (ie Euclidean distance)

performance results of various classifiers


(from http://yann.lecun.com/exdb/mnist/)

Which of these is more like the


example? A or B?
example

Euclidean distance only cares about


how many pixels overlap.

Could try to define a distance metric


that is insensitive to small deviations
in position, scale, rotation, etc.

Digit example:

60,000 training images,


10,000 test images
no preprocessing

Artificial Intelligence: Clustering

error rate on
test data (%)

linear

12.0

k=3 nearest neighbor


(Euclidean distance)
2-layer neural network
(300 hidden units)
nearest neighbor
(Euclidean distance)
k-nearest neighbor
(improved distance metric)

from Simard etal, 1998

Classifier

5.0
4.7
3.1
1.1

convolutional neural net

0.95

best (the conv. net with


elastic distortions)

0.4

humans

0.2 - 2.5

Clustering

Michael S. Lewicki ! Carnegie Mellon

Clustering: Classification without labels:

In many situations we dont have labeled training data, only unlabeled data.
Eg, in the iris data set, what if we were just starting and didnt know any classes?

petal width (cm)

2.5
2
1.5
1
0.5
0

Artificial Intelligence: Clustering

3
4
5
petal length (cm)
7

7
Michael S. Lewicki ! Carnegie Mellon

Types of learning
supervised

unsupervised

reinforcement
reinforcement

desired output
{y1 , . . . , yn }

model
{1 , . . . , n }

model output

model
{1 , . . . , n }

model
{1 , . . . , n }
next

world
(or data)

Artificial Intelligence: Clustering

world
(or data)

k
wee

world
(or data)

Michael S. Lewicki ! Carnegie Mellon

A real example: clustering electrical signals from neurons

An application of PCA: Spike sorting

amplifier

filters
A/D

software
analysis

electrode
An extracellular waveform with many different spikes
oscilloscope

10

15

20

25

msec

How do we sort the different spikes?

Basic problem: only information is signal.


The true classes are always unknown.
Michael S. Lewicki ! Carnegie Mellon

! !

An extracellular waveform with many different neural spikes


An extracellular waveform with many different spikes

10

15

20

25

msec

How do we sort the different spikes?

10

Michael S. Lewicki ! Carnegie Mellon

! !

Artificial Intelligence: Clustering

Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU

! !

Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU

Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU

Artificial Intelligence: Clustering

Sorting with level detection

Sorting
withwith
levellevel
detection
Sorting
detection

!0.5

!0.5

0.5
0.5 1
msec msec

1.5

!0.5

1.5

0.5
msec

1.5

Level detection doesnt always work.

Michael S. Lewicki ! Carnegie Mellon

! !? 7
!

! !

PrincipalPrincipal
Component
Analysis,Analysis,
Apr 23, 2001
/ Michael
Lewicki,
CMU CMU
Component
Apr 23,
2001 / S.
Michael
S. Lewicki,

11

Artificial Intelligence: Clustering

withwith
levellevel
detection
Sorting
detection
WhySorting
level
detection
doesnt
work

Why level detection doesnt work


!0.5

!0.5

0.5
0.5 1
msec msec

1.5

!0.5

1.5

0.5
msec

1.5

peakLevel
amplitude:
neuron
2 always work.
detection
doesnt
peak amplitude:
neuron 1

!? 7
!

! ! !
amplitude

PrincipalPrincipal
Component
Analysis,Analysis,
Apr 23, 2001
/ Michael
Lewicki,
CMU CMU
Component
Apr 23,
2001 / S.
Michael
S. Lewicki,

background
amplitude

One dimension is not sufficient to separate the spikes.


Artificial Intelligence: Clustering

12

Michael S. Lewicki ! Carnegie Mellon

Idea: try more features


Using multiple features

max amplitude

min amplitude
!0.5

0.5
msec

1.5

What other features could we use?

Michael S. Lewicki ! Carnegie Mellon

13

Maximum vs minimum
Maximum vs minimum
250

spike maximum (V)

200

150

100

50

0
!200

!150

!100
!50
spike minimum (V)

This allows better discrimination than max alone, but is it optimal?

14

! !

Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU

Michael S. Lewicki ! Carnegie Mellon

Artificial Intelligence: Clustering

10

! !

Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU

Artificial Intelligence: Clustering

Try different features


width features
Using multiple

height

!0.5

0.5
msec

1.5

What other features could we use?

Michael S. Lewicki ! Carnegie Mellon

! !

! !

Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU

15

Artificial Intelligence: Clustering

11

Height vs width
Height vs width
400
350

spike height (V)

300
250
200
150
100
50
0
0

0.5
1
spike width (msec)

1.5

This allows allows better discrimination.


How can we choose more objectively?

Artificial Intelligence: Clustering

16

Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU

Michael S. Lewicki ! Carnegie Mellon

Brief foray: dimensionality reduction


(in this modeling data with a normal distribution)

Can we model the signal with a better set of features?

Weve done this before:

0.35
0.3
0.25
0.2
0.15
0.1
0.05
0

Bernoulli distribution of coin flips:


p(y|, n) =

! "
n y
(1 )ny
y

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
y

Gaussian distribution of iris data:


2.5

petal width (cm)

model
{1 , . . . , n }

Idea: model the distribution of the data.


This is density estimation

p(y|!=0.25, n=10)

unsupervised
learning

data
(neural signals)

2
1.5
1
0.5
0

3
4
5
petal length (cm)

1
1
p(x|, ) =
exp 2 (x )2
2
2
Artificial Intelligence: Clustering

18

"

Michael S. Lewicki ! Carnegie Mellon

Modeling data with a Gaussian

a Gaussian (or normal) distribution is fit to data with things youre already
"
!
familiar with.
1
1
gaussian
identities
2
p(x|, ) =
exp 2 (x )
2
2
sam
roweis
1 !
=
xn
mean
N n
(revised July 1999)
1 !
2
(xn )2
variance =
N n
0.1 multidimensional gaussian

A multivariate normal is the same but in d-dimensions

a d-dimensional multidimensional gaussian (normal) density for x is:


"
!
1
d/2
1/2
T 1
N (, ) = (2)
||
exp (x ) (x )
2

=
it has entropy:

1 !
xn
N n
#
11 !

ij =S = log (2e)
j,n
bits
(xi,n d
)(x
)

const
||
21 2
N
n

(1)

(2)

where is a symmetric postive semi-definite covariance matrix and the


(unfortunate) constant is the log of the units in which x is measured over
Artificial Intelligence: Clustering
Michael S. Lewicki ! Carnegie Mellon
19
the natural units

0.2

linear functions of a normal vector

no matter how x is distributed,


E[Ax + y] = A(E[x]) + y

Recall the example from


the +
probability
theory lecture
Covar[Ax
y] = A(Covar[x])A
T

in particular this means that for normal distributed quantities:


%
&
x N (, ) (Ax + y) N A + y, AAT

#$C;'(;2D4,E2+F(+($2
1/2

x N (, )

(x ) N (0, I)
1

x N (, ) (x )
T

(x )

(3a)
(3b)

(4a)
(4b)
(4c)

2n

GHI%)054()*+J,K
L.M7N0.O//P

" '()$*+ ! %&&

" '()$*+ ! %&&


" "#$ ! ! " "#$ ! !

#$%&'()*+,-,.//!0,123'45,67,8$$'4
Artificial Intelligence: Clustering

9'$:;:(<(+&,=42>(+(4>?,@<(34,AB
20

Michael S. Lewicki ! Carnegie Mellon

The correlational structure is described by the covariance

#$C;'(;2D4,E2+F(+($2

9'(2D(%;<
G()42C4D+$'
$R,#

GHI%)054()*+J,K
L.M7N0.O//P

" '()$*+ ! %&&

" '()$*+ ! %&&


" "#$ ! ! " "#$ ! !

What about
distributions in higher
dimensions?

#$%&'()*+,-,.//!0,123'45,67,8$$'4
Artificial Intelligence: Clustering

9'$:;:(<(+&,=42>(+(4>?,@<(34,AQ
21

Michael S. Lewicki ! Carnegie Mellon

!"

Multivariate covariance matrices and principal components


An application of PCA: head shape

Head measurements on two college-age groups (Bryce and Barker):


1) football players (30 subjects)
2) non-football players (30 subjects)
Use six different measurements:
variable measurement
WDMI head width at widest dimention
CIRCUM head circumference
FBEYE front to back at eye level
EYEHD eye to top of head
EARHD ear to top of head
JAW jaw width
Are these measures independent?

22

Michael S. Lewicki ! Carnegie Mellon

! !

Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU

Artificial Intelligence: Clustering

The covariance matrix


The covariance matrix

S=

.370 .602
.149
.044
.107
.209
.602 2.629
.801
.666
.103
.377

.149 .801
.458
.012 .013
.120

.044 .666
.011 1.474
.252 .054

.107 .103 .013


.252
.488 .036
.209 .377
.120 .054 .036
.324
N
1 '
(xi,n xi)(xj,n xj )
Sij =
N 1 n=1

23

! !

Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU

Michael S. Lewicki ! Carnegie Mellon

Artificial Intelligence: Clustering

The eigenvalues

The eigenvalues
Eigenvalue
3.323
1.374
.476
.325
.157
.088

Proportion
of Variance
.579
.239
.083
.057
.027
.015

Cumulative
Proportion
.579
.818
.901
.957
.985
1.000

How many PCs should we select?

Michael S. Lewicki ! Carnegie Mellon

! !

24

Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU

Artificial Intelligence: Clustering

The eigenvalues

The eigenvalues
Eigenvalue
3.323
1.374
.476
.325
.157
.088

Proportion
of Variance
.579
.239
.083
.057
.027
.015

Cumulative
Proportion
.579
.818
.901
.957
.985
1.000

How many PCs should we select?


The first two principal components capture 81.8% of the variance.

Michael S. Lewicki ! Carnegie Mellon

! !

25

Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU

Artificial Intelligence: Clustering

The eigenvalues

The eigenvalues
Eigenvalue
3.323
1.374
.476
.325
.157
.088

Proportion
Proportion
Variance
of Variance
.579
.579
.239
.239
.083
.083
.057
.057
.027
.027
.015
.015

Cumulative
Cumulative
Proportion
Proportion
.579
.579
.818
.818
.901
.901
.957
.957
.985
.985
1.000
1.000

How many PCs should


should we
we select?
select?
The first two principal
principal components
components capture
capture 81.8%
81.8% of
of the
thevariance.
variance.

Artificial Intelligence: Clustering

eigenvectors:
a2
-.142
-.219
-.231
.891
.222
-.187
Michael S. Lewicki ! Carnegie Mellon

!!!!

??44

!
!

26

Principal Component
Principal
Component Analysis,
Analysis, Apr
Apr 23,
23, 2001
2001 // Michael
Michael S.
S.Lewicki,
Lewicki,CMU
CMU

!
!

The corresponding
variable
a1
WDMI .207
CIRCUM .873
FBEYE .261
EYEHD .326
EARHD .066
JAW .128

Using principal components to characterize the data


Use principal components to characterize a distribution

What are the data?

200

What are the data?

What is the dimensionality of


the data?

150

How many components are


there?

amplitude (V)

100
50

What will the components look


like?

0
!50

!100
!150
!0.5

msec

0.5

Artificial Intelligence: Clustering

Michael S. Lewicki ! Carnegie Mellon

! !

27

Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU

12

The first three principal components of the waveform data


The principal components of spike data
0.4
0.3

PC2

PC1

PC3

magnitude

0.2
0.1
0

!0.1
!0.2
!0.3
!0.5

msec

0.5

What do you expect when we


plot PC1 vs PC2 ?
28

! !

Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU

Michael S. Lewicki ! Carnegie Mellon

Artificial Intelligence: Clustering

13

Scatter plot of the first two principal component scores


The principal components of spike data
400

2nd component score

300
200
100
0

!100
!200
!300
!400
!200

200
400
1st component score

600

are much
better
separated.
Now the Now
datathe
aredata
much
better
separated.
Could we use more PCs? How many?
Could we use more PCs? How many?

Michael S. Lewicki ! Carnegie Mellon

29

! !

Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU

Artificial Intelligence: Clustering

14

The eigenspectrum of the waveform data


The principal components of spike data
100

sqrt(!) (V)

80

60

40

20

0
0

10

20
30
component number

40

50

The first two PCs account for most of the variance.

30

! !

Principal Component Analysis, Apr 23, 2001 / Michael S. Lewicki, CMU

Michael S. Lewicki ! Carnegie Mellon

Artificial Intelligence: Clustering

15

Recall example from last lecture: waveform data


The waveform x is modeled with a
multivariate Gaussian
x p(x|, )
and are mean and covariance of
the distribution.
Principal components can be used to
form a low-dimensional approximation
(n)

ci i

i=1

The vectors {1, . . . , T } are the


eigenvectors of .
keeping on the first two terms in the
sum is an adequate approximation of
the full T -dimensional density.
Artificial Intelligence: Clustering

$#!

!"$

!"#

!"%

$!!
*&+,-./0(1234

x(n) =

T
!

raw spike waveform data


%!!

#!
!
!#!

!$!!
!$#!
!!"#

31

now we can go back to clustering

&'()

!"#

Michael S. Lewicki ! Carnegie Mellon

Now try clustering our 2d data (could also do it in n-d)

$""

#./'()'%*+,-

#""
!""
"
!!""
!#""
!!""

"

!""
!%&'()'%*+,-

Artificial Intelligence: Clustering

33

#""

$""

Michael S. Lewicki ! Carnegie Mellon

k-means clustering

Idea: try to estimate k cluster centers by


minimizing distortion

Define distortion as:


D

rnk

N !
K
!

n=1 k=1

rnk ! xn k !2

1 if xn cluster k, 0 otherwise.

rnk is 1 for the closest cluster mean to xn.

How do we learn the cluster means?

Each point xn is the minimum distance


from its closet center.
Use EM = Estimate Maximize

Artificial Intelligence: Clustering

34

Michael S. Lewicki ! Carnegie Mellon

Using EM to estimate the cluster means

D=

N !
K
!

n=1 k=1

Our objective function is:


rnk ! xn k !2

Differential wrt to the mean (the


parameter we want to estimate):
N
!
D
=2
rnk (xn k )
k
n=1

Solving for the mean we have:


!
n rnk xn
k = !
n rnk
This is simply a weighted mean for
each cluster.
Thus we have a simple estimation
algorithm (k-means clustering)
1. select k points at random

We know the optimum is when

2. estimate (update) means


N
!

3. repeat until converged

D
=2
rnk (xn k ) = 0
k
n=1

Artificial Intelligence: Clustering

convergence (to a local minimum) is


guaranteed
Michael S. Lewicki ! Carnegie Mellon

35

k-means clustering example

$""

Select 3 points at random


for cluster means

#./'()'%*+,-

#""
!""
"
!!""
!#""
!!""

Artificial Intelligence: Clustering

"

!""
!%&'()'%*+,36

#""

$""

Michael S. Lewicki ! Carnegie Mellon

k-means clustering example

$""

The update them using


the estimate.

#./'()'%*+,-

#""
!""
"
!!""
!#""
!!""

"

Artificial Intelligence: Clustering

!""
!%&'()'%*+,-

#""

$""

Michael S. Lewicki ! Carnegie Mellon

37

k-means clustering example

$""

And iterate...

#./'()'%*+,-

#""
!""
"
!!""
!#""
!!""

Artificial Intelligence: Clustering

"

!""
!%&'()'%*+,38

#""

$""

Michael S. Lewicki ! Carnegie Mellon

k-means clustering example

$""

#./'()'%*+,-

#""
!""
"
!!""
!#""
!!""

"

Artificial Intelligence: Clustering

!""
!%&'()'%*+,-

#""

$""

Michael S. Lewicki ! Carnegie Mellon

39

k-means clustering example

$""

Stop when converged, ie


no change.

#./'()'%*+,-

#""
!""
"
!!""
!#""
!!""

Artificial Intelligence: Clustering

"

!""
!%&'()'%*+,40

#""

$""

Michael S. Lewicki ! Carnegie Mellon

An example of a local minimum

$!!

There can be multiple


local minima.

#./'()'%*+,-

#!!
"!!
!
!"!!
!#!!
!

"!!

#!!
"%&'()'%*+,-

Artificial Intelligence: Clustering

$!!

!"!!

Michael S. Lewicki ! Carnegie Mellon

41

How do we choose k?

10
9

Increasing k, will always decrease our


distortion, so we will overfit.
How can we avoid this?
Or how do we choose the best k?

n=1 k=1

$""

rnk ! xn k !

overfitting

Instead of classification error,


however, like before, we use our
distortion metric:
N !
K
!

We can use cross validation again.

D=

7
Distortion

4
5
6
7
k = number of clusters

10

k=10 clusters

2
#""
#./'()'%*+,-

Then just measure the distortion on


a test data set, and stop when we
reach a minimum.

!""
"
!!""
!#""
!!""

Artificial Intelligence: Clustering

42

"

!""
#""
$""
!%&'()'%*+,Michael S. Lewicki ! Carnegie Mellon

A nice illustration of cross validation from Andrew Moore


y

Example

Example

x
Construct a predictor of y from x given this
training data
x
Artificial Intelligence: Clustering

Michael S. Lewicki ! Carnegie Mellon

43

Construct a predictor of y from x given this


training data

Linear

Quadratic

Linear

y
Which
model is
y
best for
predicting y
Which
from x is
????
model
best for
predicting y
from x ????

Quadratic

x
Piecewise Linear

x
Piecewise Linear

x
Artificial Intelligence: Clustering

44

Michael S. Lewicki ! Carnegie Mellon

y
We want the model that generate
Linear
the best predictions on future data.
Not necessarily the one with the
y lowest error on training data
y
We
want
the
model
that
generate
Linear
the best predictions on future data.
Not necessarily the one with the
lowest error on training data
x
y

Quadratic

Quadratic

Piecewise Linear

Which
model is
y
best for
predicting y
Which
from x ????
model is
best for
predicting y
Artificial Intelligence: Clustering
from x ????

x
Piecewise Linear
x
Michael S. Lewicki ! Carnegie Mellon

45

Using a Test Set


y

Using a
y

Artificial Intelligence: Clustering

1. Use a portion
30%) of
Test Set(e.g.,
the data as test
data
1.
portion
2. Use
Fit a amodel
to
(e.g.,
30%)
of
the remaining
the
datadata
as test
training
data
3. Evaluate the
2. Fit
a model
error
on the to
the
test remaining
data
training data
x
3. Evaluate the
error on the
test data
x

46

Michael S. Lewicki ! Carnegie Mellon

y Linear

Error = 2.4

Quadratic
Error = 0.9

y Linear

Error = 2.4

Quadratic
Error = 0.9

x
Piecewise Linear

x
Error = 2.2
Piecewise Linear
x
Error = 2.2

Artificial Intelligence: Clustering

Michael S. Lewicki ! Carnegie Mellon

47

y Linear

Error = 2.4

y Linear

Error = 2.4

x
Quadratic
Error = 0.9
Quadratic
Error = 0.9

Using a Test Set:


+ Simple
x
y % of the data
- Wastes a large
- May get lucky with one
Using a Test Set:
particular subset of the data
+ Simple
- Wastes a large % of the data
- May get lucky with one
particular subset of the data

x
Piecewise Linear
x
Error = 2.2
Piecewise Linear
x
Error = 2.2

x
Artificial Intelligence: Clustering

48

Michael S. Lewicki ! Carnegie Mellon

Leave One Out Cross-Validation


For k=1 to R

Train on all the


Leave One Out Cross-Validation
y

data leaving out


(xkk=1
,yk) to R
For
Evaluate
Train on error
all the
on
(x
,y
)
k
datakleaving
out

(xk,yk)

(xk,yk)

Artificial Intelligence: Clustering

49

(xk,yk)the
Report
Evaluate
error
average
error
on trying
(xk,yk) all
after
x
datathe
points
the
Report
average error
after trying all
x
the data points

Michael S. Lewicki ! Carnegie Mellon

Error = 2.12

Error = 2.12
Note: Numerical examples in this and subsequent slides from A. Moore

Note: Numerical examples in this and subsequent slides from A. Moore


Artificial Intelligence: Clustering

50

Michael S. Lewicki ! Carnegie Mellon

Error = 0.962

Error = 0.962
Artificial Intelligence: Clustering

51

Michael S. Lewicki ! Carnegie Mellon

52

Michael S. Lewicki ! Carnegie Mellon

Error = 3.33

Error = 3.33

Artificial Intelligence: Clustering

Leave One Out Cross-Validation:


Leave
One
+ Does
not waste
dataOut Cross-Validation
+ Average over large number of trials
For k=1 to R
y- Expensive
Leave One Out Cross-Validation:
Train on all the
Leave
One
+ Does
not waste
dataOut Cross-Validation
data leaving out
+ Average over large number of trials
(xkk=1
,yk) to R
For
y- Expensive
(xk,yk)

error
Evaluate
Train on all
the
on
(x
,y
)
k
data kleaving
out

(xk,yk)

Artificial Intelligence: Clustering

53

(xk,yk)the
Report
Evaluate
error
average
error
on (x
after
trying
k,yk) all
x
the datathe
points
Report
average error
after trying all
x
the data points

Michael S. Lewicki ! Carnegie Mellon

K-Fold Cross-Validation
Randomly divide
the data set into K
subsets
K-Fold Cross-Validation
For each subset
S:
Randomly divide
y
the
dataonset
Train
theinto
dataK
not in S
subsets
Test
on the
data in
For
each
subset
S
S:
Return
Train the
on the data
average
not in Serror over
the
K subsets
Test
on the data in
x
S
Return the
Example: K = 3, each color corresponds
to a error
subset
average
over
the K subsets
x
y

Example: K = 3, each color corresponds to a subset


Artificial Intelligence: Clustering

54

Michael S. Lewicki ! Carnegie Mellon

Error = 2.05

Error = 1.11

Error = 2.05

Error = 1.11

Error = 2.93

Error = 2.93

Artificial Intelligence: Clustering

55

Michael S. Lewicki ! Carnegie Mellon

Cross-Validation Summary
+
Test Set
Wastes a lot of Simple/Efficient
Cross-Validation
Summary
data
Poor predictor
+
of future
Wastes a lot of
performance
data
Leave One Out Inefficient
Poor predictor
of future
K-Fold
Wastes
1/K of
performance
the data
Leave One Out Inefficient
K times slower
than Test Set
K-Fold
Wastes 1/K of
the data
K times slower
than Test Set
Test Set

Artificial Intelligence: Clustering

56

Simple/Efficient
Does not waste
data
Wastes only 1/K of
the data!
Does not waste
Only
data K times slower
than Test Set!
Wastes only 1/K of
the data!
Only K times slower
than Test Set!
Michael S. Lewicki ! Carnegie Mellon

A probabilistic interpretation: Gaussian mixture models

R58

Weve already seen a one-dimensional version


This example has three classes: neuron 1, neuron 2, and background noise.
Each can be modeled as a Gaussian
Any given data point comes from just one Gaussian
The whole set of data is modeled by a mixture of three Gaussians
How do
model this?
Mwe
S Lewicki
peak amplitude: neuron 2
background
amplitude

peak amplitude:
neuron 1

amplitude

A
Artificial Intelligence: Clustering

B
Michael S. Lewicki ! Carnegie Mellon

57

Figure 4. The figure illustrates the distribution of amplitudes for the background activity and the
peak amplitudes of the spikes from two units. Amplitude is along the horizontal axis. Setting the
threshold level to the position at A introduces a large number of spikes from unit 1. Increasing
the threshold to B reduces the number of spikes that are misclassified, but at the expense of
many missed spikes.

3.2.The
Types
of detection
errors
Gaussian
mixture

model density

Very often it is not possible to separate the desired spikes from the background noise with
perfect
accuracy.
Theofthreshold
level determines
between
missed spikes (false
The
likelihood
the data given
a particular the
classtrade-off
ck is given
by
negatives) and the number of background events that cross threshold (false positives), which
is illustrated in figure 4. If the thresholdp(x|c
is set,
to ,the
level at A, all of the spikes from unit
k
k k )
1 are detected, but there is a very large number of false positives due the contamination of
spikes from unit 2. If the threshold is increased to the level at B, only spikes from unit 1
x is thebut
spike
waveform,
kfall
andbelow
k arethreshold.
the meanIdeally,
and covariance
for class
ck . be set to
are detected,
a large
number
the threshold
should
optimize the desired ratio of false positives to false negatives. If the background noise level
The
marginal to
likelihood
is computed
summing
likelihood
of the K classes
is small
compared
the amplitude
of thebyspikes
andover
the the
amplitude
distributions
are well
separated, then both of these errors will be close to zero and the position of the threshold
K
!
hardly matters.
p(x|1:K) =

3.3. Misclassification error due to overlaps

p(x|ck , k )p(ck )

k=1

In addition
to the the
background
to first approximation,
is Gaussian in nature
1:K defines
parametersnoise,
for allwhich,
of the classes,
1:K = {1 , 1 , . . . , K , K }.
(we will have more to say about that below), the spike height can vary greatly if there are
"
other neurons
in the
local region
action
potentials of significant size. If the
p(ck ) is the
probability
of thethat
kth generate
class, with
k p(ck ) = 1.
peak of the desired unit and the dip of a background unit line up, a spike will be missed.
ThisisWhat
illustrated
in figure
does this
mean 5.
in this example?
How frequently this will occur depends on the firing rates of the units involved. A
rough estimate for the percentage of error due to overlaps can be calculated as follows.
Intelligence: Clustering
S. Lewicki ! Carnegie Mellon
58 shown in figure 5(b),Michael
TheArtificial
percentage
of missed spikes, like the one
is determined
by the
probability that the peak of the isolated spike will occur during the negative phase of the
background spike, which is expressed as
%missed spikes = 100rd/1000

(1)

Bayesian classification
How do we determine the class ck from the data x ?
Again use Bayes rule
p(x(n)|ck , k )p(ck )
p(ck |x(n), 1:K) = pk,n = !
(n) |c , )p(c )
k k
k
k p(x
This tells is the probability that waveform x(n) came from class ck .

Lets review the process:


1. define model of problem "
2. derive posterior distributions and estimators "
3. estimate parameters from data ??

How do we do this?

4. evaluate model accuracy

Artificial Intelligence: Clustering

Michael S. Lewicki ! Carnegie Mellon

59

Estimating the parameters: fitting the model density to the data


The objective of density estimation is to maximize the likelihood of the data the data
If we assume the samples are independent, the data likelihood is just the product of
the marginal likelihoods
p(x1:N|1:K) =

N
!

n=1

p(xn|1:K)

The class parameters are determined by optimization.


Is far more practical to optimize the log-likelihood.
One elegant approach to this is the EM algorithm.

Artificial Intelligence: Clustering

60

Michael S. Lewicki ! Carnegie Mellon

The EM algorithm
EM stands for Expectation-Maximization, and involves two steps that are iterated.
For the case of a Gaussian mixture model:
!
1. E-step: Compute pn,k = p(ck |x(n), 1:K). Let pk = n pi,n
2. M-step: Compute new mean, covariance, and class prior for each class:
Something can go bad
here...

k
k

"

pn,k x(n)/pk

What if these are zero?

"
n

pn,k (x(n) k )(x(n) k )T /pk

p(ck ) pk

This is just the sample mean and covariance, weighted by the class conditional
probabilities pn,k .
Derived by solving setting log-likelihood gradient to zero (i.e. the maximum).

Artificial Intelligence: Clustering

61

Michael S. Lewicki ! Carnegie Mellon

Four cluster
solution with decision boundaries
#""

!.1)*+,-+./.()'*+0/

%""
!""
&""
"

!&""
!!""
!%""
!#""
!!""

Artificial Intelligence: Clustering

"

!""
#""
&'()*+,-+./.()'*+0/
62

$""

Michael S. Lewicki ! Carnegie Mellon

00

But wait!
400 Heres a nine cluster solution
300
2nd component score

Uh oh...

200
100

How many clusters


are there really?

!100

Fortunately, probability
theory solves this problem
too.

!200
!300

600

!400
!200

200
400
1st component score

Artificial Intelligence: Clustering

600

Michael S. Lewicki ! Carnegie Mellon

63

(b)

of Gaussian clustering to spike sorting. (a) The ellipses show the threeof the four clusters. The lines show the Bayesian decision boundaries
Bayesian model comparison
usters. (b) The same data modelled with nine clusters. The elliptical line
ottom is the three-sigma
error contour
ofKthe
largest
Let MK represents
a model with
classes.
(Here wecluster.
will assume that we can

choose the best among all such models, but this assumption is not necessary). How
do we evaluate the probability of model MK ? Bayes rule again.

We start with our existing model, but add a term to represent the model itself. Also,
by calculating the
probability
thatonathedata
point
to
we marginalize
out the dependency
parameters,
becausebelongs
we want the result
independent of any specific value. Letting X = x
, we have
ained with Bayes
rule
(1:N)

p(MK |X ) =

p(X |MK )p(MK )


p(X )

p(x|ck , k )p(ck )
!
.
(5)
Thekdenominator
is constant across models, and if we assume all models are equally
)
k p(x|ck , k )p(c
probable a priori, the only data-dependent term is X = x(1:N).

ian decision boundaries


forp(Xthe
Because
How do we compute
|M )?model.
Weve encountered
this before.the cluster
e cluster boundaries can be computed as a function of
ld better classification, because if the model is accurate
.e. the fewest number of misclassifications.
imized by maximizing the likelihood of the data
K

Artificial Intelligence: Clustering

64

Michael S. Lewicki ! Carnegie Mellon

Evaluating the model evidence


p(X |MK ) is just the normalizing constant for the posterior for parameters
p(K |X , MK ) =

p(X |K , MK )p(MK )
p(X |MK )

(slight change of notation: K represents all parameters for model K.)


The normalizing constant here is evaluated just like before by marginalization
p(X |MK ) =

p(X |K , MK )p(MK )dK

Evaluating this term is practically a whole subfield of probability theory: Laplaces


method, monte carlo integration, variational approximation, etc. We will learn about
some of these techniques in future lectures.

Artificial Intelligence: Clustering

R64
Back

Michael S. Lewicki ! Carnegie Mellon

65

M Sclusters
Lewicki
to the

300%""

300

200!""

200

2nd component score

400

2nd component score


!.1)*+,-+./.()'*+0/

400#""

100&""
0 "

!&""
!100

!100

!!""
!200

!200

!%""
!300

!300

!#""
!400
!!""
!200

100

!""
#""
0 "
200
400
1st&'()*+,-+./.()'*+0/
component score

$""
600

!400
!200

Which model is(a)


more probable?

200
400
1st component score

600

(b)

P(M9Figure
| X) is9.exp(160)
times
P(M4 |toX).
Application
of greater
Gaussianthan
clustering
spike sorting. (a) The ellipses show the threecontours
the our
fourintuitions?
clusters. The lines show the Bayesian decision boundaries
Whysigma
mighterror
this not
agreeofwith
separating the larger clusters. (b) The same data modelled with nine clusters. The elliptical line

The extending
conclusions
are the
always
onlyisasthe
valid
as the model.
across
bottom
three-sigma
error contour of the largest cluster.
But P(M9 | X) is exp(16) times greater than P(M11 | X).

Razor.
This embodies
Classification
is Occams
performed
by calculating the probability that a data point belongs to
each of the classes, which is obtained with Bayes rule

p(x|ck , k )p(c
66 k )
p(ck |x, 1:K ) = !
.
k p(x|ck , k )p(ck )

Artificial Intelligence: Clustering

Michael S. Lewicki ! Carnegie Mellon

(5)

This implicitly defines the Bayesian decision boundaries for the model. Because the cluster
membership is probabilistic, the cluster boundaries can be computed as a function of

Comparison to cross-validation

does not waste any data


correct, if model is correct
can be expensive or difficult to evaluate for complex models
harder to implement (definitely requires a brain)

Artificial Intelligence: Clustering

67

Michael S. Lewicki ! Carnegie Mellon

Summary

k-nearest neighbor: a simple, non-parametric method for classification

easy to implement, but hard to control the complexity


can require a lot of data, since theres no model to generalize from

clustering:

is unsupervised learning
have to infer classes without labels

dimensionality reduction and PCA

another example of unsupervised learning: fitting a multivariate normal

cross validation

a general way to control complexity: test data, leave one out, k-fold

gaussian mixture models

a probabilistic version of k-means


can assume arbitrary covariance matrices
can choose most probable # of clusters (with Bayesian model comparison)

Artificial Intelligence: Clustering

68

Michael S. Lewicki ! Carnegie Mellon

You might also like