Professional Documents
Culture Documents
Clustering
Clustering
Clustering
(including k-nearest neighbor classification, k-means
clustering, cross-validation, and EM, with a brief foray
into dimensionality reduction with PCA)
2.5
Class 1
x2
1.5
Class 2
0.5
Class 3
0
4
x1
Potential advantages:
dont need an explicit model
the more examples the better
might handle more complex classes
easy to implement
no brain on part of the designer
2
difficult
Nonparametric (Instance-Base
For K-NN there are also bounds. e.g. for two classes and odd K:
(K1)/2 " # $
%
!
k
ki + eki(1 e )i+1
ei+1
eB eKN N
B
B (1 eB )
B
i
i=0
10
x2
9
8
x2
6
5
t3
4
3
1
1
Use all
Euclidean
distance in raw pixel are
space (dumb!)
7-nn.
But not
neighbors
theandsame:
Classification error (leave-one-out): 4.85%.
example
Example
nearest neighbors
7 Nearest Neighbours
10
x1
t5
w
&'
thre
x2
6
5
4
3
2
1
1
x1
k-nearest neighbors:
Digit example:
error rate on
test data (%)
linear
12.0
Classifier
5.0
4.7
3.1
1.1
0.95
0.4
humans
0.2 - 2.5
Clustering
In many situations we dont have labeled training data, only unlabeled data.
Eg, in the iris data set, what if we were just starting and didnt know any classes?
2.5
2
1.5
1
0.5
0
3
4
5
petal length (cm)
7
7
Michael S. Lewicki ! Carnegie Mellon
Types of learning
supervised
unsupervised
reinforcement
reinforcement
desired output
{y1 , . . . , yn }
model
{1 , . . . , n }
model output
model
{1 , . . . , n }
model
{1 , . . . , n }
next
world
(or data)
world
(or data)
k
wee
world
(or data)
amplifier
filters
A/D
software
analysis
electrode
An extracellular waveform with many different spikes
oscilloscope
10
15
20
25
msec
! !
10
15
20
25
msec
10
! !
! !
Sorting
withwith
levellevel
detection
Sorting
detection
!0.5
!0.5
0.5
0.5 1
msec msec
1.5
!0.5
1.5
0.5
msec
1.5
! !? 7
!
! !
PrincipalPrincipal
Component
Analysis,Analysis,
Apr 23, 2001
/ Michael
Lewicki,
CMU CMU
Component
Apr 23,
2001 / S.
Michael
S. Lewicki,
11
withwith
levellevel
detection
Sorting
detection
WhySorting
level
detection
doesnt
work
!0.5
0.5
0.5 1
msec msec
1.5
!0.5
1.5
0.5
msec
1.5
peakLevel
amplitude:
neuron
2 always work.
detection
doesnt
peak amplitude:
neuron 1
!? 7
!
! ! !
amplitude
PrincipalPrincipal
Component
Analysis,Analysis,
Apr 23, 2001
/ Michael
Lewicki,
CMU CMU
Component
Apr 23,
2001 / S.
Michael
S. Lewicki,
background
amplitude
12
max amplitude
min amplitude
!0.5
0.5
msec
1.5
13
Maximum vs minimum
Maximum vs minimum
250
200
150
100
50
0
!200
!150
!100
!50
spike minimum (V)
14
! !
10
! !
height
!0.5
0.5
msec
1.5
! !
! !
15
11
Height vs width
Height vs width
400
350
300
250
200
150
100
50
0
0
0.5
1
spike width (msec)
1.5
16
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
! "
n y
(1 )ny
y
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
y
model
{1 , . . . , n }
p(y|!=0.25, n=10)
unsupervised
learning
data
(neural signals)
2
1.5
1
0.5
0
3
4
5
petal length (cm)
1
1
p(x|, ) =
exp 2 (x )2
2
2
Artificial Intelligence: Clustering
18
"
a Gaussian (or normal) distribution is fit to data with things youre already
"
!
familiar with.
1
1
gaussian
identities
2
p(x|, ) =
exp 2 (x )
2
2
sam
roweis
1 !
=
xn
mean
N n
(revised July 1999)
1 !
2
(xn )2
variance =
N n
0.1 multidimensional gaussian
=
it has entropy:
1 !
xn
N n
#
11 !
ij =S = log (2e)
j,n
bits
(xi,n d
)(x
)
const
||
21 2
N
n
(1)
(2)
0.2
#$C;'(;2D4,E2+F(+($2
1/2
x N (, )
(x ) N (0, I)
1
x N (, ) (x )
T
(x )
(3a)
(3b)
(4a)
(4b)
(4c)
2n
GHI%)054()*+J,K
L.M7N0.O//P
#$%&'()*+,-,.//!0,123'45,67,8$$'4
Artificial Intelligence: Clustering
9'$:;:(<(+&,=42>(+(4>?,@<(34,AB
20
#$C;'(;2D4,E2+F(+($2
9'(2D(%;<
G()42C4D+$'
$R,#
GHI%)054()*+J,K
L.M7N0.O//P
What about
distributions in higher
dimensions?
#$%&'()*+,-,.//!0,123'45,67,8$$'4
Artificial Intelligence: Clustering
9'$:;:(<(+&,=42>(+(4>?,@<(34,AQ
21
!"
22
! !
S=
.370 .602
.149
.044
.107
.209
.602 2.629
.801
.666
.103
.377
.149 .801
.458
.012 .013
.120
.044 .666
.011 1.474
.252 .054
23
! !
The eigenvalues
The eigenvalues
Eigenvalue
3.323
1.374
.476
.325
.157
.088
Proportion
of Variance
.579
.239
.083
.057
.027
.015
Cumulative
Proportion
.579
.818
.901
.957
.985
1.000
! !
24
The eigenvalues
The eigenvalues
Eigenvalue
3.323
1.374
.476
.325
.157
.088
Proportion
of Variance
.579
.239
.083
.057
.027
.015
Cumulative
Proportion
.579
.818
.901
.957
.985
1.000
! !
25
The eigenvalues
The eigenvalues
Eigenvalue
3.323
1.374
.476
.325
.157
.088
Proportion
Proportion
Variance
of Variance
.579
.579
.239
.239
.083
.083
.057
.057
.027
.027
.015
.015
Cumulative
Cumulative
Proportion
Proportion
.579
.579
.818
.818
.901
.901
.957
.957
.985
.985
1.000
1.000
eigenvectors:
a2
-.142
-.219
-.231
.891
.222
-.187
Michael S. Lewicki ! Carnegie Mellon
!!!!
??44
!
!
26
Principal Component
Principal
Component Analysis,
Analysis, Apr
Apr 23,
23, 2001
2001 // Michael
Michael S.
S.Lewicki,
Lewicki,CMU
CMU
!
!
The corresponding
variable
a1
WDMI .207
CIRCUM .873
FBEYE .261
EYEHD .326
EARHD .066
JAW .128
200
150
amplitude (V)
100
50
0
!50
!100
!150
!0.5
msec
0.5
! !
27
12
PC2
PC1
PC3
magnitude
0.2
0.1
0
!0.1
!0.2
!0.3
!0.5
msec
0.5
! !
13
300
200
100
0
!100
!200
!300
!400
!200
200
400
1st component score
600
are much
better
separated.
Now the Now
datathe
aredata
much
better
separated.
Could we use more PCs? How many?
Could we use more PCs? How many?
29
! !
14
sqrt(!) (V)
80
60
40
20
0
0
10
20
30
component number
40
50
30
! !
15
ci i
i=1
$#!
!"$
!"#
!"%
$!!
*&+,-./0(1234
x(n) =
T
!
#!
!
!#!
!$!!
!$#!
!!"#
31
&'()
!"#
$""
#./'()'%*+,-
#""
!""
"
!!""
!#""
!!""
"
!""
!%&'()'%*+,-
33
#""
$""
k-means clustering
rnk
N !
K
!
n=1 k=1
rnk ! xn k !2
1 if xn cluster k, 0 otherwise.
34
D=
N !
K
!
n=1 k=1
D
=2
rnk (xn k ) = 0
k
n=1
35
$""
#./'()'%*+,-
#""
!""
"
!!""
!#""
!!""
"
!""
!%&'()'%*+,36
#""
$""
$""
#./'()'%*+,-
#""
!""
"
!!""
!#""
!!""
"
!""
!%&'()'%*+,-
#""
$""
37
$""
And iterate...
#./'()'%*+,-
#""
!""
"
!!""
!#""
!!""
"
!""
!%&'()'%*+,38
#""
$""
$""
#./'()'%*+,-
#""
!""
"
!!""
!#""
!!""
"
!""
!%&'()'%*+,-
#""
$""
39
$""
#./'()'%*+,-
#""
!""
"
!!""
!#""
!!""
"
!""
!%&'()'%*+,40
#""
$""
$!!
#./'()'%*+,-
#!!
"!!
!
!"!!
!#!!
!
"!!
#!!
"%&'()'%*+,-
$!!
!"!!
41
How do we choose k?
10
9
n=1 k=1
$""
rnk ! xn k !
overfitting
D=
7
Distortion
4
5
6
7
k = number of clusters
10
k=10 clusters
2
#""
#./'()'%*+,-
!""
"
!!""
!#""
!!""
42
"
!""
#""
$""
!%&'()'%*+,Michael S. Lewicki ! Carnegie Mellon
Example
Example
x
Construct a predictor of y from x given this
training data
x
Artificial Intelligence: Clustering
43
Linear
Quadratic
Linear
y
Which
model is
y
best for
predicting y
Which
from x is
????
model
best for
predicting y
from x ????
Quadratic
x
Piecewise Linear
x
Piecewise Linear
x
Artificial Intelligence: Clustering
44
y
We want the model that generate
Linear
the best predictions on future data.
Not necessarily the one with the
y lowest error on training data
y
We
want
the
model
that
generate
Linear
the best predictions on future data.
Not necessarily the one with the
lowest error on training data
x
y
Quadratic
Quadratic
Piecewise Linear
Which
model is
y
best for
predicting y
Which
from x ????
model is
best for
predicting y
Artificial Intelligence: Clustering
from x ????
x
Piecewise Linear
x
Michael S. Lewicki ! Carnegie Mellon
45
Using a
y
1. Use a portion
30%) of
Test Set(e.g.,
the data as test
data
1.
portion
2. Use
Fit a amodel
to
(e.g.,
30%)
of
the remaining
the
datadata
as test
training
data
3. Evaluate the
2. Fit
a model
error
on the to
the
test remaining
data
training data
x
3. Evaluate the
error on the
test data
x
46
y Linear
Error = 2.4
Quadratic
Error = 0.9
y Linear
Error = 2.4
Quadratic
Error = 0.9
x
Piecewise Linear
x
Error = 2.2
Piecewise Linear
x
Error = 2.2
47
y Linear
Error = 2.4
y Linear
Error = 2.4
x
Quadratic
Error = 0.9
Quadratic
Error = 0.9
x
Piecewise Linear
x
Error = 2.2
Piecewise Linear
x
Error = 2.2
x
Artificial Intelligence: Clustering
48
(xk,yk)
(xk,yk)
49
(xk,yk)the
Report
Evaluate
error
average
error
on trying
(xk,yk) all
after
x
datathe
points
the
Report
average error
after trying all
x
the data points
Error = 2.12
Error = 2.12
Note: Numerical examples in this and subsequent slides from A. Moore
50
Error = 0.962
Error = 0.962
Artificial Intelligence: Clustering
51
52
Error = 3.33
Error = 3.33
error
Evaluate
Train on all
the
on
(x
,y
)
k
data kleaving
out
(xk,yk)
53
(xk,yk)the
Report
Evaluate
error
average
error
on (x
after
trying
k,yk) all
x
the datathe
points
Report
average error
after trying all
x
the data points
K-Fold Cross-Validation
Randomly divide
the data set into K
subsets
K-Fold Cross-Validation
For each subset
S:
Randomly divide
y
the
dataonset
Train
theinto
dataK
not in S
subsets
Test
on the
data in
For
each
subset
S
S:
Return
Train the
on the data
average
not in Serror over
the
K subsets
Test
on the data in
x
S
Return the
Example: K = 3, each color corresponds
to a error
subset
average
over
the K subsets
x
y
54
Error = 2.05
Error = 1.11
Error = 2.05
Error = 1.11
Error = 2.93
Error = 2.93
55
Cross-Validation Summary
+
Test Set
Wastes a lot of Simple/Efficient
Cross-Validation
Summary
data
Poor predictor
+
of future
Wastes a lot of
performance
data
Leave One Out Inefficient
Poor predictor
of future
K-Fold
Wastes
1/K of
performance
the data
Leave One Out Inefficient
K times slower
than Test Set
K-Fold
Wastes 1/K of
the data
K times slower
than Test Set
Test Set
56
Simple/Efficient
Does not waste
data
Wastes only 1/K of
the data!
Does not waste
Only
data K times slower
than Test Set!
Wastes only 1/K of
the data!
Only K times slower
than Test Set!
Michael S. Lewicki ! Carnegie Mellon
R58
peak amplitude:
neuron 1
amplitude
A
Artificial Intelligence: Clustering
B
Michael S. Lewicki ! Carnegie Mellon
57
Figure 4. The figure illustrates the distribution of amplitudes for the background activity and the
peak amplitudes of the spikes from two units. Amplitude is along the horizontal axis. Setting the
threshold level to the position at A introduces a large number of spikes from unit 1. Increasing
the threshold to B reduces the number of spikes that are misclassified, but at the expense of
many missed spikes.
3.2.The
Types
of detection
errors
Gaussian
mixture
model density
Very often it is not possible to separate the desired spikes from the background noise with
perfect
accuracy.
Theofthreshold
level determines
between
missed spikes (false
The
likelihood
the data given
a particular the
classtrade-off
ck is given
by
negatives) and the number of background events that cross threshold (false positives), which
is illustrated in figure 4. If the thresholdp(x|c
is set,
to ,the
level at A, all of the spikes from unit
k
k k )
1 are detected, but there is a very large number of false positives due the contamination of
spikes from unit 2. If the threshold is increased to the level at B, only spikes from unit 1
x is thebut
spike
waveform,
kfall
andbelow
k arethreshold.
the meanIdeally,
and covariance
for class
ck . be set to
are detected,
a large
number
the threshold
should
optimize the desired ratio of false positives to false negatives. If the background noise level
The
marginal to
likelihood
is computed
summing
likelihood
of the K classes
is small
compared
the amplitude
of thebyspikes
andover
the the
amplitude
distributions
are well
separated, then both of these errors will be close to zero and the position of the threshold
K
!
hardly matters.
p(x|1:K) =
p(x|ck , k )p(ck )
k=1
In addition
to the the
background
to first approximation,
is Gaussian in nature
1:K defines
parametersnoise,
for allwhich,
of the classes,
1:K = {1 , 1 , . . . , K , K }.
(we will have more to say about that below), the spike height can vary greatly if there are
"
other neurons
in the
local region
action
potentials of significant size. If the
p(ck ) is the
probability
of thethat
kth generate
class, with
k p(ck ) = 1.
peak of the desired unit and the dip of a background unit line up, a spike will be missed.
ThisisWhat
illustrated
in figure
does this
mean 5.
in this example?
How frequently this will occur depends on the firing rates of the units involved. A
rough estimate for the percentage of error due to overlaps can be calculated as follows.
Intelligence: Clustering
S. Lewicki ! Carnegie Mellon
58 shown in figure 5(b),Michael
TheArtificial
percentage
of missed spikes, like the one
is determined
by the
probability that the peak of the isolated spike will occur during the negative phase of the
background spike, which is expressed as
%missed spikes = 100rd/1000
(1)
Bayesian classification
How do we determine the class ck from the data x ?
Again use Bayes rule
p(x(n)|ck , k )p(ck )
p(ck |x(n), 1:K) = pk,n = !
(n) |c , )p(c )
k k
k
k p(x
This tells is the probability that waveform x(n) came from class ck .
How do we do this?
59
N
!
n=1
p(xn|1:K)
60
The EM algorithm
EM stands for Expectation-Maximization, and involves two steps that are iterated.
For the case of a Gaussian mixture model:
!
1. E-step: Compute pn,k = p(ck |x(n), 1:K). Let pk = n pi,n
2. M-step: Compute new mean, covariance, and class prior for each class:
Something can go bad
here...
k
k
"
pn,k x(n)/pk
"
n
p(ck ) pk
This is just the sample mean and covariance, weighted by the class conditional
probabilities pn,k .
Derived by solving setting log-likelihood gradient to zero (i.e. the maximum).
61
Four cluster
solution with decision boundaries
#""
!.1)*+,-+./.()'*+0/
%""
!""
&""
"
!&""
!!""
!%""
!#""
!!""
"
!""
#""
&'()*+,-+./.()'*+0/
62
$""
00
But wait!
400 Heres a nine cluster solution
300
2nd component score
Uh oh...
200
100
!100
Fortunately, probability
theory solves this problem
too.
!200
!300
600
!400
!200
200
400
1st component score
600
63
(b)
of Gaussian clustering to spike sorting. (a) The ellipses show the threeof the four clusters. The lines show the Bayesian decision boundaries
Bayesian model comparison
usters. (b) The same data modelled with nine clusters. The elliptical line
ottom is the three-sigma
error contour
ofKthe
largest
Let MK represents
a model with
classes.
(Here wecluster.
will assume that we can
choose the best among all such models, but this assumption is not necessary). How
do we evaluate the probability of model MK ? Bayes rule again.
We start with our existing model, but add a term to represent the model itself. Also,
by calculating the
probability
thatonathedata
point
to
we marginalize
out the dependency
parameters,
becausebelongs
we want the result
independent of any specific value. Letting X = x
, we have
ained with Bayes
rule
(1:N)
p(MK |X ) =
p(x|ck , k )p(ck )
!
.
(5)
Thekdenominator
is constant across models, and if we assume all models are equally
)
k p(x|ck , k )p(c
probable a priori, the only data-dependent term is X = x(1:N).
64
p(X |K , MK )p(MK )
p(X |MK )
R64
Back
65
M Sclusters
Lewicki
to the
300%""
300
200!""
200
400
400#""
100&""
0 "
!&""
!100
!100
!!""
!200
!200
!%""
!300
!300
!#""
!400
!!""
!200
100
!""
#""
0 "
200
400
1st&'()*+,-+./.()'*+0/
component score
$""
600
!400
!200
200
400
1st component score
600
(b)
P(M9Figure
| X) is9.exp(160)
times
P(M4 |toX).
Application
of greater
Gaussianthan
clustering
spike sorting. (a) The ellipses show the threecontours
the our
fourintuitions?
clusters. The lines show the Bayesian decision boundaries
Whysigma
mighterror
this not
agreeofwith
separating the larger clusters. (b) The same data modelled with nine clusters. The elliptical line
The extending
conclusions
are the
always
onlyisasthe
valid
as the model.
across
bottom
three-sigma
error contour of the largest cluster.
But P(M9 | X) is exp(16) times greater than P(M11 | X).
Razor.
This embodies
Classification
is Occams
performed
by calculating the probability that a data point belongs to
each of the classes, which is obtained with Bayes rule
p(x|ck , k )p(c
66 k )
p(ck |x, 1:K ) = !
.
k p(x|ck , k )p(ck )
(5)
This implicitly defines the Bayesian decision boundaries for the model. Because the cluster
membership is probabilistic, the cluster boundaries can be computed as a function of
Comparison to cross-validation
67
Summary
clustering:
is unsupervised learning
have to infer classes without labels
cross validation
a general way to control complexity: test data, leave one out, k-fold
68