Outline: Part 1: Basic Concepts of Data Clustering Non-Supervised Learning and Clustering

Unsupervised Learning
Clustering Algorithms
Outline
Part 1: Basic Concepts of data clustering
Non-Supervised Learning and Clustering
:
:
:
:
Problem formulation cluster analysis

Taxonomies of Clustering Techniques
Data types and Proximity Measures
Difficulties and open problems
Part 2: Clustering Algorithms
Hierarchical methods
:
:
:
Single-link
Complete-link
Clustering Based on Dissimilarity Increments Criteria
Unsupervised Learning -- Ana Fred
1 From Single Clustering to Ensemble Methods - April 2009
Hierarchical Clustering
Use proximity matrix: nxn
D(i,j): proximity (similarity or distance) between patterns i and j
Step 0
Step 1
Step 2 Step 3 Step 4
agglomerative
ab
abcde
cde
de
divisive
Step 4
Step 3
Step 2 Step 1 Step 0
Hierarchical Clustering: Agglomerative Methods

1. Start with n clusters containing one object
2. Find the most similar pair of clusters Ci e Cj from the proximity
matrix and merge them into a single cluster
3. Update the proximity matrix (reduce its order by one, by replacing

the individual clusters with the merged cluster)
4. Repeat steps (2) e (3) until a single cluster is obtained (i.e. N-1
times)

Similarity measures between clusters:
:Well known similarity measures can be written using the Lance-Williams

formula, expressing the distance between cluster k and cluster i+j,
obtained by the merging of clusters i and j:
d (i j, k ) ai d (i, k ) a j d ( j, k ) bd (i, j ) c d (i, k ) d ( j, k )

Single-link
d i j, k min d (i, k ), d ( j, k )
ai a j 0.5 ; b 0 ; c -0.5
Complete-link
ai a j 0.5 ; b 0 ; c 0.5
Centroid
ai
ni
ni n j
aj
nj
ni n j
d i j, k maxd (i, k ), d ( j, k )
ni n j
n n
c0
d (i j, k ) d ( i j , k )
ai a j 0.5 ; b 0.25 ; c 0
Median
(Average link)
Wards Method
(minimum variance)
ni
ai
ni n j
ai
aj
nk ni
nk ni n j
nj
bc0
ni n j
aj
nk n j
nk ni n j
d (Ci , C j )
1
ni n j
d ( a , b)
aCi ,bC j
nk
c0
nk ni n j

Single Link: Distance between two clusters
is the distance between the closest points.
Also called neighbor joining.
d i j, k min d (i, k ), d ( j, k )
Single-link
ai a j 0.5 ; b 0 ; c -0.5
Complete-link
ai a j 0.5 ; b 0 ; c 0.5
Centroid
ni
ai
ni n j
aj
nj
ni n j
d i j, k maxd (i, k ), d ( j, k )
ni n j
n n
c0
d (i j, k ) d ( i j , k )
ai a j 0.5 ; b 0.25 ; c 0
Median
(Average link)
Wards Method
(minimum variance)
ni
ai
ni n j
ai
aj
nk ni
nk ni n j
nj
bc0
ni n j
aj
d (Ci , C j )
nk n j
nk ni n j
1
ni n j
d ( a , b)
aCi ,bC j
nk
c0
nk ni n j
Complete Link: Distance between clusters

is distance between farthest pair of points.
Single-link
d i j, k min d (i, k ), d ( j, k )
ai a j 0.5 ; b 0 ; c -0.5
Complete-link
ai a j 0.5 ; b 0 ; c 0.5
Centroide
ai
ni
ni n j
aj
nj
ni n j
d i j, k maxd (i, k ), d ( j, k )
ni n j
n n
c0
d (i j, k ) d ( i j , k )
ai a j 0.5 ; b 0.25 ; c 0
Median
(Average link)
Wards Method
(minimum variance)
ni
ai
ni n j
ai
aj
nk ni
nk ni n j
nj
bc0
ni n j
aj
nk n j
nk ni n j
d (Ci , C j )
1
ni n j
d ( a , b)
aCi ,bC j
nk
c0
nk ni n j
Centroid: Distance between clusters is

distance between centroids.
d i j, k min d (i, k ), d ( j, k )
Single-link
ai a j 0.5 ; b 0 ; c -0.5
Complete-link
ai a j 0.5 ; b 0 ; c 0.5
Centroide
ni
ai
ni n j
aj
nj
ni n j
d i j, k maxd (i, k ), d ( j, k )
ni n j
n n
c0
d (i j, k ) d ( i j , k )
ai a j 0.5 ; b 0.25 ; c 0
Median
(Average link)
Wards Method
(minimum variance)
ni
ai
ni n j
ai
aj
nk ni
nk ni n j
nj
bc0
ni n j
aj
d (Ci , C j )
nk n j
nk ni n j
1
ni n j
d ( a , b)
aCi ,bC j
nk
c0
nk ni n j

Average Link: Distance between clusters is
average distance between the cluster points.
Single-link
d i j, k min d (i, k ), d ( j, k )
ai a j 0.5 ; b 0 ; c -0.5
Complete-link
ai a j 0.5 ; b 0 ; c 0.5
Centroid
ai
ni
ni n j
aj
nj
ni n j
d i j, k maxd (i, k ), d ( j, k )
ni n j
n n
c0
d (i j, k ) d ( i j , k )
ai a j 0.5 ; b 0.25 ; c 0
Median
(Average link)
Wards Method
(minimum variance)
ni
ai
ni n j
ai
aj
nk ni
nk ni n j
nj
bc0
ni n j
aj
nk n j
nk ni n j
d (Ci , C j )
1
ni n j
d ( a , b)
aCi ,bC j
nk
c0
nk ni n j

Wards Link: Minimizes the sum-of-squares
criterion (measure of heterogenity)
d i j, k min d (i, k ), d ( j, k )
Single-link
ai a j 0.5 ; b 0 ; c -0.5
Complete-link
ai a j 0.5 ; b 0 ; c 0.5
Centroid
ni
ai
ni n j
aj
nj
ni n j
d i j, k maxd (i, k ), d ( j, k )
ni n j
n n
c0
d (i j, k ) d ( i j , k )
ai a j 0.5 ; b 0.25 ; c 0
Median
ni
ai
ni n j
(Average link)
Wards Method
ai
(minimum variance)
aj
nj
bc0
ni n j
nk ni
nk ni n j
aj
nk n j
nk ni n j
d (Ci , C j )
1
ni n j
d ( a , b)
aCi ,bC j
nk
c0
nk ni n j
d Ci , C j min d (a, b)
Single Linkage:
x
15
24
5
1
1
24
-
2
12
4
aCi ,bC j
3
1
11.7
20
21.5
8.1
16
17.9
11.7
8.1
9.8
9.8
20
1,2
16
3
9.8
4
- 5
5
1,2
21.5
-
17.9
8.1
9.8
16
8 17.9 -
8.1
9.8
9.8
16
9.8
17.9
9.8
3
1
3
1
5
4
d Ci , C j min d (a, b)
Single Linkage:
aCi ,bC j
1,2
4,5
1,2
8.1
16
8.1
9.8
4,5
16
9.8
1,2,3
4,5
1,2,3
9.8
4,5
9.8
Dendrogram
d (a, b)
Complete-Link: d Ci , C j amax
C ,bC
i
11.7
20
21.5
8.1
16
17.9
11.7
8.1
9.8
9.8
20
16
9.8
1,2
21.5
3
17.9
4
9.8
5
8
1,2
11.7
20
21.5
11.7
9.8
9.8
20
9.8
21.5
1,2
9.8
3
8
4,5
1,2
11.7
21.5
3
4,5
11.7
9.8
21.5
9.8
5
4
5
4
5
4
d (a, b)
Complete-Link: d Ci , C j amax
C ,bC
i
1,2,3
4,5
1,2,3
21.5
4,5
21.5
3
1
Dendrogram
Single-link and Complete-Link

:
A clustering of the data objects is obtained by cutting the dendrogram at

the desired level, then each connected component forms a cluster.
3
1
3
1
Single-link
Favours connectedness
Complete-link
Favours compactness

10
9
SL algorithm:
Favors connectedness
7
6
5
Equivalent to building a MST
And cutting at weak links
3
10
2
9
1
10
8
10
10
9
9
8
8
7
7
6
6
5
5
4
10
4
3
3
2
2
1
10
1
2
10

SL algorithm:
:
:
:
Detects arbitrary-shaped clusters with even densities
Cannot handle distinct density clusters
Single-link method, th=0.49

SL algorithm:
:
:
:
:
Is sensitive to in-between patterns
4
2
2
0
0
-2
-4
-5
4
2
-2
10
-4
-5
10
0
-2
-4
0
5 2009
17 From-5
Single Clustering to Ensemble
Methods - April
10

SL algorithm:
:
:
:
:
:
Needs criteria to set the final number of clusters
CL algorithm:
Favors compactness
44
22
00
-2-2
-4
-5-5
Unsupervised Learning -- -4
Ana Fred
00
55
1010

1.5
SL algorithm:
:
:
:
:
:
Favors connectedness 0.5

0
-0.5
Is sensitive to in-between
patterns
Needs criteria to set the-1final number of clusters
-1.5
CL algorithm:
:
:
-2
-2
-1
0
Favors compactness
Imposes spherical-shaped clusters on data

SL algorithm:
:
:
:
:
:
CL algorithm:
:
:
:
Favors compactness
Imposes spherical-shaped clusters on data
10
Hierarchical Clustering
Weakness
:
:
do not scale well: time complexity of O(n2), where n is the number

of total objects
can never undo what was done previously
Integration of hierarchical with distance-based clustering
:
:
:
BIRCH (Zhang, Ramakrishnan, Livny, 1996): uses a

ClusteringFeature-tree and incrementally adjusts the quality of subclusters
CURE (Guha, Rastogi & Shim, 1998): selects well-scattered points
from the cluster and then shrinks them towards the center of the
cluster by a specified fraction
CHAMELEON (G. Karypis, E.H. Han and V. Kumar, 1999):
hierarchical clustering using dynamic modeling
1. Use a graph partitioning algorithm: cluster objects into a large
number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the
genuine clusters by repeatedly combining these sub-clusters

Smoothness Hypothesis:
:
:
:
:
:
A cluster is a set of patterns sharing important characteristics in a

given context
A dissimilarity measure encapsulates the notion of pattern
resemblance
Higher resemblance patterns are more likely to belong to the same
cluster and should be associated first
Dissimilarity between neighboring patterns within a cluster should
not occur with abrupt changes
The merging of well separated clusters results in abrupt changes
in dissimilarity values
A Fred, J Leito, A New Cluster Isolation Criterion Based on Dissimilarity Increments, IEEE PAMI, 2003.
11

Dissimilarity Increments:
x ,x ,x
i
xk
- nearest neighbors
x j : j argmin d xl , xi , l i
l
xk : k argmin d xl , x j , l i
l
d xi , x j
d xi , x j
xj
xi
Dissimilarity increment:
d inc xi , x j , xk d xi , x j d x j , xk
d inc

Distribution of Dissimilarity Increments:
Uniformly distributed data
12

2D Gaussian data

Ring-shaped data
13

Directional expanding data

x
Exponential distribution: p( x ) exp
14

Exponential distribution:
:
:
Higher density patterns -> higher

Well separated clusters -> dinc on the tail of p(x)

Gap between clusters: gapi d C i ,C j d t C i
10
Ci
Cj
13
8
17
gapi
1
dt(C i)
14
dt(Ci)
16
dt(C j)
12
d(C i,Cj)
dt(Cj)
18
15
gapj
5
6
11
15

Gap between clusters: gapi d C i ,C j d t C i
dt(Cj )
dt(Ci )
d(Ci ,C j )
Ci
gapi
gapj
Cj

Cluster Isolation criterion:
Let Ci, Ck be two clusters which are candidates for merging, and let i , k
be the respective mean values of the dissimilarity increments in each
cluster. Compute the increments for each cluster, gapi and gapk. If
gapi i (gapk k), isolate cluster Ci (Ck) and proceed the
clustering strategy with the remaining patterns. If neither cluster exceeds
the gap limit, merge them.
16

Setting the Isolation Criterion Parameter :
Result: the crossings of the tangential line, at points which are multiple of
the distribution mean value, /, with the x axis, is given by (+1)/
exponential distribution,
=.05
tangential lines at points multiple of

=.05
3,5 cover the significant part of the distribution

Hierarchical Clustering Algorithm:
A statistic of the dissimilarity increments within a cluster is maintained and

updated during cluster merging
Clusters are obtained by comparing dissimilarity increments with a dynamic

threshold, i , based on cluster statistics
17

Results:
:Ring-Shaped Clusters

Results:
Single-link method, th=0.49
18

Results:
Dissimilarity Increments-base method:
3/3
g2
d1
3/
d2
3/1
g1
d1
d2
d2

Results:
Dissimilarity Increments-base method:
3/3
g2
3/
3/1
g1
d1
d2
d2
19

Results:
:2-D Patterns with Complex Structure

Dissimilarity Increments-base method
2 < < 9
Single-link
Single Link
th=1.1
Clustering of Contour Images
:
:
The data set is composed by 634 contour images of 15 types of hardware

tools: t1 to t15.
When counting each pose as a distinct sub-class in the object type, we
obtain a total of 24 classes.
20

Contour extraction
Contour
String Contour
Extraction
Description
Based on a
thresholding
method
:
:
8 directional
differential
chain code
the object boundary is sampled at 50 equally spaced points

the angle between consecutive segments is quantized in 8 levels.
21

Single-link
Dissimilarity Increments-based method
t1
t1
t2
t2
t3
t3
t4
t4
t5
t5
(string-edit distance)

Description:
Hierarchical agglomerative algorithm adopting a cluster isolation

criterion based on dissimilarity increments
Strength
:
:
:
Method is not conditioned by a particular dissimilarity measure

(examples used Euclidean and string-edit distances)
Ability to identify arbitrary shaped and sized clusters
The number of clusters is intrinsically found
Weakness
Sensitive to in-between points connecting touching clusters
22
Outline
Partitional Methods
:
:
:
K-Means
Spectral Clustering
EM-based Gaussian Mixture Decomposition
Part 3.: Validation of clustering solutions
Cluster Validity Measures

Part 4.: Ensemble Methods
Evidence Accumulation Clustering
Partitional Methods
K-Means
Minimizes the cost function:
Algorithm:
Input: k, the number of clusters; data set
1. Randomly select k seed points from the data set, and take them as initial
centroids
2. Partition the data into k clusters by assigning each object to the cluster with
the nearest centroid.
3. Compute centroids of the clusters of the current partition.
The centroid is
the center (mean point) of the cluster.
4. Go back to step 2 or stop when no more new assignment.

23
Partitional Methods: K-Means

Favors compactness
4
2
0
-2
4
-4
-5
2
10
10
0
-2
-4
-5

Favors compactness
Strength
:
:
:
Fast algorithm (O(tkn) t is the number of iterations; normally, k, t << n.)

Scalability
Often terminates at a local optimum.
Weakness
K-means clustering of uniform data (k=4)
Imposes spherical-shaped clusters
24

10
10
Favors compactness
Strength
:
:
:
66

Scalability
44
22
Weakness
:
:
88
Imposes spherical-shaped00clusters
22
44 clusters
66
Is sensitive to the number00of objects
in
88
10
10

Favors compactness
Strength
:
:
:

Scalability
Weakness
:
:
:

Is sensitive to the number of objects in clusters
Dependence on initialization
25
10
10
Favors compactness
Strength
:
:
:
2
10
0
0
8
:
:
:
Fast
Scalability
Robust to noise
4
6
Weakness
10
0
0
10

0
0
4 Unsupervised
6 Learning
8
-- Ana Fred
10

Favors compactness
Strength
:
:
:

Scalability
Weakness
:
:
:
:
:

Applicable only when mean is defined (what about categorical
data?)
26
Variations of K-Means Method

K-Means(MacQueen67): each cluster is represented by the center
of the cluster
A few variants of the k-means which differ in
:
:
:
:
Selection of the initial k means

Dissimilarity calculations (Mahalanobis distance -> elliptic clusters)
Strategies to calculate cluster means
Medoid - each cluster is represented by one of the objects in the

cluster
Fuzzy version: Fuzzy K-Means
Handling categorical data: k-modes (Huang98)
:
:
:
Replacing means of clusters with modes

Using new dissimilarity measures to deal with categorical objects
Using a frequency-based method to update modes of clusters
Spectral Clustering
For a given data set, X, Spectral Clustering finds a set of data clusters on
the basis of spectral analysis of a similarity graph
The clustering problem is defined in terms of a complete graph, G, with
vertices V={1, , N}, corresponding to the data points in the data set, and
each edge between two vertices is weighted by the similarity between them.
The weight matrix is also called the affinity matrix or the similarity matrix.
x x
Aij e
j
2
Gaussian Kernel
27
Spectral Clustering
Cutting edges of G we obtain disjoint subgraphs of G as the clusters of X
The goal of clustering is to organize the dataset into disjoint subsets with
high intra-cluster similarity and low inter-cluster similarity
The resulting clusters should be as compact and isolated as possible
Spectral Clustering
The graph partitioning for data clustering can be interpreted as a
minimization problem of an objective function, in which the compactness
and isolation are quantified by the subset sums of edge weights.
Common objective functions
Ratio cut (Rcut)
Rcut(C1 , , Ck ) :
l 1
cut(Cl , X \ Cl )
card Cl
Normalised cut (Ncut)
Ncut(C1 , , Ck ) :
l 1
Min-max cut (Mcut)
M cut(C1 , , Ck ) :
l 1
cut(Cl , X \ Cl )
cut(Cl , X )
cut(Cl , X \ Cl )
cut(Cl , Cl )
cut(A, B) is the sum of the edge weights between p A and p B

P\Cl is the complement of Cl X,
card Cl denotes the number of points in Cl
28
Spectral Clustering
The solution of the minimization problem of any of the previous objective
functions is obtained from the matrix of the first k eigenvectors of a matrix
derived from the affinity matrix (Laplacian matrix)
The eigenvectors for Ncut and Mcut are identical, and obtained from the
symmetrical Laplacian
L Lsym : D 1/ 2 AD 1/ 2
D is a diagonal matrix whose ii-th entry is the sum of the i-th row of A
Another common choice is L Lrw : D 1 D A
Distinct algorithms differ on the way of producing and using the

eigenvectors and how to derive clusters from them.
:
:
Some use each eigenvector one at a time

Other, use top k eigenvectors simultaneously
Closely related with spectral graph partitioning, in which the second
eigenvector of a graphs Laplacian is used to define a semi-optimal cut; the

second eigenvector solves a relaxation of an NP-hard discrete graph
partitioning problem, giving an approximation to the optimal cut.
Spectral Clustering
(Ng et al, 2001)
Maps the feature space into a new space, Y, based on the

eigenvectors of a matrix derived from an affinity matrix associated
with the data set.
The data partition is obtained by applying the K-means algorithm
x x
on the new space.

(NG. and Al. 2001)
Aij e
Original feature
space
j
2
Eigen vector
Feature space
A. Y. Ng and M. I. Jordan and Y. Weiss, On Spectral Clustering: Analysis and an algorithm, NIPS 2001
29
Spectral Clustering
(Ng et al, 2001)
Algorithm:
Spectral Clustering
(Ng et al, 2001)
30
Spectral Clustering
(Ng et al, 2001)
Results strongly depend on parameter values: k e
K=2, =0.1
K=2, =0.4
Spectral Clustering
(Ng et al, 2001)
K=2, =0.1
K=2, =0.4
31
Spectral Clustering
(Ng et al, 2001)
1.5
1.5
0.5
0.5
-0.5
-0.5
-1
-1
-1.5
-1.5
-2
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
1.5
K=3, =0.45
K=3, =0.1
Spectral Clustering
(Ng et al, 2001)
1.5
1.5
0.5
0.5
-0.5
-0.5
-1
-1
-1.5
-1.5
-2
-1.5
-1
-0.5
0.5
1.5
-2
K=3, =0.1
-1.5
-1
-0.5
0.5
1.5
K=3, =0.45
32
Spectral Clustering
Selection of parameter values:
MSEY
MSE:
Eigengap
1
N
( A) 1
Rcut
Rcut K
y
k
ni
i 1 j 1
mi
2
1
k 1
l 1,l k
i
jS k
jSl
A jk
ij
Selection of Parameters: Global Results on selecting and K

None of the studied methods is
suitable for the automatic
selection of the spectral
clustering parameters
A majority voting decision did not
significantly improve the results
Percentage of correct classification:
33
Spectral Clustering
Strength
:
:
Detects arbitrary-shaped clusters.

By using an adequate similarity measure between patterns, can be
applied to all types of data
Weakness
:
:
Computationally heavy
Needs criteria to set the final number of clusters and scaling factor
Model-Based Clustering: Finite Mixtures

k random sources, with probability density functions fi(x), i=1,,k
f1(x)
Choose at random
f2(x)
random variable
fi(x)
Conditional: f (x|source i) = fi (x)
Joint: f (x and source i) = fi (x) i
fk(x)
f (x and source i) i f i (x)

all sources
k
Unconditional: f(x) =
i 1
34
Model-Based Clustering: Finite Mixtures

each component models one cluster
clustering = mixture fitting
k
f(x|) = i f(x|i )
i=1
f1(x)
f2(x)
f3(x)
Gaussian Mixture Decomposition

Mixture Model:
k
f(x|) = i f(x|i )
i=1
Component densities
k
Mixing probabilities: i 0 and
: f ( x | i )
i 1
Gaussian
Arbitrary covariances: f (x | i ) N(x | i , Ci )
{ 1 , 2 ,..., k , C1 , C2 ,..., Ck , 1 , 2 ,...k 1}
Common covariance: f ( x | i ) N( x | i , C)
{ 1 , 2 ,..., k , C, 1 , 2 ,...k 1}
35
Mixture Model Fitting

n independent observations x {x (1) , x ( 2) ,..., x ( n ) }
Mixture density model: f(x|) =
f(x|i )
i=1
Estimate that maximizes (log)likelihood (ML estimate of ):
arg max L( x, )
L( x , ) log
f ( x
( j)
| ) log
j1
j 1
f (x
i 1
( j)
| i )
mixture
Gaussian Mixture Model Fitting

Problem: the likelihood function is unbounded as det(Ci ) 0
.
.
There is no global maximum

Unusual goal: a good local maximum
Example: a 2-component Gaussian mixture
f ( x | 1 , 2 , , )
2
some data points
2 2
( x 1 ) 2
2 12
1
e
2
( x 2 ) 2
2
{x1 , x 2 ,..., x n }
1 x1
( x ) 2

1 1 2 2
L( x, ) log
e
2 2
2
n
log(...)

j 2
as 2 0
36

ML estimate has no closed-form solution
Standard alternative: expectation-maximization (EM) algorithm:
Missing data problem:
Observed data:
x {x (1) , x ( 2) ,..., x ( n ) }

ML estimate has no closed-form solution
Standard alternative: expectation-maximization (EM) algorithm:
Missing data problem:
.
.
Observed data: x {x , x ,..., x }

(1)
( 2)
(n)
Missing data: z {z , z ,..., z }
(1)
( 2)
(n)
Missing labels (colors)
z ( j) z1( j) , z (2j) , ..., z (kj) , 0 ... 0 1 0 ... 0 T

1 at position i
x( j) generated
by component i
Complete log-likelihood function:

n
Lc (x, z, ) z i( j) log i f i (x ( j) | i )
j1 i 1
log f ( x ( j) , z ( j) | )
37
The EM Algorithm
The E-step: compute the expected value of Lc (x, z, )
( t ) ] Q(,
(t) )
E[Lc (x, z, ) | x,
The M-step: update parameter estimates
( t 1) arg max Q(,

(t) )
EM-Algorithm for Mixture of Gaussian

( 0) ,
(1) ,...,
(t) ,
( t 1) ,...
Iterative procedure:
The E-step: wi( j ,t )
i f ( x ( j ) | i( t ) )
k
n 1
f ( x ( j ) | n( t ) )
w i( j,t ) Estimate, at iteration t, of the probability that x ( j) was

produced by component i
The M-step:
i( t 1)
1 n ( j, t )
wi
n j1
n
i( t 1)
w
j1
( j, t )
i
x ( j)
w i( j,t )
( t 1)
i
w
j1
j1
( j, t )
i
( x ( j) i( t 1) ) ( x ( j) i( t 1) ) T
n
w
j1
( j, t )
i
38
Mixture Gaussian Decomposition: Model Selection

How many components?
:
:
The maximized likelihood never decreases when k increases
Usually:
Criteria in this cathegory:
), k k , k 1,..., k
k arg min C (
(k)
min
min
max
.
.
.
) L(x,
) P(
)
C (
(k)
(k)
(k)
Minimum description length (MDL), Rissanen and Ristad, 1992.

Akaikes information criterion (AIC), Whindham and Cutler, 1992.
Schwarzs Bayesian inference criterion (BIC), Fraley and Raftery,
1998.
Resampling-based techniques
.
.
.
Bootstrap for clustering, Jain and Moreau, 1987.

Bootstrap for Gaussian mixtures, McLachlan, 1987.
Cross validation, Smyth, 1998.

Given (k) , shortest code-length for x (Shannons):
L(x | ( k ) ) log f (x | ( k ) )
MDL criterion: parameter code length
Total code-length (two part code):
L(x, ( k ) ) log f (x | ( k ) ) L(( k ) )
Parameter
code-length
MDL criterion:
arg min log f (x | ) L( )
(k)
(k)
(k)
( k )
L(each component of (k) ) =
1
log( n ' )
2
n ' Amount of data from which the parameter is estimated
39

Classical MDL: n = n
arg min log f (x | ) k log( n )
(k)
(k)
( k )
2
Mixtures MDL (MMDL) (Figueiredo, 2002)

arg min log f (x | ) k ( N p 1) log( n ) N p
(k)
(k)
( k )
2
2
Using EM and redefining the M-Step
Np
n
i w i( j, t )
2
j1
( t 1)
i
Gaussian, arbitrary covariances

Np = d + d(d+1)/2
y+
Gaussian, common covariance:

Np = d
m 1
m 1
Np is the number of parameters

of each component.
log(
y
This M-step may annihilate components
M. Figueiredo and A. K. Jain, Unsupervised Learning of Finite Mixture Models, IEEE TPAMI, 2002
Gaussian Mixture Decomposition EM MMDL

Examples
40
Gaussian Mixture Decomposition EM MMDL

Examples

Strength
:
:
:
Model-based approach
Good for Gaussian data
Handles touching clusters
Weakness
:
:
Unable to detect arbitrary shaped clusters

41

It is a local (greedy) algorithm (likelihood never decreases)
=> Initialization dependent
74 iterations
270 iterations
References
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
A.K. Jain and M. N. Murty and P.J. Flynn, Data Clustering: A Review, ACM Computing Surveys, vol 31. No
3,pp 264-323, 1999.
Data Mining: Concepts and Techniques, J. Han and M. Kamber, Morgan Kaufmann Pusblishers, 2001.
A. Y. Ng and M. I. Jordan and Y. Weiss, On Spectral Clustering: Analysis and an algorithm, in Advances in
Neural Information Processing Systems 14, T. G. Dietterich and S. Becker and Z. Ghahramani, MIT
Press, 2002,
M. Figueiredo and A. K. Jain, Unsupervised Learning of Finite Mixture Models, IEEE TPAMI, 2002
A. L. N. Fred, J. M. N. Leito, A New Cluster Isolation Criterion Based on Dissimilarity Increments, IEEE
Trans. On Pattern Analysis and Machine Intelligence, vol 25, NO. 8, pp 2003.
42

Outline: Part 1: Basic Concepts of Data Clustering Non-Supervised Learning and Clustering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Outline: Part 1: Basic Concepts of Data Clustering Non-Supervised Learning and Clustering

Uploaded by

Copyright:

Available Formats

Unsupervised Learning

Non-Supervised Learning and Clustering

Problem formulation cluster analysis

Part 2: Clustering Algorithms

Unsupervised Learning -- Ana Fred

1 From Single Clustering to Ensemble Methods - April 2009

D(i,j): proximity (similarity or distance) between patterns i and j

Step 2 Step 3 Step 4

Step 2 Step 1 Step 0

Unsupervised Learning -- Ana Fred

2 From Single Clustering to Ensemble Methods - April 2009

Hierarchical Clustering: Agglomerative Methods

3. Update the proximity matrix (reduce its order by one, by replacing

Unsupervised Learning -- Ana Fred

3 From Single Clustering to Ensemble Methods - April 2009

Hierarchical Clustering: Agglomerative Methods

:Well known similarity measures can be written using the Lance-Williams

d (i j, k ) ai d (i, k ) a j d ( j, k ) bd (i, j ) c d (i, k ) d ( j, k )

Unsupervised Learning -- Ana Fred

4 From Single Clustering to Ensemble Methods - April 2009

Hierarchical Clustering: Agglomerative Methods

Unsupervised Learning -- Ana Fred

5 From Single Clustering to Ensemble Methods - April 2009

Hierarchical Clustering: Agglomerative Methods

Complete Link: Distance between clusters

Unsupervised Learning -- Ana Fred

6 From Single Clustering to Ensemble Methods - April 2009

Hierarchical Clustering: Agglomerative Methods

Centroid: Distance between clusters is

Unsupervised Learning -- Ana Fred

7 From Single Clustering to Ensemble Methods - April 2009

Hierarchical Clustering: Agglomerative Methods

Unsupervised Learning -- Ana Fred

8 From Single Clustering to Ensemble Methods - April 2009

Hierarchical Clustering: Agglomerative Methods

Unsupervised Learning -- Ana Fred

9 From Single Clustering to Ensemble Methods - April 2009

Unsupervised Learning -- Ana Fred

10 From Single Clustering to Ensemble Methods - April 2009

Unsupervised Learning -- Ana Fred

11 From Single Clustering to Ensemble Methods - April 2009

Unsupervised Learning -- Ana Fred

12 From Single Clustering to Ensemble Methods - April 2009

Unsupervised Learning -- Ana Fred

13 From Single Clustering to Ensemble Methods - April 2009

Single-link and Complete-Link

A clustering of the data objects is obtained by cutting the dendrogram at

Unsupervised Learning -- Ana Fred

14 From Single Clustering to Ensemble Methods - April 2009

Single-link and Complete-Link

Equivalent to building a MST

And cutting at weak links

Unsupervised Learning -- Ana Fred

15 From Single Clustering to Ensemble Methods - April 2009

Single-link and Complete-Link

Single-link method, th=0.49

Unsupervised Learning -- Ana Fred

16 From Single Clustering to Ensemble Methods - April 2009

Single-link and Complete-Link

Single-link and Complete-Link

18 From Single Clustering to Ensemble Methods - April 2009

Single-link and Complete-Link

Favors connectedness 0.5