AC-ED L08 - Clustering

Clustering
Ramon Baldrich
Universitat Autònoma de Barcelona
Another ML introduction:
Supervised learning
In supervised learning, the “right answers”

(ground truth) are given
Training Set: {(x(1), y(1)), (x(2), y(2)), …, (x(m), y(m))}
Unsupervised learning
In unsupervised learning, the “right answers”

(ground truth) are not given
Training Set: {x(1), x(2), …, x(m)}
Market segmentation
Social network analysis
Astronomical data analysis Medical Image Analysis

What is Cluster Analysis?
Finding groups of objects in data such that the
objects in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
6
Types of Clusterings
A clustering is a set of clusters
Important distinction between hierarchical
and partitional sets of clusters
• Partitional Clustering
– A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
7
Partitional Clustering
(Bölümsel Kümeleme)
Original Points A Partitional Clustering
8
Hierarchical Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
9
K-MEANS CLUSTERING ALGORITHM
K-Means Algorithm
Step 1 Initialize K “means” μk,
one for each class μ1
Hint: we can use random μ2

starting points, or choose k
points randomly from the
set of samples
K=2
K-Means Algorithm
1 0
one for each class
Step 2 Assign each point to the

closest mean μk
K-Means Algorithm
one for each class

closest mean μk
0 1
K-Means Algorithm
one for each class

μ1
closest mean μk
μ2
Step 3 Update the means μk of

the new clusters
K=2
K-Means Algorithm
one for each class

μ1
closest mean μk
μ2

the new clusters
Step 4 If means do not change

terminate, else back to
step 2
K=2
K-Means Algorithm
0 1
one for each class

closest mean μk

the new clusters

step 2
K=2
K-Means Algorithm
one for each class

closest mean μk μ2
μ1

the new clusters

step 2
K=2
K-Means Algorithm
one for each class

closest mean μk

the new clusters

step 2
K=2
K-Means Algorithm
one for each class

closest mean μk

the new clusters

step 2
K=2
K-Means Algorithm
one for each class

closest mean μk

the new clusters

step 2
K=2
K-Means Algorithm
one for each class

closest mean μk

the new clusters

step 2
K=2
K-Means Algorithm
one for each class

closest mean μk

the new clusters

step 2
K=2
Optimization Objective
Let´s define some notation to describe the assignment of data
points to clusters:
For each data point x(n) we introduce a corresponding set of binary indicator
variables:
rnk  0,1
where k = 1, …, K describing which of the K clusters the data point x(n) is
assigned to
We define an objective function (cost function, distortion

measure) as:
N K
J =  rnk x
2
(n)
− μk
n =1 k =1
which represents the sum of the squares of distances of each data point to its
assigned vector μk
Optimization Objective
Goal: find the values for the {rnk} and the {μk} so as to minimize J
Intuition: we can do this through an iterative procedure in which

each iteration involves two successive steps corresponding to
successive optimizations with respect to the {rnk} and the {μk}
E Step Minimize J with respect to the

{rnk} keeping the {μk} fixed Note: We will see next that
these two steps correspond
to the E (expectation) and
M (Maximization) steps of
M Step Minimize J with respect to the the EM algorithm
{μk} keeping the {rnk} fixed
Expectation step
{rnk} keeping the {μk} fixed
N K
J =  rnk x
2
(n)
− μk
n =1 k =1
M Step Minimize J with respect to the
Observe:
• J is a linear function of rnk -> closed form solution
• The terms involving different n are independent -> we can optimise for
each n separately by choosing rnk to be 1 for whichever value of k gives
the minimum value of ǁx(n)-μkǁ2
2
1 if 𝑘 = argmin 𝑥 (𝑛) − 𝜇𝑘
𝑟𝑛𝑘 =ቐ 𝑘
0 otherwise
Maximisation Step
N K
J =  rnk x
2
(n)
− μk
n =1 k =1
M Step Minimize J with respect to the
Observe:
• J is a quadratic function of μnk -> minimise setting the derivative with
respect to μnk to zero
𝑁
𝜕𝐽
= 2 ෍ 𝑟𝑛𝑘 𝐱 (𝑛) − 𝛍𝑘 = 0
𝜕𝛍𝑘
𝑛=1
σ𝑛 𝑟𝑛𝑘 𝐱 (𝑛)
𝛍𝑘 =
σ𝑛 𝑟𝑛𝑘
K-Means Algorithm Revisited
Step 1 Initialize K “means” μk, Initialization Initialize K “means” μk, one
one for each class for each class
Step 2 Assign each point to the Expectation Minimize J with respect to the
closest mean μk {rnk} keeping the {μk} fixed
Step 3 Update the means μk of Maximisation Minimize J with respect to the

the new clusters {μk} keeping the {rnk} fixed
Step 4 If means do not change Terminate Terminate if J has converged,

terminate, else back to otherwise back to E-Step
step 2
Alternative
Stopping/convergence criterion
1. no (or minimum) re-assignments of data points to different
clusters,
2. no (or minimum) change of centroids, or
3. minimum decrease in the sum of squared error (SSE),

𝐾
𝑆𝑆𝐸 = ෍ ෍ 𝑑𝑖𝑠𝑡(x, 𝛍𝑘 )2
𝑘=1 x∈𝑟𝑛𝑘 =1
dist(x, 𝛍𝑘 ) is the distance between data point x and

centroid 𝛍𝑘 .
‘Closeness’ could be measured by Euclidean distance,

cosine similarity, correlation, etc.
Random Initialization
Convergence of the K-
Means algorithm is assured
but it may converge to a
local minimum
Random initialization initializes centers randomly to existing data points

Problems with Selecting Initial Points
If there are K ‘real’ clusters then the chance of selecting one
centroid from each cluster is small.
Chance is relatively small when K is large, if clusters are

the same size, n, then
For example, if K = 10, then probability = 10!/1010 =

0.00036
Sometimes the initial centroids will readjust themselves in

‘right’ way, and sometimes they don’t
10 Clusters Example
Starting with two initial centroids in one cluster of each pair of clusters
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
8 8
6 6
4 4
2 2
y
0 y 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
32
10 Clusters Example
Starting with some pairs of clusters having three initial centroids, while other have
only one.
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x
Iteration 3 x
Iteration 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
33
Solutions to Initial Centroids
Problem
• Multiple runs
– Helps, but probability is not on your side
• Sample and use hierarchical clustering to determine initial
centroids
• Select more than k initial centroids and then select among
these initial centroids
– Select most widely separated
• Postprocessing
34
Pre-processing and Post-processing
• Pre-processing
– Normalize the data
– Eliminate outliers
• Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high
SSE
– Merge clusters that are ‘close’ and that have relatively
low SSE
– Can use these steps during the clustering process
• ISODATA
35
Choosing the number of Clusters
What is the right number (K) of classes?
J (cost function)
J (cost function)
Elbow
1 2 3 4 5 6 1 2 3 4 5 6
K (no. of clusters) K (no. of clusters)
The first clusters will add much information (explain a lot of variance / reduce a lot
the value of the cost function), but at some point the marginal gain will drop, giving
an angle in the graph. The number of clusters is chosen at this point, hence the
"elbow criterion". This "elbow" cannot always be unambiguously identified.
What will k-means do?
J J
J J
Limitations of K-means
K-means has problems when clusters are of differing
– Sizes
– Densities
– Non-globular shapes
K-means has problems when the data contains outliers.
46
Limitations of K-means: Differing Sizes
Original Points K-means (3 Clusters)
47
Limitations of K-means: Differing Density
48
Limitations of K-means: Non-globular Shapes
49
Limitations of K-means
• A point can only pertain to one class (hard classification)
• Clusters have the same “shape” (only parameters are the means μk)
K-means
How do we
go about
this?
GAUSSIAN MIXTURE MODELS
A Gaussian Density
Mean
p (x) = N (x μ, Σ )
Variance
A Gaussian Mixture
N (x μ1 , Σ ) + N (x μ 2 , Σ )
A Gaussian Mixture
Different Variances
N (x μ1 , Σ1 ) + N (x μ 2 , Σ 2 )
Different Variances
A Gaussian Mixture
Mixing coefficient
p(x) =  1 N (x μ1 , Σ1 ) +  2 N (x μ 2 , Σ 2 )
K

k =1
k =1
Mixing coefficient
 k = 0.7  k = 0.3
 k = 0.3
 k = 0.7
Gaussian Mixture Distribution
p(x) =   k N (x μ k , Σ k )
K
k =1
A Different Formulation
Given a mixture of Gaussians as an underlying probability
distribution each sample is produced in two steps, by first
choosing one of the Gaussian mixture components, and then
according to the selected component’s probability density
Intuition: We do not directly observe which

component was responsible for each sample – this
“responsible component” can be represented by a
latent (hidden) variable z
The latent variable z is a K-dimensional binary random variable
having a 1-of-K representation
1 for a particular element, k = i

zk = 
0 for all other elements, k  i
zk  0,1
z
k
k =1
We define: p(zk =1) =  k

As z uses a 1-of-K representation we can write:
K
p(zk = 1) =  k → p(z ) =   kzk
k =1
Similarly, for the conditional distribution of x given a particular

value of z:
p(x zk = 1) = N (x μ k , Σ k ) → p(x z ) =  N (x μ k , Σ k )
K
zk
k =1
Finally, the marginal distribution of x over all possible states of z:
p(x ) =  p(x, z ) = p(z ) p(x z ) =   k N (x μ k , Σ k )

K
z z k =1
Using the Bayes theorem:
𝑝 𝑧𝑘 = 1 𝑝 𝐱ȁ𝑧𝑘 = 1
𝛾 𝑧𝑘 ≡ 𝑝 𝑧𝑘 = 1ȁ𝐱 =
σ𝐾
𝑗=1 𝑝 𝑧𝑗 = 1 𝑝 𝐱ห𝑧𝑗 = 1
𝜋𝑘 𝑁 𝐱ห𝛍𝑘 , 𝚺𝑘
=
σ𝐾
𝑗=1 𝜋𝑗 𝑁 𝐱 ቚ𝛍𝑗 , 𝚺𝑗
γ(zk) can be seen as the “responsibility” that

component k takes for “explaining” the observation x
K-means vs Mixture of Gaussians
K-Means Mixture of Gaussians
• K-means is a (“hard”) • Mixture of Gaussians is a

classifier probability model
• We can USE it as a “soft”
classifier
• Parameters to estimate • Parameters to estimate

• Means {μk} • Means {μk}
• Covariances {Σk}
• Mixing coefficients {πk}
Maximum Likelihood
Goal: given a data set of observation {x(1), x(2), ..., x(N)) we wish
to model the data using a mixture of Gaussians
Intuition: maximise the (log) likelihood of the model parameters
K 
( )
ln p ( X  ,  ,  ) =  ln   k N x μ k , Σ k 
N
(n)
n =1  k =1 
Problem: contrary to estimating the

parameters of a single Gaussian, the
presence of singularities makes this
not a well posed problem
EXPECTATION MAXIMISATION FOR
GAUSSIAN MIXTURES
K-Means Algorithm Reminder
Initialization Initialize K “means” μk, one
for each class
1 0
Expectation Minimize J with respect to the

Maximisation Minimize J with respect to the

Terminate Terminate if J has converged,

otherwise back to E-Step
Expectation Maximization (EM) for
Gaussian Mixtures
Initialization Initialize the parameters of K
Gaussians (μk, Σk, πk)
0.7 0.3
Expectation Give each point x(n) an assignment

score γ(znk) for each cluster k
Maximisation Given the scores, adjust

{μk , πk, Σk} for each cluster k
Terminate Evaluate log likelihood, stop when

converged, else back to E-step
Gaussian Mixtures
μ 1 , Σ1 ,  1
Hint: we can use the

K-means result to initialize
μk  μk
Σ k  cov(cluster( k ) ) μ2 , Σ2 , 2
# points in cluster k
k 
Total number of points
Gaussian Mixtures
0.7 0.3

 k N (x ( n ) μ k , Σ k )
 (z nk ) =
( )
K
 j

j =1
N x (n)
μ j,Σj
γ(znk) is called “responsibility”: how much is the Gaussian

k responsible for the point x(n)
Gaussian Mixtures


N N
  (z )x N k =   (znk )
1
μ new
k = nk
(n)
Nk n =1 n =1
Gaussian Mixtures


  (z )(x )(x )
N N
N k =   (znk )
1 new T
Σ new
k = nk
(n)
−μ new
k
(n)
−μ k
Nk n =1 n =1
Gaussian Mixtures


 knew
N
= k N k =   (znk )
N n =1
Gaussian Mixtures



converged , else back to E-step
Gaussian Mixtures



Gaussian Mixtures



THE GENERAL EXPECTATION
MAXIMISATION ALGORITHM
General EM Algorithm See sections
9.3, 9.4 for
more details
Initialization Initialize parameters θold
Latent (hidden) variables

Expectation Evaluate p(Z|X, θold )
Observed variables
 new = arg max Q( , old )

Maximisation
Evaluate Likelihood

where ( ) ( )
Q  , old =  p Z X, old ln p(Z X, )
z
Terminate Evaluate log likelihood, if it converges, stop. Otherwise do θold ← θnew

and go back to E-step
Summary
• K-Means is an unsupervised method for data
clustering
• K-means yields “hard” cluster assignments
• Gaussian Mixture Models are useful for modelling
data with “soft” cluster assignments
• Expectation Maximization is a method used when
we have a model with latent (hidden) variables
• K-Means can be seen as a special case of the EM
algorithm ( =   I ,  → 0 )
HIERARCHICAL CLUSTERING
Produces a set of nested clusters organized as a
hierarchical tree
Can be visualized as a dendrogram
– A tree like diagram that records the sequences of
merges or splits
6 5
0.2
4
3 4
0.15 2
5
0.1 2
0.05
1
3 1
0
1 3 2 5 4 6
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper level
0.2 K=2
K=3
0.15
dissimilarity
K=4
0.1
K=5
0.05
K=6
0
1 3 2 5 4 6
They may correspond to meaningful taxonomies

– Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or
there are k clusters)
• Traditional hierarchical algorithms use a similarity or distance

matrix
– Merge or split one cluster at a time
Agglomerative Clustering Algorithm
More popular hierarchical clustering technique
Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
Agglomerative Clustering Algorithm
Key operation is the computation of the proximity of two

clusters
– Different approaches to defining the distance between
clusters distinguish the different algorithms
Starting Situation
Start with clusters of individual points and a
proximity matrix p1
p1 p2 p3 p4 p5 ...
p2
p3
p4
p5
.
.
. Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3 C3
C4 C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3 C3
C4 C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
The question is “How do we update the proximity matrix?”
C2
U
C1 C5 C3 C4
C1 ?
C3 C2 U C5 ? ? ? ?
C4 C3 ?
C4 ?
C1 Proximity Matrix
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
⚫ MIN
.
⚫ MAX .
⚫ Group Average .
Proximity Matrix
⚫ Distance Between Centroids
⚫ Other methods driven by an objective
function
- Ward’s Method uses squared error
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
⚫ MIN
.
⚫ MAX .
⚫ Group Average .
Proximity Matrix
function
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
⚫ MIN
.
⚫ MAX .
⚫ Group Average .
Proximity Matrix
function
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
⚫ MIN
.
⚫ MAX .
⚫ Group Average .
Proximity Matrix
function
p1 p2 p3 p4 p5 ...
p1
  p2
p3
p4
p5
⚫ MIN
.
⚫ MAX .
⚫ Group Average .
Proximity Matrix
function
Cluster Similarity: MIN or Single Link
Similarity of two clusters is based on the two most similar
(closest) points in the different clusters
– Determined by one pair of points, i.e., by one link in the
proximity graph.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical Clustering: MIN
5
1
3
5 0.2
2 1 0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
Nested Clusters Dendrogram

Strength of MIN
Original Points Two Clusters
• Can handle non-elliptical shapes

Limitations of MIN
• Sensitive to noise and outliers

Cluster Similarity: MAX or
Complete Linkage
Similarity of two clusters is based on the two least similar (most
distant) points in the different clusters
– Determined by all pairs of points in the two clusters
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Strength of MAX
• Less susceptible to noise and outliers

Limitations of MAX
•Tends to break large clusters

•Biased towards globular clusters
Cluster Similarity: Group Average
Proximity of two clusters is the average of pairwise proximity
between points in the two clusters.
- Need to use average connectivity for scalability since total proximity
favors large clusters
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical Clustering: Group Average
5 4 1
0.25
2
5 0.2
2
0.15
3 6 0.1
1 0.05
4 0
3 3 6 4 1 2 5
Nested Clusters Dendrogram

Hierarchical Clustering: Group Average
Compromise between Single and Complete Link
Strengths
– Less susceptible to noise and outliers
Limitations
– Biased towards globular clusters
Cluster Similarity: Ward’s Method
Similarity of two clusters is based on the increase in squared
error when two clusters are merged
– Similar to group average if distance between points is
distance squared
– Hierarchical analogue of K-means
• Can be used to initialize K-means
Given a set of data points C, the sum of squared errors (ESS) is
defined as:
Join those cluster 𝐶𝑖 and 𝐶𝑗 to form cluster 𝐶𝑡 with minimum

increase on squared error
5 4 1
2
5
2
3 6
1
4
3
Limitations
– O(N3) time in many cases
– There are N steps and at each step the size, proximity
matrix must be updated and searched
– Complexity can be reduced to O(N2 log(N) ) time for some
approaches
Summary
• Once a decision is made to combine two clusters, it cannot be
undone
• No objective function is directly minimized
• Different schemes have problems with one or more of the
following:
• Sensitivity to noise and outliers
• Difficulty handling different sized clusters and convex
shapes
• Breaking large clusters
GRAPH BASED CLUSTERING
A Tutorial on Spectral Clustering\Arik Azran

Graph based Clustering
Spectral Clustering Example – 2 Spirals
1.5 Dataset exhibits complex

1 cluster shapes
 K-means performs very
0.5
-2 -1.5 -1 -0.5
0
0 0.5 1 1.5 2 poorly in this space due bias
-0.5
toward dense spherical
-1
clusters.
-1.5
-2
0.8
0.6
0.4
0.2
In the embedded space -0.709 -0.7085 -0.708 -0.7075 -0.707 -0.7065

0
-0.706
given by two leading -0.2
eigenvectors, clusters are -0.4
trivial to separate. -0.6
-0.8
Spectral Clustering Example – 2 Spirals
1.5 Dataset exhibits complex

1 cluster shapes
 K-means performs very
0.5
-2 -1.5 -1 -0.5
0
0 0.5 1 1.5 2 poorly in this space due bias
-0.5
toward dense spherical
-1
clusters.
-1.5
-2
0.8
0.6
0.4
0.2
In the embedded space -0.709 -0.7085 -0.708 -0.7075 -0.707 -0.7065

0
-0.706
given by two leading -0.2
eigenvectors, clusters are -0.4
trivial to separate. -0.6
-0.8
Spectral Clustering - Derek Greene

111
Matthias Hein and Ulrike von Luxburg
112
113
114
115
116
117
118
119
120
121
122
123
124
125
Ng et al On Spectral clustering: analysis and algorithm
127
128
129
130
131
132
133
Cluster Validity
For supervised classification we have a variety of measures to
evaluate how good our model is
– Accuracy, precision, recall
For cluster analysis, the analogous question is how to evaluate

the “goodness” of the resulting clusters?
But “clusters are in the eye of the beholder”!
Then why do we want to evaluate them?

– To avoid finding patterns in noise
– To compare clustering algorithms
– To compare two sets of clusters
– To compare two clusters
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e., distinguishing
whether non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results,
e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without
reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to
determine which is better.
5. Determining the ‘correct’ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to evaluate

the entire clustering or just individual clusters.
Measures of Cluster Validity
• Numerical measures that are applied to judge various aspects of
cluster validity, are classified into the following three types.
– External Index: Used to measure the extent to which cluster labels
match externally supplied class labels.
• Entropy
– Internal Index: Used to measure the goodness of a clustering
structure without respect to external information.
• Sum of Squared Error (SSE)
– Relative Index: Used to compare two different clusterings or
clusters.
• Often an external or internal index is used for this function, e.g., SSE or
entropy
Measuring Cluster Validity Via Correlation
Two matrices
– Proximity Matrix
– “Incidence” Matrix
• One row and one column for each data point
• An entry is 1 if the associated pair of points belong to the same cluster
• An entry is 0 if the associated pair of points belongs to different clusters
• Compute the correlation between the two matrices
– Since the matrices are symmetric, only the correlation between
n(n-1) / 2 entries needs to be calculated.
• High correlation indicates that points that belong to the
same cluster are close to each other.
Not a good measure for some density clusters.

Measuring Cluster Validity Via Correlation
• Correlation of incidence and proximity matrices for the K-

means clusterings of the following two data sets.
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
y
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Corr = -0.9235 Corr = -0.5810

Using Similarity Matrix for Cluster
Validation
Order the similarity matrix with respect to cluster labels and
inspect visually.
1
1
10 0.9
0.9
20 0.8
0.8
30 0.7
0.7
40 0.6
0.6
Points
50 0.5
0.5
y
60 0.4
0.4
70 0.3
0.3
80 0.2
0.2
90 0.1
0.1
100 0
0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1
Points
x
Using Similarity Matrix for Cluster Validation
0.9
1 500
0.8
2 6
0.7
1000
3 0.6
4
1500 0.5
0.4
2000
0.3
5
0.2
2500
0.1
7
3000 0
500 1000 1500 2000 2500 3000
DBSCAN
Internal Measures: Cohesion and
Separation
Cluster Cohesion: Measures how closely related are objects in a
cluster. Example: SSE
Cluster Separation: Measure how distinct or well-separated a

cluster is from other clusters. Example: Squared Error
– Cohesion is measured by the within cluster sum of squares
(SSE) 𝑊𝑆𝑆 = ෍ ෍ (𝑥 − 𝑚 )2
𝑖
𝑖 𝑥∈𝐶𝑖
– Separation is measured by the between cluster sum of

squares 𝐵𝑆𝑆 = ෍ 𝐶 (𝑚 − 𝑚 )2
𝑖 𝑖
𝑖

AC-ED L08 - Clustering

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AC-ED L08 - Clustering

Uploaded by

Copyright:

Available Formats

Clustering

In supervised learning, the “right answers”

In unsupervised learning, the “right answers”

Astronomical data analysis Medical Image Analysis

Original Points A Partitional Clustering

Hint: we can use random μ2

Step 2 Assign each point to the

Step 2 Assign each point to the

Step 2 Assign each point to the

Step 3 Update the means μk of

Step 2 Assign each point to the

Step 3 Update the means μk of

Step 4 If means do not change

Step 2 Assign each point to the

Step 3 Update the means μk of

Step 4 If means do not change

Step 2 Assign each point to the

Step 3 Update the means μk of

Step 4 If means do not change

Step 2 Assign each point to the

Step 3 Update the means μk of

Step 4 If means do not change

Step 2 Assign each point to the

Step 3 Update the means μk of

Step 4 If means do not change

Step 2 Assign each point to the

Step 3 Update the means μk of

Step 4 If means do not change

Step 2 Assign each point to the

Step 3 Update the means μk of

Step 4 If means do not change

Step 2 Assign each point to the

Step 3 Update the means μk of

Step 4 If means do not change

We define an objective function (cost function, distortion

Intuition: we can do this through an iterative procedure in which

E Step Minimize J with respect to the

Step 3 Update the means μk of Maximisation Minimize J with respect to the

Step 4 If means do not change Terminate Terminate if J has converged,

2. no (or minimum) change of centroids, or

3. minimum decrease in the sum of squared error (SSE),

dist(x, 𝛍𝑘 ) is the distance between data point x and

‘Closeness’ could be measured by Euclidean distance,

Random initialization initializes centers randomly to existing data points

Chance is relatively small when K is large, if clusters are

For example, if K = 10, then probability = 10!/1010 =

Sometimes the initial centroids will readjust themselves in

K (no. of clusters) K (no. of clusters)

K-means has problems when the data contains outliers.

Original Points K-means (3 Clusters)

Original Points K-means (3 Clusters)

Original Points K-means (2 Clusters)

Intuition: We do not directly observe which

1 for a particular element, k = i

We define: p(zk =1) =  k

Similarly, for the conditional distribution of x given a particular

Finally, the marginal distribution of x over all possible states of z:

p(x ) =  p(x, z ) = p(z ) p(x z ) =   k N (x μ k , Σ k )

γ(zk) can be seen as the “responsibility” that

• K-means is a (“hard”) • Mixture of Gaussians is a

• Parameters to estimate • Parameters to estimate

Intuition: maximise the (log) likelihood of the model parameters

Problem: contrary to estimating the