You are on page 1of 139

Clustering

Ramon Baldrich
Universitat Autònoma de Barcelona
Another ML introduction:
Supervised learning

In supervised learning, the “right answers”


(ground truth) are given
Training Set: {(x(1), y(1)), (x(2), y(2)), …, (x(m), y(m))}
Unsupervised learning

In unsupervised learning, the “right answers”


(ground truth) are not given
Training Set: {x(1), x(2), …, x(m)}
Market segmentation
Social network analysis

Astronomical data analysis Medical Image Analysis


What is Cluster Analysis?
Finding groups of objects in data such that the
objects in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

6
Types of Clusterings
A clustering is a set of clusters
Important distinction between hierarchical
and partitional sets of clusters
• Partitional Clustering
– A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset

• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree

7
Partitional Clustering
(Bölümsel Kümeleme)

Original Points A Partitional Clustering

8
Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram

p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram

9
K-MEANS CLUSTERING ALGORITHM
K-Means Algorithm
Step 1 Initialize K “means” μk,
one for each class μ1

Hint: we can use random μ2


starting points, or choose k
points randomly from the
set of samples

K=2
K-Means Algorithm
1 0
Step 1 Initialize K “means” μk,
one for each class

Step 2 Assign each point to the


closest mean μk
K-Means Algorithm
Step 1 Initialize K “means” μk,
one for each class

Step 2 Assign each point to the


closest mean μk

0 1
K-Means Algorithm
Step 1 Initialize K “means” μk,
one for each class

Step 2 Assign each point to the


μ1
closest mean μk
μ2

Step 3 Update the means μk of


the new clusters

K=2
K-Means Algorithm
Step 1 Initialize K “means” μk,
one for each class

Step 2 Assign each point to the


μ1
closest mean μk
μ2

Step 3 Update the means μk of


the new clusters

Step 4 If means do not change


terminate, else back to
step 2
K=2
K-Means Algorithm
0 1
Step 1 Initialize K “means” μk,
one for each class

Step 2 Assign each point to the


closest mean μk

Step 3 Update the means μk of


the new clusters

Step 4 If means do not change


terminate, else back to
step 2
K=2
K-Means Algorithm
Step 1 Initialize K “means” μk,
one for each class

Step 2 Assign each point to the


closest mean μk μ2
μ1

Step 3 Update the means μk of


the new clusters

Step 4 If means do not change


terminate, else back to
step 2
K=2
K-Means Algorithm
Step 1 Initialize K “means” μk,
one for each class

Step 2 Assign each point to the


closest mean μk

Step 3 Update the means μk of


the new clusters

Step 4 If means do not change


terminate, else back to
step 2
K=2
K-Means Algorithm
Step 1 Initialize K “means” μk,
one for each class

Step 2 Assign each point to the


closest mean μk

Step 3 Update the means μk of


the new clusters

Step 4 If means do not change


terminate, else back to
step 2
K=2
K-Means Algorithm
Step 1 Initialize K “means” μk,
one for each class

Step 2 Assign each point to the


closest mean μk

Step 3 Update the means μk of


the new clusters

Step 4 If means do not change


terminate, else back to
step 2
K=2
K-Means Algorithm
Step 1 Initialize K “means” μk,
one for each class

Step 2 Assign each point to the


closest mean μk

Step 3 Update the means μk of


the new clusters

Step 4 If means do not change


terminate, else back to
step 2
K=2
K-Means Algorithm
Step 1 Initialize K “means” μk,
one for each class

Step 2 Assign each point to the


closest mean μk

Step 3 Update the means μk of


the new clusters

Step 4 If means do not change


terminate, else back to
step 2
K=2
Optimization Objective
Let´s define some notation to describe the assignment of data
points to clusters:
For each data point x(n) we introduce a corresponding set of binary indicator
variables:
rnk  0,1
where k = 1, …, K describing which of the K clusters the data point x(n) is
assigned to

We define an objective function (cost function, distortion


measure) as:
N K
J =  rnk x
2
(n)
− μk
n =1 k =1
which represents the sum of the squares of distances of each data point to its
assigned vector μk
Optimization Objective
Goal: find the values for the {rnk} and the {μk} so as to minimize J

Intuition: we can do this through an iterative procedure in which


each iteration involves two successive steps corresponding to
successive optimizations with respect to the {rnk} and the {μk}

E Step Minimize J with respect to the


{rnk} keeping the {μk} fixed Note: We will see next that
these two steps correspond
to the E (expectation) and
M (Maximization) steps of
M Step Minimize J with respect to the the EM algorithm
{μk} keeping the {rnk} fixed
Expectation step
E Step Minimize J with respect to the
{rnk} keeping the {μk} fixed
N K
J =  rnk x
2
(n)
− μk
n =1 k =1
M Step Minimize J with respect to the
{μk} keeping the {rnk} fixed

Observe:
• J is a linear function of rnk -> closed form solution
• The terms involving different n are independent -> we can optimise for
each n separately by choosing rnk to be 1 for whichever value of k gives
the minimum value of ǁx(n)-μkǁ2

2
1 if 𝑘 = argmin 𝑥 (𝑛) − 𝜇𝑘
𝑟𝑛𝑘 =ቐ 𝑘
0 otherwise
Maximisation Step
E Step Minimize J with respect to the
{rnk} keeping the {μk} fixed
N K
J =  rnk x
2
(n)
− μk
n =1 k =1
M Step Minimize J with respect to the
{μk} keeping the {rnk} fixed

Observe:
• J is a quadratic function of μnk -> minimise setting the derivative with
respect to μnk to zero
𝑁
𝜕𝐽
= 2 ෍ 𝑟𝑛𝑘 𝐱 (𝑛) − 𝛍𝑘 = 0
𝜕𝛍𝑘
𝑛=1
σ𝑛 𝑟𝑛𝑘 𝐱 (𝑛)
𝛍𝑘 =
σ𝑛 𝑟𝑛𝑘
K-Means Algorithm Revisited
Step 1 Initialize K “means” μk, Initialization Initialize K “means” μk, one
one for each class for each class

Step 2 Assign each point to the Expectation Minimize J with respect to the
closest mean μk {rnk} keeping the {μk} fixed

Step 3 Update the means μk of Maximisation Minimize J with respect to the


the new clusters {μk} keeping the {rnk} fixed

Step 4 If means do not change Terminate Terminate if J has converged,


terminate, else back to otherwise back to E-Step
step 2
Alternative
Stopping/convergence criterion
1. no (or minimum) re-assignments of data points to different
clusters,

2. no (or minimum) change of centroids, or

3. minimum decrease in the sum of squared error (SSE),


𝐾

𝑆𝑆𝐸 = ෍ ෍ 𝑑𝑖𝑠𝑡(x, 𝛍𝑘 )2
𝑘=1 x∈𝑟𝑛𝑘 =1

dist(x, 𝛍𝑘 ) is the distance between data point x and


centroid 𝛍𝑘 .

‘Closeness’ could be measured by Euclidean distance,


cosine similarity, correlation, etc.
Random Initialization
Convergence of the K-
Means algorithm is assured
but it may converge to a
local minimum

Random initialization initializes centers randomly to existing data points


Problems with Selecting Initial Points
If there are K ‘real’ clusters then the chance of selecting one
centroid from each cluster is small.

Chance is relatively small when K is large, if clusters are


the same size, n, then

For example, if K = 10, then probability = 10!/1010 =


0.00036

Sometimes the initial centroids will readjust themselves in


‘right’ way, and sometimes they don’t
10 Clusters Example
Starting with two initial centroids in one cluster of each pair of clusters

Iteration 1 Iteration 2
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x
Iteration 3 Iteration 4
8 8

6 6

4 4

2 2
y

0 y 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x

32
10 Clusters Example
Starting with some pairs of clusters having three initial centroids, while other have
only one.
Iteration 1 Iteration 2
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x
Iteration 3 x
Iteration 4
8 8

6 6

4 4

2 2
y

y
0 0

-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x

33
Solutions to Initial Centroids
Problem
• Multiple runs
– Helps, but probability is not on your side
• Sample and use hierarchical clustering to determine initial
centroids
• Select more than k initial centroids and then select among
these initial centroids
– Select most widely separated
• Postprocessing

34
Pre-processing and Post-processing
• Pre-processing
– Normalize the data
– Eliminate outliers
• Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high
SSE
– Merge clusters that are ‘close’ and that have relatively
low SSE
– Can use these steps during the clustering process
• ISODATA

35
Choosing the number of Clusters
What is the right number (K) of classes?
Choosing the number of Clusters
J (cost function)

J (cost function)
Elbow

1 2 3 4 5 6 1 2 3 4 5 6

K (no. of clusters) K (no. of clusters)

The first clusters will add much information (explain a lot of variance / reduce a lot
the value of the cost function), but at some point the marginal gain will drop, giving
an angle in the graph. The number of clusters is chosen at this point, hence the
"elbow criterion". This "elbow" cannot always be unambiguously identified.
What will k-means do?
What will k-means do?
What will k-means do?
What will k-means do?
What will k-means do?
What will k-means do?
What will k-means do?
Choosing the number of Clusters

J J

J J
Limitations of K-means
K-means has problems when clusters are of differing
– Sizes
– Densities
– Non-globular shapes

K-means has problems when the data contains outliers.

46
Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

47
Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

48
Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

49
Limitations of K-means
• A point can only pertain to one class (hard classification)
• Clusters have the same “shape” (only parameters are the means μk)

K-means

How do we
go about
this?
GAUSSIAN MIXTURE MODELS
A Gaussian Density
Mean

p (x) = N (x μ, Σ )
Variance
A Gaussian Mixture
N (x μ1 , Σ ) + N (x μ 2 , Σ )
A Gaussian Mixture
Different Variances

N (x μ1 , Σ1 ) + N (x μ 2 , Σ 2 )
Different Variances
A Gaussian Mixture
Mixing coefficient

p(x) =  1 N (x μ1 , Σ1 ) +  2 N (x μ 2 , Σ 2 )
K


k =1
k =1
Mixing coefficient

 k = 0.7  k = 0.3
 k = 0.3

 k = 0.7
Gaussian Mixture Distribution

p(x) =   k N (x μ k , Σ k )
K

k =1
A Different Formulation
Given a mixture of Gaussians as an underlying probability
distribution each sample is produced in two steps, by first
choosing one of the Gaussian mixture components, and then
according to the selected component’s probability density

Intuition: We do not directly observe which


component was responsible for each sample – this
“responsible component” can be represented by a
latent (hidden) variable z
A Different Formulation
The latent variable z is a K-dimensional binary random variable
having a 1-of-K representation

1 for a particular element, k = i


zk = 
0 for all other elements, k  i

zk  0,1

z
k
k =1

We define: p(zk =1) =  k


A Different Formulation
As z uses a 1-of-K representation we can write:
K
p(zk = 1) =  k → p(z ) =   kzk
k =1

Similarly, for the conditional distribution of x given a particular


value of z:

p(x zk = 1) = N (x μ k , Σ k ) → p(x z ) =  N (x μ k , Σ k )
K
zk

k =1

Finally, the marginal distribution of x over all possible states of z:

p(x ) =  p(x, z ) = p(z ) p(x z ) =   k N (x μ k , Σ k )


K

z z k =1
A Different Formulation
Using the Bayes theorem:
𝑝 𝑧𝑘 = 1 𝑝 𝐱ȁ𝑧𝑘 = 1
𝛾 𝑧𝑘 ≡ 𝑝 𝑧𝑘 = 1ȁ𝐱 =
σ𝐾
𝑗=1 𝑝 𝑧𝑗 = 1 𝑝 𝐱ห𝑧𝑗 = 1

𝜋𝑘 𝑁 𝐱ห𝛍𝑘 , 𝚺𝑘
=
σ𝐾
𝑗=1 𝜋𝑗 𝑁 𝐱 ቚ𝛍𝑗 , 𝚺𝑗

γ(zk) can be seen as the “responsibility” that


component k takes for “explaining” the observation x
K-means vs Mixture of Gaussians
K-Means Mixture of Gaussians

• K-means is a (“hard”) • Mixture of Gaussians is a


classifier probability model
• We can USE it as a “soft”
classifier

• Parameters to estimate • Parameters to estimate


• Means {μk} • Means {μk}
• Covariances {Σk}
• Mixing coefficients {πk}
Maximum Likelihood
Goal: given a data set of observation {x(1), x(2), ..., x(N)) we wish
to model the data using a mixture of Gaussians

Intuition: maximise the (log) likelihood of the model parameters

K 
( )
ln p ( X  ,  ,  ) =  ln   k N x μ k , Σ k 
N
(n)

n =1  k =1 

Problem: contrary to estimating the


parameters of a single Gaussian, the
presence of singularities makes this
not a well posed problem
EXPECTATION MAXIMISATION FOR
GAUSSIAN MIXTURES
K-Means Algorithm Reminder
Initialization Initialize K “means” μk, one
for each class
1 0

Expectation Minimize J with respect to the


{rnk} keeping the {μk} fixed

Maximisation Minimize J with respect to the


{μk} keeping the {rnk} fixed

Terminate Terminate if J has converged,


otherwise back to E-Step
Expectation Maximization (EM) for
Gaussian Mixtures
Initialization Initialize the parameters of K
Gaussians (μk, Σk, πk)
0.7 0.3

Expectation Give each point x(n) an assignment


score γ(znk) for each cluster k

Maximisation Given the scores, adjust


{μk , πk, Σk} for each cluster k

Terminate Evaluate log likelihood, stop when


converged, else back to E-step
Expectation Maximization (EM) for
Gaussian Mixtures
Initialization Initialize the parameters of K
Gaussians (μk, Σk, πk)

μ 1 , Σ1 ,  1

Hint: we can use the


K-means result to initialize

μk  μk
Σ k  cov(cluster( k ) ) μ2 , Σ2 , 2
# points in cluster k
k 
Total number of points
Expectation Maximization (EM) for
Gaussian Mixtures
Initialization Initialize the parameters of K
Gaussians (μk, Σk, πk)
0.7 0.3

Expectation Give each point x(n) an assignment


score γ(znk) for each cluster k

 k N (x ( n ) μ k , Σ k )
 (z nk ) =
( )
K

 j

j =1
N x (n)
μ j,Σj

γ(znk) is called “responsibility”: how much is the Gaussian


k responsible for the point x(n)
Expectation Maximization (EM) for
Gaussian Mixtures
Initialization Initialize the parameters of K
Gaussians (μk, Σk, πk)

Expectation Give each point x(n) an assignment


score γ(znk) for each cluster k

Maximisation Given the scores, adjust


{μk , πk, Σk} for each cluster k

N N

  (z )x N k =   (znk )
1
μ new
k = nk
(n)

Nk n =1 n =1
Expectation Maximization (EM) for
Gaussian Mixtures
Initialization Initialize the parameters of K
Gaussians (μk, Σk, πk)

Expectation Give each point x(n) an assignment


score γ(znk) for each cluster k

Maximisation Given the scores, adjust


{μk , πk, Σk} for each cluster k

  (z )(x )(x )
N N
N k =   (znk )
1 new T
Σ new
k = nk
(n)
−μ new
k
(n)
−μ k
Nk n =1 n =1
Expectation Maximization (EM) for
Gaussian Mixtures
Initialization Initialize the parameters of K
Gaussians (μk, Σk, πk)

Expectation Give each point x(n) an assignment


score γ(znk) for each cluster k

Maximisation Given the scores, adjust


{μk , πk, Σk} for each cluster k

 knew
N
= k N k =   (znk )
N n =1
Expectation Maximization (EM) for
Gaussian Mixtures
Initialization Initialize the parameters of K
Gaussians (μk, Σk, πk)

Expectation Give each point x(n) an assignment


score γ(znk) for each cluster k

Maximisation Given the scores, adjust


{μk , πk, Σk} for each cluster k

Terminate Evaluate log likelihood, stop when


converged , else back to E-step
Expectation Maximization (EM) for
Gaussian Mixtures
Initialization Initialize the parameters of K
Gaussians (μk, Σk, πk)

Expectation Give each point x(n) an assignment


score γ(znk) for each cluster k

Maximisation Given the scores, adjust


{μk , πk, Σk} for each cluster k

Terminate Evaluate log likelihood, stop when


converged , else back to E-step
Expectation Maximization (EM) for
Gaussian Mixtures
Initialization Initialize the parameters of K
Gaussians (μk, Σk, πk)

Expectation Give each point x(n) an assignment


score γ(znk) for each cluster k

Maximisation Given the scores, adjust


{μk , πk, Σk} for each cluster k

Terminate Evaluate log likelihood, stop when


converged , else back to E-step
THE GENERAL EXPECTATION
MAXIMISATION ALGORITHM
General EM Algorithm See sections
9.3, 9.4 for
more details
Initialization Initialize parameters θold

Latent (hidden) variables


Expectation Evaluate p(Z|X, θold )
Observed variables

 new = arg max Q( , old )


Maximisation
Evaluate Likelihood

where ( ) ( )
Q  , old =  p Z X, old ln p(Z X, )
z

Terminate Evaluate log likelihood, if it converges, stop. Otherwise do θold ← θnew


and go back to E-step
Summary
• K-Means is an unsupervised method for data
clustering
• K-means yields “hard” cluster assignments
• Gaussian Mixture Models are useful for modelling
data with “soft” cluster assignments
• Expectation Maximization is a method used when
we have a model with latent (hidden) variables
• K-Means can be seen as a special case of the EM
algorithm ( =   I ,  → 0 )
HIERARCHICAL CLUSTERING
Hierarchical Clustering
Produces a set of nested clusters organized as a
hierarchical tree
Can be visualized as a dendrogram
– A tree like diagram that records the sequences of
merges or splits
6 5
0.2
4
3 4
0.15 2
5

0.1 2

0.05
1
3 1

0
1 3 2 5 4 6
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper level
0.2 K=2
K=3
0.15
dissimilarity

K=4
0.1

K=5
0.05
K=6
0
1 3 2 5 4 6

They may correspond to meaningful taxonomies


– Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, …)
Hierarchical Clustering
• Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left

– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or
there are k clusters)

• Traditional hierarchical algorithms use a similarity or distance


matrix
– Merge or split one cluster at a time
Agglomerative Clustering Algorithm
More popular hierarchical clustering technique
Basic algorithm is straightforward
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
Agglomerative Clustering Algorithm

Key operation is the computation of the proximity of two


clusters
– Different approaches to defining the distance between
clusters distinguish the different algorithms
Starting Situation
Start with clusters of individual points and a
proximity matrix p1
p1 p2 p3 p4 p5 ...

p2
p3

p4
p5
.
.
. Proximity Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1

C2

C3 C3

C4 C4
C5

C1 Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix. C1 C2 C3 C4 C5
C1

C2
C3 C3

C4 C4
C5

C1 Proximity Matrix

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
The question is “How do we update the proximity matrix?”
C2
U
C1 C5 C3 C4

C1 ?

C3 C2 U C5 ? ? ? ?

C4 C3 ?

C4 ?

C1 Proximity Matrix

C2 U C5

...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1
Similarity?
p2

p3

p4

p5
⚫ MIN
.
⚫ MAX .
⚫ Group Average .
Proximity Matrix
⚫ Distance Between Centroids
⚫ Other methods driven by an objective
function
- Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5
⚫ MIN
.
⚫ MAX .
⚫ Group Average .
Proximity Matrix
⚫ Distance Between Centroids
⚫ Other methods driven by an objective
function
- Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5
⚫ MIN
.
⚫ MAX .
⚫ Group Average .
Proximity Matrix
⚫ Distance Between Centroids
⚫ Other methods driven by an objective
function
- Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2

p3

p4

p5
⚫ MIN
.
⚫ MAX .
⚫ Group Average .
Proximity Matrix
⚫ Distance Between Centroids
⚫ Other methods driven by an objective
function
- Ward’s Method uses squared error
How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

  p2

p3

p4

p5
⚫ MIN
.
⚫ MAX .
⚫ Group Average .
Proximity Matrix
⚫ Distance Between Centroids
⚫ Other methods driven by an objective
function
- Ward’s Method uses squared error
Cluster Similarity: MIN or Single Link
Similarity of two clusters is based on the two most similar
(closest) points in the different clusters
– Determined by one pair of points, i.e., by one link in the
proximity graph.

I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical Clustering: MIN

5
1
3
5 0.2

2 1 0.15

2 3 6 0.1

0.05
4
4 0
3 6 2 5 4 1

Nested Clusters Dendrogram


Strength of MIN

Original Points Two Clusters

• Can handle non-elliptical shapes


Limitations of MIN

Original Points Two Clusters

• Sensitive to noise and outliers


Cluster Similarity: MAX or
Complete Linkage
Similarity of two clusters is based on the two least similar (most
distant) points in the different clusters
– Determined by all pairs of points in the two clusters

I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Strength of MAX

Original Points Two Clusters

• Less susceptible to noise and outliers


Limitations of MAX

Original Points Two Clusters

•Tends to break large clusters


•Biased towards globular clusters
Cluster Similarity: Group Average
Proximity of two clusters is the average of pairwise proximity
between points in the two clusters.
- Need to use average connectivity for scalability since total proximity
favors large clusters

I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical Clustering: Group Average
5 4 1
0.25
2
5 0.2
2
0.15

3 6 0.1

1 0.05

4 0
3 3 6 4 1 2 5

Nested Clusters Dendrogram


Hierarchical Clustering: Group Average
Compromise between Single and Complete Link

Strengths
– Less susceptible to noise and outliers

Limitations
– Biased towards globular clusters
Cluster Similarity: Ward’s Method
Similarity of two clusters is based on the increase in squared
error when two clusters are merged
– Similar to group average if distance between points is
distance squared
– Hierarchical analogue of K-means
• Can be used to initialize K-means
Cluster Similarity: Ward’s Method
Given a set of data points C, the sum of squared errors (ESS) is
defined as:

Join those cluster 𝐶𝑖 and 𝐶𝑗 to form cluster 𝐶𝑡 with minimum


increase on squared error
5 4 1
2
5
2
3 6
1
4
3
Cluster Similarity: Ward’s Method
Limitations
– O(N3) time in many cases
– There are N steps and at each step the size, proximity
matrix must be updated and searched
– Complexity can be reduced to O(N2 log(N) ) time for some
approaches
Summary
• Once a decision is made to combine two clusters, it cannot be
undone
• No objective function is directly minimized
• Different schemes have problems with one or more of the
following:
• Sensitivity to noise and outliers
• Difficulty handling different sized clusters and convex
shapes
• Breaking large clusters
GRAPH BASED CLUSTERING

A Tutorial on Spectral Clustering\Arik Azran


Graph based Clustering
Spectral Clustering Example – 2 Spirals

1.5 Dataset exhibits complex


1 cluster shapes
 K-means performs very
0.5

-2 -1.5 -1 -0.5
0
0 0.5 1 1.5 2 poorly in this space due bias
-0.5
toward dense spherical
-1
clusters.
-1.5

-2

0.8

0.6

0.4

0.2

In the embedded space -0.709 -0.7085 -0.708 -0.7075 -0.707 -0.7065


0
-0.706

given by two leading -0.2

eigenvectors, clusters are -0.4

trivial to separate. -0.6

-0.8
Spectral Clustering Example – 2 Spirals

1.5 Dataset exhibits complex


1 cluster shapes
 K-means performs very
0.5

-2 -1.5 -1 -0.5
0
0 0.5 1 1.5 2 poorly in this space due bias
-0.5
toward dense spherical
-1
clusters.
-1.5

-2

0.8

0.6

0.4

0.2

In the embedded space -0.709 -0.7085 -0.708 -0.7075 -0.707 -0.7065


0
-0.706

given by two leading -0.2

eigenvectors, clusters are -0.4

trivial to separate. -0.6

-0.8

Spectral Clustering - Derek Greene


111
Matthias Hein and Ulrike von Luxburg
112
Matthias Hein and Ulrike von Luxburg
113
Matthias Hein and Ulrike von Luxburg
114
Matthias Hein and Ulrike von Luxburg
115
Matthias Hein and Ulrike von Luxburg
116
Matthias Hein and Ulrike von Luxburg
117
Matthias Hein and Ulrike von Luxburg
118
Matthias Hein and Ulrike von Luxburg
119
Matthias Hein and Ulrike von Luxburg
120
Matthias Hein and Ulrike von Luxburg
121
Matthias Hein and Ulrike von Luxburg
122
Matthias Hein and Ulrike von Luxburg
123
124
Matthias Hein and Ulrike von Luxburg
125
Matthias Hein and Ulrike von Luxburg
Ng et al On Spectral clustering: analysis and algorithm
127
128
129
130
131
132
133
Cluster Validity
For supervised classification we have a variety of measures to
evaluate how good our model is
– Accuracy, precision, recall

For cluster analysis, the analogous question is how to evaluate


the “goodness” of the resulting clusters?

But “clusters are in the eye of the beholder”!

Then why do we want to evaluate them?


– To avoid finding patterns in noise
– To compare clustering algorithms
– To compare two sets of clusters
– To compare two clusters
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e., distinguishing
whether non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results,
e.g., to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without
reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to
determine which is better.
5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluate


the entire clustering or just individual clusters.
Measures of Cluster Validity
• Numerical measures that are applied to judge various aspects of
cluster validity, are classified into the following three types.
– External Index: Used to measure the extent to which cluster labels
match externally supplied class labels.
• Entropy
– Internal Index: Used to measure the goodness of a clustering
structure without respect to external information.
• Sum of Squared Error (SSE)
– Relative Index: Used to compare two different clusterings or
clusters.
• Often an external or internal index is used for this function, e.g., SSE or
entropy
Measuring Cluster Validity Via Correlation
Two matrices
– Proximity Matrix
– “Incidence” Matrix
• One row and one column for each data point
• An entry is 1 if the associated pair of points belong to the same cluster
• An entry is 0 if the associated pair of points belongs to different clusters
• Compute the correlation between the two matrices
– Since the matrices are symmetric, only the correlation between
n(n-1) / 2 entries needs to be calculated.
• High correlation indicates that points that belong to the
same cluster are close to each other.

Not a good measure for some density clusters.


Measuring Cluster Validity Via Correlation

• Correlation of incidence and proximity matrices for the K-


means clusterings of the following two data sets.

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5
y

y
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x

Corr = -0.9235 Corr = -0.5810


Using Similarity Matrix for Cluster
Validation
Order the similarity matrix with respect to cluster labels and
inspect visually.

1
1
10 0.9
0.9
20 0.8
0.8
30 0.7
0.7
40 0.6
0.6

Points
50 0.5
0.5
y

60 0.4
0.4
70 0.3
0.3
80 0.2
0.2
90 0.1
0.1
100 0
0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1
Points
x
Using Similarity Matrix for Cluster Validation

0.9
1 500
0.8
2 6
0.7
1000
3 0.6
4
1500 0.5

0.4
2000
0.3
5
0.2
2500
0.1
7
3000 0
500 1000 1500 2000 2500 3000

DBSCAN
Internal Measures: Cohesion and
Separation
Cluster Cohesion: Measures how closely related are objects in a
cluster. Example: SSE

Cluster Separation: Measure how distinct or well-separated a


cluster is from other clusters. Example: Squared Error
– Cohesion is measured by the within cluster sum of squares
(SSE) 𝑊𝑆𝑆 = ෍ ෍ (𝑥 − 𝑚 )2
𝑖
𝑖 𝑥∈𝐶𝑖

– Separation is measured by the between cluster sum of


squares 𝐵𝑆𝑆 = ෍ 𝐶 (𝑚 − 𝑚 )2
𝑖 𝑖
𝑖

You might also like