Professional Documents
Culture Documents
Ramon Baldrich
Universitat Autònoma de Barcelona
Another ML introduction:
Supervised learning
6
Types of Clusterings
A clustering is a set of clusters
Important distinction between hierarchical
and partitional sets of clusters
• Partitional Clustering
– A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
• Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
7
Partitional Clustering
(Bölümsel Kümeleme)
8
Hierarchical Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
9
K-MEANS CLUSTERING ALGORITHM
K-Means Algorithm
Step 1 Initialize K “means” μk,
one for each class μ1
K=2
K-Means Algorithm
1 0
Step 1 Initialize K “means” μk,
one for each class
0 1
K-Means Algorithm
Step 1 Initialize K “means” μk,
one for each class
K=2
K-Means Algorithm
Step 1 Initialize K “means” μk,
one for each class
Observe:
• J is a linear function of rnk -> closed form solution
• The terms involving different n are independent -> we can optimise for
each n separately by choosing rnk to be 1 for whichever value of k gives
the minimum value of ǁx(n)-μkǁ2
2
1 if 𝑘 = argmin 𝑥 (𝑛) − 𝜇𝑘
𝑟𝑛𝑘 =ቐ 𝑘
0 otherwise
Maximisation Step
E Step Minimize J with respect to the
{rnk} keeping the {μk} fixed
N K
J = rnk x
2
(n)
− μk
n =1 k =1
M Step Minimize J with respect to the
{μk} keeping the {rnk} fixed
Observe:
• J is a quadratic function of μnk -> minimise setting the derivative with
respect to μnk to zero
𝑁
𝜕𝐽
= 2 𝑟𝑛𝑘 𝐱 (𝑛) − 𝛍𝑘 = 0
𝜕𝛍𝑘
𝑛=1
σ𝑛 𝑟𝑛𝑘 𝐱 (𝑛)
𝛍𝑘 =
σ𝑛 𝑟𝑛𝑘
K-Means Algorithm Revisited
Step 1 Initialize K “means” μk, Initialization Initialize K “means” μk, one
one for each class for each class
Step 2 Assign each point to the Expectation Minimize J with respect to the
closest mean μk {rnk} keeping the {μk} fixed
𝑆𝑆𝐸 = 𝑑𝑖𝑠𝑡(x, 𝛍𝑘 )2
𝑘=1 x∈𝑟𝑛𝑘 =1
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Iteration 3 Iteration 4
8 8
6 6
4 4
2 2
y
0 y 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
32
10 Clusters Example
Starting with some pairs of clusters having three initial centroids, while other have
only one.
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x
Iteration 3 x
Iteration 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
33
Solutions to Initial Centroids
Problem
• Multiple runs
– Helps, but probability is not on your side
• Sample and use hierarchical clustering to determine initial
centroids
• Select more than k initial centroids and then select among
these initial centroids
– Select most widely separated
• Postprocessing
34
Pre-processing and Post-processing
• Pre-processing
– Normalize the data
– Eliminate outliers
• Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high
SSE
– Merge clusters that are ‘close’ and that have relatively
low SSE
– Can use these steps during the clustering process
• ISODATA
35
Choosing the number of Clusters
What is the right number (K) of classes?
Choosing the number of Clusters
J (cost function)
J (cost function)
Elbow
1 2 3 4 5 6 1 2 3 4 5 6
The first clusters will add much information (explain a lot of variance / reduce a lot
the value of the cost function), but at some point the marginal gain will drop, giving
an angle in the graph. The number of clusters is chosen at this point, hence the
"elbow criterion". This "elbow" cannot always be unambiguously identified.
What will k-means do?
What will k-means do?
What will k-means do?
What will k-means do?
What will k-means do?
What will k-means do?
What will k-means do?
Choosing the number of Clusters
J J
J J
Limitations of K-means
K-means has problems when clusters are of differing
– Sizes
– Densities
– Non-globular shapes
46
Limitations of K-means: Differing Sizes
47
Limitations of K-means: Differing Density
48
Limitations of K-means: Non-globular Shapes
49
Limitations of K-means
• A point can only pertain to one class (hard classification)
• Clusters have the same “shape” (only parameters are the means μk)
K-means
How do we
go about
this?
GAUSSIAN MIXTURE MODELS
A Gaussian Density
Mean
p (x) = N (x μ, Σ )
Variance
A Gaussian Mixture
N (x μ1 , Σ ) + N (x μ 2 , Σ )
A Gaussian Mixture
Different Variances
N (x μ1 , Σ1 ) + N (x μ 2 , Σ 2 )
Different Variances
A Gaussian Mixture
Mixing coefficient
p(x) = 1 N (x μ1 , Σ1 ) + 2 N (x μ 2 , Σ 2 )
K
k =1
k =1
Mixing coefficient
k = 0.7 k = 0.3
k = 0.3
k = 0.7
Gaussian Mixture Distribution
p(x) = k N (x μ k , Σ k )
K
k =1
A Different Formulation
Given a mixture of Gaussians as an underlying probability
distribution each sample is produced in two steps, by first
choosing one of the Gaussian mixture components, and then
according to the selected component’s probability density
zk 0,1
z
k
k =1
p(x zk = 1) = N (x μ k , Σ k ) → p(x z ) = N (x μ k , Σ k )
K
zk
k =1
z z k =1
A Different Formulation
Using the Bayes theorem:
𝑝 𝑧𝑘 = 1 𝑝 𝐱ȁ𝑧𝑘 = 1
𝛾 𝑧𝑘 ≡ 𝑝 𝑧𝑘 = 1ȁ𝐱 =
σ𝐾
𝑗=1 𝑝 𝑧𝑗 = 1 𝑝 𝐱ห𝑧𝑗 = 1
𝜋𝑘 𝑁 𝐱ห𝛍𝑘 , 𝚺𝑘
=
σ𝐾
𝑗=1 𝜋𝑗 𝑁 𝐱 ቚ𝛍𝑗 , 𝚺𝑗
K
( )
ln p ( X , , ) = ln k N x μ k , Σ k
N
(n)
n =1 k =1
μ 1 , Σ1 , 1
μk μk
Σ k cov(cluster( k ) ) μ2 , Σ2 , 2
# points in cluster k
k
Total number of points
Expectation Maximization (EM) for
Gaussian Mixtures
Initialization Initialize the parameters of K
Gaussians (μk, Σk, πk)
0.7 0.3
k N (x ( n ) μ k , Σ k )
(z nk ) =
( )
K
j
j =1
N x (n)
μ j,Σj
N N
(z )x N k = (znk )
1
μ new
k = nk
(n)
Nk n =1 n =1
Expectation Maximization (EM) for
Gaussian Mixtures
Initialization Initialize the parameters of K
Gaussians (μk, Σk, πk)
(z )(x )(x )
N N
N k = (znk )
1 new T
Σ new
k = nk
(n)
−μ new
k
(n)
−μ k
Nk n =1 n =1
Expectation Maximization (EM) for
Gaussian Mixtures
Initialization Initialize the parameters of K
Gaussians (μk, Σk, πk)
knew
N
= k N k = (znk )
N n =1
Expectation Maximization (EM) for
Gaussian Mixtures
Initialization Initialize the parameters of K
Gaussians (μk, Σk, πk)
where ( ) ( )
Q , old = p Z X, old ln p(Z X, )
z
0.1 2
0.05
1
3 1
0
1 3 2 5 4 6
Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper level
0.2 K=2
K=3
0.15
dissimilarity
K=4
0.1
K=5
0.05
K=6
0
1 3 2 5 4 6
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or
there are k clusters)
p2
p3
p4
p5
.
.
. Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3 C3
C4 C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3 C3
C4 C4
C5
C1 Proximity Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
The question is “How do we update the proximity matrix?”
C2
U
C1 C5 C3 C4
C1 ?
C3 C2 U C5 ? ? ? ?
C4 C3 ?
C4 ?
C1 Proximity Matrix
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
⚫ MIN
.
⚫ MAX .
⚫ Group Average .
Proximity Matrix
⚫ Distance Between Centroids
⚫ Other methods driven by an objective
function
- Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
⚫ MIN
.
⚫ MAX .
⚫ Group Average .
Proximity Matrix
⚫ Distance Between Centroids
⚫ Other methods driven by an objective
function
- Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
⚫ MIN
.
⚫ MAX .
⚫ Group Average .
Proximity Matrix
⚫ Distance Between Centroids
⚫ Other methods driven by an objective
function
- Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
⚫ MIN
.
⚫ MAX .
⚫ Group Average .
Proximity Matrix
⚫ Distance Between Centroids
⚫ Other methods driven by an objective
function
- Ward’s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
⚫ MIN
.
⚫ MAX .
⚫ Group Average .
Proximity Matrix
⚫ Distance Between Centroids
⚫ Other methods driven by an objective
function
- Ward’s Method uses squared error
Cluster Similarity: MIN or Single Link
Similarity of two clusters is based on the two most similar
(closest) points in the different clusters
– Determined by one pair of points, i.e., by one link in the
proximity graph.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical Clustering: MIN
5
1
3
5 0.2
2 1 0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Strength of MAX
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical Clustering: Group Average
5 4 1
0.25
2
5 0.2
2
0.15
3 6 0.1
1 0.05
4 0
3 3 6 4 1 2 5
Strengths
– Less susceptible to noise and outliers
Limitations
– Biased towards globular clusters
Cluster Similarity: Ward’s Method
Similarity of two clusters is based on the increase in squared
error when two clusters are merged
– Similar to group average if distance between points is
distance squared
– Hierarchical analogue of K-means
• Can be used to initialize K-means
Cluster Similarity: Ward’s Method
Given a set of data points C, the sum of squared errors (ESS) is
defined as:
-2 -1.5 -1 -0.5
0
0 0.5 1 1.5 2 poorly in this space due bias
-0.5
toward dense spherical
-1
clusters.
-1.5
-2
0.8
0.6
0.4
0.2
-0.8
Spectral Clustering Example – 2 Spirals
-2 -1.5 -1 -0.5
0
0 0.5 1 1.5 2 poorly in this space due bias
-0.5
toward dense spherical
-1
clusters.
-1.5
-2
0.8
0.6
0.4
0.2
-0.8
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
y
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1
1
10 0.9
0.9
20 0.8
0.8
30 0.7
0.7
40 0.6
0.6
Points
50 0.5
0.5
y
60 0.4
0.4
70 0.3
0.3
80 0.2
0.2
90 0.1
0.1
100 0
0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1
Points
x
Using Similarity Matrix for Cluster Validation
0.9
1 500
0.8
2 6
0.7
1000
3 0.6
4
1500 0.5
0.4
2000
0.3
5
0.2
2500
0.1
7
3000 0
500 1000 1500 2000 2500 3000
DBSCAN
Internal Measures: Cohesion and
Separation
Cluster Cohesion: Measures how closely related are objects in a
cluster. Example: SSE