You are on page 1of 31

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms


dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 13: Representative-based Clustering

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 1 / 31
Representative-based Clustering
Given a dataset with n points in a d-dimensional space, D = {x i }ni=1 , and given
the number of desired clusters k, the goal of representative-based clustering is to
partition the dataset into k groups or clusters, which is called a clustering and is
denoted as C = {C1 , C2 , . . . , Ck }.
For each cluster Ci there exists a representative point that summarizes the cluster,
a common choice being the mean (also called the centroid) µi of all points in the
cluster, that is,
1 X
µi = xj
ni x ∈ C
j i

where ni = |Ci | is the number of points in cluster Ci .


A brute-force or exhaustive algorithm for finding a good clustering is simply to
generate all possible partitions of n points into k clusters, evaluate some
optimization score for each of them, and retain the clustering that yields the best
score. However, this is clearly infeasilbe, since there are O(k n /k!) clusterings of n
points into k groups.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 2 / 31
K-means Algorithm: Objective

The sum of squared errors scoring function is defined as

k X
2
X
SSE (C) = kx j − µi k
i =1 x j ∈Ci

The goal is to find the clustering that minimizes the SSE score:

C ∗ = arg min{SSE (C)}


C

K-means employs a greedy iterative approach to find a clustering that minimizes


the SSE objective. As such it can converge to a local optima instead of a globally
optimal clustering.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 3 / 31
K-means Algorithm: Objective

K-means initializes the cluster means by randomly generating k points in the data
space. Each iteration of K-means consists of two steps: (1) cluster assignment,
and (2) centroid update.
Given the k cluster means, in the cluster assignment step, each point x j ∈ D is
assigned to the closest mean, which induces a clustering, with each cluster Ci
comprising points that are closer to µi than any other cluster mean. That is, each
point x j is assigned to cluster Ci ∗ , where
k n o
2
i ∗ = arg min kx j − µi k
i =1

Given a set of clusters Ci , i = 1, . . . , k, in the centroid update step, new mean


values are computed for each cluster from the points in Ci .
The cluster assignment and centroid update steps are carried out iteratively until
we reach a fixed point or local minima.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 4 / 31
K-Means Algorithm

K-means (D, k, ǫ):


1 t =0
2 Randomly initialize k centroids: µt1 , µt2 , . . . , µtk ∈ Rd
3 repeat
4 t ← t +1
5 Cj ← ∅ for all j = 1, · · · , k
// Cluster Assignment Step
6 foreach x j ∈ D don o
2
7 i ∗ ← arg mini kx j − µti k // Assign x j to closest
centroid
8 Ci ∗ ← Ci ∗ ∪ {x j }
// Centroid Update Step
9 foreach i = 1P to k do
10 µti ← |C1 | x j ∈Ci x j
i
Pk t 2
11 until i =1 µi − µti −1 ≤ ǫ

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 5 / 31
K-means in One Dimension

bC bC bC bC bC bC bC bC bC
2 3 4 10 11 12 20 25 30
(a) Initial dataset
µ1 = 2 µ2 = 4

bC bC uT uT uT uT uT uT uT
2 3 4 10 11 12 20 25 30
(b) Iteration: t = 1
µ1 = 2.5 µ2 = 16

bC bC bC uT uT uT uT uT uT
2 3 4 10 11 12 20 25 30
(c) Iteration: t = 2

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 6 / 31
K-means in One Dimension (contd.)

µ1 = 3 µ2 = 18

bC bC bC bC uT uT uT uT uT
2 3 4 10 11 12 20 25 30
(d) Iteration: t = 3
µ1 = 4.75 µ2 = 19.60

bC bC bC bC bC bC uT uT uT
2 3 4 10 11 12 20 25 30
(e) Iteration: t = 4
µ1 = 7 µ2 = 25

bC bC bC bC bC bC uT uT uT
2 3 4 10 11 12 20 25 30
(f) Iteration: t = 5 (converged)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 7 / 31
K-means in 2D: Iris Principal Components

u2
bC bC
bC rS bC
bC

1.0 bC
bC bC bC
bC
bC bC bC Cb
bC bC bC
bC bC bC bC
0.5 bC
bC
bC bC bC bC Cb Cb bC
bC bC bC bC bC bC bC bC
bC bC Cb Cb bC bC bC bC Cb bC Cb bC
bC Cb Cb bC Cb Cb bC bC bC
bC Cb bC bC bC bC
bC bC bC bC Cb bC
Cb C b
0 bC bC
C b
bC bC Cb Cb bC bC bC Cb bC bC bC bC bC bC
bC
bC bC bC bC Cb bC bC bC bC bC
bC bC bC Cb bC bC bC Cb
bC Cb
Cb bC bC bC bC Cb bC Cb
bC bC C b bC bC
−0.5 Cb bC bC Cb Cb bC
bC
bC Cb Cb bC bC
uT bC bC bC
bC
bC
−1.0 Cb bC

bC bC
bC

−1.5 u1
−4 −3 −2 −1 0 1 2 3
(a) Random initialization: t = 0

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 8 / 31
K-means in 2D: Iris Principal Components
y
bC bC
bC bC
bC
1.0 bC
bC bC bC
bC
bC bC bC Cb
bC
bC rS bC
bC bC bC bC
0.5 bC
bC
bC bC bC bC Cb Cb bC
bC bC bC bC bC bC bC bC
bC bC Cb Cb bC bC bC bC Cb bC Cb bC
bC Cb Cb bC Cb Cb bC bC bC
bC Cb bC bC bC bC
bC bC Cb bC
0 bC bC uT bC bC bC bC Cb Cb bC bC bC C b
bC
bC bC Cb Cb bC bC bC bC bC bC bC
bC bC bC bC Cb bC Cb bC bC bC bC
bC Cb bC bC bC bC
Cb bC bC bC bC bC Cb bC Cb Cb
bC bC C b bC bC
−0.5 Cb bC bC Cb Cb bC
bC
bC Cb Cb bC bC
bC bC bC
bC
bC
−1.0 Cb bC
bC
bC

−1.5 x
−4 −3 −2 −1 0 1 2 3
(b) Iteration: t = 1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 9 / 31
K-means in 2D: Iris Principal Components
y
rS bC
rS bC
bC
1.0 bC
rS bC bC
bC
rS rS bC Cb
rS rS bC
rS uT bC bC
0.5 rS
rS
rS rS uT bC Cb Cb bC
rS rS rS rS bC bC bC bC
rS rS Sr rS Sr rS rS uT uT Tu uT Tu bC
rS Sr Sr rS Tu Tu bC bC bC bC
rS Cb bC bC bC bC
rS rS rS rS Sr uT
Tu C b
0 uT bC
T u
uT uT Tu Tu uT uT uT Tu uT bC bC bC bC bC
bC
rS rS rS rS Tu uT bC bC bC bC
rS uT uT Tu uT uT uT Tu
Tu uT T u
uT uT uT Tu uT Tu
bC Cb
rS uT T u bC bC
−0.5 Tu uT uT Tu Tu bC
uT
uT Tu Tu uT uT
uT uT uT
uT
bC
−1.0 Tu uT
uT
uT

−1.5 x
−4 −3 −2 −1 0 1 2 3
(c) Iteration: t = 8 (converged)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 10 / 31
Kernel K-means

In K-means, the separating boundary between clusters is linear. Kernel K-means


allows one to extract nonlinear boundaries between clusters via the use of the
kernel trick, i.e., we show that all operations involve only the kernel value between
a pair of points.

Let x i ∈ D be mapped to φ(x i ) in feature space. Let K = K (x i , x j ) i ,j =1,...,n
denote the n × n symmetric kernel matrix, where K (x i , x j ) = φ(x i )T φ(x j ).
The cluster means in feature space are {µφ1 , . . . , µφk }, where µφi = n1 x j ∈Ci φ(x j ).
P
i

The sum of squared errors in feature space is


k X n k
X φ
2 X X 1 X X
min SSE (C) = φ(x j ) − µi = K (x j , x j ) − K (x a , x b )

C
i =1 x ∈C j =1 i =1
ni x ∈ C x ∈ C
j i a i b i

Thus, the kernel K-means SSE objective function can be expressed purely in terms
of the kernel function.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 11 / 31
Kernel K-means: Cluster Reassignment

Consider the distance of a point φ(x j ) to the mean µφi in feature space
2 2
φ 2 φ φ
φ(x j ) − µi = kφ(x j )k − 2φ(x j )T µi + µi

2 X 1 X X
= K (x j , x j ) − K (x a , x j ) + 2 K (x a , x b )
ni x ∈ C ni x ∈ C x ∈ C
a i a i b i

Kernel K-means assign a point to the closest cluster mean as follows:


 2 
∗ φ
C (x j ) = arg min φ(x j ) − µi

i

( )
1 X X 2 X
= arg min 2 K (x a , x b ) − K (x a , x j )
i ni x ∈ C x ∈ C ni x ∈ C
a i b i a i

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 12 / 31
Kernel-Kmeans Algorithm
Kernel-Kmeans(K , k, ǫ):
1 t ←0
2 C t ← {C1t , . . . , Ckt }// Randomly partition points into k clusters
3 repeat
4 t ← t +1
5 foreach Ci ∈ C t −1 do // Squared norm of cluster means
sqnormi ← n12 x a ∈Ci x b ∈Ci K (x a , x b )
P P
6
i

7 foreach x j ∈ D do // Average kernel value for x j and Ci


t −1
8 foreach Ci ∈ CP do
1
9 avgji ← ni x a ∈Ci K (x a , x j )

// Find closest cluster for each point


10 foreach x j ∈ D do
11 foreach Ci ∈ C t −1 do
12 d(x j , Ci ) ← sqnormi − 2 · avgji


13 j ← arg mini d(x j , Ci )
14 Cjt∗ ← Cjt∗ ∪ {x j } // Cluster reassignment
C ← C1t , . . . , Ckt
t

15
until 1 − 1n ki=1 Cit ∩ Cit −1 ≤ ǫ
P
16
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 13 / 31
K-Means or Kernel K-means with Linear Kernel

Using linear kernel K (x i , xj ) = x Ti xj is equivalent to the K-means algorithm.


X2
rS

rS rS
rS uT uT
6 rS Sr rS Sr rS rS rS
rS rS rS Sr rS rS rS rS rS rS rS rS Tu uT uT uT uT
rS uT
rS rS
rS rS rS rS rS rS
rS rS rS rS rS rS rS rS rS rS rS rS Sr Sr rS Sr uT Tu uT uT uT
rS rS Sr rS rS rS rS rS Sr rS rS Tu uT Tu uT
Sr rS rS rS rS rS rS rS Tu uT Tu uT uT Tu
rS rS rS rS rS rS rS Sr Tu Tu uT
5 rS rS rS rS uT uT uT
rS rS uT uT uT
rS rS uT Tu uT
rS uT bC
Sr rSrS rS Sr Sr rS rS Tu uT uT Tu bC
bC
rS rS uT uT uT uT uT bC bC bC bC
4 rS Tu Tu uT
bC bC Cb Cb bC bC bC Cb bC bC bC
rS
rS rSSr rS rS uT Tu uT uT uT bC
Cb C b C b bC C b C b C b bC bC bC Cb Cb bC bC
rS
rS rS rS uT uT bC bC bC bC bC bC bC bC bC bC bC Cb Cb bC Cb
rS Sr rS Sr Sr Sr Sr rS rS rS Sr uT bC bC bC bC bC bC bC bC bC bC
rSrS rS rS rS rS rS Sr rS rS rS Sr uT
uT Tu
uT uT C b bC bC bC bC bC bC bC bC bC bC bC
rS rS bC bC bC Cb
Sr rS rS Sr rS Sr Sr bC bC bC bC bC Cb bC
rS bC bC
3 bC bC Cb bC bC Cb bC
bC
bC bC bC
bC
bC

1.5 X1
0 1 2 3 4 5 6 7 8 9 10 11 12
(a) Linear kernel: t = 5 iterations

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 14 / 31
Kernel K-means: Gaussian Kernel
2
 
kx −x k
Using the Gaussian kernel K (x i , xj ) = exp − i2σ2j , with σ = 1.5.
y
rS

rS
rS
6 rS rS
rS Sr
rS
rS rS rS rS
rS rS
rS
Sr rS
rS
rS
rS rS Sr rS rS uT uT
rS rS rS Sr rS Sr rS rS rS rS Sr rS rS rS rS uT
rS rS rS rS Sr rS rS uT
rS rS rS rS rS rS rS Sr rS rS rS uT Tu
rS rS Sr rS rS rS rS rS rS Sr rS rS rS
uT
uT
uT uT
rS rS rS rS rS uT
Sr rS rS rS rS
rS rS rS rS rS rS rS uT uT
uT uT
rS
5 rS rS
rS rS
rS
uT
uT
uT uT
uT
uT uT
rS uT

uTuT uT uT bC
uT uT uT bC
uT
uT uT uT uT Tu Tu bC
uT uT uT Tu uT bC bC bC bC
uT
4 uT
uT
uT uT uT
uT
uT uT
uT
uT uT
uT
uT uT
uT
bC
bC
bC
bC Cb
bC bC bC bC bC
bC Cb bC bC
bC bC
bC bC bC
bC
bC bC
bC bC bC Cb bC bC
uT uT uT uT uT uT Cb bC bC bC bC
bC bC bC bC Cb bC bC
uT bC Cb Cb bC bC
uT uT uT uT uT Tu Tu uT uT uT uT Cb bC bC Cb bC bC Cb bC bC
uTuT uT Tu Tu Tu Tu
uT uT Tu
bC bC
bC bC Cb bC bC bC bC
uT uT
uT uT Tu uT uT
uT uT
uT bC bC
bC bC bC
bC
bC
uT uT uT uT Tu bC bC bC bC
uT uT bC bC
uT
3 bC bC bC
bC Cb Cb
bC bC Cb
bC
bC bC

bC bC bC

bC

1.5 x
0 1 2 3 4 5 6 7 8 9 10 11 12
(b) Gaussian kernel: t = 4 Iterations

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 15 / 31
Expectation-Maximization Clustering
Gaussian Mixture Model

Let Xa denote the random variable corresponding to the ath attribute. Let
X = (X1 , X2 , . . . , Xd ) denote the vector random variable across the d-attributes, with x j
being a data sample from X .
We assume that each cluster Ci is characterized by a multivariate normal distribution
( )
1 (x − µi )T Σ−1
i (x − µi )
fi (x) = f (x|µi , Σi ) = d 1
exp −
(2π) 2 |Σi | 2 2

where the cluster mean µi ∈ Rd and covariance matrix Σi ∈ Rd ×d are both unknown
parameters.
The probability density function of X is given as a Gaussian mixture model over all the k
clusters
k
X k
X
f (x) = fi (x)P(Ci ) = f (x|µi , Σi )P(Ci )
i =1 i =1

where the priorPprobabilities P(Ci ) are called the mixture parameters, which must satisfy
the condition ki=1 P(Ci ) = 1.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 16 / 31
Expectation-Maximization Clustering
Maximum Likelihood Estimation

We write the set of all the model parameters compactly as

θ = {µ1 , Σ1 , P(C1 ) . . . , µk , Σk , P(Ck )}

Given the dataset D, we define the likelihood of θ as the conditional probability of the
data D given the model parameters θ
n
Y
P(D|θ) = f (x j )
j =1

The goal of maximum likelihood estimation (MLE) is to choose the parameters θ that
maximize the likelihood. We do this by maximizing the log of the likelihood function

θ ∗ = arg max{ln P(D|θ)}


θ

where the log-likelihood function is given as


n
X n
X k
X 
ln P(D|θ) = ln f (x j ) = ln f (x j |µi , Σi )P(Ci )
j =1 j =1 i =1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 17 / 31
Expectation-Maximization Clustering

Directly maximizing the log-likelihood over θ is hard. Instead, we can use the
expectation-maximization (EM) approach for finding the maximum likelihood
estimates for the parameters θ.
EM is a two-step iterative approach that starts from an initial guess for the
parameters θ. Given the current estimates for θ, in the expectation step EM
computes the cluster posterior probabilities P(Ci |x j ) via the Bayes theorem:

P(Ci and x j ) P(x j |Ci )P(Ci ) fi (x j ) · P(Ci )


P(Ci |x j ) = = Pk = Pk
P(x j ) a=1 P(x j |Ca )P(Ca ) a=1 fa (x j ) · P(Ca )

In the maximization step, using the weights P(Ci |x j ) EM re-estimates θ, that is,
it re-estimates the parameters µi , Σi , and P(Ci ) for each cluster Ci . The
re-estimated mean is given as the weighted average of all the points, the
re-estimated covariance matrix is given as the weighted covariance over all pairs of
dimensions, and the re-estimated prior probability for each cluster is given as the
fraction of weights that contribute to that cluster.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 18 / 31
EM in One Dimension: Expectation Step

Let D comprise of a single attribute X , with each point xj ∈ R a random sample


from X . For the mixture model, we use univariate normals for each cluster:

(x − µi )2
 
1
fi (x) = f (x|µi , σi2 ) = √ exp −
2πσi 2σi2

with the cluster parameters µi , σi2 , and P(Ci ).


Initialization: For each cluster Ci , with i = 1, 2, . . . , k, we can randomly initialize
the cluster parameters µi , σi2 , and P(Ci ).
Expectation Step: Given the mean µi , variance σi2 , and prior probability P(Ci )
for each cluster, the cluster posterior probability is computed as

f (xj |µi , σi2 ) · P(Ci )


wij = P(Ci |xj ) = Pk
2
a=1 f (xj |µa , σa ) · P(Ca )

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 19 / 31
EM in One Dimension: Maximization Step

Given wij values, the re-estimated cluster mean is


Pn
j =1 wij · xj
µi = P n
j =1 wij

The re-estimated value of the cluster variance is computed as the weighted


variance across all the points:
Pn 2
2 j =1 wij (xj − µi )
σi = Pn
j =1 wij

The prior probability of cluster Ci is re-estimated as


Pn
j =1 wij
P(Ci ) =
n

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 20 / 31
EM in One Dimension

µ1 = 6.63 µ2 = 7.57
0.4
0.3
0.2
0.1
bC bC bC bC bC bC bC bC bC bC bC
−1 0 1 2 3 4 5 6 7 8 9 10 11
(a) Initialization: t = 0
µ2 = 7.4
0.5
0.4
µ1 = 3.72
0.3
0.2
0.1
bC bC bC bC bC bC bC bC bC bC bC
−2 −1 0 1 2 3 4 5 6 7 8 9 10 11
(b) Iteration: t = 1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 21 / 31
EM in One Dimension: Final Clusters

µ2 = 7.56
1.8
1.5
1.2
0.9
µ1 = 2.48
0.6
0.3
bC bC bC bC bC bC bC bC bC bC bC
−1 0 1 2 3 4 5 6 7 8 9 10 11
(c) Iteration: t = 5 (converged)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 22 / 31
EM in d Dimensions

Each cluster we have to reestimate the d × d covariance matrix:


 i 2 i
(σ1 ) σ12 . . . σ1i d

 σ21 (σ2i )2
 i
. . . σ2i d 

Σi = 
 .

 .. .. .. 
. . 
σdi 1 σdi 2 ... (σdi )2

It requires O(d 2 ) parameters, which may be too many for reliable estimation. A
simplification is to assume that all dimensions are independent, which leads to a
diagonal covariance matrix:
 i 2 
(σ1 ) 0 ... 0
 0 (σ2i )2 . . . 0 
Σi =  . .
 
 .. .. . .. 

0 0 ... (σdi )2

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 23 / 31
EM in d Dimensions
Expectation Step: Given µi , Σi , and P(Ci ), the posterior probability is given as
fi (x j ) · P(Ci )
wij = P(Ci |x j ) = Pk
a=1 fa (x j ) · P(Ca )

Maximization Step: Given the weights wij , in the maximization step, we


re-estimate Σi , µi and P(Ci ).
The mean µi for cluster Ci can be estimated as
Pn
j =1 wij · x j
µ i = Pn
j =1 wij

The covariance matrix Σi is re-estimated via the outer-product form


Pn T
j =1 wij (x j − µi )(x j − µi )
Σi = Pn
j =1 wij

The prior probability P(Ci ) for each cluster is


Pn
j =1 wij
P(Ci ) =
n
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 24 / 31
Expectation-Maximization Clustering Algorithm
Expectation-Maximization (D, k, ǫ):
1 t ←0
2 Randomly initialize µt1 , . . . , µtk
3 Σti ← I , ∀i = 1, . . . , k
4 repeat
5 t ← t +1
6 for i = 1, . . . , k and j = 1, . . . , n do
f (x |µ ,Σ )·P (C )
7 wij ← Pk fj (x i|µ i ,Σ )·Pi (C ) // posterior probability
a=1 j a a a
P t (Ci |x j )
8 for i = 1, .P
. . , k do
n
j=1 wij ·x j
9 µti ← P n w // re-estimate mean
j=1 ij
Pn
x j −µi )(x j −µi )T
j=1 wij (P
10 Σti ← n w // re-estimate covariance
j=1 ij
matrix
Pn
ij w
11 P (Ci ) ← j=1
t
n
// re-estimate priors
Pk t 2
until i =1 µi − µti −1 ≤ ǫ

12

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 25 / 31
EM Clustering in 2D
Mixture of k = 3 Gaussians

f (x)
bC bC

bC bC bC bC bC
rS
bC bC
bC bC
bC X2
bC
bC bC Cb Cb
bC bC Cb bC bC
Cb bC Cb Cb
bC bC bC Cb bC Cb
Cb bC bC
bC bC bC bC bC
bC C b Cb bC
bC
bC bC
bC Cb bC Cb Cb bC bC bC Cb
Cb Cb Cb bC
bC bC bC Cb bC
bC uT bC bC Cb bC
bC bC bC Cb Cb bC Cb Cb bC bC bC
C b bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC Cb
bC Cb bC bC bC bC bC
bC Cb bC bC Cb bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
bC
bC bC
bC
bC X1
(a) Iteration: t = 0

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 26 / 31
EM Clustering in 2D
Mixture of k = 3 Gaussians

f (x)
bC bC

bC bC bC bC bC
bC bC bC X2
bC bC bC
bC bC Cb Cb rS
bC bC Cb bC bC
Cb bC Cb Cb
bC bC bC Cb bC Cb
Cb bC bC
bC bC bC bC bC
bC C b Cb bC
bC bC Cb Cb Cb
bC bC Cb bC Cb Cb bC uT bC bC Cb bC
bC bC bC Cb
bC Cb bC
bC bC bC Cb bC bC
Cb bC Cb Cb bC bC bC
C b bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC Cb
bC Cb bC bC bC bC bC
bC Cb bC bC Cb bC bC bC bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
bC
bC bC
bC
bC X1
(b) Iteration: t = 1

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 27 / 31
EM Clustering in 2D
Mixture of k = 3 Gaussians

2
f (x)
1
rS rS

0 rS rS rS rS rS
rS rS
rS rS rS
rS X2
Sr rS Sr Sr
−1 rS
Sr
rS Sr
rS rS
rS
uT uT
rS rS rS Sr Sr rS uT
rS rS rS
rS uT uT uT uT
rS uT Tu uT
rS rS Sr uT Tu uT
rS rS rS rS rS Sr Tu Tu uT
−4 rS
rS
rS uT Tu Tu
rS rS rS rS uT uT uT Tu Tu uT
uT uT uT uT uT bC
−3 uT uT uT
uT Tu Tu uT bC bC bC
uT uT uT bC
−2 uT uT uT uT bC bC bC
bC Cb
rS Cb bC bC bC bC bC
Cb bC bC Cb Cb bC bC
bC bC bC bC bC bC bC
−1 uT
uT uT
uT bC bC bC bC
bC bC bC bC bC bC
0 bC bC bC
bC
bC C b
bC
1 bC X1
2

(c) iteration: t = 36 3
4
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 28 / 31
Iris Principal Components Data
Full vs. Diagonal Covariance Matrix

The diagonal assumption leads to axis parallel contours for the normal density,
contrasted with the rotated contours for the full covariance matrix.
The full matrix yields much better clustering, since the full covariance matrix
results in 3 wrongly clustered points, whereas the diagonal covariance matrix
results in 25.

X2 X2
rS bC rS bC
rS bC rS bC
bC bC
1.0 bC 1.0 bC
rS bC bC rS bC bC
bC bC
rS uT bC bC rS rS bC bC
rS rS bC rS rS bC
uT bC bC rS bC bC
0.5 rS
rS rS rS rS
rS
rS rS uT
uT bC bC bC Cb
Cb bC bC bC
0.5 rS
rS rS rS rS
rS
rS rS rS
rS bC bC bC Cb
Cb bC bC bC
rS rS Sr Sr rS uT uT uT Tu Tu uT bC rS rS Sr Sr rS rS rS rS Sr Sr rS bC
rS rS rS rS Tu Tu bC bC bC bC rS rS rS rS rS Tu Tu bC bC bC bC
rS bC bC bC bC bC rS bC bC bC bC bC
rS rS rS rS Sr uT
Tu bC rS rS rS Sr uT
Tu bC
0 rS uT
rS rS Sr rS uT uT uT
uT
Tu uT
bC
bC bC bC bC bC
bC 0 rS uT
uT uT Tu uT uT uT uT
uT
Tu uT
bC
bC bC bC bC bC
bC
Sr rS rS rS Sr uT Tu uT uT bC bC bC bC Sr rS rS rS Tu uT bC bC bC bC
rS rS rS Tu rS uT uT Tu uT uT Tu Tu
Sr rS uT uT uT uT uTuT Tu Tu
bC Cb Tu uT uT uT uT uT uT uT Tu
bC Cb
rS uT Tu Tu bC bC uT uT Tu Tu bC bC
−0.5 Sr rS
rS
rS Tu bC −0.5 Tu uT
uT
uT Tu bC
uT uT Tu uT uT uT uT Tu uT uT
rS rS uT uT uT uT
uT uT
bC bC
−1.0 uT uT −1.0 uT uT
rS uT
uT uT

−1.5 X1 −1.5
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3

(a) Full covariance matrix (t = 36) (b) Diagonal covariance matrix (t = 29)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 29 / 31
K-means as Specialization of EM
K-means can be considered as a special case of the EM algorithm, as follows:
 n o
1 if Ci = arg min kxj − µ k2
a
P(xj |Ci ) = Ca
0 otherwise

The posterior probability P(Ci |xj ) is given as

P(xj |Ci )P(Ci )


P(Ci |xj ) = Pk
a=1 P(xj |Ca )P(Ca )

If P(xj |Ci ) = 0, then P(Ci |xj ) = 0. Otherwise, if P(xj |Ci ) = 1, then P(xj |Ca ) = 0
1·P (Ci )
∀a 6= i, and thus P(Ci |xj ) = 1· P (C )
= 1.
i

 n o
1 if xj ∈ Ci , i.e., if Ci = arg min kxj − µ k2
a
P(Ci |xj ) = Ca
0 otherwise

The parameters are µi and P(Ci ) and the covariance matrix is not used.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 30 / 31
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 13: Representative-based Clustering

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 31 / 31

You might also like