Professional Documents
Culture Documents
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 1 / 31
Representative-based Clustering
Given a dataset with n points in a d-dimensional space, D = {x i }ni=1 , and given
the number of desired clusters k, the goal of representative-based clustering is to
partition the dataset into k groups or clusters, which is called a clustering and is
denoted as C = {C1 , C2 , . . . , Ck }.
For each cluster Ci there exists a representative point that summarizes the cluster,
a common choice being the mean (also called the centroid) µi of all points in the
cluster, that is,
1 X
µi = xj
ni x ∈ C
j i
k X
2
X
SSE (C) = kx j − µi k
i =1 x j ∈Ci
The goal is to find the clustering that minimizes the SSE score:
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 3 / 31
K-means Algorithm: Objective
K-means initializes the cluster means by randomly generating k points in the data
space. Each iteration of K-means consists of two steps: (1) cluster assignment,
and (2) centroid update.
Given the k cluster means, in the cluster assignment step, each point x j ∈ D is
assigned to the closest mean, which induces a clustering, with each cluster Ci
comprising points that are closer to µi than any other cluster mean. That is, each
point x j is assigned to cluster Ci ∗ , where
k n o
2
i ∗ = arg min kx j − µi k
i =1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 4 / 31
K-Means Algorithm
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 5 / 31
K-means in One Dimension
bC bC bC bC bC bC bC bC bC
2 3 4 10 11 12 20 25 30
(a) Initial dataset
µ1 = 2 µ2 = 4
bC bC uT uT uT uT uT uT uT
2 3 4 10 11 12 20 25 30
(b) Iteration: t = 1
µ1 = 2.5 µ2 = 16
bC bC bC uT uT uT uT uT uT
2 3 4 10 11 12 20 25 30
(c) Iteration: t = 2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 6 / 31
K-means in One Dimension (contd.)
µ1 = 3 µ2 = 18
bC bC bC bC uT uT uT uT uT
2 3 4 10 11 12 20 25 30
(d) Iteration: t = 3
µ1 = 4.75 µ2 = 19.60
bC bC bC bC bC bC uT uT uT
2 3 4 10 11 12 20 25 30
(e) Iteration: t = 4
µ1 = 7 µ2 = 25
bC bC bC bC bC bC uT uT uT
2 3 4 10 11 12 20 25 30
(f) Iteration: t = 5 (converged)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 7 / 31
K-means in 2D: Iris Principal Components
u2
bC bC
bC rS bC
bC
1.0 bC
bC bC bC
bC
bC bC bC Cb
bC bC bC
bC bC bC bC
0.5 bC
bC
bC bC bC bC Cb Cb bC
bC bC bC bC bC bC bC bC
bC bC Cb Cb bC bC bC bC Cb bC Cb bC
bC Cb Cb bC Cb Cb bC bC bC
bC Cb bC bC bC bC
bC bC bC bC Cb bC
Cb C b
0 bC bC
C b
bC bC Cb Cb bC bC bC Cb bC bC bC bC bC bC
bC
bC bC bC bC Cb bC bC bC bC bC
bC bC bC Cb bC bC bC Cb
bC Cb
Cb bC bC bC bC Cb bC Cb
bC bC C b bC bC
−0.5 Cb bC bC Cb Cb bC
bC
bC Cb Cb bC bC
uT bC bC bC
bC
bC
−1.0 Cb bC
bC bC
bC
−1.5 u1
−4 −3 −2 −1 0 1 2 3
(a) Random initialization: t = 0
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 8 / 31
K-means in 2D: Iris Principal Components
y
bC bC
bC bC
bC
1.0 bC
bC bC bC
bC
bC bC bC Cb
bC
bC rS bC
bC bC bC bC
0.5 bC
bC
bC bC bC bC Cb Cb bC
bC bC bC bC bC bC bC bC
bC bC Cb Cb bC bC bC bC Cb bC Cb bC
bC Cb Cb bC Cb Cb bC bC bC
bC Cb bC bC bC bC
bC bC Cb bC
0 bC bC uT bC bC bC bC Cb Cb bC bC bC C b
bC
bC bC Cb Cb bC bC bC bC bC bC bC
bC bC bC bC Cb bC Cb bC bC bC bC
bC Cb bC bC bC bC
Cb bC bC bC bC bC Cb bC Cb Cb
bC bC C b bC bC
−0.5 Cb bC bC Cb Cb bC
bC
bC Cb Cb bC bC
bC bC bC
bC
bC
−1.0 Cb bC
bC
bC
−1.5 x
−4 −3 −2 −1 0 1 2 3
(b) Iteration: t = 1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 9 / 31
K-means in 2D: Iris Principal Components
y
rS bC
rS bC
bC
1.0 bC
rS bC bC
bC
rS rS bC Cb
rS rS bC
rS uT bC bC
0.5 rS
rS
rS rS uT bC Cb Cb bC
rS rS rS rS bC bC bC bC
rS rS Sr rS Sr rS rS uT uT Tu uT Tu bC
rS Sr Sr rS Tu Tu bC bC bC bC
rS Cb bC bC bC bC
rS rS rS rS Sr uT
Tu C b
0 uT bC
T u
uT uT Tu Tu uT uT uT Tu uT bC bC bC bC bC
bC
rS rS rS rS Tu uT bC bC bC bC
rS uT uT Tu uT uT uT Tu
Tu uT T u
uT uT uT Tu uT Tu
bC Cb
rS uT T u bC bC
−0.5 Tu uT uT Tu Tu bC
uT
uT Tu Tu uT uT
uT uT uT
uT
bC
−1.0 Tu uT
uT
uT
−1.5 x
−4 −3 −2 −1 0 1 2 3
(c) Iteration: t = 8 (converged)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 10 / 31
Kernel K-means
Thus, the kernel K-means SSE objective function can be expressed purely in terms
of the kernel function.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 11 / 31
Kernel K-means: Cluster Reassignment
Consider the distance of a point φ(x j ) to the mean µφi in feature space
2
2
φ
2 φ
φ
φ(x j ) − µi
= kφ(x j )k − 2φ(x j )T µi +
µi
2 X 1 X X
= K (x j , x j ) − K (x a , x j ) + 2 K (x a , x b )
ni x ∈ C ni x ∈ C x ∈ C
a i a i b i
( )
1 X X 2 X
= arg min 2 K (x a , x b ) − K (x a , x j )
i ni x ∈ C x ∈ C ni x ∈ C
a i b i a i
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 12 / 31
Kernel-Kmeans Algorithm
Kernel-Kmeans(K , k, ǫ):
1 t ←0
2 C t ← {C1t , . . . , Ckt }// Randomly partition points into k clusters
3 repeat
4 t ← t +1
5 foreach Ci ∈ C t −1 do // Squared norm of cluster means
sqnormi ← n12 x a ∈Ci x b ∈Ci K (x a , x b )
P P
6
i
rS rS
rS uT uT
6 rS Sr rS Sr rS rS rS
rS rS rS Sr rS rS rS rS rS rS rS rS Tu uT uT uT uT
rS uT
rS rS
rS rS rS rS rS rS
rS rS rS rS rS rS rS rS rS rS rS rS Sr Sr rS Sr uT Tu uT uT uT
rS rS Sr rS rS rS rS rS Sr rS rS Tu uT Tu uT
Sr rS rS rS rS rS rS rS Tu uT Tu uT uT Tu
rS rS rS rS rS rS rS Sr Tu Tu uT
5 rS rS rS rS uT uT uT
rS rS uT uT uT
rS rS uT Tu uT
rS uT bC
Sr rSrS rS Sr Sr rS rS Tu uT uT Tu bC
bC
rS rS uT uT uT uT uT bC bC bC bC
4 rS Tu Tu uT
bC bC Cb Cb bC bC bC Cb bC bC bC
rS
rS rSSr rS rS uT Tu uT uT uT bC
Cb C b C b bC C b C b C b bC bC bC Cb Cb bC bC
rS
rS rS rS uT uT bC bC bC bC bC bC bC bC bC bC bC Cb Cb bC Cb
rS Sr rS Sr Sr Sr Sr rS rS rS Sr uT bC bC bC bC bC bC bC bC bC bC
rSrS rS rS rS rS rS Sr rS rS rS Sr uT
uT Tu
uT uT C b bC bC bC bC bC bC bC bC bC bC bC
rS rS bC bC bC Cb
Sr rS rS Sr rS Sr Sr bC bC bC bC bC Cb bC
rS bC bC
3 bC bC Cb bC bC Cb bC
bC
bC bC bC
bC
bC
1.5 X1
0 1 2 3 4 5 6 7 8 9 10 11 12
(a) Linear kernel: t = 5 iterations
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 14 / 31
Kernel K-means: Gaussian Kernel
2
kx −x k
Using the Gaussian kernel K (x i , xj ) = exp − i2σ2j , with σ = 1.5.
y
rS
rS
rS
6 rS rS
rS Sr
rS
rS rS rS rS
rS rS
rS
Sr rS
rS
rS
rS rS Sr rS rS uT uT
rS rS rS Sr rS Sr rS rS rS rS Sr rS rS rS rS uT
rS rS rS rS Sr rS rS uT
rS rS rS rS rS rS rS Sr rS rS rS uT Tu
rS rS Sr rS rS rS rS rS rS Sr rS rS rS
uT
uT
uT uT
rS rS rS rS rS uT
Sr rS rS rS rS
rS rS rS rS rS rS rS uT uT
uT uT
rS
5 rS rS
rS rS
rS
uT
uT
uT uT
uT
uT uT
rS uT
uTuT uT uT bC
uT uT uT bC
uT
uT uT uT uT Tu Tu bC
uT uT uT Tu uT bC bC bC bC
uT
4 uT
uT
uT uT uT
uT
uT uT
uT
uT uT
uT
uT uT
uT
bC
bC
bC
bC Cb
bC bC bC bC bC
bC Cb bC bC
bC bC
bC bC bC
bC
bC bC
bC bC bC Cb bC bC
uT uT uT uT uT uT Cb bC bC bC bC
bC bC bC bC Cb bC bC
uT bC Cb Cb bC bC
uT uT uT uT uT Tu Tu uT uT uT uT Cb bC bC Cb bC bC Cb bC bC
uTuT uT Tu Tu Tu Tu
uT uT Tu
bC bC
bC bC Cb bC bC bC bC
uT uT
uT uT Tu uT uT
uT uT
uT bC bC
bC bC bC
bC
bC
uT uT uT uT Tu bC bC bC bC
uT uT bC bC
uT
3 bC bC bC
bC Cb Cb
bC bC Cb
bC
bC bC
bC bC bC
bC
1.5 x
0 1 2 3 4 5 6 7 8 9 10 11 12
(b) Gaussian kernel: t = 4 Iterations
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 15 / 31
Expectation-Maximization Clustering
Gaussian Mixture Model
Let Xa denote the random variable corresponding to the ath attribute. Let
X = (X1 , X2 , . . . , Xd ) denote the vector random variable across the d-attributes, with x j
being a data sample from X .
We assume that each cluster Ci is characterized by a multivariate normal distribution
( )
1 (x − µi )T Σ−1
i (x − µi )
fi (x) = f (x|µi , Σi ) = d 1
exp −
(2π) 2 |Σi | 2 2
where the cluster mean µi ∈ Rd and covariance matrix Σi ∈ Rd ×d are both unknown
parameters.
The probability density function of X is given as a Gaussian mixture model over all the k
clusters
k
X k
X
f (x) = fi (x)P(Ci ) = f (x|µi , Σi )P(Ci )
i =1 i =1
where the priorPprobabilities P(Ci ) are called the mixture parameters, which must satisfy
the condition ki=1 P(Ci ) = 1.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 16 / 31
Expectation-Maximization Clustering
Maximum Likelihood Estimation
Given the dataset D, we define the likelihood of θ as the conditional probability of the
data D given the model parameters θ
n
Y
P(D|θ) = f (x j )
j =1
The goal of maximum likelihood estimation (MLE) is to choose the parameters θ that
maximize the likelihood. We do this by maximizing the log of the likelihood function
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 17 / 31
Expectation-Maximization Clustering
Directly maximizing the log-likelihood over θ is hard. Instead, we can use the
expectation-maximization (EM) approach for finding the maximum likelihood
estimates for the parameters θ.
EM is a two-step iterative approach that starts from an initial guess for the
parameters θ. Given the current estimates for θ, in the expectation step EM
computes the cluster posterior probabilities P(Ci |x j ) via the Bayes theorem:
In the maximization step, using the weights P(Ci |x j ) EM re-estimates θ, that is,
it re-estimates the parameters µi , Σi , and P(Ci ) for each cluster Ci . The
re-estimated mean is given as the weighted average of all the points, the
re-estimated covariance matrix is given as the weighted covariance over all pairs of
dimensions, and the re-estimated prior probability for each cluster is given as the
fraction of weights that contribute to that cluster.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 18 / 31
EM in One Dimension: Expectation Step
(x − µi )2
1
fi (x) = f (x|µi , σi2 ) = √ exp −
2πσi 2σi2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 19 / 31
EM in One Dimension: Maximization Step
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 20 / 31
EM in One Dimension
µ1 = 6.63 µ2 = 7.57
0.4
0.3
0.2
0.1
bC bC bC bC bC bC bC bC bC bC bC
−1 0 1 2 3 4 5 6 7 8 9 10 11
(a) Initialization: t = 0
µ2 = 7.4
0.5
0.4
µ1 = 3.72
0.3
0.2
0.1
bC bC bC bC bC bC bC bC bC bC bC
−2 −1 0 1 2 3 4 5 6 7 8 9 10 11
(b) Iteration: t = 1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 21 / 31
EM in One Dimension: Final Clusters
µ2 = 7.56
1.8
1.5
1.2
0.9
µ1 = 2.48
0.6
0.3
bC bC bC bC bC bC bC bC bC bC bC
−1 0 1 2 3 4 5 6 7 8 9 10 11
(c) Iteration: t = 5 (converged)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 22 / 31
EM in d Dimensions
σ21 (σ2i )2
i
. . . σ2i d
Σi =
.
.. .. ..
. .
σdi 1 σdi 2 ... (σdi )2
It requires O(d 2 ) parameters, which may be too many for reliable estimation. A
simplification is to assume that all dimensions are independent, which leads to a
diagonal covariance matrix:
i 2
(σ1 ) 0 ... 0
0 (σ2i )2 . . . 0
Σi = . .
.. .. . ..
0 0 ... (σdi )2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 23 / 31
EM in d Dimensions
Expectation Step: Given µi , Σi , and P(Ci ), the posterior probability is given as
fi (x j ) · P(Ci )
wij = P(Ci |x j ) = Pk
a=1 fa (x j ) · P(Ca )
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 25 / 31
EM Clustering in 2D
Mixture of k = 3 Gaussians
f (x)
bC bC
bC bC bC bC bC
rS
bC bC
bC bC
bC X2
bC
bC bC Cb Cb
bC bC Cb bC bC
Cb bC Cb Cb
bC bC bC Cb bC Cb
Cb bC bC
bC bC bC bC bC
bC C b Cb bC
bC
bC bC
bC Cb bC Cb Cb bC bC bC Cb
Cb Cb Cb bC
bC bC bC Cb bC
bC uT bC bC Cb bC
bC bC bC Cb Cb bC Cb Cb bC bC bC
C b bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC Cb
bC Cb bC bC bC bC bC
bC Cb bC bC Cb bC bC bC bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
bC
bC bC
bC
bC X1
(a) Iteration: t = 0
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 26 / 31
EM Clustering in 2D
Mixture of k = 3 Gaussians
f (x)
bC bC
bC bC bC bC bC
bC bC bC X2
bC bC bC
bC bC Cb Cb rS
bC bC Cb bC bC
Cb bC Cb Cb
bC bC bC Cb bC Cb
Cb bC bC
bC bC bC bC bC
bC C b Cb bC
bC bC Cb Cb Cb
bC bC Cb bC Cb Cb bC uT bC bC Cb bC
bC bC bC Cb
bC Cb bC
bC bC bC Cb bC bC
Cb bC Cb Cb bC bC bC
C b bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC Cb
bC Cb bC bC bC bC bC
bC Cb bC bC Cb bC bC bC bC
bC
bC bC bC bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
bC
bC bC
bC
bC X1
(b) Iteration: t = 1
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 27 / 31
EM Clustering in 2D
Mixture of k = 3 Gaussians
2
f (x)
1
rS rS
0 rS rS rS rS rS
rS rS
rS rS rS
rS X2
Sr rS Sr Sr
−1 rS
Sr
rS Sr
rS rS
rS
uT uT
rS rS rS Sr Sr rS uT
rS rS rS
rS uT uT uT uT
rS uT Tu uT
rS rS Sr uT Tu uT
rS rS rS rS rS Sr Tu Tu uT
−4 rS
rS
rS uT Tu Tu
rS rS rS rS uT uT uT Tu Tu uT
uT uT uT uT uT bC
−3 uT uT uT
uT Tu Tu uT bC bC bC
uT uT uT bC
−2 uT uT uT uT bC bC bC
bC Cb
rS Cb bC bC bC bC bC
Cb bC bC Cb Cb bC bC
bC bC bC bC bC bC bC
−1 uT
uT uT
uT bC bC bC bC
bC bC bC bC bC bC
0 bC bC bC
bC
bC C b
bC
1 bC X1
2
(c) iteration: t = 36 3
4
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 28 / 31
Iris Principal Components Data
Full vs. Diagonal Covariance Matrix
The diagonal assumption leads to axis parallel contours for the normal density,
contrasted with the rotated contours for the full covariance matrix.
The full matrix yields much better clustering, since the full covariance matrix
results in 3 wrongly clustered points, whereas the diagonal covariance matrix
results in 25.
X2 X2
rS bC rS bC
rS bC rS bC
bC bC
1.0 bC 1.0 bC
rS bC bC rS bC bC
bC bC
rS uT bC bC rS rS bC bC
rS rS bC rS rS bC
uT bC bC rS bC bC
0.5 rS
rS rS rS rS
rS
rS rS uT
uT bC bC bC Cb
Cb bC bC bC
0.5 rS
rS rS rS rS
rS
rS rS rS
rS bC bC bC Cb
Cb bC bC bC
rS rS Sr Sr rS uT uT uT Tu Tu uT bC rS rS Sr Sr rS rS rS rS Sr Sr rS bC
rS rS rS rS Tu Tu bC bC bC bC rS rS rS rS rS Tu Tu bC bC bC bC
rS bC bC bC bC bC rS bC bC bC bC bC
rS rS rS rS Sr uT
Tu bC rS rS rS Sr uT
Tu bC
0 rS uT
rS rS Sr rS uT uT uT
uT
Tu uT
bC
bC bC bC bC bC
bC 0 rS uT
uT uT Tu uT uT uT uT
uT
Tu uT
bC
bC bC bC bC bC
bC
Sr rS rS rS Sr uT Tu uT uT bC bC bC bC Sr rS rS rS Tu uT bC bC bC bC
rS rS rS Tu rS uT uT Tu uT uT Tu Tu
Sr rS uT uT uT uT uTuT Tu Tu
bC Cb Tu uT uT uT uT uT uT uT Tu
bC Cb
rS uT Tu Tu bC bC uT uT Tu Tu bC bC
−0.5 Sr rS
rS
rS Tu bC −0.5 Tu uT
uT
uT Tu bC
uT uT Tu uT uT uT uT Tu uT uT
rS rS uT uT uT uT
uT uT
bC bC
−1.0 uT uT −1.0 uT uT
rS uT
uT uT
−1.5 X1 −1.5
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3
(a) Full covariance matrix (t = 36) (b) Diagonal covariance matrix (t = 29)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 29 / 31
K-means as Specialization of EM
K-means can be considered as a special case of the EM algorithm, as follows:
n o
1 if Ci = arg min kxj − µ k2
a
P(xj |Ci ) = Ca
0 otherwise
If P(xj |Ci ) = 0, then P(Ci |xj ) = 0. Otherwise, if P(xj |Ci ) = 1, then P(xj |Ca ) = 0
1·P (Ci )
∀a 6= i, and thus P(Ci |xj ) = 1· P (C )
= 1.
i
n o
1 if xj ∈ Ci , i.e., if Ci = arg min kxj − µ k2
a
P(Ci |xj ) = Ca
0 otherwise
The parameters are µi and P(Ci ) and the covariance matrix is not used.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 30 / 31
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 31 / 31