Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Data Mining and Machine Learning:
Fundamental Concepts and Algorithms

dataminingbook.info
Mohammed J. Zaki1 Wagner Meira Jr.2
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Chapter 13: Representative-based Clustering
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chap 13: Rep-based Clustering 1 / 31
Representative-based Clustering
Given a dataset with n points in a d-dimensional space, D = {x i }ni=1 , and given
the number of desired clusters k, the goal of representative-based clustering is to
partition the dataset into k groups or clusters, which is called a clustering and is
denoted as C = {C1 , C2 , . . . , Ck }.
For each cluster Ci there exists a representative point that summarizes the cluster,
a common choice being the mean (also called the centroid) µi of all points in the
cluster, that is,
1 X
µi = xj
ni x ∈ C
j i
where ni = |Ci | is the number of points in cluster Ci .

A brute-force or exhaustive algorithm for finding a good clustering is simply to
generate all possible partitions of n points into k clusters, evaluate some
optimization score for each of them, and retain the clustering that yields the best
score. However, this is clearly infeasilbe, since there are O(k n /k!) clusterings of n
points into k groups.
K-means Algorithm: Objective
The sum of squared errors scoring function is defined as
k X
2
X
SSE (C) = kx j − µi k
i =1 x j ∈Ci
The goal is to find the clustering that minimizes the SSE score:
C ∗ = arg min{SSE (C)}

C
K-means employs a greedy iterative approach to find a clustering that minimizes

the SSE objective. As such it can converge to a local optima instead of a globally
optimal clustering.
K-means Algorithm: Objective
K-means initializes the cluster means by randomly generating k points in the data
space. Each iteration of K-means consists of two steps: (1) cluster assignment,
and (2) centroid update.
Given the k cluster means, in the cluster assignment step, each point x j ∈ D is
assigned to the closest mean, which induces a clustering, with each cluster Ci
comprising points that are closer to µi than any other cluster mean. That is, each
point x j is assigned to cluster Ci ∗ , where
k n o
2
i ∗ = arg min kx j − µi k
i =1
Given a set of clusters Ci , i = 1, . . . , k, in the centroid update step, new mean

values are computed for each cluster from the points in Ci .
The cluster assignment and centroid update steps are carried out iteratively until
we reach a fixed point or local minima.
K-Means Algorithm
K-means (D, k, ǫ):

1 t =0
2 Randomly initialize k centroids: µt1 , µt2 , . . . , µtk ∈ Rd
3 repeat
4 t ← t +1
5 Cj ← ∅ for all j = 1, · · · , k
// Cluster Assignment Step
6 foreach x j ∈ D don o
2
7 i ∗ ← arg mini kx j − µti k // Assign x j to closest
centroid
8 Ci ∗ ← Ci ∗ ∪ {x j }
// Centroid Update Step
9 foreach i = 1P to k do
10 µti ← |C1 | x j ∈Ci x j
i
Pk t 2
11 until i =1 µi − µti −1 ≤ ǫ
K-means in One Dimension
bC bC bC bC bC bC bC bC bC
2 3 4 10 11 12 20 25 30
(a) Initial dataset
µ1 = 2 µ2 = 4
bC bC uT uT uT uT uT uT uT
2 3 4 10 11 12 20 25 30
(b) Iteration: t = 1
µ1 = 2.5 µ2 = 16
bC bC bC uT uT uT uT uT uT
2 3 4 10 11 12 20 25 30
(c) Iteration: t = 2
K-means in One Dimension (contd.)
µ1 = 3 µ2 = 18
bC bC bC bC uT uT uT uT uT
2 3 4 10 11 12 20 25 30
(d) Iteration: t = 3
µ1 = 4.75 µ2 = 19.60
bC bC bC bC bC bC uT uT uT
2 3 4 10 11 12 20 25 30
(e) Iteration: t = 4
µ1 = 7 µ2 = 25
bC bC bC bC bC bC uT uT uT
2 3 4 10 11 12 20 25 30
(f) Iteration: t = 5 (converged)
K-means in 2D: Iris Principal Components
u2
bC bC
bC rS bC
bC
1.0 bC
bC bC bC
bC
bC bC bC Cb
bC bC bC
bC bC bC bC
0.5 bC
bC
bC bC bC bC Cb Cb bC
bC bC bC bC bC bC bC bC
bC bC Cb Cb bC bC bC bC Cb bC Cb bC
bC Cb Cb bC Cb Cb bC bC bC
bC Cb bC bC bC bC
bC bC bC bC Cb bC
Cb C b
0 bC bC
C b
bC bC Cb Cb bC bC bC Cb bC bC bC bC bC bC
bC
bC bC bC bC Cb bC bC bC bC bC
bC bC bC Cb bC bC bC Cb
bC Cb
Cb bC bC bC bC Cb bC Cb
bC bC C b bC bC
−0.5 Cb bC bC Cb Cb bC
bC
bC Cb Cb bC bC
uT bC bC bC
bC
bC
−1.0 Cb bC
bC bC
bC
−1.5 u1
−4 −3 −2 −1 0 1 2 3
(a) Random initialization: t = 0
y
bC bC
bC bC
bC
1.0 bC
bC bC bC
bC
bC bC bC Cb
bC
bC rS bC
bC bC bC bC
0.5 bC
bC
bC bC bC bC Cb Cb bC
bC bC bC bC bC bC bC bC
bC bC Cb Cb bC bC bC bC Cb bC Cb bC
bC Cb Cb bC Cb Cb bC bC bC
bC Cb bC bC bC bC
bC bC Cb bC
0 bC bC uT bC bC bC bC Cb Cb bC bC bC C b
bC
bC bC Cb Cb bC bC bC bC bC bC bC
bC bC bC bC Cb bC Cb bC bC bC bC
bC Cb bC bC bC bC
Cb bC bC bC bC bC Cb bC Cb Cb
bC bC C b bC bC
−0.5 Cb bC bC Cb Cb bC
bC
bC Cb Cb bC bC
bC bC bC
bC
bC
−1.0 Cb bC
bC
bC
−1.5 x
−4 −3 −2 −1 0 1 2 3
y
rS bC
rS bC
bC
1.0 bC
rS bC bC
bC
rS rS bC Cb
rS rS bC
rS uT bC bC
0.5 rS
rS
rS rS uT bC Cb Cb bC
rS rS rS rS bC bC bC bC
rS rS Sr rS Sr rS rS uT uT Tu uT Tu bC
rS Sr Sr rS Tu Tu bC bC bC bC
rS Cb bC bC bC bC
rS rS rS rS Sr uT
Tu C b
0 uT bC
T u
uT uT Tu Tu uT uT uT Tu uT bC bC bC bC bC
bC
rS rS rS rS Tu uT bC bC bC bC
rS uT uT Tu uT uT uT Tu
Tu uT T u
uT uT uT Tu uT Tu
bC Cb
rS uT T u bC bC
−0.5 Tu uT uT Tu Tu bC
uT
uT Tu Tu uT uT
uT uT uT
uT
bC
−1.0 Tu uT
uT
uT
−1.5 x
−4 −3 −2 −1 0 1 2 3
(c) Iteration: t = 8 (converged)
Kernel K-means
In K-means, the separating boundary between clusters is linear. Kernel K-means

allows one to extract nonlinear boundaries between clusters via the use of the
kernel trick, i.e., we show that all operations involve only the kernel value between
a pair of points.

Let x i ∈ D be mapped to φ(x i ) in feature space. Let K = K (x i , x j ) i ,j =1,...,n
denote the n × n symmetric kernel matrix, where K (x i , x j ) = φ(x i )T φ(x j ).
The cluster means in feature space are {µφ1 , . . . , µφk }, where µφi = n1 x j ∈Ci φ(x j ).
P
i
The sum of squared errors in feature space is

k X n k
X φ
2 X X 1 X X
min SSE (C) = φ(x j ) − µi = K (x j , x j ) − K (x a , x b )

C
i =1 x ∈C j =1 i =1
ni x ∈ C x ∈ C
j i a i b i
Thus, the kernel K-means SSE objective function can be expressed purely in terms
of the kernel function.
Kernel K-means: Cluster Reassignment
Consider the distance of a point φ(x j ) to the mean µφi in feature space
2 2
φ 2 φ φ
φ(x j ) − µi = kφ(x j )k − 2φ(x j )T µi + µi

2 X 1 X X
= K (x j , x j ) − K (x a , x j ) + 2 K (x a , x b )
ni x ∈ C ni x ∈ C x ∈ C
a i a i b i
Kernel K-means assign a point to the closest cluster mean as follows:

2
∗ φ
C (x j ) = arg min φ(x j ) − µi

i
( )
1 X X 2 X
= arg min 2 K (x a , x b ) − K (x a , x j )
i ni x ∈ C x ∈ C ni x ∈ C
a i b i a i
Kernel-Kmeans Algorithm
Kernel-Kmeans(K , k, ǫ):
1 t ←0
2 C t ← {C1t , . . . , Ckt }// Randomly partition points into k clusters
3 repeat
4 t ← t +1
5 foreach Ci ∈ C t −1 do // Squared norm of cluster means
sqnormi ← n12 x a ∈Ci x b ∈Ci K (x a , x b )
P P
6
i
7 foreach x j ∈ D do // Average kernel value for x j and Ci

t −1
8 foreach Ci ∈ CP do
1
9 avgji ← ni x a ∈Ci K (x a , x j )
// Find closest cluster for each point

10 foreach x j ∈ D do
11 foreach Ci ∈ C t −1 do
12 d(x j , Ci ) ← sqnormi − 2 · avgji
∗

13 j ← arg mini d(x j , Ci )
14 Cjt∗ ← Cjt∗ ∪ {x j } // Cluster reassignment
C ← C1t , . . . , Ckt
t

15
until 1 − 1n ki=1 Cit ∩ Cit −1 ≤ ǫ
P
16
K-Means or Kernel K-means with Linear Kernel
Using linear kernel K (x i , xj ) = x Ti xj is equivalent to the K-means algorithm.

X2
rS
rS rS
rS uT uT
6 rS Sr rS Sr rS rS rS
rS rS rS Sr rS rS rS rS rS rS rS rS Tu uT uT uT uT
rS uT
rS rS
rS rS rS rS rS rS
rS rS rS rS rS rS rS rS rS rS rS rS Sr Sr rS Sr uT Tu uT uT uT
rS rS Sr rS rS rS rS rS Sr rS rS Tu uT Tu uT
Sr rS rS rS rS rS rS rS Tu uT Tu uT uT Tu
rS rS rS rS rS rS rS Sr Tu Tu uT
5 rS rS rS rS uT uT uT
rS rS uT uT uT
rS rS uT Tu uT
rS uT bC
Sr rSrS rS Sr Sr rS rS Tu uT uT Tu bC
bC
rS rS uT uT uT uT uT bC bC bC bC
4 rS Tu Tu uT
bC bC Cb Cb bC bC bC Cb bC bC bC
rS
rS rSSr rS rS uT Tu uT uT uT bC
Cb C b C b bC C b C b C b bC bC bC Cb Cb bC bC
rS
rS rS rS uT uT bC bC bC bC bC bC bC bC bC bC bC Cb Cb bC Cb
rS Sr rS Sr Sr Sr Sr rS rS rS Sr uT bC bC bC bC bC bC bC bC bC bC
rSrS rS rS rS rS rS Sr rS rS rS Sr uT
uT Tu
uT uT C b bC bC bC bC bC bC bC bC bC bC bC
rS rS bC bC bC Cb
Sr rS rS Sr rS Sr Sr bC bC bC bC bC Cb bC
rS bC bC
3 bC bC Cb bC bC Cb bC
bC
bC bC bC
bC
bC
1.5 X1
0 1 2 3 4 5 6 7 8 9 10 11 12
(a) Linear kernel: t = 5 iterations
Kernel K-means: Gaussian Kernel
2

kx −x k
Using the Gaussian kernel K (x i , xj ) = exp − i2σ2j , with σ = 1.5.
y
rS
rS
rS
6 rS rS
rS Sr
rS
rS rS rS rS
rS rS
rS
Sr rS
rS
rS
rS rS Sr rS rS uT uT
rS rS rS Sr rS Sr rS rS rS rS Sr rS rS rS rS uT
rS rS rS rS Sr rS rS uT
rS rS rS rS rS rS rS Sr rS rS rS uT Tu
rS rS Sr rS rS rS rS rS rS Sr rS rS rS
uT
uT
uT uT
rS rS rS rS rS uT
Sr rS rS rS rS
rS rS rS rS rS rS rS uT uT
uT uT
rS
5 rS rS
rS rS
rS
uT
uT
uT uT
uT
uT uT
rS uT
uTuT uT uT bC
uT uT uT bC
uT
uT uT uT uT Tu Tu bC
uT uT uT Tu uT bC bC bC bC
uT
4 uT
uT
uT uT uT
uT
uT uT
uT
uT uT
uT
uT uT
uT
bC
bC
bC
bC Cb
bC bC bC bC bC
bC Cb bC bC
bC bC
bC bC bC
bC
bC bC
bC bC bC Cb bC bC
uT uT uT uT uT uT Cb bC bC bC bC
bC bC bC bC Cb bC bC
uT bC Cb Cb bC bC
uT uT uT uT uT Tu Tu uT uT uT uT Cb bC bC Cb bC bC Cb bC bC
uTuT uT Tu Tu Tu Tu
uT uT Tu
bC bC
bC bC Cb bC bC bC bC
uT uT
uT uT Tu uT uT
uT uT
uT bC bC
bC bC bC
bC
bC
uT uT uT uT Tu bC bC bC bC
uT uT bC bC
uT
3 bC bC bC
bC Cb Cb
bC bC Cb
bC
bC bC
bC bC bC
bC
1.5 x
0 1 2 3 4 5 6 7 8 9 10 11 12
(b) Gaussian kernel: t = 4 Iterations
Expectation-Maximization Clustering
Gaussian Mixture Model
Let Xa denote the random variable corresponding to the ath attribute. Let
X = (X1 , X2 , . . . , Xd ) denote the vector random variable across the d-attributes, with x j
being a data sample from X .
We assume that each cluster Ci is characterized by a multivariate normal distribution
( )
1 (x − µi )T Σ−1
i (x − µi )
fi (x) = f (x|µi , Σi ) = d 1
exp −
(2π) 2 |Σi | 2 2
where the cluster mean µi ∈ Rd and covariance matrix Σi ∈ Rd ×d are both unknown
parameters.
The probability density function of X is given as a Gaussian mixture model over all the k
clusters
k
X k
X
f (x) = fi (x)P(Ci ) = f (x|µi , Σi )P(Ci )
i =1 i =1
where the priorPprobabilities P(Ci ) are called the mixture parameters, which must satisfy
the condition ki=1 P(Ci ) = 1.
Maximum Likelihood Estimation
We write the set of all the model parameters compactly as
θ = {µ1 , Σ1 , P(C1 ) . . . , µk , Σk , P(Ck )}
Given the dataset D, we define the likelihood of θ as the conditional probability of the
data D given the model parameters θ
n
Y
P(D|θ) = f (x j )
j =1
The goal of maximum likelihood estimation (MLE) is to choose the parameters θ that
maximize the likelihood. We do this by maximizing the log of the likelihood function
θ ∗ = arg max{ln P(D|θ)}

θ
where the log-likelihood function is given as

n
X n
X k
X
ln P(D|θ) = ln f (x j ) = ln f (x j |µi , Σi )P(Ci )
j =1 j =1 i =1
Directly maximizing the log-likelihood over θ is hard. Instead, we can use the
expectation-maximization (EM) approach for finding the maximum likelihood
estimates for the parameters θ.
EM is a two-step iterative approach that starts from an initial guess for the
parameters θ. Given the current estimates for θ, in the expectation step EM
computes the cluster posterior probabilities P(Ci |x j ) via the Bayes theorem:
P(Ci and x j ) P(x j |Ci )P(Ci ) fi (x j ) · P(Ci )

P(Ci |x j ) = = Pk = Pk
P(x j ) a=1 P(x j |Ca )P(Ca ) a=1 fa (x j ) · P(Ca )
In the maximization step, using the weights P(Ci |x j ) EM re-estimates θ, that is,
it re-estimates the parameters µi , Σi , and P(Ci ) for each cluster Ci . The
re-estimated mean is given as the weighted average of all the points, the
re-estimated covariance matrix is given as the weighted covariance over all pairs of
dimensions, and the re-estimated prior probability for each cluster is given as the
fraction of weights that contribute to that cluster.
EM in One Dimension: Expectation Step
Let D comprise of a single attribute X , with each point xj ∈ R a random sample

from X . For the mixture model, we use univariate normals for each cluster:
(x − µi )2

1
fi (x) = f (x|µi , σi2 ) = √ exp −
2πσi 2σi2
with the cluster parameters µi , σi2 , and P(Ci ).

Initialization: For each cluster Ci , with i = 1, 2, . . . , k, we can randomly initialize
the cluster parameters µi , σi2 , and P(Ci ).
Expectation Step: Given the mean µi , variance σi2 , and prior probability P(Ci )
for each cluster, the cluster posterior probability is computed as
f (xj |µi , σi2 ) · P(Ci )

wij = P(Ci |xj ) = Pk
2
a=1 f (xj |µa , σa ) · P(Ca )
EM in One Dimension: Maximization Step
Given wij values, the re-estimated cluster mean is

Pn
j =1 wij · xj
µi = P n
j =1 wij
The re-estimated value of the cluster variance is computed as the weighted

variance across all the points:
Pn 2
2 j =1 wij (xj − µi )
σi = Pn
j =1 wij
The prior probability of cluster Ci is re-estimated as

Pn
j =1 wij
P(Ci ) =
n
EM in One Dimension
µ1 = 6.63 µ2 = 7.57
0.4
0.3
0.2
0.1
bC bC bC bC bC bC bC bC bC bC bC
−1 0 1 2 3 4 5 6 7 8 9 10 11
(a) Initialization: t = 0
µ2 = 7.4
0.5
0.4
µ1 = 3.72
0.3
0.2
0.1
−2 −1 0 1 2 3 4 5 6 7 8 9 10 11
EM in One Dimension: Final Clusters
µ2 = 7.56
1.8
1.5
1.2
0.9
µ1 = 2.48
0.6
0.3
−1 0 1 2 3 4 5 6 7 8 9 10 11
(c) Iteration: t = 5 (converged)
EM in d Dimensions
Each cluster we have to reestimate the d × d covariance matrix:

 i 2 i
(σ1 ) σ12 . . . σ1i d

 σ21 (σ2i )2
 i
. . . σ2i d 

Σi = 
 .

 .. .. .. 
. . 
σdi 1 σdi 2 ... (σdi )2
It requires O(d 2 ) parameters, which may be too many for reliable estimation. A
simplification is to assume that all dimensions are independent, which leads to a
diagonal covariance matrix:
 i 2 
(σ1 ) 0 ... 0
 0 (σ2i )2 . . . 0 
Σi =  . .
 
 .. .. . .. 

0 0 ... (σdi )2
EM in d Dimensions
Expectation Step: Given µi , Σi , and P(Ci ), the posterior probability is given as
fi (x j ) · P(Ci )
wij = P(Ci |x j ) = Pk
a=1 fa (x j ) · P(Ca )
Maximization Step: Given the weights wij , in the maximization step, we

re-estimate Σi , µi and P(Ci ).
The mean µi for cluster Ci can be estimated as
Pn
j =1 wij · x j
µ i = Pn
j =1 wij
The covariance matrix Σi is re-estimated via the outer-product form

Pn T
j =1 wij (x j − µi )(x j − µi )
Σi = Pn
j =1 wij
The prior probability P(Ci ) for each cluster is

Pn
j =1 wij
P(Ci ) =
n
Expectation-Maximization Clustering Algorithm
Expectation-Maximization (D, k, ǫ):
1 t ←0
2 Randomly initialize µt1 , . . . , µtk
3 Σti ← I , ∀i = 1, . . . , k
4 repeat
5 t ← t +1
6 for i = 1, . . . , k and j = 1, . . . , n do
f (x |µ ,Σ )·P (C )
7 wij ← Pk fj (x i|µ i ,Σ )·Pi (C ) // posterior probability
a=1 j a a a
P t (Ci |x j )
8 for i = 1, .P
. . , k do
n
j=1 wij ·x j
9 µti ← P n w // re-estimate mean
j=1 ij
Pn
x j −µi )(x j −µi )T
j=1 wij (P
10 Σti ← n w // re-estimate covariance
j=1 ij
matrix
Pn
ij w
11 P (Ci ) ← j=1
t
n
// re-estimate priors
Pk t 2
until i =1 µi − µti −1 ≤ ǫ

12
EM Clustering in 2D
Mixture of k = 3 Gaussians
f (x)
bC bC
bC bC bC bC bC
rS
bC bC
bC bC
bC X2
bC
bC bC Cb Cb
bC bC Cb bC bC
Cb bC Cb Cb
bC bC bC Cb bC Cb
Cb bC bC
bC bC bC bC bC
bC C b Cb bC
bC
bC bC
bC Cb bC Cb Cb bC bC bC Cb
Cb Cb Cb bC
bC bC bC Cb bC
bC uT bC bC Cb bC
bC bC bC Cb Cb bC Cb Cb bC bC bC
C b bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC bC bC bC bC bC bC
bC Cb
bC Cb bC bC bC bC bC
bC Cb bC bC Cb bC bC bC bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
bC
bC bC
bC
bC X1
(a) Iteration: t = 0
EM Clustering in 2D
f (x)
bC bC
bC bC bC bC bC
bC bC bC X2
bC bC bC
bC bC Cb Cb rS
bC bC Cb bC bC
Cb bC Cb Cb
bC bC bC Cb bC Cb
Cb bC bC
bC bC bC bC bC
bC C b Cb bC
bC bC Cb Cb Cb
bC bC Cb bC Cb Cb bC uT bC bC Cb bC
bC bC bC Cb
bC Cb bC
bC bC bC Cb bC bC
Cb bC Cb Cb bC bC bC
C b bC bC bC
bC bC bC bC bC bC
bC bC bC bC
bC Cb
bC Cb bC bC bC bC bC
bC Cb bC bC Cb bC bC bC bC
bC
bC bC bC bC bC
bC bC bC bC bC bC
bC bC bC
bC
bC bC
bC
bC X1
EM Clustering in 2D
2
f (x)
1
rS rS
0 rS rS rS rS rS
rS rS
rS rS rS
rS X2
Sr rS Sr Sr
−1 rS
Sr
rS Sr
rS rS
rS
uT uT
rS rS rS Sr Sr rS uT
rS rS rS
rS uT uT uT uT
rS uT Tu uT
rS rS Sr uT Tu uT
rS rS rS rS rS Sr Tu Tu uT
−4 rS
rS
rS uT Tu Tu
rS rS rS rS uT uT uT Tu Tu uT
uT uT uT uT uT bC
−3 uT uT uT
uT Tu Tu uT bC bC bC
uT uT uT bC
−2 uT uT uT uT bC bC bC
bC Cb
rS Cb bC bC bC bC bC
Cb bC bC Cb Cb bC bC
−1 uT
uT uT
uT bC bC bC bC
bC bC bC bC bC bC
0 bC bC bC
bC
bC C b
bC
1 bC X1
2
(c) iteration: t = 36 3
4
Iris Principal Components Data
Full vs. Diagonal Covariance Matrix
The diagonal assumption leads to axis parallel contours for the normal density,
contrasted with the rotated contours for the full covariance matrix.
The full matrix yields much better clustering, since the full covariance matrix
results in 3 wrongly clustered points, whereas the diagonal covariance matrix
results in 25.
X2 X2
rS bC rS bC
rS bC rS bC
bC bC
1.0 bC 1.0 bC
rS bC bC rS bC bC
bC bC
rS uT bC bC rS rS bC bC
rS rS bC rS rS bC
uT bC bC rS bC bC
0.5 rS
rS rS rS rS
rS
rS rS uT
uT bC bC bC Cb
Cb bC bC bC
0.5 rS
rS rS rS rS
rS
rS rS rS
rS bC bC bC Cb
Cb bC bC bC
rS rS Sr Sr rS uT uT uT Tu Tu uT bC rS rS Sr Sr rS rS rS rS Sr Sr rS bC
rS rS rS rS Tu Tu bC bC bC bC rS rS rS rS rS Tu Tu bC bC bC bC
rS bC bC bC bC bC rS bC bC bC bC bC
rS rS rS rS Sr uT
Tu bC rS rS rS Sr uT
Tu bC
0 rS uT
rS rS Sr rS uT uT uT
uT
Tu uT
bC
bC bC bC bC bC
bC 0 rS uT
uT uT Tu uT uT uT uT
uT
Tu uT
bC
bC bC bC bC bC
bC
Sr rS rS rS Sr uT Tu uT uT bC bC bC bC Sr rS rS rS Tu uT bC bC bC bC
rS rS rS Tu rS uT uT Tu uT uT Tu Tu
Sr rS uT uT uT uT uTuT Tu Tu
bC Cb Tu uT uT uT uT uT uT uT Tu
bC Cb
rS uT Tu Tu bC bC uT uT Tu Tu bC bC
−0.5 Sr rS
rS
rS Tu bC −0.5 Tu uT
uT
uT Tu bC
uT uT Tu uT uT uT uT Tu uT uT
rS rS uT uT uT uT
uT uT
bC bC
−1.0 uT uT −1.0 uT uT
rS uT
uT uT
−1.5 X1 −1.5
−4 −3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3
(a) Full covariance matrix (t = 36) (b) Diagonal covariance matrix (t = 29)
K-means as Specialization of EM
K-means can be considered as a special case of the EM algorithm, as follows:
 n o
1 if Ci = arg min kxj − µ k2
a
P(xj |Ci ) = Ca
0 otherwise
The posterior probability P(Ci |xj ) is given as
P(xj |Ci )P(Ci )

P(Ci |xj ) = Pk
a=1 P(xj |Ca )P(Ca )
If P(xj |Ci ) = 0, then P(Ci |xj ) = 0. Otherwise, if P(xj |Ci ) = 1, then P(xj |Ca ) = 0
1·P (Ci )
∀a 6= i, and thus P(Ci |xj ) = 1· P (C )
= 1.
i
 n o
1 if xj ∈ Ci , i.e., if Ci = arg min kxj − µ k2
a
P(Ci |xj ) = Ca
0 otherwise
The parameters are µi and P(Ci ) and the covariance matrix is not used.
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
Mohammed J. Zaki1 Wagner Meira Jr.2
1
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Chapter 13: Representative-based Clustering

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

Copyright:

Available Formats

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms

Mohammed J. Zaki1 Wagner Meira Jr.2

Chapter 13: Representative-based Clustering

where ni = |Ci | is the number of points in cluster Ci .

The sum of squared errors scoring function is defined as

C ∗ = arg min{SSE (C)}

K-means employs a greedy iterative approach to find a clustering that minimizes

Given a set of clusters Ci , i = 1, . . . , k, in the centroid update step, new mean

K-means (D, k, ǫ):

In K-means, the separating boundary between clusters is linear. Kernel K-means

The sum of squared errors in feature space is

Kernel K-means assign a point to the closest cluster mean as follows:

7 foreach x j ∈ D do // Average kernel value for x j and Ci

// Find closest cluster for each point

Using linear kernel K (x i , xj ) = x Ti xj is equivalent to the K-means algorithm.

We write the set of all the model parameters compactly as

θ = {µ1 , Σ1 , P(C1 ) . . . , µk , Σk , P(Ck )}

θ ∗ = arg max{ln P(D|θ)}

where the log-likelihood function is given as

P(Ci and x j ) P(x j |Ci )P(Ci ) fi (x j ) · P(Ci )

Let D comprise of a single attribute X , with each point xj ∈ R a random sample

with the cluster parameters µi , σi2 , and P(Ci ).

f (xj |µi , σi2 ) · P(Ci )

Given wij values, the re-estimated cluster mean is

The re-estimated value of the cluster variance is computed as the weighted

The prior probability of cluster Ci is re-estimated as

Each cluster we have to reestimate the d × d covariance matrix:

Maximization Step: Given the weights wij , in the maximization step, we

The covariance matrix Σi is re-estimated via the outer-product form

The prior probability P(Ci ) for each cluster is

The posterior probability P(Ci |xj ) is given as

P(xj |Ci )P(Ci )

Mohammed J. Zaki1 Wagner Meira Jr.2

Chapter 13: Representative-based Clustering

You might also like