Cours Finite Mixture

Finite Mixture Models: Clustering & Classification
Mohamed Nadif
Université de Paris, Centre Borelli UMR 9010, France
M. Nadif (Faculté des Sciences) 2022-2023 Course 1 / 65

Notation
X = (xij ) of size (n × d) and zik ∈ {0, 1}

a b c z zi1 zi2 zi3
i1 x x x 1 1 0 0
i4 x x x 1 1 0 0
i8 x x x 1 1 0 0
i2 x x x 2 0 1 0
i5 x x x 2 0 1 0
i6 x x x 2 0 1 0
i10 x x x 2 0 1 0
i3 x x x 3 0 0 1
i7 x x x 3 0 0 1
i9 x x x 3 0 0 1
i denotes the indices of rows, j the indices of columns

k denotes the indices of clusters
We note z1 , . . . , zg the g clusters
Pn
zik = #zk denotes the cardinality of the kth cluster zk
Pi=1
P P P P P
i k zik . . . = k i zik . . . = k i∈zk . . .

Family of the k-means algorithm
Types Algorithms Criteria dissimilarity/similarity measures
P P 2
Continuous k-means i,k zik D(xi , µk ) D(xi , µk ) = j (xij − µkj )
xi , µk ∈ Rd
x
k-means-χ2 1 ( x ij − µkj )2
P P
Contingency i,k zik Dχ2 (xi , µk ) Dχ2 (xi , µk ) = j x.j i.
x x
xi = ( xi1 , . . . , xid )>
i. i.
xi , µk ∈ [0, 1]d
x
k-means-χ2 1 ( x ij − µkj )2
P P P
Categorical i,k zik Dχ2 (xi , µk ) Dχ2 (xi , µk ) = j c x.j i.
x x
(com. disjunc.) xi = ( xi1 , . . . , xim )>
i. i.
P P
Binary k-modes i,k zik D(xi , ak ) D(xi , ak ) = j |xij − akj |
Pxi , ak ∈ {0, 1}d P
Categorical k-modes i,k zik D(xi , λk ) D(xi , λk ) = j δ(xij , λkj )
xi , λk ∈ {1, . . . , mj }d δ(xij , λkj ) = 0 if xij = λkj
δ(xij , λkj ) = 1 if xij 6= λkj
P
Directional Sk-means i,k zik cos(xi , µk ) cos(xi , µk )
xi , µk ∈ [0, 1]d
P
Directional Axial k-means i,k zik D(xi , µk ) Hellinger distance
d
P xi ∈ [0, 1]
Mixed data k-means i,k zik D(xi , µk ) Gower distance
Several other extensions/connections such as with kernel k-means, symmetric-NMF

Classical clustering methods
Clustering methods hierarchical and nonhierarchical methods have advantages and
disadvantages
Disadvantages. They are for the most part heuristic techniques derived from
empirical methods
Difficulties to take into account the characteristics of clusters (shapes, proportions,
volume etc.)
Geometrical approach: Clustering with "adaptives" distances: dMk (x, y) = ||x − y||Mk
In fact, the principal question "does it exist a model ?"
Mixture Approach
MA have attracted much attention since 1990.
It is undoubtedly a very useful contribution to clustering
1 It offers considerable flexibility
2 provides solutions to the problem of the number of clusters
3 Its associated estimators of posterior probabilities give rise to a fuzzy or hard clustering
using the a MAP
4 It permits to give a meaning to certain classical criteria
Finite Mixture Models by (McLachlan and Peel, 2000)

Finite Mixture Model
Outline
1 Finite Mixture Model
Definition of the model
Example
Different approaches
2 ML and CML approaches
EM algorithm
CEM algorithm
3 Applications
Bernoulli mixture
Multinomial Mixture
Gaussian mixture model
Directional data
Other variants of EM
4 Model Selection
5 Mixture model for classification
6 Conclusion

Finite Mixture Model Definition of the model

With model-based clustering it is assumed that the data are generated by a mixture
of underlying probability distributions, where each component k of the mixture
represents a cluster. Thus, the data matrix is assumed to be an i.i.d sample
x 1 , . . . , x n where x i = (xi1 , . . . , xid ) ∈ Rd from a probability distribution with density
g
X
f (x i ; Θ) = πk ϕ(x i ; αk ),
k=1
- ϕ(. ; αk ) is the density of an observation x i from the k-th component

- αk ’s are the corresponding class parameters. These densities belong to the same
parametric family
- The parameter πk corresponds to the probability to choose the k-th component
- g , which is assumed to be known, is the number of components in the mixture
Finite Mixture Model Example
Gaussian mixture model in R1

n=9000, p=1, g=3
ϕ(., αk ) a Gaussian density αk = (mk , σk )
1
π1 = π2 = π3 = 3
The mixture density of the observed data x1 , . . . , x9000 can be written as

g
n n
!
Y Y X 1 1 xi − mk 2
f (X; Θ) = f (xi ; Θ) = πk √ exp(− ( ) )
i=1 i=1
σk 2π 2 σk
k=1
Mixture of 3 densities
Histogramme des données Histogramme des données
0.14
0.14
0.12
0.12
0.10
0.10
0.08
0.08
Densité
Densité
0.06
0.06
0.04
0.04
0.02
0.02
0.00
0.00
0 5 10 15 0 5 10 15
Données Données

Gaussian mixture model in R2 : N ((2, 1); 1) and N ((8, 7); 0.6)

X=matrix(nrow=1000,ncol=2)
for (i in 1:1000)
{
Z=rbinom(1,1,2/3)
if (Z==1){
X[i,1]=rnorm(1,2,1)
X[i,2]=rnorm(1,1,1)
}
else
{
X[i,1]=rnorm(1,8,0.6) ++
++ ++
++ +++ ++ +++ +
+
8
++
+ +++ ++ ++++++++ ++++
+++
+ + ++++ ++ + +
X[i,2]=rnorm(1,7,0.6) +
+
+++
+ +++
++ + +
+
++
+++++ ++
+ ++
+
++
++
+
++
+++
++
+++
+
+
+
+
+++
+
+
+
+
+++
+
+
+
++
++
++
+ +
++
+
++
+
+
+
+
++
++
++
+
+
+
++
+
+
+
+
+
+
+
+
+
++
+
+
+
+
++
+++
+
+
+
+++
+
+
++
+
+++
+++++
++
+++++
+
+++
+ ++
+ +++++++ +++
++ ++ + ++++
+ ++
+ + +++++ ++++++ +
} ++
+
+ +
6
+ + + ++ + +
+ +
+ + +
4
+ + +
X[,2]
plot(X,pch="+") +
++ +
+ ++ ++
++ +
+
+
+++ ++
+++
+ +
++++
+++ +
+
++ + +++
++
+
+ +
+ +
++++++ + + ++ ++++ ++++ +++ ++ +++ ++ +++++ ++ +++++ +
+ ++++ +
+ + +
+++ ++ ++ + ++ +++++ + +
2
++
++
++ +++++ +
+
+
+ +++ + +
+++
+++
+ +
+++++++
+ ++
+ +
+
+
++
+
+ +
++++ + +
++++
+++++
+ +
++
+ ++
+++
+
+++
+++
+
+
+
++
+
+
+
+++
++++++
++
+ ++++ +
++ +
+
x x 1 +
+
+
+
++++ ++
++++
++
+
+
++
++
+
+++
+
+
+++
+
++
+
++
+
++
+
++
+
++
+
++
++
+
+
+
+
+
+
++
+
+
+++++
+++
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
++
+
+
+
+
+
+
+++
++
++
+
+
+
+
+
+
+
+
+
+
+++++
+++
++++
+
+
+
+
++
+++
+
+
+++++
+
+++
++
+
++
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+++
+++
++
+
+
+++ +
++++
+ +
+
+ + ++ +++++++++ ++++++++ +++
+
+++
++
++ ++ ++
+ ++++ + ++ +
++
+ ++ + ++ ++++ +
++ + ++ +
++ +
++ ++ +++ +
x x 1
0
++ +++++
+++ ++++++ ++
+++++
x x 1 + + +++
+
+ ++++
+ ++ ++
++
++
+
++
+ +++ +
++
++ +++ +++
+ + +
+++ + + + +
++ +
+
+++ ++ ++ + + + +
x x 1 x x 1 −2
+ +
+
+
x x 0 2 1 0 2 4 6 8 10
X[,1]
. . . .
. . . .
. . 1 . . 0
x x 0 x x 0
x x 0 x x 0
8 7
library(mclust)
res.mclust=Mclust(X)
plot(X,col=res.mclust$classification,pch="+")

Likelihood of observed data X

The parameter of this model is the vector θ = (π, α) containing the mixing
proportions π = (π1 , ..., πg ) and the vector α = (α1 , ..., αg ) of parameters of each
component. The mixture density of the observed data X can be expressed as
g
n X
Y Y
f (X; Θ) = f (xi ; Θ) = πk ϕ(x i ; αk )
i i=1 k=1
Bernoulli mixture model

For instance, for binary data with x i ∈ {0, 1}d , using multivariate Bernoulli
distributions for each component, the mixture density of observed data X can be
written as
n X g d
x
Y Y
f (X; Θ) = πk αkjij (1 − αkj )1−xij
i=1 k=1 j=1
where Θ = {π1 , . . . , πg , α1 , . . . , αg } with αk = (αk1 , . . . , αkd ) and αkj ∈ [0, 1]

Complete data (X, z)

a b c z zi1 zi2 zi3
i1 x x x 1 1 0 0
i4 x x x 1 1 0 0
i8 x x x 1 1 0 0
i2 x x x 2 0 1 0
i5 x x x 2 0 1 0
i6 x x x 2 0 1 0
i10 x x x 2 0 1 0
i3 x x x 3 0 0 1
i7 x x x 3 0 0 1
i9 x x x 3 0 0 1
Likelihood of (X, z)
The parameter of this model is the vector Θ = (π, α) containing the mixing
proportions π = (π1 , ..., πg ) and the vector α = (α1 , ..., αg ) of parameters of each
component. The mixture density of complete data (X, z) can be expressed as
g
n X g
n Y
Y Y
f (X, z; Θ) = zik πk ϕ(x i ; αk ) = (πk ϕ(x i ; αk ))zik .
i=1 k=1 i=1 k=1
Pg Qg zik
since zik ∈ {0, 1} we have k=1 zik πk ϕ(x i ; αk ) = k=1 (πk ϕ(x i ; αk ))
g g
z
X Y
z1k πk ϕ(x 1 ; αk ) = π1 ϕ(x 1 ; α1 ) + 0 + 0 and (πk ϕ(x 1 ; αk )) 1k = π1 ϕ(x 1 ; α1 ) × 1 × 1
k=1 k=1

Bernoulli mixture model

For instance, for binary data with x i ∈ {0, 1}d , using multivariate Bernoulli
distributions for each component, the mixture density of complete data (X, z) can be
written as
n Y g d
!zik
xij
Y Y 1−xij
f (X; Θ) = πk αkj (1 − αkj )
i=1 k=1 j=1
where Θ = {π1 , . . . , πg , α1 , . . . , αg } with αk = (αk1 , . . . , αkd ) and αkj ∈ [0, 1]
Binary data matrix and reorganized data matrix

a b c a b c
i1 1 0 1 i1 α11 1 − α12 α13
i4 1 0 1 i4 α11 1 − α12 α13
i8 1 0 1 i8 α11 1 − α12 α13
i2 0 1 0 i2 1 − α21 α22 1 − α23
i5 0 1 0 i5 1 − α21 α22 1 − α23
i6 0 1 0 i6 1 − α21 α22 1 − α23
i10 0 1 0 i10 1 − α21 α22 1 − α23
i3 1 0 0 i3 α31 1 − α32 1 − α33
i7 0 1 0 i7 1 − α31 α32 1 − α33
i9 1 0 0 i9 α31 1 − α32 1 − α33

Estimation au sens du maximum de vraisemblance (MV)

Rappel. Etant donné un échantillon x1 , . . . , xn et une loi de probabilité Pθ , la
vraisemblance quantifie la probabilité que des observations proviennent effectivement
d’un échantillon théorique de loi la loi Pθ .
On dispose de n observations x1 , . . . , xn , considérées comme les réalisations de n variables
aléatoires indépendantes et identiquement distribuées (X1 , . . . , Xn ). On appelle
vraisemblance de l’échantillon, la densité jointe
n
Y
L(x1 , . . . , xn ; θ) = f (xi , θ)
i=1
L’estimateur au sens MV de θ est θ̂ = argmaxθ L(x1 , . . . , xn ; θ) la valeur du vecteur de

paramètres θ qui rend aussi vraisemblable que possible les observations obtenues.
Exemple. On joue à pile ou face avec une pièce potentiellement truquée. Sur 30 lancers, on obtient 20 fois "pile" et
10 fois " face". on considère que les 30 lancers x1 , . . . , x30 sont des réalisations i.i.d suivant une loi de Bernoulli de
paramètre θ inconnu Pθ (Xi = ”pile”) = θ.
30 30 P30
1−xi 30− 30
P
x x i=1 xi
Y Y
L(x1 , . . . , x30 ; θ) = Pθ (Xi = xi ) = θ i (1 − θ) = θ i=1 i (1 − θ)
i=1 i=1
P30
∂log(L) i=1 xi
= 0 ⇒ θ̂ =
∂θ 30

Finite Mixture Model Different approaches
ML and CML approaches

The problem of clustering can be studied in the mixture model using two different
approaches: the maximum likelihood approach (ML) and the classification likelihood
approach (CML)
1 The ML approach (Day, 1969): It estimates the parameters of the mixture, and the
partition on the objects is derived from these parameters using the maximum a
posteriori principle (MAP). The maximum likelihood estimation of the parameters
results in an optimization of the log-likelihood of the observed sample
n g
!
X X
LM (Θ) = L(Θ; X) = log πk ϕ(x i ; αk )
i=1 k=1
2 The CML approach (Symons, 1981): It estimates the parameters of the mixture and
the partition simultaneously by optimizing the classification log-likelihood
g
n X
X
LC (z; Θ) = L(Θ; X, z) = log f (X, z; Θ) = zik log (πk ϕ(x i ; αk ))
i=1 k=1
or
X g
n X g
n X
X
LC (z; Θ) = zik log (πk ) + zik log (ϕ(x i ; αk ))
i=1 k=1 i=1 k=1

ML and CML approaches
Outline
Example
EM algorithm
CEM algorithm
3 Applications
Bernoulli mixture
Multinomial Mixture
Directional data
4 Model Selection
6 Conclusion

ML and CML approaches EM algorithm
Introduction of EM
Much effort has been devoted to the estimation of parameters for the mixture model
Pearson used the method of moments to estimate Θ = (m1 , m2 , σ12 , σ22 , π) of a
unidimensional Gaussian mixture model with two components
f (x i ; θ) = πϕ(x i ; m1 , σ12 ) + (1 − π)ϕ(x i ; m2 , σ22 )
required to solve polynomial equations of degree nine

Generally, the appropriate method used in this context is the EM algorithm
(Dempster et al., 1977). Two steps Expectation and Maximization
This algorithm can be applied in different contexts where the model depends on
unobserved latent variables. In mixture context z represents this variable. It denotes
which x i is from. Then we note Y = (X, z) the complete data.
Starting from the relation between the densities
f (Y, Θ) = f ((X, z); Θ) = f (Y|X; Θ)f (X; Θ)
we have
log(f (X; Θ)) = log(f (Y, Θ)) − log(f (Y|X; Θ))
or
LM (Θ) = LC (z; Θ) − log f (Y|X; Θ)
Principle of EM (Missing data)

Objective: Maximization of LM (Θ)
EM relies on iterative procedure based on the conditional expectation of LM (Θ) for
a value of the current parameter Θ0
LM (Θ) = Q(Θ|Θ0 ) − H(Θ|Θ0 )
where Q(Θ|Θ0 ) = E(LC (z; Θ|X, Θ0 )) and H(Θ|Θ0 ) = E(log f (Y|X; Θ)|X, Θ0 )
Using the Jensen inequality (Dempster et al;, 1977) for fixed Θ0 we have
∀Θ, H(Θ|Θ0 ) ≤ H(Θ0 |Θ0 ). This inequality can be proved
X f (z|X; Θ)
H(Θ|Θ0 ) − H(Θ0 |Θ0 ) = f (z|X; Θ0 ) log
z∈Z
f (z|X; Θ0 )
f (z|X;Θ) f (z|X;Θ)
As log(x) ≤ x − 1, we have log f (z|X;Θ0 )
≤ f (z|X;Θ0 )
− 1 then
X X
H(Θ|Θ0 ) − H(Θ0 |Θ0 ) ≤ f (z|X; Θ) − f (z|X; Θ0 ) = 1 − 1 = 0
z∈Z z∈Z

Q(Θ|Θ0 )
The value Θ maximizing Q(Θ|Θ0 ) satisfies the relation Q(Θ|Θ0 ) ≥ Q(Θ0 |Θ0 ) and,
LM (Θ) = Q(Θ|Θ0 ) − H(Θ|Θ0 ) ≥ Q(Θ0 |Θ0 ) − H(Θ0 |Θ0 ) = LM (Θ0 )
In the mixture context

X
Q(Θ|Θ0 ) = E(LC (z; Θ|X, Θ0 )) = E(zik |X, Θ0 ) log(πk f (x i ; αk ))
i,k
Note that E(zik |X, Θ0 ) = p(zik = 1|X, Θ0 )

As the conditional distribution of the missing data z given the observed values :
f (X, z; Θ) f (X|z; Θ)f (z; Θ)
f (z|X; Θ) = =
f (X; Θ) f (X; Θ)
we have
πk ϕ(x i ; αk ) πk ϕ(x i ; αk )
p(zik = 1|X, Θ0 ) = z̃ik = = P ∝ πk ϕ(x i ; αk )
f (xi ; Θ) ` π` ϕ(x i ; α` )

The steps of EM
The EM algorithm involves constructing, from an initial θ (0) , the sequence θ (c)
satisfying
Θ(c+1) = argmax Q(Θ|Θ(c) )
and this sequence causes the criterion LM (Θ) to grow. The EM algorithm takes the
following form
Initialize by selecting an initial solution Θ(0)
Repeat the two steps until convergence
1 E-step: compute Q(Θ|Θ(c) ). Note that in the mixture case this step reduces to the
(c)
computation of the conditional probabilities z̃ik
(c+1) 1 (c+1)
M-step: compute Θ(c+1) maximizing Q(Θ, Θ(c) ). This leads to πk
P
2 = n i z̃ik and
(c+1)
the exact formula for the αk will depend on the involved parametric family of
distribution probabilities
Properties of EM
Under certain conditions, it has been established that EM always converges to a
local likelihood maximum
Simple to implement and it has good behavior in clustering and estimation contexts
Slow in some situations
Hathaway interpretation of EM: classical mixture model context

EM = alternated maximization of the fuzzy clustering criterion
FC (Z̃ , Θ) = LC (Z̃ ; Θ) + H(Z̃ )
Z̃ = (z̃ik ): fuzzy partition

P
LC (Z̃ , Θ) = i,k z̃ik log(πk ϕ(x i ; αk )): fuzzy classification log-likelihood
P
H(Z̃ ) = − i,k z̃ik log z̃ik : entropy function
Algorithm
Maximizing FC (Z̃ , Θ) w.r. to s yields the E-step
Maximizing FC (Z̃ , Θ) w.r. to Θ yields the M-step
Fuzzy clustering to hard clustering

a b c z̃i1 z̃i2 z̃i3 zi1 zi2 zi3 z
i1 x x x 0.7 0.1 0.2 1 0 0 1
i2 x x x 0.1 0.6 0.3 0 1 0 2
i3 x x x 0.1 0.1 0.8 0 0 1 3
i4 x x x 0.6 0.2 0.2 1 0 0 1
i5 x x x 0.2 0.6 0.2 0 1 0 2
i6 x x x 0.1 0.7 0.2 0 1 0 2
i7 x x x 0.2 0.1 0.7 0 0 1 3
i8 x x x 0.8 0.1 0.1 1 0 0 1
i9 x x x 0.2 0.2 0.6 0 0 1 3
i10 x x x 0.1 0.8 0.1 0 1 0 2
Example
library(mixtools)
attach(faithful)
dim(faithful)
waiting
hist(waiting)
d=density(waiting)
plot(d)
wait1 <- normalmixEM(waiting, lambda = .5, mu = c(50, 60), sigma = 5)
plot(wait1, density = TRUE, cex.axis = 1.4, cex.lab = 1.4, cex.main = 1.8,main2 = "Time between Old Faithful
eruptions", xlab2 = "Minutes")
Mixture of 2 densities
Histogram of waiting density.default(x = waiting) Time between Old Faithful eruptions
0.04
50
0.03
40
0.03
Frequency
Density
0.02
30
Density
0.02
20
0.01
0.01
10
0.00
0.00
0
40 50 60 70 80 90 100 40 60 80 100 40 50 60 70 80 90 100

waiting N = 272 Bandwidth = 3.988 Minutes

Data embedding and clustering: two different aims
Dataset (1000 × 15): Often we tend to make a PCA and from the first axes
(components) we apply a clustering method. Attention a structure into clusters can be
obvious in the plan factorial (1,15) and completely non-existent using (1,2).
Individuals factor map (PCA) Individuals factor map (PCA)
4
2
273 283
760
212 440 546
252 638 769182 592
2 247 473 590
68185 427 807401
554
516 580 370 10840915
463 23
504509 974
544 310 693538 724 789 499 930 65
173320 480814
912 308
718 891528
242
714 770662835 642 445 695 548 682
34 77 829 426 196 71 32 438
215 555 726
2
209434 497
429
643 739 272216
132 241478 980115 404 292 5318 891 528 617
1311908
529 686 763
346450657
676
755518 350626 90326494
369 279
1 939 251 847 293
933 433
844982
606 378 622 128158
351540414 805 79739 278853
10260591
240
541 489
547287
270 154395207 87 18749236808 782 862234 261
197
729714798950447790
479
592612
15724969
29 628 721
1
124
456 147 806
524711562 588466
766
742 12175696
624
893221 110 858
302
844
880160 759 103989 6 666583 459 400
785632412 568
330
130
649
494399 462 180
864
151 745
332
926
484
709 368
12
852208
863
411 988 610
847
680 895
229 889
257 296700
702 969834405
520 961 38 698 86 352842 754 642 648
326793
703 22452
937 312
107
214
753 505621
425
713
615 264
213
249469321
245
635
906
186
366
481
437
705
696
114
307
31
607
948918870
448
293
862
297 410
572
725
309
585
157 861 468 707 947199
792 250 268
59 149 946 739
315 103
272
91587768 49468
959
508
73861
47
283
845
784
577
271
471
225 67 949
618
474 677
740
875
465 161
238 162 44
730 775
467
651 361
318
290 878
652 174
729 432
570
595 156
748 328
37
511256 781 325 569 510 267 28 564 146 27 818
951 707668 342
325 200 38 997 136593301
230
354
336
960 453
133
82437373825
441
608 970
819 40
838
979 950
31617
378 153
28 827
701
317 649
648
313 879
127 993 954
658 567 881 820945 910
386 938 816
749269461 77837
989280
834 236 808
192783
609
879
1587 379743 954
543 176
149 905 836
953421
178 439
833
33
284
994 248
235 799
393
165113
68205
965
825338
920
443 435
659
670
513
93
843 329
901
535
155 92674
620
837
977 244
345
183
134
433 189
52265
22
29461
831 865
902
24998 628462
999 703 151 224406 926135 679421
14558 46
553
835 708
520 331
897856
921
760 792947100
737 993
815604533 536
54 20 982 142
460 672
553
964 262
786 609690342 883 538216 683 602
21 478
405 220796
Dim 15 (1.21%)
135
530 259
291
972 137
97
472
590527
706 116
660 498
900 719 352 683 587 200 492 414
139 937 207 256 44232459250
23
Dim 2 (7.99%)
569 72226611 938 141

944 286
436
487 871
75
171172
105
817 751925957385
584633
868
646 355
81
916
164
734
84 112400
177
578
123 568
959
811
514
778
217 35126611
124 417 332
681
694855912430 184467396 163
910 601 777 579
341 159 630
274
188 515 83
747
5779
876 733
744 79 46146 602 106
818685
884 255 397 456 932
299
543 3 238 929
354
80 717 328 999666 459 202502883
945417 780826430597347
809
227
613
169
958
860
16
565
500
882
841 7364
258
645
377
339
389
857
574
48
772
872
695
830
292
561
311691
764
323
975
903
118791
812
798
101
267984
104 412
9381
617
206
348315
223
158
564
289
627 305
614508
796
193
280 737604
246
856 372502 89 254 722 848 919745976
380 631
501 601
30
524
846
465
537
312 194
344
826 384
943710
962
322 446
686138643
259 968573
661
741 896 7617
551 90 305 279 476 106 788
724 640
445939
583199 549
195
255 88251
850758650
631 532
446
80
74
671 978
936
566
69 987266
451
629
408
589 1000
849
19
68485153
485
971
464
263 911
499
542 447
140
120767
715 449 897 126 850
540 504656
52176109
376
530 713 597
233
277 519
284 621
288 97
137169
992 420
871
934 48 398 360 409268781 397
510932919
867 723 801
396899 55762260
823 822 545 921 640581 887 204 864
867 509136
905 753613
615 190 167
148 44
817
832 987
852 664
746 690 866 581 115 887 726 928
0
946
299
848 962
675859
384
198 108
388 314
990795
991 712667
477
179639
952419 143 632
51 107
43
997758 252
675
875 562
371
173
625
161 797
859525162
54291
282
394
596
219 436
566
579
676 64165
12
958 965
486152 894
376 913
277
519422
219
968 746 752
534300 458
56 606 622
444 484
985 780
529 711 356 388346336
497
801
629888 935
825 274 298 319 334 372
43
406
537
694
976 190
175
599431
641
356
596 490
64
488
923 486
874 499166
654 131
735
698
552
904 234 495
197 598
479
470
86185 269 303721
754130
810
243 180
471 668 533
866 709
214 11650
953 673
805
532
742
809755 588
427
766 936
990923
466
909
203 687 619
882 736
170 639
166 90376 898 685 98062658 404
961386 656846
855
337
95521 452
191
558
322138
371 288
420
125
413
63
168
888
896 391
517
934
428
571
623
665
457917
940
170
582
383
716
616
66
605
966
548
806
76
563
890
682
363
150
506
886
285
981
294
774
211
475
967
128749
335
956
931
951
365476 220
788
815
677 379 195 201
100 771928
295
1395337
740 521
777
944
232
69985
117
58
425
227
74
175
230
403
281
671
505
908
434
178
972
839
885
586
301
994
235
439
411
452
248
327
108
347
863
341
393
507
638
927
516
960 488
554 105
253
860
266
853
121
172565
517
125
991
665
408
245
469
391
290
428
113
800
795
623
527
36
318
418
541
920
738
877
1000
258
773
205
575
383 361
582
574
660 196
423644
247
776
116
55 70
752
50
275
917
605
260
242
655
129
448
464
229 542 306
392955
720
498
941
567
89 789254 881
30679 203231
935362 357 94
653 816 209
663 19833
493
34 320
431 487
765
518 576757
368
451 68 830
531 221
669
186
457
667 616595
110
535 840 201 204295
0
232 148 909
661 757 594
619 333
122 637 8603
218 557 793360 536409 593
141 723
836 171
416 159
401
264
763278 34935
799 756
899
359 19389659
239
716 364 179
300963
25762
66 402
357
922
691 570
986
838 118
782 126380 704
820 943
663
3 69971045167
673
58 282
327 914
72
551
374 50963
600
239
877 12941265
986 387
523
482785
790 415813 47 225
634
21 949 704 191496 77
69429
39
892 286
422249
2168
775
63
314
133
15914
373
450 872
925
893
339
188
374
377 684
440
594 362
652
657
630
228
534515
849
122 99
560
375
489
732633
82
670
646
696
114
237
270971
610 98
890
979
747 637
297
323
924
636
503
8812
900 869
120 144
91
728 22
370119
625 927 727
821
353 575
152
15761 560 343720
358
402
689 942
539 163 913
45 833807
212
641
730 353
978 490
413821
472500
453213
841
102
751 940
240 718
851
308
608
333 731
435
437
878
988 385559
155
952
647 550
876
973 692
563
620
895
257 143
734 506
984
942
473
17495 811 771
139 14 493 885507
687
416992
210
854 531
222
398
647
319
82
275973
367
304 869 612768330 424
261
898
941 65175 210
16
706854
824
995 571 772
624 60
857 957
906 307
31
680
654
71904
948
868
697
343
350
329
764
983
41150
580
552
544
678 244
164
889218
526
523 653
444
381 557
273
410 794 614
123
985109
344 929
184
773
253800228776 697
334955873
144 491
787276 42784192 44232462 599 829
727
802635
222
111 874
7264596996
321
443
712 45843
56193
591
455
57 513
689
485
147705
358
819
477607 131
32 340
407
367 92626
735
101
458 880
975 181
769134
774
355
154
791
438744
345
482 512
153
189
470585 813226
177
578 865 6 884
496 586
233839
741 349418
736 57 407
996 306
503
636 894 226736133167
783 743 202 88 231
589 338 366
426 311
584145
369
970
174
263
901
837
20
966
40 911
316265
81715 78
870733
702
294387
719
183
348156
51 475 907262
964
832664
717 35
359 732 237
298
922 933
98
803
18678
392750930
340
512794
842 577 27618 549 208 480
441814 823
779 304
419302
432
803
56918 483
750
104
84
363
285
603
140
916 767
546
886
206 770112
460 449
491
822
522
662748967
627
787
931
700 317
725335 786
303
525
281 892 145550
644
423 526 828
688
181 845474
271 547 287 160 539
873
981
9395
10 185289211
463759998 365
810 127
555
403394995 36802 390119382 454 708
556 83977
974858 390 79415
142598 828382
572 514 246
217634
117
576111 983
483 907 600481 688 94 310182693309 956
223 454
82727642243 241804 424
−2
924 399 674

215 545
701556313
18761 193
194 296
492 765 375 455655 728 91 132 511
902
573 669 731
70 672
804 831
−1
559 692
501
78
−4
−10 −5 0 5 10 −4 −2 0 2 4
Dim 1 (11.86%) Dim 1 (11.86%)
Evaluation
kmeans, PCA-kmeans, Autoencoder-kmeans, UMAP-kmeans, Deep k-means
Model-based clustering GMM

ML and CML approaches CEM algorithm
CEM algorithm
In the CML approach the partition is added to the parameters to be estimated. The
maximum likelihood estimation of these new parameters results in an optimization of
the complete data log-likelihood. This optimization can be performed using the
following Classification EM (CEM) algorithm (Celeux and Govaert, 1992), a variant
of EM, which converts the z̃ik ’s to a discrete classification in a C-step before
performing the M-step:
(c)
E-step: compute the posterior probabilities z̃ik .
C-step: the partition z(c+1) is defined by assigning each observation x i to the cluster
which provides the maximum current posterior probability.
(c+1) (c+1)
M-step: compute the maximum likelihood estimate (πk , αk ) using the k-th
(c+1) (c+1) #z
= 1n i zik
P
cluster. This leads to πk = n k and the exact formula for the
(c+1)
αk will depend on the involved parametric family of distribution probabilities
Properties of CEM
Simple to implement and it has good practical behavior in clustering context
Faster than EM and scalable
Some difficulties when the clusters are not well separated

ML and CML approaches CEM algorithm
Link between CEM and the dynamical clustering methods

Dynamical clustering method The CEM algorithm
Assignation-step E-step
zk = {i; D(x i , ak ) ≤ D(x i , a0k ); k 0 6= k} Compute z̃ik ∝ πk ϕ(x i , αk )
C-step
zk = {i; z̃ik ≥ z̃ik 0 ; k 0 6= k}
zk = {i; − log(πk ϕ(x i , αk )) ≤ − log(πk ϕ(x i , α0k )); k 0 6= k}
Representation-step M-step
Compute the center ak of each cluster Compute the πk ’s and αk
Density and distance

When the proportions are supposed equal we can propose a "distance" or a
dissimilarity measure D by taking ϕ(x i , αk ) = exp(−D(x i , ak )) then
D(x i , ak ) = − log(ϕ(x i , αk ))
and the criterion to minimize is

XX
zik D(x i , ak )
i k
Classical k-means algorithms

Applications
Outline
Example
EM algorithm
CEM algorithm
3 Applications
Bernoulli mixture
Multinomial Mixture
Directional data
4 Model Selection
6 Conclusion

Applications Bernoulli mixture
Binary data
For binary data, considering the conditional independence model (independence for
each component), the mixture density of the observed data X can be written as
x
Y YX Y
f (X; Θ) = f (xi ; Θ) = πk αkjij (1 − αkj )1−xij
i i k j
where xij ∈ {0, 1}, αk = (αk1 , . . . , αkd ) and αkj ∈ [0, 1]

Latent Class Model
The different steps of EM algorithm
1 E-step: compute
Pz̃ik P
z̃ x z̃ik
2 M-step: αkj = Pi ikz̃ ij and πk = i
n
i ik
The different steps of CEM algorithm
1 E-step: compute z̃ik

Pz
C-step: compute
2
z x #zk
3 M-step: αkj = Pi ikz ij = %1 and πk = n
i ik

Parsimonious model
Several parsimonious models can be proposed by imposing constraints on the
parameters
X Y |xij −akj |
f (xi ; Θ) = πk εkj (1 − εkj )1−|xij −akj |
k j
where
akj = 0, εkj = αkj if αkj < 0.5
akj = 1, εkj = 1 − αkj if αkj > 0.5
The parameter αk is replaced by the two parameters ak and εk
Example:
αk = (0.7, 0.3, 0.4, 0.6)
then
ak = (1, 0, 0, 1) and εk = (0.3, 0.3, 0.4, 0.4)
- The binary vector ak represents the center of the kth cluster, each akj indicates the
most frequent binary value
- The binary vector εk ∈]0, 1/2[d represents the degrees of heterogeneity of the kth
cluster, each εkj represents the probability of j to have the value different from that of
the center,

p(xij = 1|akj = 0) = p(xij = 0|akj = 1) = εkj

p(xij = 0|akj = 0) = p(xij = 1|akj = 1) = 1 − εkj
j1 j2 j3
i1 1 0 1
i4 1 0 1
i8 1 0 1
i2 0 1 0
i5 0 1 0
i6 0 1 0
i10 0 1 0
i3 1 0 0
i7 0 1 0
i9 1 0 1
j1 j2 j3
j1 j2 j3 i1 1 − ε11 1 − ε12 1 − ε13
i1 α11 1 − α12 α13 i4 1 − ε11 1 − ε12 1 − ε13
i4 α11 1 − α12 α13 i8 1 − ε11 1 − ε12 1 − ε13
i8 α11 1 − α12 α13 i2 ε11 ε12 ε13
i2 1 − α11 α22 1 − α13 a1 1 0 1
i5 1 − α21 α22 1 − α23 i5 1 − ε21 1 − ε22 1 − ε23
i6 1 − α21 α22 1 − α23 i6 1 − ε21 1 − ε22 1 − ε23
i10 1 − α21 α22 1 − α23 i10 1 − ε21 1 − ε22 1 − ε23
i3 α21 1 − α22 1 − α23 i3 ε21 ε22 1 − ε23
i7 1 − α2 1 α22 1 − α23 i7 1 − ε21 1 − ε22 1 − ε23
i9 α21 1 − α22 α23 i9 ε2a ε22 ε23
a2 0 1 0

Example: Binary data matrix and reorganized data matrix

a b c d e a b c d e
1 1 0 1 0 1 1 1 0 1 0 1
2 0 1 0 1 0 4 1 0 1 0 0
3 1 0 0 0 0 8 1 0 1 0 1
4 1 0 1 0 0 2 0 1 0 1 0
5 0 1 0 1 1 5 0 1 0 1 1
6 0 1 0 0 1 6 0 1 0 0 1
7 0 1 0 0 0 10 0 1 0 1 0
8 1 0 1 0 1 3 1 0 0 0 0
9 1 0 0 1 0 7 0 1 0 0 0
10 0 1 0 1 0 9 1 0 0 1 0
Centers ak and Degree of heterogeneity εk

a b c d e a b c d e
a1 1 0 1 0 1 ε1 0 0 0 0 0.33
a2 0 1 0 1 0 ε2 0 0 0 0.25 0.5
a3 1 0 0 0 0 ε3 0.33 0.33 0 0.33 0
8 Models assuming proportions equal or not : [εkj ], [εk ], [εj ], [ε]

[εkj ] (heterogenity depends on k and j), [εk ] (depends only on k),
[εj ] (depends only on j), [ε] (does not depend on k or j)

CEM for the simplest model [ε] where ε does not depend on k or j
Exercise: When the proportions are supposed equal the classification log-likelihood
to maximize
n g
ε XX
LC (z; Θ) = L(Θ; X, z) = log( ) zik D(x i , ak ) + nd log(1 − ε)
1 − ε i=1
k=1
d
X
where D(x i , ak ) = |xij − akj |
j=1
ε
The parameter ε is fixed for each cluster and for each variable, as (log( 1−ε ) ≤ 0)
this maximization leads to the minimization of
g
n X
X
W (z, a) = zik D(x i , ak ) a = a1 , . . . , ag
i=1 k=1
Exercise: The CEM algorithm is equivalent to the dynamical clustering method
CEM and EM for the other models

Exercise: Describe the different steps of CEM for the models [εj ], [εk ] and [εkj ]
Exercise: Deduce the different steps of EM for these models
Applications Multinomial Mixture
Nominal categorical data

Categorical data are a generalization of binary data
Generally this kind of data is represented by a complete disjunctive table where the
categories are represented by their indicators
A variable j with h categories is represented by a binary vector such as
xijh = 1 if i takes the categorie h forj

xijh = 0 otherwise
The probability of the mixture can be written

 
X Y jh x
f (xi ; Θ) = πk  (αk ) ij 
k j,h
where αkjh is the probability that the variable j takes the categorie h when an object
belongs to the cluster k.

Notation
ykjh = zik xijh
P
i
y jh = xijh
P
i
yk = j,h ykjh
P
y = k yk = k,j,h,i xijh = nd
P P
Example
a b a1 a2 a3 b1 b2 b3 a1 a2 a3 b1 b2 b3
i1 1 2 i1 1 0 0 0 1 0 i3 0 1 0 0 0 1
i2 3 2 i2 0 0 1 0 1 0 i7 0 0 1 0 0 1
i3 2 3 i3 0 1 0 0 0 1 i9 0 1 0 0 1 0
i4 1 1 i4 1 0 0 1 0 0 i10 0 1 0 0 0 1
i5 1 2 i5 1 0 0 0 1 0 i1 1 0 0 0 1 0
i6 3 2 i6 0 0 1 0 1 0 i4 1 0 0 1 0 0
i7 3 3 i7 0 0 1 0 0 1 i5 1 0 0 0 1 0
i8 1 1 i8 1 0 0 1 0 0 i8 1 0 0 1 0 0
i9 2 2 i9 0 1 0 0 1 0 i2 0 0 1 0 1 0
i10 2 3 i10 0 1 0 0 0 1 i6 0 0 1 0 1 0
- y1a1 = 0,y1a2 = 3, y1a3 = 1, y1b1 = 0,y1b2 = 1, y1b3 = 3

- y1 = 0 + 3 + 1 + 0 + 1 + 3 = 8, y2 = 8, y3 = 4
- y = 8 + 8 + 4 = 10 × 2
Interpretation of the model

The different steps of EM algorithm

P jh P
i z̃ik xi i,k z̃ik
2 M-step: αjh
k =
P and πk =
i z̃ik n
The different steps of CEM algorithm

2 C-step: compute z
P jh jh
i zik xi yk #zk
3 M-step (Exercise) : αjh
k =
P = and πk =
i zik #zk n

Interpretation of the model

The classification log-likelihood can be written as
X X jh
LC (z; Θ) = #zk log(πk ) + yk log(αkjh )
k k,j,h
When the proportions are supposed equal, the restricted likelihood

X jh
LCR (z; Θ) = yk log(αkjh )
k,j,h
jh
yk
Given αkjh = #zk
, it can be shown that CEM maximizes the mutual information
X y jh y jh y
k
I(z, J) = log k jh
y yk y
k,j,h
X (y jh y − yk y jh )2
This expression is very close to χ2 (z, J) = k
yk y jh y
k,j,h
Assuming that X derives form the latent class model whith equal proportions, the
maximization of LC (z; Θ) is approximatively equivalent to use k-means with the χ2
metric (course 2).
Parsimonious model
Number of the parameters in latent class model is equal (g − 1) + g ∗ j (mj − 1)
P
where mj is the number of categories of j
This number is smaller than j mj required by the complete log-linear model,
Q
example (d = 10, g = 5, mj = 4 for each j), this number is equal to

(5 − 1) + 5 ∗ (40 − 10) = 154
This number can reduced by using parsimonious model by imposing constraints on
the paremetre αkj . Instead to have a probability for each categorie, we associate for
a category of j having the same of value that the center for j the probability
(1 − εkj ) and the other categories the probability εkj /(mj − 1)
Then the distribution depends on ak and εk defined by
(1 − εkj ) for xij = akj

εkj /(mj − 1) for xij =
6 akj
The parametrization concerns only the variables instead of all categories

This model is an extension of the Bernoulli model

birds dataset
Categorical variables: birds of different subspecies birds dataset (Bretagnolle 2007)

provides details on the morphology of birds (puffins). Each bird is described by five
qualitative variables. One variable for the gender and four variables giving a
morphological description of the birds. There are 69 puffins divided in two subclasses:
lherminieri and subalaris (34 and 35 individuals respectively).
library(Rmixmod)
data(birds)
dim(birds)
birds
xem.birds <- mixmodCluster(birds, 2)
summary(xem.birds)

Model and parameters [πk , εkjh ]

****************************************
* nbCluster = 2
* criterion = BIC
****************************************
*** MIXMOD Models:
* list = Binary-pk-Ekjh
* This list includes only models with free proportions.
****************************************
* data (limited to a 10x10 matrix) =
gender eyebrow collar sub-caudal border
112212
221221
323111
413211
513211
613211
723211
822211
922211
10 2 2 2 1 1
* ... ...
****************************************
*** MIXMOD Strategy:
* algorithm = EM
* number of tries = 1
* number of iterations = 200
* epsilon = 0.001
*** Initialization strategy:
* algorithm = smallEM
* number of tries = 10
* number of iterations = 5
* epsilon = 0.001
* seed = NULL
Example
****************************************
* number of modalities = 2 4 5 5 3
****************************************
*** Cluster 1
* proportion = 0.6544
* center = 1.0000 3.0000 1.0000 1.0000 1.0000
* scatter =
| 0.4937 0.4937 |
| 0.0761 0.0063 0.1741 0.0917 |
| 0.1521 0.1391 0.0043 0.0043 0.0043 |
| 0.0390 0.0045 0.0043 0.0259 0.0043 |
| 0.0577 0.0288 0.0289 |
****************************************
*** Cluster 2
* proportion = 0.3456
* center = 2.0000 2.0000 2.0000 2.0000 1.0000
* scatter =
| 0.4280 0.4280 |
| 0.1203 0.1463 0.0153 0.0107 |
| 0.0509 0.0751 0.0080 0.0080 0.0080 |
| 0.3641 0.5495 0.1288 0.0485 0.0080 |
| 0.1074 0.0940 0.0134 |
****************************************

Visualisation and Description of clusters
plot(xem.birds)
# Bigger symbol means that observations are similar
barplot(xem.birds)
# Description
Barplot of gender Barplot of eyebrow Barplot of collar
Multiple Correspondence Analysis
1.0
1.0
1.0
0.02
0.8
0.8
0.8
Conditional frequency
0.6
0.6
0.6
Unconditional frequency
0.4
0.4
0.4
0.00
0.2
0.2
0.2
0.0
0.0
0.0
C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2
1 2 1 2 3 4 1 2 3 4 5
−0.02
Axis 2
Barplot of sub−caudal Barplot of border
1.0
1.0
−0.04
0.8
0.8
0.6
0.6
0.4
0.4
−0.06
0.2
0.2
−0.03 −0.02 −0.01 0.00 0.01 0.02
0.0
0.0
C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2
Axis 1 1 2 3 4 5 1 2 3

Model selection
****************************************
*** BEST MODEL OUTPUT:
*** According to the BIC criterion
****************************************
* nbCluster = 2
* model name = Binary-pk-Ekjh
* criterion = BIC(518.9159)
* likelihood = -198.0634
****************************************

The simplest model

We assume that (1 − εkj ) does not depend the cluster k and the variable j
(1 − ε) for xij = akj

ε/(mj − 1) for xij =
6 akj
The restricted classification log-likelihood takes the following form

!
X X ε
LCR (z; Θ) = L(Θ; X, Z) = zik log( (mj − 1))δ(x i , ak ) + nd log(1 − ε)
j
1−ε
i,k
P P
or, LCR (z; Θ) = k i∈zk D(x i , ak ) + nd log(1 − ε) where
X 1−ε j
D(x i , ak ) = log( (m − 1))δ(xij , akj )
j
ε
If
Pall variables have the same number of categories, the criterion to minimize is
i,k zik D(x i , ak ), why ?
The CEM is an extension of k-modes

Contingency table
We can associate a multinomial
P model (Govaert and Nadif 2007), then the density
xi1 x
of the model ϕ(xi ; αk ) = B k πk αk1 . . . αkdid ( B does not depend on Θ)
P P P
Without log(B) we have LC (z, Θ) = i k zik log πk + j xij log(αkj )
The mutual information quantifying the information shared between z and J:
X fkj
I(z, J) = fkj log( )
fk. f.j
k,j
P (fkj −fk. f.j )2 P (fkj )2

We have the relation k,j fk. f.j
= k,j fk. f.j − 1; Using the following
approximation: x 2 − 1 ≈ 2x log(x) excellent in the neighborhood of 1 and good in
P (fkj )2 fkj 2

f
[0, 3], we have k,j fk. f.j − 1 = k,j fk. f.j ( fk. f.j ) − 1 ≈ 2 k,j fkj log( fk.kjf.j ).
P P
Then we have
1 2
I(z, J) ≈ χ (z, J)
2N
When the proportions are assumed equal, the maximization of LC (z, Θ) is equivalent
to the maximization of I(z, J) and approximately equivalent to the maximzation of
χ2 (z, J)

Applications Gaussian mixture model
The Gaussian model

P
The density can be written as: f (x i ; Θ) = k πk ϕ(x i ; µk , Σk ) where
1 1
ϕ(x i ; µk , Σk ) = d 1
exp{− (x i − µk )> Σ−1
k (x i − µk )}
(2π) 2 |Σk | 2 2
Spectral decomposition of the variance matrix
Σk = λk Dk Ak Dk>
- λk = |Σk |1/p positive real represents the volume of the kth component
- Ak = Diag (ak1 , . . . , akd ) whose elements are proportional to the eigenvalues of Σk . It
defines the shape of the kth cluster
- Dk formed by the eigenvectors. It defines the direction of the kth cluster
d(d+1)
Remark: number of parameters to estimate: (g − 1) + g × d + g × 2

Different Gaussian models

The Gaussian mixture depends on: proportions, centers, volumes, shapes and
Directions then different models can be proposed
In the following models proportions can be assumed equal or not
1 Spherical models: Ak = I then Σk = λk I . Two models [λI ] and [λk I ]

2 Diagonal models: Four models [λA], [λk A], [λAk ] and [λk Ak ]
3 General models: the eight models assuming equal or not volumes, shapes and
directions [λDAD > ], [λk DAD > ], [λDAk D > ], [λk DAk D > ], [λDk ADk> ],[λk Dk ADk> ],
[λDk Ak Dk> ] and [λk Dk Ak Dk> ]
Finally we have 28 models, we will study the problem of the choice of the models
See for instance mclust and Rmixmod.
mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite
Mixture Models. by Luca Scrucca, Michael Fop, T. Brendan Murphy and Adrian E.
Raftery, 2016

Library mclust
Spherical models: By fixing Ak = I , we place ourselves in the case where the classes
are of spherical shapes, that is to say that the variances of all the variables are equal
inside of the same class.
Diagonal models: Considering that the matrices Dk are diagonal, we force the classes
to be aligned on the axes. It is in fact the hypothesis of conditional independence in
which the variables are independent of each other within the same class.
General models: By fixing equality constraints on the Ak , the Dk or the λk , we can
generate 8 different models.

library {mclust}
Example 1
library(mclust)
data(diabetes)
class <- diabetes$class
table(class)
# class
# Chemical Normal Overt
# 36 76 33
X <- diabetes[,-1]
head(X)
res.pca=PCA(X)
clPairs(X, class)
res.mclust <- Mclust(X,3)
summary(res.mclust)
table(res.mclust$class,diabetes$class)
res.kmeans=kmeans(X,3,nstart=100)
table(res.kmeans$cluster,diabetes$class)
Example 2
data(wine, package = "gclus")
Class <- factor(wine$Class, levels = 1:3,labels = c("Barolo", "Grignolino", "Barbera"))
X <- data.matrix(wine[,-1])
mod <- Mclust(X)
summary(mod$BIC)
summary(mod)
table(Class, mod$classification)
adjustedRandIndex(Class, mod$classification)

CEM
In clustering step, each x i is assigned to the cluster maximizing
z̃ik ∝ πk ϕ(x i ; µk , Σk ) or equivalently the cluster that minimizes
− log(πk ϕ(x i ; αk )) = (x i − µk )> Σ−1

k (x i − µk ) + log |Σk | − 2 log(πk ) + cste
From density to Distance (or dissimilarity), x i is assigned to the cluster according

the following dissimilarity
DΣ2 −1 (x i ; µk ) + log |Σk | − 2 log(πk )

k
where DΣ2 −1 (x i ; µk ) = (x i − µk )> Σ−1

k (x i − µk ) is the Mahanalobis distance
k
Note that when the proportions are supposed equal and the variances identical, the
assignation is based only on
DΣ2 −1 (x i ; µk )
k
When the proportions are supposed equal and for the spherical model [λI ] (Σk = I ),
one uses the usual euclidean distance
D 2 (x i ; µk )

Description of CEM
E-step: classical, C-step: Each cluster zk is obtained by using D 2 (x i ; µk )
M-step: Given the partition z, we have to determine the parameter θ maximizing
X
LC (z, Θ) = L(Θ; X, z) = zik log (πk ϕ(x i ; αk ))
i,k
For the Gaussian model

!
1X X
− zik (x i − µk )> Σ−1
k (x i − µk ) + #zk log |Σk | − 2#zk log(πk )
2 i k
P
i zik x i
- The parameter µk is thus necessary the center µk = #zk
#zk
- The proportions satisfy πk = n
- The parameters must then for the general model
(trace(Wk Σ−1
X
F (Σ1 , . . . , ΣK ) = k ) + #zk log |Σk |)
k
zik (x i − µk )(x i − µk )>

P
where Wk = i

Consequence for the spherical model [λI ]

The function to maximize for the model [λI ] becomes
1
F (λ) = trace(W ) + nd log(λ)
λ
P
where W = k Wk
With λ = trace
nd
(W )
maximizing F (λ), the classification log-likelihood becomes
nd nd
LC (z; Θ) = − trace(W ) + cste = − W (z) + cste
2 2
Maximizing LC is equivalent to minimize the SSQ criterion minimized by the kmeans
algorithm
Interpretation
- The use of the model [λI ] assumes that the clusters are spherical having the same
proportion and the same volume
- The CEM is therefore an extension of the kmeans

Description of EM
E-step: classical
M-step: we have to determine the parameter Θ maximizing Q(Θ, Θ0 ) taking the
following form
X
LC (z; Θ) = L(Θ; X, z) = z̃ik log (πk ϕ(x i ; αk ))
i,k
For the Gaussian model

1 X
− z̃ik (x i − µk )> Σ−1
k (x i − µk ) + z̃ik log |Σk | − 2z̃ik log(πk )
2
i,k
P
z̃ x
- The parameter µk is thus necessary the center µk = Pi ik i
P i z̃ik
z̃
- The proportions satisfy πk = in ik
- The parameters Σk must then minimize
(trace(Wk Σ−1
X
F (Σ1 , . . . , ΣK ) = k ) + #zk log |Σk |)
k
z̃ik (x i − µk )(x i − µk )>

P
where Wk = i

Example : https://sandipanweb.wordpress.com/2016/07/30/
image-clustering-with-gmm-em-soft-clustering-in-r/

Von-Mises Fisher Mixture model

The von Mises-Fisher distribution (vMF)
Let xi ∈ Sd−1 be a data point following a vMF distribution,
then its pdf is
>
xi
f (xi |µ, κ) = cd (κ) expκµ , (1)
µ: centroid parameter, κ: concentration parameter, such that

d −1
κ2
kµk = 1 and κ ≥ 0. cd (κ) = d Ir (κ): the modified
(2π) 2 I d (κ)
−1
2 Figure: Impact of κ. blue: κ = 1,
Bessel function of the first kind and order r . green: κ = 10, red: κ = 100
The Mixture of vMF distributions (movMFs)

The data points x1 , . . . , xn are supposed to be i.i.d and
generated from a mixture of g vMF distributions, with pdf:
g
X
f (xi |Θ) = πk ϕ(xi |µk , κk ), (2)
k=1
where Θ = {µ1 , . . . , µg , π1 , . . . , πg , κ1 , . . . , κg }

Applications Directional data
Algorithms
Log-likelihood !
X X
L(Θ; X) = log πk ϕ(xi |µk , κk ) ,
i k
Complete data log-likelihood

X X X
LC (z; Θ) = zik log πk + zik log cd (κk ) + zik κk µ>
k xi
i,k i,k i,k
X X X
= zik log πk + zik log cd (κk ) + zik κk cos(µk , xi )
i,k i,k i,k
EM
E-step: finds the conditional expectation z̃ik = E(zik = 1|xi , Θ(t) )
(t+1)
M-step: finds the
new parameters Θ maximizing
(t) (t) P
Q(Θ, Θ ) = E L(Θ; X, z)|X, Θ s.t. k πk = 1, kµk k = 1 and κk > 0
Hypotheses:
P ∀k, πk = 1/g and κk = κ the maximization of LC (z; Θ) and
i,k zik cos(xi , µk ) are equivalent

Applications Other variants of EM
Stochastic EM "SEM", (Celeux and Diebolt, 1985)

Steps of SEM
S-step between E-step and M-step
In CEM (C-step), In SEM (S-step)
E-step: compute the posterior probabilities
S-step: This stochastic step consists to look for the partition z̄. Each object i is
assigned to the kth component. the parameter k is selected according to the
multinomial distribution (z̃i1 , . . . , z̃ig )
M-step As the CEM algorithm this step is based on z̄
Advantages and Disadvantages of SEM

It gives good results when the size of data is large enough
It can be used even if the number of clusters is unknown. It suffices to fix g to gmax
the maximum number of clusters and this number can be reduced when the a
cluster has a number of objects so lower that the estimation of parameters is not
possible. For example when the cardinality of a cluster is less than a threshold, we
run SEM with (g − 1)
It can avoid the problem of initialization and other problems of EM
Instability of the results. Solution: SEM (for estimation of paremetrs and the
number of clusters), The obtained results are used by EM
Applications Other variants of EM
Stochastic Annealing EM "SAEM" (Celeux and Diebolt, 1992)
Steps of SEM
The aim of the SAEM is to reduce the "part" of random in estimations of the
parameters
SAEM is based on SEM and EM
Solution
E-step: like for EM, SEM and CEM
S-step: like for SEM
M-step: The compute of parameters depends on this expression:
(t+1) (t+1)
Θ(t+1) = γ (t+1) ΘSEM + (1 − γ (t+1) )ΘEM
The initial value of γ = 1 and decreases until 0.

Model Selection
Outline
Example
EM algorithm
CEM algorithm
3 Applications
Bernoulli mixture
Multinomial Mixture
Directional data
4 Model Selection
6 Conclusion

Model Selection
In Finite mixture model, the problem of the choice of the model include the problem
of the number of clusters
We distinguish the two problems and we consider the model fixed and K is unknown.
Let be tow models MA and MB . Θ(MA ) and Θ(MB ) indicates the "domain" of free
parameters. if Lmax (M) = L(θ̂ M ) where θ̂ M = argmax L(θ) then we have
Θ(MA ) ⊂ Θ(MB ) ⇒ Lmax (MA ) ≤ Lmax (MB )
For example Lmax [πk λk I ]g =2 ≤ Lmax [πk λk I ]g =3 . Generally the likelihood increases
with the number of clusters.
First solution: Plot (Likelihood*number of clusters) and use the elbows
Second solution: Minimize (or maximize its opposite) the classical criteria (Criteria
in competition) taking this form
C (M) = −2Lmax (M) + τC np (M)
where np indicates the number of parameters of the model M, it represents the

complexity of the model
Different variants of AIC with τAIC = 2, AIC3 with τAIC = 3 and the famous
BIC (M) = −2Lmax (M) + log(n)np (M)

Model Selection
Example
library(mclust)
res=Mclust(X)
plot(res)
summary(res)
−7000
6
−7500
4
−8000
X[,2]
BIC
2
−8500
EII EVE
VII VEE
0
EEI VVE
VEI EEV
−9000
EVI VEV
VVI EVV
−2
EEE VVV
0 2 4 6 8 10 1 2 3 4 5 6 7 8 9
X[,1] Number of components

Mixture model for classification
Outline
Example
EM algorithm
CEM algorithm
3 Applications
Bernoulli mixture
Multinomial Mixture
Directional data
4 Model Selection
6 Conclusion

Mixture-based discriminant analysis models assume that the density for each class follows
a Gaussian mixture distribution. A Gaussian mixture model for the kth class
(k = 1, . . . , K ) has density
Gk
X
fk (x i ; Θ) = πgk ϕ(xi , µgk , Σgk )
g =1
EDDA: Eigenvalue Decomposition Discriminant Analysis assumes that the density for
each class can be described by a single Gaussian component (Bensmail and Celeux
(1996) (i.e. Gk = 1 for all k) with the component covariance structure factorised as
Σk = λk Dk Ak Dk>
1 Assuming Σk = λDAD > (model EEE) corresponds to LDA

2 Assuming Σk = λk Dk Ak Dk> (model VVV) corresponds to QDA
Example: Consider the UCI Wisconsin breast cancer diagnostic data available at
http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). This dataset provides data
for 569 patients on 30 features of the cell nuclei obtained from a digitized image of a fine needle aspirate (FNA)
of a breast mass (Mangasarian et al., 1995). For each patient the cancer was diagnosed as malignant or benign.
Following Fraley and Raftery (2002) we considered only three attributes: extreme area, extreme smoothness,
and mean texture. The dataset can be downloaded from the UCI repository using the following commands:

data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-

wisconsin/wdbc.data",header=FALSE)
dim(data)
summary(data)
X <- data[,c(4, 26, 27)]
colnames(X) <- c("texture.mean", "area.extreme", "smoothness.extreme")
Class <- data[,2]
#Then, we may randomly assign approximately 2/3 of the observations to the training
#set, and the remaining ones to the test set:
set.seed(123)
train <- sample(1:nrow(X), size = round(nrow(X)*2/3), replace = FALSE)
X.train <- X[train,]
dim(X.train)
summary(X.train)
Class.train <- Class[train]
table(Class.train)
#Class.train B M
X.test <- X[-train,]
Class.test <- Class[-train]
table(Class.test)

MclustDA
#The function MclustDA() provides fitting capabilities for the EDDA model, but we must specify the optional
argument modelType = "EDDA". The function call is thus the following:
Single component by class Gk = 1 for all k
mod1 <- MclustDA(X.train, Class.train, modelType = "EDDA")

summary(mod1, newdata = X.test, newclass = Class.test)
What is the selected model ?
A cross-validation error can also be computed using the cvMclustDA() function, which by default use nfold = 10 for
a 10-fold cross-validation:
cv <- cvMclustDA(mod1)
unlist(c(cv[("error", "se")])

MclustDA
Components by class Gk
EDDA imposes a single mixture component for each group. However, in certain
circumstances more complexity may improve performance. A more general approach,
called MclustDA, has been proposed by Fraley and Raftery (2002) , where a finite
mixture of Gaussian distributions is used within each class, with number of components
and covariance matrix (expressed following the usual decomposition) which may be
different within any class. This is the default model fitted by MclustDA:
mod2 <- MclustDA(X.train, Class.train)

summary(mod2, newdata = X.test, newclass = Class.test)
plot(mod2, what ="scatterplot", dimens = c(1,2))
plot(mod2, what = "scatterplot", dimens = c(2,3))
plot(mod2, what = "scatterplot", dimens = c(3,1))
40
0.22
4000
35
0.20
3000
0.18
30
smoothness.extreme
area.extreme
texture.mean
0.16
25
2000
0.14
20
0.12
1000
15
0.10
10
0.08
10 15 20 25 30 35 40 1000 2000 3000 4000 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22
texture.mean area.extreme smoothness.extreme

MclustDR
Another interesting graph can be obtained by projecting the data on a dimension reduced
subspace (Scrucca, 2014) with the commands:
drmod2 <- MclustDR(mod2)

summary(drmod2)
plot(drmod2, what = "boundaries", ngrid = 200)
0.4
0.2
0.0
−0.2
Dir2
−0.4
−0.6
−0.8
0.0 0.5 1.0 1.5
Dir1
Conclusion
Outline
Example
EM algorithm
CEM algorithm
3 Applications
Bernoulli mixture
Multinomial Mixture
Directional data
4 Model Selection
6 Conclusion

Conclusion
Conclusion
Finite mixture approach is interesting to deal with clustering and classification
The CML approach gives interesting criteria and generalizes the classical criteria
The different variants of EM offer good solutions
The CEM algorithm is an extension of k-means and other variants
The choice of the model is performed by using the maximum likelihood penalized by
the number of parameters
See mclust and Rmixmod
There are other Mixture models adapted to the nature of data
Next
Co-clustering
Factorization, modularity and latent block models
Course-4

Cours Finite Mixture

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cours Finite Mixture

Uploaded by

Copyright:

Available Formats

Finite Mixture Models: Clustering & Classification

Université de Paris, Centre Borelli UMR 9010, France

M. Nadif (Faculté des Sciences) 2022-2023 Course 1 / 65

X = (xij ) of size (n × d) and zik ∈ {0, 1}

i denotes the indices of rows, j the indices of columns

M. Nadif (Faculté des Sciences) 2022-2023 Course 2 / 65

Several other extensions/connections such as with kernel k-means, symmetric-NMF

M. Nadif (Faculté des Sciences) 2022-2023 Course 3 / 65

M. Nadif (Faculté des Sciences) 2022-2023 Course 4 / 65

M. Nadif (Faculté des Sciences) 2022-2023 Course 5 / 65

Definition of the model

- ϕ(. ; αk ) is the density of an observation x i from the k-th component

Gaussian mixture model in R1

The mixture density of the observed data x1 , . . . , x9000 can be written as

M. Nadif (Faculté des Sciences) 2022-2023 Course 7 / 65

Gaussian mixture model in R2 : N ((2, 1); 1) and N ((8, 7); 0.6)

M. Nadif (Faculté des Sciences) 2022-2023 Course 8 / 65

Likelihood of observed data X

Bernoulli mixture model

where Θ = {π1 , . . . , πg , α1 , . . . , αg } with αk = (αk1 , . . . , αkd ) and αkj ∈ [0, 1]

M. Nadif (Faculté des Sciences) 2022-2023 Course 9 / 65

Complete data (X, z)

M. Nadif (Faculté des Sciences) 2022-2023 Course 10 / 65

Bernoulli mixture model

where Θ = {π1 , . . . , πg , α1 , . . . , αg } with αk = (αk1 , . . . , αkd ) and αkj ∈ [0, 1]

Binary data matrix and reorganized data matrix

M. Nadif (Faculté des Sciences) 2022-2023 Course 11 / 65

Estimation au sens du maximum de vraisemblance (MV)

L’estimateur au sens MV de θ est θ̂ = argmaxθ L(x1 , . . . , xn ; θ) la valeur du vecteur de

M. Nadif (Faculté des Sciences) 2022-2023 Course 12 / 65

ML and CML approaches

M. Nadif (Faculté des Sciences) 2022-2023 Course 13 / 65

M. Nadif (Faculté des Sciences) 2022-2023 Course 14 / 65

f (x i ; θ) = πϕ(x i ; m1 , σ12 ) + (1 − π)ϕ(x i ; m2 , σ22 )

required to solve polynomial equations of degree nine

f (Y, Θ) = f ((X, z); Θ) = f (Y|X; Θ)f (X; Θ)

Principle of EM (Missing data)

LM (Θ) = Q(Θ|Θ0 ) − H(Θ|Θ0 )

M. Nadif (Faculté des Sciences) 2022-2023 Course 16 / 65

LM (Θ) = Q(Θ|Θ0 ) − H(Θ|Θ0 ) ≥ Q(Θ0 |Θ0 ) − H(Θ0 |Θ0 ) = LM (Θ0 )

In the mixture context

Note that E(zik |X, Θ0 ) = p(zik = 1|X, Θ0 )

M. Nadif (Faculté des Sciences) 2022-2023 Course 17 / 65

Hathaway interpretation of EM: classical mixture model context

Z̃ = (z̃ik ): fuzzy partition

Fuzzy clustering to hard clustering

40 50 60 70 80 90 100 40 60 80 100 40 50 60 70 80 90 100

M. Nadif (Faculté des Sciences) 2022-2023 Course 20 / 65

Data embedding and clustering: two different aims

569 72226611 938 141

924 399 674

Dim 1 (11.86%) Dim 1 (11.86%)

M. Nadif (Faculté des Sciences) 2022-2023 Course 21 / 65

M. Nadif (Faculté des Sciences) 2022-2023 Course 22 / 65

Link between CEM and the dynamical clustering methods

Density and distance

and the criterion to minimize is

Classical k-means algorithms

M. Nadif (Faculté des Sciences) 2022-2023 Course 23 / 65

M. Nadif (Faculté des Sciences) 2022-2023 Course 24 / 65

where xij ∈ {0, 1}, αk = (αk1 , . . . , αkd ) and αkj ∈ [0, 1]

The different steps of CEM algorithm

1 E-step: compute z̃ik

M. Nadif (Faculté des Sciences) 2022-2023 Course 25 / 65