Professional Documents
Culture Documents
Cours Finite Mixture
Cours Finite Mixture
Mohamed Nadif
Mixture Approach
MA have attracted much attention since 1990.
It is undoubtedly a very useful contribution to clustering
1 It offers considerable flexibility
2 provides solutions to the problem of the number of clusters
3 Its associated estimators of posterior probabilities give rise to a fuzzy or hard clustering
using the a MAP
4 It permits to give a meaning to certain classical criteria
Finite Mixture Models by (McLachlan and Peel, 2000)
Outline
1 Finite Mixture Model
Definition of the model
Example
Different approaches
2 ML and CML approaches
EM algorithm
CEM algorithm
3 Applications
Bernoulli mixture
Multinomial Mixture
Gaussian mixture model
Directional data
Other variants of EM
4 Model Selection
5 Mixture model for classification
6 Conclusion
Mixture of 3 densities
Histogramme des données Histogramme des données
0.14
0.14
0.12
0.12
0.10
0.10
0.08
0.08
Densité
Densité
0.06
0.06
0.04
0.04
0.02
0.02
0.00
0.00
0 5 10 15 0 5 10 15
Données Données
8
++
+ +++ ++ ++++++++ ++++
+++
+ + ++++ ++ + +
X[i,2]=rnorm(1,7,0.6) +
+
+++
+ +++
++ + +
+
++
+++++ ++
+ ++
+
++
++
+
++
+++
++
+++
+
+
+
+
+++
+
+
+
+
+++
+
+
+
++
++
++
+ +
++
+
++
+
+
+
+
++
++
++
+
+
+
++
+
+
+
+
+
+
+
+
+
++
+
+
+
+
++
+++
+
+
+
+++
+
+
++
+
+++
+++++
++
+++++
+
+++
+ ++
+ +++++++ +++
++ ++ + ++++
+ ++
+ + +++++ ++++++ +
} ++
+
+ +
6
+ + + ++ + +
+ +
+ + +
4
+ + +
X[,2]
plot(X,pch="+") +
++ +
+ ++ ++
++ +
+
+
+++ ++
+++
+ +
++++
+++ +
+
++ + +++
++
+
+ +
+ +
++++++ + + ++ ++++ ++++ +++ ++ +++ ++ +++++ ++ +++++ +
+ ++++ +
+ + +
+++ ++ ++ + ++ +++++ + +
2
++
++
++ +++++ +
+
+
+ +++ + +
+++
+++
+ +
+++++++
+ ++
+ +
+
+
++
+
+ +
++++ + +
++++
+++++
+ +
++
+ ++
+++
+
+++
+++
+
+
+
++
+
+
+
+++
++++++
++
+ ++++ +
++ +
+
x x 1 +
+
+
+
++++ ++
++++
++
+
+
++
++
+
+++
+
+
+++
+
++
+
++
+
++
+
++
+
++
+
++
++
+
+
+
+
+
+
++
+
+
+++++
+++
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
++
+
++
+
+
+
+
+
+
+++
++
++
+
+
+
+
+
+
+
+
+
+
+++++
+++
++++
+
+
+
+
++
+++
+
+
+++++
+
+++
++
+
++
+
++
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+++
+++
++
+
+
+++ +
++++
+ +
+
+ + ++ +++++++++ ++++++++ +++
+
+++
++
++ ++ ++
+ ++++ + ++ +
++
+ ++ + ++ ++++ +
++ + ++ +
++ +
++ ++ +++ +
x x 1
0
++ +++++
+++ ++++++ ++
+++++
x x 1 + + +++
+
+ ++++
+ ++ ++
++
++
+
++
+ +++ +
++
++ +++ +++
+ + +
+++ + + + +
++ +
+
+++ ++ ++ + + + +
x x 1 x x 1 −2
+ +
+
+
x x 0 2 1 0 2 4 6 8 10
X[,1]
. . . .
. . . .
. . 1 . . 0
x x 0 x x 0
x x 0 x x 0
8 7
library(mclust)
res.mclust=Mclust(X)
plot(X,col=res.mclust$classification,pch="+")
Likelihood of (X, z)
The parameter of this model is the vector Θ = (π, α) containing the mixing
proportions π = (π1 , ..., πg ) and the vector α = (α1 , ..., αg ) of parameters of each
component. The mixture density of complete data (X, z) can be expressed as
g
n X g
n Y
Y Y
f (X, z; Θ) = zik πk ϕ(x i ; αk ) = (πk ϕ(x i ; αk ))zik .
i=1 k=1 i=1 k=1
Pg Qg zik
since zik ∈ {0, 1} we have k=1 zik πk ϕ(x i ; αk ) = k=1 (πk ϕ(x i ; αk ))
g g
z
X Y
z1k πk ϕ(x 1 ; αk ) = π1 ϕ(x 1 ; α1 ) + 0 + 0 and (πk ϕ(x 1 ; αk )) 1k = π1 ϕ(x 1 ; α1 ) × 1 × 1
k=1 k=1
30 30 P30
1−xi 30− 30
P
x x i=1 xi
Y Y
L(x1 , . . . , x30 ; θ) = Pθ (Xi = xi ) = θ i (1 − θ) = θ i=1 i (1 − θ)
i=1 i=1
P30
∂log(L) i=1 xi
= 0 ⇒ θ̂ =
∂θ 30
1 The ML approach (Day, 1969): It estimates the parameters of the mixture, and the
partition on the objects is derived from these parameters using the maximum a
posteriori principle (MAP). The maximum likelihood estimation of the parameters
results in an optimization of the log-likelihood of the observed sample
n g
!
X X
LM (Θ) = L(Θ; X) = log πk ϕ(x i ; αk )
i=1 k=1
2 The CML approach (Symons, 1981): It estimates the parameters of the mixture and
the partition simultaneously by optimizing the classification log-likelihood
g
n X
X
LC (z; Θ) = L(Θ; X, z) = log f (X, z; Θ) = zik log (πk ϕ(x i ; αk ))
i=1 k=1
or
X g
n X g
n X
X
LC (z; Θ) = zik log (πk ) + zik log (ϕ(x i ; αk ))
i=1 k=1 i=1 k=1
Outline
1 Finite Mixture Model
Definition of the model
Example
Different approaches
2 ML and CML approaches
EM algorithm
CEM algorithm
3 Applications
Bernoulli mixture
Multinomial Mixture
Gaussian mixture model
Directional data
Other variants of EM
4 Model Selection
5 Mixture model for classification
6 Conclusion
Introduction of EM
Much effort has been devoted to the estimation of parameters for the mixture model
Pearson used the method of moments to estimate Θ = (m1 , m2 , σ12 , σ22 , π) of a
unidimensional Gaussian mixture model with two components
we have
log(f (X; Θ)) = log(f (Y, Θ)) − log(f (Y|X; Θ))
or
LM (Θ) = LC (z; Θ) − log f (Y|X; Θ)
M. Nadif (Faculté des Sciences) 2022-2023 Course 15 / 65
ML and CML approaches EM algorithm
where Q(Θ|Θ0 ) = E(LC (z; Θ|X, Θ0 )) and H(Θ|Θ0 ) = E(log f (Y|X; Θ)|X, Θ0 )
Using the Jensen inequality (Dempster et al;, 1977) for fixed Θ0 we have
∀Θ, H(Θ|Θ0 ) ≤ H(Θ0 |Θ0 ). This inequality can be proved
X f (z|X; Θ)
H(Θ|Θ0 ) − H(Θ0 |Θ0 ) = f (z|X; Θ0 ) log
z∈Z
f (z|X; Θ0 )
f (z|X;Θ) f (z|X;Θ)
As log(x) ≤ x − 1, we have log f (z|X;Θ0 )
≤ f (z|X;Θ0 )
− 1 then
X X
H(Θ|Θ0 ) − H(Θ0 |Θ0 ) ≤ f (z|X; Θ) − f (z|X; Θ0 ) = 1 − 1 = 0
z∈Z z∈Z
Q(Θ|Θ0 )
The value Θ maximizing Q(Θ|Θ0 ) satisfies the relation Q(Θ|Θ0 ) ≥ Q(Θ0 |Θ0 ) and,
The steps of EM
The EM algorithm involves constructing, from an initial θ (0) , the sequence θ (c)
satisfying
Θ(c+1) = argmax Q(Θ|Θ(c) )
and this sequence causes the criterion LM (Θ) to grow. The EM algorithm takes the
following form
Initialize by selecting an initial solution Θ(0)
Repeat the two steps until convergence
1 E-step: compute Q(Θ|Θ(c) ). Note that in the mixture case this step reduces to the
(c)
computation of the conditional probabilities z̃ik
(c+1) 1 (c+1)
M-step: compute Θ(c+1) maximizing Q(Θ, Θ(c) ). This leads to πk
P
2 = n i z̃ik and
(c+1)
the exact formula for the αk will depend on the involved parametric family of
distribution probabilities
Properties of EM
Under certain conditions, it has been established that EM always converges to a
local likelihood maximum
Simple to implement and it has good behavior in clustering and estimation contexts
Slow in some situations
M. Nadif (Faculté des Sciences) 2022-2023 Course 18 / 65
ML and CML approaches EM algorithm
Algorithm
Maximizing FC (Z̃ , Θ) w.r. to s yields the E-step
Maximizing FC (Z̃ , Θ) w.r. to Θ yields the M-step
Example
library(mixtools)
attach(faithful)
dim(faithful)
waiting
hist(waiting)
d=density(waiting)
plot(d)
wait1 <- normalmixEM(waiting, lambda = .5, mu = c(50, 60), sigma = 5)
plot(wait1, density = TRUE, cex.axis = 1.4, cex.lab = 1.4, cex.main = 1.8,main2 = "Time between Old Faithful
eruptions", xlab2 = "Minutes")
Mixture of 2 densities
Histogram of waiting density.default(x = waiting) Time between Old Faithful eruptions
0.04
50
0.03
40
0.03
Frequency
Density
0.02
30
Density
0.02
20
0.01
0.01
10
0.00
0.00
0
Dataset (1000 × 15): Often we tend to make a PCA and from the first axes
(components) we apply a clustering method. Attention a structure into clusters can be
obvious in the plan factorial (1,15) and completely non-existent using (1,2).
Individuals factor map (PCA) Individuals factor map (PCA)
4
2
273 283
760
212 440 546
252 638 769182 592
2 247 473 590
68185 427 807401
554
516 580 370 10840915
463 23
504509 974
544 310 693538 724 789 499 930 65
173320 480814
912 308
718 891528
242
714 770662835 642 445 695 548 682
34 77 829 426 196 71 32 438
215 555 726
2
209434 497
429
643 739 272216
132 241478 980115 404 292 5318 891 528 617
1311908
529 686 763
346450657
676
755518 350626 90326494
369 279
1 939 251 847 293
933 433
844982
606 378 622 128158
351540414 805 79739 278853
10260591
240
541 489
547287
270 154395207 87 18749236808 782 862234 261
197
729714798950447790
479
592612
15724969
29 628 721
1
124
456 147 806
524711562 588466
766
742 12175696
624
893221 110 858
302
844
880160 759 103989 6 666583 459 400
785632412 568
330
130
649
494399 462 180
864
151 745
332
926
484
709 368
12
852208
863
411 988 610
847
680 895
229 889
257 296700
702 969834405
520 961 38 698 86 352842 754 642 648
326793
703 22452
937 312
107
214
753 505621
425
713
615 264
213
249469321
245
635
906
186
366
481
437
705
696
114
307
31
607
948918870
448
293
862
297 410
572
725
309
585
157 861 468 707 947199
792 250 268
59 149 946 739
315 103
272
91587768 49468
959
508
73861
47
283
845
784
577
271
471
225 67 949
618
474 677
740
875
465 161
238 162 44
730 775
467
651 361
318
290 878
652 174
729 432
570
595 156
748 328
37
511256 781 325 569 510 267 28 564 146 27 818
951 707668 342
325 200 38 997 136593301
230
354
336
960 453
133
82437373825
441
608 970
819 40
838
979 950
31617
378 153
28 827
701
317 649
648
313 879
127 993 954
658 567 881 820945 910
386 938 816
749269461 77837
989280
834 236 808
192783
609
879
1587 379743 954
543 176
149 905 836
953421
178 439
833
33
284
994 248
235 799
393
165113
68205
965
825338
920
443 435
659
670
513
93
843 329
901
535
155 92674
620
837
977 244
345
183
134
433 189
52265
22
29461
831 865
902
24998 628462
999 703 151 224406 926135 679421
14558 46
553
835 708
520 331
897856
921
760 792947100
737 993
815604533 536
54 20 982 142
460 672
553
964 262
786 609690342 883 538216 683 602
21 478
405 220796
Dim 15 (1.21%)
135
530 259
291
972 137
97
472
590527
706 116
660 498
900 719 352 683 587 200 492 414
139 937 207 256 44232459250
23
Dim 2 (7.99%)
946
299
848 962
675859
384
198 108
388 314
990795
991 712667
477
179639
952419 143 632
51 107
43
997758 252
675
875 562
371
173
625
161 797
859525162
54291
282
394
596
219 436
566
579
676 64165
12
958 965
486152 894
376 913
277
519422
219
968 746 752
534300 458
56 606 622
444 484
985 780
529 711 356 388346336
497
801
629888 935
825 274 298 319 334 372
43
406
537
694
976 190
175
599431
641
356
596 490
64
488
923 486
874 499166
654 131
735
698
552
904 234 495
197 598
479
470
86185 269 303721
754130
810
243 180
471 668 533
866 709
214 11650
953 673
805
532
742
809755 588
427
766 936
990923
466
909
203 687 619
882 736
170 639
166 90376 898 685 98062658 404
961386 656846
855
337
95521 452
191
558
322138
371 288
420
125
413
63
168
888
896 391
517
934
428
571
623
665
457917
940
170
582
383
716
616
66
605
966
548
806
76
563
890
682
363
150
506
886
285
981
294
774
211
475
967
128749
335
956
931
951
365476 220
788
815
677 379 195 201
100 771928
295
1395337
740 521
777
944
232
69985
117
58
425
227
74
175
230
403
281
671
505
908
434
178
972
839
885
586
301
994
235
439
411
452
248
327
108
347
863
341
393
507
638
927
516
960 488
554 105
253
860
266
853
121
172565
517
125
991
665
408
245
469
391
290
428
113
800
795
623
527
36
318
418
541
920
738
877
1000
258
773
205
575
383 361
582
574
660 196
423644
247
776
116
55 70
752
50
275
917
605
260
242
655
129
448
464
229 542 306
392955
720
498
941
567
89 789254 881
30679 203231
935362 357 94
653 816 209
663 19833
493
34 320
431 487
765
518 576757
368
451 68 830
531 221
669
186
457
667 616595
110
535 840 201 204295
0
232 148 909
661 757 594
619 333
122 637 8603
218 557 793360 536409 593
141 723
836 171
416 159
401
264
763278 34935
799 756
899
359 19389659
239
716 364 179
300963
25762
66 402
357
922
691 570
986
838 118
782 126380 704
820 943
663
3 69971045167
673
58 282
327 914
72
551
374 50963
600
239
877 12941265
986 387
523
482785
790 415813 47 225
634
21 949 704 191496 77
69429
39
892 286
422249
2168
775
63
314
133
15914
373
450 872
925
893
339
188
374
377 684
440
594 362
652
657
630
228
534515
849
122 99
560
375
489
732633
82
670
646
696
114
237
270971
610 98
890
979
747 637
297
323
924
636
503
8812
900 869
120 144
91
728 22
370119
625 927 727
821
353 575
152
15761 560 343720
358
402
689 942
539 163 913
45 833807
212
641
730 353
978 490
413821
472500
453213
841
102
751 940
240 718
851
308
608
333 731
435
437
878
988 385559
155
952
647 550
876
973 692
563
620
895
257 143
734 506
984
942
473
17495 811 771
139 14 493 885507
687
416992
210
854 531
222
398
647
319
82
275973
367
304 869 612768330 424
261
898
941 65175 210
16
706854
824
995 571 772
624 60
857 957
906 307
31
680
654
71904
948
868
697
343
350
329
764
983
41150
580
552
544
678 244
164
889218
526
523 653
444
381 557
273
410 794 614
123
985109
344 929
184
773
253800228776 697
334955873
144 491
787276 42784192 44232462 599 829
727
802635
222
111 874
7264596996
321
443
712 45843
56193
591
455
57 513
689
485
147705
358
819
477607 131
32 340
407
367 92626
735
101
458 880
975 181
769134
774
355
154
791
438744
345
482 512
153
189
470585 813226
177
578 865 6 884
496 586
233839
741 349418
736 57 407
996 306
503
636 894 226736133167
783 743 202 88 231
589 338 366
426 311
584145
369
970
174
263
901
837
20
966
40 911
316265
81715 78
870733
702
294387
719
183
348156
51 475 907262
964
832664
717 35
359 732 237
298
922 933
98
803
18678
392750930
340
512794
842 577 27618 549 208 480
441814 823
779 304
419302
432
803
56918 483
750
104
84
363
285
603
140
916 767
546
886
206 770112
460 449
491
822
522
662748967
627
787
931
700 317
725335 786
303
525
281 892 145550
644
423 526 828
688
181 845474
271 547 287 160 539
873
981
9395
10 185289211
463759998 365
810 127
555
403394995 36802 390119382 454 708
556 83977
974858 390 79415
142598 828382
572 514 246
217634
117
576111 983
483 907 600481 688 94 310182693309 956
223 454
82727642243 241804 424
−2
−1
559 692
501
78
−4
−10 −5 0 5 10 −4 −2 0 2 4
Evaluation
kmeans, PCA-kmeans, Autoencoder-kmeans, UMAP-kmeans, Deep k-means
Model-based clustering GMM
CEM algorithm
In the CML approach the partition is added to the parameters to be estimated. The
maximum likelihood estimation of these new parameters results in an optimization of
the complete data log-likelihood. This optimization can be performed using the
following Classification EM (CEM) algorithm (Celeux and Govaert, 1992), a variant
of EM, which converts the z̃ik ’s to a discrete classification in a C-step before
performing the M-step:
(c)
E-step: compute the posterior probabilities z̃ik .
C-step: the partition z(c+1) is defined by assigning each observation x i to the cluster
which provides the maximum current posterior probability.
(c+1) (c+1)
M-step: compute the maximum likelihood estimate (πk , αk ) using the k-th
(c+1) (c+1) #z
= 1n i zik
P
cluster. This leads to πk = n k and the exact formula for the
(c+1)
αk will depend on the involved parametric family of distribution probabilities
Properties of CEM
Simple to implement and it has good practical behavior in clustering context
Faster than EM and scalable
Some difficulties when the clusters are not well separated
D(x i , ak ) = − log(ϕ(x i , αk ))
Outline
1 Finite Mixture Model
Definition of the model
Example
Different approaches
2 ML and CML approaches
EM algorithm
CEM algorithm
3 Applications
Bernoulli mixture
Multinomial Mixture
Gaussian mixture model
Directional data
Other variants of EM
4 Model Selection
5 Mixture model for classification
6 Conclusion
Binary data
For binary data, considering the conditional independence model (independence for
each component), the mixture density of the observed data X can be written as
x
Y YX Y
f (X; Θ) = f (xi ; Θ) = πk αkjij (1 − αkj )1−xij
i i k j
1 E-step: compute
Pz̃ik P
z̃ x z̃ik
2 M-step: αkj = Pi ikz̃ ij and πk = i
n
i ik
Parsimonious model
Several parsimonious models can be proposed by imposing constraints on the
parameters
X Y |xij −akj |
f (xi ; Θ) = πk εkj (1 − εkj )1−|xij −akj |
k j
where
akj = 0, εkj = αkj if αkj < 0.5
akj = 1, εkj = 1 − αkj if αkj > 0.5
The parameter αk is replaced by the two parameters ak and εk
Example:
αk = (0.7, 0.3, 0.4, 0.6)
then
ak = (1, 0, 0, 1) and εk = (0.3, 0.3, 0.4, 0.4)
- The binary vector ak represents the center of the kth cluster, each akj indicates the
most frequent binary value
- The binary vector εk ∈]0, 1/2[d represents the degrees of heterogeneity of the kth
cluster, each εkj represents the probability of j to have the value different from that of
the center,
j1 j2 j3
j1 j2 j3 i1 1 − ε11 1 − ε12 1 − ε13
i1 α11 1 − α12 α13 i4 1 − ε11 1 − ε12 1 − ε13
i4 α11 1 − α12 α13 i8 1 − ε11 1 − ε12 1 − ε13
i8 α11 1 − α12 α13 i2 ε11 ε12 ε13
i2 1 − α11 α22 1 − α13 a1 1 0 1
i5 1 − α21 α22 1 − α23 i5 1 − ε21 1 − ε22 1 − ε23
i6 1 − α21 α22 1 − α23 i6 1 − ε21 1 − ε22 1 − ε23
i10 1 − α21 α22 1 − α23 i10 1 − ε21 1 − ε22 1 − ε23
i3 α21 1 − α22 1 − α23 i3 ε21 ε22 1 − ε23
i7 1 − α2 1 α22 1 − α23 i7 1 − ε21 1 − ε22 1 − ε23
i9 α21 1 − α22 α23 i9 ε2a ε22 ε23
a2 0 1 0
CEM for the simplest model [ε] where ε does not depend on k or j
Exercise: When the proportions are supposed equal the classification log-likelihood
to maximize
n g
ε XX
LC (z; Θ) = L(Θ; X, z) = log( ) zik D(x i , ak ) + nd log(1 − ε)
1 − ε i=1
k=1
d
X
where D(x i , ak ) = |xij − akj |
j=1
ε
The parameter ε is fixed for each cluster and for each variable, as (log( 1−ε ) ≤ 0)
this maximization leads to the minimization of
g
n X
X
W (z, a) = zik D(x i , ak ) a = a1 , . . . , ag
i=1 k=1
where αkjh is the probability that the variable j takes the categorie h when an object
belongs to the cluster k.
Notation
ykjh = zik xijh
P
i
y jh = xijh
P
i
yk = j,h ykjh
P
y = k yk = k,j,h,i xijh = nd
P P
Example
a b a1 a2 a3 b1 b2 b3 a1 a2 a3 b1 b2 b3
i1 1 2 i1 1 0 0 0 1 0 i3 0 1 0 0 0 1
i2 3 2 i2 0 0 1 0 1 0 i7 0 0 1 0 0 1
i3 2 3 i3 0 1 0 0 0 1 i9 0 1 0 0 1 0
i4 1 1 i4 1 0 0 1 0 0 i10 0 1 0 0 0 1
i5 1 2 i5 1 0 0 0 1 0 i1 1 0 0 0 1 0
i6 3 2 i6 0 0 1 0 1 0 i4 1 0 0 1 0 0
i7 3 3 i7 0 0 1 0 0 1 i5 1 0 0 0 1 0
i8 1 1 i8 1 0 0 1 0 0 i8 1 0 0 1 0 0
i9 2 2 i9 0 1 0 0 1 0 i2 0 0 1 0 1 0
i10 2 3 i10 0 1 0 0 0 1 i6 0 0 1 0 1 0
jh
yk
Given αkjh = #zk
, it can be shown that CEM maximizes the mutual information
X y jh y jh y
k
I(z, J) = log k jh
y yk y
k,j,h
X (y jh y − yk y jh )2
This expression is very close to χ2 (z, J) = k
yk y jh y
k,j,h
Assuming that X derives form the latent class model whith equal proportions, the
maximization of LC (z; Θ) is approximatively equivalent to use k-means with the χ2
metric (course 2).
M. Nadif (Faculté des Sciences) 2022-2023 Course 33 / 65
Applications Multinomial Mixture
Parsimonious model
Number of the parameters in latent class model is equal (g − 1) + g ∗ j (mj − 1)
P
where mj is the number of categories of j
This number is smaller than j mj required by the complete log-linear model,
Q
birds dataset
library(Rmixmod)
data(birds)
dim(birds)
birds
xem.birds <- mixmodCluster(birds, 2)
summary(xem.birds)
Example
****************************************
* number of modalities = 2 4 5 5 3
****************************************
*** Cluster 1
* proportion = 0.6544
* center = 1.0000 3.0000 1.0000 1.0000 1.0000
* scatter =
| 0.4937 0.4937 |
| 0.0761 0.0063 0.1741 0.0917 |
| 0.1521 0.1391 0.0043 0.0043 0.0043 |
| 0.0390 0.0045 0.0043 0.0259 0.0043 |
| 0.0577 0.0288 0.0289 |
****************************************
*** Cluster 2
* proportion = 0.3456
* center = 2.0000 2.0000 2.0000 2.0000 1.0000
* scatter =
| 0.4280 0.4280 |
| 0.1203 0.1463 0.0153 0.0107 |
| 0.0509 0.0751 0.0080 0.0080 0.0080 |
| 0.3641 0.5495 0.1288 0.0485 0.0080 |
| 0.1074 0.0940 0.0134 |
****************************************
plot(xem.birds)
# Bigger symbol means that observations are similar
barplot(xem.birds)
# Description
Barplot of gender Barplot of eyebrow Barplot of collar
Multiple Correspondence Analysis
1.0
1.0
1.0
0.02
0.8
0.8
0.8
Conditional frequency
Conditional frequency
Conditional frequency
0.6
0.6
0.6
Unconditional frequency
0.4
0.4
0.4
0.00
0.2
0.2
0.2
0.0
0.0
0.0
C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2
1 2 1 2 3 4 1 2 3 4 5
−0.02
Axis 2
1.0
1.0
−0.04
0.8
0.8
Conditional frequency
Conditional frequency
0.6
0.6
0.4
0.4
−0.06
0.2
0.2
−0.03 −0.02 −0.01 0.00 0.01 0.02
0.0
0.0
C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2 C1 C2
Axis 1 1 2 3 4 5 1 2 3
Model selection
****************************************
*** BEST MODEL OUTPUT:
*** According to the BIC criterion
****************************************
* nbCluster = 2
* model name = Binary-pk-Ekjh
* criterion = BIC(518.9159)
* likelihood = -198.0634
****************************************
X 1−ε j
D(x i , ak ) = log( (m − 1))δ(xij , akj )
j
ε
If
Pall variables have the same number of categories, the criterion to minimize is
i,k zik D(x i , ak ), why ?
The CEM is an extension of k-modes
Contingency table
We can associate a multinomial
P model (Govaert and Nadif 2007), then the density
xi1 x
of the model ϕ(xi ; αk ) = B k πk αk1 . . . αkdid ( B does not depend on Θ)
P P P
Without log(B) we have LC (z, Θ) = i k zik log πk + j xij log(αkj )
The mutual information quantifying the information shared between z and J:
X fkj
I(z, J) = fkj log( )
fk. f.j
k,j
Then we have
1 2
I(z, J) ≈ χ (z, J)
2N
When the proportions are assumed equal, the maximization of LC (z, Θ) is equivalent
to the maximization of I(z, J) and approximately equivalent to the maximzation of
χ2 (z, J)
Σk = λk Dk Ak Dk>
- λk = |Σk |1/p positive real represents the volume of the kth component
- Ak = Diag (ak1 , . . . , akd ) whose elements are proportional to the eigenvalues of Σk . It
defines the shape of the kth cluster
- Dk formed by the eigenvectors. It defines the direction of the kth cluster
d(d+1)
Remark: number of parameters to estimate: (g − 1) + g × d + g × 2
Finally we have 28 models, we will study the problem of the choice of the models
See for instance mclust and Rmixmod.
mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite
Mixture Models. by Luca Scrucca, Michael Fop, T. Brendan Murphy and Adrian E.
Raftery, 2016
Library mclust
Spherical models: By fixing Ak = I , we place ourselves in the case where the classes
are of spherical shapes, that is to say that the variances of all the variables are equal
inside of the same class.
Diagonal models: Considering that the matrices Dk are diagonal, we force the classes
to be aligned on the axes. It is in fact the hypothesis of conditional independence in
which the variables are independent of each other within the same class.
General models: By fixing equality constraints on the Ak , the Dk or the λk , we can
generate 8 different models.
library {mclust}
Example 1
library(mclust)
data(diabetes)
class <- diabetes$class
table(class)
# class
# Chemical Normal Overt
# 36 76 33
X <- diabetes[,-1]
head(X)
res.pca=PCA(X)
clPairs(X, class)
res.mclust <- Mclust(X,3)
summary(res.mclust)
table(res.mclust$class,diabetes$class)
res.kmeans=kmeans(X,3,nstart=100)
table(res.kmeans$cluster,diabetes$class)
Example 2
data(wine, package = "gclus")
Class <- factor(wine$Class, levels = 1:3,labels = c("Barolo", "Grignolino", "Barbera"))
X <- data.matrix(wine[,-1])
mod <- Mclust(X)
summary(mod$BIC)
summary(mod)
table(Class, mod$classification)
adjustedRandIndex(Class, mod$classification)
CEM
In clustering step, each x i is assigned to the cluster maximizing
z̃ik ∝ πk ϕ(x i ; µk , Σk ) or equivalently the cluster that minimizes
Note that when the proportions are supposed equal and the variances identical, the
assignation is based only on
DΣ2 −1 (x i ; µk )
k
When the proportions are supposed equal and for the spherical model [λI ] (Σk = I ),
one uses the usual euclidean distance
D 2 (x i ; µk )
Description of CEM
E-step: classical, C-step: Each cluster zk is obtained by using D 2 (x i ; µk )
M-step: Given the partition z, we have to determine the parameter θ maximizing
X
LC (z, Θ) = L(Θ; X, z) = zik log (πk ϕ(x i ; αk ))
i,k
P
i zik x i
- The parameter µk is thus necessary the center µk = #zk
#zk
- The proportions satisfy πk = n
- The parameters must then for the general model
(trace(Wk Σ−1
X
F (Σ1 , . . . , ΣK ) = k ) + #zk log |Σk |)
k
nd nd
LC (z; Θ) = − trace(W ) + cste = − W (z) + cste
2 2
Maximizing LC is equivalent to minimize the SSQ criterion minimized by the kmeans
algorithm
Interpretation
- The use of the model [λI ] assumes that the clusters are spherical having the same
proportion and the same volume
- The CEM is therefore an extension of the kmeans
Description of EM
E-step: classical
M-step: we have to determine the parameter Θ maximizing Q(Θ, Θ0 ) taking the
following form
X
LC (z; Θ) = L(Θ; X, z) = z̃ik log (πk ϕ(x i ; αk ))
i,k
P
z̃ x
- The parameter µk is thus necessary the center µk = Pi ik i
P i z̃ik
z̃
- The proportions satisfy πk = in ik
- The parameters Σk must then minimize
(trace(Wk Σ−1
X
F (Σ1 , . . . , ΣK ) = k ) + #zk log |Σk |)
k
Example : https://sandipanweb.wordpress.com/2016/07/30/
image-clustering-with-gmm-em-soft-clustering-in-r/
Algorithms
Log-likelihood !
X X
L(Θ; X) = log πk ϕ(xi |µk , κk ) ,
i k
EM
E-step: finds the conditional expectation z̃ik = E(zik = 1|xi , Θ(t) )
(t+1)
M-step: finds the
new parameters Θ maximizing
(t) (t) P
Q(Θ, Θ ) = E L(Θ; X, z)|X, Θ s.t. k πk = 1, kµk k = 1 and κk > 0
Hypotheses:
P ∀k, πk = 1/g and κk = κ the maximization of LC (z; Θ) and
i,k zik cos(xi , µk ) are equivalent
Steps of SEM
The aim of the SAEM is to reduce the "part" of random in estimations of the
parameters
SAEM is based on SEM and EM
Solution
E-step: like for EM, SEM and CEM
S-step: like for SEM
M-step: The compute of parameters depends on this expression:
(t+1) (t+1)
Θ(t+1) = γ (t+1) ΘSEM + (1 − γ (t+1) )ΘEM
The initial value of γ = 1 and decreases until 0.
Outline
1 Finite Mixture Model
Definition of the model
Example
Different approaches
2 ML and CML approaches
EM algorithm
CEM algorithm
3 Applications
Bernoulli mixture
Multinomial Mixture
Gaussian mixture model
Directional data
Other variants of EM
4 Model Selection
5 Mixture model for classification
6 Conclusion
Different approaches
In Finite mixture model, the problem of the choice of the model include the problem
of the number of clusters
We distinguish the two problems and we consider the model fixed and K is unknown.
Let be tow models MA and MB . Θ(MA ) and Θ(MB ) indicates the "domain" of free
parameters. if Lmax (M) = L(θ̂ M ) where θ̂ M = argmax L(θ) then we have
For example Lmax [πk λk I ]g =2 ≤ Lmax [πk λk I ]g =3 . Generally the likelihood increases
with the number of clusters.
First solution: Plot (Likelihood*number of clusters) and use the elbows
Second solution: Minimize (or maximize its opposite) the classical criteria (Criteria
in competition) taking this form
Example
library(mclust)
res=Mclust(X)
plot(res)
summary(res)
−7000
6
−7500
4
−8000
X[,2]
BIC
2
−8500
EII EVE
VII VEE
0
EEI VVE
VEI EEV
−9000
EVI VEV
VVI EVV
−2
EEE VVV
0 2 4 6 8 10 1 2 3 4 5 6 7 8 9
Outline
1 Finite Mixture Model
Definition of the model
Example
Different approaches
2 ML and CML approaches
EM algorithm
CEM algorithm
3 Applications
Bernoulli mixture
Multinomial Mixture
Gaussian mixture model
Directional data
Other variants of EM
4 Model Selection
5 Mixture model for classification
6 Conclusion
Mixture-based discriminant analysis models assume that the density for each class follows
a Gaussian mixture distribution. A Gaussian mixture model for the kth class
(k = 1, . . . , K ) has density
Gk
X
fk (x i ; Θ) = πgk ϕ(xi , µgk , Σgk )
g =1
EDDA: Eigenvalue Decomposition Discriminant Analysis assumes that the density for
each class can be described by a single Gaussian component (Bensmail and Celeux
(1996) (i.e. Gk = 1 for all k) with the component covariance structure factorised as
Σk = λk Dk Ak Dk>
#Then, we may randomly assign approximately 2/3 of the observations to the training
#set, and the remaining ones to the test set:
set.seed(123)
train <- sample(1:nrow(X), size = round(nrow(X)*2/3), replace = FALSE)
X.train <- X[train,]
dim(X.train)
summary(X.train)
Class.train <- Class[train]
table(Class.train)
#Class.train B M
X.test <- X[-train,]
Class.test <- Class[-train]
table(Class.test)
MclustDA
#The function MclustDA() provides fitting capabilities for the EDDA model, but we must specify the optional
argument modelType = "EDDA". The function call is thus the following:
A cross-validation error can also be computed using the cvMclustDA() function, which by default use nfold = 10 for
a 10-fold cross-validation:
cv <- cvMclustDA(mod1)
unlist(c(cv[("error", "se")])
MclustDA
Components by class Gk
EDDA imposes a single mixture component for each group. However, in certain
circumstances more complexity may improve performance. A more general approach,
called MclustDA, has been proposed by Fraley and Raftery (2002) , where a finite
mixture of Gaussian distributions is used within each class, with number of components
and covariance matrix (expressed following the usual decomposition) which may be
different within any class. This is the default model fitted by MclustDA:
40
0.22
4000
35
0.20
3000
0.18
30
smoothness.extreme
area.extreme
texture.mean
0.16
25
2000
0.14
20
0.12
1000
15
0.10
10
0.08
10 15 20 25 30 35 40 1000 2000 3000 4000 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.22
MclustDR
Another interesting graph can be obtained by projecting the data on a dimension reduced
subspace (Scrucca, 2014) with the commands:
−0.4
−0.6
−0.8
Dir1
M. Nadif (Faculté des Sciences) 2022-2023 Course 63 / 65
Conclusion
Outline
1 Finite Mixture Model
Definition of the model
Example
Different approaches
2 ML and CML approaches
EM algorithm
CEM algorithm
3 Applications
Bernoulli mixture
Multinomial Mixture
Gaussian mixture model
Directional data
Other variants of EM
4 Model Selection
5 Mixture model for classification
6 Conclusion
Conclusion
Finite mixture approach is interesting to deal with clustering and classification
The CML approach gives interesting criteria and generalizes the classical criteria
The different variants of EM offer good solutions
The CEM algorithm is an extension of k-means and other variants
The choice of the model is performed by using the maximum likelihood penalized by
the number of parameters
See mclust and Rmixmod
There are other Mixture models adapted to the nature of data
Next
Co-clustering
Factorization, modularity and latent block models
Course-4