You are on page 1of 21

Cluster analysis

Prof : Xavier Boute


boute@hec.fr

The 5 keys for decision


?
I : Data structuration

II : Data integration

III : Basic descriptive statistics

IV : Multidimensional analysis

V : Visualization
Geometrical approach

X6

X2

X5
Geometrical approach

X6

X2

X5
Close ?

Distance ?

Choux et carottes ?

Z score

Ward Distance (distance between clusters)


Z6

g1 g2

Z2

Cluster C1 Cluster C2
ninj
2 2
D (Ci; Cj) = d (gi; gj)
Z5
ni + nj

Cluster C3 g3 gi : gravity center of cluster Ci
ni : # cases in cluster Ci
Hierarchical ascending clustering

x
x
x
Agglomeration Schedule

Stage Cluster First


Cluster Combined Appears
Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage
1 1 4 .013 0 0 13
2 21 24 .067 0 0 8
3 6 20 .154 0 0 8
4 8 10 .255 0 0 5
5 8 11 .377 4 0 9
6 15 16 .506 0 0 11
7 7 14 .772 0 0 13
8 6 21 1.056 3 2 15
9 8 9 1.460 5 0 16
10 12 13 1.988 0 0 16
11 5 15 2.567 0 6 12
12 3 5 3.373 0 11 17
13 1 7 4.384 1 7 18
14 17 18 5.650 0 0 20
15 6 22 7.170 8 0 17
16 8 12 10.798 9 10 19
17 3 6 15.117 12 15 18
18 1 3 20.448 13 17 21
19 8 23 25.850 16 0 22
20 17 19 36.511 14 0 22
21 1 2 47.523 18 0 23
22 8 17 73.816 19 20 23
23 1 8 138.000 21 22 0
ANOVA
Z6

g1 g2

Z2

C1 C2

Z5
n K K
2 2 2
∑ ∑∑ ∑
d (zi; 0) = d (zi; gk) + nkd (gk; 0)
C3 g3 i=1 k=1 i∈Ck k=1
Quality of a partition in K clusters
n K K
2 2 2
∑ ∑∑ ∑
d (zi; 0) = d (zi; gk) + nkd (gk; 0)
i=1 k=1 i∈Ck k=1
TOTAL WITHIN BETWEEN

Stage 1 (n clusters) : TOTAL = 0 + TOTAL (100% information)

Last stage (1 cluster) : TOTAL = TOTAL + 0 (0% information)

Quality = BETWEEN / TOTAL


Coefficient = Within Quality = Between / Total
Agglomeration Schedule

Stage Cluster First


Cluster Combined Appears
Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage
1
2
1
21
4
24
.013
.067
0
0
0
0
13
8
Quality of a partition in K clusters
3 6 20 .154 0 0 8
4 8 10 .255 0 0 5 138 − Coefficient (n-K)
5 8 11 .377 4 0 9
=
6
7
15
7
16
14
.506
.772
0
0
0
0
11
13
138
8 6 21 1.056 3 2 15
9 8 9 1.460 5 0 16
10 12 13 1.988 0 0 16

Quality of a partition in 2 clusters


11 5 15 2.567 0 6 12
12 3 5 3.373 0 11 17
13
138 − 73.816
1 7 4.384 1 7 18

= 46.5 %
14 17 18 5.650 0 0 20
15 6 22 7.170 8 0 17 =
16
17
8
3
12
6
10.798
15.117
9
12
10
15
19
18
138
18 1 3 20.448 13 17 21
19 8 23 25.850 16 0 22 Between with 2 clusters = 138 − 73.816
20 17 19 36.511 14 0 22
21
22
1 2 47.523 18 0 23 Within with 2 clusters = 73.816
8 17 73.816 19 20 23
23 1 8 138.000 21 22 0
Total = p × (n − 1) = 138
Partition in 5 clusters
138 − 25.850
Agglomeration Schedule Quality = = 81.2 %
Stage Cluster First
138
Cluster Combined Appears
Stage Cluster 1 Cluster 2 Coefficients Cluster 1 Cluster 2 Next Stage
1 1 4 .013 0 0 13
2 21 24 .067 0 0 8
3 6 20 .154 0 0 8
4 8 10 .255 0 0 5
5 8 11 .377 4 0 9
6 15 16 .506 0 0 11
7 7 14 .772 0 0 13
8 6 21 1.056 3 2 15
9 8 9 1.460 5 0 16
10 12 13 1.988 0 0 16
11 5 15 2.567 0 6 12
12 3 5 3.373 0 11 17
13 1 7 4.384 1 7 18
14 17 18 5.650 0 0 20
15 6 22 7.170 8 0 17
16 8 12 10.798 9 10 19
17 3 6 15.117 12 15 18
18 1 3 20.448 13 17 21
19 8 23 25.850 16 0 22
20 17 19 36.511 14 0 22
21 1 2 47.523 18 0 23
22 8 17 73.816 19 20 23
23 1 8 138.000 21 22 0
Interpretation of the
clusters
Interpretation of a qualitative variable
inside p quantitative variables

Interpretation of the 5 clusters


with original variables

Interpretation of the 5 clusters


with Z Score

Interpretation of the 5 clusters


with box plot

Interpretation of the 5 clusters


with box plot

Interpretation of the 5 clusters


with factors F1 F2

Interpretation of the 5 clusters


with factors F1 F2

« Non puoi insegnare niente a un uomo.


Puoi solo aiutarlo a scoprire ciò che ha dentro di sé »

You might also like