Multistat PV CA

Multivariate Statistics
Statistics: Module 8
1
Cluster Analysis
• Arrange the objects in a number of groups, called

clusters
• Objects in the same group are close together or

similar
• Objects in different groups are far apart or

dissimilar
2
Clustering methods
• Hierarchical methods
→ Agglomerative
→ Divisive
• Partitioning methods
→ nonparametric
→ Model-based
3
Scope of application
• Some methods only work on data matrices with
Euclidean (k-means) or Mahalanobis distances
(model-based)
• Other methods can be used on any

distance/dissimilarity matrix (hierarchical,
k-mediods)
4
Agglomerative hierarchical clustering
• Does not lead to a particular number of clusters
• starts with objects in individual clusters
• At each step joins two clusters
• Ends with all objects in 1 cluster
5
Intercluster dissimilarity
• At each step join the two closest or most similar
clusters
• How to measure distance or dissimilarity

between clusters?
→ Several methods exist
6
Single linkage
Consider two clusters Q and R.
The single linkage intercluster dissimilarity between
clusters Q and R:
d(Q, R) = min d(i, j)

i∈Q,j∈R
7
Complete linkage
The complete linkage intercluster dissimilarity
between clusters Q and R:
d(Q, R) = max d(i, j)

i∈Q,j∈R
8
Group average
The group average intercluster dissimilarity between
clusters Q and R:
1 X
d(Q, R) = d(i, j)
nQ nR
i∈Q,j∈R
9
Centroid linkage
• Based on Euclidean distances
• Consider two clusters Q and R.

The centroid linkage intercluster dissimilarity
between clusters Q and R:
d(Q, R) = dE (x̄(Q), x̄(R))
10
Ward’s method
• Consider two clusters Q and R.

Ward’s intercluster dissimilarity between clusters
Q and R:
2 2nQ nR 2
d (Q, R) = dE (x̄(Q), x̄(R))
nQ + nR
• Takes cluster sizes into account
11
Displaying a hierarchical clustering
• Dendrogram: clustering tree
• Banner
12
Dendrogram
• Root: all objects in one cluster
• Leaves: Each object is its own cluster
• Branches: Fusion of two clusters
• Heights: intercluster dissimilarity at each fusion

stage
13
Height
0 5 10 15 20
Algeria
Mauritius
Greenland
Mexico
Columbia
El Salvador
Ecuador
Grenada
Jamaica
Trinidad (67)
Chile
United States (NW66)
Honduras
South Africa(W)
Canada
14
United States (W66)
life
United States (66)
United States (67)
Argentina
Single linkage
agnes (*, "single")

Costa Rica
Panama
Dominican Rep
Nicaragua
Seychelles
Reunion
South Africa(C)
Guatemala
Tunisia
Trinidad(62)
Cameroon
Example: Life expectancy
Madagascar
Height
0 10 20 30 40 50 60
Algeria
Costa Rica
Panama
Dominican Rep
Nicaragua
Tunisia
El Salvador
Ecuador
Mauritius
Greenland
Honduras
Mexico
Columbia
Reunion
Seychelles
15
Grenada
life
Jamaica
Chile
agnes (*, "complete")

Trinidad (67)
Complete linkage
South Africa(W)
United States (66)
United States (67)
Argentina
Canada
United States (W66)
Trinidad(62)
Cameroon
Madagascar
South Africa(C)
Guatemala
Height
0 10 20 30 40
Algeria
Tunisia
Mauritius
Greenland
Mexico
Columbia
Honduras
Seychelles
Grenada
Jamaica
Chile
Trinidad (67)
El Salvador
Ecuador
16
Reunion
life
South Africa(W)
Argentina
Canada
Group average
agnes (*, "average")

United States (W66)
United States (66)
United States (67)
Costa Rica
Panama
Dominican Rep
Nicaragua
Trinidad(62)
South Africa(C)
Guatemala
Cameroon
Madagascar
Height
0 20 40 60 80
Algeria
Costa Rica
Panama
Dominican Rep
Nicaragua
Tunisia
El Salvador
Ecuador
Trinidad(62)
South Africa(W)
Argentina
Canada
United States (W66)
United States (66)
United States (67)
17
Mauritius
life
Ward
Greenland
Honduras
Mexico
agnes (*, "ward")

Columbia
Reunion
Seychelles
Grenada
Jamaica
Chile
Trinidad (67)
South Africa(C)
Guatemala
Cameroon
Madagascar
Banner
• Left: Each object is its own cluster
• Right all objects in one cluster
• Fusions are shown from left to right
• Length: intercluster dissimilarity at each fusion

stage
18
Single linkage
Alger
Mauri
Green
Mexic
Colum
El Sa
Ecuad
Grena
Jamai
Trini
Chile
Unite
Hondu
South
Canad
Unite
Unite
Unite
Argen
Costa
Panam
Domin
Nicar
Seych
Reuni
South
Guate
Tunis
Trini
Camer
Madag
0 5 10 15 20 21.9
Height
19
Complete linkage
Alger
Costa
Panam
Domin
Nicar
Tunis
El Sa
Ecuad
Mauri
Green
Hondu
Mexic
Colum
Reuni
Seych
Grena
Jamai
Unite
Chile
Trini
South
Unite
Unite
Argen
Canad
Unite
Trini
Camer
Madag
South
Guate
0 10 20 30 40 50 60
Height
20
Group average
Alger
Tunis
Mauri
Green
Mexic
Colum
Hondu
Seych
Grena
Jamai
Chile
Trini
Unite
El Sa
Ecuad
Reuni
South
Argen
Canad
Unite
Unite
Unite
Costa
Panam
Domin
Nicar
Trini
South
Guate
Camer
Madag
0 10 20 30 40 46
Height
21
Ward
Alger
Costa
Panam
Domin
Nicar
Tunis
El Sa
Ecuad
Trini
South
Argen
Canad
Unite
Unite
Unite
Mauri
Green
Hondu
Mexic
Colum
Reuni
Seych
Grena
Jamai
Chile
Trini
Unite
South
Guate
Camer
Madag
0 20 40 60 80 88
Height
22
K-means
→ Standardize the variables
• Partitions the observations in k groups
23
K-means objective
Minimize the sum of distances between the objects
and their group mean over all possible partitions of
the data in k groups C1 , . . . , Ck :
k X
X
2
min dE (xi , x̄(Cj ))
C1 ,...,Ck
j=1 i∈Cj
−→ Complex optimization problem

−→ Only approximate solution possible
24
Example: Pottery
Data with chemical composition of 45 specimens of
Romano-British pottery, determined by atomic
absorption spectrophometry for nine oxides.
25
Example: Number of clusters
250
Within group sum of squares
200
150
100
50
1 2 3 4 5 6
Number of groups
−→ 3 clusters
26
Example: Three clusters
CLUSPLOT( pottery.data )
4
3
2
Component 2
1
0
−1
−2
−2 0 2 4
Component 1
These two components explain 74.75 % of the point variability.
27
Example: cluster means
Clus 1 Clus 2 Clus 3
AL2O3 17.75 12.44 16.92
FE2O3 1.61 6.21 7.43
MGO 0.64 4.78 1.84
CAO 0.04 0.21 0.94
NA2O 0.05 0.23 0.35
K2O 2.02 4.19 3.10
TIO2 1.02 0.68 0.94
MNO 0.00 0.12 0.07
BAO 0.02 0.02 0.02
28
K-medoids
• Partitions the observations in k groups
• Works for any dissimilarity matrix
• A medoid is the most central object in a cluster
29
K-medoid objective
Minimize the sum of dissimilarities between the
objects and the closest medoid over all possible
choices of k medoids m1 , . . . , mt from the data:
n
X
min min d(xi , mt )
m1 ,...,mt t=1,...,k
i=1
−→ Complex optimization problem

−→ Only approximate solution possible
30
Example: Pottery
2.8
2.6
PAM objective function
2.4
2.2
2.0
1.8
2 3 4 5 6
Number of groups
−→ 3 clusters
31
Example: Three clusters
CLUSPLOT( pottery.data )
3
2
Component 2
1
0
−1
−2
−2 0 2 4
Component 1
These two components explain 74.75 % of the point variability.
32
Example: cluster medoids
AL2O3 16.90 18.00 11.10
FE2O3 7.33 1.50 5.49
MGO 1.65 0.67 4.52
CAO 0.84 0.01 0.29
NA2O 0.40 0.06 0.30
K2O 3.05 2.11 4.03
TIO2 0.99 0.92 0.63
MNO 0.07 0.00 0.08
BAO 0.02 0.02 0.02
33
Displaying a partitioning clustering
Silhouette plot
• Silhouette value s(i) of an object reflects how

well it is clustered
→ s(i) ≈ 1: well classified

→ s(i) ≈ 0: intermediate between two clusters
→ s(i) ≈ −1: badly classified
• Plot silhouette values
34
Example: Pottery
Silhouette plot of pam(x = pottery.data, k = 3, stand = T)
n = 45 3 clusters Cj
j : nj | avei∈Cj si
1 : 20 | 0.53
2 : 11 | 0.42
3 : 14 | 0.38
0.0 0.2 0.4 0.6 0.8 1.0

Silhouette width si
Average silhouette width : 0.46
35
Model-based clustering
• Assume the data come from a mixture of k
subpopulations
• Usually a mixture of normal distributions
• Several choices about the shape of the normal

distributions are possible
• Optimization problem (EM algorithm)
36
Example: Planet data
Measurements on 3 variables for 101 exoplanets
(outside the solar system) discovered up to October
2002.
37
Example: Planets
0 1000 3000 5000
15
10
Mass
5
0
5000
3000
Period
1000
0
0.8
0.6
Eccentricity
0.4
0.2
0.0
0 5 10 15 0.0 0.2 0.4 0.6 0.8
38
Example: Planets
−2000
−2500
−3000
BIC
−3500
−4000
EII VVI
VII EEE
EEI EEV
−4500
VEI VEV
EVI VVV
2 4 6 8
number of components
−→ 3 clusters
39
Example: Planets
0 1000 3000 5000
15
10
Mass
5
0
5000
3000
Period
1000
0
0.8
0.6
Eccentricity
0.4
0.2
0.0
0 5 10 15 0.0 0.2 0.4 0.6 0.8
40
Example: cluster centers
Mass 1.17 6.08 1.58
Period 6.47 1325.53 313.41
Eccentricity 0.04 0.37 0.31
41
Determining the number of clusters
• Compare results for different values of k
• Several selection criteria exist
• Subjective!
42

Multistat PV CA

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multistat PV CA

Uploaded by

Copyright:

Available Formats

Multivariate Statistics

• Arrange the objects in a number of groups, called

• Objects in the same group are close together or

• Objects in different groups are far apart or

• Other methods can be used on any

• starts with objects in individual clusters

• At each step joins two clusters

• Ends with all objects in 1 cluster

• How to measure distance or dissimilarity

→ Several methods exist

d(Q, R) = min d(i, j)

d(Q, R) = max d(i, j)

• Consider two clusters Q and R.

d(Q, R) = dE (x̄(Q), x̄(R))

• Consider two clusters Q and R.

• Takes cluster sizes into account

• Leaves: Each object is its own cluster

• Branches: Fusion of two clusters

• Heights: intercluster dissimilarity at each fusion

agnes (*, "single")

agnes (*, "complete")

agnes (*, "average")

agnes (*, "ward")

• Right all objects in one cluster

• Fusions are shown from left to right

• Length: intercluster dissimilarity at each fusion

• Partitions the observations in k groups

−→ Complex optimization problem

AL2O3 17.75 12.44 16.92

FE2O3 1.61 6.21 7.43

MGO 0.64 4.78 1.84

CAO 0.04 0.21 0.94

NA2O 0.05 0.23 0.35

K2O 2.02 4.19 3.10

TIO2 1.02 0.68 0.94

MNO 0.00 0.12 0.07

BAO 0.02 0.02 0.02

• Works for any dissimilarity matrix

• A medoid is the most central object in a cluster

−→ Complex optimization problem

AL2O3 16.90 18.00 11.10

FE2O3 7.33 1.50 5.49

MGO 1.65 0.67 4.52

CAO 0.84 0.01 0.29

NA2O 0.40 0.06 0.30

K2O 3.05 2.11 4.03

TIO2 0.99 0.92 0.63

MNO 0.07 0.00 0.08

BAO 0.02 0.02 0.02

• Silhouette value s(i) of an object reflects how

→ s(i) ≈ 1: well classified

0.0 0.2 0.4 0.6 0.8 1.0

• Usually a mixture of normal distributions

• Several choices about the shape of the normal

• Optimization problem (EM algorithm)

Mass 1.17 6.08 1.58

Period 6.47 1325.53 313.41

Eccentricity 0.04 0.37 0.31

• Several selection criteria exist

You might also like