You are on page 1of 42

Multivariate Statistics

Statistics: Module 8

1
Cluster Analysis

• Arrange the objects in a number of groups, called


clusters

• Objects in the same group are close together or


similar

• Objects in different groups are far apart or


dissimilar

2
Clustering methods
• Hierarchical methods
→ Agglomerative
→ Divisive

• Partitioning methods
→ nonparametric
→ Model-based

3
Scope of application
• Some methods only work on data matrices with
Euclidean (k-means) or Mahalanobis distances
(model-based)

• Other methods can be used on any


distance/dissimilarity matrix (hierarchical,
k-mediods)

4
Agglomerative hierarchical clustering
• Does not lead to a particular number of clusters

• starts with objects in individual clusters

• At each step joins two clusters

• Ends with all objects in 1 cluster

5
Intercluster dissimilarity
• At each step join the two closest or most similar
clusters

• How to measure distance or dissimilarity


between clusters?

→ Several methods exist

6
Single linkage
Consider two clusters Q and R.
The single linkage intercluster dissimilarity between
clusters Q and R:

d(Q, R) = min d(i, j)


i∈Q,j∈R

7
Complete linkage
Consider two clusters Q and R.
The complete linkage intercluster dissimilarity
between clusters Q and R:

d(Q, R) = max d(i, j)


i∈Q,j∈R

8
Group average
Consider two clusters Q and R.
The group average intercluster dissimilarity between
clusters Q and R:

1 X
d(Q, R) = d(i, j)
nQ nR
i∈Q,j∈R

9
Centroid linkage
• Based on Euclidean distances

• Consider two clusters Q and R.


The centroid linkage intercluster dissimilarity
between clusters Q and R:

d(Q, R) = dE (x̄(Q), x̄(R))

10
Ward’s method
• Based on Euclidean distances

• Consider two clusters Q and R.


Ward’s intercluster dissimilarity between clusters
Q and R:

2 2nQ nR 2
d (Q, R) = dE (x̄(Q), x̄(R))
nQ + nR

• Takes cluster sizes into account

11
Displaying a hierarchical clustering
• Dendrogram: clustering tree

• Banner

12
Dendrogram
• Root: all objects in one cluster

• Leaves: Each object is its own cluster

• Branches: Fusion of two clusters

• Heights: intercluster dissimilarity at each fusion


stage

13
Height

0 5 10 15 20

Algeria
Mauritius
Greenland
Mexico
Columbia
El Salvador
Ecuador
Grenada
Jamaica
Trinidad (67)
Chile
United States (NW66)
Honduras
South Africa(W)
Canada

14
United States (W66)

life
United States (66)
United States (67)
Argentina
Single linkage

agnes (*, "single")


Costa Rica
Panama
Dominican Rep
Nicaragua
Seychelles
Reunion
South Africa(C)
Guatemala
Tunisia
Trinidad(62)
Cameroon
Example: Life expectancy

Madagascar
Height

0 10 20 30 40 50 60

Algeria
Costa Rica
Panama
Dominican Rep
Nicaragua
Tunisia
El Salvador
Ecuador
Mauritius
Greenland
Honduras
Mexico
Columbia
Reunion
Seychelles

15
Grenada

life
Jamaica
United States (NW66)
Chile

agnes (*, "complete")


Trinidad (67)
Complete linkage

South Africa(W)
United States (66)
United States (67)
Argentina
Canada
United States (W66)
Trinidad(62)
Cameroon
Madagascar
South Africa(C)
Example: Life expectancy

Guatemala
Height

0 10 20 30 40

Algeria
Tunisia
Mauritius
Greenland
Mexico
Columbia
Honduras
Seychelles
Grenada
Jamaica
Chile
Trinidad (67)
United States (NW66)
El Salvador
Ecuador

16
Reunion

life
South Africa(W)
Argentina
Canada
Group average

agnes (*, "average")


United States (W66)
United States (66)
United States (67)
Costa Rica
Panama
Dominican Rep
Nicaragua
Trinidad(62)
South Africa(C)
Guatemala
Cameroon
Example: Life expectancy

Madagascar
Height

0 20 40 60 80

Algeria
Costa Rica
Panama
Dominican Rep
Nicaragua
Tunisia
El Salvador
Ecuador
Trinidad(62)
South Africa(W)
Argentina
Canada
United States (W66)
United States (66)
United States (67)

17
Mauritius

life
Ward

Greenland
Honduras
Mexico

agnes (*, "ward")


Columbia
Reunion
Seychelles
Grenada
Jamaica
Chile
Trinidad (67)
United States (NW66)
South Africa(C)
Guatemala
Cameroon
Example: Life expectancy

Madagascar
Banner
• Left: Each object is its own cluster

• Right all objects in one cluster

• Fusions are shown from left to right

• Length: intercluster dissimilarity at each fusion


stage

18
Example: Life expectancy
Single linkage

Alger
Mauri
Green
Mexic
Colum
El Sa
Ecuad
Grena
Jamai
Trini
Chile
Unite
Hondu
South
Canad
Unite
Unite
Unite
Argen
Costa
Panam
Domin
Nicar
Seych
Reuni
South
Guate
Tunis
Trini
Camer
Madag

0 5 10 15 20 21.9
Height

19
Example: Life expectancy
Complete linkage

Alger
Costa
Panam
Domin
Nicar
Tunis
El Sa
Ecuad
Mauri
Green
Hondu
Mexic
Colum
Reuni
Seych
Grena
Jamai
Unite
Chile
Trini
South
Unite
Unite
Argen
Canad
Unite
Trini
Camer
Madag
South
Guate

0 10 20 30 40 50 60
Height

20
Example: Life expectancy
Group average

Alger
Tunis
Mauri
Green
Mexic
Colum
Hondu
Seych
Grena
Jamai
Chile
Trini
Unite
El Sa
Ecuad
Reuni
South
Argen
Canad
Unite
Unite
Unite
Costa
Panam
Domin
Nicar
Trini
South
Guate
Camer
Madag

0 10 20 30 40 46
Height

21
Example: Life expectancy
Ward

Alger
Costa
Panam
Domin
Nicar
Tunis
El Sa
Ecuad
Trini
South
Argen
Canad
Unite
Unite
Unite
Mauri
Green
Hondu
Mexic
Colum
Reuni
Seych
Grena
Jamai
Chile
Trini
Unite
South
Guate
Camer
Madag

0 20 40 60 80 88
Height

22
K-means
• Based on Euclidean distances
→ Standardize the variables

• Partitions the observations in k groups

23
K-means objective
Minimize the sum of distances between the objects
and their group mean over all possible partitions of
the data in k groups C1 , . . . , Ck :
k X
X
2
min dE (xi , x̄(Cj ))
C1 ,...,Ck
j=1 i∈Cj

−→ Complex optimization problem


−→ Only approximate solution possible

24
Example: Pottery
Data with chemical composition of 45 specimens of
Romano-British pottery, determined by atomic
absorption spectrophometry for nine oxides.

25
Example: Number of clusters

250
Within group sum of squares

200
150
100
50

1 2 3 4 5 6

Number of groups

−→ 3 clusters
26
Example: Three clusters
CLUSPLOT( pottery.data )

4
3
2
Component 2

1
0
−1
−2

−2 0 2 4

Component 1
These two components explain 74.75 % of the point variability.

27
Example: cluster means
Clus 1 Clus 2 Clus 3

AL2O3 17.75 12.44 16.92

FE2O3 1.61 6.21 7.43

MGO 0.64 4.78 1.84

CAO 0.04 0.21 0.94

NA2O 0.05 0.23 0.35

K2O 2.02 4.19 3.10

TIO2 1.02 0.68 0.94

MNO 0.00 0.12 0.07

BAO 0.02 0.02 0.02

28
K-medoids
• Partitions the observations in k groups

• Works for any dissimilarity matrix

• A medoid is the most central object in a cluster

29
K-medoid objective
Minimize the sum of dissimilarities between the
objects and the closest medoid over all possible
choices of k medoids m1 , . . . , mt from the data:
n
X
min min d(xi , mt )
m1 ,...,mt t=1,...,k
i=1

−→ Complex optimization problem


−→ Only approximate solution possible

30
Example: Pottery

2.8
2.6
PAM objective function

2.4
2.2
2.0
1.8

2 3 4 5 6

Number of groups

−→ 3 clusters
31
Example: Three clusters
CLUSPLOT( pottery.data )

3
2
Component 2

1
0
−1
−2

−2 0 2 4

Component 1
These two components explain 74.75 % of the point variability.

32
Example: cluster medoids
Clus 1 Clus 2 Clus 3

AL2O3 16.90 18.00 11.10

FE2O3 7.33 1.50 5.49

MGO 1.65 0.67 4.52

CAO 0.84 0.01 0.29

NA2O 0.40 0.06 0.30

K2O 3.05 2.11 4.03

TIO2 0.99 0.92 0.63

MNO 0.07 0.00 0.08

BAO 0.02 0.02 0.02

33
Displaying a partitioning clustering
Silhouette plot

• Silhouette value s(i) of an object reflects how


well it is clustered

→ s(i) ≈ 1: well classified


→ s(i) ≈ 0: intermediate between two clusters
→ s(i) ≈ −1: badly classified
• Plot silhouette values

34
Example: Pottery
Silhouette plot of pam(x = pottery.data, k = 3, stand = T)
n = 45 3 clusters Cj
j : nj | avei∈Cj si

1 : 20 | 0.53

2 : 11 | 0.42

3 : 14 | 0.38

0.0 0.2 0.4 0.6 0.8 1.0


Silhouette width si
Average silhouette width : 0.46

35
Model-based clustering
• Assume the data come from a mixture of k
subpopulations

• Usually a mixture of normal distributions

• Several choices about the shape of the normal


distributions are possible

• Optimization problem (EM algorithm)

36
Example: Planet data
Measurements on 3 variables for 101 exoplanets
(outside the solar system) discovered up to October
2002.

37
Example: Planets
0 1000 3000 5000

15
10
Mass

5
0
5000
3000

Period
1000
0

0.8
0.6
Eccentricity

0.4
0.2
0.0
0 5 10 15 0.0 0.2 0.4 0.6 0.8

38
Example: Planets

−2000
−2500
−3000
BIC

−3500
−4000

EII VVI
VII EEE
EEI EEV
−4500

VEI VEV
EVI VVV

2 4 6 8

number of components

−→ 3 clusters
39
Example: Planets
0 1000 3000 5000

15
10
Mass

5
0
5000
3000

Period
1000
0

0.8
0.6
Eccentricity

0.4
0.2
0.0
0 5 10 15 0.0 0.2 0.4 0.6 0.8

40
Example: cluster centers
Clus 1 Clus 2 Clus 3

Mass 1.17 6.08 1.58

Period 6.47 1325.53 313.41

Eccentricity 0.04 0.37 0.31

41
Determining the number of clusters
• Compare results for different values of k

• Several selection criteria exist

• Subjective!

42

You might also like