You are on page 1of 36

Clustering

Advanced Statistics

Josu Najera Zuloaga


jnajera@deusto.es

Universidad de Deusto

Ciencia de Datos e Inteligencia Artificial (+ Ingenierı́a Informática)

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 1 / 31


Contents

1 Introduction

2 Ward’s algorithm

3 K-means algorithm

4 Real practise

5 Implementation in R

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 2 / 31


Table of Contents

1 Introduction

2 Ward’s algorithm

3 K-means algorithm

4 Real practise

5 Implementation in R

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 3 / 31


Multidimensional data analysis

The most typical method for representing a set of objects is a cloud of points (each point
is an object), evolving in a Euclidean space.
Euclidean refers to the fact that the distances between points are interpreted in terms of
similarities for the individuals (PCA) or categories (CA).
Another way of representing a set of objects and illustratiing the links between them
(similarities) is with a hierarchical tree.

It is also called indexed hierarchy (or dendogram).


Living being: First node separates animal kingdom from
plant life.
Nodes are the points where branches join.

In this unit (same as in PCA) we try to analyse the a data table without prior judgements.
The aim is to construct a hierarchical tree (rather than a principal component map) to
visualise the links between objects (study the variability within the table).
The algorithms used to construct trees such this are known as hierarchical clustering.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 4 / 31


Agglomerative Hierarchical Algorithms

There are many hierarchical algoritms.

The most common work in a agglomerative manner:


I they are called Agglomerative Hierarchical Clustering (AHC).
I first group together the most similar objects and then the group the objects.

We will study one of the most used of these AHC-s: Ward’s algorithm.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 5 / 31


Partitioning Algorithms

Other type of representation of the links between objects is the partition obtained by
diving objects into groups (one object only belongs to one group).

The aim is to divide the objects where


I the individuals with each group are similar to one another,
I individuals differ from one group to the next.0

We will study the most of of these partitioning algorithms: K-means algorithm.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 6 / 31


Defining the notion of similarities

In PCA, when describing I individuals in K variables, the similarities between individual i


and j is defined with the usual (Euclidean) distance in RK .
v
u K
uX
d(i, j) = t (x ik − xlk )2
k=1

However there exists many other distance apart from the Euclidean.
Manhattan distance (city block distance),

K
X
d(i, j) = |xik − xjk |
k=1

I Distance is measured with vertical and horizontal lines.

V1 V2 V3 a b c a b c
a 1 1 3 a 0 a 0
b 1 1 1 b √2 √0 b 2 0
c 2 2 2 c 3 3 0 c 3 3 0

Unless required by the data, we will use the Euclidean distance.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 7 / 31


Distance between groups

To construct hierarchical trees, the distance between groups of individuals must be


defined.
There are a number of options, but we will describe those that are the most widely used.
Let be A and B two groups of individuals.
The single linkage between A and B is the distance between the two closest elements in
the two clusters.
The complete linkage between A and B is the distance between the two furthest elements
in the two clusters.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 8 / 31


Distance between groups

For Euclidean distances,


Consider GA and GB the centres of gravity of group A and B.
We can measure the inertia: accounting for the groups weights (size of the group).
Apply Hyugen’s theorem to A and B, where G is the gravity centre for A ∪ B:

Total inertia=between-cluster intertia+ within-cluster inertia

Total inertia ⇒ A ∪ B with respect to G.


Between-cluster intertia ⇒ GA and GB with respect to G.
Within-cluster inertia ⇒ inertia of A with respecto to GA plus the same for B.

This partitioning suggests using between-cluster inertia as a measurement of dissimilarity


between A and B.

Ward’s method is based on this methodology.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 9 / 31


Table of Contents

1 Introduction

2 Ward’s algorithm

3 K-means algorithm

4 Real practise

5 Implementation in R

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 10 / 31


Classic Agglomerative Algorithm

1- Create the matrix D with the general term d(i, l) indicating the dissimilarity between
individuals i and l.
I It is a symmetrical matrix with 0s on the diagonal.

2- Agglomerate the most similar (closest) individuals i and l.


I in case of ex-aequo, select one couple at random
I d(i, j) is the agglomerative criterion between i and l.
I d(i, j) determines the height at which branches of the tree connect.

3- Update D to D(1) deleting the rows and columns of i and l and adding new ones for the
pair (i, l).
4- Look for the closest elements in D(1) and agglomerate them, and so on.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 11 / 31


Classic Agglomerative Algorithm: Example
We use the Manhatan distance and the Complete linkage agglomerative rule.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 12 / 31


Hierarchy and Partitions

The points where the branches corresponding to the elements being grouped coincide are
known as nodes.
The individuals being classified are referred as leaf nodes.
In tracing an horizontal line to a given index, a partition is defined.

The cut at A defines a partition into two


clusters.
The cut at B defines a more precise
partition into four clusters.
These partitions are always nested: each
B-level cluster is included in the same
cluster at level A.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 13 / 31


Ward’s method

This agglomerative method consists, at each stage of the process, of regrouping two elements by
maximizing the quality of the obtained partition.
A partition is said to be of high quality when
I individuals within a cluster are homogeneous (samll within-cluster variability).
I individuals differ from one cluster to the next (high between-cluster variability).

The already mentioned Huygen’s Theorem provides the framework for this analysis:

Total inertia = Between-cluster inertia + Within-cluster inertia

If we use this decomposition as a framework for the analysis, when checking a partition
quality, it is the same maximizing the variability between-clusters or minimizing the
variability within-clusters (total variability os the same).
Partition quality can be measured by

Between-clusters inertia
Within-clusters inertia
(percentege of variability imputed to the partition).

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 14 / 31


Ward’s method

Defining a methodology that in the agglomerative process tries to minimize the within-cluster
intertia means it we wil try to agglomerate

Clusters whose centres of gravity are close together (lower varibility).


Clusters with small sample sizes (lower inertia).

Tree obtained with the Ward’s method to


data in slide 12.
The shape of the tree is identical with
different distance (Euclidean) and
agglomerative criterion.
When a structure is strong, it is
emphasised whatever the selected
method.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 15 / 31


Ward’s method

1- At step 0, each individual represents a cluster and within-cluster inertia is 0.


2- Throughout the algorithm, the number of clusters decreases and within-inertia increases.
3- At the end of the algorithm, all the individuals are in the same cluster and within-clusters
intertia is equal to the total inertia.

Thus a indexed hierarchy proposes a decomposition of the total inertia (variability of the data)
and fits into the overall approach to principal component methods.

The difference is that the decomposition is conducted by clusters in one case and by
components in the other.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 16 / 31


Ward’s method: Choosing Partitions

A hierarchy is extremely useful for justifying the choice of a partition as we can account for the
percentage of the explained variability with the clusters.

In order to selected the optimal partition, we should base on

The overall apprearance of the tree.


Levels of the nodes, to quantify the above; each irregularity in the decrease evokes
another division (can be represented with a bar chart, graph in the top right hand corner).
The number of clusters, which must not be too high si as not to impede the concise
nature of the approach.
Cluster interpretability: even if it corresponds to a substantial increase in betwee-cluster
inertia, we do not retain subdivisions that we do not know how to interpret.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 17 / 31


Table of Contents

1 Introduction

2 Ward’s algorithm

3 K-means algorithm

4 Real practise

5 Implementation in R

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 18 / 31


Partitioning algorithms

The data are the same as for principal components methods: individuals × variables table and a
Euclidean distance.

Partitioning algorithms approach hierarchical clustering in terms of two questions:

Indexed hierarchies are often used as tools for obtaining partitions. Would there not be a
number of advantages in searching for a partition directly?
When dealing with a great number of individuals, the calculation time required to
construct an indexed hierarchy can be very long. Might we not achive shorter calculation
times with algorithms searching for a partition directly?

Although there are many partitioning algorithms, we will only explain one, the
K-means algorithm.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 19 / 31


K-means algorithm

Select the number of clusters, q.


Consider a given partition P0 where individuals are randomly divided into the q clusters.
Calculate ρ0 , the ration [between-cluster inertia]/[total inertia].
At step n of the algorithm:
1- The centres of gravity are calculated for each of the clusters.
2- Reassign each individual to the cluster that is closest to (in terms of the distance with
respect to the centres of gravity). We obtain a new partiion Pn+1 and calculate the ratio
ρn+1 .
3- As long as ρn+1 > ρn (Pn+1 is better than Pn ) we return to step 1, otherwise Pn+1 is
the partition we are looking for.
Example of the algorithm fro two clusters:

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 20 / 31


K-means algorithm

The algorithm converges but not necessarily towrd an optimum.


In practice, the algorithm is conducted many times using different initial partitions P0 .
Set of individuals wich remain in the same cluster whatever the initial partition is are
strong shapes (highlight dense areas).
This methodology also gives rise to some very small clusters (often only one individual),
made of individuals situated between high-dense areas. The two main solutions are
I assign them to the closest strong shape.
I create a ’residual’ cluster grouping together all of the isolated individuals.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 21 / 31


Table of Contents

1 Introduction

2 Ward’s algorithm

3 K-means algorithm

4 Real practise

5 Implementation in R

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 22 / 31


Partitioning and Hierarchical Clustering

Compared to hierarchical methods, partitioning strategies present two main advantages:

They optimse a criterium. In Agglomerative Hierarchical Clustering, a criterion is


optimised at each step, but it is not an overall optimal criterion referring the tree.
They can deal with much greater number of individuals.

However, the groups need to be defined prior.

This is the origin of the idea of combining the two approaches to obtain a methodology that
includes the advantages of each.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 23 / 31


Partitioning and Hierarchical Clustering

Compared to hierarchical methods, partitioning strategies present two main advantages:

They optimse a criterium. In Agglomerative Hierarchical Clustering, a criterion is


optimised at each step, but it is not an overall optimal criterion referring the tree.
They can deal with much greater number of individuals.

However, the groups need to be defined prior.

This is the origin of the idea of combining the two approaches to obtain a methodology that
includes the advantages of each.

When there are too many individuals to conduct an Agglomerative Hierarchical Clustering
(AHC) directly, the following two-phase mathodology can be implemented:

1. We partition a high number of groups (100, for example).


2. We implement the AHC by taking the groups of individuals from phase one as elements to
be classified.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 23 / 31


Clustering and Principal Components Methods

Clustering and principal components methods use simillar approaches, exploratory analysis
of a same data table.
But they differ in terms of representation methods (Euclidean clouds, indexed hierarchies
or partitions).
However, we can combine both approches to obtain a richer methodology.
Let us consider table X (I × K dimensions) in which we want to classify the rows:
1. We perform a principal component method in X (PCA or CA).
2. Retain the components that are responsible of a high percentage of the inertia (80% or
90%) and we know how to interpret.
3. Create table F with the coordinates of the individuals in those components (if we had
included all the components F and X would be equivalent as they define the same
distances between individuals).
4. Apply the AHC to table F .

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 24 / 31


Table of Contents

1 Introduction

2 Ward’s algorithm

3 K-means algorithm

4 Real practise

5 Implementation in R

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 25 / 31


Violent Crime Rates by US State Example

We will be using the Violent Crime Rates by US State dataset analysed in the PCA unit.

The objective is to group cities together into comprehensive clusters.

Once the clusters have been defined, it is important to describe them using variables or specific
individuals.

We will perform a hierarchical defining the Ward’s criterion as an agglomerative criteron.

We will use the Euclidean distances, so the most suitable (in most of the cases), as done in
PCA, is to standardise the variables.

We will use the results of the PCA in roder to perform clusters of cities.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 26 / 31


Violent Crime Rates by US State Example

We import the dataset and peform the PCA, we use ncp=Inf in order to specify that we will
retain all the components in the clustering analysis.

Then we perform an agglomerative hierarchical clustering with the function HCPC.

> library(FactoMineR)
> library(tidyverse)
> data("USArrests") #Read the data

Perform the PCA.


> pca.USA <- PCA(USArrests,scale.unit = TRUE,graph = FALSE)

Perform the agglomerative hierarchical clustering.


> hcpc.USA <- HCPC(pca.USA,nb.clust = 4,graph = FALSE)

Note: As commented in the unit, if the number of individuals is large, there is a possibility to
create clusters with K-means algorithm before constructing the agglomerative hierarchical
clustering (kk parameter in the HCPC function).

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 27 / 31


Violent Crime Rates by US State Example

The object data.clust gives back the original table with the classification in clusters of each
city.
> head(hcpc.USA$data.clust,n = 10)
Murder Assault UrbanPop Rape clust
Alabama 13.2 236 58 21.2 3
Alaska 10.0 263 48 44.5 4
Arizona 8.1 294 80 31.0 4
Arkansas 8.8 190 50 19.5 3
California 9.0 276 91 40.6 4
Colorado 7.9 204 78 38.7 4
Connecticut 3.3 110 77 11.1 2
Delaware 5.9 238 72 15.8 2
Florida 15.4 335 80 31.9 4
Georgia 17.4 211 60 25.8 3

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 28 / 31


Violent Crime Rates by US State Example
We can obtain several information in the hcpc.USA object:

> hcpc.USA$call$t$tree

Call:
flashClust::hclust(d = dissi, method = method, members = weight)

Cluster method : ward


Distance : euclidean
Number of objects: 50

Optimal number of clusters: Ration between two succesive within-cluster inertias


(hcpc.USA$call$t$quot) is minimum
> hcpc.USA$call$t$nb.clust
Within-cluster inertia: For one cluster the inertia is equal to the number of variables.
> hcpc.USA$call$t$within
Between-cluster inertia: the gained inertia when moving from n to n + 1 number of
clusters.
> hcpc.USA$call$t$inert.gain

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 29 / 31


Violent Crime Rates by US State Example

Remember, the first dimension is a scale of crimality and the second dimension is defined as a
scale of population.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 30 / 31


Violent Crime Rates by US State Example

The HCPH function generates three different graphs. We can obtain different graphs with choice
parameter.

choice="tree"

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 30 / 31


Violent Crime Rates by US State Example

The HCPH function generates three different graphs. We can obtain different graphs with choice
parameter.

choice="bar"

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 30 / 31


Violent Crime Rates by US State Example

The HCPH function generates three different graphs. We can obtain different graphs with choice
parameter.

choice="3D.map"

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 30 / 31


Violent Crime Rates by US State Example

The HCPH function generates three different graphs. We can obtain different graphs with choice
parameter.

choice="map"

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 30 / 31


Violent Crime Rates by US State Example

It can be also interesting to illustrate the cluster by the individual specific to that class.

We can calculate:

paragon individuals: those that are closest to the centre of the cluster.
Individuals shorted in clusters and the distance between each individual and the centre of
its class.
> hcpc.USA$desc.ind$para
Shout Dakota is the city that best represents cities in cluster 1, while Oklahoma, Alabama
and Michigan are the paragons of clusters 2,3,4 respectively.
specific individuals: those furthest from the centres of other clusters.
Individuals shorted by cluster and the distance between each individual and the closest
cluster centre.

> hcpc.USA$desc.ind$dist

Vermont is specific to cluster 1 because it is the city furthest from the centres of clusters
2, 3 and 4, so we can consider to be the most specific to cluster 1. Rhode Island,
Mississippi and Nevada are specific to cluster 2, 3 and 4.

Josu Najera-Zuloaga (Deusto) Correspondence Analysis CDIA - INF + CDIA 31 / 31

You might also like