Comparing Clustering

Comparing clusterings and numbers of clusters
by aggregation of calibrated clustering validity

indexes
arXiv:2002.01822v4 [stat.ME] 23 Jun 2020
Serhat Emre Akhanli (1) and Christian Hennig (2)

(1) Department of Statistical Science,
University College London, UK
and Department of Statistics, Faculty of Science,
Muğla Sıtkı Koçman University, Muğla, Turkey
Tel.: +905459267047
(2) Dipartimento di Scienze Statistiche “Paolo Fortunati”
Universita di Bologna
Bologna, Via delle belle Arti, 41, 40126, Italy
June 24, 2020
Abstract
A key issue in cluster analysis is the choice of an appropriate clus-
tering method and the determination of the best number of clusters.
Different clusterings are optimal on the same data set according to
different criteria, and the choice of such criteria depends on the con-
text and aim of clustering. Therefore, researchers need to consider
what data analytic characteristics the clusters they are aiming at are
supposed to have, among others within-cluster homogeneity, between-
clusters separation, and stability. Here, a set of internal clustering
validity indexes measuring different aspects of clustering quality is pro-
posed, including some indexes from the literature. Users can choose
the indexes that are relevant in the application at hand. In order to
measure the overall quality of a clustering (for comparing clusterings
from different methods and/or different numbers of clusters), the in-
dex values are calibrated for aggregation. Calibration is relative to a
set of random clusterings on the same data. Two specific aggregated
indexes are proposed and compared with existing indexes on simulated
1
and real data.
Keywords: number of clusters, random clustering, within-cluster ho-

mogeneity, between-clusters separation, cluster stability
MSC2010 classification: 62H30
1 Introduction
This version has been accepted by Statistics and Computing for publication.
Cluster validation, which is the evaluation of the quality of clusterings, is
crucial in cluster analysis in order to make sure that a given clustering makes
sense, but also in order to compare different clusterings. These may stem
from different clustering methods, but may also have different numbers of
clusters, and in fact optimizing measurements of clustering validity is a main
approach to estimating the number of clusters, see Halkidi et al. (2015).
In much literature on cluster analysis it is assumed, implicitly or explic-
itly, that there is only a single “true” clustering for a given data set, and that
the aim of cluster analysis is to find that clustering. According to this logic,
clusterings are better or worse depending on how close they are to the “true”
clustering. If a true grouping is known (which does not necessarily have to
be unique), “external” cluster validation compares a clustering to the true
clustering (or more generally to existing external information). A popular
external clustering validity index is the Adjusted Rand Index (Hubert and
Arabie (1985)). Here we are concerned with “internal” cluster validation
(sometimes referred to as “relative cluster validation” when used to compare
different clusterings, see Jain and Dubes (1988)), evaluating the cluster qual-
ity without reference to an external “truth”, which in most applications of
course is not known.
Hennig (2015a,b) argued that the “best” clustering depends on back-
ground information and the aim of clustering, and that different clusterings
can be optimal in different relevant respects on the same data. As an ex-
ample, within-cluster homogeneity and between-clusters separation are often
mentioned as major aims of clustering, but these two aims can be conflict-
ing. There may be widespread groups of points in the data without “gaps”,
which can therefore not be split up into separated subgroups, but may how-
ever contain very large within-group distances. If clustering is done for shape
recognition, separation is most important and such groups should not be split
up. If clustering is used for database organisation, for example allowing to
find a set of very similar images for a given image, homogeneity is most im-
portant, and a data subset containing too large distances needs to be split
2
up into two or more clusters. Also other aspects may matter such as approx-
imation of the dissimilarity relations in the data by the clustering structure,
or distributional shapes (e.g., linearity or normality). Therefore the data
cannot decide about the “optimal” clustering on their own, and user input
is needed in any case.
Many existing clustering validity indexes attempt to measure the quality
of a clustering by a single number, see Halkidi et al. (2015) for a review of
such indexes. Such indexes are sometimes called “objective”, because they do
not require decisions or tuning by the users. This is certainly popular among
users who do not want to make such decisions, be it for lack of understanding
of the implications, or be it for the desire to “let the data speak on their
own”. But in any case such users will need to decide which index to trust
for what reason, and given that requirements for a good clustering depend
on the application at hand, it makes sense that there is a choice between
various criteria and approaches. But the literature is rarely explicit about
this and tends to suggest that the clustering problem can and should be
solved without critical user input.
There is a tension in statistics between the idea that analyses should be
very closely adapted to the specifics of the situation, necessarily strongly
involving the researchers’ perspective, and the idea that analyses should be
as “objective” and independent of a personal point of view as possible, see
Gelman and Hennig (2017). A heavy focus on user input will give the user
optimal flexibility to take into account background knowledge and the aim
of clustering, but the user may not feel able to make all the required choices,
some of which may be very subtle and may be connected at best very indi-
rectly to the available information. Furthermore, it is hard to systematically
investigate the quality and reliability of such an approach, because every sit-
uation is different and it may be unclear how to generalise from one situation
to another. On the other hand, a heavy focus on “objective” unified criteria
and evaluation over many situations will make it hard or even impossible to
do the individual circumstances justice. In the present paper we try to bal-
ance these two aspects by presenting a framework that allows for very flexible
customisation, while at the same time proposing two specific aggregated in-
dexes as possible starter tools for a good number of situations that allow us
to systematically evaluate the approach on simulated and benchmark data.
Many validity indexes balance a small within cluster heterogeneity and
a large between-clusters heterogeneity in a certain way, such as Average Sil-
houette Width (ASW; Kaufman and Rousseeuw (1990)) or the Calinski and
Harabasz index (CH; Caliński and Harabasz (1974)), whereas others have
different goals; for example Hubert’s Γ index (Hubert and Schultz (1976))
emphasises good representation of the dissimilarity structure by the clus-
3
tering. In most applications, various desirable characteristics need to be
balanced against each other. It is clear that it is easier to achieve homoge-
neous clusters if the number of clusters is high, and better cluster separation
if the number of clusters is low, but in different applications these objectives
may be weighted differently, which cannot be expressed by a single index.
The approach taken here, first introduced in Hennig (2019), is to consider
a collection of validity indexes that measure various aspects of cluster quality
in order to allow the user to weight and aggregate them to a quality mea-
surement adapted to their specific clustering aim. This can then be used to
decide between different clusterings with different numbers of clusters or also
from different clustering methods. Particular attention is paid to the issue
of making the values of the different indexes comparable when aggregating
them, allowing for an interpretation of weights in terms of relative impor-
tance. This is done by generating random clusterings over the given data set
and by using the distribution of the resulting index values for calibration.
Some authors have already become aware of the benefits of looking at
several criteria for comparing clusterings, and there is some related work on
multi-objective clustering, mostly about finding the set of Pareto optimal
solutions, see, e.g., Delattre and Hansen (1980) and the overview in Handl
and Knowles (2015).
Section 2 introduces the notation. Section 3 is devoted to clustering va-
lidity indexes. It has three subsections introducing clustering validity indexes
from the literature, indexes measuring specific aspects of cluster validity to be
used for aggregation, and resampling-based indexes measuring cluster stabil-
ity. Section 4 describes how an aggregated index can be defined from several
indexes measuring specific characteristics, including calibration by random
clusterings. Section 5 proposes two specific aggregated indexes for somewhat
general purposes, presents a simulation study comparing these to indexes
from the literature, and uses these indexes to analyse three real data sets
with and one without given classes. Section 6 concludes the paper.
2 General notation
Given a data set, i.e., a set of distinguishable objects X = {x1 , x2 , . . . , xn },
the aim of cluster analysis is to group them into subsets of X . A clustering is
denoted by C = {C1 , C2 , . . . , CK }, Ck ⊆ X with cluster size nk = |Ck |, k =
1, . . . , K. We require C to be a partition, e.g., k 6= g ⇒ Ck ∩ Cg = ∅ and
SK
k=1 Ck = X . Clusters are assumed to be crisp rather than fuzzy, i.e., an
object is either a full member of a cluster or not a member of this cluster at
all. An alternative way to write xi ∈ Ck is li = k, i.e., li ∈ {1, . . . , K} is the
4
cluster label of xi .
The approach presented here is defined for general dissimilarity data. A
dissimilarity is a function d : X 2 → R+ 0 so that d(xi , xj ) = d(xj , xi ) ≥ 0 and
d(xi , xi ) = 0 for xi , xj ∈ X . Many dissimilarities are distances, i.e., they
also fulfill the triangle inequality, but this is not necessarily required here.
Dissimilarities are extremely flexible. They can be defined for all kinds of
data, such as functions, time series, categorical data, image data, text data
etc. If data are Euclidean, obviously the Euclidean distance can be used,
which is what will be done in the later experiments. See Hennig (2015a) for
a more general overview of dissimilarity measures used in cluster analysis.
3 Clustering Validity Measurement

This section lists various measurements of clustering validity. It has three
parts. In the first part we review some popular indexes that are proposed
in the literature. Each of these indexes was proposed with the ambition to
measure clustering quality on its own, so that a uniquely optimal clustering
or number of clusters can be found by optimizing one index. In the second
part we list indexes that can be used to measure a single isolated aspect of
clustering validity with a view of defining a composite measurement, adapted
to the requirements of a specific application, by aggregating several of these
indexes. In the third part we review resampling-based measurements of clus-
tering stability.
Hennig (2019) suggested to transform indexes to be aggregated into the
[0, 1]-range so that for all indexes bigger values mean a better clustering
quality. However, as acknowledged by Hennig (2019), this is not enough for
making index values comparable, and here we give the untransformed forms
of the indexes.
All these indexes are internal, i.e., they can be computed for a given
partition C on a data set X , often equipped with a dissimilarity d. The
indexes are either not defined or take trivial values for K = 1, so using them
for finding an optimal number of clusters assumes K ≥ 2.
3.1 Some popular clustering quality indexes

Here are some of the most popular of the considerable number of clustering
quality indexes that have been published in the literature. All of these were
meant for use on their own, although they may in principle also be used as
part of a composite index. But most of these indexes attempt to balance two
or more aspects of clustering quality, and from our point of view, for defining a
5
composite index, it is preferable to use indexes that measure different aspects
separately (as introduced in Section 3.2), because this improves the clarity of
interpretation of the composite index. Unless indicated otherwise, for these
indexes a better clustering is indicated by a larger value, and the best number
of clusters can be chosen by maximizing any of them over K, i.e., comparing
solutions from the same clustering method with different fixed values of K.
The Average Silhouette Width (ASW) Kaufman and Rousseeuw (1990)

compare the average dissimilarity of an observation to members of its
own cluster to the average dissimilarity to members of the closest clus-
ter to which it is not classified. It was one of the best performers for
estimating the number of clusters in the comparative study of Arbelaitz
et al. (2012). For i = 1, . . . , n, define the “silhouette width”
b i − ai
si = ∈ [−1, 1],
max {ai , bi }
where
1 X
ai = d(xi , xj ),
nli − 1 x ∈C
j li
1 X
bi = min d(xi , xj ).
h6=li nh
x ∈C
j h
The ASW is then defined as

n
1X
IASW (C) = si .
n i=1
The Calinski-Harabasz index (CH) Caliński and Harabasz (1974): This

index compares squared within-cluster dissimilarities (measuring ho-
mogeneity) with squared dissimilarities between cluster means (mea-
suring separation). This was originally defined for Euclidean data and
use with K-means (the form given here is equal to the original form
with d as Euclidean distance). It achieved very good results in the
comparative study by Milligan and Cooper (1985). It is defined as
B(n − K)
ICH (C) = ,
W(K − 1)
where
6
K
X 1 X
W= d(xi , xj )2 ,
k=1
nk xi ,xj ∈Ck
n
1 X
B= d(xi , xj )2 − W.
n i,j=1
The Dunn Index (Dunn (1974)) compares the minimum distance between
any two clusters with the maximum distance within a cluster:
min1≤g<h≤K minxi ∈Cg ,xj ∈Ch d(xi , xj )

IDunn (C) =
max1≤k≤K maxxi ,xj ∈Ck d(xi , xj )
∈ [0, 1].
Clustering Validity Index Based on Nearest Neighbours (CVNN)

was proposed by Liu et al. (2013) for fulfilling a number of desirable
properties. Its separation statistic is based on local neighbourhoods
of the points in the least separated cluster, looking at their κ nearest
neighbours: !
1 X qκ (x)
ISep (C; κ) = max ,
1≤k≤K nk x∈C κ
k
where qκ (x) is the number of observations among the κ (to be fixed by

the user) nearest neighbours of x that are not in the same cluster. A
compactness statistics ICom (C) is just the average of all within-cluster
dissimilarities. The CVNN index aggregates these:
ISep (C, κ) ICom (C)

ICV N N (C, κ) = + ,
maxC∈K ISep (C, κ) maxC∈K ICom (C)
where K is the set of all considered clusterings. Here smaller values

indicate a better clustering; CVNN needs to be minimised in order to
find an optimal K.
PearsonΓ (PG): Hubert and Schultz (1976) introduced a family of indexes

called (Hubert’s) Γ measuring the quality of fit of a dissimilarity matrix
by some representation, which could be a clustering. More than one
version of Γ can be used for clustering validation; the simplest one is
based on the Pearson sample correlation ρ. It interprets the “clustering
7
induced dissimilarity” c = vec([cij ]i<j ), where cij = 1(li 6= lj ), i.e.
the indicator whether xi and xj are in different clusters, as a “fit” of
the given data dissimilarity d = vec ([d(xi , xj )]i<j ), and measures its
quality as
IP earsonΓ (C) = ρ(d, c).
This index can be used on its own to measure clustering quality. It can
also be used as part of a composite index, measuring a specific aspect of
clustering quality, namely the approximation of the dissimilarity struc-
ture by the clustering. In some applications clusterings are computed
to summarise dissimilarity information, potentially for use of the clus-
ter indicator as explanatory factor in an analysis of variance or similar,
in which case the representation of the dissimilarity information is the
central clustering aim.
3.2 Measurement of isolated aspects of clustering qual-

ity
The following indexes measure isolated aspects of clustering quality. They
can be used to compare different clusterings, but when used for comparing
different numbers of clusters, some of them will systematically prefer either
a smaller or larger number of clusters when used on their own. For example,
it is easier to achieve smaller average or maximum within-cluster distances
with a larger number of smaller clusters. So these indexes will normally be
used as part of a composite index when deciding the number of clusters.
Average within-cluster dissimilarities: Most informal descriptions of what

a “cluster” is involve homogeneity in the sense of high similarity or low
dissimilarity of the objects within a cluster, and this is relevant in most
applications of clustering. There are various ways of measuring whether
within-cluster dissimilarities are generally low. A straightforward in-
dex averages all within-cluster dissimilarities in such a way that every
observation has the same overall weight. Alternatives could for exam-
ple involve squared distances or look at the maximum within-cluster
distance.
K
1X 1 X
Iave.wit (C) = d(xi , xj ).
n k=1 nk − 1 x 6=x ∈C
i j k
A smaller value indicates better clustering quality.
8
Separation index: Most informal descriptions of what makes a cluster men-
tion between-cluster separation besides within-cluster homogeneity. Sep-
aration measurement should optimally focus on objects on the “border”
of clusters. It would be possible to consider the minimum between-
clusters dissimilarity (as done by the Dunn index), but this might be
inappropriate, because in the case of there being more than two clusters
the computation only depends on the two closest clusters, and repre-
sents a cluster only by a single point, which may be atypical. On the
other hand, looking at the distance between cluster means as done by
the CH index is not very informative about what goes on “between”
the clusters. Thus, we propose another index that takes into account a
portion, p, of objects in each cluster that are closest to another cluster.
For every object xi ∈ Ck , i = 1, . . . , n, k ∈ 1, . . . , K, let dk:i =
/ k d(xi , xj ). Let dk:(1) ≤ . . . ≤ dk:(nk ) be the values of dk:i for
minxj ∈C
xi ∈ Ck ordered from the smallest to the largest, and let [pnk ] be the
largest integer ≤ pnk . Then, the separation index with the parameter
p is defined as
K [pn k]
1 X X
Isep.index (C; p) = PK dk:(i) ,
k=1 [pnk ] k=1 i=1
Larger values are better. The proportion p is a tuning parameter spec-

ifying what percentage of points should contribute to the “cluster bor-
der”. We suggest p = 0.1 as default.
Widest within-cluster gap: This index measures within-cluster homogene-
ity in a quite different way, considering the biggest dissimilarity dg so
that the cluster could be split into two subclusters with all dissimilar-
ities between these subclusters ≥ dg . This is relevant in applications
in which good within-cluster connectivity is required, e.g., in the de-
limitation of biological species using genetic data; species should be
genetically connected, and a gap between subclusters could mean that
no genetic exchange happens between the subclusters (on top of this,
genetic separation is also important).
Iwidest.gap (C) = max min d(xi , xj ). (1)

C∈C, D,E: C=D∪E xi ∈D, xj ∈E
Smaller values are better.

Representation of dissimilarity structure by clustering: Clusterings
are used in some applications to represent the more complex informa-
tion in the full dissimilarity matrix in a simpler way, and it is of interest
9
to measure the quality of representation in some way. For this aim here
we use PearsonΓ as defined above.
Uniformity of cluster sizes: Although not normally listed as primary aim

of clustering, in many applications (e.g., market segmentation) very
small clusters are not very useful, and cluster sizes should optimally
be close to uniform. This is measured by the well known “Entropy”
(Shannon (1948)):
K
X nk nk
Ientropy (C) = − log( ).
k=1
n n
Large values are good.
Hennig (2019) proposed some more indexes, particularly for measuring within-
cluster density decline from the density mean, similarity to a within-cluster
uniform or Gaussian distributional shape, and quality of the representation
of clusters by their centroids.
3.3 Stability
Clusterings are often interpreted as meaningful if they can be generalised as
stable substantive patterns. Stability means that they can be replicated on
different data sets of the same kind. Without requiring that new indepen-
dent data are available, this can be assessed by resampling methods such as
cross-validation and bootstrap. We review two approaches that have been
proposed in the literature to measure stability. There they were proposed
for estimating the number of clusters on their own, but this is problematic.
Whereas it makes sense to require a good clustering to be stable, it cannot
be ruled out that an undesirable clustering is also stable. For example, in
a data set with four clearly separated clusters, two well separated pairs of
clusters may give rise to a potentially even more stable two-cluster solution.
We therefore consider these indexes as measuring an isolated aspect of clus-
ter quality to be used in composite indexes. Stability is often of interest on
top of whatever criterion characterises the cluster shapes of interest. For ex-
ample, in applications that require high within-cluster homogeneity, adding
a stability criterion can avoid that the data set is split up into too small
clusters.
Prediction strength (PS): The prediction strength was proposed by Tib-

shirani and Walther (2005) for estimating the number of clusters. The
data set is split into halves (Tibshirani and Walther (2005) consider
10
splits into more than two parts as well but settle with halves eventu-
ally), say X[1] and X[2] . Two clusterings are obtained on these two parts
separately with the selected clustering technique and a fixed number
of clusters K. Then the points of X[2] are classified to the clusters of
X[1] in some way. The same is done with the points of X[1] relative to
the clustering on X[2] . For any pair of observations in the same cluster
in the same part, it is then checked whether or not they are predicted
to be in the same cluster by the clustering on the other half. This
can be repeated for various (A) splits of the data set. The prediction
strength is then defined as the average proportion of correctly pre-
dicted co-memberships for the cluster that minimises this proportion.
Formally,
A 2
1 XX mkat
IP S (C) = min ,
2A a=1 t=1 1≤k≤K nkat (nkat − 1)
X
∗
mkat = 1 (li0 at = liat ) , 1 ≤ k ≤ K,
xi 6=x 0 ∈Ckat
i
where Ckat is cluster k computed on the data half X[t] in the ath split,
nkat = |Ckat | is its number of observations, Lat = lg(1)at , . . . , lg(n/2)at
are the cluster indicators of the clustering of X[t] in the ath split.
g(1), . . . , g(n/2) denote the indexes of observations belonging
n to that o
half, assuming for ease of notation that n is even. L∗at = lg(1)at
∗ ∗
, . . . , lg(n/2)at
are the clusters of the clustering of the other half X[2−t] in the sth split,
to which the observations of X(t) are classified.
Unlike the indexes listed in Sections 3.1 and 3.2, IP S depends on the
clustering method applied to arrive at the clusterings C, because sta-
bility is evaluated comparing clusterings computed using the same
method. Furthermore, IP S requires a supervised classification method
to classify the observations in one half of the data set to the clusters
computed on the other half. Tibshirani and Walther (2005) propose
classification of observations in one half to the closest cluster centroid
in the clustered other half of the data set. This is the same classifica-
tion rule that is implicitly used by K-means and PAM clustering, and
therefore it is suitable for use together with these clustering methods.
But it is inappropriate for some other clustering methods such as Sin-
gle Linkage or Gaussian mixture model-based clustering with flexible
covariance matrices, in which observations can be assigned to clusters
with far away centroids in case of either existence of linking points
(Single Linkage) or a within-cluster covariance matrix with large vari-
11
Table 1: Methods for supervised classification associated to clustering
methods. Notation: a refers to the data split, t refers to the data set
half (observations of X[t] are classified to clusters of X[2−t] ), mka(2−t) is
the centroid and nka(2−t) the number of points of cluster k in the data set
X[2−t] , which may depend on the clustering method. For K-means and
Ward it is the cluster mean, for PAM the medoid minimising the sum of
distances to the other points in the cluster. For QDA and LDA, δka(2−t) (x) =
πka(2−t) − 12 log(|Σka(2−t) |) − (x − µka(2−t) )0 Σ−1
ka(2−t) (x − µka(2−t) ) , where
πka(2−t) is the relative frequency, Σka(2−t) is the sample covariance matrix
(for LDA pooled over all k) and µka(2−t) is the sample mean of cluster k.
∗
Classification method liat Clustering method
Nearest centroid arg min d(xi , mka(2−t) ) K-means, PAM, Ward

1≤k≤K

Nearest neighbour arg min minlja(2−t) =k d(xi , xj ) Single linkage
1≤k≤K

Furthest neighbour arg min maxlja(2−t) =k d(xi , xj ) Complete linkage
1≤k≤K

1
P
Average dissimilarity arg min nka(2−t) lja(2−t) =k d(x i , xj ) Average Linkage
1≤k≤K

QDA (or LDA) arg min δka(2−t) (xi ) Gaussian model-based
1≤k≤K
ation in the direction between the cluster centroid and the point to be
classified (Gaussian model-based clustering). The classification method
used for the prediction strength should be chosen based on the clus-
ter concept formalised by the clustering method in use. Table 1 lists
some classification methods that are associated with certain clustering
methods, and we use them accordingly.
Realising that high values of the prediction strength are easier to achieve
for smaller numbers of clusters, Tibshirani and Walther (2005) recom-
mend as estimator for the number of clusters the largest k so that
the prediction strength is above 0.8 or 0.9. For using the prediction
strength as one of the contributors to a composite index, such a cutoff
is not needed.
A bootstrap method for measuring stability (Bootstab, Fang and Wang

(2012)): Similar to the prediction strength, also here the data are re-
sampled, clusterings are generated on the resampled data by a given
12
clustering method with fixed number of clusters K. The points in the
data set that were not resampled are classified to the clusters com-
puted on the resampled data set by a supervised classification method
as listed in Table 1, and for various resampled data sets the resulting
classifications are compared.
Here, A times two bootstrap samples are drawn from the data with
replacement. Let X[1] , X[2] the two bootstrap
samples in
the ath boot-
(t) (t) (t)
strap iteration. For t = 1, 2, let La = l1a , . . . , lna based on the
clustering of X[t] . This means that for points xi that are resampled as
(t)
member of X[t] , lia is just the cluster membership indicator, whereas
(t)
for points xi not resampled as member of X[t] , lia indicates the cluster
on X[t] to which xi is classified using a suitable method from Table 1.
The Bootstab index is
A
( )
1 X 1 X (1) (2)
IBoot (C) = f 0 − fii0 a ,
A a=1 n2 i,i0 ii a
where for t = 1, 2,
(t) (t) (t)
fii0 a =1 li0 a = lia ,
indicating whether xi and xi0 are in or classified to the same cluster
based on the clustering of X[t] . IBoot is a percentage of pairs that have
different “co-membership” status based on clusterings on two bootstrap
samples. Small values of IBoot are better. Fang and Wang (2012)
suggest to choose the number of clusters by minimising IBoot . Without
proof they imply that this method is not systematically biased in favour
of smaller numbers of clusters.
4 Aggregation and Calibration for Definition

of a Composite Index
As discussed earlier, different aspects of cluster quality are typically relevant
in different applications. From a list of desirable characteristics of clusters in
a given application a composite index can be constructed as a weighted mean
of indexes that measure the specific characteristics of interest. This index
can then be optimised. We will here assume that for all involved indexes
larger values are better (involved indexes for which this is not the case can
be multiplied by −1 to achieve this). For selected indexes I1 , . . . , Is with
13
weights w1 , . . . , ws > 0:
Ps
j=1 wj Ij (C)
A(C) = Ps . (2)
j=1 wj
In order to choose the weights in a given application, it would be useful if

it were possible to interpret the weights in terms of the relative importance
of the desirable characteristics. This requires that the values of the different
I1 , . . . , Is can be meaningfully compared; a loss of 0.3, say, in one index should
in terms of overall quality be offset by an improvement of 0.3 in another index
of the same weight. For the indexes defined in Section 3, this does not hold.
Value ranges and variation will potentially differ strongly between indexes.
Here we transform the indexes relative to their expected variation over
clusterings of the same data. This requires a random routine to generate
many clusterings on the data set. Note the difference to standard thinking
about random variation where the data is random and a method’s results
are fixed, whereas here the data are treated as fixed and the clusterings as
random. For transforming the indexes relative to these, standard approaches
such as Z-scores or range transformation can be used.
The random clusterings should make some sense; one could just assign
points to clusters in a random fashion, but then chances are that most index
values from a proper application of an established clustering method will be
clearly better then those generated from the random clusterings, in which
case transforming the indexes relative to the random clusterings is not ap-
propriate. On the other hand the algorithms to generate random clusterings
need to provide enough variation for their distribution to be informative.
Furthermore, for the aim of making the indexes comparable, random clus-
terings should optimally not rely on any specific cluster concept, given that
different possible concepts are implied by the different indexes.
In order to generate random clusterings that are sensible, though, a cer-
tain cluster concept or definition is required. We treat this problem by
proposing four different algorithms for generating random clusterings that
correspond to different cluster concepts, more precisely to K-centroids (clus-
ters for which all points are close to the cluster centroid), single linkage (con-
nected clusters of arbitrary shape), complete linkage (limiting the maximum
within-cluster dissimilarity), and average linkage (a compromise allowing for
flexible shapes but not for too many large within-cluster dissimilarities or
too weak connection).
The number of cluster K is always treated as fixed for the generation of
random clusterings. The same number of random clusterings should be gen-
erated for each K from a range of interest. This also allows to assess whether
14
and to what extent certain indexes are systematically biased in favour of small
or large K.
4.1 Random K-centroids

Random K-centroids works like a single step of Lloyd’s classical K-means
algorithm (Lloyd (1982)) with random initialisation. Randomly select K
cluster centroids from the data points, and assign every observation to the
closest centroid, see Algorithm 1.
Algorithm 1: Random K-centroids algorithm

input : X = {x1 , . . . , xn } (objects), D = (d(xi , xj ))i,j=1,...,n
(dissimilarities), K (number of clusters)
output: L = (l1 , l2 , . . . , ln ) (cluster labels)
INITIALISATION:
Choose K random centroids S ← {s1 , s2 , . . . , sK } according to the uniform
distribution over subsets of size K from X
for i ← 1 to n do
# Assign every observations to the closest centroid:
li = arg min d(xi , sk ), i ∈ Nn
1≤k≤K
return L, indexing clustering CrK−cen (S)
4.2 Random K-linkage methods

The three algorithms random K-single, random K-complete, and random
K-average are connected to the basic hierarchical clustering methods single,
complete, and average linkage. As for random K-centroids, the clustering
starts from drawing K initial observations at random, forming K one-point
clusters. Then clusters are grown by adding one observations at a time to
the closest cluster, where closeness is measured using the dissimilarity to
the closest neighbour (single), to the furthest neighbour (complete), or the
average of dissimilarities (average), see Algorithm 2.
4.3 Calibration
The random clusterings can be used in different ways to calibrate the clus-
tering validity indexes. For given B and any value of the number of clusters
15
Algorithm 2: Random K-single / complete / average linkage algo-
rithms
input : X = {x1 , . . . , xn } (objects), D = (d(xi , xj ))i,j=1,...,n
(dissimilarities), K (number of clusters)
output: C = {C1 , . . . , CK } (set of clusters)
INITIALISATION:
Choose K random initial points S ← {s1 , s2 , . . . , sK } according to the
uniform distribution over subsets of size K from X
Initialise clusters C(S) = {C1 , . . . , CK } ← {{s1 } , . . . , {sK }}
t ← 1; R = X \ S;
D(t) = d(t) (x, C) x∈R,C∈C , d(t) (x, Cj ) = d(x, sj ), j = 1, . . . , K

repeat
STEP 1:
(g, h) ← arg min d(t) (xg , Ch )
xg ∈R,Ch ∈C
STEP 2:
Ch ← Ch ∪ {xg } with C updated accordingly,
R ← R \ {xg }
STEP 3:
foreach x ∈ R, Cj ∈ C do
Update d(t+1) (x, Cj ) ← d(t) (x, Cj ) for j 6= h; and
Random K-single: d(t+1) (x, Ch ) ← miny∈Ch d(t) (x, y),
Random K-complete: d(t+1) (x, Ch ) ← maxy∈Ch d(t) (x, y),
Random K-average: d(t+1) (x, Ch ) ← |C1h | y∈Ch d(t) (x, y).
P
t←t+1
until R = ∅
return C, denoted CrKsin (S), CrKcom (S), or CrKave (S)
16
K ∈ {2, . . . , Kmax } of interest, 4B + RK clusterings and corresponding in-
dex values are computed, where RK is the number of “genuine” clusterings
for given K, C1 , . . . , CRK , to be validated originally, i.e., those generated by
“proper” clustering methods (as opposed to the random clusterings generated
for calibration), and S1 , . . . , S4B are initialisation sets of size K:
CKcol = (CK:1 , . . . , CK:4B+R ) =

(CrKcen (S1 ), . . . , CrKcen (SB ),
CrKsin (SB+1 ), . . . , CrKsin (S2B ),
CrKcom (S2B+1 ), . . . , CrKcom (S3B ),
CrKave (S3B+1 ), . . . , CrKave (S4B )),
C1 , . . . , CRK ) ,
with further notation as in Algorithms 1 and 2. CKcol stands for the

collection of clustering validity indexes computed from the real and random
clustering algorithms.
There are two possible approaches to calibration:
• Indexes for clusterings with K clusters can be calibrated relative to

proper and random clusterings for the same K only. Indexes are as-
sessed relative to what is expected for the same K, with a potential to
correct systematic biases of indexes against small or large K.
• Indexes for all clusterings can be calibrated relative to genuine and ran-
dom clusterings for all values of K together. Here, raw index values are
compared over different values for K. This cannot correct systematic
biases of indexes, but may be suitable if the raw index values appro-
priately formalise what is required in the application of interest, and
that indexes that systematically improve for larger K (such as average
within-cluster distances) are balanced by indexes that favour a smaller
number (such as separation or prediction strength).
For the second approach, Ccol = K

S max
K=2 CKcol , which is used below instead of
CKcol .
There are various possible ways to use the collection of random clusterings
for standardisation, for a given index I(C) for a given clustering C with
|C| = K. We use Z-score standardisation here:
I(C)−m(CKcol )
I Z−score (C)= r ,
1 ∗ 2
C ∗ ∈CKcol (I(C )−m(CKcol ))
P
|CKcol |−1
17
where
1
I(C ∗ ).
P
m(CKcol )= |C C ∗ ∈CKcol
Kcol |
Further options are for example standardisation to range (0, 1), or trans-
formation of all values in the collection to ranks, which would lose the infor-
mation in the precise values, but is less affected by outliers.
0.80
0.80
a average a average average
average a average average average average average average
average a average average
ff a
a average a
a a average ward a average
0.75
0.75
a cfa a kmeans a
a cward a
a ward
pam a
a ca
kmeans a ward
pam a ward
a kmeans
pamfa ward
pam a average
a average a a an average average average
ff a c a fa a fa pam an faa pam
ward fa
an kspectral
a spectral a
a kmeans a
spectral
a pam a
a pam
ward
a ward
pam a
a
a complete kspectral
complete a a a a
average a fa a c ff a fff a a pam a
fff a kmeansfff a
a kmeans
spectral
a fa
kmeans a meansa cfffa a f a a complete
a kmeans
fa pam
means
ward ancomplete
ward ancomplete
ward warda
a complete
an ward an ward a an
ff a
kmeans
fa a
ff aa
a c ff aan
a
kmeans a cward
c fa
a
c ff a
c a c f
c ff aa spectral
c
c
ffa
a
c f
cffaa c ffa f
cffa
a complete
a c
a
a
kmeans
fa
spectral
cfa a
an
c
fan c a
a
cfa spectral
an kmeans
ancomplete a
n a
a
kmeans
ca
pamf a pam fan
an kspectral
means n pama
kmeans
ancomplete a
ancomplete
a ca ward
ancomplete
an
c ff
c
kmeans f a
a
c
kmeans
cward a
a ff aa c f
c fff a
spectral a c
c fff a
a
c
a
an c f
complete
cfffa a c
a c
complete
an cffa ffaa
a c f
cffa cf
n cffa
an an c f
cffaa
an c ffaan c c a
cffan cfa c a cffan
n spectral
n a pam
cffa
n kmeans
spectral an cffa
kmeans
pam an c fa
n
n
c fff a spectral
cward f f an n c a n
c ff a
c
spectral
c ff a
c
c
c fff a
c a
an
c
cpam
c ff a
f a
a c
cfa an c ffff a
spectral
c
complete
a
an
c f
complete
c a
an
c fff a
a n c
a c
ffancomplete
spectral
c
complete c
c
fa
fa
fa
a
c
n cfa
n
f a
an c
c
c
f an
ffan c
a
c
c fan cfan
an spectral
cffan cfan c
c
c
cffaa
a
n
n c
n
c f an
n c
an cffa
ffa
a
an
an c
n c
f
f an
an
ffan c c
c
f
fa
a
fa
f
n
n c f
cfffa
a
an
an
n c
c
a
a
a
n
n cfffa
c
cfan kmeans
pam
c an
a
an
n
spectral complete c n c a c c f cfa
0.70
0.70
c ff a n c fff a
c
complete c
c ff a f complete
c f
c fff a c an c an c f an
ffan c c f c n c
an cffan cffan cffanfan c an fan c fan c
c fan c
c f a
an c
cfffaan c a
cffan
n c ffa n f a
cffann c f a
spectraln
n
cfa
spectral complete f c an c
c c a
a c
c ff a n
n c ff a n c
c n c f
fan c n c f n c f n c
n cfan
fa cff n
c n c cfffa
cfffan spectral
c n n c
an c an
n c an
n c
c n
n c
c f a
an
c ff a c
c a
c c
cpamfa c
c ff a f n
n c
c ff n c
c ff n
c n c
c fff c
cff n
c
fan c c
cfff n c cfa f n
an
n c c
cfan n
n c a
cff n f
n cffa n
an c c fa
f c
cffa an c n c f
cff n cf n cff n c
c
fa c
cffff n
an
n
n
c f an
spectral
cffan c
c f an
n
cward
ff a
complete a c
c ff a
c
f a
a cfa
c
c f c
cfa
c ann
c
c
c fff n
n c
c
c
n
n
n
c
c ff n
c
n
n c
c
c n c
n cff n
n c
c
c
f n c
n cf n c n c
ff n c f n c n
cf n
n
c
c
an
n
cfffan
n
c
c
cff n an
n
n c c n
n
ff n c
f
c fan
n
cffan c c
c
c
f
fnn
n c
n c f n
c
c f n
ffan
n
c
n cff n
c f n
n
an f n f n c
mclust c n mclust c c c c
mclust
c ff a
c a c ff a
c
c an
n c
c ff a
f an c
c ff n n cf n
c n
n
n
c n
n
n
c
mclust n
n
n n
c n mclust
n
c n n c
c n
n mclust
c
f n
n c cf n c f
n mclust
cf n
c n c
n cf n mclust
n c c
c
fan c n
n c
f n
n mclustmclust
c n
c
c n c an c n c n f
mclustmclust
c n c
f n
ff n c
cff nn
c
c ff n
f
mclust
c n
pearsongamma
pearsongamma
a f an c f n c n c n c n c n n f n c n n c n n c
c ff a an c
cward n c n c ff n
c n c n c n c n c n n n c n c n c n c cf n n c c n
0.65
0.65
n c n f n n n n n
c f c ff n
c c n
n c
c n c n c n c n c n c n
n c n c n
n
n c n
n c n n mclust
c n c n n n
n c n
c ff a
c fa c ff n
c n cf n n
n c
c n
n
c n
n
n
c
c n c n
n
c
c n c n
n c n c n
n
n
c
n c n n
n n
n
c n
c
c fff n c
c n c n c n
n c n n c n n n n n
n n n n n n
c
c ff c n c mclust n
n c n
n n c n n n n
c
c cpamn
n c n
n c n c
mclust n c n
mclust n n c n
n n n
n
n
cf n n c
c n n n n n n c n
n c n c
0.60
0.60
c n n n n n n
cf n n mclust n
n n n n singlen single single
cpamn
c n
n
c n mclust n n
n n n
n n c single single single
n n n c n n
n n single single single n single
c c n n n n n single n
c n
n n n n n
c
c n mclust n n n n single single single n n
n n
0.55
0.55
n single n n
single single n
c n
n single n
n single n single n n
c n
n n n n n
mclust n n n n n n n
single n
n n n n
n
n
n
n
0.50
0.50
n
n n
n n
n
n n
n n
n n n
n
0.45
0.45
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of clusters Number of clusters
Figure 1: Values of the PearsonΓ index for eight clustering methods on Move-
ment data for numbers of clusters 2-8 (left side) and 9-20 (right side) together
with values achieved by random clusterings, denoted by grey letters “c” (ran-
dom K-centroids), “f” (random furthest neighbour/complete), “a” (random
average neighbour), “n” (random nearest neighbour/single), from left to right
for each number of clusters).
Figure 1 serves to illustrate the process. It shows the uncalibrated val-

ues of PearsonΓ achieved by the various clustering methods on the Movement
data set (see Section 5.3.3). For aggregation with other indexes, the PearsonΓ
index is calibrated by the mean and standard deviation taken from all those
clusterings, where the random clusterings indicate the amount of variation
of the index on these data. It can also be seen that the different methods
to generate random clusterings occasionally result in quite different distri-
butions, meaning that taken all together they give a more comprehensive
expression of the variation less affected by specific cluster concepts.
One may wonder whether random methods are generally too prone to
deliver nonsense clusterings that cannot compete or even compare with the
clusterings from the “proper” clustering methods, but it can be seen in Figure
18
1, and also later in Figures 4, 5, 7, 9, and 10, that the random clusterings often
achieve better index values than at least the weakest “proper” clusterings if
not even the majority of them; note that only in some of these figures random
clustering results are actually shown, but more can be inferred from some
Z-score transformed values not being significantly larger than 0. This also
happens for validation indexes already in the literature.
5 Applications and experiments

5.1 General approach and basic composite indexes
The approach presented here can be used in practice to design a composite
cluster validity index based on indexes that formalise characteristics that are
required in a specific application using (2). This can be used to compare
clusterings generated by different clustering methods with different numbers
of clusters (or other required parameter choices) and to pick an optimal
solution if required; it may be of interest to inspect not only the best solution.
By choosing weights for the different indexes different situations can be
handled very flexibly using background and subject matter information as
far as this is available. However, in order to compare the approach with
existing indexes and to investigate its performance in more generality, such
flexibility cannot be applied. In particular, it is not possible to choose index
weights from background information for simulated data. For this reason we
proceed here in a different way.
Hennig (2019) presented an example in which indexes are chosen accord-
ing to subject matter knowledge about the characteristics of the required
clustering. This was about biological species delimitation for a data set
with genetic information on bees. The chosen indexes were Iave.wit (individ-
uals within species should be genetically similar), Iwidest.gap (a genetic gap
within a species runs counter to general genetic exchange within a species),
Isep.index (species should be genetically separated from other species), and
IP earsonΓ (generally species should represent the genetic distance structure
well). Other aspects such as entropy or representation of individuals by cen-
troids were deemed irrelevant for the species delimitation task.
Here we investigate a more unified application of the methodology to data
sets with known true clusters, either simulated or real. This runs counter
to some extent to the philosophy of Hennig (2015b), where it is stated that
even for data sets with given true clusters it cannot be taken for granted that
the “true” ones are the only ones that make sense or could be of scientific
interest. However, it is instructive to see what happens in such situations,
19
and it is obviously evidence for the usefulness of the approach if it is possible
to achieve good results recovering such known true clusters.
We simulated scenarios and applied methods to some real data sets using
several combinations of indexes, in order to find combinations that have a
good overall performance. One should not expect that a single composite
index works well for all data sets, because “true” clusters can have very
different characteristics in different situations. We found, however, that there
are two composite indexes that could be used as some kind of basic toolbox,
namely A1 , made up of Iave.wit , IP earsonΓ and IBoot , and A2 , made up of
Isep.index , Iwidest.gap , and IBoot (all with equal weights). Calibration was done
as explained in Section 4.3 with B = 100.
A1 emphasises cluster homogeneity by use of Iave.wit . IP earsonΓ supports
small within-cluster distances as well but will also prefer distances between
clusters to be large, adding some protection against splitting already homo-
geneous clusters, which could happen easily if Iave.wit would be used without
a corrective. Stability as measured by IBoot is of interest in most clustering
tasks, and is another corrective against producing too small spurious clusters.
A2 emphasises cluster separation by use of Isep.index . Whereas Isep.index
looks at what goes on between clusters, Iwithin.gap makes sure that gaps within
clusters are avoided; otherwise one could achieve strong separation by only
isolating the most separated clusters and leaving clusters together that should
intuitively better be split. Once more IBoot is added for stability. At least
one of A1 and A2 worked well in all situations, although neither of these
(and none of the indexes already in the literature) works well universally, as
expected.
In our experiments we found that overall IBoot worked slightly better
than IP S for incorporating stability in an composite index, aggregating over
all numbers of clusters K together worked better at least for A1 and A2 than
aggregating separately for separate K (this is different for some other com-
posite indexes; also separate aggregation may be useful where aggregating
over all K leads to “degenerate” solutions at the upper or lower bound of K).
Z-score standardisation looked overall slightly preferable to other standard-
isation schemes, but there is much variation and not much between them
overall. We only present selected results here, particularly not showing com-
posite indexes other than A1 and A2 , standardisation other than Z-score,
and aggregation other than over all K together. Full results are available
from the authors upon request. There may be a certain selection bias in our
results given that A1 and A2 were selected based on the results of our exper-
iments from a large set of possible composite indexes. Our aim here is not to
argue that these indexes are generally superior to what already exists in the
literature, but rather to make some well founded recommendations for prac-
20
tice and to demonstrate their performance characteristics. A1 and A2 were
not selected by formal optimisation over experimental results, but rather for
having good interpretability of the composite indexes (so that in a practical
situation a user can make a decision without much effort), and the basic ten-
sion in clustering between within-cluster homogeneity and between-clusters
separation.
Note that composite indexes involving entropy could have performed even
better for the simulated and benchmark data sets with given true clusters
below, because the entropy of the given true clusters is in most cases perfect
or at least very high. But involving Entropy here seemed unfair to us, already
knowing the true clusters’ entropy for the simulated and benchmark data
sets, whereas in reality a high entropy cannot be taken for granted. Where
roughly similar cluster sizes are indeed desirable in reality, we recommend to
involve entropy in the composite indexes.
Results for cluster validation and comparison of different numbers of clus-
ters generally depend on the clustering algorithm that is used for a fixed
number of clusters. Here we applied 8 clustering algorithms (Partitioning
Around Medoids (PAM), K-means, Single Linkage, Complete Linkage, Av-
erage Linkage, Ward’s method, Gaussian Model based clustering - mclust,
Spectral Clustering; for all of these standard R-functions with default set-
tings were used). All were combined with the validity indexes CH, ASW,
Dunn, PearsonΓ, CVNN (with κ = 10), and the stability statistics PS and
Bootstab with A = 50. PS was maximised, which is different from what
is proposed in Tibshirani and Walther (2005), where the largest number of
clusters is selected for which PS is larger than some cutoff value. For the
recommended choices of the cutoff, in our simulations many data sets would
have produced an estimator of 1 for the number of clusters due to the lack of
any solution with K ≥ 2 and large enough PS, and overall results would not
have been better. One popular method that is not included is the BIC for
mixture models (Fraley and Raftery (2002)). This may have performed well
together with mclust in a number of scenarios, but is tied to mixture models
and does not provide a more general validity assessment.
5.2 Simulation study

For comparing the composite indexes A1 and A2 with the other validity and
stability indexes, data were generated from six different scenarios, covering a
variety of clustering problems (obviously we cannot claim to be exhaustive).
50 data sets were generated from each scenario. Scenario 1, 2, and 4 are
from Tibshirani and Walther (2005), scenario 3 from Hennig (2007), and
scenarios 5 and 6 from the R-package clusterSim, Walesiak and Dudek
21
(2011). Figure 2 shows data from the six scenarios.
• Scenario 1 (Three clusters in two dimensions): Clusters are nor-

mally distributed with 25, 25, and 50 observations, centred at (0, 0),
(0, 5), and (5, −3) with identity covariance matrices.
• Scenario 2 (Four clusters in 10 dimensions): Each of four clusters
was randomly chosen to have 25 or 50 normally distributed observa-
tions, with centres randomly chosen from N (0, 1.9I10 ). Any simulation
with minimum between-cluster distance less than 1 was discarded in
order to produce clusters that can realistically be separated.
• Scenario 3 (Four or six clusters in six dimensions with mixed
distribution types): Hennig (2007) motivates this as involving some
realistic issues such as different distributional shapes of the clusters and
multiple outliers. The scenario has four “clusters” and two data sub-
groups of outliers. There is an ambiguity in cluster analysis regarding
whether groups of outliers should be treated as clusters, and therefore
the data could be seen as having six clusters as well. Furthermore,
there are two “noise” variables not containing any clustering informa-
tion (one N (0, 1), the other t2 ), and the clustering structure is defined
on the first four dimensions.
Cluster 1 (150 points): Gaussian distribution with mean vector (0, 2,
0, 2) and covariance matrix 0.1I4 . Cluster 2 (250 points): Gaussian
distribution with mean vector (3, 3, 3, 3) and a covariance matrix with
diagonal elements 0.5 and covariances 0.25 in all off-diagonals. Clus-
ter 3 (70 points): A skew cluster with all four dimensions distributed
independently exponentially (1) shifted so that the mean vector is (-
1,1,1,1). Cluster 4 (70 points): 4-variate t2 -distribution with mean
vector (2, 0, 2, 0) and Gaussian covariance matrix 0.1I4 (this is the
covariance matrix of the Gaussian distribution involved in the defini-
tion of the multivariate t-distribution). Outlier cluster 1 (10 points):
Uniform[2, 5]. Outlier cluster 2 (10 points): 4-variate t2 -distribution
with mean vector (1.5, 1.5, 1.5, 1.5) and covariance matrix (see above)
2I4 .
• Scenario 4 (Two elongated clusters in three dimensions): Clus-
ter 1 was generated by setting, for all points, x1 = x2 = x3 = t with
t taking on 100 equally spaced values from −.5 to .5. Then Gaussian
noise with standard deviation .1 is added to every variable. Cluster 2
is generated in the same way, except that the value 1 is then added to
each variable.
22
• Scenario 5 (Two ring-shaped clusters in two dimensions): Gen-
erated by function shapes.circles2 of the R-package clusterSim.
For each point a random radius r is generated (see below), then a
random angle α ∼ U [0, 2π]. The point is then (r cos(α), r sin(α)). De-
fault parameters are used so that each cluster has 180 points. r for
the first cluster is from Uniform[0.75, 0.9], for the second cluster from
Uniform[0.35, 0.5].
• Scenario 6 (Two moon-shaped clusters in two dimensions):

Generated by function shapes.two.moon of the R-package clusterSim.
For each point a random radius r is generated from Uniform[0.8, 1.2],
then a random angle α ∼ U [0, 2π], and the points are (a+|r cos(α)|, r sin(α))
for the first cluster and (−|r cos(α)|, r sin(α) − b) for the second clus-
ter. Default parameters are used so that each cluster has 180 points,
a = −0.4 and b = 1.
Results are given in Tables 2 and 3. All these results are based on the
clustering method that achieved highest Adjusted Rand Index (ARI) for
the true number of clusters. This was decided because of the large number
of results, and because the clustering method chosen in this way gave the
validation methods the best chance to find a good clustering at the true
number of clusters. Figure 2 shows these clusterings.
Tables 2 and 3 give two kinds of results, namely the distribution of the
estimated numbers of clusters, and the average ARI (the maximum ARI is
1 for perfect recovery of the true clusters; a value of 0 is the expected value
for comparing two unrelated random clusterings, negative values can occur
as well). In case that the number of clusters is estimated wrongly, arguably
finding a clustering with high ARI and therefore similar to the true one is
more important than having the number of clusters close to the true one, and
in general it is not necessarily the case that a “better” number of clusters
estimate also yields a “better” clustering in the sense of higher ARI.
Scenario 1 was rather easy, with many indexes getting the number of
clusters always right. The clusters here are rather compact, and A1 is among
the best validation methods, with A2 lagging somewhat behind.
In Scenario 2, clusters are still spherical. CVNN does the best job here
and finds the correct number of clusters 45 times. Although A1 manages
this only 36 times, the average ARI with 0.930 is almost the same as what
CVNN achieves, both better by some distance than all the other methods.
A2 once more performs weakly. This should have been a good scenario for
CH, because clusters are still spherical Gaussian, but compared to scenario
1 it loses quality considerably, probably due to the higher dimensionality.
23
(a) PAM, K = 3 (ARI = 0.990) (b) mclust, K = 4 (ARI = 0.955)
(c) mclust, K = 4 (ARI = 0.834) (d) Complete linkage, K = 2 (ARI = 1.000)
(e) Single linkage, K = 2 (ARI = 1.000) (f) Spectral clustering, K = 2 (ARI = 1.000)
Figure 2: Data sets produced by Scenarios 1-6 (for the more than 2-
dimensional data sets from Scenarios 2-4 principal components are shown)
with the respective clusterings that achieve highest ARI on the true number
of clusters.
24
Table 2: Results of Simulation Study. Numbers are counts out of 50 trials.
Counts for estimates larger than 10 are not displayed. “*” indicates true
number of clusters.
Estimate of Number of Clusters
Validity Index ARI
2 3 4 5 6 7 8 9 10
Scenario 1 - Three clusters in 2-d - PAM clustering
CH 0.990 0 50∗ 0 0 0 0 0 0 0
ASW 0.961 6 44∗ 0 0 0 0 0 0 0
Dunn 0.937 11 39∗ 0 0 0 0 0 0 0
Pearson Γ 0.990 0 50∗ 0 0 0 0 0 0 0
Prediction strength 0.966 5 45∗ 0 0 0 0 0 0 0
Bootstab 0.990 0 50∗ 0 0 0 0 0 0 0
CVNN 0.990 0 50∗ 0 0 0 0 0 0 0
A1 0.990 0 50∗ 0 0 0 0 0 0 0
A2 0.942 10 40∗ 0 0 0 0 0 0 0
Scenario 2 - Four clusters in 10-d - model-based (mclust) clustering
CH 0.879 6 6 38∗ 0 0 0 0 0 0
ASW 0.815 9 9 32∗ 0 0 0 0 0 0
Dunn 0.796 12 5 32∗ 1 0 0 0 0 0
Pearson Γ 0.799 7 15 28∗ 0 0 0 0 0 0
Prediction strength 0.633 28 8 14∗ 0 0 0 0 0 0
Bootstab 0.749 13 5 9∗ 23 0 0 0 0 0
CVNN 0.934 1 4 45∗ 0 0 0 0 0 0
A1 0.930 0 4 36∗ 10 0 0 0 0 0
A2 0.709 18 11 20∗ 1 0 0 0 0 0
Scenario 3 - Four or six clusters in 6-d
with mixed distribution types - model-based (mclust) clustering.
CH 0.567 30 11 3∗ 2 1∗ 1 0 2 0
ASW 0.454 35 2 10∗ 1 2∗ 0 0 0 0
Dunn 0.571 23 12 4∗ 5 4∗ 1 0 0 1
Pearson Γ 0.587 15 3 18∗ 8 5∗ 0 1 0 0
Prediction strength 0.418 39 11 0∗ 0 0∗ 0 0 0 0
Bootstab 0.807 2 12 36∗ 0 0∗ 0 0 0 0
CVNN 0.568 32 5 3∗ 2 2∗ 2 1 1 2
A1 0.788 2 1 37∗ 2 4∗ 0 0 2 2
A2 0.739 1 5 26∗ 4 0∗ 0 0 3 11
25
Table 3: Continuous of Table 2
Estimate of Number of Clusters
Validity Index ARI
2 3 4 5 6 7 8 9 10
Scenario 4 - Two elongated clusters in 3-d - Complete linkage
CH 0.755 23∗ 9 8 4 6 0 0 0 0
ASW 1.000 50∗ 0 0 0 0 0 0 0 0
Dunn 1.000 50∗ 0 0 0 0 0 0 0 0
Pearson Γ 0.995 49∗ 1 0 0 0 0 0 0 0
Prediction strength 0.995 49∗ 1 0 0 0 0 0 0 0
Bootstab 0.975 45∗ 4 1 0 0 0 0 0 0
CVNN 0.516 1∗ 6 24 9 7 3 0 0 0
A1 0.965 43∗ 6 1 0 0 0 0 0 0
A2 1.000 50∗ 0 0 0 0 0 0 0 0
Scenario 5 - Two ring-shaped clusters in 2-d - Single linkage
CH 0.646 0∗ 4 8 10 11 6 5 3 0
ASW 0.711 2∗ 17 13 8 6 2 1 1 0
Dunn 1.000 50∗ 0 0 0 0 0 0 0 0
Pearson Γ 0.617 0∗ 0 0 3 7 4 11 15 0
Bootstab 0.901 37∗ 0 0 0 0 3 2 5 0
CVNN 0.736 5∗ 15 16 7 5 2 0 0 0
A1 0.602 0∗ 0 0 0 0 2 4 11 33
A2 0.982 47∗ 0 0 1 1 0 1 0 0
Scenario 6 - Two moon-shaped clusters in 2-d - Spectral clustering
CH 0.296 0∗ 0 3 3 8 5 7 11 13
ASW 0.349 0∗ 0 6 10 12 6 6 7 3
Dunn 1.000 50∗ 0 0 0 0 0 0 0 0
Pearson Γ 0.452 0∗ 0 14 21 10 2 3 0 0
Bootstab 0.245 0∗ 0 0 0 0 0 0 6 44
CVNN 0.338 0∗ 0 5 9 11 7 5 8 5
A1 0.321 0∗ 0 0 3 10 9 9 11 8
A2 1.000 50∗ 0 0 0 0 0 0 0 0
26
In scenario 3, the ARI was computed based on 6 clusters, but 4 clusters
are seen as a sensible estimate of K. Bootstab achieves the best ARI result,
followed closely by A1 , which gets the number of clusters right most often
(and estimated K = 6 more often than K = 5 as only method), and A2 . The
other methods are by some distance behind.
In scenario 4, where elongated clusters mean that some within-cluster
distances are quite large, A2 performs better than A1 , which puts more
emphasis on within-cluster homogeneity. Apart from A2 , also ASW and the
Dunn index deliver a perfect performance. CH and particularly CVNN are
weak here; the other methods are good with the occasional miss.
In Scenario 5, within-cluster homogeneity is no longer a key feature of the
clusters. A2 does an almost perfect job (Dunn and PS achieve ARI = 1),
whereas A1 is much worse, as are CH and PearsonΓ, with ASW and CVNN
somewhat but not much better.
Scenario 6 produces similar results to Scenario 5 with once more A2 ,
Dunn and PS performing flawlessly. The rest is much worse, with PearsonΓ
here best of the weaker methods and Bootstab in last position.
Overall these simulations demonstrate convincingly that different clus-
tering problems require different cluster characteristics, as are measured by
different indexes. One of A1 and A2 was always among the best methods,
depending on whether the scenario was characterised by a high degree of
within-cluster homogeneity, in which case A1 did well, whereas A2 was the
best method where between-clusters separation dominated, and for nonlinear
clusters. The results also show that no method is universally good. A1 per-
formed very weakly in Scenarios 5 and 6, A2 failed particularly in scenario
2, and ended up in some distance to the best methods in scenario 1 and 3.
CH was weak in scenarios 3, 5, and 6 and suboptimal elsewhere, ASW failed
in scenarios 3 and 6, and was suboptimal in some others, the Dunn index
did not perform well in scenarios 1-3, PearsonΓ was behind in scenarios 2, 3,
5, and 6, PS failed in scenarios 2 and 3, Bootstab in Scenarios 2 and 6, and
CVNN in Scenarios 3-6. In any case, when splitting up the scenarios into
two groups, namely scenario 1, 2, and 4, where homogeneity and dissimilar-
ity representation are more important, and 3, 5, and 6, where separation is
more important, A1 on average is clearly the best in the first group with an
average ARI of 0.962, with PearsonΓ achieving 0.928 in second place, and A2
is clearly the best in the second group with an average ARI of 0.907 followed
by Dunn achieving 0.857. The assignment of scenarios 3 and 4 may be con-
troversial. If scenario 3 is assigned to the homogeneity group, A1 is best by
an even larger margin. If in exchange scenario 4 is assigned to the separation
group, Dunn, PS, and A2 all achieve an average ARI better than 0.99. A2
also has the best overall mean of 0.895, although this is less relevant because
27
all overall means are affected by bad results in at least some scenarios, and
no method should be recommended for universal use (which was ignored in
almost all introductory papers of the already existing methods).
The composite indexes A1 and A2 have a clear interpretation in terms
of the features that a good clustering should have, and the results show
that they perform in line with this interpretation. The researcher needs to
decide what characteristics are required, and if this is decided correctly, a
good result can be achieved. Obviously in reality the researcher does not
know, without having clustered the data already, what features the “true”
clusters have. However, in many real applications there is either no such
thing as “true clusters”, or the situation is ambiguous, and depending on the
cluster concept several different clusterings could count as “true”, see Hennig
(2015b). In a certain sense, by choosing the cluster concept of interest, the
researcher “defines” the “true clusters”.
5.3 Real data examples with given classes

In this section we analyse three data sets obtained from the University of
California Irvine Machine Learning Repository (Dheeru and Karra Taniski-
dou (2017)) with given classes. Following the approach motivated above,
we do not make heavy use of subject-matter information here in order to
decide which indexes to aggregate. We list the three best clusterings nomi-
nated by the different indexes. In real data analysis it it recommended to not
only consider the “optimum” solution, because it can be very informative to
know that some other potentially quite similar clusterings are similarly good
in terms of the used validity index. We also show some exemplary plots that
compare all clustering solutions. In Figures 3, 6, and 8, discriminant coordi-
nates (DC; Seber (1983)) are shown, which optimise the aggregated squared
distances between cluster means standardised by the pooled within-cluster
variances.
Out of the composite indexes A1 and A2 , A1 comes out very well com-
pared to the other indexes, whereas A2 picks among its three best clusterings
only single linkage solutions isolating outlying points, achieving ARI values
around zero (results not shown). This indicates that for many real data sets
with lots of random variation, separation with flexibility in cluster shapes
and potentially large within-cluster distances is not a characteristic that will
produce convincing clusters. Many meaningful subpopulations in real data
are not strongly separated, but come with outliers, and separation-based
indexes have then a tendency to declare well separated outliers as clusters.
28
1 1
11
18
3
3 3 1 1
33 3 1 11 1 1
−4
3 1 1 1 1
3 55
3 3333 3 1
1
1 1 1
1 1 1 7 1 5 4
33 33 3 3 3 1 1 11 1 1
7 7 1 55 4
3 33 11 1 1 55 44 444
1111 111 7 1 55555544
3 3 1 1 111 77 4 4
16
1 1 1 4 44
33 77 2 2 222 1 1 11 5 54
3 3 3 3 33
1 11 11 1 1
1
7
7 7 7 2 21 1 111 1 5 55 5 4 4 44
3 11 1 1 55 5 4 4
−6
3 33 33 1 1 7 7 2 22 2
3 33 11 77 7 7 77 22 2 2 1 1 6
1 11 11 54
3 3
3 2 7 7 2 2 1 33 1 55 5 4 4 4
3 22 2 3 6 6 6
1 1 2 2 44 44
14
222 77 1 6 4
dc 2
dc 2
2 2 6 4
2 2 5
2 22 2 2 2 7 2
666 6 6 6 6 4 44
−8
2 2
22 2 6 6
2
2 2 2 2 4
2 2 2 2 2 6 6
22 2 2 6 6
22 2 2 22 222 2 6 6
12
222 2 2 22
2
2 2 22 2 2 6
22 2 2 2 2 2 2
−10
2 6
2 2 22 2 6
6
10
−12
6
2 6
4 6 8 10 12 14 −15 −10 −5 0
dc 1 dc 1
Figure 3: Discriminant coordinates plots for Wine data. Left side: True
classes. Right side: Clustering solution by spectral clustering with K = 7
(DCs are computed separately for the different clusterings).
5.3.1 Wine data

This data set is based on the results of a chemical analysis of three types
of wine grown in the same region in Italy. The Wine data set was first
investigated in Forina et al. (1990). It contains 13 continuous variables and
a class variable with 3 classes of sizes 48, 59, and 71.
Figure 3 (left side) shows that a two-dimensional projection of the data
can be found where the different wine types are well separated, but this is
difficult to find for any clustering method, particularly because the dimension
is fairly high given the low number of observations. The right side shows the
DC plot for the 7-cluster solution found by spectral clustering, which is the
best according to A1 . ARI results comparing the clusterings with the true
grouping are given in Table 4. The 3-means solution is the best, achieving
an ARI of 0.9. According to A1 this is second best. A1 is the only validity
index that chooses this clustering among its top three. This also makes A1
the best index regarding the average ARI over the top three; the next best
clustering picked by any of the indexes has an ARI of 0.37.
Figures 4 and 5 show results. Figure 4 shows the complete results for
A1 . Added in the top row are results for ASW (Z-score calibrated), selected
for reasons of illustration. One thing to note is that the best values of A1
are not much above 0, meaning that they are on average not much better
than the results for the random clusterings. In fact, for larger values of K,
29
Figure 4: Results for Wine data. First row left: A1 index. First row right:
ASW index (Z-score calibrated). Second row left: A1 index with calibration
based on same K only. Second row right: A1 index based on aggregation
without calibration.
the random clusterings produce better results than all the proper clustering
methods. Looking at the lower left side plot, in which results are shown
calibrated separately for separate K, it can be seen that for two and three
clusters the best proper clustering methods (which are also the best overall,
so the 3-means solution with ARI 0.9 comes out as the best here) are still
superior to the random clusterings, meaning that calibrated index values
are substantially larger than zero. For K ≥ 5 they all drop below zero. A
possible explanation is that the proper clustering methods for larger K fail
by trying to adapt to a structure that does not exist, whereas the random
clusterings are less affected by the fact that their K does not correspond
to a meaningful number of clusters. Looking at more detail in Figure 5, it
can be seen that even for larger K the random clusterings are beaten by
the best proper clustering methods regarding Iave.wit (the first row shows
the Z-calibrated values and the raw values with random clusterings added;
note that in these plots as well as for Bootstab smaller values are better, as
opposed to A1 , PearsonΓ, and ASW), and also for Bootstab (lower left plot
in Figure 5). Regarding PearsonΓ (lower right plot in Figure 5), the best
proper clusterings are slightly below the average of the random clusterings
for larger K (Z-calibrated values below 0), but different methods are best
for different criteria, and the random clusterings are better for large K after
30
Figure 5: Results for Wine data. First row left: Iave.wit (Z-score calibrated).
First row right: Iave.wit without calibration, with random clusterings. Second
row left: Bootstab (Z-score calibrated). Second row right: PearsonΓ (Z-score
calibrated).
aggregation. Figure 4 (upper right) shows that for the ASW only the two
best methods, Ward and spectral clustering, have Z-calibrated values larger
than zero for K ≥ 5, so the issue does not only affect A1 and its constituents.
Furthermore, ASW suggests the smallest possible K = 2 as optimal, which
can be observed in many applications (e.g., Hennig and Liao (2013)) and
may benefit from calibration for separate K.
The lower right plot in Figure 4 shows that if the indexes are aggregated
without calibration, the result is entirely dominated by Iave.wit , which in un-
calibrated state has the by far largest value range and variance (compare the
upper plots in Figure 5, taking into account the sign change for aggregation).
This is not sensible.
5.3.2 Seeds data

The Seeds data set (Charytanowicz et al. (2010)) has measurements of geo-
metrical properties of kernels belonging to three different varieties of wheat.
It contains seven geometric features of wheat kernels and a class variable
indicating three classes containing 50 objects each. Each class refers to a
type of wheat.
Figure 6 shows DCs for the true three classes (upper left) and the PAM-
clustering with K = 3 (upper right) on the same DCs. This is the clustering
31
3 33 2 22
33 22
3 2
3 3 3 2 2 2
33 3 33 22 2 22
33 22
−138
−138
3 3 3 2 2 2
33 22
3 3 3 3 22 2 2 2 2 33
3 3 33 333 2 2 22 222
3 33 3 2 2 22 2 3
3 3 33 3 3 3 3 2 2 2 2 2 22 2 2 2 2 1 3 3
33 3 333 3 2 2 2 22 2 2 222 2 3 3 3 33 3
3 3 3 2 2 2 2
222 222 22 2 2 1 3 3 3
333 333
3 3 2 2 22 2 2 2 1 3 33 3
3 2 2 2 1
−140
−140
3 2 22 2 3 33
3 3 2 2 2 2 2 2 22 22 2 2 2 1 1 3 3 3 3 33 33 3
1 13 2 2
2 22 2 22 2 22 1 1
1 33 3 33
3333 2 22 2 2222 3 33 3
3 3 2 2 2 2 2 2 3 3 3 3
2 3
dc 2
2 3
dc2
1 1 2 2 2 1 1 3 3
1 31 1 2 2 22 1
1 3 11 1 2 2 2 22 1 3
11 2 11 3
−142
−142
1 1
1111 1 1 1 1 1
1111 1 1 1
1 1 11 1 1 1 1 11 1 3
1 1 1 1
1 1 1 2 1 1
1 111 1 111
1 1 1 1 1 2 1 1 1 1
1 11 11 1 11 2 21 11 1 11
1 1 1 1 1 1
1 11 1 11
−144
−144
1 1 1 1 1 1 1 1
1 1
1
1 1 1 11 1
1 1 1 11
1 1
1 1
−146
−146
34 36 38 40 42 34 36 38 40 42
dc 1 dc1
Figure 6: Discriminant coordinates plot for Seeds data. Left side: True
classes. Right side: Clustering solution by PAM with K = 3.
that was picked as the best by A1 , and has the second best ARI (0.75) of all
clusterings; only 3-means is slightly better (0.77).
Results are shown in Table 4. A1 picks the two best clusterings as best
two, if in reverse order. The clusterings picked as best by all other indexes
are worse, however CH and CVNN achieve slightly better ARI averages over
the best three clusterings because of better clusterings in third place. The
results of ASW, Dunn, PearsonΓ, PS and Bootstab are clearly weaker.
Figure 7 shows the full results for A1 , A2 , Bootstab and CVNN (the latter
two uncalibrated, and smaller values are better). A2 is once more dominated
by single linkage solutions that isolate outliers. The Bootstab plot shows that
spectral clustering and Ward with small K are substantially less stable than
the other methods; Bootstab’s comparison of methods is somewhat similar
between Seeds and Wine data set. The CVNN plot gives a good indication
for 3 clusters for a number of clustering methods.
5.3.3 Movement data

The Movement data set Dias et al. (2009) contains 15 classes of 24 instances
each, where each class refers to a hand movement type in LIBRAS, the
Brazilian sign language. There are 90 variables, which correspond to 45
positions of two hands tracking hand movement over time extracted from
videos.
Due to the large number of classes and variables together with a still fairly
32
Figure 7: Results for Seeds data. First row left: A1 . First row right: A2 .
Second row left: Bootstab (uncalibrated; smaller values are better). Second
row right: CVNN (uncalibrated; smaller values are better).
low number of observations, this is a much harder clustering problem than

the previous data sets. In this application it is quite clear why within-cluster
homogeneity is a more important characteristic than separation, because
variation within sign language movement types should be limited (otherwise
it would be hard to communicate), whereas strong separation can hardly be
expected with a large enough number of movement types.
Figure 8 gives the DCs for the true classes on the left side. On the right
side the same DCs are used for visualising the K = 20 solution for average
linkage, which is optimal according to A1 . Obviously the recovery of the true
classes is not perfect, but there is clear structure with some connection to
the true classes and the DC projection derived from them. The clustering
achieves an ARI of 0.32. The largest ARI achieved by any clustering here
was 0.39. Out of the validity indexes, CVNN finds one with ARI 0.34 as
best, but A1 has the best average of the best three clusterings; ASW is the
second best according to that metric. For some detailed results see Figure 9
and the earlier Figure 1. Average linkage turns out to be the best method
according to A1 for larger numbers of clusters; the index clearly indicates that
a larger number of clusters is required here, though not pointing at exactly
15 clusters (which no other index does either; PearsonΓ actually delivers
the best K around 15, but ARI values of these clusterings are worse). The
ASW generates local optima also at K = 2 and around K = 11. It prefers
33
Table 4: Results of real data examples.
ARI Average Best clusterings in order (K)
Validity Index
First Second Third ARI First Second Third
WINE data set
CH 0.17 0.19 0.22 0.19 Ward (10) Ward (9) Spectral (7)
ASW 0.33 0.37 0.31 0.34 Ward (2) Spectral (2) Spectral (3)
Dunn 0.33 0.19 0.18 0.23 Ward (2) Ward (9) Spectral (9)
Pearson Γ 0.33 0.37 0.31 0.34 Ward (2) Spectral (2) Spectral (3)
PS -0.01 -0.01 -0.01 -0.01 Single (3) Single (2) Single (4)
Bootstab -0.01 0.36 -0.01 0.11 Single (2) PAM (2) Single (3)
CVNN 0.31 0.25 0.31 0.29 Spectral (4) Spectral (6) Spectral (3)
A1 0.22 0.90 0.18 0.43 Spectral (7) 3-means Spectral (9)
SEED data set
CH 0.63 0.71 0.77 0.70 Mclust (3) Ward (3) 3-means
ASW 0.44 0.48 0.50 0.48 PAM (2) 2-means Spectral (2)
Dunn 0.44 0.43 0.77 0.55 Ward (8) Spectral (5) 3-means
Pearson Γ 0.63 0.44 0.71 0.60 Mclust (3) PAM (2) Ward (3)
PS 0.00 0.49 0.48 0.32 Single (2) Average (2) 2-means
Bootstab 0.00 0.75 0.77 0.51 Single (2) PAM (3) 3-means
CVNN 0.63 0.69 0.75 0.69 Mclust (3) Complete (3) PAM (3)
A1 0.75 0.77 0.50 0.67 PAM (3) 3-means Average (9)
MOVEMENT data set
CH 0.07 0.05 0.04 0.05 2-means Spectral (2) Average (2)
ASW 0.30 0.31 0.24 0.28 20-means 19-means Ward (10)
Dunn 0.00 0.00 0.00 0.00 Single (2) Single (3) Single (4)
Pearson Γ 0.23 0.23 0.25 0.23 Average (14) Average (15) Average (16)
PS 0.00 0.00 0.00 0.00 Single (2) Single (3) Single (4)
Bootstab 0.00 0.00 0.00 0.00 Single (2) Single (3) Single (4)
CVNN 0.34 0.16 0.19 0.26 Spectral (11) Spectral (4) 10-means
A1 0.32 0.30 0.34 0.32 Average (20) 20-means Average (18)
K-means over average linkage for large K, which is slightly worse regarding
the ARI. Both methods rule out single linkage clusterings as bad, which is
correct here, as opposed to the results of Dunn and A2 .
5.4 German Bundestag data 2005

The German Bundestag data 2005 is included in the R-package flexclust
(Leisch (2006)) and consists of the German general election 2005 results (per-
centages of second votes, which determine the composition of the parliament)
by the 299 constituencies for the five biggest parties SPD (social democrats,
centre left), UNION (Christian conservative, centre right), GRUENE (green
party), FDP (liberal party), and LINKE (left; merger of the successor of the
communist party from former communist East Germany, and a social justice
initiative from the West). Different from the other data examples, this data
set does not come with “true clusters” and therefore represents a real cluster-
ing task, the clustering of constituencies. As in many real situations, a fixed
and specific aim of clustering is not given. Rather many uses are conceivable
for such a clustering. In particular, clusters can be used for differentiated
analysis of future election results, for simplified analysis of complex rela-
34
3 3
3 3 3 6 3 3
33 36
3 3333 3 3 6 66636 6 3
3 3 3 3
3 6
5
5
6 3 d 6
33 3 63 3
6 8
6 b 3 8 e 6
5 5 a a
5665 65666 66 6 a88a 8add8 8d 4
e ee e e
55 6 5 6 66 b 0b b b aa 8 4 d d0
44 bbb c 88 e6e f
4
4 45 5 5 5 d66
6 b 200b00
6b b 8
8 0a c c c hd0
8
0e e 5geeee e
4 4 5 e e 8 8 8 4 j g
44 4444 45 45 d d 5d 5 d0 02e29900e 9 e e cc c a 88c a
a a 88
a
9 88 9 8a 8 h 40 c hh 05g533eej 3 g f f
4 4 5d b d d b5b 77 3 e 8 8 ac 8 0 bh e h h ec 3e 7 g 7
g f
0 7 f f f ff ff
4 9b77 92 e 8 8a c c 9 6 03e3e0 34 3 7g 30 f 7 fff f ff fffff f
0
0
4 d492e97 979
b7bb 279 e 8 8 8aca 9 6eee 4e3
h303i3e 0e4
4 d d6d5d d d d2 2 00
7729 79e e8 ee 8e 8 8 c8 c8c c a 8 0 0dhce h a h5
0e5e30
g
4 i i7 7 g 707
dc 2
2 22 54 44 f f f f ff ff f
dc2
02 7797b7 9797299 e9 e7 e 88 8 aa a8acca c h4 ee4eee 4e4e4 4 g3 f0
d dd 5 0 9d077202 d0 2 77 a h ee a e 4hh eee4e3 h3e 4 00 f f f f f ff
e 8 ac8 aac c c 8 i f f
ff f f f f f
22 02 0 e a8 a 33h3 0 g ff f
02 2 2 c h4 3 4 f
7 e
0 h
1 4
1 4
1 2
−5
−5
1 3
1 1 4 4
1 5
1 4
1 11 1 4 154 4
11 4 55
11 11 1 541 4 4
1 1
11 15
0 5 10 0 5 10
dc 1 dc1
Figure 8: Discriminant coordinates plot for Movement data. Left side: True
classes. Right side: Clustering solution by average linkage with K = 20.
Figure 9: Results for Movement data. Left: A1 . Right: ASW (uncalibrated).
tionships between voting patterns and other characteristic indicators of the

constituencies, and for simplified election results analysis by the media and
the political parties themselves. It is therefore hard to specify a weighting
of required cluster characteristics, although it is conceivable to do this for
very specific uses. Instead we use the data set to demonstrate the use of the
two basic composite indexes, which we recommend in absence of information
about how to choose the weights. We applied seven clustering methods (all
methods applied earlier except Ward’s) with numbers of clusters between 2
and 12.
The results are shown in Figure 10. A1 and A2 deliver quite different
optimal clusterings, both of which make sense in different ways. The best
clustering according to A1 is from PAM with K = 4. The best clustering
35
●
1.0
●
● ●
●
● kmeans
● ●
mclust
●
●
2
● single
● average
complete
0.5
clara
spectral
1
● ●
A1
A2
● ●
● ●
0.0
● ●
●
●
● kmeans
0
mclust
−0.5
single
average
complete
clara
spectral
−1
−1.0
2 4 6 8 10 12 2 4 6 8 10 12
Number of clusters Number of clusters
Figure 10: Results for Bundestag data. Left: A1 . Right: A2 .
according to A2 is from Single Linkage with K = 6. The data with both

clusterings are shown in Figure 11.
The most obvious feature of the data is the separation into two clearly
visible clusters according to the results of the LINKE party. The LINKE are
strong in all constituencies in former East Germany plus the four constituen-
cies of the Saarland in the West, which is the homeland of the LINKE top
candidate 2005 Oskar Lafontaine, who had been a very popular leader of the
Saarland between 1985 and 1998. They are much worse in all other west-
ern constituencies. The clustering optimal according to the separation-based
index A2 basically delivers this pattern, except that four outlying constituen-
cies in the East are isolated as their own clusters. Looking at the data, this
clearly makes sense, as these four constituencies are clearly special. Three of
them (clusters 3, 4, and 5) are in Berlin, with strong results for the GRUENE
unlike the rest of the East, and quite diverse performances of the other par-
ties. The last one is in Dresden and is the only constituency in the whole
of Germany that has a strong LINKE result together with a strong FDP. In
fact, a candidate in that constituency had died shortly before the election,
and the constituency had voted two weeks later than the rest of Germany.
Subtleties of the German election law (that have been changed in the mean-
time) meant that the UNION could have lost a seat in parliament had they
performed too well here, apparently prompting some of their voters to vote
FDP (https://de.wikipedia.org/wiki/Bundestagswahl_2005, accessed
27 May 2020).
36
0.1 0.3 0.5 0.06 0.10 0.14 0.1 0.3 0.5 0.06 0.10 0.14
2222 2222 222

2 222 2 1111 1111 111
1 111 1
22222 22 2 222222 222 11111 11 1 111111 111
0.2 0.3 0.4 0.5
0.2 0.3 0.4 0.5

2 22 22 2 11 1
2 22222 22 2
2222 222222222222 2222222
222222222 222222 22 1 11
1 11111 11 1
1111 111111111111 1111
111
11 1111111 111111 11
222 222 2 2222222222 2 22
22222222 2222 22 111 111 1 1111111111 1 11
1111111 1111 11
2222 222 22 1 22
222 22 2222 22 2222 22
2
222222
22 1111 111 11 1 11
111 11 1111 11 111 11111 11
1
111111
11
3222
222
2222211 11111 111222
21111 2 2 212 2222 21
21 2222 221222
1 2111
11
1111111 11111 111111
11111 1 1 111111111 11 11111
SPD 3 3 33 3
33 31 23 111111 3 313111 131311212222111 1 22 3 3 333 1332 2
3 333333 1
222
1 121 1
1212
1111111 11121
11112222
12 33 33
SPD 5 2 22 2
22 21 111 11111 2 212111 121211111111111 1 11 5 2 222 122111
5 22222 1111
11111111111 1111
1
1 1111111
11 22
5222
33 3333 22123 113111
111 1111 333
3 111
13 11
1 1111 11312111 12 11 11
1111 11111
11111111 11
1 11112
1 3333 3333333 333
33 3 4 2322 2112 112111
111 1111 222
2 111
12 11
1 1111 11211111 2242 1111
1111 11111
11111111 111 1111
1 3222 22 422 222
22 2
3 3 31131 11 1111 1111114 333314
33
1
133
3 1
31
1 11 313
1 11 1 3113
1 111
33333333 33314
13133 111 111 411 11 1111
1 11 33333333 3 22 2 2121 11 1111 1111111 222211
22
1
22
2
1 1
21
1 11 212
1 11 111 4113 3
222222
2
212211122111 111 11111 11111
1 11 222222 22 2
333
33333 111
1131311111111 11111 333433411
3 3
31 411 3111111111 1 1 1
11 1111111
3 33
4333333411431111111 111
111 1
1111
11 11 1 41 1
11 11
1 33
3333 2 222
22222 111
1121211111111 11111 222122111
2 2
21 111 2111111111 1 1 1
111 11111111 2 2
1222222111121111111 111
111 1
1111
11 11 1 11
1 1111
1 22
2222
33333 111111114444 33
144 34 33 44 111 11111 1 13
1
144
1 11 3 33 3 333
3 62222 111111111111 22
111 21 22 11 111 11111 1 16 11
1
1 1 6 22 2 222
2
3 333333 1114444444444 4
3 3
434
33
3443444 434141311411 4
34 4
3 44 4 13 1
343343 43 4 444
1
14 414 33 333
3 2 222222 1111111111111 1
2 2
1
22121121
11 161111211111 1
21 1
2 11 1 2 1
212221 12 1111111
22 2222
2
333 44444
4 444 3 444444444 4 4 4 43343444344 4 44
4
44
4 3 3 222 11111
1 111 2 11111111 1 1 1 12212111211 1 11
1
11
1 2 2
3 4444444 334
334
34 44444 4 4 44444433 4 43 3 4 4
44
4 4 3
33 3 2 1111111 221
221
21 11111 11 1 11111122 1 12 2 1 1
11
1 1 2
22 2
44 44444 4 4444 4 4 11 11111 1 1

44444444 4 444444444 4 44 4 444
4
44 4 11111111 1 111111111
1111 1 111
1
11 1
444444 4 44 4 4444 4444 4 4
44 44 444
44 44 111111 1 11 1 1111 1111 1 11 11 1
11 11 111
11 11
0.5
0.5
44 44 4 44 4444 4444 444 444 11 11 1 11 1111 1 111 111
4444111141
4444 4 4
1141 141 411444 44 4 44 4
41 1111111111
1111 1 111111 111111 1 1 11 1
11
111111111
1111111
11 111 44 1 11 114 1111
1 11 44 1
11 1111 111111111
1111111
11 111 11
11 11111 1 11 111 1111
1 11 11 1
11 1111
1111 111111 111 1
11121 111 111
111
11 1111 1111111 11 11111 11 1111
11111 1111 1 1 11
11111 1 1111 111111 111 1
11111 111 111
111
11 11
11 1111 11 11111 11 1111
11111 11 1 1
11 11
111 11 1
11111111 111111111
11 11
11
1111 111111
1 11 11 1
2111111 1111
111 111
1 11 11111
11
1 1 11 1
11 1 11111111 111111111
11 11
11
11111 1111111111 1 1111
111 11111
1111111
1 1 111 111 1
333333331111
3
3333 311313
1
13 1
1
1 1122222
111222222
222
222222 2 UNION 333 33 3
221
33112222
1
1
2222 121211
21 21 111111111
2122211
333 21223
222322232
1
33
23
2
22
11
222
11131
31
2212 11111
11 131111
111 111 1
1 12
2
11
2
1
12
1
12
1 2
1
1
1
21
1
2112
21
3
3333333
333 222222221111
2
2222 211212
1
12 1
1
1
1111
111
11111 11
111111 111111 1 UNION 222 2222111111
2
11
11111
1111111
11211
11
111111111111111
11 222 11112
111211121
11
22
12
11
111
1121 111
11
111111121111
21 111 111 1
1 11
1
1
11
11
1
1111
1
11
11
1
11 2
2222222
222
0.3
0.3
222 33 2232 3 22 2
1313 2212222222
2
2222
22 2 33 22
3333233
33 2 22 222222
2 211 1112 1 323
3
3222 3 2
2
2
12
333
1
23 22
2222231
3
222213 1 11 2 212
1
22
222
12
2 2
1121 33333 33 3 1212 111 1111111
1
1111
11 1 22 11
2222126
22 1 11 111111
1 111 1111 1 212
2111
2 2 1
1
11
222
1
12 11
1111121
2
111112 1 11 11 111
11
11 1 22222 22 2
333333 11
3333333
33 3 222222 22 333333 223222 2 2 22 22212 1 32 3332
3
33 23
2323 3 32 1 22221
22
2222 33 33 3333
33 333 226222 11
2222222
22 2 111111 11 1 222222 112111 1 1 11 11111 1 21 2221
2
22 12
1212 2 21 1 111 1111
1
111 62 22 2222
22 222
3333333
3 333322 2 22222 33
333333333 2
3 322 2
3 2 3
2
2
33 2
3
33 332
3333 3
3 33
32 3 2
22 333333333 33 22222
2 222211 1 1111 22
222222 22 1
2 211 1
2 1 2 2
1
1
22 1
2
22 221
2232 2
2 22
21 6 1
11 222222222 22
3 3 3 3 3 3 322 2 3 3 2 2 2
33
33 33 3 3 3 3 3 3 33 42
22 22 2 4 2 2
4 4 2 22
33 52
0.1
0.1
3 3 3 5 5 5
1 3 3 1 3 1 1 3 1 5 5 1 5 1 1 5
12 1 1 12 11 1 1 11
1 3 22 3 221 1 3 22 11 2 3 1 3 11 3 111 1 3 11 1 3
1111111 22 1
2 1 111 3 2 2 21 111111 1
1 111122 1111111 11 1
1 1 111 4 1 1 11 111111 1
11 1 11111
0.15
0.15
11 11
111311 22 2 3 11
22121111111 212 11211 111 1111 1
22
2 3 111411 11 1 4 11
1111111111 111 11111 111 11111
1 4
11 1 1 11 1 1
44444
11 1 11
1111
11
111
1111
131
1 22
111121212
22 2
2222
2
22222222
2 2222222221111 11 11
111111114 444
GRUENE 222221 21211 1 11111
2221211 1 1111
114 1 44
111
1
1
111
1
112112
1
2
21
2
222
2
11111
11 1 11
1111
11
111
1111
121
1 11
111111111
11 1
1111
1
11111111
1 1111111111111 11 11
1111 11111 111
GRUENE 111111111111111111111 1
111
111 1 11
111
11 1
1
11111111
1
111
1
11
1 1 11 2 32 2 22
23 21112111
22
1
11111 44444
111 32 2 2
2 12 4
211 1 111114111 1
4 1 11
2111 222222
21 3333 3 3 1 1
111
1 21 1 11
12 1111111
11 11111111 11111
21 1 1
1 11 1 1
1 1 1111
11111 1
1
1111 11111111 2222 2 2
444341 111 13
11111 111 33 2 22 3 3 3222 11 4444 3 2221 32 3 242 421
4
14 411
1
31 1141111 4414
1
4 21 111211111 12 1
111 111 22 1 11 2 2 2111 11 11111 2 1111 212 111 111
1 111
21 111111 1111
111
44
44444444443
4344113113
13 3 113 1112
1
1 1122222222 22 222 223233
33 33233232 22222222 12 2111
112 11
11111 1114 144444 2 2 4324212
1 4
3 4
212124 41
2 1 1
1 1 1
411 3 4 4
4442
1
21 11
12
121
21
12222 3 33 3 33 11111111112 1
11611112 12
12 2 112 1111111111111
1
1 11 111 116122
22 22122121 11111111 111 11
111 11111111111 1 1 211112 11111111
11 1
1 1111 6 1 1
111111
1 11
1111
1 1111 6 22 2 22
11
31 11 1122222222 2
322322
111 444444444 2
333 3 4
42321442 2
22442 22 1 1
4
1 4 2 1111 1 1 1
41211 1 2221 33
11 11 1111111111 211211
11 11 1 1121111 1 1
2111111 1 1
1 1 1 1
111 1 1 111
0.05
0.05
1113 22222 22222 232 2 3411 1111 12 211112 11111 11111 222 121 12 2111 1111 1 22
44 43 44 433111
331
1 3
3
1333333 3 33333333333 3222
22 1 1111 11444
11 232
333
3
2
4 4111
1 411 4
44
4 42
41211
212
4 1
22 3333333 3333 11 12 11 122111221
1 2
2 2 22222222222 2111
11 1111111 11111
11 1
111111 121 11 1111 111 1
11
1 111
11 1111
11 1 2222222 2222
43
34 3 4 3
33
333343 33333
33 3
4 1
34
3 1
333333333
33 333333 1 1 44 444 33333
4
3 341
3434
3
34
3 3433 4333
3333 33433 3 4
44
4
1
4 1 333333
3333333 33
33333
33 12
21 2 1 2
22
222212 22222
22 2
1 1
21
2222222 1
222222222
22 222222 1 1 11 111 222222
22
1
2
2121
21
2
21
2 222 1222
2222 22122 2 1
11
1
1 1 222222
2222222 22
22222
22
13 3 1 3 1 1 3 16 6 1 6 1 1 6
1 1 1 1 1 1 1 1
0.06 0.10 0.14
0.06 0.10 0.14

4 11 11 1111111114 1 4 1 41111 1 11 11 1111111111 1 1 1 11111
4 4 1111111 1 1111 111 114111 111 11 11111111 1111111 1 1111 111 111111 111 11 1111 111
1111111 1111111 111 1 111
111 114 4 1411111 11111111 11 111 1 41 4
1 11111 1 1111 111 1111111 111 1 111
111 111 1 111111 11111111 11 111 1 11111 11111
3 4 1111 11 1 11111 11
13111
111111
1111111 3 111 1114 111114 11 1 1 1 11111
1 1 3 2 1 1111 11 1 11111 11
12111111111
1111111 2 11
11
1111 111111 11 1 1111 1 1111 1 2
443 411111 1 11
111111122222 11
1 111444 11 211 311 111 1 4
1 11
11111 112 111111 1 11
111111111111 11
1 111111 11 111 21
1111111 111111
111 11 11
11 1
344
4344 44 111 1111 11 2222222 222 32
1 1 2
21
11
1
311
1 1
11111111111
1 444
4 44 3 11 11 1422214
12 44 224112111211 111
1
1
FDP 4
4
44
14
1 1
111
12
12 21
1 2
121 3 211
1211 111 1111 111211 1 1
11
11
1
2 11 1
11111111111
1 1 111
1 11 2 11 11 1 1111
11 1
FDP 1
1
11
11
111
11 1
1111111 2
33
34344433343
341 431 11111111 12
221 331 2223232
322
2
33
2221221111144444 33
4333
33144 114 14
2 11 14
212234
1
1422221 2 22 412
41 22
1 2
22
2
1122
2 3333
3333333
212 11 211
2121122212 121 11111111 111111111 111111
111 1 221 112
1 22
1211
211 1111 111111 22
122112
11111111
111111 1111 111
1 1 1 11 1111
1 11
1 1111
2222
2222222
34 1
4 3 113 14122122 11222222222
2222222 3 232 33 322
121
2 22
3 3 21
321 111
4
44444444 33 1
2
442 2
4
4 4 12 121 2 22 1 44111
2
412
11 2
11
42 22 21 1
1 2 112 1111111 1
1111111 2 121 22 211
11
1 11
2 2 111111 111 222
2 2 1 11111
1 2 1 1 111
1 1 1 111
111
1
1 111
444 444 433
3333333331
33133 1222222
222 22 22 22 33413 3
13444
2
422 4 2
2422122 2 2
3322212 2 3
2 2 4442 1222
1 2
221
22 2 3 33 111 111 122
2222222221
23122 1111111
11111
11 11 1121 111 1111 111 2 1
221111111111 1
2211111 1 3
1 1 1111 11111 1111 2 22
44 34 3 13 22222 333
3 33323233
3 2232
3 32
2 3 1 14 444 444 334343
34
333
432
22
3 2133
41 444
1
2
41
1 2
221 22222
33333333
3 3 333 3333
3 11 21 2 12 11111 322
2 22212122
2 1121
2 21
1 2 1 1 111 111 221212
21
1211
222 221
1 122
11 111
11
1 11111 22222222
3 2 222 2222
2
3 43333333333
4 3 2222 22 2 32
33332222233 23333
2 334 222 2 4 222 3 33333 3 2 12222222222
1 2 1111 11 1 21
22221111122 12222
1 221 11 1
1 1 111 2 22222 2
3 3333 22 33 32232 32 4 33343333 22322 3 3 4222 333 3333 2 4222 42 21121 21 1 22212222 11211 2 4
1111 11 224 2222
33 3 333 332232
3
2
3
3 33 22 5 11
522 221121
5
1
5
2 22
0.35
0.35
33 33 33 33 22 22 22 22
33 3 3333333 33333 3 3 33 3333 22 2 2222222 22222 2 2 22 2222
3333 33333333
3
333
3
3 3333333
333
3 3333
333
3333333333 3333 2222 22222222
2
222
2
2 2222222
222
2 2222
222
2222222222 2222
3 3 333 3 333 3 333333 2 2 222 4 222 2 222222
3333333333 33333333 3
333
33333
333333333 33333
3 3 33
3 3333 3333333
3 333
3 33 3 2222222222 22222422 2
222
22222
222222222 22222
2 2 22
4 2422 2222222
2 222
2 22 2
333 3 33 33 0.20 222 5 22 22
0.20
3 3 3 3 33 3 3 33333 333 333 3 2 2 6 2 25 5 6 22222 222 622 5
3 333
3 3
3 3
3
33
3
33 3
LINKE 2 222
3 3 2 2 3 22 3
22 6
LINKE
11111 12 2222 2 22222 2222 1222 1 12222 22222 1 1 2212 221 11111 11 1111 1 11111 1111 1111 1 1111111111 1 1 1111 111
222222 22 222
222222 111111 1
111111 111
0.05
1111111
0.05
122 2 12
21 21212 2221112111111221 1 1 212111111 1 111 1 11111 1111 111 11
4 44 111 111
11
11 112
1111 111
1 21
1 22
212
2
122222
2 22 2222
2222 22 2
22222
2 222 221
221 222
121 1
2121
11
21
211
11 11111111144 4
1111
1111
2
22
24
14 1
21
2
4
2
121
2
2
11 122
211
2 11
22112
2 21
12 2
111
11 222
12
2 1224222
22 1222
12
2
2122
1
12111
11 1111 1 111 111 111
11 111
1111 11
111
1 1
111
11
111
111111
11 1111
1111 11 1
11111
1 1111 1111 11111
111
11 1 1111
1111 1111 1
11
11
11 11
1111111
111111
111111 111
11111 1111111 1 1
1111
11
11111
11111
11 111
1111 1111111
11
1 11
111
4444
4444444441
114
44
11
4 11
1
11411111
11
1 1 1141111
12 1 2 2 2 2
221
2
1 22
1111
2 111141 444444 4
444
4444 4444
4444
44 444
4 41
144441
4
1
4
1
422
11
21
12
1
4 1
2
1
2
1
4
442
411
1 11
4144
4 11
411
2 1
41411
1
411 11 1 24424
441
4 2442144 14 1
212
4211
14
21
21
44
44
42
11
41 21
41141
111
1
1 4111
11
1 111141 1 1
411 1111
1111111111
11111
1
111
11
1
11
11111
1 1111111
11 1 111
1 1
111
111 11111111
11 1111 1111 1 1111
1111
11111
111 11 111
1 11
11111 11
11
111
11
1
1
11
1
11111
111
1111
1 11
11 11111
111 11111 11
1111111111111111
1 11
111
11111
111111
111 1111111111 1 1
11
0.2 0.3 0.4 0.5 0.05 0.15 0.05 0.20 0.35 0.2 0.3 0.4 0.5 0.05 0.15 0.05 0.20 0.35
Figure 11: Clusterings of Bundestag data. Left: PAM clustering with K = 4

(A1 -optimal), right: Single Linkage clustering with K = 6 (A2 -optimal).
The next best clusterings according to A2 with K = 2 basically just split

the data set up into the East plus Saarland and the remainder of the West,
which also makes data analytic sense.
These clusterings may be too crude for some uses, and homogeneity may
be seen as more important. The PAM clustering with K = 4, optimal ac-
cording to A1 , separates the East plus Saarland from the West and splits
the West up into three clusters, SPD strongholds (no. 2), CDU strongholds
(no. 4), and no. 1 as a group in between with mostly good FDP results.
Looking at the German “Länder” (administrative regions with much auton-
omy), many of them belong completely to one of these clusters, and some like
the two biggest ones Nordrhein-Westfalen and Bayern split up quite interest-
ingly between two of them, separating more conservative catholic regions with
smaller population density from bigger cities with stronger SPD results. If a
finer grained clustering is required, the next best A1 -result, Average Linkage
with K = 11 (not shown) splits up the West into four clusters also including
a cluster of GRUENE strongholds, one cluster dominated by the Saarland is
separated from the East, and some outliers as discussed before are isolated.
Probably some more clusterings could be interpreted in a sensible way, but
the best clusterings according to A1 and A2 are certainly data analytically
valid and informative.
37
6 Discussion
The general idea of this paper is that cluster validity can be measured by
composite indexes, which can be made up from indexes measuring various
characteristics of a clustering important to a user in a given application.
Meaningful aggregation requires calibration, which can be done against a
set of random clusterings generated on the data set. A user may not feel
comfortable to make detailed decisions about included indexes and weights.
For this reason we suggest two specific composite indexes which focus on ei-
ther within-cluster homogeneity or between-clusters separation. These give
substantially different results and have been shown by our experiments to
be complementary advantageous in different situations. However, we rec-
ommend to take into account background information and the specifics of
the situation and the problem at hand in order to define an composite in-
dex that can be better adapted to the problem at hand. Graphical displays
such as principal components analysis may also help. Generally we have seen
that no index can be expected to work universally, and that different clus-
ter characteristics are important in different situations. The present work
enables the user to adapt cluster validation to what counts in the specific
application and shows that with an appropriate decision about important
basic required characteristics of clusters even the proposed basic composite
indexes can outperform indexes existing in the literature.
These basic composite indexes are not recommended for general auto-
matic use, and the user is encouraged to employ all available background
knowledge to design their own composite index for a given situation, and to
take into account the dynamics of individual indexes of interest before aggre-
gation to decide about one or more optimum clusterings. Such decisions can
be made before knowing results, making the result free from the influence
of selective perception or even the suspicion that the user just cherry-picks
a clustering that confirms their prior opinion. On the other hand (once
more reflecting the earlier mentioned tension between objectivity and flex-
ible adaptation to specifics of the situation), results may point the user to
some unanticipated issues and may cause them to change ideas. For exam-
ple, for the Wine data the user can conclude from the difficulty to find any
proper clustering with A1 substantially larger than 0 that calibration based
on separate K can be more suitable here, even without taking the “true”
clustering (which in reality is not available) into account.
In many situations background information and knowledge of the aim of
clustering may not be enough for the user to confidently choose individual
indexes and fix weights, and in such situations we recommend the basic
composite indexes as starting points for an analysis, keeping in mind at
38
least the rough distinction regarding whether homogeneity and dissimilarity
representation on one hand, or separation and avoidance of within-cluster
gaps on the other hand, are the focus in the situation at hand.
The present paper can be seen rather as a starting point than as an end
point for research. The simulations and examples are rather exemplary than
comprehensive. Among various things that could be tried out, non-Euclidean
data using other distances could be a worthwhile topic of investigation. Addi-
tional indexes and random clustering methods are conceivable (Hennig (2019)
proposes some more). There are various possibilities for theoretical investi-
gation of the indexes and aggregation strategies. An issue with the approach
is numerical complexity. Whereas the random clusterings can be quickly
generated, computing all index values for all proper and random clusterings
takes time. In particular, the stability indexes PS and Bootstab involve re-
sampling, and are heavy to run for a large number of involved clusterings.
Shortcuts and investigations how small the number of random clusterings
and resamplings can be chosen while still leaving results reasonably stable
would be worthwhile (we believe and some experience shows, that good re-
sults can already be achieved for, say, B = 20 and A = 25), although with
the nowadays available computing power and parallel computing data sets
clearly larger than those treated in this paper (although probably not with
millions of observations and thousands of variables) can be analysed.
The methodology presented here is implemented in function “cluster-
benchstats” in the R-package “fpc”.
Acknowledgements: The work of the second author was supported by
EPSRC grant EP/K033972/1.
References
Arbelaitz, O., I. Gurrutxaga, J. Muguerza, J. M. Perez, and I. Perona (2012).
An extensive comparative study of cluster validity indices. Pattern Recog-
nition 46, 243–256.
Caliński, T. and J. Harabasz (1974). A dendrite method for cluster analysis.

Communications in Statistics - Theory and Methods 3 (1), 1–27.
Charytanowicz, M., J. Niewczas, P. Kulczycki, P. A. Kowalski, S. Lukasik,

and S. Żak (2010). Complete gradient clustering algorithm for features
analysis of x-ray images. In E. Pitka and J. Kawa (Eds.), Information Tech-
nologies in Biomedicine, pp. 15–24. Berlin, Heidelberg: Springer Berlin
Heidelberg.
39
Delattre, M. and P. Hansen (1980). Bicriterion cluster analysis. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 4, 277–291.
Dheeru, D. and E. Karra Taniskidou (2017). UCI machine learning reposi-
tory.
Dias, D. B., R. C. Madeo, T. Rocha, H. H. Bı́scaro, and S. M. Peres (2009).
Hand movement recognition for brazilian sign language: a study using
distance-based neural networks. In Neural Networks, 2009. IJCNN 2009.
International Joint Conference on, pp. 697–704. IEEE.
Dunn, J. C. (1974). Well-separated clusters and optimal fuzzy partitions.
Journal of cybernetics 4 (1), 95–104.
Fang, Y. and J. Wang (2012). Selection of the number of clusters via the
bootstrap method. Computational Statistics & Data Analysis 56 (3), 468–
477.
Forina, M., R. Leardi, C. Armanino, S. Lanteri, P. Conti, and P. Princi
(1990). Parvus: An extendable package of programs for data exploration,
classification and correlation. Journal of Chemometrics 4 (2), 191–193.
Fraley, C. and A. E. Raftery (2002). Model-based clustering, discriminant
analysis and density estimation. Journal of the American Statistical As-
sociation 97 (4), 611–631.
Gelman, A. and C. Hennig (2017). Beyond subjective and objective in statis-
tics. Journal of the Royal Statistical Society: Series A (Statistics in Soci-
ety) 180 (4), 967–1033.
Halkidi, M., M. Vazirgiannis, and C. Hennig (2015). Method-independent
indices for cluster validation and estimating the number of clusters. In
C. Hennig, M. Meila, F. Murtagh, and R. Rocci (Eds.), Handbook of Clus-
ter Analysis, pp. 595–618. CRC Press.
Handl, J. and J. Knowles (2015). Nature-inspired clustering. In C. Hennig,
M. Meila, F. Murtagh, and R. Rocci (Eds.), Handbook of Cluster Analysis,
pp. 419–439. CRC Press.
Hennig, C. (2007). Cluster-wise assessment of cluster stability. Computa-
tional Statistics and Data Analysis 52, 258–271.
Hennig, C. (2015a). Clustering strategy and method selection. In C. Hennig,
M. Meila, F. Murtagh, and R. Rocci (Eds.), Handbook of Cluster Analysis,
pp. 703–730. CRC Press.
40
Hennig, C. (2015b). What are the true clusters? Pattern Recognition Let-
ters 64, 53 – 62. Philosophical Aspects of Pattern Recognition.
Hennig, C. (2019). Cluster validation by measurement of clustering charac-

teristics relevant to the user. In C. H. Skiadas and J. R. Bozeman (Eds.),
Data Analysis and Applications 1: Clustering and Regression, Modeling -
Estimating, Forecasting and Data Mining, pp. 1–24. ISTE Ltd., London.
Hennig, C. and T. F. Liao (2013). How to find an appropriate clustering

for mixed-type variables with application to socio-economic stratification
(with discussion). Journal of the Royal Statistical Society: Series C (Ap-
plied Statistics) 62 (3), 309–369.
Hubert, L. and P. Arabie (1985). Comparing partitions. Journal of Classifi-

cation 2, 193–218.
Hubert, L. and J. Schultz (1976). Quadratic assignment as a general data

analysis strategy. British journal of mathematical and statistical psychol-
ogy 29 (2), 190–241.
Jain, A. K. and C. Dubes, Richard (1988). Algorithms for Clustering Data.

Prentice Hall, Englewood Cliffs NJ.
Kaufman, L. and P. J. Rousseeuw (1990). Finding groups in data: An intro-

duction to cluster analysis, Volume 344. John Wiley & Sons.
Leisch, F. (2006). A toolbox for k-centroids cluster analysis. Computational

Statistics and Data Analysis 51 (2), 526–544.
Liu, Y., Z. Li, H. Xiong, X. Gao, J. Wu, and S. Wu (2013, June). Un-
derstanding and enhancement of internal clustering validation measures.
IEEE Transactions on Cybernetics 43 (3), 982–994.
Lloyd, S. (1982, September). Least squares quantization in pcm. IEEE Trans.

Inf. Theor. 28 (2), 129–137.
Milligan, G. and M. Cooper (1985). An examination of procedures for de-

termining the number of clusters in a data set. Psychometrika 50 (3),
159–179.
Seber, G. A. F. (1983). Multivariate Observations, Volume 344. John Wiley

& Sons.
Shannon, C. E. (1948, July). A mathematical theory of communication. The

Bell System Technical Journal 27 (3), 379–423.
41
Tibshirani, R. and G. Walther (2005). Cluster validation by prediction
strength. Journal of Computational and Graphical Statistics 14 (3), 511–
528.
Walesiak, M. and A. Dudek (2011). clusterSim package.
42

Comparing Clustering

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Comparing Clustering

Uploaded by

Copyright:

Available Formats

Comparing clusterings and numbers of clusters

by aggregation of calibrated clustering validity

Serhat Emre Akhanli (1) and Christian Hennig (2)

June 24, 2020

Keywords: number of clusters, random clustering, within-cluster ho-

3 Clustering Validity Measurement

3.1 Some popular clustering quality indexes

The Average Silhouette Width (ASW) Kaufman and Rousseeuw (1990)

The ASW is then defined as

The Calinski-Harabasz index (CH) Caliński and Harabasz (1974): This

min1≤g<h≤K minxi ∈Cg ,xj ∈Ch d(xi , xj )

Clustering Validity Index Based on Nearest Neighbours (CVNN)

where qκ (x) is the number of observations among the κ (to be fixed by

ISep (C, κ) ICom (C)

where K is the set of all considered clusterings. Here smaller values

PearsonΓ (PG): Hubert and Schultz (1976) introduced a family of indexes

3.2 Measurement of isolated aspects of clustering qual-

Average within-cluster dissimilarities: Most informal descriptions of what

A smaller value indicates better clustering quality.

Larger values are better. The proportion p is a tuning parameter spec-

Iwidest.gap (C) = max min d(xi , xj ). (1)

Smaller values are better.

Uniformity of cluster sizes: Although not normally listed as primary aim

Large values are good.

Prediction strength (PS): The prediction strength was proposed by Tib-

Nearest centroid arg min d(xi , mka(2−t) ) K-means, PAM, Ward

A bootstrap method for measuring stability (Bootstab, Fang and Wang

4 Aggregation and Calibration for Definition

In order to choose the weights in a given application, it would be useful if

4.1 Random K-centroids

Algorithm 1: Random K-centroids algorithm

4.2 Random K-linkage methods

CKcol = (CK:1 , . . . , CK:4B+R ) =

with further notation as in Algorithms 1 and 2. CKcol stands for the

• Indexes for clusterings with K clusters can be calibrated relative to

For the second approach, Ccol = K

Number of clusters Number of clusters

Figure 1 serves to illustrate the process. It shows the uncalibrated val-

5 Applications and experiments

5.2 Simulation study

• Scenario 1 (Three clusters in two dimensions): Clusters are nor-

• Scenario 6 (Two moon-shaped clusters in two dimensions):

(c) mclust, K = 4 (ARI = 0.834) (d) Complete linkage, K = 2 (ARI = 1.000)

5.3 Real data examples with given classes

5.3.1 Wine data

5.3.2 Seeds data

5.3.3 Movement data

low number of observations, this is a much harder clustering problem than

5.4 German Bundestag data 2005

Figure 9: Results for Movement data. Left: A1 . Right: ASW (uncalibrated).

tionships between voting patterns and other characteristic indicators of the

Number of clusters Number of clusters

Figure 10: Results for Bundestag data. Left: A1 . Right: A2 .

according to A2 is from Single Linkage with K = 6. The data with both

2222 2222 222

0.2 0.3 0.4 0.5

0.2 0.3 0.4 0.5

44 44444 4 4444 4 4 11 11111 1 1

0.06 0.10 0.14

Figure 11: Clusterings of Bundestag data. Left: PAM clustering with K = 4

The next best clusterings according to A2 with K = 2 basically just split

Caliński, T. and J. Harabasz (1974). A dendrite method for cluster analysis.

Charytanowicz, M., J. Niewczas, P. Kulczycki, P. A. Kowalski, S. Lukasik,

Hennig, C. (2019). Cluster validation by measurement of clustering charac-