You are on page 1of 2

Agglomerative Hierarchical Clustering Using Asymmetric Similarity

Paper:
Agglomerative Hierarchical Clustering Without Reversals on
Dendrograms Using Asymmetric Similarity Measures
Satoshi Takumi and Sadaaki Miyamoto
Department of Risk Engineering, School of Systems and Information Engineering, University of Tsukuba
1-1-1 Tennodai, Tsukuba, Ibaraki 305-8577, Japan
E-mail: miyamoto@risk.tsukuba.ac.jp
[Received November 29, 2011; accepted September 25, 2012]
Algorithms of agglomerative hierarchical clustering
using asymmetric similarity measures are studied.
Two different measures between two clusters are pro-
posed, one of which generalizes the average linkage
for symmetric similarity measures. Asymmetric den-
drogram representation is considered after foregoing
studies. It is proved that the proposed linkage meth-
ods for asymmetric measures have no reversals in the
dendrograms. Examples based on real data show how
the methods work.
Keywords: agglomerative clustering, asymmetric simi-
larity, asymmetric dendrogram
1. Introduction
Cluster analysis alias clustering has nowbecome a stan-
dard tool in modern data mining and data analysis. Clus-
tering techniques are divided into two classes of hierarchi-
cal and non-hierarchical methods. The major technique in
the rst class is the well-known agglomerative hierarchi-
cal clustering [1, 2] which is old but has been found useful
in a variety of applications.
Agglomerative hierarchical clustering uses a similar-
ity or dissimilarity measure between a pair of objects to
be clustered, and the similarity/dissimilarity is assumed
to have symmetric property. In some real applications,
however, relation between objects are asymmetric, e.g.,
citation counts between journals and import of goods be-
tween two countries. In such cases we have a motivation
to analyze asymmetric measures and obtain clusters hav-
ing asymmetric features.
Not many but several studies have been done on cluster-
ing based on asymmetric similarity measures. Hubert [3]
dened clusters using the concept of the connectivity of
asymmetric weighted graphs. Okada and Teramoto [4]
used the mean of asymmetric measures with an asym-
metric dendrogram. Yadohisa [5] studied the generalized
linkage method of asymmetric measures with a variation
of asymmetric dendrogram representing two levels on a
branch.
We propose two new linkage methods for asymmetric
similarity measures in this paper. A method is a gener-
alization of the average linkage for symmetric similarity
and another is a model-dependent method having the con-
cept of average citation probability from a cluster to an-
other cluster. As the asymmetric dendrogram herein, we
use a variation of that by Yadohisa [5]. We also prove that
the proposed methods have no reversals in the dendro-
gram. In order to observe how proposed methods work,
we show examples based on three real data sets.
2. Agglomerative Hierarchical Clustering
We rst review the general procedure of agglomera-
tive hierarchical clustering and then introduce asymmetric
similarity measures.
Let the set of objects for clustering be X ={x
1
, . . . , x
N
}.
Generally a cluster denoted by G
i
is a subset of X. The
family of clusters is denoted by
G ={G
1
, G
2
, . . . , G
K
},
where the clusters form a crisp partition of X:
K

i=1
G
i
= X, G
i
G
j
= / 0 (i = j). . . (1)
Moreover the number of objects in G is denoted by |G|.
Agglomerative hierarchical clustering uses a similarity
or dissimilarity measure. We use similarity here: similar-
ity between two objects x, y X is assumed to be given
and denoted by s(x, y). Similarity between two clusters is
also used, which is denoted by s(G, G

) (G, G

G) which
also is called an inter-cluster similarity.
In the classical setting a similarity measure is assumed
to be symmetric:
s(G, G

) = s(G

, G).
Let us rst describe a general procedure of agglomera-
tive hierarchical clustering [6, 7].
AHC (Agglomerative Hierarchical Clustering) Algo-
rithm:
AHC1: Assume that initial clusters are given by G =
{

G
1
,

G
2
, . . . ,

G
N
0
}, where

G
1
,

G
2
, . . . ,

G
N
are given initial
clusters.
Generally

G
j
={x
j
} X, hence N
0
= N.
Vol.16 No.7, 2012 Journal of Advanced Computational Intelligence 807
and Intelligent Informatics
Takumi, S. and Miyamoto, S.
Set K = N
0
.
(K is the number of clusters and N
0
is the initial number
of clusters.)
G
i
=

G
i
(i = 1, . . . , K).
Calculate s(G, G

) for all pairs G, G

G.
AHC2: Search the pair of maximum similarity:
(G
p
, G
q
) = arg max
G
i
,G
j
G
s(G
i
, G
j
), . . . . . . (2)
and let
m
K
= s(G
p
, G
q
) = max
G
i
,G
j
G
s(G
i
, G
j
). . . . (3)
Merge: G
r
= G
p
G
q
.
Add G
r
to G and delete G
p
, G
q
from G.
K = K 1.
If K = 1 then stop and output the dendrogram.
AHC3: Update similarity s(G
r
, G

) and s(G

, G
r
) for
all G

G.
Go to AHC2.
End AHC.
Note: The calculation of s(G

, G
r
) in AHC3 is un-
necessary when the measure is symmetric: s(G
r
, G

) =
s(G

, G
r
).
Well-known linkage methods such as the single link,
complete link, and average link all assume symmetric dis-
similarity measures [1, 2, 6]. In particular, the single link
uses the following inter-cluster similarity denition:
s(G, G

) = max
xG,yG

s(x, y). . . . . . . . (4)


When G
p
and G
q
are merged into G
r
, the updating for-
mula in AHC3 by the single link is:
s(G
r
, G

) = s(G
p
G
q
, G

)
= max{s(G
p
, G

), s(G
q
, G

)}. . (5)
The average link denes the next inter-cluster similarity:
s(G, G

) =
1
|G||G

|

xG,yG

s(x, y) . . . . . (6)
and the updating formula in AHC3 by the average link is:
s(G
r
, G

) = s(G
p
G
q
, G

)
=
|G
p
|
|G
r
|
s(G
p
, G

) +
|G
q
|
|G
r
|
s(G
q
, G

). (7)
There are other well-known linkage methods of the
centroid link and the Ward method that assume objects
are points in the Euclidean space. They use dissimilarity
measures related to the Euclidean distance. For example,
the centroid link uses the square of the Euclidean distance
between two centroids of the clusters. Anyway, the above
mentioned ve linkage methods all assume the symmetric
property of similarity and dissimilarity measures.
For the single link, complete link, and average link, it
is known that we have the monotonicity of m
K
:
m
N
m
K1
m
2
m
1
. . . . . . . (8)
Fig. 1. Three points on a plane.
C
B
A
Fig. 2. A simple example of reversal.
If the monotonicity does not hold, we have a reversal in
a dendrogram: it means that G and G

are merged into

G = GG

at level m = s(G, G

) and after that



G and G

are merged at the level m = s(



G, G

), and m > m occurs.


Reversals in a dendrogram is observed for the centroid
method. Consider the next example [6, 7]:
Example 1. Consider three points A, B,C on a plane in
Fig. 1. Two points A, B are nearer and these two are made
into a cluster. Then the distance between the mid-point
(centroid) of AB and C will be smaller than the distance
between A and B. We thus have a reversal in Fig. 2.
Apparently, if the monotonicity always holds for a link-
age method, no reversals in the dendrogram will occur.
By reviewing the above, the way how we calculate
asymmetric similarity is given in the next section.
3. Asymmetric Similarity Measures
We assume hereafter that similarity measures are asym-
metric in general:
s(G, G

) = s(G

, G).
808 Journal of Advanced Computational Intelligence Vol.16 No.7, 2012
and Intelligent Informatics

You might also like