You are on page 1of 6

Agglomerative Clustering

Using hclust()
Unlike partition clustering, which requires the user to specify the number of k
clusters and create homogeneous k groups, hierarchical clustering defines clusters
without user intervention from distances in the data and defines a tree of clusters
from this. Hierarchical clustering is particularly useful when the data is suspected
to be hierarchical (leaves nested in nodes nested in higher level nodes). It can also
be used to determine the number of clusters to be used in k-means clustering.
Hierarchical clustering can be agglomerative or divisive. Agglomerative clustering
usually yields a higher number of clusters, with less leaf nodes by cluster.

Agglomerative clustering refers to the use of algorithms, which start with a number
of clusters that is equal to the number of cases (each case being a cluster) and merges
clusters iteratively one by one, until there is only one cluster that corresponds to the
entire dataset. Divisive cluster is the opposite, it starts with one cluster, which is then
divided in two as a function of the similarities or distances in the data. These new
clusters are then divided, and so on, until each case is a cluster. In this chapter, we
will discuss agglomerative clustering. The reader might be interested in consulting
the paper Hierarchical clustering schemes by Johnson (1967).

In this chapter, we will cover the following:

• The inner working of agglomerative clustering


• The use of hclust() for agglomerative clustering with numerical attributes
• The use of hclust() with binary attributes

[ 81 ]

www.it-ebooks.info
Agglomerative Clustering Using hclust()

The inner working of agglomerative


clustering
As briefly mentioned, agglomerative clustering refers to algorithms. Let's start with
the example of the data we used last in the previous chapter:
1 rownames(life.scaled) = life$country
2 a=hclust(dist(life.scaled))
3 par(mfrow=c(1,2))
4 plot(a, hang=-1, xlab="Case number", main = "Euclidean")

We started by adding the name of each country as the row name of the related
case (line 1), in order to display it on the graph. The function hclust() was then
used to generate a hierarchical agglomerative clustering solution from the data
(line 2). The algorithm uses a distance matrix, provided as an argument (here the
default is the Euclidean distance) to determine how to create a hierarchy of clusters.
We have discussed measures of distance in the previous chapter. Please refer to
this explanation if in doubt. Finally, the hclust object a at line 2 was plotted in a
dendrogram (line 4 in the following diagram). At line 3, we set the plotting area
to include two plots.

The following figure (left panel), displays the clustering tree. At the beginning of the
creation of the cluster hierarchy (or tree), each case is a cluster. Clusters are joined
one by one on the basis of the distance between them (represented by the vertical
lines); clusters with smaller distances are selected for merging. As can be seen from
the distances plotted in the figure, hclust() first joined Panama and Costa Rica in
a new cluster. It then aggregated Canada and US (White) in a cluster, as they were
the next cases with the smallest Euclidean distance, and it also did so for Grenada
and Trinidad, Mexico and Columbia, and El Salvador and Ecuador. Next, it joined
the cluster formed by Grenada and Trinidad with Jamaica. Clusters with the next
smaller distance were then US (Nonwhite) and Chile, which were merged together.
Then the cluster Canada and US (White) was merged with Argentina. The algorithm
continued merging clusters until only one remained. It is quite interesting that
hclust() produces life expectancy clusters that are closely related to the distance
between countries; countries that are close geographically generally have similar
life expectancies.

[ 82 ]

www.it-ebooks.info
Chapter 5

Dendrograms of remaining life expectancy in several countries (Euclidean distance)

As we have mentioned, we used hclust() with the default (Euclidean distance)


matrix in the previous example. Results can be different, although usually not
dramatically, using other distance metrics. We have no reason to infer that the
shortest distance between the points is not the best measurement here, but to
illustrate the potentially different results using different measure distances, we
will be using the Manhattan distance instead of Euclidean distance. On line 1 of the
following code, we assign the hclust object returned by hclust() with argument
configuration manhattan, which sets the distance metric. We then simply plot the
resulting tree. The hang is set to a negative value in order to get the distance that
is to be plotted from 0:
1 a = hclust(dist(life.scaled, method= "manhattan"))
2 plot(a, hang=-1, xlab="Case number", main = "Manhattan")

[ 83 ]

www.it-ebooks.info
Agglomerative Clustering Using hclust()

The preceding figure (right panel) displays the result of our aggregative cluster
analysis with hclust(). We can notice that, although there are many similarities, the
clustering solution is not the same, as compared to an Euclidean distance. We can
see that Madagascar and Cameroon form a cluster that cannot be joined to the other
clusters until reaching the last step in both solutions. Also, in both solutions, the US
(White) and Canada are very similar to each other (form a cluster quite fast), as well
as Costa Rica and Panama. But let's have a look at Algeria. In the Euclidean distance
solution, it forms a cluster with Nicaragua, but in the Manhattan distance solution,
it clusters with Nicaragua and Dominican Republic. Seychelles forms a cluster
with Tunisia in the Euclidean distance solution, but they cluster with South Africa,
Argentina, Canada, and US in the Manhattan distance solution.

Indeed, the choice of the distance matrix is not without incidence on the results.
It has to be chosen as a function of the data (what they represent). Choosing the
method to determine cluster proximity also plays an important role in agglomerative
clustering. hclust() provides several methods including single linkage, complete
linkage, average linkage, and Ward's minimum variance. The single linkage method
computes the proximity of clusters as the smallest distance between the points of the
clusters. The complete linkage method computes it as the maximum distance between
points of the cluster; average linkage uses the average distance of the points from
one cluster to the points of the other. Finally, Ward minimum variance agglomerates
clusters by minimizing the variance around centroids in the resulting clusters. Note
that Ward's minimum variance method requires a squared Euclidean distance matrix.

These methods have different properties (The R Core Team, 2013, p.1301):

"Ward's minimum variance method aims at finding compact, spherical clusters.


The complete linkage method finds similar clusters. The single linkage method
(which is closely related to the minimal spanning tree) adopts a 'friends of friends'
clustering strategy. The other methods can be regarded as aiming for clusters with
characteristics somewhere between the single and complete link methods."

We will examine differences in results later in the chapter.

[ 84 ]

www.it-ebooks.info
Chapter 5

Before we proceed, let's examine, this time visually, how the algorithm proceeds
using a simple fictitious dataset. As visual representations are limited to three
dimensions, we will only use three attributes, but the computation is similar with
more attributes. We will display these using the scatterplot3d() function of the
plot3D package, which we will install and load after creating the attributes. We then
examine the clustering solution provided by hclust() in order to assess whether it
confirms the impressions we get from visual inspection:
A1 = c(2,3,5,7,8,10,20,21,23)
A2 = A1
A3 = A1

install.packages("scatterplot3d")
library(scatterplot3d)
scatterplot3d(A1,A2,A3, angle = 25, type = "h")

demo = hclust(dist(cbind(A1,A2,A3)))
plot(demo)

A 3D scatterplot and dendrogram exemplifying agglomerative clustering

[ 85 ]

www.it-ebooks.info
Agglomerative Clustering Using hclust()

As can be noticed on the left panel of the previous screenshot, there are three groups
of two points that are very close to each other. Another point is quite close to each
of these groups of two. Consider that the groups of two constitute a group of three
with the points that lie closest to them. Finally, the two groups on the left are closer
to each other than they are to the group of three on the right. If we have a look at the
dendrogram, we can see that the very same pattern is visible.

Agglomerative clustering with hclust()


In what follows, we are going to explore the use of agglomerative clustering with
hclust() using numerical and binary data in two datasets.

Exploring the results of votes in Switzerland


In this section, we will examine the case of another dataset. This dataset represents
the percentage of acceptance of the themes of federal (national) voting objects in
Switzerland in 2001. The first rows of data are in the following table. The rows
represent the cantons (the Swiss name for states). The columns (except the first)
represent the topic of the voting. The values are the percentage of acceptance of the
topic of voting. The data has been retrieved from the Swiss Statistics Office (www.bfs.
admin.ch) and are provided in the folder for this chapter (file swiss_votes.dat).

The first five rows of the dataset

To load the data, save the file in your working directory or change the working
directory (use setwd() to the path of the file) and type the following line of code.
Here, the sep argument is set to indicate that tabulations are used as separators, and
the header argument is set to T (true), meaning that the provided data has column
headers (attribute names):
swiss_votes = read.table("swiss_votes.dat", sep = "\t",
header = T)

[ 86 ]

www.it-ebooks.info

You might also like