Professional Documents
Culture Documents
ing, uscd.
2 then the
199
200
CHAPIER 5. CLUSTERING
Animals
Dogs Cats
Large Small
1 2 3 4 5
In cases where there are only two features, clusters can be found through visual
inspection by looking for dense regions in a scatterplot of the data if the subgroups or
classes are wellseparated in the feature space. If, for example, there are two bivariate
normally distributed classes and their means are separated by more than two standard
deviations, two distinct peaks form if there is enough data. In Figure4.20 at least one
of the three classes forms a distinct cluster, which could be found even if the classes
were unknown. However,distinct clusters may exist in a high-dimensional feature space
and stillnot be apparent in any of the projections of the data onto a plane defined
by a pair of the feature axes. One general way to find candidates for the centersof
clusters is to form an n-dimensional histogram of the data and find the peaks in the
histogram. However, if the number of features is large, the histogram may have to
be very coarse to have a significant number of samples in any cell, and the locations
of the boundaries between these cells are specified arbitrarily in advance, rather than
dependipg on the data.
cltstey
are
he coarsest grouping is at the top of the dendrogram, where all santples
góycdinto one cluster. In between, there are various numbers of clusters. For
exynple, in the hierarchical clustering of Figure5.1, at level 0the clusters are
{1}, (2).{3}, (4}, {5),
cach consisting of an individual sample. At level 1, the clusters are
{1,2}.(3}, {4}.{5}.
At level 2, the clusters are
{1,2}.{3} {4, 5}.
At level 3, the clusters are
{1,2,3}, {4, 5}.
At level 4, the single cluster
{1,2,3, 4, 5}
consists of all,the samples.
In a hifarchical clustering, if at some level two samples belong to a cluster, they
belong tø the same cluster at all higher levels. For example, in Figure 5.1, at level 2
sample 4 and 5 belong to the same cluster; samples 4 and also belong to the same
cluster at levels 3 and 4.
/Hierarchical clusteringalgorithms are calledagglomerative if they build the den
dtogram from the bottom up and they are called divisive if they build the dendrogram
from the top dow
The general agglomerativeclustering algorithm is straightforward to describe. The
total number of samples will be denoted by n.
Agglomerative Clustering Algorithm
1. Begin with n clusters, cach consisting of one sample,
2. Repeat step 3 a total of n -1times.
3. Find the most similar clusters C; and C;i and merge C; and C; into one cluster.
If there is a tie, merge the first pair found.
meth
Different hierarchical clustering algorithms are obtained by using different ween
the similarity bet
ods to determine the similarity of clusters. One way to measure distance
between clusters. This
chusters is to define a function that measures distance
measures the distance
function typically is induced by an underlying function that
techniques (Sec
between pairs of samples. In cluster analysis as in nearest neighbor block
Euciidean distance ánd city
tion 4.4), the most popular distance measures are
distance.
202
CHAPTER 5. CLUSTERING
30
25
20
Feature2
15
10
5
1 2 4
5 10 15 20 25 30
Feature 1
S37
2
5.2. HIERARCHICAL CLUSTERING 203
For the single-sample chusters {a} and {b}. Ds({a}. (b}) =d(o. b).
The algorithm begins with five chusters, cach cousisting of one salnple. The two
nearest clusters are then merged. The smallest number in (5.1) is 4. whirh is thr
distance between samples 1 and 2, so the clusters {1) and {2} are merged. A1 this
point there are four clusters
{1,2}. {3}.(4}.(5).
Next obtain the matrix that gives the distances between these clusters:
{1,2} 3 4 5
8.1 16.0 17.9
{1.2)
3 8.1 9.8 9.8
4 16.0 9.8 8.0
5 17.9 9.8 8.0
The value 8.1 in row {1,2} and column 3 gives the distance between the clusters {1.2}
and {3} and is computed in the following way. Matrix (5.1) shows that d(1, 3) = 11.7
and d(2,3)= 8.1. In the single-linkage algorithm, the distance between clusters is the
he minimum of these values, 8.1. The other values in the first row are computed in a
he similar way. The values in other than the first row or first column are simnply copied
by from the previous table (5.1). Since the minimum value in this matrix is 8. the clusters
WO
{4} and {5} are merged. At this point there are three clusters:
the
{1,2}. {3}, {4, 5}.
Next obtain the matrix that gives the distance between these clusters:
{1,2} 3 {4,5}
{1,2} 8.1 16.0
d 3 8.1 9.8
{4,5} 16.0 9.8
n and
idean Since the minimum value in this matrix is 8.1, the clusters {1.2} and (3} are merged.
ralues At this point there are two clusters:
{1,2,3}.{4,5}.
The next step will merge the Kvo remaining clusters at a distance of 9.S. The
(5.1) hierarchical clustering is comle. The dendrogram is shown in Figure 5.3.
204
CIHAPTER 5. CLUSTERING
1o
Distance
8
Nearest
Neighbor
6
4
2
0
1 2 4 5
Figure 5.3: Hierarchical clustering using the single-linkage algorithm. The distance
DsL between clusters that merge is shown on the vertical axis.
The value 11.7 in row {1,2} and column 3 gives the distance betwcen the clusters {1.2}
and {3} and is computed in the following way. Matrix (5.1) sbows that d(1,3) =11.7
and d(2, 3) = 81. In the complete-linkage algorithm, the distance between clusters is
the maximum of these values, 11.7. Thc other values in the first row are computed in
a similar way. The values in other than the first row or irst column are simply copicd
from (5.1). Since the minimum value in this matrix is 8, the clusters (4} and {5} are
merged. At this point the clusters are
{1,2}. {3}, {4,5}.
Next obtain the mnatrix that gives the distance between these clusters:
e l u s t
{1,2} 3 {4,5}
{1,2} 11.7 21.5
11.7 9.8
{4,5} 21.5 9.8
Since the minimum value in this matrix is 9.8, the clusters {3} and {4, 5} are merged.
At this point the clusters are
{1,2}, {3,4, 5}.
Notice that these clusters are different from those obtained at the corresponding point
of the single-linkage algorithm.
hierarchical
At the next step, the two remaining clusters will be merged. The
clustering is complete. The dendrogram is shown in Figure 5.4.
Furthest
Distance
Neighbor
15
20
10
1 3 5
1
D L(Ci, C;) = d(a, b).
aECi,bEC;
Perform a hierarchical clustering using the average-linkage algorithm on the data shown
in Figure 5.2. Use Euclidean distance (4.1) for the distance between samples.
The algorithm begins with five clusters, each consisting of one sample. The nearest
clusters {1} and {2} are then mnerged to form the clusters
{1,2},{3),{4}, (5}.
5.2. HIERARCHICAL CLUSTERING
207
{1,2},{3},{4,5).
Next obtain the matrix that gives the distance between these
clusters:
{1,2} 3 {4,5}
{1,2} 9.9 18.9 2
3 9.9 9.8
{4,5} 18.9 9.8
Since the minimum value in this matrix is 9.8, the clusters {3} and {4,5}
are merged.?
At this point the clusters are
{1,2}, {3,4, 5}.
At the next step, the two remaining clusters are mnerged and the
hierarchical clustering
is complete.
Ward's Method
Ward's method is also called the minimum-variance method. Like the other
t algorithms, Ward's method begins with one cluster for each individual sample. At
each iteration, among all pairs of clusters, it merges the pair that produces the smallest
squared error for the resulting set of clusters. The squared error for each cluster is
defned as follows. If a cluster contains m samples x1,...,Xm where x; is the feature
208 CHAPTER 5. CLUSTERING
vector (zi1,. ., Tia), the squared error for sample x which is the squared Euclidean
distance from the mean is
d
j=1
where u, is the mean value of feature / for the samples in the cluster
1
Wj = Tij:
i=1
"The squared error E for the entire cluster is the sum of the squared errors of the
samples
m d
=
E ( i j -;)' mo'.
i=lj=1
The vector comnposed of the means of each feature, (u1,..., Ld) = u, is called the
mean vector or centroid of the cluster. The squared error for a cluster is the sum
of the squared distances in each feature from the cluster members to their mean. The
squared error is thus equato the total variance of the cluster o times the number
of samples in the clusterf,where the total variance is defined to be o² = ot...+o;
the sum of the varianbes for each feature. The squared error for a set of clusters is
defined to be the sum of the squared errors for the individual clusters.
Squared
Clusters Error, E
{1,2),(3}4)45} 8.0
{1,3),(2},{4).45} 68.5
{1,4},{2}.{3}.(5} 200.0
{1,5).{2}.{3},(4} 232.0
{2,3}.(1},{4),{5} 32.5
{2,4}.{1}.{3},(5) 128.0
{2,5),{1}.(3)44} 160.0 e-(8-4)4
{3,4},{1),{2}.{5} 48.5
{3,5),(1}.{2}.(4} 48.5
{4,5){1).{2},(3} 32.0
Figure 5.5: Squared errors for each way of creating four clusters.
Squared
Clusters Error, E X
{1,2,3}.{4}.{5} 72.7
{1,2,4).{3},(5} 224.0
{1,2,5}.{3},(4} 266.7
{1,2},{3,4),(5} 56.5
{1,2).{3.5}(4} 56.5
{1,2},{4,5},(3} 40.0
2
2
Figure 5.6shows the squared error for all possible sets of clusters that result from
merging two of {1, 2}, {3}, 4}, {5}. Since the smallest squared error in Figure 5.6 is
40, the clusters {4) and {5} are merged to form the clusters
{1,2}, {3}, {4,5}.
Figure 5.7 shows the squared error for all possible sets of clusters that result from
merging two of {1,2}, {3), {4, 5}. Since the smallest squared error in Figure 5.7 is 94,
the clusters {3} and {4, 5} are merged to give the clusters
{1,2}. {3,4, 5}.
CHAPTER 5. CLUSTEIRING
210
Squared
Clusters Error, E
{1,2,3},{4,5} 104.7
{1,2,4,5),{3} 380.0
{1,2},(3,4,5}) 94.0
Errors
300
Squared
200
of
Sum 100
1 2 3 4 5
At the next step, the two remaining clusters are merged and the hierarchical clus
tering is complete. The resulting dendrogram is shown in Figure 5.8.
Forgy's Algorithm
One of the simplest partitional clustering algorithms is Forgy's algorithm Forgy].
Besides the data, input to the algorithm consists of k, the number of clusters to be
constructed, and k samples called seed points. The seed points could be chosen
randomly, or some knowledge of the desired cluster structure could be used to guide
their selection.
Forgy's Algorithm
1. Initialize the cluster centroids to the seed points.
sample in the
2. For each sample, find the cluster centroid nearest it. Put the
cluster identified with this nearest cluster centroid.
Ncarest
Sample Cluster Centroid
(4.4) (4,4) (3
g(4-s9) (8.4) (8,4)
(15,8) (8,4)
(24,4) (8,4)
(24,12) (8,4)
and
(4+ 8+ 4+ 12)/4 =7.
return to step
Since some samples changed clusters (tlhere were initially no clusters),
2
shows the results. The
Find the cluster centroid nearest each sample. Figure 5.10
clusters {(4,4), (8, 4)} and {(15,8), (24, 4), (24, 12)} areof produced. Since the sample
For step 4, compute the centroids (6,4) and (21, 8) the clusters.
4) changed clusters, return to step 2.
results. The
Find the cluster centroid nearest each sample. Figure 5.11 shows the
obtained.
clusters {(4,4),(8, 4)} and {(15, 8), (24, 4), (24, 12)} are sample
For step 4, compute the centroids (6, 4) and (21, 8) of the clusters. Since no
ternates.
will change clusters, the algorithm
In this version of Forgy's algorithm, the seed points are cfoseh árbitrarily as the
Grst two samples; however, other possibilities have been suggested. One alternativa :.
to begin with k clusters generated by one of the hierarchical clustering algorithms and
points.
use their centroids as initial seed
5.3. PARTITIONAL CLUSTERING
213
Nearest
Sample Cluster Centroid 2ene
(4,4) (6,4)
(8,4) (6,4)
(15,8) (21,8) 'Wesetswel4
(24,4) (21,8)
(24,12) (21,8)
It has been proved (Selim] that Forgy's algorithm terminates; that is,
eventually no
samples change clusters. However, if the number of samples is large, it may take the
algorithm considerable time to produce stable clusters. For this reason, some versions
of Forgy's algorithm allow the user to restrict the number of iterations. Other versions
of Forgy's algorithm Dubes permit the user to supply parameters that allow new
clusters to be created and to establish a minimum cluster size.
k-means Algorithm
Begin with k clusters, each consisting of one of the first k samples. For each
of the remaining n k samples, find the centroid nearest it. Put the samnple in
the cluster identified with this nearest centroid. After each sample is assigned,
recompute the centroid of the altered cluster:
2. Go through the data a second time. For each sample,find the centroid nearest
it. Put the sanple in the cluster identified with this nearest centroid. (During
this step,do not recompute any centroid.)
Distance to Distance to
Sample Centroid (9,5.3) Centroid (24,8)
(8,4) l.6 16.5
V&-s3) >(24,4) 15.1
(15,8) 6.6 9.0
(4,4) 6.6 40.4
(24,12) 16.4 (40)
Figure 5.12: Distances for use by step 2 of the k-means
algorithm.
Perform a partitional clustering using ihe k-means
5.2. Set k = 2 and assume that the data algorithm on the data in Figur
are ordered so that the first two
(8,4)_ and (24, 4). samples ar
For step 1, begin with two clusters {(8,4)} and
(8,4) and (24,4). For each of the remaining three {(24,4)} which have centroids a
it, put the sample in this cluster, and samples, find the centroid neares
recompute the centroid of this cluster.
The next sample (15, 8) is nearest the centroid (8,4) so it
At this point, the clusters are {(8, 4), (15, 8} and joins cluster {(8,4)}
cluster is updated to (11.5, 6) since {(24,4)}. The centroid of the first
2
(8+ 15) /2 = 11.5, (4 + 8)/2 = 6.
The next sample (4,4) is nearest the centroid (11.5, 6) so it
(15,8)}. At this point, the clusters are {(8,4), (15, 8), (4, 4)} and joins cluster {(8,4),
troid of the first cluster is updated to (9, 5.3). {(24, 4)}. The cen
The next sample (24, 12) is nearest the centroid (24, 4) it
At this point. the clusters are {(8,4), (15, 8), (4,4)} and {(24,so12), joins cluster {(24, 4)}.
of the second cluster is updated to (24, 8). At (24, 4)}. The centroid
this point, step 1of the algorithm is
complete.
For step 2, examine the samples one by one and put each one in the cluster
with the nearest centroid. As Figure 5.12shows, in this case no identified
sample changes clusters.
The resulting clusters are