A Decision Criterion for the Optimal Number of Clusters in Hierarchical Clustering

Qwest Communications, 600 Stinson Blvd., Minneapolis, MN 55413 Haesun Park (hpark@cs.umn.edu) y Department of Computer Science and Engineering University of Minnesota, Minneapolis, MN 55455 and Korea Institute for Advanced Study 207-43 Cheongryangri-dong, Dongdaemun-gu Seoul 130-012, KOREA z Department of Computer Science and Engineering University of Minnesota, Minneapolis, MN 55455

Yunjae Jung (yunjae@cs.umn.edu)

Ding-Zhu Du (dzd@cs.umn.edu)

Barry L. Drake (bldrake1@yahoo.com)
CDT, Inc., Minneapolis, MN 55454

Jan. 21, 2002 Abstract. Clustering has been widely used to partition data into groups so that the degree of association is high among members of the same group and low among members of di erent groups. Though many e ective and e cient clustering algorithms have been developed and deployed, most of them still su er from the lack of automatic or online decision for optimal number of clusters. In this paper, we de ne clustering gain as a measure for clustering optimality, which is based on the squared error sum as a clustering algorithm proceeds. When the measure is applied to a hierarchical clustering algorithm, an optimal number of clusters can be found. Our clustering measure shows good performance producing intuitively reasonable clustering con gurations in Euclidean space according to the evidence from experimental results. Furthermore, the measure can be utilized to estimate the desired number of clusters for partitional clustering methods as well. Therefore, the clustering gain measure provides a promising technique for achieving a higher level of quality for a wide range of clustering methods.

1. Introduction
Clustering refers to the process of grouping patterns so that the patterns are similar within each group and remote between di erent groups 1]. The distribution of groups can be de ned as a cluster con guration.
A part of this work was carried out while H. Park was visiting the Korea Institute for Advanced Study, Seoul, Korea, for her sabbatical leave, from September 2001 to July 2002. The work of this author was supported in part by the National Science Foundation grant CCR-9901992 y The work of this author was supported in part by the National Science Foundation grant CCR-9901992

c 2002 Kluwer Academic Publishers. Printed in the Netherlands.

cluster.tex; 23/01/2002; 12:40; p.1

2 The cluster con guration is valid if clusters cannot reasonably occur by chance or as a bene cial artifact of a clustering algorithm 2]. An optimal cluster con guration is de ned as an outcome of all possible combinations of groupings, which presents a set of the most \meaningful" associations. Even if the de nition of clustering is that simple, evaluation of clustering performance is well known as a fundamental but di cult problem. One reason is that clustering should be performed without a priori understanding of the internal structure of the data. In addition, it is impossible to determine which distribution of clusters is best given certain input patterns without an objective measure for clustering optimality. Thus, there have been many attempts to formulate a measure of optimal clustering in the past . However, only a small number of independent clustering criteria can be understood both mathematically and intuitively 12, 2]. Consequently, the hundreds of criterion functions proposed in the literature are related and the same criterion appears in several disguises 3, 4, 5, 6, 7]. Even though an objective measure is given, the di culty of optimal clustering stems from the astounding number of possible combinations of cluster con gurations 8]. The number of ways of generating k clusters from n patterns is a Stirling number of the second kind 9, 10, 11]:
( Snk)

k 1 X( 1)(k i) k in : = k! i i=1

In particular, the huge volume of data and the potentially high dimensionality of the patterns increase the di culty of achieving a measure for optimal clustering. Furthermore, it is hard to select a criterion that translates into an intuitive notion of a \cluster" from a reasonable mathematical formula 12]. Feature selection before clustering and cluster labeling after clustering are also challenging problems. As a result, many clustering algorithms to date have been heuristic or ad hoc 12, 2]. Since no ideal solution to the optimal clustering problem has existed from early clustering research 13], recently proposed algorithms have focused mostly on e ciency 14, 15, 16, 17, 18, 19] and scalability 20, 21, 22, 23, 24, 25] to reduce the computational cost and increase processing capability, respectively. It may be possible to produce a cluster con guration very quickly and process huge amounts of data at once. However, often there is no guarantee of achieving an optimal or close-to-optimal clustering con guration. We propose a method to measure clustering optimality quantitatively with a purpose to use it to determine an optimal number of clusters in various clustering algorithms. The method has been designed

cluster.tex; 23/01/2002; 12:40; p.2

based on the assumption that the optimal cluster con guration can be recognized only by the intuitive and subjective interpretation of a human. 12]. This measure can be directly used to explore an optimal con guration for all hierarchical clustering algorithms as they proceed. We show how the desired number of clusters can be estimated based on the data using the clustering gain measure. 12:40. p. Since discovering all possible combinations of cluster con guration is computationally prohibitive 8. The measure can also be useful for performance comparison among clustering algorithms since the clustering performance is also measured by clustering gain.3 . The best cluster con guration will be one which can be produced by any speci c cluster. In order to quantify clustering optimality. A series of snapshots of clustering con gurations in Euclidean distance. 8]. it is useful to consider two dimensional Euclidean space for the sake of an objective decision as depicted in Figure 1.tex. Thus.3 12 10 8 6 4 2 0 0 5 10 15 20 (a) Raw cluster configuration 12 10 8 6 4 2 0 0 5 10 15 20 (b) Under−clustered configuration 12 10 8 6 4 2 0 0 5 10 15 20 (c) Optimal cluster configuration 12 10 8 6 4 2 0 0 5 10 15 20 (d) Over−clustered configuration Figure 1. 23/01/2002. which has been designed to have a maximum value when intra-cluster similarity is maximized and inter-cluster similarity is minimized 26. we introduce clustering gain. most partitional clustering algorithms are dependent on users to determine the target number of clusters. the optimal cluster con guration can be identi ed by the maximum of the clustering gain curve. Since intuitive validation of clustering optimality can be maximized in two dimensional feature space.

k-means. Milligan compared and described objective functions for optimal agglomerative clustering. maximum distance between patterns. has been successfully estimated experimentally. While the PDDP algorithm is proceeding. However. Finally. Speci cally. 2. average dissimilarity 32] within a cluster.g. 12:40. In general. and relative inter-connectivity and relative closeness 33]. Recently. the measure is used to estimate the desired number of clusters in partitional clustering algorithms. Section 4 discusses how the proposed method can be used to evaluate the performance of clustering algorithms. 23/01/2002. The design scheme of the optimal clustering gain measure is discussed in Section 3. Consequently. The rule is to stop partitioning when the centroid scatter value exceeds the maximum cluster scatter value at any particular point. For hierarchical agglomerative clustering.4 . the functions are based on strong assumptions. cluster.tex.4 hierarchical clustering algorithm. and experimental dependency. e. In addition. According to the experimental results. a dynamic threshold based on a so-called centroid scatter value is calculated. Moreover. 30] for optimal clustering. This approach relies on experimental observations. some decision rules have been provided by Milligan and Cooper 28] to determine the appropriate level of the dendrogram 29. The rest of this paper is organized as follows. p. the desired number of clusters for partitional clustering methods. currently used stopping criteria for hierarchical clustering methods are based on prede ned thresholds including the number of iterations. the number of clusters. some background information on previous work is presented to derive optimal clustering measures. heuristics. deciding the optimal level of a dendrogram and estimating the number of target clusters remains as a challenging and fundamental problem. a stopping rule for the hierarchical divisive clustering method has been suggested in the Principal Direction Divisive Partitioning (PDDP) algorithm 31]. In Section 2. Optimal Clustering Stopping criteria for optimal clustering have been the topic of considerable past research e ort 27]. most commonly used hierarchical clustering algorithms are able to produce intuitively reasonably clustered con gurations using our clustering measure when the input patterns are distributed in a well isolated fashion. we discuss how to estimate the optimal number of clusters for partitional clustering algorithms using our new scheme in Section 5.

. ij Mij where Si is a dispersion measure of cluster i such as the squared error sum and Mij is the distance between two centroids.tex.e. However. a threshold that distinguishes the knee from other anomalies is di cult to determine. Boulder and Odell 34] introduced a cluster separation measure (2) R Si + Sj . 23/01/2002. the convergence to local minima problem. k) = MH (k + 1) MNH (k): The value MH is the point serial correlation coe cient between the matrix of Euclidean distances for patterns and a \model" matrix. ni is the number of paterns in cluster i. A stopping rule is adopted to search for a signi cant knee in the curve of MH (k) as k varies from kmax to 2 where kmax is the maximum possible number of clusters.. (1) ( f k. and k is the nubmer of clusters. there are many minimum points in the hierarchical system 34]. k)f ((k. p. Dubes 27] provided a separation index: jf . the rule is not able to avoid premature stopping. there is no theoretical basis for the feasibility of the measure and no reasonable separation measure for partitional clustering seems to exist at present 19]. i. Furthermore. However.5 . The separation measure will be that which minimizes the average similarity as follows 1 R n niRi. i=1 k X (3) where Ri maximum of Rij . a unique optimal clustering condition can not be detected by the separation measure. According to the experimental results. In addition. 12:40. The model matrix sets the distance between two patterns to be the distances between the centers of clusters to which the patterns belong. ) S (k) = 1+ (jkf+k1+k1.5 For non-hierarchical partitional algorithms. and n is the total number of the patterns to be clustered. k 1) jj . Similarly. i 6= j . k 1) where f (k + 1. cluster.

A measure of the similarity between two patterns drawn from the same feature space plays an essential role in these clustering procedures 12. 23/01/2002. is NP-complete 36.6 3. However. 35]. Cluster con guration is a random variable whose possible outcome is a particular assignment of input pattern sets. denoted as pi = pi1 . : : : . In fact. especially in two dimensional feature space 13. 2]. In addition. n 1 2 where nj is the number of patterns in cluster Cj . Clustering Balance The clustering problem is to partition the given input patterns into a speci c number of groups (clusters) so that the intra-cluster similarity is maximized and the inter-cluster similarity is minimized in a particular metric space 26.tex. and a cluster Cj is a set of patterns grouped together by a clustering algorithm and expressed by Cj = fp(j ). Accordingly. The centroid is often used for cluster data compression. p. pim ]T . Design of A Measure for Optimal Clustering 3. currently used agglomerative clustering algorithms take an approximation approach by merging more similar patterns prior to grouping less similar patterns to construct a cluster hierarchy. We will assume that there are total of n vectors P be clustered and the total(jnumber of the to clusters is k. The problem of optimal clustering is to nd a cluster con guration that is optimized according to some evaluation criterion. p(j ) . 8]. as mentioned before. the number of ways of clustering n observations into k groups is enormously large 35]. : : : . The most popular metric for measuring similarity between patterns is the Euclidean distance since it is more intuitive and applicable. which is de ned as j p(j ) 0 n 1 X p(j ) : =n i j j i=1 The centroid is a mean vector of the cluster and provides a compressed representation of the cluster in a simpler form. The most intuitive cluster. 12:40. 37].1. p0 ) denotes the i centroid of the cluster j .6 . p(j) g. pi2 . Pattern i is a feature vector in an m dimensional space. Accordingly. a combinatorial search of the set of possible con gurations for optimal clustering is clearly computationally prohibitive 27] and. Throughout the paper. k=1 ni = n. we will use the following notations.

tex. 43. 41. 19] which can be expressed using the Euclidean distance 2. 39. p. It is clear that the singleton clusters have no contribution to the intra-cluster error sum . can be denoted as k n X X (j ) (j ) 2 = kpi p0 k2 : j j =1 i=1 (4) It is also called the within-group error sum 12]. In case of divisive algorithms. Suppose two clusters Ci and Cj are merged in a step of agglomerative clustering. to be utilized in designing a measure for optimal cluster con guration as well as a stopping criterion in hierarchical clustering algorithm. Let the new cluster Cij be the cluster obtained by mering cluster. On the other hand. using the Euclidean distance. p0 ) = 0 j =1 k X (j ) kp0 p0k2 . 40. 44]. The inter-cluster error sum takes into account error sums between clusters by considering the collection of cluster centroids to be a global pattern set. 23/01/2002. 2 (5) where p0 is the global centroid de ned as n 1X p0 = n pi : i=1 Now. 12:40. the value of cannot decrease.7 . 13. is maximized when there is only one cluster that contains all patterns. 14. an analogous but opposite trends can easily be proved. More interesting fact is that while clustering process proceeds. 42. which also has a global centroid. Ward used the error sum of squared to quantify the loss of information by grouping 38]. We can assume that in the initial state of any agglomerative clustering algorithm. each pattern is the only pattern in its own cluster.7 and frequently used criterion function in clustering techniques is the squared error criterion which is the sum of squared distances from the centroid of a group to every pattern in the group 38. We will assume that the hierarchical algorithm we are consiering is agglomerative. we present some characteristics of these two con icting error sums. and the minimum value that can take is zero. p0j) ): i which. The inter-cluster error sum. in case of Euclidean space is de ned as = k X j =1 e(p(j) . The intra-cluster error sum is de ned by the squared error e as = k n XX j j =1 i=1 ( e(p(j ) .

intra-cluster error sum would be nondecreasing as the clustering proceeds if a b 0. Then. b= i j and l=1 l=1 j n n X (i) 2 + X kp(j ) c k2 : kpl cij k2 a= ij 2 l i Since. respectively. p. the centroid cij of the new cluster Cij is (i) (j ) cij = ni p0 + nj p0 : ni + nj Let b and a be intra-cluster error sums of the items that belong to the clusters Ci and Cj only. 12:40. The inter-cluster error-sum is maximized when there are n singleton clusters. Note that the global centroid p0 does not change throughout the clustering process. which occurs at the beginning of cluster. We have n n n X (i) 2 T X p(i) + ni cT cij + X kp(j ) k2 kpl k2 2cij l a b = ij l 2 i i j l=1 l=1 l=1 n T X p(j ) + n cT c 2cij j ij ij l j l=1 l=1 l=1 l=1 nj X (j ) 2(p(j ) )T pl + nj kp(j ) k2 ]: 0 2 0 l=1 i j n n n X (i) 2 (i) )T X p(i) + n kp(i) k2 + X kp(j ) k2 kpl k2 2(p0 i 0 2 l l 2 i i j l=1 l=1 P P Using n=1 pl(i) = ni p(i) and n=1 p(j ) = nj p(j ) . 23/01/2002. there is no split of a cluster in a path of agglomerative clustering.tex.8 Ci and Cj . before and after merging. Then n n X (i) (i) 2 X (j ) (j ) 2 kpl p0 k2 + kpl p0 k2 .8 . we have the desired 0 0 l l l result (i) 2 (i) 2 (i) T 2 a b = 2ni kp0 k2 ni kp0 k2 2ni (p0 ) cij + nikcij k2 +2nj kp(j ) k2 nj kp(j ) k2 2nj (p(j ) )T cij + nj kcij k2 2 0 2 0 2 0 (i) c k2 + n kp(j ) c k2 0: = nikp0 ij 2 j 0 ij 2 Similarly. the inter-cluster error sum satis es the following characteristics which show the opposite trend to that of the intra-cluster error sum.

which is that intra-cluster error sum is nonincreasing and inter-cluster similarity is nondecreasing as the divisive clustering algorithm proceeds. what follows will use the trade-o between inter-cluster and intra-cluster balance to de ne a measure for the optimal clustering con guration. The clustering balance E ( ) has been formulated with the idea that intuitively optimal clustering is achieved when the error sums have reached equilibrium. Clustering gain j for Cj is de ned as the di erence between the decreased inter-cluster error sum j compared to the initial stage and cluster. In addition. 3. We de ne the clustering balance as E ( ) = + (1 ) . Then is minimized when all n patterns belong to one cluster at the end of clustering. 12:40. we introduce clustering gain which has an interesting relation to cluster balance. Clustering Gain The clustering balance can be computed in each step of a hierarchical clustering algorithm to determine the optimal number of clusters. respectively. In this section. Therefore. and 0 1 is a scalar that determines the weight between these two sums. With the de nitions of clustering balance based on the error sums.2.tex. clustering gain is cheap to compute.9 clustering. We transformed the optimal clustering problem into a problem for nding the point where the two similarities are balanced by representing these similarities by the squared error sum in Euclidean space.9 . the trend is the other way around. When the clustering algorithm is divisive. Our design scheme is based on the fact that intra-cluster similarity is nondecreasing and inter-cluster error sum is nonincreasing as the agglomerative clustering algorithm proceeds. However. 23/01/2002. the Euclidean distance. We will concentrate on the special case for = 1=2 which provides an even balance and accordingly assume that E( ) = + : (7) Thus clustering behavior can be interpreted as a procedure seeking the global minimum of clustering balance. it can be computed in each step of clustering process to determine the optimal number of clusters without increasing the computational complexity. p. (6) where and denote intra-cluster and inter-cluster error sums for a speci c clustering con guration . a major disadvantage is the high computational cost of computing clustering balance. It is easy to show that the value of is nonincreasing as the clustering proceeds using the triangular property of the L2 norm.

one. 23/01/2002. the gain is de ned as j= j j: In the above equation. p0 ) = kpi p0 k2: j= j j i=1 i=1 cluster. the number of patterns of the nal con guration of cluster Cj can vary from 1 to n.10 X2 x 2 = X2 . (b) Clustering has been finished. p0 ) j = 0 j n X = kp(j ) p0 k2 kp(j ) p0 k2 2 2 0 i j i=1 i=1 In addition. Speci cally. the increased intra-cluster error sum j compared to the initial stage. 12:40. p0 ) e(p(j) . 0) x1 X1 X1 (a) No clustering has been conducted. The decreased portion of the inter-cluster error sum compared to the initial stage is denoted by n X (j ) e(pi .C o1 C o = ( 0. the increased portion of the intra-cluster error sum compared to the initial stage is de ned by n n X (j ) (j ) X (j ) (j ) 2 (8) e(pi . Clustering gain de ned by di erence between error sums. an equal weighting factor. Clustering gain is graphically illustrated in Figure 2 using cluster con gurations. (b) Final con guration of cluster Cj .tex.C o 2 P1 P2 Pn P n-1 X2 x2 P1 P2 (j) (j) (j) Po Pn (j) P n-1 (j) C o = ( C o1 . has been assigned to both error sums. (a) Initial con guration. C o2 ) x 1 = X1 . In particular. Figure 2. p.10 .

Eventually. The clustering gain j is always greater than or equal to zero. and not the individual data itme. the clustering gain will be positive. Therefore. 12:40.11 Expanding the gain for cluster Cj gives j = j j n n X (j ) 2 kp(j ) p k2 X kp(j ) p(j ) k2 = kpi p0 k2 0 2 0 0 2 i j j since computed from i=1 = (nj 1)kp0 p(j ) k2 0 2 Pnj (j ) (j ) i=1 pi = p0 nj .11 . Thus clustering balance can be alternatively expressed using clustering gain as E = + = . we propose the clustering gain as an e ectiveness criterion. Apparent from Figure 3 is the fact that the optimal clustering con guration discovered by a hierarchical clustering algorithm has maximum clustering gain. cluster. 23/01/2002. Note that clustering gain is analogous to the E value suggested by Jardine and Rijsbergen 45]. i=1 the total clustering gain can be (9) = k X We would like to emphasize that this clustering gain is very cheap to compute since it involves only the centroids and the global centroid. assuming the initial clustering con guration is not optimal.tex. It is interesting to note that the sum of clustering balance and clustering gain is a constant for a given data set since = E+ = + + k n k X X (j ) (j ) 2 X (j ) kpi p0 k2 + kp0 p0k2 ] = 2 j j =1 (nj 1)kp0 p(j ) k2 : 0 2 + = j =1 i=1 k nj XX j =1 i=1 k nj XX j =1 i=1 j =1 ( kp(ij) p0 k2 2 kp(0j) p0 k2 2 n X (j ) (j ) 2 kpi p0 k2) j i=1 kp(ij) p0 k2 2 which is determined completely based on the data. and not changed by the clustering result. Since clustering gain is minimum at the initial and nal clustering stages. an optimal con guration should be found during the middle of the clustering procedure. In order to determine the maximum clustering gain during the middle of the clustering procedure. p. for clustering e ectiveness.

tex.5 1 0. For visual demonstration.5 2 1. where 0 .5 x 10 4 (d) 0 20 40 60 number of iterations 80 0 0 20 40 60 number of iterations 80 Figure 3. the average-link and Ward's method. . (c) Clustering gain (d) The sum of clustering balance and clustering gain. cluster.12 . Given the same input patterns.5 1 0. p. clustering gain and the constant are compared in Figure 3.5 2 1. Clearly. In Tracking Algorithm. we are able to nd an optimal cluster con guration by tracing clustering gain instead of clustering balance. . the same optimal con guration has been obtained by popular agglomerative clustering algorithms including the single-link. the con guration containing the lowest value of clustering balance coincides with the optimal con guration produced by human being. we summarize how we can obtain the optimal cluster con guration in a given hierarchical agglomerative clustering algorithm while keeping track of the cluster gain value. the optimal con guration detected by the complete-link is visualized in Figure 4. 23/01/2002. Now.5 0 x 10 4 (c) 3 2. (a) Initial con guration of patterns. Application of this method to hierarchical divisive clustering method is straightforward. Clustering gain: the opposite concept of clustering balance. To demonstrate the performance of Tracking Algorithm. Note that we need to keep track of the clustering gain ( ) since the global maximum value of clustering gain can be discovered only after the clustering is completed. (b) Intuitively well clustered con guration captured when clustering gain is maximized.12 (a) 25 20 15 10 5 0 25 20 15 10 5 0 (b) 0 20 40 60 0 20 40 60 3 2. 12:40.

23/01/2002.tex. either clustering balance or clustering gain can be used to nd optimal con guration. 4. they also can be used to compare various clustering algorithms in terms of clustering performance. 12:40. Visual demonstration of optimal con guration discovered by Tracking Algorithm using the complete-link. (b) Optimal con guration. (d) Clustering balance. Since these measures represent clustering optimality in an absolute value. (a) Initial con guration.13 Tracking Algorithm: 1 Choose a hierarchical agglomerative clustering method(HACM) 2 Do while clustering is not complete in HACM 4 merge two clusters according to the fusion rule of the HACM 5 keep track of maximum value for ( ) and save 6 end while 7 recover the optimal con guration (a) 12 10 8 6 4 2 0 0 10 20 30 12 10 8 6 4 2 0 0 10 (b) 20 30 (c) 4000 4000 (d) 3000 3000 2000 2000 1000 1000 0 0 10 20 30 number of iterations 40 0 0 10 20 30 number of iterations 40 Figure 4. The method is based on the observation that the variance cluster. Performance Evaluation of Clustering Algorithms Given a hierarchical clustering algorithm. p.13 . (c) Intra-cluster versus inter-cluster error sum. To give an example we estimated the practical distribution in high dimensionality using the method proposed by Bennett 46].

colon. the completelink outperformed other three clustering algorithms since it produces the best con guration given the same input patterns. m x variance(a pair of patterns) ' constance: According to the equation.5 0 1 2 3 4 120 100 80 60 40 20 0 0 20 40 60 80 100 (b) (c) 4 140 3. (b) Clustering balance versus the number of clusters in single-link. To extend our approach to practical domain. weightless. Thus clustering in high dimensional space can be simulated in two dimensional space if patterns are randomly and uniformly distributed. oral. They are divided into eight cancer categories including breast. In the experiment.5 2 1. 23/01/2002.5 1 0.14 (a) 4 140 3. (a) Optimal con guration found by single-link. we conducted a simple experiment with practical document vectors. Each category con- cluster. 12:40. the complete-link produces the con guration with the lowest clustering balance. heart attack. p.5 2 1.5 0 1 2 3 4 120 100 80 60 40 20 0 0 20 (d) 40 60 80 number of clusters 100 Figure 5. According to the results. of the distance between patterns chosen randomly in a hyper-sphere is inversely proportional to the dimensionality m as follows.5 1 0. The documents have been downloaded from MEDLINE on-line library. glycolic. (d) Clustering balance versus the number of clusters in average-link.14 . and then their optimal con gurations and clustering balances are represented in Figure 5 and Figure 6.5 3 2. Comparison of the single-link and the average-link. the higher the dimensionality is the smaller the variance is. (c) Optimal con guration found by average-link. Typically used agglomerative clustering algorithms have been applied to Tracking Algorithm.5 3 2.tex. prostate and tooth-decay.

5 0 1 2 3 4 120 100 80 60 40 20 0 0 20 40 60 80 100 (b) (c) 4 140 3. Comparison of the complete-link and Ward's method.5 1 0. and ROCK 24].5 3 2. 23/01/2002.5 2 1.5 3 2. tains 500 documents. After ltering the document set using the stoplist of SMART IR system and the stemming algorithm proposed by Porter 48]. we applied Tracking Algorithm to the combination of Colon and Tooth categories. According to the results. The results in Euclidean space are graphically illustrated in Figure 7. 5. CURE 23]. 12:40. BIRCH 47]. Many clustering algorithms have e ciency cluster. Estimation of the Optimal Number of Clusters for Partitional Clustering Algorithms A major problem accompanying the use of a partitional clustering algorithm is the choice of the number of desired output clusters 27] and order dependency 13]. optimal cluster congurations can be found by our measure in Euclidean space. (b) Clustering balance versus the number of clusters in the complete-link.5 2 1.tex. (d) Clustering balance versus the number of clusters in Ward's method. (a) Optimal con guration found by the complete-link.5 0 1 2 3 4 120 100 80 60 40 20 0 0 20 (d) 40 60 80 number of clusters 100 Figure 6. p. (c) Optimal con guration found by Ward's method.15 . The sensitivity to the selection of the initial partition is a problem in most partitional algorithms 2] since the partitional clustering algorithms may converge to a local minimum of the criterion function if the initial partition is not properly chosen 2].5 1 0.15 (a) 4 140 3. For handling large data set and avoiding such sensitivity. many e cient algorithms have been proposed including CLARANS 19].

For each possible number of clusters.16 . Also the centroids of the best con guration can be fed to partitional clustering algorithms to avoid random initial assignments for centroids. the lowest balance is presented in the part (a) of the gure.5 0 0 10 20 30 40 50 60 number of clusters 70 80 90 100 Figure 7. and capacity. we can generate all possible cases with very small number of patterns such as ten.16 3. The basic assumption of this approach is that the best cluster con guration. A hierarchical clustering algorithm is considered as the best if it produces the lowest clustering balance given particular data patterns. but most partitional clustering algorithm depend on users to determine the desired number of clusters. will be an approximation of the ideal cluster con guration for partitional clustering. the best con guration can be selected among optimal con gurations produced by hierarchical clustering algorithms. Consequently. the complete-link discovered the optimal con guration as closely as the ideal. Clustering balance.5 1 0. Even though exhaustive enumeration of all possible assignments is not computationally feasible even for small numbers of patterns 12].5 x 10 4 Clustering by complete−link in Euclidean metric space:3154x100 3 2. In Figure 8. cluster. As we previously described. intra-cluster and inter-cluster squared error sums in Euclidean space. According to the experimental results. the winner among con gurations produced by hierarchical clustering algorithms. p. the desired number of clusters can be estimated from the best con guration. optimal con gurations and their clustering balances are compared with respect to ideal con guration. 12:40. it is risky to generalize this result so that the complete-link is superior to all other algorithms for all input patterns since clustering performance may change in accordance with the distribution of the input patterns. 23/01/2002. However.5 2 1.tex.

12:40.5 2 4 (c2) 6 8 10 1 0 0.4 (c1) 0. we applied k-means algorithm with all possible number of clusters.2 0. more stable con guration and improved performance are demonstrated in Figure 10.5 0 1 0.8 optimal configurations 1 2 4 6 8 number of clusters 10 Figure 8.8 1 1. k-means is able to converge quickly. the clustering balance is the average of ve trials.5 0 1 0. average-link. the estimated number can contribute to decision of the range of the true number of optimal clusters. the desired number of clusters for the given distribution is nine. Surprisingly.2 0. complete-link and Ward's method.4 (b1) 0.6 0. In this experiment. corresponding optimal con gurations are found as in Table I. However.4 (d1) 0. The averaged clustering balance produced by the k-means is depicted in Figure 9 along with the number of clusters. In addition.4 (e1) 0.8 1 1.8 1 1.17 (a1) 1 0.5 2 4 (e2) 6 8 10 1 0 0.2 0. It is clear that the estimated number is not the true value.6 0. (a) The optimal (b) The single-link (c) The average-link (d) The complete-link (e) Ward's method. When we apply Tracking Algorithm to four popular algorithms including single-link.5 (a2) 1 0 0. This result convincingly illustrates that our clustering measure can be used for partitional algorithms to estimate the number of desired clusters.6 0. 23/01/2002. To verify this assumption experimentally. cluster. Comparison of currently used agglomerative clustering algorithms to the optimal clustering by using cluster con gurations and clustering balances.8 1 1.5 2 4 (b2) 6 8 10 1 0 0.2 0.4 0. the number of clusters produced by the completelink is equivalent to the desired number of clusters obtained by k-means using all possible k values.5 0 1.5 2 4 (d2) 6 8 10 1 0 0. According to the experimental results.6 0.6 0.tex.2 0. p.5 0 1 0. When the number of desired clusters and initial centroids are estimated.17 .5 0 1 0.

5 100 2 80 1. This result is normal since centroids and medoids are located di erently in the same feature space.2292 100. Table I. the best cluster con guration found by hierarchical clustering algorithms contributes to determining the desired number of clusters and the initial centroids for partitional clustering algorithms. However. The optimal con guration and clustering balance traced by k-means. As a result. the results are almost the same as kmeans algorithm except some uctuations of clustering balance before convergence. Comparison of lowest balances algorithms Single-link Average-link Complete-link Ward's Method the lowest balance the highest gain number of clusters 49. 23/01/2002.2123 27.5 140 3 120 2. 6.18 . (a) Optimal con guration.5 0 1 2 3 4 20 0 20 40 60 number of clusters 80 100 Figure 9. Conclusion Clustering is not a new technique in computer related disciplines.139651 120.0904 29. a huge demand for clustering technique represented by a variety cluster.tex.909583 120.18 (a) 4 160 (b) 3. (b) Averaged clustering balance of all possible number of clusters using k-means. 12:40.787653 122.5 60 1 40 0.8603 29. p.770844 21 10 9 10 For k-medoid algorithm.

we believe that classi cation performance will be enhanced if we exploit our clustering measure to nd optimal sub-centroids in each class.5 2 iterations 2. The basic cluster.8 3 26. partitional clustering algorithms are able to converge more quickly and give lower clustering balance than those without our clustering measure.tex. In this paper. Recently. For the purpose of nding an optimal con guration. Additional clustering optimization and quick convergence. we proposed a measure for optimal clustering. Our approach is quite di erent from other traditional approaches. It is natural to assume that those centroids provides us with more accurate information describing the internal structure of a class than that represented by only one centroid. When it comes to classi cation.5 2 26.1 0. We de ned clustering balance using the squared error sums. In particular.5 3 Figure 10.19 . 23/01/2002.6 2.5 26.9 3. We evaluates clustering optimality using only internal properties of clusters and successfully achieves intuitive agreement for clustering optimality. By searching the compromising point between intra-cluster and inter-cluster error sums. As a result. much e ort has been presented to achieve clustering e ciency and scalability.2 1 26. (a) Optimal con guration found by a hierarchical clustering algorithm.19 (a) 4 27 (b) 26.5 26. Therefore.5 0 1 2 3 4 26 1 1. and provided to non-hierarchical partitional clustering methods. an agglomerative clustering recovers the cluster con guration with the minimum clustering balance. 12:40.4 1.7 26. of clustering applications demonstrates its importance. we are able to detect the optimal clustering con guration for any hierarchical clustering algorithms.5 26.3 26. p. the number of desired clusters and initial centroids can be estimated from the optimal cluster con guration. (b) Improved clustering performance. multiple centroids in a class can be found using our clustering measure since each class in turn is a cluster.

p.K. IEEE Transactions on Computers. Vol. Cluster Analysis for Applications. 1996 2. Univ. Operations Research. New York.17. NJ. R. R. 1979 4. Jensen. NY. 364{366.24 no. MacQueen. Jain and M. 1967 12.J. Vol. 1973 9. 1977 cluster. Therefore. 1966 15. Sibson:73. `Data Clustering: A Review'. Pattern Recognition. Inc. Algorithms for Clustering Data. Vol. Prentice Hall. M. AD 669871. Fortier and H. `A Locally Sensitive Method for Cluster Analysis.. `SLINK: an Optimally E cient Algorithm for the Single-link Cluster Method'. 493{506. Sha er. Zahn. pp. Dubes. Vol. J. J.C. J. pp. Pattern Recognition. 1968 10.E. 1966 11.1. Defays. K. `An E cient Algorithm for a Complete Link Method'.20 . 1967 16.A. pp. Anderberg. C. P. 1988 13. Urquhart. Stegun. R. 68{86. D.16. M.R. E.B.11. `Multidimensional Group Analysis'. Abramowtiz and I.20. Vol. 173{187. Computer Journal. Johnsonbaugh and S. Pattern Recognition. R. pp.C. pp. 12:40. `Single-link Characteristics of a Modeseeking Algorithm'. P. In Multivariate Analysis. In the Fifth Berkeley Symposium on Mathematical Statistics and Probability. References 1. `Graph-theoretical methods for Detecting and Describing Gestalt Clusters. 30{34. US Govt. D.J. 65{73.31 no. 1982 7.15. Vol. Solomon. ACM Computing Surveys. Handbook of Mathematical Functions with Formulas.K. Jancey. pp. of California Press. pp. we expect that many unknown contributions of our approach will be discovered in various applications of clustering while our clustering measure consistently gives feasible solutions to optimal clustering. Vol. Vol.20 rationale of improved classi cation is that classi cation performance is contributed by comparing test data to multiple centroids instead of the single centroid. Upper Saddle River.5. `A Dynamic Programming Algorithm for Cluster Analysis'.14 no. R. 1976 5. Krishna. Jain and R. pp. 1978 8.N. Gowdar and G. Information Processing and Management..C. 577{597. J. Printing O ce. Jost.3. `Graph Theoretical Clustering based on limited neighborhood sets. Willet.8. 264{323. Austral. pp. Kittler. Pattern Recognition. 105{112. Murty and P. `Some Methods for Classi cation and Analysis of Multivariate Observations'. Academic Press. Jain. Vol. Gose. pp.tex. 1973 17. Botany. E.10. 127{130. R. A. Flynn. pp. 23/01/2002. Academic Press. pp.C. `Recent Trends in Hierarchic Document Clustering: A Critical Review. Vol. Washington. NJ.T. 22{33.K. R. pp. 1971 6. Clustering Procedures. `Agglomerative Clustering using the Concept for Multispectral Data. 281{297. Graphics and Mathematical Tables. Computer Journal. editor.20. 1034{1057. New York. 1999 3. Vol. Prentice Hall. 1988 14. A. Vol.1. Englewood Cli s. Berkeley. Dubes and A. Krishnaiah. Pattern Recognition & Image Analysis.

S. A compendium of NP optimization problems. Rastogi and K.21 . Du and P. Vol.W. on Management of Data. 23/01/2002. Addison-Wesley. Texas..Freeman and Company. Boley. Data Mining and Knowledge Discovery. Fayyad and C. W. Springer-Verlang. Diday and J. Journal of Classi cation.E. CA. New York. pp. 1963 39. B. B. pp. E. Odell. IEEE Computer: Special Issue on Data Analysis and Mining. Duran and P. p. Miami. 7{24.S. Fu. Great Britain. 1987 28. WA. 1998 23.58. 1998 33.12 no. Kann. Rastogi and K. Garey and D. Vol. San Francisco. Pergamon Press Ltd. J.1. 1999 25. 1998 32. Springer-Verlag. USA.50.C. Ganti and R. D. 236{244. In Biometric society meetings. G. Vol. 159{179. R. NJ.4. Information Processing and Management. Psychometrika. E. In STOC'97.C.11. `E cient Algorithms for Agglomerative Hierarchical Clustering Methods'. Vol. S. 1975 31. Pattern Recognition.. pp. 1995 v 38. Kumar.21 No. Guha and R. CA. 1979 37. 261{329. Santiago. Reverside.se/~iggo/problemlist/compendium2. Day and H. Chile. pp. pp. pp. El Paso. P. M. Guha and R. IEEE Transactions on Pattern Analysis and Machine Intelligence. `Scaling Clusterin Algorithms to Large Databases'. In Handbook of Combinatorial Optimization.3. 1965 cluster. G. 1994 20. ACM. In the 15th Int. 325{344. Edelsbrunner. `ROCK: A Robust Clustering Algorithm for Categorical Attributes'. 1990 22. Reina.. X. Cluster Analysis: A Survey. 153-180.L. 12:40. Han G. R. Seattle.tex.R.22.21 18. Charikar and C..C. pp 68-75.T. Pardalos. 32 no. 1977 35. Tou and R. S. In Proceedings of the 20th VLDB Conference. Vol. Muchnik. Ed. Vol. J. Chekuri and T. Cluster Analysis.C. Han and V.Cooper. pp. Milligan and M. Motwani. 1984 19.6. Bradley and U. 1976 36.H. W. USA.W. pp. Everett. Ramakrishnan and J.T. pp. Abstract in Biometrics Vol. Journal of the American Statical Association. Dubes. 2000 26. Kluwer Academic Publishers. Johnson. `E cient and E ective Clustering Methods for Spatial Data Mining'. FL.Z.nada. 1974 30. `CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling'. Combinatorial Optimization in Clustering. 1998 24.L.M. Addison-Wesley. Forgy. In ACM-SIGMOD Int. Feder and R. Li.2 no. Secaucus. `Cluster Analysis of Multivariate Data: E ciency Versus Interpretability of Classi cation'.H. Gehrke. Shim. `Parallel Algorithms for Hierarchical Clustering and Cluster Validity'. Simon. 1997 27. Conf. on Data Eng. Voorhees. 1985 29. `How many Clusters are best? { an experiment'. Mirkin and I. pp. 1088{1092. In Digital Pattern Recognition. 465{476. Clustering Analysis. pp.1 no. Ng and J. 1999 34. Gonzalez. 47{94. 768.Jr. 1986 21. B. Berlin. `An Examination of Procedures for Detecting the number of Clusters in a data set'. Conf. 8. Shim. `Hierarchical Grouping to Optimize an Objective Function'. pp. `Principal Direction Divisive Partitioning'. D.20 no. E. `CURE: An E cient Clustering Algorithm for large Databases'. Pattern Recognition Principles.S. `Incremental Clustering and Dynamic Information Retrieval'. Crescenzi and V. Vol.M. `Implementing Agglomerative hierarchical Clustering Algorithms for use in Document Retrieval'.kth. 645{663. M. V. K. `Clustering Large Datasets in Arbitrary Metric Spaces'. P.H. URL site:http://www. 73{84.S. Knowledge Discovery and Data Mining. Ward. Vol. Computers and Intractability: a guide to the theory of NP-completeness. Karypis and E.

J. 1980 cluster. In Information Retrieval: Data Structures and Algorithms. Porter. 1966 42. Kaufman and P. Ball and D. `The Use of Hierarchical Clustering in Information Retrieval'. E. In Wester Management Science Inst. Vol. N. Inc. pp. Rasmussen. and Information Theory. 96. San Diego. Information Storage and Retrieval. G. pp.. CA. L. Baeza-Yates. Ramakrishnan and M. M. Theory. 517{525.S. Livny.15. Vol. pp.H. Sep. 1992 45... 1990 44. Rijsbergen. Vol. `The Intrinsic Dimensionality of Signal Collections'. Eds. IRE Trans.3. Prentice-Hall. Circuit Theory. Zhang and R. Frakes and R. Program. Vol. Vol.B. pp. G. F. p.22 40.J. pp. W. R. 1962 41.2. Bennett.25 No.. In International Conference on Microwaves. `Some Methods for Classi cation and Analysis of Multivariate Observations'. SIGMOD Rec. `Pattern Recognition by an Adaptive Process of Sample Set Construction. MacQueen. 1971 46. Rousseeuw. Sebestyen. Finding Groups in Data: an Introduction to Clustering Analysis.22 . Clustering Algorithms.7. 1964 43.. `Some Fundamental Concepts and Synthesis Procedures for Pattern Recognition Preprocessors'.IT-8. Upper Saddle River. 12:40. 1966 47. Hall. 419{442. pp. IEEE Transactions on Information Theory.S. Academic Press. `An Algorithm for Su x Stripping'. Jardine and C. `BIRCH: An e cient data clustering method for very large databases'. 23/01/2002. 103{114.B. 130{137. 217{240.14 No. NJ. T. 1996 48. J.tex.J. University of California. on Info.