You are on page 1of 15
Using CVI for Understanding Class Topology in Unsupervised Scenarios Beatriz Sevilla-Villanueva®™), Karina Gibert and Miquel Sanchez-Marré!? Knowledge Engineering and Machine Learning Group (KEMLG), Universitat Politécnica de Catalunya-Barcelona'Tech, Barcelona, Spain bea. sevillaggnail.com ? Department of Computer Science, Universitat Polit’ ‘ica de Catalunya-BarcelonaTech, Barcelona, Spain + Department of Statistics and Operations Research, Universitat Polittcnica de Catalunya-BarcelonaTech, Barcelona, Spain Abstract. Cluster validation in Clustering is an open problem, The most exploited possibility is the validation through cluster validity indexes (CVIs). However, there are many indexes available, and they perform inconsistently scoring different partitions over a given dataset. The aim of the study carried out is the analysis of seventeen CVIs to get a common understanding of its nature, and proposing an efficient strategy for validating a given clustering. A deep understanding of what CVIs are measuring has been achieved by rewriting all of them under a common notation. This exercise revealed that indexes measure different structural properties of the clusters. A Principal Component Analysis (PCA) confirmed this conceptual classification. Our methodology pro- poses to perform a multivariate joint analysis of the indexes to learn about the cluster topology instead of using them for simple ranking in a competitive way. 1 Introduction Clustering is an approach of unsupervised learning that finds some hidden strue- ture in unlabeled data. The aim of clustering is to group a set of objects into distinguishable classes, groups, or clusters of similar subjects [1]. These groups can characterize a set of different profiles which is one of the main applications of clustering in order to understand the data. The different clustering approaches, the different configurations and the selec- tion of the number of clusters (when required) lead to different solutions for the same dataset (1]. Therefore, the evaluation of which partition is correct or better than others becomes a crucial task. Usually, the real partition of the data is v ult from a clustering process cannot be compared with a reference partition by computing misclassification indexes, as in the case of supervised learning In the literature, most of used techniques for evaluating the clustering results are based on numerical indexes which evaluate the validity of the resulting par- tition from different points of view, known as Cluster Validity Indexes (CVI). © Springer International Publishing Switzerland 2016 ©. Luaces et al. (Bd): CAEPIA 2016, LNAT 9868, pp. 135-149, 2006 DOK: 10,1007/978-3-19-44636-3.15, nknown and, therefore, the re 136 _B. Sevilla-Villanueva et al, A wide number of CVIs can be found in literature and also, some reviews of those indexes which compare them [2-10]. However, there are currently no clear guidelines for deciding which is the most suitable index for a given dataset [3-6] In fact, there is not an agreement among those indexes, each one can give some information about a different property of the partition like homogeneity, compactness of clusters, variability, etc. All these CVIs refer to structural properties of the partition, which are context-independent, and the evaluation based on them is mainly made in terms of the cluster’ topology. In this work, some of these indexes are analyzed with some additional indi- cators of the structure of the partitions provided by statistical packages. A new methodology is proposed based on the joint multivariate interpretation of all these indexes that provides valuable information about the topology of the clus- ters and constitutes a richer method of evaluation, ‘This work differs from those found in literature because it does not perform a simple comparison among several indexes to see which is the best index. Instead, a multivariate analysis of indexes is performed to better understand the nature of those indexes and which of them have similar behavior. Since, in most of clustering real applications, one has no idea about the structure of the best partition which fits the data, we think that, at least, the proposed approach brings a valuable information about the properties of the chistering recognized. Coneretely, a principal component analysis (PCA) is performed in order to analyze the relationships among different indexes and in this way, to establish the methodology for further analysis of a real dataset. ‘The structure of this paper is the following: first, Sect.2 contains the back- ground of CVIs. In the methodology (Sect.3), the PCA is explain nition of the indexes with a common notation is provided. Then, a classification and analysis of these indexes is presented in Sect. 4, Section 5 proposes how to use the indexes. Finally, the discussion, conclusions and future work of this study are presented. ut it seems clear that nd a defi- 2 Background Cluster validity methods aim at the quantitative evaluation of the results of the clustering algorithms. These methods try to cope questions such as “how many clusters are in the dataset?” or “is there a better partitioning for our dataset?” [8 Most of the works done in cluster validation on unsupervised context are cen- tered in the internal validation of the clusters [2,4-11] using what is known as a cluster validity index (CVI). Previous works have shown that there is no single CVI outperforming the rest [3-6] and there are few works that compare several CVIs in order to draw some general conclusions [2~4, 6] and no general guidelines exists to help the analyst to choose the best CVI in front of a real case, A reference in this area is 6] which compares 30 CVIs using artificial datasets and Monte Carlo simulation. Other popular study is [7] which compares Davies- Bouldin Index and a modification of Hubert I’ Statistic using Monte Carlo method. Using CVI for Understanding Class ‘Topology in Unsupervised Scenarios 137 In [11], they evaluate two clustering methods (hard ¢-means and simple link- age) and the results are compared using Davies-Bouldin, Hubert’s statistics, Dunn and variants of Dunn indexes. They suggest in this study that there is not an index which provides consistent results across different clustering algorithms and data structures, ‘The performance of 15 indexes for determining the number of clusters in 162 synthetic binary datasets is analyzed in [4]. Based on the ability to recom- clusters, they proposed to use Ratkowsky-Lance and Davies-Bouldin followed by Calinski-Harabasz and Xun indexes regarding to point the correct number of classes. A comparison of a proposed CVI against others is performed in [5]. In this work, they conclude that there is not a unique index which it is good enough to determine the number of clusters. In [3], they compare different cluster indexes of different types, some eval- uating the properties of the partition itself (internal criteria), some comparing with a reference partition (external criteria that usually corresponds to accu- racy error measurement) and some comparing several partitions among them (relative criteria). Authors claim that the external criteria are better when a reference partition for comparing is available. In other cases, they conclude that Silhouettes index outperforms in their experiments. A comparison of the most popular CVIs are performed in [2]. In this work, CVI are compared using 720 synthetic datasets and 20 datasets from the UCI. ‘The synthetic datasets are all possible combinations of the following 5 factors: number of clusters (2,4,8), dimensionality (200,400,800), cluster overlap (yes, no), cluster density (equal, asymmetry) and noise level (without, with). For each dataset 3 clustering methods (k-means, Ward and Average-linkage) are run using a k from 2 to n, being n the size of the dataset. This work concludes that indexes with better performance seem to be Silhouettes, David-Bouldin and Calinski-Harabasa, All these works run a sort of competition among indexes to search for a winner. However, our belief is that most of these indexes are measuring different properties of the clusters, and that all of them are related with structural charac- teristics of the classes. It is very probable that for certain structures some indexes perform better than others. In most of clustering real applications, the main lim- itation is that one has no idea about the sphericity of classes or whether they are tangent nor other features that could help to select the best index. Instead, we think that an interesting reverse reading of this scenario is suitable, by using all those indexes to get knowledge about classes’ structure based on their joint performance, and this is the contributions of this work: the idea of making a joint interpretation of all indexes to get structural knowledge from the partition, mend the correct number 3 Methodology Tn this work, the 17 most commonly used CVIs and indicators found in the literature are evaluated over 17 UCI datasets which contain their real classifica- tion. The multivariate relationships among indexes are analyzed by means of a 138 _B. Sevilla-Villanueva et al, PCA. The first 2D and 3D factorial subspaces are analyzed to identify groups of indexes behaving similarly over several datasets and conchisions are extracted about how to use those indexes to get information about classes’ topology. 3.1 Principal Component Analysis Although Principal component analysis (PCA) [12] appears at the beginning of the XX" century, it becomes popular in the late 50's when computer had sufli- cient capacity. PCA is a multivariate statistical method that finds a reduced set of factors keeping as much information as possible from the original dataset. The factors are orthogonal and they are linear combinations of original attributes. ‘The quantity of information from the original dataset conserved in each factor coincides with the eigenvalues of the covariance matrix of the original dataset. The factors are defined by the eigenvectors. From an algebraic point of view, vectorial base changing operations are found by means of diagonalization of the covariance matrix build over X, to get the most informative projection of the original dataset. From a geometrical point of view, the most informative ortho- normal rotation of the original attributes is found, Given the matrix X = {%1,...,Xi«} of K attributes and n objects and being W a diagonal matrix of individual weights, the matrix Cou(X) = XTWX is diagonalized XTWXt= The solutions provide the eigenvectors u,,,a = {1,.., K} and corresponding eigenvalues Aq, a = {1,..,K}. The Principal Components are linear combina- K tions of original attributes and P, = Jo uaXq. Sorting both ts, a according to decreasing Aq the first r Principal Components such that 5 Aq > 0.80 are cA conserved. However, in most of re applications the first factorial plane is analyzed, the one determined by (U1, Z2), as it is the one conserving as much information as possible from the original dataset [13]. In this work PCA has been used to analyze synergies and oppositions of CVIs. 3.2 Cluster Validity Indexes ‘The evaluation of a resulting clustering is commonly assessed with internal CVIs and indicators because they do not need additional information other than data and clusters themselves. The internal validation evaluates the resulting clu ters in base to its topography or structure. This evaluation is mostly bas on the compactness (cohesion) of the clusters and the separation between clus- ters. Literature on CVIs is abundant (see Sect. 2). Most of the indexes estimate cluster cohesion (within or intra-variance), cluster separation (between or inter- variance) or combine both to compute a quality measure [14] Using CVI for Understanding Class ‘Topology in Unsupervised Scenarios 139) The formulation of the 17 indexes is provided under a common notation. Given, — Datasest 2 composed by n individuals J = {i,,...,i,} and K attributes X={X,...,Xx}. ~ Partition P = {C),..., Cc} contains € clusters where n, = card(C), 0 € P and C()C"= 2, C,C’« P. = Being d(i,i") = Hel. the distance between two individuals 4 Entropy index measures the entropy associated with partition P [15]. Entropy is always non-negative and it takes value 0 only when there is no uncertainty (when only one cluster). So, it measures chaos, values ranges from [0, 1] and it is better to minimize, Note that the uncertainty does not depend on the number of objects in J but on the relative proportions of the clusters, Entropy = — > “Slog (=) a) oo Maximum Cluster Diameter (A) is the maximum dis tance between any two points that belongs to the same clus- ter [16]. In other words, it is defined by the higher diameter among all the clusters belonging to P (see Fig. 1). Therefore, it measures compactness, values range from (0, ) and it is bet- ter to minimize Fig. 1. Maximum cluster diameter A=maxA,, A. = max d(i,i') (2) cee mee Widest Gap (wg) is the maximum within-cluster gap for all clusters. ‘The widest within-chuster gap (wg) is defined as the largest link in within-cluster 77% : minimum spanning tree [17] (see Fig.2). Thus, it“ "% measures compactness, values range from (0, ) and it, is better to minimizi ox” ™ wg = max wae, (3) os Fig. 2. Widest gap wge= max cron 9(C',0"), 9(C',0") Average Within-Cluster Distance (W) is the average of the distances between all pairs of objects within the same cluster. This measures compact ness (better to minimize) and values range in [0,) » We= YO) ali’) (4) ep We d) 140 __B. Sevilla-Villanueva et al, Within Cluster Sum of Squares (WSS) is the sum of squared distances between all pairs of objects within a cluster [10]. This measures compactness (bet- ter to minimize) and vahues range in (0,). Let ic be the barycenter of cluster C, Sowss., w. Law? (8) cep tec Ww, Average Between-Cluster Distance (B) is the average of all distances bet- ween pairs of objects which do not belong to the same cluster. It measures sep. aration (better to maximize) and values range in [0,). Given a pair of classes ocleP, Deeep, dist(C,C') St, aist(0, 07 = FY ati,’ (6) cre, Metter Lee ie fet let B Minimum Cluster Separation (4) is the minimum dis- tance between any two objects that do not belong to the same cluster. In other words, it is defined hy the lower sep- aration among all the clusters (See Fig. 3). It measures sep- aration (better to maximize) and values range in 0, ). 6, is highly related with wg, measure. 6... finds gaps between Fig.3. Minimum clusters whereas wy. finds gaps inside each cluster. cluster separation cep does baer = ani, die) @ Separation Index (Sindez) is based on the distances for every point to the closest point not in the same cluster. ‘The separation index is the mean of the S smallest separations [16,17] being S a certain proportion of the dataset. This allows formalizing separation less sensitive to a single or a few ambiguous points. It measures separation (better to maximize) and values range in (0, ) s € [0,1], S= Elns] an object i € I, c(i) is the class of object ¢ Vi € I, sep(i) = miny ges) dCi, 1") Let {sep(i)m}maim be the sorted sequence where sep(i)mm < sep(i)m1 Leas (srlin} ®) Sinder Dunn Index (D) is a cluster validity index for crisp clustering proposed in Dunn (1974) [18]. It attempts to identify “compact and well separated clus- ters” (19]. If a data set contains well-separated clusters, the distances among the clusters are usually larger than the diameters of the clusters. We present a for- mulation from [4]: (see Sect. 3.2 for 6.,«. and 3.2 for A.). It measures separation vs compactness, then its better to maximize and values range from (0, ) ming,crer be. maxcep Ac Using CVI for Understanding Class ‘Topology in Unsupervised Scenarios 141 Dunn-like index is one of the generalizations of Dunn index [18] proposed by Bezdek and Pal (11). It attempts to identify compact and well separated clusters. From all generalizations proposed in (11,20], the one available in FPC-R package [21] is the one substituting point to point distance of the original Dunn index by average inter-class distance in numerator and average point to point intra- class distance in the denominator. This version is more robust than the original Dunn Index [8,11]. Also, it measures separation us compactness, it is better to maximize and values range in [0, ) ie Lieeweer dist’) x Liveo Ui) Tene mene — 1) ming,ciep Bae D (10) Calinksi-Harabasz Index (CH) [22] is based on getting a compromise between both the between-cluster distances (separation) and the within-cluster distances (compactness). Thus, it is better to maximize and values range in [0, ) CH is defined as following: BSS/(E-1) CH WSsiin =) BSS = 7 ned(ic,7) co (11) ic is the barycenter of the cluster C' 7 is the barycenter of J, WSS is already defined. This index is interpreted as higher values as better clustering partition, Normalized Hubert Gamma Coefficient (/). This index is a Pearson version of Hubert’s gamma coefficient [8]. It gives information on how good the clustering is as an approximation of the dissimilarity matrix. It is especially useful when clustering is used for dimensionality reduction. This index takes values between —1 and 1, it is a compromise between separation and compactness and it is better to maximize. It introduces an auxiliary indicator Y such that evaluates to 1 for pairs of objects in the same cluster and to 0 otherwise. Being D the matrix with distances between subjects; the P index is defined as the correlation between D and Y wa f i= ei) p__ Lawerllé.e) MOG.) -D WTS YO. eli) # eli’) ' Diver?) = 4? DY, verve’) = HP (12) Being c(i) the class of i, d the barycenter of all distances and J the baryeenter of Y. Silhouettes Index [23] provides a succinct graphical representation of how well cach object lies within its cluster. It measures the separation vs compactness to ed for each object, but in order to be used is reduced to the average of all dataset or the average for each cluster the nearest cluster. In principle, this index is ast 142 _B. Sevilla-Villanueva et al, For each object i € I, let b(i) — afi) 03) 6) = Janat, Ba) a(i): average dissimilarity of i with all other data within the same cluster, (2) shows us how well clustered é is to the cluster assigned (smaller value means better matching) Davee dit) a) Te 1 b(:): the lowest average dissimilarity of ¢ with the data of another single clus- ter, The cluster with this lowest average dissimilarity is said to be the “neigh- bouring cluster” of i as it is, aside from the cluster i is assigned, the cluster in which # fits best, Drecweo Misi’) 10 = a ST ‘Then, the value of s(i) belongs to [-1, 1] and a higher value indicates that ¢ is better clustered, ‘The Silhouettes index of a cluster C is _ Deco ‘The overall Silhouettes index for a given dataset with a partition is Dies sli) _ Veer mee Silhouettes = EL _ Sacer Mee n Seep te (4) Baker and Hubert Index (BH) [24] is a variant of Goodman and Kruskal’s Gamma [25]. Comparisons are made between all within-cluster distances (com- pactness) and all between-cluster distances (separation). A comparison is consid- cred to be concordant if a within-cluster distance is strictly less than a between- cluster distance. Values are in [-1, 1] and it is better to maximiz Hog ‘S*+: number of concordant quadruples, S- : mumber of discordant quadruples. For this index, all possible quadruples (g, r, s, €) of input parameters are considered. A quadruple (g, r, s, 1) is called concordant if one of the following two con- ditions is true: = d(q,r) < d(s,t), q and r are in the same cluster, and s and ¢ are in different clusters, Using CVI for Understanding Class ‘Topology in Unsupervised Scenarios 143 = d(q,r) > d(s,0), q and r are in different clusters, and s and ¢ are in the same cluster By contrast, a quadruple is called discordant if one of following two conditions is true, = d(q,r) d{s,t), q and r are in the same cluster, and s and ¢ are in different clusters. Within Between Ratio (WBR) index is the ratio between the average within-cluster distance (compactness) and the average between-cluster distance (separation). Then, as lower is the value means that the partition P is more compact and more separated. Values range in [0,). See of W and B the previous definitions war-= (16) B C-Index [26,27] is computed using the within-cluster distances. W = Wrnin Wrraz + Wrin c (17) Where, W is Sum of distances over all pairs of objects from the same cluster. w= Yaw’) Pivec Let {dm}m=1mt:dm < dm+1 be the ordered list of distance possible pairs of objects. s between all Sdn, Wines ams Wrnin Let ny be the number of those pairs. Then Win is the sum of the ny smallest distances if all pairs of objects are considered (the objects can belong or not to the same chister), Similarly, Winae is the sum of the ny largest distances out of all pairs. This index measures compactness and separation, it is better to minimize and values range in [0,) Davies-Bouldin Index (DB) [28] is a cluster separation measure. The overall index is defined as the average of indexes computed from cach individual clus- ter. An individual cluster index is taken as the maximum pairwise comparison involving the cluster and the other clusters in the solution. Values range in (0, ) and it is better lower values. 144 __B. Sevilla-Villanueva et al, é 4,(0,C!) = Where, ic is the barycenter of the cluster C defined as ig = (arta) Ta = ee p is the Minkowski factor (p = 1 Manhattan distance, p = 2 Euclidean distance) Sp. is the dispersion measure of a cluster C' (for p = 1 the average distance of objects in cluster C to the barycenter of cluster C; for p = 2 the standard deviation of the distance of objects in cluster C to the barycenter of chister C) 4 Results ‘The first contribution of this paper is that, after rewriting all the indexes under a. common notation, we can observe that these indexes evaluate a reduced set of characteristies of a partition and, according their definition, all indexes can be grouped around 4 basic concepts: (A) Indexes measuring compactness of clusters: A, WG, TW and WSS. (B) Indexes measuring separation between clusters: B, 6, Sindex. (C) Indexes measuring relationships between compactness and separation: Dunn, Dunn-like, CH, P’, Silhouettes, BH, WBR, C-Index and DB. (D) Indexes measuring chaos in the clusters: Entropy In this work, the same 17 datasets from the UCI mI Machine Learning Repository in [2] are considered. Table 1 shows the main characteristics of each dataset, cach one contains an attribute with the real partition, and this will be us d as a class-variable to our purpose. ‘The mentioned 17 CVIs and indicators have been com- puted for every dataset using the real partition of each one and the behavior of the indexes is analyzed with multivariate techniques to understand both relation- ships among indexes and how they perform in front of certain class topologies A Principal Component Analysis (PCA) has been performed over Table? in order to understand the relationships among different indexes. The eigenvalues recommend to keep 3 factors with a total conserved inertia of 80.2% (see Fig. 4) Fig. 4. Histogram of the inertia of the principal components Using CVI for Understanding Class ‘Topology in Unsupervised Scenarios ‘Table 1. 17 dataset from the UCI repository and the main characteristics Dataset Num. clusters [Data type | Num. attributes | Num. mstances ‘breast w 2 Numerical] 30 509 Tonosphere 2 Numerical | 34 351 Parkinsons 2 Numerical | 22 195 sonar-all 2 Numerical | 60 208 ‘Transfusion 2 Numerical] 4 748 Haberman 2 Numerical) 3 306 Musk 2 ‘Numerical | 166 76 Spectt 2 Numerical | 44 267 This 3 Numerical | 4 150 Wine 3 Numerical | 13 178 ‘vertebral.column | 3 Numerical] 6 310 Vehicle a Numerical | 18 6 ‘breast.tissue @ Numerical] 9 106 Glass @ Numerical | 9 24 Eeoli s Numerical] 7 336 ‘vowel.context [IT ‘Numerical | 10 990 movement.libras [15 Numerical] 90 360) ‘Table 2. The 17 indexes computed for the 17 UCI datasets 145 Datat Num Chute oT 146 _B. Sevilla-Villanueva et al, Figure 5a shows the first factorial plane. FigureSb shows the 3-D projection over first three factorial axes. The indexes are represented as projected vectors over projection space. The analyzed datasets are represented by points San) Se a (a) First Factorial Plane (b) First Factorial Cube Fig. 5. Projection of the firsts factorial axis It can be seen that indexes place grouped over 5 directions both over the 1** factorial plane and the 1** factorial cube. By interpreting the PCA results under classical approach, structural properties of indexes can be elicited by understand- ing what is common between the indexes related to each of these 5 directions of projection. On the one hand, all indexes of group A (compactness) place together and orthogonally most of the relational indexes (group C); on a third direction, we can find separation indexes (group B), except for B which behaves extremely related to W (this is normal becanse of the Hugges theorem); Entropy follows its own behavior; also the Dunn index behaves orthogonally of the other families, probably because it works with minimums and maximums. Finally, Dunn-like seems to follow inverse association with compactness indexes (group A) ‘Thus, by projecting any dataset on the factorial map one can determine whether the clusters of this dataset are more compact (vertebral) or have bigger gaps (breast-tissue), if classes are more separated (vowel.context) or seem to be overlapped (Transfusion) In fact, as some packs of indexes behave similarly, they group over the fac- torial space according to the 5 projection directions, we propose to choose one representative index for every family and use the resulting reduced battery to evaluate new datasets in a more efficient way to learn about their class topology. Using CVI for Understanding Class ‘Topology in Unsupervised Scenarios 147 5 Class Evaluation Proposal Thus, given a new datasct partitioned by an attribute P (cither an expert- based or an induced clustering) the topology of classes might be understand by computing the following set of indexes: Diameter(A), Separation (6), Calinski- Harabasz(CH), Dunn and Entropy. From this battery, one can understand the following characteristics of the dataset: 1, A: Compact classes, without big gaps. 2. 6: Separated and non-overlapped classes 3. CH: Compromise between compact and separated classes and if individuals are well-clustered. 4, Dunn: compact and well separated clusters. 5, Entropy: chaos. Note that each index is interchangeable for any index of the same group (direction) since they measure the same characteristics. 6 Discu: ion, In this study, cluster validation through CVIs and indicators is faced. A joint multivariate evaluation of all indexes is proposed as a richer methodology than traditional and simple ranking according to indexes. The proposal provides infor- mation about the topology as well as the structure of the resulting partition, Efforts have been invested to express all considered indexes by means of a common notation (Sect. 3.2); this permitted a deep understanding of the indexes themselves, and to realize that most of them refer to some upper category that represent different characteristics of the clusters from a structural point of view: whether clusters are more compact or more sparse, or whether there are at least two classes that are too close, or whether it seems to be more or less overlapping among clusters. In fact, from this conceptual analysis we identify 4 categories of indexes: those meaning compactness, separation, relation between compactness and separation or chaos. With the analysis of the relationships among those indexes by means of PCA the indexes group depending on its behavior in front of the different datasets that have been analyzed and, they group accordingly to the conceptual classification provided in previous section. Except for Dunn index that has its own behavior and the average Between-cluster distance (B) that seems to approach closer the compactness indexes For the reason of efficiency, this would allow choosing a reduced battery of 5 indexes to evaluate the structure of a partition, one referred to each relevant characteristic. 148 _B. Sevilla-Villanueva et al, 7 Conclusions This work has been motivated by the inherent difficulties of the cluster valida- tion process. In most real applications, there is not prior information about the number of clusters or about the structural characteristics of the existent clusters, ‘The main contribution of this research is that joint multivariate vision of a complete set of indexes provides richer information about the topology and structure of the clusters than traditional ranking analysis. A deep understanding of what the 17 CVIS and indicators found in the literature are really measuring is achieved through the analysis of their expressions under a common notation A higher conceptual hierarchy of indexes emerged from this analysis based on the cluster characteristics targeted by each index. This explains why rankings obtained are different depending on the index for a set of datasets. PCA analysis confirms homogeneous behavior of indexes of same category in general trends Hence, from both the conceptual and the PCA analysis is possible to classify the indexes into 5 groups. Eventually a dataset can be assessed using all of them or a reduced battery containing a representative of each category. Then, for a new dataset with one or more partitions, the global set of indexes might be used to learn about the characteristics of each partition and this information support the selection of the best one. Acknowledgements. This work has been supported by the project DietdYou (TIN2014-60557-R) funded by Spanish Government References 1. Duda, R.O., Hart, P.E,, Stork, D.G.: Pattern Classification. Wiley, Hoboken (2012) 2. Arbelaitz, O., Gurrutxaga, I, Muguerza, J., Pérez, J.M., Perona, I: An extensive comparative study of cluster validity indices. Pattern Recognit. 46(1), 243-256 (2013) 3. Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suh, E., Dougherty, E.R. Model-based evaluation of clustering validation measures. Pattern Recognit. 40(3), 807-824 (2007) 4, Dimitriadou, E., Dolnicar, S., Weingessel, A.: An examination of indexes for d mining the number of clusters in binary datasets, Psychometrika 67, 137-159 (2002) 5. Maulik, U., Bandyopadhyay, S.: Performance evaluation of some clustering algo- rithms and validity indices. IBEE ‘Irans. Pattern Anal. Mach. Intell. 24, 1650-1654 (2002) 6. Milligan, G.W., Cooper, M.C.: An examination of procedures for determining the number of clusters in a datasct. Psychometrika 50, 159-179 (1985) 7. Dubes, R.C.: How many clusters are best? - an experiment. Pattern Recognit. 20(6), 645-663 (1987) 8. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intel. Inf. Syst. 17(2), 107-145 (2001) 9. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Cluster validity methods: part I ACM Sigmod Rec. 31(2), 40-45 (2002) Using CVI for Understanding Class ‘Topology in Unsupervised Scenarios 149 10. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering validity checking methods part IL ACM Sigmod Rec. 31(3), 19-27 (2002) 11. Bezdek, J.C., Pal, N.R.: Some new indexes of cluster validity, IEEE ‘Trans. Syst Man Cybern, Part B Cybern, 28(3), 301-315 (1998) 12. Pearson, K.: On lines and planes of closest fit to systems of points in space. London Edinb. Dublin Philos. Mag. J. Sei. 2(11), 559-872 (1901) 13, Benzécri, J.P; L’analyse des données, Ist edn. Dunod, Paris (1973). Tome 1: La Taxinomie, ‘Tome 2: L’analyse des correspondances 14. Kim, M., Ramakrishna, R.S.: New indices for cluster validity assessment, Pattern Recognit. Lett. 26(15), 2353-2363 (2005) 15, Meili, M.: Comparing clusterings? An information based distance. J. Multivar. Anal, 98(5), 873-895 (2007) 16. Hennig, C., Liao, T-F.: Comparing latent class and dissimilarity based clustering for mixed type variables with application to social stratification. Technical report (2010) 17, Hennig, C.: How many bee species? A case study in determining the number of clusters. In: Proceedings of GfKI-2012, Hildesheim (2013) 18. Dunn, J.C.: Well separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95-104 (1974) 19. Halkidi, M., Vazirgiannis, M.; Clustering validity assessment using multi represen tatives. In: Proceedings of SETN Conference (2002) 20. Pal, N.R., Biswas, J.: Cluster validation using graph theoretic concepts. Pattern Recognit. 0(6), 847-857 (1997) 21. Hennig, C- fpe: Flexible procedures for clustering. R package version 2.1-5 (2013) 22, Calitiski, T., Harabasz, J.: A dendrite method for cluster analysis, Commun, Stat.- Theor. Methods 3(1), 1-27 (1974) 23. Roussceuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53-65 (1987) 24, Baker, F.B., Hubert, L.J.: Measuring the power of hierarchical cluster analysis. J Am. Stat. Assoc, 70(349), 31-38 (1975) 25. Goodman, L.A., Kruskal, W.H.: Measures of association for cross classifications. J. Am. Stat. Assoc. 49(268), 732-764 (1954) 26. Hubert, L.J., Levin, J.R.: A general statistical framework for assessing categorical clustering in free recall. Psychol. Bull. 88(6), 1072-1080 (1976) 27. Gordon, A.D.: Classification, 2nd edn. Chapman and Hall/CRC, Boca Raton (1999) 28. Davies, D.L., Bouldin, D.W.: Cluster separation measure. IE Anal, Mach, Intell. 1(2), 95-104 (1979) } Trans. Pattern

You might also like