You are on page 1of 7

Available online at www.sciencedirect.

com

ScienceDirect
Procedia Technology 10 (2013) 443 – 449

International Conference on Computational Intelligence: Modeling, Techniques and Applications


(CIMTA) 2013

Clustering Ensemble: A Multiobjective Genetic Algorithm based


Approach
Sujoy Chatterjee∗, Anirban Mukhopadhyay
Department of Computer Science and Engineering, University of Kalyani, Kalyani - 741235, India

Abstract
Clustering ensemble refers to the problem of obtaining a final clustering of some data set from a set of input clustering solutions.
In this article, the clustering ensemble problem has been modeled as a multiobjective optimization problem and a multiobjective
evolutionary algorithm has been used for this purpose. The proposed multiobjective evolutionary clustering ensemble algorithm
(MOECEA) evolves a clustering solution from the input clusterings by optimizing two criteria simultaneously. The first objective
is to maximize the similarity of the resultant clustering with all the input clusterings, where the similarity between two clustering
solutions is computed using adjusted Rand index. The second criteria is to minimize the standard deviation among the similarity
scores in order to prevent the evolved clustering solution to be very similar with one of the input clusterings and very dissimilar
with the others. The performance of the proposed algorithm has been compared with that of other well-known existing cluster
ensemble algorithms for a number of artificial and real-life data sets.

©c 2013
2013 The
The Authors.
Authors. Published
Publishedby byElsevier
ElsevierLtd.
Ltd. Open access under CC BY-NC-ND license.
Selection and
Selection and peer-review
peer-review under
under responsibility
responsibilityof
ofthe
theUniversity
UniversityofofKalyani,
Kalyani,Department
DepartmentofofComputer
ComputerScience &&
Science Engineering.
Engineering
Keywords: Clustering Ensemble; Validity indices; Multiobjective Genetic Algorithm; Pareto optimality.

1. Introduction

Unsupervised classification has drawn a lot of research attractions in the field of data mining, image processing
and pattern recognition domain. Clustering [1] is used to group the elements in a data set in accordance with their
similarities. Therefore better clustering means that the elements lying in the same clusters are most similar to each
other in some sense while the elements from different clusters are dissimilar. When many clustering algorithms are
applied to same data set then it can generate different clustering results. These different types of clustering results
by different algorithms is due to the various aspects of the input data set. Every clustering algorithm implicitly or
explicitly assumes the data set as a different data model. That may cause to generate some wrong clustering results.
Clustering ensemble [2–4] algorithms basically integrate those clustering solutions to achieve a single stable solution.

∗ Corresponding author
E-mail address: sujoy.2611@gmail.com

2212-0173 © 2013 The Authors. Published by Elsevier Ltd. Open access under CC BY-NC-ND license.
Selection and peer-review under responsibility of the University of Kalyani, Department of Computer Science & Engineering
doi:10.1016/j.protcy.2013.12.381
444 Sujoy Chatterjee and Anirban Mukhopadhyay / Procedia Technology 10 (2013) 443 – 449

It is very hard to find the optimal clustering solution from the set of clustering solutions. When different clustering
solutions are generated from the same data set, its prior knowledge of the data distributions is not available com-
pletely. Again, as different elements have different characteristics so the way of grouping them in different clustering
algorithms is not same. Moreover, in clustering algorithms, the grouping formation is done in different ways. For
example, K-means clustering algorithm groups the data sets so that the total Mean Square Error to the center of each
cluster is minimum while graph-based partitioning clustering partitions the graph into K parts based on the minimum
edge weight cuts. So, it is very hard to get a final conclusion that which clustering result is better.
So now the objective of ensemble method is to combine the strengths of many individual clustering algorithms.
This is the focus of the research on clustering ensembles, seeking a combination of multiple partitions that provide
improved overall clustering of the given data. Clustering ensembles can go beyond what is typically achieved by a
single clustering algorithm in several respects i.e. robustness, novelty, stability and confidence estimation. Therefore
it is useful to obtain a final clustering solution by a consensus among the input clusterings by a clustering ensemble
method.
In this article, we have posed the clustering ensemble problem as an optimization one where the goal is to obtain a
suitable clustering solution which is roughly similar to the input clustering solutions, and thus is expected to reflect a
good consensus among the input clusterings. Thus the problem can readily be modeled as a multiobjective optimiza-
tion (MOO) problem [5], where two objectives are optimized simultaneously. The first objective is to maximize the
similarity of the resultant clustering with all the input clusterings, where the similarity between two clustering solu-
tions is computed using Adjusted Rand Index. The second criteria is to minimize the standard deviation among the
similarity scores in order to prevent the evolved clustering solution to be very similar with one of the input clusterings
and very dissimilar with the others.
In MOO, search is performed over a number of, often conflicting, objective functions. In single objective optimiza-
tion usually yields a single best solution. However, in MOO the final solution set contains a number of Pareto-optimal
solutions, none of which can be further improved on any one objective without degrading it in another. Non-dominated
Sorting Genetic Algorithm-II (NSGA-II) [6], a popular elitist MOO algorithm, is used as the underlying optimization
strategy. The Adjusted Rand Index (ARI) [7] and the Standard deviation measure are used as the objective functions.
The proposed multiobjective evolutionary clustering ensemble algorithm (MOECEA) has been applied on a num-
ber of artificial and real-life data sets and its performance has been compared with that of different well-known
clustering ensemble techniques to establish its superiority.

2. Proposed Multiobjective Clustering Ensemble Technique

This section describes the use of NSGA-II [6] for evolving a set of near-Pareto optimal ensemble clustering solu-
tions. The proposed technique is described below in detail.

2.1. Encoding of Chromosomes

Each chromosome is a sequence of integers representing the K class labels. For example, a chromosome is denoted
as {r1 ,r2 ,...,rn } where ri denotes the class label of the ith data point. In different clustering solutions the class label of
two positions may be overlapped. i.e. in two different positions of a chromosome the class label may be the same
number. Let two chromosomes be chromosome1: {1,1,1,2,2,2,3,3} and chromosome2: {1,2,2,2,2,3,3,3}. Here in the
encoding scheme for the 1st chromosome the objects {1, 2, 3} are in the first cluster, objects {4, 5, 6} are in the second
cluster and {7, 8} are in the third cluster. So, the maximum number of clusters is 3.

2.2. Initial Population

In the initial population, we are taking the whole set of input clusterings for which an ensemble is to be generated.
In addition with that we are also taking some random clustering solutions to avoid any bias.
Sujoy Chatterjee and Anirban Mukhopadhyay / Procedia Technology 10 (2013) 443 – 449 445

2.3. Selection

The selection process select chromosomes for the later breeding directed by the survival of the “fittest” concept of
natural genetic systems. In binary tournament selection strategy, the selection involves running several “tournaments”
among a few individuals that are chosen randomly from the initial population. But in the context of multiobjective
ensemble clustering, the selection is based on crowded binary tournament selection strategy as used in NSGA-II.

2.4. Crossover

Crossover is a probabilistic process that exchanges information between two parent chromosomes for generating
two child chromosomes. In this article crossover with a “fixed” crossover probability of kc is used. In traditional
GA based algorithms crossover is generally single point or multipoint crossover. But in cluster ensemble technique
this type of crossover may distort the original population and it may affect the convergence of the optimal solution.
For example let the two chromosomes be chromosome1: {1,1,1,2,2,3} and chromosome2: {3,3,3,1,1,2}. Now, these
two chromosomes are basically the same clustering results as in both the cases the first 3 objects are in one cluster,
4th and 5th objects are in another cluster and 6th object is in the last cluster. Therefore, they have the same fitness
value. But, if a single point crossover is performed at the 5th position, then after crossover the labeling would be for
chromosome1: {1,1,1,2,1,2} and chromosome2: {3,3,3,1,2,3}. As can be seen, in the first child, the number of clusters
has been changed. To avoid this, in our approach we have used a bipartite graph-based labeling method as follows.
Let the two chromosomes be chromosome1: {1,3,1,2,1,2,3,1,3} and chromosome2: {2,2,2,3,3,3,1,1,1}. We calcu-
late the dissimilarity of two different clusters of the two chromosomes using the following equation 1.

 
|Ci |−|Ci C j | |C j |−|Ci C j |
|Ci | + |C j |
dis(Ci , C j ) = (1)
2

Here dissimilarity between clusteri of chromosome1 and cluster j of chromosome2 is calculated. |Ci | means the size
of clusteri . For example, the dissimilarity of the cluster1 of chromosome1 and cluster1 of chromosome2 is dis(c1 ,c1 )
= (3/4+2/3)/2=0.708. In chromosome1, objects {1,3,5,8} are the in the first cluster whereas for chromosome2, ob-
jects{7,8,9} are in the first cluster. So, only object 8 is common for both the chromosomes. The size of cluster1 of
chromosome1 is 4 while the size of cluster1 of chromosome2 is 3. After computation of dissimilarity matrix we
construct a bipartite graph based on this dissimilarity scores. For our example, the bipartite graph has 3 vertices in
each set. Let the left hand side set (for chromosome1) has 3 vertices and the right hand side set (for chromosome2)
has 3 vertices. Each vertex represents a cluster encoded in a chromosome. From this bipartite graph we can derive a
replacement matrix which actually stores the final label of the chromosome2 based on chromosome1. The steps for
constructing the replacement matrix are as follows:

1. Search the edge with minimum weight from the graph.


2. Store the two vertices corresponding to the edge.
3. After storing that value, remove all the edges that are incident upon the vertices on the right hand side set.
4. Repeat the above steps for remaining edges.

This replacement matrix stores the information of the labels of chromosome2 that is replaced by the labels of chro-
mosome1. Therefore, if two chromosomes participate for the crossover then after crossover, only labeling of chromo-
some2 is changed.
Again, this crossover operation ensures that the two parent chromosomes representing the same solutions are not
affected after the crossover operation. In the other case, that is for two parent chromosomes that do not represent the
same solution can be relabeled by the same procedure and it generates two new child chromosomes after exchanging
their information.
446 Sujoy Chatterjee and Anirban Mukhopadhyay / Procedia Technology 10 (2013) 443 – 449

2.5. Mutation

Each chromosome undergoes mutation with a very small mutation probability M p . In the mutation operator we
have used to add or subtract some quantity to the label of the chromosome. For computation purpose, the float value
generated for a particular label after the addition or subtraction has been converted to near integer.

2.6. Choice of Objectives

Our algorithm uses two objective functions. Here the similarities among the reference clustering and input clus-
tering solutions are calculated first and then sum of these similarities divided by the number of clustering solutions is
considered as first objective. To compute the similarity between two clustering solutions, we have used Adjusted Rand
Index value. To avoid that the encoded clustering solution is too similar with one of the input clusterings and thus
making the first objective value very high, we have minimized standard deviation of the similarity scores between
the encoded solution to the input clusterings, and this is used as the second objective function. Therefore the first
objective function is to be maximized whereas the second one is to be minimized.

2.7. Selecting a Solution from the Non-dominated Set

In the final generation, the multiobjective clustering method produces near-Pareto-optimal non-dominated set of
solutions. Hence, it is necessary to choose a particular solution from the set of non-dominated solutions.
The optimal solution is chosen by analyzing the knee regions of the non-dominated front. The idea is to stress
on the solution that conforms to the “knee region”. The “knee” is formed by those solutions of the Pareto optimal
front where a small improvement of one objective would lead to large deterioration of another objective. So here in
our algorithm, the most promising solution is chosen from those solutions of the optimal Pareto front where small
improvement of the first objective would lead to large deterioration of the second objective value.

3. Experimental Design and Results

In this section, we have first described the data sets that we used in our experiments. We present experiments on
various real-life data sets and as well as artificial data sets to evaluate the performance of the proposed algorithm. The
algorithm is compared with three well-known existing cluster ensemble algorithms, namely CSPA, HGPA, MCLA
and with the single objective version of the proposed algorithm. The adopted performance metrics are Adjusted Rand
index (ARI), Minkowski score (MS), Sillhoute index and other cluster validity indices. Here the single objective clus-
tering ensemble algorithm uses the ratio of the first objective function of MOECEA to the number of input clusterings,
i.e., the average similarity of the encoded solution to the input clustering solutions. The input clustering solutions are
generated by using random subspace clustering using K-means and other clustering algorithms.
Experiments are performed in MATLAB 2008a and the running environment is an Intel (R) CPU 1.6 GHz machine
with 1 GB of RAM running Windows XP Professional.

3.1. Data sets

Three real-life data sets are used for experiments. A short description of the data sets in terms of the number of
data points, dimension and the number of clusters is provided in Table 1. The three real-life data sets are obtained
from the UCI Machine learning Repository. The description of artificial data set is given in Table 2.

3.2. Performance Metrics

Some external and internal cluster validity indices are used as the performance metrics. External indices, such as
Adjusted Rand Index (ARI) [7], Rand Index (RI) [8], Minkowski Score (MS) [9], classification accuracy (CA) [10],
Mirkins Index (MI) and Huberts Index (HI) [11] are used to evaluate the performance of the algorithms with respect
to the true clustering of the data sets. Here larger values of ARI, RI, CA, HI and smaller values of MS, MI indicate
better clustering, respectively.
Sujoy Chatterjee and Anirban Mukhopadhyay / Procedia Technology 10 (2013) 443 – 449 447

On the other hand, the internal validity indices, such as DB index [12], Dunn index [13] and Silhouette index [14]
are used to evaluate the clustering solutions objectively. Among these indices, smaller value of DB index and larger
values of Dunn and Silhouette indices indicate better clustering, respectively.

Table 1. Description of Real-life data sets


Data set Instances number of classes number of attributes
Wine 178 3 13
Iris 150 3 3
Seed 210 3 7

Table 2. Description of Artificial data sets


Data set Instances number of classes number of attributes
R15 600 15 2

3.3. Parameter Settings

The number of clusters parameter is fixed for particular data sets. For the proposed algorithm, the crossover rate is
0.9, mutation rate is 0.01 and population size is 40.

3.4. Results

In Tables 3-6, the performance metric values obtained by different cluster ensemble algorithms have been reported
for the four data sets, respectively. It is evident from the tables that although in few cases, some of the algorithms are
performing little better, but in most of the cases, the proposed multiobjective algorithm provides consistently good
performance. Moreover, it performs better than the single objective version as well, which demonstrates the utility of
the multiobjective framework for developing the proposed algorithm.

Table 3. Performance metric values for wine data set

Algorithm Silhoutte Adj Rand Rand MI HI CA DB DUNN Minkowski


Proposed 0.1775 0.7040 0.8681 0.1319 0.7362 0.8933 1.3639 0.5500 0.6301
CSPA 0.1385 0.5481 0.7990 0.2010 0.5980 0.8258 1.6390 0.4667 0.7809
HGPA -0.0721 0.2040 0.6459 0.3541 0.2918 0.6067 5.5503 0.1586 1.0361
MCLA 0.1383 0.6360 0.8378 0.1622 0.6756 0.8652 1.8400 0.3902 0.6987
Single Obj -0.4496 0.7161 0.8742 0.1258 0.7485 0.9045 0.4869 0 0.6229

3.5. Pareto-Fronts for Different Data Sets

For the purpose of illustration, Figs. 1 and 2 show the final non-dominated fronts obtained by the proposed multi-
objective cluster ensemble algorithm for Wine, Iris and Seed, R15 data sets, respectively. The figures also show the
solutions obtained by other algorithms on the same objective space. Moreover, the selected knee solutions(denoted as
Multi-objective) are also marked in the figures.
448 Sujoy Chatterjee and Anirban Mukhopadhyay / Procedia Technology 10 (2013) 443 – 449

Table 4. Different validity measure for Iris data set

Algorithm Silhoutte Adj Rand Rand MI HI CA DB DUNN Minkowski


Proposed 0.7345 0.7592 0.8923 0.1077 0.7845 0.9067 0.5805 2.3270 0.5577
CSPA 0.6961 0.7415 0.8859 0.1141 0.7718 0.9000 0.6206 2.1600 0.5889
HGPA 0.6320 0.6808 0.8590 0.1410 0.7179 0.8733 0.6988 1.6246 0.6538
MCLA 0.7127 0.7156 0.8737 0.1263 0.7475 0.8867 0.6102 2.2973 0.6129
Single Obj 0.5642 0.7079 0.8709 0.1291 0.7417 0.8867 0.1851 0 0.6247

Table 5. Different validity measure for Seed data set

Algorithm Silhoutte Adj Rand Rand MI HI CA DB DUNN Minkowski


Proposed 0.5877 0.6348 0.8383 0.1617 0.6766 0.8619 0.5637 2.0373 0.6984
CSPA 0.5769 0.6687 0.8535 0.1465 0.7069 0.8762 0.8297 1.9937 0.6663
HGPA 0.1349 0.2589 0.6612 0.3388 0.3224 0.6333 2.3407 0.5560 0.9519
MCLA 0.6233 0.6303 0.8352 0.1648 0.6704 0.8524 0.7763 1.9978 0.6957
Single Obj 0.6261 0.6409 0.8398 0.1602 0.6796 0.8571 0.6350 2.2266 0.6848

Table 6. Different validity measure for R15 data set

Algorithm Silhoutte Adj Rand Rand MI HI CA DB DUNN Minkowski


Proposed 0.6267 0.7610 0.9635 0.0365 0.9270 0.7317 0.4753 0 0.7487
CSPA 0.8993 0.9964 0.9996 4.3962e-004 0.9991 0.9983 0.2912 3.9081 0.0822
HGPA 0.4503 0.7469 0.9692 0.0308 0.9384 0.8350 2.6545 0.0723 0.6880
MCLA 0.4922 0.9928 0.9991 8.7924e-004 0.9982 0.9967 0.2827 3.9081 0.1162
Single Obj 0.6841 0.8638 0.9813 0.0187 0.9626 0.8633 0.3701 0 0.5357

(a) (b)
Fig. 1. (a) Pareto front for Wine data set; (b) Pareto front for Iris data set
Sujoy Chatterjee and Anirban Mukhopadhyay / Procedia Technology 10 (2013) 443 – 449 449

(a) (b)
Fig. 2. (a) Pareto front for Seed data set; (b) Pareto front for R-15 data set

4. Conclusion

In this article, a multiobjective evolutionary cluster ensemble algorithm (MOECEA) has been proposed on the
framework of a popular multiobjective genetic algorithm, NSGA-II. The objectives are to maximize the similarity
of the evolved ensemble clustering solution with the input clustering solutions whereas minimizing the standard
deviations of these similarities in order to avoid any bias. The performance of the proposed algorithm has been
compared with that of other existing clustering ensemble algorithms on some real-life and artificial data sets. The
results demonstrate the utility of the proposed technique over other existing approaches.

References

[1] Jain, A.K., Dubes, R.C.. Data clustering: A review. ACM Computing Surveys 1999;31.
[2] Strehl, A., Ghosh, J.. Cluster ensembles - a knowledge reuse framework for combining partitionings. In: Proc. 11th National Conference of
Artificial intelligence. 2002, p. 93–98.
[3] Ghaemi, R., Sulaiman, M., Ibrahim, H., Mustapha, N.. A review: accuracy optimization in clustering ensembles using genetic algorithms.
Artificial Intelligence Review, Springer 2011;35(4):287–318.
[4] Fred, A.L.N., Jain, A.K.. Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine
Intelligence 2005;27(6):835–850.
[5] Coello, C.. A comprehensive survey of evolutionary-based multiobjective optimization techniques. Knowledge and Information Systems
1999;1(3):129–156.
[6] Deb, K., Pratap, A., Agrawal, S., Meyarivan, T.. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on
Evolutionary Computation 2002;6(2):182–197.
[7] Yeung, K.Y., Ruzzo, W.L.. An empirical study on principal component analysis for clustering gene expression data. Bioinformatics
2001;17(9):763–774.
[8] W. M. Rand, W.. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 1971;66(336):846–
850.
[9] Ben-Hur, A., Isabelle, G.. Detecting stable clusters using principal component analysis. Methods Mol Biol 2003;224:159–182.
[10] Bandyopadhyay, S., Maulik, U., Mukhopadhyay, A.. Multiobjective genetic clustering for pixel classification in remote sensing imagery.
IEEE Transactions on Geoscicence and Remote Sensing 2007;45(5):1506–1511.
[11] Hubert, L., Arbie, P.. Comparing partitions. Journal of Classification 1985;2(1):193–218.
[12] Davies, D.L., Bouldin, D.W.. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1979;1:224–227.
[13] Dunn, J.C.. Well separated clusters and optimal fuzzy partitions. J Cyberns 1974;4:95–104.
[14] Rousseeuw, P.. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comp App Math 1987;20:53–65.

You might also like