Professional Documents
Culture Documents
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/308864684
CITATIONS READS
0 195
2 authors, including:
Clara Pizzuti
Italian National Research Council
133 PUBLICATIONS 2,365 CITATIONS
SEE PROFILE
All content following this page was uploaded by Clara Pizzuti on 06 December 2016.
Abstract A genetic algorithm, that exploits the K-means principles for dividing
objects in groups having high similarity, is proposed. The method evolves a popula-
tion of chromosomes, each representing a division of objects in a different number
of clusters. A group-based crossover, enriched with the one-step K-means opera-
tor, and a mutation strategy that reassigns objects to clusters on the base of their
distance to the clusters computed so far, allow the approach to determine the best
number of groups present in the dataset. The method has been experimented with
four different fitness functions on both synthetic and real-world datasets, for which
the ground-truth division is known, and compared with the K-means method. Re-
sults show that the approach obtains higher values of evaluation indexes than that
obtained by the K-means method.
1 Introduction
1
2 Clara Pizzuti, Nicola Procopio
K-means method finds a partition that minimizes the total within-cluster variance,
also known as sum of squared error, defined as
k
CVW = ∑ ∑ ||x − mi ||2 (1)
i=1 x∈Ci
where mi is the centroid of cluster Ci and ||.|| denotes the Euclidean distance. In the
following a brief overview of evolutionary approaches combined with the K-means
method is reported.
minimum kmin and maximum kmax value, thus the chromosome length is kmax − kmin ,
and the presence of don’t care symbols allow to have a variable number of clusters.
The fitness function of this new algorithm is the Davis-Bouldin index [5]. Though
the method works well on artificial data, for the well known Iris datasets the method
merges two, out of the three clusters, having overlapping objects.
In the next section a method that exploits K-means principles inside a genetic
algorithms is presented.
p1 p2
Positions 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Renumber 1 1 1 2 1 3 3 1 2 1 1111112221
Fig. 2 Group-based crossover where the random position h = 2 is selected, thus l1 = 1 and l2 = 2.
The gene values of the parents, changed to generate the offspring, are highlighted in bold.
Initialization. The simple strategy of generating k random cluster labels and as-
signing at random an object to a cluster can produce bad initial clusters that could
overlap, thus increasing the convergence times. The initialization process employed
by CluGA uses the approach of K-means++ [1] to select k centers, and then as-
signs each object to the closest center. The K-means++ seeding strategy has been
shown by the authors to improve both accuracy and speed of K-means, thus it can
be beneficial also for CluGA. For each element of the population, a random num-
ber k between 2 and kmax is generated. Then the following steps are performed. 1)
Choose the first cluster center c1 = x uniformly at random from the dataset X. 2)
Choose the next cluster center ci = x0 ∈ X − {x} from the remaining data objects
with probability proportional to its distance from the closest center already chosen.
3) Repeat step 2) until kmax centers have been chosen.
Crossover. Standard one-point and two-point crossover have been shown to
present drawbacks when applied to group problems [6]. In fact, it can happen
that the offspring encode solutions too far from those of their parents. A cluster-
oriented crossover operator can avoid these problems. CluGA performs a group-
based crossover followed by the application of the K-means operator KMO. Given
two individuals p1 and p2 of the population, the crossover operator performs the
following steps: 1) Choose a random position h between 1 and n, 2) take the cluster
labels l1 = p1 (h) and l2 = p2 (h), 3) generate the first offspring child1 by substituting
the value of all those genes of p1 having cluster label equal to l1 with l2 , 4) generate
the second offspring child2 by substituting the value of all those genes of p2 having
cluster label equal to l2 with l1 . 5) Apply KMO to child1 and child2 . 6) If any sin-
gleton cluster is present, assign its object to one of the existing clusters at random.
An example of group-based crossover is shown in Figure 2. If the random position
h = 2 is selected, the members {1, 2, 10} of the cluster in p1 with label value 1 are
assigned cluster label 2, while the members {2, 3, 10} of the cluster in p2 with label
value 2 are assigned cluster label 1.
Mutation. The mutation operator performs a local search that reassigns objects
to clusters at lower distance. For each object xi ∈ X, it checks whether the distance
from xi to the cluster Ca it belongs to is higher than the distance from xi to another
cluster Cb . In such a case xi is removed from Ca and added to Cb .
6 Clara Pizzuti, Nicola Procopio
where mi is the centroid of cluster i, ni is the size of cluster i, m is the overall mean
of the dataset, and ||.|| denotes the Euclidean distance.
The Silhouette index, for each object, measures how similar that object is to ele-
ments in its own cluster, when compared to objects in other clusters. The silhouette
criterion is defined as:
A K-means based Genetic Algorithm for Data Clustering 7
1 k 1 b(x) − a(x)
S= ∑ ∑ (4)
k i=1 ni x∈Ci max(a(x), b(x))
where a(x) is the average distance from x to the other objects in the same cluster as
x, and b(x) is the minimum average distance from x to points in a different cluster,
minimized over all clusters.
The silhouette value ranges from -1 to +1. A high silhouette value indicates that
objects are well-matched in their own cluster, and poorly-matched to neighboring
clusters.
The Davies-Bouldin criterion is based on a ratio of within-cluster and between-
cluster distances. It is defined as:
1 k di + d j
DB = ∑ maxi6= j { } (5)
k i=1 di, j
where di and d j are the average distances between each object in the i-th and j-th
cluster, respectively, and the centroid of the own cluster, and di, j is the Euclidean
distance between the centroids of the i-th and j-th clusters.
The pseudo-code of the algorithm is shown in Figure 3. In the next section the
CluGA method is executed on synthetic generated and real-world datasets, and the
clustering structures obtained by employing as fitness functions cluster-within vari-
ance, Calinski-Harabasz , Silhouette, and Davies-Bouldin criteria, are compared.
5 Experimental Results
Table 1 Comparison between CluGA and the K-means algorithm on synthetic datasets. R: Recall,
P:Precision, F: Fmeasure. The best values, by excluding the CVW index, are highlighted in bold.
Table 2 Comparison between CluGA and the K-means algorithm on IRIS and CANCER datasets.
R: Recall, P:Precision, F: Fmeasure. The best values, by excluding the CVW index, are highlighted
in bold.
IRIS CANCER
Fitness K-Means Fitness K-Means
CVW S V RC DB CVW S V RC DB
mean 0.3238 0.5418 0.7223 0.5681 0.7005 0.2583 0.7825 0.8364 0.8226 0.8337
ARI best 0.3417 0.5583 0.7455 0.5681 0.3053 0.8337 0.8551 0.8337
st. dev. (0.0128) (0.0058) (0.0148) (0.0000) (0.0941) (0.043) (0.0478) (0.0102) (0.0171) (0)
mean 0.9237 0.5255 0.7457 0.5794 0.7298 0.8719 0.6665 0.7289 0.7115 0.7252
R best 0.9423 0.5537 0.7869 0.5793 0.8811 0.7252 0.7543 0.7252
st. dev. (0.0148) (0.0099) (0.0190) (0.0000) (0.0686) (0.0066) (0.0482) (0.0135) (0.0211) (0)
mean 0.4212 0.8901 0.7589 1.0000 0.7542 0.2294 0.6914 0.7373 0.7227 0.7339
P best 0.4353 0.9490 0.8093 1 best 0.2390 0.7339 0.7586 0.7339
st. dev. (0.0094) (0.0207) (0.0229) ( 0.0000) (0.0341) (0.0088) (0.0335) (0.011) (0.0175) (0)
mean 0.5785 0.6608 0.7522 0.7337 0.7413 0.3631 0.6786 0.7331 0.717 0.7295
F best 0.5955 0.6994 0.7979 0.7336 0.3746 0.7295 0.7563 0.7295
st. dev. (0.0115) (0.0136) (0.0209) (0.0000) (0.0533) (0.011) (0.0414) (0.0122) (0.0192) (0)
mean 21.6335 152.4457 79.0029 154.9470 85.2417 72.2422 200.1433 196.9676 197.4081 196.9366
CVW best 21.23349 152.348 78.85144 154.947 70.8324 196.9366 196.9366 196.9366
st. dev. (0.3481) (0.3092) ( 0.2705) (0.0000) (20.2076) (1.0418) (6.6177) (0.0767) (0.9443) (0)
mean 13.0 2 3 2 3 26.6 2 2 2 2
k best 13 2 3 2 25 2 2 2
st. dev. (0.0000) (0.0000) (0.0000) (0.0000) (0.6992) (0) ( 0) (0)
Table 3 Comparison between CluGA and the K-means algorithm on GLASS and ECOLI datasets.
R: Recall, P:Precision, F: Fmeasure. The best values, by excluding the CVW index, are highlighted
in bold.
GLASS ECOLI
Fitness K-Means Fitness K-Means
CVW S V RC DB CVW S V RC DB
mean 0.2196 0.2062 0.5803 0.2087 0.5195 0.257 0.6189 0.573 0.6805 0.3876
ARI best 0.2446 0.6762 0.6064 0.6086 0.2785 0.6925 0.7065 0.7277
st. dev. (0.0309) (0.2945) (0.0316) (0.2706) (0.1411) 0.0128 0.1222 0.1397 0.0591 0.1599
mean 0.8405 0.1932 0.4381 0.1682 0.3575 0.7654 0.5005 0.4573 0.6127 0.6021
R best 0.8568 0.7039 0.5358 0.5336 0.7867 0.578 0.5667 0.6877
st. dev. (0.0146) (0.2912) (0.0726) (0.2316) (0.1009) (0.0104) (0.0995) (0.1255) (0.0717) (0.2547)
mean 0.2321 0.2224 0.4054 0.2313 0.3900 0.4277 0.7684 0.7271 0.7215 0.5537
P best 0.2463 0.4094 0.4310 0.3933 0.4421 0.7859 0.812 0.7672
dt. dev. (0.0124) (0.1653) (0.0253) (0.1382) (0.0277) (0.008) (0.0183) (0.0826) (0.0631) (0.0242)
mean 0.3636 0.1628 0.4192 0.1578 0.3660 0.5487 0.6007 0.5558 0.6598 0.6158
F best 0.3809 0.5177 0.4642 0.4528 0.5661 0.6544 0.6675 0.6993
st. dev. (0.0161) (0.2218) (0.0412) (0.1909) (0.0873) (0.0085) (0.0836) (0.1191) (0.0562) (0.0235)
mean 178.5231 932.6661 752.9121 929.1354 860.4198 8.7751 25.3293 26.9246 19.2687 14.7483
CVW best 173.3125 358.5322 589.0314 495.5036 8.5466 21.1775 23.261 13.5761
st. dev. (4.6974) (361.2018) (109.3977) (324.6504) (126.0378) (0.2116) (4.7125) (5.6102) (3.1077) (0.6440)
mean 13.9000 3.0000 2.3000 2.8000 2 18.8 2.9 2.7 8.1 8
k 13 2 2 2 18 4 3 8
st. dev. (0.3162) (1.7638) (0.4830) (0.9189) (0.4216) (0.5676) (0.483) (2.7669) (0)
A K-means based Genetic Algorithm for Data Clustering 9
Fig. 4 Synthetic datasets with 500 objects: (a) k = 3, (b) k = 4, (c) k = 6.)
been written in MATLAB 8.6 R2015b, by using the Genetic Algorithm Solver of
the Global Optimization Toolbox. A trial and error procedure has been adopted for
fixing the parameter values. Thus the crossover rate pc has been fixed to 0.8, the mu-
tation rate pm to 0.2, population size 100, elite reproduction 10% of the population
size, number of generations is 50. In order to evaluate the method, since the ground-
truth clusterings are known, we computed the well known measures Adjusted Rand
Index [9], Precision, Recall, and Fmeasure [18], adopted in the literature to assess
the capability of an algorithm in finding partitions similar to the true data division.
We first present the results obtained by CluGA on randomly generated synthetic data
sets. We consider three kinds of datasets of 500 objects each. The first one, named
Syn 3 contains three Gaussian clusters, distinct and well separated, with standard
deviation of the centroids equal to (0.2, 0.2, 0.35) (Figure 4(a)). In the second one,
named Syn 4, there are four clusters close to each other with standard deviation of
centroids equal to (0.2, 0.35, 0.45, 0.3) (Figure 4(b)). The third one, named Syn 6 is
constituted by six mixed clusters with standard deviation of centroids equal to (0.2,
0.2, 0.35, 0.45, 0.1, 0.1) (Figure 4(c)).
Table 1 shows the execution of the method on the synthetic data sets for the four
fitness functions, along with the results of the K-means algorithm. Each method has
been executed 10 times and average and best values of the ten runs, along with stan-
dard deviation, are reported. Notice that the K-means has been executed with input
parameter k equal to the true number of clusters,
p while CluGA has been executed by
fixing the maximum number of clusters to (n), which, in the literature, is consid-
ered a rule of thumb [16]. The table highlights the very good capabilities of CluGA
in partitioning the datasets in a number of clusters very close to the ground truth,
even if the algorithm does not know the number of clusters to find, when using as
fitness functions V RC (formula (2)), S (formula (4)), and DB (formula (5)). In fact,
on Syn 3 it reaches the same results of the K-means method. On Syn 4 and Syn 6,
though the average values of the measures are slightly lower, the best values, out of
the ten runs, are always better. As regards the cluster-within variance, it is evident
from the table the selection bias towards a high number of clusters, which confirms
the experimentation of Liu et al. [13]. With this fitness function the method for the
three datasets obtains an average of 22, 22, 21 clusters and average CVW values 13.2,
24.8, 12.6, respectively.
10 Clara Pizzuti, Nicola Procopio
Tables 2-3 compare CluGA and K-means on four well known real-world datasets
of the UCI Machine Learning Repository. The Iris dataset consists of 150 sam-
ples of Iris flowers categorized into three species by four features. Breast Cancer
Wisconsin dataset contains 683 objects with 9 features divided into two classes.
The Glass dataset has 214 types of glass with 9 features, divided into 6 classes.
The Ecoli dataset consists of 336 instances grouped with respect to 8 protein lo-
calization sites, described by seven attributes. The tables clearly point out that the
optimization of the Calinski-Harabasz criterion obtains the best evaluation measure
values on all the datasets, except for Ecoli. This fitness function allows the algo-
rithm to outperform the K-means method on all the datasets with higher ARI value
and lower within-cluster variance. For instance, on the Iris dataset CluGA finds
the three clusters with the mean Adjusted Rand Index value equal to 0.7223, the
best ARI equal to 0.7455, and cluster-within variance 79.0029, while the K-means
obtains ARI=0.7005 and CVW =85.2417. For the Cancer dataset the mean and best
ARI values of CluGA are 0.8364 and 0.8551, respectively, while for the K-means
ARI=0.8337. Analogously for the Glass dataset, CluGA obtains ARI=0.5803 and
CVW =752.9121, while K-means finds ARI=0.5195 and CVW = 860.4198. As regards
the Ecoli dataset, the Calinski-Harabasz and the Silhouette indexes merges some
clusters, while the Davis-Bouldin criterion finds all the 8 clusters in almost all the
executions. However it must be noted that this dataset contains two clusters with
only two objects, and one cluster with 5 objects, which are not easy to find. In fact,
the K-means finds 8 clusters with a precision value around 0.5, which is rather low,
while that of the three indexes is above 0.7. Also for these real-world datasets it can
be observed that the CVW criterion prefers a number of clusters much higher than
the ground-truth.
6 Conclusions
The paper proposed a genetic algorithm able to divide a dataset in a number of
groups not known in advance. The method employs label-based representation, and
exploits the K-means strategy to improve the offspring generated by the group-based
crossover. Though the idea of combining genetic algorithms and K-means is not
new, CluGA sensibly differs from the existing proposals. With respect to GKA [12],
FGKA [14] and IGKA [15], it applies the one step operator of the K-means after
having effectively performed the group-based crossover. The above methods, in-
stead, substitute the crossover with the K-means operator. Moreover, CluGA does
not need to fix the number of clusters. Each chromosome, in fact, is initialized
with a random number in the interval {2, . . . , kmax }, where kmax could be also n.
As regards the KGA method [2], CluGA has different representation and completely
different operators. Experiments on synthetic and real-world datasets by using four
different fitness functions show that when the three popular evaluation criteria of
Calinski-Harabasz, Silhouette and Davis-Bouldin are employed as fitness functions,
the method obtains very good solutions and outperforms the K-means method. The
cluster-within variance, instead, though used by many authors, is not apt for genetic
algorithms since its optimization generates a bias towards a high number of clusters.
A K-means based Genetic Algorithm for Data Clustering 11
The experiments also pointed out the Calinski- Harabasz criterion obtains the best
results for all the datasets, except for Ecoli, where the 8 clusters are merged into
three. Future work will evaluate the method on data sets coming from real-world
applications.
Acknowledgment: This work has been partially supported by MIUR D.D. n
0001542, under the project BA2KNOW − PON03PE 00001 1.
References
1. David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA
’07, pages 1027–1035, 2007.
2. Sanghamitra Bandyopadhyay and Ujjwal Maulik. An evolutionary technique based on k-
means algorithm for optimal clustering in rn. Inf. Sci. Appl., 146(1-4):221–237, 2002.
3. Sanghamitra Bandyopadhyay and Ujjwal Maulik. Genetic clustering for automatic evolution
of clusters and application to image classification. Pattern Rec., 35:1197 – 1208, 2004.
4. T. Calinski and J. Harabasz. A dendrite method for cluster analysis. Communications in
Statistics, 3(1):1–27, 1974.
5. D. Davies and D. Bouldin. A cluster separation measure. IEEE Trans. Pattern Analysis and
Machine Intelligence, 1(2):224–227, 1979.
6. Emanuel Falkenauer. Genetic Algorithms and Grouping Problems. John Wiley & Sons, Inc.,
New York, NY, USA, 1998.
7. Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. On clustering validation tech-
niques. J. Intell. Inf. Syst., 17(2-3):107–145, December 2001.
8. Eduardo Raul Hruschka, Ricardo J. G. B. Campello, Alex A. Freitas, and André C. Ponce
Leon F. De Carvalho. A survey of evolutionary algorithms for clustering. IEEE Trans. Sys.
Man Cybernetics-Part C, 39(2):133–155, March 2009.
9. Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2:193–
218, 1985.
10. Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice-Hall, Inc.,
Upper Saddle River, NJ, USA, 1988.
11. Adán José-Garcı́a and Wilfrido Gómez-Flores. Automatic clustering using nature-inspired
metaheuristics: A survey. Applied Soft Computing, 41:192–213, 2016.
12. K. Krishna and Murty M. N. Genetic k-means algorithm. IEEE Trans. Sys. Man Cybernetics-
Part B, 29(3):433–439, 1999.
13. Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, and Junjie Wu. Understanding of inter-
nal clustering validation measures. In Proceedings of the 2010 IEEE International Conference
on Data Mining, ICDM ’10, pages 911–916, 2010.
14. Yi Lu, Shiyong Lu, Farshad Fotouhi, Youping Deng, and Susan J. Brown. Fgka: A fast ge-
netic k-means clustering algorithm. In Proceedings of the 2004 ACM Symposium on Applied
Computing, SAC ’04, pages 622–623, 2004.
15. Yi Lu, Shiyong Lu, Farshad Fotouhi, Youping Deng, and Susan J. Brown. Performance eval-
uation of some clustering algorithms and validity indices. BMC Bioinformatics, 5(172):1–10,
2004.
16. N. R. Pal and J. C. Bezdek. On cluster validity for the fuzzy c-means model. IEEE Transac-
tions on Fuzzy Systems, 3(3):370–379, 1995.
17. Peter Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster
analysis. J. Comput. Appl. Math., 20(1):53–65, 1987.
18. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Addison-
Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2005.