You are on page 1of 12

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/308864684

A K-means Based Genetic Algorithm for Data


Clustering

Chapter in Advances in Intelligent Systems and Computing · October 2017


DOI: 10.1007/978-3-319-47364-2_21

CITATIONS READS

0 195

2 authors, including:

Clara Pizzuti
Italian National Research Council
133 PUBLICATIONS 2,365 CITATIONS

SEE PROFILE

All content following this page was uploaded by Clara Pizzuti on 06 December 2016.

The user has requested enhancement of the downloaded file.


A K-means based Genetic Algorithm for Data
Clustering

Clara Pizzuti, Nicola Procopio

Abstract A genetic algorithm, that exploits the K-means principles for dividing
objects in groups having high similarity, is proposed. The method evolves a popula-
tion of chromosomes, each representing a division of objects in a different number
of clusters. A group-based crossover, enriched with the one-step K-means opera-
tor, and a mutation strategy that reassigns objects to clusters on the base of their
distance to the clusters computed so far, allow the approach to determine the best
number of groups present in the dataset. The method has been experimented with
four different fitness functions on both synthetic and real-world datasets, for which
the ground-truth division is known, and compared with the K-means method. Re-
sults show that the approach obtains higher values of evaluation indexes than that
obtained by the K-means method.

1 Introduction

Clustering is an unsupervised data analysis technique whose goal is to categorize


data objects having similar characteristics in clusters. The applicability of this tech-
nique has been recognized in diverse fields such as biology, sociology, image pro-
cessing, information retrieval, and many different methods have been proposed [18].
Among the numerous existing clustering techniques, the K-means method [10] is
one of the most popular because of its efficiency and efficacy in grouping data. A
plenty of heuristics have been presented to overcome some drawbacks inherent this
method, such as the need to provide the number of clusters as input parameter, or
the choice of initial centroids. In this context, evolutionary computation based ap-
proaches have been playing a central role because of their capability of exploring
the search space and escaping from local minima during the optimization process.

National Research Council of Italy (CNR)


Institute for High Performance Computing and Networking (ICAR)
Via P. Bucci 7/11C, 87036 Rende (CS), Italy
e-mail: {clara.pizzuti,nicola.procopio}@icar.cnr.it

1
2 Clara Pizzuti, Nicola Procopio

Several clustering algorithms based on Genetic Algorithms have been proposed in


the last years. Recent surveys describing the most significant ones can be found
in [8, 11]. Among them, a number of approaches combine the ability of K-means
in partitioning data with that of genetic algorithms of performing adaptive search
process to find near optimal solutions for an optimization problem [12].
In this paper, a genetic algorithm that integrates the local search K-means prin-
ciple into the genetic operators of crossover and mutation is proposed. The method,
named CluGA, evolves a population of chromosomes, each representing a division
of objects in a number of clusters not fixed a priori. Each individual is a string of
length equal to the number of the dataset objects. The value of the i-th gene is the
label of the cluster to which the i-th object belongs. Every individual is initialized
with a random number in the interval {2, kmax }, where kmax is the maximum number
of allowed clusters. The crossover operator first obtains the offspring by adopting a
group-based strategy, then refines the offspring by performing the one-step K-means
operator, introduced in [12]. Moreover, the mutation operator reassigns an object
x belonging to a cluster Ci to another cluster C j if x is closer to C j . The method
has been experimented with four different fitness functions on both synthetic and
real-world datasets, and compared with the K-means method. Results show that the
approach is very competitive with respect to K-means. It is able to find solutions
having the exact number of clusters, without any prior knowledge, with high values
of the evaluation indexes used to assess the performance of the method.
The paper is organized as follows. In the next section the clustering problem is
introduced. In Section 3 an overview of the existing approaches that combine Ge-
netic Algorithms and the K-means principles are reported. In Section 4 the algorithm
CluGA is described in detail. Section 5 presents the results of the approach on real-
world and synthetic datasets. Finally, Section 6 concludes the paper and outlines
future developments.

2 The clustering problem


Let X ={x1 , . . . , xn } be a set of n data objects, also called points, where each x j , j =
1, . . . , n is a d-dimensional feature vector representing a single data item. A partition,
or clustering, of X is a collection C = {C1 , . . . ,Ck } of k nonoverlapping subsets of X
such that {C1 ∪ . . . ∪Ck } = X and {Ci ∩C j } = 0/ for i 6= j. The centroid of a cluster
Ci is defined as the mean of the objects it contains. Independently of the clustering
approach adopted, the main objective of the clustering problem is to partition data
objects into a number of clusters such that both within cluster similarity and between
cluster heterogeneity are high [10].
The K-means clustering method is a partitional clustering algorithm that groups
a set of objects into k clusters by optimizing a criterion function. The technique
performs three main steps: 1) selection of k objects as cluster centroids, 2) assign-
ment of objects to the closest cluster, 3) updating of centroids on the base of the
assigned data. Steps 2 and 3 are repeated until no object changes its membership to
a cluster, or the criterion function does not improve for a number of iterations. The
A K-means based Genetic Algorithm for Data Clustering 3

K-means method finds a partition that minimizes the total within-cluster variance,
also known as sum of squared error, defined as
k
CVW = ∑ ∑ ||x − mi ||2 (1)
i=1 x∈Ci

where mi is the centroid of cluster Ci and ||.|| denotes the Euclidean distance. In the
following a brief overview of evolutionary approaches combined with the K-means
method is reported.

3 K-means based Genetic Algorithms


In the last years a lot of methods that integrate Genetic Algorithms and K-means
in several different ways have been presented. One of the first proposals that hy-
bridizes a genetic algorithm with the K-means technique is the GKA method of
Krishna and Murty [12]. This method adopts the integer encoding scheme where
a chromosome is a vector of n elements, i.e. the number of data objects, and each
gene has value in the alphabet {1, . . . , k}, with k the fixed number of clusters to find.
At the beginning each gene receives a random number between 1 and k. The authors
define a one-step K-means algorithm, called K-means operator KMO, that, given
an individual, computes the centroids and reassigns each data object to the closest
cluster. This operator is used as crossover in the algorithm to improve convergence.
Moreover, a distance-based mutation that changes an allele value to a cluster num-
ber with probability proportional to the distance from the object to the cluster center
is introduced. The fitness function minimizes the within-cluster variance. Because
of the fixed number of clusters to find, both mutation and KMO can generate empty
clusters, leading to what the authors call illegal string (i.e. chromosome). In these
cases singleton clusters are introduced to re-obtain the fixed number of clusters.
Inspired by GKA, Lu et al. proposed FGKA (Fast GKA) [14] and IGKA (Incre-
mental) [15]. The main difference between FGKA and GKA consists in a different
mutation operator, while IGKA performs an incremental computation of centroids
that makes the method faster. These methods have been experimented on gene ex-
pression data. It is worth to point out that GKA, FGKA and IGKA could better be
classified as memetic evolutionary methods. In fact, they substitute the crossover
operator with a local search which, in this case, is the K-means one-step operator.
Bandyopadhyay and Maulik [2] presented a method, named KGA, that uses the
principles of K-means. Differently from GKA, a clustering solution is represented
with a vector of k centroids, with k fixed a priori. The authors motivate this choice
for efficiency requirements. The fitness function to minimize is defined as inverse
of the sum of distances of each object to its centroid. Population is initialized by
randomly selecting k points, then one-point crossover and a mutation operator based
on the fitness value are applied. An extension of the approach that tries to avoid
the problem of the parameter k is presented in [3]. Chromosomes are constituted by
real numbers, representing the centroid coordinates, and a special symbol don’t care
meaning that a gene does not contain a centroid. The value of k is assumed between a
4 Clara Pizzuti, Nicola Procopio

Fig. 1 Encoding of a clustering solution and its renumbering.

minimum kmin and maximum kmax value, thus the chromosome length is kmax − kmin ,
and the presence of don’t care symbols allow to have a variable number of clusters.
The fitness function of this new algorithm is the Davis-Bouldin index [5]. Though
the method works well on artificial data, for the well known Iris datasets the method
merges two, out of the three clusters, having overlapping objects.
In the next section a method that exploits K-means principles inside a genetic
algorithms is presented.

4 The CluGA method


In this section a detailed description of CluGA is given, along with the genetic rep-
resentation and operators adopted.
Encoding. The method adopts the label-based encoding of a clustering solution
where a chromosome consists of a string of length equal to the number n of objects.
Each object is located in a position i of the chromosome and is associated with an
integer in the alphabet {2, . . . , kmax }, where kmax is the maximum number of possi-
ble clusters and each integer is the label of a cluster. It is known that this coding is
redundant [8] because the same partitioning can be represented in (kmax − 1)! differ-
ent ways, thus the search space can be very large. In order to improve the efficiency
of the algorithm, we apply a renumbering procedure [6] after each application of
genetic operators, avoiding, in this way, the presence of different strings represent-
ing the same solution. Figure 1 shows ten objects grouped in k = 3 clusters with the
corresponding representation and that obtained after renumbering it.
A K-means based Genetic Algorithm for Data Clustering 5

p1 p2
Positions 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Labels 1123244231 1221113332

Children 2223244232 1111113331

Renumber 1 1 1 2 1 3 3 1 2 1 1111112221

Fig. 2 Group-based crossover where the random position h = 2 is selected, thus l1 = 1 and l2 = 2.
The gene values of the parents, changed to generate the offspring, are highlighted in bold.

Initialization. The simple strategy of generating k random cluster labels and as-
signing at random an object to a cluster can produce bad initial clusters that could
overlap, thus increasing the convergence times. The initialization process employed
by CluGA uses the approach of K-means++ [1] to select k centers, and then as-
signs each object to the closest center. The K-means++ seeding strategy has been
shown by the authors to improve both accuracy and speed of K-means, thus it can
be beneficial also for CluGA. For each element of the population, a random num-
ber k between 2 and kmax is generated. Then the following steps are performed. 1)
Choose the first cluster center c1 = x uniformly at random from the dataset X. 2)
Choose the next cluster center ci = x0 ∈ X − {x} from the remaining data objects
with probability proportional to its distance from the closest center already chosen.
3) Repeat step 2) until kmax centers have been chosen.
Crossover. Standard one-point and two-point crossover have been shown to
present drawbacks when applied to group problems [6]. In fact, it can happen
that the offspring encode solutions too far from those of their parents. A cluster-
oriented crossover operator can avoid these problems. CluGA performs a group-
based crossover followed by the application of the K-means operator KMO. Given
two individuals p1 and p2 of the population, the crossover operator performs the
following steps: 1) Choose a random position h between 1 and n, 2) take the cluster
labels l1 = p1 (h) and l2 = p2 (h), 3) generate the first offspring child1 by substituting
the value of all those genes of p1 having cluster label equal to l1 with l2 , 4) generate
the second offspring child2 by substituting the value of all those genes of p2 having
cluster label equal to l2 with l1 . 5) Apply KMO to child1 and child2 . 6) If any sin-
gleton cluster is present, assign its object to one of the existing clusters at random.
An example of group-based crossover is shown in Figure 2. If the random position
h = 2 is selected, the members {1, 2, 10} of the cluster in p1 with label value 1 are
assigned cluster label 2, while the members {2, 3, 10} of the cluster in p2 with label
value 2 are assigned cluster label 1.
Mutation. The mutation operator performs a local search that reassigns objects
to clusters at lower distance. For each object xi ∈ X, it checks whether the distance
from xi to the cluster Ca it belongs to is higher than the distance from xi to another
cluster Cb . In such a case xi is removed from Ca and added to Cb .
6 Clara Pizzuti, Nicola Procopio

Fitness Function. The choice of an appropriate fitness function is a key point to


obtain a good solution for the problem to solve. In the literature many indexes have
been defined to evaluate clustering results [7]. These indexes have been mainly used
for cluster evaluation, in order to choose the best clustering method, and to de-
termine the optimal number of clusters, when this information is not known, as it
generally happens for real-world datasets. However, they have also been used by a
number of clustering methods as functions to optimize when partitioning data ob-
jects. Some authors, like Bandyopadhyay and Maulik [2], proposed to minimize the
cluster-within variance, analogously to the K-means method. However this criterion
is not apt when using a genetic algorithm since, as reported in [13], it is biased to-
wards a high number of clusters, i.e. the higher the number of clusters k found by a
method, the lower the within-cluster variance value. Thus a genetic algorithm that
optimizes this index has the tendency to prefer solutions with many clusters. The
main motivation of this behavior is that CVW takes into account only the closeness
of points to centroids. However, as pointed out by Tan et al. [18], to measure the
goodness of a clustering structure, the concepts of cluster cohesion and separation
must be both considered. The former suggests that clusters should be compact, that
is objects in the same cluster must stay close. The latter defines how well clusters
are separated. While compactness is generally measured through data variance, the
lower the variance the better the cohesion, separation relies on the distances between
cluster centroids or objects in different clusters.
Actually, minimizing CVW corresponds to optimizing compactness. In order to
take into account both compactness and separation, we consider three popular cri-
teria as fitness functions: the Calinski-Harabasz [4], the Silhouette [17] and the
Davis-Bouldin [5] indices. We show, in the experimental result section, that CluGA,
endowed with one of these indices, is capable of obtaining clusterings very similar
to the ground truth ones. The optimization of CVW , instead, splits clusters in many
small groups. In the following we give the definitions of these criteria.
The Calinski-Harabasz criterion, called also the variance ratio criterion (VRC)
is defined as
CVB n − k
V RC = × (2)
CVW k − 1
where CVB is the total between-cluster variance, CVW is the overall within-cluster
variance, as defined by formula (1), k is the number of clusters, and n is the number
of objects. CVB is defined as:
k
CVB = ∑ ni ||mi − m||2 (3)
i=1

where mi is the centroid of cluster i, ni is the size of cluster i, m is the overall mean
of the dataset, and ||.|| denotes the Euclidean distance.
The Silhouette index, for each object, measures how similar that object is to ele-
ments in its own cluster, when compared to objects in other clusters. The silhouette
criterion is defined as:
A K-means based Genetic Algorithm for Data Clustering 7

Input: A set X ={x1 , . . . , xn } of n data objects, crossover probability pc , mutation probability pm


Output: A partition C = {C1 , . . . ,Ck } of X in k groups
Method: Perform the following steps:
1 Create an initial population of individuals by applying the K-means++ approach
2 Evaluate all the individuals
3 while termination condition is not satisfied do
4 for each individual I = {g1 , . . . , gn } in the population
5 Perform group-based crossover and KMO operator with probability pc , renumber the individual
6 Perform local search based mutation with probability pm , renumber the individual
7 Evaluate the fitness function
8 end for
9 Select solutions with the better fitness values
10 end while
11 return the individual having the best fitness value

Fig. 3 The pseudo-code of the CluGA algorithm.

1 k 1 b(x) − a(x)
S= ∑ ∑ (4)
k i=1 ni x∈Ci max(a(x), b(x))

where a(x) is the average distance from x to the other objects in the same cluster as
x, and b(x) is the minimum average distance from x to points in a different cluster,
minimized over all clusters.
The silhouette value ranges from -1 to +1. A high silhouette value indicates that
objects are well-matched in their own cluster, and poorly-matched to neighboring
clusters.
The Davies-Bouldin criterion is based on a ratio of within-cluster and between-
cluster distances. It is defined as:

1 k di + d j
DB = ∑ maxi6= j { } (5)
k i=1 di, j

where di and d j are the average distances between each object in the i-th and j-th
cluster, respectively, and the centroid of the own cluster, and di, j is the Euclidean
distance between the centroids of the i-th and j-th clusters.
The pseudo-code of the algorithm is shown in Figure 3. In the next section the
CluGA method is executed on synthetic generated and real-world datasets, and the
clustering structures obtained by employing as fitness functions cluster-within vari-
ance, Calinski-Harabasz , Silhouette, and Davies-Bouldin criteria, are compared.

5 Experimental Results

This section provides a thorough experimentation for assessing the capability of


CluGA in partitioning artificial and real-world datasets. The CluGA algorithm has
8 Clara Pizzuti, Nicola Procopio

Table 1 Comparison between CluGA and the K-means algorithm on synthetic datasets. R: Recall,
P:Precision, F: Fmeasure. The best values, by excluding the CVW index, are highlighted in bold.

Syn3 Syn4 Syn6


Fitness K-Means Fitness K-Means Fitness K-Means
CVW S V RC DB CVW S V RC DB CVW S V RC DB
mean 0.2110 0.9994 0.9994 0.9988 0.9994 0.2612 0.8891 0.8654 0.8743 0.8972 0.2100 0.2907 0.3428 0.2871 0.3650
ARI best 0.2257 1 1 1 0.2791 0.9161 0.9161 0.9174 0.2423 0.3625 0.4277 0.3387
st. dev. (0.0088) (0.0019) (0.0019) (0.0025) (0.0019) (0.0118) (0.0299) (0.0468) (0.0435) (0.0201) (0.0142) (0.0484) (0.0519) (0.0277) (0.0457)
mean 0.9996 0.9989 0.9989 0.9978 0.9989 0.8937 0.8591 0.8519 0.8442 0.8630 0.5413 0.4645 0.4003 0.4907 0.4407
R best 1 1 1 1 0.9169 0.8832 0.8832 0.8855 0.5591 0.5504 0.5446 0.5308
st. dev. (0.0013) (0.0035) (0.0035) (0.0047) (0.0035) (0.0141) (0.0264) (0.0247) (0.0395) (0.0233) (0.0163) (0.0926) (0.0712) (0.0724) (0.0358)
mean 0.3706 0.9989 0.9989 0.9978 0.9989 0.4126 0.8519 0.8320 0.8431 0.8630 0.3289 0.4117 0.4778 0.3703 0.4480
P best 0.3786 1 1 1 0.4305 0.8833 0.8833 0.8859 0.3484 0.6020 0.5366 0.4423
st. dev. (0.0038) (0.0035) (0.0035) (0.0047) (0.0035) (0.0086) (0.0381) (0.0497) (0.0413) (0.0230) (0.0129) (0.1004) (0.0510) (0.0303) (0.0364)
mean 0.5407 0.9989 0.9989 0.9978 0.9989 0.5646 0.8554 0.8413 0.8436 0.8630 0.4092 0.4186 0.4284 0.4170 0.4443
F best 0.5492 1 1 1 0.5859 0.8832 0.8832 0.8857 0.4281 0.4583 0.4710 0.4509
st.dev. (0.0041) (0.0035) (0.0035) (0.0047) (0.0035) (0.0106) (0.0310) (0.0341) (0.0403) (0.0232) (0.0144) (0.0254) (0.0251) (0.0280) (0.0360)
mean 13.2836 67.2093 67.2093 67.2291 67.2093 24.8088 102.1898 102.4346 106.6969 103.0274 12.6606 36.1794 58.3065 24.4930 44.9482
CVW best 11.5204 63.9562 63.9562 63.9562 23.4774 90.8306 91.4968 100.0645 11.2092 13.0521 14.6163 15.3501
st. dev (0.9133) (2.4072) (2.4072) (2.4305) (2.4071) (0.9599) (4.5316) (6.3204) (7.2930) (2.3689) (0.9003) (32.3026) (21.0439) (17.2006) (3.5852)
mean 22.9000 3.0000 3.0000 3.0000 3 22.8000 4.1000 4.2000 4.1000 4 22.7000 14.1000 5.8000 17.3000 6
k best 22 3 3 3 22 4 4 4 21 3 6 5
st. dev. (0.3162) (0) (0) (0) (0.4216) (0.3162) (0.4216) (0.3162) (0.6749) (7.8095) (5.0947) (4.7152)

Table 2 Comparison between CluGA and the K-means algorithm on IRIS and CANCER datasets.
R: Recall, P:Precision, F: Fmeasure. The best values, by excluding the CVW index, are highlighted
in bold.
IRIS CANCER
Fitness K-Means Fitness K-Means
CVW S V RC DB CVW S V RC DB
mean 0.3238 0.5418 0.7223 0.5681 0.7005 0.2583 0.7825 0.8364 0.8226 0.8337
ARI best 0.3417 0.5583 0.7455 0.5681 0.3053 0.8337 0.8551 0.8337
st. dev. (0.0128) (0.0058) (0.0148) (0.0000) (0.0941) (0.043) (0.0478) (0.0102) (0.0171) (0)
mean 0.9237 0.5255 0.7457 0.5794 0.7298 0.8719 0.6665 0.7289 0.7115 0.7252
R best 0.9423 0.5537 0.7869 0.5793 0.8811 0.7252 0.7543 0.7252
st. dev. (0.0148) (0.0099) (0.0190) (0.0000) (0.0686) (0.0066) (0.0482) (0.0135) (0.0211) (0)
mean 0.4212 0.8901 0.7589 1.0000 0.7542 0.2294 0.6914 0.7373 0.7227 0.7339
P best 0.4353 0.9490 0.8093 1 best 0.2390 0.7339 0.7586 0.7339
st. dev. (0.0094) (0.0207) (0.0229) ( 0.0000) (0.0341) (0.0088) (0.0335) (0.011) (0.0175) (0)
mean 0.5785 0.6608 0.7522 0.7337 0.7413 0.3631 0.6786 0.7331 0.717 0.7295
F best 0.5955 0.6994 0.7979 0.7336 0.3746 0.7295 0.7563 0.7295
st. dev. (0.0115) (0.0136) (0.0209) (0.0000) (0.0533) (0.011) (0.0414) (0.0122) (0.0192) (0)
mean 21.6335 152.4457 79.0029 154.9470 85.2417 72.2422 200.1433 196.9676 197.4081 196.9366
CVW best 21.23349 152.348 78.85144 154.947 70.8324 196.9366 196.9366 196.9366
st. dev. (0.3481) (0.3092) ( 0.2705) (0.0000) (20.2076) (1.0418) (6.6177) (0.0767) (0.9443) (0)
mean 13.0 2 3 2 3 26.6 2 2 2 2
k best 13 2 3 2 25 2 2 2
st. dev. (0.0000) (0.0000) (0.0000) (0.0000) (0.6992) (0) ( 0) (0)

Table 3 Comparison between CluGA and the K-means algorithm on GLASS and ECOLI datasets.
R: Recall, P:Precision, F: Fmeasure. The best values, by excluding the CVW index, are highlighted
in bold.
GLASS ECOLI
Fitness K-Means Fitness K-Means
CVW S V RC DB CVW S V RC DB
mean 0.2196 0.2062 0.5803 0.2087 0.5195 0.257 0.6189 0.573 0.6805 0.3876
ARI best 0.2446 0.6762 0.6064 0.6086 0.2785 0.6925 0.7065 0.7277
st. dev. (0.0309) (0.2945) (0.0316) (0.2706) (0.1411) 0.0128 0.1222 0.1397 0.0591 0.1599
mean 0.8405 0.1932 0.4381 0.1682 0.3575 0.7654 0.5005 0.4573 0.6127 0.6021
R best 0.8568 0.7039 0.5358 0.5336 0.7867 0.578 0.5667 0.6877
st. dev. (0.0146) (0.2912) (0.0726) (0.2316) (0.1009) (0.0104) (0.0995) (0.1255) (0.0717) (0.2547)
mean 0.2321 0.2224 0.4054 0.2313 0.3900 0.4277 0.7684 0.7271 0.7215 0.5537
P best 0.2463 0.4094 0.4310 0.3933 0.4421 0.7859 0.812 0.7672
dt. dev. (0.0124) (0.1653) (0.0253) (0.1382) (0.0277) (0.008) (0.0183) (0.0826) (0.0631) (0.0242)
mean 0.3636 0.1628 0.4192 0.1578 0.3660 0.5487 0.6007 0.5558 0.6598 0.6158
F best 0.3809 0.5177 0.4642 0.4528 0.5661 0.6544 0.6675 0.6993
st. dev. (0.0161) (0.2218) (0.0412) (0.1909) (0.0873) (0.0085) (0.0836) (0.1191) (0.0562) (0.0235)
mean 178.5231 932.6661 752.9121 929.1354 860.4198 8.7751 25.3293 26.9246 19.2687 14.7483
CVW best 173.3125 358.5322 589.0314 495.5036 8.5466 21.1775 23.261 13.5761
st. dev. (4.6974) (361.2018) (109.3977) (324.6504) (126.0378) (0.2116) (4.7125) (5.6102) (3.1077) (0.6440)
mean 13.9000 3.0000 2.3000 2.8000 2 18.8 2.9 2.7 8.1 8
k 13 2 2 2 18 4 3 8
st. dev. (0.3162) (1.7638) (0.4830) (0.9189) (0.4216) (0.5676) (0.483) (2.7669) (0)
A K-means based Genetic Algorithm for Data Clustering 9

(a) (b) (c)

Fig. 4 Synthetic datasets with 500 objects: (a) k = 3, (b) k = 4, (c) k = 6.)

been written in MATLAB 8.6 R2015b, by using the Genetic Algorithm Solver of
the Global Optimization Toolbox. A trial and error procedure has been adopted for
fixing the parameter values. Thus the crossover rate pc has been fixed to 0.8, the mu-
tation rate pm to 0.2, population size 100, elite reproduction 10% of the population
size, number of generations is 50. In order to evaluate the method, since the ground-
truth clusterings are known, we computed the well known measures Adjusted Rand
Index [9], Precision, Recall, and Fmeasure [18], adopted in the literature to assess
the capability of an algorithm in finding partitions similar to the true data division.
We first present the results obtained by CluGA on randomly generated synthetic data
sets. We consider three kinds of datasets of 500 objects each. The first one, named
Syn 3 contains three Gaussian clusters, distinct and well separated, with standard
deviation of the centroids equal to (0.2, 0.2, 0.35) (Figure 4(a)). In the second one,
named Syn 4, there are four clusters close to each other with standard deviation of
centroids equal to (0.2, 0.35, 0.45, 0.3) (Figure 4(b)). The third one, named Syn 6 is
constituted by six mixed clusters with standard deviation of centroids equal to (0.2,
0.2, 0.35, 0.45, 0.1, 0.1) (Figure 4(c)).
Table 1 shows the execution of the method on the synthetic data sets for the four
fitness functions, along with the results of the K-means algorithm. Each method has
been executed 10 times and average and best values of the ten runs, along with stan-
dard deviation, are reported. Notice that the K-means has been executed with input
parameter k equal to the true number of clusters,
p while CluGA has been executed by
fixing the maximum number of clusters to (n), which, in the literature, is consid-
ered a rule of thumb [16]. The table highlights the very good capabilities of CluGA
in partitioning the datasets in a number of clusters very close to the ground truth,
even if the algorithm does not know the number of clusters to find, when using as
fitness functions V RC (formula (2)), S (formula (4)), and DB (formula (5)). In fact,
on Syn 3 it reaches the same results of the K-means method. On Syn 4 and Syn 6,
though the average values of the measures are slightly lower, the best values, out of
the ten runs, are always better. As regards the cluster-within variance, it is evident
from the table the selection bias towards a high number of clusters, which confirms
the experimentation of Liu et al. [13]. With this fitness function the method for the
three datasets obtains an average of 22, 22, 21 clusters and average CVW values 13.2,
24.8, 12.6, respectively.
10 Clara Pizzuti, Nicola Procopio

Tables 2-3 compare CluGA and K-means on four well known real-world datasets
of the UCI Machine Learning Repository. The Iris dataset consists of 150 sam-
ples of Iris flowers categorized into three species by four features. Breast Cancer
Wisconsin dataset contains 683 objects with 9 features divided into two classes.
The Glass dataset has 214 types of glass with 9 features, divided into 6 classes.
The Ecoli dataset consists of 336 instances grouped with respect to 8 protein lo-
calization sites, described by seven attributes. The tables clearly point out that the
optimization of the Calinski-Harabasz criterion obtains the best evaluation measure
values on all the datasets, except for Ecoli. This fitness function allows the algo-
rithm to outperform the K-means method on all the datasets with higher ARI value
and lower within-cluster variance. For instance, on the Iris dataset CluGA finds
the three clusters with the mean Adjusted Rand Index value equal to 0.7223, the
best ARI equal to 0.7455, and cluster-within variance 79.0029, while the K-means
obtains ARI=0.7005 and CVW =85.2417. For the Cancer dataset the mean and best
ARI values of CluGA are 0.8364 and 0.8551, respectively, while for the K-means
ARI=0.8337. Analogously for the Glass dataset, CluGA obtains ARI=0.5803 and
CVW =752.9121, while K-means finds ARI=0.5195 and CVW = 860.4198. As regards
the Ecoli dataset, the Calinski-Harabasz and the Silhouette indexes merges some
clusters, while the Davis-Bouldin criterion finds all the 8 clusters in almost all the
executions. However it must be noted that this dataset contains two clusters with
only two objects, and one cluster with 5 objects, which are not easy to find. In fact,
the K-means finds 8 clusters with a precision value around 0.5, which is rather low,
while that of the three indexes is above 0.7. Also for these real-world datasets it can
be observed that the CVW criterion prefers a number of clusters much higher than
the ground-truth.

6 Conclusions
The paper proposed a genetic algorithm able to divide a dataset in a number of
groups not known in advance. The method employs label-based representation, and
exploits the K-means strategy to improve the offspring generated by the group-based
crossover. Though the idea of combining genetic algorithms and K-means is not
new, CluGA sensibly differs from the existing proposals. With respect to GKA [12],
FGKA [14] and IGKA [15], it applies the one step operator of the K-means after
having effectively performed the group-based crossover. The above methods, in-
stead, substitute the crossover with the K-means operator. Moreover, CluGA does
not need to fix the number of clusters. Each chromosome, in fact, is initialized
with a random number in the interval {2, . . . , kmax }, where kmax could be also n.
As regards the KGA method [2], CluGA has different representation and completely
different operators. Experiments on synthetic and real-world datasets by using four
different fitness functions show that when the three popular evaluation criteria of
Calinski-Harabasz, Silhouette and Davis-Bouldin are employed as fitness functions,
the method obtains very good solutions and outperforms the K-means method. The
cluster-within variance, instead, though used by many authors, is not apt for genetic
algorithms since its optimization generates a bias towards a high number of clusters.
A K-means based Genetic Algorithm for Data Clustering 11

The experiments also pointed out the Calinski- Harabasz criterion obtains the best
results for all the datasets, except for Ecoli, where the 8 clusters are merged into
three. Future work will evaluate the method on data sets coming from real-world
applications.
Acknowledgment: This work has been partially supported by MIUR D.D. n
0001542, under the project BA2KNOW − PON03PE 00001 1.

References

1. David Arthur and Sergei Vassilvitskii. K-means++: The advantages of careful seeding. In
Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA
’07, pages 1027–1035, 2007.
2. Sanghamitra Bandyopadhyay and Ujjwal Maulik. An evolutionary technique based on k-
means algorithm for optimal clustering in rn. Inf. Sci. Appl., 146(1-4):221–237, 2002.
3. Sanghamitra Bandyopadhyay and Ujjwal Maulik. Genetic clustering for automatic evolution
of clusters and application to image classification. Pattern Rec., 35:1197 – 1208, 2004.
4. T. Calinski and J. Harabasz. A dendrite method for cluster analysis. Communications in
Statistics, 3(1):1–27, 1974.
5. D. Davies and D. Bouldin. A cluster separation measure. IEEE Trans. Pattern Analysis and
Machine Intelligence, 1(2):224–227, 1979.
6. Emanuel Falkenauer. Genetic Algorithms and Grouping Problems. John Wiley & Sons, Inc.,
New York, NY, USA, 1998.
7. Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. On clustering validation tech-
niques. J. Intell. Inf. Syst., 17(2-3):107–145, December 2001.
8. Eduardo Raul Hruschka, Ricardo J. G. B. Campello, Alex A. Freitas, and André C. Ponce
Leon F. De Carvalho. A survey of evolutionary algorithms for clustering. IEEE Trans. Sys.
Man Cybernetics-Part C, 39(2):133–155, March 2009.
9. Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2:193–
218, 1985.
10. Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice-Hall, Inc.,
Upper Saddle River, NJ, USA, 1988.
11. Adán José-Garcı́a and Wilfrido Gómez-Flores. Automatic clustering using nature-inspired
metaheuristics: A survey. Applied Soft Computing, 41:192–213, 2016.
12. K. Krishna and Murty M. N. Genetic k-means algorithm. IEEE Trans. Sys. Man Cybernetics-
Part B, 29(3):433–439, 1999.
13. Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, and Junjie Wu. Understanding of inter-
nal clustering validation measures. In Proceedings of the 2010 IEEE International Conference
on Data Mining, ICDM ’10, pages 911–916, 2010.
14. Yi Lu, Shiyong Lu, Farshad Fotouhi, Youping Deng, and Susan J. Brown. Fgka: A fast ge-
netic k-means clustering algorithm. In Proceedings of the 2004 ACM Symposium on Applied
Computing, SAC ’04, pages 622–623, 2004.
15. Yi Lu, Shiyong Lu, Farshad Fotouhi, Youping Deng, and Susan J. Brown. Performance eval-
uation of some clustering algorithms and validity indices. BMC Bioinformatics, 5(172):1–10,
2004.
16. N. R. Pal and J. C. Bezdek. On cluster validity for the fuzzy c-means model. IEEE Transac-
tions on Fuzzy Systems, 3(3):370–379, 1995.
17. Peter Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster
analysis. J. Comput. Appl. Math., 20(1):53–65, 1987.
18. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Addison-
Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2005.

View publication stats

You might also like