You are on page 1of 21

J Intell Inf Syst (2012) 38:321–341 DOI 10.1007/s10844-011-0158-3

Data clustering using bacterial foraging optimization

Miao Wan · Lixiang Li · Jinghua Xiao · Cong Wang · Yixian Yang

Received: 10 May 2010 / Revised: 16 March 2011 / Accepted: 17 March 2011 / Published online: 9 April 2011 © Springer Science+Business Media, LLC 2011

Abstract Clustering divides data into meaningful or useful groups (clusters) without any prior knowledge. It is a key technique in data mining and has become an important issue in many fields. This article presents a new clustering algorithm based on the mechanism analysis of Bacterial Foraging (BF). It is an optimization methodology for clustering problem in which a group of bacteria forage to converge to certain positions as final cluster centers by minimizing the fitness function. The quality of this approach is evaluated on several well-known benchmark data sets. Compared with the popular clustering method named k-means algorithm, ACO- based algorithm and the PSO-based clustering technique, experimental results show that the proposed algorithm is an effective clustering technique and can be used to handle data sets with various cluster sizes, densities and multiple dimensions.

Keywords Data mining · Data clustering · Bacterial foraging optimization · Optimization based clustering

1 Introduction

Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters) (Jain et al. 1999). In the past fifty years, many

M. Wan (B) · L. Li · C. Wang · Y. Yang

Information Security Center, State Key Laboratory of Networking and Switching Technology,

Beijing University of Posts and Telecommunications, P.O. Box 145, Beijing 100876, China e-mail: wanmiao120@163.com

M. Wan · L. Li · C. Wang · Y. Yang

Key Laboratory of Network and Information Attack & Defence Technology of MOE,

Beijing University of Posts and Telecommunications, Beijing 100876, China

J. Xiao School of Science, Beijing University of Posts and Telecommunications, Beijing 100876, China

100876, China J. Xiao School of Science, Beijing University of Posts and Telecommunications, Beijing 100876, China

322

J Intell Inf Syst (2012) 38:321–341

attentions have been focused on the problem of clustering from the theoretical and

the practical point of view. Such problem has been addressed in diverse areas such

as pattern recognition, data analysis, image processing, economic science (especially

market research) and biology. So the study about new clustering algorithms is an important issue in the research fields including data mining, machine learning, statistics, and biology. In recent years, different clustering algorithms have been proposed, such as

partitioning (MacQueen 1967; Ng and Han 1994), hierarchical (Guha et al. 1998), density-based (Hinneburg and Keim 1998), grid-based (Sheikholeslami et al. 1998) and model-based (Dempster et al. 1977). Partitioning approach constructs different partitions based on some criterion. For hard partitional clustering, each pattern belongs to one and only one cluster. Fuzzy clustering (Bezdek 1981; Zhang and Leung 2004) extends this notion that each pattern may belong to all clusters with

a degree of membership. Apart from the above techniques, kernel k-means and

spectral clustering have both been used to identify clusters that are non-linearly separable in input space (Dhillon et al. 2005, 2007; Filippone et al. 2008). k-means algorithm (MacQueen 1967) is the most popular approach because of its simplicity, efficiency and low cost of computation. However, since criterion func- tions for clustering are usually non-convex and nonlinear, traditional approaches, especially standard k-means algorithm, is sensitive to initializations and easy to be trapped in local optimal solutions. As the increasing numbers and dimensions of data sets, finding solutions to the criterion functions has become an NP-hard problem. Some variants to standard k-means method provide a fast and local search strategy to solve this problem (Arthur and Vassilvitskii 2007; Kanungo et al. 2004). Since the importance of clustering strategies in many fields, global optimization methods (Hruschka et al. 2006; Shelokar et al. 2004; van der Merwe and Engelbrecht 2003; Li et al. 2006), such as genetic algorithms (GA), ant colony optimization (ACO) and particle swarm optimization (PSO), have been applied to solve clustering problems (Hruschka et al. 2006; Handl et al. 2006; Shelokar et al. 2004; van der Merwe and Engelbrecht 2003; Wan et al. 2010). When solving clustering problems, these algorithms start from an initial population or position and explore the solution space through a number of iterations to reach a near optimal solution. The social insects behavior such as finding the best food source, building of optimal nest structure, brooding, protecting the larva, guarding, etc. show intelligent behavior on the swarm level (Englebrecht 2002). Foraging is a kind of social insect behaviors and can be modelled as an optimization process where an animal seeks to maximize energy intake per unit time spent for foraging. This view led Passino to develop a new optimization algorithm which is inspired by the social foraging behavior of Escherichia coli (E. Coli) bacteria and named as Bacterial Foraging (BF) (Passino 2002). Until today, this latest optimization algorithm, BF, is gaining importance in the optimization problems and has been successfully implemented to some engineering problems such as optimal controller design (Passino 2002; Kim et al. 2007), antenna arrays systems (Guney and Basbug 2008), active power

filter synthesis (Mishra and Bhende 2007), and learning of artificial neural networks (Kim and Cho 2005). Mathematical modelling, modification, and adaptation of the algorithm might be a major part of the research on BF in future. As data clustering can be seen as a process of function optimization, BF may be applied to solve clustering problems with its global search capability.

as a process of function optimization, BF may be applied to solve clustering problems with its

J Intell Inf Syst (2012) 38:321–341

323

In this paper we propose a new clustering algorithm (called, BF-C) for grouping

data by the optimization property of bacterial foraging behavior. Instead of the high- speed local search, BF-C is an global optimization-based algorithm which provides a new point of view to solve the NP-hard clustering problems. Meanwhile, it is a brand- new application of Bacterial Foraging. In our algorithm, no centroid or center needs to be selected in the initial step. Moreover, in order to overcome the drawbacks of traditional algorithms, the proposed algorithm tries to achieve its tripartite objective:

(a) find a high quality approximation to the optimal clustering solution; (b) have a good algorithm performance for high-dimensional data; (c) not sensitive to clusters with different size and density. The rest of this paper is organized in the following way. Section 2 gives a back- ground of optimization based clustering and the BF algorithm. Section 3 describes the whole process of the proposed BF-C algorithm in detail. In Section 4 we give a brief introduction to another three clustering algorithms for comparison and present four measures for algorithm performance evaluation. Section 5 presents experiment companions and discusses experimental results. Finally, conclusion and future work are given in Section 6.

2 Background

2.1 Optimization based clustering

Clustering is a data mining technique which classifies objects into groups (clusters) without any prior knowledge. The problem of common clustering can be formally

started as follows. Given a sample data set X = {x 1 , x 2 ,

of the objects into K clusters C 1 , C 2 ,

x n }, determine a partition

,

, C K which satisfies:

⎧ ⎪ ⎪ C i = X;

K

⎨ ⎪ C i C j = ∅,

i=1

C i = ∅,

i, j = 1, 2, i = 1, 2,

,

K ;

,

K.

i = j ;

(1)

In the viewpoint of mathematics, cluster C i can be determined by:

C i = {x j | x j z i x j z p ,

z i =

1

|C i |

x j C i

x

j ,

p = i,

i = 1, 2,

p = 1, 2,

,

K,

x j X},

K ,

,

(2)

where · denotes the distance of any two data points in the sample set. z i is the center of cluster C i , which is represented by the average (mean) of all the points in the cluster.

A clustering criterion must be adopted. The most commonly used criterion in

clustering task is the Sum of Squared Error (SSE) (Tan et al. 2006):

SSE =

K

i=1

x

j C i

x j z i 2 .

(3)

is the Sum of Squared Error ( SSE ) (Tan et al. 2006 ): SSE =

324

J Intell Inf Syst (2012) 38:321–341

For each data in the given set, the error is the distance to the nearest cluster. The general objective of clustering is to obtain that partition which, for fixed number of clusters, minimizes the square-error. Thus, the clustering problem is converted to a process of searching K centers

z 1 , z 2 ,

z K , which can minimize the sum of distance between all the sample data x i

, and its closest center. This could be considered as a function optimization issue with the objective function as SSE.

2.2 The bacterial foraging (BF) algorithm

The BF algorithm (Passino 2002) is a new stochastic global search technique based on the foraging behavior of E. Coli bacteria present in the human intestine. The ideas from bacterial foraging can be utilized to solve non-gradient optimization problems by three processes, namely, chemotaxis, reproduction, and elimination and dispersal. Generally, as a group, the E. Coli bacteria will try to find food and avoid harmful phenomena during foraging, and after a certain time period, recover and return to some standard behavior in a homogeneous medium. An E. Coli bacterium can move in two different ways: tumbling and swimming, and it alternates between these two modes of operation its entire lifetime. This alternation between the two modes, called chemotactic steps, will move the bacterium, but in random directions, and this

enables it to “search” for nutrients. After the bacterium has collected a given amount of nutrients, it can self-reproduce and divide into two. The bacteria population can also change (e.g., be killed or dispersed) by the local environment.

A BF optimization algorithm can be explained as follows:

Given a D-dimensional search space D , try to find the minimum of objective

function J(θ), θ D , where we do not have measurements or an analytical descrip- tion of the gradient J(θ). Here, we use ideas from bacterial foraging to solve this

non-gradient optimization problem. Let {θ i ( j)|i = 1, 2,

S} represent the position

of each member in the population of the S bacteria at the jth chemotactic step.

, the random direction specified by the tumble. To represent a tumble, a unit length random direction, say φ( j), is generated; this will be used to the following swim phase after a tumble. Therefore, the position of bacterium i in one step is updated as:

Choose C(i) > 0 (i = 1, 2,

S), denote a basic chemotactic step size that taken in

,

θ i ( j + 1) = θ i ( j) + C(i)φ( j).

(4)

If Ji ( j + 1)) < Ji ( j)), another step in this same direction will be taken. This

swimming iteration will be continued as long as it continues to reduce the objective function, but only up to a maximum number of steps, N s . After N c chemotactic steps, a reproduction step is taken. S r (half of the population) healthiest bacteria each split into two bacteria, which are placed at the same location. Finally, each bacterium in the population is subjected to an elimination–dispersal process with probability p ed .

3 Proposed methodology: the BF-C algorithm

In this section we will express how bacterial foraging optimization solves general clustering problem in detail.

algorithm In this section we will express how bacterial foraging optimization solves general clustering problem in

J Intell Inf Syst (2012) 38:321–341

325

3.1 The BF based clustering (BF-C) algorithm

As we have just mentioned in Section 2.1, clustering tasks can be considered as optimization problems. Firstly, the fitness function should be specified. Here we choose SSE in (3) to be the required function J in BF-C:

J(w, z) =

K

n

D

c=1

t=1

d=1

w tp x td z cd 2 ,

(5)

where D is the dimension of the search space; w is a weight matrix of size n × K and w tp is the associated weight of data x t with cluster c which can be assigned as

w tp =

1

0

if x t is labelled to cluster c otherwise

, t = 1,

,

n,

c = 1,

,

K.

Algorithm 1 introduces the proposed BF-C algorithm. In the BF-C algorithm, an S-size population of bacteria is generated for each center, so there will be S × K bacteria changing positions for the minimum cost by foraging behaviors in this approach. A virtual bacterium is actually one trial solution (may be called a search-agent) that moves on the functional surface to locate the global optimum. Initially, S data are randomly generated from X as bacteria for each center z c (line 1 in Algorithm 1). Then for every bacterium i, the chemotaxis process starts (lines 4–20 in Al- gorithm 1). All the bacteria update their positions for N c step of iterations. The

agents first present a tumble in a unit length random direction (

(i) D is a random vector with each element a random number on [1, 1]) with a basic chemotaxis step size C(i) (line 6 in Algorithm 1), and then swim to minimize the objective function J up to a maximum number of steps, N s (lines 9– 18 in Algorithm 1). The chemotaxis process is in a combined N re step reproduction loop (lines 3, 21–23 in Algorithm 1), and encapsulated in a N ed -length elimination– dispersal phase during which a percentage p ed of bacteria are dispersed at random (lines 2, 24–25 in Algorithm 1). All the bacteria will converge to certain places in the search space after the iteration process. The final positions of the bacteria are considered as the required centers. Allocate all the data according to (2) into different clusters which are represented by the final centers gained after the iteration process. Assign every data object a corresponding cluster label (lines 26–33 in Algorithm 1).

(i)

T

(i) (i) , where

3.2 Guidelines for algorithm parameter setting

The bacterial foraging optimization algorithm requires initialization of a variety of parameters, and the authors of Passino (2002) gave out a set of guidelines for parameter choices in BF. Since we put the basic idea of BF into our methodology, these guidelines can work in BF-C as well. In BF-C, the size of the bacteria population S should be picked first. Enlarging S will apparently increase the computing time but find the optimal more easily. Next, there is a three-layer optimization loop in BF-C with the size of N iter = N c × N re × N ed . The larger N iter is, the better the optimization progress is, but also the more

N r e × N e d . The larger N i t e r is,

326

J Intell Inf Syst (2012) 38:321–341

Algorithm 1 The BF-C Algorithm

Require:

Data set, X = {x 1 , x 2 , Cluster number, K. Ensure:

,

x n };

1:

2:

3:

4:

5:

6:

7:

8:

9:

10:

11:

12:

13:

14:

15:

16:

17:

18:

19:

20:

21:

22:

23:

24:

Clusters: {C 1 , C 2 ,

, randomly as the positions of bacteria for each cluster center z c (c = 1, 2, for l = 1 : N ed do for k = 1 : N re do for j = 1 : N c do for i = 1 : S do

, Initialize K centers for C 1 , C 2 ,

C K }.

C K : Generate S data {b c , b 2 ,

1

c

,

b

S

c

} from X

, K).

b i c ( j + 1, k,l) = b i c ( j, k,l) + C(i)

Calculate J(i, j, k, l) with current b i c ( j, k,l) J last = J(i, j, k,l) while m < N s do m = m + 1

(i)

T (i) (i)

if J(i, j

+ 1, k, l) < J last then

b i c ( j + 1, k,l) = b i c ( j + 1, k,l) + C(i)

J last = J(i, j + 1, k,l)

z c ( j, k,l) = b i c ( j else m = N s end if end while end for end for

J

i

health = N j=1 c +1

J(i, j, k,l)

+ 1, k,l)

Reproduce(X,J health )

end for

Elimination–dispersal(X, p ed )

(i)

T (i) (i)

25: end for

26:

for t = 1 : n do

27:

for c = 1 : K do

28:

Calculate distance d c = x t z c

29:

end for

30:

31:

, Find the position p of min(d)

d = {d 1 , d 2 ,

d K }

32:

C p .add(x t )

33: end for

computational complexity is. If N iter is too short, the algorithm could more easily get trapped in a local minimum. Then, the bacteria will swim in random directions with N s steps. Large values of N s tend to make the bacteria move more in different directions to get better results, but of course more computational complexity. And, if p ed is large, the algorithm can degrade to random exhaustive search. If, however, it

complexity. And, if p e d is large, the algorithm can degrade to random exhaustive search.

J Intell Inf Syst (2012) 38:321–341

327

is chosen appropriately, it can help the algorithm jump out of local optima and into a global optimum. Finally, C(i) is the only one that occurs in the iteration function of (4) and can be seen as a type of “step size” for the BF optimization algorithm. You can choose a biologically motivated value; however, such values may not be the best for an engineering application (Passino 2002). If the C(i) values are too large, then if the optimum value lies in a valley with steep edges, the search will tend to jump out of the valley, or it may simply miss possible local minima by swimming through them without stopping. On the other hand, if the C(i) values are too small, convergence can be slow, but if the search finds a local minimum it will typically not deviate too far from it. In Section 5, we will set up experiments to investigate parameters of BF-C.

4 Cluster validity and compared methods

One of the most important issues of cluster analysis is the evaluation of clustering results to find the partitioning that the best fits the underlying data. The procedure of evaluating the results of a clustering algorithm is known as cluster validity. Furthermore, in order to show the superiority of the clustering algorithm, some existing methods are selected for comparing with the proposed algorithm during cluster validity.

4.1 Cluster validity

Two kinds of cluster validity approaches are chosen in this article. The first is based on external criteria, which are used to evaluate the results of the proposed BF-C algorithm based on the comparison to the pre-specified class label information of the data set. The second one is based on internal criteria, which we evaluate the clustering results of the BF-C algorithm performance without any prior knowledge of data sets. Two external validity measures Rand and Jaccard (Theodoridis and Koutroumbas 2006), as well as two internal validity measures, Beta (Pal et al. 2000) and Distance index are utilized for performance evaluation of the BF-C algorithm and its comparison methods.

Rand coefficient (R): It determines the degree of similarity between the known correct cluster structure and the results obtained by a clustering algorithm (Theodoridis and Koutroumbas 2006). It is defined as

R =

SS + DD

SS + SD + DS + DD .

(6)

SS, SD, DS, DD represent the number of possible pairs of data points where,

SS:

both the data points belong to the same cluster and same group.

SD:

both the data points belong to the same cluster but different groups.

DS:

both the data points belong to different clusters but same group.

DD:

both the data points belong to different clusters and different groups.

Note that if there are N data points in a data set, M = SS + SD + DS + DD, where M is the total number of possible data pairs and its value equals to N(N 1)/2.

+ SD + DS + DD , where M is the total number of possible data

328

J Intell Inf Syst (2012) 38:321–341

Value of R is in the range [0, 1] and higher the value of R, better is the clustering.

Jaccard coefficient (J): It is the same as rand coefficient except that it excludes DD and is defined as

J =

SS + DS .

SS + SD

(7)

Value of J locates in the interval [0, 1]. The higher the value of J, the better the clustering performance is.

Beta index (β): It computes the ratio of total variation and within class variation (Pal et al. 2000), and is defined as

β =

C

i=1 n j=1 (X ij

i

X) 2

C

i=1 n j=1 (X ij X i ) 2

i

,

(8)

where X is the mean of all the data points and X i is the mean of the data points that belong to cluster C i ; X ij is the jth data point of ith cluster and n i is the number of data points in cluster C i . Since the numerator of β is a constant for

a given data set, the value of β is dependent on the denominator only. The

denominator decreases with homogeneity in the formed clusters. Therefore, for

a given data set, higher the value of β, better is the clustering (Pal et al. 2000).

Note that (X ij X) can be calculated as Euclidean distance of the two vectors

X ij and X.

Distance index (Dis = Intra

Inter ): It computes the ratio of average intra-cluster

distance and average inter-cluster distance. The intra-cluster distance measure is the distance between a point and its cluster center. We take the average of all of these distances and call it Intra which is defined as

Intra = 1

n

K

i=1

x

j C i

x j z i 2 ,

(9)

where n is the total number of objects in a data set. The inter-cluster distance between two clusters is defined as the distance be-

tween the centers of them. We calculate the average of all of these distances as

follows

Inter =

1 K z i z j 2 ,

i = 1, 2,

,

K 1,

j = i + 1,

,

K.

(10)

A good clustering method should produce clusters with high intra-class similarity

while low inter-class similarity. So cluster results can be measured by combining

the average intra-cluster distance (Intra) and average inter-cluster distance (Inter) in a ratio way:

Dis = Intra Inter

.

(11)

Therefore, we want to minimize the value of measure Dis.

(Inter) in a ratio way: D i s = Intra Inter . (11) Therefore, we want

J Intell Inf Syst (2012) 38:321–341

329

4.2 Methods for comparison

For presenting the superiority of the proposed BF-C algorithm, we select some previous clustering techniques for algorithm comparisons. Firstly we choose the k-means algorithm (MacQueen 1967) as a method to be compared because it is the most famous conventional clustering technique. The k- means algorithm is a partition-based clustering approach (see Algorithm 2) and has been widely applied for decades of years.

Algorithm 2 The k-means Clustering Algorithm

Require:

Data set, X = {x 1 , x 2 , Cluster number, K. Ensure:

,

x n };

Clusters: {C 1 , C 2 ,

, Initialize K centers for C 1 , C 2 ,

C K }.

C K : Randomly select K data points from X

, randomly as the initial centroid vectors. 2: repeat

1:

3:

Assign each data point to its closest centroid and form K clusters by (2).

4:

Recompute the centroid for each cluster.

5: until Centroid vectors do not change.

Moreover, as an global optimization-based methodology, the BF-C algorithm will be compared with the ant-based clustering (Handl et al. 2006) and PSO-based clustering technique (van der Merwe and Engelbrecht 2003). Ant colony optimization (ACO) (Dorigo and Maniezzo 1996) was designed to emulate ants’ behavior of laying pheromone on the ground while moving to solve optimization problems. Handl et al. (2006) presented an instance of ACO for cluster- ing which return an explicit partitioning of data by an automatic process. The ACO algorithm imitates the mechanisms by choosing solutions based on pheromones and updating pheromones based on the solution quality (shown in Algorithm 3). Particle swarm optimization (PSO) (Kennedy and Eberhart 1995) is a population- based algorithm. It is a global optimization method and simulates bud flocking or fish schooling behavior to achieve a self-evolution system. The clustering approach using PSO can search automatically the data centers of K groups data set by optimizing the objective function (see Algorithm 4). In Section 5, we will set up a series of experiments to describe method comparisons between BF-C, k-means, ACO-based and PSO-based clustering algorithms.

5 Experiments

In this section, we will present several simulation experiments on the platform of Matlab to give a detailed illustration on the superiority and feasibility of the proposed approach.

on the platform of Matlab to give a detailed illustration on the superiority and feasibility of

330

J Intell Inf Syst (2012) 38:321–341

Algorithm 3 The ACO-based Clustering Algorithm

Require:

Data set, X = {x 1 , x 2 , Cluster number, K. Ensure:

,

x n };

Clusters: {C 1 , C 2 ,

,

C K }.

1: Initialize pheromones. Randomly scatter data items on the toroidal grid, and generate positions of R ants randomly from the data space for each center.

2:

3:

4:

5:

6:

7:

for j = 1 : iter max do

for i = 1 : R do

let each data belong to one cluster with the probability threshold q

Calculate the objective function J(i, j) with current centers

J last = J(i, j)

Construct solution S i using pheromone trail

8:

Calculate new cluster center; Calculate J(i, j + 1) with current centers

9:

if J(i, j + 1) < J last then

10:

S i ( j + 1) = S i ( j) //P i represents the Save the best solution among the R

11:

solutions found. end if

12:

end for

13:

Update the pheromone level on all data according to the best solution.

14:

{z 1 , z 2 ,

, the best solution. 15: end for

z K } = S b //Update cluster centers by the cluster center values of

16:

for t = 1 : n do

17:

for c = 1 : K do

18:

Calculate distance d c = x t z c

19:

end for

20:

d = {d 1 , d 2 ,

,

d K }

21:

Find the position p of min(d)

22:

C p .add(x t )

23: end for

5.1 Data source

Two different types of benchmark data sets are used: two synthetic data sets (Handl and Knowles 2008) that permit the modulation of specific data properties and three real data sets provided by UCI Machine Learning Repository (UCI Machine Learning Repository 2007). Both of the two synthetic data sets in our work follow x-dimensional normal distributions N(μ, σ) from which the data items are located into the y different clusters. The sample size s of each cluster, the mean vector μ and the vector of the standard deviation σ are themselves randomly determined using uniform distributions over fixed ranges (with s ∈ [50, 450], μ i ∈ [−10, 10] and σ i ∈ [0, 5]). Consequently, clusters in each data set are with different size and different density. The first one, which we call it 2D-4C, is a 2-dimensional data set arranged in ([20, 20], [12, 8]) and contains 4 clusters with 528, 348, 272 and 424 instances each

data set arranged in ([ − 20, 20], [ − 12, 8]) and contains 4 clusters

J Intell Inf Syst (2012) 38:321–341

331

Algorithm 4 The PSO-based Clustering Algorithm

Require:

Data set, X = {x 1 , x 2 , Cluster number, K. Ensure:

,

x n };

Clusters: {C 1 , C 2 ,

,

C K }.

1: Initialize the position M and velocity v of S particles randomly, in which each

K) contains K randomly generated centroid m iK }.

single particle M i (i = 1, 2,

vectors: M i = {m i1 , m i2 ,

,

,

2:

for j = 1 : iter max do

 

3:

for i = 1 : S do

4:

Calculate the objective function J(i, j) with current M i ( j)

5:

J last = J(i, j)

6:

v i ( j + 1) = w v i ( j) + c 1 rand() (P i ( j) M i ( j)) + c 2 rand() (P g

7:

M i ( j)) M i ( j + 1) = M i ( j) + v i (

j)

8:

Calculate J(i, j + 1) with current M i ( j + 1)

9:

if J(i, j + 1) < J last then

10:

P i ( j + 1) = M i ( j + 1) //P i represents the local best position, the best

11:

position found so far for particle i.

12:

else P i ( j + 1) = P i ( j)

 

13:

end if

14:

end for

15:

Update the global best position P g : Select the best P i from {P 1 , P 2 ,

,

P S }

as P g . //P g represents the global best position in the neighborhood of each particle.

16:

{z 1 , z 2 ,

,

z K } = P g

17: end for

18:

for t = 1 : n do

19:

for c = 1 : K do

20:

Calculate distance d c = x t z c

21:

end for

22:

d = {d 1 , d 2 ,

,

d K }

23:

Find the position p of min(d)

24:

C p .add(x t )

25: end for

(see Fig. 1). The second data set, named 10D-4C, contains a total number of 1,289 items that spread in 4 clusters based on 10 different features. All the 5 data sets from UCI that we employ in our experiments are famous data- base that can be easily found in data mining and pattern recognition literature. Iris data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant and can be treated as a cluster in the experiments. Each instance has 4 features representing sepal length, sepal width, petal length and petal width, respectively. Wine data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. This set contains 3 clusters and has 59,

grown in the same region in Italy but derived from three different cultivars. This set contains

332

J Intell Inf Syst (2012) 38:321–341

8 6 4 2 0 -2 -4 -6 -8 -10 -12 -20 -15 -10 -5
8
6
4
2
0
-2
-4
-6
-8
-10
-12
-20
-15
-10
-5
0
5
10
15
20

Fig. 1

The original 2-dimensional data distribution in space

71, 48 instances for each cluster. Glass data set has 214 instances describing 6 classes of glass based on 9 features. Zoo data set is a simple database containing 101 animal instances with 16 Boolean-valued attributes which are classified into 7 categories. Ionosphere contains 351 radar data with 34 continuous features and was collected by a system in Goose Bay, Labrador. This system consists of a phased array of 16 high- frequency antennas with a total transmitted power on the order of 6.4 kilowatts. The targets were free electrons in the ionosphere.“Good” radar returns are those showing evidence of some type of structure in the ionosphere. “Bad” returns are those that do not; their signals pass through the ionosphere. The data points in all the 5 data sets are scattered in high-dimensional spaces. The description of all the data sets used in our study can be summarized in Table 1.

Table 1

of data sets

Summarization

summarized in Table 1 . Table 1 of data sets Summarization Data sets Instances Featrues/dimensions Clusters

Data sets

Instances

Featrues/dimensions

Clusters

2D-4C

1,572

2

4

10D-4C

1,289

10

4

Iris

150

4

3

Wine

178

13

3

Glass

214

9

6

Zoo

101

16

7

Ionosphere

351

34

2

J Intell Inf Syst (2012) 38:321–341

333

5.2 Parameter investigation

Parameter selection is an important part of optimization-based approaches. In this subsection we present results from our investigations on the impacts of some key parameters based on the guidelines in Section 3.2, and assign initial values for them.

5.2.1 Chemotaxis step size C(i)

In BF, C(i) is the size of chemotaxis step and can be initialized with biologically motivated values. However, a biologically motivated value may not be the best for an engineering application (Passino 2002), it should be chosen according to our data clustering tasks. Below in Fig. 2 we illustrate the relationship between the objective function and the number of chemotactic steps N c for different C(i). From Fig. 2 we can find when the size of chemotaxis step C(i) is smaller, the objective function converges faster. Since SSE reaches the smallest value at C(i) = 0.1, we select 0.1 as the parameter value of C(i) for the proposed BF-C algorithm to implement the coming tasks.

5.2.2 Chemotactic step N c and swim step N s

Next, large values for Nc result in many chemotactic steps, and hopefully more opti- mization progress, but of course more computational complexity. Figure 3 presents the characteristics between objective function and the number of chemotactic steps

145 C(i)=0.05 140 C(i)=0.1 135 C(i)=0.15 C(i)=0.2 130 125 120 115 110 105 100 95
145
C(i)=0.05
140
C(i)=0.1
135
C(i)=0.15
C(i)=0.2
130
125
120
115
110
105
100
95
0
10
20
30
40
50
60
70
80
90
100
Nc
SSE

Fig. 2

Performance of BF-C for Iris data with different C(i)

95 0 10 20 30 40 50 60 70 80 90 100 Nc SSE Fig. 2

334

J Intell Inf Syst (2012) 38:321–341

160 Ns=2 Ns=3 Ns=4 150 Ns=5 Ns=6 140 130 120 110 100 0 10 20
160
Ns=2
Ns=3
Ns=4
150
Ns=5
Ns=6
140
130
120
110
100
0
10
20
30
40
50
60
70
80
90
100
Nc
SSE

Fig. 3

Performance values for the five different swim step sizes for N c from 1 to 100

N c for different life time N s of the bacteria. As evident, when the swim step N s is smaller, the objective function converges faster. From Fig. 3 we can also find BF-C converges to the smallest SSE at N s = 4 and 6. However, the objective function converges faster at N s = 4. We thus choose N s = 4 and N c = 100 in our data clustering tasks.

5.2.3 Reproduction step N re and elimination–dispersal step N ed

If N c is large enough, the value of N re affects how the algorithm ignores bad regions and focuses on good ones. If N re is too small, the algorithm may converge prematurely; however, larger values of N re clearly increase computational com- plexity. A low value for N ed dictates that the algorithm will not rely on random elimination–dispersal events to try to find favorable regions. A high value increases computational complexity but allows the bacteria to look in more regions to find good nutrient concentrations. Figures 4 and 5 depict the values of objective function (SSE) and the correspond- ing elapsed timea by experiments with N re from 2 to 6 and N ed from 1 to 5. It is easy to find in Figs. 4 and 5 that the larger N re or N ed is, the more slowly BF- C converges. Moreover, SSE changes slightly after N re = 4 and N ed = 2, while the elapsed times increase significantly. Based on these results, we choose N re = 4 and N ed = 2 in our applications.

increase significantly. Based on these results, we choose N r e = 4 and N e

J Intell Inf Syst (2012) 38:321–341

335

102 101 100 99 98 97 2 3 4 5 6 SSE
102
101
100
99
98
97
2 3
4
5
6
SSE
5 4 3 2 1 2 3 4 5 6 Time
5
4
3
2
1
2 3
4
5
6
Time

Fig. 4

Nre

SSE and the computing time of BF-C for Iris data with different N re

5.2.4 Elimination–dispersal probability p ed

In BF, if p ed is large, the algorithm can degrade to random exhaustive search. However, appropriately choose of p ed can help the algorithm jump out of local optima and into a global optimum. Figure 6 shows the relationship between objective

100 99 98 97 1 2 3 4 5 SSE
100
99
98
97
1 2
3
4
5
SSE
12 10 8 6 4 2 1 2 3 4 5 Time
12
10
8
6
4
2
1 2
3
4
5
Time

Fig. 5

Ned

SSE and the computing time of BF-C for Iris data with different N ed

10 8 6 4 2 1 2 3 4 5 Time Fig. 5 Ned SSE and

336

J Intell Inf Syst (2012) 38:321–341

105 103 101 99 97 95 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 SSE
105
103
101
99
97
95
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
SSE

Fig. 6

Ped

Performance values for the eight different elimination–dispersal probabilities

function values and different p ed . Apparently, BF-C gets the smallest SSE value at p ed = 0.25.

5.2.5 Other parameters

For PSO, we use 50 particles, and set w = 0.72 and c 1 = c 2 = 1.49. These values were chosen to ensure good convergence (van den Bergh 2002). For ACO, the authors have designed some techniques to set the parameters for optimal performance (Handl et al. 2006). In our implementation we also choose 10 ants and 1,000 iteration steps which have followed the same settings. For BF-C, based on the investigations in the previous subsection, we therefore choose S = 50, N c = 100, N s = 4, N re = 4, N ed = 2 and p ed = 0.25.

5.3 Results and analysis

For all the results reported, average values of different performance indices over 30 simulations and their corresponding standard deviations (shown in bracket) for each data set are given. Euclidean distance is chosen to measure the distance between data points in our work. Rank of each algorithm is given depending on its performance measure followed by corresponding rank (from 1 to 3). Table 2 summarizes the clustering results obtained by the k-means, ACO, PSO and the proposed BF-C algorithms for different data sets. From the clustering results of real data sets which is shown in Table 2, and according to the properties of data sets which are described in Table 1, some conclusions are revealed as follows:

(1) It is apparent that in terms of external validity measures (Rand and Jaccard indices) performance of the proposed BF-C algorithm is better for most of the

validity measures ( Rand and Jaccard indices) performance of the proposed BF-C algorithm is better for

J Intell Inf Syst (2012) 38:321–341

337

Table 2

Values of performance measures by the k-means, ACO, PSO and the proposed BF-C

algorithms

Data sets

Method

Rand

Jaccard

β

Dis = Intra Inter

Time (s)

2D-4C

k-means

0.8636

0.8021

11.1319 c

0.01079

0.23058 a

 

(0.010365)

(0.017803)

(0.684979)

(0.289137)

(0.011049)

 

PSO

0.9941 a

0.9778 a

12.565 b

0.00998 a

10.87969

 

(0.0061857)

(0.0094362)

(0.669946)

(0.598628)

(0.419201)

 

ACO

0.9916 c

0.9558 c

1.3874

0.01012 c

10.19844 c

 

(0.031725)

(0.0107)

(0.051462)

(0.004398)

(1.74367)

 

BF-C

0.9920 b

0.9702 b

13.249 a

0.01006 b

6.7922 b

 

(0.0027578)

(0.0101823)

(0.329451)

(0.002997)

(0.27472)

10D-4C

k-means

0.8946 c

0.7203 c

2.264 b

0.0973

0.069018 a

 

(0.03401184)

(0.0138685)

(0.052886)

(0.080985)

(0.023524)

 

PSO

0.8763

0.6924

2.1989 c

0.09693 c

27.71563

 

(0.0393151)

(0.0893076)

(0.06987)

(0.434701)

(0.919202)

 

ACO

0.9239 b

0.76142 b

1.1102

0.096711 b

19.2719 c

 

(0.050278)

(0.035419)

(0.034022)

(0.062165)

(5.57154)

 

BF-C

0.9319 a

0.8187 a

2.2968 a

0.09202 a

18.28282 b

 

(0.011764)

(0.00219203)

(0.040214)

(0.044709)

(0.902037)

Iris

k-means

0.8737 c

0.6823 c

7.8405 c

0.02422 c

0.00625 a

 

(0.135340)

(0.096661)

(0.602076)

(0.02267)

(0.004941)

 

PSO

0.9195 b

0.7828 b

8.3579 b

0.02243 b

6.753125

 

(0.0427095)

(0.0717713)

(0.374798)

(0.005974)

(0.07683)

 

ACO

0.8254

0.6547

1.6159

0.03104

5.5256 c

 

(0.008045)

(0.046406)

(0.53276)

(0.02067)

(0.48719)

 

BF-C

0.9341 a

0.8180 a

9.1295 a

0.02111 a

2.9344 b

 

(0.0103238)

(0.0248901)

(0.369183)

(0.004837)

(0.058962)

Wine

k-means

0.7170 c

0.4127

7.3745 c

0.02652 b

0.008235 a

 

(0.00675452)

(0.00306349)

(0.398942)

(0.002768)

(0.077548)

 

PSO

0.7307 b

0.4312 b

7.6108 b

0.02727 c

19.92031

 

(0.0118794)

(0.01004092)

(0.358704)

(0.003944)

(0.180339)

 

ACO

0.683959

0.424734 c

1.012184

0.030469

11.0644 c

 

(0.0107)

(0.007082)

(0.0061833)

(0.008435)

(0.3288)

 

BF-C

0.7516 a

0.4494 a

7.9366 a

0.0259 a

7.86721 b

 

(0.00289913)

(0.00282842)

(0.270582)

(0.001703)

(0.140304)

Glass

k-means

0.7047 b

0.2676 b

3.1188 b

0.03647 b

0.034375 a

 

(0.0122868)

(0.029821)

(0.211435)

(0.018903)

(0.07548)

 

PSO

0.5409

0.1902

2.4245 c

0.04097

15.29531

 

(0.0636369)

(0.0543058)

(0.127317)

(0.03715)

(0.699914)

 

ACO

0.6353 c

0.2699 c

1.009839

0.037285 c

13.9375 c

 

(0.0395196)

(0.0358161)

(0.004898)

(0.026172)

(0.40873)

 

BF-C

0.7376 a

0.2765 a

3.5644 a

0.03171 a

11.62502 b

 

(0.0127279)

(0.0208597)

(0.072933)

(0.007819)

(0.203714)

Zoo

k-means

0.7998

0.3758

4.1048 c

0.01821

0.01875 a

 

(0.0484368)

(0.116199)

(0.151947)

(0.003507)

(0.0010546)

 

PSO

0.8525 c

0.4768 c

4.6966 b

0.01757 c

44.23438

 

(0.372645)

(0.0714178)

(0.26326)

(0.009615)

(3.770243)

 

ACO

0.8829 b

0.6867 b

1.02699

0.01691 b

17.6563 c

 

(0.0231966)

(0.0541532)

(0.0168503)

(0.0034002)

(0.75973)

 

BF-C

0.9210 a

0.6977 a

5.9665 a

0.01538 a

11.38752 b

 

(0.0311569)

(0.0448654)

(0.093465)

(0.003154)

(0.552599)

5.9665 a 0.01538 a 11.38752 b   (0.0311569) (0.0448654) (0.093465) (0.003154) (0.552599)

338

J Intell Inf Syst (2012) 38:321–341

Table 2

(continued)

 

Data sets

Method

Rand

Jaccard

β

Dis = Intra Inter

Time (s)

Ionosphere

k-means

0.5877 c

0.4323 c

1.3405 c

0.75862 c

0.046875 a

 

(0.0012882)

(0.0043657)

(0.07535433)

(0.0738499)

(0.02578)

 

PSO

0.5921 b

0.4261

1.3516 b

0.75771 b

45.1 c

 

(0.0013718)

(0.0067175)

(0.04071)

(0.081232)

(1.094375)

 

ACO

0.5398

0.5384 a

1.3114

0.76174

53.6571

 

(0.000967)

(0.0015278)

(0.0034079)

(0.036454)

(4.6563)

 

BF-C

0.5989 a

0.44390 b

1.3528 a

0.75413 a

36.39376 b

 

(0.00120208)

(0.00314662)

(0.042282)

(0.023309)

(0.476011)

a Rank 1

b Rank 2

c Rank 3

data sets (namely 10D-4C, Iris, Wine, Glass, Zoo and Rand for Ionosphere), whereas PSO gives better result for 2D-4C data set and ACO gets the best Jaccard for Ionosphere. These results show that with the help of global and chaotic search, the proposed BF-C methodology can reach the global optimal solutions, which has covered the shortage of the k-means algorithm. Meanwhile, as an optimization-based clustering algorithm, BF-C reaches the optimal points more closer and exhibited better convergence than ACO and PSO techniques. (2) Furthermore, for Glass and Zoo data, the proposed BF-C approach gets a

(3)

(4)

(5)

evident improvement of Rand and Jaccard values to other three algorithms. Combining the description of Glass and Zoo data sets, we can conclude that the BF-C algorithm is much more effective for multi-cluster data sets and has the superiority for clusters with different scales. We can find that the proposed BF-C algorithm has the best performance of β index and the ACO-based clustering algorithm gets the smallest (worst) β values for all data sets. It is clear that the BF-C algorithm is quite available for data sets without any prior information, which will be helpful in real life clustering applications. When considering intra-cluster and inter-cluster distances, the former ensures compact clusters with little deviation from the cluster centers, while the latter ensures larger separation between the different clusters. Dis index is the ratio of intra-cluster and inter-cluster distances, which should be minimized. With reference to this criterion, the BF-C algorithm succeeds most in finding clusters with smaller Dis value than the k-means algorithm, ant-based and PSO-based approaches, although PSO algorithm performs the best for the 2D-4C data set. The standard deviations of different measures obtained by different methods are shown in bracket. Stability of BF-C over different data sets can also be seen from the smaller values of standard deviation of Rand, Jaccard and Dis indices. Although ACO has smaller standard deviation values of β due to its poor performance of β. And the k-means algorithm gives less standard deviation of computing time because of its little computational complexity. These comparisons present that the results of the BF-C algorithm change less at different experiments, and BF-C is a more stable clustering technique than the k-means, PSO and ACO-based clustering algorithms.

experiments, and BF-C is a more stable clustering technique than the k -means, PSO and ACO-based
experiments, and BF-C is a more stable clustering technique than the k -means, PSO and ACO-based
experiments, and BF-C is a more stable clustering technique than the k -means, PSO and ACO-based

J Intell Inf Syst (2012) 38:321–341

339

(6) The CPU (execution) time, in seconds, needed by the algorithms are also

(7)

given in the table for comparison. All the experiments are performed in a Dell terminal with Intel Core(TM)2 Due CPU (2.53 GHz clock speed with 2GB memory) and in Windows XP environment. Implementation of the algorithms is done in Matlab. It is apparent that the