You are on page 1of 10

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317

Finding Efficient Initial Clusters Centers for K-Means


Aarti Chaudhary* , Dharmveer Singh Rajpoot
Department of Computer Science and Engineering / Information Technology ,
Jaypee Institute of Information Technology , Noida , India
chaudhary.aarti111@gmail.com*, dharmveer.rajpoot@jiit.ac.in
ABSTRACT
K-Means algorithm is one of the most popular partitioning clustering algorithm. It is well-known
due to its simplicity but, have many drawbacks. The random selection of initial centers leads to
local convergence and never gets the optimal clustering result. In this paper, we propose an
efficient initialization of centers through the highly dense and maximum distance approach with
aim is to optimal clustering results. The experiments perform on three standard datasets and
verify with five validity measures show that the clustering results are better in terms of
minimizing compactness within clusters and maximizing separation between the clusters. The
convergence speed of propose method is quiet faster than the traditional methods like K-Means,
K-Means++ and K-Medoids.
Index Terms: Clustering; Dataset; K-Means; Initial cluster center and Optimization.

I. INTRODUCTION
In todays scenario, lots of data like web commerce, credit card/debit card transactions, web server logs
and so on are collected and stored in database, warehouse every second. The applications range from
scientific discovery to business which includes various techniques to analyze data for find hidden
patterns that are not readily evident and models for all types of data [2]. There are many techniques to
find the patterns i.e. supervised learning includes Classification and Regression, Unsupervised learning
consists of Clustering, Dependency modeling have associations, summarization and causality, outlier and
deviation detection [1].
Clustering is defined as partitioning the data into groups or subsets called as clusters. Clustering methods
includes partitioning and hierarchical. The hierarchical clustering broadly categories into two groups i.e.
the divisive approach and agglomerative approach. The main drawback of this method is that the division
or merging what have done on clusters is permanent in nature and the time complexity is quite high i.e.
O(M2), where M is the number of data points. The partitioning clustering divides the datasets into given
number of clusters. There are many methods through which partitioning clustering can be done it i.e.
Graph - theoretic, density-based, model-based, Grid-based and etc. The aim of clustering is to minimize
the homogeneity within clusters and maximize the heterogeneity between the clusters. As in
unsupervised learning the initial knowledge i.e. where to start the clustering is missing, which resulting
more complexity to divide the dataset into clusters. So, the initial centers selection is quiet complex. KMeans, K-Medoids are two well- known partitioning algorithms. In K-Means the initial centers are
randomly selected from the dataset. As, K-Means mainly consists of two steps: 1) Assigning the data
points to nearest centre using distance approach. 2) Updating centers while taking the mean of all data
points assigned to particular cluster. Repeat these two steps until centers do not convergence. On the
other hand, K-Medoids also select the initial medoids randomly from the dataset. In the first step, for
every medoids swap the medoids to all the non-medoids and calculating the cost of that configuration. In
second step, if the cost is lower than the previous one than the swapping between the medoids to nonmedoids is done. Repeat the step until no more data points or medoids change the cluster. The main
45 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317
difference between the K-Means and K-Medoids is that the final centers in K-Means may be comes
different data point from the dataset but, in case of K-Medoids the final centers allows the data points
from the dataset only.
The main advantage of K-Means and K-Medoids are easy to implement and understand. But the problem
is wrong initial centers selection leads to local optimal clustering results and cannot acquire global
optimal configuration.
Rest of the paper is organized in the following Sections: Section II, describe the literature survey related
to initialization of clusters center. Section III consists of the description of proposed algorithm with the
pseudo code. Section IV contains results and discussion of experiments performed of traditional methods
and proposed method. Section V summaries the conclusion and future scope of research paper.
II. RELATED WORK
The initial centre problem is one of the mostly desirable to solve. In past, many solutions are given to
solve it. Juifang Chang [3] uses the re-centroid technique to eliminate unstable results in clustering which
comes for identifying initial centers from patterns of different shape and size. Yunming Ye [4] proposed
the NK-Means which is combination of NBC (Neighborhood Based clustering) and K-Means. It is very
time consuming process because to identify that the data point is located in dense, sparse or evenly
distributed space. So, the data space is divided into hyper cubes cells of identical sides in every
dimension. The density-aware approach is employed to only dense cells which further division of them.
Partha Sarathi Bishnu [5] use Quad-Tree data structure for finding the initial clusters centers. The main
drawback is that if the dataset is not well scattered into data space, then all the data points either goes in
left or right side of tree. Kai Lei [6] propose algorithm in which cluster contains fixed range of objects
under it. Samuel Sangkon Lee [7] uses the maximum average distance between the selected initial
centers. It results more accurate results as centers are evenly scattered in the data space. Xin Chang [8]
uses the supervised learning to identify the initial centers in case of CSK (Complete Seed K-Means).On the
other hand, ISK (Incomplete Seed K-Means) some of initial centers are selected but remaining is selected
using maximum distance approach or with random basics. Young Jun Zhang [9] proposed an algorithm to
choose initial centers according to similarity density that is calculated using the similarity degree. The
minimum similarity density points are selected as initial centers.
III. PROPOSED ALGORITHM
In this section, the detailed description of proposed algorithm is given. The algorithm consists of three
parts. The first part is maximum distance approach, the potential k points are selected which are having
maximum distance replacement between them from the dataset. The next part is to find high-density k
points by using the closest neighbor approach. In the last part the 2k data points are merged until the
desired number of clusters does not reach. The steps included into algorithms are adorned into three
parts as follows:
Maximum Distance Replacement Approach: In the starting k consequent data points are selected
from the dataset. Then, apply the maximum distance replacement approach on the selected data points.
In the end, the resultant k potential initial centers will come from it. The k potential will go to merging
step with the high-density k centers.

46 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317
High-Density Approach: In this k data points are selected as high-density approach. The approach is
calculates the CCR(c-th closest radius) for every data point, from which one minimum is taken as the k
potential centers for the K-Means initialization.
Merge Stage: Collect all the potential 2k centers are tries to merge close centers until the input number
of clusters will not reached.
A. Maximum Distance Approach: Let, assume S is the dataset and n is the number of data points
n
where, k as the number of clusters. Consider, M is potential initial centers and
S = Y

( i)

i =1

represented as

(Y

,Y

,.....

Y )
k

Now calculate the Euclidean distance between Y

k +1

and M dis (Y

Now, select the maximum distance point and replace it by Y

k +1

, Y 1 ), dis (Y

k +1

,Y

),... dis (Y

k +1

,Y k )

k +1

Repeat this process for all data points from k+1 to n.


B. High-Density Approach: Lets consider, the c-th closet point to Y j as CCP (j, c) and all the c closet
neighbors to Y as CCN (j, c), where
j

CCN ( j , c ) =

(CCP ( j , i ))

i =1

Then, we calculate the c-closet data points radius (CCR) indicates that the data point is present in dense
area or not. In this we are taking the mean distance to all the c closet data points of it and is calculated as
follows:

CCR( j , c) = CCN ( j ) dis Y j ,Y i c (1)


Yi

dis

(Y , Y ) =
j

d =0

(Y dj Y id )

Where, m indicates the dimensions of data point and c indicates division of data point in the class done by
experts.
C. Merge Stage: Merge the clusters having minimum Euclidean distance and update the clusters until
desired number of clusters does not reach.

min dis (C1 , C 2 ), dis (C1 , C 2 ),...dis (C1 , C k )

D. Initial Centres Selection Algorithm Description


As in the selection process we are choosing centres from the high density points and maximum distance
replacement approach. Both the steps give the potential initial centres from which final initial centres are
selected. The merging step shrinks the potential centres which are minimum distance with each other.
Input: n indicates the number of data points and k is number of clusters and dataset S =

(Y i )

i =1

where c

is division of data point into k clusters done by expert


Output: Initial cluster centers IC =

(Y i )

i =1

Begin (Proposed Method)


47 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317
Step1: Max distance replacement potential initial points.
Step1.1: Select k points consecutively from starting i.e. M =
Step1.2: For every

(Y ,Y
1

,.....Y k ) .

i (k + 1, n ) using max-distance searching between M and Y i .

Step2: High-density point


Step2.1: Get distance matrix D and initialize RL(set of remaining list which has not removed or
selected) with S.
Step2.2: For each point

in RL, we will calculate CCR(j, c) using equation 1.

Step2.3: Find out the data point which has the min CNR, add to IL(set of selected initial centers),
and delete it and its all c closest neighbors from RL.
Step2.4: Go to step 2.1 until RL is empty.
Step3: Merge potential centers.
Step3.1: Place all potential points acquired for Step1 and Step 2 in IC.
Step3.2: Calculate distance all data point in IC and select the min distance merge with update
centre in C.
Step3.3: Go to step 3.2 until IC=k
Step4: Then, IC is feeded to K-means as initial centers run up to centers do not change.
End
IV. EXPERIMENT AND DISCUSSION
In the study, proposed algorithm was developed in Java programming via Eclipse IDE Tool and run on a
computer with 2.5 GHz Intel Core i5 processor, 4GB 1600MHz DDR3 on Mac Book Pro OS X Yosemite
Version 10.10 System.

A. Description of Datasets
Our experiments contains of three real datasets taken from UCI dataset repository. The Iris datasets
contains 150 objects in 5 dimensions with all values are preprocessed and real with no missing values.
The classification of iris data into 3 classes with 50 objects each that was given by experts. The Climate
Model Simulation Crashes dataset contains 540 objects in 21 dimensions with all values are preprocessed
and real with no missing values. The Seeds data consists of 210 objects in 8 dimensions with all values
are preprocessed and real with no missing values [11]. All the datasets are described in Table 1.

48 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317
Table 1: Description of Datasets
S. NO.
1
2
3

DATASETS
IRIS PLANT
CLIMATE MODEL SIMULATION CRASHES
SEEDS

INSTANCES ATTRIBUTES
150
5
540
21
210
8

B. Description of Traditional Methods


To compare our method, we use three other popular techniques. The first method is one of the wellknown techniques of initial centers known as K-Means. Secondly, simple technique of selecting of initial
centers from data points only called as K-Medoids. Thirdly, the technique is the extension of K-Means
known as K-Means++.
C. Validity Measures
We use four cluster validity measures i.e. Root Mean Square Error (RMSE), Sum of Square Error (SSE), SD
Validity Index (SD index), Davies - Bouldin Index (DB index) and iteration speed after selection of initial
centers how much time to get the optimal structure of clustering for our experiments.

1. RMSE: The Root Mean Square Error represents the mean of sum of square error. The fewer Root
mean square error value indicates the better cluster structure.
m

xi c j

RMSE =

x X ,c C
i

i =1 j =1

Where, X Number of data points, C k cluster centers, k- Number of clusters, m-Number of dimensions
2. SSE: The SSE calculates the squared distance of all data points to the nearest centers. The smaller
value of SSE shows the result close to optimal clustering structure.
k

SSE =
i =1

x j ci

x jci

x X ,c C
j

Where, X ---Number of data points, C ---k cluster centers, k --Number of clusters


3. SD Validity Index: SD index deals not the homogeneity between the members of same clusters, but
also heterogeneity among clusters. A low SD index shows efficient clustering configuration.

SDIndex =
Where,

(C i )

1
+

k i =1 (( X ))

(C ) ---Variance of
i

between clusters,

min

max

min i =1

j =1, j i

C i C j

ith cluster, ( X ) ---Variance of whole dataset,

max

---Maximum distance

---Minimum distance between clusters

4. DB Index: DB Index calculates the mean of compactness between clusters and identify most similar
one. The lesser value of DB index indicates preferred cluster configuration.
49 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317

DBIndex =

1 nj

j k =1

Where,

= max (S kl )k , l = 1,2,... n j k l ,

dis
S

kl

---Similarity of clusters,

disp

kl

= d (vk , vl ) and

kl

disp + disp
dis
k

kl

disp

kl

---Dispersion of the cluster,

d (x , v )

k x

ck

dis

kl

---Dissimilarity of cluster

The iteration speed is also compared from the existed techniques. The speed has to be less to acquire
final cluster centers.
D. Results and Discussion

In this section, we show the results of experiment performed on the proposed method and compare it
with three well known existing methods using three standard datasets and five validity measures. Here,
all the results are given in Table II, Table III, Table IV, Figure I, Figure II and Figure III below where we
assume that there are three clusters (k = 3) in each dataset. All the values of RMSE and DB are dividing by
100 and multiplied by 100 respectively in the Figures for effective visualization. In all the comparison
tables the bold value shows the better results.
Table 2: Comparison of Results on Iris Plant Dataset
QUALITY
MEASURE
RMSE

K-MEANS

K-MEDOIDS

K-MEANS++

122.27

113.34

97.34

PROPOSED
METHOD
97.32

SSE

3.191

2.180

1.419

1.414

SD INDEX

16.5

8.1

5.9

3.1

DB INDEX

0.09

0.08

0.01

0.08

ITERATION

29700

Table 3: Comparison of Results on Seeds Dataset

50 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317
QUALITY
MEASURE

K-MEANS

K-MEDOIDS

K-MEANS++

PROPOSED
METHOD

RMSE

313.73

381.90

313.21

313.21

SSE

7.54

12.79

8.79

8.97

SD INDEX

2.76

4.11

3.06

2.56

DB INDEX

0.021

0.022

0.012

0.025

ITERATION

33390

Table 4: Comparison of Results on CMSE Dataset


QUALITY
MEASURE

K-MEANS

K-MEDOIDS

K-MEANS++

PROPOSED
METHOD

RMSE

644.65

744.84

645.07

644.64

SSE

2.65

4.18

2.70

2.33

SD INDEX

4.12

1.28

4.19

4.06

DB INDEX

0.020

0.010

0.020

0.017

ITERATION

21

200340

21

13

35
30
25
K-MEANS

20

K-MEDOIDS

15

K-MEANS++
10
PROPOSED
5
0
RMSE

SSE

SD INDEX

DB INDEX

ITERATION

Fig 1: Comparison of Results on Iris Plant Dataset

51 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317

45
40
35
30
25
20
15
10
5
0

K-MEANS
K-MEDOIDS
K-MEANS++
PROPOSED

RMSE

SSE

SD INDEX

DB INDEX

ITERATION

Fig 2: Comparison of Results in Seeds Dataset

35
30
25
K-MEANS
20
K-MEDOIDS

15
10

K-MEANS++

PROPOSED

0
RMSE

SSE

SD INDEX

DB INDEX

ITERATION

Fig 3: Comparison of Results on CMSE Dataset


In the above tables, K-Means selects the initial centers randomly from the dataset; K-Medoids also select
the initial medoids randomly. But the difference between K-Means, K-Medoids is that K-Medoids select
final centers from the datasets only, on the other hand K-Means always take the mean of data points
present in the cluster. K-Means++ is the refinement of K-Means which selects the initial centers by using
the probability function.
We saw the measure of cluster validity on Iris, Seeds, and Climate Model Simulation Crashes datasets
with different number of clusters. Table II shows the proposed algorithm shows optimal results with
respect to RMSE, SSE, SD Validity Index as compare to existed techniques. The iteration time for
acquiring the final clusters is quiet less compare to K-Means and K-Medoids convergence speed, but
same in the case of K-Means++.In the case of DB index proposed algorithm is showing better results
compare to K-Means, K-Medoids but above this K-Means++ shows the best result. So, in the case of error
minimization and compactness or separation between clusters proposed algorithm produce cluster of
data in more efficient way. The assumption is proving now through experiments that the homogeneity
would increase within the clusters and heterogeneity would minimize between the clusters. In table III
the proposed is showing optimal results in SSE and SD index from K-Medoids and with K-Means++ the
result is equal. But, in K-Means show better configuration with all methods.RMSE and DB index indicates
effective than K-Medoids and K-Means, but equal with K-Means++.The iteration time is faster than all the
existed methods. Table IV shows effective error minimization and the iteration speed in case of proposed
52 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317
algorithm as compare to existed techniques. The SD Validity index and DB index is better than the KMeans and K-Means++ but, best in the K-Medoids technique.
V. CONCLUSION AND FUTURE WORK
For getting effective clustering results from the clustering technique, the selection is quiet efficient. As
random selection of initial centers may lead to wrong clustering configuration and also maximized the
intra-cluster and minimized the inter-cluster distance. The existed partitioning clustering selection will
not effective enough in measured by evaluation metrics as compare to proposed algorithm. In this paper
an initial centers selection method is proposed that is based on dense area and maximum distance
replacement approach which leads to effective clustering configuration with some of datasets in term of
error minimization, homogeneity and heterogeneity of the clusters. The iteration speed is quiet faster or
equal to the existing technique. We through evaluation metrics test to evaluate our proposed method,
prove to get more efficient clustering results.
In the future, we shall test our proposed algorithm with high-dimensional datasets. We fix k value in the
test because the classification to class via labels was known initially. When we use this algorithm for data
which is not having the class labels then finding the labels is quiet a complex issue.
VI.

REFERENCES

[1]

A. K. Jain, M. N. Murty and P. J. Flynn, Data Clustering: A Review, Journal ACM Computing
Surveys (CSUR), Volume 31, Issue 3, pp. 264-323, 1999.

[2]

K. P. Soman, S. Diwaker and V. Ajay, Insight into Data mining: Theory and Practice, pp.17-18,
2006.

[3]

J. Chang, SDCC: A New Stable Double-Centroid Clustering Technique Based on K-Means for Nonspherical Patterns, Advances in Neural Networks, Springer Berlin Heidelberg, pp. 794-801, 2009.

[4]

Ye, Y., Huang, J. Z., Chen, X., Zhou, S., Williams, G., and Xu, X., Neighborhood density method for
selecting initial cluster centers in K-means clustering, Advances in Knowledge Discovery and
Data Mining, Springer Berlin Heidelberg, pp. 189-198, 2006.

[5]

Bishnu, P. S., and Bhattacherjee, V., Software fault prediction using quad tree-based k-means
clustering algorithm, IEEE Transactions on Knowledge and Data Engineering,volume 24, Issue 6,
pp. 1146-1150, 2012.

[6]

Lei, K., Wang, S., Song, W., and Li, Q., Size-Constrained Clustering Using an Initial Points Selection
Method, Knowledge Science, Engineering and Management, Springer Berlin Heidelberg, pp. 195205, 2013.

[7]

Lee, S. S., and Han, C. Y., Finding Good Initial Cluster Center by Using Maximum Average
Distance, Advances in Natural Language Processing, Springer Berlin Heidelberg, pp. 228-238,
2012.

[8]

Wang, X., Wang, C., and Shen, J., Semisupervised K-Means Clustering by Optimizing Initial
Cluster Centers, Web Information Systems and Mining, Springer Berlin Heidelberg, pp. 178-187,
2011.

53 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853, Impact Factor 1.317
[9]

Zhang, Y., and Cheng, E., An optimized method for selection of the initial centers of k-means
clustering, Integrated Uncertainty in Knowledge Modeling and Decision Making, Springer Berlin
Heidelberg, pp. 149-156, 2013.

[10]

Kovcs, F., Legny, C., and Babos, A., Cluster validity measurement techniques, 6th International
symposium of Hungarian researchers on computational intelligence, 2005.

[11]

Dataset Collection: http://archive.ics.uci.edu/ml/datasets.html

54 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org