You are on page 1of 16

International Journal of Computer Engineering and Technology ENGINEERING (IJCET), ISSN 0976INTERNATIONAL JOURNAL OF COMPUTER 6367(Print), ISSN 0976

6375(Online) Volume 4, Issue 3, May June (2013), IAEME & TECHNOLOGY (IJCET) ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 4, Issue 3, May-June (2013), pp. 204-219 IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com

IJCET
IAEME

DYNAMIC APPROACH TO k-Means CLUSTERING ALGORITHM


Deepika Khurana1 and Dr. M.P.S Bhatia2
1

(Department of Computer Engineering, Netaji Subhas Institute of Technology, University of Delhi, New Delhi, India) 2 (Department of Computer Engineering, Netaji Subhas Institute of Technology, University of Delhi, New Delhi, India)

ABSTRACT k-Means clustering algorithm is a heuristic algorithm that partitions the dataset into k clusters by minimizing the sum of squared distance in each cluster. In contrast, there are number of weaknesses. First it requires a prior knowledge of cluster number k. Second it is sensitive to initialization which leads to random solutions. This paper presents a new approach to k-Means clustering by providing a solution to initial selection of cluster centroids and a dynamic approach based on silhouette validity index. Instead of running the algorithm for different values of k, the user need to give only initial value of k as ko as input and algorithm itself determines the right number of clusters for a given dataset. The algorithm is implemented in the MATLAB R2009b and results are compared with the original k-Means algorithm and other modified k-Means clustering algorithms. The experimental results demonstrate that our proposed scheme improves the initial center selection and overall computation time. Keywords: Clustering, Data mining, Dynamic, k-Means, Silhouette validity index. I. INTRODUCTION

Data Mining is defined as mining of knowledge from huge amount of data. Using Data mining we can predict the nature and behaviour of any kind of data. It was recognized that information is at the heart of the business operations and that decision makers could make the use of data stored to gain the valuable insight into the business. DBMS gave access to the data stored but this was only small part of what could be gained from the data.
204

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME Analyzing data can further provide the knowledge about the business by going beyond the data explicitly stored to derive knowledge about the business. Learning valuable information from the data made clustering techniques widely applied to the areas of artificial intelligence, customer relationship management, data compression, data mining, image processing, machine learning, pattern recognition, market analysis, and fraud-detection and so on. Cluster Analysis of a data is an important task in Knowledge Discovery and Data Mining. Clustering is the process to group the data on the basis of similarities and dissimilarities among the data elements. Clustering is the process of finding the group of objects such that object in one group will be similar to one another and different from the objects in the other group. Clustering is an unsupervised algorithm, which requires a parameter that specifies the number of clusters k. For setting this parameter either requires detailed knowledge of the dataset or requires the algorithm to be run for different values of k to determine the correct number of clusters. However for large and multidimensional data process of clustering becomes time consuming and determining the correct number of clusters in large data becomes difficult. The k-Means clustering algorithm is an old algorithm that has been intensely researched, owing to its ease and simplicity of implementation. However there have also been criticisms on its performance, in particularly for demanding the value of k in prior. It is evident from the previous researches that providing the number of clusters in prior does not in any way assist in the production of good quality clusters. Original k-Means also determines the initial centers randomly in each run which leads to different solutions. To validate the clustering results we have chosen Silhouette validity index as a validity measure. The Silhouettes validity index is particularly useful when seeking to know the number of clusters that are compact and well separated. This index is used after the clustering to check the validity of clusters produced. This paper presents a new method for selection of the initial k centers and a dynamic approach to k-Means clustering. Initial value of k as ko is provided by the user. The algorithm will then partition the whole space into different segments and calculate the frequency of data points in each segment. The ko highest frequency segments are then chosen as initial ko clusters. To determine the initial centers, the algorithm will calculate for each segment the distance of points from origin; sort them and then coordinates corresponding to the mid value of the distance is chosen to be the center for that segment. Then cluster assignment process is done. Then the Silhouettes validity index is calculated for the initial ko clusters. This step is then repeated for ( ko +2) and (ko -2) number of clusters. The algorithm will then iterate again for specified conditions and stop at the maximum value of silhouette index yielding k correct number of clusters. The proposed approach is dynamic in the sense that user need not to check the algorithm for different values of k. Instead the algorithm stops itself at best value of k giving compact and separated clusters. Proposed algorithm shows that it takes less execution time when compared with Original k-Means and modified approach to k-Means clustering. The paper is organised as follows: Section 2 presents related work. Silhouette validity index is discussed in 3. Section 4 describes Original k-Means. 5 and 6 sections details the approaches discussed in [1] and [2] respectively. Section 7 describes the proposed algorithm. Section 8 shows implementation results. Conclusion and future work is presented in section 9.
205

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME II. RELATED WORK

In literature [1] there is an improved k-Means algorithm based on the improvement of the sensitivity of the initial centers. This algorithm partitions the whole data space into different segments and calculates the frequency of points in each segment. The segment which shows the maximum frequency will b considered for initial centroid depending upon the value of k. In literature [2] another method of finding initial cluster centers is discussed. It first finds closest pair of data points and then on the basis of these points it forms the subset of dataset, and this process is repeated k times to find k small subsets, to find initial k centroids. The author in literature [3] uses Principal Component Analysis for dimension reduction and to find initial cluster centers. In [4] first data set is pre-processed by transforming all data values to positive space and then data is sorted and divided into k equal sets and then middle value of each set is taken as initial center point. In literature[5] a dynamic solution to k Means is proposed that algorithm is designed with pre-processor using silhouette validity index that automatically determines the appropriate number of clusters, that increase the efficiency for clustering to a great extent. In [6] a method is proposed to make algorithm independent of number of iterations that avoids computing distance of each data point to cluster centers repeatedly, saving running time and reducing computational complexity. In the literature [7] dynamic means algorithm is proposed to improve the cluster quality and optimizing the number of clusters. The user has the flexibility either to fix the number of clusters or input the minimum number of clusters required. In the former case it works same as k-Means algorithm. In the latter case the algorithm computes the new cluster centers by incrementing the cluster count by one in every iteration until it satisfies the validity of cluster quality In [8] the main purpose is to optimize the initial centroids for k-Means algorithm.Author proposed Hierarchical k-Means algorithm. It utilizes all the clustering results of k-Means in certain times, even though some of them reach the local optima. Then, transform the all centroids of clustering result by combining with Hierarchical algorithm in order to determine the initial centroids for k-Means. This algorithm is better used for the complex clustering cases with large data set and many dimensional attributes. III. SILHOUETTE VALIDITY INDEX

The Silhouette value for each point is a measure of how similar that point is to the points in its own cluster compared to the points in other clusters. This technique computes the silhouette width for each data point, silhouette width for each cluster and overall average silhouette width. The silhouette width for the ith point of mth cluster is given by equation 1:
,

Where ai is the average distance from the ith point to the other points in its cluster and bi is the minimum of the average distance from point i to the points in the other k-1 clusters. It
206

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME ranges from -1 to +1. Every point i with a silhouette index close to 1 indicates that it belongs to the cluster being assigned. A value of zero indicates object could also be assigned to another closest cluster. A value of close to -1 indicates that object is wrongly clustered or is somewhere between clusters. The silhouette width for each cluster is given by equation 2:

The overall average silhouette width is given by equation 3:

We have used this silhouette validity index as a measure of cluster validity in the implementation of Original k Means, modified approach I and modified approach II. We have used this measure as a basis to make the proposed algorithm work dynamically. . IV. ORIGINAL K-MEANS ALGORITHM

The k-Means algorithm takes the input parameter k, and partition a set of n objects into k clusters so that the resulting intra-cluster similarity is high but the inter-cluster similarity is low cluster similarity is measured in regard to the mean value of the objects in a cluster which can be viewed as a clusters centroid or center of gravity. The k-means algorithm proceeds as follows: 1. Randomly select k of the objects, each of which initially represents a cluster mean or center. 2. For each of the remaining objects, an object is assigned to a cluster to which it is the most similar, based on the distance between the object and the cluster mean. It then computes the new mean for each cluster using equation 4: 3.

Where, Mj is centroid of cluster j and nj is the number of data points in cluster j. 4. This process iterates until the criterion function converges. Typically the square error criterion is used, defined using equation 5: | |

Where p is the data point and mi is the center for cluster Ci. E is the sum of squared error of all points in dataset. The distance of criterion function is the Euclidean distance which is used to calculate the distance between data point and cluster center.
207

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME The Euclidean distance between two vectors x = (x1, x2 , x3 , x4-------- xn) and y= (y1 , y2 , y3 , y4 ---- yn) can be calculated using equation 6: ,

Algorithm: The k Means algorithm for partitioning, where each clusters center is represented by the mean value of the objects in the cluster. Input: k: the number of clusters, D: a data set containing n objects. Output: A set of k clusters. Method: 1. arbitrarily choose k objects from D as the initial cluster centers; 2. repeat 3. (re)assign each object to the cluster to which the object is most similar, based on the mean value of the objects in the cluster; 4. update the cluster means , i.e., calculate the mean value of objects for each cluster; 5. until no change; V. MODIFIED APPROACH I

The first approach discussed in [1] optimizes the Original k Means algorithm by proposing a method on how to choose initial clusters. The author proposed a method that partitions the given input data space into k * k segments, where k is desired number of clusters. After portioning the data space, frequency of each segment is calculated and highest k frequency segments are chosen to represent initial clusters. If some parts are having same frequency, the adjacent segments with the same least frequency are merged until we get the k number of segments. Then initial centers are calculated by taking the mean of the data points in respective segments to get the initial k centers. By this process we will get the initial which are always same as compared to the Original k Means algorithm which always selects initial random centers for a given dataset. Next, a threshold distance is calculated for each centroid is defined as distance between each cluster centroid and for each centroid take the half of the minimum distance from the remaining centroids. Threshold distance is denoted by dc(i) for the cluster C i . To assign the data point to the cluster, take a point p in the dataset and calculate its distance from the centroid of cluster i and compare it with dc(i) . If it is less than or equal to dc(i) then assign the data point p to the cluster i else calculate its distance from other centroids. This process is repeated until data point p is assigned to one of the cluster. If data point p is not assigned to any of the cluster then the centroid which shows minimum distance for the data point p becomes the centroid for that point. The centroid is then updated by calculating the mean of the data points in the cluster.

208

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME Pseudo code for Modified k-Means algorithm is as follows: Input: Dataset of N data points D (i = 1 to N) Desired number of clusters = k Output: N data points clustered into k clusters. Steps: 1. Input the data set and value of k. 2. If the value of k is 1 then Exit. 3. Else 4. /*divide the data point space into k*k, means k Vertically and k horizontally*/ 5. For each dimension{ 6. Calculate the minimum and maximum value of data Points. 7. Calculate range of group(RG) using equation 7: 8. Divide the data point space in k group with width RG 9. } 10. Calculate the frequency of data points in each partitioned space. 11. Choose the k highest frequency group. 12. Calculate the mean of selected group. /* These will be the initial centroids of k clusters.*/ 13. Calculate the distance between each clusters using equation 8: , , : , , & Where d(C i, C j) is distance between centroid i and j 14. Take the minimum distance for each cluster and make it half using equation 9: , , Where, dc(i) is half of the minimum distance of i clusters. 15. For each data points Zp= 1 to N { 16. For each cluster j= 1 to k { 17. Calculate d(Zp,Mj) using equation 10: ,
th

cluster from other remaining

(10)

where d(xi,yi) is the distance between vector vectors x = (x1, x2 , x3 , x4-------- xn) and y= (y1 , y2 , y3 , y4 ---- yn). 18. If (d(Zp,Mj)) dcj){ 19. Then Zp assign to cluster Cj . 20. Break; 21. } 22. Else 23. Continue;
209

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME 24. } 25. If Zp, does not belong to any cluster then 26. Zp, min(d(Zp, , Mi)) where i [1, k] 27. } 28. Check the termination condition of algorithm if Satisfied 29. Exit. 30. Else 31. Calculate the centroid of cluster using equation 11: Where Mj is centroid of cluster j and nj is the number of data points in cluster j. 32. Go to step 13. VI. MODIFIED APPROACH II

In the work of [2], author calculate the distance of between each data points and select that pair which show the minimum distance and remove it from actual dataset. Then took one data point from data set and calculate the distance between selected initial point and data point from data set and add with initial data point which show the minimum distance. Repeat this process till threshold value achieved. If number of subsets formed is less than k then again calculate the distance between each data point from the rest data set and repeat that process till k cluster formed. First phase is to determine initial centroids, for this compute the distance between each data point and all other data points in the set D. Then find out the closest pair of data points and form a set A1 consisting of these two data points, and delete them from the data point set D. Then determine the data point which is closest to the set A1, add it to A1 and delete it from D. Repeat this procedure until the number of elements in the set A1 reaches a threshold. Then again form another data-point set A2. Repeat this till k such sets of data points are obtained. Finally the initial centroids are obtained by averaging all the vectors in each data-point set. The Euclidean distance is used for determining the closeness of each data point to the cluster centroids Next phase is to assign points to the clusters. Here the main idea is to set two simple data structures to retain the labels of cluster and the distance of all the data objects to the nearest cluster during the each iteration, that can be used in next iteration, we calculate the distance between the current data object and the new cluster center, if the computed distance is smaller than or equal to the distance to the old center, the data object stays in its cluster that was assigned to in previous iteration. Therefore, there is no need to calculate the distance from this data object to the other k- 1clustering center, saving the calculative time to the k-1 cluster centers. Otherwise, we must calculate the distance from the current data object to all k cluster centers, and find the nearest cluster center and assign this point to the nearest cluster center. And then we separately record the label of nearest cluster center and the distance to its center. Because in each iteration some data points still remain in the original cluster, it means that some parts of the data points will not be calculated, saving a total time of calculating the distance, thereby enhancing the efficiency of the algorithm.

210

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME Pseudo code for modified k- Means algorithm is as follows: Input: Dataset D of N data points (i = 1 to N) Desired number of clusters = k Output: N data points clustered into k clusters. Phase 1: Steps: 1. Set m = 1; data- points in the 2. Compute the distance between each data point and all other dataset D using equation 12:

3.

4. 5. 6.

7. Phase 2: Steps:

(12) ) where d(xi,yi) is the he distance between vector vectors x = (x1, x2 , x3 , x4-------- xn) and y= (y1 , y2 , y3 , y4 ---- yn). Find the closest pair of data points from the set D and form a data data-point set Am (1<= m <= = k) which contains these two datadata points, Delete these two data points from the set D; Find the data point in D that is closest to the data point set Am, Add it to Am and delete it from D; Repeat step 4 until the number of data points in Am reaches 0.75* 0.75*(N/k); If m<k, then m = m+1, find another pair of data points from D between which the distance is the shortest, form another data-point data point set Am and delete them from D, Go to step 4; For each data-point point set Am (1<=m<=k) find the arithmetic mean of the vect vectors of data points in Am, these means will be the initial centroids

1. Compute the distance of each data data-point di (1<=i<=N) to all the centroids Cj (1<=j<=k) as d(di, Cj) using equation (4.1) 2. For each data-point point di, find the closest centroid Cj and assign di to cluster j. 3. Set Cluster Id[i]=j; /* j: Id of the closest cluster for point i */. 4. Set Nearest _Dist[i]= d(di, Cj); 5. For each cluster j (1<=j<=k), recalculate the centroids; 6. Repeat 7. For each data-point point di, a. Compute its distance from the centroid centroid of the present nearest cluster; b. If this distance is less than or equal to the present nearest distance, the data-point point stays in the cluster; c. Else for every centroid cj (1<=j<=k) compute the distance d(di, Cj); d. End for; 8. Assign the data-point point di to the cluster with the nearest centroid Cj 9. Set ClusterId[i]=j; 10. Set Nearest_Dist[i] = d(di, Cj); 11. End for (step(2)); 12. For each cluster j (1<=j<=k), Recalculate the centroids until the convergence criteria is met i.e. either no center updates or no point moves to another cluster.
211

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME VII. PROPOSED APPROACH

The changes are based on the selection of initial k centers and making the algorithm to work dynamically, i.e. instead of running algorithms for different values of k, we try to make algorithm in such a way that it itself decides how many clusters are there in a given dataset. The two modifications are as follows: A method to select initial k centers. To make algorithm dynamic. Proposed Algorithm: The algorithm consists of three phases: First phase is to determine the initial k centroids. In this phase user inputs the dataset and value of k. The data space is divided into k*k segments as discussed in [1]. After dividing the data space we choose the k segments with the highest frequency of points. If some parts are having same frequency, the adjacent segments with the same least frequency are merged until we get the k number of segments. Then we find the distance of each point in each selected segment with the origin and these distances are sorted for each segment and then middle point is selected as the center for that segment. This step is repeated for each k selected segments. These represent the initial k centroids. Second phase is to assign points to the cluster based on the minimum distance between the point and the cluster centers. The distance measure used is Euclidean distance. It then computes the mean of the clusters formed as the next centers. This process is repeated until no more center updates. Third phase is where algorithm iterates dynamically to determine right number of clusters. To choose the right number of clusters we use the concept the concept of Silhouette Validity Index. Pseudo Code for Proposed Algorithm: Input: Dataset of N points. Desired number of k clusters. Output: N points grouped into k clusters. Phase1: Finding Initial centroids Steps: 1. Input the dataset and value of k 2. 2. Divide the data point set into k*k segments /*k vertically and k horizontally*/ 3. For each dimension { 4. Calculate the minimum and maximum value of data points. 5. Calculate the width (Rg) using equation 13: } 6. Calculate the frequency of data points in each segment. 7. Choose the k highest frequency segments. 8. For each segment i = 1 to k { 9. For each point j in the segment i
212

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME { 10. Calculate the distance of point j with origin } 11. Sort these distances in ascending order in matrix D 12. Select the middle point distance. 13. The co-ordinates corresponding to the distance in 12 is chosen as initial center for the ith cluster. } 14. These k co-ordinates are stored in matrix C which represents the initial centroids. Phase2: Assigning points to the cluster Steps: 1. Repeat 2. For each data point p = 1 to N { 3. For each cluster j = 1 to k { 4. Calculate distance between point p and cluster centroid cj of Cj using equation 14:

,
} } Assign p to min{d(p,cj)}where j [1,k]. Check the termination condition of the algorithm if Satisfied Exit Else Calculate the new centroids of cluster using 15:

(14)

5. 6. 7. 8. 9.

Where nj is the number of points in cluster . 10. Go to step 1. Phase3: To determine appropriate number of clusters For the given value o the phase 1 and 2 are run for three iterations using k-2, k and k +2. Three corresponding Silhouette values are calculated as discussed in section 2. These are denoted by Sk-2, Sk, Sk+2. The appropriate number of clusters is then found using following steps. Steps: 1. If Sk-2 < Sk and Sk > Sk+2 then run phase 1 and phase 2 using k+1 and k-1 and corresponding Sk+1 and Sk-1 are found. The maximum of the three Sk-1, Sk, Sk+1 then determines the value of k as appropriate number of clusters. For example if Sk+1 is maximum, then number of clusters formed by the algorithm is k+1.

213

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME 2. Else If Sk+2 > Sk and Sk+2 > Sk-2 then run phase 1 and phase 2 using k+1, k+3, k+4 and corresponding Sk+1, Sk+3 and Sk+4 are found. The k values corresponding to maximum of the Sk+1, Sk+2, Sk+3, Sk+4 is returned. 3. Else If Sk+2 < Sk-2 and Sk < Sk-2 then run phase 1 and phase 2 using k-1, k-2, k3, k-4 and corresponding Sk-1, Sk-2, Sk-3 and Sk-4 are found. The k values corresponding to maximum of the Sk-1, Sk-2, Sk-3, Sk-4 is returned. 4. Stop. Thus the algorithm terminates itself where the best value of k is found. This value of k shows appropriate number of clusters for a given data set. VIII. RESULT ANALYSIS

The proposed algorithm is implemented and results are compared with that of modified approach [1] and [2] in terms of execution time and initial centers chosen. 1. The total time taken by the algorithm to form clusters and dynamically determining the appropriate number of clusters is actually less than the total time taken by the algorithm [1] to run for different values of k. For example if we run algorithm in [1] for different values of k such as k = 2, 3, 4, 5, 6, 7, etc. The algorithm in [1] takes more time as compared to the proposed algorithm which itself runs for different values of k. 2. We define new method to determine initial centers that is based on middle value as compared to mean value. The reason behind this is that the middle value best represents the distribution and moreover as mean is influenced by too large and too small values, the middle value is not affected by this. The results show that algorithm works dynamically and is also an improvement over original k-Means. Table I shows results of running algorithm in [1] over wine dataset from UCI repository for k = 3, 4, 5, 6, 7 and 9. The algorithm is run for these values of k because in proposed algorithm we initially fed k =7 and algorithm runs for these values of k automatically and so total execution time of both algorithms are compared. And results shows that proposed algorithm take less time than running the algorithm [1] individually for different values of k. TABLE I: RESULTS OF ALGORITHM [1] FOR WINE DATASET Sr. no. Value of k Silhouette validity Index Execution time (s) 1. 3 0.4437 3.92 2. 4 0.3691 4.98 3. 5 0.3223 2.89 4. 6 0.2923 7.51 5. 7 0.2712 3.56 6. 9 0.2082 11.96 TOTAL EXECUTION TIME 34.82 The results for the proposed algorithm show different runs and stops at: maximum silhouette value = 0.443641 for best value of k = 3 Elapsed time is 30.551549 seconds.
214

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME Thus, from results it is clear that algorithm stops itself when it finds the right number of clusters. It shows that proposed algorithm takes less time as compared to algorithm in [1]. Table II shows results of execution time for both algorithms. It also shows initial value fed to our algorithm and where the best value of k is found the algorithm stops. Experiments are performed on random datasets of 50, 178 and 500 points. TABLE II COMPARING RESULTS OF PROPOSED ALGORITHM & ALGORITHM IN [1] Dataset Initial value of Best value of Proposed Algorithm [1] k k algorithm time (s) time (s)

Sr. no.

1. 2. 3.

50 points 178 points 500 points

4 9 5

6 10 4

18.6084 39.1941 66.6134

28.0096 50.2726 91.9400

Table III shows comparison results between original k-means, modified approach II and proposed algorithm. When comparing execution times of proposed algorithm with other algorithms, it is seen that proposed algorithm takes much less time than the original k-Means and for the large dataset such as dataset of 500 points; the proposed algorithm also outperforms the modified approach II. TABLE- III EXECUTION TIME(s) COMPARISON Dataset Original kModified approach Proposed algorithm Means II 50 points 178 points 500 points 15.1727 74.5168 86.7619 11.4979 21.6497 87.2461 18.6084 39.1941 66.6134

Sr. No. 1. 2. 3.

From all the results we can conclude that although procedure of proposed algorithm is long but it prevents user from running the algorithm for different values of k as in other three algorithms discussed in previous chapters. The proposed algorithm dynamically iterates and stops at best value of k. Figure I III shows different silhouette plots for all 3 datasets of random points discussed above depicting how close a point to the other members of its own cluster is. The plot also shows that if any point is not placed incorrect cluster if the silhouette index value for that point is negative. Figure IV shows execution time comparison of all the algorithms discussed in the paper.

215

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

Figure 1 Silhouette plot for 50 points

Figure 2 Silhouette plot for 178 points

Figure 3 Silhouette plot for 500 points

216

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME

Figure 4 Execution time (s) comparison of different algorithms Table 4 shows comparison between all four algorithms discussed in paper on the basis of experiments results. TABLE 1V: COMPARISON OF ALGORITHMS
Parameters Initial Centers Original kMeans Always random and thus different clusters for same value of k for given data. Can work with data redundancies. Yes. Fixed input parameter. Modified Approach I Way to select initial centers is fixed by always choosing initial centers in the highest frequency segment. Modified Approach II Selection of initial centers by always choosing points based on the similarity between points. Proposed Algorithm Initial centers are fixed by choosing the centers in the highest frequency segment, which is middle point of that segment points calculated from origin. Can work with redundant data. No. Initial value given as input, algorithm dynamically iterates and determines best value of k for given data. Less than all other three algorithms.

Redundant Data Dead Unit Problem Value of k

Suitable.

Not suitable for data with redundant points. No. No. Fixed input parameter

Fixed input parameter

Execution Time

More.

Less as compared to Original k-Means, but more than other two algorithms

Less than Original kMeans and Modified Approach I but more than Proposed Algorithm.

217

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME IX. CONCLUSION In this paper we presented different approaches to k-Means clustering that are concluded using comparative presentation and stressing their pros and cons. Another issue discussed in the paper is clustering validity which we measured using silhouette validity index. For a given dataset this index shows which value of k produces compact and well separated clusters. The paper presents a new method for selecting initial centers and dynamic approach to k-Means clustering so that user needs not to check the clusters for different values of k. Instead inputs initial value of k and algorithm stops after it finds best value of k i.e. algorithm stops when it attains maximum silhouette value. Experiments also show that the proposed dynamic algorithm takes much less computation time then the other three algorithms discussed in the paper. X. ACKNOWLEDGEMENTS I am eternally grateful to my research supervisor Dr. M.P.S Bhatia for their invigorating support and valuable suggestions and guidance. I thank him for his supervision and patiently correcting me with my work. It is a great experience and I gained a lot here. Finally, we are thankful to the almighty God who had given us the power, good sense and confidence to complete my research analysis successfully. I also thank my parents and my friends who were a constant source of encouragement. I would also like to thanks Navneet Singh for his appreciation. REFERENCES Proceedings Papers [1] Ran Vijay Singh and M.P.S Bhatia, Data Clustering with Modified K-means Algorithm, IEEE International Conference on Recent Trends in Information Technology, ICRTIT 2011, pp 717-721. D. Napoleon and P. Ganga lakshmi, An Efficient K-Means Clustering Algorithm for [2] Reducing Time Complexity using Uniform Distribution Data Points, IEEE 2010. Journal Papers [3] Tajunisha and Saravanan, Performance Analysis of k-means with different initialization methods for high dimensional data International Journal of Artificial Intelligence & Applications (IJAIA), Vol.1, No.4, October 2010 [4] Neha Aggarwal and Kriti Aggarwal,A Mid- point based k mean Clustering Algorithm for Data Mining. International Journal on Computer Science and Engineering (IJCSE) 2012. [5] Barile Barisi Baridam, More work on k-means Clustering algortithm: The Dimensionality Problem . International Journal of Computer Applications (0975 8887)Volume 44 No.2, April 2012. Proceedings Papers [6] Shi Na, Li Xumin, Guan Yong Research on K-means clustering algorithm. Proc of Third International symposium on Intelligent Information Technology and Security Informatics, IEEE 2010.
218

International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 3, May June (2013), IAEME [7] Ahamad Shafeeq and Hareesha Dynamic clustering of data with modified K-mean algorithm, Proc. International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol. 27 (2012) (2012) IACSIT Press, Singapore 2012. Research [8] Kohei Arai,Ali Ridho Barakbah, Hierarchical K-means: an algorithm for centroids initialization for k-Means. Reports of the faculty of Science and Engineering, Saga University, Vol. 26, No. 1, 2007. Books [1] Jiawei Han and Micheline Kamber, data mining concepts and techniques (Second Edition).

219