This action might not be possible to undo. Are you sure you want to continue?

**An Improved Clustering Algorithm
**

XinRui HeBei Electric Power Research Institute, Shijiazhuang 050021,China DuoChunhong Institute of Computer, North China Electric Power University, Baoding 071003,China duochunhong@163.com

technique for many fields including data mining, pattern recognition， image processing， market research and other applications．

Abstract

The K-means algorithm based on partition and the DBSCAN algorithm based on density are analyzed. Combining advantages with disadvantages of the two algorithms, the improved algorithm DBSK is proposed. Because of the partition of data set, DBSK reduces the requirement of memory; the method of computing variable value is put forward; to the uneven data set, because of adopting different variable values in each local data set, the dependence on global parameters is reduced, so the clustering result is better. Simulative experiment is carried out, which proves the algorithm’s feasibility and validity.

**2. Relative clustering algorithms
**

2.1. K-means algorithm

K-means algorithm randomly selects K points as initial cluster centers, and then calculates the distance between each sample and cluster center, each point belongs to the cluster of the nearest center, then calculate the new clustering centers, if two adjacent cluster centers have not any change, the adjustment of the sample has finished, and the convergence function Jc has been convergent. The main merits of K-means algorithm are simple, fast and effective when dealing with large databases. The demerits are that different initial values may lead to different clustering results and that mountain climbing technology can cause local infinitesimal value easily. The two limitations restrict its application scopes [3-5]. Many researchers propose various methods of how to choose the initial clustering centers： ① partition the whole mixed samples into k clusters intuitively, and compute the average value of each cluster as initial cluster center; ② select representative points through density-based methods as initial cluster centers; ③ calculate representative points of k clusters from (k-1) clusters; ④ find out initial clustering centers through the max-min cluster method; ⑤ select initial values and cluster for many times, then find out the best clustering result; ⑥ mix clustering with genetic algorithm or immunity programming method. The sample needs to be adjusted constantly in order

1. Introduction

Data mining can find useful information, extract the concerned mode from a great deal of data, and dig out the changing rules and the dependent relations of data. Via data mining, people can discover data’s hidden value, and obtain powerful support for decision-making and science-finding [1]. Clustering analysis is one of the important research domains in data mining. Via clustering, people can distinguish dense area from sparse area，and find holistic distributing mode. Because there are no obvious variables as attributes of data, hidden attributes need to be detected. The database is divided into several groups, and each group contains several similar records, then rule or knowledge is obtained which is useful for decision-maker [2]. Now in many domains, there are no transcendent rules and successful patterns, and it is unknown how many clusters to be partitioned, so clustering is a promising application

978-0-7695-3311-7/08 $25.00 © 2008 IEEE DOI 10.1109/ISCID.2008.218

394

Authorized licensed use limited to: Jabalpur Engineering College. Downloaded on December 25, 2008 at 06:17 from IEEE Xplore. Restrictions apply.

. it optimizes K-means algorithm by sampling technology. ③ Because there are holistic variables Eps and MinPts in DBSCAN algorithm. When the data number is very large. it computes the value of MinPtsi. so several local datasets are gained. totalVolumei Ni is the number of the data points in the local data set whose cluster center is Ii. then o is marked as core object. in this way the clustering result will be better. The improved DBSK Algorithm DBSCAN operates directly on the whole data set. DBSCAN algorithm partitions the area as a cluster which has high-density. The larger the data number is. it requires volumes of memory support and incurs substantial I／ O costs [9]. or the data point which belongs to one local dataset is plotted into another. then researches all objects that are density-reach from o about Eps and MinPts. totalVolumei represents the volume of the super-cuboid whose cluster center is Ii [10]. and applies DBSCAN algorithm to each local dataset. However ， it has some disadvantages as following: ① It requires large volumes of memory support and needs a lot of I／O costs when dealing with large-scale databases. The time complexity should be analyzed in order to improve the application scopes of k-means algorithm. because it operates directly on the entire databases. querying area and extending cluster are executed constantly [6]. we choose local parameter values. According to different dimensions. If the dataset is partitioned before processing according to some rules. To determine the value of the local parameter MinPtsi The following method is used to determine each local parameter: choose a fixed value of Eps. Until there is a integrated cluster. these clusters that have high-density and are near from each other may be combined as one cluster. according to their characters.2. discovers clusters of arbitrary shape and effectively handles noise in spatial databases [7-8]. The choice of initial points and the adjustment after one iterative are operated on the sample data. 2008 at 06:17 from IEEE Xplore. The data distributing may be uneven. and the data in its Eps-neighborhood is extending object next time. If the data number of Eps-neighborhood of o is larger than MinPts. and their differences will be ignored. EpsVolume select different values: One-dimensional data: EpsVolumei = 2 × Eps Two-dimensional data: EpsVolumei = π × Eps Three-dimensional data: EpsVolumei = 2 4 3 It mainly discusses in two-dimensional space: × π × Eps 3 MinPtsi = π × Eps 2 totalVolumei × Ni 395 Authorized licensed use limited to: Jabalpur Engineering College. The convergence speed will be greatly improved.1. these objects will be recognized as “noise”. 3. Restrictions apply. and is marked as “noise”. Thirdly. If we choose a smaller value of Eps according to dense clusters. If o is a core object. 3. DBSCAN Algorithm First. there is a cluster C about Eps and MinPts. and gains the clustering result of the entire dataset. As a result. if not. 2. the more the running time will spend. calculate MinPtsi for each local data set: EpsVolumei MinPtsi = × Ni . and different parameter values may lead to different clustering results. the I／O costs will be decreased. and will not be expanded. it merges the clustering results of each local dataset. Obviously. so the clustering results should be processed in order to clear up influences of partition. which is determined by any core object in it. The data number of each local dataset is smaller than the initial dataset. Secondly. Firstly. ② It is sensitive to input parameters. This paper improves k-means algorithm by sampling technology. a sparse cluster will be partitioned into several clusters which have similar characters. the data number in the neighborhood of sparse clusters may be smaller than MinPts. it is difficult to choose a right value under these two circumstances. o is marked as “noise”. one large cluster may be partitioned into two different local datasets. and partitions the dataset. and the consumed time will be reduced. If we choose a larger value of Eps according to sparse clusters.to obtain new clustering centers. However. that is. clustering quality will degrade when the cluster density and the distance between clusters are not even. Downloaded on December 25. DBSCAN algorithm selects an arbitrary object o from database.

Time complexity of the DBSK algorithm The time complexity of K-means algorithm is O (nkd). otherwise I = I +1. Step6 output clustering results.2. 2008 at 06:17 from IEEE Xplore. 2. then MinPts= MinPtsA. the computational complexity is O (n2).. (3) b is density-connected from a about Eps and MinPtsA. Restrictions apply.. unite cluster A and cluster B. MinPtsi = π × Eps 2 totalVolumei × Ni . the average running time complexity of DBSCAN is O (nlogn). return (2). (2) a is density-connected from b about Eps and MinPtsB. unite cluster A and cluster B.. then runs the K-means algorithm and time complexity is O (mkd).the time of uniting data sets is O (n). a is not density-connected from b about Eps and MinPtsB. DBSK algorithm first applies sampling technique to get sample data sets m (suppose m = n / 2). Downloaded on December 25.j=1.3. Step3 suppose the size of the subset is n. Zk( I)) =min{D … n j=1 2 3 … k 396 Authorized licensed use limited to: Jabalpur Engineering College. and d is the dimension of the sample points.…. a is density-connected from b about Eps and MinPtsB.. The basic framework of DBSK algorithm Input: control parameters. calculate the distance of each data object in subset and cluster center D( x i. K and Eps . calculate 1 n ( j) k new cluster centers. k. ( xi. each cluster is a local data set. MinPtsA 4. unite cluster A and cluster B. the total time complexity of DBSK algorithm is O (mkd) + k × O ((n / k) log (n / k)) + O (n).…. Figure 1 b is density-connected from a about Eps and according to Eps and MinPtsi . Jc ( I ) = ∑∑ j =1 k =1 k nj x k − Zj ( I ) ( j) 2 . do the following operations: select K initial cluster centers Zj(I ). join the rest points into the nearest cluster. where n refers to the total number of sample points.n} is satisfied. clustering data sets. Figure 2 a is not density-connected from b about Eps and MinPtsB 3. Therefore. The time complexity of the three algorithms is shown in Figure 3. j = 1.3. if space index is unused.3. then MinPts= min{ MinPtsB ， MinPtsA}. including: sampling rate. The merger of each local data set’s clustering results Suppose a and b are respectively in cluster A and cluster B. 2. b is not density-connected from a about Eps and MinPtsA. Step1 initialize control parameters.Zj(I )). the time complexity in each data sets is O ((n / k) log (n / k)). apply DBSCAN clustering algorithm. k is the appointed cluster number. the j algorithm is finished. judgment: if Jc ( I ) − Jc ( I − 1) ≺ ξ . then MinPts= MinPtsB. This is shown in Figure 1 and Figure 2. Zj(I ))，i＝1，2， 3， ， ， ， ， ， ， ， if D ( xi. Step5 merger clustering results of each local data set. . 3. Output: cluster results. n i =1 3. Step4 do the following operations for each local data set: calculate each local parameters MinPts. then the algorithm is finished. unite cluster A and cluster B if they satisfy either of the following conditions: (1) b is density-connected from a about Eps and MinPtsA. Zj ( I ) = ∑ xi j = 1. k. Step2 randomly select subset of the data set according to sampling rate. then xi the convergence function Jc wk.2.

“ Data mining method. W. hollow circles indicate noise points. The simulative experiment In order to test the feasibility and effectiveness of the DBSK algorithm. Chen An. Journal of software. Mechanical Engineering Science”， 2004，219（C）. and ZHUANG Yan-bin. “Feature Weighting in K-means Clustering”. 2003. 17-25. [6] Martin Ester. [5] D. 2002. From the clustering result we can see: DBSCAN algorithm can not do very well to uneven data set. 2005.M.Modha. Restrictions apply.103-119. and Hu C.7（1）. 162-195. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. 2001. Computer Applications. [3] D T Pham. 2006. MINI-MICRO System. Computers and Operations Research. and Zhou Longxiang. 114-116. 5. there is no discussion and research to three dimensional or multi-dimensional complex data. “Research and Implementation of Clustering Algorithm Based on DBSCAN”. References Figure 4 The clustering result of DBSCAN algorithm Eps＝18 Minpts＝4 Figure 5 The clustering result of DBSK algorithm K=3 Eps＝18 sampling rate =30％ In Figure 4 and Figure 5. Proceedings of 2nd international conference on knowledge discovery and data mining. 217-237.65-70. it is not good enough to the sparse data points on the left. application.Figure 3 The time complexity of the three algorithms 6. DBSK algorithm recognize more clusters that have small data points and are relatively sparse than DBSCAN algorithm. “Integration of Self-organizing Feature Map and K-means Algorithm for Market Segmentation”. 45-47. [7] Chen Ning. but the massive data sets often have multi-dimensional attribute. 12（7）. To the sparse data points on the left. and Guo Guoqiang. 2008 at 06:17 from IEEE Xplore. the number of the data points that handled to noise is distinctly reduced. 2004. [2] David Hand. S S Dimov. How to determine the size of the sample (is also called sampling complexity) is a question that needs further study. 2001. Conclusions Combining the sampling technique.. [8] Rong Qiusheng.S.38-43 397 Authorized licensed use limited to: Jabalpur Engineering College. Hans-Peter Kriegel. [4] Kuo R. College of computer science Chongqing University. different shapes represent different clusters. and a few data points are handled to noise which should belong to some cluster. “Study on data partition DBSCAN using genetic algorithm”. Downloaded on December 25. .2001.M. “Data-partition-based Parellel DBSCAN Algorithm”. Heikki Mannila. 24(4). LIU Zong-tian. [1] Chen Lei-da. DBSK algorithm partitions the data set.J. [9] HE Zhong-sheng. and C D Nguyen. “An Effective Clustering Algorithm in Large Transaction Databases”. 27(1).29（11）. and selects different parameter values according to each local data set. tools”. Yan Junbiao. Information Systems Management. In this paper DBSK algorithm only applies in the two-dimensional data set.. “Principles of Data Mining”. 476-484.S. 1475-1493. K-means algorithm and DBSCAN algorithm.Spangler. so more work should be carried out on the data mining. The MIT Press.. simulative experiment is carried out. “Selection of K in K-means clustering. an improved DBSK algorithm has been proposed. Ho L. 2002. and Padhraic Smyth. [10] SunSi. 7.

- hitkarini college of engineering
- Rules for Classification data mining
- VPN Application
- 394 Project Explation PPT
- Datamining Intro IEP 2
- schedule for PET madhya pradesh
- railway agents allahabad
- Access Point fro all dongle and dat card
- Data Mining Concepts 15061
- Optimal Merge
- Algorithm Details
- itnm
- lplannew
- middsa
- Database question and answers
- it-syllabasdt-14.07.09(5$6sem)
- BASICS OF DATA MINING
- 18351702-umllab
- 5 Linear Block Codes
- 2-3 Trees tyutorial
- Communications and Services Certifications

Sign up to vote on this title

UsefulNot useful- Clustering Density
- A Novel Approach in Clustering via Rough Set
- CS583 Unsupervised Learning
- JSEA_2013041110210462
- Report
- pp9_v4 - mejorado
- Local Interactive Clustering
- Distributed Clustering
- clus_pp1
- Cluster Analysis
- Fuzzy,Non Fuzzy Imp
- Math lecture
- ijcia
- Expert Soft
- MIP AND UNSUPERVISED CLUSTERING FOR THE DETECTION OF BRAIN TUMOUR CELLS
- Comparison Between C-mea and K
- Cluster Prentice Hall17
- 115
- Data Mining Constrain Based Clustering
- A Mixture Model of Hubness and Pca
- Program-KSE-2014-v5
- 14. Eng Survey S. Subitha
- Curriculum Vitae
- Comparing hard and overlapping clusterings
- DSPMeeting_2
- [object XMLDocument]
- Bacher Wenzig Vogler Spss Twostep
- Lecture 2 Multivariate Data Analysis and Visualization
- tr-b-04-09
- Rough K-Means based on Mapreduce
- DBSCAN Clustering