You are on page 1of 4

2008 International Symposium on Computational Intelligence and Design

An Improved Clustering Algorithm

XinRui DuoChunhong
HeBei Electric Power Research Institute, Institute of Computer, North China Electric
Shijiazhuang 050021,China Power University,
Baoding 071003,China
duochunhong@163.com

Abstract technique for many fields including data mining,


pattern recognition,image processing,market research
and other applications.
The K-means algorithm based on partition and
the DBSCAN algorithm based on density are
analyzed. Combining advantages with 2. Relative clustering algorithms
disadvantages of the two algorithms, the improved
algorithm DBSK is proposed. Because of the 2.1. K-means algorithm
partition of data set, DBSK reduces the requirement
of memory; the method of computing variable value K-means algorithm randomly selects K points as
is put forward; to the uneven data set, because of initial cluster centers, and then calculates the distance
adopting different variable values in each local between each sample and cluster center, each point
data set, the dependence on global parameters is belongs to the cluster of the nearest center, then
reduced, so the clustering result is better. Simulative calculate the new clustering centers, if two adjacent
experiment is carried out, which proves the cluster centers have not any change, the adjustment of
algorithm’s feasibility and validity. the sample has finished, and the convergence function
Jc has been convergent.
The main merits of K-means algorithm are simple,
fast and effective when dealing with large databases.
The demerits are that different initial values may lead
1. Introduction to different clustering results and that mountain
climbing technology can cause local infinitesimal
Data mining can find useful information, extract the value easily. The two limitations restrict its application
concerned mode from a great deal of data, and dig out scopes [3-5]. Many researchers propose various
the changing rules and the dependent relations of data. methods of how to choose the initial clustering
Via data mining, people can discover data’s hidden centers:
value, and obtain powerful support for ① partition the whole mixed samples into k
decision-making and science-finding [1]. clusters intuitively, and compute the average value of
Clustering analysis is one of the important research each cluster as initial cluster center;
domains in data mining. Via clustering, people can
② select representative points through
distinguish dense area from sparse area,and find density-based methods as initial cluster centers;
holistic distributing mode. Because there are no
③ calculate representative points of k clusters from
obvious variables as attributes of data, hidden
(k-1) clusters;
attributes need to be detected. The database is divided
into several groups, and each group contains several ④ find out initial clustering centers through the
similar records, then rule or knowledge is obtained max-min cluster method;
which is useful for decision-maker [2]. Now in many ⑤ select initial values and cluster for many times,
domains, there are no transcendent rules and successful then find out the best clustering result;
patterns, and it is unknown how many clusters to be ⑥ mix clustering with genetic algorithm or
partitioned, so clustering is a promising application immunity programming method.
The sample needs to be adjusted constantly in order

978-0-7695-3311-7/08 $25.00 © 2008 IEEE 394


DOI 10.1109/ISCID.2008.218

Authorized licensed use limited to: Jabalpur Engineering College. Downloaded on December 25, 2008 at 06:17 from IEEE Xplore. Restrictions apply.
to obtain new clustering centers. The larger the data 3. The improved DBSK Algorithm
number is, the more the running time will spend. The
time complexity should be analyzed in order to DBSCAN operates directly on the whole data set.
improve the application scopes of k-means algorithm. When the data number is very large, it requires
This paper improves k-means algorithm by sampling volumes of memory support and incurs substantial I/
technology. The choice of initial points and the O costs [9]. If the dataset is partitioned before
adjustment after one iterative are operated on the
processing according to some rules, the I/O costs will
sample data. The convergence speed will be greatly
be decreased, and the consumed time will be reduced.
improved.
The data distributing may be uneven, so several local
datasets are gained. The data number of each local
2.2. DBSCAN Algorithm dataset is smaller than the initial dataset, according to
their characters, we choose local parameter values, in
First, DBSCAN algorithm selects an arbitrary object this way the clustering result will be better. However,
o from database, then researches all objects that are one large cluster may be partitioned into two different
density-reach from o about Eps and MinPts. If the data local datasets, or the data point which belongs to one
number of Eps-neighborhood of o is larger than MinPts, local dataset is plotted into another, and is marked as
then o is marked as core object, and the data in its “noise”, so the clustering results should be processed
Eps-neighborhood is extending object next time, if not, in order to clear up influences of partition.
o is marked as “noise”. If o is a core object, there is a Firstly, it optimizes K-means algorithm by sampling
cluster C about Eps and MinPts, which is determined technology, and partitions the dataset. Secondly, it
by any core object in it. Until there is a integrated computes the value of MinPtsi, and applies DBSCAN
cluster, querying area and extending cluster are algorithm to each local dataset. Thirdly, it merges the
executed constantly [6]. clustering results of each local dataset, and gains the
DBSCAN algorithm partitions the area as a cluster clustering result of the entire dataset.
which has high-density, discovers clusters of arbitrary
shape and effectively handles noise in spatial databases
3.1. To determine the value of the local
[7-8]. However , it has some disadvantages as
parameter MinPtsi
following:
① It requires large volumes of memory support The following method is used to determine each
and needs a lot of I/O costs when dealing with local parameter: choose a fixed value of Eps, calculate
large-scale databases, because it operates directly on MinPtsi for each local data set:
the entire databases;
EpsVolumei
② It is sensitive to input parameters, and different MinPtsi = × Ni .
parameter values may lead to different clustering totalVolumei
results. Ni is the number of the data points in the local data
③ Because there are holistic variables Eps and set whose cluster center is Ii; totalVolumei represents
MinPts in DBSCAN algorithm, clustering quality will the volume of the super-cuboid whose cluster center is
degrade when the cluster density and the distance Ii [10]; According to different dimensions, EpsVolume
between clusters are not even. If we choose a smaller select different values:
value of Eps according to dense clusters, the data One-dimensional data: EpsVolumei = 2 × Eps
number in the neighborhood of sparse clusters may be Two-dimensional data: EpsVolumei = π × Eps
2

smaller than MinPts, that is, these objects will be


recognized as “noise”, and will not be expanded. As a 4
× π × Eps
3
Three-dimensional data: EpsVolumei =
result, a sparse cluster will be partitioned into several 3
clusters which have similar characters. If we choose a It mainly discusses in two-dimensional space:
larger value of Eps according to sparse clusters, these
π × Eps
2
clusters that have high-density and are near from each MinPtsi = × Ni
other may be combined as one cluster, and their totalVolumei
differences will be ignored. Obviously, it is difficult to
choose a right value under these two circumstances.

395

Authorized licensed use limited to: Jabalpur Engineering College. Downloaded on December 25, 2008 at 06:17 from IEEE Xplore. Restrictions apply.
3.2. The merger of each local data set’s ( xi,Zj(I )),j=1,2,3,...,n} is satisfied, then xi wk;
clustering results the convergence function Jc
k nj

∑∑ ( j) 2

Suppose a and b are respectively in cluster A and Jc ( I ) = x k − Zj ( I ) ;


j =1 k =1
cluster B, unite cluster A and cluster B if they satisfy
either of the following conditions: judgment: if Jc ( I ) − Jc ( I − 1) ≺ ξ , the
(1) b is density-connected from a about Eps and algorithm is finished, otherwise I = I +1, calculate
MinPtsA, a is not density-connected from b about Eps
1 n ( j) j

and MinPtsB, then MinPts= MinPtsA, unite cluster A k new cluster centers, Zj ( I ) = ∑ xi j = 1, 2,
and cluster B. This is shown in Figure 1 and Figure 2; n i =1
(2) a is density-connected from b about Eps and 3,…, k, return (2);
MinPtsB, b is not density-connected from a about Eps join the rest points into the nearest cluster,
and MinPtsA, then MinPts= MinPtsB, unite cluster A each cluster is a local data set;
and cluster B; Step4 do the following operations for each local
(3) b is density-connected from a about Eps and data set:
MinPtsA, a is density-connected from b about Eps and calculate each local parameters MinPts,
MinPtsB, then MinPts= min{ MinPtsB , MinPtsA}, π × Eps
2

unite cluster A and cluster B. MinPtsi = × Ni ;


totalVolumei
according to Eps and MinPtsi , apply
DBSCAN clustering algorithm;
Step5 merger clustering results of each local data
set;
Step6 output clustering results, then the algorithm is
Figure 1 b is density-connected from a about Eps and
finished.
MinPtsA
4. Time complexity of the DBSK algorithm
The time complexity of K-means algorithm is O
(nkd), where n refers to the total number of sample
points, k is the appointed cluster number, and d is the
dimension of the sample points; the average running
time complexity of DBSCAN is O (nlogn), if space
Figure 2 a is not density-connected from b about Eps index is unused, the computational complexity is O
and MinPtsB (n2); DBSK algorithm first applies sampling technique
to get sample data sets m (suppose m = n / 2), then runs
3.3. The basic framework of DBSK algorithm the K-means algorithm and time complexity is O
(mkd), the time complexity in each data sets is O ((n /
k) log (n / k)),the time of uniting data sets is O (n),
Input: control parameters, clustering data sets.
Therefore, the total time complexity of DBSK
Output: cluster results.
algorithm is O (mkd) + k × O ((n / k) log (n / k)) + O
Step1 initialize control parameters, including:
(n). The time complexity of the three algorithms is
sampling rate, K and Eps ;
shown in Figure 3.
Step2 randomly select subset of the data set
according to sampling rate;
Step3 suppose the size of the subset is n, do the
following operations:
select K initial cluster centers Zj(I ), j = 1,
2, 3,…, k;
calculate the distance of each data object in
subset and cluster center D( x i, Zj(I )),i=1,2,
3,…,n,j=1,2,3,…,k, if D ( xi, Zk( I)) =min{D

396

Authorized licensed use limited to: Jabalpur Engineering College. Downloaded on December 25, 2008 at 06:17 from IEEE Xplore. Restrictions apply.
Figure 3 The time complexity of the three algorithms 6. Conclusions
5. The simulative experiment Combining the sampling technique, K-means
algorithm and DBSCAN algorithm, an improved
DBSK algorithm has been proposed. How to determine
In order to test the feasibility and effectiveness of the the size of the sample (is also called sampling
DBSK algorithm, simulative experiment is carried out. complexity) is a question that needs further study. In
this paper DBSK algorithm only applies in the
two-dimensional data set, there is no discussion and
research to three dimensional or multi-dimensional
complex data, but the massive data sets often have
multi-dimensional attribute, so more work should be
carried out on the data mining.

7. References
Figure 4 The clustering result of DBSCAN [1] Chen Lei-da, “ Data mining method, application, tools”,
algorithm Eps=18 Minpts=4 Information Systems Management,2001,7(1),65-70.
[2] David Hand, Heikki Mannila, and Padhraic Smyth,
“Principles of Data Mining”, The MIT Press, 2001, 162-195.
[3] D T Pham, S S Dimov, and C D Nguyen, “Selection of K
in K-means clustering. Mechanical Engineering Science”,
2004,219(C),103-119.
[4] Kuo R.J., Ho L.M., and Hu C.M., “Integration of
Self-organizing Feature Map and K-means Algorithm for
Market Segmentation”, Computers and Operations Research,
2002,29(11), 1475-1493.
[5] D.S.Modha, W.S.Spangler, “Feature Weighting in
Figure 5 The clustering result of DBSK algorithm K-means Clustering”, 2003, 217-237.
[6] Martin Ester, Hans-Peter Kriegel, “A Density-Based
K=3 Eps=18 sampling rate =30% Algorithm for Discovering Clusters in Large Spatial
In Figure 4 and Figure 5, different shapes represent Databases with Noise”, Proceedings of 2nd international
different clusters, hollow circles indicate noise points. conference on knowledge discovery and data mining, 2002,
17-25.
From the clustering result we can see: DBSCAN [7] Chen Ning, Chen An, and Zhou Longxiang, “An
algorithm can not do very well to uneven data set; it is Effective Clustering Algorithm in Large Transaction
not good enough to the sparse data points on the left, Databases”, Journal of software, 2001, 12(7), 476-484.
and a few data points are handled to noise which [8] Rong Qiusheng, Yan Junbiao, and Guo Guoqiang,
should belong to some cluster. DBSK algorithm “Research and Implementation of Clustering Algorithm
partitions the data set, and selects different parameter Based on DBSCAN”, Computer Applications, 2004, 24(4),
values according to each local data set. DBSK 45-47.
algorithm recognize more clusters that have small data [9] HE Zhong-sheng, LIU Zong-tian, and ZHUANG Yan-bin,
points and are relatively sparse than DBSCAN “Data-partition-based Parellel DBSCAN Algorithm”,
algorithm. To the sparse data points on the left, the MINI-MICRO System, 2006, 27(1), 114-116.
[10] SunSi, “Study on data partition DBSCAN using genetic
number of the data points that handled to noise is
algorithm”, College of computer science Chongqing
distinctly reduced. University, 2005,38-43

397

Authorized licensed use limited to: Jabalpur Engineering College. Downloaded on December 25, 2008 at 06:17 from IEEE Xplore. Restrictions apply.