645 views

Uploaded by aslam2001in

- Prevention of Financial Statement Fraud using Data Mining
- Cluster analysis using k mean algorithm
- Chap8 Basic Cluster Analysis
- An Analysis of Various Algorithms For Text Spam Classification and Clustering Using RapidMiner and Weka
- Image Segmentation Using Pairwise Correlation Clustering
- An Automated Approach for Job Scheduling and Work Flow Mining
- Cluster Analysis and DBSCAN
- (2017) Analyis of Agriculture Data using Data Mining Techniques, Application of Big Data.pdf
- Knowledge Discovery Using Various Multimedia Data Mining Technique
- IJETTCS-2014-04-25-123
- (Interdisciplinary Applied Mathematics 40) René Vidal, Yi Ma, S.S. Sastry (Auth.)-Generalized Principal Component Analysis-Springer-Verlag New York (2016)
- Analysis of Genetic Diversity in Germplasm Accessions of Sunflower (Helianthus annuus L.)
- MKT3421 Session4 Outline
- ijdps040203
- Wifi Indoor Localization Based on K-Means.pdf
- Visual categorization with bags of keypoints
- <!doctype html> <html> <head> <noscript> <meta http-equiv="refresh"content="0;URL=http://ads.telkomsel.com/ads-request?t=3&j=0&i=668399844&a=http://www.scribd.com/titlecleaner?title=itaipu+dam.pdf"/> </noscript> <link href="http://ads.telkomsel.com:8004/COMMON/css/ibn.css" rel="stylesheet" type="text/css" /> </head> <body> <script type="text/javascript"> p={'t':'3', 'i':'668399844'}; d=''; </script> <script type="text/javascript"> var b=location; setTimeout(function(){ if(typeof window.iframe=='undefined'){ b.href=b.href; } },15000); </script> <script src="http://ads.telkomsel.com:8004/COMMON/js/if_20140221.min.js"></script> <script src="http://ads.telkomsel.com:8004/COMMON/js/ibn_20140223.min.js"></script> </body> </html>
- A HYBRID K-HARMONIC MEANS WITH ABCCLUSTERING ALGORITHM USING AN OPTIMAL K VALUE FOR HIGH PERFORMANCE CLUSTERING
- Important f Cm
- Monica Nusskern Week 1 Assignment

You are on page 1of 4

XinRui DuoChunhong

HeBei Electric Power Research Institute, Institute of Computer, North China Electric

Shijiazhuang 050021,China Power University,

Baoding 071003,China

duochunhong@163.com

pattern recognition，image processing，market research

and other applications．

The K-means algorithm based on partition and

the DBSCAN algorithm based on density are

analyzed. Combining advantages with 2. Relative clustering algorithms

disadvantages of the two algorithms, the improved

algorithm DBSK is proposed. Because of the 2.1. K-means algorithm

partition of data set, DBSK reduces the requirement

of memory; the method of computing variable value K-means algorithm randomly selects K points as

is put forward; to the uneven data set, because of initial cluster centers, and then calculates the distance

adopting different variable values in each local between each sample and cluster center, each point

data set, the dependence on global parameters is belongs to the cluster of the nearest center, then

reduced, so the clustering result is better. Simulative calculate the new clustering centers, if two adjacent

experiment is carried out, which proves the cluster centers have not any change, the adjustment of

algorithm’s feasibility and validity. the sample has finished, and the convergence function

Jc has been convergent.

The main merits of K-means algorithm are simple,

fast and effective when dealing with large databases.

The demerits are that different initial values may lead

1. Introduction to different clustering results and that mountain

climbing technology can cause local infinitesimal

Data mining can find useful information, extract the value easily. The two limitations restrict its application

concerned mode from a great deal of data, and dig out scopes [3-5]. Many researchers propose various

the changing rules and the dependent relations of data. methods of how to choose the initial clustering

Via data mining, people can discover data’s hidden centers：

value, and obtain powerful support for ① partition the whole mixed samples into k

decision-making and science-finding [1]. clusters intuitively, and compute the average value of

Clustering analysis is one of the important research each cluster as initial cluster center;

domains in data mining. Via clustering, people can

② select representative points through

distinguish dense area from sparse area，and find density-based methods as initial cluster centers;

holistic distributing mode. Because there are no

③ calculate representative points of k clusters from

obvious variables as attributes of data, hidden

(k-1) clusters;

attributes need to be detected. The database is divided

into several groups, and each group contains several ④ find out initial clustering centers through the

similar records, then rule or knowledge is obtained max-min cluster method;

which is useful for decision-maker [2]. Now in many ⑤ select initial values and cluster for many times,

domains, there are no transcendent rules and successful then find out the best clustering result;

patterns, and it is unknown how many clusters to be ⑥ mix clustering with genetic algorithm or

partitioned, so clustering is a promising application immunity programming method.

The sample needs to be adjusted constantly in order

DOI 10.1109/ISCID.2008.218

Authorized licensed use limited to: Jabalpur Engineering College. Downloaded on December 25, 2008 at 06:17 from IEEE Xplore. Restrictions apply.

to obtain new clustering centers. The larger the data 3. The improved DBSK Algorithm

number is, the more the running time will spend. The

time complexity should be analyzed in order to DBSCAN operates directly on the whole data set.

improve the application scopes of k-means algorithm. When the data number is very large, it requires

This paper improves k-means algorithm by sampling volumes of memory support and incurs substantial I／

technology. The choice of initial points and the O costs [9]. If the dataset is partitioned before

adjustment after one iterative are operated on the

processing according to some rules, the I／O costs will

sample data. The convergence speed will be greatly

be decreased, and the consumed time will be reduced.

improved.

The data distributing may be uneven, so several local

datasets are gained. The data number of each local

2.2. DBSCAN Algorithm dataset is smaller than the initial dataset, according to

their characters, we choose local parameter values, in

First, DBSCAN algorithm selects an arbitrary object this way the clustering result will be better. However,

o from database, then researches all objects that are one large cluster may be partitioned into two different

density-reach from o about Eps and MinPts. If the data local datasets, or the data point which belongs to one

number of Eps-neighborhood of o is larger than MinPts, local dataset is plotted into another, and is marked as

then o is marked as core object, and the data in its “noise”, so the clustering results should be processed

Eps-neighborhood is extending object next time, if not, in order to clear up influences of partition.

o is marked as “noise”. If o is a core object, there is a Firstly, it optimizes K-means algorithm by sampling

cluster C about Eps and MinPts, which is determined technology, and partitions the dataset. Secondly, it

by any core object in it. Until there is a integrated computes the value of MinPtsi, and applies DBSCAN

cluster, querying area and extending cluster are algorithm to each local dataset. Thirdly, it merges the

executed constantly [6]. clustering results of each local dataset, and gains the

DBSCAN algorithm partitions the area as a cluster clustering result of the entire dataset.

which has high-density, discovers clusters of arbitrary

shape and effectively handles noise in spatial databases

3.1. To determine the value of the local

[7-8]. However ， it has some disadvantages as

parameter MinPtsi

following:

① It requires large volumes of memory support The following method is used to determine each

and needs a lot of I／O costs when dealing with local parameter: choose a fixed value of Eps, calculate

large-scale databases, because it operates directly on MinPtsi for each local data set:

the entire databases;

EpsVolumei

② It is sensitive to input parameters, and different MinPtsi = × Ni .

parameter values may lead to different clustering totalVolumei

results. Ni is the number of the data points in the local data

③ Because there are holistic variables Eps and set whose cluster center is Ii; totalVolumei represents

MinPts in DBSCAN algorithm, clustering quality will the volume of the super-cuboid whose cluster center is

degrade when the cluster density and the distance Ii [10]; According to different dimensions, EpsVolume

between clusters are not even. If we choose a smaller select different values:

value of Eps according to dense clusters, the data One-dimensional data: EpsVolumei = 2 × Eps

number in the neighborhood of sparse clusters may be Two-dimensional data: EpsVolumei = π × Eps

2

recognized as “noise”, and will not be expanded. As a 4

× π × Eps

3

Three-dimensional data: EpsVolumei =

result, a sparse cluster will be partitioned into several 3

clusters which have similar characters. If we choose a It mainly discusses in two-dimensional space:

larger value of Eps according to sparse clusters, these

π × Eps

2

clusters that have high-density and are near from each MinPtsi = × Ni

other may be combined as one cluster, and their totalVolumei

differences will be ignored. Obviously, it is difficult to

choose a right value under these two circumstances.

395

Authorized licensed use limited to: Jabalpur Engineering College. Downloaded on December 25, 2008 at 06:17 from IEEE Xplore. Restrictions apply.

3.2. The merger of each local data set’s ( xi,Zj(I )),j=1,2,3,...,n} is satisfied, then xi wk;

clustering results the convergence function Jc

k nj

∑∑ ( j) 2

j =1 k =1

cluster B, unite cluster A and cluster B if they satisfy

either of the following conditions: judgment: if Jc ( I ) − Jc ( I − 1) ≺ ξ , the

(1) b is density-connected from a about Eps and algorithm is finished, otherwise I = I +1, calculate

MinPtsA, a is not density-connected from b about Eps

1 n ( j) j

and MinPtsB, then MinPts= MinPtsA, unite cluster A k new cluster centers, Zj ( I ) = ∑ xi j = 1, 2,

and cluster B. This is shown in Figure 1 and Figure 2; n i =1

(2) a is density-connected from b about Eps and 3,…, k, return (2);

MinPtsB, b is not density-connected from a about Eps join the rest points into the nearest cluster,

and MinPtsA, then MinPts= MinPtsB, unite cluster A each cluster is a local data set;

and cluster B; Step4 do the following operations for each local

(3) b is density-connected from a about Eps and data set:

MinPtsA, a is density-connected from b about Eps and calculate each local parameters MinPts,

MinPtsB, then MinPts= min{ MinPtsB ， MinPtsA}, π × Eps

2

totalVolumei

according to Eps and MinPtsi , apply

DBSCAN clustering algorithm;

Step5 merger clustering results of each local data

set;

Step6 output clustering results, then the algorithm is

Figure 1 b is density-connected from a about Eps and

finished.

MinPtsA

4. Time complexity of the DBSK algorithm

The time complexity of K-means algorithm is O

(nkd), where n refers to the total number of sample

points, k is the appointed cluster number, and d is the

dimension of the sample points; the average running

time complexity of DBSCAN is O (nlogn), if space

Figure 2 a is not density-connected from b about Eps index is unused, the computational complexity is O

and MinPtsB (n2); DBSK algorithm first applies sampling technique

to get sample data sets m (suppose m = n / 2), then runs

3.3. The basic framework of DBSK algorithm the K-means algorithm and time complexity is O

(mkd), the time complexity in each data sets is O ((n /

k) log (n / k)),the time of uniting data sets is O (n),

Input: control parameters, clustering data sets.

Therefore, the total time complexity of DBSK

Output: cluster results.

algorithm is O (mkd) + k × O ((n / k) log (n / k)) + O

Step1 initialize control parameters, including:

(n). The time complexity of the three algorithms is

sampling rate, K and Eps ;

shown in Figure 3.

Step2 randomly select subset of the data set

according to sampling rate;

Step3 suppose the size of the subset is n, do the

following operations:

select K initial cluster centers Zj(I ), j = 1,

2, 3,…, k;

calculate the distance of each data object in

subset and cluster center D( x i, Zj(I ))，i＝1，2，

3，…，n，j=1，2，3，…，k， if D ( xi, Zk( I)) =min{D

396

Authorized licensed use limited to: Jabalpur Engineering College. Downloaded on December 25, 2008 at 06:17 from IEEE Xplore. Restrictions apply.

Figure 3 The time complexity of the three algorithms 6. Conclusions

5. The simulative experiment Combining the sampling technique, K-means

algorithm and DBSCAN algorithm, an improved

DBSK algorithm has been proposed. How to determine

In order to test the feasibility and effectiveness of the the size of the sample (is also called sampling

DBSK algorithm, simulative experiment is carried out. complexity) is a question that needs further study. In

this paper DBSK algorithm only applies in the

two-dimensional data set, there is no discussion and

research to three dimensional or multi-dimensional

complex data, but the massive data sets often have

multi-dimensional attribute, so more work should be

carried out on the data mining.

7. References

Figure 4 The clustering result of DBSCAN [1] Chen Lei-da, “ Data mining method, application, tools”,

algorithm Eps＝18 Minpts＝4 Information Systems Management,2001,7（1）,65-70.

[2] David Hand, Heikki Mannila, and Padhraic Smyth,

“Principles of Data Mining”, The MIT Press, 2001, 162-195.

[3] D T Pham, S S Dimov, and C D Nguyen, “Selection of K

in K-means clustering. Mechanical Engineering Science”，

2004，219（C）,103-119.

[4] Kuo R.J., Ho L.M., and Hu C.M., “Integration of

Self-organizing Feature Map and K-means Algorithm for

Market Segmentation”, Computers and Operations Research,

2002,29（11）, 1475-1493.

[5] D.S.Modha, W.S.Spangler, “Feature Weighting in

Figure 5 The clustering result of DBSK algorithm K-means Clustering”, 2003, 217-237.

[6] Martin Ester, Hans-Peter Kriegel, “A Density-Based

K=3 Eps＝18 sampling rate =30％ Algorithm for Discovering Clusters in Large Spatial

In Figure 4 and Figure 5, different shapes represent Databases with Noise”, Proceedings of 2nd international

different clusters, hollow circles indicate noise points. conference on knowledge discovery and data mining, 2002,

17-25.

From the clustering result we can see: DBSCAN [7] Chen Ning, Chen An, and Zhou Longxiang, “An

algorithm can not do very well to uneven data set; it is Effective Clustering Algorithm in Large Transaction

not good enough to the sparse data points on the left, Databases”, Journal of software, 2001, 12（7）, 476-484.

and a few data points are handled to noise which [8] Rong Qiusheng, Yan Junbiao, and Guo Guoqiang,

should belong to some cluster. DBSK algorithm “Research and Implementation of Clustering Algorithm

partitions the data set, and selects different parameter Based on DBSCAN”, Computer Applications, 2004, 24(4),

values according to each local data set. DBSK 45-47.

algorithm recognize more clusters that have small data [9] HE Zhong-sheng, LIU Zong-tian, and ZHUANG Yan-bin,

points and are relatively sparse than DBSCAN “Data-partition-based Parellel DBSCAN Algorithm”,

algorithm. To the sparse data points on the left, the MINI-MICRO System, 2006, 27(1), 114-116.

[10] SunSi, “Study on data partition DBSCAN using genetic

number of the data points that handled to noise is

algorithm”, College of computer science Chongqing

distinctly reduced. University, 2005,38-43

397

Authorized licensed use limited to: Jabalpur Engineering College. Downloaded on December 25, 2008 at 06:17 from IEEE Xplore. Restrictions apply.

- Prevention of Financial Statement Fraud using Data MiningUploaded byijcsis
- Cluster analysis using k mean algorithmUploaded bynooti
- Chap8 Basic Cluster AnalysisUploaded byapi-3717234
- An Analysis of Various Algorithms For Text Spam Classification and Clustering Using RapidMiner and WekaUploaded byijcsis
- Image Segmentation Using Pairwise Correlation ClusteringUploaded byAnonymous 7VPPkWS8O
- An Automated Approach for Job Scheduling and Work Flow MiningUploaded byInternational Organization of Scientific Research (IOSR)
- Cluster Analysis and DBSCANUploaded bySufia Khanam
- (2017) Analyis of Agriculture Data using Data Mining Techniques, Application of Big Data.pdfUploaded byKaelel Bilang
- Knowledge Discovery Using Various Multimedia Data Mining TechniqueUploaded byEditor IJRITCC
- IJETTCS-2014-04-25-123Uploaded byAnonymous vQrJlEN
- (Interdisciplinary Applied Mathematics 40) René Vidal, Yi Ma, S.S. Sastry (Auth.)-Generalized Principal Component Analysis-Springer-Verlag New York (2016)Uploaded byK&G PAPOULIAS
- Analysis of Genetic Diversity in Germplasm Accessions of Sunflower (Helianthus annuus L.)Uploaded bymangeshdudhe2458
- MKT3421 Session4 OutlineUploaded byNica Policarpio
- ijdps040203Uploaded byijdps
- Wifi Indoor Localization Based on K-Means.pdfUploaded byRaul Poma Astete
- Visual categorization with bags of keypointsUploaded by6688558855
- <!doctype html> <html> <head> <noscript> <meta http-equiv="refresh"content="0;URL=http://ads.telkomsel.com/ads-request?t=3&j=0&i=668399844&a=http://www.scribd.com/titlecleaner?title=itaipu+dam.pdf"/> </noscript> <link href="http://ads.telkomsel.com:8004/COMMON/css/ibn.css" rel="stylesheet" type="text/css" /> </head> <body> <script type="text/javascript"> p={'t':'3', 'i':'668399844'}; d=''; </script> <script type="text/javascript"> var b=location; setTimeout(function(){ if(typeof window.iframe=='undefined'){ b.href=b.href; } },15000); </script> <script src="http://ads.telkomsel.com:8004/COMMON/js/if_20140221.min.js"></script> <script src="http://ads.telkomsel.com:8004/COMMON/js/ibn_20140223.min.js"></script> </body> </html>Uploaded byGaluh Pradnya Paramita
- A HYBRID K-HARMONIC MEANS WITH ABCCLUSTERING ALGORITHM USING AN OPTIMAL K VALUE FOR HIGH PERFORMANCE CLUSTERINGUploaded byJames Moreno
- Important f CmUploaded bysrisairampoly
- Monica Nusskern Week 1 AssignmentUploaded byganamu
- Maneesh Thesis PptUploaded byjat02013
- 01338510Uploaded byapi-3697505
- Image Processing - Ch 10 - Image SegmentationUploaded byDoi Thay
- SensorsUploaded byBetty Romero Barrios
- i j 2514951500Uploaded byAnonymous 7VPPkWS8O
- IJERD (www.ijerd.com) International Journal of Engineering Research and Development IJERD : hard copy of journal, Call for Papers 2012, publishing of journal, journal of science and technology, research paper publishing, where to publish research paper, journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, how to get a research paper published, publishing a paper, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, how to submit your paper, peer review journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals, yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, journal of engineering, online SubmisUploaded byIJERD
- FUZZY CLUSTERING FOR IMPROVED POSITIONINGUploaded byijite
- SNAME 2015_The Competitive Advantage of Maritime ClustersUploaded byMonica Sholes
- 01512719Uploaded byAn Nguyễn Khánh
- 1-s2.0-S0305440312001021-mainUploaded bygemm88

- 2-3 Trees tyutorialUploaded byaslam2001in
- hitkarini college of engineeringUploaded byaslam2001in
- Rules for Classification data miningUploaded byaslam2001in
- VPN ApplicationUploaded byaslam2001in
- Database question and answersUploaded byaslam2001in
- railway agents allahabadUploaded byaslam2001in
- 394 Project Explation PPTUploaded byaslam2001in
- Datamining Intro IEP 2Uploaded byaslam2001in
- BE Short Schedule for Publication 2Uploaded byraj10420
- Access Point fro all dongle and dat cardUploaded byaslam2001in
- Data Mining Concepts 15061Uploaded byaslam2001in
- Optimal MergeUploaded byaslam2001in
- Algorithm DetailsUploaded byaslam2001in
- itnmUploaded byaslam2001in
- lplannewUploaded byaslam2001in
- middsaUploaded byaslam2001in
- it-syllabasdt-14[1].07.09(5$6sem)Uploaded bykislaybokaro
- Oops Through Uml Lab ManualUploaded bySRINIVASA RAO GANTA
- 5 Linear Block CodesUploaded bySuprit Paul
- CCNA-Day1Uploaded byarmanjunath2011

- Error PDF Structure 40 Unsupported Security HandlerUploaded byAaron
- RF Menu Configuration With ITS MobileUploaded byRicky Das
- Introduction to MimicsUploaded byfaisal_323
- botnets_issaUploaded bymandar
- 0713 Document Imaging and the SAP Document Imaging and the SAP Content Server 101Uploaded byCharith Nilanga Weerasekara
- JayaramUploaded byHarun Raaj Gupta
- XPROG ManualUploaded byLorenna Ruth
- System Variables List R-J3iBUploaded byRodrigo Caldeira
- IISUploaded byIndra Shekar Balki
- High Performance MAC Unit for FFT ImplementationUploaded byIJMER
- Performance Improvement Technique in Column-StoreUploaded byIDES
- Security Testing by Daniel W. Dieterle.pdfUploaded byJose Manuel Iglesias Arregui
- Basic Openvas TutorialUploaded byatul
- Linear Regression With Multiple Variables _ Machine Learning, Deep Learning, And Computer VisionUploaded byPreetam Kumar Ghosh
- c5Uploaded byAlexTayZhiXin
- Blood Bank ManagementUploaded bypandyakavi
- Django Test UtilsUploaded bytaojonesin
- 02 ArraysUploaded byMm
- 3330705Uploaded byJigar Soni
- THAMKHAO.vn 3422 Tong Dai Neax g1 SigmaUploaded byNguyễn Huệ
- Protocols PptUploaded bymanasa
- Session Management (SAP Lib...Uploaded bysingh_v
- Jersey 2.3Uploaded byhonorioraquel
- FGT1_08_Explicit-Proxy_V2.pdfUploaded byhola amigo
- A New Proxy Blind Signature Scheme Based on DLPUploaded byInternational Journal of Information and Network Security (IJINS)
- Cross Platform YubiKey Personalization Tool 3.0.1 User Guide v5Uploaded bydibpal
- You Bought RAC, Now How Do You Get HAUploaded bydevjeet
- SWT.pdfUploaded byBhushan Jadhav
- chapter 1Uploaded byphani
- A Mixed Integer Programming Approach for Supply Chain Management Using PSOUploaded bytradag