You are on page 1of 8

Journal of Information Engineering and Applications ISSN 2224-5782 print! ISSN 2225-"5"# online! $ol.%& No.

''& 2"'%

www.iiste.org

The Improved K-Means with Particle Swarm Optimization


(rs. Nid)i Sing)& *r.*i+a,ar Sing) *epartment of -omputer Science . Engg& /0I1&/0&/)opal.India. (.2! nnid)i3sng'445a)oo.com di+a,ar3sing)4rediffmail.com Abstract In toda56s world data mining )as 7ecome a large field of researc). As t)e time increases a large amount of data is accumulated. -lustering is an important data mining tas, and )as 7een used e8tensi+el5 75 a num7er of researc)ers for different application areas suc) as finding similarities in images& te8t data and 7io-informatics data. -luster anal5sis is one of t)e primar5 data anal5sis met)ods. -lustering defines as t)e process of organi9ing data o7:ects into a set of dis:oint classes called clusters. -lustering is an e8ample of unsuper+ised classification. In clustering& ;-(eans (ac<ueen! is one of t)e most well ,nown popular clustering algorit)m. ;-(eans is a partitioning algorit)m follows some draw7ac,s= num7er of clusters , must 7e ,nown in ad+anced& it is sensiti+e to random selection of initial cluster centre& and it is sensiti+e to outliers. In t)is paper& we tried to impro+e some draw7ac,s of ;-(eans algorit)m and an efficient algorit)m is proposed to en)ance t)e ;-(eans clustering wit) 2article Swarm >ptimi9ation. In recent 5ears& 2article Swarm >ptimi9ation 2S>! )as 7een successfull5 applied to a num7er of real world clustering pro7lems wit) t)e fast con+ergence and t)e effecti+el5 for )ig)-dimensional data. Keywords: -lustering& ;-(eans clustering& 2S> 2article Swarm >ptimi9ation!& ?ierarc)ical clustering. I. Introd ction A. !l sterin" *ata mining is t)e process of e8tracting patterns from large amount of data repositories or data 7ases. *ata mining is seen as an increasingl5 important tool 75 modern 7usiness to transform data into 7usiness intelligence gi+ing an informational ad+antage. -luster anal5sis or -lustering is t)e assignment of a set of o7:ects into su7sets called clusters! so t)at o7:ects in t)e same cluster are similar in some sense. -lustering is a met)od of unsuper+ised learning& and a common tec)ni<ue used in +ariet5 of fields including mac)ine learning& data mining& pattern recognition& image anal5sis and 7ioinformatics @'A. A good clustering met)od will produce )ig) <ualit5 of clusters wit) )ig) intra-cluster similarit5 and low inter-cluster similarit5. #. Types o$ !l sterin" *ata -lustering algorit)ms mainl5 di+ided into two categories= ?ierarc)ical and 2artition algorit)ms. ?ierarc)ical algorit)m usuall5 consists of eit)er Agglomerati+e B7ottom-upC! or *i+isi+e Btop-downC!. Agglomerati+e algorit)ms as we ,now is a 7ottom-up approac) t)at is it 7egin wit) eac) element as a separate cluster and merge t)em into successi+el5 larger clusters. *i+isi+e algorit)ms is a top-down approac) and 7egin wit) t)e w)ole set and proceed to di+ide it into successi+el5 smaller clusters@'A.2artition clustering algorit)m partitions t)e data set into desired num7er of sets in a single step. (an5 met)ods )a+e 7een proposed to sol+e clustering pro7lem. >ne of t)e most popular and simple clustering met)od is ;-(eans clustering algorit)m de+eloped 75 (ac Dueen in 'E#7.;-(eans -lustering algorit)m is an iterati+e approac)F it partitions t)e dataset into , groups. Efficienc5 of ;-(eans algorit)m )ea+il5 depends on selection of initial centroids 7ecause randoml5 c)oosing initial centroid also )as an influence on num7er of iteration and also produces different cluster results for different cluster centroid @2A. !. Particle Swarm Intelli"ence Swarm Intelligence SI!& was inspired 75 t)e 7iological 7e)a+iour of animals& and is an inno+ati+e distri7uted intelligent paradigm for sol+ing optimi9ation pro7lems @%A. 1)e two main Swarm Intelligence algorit)ms are '! Ant -olon5 >ptimi9ation A->! and 2! 2article Swarm >ptimi9ation 2S>!. 1)is paper mainl5 deals wit) 2S>. It is an optimi9ation tec)ni<ue originall5 proposed 75 ;enned5 and E7er)art @4A and was inspired 75 t)e swarm 7e)a+iour of 7irds& fis) and 7ees w)en t)e5 searc) for food or communicate wit) eac) ot)er.

'

Journal of Information Engineering and Applications ISSN 2224-5782 print! ISSN 2225-"5"# "5"# online! $ol.%& No.''& 2"'%

www.iiste.org

Higure '.'. /asic flow diagram of 2S>. 2S> approac) is 7ased upon cooperation and communication among agents called particles. 2articles are t)e agents t)at represent indi+idual solutions w)ile t)e swarm is t)e collection collection of particles w)ic) represent t)e solution space. 1)e particles t)en start mo+ing t)roug) t)e solution space 75 maintaining a +elocit5 +alue V and ,eeping trac, of its 7est pre+ious position ac)ie+ed so far. 1)is position +alue is ,nown as its personal 7est 7e position. Glo7al 7est is anot)er 7est solution w)ic) is t)e 7est fitness +alue w)ic) is ac)ie+ed 75 an5 of t)e particles. 1)e fitness of eac) particle or t)e entire swarm is e+aluated 75 a fitness function @7A. 1)e flow c)art of 7asic 2S> is s)own in Higure '.'= %. K-Means Means !l sterin" Al"orithm 1)is part 7riefl5 descri7es t)e standard ;-(eans ; algorit)m. ;-(eans (eans is a t5pical clustering algorit)m in *ata (ining w)ic) is widel5 used for clustering large set of data6s. In 'E#7& (ac Dueen firstl5 proposed t)e ;-(eans ; algorit)m& it was one of t)e most simple& unsuper+ised learning algorit)m& w)ic) was applied to sol+e t)e pro7lem of t)e well-,nown ,nown cluster @EA.1)e most widel5 used 2artitioning -lustering algorit)m is ;-(eans. ; ;(eans algorit)m clusters t)e input data data into , clusters& w)ere , is gi+en as an input parameter.

Higure '.2 2rocess of ;-(eans ; algorit)m

Journal of Information Engineering and Applications ISSN 2224-5782 print! ISSN 2225-"5"# online! $ol.%& No.''& 2"'%

www.iiste.org

;-(eans algorit)m finds a partition t)at )ig)l5 depends on t)e minimum s<uared error 7etween t)e centroid of a cluster and its data points. 1)e algorit)m first ta,es , random data points as initial centroids and assigns eac) data point to t)e nearest cluster until con+ergence criteria is met. Alt)oug) ;-(eans algorit)m is simple and eas5 to implement& it suffers from t)ree ma:or draw7ac,s= '! 1)e num7er of clusters& , )as to 7e specified as input. 2! 1)e algorit)m con+erges to local optima. %! 1)e final clusters depend on randoml5 c)osen initial centroids. 4! ?ig) computational comple8it5 1)e flow diagram for simple ;-(eans algorit)m is s)own in Higure '.2@''A. II. &elated 'or( $arious researc)es )a+e 7een carried out to impro+e t)e efficienc5 of ;-(eans algorit)m wit) 2article Swarm >ptimi9ation. 2article Swarm >ptimi9ation gi+es t)e optimal initial seed and using t)is 7est seed ;-(eans algorit)m produces 7etter clusters and produces muc) accurate results t)an traditional ;-(eans algorit)m. A. (. Ha)im et al. @5A proposed an en)anced met)od for assigning data points to t)e suita7le clusters. In t)e original ;-(eans algorit)m in eac) iteration t)e distance is calculated 7etween eac) data element to all centroids and t)e re<uired computational time of t)is algorit)m is depends on t)e num7er of data elements& num7er of clusters and num7er of iterations& so it is computationall5 e8pensi+e. ;. A. A7dul Na9eer et al. @#A proposed an en)anced algorit)m to impro+e t)e accurac5 and efficienc5 of t)e ;(eans clustering algorit)m. In t)is algorit)m two met)ods are used& one met)od for finding t)e 7etter initial centroids. And anot)er met)od for an efficient wa5 for assigning data points to appropriate clusters wit) reduced time comple8it5. 1)is algorit)m produces good clusters in less amount of computational time. S)afi< Alam @7A proposed a no+el algorit)m for clustering called E+olutionar5 2article Swarm >ptimi9ation E2S>!-clustering algorit)m w)ic) is 7ased on 2S>. 1)e proposed algorit)m is 7ased on t)e e+olution of swarm generations w)ere t)e particles are initiall5 uniforml5 distri7uted in t)e input data space and after a specified num7er of iterationsF a new generation of t)e swarm e+ol+es. 1)e swarm tries to d5namicall5 ad:ust itself after eac) generation to optimal positions. 1)e paper descri7es t)e new algorit)m t)e initial implementation and presents tests performed on real clustering 7enc)mar, data. 1)e proposed met)od is compared wit) ;-(eans clustering- a 7enc)mar, clustering tec)ni<ue and simple particle swarm clustering algorit)m. 1)e results s)ow t)at t)e algorit)m is efficient and produces compact clusters. In t)is paper& Ie,s)m5 2 -)andran et al. @8A descri7es a recentl5 de+eloped (eta )euristic optimi9ation algorit)m named )armon5 searc) )elps to find out near glo7al optimal solutions 75 searc)ing t)e entire solution space. ;-(eans performs a locali9ed searc)ing. Studies )a+e s)own t)at )57rid algorit)m t)at com7ines t)e two ideas will produce a 7etter solution. In t)is paper& a new approac) t)at com7ines t)e impro+ed )armon5 searc) optimi9ation tec)ni<ue and an en)anced ;-(eans algorit)m is proposed. III. Proposed 'or( 1)e proposed algorit)m wor,s in two p)ases. 2)ase I is Algorit)m %.' descri7es t)e 2article Swarm >ptimi9ation and 2)ase II is Algorit)m %.2 descri7es t)e original ;-(eans Algorit)m. 1)e Algorit)m %.' w)ic) gi+es 7etter seed selection 75 calculating forces on eac) particle due to anot)er in eac) direction and t)e total force on an indi+idual particle. 1)e output of Algorit)m %.' is gi+en as input to Algorit)m %.2 w)ic) generates t)e final clusters. 1)e cluster generated 75 t)is proposed algorit)m is muc) accurate and of good <ualit5 in comparison to ;-(eans algorit)m. Al"orithm ).* Particle Swarm Optimization Step '.initiali9ation of parametersJJ num7er of particles& +elocit5. Step 2.select randoml5 t)ree particles goals 2.' Initial goal 2.2 A+erage goal 2.% Hinal goal Step %. Kepeat step 2 until finds w)ic) particle goal is optimal. Step 4 .*o 4.' 1)e total force on one particle due to anot)er& in L direction due to distance factor will 7e H*La&i suc) FDxa &i = min Q. ' d a &i cos t)at=

[ (

M)ere& %

Journal of Information Engineering and Applications ISSN 2224-5782 print! ISSN 2225-"5"# online! $ol.%& No.''& 2"'%

www.iiste.org

d a& i & is distance 7etween t)e two particles& N is t)e angle wit) L-a8is and D is a constant. 4.2 In t)e same manner force in O direction is calculated. 4.% Now total force acted on a particle can 7e calculated as H*8a P

FDx
i =' ia

a &i

M)ere n is t)e num7er of particles in t)e s5stem. Step 5.Similarit5 measure is calculated. A low +alue of s<uared error of corresponding +alues of data +ector is considered as a measure of similarit5.

= (x j &' x j &i )2
m j ='

p v i &' i =' p v i & 2 i =' r .' n Step #.distance matri8 $c is calculated as v c = . . p v i & j i ='

Step 7. -lusters are formed w)en one clusters com7ines wit) anot)er. If t)e mean +alue of cluster )a+ing )ig)er cluster I* is wit)in 'Qsigma limit of t)e ot)er cluster& t)e two clusters are merged toget)er. Sigma is t)e standard de+iation of distances of t)e particles from t)e mean +alue. Step 8. Stop Al"orithm ).+ K-Means !l sterin" Al"orithm ,-. Ke<uire= D = {d1, d2, d3, ..., di, ..., dn } JJ Set of n data points. Initial cluster centroids calculated from Algorit)m %.'. Ensure= A set of k clusters. Steps= Step '. Ar7itraril5 c)oose k data points from D as initial centroidsF Step 2. Kepeat Assign eac) point di to t)e cluster w)ic) )as t)e closest centroidF -alculate t)e new mean for eac) clusterF Step %.0ntil con+ergence criteria are met. 1)e s<uared error can 7e calculated using E< %.'!= EP

xx
i =' xCi

2 i

RRRRR. %.'!

1)e accurac5 can 7e calculated using 2recision and Kecall= 2recision is t)e fraction of retrie+ed documents t)at are rele+ant to t)e searc). It can 7e calculated using E< %.2!. Kecall is t)e fraction of documents t)at are rele+ant to t)e <uer5 t)at are successfull5 retrie+ed. It can 7e calculated using E< %.%!. 2recisionP

tp tp + fp tp

RRRRR. %.2!

KecallP

t p + fn

RRRRRRRR. %.%!

M)ere tpP true positi+e correct result! 4

Journal of Information Engineering and Applications ISSN 2224-5782 print! ISSN 2225-"5"# online! $ol.%& No.''& 2"'%

www.iiste.org

tnP true negati+e correct a7sence of result! fpP false positi+e une8pected result! fnP false negati+e (issing result! >+erall accurac5 can 7e calculated using E< %.4!. Accurac5P

t p + tn t p + tn + f p + f n

RRR. %.4!

I/. &es lt Analysis 1)e tec)ni<ue proposed in t)is paper is tested on different data sets suc) as 7reast cancer @'"A& 1)5roid @'"A& Ecoli @'"A& against different criteria suc) as accurac5& time& error rate& num7er of iterations and num7er of clusters. 1)e same data sets are used as an input for t)e ;-(eans algorit)m. /ot) t)e algorit)ms do not need to ,now num7er of clusters in ad+ance. Hor t)e ;-(eans algorit)m t)e set of initial centroids also re<uired. 1)e proposed met)od finds initial centroids s5stematicall5. 1)e proposed met)ods ta,e additional inputs li,e t)res)old +alues. 1)e description of datasets is s)own in 1a7le 4.'. Table0.*. %escription o$ %atasets

1)e comparati+e anal5sis for different attri7utes li,e time& accurac5& error rate and num7er of iterations are ta7ulated in 1a7le 4.2. /ased on t)ese attri7utes t)e performance of t)e ;-(eans and proposed algorit)m are calculated for a particular t)res)old. In t)is proposed algorit)m t)ere is no need to e8plicitl5 define num7er of clusters ;. Table0.+. Per$ormance !omparison o$ the Al"orithms $or %i$$erent %atasets $or Threshold 1.02

1)e grap)ical result 7ased on comparison s)own in ta7le 4.2 is s)own in Higure 4.'& Higure 4.2& Higure 4.%& and Higure 4.4& wit) t)res)old +alue ".45. Higure 4.' s)ows t)e comparison of different datasets wit) respect to 1ime. Kesults )a+e s)own t)at t)e proposed algorit)m ta,es comparati+e less time as compared to ;-(eans.

Journal of Information Engineering and Applications ISSN 2224-5782 print! ISSN 2225-"5"# online! $ol.%& No.''& 2"'%

www.iiste.org

K-Means v/s Proposed algorithm


10 Time (in sec) 0 K-Means Proposed algorithm

Datasets

3i" re 0.*. !omparison o$ di$$erent dataset with respect to Time at threshold 1.02 Higure 4.2 s)ows t)e error rate for different datasets& and t)e error rate for ;-(eans muc) larger t)an t)e proposed algorit)m.

K-Means v/s Proposed algorithm


Error 5 Rate 0 (%)
10 k-means proposed algorithm

Datasets
3i" re 0.+ !omparison o$ di$$erent dataset with respect to 4rror rate at threshold 1.02 Higure 4.% s)ows num7er of iterations for ;-(eans and proposed algorit)m& t)e proposed algorit)m ta,e muc) num7er of iterations.

K-Means v/s Proposed algorithm


7 6 Number 5 of 4 iteratio 3 2 ns 1 0 k-means Proposed algorithm

Datasets
3i" re 0.) !omparison o$ di$$erent dataset with respect to n mber o$ iterations at threshold 1.02 Higure 4.4 s)ows t)e accurac5 of ;-(eans and proposed algorit)m& result s)ows t)at t)e proposed algorit)m is muc) accurate t)an ;-(eans.

Journal of Information Engineering and Applications ISSN 2224-5782 print! ISSN 2225-"5"# online! $ol.%& No.''& 2"'%

www.iiste.org

K-Means v/s Proposed algorithm


Accurac y (%)
92 90 88 86 84 82 80 k-means Proposed algorithm

Data sets
3i" re 0.0 !omparison o$ di$$erent dataset with respect to acc racy at threshold 1.02 (atla7 7.8." is used for programming in e8periment. -omputer configuration is Intel 2entium 2.8"G?9 -20& 25#(/ KA(. /. %isc ssion In t)is paper we )a+e discussed t)e impro+ed ;-(eans -lustering wit) 2article Swarm >ptimi9ation. >ne of t)e ma:or draw7ac, of ;-(eans -lustering is t)e random selection of seed& t)e random selection of initial seed results in different cluster w)ic) are not good in <ualit5. 1)e ;-(eans -lustering algorit)m needs t)e steps '! declaration of , clusters 2! initial seed selection %! similarit5 matri8 4! cluster generation. 1)e 2S> algorit)m is applied in step 2! t)e 2S> gi+es t)e optimal solution for seed selection. In t)is paper& t)e standard ;-(eans is applied wit) different 2S> to produce results w)ic) are more accurate and efficient t)an ;-(eans algorit)m. In t)is paper t)ere is no need to gi+en , num7er of clusters in ad+ance onl5 a t)res)old +alue is re<uired. &e$erences @'A. (.$./.1.Sant)i& $.K.N.S.S.$. Sai Ieela& 2.0. Anit)a& *. Nagamalleswari& BEn)ancing ;-(eans -lustering Algorit)mC& IJ-S1 $ol 2& Issue 4& >ct- *ec 2"''. @2A. (ad)u Oedla& Srini+asa Kao 2at)o,ota& 1 ( Srini+asa& BEn)ancing ;-(eans -lustering Algorit)m wit) Impro+ed Initial -enterC& IJ-SI1! International Journal of -omputer Science and Information 1ec)nologies& $ol. ' 2!& pp='2'-'25& 2"'". @%A. A:it) A7ra)am& ?e Guo& and Iiu ?ong7o& BSwarm Intelligence= Houndations& 2erspecti+es and ApplicationsC& Swarm Intelligent S5stems& Ned:a) N& (ourelle I eds.!&No+a 2u7lis)ers& 0SA& 2""#. @4A. J. ;enned5& and K. -. E7er)art& B2article Swarm >ptimi9ation&C 2roc. >f IEEE International -onference on Neural Networ,s I-NN!& $ol. I$& 2ert)& Australia& 'E42- 'E48& 'EE5 @5A. A. (. Ha)im& A. (. Salem& H. A. 1or,e5 and (. A. Kamadan& BAn Efficient en)anced ;-(eans clustering algorit)m&C :ournal of S)e:iang 0ni+ersit5& '" 7!= '#2#'#%%& 2""#. @#A. ;. A. A7dul Na9eer and (. 2. Se7astian& BImpro+ing t)e accurac5 and efficienc5 of t)e ;-(eans clustering algorit)m&C in International -onference on *ata (ining and ;nowledge Engineering I-*(;E!& 2roceedings of t)e Morld -ongress on Engineering M-E-2""E!&$ol '& Jul5& Iondon& 0;& 2""E. @7A. S)afi< Alam& Gillian *o77ie& 2atricia Kiddle& TAn E+olutionar5 2article Swarm >ptimi9ation Algorit)m for *ata -lusteringT& Swarm Intelligence S5mposium St. Iouis (> 0SA& Septem7er 2'-2%& IEEE 2""8. @8A. Ie,s)m5 2 -)andran&; A A7dul Na9eer& BAn Impro+ed -lustering Algorit)m 7ased on ;-(eans and ?armon5 Searc) >ptimi9ationC& IEEE 2"''. @EA. S)i Na& Lumin Iiu& Guan Oong& BKesearc) on ;-(eans -lustering Algorit)m -An Impro+ed ;-(eans -lustering Algorit)mC& 1)ird International S5mposium on Intelligent Information 1ec)nolog5 and Securit5 Informatics& pp= #%-#7& IEEE& 2"'". @'"A. 1)e 0-I Kepositor5 we7site. @>nlineA.A+aila7le= )ttp=JJarc)i+e.ics.uci.eduJ&2"'". @''A. Juntao Mang&Liaolong Su&TAn impro+ed ;-(eans clustering algorit)mT&IEEE&2"''.

This academic article was published by The International Institute for Science, Technology and Education (IISTE). The IISTE is a pioneer in the Open Access Publishing service based in the U.S. and Europe. The aim of the institute is Accelerating Global Knowledge Sharing. More information about the publisher can be found in the IISTEs homepage: http://www.iiste.org CALL FOR JOURNAL PAPERS The IISTE is currently hosting more than 30 peer-reviewed academic journals and collaborating with academic institutions around the world. Theres no deadline for submission. Prospective authors of IISTE journals can find the submission instruction on the following page: http://www.iiste.org/journals/ The IISTE editorial team promises to the review and publish all the qualified submissions in a fast manner. All the journals articles are available online to the readers all over the world without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. Printed version of the journals is also available upon request of readers and authors. MORE RESOURCES Book publication information: http://www.iiste.org/book/ Recent conferences: http://www.iiste.org/conference/ IISTE Knowledge Sharing Partners EBSCO, Index Copernicus, Ulrich's Periodicals Directory, JournalTOCS, PKP Open Archives Harvester, Bielefeld Academic Search Engine, Elektronische Zeitschriftenbibliothek EZB, Open J-Gate, OCLC WorldCat, Universe Digtial Library , NewJour, Google Scholar