View of Advanced Methods To Improve Performance of K-Means Algorithm - A Review

Global Journals BTEX JournalKaleidoscope™ Antifictal Intelligence formulated this projection for compatibility purposes from the original article published at Global Joumals, However, this technology Is currently tn beta. Therefore, Bray fonore add layouts, missed formulae, text, tables, or figures 2 Advanced Methods to Improve Performance of K-Means 2 Algorithm: A Review = Dr. Ritu yadav! and Anuradha Sharma? L 5 Recetuad: 8 April 2012 Accepted: 8 Moy 2012 Published: 16 May £042 + Abstract © Clustering is an unsupervised classification that ie the partitioning of a data set in a set of 5 meaningful subsets. Each object in dataset shares some common property- often proximity 10 according to some defined distance measure, Among various typea of clustering techniques, 1 K-Meane ia one of the most popular algorithme. The objective of K-meana algorithm ia to 1s make the distances of objects in the same cluster as small as possible. Algorithms, eystems > and frameworks that addrese clustering challenges have been more elshorated over the past: 44 years. In thie review paper, we present the K Means algorithm and its improved techniques. 1c Index terms—classidcation, clustering, k-means clustering, partitioning clustering. y 1 Introduction 1e_lustering {sa type of categorization imposed rules on 9 group of data points or objects. A broad definition of 1s clustering eould be the process of catogortaing a Fite nomber of data points tnto groupe where all members ze In the group are simflar fn some manner” Ase result, a cluster fs 8 aggregation of objects. All data, points in a1 the same chustar tave common properties (eg. distancs) which are ditfenent to the data poms laying th other ve clusters, z= Cluster analysis is an iterated process of lmowiedge diccowry and it is a a multivariate statistical technique a which Identifies groupings of tho data, objects based on the inter-object similarities computed by a chosen 25 distsnes metre Clustering algorithms can be classified tnto two eategores: Hterarchtcal clustering and Pastitional be clustering [1]. The partttional clustering algorithms, whtch difar from the htersachteal clasteeing algorithms, are ar usually to create some sets of clusters at start and partition the dats. into similar groups aller each iteration. ce Portitional clustaring is more used than hierarchical clustering because the dataset can be divided into more than. 2» two subgroups tn a single step but for hterarchy method, always marge or divide into 2 subgroups, and don’t se need to complete the dendrogram (2) Closter analysts of date fs an important task in kmowledge disoovery and date. mining. Chuster analysts ams to 22 group dats on the basis of similarities and dissimilarities among the data elements. The pmcess cin be performed os ita supervised, semi-supervised or unsupervised manner. Different algorithms have been proposed which tele 34 into scaount the nature of the dats.and the Input parameters in onder to partition the data, Dats vectors are os clustered around eenteold vectors, ‘The cluster the data. vector belongs to fs determined by tts distance to the be centimid ventor. Depereing on the natire of the algoetthm, tbe numbers of eantrots are either defined tn advanos st by the user or antomattealy determined by the algorithm. Discovering the optimum number of clusters or natoral se groups in the data is rot a trivial tess, ‘The populer clustering techriques which are sugzester so far are either partition based or hierarchy based, but both approsches have thelr own aehantages and limitations In teem co of the number of clusters, shape of clusters, and cluster overlapping [3] Some other approaches are designed 4 ustrg diffrent clostering, techniquee and tmvolve opttmization im the process. Tha troplvament of intelligent ‘& opttmntzatton techniques has been found effective to enharwe the comple, real time, and costly data mining = process7 CLUSTER CENTROID DECISION METHOD 2 IL 3 K-means algorithm ‘The conventional K-mean algorithm ts based on decomposition, most popular technique in data. mining field. The ‘conoapt of K-Means algorithm uses K as a parameter, Divide n object into K clusters, to create relatively high Sint t tha clostr and, tlaly low sntarty eben csr. Ara tuft i tata dstarne between the vlucs neath cise tothe ester contr, Tho later enter a och elostae sth mean saluo of th ese The calculation ov sath ara by mean Value ofthe cnte odocts The oesauerent of the alate for the algortim selection i done bythe recproca of Boolean dstarar, That to sy, the cr the astanon, the bigger the smarty of two objets and vice wrse. 2) Proowiuw of K-mean Algorthin Kncan dstebates tl objosts to K ramoue of chstos at randen 2) Calcot the mean value of wath laser and use ts et “ale to reer the eater; 2) Redtibuta th objets to the donk ouster eeoeing tots distance to ths Closter cater, 5) Uprate the mean value ofthe cost, sey cleulate the mean value of the objets in eech costar; 4) Calculate the entetion faction Eun te exttion function conver, ‘vals, tho Kuoan algorithm extern fartien adopts sqsare ator eitecton, deed a In whic F fp total quate creo’ of all tho oso tn the data cst, p fe gfe data oblet, mi mean value of easter Cr {pnd ave bath mluamersions}. Tha foreon oft eta to make the goers ster teas compacted sna tedepensent se ostbe [ib] Analysis of the Pearnanen of the Kemean lgoitam | 4 Advantages 1) This a chsste algorithm 9 rool cluster problem; this algorithm (simple re fst; 2) For large data, éallecton, ths gorithm fs watively Bextble sti highly ater, beeaus the compleity te O (rth aan Shieh, ais the ataer ofall objects eis the number of cluster, isthe mes of Iteration, Usually an ae tan ‘Tt algetta asually ands with local optinara, 5) It provfacs rathely gvod result ov nner laste, 2) Beran ofthe Iinttation of tho Bactdcan estance [5], I ean only proous the rumerte! sao, ith Sood gunmen! ad state eens « Disadvantages 1) Sensitive tothe selection of intial laser cote, usualy sl withou lobe optimal soltion, bot suboptnal elation, 2) There ie no appleable vidoe fort doesn of th Ylus off auer of caste to genera), an sors to nal sale, fr alfred alog thre may bo afr casters prorated 8) Pit algorites fe ensy to be asturboa by sonoemel pols; a few of thie abnormal data wu extwens tues to th ar aloe: & Sometines th result of estar may lose balan I 5 Advanced k-means clustering algorithms ‘The K-means algnrtthm and its oorained algorithms are tn the farcily of canter bese clustering algoettheas. "This farnily have several methods: expectation maximization, fuzzy K-means ar harmonic K-means, betel review of these algorithms is given in the following sub-sections. IV Methods to Improve kemesns algorithm’s performance 2) Methods for Initial Point Selection i. 6 Refining Initial Points Algorithm In patitional clustering algorithm, the frst step we should get initial sear points (clostc enters}, To choose good inital points wil improve coltions an reuse exooation time, Retiring inital points algorithm is propose Bi Tor a start, randomly choose some subscts within equal number of samples from Inge data. wots. Secor, parttional algorithm is applted to each subset to gotonch oanor acts of th subsets, Thy, gather thoea conta Sets and apply the partitions! algorithm again to obtain the most proper center set. For getting good intial sed points, totally we repeat the patiional algorithm 2 times by fever sample sts. Finally ron the partitonal algorithm with the most festble center setae the intial secs and orga azge dats. oct, ‘Tho algorithm stops ao: 1) Randomly bold J sunple subsets. & tsa candom subast of data. (1 = 1.) and the sito ofS ts &_stae). 2) Uso modfed algrtthen to ind cantar Ct of each &t Gather ll Gi (7 = Le ) into © Total. 3) Foceach set Ci (= Lud j cum paritonal algosthm with fit points Gi and data set © _Total to get another center set FCI. 4) For each FOI i= 1...) ), calculate sam Sumi of the distance between each point in C_Total to the closest center point in FC! , 5) Find minimum of Sumi (¢ $1.0), Sump fs tinimem, tae FCp as fal inti pons. t 7 Cluster Centroid Decision Method ‘This method proposed « technique to assign the data, point to appropriate cluster’s centroid, we calculate the distance between each oluster’s eantzold ard for esch centrofd tske the minimum distance from the remalning ccontiold ard mele (t tall, denoted by de(!) Le, half of the minimum distaroe feom ith elustes’s centroid tn the remaining cluster’s centanic. New tale any data point to calculate its distance from ith centroid snd enmpare itswith do) TE 1 lass than or equal to de(t) then data potnt ts assigned to the ith clustar otherwise caleulae the atstanos fra the other canteoid, Repest this ptooess until that data point i ssigned to any of the rang cluster, IFdata potnt ts not asigned to ary of the cluster then the centzoid which shows the mirumun distance swith data. point becomes the cluster for thnt dats point, Repent this proosss for each data point, Tals: mean, of each cluster separately and update the centroid of clusters Ide tradtfonal mean. Repeat this process until termination cordon te schlaved (8). ND: Number of data point K : Number of clusters centrold Ci: Wh cluster Some equatfors used in algosthin axwfCt,Cl] = {amiym) : (1g)? [18] £ 19} ‘Where [Ci 8 the distance betwean elistae C1 and Cl:de(t) = (ming CK,Cil} vwhare de(i) is the hal of the minimum distance from ith cluster to any other remaining cluster 8 Cluster Seed Selection ‘When calculating the K turn of clustering mes with the improved algorithm, those dats in the cluster having reat sitilaity to the K-1 eategory seeds should be aciopted to calculate thelr mean points (goometetcal canter) fs the clustaring seed of the K tim and the specie ealeulatfon method ts below as [13]: 1) For the cluster Ci(k-t) obtained through the K-1 tum of clustering, the minimum similarity sim _miri(K-1) of the dats in the cluster to the clustering seed 8i(J-T of the chistor is calculated, 2) The data, inthe cluster Ciel) is calculated that thas similarity of more thea 1-P* (c-sim _mini(k-T) jo the clustering sed Si(k-D) (among, ? is a constant between 0-1), and the data et Is recomed as cni(k-1) 8) The mean points of the dats in ent(iel) are calculated as the clustering we of the K tum, bb) Methods to Define no of Closter ¢ 9 Initialization Method ‘This method depenris on the dats and works well to find the best number of cluster and thefe centrotds values, Th starts by reading the date as 2D matels, and then calculates the mean of the first frame sige F 1=300300,F2=150x150, FI=100x100, FA=H0x50, F5—30x80, FO=10el or F7=Esb. Then, it koos the value ‘of means in an array callod means array even at tho ond of the data matrix. After that it sorte the valuos im the means array in an ascenciing manner. In cases where the values are similar, they are removed to avoid an overlap. In other wonis only one value Is kept, Tt will then calculate the number of elements in the means array: this number fs the rumber of clustzes and thete values are the centokeis values a tndfoxted In the stepe below #5 [LU]: 1) Rend the data sot as a matrix, ( 10 The Encoding Method Acconiing to the characteristics of K-mean cluster algorithm, to Bnd the optimum eluster, the epttmom K value should be found, the value of K le the learning object of the gerette algosithm, the enanding ft arcodig ta K value, In general sltustion, to the class fesue, there is always « maximum number of classes "MAXCLassniam” for the cluster, this value Is (nput by the user. Bo K is a Integral between Land MAXCLassrinm, can be initeatedi ins binary string, In this experiment, using a byte to expmss K value, that is 255 classes maximum. This value {s enough for normal cluster problem [4]. |) Chooses a maaber ef chromosome from the original n ebromesome Tsing the rovlatta wheels selection of the tmitfonal genatie algorithm. 2) Cressover method te applied on seleated chromosome fn the matting poo). 3) Mutation ts applied over chromosomes {n ke matting pool tt 11 Tentative Clustering Clustering see principal components analysis to determine 2 tantative value of cour of classes and provide Ctangeable able Tor obets. Tho hart based clstertng approach petonns principal componsr snails on Stavdond sere of give mates ond theater projets tho lua ta speceo the enlecleted petal vector Coant of tha payed pital voor fs dspeng on tha sive nomer of esse (Ke Meora eigenen ral °K to proces). In over ta avoid depersing on the ruber of clases “Kee to fel masiinm porsbl less, (We project the matrix to the space of all principal vectors. After we calculate a probability matrix (P) from result. Sr proeation mate ©) such that Ph try shove probably of eonasrtty 0 hobo to [a oblec By sefning mate © aoonding to the probabiity vals of mate Pwo fl a block mate that reposts groupe of objects [2] ‘I 12 Conclusion ‘This paper presents an overview of the Jemesns clustering algorithm, K-means clustering Is a common way to define classes of fobs within a Dataset, The initial starting point selection may have a significant effect on the results of the algorithi, both im the nomber of elustars found and their eantrotds. Methods to tmprove performaroe of kemeans elustering are discussed in this paper. These methods fall «te two eateyortes: fattial point selection and define number of cluster. Six of these methods, three from each category, are presented. These12 CONCLUSION OPEN Peco hvuce ah OF RESEARCH SOCIETY, USA /} Figure 1 Figure 2: 2) EMC) = Sed sews ReD Figure 3: 4) 418 methods have been implementaal in data: mintng system and ean get better results for some practical programs se such as charaoter mengnttion, image processing, text seaaching .* ? £6 2612 Global Joumals Ire. (US) *© 2612 Global Journals Inc. (US) Global Journal of Computer fesence and Technology Volume XII Issue IX Version I12 CONCLUSION(Cu) , Soong Gu. (Seming Zoo) (amma and Bin) , Alf Salem Samma., Bie [Chon ()] ‘An nhancemant of K-moans Chustoring Algorithr’” Ktanwet Chen . praceating of Business Intelligence and Finanoial Bngsncering, (oding of Business Irulligence and Finar(el EngincetingCharadn, Chirs) 2000, 2000. Key Lab, of the Snuthwestern Land Rasoaross Manstaring, Sichusn Normal Univ fang and Su ()) 'An improved K-means Clustering Algorithm! Juntao Wang , ; Xisalong So . prosseding of Communication Software and Networks (IOGSN), 2011 ERB Sra International Conference, (ecing of Communtestion Software snd Networks (ICCSN}, 2011 IEBE See Intarnationsl ConferereeXuztou, Chins) 211 [Cala anc Abdal ()] daptation of K-Mcans Algorithm for Image Segmentatton”in procceding of. Rosalina Salam , Abdul . International Journal of Signal Processing 2000. (Zhang ()] Coneroled Eharmenis means -Eossting in ancugerssed learvéng, B Zhang . HLP-2000-187. 2000, Hevriett-Pecleari Lats, (Technica! Report) (Han ard Kember()} J] Han, M Kamber . Dota Mining: Concepts aad Techniques, (San Francisco, CA) 2001 Morgan Kaufmann Publisher. (Pérea etal. ()| improving the Meieney and Rflesey of tha K-means Clustering Algorithm Through a New Convergene Coniition? J Péter , R Paeos , L Cros, G Reyes , R Baseve , H Finice . 1CCSA £00, M Garvesi, Gevrlloa.(od.) (Part ID; Berlin Heidelberg) 2007. Springer-Verlag, 4707 p. (Kearney and Patton ()] ‘information gain ranking’ Golm Kearney , Andrew J Patton . Financial Reufew 2000, Alp. (Dashti ot al. (|] MK-means Modified K-means clustering algosithm’. HT Dasbti , T Simas , R A Ribeteo A Awodi , A Moitinho . The 2010 International Jaint Conference, (Maden, WI, USA) 2010, Unty. of ‘Wacom (proceeriing of Neural Networks (LICNN)) [Lt ()) Modified K-meane clustering algorithm, W Lt. 10.1100/CISP.2008.249, 2008. IEEE. (Bishop ()] Newral networks for gattorn reeagaition, CM Bishop . 1998, Oxford: Clarerdan Press (Xin ()] ‘Raseerch on Text Chusteeing Algorithm Based on Improved K-raears’. Li Xtrawa . 2040 International Conference, (Nanchang, Ching) 2010. Jisngyi Univ. of Finance & Boon, (in proceeding of Computer Design and Applicetions (ICCDA) [Singh arc Bhatia ()] RV Singh, MPS Bhatie. Data Clustering with Modiid H-meane Algoritin in penseding Of Recerd Trends én Information Tachnakpy (IORTIT), 2011 International Conference, (New Delhi, India) 2011. Dept, of Compat. fet. & Eng. Univ. of Delhi (Ma? Yarko and WierzchoT] Standard and Genetic K-meana Clustering Techniques én Image Segmentation, D Ma? Yaeko, $1 Wierache? . CIM 07) 0-7605 2804 5/07 IBEE 2007

View of Advanced Methods To Improve Performance of K-Means Algorithm - A Review

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

View of Advanced Methods To Improve Performance of K-Means Algorithm - A Review

Uploaded by

Copyright:

Available Formats

You might also like