You are on page 1of 4

1.Install R from For clustering you need the following packages: cluster default!" fpc" p#clust" mcclust.

usually installed by

$.Install all these packages with the command %install.packages &package'name(" lib)(path'of'lib(! e.g. %install.packages &fpc(" lib)(*:/+rogram Files/R/R-$.1,.1/library(! &%( is the R prompter!

-..n installed package can be load with command %library package'name! e.g. %library fpc!

/.*opy the data file in working directory. 0ou can find your working directory with command %getwd ! 1r you can set the path with %setwd &path'of'wd(!

,.2oad the data in an R matri3/#ector with command read.cs# for cs# files! e.g. %mydata4-read.cs# &1-total'#an5'client'engros'num.cs#(! 1f course" you can change this too long Romanian file name. 0ou can load any other file" but remember" for this type of culstering file could ha#e only numerical data.

6.First type of clustering is a &classical( clustering using k-means algorithm. 7ust type %kmeans.result4-kmeans mydata" -! &-( is the number of clusters you want could be" theoretical" any number!. 0ou can try the algorithm with $"/","6 etc. clusters.

If you type %kmeans.result you can see anytime the result of clustering. 8he particular data about clustering you can see using some culstering #ariables: 9cluster9 9centers9 9totss9 9withinss9 9tot.withinss9 9betweenss9 9si5e9 e.g. %kmeans.result:cluster to see only the clusters! or %kmeans.result:centers to see the centroid of e#ery cluster! etc. ;e can plot a graph for $ or - #ariables but I will not enter in too many details.

<.For the k-medoids clustering more robust than k-means if we have outliers in data! we need to load fpc package with command %library fpc! 8here are two main algorithms" +.= and *2.R." implemented in pam() and pamk() R function> pamk ! function does not re?uire to user to choose number of clusters" and it calls the function pam ! and estimate the number of clusters. e.g. %pamk.result4-pamk mydata! 8ype %pamk.result and you will see the result.

For using pam ! you ha#e to choose the number of clusters for e3ample -!: %pam.result4-pam mydata" -! 8ype %pam.result and see the result. 0ou will obser#e that pamk ! takes more time than kmeans ! or pamk !. 8he major difference between kmeans and pam/pamk is that while in k-means a cluster is represented with its center" in k-medoids pam/pamk algorithms! the cluster is represented with the object closest to the center of the cluster.

@.;e can ha#e hierarchical clustering with hclust ! function. 8ype %hc4-hclust dist mydata!" method)(a#e(! 8his method is more complicated - for plotting we need a #ariable as label" which could be an inde3 of initial data. If weAll apply this IAll gi#e more details.

@.For density-based clustering we can use BCD*.E algorithm from fpc package. 8he main idea is to group objects into one cluster if they are connected to one another by density populated area. 8here are $ parameters: FepsA G reachability distance" defines the si5e of neighborhood if it is too small you can ha#e 5ero clustersH! and F=in+tsA- reachability minimum numbers of points. =ost of the time you can try different #alues of these parameters. For e3ample" if you try with %ds4-dbscan mydata" eps)I./1" =in+ts),! you get 5ero cluster no enough density points! If the number of points in the neighborhood of a point is no less than =in+ts" then this pointis a &dense point(. 8he strength of density-based clustering is that it can disco#er clusters with #arious shapes and si5es and it is insensiti#e to noise k-means find clusters with sphere shape and appro3imately with similar si5es!.

Jnfortunately" the file I found seems to be insensiti#e to density based clustering it seems to ha#e no rele#ant #ariance in density points - you can check this if type with different #alues for eps and =in+ts" and with %ds4dbscan mydata"eps)1" =in+ts)1! you will findK 1LLI clusters!. Cut if you want to see how this algorithms working with some results" you can try it with a #ery small data file which is by default in R!. 8ype %iris$4-irisM-,N Ofor remo#e a nonnumeric column %ds4-dbscan iris$" eps)I./$" =in+ts),! Dee result with %ds and clusters with %ds:cluster 0ou can change eps and =in+ts for to see what happens the data file are with flowers species!.

For what we ha#e" I think is good to test kmeans" pam and pamk. If we decide what kind of algorithms weAll use" we can write an R function for simplify this entire manual job.

I hope there is no &fatal( typing error in synta3 of the R commands. Dorry for the Pnglish errors.