19 views

Uploaded by NurulRahma

Machine Learning

- SPE-137184-MS
- Fuzzy Clustering for Symbolic Data
- Genetic Algorithm-based Clustering Technique
- Every School an Accessible School
- Clustering Appearances of Objects under Varying Illumination Conditions
- 40120130406001
- PERFORMANCE EVALUATION OF WEKA CLUSTERING ALGORITHMS ON LARGE DATASETS
- Comparative Study of Popular Data Mining Algorithms
- A New Clustering and Preprocessing for Web Log Mining.pdf
- Define and Refine Criteria DLU
- Data Science in Practice
- Privacy Preserving Clustering in Data Mining Using VQ Code Book Generation
- Unit-III
- Useful Notes on ML
- 1682-3339-1-SM
- Neutrosophic Hierarchical Clustering Algoritms
- CS583 Unsupervised Learning
- Ensemble Clustering for Biological Dataset
- Elevating Forensic Investigation System for File Clustering
- A Comparative Study of Various Clustering Algorithms in Data Mining

You are on page 1of 7

K-Means (Clustering)

1. Clustering Analysis

objects/individuals in one group are more similar than objects/individuals in other groups.

For example, from a ticket booking engine database identifying clients with similar booking

activities and group them together (called Clusters). Later these identified clusters can be

targeted for business improvement by issuing special offers, etc. Cluster Analysis falls into

Unsupervised Learning algorithms, where in Data to be analyzed will be provided to a Cluster

analysis algorithm to identify hidden patterns within as shown in the figure below.

In the image above, the cluster algorithm has grouped the input data into two groups. There are 3

Popular Clustering algorithms, Hierarchical Cluster Analysis, K-Means Cluster Analysis, Two-

step Cluster Analysis, of which today I will be dealing with K-Means Clustering.

K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their

similarity. Unsupervised learning means that there is no outcome to be predicted, and the

algorithm just tries to find patterns in the data. In k means clustering, we have the specify the

number of clusters we want the data to be grouped into. The algorithm randomly assigns each

observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates

through two steps:

These two steps are repeated till the within cluster variation cannot be reduced any further. The

within cluster variation is calculated as the sum of the euclidean distance between the data points

and their respective cluster centroids.

1

Explaining k-Means Cluster Algorithm

In K-means algorithm, k stands for the number of clusters (groups) to be formed, hence this

algorithm can be used to group known number of groups within the Analyzed data.

K Means is an iterative algorithm and it has two steps. First is a Cluster Assignment Step, and

second is a Move Centroid Step.

CLUSTER ASSIGNMENT STEP: In this step, we randomly chose two cluster points (red dot &

green dot) and we assign each data point to one of the two cluster points whichever is closer to it.

(Top part of the below image)

MOVE CENTROID STEP: In this step, we take the average of the points of all the examples in

each group and move the Centroid to the new position i.e. mean position calculated. (Bottom part

of the below image)

The above steps are repeated until all the data points are grouped into 2 groups and the mean of

the data points at the end of Move Centroid Step doesnt change.

By repeating the above steps the final output grouping of the input data will be obtained.

Example 1 Cluster Analysis on Accidental Deaths by Natural Causes in India using R

CLUSTER.

Using the package we shall do cluster analysis of Accidents deaths in India by Natural Causes.

Steps implemented will be discussed as below:

Between 2001 & 2012. Input data is displayed as below:

2

For any cluster analysis, all the features have to be converted into numerical & the larger values

in the Year Columns are converted to z-score for better results.

Run Elbow method (code available below) is run to find the optimal number of clusters present

within the data points.

Run the K-means cluster method of the R package & visualize the results as below:

3

Code:

#Fetch data

data= read.csv(Cluster Analysis.csv)

APStats = data[which(data$STATE == ANDHRA PRADESH),]

APMale = rowSums(APStats[,4:8])

APFemale = rowSums(APStats[,9:13])

APStats[,APMale] = APMale

APStats[,APFemale] = APFemale

data = APStats[c(2,3,14,15)]

library(cluster)

library(graphics)

library(ggplot2)

#factor the categorical fields

cause = as.numeric(factor(data$CAUSE))

data$CAUSE = cause

#Z-score for Year column

z = {}

m = mean(data$Year)

sd = sd(data$Year)

year = data$Year

for(i in 1:length(data$Year)){

z[i] = (year[i] m)/sd

}

data$Year = as.numeric(z)

#Calculating K-means Cluster assignment & cluster group steps

cost_df <- data.frame()

for(i in 1:100){

kmeans<- kmeans(x=data, centers=i, iter.max=100)

cost_df<- rbind(cost_df, cbind(i, kmeans$tot.withinss))

}

names(cost_df) <- c(cluster, cost)

#Elbow method to identify the idle number of Cluster

#Cost plot

ggplot(data=cost_df, aes(x=cluster, y=cost, group=1)) +

theme_bw(base_family=Garamond) +

geom_line(colour = darkgreen) +

theme(text = element_text(size=20)) +

ggtitle(Reduction In Cost For Values of kn) +

xlab(nClusters) +

ylab(Within-Cluster Sum of Squaresn)

clust = kmeans(data,5)

4

clusplot(data, clust$cluster, color=TRUE, shade=TRUE,labels=13, lines=0)

data[,cluster] = clust$cluster

head(data[which(data$cluster == 5),])

The iris dataset contains data about sepal length, sepal width, petal length, and petal width of

flowers of different species. Let us see what it looks like:

library(datasets)

head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3.0 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

4 4.6 3.1 1.5 0.2 setosa

5 5.0 3.6 1.4 0.2 setosa

6 5.4 3.9 1.7 0.4 setosa

After a little bit of exploration, I found that Petal.Length and Petal.Width were similar

among the same species but varied considerably between different species, as demonstrated

below:

library(ggplot2)

ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()

5

Clustering

Now that we have seen the data, let us try to cluster it. Since the initial cluster assignments are

random, let us set the seed to ensure reproducibility.

set.seed(20)

irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)

irisCluster

K-means clustering with 3 clusters of sizes 46, 54, 50

Cluster means:

Petal.Length Petal.Width

1 5.626087 2.047826

2 4.292593 1.359259

3 1.462000 0.246000

Clustering vector:

[1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

[35] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

[69] 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1

[103] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 2 2 1 1 1 1 1 1 1 1

[137] 1 1 2 1 1 1 1 1 1 1 1 1 1 1

[1] 15.16348 14.22741 2.02200

(between_SS / total_SS = 94.3 %)

Available components:

[5] "tot.withinss" "betweenss" "size" "iter"

[9] "ifault"

Since we know that there are 3 species involved, we ask the algorithm to group the data into 3

clusters, and since the starting assignments are random, we specify nstart = 20. This means

that R will try 20 different random starting assignments and then select the one with the lowest

within cluster variation.

We can see the cluster centroids, the clusters that each data point was assigned to, and the within

cluster variation.

table(irisCluster$cluster, iris$Species)

setosa versicolor virginica

1 0 2 44

2 0 48 6

3 50 0 0

As we can see, the data belonging to the setosa species got grouped into cluster 3, versicolor

into cluster 2, and virginica into cluster 1. The algorithm wrongly classified two data points

belonging to versicolor and six data points belonging to virginica.

6

We can also plot the data to see the clusters:

ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$cluster)) +

geom_point()

- SPE-137184-MSUploaded byJosé Timaná
- Fuzzy Clustering for Symbolic DataUploaded byKiatichai Treerattanapitak
- Genetic Algorithm-based Clustering TechniqueUploaded byArmansyah Barus
- Every School an Accessible SchoolUploaded byDeepak Warrier
- Clustering Appearances of Objects under Varying Illumination ConditionsUploaded byTanuj Kumar
- 40120130406001Uploaded byIAEME Publication
- PERFORMANCE EVALUATION OF WEKA CLUSTERING ALGORITHMS ON LARGE DATASETSUploaded byIJAR Journal
- Comparative Study of Popular Data Mining AlgorithmsUploaded byRahul Sharma
- A New Clustering and Preprocessing for Web Log Mining.pdfUploaded byMichele Fleming
- Define and Refine Criteria DLUUploaded byalexillo91
- Data Science in PracticeUploaded byTai Khuat
- Privacy Preserving Clustering in Data Mining Using VQ Code Book GenerationUploaded byCS & IT
- Unit-IIIUploaded bykavi
- Useful Notes on MLUploaded bysirneonash
- 1682-3339-1-SMUploaded byNguyen Tra
- Neutrosophic Hierarchical Clustering AlgoritmsUploaded byDon Hass
- CS583 Unsupervised LearningUploaded byJrey Kumalah
- Ensemble Clustering for Biological DatasetUploaded byEdian De Los Santos
- Elevating Forensic Investigation System for File ClusteringUploaded byInternational Journal of Research in Engineering and Technology
- A Comparative Study of Various Clustering Algorithms in Data MiningUploaded byPedro Correia
- Full TextUploaded byyaa
- Boswell Proposal April2010!11!1Uploaded byhossein
- Harmonized Scheme for Data Mining Technique to Progress Decision Support System in an Uncertain SituationUploaded byInternational Journal of Research in Engineering and Technology
- 2011-2012 IPPR Micro Lesson PlanUploaded bypatelIbrahim
- conl.12459Uploaded byMasruddin Syah
- qp-1.pdfUploaded byVijay Babu
- WIRELESS SENSOR NETWORK CLUSTERING USING PARTICLES SWARM OPTIMIZATION FOR REDUCING ENERGY CONSUMPTIONUploaded byijmit
- Paper 9Uploaded byRakeshconclave
- An Approach To Improve The Energy Efficiency Of LEACH Protocol using Genetic AlgorithmUploaded byIJIRST
- IJETR033283Uploaded byerpublication

- Day3 Cloud RitcheyUploaded byPedro
- yellow-book.pdfUploaded byDan Michael Dabalos Pabilona
- Galv WeatheringSteelUploaded byAlupole_AlbertLim
- Aluminum 2618 T61Uploaded byRakesh S India
- BCRK-CSR100vUploaded bymof199
- Syllabus Cscie e90 Cloud Computing and IOT2016Uploaded byandrecoimbra
- Smooth WallUploaded byhangtuah79
- Telecom 2g,3g,4g,Rf Lte Drive Test,Optimization,Ipv6 Study Materials_ Lte Drive Test ParametersUploaded byAmine Tellibi
- Unit OutlineUploaded byalaya78
- Team Feasibility StudyUploaded byMher Karizze Anne Narciso
- FUSION MANUAL_Rev Illustrations [1]Uploaded byMehmet Soysal
- Single Phase TransformerUploaded byJagmeet Singh
- GIS_EUploaded bycarmen_andreea_
- Pioner ManualUploaded byNick Alave Navarra
- my sipUploaded bySheetal Chauhan
- Windows 7 ShotcutUploaded byAji Sutrisno
- Marks & SpencerUploaded byimtehan_chowdhury
- Thermaline Heat Shield PDSUploaded byKhemaraj Path
- Hackin9 Magazine - June 2011 (True PDF)Uploaded byIvan Apuppy
- 14 DFA and BSP vs Falcon.pdfUploaded byYen Yen
- LnT Creates History at Kakrapar_final-6876Uploaded byAnonymous UebIaD8A8C
- dis2Uploaded byDeepak Mishra
- READMEUploaded by13henry
- Short circuit and overload protection circuits and devicesUploaded byCODE0303456
- Lecture Notes 2Uploaded byjoshua
- syallubusUploaded byPreethi Ammu
- NIT_1.pdfUploaded bynikhil
- Calculation According to en 13384-1Uploaded bySADJI Rachid
- Erection Testing of Transformer the Key Mistakes Made by Erection EngineerUploaded byAbdulyunus Amir
- PowerPole 153045 - 15A 30A 45AUploaded byHernan F Godoy