You are on page 1of 22

K-Means Clustering

https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/

Learning Outcome
By the end of this lecture, you should be able to understand,
explain and apply K-Means Clustering.
• Similar is the measure of similarity (distance)
between “points” to be clustered
 K-Means Clustering is
one of the simplest
unsupervised machine
learning algorithms
where it is fast and
efficient in terms of its
computational cost.
K-Means Clustering in Python
#K-Means Clustering on iris flower dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
#import the dataset
df = pd.read_csv('iris.csv')
#df.head(10)
#4 columns of features
x = df.iloc[:, [0,1,2,3]].values
kmeanss = KMeans(n_clusters=5)
y_kmeanss = kmeanss.fit_predict(x)
print(y_kmeanss)
kmeanss.cluster_centers_
#to find optimum number of cluster
Error =[]
for i in range(1, 11):
kmeans = KMeans(n_clusters = i).fit(x)
kmeans.fit(x)
Error.append(kmeans.inertia_)
import matplotlib.pyplot as plt
#the elbow indicates the optimal value of K
#edit and run again using the new K
plt.plot(range(1, 11), Error)
plt.title('Elbow method')
plt.xlabel('No of clusters')
plt.ylabel('Error')
plt.show()
Elbow method gives us an
idea on what a
good k number of clusters
would be based on the sum
of squared distance (SSE)
between data points and their Estimated K=3
assigned clusters’ centroids.

We pick k at the spot where


SSE starts to flatten out and
forming an elbow
https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-
aa03e644b48a

inertia measures how well a dataset was clustered

homogeneity score describes the closeness of the clustering algorithm to this perfection.

completeness score describes the closeness (number of classes within the same cluster) of the
clustering algorithm to this perfection.
V measure the harmonic mean/normalized between homogeneity and completeness.

adjusted Rand index a function that computes a similarity measure between two clusters

adjusted mutual information computes a similarity measure between two clusters by chance

silhouette coefficient determine the degree of separation between clusters

You might also like