You are on page 1of 27

K- MEAN ALGORITHM

• K-means clustering algorithm computes the centroids and iterates until


we it finds optimal centroid. It assumes that the number of clusters are
already known.
• It is also called flat clustering algorithm. The number of clusters
identified from data by algorithm is represented by ‘K’ in K-means.
• In this algorithm, the data points are assigned to a cluster in such a
manner that the sum of the squared distance between the data points
and centroid would be minimum.
• It is to be understood that less variation within the clusters will lead to
more similar data points within same cluster.
Working of K-Means Algorithm
• Step 1 − First, we need to specify the number of clusters, K, need
to be generated by this algorithm.
• Step 2 − Next, randomly select K data points and assign each data
point to a cluster. In simple words, classify the data based
on the number of data points.
• Step 3 − Now it will compute the cluster centroids.
• Step 4 − Next, keep iterating the following until we find optimal
centroid which is the assignment of data points to the
clusters that are not changing any more −
• 4.1 − First, the sum of squared distance between data points
and centroids would be computed.
• 4.2 − Now, we have to assign each data point to the cluster that
is closer than other cluster (centroid).
• 4.3 − At last compute the centroids for the clusters by taking the
average of all data points of that cluster.
Implementation in Python
• Example 1

It is a simple example to understand how k-means works. In this
example, we are going to first generate 2D dataset containing 4
different blobs and after that will apply k-means algorithm to see the
result.
• First, we will start by importing the necessary packages −
• %matplotlib inline
• import matplotlib.pyplot as plt
• import seaborn as sns; sns.set()
• import numpy as np
• from sklearn.cluster import KMeans
The following code will generate the 2D,
containing four blobs −
• from sklearn.datasets.samples_generator import make_blobs
• X, y_true = make_blobs(n_samples=400, centers=4, cluster_std=0.60,
random_state=0)
Next, the following code will help us to
visualize the dataset −
• plt.scatter(X[:, 0], X[:, 1], s=20);
• plt.show()
• Next, make an object of KMeans along with providing number of
clusters, train the model and do the prediction as follows −
• kmeans = KMeans(n_clusters=4)
• kmeans.fit(X)
• y_kmeans = kmeans.predict(X)
• Now, with the help of following code we can plot and visualize the
cluster’s centers picked by k-means Python estimator

• plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=20, cmap='summer')


• centers = kmeans.cluster_centers_
• plt.scatter(centers[:, 0], centers[:, 1], c='blue', s=100, alpha=0.9);
• plt.show()
Advantages of k-means
1.Simple and easy to implement: The k-means algorithm is easy to
understand and implement, making it a popular choice for clustering
tasks.
2.Fast and efficient: K-means is computationally efficient and can
handle large datasets with high dimensionality.
3.Scalability: K-means can handle large datasets with a large number of
data points and can be easily scaled to handle even larger datasets.
4.Flexibility: K-means can be easily adapted to different applications
and can be used with different distance metrics and initialization
methods.
Disadvantages of K-Means:
1.Sensitivity to initial centroids: K-means is sensitive to the initial
selection of centroids and can converge to a suboptimal solution.
2.Requires specifying the number of clusters: The number of clusters k
needs to be specified before running the algorithm, which can be
challenging in some applications.
3.Sensitive to outliers: K-means is sensitive to outliers, which can have
a significant impact on the resulting clusters.
Applications of K-Means Clustering
K-Means clustering is used in a variety of examples or business cases in
real life, like:
• Academic performance
• Diagnostic systems
• Search engines
• Wireless sensor networks
Problem 01
• Cluster the following eight points (with (x, y) representing locations) into three
clusters:
• A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5),
A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined
as-
• Ρ(a, b) = |x2 – x1| + |y2 – y1|

• Use K-Means Algorithm to find the three cluster centers after the
second iteration.
Solution-
• We follow the above discussed K-Means Clustering Algorithm-
• Iteration-01:

• We calculate the distance of each point from each of the center of the three
clusters.
• The distance is calculated by using the given distance function.
• The following illustration shows the calculation of distance between point
A1(2, 10) and each of the center of the three clusters-
Calculating Distance Between A1(2, 10) and C1(2, 10)-

Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
Calculating Distance Between A1(2, 10) and C2(5, 8)-

Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
• Calculating Distance Between A1(2, 10) and C3(1, 2)-

Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9
In the similar manner, we calculate the distance of other points from
each of the center of the three clusters.

• We draw a table showing all the results.


• Using the table, we decide which point belongs to which cluster.
• The given point belongs to that cluster whose center is nearest to it.
Distance from
Distance from center Distance from center Point belongs to
Given Points center (2, 10) of
(5, 8) of Cluster-02 (1, 2) of Cluster-03 Cluster
Cluster-01

A1(2, 10) 0 5 9 C1

A2(2, 5) 5 6 4 C3

A3(8, 4) 12 7 9 C2

A4(5, 8) 5 0 10 C2

A5(7, 5) 10 5 9 C2

A6(6, 4) 10 5 7 C2

A7(1, 2) 9 10 0 C3

A8(4, 9) 3 2 10 C2
• From here, New clusters are-
• Cluster-01:
• First cluster contains points-
• A1(2, 10)
• Cluster-02:
• Second cluster contains points-
• A3(8, 4)
• A4(5, 8)
• A5(7, 5)
• A6(6, 4)
• A8(4, 9)
• Cluster-03:
• Third cluster contains points-
• A2(2, 5)
• A7(1, 2)
• Now,
• We re-compute the new cluster clusters.
• The new cluster center is computed by taking mean of all the points
contained in that cluster.

• For Cluster-01:

• We have only one point A1(2, 10) in Cluster-01.
• So, cluster center remains the same.
• For Cluster-02:

• Center of Cluster-02
• = ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
• = (6, 6)
• For Cluster-03:

• Center of Cluster-03
• = ((2 + 1)/2, (5 + 2)/2)
• = (1.5, 3.5)

• This is completion of Iteration-01.

You might also like