You are on page 1of 13

K-means clustering

K-means clustering
-> It is
unsupervised learning algorithm
-> It is an iterative
algorithm that divides the
unlabelled data into K different clusters in

such a way that each dataset


oclongs only one

group that has similar


properties .

2 will be two clusters


If K =
,
there ,

k = 3 3 clusters and so on
,

It is a centroid based algorithm where


->

each cluster is associated with a centroid

-> Main Alm : Minimize the sum


of distance blw
the point and their
cata

corresponding clusters .
How does the K-Means
Algorithm work
??
Step 1
. Select the number K to decide the no .

of
clusters .

Step 2 . Select random K-points (It can other


from
the
input data .

closest
Step 3 .

Assign each
point to their
centroid .

Step .
4 Calculate the variance and
place a new

centroid
of each clustes .

Step 5 .
Repeat 3rd Step

6
Step ·

If an
reassigment occurs , then go to

step 4 else to FINISH


go
.

Step 7 . Model is Ready .

.
Let us visualize this
!!!
Suppose we have two variables M2 and M2
The X-y scatter plot :

Let us take K= 2

and
put those centroids at

random position .

Now will each


we
assign
data point to its closest
centroid .

I we do it
using
some

mathematical techniques]
So we will draw a median blu both the centrolds .

From this
picture we can say
that all the
points on top
close to
of line are blue

Centroid : they all belongs


to blue cluster .

downside
whereas all
points
of line to cluster
belongs
orange
.
But does not end
steps
here , as we want to

find optimal cluster .

centroid to and
again bring
Now we will centre

repent all
Teji again
se S .

Perpendicular visector
3
Here we can say we
got
points .:
as error .

repeat again
.
Reassign Centroid
Located
points
.

Now perpendicular
bisector again
Be HAPPY
But no error we
got YOU GOF
EINSTER
this time
.. This will be om

final cluster .
#Properties of
Clusters .

All the data cluster should


1st :
points
each other
in a

1 Similar
: to

Bank wants to
Suppose a
segment its customers
cluster are
.

If the customers in a
particular
not similar to each other then their
requirements
u
the bank tem offer , they
vary If
.

gives
same

interest the
may not like it .
and their in

bank reduce
might .

2nd : The data


points from different clusters
should be as cliff as possible ·

Suppose let us take a example .

X ....... ......
Det ·
...."
......
-

Income
G ⑧ O
...........
1
T

X X
What do you think which cluster scheme is

??
best

In . I
case
clearly and
we can see that peach blue
clusters are totally different from each other -

Cluster have
peach high income and
high debt
have and low debt
whereas cluster blue
high income .

Hence data points from diff clusters should be as cliff


from each other as
possible to have more
meaningful
clusters .

Iterative find the


The means uses an
approach to

optimul cluster assignments by minimizing the sum

of squared distances blu data points and their


cluster centroids
assigned .
# Understanding different Evaluation metrics

for clustering .

The
primary
aim
of clustering is not
just to
make clusters but to make good and
meaningful
ones .

Inertia tells how far the


-> It us
points within

a cluster are .

So Inertia calculates the sum of distances of all

the
points within a clustes
from the centroid of
that clustes .

Normally we use Euclidean distance as the distance


metric , as long as most of the features are numels,

otherwise , Manhattan distance in case most of the

features are
categorical .

The distance within the clusters is known as


intracluster distance So the Inertia us
.
gives
the sum
of intracluster distances .

Lesser the inertia value


,
the better on

clusters are
Dunn Index -> quentia takes care about 1st
that distance olw
property points
and centrod decreases .

But what about 2nd


property ??
This is where the Dunn index comes into action .

Inter Cluster
Intra cluster distance
distance
.

Along with the distance blu the centroid and


the Dunn index also takes account the
points
distance
,

blu two clusters .


The distance blu the centroids of two diff
cluster is known as Inter cluster distance

Dunn Index is the ratio


of the minimum
of
inter-cluster distances and maximum of Intra

cluster distances .

Dunn Index= min /Inter cluster distance)


-tracluster
distance)

The Dunn index , the better


more the value of
the clusters will be
Silhouette Scove

The Silhouette score measures the similarity


of each
point to its own cluster compared to
other clusters .

A
high sil . Score indicates that the clusters are

and each
well
separated , sample is more

to
similar the
samples in its own Cluster
than to
samples in other clusters .

siz . Score =O (Overlapping Clusters]


sit .
Score 10
(poor clustering]

How to
pick optimal number
of clusters
??
① Elbow methoc
⑥ Silhouette coefficient
③ Gap statistics

④ Information criteria -> BIC and AIC .

⑤ Domain knowledge .
4

Within cluster sum


of squares .

Sin Analysis

sci)="is

we will then calculate


average
silhouette = mean [ si]
Then blu auch K
plot the
graph arg
Sil . .

Note Points : 1. value will come blw [-7 , 1]


2 . 1 -> best

-
overlay<ping
-
1 - - worse .
we can see sil . Score

maximized at 3
,
so

w. can take 3 clusters .

The Sil .
Score IS used in combination with the
decision
Elbow method
for a more
confident .

You might also like