K-Means Clustering

K-means clustering
K-means clustering
-> It is
unsupervised learning algorithm
-> It is an iterative
algorithm that divides the
unlabelled data into K different clusters in
such a way that each dataset

oclongs only one
group that has similar

properties .
2 will be two clusters

If K =
,
there ,
k = 3 3 clusters and so on
,
It is a centroid based algorithm where

->
each cluster is associated with a centroid
-> Main Alm : Minimize the sum

of distance blw
the point and their
cata
corresponding clusters .
How does the K-Means
Algorithm work
??
Step 1
. Select the number K to decide the no .
of
clusters .
Step 2 . Select random K-points (It can other

from
the
input data .
closest
Step 3 .
Assign each
point to their
centroid .
Step .
4 Calculate the variance and
place a new
centroid
of each clustes .
Step 5 .
Repeat 3rd Step
6
Step ·
If an
reassigment occurs , then go to
step 4 else to FINISH

go
.
Step 7 . Model is Ready .
.
Let us visualize this
!!!
Suppose we have two variables M2 and M2
The X-y scatter plot :
Let us take K= 2
and
put those centroids at
random position .
Now will each

we
assign
data point to its closest
centroid .
I we do it
using
some
mathematical techniques]
So we will draw a median blu both the centrolds .
From this
picture we can say
that all the
points on top
close to
of line are blue
Centroid : they all belongs

to blue cluster .
downside
whereas all
points
of line to cluster
belongs
orange
.
But does not end
steps
here , as we want to
find optimal cluster .
centroid to and
again bring
Now we will centre
repent all
Teji again
se S .
Perpendicular visector
3
Here we can say we
got
points .:
as error .
repeat again
.
Reassign Centroid
Located
points
.
Now perpendicular
bisector again
Be HAPPY
But no error we
got YOU GOF
EINSTER
this time
.. This will be om
final cluster .
#Properties of
Clusters .
All the data cluster should

1st :
points
each other
in a
1 Similar
: to
Bank wants to
Suppose a
segment its customers
cluster are
.
If the customers in a
particular
not similar to each other then their
requirements
u
the bank tem offer , they
vary If
.
gives
same
interest the
may not like it .
and their in
bank reduce
might .
2nd : The data

points from different clusters
should be as cliff as possible ·
Suppose let us take a example .
X ....... ......
Det ·
...."
......
-
Income
G ⑧ O
...........
1
T
X X
What do you think which cluster scheme is
??
best
In . I
case
clearly and
we can see that peach blue
clusters are totally different from each other -
Cluster have
peach high income and
high debt
have and low debt
whereas cluster blue
high income .
Hence data points from diff clusters should be as cliff

from each other as
possible to have more
meaningful
clusters .
Iterative find the

The means uses an
approach to
optimul cluster assignments by minimizing the sum
of squared distances blu data points and their

cluster centroids
assigned .
# Understanding different Evaluation metrics
for clustering .
The
primary
aim
of clustering is not
just to
make clusters but to make good and
meaningful
ones .
Inertia tells how far the

-> It us
points within
a cluster are .
So Inertia calculates the sum of distances of all
the
points within a clustes
from the centroid of
that clustes .
Normally we use Euclidean distance as the distance

metric , as long as most of the features are numels,
otherwise , Manhattan distance in case most of the
features are
categorical .
The distance within the clusters is known as

intracluster distance So the Inertia us
.
gives
the sum
of intracluster distances .
Lesser the inertia value

,
the better on
clusters are
Dunn Index -> quentia takes care about 1st
that distance olw
property points
and centrod decreases .
But what about 2nd

property ??
This is where the Dunn index comes into action .
Inter Cluster
Intra cluster distance
distance
.
Along with the distance blu the centroid and

the Dunn index also takes account the
points
distance
,
blu two clusters .
↑
The distance blu the centroids of two diff
cluster is known as Inter cluster distance
Dunn Index is the ratio

of the minimum
of
inter-cluster distances and maximum of Intra
cluster distances .
Dunn Index= min /Inter cluster distance)

-tracluster
distance)
The Dunn index , the better

more the value of
the clusters will be
Silhouette Scove
The Silhouette score measures the similarity

of each
point to its own cluster compared to
other clusters .
A
high sil . Score indicates that the clusters are
and each
well
separated , sample is more
to
similar the
samples in its own Cluster
than to
samples in other clusters .
siz . Score =O (Overlapping Clusters]

sit .
Score 10
(poor clustering]
How to
pick optimal number
of clusters
??
① Elbow methoc
⑥ Silhouette coefficient
③ Gap statistics
④ Information criteria -> BIC and AIC .
⑤ Domain knowledge .
4
Within cluster sum

of squares .
Sin Analysis
sci)="is
we will then calculate

average
silhouette = mean [ si]
Then blu auch K
plot the
graph arg
Sil . .
Note Points : 1. value will come blw [-7 , 1]

2 . 1 -> best
-
overlay<ping
-
1 - - worse .
we can see sil . Score
maximized at 3
,
so
w. can take 3 clusters .
The Sil .
Score IS used in combination with the
decision
Elbow method
for a more
confident .

K-Means Clustering

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

K-Means Clustering

Uploaded by

Copyright:

Available Formats

K-means clustering

such a way that each dataset

group that has similar

2 will be two clusters

It is a centroid based algorithm where

each cluster is associated with a centroid

-> Main Alm : Minimize the sum

Step 2 . Select random K-points (It can other

step 4 else to FINISH

Step 7 . Model is Ready .

Now will each

Centroid : they all belongs

find optimal cluster .

All the data cluster should

2nd : The data

Suppose let us take a example .

Hence data points from diff clusters should be as cliff

Iterative find the

optimul cluster assignments by minimizing the sum

of squared distances blu data points and their

Inertia tells how far the

So Inertia calculates the sum of distances of all

Normally we use Euclidean distance as the distance

otherwise , Manhattan distance in case most of the

The distance within the clusters is known as

Lesser the inertia value

But what about 2nd

Along with the distance blu the centroid and

blu two clusters .

Dunn Index is the ratio

Dunn Index= min /Inter cluster distance)

The Dunn index , the better

The Silhouette score measures the similarity

siz . Score =O (Overlapping Clusters]

④ Information criteria -> BIC and AIC .

Within cluster sum

we will then calculate

Note Points : 1. value will come blw [-7 , 1]

w. can take 3 clusters .

You might also like