Professional Documents
Culture Documents
K-Means Clustering
K-Means Clustering
K-means clustering
-> It is
unsupervised learning algorithm
-> It is an iterative
algorithm that divides the
unlabelled data into K different clusters in
k = 3 3 clusters and so on
,
corresponding clusters .
How does the K-Means
Algorithm work
??
Step 1
. Select the number K to decide the no .
of
clusters .
closest
Step 3 .
Assign each
point to their
centroid .
Step .
4 Calculate the variance and
place a new
centroid
of each clustes .
Step 5 .
Repeat 3rd Step
6
Step ·
If an
reassigment occurs , then go to
.
Let us visualize this
!!!
Suppose we have two variables M2 and M2
The X-y scatter plot :
Let us take K= 2
and
put those centroids at
random position .
I we do it
using
some
mathematical techniques]
So we will draw a median blu both the centrolds .
From this
picture we can say
that all the
points on top
close to
of line are blue
downside
whereas all
points
of line to cluster
belongs
orange
.
But does not end
steps
here , as we want to
centroid to and
again bring
Now we will centre
repent all
Teji again
se S .
Perpendicular visector
3
Here we can say we
got
points .:
as error .
repeat again
.
Reassign Centroid
Located
points
.
Now perpendicular
bisector again
Be HAPPY
But no error we
got YOU GOF
EINSTER
this time
.. This will be om
final cluster .
#Properties of
Clusters .
1 Similar
: to
Bank wants to
Suppose a
segment its customers
cluster are
.
If the customers in a
particular
not similar to each other then their
requirements
u
the bank tem offer , they
vary If
.
gives
same
interest the
may not like it .
and their in
bank reduce
might .
X ....... ......
Det ·
...."
......
-
Income
G ⑧ O
...........
1
T
X X
What do you think which cluster scheme is
??
best
In . I
case
clearly and
we can see that peach blue
clusters are totally different from each other -
Cluster have
peach high income and
high debt
have and low debt
whereas cluster blue
high income .
for clustering .
The
primary
aim
of clustering is not
just to
make clusters but to make good and
meaningful
ones .
a cluster are .
the
points within a clustes
from the centroid of
that clustes .
features are
categorical .
clusters are
Dunn Index -> quentia takes care about 1st
that distance olw
property points
and centrod decreases .
Inter Cluster
Intra cluster distance
distance
.
↑
The distance blu the centroids of two diff
cluster is known as Inter cluster distance
cluster distances .
A
high sil . Score indicates that the clusters are
and each
well
separated , sample is more
to
similar the
samples in its own Cluster
than to
samples in other clusters .
How to
pick optimal number
of clusters
??
① Elbow methoc
⑥ Silhouette coefficient
③ Gap statistics
⑤ Domain knowledge .
4
Sin Analysis
sci)="is
-
overlay<ping
-
1 - - worse .
we can see sil . Score
maximized at 3
,
so
The Sil .
Score IS used in combination with the
decision
Elbow method
for a more
confident .