K Means Clustering 2021 2022

Non-Hierarchical Clustering
Analysis
K-Means Clustering
Dr. Chaouki BOUFENAR
Computer Science department
College of Science
University of Algiers 1 – Benyouçef Benkhedda
Emai : boufenarc@gmail.com
K-means Clustering
K-means
Algorithm
Unlabeled Data Clusters (centroïds or labels)
28/05/2021 Master 1 ISII : Data Mining 2

K-means Uses
o Behavioral segmentation (by o Geostatistics
purchase history, website, profiles,
incomes…)
o Inventory categorization (by sales activity, manufacturing metrics)

K-means clustering is used in this task because the number of clusters required in categorization of items already set.
o Anomalies Detection

K-means Algorithm
 Non-hierarchical clustering partitions a set of N objects into k distinct groups based on some
distance
 K-means clustering is one of the simplest and popular unsupervised machine learning algorithms
that make inferences from datasets using only input vectors without labeled outcomes.
 The number of clusters K can be known a priori or can be estimated as a part of the procedure
Begin
Algorithm Kmeans
Initialisation step Initialization
Repeat
For each i = 1, …,N Centroid No
Assignment Step
Stop Yes
Update step Distance objects End
to centroids criteria
End for
Until Stop criteria Grouping based on

minimum distance
End

K-means Algorithm
 Initialisation Step
• Defining a distance between observations or groups of observations
• Partition initial objects in K distinct clusters (generally randomly selected centroïds or
mean vectors) make inferences from datasets using only input vectors without labeled
outcomes.
 Assignement Step
Assign object Xi to cluster Xk that has closest centroïd (mean) based on a distance measure
 Update Step
Update cluster centroïds if an object is reassigned to new cluster
 Stop Criteria
• All objects remain in the same cluster (no change in the values of centroïds)
• When the k-means algorithm won't find a final solution, we stop it after a pre-chosen
maximum of iterations.

K-means Example
Consider the following data set consisting of the scores of two variables X and Y on each of Nine
objects:
14
G
12
A H
10
D
8
Y
6 B E
C
4
F
2
0
0 1 2 3 4 5 6 7 8 9
X

K-means Example
The Manhattan distance matrix between différents objects is given in table.
A B C D E F G H
A 0.0 5.0 12.0 5.0 10.0 9.0 3.0 6.0
B 0.0 7.0 6.0 5.0 4.0 8.0 11.0
C 0.0 7.0 2.0 9.0 15.0 6.0
D 0.0 5.0 10.0 8.0 5.0
E 0.0 9.0 13.0 6.0
F 0.0 10.0 15.0
G 0.0 9.0
H 0.0

K-means Example
Apply K-means starting from the two clusters (centroïds, k=2)
Initialisation Step
●
It consists of finding a sensible initial partition.
● We choose using the Manhattan distance measure (in our case) the two most distant objects
((C,G) or (F,H)).
Object centroïd
Cluster 1 C (8.0 ; 4.0)
Cluster 2 G (1.0 ; 12.0)

K-means Example
Iteration 01
Assignement Step
• We calculate the distances from each object

to the two clusters.
• The remaining objects are examined one by

one and located in relation to the nearest
cluster (in terms of minimum Manhattan
distance)
Update Step
●
The centroïd (mean vector) is recalculated
for each cluster

K-means Example
Iteration 02
Assignement Step
Update Step
Stop. No new relocation

How Evaluating Clustering Algorithm
➔ Inirtia (or WCSS : Within- Cluster Sum of Squares) tells us how far the points within a cluster
are. So, it calculates as the sum of squared distances of all the points within a cluster from
the centroïd of that cluster.
2
𝑰𝒏𝒆𝒓𝒕𝒊𝒂 = ෍ 𝑥𝑖 − 𝜇𝑗
𝑖𝑗
The idea behind good clustering is having a small value of inertia, and small number of
clusters.
The lesser the inertia value, the better our clusters are
➔ Distortion is calculated as the average of the squared distances from the centroid of the
respective clusters. Typically, the Euclidean distance metric is used.
𝑘 2
𝑥𝑖 − 𝜇𝑗
𝑫𝒊𝒔𝒕𝒐𝒓𝒕𝒊𝒐𝒏 = ෍ ෍
𝑗=1 𝑥𝑖 ∈𝐶𝑗
𝐶𝑗

➔
Silhouette Score It is calculated for each instance and the formula goes like this :
Silhouette Coefficient = (x-y)/ max(x,y)
y : the mean intra cluster distance i.e mean distance to the other instances in the same cluster.
x : the mean nearest cluster distance i.e. mean distance to the instances of the next closest cluster.

 The coefficient varies between -1 and 1.

 A value close to 1 implies that the instance is close to its cluster.
 A value close to -1 means that the value is assigned to the wrong cluster.
 Better method as it makes the decision regarding the optimal number of clusters more
meaningful and clear.
 But it is computation expensive as the coefficient is calculated for every instance.
➔
Dunn index is the ratio of the minimum of inter-cluster distances and maximum of intra-cluster
distances.
The Dunn Index has a value between zero and infinity, and should be maximized.

How Choosing the Best K value ?
Elbow method is one of the most popular methods to determine this optimal value of k.
 The elbow point in the inertia (or distortion) graph is a good choice because after that the
change in the value of inertia (or distortion) isn’t significant.
 We iterate the values of k in a given range and calculate the values of inirtia (or distortion)
for each value of k
 To determine the optimal number of clusters, we have to select the value of k at the “elbow”, the point after
which the distortion/inertia start decreasing in a linear way.
Elbow point
K=3
Elbow point
K=3

Complete Example
We keep the same example above and we try to calculate the inertia and the distortion by varying
the K value in range {1.. 8}
K=3
Last update
step
Initialisation step
14
G
12
Last assignement A H
10
step D
8
B E
Y
6 C
4 F
2
0
0 1 2 3 4 5 6 7 8 9
X

K=4
Last update
step
Initialisation
step
14
G
12
A H
10 D
8
B E
Y
Last assignement 6 C
step 4 F
2
0
0 1 2 3 4 5 6 7 8 9
X

K=5
Last update step
Initialisation
step
14
G
12
A H
10 D
8
B E
Y
6 C
Last assignement 4 F
step 2
0
0 1 2 3 4 5 6 7 8 9
X

K=6
Last assignement
Initialisation step
step
Last update step

14
G
12
A H
10 D
8
B E
Y
6 C
4 F
2
0
0 1 2 3 4 5 6 7 8 9
X

K=7
Last assignement step
Initialisation step
14
G
12
A H
Last update step 10 D
8
Y 6 B E
C
4 F
2
0
0 1 2 3 4 5 6 7 8 9
X

K=8
Initialisation step Last update step
Last assignement step
14
G
12
A H
10 D
8
B E
Y
6 C
4 F
2
0
0 1 2 3 4 5 6 7 8 9
X

Complete Example
Distortion & Inertia

Complete Example

Complete Example

Complete Example
Elbow Method Result
The best value of k is 4. So, we will have four clusters grouping objects as follows :
14
G
12
A H
10
D
8
Y
6 B E
C
4
F
2
0
0 1 2 3 4 5 6 7 8 9
X

Strengths & Weakness
 Strengths
o Easy to implement
o Relatively fast and efficient
o Only has one parameter to tune and we can easily see the direct impact of adjusting the value
parameter k
 Weakness
o K-Means algorithm is sensitive to outliers
o K-Means algorithm suffers from the problem of convergence to a local optima
o Different initial centroids results in different clusters
o May run infinitely if the stopping criteria is not satisfied
o Clusters are shaped spherical. To know how K-means will behave for a particular data set, we
imagine spherical datasets. So for data having complex geometrical shapes, K-means cannot
identify clear clusters

K Means Clustering 2021 2022

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

K Means Clustering 2021 2022

Uploaded by

Copyright:

Available Formats

Non-Hierarchical Clustering

Unlabeled Data Clusters (centroïds or labels)

28/05/2021 Master 1 ISII : Data Mining 2

o Inventory categorization (by sales activity, manufacturing metrics)

28/05/2021 Master 1 ISII : Data Mining 3

Until Stop criteria Grouping based on

28/05/2021 Master 1 ISII : Data Mining 4

28/05/2021 Master 1 ISII : Data Mining 5

28/05/2021 Master 1 ISII : Data Mining 6

The Manhattan distance matrix between différents objects is given in table.

28/05/2021 Master 1 ISII : Data Mining 7

28/05/2021 Master 1 ISII : Data Mining 8

• We calculate the distances from each object

• The remaining objects are examined one by

28/05/2021 Master 1 ISII : Data Mining 9

Stop. No new relocation

28/05/2021 Master 1 ISII : Data Mining 10

28/05/2021 Master 1 ISII : Data Mining 11

28/05/2021 Master 1 ISII : Data Mining 12

 The coefficient varies between -1 and 1.

 But it is computation expensive as the coefficient is calculated for every instance.

28/05/2021 Master 1 ISII : Data Mining 13

28/05/2021 Master 1 ISII : Data Mining 13

28/05/2021 Master 1 ISII : Data Mining 15

28/05/2021 Master 1 ISII : Data Mining 16

Last update step

28/05/2021 Master 1 ISII : Data Mining 17

Last update step

28/05/2021 Master 1 ISII : Data Mining 18

28/05/2021 Master 1 ISII : Data Mining 19

Last assignement step

28/05/2021 Master 1 ISII : Data Mining 20

28/05/2021 Master 1 ISII : Data Mining 21

28/05/2021 Master 1 ISII : Data Mining 22

28/05/2021 Master 1 ISII : Data Mining 23

28/05/2021 Master 1 ISII : Data Mining 24

o Relatively fast and efficient

o K-Means algorithm suffers from the problem of convergence to a local optima

o Different initial centroids results in different clusters

o May run infinitely if the stopping criteria is not satisfied

28/05/2021 Master 1 ISII : Data Mining 25

You might also like