You are on page 1of 26

Non-Hierarchical Clustering

Analysis
K-Means Clustering
Dr. Chaouki BOUFENAR
Computer Science department
College of Science
University of Algiers 1 – Benyouçef Benkhedda

Emai : boufenarc@gmail.com
K-means Clustering

K-means
Algorithm

Unlabeled Data Clusters (centroïds or labels)

28/05/2021 Master 1 ISII : Data Mining 2


K-means Uses
o Behavioral segmentation (by o Geostatistics
purchase history, website, profiles,
incomes…)

o Inventory categorization (by sales activity, manufacturing metrics)


K-means clustering is used in this task because the number of clusters required in categorization of items already set.

o Anomalies Detection

28/05/2021 Master 1 ISII : Data Mining 3


K-means Algorithm
 Non-hierarchical clustering partitions a set of N objects into k distinct groups based on some
distance
 K-means clustering is one of the simplest and popular unsupervised machine learning algorithms
that make inferences from datasets using only input vectors without labeled outcomes.
 The number of clusters K can be known a priori or can be estimated as a part of the procedure

Begin
Algorithm Kmeans
Initialisation step Initialization
Repeat
For each i = 1, …,N Centroid No
Assignment Step
Stop Yes
Update step Distance objects End
to centroids criteria
End for

Until Stop criteria Grouping based on


minimum distance
End

28/05/2021 Master 1 ISII : Data Mining 4


K-means Algorithm
 Initialisation Step
• Defining a distance between observations or groups of observations
• Partition initial objects in K distinct clusters (generally randomly selected centroïds or
mean vectors) make inferences from datasets using only input vectors without labeled
outcomes.

 Assignement Step
Assign object Xi to cluster Xk that has closest centroïd (mean) based on a distance measure

 Update Step
Update cluster centroïds if an object is reassigned to new cluster

 Stop Criteria
• All objects remain in the same cluster (no change in the values of centroïds)
• When the k-means algorithm won't find a final solution, we stop it after a pre-chosen
maximum of iterations.

28/05/2021 Master 1 ISII : Data Mining 5


K-means Example
Consider the following data set consisting of the scores of two variables X and Y on each of Nine
objects:

14
G
12
A H
10
D
8
Y

6 B E
C
4
F
2

0
0 1 2 3 4 5 6 7 8 9
X

28/05/2021 Master 1 ISII : Data Mining 6


K-means Example

The Manhattan distance matrix between différents objects is given in table.

A B C D E F G H
A 0.0 5.0 12.0 5.0 10.0 9.0 3.0 6.0
B 0.0 7.0 6.0 5.0 4.0 8.0 11.0
C 0.0 7.0 2.0 9.0 15.0 6.0
D 0.0 5.0 10.0 8.0 5.0
E 0.0 9.0 13.0 6.0
F 0.0 10.0 15.0
G 0.0 9.0
H 0.0

28/05/2021 Master 1 ISII : Data Mining 7


K-means Example
Apply K-means starting from the two clusters (centroïds, k=2)

Initialisation Step


It consists of finding a sensible initial partition.
● We choose using the Manhattan distance measure (in our case) the two most distant objects
((C,G) or (F,H)).

Object centroïd
Cluster 1 C (8.0 ; 4.0)
Cluster 2 G (1.0 ; 12.0)

28/05/2021 Master 1 ISII : Data Mining 8


K-means Example
Iteration 01
Assignement Step

• We calculate the distances from each object


to the two clusters.

• The remaining objects are examined one by


one and located in relation to the nearest
cluster (in terms of minimum Manhattan
distance)

Update Step

The centroïd (mean vector) is recalculated
for each cluster

28/05/2021 Master 1 ISII : Data Mining 9


K-means Example
Iteration 02

Assignement Step
Update Step

Stop. No new relocation

28/05/2021 Master 1 ISII : Data Mining 10


How Evaluating Clustering Algorithm

➔ Inirtia (or WCSS : Within- Cluster Sum of Squares) tells us how far the points within a cluster
are. So, it calculates as the sum of squared distances of all the points within a cluster from
the centroïd of that cluster.
2
𝑰𝒏𝒆𝒓𝒕𝒊𝒂 = ෍ 𝑥𝑖 − 𝜇𝑗
𝑖𝑗

The idea behind good clustering is having a small value of inertia, and small number of
clusters.
The lesser the inertia value, the better our clusters are

➔ Distortion is calculated as the average of the squared distances from the centroid of the
respective clusters. Typically, the Euclidean distance metric is used.

𝑘 2
𝑥𝑖 − 𝜇𝑗
𝑫𝒊𝒔𝒕𝒐𝒓𝒕𝒊𝒐𝒏 = ෍ ෍
𝑗=1 𝑥𝑖 ∈𝐶𝑗
𝐶𝑗

28/05/2021 Master 1 ISII : Data Mining 11


How Evaluating Clustering Algorithm

Silhouette Score It is calculated for each instance and the formula goes like this :
Silhouette Coefficient = (x-y)/ max(x,y)
y : the mean intra cluster distance i.e mean distance to the other instances in the same cluster.

x : the mean nearest cluster distance i.e. mean distance to the instances of the next closest cluster.

28/05/2021 Master 1 ISII : Data Mining 12


How Evaluating Clustering Algorithm

 The coefficient varies between -1 and 1.


 A value close to 1 implies that the instance is close to its cluster.
 A value close to -1 means that the value is assigned to the wrong cluster.

 Better method as it makes the decision regarding the optimal number of clusters more
meaningful and clear.

 But it is computation expensive as the coefficient is calculated for every instance.


Dunn index is the ratio of the minimum of inter-cluster distances and maximum of intra-cluster
distances.
The Dunn Index has a value between zero and infinity, and should be maximized.

28/05/2021 Master 1 ISII : Data Mining 13


How Choosing the Best K value ?
Elbow method is one of the most popular methods to determine this optimal value of k.

 The elbow point in the inertia (or distortion) graph is a good choice because after that the
change in the value of inertia (or distortion) isn’t significant.

 We iterate the values of k in a given range and calculate the values of inirtia (or distortion)
for each value of k

 To determine the optimal number of clusters, we have to select the value of k at the “elbow”, the point after
which the distortion/inertia start decreasing in a linear way.

Elbow point
K=3
Elbow point
K=3

28/05/2021 Master 1 ISII : Data Mining 13


Complete Example
We keep the same example above and we try to calculate the inertia and the distortion by varying
the K value in range {1.. 8}

K=3
Last update
step
Initialisation step

14
G
12
Last assignement A H
10
step D
8
B E
Y
6 C
4 F
2
0
0 1 2 3 4 5 6 7 8 9
X

28/05/2021 Master 1 ISII : Data Mining 15


K=4
Last update
step
Initialisation
step

14
G
12
A H
10 D
8
B E

Y
Last assignement 6 C
step 4 F
2
0
0 1 2 3 4 5 6 7 8 9
X

28/05/2021 Master 1 ISII : Data Mining 16


K=5

Last update step

Initialisation
step

14
G
12
A H
10 D
8
B E

Y
6 C
Last assignement 4 F
step 2
0
0 1 2 3 4 5 6 7 8 9
X

28/05/2021 Master 1 ISII : Data Mining 17


K=6

Last assignement
Initialisation step
step

Last update step


14
G
12
A H
10 D
8
B E

Y
6 C
4 F
2
0
0 1 2 3 4 5 6 7 8 9
X

28/05/2021 Master 1 ISII : Data Mining 18


K=7
Last assignement step

Initialisation step

14
G
12
A H
Last update step 10 D
8
Y 6 B E
C
4 F
2
0
0 1 2 3 4 5 6 7 8 9
X

28/05/2021 Master 1 ISII : Data Mining 19


K=8
Initialisation step Last update step

Last assignement step

14
G
12
A H
10 D
8
B E
Y

6 C
4 F
2
0
0 1 2 3 4 5 6 7 8 9
X

28/05/2021 Master 1 ISII : Data Mining 20


Complete Example
Distortion & Inertia

28/05/2021 Master 1 ISII : Data Mining 21


Complete Example
Distortion & Inertia

28/05/2021 Master 1 ISII : Data Mining 22


Complete Example
Distortion & Inertia

28/05/2021 Master 1 ISII : Data Mining 23


Complete Example
Elbow Method Result

The best value of k is 4. So, we will have four clusters grouping objects as follows :

14
G
12
A H
10
D
8
Y

6 B E
C
4
F
2

0
0 1 2 3 4 5 6 7 8 9
X

28/05/2021 Master 1 ISII : Data Mining 24


Strengths & Weakness
 Strengths
o Easy to implement

o Relatively fast and efficient

o Only has one parameter to tune and we can easily see the direct impact of adjusting the value
parameter k

 Weakness
o K-Means algorithm is sensitive to outliers

o K-Means algorithm suffers from the problem of convergence to a local optima

o Different initial centroids results in different clusters

o May run infinitely if the stopping criteria is not satisfied

o Clusters are shaped spherical. To know how K-means will behave for a particular data set, we
imagine spherical datasets. So for data having complex geometrical shapes, K-means cannot
identify clear clusters

28/05/2021 Master 1 ISII : Data Mining 25

You might also like