Professional Documents
Culture Documents
Analysis
K-Means Clustering
Dr. Chaouki BOUFENAR
Computer Science department
College of Science
University of Algiers 1 – Benyouçef Benkhedda
Emai : boufenarc@gmail.com
K-means Clustering
K-means
Algorithm
o Anomalies Detection
Begin
Algorithm Kmeans
Initialisation step Initialization
Repeat
For each i = 1, …,N Centroid No
Assignment Step
Stop Yes
Update step Distance objects End
to centroids criteria
End for
Assignement Step
Assign object Xi to cluster Xk that has closest centroïd (mean) based on a distance measure
Update Step
Update cluster centroïds if an object is reassigned to new cluster
Stop Criteria
• All objects remain in the same cluster (no change in the values of centroïds)
• When the k-means algorithm won't find a final solution, we stop it after a pre-chosen
maximum of iterations.
14
G
12
A H
10
D
8
Y
6 B E
C
4
F
2
0
0 1 2 3 4 5 6 7 8 9
X
A B C D E F G H
A 0.0 5.0 12.0 5.0 10.0 9.0 3.0 6.0
B 0.0 7.0 6.0 5.0 4.0 8.0 11.0
C 0.0 7.0 2.0 9.0 15.0 6.0
D 0.0 5.0 10.0 8.0 5.0
E 0.0 9.0 13.0 6.0
F 0.0 10.0 15.0
G 0.0 9.0
H 0.0
Initialisation Step
●
It consists of finding a sensible initial partition.
● We choose using the Manhattan distance measure (in our case) the two most distant objects
((C,G) or (F,H)).
Object centroïd
Cluster 1 C (8.0 ; 4.0)
Cluster 2 G (1.0 ; 12.0)
Update Step
●
The centroïd (mean vector) is recalculated
for each cluster
Assignement Step
Update Step
➔ Inirtia (or WCSS : Within- Cluster Sum of Squares) tells us how far the points within a cluster
are. So, it calculates as the sum of squared distances of all the points within a cluster from
the centroïd of that cluster.
2
𝑰𝒏𝒆𝒓𝒕𝒊𝒂 = 𝑥𝑖 − 𝜇𝑗
𝑖𝑗
The idea behind good clustering is having a small value of inertia, and small number of
clusters.
The lesser the inertia value, the better our clusters are
➔ Distortion is calculated as the average of the squared distances from the centroid of the
respective clusters. Typically, the Euclidean distance metric is used.
𝑘 2
𝑥𝑖 − 𝜇𝑗
𝑫𝒊𝒔𝒕𝒐𝒓𝒕𝒊𝒐𝒏 =
𝑗=1 𝑥𝑖 ∈𝐶𝑗
𝐶𝑗
x : the mean nearest cluster distance i.e. mean distance to the instances of the next closest cluster.
Better method as it makes the decision regarding the optimal number of clusters more
meaningful and clear.
➔
Dunn index is the ratio of the minimum of inter-cluster distances and maximum of intra-cluster
distances.
The Dunn Index has a value between zero and infinity, and should be maximized.
The elbow point in the inertia (or distortion) graph is a good choice because after that the
change in the value of inertia (or distortion) isn’t significant.
We iterate the values of k in a given range and calculate the values of inirtia (or distortion)
for each value of k
To determine the optimal number of clusters, we have to select the value of k at the “elbow”, the point after
which the distortion/inertia start decreasing in a linear way.
Elbow point
K=3
Elbow point
K=3
K=3
Last update
step
Initialisation step
14
G
12
Last assignement A H
10
step D
8
B E
Y
6 C
4 F
2
0
0 1 2 3 4 5 6 7 8 9
X
14
G
12
A H
10 D
8
B E
Y
Last assignement 6 C
step 4 F
2
0
0 1 2 3 4 5 6 7 8 9
X
Initialisation
step
14
G
12
A H
10 D
8
B E
Y
6 C
Last assignement 4 F
step 2
0
0 1 2 3 4 5 6 7 8 9
X
Last assignement
Initialisation step
step
Y
6 C
4 F
2
0
0 1 2 3 4 5 6 7 8 9
X
Initialisation step
14
G
12
A H
Last update step 10 D
8
Y 6 B E
C
4 F
2
0
0 1 2 3 4 5 6 7 8 9
X
14
G
12
A H
10 D
8
B E
Y
6 C
4 F
2
0
0 1 2 3 4 5 6 7 8 9
X
The best value of k is 4. So, we will have four clusters grouping objects as follows :
14
G
12
A H
10
D
8
Y
6 B E
C
4
F
2
0
0 1 2 3 4 5 6 7 8 9
X
o Only has one parameter to tune and we can easily see the direct impact of adjusting the value
parameter k
Weakness
o K-Means algorithm is sensitive to outliers
o Clusters are shaped spherical. To know how K-means will behave for a particular data set, we
imagine spherical datasets. So for data having complex geometrical shapes, K-means cannot
identify clear clusters