You are on page 1of 14

K-MEANS CLUSTERING

ALGORITHM

1
K-Means Clustering Algorithm

Konsep Clustering

Clustering is the classification of objects into different


groups, or more precisely, the partitioning of a data set
into subsets (clusters), so that the data in each subset
(ideally) share some common trait - often according to
some defined distance measure.

2
K-Means Clustering Algorithm

Clustering Concept
o Group a number of data or objects into clusters (groups) so that
each cluster will contain as similar data as possible.
o K-Means, including unsupervised learning.
o The data on the clustering technique is not known for its output
(output or label).
o Method for measuring cluster quality: sum of squared error (SSE):
k 2

SSE    d  p, m 
i 1 pCi
i

p Є Ci = each data point in cluster i, mi = centroid of cluster i, d = closest


distances/ variance in each cluster i.
o The SSE value depends on the number of clusters and how the data is
grouped into clusters. The smaller the SSE value, the better the
clustering results.

3
K-Means Clustering Algorithm

K-Means Clustering Algorithm


o K-Means, including partitioning clustering.
o The objects are grouped into k clusters.
o First, to perform the clustering process, one must determine the
value of K.
o These clusters have a middle value (central value) called the
centroid.
o Use similarity measures to group objects.
o Similarities are translated into the concept of distance (d).
o The closer the distance between two objects or data, the
higher the similarity.
o The purpose of k-Means is to minimize the total distance of
elements between clusters.

4
K-Means Clustering Algorithm

K-Means Clustering Algorithm


1. Select the desired number of k clusters.
2. Initialize the value of k for the cluster centre (centroid) randomly.
3. Place each data or object in the nearest cluster. The distance determines
the proximity of two objects. The distance used in the k-Means algorithm
is Euclidean distance (d).
n
d Euclidean x, y   
 i i  2
x y
i 1

x = x1, x2, . . . , xn, and y = y1, y2, . . . , yn is the number of n.

4. Recalculate the cluster centre with the current cluster membership.


The cluster centre is the average (mean) of all data or objects in a
particular cluster.

5
K-Means Clustering Algorithm

Algoritma K-Means Clustering

6
K-Means Clustering Algorithm

Example 1
Tabel 1 Data point 1. Determine the number of clusters k=2.
Instances X Y 2. Determine the initial centroid
randomly, for example, from the
A 1 3
data beside m1 =(1,1), m2=(2,1).
B 3 3
3. Place each object to the nearest
C 4 3 cluster based on the centroid value
D 5 3 closest to the difference (distance).
E 1 2
Table 2, The results are: cluster1 =
{A,E,G}, cluster 2={B,C,D,F,H}. The
F 4 2
SSE values are:
G 1 1
H 2 1 k 2

SSE    d  p, mi 
i 1 pCi
tampilan data awal

7
K-Means Clustering Algorithm

Example 1
Tabel 2
4. Calculate the new centroid value:
m1  1  1  1 / 3, 3  2  1 / 3  1,2
m2  3  4  5  4  2 / 5, 3  3  3  2  1 / 5  3,6;2,4

5. Reassign each object using the new


cluster centre. In table 3, the new SSE
value:

Clusters and
centroids after
the first stage.

8
K-Means Clustering Algorithm

Example 1
Tabel 3
• cluster 1={A,E,G,H}, cluster
2={B,C,D,F}, then look for the
new centroid value, namely:
m1=(1,25;1,75) and
m2=(4;2.75).

• Reassign each object using the


new cluster centre. In table 4,
The new SSE value:
Clusters and
centroids
after the
second stage.

9
K-Means Clustering Algorithm

Example 1
Tabel 4

• It can be seen in table 4. There are no more member


changes in each cluster.
• The final results are: cluster 1={A,E,G,H}, and cluster
2={B,C,D,F} with SSE value = 6.25 and the number of
iterations 3.
10
K-Means Clustering Algorithm

K-Means Clustering Visual Basic Code


Dim isStillMoving As Boolean
Sub kMeanCluster (Data() As Variant, numCluster As Integer)
isStillMoving = True
' main function to cluster data into k number of Clusters
if totalData <= numCluster Then
' input:
'only the last data is put here because it designed to be interactive
' + Data matrix (0 to 2, 1 to TotalData);
Data(0, totalData) = totalData ' cluster No = total data
' Row 0 = cluster, 1 =X, 2= Y; data in columns
Centroid(1, totalData) = Data(1, totalData) ' X
' + numCluster: number of cluster user want the data to be clustered
Centroid(2, totalData) = Data(2, totalData) ' Y
' + private variables: Centroid, TotalData
Else
' ouput:
'calculate minimum distance to assign the new data
' o) update centroid
min = 10 ^ 10 'big number
' o) assign cluster number to the Data (= row 0 of Data)
X = Data(1, totalData)
Y = Data(2, totalData)
Dim i As Integer
For i = 1 To numCluster
Dim j As Integer
d = dist(X, Y, Centroid(1, i), Centroid(2, i))
Dim X As Single
If d < min Then
Dim Y As Single
min = d
Dim min As Single
cluster = i
Dim cluster As Integer
End If
Dim d As Single
Next i
Dim sumXY()
Data(0, totalData) = cluster

11
K-Means Clustering Algorithm

K-Means Clustering Visual Basic Code


For i = 1 To totalData
Do While isStillMoving
min = 10 ^ 10 'big number
' this loop will surely convergent
X = Data(1, i)
'calculate new centroids
Y = Data(2, i)
' 1 =X, 2=Y, 3=count number of data
For j = 1 To numCluster
ReDim sumXY(1 To 3, 1 To numCluster)
d = dist(X, Y, Centroid(1, j), Centroid(2, j))
For i = 1 To totalData
If d < min Then
sumXY(1, Data(0, i)) = Data(1, i) + sumXY(1, Data(0, i))
min = d
sumXY(2, Data(0, i)) = Data(2, i) + sumXY(2, Data(0, i))
cluster = j
Data(0, i))
End If
sumXY(3, Data(0, i)) = 1 + sumXY(3, Data(0, i))
Next j
Next i
If Data(0, i) <> cluster Then
For i = 1 To numCluster
Data(0, i) = cluster
Centroid(1, i) = sumXY(1, i) / sumXY(3, i)
isStillMoving = True
Centroid(2, i) = sumXY(2, i) / sumXY(3, i)
End If
Next i
Next i
'assign all data to the new centroids
Loop
isStillMoving = False
End If
End Sub

12
K-Means Clustering Algorithm

Exercise 1
The following table is a dataset of 15 students taking Data mining courses. The 15
students will be grouped into three parts, namely the smart, normal and poor groups.
Do the calculation of the SSE value.
NO NAMA UTS TUGAS UAS
MAHASISWA
1 Roy 89 90 75
2 Sintia 90 71 95
3 Iqbal 70 75 80
4 Dilan 45 65 59
5 Ratna 65 75 53
6 Merry 80 70 75
7 Rudi 90 85 81
8 Hafiz 70 70 73
9 Gede 96 93 85
10 Christian 60 55 48
11 Justin 45 60 58
12 Jesika 60 70 72
13 Ayu 85 90 88
14 Siska 52 68 55
15 Reitama 40 60 7

13
K-Means Clustering Algorithm

Exercise 2
Perform the clustering process on the following data. Also, do some experiments to
determine the optimal k (number of clusters) based on the minimum SSE value. Also,
draw a scatter graph for each value of k. Implement the clustering process using
MATLAB.

14

You might also like