You are on page 1of 54

Chapter 5

Cluster
• K-Means Clustering-

2/26/2022 1
• Outline
• Basic Concepts and Methods
• Cluster Analysis
• Goal of Cluster Analysis
• Types of clustering
o Partitioning Methods
o -K-means Algorithm
How to work K-Clustering algorithm?
▪ Distance measures
• Simple Clustering: K-means with numeric data

-
2/26/2022 2
• What is clustering?

2/26/2022 3
• Definition:

◼ Clustering: the process of grouping a set of objects into classes of


similar objects
• It groups the data points based on their similarity or closeness to each
other, in simple terms,
• the algorithm needs to find the data points whose values are similar to
each other and therefore these points would then belong to the same
cluster.
◼ Most common form of unsupervised learning

◼ Unsupervised learning = learning from raw data, as opposed to

supervised data where a classification of examples is given

2/26/2022 4
⚫ Clustering is the classification of objects into different
groups, or more precisely,
⚫ the partitioning of a data set into subsets (clusters), so that
the data in each subset (ideally) share some common trait
- often according to some defined distance measure.

2/26/2022 5
• Goal of Cluster Analysis
• The objects within a group be similar to one another and
different from the objects in other groups

2/26/2022 6
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

2/26/2022 7
Clustering

2/26/2022 8
2/26/2022 9
• Types of Clustering .
◼ Partitioning Methods
- K-means Algorithm
-Kernel K-means
-Expectation-Maximization Clustering
- Further Reading
Hierarchical Clustering . .
- Preliminaries
-Agglomerative Hierarchical Clustering
Density-based Clustering .
-The DBSCAN Algorithm
- Kernel Density Estimation
- Density-based Clustering: DENCLUE

2/26/2022 10
• K-Means Clustering-

2/26/2022 11
• K-Means Clustering-
• K-Means clustering is an unsupervised iterative clustering technique.
• It partitions the given data set into k predefined distinct clusters.
• A cluster is defined as a collection of data points exhibiting certain
similarities.

2/26/2022 12
• Steps in K-Means:
• step1:choose k value for ex: k=2
• step2:initialize centroids randomly
• step3:calculate Euclidean distance from centroids to each data
point and form clusters that are close to centroids
• step4: find the centroid of each cluster and update centroids
• step:5 repeat step3
• Each time clusters are made centroids are updated,
• The updated centroid is the center of all points which fall in the
cluster.
• This process continues till the centroid no longer changes i.e
solution converges.

2/26/2022 13
• Distance measures

2/26/2022 14
• Rules
• 1- The distance function between two points a = (x1, y1) and b = (x2, y2) is
defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
• 2- Euclidean Distance’

• 3- The k-means algorithm uses the concept of centroid to create ‘k


clusters.

2/26/2022 15
• How to work K-Clustering algorithm?

2/26/2022 16
Simple Clustering: K-means

Works with numeric data only


1) Pick a number (K) of cluster centers (at random)
2) Assign every item to its nearest cluster center (e.g.
using Euclidean distance)
3) Move each cluster center to the mean of its assigned
items
4) Repeat steps 2,3 until convergence (change in cluster
assignments less than a threshold)

2/26/2022 17
• Draw the data as well as

2/26/2022 18
K-means example, step 1

k1 The best
Y Initial
cluster
Pick 3 Which
k2 located in
initial the borders
cluster
centers
(randomly)
k3

X
2/26/2022 19
K-means example, step 2

k1
Y

k2
Assign
each point
to the closest
cluster
center k3 Euclidean
distance
X
2/26/2022 20
K-means example, step 3

k1 k1
Y

Move k2 C=(x1+x2…
xi)/n),
each cluster (Y1+…Y2/n)
center k3
k2
to the mean
of each cluster k3

X
2/26/2022 21
K-means example, step 4

Reassign k1 Now we have


points new cluster
Y
closest to a
different new
cluster center
k3
Q: Which k2
points are
reassigned?

X
2/26/2022 22
K-means example, step 4 …

k1
Y
A: three
points with
animation k3
k2

X
2/26/2022 23
K-means example, step 4b

k1
Y
re-compute
cluster
means k3
k2

X
2/26/2022 24
K-means example, step 5

• .
k1
Y

k2
move cluster
centers to k3
cluster means

X
2/26/2022 25
Semi-Supervised Clustering Example

.
. .. . .
.. .
.. . .
. .
. . .
. .
2/26/2022 26
2/26/2022 27
2/26/2022 28
2/26/2022 29
2/26/2022 30
2/26/2022 31
• Steps of K-Means Clustering Algorithm-
• K-Means Clustering Algorithm involves the following steps-
• Step-01:
• Choose the number of clusters K.
• Step-02:
• Randomly select any K data points as cluster centers.
• Select cluster centers in such a way that they are as farther as possible from
each other.
• Step-03:
• Calculate the distance between each data point and each cluster center.
• The distance may be calculated either by using given distance function or by
using Euclidean distance formula.

2/26/2022 32
• Step-04:
• Assign each data point to some cluster.
• A data point is assigned to that cluster whose center is nearest to that data
point
• Step-05:
• Re-compute the center of newly formed clusters.
• The center of a cluster is computed by taking mean of all the data points
contained in that cluster.
• Step-06:

2/26/2022 33
• Step-06:
• Keep repeating the procedure from Step-03 to Step-05 until any of the
following stopping criteria is met- Center of newly formed clusters do not
change
• Data points remain present in the same cluster
• Maximum number of iterations are reached

2/26/2022 34
K-means Clustering

• Partitional clustering approach


• Number of clusters, K, must be specified
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• The basic algorithm is very simple
• Simple Clustering: K-means with numeric data

2/26/2022 36
• Example
• k-means algorithm with numerical Data

2/26/2022 37
• Problem 1:
• Problem-01:
• Cluster the following five point (with (x, y) representing locations) into two clusters:
• A (2, 2), B(3, 2), D(3, 1), C(3, 1), E(1.5, 0.5),
• With consider Initial cluster centers are: A (2, 2), and C(1, 1).

2/26/2022 38
• Step1: Find Iteration-01:
Distance from center C1(2, 2) Distance from centerC2
Given Points Point belongs to Cluster
of Cluster-01 (1, 1) of Cluster-02
A(2, 2) Putt the smallest values
B(3, 2)

C(1, 1)

D(3, 1)

E(1.5, 0.5)

• Iteration-01:
• Calculating Distance Between A(2, 2) and C1(2, 2)-
• Ρ(A, C1)
• = |x2 – x1| + |y2 – y1|
• = |2 – 2| + |2 – 2|
• =0
2/26/2022 39
• Calculating Distance Between A(2, 2) and C2(1, 1)-
• Ρ(A, C2)

= |x2 – x1| + |y2 – y1|


= |1 – 2| + |1 – 2|=2
In the similar manner, we calculate the distance of other points from each of the center of the three clusters.
• Calculating Distance Between B(3, 2) and C1(2, 2)-
• Ρ(B, C1)

= |x2 – x1| + |y2 – y1|


= |2 – 3| + |2 – 2|=1
• Calculating Distance Between B(3, 2) and C1(1, 1)-
• Ρ(B, C2)

= |x2 – x1| + |y2 – y1|


= |1 – 3| + |1 – 2| = 3

2/26/2022 40
• Calculating Distance Between C(1, 1) and C1(2, 2)-
• Ρ(C, C1)

= |x2 – x1| + |y2 – y1|


= |1 – 2| + |1 – 2| = 2
• Calculating Distance Between C(1, 1) and C2(1, 1)-
• Ρ(C, C2)

= |x2 – x1| + |y2 – y1|


= |1 – 1| + |1 – 1| = 0
• Calculating Distance Between D(1, 1) and C1(2, 2)-
• Ρ(D, C1)

= |x2 – x1| + |y2 – y1|


= |2 – 1| + |2 – 1| = 2

2/26/2022 41
• Calculating Distance Between D(1, 1) and C2(1, 1)-
• Ρ(D, C1)

= |x2 – x1| + |y2 – y1|


= |1 – 1| + |1 – 1| = 0
• Calculating Distance Between E(1.5, 0.5) and C1(2, 2)-
• Ρ(D, C1)

= |x2 – x1| + |y2 – y1|


= |2 – 1.5| + |2 – 0.5| = 2
• Calculating Distance Between E(1.5, 0.5) and C2(1, 1)-
• Ρ(D, C1)

= |x2 – x1| + |y2 – y1|


= |1 – 1.5| + |1 – 0.5| = 1

2/26/2022 42
• STEP
Distance from Distance from
Point belongs to
Given Points center C1(2, 2) of centerC2 (1, 1) of
Cluster
Cluster-01 Cluster-02
A(2, 2) 0 2 C1
B(3, 2) 1 3 C1
C(1, 1) 2 0 C2
D(3, 1) 2 0 C2
E(1.5, 0.5) 2 1 C2

• From here, New clusters are-


Cluster-01:A(2, 2) , B(3, 2)
Cluster-02:C(1, 1) , D(3, 1) , E(1.5, 0.5)
2/26/2022 43
• Now, We re-compute the new cluster clusters.
• The new cluster center is computed by taking mean of all the points contained in that cluster.
• Step : Find the cluster center
Cluster-01:A(2, 2) , B(3, 2)

Center of Cluster-01 = ( 2+3/2 , 2+2/2 )= ( 2.5 ,2)

Cluster-02:C(1, 1) , D(3, 1) , E(1.5, 0.5)

Center of Cluster-02 = ( 1+3+1.5/3 , 1+1 +0.5/3 )= ( 1.8 ,0.8 )

2/26/2022 44
• Step : Updating of center
Distance from Distance from
Point belongs to
Given Points center C1(2.5, 2) of centerC2 (1.8, 0.8)
Cluster
Cluster-01 of Cluster-02
A(2, 2)
B(3, 2)
C(1, 1)
D(3, 1)
E(1.5, 0.5)

• Iteration-02:
• Calculating Distance Between A(2, 2) and C1(2.5, 2)-
• Ρ(A, C1)
• = |x2 – x1| + |y2 – y1|
• = |2.5 – 2| + |2 – 2|=0.5
2/26/2022 45
• Calculating Distance Between A(2, 2) and C2(1.8, 0.8)-
• Ρ(A, C2)
= |x2 – x1| + |y2 – y1|
= |1.8 – 2| + |0.8 – 2|=0.2+1.8=2
• Calculating Distance Between B(3, 2) and C1(2.5, 2)-
• Ρ(B, C1)
= |x2 – x1| + |y2 – y1|
= |2.5 – 3| + |2– 2|=0.5
• Calculating Distance Between B(3, 2) and C2(1.8, 0.8)-
• Ρ(B, C2)
= |x2 – x1| + |y2 – y1|
= |1.8 – 3| + |0.8– 2|=1.2+1.2=2.4
• Calculating Distance Between C(1, 1) and C1(2.5, 2)-
• Ρ(C, C1)
= |x2 – x1| + |y2 – y1|
= |2.5– 1| + |2– 1|=1.5+1=2.5

2/26/2022 46
• Calculating Distance Between C(1, 1) and C2(1.8, 0.8)-
• Ρ(C, C1)
= |x2 – x1| + |y2 – y1|
= |1.8 – 1| + |0.8– 1|=0.8+0.2=1
• Calculating Distance Between D(3, 1) and C2(1.8, 0.8)-
• Ρ(D, C2)
= |x2 – x1| + |y2 – y1|
= |1.8 – 3| + |0.8– 1|=1.2+0.2 =1.4
• Calculating Distance Between D(1.5, 0.5) and C2(1.8, 0.8)-
• Ρ(E, C1)
= |x2 – x1| + |y2 – y1|
= |1.8 – 1.5| + |0.8– 0.5|=0.3+0.3= O.6

2/26/2022 47
• Calculating Distance Between E(1.5, 0.5) and C1(2.5, 2)-
• Ρ(E, C1)
= |x2 – x1| + |y2 – y1|
= |2.5 – 1.5| + |2– 0.5|=1+1.5=2.5

• Calculating Distance Between E(1.5, 0.5) and C2(1.8, 0.8)-


• Ρ(E, C2)
= |x2 – x1| + |y2 – y1|
= |1.8– 1.5| + |0.8– 0.5|=0.3+0.3= 0.6

2/26/2022 48
• STEP
Distance from center Distance from centerC2 (1.8, Point belongs to
Given Points
C1(2.5, 2) of Cluster-01 0.8) of Cluster-02 Cluster

A(2, 2) 0.5 2 C1
B(3, 2) 0.5 2.4 C1
C(1, 1) 2.5 1 C2
D(3, 1) 1.4 0.6 C2
E(1.5, 0.5) 2.5 0.6 C2

2/26/2022 49
• Problem-02:
• Cluster the following eight points (with (x, y) representing locations) into three
clusters:
• A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)

-With consider Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
-The distance function between two points a = (x1, y1) and b = (x2, y2) is
defined as-
• Ρ(a, b) = |x2 – x1| + |y2 – y1

• Solution-
• The k= 3

2/26/2022 50
K-Means Clustering: Example
 Given:
Means of the cluster ki, mi = (ti1 + ti2 + … + tim)/m
Data {2, 4, 10, 12, 3, 20, 30, 11, 25}
K=2
 Solution:
◼ m1 = 2, m2 = 4,
 K1 = {2, 3}, and K2 = {4, 10, 12, 20, 30, 11, 25}
◼ m1 = 2.5, m2 = 16
 K1 = {2, 3, 4}, and K2 = {10, 12, 20, 30, 11, 25}
◼ m1 = 3, m2 = 18
 K1 = {2, 3, 4, 10}, and K2 = {12, 20, 30, 11, 25}
◼ m1 = 4.75, m2 = 19.6
 K1 = {2, 3, 4, 10, 11, 12}, and K2 = {20, 30, 25}
◼ m1 = 7, m2 = 25
 K1 = {2, 3, 4, 10, 11, 12}, and K2 = {20, 30, 25}
2/26/2022 51
• Advantages-

• K-Means Clustering Algorithm offers the following advantages-


• Point-01:
• It is relatively efficient with time complexity O(nkt) where-
• n = number of instances
• k = number of clusters
• t = number of iterations

2/26/2022 52
• Disadvantages-

• K-Means Clustering Algorithm has the following disadvantages-
• It requires to specify the number of clusters (k) in advance.
• It can not handle noisy data and outliers.
• It is not suitable to identify clusters with non-convex shapes.

2/26/2022 53
• END

2/26/2022 54

You might also like