Professional Documents
Culture Documents
Cluster
• K-Means Clustering-
2/26/2022 1
• Outline
• Basic Concepts and Methods
• Cluster Analysis
• Goal of Cluster Analysis
• Types of clustering
o Partitioning Methods
o -K-means Algorithm
How to work K-Clustering algorithm?
▪ Distance measures
• Simple Clustering: K-means with numeric data
-
2/26/2022 2
• What is clustering?
2/26/2022 3
• Definition:
2/26/2022 4
⚫ Clustering is the classification of objects into different
groups, or more precisely,
⚫ the partitioning of a data set into subsets (clusters), so that
the data in each subset (ideally) share some common trait
- often according to some defined distance measure.
2/26/2022 5
• Goal of Cluster Analysis
• The objects within a group be similar to one another and
different from the objects in other groups
2/26/2022 6
Notion of a Cluster can be Ambiguous
2/26/2022 7
Clustering
2/26/2022 8
2/26/2022 9
• Types of Clustering .
◼ Partitioning Methods
- K-means Algorithm
-Kernel K-means
-Expectation-Maximization Clustering
- Further Reading
Hierarchical Clustering . .
- Preliminaries
-Agglomerative Hierarchical Clustering
Density-based Clustering .
-The DBSCAN Algorithm
- Kernel Density Estimation
- Density-based Clustering: DENCLUE
2/26/2022 10
• K-Means Clustering-
2/26/2022 11
• K-Means Clustering-
• K-Means clustering is an unsupervised iterative clustering technique.
• It partitions the given data set into k predefined distinct clusters.
• A cluster is defined as a collection of data points exhibiting certain
similarities.
2/26/2022 12
• Steps in K-Means:
• step1:choose k value for ex: k=2
• step2:initialize centroids randomly
• step3:calculate Euclidean distance from centroids to each data
point and form clusters that are close to centroids
• step4: find the centroid of each cluster and update centroids
• step:5 repeat step3
• Each time clusters are made centroids are updated,
• The updated centroid is the center of all points which fall in the
cluster.
• This process continues till the centroid no longer changes i.e
solution converges.
2/26/2022 13
• Distance measures
2/26/2022 14
• Rules
• 1- The distance function between two points a = (x1, y1) and b = (x2, y2) is
defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
• 2- Euclidean Distance’
2/26/2022 15
• How to work K-Clustering algorithm?
2/26/2022 16
Simple Clustering: K-means
2/26/2022 17
• Draw the data as well as
2/26/2022 18
K-means example, step 1
k1 The best
Y Initial
cluster
Pick 3 Which
k2 located in
initial the borders
cluster
centers
(randomly)
k3
X
2/26/2022 19
K-means example, step 2
k1
Y
k2
Assign
each point
to the closest
cluster
center k3 Euclidean
distance
X
2/26/2022 20
K-means example, step 3
k1 k1
Y
Move k2 C=(x1+x2…
xi)/n),
each cluster (Y1+…Y2/n)
center k3
k2
to the mean
of each cluster k3
X
2/26/2022 21
K-means example, step 4
X
2/26/2022 22
K-means example, step 4 …
k1
Y
A: three
points with
animation k3
k2
X
2/26/2022 23
K-means example, step 4b
k1
Y
re-compute
cluster
means k3
k2
X
2/26/2022 24
K-means example, step 5
• .
k1
Y
k2
move cluster
centers to k3
cluster means
X
2/26/2022 25
Semi-Supervised Clustering Example
.
. .. . .
.. .
.. . .
. .
. . .
. .
2/26/2022 26
2/26/2022 27
2/26/2022 28
2/26/2022 29
2/26/2022 30
2/26/2022 31
• Steps of K-Means Clustering Algorithm-
• K-Means Clustering Algorithm involves the following steps-
• Step-01:
• Choose the number of clusters K.
• Step-02:
• Randomly select any K data points as cluster centers.
• Select cluster centers in such a way that they are as farther as possible from
each other.
• Step-03:
• Calculate the distance between each data point and each cluster center.
• The distance may be calculated either by using given distance function or by
using Euclidean distance formula.
2/26/2022 32
• Step-04:
• Assign each data point to some cluster.
• A data point is assigned to that cluster whose center is nearest to that data
point
• Step-05:
• Re-compute the center of newly formed clusters.
• The center of a cluster is computed by taking mean of all the data points
contained in that cluster.
• Step-06:
2/26/2022 33
• Step-06:
• Keep repeating the procedure from Step-03 to Step-05 until any of the
following stopping criteria is met- Center of newly formed clusters do not
change
• Data points remain present in the same cluster
• Maximum number of iterations are reached
2/26/2022 34
K-means Clustering
2/26/2022 36
• Example
• k-means algorithm with numerical Data
2/26/2022 37
• Problem 1:
• Problem-01:
• Cluster the following five point (with (x, y) representing locations) into two clusters:
• A (2, 2), B(3, 2), D(3, 1), C(3, 1), E(1.5, 0.5),
• With consider Initial cluster centers are: A (2, 2), and C(1, 1).
2/26/2022 38
• Step1: Find Iteration-01:
Distance from center C1(2, 2) Distance from centerC2
Given Points Point belongs to Cluster
of Cluster-01 (1, 1) of Cluster-02
A(2, 2) Putt the smallest values
B(3, 2)
C(1, 1)
D(3, 1)
E(1.5, 0.5)
• Iteration-01:
• Calculating Distance Between A(2, 2) and C1(2, 2)-
• Ρ(A, C1)
• = |x2 – x1| + |y2 – y1|
• = |2 – 2| + |2 – 2|
• =0
2/26/2022 39
• Calculating Distance Between A(2, 2) and C2(1, 1)-
• Ρ(A, C2)
2/26/2022 40
• Calculating Distance Between C(1, 1) and C1(2, 2)-
• Ρ(C, C1)
2/26/2022 41
• Calculating Distance Between D(1, 1) and C2(1, 1)-
• Ρ(D, C1)
2/26/2022 42
• STEP
Distance from Distance from
Point belongs to
Given Points center C1(2, 2) of centerC2 (1, 1) of
Cluster
Cluster-01 Cluster-02
A(2, 2) 0 2 C1
B(3, 2) 1 3 C1
C(1, 1) 2 0 C2
D(3, 1) 2 0 C2
E(1.5, 0.5) 2 1 C2
2/26/2022 44
• Step : Updating of center
Distance from Distance from
Point belongs to
Given Points center C1(2.5, 2) of centerC2 (1.8, 0.8)
Cluster
Cluster-01 of Cluster-02
A(2, 2)
B(3, 2)
C(1, 1)
D(3, 1)
E(1.5, 0.5)
• Iteration-02:
• Calculating Distance Between A(2, 2) and C1(2.5, 2)-
• Ρ(A, C1)
• = |x2 – x1| + |y2 – y1|
• = |2.5 – 2| + |2 – 2|=0.5
2/26/2022 45
• Calculating Distance Between A(2, 2) and C2(1.8, 0.8)-
• Ρ(A, C2)
= |x2 – x1| + |y2 – y1|
= |1.8 – 2| + |0.8 – 2|=0.2+1.8=2
• Calculating Distance Between B(3, 2) and C1(2.5, 2)-
• Ρ(B, C1)
= |x2 – x1| + |y2 – y1|
= |2.5 – 3| + |2– 2|=0.5
• Calculating Distance Between B(3, 2) and C2(1.8, 0.8)-
• Ρ(B, C2)
= |x2 – x1| + |y2 – y1|
= |1.8 – 3| + |0.8– 2|=1.2+1.2=2.4
• Calculating Distance Between C(1, 1) and C1(2.5, 2)-
• Ρ(C, C1)
= |x2 – x1| + |y2 – y1|
= |2.5– 1| + |2– 1|=1.5+1=2.5
2/26/2022 46
• Calculating Distance Between C(1, 1) and C2(1.8, 0.8)-
• Ρ(C, C1)
= |x2 – x1| + |y2 – y1|
= |1.8 – 1| + |0.8– 1|=0.8+0.2=1
• Calculating Distance Between D(3, 1) and C2(1.8, 0.8)-
• Ρ(D, C2)
= |x2 – x1| + |y2 – y1|
= |1.8 – 3| + |0.8– 1|=1.2+0.2 =1.4
• Calculating Distance Between D(1.5, 0.5) and C2(1.8, 0.8)-
• Ρ(E, C1)
= |x2 – x1| + |y2 – y1|
= |1.8 – 1.5| + |0.8– 0.5|=0.3+0.3= O.6
2/26/2022 47
• Calculating Distance Between E(1.5, 0.5) and C1(2.5, 2)-
• Ρ(E, C1)
= |x2 – x1| + |y2 – y1|
= |2.5 – 1.5| + |2– 0.5|=1+1.5=2.5
2/26/2022 48
• STEP
Distance from center Distance from centerC2 (1.8, Point belongs to
Given Points
C1(2.5, 2) of Cluster-01 0.8) of Cluster-02 Cluster
A(2, 2) 0.5 2 C1
B(3, 2) 0.5 2.4 C1
C(1, 1) 2.5 1 C2
D(3, 1) 1.4 0.6 C2
E(1.5, 0.5) 2.5 0.6 C2
2/26/2022 49
• Problem-02:
• Cluster the following eight points (with (x, y) representing locations) into three
clusters:
• A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
-With consider Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
-The distance function between two points a = (x1, y1) and b = (x2, y2) is
defined as-
• Ρ(a, b) = |x2 – x1| + |y2 – y1
• Solution-
• The k= 3
2/26/2022 50
K-Means Clustering: Example
Given:
Means of the cluster ki, mi = (ti1 + ti2 + … + tim)/m
Data {2, 4, 10, 12, 3, 20, 30, 11, 25}
K=2
Solution:
◼ m1 = 2, m2 = 4,
K1 = {2, 3}, and K2 = {4, 10, 12, 20, 30, 11, 25}
◼ m1 = 2.5, m2 = 16
K1 = {2, 3, 4}, and K2 = {10, 12, 20, 30, 11, 25}
◼ m1 = 3, m2 = 18
K1 = {2, 3, 4, 10}, and K2 = {12, 20, 30, 11, 25}
◼ m1 = 4.75, m2 = 19.6
K1 = {2, 3, 4, 10, 11, 12}, and K2 = {20, 30, 25}
◼ m1 = 7, m2 = 25
K1 = {2, 3, 4, 10, 11, 12}, and K2 = {20, 30, 25}
2/26/2022 51
• Advantages-
2/26/2022 52
• Disadvantages-
•
• K-Means Clustering Algorithm has the following disadvantages-
• It requires to specify the number of clusters (k) in advance.
• It can not handle noisy data and outliers.
• It is not suitable to identify clusters with non-convex shapes.
2/26/2022 53
• END
2/26/2022 54