You are on page 1of 19

TOPIC 6 – PART C

CLUSTERING: PARTITIONING
APPROACH
OBJECTIVES

• To introduce the basic concepts of clustering


• To discuss how to compute the dissimilarity between objects of
different attribute types
• To examine several clustering techniques
• Partitioning approach ✅
• Hierarchical approach

https://discuss.cryosparc.com/t/using-particles-from-cluster-mode-in-3d-va-for-refinement-fa
ils/3665/2
MAJOR CLUSTERING APPROACHES

• Partitioning approach:
• Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
• Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
• Create a hierarchical decomposition of the set of data (or objects) using some criterion
• Typical methods: Diana, Agnes, BIRCH, CAMELEON
• Density-based approach:
• Based on connectivity and density functions: DBSACN, OPTICS, DenClue
• Grid-based approach:
• Based on a multiple-level granularity structure, a finite number of cells: STING, WaveCluster,
CLIQUE
PARTITIONING ALGORITHMS: BASIC CONCEPT

• Partitioning method: partitioning a database D of n objects into a set of k


clusters, subject to min sum of squared distance k 2
E   i 1 pCi ( p  mi )
• Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means (MacQueen’67): Each cluster is represented by the center of the cluster
• k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each
cluster is represented by one of the objects in the cluster
THE k-Means CLUSTERING ALGORITHM

Given k, the k-means algorithm is implemented in four


steps:
1.Partition objects into k nonempty subsets
2.Compute seed points as the centroids of the clusters of the current
partitioning (the centroid is the center, i.e., mean point, of the cluster)
3.Assign each object to the cluster with the nearest seed point
4.Go back to Step 2, stop when the assignment does not change
EXAMPLE OF k-Means CLUSTERING
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
EXAMPLE OF k-Means CLUSTERING
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4

3 Assign 3 Update 3

2
each
2

1
the 2

1
1

0 objects 0
0 1 2 3 4 5 6 7 8 9 10
cluster 0
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
to most means
similar reassign reassign
center 10 10

K=2 9 9

8 8

Arbitrarily choose K 7

6
7

object as initial cluster 5 5

4
Update 4

center 3 3

2 the 2

0
cluster 1

0
0 1 2 3 4 5 6 7 8 9 10
means 0 1 2 3 4 5 6 7 8 9 10
THE k-Means CLUSTERING APPROACH
Point x y You are given 10 points with
X1 2 6 variables x and y.
X3
X2 3 4 Find 2 clusters using k-means
X4 algorithm
X3 3 8 X1 X10 with given initialized centroids:
X4 4 7 c1 = (3,4) and c2 = (7,4)
X9
X5 6 2 X2 X6 X X8
1
X6 6 4 X7 Tips:
X5
X7 7 3 • Numeric variable
X8 • Use Euclidean distance for
7 4
d(i,j)
X9 8 5
X10 7 6
THE k-Means CLUSTERING APPROACH
Point x y You are given 10 points with
X1 2 6 variables x and y.
X3
X2 3 4 Find 2 clusters using k-means
X4 algorithm
X3 3 8 X1 X10 with given initialized centroids:
X4 4 7 c1 = (3,4) and c2 = (7,4)
X9
X5 6 2 X2 X6 X X8
1
X6 6 4 X7 Example: d(X1, c1) and d(X1, c2)
X5
X7 7 3
X8 7 4 = 2.23

X9 8 5 = 5.39

X10 7 6 So, is nearer to centroid 1 and is being


assigned to CLUSTER 1.
THE k-Means CLUSTERING APPROACH
Point x y You are given 10 points with
X1 2 6 variables x and y.
X3 Find 2 clusters using k-means
X2 3 4
X4 algorithm
X3 3 8 X1 X10 with given initialized centroids:
X4 4 7 c1 = (3,4) and c2 = (7,4)
X9
X5 6 2 X2 X6 X X8
1
X6 6 4 X7 Example: d(X1, c1) and d(X1, c2)
X5
X7 7 3
X8 7 4 =0

X9 8 5 =4

X10 7 6 So, is nearer to centroid 1 and is being


assigned to CLUSTER 1.
THE k-Means CLUSTERING APPROACH
i x y d to C1 d to C2 Cluster
(3,4) (7,4) Centroids: c1 = (3,4) and c2 = (7,4)
1 2 6 2.23 5.39 1
2 3 4 0 4 1
ITERATION 1
3 3 8 4.00 5.66 1
• Cluster 1 :1, 2, 3, 4
• Cluster 2 :5, 6, 7, 8, 9, 10
4 4 7 3.16 4.24 1

5 6 2 3.61 2.24 2 • Update new centroids for each


6 6 4 3.00 1 2 cluster.
7 7 3 4.12 1 2 • Mean Cluster 1:
• X= (2+3+3+ 4)/4 = 3
8 7 4 4.00 0 2
• Y= (6+4+8+7)/4 = 6.25
9 8 5 5.10 1.41 2 • Mean Cluster 2:
10 7 6 4.47 2 2
• X = (6+6+7+7+8+7)/6 = 6.83
• Y = (2+4+3+4+5+6)/6 = 4
THE k-Means CLUSTERING APPROACH
i x y d to C1 d to C2 Cluster
(3,6.25) (6.83,4)
1 2 6
New centroids:
2 3 4
c1 = (3,6.25) and c2 = (6.83,4)
3 3 8
4 4 7
• Use Euclidean distance for d(i,j)
5 6 2
6 6 4 • Assign each object to the cluster
7 7 3 with the nearest centroid
8 7 4
• Repeat the step, stop when the
9 8 5
assignment does not change.
10 7 6
EVALUATION CLUSTER ANALYSIS: How many clusters?

There is not a “truly optimal” way to calculate it


Heuristics are often used
 Look at the sparseness of clusters
 Choose a number of clusters so that adding another cluster would not give much better
modeling of the data.
 If one graphs the percentage of variance explained by the clusters, there is a point at
which the marginal gain will drop, indicating the number of clusters to be chosen
 Number of clusters = (n: no of data points)
Evaluating k-means clusters
• Given two clusters, we can choose the one with the smallest error;
• One easy way to reduce sum of squared distance is to increase k, the number of clusters;
• A good clustering with smaller k can have a lower sum of squared error than a poor clustering
with higher k

E   ik1 pCi ( p  mi ) 2
VARIATIONS OF THE k-Means APPROACH

• Most of the variants of the k-means which differ in


 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means

• Handling categorical data: k-modes (Huang’98)


 Replacing means of clusters with modes (most frequent)
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method
OVERCOMING k-Means LIMITATIONS

• K-means has problems when clusters are of differing:


• Sizes, densities, non-globular shapes
• The k-means algorithm is sensitive to outliers !
• Since an object with an extremely large value may substantially distort the distribution of the
data.
• K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located object
in a cluster. Medoids are similar in concept to means or centroids, but medoids are always members of
the data set. 10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
OVERCOMING k-Means LIMITATIONS

 Randomly selected initial centroids may be poor


 Limit of random initialization
 Perform multiple runs and then select the set of clusters
with the min. SSE
 Select the first point at random or take the centroid of all
points, then for each successive initial centroid, select the
point that is farthest from any of the initial centroids
already selected.
References

1. Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, 3rd Edition,
Morgan Kaufmann, 2012.

2. Pang-Ning Tan, Michael Steinbach & Vipin Kumar, Introduction to Data Mining, Addison
Wesley, 2019.

3. Lloyd, Stuart P. (1957). "Least square quantization in PCM". Bell Telephone Laboratories
Paper. Published in journal much later: Lloyd, Stuart P. (1982). "Least squares quantization in
PCM" (PDF). IEEE Transactions on Information
THANK YOU
Shuzlina Abdul Rahman | Sofianita Mutalib | Siti Nur Kamaliah Kamarudin

You might also like