Understanding Cluster Analysis Techniques
Understanding Cluster Analysis Techniques
0
d(2,1)
Dissimilarity matrix 0
d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n,2) ... ... 0
Standardize data
Calculate the mean absolute deviation:
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
Calculate the standardized measurement (z-
score)
If q = 2, d is Euclidean distance:
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
Partitioning approach:
Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
using some criterion
Typical methods: DIANA, AGNES, BIRCH, ROCK, CAMELEON
Example
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3 Update 3
3
2 each
2 the 2
1
objects
1
0
cluster 1
0
0
0 1 2 3 4 5 6 7 8 9 10 to
0 1 2 3 4 5 6 7 8 9 10 means 0 1 2 3 4 5 6 7 8 9 10
most
similar reassign reassign
center 10 10
K=2 9 9
8 8
Arbitrarily choose 7 7
6 6
K object as initial 5 5
2
the 3
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9 10
means 0
0 1 2 3 4 5 6 7 8 9 10
Courtesy of “https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-
clustering/”
March 26, 2025 Data Mining: Concepts and Techniques 21
Challenges
When the densities of original points are
different
Courtesy of “https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-
clustering/”
March 26, 2025 Data Mining: Concepts and Techniques 22
Variations of the K-Means Method
Then, choose the new cluster center from the data points with
the probability of x being proportional to (D(x))2
2. Calculate the
distance of each 3. The farthest
point from the distance point is
centroid chosen as another
March 26, 2025 cluster centroid
Courtesy of “https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means- 25
K-Means++ Clustering
Courtesy of “https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-
clustering/”
March 26, 2025 26
Choosing right number of
clusters
Elbow method – plot an elbow curve, where x-axis
represent the number of clusters and the y-axis
will be an evaluation metric.
Choose any one metric like inertia or Dunn index
and plot its value against the clusters
Courtesy of “https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-
March 26, 2025 27
https://codinginfinite.com/k-means-
clustering-explained-with-numerical-
example/
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
changing.
March 26, 2025 Data Mining: Concepts and Techniques 30
Hierarchical Clustering
Produces a set of nested clusters
organized as a hierarchical tree
Can be visualized as a dendrogram
A tree-like diagram that records the
6 5
0.2
4
3 4
0.15 2
5
0.1 2
0.05
1
3 1
0
1 3 2 5 4 6
Hierarchical Clustering
Given a set of N items to be clustered, and an NxN distance
(or similarity) matrix, the basic process of hierarchical
clustering is as follows:
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point
p2
p3
p4
p5
.
.
.
Distance/Proximity
Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1
Distance/Proximity
Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate State
Merge the two closest clusters (C2 and C5) and update
the distance matrix.
C1 C2 C3 C4 C5
C1
C3 C2
C3
C4
C4
C5
C1
Distance/Proximity
Matrix
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
After Merging
“How do we update the distance matrix?”
C2
U
C1 C5 C3 C4
C3 C1 ?
C4 C2 U C5 ? ? ? ?
C3 ?
C1 C4 ?
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Hierarchical Clustering
Dsl Ci , C j min x , y d ( x, y ) x Ci , y C j
Single-link clustering: example
5
1
3
5 0.2
2 1 0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
Dcl Ci , C j max x , y d ( x, y ) x Ci , y C j
Complete-link clustering: example
4 1
2 5 0.4
0.35
5
2 0.3
0.25
3 6
0.2
3 0.15
1 0.1
4 0.05
0
3 6 4 1 2 5
1
Davg Ci , C j d ( x, y )
Ci C j xCi , yC j
Average-link clustering: example
5 4 1
2
0.25
5 0.2
2 0.15
3 6 0.1
1 0.05
4 0
3 3 6 4 1 2 5
Strengths
Less susceptible to noise and outliers
Limitations
Biased towards globular clusters
Distance between two clusters
Centroid distance between clusters Ci
and Cj is the distance between the centroid
ri of Ci and the centroid rj of Cj
• ri: centroid of Ci
• rj: centroid of Cj
• rij: centroid of Cij
Ward’s distance for clusters
Similar to group average and centroid
distance
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
Hierarchical Clustering
In a dendrogram, the length of each tree branch represents
the distance between clusters it joins.
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
• Continue recursively
https://people.revoledu.com/kardi/
tutorial/Clustering/Numerical
%20Example.htm
https://online.stat.psu.edu/
stat555/node/86/
9
(3,4)
(2,6)
8
(4,5)
5
1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5