Professional Documents
Culture Documents
Clustering
• Clustering is a data mining technique that finds similarities
between data according to the characteristics found in the data &
groups similar data objects into one cluster
• Given a set of points, with a x
notion of distance between x x xx x
points, group the points into some x x x x x x
number of clusters, so that x xx x x x x
members of a cluster are in some x x x x
sense as close to each other as x x xx x
possible. x
• While data points in the same x x
cluster are similar, those in x x x x
separate clusters are dissimilar to x x x
one another. x
2
Example: clustering
• The example below demonstrates the clustering of padlocks of
same kind. There are a total of 10 padlocks which varies in
color, size, shape, etc.
i 1 i i
• Cosine Similarity
– If X and Y are two vector attributes of data objects, then
cosine similarity measure is given by:
x y
dis( X ,Y ) i i
x y
i i
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)½ =
(17)½ = 4.12
cos(d1, d2 ) = 0.94
7
Major Clustering Approaches
• Partitioning clustering approach:
– Construct various partitions and then evaluate them by some
criterion
– Typical methods:
• distance-based: K-means clustering
• model-based: expectation maximization (EM) clustering.
8
Partitioning Algorithms: Basic Concept
• Partitioning method: Construct a partition of a
database D of n objects into a set of k clusters;
such that, sum of squared distance is minimum
• Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
– Heuristic methods: k-means and k-medoids algorithms
– k-means: Each cluster is represented by the center of
the cluster
– k-medoids: Each cluster is represented by one of the
objects in the cluster
9
The K-Means Clustering Method
• Algorithm:
• Select K cluster points as initial centroids (the initial
centroids are selected randomly)
– Given k, the k-means algorithm is implemented as follows:
• Repeat
– Partition objects into k nonempty subsets
– Recompute the centroids of each K clusters of the
current partition (the centroid is the center, i.e., mean
point, of the cluster)
– Assign each object to the cluster with the nearest seed
point
• Until the centroid don’t change
10
The K-Means Clustering Method
• Example
10 Assign 10
9
10
each
9
Update
8 8
8
7 7
7
6
objects 6
5 the 6
5
5
4
to most 4
cluster 4
3 3
3
2
similar 2
means 2
center
1 1
1
0 0
0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
reassign reassign
K=2
Update
10 10
9 9
Arbitrarily 8
7 the
8
choose K object 6
5 cluster
6
as initial cluster 4
3 means
4
center 2
1
2
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
11
Example Problem
• Cluster the following eight points (with (x, y)
representing locations) into three clusters : A1(2,
10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6,
4) A7(1, 2) A8(4, 9).
– Assume that initial cluster centers are: A1(2, 10),
A4(5, 8) and A7(1, 2).
• The distance function between two points a=(x1,
y1) and b=(x2, y2) is defined as:
dis(a, b) = |x2 – x1| + |y2 – y1| .
• Use k-means algorithm to find optimal centroids to
group the given data into three clusters.
Iteration 1
First we list all points in the first column of the table below. The initial
cluster centers – centroids, are (2, 10), (5, 8) and (1, 2) - chosen
randomly.
(2,10) (5, 8) (1, 2)
Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10) 0 5 9 1
A2 (2, 5) 5 6 4 3
A3 (8, 4) 12 7 9 2
A4 (5, 8) 5 0 10 2
A5 (7, 5) 10 5 9 2
A6 (6, 4) 10 5 7 2
A7 (1, 2) 9 10 0 3
A8 (4, 9) 3 2 10 2
Next, we will calculate the distance from each points to each of the
three centroids, by using the distance function:
dis(point i,mean j)=|x2 – x1| + |y2 – y1|
Iteration 1
• Starting from point A1 calculate the distance to each of the three
means, by using the distance function:
dis (A1, mean1) = |2 – 2| + |10 – 10| = 0 + 0 = 0
dis(A1, mean2) = |5 – 2| + |8 – 10| = 3 + 2 = 5
dis(A1, mean3) = |1 – 2| + |2 – 10| = 1 + 8 = 9
– Fill these values in the table & decide which cluster should the point (2,
10) be placed in? The one, where the point has the shortest distance to
the mean – i.e. mean 1 (cluster 1), since the distance is 0.
• Next go to the second point A2 and calculate the distance:
dis(A2, mean1) = |2 – 2| + |10 – 5| = 0 + 5 = 5
dis(A2, mean2) = |5 – 2| + |8 – 5| = 3 + 3 = 6
dis(A2, mean3) = |1 – 2| + |2 – 5| = 1 + 3 = 4
– So, we fill in these values in the table and assign the point (2, 5) to cluster 3
since mean 3 is the shortest distance from A2.
• Analogically, we fill in the rest of the table, and place each point in
one of the clusters
Iteration 1
• Next, we need to re-compute the new cluster centers (means). We
do so, by taking the mean of all points in each cluster.
• For Cluster 1, we only have one point A1(2, 10), which was the old
mean, so the cluster center remains the same.
• For Cluster 2, we have five points and needs to take average of them
as new centroid, i,e.
( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) = (6, 6)
• For Cluster 3, we have two points. The new centroid is:
( (2+1)/2, (5+2)/2 ) = (1.5, 3.5)
• That was Iteration1 (epoch1). Next, we go to Iteration2 (epoch2),
Iteration3, and so on until the centroids do not change anymore.
– In Iteration2, we basically repeat the process from Iteration1 this
time using the new means we computed.
Second epoch
• Using the new centroid we have to compute cluster members.
(2,10) (6, 6) (1.5, 3.5)
Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10) 0 8 7 1
A2 (2, 5) 5 5 2 3
A3 (8, 4) 12 4 7 2
A4 (5, 8) 5 3 8 2
A5 (7, 5) 2
A6 (6, 4) 2
A7 (1, 2) 3
A8 (4, 9) 1
• After the 2nd epoch the results would be:
cluster 1: {A1,A8} with new centroid=(3,9.5);
cluster 2: {A3,A4,A5,A6} with new centroid=(6.5,5.25);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
Third epoch
• Using the new centroid we have to compute cluster members.
(3,9.5) (6.5, 5.25) (1.5, 3.5)
Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10) 1.5 9.25 7 1
A2 (2, 5) 5.5 4.75 2 3
A3 (8, 4) 2
A4 (5, 8) 1
A5 (7, 5) 2
A6 (6, 4) 2
A7 (1, 2) 3
A8 (4, 9) 1
• After the 3rd epoch the results would be:
cluster 1: {A1,A4,A8} with new centroid=(3.66,9);
cluster 2: {A3,A5,A6} with new centroid=(7,4.33);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
Fourth epoch
• Using the new centroid we have to compute cluster members.
(3.66,9) (7, 4.33) (1.5, 3.5)
Point Mean 1 Mean 2 Mean 3 Cluster
A1 (2, 10) 2.66 10.67 7 1
A2 (2, 5) 3
A3 (8, 4) 2
A4 (5, 8) 1
A5 (7, 5) 2
A6 (6, 4) 2
A7 (1, 2) 3
A8 (4, 9) 1
• After the 4th epoch the results would be:
cluster 1: {A1,A4,A8} with new centroid=(3.66,9);
cluster 2: {A3,A5,A6} with new centroid=(7,4.33);
cluster 3: {A2,A7} with new centroid=(1.5,3.5)
Final results
• Finally in the 4th epoch there is no change of members of
clusters and centroids. So the algoithrm stops.
• The result of clustering is shown in the following figure
Comments on the K-Means Method
• Strength: Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
• Weakness
–Applicable only when mean is defined, then what about
categorical data? Use hierarchical clustering
• Need to specify k, the number of clusters, in advance
–Unable to handle noisy data and outliers Since an object with
an extremely large value may substantially distort the
distribution of the data.
• K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the
most centrally located object in a cluster.
20
Hierarchical Clustering
• Produces a set of nested clusters organized
as a hierarchical tree.
0.2
0
1 3 2 5 4 6
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0
Two main types of hierarchical clustering
• Agglomerative: it is a Bottom Up clustering
technique
– Start with all sample units in n clusters of size 1.
– Then, at each step of the algorithm, the pair of clusters with the shortest
distance are combined into a single cluster.
– The algorithm stops when all sample units are combined into a single
cluster of size n.
• Divisive: it is a Top Down clustering technique
– Start with all sample units in a single cluster of size n.
– Then, at each step of the algorithm, clusters are partitioned into a
pair of daughter clusters, selected to maximize the distance
between each daughter.
– The algorithm stops when sample units are partitioned into n
clusters of size 1.
Dendrogram: Shows How the Clusters are Merged
23
Agglomerative Clustering Algorithm
• More popular hierarchical clustering technique
• Basic algorithm is straightforward
1. Let each data point be a cluster
2. Compute the proximity matrix
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
• Key operation is the computation of the proximity of two clusters
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Example
• Perform a agglomerative clustering of five
samples using two features X and Y. Calculate
Manhattan distance between each pair of
samples to measure their similarity.
Data item X Y
1 4 4
2 8 4
3 15 8
4 24 4
5 24 12
Strengths of Hierarchical Clustering