Clustering Part-1

PATTERN RECOGNITION
CSE -4213
Tamanna Tabassum
Lecturer, Dept. of CSE
Contents
 Cluster Analysis
 Application of Clustering
 Major clustering approach
 Clustering Algorithm
 K-means Algorithm
 Nearest Neighbor Algorithm
 Agglomerative Algorithm
 Divisive Algorithm
 Conclusion
 References
What is Cluster Analysis?
 Finding groups of objects such that the objects in
a group will be similar (or related) to one another
and different from (or unrelated to) the objects in
other groups Inter-cluster
distances are
Intra-cluster
maximized
distances are
minimized
Cluster Analysis
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to
the characteristics found in the data and
grouping similar data objects into clusters
 Unsupervised learning: no predefined
classes
What Is Good Clustering?
 A good clustering method will produce high
quality clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both
the similarity measure used by the method and its
implementation.
 The quality of a clustering method is also
measured by its ability to discover some or all of
the hidden patterns.
Application of Clustering
 Applications of clustering algorithm
includes
 Pattern Recognition
 Spatial Data Analysis
 Image Processing
 Economic Science (especially market research)
 Web analysis and classification of documents
 Classification of astronomical data and
classification of objects found in an
archaeological study
 Medical science
Outliers
 Outliers are objects that do not belong to any
cluster or form clusters of very small cardinality
cluster
outliers
 In some applications we are interested in

discovering outliers, not clusters (outlier
analysis)
Distance
Distance
Major Clustering Approach
 Partitioning approach
 Construct various partitions and then evaluate
them by some criterion
 Typical methods:
 k-means,
 k-medoids,
 Squared Error Clustering Algorithm
 Nearest neighbor algorithm
Major Clustering Approach(Conti…)
 Hierarchical approach
 Hierarchical methods obtain a nested partition
of the objects resulting in a tree of clusters.
 BIRCH(Balanced Iterative Reducing and Clustering
Using Hierarchies),
 ROCK(A Hierarchical Clustering Algorithm for
Categorical Attributes).
 Chameleon(A Hierarchical Clustering Algorithm Using
Dynamic Modeling).
 Density-based approach
 Based on connectivity and density functions
 Density based methods include DBSCAN(A Density-
Based Clustering Method on Connected Regions with
Sufficiently High Density),
 OPTICS( Ordering Points to Identify the Clustering
Structure), DENCLUE(Clustering Based on Density
Distribution Functions)
 Grid-based approach
 Based on a multiple-level granularity structure
 STING(Statistical Information Grid),
 WaveCluster(Clustering Using Wavelet
Transformation)
 CLIQUE are some example of grid-based method.
Major Clustering Approach
Cluster Analysis
Partitions
Hierarchical Grid-Based Model-Based
Methods
K-means
Agglomerative Divisive Density-Based Expectation

Maximization
Clustering Algorithm(K-means)
 K-means Algorithm: The K-means algorithm may be
described as follows
1. Select the number of clusters. Let this number be K
2. Pick K seeds as centroids of the k clusters. The seeds may be
picked randomly unless the user has some insight into the data.
3. Compute the Euclidean distance of each object in the dataset from
each of the centroids.
4. Allocate each object to the cluster it is nearest to base on the
distances computer in the previous step.
5. Compute the centroids of the clusters by computing the means of
the attribute values of the objects in each cluster.
6. Cheek if the stopping criterion has been met(e.g. the cluster
membership is unchanged) if yes go to step 7. If not, go to step 3.
7. [optional] One may decide to stop at this stage or to split a cluster
or combine two clusters heuristically until a stopping criterion is
met.
K-means Example
 Consider the data about students. The only attributes are
the age and the three marks
Table 1: Data For K-means clustering
Student Age Marks1 Marks2 Marks3
18 73 75 57
18 79 85 75
23 70 70 52
20 55 55 55
22 85 86 87
19 91 90 89
20 70 65 60
21 53 56 59
19 82 82 60
47 75 76 77
K-means Example(Conti…)
 Steps 1 and 2: Let the three seeds be first three students.
Table 2: The three seeds
Student Age Mark1 Mark2 Mark3
18 73 75 57
18 79 85 75
23 70 70 52
 Now compute the distances

 Based on these distances, each student is allocated to the
nearest cluster.
C1 18 73 75 57
S1 18 73 75 57
K-means Example(Conti…) 0 0 0 0
Total Distance 0
C1 18 73 75 57 Distances from clusters

Allocation to the
C2 18 79 85 75 From From From nearest cluster
C3 23 70 70 52 C1 C2 C3
S1 18 73 75 57 0
C2 18 79 85 75
S1 18 73 75 57
Total Distance 34

Allocation to the
C3 23 70 70 52 C1 C2 C3
S1 18 73 75 57 0 34
C3 23 70 70 52
S1 18 73 75 57
Total Distance 18

Allocation to the
C3 23 70 70 52 C1 C2 C3
S1 18 73 75 57 0 34 18 C1
C1 18 73 75 57
S2 18 79 85 75
Total Distance 34

Allocation to the
C3 23 70 70 52 C1 C2 C3
S1 18 73 75 57 0 34 18 C1
S2 18 79 85 75 34
C2 18 79 85 75
S2 18 79 85 75
Total Distance 0

Allocation to the
C3 23 70 70 52 C1 C2 C3
S1 18 73 75 57 0 34 18 C1
S2 18 79 85 75 34 0
C3 23 70 70 52
S2 18 79 85 75
Total Distance 52

Allocation to the
C3 23 70 70 52 C1 C2 C3
S1 18 73 75 57 0 34 18 C1
S2 18 79 85 75 34 0 52 C2

Allocation to the
C3 23 70 70 52 C1 C2 C3
S1 18 73 75 57 0 34 18 C1
S2 18 79 85 75 34 0 52 C2
S3 23 70 70 52 18 52 0 C3
S4 20 55 55 55 42 76 36 C3
S5 22 85 86 87 57 23 67 C2
S6 19 91 90 89 66 32 82 C2
S7 20 70 65 60 18 46 16 C3
S8 21 53 56 59 44 74 40 C3
S9 19 82 82 60 20 22 36 C1
S10 47 75 76 77 52 44 60 C2
S1 18 73 75 57
S9 19 82 82 60
K-means Example(Conti…) AVG 18.5 77.5 78.5 58.5
Age Marks Marks Marks Cluster

1 2 3
Age Mark1 Mark2 Mark3

S1 18 73 75 57 C1
C1 18.5 77.5 78.5 58.5
S2 18 79 85 75 C2
S3 23 70 70 52 C3
S4 20 55 55 55 C3
S5 22 85 86 87 C2
S6 19 91 90 89 C2
S7 20 70 65 60 C3
S8 21 53 56 59 C3
S9 19 82 82 60 C1
S10 47 75 76 77 C2
S2 18 79 85 75
S5 22 85 86 87
K-means Example(Conti…) S6 19 91 90 89
S10 47 75 76 77
Age Marks Marks Marks Cluster AVG 26.5 82.5 84.3 82.0
1 2 3

S1 18 73 75 57 C1
C1 18.5 77.5 78.5 58.5
S2 18 79 85 75 C2
S3 23 70 70 52 C3 C2 26.5 82.5 84.3 82.0
S4 20 55 55 55 C3
S5 22 85 86 87 C2
S6 19 91 90 89 C2
S7 20 70 65 60 C3
S8 21 53 56 59 C3
S9 19 82 82 60 C1
S10 47 75 76 77 C2
S3 23 70 70 52
S4 20 55 55 55
K-means Example(Conti…) S7 20 70 65 60
S8 21 53 56 59
Age Marks Marks Marks Cluster AVG 21 61.5 61.5 56.5

1 2 3

S1 18 73 75 57 C1
C1 18.5 77.5 78.5 58.5
S2 18 79 85 75 C2
S3 23 70 70 52 C3 C2 26.5 82.5 84.3 82.0
S4 20 55 55 55 C3 C3 21 61.5 61.5 56.5
S5 22 85 86 87 C2
S6 19 91 90 89 C2
S7 20 70 65 60 C3
S8 21 53 56 59 C3
S9 19 82 82 60 C1
S10 47 75 76 77 C2
Age Marks Marks Marks Cluster

1 2 3

S1 18 73 75 57 C1
C1 18.5 77.5 78.5 58.5
S2 18 79 85 75 C2
S3 23 70 70 52 C3 C2 26.5 82.5 84.3 82.0
S4 20 55 55 55 C3 C3 21 61.5 61.5 56.5
S5 22 85 86 87 C2
S6 19 91 90 89 C2
Cluster membership
S7 20 70 65 60 C3
S8 21 53 56 59 C3
Cluster-1: S1 , S9
S9 19 82 82 60 C1 Cluster-2: S2 ,S5 , S6 , S10
S10 47 75 76 77 C2
Cluster-3: S3 , S4 S7 S8
 Use the new cluster means to re compute
the distance of each object to each of the
means, again allocating each object to the
nearest cluster.
C1 18.5 77.5 78.5 58.5 Distances from clusters

Allocation to the
C2 26.5 82.5 84.3 82 From From From nearest cluster
C3 21.0 62.0 61.5 56.5 C1 C2 C3
S1 18 73 75 57 10.0 52.3 28.0 C1
S2 18 79 85 75 25.0 19.8 62.0 C2

S3 23 70 70 52 27.0 60.3 23.0 C3
S4 20 55 55 55 51.0 90.3 16.0 C3
S5 22 85 86 87 47.0 13.8 79.0 C2
S6 19 91 90 89 56.0 28.8 92.0 C2
S7 20 70 65 60 24.0 60.3 16.0 C3
S8 21 53 56 59 50.0 86.3 17.0 C3
S9 19 82 82 60 10.0 32.3 46.0 C1
S10 47 75 76 77 52.0 41.3 74.0 C2
 No changes in member
 We have done.
Cluster membership
Cluster-1: S1 , S9
Cluster-2: S2 ,S5 , S6 , S10
Cluster-3: S3 , S4 S7 S8
 Example
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K-Means
 Strengths
 Relatively efficient: O(tkn), where n is # objects,
k is # clusters, and t is # iterations. Normally,
k, t << n.
 Often terminates at a local optimum.
 Weaknesses
 Applicable only when mean is defined (what about
categorical data?)
 Need to specify k, the number of clusters, in
advance
 Trouble with noisy data and outliers
 Not suitable to discover clusters with non-convex
shapes
33
 The results of the k-means method depend strongly on the initial
guesses of the seeds.
 The k-means method can be sensitive to outliers. If an outlier is
picked as a starting seed, it may end up in a cluster of its own.
Also if an outlier moves from one cluster to another during
iterations, it can have a major impact on the clusters because the
means of the two clusters are likely to change significantly.
 Although some local optimum solutions discovered by the K-
means method are satisfactory, often the local optimum is not as
good as the global optimum.
 The K-means method does not consider the size of the clusters.
Some clusters may be large and some very small.
 The K-means does not deal with overlapping clusters.
Nearest Neighbor Algorithm
 An algorithm similar to the single link technique
is called the nearest neighbor algorithm.
 With this serial algorithm, items are iteratively
merged into the existing clusters that are closet.
 In this algorithm a threshold, t is used to
determine if items will be added to existing
clusters or if a new cluster is created.
Nearest Neighbor Algorithm
 Algorithm for Nearest Neighbor clustering
 Input:
 D = {, } // Set of elements
 A //Adjacency matrix showing distance between elements
 Output: K // Set of clusters
1. ;
2. K= {}
3. K=1
4. For i =2 to n do
1. Find the in some cluster
1. dis(,) is the smallest;
2. if dis(, )≤ t then
1. =U
3. else
1. K = K+1;
2. = {};
Algorithm-Nearest Neighbor
 Derive a similarity matrix from the items
in the dataset.
 This matrix, referred to as the distance matrix,
will hold the similarity values for each and
every item in the data set. (These values are
elaborated in detail in the next example.)
 With the matrix in place, compare each
item in the dataset to every other item
and compute the similarity value.
Algorithm-Nearest Neighbor
 Using the distance matrix, examine every
item to see whether the distance to its
neighbors is less than a value that you have
defined.
 This value is called the threshold..
 The algorithm puts each element in a
separate cluster, analyzes the items, and
decides which items are similar, and adds
similar items to the same cluster.
 The algorithm stops when all items have
been examined.
Example: a dataset of eight geographical locations where
individuals live collected at a specific point in time.
Individual ID GPS – Geographical GPS – Geographical

Longitude Latitude
1 2 10
2 2 5
3 8 4
4 5 8
5 7 5
6 6 4
7 1 2
8 4 9
Threshold value 4.5

Similarity Calculation- Euclidean distance
 ID Individ Individ Individ Individ Individ Individ Individ Individu
ual #1 ual #2 ual #3 ual #4 ual #5 ual #6 ual #7 al #8
Individ 0 5 6 3.6 7.07 7.21 8.062 2.23

ual #1
Individ 0 6.8 4.24 5 4.12 3.16 4.47
ual #2
Individ 0 5 1.41 1.41 7.28 6.40
ual #3
Individ 0 3.31 4.12 7.21 1.41
ual #4
Individ 0 1.41 6.70 5
ual #5
Individ 0 5.38 5.38
ual #6
Individ 0 7.61
ual #7
Individ 0
ual #8
Continue…..
 Summary
 C1 = {Individual 1, Individual 4, Individual 8}
 C2 = {Individual 2, Individual 7}
 C3 = {Individual 3, Individual 5, Individual 6}
 Source: https
://www.dummies.com/programming/big-data/data-science/how-to-cluster
-by-nearest-neighbors-in-predictive-analysis
/
Nearest Neighbor Algorithm Example
Table : Distance among A, B, C, D, E data
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
 A placed to a cluster by itself
K1={A}
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
 Consider B, should it be added to K1 or form a
new cluster?
 Dist(A,B)=1 and less than threshold value 2
 So K1={A, B}
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
 For C we calculate distance from both A and B.
 Dist(AB, C)= min{dist(A, C), Dist(B, C)}
 Dist(AB, C)=2
 So K1={A, B, C}
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
 Dist(ABC, D)= min{Dist(A, D), Dist(B, D),Dist(C, D)}
=min{2,4,1}
=1
 So K1={A, B, C, D}
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
 Dist(ABCD, E)= min{Dist(A, E), Dist(B, E),Dist(C, E), Dist(C, E)}
=min{3, 3, 5, 3}
=3 greater than threshold value.
 So K1={A, B, C, D}
 And K2={E}
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Thank you

Clustering Part-1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering Part-1

Uploaded by

Copyright:

Available Formats

PATTERN RECOGNITION

 In some applications we are interested in

Agglomerative Divisive Density-Based Expectation

 Now compute the distances

C1 18 73 75 57 Distances from clusters

C1 18 73 75 57 Distances from clusters

C1 18 73 75 57 Distances from clusters

C1 18 73 75 57 Distances from clusters

C1 18 73 75 57 Distances from clusters

C1 18 73 75 57 Distances from clusters

C1 18 73 75 57 Distances from clusters

Age Marks Marks Marks Cluster

Age Mark1 Mark2 Mark3

Age Mark1 Mark2 Mark3

Age Marks Marks Marks Cluster AVG 21 61.5 61.5 56.5

Age Mark1 Mark2 Mark3

S4 20 55 55 55 C3 C3 21 61.5 61.5 56.5

Age Marks Marks Marks Cluster

Age Mark1 Mark2 Mark3

S4 20 55 55 55 C3 C3 21 61.5 61.5 56.5

C1 18.5 77.5 78.5 58.5 Distances from clusters

S1 18 73 75 57 10.0 52.3 28.0 C1

S2 18 79 85 75 25.0 19.8 62.0 C2

Individual ID GPS – Geographical GPS – Geographical

Threshold value 4.5

Individ 0 5 6 3.6 7.07 7.21 8.062 2.23

 C3 = {Individual 3, Individual 5, Individual 6}

You might also like