Professional Documents
Culture Documents
Clustering Part-1
Clustering Part-1
CSE -4213
Tamanna Tabassum
Lecturer, Dept. of CSE
Contents
Cluster Analysis
Application of Clustering
Major clustering approach
Clustering Algorithm
K-means Algorithm
Nearest Neighbor Algorithm
Agglomerative Algorithm
Divisive Algorithm
Conclusion
References
What is Cluster Analysis?
Finding groups of objects such that the objects in
a group will be similar (or related) to one another
and different from (or unrelated to) the objects in
other groups Inter-cluster
distances are
Intra-cluster
maximized
distances are
minimized
Cluster Analysis
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to
the characteristics found in the data and
grouping similar data objects into clusters
Unsupervised learning: no predefined
classes
What Is Good Clustering?
A good clustering method will produce high
quality clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both
the similarity measure used by the method and its
implementation.
The quality of a clustering method is also
measured by its ability to discover some or all of
the hidden patterns.
Application of Clustering
Applications of clustering algorithm
includes
Pattern Recognition
Spatial Data Analysis
Image Processing
Economic Science (especially market research)
Web analysis and classification of documents
Classification of astronomical data and
classification of objects found in an
archaeological study
Medical science
Outliers
Outliers are objects that do not belong to any
cluster or form clusters of very small cardinality
cluster
outliers
Cluster Analysis
Partitions
Hierarchical Grid-Based Model-Based
Methods
K-means
18 73 75 57
18 79 85 75
23 70 70 52
20 55 55 55
22 85 86 87
19 91 90 89
20 70 65 60
21 53 56 59
19 82 82 60
47 75 76 77
K-means Example(Conti…)
Steps 1 and 2: Let the three seeds be first three students.
Table 2: The three seeds
Student Age Mark1 Mark2 Mark3
18 73 75 57
18 79 85 75
23 70 70 52
S1 18 73 75 57 0
C2 18 79 85 75
S1 18 73 75 57
K-means Example(Conti…) 0 6 10 18
Total Distance 34
S1 18 73 75 57 0 34
C3 23 70 70 52
S1 18 73 75 57
K-means Example(Conti…) 5 3 5 5
Total Distance 18
S1 18 73 75 57 0 34 18 C1
C1 18 73 75 57
S2 18 79 85 75
K-means Example(Conti…) 0 6 10 18
Total Distance 34
S1 18 73 75 57 0 34 18 C1
S2 18 79 85 75 34
C2 18 79 85 75
S2 18 79 85 75
K-means Example(Conti…) 0 0 0 0
Total Distance 0
S1 18 73 75 57 0 34 18 C1
S2 18 79 85 75 34 0
C3 23 70 70 52
S2 18 79 85 75
K-means Example(Conti…) 5 9 15 23
Total Distance 52
S1 18 73 75 57 0 34 18 C1
S2 18 79 85 75 34 0 52 C2
K-means Example(Conti…)
S1 18 73 75 57 0 34 18 C1
S2 18 79 85 75 34 0 52 C2
S3 23 70 70 52 18 52 0 C3
S4 20 55 55 55 42 76 36 C3
S5 22 85 86 87 57 23 67 C2
S6 19 91 90 89 66 32 82 C2
S7 20 70 65 60 18 46 16 C3
S8 21 53 56 59 44 74 40 C3
S9 19 82 82 60 20 22 36 C1
S10 47 75 76 77 52 44 60 C2
S1 18 73 75 57
S9 19 82 82 60
K-means Example(Conti…) AVG 18.5 77.5 78.5 58.5
Age Marks Marks Marks Cluster AVG 26.5 82.5 84.3 82.0
1 2 3
S4 20 55 55 55 C3
S5 22 85 86 87 C2
S6 19 91 90 89 C2
S7 20 70 65 60 C3
S8 21 53 56 59 C3
S9 19 82 82 60 C1
S10 47 75 76 77 C2
S3 23 70 70 52
S4 20 55 55 55
K-means Example(Conti…) S7 20 70 65 60
S8 21 53 56 59
S5 22 85 86 87 C2
S6 19 91 90 89 C2
S7 20 70 65 60 C3
S8 21 53 56 59 C3
S9 19 82 82 60 C1
S10 47 75 76 77 C2
K-means Example(Conti…)
S5 22 85 86 87 C2
S6 19 91 90 89 C2
Cluster membership
S7 20 70 65 60 C3
S8 21 53 56 59 C3
Cluster-1: S1 , S9
S9 19 82 82 60 C1 Cluster-2: S2 ,S5 , S6 , S10
S10 47 75 76 77 C2
Cluster-3: S3 , S4 S7 S8
K-means Example(Conti…)
Use the new cluster means to re compute
the distance of each object to each of the
means, again allocating each object to the
nearest cluster.
K-means Example(Conti…)
Cluster membership
Cluster-1: S1 , S9
Cluster-2: S2 ,S5 , S6 , S10
Cluster-3: S3 , S4 S7 S8
K-means Example(Conti…)
Example
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K-Means
Strengths
Relatively efficient: O(tkn), where n is # objects,
k is # clusters, and t is # iterations. Normally,
k, t << n.
Often terminates at a local optimum.
Weaknesses
Applicable only when mean is defined (what about
categorical data?)
Need to specify k, the number of clusters, in
advance
Trouble with noisy data and outliers
Not suitable to discover clusters with non-convex
shapes
33
K-means Example(Conti…)
The results of the k-means method depend strongly on the initial
guesses of the seeds.
The k-means method can be sensitive to outliers. If an outlier is
picked as a starting seed, it may end up in a cluster of its own.
Also if an outlier moves from one cluster to another during
iterations, it can have a major impact on the clusters because the
means of the two clusters are likely to change significantly.
Although some local optimum solutions discovered by the K-
means method are satisfactory, often the local optimum is not as
good as the global optimum.
The K-means method does not consider the size of the clusters.
Some clusters may be large and some very small.
The K-means does not deal with overlapping clusters.
Nearest Neighbor Algorithm
An algorithm similar to the single link technique
is called the nearest neighbor algorithm.
With this serial algorithm, items are iteratively
merged into the existing clusters that are closet.
In this algorithm a threshold, t is used to
determine if items will be added to existing
clusters or if a new cluster is created.
Nearest Neighbor Algorithm
Algorithm for Nearest Neighbor clustering
Input:
D = {, } // Set of elements
A //Adjacency matrix showing distance between elements
Output: K // Set of clusters
1. ;
2. K= {}
3. K=1
4. For i =2 to n do
1. Find the in some cluster
1. dis(,) is the smallest;
2. if dis(, )≤ t then
1. =U
3. else
1. K = K+1;
2. = {};
Algorithm-Nearest Neighbor
Derive a similarity matrix from the items
in the dataset.
This matrix, referred to as the distance matrix,
will hold the similarity values for each and
every item in the data set. (These values are
elaborated in detail in the next example.)
With the matrix in place, compare each
item in the dataset to every other item
and compute the similarity value.
Algorithm-Nearest Neighbor
Using the distance matrix, examine every
item to see whether the distance to its
neighbors is less than a value that you have
defined.
This value is called the threshold..
The algorithm puts each element in a
separate cluster, analyzes the items, and
decides which items are similar, and adds
similar items to the same cluster.
The algorithm stops when all items have
been examined.
Example: a dataset of eight geographical locations where
individuals live collected at a specific point in time.
1 2 10
2 2 5
3 8 4
4 5 8
5 7 5
6 6 4
7 1 2
8 4 9
C2 = {Individual 2, Individual 7}
Source: https
://www.dummies.com/programming/big-data/data-science/how-to-cluster
-by-nearest-neighbors-in-predictive-analysis
/
Nearest Neighbor Algorithm Example
Table : Distance among A, B, C, D, E data
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Nearest Neighbor Algorithm Example
A placed to a cluster by itself
K1={A}
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Nearest Neighbor Algorithm Example
Consider B, should it be added to K1 or form a
new cluster?
Dist(A,B)=1 and less than threshold value 2
So K1={A, B}
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Nearest Neighbor Algorithm Example
For C we calculate distance from both A and B.
Dist(AB, C)= min{dist(A, C), Dist(B, C)}
Dist(AB, C)=2
So K1={A, B, C}
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Nearest Neighbor Algorithm Example
Dist(ABC, D)= min{Dist(A, D), Dist(B, D),Dist(C, D)}
=min{2,4,1}
=1
So K1={A, B, C, D}
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Nearest Neighbor Algorithm Example
Dist(ABCD, E)= min{Dist(A, E), Dist(B, E),Dist(C, E), Dist(C, E)}
=min{3, 3, 5, 3}
=3 greater than threshold value.
So K1={A, B, C, D}
And K2={E}
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Thank you