You are on page 1of 3

Clustering

Created: 2023-11-12 20:49


Lecture No. : 6

1. Clustering is also known as unsupervised learning.


2. Organizing data into classes such that there is :
1. High Intra Class Similarity
2. Lower Inter Class Similarity
3. Each Clustering problem is based on some kind of distance between points
4. Distance is the measure of dissimilarity. The intuitions behind the distance measure
properties are:
1. Symmetry
2. Constancy of self similarity
3. Positivity
4. Triangular Inequality
5. Euclidean Distance:
1. L2 Norm: Square root of the sum of the square of the differences between x and y
2. L1 Norm: Sum of the differences in each dimension
6. Non Euclidean Distance:
1. Jaccard Distance: 1 − I ntersection

U nion

2. Cosine Distance:
P .Q

|P ||Q|

7. Types of Clustering:
1. Partitional Algorithms
1. Centroid Based Method
2. Most Common Algorithm: K-Means Clustering
3. Algorithm:
1. Decide the number of centroids i.e K
2. Initialize K cluster centroids
3. Decide which cluster is closer to the co-ordinate and associate with the co-
ordinate with the center
4. Re-calculate the centroid by averaging all the points
5. In 2 successive iteration, if the value of the centroids doesn't change, the
algorithm converges.
2. Hierarchical Algorithms
1. Bottom Up or Hierarchical Agglomerative Clustering:
1. Starting with each item on its own cluster from the bottom to find the best pair
to merge into a new cluster
2. Doesn't require us to prespecify the number of clusters
3. Algorithm:
1. Compute all pair wise pattern pattern similarity coefficients
2. Place each of n patterns into a class of its own
3. Merge the two most similar clusters into one. Re-compute the inter
cluster similarity scores with respect to the new cluster.
4. Repeat above steps until there are k Clusters left. (K can be 1)
2. Top Down or Divisive:
1. Starting with all data in a single cluster, consider every possible way to divide
the cluster into two.
2. Cluster is split using a flat clustering algorithm
3. More Complex as a flat clustering algorithm is required as a subroutine.
4. More Efficient: Linear in the number of patterns & Clusters
5. More Accurate: Consider global distribution where bottom up cares for the
local distribution
8. Properties of a Clustering Algorithm:
1. Scalability
2. Ability to deal with different data
3. Minimal requirements for Domain Knowledge
4. Able to deal with noise and outliers
5. Insensitive to order of inputs
6. Incorporation of user specified constraints
7. Interpretability & Usability
9. Computing Distance Matrix:
1. Min Distance
2. Max Distance
3. Group Average
4. Ward's Method: Increase in squared error when two clusters are merged.
10. k-Means Method: A Partitional Clustering Approach
1. Strength:
1. Efficient O(tkn) where t=iterations, k=no of clusters, n=data/onject
2. Often terminates at a local optimum
2. Weakness:
1. Applicable when mean is defined, problem for categorical data
2. Need to prespecify no of clusters
3. Unable to handle noisy data
4. Not suitable for clusters with non-convex shapes
11. Birch Algorithm:
1. Use an in memory R tree to store points that are clustered
2. Insert points on at a time into the tree, merging a new point with the existing cluster if
less than allowable threshold
3. If there are more leaf nodes than fit in memory, merge existing clusters that are close
to each other
4. At the end of first pass, we get a large number of clusters at the leaves of R tree.
12. Applications of Clustering:
1. Identification of Cancer Cells
2. Search Engines
3. Customer Segmentation
4. Biology: Different Species Classification
5. Lang Use: GIS

##References

1.

You might also like