Professional Documents
Culture Documents
Unit 08:
Unsupervised Learning:
Clustering
Clustering
Outline
• Unsupervised Learning - Clustering
• Anomaly Detection
Clustering
What is Clustering?
▪ Clustering: the process of grouping data samples that are
similar in some way into classes of similar objects
▪ Clustering is a form of unsupervised learning – class labels are
not known in advance (e.g. y is not provided)
▪ It is a method of data exploration – a way of looking for
patterns or structure in the data that are of interest
Clustering 3
Popular Clustering Algorithms
Clustering 4
K-means Clustering
Clustering 5
K-means Clustering
Clustering 6
K-means Clustering
Clustering 7
K-means Clustering
Clustering 8
K-means Clustering - animation
Clustering 9
K-means Clustering
Clustering 10
K-means Clustering Problems
Clustering 11
Agglomerative Clustering
Clustering 12
Agglomerative Clustering
Clustering 13
Agglomerative Clustering
Clustering 14
Agglomerative Clustering
Clustering 15
Agglomerative Clustering
Clustering 16
Agglomerative Clustering
Clustering 17
Agglomerative Clustering - animation
Clustering 18
DBSCAN
▪ DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
▪ Core point: a data point that has more than a specified number of
MinPts within Eps radius around it
▪ Border point: a data point that has fewer than MinPts within Eps, but
is in the neighborhood of a core point
▪ Noise point: any point that is not a core point or a border point
Clustering 20
DBSCAN
Clustering 21
DBSCAN
Clustering 22
DBSCAN
Clustering 23
DBSCAN - animation
Clustering 24
DBSCAN
▪ Strength :
• Can handle noise (outliers ) very well
• Can handle clusters of different shapes and sizes
▪ Weakness :
• Does not work well when dealing with clusters of
varying densities
• Sensitive to the hyperparameters
• May not work well in high dimensionality of data
Clustering 25
Anomaly Detection
Clustering 26
Anomaly Detection with Clustering
▪ The underlying assumption is that if we cluster the data,
normal data will belong to clusters while anomalies will
not belong to any clusters or belong to small clusters.
▪ A data point is considered an anomaly if the distance
from the data point to known large clusters is too far.
Clustering 27
Problem with Euclidean Distance
▪ For the distance measure, it should be noted that Euclidean
distance fails to find the correct distance because it tries to get
ordinary straight-line distance.
Clustering 28
Mahalanobis Distance
▪ The solution is Mahalanobis distance which takes into account
the direction of the variance in order to normalize it properly.
Clustering 29
Multivariate Outlier Detection With Mahalanobis Distance
▪ To detect outliers, we should specify the distance threshold
▪ The distance threshold is set by multiplying the STD (standard
deviation) of the Mahalanobis Distance by the extremeness
degree k such that:
thresh = k * std
Clustering 30
Next:
Search
SearchAlgorithms
Problems
Clustering