L08 Clustering

UCCD2063
Artificial Intelligence Techniques
Unit 08:
Unsupervised Learning:
Clustering
Clustering
Outline
• Unsupervised Learning - Clustering
• Anomaly Detection
Clustering
What is Clustering?
▪ Clustering: the process of grouping data samples that are
similar in some way into classes of similar objects
▪ Clustering is a form of unsupervised learning – class labels are
not known in advance (e.g. y is not provided)
▪ It is a method of data exploration – a way of looking for
patterns or structure in the data that are of interest
Clustering 3
Popular Clustering Algorithms
▪ K-means clustering – trying to separate samples in k

groups of equal variance.
▪ Agglomerative clustering – a bottom up approach which

each observation starts in its own cluster, and clusters are
successively merged together.
▪ DBSCAN – views clusters as areas of high density

separated by areas of low density, clusters found by
DBSCAN can be any shape.
Clustering 4
K-means Clustering
1. Randomly initialize k (e.g. 3)

cluster centers
2. Assign each data point to the
closest cluster center (using some
distance measure)
3. Re-compute cluster centers (mean
of data points in cluster)
4. Repeat the process and stop when
there are no new re-assignments
Clustering 5
K-means Clustering
1. Randomly initialize k cluster

centers
distance measure)
Clustering 6
K-means Clustering

centers
distance measure)
Clustering 7
K-means Clustering

centers
distance measure)
Clustering 8
K-means Clustering - animation
Clustering 9
K-means Clustering
Problems with k-means:

• Have to select k centers, can be difficult
• Start with a random choice of cluster centers and
may yield different clustering results on different
runs
• Assume clusters are convex shaped, cannot deal
with complex clusters
Clustering 10
K-means Clustering Problems
(a) Different starting cluster centers
(b) Incorrect number of clusters (c) Non-convex clusters
Clustering 11
Agglomerative Clustering
▪ Each points are initialized as its own cluster

▪ Compute the linkage between clusters
▪ Merge the two clusters with smallest linkage
▪ Repeat the process until desirable number of clusters obtained
Clustering 12

Clustering 13

Clustering 14

Clustering 15

Clustering 16

Clustering 17
Agglomerative Clustering - animation
Clustering 18
DBSCAN
▪ DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
▪ Core point: a data point that has more than a specified number of
MinPts within Eps radius around it
▪ Border point: a data point that has fewer than MinPts within Eps, but
is in the neighborhood of a core point
▪ Noise point: any point that is not a core point or a border point
(Hyperparameters: MinPts and Eps )

Clustering 19
DBSCAN
▪ Randomly select a core point to start a new cluster

▪ Iteratively add all points (core and border) within the Eps distance to
the cluster
▪ Stop when no more points are within the Eps neighborhood
▪ Repeat the procedure with an unvisited core point, and stop when no
more unvisited core point
Clustering 20
DBSCAN

the cluster
Clustering 21
DBSCAN

the cluster
Clustering 22
DBSCAN

the cluster
Clustering 23
DBSCAN - animation
Clustering 24
DBSCAN
▪ Strength :
• Can handle noise (outliers ) very well
• Can handle clusters of different shapes and sizes
▪ Weakness :
• Does not work well when dealing with clusters of
varying densities
• Sensitive to the hyperparameters
• May not work well in high dimensionality of data
Clustering 25
Anomaly Detection
▪ Anomaly detection detects data points that does not fit

well with the rest of the data. It has a wide range of
applications such as fraud detection, surveillance,
diagnosis, data cleanup, etc.
▪ Anomaly detection can be approached in many ways
depending on the nature of data (labeled or not labeled,
ordered or unordered, ...)
▪ Today we will focus on anomaly detection on multivariate
unordered data using clustering and Mahalanobis
distance.
Clustering 26
Anomaly Detection with Clustering
▪ The underlying assumption is that if we cluster the data,
normal data will belong to clusters while anomalies will
not belong to any clusters or belong to small clusters.
▪ A data point is considered an anomaly if the distance
from the data point to known large clusters is too far.
Clustering 27
Problem with Euclidean Distance
▪ For the distance measure, it should be noted that Euclidean
distance fails to find the correct distance because it tries to get
ordinary straight-line distance.
Clustering 28
Mahalanobis Distance
▪ The solution is Mahalanobis distance which takes into account
the direction of the variance in order to normalize it properly.
S is the Covariance Matrix of cluster
Clustering 29
Multivariate Outlier Detection With Mahalanobis Distance
▪ To detect outliers, we should specify the distance threshold
▪ The distance threshold is set by multiplying the STD (standard
deviation) of the Mahalanobis Distance by the extremeness
degree k such that:
thresh = k * std
Clustering 30
Next:
Search
SearchAlgorithms
Problems
Clustering

L08 Clustering

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L08 Clustering

Uploaded by

Copyright:

Available Formats

UCCD2063

Artificial Intelligence Techniques

▪ K-means clustering – trying to separate samples in k

▪ Agglomerative clustering – a bottom up approach which

▪ DBSCAN – views clusters as areas of high density

1. Randomly initialize k (e.g. 3)

1. Randomly initialize k cluster

1. Randomly initialize k cluster

1. Randomly initialize k cluster

Problems with k-means:

(a) Different starting cluster centers

(b) Incorrect number of clusters (c) Non-convex clusters

▪ Each points are initialized as its own cluster

▪ Each points are initialized as its own cluster

▪ Each points are initialized as its own cluster

▪ Each points are initialized as its own cluster

▪ Each points are initialized as its own cluster

▪ Each points are initialized as its own cluster

(Hyperparameters: MinPts and Eps )

▪ Randomly select a core point to start a new cluster

▪ Randomly select a core point to start a new cluster

▪ Randomly select a core point to start a new cluster

▪ Randomly select a core point to start a new cluster

▪ Anomaly detection detects data points that does not fit

S is the Covariance Matrix of cluster

You might also like