You are on page 1of 31

UCCD2063

Artificial Intelligence Techniques

Unit 08:
Unsupervised Learning:
Clustering

Clustering
Outline
• Unsupervised Learning - Clustering
• Anomaly Detection

Clustering
What is Clustering?
▪ Clustering: the process of grouping data samples that are
similar in some way into classes of similar objects
▪ Clustering is a form of unsupervised learning – class labels are
not known in advance (e.g. y is not provided)
▪ It is a method of data exploration – a way of looking for
patterns or structure in the data that are of interest

Clustering 3
Popular Clustering Algorithms

▪ K-means clustering – trying to separate samples in k


groups of equal variance.

▪ Agglomerative clustering – a bottom up approach which


each observation starts in its own cluster, and clusters are
successively merged together.

▪ DBSCAN – views clusters as areas of high density


separated by areas of low density, clusters found by
DBSCAN can be any shape.

Clustering 4
K-means Clustering

1. Randomly initialize k (e.g. 3)


cluster centers
2. Assign each data point to the
closest cluster center (using some
distance measure)
3. Re-compute cluster centers (mean
of data points in cluster)
4. Repeat the process and stop when
there are no new re-assignments

Clustering 5
K-means Clustering

1. Randomly initialize k cluster


centers
2. Assign each data point to the
closest cluster center (using some
distance measure)
3. Re-compute cluster centers (mean
of data points in cluster)
4. Repeat the process and stop when
there are no new re-assignments

Clustering 6
K-means Clustering

1. Randomly initialize k cluster


centers
2. Assign each data point to the
closest cluster center (using some
distance measure)
3. Re-compute cluster centers (mean
of data points in cluster)
4. Repeat the process and stop when
there are no new re-assignments

Clustering 7
K-means Clustering

1. Randomly initialize k cluster


centers
2. Assign each data point to the
closest cluster center (using some
distance measure)
3. Re-compute cluster centers (mean
of data points in cluster)
4. Repeat the process and stop when
there are no new re-assignments

Clustering 8
K-means Clustering - animation

Clustering 9
K-means Clustering

Problems with k-means:


• Have to select k centers, can be difficult
• Start with a random choice of cluster centers and
may yield different clustering results on different
runs
• Assume clusters are convex shaped, cannot deal
with complex clusters

Clustering 10
K-means Clustering Problems

(a) Different starting cluster centers

(b) Incorrect number of clusters (c) Non-convex clusters

Clustering 11
Agglomerative Clustering

▪ Each points are initialized as its own cluster


▪ Compute the linkage between clusters
▪ Merge the two clusters with smallest linkage
▪ Repeat the process until desirable number of clusters obtained

Clustering 12
Agglomerative Clustering

▪ Each points are initialized as its own cluster


▪ Compute the linkage between clusters
▪ Merge the two clusters with smallest linkage
▪ Repeat the process until desirable number of clusters obtained

Clustering 13
Agglomerative Clustering

▪ Each points are initialized as its own cluster


▪ Compute the linkage between clusters
▪ Merge the two clusters with smallest linkage
▪ Repeat the process until desirable number of clusters obtained

Clustering 14
Agglomerative Clustering

▪ Each points are initialized as its own cluster


▪ Compute the linkage between clusters
▪ Merge the two clusters with smallest linkage
▪ Repeat the process until desirable number of clusters obtained

Clustering 15
Agglomerative Clustering

▪ Each points are initialized as its own cluster


▪ Compute the linkage between clusters
▪ Merge the two clusters with smallest linkage
▪ Repeat the process until desirable number of clusters obtained

Clustering 16
Agglomerative Clustering

▪ Each points are initialized as its own cluster


▪ Compute the linkage between clusters
▪ Merge the two clusters with smallest linkage
▪ Repeat the process until desirable number of clusters obtained

Clustering 17
Agglomerative Clustering - animation

Clustering 18
DBSCAN
▪ DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
▪ Core point: a data point that has more than a specified number of
MinPts within Eps radius around it
▪ Border point: a data point that has fewer than MinPts within Eps, but
is in the neighborhood of a core point
▪ Noise point: any point that is not a core point or a border point

(Hyperparameters: MinPts and Eps )


Clustering 19
DBSCAN

▪ Randomly select a core point to start a new cluster


▪ Iteratively add all points (core and border) within the Eps distance to
the cluster
▪ Stop when no more points are within the Eps neighborhood
▪ Repeat the procedure with an unvisited core point, and stop when no
more unvisited core point

Clustering 20
DBSCAN

▪ Randomly select a core point to start a new cluster


▪ Iteratively add all points (core and border) within the Eps distance to
the cluster
▪ Stop when no more points are within the Eps neighborhood
▪ Repeat the procedure with an unvisited core point, and stop when no
more unvisited core point

Clustering 21
DBSCAN

▪ Randomly select a core point to start a new cluster


▪ Iteratively add all points (core and border) within the Eps distance to
the cluster
▪ Stop when no more points are within the Eps neighborhood
▪ Repeat the procedure with an unvisited core point, and stop when no
more unvisited core point

Clustering 22
DBSCAN

▪ Randomly select a core point to start a new cluster


▪ Iteratively add all points (core and border) within the Eps distance to
the cluster
▪ Stop when no more points are within the Eps neighborhood
▪ Repeat the procedure with an unvisited core point, and stop when no
more unvisited core point

Clustering 23
DBSCAN - animation

Clustering 24
DBSCAN

▪ Strength :
• Can handle noise (outliers ) very well
• Can handle clusters of different shapes and sizes

▪ Weakness :
• Does not work well when dealing with clusters of
varying densities
• Sensitive to the hyperparameters
• May not work well in high dimensionality of data

Clustering 25
Anomaly Detection

▪ Anomaly detection detects data points that does not fit


well with the rest of the data. It has a wide range of
applications such as fraud detection, surveillance,
diagnosis, data cleanup, etc.
▪ Anomaly detection can be approached in many ways
depending on the nature of data (labeled or not labeled,
ordered or unordered, ...)
▪ Today we will focus on anomaly detection on multivariate
unordered data using clustering and Mahalanobis
distance.

Clustering 26
Anomaly Detection with Clustering
▪ The underlying assumption is that if we cluster the data,
normal data will belong to clusters while anomalies will
not belong to any clusters or belong to small clusters.
▪ A data point is considered an anomaly if the distance
from the data point to known large clusters is too far.

Clustering 27
Problem with Euclidean Distance
▪ For the distance measure, it should be noted that Euclidean
distance fails to find the correct distance because it tries to get
ordinary straight-line distance.

Clustering 28
Mahalanobis Distance
▪ The solution is Mahalanobis distance which takes into account
the direction of the variance in order to normalize it properly.

S is the Covariance Matrix of cluster

Clustering 29
Multivariate Outlier Detection With Mahalanobis Distance
▪ To detect outliers, we should specify the distance threshold
▪ The distance threshold is set by multiplying the STD (standard
deviation) of the Mahalanobis Distance by the extremeness
degree k such that:
thresh = k * std

Clustering 30
Next:

Search
SearchAlgorithms
Problems

Clustering

You might also like