BUSI 651 - Week 3n

Machine Learning
Tools and Techniques
Week 3: Unsupervised Learning

Unsupervised Learning
Clustering Taxonomy
Introduction to Cluster
Analysis
K-Means Clustering
Hierarchical Clustering
BUSI 651: Machine Learning 2

• The goal is to find structure/pattern in data by exploring the

relationship (attributes/features) between data points.

• Clustering is an exploratory data analysis technique that can

be used to
• Group data (taxonomy of things)
• Finding homogeneous subgroups i.e. data points in each cluster
are as similar as possible compared to those in other groups.

• Used mainly in
• Market Segmentation
• Social Network Analysis
• Image Compression and Segmentation
• Document Clustering

Clustering Taxonomy

Partitional Clustering
• Density Based - This category clusters objects based on a local density criterion
where objects are considered densely populated together and are separated
by subspaces of low density. Examples are DBSCAN and OPTICS.
• Model Based - The idea is to build a statistical model for each cluster and find
one that best fits. The user specifies the model in the form of parameters
allowing the model to change during the learning phase. Examples are
COBWEB and AutoClass.
• Distance Based - are generally easy to implement due to their simplicity and
can be applied in numerous scenarios. Popular distance-based algorithms
include the K-means algorithm.

K-Means
• The most popular and widely used cluster analysis algorithm

that partitions the dataset into k distinct (pre-defined) non-
overlapping clusters.
• The k-means optimization objective is to minimize the
distance (sum of squared distance) between the cluster
centroid and the object assigned to the centroid.

K-means
Input: A vector x1, x2, … xn, k number of clusters

Output: k clusters
procedure K-means
{
Randomly select initial k number of centroids, C1, C2, …Ck
Repeat
Assign each point to the closest centroid to form a cluster
For i = 1, i++, I =k
Recalculate the mean for each cluster centroid
Replace Ci with the mean of all the samples in cluster i
End for
Until convergence criteria is met
}
K-Means
Randomly select initial k number of centroids, c1, c2, …. Cn

Assign each point to the closest centroid to form a cluster
K-Means
For i = 1, i++, I =k
Recalculate the mean for each cluster centroid The average/means of data points is
Replace Ci with the mean of all the samples in cluster i assigned to the cluster centroid.
End for BUSI 651: Machine Learning 11
K-Means
• Convergence of K-Means is widely affected by the random

initialization of the k cluster centroids.
• This can be corrected by
• computing the cost/distortion function that minimizes the distance
between the centroid and the data points.
• repeating the random initialization process multiple times until the
best initial clusters are found.

Initialization of k

Initialization of k

Number of k clusters

Number of k clusters
If we know the context of the problem e.g.

We need to divide the dataset according to
new, returning, continuing customers then we
can safely choose 3

Distance Measures
Manhattan à (6-0) + (6-0) = 12
Euclidean à sqrt (62 + 62) = 8.49
Yellow, Red, and Blue lines have the same distance of 12.
Green line has a distance of 8.49

• Divisive - This is a top-down approach where it begins
with one root that contains all the data points. This
root is then recursively considered if it can be split
further based on some dissimilarity distance. This
process is repeated until a singleton is obtained.
• Agglomerative - This is a bottom-up approach where
all data points are represented at the bottom of the
binary tree. These points are recorded in a
dissimilarity matrix and the closest sets of clusters
are then merged together. The dissimilarity matrix is
then updated and the process is repeated where the
closest pairs that are less dissimilar are merged
together bottom-up until one cluster remains that
contains all the data points.

Distance Measures
• Single Linkage Clustering (SLC) – distance

between 2 clusters is defined as the shortest
distance between 2 points in each cluster.
• Complete Linkage Clustering (CLC) – distance

between 2 clusters is defined as the furthest
distance between 2 points in each cluster.
Distance Measures
• Average Linkage Clustering (ALC) –

distance between 2 clusters is defined as
the average distance between all points in
one cluster to another cluster.


Challenges of Hierarchical Clustering
• If a misclassification is done, it is very difficult to reassign an

object again.
• Merge/split decision once done are difficult to undo.
• They tend to have a higher complexity, thus not suitable for
large datasets.

Thank you!
• Any questions?

References
• Machine Learning: Unsupervised Learning, Rizwan Khan.

Chapter 14.
• Big Data Analytics: A Tutorial of Some Clustering
Techniques, Said Baadel. Int. J. Management and Data
Analytics 1 (2), 38-46

BUSI 651 - Week 3n

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BUSI 651 - Week 3n

Uploaded by

Copyright:

Available Formats

Machine Learning

Tools and Techniques

Week 3: Unsupervised Learning

BUSI 651: Machine Learning 2

• The goal is to find structure/pattern in data by exploring the

BUSI 651: Machine Learning 3

• Clustering is an exploratory data analysis technique that can

BUSI 651: Machine Learning 4

BUSI 651: Machine Learning 5

BUSI 651: Machine Learning 6

BUSI 651: Machine Learning 7

• The most popular and widely used cluster analysis algorithm

BUSI 651: Machine Learning 8

Input: A vector x1, x2, … xn, k number of clusters

Randomly select initial k number of centroids, c1, c2, …. Cn

• Convergence of K-Means is widely affected by the random

BUSI 651: Machine Learning 12

BUSI 651: Machine Learning 13

BUSI 651: Machine Learning 14

BUSI 651: Machine Learning 15

If we know the context of the problem e.g.

BUSI 651: Machine Learning 16

Manhattan à (6-0) + (6-0) = 12

Euclidean à sqrt (62 + 62) = 8.49

BUSI 651: Machine Learning 17

BUSI 651: Machine Learning 18

• Single Linkage Clustering (SLC) – distance

• Complete Linkage Clustering (CLC) – distance

• Average Linkage Clustering (ALC) –

BUSI 651: Machine Learning 20

BUSI 651: Machine Learning 21

• If a misclassification is done, it is very difficult to reassign an

BUSI 651: Machine Learning 22

BUSI 651: Machine Learning 23

• Machine Learning: Unsupervised Learning, Rizwan Khan.

BUSI 651: Machine Learning 24

You might also like