You are on page 1of 40

Unsupervised Learning - Clustering

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Agenda - Clustering

Fundamentals of
unsupervised Case study1 Fundamentals Hierarchical
learning walkthrough of clustering clustering

Data pre-
Casestudy2 K-means processing for Clustering –
walkthrough clustering clustering hands on

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
2
What is unsupervisedlearning?

No defined Patterns in the data


dependent and are used to identify
independent
/ group similar
variables.
observations

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
3
Supervised vs unsupervisedlearning

Supervised learning Unsupervised learning

• Clearly defined X and Y • Unlabelled data


variables • Emerging patterns based on
• Predict a continuous similarity identified
response (Regression) • Clustering
• categorical response • Association rules (market
(classification) basket analysis)

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
6
What isclustering?
Heterogeneity
Grouping between groups
objects

Homogeneity
SSB > SSW
within groups

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
9
What is clustering?
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

This is classic case of Unsupervised learning

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Why do we cluster?

Group records such that Clustering results are used:

• Similar to one another • As a stand-alone tool to get


within the same cluster insight into data distribution
• Dissimilar to the objects in • Visualization of clusters
other clusters may unveil important
information
• As a preprocessing step for
other algorithms

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
7
Cluster Analysis – Use cases

Image Market Market


Web Finance
processing Segmentation structure
analysis
• cluster images • Cluster groups • customers are • identifying • cluster analysis
based on their of users based segmented groups of can be used for
visual content on their access based on similar creating
patterns on demographic products balanced
webpages and transaction according to portfolios
• Cluster history competitive
webpages information, measures of
based on their and a similarity
content marketing
strategy is
tailored for
each segment

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
8
Clustering vsPCA PCA – grouping variables
that relate to each
other
Clustering – Segment
Clustering
variables according to the
distance between them.

Grouping of similar
rows

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
14
Role of clusteranalysis

Taxonomy Influence of cluster


Data Reduction
description on Y variable
• Classify • Exploratory • What is the
observations into • Confirmatory average sales
manageable from each
groups customer
segment?
• How does churn
% vary for each
customer
segment
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
15
The clustering task

Group observations so that the observations belonging


in the same group are similar, whereas observations in
different groups are dissimilar.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Observations to cluster - dimensions

Real-value attributes • salary, height

Binary attributes • gender (M/F), has_cancer(T/F)

Nominal (categorical)
• religion (Christian, Muslim, Buddhist, Hindu, etc.)
attributes

Ordinal/Ranked • military rank (soldier, sergeant, lutenant, captain,


attributes etc.)

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
12
Measuring similarity -Distances
• Eucledian distance
• Manhattan distance
• Chebyshev distance

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
13
Euclidean distance

Where x1 to xp are the independent variables of i and j

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
14
Manhattan distance (city –block distance)
• Distance between the projection of points on the axis.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
15
Chebyshev Distance (chessboard distance)
Max ( |x1-x2|, |y1-y2|, |z1-z2|, ….)

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Minkowski Distance

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Types of clustering

Agglomerative

Hierarchical
Divisive
Clustering

Partitioning K-means

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
18
Clustering types

Agglomerative
Divisive approach Partitioning
clustering

• Bottom up approach • Top down approach • constructs ‘k’ partition


• start with each object • start with all of the of data
forming a separate objects in the same • Each partition will
group cluster represent a cluster
• It keeps on merging • a cluster is split up and k ≤ n
the objects or groups into smaller clusters
that are close to one
another

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
19
Hierarchical clustering
• Records are sequentially grouped to create clusters,
based on distances between records and distances
between clusters.
• Hierarchical clustering also produces a useful graphical
display of the clustering process and results, called a
dendrogram.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
20
Strengths of Hierarchical Clustering
• No assumptions on the number of clusters
• Any desired number of clusters can be obtained by
‘cutting’
the dendrogram at the proper level
• Hierarchical clustering may correspond to meaningful
taxonomies

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Disadvantages of Hierarchical clustering
• Time complexity: not suitable for larger data sets.
• Very sensitive to outliers

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
22
Hierarchical clustering -Steps

Start with n clusters Two closest records


where each record will are merged into a
be a cluster cluster

At every step, two closest


clusters with smallest
distance are merged Repeat till there is a
single cluster that
•Either single records are added to
existing clusters or
includes all the records
•Two existing clusters are
combined

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
23
Dendrograms
• A dendrogram is a treelike diagram that summarizes the
process of clustering
• On the x-axis are the records
• Similar records are joined by lines whose vertical length
reflects the distance between the records
• the greater the difference in height, the more dissimilarity
• By choosing a cutoff distance on the y-axis, a set of
clusters is created
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
24
Dendrograms

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 31
Distance between twoclusters
• Each cluster is a set of points
• How do we define distance between two sets of points

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Hierarchical clustering – Distance betweenclusters
Linkage types
• Single linkage
• Complete linkage
• Average linkage
• Centroid linkage
• Ward’s Method

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
27
Single Linkage
• Distance between two clusters is defined as
the shortest distance between two points in each cluster.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
28
Complete linkage
• Distance between two clusters is defined as
the longest distance between two points in each cluster.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
29
Average linkage
• Distance between two clusters
is defined as the average
distance between each point in
one cluster to every point in the
other cluster.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
30
Centroid linkage
• Based on centroid distance. clusters
are represented by their mean
values for each variable, which forms
a vector of means.
• Distance between 2 clusters is
distance between the 2 vectors

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
31
Ward’s linkage
• Similar to group average and centroid distance
• joins records and clusters together progressively to produce larger and larger
clusters, but operates slightly differently from the general approach.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
32
Hierarchical Clustering:Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4

5
1
2
5
2
3 6 Group Average
3
4 1
4

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
K-MEANS CLUSTERING

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
K-means Clustering
• A non-hierarchical approach to forming good clusters is to pre-
specify a desired number of clusters, k
• Assign each record to one of the k clusters, according to their
distance from each cluster
• So as to minimize a measure of dispersion within the clusters
• The ‘means’ in the K-means refers to averaging of the data;
that is, finding the centroid
• K-means clustering is widely used in large dataset
applications Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
35
How does k-means clustering work?
Derive # of clusters
Specify value of K
using scree plot

Partition dataset to K
initial clusters

Assign each record to


cluster with nearest
centroid

Yes Recalculate centroids


for losing and receiving
clusters

Do reassignments occur?

No
Finalise clusters

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
36
Scaling –Z scaling & Min-max scaling
ZScaling Min Max scaling

• Features will be rescaled • The data is scaled to a fixed


• Have the properties of a range - 0 to 1.
standard normal distribution • The cost of having this
• μ=0 and σ=1 bounded range - smaller
standard deviations, which
can suppress the effect of
outliers

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
37
Where is scalingused?

k-nearest
k-means
neighbors

Perceptron , principal
Neural componen
networks t analysis

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
38
Validating Clusters
• The resulting clusters should be valid to generate insights
• Cluster interpretability
• Cluster stability
• Cluster separation
• Number of clusters

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
39
Questions?
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
40

You might also like