Unsupervised Learning - Clustering

Unsupervised Learning - Clustering
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
Agenda - Clustering
Fundamentals of
unsupervised Case study1 Fundamentals Hierarchical
learning walkthrough of clustering clustering
Data pre-
Casestudy2 K-means processing for Clustering –
walkthrough clustering clustering hands on
2
What is unsupervisedlearning?
No defined Patterns in the data

dependent and are used to identify
independent
/ group similar
variables.
observations
3
Supervised vs unsupervisedlearning
Supervised learning Unsupervised learning
• Clearly defined X and Y • Unlabelled data

variables • Emerging patterns based on
• Predict a continuous similarity identified
response (Regression) • Clustering
• categorical response • Association rules (market
(classification) basket analysis)
6
What isclustering?
Heterogeneity
Grouping between groups
objects
Homogeneity
SSB > SSW
within groups
9
What is clustering?
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
This is classic case of Unsupervised learning
Why do we cluster?
Group records such that Clustering results are used:
• Similar to one another • As a stand-alone tool to get

within the same cluster insight into data distribution
• Dissimilar to the objects in • Visualization of clusters
other clusters may unveil important
information
• As a preprocessing step for
other algorithms
7
Cluster Analysis – Use cases
Image Market Market

Web Finance
processing Segmentation structure
analysis
• cluster images • Cluster groups • customers are • identifying • cluster analysis
based on their of users based segmented groups of can be used for
visual content on their access based on similar creating
patterns on demographic products balanced
webpages and transaction according to portfolios
• Cluster history competitive
webpages information, measures of
based on their and a similarity
content marketing
strategy is
tailored for
each segment
8
Clustering vsPCA PCA – grouping variables
that relate to each
other
Clustering – Segment
Clustering
variables according to the
distance between them.
Grouping of similar
rows
14
Role of clusteranalysis
Taxonomy Influence of cluster

Data Reduction
description on Y variable
• Classify • Exploratory • What is the
observations into • Confirmatory average sales
manageable from each
groups customer
segment?
• How does churn
% vary for each
customer
segment
15
The clustering task
Group observations so that the observations belonging

in the same group are similar, whereas observations in
different groups are dissimilar.
Observations to cluster - dimensions
Real-value attributes • salary, height
Binary attributes • gender (M/F), has_cancer(T/F)
Nominal (categorical)
• religion (Christian, Muslim, Buddhist, Hindu, etc.)
attributes
Ordinal/Ranked • military rank (soldier, sergeant, lutenant, captain,

attributes etc.)
12
Measuring similarity -Distances
• Eucledian distance
• Manhattan distance
• Chebyshev distance
13
Euclidean distance
Where x1 to xp are the independent variables of i and j
14
Manhattan distance (city –block distance)
• Distance between the projection of points on the axis.
15
Chebyshev Distance (chessboard distance)
Max ( |x1-x2|, |y1-y2|, |z1-z2|, ….)
Minkowski Distance
Types of clustering
Agglomerative
Hierarchical
Divisive
Clustering
Partitioning K-means
18
Clustering types
Agglomerative
Divisive approach Partitioning
clustering
• Bottom up approach • Top down approach • constructs ‘k’ partition

• start with each object • start with all of the of data
forming a separate objects in the same • Each partition will
group cluster represent a cluster
• It keeps on merging • a cluster is split up and k ≤ n
the objects or groups into smaller clusters
that are close to one
another
19
Hierarchical clustering
• Records are sequentially grouped to create clusters,
based on distances between records and distances
between clusters.
• Hierarchical clustering also produces a useful graphical
display of the clustering process and results, called a
dendrogram.
20
Strengths of Hierarchical Clustering
• No assumptions on the number of clusters
• Any desired number of clusters can be obtained by
‘cutting’
the dendrogram at the proper level
• Hierarchical clustering may correspond to meaningful
taxonomies
Disadvantages of Hierarchical clustering
• Time complexity: not suitable for larger data sets.
• Very sensitive to outliers
22
Hierarchical clustering -Steps
Start with n clusters Two closest records

where each record will are merged into a
be a cluster cluster
At every step, two closest

clusters with smallest
distance are merged Repeat till there is a
single cluster that
•Either single records are added to
existing clusters or
includes all the records
•Two existing clusters are
combined
23
Dendrograms
• A dendrogram is a treelike diagram that summarizes the
process of clustering
• On the x-axis are the records
• Similar records are joined by lines whose vertical length
reflects the distance between the records
• the greater the difference in height, the more dissimilarity
• By choosing a cutoff distance on the y-axis, a set of
clusters is created
24
Dendrograms
Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited 31
Distance between twoclusters
• Each cluster is a set of points
• How do we define distance between two sets of points
Hierarchical clustering – Distance betweenclusters
Linkage types
• Single linkage
• Complete linkage
• Average linkage
• Centroid linkage
• Ward’s Method
27
Single Linkage
• Distance between two clusters is defined as
the shortest distance between two points in each cluster.
28
Complete linkage
• Distance between two clusters is defined as
the longest distance between two points in each cluster.
29
Average linkage
• Distance between two clusters
is defined as the average
distance between each point in
one cluster to every point in the
other cluster.
30
Centroid linkage
• Based on centroid distance. clusters
are represented by their mean
values for each variable, which forms
a vector of means.
• Distance between 2 clusters is
distance between the 2 vectors
31
Ward’s linkage
• Similar to group average and centroid distance
• joins records and clusters together progressively to produce larger and larger
clusters, but operates slightly differently from the general approach.
32
Hierarchical Clustering:Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1
2
5
2
3 6 Group Average
3
4 1
4
K-MEANS CLUSTERING
K-means Clustering
• A non-hierarchical approach to forming good clusters is to pre-
specify a desired number of clusters, k
• Assign each record to one of the k clusters, according to their
distance from each cluster
• So as to minimize a measure of dispersion within the clusters
• The ‘means’ in the K-means refers to averaging of the data;
that is, finding the centroid
• K-means clustering is widely used in large dataset
applications Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited
35
How does k-means clustering work?
Derive # of clusters
Specify value of K
using scree plot
Partition dataset to K
initial clusters
Assign each record to

cluster with nearest
centroid
Yes Recalculate centroids

for losing and receiving
clusters
Do reassignments occur?
No
Finalise clusters
36
Scaling –Z scaling & Min-max scaling
ZScaling Min Max scaling
• Features will be rescaled • The data is scaled to a fixed

• Have the properties of a range - 0 to 1.
standard normal distribution • The cost of having this
• μ=0 and σ=1 bounded range - smaller
standard deviations, which
can suppress the effect of
outliers
37
Where is scalingused?
k-nearest
k-means
neighbors
Perceptron , principal
Neural componen
networks t analysis
38
Validating Clusters
• The resulting clusters should be valid to generate insights
• Cluster interpretability
• Cluster stability
• Cluster separation
• Number of clusters
39
Questions?
40

Unsupervised Learning - Clustering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unsupervised Learning - Clustering

Uploaded by

Copyright:

Available Formats

Unsupervised Learning - Clustering

No defined Patterns in the data

Supervised learning Unsupervised learning

• Clearly defined X and Y • Unlabelled data

This is classic case of Unsupervised learning

Group records such that Clustering results are used:

• Similar to one another • As a stand-alone tool to get

Image Market Market

Taxonomy Influence of cluster

Group observations so that the observations belonging

Real-value attributes • salary, height

Binary attributes • gender (M/F), has_cancer(T/F)

Ordinal/Ranked • military rank (soldier, sergeant, lutenant, captain,

Where x1 to xp are the independent variables of i and j

• Bottom up approach • Top down approach • constructs ‘k’ partition

Start with n clusters Two closest records

At every step, two closest

Assign each record to

Yes Recalculate centroids

• Features will be rescaled • The data is scaled to a fixed

You might also like