Clustering - The Data Ensemble

Unsupervised learning
Take away from Unsupervised Learning

In this topic, you will be learning about Unsupervised Learning algorithms. You
will also understand the various distance measures that will be used in clustering.
Distance Measure
Distance Measure is a very important aspect of clustering. Knowing how close or how
far apart each variable is with respect to the other helps in grouping them.
Jaccard Distance
The Jaccard index is used to compare elements of two sets to identify which of the
members are shared and not shared. The Jaccard Distance is a measure of how
different the two given sets are.
Jaccard Distance = 1-(Jaccard Index)
Eucledian Distance is the shortest distance between the two given points in
Eucledian Space.
Cosine distance of two given vectors u and v is the angular cosine between the
given vectors.
Manhattan distance is calculated on a strictly horizontal or vertical path.
Module Summary
Learning at a high level
Few prominent distance measures used in Clustering data
Hierarchical Clustering
Hierarchical clustering – What’s Covered?

In this course we will be learning the following clustering techniques
K Means
Hierarchical
Hierarchical Clustering Explained
Begin by allotting each item to a cluster. If you are having N items, you are now
having N clusters, where each of them contains one item.
Now, let us make the similarities (distances) between the clusters the same as the
similarities (distances) between the items they include.
Discover the most identical or closest pair of clusters, merge them into one
cluster, thereby reducing one cluster.
Calculate the similarities (distances) between each of the old clusters and the new
cluster.
Repeat step 2 and step 3 until all items are finally clustered into one cluster
with size N.
Source:
https://home.deib.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html
Dendrogram
A dendrogram is a branching diagram that represents the relationships of similarity
among a group of entities
Each branch is called a clade
The terminal end of each clade is called a leaf
There is no limit to the number of leaves in a clade
The arrangement of the clades tells us which leaves are most similar to each other
The height of the branch points indicates how similar or different they are from
each other
The greater the height, the greater the difference between the points
Disadvantages of Agglomerative Clustering

Disadvantages for agglomerative hierarchical clustering
If data points are wrongly grouped at the inception, they cannot be reallocated.
If different similarity measures are utilized to calculate the similarity between
clusters, it may result in different results altogether.
Tips for Hierarchical Clustering

There is no particular size that fits all solutions to determine how many clusters
you need. It depends on what you intend to do with them. For a better solution,
look at the basic characteristics of the given clusters at successive steps and
make a decision when you have a solution that can be interpreted
Hierarchical clustering – Standardization

Standardizing the variables is a good way to follow while clustering data.
Summary on Hierarchical Clustering

In this module, you have learnt Hierarchical Clustering in detail. You have also
learnt how to read a Dendogram and some tips to be followed when
fitting hierarchical clustering to a data set.
K Means Clustering
Take Away from K-Means Clustering

In this topic, you will learn K Means Clustering in detail. You will also get to
understand the concept though an interactive game.
K-Means Algorithm Simplified

Place k points in the space represented by the objects that are being clustered.
These points represent initial group centroids.
Assign each object to the group that has the closest centroid.
When all objects have been assigned, recalculate the positions of the k centroids.
Repeat Step 2 and 3 until the centroids no longer move.
Source: http://eacharya.inflibnet.ac.in/data-server/eacharya-
documents/53e0c6cbe413016f23443704_INFIEP_33/19/LM/33-19-LM-V1-
S1__document_clustering_2.pdf
Tips for K Means Clustering

For large datasets random sampling can be used to determine the k value for
clustering
Hierarchical Clustering can also be used for the same
Choosing Right K-value

Other Ways to choose the right k value
By rule of thumb
Elbow method
Information Criterion Approach
An Information Theoretic Approach
Choosing k using the Silhouette
Cross-validation
Summary on K Means Clustering

In this topic, you have learnt K-Means clustering in detail. You have also learnt
the concept through a game and understood some tips on how to fit k-means algorithm
to your data set.
K Means Clustering using R
Code Snippet
K Means Clustering in R
Loading and exploring the dataset
library(datasets)
head(iris)
Visualizing the data
library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()
Setting the seed and creating the cluster
set.seed(20)
irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)
irisCluster
Comparing the clusters with the species
table(irisCluster$cluster, iris$Species)
Plotting the dataset to view the clusters
irisCluster$cluster <- as.factor(irisCluster$cluster)

ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$cluster)) + geom_point()
Code Snippet
Hierarchical Clustering in R
Loading and exploring the dataset
library(datasets)
head(iris)
Visualizing the data
library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()
Calculating the distance and plotting the dendogram
clusters <- hclust(dist(iris[, 3:4]))

plot(clusters)
Cutting the desired number of clusters and comparing it with the data
clusterCut <- cutree(clusters, 3)

table(clusterCut, iris$Species)
Visualizing the clusters
ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$Species)) +

geom_point(alpha = 0.4, size = 3.5) + geom_point(col = clusterCut) +
scale_color_manual(values = c('black', 'red', 'green'))
Clustering Course Summary

In this course you have learnt
Unsupervised Learning Technique (Clustering)
Hierarchical Clustering
K Means Clustering
Hands-on Exercise on clustering using R
Hope you had fun in this journey.

Clustering - The Data Ensemble

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering - The Data Ensemble

Uploaded by

Copyright:

Available Formats

Unsupervised learning

Take away from Unsupervised Learning

Jaccard Distance = 1-(Jaccard Index)

Manhattan distance is calculated on a strictly horizontal or vertical path.

Hierarchical clustering – What’s Covered?

Hierarchical Clustering Explained

Disadvantages of Agglomerative Clustering

Tips for Hierarchical Clustering

Hierarchical clustering – Standardization

Summary on Hierarchical Clustering

Take Away from K-Means Clustering

K-Means Algorithm Simplified

Tips for K Means Clustering

Choosing Right K-value

Summary on K Means Clustering

K Means Clustering using R

Loading and exploring the dataset

irisCluster$cluster <- as.factor(irisCluster$cluster)

Loading and exploring the dataset

Visualizing the data

Calculating the distance and plotting the dendogram

clusters <- hclust(dist(iris[, 3:4]))

clusterCut <- cutree(clusters, 3)

ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$Species)) +

Clustering Course Summary

You might also like