Understanding K-Means and Hierarchical Clustering in R

12/13/22, 9:56 PM Understanding K-Means and Hierarchical Clustering in R | by Shrey Parth | Medium
Open in app Sign up Sign In
Shrey Parth Follow
Jan 27, 2020 · 7 min read · Listen
Save
Understanding K-Means and Hierarchical

Clustering in R
52
https://medium.com/@shreytparth/understanding-k-means-and-hierarchical-clustering-in-r-9899342cc4fb 1/15
Clustering is an unsupervised learning technique. It means there is no outcome to

be predicted, the algorithm tries to find pattern in data. It deals with identifying
subgroups within the data such that a set of objects in a subgroup(cluster) are
similar to each other compared to objects in another cluster.
How do we define similarity?

Similarity is an amount that reflects the strength of relationship between two data
objects. It is the mathematical representation of how similar or dissimilar the data
points are from each other. Every algorithm follows a different set of rules for
calculating similarity and it depends on the problem at hand.
Getting started
We will take the iris flower dataset for the purpose of this tutorial. The data set
consists of 50 samples from each of three species of Iris (Setosa, Virginica, Versicolor).
Four features are measured from each sample: the length and the width of the
sepals and petals in centimeters.
Getting the required libraries
Loading the dataset
Let’s take a look at our data
Exploratory Data Analysis
Restructuring the data for better visualization.
Creating a scatter plot for the length and width for 3 species of flowers in our data.
The overall value of length is more than the width. Compared to versicolor and
virginica the petal length and width of setosa is smaller.
Visualizing the metrics on a histogram , it’s fairly obvious that the length of both
petal and sepal is larger than the width.
Plotting the variable Sepal length of all 3 flowers. We see that the sepal length of
versicolor ranges from 2.0 to 3.4 and that of setosa is from 3.0 to 4.3. The y-axis
shows the count of samples of these flowers in the data.
K-Means Clustering
K-means Clustering is an unsupervised machine learning algorithm. It groups
similar data-points together and discovers underlying patterns, by identifying a
fixed number (K) clusters in dataset. ‘Means’ refers to the averaging of the data i.e.
finding the clusters.
K is defined as the number of centroids we need in the dataset. A centroid is an

imaginary or real location representing the centre of the cluster.
Process:
1) Starts with the first group of randomly selected centroids- which are the
beginning points.
2) Performs iterative(repetitive) calculations to optimize the positions of clusters.
3) The process stops when the centroids are sthbilized and the values don’t change
with further iterations or when the defined number of iterations are reached.
The bigger the value of K, the lower will be the variance within the groups in the
clustering. If K is equal to the number of observations, then each point will be a
group and the variance will be 0. It’s necessary to find an optimum number of
clusters. Variance within a group means how different the members of the group
are. A large variance shows that there’s more dissimilarity in the groups.
The kmeans() function outputs the results of the clustering. The cluster in which
each observation was allocated has a mean and a percentage (88.4%) that
represents the compactness of the clustering, and how similar are the members
within the same group.
If all the observations within a group were in the same exact point in the n-
dimensional space, then we would achieve 100% of compactness.
As we can see, the data belonging to the setosa species got grouped into cluster 3,
versicolor into cluster 2, and virginica into cluster 1. The algorithm wrongly
classified 2 data points belonging to versicolor into virginica and 14 data points
belonging to virginica into versicolor.
Let’s plot a chart showing the “within sum of squares” by the number of groups (K
value). The within sum of squares is a metric that shows the dissimilarity within
members of a group. The greater is the sum, the greater is the dissimilarity.
We can see that going from K=3 to 4 there’s a decrease in the sum of squares, which
means our dissimilarity will decrease and compactness will increase if we take K=4.
So, let’s choose K = 4 and run the K-means again.
Using 3 groups (K = 3) we had 88.4% of well-grouped data. Using 4 groups (K = 4)

that value raised to 91.6%, which is a good value for us.
Hierarchical Clustering
Hierarchical clustering is an alternative approach to clustering which builds a
hierarchy from the bottom-up, and doesn’t require us to specify the number of
clusters beforehand. There are two types of hierarchical clustering
a) Agglomerative - Each data point is considered a separate cluster initially and at

each iteration, similar clusters merge with other clusters util one or desired number
of clusters are formed.
b) Divisive -It’s the opposite of Agglomerative clustering. All data points are
considered into a single cluster and then are separated further until we get the
desired number of clusters.
Process:
1) Compute proximity/dissimilarity/distance matrix. This is the backbone of our
clustering. It is a mathematical expression of how different or distant the data-
points are from each other.
2) There are many ways to calculate dissimilarity between clusters. These are the
linkage methods.
a) MIN
b) MAX
c) Group Average
d) Ward’s method
3) Let each data point be a cluster.
4) Merge the 2 closest clusters based on the distances from the distance matrix and
as a result the amount of clusters goes down by 1.
5)Update proximity/distance matrix and repeat step 4 until desired clusters remain.
Let us see how well the hierarchical clustering algorithm performs on our dataset.
We will use hclust() for this which requires us to provide the data in the form of a
distance matrix. We will create this by using dist().
In hierarchical clustering, we categorize the objects into a hierarchy similar to a

tree-like diagram which is called a dendrogram. It indicates both the similarity and
the order that the clusters were formed.
We’ll cut our dendrogram at cluster 3 and check how it performs.
It looks like the algorithm successfully classified all the flowers of species setosa
into cluster 1, and virginica into cluster 2, but had trouble with versicolor.
Let us see if we can better by using a different linkage method. This time, we will
use the mean linkage method.
Next, we’ll cut the dendrogram in order to create the desired number of clusters.
Since in this case we already know that there are three Species we will choose the
number of clusters to be k = 3. We will use the cutree() function.
We can see that this time, the algorithm did a little better but has problem
classifying virginica properly.
This brings us to an end of this article, I hope you learned something about how to
attack a data science classification project.
About Help Terms Privacy
Get the Medium app

Understanding K-Means and Hierarchical Clustering in R

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Understanding K-Means and Hierarchical Clustering in R

Uploaded by

Copyright:

Available Formats

12/13/22, 9:56 PM Understanding K-Means and Hierarchical Clustering in R | by Shrey Parth | Medium

Open in app Sign up Sign In

Shrey Parth Follow

Jan 27, 2020 · 7 min read · Listen

Understanding K-Means and Hierarchical

Clustering is an unsupervised learning technique. It means there is no outcome to

How do we define similarity?

Getting the required libraries

Loading the dataset

Let’s take a look at our data

Exploratory Data Analysis

Restructuring the data for better visualization.

K is defined as the number of centroids we need in the dataset. A centroid is an

2) Performs iterative(repetitive) calculations to optimize the positions of clusters.

Using 3 groups (K = 3) we had 88.4% of well-grouped data. Using 4 groups (K = 4)

a) Agglomerative - Each data point is considered a separate cluster initially and at

3) Let each data point be a cluster.

In hierarchical clustering, we categorize the objects into a hierarchy similar to a

We’ll cut our dendrogram at cluster 3 and check how it performs.

About Help Terms Privacy

Get the Medium app

You might also like