You are on page 1of 4

Assignment questions on clustering and PCA

Question 1: Assignment Summary

The assignment on clustering and PCA is to cluster the countries on their socio economic
status, so that we can identify the countries which are most in need of the support and funds
from HELP organisation.

The analysis steps are as follows:

● The data is first checked for any missing values and outliers. No missing values are
found and few outliers are removed so that clustering can be done in an effective way.
● Principal component analysis (PCA) is performed on the data and we found that by
reducing the number of columns from 9 to 5, we can still preserve 94% of variance in
data. Number of columns are now reduced to 5 and data is scaled using standard scaler.
● Next we performed K means clustering on this dataset and by using elbow curve graph
and silhouette score we identified that forming 3 clusters on this dataset is optimal for
performing further analysis.
● Countries are now clustered into rich, moderate and poor countries.
● Next we performed Hierarchical clustering on the data set. Single linkage method did not
provide satisfactory results but by performing complete linkage hierarchical clustering ,
we were able to divide countries into 3 clusters.
● The list of top 10 poor countries obtained using both K means and Hierarchical are found
to be exactly the same.
● Based on the data obtained, the suitable recommendations on how funds should be
distributed is provided.
Question 2: Clustering

a) Compare and contrast K-means Clustering and Hierarchical Clustering.

K-Means and Hierarchical clustering are clustering algorithms which helps us to create a
subgroup of data from larger dataset. Such that data points in this smaller subgroups are similar
to each other.

In K-means clustering, the number of sub groups to be created out of this large data set need to
be decided before running the algorithm. Where are for Hierarchical clustering, this is not
required.

K-Means clustering is faster in forming a cluster when compared to that of Hierarchical since it
performs lesser calculations and is more suitable when number of datapoints are very high.
Hierarchical clustering is not suitable for clustering very large datasets.

In K-means clustering, the initial centroid points selected will have an impact on the final
outcome of the clusters formed. We do not face any such issues in Hierarchical clustering.

b) Briefly explain the steps of the K-means clustering algorithm.

Below are the steps involved in running K means clustering algorithm:

● By analysing the data points or due to business requirements we need to first decide the
value of K i.e. number of clusters that need to be created.
● For each value of K, the new data points are selected randomly which are called are
cluster centroids. For example, if the value of K is 4, then 4 new data points are
randomly selected as cluster centroids.
● For these centroids,calculate the euclidean distance for each data points.
● The data points which has the shortest distance with the cluster centroids, will be
assigned to those centroids and a cluster will be formed.
● Calculate the new centroid points for each cluster by using the data points in their
respective cluster.
● These new centroid points will now replace the previously selected random cluster
centroids.
● Repeat the steps from 3 to 6 until there are no more changes in centroid points and data
points in the cluster.
c) How is the value of ‘k’ chosen in K-means clustering? Explain both the statistical as
well as the business aspect of it.

The value of k in K means clustering defines how many clusters can be formed by using the
existing dataset. It takes the available data points and groups them into ‘k’ number of clusters.
Here the value of ‘k’ need to be explicitly given as input to the algorithm so that those many
clusters can be formed. So value of ‘k’ is the first thing to be decided before running the
algorithm.

The number of clusters formed must be in such a way that data points within the same cluster
must be similar to each other and data points between 2 clusters must be different from each
other. Also we must consider business aspect before forming this ‘k’ clusters. Having too many
clusters or too less clusters is also not preferred since it will not help us reach our final goal. So
inputs from business is also need to be taken before forming this clusters to understand the
optimal value of ‘k’ and also which dimensions need to be prioritised so that maximum business
goal can be met.

d) Explain the necessity for scaling/standardisation before performing Clustering.

PCA uses Singular value decomposition(SVD) to perform analysis and obtain principal
components. SVD is sensitive to scaling. If a particular dimension of a data is at a higher
magnitude when compared to that of data in other dimensions, it will have more impact on the
outcome of results since it will be treated as a prominent dimension. By scaling the data, all the
dimensions will have almost similar magnitude and any variance of data in all dimensions will be
given equal prominence. So scaling is necessary before performing PCA.

e) Explain the different linkages used in Hierarchical Clustering.

There are 3 types of linkage which can be used to perform clustering in Hierarchical clustering.

● Single linkage: Here distance between 2 clusters is defined as the shortest distance
between any 2 points in these 2 clusters.
● Complete linkage: Here distance between 2 clusters is defined as the longest distance
between any 2 points in these 2 clusters.
● Average linkage: Here distance between 2 clusters is defined as the average distance
between every point in cluster to every other points in another cluster.
Question 3: Principal Component Analysis

a) Give at least three applications of using PCA.

PCA is a very effective technique used in reducing the dimensions of the data. Below are few
applications of PCA.

● It can be used in image recognition. By performing PCA on image data, we can reduce
the number of dimensions in data by dropping few components in data which are highly
correlated.
● PCA can be used to visualise a multi dimensional data on a 2 D plane by reducing the
number of dimensions to 2 and by keeping maximum variation in information as
possible.
● PCA is also used in the banking and health sector to effectively analyse the data and to
make the right inferences on it.

b) Briefly discuss the 2 important building blocks of PCA - Basis transformation and
variance as information.

Basis are the units which are used to represent vectors of a matrix. In PCA by changing the
basis vectors we try to check if the information can be represented without losing much of the
original information.

Example: Length of 12 cm can also be represented as 4.72 inches. Here the basis is changed
from cm to inches without losing the original information.

In PCA we change the basis vectors to new set of vectors such that maximum amount of
variance in information is captured using new set of vectors. If variance in data is high, then the
amount of information in the data is also high and if variance in data is less, then the amount of
information is also less. That is why information will be measured as variance in data.

c) State at least three shortcomings of using Principal Component Analysis.

Shortcomings of PCA are:

● PCA is effective in reducing the number of dimensions in the data set if the data points
are highly correlated. If correlation coefficient is less, then we will be losing a substantial
amount of information by reducing the number of columns in data.
● Data must be scaled and centered before performing the PCA. The results may vary if
different dimensions have different magnitude of data.
● PCA is most suitable for linear data and not much recommended for non linear dataset.

You might also like