Professional Documents
Culture Documents
Problem 1: Clustering
The dataset given is about the Health and economic conditions in different States of a
country. The Group States based on how similar their situation is, so as to provide these
groups to the government so that appropriate measures can be taken to escalate their Health
and Economic conditions.
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, etc, etc)
1.2. Do you think scaling is necessary for clustering in this case? Justify
1.3. Apply hierarchical clustering to scaled data. Identify the number of optimum clusters
using Dendrogram and briefly describe them.
1.4. Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and find the silhouette score.
1.5. Describe cluster profiles for the clusters defined. Recommend different priority based
actions that need to be taken for different clusters on the bases of their vulnerability
situations according to their Economic and Health Conditions.
• The Health indices 1 box plot shows two outliers and the distribution data ranges
from 0 to 1000 and are positively skewed 0.715371
• Box plot for Health indices 2 shows no presence of outliers and the distribution of
data ranges from 0 to 2000 and are negatively skewed -0.173803
• Box plot for per capita income shows the presence of an outlier and the distribution
of data ranges from 5 to 8000 and are positively skewed 0.823113
• Box plot for GDP shows no presence of an outlier and the distribution of the data is
positively skewed .829665
Pair plot
Per capita income has shown less correlation with Health indices 1 and GDP compared to
the strong correlation which GDP has to Health indices 1 and Heath indices 2.
1.2. Do you think scaling is necessary for clustering in this case? Justify
Yes, scaling is very important as the model works based on the distance-based computations
scaling is necessary for unscaled data.
Scaling needs to be done as the values of the variables are in different scales. Health indices,
Per Capita and GDP are composite of different values and this may get more weightage.
Scaling will have all the values in the relative same range.
I have used standard scalar for scaling Below is the snapshot of scaled data.
Now, we can understand all the data points have clustered into 3 clusters. Next to map these
clusters to our dataset we can use fclusters Criterion we can give “maxclust”
Observation
Both the method are almost similar means, minor variation, which we know it occurs.
There was not too much variations from both methods
Cluster grouping based on the dendrogram upon further analysis, and based on the dataset
had gone for 3 group cluster and three group cluster solution gives a pattern based on
Low/medium/high
1.4. Apply K-Means clustering on scaled data and determine optimum clusters.
Apply elbow curve and find the silhouette score.
K-means clustering, randomly we decide to give clusters = 3 and we look at the distribution
of clusters according to the clusters. We apply K-means technique to the scaled data.
The Elbow Method is more of a decision rule, while the Silhouette is a metric used for
validation while clustering. Thus, it can be used in combination with the Elbow Method.
• States that fall in the Cluster 0 – Medium have sufficient infrastructure and their
quality of life is per the living standards.
• States that fall in Cluster 1 – low, very poor per capita income and that leads to
very poor quality of living standards as the individuals don’t have much sources of
income and that affects gross domestic product.
• States that fall in Cluster 2 – High, have a very robust economy and so all economic
indicators shows an upward trend.
Considering the above result there are several ways to increase GDP:
• Education and training. Greater education and job skills allow individuals to
produce more goods and services, start businesses and earn higher incomes. That
leads to a higher GDP.
• Restrict population. Lowering the population can increase the GDP per capita, but
forcing families to do so is a ruthless solution to the problem.