You are on page 1of 14

CLUSTERING ANALYSIS

Problem 1: Clustering
The dataset given is about the Health and economic conditions in different States of a
country. The Group States based on how similar their situation is, so as to provide these
groups to the government so that appropriate measures can be taken to escalate their Health
and Economic conditions.
1.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, etc, etc)
1.2. Do you think scaling is necessary for clustering in this case? Justify
1.3. Apply hierarchical clustering to scaled data. Identify the number of optimum clusters
using Dendrogram and briefly describe them.
1.4. Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and find the silhouette score.
1.5. Describe cluster profiles for the clusters defined. Recommend different priority based
actions that need to be taken for different clusters on the bases of their vulnerability
situations according to their Economic and Health Conditions.

Data Dictionary for State_wise_Health_income:


1. States: names of States
2. Health_indeces1: A composite index rolls several related measures (indicators) into
a single score that provides a summary of how the health system is performing in the
State.
3. Health_indeces2: A composite index rolls several related measures (indicators) into a
single score that provides a summary of how the health system is performing in
certain areas of the States.
4. Per_capita_income-Per capita income (PCI) measures the average income earned
per
person in a given area (city, region, country, etc.) in a specified year. It is calculated by
dividing the area's total income by its total population.
5. GDP: GDP provides an economic snapshot of a country/state, used to estimate the
size of an economy and growth rate.
1.1 Read the data and do exploratory data analysis. Describe the data briefly.
Importing all the necessary libraries

Reading the data,


• The shape of the data is (297, 6)
• The info of the data shows that are most of the data types are integer
• No Null values in the data
• No missing values in the data
• No duplicates data present

• We have total 5 variables excluding the Unnamed


• Total number of states are 269

Exploratory Data Analysis Univariate / Bivariate analysis


Helps us to understand the distribution of data in the dataset. With univariate analysis
we can find patterns and we can summarize the data and have understanding about the
data to solve our business problem.
Observation

• The Health indices 1 box plot shows two outliers and the distribution data ranges
from 0 to 1000 and are positively skewed 0.715371
• Box plot for Health indices 2 shows no presence of outliers and the distribution of
data ranges from 0 to 2000 and are negatively skewed -0.173803
• Box plot for per capita income shows the presence of an outlier and the distribution
of data ranges from 5 to 8000 and are positively skewed 0.823113
• Box plot for GDP shows no presence of an outlier and the distribution of the data is
positively skewed .829665
Pair plot
Per capita income has shown less correlation with Health indices 1 and GDP compared to
the strong correlation which GDP has to Health indices 1 and Heath indices 2.
1.2. Do you think scaling is necessary for clustering in this case? Justify
Yes, scaling is very important as the model works based on the distance-based computations
scaling is necessary for unscaled data.
Scaling needs to be done as the values of the variables are in different scales. Health indices,
Per Capita and GDP are composite of different values and this may get more weightage.
Scaling will have all the values in the relative same range.
I have used standard scalar for scaling Below is the snapshot of scaled data.

1.3. Apply hierarchical clustering to scaled data. Identify the number of


optimum clusters using Dendrogram and briefly describe them

Hierarchical clustering – ward’s method & average method


By choosing ward’s method to the scaled data,
For visualization purposes I have used to Dendrogram
The above dendrogram indicates all the data points have clustered to different clusters by
wards method.
To find the optimal number cluster through which we can solve our business objective we
use truncate mode = lastp.
Wherein we can give last p = 10 according to industry set base value.

From the above we can see


that the data points have been clustered into 3 sections.
F cluster function helps in forming the cluster .data points have clustered into 3 clusters.
Next to map these clusters to our dataset we can use fclusters
Criterion we can give “maxclust”
Cluster Frequency

Average method to the scaled data

Cluster profiling to understand the business problem.


The above dendrogram indicates all the data points have clustered to different clusters by
average method.
To find the optimal number cluster through which we can solve our business objective we
use truncate mode = lastp.
Wherein we can give last p = 10 according to industry set base value.

Now, we can understand all the data points have clustered into 3 clusters. Next to map these
clusters to our dataset we can use fclusters Criterion we can give “maxclust”

Observation
Both the method are almost similar means, minor variation, which we know it occurs.
There was not too much variations from both methods
Cluster grouping based on the dendrogram upon further analysis, and based on the dataset
had gone for 3 group cluster and three group cluster solution gives a pattern based on
Low/medium/high
1.4. Apply K-Means clustering on scaled data and determine optimum clusters.
Apply elbow curve and find the silhouette score.

K-means clustering, randomly we decide to give clusters = 3 and we look at the distribution
of clusters according to the clusters. We apply K-means technique to the scaled data.

Within-Cluster-Sum of Squared Errors (WSS) for different values of k, helps choose


the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is
visible as an elbow.
The range of the Silhouette value is between +1 and -1. A high value is desirable and
indicates that the point is placed in the correct cluster. If many points have a negative
Silhouette value, it may indicate that we have created too many or too few clusters.

The Elbow Method is more of a decision rule, while the Silhouette is a metric used for
validation while clustering. Thus, it can be used in combination with the Elbow Method.

3-Group clusters via K- Means


Cluster 0 – Medium, Cluster 1 – low, Cluster 2 – High
1.5. Describe cluster profiles for the clusters defined. Recommend different
priority-based actions that need to be taken for different clusters on the bases
of their vulnerability situations according to their Economic and Health
Conditions

Based on the cluster profiles we interpret the following:

• States that fall in the Cluster 0 – Medium have sufficient infrastructure and their
quality of life is per the living standards.

• States that fall in Cluster 1 – low, very poor per capita income and that leads to
very poor quality of living standards as the individuals don’t have much sources of
income and that affects gross domestic product.

• States that fall in Cluster 2 – High, have a very robust economy and so all economic
indicators shows an upward trend.

Considering the above result there are several ways to increase GDP:

• Education and training. Greater education and job skills allow individuals to
produce more goods and services, start businesses and earn higher incomes. That
leads to a higher GDP.

• Good infrastructure. Without a functioning power system or good roads, a nation


has limited ability to make or ship goods, and businesses have limited ability to
provide services. Building a good infrastructure, including telecommunications,
makes it possible to massively expand the economy and increase per capita income.

• Restrict population. Lowering the population can increase the GDP per capita, but
forcing families to do so is a ruthless solution to the problem.

You might also like