You are on page 1of 20

Decision Making

Submitted by-Ankita Mishra


Part 1: PCA:

1. Perform Exploratory Data Analysis [both univariate and multivariate analysis to be

performed]. The inferences drawn from this should be properly documented.

Univariate Analysis

We will start by examining the distribution of each variable in the data set using histograms:

From the histograms, we can see that most of the variables are roughly normally distributed,
We can also compute summary statistics for each variable using the describe method:
Multivariate Analysis

2. Scale the variables and write the inference for using the type of scaling function for this

case study.
There are different types of scaling methods that can be used in PCA, including standardization,
normalization, and min-max scaling. The choice of scaling method depends on the nature of the data and
the objective of the analysis.

In this case study, we will use standardization to scale the variables. Standardization involves subtracting
the mean of each variable and dividing by its standard deviation. This method ensures that each variable
has a mean of zero and a standard deviation of one.

To apply standardization to our data set, we can use the StandardScaler class from the sklearn. pre-
processing module:

3. Write the explicit form of the first PC (in terms of Eigen Vectors).

Given the covariance matrix covariance matrix and its computed eigenvectors eigenvectors, the first
principal component (PC) can be expressed as a linear combination of the original features, where the
coefficients are given by the first eigenvector of the covariance matrix.

The first eigenvector corresponds to the largest eigenvalue, which represents the direction of the
maximum variance in the data. To obtain the explicit form of the first PC in terms of the eigenvectors, we
can take the dot product of the original features with the first eigenvector. This can be expressed as:

PC1 = a[0]*X1 + a[1]*X2 + ... + a[n-1]*Xn

where a is the first eigenvector and X1, X2, ..., Xn are the original features.

In Python code, this can be computed as follows:

# Extract the first eigenvector


first eigenvector = eigenvectors [: 0]

# Compute the first principal component


First pc = np.dot (features, first eigenvector)

4. Discuss the cumulative values of the eigenvalues. How does it help you to decide on the
optimum number of principal components? What do the eigenvectors indicate? Perform PCA and
export the data of the Principal Component scores into a data frame.

The eigenvalues obtained from the covariance matrix represent the amount of variance explained by
each principal component. The cumulative sum of the eigenvalues can be used to determine the total
amount of variance explained by the first k principal components.

To decide on the optimum number of principal components to retain, one can look at the cumulative sum
of the eigenvalues and choose a number of components that explain a high percentage of the total
variance in the data. A common threshold is to retain enough components to explain at least 80% or 90%
of the total variance.

The eigenvectors indicate the directions in which the original features are correlated with each other.
Specifically, each eigenvector represents a linear combination of the original features that maximizes the
variance in that direction. The length of each eigenvector indicates the strength of the correlation
between the original features and the direction represented by the eigenvector.

To perform PCA and export the data of the principal component scores into a data frame, we can use the
following Python code:

In this code, we first compute the covariance matrix and its eigenvalues and eigenvectors using NumPy's
cov and linalg.eig functions. We then sort the eigenvalues in descending order and compute the
cumulative sum to determine the number of principal components to retain. We choose to retain enough
components to explain at least 80% of the total variance in the data.

We then project the original data onto the first num_components principal components using NumPy's
dot function and create a data frame of the component scores using pandas. Finally, we concatenate the
component scores with the original data and export the resulting data frame to a CSV file.
5. Mention the business implication of using the Principal Component Analysis for this case
study.

Principal Component Analysis (PCA) can have several business implications for the hair salon in this case
study.

Firstly, PCA can help identify the underlying factors that contribute to customer satisfaction, such as the
quality of service, the skill level of the stylists, or the convenience of the salon location. By reducing the
number of variables to a smaller set of principal components, PCA can make it easier for the salon to
identify the most important factors and focus on improving them to enhance customer satisfaction.

Secondly, PCA can help the salon optimize its marketing strategy by identifying the most important
factors that influence customer behaviour, such as the frequency of visits or the types of services
requested. By understanding these factors, the salon can tailor its marketing campaigns to target
customers with specific needs or preferences.

Thirdly, PCA can help the salon optimize its operations by identifying the most important factors that
contribute to efficiency and profitability, such as the utilization rate of stylists or the average revenue per
customer. By monitoring these factors, the salon can identify areas for improvement and implement
changes to increase efficiency and profitability.

Overall, PCA can help the hair salon in this case study gain a deeper understanding of its customers,
operations, and market, and make data-driven decisions to improve performance and competitiveness.

Part 2: Clustering:

The State_wise_Health_income.csv dataset given is about the Health and economic


conditions in different States of a country. The Group States based on how similar their situation
is, so as to provide these groups to the government so that appropriate measures can be taken
to escalate their Health and Economic conditions.

2.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, etc, etc)

2.2. Do you think scaling is necessary for clustering in this case? Justify

2.3. Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using
Dendrogram and briefly describe them.

2.4. Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and find the silhouette score.

2.5. Describe cluster profiles for the clusters defined. Recommend different priority-based actions
that need to be taken for different clusters on the bases of their vulnerability situations according
to their Economic and Health Conditions.
Data Dictionary for State_wise_Health_income Dataset:
1. States: names of States
2. Health_indeces1: A composite index rolls several related measures (indicators) into a single
score that provides a summary of how the health system is performing in the State.
3. Health_indeces2: A composite index rolls several related measures (indicators) into a single
score that provides a summary of how the health system is performing in certain areas of the
States.
4. Per_capita_income-Per capita income (PCI) measures the average income earned per person
in a given area (city, region, country, etc.) in a specified year. It is calculated by dividing the
area' s total income by its total population.

2.1. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null
values, Data types, shape, EDA, etc, etc)

Importing all the necessary libraries

Reading the data,

## Null values check

## Duplicates in data
## Percentage of Missing values

The shape of the data is (297, 6)


The info of the data shows that are most of the data types are integer
No Null values in the data
No missing values in the data
No duplicates data present

We have total 5 variables excluding the Unnamed


Total number of states are 269

Exploratory Data Analysis Univariate / Bivariate analysis


Helps us to understand the distribution of data in the dataset. With univariate analysis
we can find patterns and we can summarize the data and have understanding about the
data to solve our business problem
Observation:

The health indices 1 box plot shows two outliers and the distribution data ranges from 0 to 1000
and are positively skewed 0.715371

Box plot for Health indices 2 shows no presence of outliers and the distribution of data ranges
from 0 to 2000 and are negatively skewed -0.173803

Box plot for per capita income shows the presence of an outlier and the distribution of data
ranges from 5 to 8000 and are positively skewed 0.823113

Box plot for GDP shows no presence of an outlier and the distribution of the data is positively
skewed .829665
Pair plot:
Per capita income has shown less correlation with Health indices 1 and GDP compared to the strong
correlation which GDP has to Health indices 1 and Heath indices 2.
2.2. Do you think scaling is necessary for clustering in this case? Justify

Yes, scaling is very important as the model works based on the distance-based computations scaling is
necessary for unscaled data.
Scaling needs to be done as the values of the variables are in different scales. Health indices, Per Capita
and GDP are composite of different values and this may get more weightage. Scaling will have all the
values in the relative same range.
I have used standard scalar for scaling Below is the snapshot of scaled data.

2.3. Apply hierarchical clustering to scaled data. Identify the number of


optimum clusters using Dendrogram and briefly describe them.

Hierarchical clustering – ward’s method & average method


By choosing ward’s method to the scaled data,
For visualization purposes I have used to Dendrogram

The above dendrogram indicates all the data points have clustered to different clusters by wards method.
To find the optimal number cluster through which we can solve our business objective we use truncate
mode = lastp.
Wherein we can give last p = 10 according to industry set base value.

From the above we can see that the data points have been clustered into 3 sections.
F cluster function helps in forming the cluster .data points have clustered into 3 clusters. Next to
map these clusters to our dataset we can use fclusters
Criterion we can give “maxclust”
Cluster Frequency:

Average method to the scaled data:


Cluster profiling to understand the business problem.

The above dendrogram indicates all the data points have clustered to different clusters by average
method.
To find the optimal number cluster through which we can solve our business objective we use
truncate mode = lastp.
Wherein we can give last p = 10 according to industry set base value.
Now, we can understand all the data points have clustered into 3 clusters. Next to map these
clusters to our dataset we can use fclusters Criterion we can give “maxclust.
Observation:

Both the method are almost similar means, minor variation, which we know it occurs.
There was not too much variations from both methods
Cluster grouping based on the dendrogram upon further analysis, and based on the dataset had
gone for 3 group cluster and three group cluster solution gives a pattern based on
Low/medium/high

2.4. Apply K-Means clustering on scaled data and determine optimum clusters. Apply
elbow curve and find the silhouette score.

K-means clustering, randomly we decide to give clusters = 3 and we look at the distribution of
clusters according to the clusters. We apply K-means technique to the scaled data.

Within-Cluster-Sum of Squared Errors (WSS) for different values of k, helps choose the k for which
WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.
The range of the Silhouette value is between +1 and -1. A high value is desirable and indicates that
the point is placed in the correct cluster. If many points have a negative Silhouette value, it may
indicate that we have created too many or too few clusters.

The Elbow Method is more of a decision rule, while the Silhouette is a metric used for validation
while clustering. Thus, it can be used in combination with the Elbow Method.
2.5. Describe cluster profiles for the clusters defined. Recommend different priority based
actions that need to be taken for different clusters on the bases of their vulnerability
situations according to their Economic and Health Conditions.

Based on the cluster profiles we interpret the following:

• States that fall in the Cluster 0 – Medium have sufficient infrastructure and their quality of
life is per the living standards.

• States that fall in Cluster 1 – low, very poor per capita income and that leads to very poor
quality of living standards as the individuals don’t have much sources of income and that affects
gross domestic product.

• States that fall in Cluster 2 – High, have a very robust economy and so all economic indicators
shows an upward trend.

Considering the above result there are several ways to increase GDP:

• Education and training. Greater education and job skills allow individuals to produce more
goods and services, start businesses and earn higher incomes. That leads to a higher GDP.

• Good infrastructure. Without a functioning power system or good roads, a nation has
limited ability to make or ship goods, and businesses have limited ability to provide services.
Building a good infrastructure, including telecommunications, makes it possible to massively
expand the economy and increase per capita income.

• Restrict population. Lowering the population can increase the GDP per capita, but forcing
families to do so is a ruthless solution to the problem.

END OF PROJECT

You might also like