PCA Problem Statement

Topic: Dimension Reduction With PCA
Instructions:
Please share your answers filled in-line in the word document. Submit code separately
wherever applicable.
Please ensure you update all the details:
Name: yangmiso yangya Batch ID: DSWDMCON 180122
Topic: Principal Component Analysis
Grading Guidelines:
1. An assignment submission is considered complete only when correct and executable code(s) are submitted along
with the documentation explaining the method and results. Failing to submit either of those will be considered an
invalid submission and will not be considered for evaluation.
2. Assignments submitted after the deadline will affect your grades.
Grading:
Ans Date Ans Date
Correct On time A 100
80% & above On time B 85 Correct Late
50% & above On time C 75 80% & above Late
50% & below On time D 65 50% & above Late
E 55 50% & below
Copied/No Submission F 45
● Grade A: (>= 90): When all assignments are submitted on or before the given deadline
● Grade B: (>= 80 and < 90):
o When assignments are submitted on time but less than 80% of problems are completed.
(OR)
o All assignments are submitted after the deadline.
● Grade C: (>= 70 and < 80):

o When assignments are submitted on time but less than 50% of the problems are completed.
(OR)
o Less than 80% of problems in the assignments are submitted after the deadline
● Grade D: (>= 60 and < 70):

o Assignments submitted after the deadline and with 50% or less problems.
● Grade E: (>= 50 and < 60):

o Less than 30% of problems in the assignments are submitted after the deadline
(OR)
o Less than 30% of problems in the assignments are submitted before deadline
● Grade F: (< 50): No submission (or) malpractice.
© 2013 - 2021 360DigiTMG. All Rights Reserved.

Hints:
1. Business Problem
1.1. What is the business objective?
1.1. Are there any constraints?
2. Work on each feature of the dataset to create a data dictionary as displayed in the below
image:
2.1 Make a table as shown above and provide information about the features such as its data type
and its relevance to the model building. And if not relevant, provide reasons and a description of the
feature.
3. Data Pre-processing
3.1 Data Cleaning, Feature Engineering, etc.
4. Exploratory Data Analysis (EDA):
4.1. Summary.
4.2. Univariate analysis.
4.3. Bivariate analysis.
5. Model Building
5.1 Build the model on the scaled data (try multiple options).
5.2 Perform PCA analysis and get the maximum variance between components.
5.3 Perform clustering before and after applying PCA to cross the number of clusters
formed.
5.4 Briefly explain the model output in the documentation.
6. Write about the benefits/impact of the solution - in what way does the business (client)
benefit from the solution provided?
Problem Statement: -
Perform hierarchical and K-means clustering on the dataset. After that, perform PCA on the
dataset and extract the first 3 principal components and make a new dataset with these 3
principal components as the columns. Now, on this new dataset, perform hierarchical and K-
means clustering. Compare the results of clustering on the original dataset and clustering on
the principal components dataset (use the scree plot technique to obtain the optimum
number of clusters in K-means clustering and check if you’re getting similar results with and
without PCA).

Problem Statement: -
A pharmaceuticals manufacturing company is conducting a study on a new medicine to treat

heart diseases. The company has gathered data from its secondary sources and would like you to
provide high level analytical insights on the data. Its aim is to segregate patients depending on
their age group and other factors given in the data. Perform PCA and clustering algorithms on
the dataset and check if the clusters formed before and after PCA are the same and provide a
brief report on your model. You can also explore more ways to improve your model.
Note: This is just a snapshot of the data. The datasets can be downloaded from AiSpry LMS in
the Hands-On Material section.
1. Perform hierarchical and K-means clustering on the dataset. After that, perform PCA on the
dataset and extract the first 3 principal components and make a new dataset with these 3
principal components as the columns. Now, on this new dataset, perform hierarchical and K-
means clustering. Compare the results of clustering on the original dataset and clustering on the
principal components dataset (use the scree plot technique to obtain the optimum number of
clusters in K-means clustering and check if you’re getting similar results with and without PCA).
Solution:
1. Business Problem
1.2. What is the business objective?

Answer:
Maximize : sales
Minimize : Loss
1.2. Are there any constraints?

Answer: No constrain
Name of the features Descriptions Types Relevance

Type Class Category of the Wine NUMBER YES
Alcohol Amount of Alcholol in that perticular wine type NUMBER YES
Malic Amount of Malic Acid in that perticular wine type NUMBER YES
Ash Amount of Ash in that perticular wine type NUMBER YES
Alcalinity Amount of Alcalinity of Ash in that perticular wine type NUMBER YES
Magnesium Amount of Magnesium in that perticular wine type NUMBER YES
Phenols Amount of phenol in that perticular wine type NUMBER YES
Flavanoids Amount of phenol in that perticular wine type NUMBER YES
Nonflavanoids Amount of Nonflavanoid phenols in that perticular wine type NUMBER YES
Proanthocyanins Amount of Proanthocyanins in that perticular wine type NUMBER YES
Color Amount of Color intensity for that perticular wine type NUMBER YES
Hue Amount of Hue for that perticular wine type NUMBER YES
Dilution Amount of diluted in that perticular wine type NUMBER YES
Proline Amount of Proline in that perticular wine type NUMBER YES
Hierarchical clustering
import pandas as pd
import matplotlib.pylab as plt
wine1= pd.read_csv(r"E:\yangmiso yangya\data science\pca\\Wine.csv")
wine1.describe()
wine1.info()
wine = wine1.drop(["Type"], axis=1)
# Normalization function
def norm_func(i):
x = (i-i.min()) / (i.max()-i.min())
return (x)
# Normalized data frame (considering the numerical part of data)

df_norm = norm_func(wine.iloc[:, 1:])
df_norm.describe()
# for creating dendrogram

from scipy.cluster.hierarchy import linkage
import scipy.cluster.hierarchy as sch
z = linkage(df_norm, method = "complete", metric = "euclidean")
# Dendrogram
plt.figure(figsize=(15, 8));plt.title('Hierarchical Clustering
Dendrogram');plt.xlabel('Index');plt.ylabel('Distance')
sch.dendrogram(z,
leaf_rotation = 0, # rotates the x axis labels
leaf_font_size = 10 # font size for the x axis labels
)
plt.show()
# Now applying AgglomerativeClustering choosing 5 as clusters from the above dendrogram
from sklearn.cluster import AgglomerativeClustering
h_complete = AgglomerativeClustering(n_clusters = 3, linkage = 'complete', affinity =

"euclidean").fit(df_norm)
h_complete.labels_
cluster_labels = pd.Series(h_complete.labels_)
wine['clust'] = cluster_labels # creating a new column and assigning it to new column
wine1 = wine.iloc[:, [13,0,1,2,3,4,5,6,7,8,9,10,11,12]]

wine1.head()
# Aggregate mean of each cluster

wine1.iloc[:, 1:].groupby(wine1.clust).mean()
clust Alcohol Malic Ash Alcalinity Magnesium
Phenols Flavanoids Nonflavanoids Proanthocyanins
Color Hue Dilution Proline
0 13.524782608695652 2.060869565217391
2.4320289855072463 17.61449275362319
108.04347826086956 2.822608695652174
2.9330434782608696 0.2876811594202899
1.9805797101449274 5.224347826086957
1.0552173913043479 3.137536231884058
1051.7101449275362
1 12.285818181818183 1.7352727272727273
2.228909090909091 20.114545454545453 89.78181818181818
2.2434545454545454 2.061818181818182 0.376 1.516
3.0152727272727273 1.0854545454545454
2.805090909090909 485.1272727272727
2 13.058888888888887 3.3005555555555555
2.422962962962963 21.26666666666667 99.27777777777777
1.6737037037037037 0.8412962962962963
0.4422222222222222 1.1692592592592592
6.9262962777777775 0.7021481481481481
1.7427777777777778 624.0185185185185
k-mean clustering
wine1.describe()
wine = wine1.drop(["Type"], axis = 1)
def norm_func(i):
x = (i - i.min()) / (i.max() - i.min())
return (x)

df_norm = norm_func(wine.iloc[:, 1:])
###### scree plot or elbow curve ############

TWSS = []
k = list(range(2, 9))
for i in k:
kmeans = KMeans(n_clusters = i)
kmeans.fit(df_norm)
TWSS.append(kmeans.inertia_)
TWSS
# Scree plot
plt.plot(k, TWSS, 'ro-');plt.xlabel("No_of_Clusters");plt.ylabel("total_within_SS")
# Selecting 3 clusters from the above scree plot which is the optimum number of clusters
model = KMeans(n_clusters = 3)
model.fit(df_norm)
model.labels_ # getting the labels of clusters assigned to each row

mb = pd.Series(model.labels_) # converting numpy array into pandas series object
wine['clust'] = mb # creating a new column and assigning it to new column
wine.head()
df_norm.head()
wine = wine.iloc[:,[13,0,1,2,3,4,5,6,7,8,9,10,11,12]]
wine.head()
total = wine.iloc[:, 1:].groupby(wine.clust).mean()
clust Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids

Nonflavanoids Proanthocyanins Color Hue Dilution Proline
0 13.47 1.988169014084507 2.426056338028169 17.654929577464788 106.73239436619718
2.8656338028169017 2.9714084507042253 0.28352112676056335 1.9750704225352111
5.21169014084507 1.0559154929577466 3.158450704225352 1037.281690140845
1 13.117884615384614 3.2746153846153847 2.413653846153846 21.225 98.75
1.6726923076923077 0.8226923076923077 0.45038461538461544 1.1519230769230768
7.15423075 0.696076923076923 1.6990384615384615 623.8846153846154
2 12.283818181818182 1.8987272727272728 2.245090909090909 20.234545454545458
91.65454545454546 2.147090909090909 1.9538181818181817 0.37927272727272726
1.51 2.8779999999999997 1.0774545454545454 2.768727272727273 488.3272727272727
After applying hierarchical clustering and k mean clustering we find out that the data set we have are
divided into equal clusters and almost equivalent or identical to each other.
Principal component analysis(pca)

import pandas as pd
import numpy as np

wine1.describe()
wine1.info()
wine = wine1.drop(["Type"], axis = 1)
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
# Considering only numerical data

wine.data = wine.iloc[:, 0:]
# Normalizing the numerical data

wine_normal = scale(wine.data)
wine_normal
pca = PCA(n_components = 6)
pca_values = pca.fit_transform(wine_normal)
# The amount of variance that each PCA explains is

var = pca.explained_variance_ratio_
var
# PCA weights
pca.components_
pca.components_[0]
# Cumulative variance
var1 = np.cumsum(np.round(var, decimals = 4) * 100)
var1
# Variance plot for PCA components obtained

plt.plot(var1, color = "red")
# PCA scores
pca_values
pca_data = pd.DataFrame(pca_values)
pca_data.columns = "comp0", "comp1", "comp2", "comp3", "comp4", "comp5"
final = pd.concat([wine1.Type, pca_data.iloc[:, 0:3]], axis = 1)
# Scatter diagram
ax = final.plot(x='comp0', y='comp1', kind='scatter',figsize=(12,8))
From k- mean clustering
final1 = final.drop(["Type"], axis = 1)
from sklearn.cluster import KMeans

TWSS = []
for i in k:
kmeans.fit(final1)
TWSS
# Scree plot
Fig.Elbow curve after applying PCA in the data set
Fig. Elbow curve from k-mean clustering
From the above elbow graph we see that the both the graph of the slopes changes at k=3 ,so as
we see the optimum value after applying pca and without pca are identical to each other.
2. A pharmaceuticals manufacturing company is conducting a study on a new medicine to treat heart
diseases. The company has gathered data from its secondary sources and would like you to provide
high level analytical insights on the data. Its aim is to segregate patients depending on their age
group and other factors given in the data. Perform PCA and clustering algorithms on the dataset and
check if the clusters formed before and after PCA are the same and provide a brief report on your
model. You can also explore more ways to improve your model.
Solution:
Business Problem
1.What is the business objective?
Answer:
Maximize : Prevention/prediction
Minimize : Heart disease
2.Are there any constraints?

Answer: No constrain
Name of the features Description Types Relevance
Age The age of the patient. CHAR YES
sex The gender of the patient. (1 = male, 0 = female). CATEGORICAL YES
cp Type of chest pain. (1 = typical angina, 2 = atypical CATEGORICAL YES

angina, 3 = non — anginal pain, 4 = asymptotic).
trestbps Resting blood pressure in mmHg. NUMBER YES
chol Serum Cholestero in mg/dl. NUMBER YES
fbs Fasting Blood Sugar. (1 = fasting blood sugar is CATEGORICAL YES

more than 120mg/dl, 0 = otherwise).
restecg Resting ElectroCardioGraphic results (0 = normal, 1 CATEGORICAL YES

= ST-T wave abnormality, 2 = left ventricular
hyperthrophy).
Thalach Max heart rate achieved. NUMBER YES
exang Exercise induced angina (1 = yes, 0 = no) CATEGORICAL YES
oldpeak ST depression induced by exercise relative to rest. NUMBER YES
slope Peak exercise ST segment (1 = upsloping, 2 = flat, 3 CATEGORICAL YES

= downsloping).
ca Number of major vessels (0–3) colored by CATEGORICAL YES
flourosopy.
Thal Thalassemia (1 = normal, 2 = fixed defect, 3 = CATEGORICAL YES

reversible defect).
Target Diagnosis of heart disease (0 = absence, 1 = CATEGORICAL YES

present).
Hierarchical clustering
import pandas as pd
heart1 = pd.read_csv(r"E:\yangmiso yangya\data science\pca\\heart disease.csv")
heart1.describe()
heart1.info()
def norm_func(i):
x = (i-i.min()) / (i.max()-i.min())
return (x)

df_norm = norm_func(heart1.iloc[:, 1:])
df_norm.describe()
# for creating dendrogram

from scipy.cluster.hierarchy import linkage
import scipy.cluster.hierarchy as sch
z = linkage(df_norm, method = "complete", metric = "euclidean")
# Dendrogram
plt.figure(figsize=(15, 8));plt.title('Hierarchical Clustering
Dendrogram');plt.xlabel('Index');plt.ylabel('Distance')
sch.dendrogram(z,
leaf_rotation = 0, # rotates the x axis labels
leaf_font_size = 10 # font size for the x axis labels
)
plt.show()
# Now applying AgglomerativeClustering choosing 5 as clusters from the above dendrogram

from sklearn.cluster import AgglomerativeClustering
h_complete = AgglomerativeClustering(n_clusters = 3, linkage = 'complete', affinity =

"euclidean").fit(df_norm)
h_complete.labels_
cluster_labels = pd.Series(h_complete.labels_)
heart1['clust'] = cluster_labels # creating a new column and assigning it to new column
heart1 = heart1.iloc[:, [14,0,1,2,3,4,5,6,8,9,10,11,12,13]]

heart1.head()
# Aggregate mean of each cluster

total = heart1.iloc[:, 2:].groupby(heart1.clust).mean()
out[1]:
clust sex cp trestbps chol fbs restecgexang oldpeak slope ca thal
target
0 0.64 1.285 129.47 243.47 0.065 0.55 0.115 0.715 1.535 0.47 2.21 0.775
1 0.7361111111111112 0.1111111111111111 135.95833333333334 256.19444444444446
0.05555555555555555 0.5 0.9166666666666666 2.0125 1.0138888888888888
1.2083333333333333 2.5972222222222223 0.0
2 0.8387096774193549 0.9032258064516129 135.4516129032258
241.2258064516129 0.9032258064516129 0.45161290322580644 0.3225806451612903
0.8741935483870968 1.4193548387096775 1.2903225806451613 2.3225806451612905
0.3225806451612903
k-mean clustering
# Kmeans on heart disease Data set
heart1 = pd.read_csv(r"E:\yangmiso yangya\data science\pca\\heart disease.csv")
heart1.describe()
def norm_func(i):
x = (i - i.min()) / (i.max() - i.min())
return (x)

df_norm = norm_func(heart1.iloc[:, 0:])
TWSS = []
for i in k:
kmeans.fit(df_norm)
TWSS
# Scree plot
# Selecting 4 clusters from the above scree plot which is the optimum number of clusters
model = KMeans(n_clusters = 4)
model.fit(heart1)
model.labels_ # getting the labels of clusters assigned to each row

mb = pd.Series(model.labels_) # converting numpy array into pandas series object
heart1['clust'] = mb # creating a new column and assigning it to new column
heart1 = heart1.iloc[:,[14,0,1,2,3,4,5,6,7,8,9,10,11,12,13]]
heart1.head()
heart1.iloc[:, 0:].groupby(heart1.clust).mean()
out[2]:
clust agesex cp trestbps chol fbs restecg thalach exang oldpeak

slope ca thal target clust
0 56.532467532467535 0.6363636363636364 0.8441558441558441
138.27272727272728 302.02597402597405 0.16883116883116883
0.4675324675324675 145.07792207792207 0.42857142857142855
1.164935064935065 1.4025974025974026 0.8831168831168831
2.3896103896103895 0.4025974025974026 0.0
1 51.97872340425532 0.723404255319149 1.0106382978723405
128.2659574468085 193.27659574468086 0.14893617021276595
0.6382978723404256 148.7127659574468 0.2978723404255319
1.0851063829787233 1.4148936170212767 0.574468085106383
2.202127659574468 0.5957446808510638 1.0
2 62.6 0.0 0.8 135.8 438.2 0.2 0.0 155.6 0.2 1.9
1.2 1.2 2.6 0.6 2.0
3 54.496062992125985 0.7086614173228346 1.015748031496063
129.91338582677164 244.11811023622047 0.13385826771653545
0.5039370078740157 152.8740157480315 0.29133858267716534
0.8960629921259843 1.3937007874015748 0.7322834645669292
2.338582677165354 0.5905511811023622 3.0
Principal component analysis

heart = heart1.drop(["clust"], axis = 1)
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
# Considering only numerical data

heart.data = heart.iloc[:, 0:]
# Normalizing the numerical data

heart_normal = scale(heart.data)
heart_normal
pca = PCA(n_components = 6)
pca_values = pca.fit_transform(heart_normal)
# The amount of variance that each PCA explains is

var = pca.explained_variance_ratio_
var
# PCA weights
pca.components_
pca.components_[0]
# Cumulative variance
var1 = np.cumsum(np.round(var, decimals = 4) * 100)
var1
# Variance plot for PCA components obtained

plt.plot(var1, color = "red")
# PCA scores
pca_values
pca_data = pd.DataFrame(pca_values)
pca_data.columns = "comp0", "comp1", "comp2", "comp3", "comp4", "comp5"
final = pd.concat([ pca_data.iloc[:, 0:3]], axis = 1)
# Scatter diagram
ax = final.plot(x='comp0', y='comp1', kind='scatter',figsize=(12,8))
There is no corelationship as we observe from the scatter plot.
K-mean clustering after applying PCA
model1 = KMeans(n_clusters = 3).fit(final)
final.plot(x = "comp0", y = "comp1", c = model1.labels_, kind="scatter", s = 10, cmap = plt.cm.coolwarm)

TWSS = []
for i in k:
kmeans.fit(final)
TWSS
# Scree plot
Fig.Elbow curve after applying PCA in the data set
Fig. Elbow curve from k-mean clustering
From the graph we see the slope of both is slide changing to one another. But overall it is identical.

PCA Problem Statement

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PCA Problem Statement

Uploaded by

Copyright:

Available Formats

Topic: Dimension Reduction With PCA

● Grade C: (>= 70 and < 80):

● Grade D: (>= 60 and < 70):

● Grade E: (>= 50 and < 60):

● Grade F: (< 50): No submission (or) malpractice.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

A pharmaceuticals manufacturing company is conducting a study on a new medicine to treat

1.2. What is the business objective?

1.2. Are there any constraints?

Name of the features Descriptions Types Relevance

wine = wine1.drop(["Type"], axis=1)

# Normalized data frame (considering the numerical part of data)

# for creating dendrogram

z = linkage(df_norm, method = "complete", metric = "euclidean")

h_complete = AgglomerativeClustering(n_clusters = 3, linkage = 'complete', affinity =

wine['clust'] = cluster_labels # creating a new column and assigning it to new column

wine1 = wine.iloc[:, [13,0,1,2,3,4,5,6,7,8,9,10,11,12]]

# Aggregate mean of each cluster

# Normalized data frame (considering the numerical part of data)

###### scree plot or elbow curve ############

model.labels_ # getting the labels of clusters assigned to each row

total = wine.iloc[:, 1:].groupby(wine.clust).mean()

clust Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids

Principal component analysis(pca)

wine1= pd.read_csv(r"E:\yangmiso yangya\data science\pca\\Wine.csv")

# Considering only numerical data

# Normalizing the numerical data

# The amount of variance that each PCA explains is

# Variance plot for PCA components obtained

from sklearn.cluster import KMeans

###### scree plot or elbow curve ############

Fig. Elbow curve from k-mean clustering

2.Are there any constraints?

Name of the features Description Types Relevance

Age The age of the patient. CHAR YES

sex The gender of the patient. (1 = male, 0 = female). CATEGORICAL YES

cp Type of chest pain. (1 = typical angina, 2 = atypical CATEGORICAL YES

trestbps Resting blood pressure in mmHg. NUMBER YES

chol Serum Cholestero in mg/dl. NUMBER YES

fbs Fasting Blood Sugar. (1 = fasting blood sugar is CATEGORICAL YES

restecg Resting ElectroCardioGraphic results (0 = normal, 1 CATEGORICAL YES

Thalach Max heart rate achieved. NUMBER YES

exang Exercise induced angina (1 = yes, 0 = no) CATEGORICAL YES

oldpeak ST depression induced by exercise relative to rest. NUMBER YES

slope Peak exercise ST segment (1 = upsloping, 2 = flat, 3 CATEGORICAL YES

Thal Thalassemia (1 = normal, 2 = fixed defect, 3 = CATEGORICAL YES

Target Diagnosis of heart disease (0 = absence, 1 = CATEGORICAL YES

heart1 = pd.read_csv(r"E:\yangmiso yangya\data science\pca\\heart disease.csv")

# Normalized data frame (considering the numerical part of data)

# for creating dendrogram

# Now applying AgglomerativeClustering choosing 5 as clusters from the above dendrogram

h_complete = AgglomerativeClustering(n_clusters = 3, linkage = 'complete', affinity =

heart1['clust'] = cluster_labels # creating a new column and assigning it to new column

heart1 = heart1.iloc[:, [14,0,1,2,3,4,5,6,8,9,10,11,12,13]]

# Aggregate mean of each cluster

# Normalized data frame (considering the numerical part of data)

model.labels_ # getting the labels of clusters assigned to each row

clust agesex cp trestbps chol fbs restecg thalach exang oldpeak

Principal component analysis

from sklearn.decomposition import PCA

# Considering only numerical data

# Normalizing the numerical data