You are on page 1of 25

Topic: Dimension Reduction With PCA

Instructions:
Please share your answers filled in-line in the word document. Submit code separately
wherever applicable.
Please ensure you update all the details:
Name: yangmiso yangya Batch ID: DSWDMCON 180122
Topic: Principal Component Analysis

Grading Guidelines:
1. An assignment submission is considered complete only when correct and executable code(s) are submitted along
with the documentation explaining the method and results. Failing to submit either of those will be considered an
invalid submission and will not be considered for evaluation.
2. Assignments submitted after the deadline will affect your grades.

Grading:
Ans Date Ans Date
Correct On time A 100
80% & above On time B 85 Correct Late
50% & above On time C 75 80% & above Late
50% & below On time D 65 50% & above Late
E 55 50% & below
Copied/No Submission F 45

● Grade A: (>= 90): When all assignments are submitted on or before the given deadline
● Grade B: (>= 80 and < 90):
o When assignments are submitted on time but less than 80% of problems are completed.
(OR)
o All assignments are submitted after the deadline.

● Grade C: (>= 70 and < 80):


o When assignments are submitted on time but less than 50% of the problems are completed.
(OR)
o Less than 80% of problems in the assignments are submitted after the deadline

● Grade D: (>= 60 and < 70):


o Assignments submitted after the deadline and with 50% or less problems.

● Grade E: (>= 50 and < 60):


o Less than 30% of problems in the assignments are submitted after the deadline
(OR)
o Less than 30% of problems in the assignments are submitted before deadline

● Grade F: (< 50): No submission (or) malpractice.

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Hints:
1. Business Problem
1.1. What is the business objective?
1.1. Are there any constraints?

2. Work on each feature of the dataset to create a data dictionary as displayed in the below
image:

2.1 Make a table as shown above and provide information about the features such as its data type
and its relevance to the model building. And if not relevant, provide reasons and a description of the
feature.

3. Data Pre-processing
3.1 Data Cleaning, Feature Engineering, etc.
4. Exploratory Data Analysis (EDA):
4.1. Summary.
4.2. Univariate analysis.
4.3. Bivariate analysis.

5. Model Building
5.1 Build the model on the scaled data (try multiple options).
5.2 Perform PCA analysis and get the maximum variance between components.
5.3 Perform clustering before and after applying PCA to cross the number of clusters
formed.
5.4 Briefly explain the model output in the documentation.

6. Write about the benefits/impact of the solution - in what way does the business (client)
benefit from the solution provided?

Problem Statement: -
© 2013 - 2021 360DigiTMG. All Rights Reserved.
Perform hierarchical and K-means clustering on the dataset. After that, perform PCA on the
dataset and extract the first 3 principal components and make a new dataset with these 3
principal components as the columns. Now, on this new dataset, perform hierarchical and K-
means clustering. Compare the results of clustering on the original dataset and clustering on
the principal components dataset (use the scree plot technique to obtain the optimum
number of clusters in K-means clustering and check if you’re getting similar results with and
without PCA).

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Problem Statement: -

A pharmaceuticals manufacturing company is conducting a study on a new medicine to treat


heart diseases. The company has gathered data from its secondary sources and would like you to
provide high level analytical insights on the data. Its aim is to segregate patients depending on
their age group and other factors given in the data. Perform PCA and clustering algorithms on
the dataset and check if the clusters formed before and after PCA are the same and provide a
brief report on your model. You can also explore more ways to improve your model.
Note: This is just a snapshot of the data. The datasets can be downloaded from AiSpry LMS in
the Hands-On Material section.

1. Perform hierarchical and K-means clustering on the dataset. After that, perform PCA on the
dataset and extract the first 3 principal components and make a new dataset with these 3
principal components as the columns. Now, on this new dataset, perform hierarchical and K-
means clustering. Compare the results of clustering on the original dataset and clustering on the
principal components dataset (use the scree plot technique to obtain the optimum number of
clusters in K-means clustering and check if you’re getting similar results with and without PCA).
Solution:
1. Business Problem

1.2. What is the business objective?


Answer:
Maximize : sales
Minimize : Loss

1.2. Are there any constraints?


Answer: No constrain

Name of the features Descriptions Types Relevance


Type Class Category of the Wine NUMBER YES
Alcohol Amount of Alcholol in that perticular wine type NUMBER YES
Malic Amount of Malic Acid in that perticular wine type NUMBER YES
Ash Amount of Ash in that perticular wine type NUMBER YES
Alcalinity Amount of Alcalinity of Ash in that perticular wine type NUMBER YES
Magnesium Amount of Magnesium in that perticular wine type NUMBER YES
Phenols Amount of phenol in that perticular wine type NUMBER YES
Flavanoids Amount of phenol in that perticular wine type NUMBER YES
Nonflavanoids Amount of Nonflavanoid phenols in that perticular wine type NUMBER YES
Proanthocyanins Amount of Proanthocyanins in that perticular wine type NUMBER YES
Color Amount of Color intensity for that perticular wine type NUMBER YES
Hue Amount of Hue for that perticular wine type NUMBER YES
Dilution Amount of diluted in that perticular wine type NUMBER YES
Proline Amount of Proline in that perticular wine type NUMBER YES

Hierarchical clustering
import pandas as pd
import matplotlib.pylab as plt
wine1= pd.read_csv(r"E:\yangmiso yangya\data science\pca\\Wine.csv")

wine1.describe()
wine1.info()

wine = wine1.drop(["Type"], axis=1)

# Normalization function
def norm_func(i):
x = (i-i.min()) / (i.max()-i.min())
return (x)

# Normalized data frame (considering the numerical part of data)


df_norm = norm_func(wine.iloc[:, 1:])
df_norm.describe()

# for creating dendrogram


from scipy.cluster.hierarchy import linkage
import scipy.cluster.hierarchy as sch

z = linkage(df_norm, method = "complete", metric = "euclidean")

# Dendrogram
plt.figure(figsize=(15, 8));plt.title('Hierarchical Clustering
Dendrogram');plt.xlabel('Index');plt.ylabel('Distance')
sch.dendrogram(z,
leaf_rotation = 0, # rotates the x axis labels
leaf_font_size = 10 # font size for the x axis labels
)
plt.show()
# Now applying AgglomerativeClustering choosing 5 as clusters from the above dendrogram
from sklearn.cluster import AgglomerativeClustering

h_complete = AgglomerativeClustering(n_clusters = 3, linkage = 'complete', affinity =


"euclidean").fit(df_norm)
h_complete.labels_

cluster_labels = pd.Series(h_complete.labels_)

wine['clust'] = cluster_labels # creating a new column and assigning it to new column

wine1 = wine.iloc[:, [13,0,1,2,3,4,5,6,7,8,9,10,11,12]]


wine1.head()

# Aggregate mean of each cluster


wine1.iloc[:, 1:].groupby(wine1.clust).mean()
clust Alcohol Malic Ash Alcalinity Magnesium
Phenols Flavanoids Nonflavanoids Proanthocyanins
Color Hue Dilution Proline
0 13.524782608695652 2.060869565217391
2.4320289855072463 17.61449275362319
108.04347826086956 2.822608695652174
2.9330434782608696 0.2876811594202899
1.9805797101449274 5.224347826086957
1.0552173913043479 3.137536231884058
1051.7101449275362
1 12.285818181818183 1.7352727272727273
2.228909090909091 20.114545454545453 89.78181818181818
2.2434545454545454 2.061818181818182 0.376 1.516
3.0152727272727273 1.0854545454545454
2.805090909090909 485.1272727272727
2 13.058888888888887 3.3005555555555555
2.422962962962963 21.26666666666667 99.27777777777777
1.6737037037037037 0.8412962962962963
0.4422222222222222 1.1692592592592592
6.9262962777777775 0.7021481481481481
1.7427777777777778 624.0185185185185
k-mean clustering
wine1= pd.read_csv(r"E:\yangmiso yangya\data science\pca\\Wine.csv")
wine1.describe()
wine = wine1.drop(["Type"], axis = 1)

# Normalization function
def norm_func(i):
x = (i - i.min()) / (i.max() - i.min())
return (x)

# Normalized data frame (considering the numerical part of data)


df_norm = norm_func(wine.iloc[:, 1:])

###### scree plot or elbow curve ############


TWSS = []
k = list(range(2, 9))

for i in k:
kmeans = KMeans(n_clusters = i)
kmeans.fit(df_norm)
TWSS.append(kmeans.inertia_)

TWSS
# Scree plot
plt.plot(k, TWSS, 'ro-');plt.xlabel("No_of_Clusters");plt.ylabel("total_within_SS")
# Selecting 3 clusters from the above scree plot which is the optimum number of clusters
model = KMeans(n_clusters = 3)
model.fit(df_norm)

model.labels_ # getting the labels of clusters assigned to each row


mb = pd.Series(model.labels_) # converting numpy array into pandas series object
wine['clust'] = mb # creating a new column and assigning it to new column

wine.head()
df_norm.head()

wine = wine.iloc[:,[13,0,1,2,3,4,5,6,7,8,9,10,11,12]]
wine.head()

total = wine.iloc[:, 1:].groupby(wine.clust).mean()

clust Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids


Nonflavanoids Proanthocyanins Color Hue Dilution Proline
0 13.47 1.988169014084507 2.426056338028169 17.654929577464788 106.73239436619718
2.8656338028169017 2.9714084507042253 0.28352112676056335 1.9750704225352111
5.21169014084507 1.0559154929577466 3.158450704225352 1037.281690140845
1 13.117884615384614 3.2746153846153847 2.413653846153846 21.225 98.75
1.6726923076923077 0.8226923076923077 0.45038461538461544 1.1519230769230768
7.15423075 0.696076923076923 1.6990384615384615 623.8846153846154
2 12.283818181818182 1.8987272727272728 2.245090909090909 20.234545454545458
91.65454545454546 2.147090909090909 1.9538181818181817 0.37927272727272726
1.51 2.8779999999999997 1.0774545454545454 2.768727272727273 488.3272727272727

After applying hierarchical clustering and k mean clustering we find out that the data set we have are
divided into equal clusters and almost equivalent or identical to each other.

Principal component analysis(pca)


import pandas as pd
import numpy as np

wine1= pd.read_csv(r"E:\yangmiso yangya\data science\pca\\Wine.csv")


wine1.describe()

wine1.info()
wine = wine1.drop(["Type"], axis = 1)
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale

# Considering only numerical data


wine.data = wine.iloc[:, 0:]

# Normalizing the numerical data


wine_normal = scale(wine.data)
wine_normal

pca = PCA(n_components = 6)
pca_values = pca.fit_transform(wine_normal)

# The amount of variance that each PCA explains is


var = pca.explained_variance_ratio_
var

# PCA weights
pca.components_
pca.components_[0]

# Cumulative variance
var1 = np.cumsum(np.round(var, decimals = 4) * 100)
var1

# Variance plot for PCA components obtained


plt.plot(var1, color = "red")
# PCA scores
pca_values

pca_data = pd.DataFrame(pca_values)
pca_data.columns = "comp0", "comp1", "comp2", "comp3", "comp4", "comp5"
final = pd.concat([wine1.Type, pca_data.iloc[:, 0:3]], axis = 1)

# Scatter diagram
import matplotlib.pylab as plt
ax = final.plot(x='comp0', y='comp1', kind='scatter',figsize=(12,8))
From k- mean clustering
final1 = final.drop(["Type"], axis = 1)

from sklearn.cluster import KMeans

###### scree plot or elbow curve ############


TWSS = []
k = list(range(2, 9))

for i in k:
kmeans = KMeans(n_clusters = i)
kmeans.fit(final1)
TWSS.append(kmeans.inertia_)

TWSS
# Scree plot
plt.plot(k, TWSS, 'ro-');plt.xlabel("No_of_Clusters");plt.ylabel("total_within_SS")
Fig.Elbow curve after applying PCA in the data set

Fig. Elbow curve from k-mean clustering

From the above elbow graph we see that the both the graph of the slopes changes at k=3 ,so as
we see the optimum value after applying pca and without pca are identical to each other.
2. A pharmaceuticals manufacturing company is conducting a study on a new medicine to treat heart
diseases. The company has gathered data from its secondary sources and would like you to provide
high level analytical insights on the data. Its aim is to segregate patients depending on their age
group and other factors given in the data. Perform PCA and clustering algorithms on the dataset and
check if the clusters formed before and after PCA are the same and provide a brief report on your
model. You can also explore more ways to improve your model.
Solution:
Business Problem
1.What is the business objective?
Answer:
Maximize : Prevention/prediction
Minimize : Heart disease

2.Are there any constraints?


Answer: No constrain

Name of the features Description Types Relevance

Age The age of the patient. CHAR YES

sex The gender of the patient. (1 = male, 0 = female). CATEGORICAL YES

cp Type of chest pain. (1 = typical angina, 2 = atypical CATEGORICAL YES


angina, 3 = non — anginal pain, 4 = asymptotic).

trestbps Resting blood pressure in mmHg. NUMBER YES

chol Serum Cholestero in mg/dl. NUMBER YES

fbs Fasting Blood Sugar. (1 = fasting blood sugar is CATEGORICAL YES


more than 120mg/dl, 0 = otherwise).

restecg Resting ElectroCardioGraphic results (0 = normal, 1 CATEGORICAL YES


= ST-T wave abnormality, 2 = left ventricular
hyperthrophy).

Thalach Max heart rate achieved. NUMBER YES

exang Exercise induced angina (1 = yes, 0 = no) CATEGORICAL YES

oldpeak ST depression induced by exercise relative to rest. NUMBER YES

slope Peak exercise ST segment (1 = upsloping, 2 = flat, 3 CATEGORICAL YES


= downsloping).
ca Number of major vessels (0–3) colored by CATEGORICAL YES
flourosopy.

Thal Thalassemia (1 = normal, 2 = fixed defect, 3 = CATEGORICAL YES


reversible defect).

Target Diagnosis of heart disease (0 = absence, 1 = CATEGORICAL YES


present).

Hierarchical clustering

import pandas as pd
import matplotlib.pylab as plt

heart1 = pd.read_csv(r"E:\yangmiso yangya\data science\pca\\heart disease.csv")

heart1.describe()
heart1.info()
# Normalization function
def norm_func(i):
x = (i-i.min()) / (i.max()-i.min())
return (x)

# Normalized data frame (considering the numerical part of data)


df_norm = norm_func(heart1.iloc[:, 1:])
df_norm.describe()

# for creating dendrogram


from scipy.cluster.hierarchy import linkage
import scipy.cluster.hierarchy as sch
z = linkage(df_norm, method = "complete", metric = "euclidean")

# Dendrogram
plt.figure(figsize=(15, 8));plt.title('Hierarchical Clustering
Dendrogram');plt.xlabel('Index');plt.ylabel('Distance')
sch.dendrogram(z,
leaf_rotation = 0, # rotates the x axis labels
leaf_font_size = 10 # font size for the x axis labels
)
plt.show()

# Now applying AgglomerativeClustering choosing 5 as clusters from the above dendrogram


from sklearn.cluster import AgglomerativeClustering

h_complete = AgglomerativeClustering(n_clusters = 3, linkage = 'complete', affinity =


"euclidean").fit(df_norm)
h_complete.labels_
cluster_labels = pd.Series(h_complete.labels_)

heart1['clust'] = cluster_labels # creating a new column and assigning it to new column

heart1 = heart1.iloc[:, [14,0,1,2,3,4,5,6,8,9,10,11,12,13]]


heart1.head()

# Aggregate mean of each cluster


total = heart1.iloc[:, 2:].groupby(heart1.clust).mean()
out[1]:
clust sex cp trestbps chol fbs restecgexang oldpeak slope ca thal
target
0 0.64 1.285 129.47 243.47 0.065 0.55 0.115 0.715 1.535 0.47 2.21 0.775
1 0.7361111111111112 0.1111111111111111 135.95833333333334 256.19444444444446
0.05555555555555555 0.5 0.9166666666666666 2.0125 1.0138888888888888
1.2083333333333333 2.5972222222222223 0.0
2 0.8387096774193549 0.9032258064516129 135.4516129032258
241.2258064516129 0.9032258064516129 0.45161290322580644 0.3225806451612903
0.8741935483870968 1.4193548387096775 1.2903225806451613 2.3225806451612905
0.3225806451612903

k-mean clustering
# Kmeans on heart disease Data set
heart1 = pd.read_csv(r"E:\yangmiso yangya\data science\pca\\heart disease.csv")

heart1.describe()

# Normalization function
def norm_func(i):
x = (i - i.min()) / (i.max() - i.min())
return (x)

# Normalized data frame (considering the numerical part of data)


df_norm = norm_func(heart1.iloc[:, 0:])
###### scree plot or elbow curve ############
TWSS = []
k = list(range(2, 9))

for i in k:
kmeans = KMeans(n_clusters = i)
kmeans.fit(df_norm)
TWSS.append(kmeans.inertia_)

TWSS
# Scree plot
plt.plot(k, TWSS, 'ro-');plt.xlabel("No_of_Clusters");plt.ylabel("total_within_SS")

# Selecting 4 clusters from the above scree plot which is the optimum number of clusters
model = KMeans(n_clusters = 4)
model.fit(heart1)

model.labels_ # getting the labels of clusters assigned to each row


mb = pd.Series(model.labels_) # converting numpy array into pandas series object
heart1['clust'] = mb # creating a new column and assigning it to new column

heart1 = heart1.iloc[:,[14,0,1,2,3,4,5,6,7,8,9,10,11,12,13]]
heart1.head()

heart1.iloc[:, 0:].groupby(heart1.clust).mean()
out[2]:

clust agesex cp trestbps chol fbs restecg thalach exang oldpeak


slope ca thal target clust
0 56.532467532467535 0.6363636363636364 0.8441558441558441
138.27272727272728 302.02597402597405 0.16883116883116883
0.4675324675324675 145.07792207792207 0.42857142857142855
1.164935064935065 1.4025974025974026 0.8831168831168831
2.3896103896103895 0.4025974025974026 0.0
1 51.97872340425532 0.723404255319149 1.0106382978723405
128.2659574468085 193.27659574468086 0.14893617021276595
0.6382978723404256 148.7127659574468 0.2978723404255319
1.0851063829787233 1.4148936170212767 0.574468085106383
2.202127659574468 0.5957446808510638 1.0
2 62.6 0.0 0.8 135.8 438.2 0.2 0.0 155.6 0.2 1.9
1.2 1.2 2.6 0.6 2.0
3 54.496062992125985 0.7086614173228346 1.015748031496063
129.91338582677164 244.11811023622047 0.13385826771653545
0.5039370078740157 152.8740157480315 0.29133858267716534
0.8960629921259843 1.3937007874015748 0.7322834645669292
2.338582677165354 0.5905511811023622 3.0

Principal component analysis


heart = heart1.drop(["clust"], axis = 1)

from sklearn.decomposition import PCA


import matplotlib.pyplot as plt
from sklearn.preprocessing import scale

# Considering only numerical data


heart.data = heart.iloc[:, 0:]

# Normalizing the numerical data


heart_normal = scale(heart.data)
heart_normal

pca = PCA(n_components = 6)
pca_values = pca.fit_transform(heart_normal)

# The amount of variance that each PCA explains is


var = pca.explained_variance_ratio_
var

# PCA weights
pca.components_
pca.components_[0]

# Cumulative variance
var1 = np.cumsum(np.round(var, decimals = 4) * 100)
var1

# Variance plot for PCA components obtained


plt.plot(var1, color = "red")

# PCA scores
pca_values

pca_data = pd.DataFrame(pca_values)
pca_data.columns = "comp0", "comp1", "comp2", "comp3", "comp4", "comp5"
final = pd.concat([ pca_data.iloc[:, 0:3]], axis = 1)

# Scatter diagram
import matplotlib.pylab as plt
ax = final.plot(x='comp0', y='comp1', kind='scatter',figsize=(12,8))
There is no corelationship as we observe from the scatter plot.
K-mean clustering after applying PCA
model1 = KMeans(n_clusters = 3).fit(final)

final.plot(x = "comp0", y = "comp1", c = model1.labels_, kind="scatter", s = 10, cmap = plt.cm.coolwarm)

###### scree plot or elbow curve ############


TWSS = []
k = list(range(2, 9))

for i in k:
kmeans = KMeans(n_clusters = i)
kmeans.fit(final)
TWSS.append(kmeans.inertia_)

TWSS
# Scree plot
plt.plot(k, TWSS, 'ro-');plt.xlabel("No_of_Clusters");plt.ylabel("total_within_SS")

Fig.Elbow curve after applying PCA in the data set

Fig. Elbow curve from k-mean clustering

From the graph we see the slope of both is slide changing to one another. But overall it is identical.

You might also like