Professional Documents
Culture Documents
PCA Problem Statement
PCA Problem Statement
Instructions:
Please share your answers filled in-line in the word document. Submit code separately
wherever applicable.
Please ensure you update all the details:
Name: yangmiso yangya Batch ID: DSWDMCON 180122
Topic: Principal Component Analysis
Grading Guidelines:
1. An assignment submission is considered complete only when correct and executable code(s) are submitted along
with the documentation explaining the method and results. Failing to submit either of those will be considered an
invalid submission and will not be considered for evaluation.
2. Assignments submitted after the deadline will affect your grades.
Grading:
Ans Date Ans Date
Correct On time A 100
80% & above On time B 85 Correct Late
50% & above On time C 75 80% & above Late
50% & below On time D 65 50% & above Late
E 55 50% & below
Copied/No Submission F 45
● Grade A: (>= 90): When all assignments are submitted on or before the given deadline
● Grade B: (>= 80 and < 90):
o When assignments are submitted on time but less than 80% of problems are completed.
(OR)
o All assignments are submitted after the deadline.
2. Work on each feature of the dataset to create a data dictionary as displayed in the below
image:
2.1 Make a table as shown above and provide information about the features such as its data type
and its relevance to the model building. And if not relevant, provide reasons and a description of the
feature.
3. Data Pre-processing
3.1 Data Cleaning, Feature Engineering, etc.
4. Exploratory Data Analysis (EDA):
4.1. Summary.
4.2. Univariate analysis.
4.3. Bivariate analysis.
5. Model Building
5.1 Build the model on the scaled data (try multiple options).
5.2 Perform PCA analysis and get the maximum variance between components.
5.3 Perform clustering before and after applying PCA to cross the number of clusters
formed.
5.4 Briefly explain the model output in the documentation.
6. Write about the benefits/impact of the solution - in what way does the business (client)
benefit from the solution provided?
Problem Statement: -
© 2013 - 2021 360DigiTMG. All Rights Reserved.
Perform hierarchical and K-means clustering on the dataset. After that, perform PCA on the
dataset and extract the first 3 principal components and make a new dataset with these 3
principal components as the columns. Now, on this new dataset, perform hierarchical and K-
means clustering. Compare the results of clustering on the original dataset and clustering on
the principal components dataset (use the scree plot technique to obtain the optimum
number of clusters in K-means clustering and check if you’re getting similar results with and
without PCA).
1. Perform hierarchical and K-means clustering on the dataset. After that, perform PCA on the
dataset and extract the first 3 principal components and make a new dataset with these 3
principal components as the columns. Now, on this new dataset, perform hierarchical and K-
means clustering. Compare the results of clustering on the original dataset and clustering on the
principal components dataset (use the scree plot technique to obtain the optimum number of
clusters in K-means clustering and check if you’re getting similar results with and without PCA).
Solution:
1. Business Problem
Hierarchical clustering
import pandas as pd
import matplotlib.pylab as plt
wine1= pd.read_csv(r"E:\yangmiso yangya\data science\pca\\Wine.csv")
wine1.describe()
wine1.info()
# Normalization function
def norm_func(i):
x = (i-i.min()) / (i.max()-i.min())
return (x)
# Dendrogram
plt.figure(figsize=(15, 8));plt.title('Hierarchical Clustering
Dendrogram');plt.xlabel('Index');plt.ylabel('Distance')
sch.dendrogram(z,
leaf_rotation = 0, # rotates the x axis labels
leaf_font_size = 10 # font size for the x axis labels
)
plt.show()
# Now applying AgglomerativeClustering choosing 5 as clusters from the above dendrogram
from sklearn.cluster import AgglomerativeClustering
cluster_labels = pd.Series(h_complete.labels_)
# Normalization function
def norm_func(i):
x = (i - i.min()) / (i.max() - i.min())
return (x)
for i in k:
kmeans = KMeans(n_clusters = i)
kmeans.fit(df_norm)
TWSS.append(kmeans.inertia_)
TWSS
# Scree plot
plt.plot(k, TWSS, 'ro-');plt.xlabel("No_of_Clusters");plt.ylabel("total_within_SS")
# Selecting 3 clusters from the above scree plot which is the optimum number of clusters
model = KMeans(n_clusters = 3)
model.fit(df_norm)
wine.head()
df_norm.head()
wine = wine.iloc[:,[13,0,1,2,3,4,5,6,7,8,9,10,11,12]]
wine.head()
After applying hierarchical clustering and k mean clustering we find out that the data set we have are
divided into equal clusters and almost equivalent or identical to each other.
wine1.info()
wine = wine1.drop(["Type"], axis = 1)
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
pca = PCA(n_components = 6)
pca_values = pca.fit_transform(wine_normal)
# PCA weights
pca.components_
pca.components_[0]
# Cumulative variance
var1 = np.cumsum(np.round(var, decimals = 4) * 100)
var1
pca_data = pd.DataFrame(pca_values)
pca_data.columns = "comp0", "comp1", "comp2", "comp3", "comp4", "comp5"
final = pd.concat([wine1.Type, pca_data.iloc[:, 0:3]], axis = 1)
# Scatter diagram
import matplotlib.pylab as plt
ax = final.plot(x='comp0', y='comp1', kind='scatter',figsize=(12,8))
From k- mean clustering
final1 = final.drop(["Type"], axis = 1)
for i in k:
kmeans = KMeans(n_clusters = i)
kmeans.fit(final1)
TWSS.append(kmeans.inertia_)
TWSS
# Scree plot
plt.plot(k, TWSS, 'ro-');plt.xlabel("No_of_Clusters");plt.ylabel("total_within_SS")
Fig.Elbow curve after applying PCA in the data set
From the above elbow graph we see that the both the graph of the slopes changes at k=3 ,so as
we see the optimum value after applying pca and without pca are identical to each other.
2. A pharmaceuticals manufacturing company is conducting a study on a new medicine to treat heart
diseases. The company has gathered data from its secondary sources and would like you to provide
high level analytical insights on the data. Its aim is to segregate patients depending on their age
group and other factors given in the data. Perform PCA and clustering algorithms on the dataset and
check if the clusters formed before and after PCA are the same and provide a brief report on your
model. You can also explore more ways to improve your model.
Solution:
Business Problem
1.What is the business objective?
Answer:
Maximize : Prevention/prediction
Minimize : Heart disease
Hierarchical clustering
import pandas as pd
import matplotlib.pylab as plt
heart1.describe()
heart1.info()
# Normalization function
def norm_func(i):
x = (i-i.min()) / (i.max()-i.min())
return (x)
# Dendrogram
plt.figure(figsize=(15, 8));plt.title('Hierarchical Clustering
Dendrogram');plt.xlabel('Index');plt.ylabel('Distance')
sch.dendrogram(z,
leaf_rotation = 0, # rotates the x axis labels
leaf_font_size = 10 # font size for the x axis labels
)
plt.show()
k-mean clustering
# Kmeans on heart disease Data set
heart1 = pd.read_csv(r"E:\yangmiso yangya\data science\pca\\heart disease.csv")
heart1.describe()
# Normalization function
def norm_func(i):
x = (i - i.min()) / (i.max() - i.min())
return (x)
for i in k:
kmeans = KMeans(n_clusters = i)
kmeans.fit(df_norm)
TWSS.append(kmeans.inertia_)
TWSS
# Scree plot
plt.plot(k, TWSS, 'ro-');plt.xlabel("No_of_Clusters");plt.ylabel("total_within_SS")
# Selecting 4 clusters from the above scree plot which is the optimum number of clusters
model = KMeans(n_clusters = 4)
model.fit(heart1)
heart1 = heart1.iloc[:,[14,0,1,2,3,4,5,6,7,8,9,10,11,12,13]]
heart1.head()
heart1.iloc[:, 0:].groupby(heart1.clust).mean()
out[2]:
pca = PCA(n_components = 6)
pca_values = pca.fit_transform(heart_normal)
# PCA weights
pca.components_
pca.components_[0]
# Cumulative variance
var1 = np.cumsum(np.round(var, decimals = 4) * 100)
var1
# PCA scores
pca_values
pca_data = pd.DataFrame(pca_values)
pca_data.columns = "comp0", "comp1", "comp2", "comp3", "comp4", "comp5"
final = pd.concat([ pca_data.iloc[:, 0:3]], axis = 1)
# Scatter diagram
import matplotlib.pylab as plt
ax = final.plot(x='comp0', y='comp1', kind='scatter',figsize=(12,8))
There is no corelationship as we observe from the scatter plot.
K-mean clustering after applying PCA
model1 = KMeans(n_clusters = 3).fit(final)
for i in k:
kmeans = KMeans(n_clusters = i)
kmeans.fit(final)
TWSS.append(kmeans.inertia_)
TWSS
# Scree plot
plt.plot(k, TWSS, 'ro-');plt.xlabel("No_of_Clusters");plt.ylabel("total_within_SS")
From the graph we see the slope of both is slide changing to one another. But overall it is identical.