Professional Documents
Culture Documents
Day12 Hierarchical Clustering
Day12 Hierarchical Clustering
Instructions:
Please share your answers wherever applicable in-line with the word document.
Submit code separately wherever applicable.
Guidelines:
1. An assignment submission is considered complete only when correct and executable code(s) are
submitted along with the documentation explaining the method and results. Failing to submit either
of those will be considered an invalid submission and will not be considered as correct submission.
2. Ensure that you submit your assignments correctly and in full. Resubmission is not allowed.
3. Post the submission you can evaluate your work by referring to keys provided. (will be available
only post the submission).
Hints:
1. Business Problem
1.1. What is the business objective?
1.2. What are the constraints?
2. Work on each feature of the dataset to create a data dictionary as displayed in the below
image:
3. Data Pre-processing
3.1 Data Cleaning, Feature Engineering, etc.
4. Exploratory Data Analysis (EDA):
4.1. Summary.
This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00
5. Model Building
5.1 Build the model on the scaled data (try multiple options).
5.2 Perform the hierarchical clustering and visualize the clusters using dendrogram.
5.3 Validate the clusters (try with different number of clusters), label the clusters, and
derive insights (compare the results from multiple approaches).
6. Write about the benefits/impact of the solution - in what way does the business (client)
benefit from the solution provided?
Problem Statements:
1. Perform clustering for the airlines data to obtain optimum number of clusters.
Draw the inferences from the clusters obtained. Refer to EastWestAirlines.xlsx
dataset.
CODE –
import numpy as np
import matplotlib.pylab as plt
import pandas as pd
This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00
#AGGLOMERATIVECLUSTERING
from sklearn.cluster import AgglomerativeClustering
h_complete = AgglomerativeClustering(n_clusters = 3, linkage = 'complete', affinity =
"euclidean").fit(df_norm)
h_complete.labels_
cluster_labels = pd.Series(h_complete.labels_)
This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00
2. Perform clustering for the crime data and identify the number of clusters
formed and draw inferences. Refer to crime_data.csv dataset.
CODE-
import pandas as pd
import matplotlib.pylab as plt
crime = pd.read_csv("C:/Users/hudso/Downloads/Dataset_Assignment
Clustering/crime_data.csv")
#Data cleansing
# Removing the unnamed column from dataset
crime.describe()
crime.info()
crime.rename({"Unnamed: 0":"a"}, axis="columns", inplace=True)
crimenew = crime.drop(["a"], axis=1)
# Normalization function
def norm_func(i):
x = (i-i.min()) / (i.max()-i.min())
return (x)
# Normalized data frame (considering the numerical part of data)
df_norm = norm_func(crimenew.iloc[:, 0:])
This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00
This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00
CODE-
import pandas as pd
import matplotlib.pylab as plt
telco = pd.read_excel("C:/Users/hudso/Downloads/Dataset_Assignment
Clustering/Telco_customer_churn.xlsx")
telco.describe()
telco.info()
telconew = telco.drop(["Customer ID","Count","Quarter","Phone Service","Offer","Payment
Method","Internet Type","Paperless Billing","Contract","Device Protection Plan","Referred a
Friend"],axis=1)
# Create dummy variables
df_new = pd.get_dummies(telconew)
df_new_1 = pd.get_dummies(telconew, drop_first = True)
# Normalization
def norm_func(i):
x = (i-i.min()) / (i.max()-i.min())
return (x)
df_norm = norm_func(df_new_1.iloc[:, 1:])
df_norm.describe()
# Dendrogram
from scipy.cluster.hierarchy import linkage
import scipy.cluster.hierarchy as sch
z = linkage(df_norm, method = "complete", metric = "euclidean")
plt.figure(figsize=(15, 8));plt.title('Hierarchical
ClusteringDendrogram');plt.xlabel('Index');plt.ylabel('Distance')
sch.dendrogram(z,leaf_rotation = 0,
leaf_font_size = 10
)
plt.show()
This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00
This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00
import pandas as pd
This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00
This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00