You are on page 1of 9

Hierarchical Clustering

Instructions:
Please share your answers wherever applicable in-line with the word document.
Submit code separately wherever applicable.

Please ensure you update all the details:


Name: ____hari machavarapu_________ Batch ID: _____dswdcmb
150622h______
Topic: Hierarchical Clustering

Guidelines:
1. An assignment submission is considered complete only when correct and executable code(s) are
submitted along with the documentation explaining the method and results. Failing to submit either
of those will be considered an invalid submission and will not be considered as correct submission.

2. Ensure that you submit your assignments correctly and in full. Resubmission is not allowed.

3. Post the submission you can evaluate your work by referring to keys provided. (will be available
only post the submission).

Hints:
1. Business Problem
1.1. What is the business objective?
1.2. What are the constraints?

2. Work on each feature of the dataset to create a data dictionary as displayed in the below
image:

3. Data Pre-processing
3.1 Data Cleaning, Feature Engineering, etc.
4. Exploratory Data Analysis (EDA):
4.1. Summary.

This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00

© 2013 - 2022 360DigiTMG. All Rights Reserved.


https://www.coursehero.com/file/159682543/Day12-Hierarchical-Clusteringdocx/
4.2. Univariate analysis.
4.3. Bivariate analysis.

5. Model Building
5.1 Build the model on the scaled data (try multiple options).
5.2 Perform the hierarchical clustering and visualize the clusters using dendrogram.
5.3 Validate the clusters (try with different number of clusters), label the clusters, and
derive insights (compare the results from multiple approaches).
6. Write about the benefits/impact of the solution - in what way does the business (client)
benefit from the solution provided?

Problem Statements:
1. Perform clustering for the airlines data to obtain optimum number of clusters.
Draw the inferences from the clusters obtained. Refer to EastWestAirlines.xlsx
dataset.

CODE –

import numpy as np
import matplotlib.pylab as plt
import pandas as pd

This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00

© 2013 - 2022 360DigiTMG. All Rights Reserved.


https://www.coursehero.com/file/159682543/Day12-Hierarchical-Clusteringdocx/
from sklearn.preprocessing import MinMaxScaler
flight = pd.read_excel("C:/Users/hudso/Downloads/Dataset_Assignment
Clustering/EastWestAirlines.xlsx",sheet_name='data')
flight.describe()
flight.info()
#NORMALIZING
def norm_func(i):
x = (i-i.min()) / (i.max()-i.min())
return (x)
df_norm = norm_func(flight.iloc[:, 1:])
df_norm.describe()
#DENDOGRAM
from scipy.cluster.hierarchy import linkage
import scipy.cluster.hierarchy as sch
z = linkage(df_norm, method = "complete", metric = "euclidean")
plt.figure(figsize=(15, 8));plt.title('Hierarchical
ClusteringDendrogram');plt.xlabel('Index');plt.ylabel('Distance')
sch.dendrogram(z,
leaf_rotation = 0, # rotates the x axis labels
leaf_font_size = 10 # font size for the x axis labels
)
plt.show()

#AGGLOMERATIVECLUSTERING
from sklearn.cluster import AgglomerativeClustering
h_complete = AgglomerativeClustering(n_clusters = 3, linkage = 'complete', affinity =
"euclidean").fit(df_norm)
h_complete.labels_
cluster_labels = pd.Series(h_complete.labels_)

This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00

© 2013 - 2022 360DigiTMG. All Rights Reserved.


https://www.coursehero.com/file/159682543/Day12-Hierarchical-Clusteringdocx/
flight['clust'] = cluster_labels # creating a new column and assigning it to new column
flight = flight.iloc[:, [11,0,1,2,3,4,5,6,7,8,9,10,]]
flight.head()

2. Perform clustering for the crime data and identify the number of clusters
formed and draw inferences. Refer to crime_data.csv dataset.

CODE-
import pandas as pd
import matplotlib.pylab as plt
crime = pd.read_csv("C:/Users/hudso/Downloads/Dataset_Assignment
Clustering/crime_data.csv")
#Data cleansing
# Removing the unnamed column from dataset
crime.describe()
crime.info()
crime.rename({"Unnamed: 0":"a"}, axis="columns", inplace=True)
crimenew = crime.drop(["a"], axis=1)
# Normalization function
def norm_func(i):
x = (i-i.min()) / (i.max()-i.min())
return (x)
# Normalized data frame (considering the numerical part of data)
df_norm = norm_func(crimenew.iloc[:, 0:])

This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00

© 2013 - 2022 360DigiTMG. All Rights Reserved.


https://www.coursehero.com/file/159682543/Day12-Hierarchical-Clusteringdocx/
df_norm.describe()
# for creating dendrogram
from scipy.cluster.hierarchy import linkage
import scipy.cluster.hierarchy as sch
z = linkage(df_norm, method = "complete", metric = "euclidean")
# Dendrogram
plt.figure(figsize=(15, 8));plt.title('Hierarchical
ClusteringDendrogram');plt.xlabel('Index');plt.ylabel('Distance')
sch.dendrogram(z,
leaf_rotation = 0, # rotates the x axis labels
leaf_font_size = 10 # font size for the x axis labels
)
plt.show()

# Now applying AgglomerativeClustering choosing 3 as clusters from the above dendrogram


from sklearn.cluster import AgglomerativeClustering
h_complete = AgglomerativeClustering(n_clusters = 3, linkage = 'complete', affinity
="euclidean").fit(df_norm)
h_complete.labels_
cluster_labels = pd.Series(h_complete.labels_)
crime['clust'] = cluster_labels # creating a new column and assigning it to new column
crime = crime.iloc[:, [5,0,1,2,3,4]]
crime.head()
Inferences
# count the number in each cluster
crime.value_counts()
# Aggregate mean of each cluster
crime.iloc[:, 2:].groupby(crime.clust).mean()

This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00

© 2013 - 2022 360DigiTMG. All Rights Reserved.


https://www.coursehero.com/file/159682543/Day12-Hierarchical-Clusteringdocx/
3. Perform clustering analysis on the telecom data set. The data is a mixture of both
categorical and numerical data. It consists of the number of customers who churn
out. Derive insights and get possible information on factors that may affect the
churn decision. Refer to Telco_customer_churn.xlsx dataset.

CODE-
import pandas as pd
import matplotlib.pylab as plt
telco = pd.read_excel("C:/Users/hudso/Downloads/Dataset_Assignment
Clustering/Telco_customer_churn.xlsx")
telco.describe()
telco.info()
telconew = telco.drop(["Customer ID","Count","Quarter","Phone Service","Offer","Payment
Method","Internet Type","Paperless Billing","Contract","Device Protection Plan","Referred a
Friend"],axis=1)
# Create dummy variables
df_new = pd.get_dummies(telconew)
df_new_1 = pd.get_dummies(telconew, drop_first = True)
# Normalization
def norm_func(i):
x = (i-i.min()) / (i.max()-i.min())
return (x)
df_norm = norm_func(df_new_1.iloc[:, 1:])
df_norm.describe()
# Dendrogram
from scipy.cluster.hierarchy import linkage
import scipy.cluster.hierarchy as sch
z = linkage(df_norm, method = "complete", metric = "euclidean")
plt.figure(figsize=(15, 8));plt.title('Hierarchical
ClusteringDendrogram');plt.xlabel('Index');plt.ylabel('Distance')
sch.dendrogram(z,leaf_rotation = 0,
leaf_font_size = 10
)
plt.show()

This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00

© 2013 - 2022 360DigiTMG. All Rights Reserved.


https://www.coursehero.com/file/159682543/Day12-Hierarchical-Clusteringdocx/
# AgglomerativeClustering
from sklearn.cluster import AgglomerativeClustering
h_complete = AgglomerativeClustering(n_clusters = 3, linkage = 'complete', affinity
="euclidean").fit(df_norm)
h_complete.labels_
cluster_labels = pd.Series(h_complete.labels_)
telco['clust'] = cluster_labels # creating a new column and assigning it to new column
telco = telco.iloc[:,
[30,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]]
telco.head()
# count the number in each cluster
telco['clust'].value_counts()
# Aggregate mean of each cluster
total = telco.iloc[:, 2:].groupby(telco.clust).mean()

4. Perform clustering on mixed data. Convert the categorical variables to numeric by


using dummies or label encoding and perform normalization techniques. The data
set consists of details of customers related to their auto insurance. Refer to
Autoinsurance.csv dataset.

This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00

© 2013 - 2022 360DigiTMG. All Rights Reserved.


https://www.coursehero.com/file/159682543/Day12-Hierarchical-Clusteringdocx/
CODE –

import pandas as pd

import matplotlib.pylab as plt


auto = pd.read_csv("C:/Users/hudso/Downloads/Dataset_Assignment
Clustering/AutoInsurance.csv")
auto.describe()
auto.info()
auto1 = auto.drop(["Customer","State","Effective To Date"],axis=1)
df_new = pd.get_dummies(auto1)
df_new_1 = pd.get_dummies(auto1, drop_first = True)
# Normalization
def norm_func(i):
x = (i-i.min()) / (i.max()-i.min())
return (x)
df_norm = norm_func(df_new_1.iloc[:, 0:])
df_norm.describe()
#Dendrogram
from scipy.cluster.hierarchy import linkage
import scipy.cluster.hierarchy as sch
z = linkage(df_norm, method = "complete", metric = "euclidean")
plt.figure(figsize=(15, 8));plt.title('Hierarchical
ClusteringDendrogram');plt.xlabel('Index');plt.ylabel('Distance')
sch.dendrogram(z,
leaf_rotation = 0, # rotates the x axis labels
leaf_font_size = 10 # font size for the x axis labels
)
plt.show()

This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00

© 2013 - 2022 360DigiTMG. All Rights Reserved.


https://www.coursehero.com/file/159682543/Day12-Hierarchical-Clusteringdocx/
#AgglomerativeClustering
from sklearn.cluster import AgglomerativeClustering
h_complete = AgglomerativeClustering(n_clusters = 3, linkage = 'complete', affinity
="euclidean").fit(df_norm)
h_complete.labels_
cluster_labels = pd.Series(h_complete.labels_)
auto['clust'] = cluster_labels # creating a new column and assigning it to new column
auto = auto.iloc[:, [24,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]]
auto.head()
# Aggregate mean
total = auto.iloc[:, 3:].groupby(auto.clust).mean()
# count
auto['clust'].value_counts()

This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00

© 2013 - 2022 360DigiTMG. All Rights Reserved.


https://www.coursehero.com/file/159682543/Day12-Hierarchical-Clusteringdocx/
Powered by TCPDF (www.tcpdf.org)

You might also like