Day12 Hierarchical Clustering

Hierarchical Clustering
Instructions:
Please share your answers wherever applicable in-line with the word document.
Submit code separately wherever applicable.
Please ensure you update all the details:

Name: ____hari machavarapu_________ Batch ID: _____dswdcmb
150622h______
Topic: Hierarchical Clustering
Guidelines:
1. An assignment submission is considered complete only when correct and executable code(s) are
submitted along with the documentation explaining the method and results. Failing to submit either
of those will be considered an invalid submission and will not be considered as correct submission.
2. Ensure that you submit your assignments correctly and in full. Resubmission is not allowed.
3. Post the submission you can evaluate your work by referring to keys provided. (will be available
only post the submission).
Hints:
1. Business Problem
1.1. What is the business objective?
1.2. What are the constraints?
2. Work on each feature of the dataset to create a data dictionary as displayed in the below
image:
3. Data Pre-processing
3.1 Data Cleaning, Feature Engineering, etc.
4. Exploratory Data Analysis (EDA):
4.1. Summary.
This study source was downloaded by 100000866404070 from CourseHero.com on 07-26-2023 06:18:46 GMT -05:00
© 2013 - 2022 360DigiTMG. All Rights Reserved.

https://www.coursehero.com/file/159682543/Day12-Hierarchical-Clusteringdocx/
4.2. Univariate analysis.
4.3. Bivariate analysis.
5. Model Building
5.1 Build the model on the scaled data (try multiple options).
5.2 Perform the hierarchical clustering and visualize the clusters using dendrogram.
5.3 Validate the clusters (try with different number of clusters), label the clusters, and
derive insights (compare the results from multiple approaches).
6. Write about the benefits/impact of the solution - in what way does the business (client)
benefit from the solution provided?
Problem Statements:
1. Perform clustering for the airlines data to obtain optimum number of clusters.
Draw the inferences from the clusters obtained. Refer to EastWestAirlines.xlsx
dataset.
CODE –
import numpy as np
import matplotlib.pylab as plt
import pandas as pd

from sklearn.preprocessing import MinMaxScaler
flight = pd.read_excel("C:/Users/hudso/Downloads/Dataset_Assignment
Clustering/EastWestAirlines.xlsx",sheet_name='data')
flight.describe()
flight.info()
#NORMALIZING
def norm_func(i):
x = (i-i.min()) / (i.max()-i.min())
return (x)
df_norm = norm_func(flight.iloc[:, 1:])
df_norm.describe()
#DENDOGRAM
from scipy.cluster.hierarchy import linkage
import scipy.cluster.hierarchy as sch
z = linkage(df_norm, method = "complete", metric = "euclidean")
plt.figure(figsize=(15, 8));plt.title('Hierarchical
ClusteringDendrogram');plt.xlabel('Index');plt.ylabel('Distance')
sch.dendrogram(z,
leaf_rotation = 0, # rotates the x axis labels
leaf_font_size = 10 # font size for the x axis labels
)
plt.show()
#AGGLOMERATIVECLUSTERING
from sklearn.cluster import AgglomerativeClustering
h_complete = AgglomerativeClustering(n_clusters = 3, linkage = 'complete', affinity =
"euclidean").fit(df_norm)
h_complete.labels_
cluster_labels = pd.Series(h_complete.labels_)

flight['clust'] = cluster_labels # creating a new column and assigning it to new column
flight = flight.iloc[:, [11,0,1,2,3,4,5,6,7,8,9,10,]]
flight.head()
2. Perform clustering for the crime data and identify the number of clusters
formed and draw inferences. Refer to crime_data.csv dataset.
CODE-
import pandas as pd
crime = pd.read_csv("C:/Users/hudso/Downloads/Dataset_Assignment
Clustering/crime_data.csv")
#Data cleansing
# Removing the unnamed column from dataset
crime.describe()
crime.info()
crime.rename({"Unnamed: 0":"a"}, axis="columns", inplace=True)
crimenew = crime.drop(["a"], axis=1)
# Normalization function
def norm_func(i):
return (x)
# Normalized data frame (considering the numerical part of data)
df_norm = norm_func(crimenew.iloc[:, 0:])

df_norm.describe()
# for creating dendrogram
# Dendrogram
sch.dendrogram(z,
)
plt.show()
# Now applying AgglomerativeClustering choosing 3 as clusters from the above dendrogram

h_complete = AgglomerativeClustering(n_clusters = 3, linkage = 'complete', affinity
="euclidean").fit(df_norm)
h_complete.labels_
crime['clust'] = cluster_labels # creating a new column and assigning it to new column
crime = crime.iloc[:, [5,0,1,2,3,4]]
crime.head()
Inferences
# count the number in each cluster
crime.value_counts()
# Aggregate mean of each cluster
crime.iloc[:, 2:].groupby(crime.clust).mean()

3. Perform clustering analysis on the telecom data set. The data is a mixture of both
categorical and numerical data. It consists of the number of customers who churn
out. Derive insights and get possible information on factors that may affect the
churn decision. Refer to Telco_customer_churn.xlsx dataset.
CODE-
import pandas as pd
telco = pd.read_excel("C:/Users/hudso/Downloads/Dataset_Assignment
Clustering/Telco_customer_churn.xlsx")
telco.describe()
telco.info()
telconew = telco.drop(["Customer ID","Count","Quarter","Phone Service","Offer","Payment
Method","Internet Type","Paperless Billing","Contract","Device Protection Plan","Referred a
Friend"],axis=1)
# Create dummy variables
df_new = pd.get_dummies(telconew)
df_new_1 = pd.get_dummies(telconew, drop_first = True)
# Normalization
def norm_func(i):
return (x)
df_norm = norm_func(df_new_1.iloc[:, 1:])
df_norm.describe()
# Dendrogram
sch.dendrogram(z,leaf_rotation = 0,
leaf_font_size = 10
)
plt.show()

# AgglomerativeClustering
h_complete.labels_
telco['clust'] = cluster_labels # creating a new column and assigning it to new column
telco = telco.iloc[:,
[30,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29]]
telco.head()
# count the number in each cluster
telco['clust'].value_counts()
# Aggregate mean of each cluster
total = telco.iloc[:, 2:].groupby(telco.clust).mean()
4. Perform clustering on mixed data. Convert the categorical variables to numeric by

using dummies or label encoding and perform normalization techniques. The data
set consists of details of customers related to their auto insurance. Refer to
Autoinsurance.csv dataset.

CODE –
import pandas as pd

auto = pd.read_csv("C:/Users/hudso/Downloads/Dataset_Assignment
Clustering/AutoInsurance.csv")
auto.describe()
auto.info()
auto1 = auto.drop(["Customer","State","Effective To Date"],axis=1)
df_new = pd.get_dummies(auto1)
df_new_1 = pd.get_dummies(auto1, drop_first = True)
# Normalization
def norm_func(i):
return (x)
df_norm = norm_func(df_new_1.iloc[:, 0:])
df_norm.describe()
#Dendrogram
sch.dendrogram(z,
)
plt.show()

#AgglomerativeClustering
h_complete.labels_
auto['clust'] = cluster_labels # creating a new column and assigning it to new column
auto = auto.iloc[:, [24,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]]
auto.head()
# Aggregate mean
total = auto.iloc[:, 3:].groupby(auto.clust).mean()
# count
auto['clust'].value_counts()

Powered by TCPDF (www.tcpdf.org)

Day12 Hierarchical Clustering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Day12 Hierarchical Clustering

Uploaded by

Copyright:

Available Formats

Hierarchical Clustering

Please ensure you update all the details:

© 2013 - 2022 360DigiTMG. All Rights Reserved.

© 2013 - 2022 360DigiTMG. All Rights Reserved.

© 2013 - 2022 360DigiTMG. All Rights Reserved.

© 2013 - 2022 360DigiTMG. All Rights Reserved.

# Now applying AgglomerativeClustering choosing 3 as clusters from the above dendrogram

© 2013 - 2022 360DigiTMG. All Rights Reserved.

© 2013 - 2022 360DigiTMG. All Rights Reserved.

4. Perform clustering on mixed data. Convert the categorical variables to numeric by

© 2013 - 2022 360DigiTMG. All Rights Reserved.

import matplotlib.pylab as plt

© 2013 - 2022 360DigiTMG. All Rights Reserved.

© 2013 - 2022 360DigiTMG. All Rights Reserved.

You might also like