Welcome to Scribd!

Chapter 2

Uploaded by

0% found this document useful (0 votes)

4 views30 pages

This document discusses hierarchical clustering in Python. It explains how to perform hierarchical clustering using SciPy by creating a distance matrix with linkage(), specifying different linkage methods like ward and single, and using fcluster() to create cluster labels. It also discusses visualizing clusters with matplotlib and seaborn, determining the optimal number of clusters using dendrograms, and the limitations of hierarchical clustering like its quadratic time complexity for large datasets.

Original Description:

Data science datacamp cap2

Original Title

chapter2

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

4 views30 pages

Chapter 2

Uploaded by

Cerveza La Gaceta

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 30

Search inside document

Basics of

hierarchical
clustering
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Creating a distance matrix using linkage
scipy.cluster.hierarchy.linkage(observations,
method='single',
metric='euclidean',
optimal_ordering=False
)

method : how to calculate the proximity of clusters

metric : distance metric

optimal_ordering : order data points

CLUSTER ANALYSIS IN PYTHON

Which method should use?
'single' : based on two closest objects

'complete' : based on two farthest objects

'average' : based on the arithmetic mean of all objects

'centroid' : based on the geometric mean of all objects

'median' : based on the median of all objects

'ward' : based on the sum of squares

CLUSTER ANALYSIS IN PYTHON

Create cluster labels with fcluster
scipy.cluster.hierarchy.fcluster(distance_matrix,
num_clusters,
criterion
)

distance_matrix : output of linkage() method

num_clusters : number of clusters

criterion : how to decide thresholds to form clusters

CLUSTER ANALYSIS IN PYTHON

Hierarchical clustering with ward method

CLUSTER ANALYSIS IN PYTHON

Hierarchical clustering with single method

CLUSTER ANALYSIS IN PYTHON

Hierarchical clustering with complete method

CLUSTER ANALYSIS IN PYTHON

Final thoughts on selecting a method
No one right method for all

Need to carefully understand the distribution of data

CLUSTER ANALYSIS IN PYTHON

Let's try some
exercises
C L U S T E R A N A LY S I S I N P Y T H O N
Visualize clusters
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Why visualize clusters?
Try to make sense of the clusters formed

An additional step in validation of clusters

Spot trends in data

CLUSTER ANALYSIS IN PYTHON

An introduction to seaborn
seaborn : a Python data visualization library based on matplotlib

Has be er, easily modi able aesthetics than matplotlib!

Contains functions that make data visualization tasks easy in the context of data analytics

Use case for clustering: hue parameter for plots

CLUSTER ANALYSIS IN PYTHON

Visualize clusters with matplotlib
from matplotlib import pyplot as plt

df = pd.DataFrame({'x': [2, 3, 5, 6, 2],

'y': [1, 1, 5, 5, 2],
'labels': ['A', 'A', 'B', 'B', 'A']})
colors = {'A':'red', 'B':'blue'}
df.plot.scatter(x='x',
y='y',
c=df['labels'].apply(lambda x: colors[x]))
plt.show()

CLUSTER ANALYSIS IN PYTHON

Visualize clusters with seaborn
from matplotlib import pyplot as plt
import seaborn as sns

df = pd.DataFrame({'x': [2, 3, 5, 6, 2],

'y': [1, 1, 5, 5, 2],
'labels': ['A', 'A', 'B', 'B', 'A']})
sns.scatterplot(x='x',
y='y',
hue='labels',
data=df)
plt.show()

CLUSTER ANALYSIS IN PYTHON

Comparison of both methods of visualization
MATPLOTLIB PLOT SEABORN PLOT

CLUSTER ANALYSIS IN PYTHON

Next up: Try some
visualizations
C L U S T E R A N A LY S I S I N P Y T H O N
How many clusters?
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Introduction to dendrograms
Strategy till now - decide clusters on visual
inspection

Dendrograms help in showing progressions

as clusters are merged

A dendrogram is a branching diagram that

demonstrates how each cluster is
composed by branching out into its child
nodes

CLUSTER ANALYSIS IN PYTHON

Create a dendrogram in SciPy
from scipy.cluster.hierarchy import dendrogram

Z = linkage(df[['x_whiten', 'y_whiten']],
method='ward',
metric='euclidean')
dn = dendrogram(Z)
plt.show()

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
Next up - try some
exercises
C L U S T E R A N A LY S I S I N P Y T H O N
Limitations of
hierarchical
clustering
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Measuring speed in hierarchical clustering
timeit module

Measure the speed of .linkage() method

Use randomly generated points

Run various iterations to extrapolate

CLUSTER ANALYSIS IN PYTHON

Use of timeit module
from scipy.cluster.hierarchy import linkage
import pandas as pd
import random, timeit
points = 100
df = pd.DataFrame({'x': random.sample(range(0, points), points),
'y': random.sample(range(0, points), points)})
%timeit linkage(df[['x', 'y']], method = 'ward', metric = 'euclidean')

1.02 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

CLUSTER ANALYSIS IN PYTHON

Comparison of runtime of linkage method
Increasing runtime with data points

Quadratic increase of runtime

Not feasible for large datasets

CLUSTER ANALYSIS IN PYTHON

Next up - exercises
C L U S T E R A N A LY S I S I N P Y T H O N

Coincent - Data Science With Python Assignment
Document23 pages
Coincent - Data Science With Python Assignment
Sai Nikhil Nellore
100% (2)
Chapter2 Hierarchical Clustering
Document30 pages
Chapter2 Hierarchical Clustering
Komi David ABOTSITSE
No ratings yet
Cluster Analysis in Python Chapter2 PDF
Document30 pages
Cluster Analysis in Python Chapter2 PDF
Fgpeqw
No ratings yet
Basics of Hierarchical Clustering: Shaumik Daityari
Document30 pages
Basics of Hierarchical Clustering: Shaumik Daityari
NourheneMbarek
No ratings yet
Cluster Analysis in Python Chapter1 PDF
Document31 pages
Cluster Analysis in Python Chapter1 PDF
Fgpeqw
No ratings yet
Chapter 1
Document31 pages
Chapter 1
Cerveza La Gaceta
No ratings yet
Chapter3 PDF
Document25 pages
Chapter3 PDF
NourheneMbarek
No ratings yet
Unsupervised Learning: Basics: Shaumik Daityari
Document31 pages
Unsupervised Learning: Basics: Shaumik Daityari
NourheneMbarek
No ratings yet
Data Mining
Document18 pages
Data Mining
ayushchoudhary20289
No ratings yet
AI With Python - Unsupervised Learning - Clustering
Document12 pages
AI With Python - Unsupervised Learning - Clustering
Chandu Chandrakanth
No ratings yet
Chapter4-Clustering in Real World
Document30 pages
Chapter4-Clustering in Real World
Komi David ABOTSITSE
No ratings yet
Chapter 4
Document18 pages
Chapter 4
debojyoti
No ratings yet
11 Different Ways For Outlier Detection in Python
Document11 pages
11 Different Ways For Outlier Detection in Python
Neethu Merlin Alan
No ratings yet
21BEC505 Exp2
Document7 pages
21BEC505 Exp2
jay
No ratings yet
Designing Machine Learning Workflows in Python Chapter4
Document38 pages
Designing Machine Learning Workflows in Python Chapter4
Fgpeqw
No ratings yet
How Good Is Your Model?: Andreas Müller
Document54 pages
How Good Is Your Model?: Andreas Müller
rickshark
No ratings yet
Machine Learning
Document56 pages
Machine Learning
Mani Vrs
100% (3)
Lab Experiment No. 4 Name: Dhruv Jain SAP ID: 60004190030 Div/Batch: A/A2 Aim
Document9 pages
Lab Experiment No. 4 Name: Dhruv Jain SAP ID: 60004190030 Div/Batch: A/A2 Aim
Dhruv jain
No ratings yet
Cluster Analysis in Python Chapter4 PDF
Document30 pages
Cluster Analysis in Python Chapter4 PDF
Fgpeqw
No ratings yet
Dominant Colors in Images: Shaumik Daityari
Document30 pages
Dominant Colors in Images: Shaumik Daityari
NourheneMbarek
No ratings yet
ML Lab - Sukanya Raja
Document23 pages
ML Lab - Sukanya Raja
Rick Mitra
No ratings yet
Fraud Detection in Python Chapter3
Document33 pages
Fraud Detection in Python Chapter3
Fgpeqw
No ratings yet
Survey of Clustering Algorithms
Document37 pages
Survey of Clustering Algorithms
Aniket Roy
No ratings yet
Unsupervised Learning
Document91 pages
Unsupervised Learning
Rohit Singh
No ratings yet
Regression PDF
Document10 pages
Regression PDF
sanjay kumar
No ratings yet
The K-Modes Algorithm With Entropy Based Similarity Coefficient
Document6 pages
The K-Modes Algorithm With Entropy Based Similarity Coefficient
Deden Istiawan
No ratings yet
Supervised Learning With Scikit-Learn: How Good Is Your Model?
Document31 pages
Supervised Learning With Scikit-Learn: How Good Is Your Model?
Victor Ng
No ratings yet
What Is Cluster Analysis?: Dmitriy (Dima) Gorenshteyn
Document54 pages
What Is Cluster Analysis?: Dmitriy (Dima) Gorenshteyn
atulyaa
No ratings yet
Chapter1 PDF
Document54 pages
Chapter1 PDF
debojyoti
No ratings yet
SVM K NN MLP With Sklearn Jupyter NoteBo
Document22 pages
SVM K NN MLP With Sklearn Jupyter NoteBo
Ahm Tharwat
No ratings yet
JAVIER KMeans Clustering Jupyter Notebook
Document7 pages
JAVIER KMeans Clustering Jupyter Notebook
Alleah Laylo
No ratings yet
K Means Clustering
Document11 pages
K Means Clustering
Shobha Kumari Choudhary
No ratings yet
TD2345
Document3 pages
TD2345
ashitaka667
No ratings yet
Exp 8
Document5 pages
Exp 8
Soban Maruf
No ratings yet
Ex No: Date: K-Means Clustering Using Python: Scatter
Document10 pages
Ex No: Date: K-Means Clustering Using Python: Scatter
Jasmitha B
No ratings yet
Chapter 2
Document49 pages
Chapter 2
Rajachandra Voodiga
No ratings yet
Jaipur National University: Project Design With Seminar
Document26 pages
Jaipur National University: Project Design With Seminar
Faizan Shaikh
100% (1)
Unit-Iv Material
Document24 pages
Unit-Iv Material
Udaya sri
No ratings yet
Maxbox Starter60 Machine Learning
Document8 pages
Maxbox Starter60 Machine Learning
Max Kleiner
No ratings yet
2732 6870 2 LE Proof1
Document11 pages
2732 6870 2 LE Proof1
NguyenVanHai
No ratings yet
SUMERA - Kmeans Clustering - Jupyter Notebook
Document7 pages
SUMERA - Kmeans Clustering - Jupyter Notebook
Alleah Laylo
No ratings yet
PML Ex3
Document20 pages
PML Ex3
Jasmitha B
No ratings yet
PML KNN
Document9 pages
PML KNN
Jasmitha B
No ratings yet
PML Ex3 Final
Document20 pages
PML Ex3 Final
Jasmitha B
No ratings yet
Exp 4
Document10 pages
Exp 4
jay
No ratings yet
Chapter 3
Document33 pages
Chapter 3
Adriano Vianna
No ratings yet
s18 Cu6051np Cw1 17031944 Nirakar Sigdel
Document18 pages
s18 Cu6051np Cw1 17031944 Nirakar Sigdel
Santosh Lamichhane
No ratings yet
Cluster Analysis Usingr PDF
Document0 pages
Cluster Analysis Usingr PDF
Rahul Sukhija
No ratings yet
Aychew Chernet
Document8 pages
Aychew Chernet
aychewchernet
No ratings yet
ML - Practical File
Document15 pages
ML - Practical File
Jatin Mathur
No ratings yet
Meeting 7 Unsupervised Learnign
Document95 pages
Meeting 7 Unsupervised Learnign
Antonio Victory
No ratings yet
Case Study - Classifier
Document5 pages
Case Study - Classifier
Stuti Singh
No ratings yet
Unsupervised Learning
Document95 pages
Unsupervised Learning
mk
No ratings yet
Regression and Classification - Supervised Machine Learning - GeeksforGeeks
Document9 pages
Regression and Classification - Supervised Machine Learning - GeeksforGeeks
bsudheertec
No ratings yet
Sensitivity Analysis
Document64 pages
Sensitivity Analysis
Vinoth Kumar
No ratings yet
Bigdata External Programs 181801120034
Document4 pages
Bigdata External Programs 181801120034
agoyal5145
No ratings yet
CS583 Clustering
Document40 pages
CS583 Clustering
ruoibmt
No ratings yet
Data Mining
Document98 pages
Data Mining
Jijeesh Baburajan
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
Advanced Mathematical Applications in Data Science
From Everand
Advanced Mathematical Applications in Data Science
Biswadip Basu Mallik
No ratings yet