You are on page 1of 24

Credits

Python libraries

pandas: Used for efficient data manipulation. Link to pandas


seaborn: Used for statistical data visualization. Link to seaborn
scipy: Library for scientific computing and statistical functions. Link to scipy
matplotlib: Versatile plotting library for creating visualizations. Link to matplotlib
numpy: Used for numerical operations. Link to numpy

Reference

spearman's rank correlation coefficient wikipedia link wikipedia link

Pearson correlation coefficient wikipedia link wikipedia link

Exploring Insights through Bivariate Analysis in


Machine Learning and Data Analysis
Introduction: This notebook serves as a comprehensive guide to understanding and implementing bivariate
analysis in the realm of machine learning and data analysis. Bivariate analysis is a statistical method that explores
the relationship between two variables, shedding light on patterns, correlations, and dependencies. Through
practical examples and real-life scenarios, this notebook aims to demonstrate how bivariate analysis can be a
powerful tool for extracting meaningful insights from data.

Key Concepts:

1. Bivariate Analysis Fundamentals:

Definition: Understand the basics of bivariate analysis, focusing on the examination of two variables
simultaneously.
Scatter Plots: Visualize the relationship between two variables using scatter plots to identify trends and
patterns.
2. Correlation Coefficients:

Pearson, Spearman, and Kendall: Explore different correlation coefficients to quantify the strength and
direction of relationships.
Interpretation: Learn how to interpret correlation values and understand their significance.
3. Use Cases in Daily Life:

Finance: Analyze the correlation between income and expenditure to make informed budgeting decisions.
Health: Investigate the relationship between exercise frequency and overall well-being.
Education: Explore the correlation between study hours and academic performance.
4. Machine Learning Applications:

Feature Selection: Utilize bivariate analysis to identify key features that impact the target variable in
machine learning models.
Model Validation: Assess the relationship between predicted and actual outcomes through bivariate
analysis of model residuals.
5. Real-Life Example: Predicting Housing Prices:

Data Exploration: Use bivariate analysis to understand the correlation between various features and
housing prices.
Feature Engineering: Identify and create new features based on bivariate insights.
Model Improvement: Implement findings from bivariate analysis to enhance the predictive accuracy of a
housing price prediction model.

Practical Implementation:

1. Data Preparation:

Data Cleaning: Address missing values, outliers, and inconsistencies for a robust analysis.
Data Transformation: Normalize or scale variables to ensure accurate interpretation.
2. Python Code Examples:

Utilize popular libraries such as Pandas, Matplotlib, and Seaborn for efficient bivariate analysis.
Code snippets for calculating correlation coefficients, creating scatter plots, and interpreting results.

Conclusion: This notebook aims to equip users with the skills to perform insightful bivariate analysis in both
personal and machine learning contexts. By mastering these techniques, practitioners can make informed decisions,
extract valuable patterns from data, and enhance the performance of machine learning models.

Let's see each topic one by one

Correlation:
Definition: Correlation is a statistical measure that quantifies the degree to which two variables change together. It
indicates the strength and direction of a linear relationship between two variables. The correlation coefficient is a
numerical value that ranges from -1 to 1, where:

1 indicates a perfect positive correlation (as one variable increases, the other also increases linearly).
-1 indicates a perfect negative correlation (as one variable increases, the other decreases linearly).
0 indicates no linear correlation.

Real-life Example: Consider a dataset containing the number of hours spent studying for an exam and the
corresponding exam scores. If there is a positive correlation, it suggests that as study hours increase, exam scores
tend to increase. Conversely, a negative correlation would imply that as study hours increase, exam scores tend to
decrease.

Where to Use in Machine Learning and Real-Life Projects:

1. Feature Selection:

In machine learning, understanding the correlation between features helps identify and select the most
relevant features for training models. Highly correlated features might not provide additional information,
and it could be beneficial to keep only one of them.
2. Model Performance:

Correlation analysis can help assess the relationships between input features and the target variable. This
information is valuable for selecting appropriate models and understanding how changes in input features
impact the model's predictions.
3. Data Preprocessing:

Detecting and handling multicollinearity (high correlation between independent variables) is essential in
preprocessing data. Multicollinearity can lead to unstable coefficient estimates in regression models.
4. Real-Life Projects:

In finance, correlation is used to understand the relationship between different assets. For instance,
knowing the correlation between stocks helps in portfolio diversification.
In healthcare, correlation analysis can be applied to understand the relationships between variables like
patient age, lifestyle, and the occurrence of specific health conditions.

In marketing, analyzing the correlation between advertising spending and sales can guide budget
allocation for different channels.

In climate science, understanding the correlation between various environmental factors can help predict
and mitigate the impact of climate change.

In social sciences, correlation analysis may be used to study the relationship between variables such as
income and education levels.

Monotonic Relationship:
Definition: A monotonic relationship is a mathematical concept that describes the consistent trend between two
variables. In a monotonic relationship, as one variable increases (or decreases), the other variable also consistently
increases (or decreases), but not necessarily at a constant rate. It doesn't have to be a straight line; it could be any
pattern that consistently moves in one direction.

Real-life Example: Consider the relationship between the number of hours spent studying for an exam and the
exam scores. A monotonic relationship in this context would mean that as study hours increase, exam scores
consistently increase, even if the relationship is not strictly linear. For example, the more time a student spends
studying, the higher the chances of achieving a higher score.

Use in ML/DL/Data Science:

Feature Engineering: Monotonic relationships are crucial when creating features for machine learning models.
Understanding how variables change in a consistent direction helps identify meaningful features that can
improve model performance.

Correlation Analysis: Monotonic correlation, as measured by Spearman correlation, can provide insights into
the relationships between variables. This is valuable when dealing with non-linear associations in the data.

Monotonic Relationship Example:


Real-life Scenario: Consider the relationship between the number of hours spent studying (independent variable)
and the exam scores obtained (dependent variable). The assumption is that as study hours increase, exam scores
tend to increase, but not necessarily in a strictly linear fashion.

Python Code with Plot:

In [1]: import numpy as np


import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Simulating data for monotonic relationship


study_hours = np.array([2, 4, 6, 8, 10])
exam_scores = np.array([60, 70, 80, 85, 90])

# Plotting the monotonic relationship


plt.figure(figsize=(8, 6))
plt.scatter(study_hours, exam_scores, color='blue',
label='Monotonic Relationship')
plt.plot(study_hours, exam_scores, linestyle='--',
color='red') # Adding a line to represent the trend

# Labels and title


plt.xlabel('Study Hours')
plt.ylabel('Exam Scores')
plt.title('Monotonic Relationship Example')
plt.legend()
plt.grid(True)

# Show the plot


plt.show()

In this plot, the blue dots represent data points, and the red dashed line indicates the monotonic trend. As study
hours increase, exam scores tend to increase, demonstrating a monotonic relationship.

Linear Relationship:
Definition: A linear relationship is a mathematical concept where two variables have a constant rate of change,
forming a straight line when plotted on a graph. In a linear relationship, a change in one variable is proportional to a
change in the other variable.

Real-life Example: Consider the relationship between the distance traveled and time taken during a constant-speed
journey. The linear relationship here means that the distance covered is directly proportional to the time elapsed.
The graph of this relationship would be a straight line.

Use in ML/DL/Data Science:

Linear Regression: Linear relationships are fundamental in linear regression models, where the goal is to
model the relationship between the independent and dependent variables as a straight line. This is widely used
in various machine learning applications.

Feature Importance: Understanding linear relationships helps in identifying which features have a significant
impact on the target variable, aiding in feature selection for model training.

Linear Relationship Example:


Real-life Scenario: Consider the relationship between the distance traveled (independent variable) and the time
taken during a constant-speed journey (dependent variable). The assumption is that the relationship is linear, with a
constant rate of change.

Python Code with Plot:

In [2]: import numpy as np


import matplotlib.pyplot as plt

# Simulating data for linear relationship


distance_traveled = np.array([50, 100, 150, 200, 250])
time_elapsed = np.array([1, 2, 3, 4, 5])

# Plotting the linear relationship


plt.figure(figsize=(8, 6))
plt.scatter(time_elapsed, distance_traveled, color='green',
label='Linear Relationship')
plt.plot(time_elapsed, distance_traveled, linestyle='-',
color='purple') # Adding a line to represent the trend

# Labels and title


plt.xlabel('Time Elapsed (hours)')
plt.ylabel('Distance Traveled (km)')
plt.title('Linear Relationship Example')
plt.legend()
plt.grid(True)

# Show the plot


plt.show()

In this plot, the green dots represent data points, and the purple line indicates the linear relationship. As time
elapses, the distance traveled increases at a constant rate, demonstrating a linear relationship.

Relevance in ML/DL/Data Science:


Model Selection: Depending on the nature of the relationship in the data, different models may be more
appropriate. Linear relationships may be modeled using linear regression, while non-linear or monotonic
relationships may require more complex models or transformations.

Interpretability: Linear relationships are often more interpretable and easier to explain. In certain situations, a
linear model might be preferred for its simplicity and ease of interpretation.

Feature Engineering: Both monotonic and linear relationships guide feature engineering efforts. Feature
transformations or the creation of interaction terms may be applied based on the identified relationships to
improve model performance.

Reference
Pearson correlation coefficient wikipedia link wikipedia_link

Correlation Coefficients:
Correlation coefficients measure the strength and direction of a linear relationship between two variables. There are
different types of correlation coefficients, including Pearson, Spearman, and Kendall.

1. Pearson Correlation Coefficient:


Definition: The Pearson correlation coefficient measures the linear relationship between two continuous
variables. It ranges from -1 to 1, where:

1 indicates a perfect positive linear relationship,


-1 indicates a perfect negative linear relationship,
0 indicates no linear relationship.
Real-life Example: Consider a dataset that contains information about the number of hours students spend
studying and their exam scores. A positive Pearson correlation coefficient suggests that as study hours increase,
exam scores tend to increase as well.

Python Code:

In [3]: import pandas as pd


from scipy.stats import pearsonr

# Example dataset
data = {'Study Hours': [3, 5, 7, 4, 6],
'Exam Scores': [60, 75, 85, 70, 80]}

df = pd.DataFrame(data)

# Calculate Pearson correlation coefficient


pearson_corr, _ = pearsonr(df['Study Hours'], df['Exam Scores'])

print(f"Pearson Correlation Coefficient: {pearson_corr:.2f}")

Pearson Correlation Coefficient: 0.99

In [4]: import pandas as pd


import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import pearsonr

# Example dataset
data = {'Study Hours': [3, 5, 7, 4, 6],
'Exam Scores': [60, 75, 85, 70, 80]}

df = pd.DataFrame(data)

# Calculate correlation coefficients


pearson_corr, _ = pearsonr(df['Study Hours'], df['Exam Scores'])
# Create scatter plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Study Hours', y='Exam Scores', data=df,
color='blue', s=100)

# Add lines representing the monotonic and linear relationships


sns.lineplot(x='Study Hours', y='Exam Scores', data=df, color='red',
linestyle='--', label=f'Pearson ({pearson_corr:.2f})')

# Set plot labels and title


plt.xlabel('Study Hours')
plt.ylabel('Exam Scores')
plt.title('Pearson Correlation Example')

# Show legend
plt.legend()

# Show the plot


plt.show()

Reference

spearman's rank correlation coefficient wikipedia link wikipedia_link

2. Spearman Correlation Coefficient:


Definition: The Spearman correlation coefficient is a non-parametric measure of the strength and direction of the
monotonic relationship between two variables. Monotonicity refers to a consistent trend in the relationship, but it
does not have to be strictly linear. This makes Spearman correlation suitable for assessing relationships in both
continuous and ordinal variables.

Unlike the Pearson correlation coefficient, Spearman does not assume that the relationship between variables is
linear. Instead, it focuses on whether, as one variable increases, the other tends to consistently increase or decrease.
Real-life Example: Let's consider a practical scenario where you are analyzing the relationship between study hours
and exam scores. In a traditional linear relationship, more study hours might result in higher exam scores, but this
may not always be the case. The Spearman correlation is valuable in situations where the relationship is not strictly
linear but still follows a consistent trend.

For instance:

Linear Scenario: If there's a linear relationship, more study hours might always lead to higher scores, and the
Spearman correlation would capture this.
Non-linear Scenario: If the relationship is not strictly linear, but as study hours increase, scores tend to
consistently increase or decrease (i.e., a monotonic relationship), Spearman correlation still effectively quantifies
this trend.

Python Code Example: Let's use Python to calculate the Spearman correlation coefficient for the given dataset:

In [5]: import pandas as pd


from scipy.stats import spearmanr

# Example dataset
data = {'Study Hours': [3, 5, 7, 4, 6],
'Exam Scores': [60, 75, 85, 70, 80]}

df = pd.DataFrame(data)

# Calculate Spearman correlation coefficient


spearman_corr, _ = spearmanr(df['Study Hours'], df['Exam Scores'])

print(f"Spearman Correlation Coefficient: {spearman_corr:.2f}")

Spearman Correlation Coefficient: 1.00

This code calculates the Spearman correlation coefficient for the 'Study Hours' and 'Exam Scores' variables in the
provided dataset. The resulting coefficient helps quantify the monotonic relationship between study hours and
exam scores, providing insights into the consistency of the trend.

In [6]: import pandas as pd


import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import spearmanr, pearsonr

# Example dataset
data = {'Study Hours': [3, 5, 7, 4, 6],
'Exam Scores': [60, 75, 85, 70, 80]}

df = pd.DataFrame(data)

# Calculate correlation coefficients


spearman_corr, _ = spearmanr(df['Study Hours'], df['Exam Scores'])

# Create scatter plot


plt.figure(figsize=(8, 6))
sns.scatterplot(x='Study Hours', y='Exam Scores', data=df,
color='blue', s=100)

# Add lines representing the monotonic and linear relationships


sns.regplot(x='Study Hours', y='Exam Scores', data=df, color='green',
scatter=False, label=f'Pearson ({spearman_corr:.2f})')

# Set plot labels and title


plt.xlabel('Study Hours')
plt.ylabel('Exam Scores')
plt.title('Spearman and Pearson Correlation Example')

# Show legend
plt.legend()
# Show the plot
plt.show()

Spearman and Pearson Correlation Example with plots


In [7]: import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import spearmanr, pearsonr

# Example dataset
data = {'Study Hours': [3, 5, 7, 4, 6],
'Exam Scores': [60, 75, 85, 70, 80]}

df = pd.DataFrame(data)

# Calculate correlation coefficients


spearman_corr, _ = spearmanr(df['Study Hours'], df['Exam Scores'])
pearson_corr, _ = pearsonr(df['Study Hours'], df['Exam Scores'])

# Create scatter plot


plt.figure(figsize=(8, 6))
sns.scatterplot(x='Study Hours', y='Exam Scores', data=df,
color='blue', s=100)

# Add lines representing the monotonic and linear relationships


sns.lineplot(x='Study Hours', y='Exam Scores', data=df, color='red',
linestyle='--', label=f'Spearman ({spearman_corr:.2f})')

sns.regplot(x='Study Hours', y='Exam Scores', data=df, color='green',


scatter=False, label=f'Pearson ({pearson_corr:.2f})')

# Set plot labels and title


plt.xlabel('Study Hours')
plt.ylabel('Exam Scores')
plt.title('Spearman and Pearson Correlation Example')

# Show legend
plt.legend()

# Show the plot


plt.show()

Plot Explanation:
1. Scatter Plot:

The blue dots represent individual data points, where each dot corresponds to a combination of study
hours and exam scores.
2. Spearman Correlation Line (Red Dashed Line):

The red dashed line represents the monotonic trend captured by the Spearman correlation. It shows the
general direction of the relationship between study hours and exam scores. The legend includes the
Spearman correlation coefficient, indicating the strength of the monotonic relationship.
3. Pearson Correlation Line (Green Line):

The green line represents the linear trend captured by the Pearson correlation. It is a best-fit line that
represents the overall linear relationship between study hours and exam scores. The legend includes the
Pearson correlation coefficient, indicating the strength of the linear relationship.
4. Legend:

The legend provides information about the correlation coefficients associated with each method
(Spearman and Pearson). The coefficients are numerical values that quantify the strength and direction of
the respective relationships.
5. Axes and Labels:

The x-axis represents the 'Study Hours,' and the y-axis represents the 'Exam Scores.'
The plot includes labels for the x-axis, y-axis, and a title for better clarity.

Interpretation:
Spearman Correlation:

The red dashed line captures the overall monotonic trend between study hours and exam scores.
The Spearman correlation coefficient (included in the legend) quantifies the strength and direction of the
monotonic relationship.
Pearson Correlation:

The green line represents the linear relationship between study hours and exam scores.
The Pearson correlation coefficient (included in the legend) quantifies the strength and direction of the
linear relationship.

By visually inspecting the scatter plot and considering the correlation lines, you can gain insights into how well each
method captures the relationship between study hours and exam scores. The legend provides numerical values for
the correlation coefficients, helping you understand the degree of association in each case.

Bivariate Analysis plot's:


Bivariate Analysis: Scatterplot with Iris Dataset
Objective: To explore the relationship between sepal length and sepal width in the Iris dataset using a scatterplot.

Python Code:

In [8]: from sklearn import datasets


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Iris dataset


iris = datasets.load_iris()
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target

# Scatterplot
plt.figure(figsize=(8, 6))
sns.scatterplot(x='sepal length (cm)', y='sepal width (cm)',
hue='target', data=iris_df, palette='viridis', s=80)

# Labels and title


plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Scatterplot: Sepal Length vs. Sepal Width (Iris Dataset)')

# Show the plot


plt.show()
Explanation:

1. Import Libraries:

Import necessary libraries, including datasets from sklearn , pandas , seaborn , and
matplotlib.pyplot .
2. Load Iris Dataset:

Load the Iris dataset using datasets.load_iris() and create a DataFrame ( iris_df ) from it.
3. Scatterplot:

Use sns.scatterplot to create a scatterplot.


'Sepal Length' is plotted on the x-axis, 'Sepal Width' on the y-axis.
hue='target' colors the points based on the species of iris flowers.
palette='viridis' sets the color palette, and s=80 adjusts the size of the points.
4. Labels and Title:

Add labels to the x and y axes using plt.xlabel and plt.ylabel .


Set the title of the plot using plt.title .
5. Show the Plot:

Finally, use plt.show() to display the scatterplot.

Interpretation:

Each point on the scatterplot represents an iris flower.


The x-axis shows the sepal length, and the y-axis shows the sepal width.
Different colors represent different species of iris flowers (setosa, versicolor, virginica).

Bivariate Analysis Insights:

This scatterplot helps visualize the relationship between sepal length and sepal width for different species of iris
flowers.
Insights into the distribution and separation of species can be gained.
It's possible to observe patterns and potential correlations between sepal length and sepal width.

2. Bar Plot (Numerical - Categorical)

Bivariate Analysis: Bar Plot (Numerical - Categorical)


Objective: To explore the relationship between survival status and passenger class in the Titanic dataset using a bar
plot.

Python Code:

In [9]: import pandas as pd


import seaborn as sns
import matplotlib.pyplot as plt

# Load the Titanic dataset


titanic = sns.load_dataset('titanic')

# Bar plot
plt.figure(figsize=(8, 6))
sns.barplot(x='pclass', y='survived', data=titanic, ci=None, palette='muted')

# Labels and title


plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')
plt.title('Bar Plot: Survival Rate by Passenger Class (Titanic Dataset)')

# Show the plot


plt.show()

Explanation:
1. Import Libraries:

Import necessary libraries, including pandas , seaborn , and matplotlib.pyplot .


2. Load Titanic Dataset:

Load the Titanic dataset using sns.load_dataset('titanic') .


3. Bar Plot:

Use sns.barplot to create a bar plot.


'Passenger Class' is plotted on the x-axis, and 'Survival Rate' is plotted on the y-axis.
ci=None removes confidence intervals from the bars.
palette='muted' sets the color palette.
4. Labels and Title:

Add labels to the x and y axes using plt.xlabel and plt.ylabel .


Set the title of the plot using plt.title .
5. Show the Plot:

Finally, use plt.show() to display the bar plot.

Interpretation:

Each bar represents the survival rate for a specific passenger class.
'Passenger Class' is a categorical variable with three levels (1st, 2nd, and 3rd class).
'Survival Rate' is a numerical variable representing the proportion of passengers who survived.

Bivariate Analysis Insights:

The bar plot visually compares the survival rates across different passenger classes.
It helps to identify patterns and differences in survival rates based on the passenger class.
This type of analysis is valuable for understanding how certain factors may influence outcomes in categorical
variables.

This bivariate analysis using a bar plot provides insights into the relationship between a numerical variable (survival
rate) and a categorical variable (passenger class) in the context of the Titanic dataset.

3. Box Plot (Numerical - Categorical)


In [10]: import seaborn as sns
import matplotlib.pyplot as plt

# Load the Titanic dataset


titanic = sns.load_dataset('titanic')

# Box plot
plt.figure(figsize=(10, 8))
sns.boxplot(x='pclass', y='age', hue='sex', data=titanic, palette='muted')

# Labels and title


plt.xlabel('Passenger Class')
plt.ylabel('Age')
plt.title('Box Plot: Age Distribution by Passenger Class and Gender')

# Show the plot


plt.show()
1. Load Titanic Dataset:

titanic = sns.load_dataset('titanic') : Loads the Titanic dataset from Seaborn.


2. Box Plot:

plt.figure(figsize=(10, 8)) : Sets the size of the figure.


sns.boxplot(x='pclass', y='age', hue='sex', data=titanic, palette='muted') : Creates a
box plot.
x='pclass' : Categorical variable on the x-axis (Passenger Class).
y='age' : Numerical variable on the y-axis (Age).
hue='sex' : Separates the data by gender.
data=titanic : Specifies the dataset.
palette='muted' : Sets the color palette.
3. Labels and Title:

plt.xlabel('Passenger Class') : Adds a label to the x-axis.


plt.ylabel('Age') : Adds a label to the y-axis.
plt.title('Box Plot: Age Distribution by Passenger Class and Gender (Titanic
Dataset)') : Sets the title of the plot.
4. Show the Plot:

plt.show() : Displays the box plot.

Interpretation:

The box plot visualizes the distribution of ages for each passenger class and gender.
Each box represents the interquartile range (IQR) of ages within a specific category.
The horizontal line inside the box represents the median age.
Whiskers extend to show the range of ages within 1.5 times the IQR, and individual points beyond the whiskers
are considered outliers.

Insights:

This box plot allows you to compare the distribution of ages across different passenger classes and genders.
You can observe the median age, the spread of ages (IQR), and identify potential outliers.
It provides insights into the age demographics of passengers in each class and gender category.

4. Distplot (Numerical - Categorical)


In [11]: import seaborn as sns
import matplotlib.pyplot as plt

# Load the Titanic dataset


titanic = sns.load_dataset('titanic')

# Create a distribution plot for age categorized by survival status


sns.distplot(titanic[titanic['survived'] == 0]['age'],
hist=False, label='Not Survived')
sns.distplot(titanic[titanic['survived'] == 1]['age'],
hist=False, label='Survived')

# Set labels and title


plt.xlabel('Age')
plt.ylabel('Density')
plt.title('Age Distribution by Survival Status (Titanic)')

# Show the plot


plt.show()

Explanation:

1. Import Libraries:

Import necessary libraries, including seaborn and matplotlib.pyplot .


2. Data Filtering:
Use boolean indexing to filter passengers who did not survive ( Survived == 0 ) and those who survived
( Survived == 1 ) from the Titanic dataset.
3. Distplot:

Use sns.distplot to create a distribution plot.


Pass the filtered age data for passengers who did not survive and those who survived separately.
hist=False parameter removes the histogram, leaving only the kernel density estimate (KDE) for a
smooth distribution curve.
label is added for each category to distinguish them in the legend.
4. Labels and Title:

Add labels to the x and y axes using plt.xlabel and plt.ylabel .


Set the title of the plot using plt.title .
Add a legend to distinguish between the age distributions of survived and not survived passengers.
5. Show the Plot:

Finally, use plt.show() to display the distribution plot.

Interpretation:

The plot visualizes the distribution of ages for passengers who survived and those who did not survive the
Titanic disaster.
The x-axis represents age, and the y-axis represents the density (probability) of observing a particular age.
The smooth curves (KDE) help in visualizing the probability distribution of ages for each category.

Use of Distplot:

Comparison of Distributions: Distplots are useful for comparing the distributions of numerical variables across
different categories.
Identifying Patterns: Distplots provide insights into the patterns and shapes of distributions within different
groups.

Where to Use This Plot:

Survival Analysis: In this case, the plot is used to analyze the age distribution of passengers based on their
survival status.
Customer Behavior Analysis: For instance, comparing the age distribution of customers who made a purchase
versus those who didn't in an e-commerce dataset.

Real-life Example: In the context of the Titanic dataset, this plot helps answer questions like "Were there any age-
related patterns in survival rates?" For instance, it might reveal if certain age groups were more likely to survive or
not. This type of analysis can provide valuable insights for understanding the factors that influenced survival on the
Titanic.

5. HeatMap (Categorical - Categorical)


Objective: To create a heatmap that visualizes the relationship between two categorical variables in the Titanic
dataset — specifically, the relationship between passenger class ( pclass ) and survival status ( survived ).

Python Code:

In [12]: import seaborn as sns


import pandas as pd

# Assuming 'titanic' is a DataFrame containing the Titanic dataset


# Example: titanic = pd.read_csv('titanic.csv')

# Heatmap using pd.crosstab and sns.heatmap


plt.figure(figsize=(8, 6))
sns.heatmap(pd.crosstab(titanic['pclass'], titanic['survived']),
annot=True, fmt='d', cmap='Blues')

# Labels and title


plt.xlabel('Survival Status')
plt.ylabel('Passenger Class')
plt.title('Heatmap: Passenger Class vs. Survival Status')

# Show the plot


plt.show()

Explanation:

1. Import Libraries:

Import necessary libraries, including seaborn , pandas , and matplotlib.pyplot .


2. Load Titanic Dataset:

Assuming you have a DataFrame named 'titanic' containing the Titanic dataset.
3. Create Heatmap:

Use pd.crosstab to create a contingency table between 'pclass' and 'survived'.


sns.heatmap visualizes the contingency table as a heatmap.
annot=True adds annotations (numeric values) to each cell.
fmt='d' specifies the format of the annotations as integers.
cmap='Blues' sets the color map for the heatmap.
4. Labels and Title:

Add labels to the x and y axes using plt.xlabel and plt.ylabel .


Set the title of the plot using plt.title .
5. Show the Plot:

Finally, use plt.show() to display the heatmap.

Interpretation:
The heatmap visually represents the relationship between passenger class and survival status.
Rows represent passenger class, and columns represent survival status.
The color intensity in each cell indicates the count of passengers falling into a specific combination of class and
survival status.

Use of the Plot:

Visualizing Associations: Heatmaps are effective for visually assessing associations between two categorical
variables. In this case, it helps understand the distribution of survivors and non-survivors across different
passenger classes.

Where Can We Use This Plot:

Exploratory Data Analysis (EDA): Use heatmaps during EDA to uncover patterns and relationships between
categorical variables.
Feature Importance: In machine learning, understanding the relationships between categorical features and
the target variable is essential for feature selection.

Real-life Example:

The heatmap could be used to analyze whether there is a correlation between passenger class and survival
status on the Titanic. It might help identify patterns such as a higher survival rate in certain passenger classes.

6. ClusterMap (Categorical - Categorical)


In [13]: sns.clustermap(pd.crosstab(titanic['parch'],titanic['survived']))

Out[13]: <seaborn.matrix.ClusterGrid at 0x1dda2c3edc0>


Explanation:
1. Import Libraries:

Import seaborn ( sns ), pandas ( pd ), and matplotlib.pyplot ( plt ).


2. Load Titanic Dataset:

Load the Titanic dataset into a DataFrame ( titanic ). In this example, we assume that the 'titanic.csv' file
contains the dataset.
3. Create Crosstab:

Use pd.crosstab to create a contingency table between the 'parch' (number of parents/children aboard)
and 'survived' columns.
4. sns.clustermap :

Apply sns.clustermap to create a clustered heatmap of the contingency table.


This function hierarchically clusters both rows and columns based on similarity in values, resulting in a
visually organized heatmap.
5. Show the Plot:

Finally, use plt.show() to display the clustermap.


Interpretation:
The clustermap visualizes the relationship between the number of parents/children aboard ( parch ) and survival
( survived ) in the Titanic dataset. Rows and columns are clustered based on their similarity in the distribution of
values.

Rows represent different values of 'parch' (number of parents/children).


Columns represent different values of 'survived' (0 or 1 indicating not survived or survived).

Use of sns.clustermap :
1. Identifying Patterns:

The clustermap helps identify patterns or relationships between two categorical variables. In this case, it
explores the distribution of survival outcomes based on the number of parents/children aboard.
2. Hierarchical Clustering:

The hierarchical clustering visually groups similar rows and columns, providing insights into potential
associations or trends within the data.
3. Data Exploration:

Useful for exploratory data analysis to reveal hidden structures or dependencies between categorical
variables.

Real-life Example:
In the Titanic dataset, the clustermap can help identify if there are specific patterns in survival based on the number
of parents/children aboard. It might reveal whether certain family sizes had higher or lower survival rates,
contributing to a deeper understanding of the dataset.

For example, the heatmap may show clusters where survival rates are higher for specific family sizes, indicating
potential trends in passenger survival based on family relationships.

This type of analysis is valuable for understanding the dynamics of survival in different family structures and can
guide further investigations or feature engineering in a machine learning context.

7. Pairplot
The pairplot is a powerful visualization tool provided by the Seaborn library in Python. It creates a grid of
scatterplots, histograms, and density plots for multiple variables in a dataset. Each combination of variables is
visualized, making it particularly useful for exploring relationships between multiple numerical variables.

In [14]: import seaborn as sns


import matplotlib.pyplot as plt
from sklearn import datasets

# Load the Iris dataset


iris = datasets.load_iris()
iris_df = sns.load_dataset('iris')

# Create a pairplot
sns.pairplot(iris_df, hue='species')

# Show the plot


plt.show()
Explanation:

1. Import Libraries:

Import necessary libraries, including seaborn , matplotlib.pyplot , and datasets from sklearn .
2. Load Iris Dataset:

Load the Iris dataset using datasets.load_iris() and create a DataFrame ( iris_df ) from it.
3. Create Pairplot:

Use sns.pairplot to create a grid of scatterplots and histograms.


hue='species' colors the data points based on the species of iris flowers.
4. Show the Plot:

Finally, use plt.show() to display the pairplot.

Interpretation:

The pairplot shows scatterplots for all combinations of numerical variables (sepal length, sepal width, petal
length, petal width) in the Iris dataset.
Diagonal elements display histograms or kernel density plots for each variable.

Use of Pairplot:

Exploratory Data Analysis (EDA):


The pairplot is a powerful tool for initial data exploration, providing insights into the relationships and
distributions of variables.
Feature Relationships:

It helps in understanding how different features relate to each other. For example, in the context of the Iris
dataset, you can observe how petal length and petal width correlate.
Identifying Patterns:

Patterns, clusters, and potential outliers in the data can be visually identified.
Correlation Assessment:

Pairplots are useful for assessing the correlation between variables. Strong correlations may indicate
multicollinearity.

Real-life Example: Consider a dataset containing information about various physical and physiological attributes of
individuals, such as height, weight, blood pressure, and cholesterol levels. Using a pairplot, you can visualize the
relationships between these variables, identify patterns, and assess potential correlations. This can be valuable for
understanding factors that may influence overall health.

Note:

The hue parameter is particularly useful when dealing with datasets with a categorical variable. It colors the
data points based on the values of that variable, providing additional insights into patterns related to the
categorical variable (e.g., species of iris flowers in the case of the Iris dataset).

8. Lineplot (Numerical - Numerical)


In [ ]: new = flights.groupby('year').sum().reset_index()
sns.lineplot(new['year'],new['passengers'])

In [ ]: import seaborn as sns


import matplotlib.pyplot as plt

# Load the 'flights' dataset from Seaborn


flights = sns.load_dataset('flights')

# Convert 'passengers' column to numeric


flights['passengers'] = pd.to_numeric(flights['passengers'], errors='coerce')

# Grouping by 'year' and summing the 'passengers'


new = flights.groupby('year')['passengers'].sum().reset_index()

# Creating a line plot


sns.lineplot(x='year', y='passengers', data=new)

# Display the plot


plt.show()

Explanation:
1. Import Libraries:

Import necessary libraries, including seaborn and matplotlib.pyplot .


2. Load Dataset:

Load the 'flights' dataset from Seaborn using sns.load_dataset('flights') . This dataset contains
information about the number of passengers on different flights over multiple years.
3. Convert to Numeric:

Convert the 'passengers' column to numeric using pd.to_numeric() . The errors='coerce'


parameter is used to handle any non-numeric values by converting them to NaN.
4. Grouping and Summing:

Group the DataFrame by the 'year' column using groupby('year') .


Sum the 'passengers' values for each year using .sum() .
Reset the index to create a new DataFrame with 'year' and 'passengers' columns.
5. Line Plot:

Create a line plot using sns.lineplot() . The 'year' values are plotted on the x-axis, and the
corresponding total 'passengers' values are plotted on the y-axis.
6. Display the Plot:

Use plt.show() to display the line plot.

Use of Line Plot:


1. Time Series Analysis:

Line plots are commonly used in time series analysis to visualize trends and patterns over time. In the given
example, the plot shows how the total number of passengers changes over different years.

2. Trend Identification:

Line plots help identify trends, cycles, or seasonality in data. An upward or downward trend in the line can
indicate an increase or decrease in the variable of interest.

3. Comparative Analysis:

Line plots allow for the comparison of trends between different categories or groups. In this case, the plot
could reveal how the total number of passengers varies across different years.

Real-life Example:
Let's consider an airline dataset where 'flights' contains information about the number of passengers each year. The
line plot visualizes the trend in the total number of passengers over several years.

In [ ]: import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Example dataset
data = {'year': [2010, 2011, 2012, 2013, 2014],
'passengers': [500000, 550000, 600000, 620000, 650000]}

flights = pd.DataFrame(data)

# Grouping by 'year' and summing the 'passengers'


new = flights.groupby('year').sum().reset_index()
print(new)
# Creating a line plot
sns.lineplot(x='year', y='passengers',data=new)

# Display the plot


plt.xlabel('Year')
plt.ylabel('Total Passengers')
plt.title('Trend of Total Passengers Over Years')
plt.show()

In this example, the line plot shows how the total number of passengers has increased over the years, providing
insights into the overall trend in passenger traffic.

You might also like