Professional Documents
Culture Documents
Bivariate Analysis
Bivariate Analysis
Python libraries
Reference
Key Concepts:
Definition: Understand the basics of bivariate analysis, focusing on the examination of two variables
simultaneously.
Scatter Plots: Visualize the relationship between two variables using scatter plots to identify trends and
patterns.
2. Correlation Coefficients:
Pearson, Spearman, and Kendall: Explore different correlation coefficients to quantify the strength and
direction of relationships.
Interpretation: Learn how to interpret correlation values and understand their significance.
3. Use Cases in Daily Life:
Finance: Analyze the correlation between income and expenditure to make informed budgeting decisions.
Health: Investigate the relationship between exercise frequency and overall well-being.
Education: Explore the correlation between study hours and academic performance.
4. Machine Learning Applications:
Feature Selection: Utilize bivariate analysis to identify key features that impact the target variable in
machine learning models.
Model Validation: Assess the relationship between predicted and actual outcomes through bivariate
analysis of model residuals.
5. Real-Life Example: Predicting Housing Prices:
Data Exploration: Use bivariate analysis to understand the correlation between various features and
housing prices.
Feature Engineering: Identify and create new features based on bivariate insights.
Model Improvement: Implement findings from bivariate analysis to enhance the predictive accuracy of a
housing price prediction model.
Practical Implementation:
1. Data Preparation:
Data Cleaning: Address missing values, outliers, and inconsistencies for a robust analysis.
Data Transformation: Normalize or scale variables to ensure accurate interpretation.
2. Python Code Examples:
Utilize popular libraries such as Pandas, Matplotlib, and Seaborn for efficient bivariate analysis.
Code snippets for calculating correlation coefficients, creating scatter plots, and interpreting results.
Conclusion: This notebook aims to equip users with the skills to perform insightful bivariate analysis in both
personal and machine learning contexts. By mastering these techniques, practitioners can make informed decisions,
extract valuable patterns from data, and enhance the performance of machine learning models.
Correlation:
Definition: Correlation is a statistical measure that quantifies the degree to which two variables change together. It
indicates the strength and direction of a linear relationship between two variables. The correlation coefficient is a
numerical value that ranges from -1 to 1, where:
1 indicates a perfect positive correlation (as one variable increases, the other also increases linearly).
-1 indicates a perfect negative correlation (as one variable increases, the other decreases linearly).
0 indicates no linear correlation.
Real-life Example: Consider a dataset containing the number of hours spent studying for an exam and the
corresponding exam scores. If there is a positive correlation, it suggests that as study hours increase, exam scores
tend to increase. Conversely, a negative correlation would imply that as study hours increase, exam scores tend to
decrease.
1. Feature Selection:
In machine learning, understanding the correlation between features helps identify and select the most
relevant features for training models. Highly correlated features might not provide additional information,
and it could be beneficial to keep only one of them.
2. Model Performance:
Correlation analysis can help assess the relationships between input features and the target variable. This
information is valuable for selecting appropriate models and understanding how changes in input features
impact the model's predictions.
3. Data Preprocessing:
Detecting and handling multicollinearity (high correlation between independent variables) is essential in
preprocessing data. Multicollinearity can lead to unstable coefficient estimates in regression models.
4. Real-Life Projects:
In finance, correlation is used to understand the relationship between different assets. For instance,
knowing the correlation between stocks helps in portfolio diversification.
In healthcare, correlation analysis can be applied to understand the relationships between variables like
patient age, lifestyle, and the occurrence of specific health conditions.
In marketing, analyzing the correlation between advertising spending and sales can guide budget
allocation for different channels.
In climate science, understanding the correlation between various environmental factors can help predict
and mitigate the impact of climate change.
In social sciences, correlation analysis may be used to study the relationship between variables such as
income and education levels.
Monotonic Relationship:
Definition: A monotonic relationship is a mathematical concept that describes the consistent trend between two
variables. In a monotonic relationship, as one variable increases (or decreases), the other variable also consistently
increases (or decreases), but not necessarily at a constant rate. It doesn't have to be a straight line; it could be any
pattern that consistently moves in one direction.
Real-life Example: Consider the relationship between the number of hours spent studying for an exam and the
exam scores. A monotonic relationship in this context would mean that as study hours increase, exam scores
consistently increase, even if the relationship is not strictly linear. For example, the more time a student spends
studying, the higher the chances of achieving a higher score.
Feature Engineering: Monotonic relationships are crucial when creating features for machine learning models.
Understanding how variables change in a consistent direction helps identify meaningful features that can
improve model performance.
Correlation Analysis: Monotonic correlation, as measured by Spearman correlation, can provide insights into
the relationships between variables. This is valuable when dealing with non-linear associations in the data.
In this plot, the blue dots represent data points, and the red dashed line indicates the monotonic trend. As study
hours increase, exam scores tend to increase, demonstrating a monotonic relationship.
Linear Relationship:
Definition: A linear relationship is a mathematical concept where two variables have a constant rate of change,
forming a straight line when plotted on a graph. In a linear relationship, a change in one variable is proportional to a
change in the other variable.
Real-life Example: Consider the relationship between the distance traveled and time taken during a constant-speed
journey. The linear relationship here means that the distance covered is directly proportional to the time elapsed.
The graph of this relationship would be a straight line.
Linear Regression: Linear relationships are fundamental in linear regression models, where the goal is to
model the relationship between the independent and dependent variables as a straight line. This is widely used
in various machine learning applications.
Feature Importance: Understanding linear relationships helps in identifying which features have a significant
impact on the target variable, aiding in feature selection for model training.
In this plot, the green dots represent data points, and the purple line indicates the linear relationship. As time
elapses, the distance traveled increases at a constant rate, demonstrating a linear relationship.
Interpretability: Linear relationships are often more interpretable and easier to explain. In certain situations, a
linear model might be preferred for its simplicity and ease of interpretation.
Feature Engineering: Both monotonic and linear relationships guide feature engineering efforts. Feature
transformations or the creation of interaction terms may be applied based on the identified relationships to
improve model performance.
Reference
Pearson correlation coefficient wikipedia link wikipedia_link
Correlation Coefficients:
Correlation coefficients measure the strength and direction of a linear relationship between two variables. There are
different types of correlation coefficients, including Pearson, Spearman, and Kendall.
Python Code:
# Example dataset
data = {'Study Hours': [3, 5, 7, 4, 6],
'Exam Scores': [60, 75, 85, 70, 80]}
df = pd.DataFrame(data)
# Example dataset
data = {'Study Hours': [3, 5, 7, 4, 6],
'Exam Scores': [60, 75, 85, 70, 80]}
df = pd.DataFrame(data)
# Show legend
plt.legend()
Reference
Unlike the Pearson correlation coefficient, Spearman does not assume that the relationship between variables is
linear. Instead, it focuses on whether, as one variable increases, the other tends to consistently increase or decrease.
Real-life Example: Let's consider a practical scenario where you are analyzing the relationship between study hours
and exam scores. In a traditional linear relationship, more study hours might result in higher exam scores, but this
may not always be the case. The Spearman correlation is valuable in situations where the relationship is not strictly
linear but still follows a consistent trend.
For instance:
Linear Scenario: If there's a linear relationship, more study hours might always lead to higher scores, and the
Spearman correlation would capture this.
Non-linear Scenario: If the relationship is not strictly linear, but as study hours increase, scores tend to
consistently increase or decrease (i.e., a monotonic relationship), Spearman correlation still effectively quantifies
this trend.
Python Code Example: Let's use Python to calculate the Spearman correlation coefficient for the given dataset:
# Example dataset
data = {'Study Hours': [3, 5, 7, 4, 6],
'Exam Scores': [60, 75, 85, 70, 80]}
df = pd.DataFrame(data)
This code calculates the Spearman correlation coefficient for the 'Study Hours' and 'Exam Scores' variables in the
provided dataset. The resulting coefficient helps quantify the monotonic relationship between study hours and
exam scores, providing insights into the consistency of the trend.
# Example dataset
data = {'Study Hours': [3, 5, 7, 4, 6],
'Exam Scores': [60, 75, 85, 70, 80]}
df = pd.DataFrame(data)
# Show legend
plt.legend()
# Show the plot
plt.show()
# Example dataset
data = {'Study Hours': [3, 5, 7, 4, 6],
'Exam Scores': [60, 75, 85, 70, 80]}
df = pd.DataFrame(data)
# Show legend
plt.legend()
Plot Explanation:
1. Scatter Plot:
The blue dots represent individual data points, where each dot corresponds to a combination of study
hours and exam scores.
2. Spearman Correlation Line (Red Dashed Line):
The red dashed line represents the monotonic trend captured by the Spearman correlation. It shows the
general direction of the relationship between study hours and exam scores. The legend includes the
Spearman correlation coefficient, indicating the strength of the monotonic relationship.
3. Pearson Correlation Line (Green Line):
The green line represents the linear trend captured by the Pearson correlation. It is a best-fit line that
represents the overall linear relationship between study hours and exam scores. The legend includes the
Pearson correlation coefficient, indicating the strength of the linear relationship.
4. Legend:
The legend provides information about the correlation coefficients associated with each method
(Spearman and Pearson). The coefficients are numerical values that quantify the strength and direction of
the respective relationships.
5. Axes and Labels:
The x-axis represents the 'Study Hours,' and the y-axis represents the 'Exam Scores.'
The plot includes labels for the x-axis, y-axis, and a title for better clarity.
Interpretation:
Spearman Correlation:
The red dashed line captures the overall monotonic trend between study hours and exam scores.
The Spearman correlation coefficient (included in the legend) quantifies the strength and direction of the
monotonic relationship.
Pearson Correlation:
The green line represents the linear relationship between study hours and exam scores.
The Pearson correlation coefficient (included in the legend) quantifies the strength and direction of the
linear relationship.
By visually inspecting the scatter plot and considering the correlation lines, you can gain insights into how well each
method captures the relationship between study hours and exam scores. The legend provides numerical values for
the correlation coefficients, helping you understand the degree of association in each case.
Python Code:
# Scatterplot
plt.figure(figsize=(8, 6))
sns.scatterplot(x='sepal length (cm)', y='sepal width (cm)',
hue='target', data=iris_df, palette='viridis', s=80)
1. Import Libraries:
Import necessary libraries, including datasets from sklearn , pandas , seaborn , and
matplotlib.pyplot .
2. Load Iris Dataset:
Load the Iris dataset using datasets.load_iris() and create a DataFrame ( iris_df ) from it.
3. Scatterplot:
Interpretation:
This scatterplot helps visualize the relationship between sepal length and sepal width for different species of iris
flowers.
Insights into the distribution and separation of species can be gained.
It's possible to observe patterns and potential correlations between sepal length and sepal width.
Python Code:
# Bar plot
plt.figure(figsize=(8, 6))
sns.barplot(x='pclass', y='survived', data=titanic, ci=None, palette='muted')
Explanation:
1. Import Libraries:
Interpretation:
Each bar represents the survival rate for a specific passenger class.
'Passenger Class' is a categorical variable with three levels (1st, 2nd, and 3rd class).
'Survival Rate' is a numerical variable representing the proportion of passengers who survived.
The bar plot visually compares the survival rates across different passenger classes.
It helps to identify patterns and differences in survival rates based on the passenger class.
This type of analysis is valuable for understanding how certain factors may influence outcomes in categorical
variables.
This bivariate analysis using a bar plot provides insights into the relationship between a numerical variable (survival
rate) and a categorical variable (passenger class) in the context of the Titanic dataset.
# Box plot
plt.figure(figsize=(10, 8))
sns.boxplot(x='pclass', y='age', hue='sex', data=titanic, palette='muted')
Interpretation:
The box plot visualizes the distribution of ages for each passenger class and gender.
Each box represents the interquartile range (IQR) of ages within a specific category.
The horizontal line inside the box represents the median age.
Whiskers extend to show the range of ages within 1.5 times the IQR, and individual points beyond the whiskers
are considered outliers.
Insights:
This box plot allows you to compare the distribution of ages across different passenger classes and genders.
You can observe the median age, the spread of ages (IQR), and identify potential outliers.
It provides insights into the age demographics of passengers in each class and gender category.
Explanation:
1. Import Libraries:
Interpretation:
The plot visualizes the distribution of ages for passengers who survived and those who did not survive the
Titanic disaster.
The x-axis represents age, and the y-axis represents the density (probability) of observing a particular age.
The smooth curves (KDE) help in visualizing the probability distribution of ages for each category.
Use of Distplot:
Comparison of Distributions: Distplots are useful for comparing the distributions of numerical variables across
different categories.
Identifying Patterns: Distplots provide insights into the patterns and shapes of distributions within different
groups.
Survival Analysis: In this case, the plot is used to analyze the age distribution of passengers based on their
survival status.
Customer Behavior Analysis: For instance, comparing the age distribution of customers who made a purchase
versus those who didn't in an e-commerce dataset.
Real-life Example: In the context of the Titanic dataset, this plot helps answer questions like "Were there any age-
related patterns in survival rates?" For instance, it might reveal if certain age groups were more likely to survive or
not. This type of analysis can provide valuable insights for understanding the factors that influenced survival on the
Titanic.
Python Code:
Explanation:
1. Import Libraries:
Assuming you have a DataFrame named 'titanic' containing the Titanic dataset.
3. Create Heatmap:
Interpretation:
The heatmap visually represents the relationship between passenger class and survival status.
Rows represent passenger class, and columns represent survival status.
The color intensity in each cell indicates the count of passengers falling into a specific combination of class and
survival status.
Visualizing Associations: Heatmaps are effective for visually assessing associations between two categorical
variables. In this case, it helps understand the distribution of survivors and non-survivors across different
passenger classes.
Exploratory Data Analysis (EDA): Use heatmaps during EDA to uncover patterns and relationships between
categorical variables.
Feature Importance: In machine learning, understanding the relationships between categorical features and
the target variable is essential for feature selection.
Real-life Example:
The heatmap could be used to analyze whether there is a correlation between passenger class and survival
status on the Titanic. It might help identify patterns such as a higher survival rate in certain passenger classes.
Load the Titanic dataset into a DataFrame ( titanic ). In this example, we assume that the 'titanic.csv' file
contains the dataset.
3. Create Crosstab:
Use pd.crosstab to create a contingency table between the 'parch' (number of parents/children aboard)
and 'survived' columns.
4. sns.clustermap :
Use of sns.clustermap :
1. Identifying Patterns:
The clustermap helps identify patterns or relationships between two categorical variables. In this case, it
explores the distribution of survival outcomes based on the number of parents/children aboard.
2. Hierarchical Clustering:
The hierarchical clustering visually groups similar rows and columns, providing insights into potential
associations or trends within the data.
3. Data Exploration:
Useful for exploratory data analysis to reveal hidden structures or dependencies between categorical
variables.
Real-life Example:
In the Titanic dataset, the clustermap can help identify if there are specific patterns in survival based on the number
of parents/children aboard. It might reveal whether certain family sizes had higher or lower survival rates,
contributing to a deeper understanding of the dataset.
For example, the heatmap may show clusters where survival rates are higher for specific family sizes, indicating
potential trends in passenger survival based on family relationships.
This type of analysis is valuable for understanding the dynamics of survival in different family structures and can
guide further investigations or feature engineering in a machine learning context.
7. Pairplot
The pairplot is a powerful visualization tool provided by the Seaborn library in Python. It creates a grid of
scatterplots, histograms, and density plots for multiple variables in a dataset. Each combination of variables is
visualized, making it particularly useful for exploring relationships between multiple numerical variables.
# Create a pairplot
sns.pairplot(iris_df, hue='species')
1. Import Libraries:
Import necessary libraries, including seaborn , matplotlib.pyplot , and datasets from sklearn .
2. Load Iris Dataset:
Load the Iris dataset using datasets.load_iris() and create a DataFrame ( iris_df ) from it.
3. Create Pairplot:
Interpretation:
The pairplot shows scatterplots for all combinations of numerical variables (sepal length, sepal width, petal
length, petal width) in the Iris dataset.
Diagonal elements display histograms or kernel density plots for each variable.
Use of Pairplot:
It helps in understanding how different features relate to each other. For example, in the context of the Iris
dataset, you can observe how petal length and petal width correlate.
Identifying Patterns:
Patterns, clusters, and potential outliers in the data can be visually identified.
Correlation Assessment:
Pairplots are useful for assessing the correlation between variables. Strong correlations may indicate
multicollinearity.
Real-life Example: Consider a dataset containing information about various physical and physiological attributes of
individuals, such as height, weight, blood pressure, and cholesterol levels. Using a pairplot, you can visualize the
relationships between these variables, identify patterns, and assess potential correlations. This can be valuable for
understanding factors that may influence overall health.
Note:
The hue parameter is particularly useful when dealing with datasets with a categorical variable. It colors the
data points based on the values of that variable, providing additional insights into patterns related to the
categorical variable (e.g., species of iris flowers in the case of the Iris dataset).
Explanation:
1. Import Libraries:
Load the 'flights' dataset from Seaborn using sns.load_dataset('flights') . This dataset contains
information about the number of passengers on different flights over multiple years.
3. Convert to Numeric:
Create a line plot using sns.lineplot() . The 'year' values are plotted on the x-axis, and the
corresponding total 'passengers' values are plotted on the y-axis.
6. Display the Plot:
Line plots are commonly used in time series analysis to visualize trends and patterns over time. In the given
example, the plot shows how the total number of passengers changes over different years.
2. Trend Identification:
Line plots help identify trends, cycles, or seasonality in data. An upward or downward trend in the line can
indicate an increase or decrease in the variable of interest.
3. Comparative Analysis:
Line plots allow for the comparison of trends between different categories or groups. In this case, the plot
could reveal how the total number of passengers varies across different years.
Real-life Example:
Let's consider an airline dataset where 'flights' contains information about the number of passengers each year. The
line plot visualizes the trend in the total number of passengers over several years.
In [ ]: import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Example dataset
data = {'year': [2010, 2011, 2012, 2013, 2014],
'passengers': [500000, 550000, 600000, 620000, 650000]}
flights = pd.DataFrame(data)
In this example, the line plot shows how the total number of passengers has increased over the years, providing
insights into the overall trend in passenger traffic.