You are on page 1of 73

Mastering Exploratory Data Analysis with

Python: A Comprehensive Guide to Unveiling


Hidden Insights

Datasans
https://datasans.medium.com/

1
Table of Contents

Chapter 1: Introduction to Exploratory Data Analysis with Python.................................. 7


1.1 The Importance of Exploratory Data Analysis.............................................................. 7
1.2 The Role of Python in EDA..............................................................................................7
1.3 Steps in the EDA Process............................................................................................... 8
1.4 Descriptive Statistics in EDA.......................................................................................... 8
1.5 Data Visualization in EDA................................................................................................9
1.6 Handling Missing Data and Outliers...............................................................................9
1.7 Summary.........................................................................................................................10
Chapter 2: Data Cleaning and Preprocessing Techniques...............................................11
2.1 Introduction.................................................................................................................... 11
2.2 Removing Unnecessary Data........................................................................................11
2.3 Dealing with Duplicate Records................................................................................... 12
2.4 Correcting Data Entry Errors........................................................................................ 12
2.5 Data Transformation...................................................................................................... 12
2.6 Data Validation............................................................................................................... 14
2.7 Summary.........................................................................................................................14
Chapter 3: Essential Python Libraries for EDA: NumPy, pandas, and matplotlib......... 15
3.1 Introduction.................................................................................................................... 15
3.2 NumPy.............................................................................................................................15
3.2.1 Creating and Manipulating NumPy Arrays.............................................................. 15
3.2.2 Mathematical Operations with NumPy.................................................................... 16
3.3 pandas.............................................................................................................................16
3.3.1 Creating and Manipulating pandas DataFrames.....................................................16
3.3.2 Working with Data in pandas...................................................................................17
3.4 matplotlib........................................................................................................................ 18
3.4.1 Basic Plotting with matplotlib...................................................................................18
3.4.2 Customizing Plots with matplotlib............................................................................20
3.5 Summary.........................................................................................................................22
Chapter 4: Descriptive Statistics and Data Visualization with Python............................23
4.1 Introduction.................................................................................................................... 23
4.2 Descriptive Statistics.....................................................................................................23
4.2.1 Measures of Central Tendency................................................................................23
4.2.2 Measures of Dispersion...........................................................................................24
4.2.3 Measures of Shape................................................................................................. 25
4.3 Data Visualization...........................................................................................................26
4.3.1 Univariate Data Visualization.................................................................................. 26
4.3.2 Bivariate Data Visualization.....................................................................................29
4.3.3 Multivariate Data Visualization................................................................................ 33
4.4 Summary.........................................................................................................................35
Chapter 5: Advanced Data Visualization Techniques with seaborn and Plotly............. 36
5.1 Introduction.................................................................................................................... 36
5.2 Seaborn: Statistical Data Visualization........................................................................ 36

2
5.2.1 FacetGrid: Multi-plot Grids for Conditional Relationships........................................36
5.2.2 PairGrid: Pairwise Relationships in a Dataset.........................................................37
5.2.3 Clustermap: Hierarchical Clustering and Heatmap................................................. 38
5.3 Plotly: Interactive Data Visualization............................................................................40
5.3.1 Interactive Line and Scatter Plots............................................................................40
5.3.2 Interactive Bar Plots and Pie Charts....................................................................... 41
5.3.3 Interactive Heatmaps.............................................................................................. 41
5.4 Summary.........................................................................................................................43
Chapter 6: Handling Missing Data and Outliers in Python.............................................. 44
6.1 Introduction.................................................................................................................... 44
6.2 Handling Missing Data...................................................................................................44
6.2.1 Identifying Missing Data.......................................................................................... 44
6.2.2 Deleting Missing Data............................................................................................. 45
6.2.3 Imputing Missing Data.............................................................................................45
6.3 Handling Outliers........................................................................................................... 46
6.3.1 Identifying Outliers...................................................................................................46
6.3.1.1 Z-score Method.............................................................................................. 46
6.3.1.2 IQR Method.................................................................................................... 47
6.3.2 Handling Outliers.....................................................................................................47
6.3.2.1 Deleting Outliers............................................................................................. 47
6.3.2.2 Capping Outliers.............................................................................................47
6.3.2.3 Transforming Data..........................................................................................48
6.4 Summary.........................................................................................................................48
Chapter 7: Dimensionality Reduction Techniques: PCA and t-SNE................................49
7.1 Introduction to Dimensionality Reduction...................................................................49
7.2 Principal Component Analysis (PCA).......................................................................... 49
7.2.1 Theory behind PCA................................................................................................. 49
7.2.2 Implementing PCA in Python.................................................................................. 49
7.3 t-Distributed Stochastic Neighbor Embedding (t-SNE).............................................. 51
7.3.1 Theory behind t-SNE...............................................................................................51
7.3.2 Implementing t-SNE in Python................................................................................ 51
7.4 Comparing PCA and t-SNE........................................................................................... 52
7.5 Practical Tips for Dimensionality Reduction............................................................... 53
7.6 Summary.........................................................................................................................53
Chapter 8: Time Series Analysis and Forecasting with Python...................................... 54
8.1 Introduction to Time Series Analysis........................................................................... 54
8.2 Components of a Time Series.......................................................................................54
8.3 Stationarity and Differencing........................................................................................ 54
8.4 Autocorrelation and Partial Autocorrelation............................................................... 55
8.5 Time Series Analysis Techniques in Python............................................................... 55
8.5.1 Decomposition.........................................................................................................55
8.5.2 Smoothing............................................................................................................... 56
8.6 Time Series Forecasting Techniques in Python..........................................................56
8.6.1 Autoregressive Integrated Moving Average (ARIMA)............................................. 56

3
8.6.2 Seasonal Decomposition of Time Series (STL) and ARIMA................................... 57
8.7 Model Evaluation and Selection................................................................................... 58
8.8 Summary.........................................................................................................................58
Chapter 9: Text Data Exploration and Natural Language Processing.............................59
9.1 Introduction.................................................................................................................... 59
9.2 Text Preprocessing........................................................................................................ 59
9.3 Feature Extraction..........................................................................................................60
9.4 Natural Language Processing Techniques..................................................................61
9.4.1 Sentiment Analysis..................................................................................................61
9.4.2 Text Classification....................................................................................................61
9.4.3 Topic Modeling........................................................................................................ 62
9.5 Summary.........................................................................................................................63
Chapter 10: Case Studies: Real-World Applications of EDA with Python......................64
10.1 Introduction.................................................................................................................. 64
10.2 Case Study 1: Customer Segmentation for a Retail Company................................64
10.3 Case Study 2: Sales Forecasting for a Manufacturing Company............................67
10.4 Case Study 3: Sentiment Analysis for a Movie Streaming Platform....................... 70

4
Introduction

"Mastering Exploratory Data Analysis with Python: A Comprehensive Guide to Unveiling


Hidden Insights" is a practical, hands-on guide that takes you through the process of
understanding, cleaning, visualizing, and analyzing data using Python. This book will equip
you with the necessary skills to perform effective data analysis and extract valuable insights
from your data, providing you with a solid foundation in exploratory data analysis (EDA)
with Python.

In today's data-driven world, making sense of vast amounts of data is crucial for making
informed decisions. EDA is an essential step in the data analysis process, enabling you to
explore, summarize, and visualize your data to generate insights and inform
decision-making. With Python being one of the most popular programming languages for
data analysis, mastering EDA with Python is an invaluable skill for data analysts, data
scientists, and professionals in various industries.

This book is divided into 10 chapters, each focusing on a specific aspect of EDA with Python.

Chapter 1 provides an introduction to exploratory data analysis, its importance, and the role
of Python in EDA. This chapter sets the stage for understanding the subsequent chapters
and helps you develop a mindset for effective data exploration.

Chapter 2 focuses on data cleaning and preprocessing techniques, an essential step in the
EDA process to ensure that your data is accurate and ready for analysis. This chapter covers
data wrangling, data transformation, and feature engineering.

Chapter 3 introduces essential Python libraries for EDA, including NumPy, pandas, and
matplotlib. You will learn how to use these libraries to manipulate, analyze, and visualize
your data effectively.

Chapter 4 delves into descriptive statistics and data visualization with Python. You will learn
how to use Python to compute summary statistics and create various types of visualizations
to explore your data further.

Chapter 5 covers advanced data visualization techniques using seaborn and Plotly, two
powerful Python libraries that enable you to create stunning and interactive visualizations
with ease.

Chapter 6 focuses on handling missing data and outliers in Python, teaching you techniques
to detect, analyze, and address these common data issues that can significantly impact your
analysis.

5
Chapter 7 introduces dimensionality reduction techniques, such as Principal Component
Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE), to help you better
understand complex, high-dimensional data.

Chapter 8 covers time series analysis and forecasting with Python, equipping you with the
skills to analyze, model, and forecast data with a time component.

Chapter 9 dives into text data exploration and natural language processing, enabling you to
analyze and extract insights from textual data using Python.

Finally, Chapter 10 presents case studies illustrating real-world applications of EDA with
Python. These case studies provide practical examples of how the concepts and techniques
discussed in the previous chapters can be applied to real-world problems, demonstrating
the power and versatility of EDA with Python.

By the end of this book, you will have a thorough understanding of the EDA process and be
proficient in using Python to explore, visualize, and analyze your data. With the practical
knowledge and experience gained from working through the examples and case studies, you
will be well-equipped to tackle various data analysis tasks and uncover hidden insights from
your data.

Whether you are a beginner looking to learn the fundamentals of EDA with Python or an
experienced professional seeking to enhance your data analysis skills, "Mastering
Exploratory Data Analysis with Python: A Comprehensive Guide to Unveiling Hidden
Insights" is an invaluable resource that will help you become a more effective data analyst or
data scientist.

Embark on this exciting journey through the world of data exploration and unlock the
power of Python to transform raw data into valuable insights, enabling you to make
data-driven decisions and contribute to the success of your organization. With a strong
foundation in EDA and Python, you will be well-prepared to face the challenges of an
increasingly data-driven world and excel in your chosen field.

6
Chapter 1: Introduction to Exploratory
Data Analysis with Python

1.1 The Importance of Exploratory Data Analysis


In the era of big data and data-driven decision-making, understanding and making sense of
data is more important than ever. Exploratory Data Analysis (EDA) is a crucial step in the
data analysis process that allows you to investigate, summarize, and visualize your data to
generate insights and inform decision-making. EDA helps you to identify patterns, detect
anomalies, and uncover relationships between variables, ultimately leading to the
formulation of hypotheses and the selection of appropriate statistical models.

EDA is an iterative and interactive process, combining both quantitative and qualitative
methods. By exploring your data visually and statistically, you can develop a better
understanding of the underlying structure and characteristics of the data, which can guide
you in identifying potential problems and opportunities for further analysis.

1.2 The Role of Python in EDA


Python has emerged as one of the most popular programming languages for data analysis,
thanks to its simplicity, versatility, and the wide range of powerful libraries available for
handling and processing data. Python offers a rich ecosystem of tools and libraries that
make it easy to perform EDA, including:

pandas: A library for data manipulation and analysis, providing data structures like
DataFrames and Series, which enable you to work with structured data efficiently.
NumPy: A library for numerical computing in Python, offering support for
multi-dimensional arrays and various mathematical functions.
matplotlib: A plotting library that allows you to create static, animated, and interactive
visualizations in Python.
seaborn: A statistical data visualization library based on matplotlib, providing a high-level
interface for creating informative and attractive visualizations.
Plotly: A library for creating interactive, web-based visualizations that can be used to
explore and analyze data in a more intuitive way.
By using these libraries in combination, you can perform EDA with Python quickly and
effectively, making it an invaluable tool for data analysts and data scientists.

7
1.3 Steps in the EDA Process
The EDA process typically consists of the following steps:

Data Collection: The first step in any data analysis process is gathering the data. Data can be
collected from various sources, such as databases, APIs, web scraping, or even manual
entry.

Data Cleaning and Preprocessing: Once the data is collected, it needs to be cleaned and
preprocessed to ensure that it is accurate and ready for analysis. This step involves handling
missing values, removing duplicate records, correcting data entry errors, and transforming
data into appropriate formats.

Data Exploration: The main part of EDA is exploring the data to gain insights and identify
patterns. This can be done using descriptive statistics, such as measures of central tendency
(mean, median, mode) and dispersion (range, variance, standard deviation), as well as data
visualizations, such as histograms, box plots, scatter plots, and heatmaps.

Feature Engineering: Based on the insights gained from data exploration, you may need to
create new features or modify existing ones to better represent the underlying patterns in
the data. This step involves techniques such as encoding categorical variables, scaling
numerical variables, and creating interaction features.

Model Selection and Validation: After exploring the data and engineering features, you can
use the insights gained to select appropriate statistical models for further analysis or
prediction. Model validation is then performed to assess the performance of the selected
models and ensure their reliability.

1.4 Descriptive Statistics in EDA


Descriptive statistics are used to summarize and describe the main features of a dataset,
providing a snapshot of its overall structure and characteristics. The most common
descriptive statistics used in EDA include:

Measures of Central Tendency: These statistics describe the center or average value of a
dataset. The most common measures of central tendency are the mean (arithmetic average),
median (middle value), and mode (most frequently occurring value).

Measures of Dispersion: These statistics describe the spread or variability of a dataset. The
most common measures of dispersion are the range (difference between the minimum and
maximum values), variance (average of the squared differences from the mean), and
standard deviation (square root of the variance).

8
Measures of Shape: These statistics describe the shape of the distribution of a dataset. The
most common measures of shape are skewness (a measure of the asymmetry of the
distribution) and kurtosis (a measure of the "tailedness" or peakiness of the distribution).

1.5 Data Visualization in EDA


Data visualization is a key component of EDA, as it allows you to explore your data visually
and identify patterns, trends, and relationships that may not be apparent from descriptive
statistics alone. Some common types of data visualizations used in EDA include:

Histograms: A histogram is a graphical representation of the distribution of a dataset,


showing the frequency of values within specific intervals (bins). Histograms can help you
identify the overall shape, central tendency, and dispersion of your data.

Box Plots: A box plot (also known as a box-and-whisker plot) is a graphical representation of
the distribution of a dataset, showing the median, quartiles, and outliers. Box plots can help
you identify the central tendency, dispersion, and potential outliers in your data.

Scatter Plots: A scatter plot is a graphical representation of the relationship between two
variables, with each data point represented as a point in a Cartesian coordinate system.
Scatter plots can help you identify correlations, trends, and clusters in your data.

Heatmaps: A heatmap is a graphical representation of data where individual values are


represented as colors, with one color representing high values and another color
representing low values. Heatmaps can help you identify patterns, clusters, and outliers in
your data, especially when dealing with large datasets or high-dimensional data.

1.6 Handling Missing Data and Outliers


Missing data and outliers are common issues in real-world datasets that can significantly
impact your analysis. Handling these issues appropriately is an important part of the EDA
process.

Missing Data: Missing data can result from various factors, such as data entry errors, data
collection problems, or data processing errors. Handling missing data typically involves
either imputing the missing values (using techniques such as mean, median, or mode
imputation, or more advanced methods like k-Nearest Neighbors imputation) or removing
the records with missing values (listwise deletion).

9
Outliers: Outliers are data points that are significantly different from the rest of the data,
either due to errors in data collection or processing, or because they represent genuine
extreme values. Identifying and handling outliers is important to ensure that your analysis
is not unduly influenced by these extreme values. Outliers can be detected using various
techniques, such as Z-scores, IQR (interquartile range), or visual inspection of box plots and
scatter plots. Handling outliers may involve transforming the data (e.g., log transformation)
or removing the outliers from the analysis.

1.7 Summary
Exploratory Data Analysis is a critical step in the data analysis process that allows you to
gain a deeper understanding of your data and identify potential problems and opportunities
for further analysis. By mastering EDA with Python, you will be well-equipped to tackle a
wide range of data analysis tasks and uncover hidden insights from your data. In the
following chapters, you will learn about the various tools and techniques used in EDA with
Python, starting with setting up your Python environment for data analysis in Chapter 2.

10
Chapter 2: Data Cleaning and
Preprocessing Techniques

2.1 Introduction
Data cleaning and preprocessing are essential steps in the data analysis process, as they
ensure that your data is accurate, consistent, and ready for analysis. In this chapter, we will
cover various data cleaning and preprocessing techniques, such as handling missing values,
dealing with duplicate records, correcting data entry errors, and transforming data into
appropriate formats. We will focus on techniques that are not covered in other chapters to
provide a comprehensive understanding of data cleaning and preprocessing.

2.2 Removing Unnecessary Data


In real-world datasets, you might come across columns or rows that do not provide any
useful information for your analysis. These may include columns with the same value for all
records, irrelevant columns, or rows that do not meet specific criteria. Removing
unnecessary data can help reduce the size of your dataset, making it easier to analyze and
visualize.

To remove unnecessary columns in pandas, you can use the drop function:
import pandas as pd

# Load the dataset


data = pd.read_csv("example.csv")

# Drop unnecessary columns


data = data.drop(["column_to_remove1", "column_to_remove2"], axis=1)

To remove rows based on specific criteria, you can use boolean indexing:
# Remove rows where the value in the "age" column is less than 18
data = data[data["age"] >= 18]

11
2.3 Dealing with Duplicate Records
Duplicate records can occur in datasets due to data entry errors, data collection problems,
or data processing issues. Identifying and handling duplicate records is important to ensure
the accuracy and consistency of your analysis.

To identify duplicate records in pandas, you can use the duplicated function:
# Find duplicate records
duplicates = data.duplicated()

# Print the number of duplicate records


print("Number of duplicate records:", duplicates.sum())

To remove duplicate records, you can use the drop_duplicates function:


# Remove duplicate records
data = data.drop_duplicates()

2.4 Correcting Data Entry Errors


Data entry errors can introduce inconsistencies and inaccuracies in your dataset. Common
data entry errors include typos, inconsistent capitalization, and incorrect formatting. To
correct data entry errors, you can use various string manipulation functions available in
pandas.

For example, to correct inconsistent capitalization in a column, you can use the str.lower or
str.upper functions:
# Convert all values in the "name" column to lowercase
data["name"] = data["name"].str.lower()

To correct typos or replace specific values, you can use the replace function:
# Replace incorrect values in the "gender" column
data["gender"] = data["gender"].replace({"Malee": "Male", "Femalle":
"Female"})

2.5 Data Transformation


Data transformation involves converting data into a format that is suitable for analysis. This
may include converting categorical variables into numerical variables, normalizing
numerical variables, or creating new variables based on existing variables.

12
Converting categorical variables: Categorical variables can be converted into numerical
variables using various encoding techniques, such as one-hot encoding or ordinal encoding.

One-hot encoding creates binary columns for each category, while ordinal encoding assigns
an integer value to each category. In pandas, you can use the get_dummies function for
one-hot encoding and the replace function for ordinal encoding:

# One-hot encoding
one_hot_encoded_data = pd.get_dummies(data,
columns=["categorical_column"])

# Ordinal encoding
ordinal_mapping = {"low": 1, "medium": 2, "high": 3}
data["ordinal_column"] = data["ordinal_column"].replace(ordinal_mapping)

Normalizing numerical variables: Normalizing numerical variables involves scaling the


values to a specific range, typically [0, 1] or [-1, 1]. This can help improve the performance of
certain machine learning algorithms and make it easier to compare variables with different
scales. In Python, you can use the `MinMaxScaler` from the `sklearn.preprocessing`
module to normalize numerical variables:
from sklearn.preprocessing import MinMaxScaler

# Create a MinMaxScaler object


scaler = MinMaxScaler()

# Fit the scaler to the data and transform the values


data["numerical_column"] =
scaler.fit_transform(data[["numerical_column"]])

Creating new variables: Creating new variables based on existing variables can help you
extract valuable insights from your data. For example, you can create a new variable that
represents the ratio between two variables, or a variable that captures the interaction
between two variables. To create new variables in pandas, you can use the assignment
operator (=) and perform arithmetic operations on existing columns:
# Create a new variable representing the ratio between two variables
data["ratio"] = data["column1"] / data["column2"]

# Create a new variable representing the interaction between two


variables
data["interaction"] = data["column1"] * data["column2"]

13
2.6 Data Validation
Data validation involves checking your dataset for errors or inconsistencies and correcting
them as necessary. This may include checking for outliers, validating data against external
sources, or verifying that data meets specific constraints.

Checking for outliers: Outliers are data points that are significantly different from the rest of
the data. Identifying and handling outliers is important to ensure that your analysis is not
biased by extreme values. In Python, you can use various visualization techniques, such as
box plots or scatter plots, to identify outliers in your data. Once identified, you can decide
whether to remove or transform the outliers, depending on the nature of the data and the
goals of your analysis.

Validating data against external sources: If your dataset contains information that can be
validated against external sources, such as geolocation data, dates, or addresses, it is a good
practice to perform data validation to ensure the accuracy of your data. This may involve
comparing your data to external databases, APIs, or other reliable sources of information.

Verifying data constraints: Ensuring that your data meets specific constraints, such as
minimum or maximum values, unique values, or specific data types, can help you catch
errors and inconsistencies in your dataset. In pandas, you can use various functions, such as
describe, value_counts, or dtypes, to verify that your data meets the desired constraints.

2.7 Summary
Data cleaning and preprocessing are crucial steps in the data analysis process, ensuring that
your dataset is accurate, consistent, and ready for analysis. In this chapter, we covered
various data cleaning and preprocessing techniques, such as handling missing values,
dealing with duplicate records, correcting data entry errors, transforming data into
appropriate formats, and validating data. By applying these techniques to your dataset, you
can improve the quality of your data and enhance the reliability of your analysis.

In the next chapter, we will explore essential Python libraries for Exploratory Data Analysis,
including NumPy, pandas, and matplotlib, and demonstrate how to use these libraries to
perform data manipulation, analysis, and visualization.

14
Chapter 3: Essential Python Libraries
for EDA: NumPy, pandas, and matplotlib

3.1 Introduction
Exploratory Data Analysis (EDA) is a critical step in the data analysis process that helps you
understand your data, identify patterns, and uncover insights. To perform EDA effectively in
Python, it's essential to be familiar with some of the core libraries, such as NumPy, pandas,
and matplotlib. In this chapter, we will introduce these libraries and demonstrate how to
use them for data manipulation, analysis, and visualization.

3.2 NumPy
NumPy (short for Numerical Python) is a foundational library for numerical computing in
Python. It provides a high-performance, multidimensional array object called ndarray,
along with a collection of mathematical functions to perform operations on these arrays.
NumPy is the foundation for many other scientific libraries in Python, including pandas and
matplotlib.

3.2.1 Creating and Manipulating NumPy Arrays


To create a NumPy array, you can use the numpy.array function:
import numpy as np

# Create a 1D NumPy array


arr1 = np.array([1, 2, 3, 4])

# Create a 2D NumPy array


arr2 = np.array([[1, 2], [3, 4]])

NumPy provides a range of functions to manipulate arrays, such as reshaping, slicing, and
concatenating:
# Reshape the array
arr2 = arr2.reshape(1, 4)

# Slice the array

15
arr_slice = arr1[1:3]

# Concatenate two arrays


arr_concat = np.concatenate((arr1, arr2), axis=None)

3.2.2 Mathematical Operations with NumPy


NumPy offers a comprehensive set of mathematical functions that can be applied to arrays,
such as arithmetic operations, linear algebra, and statistical operations:
# Element-wise addition
sum_arr = arr1 + arr2

# Dot product
dot_product = np.dot(arr1, arr2)

# Mean of an array
mean = np.mean(arr1)

# Standard deviation of an array


std_dev = np.std(arr1)

3.3 pandas
pandas is a powerful data manipulation and analysis library that builds on top of NumPy. It
provides two main data structures: the Series for 1D data and the DataFrame for 2D data.
pandas makes it easy to perform common data manipulation tasks, such as filtering,
sorting, and aggregating data, and integrates well with other Python libraries for data
analysis and visualization.

3.3.1 Creating and Manipulating pandas DataFrames


To create a pandas DataFrame, you can use the pandas.DataFrame function:
import pandas as pd

# Create a DataFrame from a dictionary


data = {'column1': [1, 2, 3], 'column2': [4, 5, 6]}
df = pd.DataFrame(data)

pandas provides various functions to manipulate DataFrames, such as filtering, sorting, and
aggregating data:

16
# Filter data based on a condition
filtered_data = df[df['column1'] > 1]

# Sort data by a specific column


sorted_data = df.sort_values(by='column2')

# Aggregate data using the groupby function


grouped_data = df.groupby('column1').mean()

3.3.2 Working with Data in pandas


pandas offers a range of functions to read and write data from various file formats, such as
CSV, Excel, and SQL databases:
# Read data from a CSV file
data = pd.read_csv('data.csv')

# Write data to a CSV file


data.to_csv('output.csv', index=False)

# Read data from an Excel file


data = pd.read_excel('data.xlsx')

# Write data to an Excel file


data.to_excel('output.xlsx', index=False)

pandas also provides functionality to handle missing data, merge and join datasets, and
reshape data:
# Fill missing values with a specified value
data_filled = data.fillna(value=0)

# Merge two DataFrames on a specific column


merged_data = pd.merge(data1, data2, on='key_column')

# Pivot data to reshape the DataFrame


pivoted_data = data.pivot(index='row', columns='column', values='value')

17
3.4 matplotlib
matplotlib is a widely-used data visualization library in Python that provides a flexible
interface for creating static, animated, and interactive visualizations. It is built on top of
NumPy and works well with pandas, making it a popular choice for EDA.

3.4.1 Basic Plotting with matplotlib


To create a basic plot with matplotlib, you can use the pyplot module:
import matplotlib.pyplot as plt

# Create a simple line plot


x = [1, 2, 3, 4]
y = [1, 4, 9, 16]
plt.plot(x, y)
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.title('Simple Line Plot')
plt.show()

Output:

matplotlib supports a wide range of plot types, such as bar plots, scatter plots, and
histograms:
# Create a bar plot
plt.bar(x, y)
plt.show()

# Create a scatter plot


plt.scatter(x, y)
plt.show()

18
# Create a histogram
plt.hist(y, bins=4)
plt.show()

Output:

19
3.4.2 Customizing Plots with matplotlib
matplotlib offers many options to customize the appearance of your plots, such as changing
the colors, markers, and line styles:
# Create a line plot with custom style
plt.plot(x, y, color='red', linestyle='dashed', marker='o')
plt.show()

You can also create multiple subplots within a single figure, adjust the axis limits, and add
legends and annotations:
# Create multiple subplots
fig, axes = plt.subplots(1, 2)
axes[0].plot(x, y)
axes[1].scatter(x, y)
plt.show()

# Set axis limits


plt.plot(x, y)
plt.xlim(0, 5)
plt.ylim(0, 20)
plt.show()

# Add a legend and annotations


plt.plot(x, y, label='Line')
plt.legend()
plt.annotate('Max Value', xy=(4, 16), xytext=(2, 12),
arrowprops=dict(facecolor='black', arrowstyle='->'))
plt.show()

20
Output:

21
3.5 Summary
In this chapter, we introduced the essential Python libraries for EDA, including NumPy,
pandas, and matplotlib. We explored the core features and functionality of these libraries,
demonstrating how to use them for data manipulation, analysis, and visualization.
Familiarity with these libraries is crucial for performing effective EDA in Python.

In the next chapter, we will delve deeper into descriptive statistics and data visualization
with Python, showing how to use these libraries to summarize and visualize your data in
various ways, and uncover insights and patterns within your dataset.

22
Chapter 4: Descriptive Statistics and
Data Visualization with Python

4.1 Introduction
Descriptive statistics and data visualization are fundamental aspects of Exploratory Data
Analysis (EDA) that help you understand the underlying structure, distribution, and
relationships within your data. In this chapter, we will explore various techniques for
summarizing and visualizing your data using Python and its powerful libraries, such as
pandas, matplotlib, and seaborn. We will cover common descriptive statistics measures,
distribution plots, and techniques for visualizing relationships between variables.

4.2 Descriptive Statistics


Descriptive statistics summarize the main features of a dataset, providing a quantitative
description of the data's central tendency, dispersion, and shape. pandas, NumPy, and the
Python standard library offer various functions to calculate descriptive statistics for your
data.

4.2.1 Measures of Central Tendency


Measures of central tendency describe the center of the data distribution. The most
common measures are the mean, median, and mode.

Mean: The mean is the average value of the dataset and is calculated by adding up all the
values and dividing by the total number of values. You can use pandas or NumPy to
calculate the mean of a dataset:
import pandas as pd
import numpy as np

data = pd.Series([1, 2, 3, 4, 5])

# Calculate the mean using pandas


mean_pandas = data.mean()

# Calculate the mean using NumPy


mean_numpy = np.mean(data)

23
Median: The median is the middle value of the dataset when the data is sorted in ascending
order. If the dataset has an odd number of values, the median is the middle value. If the
dataset has an even number of values, the median is the average of the two middle values.
You can use pandas or NumPy to calculate the median of a dataset:
# Calculate the median using pandas
median_pandas = data.median()

# Calculate the median using NumPy


median_numpy = np.median(data)

Mode: The mode is the value that occurs most frequently in the dataset. A dataset can have
no mode, one mode (unimodal), or multiple modes (multimodal). You can use pandas to
calculate the mode of a dataset:
# Calculate the mode using pandas
mode_pandas = data.mode()

4.2.2 Measures of Dispersion


Measures of dispersion describe the spread or variability of the data. The most common
measures are the range, interquartile range (IQR), variance, and standard deviation.

Range: The range is the difference between the maximum and minimum values in the
dataset. You can use pandas or NumPy to calculate the range of a dataset:
# Calculate the range using pandas
range_pandas = data.max() - data.min()

# Calculate the range using NumPy


range_numpy = np.ptp(data)

Interquartile Range (IQR): The IQR is the range between the first quartile (25th percentile)
and the third quartile (75th percentile) of the dataset. The IQR is a robust measure of
dispersion that is less sensitive to outliers than the range. You can use pandas or NumPy to
calculate the IQR of a dataset:
# Calculate the IQR using pandas
iqr_pandas = data.quantile(0.75) - data.quantile(0.25)

# Calculate the IQR using NumPy


iqr_numpy = np.percentile(data, 75) - np.percentile(data, 25)

24
Variance: The variance is a measure of how much the data values deviate from the mean. It
is calculated as the average of the squared differences between each value and the mean.
You can use pandas or NumPy to calculate the variance of a dataset:
# Calculate the variance using pandas
variance_pandas = data.var()

# Calculate the variance using NumPy


variance_numpy = np.var(data)

Standard Deviation: The standard deviation is the square root of the variance and is a
measure of the average distance between each data value and the mean. The standard
deviation is expressed in the same unit as the data values. You can use pandas or NumPy to
calculate the standard deviation of a dataset:
# Calculate the standard deviation using pandas
std_dev_pandas = data.std()

# Calculate the standard deviation using NumPy


std_dev_numpy = np.std(data)

4.2.3 Measures of Shape


Measures of shape describe the distribution's shape, such as its skewness and kurtosis.

Skewness: Skewness is a measure of the asymmetry of the data distribution. A positive skew
indicates that the distribution has a long right tail, while a negative skew indicates a long left
tail. A skewness value close to zero indicates a symmetric distribution. You can use pandas
or scipy.stats to calculate the skewness of a dataset:
import scipy.stats

# Calculate the skewness using pandas


skewness_pandas = data.skew()

# Calculate the skewness using scipy.stats


skewness_scipy = scipy.stats.skew(data)

Kurtosis: Kurtosis is a measure of the "tailedness" of the data distribution. A high kurtosis
value indicates a distribution with heavy tails and a high peak, while a low kurtosis value
indicates a distribution with light tails and a low peak. You can use pandas or scipy.stats to
calculate the kurtosis of a dataset:

25
# Calculate the kurtosis using pandas
kurtosis_pandas = data.kurt()

# Calculate the kurtosis using scipy.stats


kurtosis_scipy = scipy.stats.kurtosis(data)

4.3 Data Visualization


Data visualization is a powerful technique for understanding your data, uncovering patterns
and relationships, and communicating your findings. In this section, we will explore various
data visualization techniques using Python libraries such as matplotlib, seaborn, and
pandas.

4.3.1 Univariate Data Visualization


Univariate data visualization techniques focus on a single variable and help you understand
its distribution, central tendency, and dispersion.

First we will generate a dummy data for further visualizations, let's consider a dataset that
represents the weights and heights of individuals in three different age groups:
import numpy as np
import pandas as pd

np.random.seed(42)

# Generate random data for weights and heights in three age groups
age_groups = ['18-29', '30-49', '50-65']
weights = np.concatenate([np.random.normal(loc=60, scale=8, size=1000),
np.random.normal(loc=70, scale=10, size=1000),
np.random.normal(loc=75, scale=8, size=1000)])
heights = np.concatenate([np.random.normal(loc=170, scale=8, size=1000),
np.random.normal(loc=175, scale=6, size=1000),
np.random.normal(loc=168, scale=6,
size=1000)])
ages = np.repeat(age_groups, 1000)

# Create a pandas DataFrame


data = pd.DataFrame({'Age Group': ages, 'Weight (kg)': weights, 'Height
(cm)': heights})

26
Histogram: A histogram is a graphical representation of the distribution of a dataset. It
divides the data into a set of intervals (bins) and displays the frequency of observations
within each bin. You can use matplotlib or seaborn to create a histogram:
import matplotlib.pyplot as plt
import seaborn as sns

# Histogram
plt.figure(figsize=(10, 5))
sns.histplot(data=data, x='Height (cm)', hue='Age Group', kde=False,
bins=30, alpha=0.7)
plt.title('Histogram')
plt.show()

Output:

Density Plot: A density plot, also known as a Kernel Density Estimate (KDE) plot, is a
smoothed version of the histogram that estimates the probability density function of the
dataset. You can use seaborn or pandas to create a density plot:
# Density Plot
plt.figure(figsize=(10, 5))
sns.kdeplot(data=data, x='Height (cm)', hue='Age Group', shade=True)
plt.title('Density Plot')
plt.show()

27
Output:

Box Plot: A box plot, also known as a box-and-whisker plot, is a graphical representation of
the five-number summary of a dataset (minimum, first quartile, median, third quartile, and
maximum) and helps visualize the central tendency, dispersion, and outliers of the data.
You can use matplotlib or seaborn to create a box plot:
# Box Plot
plt.figure(figsize=(10, 5))
sns.boxplot(data=data, x='Age Group', y='Height (cm)')
plt.title('Box Plot')
plt.show()

Output:

28
Violin Plot: A violin plot is a combination of a box plot and a density plot that shows the
distribution of the data across different categories. You can use seaborn to create a violin
plot:
# Violin Plot
plt.figure(figsize=(10, 5))
sns.violinplot(data=data, x='Age Group', y='Height (cm)')
plt.title('Violin Plot')
plt.show()

Output:

4.3.2 Bivariate Data Visualization


Bivariate data visualization techniques focus on the relationship between two variables and
help you identify trends, correlations, and patterns in the data.

Scatter Plot: A scatter plot displays the relationship between two continuous variables by
plotting their values as points in a Cartesian coordinate system. You can use matplotlib or
seaborn to create a scatter plot:
x = [1, 2, 3, 4, 5]
y = [1, 4, 9, 16, 25]

# Create a scatter plot using matplotlib


plt.scatter(x, y)
plt.show()

# Create a scatter plot using seaborn


sns.scatterplot(x=x, y=y)
plt.show()

29
Output:

Line Plot: A line plot displays the relationship between two continuous variables by
connecting their data points with line segments. You can use matplotlib or seaborn to create
a line plot:
# Create a line plot using matplotlib
plt.plot(x, y)
plt.show()

# Create a line plot using seaborn


sns.lineplot(x=x, y=y)
plt.show()

30
Output:

Bar Plot: A bar plot displays the relationship between a categorical variable and a
continuous variable by representing the continuous variable's values as bars. You can use
matplotlib or seaborn to create a bar plot:
categories = ['A', 'B', 'C', 'D', 'E']
values = [10, 15, 20, 25, 30]

# Create a bar plot using matplotlib


plt.bar(categories, values)
plt.show()

# Create a bar plot using seaborn


sns.barplot(x=categories, y=values)
plt.show()

Output:

31
Heatmap: A heatmap displays the relationship between two categorical variables by
representing their intersection's values as colors in a grid. You can use seaborn to create a
heatmap:
import numpy as np

data_matrix = np.array([[10, 15, 20],


[25, 30, 35],
[40, 45, 50]])

# Create a heatmap using seaborn


sns.heatmap(data_matrix, annot=True, cmap='coolwarm')
plt.show()

Output:

32
4.3.3 Multivariate Data Visualization
Multivariate data visualization techniques focus on the relationship between multiple
variables and help you identify interactions, trends, and correlations in the data.

Pair Plot: A pair plot, also known as a scatterplot matrix, is a grid of scatter plots that display
the pairwise relationships between multiple continuous variables. You can use seaborn to
create a pair plot:
import seaborn as sns
import pandas as pd

data_dict = {'A': [1, 2, 3, 4, 5],


'B': [2, 4, 6, 8, 10],
'C': [1, 4, 9, 16, 25],
'D': [5, 4, 3, 2, 1]}
data_frame = pd.DataFrame(data_dict)

# Create a pair plot using seaborn


sns.pairplot(data_frame)
plt.show()

Output:

33
Correlation Matrix: A correlation matrix is a table that shows the pairwise correlations
between multiple continuous variables. A correlation matrix can be visualized as a heatmap
to identify trends and patterns in the data. You can use pandas and seaborn to create a
correlation matrix heatmap:
# Calculate the correlation matrix using pandas
corr_matrix = data_frame.corr()

# Create a correlation matrix heatmap using seaborn


sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()

Output:

Parallel Coordinates Plot: A parallel coordinates plot displays the relationships between
multiple continuous variables by connecting their values with line segments across parallel
axes. You can use pandas.plotting to create a parallel coordinates plot:
from pandas.plotting import parallel_coordinates

# Create a parallel coordinates plot using pandas.plotting


parallel_coordinates(data_frame, 'A', colormap='viridis')
plt.show()

Output:

34
4.4 Summary
In this chapter, we have covered essential techniques for descriptive statistics and data
visualization in Python. These techniques can help you understand your data's structure,
distribution, and relationships, which are crucial steps in any data analysis process. By
leveraging the power of Python libraries such as pandas, matplotlib, and seaborn, you can
effectively explore and analyze your data to make informed decisions and communicate
your findings.

35
Chapter 5: Advanced Data Visualization
Techniques with seaborn and Plotly

5.1 Introduction
In the previous chapter, we discussed basic data visualization techniques using Python
libraries such as matplotlib, pandas, and seaborn. This chapter will focus on advanced data
visualization techniques using seaborn and Plotly, two powerful Python libraries for
creating beautiful, interactive, and informative visualizations. We will cover various
advanced visualization techniques that can help you uncover hidden patterns, trends, and
relationships in your data.

5.2 Seaborn: Statistical Data Visualization


Seaborn is a Python data visualization library based on matplotlib that provides a high-level
interface for creating informative and attractive statistical graphics. Seaborn offers various
built-in themes, color palettes, and functions for visualizing complex datasets, making it
easy to create professional-quality visualizations with minimal effort. In this section, we will
explore some of seaborn's advanced visualization techniques.

5.2.1 FacetGrid: Multi-plot Grids for Conditional Relationships


FacetGrid is a powerful seaborn class for visualizing conditional relationships in your
dataset across multiple dimensions. It allows you to create a grid of subplots based on the
values of one or more categorical variables, enabling you to explore the interactions
between these variables and a continuous variable. To use FacetGrid, first, create an
instance of the class, specifying the data and categorical variables to be used, then map a
plotting function to the grid:
import seaborn as sns
import matplotlib.pyplot as plt

# Load the example tips dataset


data = sns.load_dataset('tips')

# Create a FacetGrid instance


g = sns.FacetGrid(data, col='time', row='smoker')

36
# Map a plotting function to the grid
g.map(sns.scatterplot, 'total_bill', 'tip')

# Display the plot


plt.show()

Output:

5.2.2 PairGrid: Pairwise Relationships in a Dataset


PairGrid is another powerful seaborn class for visualizing pairwise relationships in a
dataset, similar to Pairplot but offering more customization. It creates a grid of subplots
where each subplot displays a scatter plot between two continuous variables. To use
PairGrid, first, create an instance of the class, specifying the data, then map a plotting
function to the grid:
# Load the example iris dataset
data = sns.load_dataset('iris')

# Create a PairGrid instance


g = sns.PairGrid(data)

# Map a plotting function to the grid


g.map(sns.scatterplot)

37
# Display the plot
plt.show()

Output:

5.2.3 Clustermap: Hierarchical Clustering and Heatmap


A clustermap is a combination of a heatmap and hierarchical clustering, which can reveal
patterns and relationships in your data by grouping similar data points together. Seaborn's
clustermap function creates a clustered heatmap, allowing you to visualize complex
multivariate data:
import numpy as np

# Create a random dataset


data = np.random.rand(10, 12)

# Create a clustermap

38
sns.clustermap(data, cmap='viridis')

# Display the plot


plt.show()

Output:

39
5.3 Plotly: Interactive Data Visualization
Plotly is a Python graphing library that makes interactive, publication-quality graphs. It
provides numerous high-level plotting functions for creating various types of interactive
plots, including line plots, scatter plots, bar plots, pie charts, and more. In this section, we
will explore some of Plotly's advanced visualization techniques.

5.3.1 Interactive Line and Scatter Plots


Plotly allows you to create interactive line and scatter plots with tooltips, zooming, and
panning capabilities. You can use the plotly.graph_objs module to create these interactive
plots:
import plotly.graph_objs as go
# Create a line plot
trace_line = go.Scatter(x=[1, 2, 3, 4, 5],
y=[1, 4, 9, 16, 25],
mode='lines',
name='Line Plot')

# Create a scatter plot


trace_scatter = go.Scatter(x=[1, 2, 3, 4, 5],
y=[5, 4, 3, 2, 1],
mode='markers',
name='Scatter Plot')

# Combine the plots and display them


data = [trace_line, trace_scatter]
layout = go.Layout(title='Interactive Line and Scatter Plots')
fig = go.Figure(data=data, layout=layout)
fig.show()

Output:

40
5.3.2 Interactive Bar Plots and Pie Charts
Plotly also allows you to create interactive bar plots and pie charts. Here's how you can
create a bar plot and a pie chart using the plotly.graph_objs module:
# Create a bar plot
trace_bar = go.Bar(x=['A', 'B', 'C', 'D', 'E'],
y=[10, 15, 20, 25, 30],
name='Bar Plot')

# Create a pie chart


trace_pie = go.Pie(labels=['A', 'B', 'C', 'D', 'E'],
values=[10, 15, 20, 25, 30],
name='Pie Chart')

# Combine the plots and display them


data = [trace_bar, trace_pie]
layout = go.Layout(title='Interactive Bar Plot and Pie Chart')
fig = go.Figure(data=data, layout=layout)
fig.show()

Output:

5.3.3 Interactive Heatmaps


Plotly provides an easy way to create interactive heatmaps with tooltips, allowing you to
visualize complex data matrices:
import numpy as np

# Create a random dataset


data_matrix = np.random.rand(10, 12)

# Create a heatmap
trace_heatmap = go.Heatmap(z=data_matrix,
colorscale='Viridis')

41
# Display the heatmap
data = [trace_heatmap]
layout = go.Layout(title='Interactive Heatmap')
fig = go.Figure(data=data, layout=layout)
fig.show()

Output:

5.3.4 Interactive 3D Plots

Plotly also supports interactive 3D plots, such as scatter plots and surface plots, which can
help you visualize the relationships between three continuous variables:
# Create a 3D scatter plot
trace_scatter3d = go.Scatter3d(x=[1, 2, 3, 4, 5],
y=[1, 4, 9, 16, 25],
z=[5, 4, 3, 2, 1],
mode='markers',
marker=dict(size=8))

# Create a 3D surface plot


trace_surface = go.Surface(z=[[1, 2, 3, 4, 5],
[2, 4, 6, 8, 10],
[3, 6, 9, 12, 15]])

# Combine the plots and display them


data = [trace_scatter3d, trace_surface]
layout = go.Layout(title='Interactive 3D Plots')
fig = go.Figure(data=data, layout=layout)
fig.show()

42
Output:

5.4 Summary
In this chapter, we have explored various advanced data visualization techniques using
seaborn and Plotly, two powerful Python libraries for creating beautiful, interactive, and
informative visualizations. By incorporating these techniques into your exploratory data
analysis workflow, you can gain deeper insights into your data and uncover hidden patterns,
trends, and relationships. Mastering these advanced visualization techniques can help you
make better-informed decisions and communicate your findings more effectively.

Seaborn offers a high-level interface for creating complex statistical graphics, such as
FacetGrid, PairGrid, and clustermaps, which can help you visualize conditional
relationships and multivariate data. On the other hand, Plotly provides numerous high-level
plotting functions for creating interactive plots, such as line plots, scatter plots, bar plots,
pie charts, heatmaps, and 3D plots. These interactive visualizations enable you to explore
your data more intuitively and share your findings with others easily.

43
Chapter 6: Handling Missing Data and
Outliers in Python

6.1 Introduction
Missing data and outliers are common issues in real-world datasets that can significantly
affect the quality of your exploratory data analysis and the performance of your machine
learning models. In this chapter, we will discuss various techniques for handling missing
data and outliers in Python, using the pandas, NumPy, and scipy libraries. By learning these
techniques, you will be better equipped to clean and preprocess your data, ensuring that
your analysis is robust and accurate.

6.2 Handling Missing Data


Missing data occurs when some observations or values are not available in your dataset.
This can be due to various reasons, such as data entry errors, data collection problems, or
data corruption. Handling missing data is an essential step in the data cleaning process
because missing values can lead to biased or incomplete analysis results.

6.2.1 Identifying Missing Data


The first step in handling missing data is to identify the missing values in your dataset. In
pandas, missing data is usually represented by the NaN (Not a Number) value. You can use
the isna() or isnull() methods to check for missing values in your DataFrame:
import pandas as pd

# Load the example Titanic dataset


data = pd.read_csv('titanic.csv')

# Check for missing values


missing_values = data.isna()

# Count the number of missing values per column


missing_values_count = missing_values.sum()
print(missing_values_count)

44
6.2.2 Deleting Missing Data
One approach to handling missing data is to delete the rows or columns containing missing
values. This can be done using the dropna() method in pandas. However, this method
should be used with caution, as it may lead to loss of information if too many rows or
columns are removed.
# Drop rows with missing values
data_no_missing_rows = data.dropna()

# Drop columns with missing values


data_no_missing_columns = data.dropna(axis=1)

6.2.3 Imputing Missing Data


Another approach to handling missing data is to replace the missing values with an
estimate, such as the mean, median, or mode of the available data. This process is called
imputation. Pandas provides the fillna() method for imputing missing values:
# Impute missing values with the mean
data_mean_imputed = data.fillna(data.mean())

# Impute missing values with the median


data_median_imputed = data.fillna(data.median())

# Impute missing values with the mode


data_mode_imputed = data.fillna(data.mode().iloc[0])

You can also use more advanced imputation techniques, such as k-Nearest Neighbors (KNN)
imputation or regression imputation, by employing the sklearn.impute module:
from sklearn.impute import KNNImputer, SimpleImputer

# KNN imputation
knn_imputer = KNNImputer(n_neighbors=5)
data_knn_imputed = knn_imputer.fit_transform(data)

# Regression imputation
reg_imputer = SimpleImputer(strategy='constant', fill_value=-9999)
data_reg_imputed = reg_imputer.fit_transform(data)

45
6.3 Handling Outliers
Outliers are data points that are significantly different from the majority of the data. They
can be caused by data entry errors, data collection problems, or natural variations in the
data. Outliers can have a significant impact on your data analysis and the performance of
your machine learning models, as they can skew the distribution of your data and affect the
results of statistical tests.

6.3.1 Identifying Outliers


There are several methods for identifying outliers in your data, such as using box plots,
histograms, or scatter plots to visually inspect the distribution of your data. Additionally,
you can use statistical methods like the Z-score, IQR (Interquartile Range), or Tukey's fences
to detect outliers numerically.

6.3.1.1 Z-score Method


The Z-score is a measure of how many standard deviations a data point is from the mean of
the dataset. Data points with a high absolute Z-score (typically above 2 or 3) are considered
outliers. You can calculate the Z-score using the following formula:
𝑍 = (𝑋 − μ) / σ

Where X is the data point, μ is the mean of the dataset, and σ is the standard deviation of the
dataset. You can use the scipy.stats module to compute the Z-score for your data:
import numpy as np
from scipy import stats

# Create a sample dataset


data = np.array([1, 2, 3, 4, 5, 100])

# Calculate the Z-scores


z_scores = stats.zscore(data)

# Find outliers
outliers = data[np.abs(z_scores) > 2]
print(outliers)

46
6.3.1.2 IQR Method

The IQR is the range between the first quartile (25th percentile) and the third quartile (75th
percentile) of the data. Data points that fall below the first quartile minus 1.5 times the IQR
or above the third quartile plus 1.5 times the IQR are considered outliers. You can use the
following code to detect outliers using the IQR method:
# Calculate the IQR
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
iqr = q3 - q1

# Find outliers
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = data[(data < lower_bound) | (data > upper_bound)]
print(outliers)

6.3.2 Handling Outliers


After identifying outliers, you can handle them using various techniques, such as deleting,
capping, or transforming the data.

6.3.2.1 Deleting Outliers


One approach to handling outliers is to simply remove them from the dataset. However, this
method should be used with caution, as it may lead to loss of information if too many data
points are removed.
# Remove outliers using the Z-score method
data_no_outliers = data[np.abs(z_scores) <= 2]

6.3.2.2 Capping Outliers


Another approach to handling outliers is to cap them at a certain value, such as the lower
and upper bounds calculated using the IQR method. This method preserves the original
structure of the data but reduces the impact of extreme values.
# Cap outliers using the IQR method
data_capped = np.clip(data, lower_bound, upper_bound)

47
6.3.2.3 Transforming Data
Transforming the data can help reduce the impact of outliers by compressing the range of
the data. Common data transformations include the log, square root, and inverse
transformations. You can apply these transformations using the NumPy library:
# Apply a log transformation
data_log_transformed = np.log(data)

# Apply a square root transformation


data_sqrt_transformed = np.sqrt(data)

# Apply an inverse transformation


data_inverse_transformed = 1 / data

6.4 Summary
Handling missing data and outliers is a critical aspect of exploratory data analysis and data
preprocessing. In this chapter, we have discussed various techniques for identifying and
handling missing data, such as deleting, imputing, or using advanced imputation methods
like KNN imputation and regression imputation. We have also explored different methods
for identifying and handling outliers, including the Z-score, IQR, deletion, capping, and data
transformation techniques.

By applying these techniques to your datasets, you can ensure that your data is clean,
accurate, and ready for further analysis or modeling. Mastering these data cleaning and
preprocessing techniques will not only help improve the quality of your exploratory data
analysis but also enhance the performance of your machine learning models, leading to
more accurate and reliable predictions.

In the next chapter, we will delve into dimensionality reduction techniques, such as PCA
(Principal Component Analysis) and t-SNE (t-distributed Stochastic Neighbor Embedding),
which can help you reduce the complexity of your data and extract meaningful features for
further analysis or modeling.

48
Chapter 7: Dimensionality Reduction
Techniques: PCA and t-SNE

7.1 Introduction to Dimensionality Reduction


In many real-world scenarios, data sets consist of a large number of variables or features,
which can make the process of data analysis and model building cumbersome and
time-consuming. Moreover, the presence of a large number of features can lead to
overfitting and make interpretation difficult. This is where dimensionality reduction
techniques come into play. Dimensionality reduction is the process of reducing the number
of features in a dataset while preserving the essential information as much as possible.

In this chapter, we will discuss two popular dimensionality reduction techniques: Principal
Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). We
will cover the theory behind each technique, its implementation in Python using popular
libraries, and practical applications.

7.2 Principal Component Analysis (PCA)

7.2.1 Theory behind PCA


Principal Component Analysis (PCA) is a linear dimensionality reduction technique that
seeks to project the data onto a lower-dimensional space while preserving as much variance
as possible. It achieves this by identifying new features called principal components, which
are linear combinations of the original features. The principal components are orthogonal
(uncorrelated) and are ranked according to the amount of variance they explain in the data.
By selecting a subset of the principal components, we can reduce the dimensionality of the
data while retaining most of its variance.

7.2.2 Implementing PCA in Python


To implement PCA in Python, we will use the PCA class from the popular scikit-learn
library. Let's start by importing the necessary libraries and loading a sample dataset:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

49
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

iris = load_iris()
X = iris.data
y = iris.target

Now that we have our dataset loaded, we can apply PCA. In this example, we will reduce the
dimensionality of the iris dataset from four to two:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
Let's visualize the transformed data:
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()

Output:

In the resulting plot, we can observe that PCA has successfully reduced the dimensionality
of the data while preserving the separation between the three classes.

50
7.3 t-Distributed Stochastic Neighbor Embedding
(t-SNE)

7.3.1 Theory behind t-SNE


t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensionality
reduction technique particularly suited for visualizing high-dimensional data in a
low-dimensional space (typically 2D or 3D). Unlike PCA, t-SNE does not attempt to preserve
global structure or variance. Instead, it focuses on preserving the local structure of the data,
making it an excellent tool for visualizing complex data structures.

t-SNE works by minimizing the divergence between two probability distributions: one
representing pairwise similarities in the high-dimensional space and the other representing
pairwise similarities in the low-dimensional space. It uses a t-distribution to model the
pairwise similarities in the low-dimensional space, which alleviates the "crowding problem"
commonly encountered in other dimensionality reduction techniques.

7.3.2 Implementing t-SNE in Python


To implement t-SNE in Python, we will make use of the TSNE class available in the
scikit-learn library. We'll begin by importing the necessary libraries and utilizing the same
iris dataset from the PCA example:
from sklearn.manifold import TSNE

t-SNE is sensitive to the scale of features; therefore, it is crucial to standardize the dataset
before applying the technique:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)

Now, we are ready to apply t-SNE to the standardized dataset. Here, we'll set the number of
components to 2, perplexity to 30, and the number of iterations to 1000:
tsne = TSNE(n_components=2, perplexity=30, n_iter=1000)
X_tsne = tsne.fit_transform(X_standardized)

Next, we'll visualize the transformed data:


plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='viridis')

51
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.show()

Output:

The resulting plot shows that t-SNE has also successfully reduced the dimensionality of the
data while preserving the separation among the three classes. However, the plot may appear
different each time t-SNE is run due to the stochastic nature of the algorithm.

7.4 Comparing PCA and t-SNE


Both PCA and t-SNE are popular dimensionality reduction techniques, but they have
different properties and use cases. Some key differences include:

PCA is a linear technique, while t-SNE is non-linear.


PCA preserves global structure and variance, while t-SNE preserves local structure.
PCA is deterministic, while t-SNE is stochastic, which means that the results of t-SNE can
vary between runs.
PCA is faster and more scalable to large datasets, while t-SNE can be computationally
expensive, especially for large datasets.
In general, PCA is a good starting point for dimensionality reduction, as it is
computationally efficient and easy to interpret. However, for datasets with complex
non-linear structures or when the primary goal is visualization, t-SNE may be more
appropriate.

52
7.5 Practical Tips for Dimensionality Reduction
Here are some practical tips when applying dimensionality reduction techniques:

Always standardize the data before applying dimensionality reduction, as both PCA and
t-SNE are sensitive to feature scales.
Choose the right technique based on your goals: use PCA for preserving global structure and
variance, and t-SNE for visualizing complex data structures.
For PCA, decide on the number of principal components to retain based on the explained
variance ratio. You can use the explained_variance_ratio_ attribute of the PCA class to help
you make this decision.
For t-SNE, experiment with the perplexity and n_iter parameters to find the best settings for
your dataset. Perplexity balances the focus on local and global aspects of the data, while the
number of iterations affects the optimization process.
Be cautious when interpreting the results of dimensionality reduction, as information is
inevitably lost during the process. Always validate your findings with other techniques or
domain knowledge.

7.6 Summary
In this chapter, we covered two popular dimensionality reduction techniques: Principal
Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE). We
discussed the theory behind each technique, their implementation in Python using popular
libraries, and practical applications.

By reducing the dimensionality of your datasets, you can not only simplify your data
analysis and model building process but also improve the interpretability of your results.
Remember to choose the right technique based on your goals and always validate your
findings with other techniques or domain knowledge.

53
Chapter 8: Time Series Analysis and
Forecasting with Python

8.1 Introduction to Time Series Analysis


Time series analysis is a set of techniques used to analyze data points collected over time. It
aims to extract meaningful patterns, trends, and relationships within the data, as well as
make predictions about future values. Time series analysis is commonly employed in
various domains, including finance, economics, weather forecasting, and sales forecasting.

In this chapter, we will discuss the fundamental concepts of time series analysis, including
components of a time series, stationarity, and autocorrelation. We will then explore various
techniques for time series analysis and forecasting in Python, such as decomposition,
smoothing, and ARIMA models.

8.2 Components of a Time Series


A time series can be broken down into four components:

Trend: The long-term movement in the data, usually indicating an increase or decrease over
time.
Seasonality: Regular, periodic fluctuations in the data, often related to the time of year,
month, or week.
Cyclic: Oscillatory patterns that do not follow a fixed period, often caused by external
factors or the underlying system dynamics.
Irregular: Random fluctuations in the data not explained by the other components.
These components can be combined either additively or multiplicatively to form the time
series.

8.3 Stationarity and Differencing


A time series is said to be stationary if its statistical properties, such as mean and variance,
remain constant over time. Stationarity is an essential assumption for many time series
analysis and forecasting techniques, as it simplifies the modeling process and improves
prediction accuracy.

54
To achieve stationarity, we can apply a process called differencing, which computes the
difference between consecutive observations in the time series. Differencing can be applied
multiple times, with each iteration called a "lag." The order of differencing required to
achieve stationarity is denoted by "d."

8.4 Autocorrelation and Partial Autocorrelation


Autocorrelation is a measure of the correlation between a time series and a lagged version
of itself. It helps identify the presence of repeating patterns or seasonality in the data.
Partial autocorrelation is similar to autocorrelation but accounts for the correlation of
intermediate lags. Autocorrelation and partial autocorrelation can be visualized using
autocorrelation function (ACF) and partial autocorrelation function (PACF) plots,
respectively.

8.5 Time Series Analysis Techniques in Python

8.5.1 Decomposition
Decomposition is a technique for separating the different components of a time series. We
can use the seasonal_decompose function from the statsmodels library to perform
decomposition in Python.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose

# Load time series data


data = pd.read_csv('sample_data.csv', index_col='date',
parse_dates=True)

# Perform decomposition
result = seasonal_decompose(data, model='additive')

# Plot decomposition components


result.plot()
plt.show()

55
8.5.2 Smoothing
Smoothing techniques are used to reduce noise and highlight trends in time series data.
Common smoothing techniques include moving averages, exponential smoothing, and
Holt-Winters method.
from statsmodels.tsa.holtwinters import SimpleExpSmoothing,
ExponentialSmoothing

# Moving average
data['moving_average'] = data['value'].rolling(window=12).mean()

# Simple exponential smoothing


ses_model = SimpleExpSmoothing(data['value'])
ses_fit = ses_model.fit(smoothing_level=0.2)
data['SES'] = ses_fit.fittedvalues

# Holt-Winters method
hw_model = ExponentialSmoothing(data['value'], trend='add',
seasonal='add', seasonal_periods=12)
hw_fit = hw_model.fit()
data['Holt_Winters'] = hw_fit.fittedvalues

# Plot original data and smoothed data


data[['value', 'moving_average', 'SES', 'Holt_Winters']].plot()
plt.show()

8.6 Time Series Forecasting Techniques in Python

8.6.1 Autoregressive Integrated Moving Average (ARIMA)


ARIMA is a popular forecasting method that combines the concepts of autoregression,
moving average, and differencing. It is represented by the notation ARIMA(p, d, q), where p
is the order of the autoregressive term, d is the order of differencing, and q is the order of
the moving average term.

To fit an ARIMA model in Python, we can use the `ARIMA` class from the `statsmodels`
library:
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA

56
# Load time series data
data = pd.read_csv('sample_data.csv', index_col='date',
parse_dates=True)

# Fit ARIMA model


arima_model = ARIMA(data['value'], order=(1, 1, 1))
arima_fit = arima_model.fit()

# Make predictions
forecast = arima_fit.forecast(steps=12)

8.6.2 Seasonal Decomposition of Time Series (STL) and ARIMA


When dealing with seasonal time series data, we can use the Seasonal Decomposition of
Time Series (STL) to decompose the time series into its components, then apply ARIMA to
the residual component.
import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.seasonal import STL

# Load time series data


data = pd.read_csv('sample_data.csv', index_col='date',
parse_dates=True)

# Perform STL decomposition


stl = STL(data['value'], period=12)
stl_fit = stl.fit()
data['residual'] = stl_fit.resid

# Fit ARIMA model on residual component


arima_model = ARIMA(data['residual'].dropna(), order=(1, 1, 1))
arima_fit = arima_model.fit()

# Make predictions
residual_forecast = arima_fit.forecast(steps=12)

# Reconstruct the forecasted values by adding back the seasonal


component
forecast = residual_forecast + stl_fit.seasonal[-12:]

57
8.7 Model Evaluation and Selection
To evaluate the performance of a time series forecasting model, we can use metrics such as
Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error
(RMSE). We can also use cross-validation techniques, such as time series cross-validation or
walk-forward validation, to assess model performance on different portions of the data.

8.8 Summary
In this chapter, we explored the fundamentals of time series analysis and various techniques
for time series analysis and forecasting in Python. We covered decomposition, smoothing,
ARIMA models, and STL decomposition combined with ARIMA. Additionally, we discussed
the importance of stationarity, differencing, autocorrelation, and partial autocorrelation.

By understanding and applying these techniques, you can extract meaningful insights from
time series data and make accurate predictions about future values. Always remember to
evaluate your models using appropriate metrics and cross-validation techniques to ensure
reliable forecasting performance.

58
Chapter 9: Text Data Exploration and
Natural Language Processing

9.1 Introduction
Text data exploration and natural language processing (NLP) involve the analysis and
manipulation of textual data to extract insights, understand patterns, and build predictive
models. In this chapter, we will cover essential concepts and techniques for text data
exploration and NLP using Python. We will explore text preprocessing, feature extraction,
and several NLP techniques and algorithms, such as tokenization, stemming, and topic
modeling.

9.2 Text Preprocessing


Text preprocessing is an essential step in NLP that involves cleaning and transforming raw
text data into a structured format suitable for analysis. Common text preprocessing
techniques include:

Lowercasing
Removing punctuation and special characters
Removing stopwords
Tokenization
Stemming and lemmatization
We can use the Natural Language Toolkit (nltk) library in Python to perform most of these
preprocessing tasks:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# Load raw text data


text = "This is an example sentence for text preprocessing in Python."

# Lowercase

59
text = text.lower()

# Remove punctuation and special characters


text = ''.join(c for c in text if c.isalnum() or c.isspace())

# Remove stopwords
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)
text = ' '.join(w for w in word_tokens if w not in stop_words)

# Stemming
stemmer = PorterStemmer()
text = ' '.join(stemmer.stem(w) for w in word_tokenize(text))

# Lemmatization
lemmatizer = WordNetLemmatizer()
text = ' '.join(lemmatizer.lemmatize(w) for w in word_tokenize(text))

9.3 Feature Extraction


Feature extraction is the process of converting text data into a numerical representation
suitable for machine learning algorithms. Common feature extraction techniques for text
data include:

Bag of Words (BoW)


Term Frequency-Inverse Document Frequency (TF-IDF)
Word embeddings (e.g., Word2Vec, GloVe)
We can use the scikit-learn library in Python to perform BoW and TF-IDF feature extraction:
from sklearn.feature_extraction.text import CountVectorizer,
TfidfVectorizer

# Load text data


documents = ['This is the first document.', 'This is the second
document.']

# Bag of Words
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(documents)

# TF-IDF
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(documents)

60
For word embeddings, we can use the Gensim library in Python:
import gensim.downloader as api

# Load pre-trained Word2Vec model


model = api.load('word2vec-google-news-300')

# Get word embeddings


word_embedding = model['example']

9.4 Natural Language Processing Techniques


In this section, we will explore various NLP techniques and algorithms, such as sentiment
analysis, text classification, and topic modeling.

9.4.1 Sentiment Analysis


Sentiment analysis is the process of determining the sentiment or emotion expressed in a
piece of text. The most common sentiment analysis task is binary classification, where the
text is classified as either positive or negative. We can use pre-trained models like VADER
(Valence Aware Dictionary and sEntiment Reasoner) from the nltk library to perform
sentiment analysis:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

# Load VADER sentiment analyzer


sia = SentimentIntensityAnalyzer()

# Analyze sentiment
text = "I love this product. It's amazing!"
sentiment_scores = sia.polarity_scores(text)

9.4.2 Text Classification


Text classification involves assigning predefined categories or labels to text data. A common
application of text classification is spam detection in emails. In this example, we'll use the
Naive Bayes algorithm from the scikit-learn library to perform text classification:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

61
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load text data


data = pd.read_csv('sample_data.csv')
X = data['text']
y = data['label']

# Feature extraction
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(X)

# Split data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y,
test_size=0.2, random_state=42)

# Train Naive Bayes classifier


classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Make predictions
y_pred = classifier.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)

9.4.3 Topic Modeling


Topic modeling is an unsupervised learning technique used to identify topics or themes
within a collection of documents. One popular topic modeling algorithm is Latent Dirichlet
Allocation (LDA). We can use the Gensim library to perform LDA topic modeling in Python:
import gensim
from gensim.corpora import Dictionary
from gensim.models import LdaModel
from gensim.utils import simple_preprocess

# Load text data


documents = ['This is the first document.', 'This is the second
document.']

# Tokenize and create a dictionary


tokens = [simple_preprocess(doc) for doc in documents]
dictionary = Dictionary(tokens)

62
# Create a bag of words corpus
corpus = [dictionary.doc2bow(doc) for doc in tokens]

# Train LDA model


lda_model = LdaModel(corpus, num_topics=2, id2word=dictionary,
random_state=42)

# Print topics
topics = lda_model.print_topics(num_words=5)

9.5 Summary
In this chapter, we discussed essential concepts and techniques for text data exploration and
natural language processing using Python. We covered text preprocessing, feature
extraction, and various NLP techniques such as sentiment analysis, text classification, and
topic modeling.

By understanding and applying these techniques, you can analyze and process large
amounts of textual data, derive insights, and build effective predictive models. Remember to
preprocess your text data and choose the appropriate feature extraction method for your
specific problem to ensure the best results.

63
Chapter 10: Case Studies: Real-World
Applications of EDA with Python

10.1 Introduction
In this chapter, we will explore real-world case studies that demonstrate how Exploratory
Data Analysis (EDA) with Python can be applied across various industries and domains.
These case studies will provide a practical understanding of how EDA techniques, combined
with the power of Python, can be used to derive insights from data, inform decision-making,
and solve complex problems.

10.2 Case Study 1: Customer Segmentation for a


Retail Company
A retail company wants to segment its customers to create targeted marketing campaigns
and improve customer satisfaction. The company has collected transactional data, including
customer demographics, purchase history, and revenue.

Objective: Perform EDA to identify customer segments based on their shopping behavior.

Approach:

Load and clean the dataset: Handle missing values, remove outliers, and preprocess
categorical variables.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

# Generate synthetic data


np.random.seed(42)
n_samples = 500
age = np.random.randint(18, 65, n_samples)
gender = np.random.choice(['M', 'F'], n_samples)
country = np.random.choice(['USA', 'Canada', 'UK'], n_samples)
purchase_frequency = np.random.randint(1, 20, n_samples)
total_revenue = np.random.randint(100, 10000, n_samples)

data = pd.DataFrame({
'age': age,

64
'gender': gender,
'country': country,
'purchase_frequency': purchase_frequency,
'total_revenue': total_revenue
})

# Clean data
data = data.dropna()
data = data.drop_duplicates()

# Preprocess categorical variables


data = pd.get_dummies(data, columns=['gender', 'country'])

# Standardize data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

Compute relevant features (e.g., average purchase value, purchase frequency) and visualize
the distribution of these features.
import matplotlib.pyplot as plt
import seaborn as sns

# Compute relevant features


data['average_purchase_value'] = data['total_revenue'] /
data['purchase_frequency']

# Visualize the distribution of features


plt.figure(figsize=(10, 5))
sns.histplot(data['average_purchase_value'], kde=True)
plt.xlabel('Average Purchase Value')
plt.show()

plt.figure(figsize=(10, 5))
sns.histplot(data['purchase_frequency'], kde=True)
plt.xlabel('Purchase Frequency')
plt.show()

65
Output:

Apply clustering algorithms (e.g., K-means, DBSCAN) to segment customers.


from sklearn.cluster import KMeans

# Apply K-means clustering


kmeans = KMeans(n_clusters=3, random_state=42)
data['cluster'] = kmeans.fit_predict(data_scaled)

# Visualize the clusters


plt.figure(figsize=(10, 5))
sns.scatterplot(data=data, x='average_purchase_value',
y='purchase_frequency', hue='cluster', palette='viridis')

66
plt.xlabel('Average Purchase Value')
plt.ylabel('Purchase Frequency')
plt.show()

Output:

Analyze the customer segments and provide insights for targeted marketing campaigns.

10.3 Case Study 2: Sales Forecasting for a


Manufacturing Company
A manufacturing company wants to forecast its sales to optimize production planning and
inventory management. The company has collected historical sales data, including
timestamps, product information, and sales volume.

Objective: Perform EDA to forecast future sales and understand the factors affecting sales
performance.

Approach:

Load and preprocess the dataset: Convert timestamps to datetime objects, handle missing
values, and remove outliers
import pandas as pd
import numpy as np

# Generate synthetic sales data


np.random.seed(42)

67
n_samples = 365
dates = pd.date_range('2020-01-01', periods=n_samples)
product_ids = np.random.choice(np.arange(1, 6), n_samples)
sales_volume = np.random.randint(10, 200, n_samples)

data = pd.DataFrame({
'date': dates,
'product_id': product_ids,
'sales_volume': sales_volume
})

# Convert timestamps to datetime objects


data['date'] = pd.to_datetime(data['date'])

# Handle missing values and remove outliers


data = data.dropna()
data = data[(data['sales_volume'] >= 10) & (data['sales_volume'] <=
200)]

Perform time series decomposition to analyze trends, seasonality, and residuals.


from statsmodels.tsa.seasonal import seasonal_decompose

# Decompose the time series


data = data.set_index('date')
sales_decomposition = seasonal_decompose(data['sales_volume'],
model='additive')

# Plot the decomposition components


fig, (ax1, ax2, ax3, ax4) = plt.subplots(4, 1, figsize=(15, 8))
sales_decomposition.observed.plot(ax=ax1)
ax1.set_title('Observed')
sales_decomposition.trend.plot(ax=ax2)
ax2.set_title('Trend')
sales_decomposition.seasonal.plot(ax=ax3)
ax3.set_title('Seasonal')
sales_decomposition.resid.plot(ax=ax4)
ax4.set_title('Residual')
plt.tight_layout()
plt.show()

68
Output:

Apply time series forecasting techniques, such as ARIMA or Prophet, to predict future sales.
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error

# Prepare the data for forecasting


data = data.reset_index()
train_data = data[data['date'] < '2020-11-01']
test_data = data[data['date'] >= '2020-11-01']

# Fit the ARIMA model


arima_model = ARIMA(train_data['sales_volume'], order=(1, 1, 1))
arima_model_fit = arima_model.fit()

# Forecast future sales


forecast, _, _ = arima_model_fit.forecast(steps=len(test_data),
alpha=0.05)

# Evaluate the model


mse = mean_squared_error(test_data['sales_volume'], forecast)
print(f'Mean Squared Error: {mse}')

Analyze the forecast results and provide insights for production planning and inventory
management.

69
10.4 Case Study 3: Sentiment Analysis for a Movie
Streaming Platform
A movie streaming platform wants to analyze customer sentiment to inform their content
recommendations and improve user satisfaction. The platform has collected user reviews,
including timestamps, movie information, and review text.

Objective: Perform EDA to understand customer sentiment and identify factors affecting user
satisfaction.

Approach:

Load and preprocess the dataset: Convert timestamps to datetime objects, clean the review
text, and remove stopwords.
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Generate synthetic review data


n_samples = 500
dates = pd.date_range('2020-01-01', periods=n_samples)
movie_ids = np.random.choice(np.arange(1, 11), n_samples)
reviews = [
'Great movie! I loved the acting and the storyline.',
'This movie was terrible. The plot made no sense.',
'An average movie, but the special effects were impressive.',
'I really enjoyed this film. The characters were well developed.',
'One of the best movies I have seen in a long time.',
'Poor acting and a weak plot made this movie a disappointment.',
'A fun, entertaining movie for the whole family.',
'This film was a waste of time. I would not recommend it.',
'An engaging story with excellent performances by the cast.',
'The movie was slow-paced and did not hold my interest.'
]

data = pd.DataFrame({
'date': np.random.choice(dates, n_samples),
'movie_id': movie_ids,
'review': np.random.choice(reviews, n_samples)
})

# Convert timestamps to datetime objects

70
data['date'] = pd.to_datetime(data['date'])

# Clean review text and remove stopwords


stop_words = set(stopwords.words('english'))

def clean_review(review):
review = re.sub(r'\W', ' ', review)
review = re.sub(r'\s+[a-zA-Z]\s+', ' ', review)
review = re.sub(r'\s+', ' ', review)
review = review.lower().split()
return ' '.join([word for word in review if word not in stop_words])

data['cleaned_review'] = data['review'].apply(clean_review)
Perform sentiment analysis using techniques such as VADER or TextBlob to understand the
sentiment of each review.
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

# Initialize the sentiment analyzer


sentiment_analyzer = SentimentIntensityAnalyzer()

# Compute sentiment scores


data['sentiment_score'] = data['cleaned_review'].apply(lambda x:
sentiment_analyzer.polarity_scores(x)['compound'])

# Label reviews as positive, neutral, or negative


def label_sentiment(score):
if score > 0.05:
return 'positive'
elif score < -0.05:
return 'negative'
else:
return 'neutral'

data['sentiment'] = data['sentiment_score'].apply(label_sentiment)

Visualize the distribution of sentiment scores and analyze the relationship between
sentiment and movie information.
import seaborn as sns
import matplotlib.pyplot as plt

# Visualize the distribution of sentiment scores


plt.figure(figsize=(10, 5))
sns.histplot(data['sentiment_score'], kde=True)
plt.xlabel('Sentiment Score')

71
plt.show()

# Analyze the relationship between sentiment and movie information


plt.figure(figsize=(10, 5))
sns.boxplot(data=data, x='movie_id', y='sentiment_score',
palette='viridis')
plt.xlabel('Movie ID')
plt.ylabel('Sentiment Score')
plt.show()

Output:

Provide insights for content recommendations and user satisfaction improvement based on
the sentiment analysis results.

72
These case studies demonstrate the versatility of EDA with Python and its applicability to
various real-world problems. By mastering the techniques and tools covered in this book,
you will be well-equipped to tackle data-driven challenges and derive valuable insights from
data across industries and domains.

With these case studies as a foundation, you can now continue exploring the world of
exploratory data analysis with Python. Here are some additional topics and directions for
further study:

Advanced machine learning techniques: Dive deeper into machine learning by exploring
advanced algorithms such as Random Forests, Gradient Boosting, and neural networks.
Apply these techniques to the preprocessed data from your EDA to build more accurate and
efficient predictive models.

Big Data processing: As datasets grow in size, you may need to use distributed computing
frameworks like Apache Spark to handle large-scale data processing tasks. Learn how to
integrate Spark with your EDA workflow to scale your data analysis capabilities.

Data visualization libraries: Expand your data visualization toolkit by learning about other
Python libraries like Bokeh, Altair, and Plotly Express. These libraries offer additional
customization options and interactive features that can help you create more engaging
visualizations.

Automating EDA: Investigate automated EDA tools like pandas-profiling, sweetviz, or D-Tale,
which can help you streamline your EDA workflow and generate comprehensive reports with
minimal manual effort.

Geospatial data analysis: Explore the world of geospatial data and learn how to work with
geographic information systems (GIS) using Python libraries like GeoPandas, Shapely, and
Folium. Apply your EDA skills to analyze spatial patterns, relationships, and trends in
geographic data.

Network analysis: Learn how to analyze and visualize complex networks, such as social
networks, transportation networks, or biological networks, using Python libraries like
NetworkX and graph-tool.

By expanding your knowledge and skillset in these areas, you will become a more versatile
and effective data analyst or data scientist. Remember that the key to mastering EDA is
practice—apply the techniques and tools you've learned to new datasets and real-world
problems to continue honing your skills and uncovering valuable insights from data.

As you continue to develop your expertise in EDA and related fields, remember that
collaboration and knowledge sharing are essential for growth and innovation. Engage with
the data science community by attending conferences, participating in online forums, and
contributing to open-source projects. By learning from others and sharing your own
experiences, you can help advance the field of data analysis and create new opportunities
for discovery and insight.

73

You might also like