You are on page 1of 3

Sign in Get started

Follow 584K Followers · Editors' Picks Features Deep Dives Grow Contribute About

You have 2 free member-only stories left this month. Sign up for Medium and get an extra one

Semi-Automated Exploratory
Data Analysis (EDA) in Python
Comprehensive Data Exploration Process with One-Click

Destin Gong Mar 1 · 10 min read

EDA overview (image by author from www.visual-design.net )

Exploratory Data Analysis, also known as EDA, has become an increasingly


hot topic in data science. Just as the name suggests, it is the process of trial
and error in an uncertain space, with the goal of >nding insights. It usually
happens at the early stage of the data science lifecycle. Although there is no
clear-cut between the de>nition of data exploration, data cleaning, or
feature engineering. EDA is generally found to be sitting right after the data
cleaning phase and before feature engineering or model building. EDA
assists in setting the overall direction of model selection and it helps to
check whether the data has met the model assumptions. As a result,
carrying out this preliminary analysis may save you a large amount of time
for the following steps.

In this article, I have created a semi-automated EDA process that can be


broken down into the following steps:

1. Know Your Data

2. Data Manipulation and Feature Engineering

3. Univariate Analysis

4. Multivariate Analysis

Feel free to jump to the part that you are interested in, or grab the full code
at the end of the article published on my website if you >nd it helpful.

1. Know Your Data


Firstly, we need to load the python libraries and the dataset. For this
exercise, I am using several public datasets from the Kaggle community, feel
free to have exploration on these amazing data using the link below:

Restaurant Business Rakings 2020

Reddit WallStreetBets Posts

Import Libraries
I will be using four main libraries: Numpy — to work with arrays; Pandas —
to manipulate data in a spreadsheet format that we are familiar with;
Seaborn and matplotlib — to create data visualization.

import pandas as pd Top highlight


import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np from pandas.api.types
import is_string_dtype, is_numeric_dtype

Import Data
Create a data frame from the imported dataset by copying the path of the
dataset and use df.head(5) to take a peek at the >rst 5 rows of the data.

“reddit_wsb” dataset result (image by author)

“restaurant” dataset output (image by author)

Before zooming into each >eld, let’s >rst take a bird’s eye view of the overall
dataset characteristics.

info()
It gives the count of non-null values for each column and its data type.

“reddit_wsb” dataset result (image by author)

“restaurant” dataset output (image by author)

describe( )
This function provides basic statistics of each column. By passing the
parameter “include = ‘all’”, it outputs the value count, unique count, the
top-frequency value of the categorical variables and count, mean,
standard deviation, min, max, and percentile of numeric variables

“reddit_wsb” dataset result (image by author)

“restaurant” dataset output (image by author)

If we leave it empty, it only shows numeric variables. As you can see, only
columns being identi>ed as “int64” in the info() output are shown below.

describe() result for “restaurant” dataset (image by author)

Missing Value
Handling missing values is a rabbit hole that cannot be covered in one or
two sentences. If you would love to know how to address missing values in
the model lifecycle and understand dicerent types of missing data, here are
some articles that may help:

Simple Logistic Regression using Python scikit-learn


Step-by-Step Guide from Data Preprocessing to Model Evaluation
towardsdatascience.com

How to Address Common Data Quality Issues Without Code


Use Tableau to Solve Inconsistent Values
medium.com

In this article, we will focus on identifying the number of missing values.


isnull().sum() returns the number of missing values for each column.

“reddit_wsb” dataset result (image by author)

We can also do some simple manipulations to make the output more


insightful. Firstly, calculate the percentage of missing values.

“reddit_wsb” dataset result (image by author)

Then, visualize the percentage of the missing value based on the data frame
“missing_df”. The for loop is basically a handy way to add labels to the bars.
As we can see from the chart, nearly half of the “body” values from the
“reddit_wsb” dataset are missing, which leads us to the next step “feature
engineering”.

“reddit_wsb” dataset result (image by author)

2. Feature Engineering
This is the only part that requires some human judgment, thus cannot be
easily automated. Don’t be afraid of this terminology. I think of feature
engineering as a fancy way of saying transforming the data at hand to make
it more insightful. There are several common techniques, e.g. change the
date of birth into age, decomposing date into year, month, day, and binning
numeric values. But the general rule is that this process should be tailored
to both the data at hand and the objectives to achieve. If you would like to
know more about these techniques, I found this article “Fundamental
Techniques of Feature Engineering for Machine Learning” brings a holistic
view of feature engineering in practice.

For the “reddit_wsb” dataset, I simply did three manipulations on the


existing data.

1. title → title_length;

df['title_length'] = df['title'].apply(len)

As a result, the high-cardinality column “title” has been transformed into a


numeric variable which can be further used in the correlation analysis.

2. body → with_body

df['with_body'] = np.where(df['body'].isnull(), 'Yes', 'No')

Since there is a large portion of missing values, the “body” >eld is


transformed into either with_body = “Yes” and with_body = “No”, thus it
can be easily analyzed as a categorical variable.

3. timestamp→ month

df['month'] = pd.to_datetime(df['timestamp']).dt.month.apply(str)

Since most data are gather from the year “2021”, there is no point
comparing the year. Therefore I kept the month section of the “date”, which
also helps to group data into larger subsets.

In order to streamline the further analysis, I drop the columns that won’t
contribute to the EDA.

df = df.drop(['id', 'url', 'timestamp', 'title', 'body'], axis=1)

For the “restaurant” dataset, the data is already clean enough, therefore I
simply trimmed out the columns with high cardinality.

df = df.drop(['Restaurant', 'City'], axis=1)

Furthermore, the remaining variables are categorized into numerical and


categorical, since univariate analysis and multivariate analysis require
dicerent approaches to handle dicerent data types. “is_string_dtype” and
“is_numeric_dtype” are handy functions to identify the data type of each
>eld.

“reddit_wsb” dataset result (image by author)

“restaurant” dataset output (image by author)

After >nalizing the numerical and categorical variables lists, the univariate
and multivariate analysis can be automated.

3. Univariate Analysis
The describe() function mentioned in the >rst section has already provided
a univariate analysis in a non-graphical way. In this section, we will be
generating more insights by visualizing the data and spot the hidden
patterns through graphical analysis.

Have a read of my article on “How to Choose the Most Appropriate Chart” if


you are interested in knowing which chart types are most suitable for which
data type.

How to Choose the Most Appropriate Chart


Line chart, bar chart, pie chart … they tell different stories
towardsdatascience.com

Categorical Variables → Bar chart

The easiest yet most intuitive way to visualize the property of a categorical
variable is to use a bar chart to plot the frequency of each categorical value.

Numerical Variables → histogram

To graph out the numeric variable distribution, we can use histogram which
is very similar to bar chart. It splits continuous numbers into equal size bins
and plot the frequency of records falling between the interval.

create histogram and bar chart (image by author)

I use this for loop to iterate through the columns in the data frame and
create a plot for each column. Then use a histogram if it is numerical
variables and a bar chart if categorical.

“reddit_wsb” dataset result (image by author)

“restaurant” dataset output (image by author)

4. Multivariate Analysis
Multivariate analysis is categorized into these three conditions to address
various combinations of numerical variables and categorical variables.

1. Numerical vs. Numerical → heat map or pairplot


Firstly, let’s use the correlation matrix to >nd the correlation of all
numeric data type columns. Then use a heat map to visualize the result.
The annotation inside each cell indicates the correlation coegcient of the
relationship.

“reddit_wsb” dataset result (image by author)

“restaurant” dataset output (image by author)

Secondly, since the correlation matrix only indicates the strength of linear
relationship, it is better to plot the numerical variables using seaborn
function sns.pairplot(). Notice that, both the sns.heatmap() and
sns.pairplot() function ignore non-numeric data type.

“reddit_wsb” dataset result (image by author)

“restaurant” dataset output (image by author)

Pair plot or scatterplot is a good complementary to the correlation matrix,


especially when nonlinear relationships (e.g. exponential, inverse
relationship) might exist. For example, the inverse relationship between
“Rank” and “Sales” observed in the restaurant dataset may be mistaken as
strong linear relationship if we simply look at number “- 0.92” the
correlation matrix.

2. Categorical vs. Categorical → countplot with hue

sample output set (image by author from the website)

The relationship between two categorical variables can be visualized using


grouped bar charts. The frequency of the primary categorical variables is
broken down by the secondary category. This can be achieved using
sns.countplot().

I use a nested for loop, where the outer loop iterates through all categorical
variables and assigns them as the primary category, then the inner loop
iterate through the list again to pair the primary category with a dicerent
secondary category.

grouped bar chart code (image by author)

Within one grouped bar chart, if the frequency distribution always follows
the same pattern across dicerent groups, it suggests that there is no
dependency between the primary and secondary category. However, if the
distribution is dicerent then it indicates that it is likely that there is a
dependency between two variables.

“reddit_wsb” dataset result (image by author)

Since there is only one categorical variable in the “restaurant” dataset, no


graph is generated.

3. Categorical vs. Numerical → boxplot or pairplot with hue

boxplot sample output set (image by author from the website)

Box plot is usually adopted when we need to compare how numerical data
varies across groups. It is an intuitive way to graphically depict if the
variation in categorical features contributes to the dicerence in values,
which can be additionally quanti>ed using ANOVA analysis. In this process,
I pair each column in the categorical list with all columns in the numerical
list and plot the box plot accordingly.

boxplot code (image by author)

In the “reddit_wsb” dataset, no signi>cant dicerence is observed across


dicerent categories.

“reddit_wsb” dataset result (image by author)

On the other hand, the “restaurant” dataset gives us some interesting


output. Some states (e.g. “Mich.”) seem to jump all around the plots. Is it
just because of the relatively smaller sample size for these states, which
might worth further investigation.

“restaurant” dataset output (image by author)

Another approach is built upon the pairplot that we performed earlier for
numerical vs. numerical. To introduce the categorical variable, we can use
diFerent hues to represent. Just like what we did for countplot. To do this,
we can simply loop through the categorical list and add each element as the
hue of the pairplot.

pairplot with hue code (image by author)

Consequently, it is easy to visualize whether each group forms clusters in


the scatterplot.

“reddit_wsb” dataset result (image by author)

“restaurant” dataset output (image by author)

Take-Home Message
This article covers several steps to perform EDA:

1. Know Your Data: have a bird’s view of the characteristics of the dataset.

2. Feature Engineering: transform variables into something more


insightful.

3. Univariate Analysis: 1) histogram to visualize numerical data; 2) bar


chart to visualize categorical data.

4. Multivariate Analysis: 1) Numerical vs. Numerical: correlation matrix,


scatterplot (pairplot); 2) Categorical vs. Categorical: Grouped bar chart;
3) Numerical vs. Categorical: pairplot with hue, box plot.

Feel free to grab the code from the end of the article on my website. As
mentioned earlier, other than the feature engineering part, the rest of the
analysis can be automated. However, it is always better when the
automation process is accompanied by some human touch, for example,
experimenting on the bin size to optimize the histogram distribution. As
always, I hope you >nd this article helpful and I encourage you to give it a
go with your own dataset :)

More Related Articles

Level Up 7 Data Science Skills Through YouTube


We are all familiar with the modern game design, that champions or
heroes are always equipped with certain attributes…
link.medium.com

Simple Logistic Regression using Python scikit-learn


Step-by-Step Guide from Data Preprocessing to Model Evaluation
towardsdatascience.com

Top 15 Websites for Data Scientists to Follow in 2021


Sites and Blogs that Inspire Learning
medium.com

Originally published at https://www.visual-design.net on February 28, 2021.

Sign up for The Variable


By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from
hands-on tutorials and cutting-edge research to original features you don't want to
miss. Take a look.

Your email Get this newsletter

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.

804 8

Eda Exploratory Data Analysis Seaborn Data Science Eda In Python

More from Towards Data Science Follow

Your home for data science. A Medium publication sharing concepts, ideas and
codes.

Mandy Gu · Mar 1

Getting Noticed as an Aspiring Data


Scientist
Tips to stand out in a competitive job market

Photo by Magnet.me on Unsplash

This article is cross-posted on my website https://www.dscrashcourse.com/


alongside other guides we have been curating for aspiring data scientists.

The data scientist role is a popular career choice for anyone who likes to
work with numbers and analytics. Once referred to as “the sexiest job in the
21st century” by Harvard Business review, the popularity of this industry
has caused it to become oversaturated with job seekers and bootcamps.

With so much interest and competition, it has become harder for aspiring
data scientists to stand out and get noticed in the job market.

So, what can an aspiring data scientist do to stand out?


First, it’s important to understand the reality…

Read more · 4 min read


169

Frank Neugebauer · Mar 1

My PC vs Azure ML Compute — GO!


Is ML Studio Compute Faster?

Photo by Tim Gouw on Unsplash

A few weeks ago, I created a YouTube video on connecting Microsoft Visual


Studio Code to a Jupyter Notebook running on a Compute resource within
Azure (via Azure ML Studio). When I was making the video, I kept
wondering how much faster that remote server really is.

So I decided to test it.

Here’s the video for the Jupyter setup:

Video by Author

The Setup
I’m not going to make you wade through my exploratory data analysis, or
even loading my data. …

Read more · 5 min read

33

Samuel Fraga Mateos · Mar 1

How the cloud will help (or not) your


business
Understanding the cloud main advantages and disadvantages

Photo by Pero Kalimero on Unsplash

In this article, we are going to change the context slightly. In the last
articles, we have been talking about data management, the importance of
data quality, and business analytics. This time, I am very excited to
announce to you that we are going to explore, over the next few weeks, a
current trend that will acect all companies in the decade in which we >nd
ourselves: the cloud. I know that the topic cloud is very broad since it has a
lot of concepts so we’ll focus on data in the cloud.

I think by now, we have all…

Read more · 6 min read

58 2

Walid Amamou · Mar 1

How to Fine-Tune BERT Transformer with


spaCy 3
A step-by-step guide on how to to fine-tune BERT for NER

Photo by Alina Grubnyak on Unsplash

Since the seminal paper “Attention is all you need” of Vaswani et al,
Transformer models have become by far the state of the art in NLP
technology. With applications ranging from NER, Text Classi>cation,
Question Answering or text generation, the applications of this amazing
technology are limitless.

More speci>cally, BERT — which stands for Bidirectional Encoder


Representations from Transformers— leverages the transformer
architecture in a novel way. For example, BERT analyses both sides of the
sentence with a randomly masked word to make a prediction. …

Read more · 7 min read

143 1

Ernest Chan · Mar 1

Paper Highlights-Challenges in Deploying


Machine Learning: a Survey of Case
Studies
This paper better prepares us for deploying ML models by discussing
challenges we might face

Overview
Production ML is hard. If we can better understand the challenges in
deploying ML, we can be better prepared for our next project. That’s why I
enjoyed reading Challenges in Deploying Machine Learning: a Survey of
Case Studies (on arXiv 18 Jan, 2021) by Paleyes, Urma, and Lawrence. It
surveys papers and articles within the last 5 years relevant to the ML
deployment process. To group challenges during the ML development
process, the authors separated the ML workqow into 4 high level steps,
from data management to model deployment. Afterwards there is a section
on cross-cutting issues.

In this post…

Read more · 8 min read

24

Read more from Towards Data Science

More From Medium

Run Your Python Code Automate Microsoft 2 Must-Know OOP Five things I have
as Fast as C Excel and Word using Concepts in Python learned after solving
Marcel Moosbrugger in
Python Soner Yıldırım in Towards
500+ Leetcode
Towards Data Science M Khorasani in Towards Data Data Science
questions
Science Federico Mannucci in
Towards Data Science

Why I Stopped Applying A Complete Yet Simple Power BI — How to fit Operationalization: the
For Data Science Jobs Guide to Move From 200 million rows in less art and science of
Kurtis Pykes in Towards Data
Excel to Python than 1GB! making metrics
Science Frank Andrade in Towards Nikola Ilic in Towards Data Cassie Kozyrkov in Towards
Data Science Science Data Science

About Help Legal

You might also like