Professional Documents
Culture Documents
Follow 584K Followers · Editors' Picks Features Deep Dives Grow Contribute About
You have 2 free member-only stories left this month. Sign up for Medium and get an extra one
Semi-Automated Exploratory
Data Analysis (EDA) in Python
Comprehensive Data Exploration Process with One-Click
3. Univariate Analysis
4. Multivariate Analysis
Feel free to jump to the part that you are interested in, or grab the full code
at the end of the article published on my website if you >nd it helpful.
Import Libraries
I will be using four main libraries: Numpy — to work with arrays; Pandas —
to manipulate data in a spreadsheet format that we are familiar with;
Seaborn and matplotlib — to create data visualization.
Import Data
Create a data frame from the imported dataset by copying the path of the
dataset and use df.head(5) to take a peek at the >rst 5 rows of the data.
Before zooming into each >eld, let’s >rst take a bird’s eye view of the overall
dataset characteristics.
info()
It gives the count of non-null values for each column and its data type.
describe( )
This function provides basic statistics of each column. By passing the
parameter “include = ‘all’”, it outputs the value count, unique count, the
top-frequency value of the categorical variables and count, mean,
standard deviation, min, max, and percentile of numeric variables
If we leave it empty, it only shows numeric variables. As you can see, only
columns being identi>ed as “int64” in the info() output are shown below.
Missing Value
Handling missing values is a rabbit hole that cannot be covered in one or
two sentences. If you would love to know how to address missing values in
the model lifecycle and understand dicerent types of missing data, here are
some articles that may help:
Then, visualize the percentage of the missing value based on the data frame
“missing_df”. The for loop is basically a handy way to add labels to the bars.
As we can see from the chart, nearly half of the “body” values from the
“reddit_wsb” dataset are missing, which leads us to the next step “feature
engineering”.
2. Feature Engineering
This is the only part that requires some human judgment, thus cannot be
easily automated. Don’t be afraid of this terminology. I think of feature
engineering as a fancy way of saying transforming the data at hand to make
it more insightful. There are several common techniques, e.g. change the
date of birth into age, decomposing date into year, month, day, and binning
numeric values. But the general rule is that this process should be tailored
to both the data at hand and the objectives to achieve. If you would like to
know more about these techniques, I found this article “Fundamental
Techniques of Feature Engineering for Machine Learning” brings a holistic
view of feature engineering in practice.
1. title → title_length;
df['title_length'] = df['title'].apply(len)
2. body → with_body
3. timestamp→ month
df['month'] = pd.to_datetime(df['timestamp']).dt.month.apply(str)
Since most data are gather from the year “2021”, there is no point
comparing the year. Therefore I kept the month section of the “date”, which
also helps to group data into larger subsets.
In order to streamline the further analysis, I drop the columns that won’t
contribute to the EDA.
For the “restaurant” dataset, the data is already clean enough, therefore I
simply trimmed out the columns with high cardinality.
After >nalizing the numerical and categorical variables lists, the univariate
and multivariate analysis can be automated.
3. Univariate Analysis
The describe() function mentioned in the >rst section has already provided
a univariate analysis in a non-graphical way. In this section, we will be
generating more insights by visualizing the data and spot the hidden
patterns through graphical analysis.
The easiest yet most intuitive way to visualize the property of a categorical
variable is to use a bar chart to plot the frequency of each categorical value.
To graph out the numeric variable distribution, we can use histogram which
is very similar to bar chart. It splits continuous numbers into equal size bins
and plot the frequency of records falling between the interval.
I use this for loop to iterate through the columns in the data frame and
create a plot for each column. Then use a histogram if it is numerical
variables and a bar chart if categorical.
4. Multivariate Analysis
Multivariate analysis is categorized into these three conditions to address
various combinations of numerical variables and categorical variables.
Secondly, since the correlation matrix only indicates the strength of linear
relationship, it is better to plot the numerical variables using seaborn
function sns.pairplot(). Notice that, both the sns.heatmap() and
sns.pairplot() function ignore non-numeric data type.
I use a nested for loop, where the outer loop iterates through all categorical
variables and assigns them as the primary category, then the inner loop
iterate through the list again to pair the primary category with a dicerent
secondary category.
Within one grouped bar chart, if the frequency distribution always follows
the same pattern across dicerent groups, it suggests that there is no
dependency between the primary and secondary category. However, if the
distribution is dicerent then it indicates that it is likely that there is a
dependency between two variables.
Box plot is usually adopted when we need to compare how numerical data
varies across groups. It is an intuitive way to graphically depict if the
variation in categorical features contributes to the dicerence in values,
which can be additionally quanti>ed using ANOVA analysis. In this process,
I pair each column in the categorical list with all columns in the numerical
list and plot the box plot accordingly.
Another approach is built upon the pairplot that we performed earlier for
numerical vs. numerical. To introduce the categorical variable, we can use
diFerent hues to represent. Just like what we did for countplot. To do this,
we can simply loop through the categorical list and add each element as the
hue of the pairplot.
Take-Home Message
This article covers several steps to perform EDA:
1. Know Your Data: have a bird’s view of the characteristics of the dataset.
Feel free to grab the code from the end of the article on my website. As
mentioned earlier, other than the feature engineering part, the rest of the
analysis can be automated. However, it is always better when the
automation process is accompanied by some human touch, for example,
experimenting on the bin size to optimize the histogram distribution. As
always, I hope you >nd this article helpful and I encourage you to give it a
go with your own dataset :)
Every Thursday, the Variable delivers the very best of Towards Data Science: from
hands-on tutorials and cutting-edge research to original features you don't want to
miss. Take a look.
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.
804 8
Your home for data science. A Medium publication sharing concepts, ideas and
codes.
Mandy Gu · Mar 1
The data scientist role is a popular career choice for anyone who likes to
work with numbers and analytics. Once referred to as “the sexiest job in the
21st century” by Harvard Business review, the popularity of this industry
has caused it to become oversaturated with job seekers and bootcamps.
With so much interest and competition, it has become harder for aspiring
data scientists to stand out and get noticed in the job market.
Video by Author
The Setup
I’m not going to make you wade through my exploratory data analysis, or
even loading my data. …
33
In this article, we are going to change the context slightly. In the last
articles, we have been talking about data management, the importance of
data quality, and business analytics. This time, I am very excited to
announce to you that we are going to explore, over the next few weeks, a
current trend that will acect all companies in the decade in which we >nd
ourselves: the cloud. I know that the topic cloud is very broad since it has a
lot of concepts so we’ll focus on data in the cloud.
58 2
Since the seminal paper “Attention is all you need” of Vaswani et al,
Transformer models have become by far the state of the art in NLP
technology. With applications ranging from NER, Text Classi>cation,
Question Answering or text generation, the applications of this amazing
technology are limitless.
143 1
Overview
Production ML is hard. If we can better understand the challenges in
deploying ML, we can be better prepared for our next project. That’s why I
enjoyed reading Challenges in Deploying Machine Learning: a Survey of
Case Studies (on arXiv 18 Jan, 2021) by Paleyes, Urma, and Lawrence. It
surveys papers and articles within the last 5 years relevant to the ML
deployment process. To group challenges during the ML development
process, the authors separated the ML workqow into 4 high level steps,
from data management to model deployment. Afterwards there is a section
on cross-cutting issues.
In this post…
24
Run Your Python Code Automate Microsoft 2 Must-Know OOP Five things I have
as Fast as C Excel and Word using Concepts in Python learned after solving
Marcel Moosbrugger in
Python Soner Yıldırım in Towards
500+ Leetcode
Towards Data Science M Khorasani in Towards Data Data Science
questions
Science Federico Mannucci in
Towards Data Science
Why I Stopped Applying A Complete Yet Simple Power BI — How to fit Operationalization: the
For Data Science Jobs Guide to Move From 200 million rows in less art and science of
Kurtis Pykes in Towards Data
Excel to Python than 1GB! making metrics
Science Frank Andrade in Towards Nikola Ilic in Towards Data Cassie Kozyrkov in Towards
Data Science Science Data Science