Professional Documents
Culture Documents
Preprocessing Techniques
2
First, you have to dig deep in the problem, understand what
clues you are missing and what information you can extract.
3
Step 1: Exploratory Data Analysis
The first step in a data science project is the exploratory
analysis, that helps in understanding the problem and taking
decisions in the next steps.
It tends to be skipped, but it’s the worst error because you’ll
lose a lot of time later to find the reason why the model gives
errors or didn’t perform as expected.
4
Step 1: Exploratory Data Analysis
Exploratory analysis into three parts:
1. Check the structure of the dataset, the statistics, the missing
values, the duplicates, the unique values of the categorical
variables
2. Understand the meaning and the distribution of the variables
3. Study the relationships between variables
5
Step 1: Exploratory Data Analysis
To analyse how the dataset is organised, there are the following
Pandas methods that can help you:
df.head()
df.info()
df.isnull().sum()
df.duplicated().sum()
df.describe([x*0.1 for x in range(10)])
for c in list(df):
print(df[c].value_counts())
6
Step 1: Exploratory Data Analysis
When trying to understand the variables, it’s useful to split the
analysis into two further parts: numerical features and
categorical features.
First, we can focus on the numerical features that can be
visualized through
histograms
boxplots.
7
Step 1: Exploratory Data Analysis
After, it’s the turn for the categorical variables.
In case it’s a binary problem, it’s better to start by looking if the
classes are balanced.
After focused on the remaining categorical variables using the bar
plots.
Finally check the correlation between each pair of numerical
variables.
Other useful data visualizations can be the scatter plots and
boxplots to observe the relations between a numerical and a
categorical variable.
8
Step 2: Deal with Missing values
In the first step, investigate missing values in each variable.
In case there are missing values, we need to understand how to
handle the issue.
The easiest way would be to remove the variables or the rows
that contain NaN values,
but we would prefer to avoid it because we risk losing useful
information that can help our machine learning model on
solving the problem.
9
Step 2: Deal with Missing values
If we are dealing with a numerical variable, there are several approaches
to fill it.
The most popular method consists in filling the missing values with the
mean/median of that feature:
df['age'].fillna(df['age'].mean())
df['age'].fillna(df['age'].median())
df['type_building'].fillna(df['type_building'].mode()[0])
11
Step 3: Deal with Duplicates and Outliers
12
13