You are on page 1of 13

Data Cleaning and

Preprocessing Techniques
2
 First, you have to dig deep in the problem, understand what
clues you are missing and what information you can extract.

 After understanding the problem, you need to prepare the


dataset for your machine learning model since the data in its
initial condition is never enough.

3
Step 1: Exploratory Data Analysis
 The first step in a data science project is the exploratory
analysis, that helps in understanding the problem and taking
decisions in the next steps.
 It tends to be skipped, but it’s the worst error because you’ll
lose a lot of time later to find the reason why the model gives
errors or didn’t perform as expected.

4
Step 1: Exploratory Data Analysis
 Exploratory analysis into three parts:
1. Check the structure of the dataset, the statistics, the missing
values, the duplicates, the unique values of the categorical
variables
2. Understand the meaning and the distribution of the variables
3. Study the relationships between variables

5
Step 1: Exploratory Data Analysis
 To analyse how the dataset is organised, there are the following
Pandas methods that can help you:
df.head()
df.info()
df.isnull().sum()
df.duplicated().sum()
df.describe([x*0.1 for x in range(10)])
for c in list(df):
print(df[c].value_counts())

6
Step 1: Exploratory Data Analysis
 When trying to understand the variables, it’s useful to split the
analysis into two further parts: numerical features and
categorical features.
 First, we can focus on the numerical features that can be
visualized through
 histograms
 boxplots.

7
Step 1: Exploratory Data Analysis
 After, it’s the turn for the categorical variables.
 In case it’s a binary problem, it’s better to start by looking if the
classes are balanced.
 After focused on the remaining categorical variables using the bar
plots.
 Finally check the correlation between each pair of numerical
variables.
 Other useful data visualizations can be the scatter plots and
boxplots to observe the relations between a numerical and a
categorical variable.
8
Step 2: Deal with Missing values
 In the first step, investigate missing values in each variable.
 In case there are missing values, we need to understand how to
handle the issue.
 The easiest way would be to remove the variables or the rows
that contain NaN values,
 but we would prefer to avoid it because we risk losing useful
information that can help our machine learning model on
solving the problem.

9
Step 2: Deal with Missing values
 If we are dealing with a numerical variable, there are several approaches
to fill it.
 The most popular method consists in filling the missing values with the
mean/median of that feature:
df['age'].fillna(df['age'].mean())
df['age'].fillna(df['age'].median())

 Another way is to substitute the blanks with group by imputations:


df['price'].fillna(df.group('type_building')['price'].transform('mean'),inplace=True)
 It can be a better option in case there is a strong relationship between a
numerical feature and a categorical feature.
10
 fill the missing values of categorical based on the mode of that
variable:

df['type_building'].fillna(df['type_building'].mode()[0])

11
Step 3: Deal with Duplicates and Outliers

12
13

You might also like