Data Pre-Processing Python For Beginner

When dealing with machine learning project, real world data typically
is not ready to be used. There might be missing values or incorrect

types in the dataset that we get. These rawness of the data needs to be
dealt first so that ML algorithm can be applied on it. This is a common
problem that all data-related professionals have to face.
The process of dealing with unclean data and transform it into more
appropriate form for modeling is called data pre-processing. This step
can be considered as a mandatory in machine learning process due to
some reason, such as:
• data errors: Statistical noise or missing data need to be

corrected.
• data types: Most machine learning algorithm require input

data in form of numbers.
• data complexity: Some data might be so complex that

algorithm can not perform well on it. Complexity can be a
reason for overfitting in a model.
While data pre-processing can be different for every cases, there are
some common tasks that ca be used:
• data cleansing
• feature selection
Internal
• data scaling
• feature engineering
• dimensionality reduction
We will explore these steps and implement it on sample dataset using

python libraries.
Data Cleansing: Handling missing values
One of the most common process of data cleansing is dealing with

missing values. Basically, there are two ways to handle missing values:
1. Remove rows with missing values
2. Impute missing values
Removing rows is the simplest strategy and easy to execute. On the

contrary, impute missing values is more complicated. We can impute
values using some rules, such as:
• Constant value that has meaning within the domain and

different from other data, like 0 or -1.
• Central tendency of data, which are mean, median, or

mode.
• Predictive values estimated from other data.
Internal
Even though most ML algorithm require complete dataset, not all of
them fail when there is missing data. There are algorithm that robust
to missing values, like KNN and Naive Bayes while other algorithm can
use missing values as a unique value, like Decision Trees. Nevertheless,
scikit-learn library implementations for those algorithms are not
robust to missing values.
We are going to use SimpleImputer class to transform all missing

values marked with a NaN value with the mean value for the column.
You can download the dataset here: Melbourne Housing
Snapshot.
Four features have missing values. We will work on feature ‘Age’, ‘BuildingArea’, and
‘YearBuilt’.
Internal
Feature Selection
In a nutshell, feature selection means removing irrelevant features.

The reasons we need to do this are to:
• reduce complexity
• produce easy to understand model
• reduce computational cost
• prevent overfitting
• improve model performance
These are feature selection techniques based on its basic algorithm:
credit: machinelearningmastery.com
Internal
In using stats based feature selection, it is important to choose what
method to use based on the data types of input and output variable.
This is a decision tree to decide which stats based method is
suitable for our data:
credit: machinelearningmastery.com
We are going to use RFE method to select the most important features
from our dataset. Recursive Feature Elimination (RFE) is popular due
to its flexibility and ease of use. It reduces model complexity
by removing features one by one until the selected number of
features is left.
The scikit-learn Python machine learning library provides an

implementation of RFE for machine learning. To use it, first, the class
is configured with the chosen algorithm specified via the “estimator”
argument and the number of features to select via the
“n_features_to_select” argument.
Internal
Six most relevant features based on RFE are indicated by “Selected=True”
Feature Scaling
Many machine learning algorithms perform better when numerical

input variables are scaled. This case includes algorithms that use a
weighted sum of the input, like linear regression, and algorithms that
use distance measures, like k-nearest neighbors, or gradient descent-
based algorithms.
There are two common methods for scaling:
For the Melbourne Housing dataset, we are going to implement

normalization using scikit-learn object called MinMaxScaler.
Internal
All maximum values have been scaled to 1
Feature Engineering
Feature engineering is the process of transforming data to

represent the underlying problem better to the predictive
models. It is an iterative process that interplays with data selection and
model evaluation, again and again.
General process of feature engineering are commonly divided by

numerical and categorical feature.
Example of feature engineering for numerical features including:
• Feature Generation: feature 1 + feature 2, feature 1 x

feature 2, feature 1 /feature 2, etc.
• Decomposing Categorical Attributes: item_color ->

is_red, is_blue; gender -> is_male, is_female (one-hot
encoding)
Internal
• Decomposing a Date-Time: datetime -> hour_of_day;
hour -> morning, night
• Reframing Numerical Quantities: weight -> above_70,

below_70
• etc.
Tips for doing numerical feature engineering effectively:
1. Ask the expert
2. Discretization
3. Combinations of 2 features or more
4. Using simple statistics descriptive
Next, for handling categorical features, there are several method called
encoding. These are three common encoding techniques with sample.
Label Encoding
• Give every categorical variable a numerical ID.
Internal
• Useful for non-linear and tree-based algorithms.
• Does not increase dimensionality.
• Useful for ordinal data type.
One-Hot Encoding
• Create new feature for every unique value.
• Memory depends on number of unique category.
• Similar to dummy encoding that generates n-1 new columns,

while OHE generates n new columns, with n is the count of
unique value from encoded feature.
Binary Encoding
Internal
• Variables -> numerical label (label encoding) -> binary
number -> split every digit into different columns.
• Useful for feature with large number of unique values.

Increase dataset dimension logarithmically.
• Only need to create log base 2 new columns of unique values

from encoded feature.
The following code shows how to implement one-hot encoding in the

pandas Python library via get_dummies class.
Dimensionality Reduction
More input features often make a predictive modeling task more

challenging to model, more generally referred to as the curse of
dimensionality.
Internal
Dimensionality reduction techniques are often used for data
visualization. Nevertheless, these techniques can be used in applied
machine learning to simplify a classification or regression
dataset in order to better fit a predictive model.
One of the most popular technique for dimensionality reduction in

machine learning is Principal Component Analysis (PCA).
Handling Outliers
Many data have outliers that can heavily affect model training result.
In Python, outliers can be easily detected using boxplot visualization.
Both Landsize and BuildingArea feature have outliers
Internal
We can adjust the outliers without any additional library using
winsorization method. Outlier values can be replaced by certain value
that called upper and lower bound.
Those are several common method for data preparation. Every project
is unique and may need different approach for data pre-processing and
cleansing.
References
• https://www.kaggle.com/alexisbcook/missing-values
• https://machinelearningmastery.com/data-preparation-for-
machine-learning-7-day-mini-course/
• https://machinelearningmastery.com/feature-selection-with-
real-and-categorical-data/
• https://towardsdatascience.com/categorical-encoding-using-
label-encoding-and-one-hot-encoder-911ef77fb5bd
Internal

Data Pre-Processing Python For Beginner

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Pre-Processing Python For Beginner

Uploaded by

Copyright:

Available Formats

When dealing with machine learning project, real world data typically

is not ready to be used. There might be missing values or incorrect

• data errors: Statistical noise or missing data need to be

• data types: Most machine learning algorithm require input

• data complexity: Some data might be so complex that

We will explore these steps and implement it on sample dataset using

Data Cleansing: Handling missing values

One of the most common process of data cleansing is dealing with

1. Remove rows with missing values

2. Impute missing values

Removing rows is the simplest strategy and easy to execute. On the

• Constant value that has meaning within the domain and

• Central tendency of data, which are mean, median, or

• Predictive values estimated from other data.

We are going to use SimpleImputer class to transform all missing

In a nutshell, feature selection means removing irrelevant features.

• produce easy to understand model

• reduce computational cost

• improve model performance

These are feature selection techniques based on its basic algorithm:

The scikit-learn Python machine learning library provides an

Many machine learning algorithms perform better when numerical

There are two common methods for scaling:

For the Melbourne Housing dataset, we are going to implement

Feature engineering is the process of transforming data to

General process of feature engineering are commonly divided by

Example of feature engineering for numerical features including:

• Feature Generation: feature 1 + feature 2, feature 1 x

• Decomposing Categorical Attributes: item_color ->

• Reframing Numerical Quantities: weight -> above_70,

Tips for doing numerical feature engineering effectively:

1. Ask the expert

3. Combinations of 2 features or more

4. Using simple statistics descriptive

• Give every categorical variable a numerical ID.

• Does not increase dimensionality.

• Useful for ordinal data type.

• Create new feature for every unique value.

• Memory depends on number of unique category.

• Similar to dummy encoding that generates n-1 new columns,

• Useful for feature with large number of unique values.

• Only need to create log base 2 new columns of unique values

The following code shows how to implement one-hot encoding in the

More input features often make a predictive modeling task more

One of the most popular technique for dimensionality reduction in

Both Landsize and BuildingArea feature have outliers

You might also like