You are on page 1of 12

When dealing with machine learning project, real world data typically

is not ready to be used. There might be missing values or incorrect


types in the dataset that we get. These rawness of the data needs to be
dealt first so that ML algorithm can be applied on it. This is a common
problem that all data-related professionals have to face.

The process of dealing with unclean data and transform it into more
appropriate form for modeling is called data pre-processing. This step
can be considered as a mandatory in machine learning process due to
some reason, such as:

• data errors: Statistical noise or missing data need to be


corrected.

• data types: Most machine learning algorithm require input


data in form of numbers.

• data complexity: Some data might be so complex that


algorithm can not perform well on it. Complexity can be a
reason for overfitting in a model.

While data pre-processing can be different for every cases, there are
some common tasks that ca be used:

• data cleansing

• feature selection

Internal
• data scaling

• feature engineering

• dimensionality reduction

We will explore these steps and implement it on sample dataset using


python libraries.

Data Cleansing: Handling missing values

One of the most common process of data cleansing is dealing with


missing values. Basically, there are two ways to handle missing values:

1. Remove rows with missing values

2. Impute missing values

Removing rows is the simplest strategy and easy to execute. On the


contrary, impute missing values is more complicated. We can impute
values using some rules, such as:

• Constant value that has meaning within the domain and


different from other data, like 0 or -1.

• Central tendency of data, which are mean, median, or


mode.

• Predictive values estimated from other data.

Internal
Even though most ML algorithm require complete dataset, not all of
them fail when there is missing data. There are algorithm that robust
to missing values, like KNN and Naive Bayes while other algorithm can
use missing values as a unique value, like Decision Trees. Nevertheless,
scikit-learn library implementations for those algorithms are not
robust to missing values.

We are going to use SimpleImputer class to transform all missing


values marked with a NaN value with the mean value for the column.
You can download the dataset here: Melbourne Housing
Snapshot.

Four features have missing values. We will work on feature ‘Age’, ‘BuildingArea’, and
‘YearBuilt’.

Internal
Feature Selection

In a nutshell, feature selection means removing irrelevant features.


The reasons we need to do this are to:

• reduce complexity

• produce easy to understand model

• reduce computational cost

• prevent overfitting

• improve model performance

These are feature selection techniques based on its basic algorithm:

credit: machinelearningmastery.com

Internal
In using stats based feature selection, it is important to choose what
method to use based on the data types of input and output variable.
This is a decision tree to decide which stats based method is
suitable for our data:

credit: machinelearningmastery.com

We are going to use RFE method to select the most important features
from our dataset. Recursive Feature Elimination (RFE) is popular due
to its flexibility and ease of use. It reduces model complexity
by removing features one by one until the selected number of
features is left.

The scikit-learn Python machine learning library provides an


implementation of RFE for machine learning. To use it, first, the class
is configured with the chosen algorithm specified via the “estimator”
argument and the number of features to select via the
“n_features_to_select” argument.

Internal
Six most relevant features based on RFE are indicated by “Selected=True”

Feature Scaling

Many machine learning algorithms perform better when numerical


input variables are scaled. This case includes algorithms that use a
weighted sum of the input, like linear regression, and algorithms that
use distance measures, like k-nearest neighbors, or gradient descent-
based algorithms.

There are two common methods for scaling:

For the Melbourne Housing dataset, we are going to implement


normalization using scikit-learn object called MinMaxScaler.

Internal
All maximum values have been scaled to 1

Feature Engineering

Feature engineering is the process of transforming data to


represent the underlying problem better to the predictive
models. It is an iterative process that interplays with data selection and
model evaluation, again and again.

General process of feature engineering are commonly divided by


numerical and categorical feature.

Example of feature engineering for numerical features including:

• Feature Generation: feature 1 + feature 2, feature 1 x


feature 2, feature 1 /feature 2, etc.

• Decomposing Categorical Attributes: item_color ->


is_red, is_blue; gender -> is_male, is_female (one-hot
encoding)

Internal
• Decomposing a Date-Time: datetime -> hour_of_day;
hour -> morning, night

• Reframing Numerical Quantities: weight -> above_70,


below_70

• etc.

Tips for doing numerical feature engineering effectively:

1. Ask the expert

2. Discretization

3. Combinations of 2 features or more

4. Using simple statistics descriptive

Next, for handling categorical features, there are several method called
encoding. These are three common encoding techniques with sample.

Label Encoding

• Give every categorical variable a numerical ID.

Internal
• Useful for non-linear and tree-based algorithms.

• Does not increase dimensionality.

• Useful for ordinal data type.

One-Hot Encoding

• Create new feature for every unique value.

• Memory depends on number of unique category.

• Similar to dummy encoding that generates n-1 new columns,


while OHE generates n new columns, with n is the count of
unique value from encoded feature.

Binary Encoding

Internal
• Variables -> numerical label (label encoding) -> binary
number -> split every digit into different columns.

• Useful for feature with large number of unique values.


Increase dataset dimension logarithmically.

• Only need to create log base 2 new columns of unique values


from encoded feature.

The following code shows how to implement one-hot encoding in the


pandas Python library via get_dummies class.

Dimensionality Reduction

More input features often make a predictive modeling task more


challenging to model, more generally referred to as the curse of
dimensionality.

Internal
Dimensionality reduction techniques are often used for data
visualization. Nevertheless, these techniques can be used in applied
machine learning to simplify a classification or regression
dataset in order to better fit a predictive model.

One of the most popular technique for dimensionality reduction in


machine learning is Principal Component Analysis (PCA).

Handling Outliers

Many data have outliers that can heavily affect model training result.
In Python, outliers can be easily detected using boxplot visualization.

Both Landsize and BuildingArea feature have outliers

Internal
We can adjust the outliers without any additional library using
winsorization method. Outlier values can be replaced by certain value
that called upper and lower bound.

Those are several common method for data preparation. Every project
is unique and may need different approach for data pre-processing and
cleansing.

References
• https://www.kaggle.com/alexisbcook/missing-values

• https://machinelearningmastery.com/data-preparation-for-
machine-learning-7-day-mini-course/

• https://machinelearningmastery.com/feature-selection-with-
real-and-categorical-data/

• https://towardsdatascience.com/categorical-encoding-using-
label-encoding-and-one-hot-encoder-911ef77fb5bd

Internal

You might also like