You are on page 1of 50

The Complete Guide to Data

Preprocessing (Part 1)

https://pub.towardsai.net/the-complete-guide-to-data-preprocessing-3e0092b74016

Dr. Roi Yehoshua


·
Follow
Published in

Towards AI

Data preprocessing is the process of cleaning, transforming, and


organizing your data set in order to prepare it for data analysis and
modeling. It aims to improve the quality, integrity, and reliability of the
data, and addresses issues such as missing values, noisy data, outliers,
and incompatible data formats.

“Garbage in, garbage out” is a known phrase in data science, which


expresses the idea that the quality of the results of a model is
determined by the quality of its inputs. The more informative and less
noisy your data is, the better the model will be able to learn the
underlying patterns or relationships in the data and generalize to new
unseen data.

Data preprocessing is often the most important phase of a machine


learning project, and also the one that takes the longest time to
complete.

This article discusses the main steps involved in data preprocessing,


when to apply each one, and the classes in Scikit-Learn that can help
you implement them.

Data Preprocessing Tasks


The main tasks involved in data preprocessing are:

1. Data cleaning

2. Handling missing data

3. Encoding categorical data

4. Detecting and handling outliers

5. Handling skewed data

6. Discretization

7. Scaling and normalization


Feature selection and extraction are considered separate steps from
data preprocessing, although there can be some overlap between them.

In this part of the article, we will focus on steps 1–5, and in the second
part we will discuss steps 6–7.

The order of the steps outlined above may change according to the
model requirements and characteristics of the data set. Nonetheless,
some of the steps depend on each other. For example, the data should
be normalized only after it is cleaned and the missing values and
outliers have been handled.

In addition, some of the steps may be iterative, i.e., they might be


performed multiple times throughout the data preprocessing process.
For example, after detecting outliers (step 4), you might find out that
some of them are actually errors and then go back to data cleaning
(step 1) to fix them.

In supervised learning tasks, data preprocessing is usually performed


only on the the target labels. Changing the labels in the data set should
be done with the utmost caution, as it impacts the labels that will be
generated by the model for unseen data.

The main modules in Scikit-Learn that are used for data preprocessing
are:
• sklearn.preprocessing provides various transformers for
scaling, normalization, encoding features, and
discretization.

• sklearn.impute provides transformers for imputing


missing values.

Data Cleaning
Data cleaning (or cleansing) involves correcting or removing incorrect,
inaccurate, inconsistent, irrelevant, or duplicate data from the data set.
These issues can arise from various sources, such as:

• Data entry errors (e.g., an invalid postal code,


typographical errors)

• Out-of-range values (e.g., a negative product price)

• Corruption in transmitting or storage of the data

• Merging ambiguous data from different sources (e.g., the


same customer was stored in two systems with two
different addresses)

• Using inconsistent formats for dates, phone numbers,


names of states, etc.

• Using inconsistent unit measures (e.g., using both


centimeters and feet to measure length)
• Including features that are irrelevant to the analysis, such
as user id

Data cleaning is performed using a combination of manual correction


operations with automatic processing tools, and often requires domain
expertise in order to identify and resolve the inaccuracies and
inconsistencies in the data.

Handling Missing Values


Missing data is one of the most common issues in real-world data sets.
It can occur due to various reasons, such as data entry errors, null
values in a database, private information, etc. In Python, missing
values are typically represented by NaN (Not a Number) or None
values.

Many machine learning algorithms cannot deal with missing values


(exceptions include KNN, Naive Bayes, and decision trees), thus this
issue needs to be resolved during the data preparation phase.

Common approaches for dealing with missing data include:

1. Remove the samples with missing values. This option is


recommended only if there is a small number of such
samples.

2. Remove features that have a high percentage of missing


values.
3. Impute the missing values, i.e., replace them with some
appropriate fill value, such as the mean or median of the
corresponding feature.

Scikit-Learn provides three types of imputers:

1. SimpleImputer imputes the missing values using the


statistics (e.g., mean, median, or mode) of the feature with
the missing values or using a constant value.
Its important parameters are:

• missing_values — which values are considered to be


missing values (defaults to np.nan).

• strategy — the statistic to use for the imputation. The


options are “mean” (the default), “median”,
“most_frequent” and “constant”. For categorical features,
only the options “most_frequent” and “constant” can be
used.

• fill_value — which constant to use for replacing the


missing values (when the chosen strategy is “constant”).

For example, imputing missing values using the mean of each feature:

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X = [[np.nan, 5, np.nan], [2, 4, 10], [3, np.nan, 5]]
imputer.fit_transform(X)

array([[ 2.5, 5. , 7.5],


[ 2. , 4. , 10. ],
[ 3. , 4.5, 5. ]])

2. IterativeImputer models each feature with missing values as a


function of the other features, in a round-robin fashion. In each
iteration, one of the features with missing values is designated as the
output y, and the other features are treated as the inputs X. Then, a
regression model is trained on (X, y), and used to predict the missing
values of y. This process is repeated for max_iter imputation rounds.

Important parameters of this transformer:

• estimator — the estimator to use for the imputation (the


default is BayesianRidge).

• max_iter — maximum number of imputation rounds


(defaults to 10).

• initial_strategy — which strategy to use to initialize the


missing values (same as the strategy parameter in
SimpleImputer).

• imputation_order — the order in which the features will


be imputed. Defaults to ‘ascending’, i.e., from features
with the fewest missing values to the most.
Since this transformer is still experimental, before using it you need to
explicitly import enable_iterative_imputer:

from sklearn.experimental import enable_iterative_imputer

from sklearn.impute import IterativeImputer

imputer = IterativeImputer(max_iter=10)

X = [[1, 2], [2, 4], [4, 8], [np.nan, 3], [5, np.nan]]

imputer.fit_transform(X)

array([[ 1. , 2. ],
[ 2. , 4. ],
[ 4. , 8. ],
[ 1.50000846, 3. ],
[ 5. , 10.00000145]])

We can see that the imputer has learned that the second feature is
equal to twice of the first one.

3. KNNImputer imputes the missing values by using the mean value of


the k-nearest neighbors that have a value for the missing feature.

Important parameters of this transformer:

• n_neighbors — the number of neighbors to use for the


imputation (defaults to 5)

• weights — whether to weight the neighbors uniformly (the


default) or by the inverse of their distance.
• metric — the metric to use for computing the distances.
Possible values are ‘nan_euclidean’ (an Euclidean
distance metric that supports missing values) or a custom
function.

The following example replaces the missing values with the mean
feature value of the two nearest neighbors:

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2)

X = [[1, 2, np.nan], [3, 2, 3], [6, np.nan, 5], [7, 8, 10]]


imputer.fit_transform(X)

array([[ 1., 2., 4.],


[ 3., 2., 3.],
[ 6., 5., 5.],
[ 7., 8., 10.]])

In general, the simple imputer performs worse than the more complex
imputers on weak models, while it works as well as or better than them
on powerful models.

Encoding Categorical Data


Most machine learning models cannot handle categorical features
directly, thus these features need to be converted into a numerical
format. There are three main approaches for encoding categorical data:
1. Ordinal encoding assigns a unique integer value to
each category based on the order or ranking of the
categories. For example, a categorical feature of
“IncomeLevel” with three categories: Low, Medium and
High, could be encoded as 0, 1, and 2, respectively.

In Scikit-Learn, you can use the OrdinalEncoder class to perform this


type of encoding. Its important parameters are:

• categories — a list of categories sorted according to the


order in which to assign them the integers. Defaults to
‘auto’, i.e., the categories are automatically determined
from the training set.

• handle_unknown — how to handle an unknown category


that was not encountered in the training set. The options
are ‘error’ for raising an error (the default), or
‘use_encoded_value’ for setting the value of the unknown
category to the one specified in the
parameter unknown_value.

• unknown_value — the encoded value of unknown


categories.

• encoded_missing_value — the encoded value of missing


categories (defaults to np.nan).
For example, let’s encode the following data set that contains two
categorical features using ordinal encoding:

from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()

X = [['LowIncome', 'BA'], ['HighIncome', 'PhD'], ['MediumIncome', 'BA']]


encoder.fit_transform(X)

array([[1., 0.],
[0., 1.],
[2., 0.]])

In this case, the integers assigned to the categories were determined by


the alphabetical order of their names (e.g., ‘HighIncome’ <
‘LowIncome’ < ‘MediumIncome’).

Pros of ordinal encoding:

• Preserves the ordinal relationship between the categories.

• Does not add dimensions to the data set.

Cons:

• When there is no inherent order among the categories, the


machine learning model would interpret the categories as
ordered.
• The model’s prediction may be influenced by the
magnitude of the integers assigned to the categories. For
example, in linear regression, these integers are
multiplied by the weight associated with the categorical
feature, thus an integer value of 2 will have twice an
impact on the model’s prediction than an integer value o

2. One-hot encoding converts a categorical variable


with n categories into n binary features, with one of them 1, and all the
others 0. For example, one-hot encoding of the “IncomeLevel” variable
would create three binary features: “LowIncome”, “MediumIncome”,
and “HighIncome”. For a sample that belongs to the “HighIncome”
category, the “HighIncome” feature would be 1, and the other features
would be 0.

In Scikit-Learn, you can use the transformer OneHotEncoder to


perform one-hot encoding. Its important parameters are:

• categories — a list of categories sorted according to the


order in which to assign them the binary features.
Defaults to ‘auto’, i.e., the categories are automatically
determined from the training set.

• drop — specifies whether to drop one of the categories per


feature (in which case this category will be encoded as an
all-zeros vector). The options are ‘first’ to drop the first
category in each feature, ‘if_binary’ to drop the first
category only in features with two categories, or None to
retain all the features (the default).

• max_categories — specifies an upper limit to the number


of output features for each categorical feature. All the
infrequent categories will be aggregated into a single
output.

• handle_unknown — specifies how to handle unknown


categories during the transform. It can be one of the
following options:

• ‘error’: raise an error if an unknown category is encountered


(the default).

• ‘ignore’: encode the unknown category as an all-zeros vector.

• ‘infrequent_if_exist’: map the unknown category to the


infrequent category if it exists. If infrequent category support
was not configured (by specifying
the max_categories or min_frequency parameters), or no
infrequent category was found in the training set, the
unknown category will be encoded as an all-zeros vector. If
the test set might have missing categories, it is better to use
this option.

• sparse_output — returns a sparse matrix if True (the


default), else returns an array.
For example, let’s encode the same data set from the previous example
using one-hot encoding:

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

X = [['LowIncome', 'BA'], ['HighIncome', 'PhD'], ['MediumIncome', 'BA']]


encoder.fit_transform(X).toarray()

array([[0., 1., 0., 1., 0.],


[1., 0., 0., 0., 1.],
[0., 0., 1., 1., 0.]])

In the transformed matrix, the first three columns are the encoding of
the first feature with categories
LowIncome/MediumIncome/HighIncome, and the last two columns
are the encoding of the second feature with categories BA/PhD.

Note that the encoder by default returns a SciPy sparse matrix (where
only nonzero values are stored), thus in order to display it we had to
convert it to a dense NumPy array by calling the toarray() method.

Pros of one-hot encoding:

• Does not impose any order among the categories.

• Allows the model to learn independent mappings between


each category and the target label. For example, in linear
regression, a separate coefficient will be associated with
each binary feature.

Cons:

• The feature space can blow up quickly if the categorical


variables have a large number of unique categories. In this
case, dimensionality reduction techniques can be used to
reduce the number of dimensions of the transformed
feature matrix.

3. Hash encoding applies a hash function to the categories and


converts them into a fixed number of dimensions. It is more memory
efficient than one-hot encoding, but different categories may be
mapped to the same hash value. In Scikit-Learn, you can use the
transformer FeatureHasher to perform this type of encoding.

Additional category encoders are available in the package category-


encoders, which is part of scikit-learn-contrib (a collection of high-
quality scikit-learn compatible utilities).

Detecting and Handling Outliers


Outliers are data points that significantly deviate from the majority of
the data. They can be caused by data entry errors or measurement
errors, but they can also represent real anomalous observations.
There are various methods to detect outliers, such as:

1. Using statistical measures such as z-score, which


represents the number of standard deviations away from
the mean:

where μ is the mean of the data points and σ is their standard


deviation. Data points that are above or below a specified threshold
(e.g., greater than 3 or less than 3 standard deviations) are considered
outliers.

2. Using plots such as percentile or box-plot. A box-plot visually


displays the quartiles and any data points outside a specified range
(e.g., 1.5 times the IQR above the upper quartile or below the lower
quartile) are considered outliers.
A box plot (image by author)

3. Density-based clustering methods, such as DBSCAN, can identify


outliers based on the density of the data points.

4. Isolation forest is an ensemble-based approach for anomaly


detection.

There are also different ways to handle outliers, depending on the


nature and extent of the outliers:

1. Remove the outliers. This is the simplest approach, but it


should be done judiciously, as it can potentially lead to
information loss.
2. Treat the outliers as missing values and then use one of
the aforementioned imputation methods to replace them.

3. Capping sets a predefined threshold for extreme values.


Any data point that exceeds the threshold is replaced with
the threshold value.

4. Winsorization sets all the outliers to a specified percentile


of the data. For example, a 90% winsorization replaces all
the data points above the 95th percentile with the 95th
percentile, and all the points below the 5th percentile with
the 5th percentile. This approach limits the impact of
outliers without completely removing them.

5. Use discretization to group the data points into bins, and


assign the outliers to a separate bin or to the nearest bin.

Handling Skewed Data


Skewed data is data that is not symmetrically distributed around the
mean and has a long tail toward one direction. Skewness can impact
the performance of some machine learning models, which assume a
symmetric or even normal distribution of the data (e.g., Gaussian
Naive Bayes assumes that the features are normally distributed).

There are several methods to deal with skewed data:


1. Logarithmic transformation: taking the logarithm can
help reduce right skewness since it compresses larger
values while maintaining the order of the data.

2. Exponential transformation: taking the reciprocal of the


data (y = x⁻¹) can help reduce left skewness and make the
distribution more symmetric.

3. Winsorization: as in outlier handling, winsorization can


handle skewness by replacing extreme values with values
at a specified percentile.

4. Power transformations involve raising the data to a


power, which is determined through maximum likelihood
estimation. They can transform the data to a more
symmetric and approximately normal distribution.

For power transformations, you can use the class PowerTransformer,


which currently provides two transformations:

1. Box-Cox transform, which works only with strictly


positive values:

2. Yeo-Johnson transform, which works with any real value:


The following example uses the Box-Cox transformation to map
samples drawn from a lognormal distribution into a normal
distribution. The data before the transformation is:

X = np.random.RandomState(0).lognormal(size=500)
plt.hist(X, bins=50)

And the data after the transformation is:

from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer('box-cox')

X_new = pt.fit_transform(X.reshape(-1, 1))


plt.hist(X_new, bins=50)
The log normal distribution after a power transformation

Note that power transformations are not effective with every type
of distribution. For example, they do not work well with uniform
or bimodal distributions.
The Complete Guide to Data
Preprocessing (Part 2)

https://pub.towardsai.net/the-complete-guide-to-data-preprocessing-part-2-96cbcd1b6d90

In the first part of this article, we described the data


preprocessing process and showed how to handle missing values,
categorical data, outliers and skewed data. In this part of the
article, we will describe the discretization and normalization
activities, and then demonstrate the entire process on a sample
data set.

Discretization
Discretization transforms a continuous-valued feature into a discrete
one by partitioning the range of its values into a set of intervals or bins.
The two main methods for discretization are:

1. Equal width/binning: the range of the variable is divided


into equal-width bins. For example, if the range of the
variable is 0–20 and we want 5 bins, then each bin will
cover a range of 4 units (0–4,4–8,8–12,12–16,16–20).

2. Equal frequency: each bin contains the same number of


data points.
The discretized values are usually one-hot encoded. For example, if the
number of bins is 5, the result of the discretization will be 5 new binary
features, where each feature indicates whether the given sample
belongs to the corresponding bin.

Use cases for discretization:

1. Some machine learning algorithms cannot handle


continuous values directly, such as some variants of Naive
Bayes and the Apriori algorithm for association rule
mining.

2. Discretization can make the model more expressive since


it allows the model to find a mapping between each
interval and the target label. For example, imagine that we
need to predict the price of a house given its location,
represented by its latitude and longitude. If we use a
linear regression model, it can only find a linear
correlation between the exact location of the house and its
price. However, if we discretize the latitude and longitude
into 10 bins each, the model can find a linear correlation
between each one of the 100 areas and the price of the
house.

3. Handle outliers or extreme values by placing them in their


own category.
The drawback of discretization is that it can lead to a loss of
information, and may introduce bias if the number of bins is too small
or their edges are not properly chosen.

In Scikit-Learn, you can use KBinsDiscretizer to perform


discretization. Its important parameters are:

• n_bins defines the number of bins (defaults to 5).

• encode specifies the method used to encode the


discretized result. The options are:

• ‘onehot’ (default) — encode the discretized values with


one-hot encoding and return a sparse matrix

• ‘onehot-dense’ — encode the discretized values with one-


hot encoding and return a dense array

• ‘ordinal’ — return the bin identifier of the sample as an


integer value

• strategy specifies how to define the widths of the bins, can


be one of the following options:

• ‘uniform’: all the bins have the same width.

• ‘quantile’ (the default): all the bins have the same number
of points.
• ‘kmeans’: the bins are determined by running a K-Means
clustering on the data points.

Example for discretizing three continuous features into three equal-


width bins:

from sklearn.preprocessing import KBinsDiscretizer

discretizer = KBinsDiscretizer(n_bins=3, strategy='uniform',

encode='ordinal')

X = [[-1, 2, 3], [0.5, 6, 10], [0, 1, 8], [0.2, 3, 15]]

discretizer.fit_transform(X)

array([[0., 0., 0.],

[2., 2., 1.],

[2., 0., 1.],

[2., 1., 2.]])

The bin_edges_ attribute of the discretizer contains the edges of the


bins:

discretizer.bin_edges_

array([array([-1. , -0.5, 0. , 0.5]),

array([1. , 2.66666667, 4.33333333, 6. ]),

array([ 3., 7., 11., 15.])], dtype=object)


You can read more about discretization in this article.

Scaling and Normalization


Many machine learning algorithms do not perform well when the
features have different scales. These include distance-based algorithms
such as KNN and KMeans (since the distances are dominated by
features with larger ranges), and algorithms that use gradient descent
for optimization, such as neural networks (as different ranges induce
different step sizes for each feature).

The goal of feature scaling is to bring all the features to a common scale
or range. The most common approaches for feature scaling are:

1. Min-max scaling scales all the features to the same


range [min, max], where the typical range is [0, 1].
Mathematically, this transformation can be expressed as:

In Scikit-Learn, this transformation is performed by a MinMaxScaler.


For example:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X = [[-1, 2, 3], [0.5, 6, 10], [0, 1, 8]]

scaler.fit_transform(X)

array([[0. , 0.2 , 0. ],

[1. , 1. , 1. ],

[0.66666667, 0. , 0.71428571]])

Pros:

• Brings all the features to the same range.

Cons:

• Sensitive to outliers. Since the range of the data is


determined by the minimum and the maximum, outliers
can cause the scaling to compress the majority of the data
into a small range.

2. Standardization (also known as z-score normalization)


subtracts from each feature its mean and scales it to unit variance. The
z-score of a sample x is calculated as:

where μ is the mean of the data points and σ is their standard


deviation.
In Scikit-Learn, standardization is performed by a StandardScaler. For
example:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X = [[-1, 2, 3], [0.5, 6, 10], [0, 1, 8]]

scaler.fit_transform(X)

array([[-1.33630621, -0.46291005, -1.35873244],

[ 1.06904497, 1.38873015, 1.01904933],

[ 0.26726124, -0.9258201 , 0.33968311]])

Pros:

• Transforms the features to have 0 mean and unit


variance, which is useful for algorithms that assume
standardized features (e.g., PCA assumes that the features
are centered around 0).

Cons:

• The transformed features may have different ranges.

• Sensitive to outliers, since these can significantly impact


the mean and standard deviation (but it is less sensitive to
outliers than min-max scaling).
3. Robust scaling — similar to standard scaling, but uses statistics
that are more robust to outliers: from each feature, it subtracts its
median and divides it by its inter-quantile range (IQR, the range
between the first quartile and the third quartile).

Mathematically, the transformation can be written as follows:

In Scikit-Learn, this transformation is performed by


a RobustScaler. For example:

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X = [[-1, 2, 3], [0.5, 6, 10], [0, 1, 8]]
scaler.fit_transform(X)

array([[-1.33333333, 0. , -1.42857143],
[ 0.66666667, 1.6 , 0.57142857],
[ 0. , -0.4 , 0. ]])

Pros:

• Less affected by outliers than standard scaling.

Cons:

• Does not normalize the data to have 0 mean and unit


variance.
• The transformed features may not have an intuitive
interpretation.

Note that all three scalers preserve the shape of the original
distribution of the feature, since they are all linear transformations
(i.e., transformations of the form f(x) = ax + b).

Example: The Titanic Data Set


We will now demonstrate the entire data preprocessing process on the
titanic data set, available from Scikit-Learn. This data set describes the
survival status of passengers on the Titanic. It contains 1,309 rows and
has the following 14 features (including the label):

• pclass — the passenger class (1 = 1st, 2 = 2nd, 3 = 3rd),


indicates a socio-economic status (1st ~ Upper, 2nd ~
Middle, 3rd ~ Lower)

• name — name of the passenger

• sex — male or female

• age — age in years (can be a fraction if age is less than 1)

• sibsp — number of siblings/spouses aboard

• parch — number of parents/children aboard


• ticket — ticket number

• fare — passenger’s fare (in British pounds)

• cabin — cabin number

• embarked — port of embarkation (C = Cherbourg, Q =


Queenstown, S = Southampton)

• boat — the number of lifeboat (if survived)

• body — body identification number (if did not survive and


body was recovered)

• home.dest — home/destination address

• survival — the target label (0 = No, 1 = Yes)

Loading the Data Set

We first import the required libraries and classes:


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import fetch_openml


from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
Let’s also fix the random seed in order to have reproducible
results:

np.random.seed(0)

Next, we fetch the data set using the fetch_openml() function:

X, y = fetch_openml('titanic', version=1, return_X_y=True,


as_frame=True)

Let’s examine the first rows of the data using the


DataFrame’s head() method:

X.head()

Let’s also check the data types of the features and whether there
are any missing values:

X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null float64
1 name 1309 non-null object
2 sex 1309 non-null category
3 age 1046 non-null float64
4 sibsp 1309 non-null float64
5 parch 1309 non-null float64
6 ticket 1309 non-null object
7 fare 1308 non-null float64
8 cabin 295 non-null object
9 embarked 1307 non-null category
10 boat 486 non-null object
11 body 121 non-null float64
12 home.dest 745 non-null object
dtypes: category(2), float64(6), object(5)
memory usage: 115.4+ KB

We can see that out of the 13 features, 6 of them have a ‘float64’ data
type and 7 have a categorical type (‘category’ or ‘object’). However, the
data type by itself is not enough to indicate whether a feature is
numerical or not. For example, the ‘pclass’ feature, despite having a
float64 type, is in fact an ordinal feature with discrete values (1.0, 2.0
or 3.0).

We can also observe that 5 of the features contain missing values: ‘age’,
‘cabin’, ‘boat’, ‘body’, and ‘home.dest’.

Data Cleaning

Before we move to more advanced data exploration, let’s remove the


following features, which are not relevant to the prediction task:

• ‘name’ and ‘ticket’ are unique per passenger.

• ‘cabin’, ‘boat’, and ‘body’ contain a high percentage of


missing values. In addition, ‘boat’ and ‘body’ are really
part of the target, since we know that any passenger with a
lifeboat number survived and any passenger with a body
identification number did not survive.
• ‘home.dest’ contains a high percentage of unique values
and does not seem to be relevant to the survival of the
passenger.

We can easily drop these columns by calling the drop() method of the
DataFrame:
X.drop(['name', 'ticket', 'boat', 'body', 'home.dest'], axis=1,
inplace=True)

Let’s examine the data set after dropping these columns:

X.head()

Exploratory Data Analysis (EDA)

We can now do more advanced data exploration. First, let’s check the
correlations between the features, including the label:
# Merge the features and the label to one DataFrame
df = pd.concat([X, y.astype('float')], axis=1)

plt.figure(figsize=(6, 4))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
The correlations between the features

We can see that the features ‘sibsp’ and ‘parch’ are weakly correlated
with the target, which suggests that some feature engineering may be
needed to extract more useful information from them (e.g., combine
them into a single feature of ‘family_size’).

Let’s also find if there are any outliers in the data by drawing a box
plot:
sns.boxplot(data=X)
We can see that there are many outliers in the ‘fare’ column. This
suggests that using a robust scaler to normalize the data would be a
better choice than a standard scaler.

Finally, since this is a classification problem, let’s check if it is balanced


in terms of the class distribution:
y.value_counts() / y.value_counts().sum()

0 0.618029
1 0.381971
Name: survived, dtype: float64

The classes are fairly balanced (61.8% of the passengers did not
survive, and 38.2% survived).

Data Preprocessing
Let’s identify the tasks we need to perform in order to prepare the data
set for modeling:

1. Impute the missing values in the columns ‘age’ (263


missing values), ‘fare’ (one missing value), and ‘embarked’
(two missing values). Note that ‘embarked’ is a categorical
variable, therefore it requires a different imputation
strategy. We will use a SimpleImputer with
strategy=’most_frequent’ for the categorical feature, and a
KNNImputer (with k = 5) for the numerical features.

2. Encode the categorical features ‘pclass’, ‘sex’ and


‘embarked’ using one-hot encoding.

3. Scale the numerical features ‘age’, ‘sibsp’, ‘parch’ and ‘fare’


using a robust scaler.

Notice that we need to apply different transformations on the


categorical and the numerical features. The basic pipeline in Scikit-
Learn does not allow you to apply a transformer to only a subset of the
features (you can read more about Scikit-Learn pipelines in this
article). However, you can combine it with another class
called ColumnTransformer, which allows different subsets of features
to be transformed separately, and then it concatenates them to form a
single feature set.

We first define a pipeline to transform the categorical features:


cat_features = ['pclass', 'sex', 'embarked']

cat_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])

Next, we define a pipeline to transform the numerical features:

num_features = ['age', 'sibsp', 'parch', 'fare']

num_transformer = Pipeline([
('imputer', KNNImputer(n_neighbors=5)),
('scaler', RobustScaler())
])

We now combine the two pipelines using a ColumnTransformer,


which associates each pipeline with its corresponding set of
features:

preprocessor = ColumnTransformer([
('num', num_transformer, num_features),
('cat', cat_transformer, cat_features)
])

Lastly, we build a pipeline that combines the column transformer


and our classification model. In this example, we will use a
random forest classifier with its default settings:

model = Pipeline([
('pre', preprocessor),
('clf', RandomForestClassifier())
])

Train-Test Split

Before training the model, we split the data set into 80% training and
20% test sets:

X_train, X_test, y_train, y_test = train_test_split(X, y,


test_size=0.2, random_state=0)
Training the Model

We can now simply fit the model to the training set:


model.fit(X_train, y_train)

Model Evaluation

Let’s now evaluate the model both on the training and the test sets:
train_acc = model.score(X_train, y_train)
print(f'Train accuracy: {train_acc:.4f}')

test_acc = model.score(X_test, y_test)


print(f'Test accuracy: {test_acc:.4f}')

Train accuracy: 0.9713


Test accuracy: 0.7977

It seems that the model is overfitting the training set to some


degree. At this point, you might want to experiment with different
transformers and imputers, try to extract new features from the
data (as suggested above), and tune the hyperparameters of the
model.

Final Notes
You can find the code examples of this article on my
github: https://github.com/roiyeho/medium/tree/main/data_preproc
essing

Thanks for reading!


Discretization and when to use
it

https://medium.com/@roiyeho/discretization-and-when-to-use-it-649db24e59d1\

Discretization is an operation that transforms a continuous-valued


feature into a discrete one. Many data scientists are not aware of the
power of this transformation and how it can boost the performance of
their models on certain data sets.

This article explains what is discretization, when to use it, and how to
apply it to your own data sets using Scikit-Learn.

Discretization Definition

Mathematically speaking, discretization takes a feature whose values


lie in the range [x₀, xₙ] and splits them into a set of n intervals (bins):

There are two main approaches for discretization:

1. Equal width: all the bins have the same width.


2. Equal frequency (equal depth): all the bins have the
same number of points.

For example, let’s say that we have an age feature with the following
values: 1, 3, 4, 7, 11, 12, 15, 17, 24, 31, 35, 36, 40, 41, 43, 46, 50, 74, 77,
86 and we want to discretize it into 5 bins.

In the equal-width approach, we take the range of the values (in this
case 86 - 1 = 85) and divide it by the number of bins (5), such that the
width of each bin is 85 / 5 = 17. Therefore, the bins in this case would
be [1, 18], (18, 35], (35, 52], (52, 69], and (69, 86].

On the other hand, in the equal-depth approach, each bin should have
the same number of values. Since we have a total of 20 age values, each
bin should have 4 values. Therefore, the bins in this case are [1, 7], (7,
17], (17, 36], (36, 46], and (46, 86].

When do we need to use discretization?

There are machine learning algorithms that cannot handle continuous


values directly, and require the data to be discretized. Examples for
such algorithms include Naïve Bayes for classification, and the Apriori
algorithm for association rule mining.
Even when the machine learning model can handle continuous values,
discretization can often help it make a better usage of the continuous-
valued feature.

Even when the machine learning model can handle continuous values,
discretization can often help it make a better usage of the continuous-
valued feature.

For example, imagine that we need to predict the price of a house given
its location, represented by its latitude and longitude. Let’s call these
two features x₀ and x₁. If we use these two features directly, our model
may learn an incorrect correlation between the house location and its
price.

For example, if we use linear regression for the prediction, our model’s
hypothesis is:

This means that the price prediction is linearly dependent on the


specific values of the latitude and longitude. For example, a house that
is located at (20, 20) will always have a predicted price twice larger
than a house that is located at (10, 10), regardless of the weights that
the model has learned!
Instead, we would like the model to learn the correlation
between different areas (neighborhoods) and the house price, rather
than the correlation between the exact location of the house and its
price. To that end, we can discretize the longitude and the latitude into
bins, such that each bin becomes a new feature in our data set.

Discretizing the longitude variable into bins

For example, if we discretize the latitude and longitude into 5 bins


each, instead of having two variables in our model, we will have 10
variables:

So now the linear regression model will be able to learn the correlation
between each bin (area) and the house price.
Discretization in Scikit-Learn

Scikit-Learn provides the KBinsDiscretizer transformer that can


discretize your data into intervals.

This transformer has a few important parameters:

1. n_bins defines the number of bins (the default is 5 bins)

2. encode specifies the method used to encode the


discretized result. The default is ‘onehot’, which means
that the result will be encoded using one-hot encoding.
For example, if the number of bins is 5, the result of the
transformation will be 5 new binary features (each feature
will indicate whether the given sample belongs to one of
the intervals).

3. strategy specifies how to define the widths of the bins,


can be one of the following options:

• ‘uniform’: all the bins will have the same width.

• ‘quantile’: all the bins will have the same number of


points.

• ‘kmeans’: the bins will be determined by running a K-


Means clustering on the points.

Discretization Example
For example, let’s build a regression model for the California housing
dataset available at Scikit-Learn. The goal in this data set is to predict
the median house value of a given district (house block) in California,
based on 8 different features of that district (such as the median
income or the average number of rooms per household).

We first fetch the data set:

from sklearn.datasets import fetch_


data = fetch_california_housing()
X, y = data.data, data.target
feature_names = data.feature_names

To explore the data set, we merge the features (X) and the labels (y)
into a pandas DataFrame and display the first rows from the table:

mat = np.column_stack((X, y))

df = pd.DataFrame(mat, columns=np.append(feature_names, 'MedValue'))

df.head()

As a baseline estimation, let’s examine which results we can get by


running a simple linear regression on this data set.
First, we split our data set into 80% training set and 20% test set:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Then, we fit a simple LinearRegression model to the training set:

from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X_train, y_train)

Let’s evaluate the model on the training and test sets:

train_score = reg.score(X_train, y_train)


print('R2 score on the training set:', np.round(train_score, 5))

test_score = reg.score(X_test, y_test)


print('R2 score on the test set:', np.round(test_score, 5))

R2 score on the training set: 0.6089


R2 score on the test set: 0.59432

Now, let’s discretize the longitude and latitude columns into 10


intervals.

We first create the KBinsDiscretizer transformer. In order to see the


results of the discretization, we will set the encode strategy to ‘onehot-
dense’ instead of the default ‘onehot’ encoding (which returns a
sparse matrix that cannot be printed out).
Eventually, when you integrate the discretizer into a pipeline of
transformers (see this article on how to build a pipeline), this won’t be
necessary since you won’t need to examine the result of each
transformer.
from sklearn.preprocessing import KBinsDiscretizer

encoder = KBinsDiscretizer(n_bins=10, encode='onehot-dense')

We now call the fit_transform() method of the discretizer on the


Longitude column:

longitude_bins = encoder.fit_transform(df[['Longitude']])
print(longitude_bins)
[[0. 1. 0. ... 0. 0. 0.]
[0. 1. 0. ... 0. 0. 0.]
[0. 1. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]

The result of the discretization is a matrix, where each row represents


one of the sample points, and the columns represent the 10 bins.

To merge the result of the discretization back into our original


DataFrame, we will create another DataFrame from this matrix, and
then use pd.concat() to concat the two DataFrames together.

longitude_labels = [f'Longitude{i}' for i in range(10)]

longitude_df = pd.DataFrame(longitude_bins,

columns=longitude_labels)

df2 = pd.concat([df, longitude_df], axis=1)

df2
Let’s do the same for the Latitude column:

latitude_bins = encoder.fit_transform(df[['Latitude']])

latitude_labels = [f'Latitude{i}' for i in range(10)]

latitude_df = pd.DataFrame(latitude_bins,

columns=latitude_labels)

df3 = pd.concat([df2, latitude_df], axis=1)

df3.head()

But wait… we forgot to drop the original Longitude and Latitude


columns. So let’s do it now:

df3 = df3.drop(['Longitude', 'Latitude'], axis=1)


df3.head()
We are now ready to run our linear regression model on the
transformed data set. First we extract the features from the DataFrame
into a variable X and the labels into a variable y:

X = df3.drop('MedianValue', axis=1)

y = df3['MedianValue']

We split the data set again into training and test sets, and then fit our
model to the training set:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

reg.fit(X_train, y_train)

Let’s evaluate the model on the training and test sets:

train_score = reg.score(X_train, y_train)

print('R2 score on the training set:', np.round(train_score, 5))

test_score = reg.score(X_test, y_test)

print('R2 score on the test set:', np.round(test_score, 5))

R2 score on the training set: 0.6405


R2 score on the test set: 0.60771

Our R² score has significantly improved both on the training and the
test sets!

Final Notes
You can find the code example of this article on my github
repository: https://github.com/roiyeho/medium/tree/main/discretiza
tion

Feel free to follow me to get more content like this:)

You might also like