Data Preprocessing: Modern Data Analytics (G0Z39A) Prof. Dr. Ir. Jan de Spiegeleer

Data
Preprocessing
MODERN DATA ANALYTICS
[G0Z39A]
PROF. DR. IR. JAN DE SPIEGELEER
Contents
◦ Pandas
◦ Sklearn
◦ Scalers
◦ Outlier detection
◦ One-hot encoding
◦ Pipes
This chapter
Missing Values
Pandas Transforming Data
Normalisation
Lesson 3 Sklearn Scalers One-Hot Encoding Pipes
3
This chapter
PART I
Missing Values
Transforming Data Data Preprocessing with Pandas.ipynb

Pandas
Normalisation
PART II
Lesson 3 Sklearn Scalers One-Hot Encoding Pipes
4
Missing Values
◦ Case Study : Hepatitis data set
◦ Concepts
◦ Removing missing values
◦ Replace Missing Values
◦ Caveat: “Inplace = True”
◦ Using bfill and ffill
5
Data Transformation
◦ Case Study: Tweets
◦ Concepts
◦ Lambda functions
◦ Regular Expressions
6
Data Normalisation
◦ Case Study : Covid 19 Data
◦ Concepts : Normalisation using
◦ Z-score
◦ Min-Max
◦ Clipping
7
Scikit-Learn
8
Advantages:
◦ Simple, easy to use, and effective.
◦ In rapid development, and constantly being improved.
◦ Wide range of algorithms, including clustering, factor analysis, principal component analysis, and more.
◦ Can extract data from images and text.
◦ Can be used for NLP
◦ Provides sample data-sets
Disadvantages:
◦ Less outspoken on deep learning
◦ In rapid development
9
Scikit-Learn
Supervised learning
Data comes with additional attributes that we want to predict:
◦ classification: samples belong to two or more classes and we want to learn from already labeled data how to predict
the class of unlabeled data. An example of a classification problem would be handwritten digit recognition, in which
the aim is to assign each input vector to one of a finite number of discrete categories.
◦ regression: if the desired output consists of one or more continuous variables, then the task is called regression. An
example of a regression problem would be the prediction of the length of a salmon as a function of its age and
weight.
Unsupervised learning
The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or
to determine the distribution of data within the input space, known as density estimation, or to project the data from a
high-dimensional space down to two or three dimensions for the purpose of visualization (dimension reduction).
10
Both are powerful tools but in R, switching between different models usually means learning a new package written by a
different author. The interface may be completely different, the documentation may or may not be helpful in learning
the package, and the package may or may not be under active development.
When starting to work with ML or data analysis, Scikit-learn is going to be easier to use because of its consistency. In a
nutshell you are going to use the following steps for most of the models defined in Scikit-Learn
1. Feature matrix
◦ Row = observation
◦ Column = feature
2. API : Every machine learning algorithm in Scikit-Learn is implemented via the API, which provides a consistent
interface for a wide range of machine learning applications.
Reference : https://arxiv.org/abs/1309.0238
11
Feature Matrix
In supervised learning
𝑌! = 𝑓 𝑋 where 𝑓() stands for the model , 𝑌is
! the forecast of the target 𝑦 and 𝑋 is the feature
matrix
The feature matrix (𝑋) is assumed to be two-dimensional, with shape [n_samples, n_features],
and is most often contained in a NumPy array or a Pandas DataFrame, though some Scikit-Learn
models also accept SciPy sparse matrices.
We also generally work with a label or target array (𝑦) . The target array is usually one
dimensional, with length n_samples, and is generally contained in a NumPy array or Pandas
Series.
12
Feature Matrix
13
One API serving different models
Most commonly, the steps in using the Scikit-Learn estimator API are as:
1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
2. Choose the model’s hyperparameters by instantiating this class with desired values.
3. Arrange data into a features matrix and target vector following the discussion on previous slide.
4. Fit the model to your data by calling the fit() method of the model instance.
5. Apply the Model to new data:
1. For supervised learning, often we predict numbers/classes for unknown data using the predict() method.
2. For unsupervised learning, we often transform or infer properties of the data using the transform() or
predict() method.
14
Example
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British
statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper The use of multiple
measurements in taxonomic problems .
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica
and Iris versicolor).
Four features were measured from each sample: the length and the width of the sepals and
petals, in centimeters.
15
Example
introduction sklearn.ipynb
Feature Matrix of the Iris DataSet
16
Plug and Play ?
17
Plug and Play ?
2 casestudies will illustrate the importance of PreProcessing

1. Ridge-Regression
2. Clustering
18
Case Study : Ridge Regression
19
Futures are derivative financial contracts obligating the buyer to purchase an asset or the seller to sell
an asset at a predetermined future date and set price.
A futures contract allows an investor to speculate on the direction of a security, commodity, or
financial instrument.
Futures are used to hedge the price movement of the underlying asset to help prevent losses from
unfavorable price changes.
Example:
- Last week purchased 10 March 21 Futures on the SX5E @ 3680
- 5 days later sold 10 March 21 Futures @ 3700
- Profit = 10 x (3700-3680) x 10 EUR/pt = 10 x 20 x 10 = 2000 EUR
20
A portfolio manager is managing a portfolio of 75 shares (“equities”) and has assembled data covering the period from October 2014 until
August 2015. The manager seeks to hedge the portfolio based on the observed price returns of the portfolio.
Question : What mix of Futures can be used to hedge this portfolio ?
1. X1,i : DAX (Deutscher Aktienindex)
2. X2,i : NKY (Nikkei 225)
3. X3,i : SPX (S&P 500)
4. X4,i : SX5E (EURO STOXX 50)
5. X5,i : SX7E (EURO STOXX Banks)
6. X6,i : SXAP (STOXX Europe 600 Automobiles & Parts)
7. X7,i : SXKP (STOXX Europe 600 Telecommunications)
21
7
X
Ŷi = 0 + k rk,i
k=1
where rk,i = price return of Future k on day i
Using the ordinary least squares model of sklearn

(http://bit.ly/34FjeTB).
The results show an impressive R2 but the what is wrong with

the outcome ? What do you think about the size of the
coefficients ?
LSExample-OLS.ipynb
22
Take a step back an imagine, we are only going to use two Futures to get the portfolio. In this case, there are only 2 coefficients (𝛽!and
𝛽"). Let’s also ignore the need for an intercept 𝛽# The returns of the portfolio can be modeled as
𝑦 = 𝛽!r! + 𝛽"r"with 𝑟$ the price return of Future 𝑖
There are a lot of combinations possible. The optimal combination is the one minimizing the loss function L:
The optimal values are
LSExample-OLS-simple.ipynb
23
This figure illustrates the importance of SXKP Future. Any
change in SXKP and SX5E increases the error, but of course
the penalty (increase in the loss function) is much severy
for a modification in SXKP then in terms of a similar change Lower Weights º
in SX5E
Question we need to solve:

How can we reduce the weights of (SX5E,SXKP) BUT keep
the loss as small as possible?
This is a CONSTRAINTED OPTIMISATION , that will be solved

using one of the following regularisation techniques
1. Ridge Regression
2. Lasso
3. Elastic Net
24
The problem can be restated as : 𝛼↓
8 9
<Xn p
X p
X =
ridge 2 2
= arg min (yi 0 xij j) +↵ j
: ;
i=1 j=1 j=1
𝛼↑
LSExample-Ridge.ipynb
Ref: https://bit.ly/34GSAtE
25
Case Study: Ridge Regresssion
𝛼 = 5.0
26
Ridge Regression solves the problem of over-fitting BUT introduces two separate issues :
1. Optimal value for 𝛼 => Can be solved with cross valiation (see notebook)
2. Need for normalisation (rescaling is a better word). Normalization is very important for methods with regularization. This is because
the scale of the variables affect the how much regularization will be applies to specific variable. Ridge regression regularize the linear
regression by imposing a penalty on the size of coefficients. Thus the coefficients are shrunk toward zero and toward each other. But
when this happens and if the independent variables do not have the same scale, the shrinking is not fair. Two independent variables
with different scales will have different contributions to the penalized terms, because the penalized term is a sum of squares of all
the coefficients.
Fortunately this is a built-in feature in Scikit-Learn :

class sklearn.linear_model.Ridge(alpha=1.0, *, fit_intercept=True, normalize=False,
copy_X=True, max_iter=None, tol=0.001, solver='auto', random_state=None)[source]
27
https://bit.ly/3o19wm8
28
ALWAYS READ THE MANUAL
29
Case Study: Clustering
Clustering is the process of grouping data into group such that
◦ Data in the same group are similar
◦ Data in different groups are dissimilar
A popular algorithm is the K-means clustering. Widely used and capable of handling large datasets, the algorithm
requires the users to specify the number of clusters K in which the partitioning is to be done.
Most clustering algorithms are based on some distance measure between two arbitrary data points. The idea is that
points whose mutual distance are small, should end up in the same cluster. The most common distance metric is the
Euclidean Distance.
30
https://bit.ly/2M0H6e0 31
• Euclidean Distance
v
uX

u p
DEuclidean (xa , xb ) = t (xaj xbj )2
j=1
• Manhattan (City Block) Distance

p
X
DManhattan (xa , xb ) = | xaj xbj |
j=1
• Chebychev Distance
DChebychev (xa , xb ) = max (| xa1 xb1 |, . . . , | xap xbp |)

Some popular distance measures for a feature matrix
• Minkowski Distance
X with p features:
0 1 h1
p
X
DMinkowski (xa , xb ) = @ | xaj xbj |h A
j=1
Particular values for h :

1. h = 1 : Manhattan Distance
2. h = 2 : Euclidean Distance
3. h ! 1 : Chebychev Distance
32
K-MEANS algorithm
1. Choose the number of clusters K
2. Choose K (random) data points as centroids for these clusters
3. For each of the N data points, calculate the distance to each of the centroids. Assign each point to the cluster C(i) that is the closest:
C(i) = arg minkxi mk k2 (1)

1kK
4. Calculate the K centroids of these clusters : m1,m2,...mK
5. As long as the stopping criterium is not met, go to step 3.
Given enough time, K-means will always converge, however this may be to a local minimum. This is highly dependent on the initialization of
the centroids. As a result, the computation is often done several times, with different initializations of the centroids. One method to help
address this issue is the k-means++ initialization scheme, which has been implemented in scikit-learn (use the init='k-means++' parameter).
This initializes the centroids to be (generally) distant from each other, leading to provably better results than random initialization, as shown in
the reference.
Reference : https://bit.ly/2M00SX0
33
34
Equity and debt are two distinct classes of securities in terms
of investing risks and potential return, but their value depends
on the same underlying assets of the firm and therefore the
risk-return tradeoff of each security should be systematically
related.
The value of debt (corporate bonds, convertible bonds,

perpetuals,...) depends on the credit spread (CS). The credit
spread is expressed in basis points (bps) and is added to the
risk-free interest rate when discouting cashflows.
✓ ◆E
S0
CS = CS0 ⇥
S
The equation above maps the value of the credit spread to the
share price S using the so-called “equity-to-credit” link (E)
35
◦ Making a profit of a mismatch in the This is not a riskfree exercise ! In 2007 the
behaviour of equity prices vs. bond prices is share of Chrysler spiked and the company’s
called “capital structure arbitrage” debt was downgraded
◦ Info : https://bit.ly/3dzRCos ◦ Share Price ↑
◦ Example: a firm surprises the market and ◦ Bond Price ↓
reports disappointing earnings, a company's
Protagonists:
stock may immediately fall 10 percent, but
that same information may not be reflected
in the company's bond price until several
days later and may effect a drop in the bond's
price of only 2 percent.
1.Mr Kerkorian made a 2.Moody’s downgraded 3. Hedge funds

take-over bid on the corp. debt of lost money
Chryslers’s shares. Chrysler
36
Case Study : Clustering
The relationship illustrates that are model is
flawed, there are different “regimes” at
work.
- distressed market
low share price / high credit spread
- normal markets
Cluster_Intro.ipynb
37
• Identify the clusters
• Set-up an equity-to-credit model
in each of the clusters
Case Study
: Clustering
38
Classifying a new datapoint

Share price = 8.0 and Credit Spread=1.0
39
The new data point is categorized in the wrong cluster. What
went wrong ?
In this example we used an Euclidean distance, which is the

standard metric for most clustering algorithms (including K-
Means in Sci-kit learn). If one of your features has a range of
values much larger than the others, clustering will be
completely dominated by that one feature. In this case, the
share price is dominating the distance calculations and is
responsible for the wrong clustering.
This can be dealt with using a standardisation of the data
KMeans_Cluster_Scaling.ipynb
40
Case Study : Clustering
Problem Solved !
41
Scaler-Objects
42
Scaling
Scikit-learn provides a library of dataset transformations, which may:
1. Clean Data
2. Reduce the dimension of a dataset
3. Extract Features into a format supported by machine learning algorithms from raw datasets consisting of formats such as text and
images.
Note : If these data transforms are used when training a model, they also must be used on subsequent datasets, whether it’s test data or
data in a production system. Otherwise, the feature space will change, and the model will not be able to perform effectively.
In the topic of scaling Sklearn provides Scaler-objects for this purpose. The sklearn.preprocessing package provides several common utility
functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators.
In general, learning algorithms benefit from standardization of the data set.
43
Scaling
Preprocessing.ipynb
44
Case Study: Classifying Wine
In this example, we will work a wine-data set 5) Magnesium
(https://bit.ly/2KBbsU8). These data are the results of a 6) Total phenols
chemical analysis of wines grown in the same region in Italy 7) Flavanoids
but derived from three different cultivars. The analysis 8) Nonflavanoid phenols
determined the quantities of 13 constituents found in each of
9) Proanthocyanins
the three types of wines.
10)Color intensity
The 13 attributes (features) are :
11)Hue
1) Alcohol
12)OD280/OD315 of diluted wines
2) Malic acid
13)Proline
3) Ash
4) Alcalinity of ash
Don’t be too focused on our choice to construct a classifyer
using a logistic regression model. The goal of the notebook is
only to illustrate the use of Scaler object.
Preprocessing.ipynb
45
Other Scalers in Sklearn
◦ StandardScaler()
◦ MinMaxScaler()
◦ MaxAbsScaler()
◦ RobustScaler()
46
Outlier Detection
47
Introduction
The presence of outliers in a classification or regression dataset can result
in a poor fit and lower predictive modeling performance.
Identifying and removing outliers is challenging with simple statistical
methods for most machine learning datasets given the large number of
input variables. Instead, automatic outlier detection methods can be used
in the modeling pipeline and compared, just like other data preparation
transforms that may be applied to the dataset. (cfr Scalers). Reference :
https://bit.ly/3rvXhAb
Outliers are observations in a dataset that don’t fit in some way. Perhaps
the most common or familiar type of outlier is the observations that are far
from the rest of the observations or the center of mass of observations.
This is easy to understand when we have one or two variables and we can
visualize the data as a histogram or scatter plot, although it becomes very
challenging when we have many input variables defining a high-
dimensional input feature space.
We want to remove the outliers before building our model.
48
Contamination of Data
Distance calculations are very important in ML. Sklearn provides a lot of different distance metrics. The metrics to use depend on the
nature of the feature space:
◦ Real-valued feature spaces:
◦ Manhattan Distance
◦ Euclidean Distance
◦ Mahalanobis Distance (case study)
◦ ...
◦ Integer-valued vector spaces:
◦ HammingDistance
◦ ...
◦ boolean-valued vector spaces:
◦ Jaccard-Distance
49
Distance Metrics
IDENTIFIER NAME Function
The Mahalanobis distance can be seen as the multi-variate pP
2
Euclidean EuclideanDistance P ((x y) )
version of the Z-score. In case x and y are two points on a line.
Manhattan ManhattanDistance |x y|
The mahalanobis distance between these two points is : Chebyshev ChebyshevDistance max(| x y |)
P p 1/p
|x y| Minkowski MinkowskiDistance P(| x y | ) p 1/p
Wminkowski WMinkowskiDistance pP(w | x y | )
Seuclidean SEuclideanDistance 2
The fact we are introducing covariance , makes a connection q ((x y) /V )
when studying the expression of the density function of Mahalanobis MahalanobisDistance (x y)0 SX1 (x y)
a multi-variate Gaussian distribution.
1 1
exp 2 (x µ)T ⌃ (x µ)
fX (x1 , . . . , xp ) = p
(2⇡)p | ⌃|
50
Distance Metrics
In both situations, the two points have the same Euclidean distance but a different.
Mahalanobian distance. In terms of Mahalanobian-metrics, Point 2 is cleary an outlier.
51
Distance Metrics
Aren’t we opening Pandora’s box ? We
solved the problem when having to face
correlated data in an Euclidean context, but
while only having an estimate of the
covariance metrics, based on sample data.
Any contamination on these sample date,

results in a wrong estimate of the
covariance matrix and a flawed outlier
detection procedure.
We need a robust estimate of our

covariance matrix.
52
Distance Metrics
Illustration of a contaminated
dataset, the point Y is
no longer considered an
outlier because of an outlier (▲)
in the dataset.
The (mahalanobis) distance of the observation

(★) vs. the mid-point of the dataset decreases
because of the outlier. Anomalies in the data can
distort the parameter estimates, with the effect of
making some points appear less anomalous than
they really are.
53
Distance Metrics
Getting a better intuition on the Determinant of a Covariance Matrix .

The determinant illustrates how much “space” the data occupies in the
feature space X. The presence of outliers, increases 𝑆%
54
Distance Metrics
◦ The Minimum Covariance Determinant (MCD) method is a highly robust estimator of multivariate location and scatter, for which a fast
algorithm is available. And it also serves as a convenient and efficient tool for outlier detection.
◦ MCD is a method for estimating the mean and covariance matrix in a way that tries to minimize the influence of anomalies. The idea
is to estimate these parameters from a subset of the data that has been chosen to (hopefully) not contain anomalies.
◦ More specifically, imagine taking all possible subsets of the data, of a specified size. Estimate the mean and covariance matrix for each
subset. Then, keep the estimates for the subset whose covariance matrix has the smallest determinant.
◦ The idea behind minimizing the determinant is that the determinant of a covariance matrix measures how broad the distribution is.
MCD therefore selects the subset of the data that is most tightly distributed. This is to exclude anomalies, which are likely to lie
further away from the rest of the data
◦ In practice, one can't actually perform a brute force search over all possible subsets of the data, because there are too many. So,
practical MCD algorithms are concerned with how to perform this procedure in a computationally efficient way.
◦ Reference : P. J. Rousseeuw. Least median of squares regression. J. Am Stat Ass, 79:871, 1984.
55
Case Study: Breast Cancer Research
Let’s see an example implementation on the BreastCancer dataset, where the objective is to determine if a tumour is benign or
malignant. `One Class classification` is a type of algorithm where the training dataset contains observations belonging to only
one class.
With only that information known, the objective is to figure out if a given observation in a new (or test) dataset belongs to that
class.
Those observations in the test set that are “very distant” from the malignant-cases in the training set, can be considered benign
(which is good news for the patients involved).
MinCovDet.ipynb
Reference : https://www.machinelearningplus.com/statistics/mahalanobis-distance/
56
Automated Outlier Detection
◦ Isolation Forest
◦ MinCovDet
◦ Local Outlier Factor
◦ One-Class SVM
Ref: https://bit.ly/3rvXhAb
housing.ipynb
57
Automated Outlier Detection
1. CRIM per capita crime rate by town
To illustrate the different methods, we will use the housing 2. ZN proportion of residential land zoned for lots over
data set. This is a well known dataset in many case studies. 25,000 sq.ft.
The data are available on : 3. INDUS proportion of non-retail business acres per town
https://goz39a.s3.eu-central-1.amazonaws.com/housing.csv 4. CHAS Charles River dummy variable (= 1 if tract bounds
river; 0 otherwise)
The dataset has 506 observations of housing prices in the 5. NOX nitric oxides concentration (parts per 10 million)
suburbs of Boston. Each data point has 13 continuous 6. RM average number of rooms per dwelling
attributes (including "class"attribute "MEDV"), 1 binary-valued 7. AGE proportion of owner-occupied units built prior to
attribute. 1940
8. DIS weighted distances to five Boston employment
There are no missing values in the dataset. centres
9. RAD index of accessibility to radial highways
10. TAX full-value property-tax rate per $10,000
11. PTRATIO pupil-teacher ratio by town
12. B proportion of immigrants by town
13. LSTAT % lower status of the population
14. MEDV Median value of owner-occupied homes in
$1000's
58
Case Study: Housing Data
An ordinary least square model (OLS)was constructed on 2/3
of the data (=training) and tested on the remaining.
The MAE error was 3.668, which is our base-line performance.
Let’s see what the impact is of a cleaning procedure where

outliers are removed.
59
Case Study : Housing Data – Isolation
Forest
Isolation Forest, or iForest for short, is a tree-based anomaly
detection algorithm. It is based on modeling the data in such a
way as to isolate anomalies that are both few in number and
different in the feature space. These anomalies have attribute-
values that are very different from those of the normal
instances.
Reference : https://bit.ly/2KH2wMS
In Scikit learn, the Isolation Forest Class can be used for this
(https://bit.ly/38C9Lhl)
Perhaps the most important hyperparameter in the model is

the “contamination” argument, which is used to help estimate
the number of outliers in the dataset. This is a value between
0.0 and 0.5 and by default is set to 0.1.
60
Case Study : Housing Data – MinCovDet
If the input variables have a Gaussian distribution, then
MinCovDet (=MCD) can be used to detect outliers.
This MCD approach can be generalized by defining a

hypersphere (ellipsoid) that covers the normal data, and data
that falls outside this shape is considered an outlier..
The scikit-learn library provides access to this method via the

EllipticEnvelope class.
It provides the “contamination” argument that defines the

expected ratio of outliers to be observed in practice.
61
Case Study : Housing Data – Local Outlier
Factor
A simple approach to identifying outliers is to locate those
examples that are far from the other examples in the feature
space.
This can work well for feature spaces with low dimensionality
(few features), although it can become less reliable as the
number of features is increased, referred to as the curse of
dimensionality.
The local outlier factor, or LOF for short, is a technique that

attempts to harness the idea of nearest neighbors for outlier
detection. Each example is assigned a scoring of how isolated
or how likely it is to be outliers based on the size of its local
neighborhood. Those examples with the largest score are
more likely to be outliers..
62
Case Study : Housing Data – One-Class
SVM
Although SVM is a classification algorithm and One-Class SVM
is also a classification algorithm, it can be used to discover
outliers in input data for both regression and classification
datasets.
The scikit-learn library provides an implementation of one-

class SVM in the OneClassSVM class.
The class provides the “nu” argument that specifies the

approximate ratio of outliers in the dataset, which defaults to
0.1.
63
Case Study : Discuss
64
OneHotEncoders
65
Different Approaches
Very often it will be required to transform classes into numbers (before entering into a Machine Learning algorithm).
This can be done using the get_dummies()-function of pandas. Sci-kit Learn has a OneHotEncoder class.
Using such OneHotEncoder has the advantage of being able to fit on some training data and then transform on some other data using
the same instance. This is similar to the Scaler-objects used.
onehot.ipynb
66
Building Pipelines
67
Using Pipelines in ML
Read Data Clean Data Preprocessing Create Model(s)

Data
In the previous examples, the same workflow was always followed. A real pipeline consisting of
a sequence to API-calls to the Scikit-Learn libary.
68
Pipeline Ingredients
◦ SimpleImputer
◦ OneHotEncoder
◦ ColumnTransformer
◦ Scaler
69
SimpleImputer
This is an important class to deal with missing
date (unless you fanatically want to use the
dropna()-function)
For example, you may want to impute missing

numerical values with a median value, then scale
the values and impute missing categorical values
using the most frequent value and one hot
encode the categories.
70
SimpleImputer
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
X_preprocessed=imp_mean.transform(X)
0 1 2 Training
0 7.0 2.0 3.0
1 4.0 6.0
2 10.0 5.0 9.0
0 1 2 0 1 2
0 2.0 3 0 7.0 2.0 3.0
1 4.0 6 1 4.0 3.5 6.0
2 10.0 9 2 10.0 3.5 9.0
71
ColumnTransformers
Applying data transforms like scaling or encoding categorical Each transformer is a three-element tuple that defines the
variables is straightforward when all input variables are the name of the transformer, the transform to apply, and the
same type. It can be challenging when you have a dataset with column indices to apply it to.
mixed types and you want to selectively apply data transforms
to some, but not all, input features. For example:
(Name, Transformer, Columns)
Thankfully, the scikit-learn Python machine learning library
provides the ColumnTransformer that allows you to selectively ◦ Name : string
apply data transforms to different columns in your dataset.
The ColumnTransformer is a class in the scikit-learn Python ◦ Transformer : This class must support the fit () or
machine learning library that allows you to selectively apply transform() function
data preparation transforms. For example, it allows you to ◦ Columns : Indexes the data on its second axis.
apply a specific transform or sequence of transforms to just ◦ Integers are interpreted as positional columns
the numerical columns, and a separate sequence of transforms ◦ strings can reference DataFrame columns by name
to just the categorical columns.
72
ColumnTransformers
.
('cat',OneHotEncoder( ('num',StandardSCaler .
('num2',Imputer,[4])
),[0,1]) (),[2]) .
.
Data is passed into the ColumnTransfer and the columns are processed one after the other.
Column [3] is not processed and will be removed from the dataset. This can be changed by setting the “remainder”
argument.
Setting remainder=’passthrough’ will mean that all columns not specified in the list of “transformers”
will be passed through without transformation, instead of being dropped.
73
ColumnTransformers
...
t = [('num', SimpleImputer(strategy='median'), [0, 1]), ('cat', SimpleImputer(strategy='most_frequent'), [2, 3])]
transformer = ColumnTransformer(transformers=t)
...
train_X = transformer.fit_transform(train_X)
74
ColumnTransformers (case study)
Predicting the age of abalone from physical measurements.
The age of abalone is determined by cutting the shell through
the cone, staining it, and counting the number of rings through
a microscope -- a boring and time-consuming task. Other
measurements, which are easier to obtain, are used to predict
the age. Further information, such as weather patterns and
location (hence food availability) may be required to solve the
problem. From the original data examples with missing values
were removed (the majority having the predicted value
missing), and the ranges of the continuous values have been
scaled.
Data comes from an original (non-machine-learning) study:
Warwick J Nash, Tracy L Sellers, Simon R Talbot, Andrew J
Cawthorn and Wes B Ford (1994) "The Population Biology of
Ref : https://bit.ly/3pra84V
Abalone (_Haliotis_species) in Tasmania. I. Blacklip Abalone
(_H. rubra_) from the North Coast and Islands of Bass Strait",
Sea Fisheries Division, Technical Report No. 48 (ISSN 1034- abalone.ipynb
3288)
75
ColumnTransformers (case study)
Name Data
Sex nominal
Length continuous
Diameter continuous
Height continuous
Whole weight continuous
Shucked weight continuous
Viscera weight continuous
Shell weight continuous
Rings integer
The number of rings is the value to predict: either as

a continuous value or as a classification problem.
76
Case Study 1 :
Building a Pipeline
Reference : http://bit.ly/37QyNKq
boston.ipynb
On every object within the pipeline the methods fit_transform are invoked during training, while transform (or predict) are called
during test. So far using pipelines is just a matter of code cleaness and minimization.
77
Case Study 2: Building a Pipeline
In order to make the article intuitive, we will learn all the concepts
while simultaneously working on a real world data: A sales
prediction.
As a part of the problem, we are provided with the information
about the stores (location, size, etc), products (weight, category,
price, etc) and historical sales data. Using this information, we have
to forecast the sales of the products in the stores.
The data scientists at BigMart have collected 2013 sales data for
1559 products across 10 stores in different cities. Also, certain
attributes of each product and store have been defined. The aim is
to build a predictive model and predict the sales of each product at
a particular outlet.
Please note that the data may have missing values as some stores
might not report all the data due to technical glitches. Hence, it will
be required to treat them accordingly.
Ref : https://bit.ly/3pxB2s9 The target variable in the dataset is Item_Outlet_Sales.
bigstore.ipynb
78
Variable Description There is a train (8523) and test (5681) data set. Only the train data set has
Item_Identifier Unique product ID both input and output variable(s).
Item_Weight Weight of product
You need to predict the sales for test data set.
Item_Fat_Content Whether the product is low fat or not
The % of total display area of all products in a store
Item_Visibility allocated to the particular product
Item_Type The category to which the product belongs Location of the training & test data:
Item_MRP Maximum Retail Price (list price) of the product
Outlet_Identifier Unique store ID https://goz39a.s3.eu-central-1.amazonaws.com/bigstore_test.csv
Outlet_Establishment_Y
ear The year in which store was established https://goz39a.s3.eu-central-1.amazonaws.com/bigstore_train.csv
Outlet_Size The size of the store in terms of ground area covered
Outlet_Location_Type The type of city in which the store is located

Whether the outlet is just a grocery store or some sort of
Outlet_Type supermarket
Sales of the product in the particular store. This is the
Item_Outlet_Sales outcome variable to be predicted.
79
Only 7 features are going to be used in this model, the rest of the columns will be dropped
◦ Item_MRP: It holds the price of the products. During the preprocessing step we used a standard scaler to scale this values.
◦ Outlet_Type_Grocery_Store: A binary column which indicates if the outlet type is a grocery store or not. To use this information in the
model building process, we will add a binary feature in the existing data that contains 1 (if outlet type is a grocery store) and 0 ( if outlet
type is something else).
◦ Item_Visibility: Denotes visibility of products in the store. Since this variable had a small value range and no missing values, we didn’t apply
any preprocessing steps on this variable.
◦ Outlet_Type_Supermarket_Type3: Another binary column indicating if the outlet type is a “supermarket_type_3” or not. To capture this
information we will create binary feature that stores 1 (if outlet type is supermarket_type_3) and 0 (othewise).
◦ Outlet_Identifier_OUT027: This feature specifies whether the outlet identifier is “OUT027” or not. Similar to the last previous example, we
will create a separate column that carries 1 (if outlet type is grocery store) and 0 (otherwise).
◦ Outlet_Establishment_Year: The Outlet_Establishment_Year describes year of establishment of the stores. Since we did not perform any
transformation on values in this column, we will not preprocess it in the pipeline as well.
◦ Item_Weight: During the preprocessing steps we observed that Item_Weight had missing values. These missing values were imputed using
the average of the column. This has to be taken into account while building the machine learning pipeline.
80
81
82

Data Preprocessing: Modern Data Analytics (G0Z39A) Prof. Dr. Ir. Jan de Spiegeleer

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Preprocessing: Modern Data Analytics (G0Z39A) Prof. Dr. Ir. Jan de Spiegeleer

Uploaded by

Copyright:

Available Formats

Data

Pandas Transforming Data

Lesson 3 Sklearn Scalers One-Hot Encoding Pipes

Transforming Data Data Preprocessing with Pandas.ipynb

2 casestudies will illustrate the importance of PreProcessing

Question : What mix of Futures can be used to hedge this portfolio ?

1. X1,i : DAX (Deutscher Aktienindex)

2. X2,i : NKY (Nikkei 225)

3. X3,i : SPX (S&P 500)

4. X4,i : SX5E (EURO STOXX 50)

5. X5,i : SX7E (EURO STOXX Banks)

6. X6,i : SXAP (STOXX Europe 600 Automobiles & Parts)

7. X7,i : SXKP (STOXX Europe 600 Telecommunications)

where rk,i = price return of Future k on day i

Using the ordinary least squares model of sklearn

The results show an impressive R2 but the what is wrong with

𝑦 = 𝛽!r! + 𝛽"r"with 𝑟$ the price return of Future 𝑖

The optimal values are

Question we need to solve:

This is a CONSTRAINTED OPTIMISATION , that will be solved

Fortunately this is a built-in feature in Scikit-Learn :

Case Study: Clustering

• Manhattan (City Block) Distance

DChebychev (xa , xb ) = max (| xa1 xb1 |, . . . , | xap xbp |)

Particular values for h :

2. Choose K (random) data points as centroids for these clusters

C(i) = arg minkxi mk k2 (1)

5. As long as the stopping criterium is not met, go to step 3.

The value of debt (corporate bonds, convertible bonds,

1.Mr Kerkorian made a 2.Moody’s downgraded 3. Hedge funds

Classifying a new datapoint

In this example we used an Euclidean distance, which is the

This can be dealt with using a standardisation of the data

2. Reduce the dimension of a dataset

We want to remove the outliers before building our model.

Any contamination on these sample date,

We need a robust estimate of our

The (mahalanobis) distance of the observation

Getting a better intuition on the Determinant of a Covariance Matrix .

The MAE error was 3.668, which is our base-line performance.

Let’s see what the impact is of a cleaning procedure where

Perhaps the most important hyperparameter in the model is

This MCD approach can be generalized by defining a

The scikit-learn library provides access to this method via the

It provides the “contamination” argument that defines the

The local outlier factor, or LOF for short, is a technique that

The scikit-learn library provides an implementation of one-

The class provides the “nu” argument that specifies the

Read Data Clean Data Preprocessing Create Model(s)

For example, you may want to impute missing

The number of rings is the value to predict: either as

Ref : https://bit.ly/3pxB2s9 The target variable in the dataset is Item_Outlet_Sales.

Outlet_Location_Type The type of city in which the store is located

You might also like