You are on page 1of 19

Preprocessing

Instructor: Hrant Davtyan


Course: Business Analytics, Fall, 2021
Content
1. Cleansing
2. Missing values
3. Outliers
a. Univariate methods
b. Multivariate methods
c. Digit-based methods
4. Scaling
Cleansing actions

1. Uninformative variable removal


2. Unobserved variable removal
3. Duplicate elimination
4. Categorical variable transformation
5. Validity checks
6. Missing value handling
7. Outlier/anomaly detection
Missing
Values
Motivation

just because they are odd does not mean they are
unimportant

John Foreman, author of Data Smart


Introduction

1. Missing value is an observation of a variable, which does not store any data
2. Some algorithms handle them automatically, others do not
3. Recommended to deal with them in the beginning

to drop or not to drop,


that is the question...
Actions Drop if:

• an observation has "a lot of" missing values


• a variable has "a lot of" missing values
• values are missing at random

Do not drop (fill/impute) if:

• an observation has "only a few" missing values


• a variable has "only a few" missing values
• values are not missing at random

Note:

• The recommended actions above refer only to training set


• Test set (which mimics the real prediction set) cannot allow
values to be dropped
Imputation

1. single value - easy and fast


• mean, mode, median or some other value
2. replication - only useful when observations have intrinsic order
• backward fill, forward fill
3. distribution specific - more sophisticated version of N1
• e.g. single value per group
4. regression - very slow but more precise
• estimate the missing value
Sample code (Python)

#importing pandas and reading data


import pandas as pd
data = pd.read_csv(“data.csv”)
#checking for missing values (NA/NaN)
data.isnull()
#removing NA/NaN values
data.dropna()

#filling NA/NaN values


data.fillna()
Outliers
Outliers

Outliers:
• Affect both data description and prediction results
• Some methods are robust, others not
• Recommended to handle them in after observing distortion in results

Methods to detect:
• Mean and Std
• InterQuartile Range (IQR)
• Median Absolute Deviation (MAD)
Mean and Std

1. Assumes normal
distribution
2. Uses mean, which is a
measure not robust to
outliers
3. Is easy and proven to be
useful when data is big
enough
InterQuartile Range (IQR)

• Formula: IQR = Q3 - Q1
• Where: Q1 = 1st quartile, Q3 = 3rd quartile
• Outliers are outside of the following range:
[Q1- α × IQR, Q3+ α × IQR]
• α = 1.5 is the classical value
Median Absolute Deviation (MAD)

• MAD=Median(|X−Median(X)|)
• Steps:
• Calculate Median
• Calculate deviation between each data point and median
• Make all those deviations nonnegative using their absolute value
• Calculate the median of the above sequence
• MAD is like Std for median
• Outliers are outside of [Median-α × MAD,Median+α × MAD]
• α = 1.5 is the classical value
Benford’s law
Cook’s distance

• Univariate outlier detection methods may yield to elimination of observation


points that are considered an outlier for one feature but an inliner for others.
• Cook’s distance is a method which allows to identify influential points (usually
outliers) based on all features in the model and is calculated as follows:
• i
• i
• i
• where s2 is the MSE of the model and p is the number of coefficients.
• The higher Di, the larger influence. Usually, Di>1 is considered an outlier.
• Cook’s distance is usually used for analyzing small datasets.
Scaling
Scaling

• In regression/classification analysis, many times different variables have different scales


• As a result, the coefficients of those variables will be on different scales as well
• Example: if you use Price in AMD (larger scale) and USD (smaller scale) then you will get
smaller/larger coefficient as a result
• Scaling is the process of bringing different variables to a common scale
• It ensures that coefficients of variables become comparable to each other.
• Typically, scaling is required for parametric models with numerical optimization, yet reduces
model explainability (hard to interpret values after scaling)

Methods to scale:
• Standardization (removing mean and scaling to unit variance)
• Min-Max scaling (brining variable values to a certain range, e.g. making min=0, max=1)
Thank you

You might also like