2) Business Analytics. Preprocessing

Preprocessing
Instructor: Hrant Davtyan

Course: Business Analytics, Fall, 2021
Content
1. Cleansing
2. Missing values
3. Outliers
a. Univariate methods
b. Multivariate methods
c. Digit-based methods
4. Scaling
Cleansing actions
1. Uninformative variable removal

2. Unobserved variable removal
3. Duplicate elimination
4. Categorical variable transformation
5. Validity checks
6. Missing value handling
7. Outlier/anomaly detection
Missing
Values
Motivation
just because they are odd does not mean they are
unimportant
John Foreman, author of Data Smart

Introduction
1. Missing value is an observation of a variable, which does not store any data
2. Some algorithms handle them automatically, others do not
3. Recommended to deal with them in the beginning
to drop or not to drop,

that is the question...
Actions Drop if:
• an observation has "a lot of" missing values

• a variable has "a lot of" missing values
• values are missing at random
Do not drop (fill/impute) if:
• an observation has "only a few" missing values

• a variable has "only a few" missing values
• values are not missing at random
Note:
• The recommended actions above refer only to training set

• Test set (which mimics the real prediction set) cannot allow
values to be dropped
Imputation
1. single value - easy and fast

• mean, mode, median or some other value
2. replication - only useful when observations have intrinsic order
• backward fill, forward fill
3. distribution specific - more sophisticated version of N1
• e.g. single value per group
4. regression - very slow but more precise
• estimate the missing value
Sample code (Python)
#importing pandas and reading data

import pandas as pd
data = pd.read_csv(“data.csv”)
#checking for missing values (NA/NaN)
data.isnull()
#removing NA/NaN values
data.dropna()
#filling NA/NaN values

data.fillna()
Outliers
Outliers
Outliers:
• Affect both data description and prediction results
• Some methods are robust, others not
• Recommended to handle them in after observing distortion in results
Methods to detect:
• Mean and Std
• InterQuartile Range (IQR)
• Median Absolute Deviation (MAD)
Mean and Std
1. Assumes normal
distribution
2. Uses mean, which is a
measure not robust to
outliers
3. Is easy and proven to be
useful when data is big
enough
InterQuartile Range (IQR)
• Formula: IQR = Q3 - Q1
• Where: Q1 = 1st quartile, Q3 = 3rd quartile
• Outliers are outside of the following range:
[Q1- α × IQR, Q3+ α × IQR]
• α = 1.5 is the classical value
Median Absolute Deviation (MAD)
• MAD=Median(|X−Median(X)|)
• Steps:
• Calculate Median
• Calculate deviation between each data point and median
• Make all those deviations nonnegative using their absolute value
• Calculate the median of the above sequence
• MAD is like Std for median
• Outliers are outside of [Median-α × MAD,Median+α × MAD]
• α = 1.5 is the classical value
Benford’s law
Cook’s distance
• Univariate outlier detection methods may yield to elimination of observation

points that are considered an outlier for one feature but an inliner for others.
• Cook’s distance is a method which allows to identify influential points (usually
outliers) based on all features in the model and is calculated as follows:
• i
• i
• i
• where s2 is the MSE of the model and p is the number of coefficients.
• The higher Di, the larger influence. Usually, Di>1 is considered an outlier.
• Cook’s distance is usually used for analyzing small datasets.
Scaling
Scaling
• In regression/classification analysis, many times different variables have different scales

• As a result, the coefficients of those variables will be on different scales as well
• Example: if you use Price in AMD (larger scale) and USD (smaller scale) then you will get
smaller/larger coefficient as a result
• Scaling is the process of bringing different variables to a common scale
• It ensures that coefficients of variables become comparable to each other.
• Typically, scaling is required for parametric models with numerical optimization, yet reduces
model explainability (hard to interpret values after scaling)
Methods to scale:
• Standardization (removing mean and scaling to unit variance)
• Min-Max scaling (brining variable values to a certain range, e.g. making min=0, max=1)
Thank you

2) Business Analytics. Preprocessing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2) Business Analytics. Preprocessing

Uploaded by

Copyright:

Available Formats

Preprocessing

Instructor: Hrant Davtyan

1. Uninformative variable removal

John Foreman, author of Data Smart

to drop or not to drop,

• an observation has "a lot of" missing values

Do not drop (fill/impute) if:

• an observation has "only a few" missing values

• The recommended actions above refer only to training set

1. single value - easy and fast

#importing pandas and reading data

#filling NA/NaN values

• Univariate outlier detection methods may yield to elimination of observation

• In regression/classification analysis, many times different variables have different scales

You might also like