Professional Documents
Culture Documents
just because they are odd does not mean they are
unimportant
1. Missing value is an observation of a variable, which does not store any data
2. Some algorithms handle them automatically, others do not
3. Recommended to deal with them in the beginning
Note:
Outliers:
• Affect both data description and prediction results
• Some methods are robust, others not
• Recommended to handle them in after observing distortion in results
Methods to detect:
• Mean and Std
• InterQuartile Range (IQR)
• Median Absolute Deviation (MAD)
Mean and Std
1. Assumes normal
distribution
2. Uses mean, which is a
measure not robust to
outliers
3. Is easy and proven to be
useful when data is big
enough
InterQuartile Range (IQR)
• Formula: IQR = Q3 - Q1
• Where: Q1 = 1st quartile, Q3 = 3rd quartile
• Outliers are outside of the following range:
[Q1- α × IQR, Q3+ α × IQR]
• α = 1.5 is the classical value
Median Absolute Deviation (MAD)
• MAD=Median(|X−Median(X)|)
• Steps:
• Calculate Median
• Calculate deviation between each data point and median
• Make all those deviations nonnegative using their absolute value
• Calculate the median of the above sequence
• MAD is like Std for median
• Outliers are outside of [Median-α × MAD,Median+α × MAD]
• α = 1.5 is the classical value
Benford’s law
Cook’s distance
Methods to scale:
• Standardization (removing mean and scaling to unit variance)
• Min-Max scaling (brining variable values to a certain range, e.g. making min=0, max=1)
Thank you