You are on page 1of 17

ADVANCED QUANTITATIVE

RESEARCH METHODS
Dr. SATHIAMOORTHY KANNAN
LECTURE 6
DATA CLEANING
DATA CLEANING involves correcting:
•Unusual data
•Missing data
•Outliers
• Data by Reverse coding
UNUSUAL DATA
• Normally this is a value that is a
result of human error
• Either it is a very small or very large
value that does not match the other
values in the same variable
How to correct it?
• Go to the case and do the
appropriate correction
Let’s do it
MISSING DATA

• Data that is missing at some


points in a variable
HOW TO CORRECT MISSING DATA?

•Use a new category


if variable is categorical
• Replace with the mean of the variable
if it is interval/ratio
Let’s do it
OUTLIERS
What are Outliers?
Outliers are…
• Data points that are far away from the main
distribution
• There are 2 types
• Mild outliers: data points that are between
1.5 x IQR and 3 x IQR
• Extreme outliers: data points located
more than 3 x IQR
• IQR stands for Inter Quartile Range = Q3 - Q1
How to detect Outliers?
• Extreme Outlier (*) = More than 3 x IQR
• Mild Outlier = Between 1.5 x IQR and 3 x IQR
• is detected using Boxplots

Another criterion for outlier detection:


• Extreme Outlier (*) : More than 3 x IQR
• Mild Outlier : Between 2.2 x IQR and 3 x IQR
The Winsorizing Process
• is a way to minimize the influence
of outliers in the data by either:

•Assigning the outlier a lower weight, or


•Changing the value so that it is closer to
other values in the set.
Why winsorize?
•The purpose of Winsorization is to
reduce the impact of extreme
observations on the data mean and
standard deviation
How to winsorize?
• Before winsorizing, make sure the outlier is not a
result of measurement error or some other fixable
error.

0.1,1,12,14,16,18,19,21,24,26,29,
32,33,35,39,40,41,44,99,125}
•Existing Mean = 33.405.
•12,12,12,14,16,18,19,21,24,26,2
9,32,33,35,39,40,41,44,44,44}

•New Mean = 27.75


Let’s do it

You might also like