Applied Statistics Outliers Chapter 2

APPLIED
STATISTICS
Chapter 2
The beast of bias
Instructor: Ms. Umme Siddiqa

OUTLIER
 An outlier is a score very different from the rest of the data.
 When we analyze data we have to be aware of such values because they manipulate the data.
 Example of such value manipulating the data is mean score. For instance, suppose you have
data which has four values of 10, 10, 10, 10 but one extreme value of 1. Despite almost all
values of 10, the mean would become 8 due to one extreme value of 1.
DEALING WITH OUTLIERS
 If you detect outliers in the data there are several options for reducing the impact of these
values.
 However, before you do any of these things, it’s worth checking that the data have been
entered correctly for the problem cases.
 If the data are correct then there are three main options you can use to deal with outliers.
1. REMOVING THE CASE
 This entails deleting the data from the person who contributed the outlier.
 However, this should be done only if you have good reason to believe that
this case is not from the population that you intended to sample.
 For example, you are taking data from Pakistanis about the spices in their
food and you have one person who never eaten Pakistani food because he is
not from Pakistan. Therefore, you can delete this case because it is not from
the targeted population.
2. TRANSFORMING THE
DATA
 Outliers tend to skew the distribution and this skew (and, therefore, the impact of the outliers)
can sometimes be reduced by applying transformations to the data.
 It is a Mathematical procedure (such as taking the square root) used on each score in a sample,
usually done to make the sample distribution closer to normal.
2. TRANSFORMING THE
DATA
 Once you transform the data that makes the scores in the sample appear to meet the normality
assumption (and if the other assumptions are met), you can then go ahead with a usual t test,
analysis of variance or significance test of a correlation or regression.
 Some of these procedures use a trimmed mean. A trimmed mean is simply a mean based on
the distribution of scores after some percentage of scores has been removed from each extreme
of the distribution. So, a 10% trimmed mean will remove 10% of scores from the top and
bottom before the mean is calculated.
2. TRANSFORMING THE
DATA
A. Log transformation (log(Xi)): Taking the logarithm of a set of numbers squashes the right
tail of the distribution. As such it’s a good way to reduce positive skew.
 However, you can’t get a log value of zero or negative numbers, so if your data tend to zero
or produce negative numbers you need to add a constant to all of the data before you do the
transformation.
2. TRANSFORMING THE
DATA
B. Square root transformation (√Xi): Taking the square root of large values has more of an
effect than taking the square root of small values.
 Consequently, taking the square root of each of your scores will bring any large scores closer
to the centre – rather like the log transformation. As such, this can be a useful way to reduce
positive skew; however, you still have the same problem with negative numbers (negative
numbers don’t have a square root).
2. TRANSFORMING THE
DATA
C. Reciprocal transformation (1/Xi): Dividing 1 by each score also reduces the impact of large
scores. The transformed variable will have a lower limit of 0 (very large numbers will become
close to 0).
 One thing to bear in mind with this transformation is that it reverses the scores: scores that
were originally large in the data set become small (close to zero) after the transformation, but
scores that were originally small become big after the transformation. For example, imagine
two scores of 1 and 10; after the transformation they become 1/1 = 1, and 1/10 = 0.1: the small
score becomes bigger than the large score after the transformation.
2. TRANSFORMING THE
DATA
D. Reverse score transformations: Any one of the above transformations can be used to
correct negatively skewed data, but first you have to reverse the scores. To do this, subtract
each score from the highest score obtained.
 If you do this, don’t forget to reverse the scores back afterwards, or to remember that the
interpretation of the variable is reversed: big scores have become small and small scores have
become big!
3. CHANGING THE SCORE
 If transformation fails, then you can consider replacing the score.
 This on the face of it may seem like cheating (you’re changing the data from what was
actually corrected); however, if the score you’re changing is very unrepresentative and biases
your data anyway then changing the score is the lesser of two evils!
3. CHANGING THE
SCORE
There are several options for how to change the score:
A. The next highest score plus one: Change the score to
be one unit above the next highest score in the data
set.
B. Convert back from a z-score: by using formula { X =
(z score × standard deviation) + mean } we can
change the score and replace our outliers with that
score.
C. The mean plus two standard deviations: A variation
on the above method is to use the mean plus two
times the standard deviation.

Applied Statistics Outliers Chapter 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Applied Statistics Outliers Chapter 2

Uploaded by

Copyright:

Available Formats

APPLIED

Instructor: Ms. Umme Siddiqa

You might also like