You are on page 1of 47

Outlier

In statistics, an outlier is a data point that


differs significantly from other
observations.[1][2] An outlier may be due
to a variability in the measurement, an
indication of novel data, or it may be the
result of experimental error; the latter are
sometimes excluded from the data
set.[3][4] An outlier can be an indication of
exciting possibility, but can also cause
serious problems in statistical analyses.
Figure 1. Box plot of data from the
Michelson–Morley experiment
displaying four outliers in the middle
column, as well as one outlier in the
first column.

Outliers can occur by chance in any


distribution, but they can indicate novel
behaviour or structures in the data-set,
measurement error, or that the
population has a heavy-tailed
distribution. In the case of measurement
error, one wishes to discard them or use
statistics that are robust to outliers, while
in the case of heavy-tailed distributions,
they indicate that the distribution has
high skewness and that one should be
very cautious in using tools or intuitions
that assume a normal distribution. A
frequent cause of outliers is a mixture of
two distributions, which may be two
distinct sub-populations, or may indicate
'correct trial' versus 'measurement error';
this is modeled by a mixture model.

In most larger samplings of data, some


data points will be further away from the
sample mean than what is deemed
reasonable. This can be due to incidental
systematic error or flaws in the theory
that generated an assumed family of
probability distributions, or it may be that
some observations are far from the
center of the data. Outlier points can
therefore indicate faulty data, erroneous
procedures, or areas where a certain
theory might not be valid. However, in
large samples, a small number of outliers
is to be expected (and not due to any
anomalous condition).

Outliers, being the most extreme


observations, may include the sample
maximum or sample minimum, or both,
depending on whether they are extremely
high or low. However, the sample
maximum and minimum are not always
outliers because they may not be
unusually far from other observations.

Naive interpretation of statistics derived


from data sets that include outliers may
be misleading. For example, if one is
calculating the average temperature of
10 objects in a room, and nine of them
are between 20 and 25 degrees Celsius,
but an oven is at 175 °C, the median of
the data will be between 20 and 25 °C but
the mean temperature will be between
35.5 and 40 °C. In this case, the median
better reflects the temperature of a
randomly sampled object (but not the
temperature in the room) than the mean;
naively interpreting the mean as "a typical
sample", equivalent to the median, is
incorrect. As illustrated in this case,
outliers may indicate data points that
belong to a different population than the
rest of the sample set.
Estimators capable of coping with
outliers are said to be robust: the median
is a robust statistic of central tendency,
while the mean is not.[5] However, the
mean is generally a more precise
estimator.[6]

Occurrence and causes

Relative probabilities in a normal


distribution

In the case of normally distributed data,


the three sigma rule means that roughly
1 in 22 observations will differ by twice
the standard deviation or more from the
mean, and 1 in 370 will deviate by three
times the standard deviation.[7] In a
sample of 1000 observations, the
presence of up to five observations
deviating from the mean by more than
three times the standard deviation is
within the range of what can be expected,
being less than twice the expected
number and hence within 1 standard
deviation of the expected number – see
Poisson distribution – and not indicate
an anomaly. If the sample size is only
100, however, just three such outliers are
already reason for concern, being more
than 11 times the expected number.

In general, if the nature of the population


distribution is known a priori, it is
possible to test if the number of outliers
deviate significantly from what can be
expected: for a given cutoff (so samples
fall beyond the cutoff with probability p)
of a given distribution, the number of
outliers will follow a binomial distribution
with parameter p, which can generally be
well-approximated by the Poisson
distribution with λ = pn. Thus if one takes
a normal distribution with cutoff 3
standard deviations from the mean, p is
approximately 0.3%, and thus for 1000
trials one can approximate the number of
samples whose deviation exceeds 3
sigmas by a Poisson distribution with λ =
3.
Causes

Outliers can have many anomalous


causes. A physical apparatus for taking
measurements may have suffered a
transient malfunction. There may have
been an error in data transmission or
transcription. Outliers arise due to
changes in system behaviour, fraudulent
behaviour, human error, instrument error
or simply through natural deviations in
populations. A sample may have been
contaminated with elements from
outside the population being examined.
Alternatively, an outlier could be the
result of a flaw in the assumed theory,
calling for further investigation by the
researcher. Additionally, the pathological
appearance of outliers of a certain form
appears in a variety of datasets,
indicating that the causative mechanism
for the data might differ at the extreme
end (King effect).

Definitions and detection


There is no rigid mathematical definition
of what constitutes an outlier;
determining whether or not an
observation is an outlier is ultimately a
subjective exercise.[8] There are various
methods of outlier detection, some of
which are treated as synonymous with
novelty detection.[9][10][11][12][13] Some are
graphical such as normal probability
plots. Others are model-based. Box plots
are a hybrid.

Model-based methods which are


commonly used for identification
assume that the data are from a normal
distribution, and identify observations
which are deemed "unlikely" based on
mean and standard deviation:

Chauvenet's criterion
Grubbs's test for outliers
Dixon's Q test
ASTM E178: Standard Practice for
Dealing With Outlying Observations[14]
Mahalanobis distance and leverage are
often used to detect outliers, especially
in the development of linear regression
models.
Subspace and correlation based
techniques for high-dimensional
numerical data[13]

Peirce's criterion

It is proposed to determine in a
series of observations the
limit of error, beyond which all
observations involving so great
an error may be rejected,
provided there are as many as
such observations. The
principle upon which it is
proposed to solve this problem
is, that the proposed
observations should be rejected
when the probability of the
system of errors obtained by
retaining them is less than that
of the system of errors
obtained by their rejection
multiplied by the probability of
making so many, and no more,
abnormal observations.
(Quoted in the editorial note on
page 516 to Peirce (1982
edition) from A Manual of
Astronomy 2:558 by
Chauvenet.) [15][16][17][18]
Tukey's fences

Other methods flag observations based


on measures such as the interquartile
range. For example, if and are the
lower and upper quartiles respectively,
then one could define an outlier to be any
observation outside the range:

for some nonnegative constant . John


Tukey proposed this test, where
indicates an "outlier", and
indicates data that is "far out".[19]
In anomaly detection

In various domains such as, but not


limited to, statistics, signal processing,
finance, econometrics, manufacturing,
networking and data mining, the task of
anomaly detection may take other
approaches. Some of these may be
distance-based[20][21] and density-based
such as Local Outlier Factor (LOF).[22]
Some approaches may use the distance
to the k-nearest neighbors to label
observations as outliers or non-
outliers.[23]
Modified Thompson Tau test

The modified Thompson Tau test is a


method used to determine if an outlier
exists in a data set. The strength of this
method lies in the fact that it takes into
account a data set's standard deviation,
average and provides a statistically
determined rejection zone; thus providing
an objective method to determine if a
data point is an outlier.[24] How it works:
First, a data set's average is determined.
Next the absolute deviation between
each data point and the average are
determined. Thirdly, a rejection region is
determined using the formula:
;

where is the critical value from the


Student t distribution with n-2 degrees of
freedom, n is the sample size, and s is
the sample standard deviation. To
determine if a value is an outlier:
Calculate . If δ >
Rejection Region, the data point is an
outlier. If δ ≤ Rejection Region, the data
point is not an outlier.

The modified Thompson Tau test is used


to find one outlier at a time (largest value
of δ is removed if it is an outlier).
Meaning, if a data point is found to be an
outlier, it is removed from the data set
and the test is applied again with a new
average and rejection region. This
process is continued until no outliers
remain in a data set.

Some work has also examined outliers


for nominal (or categorical) data. In the
context of a set of examples (or
instances) in a data set, instance
hardness measures the probability that
an instance will be misclassified (
where y is the assigned
class label and x represent the input
attribute value for an instance in the
training set t).[25] Ideally, instance
hardness would be calculated by
summing over the set of all possible
hypotheses H:

Practically, this formulation is unfeasible


as H is potentially infinite and calculating
is unknown for many algorithms.
Thus, instance hardness can be
approximated using a diverse subset
:
where is the hypothesis induced
by learning algorithm trained on
training set t with hyperparameters .
Instance hardness provides a continuous
value for determining if an instance is an
outlier instance.

Working with outliers


The choice of how to deal with an outlier
should depend on the cause. Some
estimators are highly sensitive to outliers,
notably estimation of covariance
matrices.
Retention

Even when a normal distribution model is


appropriate to the data being analyzed,
outliers are expected for large sample
sizes and should not automatically be
discarded if that is the case.[26] Instead,
one should use a method that is robust
to outliers to model or analyze data with
naturally occurring outliers.[26]

Exclusion

When deciding whether to remove an


outlier, the cause has to be considered.
As mentioned earlier, if the outlier's origin
can be attributed to an experimental
error, or if it can be otherwise determined
that the outlying data point is erroneous,
it is generally recommended to remove
it.[26][27] However, it is more desirable to
correct the erroneous value, if possible.

Removing a data point solely because it


is an outlier, on the other hand, is a
controversial practice, often frowned
upon by many scientists and science
instructors, as it typically invalidates
statistical results.[26][27] While
mathematical criteria provide an
objective and quantitative method for
data rejection, they do not make the
practice more scientifically or
methodologically sound, especially in
small sets or where a normal distribution
cannot be assumed. Rejection of outliers
is more acceptable in areas of practice
where the underlying model of the
process being measured and the usual
distribution of measurement error are
confidently known.

The two common approaches to exclude


outliers are truncation (or trimming) and
Winsorising. Trimming discards the
outliers whereas Winsorising replaces
the outliers with the nearest
"nonsuspect" data.[28] Exclusion can also
be a consequence of the measurement
process, such as when an experiment is
not entirely capable of measuring such
extreme values, resulting in censored
data.[29]

In regression problems, an alternative


approach may be to only exclude points
which exhibit a large degree of influence
on the estimated coefficients, using a
measure such as Cook's distance.[30]

If a data point (or points) is excluded


from the data analysis, this should be
clearly stated on any subsequent report.

Non-normal distributions

The possibility should be considered that


the underlying distribution of the data is
not approximately normal, having "fat
tails". For instance, when sampling from
a Cauchy distribution,[31] the sample
variance increases with the sample size,
the sample mean fails to converge as the
sample size increases, and outliers are
expected at far larger rates than for a
normal distribution. Even a slight
difference in the fatness of the tails can
make a large difference in the expected
number of extreme values.

Set-membership uncertainties

A set membership approach considers


that the uncertainty corresponding to the
ith measurement of an unknown random
vector x is represented by a set Xi
(instead of a probability density
function). If no outliers occur, x should
belong to the intersection of all Xi's.
When outliers occur, this intersection
could be empty, and we should relax a
small number of the sets Xi (as small as
possible) in order to avoid any
inconsistency.[32] This can be done using
the notion of q-relaxed intersection. As
illustrated by the figure, the q-relaxed
intersection corresponds to the set of all
x which belong to all sets except q of
them. Sets Xi that do not intersect the q-
relaxed intersection could be suspected
to be outliers.
Figure 5. q-relaxed intersection of 6
sets for q=2 (red), q=3 (green), q= 4
(blue), q= 5 (yellow).

Alternative models

In cases where the cause of the outliers


is known, it may be possible to
incorporate this effect into the model
structure, for example by using a
hierarchical Bayes model, or a mixture
model.[33][34]

See also
Anomaly (natural sciences)
Novelty detection
Anscombe's quartet
Data transformation (statistics)
Extreme value theory
Influential observation
Random sample consensus
Robust regression
Studentized residual
Winsorizing

References
1. Grubbs, F. E. (February 1969). "Procedures
for detecting outlying observations in
samples". Technometrics. 11 (1): 1–21.
doi:10.1080/00401706.1969.10490657 (h
ttps://doi.org/10.1080%2F00401706.196
9.10490657) . "An outlying observation, or
"outlier," is one that appears to deviate
markedly from other members of the
sample in which it occurs."

2. Maddala, G. S. (1992). "Outliers" (https://b


ooks.google.com/books?id=nBS3AAAAIA
AJ&pg=PA89) . Introduction to
Econometrics (https://archive.org/details/
introductiontoec00madd/page/89)
(2nd ed.). New York: MacMillan. pp. 89 (ht
tps://archive.org/details/introductiontoec
00madd/page/89) . ISBN 978-0-02-
374545-4. "An outlier is an observation
that is far removed from the rest of the
observations."

3. Pimentel, M. A., Clifton, D. A., Clifton, L., &


Tarassenko, L. (2014). A review of novelty
detection. Signal Processing, 99, 215-249.
4. Grubbs 1969, p. 1 stating "An outlying
observation may be merely an extreme
manifestation of the random variability
inherent in the data. ... On the other hand,
an outlying observation may be the result
of gross deviation from prescribed
experimental procedure or an error in
calculating or recording the numerical
value."

5. Ripley, Brian D. 2004. Robust statistics (ht


tp://www.stats.ox.ac.uk/pub/StatMeth/Ro
bust.pdf) Archived (https://web.archive.o
rg/web/20121021081319/http://www.stat
s.ox.ac.uk/pub/StatMeth/Robust.pdf)
2012-10-21 at the Wayback Machine
6. Chandan Mukherjee, Howard White, Marc
Wuyts, 1998, "Econometrics and Data
Analysis for Developing Countries Vol. 1"
[1] (https://books.google.com/books?id=
H-lkYmatYtAC&dq=median+is+less+preci
se+than+mean&pg=PA60)

7. Ruan, Da; Chen, Guoqing; Kerre, Etienne


(2005). Wets, G. (ed.). Intelligent Data
Mining: Techniques and Applications (http
s://archive.org/details/intelligentdatam00
ruan_742) . Studies in Computational
Intelligence Vol. 5. Springer. p. 318 (http
s://archive.org/details/intelligentdatam00
ruan_742/page/n326) . ISBN 978-3-540-
26256-5.
8. Zimek, Arthur; Filzmoser, Peter (2018).
"There and back again: Outlier detection
between statistical reasoning and data
mining algorithms" (https://web.archive.or
g/web/20211114121638/https://findrese
archer.sdu.dk:8443/ws/files/153197807/
There_and_Back_Again.pdf) (PDF). Wiley
Interdisciplinary Reviews: Data Mining and
Knowledge Discovery. 8 (6): e1280.
doi:10.1002/widm.1280 (https://doi.org/1
0.1002%2Fwidm.1280) . ISSN 1942-4787
(https://www.worldcat.org/issn/1942-478
7) . S2CID 53305944 (https://api.semanti
cscholar.org/CorpusID:53305944) .
Archived from the original (https://findres
earcher.sdu.dk:8443/ws/files/15319780
7/There_and_Back_Again.pdf) (PDF) on
2021-11-14. Retrieved 2019-12-11.
9. Pimentel, M. A., Clifton, D. A., Clifton, L., &
Tarassenko, L. (2014). A review of novelty
detection. Signal Processing, 99, 215-249.

10. Rousseeuw, P; Leroy, A. (1996), Robust


Regression and Outlier Detection
(3rd ed.), John Wiley & Sons

11. Hodge, Victoria J.; Austin, Jim (2004), "A


Survey of Outlier Detection
Methodologies", Artificial Intelligence
Review, 22 (2): 85–126,
CiteSeerX 10.1.1.109.1943 (https://citese
erx.ist.psu.edu/viewdoc/summary?doi=1
0.1.1.109.1943) ,
doi:10.1023/B:AIRE.0000045502.10941.a
9 (https://doi.org/10.1023%2FB%3AAIRE.
0000045502.10941.a9) , S2CID 3330313
(https://api.semanticscholar.org/CorpusI
D:3330313)
12. Barnett, Vic; Lewis, Toby (1994) [1978],
Outliers in Statistical Data (3 ed.), Wiley,
ISBN 978-0-471-93094-5

13. Zimek, A.; Schubert, E.; Kriegel, H.-P.


(2012). "A survey on unsupervised outlier
detection in high-dimensional numerical
data". Statistical Analysis and Data
Mining. 5 (5): 363–387.
doi:10.1002/sam.11161 (https://doi.org/1
0.1002%2Fsam.11161) . S2CID 6724536
(https://api.semanticscholar.org/CorpusI
D:6724536) .

14. E178: Standard Practice for Dealing With


Outlying Observations (https://www.nrc.g
ov/docs/ML1023/ML102371244.pdf)
15. Benjamin Peirce, "Criterion for the
Rejection of Doubtful Observations" (htt
p://articles.adsabs.harvard.edu/cgi-bin/n
ph-iarticle_query?1852AJ......2..161P;data
_type=PDF_HIGH) , Astronomical Journal
II 45 (1852) and Errata to the original
paper (http://articles.adsabs.harvard.edu/
cgi-bin/nph-iarticle_query?1852AJ......2..1
76P;data_type=PDF_HIGH) .

16. Peirce, Benjamin (May 1877 – May 1878).


"On Peirce's criterion". Proceedings of the
American Academy of Arts and Sciences.
13: 348–351. doi:10.2307/25138498 (http
s://doi.org/10.2307%2F25138498) .
JSTOR 25138498 (https://www.jstor.org/s
table/25138498) .
17. Peirce, Charles Sanders (1873) [1870].
"Appendix No. 21. On the Theory of Errors
of Observation". Report of the
Superintendent of the United States Coast
Survey Showing the Progress of the
Survey During the Year 1870: 200–224..
NOAA PDF Eprint (http://docs.lib.noaa.go
v/rescue/cgs/001_pdf/CSC-0019.PDF#pa
ge=215) (goes to Report p. 200, PDF's p.
215).
18. Peirce, Charles Sanders (1986) [1982].
"On the Theory of Errors of Observation".
In Kloesel, Christian J. W.; et al. (eds.).
Writings of Charles S. Peirce: A
Chronological Edition (https://archive.org/
details/writingsofcharle0002peir/page/14
0) . Vol. 3, 1872–1878. Bloomington,
Indiana: Indiana University Press.
pp. 140–160 (https://archive.org/details/
writingsofcharle0002peir/page/140) .
ISBN 978-0-253-37201-7. – Appendix 21,
according to the editorial note on page
515
19. Tukey, John W (1977). Exploratory Data
Analysis (https://archive.org/details/explo
ratorydataa00tuke_0) . Addison-Wesley.
ISBN 978-0-201-07616-5. OCLC 3058187
(https://www.worldcat.org/oclc/305818
7) .

20. Knorr, E. M.; Ng, R. T.; Tucakov, V. (2000).


"Distance-based outliers: Algorithms and
applications". The VLDB Journal the
International Journal on Very Large Data
Bases. 8 (3–4): 237.
CiteSeerX 10.1.1.43.1842 (https://citeseer
x.ist.psu.edu/viewdoc/summary?doi=10.
1.1.43.1842) .
doi:10.1007/s007780050006 (https://doi.
org/10.1007%2Fs007780050006) .
S2CID 11707259 (https://api.semanticsch
olar.org/CorpusID:11707259) .
21. Ramaswamy, S.; Rastogi, R.; Shim, K.
(2000). Efficient algorithms for mining
outliers from large data sets. Proceedings
of the 2000 ACM SIGMOD international
conference on Management of data -
SIGMOD '00. p. 427.
doi:10.1145/342009.335437 (https://doi.o
rg/10.1145%2F342009.335437) .
ISBN 1581132174.
22. Breunig, M. M.; Kriegel, H.-P.; Ng, R. T.;
Sander, J. (2000). LOF: Identifying
Density-based Local Outliers (http://www.
dbs.ifi.lmu.de/Publikationen/Papers/LOF.
pdf) (PDF). Proceedings of the 2000 ACM
SIGMOD International Conference on
Management of Data. SIGMOD. pp. 93–
104. doi:10.1145/335191.335388 (https://
doi.org/10.1145%2F335191.335388) .
ISBN 1-58113-217-4.
23. Schubert, E.; Zimek, A.; Kriegel, H. -P.
(2012). "Local outlier detection
reconsidered: A generalized view on
locality with applications to spatial, video,
and network outlier detection". Data
Mining and Knowledge Discovery. 28:
190–237. doi:10.1007/s10618-012-0300-
z (https://doi.org/10.1007%2Fs10618-012
-0300-z) . S2CID 19036098 (https://api.se
manticscholar.org/CorpusID:19036098) .

24. Thompson .R. (1985). "A Note on


Restricted Maximum Likelihood
Estimation with an Alternative Outlier
Model (https://www.jstor.org/stable/2345
543?seq=1#page_scan_tab_content
s) ".Journal of the Royal Statistical
Society. Series B (Methodological), Vol.
47, No. 1, pp. 53-55
25. Smith, M.R.; Martinez, T.; Giraud-Carrier, C.
(2014). "An Instance Level Analysis of
Data Complexity (https://link.springer.co
m/article/10.1007%2Fs10994-013-5422-
z) ". Machine Learning, 95(2): 225-256.

26. Karch, Julian D. (2023). "Outliers may not


be automatically removed" (https://psyarx
iv.com/47ezg/) . Journal of Experimental
Psychology: General. 152 (6): 1735–1753.
doi:10.1037/xge0001357 (https://doi.org/
10.1037%2Fxge0001357) .
PMID 37104797 (https://pubmed.ncbi.nl
m.nih.gov/37104797) . S2CID 258376426
(https://api.semanticscholar.org/CorpusI
D:258376426) .
27. Bakker, Marjan; Wicherts, Jelte M. (2014).
"Outlier removal, sum scores, and the
inflation of the type I error rate in
independent samples t tests: The power
of alternatives and recommendations".
Psychological Methods. 19 (3): 409–427.
doi:10.1037/met0000014 (https://doi.org/
10.1037%2Fmet0000014) .
PMID 24773354 (https://pubmed.ncbi.nl
m.nih.gov/24773354) .

28. Wike, Edward L. (2006). Data Analysis: A


Statistical Primer for Psychology
Students. Transaction Publishers. pp. 24–
25. ISBN 9780202365350.
29. Dixon, W. J. (June 1960). "Simplified
estimation from censored normal
samples" (http://projecteuclid.org/downlo
ad/pdf_1/euclid.aoms/1177705900) .
The Annals of Mathematical Statistics. 31
(2): 385–391.
doi:10.1214/aoms/1177705900 (https://d
oi.org/10.1214%2Faoms%2F117770590
0) .

30. Cook, R. Dennis (Feb 1977). "Detection of


Influential Observations in Linear
Regression". Technometrics (American
Statistical Association) 19 (1): 15–18.

31. Weisstein, Eric W. Cauchy Distribution.


From MathWorld--A Wolfram Web
Resource (http://mathworld.wolfram.co
m/CauchyDistribution.html)
32. Jaulin, L. (2010). "Probabilistic set-
membership approach for robust
regression" (http://www.ensta-bretagne.f
r/jaulin/paper_probint_0.pdf) (PDF).
Journal of Statistical Theory and Practice.
4: 155–167.
doi:10.1080/15598608.2010.10411978 (h
ttps://doi.org/10.1080%2F15598608.201
0.10411978) . S2CID 16500768 (https://a
pi.semanticscholar.org/CorpusID:165007
68) .

33. Roberts, S. and Tarassenko, L.: 1995, A


probabilistic resource allocating network
for novelty detection. Neural Computation
6, 270–284.
34. Bishop, C. M. (August 1994). "Novelty
detection and Neural Network validation".
IEE Proceedings - Vision, Image, and
Signal Processing. 141 (4): 217–222.
doi:10.1049/ip-vis:19941330 (https://doi.
org/10.1049%2Fip-vis%3A19941330) .

External links
Wikimedia Commons has media
related to Outliers.

Renze, John. "Outlier" (https://mathwor


ld.wolfram.com/Outlier.html) .
MathWorld.

Balakrishnan, N.; Childs, A. (2001)


[1994], "Outlier" (https://www.encyclop
ediaofmath.org/index.php?title=Outlie
r) , Encyclopedia of Mathematics, EMS
Press
Grubbs test (http://www.itl.nist.gov/div
898/handbook/eda/section3/eda35h.h
tm) described by NIST manual

Retrieved from
"https://en.wikipedia.org/w/index.php?
title=Outlier&oldid=1181881199"

This page was last edited on 25 October 2023, at


19:52 (UTC). •
Content is available under CC BY-SA 4.0 unless
otherwise noted.

You might also like