Outlier Detection: Univariate and Multivariate

Outlier Detection
Univariate and Multivariate

By
Subhasis Dasgupta
Asst Professor
Praxis Business School, Kolkata
What is an Outlier
In analytics, an outlier is a data point whose characteristics are
significantly different from rest of the data points
Parameters estimated in the presence of ouliers could be heavily biased

and unrealistic
An outlier can be univariate (while dealing with a single numeric variable)

or multivariate (a data point consisting of multiple variables)
Univariate outliers are usually detected by observing the distribution of

the data points
In case of multivariate situations, there are several different ways of

detecting outliers e.g. Mahalanobis distance, LOF , clustering of data
points etc.
Univariate and bivariate outlier
Univariate Outliers
Bivariate Outliers
X
Detecting Univariate Outliers
If data is normally distributed, outliers can be identified with respect to
spread of mean with confidence intervals (usually 95% or higher)
If data is not normally distributed, percentile values can be used to detect

the outliers
Percentile values are non parametric in nature and hence considered to

be appropriate in almost all the cases
Inter Quartile Range (IQR) plays important role in detecting outliers in

univariate data
Box and Whisker plot is most popular in this regard to detect univariate
outliers
Box and Whisker Plot
Box and Whisker plot (popularly known as

Box Plot) is a way of portraiting the
distribution of data (univariate) based on
percentile values
It uses Inter Quartile Range (IQR) to

determine lower and upper cut-off values
Data points lying outside these cut-off

values are termed as outliers
Multivariate Outliers
Impossible for human to see an outlying point if the number of variables
are more than 3
Need to rely on mathematical procedures to locate multivariate outliers
Some of the methods are:

Local Outlier Factor (LOF)
Clustering of data points
Mahalanobis Distance measure
Local Correlation Integral (not yet implemented in R)
We shall focus on the first three methods

Local Outlier Factor
This method assigns each data point in the dataset a ‘degree’ of outlying
score which can be used to detect outliers
This is a density based approach and resembles very closely to density

based clustering
However, unlike density based clustering, it does not require an alpha

neighborhood distance parameter
LOF is calculated on the basis of a few parameters such as

Local Reachability Density (LRD)
K-MinPts
However, there are other concepts on the basis of which LRD is

calculated
LOF Basic Concepts
K-distance of an object P: For any positive integer k, the k-distance of
object p, denoted as k-distance(p), is defined as the distance d(p,o)
between p and an object o ∈ D such that:
(i) for at least k objects o’∈D \ {p} it holds that d(p,o’) ≤ d(p,o), and
(ii) for at most k-1 objects o’∈D \ {p} it holds that d(p,o’) < d(p,o).
k-distance neighborhood of an object p): Given the k-distance of p,

the k-distance neighborhood of p contains every object whose distance
from p is not greater than the k-distance, i.e.
Nk-distance(p)(p) = { q ∈ D\{p} | d(p, q) ≤ k-distance(p)}.
Source: LOF: Identifying Density-Based Local Outliers by Markus M.

Breunig et. al (2000), Proc. ACM SIGMOD 2000 Int. Conf. On
Management of Data, Dalles, TX, 2000
LOF Basic Concepts
Reachability distance of an object p w.r.t. object o: The reachability
distance of object p with respect to object o is defined as
reach-distk(p, o) = max { k-distance(o), d(p, o) }
Reach-distk(P1,O)
P1
O
local reachability density of an object p: It is defined as
  reach  dist MinPts( p, o) 

 oN ( p ) 
lrd MinPts( p )  1 /  Min Pts 
Reach-distk(P2,O)
N MinPts( p)
P2
 
MinPts (k) =4
It is the inverse of the average reachability distance based on the MinPts
nearest neighbors of p
Source: LOF: Identifying Density-Based Local Outliers by Markus M.
Breunig et. al (2000), Proc. ACM SIGMOD 2000 Int. Conf. On
Management of Data, Dalles, TX, 2000
LOF Basic Concepts
local outlier factor of p is defined as
 lrd MinPts(o) 
  lrd 
MinPts ( p )
LOFMinPts( p )   
oN MinPts ( p )
 N MinPts( p ) 
 
 
It is the average of the ratio of the local reachability density of p and
those of p’s MinPts-nearest neighbors
Mahalanobis Distance Measure
Mahalanobis distance measures the distance between a point and a
distribution
This is a multi variate generalization of the concept that tries to find out
how many std deviation away a data point is lying from the mean of the
distribution
Mathematically,
DM  x   T S 1 x   
Mahalanobis Distance Measure
“The Mahalanobis distance is simply the distance of the test point from
the center of mass divided by the width of the ellipsoid in the direction
of the test point” Source: wikipedia
DM2 is approximately chi-square distributed
Mahalanobis distance is used in clustering as well as in classification

modeling
R Packages
• Rlof
• Mvoutlier
• MVN

Outlier Detection: Univariate and Multivariate

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Outlier Detection: Univariate and Multivariate

Uploaded by

Copyright:

Available Formats

Outlier Detection

Univariate and Multivariate

Parameters estimated in the presence of ouliers could be heavily biased

An outlier can be univariate (while dealing with a single numeric variable)

Univariate outliers are usually detected by observing the distribution of

In case of multivariate situations, there are several different ways of

If data is not normally distributed, percentile values can be used to detect

Percentile values are non parametric in nature and hence considered to

Inter Quartile Range (IQR) plays important role in detecting outliers in

Box and Whisker plot (popularly known as

It uses Inter Quartile Range (IQR) to

Data points lying outside these cut-off

Need to rely on mathematical procedures to locate multivariate outliers

Some of the methods are:

We shall focus on the first three methods

This is a density based approach and resembles very closely to density

However, unlike density based clustering, it does not require an alpha

LOF is calculated on the basis of a few parameters such as

However, there are other concepts on the basis of which LRD is

k-distance neighborhood of an object p): Given the k-distance of p,

Source: LOF: Identifying Density-Based Local Outliers by Markus M.

  reach  dist MinPts( p, o) 

DM2 is approximately chi-square distributed

Mahalanobis distance is used in clustering as well as in classification

You might also like