You are on page 1of 13

Outlier Detection

Univariate and Multivariate


By
Subhasis Dasgupta
Asst Professor
Praxis Business School, Kolkata
What is an Outlier
In analytics, an outlier is a data point whose characteristics are
significantly different from rest of the data points

Parameters estimated in the presence of ouliers could be heavily biased


and unrealistic

An outlier can be univariate (while dealing with a single numeric variable)


or multivariate (a data point consisting of multiple variables)

Univariate outliers are usually detected by observing the distribution of


the data points

In case of multivariate situations, there are several different ways of


detecting outliers e.g. Mahalanobis distance, LOF , clustering of data
points etc.
Univariate and bivariate outlier

Univariate Outliers

Bivariate Outliers

X
Detecting Univariate Outliers
If data is normally distributed, outliers can be identified with respect to
spread of mean with confidence intervals (usually 95% or higher)

If data is not normally distributed, percentile values can be used to detect


the outliers

Percentile values are non parametric in nature and hence considered to


be appropriate in almost all the cases

Inter Quartile Range (IQR) plays important role in detecting outliers in


univariate data

Box and Whisker plot is most popular in this regard to detect univariate
outliers
Box and Whisker Plot

Box and Whisker plot (popularly known as


Box Plot) is a way of portraiting the
distribution of data (univariate) based on
percentile values

It uses Inter Quartile Range (IQR) to


determine lower and upper cut-off values

Data points lying outside these cut-off


values are termed as outliers
Multivariate Outliers
Impossible for human to see an outlying point if the number of variables
are more than 3

Need to rely on mathematical procedures to locate multivariate outliers

Some of the methods are:


Local Outlier Factor (LOF)
Clustering of data points
Mahalanobis Distance measure
Local Correlation Integral (not yet implemented in R)

We shall focus on the first three methods


Local Outlier Factor
This method assigns each data point in the dataset a ‘degree’ of outlying
score which can be used to detect outliers

This is a density based approach and resembles very closely to density


based clustering

However, unlike density based clustering, it does not require an alpha


neighborhood distance parameter

LOF is calculated on the basis of a few parameters such as


Local Reachability Density (LRD)
K-MinPts

However, there are other concepts on the basis of which LRD is


calculated
LOF Basic Concepts
K-distance of an object P: For any positive integer k, the k-distance of
object p, denoted as k-distance(p), is defined as the distance d(p,o)
between p and an object o ∈ D such that:

(i) for at least k objects o’∈D \ {p} it holds that d(p,o’) ≤ d(p,o), and
(ii) for at most k-1 objects o’∈D \ {p} it holds that d(p,o’) < d(p,o).

k-distance neighborhood of an object p): Given the k-distance of p,


the k-distance neighborhood of p contains every object whose distance
from p is not greater than the k-distance, i.e.
Nk-distance(p)(p) = { q ∈ D\{p} | d(p, q) ≤ k-distance(p)}.

Source: LOF: Identifying Density-Based Local Outliers by Markus M.


Breunig et. al (2000), Proc. ACM SIGMOD 2000 Int. Conf. On
Management of Data, Dalles, TX, 2000
LOF Basic Concepts
Reachability distance of an object p w.r.t. object o: The reachability
distance of object p with respect to object o is defined as
reach-distk(p, o) = max { k-distance(o), d(p, o) }
Reach-distk(P1,O)
P1
O
local reachability density of an object p: It is defined as

  reach  dist MinPts( p, o) 


 oN ( p ) 
lrd MinPts( p )  1 /  Min Pts 
Reach-distk(P2,O)
N MinPts( p)
P2
 
MinPts (k) =4
It is the inverse of the average reachability distance based on the MinPts
nearest neighbors of p
Source: LOF: Identifying Density-Based Local Outliers by Markus M.
Breunig et. al (2000), Proc. ACM SIGMOD 2000 Int. Conf. On
Management of Data, Dalles, TX, 2000
LOF Basic Concepts
local outlier factor of p is defined as

 lrd MinPts(o) 
  lrd 
MinPts ( p )
LOFMinPts( p )   
oN MinPts ( p )

 N MinPts( p ) 
 
 
It is the average of the ratio of the local reachability density of p and
those of p’s MinPts-nearest neighbors
Mahalanobis Distance Measure
Mahalanobis distance measures the distance between a point and a
distribution

This is a multi variate generalization of the concept that tries to find out
how many std deviation away a data point is lying from the mean of the
distribution

Mathematically,

DM  x   T S 1 x   
Mahalanobis Distance Measure
“The Mahalanobis distance is simply the distance of the test point from
the center of mass divided by the width of the ellipsoid in the direction
of the test point” Source: wikipedia

DM2 is approximately chi-square distributed

Mahalanobis distance is used in clustering as well as in classification


modeling
R Packages
• Rlof
• Mvoutlier
• MVN

You might also like