Professional Documents
Culture Documents
Natural variation
https://umn.zoom.us/my/kumar001
– Unusually tall people
Data errors
– 200 pound 2 year old
Model-based Approaches
◆ Modelcan be parametric or non-parametric
◆ Anomalies are those points that don’t fit well
Model-free Approaches
◆ Anomalies are identified directly from the data without
building a model
Often the underlying assumption is that the
most of the points in the data are normal
Statistical Approaches
Proximity-based
– Anomalies are points far away from other points
Clustering-based
– Points far away from cluster centers are outliers
– Small clusters are outliers
Reconstruction Based
Introduction to Data Mining, 2nd Edition Tan,
4/12/2021 8
Steinbach, Karpatne, Kumar
Statistical Approaches
Probabilistic definition of an outlier: An outlier is an object that
has a low probability with respect to a probability distribution
model of the data.
Usually assume a parametric model describing the distribution
of the data (e.g., normal distribution)
Apply a statistical test that depends on
– Data distribution
– Parameters of distribution (e.g., mean, variance)
– Number of expected outliers (confidence limit)
Issues
– Identifying the distribution of a data set
◆ Heavy tailed distribution
– Number of attributes
– Is the data a mixture of distributions?
Introduction to Data Mining, 2nd Edition Tan,
4/12/2021 9
Steinbach, Karpatne, Kumar
Normal Distributions
One-dimensional
Gaussian
7
0.1
6
0.09
5
0.08
4
0.07
3
0.06
Two-dimensional
2
0.05 Gaussian
y
0 0.04
-1 0.03
-2 0.02
-3 0.01
-4
probability
-5 density
-4 -3 -2 -1 0 1 2 3 4 5
x
Introduction to Data Mining, 2nd Edition Tan,
4/12/2021 10
Steinbach, Karpatne, Kumar
Grubbs’ Test
Data distribution, D = (1 – ) M + A
M is a probability distribution estimated from data
– Can be based on any modeling method (naïve Bayes,
maximum entropy, etc.)
A is initially assumed to be uniform distribution
Likelihood at time t:
N |At |
Lt ( D ) = PD ( xi ) = (1 − ) PM t ( xi ) PAt ( xi )
|M t |
i =1 xi M t xiAt
LLt ( D ) = M t log(1 − ) + log PM t ( xi ) + At log + log PAt ( xi )
xi M t xi At
D 2
1.8
1.6
1.4
1.2
0.8
0.6
0.4
Outlier Score
Introduction to Data Mining, 2nd Edition Tan,
4/12/2021 16
Steinbach, Karpatne, Kumar
One Nearest Neighbor - Two Outliers
0.55
D
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
Outlier Score
Introduction to Data Mining, 2nd Edition Tan,
4/12/2021 17
Steinbach, Karpatne, Kumar
Five Nearest Neighbors - Small Cluster
2
D
1.8
1.6
1.4
1.2
0.8
0.6
0.4
Outlier Score
Introduction to Data Mining, 2nd Edition Tan,
4/12/2021 18
Steinbach, Karpatne, Kumar
Five Nearest Neighbors - Differing Density
D 1.8
1.6
1.4
1.2
0.8
0.6
0.4
0.2
Outlier Score
Introduction to Data Mining, 2nd Edition Tan,
4/12/2021 19
Steinbach, Karpatne, Kumar
Strengths/Weaknesses of Distance-Based Approaches
Simple
Expensive – O(n2)
Sensitive to parameters
6.85
6
C
1.40 4
D
1.33
2
A
Outlier Score
Introduction to Data Mining, 2nd Edition Tan,
4/12/2021 23
Steinbach, Karpatne, Kumar
Relative Density-based: LOF approach
In the NN approach, p2 is
not considered as outlier,
while LOF approach find
both p1 and p2 as outliers
p2
p1
Simple
Expensive – O(n2)
Sensitive to parameters
An object is a cluster-based
outlier if it does not strongly
belong to any cluster
– For prototype-based clusters, an
object is an outlier if it is not close
enough to a cluster center
◆ Outliers can impact the clustering produced
4
C
3.5
2.5
D 0.17
2
1.5
1.2 1
A
0.5
Outlier Score
Introduction to Data Mining, 2nd Edition Tan,
4/12/2021 27
Steinbach, Karpatne, Kumar
Relative Distance of Points from Closest Centroid
4
3.5
2.5
1.5
0.5
Outlier Score
Introduction to Data Mining, 2nd Edition Tan,
4/12/2021 28
Steinbach, Karpatne, Kumar
Strengths/Weaknesses of Clustering-Based Approaches
Simple
Reconstruction Error(x)= x − xො
Objects with large reconstruction errors are
anomalies
Equation of hyperplane
𝜙 is the mapping to high dimensional space
Weight vector is
ν is fraction of outliers
Optimization condition is the following
Choice of ν is difficult
Computationally expensive