Professional Documents
Culture Documents
Data Mining
Jonathan
c
Taylor
Jonathan
c Taylor
December 2, 2012
1/1
Outliers
Statistics 202:
Data Mining
Concepts
Jonathan
c
Taylor What is an outlier? The set of data points that are
considerably different than the remainder of the data . . .
When do they appear in data mining tasks?
Given a data matrix X , find all the cases x i ∈ X with
anomaly/outlier scores greater than some threshold t. Or,
the top n outlier scores.
Given a data matrix X , containing mostly normal (but
unlabeled) data points, and a test case x new , compute an
anomaly/outlier score of x new with respect to X .
Applications
Credit card fraud detection;
Network intrusion detection;
Misspecification of a model.
2/1
What is an outlier?
Statistics 202:
Data Mining
Jonathan
c
Taylor
3/1
Outliers
Statistics 202:
Data Mining
Jonathan
c
Taylor
Issues
How many outliers are there in the data?
Method is unsupervised, similar to clustering or finding
clusters with only 1 point in them.
Usual assumption: There are considerably more “normal”
observations than “abnormal” observations
(outliers/anomalies) in the data.
4/1
Outliers
Statistics 202:
Data Mining
Jonathan
c
Taylor
General steps
Build a profile of the “normal” behavior. The profile
generally consists of summary statistics of this “normal”
population.
Use these summary statistics to detect anomalies, i.e.
points whose characteristics are very far from the normal
profile.
General types of schemes involve a statistical model of
“normal”, and “far” is measured in terms of likelihood.
Other schemes based on distances can be quasi-motivated
by such statistical techniques . . .
5/1
Outliers
Statistics 202:
Data Mining
Jonathan
c
Taylor
Statistical approach
Assume a parametric model describing the distribution of
the data (e.g., normal distribution)
Apply a statistical test that depends on:
Data distribution (e.g. normal)
Parameter of distribution (e.g., mean, variance)
Number of expected outliers (confidence limit, α or Type I
error)
6/1
Outliers
Statistics 202:
Data Mining
Jonathan
c
Taylor
Grubbs’ Test
Suppose we have a sample of n numbers
Z = {Z1 , . . . , Zn }, i.e. a n × 1 data matrix.
Assuming data is from normal distribution, Grubbs’ tests
uses distribution of
max1≤i≤n Zi − Z̄
Z
SD(ZZ)
7/1
Outliers
Statistics 202:
Data Mining
Jonathan
c
Taylor
Grubbs’ Test
Lower tail variant:
min1≤i≤n Zi − Z̄
Z
SD(ZZ)
Two-sided variant:
max1≤i≤n |Zi − Z̄
Z|
SD(ZZ)
8/1
Outliers
Statistics 202:
Data Mining Grubbs’ Test
Jonathan
c Having chosen a test-statistic, we must determine a
Taylor
threshold that sets our “threshold” rule
Often this is set via a hypothesis test to control Type I
error.
For large positive outlier, threshold is based on choosing
some acceptable Type I error α and finding cα so that
max1≤i≤n |Zi − Z̄Z|
P0 ≥ cα ≈ α
SD(Z Z)
Statistics 202:
Data Mining
Jonathan
c Grubbs’ Test
Taylor
Two sided critical level has the form
v
u 2
tα/(2n),n−2
n − 1u
cα = √ t 2
n n − 2 + tα/(2n),n−2
where
P(Tk ≥ tγ,k ) = γ
is the upper tail quantile of Tk .
In R, you can use the functions pnorm, qnorm, pt, qt
for these quantities.
10 / 1
based techniques
Model based: linear regression with outliers
ldStatistics
a model 202:
Data Mining
Jonathan
c
which
Taylordon’t fit the model
identified as outliers
e appropriate
ls can be fed in to
test.
Figure : Residuals from model can be fed into Grubbs’ test or
Bonferroni (variant)
Statistics 202:
Data Mining
Jonathan
c Multivariate data
Taylor
If the non-outlying data is assumed to be multivariate
Gaussian, what is the analogy of Grubbs’ statistic
max1≤i≤n |Zi − Z̄
Z|
SD(ZZ)
12 / 1
Outliers
Statistics 202:
Data Mining
Jonathan
c
Taylor Likelihood approach
Assume data is a mixture
F = (1 − λ)M + λA.
13 / 1
Outliers
Statistics 202:
Data Mining
Jonathan
c
Taylor Likelihood approach
Do we estimate λ or fix it?
The book starts describing an algorithm that tries to
maximize the equivalent classification likelihood
Y
L(θM , θA ; l) = (1 − λ)#lM fM (xi , θM )
i∈lM
Y
× λ#lA fA (xi ; θA )
i∈lA
14 / 1
Outliers
Statistics 202:
Data Mining
Jonathan
c
Taylor
Likelihood approach: Algorithm
Algorithm tries to maximize this by forming iterative
estimates (Mt , At ) of “normal” and “outlying” data
points.
1 At each stage, tries to place individual points of Mt to At .
2 Find (θbM , θbA ) based on partition new partition (if
necessary).
3 If increase in likelihood is large enough, call these new set
(Mt+1 , At+1 ).
4 Repeat until no further changes.
15 / 1
Outliers
Statistics 202:
Data Mining
Jonathan
c Nearest neighbour approach
Taylor
Many ways to define outliers.
Example: data points for which there are fewer than k
neighboring points within a distance .
Example: the n points whose distance to k-th nearest
neighbour is largest.
The n points whose average distance to the first k nearest
neighobours is largest.
Each of these methods all depend on choice of some
parameters: k, n, . Difficult to choose these in a
systematic way.
16 / 1
Outliers
Statistics 202:
Data Mining
Jonathan
c Density approach
Taylor
For each point, x i compute a density estimate fx i ,k using
its k nearest neighbours.
Density estimate used is
P !−1
xi,y )
y ∈N(xx i ,k) d(x
fx i ,k =
#N(xx i , k)
Define
fx i ,k
LOF (xx i ) = P
( y ∈N(xx i ,k) fy ,k )/#N(xx i , k)
17 / 1
! Compute local outlier factor (LOF) of a
Outliers average of the ratios of the density of
density of its nearest neighbors
Statistics 202: ! Outliers are points with largest LOF va
Data Mining
Jonathan
c
Taylor
In th
not
whil
both
p2
! p1
!
18 / 1
Outliers
Statistics 202:
Data Mining
Jonathan
c
Taylor
Detection rate
Set P(O) to be the proportion of outliers or anomalies.
Set P(D|O) to be the probability of declaring an outlier if
it truly is an outlier. This is the detection rate.
Set P(D|O c ) to the probability of declaring an outlier if it
is truly not an outlier.
19 / 1
Outliers
Statistics 202:
Data Mining
Jonathan
c
Taylor
Bayesian detection rate
Bayesian detection rate is
P(D|O)P(O)
P(O|D) = .
P(D|O)P(O) + P(D|O c )P(O c )
P(D|O c )P(O c )
P(O c |D) = .
P(D|O c )P(O c ) + P(D|O)P(O)
20 / 1
Statistics 202:
Data Mining
Jonathan
c
Taylor
21 / 1