Anomaly

Anomaly
In this the goal is to find the object that are different from other objects. These objects are also
called as outliers. Anomaly detection is also called as deviation detection.
Anomalous results may indicate either a problem with the experiment or a new phenomenon to
be investigated.
Examples:
Fraud detection
Intrusion detection
Ecosystem disturbances
A relatively small number of outliers can distort the mean and standard deviation of set. It may
alter the set of clusters produced by clustering algorithm
In past anomaly detection (and removal) is often part of data processing
Now a days in many cases, anomalies are the focus rather than removal
Topics:
1) Explore the causes of anomalies

2) Consider various anomaly detection problems
3) Draw distinctions among approaches whether they use class label information
4) Issues in this distinction
Hawkins’ definition of an outlier:
An outlier is an observation that differs so much from other observations as to arouse suspicion
that it was generated by a different mechanism (or is from different class)
Gaussian distribution (normal distribution):
The probability of a data object decreases rapidly as the distance of object from the center of
distribution increases
Causes of anomalies:
Data from different classes
Natural variation
Data measurement and collection errors
Techniques themselves are not affected by the source of anomaly
In many cases we can find the anomalies but cannot find the cause of anomalies
Approaches:
Model based techniques
Proximity based techniques
Density based techniques

Model based:
In clustering – anomaly is an object that does not belong to any cluster or anomalies are small
clusters that are far from other clusters
In regression – anomaly is an object that is relatively far from its predicted value
In classification – as anomalous and normal objects can be viewed as defining two distinct
classes, classification can be used for building models of these two classes
Anomalies are relatively rare and this idea needs to be taken into account when choosing
classification technique and the measures to be used for evaluation
Here we first build any one of the above model and find the outliers based on the objective
functions of the respective model
When we cannot build a model we use proximity based techniques
Proximity based:
Also called as distance based outlier detection techniques such as Z-score, boxplot
Based on use of class labels:
Supervised anomaly detection
Unsupervised
Semi supervised
The major distinction in these three are the degree to which class labels (output labels) are
available for the data
Supervised: Techniques for supervised anomaly detection require the existence of a training set
with both anomalous and normal objects
Unsupervised: In many practical situations class labels are not available. In case the objective is
to assign a score to each instance that reflects the degree to which the instance is anomalous
In semi supervised: sometimes training data consists labelled normal data but has no information
about the anomalous objects
A general definition of an anomaly must specify how the values of multiple variables are used to
determine whether an object is anomalous or not
Issues:
Number of attributes used to define anomaly
Global vs. local
Degree to which a point is anomaly
Identifying one anomaly at a time vs. many anomalies at once
Evaluation
Efficiency
Degree to which a point is anomaly:
Assessment of the degree to which an object is anomalous is known as anomaly score or outlier
score
If we describe in a binary fashion, this does not reflect the underlying reality that some objects
are more extreme anomalies than others.
Masking – techniques that attempt to identify one anomaly at a time are subjected to problem
masking
Swamping - techniques that attempt to identify multiple outliers at a time are subjected to
problem swamping
Statistical approach (model based):
Most statistical approaches to outlier detection are based on building a probability distribution
and considering how likely objects are under that model
An outlier is an object that has a low probability with respect to a probability distribution of the
model
Issues:
Identifying the specific distribution of data set
Number of attributes used
Mixtures of distributions
Using wrong distribution model for data
Some distribution models: Gaussian, poisson, binomial
Data can be modelled as a mixture of distributions and outlier detection schemes can be
developed based on such schemes
These models are more powerful but are more complicated
Detecting outliers for a single value:
Zscore:
Z1i = (x1i – mean(xi)) / sd(xi)
A z score of 3 or -3 is common cut-off value. For the value greater than 3 or less than -3 we refer
to as anomalies
Zscore is the number of standard deviations above or below the mean

Generally for a bell shaped curve 99.7% of values lie within a range of 3 standard deviations
Detecting outliers using boxplot:
The default is to identify outliers more than 1.5 IQR’s above the third quartile or less than 1.5
IQR’s below the first quartile
Boxplot approach is more robust than the Zscore because the mean and standard deviation are
sensitive to outliers
Questions:
11) Using the grades for the first midterm spring2008exams.csv. Are there
any outliers according to the z=+/-3 rule? What is the value of the largest z
score and what is the value of the smallest (most negative) z score?
12) Using the grades for the second midterm spring2008exams.csv. Are
there any outliers according to the z=+/-3 rule? What is the value of the
largest z score and what is the value of the smallest (most negative) z
score?
13) Using Excel for the user agent column of the data at logs.txt (The user
agent column is the second to last column and the value for it in the first
row is "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR
1.1.4322)"). What user agents are identified as outliers using the z=+/-3
rule on the counts of the user agents? What are the z scores for these
outliers?
14) Using the grades for the second midterm spring2008exams.csv. Show
your R commands and include the boxplot. Are any of the grades for the
second midterm outliers by this rule? If so, which ones?
15) Using the midterm grades at spring2008exams.csv. Use linear

regression to find outliers. Be sure to include the plot. Which student had
the largest POSITIVE residual?

Anomaly

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Anomaly

Uploaded by

Copyright:

Available Formats

Anomaly

In past anomaly detection (and removal) is often part of data processing

1) Explore the causes of anomalies

Hawkins’ definition of an outlier:

Gaussian distribution (normal distribution):

Data from different classes

Data measurement and collection errors

Techniques themselves are not affected by the source of anomaly

Model based techniques

Proximity based techniques

Density based techniques

When we cannot build a model we use proximity based techniques

Based on use of class labels:

Supervised anomaly detection

Number of attributes used to define anomaly

Global vs. local

Degree to which a point is anomaly

Identifying one anomaly at a time vs. many anomalies at once

Degree to which a point is anomaly:

Statistical approach (model based):

Identifying the specific distribution of data set

Number of attributes used

Using wrong distribution model for data

Some distribution models: Gaussian, poisson, binomial

These models are more powerful but are more complicated

Detecting outliers for a single value:

Z1i = (x1i – mean(xi)) / sd(xi)

Zscore is the number of standard deviations above or below the mean

Detecting outliers using boxplot:

15) Using the midterm grades at spring2008exams.csv. Use linear

You might also like