You are on page 1of 5

Anomaly

In this the goal is to find the object that are different from other objects. These objects are also
called as outliers. Anomaly detection is also called as deviation detection.

Anomalous results may indicate either a problem with the experiment or a new phenomenon to
be investigated.

Examples:

Fraud detection

Intrusion detection

Ecosystem disturbances

A relatively small number of outliers can distort the mean and standard deviation of set. It may
alter the set of clusters produced by clustering algorithm

In past anomaly detection (and removal) is often part of data processing

Now a days in many cases, anomalies are the focus rather than removal

Topics:

1) Explore the causes of anomalies


2) Consider various anomaly detection problems
3) Draw distinctions among approaches whether they use class label information
4) Issues in this distinction

Hawkins’ definition of an outlier:

An outlier is an observation that differs so much from other observations as to arouse suspicion
that it was generated by a different mechanism (or is from different class)

Gaussian distribution (normal distribution):

The probability of a data object decreases rapidly as the distance of object from the center of
distribution increases

Causes of anomalies:

Data from different classes

Natural variation

Data measurement and collection errors

Techniques themselves are not affected by the source of anomaly

In many cases we can find the anomalies but cannot find the cause of anomalies

Approaches:

Model based techniques

Proximity based techniques

Density based techniques


Model based:

In clustering – anomaly is an object that does not belong to any cluster or anomalies are small
clusters that are far from other clusters

In regression – anomaly is an object that is relatively far from its predicted value

In classification – as anomalous and normal objects can be viewed as defining two distinct
classes, classification can be used for building models of these two classes

Anomalies are relatively rare and this idea needs to be taken into account when choosing
classification technique and the measures to be used for evaluation

Here we first build any one of the above model and find the outliers based on the objective
functions of the respective model

When we cannot build a model we use proximity based techniques

Proximity based:

Also called as distance based outlier detection techniques such as Z-score, boxplot

Based on use of class labels:

Supervised anomaly detection

Unsupervised

Semi supervised

The major distinction in these three are the degree to which class labels (output labels) are
available for the data

Supervised: Techniques for supervised anomaly detection require the existence of a training set
with both anomalous and normal objects

Unsupervised: In many practical situations class labels are not available. In case the objective is
to assign a score to each instance that reflects the degree to which the instance is anomalous

In semi supervised: sometimes training data consists labelled normal data but has no information
about the anomalous objects

A general definition of an anomaly must specify how the values of multiple variables are used to
determine whether an object is anomalous or not

Issues:

Number of attributes used to define anomaly

Global vs. local

Degree to which a point is anomaly

Identifying one anomaly at a time vs. many anomalies at once

Evaluation
Efficiency

Degree to which a point is anomaly:

Assessment of the degree to which an object is anomalous is known as anomaly score or outlier
score

If we describe in a binary fashion, this does not reflect the underlying reality that some objects
are more extreme anomalies than others.

Masking – techniques that attempt to identify one anomaly at a time are subjected to problem
masking

Swamping - techniques that attempt to identify multiple outliers at a time are subjected to
problem swamping

Statistical approach (model based):

Most statistical approaches to outlier detection are based on building a probability distribution
and considering how likely objects are under that model

An outlier is an object that has a low probability with respect to a probability distribution of the
model

Issues:

Identifying the specific distribution of data set

Number of attributes used

Mixtures of distributions

Using wrong distribution model for data

Some distribution models: Gaussian, poisson, binomial

Data can be modelled as a mixture of distributions and outlier detection schemes can be
developed based on such schemes

These models are more powerful but are more complicated

Detecting outliers for a single value:

Zscore:

Z1i = (x1i – mean(xi)) / sd(xi)

A z score of 3 or -3 is common cut-off value. For the value greater than 3 or less than -3 we refer
to as anomalies

Zscore is the number of standard deviations above or below the mean


Generally for a bell shaped curve 99.7% of values lie within a range of 3 standard deviations

Detecting outliers using boxplot:

The default is to identify outliers more than 1.5 IQR’s above the third quartile or less than 1.5
IQR’s below the first quartile

Boxplot approach is more robust than the Zscore because the mean and standard deviation are
sensitive to outliers

Questions:
11) Using the grades for the first midterm spring2008exams.csv. Are there
any outliers according to the z=+/-3 rule? What is the value of the largest z
score and what is the value of the smallest (most negative) z score?

12) Using the grades for the second midterm spring2008exams.csv. Are
there any outliers according to the z=+/-3 rule? What is the value of the
largest z score and what is the value of the smallest (most negative) z
score?

13) Using Excel for the user agent column of the data at logs.txt (The user
agent column is the second to last column and the value for it in the first
row is "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR
1.1.4322)"). What user agents are identified as outliers using the z=+/-3
rule on the counts of the user agents? What are the z scores for these
outliers?

14) Using the grades for the second midterm spring2008exams.csv. Show
your R commands and include the boxplot. Are any of the grades for the
second midterm outliers by this rule? If so, which ones?

15) Using the midterm grades at spring2008exams.csv. Use linear


regression to find outliers. Be sure to include the plot. Which student had
the largest POSITIVE residual?

You might also like