Professional Documents
Culture Documents
In this the goal is to find the object that are different from other objects. These objects are also
called as outliers. Anomaly detection is also called as deviation detection.
Anomalous results may indicate either a problem with the experiment or a new phenomenon to
be investigated.
Examples:
Fraud detection
Intrusion detection
Ecosystem disturbances
A relatively small number of outliers can distort the mean and standard deviation of set. It may
alter the set of clusters produced by clustering algorithm
Now a days in many cases, anomalies are the focus rather than removal
Topics:
An outlier is an observation that differs so much from other observations as to arouse suspicion
that it was generated by a different mechanism (or is from different class)
The probability of a data object decreases rapidly as the distance of object from the center of
distribution increases
Causes of anomalies:
Natural variation
In many cases we can find the anomalies but cannot find the cause of anomalies
Approaches:
In clustering – anomaly is an object that does not belong to any cluster or anomalies are small
clusters that are far from other clusters
In regression – anomaly is an object that is relatively far from its predicted value
In classification – as anomalous and normal objects can be viewed as defining two distinct
classes, classification can be used for building models of these two classes
Anomalies are relatively rare and this idea needs to be taken into account when choosing
classification technique and the measures to be used for evaluation
Here we first build any one of the above model and find the outliers based on the objective
functions of the respective model
Proximity based:
Also called as distance based outlier detection techniques such as Z-score, boxplot
Unsupervised
Semi supervised
The major distinction in these three are the degree to which class labels (output labels) are
available for the data
Supervised: Techniques for supervised anomaly detection require the existence of a training set
with both anomalous and normal objects
Unsupervised: In many practical situations class labels are not available. In case the objective is
to assign a score to each instance that reflects the degree to which the instance is anomalous
In semi supervised: sometimes training data consists labelled normal data but has no information
about the anomalous objects
A general definition of an anomaly must specify how the values of multiple variables are used to
determine whether an object is anomalous or not
Issues:
Evaluation
Efficiency
Assessment of the degree to which an object is anomalous is known as anomaly score or outlier
score
If we describe in a binary fashion, this does not reflect the underlying reality that some objects
are more extreme anomalies than others.
Masking – techniques that attempt to identify one anomaly at a time are subjected to problem
masking
Swamping - techniques that attempt to identify multiple outliers at a time are subjected to
problem swamping
Most statistical approaches to outlier detection are based on building a probability distribution
and considering how likely objects are under that model
An outlier is an object that has a low probability with respect to a probability distribution of the
model
Issues:
Mixtures of distributions
Data can be modelled as a mixture of distributions and outlier detection schemes can be
developed based on such schemes
Zscore:
A z score of 3 or -3 is common cut-off value. For the value greater than 3 or less than -3 we refer
to as anomalies
The default is to identify outliers more than 1.5 IQR’s above the third quartile or less than 1.5
IQR’s below the first quartile
Boxplot approach is more robust than the Zscore because the mean and standard deviation are
sensitive to outliers
Questions:
11) Using the grades for the first midterm spring2008exams.csv. Are there
any outliers according to the z=+/-3 rule? What is the value of the largest z
score and what is the value of the smallest (most negative) z score?
12) Using the grades for the second midterm spring2008exams.csv. Are
there any outliers according to the z=+/-3 rule? What is the value of the
largest z score and what is the value of the smallest (most negative) z
score?
13) Using Excel for the user agent column of the data at logs.txt (The user
agent column is the second to last column and the value for it in the first
row is "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR
1.1.4322)"). What user agents are identified as outliers using the z=+/-3
rule on the counts of the user agents? What are the z scores for these
outliers?
14) Using the grades for the second midterm spring2008exams.csv. Show
your R commands and include the boxplot. Are any of the grades for the
second midterm outliers by this rule? If so, which ones?