You are on page 1of 49

MACHINE LEARNING AND NETWORK SECURITY

UNIT 2: Anomaly Detection


Anomaly vs Outliers
A standard normal table (also called the unit normal table or z-score table) is a mathematical table for the values
of ϕ, indicating the values of the cumulative distribution function of the normal distribution. Z-Score, also known as
the standard score, indicates how many standard deviations an entity is, from the mean.

𝑋−𝜇
𝑍=
𝜎
Reference Link: https://www.machinelearningplus.com/machine-learning/how-to-detect-outliers-with-z-score/
Standard Normal Distribution
Mean=0, Standard Deviation=1
Causes of Anomalies

1. Data from different classes

An object may be different because it is of a different class. Cases like credit card theft, Intrusion detection, outcome
of disease, abnormal test result are good examples of anomalies occurring and identified using class labels. Example:
measuring the weights of oranges, but a few grapefruit are mixed in.

2. Natural variation

In a Normal or Gaussian distribution the probability of a data object decreases rapidly. Such objects are considered as
anomalies. These are also called as outliers. Example: Unusually tall people.

3. Data measurement and Collection Errors

These kinds of errors occur when we collect erroneous data or if there is any deviation while measuring data.
Example: 200 pounds of a 2 year old.
Line A is blue line, B is green line and C is red
line.
We could use a clustering algorithm to assign membership to cluster.
Other Challenges in Anomaly Detection
Machine learning methods can be classified in many different ways. Quite frequently, we differentiate between
supervised and unsupervised learning. In supervised learning, the learning program needs labeled examples
given by a “teacher”, whereas in unsupervised learning, the program directly learns patterns from the data,
without any human intervention or guidance. The typical approach adopted by this method is to build a
predictive model for normal vs. anomaly classes. It compares any unseen data instance against the model to
identify which class it belongs to, whereas an unsupervised method works based on certain assumptions. It
assumes that (i) normal instances are far more frequent than anomalous instances and (ii) anomalous instances
are statistically different from normal instances. However, if these assumptions are not true, such methods
suffer from high false alarm rates.

For supervised learning, an important issue is to obtain accurate and representative labels, especially for the
anomaly classes.
Various Types of Data

The attributes used to describe real-life objects can be of different types. The following are the commonly used
types of attribute variables.
network

You might also like