0 Up votes0 Down votes

3 views15 pagesdgfds

Mar 15, 2018

© © All Rights Reserved

PDF, TXT or read online from Scribd

dgfds

© All Rights Reserved

3 views

dgfds

© All Rights Reserved

- The Law of Explosive Growth: Lesson 20 from The 21 Irrefutable Laws of Leadership
- Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
- Hidden Figures Young Readers' Edition
- The E-Myth Revisited: Why Most Small Businesses Don't Work and
- Micro: A Novel
- The Wright Brothers
- The Other Einstein: A Novel
- State of Fear
- State of Fear
- The Power of Discipline: 7 Ways it Can Change Your Life
- The Kiss Quotient: A Novel
- Being Wrong: Adventures in the Margin of Error
- Algorithms to Live By: The Computer Science of Human Decisions
- The 6th Extinction
- The Black Swan
- The Art of Thinking Clearly
- The Last Battle
- Prince Caspian
- A Mind for Numbers: How to Excel at Math and Science Even If You Flunked Algebra
- The Theory of Death: A Decker/Lazarus Novel

You are on page 1of 15

Anomaly Detection

Revision 2.0

Bangalore Professional Development Center

Work Integrated Learning Programmes

Statistical Approaches

Univariate Normal Distribution

A model (or distribution) is created for the data.

Objects are evaluated with respect to how well they fit the model.

For univariate normal distribution, Gaussian (normal) distribution is

used to identify the outlier.

The normal distribution N(μ, σ) has two

parameters mean (μ) and standard

deviation (σ). The plot shows the

probability density function for N(0, 1).

A object’s distance from the center can

be used to test if it is an outlier. For an

object more than 4, the probability is

one in ten thousand (extremely low).

Normal (Gaussian’s) Density

Function

(x )2

[ ]

P( x )

1

.e 2. 2

2 .( )

where,

P( x ) probability of occurence for x in t he distribution

mean

var iance

2

Exercise

24, 3, 18, 19, 21, 13}. Find out the probability for a value 15 to occur

considering the vector contains the entire population.

Mean (μ) = 15.71

Variance (σ2) = 42.20

Standard Deviation (σ) = 6.50

A = 1/{(2π)1/2.σ} = 0.061

x = 15

B = -(x- μ)2 / 2. σ2 = -0.006

C = eB = 2.71828(-0.006) = 1

So the probability = AxC= 0.061x1 = 0.061

2. [Optional] Z-normalize the above vector and using the Excel sheet

draw the probability density plot. You can use normdist() function of

Excel to calculate the probability of each data point.

4

Multivariate Normal Distribution

Probability Density Function (PDF)

How this density is identified?

How outliers can be identified in multivariate normal distribution in

5

general?

BITS Pilani, WILP

Outlier in Multivariate Normal

Distribution

For the univariate dataset, the outlier detection approach is probability density function

drawn from μ and σ assuming the points are in normal distribution.

The question is how to adopt a similar approach for multivariate normal distribution. The

answer is to take the similar approach and thus the covariance comes into picture.

When there is correlation among the attributes which are in normal distribution, the

concept of Mahalanobis Distance comes into picture. It uses the covariance in calculating

the distance.

It is formalized by P.C. Mahalanobis, the famous Indian statistician who is remembered as

the founder of the Indian Statistical Institute, Kolkata and a member of the first planning

commission of India.

Mahalanobis( X , X ) [ X X ].S 1 .[ X X ] T

where, X is the mean of X

P.C. Mahalanobis

T 1893-1972

[ X X ] is the transpose of matrix[ X X ]

Probability Density Function

Multivariate Normal Distribution

Then the probability density function for a data point x is given by:

1

1 .( x X ).S 1 ( x X ) T

P( x ) .e 2

( 2 ) .| S |

m 1/ 2

to the magnitude of the Mahalanobis distance (ln e-x = -x).

find out the outliers instead of calculating the actual probability.

BITS Pilani, WILP

Covariance Matrix & Inverse

If there are two attributes (X, Y) then covariance matrix

is defined as:

SXX SXY

S

S YX S YY

The inverse of a 2x2 matrix is calculated as shown

below. The inverse of above covariance matrix (S-1) can

be calculated similarly:

1

a b 1 d b

c d ad bc c a

Inverse of a 3x3 matrix.

BITS Pilani, WILP

Mahalanobis Distance

Illustration

Mahalanobis Distance.

# X Y (X-X') (Y-Y') Mahalanobis Dist

A 2 2 -2.7 -2.6 4.00

B 2 5 -2.7 0.4 2.06

C 6 5 1.3 0.4 0.44

D 7 3 2.3 -1.6 3.32

E 4 7 -0.7 2.4 3.30

F 6 4 1.3 -0.6 0.77

G 5 3 0.3 -1.6 1.41

H 4 6 -0.7 1.4 1.27

I 2 5 -2.7 0.4 2.06

J 1 3 -3.7 -1.6 3.66

K 6 5 1.3 0.4 0.44

L 7 4 2.3 -0.6 1.80

M 8 7 3.3 2.4 4.31

N 5 6 0.3 1.4 0.94

O 5 4 0.3 -0.6 0.24

Mean (X', Y') 4.7 4.6

0.25 -0.09

-0.09 0.51

Review the Excel sheet provided through eLearn for the calculations. 9

A Question!

can be visually identified or may

be using the Euclidean’s distance.

Why we need Mahalanobis 2-D view of a dataset points

distance?

only. When we have more

dimensions and attributes have

probabilistic relationship,

A possible higher dimensional view of the

Mahalanobis distance is required. same dataset

Density Based Outlier Detection

DBSCAN clustering algorithm identifies outliers with a global view of the data set.

In practice, datasets could demonstrate a more complex structure where objects may be considered

outliers with respect to their local neighbourhood. (a data set with different densities).

In the shown data distribution, there are two clusters C1 and C2.

Object O3 can be declared as distance based outlier because it is far from the majority of the

objects.

What about objects O1 and O2?

The distance of O1 and O2 from the objects of cluster C1 is

smaller than the average distance of an object from its

nearest neighbour in the cluster C2.

O1 and O2 are not distance based outliers. But they are

outliers with respect to the cluster C1 because they

deviate significantly from other objects of C1.

Similarly the distance between O4 and its nearest neighbour in C2 is higher than the distance

between O1 or O2 and their nearest neighbours in C1, still O4 may not be an outlier because

C2 is sparse. Distance based detection does not capture local outliers. There is a need of

different approach.

BITS Pilani, WILP

Density Based Outliers

For a dataset that is having few objects, let:

distance (x, y) = The distance between the object x and object y using some norm.

N(x, k) = The set containing k nearest neighbours for an object x.

density (x, k) = Density of object x for its k nearest neighbours. It is defined as the

reciprocal of the average distance of k nearest neighbours from x. It can be written

as follows:

| N( x,k )|

density( x,k )

y N ( x ,k )

dis tan ce( x, y )

average relative density (x, k) or outlier score = is the ratio of density of an object x

and the average density of its k nearest neighbours. It can be written as follows:

density( x,k )

average relative density( x,k )

y N ( x ,k )

density( y,k ) / | N( x,k )|

Illustration

average relative

Object distance (x, y) k=3 nearest Objects X Y

Pairs L1 norm Objects

neighbours density (x, k=3) density(x, k=3)

(outlier score) A 1.00 2.00

A-B 1.50 A B, C, D 0.80 1.11 B 2.00 1.50

A-C 0.50 B A, C, D 0.80 1.11 C 1.00 1.50

A-D 1.75 C A, B, D D 2.00 2.75

0.80 1.11

B-C 1.00 E 7.00 2.25

D A, B, C 0.57 0.71

B-D 1.25 F 7.00 2.50

E F, G, H 3.00 1.64

C-D 2.25 G 7.00 2.00

F E, G, H 2.00 0.92

E-F 0.25 H 7.50 2.25

G E, F, H 2.00 0.92

E-G 0.25 I 6.00 2.50

E-H

H E, F, G 1.50 0.64

0.50

E-I I E, F, G 0.80 0.34

1.25

F-G 0.50

F-H 0.75

F-I 1.00

G-H 0.75

G-I 1.50

H-I 1.75

objects in their clusters and I is in more dense

region than D.

When average relative density is taken into

account, I is identified as a more potential

outlier.

BITS Pilani, WILP

Review Points

vs. datasets where attributes represent spatial

coordinates.

Identify the similarities in the univariate and multivariate

probability density functions for normal distribution.

How do we find outliers in these two different datasets?

Limitations of DBSCAN to find out local outliers?

Can you think of a scenario where local outliers

identification could be useful? (e.g. rural and urban areas

on the same datasets)

14

Thank You

- ch07Uploaded byatique1975
- MAT212.pdfUploaded bykrishna135
- KLE MSC Bio PG Dip SyllabusUploaded byMarkWeber
- Skill Versus Luck and Investment PerformanceUploaded bynabs
- Project Planning & SchedulingUploaded byraied
- RSMEUploaded byDamaris Arias
- Chapter 6 FinalUploaded byAbrar Ahmed
- IntroductionUploaded byYana Mk
- Isi Mtech Qror 06Uploaded byapi-26401608
- Course in statisticsUploaded byAlejandro Belenguer
- AMC_TB206_tcm18-25922Uploaded byalbertonunez@udc.es
- normalitas homogenitas.docxUploaded byMuhammad S Lutfhian
- 2006EA-1Uploaded byjaved765
- StabilityUploaded byHernan Rey
- Doane - Stats - Chap 007 - Test answersUploaded byBG Monty 1
- A System for Denial-of-Service Attack Detection Based on Multivariate Correlation Analysis.pdfUploaded byhemanthbbc
- Normal and Lognormal DistributionUploaded byPablo Sanchez
- Biostatistics homeworkUploaded byXiou Cao
- MAT212Uploaded bykrishna135
- Detection of Facial Feature Points UsingUploaded byOmar Rigane
- lect8aUploaded byCihanDağ
- MF15Uploaded byLim Zi Ai
- Slutzky - The Summation of Random Causes as the Source of Cyclic Processes (1937).pdfUploaded byUriel Garcia
- GaussianIntegralUploaded byBoyan Ivanov Bonev
- Data CitraUploaded byNanda Azhari
- Probability DistributionsaUploaded byasdasdas asdasdasdsadsasddssa
- Introduction to Statistical Quality ControlUploaded byZaMar Mbatha
- Uncertainty ModellingUploaded byOkky Warman
- Paper - Perdidas por penetracion.pdfUploaded byjtafurp
- Some Important Discrete Probability DistributionsUploaded byVikrant Lad

- Council2008 Doc AcppUploaded byindusexposium
- OnlineUploaded byG_spoorthy1925
- 6-8MathStandardsUploaded bydestiny_eastep
- Households With Children 2018Uploaded byAnonymous Pb39klJ
- HR Interview Qs.Uploaded bySampada Gupta
- Ethics Leaflet and ReflectionUploaded byMenkheperre George
- Syracuse-Buffalo Immigration StudyUploaded byrickmoriarty
- Assignment 1 Format Baru 17042011Uploaded byNoor Zilawati Sabtu
- Video Lesson PlanUploaded byCorey Bailey
- Embracing ChangeUploaded byagnesmalk
- BS Chemical EngineeringUploaded byhorasius
- 316 Example Final 2 SolutionUploaded byDarran Cairns
- Consumer Prefernce on Times of IndiaUploaded byReshma Swetha
- An Empirical Study on General Insurance Agents’ Performance in Sri Lankan Insurance IndustryUploaded byInternational Journal of Innovative Science and Research Technology
- Lesson Plan 2-ED 232Uploaded bytaylor_elise_m
- Understanding Animate Agents (Wheatley et al 2007)Uploaded byapi-3695811
- 196Uploaded byIJAR Journal
- English Syllabus PQEUploaded byShweta Gupta
- Johari's WindowUploaded byJai Macanas
- Breastfeeding and the Risk of Childhood ObesityUploaded bymedicina ucn
- ResearchUploaded byPeter Paul Basiloy
- Thesis Project 2Uploaded byAlberto Nolasco
- engl394 project proposal memo 1 1Uploaded byapi-318297268
- A Measure for scientific thinkingUploaded byCarlo Magno
- 1 stakeholder analysis for language access policy developmentUploaded byapi-249077964
- Stats 2 Final Problems Final SolutionsUploaded byTony536
- Rationale, Design, And Baseline qwUploaded byKevin Mesa
- Catalant Expert HandbookUploaded byOSU92
- PBL using Lepida Scuola Method_ENUploaded byenzo_zecchi
- Butler’s neuromobilizations combined with proprioceptive neuromuscular facilitation are effective in reducing of upper limb sensory in late-stage stroke subjects: a three-group randomized trialUploaded byCarmen Menaya Fernandez

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.