You are on page 1of 15

# WHAT ARE OUTLIERS?

A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers. Outliers can be caused by measurement or execution error. the outliers may be of particular interest

Applications:
Fraud detection
Medicine Public health

## OUTLIER DETECTION METHODS

Statistical Distribution-Based Outlier Detection Distance-Based Outlier Detection Density-Based Local Outlier Detection Deviation-Based Outlier Detection

## Statistical Distribution-Based Outlier Detection

assumes a distribution for the given data set identifies outliers with respect to the model using a discordancy test requires knowledge of the data set parameters knowledge of distribution parameters expected number of outliers.

## How does the discordancy testing work?

This test examines two hypotheses: working hypothesis alternative hypothesis

A working hypothesis, H, is a statement that the entire data set of n objects comes from an initial distribution model, F, that is, H : oi E F, where i = 1, 2, , n. Verifies whether oi is <> in relation to F Assume T is some statistic used as discordancy test Assume value of the statistic for object oi is vi Then distribution T is constructed SP(vi)=Prob(T > vi), is evaluated If SP(vi) is small H is rejected

An alternative hypothesis, H, which states that oi comes from another distribution model, G, is adopted. The result is very much dependent on which model F is chosen because oi may be an outlier under one model and a perfectly valid value under another.

kinds of alternative distributions. Inherent alternative distribution H : oi E G, where i = 1, 2, : : : , n Mixture alternative distribution G. H : oi E (1-mu)F +muG, where i = 1, 2, : : : , n. Slippage alternative distribution

## Distance-Based Outlier Detection

An object, o, in a data set, D, is a distancebased (DB) outlier with parameters pct and dmin,11 that is, a DB(pct;dmin)-outlier, if at least a fraction, pct, of the objects in D lie at a distance greater than dmin from o.

## algorithms for mining distance-based outliers

Index-based algorithm Nested-loop algorithm Cell-based algorithm

## Density-Based Local Outlier Detection

Distance-based outlier detection is based on global distance distribution It encounters difficulties to identify outliers if data is not uniformly distributed.

## Deviation-Based Outlier Detection

it identifies outliers by examining the main characteristics of objects in a group

two techniques for deviation-based outlier detection Sequential Exception Technique OLAP Data Cube Technique