You are on page 1of 15

# WHAT ARE OUTLIERS?

• A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers. • the outliers may be of particular interest . • Outliers can be caused by measurement or execution error.

Applications: • Fraud detection • Medicine • Public health • Sports statistics • Detecting measurement errors .

OUTLIER DETECTION METHODS • Statistical Distribution-Based Outlier Detection • Distance-Based Outlier Detection • Density-Based Local Outlier Detection • Deviation-Based Outlier Detection .

.Statistical Distribution-Based Outlier Detection • assumes a distribution for the given data set • identifies outliers with respect to the model using a discordancy test • requires knowledge of the data set parameters • knowledge of distribution parameters • expected number of outliers.

How does the discordancy testing work? • This test examines two hypotheses: • working hypothesis • alternative hypothesis .

where i = 1. n. • Verifies whether oi is <> in relation to F • Assume T is some statistic used as discordancy test • Assume value of the statistic for object oi is vi • Then distribution T is constructed • SP(vi)=Prob(T > vi). … . H. that is. 2. F. is a statement that the entire data set of n objects comes from an initial distribution model. • H : oi E F.• A working hypothesis. is evaluated • If SP(vi) is small H is rejected .

G. • The result is very much dependent on which model F is chosen because oi may be an outlier under one model and a perfectly valid value under another. H. which states that oi comes from another distribution model. . is adopted.• An alternative hypothesis.

: : : . • Slippage alternative distribution . n. H : oi E (1-mu)F +muG.• kinds of alternative distributions. 2. : : : . n • Mixture alternative distribution G. where i = 1. • Inherent alternative distribution H’ : oi E G. 2. where i = 1.

. a DB(pct.Distance-Based Outlier Detection • An object. D. of the objects in D lie at a distance greater than dmin from o. is a distancebased (DB) outlier with parameters pct and dmin.dmin)-outlier. in a data set. o. if at least a fraction.11 that is. pct.

algorithms for mining distance-based outliers • Index-based algorithm • Nested-loop algorithm • Cell-based algorithm .

Density-Based Local Outlier Detection • Distance-based outlier detection is based on global distance distribution • It encounters difficulties to identify outliers if data is not uniformly distributed. .

Deviation-Based Outlier Detection • it identifies outliers by examining the main characteristics of objects in a group • two techniques for deviation-based outlier detection • Sequential Exception Technique • OLAP Data Cube Technique .

Sequential Exception Technique .

OLAP Data Cube Technique .