Professional Documents
Culture Documents
A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers. Outliers can be caused by measurement or execution error. the outliers may be of particular interest
Applications:
Fraud detection
Medicine Public health
A working hypothesis, H, is a statement that the entire data set of n objects comes from an initial distribution model, F, that is, H : oi E F, where i = 1, 2, , n. Verifies whether oi is <> in relation to F Assume T is some statistic used as discordancy test Assume value of the statistic for object oi is vi Then distribution T is constructed SP(vi)=Prob(T > vi), is evaluated If SP(vi) is small H is rejected
An alternative hypothesis, H, which states that oi comes from another distribution model, G, is adopted. The result is very much dependent on which model F is chosen because oi may be an outlier under one model and a perfectly valid value under another.
kinds of alternative distributions. Inherent alternative distribution H : oi E G, where i = 1, 2, : : : , n Mixture alternative distribution G. H : oi E (1-mu)F +muG, where i = 1, 2, : : : , n. Slippage alternative distribution
two techniques for deviation-based outlier detection Sequential Exception Technique OLAP Data Cube Technique