You are on page 1of 18

Outlier Analysis

Data Mining:
Data Mining Methods
with Dr. Qin Lv
Learning objective: Apply techniques
for outlier analysis and explain how they
work. Evaluate and compare methods.
Outlier/Anomaly
Ø General/normal patterns
• Frequent patterns, classification, clustering
Ø Abnormal: differs from the norm
Ø Outlier analysis/anomaly detection
• “Outliers are errors and should be removed?”
• Outliers may be significant events, frauds, …
Types of Outliers/Anomalies
280

Ø Global 260 Unusual Time


Series Snippets Level Shifting

• Obj differs from the rest 240

Kelvin
Ø Contextual 220

• Obj differs in a context


200

Ø Collective
180

160

• Group of objs differ


199 200 200 200 200 200
8-0 0-0 1-0 2-1 4-0 5-0
9-1 1-3 6-1 0-2 3-1 7-2
8 1 5 8 1 4
Date (yyyy-mm-dd)
Outlier/Anomaly Examples
Ø Environmental
• Temperature, commute time, wind power generation
Ø Social
• #friends, #posts, #likes, time series, network structure
Ø Financial
• Stock price, credit card transactions, spatial/temporal
Anomaly Detection: Challenges
Ø Normal vs. abnormal
• No clear definition, vary by application, noisy data
Ø Efficiency
• Latency, scalability, adaptability
Ø Interpretability
• How to interpretate and explain the results?
Anomaly Detection: Methods
Ø Groundtruth label
• Supervised, unsupervised, semi-supervised
Ø Assumption of normal vs. abnormal
• Statistical model
• Majority vs. minority
• Proximity: distance, density
Classification-based Methods
Ø Supervised learning
• Normal and/or abnormal cases
• More than one normal/abnormal classes
Ø Challenges
• Class imbalance: e.g., faults are much less frequent
• New patterns: e.g., new types of security attack
Clustering-based Methods
Ø Unsupervised learning
• No pre-defined normal/abnormal cases
• Majority vs. minority clusters
• Multiple clusters for the normal/abnormal cases
Ø Generalizable to different applications
• Clustering method, similarity measure
Proximity-based Methods
Ø Faraway from others => abnormal
Ø Distance-based
• Absolute proximity, e.g., kNN
Ø Density-based
• Relative proximity, e.g., ε-neighborhood
Semi-supervised Methods
Ø Partial labels: normal/abnormal cases
Ø Combination of clustering & classification
• Clustering
• Normal: large clusters and/or normal cases
• Abnormal: small clusters and/or abnormal cases
• Classification
Contextual Anomaly
Ø Context features => behavior features
• E.g., similar weather conditions => solar/wind power
Ø Identifying context
• Frequent patterns, subspace, domain-specific
Ø Detecting anomaly within context
• Transform to global outlier detection
Collective Anomaly
Ø A group of objects deviate from the norm
• E.g., Distributed Denial-of-Service (DDoS) attack
Ø Structural relationship among objects
• E.g., purchases of specific items at certain time/location
Ø Super object: a group of related objects
• Transform to global outlier detection
280

260 Unusual Time


Series Snippets Level Shifting

Example: Remote
240

Kelvin
220

Sensing Data
200

180

160
199 200 200 200 200 200
8 -09 0-0 1-0 2-1 4-0 5-0
1-3 6-1 0-2 3-1 7-2

Ø Spatial-temporal anomaly
-18 1 5 8 1 4
Date (yyyy-mm-dd)
Example:
Wind Farm
Ø Fault prediction

Ø Fault diagnosis
Example: Solar
Farm (1)
Ø Hierarchical anomaly
PV Str ing Com biner Box Inver ter Utility Gr id

PV
Panel
Example: Solar Farm (2)
Ø Collaborative fault detection
Summary: Outlier Analysis
Ø Types of outliers/anomalies
• Global, contextual, collective
Ø Methods
• Unsupervised, (semi-)supervised, context, structure
Ø Interpretation
• Errors or significant events

You might also like