You are on page 1of 29

OFFICIAL (CLOSED) \ NON-SENSITIVE

What is
Anomaly Detection
Mr Hew Ka Kian
hew_ka_kian@rp.edu.sg
OFFICIAL (CLOSED) \ NON-SENSITIVE

What is Anomaly Detection


• Anomaly detection (also outlier detection) is the identification of rare items,
events or observations which raise suspicions by differing significantly from the
majority of the data.
• The anomalous items may
translate to some kind of problem such
as bank fraud, network breach, a structural
defect, medical problems or errors in a text.
• Anomalies are also referred to as outliers,
novelties, noise, deviations and exceptions.
• Anomaly detection is applied on unlabeled
data is known as unsupervised anomaly
detection although supervised anomaly
detection is possible with labeled data
Source: https://en.wikipedia.org/wiki/Anomaly_detection
OFFICIAL (CLOSED) \ NON-SENSITIVE

Anomaly Types
• Anomaly is a broad concept, which may refer to many different types
of events in time series.
• A spike of value, a shift of volatility etc. could all be anomalous or
normal, depending on the specific context.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Time series data anomaly detection


• Successful anomaly detection hinges on an ability to accurately analyze time series data in real
time.
• Time series data is composed of a sequence of values over time. That means each point is
typically a pair of two items — a timestamp for when the metric was measured, and the value of
that metric.
• Time series data anomaly detection can be used
for valuable metrics such as: Seismic
Virus
infection reading
cases

Power
Transaction
generator
volume
output

Login Mobile app


attempts installs

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Univariate vs. Multivariate


Univariate Multivariate

• Looking at one variable • Need to consider multiple


• If we want to look at factors and the relationship
anomalous weather patterns, between them
univariate anomaly detection • If we want to look at
will measure a single indicator, anomalous weather patterns,
such as temperature. We can multivariate analysis will
then ask questions like “is this consider a host of factors, like
temperature strange for this precipitation, humidity and air
region?” pressure.
OFFICIAL (CLOSED) \ NON-SENSITIVE

Anomaly Detection Toolkit

Anomaly Detection Toolkit (ADTK) is a Python package for unsupervised time


series anomaly detection.

This package offers a set of functions that makes the training of the dataset and
the detection of anomaly easier to code

It also provides some functions to process and visualize time series and
anomaly events.

ADTK is open sourced and its code and many examples of how to use the
package is at https://github.com/arundo/adtk
OFFICIAL (CLOSED) \ NON-SENSITIVE

ADTK Anomaly Types


• ADTK can detect a point anomaly where there is a data point whose value is
significantly different from others.
• An outlier point in a time series time is one that exceed the normal range of this
series.
• To detect outliers, the normal range of time series values (baseline) is what a
detector needs to learn.
OFFICIAL (CLOSED) \ NON-SENSITIVE

ADTK Anomaly Types


• Spike and Level Shift: In some situations, whether a time point is
normal depends on if its value is aligned with its near past.
• An abrupt increase or decrease of value is called a spike if the change
is temporary.

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

ADTK Anomaly Types


• An abrupt increase or decrease of value is called a spike if the change
is temporary.
• However we should use ADTK to detect a level shift if the change is
permanent.

Source: https://cloud.google.com/ai-platform/docs/ml-solutions-overview
OFFICIAL (CLOSED) \ NON-SENSITIVE

Workflow Source data

Data can come from the places You can also get your hands on
you have access to like the credit privileged data through Prepare data
card transaction if you work in the commercial arrangement like
bank paying for the data
Select the
algorithm

There are also plenty of open (public) dataset that is shared with the public free: Train the
https://kaggle.com is an online community for machine learning enthusiasts and it model
has many open dataset
https://data.gov.sg was first launched in 2011 as the government's one-stop portal
to its publicly-available datasets from 70 public agencies. To date, more than 100 Test the model
apps have been created using the government’s open data.

Use the model


for Inference
OFFICIAL (CLOSED) \ NON-SENSITIVE

Workflow Source data

Filter the data Transform


Prepare data

• The rows of interest like • Combine multiple source Select the


those above 65 years old • Extract feature by applying algorithm
• The columns of interest, mathematical functions
basically just the datetime like moving average of 20
and the feature columns data points Train the
model

Test the model

Use the model


for Inference
OFFICIAL (CLOSED) \ NON-SENSITIVE

Workflow Source data

What type of anomaly?

Prepare data

Global (point) Contextual anomalies Collective anomalies


Select the
algorithm

ADTK volatility shift


ADTK threshold detector
to detect values outside
ADTK quantile detector to
detect values outside
detector detects shift of
volatility by comparing 2
ADTK Seasonal detector
detects departure from a
Train the
certain threshold values certain percentile windows of values repeating pattern model

Test the model

Use the model


for Inference
OFFICIAL (CLOSED) \ NON-SENSITIVE

Workflow Source data

Train and Test


Prepare data

To train the model using ADTK, call the fit(df)


function Select the
algorithm

To test or use the model to detect anomaly, call Train the


the detect(df) function model

Convenient function that train followed by Test the model


detect, call the fit_detect(df) function

Use the model


for Inference
OFFICIAL (CLOSED) \ NON-SENSITIVE

Exercise D
Pandas is a fast and powerful Python library for data
manipulation.

Import the library

• import pandas as pd

Read the content of the comma separated values (CSV) into a


Pandas Series object
• s = pd.read_csv('dataset.csv', index_col="pr_date",
parse_dates=True, infer_datetime_format=True)
• index_col is the column that holds the datetime
• infer_datetime_format=True tries to guess the date format
OFFICIAL (CLOSED) \ NON-SENSITIVE

Exercise D

Use the ADTK validate_series(df) to check for error in the Series

Import the library

• from adtk.data import validate_series

Validate the Series

• s = validate_series(s)
OFFICIAL (CLOSED) \ NON-SENSITIVE

Exercise D Detecting Simple Threshold


Use ThresholdAD to detect outlier (point anomaly) that exceeds the
baseline threshold

Import the library

• from adtk.detector import ThresholdAD

Create the ThresholdAD object with the high and low threshold values. Values above the
high or below the low threshold are flagged as anomaly
• threshold_ad = ThresholdAD(high=30, low=15)

Detect anomalies

• anomalies = threshold_ad.detect(s)
OFFICIAL (CLOSED) \ NON-SENSITIVE

Exercise D

Use the plot() function to visualize the graph

Import the library

• from adtk.visualization import plot

Plot the graph with the DataFrame and anomalies. Can specify the anomaly
marker colour and the tag as marker (dot on the graph)

• plot(s, anomaly=anomalies, anomaly_color='red’,


anomaly_tag="marker");
OFFICIAL (CLOSED) \ NON-SENSITIVE

Exercise D Percentile
• Percentile: the value below which a percentage of data falls.

Source: https://www.mathsisfun.com/data/percentiles.html
OFFICIAL (CLOSED) \ NON-SENSITIVE

Exercise D Percentile
Use QuantileAD to detect outlier (point anomaly) that exceeds the
certain percentile of the series

Import the library

• from adtk.detector import QuantileAD

Create the object with the high and low percentile with values outside of
this boundary considered anomalous
• quantile_ad = QuantileAD(high=0.99, low=0.01)

Need to be trained on the range before detection. fit_detect() does this


in one step
• anomalies = quantile_ad.fit_detect(df)
OFFICIAL (CLOSED) \ NON-SENSITIVE

Exercise D
Use VolatilityShiftAD() to detect a change in the volatility in the
time series

Import the library

• from adtk.detector import VolatilityShiftAD

VolatilityShiftAD() compares the volatity between 2 windows next to each other. We


have to create the object specifying how many time points do the windows contain

• volatility_shift_ad = VolatilityShiftAD(window=30)
OFFICIAL (CLOSED) \ NON-SENSITIVE

Exercise E
• Read tsla.us.txt csv file and print the content.
s = pd.read_csv('data/tsla.us.txt’,
index_col="Date", parse_dates=True,
infer_datetime_format=True)
print(s)
• What are the columns
Open, High, Low, Close, volume, OpenInt
OFFICIAL (CLOSED) \ NON-SENSITIVE

Exercise E
• Write the code to drop the other columns except the Date and Volume columns.
s = s.drop(['High','Low','Open','Close','OpenInt'],axis=1)
• Write the code to detect the volatility shift with window of 60 values
s = validate_series(s)
volatility_shift_ad = VolatilityShiftAD(window=60)
anomalies = volatility_shift_ad.fit_detect(s)
• Plot the graph and do you have something like the graph shown in the worksheet that
detects the starts of increase volatility?
plot(s, anomaly=anomalies, anomaly_color='red’);
• What did you find that was the likely reason for the anomaly? Bare in mind that anomaly
does not have to be bad, simply something that deviates from the standard or norm.
• Tesla announced it is getting profitable in 2013.
OFFICIAL (CLOSED) \ NON-SENSITIVE

Exercise F
• Open the file weekly-infectious-disease-bulletin-cases.csv using Excel.
• What are the columns

epi_week: week of the year


disease: disease
no._of_cases: no of cases
OFFICIAL (CLOSED) \ NON-SENSITIVE

Exercise F
• How do we filter for the rows with ‘Dengue Fever’?
infectious = infectious[infectious['disease'] == 'Dengue Fever’]
• Set the epi_week as the index column.
infectious = infectious.set_index('epi_week')
• Drop the disease column.
infectious = infectious.drop('disease',axis=1)
OFFICIAL (CLOSED) \ NON-SENSITIVE

Student Activity Source data

In exercise G, we are going to do the whole ML workflow Prepare data

1. Source for the data at Data.gov.sg


Select the
2. Prepare the data algorithm
• Change the datetime format
• Choose a disease to examine and filter out the irrelevant columns Train the
model
3. Select the algorithm – Shift in volatility
4. Train and use the model for inference Test the
model

Use the model


for Inference

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Exercise G
• Acute Upper Respiratory Tract infections
infectious = pd.read_csv(
'data/average-daily-polyclinic-attendances-for-selected-diseases.csv’)
infectious['epi_week'] = pd.to_datetime(
infectious['epi_week'] + '-1 00:00:00', format='%Y-W%W-%w %H:%M:%S’)
infectious = infectious[
infectious['disease'] == 'Acute Upper Respiratory Tract infections’]
infectious = infectious.set_index('epi_week’)
infectious = infectious.drop('disease',axis=1)
volatility_shift_ad = VolatilityShiftAD(window=52)
anomalies = volatility_shift_ad.fit_detect(infectious)
plot(infectious, anomaly=anomalies, anomaly_color='red');
OFFICIAL (CLOSED) \ NON-SENSITIVE

Problem Solution
Is there is any abrupt change in trend for visitors to Singapore?

Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE

Problem Solution
• Source data
• Get the visitor-international-arrivals-to-singapore-by-region-monthly.csv from Data.gov.sg
• The datetime column is well defined so no need to modify
s = pd.read_csv('data/visitor-international-arrivals-to-singapore-by-region-
monthly.csv', index_col="month",
parse_dates=True)
• Filter for a region like Africa
s = s[s['region']=='Africa']
• Drop the regions column
s = s.drop(['region'],axis=1)
• Train and inference
s = validate_series(s)
volatility_shift_ad = VolatilityShiftAD(window=12)
anomalies = volatility_shift_ad.fit_detect(s)
• Plot
plot(s, anomaly=anomalies, anomaly_color='red')
OFFICIAL (CLOSED) \ NON-SENSITIVE

Exercise H
• Any anomaly?
Yes, around 2002-2003 (depending on country)

You might also like