What Is Anomaly Detection: MR Hew Ka Kian Hew - Ka - Kian@rp - Edu.sg

OFFICIAL (CLOSED) \ NON-SENSITIVE
What is
Anomaly Detection
Mr Hew Ka Kian
hew_ka_kian@rp.edu.sg
What is Anomaly Detection

• Anomaly detection (also outlier detection) is the identification of rare items,
events or observations which raise suspicions by differing significantly from the
majority of the data.
• The anomalous items may
translate to some kind of problem such
as bank fraud, network breach, a structural
defect, medical problems or errors in a text.
• Anomalies are also referred to as outliers,
novelties, noise, deviations and exceptions.
• Anomaly detection is applied on unlabeled
data is known as unsupervised anomaly
detection although supervised anomaly
detection is possible with labeled data
Source: https://en.wikipedia.org/wiki/Anomaly_detection
Anomaly Types
• Anomaly is a broad concept, which may refer to many different types
of events in time series.
• A spike of value, a shift of volatility etc. could all be anomalous or
normal, depending on the specific context.
Source:
Time series data anomaly detection

• Successful anomaly detection hinges on an ability to accurately analyze time series data in real
time.
• Time series data is composed of a sequence of values over time. That means each point is
typically a pair of two items — a timestamp for when the metric was measured, and the value of
that metric.
• Time series data anomaly detection can be used
for valuable metrics such as: Seismic
Virus
infection reading
cases
Power
Transaction
generator
volume
output
Login Mobile app

attempts installs
Source:
Univariate vs. Multivariate

Univariate Multivariate
• Looking at one variable • Need to consider multiple

• If we want to look at factors and the relationship
anomalous weather patterns, between them
univariate anomaly detection • If we want to look at
will measure a single indicator, anomalous weather patterns,
such as temperature. We can multivariate analysis will
then ask questions like “is this consider a host of factors, like
temperature strange for this precipitation, humidity and air
region?” pressure.
Anomaly Detection Toolkit
Anomaly Detection Toolkit (ADTK) is a Python package for unsupervised time

series anomaly detection.
This package offers a set of functions that makes the training of the dataset and
the detection of anomaly easier to code
It also provides some functions to process and visualize time series and
anomaly events.
ADTK is open sourced and its code and many examples of how to use the
package is at https://github.com/arundo/adtk
ADTK Anomaly Types

• ADTK can detect a point anomaly where there is a data point whose value is
significantly different from others.
• An outlier point in a time series time is one that exceed the normal range of this
series.
• To detect outliers, the normal range of time series values (baseline) is what a
detector needs to learn.
ADTK Anomaly Types

• Spike and Level Shift: In some situations, whether a time point is
normal depends on if its value is aligned with its near past.
• An abrupt increase or decrease of value is called a spike if the change
is temporary.
Source:
ADTK Anomaly Types

• An abrupt increase or decrease of value is called a spike if the change
is temporary.
• However we should use ADTK to detect a level shift if the change is
permanent.
Source: https://cloud.google.com/ai-platform/docs/ml-solutions-overview
Workflow Source data
Data can come from the places You can also get your hands on
you have access to like the credit privileged data through Prepare data
card transaction if you work in the commercial arrangement like
bank paying for the data
Select the
algorithm
There are also plenty of open (public) dataset that is shared with the public free: Train the
https://kaggle.com is an online community for machine learning enthusiasts and it model
has many open dataset
https://data.gov.sg was first launched in 2011 as the government's one-stop portal
to its publicly-available datasets from 70 public agencies. To date, more than 100 Test the model
apps have been created using the government’s open data.
Use the model

for Inference
Filter the data Transform

Prepare data
• The rows of interest like • Combine multiple source Select the

those above 65 years old • Extract feature by applying algorithm
• The columns of interest, mathematical functions
basically just the datetime like moving average of 20
and the feature columns data points Train the
model
Test the model
Use the model

for Inference
What type of anomaly?
Prepare data
Global (point) Contextual anomalies Collective anomalies

Select the
algorithm
ADTK volatility shift

ADTK threshold detector
to detect values outside
ADTK quantile detector to
detect values outside
detector detects shift of
volatility by comparing 2
ADTK Seasonal detector
detects departure from a
Train the
certain threshold values certain percentile windows of values repeating pattern model
Test the model
Use the model

for Inference
Train and Test

Prepare data
To train the model using ADTK, call the fit(df)

function Select the
algorithm
To test or use the model to detect anomaly, call Train the

the detect(df) function model
Convenient function that train followed by Test the model

detect, call the fit_detect(df) function
Use the model

for Inference
Exercise D
Pandas is a fast and powerful Python library for data
manipulation.
Import the library
• import pandas as pd
Read the content of the comma separated values (CSV) into a

Pandas Series object
• s = pd.read_csv('dataset.csv', index_col="pr_date",
parse_dates=True, infer_datetime_format=True)
• index_col is the column that holds the datetime
• infer_datetime_format=True tries to guess the date format
Exercise D
Use the ADTK validate_series(df) to check for error in the Series
Import the library
• from adtk.data import validate_series
Validate the Series
• s = validate_series(s)
Exercise D Detecting Simple Threshold

Use ThresholdAD to detect outlier (point anomaly) that exceeds the
baseline threshold
Import the library
• from adtk.detector import ThresholdAD
Create the ThresholdAD object with the high and low threshold values. Values above the
high or below the low threshold are flagged as anomaly
• threshold_ad = ThresholdAD(high=30, low=15)
Detect anomalies
• anomalies = threshold_ad.detect(s)
Exercise D
Use the plot() function to visualize the graph
Import the library
• from adtk.visualization import plot
Plot the graph with the DataFrame and anomalies. Can specify the anomaly
marker colour and the tag as marker (dot on the graph)
• plot(s, anomaly=anomalies, anomaly_color='red’,

anomaly_tag="marker");
Exercise D Percentile
• Percentile: the value below which a percentage of data falls.
Source: https://www.mathsisfun.com/data/percentiles.html
Exercise D Percentile
Use QuantileAD to detect outlier (point anomaly) that exceeds the
certain percentile of the series
Import the library
• from adtk.detector import QuantileAD
Create the object with the high and low percentile with values outside of
this boundary considered anomalous
• quantile_ad = QuantileAD(high=0.99, low=0.01)
Need to be trained on the range before detection. fit_detect() does this

in one step
• anomalies = quantile_ad.fit_detect(df)
Exercise D
Use VolatilityShiftAD() to detect a change in the volatility in the
time series
Import the library
• from adtk.detector import VolatilityShiftAD
VolatilityShiftAD() compares the volatity between 2 windows next to each other. We

have to create the object specifying how many time points do the windows contain
• volatility_shift_ad = VolatilityShiftAD(window=30)
Exercise E
• Read tsla.us.txt csv file and print the content.
s = pd.read_csv('data/tsla.us.txt’,
index_col="Date", parse_dates=True,
infer_datetime_format=True)
print(s)
• What are the columns
Open, High, Low, Close, volume, OpenInt
Exercise E
• Write the code to drop the other columns except the Date and Volume columns.
s = s.drop(['High','Low','Open','Close','OpenInt'],axis=1)
• Write the code to detect the volatility shift with window of 60 values
s = validate_series(s)
volatility_shift_ad = VolatilityShiftAD(window=60)
anomalies = volatility_shift_ad.fit_detect(s)
• Plot the graph and do you have something like the graph shown in the worksheet that
detects the starts of increase volatility?
plot(s, anomaly=anomalies, anomaly_color='red’);
• What did you find that was the likely reason for the anomaly? Bare in mind that anomaly
does not have to be bad, simply something that deviates from the standard or norm.
• Tesla announced it is getting profitable in 2013.
Exercise F
• Open the file weekly-infectious-disease-bulletin-cases.csv using Excel.
• What are the columns
epi_week: week of the year

disease: disease
no._of_cases: no of cases
Exercise F
• How do we filter for the rows with ‘Dengue Fever’?
infectious = infectious[infectious['disease'] == 'Dengue Fever’]
• Set the epi_week as the index column.
infectious = infectious.set_index('epi_week')
• Drop the disease column.
infectious = infectious.drop('disease',axis=1)
Student Activity Source data
In exercise G, we are going to do the whole ML workflow Prepare data
1. Source for the data at Data.gov.sg

Select the
2. Prepare the data algorithm
• Change the datetime format
• Choose a disease to examine and filter out the irrelevant columns Train the
model
3. Select the algorithm – Shift in volatility
4. Train and use the model for inference Test the
model
Use the model

for Inference
Source:
Exercise G
• Acute Upper Respiratory Tract infections
infectious = pd.read_csv(
'data/average-daily-polyclinic-attendances-for-selected-diseases.csv’)
infectious['epi_week'] = pd.to_datetime(
infectious['epi_week'] + '-1 00:00:00', format='%Y-W%W-%w %H:%M:%S’)
infectious = infectious[
infectious['disease'] == 'Acute Upper Respiratory Tract infections’]
infectious = infectious.set_index('epi_week’)
infectious = infectious.drop('disease',axis=1)
anomalies = volatility_shift_ad.fit_detect(infectious)
plot(infectious, anomaly=anomalies, anomaly_color='red');
Problem Solution
Is there is any abrupt change in trend for visitors to Singapore?
Source:
Problem Solution
• Source data
• Get the visitor-international-arrivals-to-singapore-by-region-monthly.csv from Data.gov.sg
• The datetime column is well defined so no need to modify
s = pd.read_csv('data/visitor-international-arrivals-to-singapore-by-region-
monthly.csv', index_col="month",
parse_dates=True)
• Filter for a region like Africa
s = s[s['region']=='Africa']
• Drop the regions column
s = s.drop(['region'],axis=1)
• Train and inference
s = validate_series(s)
anomalies = volatility_shift_ad.fit_detect(s)
• Plot
plot(s, anomaly=anomalies, anomaly_color='red')
Exercise H
• Any anomaly?
Yes, around 2002-2003 (depending on country)

What Is Anomaly Detection: MR Hew Ka Kian Hew - Ka - Kian@rp - Edu.sg

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

What Is Anomaly Detection: MR Hew Ka Kian Hew - Ka - Kian@rp - Edu.sg

Uploaded by

Copyright:

Available Formats

OFFICIAL (CLOSED) \ NON-SENSITIVE

What is Anomaly Detection

Time series data anomaly detection

Login Mobile app

Univariate vs. Multivariate

• Looking at one variable • Need to consider multiple

Anomaly Detection Toolkit

Anomaly Detection Toolkit (ADTK) is a Python package for unsupervised time

ADTK Anomaly Types

ADTK Anomaly Types

ADTK Anomaly Types

Workflow Source data

Use the model

Workflow Source data

Filter the data Transform

• The rows of interest like • Combine multiple source Select the

Test the model

Use the model

Workflow Source data

What type of anomaly?

Global (point) Contextual anomalies Collective anomalies

ADTK volatility shift

Test the model

Use the model

Workflow Source data

Train and Test

To train the model using ADTK, call the fit(df)

To test or use the model to detect anomaly, call Train the

Convenient function that train followed by Test the model

Use the model

Import the library

Read the content of the comma separated values (CSV) into a

Use the ADTK validate_series(df) to check for error in the Series

Import the library

• from adtk.data import validate_series

Validate the Series

Exercise D Detecting Simple Threshold

Import the library

• from adtk.detector import ThresholdAD

Use the plot() function to visualize the graph

Import the library

• from adtk.visualization import plot

• plot(s, anomaly=anomalies, anomaly_color='red’,

Import the library

• from adtk.detector import QuantileAD

Need to be trained on the range before detection. fit_detect() does this

Import the library

• from adtk.detector import VolatilityShiftAD

VolatilityShiftAD() compares the volatity between 2 windows next to each other. We

epi_week: week of the year

Student Activity Source data

In exercise G, we are going to do the whole ML workflow Prepare data

1. Source for the data at Data.gov.sg

Use the model

You might also like