Professional Documents
Culture Documents
What is
Anomaly Detection
Mr Hew Ka Kian
hew_ka_kian@rp.edu.sg
OFFICIAL (CLOSED) \ NON-SENSITIVE
Anomaly Types
• Anomaly is a broad concept, which may refer to many different types
of events in time series.
• A spike of value, a shift of volatility etc. could all be anomalous or
normal, depending on the specific context.
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Power
Transaction
generator
volume
output
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
This package offers a set of functions that makes the training of the dataset and
the detection of anomaly easier to code
It also provides some functions to process and visualize time series and
anomaly events.
ADTK is open sourced and its code and many examples of how to use the
package is at https://github.com/arundo/adtk
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source: https://cloud.google.com/ai-platform/docs/ml-solutions-overview
OFFICIAL (CLOSED) \ NON-SENSITIVE
Data can come from the places You can also get your hands on
you have access to like the credit privileged data through Prepare data
card transaction if you work in the commercial arrangement like
bank paying for the data
Select the
algorithm
There are also plenty of open (public) dataset that is shared with the public free: Train the
https://kaggle.com is an online community for machine learning enthusiasts and it model
has many open dataset
https://data.gov.sg was first launched in 2011 as the government's one-stop portal
to its publicly-available datasets from 70 public agencies. To date, more than 100 Test the model
apps have been created using the government’s open data.
Prepare data
Exercise D
Pandas is a fast and powerful Python library for data
manipulation.
• import pandas as pd
Exercise D
• s = validate_series(s)
OFFICIAL (CLOSED) \ NON-SENSITIVE
Create the ThresholdAD object with the high and low threshold values. Values above the
high or below the low threshold are flagged as anomaly
• threshold_ad = ThresholdAD(high=30, low=15)
Detect anomalies
• anomalies = threshold_ad.detect(s)
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise D
Plot the graph with the DataFrame and anomalies. Can specify the anomaly
marker colour and the tag as marker (dot on the graph)
Exercise D Percentile
• Percentile: the value below which a percentage of data falls.
Source: https://www.mathsisfun.com/data/percentiles.html
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise D Percentile
Use QuantileAD to detect outlier (point anomaly) that exceeds the
certain percentile of the series
Create the object with the high and low percentile with values outside of
this boundary considered anomalous
• quantile_ad = QuantileAD(high=0.99, low=0.01)
Exercise D
Use VolatilityShiftAD() to detect a change in the volatility in the
time series
• volatility_shift_ad = VolatilityShiftAD(window=30)
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise E
• Read tsla.us.txt csv file and print the content.
s = pd.read_csv('data/tsla.us.txt’,
index_col="Date", parse_dates=True,
infer_datetime_format=True)
print(s)
• What are the columns
Open, High, Low, Close, volume, OpenInt
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise E
• Write the code to drop the other columns except the Date and Volume columns.
s = s.drop(['High','Low','Open','Close','OpenInt'],axis=1)
• Write the code to detect the volatility shift with window of 60 values
s = validate_series(s)
volatility_shift_ad = VolatilityShiftAD(window=60)
anomalies = volatility_shift_ad.fit_detect(s)
• Plot the graph and do you have something like the graph shown in the worksheet that
detects the starts of increase volatility?
plot(s, anomaly=anomalies, anomaly_color='red’);
• What did you find that was the likely reason for the anomaly? Bare in mind that anomaly
does not have to be bad, simply something that deviates from the standard or norm.
• Tesla announced it is getting profitable in 2013.
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise F
• Open the file weekly-infectious-disease-bulletin-cases.csv using Excel.
• What are the columns
Exercise F
• How do we filter for the rows with ‘Dengue Fever’?
infectious = infectious[infectious['disease'] == 'Dengue Fever’]
• Set the epi_week as the index column.
infectious = infectious.set_index('epi_week')
• Drop the disease column.
infectious = infectious.drop('disease',axis=1)
OFFICIAL (CLOSED) \ NON-SENSITIVE
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise G
• Acute Upper Respiratory Tract infections
infectious = pd.read_csv(
'data/average-daily-polyclinic-attendances-for-selected-diseases.csv’)
infectious['epi_week'] = pd.to_datetime(
infectious['epi_week'] + '-1 00:00:00', format='%Y-W%W-%w %H:%M:%S’)
infectious = infectious[
infectious['disease'] == 'Acute Upper Respiratory Tract infections’]
infectious = infectious.set_index('epi_week’)
infectious = infectious.drop('disease',axis=1)
volatility_shift_ad = VolatilityShiftAD(window=52)
anomalies = volatility_shift_ad.fit_detect(infectious)
plot(infectious, anomaly=anomalies, anomaly_color='red');
OFFICIAL (CLOSED) \ NON-SENSITIVE
Problem Solution
Is there is any abrupt change in trend for visitors to Singapore?
Source:
OFFICIAL (CLOSED) \ NON-SENSITIVE
Problem Solution
• Source data
• Get the visitor-international-arrivals-to-singapore-by-region-monthly.csv from Data.gov.sg
• The datetime column is well defined so no need to modify
s = pd.read_csv('data/visitor-international-arrivals-to-singapore-by-region-
monthly.csv', index_col="month",
parse_dates=True)
• Filter for a region like Africa
s = s[s['region']=='Africa']
• Drop the regions column
s = s.drop(['region'],axis=1)
• Train and inference
s = validate_series(s)
volatility_shift_ad = VolatilityShiftAD(window=12)
anomalies = volatility_shift_ad.fit_detect(s)
• Plot
plot(s, anomaly=anomalies, anomaly_color='red')
OFFICIAL (CLOSED) \ NON-SENSITIVE
Exercise H
• Any anomaly?
Yes, around 2002-2003 (depending on country)