You are on page 1of 43

Institut für

Technische Informatik und


Kommunikationsnetze

Anomaly Detection on
High-Dimensional Time Series
Master Thesis

Athanasios Fitsios
afitsios@student.ethz.ch

Computer Engineering and Networks Laboratory


Department of Information Technology and Electrical Engineering
ETH Zürich

Supervisors:
Mitch Gusat, IBM Research
Prof. Dr. Lothar Thiele, ETH Zurich

September 8, 2020
Acknowledgements

This work was performed in collaboration with IBM Research Zurich. I would
like to thank Mitch Gusat for his guidance and fruitful collaboration during the
past six months. I would also like to thank Prof. Dr. Lothar Thiele for giving
me the opportunity to write this master thesis and for his valuable insights and
feedback.
This master thesis was supported by the Onassis Foundation - Scholarship
ID: F ZO 081-1/2018-2019.

i
Abstract

Anomaly detection has been increasingly attracting interest in various appli-


cation domains, such as cyber-security, communications and cloud computing.
Anomalies can often convey critical and actionable information in such applica-
tions. The goal of this thesis is to perform anomaly detection on high-dimensional
multivariate time series by leveraging deep learning techniques. We develop two
deep learning models for anomaly detection and apply them to Key Performance
Indicators(KPIs) provided by IBM Cloud storage systems. We illustrate the ef-
fectiveness of the two models for detecting anomalies in cloud storage KPIs
and propose a method to combine them in an ensemble. We also present two
attempts towards the optimization of the anomaly detection process, by intro-
ducing anomaly causality and explainability.

ii
Contents

Acknowledgements i

Abstract ii

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Questions and Main Contributions . . . . . . . . . . . . 1
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background and Related Work 4


2.1 Anomaly Detection Background . . . . . . . . . . . . . . . . . . . 4
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Problem Context and Strategy 7


3.1 Data-related Challenges . . . . . . . . . . . . . . . . . . . . . . . 8

4 Models for Anomaly Detection 12


4.1 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Sequence-to-sequence Reconstruction . . . . . . . . . . . . . . . . 15
4.3 Predictability-based Ensemble . . . . . . . . . . . . . . . . . . . . 19

5 Towards Anomaly Detection Optimization 23


5.1 Root Cause Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.2 Model Explainability . . . . . . . . . . . . . . . . . . . . . . . . . 29

6 Summary and Future Work 32


6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

Bibliography 35

iii
Contents iv

A Software Implementation Tools 1


Chapter 1

Introduction

1.1 Motivation

Anomaly detection refers to the problem of finding patterns in data that are
unexpected and dissimilar to the majority of the data. There has been increasing
research interest in detecting anomalies, as they may often carry critical and
actionable information in various application domains. Such applications include
intrusion detection for cyber-security, fraud detection in financial institutions,
military surveillance, healthcare and many more. Due to the importance of
these applications, the real life relevance of anomalies has been a key feature of
anomaly detection, making it an important subject for data analysts in research
and industry.
The problem of detecting anomalies in time series, in particular, is increas-
ingly important in communications, IT and cloud storage systems. While sta-
tistical methods such as autoregressive integrated moving average (ARIMA)[3]
and Holt-Winters [7] have been traditionally particularly effective for anomaly
detection, there has been rapidly increasing interest in deep learning methods in
recent years.
In this thesis, we will also focus on deep learning methods for anomaly de-
tection in multivariate time series. The main motivation for this choice stems
from the need to perform anomaly detection at large scale. The statistical meth-
ods operate well on univariate time series, but find it difficult to scale to high-
dimensional multivariate time series. Moreover, deep learning methods can per-
form end-to-end anomaly detection taking raw input data and are able to capture
complex structures and features within the data.

1.2 Research Questions and Main Contributions

The main goal of this thesis is to exploit deep learning in order to perform
anomaly detection in Key Performance Indicators (KPIs) taken from real-life

1
1. Introduction 2

cloud storage systems. The KPIs are high-dimensional multivariate time series
that have been collected by the Zurich lab of IBM Research and correspond to
complex storage systems in the IBM Cloud.
In state-of-the-art literature there have been solutions for anomaly detection
that tend to focus and specialize in their respective application domains. A
guiding question for this work is whether it is possible to have an anomaly
detection method for time series that is applicable to the domain of cloud storage,
while also offering the potential for generalization.
It is also of key importance that we develop deep learning models for anomaly
detection that can effectively scale for multivariate time series with 100s or even
1000s of dimensions. This can prove very important for support engineers and
subject matter experts, as they currently have to spend considerable time to man-
ually examine a large number of KPIs in order to investigate potential anomalies
and their root causes.
The key contents of this thesis can be summarized as follows:

• We combine convolutional and recurrent neural networks and utilize two


deep learning models to detect anomalies in cloud storage KPIs.

• We propose a general method to determine an ensemble of our two models


based on data predictability.

• We explore optimization tools to assist in root cause analysis and explain-


able anomaly detection.

1.3 Outline

This thesis consists of the following parts:

• In Chapter 2 we provide some background in anomaly detection and briefly


discuss related work from literature.

• In Chapter 3 we describe in detail the problem context, our main goals and
challenges. We particularly focus on the KPIs in our dataset and examine
the specific challenges posed by this type of data.

• In Chapter 4 we present two deep learning models that we developed for


this thesis. We also propose an ensemble method to combine or choose
between these models.

• In Chapter 5 we present our attempts towards the optimization of the


anomaly detection process. We specifically describe our solutions to pro-
vide anomaly causality and explainability.
1. Introduction 3

• In Chapter 6 we offer a summary of the work presented in this thesis and


provide an outlook for the future.
Chapter 2

Background and Related Work

In this chapter we will provide some basic background knowledge and definitions
related to anomaly detection. Then, we will briefly go through some of the
related work on this subject.

2.1 Anomaly Detection Background

Anomaly detection is a process that aims to identify instances in a dataset which


are dissimilar to others and appear to deviate from the norm. Such instances are
called anomalies or outliers. Anomalies are rare in a dataset compared to normal
instances. Anomalies might be induced in the data for a variety of reasons, such
as malicious activity, fraud or system failures.
A process closely related to anomaly detection is novelty detection, which
aims to identify novel previously unobserved patterns in the data. The tech-
niques used for novelty detection are usually the ones used for anomaly detec-
tion. The main distinction between novelties and anomalies is that the novelties
are typically reclassified in the dataset as normal data instances after being de-
tected. In this sense, a novelty is not a technical problem or a sign of malicious
activity like an anomaly is. Novelties are new observations in the system that
will be deemed as normal if they are repeated in the future. [6].
Anomalies can generally be classified into three categories: point, contextual,
and collective anomalies.

• Point anomalies are individual instances that deviate from the rest of the
data. It is the most common type of anomaly and the focus of most of the
work in literature.

• Contextual anomalies are points that can be deemed as normal when in-
spected individually, but are considered anomalous in a specific problem
context. The most common context is time. For example, a temperature

4
2. Background and Related Work 5

below zero can be seen as normal, but if the context of time is a summer
month, this temperature would be anomalous.

• Collective anomalies occur when a group of instances appear normal indi-


vidually, but signal an anomaly when they are observed as a group. For
example, a set of otherwise normal credit card transactions can be consid-
ered an anomaly when they are all executed within just a few minutes.

In this work we specifically focus on anomaly detection on multivariate time


series. Time series are univariate when only a single variable varies over time. A
multivariate time series consists of several time-dependent variables. Each vari-
able varies over time, but has also some dependency and correlation with other
variables. In the case of anomaly detection, statistical methods such as autore-
gressive integrated moving average (ARIMA)[3] and triple exponential smooth-
ing, i.e the Holt-Winters method[7], exhibit very good performance in univariate
time series. However, they often struggle to scale to multivariate time series
with high dimensions, and are also sensitive to noise. This is one of the reasons
that attention in anomaly detection applications has shifted to deep learning,
especially for multivariate time series.
Deep learning for anomaly detection can be performed in a supervised or
unsupervised way, depending on the availability of class labels. Labels can dis-
tinguish the data instances between normal and anomalous. However, obtaining
labels for all types of data can be costly and challenging. It is especially difficult
to obtain labels for all possible anomalies in a dataset, as they are often dynamic
in nature and can occur very rarely. For example, as new data measurements
are gathered, new types of anomalies may occur, for which there are no avail-
able labels. Because of this, the majority of deep learning models for anomaly
detection perform unsupervised learning. These models detect anomalies based
solely on the intrinsic properties of the data. Moreover, unsupervised anomaly
detection models assume that the normal instances in a dataset are far more
than the anomalies, which is usually indeed the case.

2.2 Related Work

In this section we will briefly mention some examples of state-of-the-art work


related to our topic. Several solutions have been presented in recent years that
utilize deep learning for anomaly detection in various domains. To the best of
our knowledge there has not yet been a widely accepted, general solution that
is effective across multiple domains, including cloud storage. In [5] Chalapathy
et al. provide a very useful and thorough survey on deep learning approaches
for anomaly detection. Our interest is focused on applications that address
multivariate time series.
2. Background and Related Work 6

In [12] Hundman et al. use LSTM-based forecasting and detect anomalies


on spacecraft telemetry data based on the statistical distribution of prediction
errors. However, they find it difficult to handle multivariate inputs of high di-
mension. Zhang et al. [25] attempt to overcome such limitations by combining
convolutional neural networks and convolutional LSTMs [21].
The Deep Autoencoding Gaussian Mixture Model [26] jointly considers a
deep autoencoder and a Gaussian mixture model to model the density distribu-
tion of multidimensional data. However, they cannot easily capture temporal
dependencies across different timesteps. Malhotra et al. [18] present an LSTM
encoder-decoder that achieves to model temporal dependencies in time series and
detect anomalies by means of reconstruction error. The generalization capability
of this model may however be affected by noise, leading to an increased number
of false positives.
DeepAD, a deep learning based framework for multivariate time series anomaly
detection is proposed in [4]. The framework combines the predictions of multiple
forecasting techniques, including LSTM neural networks, autoregressive mod-
els [3] and triple exponential smoothing [7] in an ensemble in order to discover
anomalies in an unsupervised manner. In [24], Yuan et al. use deep attention
based models to design an interpretable anomaly detection system that attempts
to effectively explain the anomalies detected.
Chapter 3

Problem Context and Strategy

In this chapter we will describe the main problem that this thesis sets out to
tackle and underline the key challenges we face in the context of the thesis.
Our main goal is to perform anomaly detection in real multivariate time series
data provided by IBM Cloud storage systems by utilizing deep neural networks.
The given time series are Key Performance Indicators (KPIs) of a cloud storage
system. It is a difficult problem, as we must work with high-dimensional raw time
series that have been recently collected, which means they have relatively short
history and have not been previously curated. The key challenges associated
with this context will be presented in more detail in section 3.1.
To tackle our main problem, we decided to design a deep neural network
that can ingest all the KPIs at the same time. Hence, the input will be a
tensor similar to Fig. 3.1. The input KPIs cover approximately 11 weeks of
measurements taken at 5 minute intervals. To construct the input tensor we
divide the time series into samples, e.g. days, where 1 day equals 288 timesteps.
Therefore, our input will be a 3D tensor with dimensions [samples, timesteps,
KPIs]. At the time of this thesis, we do not have a sufficient number of class
labels that indicate the true anomalies in our time series. Due to this challenge,
our neural networks must perform unsupervised learning.
The output of the deep learning model should be a list of detected anomalies,
the time window when they occur and a list of the KPIs that are the most
anomalous. The method that we use to compute this output will be covered in
detail in Chapter 4.
After the main problem of anomaly detection, we also work on two subprob-
lems that can further optimize the anomaly detection process: (1) Investigating
anomaly causality through root cause analysis, and (2) adding explainability to
anomaly detection. These subproblems are important, because they can provide
valuable insights to support engineers about possible technical problems behind
the anomalies and also about the most important parts and KPIs of the storage
system. Our approach to tackle these subproblems will be presented in detail in
Chapter 5.

7
3. Problem Context and Strategy 8

Figure 3.1: Shape of the 3D tensor to be fed as input to the neural network.
Features correspond to KPIs.

3.1 Data-related Challenges

To better understand the challenges in the context of our problem, we need to


closely examine the input data. As we mentioned before, the provided KPIs
correspond to about 11 weeks of measurements taken every 5 min. This is a
relatively short history that amounts to a little over 20000 timesteps for train-
ing. Based on this history, the deep learning model will aim to learn the nor-
mal behaviour of every KPI, in order to detect anomalous deviations from this
normality. As future data are gathered and the training history grows, it will
be easier for a neural network to produce high and robust performance in the
anomaly detection task.
Another aspect to point out is that the given KPIs are taken from multiple
storage systems that are parts of the IBM Cloud. We must treat every one of
these storage systems separately and train a separate model for the KPIs of each
system. This is due to the fact that the same KPI can have significantly different
values in different systems due to unique characteristics like system size, number
of storage units and customer behaviour.
For a specific storage system, the number of corresponding KPIs may be
in the range of 100s or even 1000s. In Figures 3.2 and 3.3 we can see several
example KPIs plotted for a duration of one day.
In addition, even KPIs of the same storage system can have vastly different
scales. For example, in Fig. 3.2 the plotted KPIs are data rates that show values
in the range of millions. On the other hand, in Fig. 3.3 we have plotted response
times with values smaller than 1. Due to these differences it is necessary to
preprocess the data and transform them to similar scales. Such a normalization
will be highly beneficial to the deep learning model, as it can lead to better
learning and faster convergence.
In the context of the KPIs we are working with, an anomaly can be a pre-
viously unseen peak in the time sequence. An example can be seen in Fig. 3.4.
The decision to classify a value in a time series as an anomaly is based on the
3. Problem Context and Strategy 9

Figure 3.2: Example KPIs. These plots correspond to data rates.

Figure 3.3: Example KPIs. These plots correspond to response times.


3. Problem Context and Strategy 10

Figure 3.4: Example of a potential anomaly. The plots correspond to response


times. Unusually high values indicate a possible anomaly (red circles).

temporal history of the time series. Values that are dissimilar to any pattern
that we encountered before are candidates for anomaly detection. Since we do
not have a full set of labels for anomalies in our data, any detected anomaly will
be further validated by support engineers.
Our data contains a diverse set of KPIs which can exhibit quite different
behaviours over time. Some KPIs can be quite unpredictable and do not exhibit
any seasonal periodic patterns in their values. On the other hand, we also ob-
served KPIs with strong periodicity. This characteristic can make it easier for
an anomaly detection model to learn from these KPIs. We can see a relevant
example in Fig. 3.5. In the left part of the figure we have the plots of an I/O
rate (top) and a response time (bottom). The enlarged plot interval in the right
corresponds to measurements from five weekdays. We can observe that both
KPIs have a strong seasonal pattern on a daily basis during the weekdays. This
pattern stops repeating and changes during the weekend.
In Chapter 4 we will present in detail our approach to tackle the afore-
mentioned challenges. We develop two deep learning models, one that utilizes
time series forecasting (section 4.1) and one that performs input reconstruction
through an encoder-decoder architecture (section 4.2). Our input data in both
models will be normalized to account for the different scales. In the case of input
KPIs that surpass 100s or 1000s in number, we incorporate compression in our
approach in section 4.2. In Chapter 4 we also observe the performance of our
deep learning models with regard to the varying degree of periodicity in the KPIs
and in section 4.3 we describe an approach to ensemble our models based on this
characteristic.
3. Problem Context and Strategy 11

Figure 3.5: Example KPIs with seasonal behaviour. There is a periodic daily
pattern during the weekdays (highlighted by the blue rectangle), which stops on
the weekend.
Chapter 4

Models for Anomaly Detection

In this chapter we will present our solutions for anomaly detection. We developed
two deep learning models in the context of this thesis. We also describe a method
to combine the different models in an ensemble. Since in our provided data from
IBM there are not enough class labels for true anomalies, our models will perform
unsupervised learning.

4.1 Forecasting

Our initial model for anomaly detection is based on time series forecasting. Given
a time window of the input time series, the machine learning model tries to
predict the value of the time series for a certain time in the future. This time
interval is called the forecasting horizon and can be short, e.g. 1 timestep ahead,
or longer.
The input to the model is a multivariate time series with several dimensions
[samples, time, KPIs]. In the output, we obtain the next predicted values of the
given time series. We compare the predictions with the true values and compute
the forecasting error. Based on this error, we can then employ a heuristic method
to detect anomalies.
The forecasting model architecture (Fig. 4.1) is based on [16]. The main
components are a convolutional layer[17] followed by a recurrent layer[11]. The
convolutional layer discovers local dependency patterns among the input time
series, while the recurrent layer can learn long-term dependencies.
Two extra components are added to the model: (1) a recurrent skip layer is
added to exploit potential periodic properties in the input, and (2) an autore-
gressive linear component [3] is added in parallel to the model to make it more
robust for time series with rapid scale changes.
We test the model on KPIs from an IBM Cloud storage system that exhibited
a known anomaly. In Fig. 4.2 we can see a particular KPI of this system, more

12
4. Models for Anomaly Detection 13

Figure 4.1: Deep learning model for time series forecasting. Based on [16].

specifically a response time. The dates in the figure indicate times when an
anomaly is known to appear. It is easily discernible that the response time is
much higher in this time window. In the inner figure we see a histogram for this
KPI, indicating a long tail of outlier values. One can also observe, that even
before the anomaly, peaks of the KPI response time appear.
After applying the forecasting model to this system, we plot the prediction
and the true values for the same response time KPI. In Fig. 4.3 we zoom in on a
time window around the known anomaly. We observe that the prediction closely
follows the true value even around the known anomaly. Ideally, we would want
our model to only learn the ”normal” behaviour of the system and have larger
forecasting errors in case of an anomaly. By examining this result we conclude
that, due to the unpredictable nature of this particular system, the model tends
to output a predicted value that resembles the last input value it received.
In Fig. 4.4 we produce a similar plot for another KPI of the same system,
this time a data rate. Even though the true data rate shows normal behaviour
at the time of the known anomaly, we observe that the predicted value (in blue)
deviates significantly from the true value in this time window. This indicates
that the convolutional part of the model captured the correlation between the
data rate and the response time. Hence, the predicted data rate was affected by
the ”anomalous” response time.
Even though we see that the result in Fig. 4.3 is not ideal - as the predicted
output tends to follow the last input value - , we still investigate whether it can
be used to correctly detect the known anomaly. For this purpose, we obtain the
forecasting error for this specific KPI. We employ dynamic time warping[9] to
compute the forecasting error for a rolling window of 3 timesteps (equivalent to
15 min). We plot the resulting forecasting error in Fig. 4.5. As we can see, we
still have significantly larger error in the times of the known anomaly. However,
4. Models for Anomaly Detection 14

Figure 4.2: Example KPI with known anomaly. The inner figure shows the
histogram for this KPI.

Figure 4.3: Predicted output (blue) and true value (orange) for a KPI (response
time) with known anomaly
4. Models for Anomaly Detection 15

Figure 4.4: Predicted output (blue) and true value (orange) for a different KPI
(data rate) in the same system.

depending on the chosen error threshold (horizontal lines), we see that we run
the risk of detecting many false anomalies.
These initial results lead us to the conclusion that our initial model has two
main limitations:

• The model performs better on data that have reasonably seasonal and
recurrent behaviour. In case of measured traces from complex cloud storage
systems like the one we investigated in this thesis, the KPIs can be quite
unpredictable over time and exhibit behaviour that is similar to a random
walk. In such cases, the quality of forecasting will drop, which may lead
to many falsely detected anomalies.

• The current model does not perform significant dimensionality reduction.


Given input data that may consist of 100s or even 1000s of KPIs, some
amount of compression will be necessary to improve resource requirements
and performance.

4.2 Sequence-to-sequence Reconstruction

To amend the limitations faced by the previous forecasting model, we proceed


to design a more complex deep learning model. Our final approach differs from
our initial attempt in two key aspects:

• We want to incorporate compression as a key characteristic of our model.


4. Models for Anomaly Detection 16

Figure 4.5: Forecasting error for the KPI with the known anomaly, computed
with dynamic time warping (DTW). The horizontal lines indicate different error
thresholds for classifying anomalies.

• Instead of aiming to predict future timesteps for our input time series, we
now perform reconstruction of the input.

To perform this task, the basic architecture we will utilize is the sequence-to-
sequence model [23], or encoder - decoder model (Fig.4.6). In this basic structure,
the input passes through the Encoder, which compresses the input into a latent
space manifold. In this way, the compressed representation of the data captures
only the most essential information. The Decoder then uses this compressed
representation to generate the required output sequence. In our case, the desired
output will be the reconstructed input. In this sense, the model can also be called
an autoencoder.
Therefore, the goal of our model will be to reconstruct the given input time
series, which in our case are KPIs, through unsupervised learning. By comparing
the model output to the original input, we can obtain the reconstruction error.
We will then utilize this error to detect anomalies.
To implement our final model architecture, we use a combination of convo-
lutional and recurrent layers, as seen in Fig. 4.7. The encoder is comprised of a
series of convolutional layers, which extract the relevant features from the input
KPIs. The convolutional layers can capture correlations among the KPIs, as well
as short patterns in the time dimension. The depth of the encoder (i.e. number of
subsequent layers) depends on the level of compression that we want to achieve.
Following the convolutional layers, we also add a recurrent layer, specifically a
long short term memory (LSTM) [11]. The LSTM layer provides a ”forgetful”
4. Models for Anomaly Detection 17

Figure 4.6: Basic encoder - decoder structure

memory, as it uses its internal gates to capture long term dependencies and also
learn which information in a sequence should be kept or thrown away. For the
decoder part of the model, we use convolutional layers analogous to the ones in
the encoder. These layers perform a so called ”deconvolution”, as they expand
the compressed representation back to the original input dimensions.
In Fig. 4.8, we illustrate the final model architecture for a given IBM Cloud
storage system. In this example we start from 82 KPIs divided into two channels
in the input and use 4 convolutional layers for compression. In the model output
we obtain a reconstructed version of the input tensor, from which we compute
the reconstruction error.
To detect anomalies based on the reconstruction error, we employ a heuristic
method based on the k-sigma rule. We aggregate the reconstruction errors over
all KPIs and compute the mean value and the standard deviation (sigma). Then
we look at the error at every timestep and check its distance from the mean
value in relation to k times sigma. In the examples below, we tag timesteps as
anomalies when the error is at least a distance of 2 sigmas away from the mean.
In Figures 4.9 and 4.10 we illustrate the result of applying this model to real
systems. The blue line shows the aggregated reconstruction error. Highlighted
in red rectangles are the timesteps that were considered an anomaly by our
algorithm. The time history of our given data in both examples covers a span
of almost 11 weeks. In the case of Fig. 4.9, we detect a significant anomaly that
lasted for a few days. Other than that, the reconstruction error remains steadily
low.
Another example of applying our anomaly detection on a different system
can be seen in Fig. 4.10. In this case, the error is very low until approxi-
mately the last 10 days of measurements. In that window we detect a number of
anomalies with elevated errors. It is worth noting that, if the system continues
to exhibit similar behaviour as future measurements are gathered and added to
4. Models for Anomaly Detection 18

Figure 4.7: General architecture of our autoencoder model, built with a combi-
nation of CNN and RNN layers. Blue blocks are for the encoder, orange blocks
for the decoder.

Figure 4.8: Final model architecture for a given IBM cloud system. The encoder
(blue) and decoder (orange) each have four convolutional layers.
4. Models for Anomaly Detection 19

Figure 4.9: Example of anomaly detection based on reconstruction error (blue


line). The red lines indicate detected anomalies.

the training history, we expect the model to eventually learn and adapt to this
”new normal” behaviour and no longer consider it anomalous. In this scenario,
the current detected anomalies would be indicative of novel system behaviour
(novelty detection), rather than e.g., a technical problem in the system.

4.3 Predictability-based Ensemble

After the implementation of the two deep learning models presented in sections
4.1 and 4.2, another aspect we want to explore is a generalizable method to
combine and discriminate between these models. The goal is to have an ensemble
of the two models, and vote on which ones are better to use based on the given
data.
We choose to discriminate our ensemble of models based on data predictabil-
ity, as we observe that accurate forecasting depends on the predictability of the
time series. For example, for very seasonal well-behaved data, the simpler less
complex forecasting model may prove to be good enough. Seeing data from real
cloud storage systems, we observed quite different behaviour among KPIs be-
longing to different systems in terms of predictability. We can have KPIs that
exhibit strong seasonality and periodic behaviour (e.g. similar values on a daily
or weekly basis). On the other hand, we can observe KPIs whose behaviour does
not show any periodic pattern and approaches a random walk.
There can be multiple ways to examine the predictability of our given data,
including methods like autoregressive integrated moving average (ARIMA), Holt-
Winters and autocorrelation functions.
4. Models for Anomaly Detection 20

Figure 4.10: Second example of error-based anomaly detection, on a different


system. Multiple anomalies (red) are detected.

We made the choice to use the autocorrelation function (ACF) to examine


the predictability of our KPIs. This function computes the correlation of a time
series with a lagged copy of itself, for various time lags. It is used to look for
seasonal patterns in the history of the time series.
For the example ensemble of our two models, the typical workflow is presented
in Fig. 4.11. After the necessary preprocessing of the data, the autocorrelation
function is employed to check the predictability of the input time series. De-
pending on the outcome of this check, we can select between our AD models.
In Fig. 4.12 we can see the autocorrelation function of an example KPI from
a given cloud storage system. The observed KPI is an I/O rate which represents
system input load. We observe some quite strong correlations, i.e., seasonal
patterns, particularly from day to day. This pattern is probably due to stable
user activity on the storage system. For this example, our initial forecasting
model could be applied to quite good effect.
On the contrary, in Fig. 4.13 we illustrate a KPI of a system without any
discernible seasonal pattern. The observed KPI is a response time. The autocor-
relation function in this case has low values across all the time lags. In this case,
therefore, we would be inclined to utilize our more complex reconstruction-based
encoder-decoder model.
Currently we can only select between our two models, but this approach can
be generalized to an ensemble of multiple models.
4. Models for Anomaly Detection 21

Figure 4.11: Flow chart for a predictability-based ensemble between two anomaly
detection models.

Figure 4.12: Autocorrelation function of an I/O rate KPI from an IBM cloud
storage system. The KPI exhibits strong seasonal correlation on a daily basis.
4. Models for Anomaly Detection 22

Figure 4.13: Autocorrelation function of a response time KPI from another IBM
cloud storage system. This example shows no discernible seasonality.
Chapter 5

Towards Anomaly Detection


Optimization

In this chapter, we explore ways to further optimize the anomaly detection pro-
cess. We tackle two subproblems related to anomaly causality and explainability.
The goal is to provide helpful information to support engineers and assist them
in the workload of investigating system anomalies.

5.1 Root Cause Analysis

In this first subproblem, we designed an algorithmic tool to aid in root cause


analysis (RCA). The goal of this RCA tool is to investigate causal relationships
within the system KPIs and help support engineers identify the true problems
that caused the detected anomalies.
The RCA tool is employed after the anomaly detection model (Fig. 5.1) and
gets two inputs: (1) the time window (start and end) of every anomaly that was
detected by the deep learning model, and (2) the top-k KPIs that triggered the
anomaly, by having the largest reconstruction error in that window. The RCA
then goes on to analyse the rest of the KPIs to find root cause candidates for
the anomaly.
The output of RCA is a ranked list of KPIs. The top KPIs in the list are
the most likely candidates to have a causal connection with the top-k anomalous
KPIs. To achieve this, we apply an ensemble of similarity measures between the
anomalous KPIs and the remaining input KPIs. In the remainder of this section,
we will present our RCA method in more detail.

Method Our RCA tool ranks a set of possible root cause KPIs according to
their similarity 1 to a smaller set of anomalous KPIs. This ranking is decided
1
We define a similarity function between two time series T and Cj as D(T, Cj ). A time
series Ci is more similar to T than Cj if D(T, Ci ) < D(T, Cj ) [10].

23
5. Towards Anomaly Detection Optimization 24

Figure 5.1: Enhanced anomaly detection pipeline. The results of the anomaly
detection model are coupled with a root cause analysis (RCA) tool.

based on a two level ensemble, an intra-ensemble (each metric ranks the candi-
date KPIs) and an inter-ensemble (ensemble across the rankings of the metrics)
of the individual similarity metrics. We use 3 distance and 2 correlation metrics.
These are the L1/Manhattan, L2/Euclidean, DTW/Dynamic Time Warping,
Spearman and Pearson metrics respectively. The similarity metrics between two
time series T and Cj are computed as follows:

n q
X
Eucl(T, Cj ) = (Ti − Cji )2 (5.1)
i=0
Xn
M anh(T, Cj ) = |(Ti − Cji )| (5.2)
i=0
1 Pn
n i=0 (Ti − µT )(Cji − µCj )
P ear(T, Cj ) = (5.3)
σT σCj
Spear(T, Cj ) = P ear(rgT , rgCj ) (5.4)

where µT , µCj , σT and σCj are the mean and standard deviation of time
series T and Cj , respectively, whereas rgT and rgCj are the converted rank
variables used by Spearman (i.e., data points are sorted and values are replaced
by their rank). For DTW we use the approximate FastDTW [20] technique
that provides optimal or near-optimal alignments, by recursively projecting and
refining a solution from larger resolutions. These metrics have the following
beneficial properties:

• Lp -norm methods are parameter-free shape-based similarity measures [10].


5. Towards Anomaly Detection Optimization 25

Figure 5.2: Block diagram of our root cause analysis tool.


5. Towards Anomaly Detection Optimization 26

• L2/Euclidean is susceptible to outliers, scale units and normalization.

• L1/Manhattan does not satisfy the same properties as L2 but is more


robust in the presence of noise.

• Dynamic Time Warping has a unique tolerance to skew.

• Pearson is a scale invariant proxy for the cosine distance.

• Spearman has similar properties with Pearson, but is non-parametric and


more robust to noise.

A holistic view of the method can be seen in Fig. 5.2. The detailed op-
eration is captured in Algorithm 1. It requires an input containing: (1) a set
of anomalies to be analyzed; (2) the triggering KPIs for each anomaly; (3) a
set of the potential root cause KPI candidates. We normalize the KPIs within
the anomaly time window to bring them to similar scales. We compare every
trigger KPI with all candidate KPIs and produce a sorted list of candidates for
each metric. To produce the final ranked list of root cause candidate KPIs, we
perform ensemble voting across the five similarity metrics. The voting bias for
each metric is defined according to the system’s region of operation.

Voting Bias We use a novel method to adjust and re-bias the weights of each
component of the ensemble. The vote rigging is control-driven by the regions
suggested by the cloud system’s delay-load transfer functions, e.g., the family
of ”hockey-stick” plots that characterize each specific type of Cloud/Storage
(Fig. 5.3).
Briefly, this is accomplished by

1. separating the equivalent dynamic queuing system operational regions into


a normal, i.e., linear and saturation (identified by green and red colors
respectively in Fig. 5.3)

2. detecting whether the system operates in a normal or saturation region,


and rigging the ensemble votes accordingly.

Normal Region When the system operates within its normal region, the
probability of encountering high noise or a vast amount of outliers is low. We
therefore cast equal votes to all metrics (0.2 to each).
5. Towards Anomaly Detection Optimization 27

Algorithm 1: Root Cause Analysis tool


Input: Event list, Trigger KPIs, Candidate KPIs
for event in Event list do
w ← time window of event;
norm data ← normalize data(w);
region ← region of operation;
if region is normal then
voting bias ← [0.2, 0.2, 0.2, 0.2, 0.2];
end
/* bias order: [L1,L2,DTW,Pearson,Spearman] */
if region is saturated then
voting bias ← [0.3, 0.05, 0.3, 0.3, 0.05];
end
for trigger in Trigger KPIs do
for metric = L1, L2, DTW, Pearson, Spearman do
dists[metric] ←
compute distance with every KP I(norm data);
Sort dists[metric] by smaller distance;
/* intra-ensemble ranking */
end
candidate list ← inter ensemble ranking(dists, voting bias);
top k candidates ← candidate list[k];
end
end
Result: Top-k Root cause candidates
5. Towards Anomaly Detection Optimization 28

Figure 5.3: Detection of (a) Region: Solve for power = (throughput/delay) =


0 to determine if normal or saturated with respect to the saturation knee, and
(b) Mode: within the window of interest, estimate the local variability of the
raw delay and load distributions. The interference is the ratio µ/σ of the delay
signal, where µ and σ are the mean and standard deviation

Saturation Region On the contrary, when the system’s load increases


near the saturation point, queues start building up and the throughput drops.
The initially quasi-flat end-to-end delay increases (sub)linearly until the queues
start overflowing – in our case (lossless Storage Area Network – SAN), the flow
control backpressures them upstream towards sources – and exponentially there-
after. Under this scenario, the amount of outliers is expected to be significantly
increased. We therefore decrease the contribution of the Euclidean and the
Spearman metrics (votes of 0.05 each). The votes of the rest of the metrics is
increased to ensure a total sum of 1 (0.3 each).

Example In Fig. 5.4 we illustrate the result of applying our RCA tool to a
cloud storage system. The anomaly window is shown in red dotted lines. We
look into a time window of 8 hours before and after the anomalous event. In this
case, the anomalous KPI is a Response Time (delay). After applying the RCA
5. Towards Anomaly Detection Optimization 29

Triggering events (DV)


1.00
Anomaly
Normalized value

0.75

0.50

0.25

0.00
05:19:00 13:39:00 21:59:00 06:19:00 14:39:00 22:59:00 07:19:00 15:39:00
Timestamp

Root cause candidates (IV)


1.00
top-1 cause
Normalized value

0.75 top-2 cause


top-3 cause
0.50

0.25

0.00
05:19:00 13:39:00 21:59:00 06:19:00 14:39:00 22:59:00 07:19:00 15:39:00
Timestamp

Figure 5.4: Top-3 root cause candidates (bottom) returned by the RCA ensemble
versus the triggering KPI (top) detected by the anomaly detection engine. The
red dotted lines show the event window.

ensemble, we come up with a ranked list of potential root cause candidates. The
Top-3 root cause candidate KPIs shown in the figure are rates (loads), which
strongly correlate with this response time. By further investigating the role of
these KPIs in the system, we confirm that there is a pseudo-causal path in which
the activity of these three loads propagates through the system and causally
affects the respective response time during the event window.
The major benefit or RCA is that it significantly narrows down the workload
of Support Engineers in investigating anomalies. Instead of spending hours or
days manually going through hundreds of KPIs to look for root causes, the
Support team can now only look into a few top root cause candidate KPIs
suggested by the RCA tool.

5.2 Model Explainability

Our RCA tool helps with the investigation of causality among the KPIs in an
anomaly. We want to further enhance our anomaly detection process by introduc-
ing and leveraging explainability in the deep learning model. Initially, we would
like to investigate the importance of each of our input KPIs, i.e, how much each
KPI contributes to the anomaly detection model. This will give us information
about critical system KPIs that significantly affect the anomaly detection.
As we saw in Chapter 4.2, in a sequence-to-sequence model the encoder com-
presses the input into a latent space manifold, keeping the most necessary in-
5. Towards Anomaly Detection Optimization 30

Figure 5.5: Extracting the contribution of input KPIs to the latent space in our
encoder-decoder model.

formation. The decoder then uses this compressed representation to reconstruct


the input. Hence, to estimate the importance of our KPIs, we need to extract
the contribution of each KPI to the compressed manifold (Fig. 5.5).
There exist several techniques in literature that are related to explainability in
machine learning and particularly deep neural networks [2, 19, 13]. To compute
our manifold contribution, we will use the method described in [22] to generate
saliency maps. This is a gradient-based method, traditionally applied to image
data, that uses back-propagation to generate heatmaps with the importance of
each input pixel.
In a similar fashion, we can apply the saliency method on our time series
data. We start from the output of the last convolutional layer of the encoder
and we propagate back to the input, ultimately computing the gradients with
respect to the input time series. We can use these gradients to highlight input
parts that most contribute towards the encoder output. An example of applying
this method to an IBM Cloud system can be seen in Fig. 5.6a, b. The exam-
ple input spans 288 timesteps (one day) and 82 KPIs divided in two channels.
This visualization shows saliency values of each KPI (horizontal axis) for every
timestep (vertical axis). Since we are interested to know the importance of KPIs
across the entire input window, we can aggregate the saliency values over the
time axis and get a single value for each KPI (Fig. 5.6c). Ranking these values
will give us a list of KPIs ranked according to their manifold contribution.
5. Towards Anomaly Detection Optimization 31

(a) (b)

(c)

Figure 5.6: Saliency heatmaps for an input of 288 timesteps and 82 KPIs divided
in two channels (a) and (b). We aggregate the saliency over time and return one
saliency value for each KPI in (c).
Chapter 6

Summary and Future Work

6.1 Summary

In this thesis we presented our work on anomaly detection in high-dimensional


multivariate time series. The dataset that we focused on consisted of Key Per-
formance Indicators (KPIs), which were provided by IBM Research Zurich and
contained measurements from real IBM Cloud storage systems. The measure-
ments cover a relatively short history of 11 weeks taken at 5 minute intervals.
We provided insights on the intricacies of our time series, such as the vast scaling
differences among KPIs and the varying degrees of periodicity.
To perform anomaly detection on the given KPIs, we developed two deep
learning models. Due to the shortage of available anomaly labels, our models
perform unsupervised learning based solely on the intrinsic properties of the input
data. Anomalies are characterized based on their deviation from the normal
patterns that are prevalent in the temporal history of the time series.
Initially, we utilized a deep learning model based on [16] that uses a con-
volutional and a recurrent layer to perform time series forecasting. We then
detect anomalies by using thresholds on the forecasting error. We conclude that
this model can be useful to detect known anomalies, albeit with some limita-
tions. The number of falsely detected anomalies may be higher, depending on
the quality of forecasting, which in turn is limited by the predictability of the
input data. Also, with input dimensions of 100s or 100s of KPIs there exists a
need for deeper compression than this model provides.
To address these issues, we designed a second deep learning model based
on the encoder-decoder architecture. The encoder produces a compressed rep-
resentation of the input in the latent space, through a combination of several
convolutional layers, paired with an LSTM layer. The decoder then utilizes
analogous convolutional layers to reconstruct the input. We detect anomalies by
applying the 2-sigma rule on the reconstruction error. We conclude through ex-
amples that this model is able to reliably detect anomalies, while also performing
the desired compression on inputs in the range of 100 KPIs.

32
6. Summary and Future Work 33

After observing that the given KPIs may be quite different in terms of pre-
dictability, we presented a method to combine our deep learning models in an
ensemble based on the predictability of the input KPIs. As a determining factor
for this ensemble, we use the autocorrelation function. We choose between our
models based on the presence of predictable periodic patterns in the data, as
indicated by the autocorrelation function.
Finally, we investigated two subproblems to optimize the anomaly detection
process:

• We designed an algorithmic tool to aid support engineers in anomaly root


cause analysis. Our tool takes the anomalous KPIs detected by the deep
learning model and cross-examines them with the rest of the KPIs to find
potential root cause candidates. The analysis is performed through an
ensemble of five similarity metrics. The result a list of the top root cause
candidate KPIs. Feedback from support engineers indicate that this tool
can be quite useful and save them time in investigating the true problems
behind detected anomalies.

• We also introduced explainability to the anomaly detection process. We


generate saliency maps [22] that visualize the importance of each of the in-
put KPIs and their contribution to the latent space in the encoder-decoder
model.

In the next section, we will discuss some directions in which the work of this
thesis can be improved in the future.

6.2 Future Work

There are several ways to build upon the work that was done in the context of
this thesis.

• In this work we used a relatively simple heuristic for anomaly detection


based on the reconstruction error: the k-sigma rule. In the future, a more
complex method could be implemented. For example, a heuristic algorithm
that takes into account the history of detected anomalies and remembers
which KPIs were anomalous in the past.

• There is also always room for further optimization and tuning in the deep
learning models. For example, if we have an input that consists of 1000s
of KPIs, more convolutional layers can be added to the encoder-decoder
model to achieve the desired compression.
6. Summary and Future Work 34

• Lastly, if a large enough number of labels for the true anomalies can be pro-
vided in the future, it would possible to make the switch to semi-supervised
or supervised learning.
Bibliography

[1] Martı́n Abadi et al. TensorFlow: Large-Scale Machine Learning on Het-


erogeneous Distributed Systems. 2016. arXiv: 1603.04467 [cs.DC].
[2] Sebastian Bach et al. “On Pixel-Wise Explanations for Non-Linear Clas-
sifier Decisions by Layer-Wise Relevance Propagation”. In: PLOS ONE
10.7 (July 2015), pp. 1–46. doi: 10.1371/journal.pone.0130140. url:
https://doi.org/10.1371/journal.pone.0130140.
[3] P.J. Brockwell and R.A. Davis. Time Series: Theory and Methods. Springer
Series in Statistics. Springer New York, 2013. isbn: 9781489900043. url:
https://books.google.ch/books?id=DJ%5C_lBwAAQBAJ.
[4] Teodora Sandra Buda, Bora Caglayan, and Haytham Assem. “DeepAD:
A Generic Framework Based on Deep Learning for Time Series Anomaly
Detection”. In: Advances in Knowledge Discovery and Data Mining. Ed. by
Dinh Phung et al. Cham: Springer International Publishing, 2018, pp. 577–
588. isbn: 978-3-319-93034-3.
[5] Raghavendra Chalapathy and Sanjay Chawla. Deep Learning for Anomaly
Detection: A Survey. 2019. arXiv: 1901.03407 [cs.LG].
[6] Varun Chandola, Arindam Banerjee, and Vipin Kumar. “Anomaly Detec-
tion: A Survey”. In: ACM Comput. Surv. 41.3 (July 2009). issn: 0360-
0300. doi: 10.1145/1541880.1541882. url: https://doi.org/10.1145/
1541880.1541882.
[7] C. Chatfield. “The Holt-Winters Forecasting Procedure”. In: Journal of the
Royal Statistical Society. Series C (Applied Statistics) 27.3 (1978), pp. 264–
279. issn: 00359254, 14679876. url: http://www.jstor.org/stable/
2347162.
[8] François Chollet et al. Keras. https://keras.io. 2015.
[9] “Dynamic Time Warping”. In: Information Retrieval for Music and Mo-
tion. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007, pp. 69–84. isbn:
978-3-540-74048-3. doi: 10.1007/978- 3- 540- 74048- 3_4. url: https:
//doi.org/10.1007/978-3-540-74048-3_4.
[10] Philippe Esling and Carlos Agon. “Time-Series Data Mining”. In: ACM
Comput. Surv. 45.1 (Dec. 2012). issn: 0360-0300. doi: 10.1145/2379776.
2379788. url: https://doi.org/10.1145/2379776.2379788.

35
BIBLIOGRAPHY 36

[11] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”.


In: Neural Comput. 9.8 (Nov. 1997), pp. 1735–1780. issn: 0899-7667. doi:
10.1162/neco.1997.9.8.1735. url: https://doi.org/10.1162/neco.
1997.9.8.1735.
[12] Kyle Hundman et al. “Detecting Spacecraft Anomalies Using LSTMs and
Nonparametric Dynamic Thresholding”. In: Proceedings of the 24th ACM
SIGKDD International Conference on Knowledge Discovery and Data Min-
ing - KDD ’18 (2018). doi: 10 . 1145 / 3219819 . 3219845. url: http :
//dx.doi.org/10.1145/3219819.3219845.
[13] Been Kim et al. Interpretability Beyond Feature Attribution: Quantita-
tive Testing with Concept Activation Vectors (TCAV). 2017. arXiv: 1711.
11279 [stat.ML].
[14] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Opti-
mization. 2014. arXiv: 1412.6980 [cs.LG].
[15] Raghavendra Kotikalapudi and contributors. keras-vis. https://github.
com/raghakot/keras-vis. 2017.
[16] Guokun Lai et al. Modeling Long- and Short-Term Temporal Patterns with
Deep Neural Networks. 2017. arXiv: 1703.07015 [cs.LG].
[17] Yann Lecun et al. “Gradient-based learning applied to document recogni-
tion”. In: Proceedings of the IEEE. 1998, pp. 2278–2324.
[18] Pankaj Malhotra et al. LSTM-based Encoder-Decoder for Multi-sensor Anomaly
Detection. 2016. arXiv: 1607.00148 [cs.AI].
[19] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “”Why Should I
Trust You?”: Explaining the Predictions of Any Classifier”. In: Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, San Francisco, CA, USA, August 13-17, 2016.
2016, pp. 1135–1144.
[20] Y Sakurai, M Yoshikawa, and Christos Faloutsos. “FTW: fast similarity
search under the time warping distance”. In: Proceedings of the twenty-
fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database
systems 90 (2005), pp. 326–337. doi: http://doi.acm.org/10.1145/
1065167.1065210.
[21] Xingjian Shi et al. Convolutional LSTM Network: A Machine Learning
Approach for Precipitation Nowcasting. 2015. arXiv: 1506.04214 [cs.CV].
[22] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep Inside
Convolutional Networks: Visualising Image Classification Models and Saliency
Maps. 2013. arXiv: 1312.6034 [cs.CV].
BIBLIOGRAPHY 37

[23] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. “Sequence to Sequence
Learning with Neural Networks”. In: Proceedings of the 27th International
Conference on Neural Information Processing Systems - Volume 2. NIPS’14.
Montreal, Canada: MIT Press, 2014, pp. 3104–3112.
[24] Ye Yuan et al. “MuVAN: A Multi-view Attention Network for Multivariate
Temporal Data”. In: 2018 IEEE International Conference on Data Mining
(ICDM) (2018), pp. 717–726.
[25] Chuxu Zhang et al. A Deep Neural Network for Unsupervised Anomaly
Detection and Diagnosis in Multivariate Time Series Data. 2018. arXiv:
1811.08055 [cs.LG].
[26] Bo Zong et al. “Deep Autoencoding Gaussian Mixture Model for Unsuper-
vised Anomaly Detection”. In: ICLR. 2018.
Appendix A

Software Implementation Tools

Throughout this master thesis the programming language used was Python. For
the development of deep learning models I used Tensorflow[1] and Keras[8] with
a Tensorflow backend.
For the forecasting model in Section 4.1 the authors of [16] provided open-
source code online, written in PyTorch. I adapted that code and wrote my
own version in Keras. I also changed some parameters to suit the needs of my
application and added code to perform anomaly detection on the forecasting
error. Code development for the remaining parts of the thesis was performed by
myself.
Some other Python libraries that I utilized are :

• Matplotlib for generating plots.

• Numpy for handling the input data as arrays.

• I used the RobustScaler provided by the scikit-learn library to normalize


the data.

• I also used the keras-vis toolkit [15] in Section 5.2 to generate the saliency
maps.

For the training of my neural networks I used standard learning parameters.


The optimizer of choice was Adam [14]. As a loss function I used the Mean
squared error (MSE).

You might also like