Professional Documents
Culture Documents
Anomaly Detection on
High-Dimensional Time Series
Master Thesis
Athanasios Fitsios
afitsios@student.ethz.ch
Supervisors:
Mitch Gusat, IBM Research
Prof. Dr. Lothar Thiele, ETH Zurich
September 8, 2020
Acknowledgements
This work was performed in collaboration with IBM Research Zurich. I would
like to thank Mitch Gusat for his guidance and fruitful collaboration during the
past six months. I would also like to thank Prof. Dr. Lothar Thiele for giving
me the opportunity to write this master thesis and for his valuable insights and
feedback.
This master thesis was supported by the Onassis Foundation - Scholarship
ID: F ZO 081-1/2018-2019.
i
Abstract
ii
Contents
Acknowledgements i
Abstract ii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Questions and Main Contributions . . . . . . . . . . . . 1
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Bibliography 35
iii
Contents iv
Introduction
1.1 Motivation
Anomaly detection refers to the problem of finding patterns in data that are
unexpected and dissimilar to the majority of the data. There has been increasing
research interest in detecting anomalies, as they may often carry critical and
actionable information in various application domains. Such applications include
intrusion detection for cyber-security, fraud detection in financial institutions,
military surveillance, healthcare and many more. Due to the importance of
these applications, the real life relevance of anomalies has been a key feature of
anomaly detection, making it an important subject for data analysts in research
and industry.
The problem of detecting anomalies in time series, in particular, is increas-
ingly important in communications, IT and cloud storage systems. While sta-
tistical methods such as autoregressive integrated moving average (ARIMA)[3]
and Holt-Winters [7] have been traditionally particularly effective for anomaly
detection, there has been rapidly increasing interest in deep learning methods in
recent years.
In this thesis, we will also focus on deep learning methods for anomaly de-
tection in multivariate time series. The main motivation for this choice stems
from the need to perform anomaly detection at large scale. The statistical meth-
ods operate well on univariate time series, but find it difficult to scale to high-
dimensional multivariate time series. Moreover, deep learning methods can per-
form end-to-end anomaly detection taking raw input data and are able to capture
complex structures and features within the data.
The main goal of this thesis is to exploit deep learning in order to perform
anomaly detection in Key Performance Indicators (KPIs) taken from real-life
1
1. Introduction 2
cloud storage systems. The KPIs are high-dimensional multivariate time series
that have been collected by the Zurich lab of IBM Research and correspond to
complex storage systems in the IBM Cloud.
In state-of-the-art literature there have been solutions for anomaly detection
that tend to focus and specialize in their respective application domains. A
guiding question for this work is whether it is possible to have an anomaly
detection method for time series that is applicable to the domain of cloud storage,
while also offering the potential for generalization.
It is also of key importance that we develop deep learning models for anomaly
detection that can effectively scale for multivariate time series with 100s or even
1000s of dimensions. This can prove very important for support engineers and
subject matter experts, as they currently have to spend considerable time to man-
ually examine a large number of KPIs in order to investigate potential anomalies
and their root causes.
The key contents of this thesis can be summarized as follows:
1.3 Outline
• In Chapter 3 we describe in detail the problem context, our main goals and
challenges. We particularly focus on the KPIs in our dataset and examine
the specific challenges posed by this type of data.
In this chapter we will provide some basic background knowledge and definitions
related to anomaly detection. Then, we will briefly go through some of the
related work on this subject.
• Point anomalies are individual instances that deviate from the rest of the
data. It is the most common type of anomaly and the focus of most of the
work in literature.
• Contextual anomalies are points that can be deemed as normal when in-
spected individually, but are considered anomalous in a specific problem
context. The most common context is time. For example, a temperature
4
2. Background and Related Work 5
below zero can be seen as normal, but if the context of time is a summer
month, this temperature would be anomalous.
In this chapter we will describe the main problem that this thesis sets out to
tackle and underline the key challenges we face in the context of the thesis.
Our main goal is to perform anomaly detection in real multivariate time series
data provided by IBM Cloud storage systems by utilizing deep neural networks.
The given time series are Key Performance Indicators (KPIs) of a cloud storage
system. It is a difficult problem, as we must work with high-dimensional raw time
series that have been recently collected, which means they have relatively short
history and have not been previously curated. The key challenges associated
with this context will be presented in more detail in section 3.1.
To tackle our main problem, we decided to design a deep neural network
that can ingest all the KPIs at the same time. Hence, the input will be a
tensor similar to Fig. 3.1. The input KPIs cover approximately 11 weeks of
measurements taken at 5 minute intervals. To construct the input tensor we
divide the time series into samples, e.g. days, where 1 day equals 288 timesteps.
Therefore, our input will be a 3D tensor with dimensions [samples, timesteps,
KPIs]. At the time of this thesis, we do not have a sufficient number of class
labels that indicate the true anomalies in our time series. Due to this challenge,
our neural networks must perform unsupervised learning.
The output of the deep learning model should be a list of detected anomalies,
the time window when they occur and a list of the KPIs that are the most
anomalous. The method that we use to compute this output will be covered in
detail in Chapter 4.
After the main problem of anomaly detection, we also work on two subprob-
lems that can further optimize the anomaly detection process: (1) Investigating
anomaly causality through root cause analysis, and (2) adding explainability to
anomaly detection. These subproblems are important, because they can provide
valuable insights to support engineers about possible technical problems behind
the anomalies and also about the most important parts and KPIs of the storage
system. Our approach to tackle these subproblems will be presented in detail in
Chapter 5.
7
3. Problem Context and Strategy 8
Figure 3.1: Shape of the 3D tensor to be fed as input to the neural network.
Features correspond to KPIs.
temporal history of the time series. Values that are dissimilar to any pattern
that we encountered before are candidates for anomaly detection. Since we do
not have a full set of labels for anomalies in our data, any detected anomaly will
be further validated by support engineers.
Our data contains a diverse set of KPIs which can exhibit quite different
behaviours over time. Some KPIs can be quite unpredictable and do not exhibit
any seasonal periodic patterns in their values. On the other hand, we also ob-
served KPIs with strong periodicity. This characteristic can make it easier for
an anomaly detection model to learn from these KPIs. We can see a relevant
example in Fig. 3.5. In the left part of the figure we have the plots of an I/O
rate (top) and a response time (bottom). The enlarged plot interval in the right
corresponds to measurements from five weekdays. We can observe that both
KPIs have a strong seasonal pattern on a daily basis during the weekdays. This
pattern stops repeating and changes during the weekend.
In Chapter 4 we will present in detail our approach to tackle the afore-
mentioned challenges. We develop two deep learning models, one that utilizes
time series forecasting (section 4.1) and one that performs input reconstruction
through an encoder-decoder architecture (section 4.2). Our input data in both
models will be normalized to account for the different scales. In the case of input
KPIs that surpass 100s or 1000s in number, we incorporate compression in our
approach in section 4.2. In Chapter 4 we also observe the performance of our
deep learning models with regard to the varying degree of periodicity in the KPIs
and in section 4.3 we describe an approach to ensemble our models based on this
characteristic.
3. Problem Context and Strategy 11
Figure 3.5: Example KPIs with seasonal behaviour. There is a periodic daily
pattern during the weekdays (highlighted by the blue rectangle), which stops on
the weekend.
Chapter 4
In this chapter we will present our solutions for anomaly detection. We developed
two deep learning models in the context of this thesis. We also describe a method
to combine the different models in an ensemble. Since in our provided data from
IBM there are not enough class labels for true anomalies, our models will perform
unsupervised learning.
4.1 Forecasting
Our initial model for anomaly detection is based on time series forecasting. Given
a time window of the input time series, the machine learning model tries to
predict the value of the time series for a certain time in the future. This time
interval is called the forecasting horizon and can be short, e.g. 1 timestep ahead,
or longer.
The input to the model is a multivariate time series with several dimensions
[samples, time, KPIs]. In the output, we obtain the next predicted values of the
given time series. We compare the predictions with the true values and compute
the forecasting error. Based on this error, we can then employ a heuristic method
to detect anomalies.
The forecasting model architecture (Fig. 4.1) is based on [16]. The main
components are a convolutional layer[17] followed by a recurrent layer[11]. The
convolutional layer discovers local dependency patterns among the input time
series, while the recurrent layer can learn long-term dependencies.
Two extra components are added to the model: (1) a recurrent skip layer is
added to exploit potential periodic properties in the input, and (2) an autore-
gressive linear component [3] is added in parallel to the model to make it more
robust for time series with rapid scale changes.
We test the model on KPIs from an IBM Cloud storage system that exhibited
a known anomaly. In Fig. 4.2 we can see a particular KPI of this system, more
12
4. Models for Anomaly Detection 13
Figure 4.1: Deep learning model for time series forecasting. Based on [16].
specifically a response time. The dates in the figure indicate times when an
anomaly is known to appear. It is easily discernible that the response time is
much higher in this time window. In the inner figure we see a histogram for this
KPI, indicating a long tail of outlier values. One can also observe, that even
before the anomaly, peaks of the KPI response time appear.
After applying the forecasting model to this system, we plot the prediction
and the true values for the same response time KPI. In Fig. 4.3 we zoom in on a
time window around the known anomaly. We observe that the prediction closely
follows the true value even around the known anomaly. Ideally, we would want
our model to only learn the ”normal” behaviour of the system and have larger
forecasting errors in case of an anomaly. By examining this result we conclude
that, due to the unpredictable nature of this particular system, the model tends
to output a predicted value that resembles the last input value it received.
In Fig. 4.4 we produce a similar plot for another KPI of the same system,
this time a data rate. Even though the true data rate shows normal behaviour
at the time of the known anomaly, we observe that the predicted value (in blue)
deviates significantly from the true value in this time window. This indicates
that the convolutional part of the model captured the correlation between the
data rate and the response time. Hence, the predicted data rate was affected by
the ”anomalous” response time.
Even though we see that the result in Fig. 4.3 is not ideal - as the predicted
output tends to follow the last input value - , we still investigate whether it can
be used to correctly detect the known anomaly. For this purpose, we obtain the
forecasting error for this specific KPI. We employ dynamic time warping[9] to
compute the forecasting error for a rolling window of 3 timesteps (equivalent to
15 min). We plot the resulting forecasting error in Fig. 4.5. As we can see, we
still have significantly larger error in the times of the known anomaly. However,
4. Models for Anomaly Detection 14
Figure 4.2: Example KPI with known anomaly. The inner figure shows the
histogram for this KPI.
Figure 4.3: Predicted output (blue) and true value (orange) for a KPI (response
time) with known anomaly
4. Models for Anomaly Detection 15
Figure 4.4: Predicted output (blue) and true value (orange) for a different KPI
(data rate) in the same system.
depending on the chosen error threshold (horizontal lines), we see that we run
the risk of detecting many false anomalies.
These initial results lead us to the conclusion that our initial model has two
main limitations:
• The model performs better on data that have reasonably seasonal and
recurrent behaviour. In case of measured traces from complex cloud storage
systems like the one we investigated in this thesis, the KPIs can be quite
unpredictable over time and exhibit behaviour that is similar to a random
walk. In such cases, the quality of forecasting will drop, which may lead
to many falsely detected anomalies.
Figure 4.5: Forecasting error for the KPI with the known anomaly, computed
with dynamic time warping (DTW). The horizontal lines indicate different error
thresholds for classifying anomalies.
• Instead of aiming to predict future timesteps for our input time series, we
now perform reconstruction of the input.
To perform this task, the basic architecture we will utilize is the sequence-to-
sequence model [23], or encoder - decoder model (Fig.4.6). In this basic structure,
the input passes through the Encoder, which compresses the input into a latent
space manifold. In this way, the compressed representation of the data captures
only the most essential information. The Decoder then uses this compressed
representation to generate the required output sequence. In our case, the desired
output will be the reconstructed input. In this sense, the model can also be called
an autoencoder.
Therefore, the goal of our model will be to reconstruct the given input time
series, which in our case are KPIs, through unsupervised learning. By comparing
the model output to the original input, we can obtain the reconstruction error.
We will then utilize this error to detect anomalies.
To implement our final model architecture, we use a combination of convo-
lutional and recurrent layers, as seen in Fig. 4.7. The encoder is comprised of a
series of convolutional layers, which extract the relevant features from the input
KPIs. The convolutional layers can capture correlations among the KPIs, as well
as short patterns in the time dimension. The depth of the encoder (i.e. number of
subsequent layers) depends on the level of compression that we want to achieve.
Following the convolutional layers, we also add a recurrent layer, specifically a
long short term memory (LSTM) [11]. The LSTM layer provides a ”forgetful”
4. Models for Anomaly Detection 17
memory, as it uses its internal gates to capture long term dependencies and also
learn which information in a sequence should be kept or thrown away. For the
decoder part of the model, we use convolutional layers analogous to the ones in
the encoder. These layers perform a so called ”deconvolution”, as they expand
the compressed representation back to the original input dimensions.
In Fig. 4.8, we illustrate the final model architecture for a given IBM Cloud
storage system. In this example we start from 82 KPIs divided into two channels
in the input and use 4 convolutional layers for compression. In the model output
we obtain a reconstructed version of the input tensor, from which we compute
the reconstruction error.
To detect anomalies based on the reconstruction error, we employ a heuristic
method based on the k-sigma rule. We aggregate the reconstruction errors over
all KPIs and compute the mean value and the standard deviation (sigma). Then
we look at the error at every timestep and check its distance from the mean
value in relation to k times sigma. In the examples below, we tag timesteps as
anomalies when the error is at least a distance of 2 sigmas away from the mean.
In Figures 4.9 and 4.10 we illustrate the result of applying this model to real
systems. The blue line shows the aggregated reconstruction error. Highlighted
in red rectangles are the timesteps that were considered an anomaly by our
algorithm. The time history of our given data in both examples covers a span
of almost 11 weeks. In the case of Fig. 4.9, we detect a significant anomaly that
lasted for a few days. Other than that, the reconstruction error remains steadily
low.
Another example of applying our anomaly detection on a different system
can be seen in Fig. 4.10. In this case, the error is very low until approxi-
mately the last 10 days of measurements. In that window we detect a number of
anomalies with elevated errors. It is worth noting that, if the system continues
to exhibit similar behaviour as future measurements are gathered and added to
4. Models for Anomaly Detection 18
Figure 4.7: General architecture of our autoencoder model, built with a combi-
nation of CNN and RNN layers. Blue blocks are for the encoder, orange blocks
for the decoder.
Figure 4.8: Final model architecture for a given IBM cloud system. The encoder
(blue) and decoder (orange) each have four convolutional layers.
4. Models for Anomaly Detection 19
the training history, we expect the model to eventually learn and adapt to this
”new normal” behaviour and no longer consider it anomalous. In this scenario,
the current detected anomalies would be indicative of novel system behaviour
(novelty detection), rather than e.g., a technical problem in the system.
After the implementation of the two deep learning models presented in sections
4.1 and 4.2, another aspect we want to explore is a generalizable method to
combine and discriminate between these models. The goal is to have an ensemble
of the two models, and vote on which ones are better to use based on the given
data.
We choose to discriminate our ensemble of models based on data predictabil-
ity, as we observe that accurate forecasting depends on the predictability of the
time series. For example, for very seasonal well-behaved data, the simpler less
complex forecasting model may prove to be good enough. Seeing data from real
cloud storage systems, we observed quite different behaviour among KPIs be-
longing to different systems in terms of predictability. We can have KPIs that
exhibit strong seasonality and periodic behaviour (e.g. similar values on a daily
or weekly basis). On the other hand, we can observe KPIs whose behaviour does
not show any periodic pattern and approaches a random walk.
There can be multiple ways to examine the predictability of our given data,
including methods like autoregressive integrated moving average (ARIMA), Holt-
Winters and autocorrelation functions.
4. Models for Anomaly Detection 20
Figure 4.11: Flow chart for a predictability-based ensemble between two anomaly
detection models.
Figure 4.12: Autocorrelation function of an I/O rate KPI from an IBM cloud
storage system. The KPI exhibits strong seasonal correlation on a daily basis.
4. Models for Anomaly Detection 22
Figure 4.13: Autocorrelation function of a response time KPI from another IBM
cloud storage system. This example shows no discernible seasonality.
Chapter 5
In this chapter, we explore ways to further optimize the anomaly detection pro-
cess. We tackle two subproblems related to anomaly causality and explainability.
The goal is to provide helpful information to support engineers and assist them
in the workload of investigating system anomalies.
Method Our RCA tool ranks a set of possible root cause KPIs according to
their similarity 1 to a smaller set of anomalous KPIs. This ranking is decided
1
We define a similarity function between two time series T and Cj as D(T, Cj ). A time
series Ci is more similar to T than Cj if D(T, Ci ) < D(T, Cj ) [10].
23
5. Towards Anomaly Detection Optimization 24
Figure 5.1: Enhanced anomaly detection pipeline. The results of the anomaly
detection model are coupled with a root cause analysis (RCA) tool.
based on a two level ensemble, an intra-ensemble (each metric ranks the candi-
date KPIs) and an inter-ensemble (ensemble across the rankings of the metrics)
of the individual similarity metrics. We use 3 distance and 2 correlation metrics.
These are the L1/Manhattan, L2/Euclidean, DTW/Dynamic Time Warping,
Spearman and Pearson metrics respectively. The similarity metrics between two
time series T and Cj are computed as follows:
n q
X
Eucl(T, Cj ) = (Ti − Cji )2 (5.1)
i=0
Xn
M anh(T, Cj ) = |(Ti − Cji )| (5.2)
i=0
1 Pn
n i=0 (Ti − µT )(Cji − µCj )
P ear(T, Cj ) = (5.3)
σT σCj
Spear(T, Cj ) = P ear(rgT , rgCj ) (5.4)
where µT , µCj , σT and σCj are the mean and standard deviation of time
series T and Cj , respectively, whereas rgT and rgCj are the converted rank
variables used by Spearman (i.e., data points are sorted and values are replaced
by their rank). For DTW we use the approximate FastDTW [20] technique
that provides optimal or near-optimal alignments, by recursively projecting and
refining a solution from larger resolutions. These metrics have the following
beneficial properties:
A holistic view of the method can be seen in Fig. 5.2. The detailed op-
eration is captured in Algorithm 1. It requires an input containing: (1) a set
of anomalies to be analyzed; (2) the triggering KPIs for each anomaly; (3) a
set of the potential root cause KPI candidates. We normalize the KPIs within
the anomaly time window to bring them to similar scales. We compare every
trigger KPI with all candidate KPIs and produce a sorted list of candidates for
each metric. To produce the final ranked list of root cause candidate KPIs, we
perform ensemble voting across the five similarity metrics. The voting bias for
each metric is defined according to the system’s region of operation.
Voting Bias We use a novel method to adjust and re-bias the weights of each
component of the ensemble. The vote rigging is control-driven by the regions
suggested by the cloud system’s delay-load transfer functions, e.g., the family
of ”hockey-stick” plots that characterize each specific type of Cloud/Storage
(Fig. 5.3).
Briefly, this is accomplished by
Normal Region When the system operates within its normal region, the
probability of encountering high noise or a vast amount of outliers is low. We
therefore cast equal votes to all metrics (0.2 to each).
5. Towards Anomaly Detection Optimization 27
Example In Fig. 5.4 we illustrate the result of applying our RCA tool to a
cloud storage system. The anomaly window is shown in red dotted lines. We
look into a time window of 8 hours before and after the anomalous event. In this
case, the anomalous KPI is a Response Time (delay). After applying the RCA
5. Towards Anomaly Detection Optimization 29
0.75
0.50
0.25
0.00
05:19:00 13:39:00 21:59:00 06:19:00 14:39:00 22:59:00 07:19:00 15:39:00
Timestamp
0.25
0.00
05:19:00 13:39:00 21:59:00 06:19:00 14:39:00 22:59:00 07:19:00 15:39:00
Timestamp
Figure 5.4: Top-3 root cause candidates (bottom) returned by the RCA ensemble
versus the triggering KPI (top) detected by the anomaly detection engine. The
red dotted lines show the event window.
ensemble, we come up with a ranked list of potential root cause candidates. The
Top-3 root cause candidate KPIs shown in the figure are rates (loads), which
strongly correlate with this response time. By further investigating the role of
these KPIs in the system, we confirm that there is a pseudo-causal path in which
the activity of these three loads propagates through the system and causally
affects the respective response time during the event window.
The major benefit or RCA is that it significantly narrows down the workload
of Support Engineers in investigating anomalies. Instead of spending hours or
days manually going through hundreds of KPIs to look for root causes, the
Support team can now only look into a few top root cause candidate KPIs
suggested by the RCA tool.
Our RCA tool helps with the investigation of causality among the KPIs in an
anomaly. We want to further enhance our anomaly detection process by introduc-
ing and leveraging explainability in the deep learning model. Initially, we would
like to investigate the importance of each of our input KPIs, i.e, how much each
KPI contributes to the anomaly detection model. This will give us information
about critical system KPIs that significantly affect the anomaly detection.
As we saw in Chapter 4.2, in a sequence-to-sequence model the encoder com-
presses the input into a latent space manifold, keeping the most necessary in-
5. Towards Anomaly Detection Optimization 30
Figure 5.5: Extracting the contribution of input KPIs to the latent space in our
encoder-decoder model.
(a) (b)
(c)
Figure 5.6: Saliency heatmaps for an input of 288 timesteps and 82 KPIs divided
in two channels (a) and (b). We aggregate the saliency over time and return one
saliency value for each KPI in (c).
Chapter 6
6.1 Summary
32
6. Summary and Future Work 33
After observing that the given KPIs may be quite different in terms of pre-
dictability, we presented a method to combine our deep learning models in an
ensemble based on the predictability of the input KPIs. As a determining factor
for this ensemble, we use the autocorrelation function. We choose between our
models based on the presence of predictable periodic patterns in the data, as
indicated by the autocorrelation function.
Finally, we investigated two subproblems to optimize the anomaly detection
process:
In the next section, we will discuss some directions in which the work of this
thesis can be improved in the future.
There are several ways to build upon the work that was done in the context of
this thesis.
• There is also always room for further optimization and tuning in the deep
learning models. For example, if we have an input that consists of 1000s
of KPIs, more convolutional layers can be added to the encoder-decoder
model to achieve the desired compression.
6. Summary and Future Work 34
• Lastly, if a large enough number of labels for the true anomalies can be pro-
vided in the future, it would possible to make the switch to semi-supervised
or supervised learning.
Bibliography
35
BIBLIOGRAPHY 36
[23] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. “Sequence to Sequence
Learning with Neural Networks”. In: Proceedings of the 27th International
Conference on Neural Information Processing Systems - Volume 2. NIPS’14.
Montreal, Canada: MIT Press, 2014, pp. 3104–3112.
[24] Ye Yuan et al. “MuVAN: A Multi-view Attention Network for Multivariate
Temporal Data”. In: 2018 IEEE International Conference on Data Mining
(ICDM) (2018), pp. 717–726.
[25] Chuxu Zhang et al. A Deep Neural Network for Unsupervised Anomaly
Detection and Diagnosis in Multivariate Time Series Data. 2018. arXiv:
1811.08055 [cs.LG].
[26] Bo Zong et al. “Deep Autoencoding Gaussian Mixture Model for Unsuper-
vised Anomaly Detection”. In: ICLR. 2018.
Appendix A
Throughout this master thesis the programming language used was Python. For
the development of deep learning models I used Tensorflow[1] and Keras[8] with
a Tensorflow backend.
For the forecasting model in Section 4.1 the authors of [16] provided open-
source code online, written in PyTorch. I adapted that code and wrote my
own version in Keras. I also changed some parameters to suit the needs of my
application and added code to perform anomaly detection on the forecasting
error. Code development for the remaining parts of the thesis was performed by
myself.
Some other Python libraries that I utilized are :
• I also used the keras-vis toolkit [15] in Section 5.2 to generate the saliency
maps.