You are on page 1of 10

Open in app

Search

Attention for time series forecasting and classification


Harnessing the most recent advances in NLP for time series forecasting and classification

Isaac Godfried · Following


Published in Towards Data Science
14 min read · Apr 10, 2019

Listen Share More

Note this article was published on 10/18/19 I have no idea why Medium is saying that it is from 4/9/19. The article
contains the most recent and up-to-date results including articles that will appear at Neurips 2019 and preprints that
came out just this past August.

Update 8/24/2020: A lot of people have asked for an update. I haven’t gotten around to writing another article on this
subject but you can find implementations of the transformer and several other models with attention in the flow-
forecast repository.

Transformers (specifically self-attention) have powered significant recent progress in NLP. They have enabled models like
BERT, GPT-2, and XLNet to form powerful language models that can be used to generate text, translate text, answer
questions, classify documents, summarize text, and much more. With their recent success in NLP one would expect
widespread adaptation to problems like time series forecasting and classification. After all, both involve processing
sequential data. However, to this point research on their adaptation to time series problems has remained limited.
Moreover, while some results are promising, others remain more mixed. In this article, I will review current literature on
applying transformers as well as attention more broadly to time series problems, discuss the current barriers/limitations,
and brainstorm possible solutions to (hopefully) enable these models to achieve the same level success as in NLP. This
article will assume that you have a basic understanding of soft-attention, self-attention, and transformer architecture. If
you don’t please read one of the linked articles. You can also watch my video from the PyData Orono presentation night.
Attention for time series data: Review
The need to accurately forecast and classify time series data spans across just about every industry and long predates
machine learning. For instance, in hospitals you may want to triage patients with the highest mortality early-on and
forecast patient length of stay; in retail you may want to predict demand and forecast sales; utility companies want to
forecast power usage, etc.

Despite the successes of deep learning with respect to computer vision many time series models are still shallow.
Particularly, in industry many data scientists still utilize simple autoregressive models instead of deep learning. In some
cases, they may even use models like XGBoost fed with manually manufactured time intervals. Usually, the common
reasons for choosing these methods remain interpretability, limited data, ease of use, and training cost. While there is no
single solution to address all these issues, deep models with attention provide a compelling case. In many cases, they offer
overall performance improvements (other vanilla LSTMs/RNNs) with the benefit of interpretability in the form of attention
heat maps. Additionally, in many cases, they are faster than using an RNN/LSTM (particularly with some of the techniques
we will discuss).

Several papers have studied using basic and modified attention mechanisms for time series data. LSTNet is one of the first
papers that proposes using an LSTM + attention mechanism for multivariate forecasting time series. Temporal Pattern
Attention for Multivariate Time Series Forecasting by Shun-Yao Shih et al. focused on applying attention specifically
attuned for multivariate data. This mechanism aimed at resolving issues including noisy variables in the multivariate time
series and introducing a better method than a simple average. Specifically,

The attention weights on rows select those variables that are helpful for forecasting. Since the context vector vt is now the weighted
sum of the row vectors containing the information across multiple time steps, it captures temporal information.
Simply speaking, this aims to select the useful information across the various feature time series data for predicting the
target time series. First, they utilize a 2dConvolution on the row vectors of the RNNs hidden states. This is followed by a
scoring function. Finally, they use a sigmoid activation instead of softmax since they expect multiple variables to be
relevant for prediction. The rest follows a fairly standard attention practice.

Code for the temporal pattern attention mechanism. Notice that the authors choose to use 32 filters.
Diagram from the paper

In terms of results, the model outperforms (using relative absolute error) other methods including a standard auto-
regressive model and LSTNet on forecasting solar energy and electricity demand, traffic and exchange rate.

Even though this article doesn’t use self-attention I think this is a really interesting and well-thought-out use of attention. A
lot of time-series research seems to focus on univariate time series data. Moreover, the ones that do study multivariate time
series often solely expand the dimensions of the attention mechanism rather than apply it horizontally across the feature
time-series. It might make sense to see if a modified self-attention mechanism could select the relevant source time series
data for predicting the target. The full code for this paper is publicly accessible on GitHub.
What is really going on with self-attention?
Lets first briefly review a couple of specifics of self-attention before we delve into the time series portion. For a more
detailed examination please see this article on mathematics of attention or the Illustrated Transformer. For self-attention
recall that we generally have query, key, value vectors that are formed via simple matrix multiplication of the embedding
by the weight matrix. What a lot of explanatory articles don’t mention is that query, key, and value can often come from
different sources depending on the task and vary based on whether it is the encoder or the decoder layer. So for instance, if
the task is machine translation the query, key and value vectors in the encoder would come from the source language but
the query, key, and value vectors in the decoder would come from the target language. In the unsupervised language
modeling case however they are all generally formed from the source sequence. Later on we will see that many self-
attention time series models modify how these values are formed.

Secondly, self-attention generally requires positional encodings as it has no knowledge of sequence order. It usually
incorporates this positional information via addition to the word or time step embedding rather than concatenation. This
is somewhat odd as you would assume that adding positional encodings directly to the word embedding would hurt it.
However according to this Reddit response due to the high dimensionality of the word embeddings the positional
encodings you get approximate orthogonality (i.e. the positional encodings and word embeddings already occupy different
spaces). Moreover, the poster argues that sine and cosine help to give nearby word similar positional embeddings.

But in the end this still leaves a lingering question: wouldn’t straightforward concatenation work better in this respect? This
is something that I don’t have a direct answer for at the moment. There are however some good recent papers on creating
better positional embeddings. Transformer-XL (the basis for XLNet) has its own specific relational embeddings. Also the
NIPs 2019 paper, Self-attention with Functional Time Representation Learning, examines creating more effective
positional representations through a functional feature map (though paper is not currently on arxiv at the moment).

A number of recent studies have analyzed what actually happens in models like BERT. Although geared entirely towards
NLP these studies can help us to understand how to effectively utilize these architectures for time series data as well as
anticipate possible problems.
In “What Does BERT Look At? An Analysis of BERT’s Attention” the authors analyze the attention of BERT and investigate
linguistic relations. This paper is a great illustration of how self-attention (or any type of attention really) naturally lends
itself to interpretability. As we can use the attention weights to visualize the relevant parts of focus.

Figure 5 from the paper. This technique of illustrating attention weights is highly useful for interpretability purposes
and cracking open the “black box” of deep learning. Similar methods of analyzing specific attention weights could
show which time steps or time series a model focuses on when predicting.

Also interesting is the fact that the authors find the following:

We find that most heads put little attention on the current token. However, there are heads that specialize to attending heavily on
the next or previous token, especially in earlier layers of the network

Obviously in time-series data attention heads “attending to the next token” is problematic. Hence, when dealing with time
series we will have to apply some sort of mask. Secondly, it is hard to tell if this is solely a product of the language data
BERT was trained on or if this is likely to occur with multi-headed attention more broadly speaking. For forming language
representations focusing on the closest word makes a lot of sense. However, this is much more variable with time series
data, in certain time series sequences causality can come from steps much further back (for instance for some rivers it can
take 24+ hours for heavy rainfall to raise the river).

Are Sixteen Heads Really Better than One

In this article, the authors found that pruning several attention heads had a limited effect on performance. Generally,
performance only significantly fell when more than 20% of attention heads were pruned. This is particularly relevant for
time series data as often we are dealing with long dependencies. Especially only ablating a single attention head seems to
have almost no impact on score and in some cases results in better performance.
From p. 6 of the article. Using fewer attention heads may serve as an effective strategy for reducing the computational
burden of self-attention for time series data. There seems to be a substantial amount of overlap of certain heads. In
general it might make sense to train on more data (when available) rather than have more heads.

Visualizing the Geometry of BERT

Umap visualization of the different semantic sub-spaces of the word “run.” Visualization made using context atlas
https://storage.googleapis.com/bert-wsd-vis/demo/index.html?#word=run . It would be interesting to create Umap
visualization of different time series representations from a large scale trained transformer model.

This paper explores the geometrical structures found within the BERT model. They conclude that BERT seems to have
geometric representations of parse trees internally. They also discover there are semantically meaningful subspaces within
the larger embedding space. Although this probe is obviously linguistically focused, the main question it raises is if BERT
learns these linguistically meaningful patterns then would it learn similar temporally relevant patterns. For instance, if we
large scale trained a transformer time series, what would we discover in the embedding space? Would for instance we see
similar patient trajectories clustered together or if we trained on many different streams for flood forecasting would it
group dam fed streams together with similar release cycles, etc… Large scale training of a transformer on thousands of
different time series could prove insightful and enhance our understanding of the data as well. The authors include two
cool GitHub pages with interactive visualizations that you can use to explore further.

Another fascinating research work that came out of ICLR 2019 was Pay Less Attention with Lightweight and Dynamic
Convolutions. This work investigates both why self-attention works and proposes dynamic convolutions as an alternative.
The main advantage of dynamic convolutions are that they are computationally simpler and more parallelizable than self-
attention. The authors find that these dynamic convolutions preform roughly equivalent to self-attention. The authors also
employ weight sharing which further reduces the parameters required overall. Interestingly, despite the potential speed
improvements I haven’t seen any time series forecasting research adopt this methodology (at least not yet).
Self-attention for time series
There have been only a few research papers that use self-attention on time series data with varying degrees of success. If
you know of any additional ones please let me know. Additionally, huseinzol05 on GitHub has implemented a vanilla
version of attention is all you need for stock forecasting.

Attend and Diagnose leverages self attention on medical time series data. This time series data is multivariate and contains
information like a patient’s heart rate, SO2, blood pressure, etc.

The architecture for attend and diagnose

Their architecture starts with a 1-D convolution across each clinical factor which they use to achieve preliminary
embeddings. Recall that a 1D Conv will utilize a kernel of a specific length and process it a set number of times. It is
important to note that here the 1-D convolution is not applied across the time series steps as is typical. Therefore if the
initial time series contains 100 steps it will still contain 100 steps. Rather it is instead applied to create a multi-dimensional
representation of each time step. For more information on 1-D convolutions for time series data refer to this great article.
After the 1-D convolution step the authors then use positional encodings:

The encoding is performed by mapping time step t to the same randomized lookup table during both training and prediction.

This is different than standard self-attention which uses cosine and sine functions to capture the position of words. The
positional encodings are joined (likely added although…the authors do not indicate exactly how) to each respective output
from the 1D Conv layer.

Next comes the self-attention operation. This is mostly the same as the standard type of multi-headed attention operation,
however it has a few subtle differences. First as mentioned above since this is time series data the self-attention
mechanism cannot incorporate the entire sequence. It can only incorporate timesteps up to the time step being
considered. To accomplish this the authors appear to use a masking mechanism that also masks timestamps too far in the
past. Unfortunately, the authors are very non-specific on the actual formula for this, however, if I had to guess I would
assume it is roughly analogous to the masking operation shown by the authors in overcoming the transformer bottleneck.

After the multi-headed attention, the now transformed embeddings still need to have additional steps taken before they
are useful. Typically in standard self-attention, we have an addition and layer normalization component. The layer
normalization will normalize the output of the self-attention and the original embedding (see here for more information
on this), however, the authors instead chooses to Dense Interpolation. This means that embeddings outputted from the
multi-headed-attention module are taken and used in a manner that is useful for capturing syntactic and structural
information.

After the dense interpolation algorithm, there is a linear layer followed by a softmax, sigmoid or relu layer (depending on
the task). The model itself is multitasking so it aims to forecast length of stay, the diagnosis code, the risk of
decompensation, the length of stay and the mortality rate.

Altogether I thought this paper was good demonstration of using self-attention on multivariate time series data. The results
were state of the art at the time it was released, they have now been surpassed by TimeNet. However, this is primarily due
to the effectiveness of transfer-learning based pretraining rather than the architecture. If I had to guess similar pre-
training with SAND would result in better performance.

My main critcism of this paper is primarily from reproducibility standpoint as no code is provided and various
hyperparameters such as the kernal size are either not included or only vaguely hinted at. Other concepts are not discussed
clearly enough such as the masking mechanism. I’m currently working on trying to reimplement in PyTorch and will post
the code here when I’m more sure about its realiability.

Another recent paper that is fairly interesting is “CDSA: Cross-Dimensional Self-Attention for Multivariate, Geo-tagged
Time Series Imputation” by Jiawei Ma et al. This article focuses on imputing (estimating) missing time series values.
Effective data imputation is important for many real world applications as sensors often have periods where they
malfunction causing missing data. This creates problems when trying to forecast or classify data as the missing or null
values will impact the forecast. The authors setup their model to use a cross attention mechanism that works by utilizing
data in different dimensions such as time location and the measurement.

This is another fascinating example of modifying the standard self-attention mechanism to work across multi
dimensional data. In particular as stated above the value vector is meant to capture contextual information. Oddly this
is in contrast to what we will see later where the query and key vectors utilize contextual information but not the value
vector.

The authors evaluate their results on several traffic-forecasting and air-quality datasets. They evaluate with respect to both
forecasting and imputation. For testing imputation, they discard a certain percentage of the values and attempt to impute
them using the model. They compare these for with the actual values. Their model outperforms other RNN and statistical
imputation methods on all missing data rates. In terms of forecasting, the model also achieves the best performance.

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting by Shiyang Li et
al.

This is a recent article that will appear at NIPS in 2019. It focuses on several of the problems with applying the transformer
to time series data. The authors basically argue that classical self attention does fully leverage the contextual data. They
argue that this particularly causes problems with dynamic time series data that varies with seasonality (for instance
forecasting sales around the holidays vs. the rest of the year or forecasting extreme weather patterns). To remedy this they
introduce a new method of generating the query and value vectors.

We propose convolutional self attention [mechanism] by employing causal convolutions to produce queries and keys in the self
attention layer. Query-key matching aware of local context, e.g. shapes, can help the model achieve lower training error and further
improve its forecasting accuracy.

From p. 3 of article. Essentially with other versions of multi-headed attention query, value, and key vectors are created
off a single time-step, whereas a larger kernel size allows it to create the key and query vectors from multiple time-
steps. This allows the model to be able to understand a greater degree of context.
Part two of the article focuses on solutions related to the memory use of the transformer model. Self-attention is very
memory intensive particularly with respect to very long sequences (specifically it is O(L²)). The authors propose a new
attention mechanism that is O(L(log L)²). With this self-attention mechanism, cells can only attend to previous cells with
an exponential step size. So for instance cell five would attend to cell four and cell two. They also introduce two variations
of this log attention: local attention and restart attention. See their diagram below for more information

From page 5. of the article. All these variations have the same overall time complexity O(L(log L)²), however the actual
runtime of Local Attention + Logsparse would likely be longest due to it attending to the most cells.

The authors evaluate their approach on several different datasets including electricity consumption (recorded in 15-minute
intervals), traffic in San Francisco (20-minute intervals), solar data production hourly (from 137 different power plants) and
wind data (daily estimates of 28 counties wind potential as a percentage of overall power production). Their choice of ρ-
quantile loss as an evaluation metric is a bit strange as normally I’d expect MAE, MAP, RMSE or something similar for a
time series forecasting problem.

Equation for p quantile regression loss.

I’m still trying to grasp what exactly this metric represents, however at least from the results it appears that a lower score is
better. Using this metric their convolutional self-attention transformer outperforms DeepAR, DeepState, ARIMA, and other
models. They also conduct an ablation study where they look at the effect of kernel size when computing a seven-day
forecast. They found that a kernel size of 5 or 6 generally produced the best result.

I think this is a good research article that addresses some of the short-comings of the transformer as applied to time-series
data. I particularly think that the use of a convolutional kernel (of size greater than one) is really useful in time series
problems where you want to capture surrounding context for the key and query vectors. There is currently no code, but
NeurIPs is still more than a month away so hopefully, the authors release it between now and then.
Conclusion and future directions
In conclusion, self-attention and related architectures have led to improvements in several time series forecasting use
cases, however, altogether they have not seen widespread adaptation. This likely revolves around several factors such as
the memory bottleneck, difficulty encoding positional information, focus on pointwise values, and lack of research around
handling multivariate sequences. Additionally, outside of NLP many researchers are not probably not familiar with self-
attention and its potential. While simple models such as ARIMA in many cases make sense for industry problems I believe
that transformers have a lot to offer as well.

Hopefully, the approaches summarized in this article shine some light on effectively applying transformers to time series
problems. In a subsequent article, I plan on giving a practical step-by-step example of forecasting and classifying time-
series data with a transformer in PyTorch. Any feedback and/or criticisms are welcome in the comments. Please let me
know if I got something incorrect (which is quite possible given the complexity of the topic) and I will update the article.

Machine Learning Time Series Forecasting Transformers Deep Learning Time Series Analysis

Following

Written by Isaac Godfried


1.4K Followers · Writer for Towards Data Science

Data Scientist, ex-Data Engineer, Maintainer of Flow Forecast

More from Isaac Godfried and Towards Data Science

Isaac Godfried in Towards Data Science

Advances in Deep Learning for Time Series Forecasting and Classification: Winter 2023 Edition
The downfall of transformers for time series forecasting and the rise of time series embedding methods. Plus advances in anomaly
detection…

· 16 min read · Jan 10

You might also like