Is Attention All You Need For Intraday Forex Tradi

Received: 31 January 2023 Revised: 29 March 2023 Accepted: 13 April 2023
DOI: 10.1111/exsy.13317
ORIGINAL ARTICLE
Is attention all you need for intraday Forex trading?
Przemysław Grądzki | jcik

Piotr Wo
Faculty of Economic Sciences, University of

Warsaw, Warsaw, Poland Abstract
The main objective of this paper is to analyse whether the Transformer neural
Correspondence
Przemysław Grądzki, Faculty of Economic network, which has become one of the most influential algorithms in Artificial Intelli-
Sciences, University of Warsaw, ul. Dluga
gence over the last few years, exhibits predictive capabilities for high-frequency
44/50, 00-241, Warsaw, Poland.
Email: p.gradzki@uw.edu.pl Forex data. The prediction task is to classify short-term Forex movements for six cur-
rency pairs and five different time intervals from 60 to 720 min. We find that the
Funding information
COST Action CA19130 – Fintech and Artificial Transformer exhibits high predictive power in the context of intraday Forex trading.
Intelligence in Finance – Toward a transparent
This performance is slightly better than for the carefully selected benchmark –
financial industry
ResNet-LSTM, which currently is a state-of-the-art algorithm. Since intraday Forex
trading based on deep learning models is largely unexplored, we offer insight on
which currency pair and time interval are amenable to devising a profitable trading
strategy. We also show that high predictive accuracy can be misleading in real world
trading for short time intervals, as models trained on OHLC data tend to report the
highest accuracy when the spread cost is the highest. This renders assessment based
on typical machine learning metrics overly optimistic. Therefore, it is critical to bac-
ktest frequent intraday Forex trading strategies with realistic cost assumptions, which
is rarely the case in empirical literature. Lastly, sensitivity analysis shows that the
length of the time interval used for training does not play a critical role in the Trans-
former's predictive capabilities, whereas features derived from technical analysis are
essential.
KEYWORDS
algorithmic investment strategies, convolutional neural networks, financial forecasting, Forex,
machine learning, ResNet, self-attention, Transformer
1 | I N T RO DU CT I O N
In recent years, many articles have been published on forecasting the movements of financial assets through means of deep learning (Hu
et al., 2021; Jiang, 2021; Sezer et al., 2020; Thakkar & Chaudhari, 2021). Despite this influx, many areas still remain unexplored. One of them is
application of the Transformer network in trading. Given its spectacular achievements in different domains focused on NLP, speech recognition,
and computer vision, one might expect it to outperform current state-of-the-art deep learning models for trading. Another limitation in the exis-
ting literature is that researchers most commonly rely on daily data (Jiang, 2021) for the stock market and leave other markets overlooked. Addi-
tionally, daily data may not fully utilize the power of deep learning models that typically shine when provided with a high volume of data.
Decisions about time aggregation used in research will often be based on data availability, and this leads to a void when it comes to the compari-
son of different time intervals and their predictability. Another frequent limitation of existing publications regards the realistic evaluation of finan-
cial performance or lack thereof. Sometimes evaluation is limited only to comparison of machine learning performance metrics like MSE for
regression or accuracy for classification. Another time transactional cost assumption might not hold true in real life trading. Out of dozens of arti-
cles evaluated by Hu et al. (2021), only four analysed the Forex market and only one considered trading results as a performance metric. This lack
Expert Systems. 2023;e13317. wileyonlinelibrary.com/journal/exsy © 2023 John Wiley & Sons Ltd. 1 of 19
https://doi.org/10.1111/exsy.13317
2 of 19 GRĄDZKI and WÓJCIK
of interest might come as a surprise given that Forex is the world's largest financial market. We found that intraday trading on the Forex market
based on deep learning algorithms and different time frequencies is not well researched, which is one of the reasons behind our article. The main
contribution of this article is the application of the state-of-the-art Transformer network to different currency pairs within different time aggrega-
tions, which is followed by robust analysis of the practical aspect of our findings based on realistic assumptions regarding trading costs, something
which is rarely the case in the empirical literature. The innovative elements of this study are:
• Application of the Transformer network to financial forecasting and the devising of a trading strategy based on its predictions.
• Research of currently state-of-the-art deep learning models for intraday Forex trading to ensure the robustness of comparative models.
• The search for an optimal (from the trading perspective) time aggregation frequency for different currency pairs.
• Thorough analysis of the results obtained that shows how the performance of deep learning models for intraday Forex trading can over-
estimate its performance and consequently lead to mistaken conclusions.
• Application of transfer learning between different currency pairs, which combined with the importance of features derived from technical anal-
ysis discussed in the sensitivity analysis chapter, give some credence to this method popular today among traders.
The main objective of our research is to analyse if the Transformer network exhibits predictive capabilities in the context of the Forex market
and if so, whether it can outperform other deep learning algorithms commonly used in recent financial literature. Given that intraday Forex trading
based on deep learning is poorly researched, we first establish what is the state-of-the-art benchmark algorithm that will be compared against the
Transformer network. Due to more noise in data of high frequency, as well as the higher impact of transactional costs, we formulate the research
hypothesis that longer intervals (lower frequency of data) should lead to better trading performance. The analysed frequencies range from one
data point each 60 min to one data point each 720 min (12 h). The deep learning algorithms under consideration for the benchmark model are
variants of convolutional neural networks. To the best of our knowledge, analysis of time series frequency in the context of deep learning predic-
tive power for the Forex market has not yet been explored in financial literature. As discussed in the next section, differing variants of CNNs have
showed promising results in certain studies, whereas Transformers have scarcely been explored by researchers.
The next section starts with a review of commonly applied deep learning methods, and offers both a literature review and a description of
the Transformer network. Section 3 describes data and its transformations. Section 4 is dedicated to the comparison of deep learning architec-
tures from their performance standpoint and summarizes the obtained results, whereas section 5 focuses on challenges when interpreting results.
Section 6 offers a commentary on different configuration decisions that were made during the design of the research. Section 7 offers conclusions
on the research.
2 | M E TH O DO LO GY
2.1 | CNNs
Convolutional neural networks (CNNs) are widely used in the field of image recognition with many impressive achievements, ranging from win-
ning the ImageNet competition in 2012 to reaching human level ability in many tasks. Other areas of achievement include object location, human
tracking, activity recognition, face recognition, and emotion recognition (Thakkar & Chaudhari, 2021). Their impressive pattern recognition ability
can be especially useful for analysis of price patterns, something that traders have practiced for decades. The advantage of using deep neural net-
works and CNNs in particular is the automatic feature selection. There is no need for prior definition of ‘overbought’ or ‘oversold’ conditions or
technical patterns like flags or wedges. Provided enough data, those patterns should be uncovered by the model if they indeed contain useful
information. CNNs can be successfully applied to univariate and multivariate time series. Convolution is a mathematical operation that for a given
input (typically an image) returns feature maps by successively applying filters to different regions of the input. For a univariate time-series this
operation would be equivalent to computing the moving average of length equal to kernel size. In this example, the filter has only one dimension
– time. In the case of a multivariate time series, filter moves only along temporal dimension and covers all available features as presented on
Figure 1. Usage of 1-D convolutions is relatively new compared to 2-D convolutions. As reported by Kiranyaz et al. (2021), 1-D convolutions have
many advantages related to their computational efficiency combined with state-of-the-art performance in many domains. Even though time series
evolution can be stored as an actual image (and in fact, we later present some studies that have implemented such an approach), in practice many
studies, including ours, operate directly on matrices as if they were images. Output of 1-D convolution layer applied to multivariate time series
can be expressed similarly to 2-D cross-correlation (Goodfellow et al., 2016):
X X
Ct,i ¼ f ðW k Xt,i þ bk Þ ¼ f bk þ n
X
m tþn,iþm
W n,m ð1Þ
GRĄDZKI and WÓJCIK 3 of 19
FIGURE 1 Example of a 1-D convolution neural network applied to multivariate time series data.
Where:
* Represents convolution.
X is input matrix Rsm:
Wk are trainable weights Rnm for k-th feature map
bk are biases
f is an activation function
2.2 | LSTMs
Another frequent approach to time series modelling with neural networks is to use recurrent neural networks (RNNs). They are especially suitable
for the task as they are designed to process sequences. RNNs can retain information from previous points in the sequence that can be used for
current prediction. In fact, one type of RNNs – Long-Short Term Memory (LSTM) – is the most widely used algorithm for stock market prediction
according to Hu et al. (2021). In their meta-study reviewing deep learning algorithms published between 2015 and 2020, they concluded that the
most common approach was based on the LSTM algorithm (43%) followed by CNN (21%). Other types of RNNs were grouped together and con-
stituted 6% of research papers. LSTM is specifically designed to retain information over long periods, something that initial RNNs struggle to do in
practice because of vanishing gradients. LSTM achieves this through modifications in RNN cells by adding gates that control the memorizing pro-
cess. Since LSTMs are very well described in the literature (for example Staudemeyer and Morris (2019)), we are not going to present a detailed
explanation here. Although RNNs are typically associated with sequence modelling, it is worth noting that CNN architectures specially designed
for this type of problem show promising results and can outperform RNNs in a variety of sequence modelling problems (van den Oord
et al., 2016).
2.3 | Applications in trading
Before going into the financial literature, we propose to look first into time series forecasting more broadly. Fawaz et al. (2019) asked a question
relevant to our study: ‘what is the current state-of- the-art DNN for TSC [time series classification]’? They tested nine different deep learning
architectures on univariate and multivariate time series coming from different domains (e.g., sensor readings, electrocardiogram, human activity
recognition or sleep stage identification). The univariate study contained 85 time series from the UCR/UEA (2022). The time series had variability
in their length ranging from 24 to 2709 and training set size ranged from 16 to 8926. Interestingly, they found that a small training dataset was
not an impediment to obtaining high accuracy metrics with deep architectures. The number of classes for the target variable ranged from 2 to 60.
Multivariate analysis covered 13 datasets of diverse characteristics with a length from 29 to 1919 and train set size from 16 to 6600. In the case
of univariate series, ResNet (He et al., 2016) significantly outperformed other models, multivariate as well, although this time the results were not
statistically significant because the sample of multivariate time series was much smaller. This is why ResNet will be an architecture extensively
studied in our research.
As mentioned, the study of CNNs for financial markets is particularly interesting, as it closely resembles one of the popular frameworks
among practitioners – namely, the chart pattern study also known as technical analysis. Menkhoff and Taylor (2007) reported that usage of tech-
nical analysis is a widespread practice among Forex traders and can lead to profitable results, although its theoretical explanation leaves many
economists puzzled. Chen and He (2018) built a CNN to classify future (10 days ahead) movements in stocks listed in China into two classes: up
and down. They emphasized that for time series prediction through CNN, a convolution operation can be considered as 1-D or 2-D. In the case of
1-D convolution, one of the core dimensions is fixed and moves only along the time axis, as explained at the beginning of this chapter. The net-
work they used was relatively shallow and consisted of 6 layers: input, 2 convolution layers, one pooling layer and two fully connected layers.
They reported that 1-D convolution produced a robust model. However, the model lacked adequate benchmarking to establish whether it is appli-
cable to real life investing. Sezer and Ozbayoglu (2018) applied a CNN model directly to images created out of financial time series. They analysed
daily movements in Dow 30 stocks and a variety of popular ETFs (SP500, Nasdaq100, US sector ETFs: Financial, Utilities, Consumer Staples, Con-
sumer Discretionary, Energy and MSCI for Brazil and Hong Kong). Images were of 15 15 shape, representing 15 different intervals for 15 indica-
tors coming from technical analysis. In that aspect, their study is different from ours and many others as it does not directly include time as one of
the dimensions. One pixel in such an image represents the value behind a technical indicator for a given time length. The target variable was
labelled into three categories: buy, hold, and sell by selecting bottom and top points in a sliding window. Based on their evaluation, the proposed
model outperformed benchmarking methods: buy & hold, technical analysis based systems, and other (non-CNN) deep learning models. Transac-
tion cost was assumed in the form of a commission of $1 per transaction with initial capital of $10,000, so at 0.01% at first. The model architec-
ture was as follows: input layer (15 15), two convolutional layers (15 15 32, 15 15 64), a max pooling (7 7 64), two dropout
(0.25, 0.50), fully connected layers (128), and an output layer (3). Convolution operation was two dimensional. Gudelek et al. (2017) built a CNN
model with 2 layers of 2-D convolution followed by max pooling and two fully connected layers of 128 and 64 neurons. Images were created
using daily data for 17 different ETFs (SP500, DJI, 9 US sector indexes and 6 MSCI indexes for different countries), 28 day rolling window and
various technical indicators. They reported high accuracy for next day predictions (above 70%) and investing results outperforming a buy & hold
strategy. It should be noted that when testing a model for stock markets, which exhibit an upward trend, it is not unusual to observe accuracy
substantially above 50%, something that is much harder for Forex markets. Transactional costs were assumed to be fixed at $5 with starting capi-
tal of $10,000. Additionally, they compared results between one of the first choices that a machine learning researcher in finance has to
face – regression or classification? Interestingly, the classification model showed higher accuracy but lower profits. This is most likely due to the
fact that in the trading algorithm they incorporated the magnitude of prediction for the regression model, but did not do the same for classifica-
tion. Splitting the target variable into two classes (buy, sell) turned out to be more favourable than into three classes (buy, hold, sell). As we have
described, using technical indicators is a common approach when applying deep learning for predicting movements of financial assets. However,
Sim et al. (2019) showed that this does not necessarily improve the performance of the models. In fact, CNNs with technical indicators showed
worse performance measured by accuracy than the model relying only on price. It is difficult to assess how transferable these findings are, as it
was tested only on 1-min data for SP500. Actual images of 1-min ticks for the last 30 min were fed to an LeNet-5 (LeCun et al., 1999) type CNN
network to predict movement for the next minute. Zhao and Khushi (2020) compared different algorithms for high frequency (5 min) modelling of
USDJPY. They proposed that the image of Forex exchange price along with technical indicators undergo first wavelet denoising, then ResNet was
used for feature extraction and finally LightGBM was applied to obtain the prediction for the fifth time interval (25 min). They showed that the
proposed solution outperformed other similar algorithms in terms of MAE, MSE, and RMSE. The study however lacked practical evaluation of the
results achieved – and as we show in our study, choosing the time interval has important implications on interpreting results and investing
feasibility.
As presented, CNNs show promising results according to different researchers, even with relatively shallow architectures. Optimal time
aggregation, which is one of the topics of our research, was not studied. Additionally, profitability of CNN-based trading strategies for the Forex
market is unknown. With our research, we aim to address these gaps.
2.4 | Transformers
The introduction of the Transformer network by Vaswani et al. (2017) has sparked great progress in artificial intelligence leading to state-of-the
art performance in tasks such as NLP, speech recognition, and computer vision. Since Transformers can outperform other deep learning models
on sequential data, it is a natural consequence to apply them to time series problems. Over the last few years, researchers have proposed different
approaches. Some examples for time series forecasting include LogTrans (Li et al., 2019), Informer (Zhou et al., 2021), Autoformer (Wu
et al., 2021), FEDformer (Zhou et al., 2022), Pyraformer (Liu et al., 2022). For time series classification, which is a focus of this paper, an example
is GTN (Liu et al., 2021). The contribution of this research is toward improving the computational efficiency of the algorithm and testing different
ways of encoding time series data. Despite the great promise of Transformers in pushing the envelope in time series forecasting, it should be
noted that on some tasks more traditional approaches can still perform better (Zeng et al., 2022). There are several advantages to using the Trans-
former network over previously mentioned deep learning algorithms, the main ones being that its design allows for parallel computing, contrary to
RNNs, and the distant parts of its input sequence can easily affect each other, which allows to capture long term dependencies more easily than
in the case of CNNs.
At the core of the Transformer network, there is a self-attention mechanism, which consists of three matrices: query and key-value pairs. This
is an analogous concept to information retrieval. Output from attention block is a representation of the input that focuses on different elements
of the input sequence irrespective of their position. It is computed as a weighted sum of value matrices. The weights come from a softmax func-
tion applied to dot product between query and keys divided by root dk. Therefore, the self-attention mechanism is defined as follows:
!
QK T
Attention ðQ, K, V Þ ¼ softmax pffiffiffi V ð2Þ
dk
Another core idea of the Transformer architecture is to use multi-head attention. This performs the above-described attention computation
multiple times in parallel and concatenates their outputs as follows:
MultiHeadðQ, K, V Þ ¼ Concatðhead1 , …, headh ÞW O ð3Þ

where headi ¼ Attention QW Q K V
i , KW i , VW i
dmodel dk
Where W Q
i R ,W Ki Rdmodel dk , W Vi Rdmodel dk ^ W O
i R
hdmodel dk
where dmodel is size of embedding layer in original paper which will corre-
spond to the number of features in our case, dk is a hyperparameter and h is number of heads.
Each of the encoder and decoder blocks also contain position-wise feed-forward layer that are equivalent to two convolution layers with a
kernel of size 1 with Relu activation function expressed as:
FFNðxÞ ¼ ReLUðxW 1 þ b1 ÞW 2 þ b2 ð4Þ
Since the order of a sequence carries essential information and the Transformer network on its own does not have a notion of a sequence,
Vaswani et al. (2017) proposed to add positional encoding to the input, which in their case was an embedding layer. To that end, they used sine
and cosine functions of different frequencies according to the below formula:

PEðpos,2iÞ ¼ sin pos=100002i=dmodel ð5Þ

PEðpos,2iþ1Þ ¼ cos pos=100002i=dmodel ð6Þ
where pos in the case of a time series corresponds to a step within a time series sequence. Positional encoding needs to have the same dimension
as the input layer as they are summed together. It is worth noting that such encoding is fixed, that is, not trained during neural network training.
The authors experimented also with trainable positional embeddings, but the results did not differ substantially.
When it comes to the application of Transformers in investing, this area is still largely a novel one. In different meta-studies referred to earlier,
Transformer models are scarcely mentioned. Researchers have explored attention mechanism in the past, however the self-attention mechanism
proposed by Vaswani et al. (2017) remains unexplored. Wang et al. (2022) showed promising results with the Transformer network outperforming
benchmark methods on univariate daily data for market indices. However, it is not clear whether benchmark methods were optimally selected for
CNN and LSTM networks, which the Transformer network was compared against. Another promising venue of research is to use the Transformer
network in studies that leverage text data for trading. Liu et al. (2019) proposed a model, which uses the Transformer Encoder architecture to
extract the deep semantic features for 47 stocks from SP500 based on twitter data. Subsequently, output for this layer is fed to a capsule net-
work. However, this model was based entirely on text data and therefore does not share similarities with the subject of our research. Zhang et al.
(2022) combined text and pricing data in one architecture that managed to outperform the model proposed by Liu et al. (2019). However, the
multi-head attention mechanism was applied to text data, whereas we plan to apply it only on prices and price-derived information.
3 | D AT A
Empirical data used in this paper comes from histdata.com (histdata, 2022) in 1-min intervals (open, close, high, low). This source allowed us to col-
lect data for a long period – 12 years. However, when fetching data from a brokerage company for such granularity, much shorter periods can
typically be expected to be available. As there might be discrepancies in the data (i.e., depending on the source), we want to create a model that
can later be retrained on shorter data that would be used for real trading. This has a consequence on feature engineering explained later in this
section. 12 years of data was aggregated into different intervals: 60, 120, 240, 480, and 720 min (12 h). We decided to split the dataset in a
sequential manner for training, validation, and test set in order to prevent overfitting in the validation set, which was present in the case of multi-
step ahead forecasting. In this case, the model can memorize patterns instead of learning generalizable knowledge. To avoid the problem of over-
fitting, we used checkpoints when training and we kept models with the best accuracy on the validation set. The training period ranges from
2010-01-01 to 2019-12-31, validation period from 2020-01-01 to 2020-12-31 and test period from 2021-01-01 to 2021-12-31. First, all models
were trained for one step ahead prediction. Analysis of different prediction horizons is included in section 7. Distribution of target variables for
training dataset is presented in Table 1.
Apart from price information, commonly used technical indicators (Murphy, 1999) were applied. The full list of features is presented below:
• Price: open, close, high, low

• Simple, exponential moving averages and standard deviation for close prices of different lengths. Length of moving averages as well as their
usefulness should ideally be tested empirically. This however would be computationally very expensive and it is not the core interest of our
research. For example, Sezer and Ozbayoglu (2018) used 15 different values ranging from 6 to 20. Some researchers and traders tend to use
round numbers like 20, 50, 100, 150, and 200, whereas others claim that using different values for example 9 instead of 10 give them ‘an
edge’ in investing by reacting faster than many other traders. We decided to focus on short term values for two reasons: we predict short term
movements (next period) and computing long term averages discards many initial observations. The second reason is important as per our com-
ment at the beginning of this section regarding data availability depending on the source. At the end we use lengths of 3, 5, 10, 13, and 20.
The general idea behind moving averages is that when a shorter moving average crosses from below a longer-term moving average it is consid-
ered as bullish, and the opposite situation is considered bearish. Additionally moving averages can play support or resistance roles in their rela-
tion to price.
• Stochastic Oscillator
• RSI (relative strength index) for length 14, 10
• MACD (Moving Average Convergence / Divergence)
• Williams %R
• Bollinger bands for 2 standard deviations and period length 5.
• Historical returns between consecutive periods
• Time transformations: sin and cos of hour and weekday. It is a common practice among machine learning practitioners to represent time
through sin and cos transformation to preserve the cyclical nature of time where for example the distance between 11 p.m. and 1 a.m. is the
same as between 1 a.m. and 3 a.m. Transformation is done according to the following formula:

2π hour
hour sin ¼ sin ð7Þ
23

2π hour
hour cos ¼ cos ð8Þ
23
TABLE 1 Distribution of target variable (share of “buy”) in training dataset.
Interval (min) EURGBP EURCHF EURUSD EURPLN USDJPY USDCHF

60 0.49 0.49 0.49 0.49 0.50 0.50
120 0.49 0.49 0.49 0.49 0.50 0.50
240 0.50 0.49 0.49 0.49 0.50 0.50
480 0.50 0.50 0.49 0.49 0.51 0.50
720 0.49 0.50 0.50 0.48 0.51 0.50
Sequence length that forms one training sample is another parameter that is hard to establish on theoretical grounds and different values
should ideally be tested. Zhang et al. (2019) for high frequency data used 100 recent updates of book order, whereas Zhao and Khushi (2020) also
for high frequency data used 30. We decided to start with a higher window of 96 as some technical patterns might require more time to play out.
96 is equivalent to 4 days for 60 min, 8 days for 120 min, and so on. Since the number of features was 40, the matrix going into the model has a
shape of 96 40. All features before entering the model are standardized by removing the mean and scaling to unit variance. This computation is
done on training dataset and then applied on validation and test set with the same parameters.
As the data comes without a spread, we assume the following values (in pips) to incorporate as transactional cost when analysing the results
from the point of view of financial performance – EURCHF: 2, USDCHF: 2, EURGBP: 2, EURUSD: 1.5, USDJPY: 1.5, EURPLN: 30. Those values
come from the analysis of offerings of different contract for difference (CFD) providers. For example, Oanda (2022) publishes average spreads on
their website. According to this data, we could assume 1.5 for all major currencies. However, we decided to be more conservative and assumed
1.5 only for the two most traded currency pairs: EURUSD and USDJPY. According to the Bank of International Settlements survey (2019), these
currencies are responsible for 24% and 13.2% of the volume respectively, whereas other pairs have low single digit values. Swap cost is not
directly included in the calculations. We think it is a reasonable simplification as the spread is the higher of the two, and when trading is frequently
conducted overnight, swap costs may be avoided. As of the moment of writing (May 2022), swap cost is significantly lower among all currency
pairs than the spread.
In order to do a feasibility check of trading via different time intervals, we analysed what is the fraction of movements that are smaller than
the spreads assumed above (Table 2). This summary supports our research hypothesis that trading based on high frequency data will be less prof-
itable than for more aggregated data and very likely not profitable at all. Based on this data we excluded even more frequent aggregations from
our analysis. Additionally, we do not expect profitable results for less common currencies characterized by high spread and represented in our
study by EURPLN. One interesting consideration might be to model this as a classification task where price movements within the transactional
cost is a separate class. However, in this article we decided to treat it as a binary classification task with two classes: “buy” and “sell”.
4 | EMPIRICAL RESULTS
4.1 | Deep learning architectures selected for Forex price prediction model
Our research framework consists of several steps. In the first one, we compare the Transformer network against various benchmarks deep learn-
ing models. Next, the given Transformer network is analysed from the trading perspective. This is followed by additional discussion about the
results, with a special focus on trading performance and its practical usability. Lastly, we discuss some of the parameters of the study and how
they might impact the results for the best performing currency pair. The considered parameters are sequence length and forecasting period, that
is, forecasting one step ahead versus more long-term forecast. Additionally, we also test a model without technical indicators. The algorithm that
constitutes the best benchmark for the Transformer network is a residual network. This was also a type of benchmark selected by Dosovitskiy
et al. (2021) for their Vision Transformer. Residual networks were introduced by the Microsoft Research Group (He et al., 2016). The idea behind
residual networks is to allow for building deeper networks while controlling the complexity of the network by means of a shortcut connection. A
shortcut connection adds a linear connection between the input of a residual block (see Figure 2) and its output. The Microsoft Research Group
showed that this is an effective way of tackling the problem of degradation when deeper networks return higher training errors than their shallow
counterparts. The exact architecture that was used is presented in Figure 2. Each residual block consists of three convolutional layers with the
first convolution of filter size equal to 8, second equal to 5 and third equal to 3. In He et al. (2016) the residual block consisted of two convolution
layers (3 3, 3 3), however later implementations, for example, Resent 50 use three convolution layers (1 1, 3 3, 1 1). Since most papers
refer to 2-D convolutions, there is no established standard for residual networks for 1-D convolutions. Because of the good results reported by
Fawaz et al. (2019) outside the finance domain we decided to follow the same construction of the residual block for 1-D convolutions. Each con-
volution layer uses ReLU activation function and is followed by batch normalization. The first residual block has 32 feature maps whereas the
TABLE 2 Fraction of movements of magnitude lower than transactional cost.
Interval (min) EURGBP (%) EURCHF (%) EURUSD (%) EURPLN (%) USDJPY (%) USDCHF (%)
60 31.5 38.1 17.3 72.3 18.5 26.7
120 22.9 29.7 12.3 61.9 13.2 19.3
240 16.4 22.4 8.4 50.3 9.0 13.4
480 10.5 16.2 5.3 27.5 5.9 8.8
720 7.7 13.3 4.1 29.3 4.6 6.5
F I G U R E 2 ResNet architecture used in our research. Residual blocks are represented by orange, green, and blue colours. All convolution
layers are 1D.
other two have 64 feature maps. The last residual block is followed by a global average pooling layer and a fully connected layer with two units
representing two classes: the price going up or down in the next period. We use padding which preserves input shape for all convolutions. In total,
the model has 138,434 trainable parameters.
ResNet_LSTM takes advantage of one of potential enhancements of CNNs for time series problems, which is to combine them with types of
recurrent network layers like LSTM (Lu et al., 2020; Zhang et al., 2019). Tsantekidis et al. (2020) show that a combination of CNN and LSTM out-
performs those models if they are considered separately, however in the case of the Forex market the difference is not large. Additionally, they
use different CNN architecture without residual connections. In our case, we replace the global average pooling layer with an LSTM layer with
64 units. Since this operation adds many new parameters to the model and in the experimental setup model it proved to be prone to overfitting,
we added dropout at the end of each residual block, resulting in 171,458 trainable parameters in total. Another benchmark model that was tested
is inception network. Inception architecture was introduced by Szegedy et al. (2015). It is main idea is to have multiple convolutions with different
filter sizes operating at the same level instead of stacking convolutional layers sequentially. We have experimented with different building blocks
of the inception module, however none of them yielded results similar to ResNet. Therefore, we chose ResNet to be the benchmark model and
we skip inception results.
Standard metrics for binary classification are presented for comparison: accuracy and area under the roc curve (AUROC) (Table 3). The Trans-
former has the highest accuracy 16 times and AUROC 13 times. ResNet_LSTM achieved the highest accuracy 13 times and AUROC 16 times. In
that sense results between them are comparable. However, when it comes to the practical aspect, the Transformer presents better opportunity
for trading, something explained in more detail in next chapter. For now, we note that it performs significantly better for EURCHF for frequency
480 and 720 min., which are among the best combinations of currency and time interval for trading. As we explain in the next chapter, longer
intervals are more reliable for trading. Additionally, the Transformer offers best performance for the same intervals for USDJPY and USDCHF,
two currencies for which it is possible to devise a profitable strategy, too. The combination for which ResNet_LSTM outperformed the Trans-
former and which offers positive returns is EURUSD for intervals 480 and 720 min. ResNet achieved the highest accuracy 11 times and AUROC
10 times and from important combinations offers a good rate of return for EURGBP for 720 min. ResNet_LSTM has outperformed slightly plain
ResNet, which is in line with Tsantekidis et al. (2020). Based on the above, we conclude that the Transformer network offers good predictive
power for intraday Forex trading, and we will proceed to discuss the profitability of a trading strategy based on this model. A summary of all
hyperparameters used to train the Transformer applied in our study can be found in Table 4, along with graphical representation of the model on
Figure 3.
4.2 | Trading results
All presented results are computed on a test set that is held out from the training, which encompasses the whole of 2021. Standard accuracy met-
ric is used to assess if models have any predictive power. A value significantly deviating from 0.5 should be a necessary but not sufficient condi-
tion for a profitable trading strategy and serves as a simple benchmark. This is because the share of up and down movements for high frequency
Forex data is generally close to 50% over a long period as presented in Table 1. Accuracy is presented for different probability thresholds, as it
might be more profitable to trade when the probability of a certain move is higher and to stay out of the market when uncertainty is high. To eval-
uate financial performance a simple trading strategy was implemented. Trading results (row with a label “Net results” in Tables 5a, 5b, and 5c) rep-
resent the value of capital at the end of the investment period assuming 100 at the beginning. No leverage is assumed. For example, threshold
0.45–0.55 means that for probabilities below 0.45 a short position is taken and above 0.55 a long position. Between 0.45 and 0.55, no position is
taken and if a transaction was open previously, it will be closed. Transactional cost is incorporated as explained in section 3. %time shows what
fraction of time either long or short position is open (1.00 = 100%).
TABLE 3 Performance metrics for different neural network architectures.
Transformer ResNet_LSTM ResNet
Interval (min) Accuracy AUROC Accuracy AUROC Accuracy AUROC

EURCHF 60 0.55 0.56 0.54 0.54 0.54 0.54
120 0.55 0.56 0.57 0.57 0.57 0.57
240 0.52 0.52 0.53 0.54 0.52 0.52
480 0.56 0.56 0.54 0.53 0.54 0.54
720 0.60 0.60 0.54 0.55 0.58 0.58
USDJPY 60 0.51 0.51 0.51 0.52 0.50 0.51
120 0.50 0.52 0.50 0.50 0.50 0.51
240 0.52 0.52 0.50 0.51 0.51 0.50
480 0.57 0.56 0.49 0.49 0.49 0.51
720 0.53 0.53 0.51 0.53 0.48 0.48
EURUSD 60 0.53 0.53 0.52 0.53 0.51 0.52
120 0.51 0.51 0.52 0.52 0.51 0.52
240 0.50 0.51 0.51 0.52 0.51 0.52
480 0.52 0.52 0.54 0.53 0.52 0.50
720 0.49 0.49 0.51 0.51 0.49 0.50
EURGBP 60 0.52 0.53 0.53 0.53 0.52 0.52
120 0.52 0.53 0.52 0.53 0.54 0.54
240 0.53 0.50 0.52 0.53 0.53 0.52
480 0.54 0.51 0.51 0.52 0.53 0.50
720 0.52 0.51 0.48 0.48 0.57 0.57
EURPLN 60 0.55 0.55 0.55 0.56 0.54 0.55
120 0.56 0.57 0.57 0.58 0.55 0.57
240 0.54 0.55 0.53 0.54 0.58 0.59
480 0.55 0.54 0.55 0.54 0.53 0.53
720 0.57 0.56 0.51 0.50 0.54 0.53
USDCHF 60 0.52 0.52 0.53 0.53 0.54 0.54
120 0.53 0.54 0.53 0.54 0.55 0.55
240 0.49 0.51 0.51 0.52 0.52 0.53
480 0.53 0.54 0.53 0.54 0.53 0.54
720 0.59 0.60 0.58 0.58 0.56 0.56
Note: The best result in each row is in bold.
Generally, the accuracy of predictions improves when looking at more strict probability thresholds, which is a good sign that when the model
is ‘more certain’ the discriminatory power increases. Based on those results we conclude that it should be possible to create profitable intraday
trading strategy based on the Transformer model for EURCHF, USDJPY, EURGBP, and USDCHF. For EURUSD, the model does not perform well,
which can be explained by the fact that it is the most traded and hence the most efficient currency pair. For EURPLN, the model performs well,
but even with high accuracy it is difficult to compensate for high transactional cost. Theoretically, it shows a positive return for 120 min, however
it should be noted that in reality its spread can be very volatile and therefore this result should be taken with caution.
In Figure 4 we present the evolution of equity curves of a portfolio for two well performing combinations of currencies and intervals. For
EURCHF 60 min the threshold is 0.35–0.65 and for 720 min the threshold is 0.5–0.5 For USDCHF the thresholds are 0.4–0.6 and 0.5–0.5,
respectively. The subfigure (a) shows that the level of transactional cost has a great impact on the obtained results. With no cost, the result is
110.7, with 2 pips 105.2 and with 4 pips it is generating a loss with result 99.9. This raised the question: how reliable are such backtested results
given that actual market spread can be volatile? Subfigure (b) shows better results but analogous conclusion for the lowest frequency trading
(720 min). This is surprising as for less frequent trading one might expect spread impact to be less visible in the results. Subfigures (c) and (d) for
USDCHF lead to a very similar conclusion as for EURCHF.
TABLE 4 Hyperparameters of the Transformer model.
Hyperparameter Value
Number of heads 4
Number of self attention blocks 4
Size of query and key in each attention block 64
Number of neurons in dense layer 128
Activation function in dense layer ReLU
Dropouta 0.25
Optimizer Adam
Learning rate 0.001
Batch size 64
Sequence length 96
Training period 2010-01-01 to 2019-12-31
Validation period 2020-01-01 to 2020-12-31
Test period 2021-01-01 to 2021-12-31
a
The same value applied in attention block and dense layer.
FIGURE 3 Transformer network used in our study.
Although the obtained trading results might appear small when compared to some benchmarks (like historical risk-free rate of return or buy
and hold strategy for different assets like SP500), it is worth remembering that the Forex market is characterized by lower volatility and that trad-
ing typically happens with high leverage, thus the obtained results would have to be scaled accordingly to reflect that.
In subfigure (a) in Figure 5 we present an example of a 720 min sequence in test period between 2021-03-01 17:00 and 2021-04-30 05:00
for which the model correctly predicted a drop within the next 12 h with a probability of 58%. In this case it is another drop after a series of
TABLE 5a Trading performance of Transformer model for EURCHF and USDJPY on the test set (entire 2021).
EURCHF USDJPY
Threshold Interval 60 120 240 480 720 60 120 240 480 720
0.5–0.5 accuracy 0.55 0.55 0.52 0.56 0.60 0.51 0.50 0.52 0.57 0.53
%time 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Net results 98.2 98.1 91.9 99.9 107.8 81.6 88.1 94.4 105.9 104.0
0.45–0.55 accuracy 0.63 0.64 0.54 0.57 0.72 0.56 0.53 0.53 0.64 0.59
%time 0.27 0.36 0.49 0.63 0.21 0.26 0.21 0.14 0.14 0.38
Net results 101.1 106.5 94.5 99.7 104.0 94.4 95.8 98.8 100.8 102.0
0.4–0.6 accuracy 0.71 0.69 0.56 0.60 0.77 0.60 0.63 0.54 0.90 0.68
%time 0.11 0.21 0.23 0.41 0.07 0.10 0.07 0.03 0.01 0.11
Net results 105.0 105.1 97.1 100.7 101.6 100.9 100.9 99.9 101.2 102.4
0.35–0.65 accuracy 0.79 0.79 0.65 0.62 0.69 0.67 0.74 0.67 0.00 0.78
%time 0.06 0.13 0.10 0.23 0.02 0.04 0.02 0.00 0.00 0.02
Net results 105.2 105.3 100.7 100.0 100.4 100.8 101.2 100.4 99.9 100.5
0.3–0.7 accuracy 0.84 0.83 0.71 0.66 – 0.69 0.82 – – –
%time 0.03 0.07 0.04 0.13 – 0.02 0.01 – – –
Net results 102.9 103.1 100.5 101.2 – 100.3 100.5 – – –
0.25–0.75 accuracy 0.88 0.91 0.68 0.74 – 0.80 0.89 – – –
%time 0.01 0.03 0.02 0.06 – 0.01 0.00 – – –
Net results 101.1 101.8 100.0 101.4 – 100.5 100.4 – – –
Note: the best result in each column is in bold. A missing value means there was no probability of a given magnitude; %time shows the fraction of time
when either long or short position was open (1.00 = 100%). Threshold represents the probability range for which the decision is to stay out of the market.
TABLE 5b Trading performance of Transformer model for EURUSD and EURGBP on the test set (entire 2021).
EURUSD EURGBP
0.5–0.5 accuracy 0.53 0.51 0.50 0.52 0.49 0.52 0.52 0.53 0.54 0.52
%time 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Net results 91.1 92.3 93.8 95.3 96.0 71.4 78.6 106.6 104.5 97.4
0.45–0.55 accuracy 0.57 0.52 0.25 0.51 0.53 0.61 0.57 – 0.59 0.56
%time 0.19 0.26 0.01 0.17 0.22 0.18 0.37 – 0.02 0.48
Net results 94.9 94.1 99.1 100.3 100.0 93.4 89.5 – 100.2 101.6
0.4–0.6 accuracy 0.64 0.53 0.00 0.57 0.25 0.70 0.68 – 0.38 0.57
%time 0.05 0.07 0.00 0.01 0.01 0.08 0.16 – 0.01 0.26
Net results 99.6 97.7 100.0 100.2 99.7 99.4 98.7 – 99.4 101.6
0.35–0.65 accuracy 0.68 0.54 0.00 0.00 0.00 0.84 0.72 – 0.33 0.53
%time 0.01 0.03 0.00 0.00 0.00 0.04 0.08 – 0.01 0.11
Net results 100.0 99.0 100.0 100.0 99.9 101.8 99.4 – 99.3 100.4
0.3–0.7 accuracy 0.60 0.50 – – – 0.87 0.76 – 0.25 0.53
%time 0.00 0.01 – – – 0.01 0.04 – 0.01 0.03
Net results 99.9 99.7 – – – 100.9 99.8 – 99.5 100.3
0.25–0.75 accuracy 0.75 – – – – 0.86 0.84 – 0.50 0.40
%time 0.00 – – – – 0.00 0.02 – 0.00 0.01
Net results 100.0 – – – – 100.2 100.6 – 99.6 99.7
TABLE 5c Trading performance of Transformer model for EURPLN and USDCHF on the test set (entire 2021).
EURPLN USDCHF
0.5–0.5 accuracy 0.55 0.56 0.54 0.55 0.57 0.52 0.53 0.49 0.54 0.60
%time 1.00 1,00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Net results 54.1 86.4 88.8 95.7 96.4 70.5 87.5 91.2 99.4 112.7
0.45–0.55 accuracy 0.64 0.693 0.57 0.58 0.60 0.68 0.59 0.51 0.81 0.74
%time 0.20 0.22 0.44 0.46 0.68 0.14 0.22 0.13 0.09 0.25
Net results 81.6 93.35 87.8 93.5 99.1 105.1 96.5 96.4 103.1 104.7
0.4–0.6 accuracy 0.75 0.871 0.64 0.68 0.66 0.77 0.70 0.60 0.82 0.85
%time 0.04 0.02 0.12 0.11 0.33 0.06 0.06 0.03 0.01 0.09
Net results 98.0 101,0 97.3 99.2 98.9 106.7 101.9 100.1 100.3 104.1
0.35–0.65 accuracy 0.82 1,00 0.73 0.67 0.62 0.79 0.78 0.81 1.00 –
%time 0.01 0,00 0.01 0.02 0.10 0.03 0.02 0.01 0.00 –
Net results 99.8 100.2 99.9 99.0 99.9 103.7 101.7 100.6 100.1 –
0.3–0.7 accuracy 0.84 1,00 0.67 – 0.88 0.79 0.80 0.80 – –
%time 0.00 0,00 0.00 – 0.01 0.01 0.01 0.00 – –
Net results 99.7 99.99 100.2 – 100.6 101.6 100.8 100.1 – –
0.25–0.75 accuracy – – 1.00 – – 0.73 0.67 0.67 – –
%time – – 0.00 – – 0.00 0.00 0.00 – –
Net results – – 100.1 – – 100.0 100.0 99.9 – –
previous declines and the model tells us that most likely the short-term bottom is not in. Attention score heatmaps on the right correspond to the
first multi-head attention layer. We can see that the first two heads focus on a relatively low number of time points to extract classification useful
features. The bar chart on the left represents average attention scores from the heatmaps for ease of analysis. Based on this, we can see that the
first two and last heads attend to bottoms and drops in the chart with different strengths. For head number 3 there is no clear pattern and distri-
bution is more uniform. In subfigure (b) we see the same graphs but for a different period – between 2021-08-10 05:00 and 2021-10-10 17:00.
This time model correctly predicted upward movement with a probability of 60.8%. Similarly, as in the previous case, we see that heads 1, 2, and
4 attend to bottoms and drops in the chart. This time however the model is able to interpret this context as a reason for the price to go up.
Concerning our research objective, we note the following intervals giving the highest trading results for the analysed currency pairs: EURGBP
– 240 min, EURCHF – 720 min, EURUSD – not available, EURPLN – 120 min, USDJPY – 480 min, USDCHF – 720 min.
5 | M I S LE A D I N G P RE D I C T I V E A C C U R A C Y
Research presented in empirical papers often stops at this point, reporting the predictive capabilities of deep learning models, given that they
increase the chances of correctly predicting the direction of financial asset movement over a random prediction or benchmark model (Di Persio &
Honchar, 2016; Lu et al., 2020; Tsantekidis et al., 2020; Zhao & Khushi, 2020). The following analysis shows that accuracy metrics and backtest
results can be misleading in the context of the Forex market. To demonstrate this point, we focus on an example of EURCHF, 1 h interval and 0.5
threshold. Subfigure (a) in Figure 6 shows that the vast majority of correct predictions occur within just three hourly intervals: 21:00–22:00,
22:00–23:00, and 23:00–00:00. Moreover, we show that predictions resulting from the model are no better than naïve predictions. Naïve predic-
tions were created based on an asymmetric distribution of difference in close prices between consecutive hours. Concretely, it is the mode of the
direction of price changes in a particular time interval in the training set. We conclude that in this case the model has predominantly learnt to
detect those simple patterns as accuracy excluding the three above mentioned intervals drops to 52%. The existence of such patterns and market
efficiency can be explained by the fact that spreads are much wider in these time intervals, as presented in Figure 7, which most probably makes
strategies trying to exploit these patterns unsuccessful. However, this hypothesis should be verified using actual spreads, which is one of the con-
clusions of this research – correct and robust assumptions about transactional costs are critical for evaluation of intraday trading strategy. As
data becomes more aggregated, these patterns disappear and for example for a 720 min interval such a naïve strategy would produce accuracy
(a) EURCHF - 60 min interval (b) EURCHF - 720 min interval
(c) USDCHF – 60 min interval (d) USDCHF – 720 min interval
FIGURE 4 Equity curves of the trading strategy for EURCHF for 60 (top) and 720 min (bottom) intervals.
below 50%, significantly lower than the one based on the signals resulting from the model (subfigure (b) in Figure 6). The importance of the
above-mentioned time intervals relates to the fact that at 10 p.m. UTC the New York stock exchange session is closing and then until midnight
only the Sydney stock exchange is open with significantly reduced trading volume.
Data on EURCHF spread presented in Figure 7 was collected in 20 s intervals for the CFD instrument from one of the leading European bro-
kers. It shows that spread grows rapidly before 10 p.m. and decreases only after 11 p.m., thereby impeding trading in those hours. Data from a dif-
ferent broker – Oanda – confirms the cyclicality of this spread spike in the night. For the remaining part of the day the average spread stays
below 2 pips with occasional spikes.
To illustrate the impact of this phenomenon on trading performance, we present in Table 6 once again the results obtained in the previous
section compared with the strategy that does not trade in the above mentioned time intervals. We can see that for 60 and 120 min this leads to
inferior results for all probability thresholds. Interestingly, for 60 min our Transformer model is still able to achieve accuracy higher than 50%,
however it's challenging to obtain profits with this relatively simple trading strategy.
The patterns described in this section show different strength depending on the currency pair. For USDJPY and 60 min interval with thresh-
olds 0.45–0.55 the Transformer model achieved 56% accuracy whereas a naïve method yields 50.8% accuracy and as Figure 8 shows, the better
model is distributed across different hourly intervals with not as strong a pattern as for EURCHF. Hence, the performance of the deep learning
model for each currency pair and data interval has to be analysed carefully before drawing conclusions. Based on those findings, we conclude that
mostly likely there is no profitable interval for EURPLN, as 120 min selected in the previous chapter suffers from the problem described in this
chapter. As for other currencies, the best intervals were longer than that, hence we maintain our conclusions for them.
6 | S E N S I T I V I T Y A N A L Y S I S – LE N G T H O F T H E TR A I N I N G SA M P L E , F O R E C A S T HO R I Z O N ,
A N D A D DI T I O N A L P R E D I C T O R S
During the design of research, we had to make a set of assumptions that ideally should be inferred from data. In this section we present the
results when changing one of the key assumptions at a time for currency pair that proved to be well suited for modelling – EURCHF. The assump-
tions analysed in this section refer to the length of the training period, prediction horizon, and the importance of features other than the price. All
(a) Correctly predicted downward movement
(b) Correctly predicted upward movement
F I G U R E 5 EURCHF price for a selected time window along with average self-attention scores (left) and self-attention scores heatmap (right)
for all 4 heads in the first multi-head attention layer. (a) Correctly predicted downward movement; (b) Correctly predicted upward movement.
(a) 60 min (b) 720 min
FIGURE 6 Accuracy by hour for EURCHF.
FIGURE 7 Spread for EURCHF at night (London time).
those considerations refer to the Transformer model described in previous sections. In order to remain concise, we decided to offer detailed data
upon request and present here only the key findings.
For the length of the training period, we tested three different values: 24, 48, and 96 observations. Overall, there is no single value that would
consistently outperform the others and the results are similar between different lengths. As a relative measure we counted how often a given
length showed the best results. We excluded any accuracy metric for thresholds other than 0.5–0.5 from this comparison as different values of %
time make them not directly comparable. Instead, we focused on trading results for this group that are comparable. The length of 24 won 14 times,
48 won 17 times, and 96 won 13 times. We conclude that including a longer series of historical prices to the algorithm does not improve its per-
formance. Therefore, a shorter length can be used, one that improves the speed of the training process and is computationally cheaper.
For the forecast horizon period, we tested three different scenarios – two fixed values: 4 and 8 observations and the scenario where we fore-
cast for the next 24 h, which means a different number of observations for a different frequency of data. As stated in section 3, modification in
forecast horizon leads to overfitting, which was controlled by selecting the best model based on the validation set. Overall, the results for the pre-
diction horizon longer than 1 observation are less appealing. Trading results might be positive and even better than one step ahead prediction.
This however should be taken with a grain of salt as accuracy no longer increases when selecting more restrictive probability thresholds.
TABLE 6 Comparison of trading performance for EURCHF after excluding hours with high spread.
All hours (initial results) High spread hours excluded
Threshold Interval 60 120 60 120

0.5–0.5 accuracy 0.55 0.55 0.52 0.50
%time 1.00 1.00 0.87 0.83
Net results 98.2 98.1 85.5 85.1
0.45–0.55 accuracy 0.63 0.64 0.55 0.51
%time 0.27 0.36 0.17 0.21
Net results 101.1 106.5 90.5 95.5
0.4–0.6 accuracy 0.71 0.69 0.55 0.45
%time 0.11 0.21 0.04 0.08
Net results 105.0 105.1 97.0 96.6
0.35–0.65 accuracy 0.79 0.79 0.65 0.48
%time 0.06 0.13 0.01 0.03
Net results 105.2 105.3 99.6 98.8
0.3–0.7 accuracy 0.84 0.83 0.67 0.39
%time 0.03 0.07 0.00 0.01
Net results 102.9 103.1 99.9 99.7
0.25–0.75 accuracy 0.88 0.91 1.00 0.33
%time 0.01 0.03 0.00 0.00
Net results 101.1 101.8 100.0 100.0
Note: %time shows the fraction of time when either long or short position was open (1.00 = 100%). Threshold represents the probability range for which
the decision is to stay out of the market.
FIGURE 8 Accuracy by hour for USDJPY for 60 min interval with threshold 0.45–0.55.
Performance for low intervals is still subject to the challenges described in section 6, which entails trading at times of high spread. This issue again
should be analysed case by case. The potential area for future research is to apply Transformer models tailored for long term forecasting described
in 2.4 for different prediction horizons.
Regarding the feature importance, we tested the model with just four price features: high, low, open, close. For many epochs, model training
and validation accuracy oscillated around 50% without making any meaningful improvements. Based on this result we conclude that additional
feature engineering process is essential for the improvement of model performance.
7 | C O N CL U S I O N S
We have shown that the Transformer network exhibits predictive power in the context of intraday Forex movements with performance metrics
varying greatly between different currency pairs and frequencies of data. At the beginning of our research, we compared the Transformer perfor-
mance to state-of-the-art models for 6 currency pairs for one step ahead prediction for 5 different time intervals. From the practical (trading)
standpoint, the Transformer model offers improvement over the ResNet-LSTM model, which presents a challenging benchmark and, in some
cases still might be a better choice. The Transformer used in our study consisted of four encoder blocks with 4 heads each followed by a global
average pooling and a linear layer. We switched from positional encoding introduced in the original Transformer network to trainable encodings,
which improved the performance of the model. We have shown that for four currency pairs (EURCHF, USDJPY, EURGBP, USDCHF) it is possible
to devise a profitable trading strategy based on the proposed model and we offer insights with respect to optimal time aggregation for those cur-
rencies. Additionally, we note that applying transfer learning between currencies is a viable strategy to improve the results as we can take advan-
tage of the fact that for some currencies it is easier to train neural networks. This conclusion might provide some credence to technical analysis,
especially combined with our conclusion about the importance of technical analysis based features, suggesting that there are some universal price
patterns shared between different currencies. This step was followed by additional analysis of the obtained results in the context of real-world
trading which led to the conclusion that some of the positive results for high frequency intervals (120 min or less) would not hold true in reality.
We showed that the dominant assumption in research into fixed transactional costs will lead to incorrect conclusions. Therefore, one of the key
conclusions from our study is that the usefulness of deep learning algorithms in real Forex trading should be carefully studied with realistic
assumptions about transactional costs. Relying on typical accuracy metrics and simple backtesting can be very misleading in the Forex market
when using OHLC data. We showed that for high frequencies high accuracy was achieved when the spread cost was the highest. Because of this
issue and supported by additional analysis of trading performance outside of hours with high spread, we conclude that intervals 240, 480, and
720 min are more suitable for Forex trading based on deep learning models. Additionally, we concluded that the length of the training sample in
the network did not play a critical role, whereas feature selection did. Potential future areas of research could include further investigation aimed
at improving Transformer networks by testing different architecture setups and time embeddings. Additionally, designing smarter investment
strategies could also lead to superior results.
FUND ING INFORMATION

This work was partially supported by the COST Action CA19130 – Fintech and Artificial Intelligence in Finance – Toward a transparent financial
industry.
DATA AVAI LAB ILITY S TATEMENT

The data that support the findings of this study are openly available at https://www.histdata.com/download-free-forex-data/?/ascii/1-minute-
bar-quotes.
ORCID
Przemysław Grądzki https://orcid.org/0000-0001-8803-7614
jcik
Piotr Wo https://orcid.org/0000-0003-1853-8784
RE FE R ENC E S
BIS. (2019). Triennial Central Bank Survey. https://www.bis.org/statistics/rpfx19_fx.html. Accessed May 01, 2022.
Chen, S., & He, H. (2018). Stock prediction using convolutional neural network. IOP Conference Series: Materials Science and Engineering, 435, 012026.
Di Persio, L., & Honchar, O. (2016). Artificial neural networks approach to the forecast of stock market price movements. International Journal of Economics
and Management Systems, 1, 158–162.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., &
Houlsby, N. (2021). An image is worth 1616 words: Transformers for image recognition at scale. ICLR, 2021.
Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L., & Muller, P. A. (2019). Deep learning for time series classification: A review. Data Mining and Knowledge
Discovery, 33, 917–963. https://doi.org/10.48550/arXiv.1809.04356
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
Gudelek, M. U., Boluk, S. A., & Ozbayoglu, A. M. (2017). A deep learning based stock trading model with 2-D CNN trend detection. In Proceedings of 2017
IEEE Symposium Series on Computational Intelligence (SSCI). https://doi.org/10.1109/SSCI.2017.8285188
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). https://doi.org/10.48550/arXiv.1512.03385
Histdata, 2022. https://www.histdata.com/download-free-forex-data/?/ascii/1-minute-bar-quotes

Hu, Z., Zhao, Y., & Khushi, M. (2021). A survey of forex and stock Price prediction using deep learning. Applied System Innovation, 4, 9. https://doi.org/10.
3390/asi4010009
Jiang, W. (2021). Applications of deep learning in stock market prediction: Recent progress. Expert Systems with Applications, 184, 115537. https://doi.org/
10.1016/j.eswa.2021.115537
Kiranyaz, S., Avci, O., Abdeljaber, O., Ince, T., Gabbouj, M., & Inman, D. J. (2021). 1D convolutional neural networks and applications: A survey. Mechanical
Systems and Signal Processing, 151, 107398. https://doi.org/10.1016/j.ymssp.2020.107398
LeCun, Y., Haffner, P., Bottou, L., & Bengio, Y. (1999). Gradient-based learning applied to document recognition. In Shape, contour and grouping in computer
vision (pp. 319–345). Springer. https://doi.org/10.1007/3-540-46805-6_19
Li, S., Jin, X., Xuan, Y., Zhou, X., Chen, W., Wang, Y., & Yan, X. (2019). Enhancing the locality and breaking the memory bottleneck of transformer on time
series forecasting. In NeurIPS.
Liu, J., Liu, X., Lin, H., Xu, B., Ren, Y., Diao, Y., & Yang, L. (2019). Transformer-based capsule network for stock movements prediction. Proceedings of the
First Workshop on Financial Technology and Natural Language Processing (2019), (pp. 66–73).
Liu, M., Ren, S., Ma, S., Jiao, J., Chen, Y., Wang, Z., & Song, W. (2021). Gated transformer networks for multivariate time series classification. arXiv preprint
arXiv:2103.14438.
Liu, S., Yu, H., Liao, C., Li, J., Lin, W., Liu, A. X., & Dustdar, S. (2022). Pyraformer: Low-complexity pyramidal attention for long-range time series modeling
and forecasting. In ICLR.
Lu, W., Li, J., Li, Y., Sun, A., & Wang, J. (2020). A CNN-LSTM-based model to forecast stock prices. Hindawi Complexity, 2020, 1–10. https://doi.org/10.
1155/2020/6622927
Menkhoff, L., & Taylor, M. P. (2007). The obstinate passion of foreign exchange professionals: Technical analysis Journal of Economic Literature. https://
www.jstor.org/stable/27646888
Murphy, J. J. (1999). Technical analysis of the financial markets. A comprehensive guide to trading methods and applications. New York Institute of
Finance, 1999.
Oanda (2022). https://www1.oanda.com/forex-trading/markets/recent. Accessed May 01, 2022.
Sezer, O. B., Gudelek, M. U., & Ozbayoglu, A. M. (2020). Financial time series forecasting with deep learning: A systematic literature review: 2005–2019.
Applied Soft Computing Journal, 90, 106181. https://doi.org/10.1016/j.asoc.2020.106181
Sezer, O. B., & Ozbayoglu, A. M. (2018). Algorithmic financial trading with deep convolutional neural networks: Time series to image conversion approach.
Applied Soft Computing, 70, 525–538. https://doi.org/10.1016/j.asoc.2018.04.024
Sim, H. S., Kim, H. I., & Ahn, J. J. (2019). Is deep learning for image recognition applicable to stock market prediction? Hindawi Complexity Volume, 4324878,
1–10. https://doi.org/10.1155/2019/4324878
Staudemeyer, R. C., & Morris, E. R. (2019). Understanding LSTM – a tutorial into Long Short-Term Memory Recurrent Neural Networksar Xiv:1909.09586.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. IEEE
Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/CVPR.2015.7298594
Thakkar, A., & Chaudhari, K. (2021). A comprehensive survey on deep neural networks for stock market: The need, challenges, and future directions. Expert
Systems with Applications, 177, 114800. https://doi.org/10.1016/j.eswa.2021.114800
Tsantekidis, A., Passalis, N., Tefas, A., Kanniainen, J., Gabbouj, M., & Iosifidis, A. (2020). Using deep learning for price prediction by exploiting stationary
limit order book features. Applied Soft Computing Journal, 93, 106401. https://doi.org/10.1016/j.asoc.2020.106401
UCR/UEA. (2022). Archive. https://www.cs.ucr.edu/eamonn/time_series_data_2018. Accessed May 01, 2022.
van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., & Kavukcuoglu, K. (2016). Wavenet: A generative
model for raw audio. arXiv preprint arXiv:1609.03499. https://doi.org/10.48550/arXiv.1609.03499
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need, advances in neural
information processing systems 30 (NIPS 2017).
Wang, C., Chen, Y., Zhang, S., & Zhang, Q. (2022). Stock market index prediction using deep transformer model. Expert Systems with Applications.
Wu, H., Xu, J., Wang, J., & Long, M. (2021). Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In NeurIPS.
Zeng, A., Chen, M., Zhang, L., & Xu, Q. (2022). Are transformers effective for time series forecasting? arXiv preprint arXiv:2205.13504v2.
Zhang, Q., Qin, C., Zhang, Y., Bao, F., Zhang, C., & Liu, P. (2022). Transformer-based attention network for stock movement prediction. Expert Systems
With Applications.
Zhang, Z., Zohren, S., & Roberts, S. (2019). DeepLOB: Deep convolutional neural networks for limit order books. IEEE Transactions on Signal Processing
(Volume: 67, Issue: 11, June1, 1 2019). https://doi.org/10.48550/arXiv.1808.03668
Zhao, Y., & Khushi, M. (2020). Wavelet Denoised-ResNet CNN and LightGBM Method to Predict Forex Rate of Change. 2020 IEEE International Conference
on Data Mining Workshops (ICDMW), Sorrento, Italy. https://doi.org/10.48550/arXiv.2102.04861
Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, X., & Zhang, W. (2021). In- former: Beyond efficient transformer for long sequence time- series fore-
casting. In AAAI.
Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., & Jin, R. (2022). FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting.
arXiv preprint arXiv:2201.12740.
AUTHOR BIOGRAPHI ES
Przemysław Grądzki is a Ph.D. candidate at University of Warsaw, Faculty of Economic Science. He has an M.Sc. degree in Computer Science
and Econometrics from University of Warsaw. His research area is focused on the intersection of Artificial Intelligence and Quantitative
Finance. Aside from academic activity, Przemysław has many years of professional experience working on data science projects across differ-
ent industries.
jcik is an associate professor at the University of Warsaw, Faculty of Economic Sciences and the head of the Data Science Lab
Piotr Wo
research team. His research interests are focused on two areas. The first is regional and local development and in particular the measurement
of inequalities and real economic convergence on a regional and local level. The second area is quantitative finance, in particular construction
and testing of algorithmic trading strategies. Both areas of interest involve the use of advanced quantitative tools including machine learning
algorithms.
jcik, P. (2023). Is attention all you need for intraday Forex trading? Expert Systems, e13317.
How to cite this article: Grądzki, P., & Wo
https://doi.org/10.1111/exsy.13317

Is Attention All You Need For Intraday Forex Tradi

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Is Attention All You Need For Intraday Forex Tradi

Uploaded by

Copyright:

Available Formats

Received: 31 January 2023 Revised: 29 March 2023 Accepted: 13 April 2023

Is attention all you need for intraday Forex trading?

Przemysław Grądzki | jcik

Faculty of Economic Sciences, University of

2.3 | Applications in trading

MultiHeadðQ, K, V Þ ¼ Concatðhead1 , …, headh ÞW O ð3Þ

FFNðxÞ ¼ ReLUðxW 1 þ b1 ÞW 2 þ b2 ð4Þ

• Price: open, close, high, low

TABLE 1 Distribution of target variable (share of “buy”) in training dataset.

Interval (min) EURGBP EURCHF EURUSD EURPLN USDJPY USDCHF

TABLE 2 Fraction of movements of magnitude lower than transactional cost.

4.2 | Trading results

TABLE 3 Performance metrics for different neural network architectures.

Transformer ResNet_LSTM ResNet

Interval (min) Accuracy AUROC Accuracy AUROC Accuracy AUROC

Note: The best result in each row is in bold.

TABLE 4 Hyperparameters of the Transformer model.

FIGURE 3 Transformer network used in our study.

(a) EURCHF - 60 min interval (b) EURCHF - 720 min interval

(c) USDCHF – 60 min interval (d) USDCHF – 720 min interval

(a) Correctly predicted downward movement

(b) Correctly predicted upward movement

(a) 60 min (b) 720 min

FIGURE 6 Accuracy by hour for EURCHF.

FIGURE 7 Spread for EURCHF at night (London time).

All hours (initial results) High spread hours excluded

Threshold Interval 60 120 60 120

FUND ING INFORMATION

DATA AVAI LAB ILITY S TATEMENT

Histdata, 2022. https://www.histdata.com/download-free-forex-data/?/ascii/1-minute-bar-quotes

You might also like