The Use of LSTM - in - Pairs - Trading

The use of Long-Short Term Memory
networks in pairs trading strategies

S.A.M. Trommelen
A thesis submitted in partial fulfillment of the

requirements for the degree of Master of Science in
Quantitative Finance and Actuarial Science
Tilburg School of Economics and Management

Tilburg University
supervised by
Dr. P. Čížek
March, 2021
Contents
1 Introduction 4
2 Data 6
3 Methodology 7
3.1 Trading periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Distance Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Copula modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Long Short-Term Memory Network . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Trading costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6 Performance calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4 Results 22
4.1 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Trade characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Risk characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 Conclusion 33
1
Abstract
This thesis aims to provide a comparison between conventional pairs-trading strategies and a
more contemporary approach based on a Long Short Term Memory (LSTM) network. It tests
the different strategies on constituents of the STOXX600 index, which covers roughly 90%
of the total European stock market capitalization, between August 1999 until May 2020 and
involves a total of 1562 unique assets. The distance method, copula-based approach, and LSTM
network strategy result in average monthly returns of 11, -3 and 27 basis points respectively over
the whole time horizon, after trading costs have been taken into account. The three strategies
exhibit no real dependence on the performance of the index and perform extremely well in times
of economic hardship in both the economic and risk-adjusted sense, suggesting that these could
be used to effectively hedge against market risk during volatile states of the market.
2
Acknowledgements
This thesis is the last part of my Quantitative Finance and Actuarial Science master program
at Tilburg University and is the product of months of hard work. It has been a year that tested
my determination to the fullest, with many ups and downs, and I feel glad to have been able
to persist through these adversities. This was not possible without the love and support of my
family and friends. Thank you so much for being there for me since the very beginning.
I would like to further extend my gratitude to my supervisor, Dr. Pavel Čížek, who has provided
me with very valuable insights and ideas that have ultimately formed the thesis you are reading
today.
3
1. Introduction
Pairs trading is a speculative investment strategy that seeks to exploit short-term deviations of
the spread between two paired assets by exploiting its mean-reverting nature. The strategy is to
identify co-moving pairs of securities and open up a long-short position in the event the spread
between this pair diverges. The rationale is that for a well-behaved pair such divergences imply
that the underlying securities are relatively over- and undervalued compared to one another. Any
mispricing will correct itself over time and presents the opportunity to generate a profit. Pairs
trading strategies (PTS) are thus inherently contrarian, as it prescribes to short the asset that
is performing well and simultaneously long the asset that is performing poor. These strategies
have been around since the eighties and are still regularly used by institutional investors.
Gatev et al. (2006) provides one of the first exhaustive studies on PTS. They construct a relatively
simple PTS that is able to generate significant profits between July 1963 and December 2002.
Their strategy became known as the distance method (DM) and has become the benchmark in
the PTS literature. As with most arbitrage opportunities, the profits from the resulting strategy
appeared to be generally declining over time. Do and Faff (2010, 2012) examine the DM during
a 47 year time period and find that the performance of the DM peaks during 1970-1990 and
starts declining after, with exceptions being the years following the dot-com bubble and the
global financial crisis. They attribute this decline to an increase in pairs that diverge in spread
and do not converge back.
The literature has since then been introduced to many new forms of PTS. One such strategy is
based upon modeling with copulas. A copula is a multivariate cumulative distribution function
that directly defines the dependence structure between two or more random variables; it allows to
model the marginal distribution functions and their underlying dependence function separately.
Copulas have existed for quite some time already, but only recently have been widely applied in
PTS. Xie and Wu (2013) point out that the distance method is fundamentally flawed in its implicit
assumption of multivariate normal returns. There exists a sizeable literature, most notably Cont
(2001), providing evidence against the normality of stock returns. In this sense, the distance
method is inefficiently and incorrectly using its available data. The copula strategy does not
need any assumption about the underlying marginal or joint distribution function being normal
as it allows these to be modelled independently without any further restrictions. However, their
4
profitability seems to be varying between studies. Xie and Wu (2013) discover significantly
higher excess returns of copula-based portfolios compared to portfolios based upon the DM,
while the large empirical study of Rad et al. (2016) suggests that copulas actually perform quite
worse than the DM. This thesis aims to find out how the copula strategy performs on a large set
of the European market, but also to come up with a more sophisticated strategy.
Which leads us to artificial neural networks. These networks have gained considerable attraction
over the last decade as their flexibility allows them to be applied to virtually any field. Recurrent
neural networks are a special class of artificial neural networks that enable the network to learn
in a sequential manner, which makes them incredibly suited for time-series applications and
financial applications like portfolio management or investment strategies. Fischer and Krauss
(2018) use a Long-Short Term Memory (LSTM) network to form investment portfolios based on
market predictions. The resulting investment portfolios exhibit low exposure to systematic risk
but become generally unprofitable after 2010, however, the LSTM-based model has significantly
improved performance over more conventional machine learning techniques like random forests.
This begs the question whether LSTM networks could also successfully be applied to PTS. The
main objective of this thesis is therefore to investigate if it is possible and worthwhile to apply
an LSTM networks in a PTS setting.
This thesis contributes to the existing literature in a two ways. First, it investigates the applica-
bility of recurrent neural networks for pairs trading strategies and tests its potential profitability.
Second, it provides a comparison between the benchmark, copula, and LSTM-based strategies
on a large set of the European market applied during the past two decades. The remainder of
this thesis is structured as follows. Section 2 provides a brief description of the dataset used to
backtest the different PTS. Section 3 reviews the methodology of the strategies in detail. The
results are presented in section 4. Finally, section 5 concludes.
5
2. Data
The PTS are backtested on the STOXX Europe 600 (SXXP) index to assess their performance.
This index was established in 1998 and consists of 600 different securities with small, medium
and large market capitalizations that all originate from a set of European countries, but mainly
are from the United Kingdom, France, Germany and Switzerland. This dataset is not directly
available and is constructed in two steps. First, for every calendar month, the set of all
constituents comprising the index is retrieved from Thomson Reuters Datastream. This list is
then aggregated over all the available years. Second, for each security on this list, daily series
about its trading price, return index (with and without dividends) and market capitalization, as
well as its corresponding industry, are retrieved from the same source. This yields a dataset
comprising of a static list and three time series, consisting of observations on 5353 trading days
spanning the first of August 1999 until the end of May 2020, for a total 1568 distinct securities.
Securities that lack observations for the majority of their active duration are dropped from the
dataset. A binary matrix is constructed to only allow trades on constituents of the index: a
security is marked with a one in the event that it is a constituent of the index for each particular
calendar month. Note that trade is still permitted in the event that a constituent leaves the index,
but just for that trading period only.
6
3. Methodology
3.1. Trading periods
All of the three PTS consist of two stages in their implementation. The main purpose of the
first stage, or formation period, is to form the potential pairs and estimate any potential model
parameters that might be used during the second stage. In the second stage, or trading period,
the realized spread of each potential pair is evaluated and the strategy applies its trading rules
determined during the formation period. The lengths of the formation and trading periods can
be chosen arbitrarily. This implementation follows the preceding literature and nominates a
formation and trading period lasting twelve and six calendar months respectively, which allows
for more accessible comparisons between obtained results. Each strategy is applied on a monthly
basis without waiting for the previous trading period to finish, which results in six overlapping
portfolios. The return of a particular month is then defined by taking the average return of each
of the six overlapping portfolios based on marked-to-market valuation.
3.2. Distance Method
The distance method (DM) applied in this thesis is based upon the original implementation of the
strategy introduced in Gatev et al. (2006). At the start of the formation period, the daily closing
prices of the securities are normalized and represent the total cumulative return index without
𝑗
any dividends. Let 𝑋𝑡𝑖 and 𝑋𝑡 denote the normalized prices of securities 𝑖, 𝑗 ∈ {1, . . . 𝑛 𝑠 } with
𝑗
𝑖 ≠ 𝑗. The spread between 𝑋𝑡𝑖 and 𝑋𝑡 at time 𝑡 is then given by
𝑖𝑗 𝑗
Δ 𝑡 = 𝑋𝑡 − 𝑋𝑡𝑖 .
The criterion by which the DM ranks the various pairs is the sum of squared differences (SSD).
The motivation behind using the SSD, or sum of squared spreads, is that securities that are
behaving similarly in price are be more likely to persist with this behavior in further time
periods. This means that the pair corrects itself after any short-term deviation, reverting back
its to mean. Formally, this such a process could be described by a cointegration framework
using Engel and Granger’s Error Correction Model (Vidyamurthy, 2004). For the DM, such
sophistication is not necessary, as the only determinant by which the pairs are ranked is by this
measure of closeness. However, this measure is far from perfect, as the securities could also
7
behave similarly in price by factors independent of their common characteristics. Such pairs
often diverge during the trading period and typically result in a loss.
Throughout the formation period, the sum of squared differences (SSD) between the pair’s
individual normalized price series is calculated. Let 𝑇1 represents the number of observations
during the formation period, then the SSD between stock 𝑖 and 𝑗 is given by
𝑇1
Õ
𝛿𝑖, 𝑗 = (𝑋𝑖,𝑡 − 𝑋 𝑗,𝑡 ) 2 .
𝑡=1
The best performing pairs, in terms of minimized SSD, are eligible to trade during the trading
period. Gatev et al. (2006) reports that the portfolio formed on the top 20 pairs has the highest
average monthly return. For this reason, only 20 pairs are considered for trading during each
time period.
At the start of the trading period, the prices of the securities are again rescaled to represent
cumulative returns. At each point in time, the observed spread of each pair is determined and
monitored. The DM describes to open up positions whenever the magnitude of the spread
exceeds a certain level, and to close positions whenever this spread reverts back to its mean
level. These thresholds are based on a standard deviation metric, where the standard deviation
of the spread is determined in the training period only. Let the standard deviation of the spread
of a pair (𝑖, 𝑗) during the formation period be denoted by 𝜎
˜ 𝑖 𝑗 and let the number of observations
during the trading period be represented by 𝑇2 . If 𝑜𝑡 and 𝑐 𝑡 represent the signals to open or
close a position respectively at time 𝑡 = 𝑇1 + 1, . . . , 𝑇2 , then
𝑜𝑡 = 1{Δ 𝑖 𝑗 ≥𝑘 1 𝜎˜ 𝑖 𝑗 } − 1{Δ 𝑖 𝑗 ≤−𝑘 1 𝜎˜ 𝑖 𝑗 } and 𝑐 𝑡 = 1{𝑘 2 𝜎˜ 𝑖 𝑗 ≥Δ 𝑖 𝑗 ≥−𝑘 2 𝜎˜ 𝑖 𝑗 } .

𝑡 𝑡 𝑡
Here, 𝑘 1 is the opening threshold parameter of the DM. A typical rule is to open up positions
whenever the observed spread exceeds two historical standard deviations, i.e. 𝑘 1 = 2. The
opening threshold is key to the overall performance of a strategy. If the opening threshold is
set too low, then a myriad of positions will be initiated for a relatively small duration of time.
Accordingly, the average monthly returns will also be lower, as the position is opened when the
spread is relatively small. The average monthly returns of the strategy will therefore be affected
more severely by trading costs. On the other hand, if the opening threshold is too high, then the
strategy might miss out mean-reversions of the spread. Section 4.4 studies the effect of different
opening thresholds for each different PTS in further detail.
8
𝑖𝑗
The sign of 𝑜𝑡 determines the direction in which a position is opened. If Δ 𝑡 ≥ 𝑘 1 𝜎
˜ 𝑖 𝑗 , then
the first security is relatively undervalued compared to the second. Accordingly, the strategy
prescribes to go long in the first and to short the second. Whenever the spread converges back,
that is within 𝑘 2 historical standard deviations, the position is closed. It is possible that a pair
opens and closes multiple positions throughout the trading period. In the event that the spread
does not converge, the position is manually closed at the end of trading period. Such positions
typically lead to net loss.
Liew and Wu (2013) point out that a major shortcoming of the distance method is its inherent
assumption of linear association describing the dependency structure. Selecting pairs by mini-
mizing squared distance is equivalent to using linear correlation criteria. If the distribution of
the underlying pair is multivariate normal, then minimum distance is able to completely cap-
ture the dependency structure. However, empirical studies, most notably Cont (2001), suggest
that the returns of assets (or pairs) are seldomly normal. Tail dependency, which describes
co-movements between two random variables in the lower and upper part of the distribution, are
often observed in practice and unable to be explained with (multivariate) normal distributions.
As such, the distance method is not adequate enough to predict future movements in price.
3.3. Copula modeling
A copula-based approach circumvents this issue as it makes no inherent assumption about

joint or marginal behavior. The copula-based PTS aims to quantify the relative mispricing
between two assets using their joint distribution function determined by its marginal and copula
distribution function. A copula is a function that describes the dependence structure of a
set of random variables given their univariate marginals. The copula was first introduced by
Sklar (1959) and his famous existence theorem provides a starting point for any copula-based
approach.
Theorem 1 (Sklar’s theorem) Let 𝑿 = (𝑋1 , . . . , 𝑋𝑛 ) be a 𝑑-dimensional vector of random

variables with joint distribution function 𝐻 (𝑋1 , . . . , 𝑋𝑛 ). Suppose each random variable 𝑋𝑖 has
a marginal distribution function given by 𝐹𝑖 (𝑥𝑖 ). Then, there exists a copula density function 𝐶
such that for all 𝒙 ∈ R, it holds that
𝐻 (𝑥 1 , . . . , 𝑥 𝑛 ) = 𝐶 (𝐹1 (𝑥 1 ), . . . , 𝐹𝑛 (𝑥 𝑛 )).
9
Thus, the copula directly defines the dependence structure between the random variables, and
additionally, if each marginal density is continuous, this density is unique. Assume that the
copula is 𝑛-times differentiable. Then, the joint probability density function is given by
𝜕 𝑛 𝐶 (𝐹1 (𝑥1 ), . . . , 𝐹𝑛 (𝑥 𝑛 ))
𝑓 (𝑥 1 , . . . , 𝑥 𝑛 ) = .
𝜕𝑥 1 . . . 𝜕𝑥 𝑛
If each marginal 𝐹𝑖 (𝑥𝑖 ) is differentiable, then this joint probability density function becomes
𝑓 (𝑥 1 , . . . , 𝑥 𝑛 ) = 𝑓 (𝑥 1 ) · . . . · 𝑓 (𝑥 𝑛 ) · 𝑐(𝐹1 (𝑥 1 ), . . . , 𝐹𝑛 (𝑥 𝑛 )).
The joint probability density function is therefore equal to the product of the marginal probability
densities multiplied by the copula density function, which is the 𝑛th derivative of the copula
with respect to the cdf of the random variables, that is,
𝜕 𝑛 𝐶 (𝐹1 (𝑥1 ), . . . , 𝐹𝑛 (𝑥 𝑛 ))
𝑐(𝐹1 (𝑥1 ), . . . , 𝐹𝑛 (𝑥 𝑛 )) = .
𝜕𝐹1 (𝑥1 ), . . . , 𝜕𝐹𝑛 (𝑥 𝑛 )
Indeed, this is the term that captures the complete dependence structure between the set of ran-
dom variables; with independence corresponding to the case where 𝑐(𝐹1 (𝑥1 ), . . . , 𝐹𝑛 (𝑥 𝑛 )) = 1
for every 𝒙 = (𝑥1 , . . . , 𝑥 𝑛 ). This highlights the flexibility of using a copula, as marginal densities
can be chosen independently without any assumptions about their joint behavior. Accordingly,
the joint behavior can now be modelled by a variety of copula distribution functions. This also
means that the entire dependence structure of the random variables is now able to be estimated
with standard techniques like maximum likelihood estimation.
In a bivariate setting like PTS it can be shown that that the copula density function can be
factorized into conditional probabilities. If the random variables 𝑋1 and 𝑋2 have cumulative
distribution functions 𝐹1 (𝑥 1 ) and 𝐹2 (𝑥 2 ), then by the probability integral transformation theorem
it holds that 𝑈1 = 𝐹1 (𝑋1 ) ∼ 𝑈 (0, 1) and 𝑈2 = 𝐹2 (𝑋2 ) ∼ 𝑈 (0, 1). Nelsen (2007) proves that
the partial derivative of the copula density function with respect to one variate then denotes its
probability density conditional on the other variate, that is,
𝜕𝐶 (𝑢 1 , 𝑢 2 )
ℎ1 (𝑢 1 | 𝑢 2 ) = := 𝑃(𝑈1 < 𝑢 1 | 𝑈2 = 𝑢 2 ) (1)
𝜕𝑢 2
and
𝜕𝐶 (𝑢 1 , 𝑢 2 )
ℎ2 (𝑢 2 | 𝑢 1 ) = := 𝑃(𝑈2 < 𝑢 2 | 𝑈1 = 𝑢 1 ). (2)
𝜕𝑢 1
If 𝑋1 and 𝑋2 denote normalized prices of a pair of securities, then Xie and Wu (2013) show that
ℎ1 (𝑢 1 | 𝑢 2 ) represents the likelihood that the actual spread is smaller than the currently observed
10
spread, conditional on the normalized price of the second security. In a way, this conditional
probability indicates the degree of mispricing of the first security conditional on the second.
For example, if ℎ1 (𝑢 1 | 𝑢 2 ) > 0.5, then the normalized price is expected to to decrease as it
converges back to its equilibrium, which means that the first security is overvalued conditional
on the second. Similarly, the first security is undervalued if ℎ1 (𝑢 1 | 𝑢 2 ) < 0.5 and predicts an
increase in the normalized price. These mispricing indices will form the basis in determining
whether new positions are opened or closed.
This implementation is mainly based upon and Liew and Wu (2013) and Rad et al. (2016)
and is like the DM performed in two stages. First, pairs are formed during the formation
period according to their SSD. This measure is favorable as it is fast to compute and seems to
empirically perform just as good as measures based upon correlation or cointegration. Then,
their marginal and joint distribution functions are determined by fitting a given set of distribution
functions. Throughout the trading period, the marginals and copula distribution functions are
then used to determine the daily relative mispricing indices. Like the mispricing measure of
the DM, these determine whether positions are initiated or closed.
After forming the pairs, the normalized prices are transformed by taking the logarithm. This
transformation is not necessary, though it slightly improves overall fit. The marginal and
copula distribution functions are determined by applying the inference functions for margins
(IFM) approach (Joe, 1997). This starts with electing an appropriate set of marginals and
having both the random variables independently maximize their log-likelihood on each given
marginal, which means that two assets do not necessarily share the same distribution. The set
of marginals consist of normal, logistic, and generalized extreme value distributions which have
their densities and parameter space presented in table 1.
In a bivariate setting, the decomposed joint probability function in section 3.3 greatly simplifies
as the copula density function is then equal to the product of the two conditional probabilities.
Using the probability integral transformation together with the best marginal distributions (in
terms of maximized likelihood) makes it possible to determine the log-likelihood of the joint
distribution function constructed by any given copula. Similar to Rad et al. (2016), the copulas
are chosen from the Student-t, Clayton, and Gumbel families, while also considering the rotated
variations of the latter two. The conditional probability densities and parameter space of each
copula is shown in table 2.
11
Distribution Density function Parameters

𝑥−𝜇 2
Normal √1 exp − 1 𝜇 ∈ R, 𝜎 > 0
𝜎 2𝜋 2 𝜎
2
−(𝑥−𝜇) −(𝑥−𝜇)
Logistic exp 𝜎 𝜎 1 + exp 2𝜎 𝜇 ∈ R, 𝜎 > 0

𝑥−𝜇 1/𝑐 𝑥−𝜇 1/𝑐−1
Gen. extreme value exp − 1 − 𝑐 𝜎 1−𝑐 𝜎 𝜇 ∈ R, 𝑐, 𝜎 > 0
Table 1: The candidate marginal distribution functions. Note that 𝜇 and 𝜎 represent location
and scale parameters.
Copula 𝒉1 (𝒖 1 | 𝒖 2 ) = 𝑷(𝑼1 ≤ 𝒖 1 | 𝑼2 = 𝒖 2 ) Parameters

−1 −1
q −1

Student-t 𝑡 𝜈+1 𝑡 𝜈 (𝑢 1 ) − 𝜌𝑡 𝜈 (𝑢 2 ) 𝜈 + (𝑡 𝜈 (𝑢 2 )) 2 1 − 𝜌 2 /(𝜈 + 1) 𝜌 ∈ (−1, 1)
− 1𝜃 −1
Clayton 𝑢 −(𝜃+1)
2 𝑢 −𝜃 −𝜃
1 + 𝑢2 − 1 𝜃>0
− 1 −1
Rotated Clayton 1 − (1 − 𝑢 2 ) −(𝜃+1) (1 − 𝑢 1 ) −𝜃 + (1 − 𝑢 2 ) −𝜃 − 1 𝜃 𝜃>0
1−𝜃
Gumbell 𝐶𝜃 (𝑢 1 , 𝑢 2 ) (− ln 𝑢 1 ) 𝜃 + (− ln 𝑢 2 ) 𝜃 𝜃
(− ln 𝑢 2 ) 𝜃−1 /𝑢 2 𝜃>1
1−𝜃
𝜃
1 − 𝐶𝜃 (1 − 𝑢 1 , 1 − 𝑢 2 ) (− ln(1 − 𝑢 1 )) 𝜃 + (− ln(1 − 𝑢 2 )) 𝜃
Rotated Gumbell 𝜃>1
𝜃−1
· (− ln(1 − 𝑢 2 ))
/(1 − 𝑢 2 )
1
with 𝐶𝜃 (𝑢 1 , 𝑢 2 ) = exp − (− ln 𝑢 1 ) 𝜃 + (− ln 𝑢 2 ) 𝜃 𝜃
Table 2: The candidate copulas and their conditional distribution functions.
The copula that yields the highest log-likelihood is then selected to use during the trading
period. Since the number of free parameters of each copula is one, metrics like the Bayesian
Information Criterion (BIC) or Akaike Information Criterion (AIC) will lead to the the same
distributions as those obtained from using the maximum likelihood criterion. Table 3 presents
the fraction of pairs and assets that are assigned a certain distribution. The logistic distribution
seems to provide the best fit for the marginals overall, while the Student-t and Clayton copulas
are the most popular distributions that match the dependence structure between marginals.
12
Panel A: Copula families
Student-t Clayton Rotated Clayton Gumbell Rotated Gumbell
Fraction of pairs 34.3% 46.5% 0.5% 0.1% 18.6%
Panel B: Marginal distributions
Logistic Normal Generalized extreme value
Fraction of assets 96.3% 3.1% 0.6%
Table 3: Share of the matched distributions using the IFM procedure.
Trading with the copula-based PTS is determined entirely on a basis of the mispricing indices
between a pair of securities determined earlier. This starts by normalizing the price series of
the trading period and taking its logarithm. Using the inverse cumulative distribution function
of the marginals gives the return-based quantiles 𝑈1 and 𝑈2 . The relative mispricing indices
𝑚𝑖,𝑡 are now determined by applying equation (1) and equation (2) to the best describing copula
of the pair. These are then centered around zero, and so its sign represents the direction of the
over- or undervaluation. That is,
𝑚 1,𝑡 = ℎ1 (𝑢 1 | 𝑢 2 ) − 0.5 = 𝑃(𝑈1 ≤ 𝑢 1 | 𝑈2 = 𝑢 2 ) − 0.5,
𝑚 2,𝑡 = ℎ2 (𝑢 2 | 𝑢 1 ) − 0.5 = 𝑃(𝑈2 ≤ 𝑢 2 | 𝑈1 = 𝑢 1 ) − 0.5.
Following Rad et al. (2016), the daily mispricing indices are cumulatively summed up during
the entire trading period to serve as the overall mispricing index of the two assets:
𝑀1,𝑡 = 𝑀1,𝑡−1 + 𝑚 1,𝑡 ,
𝑀2,𝑡 = 𝑀2,𝑡−1 + 𝑚 2,𝑡 ,
with initial values 𝑀1,0 = 𝑀2,0 = 0. If the first cumulative mispriced index 𝑀1,𝑡 is positive,
then the first security is relatively overpriced compared to the second one and vice-versa if its
negative. If the mispricing measure is relevant, then it should exhibit mean-reversion around
zero, to reflect that the market is efficient in correcting the mispricing over time.
These two series are used as signals indicating whether or not a position should be opened up
on each day of the trading period. Again, let 𝑘 1 and 𝑘 2 represent the thresholds that determine
whether a position is opened or closed. Assuming there are no open positions at the time, a
new position is opened whenever the cumulative mispricing indices of one of the securities is
13
greater than 𝑘 1 while the second one is lower than −𝑘 1 . In this event, the security that exceeds
the (positive) threshold is considered overvalued and is shorted. The other security is naturally
considered undervalued and is bought. The positions are reversed whenever one of the indices
falls below the closing threshold 𝑘 2 or if the end of the trading period is reached. The opening
and closing signals 𝑜𝑡 and 𝑐 𝑡 at time 𝑡 = 1, . . . , 𝑇 are thus given by
𝑜𝑡 = 1{𝑀2,𝑡 ≥𝑘 1 } 1{𝑀1,𝑡 ≤𝑘 1 } − 1{𝑀1,𝑡 ≥𝑘 1 } 1{𝑀2,𝑡 ≤𝑘 1 }
𝑐 𝑡 = 1{𝑘 2 ≥𝑀1,𝑡 ≥−𝑘 2 } + 1{𝑘 2 ≥𝑀2,𝑡 ≥−𝑘 2 } .
3.4. Long Short-Term Memory Network
Recurrent neural networks (RNN) are becoming increasingly popular in time series applications
as their extremely flexible structure permits more restrictions than classical time series applica-
tions. These networks distinguish themselves from ordinary (feedforward) neural networks by
allowing the neurons to construct cycles between themselves. These feedback loops enable the
retention and back-propagation of information to form temporal relationships of the predictors
between the past and the present.
A drawback of RNNs is that these may suffer from the so-called vanishing gradient problem,
which ultimately prevents the network from further training. The Long Short-Term Memory
(LSTM) network, first introduced by Hochreiter and Schmidhuber (1997), is a variant of the
RNN that tackles the vanishing gradient problem by introducing new elements to the neuron
that directly control the flow of information over time. These new elements are controlling gates
that regulate the rate at which the neuron forgets, accepts, and passes on information. There
exist many variants of LSTM, each with varying performance. Greff et al. (2016) provides a
large sample study that tests a few variants of LSTM. The architecture of the LSTM neurons
in this thesis is similar to the original implementation, and a schematic representation of the
neuron is given in figure 1.
The internal memory cell (or cell state) is the most important element of the LSTM neuron.
It stores, provides, and forgets past and current information in the network making use of the
three gates depicted in figure 1. These gates, together with the internal memory cell state, are
what separates LSTM and ordinary RNN networks. To illustrate their use, consider a sequence
of inputs (𝒙 1 , 𝒙 2 , . . . , 𝒙𝑇 ) consisting of input vectors for time period 𝑡 = 1, . . . , 𝑇. Let 𝑾 𝑖
and 𝑹𝑖 denote the weighting matrices for the ordinary and recurrent inputs of the input gate
14
Figure 1: Schematic representation of an LSTM neuron with its three different gates.
respectively and let 𝒃𝑖 denote the bias vector. Here, the subscript 𝑖 represents the input gate;
the forget and output gate have independent weights and biases and will be indexed by 𝑓 and
𝑜 respectively. The activation levels of the three gates represent a degree of relevance and are
typically determined by the Sigmoid activation function, which is given by
1
𝜎(𝑥) = . (3)
1 + exp(−𝑥)
Components are less favorable the closer they are to zero, and conversely, more favorable the
closer they are to one.
The input gate transforms the inputs to the neuron into a vector with activation values by
applying the Sigmoid activation function to a weighted sum of the inputs, recurrent inputs, and
bias. The activation levels of the input gate at time 𝑡 are thus defined by
𝒊 𝑡 = 𝜎(𝑾 𝑖 𝒙 𝑡 + 𝑹𝑖 𝒉𝑡−1 + 𝒃𝑖 ).
Besides the propagation of information, it is also crucial for the neuron to be able to forget
redundant information and reset its own cell state. This feature was introduced only two years
after the original paper but greatly improved the overall performance of the network. In the
introducing paper, Gers et al. (2000) argue that cell states unable to reset themselves may
15
grow indefinitely, which might cause the network to break down. The activation values are
determined from the same set of inputs (𝒙 𝑡 , 𝒉𝑡−1 ) and follow from
𝒇 𝑡 = 𝜎(𝑾 𝑓 𝒙 𝑡 + 𝑹 𝑓 𝒉𝑡−1 + 𝒃 𝑓 ).
These activation values decide what information from the cell state will be omitted when
updating its interior state. To this extent, the neuron first determines a fresh cell state by
considering
𝑠˜𝑡 = tanh(𝑾 𝑠 𝒙 𝑡 + 𝑹 𝑠 𝒉𝑡−1 + 𝒃 𝑠 ).
The activation functions of the cell states are almost never the standard Sigmoid, as these states
require the ability to remove or subtract information from themselves, which is not possible
considering the range of the standard Sigmoid. Essentially, combining the relevant parts of
the former cell state with that of the fresh cell state enables the neuron to remember and forget
all different kinds of information and gives the LSTM its unique properties. It should be no
surprise that the relevant parts are determined by the activation levels of the forget and input
gate. Indeed, the new cell state 𝒔𝑡 is determined by
𝒔𝑡 = 𝒇 𝑡 ◦ 𝒔𝑡−1 + 𝒊 𝑡 ◦ 𝒔˜ 𝒕 ,
where ◦ denotes the element-wise product. The output gate is responsible for passing on the
right information to the next units in the current and following layers. The activation levels of
the output gate are given by
𝒐𝑡 = 𝝈(𝑾 𝑜 𝒙 𝑡 + 𝑹 𝑜 𝒉𝑡−1 + 𝒃 0 )
and combining this with the new cell state finally gives the recurrent output,
𝒉𝑡 = 𝒐𝑡 ◦ tanh(𝒔𝑡 ).
The network starts optimizing all of its parameters (i.e. the weighting matrices, bias vectors)
relative to some loss function through mini-batch gradient descent. This is a modified version of
the gradient descent algorithm and uses a random set of samples (a batch) instead of the entire
training set. It feeds batches through the neural network and approximates the error gradient
relative to the batch. It then finds the optimal weights and biases by moving along this gradient
until the loss function is minimal, after which the weights are updated. This process is repeated
until the gradient no longer improves or if a stopping criterion is reached.
16
The implementation of the LSTM network used for this PTS is built with Keras, an open-source
software library based on Tensorflow. Tensorflow is a general machine-learning library for
Python developed by Google and is considered the industry standard when it comes to artificial
neural networks. The approach follows these three stages: (1) select and prepare the right
training and test data; (2) construct the network and have it learn on the training set; and (3)
use then network to predict the future spread given the samples of the trading period, which are
then used to open up the trading positions.
It remains to be decided how far into the future the neural network should make its predictions.
If the horizon is too short, then the volatility of day-to-day spreads might disrupt the mean-
reverting property the strategy seeks to exploit. On the other hand, if the horizon is too long,
it might miss out on reversions and underperform compared to a shorter horizon. In this
implementation, the horizon is arbitrarily set to two weeks. Considering no trade takes place on
weekends and holidays, this usually means that the neural network has to predict ten time-steps
in advance.
Like any other machine learning approach, LSTM networks can perform poorly if the number
of features (or variables) is too high. Selecting the right features is therefore crucial. Here, the
future spread of a pair is predicted based upon the currently observed spread, the cumulative
return series of both assets and their respective industry average, and the SXXP price index.
However, these features and outcomes first need to be normalized. Spreads tend to be on a rather
small scale, and transforming these provides the network with the much needed sparsity. The
features and outputs are scaled in such a way that their range is mapped to the [−1, 1] interval.
As the range of the features is not necessarily the same during the formation and trading periods,
the minimum and maximum scaling parameters are set at 0.50 and 1.50 for returns and -0.25
and 0.25 for spreads. This ensures sufficient sparsity between the features and outputs while
not relying on scaling parameters depending on both training or test set. These features are
stacked into sliding windows of approximately two weeks in length, which allows the network
to capture the temporal relationships between features more effectively.
The neural network itself is comprised of an input layer with one neuron for each particular
feature, two hidden layers with 16 and 8 LSTM neurons, and an output layer with one neuron
that represents the the two-week forecast of the spread. In total, the complete network has 2281
trainable parameters. Between each LSTM layer is a so called dropout layer. The purpose of
17
such a layer is to provide regularization to the network. If this layer is omitted, the network
will start to overfit quickly on the training set, leading to poor out of sample predictions. The
dropout layer independently disables some random number of neurons of its parent layer with
some fixed probability. Disabling a random set of neurons makes the overall learning of the
network more noisy. Without any dropout layer, the network might learn its layers to correct
any mistakes made by the previous layers, leading to undesirable overfitting. In the presence of
a dropout layer, this effect is greatly reduced.
The network starts training by minimizing the standard mean-squared error loss function using
Adam (Kingma and Ba, 2014), one of the most popular gradient descent algorithms. This
is done by dividing the entire formation set into batches of 16 samples and executing the
mini-batch gradient descent algorithm until the entire formation set has passed through the
network 64 times (one full pass-through is commonly referred to as an epoch). The choice of
hyperparameters should be determined by cross-validation to maximize the performance of the
network. The hyperparameters here were chosen by comparing the validation error for a wide
grid of values. While this does represent cross-validation, this is formally not the same. Either
way, the hyperparameters seems to provide good results.
At this point, the network has been fully trained and is ready to start forecasting the future
spread. The network then starts predicting the future spread iteratively for each sliding window
of the trading set. As the network is not updated after each new sample of the trading period, its
weights and bias parameters are based entirely on the samples of formation period. This does
however greatly improve the overall speed of the strategy, as training the neural network to the
formation set takes by far the longest time. The trade-off is the lowered overall accuracy of the
network, as later parts of the trading set have had less relevant training than those in the earlier
parts. So ideally, the model should be retrained after each new sample in the trading set, to
mimic the natural flow of information. Doing so for each pair is computationally very expensive
and would take roughly two to three weeks of total training time. This was unfortunately not
possible and highlights the major caveat of the current implementation.
The trading period starts with forming the cumulative return indices and determining the spread
between each nominated pair. Taking the difference between the actual and predicted future
spreads provides the direction and magnitude in which the spread is predicted to deviate and has
similar interpretation as the mispricing index of the copula strategy. The last five predictions
18
are then smoothed to reduce the effect of any strong short term deviations. Similar to the DM,
the opening and closing of positions is determined by applying a standard deviation metric. As
the predictions are made two weeks in advance, the threshold should be substantially higher
than the opening threshold of the DM. If the difference between the observed return and the
smoothed forecast exceeds 𝑘 1 = 5 historical standard deviations, signaling a strong deviation
from the spread, a position is opened. The historical standard deviation is again determined
based on the formation period only. The position is reversed whenever it falls below 𝑘 2 = 0.5
historical standard deviations or whenever the trading period is at an end. If 𝜃˜𝑡 represents this
difference, then the opening and closing signals 𝑜𝑡 and 𝑐 𝑡 at time 𝑡 are given by
˜ − 1{ 𝜃˜𝑡 ≤−𝑘 1 𝜎}
𝑜𝑡 = 1{𝜃˜𝑡 ≥𝑘 1 𝜎} ˜ and 𝑐 𝑡 = 1{𝑘 2 𝜎≥ ˜ .
˜ 𝜃˜𝑡 ≥−𝑘 2 𝜎}
3.5. Trading costs
Trading costs play a crucial part in the assessment of the performance of any given portfolio. The
main elements of transaction costs are commissions and market impact costs, plus an additional
loan fee in the event of short selling. Neglecting trading costs might make an unprofitable
strategy appear profitable and due to the overall frequency of trades this compounds quite
quickly. Generally, trading costs are market- and investor-dependent, and so there is no perfect
way to incorporate these into the model. This application uses idealized trading costs that
resemble the trading costs for an institutional investor as close as possible.
Commissions are declining considerably over the last decades which necessitates the use of
time-varying commissions. Do and Faff (2012) find that the average commissions institutional
investors face have declined from 10 to 8 basis points (bps) during the years 1999 to 2009. For
the remaining years, it is assumed that this declining trend continues to approximately 3 bps
in 2020, which is the current rate for medium to large sized institutional investors applied to
European equities trading reported by InteractiveBrokers.
In a more recent study, Frazzini et al. (2018) develop a model to estimate the market impact
costs and report two relevant findings: (1) the overall size of a trade, typically represented
as a percentage of the daily trading volume (DTV), is the most significant variable affecting
price impact; and (2) the average market impact costs associated with a short sale are not
significantly different from one when selling long. The monthly one-way trading costs are
determined by the median of the distribution resulting from their market impact model, with
19
the DTV corresponding to one percent. As such, the resulting market impact costs are likely
to be an overestimation. The estimates are based upon live trading data and are available from
early 1998 until late 2016, which means that the last few years of the application lack estimates.
In the missing years, it is assumed that the dynamics of the market impact costs follow an
Ornstein-Uhlenbeck process, which is a mean-reverting process with temporal shocks.
26
24
22
Trading costs (bps)
20
18
16
14
12
10
2000 2004 2008 2012 2016 2020
Year
Figure 2: Monthly estimates of the trading costs (measured in basis points) following the
market impact model from Frazzini et al. (2018).
In addition to commissions and market impact costs, short sellers also face an extra fee associated
with the loan of the asset. This loan charges a short interest rate during the full duration of
the position, which varies based upon the individual characteristics of the asset that is being
shorted. As the constituents of the SXXP index are selected based on market capitalization and
liquidity, it follows that the short interest rate of the SXXP constituents must be relatively low
compared to smaller caps. This rate is idealized to start at 0.6% per annum at the start of 1999,
decreasing gradually to 0.3% per annum at the end of 2020.
3.6. Performance calculation
The main performance criteria by which the different PTS are compared is the average monthly
return of the resulting strategy’s portfolio, which can be determined in various forms. As the
payoffs are determined by the difference between a long and short position, these represent
excess returns. In principle, these positions are self-financing: the proceeds of the short-sale
20
could be used to buy the other asset. The valuation of the long-short position is for return-based
considerations set to 1 dollar, which is common practice if the goal is to determine the actual
return.
There are exist various return measures that can be used to asses the performance of a portfolio.
Gatev et al. (2006) propose two of them: one based upon employed capital and one based upon
committed capital. Returns based on employed capital only take the capital which has actively
been used to initiate the trading positions into consideration. On the other hand, returns based
on committed capital suppose the investor reserves a certain amount of capital at the start of the
trading period for each pair. If 𝑟 𝑡𝑖 denotes the return of pair 𝑖 = 1, . . . , 𝑘 at time 𝑡 = 1, . . . , 𝑇,
then the return on employed and committed capital of the portfolio are respectively given by
𝑘 𝑘
Õ 𝑟𝑖 𝑡
Õ 𝑟𝑖 𝑡
𝑅𝑡𝑒 = and 𝑅𝑡𝑐 = , (4)
𝑖=1
𝑛𝑡 𝑖=1
𝑘
where 𝑛𝑡 denotes the number of pairs that open up a long-short position at least once during
the trading period. The return on employed capital is therefore more of a direct measure that
evaluates the performance of the portfolio only on its active pairs. In contrast, the return
on committed capital is a more conservative measure that assumes the remaining capital has
set aside for each nominated pair. If funds are flexible with their investment opportunities,
then return on employed capital might be preferable to reflect that the remaining capital can
be invested elsewhere, for example in already existing positions. Furthermore, idle positions
typically do not benefit from interest. In any event, the two measures are comparatively the
same and preference for either measure is entirely subjective. For these reasons, the returns
presented in this thesis are all based upon employed capital.
21
4. Results
4.1. Descriptive statistics
Table 4 presents the characteristics of the distribution of the monthly returns. Out of all the
three PTS, the LSTM strategy provides the highest overall monthly return, with or without the
inclusion of trading costs, followed by the DM strategy. The copula strategy results in low
monthly returns and is even surpassed by the risk free asset (a one month Treasury bill), as
indicated by its negative Sharpe ratio, with or without trading costs. This makes the copula
strategy rather unattractive from an economic point of view. The DM strategy is also surpassed
(over the whole time horizon) by the risk-free asset if trading costs are taken into account. This
means that only the LSTM strategy is able to generate any worthwhile profits, however, these
returns are rather low.
In addition to the Sharpe ratio, table 4 also reports the Sortino ratio of each different strategy.
This ratio is an alternative risk-adjusted measure and is very similar to the Sharpe ratio, but
unlike the Sharpe only penalizes the downside risk of the portfolio. As the Sharpe ratio is
defined by dividing a mean excess return by its standard deviation, this metric will penalize
positive and negative returns equally, while in practice, only negative returns might constitute a
risk. The Sortino ratios show that the LSTM strategy is again the only profitable strategy (after
trading costs) on a risk-adjusted basis.
Strategy Mean t-stat Std. dev. Sharpe Sortino Min Max Skewness Kurtosis
Panel A: Before trading costs
DM 0.00230∗∗ 3.01 0.0118 0.090 0.130 -0.0590 0.0400 -0.298 3.279
Copula 0.00113 1.51 0.0116 -0.010 -0.016 -0.0390 0.0469 0.496 2.493
LSTM 0.00395∗∗∗ 3.42 0.0178 0.153 0.228 -0.0609 0.0724 -0.013 2.160
Panel B: After trading costs
DM 0.00111 1.45 0.0117 -0.012 -0.017 -0.0600 0.0377 -0.330 3.290
Copula −0.00027 -0.36 0.0115 -0.131 -0.215 -0.0408 0.0453 0.486 2.533
LSTM 0.00274∗∗ 2.38 0.0177 0.085 0.126 -0.0620 0.0711 -0.028 2.203
Table 4: Key characteristics of the monthly excess return distribution for each of the three
PTS applied during 2000-2020. ∗ , ∗∗ , and ∗∗∗ represent statistical significance at the 10%,
5%, and 1% levels respectively.
22
Figure 3 presents the cumulative returns of the three PTS before and after trading costs along
with the SXXP dividend-adjusted return. Without trading costs, all three strategies seem to
outperform the index at the end of the time period. However, the DM and copula strategies are
stagnating in cumulative returns after 2010. In the presence of trading costs, these strategies
actually start accumulating losses. The LSTM strategy experiences the same issue, albeit after
only six years later. During this period, this strategy is able to generate significant returns,
especially during 2016-2017, when the market was still recovering from the stock market sell-
off that happened a year earlier. Interestingly, each strategy exhibits increased returns when the
index is performing poorly, which shows that the strategies are mainly independent from the
index return.
The observations made in figure 3 carry over when we start to consider the risk-adjusted return.
The two-year rolling Sharpe ratios of the PTS are shown in figure 4 and confirms that the DM
and copula strategy are largely unprofitable throughout the trading period. The three PTS move
similarly in risk-adjusted returns, but over the entire time period, the LSTM strategy has the
best overall risk-adjusted return.
Note however, that this is still dependent on the chosen time period, as the strategy actually suffers
from negative risk-adjusted returns during e.g. 2005-2007, 2011-2014, and 2018 onwards. Its
current applicability is therefore limited, and it seems that this strategy is only reasonable to
pick up alpha or hedge against market risk during volatile financial times.
23
SXXP
DM
2.5 Copula
LSTM
Cummulative return
2.0
1.5
1.0
0.5
2000 2004 2008 2012 2016 2020
Time
(a) Before trading costs
2.2
SXXP
2.0 DM
Copula
1.8 LSTM
Cummulative return
1.6
1.4
1.2
1.0
0.8
0.6
2000 2004 2008 2012 2016 2020

Time
(b) After trading costs
Figure 3: Cumulative excess returns of the three different strategies.
These results generally seem to align with the findings of Do and Faff (2010), Do and Faff
(2012) and Rad et al. (2016). These studies document the declining trend of the profitability of
PTS in United States over a wide time period and find that this trend started showing since the
1990s. Due to the relatively young age of the SXXP, there is no way to formally test if the same
declining trend was also apparent around that time, but the stagnating cumulative returns of the
DM and copula strategy suggest that the declining profitability of PTS in European markets
must have been initiated before the opening of the index, and very likely to have occurred around
the same time, which makes sense from an arbitrage point of view.
24
0.2
0.1
0.0
Sharpe ratio
0.1
0.2
0.3 DM
Copula
LSTM
0.4
2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
Time
(a) Before trading costs
0.2
0.1
0.0
Sharpe ratio
0.1
0.2
0.3 DM
Copula
LSTM
0.4
2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
Time
(b) After trading costs
Figure 4: Two year rolling Sharpe ratios for each PTS.
4.2. Trade characteristics
The characteristics of the long-short positions of the three PTS are shown in table 5. Converged
positions are positions that are closed on the basis of trading rules and conversely, unconverged
positions are positions that are forcefully closed at the end of the trading period. The DM
and copula strategies open up significantly more positions compared to the LSTM strategy, but
generally attain lower returns in either event. The copula strategy has the highest percentage
of converged positions, though these positions have relatively low average returns of 1.3%
25
compared to the LSTM or DM strategies, which attain average returns of 4.6% and 5.3%
respectively on converged positions. Interestingly, the DM seems to perform better on converged
positions, with higher average returns and lower standard deviations than the other strategies.
However, this strategy suffers the most from unconverged positions, as it has the lowest average
return at -5.5% per unconverged trade. Considering the DM strategy has approximately one
unconverged trade for each converged trade, this effect is detrimental for its overall performance.
Incorporating stopping mechanisms into the trading rules of each strategy could potentially
decrease the overall loss of unconverged positions and could be beneficial for strategies with a
high fraction of unconverged trades or severe losses on these.
Strategy Pairs Positions Type Type (%) Mean Std. dev. Avg. days open
Converged 53.0 0.0532 0.0288 41
DM 2335 7478
Unconverged 47.0 -0.0525 0.1228 100
Converged 72.9 0.0133 0.0512 28
Copula 2141 7655
Unconverged 27.1 -0.0382 0.1220 85
Converged 52.3 0.0459 0.0805 34
LSTM 1426 2674
Unconverged 47.7 -0.0299 0.1157 87
Table 5: Key statistics of the long-short positions opened by the three PTS. The reported
returns of the positions are after trading costs.
Figure 5 shows the distributions of the after-cost returns for each different strategy. All of the
strategies appear to have fatter left-tails than right-tails, which is very likely to be caused by
the lack of a stopping mechanism in the case of a diverged position. In some sense, converged
positions do execute based on a stopping mechanism, which is why the distribution is leaning
more heavily to the right. Interestingly, the LSTM strategy shows heavier right-tails than the
other two strategies, which could be explained by its very strict opening criteria. If an position is
eventually opened, it is more likely to generate a higher return, which is also seen from table 5.
26
1600
1400
1200
1000
Frequency
800
600
400
200
0
0.4 0.2 0.0 0.2 0.4
Return
(a) DM strategy
1400
1200
1000
Frequency
800
600
400
200
0
0.4 0.2 0.0 0.2 0.4
Return
(b) Copula strategy
250
200
150
Frequency
100
50
0
0.4 0.2 0.0 0.2 0.4
Return
(c) LSTM strategy
Figure 5: Distribution of the returns of the positions initiated by each different PTS in
absence of trading costs.
27
4.3. Risk characteristics
This section investigates the exposure of the different PTS to systematic sources of risk using the
Fama-French five-factor model (Fama and French, 2015). This model is an extension of their
Nobel-prize winning three-factor model and introduces two new risk factors. The five-factor
model captures the exposure of a portfolio relative to general market (Mkt-RF), small minus
big capitalization (SMB), high minus low book-to-market (HML), robust minus weak (RMW),
and conservative minus aggressive (CMA) risk factors. The excess returns (in terms of the risk-
free) of the portfolios corresponding to each PTS are regressed on the European risk factors.
The resulting coefficient estimates represent the relative correlation between that particular risk
factor and the strategy’s portfolio. The standard errors of the regression model are based on
the Newey-West covariance estimator with lags corresponding to six months, to correct for the
endogenous auto-correlation caused by overlapping the different portfolios. All of the factors
are retrieved from the data library of Kenneth French. In such a regression, the intercept is
often referred to as the alpha. Positive alphas indicates a portfolio’s ability to outperform the
market, but without any significance these have little value.
Strategy Alpha Mkt-Rf SMB HML RMW CMA

Panel A: Before transaction costs
DM 0.0012 0.0020 −0.0194 0.0600 −0.0429 −0.0648
(0.001) (0.021) (0.037) (0.052) (0.047) (0.076)
Copula 0.0001 0.0001 −0.0288 0.0893 0.0152 −0.1774

(0.001) (0.025) (0.035) (0.075) (0.044) (0.113)
LSTM 0.0026∗∗ 0.0060 −0.1398∗∗ 0.0557 0.0816 −0.0146

(0.001) (0.036) (0.058) (0.069) (0.062) (0.099)
Panel B: After transaction costs

DM 0.0001 0.0015 −0.0197 0.0547 −0.0456 −0.0657
(0.001) (0.021) (0.037) (0.052) (0.047) (0.075)
Copula −0.0013 −0.0006 −0.0287 0.0828 0.0116 −0.1785

(0.001) (0.025) (0.035) (0.075) (0.044) (0.113)
LSTM 0.0014 0.0059 −0.1390∗∗ 0.0497 0.0780 −0.0134

(0.001) (0.036) (0.058) (0.068) (0.061) (0.098)
Table 6: This table presents the coefficients estimates of the monthly excess return series
regressed against the risk factors of the Fama-French five-factor model (Fama and French
(2015). The standard errors are based upon the Newey-West estimator with 6 periods of lag.
∗ , ∗∗ , and ∗∗∗ represent statistical significance at the 10%, 5%, and 1% levels respectively.
28
Table 6 reports the outcomes of this regression. In the absence of trading costs, only the LSTM
strategy seems to be able to generate a significantly positive alpha. The DM shows a moderately
sized alpha but is with 𝑝 = 0.19 not sufficiently significant. Furthermore, only the LSTM
strategy portrays any significant correlation with the risk factors. In the presence of trading
costs, all of the alphas are insignificant and negative, implying that the strategies rely heavily
on trading costs to make them profitable compared to the overall market rate.
The Carhart four-factor model (Carhart (1997)) was originally introduced as an extension of
Fama-French’s three-factor model to capture the risk exposure of a portfolio to momentum,
which describes the continuing trend of a stock if it has been going up or down for longer
periods of time. However, the momentum risk factor could also be added to the five-factor
model. Table 7 presents the results of regressing the excess returns of the three PTS on this
expanded model.
Strategy Alpha Mkt-Rf SMB HML RMW CMA MOM

Panel A: Before transaction costs
DM 0.0017∗∗ −0.0096 0.0034 0.0257 0.0062 0.0044 −0.0765∗∗∗
(0.001) (0.022) (0.041) (0.045) (0.062) (0.068) (0.029)
Copula 0.0006 −0.0139 −0.0013 0.0481 0.0743 −0.0940 −0.0921∗∗∗

(0.001) (0.026) (0.035) (0.055) (0.058) (0.084) (0.033)
LSTM 0.0031∗∗ −0.0067 −0.1148∗ 0.0181 0.1355∗∗ 0.0614 −0.0839∗∗

(0.001) (0.036) (0.064) (0.075) (0.062) (0.098) (0.034)
Panel B: After transaction costs

DM 0.0005 −0.0100 0.0029 0.0208 0.0030 0.0028 −0.0757∗∗∗
(0.001) (0.022) (0.041) (0.045) (0.062) (0.067) (0.029)
Copula −0.0008 −0.0145 −0.0015 0.0418 0.0703 −0.0957 −0.0915∗∗∗

(0.001) (0.025) (0.035) (0.055) (0.058) (0.084) (0.034)
LSTM 0.0019 −0.0067 −0.1144∗ 0.0127 0.1310∗∗ 0.0614 −0.0826∗∗

(0.001) (0.035) (0.064) (0.074) (0.061) (0.097) (0.034)
Table 7: This table presents the coefficients estimates of the monthly excess return series
regressed against the risk factors of the Fama-French 5 Factor model (Fama and French
(2015) and an additional momentum factor (Carhart (1997)). The standard errors are based
upon the Newey-West estimator with 6 periods of lag. ∗ , ∗∗ , and ∗∗∗ represent statistical
significance at the 10%, 5%, and 1% levels respectively.
The regression shows that all of three PTS show a significant negative correlation with regards
29
to the momentum factor, which is not surprising as PTS is contrarian in nature. In addition to the
SMB risk factor, the LSTM strategy now also seems to exhibit significantly negative correlations
with the RMW risk factor. Since the risk factor corresponding to the excess market return is
insignificant for all three strategies, the market-neutrality of the PTS is further underlined. This
means that they pose an attractive diversifier for investment portfolios that seek to mitigate some
form market-risk.
4.4. Sensitivity analysis
The opening threshold is one of the most important variables, as it inherently controls the
rate at which new positions are opened. This section examines how changes in the opening
threshold affect the overall performance of the strategy in the presence of trading costs. Figure 6
presents the changes in cumulative returns of the three PTS for different opening thresholds and
table 8 reports how the different opening parameters change the composition of the long-short
positions.
As expected, all three strategies open up fewer positions if the opening threshold is raised
and mostly attain higher average returns on both converged and unconverged positions. The
significant increase in profitability of converged or unconverged trades of the LSTM strategy
that occurs when the opening threshold is set at 𝑘 1 = 6 is offset by a very low number of
initiated positions. This means that it will benefit less from its improved returns, and so there
is a trade-off between the number of converged positions and their returns. The strategy that
strikes the balance between high returns and sufficiently high opened positions typically leads
to improved performance overall.
30
Cumulative return 1.4
1.3
1.2
1.1
k1 = 1.5
k1 = 2.0
1.0 k1 = 2.5
2000 2004 2008 2012 2016 2020
Time
(a) DM strategy
1.20
1.15
1.10
Cumulative return
1.05
1.00
0.95
0.90 k = 0.3
k = 0.5
0.85 k = 0.7
2000 2004 2008 2012 2016 2020
Time
(b) Copula strategy
k = 4.0
2.0 k = 5.0
k = 6.0
1.8
Cumulative return
1.6
1.4
1.2
1.0
2000 2004 2008 2012 2016 2020
Time
(c) LSTM strategy
Figure 6: Sensitivity analysis on the opening threshold of the three PTS.
31
Threshold Pairs Positions Type Type (%) Mean Std. dev. Avg. days open
Panel A: Distance Method
Converged 61.3 0.0401 0.0237 35
𝑘 1 = 1.5 2431 9562
Unconverged 38.7 -0.0571 0.1251 102
Converged 53.0 0.0532 0.0288 41
𝑘 1 = 2.0 2335 7478
Unconverged 47.0 -0.0523 0.1228 100
Converged 54.4 0.0654 0.0333 45
𝑘 1 = 2.5 2307 6045
Unconverged 45.6 -0.0472 0.1213 98
Panel B: Copula Strategy
Converged 80.0 0.0075 0.0452 22
𝑘 1 = 0.3 2238 11828
Unconverged 20.0 -0.0359 0.1220 85
Converged 72.9 0.0133 0.0512 28
𝑘 1 = 0.5 2141 7655
Unconverged 27.1 -0.0382 0.1220 84
Converged 65.8 0.0185 0.0560 33
𝑘 1 = 0.7 1992 5433
Unconverged 34.2 -0.0378 0.1217 83
Panel C: LSTM strategy
Converged 62.7 0.0352 0.0683 33
𝑘 1 = 4.0 1895 4558
Unconverged 37.3 -0.0401 0.1113 88
Converged 52.3 0.0459 0.0805 34
𝑘 1 = 5.0 1426 2674
Unconverged 47.7 -0.0296 0.1157 87
Converged 56.7 0.0597 0.0990 37
𝑘 1 = 6.0 1006 1631
Unconverged 43.3 -0.0210 0.1215 85
Table 8: This table shows how the opening threshold changes key statistics of the long-short
positions.
32
5. Conclusion
This thesis studied the performance of three pairs trading strategies applied on the SXXP index
during August 1999 until May 2020. The DM, copula, and LSTM strategies result in average
monthly returns of 11, -3 and 27 respectively after trading costs are taken into account. For
two risk-adjusted measures, i.e. the Sharpe and Sortino ratio, only the LSTM strategy resulted
in positive risk-adjusted returns. The LSTM strategy proposed in this thesis significantly
outperforms both the DM and copula strategies in economic and risk-adjusted terms, which
is primarily due to its ability to form higher quality positions, as the total number of opened
positions is almost three times lower than the two other strategies, but carry higher average
returns regardless if the position converges or not. However, this is also accompanied by a
substantial increase in variance. Though, on a risk-adjusted basis, this does not weight up
against the huge increase in average monthly returns.
While all three strategies seem to show market-neutrality, there is no real link between any of
the systematic risk factors of the Fama-French five-factor model. The different strategies do
however significantly correlate with the momentum factor, which is due to the contrarian nature
of PTS. Furthermore, each strategy exhibits no strong dependence on the overall SXXP return
and performs moderately to extremely well when the index is performing poor. In the years
following the dot-com bubble and the global financial crisis, the LSTM strategy attained annual
returns of 19.1% and 26.6%, which is absurdly high. However, the momentum did not maintain
and after the 2017 bump, the strategy no longer seems to be profitable.
Due to their inherent market-neutrality, portfolios based on these PTS could serve as a viable
way to diversify a portfolio during volatile economic times, as its low exposure to market-risk
allows to pick up alpha. It is currently not recommended to use though, as its risk-adjusted return
is mostly negative. Analyzing the characteristics of the opened positions, the copula strategy
shows the highest percentage of converged positions but sustains high losses in its unconverged
positions, making the strategy relatively unattractive overall. Accordingly, the effect of trading
costs also weight harder on the copula strategy. It would be interesting to see how the use of
stopping criteria could affect the performance of the different PTS, as the primary losses of each
strategy are made up by unconverged positions.
However, the current design of the LSTM network is far from optimal. The hyperparameters
33
are chosen without any formal cross-validation and the network is not iteratively trained after
new observations during the trading period. A network that iteratively updates with new
observations should yield greatly improved predictive power. Nevertheless, the LSTM network
shows promising results, even in its currently lacking form. Furthermore, pairs are all formed
based upon the same criterion for each of the three PTS, which more or less highlights a
strategy’s ability to extract returns from deviations in a potential spread equilibrium. This begs
the question how the strategies would perform if each of them would form its own pairs, instead
of using the same sum of squared deviations rule. For instance, the inference functions for
margins method could be applied to all possible pairs of the market and the resulting joint
distribution function could rank pairs more efficiently and effectively. Additionally, LSTM
networks are extremely flexible by construction, and could in principle be used to form its
own pairs and trading rules, essentially encapsulating a completely autonomous trading system.
These points could serve as an interesting starting point for future research in the PTS literature.
34
References
Mark M Carhart. On Persistence in Mutual Fund Performance. The Journal of Finance, 52(1):
57–82, 1997.
R. Cont. Empirical properties of asset returns: stylized facts and statistical issues. Quantitative
Finance, 1(2):223–236, 2001.
Binh Do and Robert Faff. Does Simple Pairs Trading Still Work? Financial Analysts Journal,
66(4):83–95, 2010.
Binh Do and Robert Faff. Are Pairs Trading Profits Robust to Trading Costs? Journal of
Financial Research, 35(2):261–287, 2012.
Eugene F Fama and Kenneth R French. Common risk factors in the returns on stocks and bonds.
Journal of Financial Economics, 33:3–56, 1993.
Eugene F Fama and Kenneth R French. A Five-Factor Asset Pricing Model. Journal of Financial
Economics, 116(1):1–22, 2015.
Thomas Fischer and Christopher Krauss. Deep learning with long short-term memory networks
for financial market predictions. European Journal of Operational Research, 270(2):654–
669, 2018.
Andrea Frazzini, Ronen Israel, and Tobias J Moskowitz. Trading Costs. Available at SSRN
3229719, 2018.
Evan Gatev, William N Goetzmann, and K Geert Rouwenhorst. Pairs Trading: Performance of
a Relative Value Arbitrage Rule. The Review of Financial Studies, 19(3):797–827, 2006.
Felix A Gers, Jürgen A Schmidhuber, and Fred A Cummins. Learning to Forget: Continual
Prediction with LSTM. Neural Computation, 12(10):2451–2471, 2000.
Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber.
LSTM: A Search Space Odyssey. IEEE Transactions on Neural Networks and Learning
Systems, 28(10):2222–2232, 2016.
Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural computation, 9
(8):1735–1780, 1997.
35
Narasimhan Jegadeesh. Evidence of predictable behavior of security returns. The Journal of
Finance, 45(3):881–898, 1990.
Harry Joe. Multivariate Models and Multivariate Dependence Concepts. CRC Press, 1997.
Iebeling Kaastra and Milton Boyd. Designing a neural network for forecasting financial and
economic time series. Neurocomputing, 10(3):215–236, 1996.
Diederik P Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv
preprint arXiv:1412.6980, 2014.
Christopher Krauss. Statistical arbitrage pairs trading strategies: Review and outlook. Journal
of Economic Surveys, 31(2):513–545, 2017.
Rong Qi Liew and Yuan Wu. Pairs trading: A copula approach. Journal of Derivatives & Hedge
Funds, 19(1):12–30, 2013.
Roger B Nelsen. An Introduction to Copulas. Springer Science & Business Media, 2007.
Hossein Rad, Rand Kwong Yew Low, and Robert Faff. The Profitability of Pairs Trading
Strategies: Distance, Cointegration, and Copula Methods. Quantitative Finance, 16(10):
1541–1558, 2016.
M Sklar. Fonctions de Répartition à n Dimensions et Leurs Marges. Publications de l’Institut

de Statistique de L’Universite de Paris, 8:229–231, 1959.
Johannes Stübinger, Benedikt Mangold, and Christopher Krauss. Statistical arbitrage with vine
copulas. Quantitative Finance, 18(11):1831–1849, 2018.
Ganapathy Vidyamurthy. Pairs Trading: Quantitative Methods and Analysis, volume 217. John
Wiley & Sons, 2004.
Wenjun Xie and Yuan Wu. Copula-Based Pairs Trading Strategy. Asian Finance Association,
2013.
36

The Use of LSTM - in - Pairs - Trading

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Use of LSTM - in - Pairs - Trading

Uploaded by

Copyright:

Available Formats

The use of Long-Short Term Memory

networks in pairs trading strategies

A thesis submitted in partial fulfillment of the

Tilburg School of Economics and Management

3.1. Trading periods

3.2. Distance Method

𝑜𝑡 = 1{Δ 𝑖 𝑗 ≥𝑘 1 𝜎˜ 𝑖 𝑗 } − 1{Δ 𝑖 𝑗 ≤−𝑘 1 𝜎˜ 𝑖 𝑗 } and 𝑐 𝑡 = 1{𝑘 2 𝜎˜ 𝑖 𝑗 ≥Δ 𝑖 𝑗 ≥−𝑘 2 𝜎˜ 𝑖 𝑗 } .

3.3. Copula modeling

A copula-based approach circumvents this issue as it makes no inherent assumption about

Theorem 1 (Sklar’s theorem) Let 𝑿 = (𝑋1 , . . . , 𝑋𝑛 ) be a 𝑑-dimensional vector of random

Copula 𝒉1 (𝒖 1 | 𝒖 2 ) = 𝑷(𝑼1 ≤ 𝒖 1 | 𝑼2 = 𝒖 2 ) Parameters

Table 2: The candidate copulas and their conditional distribution functions.

Table 3: Share of the matched distributions using the IFM procedure.

𝑚 1,𝑡 = ℎ1 (𝑢 1 | 𝑢 2 ) − 0.5 = 𝑃(𝑈1 ≤ 𝑢 1 | 𝑈2 = 𝑢 2 ) − 0.5,

𝑚 2,𝑡 = ℎ2 (𝑢 2 | 𝑢 1 ) − 0.5 = 𝑃(𝑈2 ≤ 𝑢 2 | 𝑈1 = 𝑢 1 ) − 0.5.

𝑀1,𝑡 = 𝑀1,𝑡−1 + 𝑚 1,𝑡 ,

𝑀2,𝑡 = 𝑀2,𝑡−1 + 𝑚 2,𝑡 ,

𝑜𝑡 = 1{𝑀2,𝑡 ≥𝑘 1 } 1{𝑀1,𝑡 ≤𝑘 1 } − 1{𝑀1,𝑡 ≥𝑘 1 } 1{𝑀2,𝑡 ≤𝑘 1 }

𝑐 𝑡 = 1{𝑘 2 ≥𝑀1,𝑡 ≥−𝑘 2 } + 1{𝑘 2 ≥𝑀2,𝑡 ≥−𝑘 2 } .

3.4. Long Short-Term Memory Network

3.5. Trading costs

3.6. Performance calculation

4.1. Descriptive statistics

(a) Before trading costs

2000 2004 2008 2012 2016 2020

(b) After trading costs

Figure 3: Cumulative excess returns of the three different strategies.

(a) Before trading costs

(b) After trading costs

Figure 4: Two year rolling Sharpe ratios for each PTS.

4.2. Trade characteristics

(b) Copula strategy

(c) LSTM strategy

Strategy Alpha Mkt-Rf SMB HML RMW CMA

Copula 0.0001 0.0001 −0.0288 0.0893 0.0152 −0.1774

LSTM 0.0026∗∗ 0.0060 −0.1398∗∗ 0.0557 0.0816 −0.0146

Panel B: After transaction costs

Copula −0.0013 −0.0006 −0.0287 0.0828 0.0116 −0.1785

LSTM 0.0014 0.0059 −0.1390∗∗ 0.0497 0.0780 −0.0134

Strategy Alpha Mkt-Rf SMB HML RMW CMA MOM

Copula 0.0006 −0.0139 −0.0013 0.0481 0.0743 −0.0940 −0.0921∗∗∗

LSTM 0.0031∗∗ −0.0067 −0.1148∗ 0.0181 0.1355∗∗ 0.0614 −0.0839∗∗

Panel B: After transaction costs

Copula −0.0008 −0.0145 −0.0015 0.0418 0.0703 −0.0957 −0.0915∗∗∗

LSTM 0.0019 −0.0067 −0.1144∗ 0.0127 0.1310∗∗ 0.0614 −0.0826∗∗

4.4. Sensitivity analysis

(b) Copula strategy

(c) LSTM strategy

Figure 6: Sensitivity analysis on the opening threshold of the three PTS.

M Sklar. Fonctions de Répartition à n Dimensions et Leurs Marges. Publications de l’Institut

You might also like