Professional Documents
Culture Documents
Abstract
We demonstrate an application of online transfer learning as a digital assets trading agent.
This agent makes use of a powerful feature space representation in the form of an echo state
network, the output of which is made available to a direct, recurrent reinforcement learning
agent. The agent learns to trade the XBTUSD (Bitcoin versus US dollars) perpetual swap
derivatives contract on BitMEX. It learns to trade intraday on five minutely sampled data,
avoids excessive over-trading, captures a funding profit and is also able to predict the
direction of the market. Overall, our crypto agent realises a total return of 350%, net of
transaction costs, over roughly five years, 71% of which is down to funding profit. The
annualised information ratio that it achieves is 1.46.
Keywords: online learning, transfer learning, echo state networks, recurrent reinforcement
learning, financial time series
1. Introduction
Financial time series provide many modelling challenges for both researchers and practi-
tioners. In some circumstances, data availability is sparse, and the datasets are vast in
other circumstances; this impacts the choice of model and the learning style. Financial
time series are typically both autocorrelated and nonstationary, requiring approaches such
as integer or fractional differencing (Shumway, 2011) to remove these effects and facilitate
correct feature selection; this is previously identified by Granger and Newbold (1974) as
leading to spurious regressions if not mitigated.
Against this backdrop, we explore different forms of online learning, including sequential
learning, transfer learning (Yang et al., 2020) and reinforcement learning (Sutton and Barto,
2018). More concretely, we transfer the learning of an echo state network (Jaeger, 2002)
to a direct, recurrent reinforcement learning agent (Moody et al., 1998) who must learn to
trade digital asset futures, specifically the XBTUSD (Bitcoin versus US dollar) perpetual
swap on the BitMEX exchange. In addition, our transfer learner benefits from an ample,
dynamic reservoir feature space and can identify and learn from the different sources of
impact on profit and loss, including execution costs, exchange fees, funding costs and price
moves in the market.
2. Background
In this section, we provide a brief overview of the related ideas that we use in our exper-
imentation, namely transfer learning, reservoir computing and a form of policy gradients
reinforcement learning.
1
Borrageiro, Firoozye and Barucca
2
The Recurrent Reinforcement Learning Crypto Agent
values, policy gradient methods learn a parameterised policy that can select actions without
the use of a value function. Williams also introduced his reinforce algorithm
where wij is the model weight going from the j 0 th input to the i0 th output and wi is
the weight vector for the i0 th hidden processing unit in a network of such units, whose
goal it is to adapt in such a way as to maximise the scalar reward r. The learning rate
is ηij and the weight update of equation 1 is typically applied with gradient ascent. The
reinforcement baseline bij , is conditionally independent of the model outputs yi , given the
network parameters w and inputs xi . The characteristic eligibility of wij is ln(∂πi /∂wij ),
where πi (yi = c, wi , xi ) is the probability mass function determining the value of yi as a
function of the parameters of the unit and its input. Baseline subtraction r − bij plays an
important role in reducing the variance of gradient estimators and Sugiyama (2015) shows
that the optimal baseline is given as
PT
∗ Ep(r|w) rt k t=1 ∇ ln π(at |st , w)k2
b = PT , (2)
Ep(r|w) k t=1 ∇ ln π(at |st , w)k2
where the policy function π(at |st , w) denotes the probability of taking action at at time t
given state st , parameterised by w. The main result of Williams’s paper is
Theorem 3 For any reinforce algorithm, the inner product of E[∆w|w] and ∇E[r|w] is
non-negative and if ηij > 0, then this inner product is zero if and only if ∇E[r|w] = 0. If
ηij is independent of i and j, then E[∆w|w] = η∇E[r|w].
This result relates ∇E[r|w], the gradient in weight space of the performance measure
E[r|w], to E[∆w|w], the average update vector in weight space. Thus for any reinforce
3
Borrageiro, Firoozye and Barucca
algorithm, the average update vector in weight space lies in a direction for which this
performance measure is increasing and the quantity (r − bij ) ln(∂πi /∂wij ) represents an
unbiased estimate of ∂E[r|w]/∂wij .
The reward
rt = ∆pt ft−1 − δt |∆ft | (8)
depends on the change in reference price pt , from which a gross profit and loss (pnl) is
computed, transaction costs δt and the differentiable position function −1 ≤ ft ≤ 1. This
position function depends on the model inputs and parameters ft , f (xt ; θ t ). The right half
of equation 7 show the dependency of the model parameters on the past sequence of trades.
To correctly compute and optimize these total derivatives requires the use of recurrent
algorithms such as backpropagation through time (Rumelhart et al., 1985; Werbos, 1990)
or real-time recurrent learning (Williams and Zipser, 1989). An undesirable property of
the Sharpe ratio in that it penalises a model that produces returns larger than
bt−1
rt∗ = , (9)
at−1
4
The Recurrent Reinforcement Learning Crypto Agent
which is counter-intuitive relative to most investors’ notions of risk and reward (Moody
et al., 1998).
Gold (2003) extends Moody et al. (1998) work and investigates high-frequency currency
trading with neural networks trained via recurrent reinforcement learning. He compares
the performance of linear networks with neural networks containing a single hidden layer
and examines the impact of shared system hyper-parameters on performance. In general,
he concludes that the trading systems may be effective but that the performance varies
widely for different currency markets, and simple statistics of the markets cannot explain
this variability.
Tamar et al. (2017) extend the policy gradient method to the whole class of coherent
risk measures that are widely used in finance. They consider both static and time-consistent
dynamic risk measures. Their approach is in the spirit of policy gradient algorithms for
static risk measures and combines a standard sampling approach with convex programming.
For dynamic risk measures, their approach is actor-critic style (Sutton and Barto, 2018)
and involves explicit approximation of value functions.
Luo et al. (2019) build a novel reinforcement learning trading agent, adopting a deep
learning actor-critic algorithm with deterministic policy gradients to find the optimal pol-
icy. Their proposed algorithm has two different convolutional neural network (Goodfellow
et al., 2016) based function approximators and outperforms some baseline methods when
experimenting with stock index futures.
Zhang et al. (2019) use deep reinforcement learning algorithms such as deep q-learning
networks (Mnih et al., 2013), neural policy gradients (Silver et al., 2016) and advantage
actor-critic (Mnih et al., 2016), to design trading strategies for continuous futures contracts.
They use long short-term memory neural networks (Hochreiter and Schmidhuber, 1997)
to train both the actor and critic networks. They show that their method outperforms
various baseline models, delivering positive profits despite high transaction costs. Their
experiments show that the proposed algorithms can follow large market trends without
changing positions and scale down or hold through consolidation periods.
Ye et al. (2020) address the optimal trade execution problem using limit order books,
where a model must learn how best to execute a given block of shares at minimal cost or for
maximal return. To this end, they propose a deep reinforcement learning-based solution
that uses a deterministic policy gradient framework. Experiments on three real market
datasets show that the proposed approach significantly outperforms other methods such
as a submit and leave policy, a q-learning algorithm (Watkins, 1989) and a hybrid method
that combines the Almgren-Chriss model (Almgren and Chriss, 2001) with reinforcement
learning.
We begin with a discussion of the research data that we use in our experiment, followed
by a detailed description of the research methods, finishing up with a description of the
experiment results. As a high-level summary, our goal in this experiment is to explore
transfer learning using a source model, an echo state network and a target model, a direct,
recurrent, reinforcement learning agent. The objective of this meta-model is to learn to
trade digital asset futures, specifically perpetual contracts on the BitMEX crypto exchange.
Finally, the dynamical reservoir of the echo state network acts as a powerful nonlinear
feature space; this is fed into the upstream recurrent reinforcement learner, who is aware
of the various sources of impact on profit or loss and learns to target the desired position.
5
Borrageiro, Firoozye and Barucca
3.2 Funding
The funding rate for the perpetual swap comprises two parts: an interest rate differential
component and the premium or discount of the basis. We denote this funding rate as κt .
The interest differential reflects the borrowing cost of each currency involved in the pair
equote
t − ebase
t
et = . (12)
T
For example, with the XBTUSD perpetual swap, the interest rate of borrowing in US
dollars is denoted by equote
t and the interest rate of borrowing in Bitcoin is represented by
ebase
t . As funding occurs every 8 hours, T = 8. The basis premium / discount component
is computed in a manner similar to equation 11, with some subtleties applied to minimise
market manipulation, such as the use of time and volume weighted average prices. The
funding rate is finally
κt = bt + max(min(γ, et − bt ), −γ), (13)
where γ is a basis cap, typically 5 basis points (0.05%). When the basis is positive, funding
will be positive, and traders with long positions (buy XBT, sell USD) will pay those with
6
The Recurrent Reinforcement Learning Crypto Agent
short positions (sell XBT, buy USD). Reciprocally, shorts pay longs when the basis is
negative.
3.3 Methods
We begin with a description of the dynamic reservoir feature space, the resultant learning
of which is transferred to the direct, recurrent reinforcement learner, which targets the
desired risk position.
• Change the signs of a desired number of entries of Whidden to get negative connection weights
as well.
7
Borrageiro, Firoozye and Barucca
• Sparsify Whidden with probability P (α), 0 α < 1, setting those elements to zero.
This procedure is guaranteed to ensure the echo state property for any input. Intuitively, a
recurrent neural network has the echo state property concerning an input signal ut , if any
initial network state is forgotten or washed out when the network is driven by ut for long
enough (Jaeger, 2017). The model supports recurrent connections from either a teacher
signal yt ∈ Rnback or model output ŷt ∈ Rnback . These are connected to the model via
the weight matrix Wback ∈ Rnhidden ×nback , whose weights are initialised at random from a
standard normal. Note that Winput , Whidden and Wback , have weights that remain fixed.
Finally, initialise the output weight vector w0out ∈ Rninput +nhidden +nback . It is at this
point that our procedure differs from the original echo state network formulation shown
by Jaeger (2002). There, wout is a matrix Wout ∈ Rnback ×(ninput +nhidden +nback ) and the
performance measure of the model is the quadratic loss
T
1 X
min (yt − Wout zt )2 . (14)
W out 2T t=1
The reasons will become apparent shortly, when we detail the direct, recurrent reinforce-
ment learning part of the model. But first, we must describe how we create the augmented
state of the system, zt ∈ Rninput +nhidden +nback . Firstly, we initialise a zero-valued internal
state vector x0 ∈ Rnhidden . Then at time t, we compute the recurrent internal state
where fhidden (.) is typically a squashing function such as the hyperbolic tangent. The
augmented, recurrent system state is then
λ 2
υt = µt − σ , (17)
2 t
where the expected return µt and variance of returns σt2 may be estimated in an online
fashion with exponential decay
σt2 = τ σt−1
2
+ (1 − τ )(rt − µt )2 . (19)
The risk appetite constant λ > 0, can be set as a function of an investor’s desired risk
adjusted return, as demonstrated by Kahn (2019). Define the annualised information ratio
as the risk adjusted differential reward measure, where the difference is taken with respect
to a benchmark or baseline strategy
µt − bt
irt = 2520.5 × . (20)
σt
8
The Recurrent Reinforcement Learning Crypto Agent
Substituting the non-annualised information ratio into the quadratic utility and differenti-
ating with respect to the risk, we obtain a suitable value for the risk appetite parameter
irt
λ= . (21)
σt
The net returns whose expectation and variance we seek to learn, are decomposed as
rt = ∆pt ft−1 − δt |∆ft | − κt ft , (22)
where ∆pt is the change in reference price, typically a mid price
∆pt = 0.5 × (bidt + askt − bidt−1 − askt−1 ), (23)
δt represents the execution cost for a price taker
δt = 0.5 × (askt − bidt ), (24)
κt is the funding cost (see subsection 3.1) and ft is the desired position learnt by the
recurrent reinforcement learner
ft = tanh [wtout ]T zt .
(25)
The model is maximally short when ft = −1 and maximally long when ft = 1. The past
positions of the model are used as the feedback connections for equation 16
ŷt = [ft−nback , ..., ft−1 ]. (26)
The goal of our recurrent reinforcement learner is to maximise the utility in equation
17, by targeting a position in equation 25. To do this, we apply an online optimisation
update of the form
dυt
wtout = wt−1
out
+ ∇υt ≡ ∆wtout + , (27)
dwtout
where the weight update procedure is an extended Kalman filter for neural networks
(Williams, 1992b; Haykin, 2001), albeit modified for reinforcement learning in this con-
text. The sequential updates are applied as
∇υtT Pt−1 ∇υt
q =1+ , (28)
τ
Pt−1 ∇υt
k= , (29)
qτ
wtout = wt−1
out
+ k, (30)
Pt−1
Pt = − kkT q. (31)
τ
Above, τ is an exponential decay factor, Pt is an approximation to ∇2 υt , the inverse
Hessian of the utility function υt with respect to the model weights wtout .
We decompose the gradient of the utility function with respect to the recurrent rein-
forcement learner’s parameters as follows
dυt drt dft drt dft−1
∇υt = +
drt dft dwtout out
dft−1 dwt−1
(
dυt drt ∂ft ∂ft ∂ft−1
= + (32)
drt dft ∂wtout ∂ft−1 ∂wt−1out
)
drt ∂ft−1 ∂ft−1 ∂ft−2
+ out + out .
dft−1 ∂wt−1 ∂ft−2 ∂wt−2
9
Borrageiro, Firoozye and Barucca
3.5 Results
Table 1 and figure 3 show the results of the experiment. The crypto agent achieves a
total return of just under 350%, over a test set that is less than five years. The associated
annualised information ratio is 1.46. Denoting the maximally short position as -1, no
position as 0 and the maximally long position as 1, we see that the agent averages a
position of 0.41. Thus there is a bias toward the agent maintaining a long position, which
is desirable, as Bitcoin has appreciated against the US dollar over this period. We see
visual evidence in figure 3 that on occasion, the crypto agent abstains from trading, or
rather is forced to take no position; this will happen during periods when the predictive
performance of the agent decreases relative to execution and funding costs. Our crypto
agent also captures a 71% cumulative return due to earning funding, which is expected,
as the agent learns to target the appropriate position that maximises its quadratic utility,
and funding is one of the drivers of this utility. The total execution cost and exchange fees
that the agent pays out is -54%.
10
The Recurrent Reinforcement Learning Crypto Agent
4. Discussion
The echo state network provides a powerful and scalable feature space representation. We
transfer this powerful learning representation to a recurrent reinforcement learning agent,
that learns to target a position directly. It is possible to use the echo state network as a
reinforcement learning agent itself, as shown by Szita et al. (2006). However the approach
taken there may lead to undesired behaviour in a trading context. Specifically, they use
an echo state network to estimate the state-action value function of a temporal difference
learning sarsa model (Sutton, 1988; Sutton and Barto, 2018). This value function takes
11
Borrageiro, Firoozye and Barucca
the form
qπ (s, a) = Eπ {rt |st , at }, (36)
where the expected return rt depends on the transition to state st having taken action at
under policy π. Sarsa estimates this value function sequentially as
where 0 < γ ≤ 1 is a discount factor for multi-step rewards, and η > 0 is a learning rate.
Equation 37 shows that the state transition reward is passed back to the starting state. In
activities such as maze traversal or board games, being aware of the reward associated with
multiple steps or decisions and passing that reward back to the current position is of great
value. However, in the context of trading, where the value function q(st , at ) represents the
value of a position st , with the possibility of switching or remaining in the same position
denoted by action at , we will on occasion find that a larger utility is assigned to the wrong
state. For example, imagine the current state is st = 0, that is, the model has no position.
Now we observe a large positive price jump leading to large positive reward rt+1 0.
Value function estimators such as equation 37 would pass the state transition reward rt+1
to the initial state q(st = 0, at = 0). At the next iteration, with probability P r(1 − ),
q(st+1 = 0, at+1 = 0), as it has the highest value function. However, if the position is zero,
the model cannot hope to earn a profit. Even if one excludes the zero state, then there
is still the possibility of observing this problem for a reversal strategy with possible states
s = {−1, 1}. Direct reinforcement, as we describe it, does not incur these problems and
we have, through transfer learning, improved upon the earlier work in direct reinforcement
(Moody et al., 1998; Gold, 2003).
6. Conclusion
We demonstrate an application of online transfer learning as a digital assets trading agent.
This agent makes use of a powerful feature space representation in the form of an echo state
network, the output of which is made available to a direct, recurrent reinforcement learning
agent. The agent learns to trade the XBTUSD (Bitcoin versus US dollars) perpetual swap
derivatives contract on BitMEX. It learns to trade intraday on five minutely sampled data,
avoids excessive over-trading, captures a funding profit and is also able to predict the
direction of the market. Overall, our crypto agent realizes a total return of 350%, net of
transaction costs, over roughly five years, 71% of which is down to funding profit. The
annualized information ratio that it achieves is 1.46.
12
The Recurrent Reinforcement Learning Crypto Agent
ir total return
count 250 250
mean 1.160 2.977
std 0.299 0.740
min 0.219 0.654
25% 1.023 2.572
50% 1.199 3.076
75% 1.355 3.352
max 1.763 5.021
se(mean) 0.019 0.047
lb(mean) 1.123 2.886
ub(mean) 1.197 3.069
References
Robert Almgren and Neil Chriss. Optimal execution of portfolio transactions. Journal of Risk, 3:
5–40, 2001.
Carl Gold. Fx trading via recurrent reinforcement learning. In IEEE. IEEE, 2003.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
Clive W. J. Granger and Paul Newbold. Spurious regressions in econometrics. Journal of economet-
rics, 2:111–120, 1974.
13
Borrageiro, Firoozye and Barucca
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):
1735–1780, 1997.
Herbert Jaeger. Adaptive nonlinear system identification with echo state networks. In NIPS, 2002.
Herbert Jaeger. Using conceptors to manage neural long-term memories for temporal patterns.
Journal of Machine Learning Research, 18:13:1–13:43, 2017.
Richard C. Grinold Ronald N. Kahn. Advances in Active Portfolio Management: New Developments
in Quantitative Investing. McGraw-Hill Education, 2019.
Mantas Lukosevicius and Herbert Jaeger. Reservoir computing approaches to recurrent neural net-
work training. Comput. Sci. Rev., 3:127–149, 2009.
Mantas Lukosevicius, Herbert Jaeger, and Benjamin Schrauwen. Reservoir computing trends. KI -
Künstliche Intelligenz, 26:365–371, 2012.
Suyuan Luo, Xudong Lin, and Zunxin Zheng. A novel cnn-ddpg based ai-trader: Performance and
roles in business operations. Transportation research. Part E, Logistics and transportation review,
131:68–79, 2019. ISSN 1366-5545.
Wolfgang Maass and Henry Markram. On the computational power of circuits of spiking neurons.
J. Comput. Syst. Sci., 69:593–616, 2004.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wier-
stra, and Martin Riedmiller. Playing atari with deep reinforcement learning, 2013.
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim
Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement
learning, 2016.
John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saffell. Performance functions and rein-
forcement learning for trading systems and portfolios. Journal of Forecasting, 17:441–470, 1998.
James M. Murray. Local online learning in recurrent networks with random feedback. eLife, 8, 2019.
Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge
and Data Engineering, 22:1345–1359, 2010.
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations
by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive
Science, 1985.
Bruno Klein Salvalaio and Gabriel de Oliveira Ramos. Self-adaptive appearance-based eye-tracking
with online transfer learning. In 2019 8th Brazilian Conference on Intelligent Systems (BRACIS),
pages 383–388. IEEE, 2019.
Arash Samadi, Timothy P. Lillicrap, and Douglas Blair Tweed. Deep learning with dynamic spiking
neurons and fixed feedback weights. Neural Computation, 29:578–602, 2017.
William F. Sharpe. Mutual fund performance. The Journal of Business, 39, 1 1966.
William. F. Sharpe. Expected utility asset allocation. Financial Analysts Journal, 63(5):18–30,
2007.
14
The Recurrent Reinforcement Learning Crypto Agent
Robert H. Shumway. Time series analysis and its applications : with R examples. Springer texts in
statistics. Springer, 3rd ed. edition, 2011.
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman,
Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine
Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with
deep neural networks and tree search. Nature, 529, 1 2016. ISSN 0028-0836. doi: 10.1038/
nature16961.
Masashi Sugiyama. Statistical Reinforcement Learning. CRC Press, 1st edition edition, 2015. ISBN
9781439856901.
R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):
9–44, 1988. ISSN 0885-6125.
István Szita, Viktor Gyenes, and András Lörincz. Reinforcement learning with echo state networks.
In ICANN, 2006.
Aviv Tamar, Yinlam Chow, Mohammad Ghavamzadeh, and Shie Mannor. Sequential decision
making with coherent risk. IEEE transactions on automatic control, 62(7):3323–3338, 2017. ISSN
0018-9286.
Xingda Wang, Xiaoping Wang, and Zhigang Zeng. A novel weight update rule of online transfer
learning. In 2020 12th International Conference on Advanced Computational Intelligence (ICACI),
pages 349–355. IEEE, 2020.
C. J. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, 1989.
P.J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the
IEEE, 78(10):1550–1560, 1990. doi: 10.1109/5.58337.
Ronald J. Williams. Training recurrent networks using the extended kalman filter. In [Proceedings
1992] IJCNN International Joint Conference on Neural Networks, volume 4, pages 241–246 vol.4,
1992b.
Ronald J. Williams and David Zipser. A learning algorithm for continually running fully recurrent
neural networks. Neural computation, 1:270–280, 1989.
Xinghan Xu and Weijie Ren. Prediction of air pollution concentration based on mrmr and echo
state network. Applied Sciences, 9(9), 2019. ISSN 2076-3417.
Qiang Yang, Yu Zhang, Wenyuan Dai, and Sinno Jialin Pan. Transfer Learning. Cambridge Uni-
versity Press, 2020.
Zekun Ye, Weijie Deng, Shuigeng Zhou, Yi Xu, and Jihong Guan. Optimal trade execution based
on deep deterministic policy gradient. In Database Systems for Advanced Applications, volume
12112 of Lecture Notes in Computer Science, pages 638–654. Springer International Publishing,
2020. ISBN 3030594092.
15
Borrageiro, Firoozye and Barucca
Izzet B. Yildiz, Herbert Jaeger, and Stefan J. Kiebel. Re-visiting the echo state property. Neural
networks : the official journal of the International Neural Network Society, 35:1–9, 2012.
Zihao Zhang, Stefan Zohren, and Stephen Roberts. Deep reinforcement learning for trading. In
ArXiv, 2019.
Peilin Zhao, Steven C. H. Hoi, Jialei Wang, and Bin Li. Online transfer learning. Artificial intelli-
gence, 216:76–102, 2014.
16