The Recurrent Reinforcement Learning Crypto Agent: Gabriel Borrageiro Nick Firoozye Paolo Barucca

Financial Computing Group, UCL (2021)
The Recurrent Reinforcement Learning Crypto Agent
Gabriel Borrageiro gabriel.borrageiro.20@ucl.ac.uk

Nick Firoozye n.firoozye@ucl.ac.uk
Paolo Barucca p.barucca@ucl.ac.uk
Department of Computer Science
University College London
Gower Street, London, WC1E 6BT, UK
arXiv:2201.04699v1 [cs.LG] 12 Jan 2022
Abstract
We demonstrate an application of online transfer learning as a digital assets trading agent.
This agent makes use of a powerful feature space representation in the form of an echo state
network, the output of which is made available to a direct, recurrent reinforcement learning
agent. The agent learns to trade the XBTUSD (Bitcoin versus US dollars) perpetual swap
derivatives contract on BitMEX. It learns to trade intraday on five minutely sampled data,
avoids excessive over-trading, captures a funding profit and is also able to predict the
direction of the market. Overall, our crypto agent realises a total return of 350%, net of
transaction costs, over roughly five years, 71% of which is down to funding profit. The
annualised information ratio that it achieves is 1.46.
Keywords: online learning, transfer learning, echo state networks, recurrent reinforcement
learning, financial time series
1. Introduction
Financial time series provide many modelling challenges for both researchers and practi-
tioners. In some circumstances, data availability is sparse, and the datasets are vast in
other circumstances; this impacts the choice of model and the learning style. Financial
time series are typically both autocorrelated and nonstationary, requiring approaches such
as integer or fractional differencing (Shumway, 2011) to remove these effects and facilitate
correct feature selection; this is previously identified by Granger and Newbold (1974) as
leading to spurious regressions if not mitigated.
Against this backdrop, we explore different forms of online learning, including sequential
learning, transfer learning (Yang et al., 2020) and reinforcement learning (Sutton and Barto,
2018). More concretely, we transfer the learning of an echo state network (Jaeger, 2002)
to a direct, recurrent reinforcement learning agent (Moody et al., 1998) who must learn to
trade digital asset futures, specifically the XBTUSD (Bitcoin versus US dollar) perpetual
swap on the BitMEX exchange. In addition, our transfer learner benefits from an ample,
dynamic reservoir feature space and can identify and learn from the different sources of
impact on profit and loss, including execution costs, exchange fees, funding costs and price
moves in the market.
2. Background
In this section, we provide a brief overview of the related ideas that we use in our exper-
imentation, namely transfer learning, reservoir computing and a form of policy gradients
reinforcement learning.
1
Borrageiro, Firoozye and Barucca
2.1 Transfer Learning

Transfer learning refers to the machine learning paradigm in which an algorithm extracts
knowledge from one or more application scenarios to help boost the learning performance in
a target scenario (Yang et al., 2020). Typically, traditional machine learning requires large
amounts of training data. Transfer learning copes better with data sparsity by looking at
related learning domains where data is sufficient. Even in a big data scenario such as with
streaming high-frequency data, transfer learning can benefit from learning immediately,
where data is initially sparse, and a learner must begin providing forecasts when asked to
do so. An increasing number papers focus on online transfer learning (Zhao et al., 2014;
Salvalaio and de Oliveira Ramos, 2019; Wang et al., 2020). Following Pan and Yang (2010),
we define transfer learning as:
Definition 1 (transfer learning) Given a source domain DS and learning task TS , a
target domain DT and learning task TT , transfer learning aims to help improve the learning
of the target predictive function fT (.) in DT using the knowledge in DS and TS , where
DS 6= DT , or TS 6= TT .
2.2 Echo State Networks

Echo state networks are a form of recurrent neural network. They consist of a large, fixed,
recurrent reservoir network, from which the desired output is obtained by training suitable
output connection weights. Determination of the optimal output weights is solvable ana-
lytically, for example, online sequentially with recursive least squares, as in Jaeger (2002).
Echo state networks are an example of the reservoir computing paradigm of understanding
and training recurrent neural networks, based on treating the recurrent part (the reservoir)
differently than the readouts from it (Lukosevicius et al., 2012). Following Jaeger (2002),
the echo state property states that:
Definition 2 (echo state property) If a network is started from two arbitrary states
x0 , x̃0 and is run with the same input sequence in both cases, the resulting state sequences
xT , x̃T converge to each other. If this condition holds, the reservoir network state will
asymptotically depend only on the input history, and the network is said to be an echo
state network.
The echo state property is guaranteed if the dynamic reservoir weight matrix Whidden is
scaled such that its spectral radius ρ(Whidden ), that is, its largest absolute eigenvalue,
satisfies ρ(Whidden ) < 1. This ensures that Whidden is contractive. The mathematically
correct connection between the spectral radius and the echo state property is that the latter
is violated if ρ(Whidden ) > 1 in reservoirs using the tanh function as neuron nonlinearity
and for zero input (Lukosevicius and Jaeger, 2009).
Murray (2019) states that the criteria for plausible modelling on how the brain might
perform challenging time-dependent computations are locality and causality. As the echo
state network uses a fixed, dynamic reservoir of weights, whose update depends on local
information of inputs and activations, with the addition of fixed and random feedback, the
model offers a plausible model of biological function. These biological aspects are explored
by Maass and Markram (2004) concerning liquid state machines and more generally with
spiking neural networks in Samadi et al. (2017).
2.3 Policy Gradient Reinforcement Learning

Williams (1992a) introduced policy gradient methods in a reinforcement learning context.
Whereas the majority of reinforcement learning algorithms tend to focus on action value
estimation, learning the value of and selecting actions based on their estimated action
2
Figure 1: the echo state network (Xu and Ren, 2019)
values, policy gradient methods learn a parameterised policy that can select actions without
the use of a value function. Williams also introduced his reinforce algorithm
∆wij = ηij (r − bij ) ln(∂πi /∂wij ), (1)
where wij is the model weight going from the j 0 th input to the i0 th output and wi is
the weight vector for the i0 th hidden processing unit in a network of such units, whose
goal it is to adapt in such a way as to maximise the scalar reward r. The learning rate
is ηij and the weight update of equation 1 is typically applied with gradient ascent. The
reinforcement baseline bij , is conditionally independent of the model outputs yi , given the
network parameters w and inputs xi . The characteristic eligibility of wij is ln(∂πi /∂wij ),
where πi (yi = c, wi , xi ) is the probability mass function determining the value of yi as a
function of the parameters of the unit and its input. Baseline subtraction r − bij plays an
important role in reducing the variance of gradient estimators and Sugiyama (2015) shows
that the optimal baseline is given as
PT
∗ Ep(r|w) rt k t=1 ∇ ln π(at |st , w)k2
b = PT , (2)
Ep(r|w) k t=1 ∇ ln π(at |st , w)k2
where the policy function π(at |st , w) denotes the probability of taking action at at time t
given state st , parameterised by w. The main result of Williams’s paper is
Theorem 3 For any reinforce algorithm, the inner product of E[∆w|w] and ∇E[r|w] is
non-negative and if ηij > 0, then this inner product is zero if and only if ∇E[r|w] = 0. If
ηij is independent of i and j, then E[∆w|w] = η∇E[r|w].
This result relates ∇E[r|w], the gradient in weight space of the performance measure
E[r|w], to E[∆w|w], the average update vector in weight space. Thus for any reinforce
3
algorithm, the average update vector in weight space lies in a direction for which this
performance measure is increasing and the quantity (r − bij ) ln(∂πi /∂wij ) represents an
unbiased estimate of ∂E[r|w]/∂wij .
2.3.1 Policy Gradient Methods in Financial Trading

Moody et al. (1998) propose to train trading systems and portfolios by optimising objective
functions that directly measure trading and investment performance. Rather than basing a
trading system on forecasts, or training via a supervised learning algorithm using labelled
trading data, they train their systems using a direct, recurrent reinforcement learning
algorithm, which is an example of the policy gradient method. The direct part refers to the
fact that the model tries to target a position directly and the model’s weights are adapted
such that the performance measure is maximised. The performance function that they
primarily consider is the differential Sharpe ratio. The annualised Sharpe ratio (Sharpe,
1966) is
µ − rf
sr = 2520.5 × , (3)
σ
where µ is the return of the strategy, σ is the standard deviation of returns and rf is the
risk free rate. The differential Sharpe ratio is defined as
dsrt bt−1 ∆at − 0.5at−1 ∆bt
= , (4)
dτ (at−1 − a2t−1 )3/2
where the quantities at and bt are exponentially weighted estimates of the first and second
moments of rt
at = at−1 + τ ∆at = at−1 + τ (rt − at−1 )
(5)
bt = bt−1 + τ ∆bt = bt−1 + τ (rt2 − bt−1 ),
with τ ∈ (0, 1]. They consider a batch gradient ascent update
dsrT
∆θ T = η , (6)
dθ
where
T
( )
dsrT 1 X dsrT drt dft drt dft−1
= +
dθ T t=1 drt dft dθ dft−1 dθ
( )( ) (7)
T
1X bT − aT rt drt dft drt dft−1
= + .
T t=1 (bT − a2T )3/2 dft dθ dft−1 dθ
The reward
rt = ∆pt ft−1 − δt |∆ft | (8)
depends on the change in reference price pt , from which a gross profit and loss (pnl) is
computed, transaction costs δt and the differentiable position function −1 ≤ ft ≤ 1. This
position function depends on the model inputs and parameters ft , f (xt ; θ t ). The right half
of equation 7 show the dependency of the model parameters on the past sequence of trades.
To correctly compute and optimize these total derivatives requires the use of recurrent
algorithms such as backpropagation through time (Rumelhart et al., 1985; Werbos, 1990)
or real-time recurrent learning (Williams and Zipser, 1989). An undesirable property of
the Sharpe ratio in that it penalises a model that produces returns larger than
bt−1
rt∗ = , (9)
at−1
4
which is counter-intuitive relative to most investors’ notions of risk and reward (Moody
et al., 1998).
Gold (2003) extends Moody et al. (1998) work and investigates high-frequency currency
trading with neural networks trained via recurrent reinforcement learning. He compares
the performance of linear networks with neural networks containing a single hidden layer
and examines the impact of shared system hyper-parameters on performance. In general,
he concludes that the trading systems may be effective but that the performance varies
widely for different currency markets, and simple statistics of the markets cannot explain
this variability.
Tamar et al. (2017) extend the policy gradient method to the whole class of coherent
risk measures that are widely used in finance. They consider both static and time-consistent
dynamic risk measures. Their approach is in the spirit of policy gradient algorithms for
static risk measures and combines a standard sampling approach with convex programming.
For dynamic risk measures, their approach is actor-critic style (Sutton and Barto, 2018)
and involves explicit approximation of value functions.
Luo et al. (2019) build a novel reinforcement learning trading agent, adopting a deep
learning actor-critic algorithm with deterministic policy gradients to find the optimal pol-
icy. Their proposed algorithm has two different convolutional neural network (Goodfellow
et al., 2016) based function approximators and outperforms some baseline methods when
experimenting with stock index futures.
Zhang et al. (2019) use deep reinforcement learning algorithms such as deep q-learning
networks (Mnih et al., 2013), neural policy gradients (Silver et al., 2016) and advantage
actor-critic (Mnih et al., 2016), to design trading strategies for continuous futures contracts.
They use long short-term memory neural networks (Hochreiter and Schmidhuber, 1997)
to train both the actor and critic networks. They show that their method outperforms
various baseline models, delivering positive profits despite high transaction costs. Their
experiments show that the proposed algorithms can follow large market trends without
changing positions and scale down or hold through consolidation periods.
Ye et al. (2020) address the optimal trade execution problem using limit order books,
where a model must learn how best to execute a given block of shares at minimal cost or for
maximal return. To this end, they propose a deep reinforcement learning-based solution
that uses a deterministic policy gradient framework. Experiments on three real market
datasets show that the proposed approach significantly outperforms other methods such
as a submit and leave policy, a q-learning algorithm (Watkins, 1989) and a hybrid method
that combines the Almgren-Chriss model (Almgren and Chriss, 2001) with reinforcement
learning.
3. The Research Experiment
We begin with a discussion of the research data that we use in our experiment, followed
by a detailed description of the research methods, finishing up with a description of the
experiment results. As a high-level summary, our goal in this experiment is to explore
transfer learning using a source model, an echo state network and a target model, a direct,
recurrent, reinforcement learning agent. The objective of this meta-model is to learn to
trade digital asset futures, specifically perpetual contracts on the BitMEX crypto exchange.
Finally, the dynamical reservoir of the echo state network acts as a powerful nonlinear
feature space; this is fed into the upstream recurrent reinforcement learner, who is aware
of the various sources of impact on profit or loss and learns to target the desired position.
5
3.1 The Data

The data that we experiment with is from the BitMEX cryptocurrency derivatives ex-
change. In 2016 they launched the XBTUSD perpetual swap, where speculators and in-
vestors can trade Bitcoin against the US dollar. The perpetual swap is similar to a tradi-
tional futures contract, except there is no expiry or settlement. It mimics a margin-based
spot market and trades close to the underlying reference index price. A funding mechanism
is used to tether the contract to its underlying spot price. In contrast, a futures contract
may trade at a significantly different price due to basis
basist = f uturest − indext . (10)
The basis means different things in different markets. For example, in the oil market, the
demand for spot oil can outpace the demand for futures oil, especially if OPEC withholds
supply, leading to a higher spot price; this results in a futures curve in a state of back-
wardation. The crypto futures normally trade in a state of contango, where the futures
prices trade at a higher rate than spot. Backwardation or contango in crypto markets do
not represent supply and demand shortages in an economic sense but rather reflect risk
appetite for crypto. Similar effects happen in the equity markets. As with the equity
markets, crypto market participants can take risks in the futures market more easily. The
spot markets typically do not offer leverage, and the trader must have inventory in the
exchange to trade. In contrast, the futures markets allow traders to use leverage and sell
assets short as easily as buying and going long without having to borrow the underlying
asset. However, what is required is a margin or deposit to fund the position. Figure 2
shows the basis in relative terms for the XBTUSD perpetual swap during the bear market
of 2018. The midprice of the perpetual swap is compared against the underlying index it
tracks, .BXBT. The relative basis is computed as
f uturest − indext
rbasist = . (11)
indext
Before this bear market, Bitcoin hit a then all-time high of $20,000, and the 100-day
exponentially weighted moving average of relative basis was very positive in late 2017. For
most of 2018 and 2019, the basis was largely negative, reflecting the cash market sell-off
from all-time highs down to circa USD 3,000.
3.2 Funding
The funding rate for the perpetual swap comprises two parts: an interest rate differential
component and the premium or discount of the basis. We denote this funding rate as κt .
The interest differential reflects the borrowing cost of each currency involved in the pair
equote
t − ebase
t
et = . (12)
T
For example, with the XBTUSD perpetual swap, the interest rate of borrowing in US
dollars is denoted by equote
t and the interest rate of borrowing in Bitcoin is represented by
ebase
t . As funding occurs every 8 hours, T = 8. The basis premium / discount component
is computed in a manner similar to equation 11, with some subtleties applied to minimise
market manipulation, such as the use of time and volume weighted average prices. The
funding rate is finally
κt = bt + max(min(γ, et − bt ), −γ), (13)
where γ is a basis cap, typically 5 basis points (0.05%). When the basis is positive, funding
will be positive, and traders with long positions (buy XBT, sell USD) will pay those with
6
Figure 2: XBTUSD basis during the bear market of 2019
short positions (sell XBT, buy USD). Reciprocally, shorts pay longs when the basis is
negative.
3.3 Methods
We begin with a description of the dynamic reservoir feature space, the resultant learning
of which is transferred to the direct, recurrent reinforcement learner, which targets the
desired risk position.
3.3.1 The Dynamic Reservoir Feature Space

Denote as ut , a vector of external inputs to the system, which is observed at time t. In
the context of this experiment, such external input would include order book, transaction
and funding information. These features may come from the instrument being traded,
exogenous instruments, or both. Initialise the external input weight matrix Winput ∈
Rnhidden ×ninput , where the weights are drawn at random; a draw from a standard normal
would suffice. Here, nhidden denotes the number of hidden processing units in the internal
dynamical reservoir and ninput is the number of external inputs, including a bias term.
Next we initialise the hidden processing units weight matrix, Whidden ∈ Rnhidden ×nhidden .
The procedure detailed by Yildiz et al. (2012) is
• Initialise a random matrix Whidden , all with non-negative entries.
• Scale Whidden such that its spectral radius ρ(Whidden ) < 1.
• Change the signs of a desired number of entries of Whidden to get negative connection weights
as well.
7
• Sparsify Whidden with probability P (α), 0 α < 1, setting those elements to zero.
This procedure is guaranteed to ensure the echo state property for any input. Intuitively, a
recurrent neural network has the echo state property concerning an input signal ut , if any
initial network state is forgotten or washed out when the network is driven by ut for long
enough (Jaeger, 2017). The model supports recurrent connections from either a teacher
signal yt ∈ Rnback or model output ŷt ∈ Rnback . These are connected to the model via
the weight matrix Wback ∈ Rnhidden ×nback , whose weights are initialised at random from a
standard normal. Note that Winput , Whidden and Wback , have weights that remain fixed.
Finally, initialise the output weight vector w0out ∈ Rninput +nhidden +nback . It is at this
point that our procedure differs from the original echo state network formulation shown
by Jaeger (2002). There, wout is a matrix Wout ∈ Rnback ×(ninput +nhidden +nback ) and the
performance measure of the model is the quadratic loss
T
1 X
min (yt − Wout zt )2 . (14)
W out 2T t=1
The reasons will become apparent shortly, when we detail the direct, recurrent reinforce-
ment learning part of the model. But first, we must describe how we create the augmented
state of the system, zt ∈ Rninput +nhidden +nback . Firstly, we initialise a zero-valued internal
state vector x0 ∈ Rnhidden . Then at time t, we compute the recurrent internal state
xt = fhidden (Winput ut + Whidden xt−1 + Wtback ŷt ), (15)
where fhidden (.) is typically a squashing function such as the hyperbolic tangent. The
augmented, recurrent system state is then
zt = [ut , xt , ŷt ]. (16)
3.3.2 Direct Recurrent Reinforcement Learning

This augmented, internal feature state is now fed into the upstream model, a direct, re-
current neural network, whose performance measure is a quadratic utility function of re-
ward and risk. Sharpe (2007) discusses asset allocation as a function of expected utility
maximisation, where the utility function may be more complex than that associated with
mean-variance analysis. Denote the expected utility for a single asset portfolio as
λ 2
υt = µt − σ , (17)
2 t
where the expected return µt and variance of returns σt2 may be estimated in an online
fashion with exponential decay
µt = τ µt−1 + (1 − τ )rt , (18)
σt2 = τ σt−1
2
+ (1 − τ )(rt − µt )2 . (19)
The risk appetite constant λ > 0, can be set as a function of an investor’s desired risk
adjusted return, as demonstrated by Kahn (2019). Define the annualised information ratio
as the risk adjusted differential reward measure, where the difference is taken with respect
to a benchmark or baseline strategy
µt − bt
irt = 2520.5 × . (20)
σt
8
Substituting the non-annualised information ratio into the quadratic utility and differenti-
ating with respect to the risk, we obtain a suitable value for the risk appetite parameter
irt
λ= . (21)
σt
The net returns whose expectation and variance we seek to learn, are decomposed as
rt = ∆pt ft−1 − δt |∆ft | − κt ft , (22)
where ∆pt is the change in reference price, typically a mid price
∆pt = 0.5 × (bidt + askt − bidt−1 − askt−1 ), (23)
δt represents the execution cost for a price taker
δt = 0.5 × (askt − bidt ), (24)
κt is the funding cost (see subsection 3.1) and ft is the desired position learnt by the
recurrent reinforcement learner
ft = tanh [wtout ]T zt .

(25)
The model is maximally short when ft = −1 and maximally long when ft = 1. The past
positions of the model are used as the feedback connections for equation 16
ŷt = [ft−nback , ..., ft−1 ]. (26)
The goal of our recurrent reinforcement learner is to maximise the utility in equation
17, by targeting a position in equation 25. To do this, we apply an online optimisation
update of the form
dυt
wtout = wt−1
out
+ ∇υt ≡ ∆wtout + , (27)
dwtout
where the weight update procedure is an extended Kalman filter for neural networks
(Williams, 1992b; Haykin, 2001), albeit modified for reinforcement learning in this con-
text. The sequential updates are applied as
∇υtT Pt−1 ∇υt
q =1+ , (28)
τ
Pt−1 ∇υt
k= , (29)
qτ
wtout = wt−1
out
+ k, (30)
Pt−1
Pt = − kkT q. (31)
τ
Above, τ is an exponential decay factor, Pt is an approximation to ∇2 υt , the inverse
Hessian of the utility function υt with respect to the model weights wtout .
We decompose the gradient of the utility function with respect to the recurrent rein-
forcement learner’s parameters as follows

dυt drt dft drt dft−1
∇υt = +
drt dft dwtout out
dft−1 dwt−1
(
dυt drt ∂ft ∂ft ∂ft−1
= + (32)
drt dft ∂wtout ∂ft−1 ∂wt−1out
)
drt ∂ft−1 ∂ft−1 ∂ft−2
+ out + out .
dft−1 ∂wt−1 ∂ft−2 ∂wt−2
9
The constituent derivatives for the left half of equation 32 are:

dυt
= (1 − η)[1 − λ(rt − µt )], (33)
drt
drt
= −δt × sign(∆ft ) − κt , (34)
dft
dft
= zt [1 − tanh2 ([wtout ]T zt )]
dwtout
(35)
out
+ wt,n [1 − tanh2 ([wtout ]T zt )]
× zt−1 [1 − tanh2 ([wt−1
out T
] zt−1 )],
where n = ninput + nhidden + nback − 1, using 0 as the starting index.
3.4 Experiment Design

We put a recurrent reinforcement learning crypto agent to work by trading the XBTUSD
(Bitcoin vs US dollar) perpetual swap on BitMEX. We transfer the output of the source
model, the dynamic reservoir feature space of subsection 3.3.1, to the target model, the di-
rect recurrent reinforcement learner of subsection 3.3.2, who learns to target a risk position
directly. Finally, we use five minutely sampled intraday data. Our performance evaluation
procedure involves the following:
• Select the echo state network and direct, reinforcement learner hyper-parameters through a
cross-validation procedure, from an initial training set, with information ratio chosen as a
performance measure.
• Backtest the remaining available history as a test set.
• Learn sequentially online to target the desired position.
• Force the agent to trade as a price taker, who incurs an execution cost equal to equation 24
plus exchange fees, which are set to 5 basis points (0.05%).
• For non-zero risk positions, apply the appropriate funding profit or loss as per equation 13.
• Monitor equation 18, the expected net reward of the strategy. If µt ≥ 0, the crypto agent can
trade freely. Otherwise, close the position and wait for an opportunity to enter the market
again.
3.5 Results
Table 1 and figure 3 show the results of the experiment. The crypto agent achieves a
total return of just under 350%, over a test set that is less than five years. The associated
annualised information ratio is 1.46. Denoting the maximally short position as -1, no
position as 0 and the maximally long position as 1, we see that the agent averages a
position of 0.41. Thus there is a bias toward the agent maintaining a long position, which
is desirable, as Bitcoin has appreciated against the US dollar over this period. We see
visual evidence in figure 3 that on occasion, the crypto agent abstains from trading, or
rather is forced to take no position; this will happen during periods when the predictive
performance of the agent decreases relative to execution and funding costs. Our crypto
agent also captures a 71% cumulative return due to earning funding, which is expected,
as the agent learns to target the appropriate position that maximises its quadratic utility,
and funding is one of the drivers of this utility. The total execution cost and exchange fees
that the agent pays out is -54%.
10
Figure 3: cumulative returns for the XBTUSD crypto agent
position execution carry pnl

count 1684 1684 1684 1684
mean 0.405 -0.000 0.000 0.002
std 0.579 0.001 0.005 0.027
min -0.990 -0.019 -0.014 -0.155
25 % 0.000 0.000 -0.000 0.000
50 % 0.000 0.000 0.000 0.000
75 % 0.990 0.000 0.000 0.003
max 0.990 0.000 0.014 0.177
sum -0.542 0.713 3.498
ir 1.46
Table 1: daily pnl statistics for the XBTUSD crypto agent
4. Discussion
The echo state network provides a powerful and scalable feature space representation. We
transfer this powerful learning representation to a recurrent reinforcement learning agent,
that learns to target a position directly. It is possible to use the echo state network as a
reinforcement learning agent itself, as shown by Szita et al. (2006). However the approach
taken there may lead to undesired behaviour in a trading context. Specifically, they use
an echo state network to estimate the state-action value function of a temporal difference
learning sarsa model (Sutton, 1988; Sutton and Barto, 2018). This value function takes
11
the form
qπ (s, a) = Eπ {rt |st , at }, (36)
where the expected return rt depends on the transition to state st having taken action at
under policy π. Sarsa estimates this value function sequentially as
q(st , at ) = (1 − η)q(st , at ) + η[rt+1 + γq(st+1 , at+1 )], (37)
where 0 < γ ≤ 1 is a discount factor for multi-step rewards, and η > 0 is a learning rate.
Equation 37 shows that the state transition reward is passed back to the starting state. In
activities such as maze traversal or board games, being aware of the reward associated with
multiple steps or decisions and passing that reward back to the current position is of great
value. However, in the context of trading, where the value function q(st , at ) represents the
value of a position st , with the possibility of switching or remaining in the same position
denoted by action at , we will on occasion find that a larger utility is assigned to the wrong
state. For example, imagine the current state is st = 0, that is, the model has no position.
Now we observe a large positive price jump leading to large positive reward rt+1 0.
Value function estimators such as equation 37 would pass the state transition reward rt+1
to the initial state q(st = 0, at = 0). At the next iteration, with probability P r(1 − ),
q(st+1 = 0, at+1 = 0), as it has the highest value function. However, if the position is zero,
the model cannot hope to earn a profit. Even if one excludes the zero state, then there
is still the possibility of observing this problem for a reversal strategy with possible states
s = {−1, 1}. Direct reinforcement, as we describe it, does not incur these problems and
we have, through transfer learning, improved upon the earlier work in direct reinforcement
(Moody et al., 1998; Gold, 2003).
5. Limitations Of The Work

As previously discussed in subsection 3.3.1, the various echo state network parameters are
initialised at random. Figure 4 and table 2 measure the impact of this randomness on
test set performance. We run a Monte Carlo simulation of 250 trials, where the network
parameterisation is fixed to θ = {nhidden = 100, nback = 10}. The information ratios in
this set of simulations vary from 0.219 to 1.763, with total returns between 65.4% and
502.1%. Whilst this does show evidence of a reasonable variability of returns, table 2
also shows that 95% of the mean information ratios vary between 1.1 and 1.2 and 95% of
the mean total returns vary between 289% and 307%. Thus the overall picture remains
unchanged, that being that the transfer learning crypto agent has successfully learnt how
to trade this instrument during the test period. Other than this acceptable sensitivity to
weight initialization, we make no assumptions that would otherwise cause our results to be
violated if these assumptions were not met.
6. Conclusion
We demonstrate an application of online transfer learning as a digital assets trading agent.
This agent makes use of a powerful feature space representation in the form of an echo state
network, the output of which is made available to a direct, recurrent reinforcement learning
agent. The agent learns to trade the XBTUSD (Bitcoin versus US dollars) perpetual swap
derivatives contract on BitMEX. It learns to trade intraday on five minutely sampled data,
avoids excessive over-trading, captures a funding profit and is also able to predict the
direction of the market. Overall, our crypto agent realizes a total return of 350%, net of
transaction costs, over roughly five years, 71% of which is down to funding profit. The
annualized information ratio that it achieves is 1.46.
12
Figure 4: Monte Carlo simulation, information ratio versus total return
ir total return
count 250 250
mean 1.160 2.977
std 0.299 0.740
min 0.219 0.654
25% 1.023 2.572
50% 1.199 3.076
75% 1.355 3.352
max 1.763 5.021
se(mean) 0.019 0.047
lb(mean) 1.123 2.886
ub(mean) 1.197 3.069
Table 2: Monte Carlo simulation, information ratio versus total return
References
Robert Almgren and Neil Chriss. Optimal execution of portfolio transactions. Journal of Risk, 3:
5–40, 2001.
Carl Gold. Fx trading via recurrent reinforcement learning. In IEEE. IEEE, 2003.
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
Clive W. J. Granger and Paul Newbold. Spurious regressions in econometrics. Journal of economet-
rics, 2:111–120, 1974.
13
Simon Haykin. Kalman filtering and neural networks. Wiley, 2001.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):
1735–1780, 1997.
Herbert Jaeger. Adaptive nonlinear system identification with echo state networks. In NIPS, 2002.
Herbert Jaeger. Using conceptors to manage neural long-term memories for temporal patterns.
Journal of Machine Learning Research, 18:13:1–13:43, 2017.
Richard C. Grinold Ronald N. Kahn. Advances in Active Portfolio Management: New Developments
in Quantitative Investing. McGraw-Hill Education, 2019.
Mantas Lukosevicius and Herbert Jaeger. Reservoir computing approaches to recurrent neural net-
work training. Comput. Sci. Rev., 3:127–149, 2009.
Mantas Lukosevicius, Herbert Jaeger, and Benjamin Schrauwen. Reservoir computing trends. KI -
Künstliche Intelligenz, 26:365–371, 2012.
Suyuan Luo, Xudong Lin, and Zunxin Zheng. A novel cnn-ddpg based ai-trader: Performance and
roles in business operations. Transportation research. Part E, Logistics and transportation review,
131:68–79, 2019. ISSN 1366-5545.
Wolfgang Maass and Henry Markram. On the computational power of circuits of spiking neurons.
J. Comput. Syst. Sci., 69:593–616, 2004.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wier-
stra, and Martin Riedmiller. Playing atari with deep reinforcement learning, 2013.
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim
Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement
learning, 2016.
John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saffell. Performance functions and rein-
forcement learning for trading systems and portfolios. Journal of Forecasting, 17:441–470, 1998.
James M. Murray. Local online learning in recurrent networks with random feedback. eLife, 8, 2019.
Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge
and Data Engineering, 22:1345–1359, 2010.
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations
by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive
Science, 1985.
Bruno Klein Salvalaio and Gabriel de Oliveira Ramos. Self-adaptive appearance-based eye-tracking
with online transfer learning. In 2019 8th Brazilian Conference on Intelligent Systems (BRACIS),
pages 383–388. IEEE, 2019.
Arash Samadi, Timothy P. Lillicrap, and Douglas Blair Tweed. Deep learning with dynamic spiking
neurons and fixed feedback weights. Neural Computation, 29:578–602, 2017.
William F. Sharpe. Mutual fund performance. The Journal of Business, 39, 1 1966.
William. F. Sharpe. Expected utility asset allocation. Financial Analysts Journal, 63(5):18–30,
2007.
14
Robert H. Shumway. Time series analysis and its applications : with R examples. Springer texts in
statistics. Springer, 3rd ed. edition, 2011.
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman,
Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine
Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with
deep neural networks and tree search. Nature, 529, 1 2016. ISSN 0028-0836. doi: 10.1038/
nature16961.
Masashi Sugiyama. Statistical Reinforcement Learning. CRC Press, 1st edition edition, 2015. ISBN
9781439856901.
R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):
9–44, 1988. ISSN 0885-6125.
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford

Book, Cambridge, MA, USA, 2018.
István Szita, Viktor Gyenes, and András Lörincz. Reinforcement learning with echo state networks.
In ICANN, 2006.
Aviv Tamar, Yinlam Chow, Mohammad Ghavamzadeh, and Shie Mannor. Sequential decision
making with coherent risk. IEEE transactions on automatic control, 62(7):3323–3338, 2017. ISSN
0018-9286.
Xingda Wang, Xiaoping Wang, and Zhigang Zeng. A novel weight update rule of online transfer
learning. In 2020 12th International Conference on Advanced Computational Intelligence (ICACI),
pages 349–355. IEEE, 2020.
C. J. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, 1989.
P.J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the
IEEE, 78(10):1550–1560, 1990. doi: 10.1109/5.58337.
Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement

learning. Machine Learning, 8, 5 1992a. ISSN 0885-6125. doi: 10.1007/BF00992696.
Ronald J. Williams. Training recurrent networks using the extended kalman filter. In [Proceedings
1992] IJCNN International Joint Conference on Neural Networks, volume 4, pages 241–246 vol.4,
1992b.
Ronald J. Williams and David Zipser. A learning algorithm for continually running fully recurrent
neural networks. Neural computation, 1:270–280, 1989.
Xinghan Xu and Weijie Ren. Prediction of air pollution concentration based on mrmr and echo
state network. Applied Sciences, 9(9), 2019. ISSN 2076-3417.
Qiang Yang, Yu Zhang, Wenyuan Dai, and Sinno Jialin Pan. Transfer Learning. Cambridge Uni-
versity Press, 2020.
Zekun Ye, Weijie Deng, Shuigeng Zhou, Yi Xu, and Jihong Guan. Optimal trade execution based
on deep deterministic policy gradient. In Database Systems for Advanced Applications, volume
12112 of Lecture Notes in Computer Science, pages 638–654. Springer International Publishing,
2020. ISBN 3030594092.
15
Izzet B. Yildiz, Herbert Jaeger, and Stefan J. Kiebel. Re-visiting the echo state property. Neural
networks : the official journal of the International Neural Network Society, 35:1–9, 2012.
Zihao Zhang, Stefan Zohren, and Stephen Roberts. Deep reinforcement learning for trading. In
ArXiv, 2019.
Peilin Zhao, Steven C. H. Hoi, Jialei Wang, and Bin Li. Online transfer learning. Artificial intelli-
gence, 216:76–102, 2014.
16

The Recurrent Reinforcement Learning Crypto Agent: Gabriel Borrageiro Nick Firoozye Paolo Barucca

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Recurrent Reinforcement Learning Crypto Agent: Gabriel Borrageiro Nick Firoozye Paolo Barucca

Uploaded by

Copyright:

Available Formats

Financial Computing Group, UCL (2021)

The Recurrent Reinforcement Learning Crypto Agent

Gabriel Borrageiro gabriel.borrageiro.20@ucl.ac.uk

2.1 Transfer Learning

2.2 Echo State Networks

2.3 Policy Gradient Reinforcement Learning

Figure 1: the echo state network (Xu and Ren, 2019)

∆wij = ηij (r − bij ) ln(∂πi /∂wij ), (1)

2.3.1 Policy Gradient Methods in Financial Trading

3. The Research Experiment

3.1 The Data

Figure 2: XBTUSD basis during the bear market of 2019

3.3.1 The Dynamic Reservoir Feature Space

• Scale Whidden such that its spectral radius ρ(Whidden ) < 1.

xt = fhidden (Winput ut + Whidden xt−1 + Wtback ŷt ), (15)

zt = [ut , xt , ŷt ]. (16)

3.3.2 Direct Recurrent Reinforcement Learning

µt = τ µt−1 + (1 − τ )rt , (18)

The constituent derivatives for the left half of equation 32 are:

3.4 Experiment Design

Figure 3: cumulative returns for the XBTUSD crypto agent

position execution carry pnl

Table 1: daily pnl statistics for the XBTUSD crypto agent

q(st , at ) = (1 − η)q(st , at ) + η[rt+1 + γq(st+1 , at+1 )], (37)

5. Limitations Of The Work

Figure 4: Monte Carlo simulation, information ratio versus total return

Table 2: Monte Carlo simulation, information ratio versus total return

Simon Haykin. Kalman filtering and neural networks. Wiley, 2001.

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement

You might also like