Professional Documents
Culture Documents
net/publication/333745301
CITATIONS READS
0 143
6 authors, including:
Ricardo Silva
University College London
53 PUBLICATIONS 502 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Jacobo Roa Vicens on 25 June 2019.
Jacobo Roa-Vicens 1 2 Cyrine Chtourou 1 Angelos Filos 3 Francisco Rul·lan 2 Yarin Gal 3 Ricardo Silva 2
Abstract 1. Introduction
Multi-agent learning is a promising method to Limit order books play a central role in the formation of
simulate aggregate competitive behaviour in fi- prices for financial securities in exchanges globally. These
arXiv:1906.04813v1 [cs.LG] 11 Jun 2019
nance. Learning expert agents’ reward functions systems centralize limit orders of price and volume to buy
through their external demonstrations is hence or sell certain securities from large numbers of dealers and
particularly relevant for subsequent design of real- investors, matching bids and offers in a transparent process.
istic agent-based simulations. Inverse Reinforce- The dynamics that emerge from this aggregate process (Preis
ment Learning (IRL) aims at acquiring such re- et al., 2006; Cont et al., 2010) of competitive behaviour
ward functions through inference, allowing to fall naturally within the scope of multi-agent learning, and
generalize the resulting policy to states not ob- understanding them is of paramount importance to liquidity
served in the past. This paper investigates whether providers (known as market makers) in order to achieve
IRL can infer such rewards from agents within optimal execution of their operations. We focus on the
real financial stochastic environments: limit or- problem of learning the latent reward function of an expert
der books (LOB). We introduce a simple one- agent from a set of observations of their external actions in
level LOB, where the interactions of a number of the LOB.
stochastic agents and an expert trading agent are
modelled as a Markov decision process. We con- Reinforcement learning (RL) (Sutton & Barto, 2018) is
sider two cases for the expert’s reward: either a a formal framework to study sequential decision-making,
simple linear function of state features; or a com- particularly relevant for modelling the behaviour of finan-
plex, more realistic non-linear function. Given cial agents in environments like the LOB. In particular, RL
the expert agent’s demonstrations, we attempt to allows to model their decision-making process as agents
discover their strategy by modelling their latent interacting with a dynamic environment through policies
reward function using linear and Gaussian process that seek to maximize their respective cumulative rewards.
(GP) regressors from previous literature, and our Inverse reinforcement learning (Russell, 1998) is therefore
own approach through Bayesian neural networks a powerful framework to analyze and model the actions of
(BNN). While the three methods can learn the such agents, aiming at discovering their latent reward func-
linear case, only the GP-based and our proposed tions: the most "succinct, robust and transferable definition
BNN methods are able to discover the non-linear of a task" (Ng et al., 2000). Once learned, such reward func-
reward case. Our BNN IRL algorithm outper- tions can be generalized to unobserved regions of the state
forms the other two approaches as the number of space, an important advantage over other learning methods.
samples increases. These results illustrate that
complex behaviours, induced by non-linear re-
ward functions amid agent-based stochastic sce- Overview. The paper starts at Section 2 by introducing
narios, can be deduced through inference, encour- the foundations of IRL, relevant aspects of the maximum
aging the use of inverse reinforcement learning causal entropy model used, and useful properties of Gaus-
for opponent-modelling in multi-agent systems. sian processes and Bayesian neural networks applied to IRL.
A formal description of the limit order book model follows.
1
JPMorgan Chase & Co., London, United Kingdom In Section 3, we formulate the one-level LOB as a finite
2
Department of Statistical Science, University College London, Markov decision process (MDP) and express its dynamics
United Kingdom 3 Department of Computer Science, University of
Oxford, United Kingdom. Correspondence to: Jacobo Roa-Vicens in terms of a Poisson binomial distribution. Finally, we in-
<jacobo.roavicens@jpmorgan.com>. vestigate the performance of two well-known IRL methods
(maximum entropy IRL and GP-based IRL), and propose
Published as a workshop paper on AI in Finance: Applications and and test an additional IRL approach based on Bayesian neu-
Infrastructure for Multi-Agent Learning at the 36 th International
Conference on Machine Learning, Long Beach, California, PMLR Opinions expressed in this paper are those of the authors, and
97, 2019. Copyright 2019 by the author(s). do not necessarily reflect the view of JP Morgan.
Towards Inverse Reinforcement Learning for Limit Order Book Dynamics
ral networks. Each IRL method is tested on two versions imum entropy IRL, Gaussian process-based IRL, and our
of the LOB environment, where the reward function of the IRL approach based on Bayesian neural networks. Results
expert agent may be either a simple linear function of state show that BNNs are able to recover the target rewards, out-
features, or a more complex and realistic non-linear reward performing comparable methods both in IRL performance
function. and in terms of computational efficiency.
distribution of trajectories generated by the optimal policy mized with respect to the finite rewards f and θ:
follows π ∗ (a|s) ∝ exp {Q0 (st , at )} (Ziebart et al., 2008;
hZ
Haarnoja et al., 2017), where Q0 is a is a soft Q-function
i
p(f, θ, D|Xf ) = p(D|r)p(r|f, θ, Xf )dr p(f, θ|Xf )
that incorporates the treatment of sequentially-revealed in- r
formation, regularizing the objective of forward reinforce-
ment learning with respect to differential entropy H(·) as Inside the above integral, we can recognize the IRL objec-
described inhFu et al. (2017): Q0 (st , at ) = rt (s, tive p(D|r), the GP posterior p(r|f, θ, Xf ) and the prior
i a) + of f and θ. To mitigate the intractability of this integral,
t0
PT
E(st+1 ,...)∼π t0 =t γ (r (st0 , at 0 ) + H (π (·|st0 ))) .
Levine et al. (2011) use the Deterministic Training Condi-
The inference problem for an unknown reward function tional (DTC) approximation, which reduces the GP posterior
r(s, a) thus boils down to maximizing the likelihood to its mean function.
P (D|r) as in Levine et al. (2011):
Bayesian Neural Networks applied to IRL. A neural
" # network (NN) is a superposition of multiple layers con-
XX 0 X 0 sisting of linear transformations and non-linear functions.
exp Qsri,t ,ai,t − log exp(Qsri,t ,a0 i,t )
i t a0
Infinitely wide NNs whose weights are random variables
have been proven to converge to a GP (Neal, 1995; Williams
& Jacobs, 1997) and hence represent a universal function
2.2. IRL methods considered
approximator (Cybenko, 1989). However, this convergence
The three inverse reinforcement learning methods that we property is not applicable to finite NNs, which are the net-
will test on our LOB model for both linear and exponen- works used in practice. Bayesian neural networks are equiv-
tial expert rewards are: maximum entropy IRL (MaxEnt), alent networks in the finite case. BNNs have been the fo-
Gaussian processes-based IRL (GPIRL), and our implemen- cus of multiple studies (Neal, 1995; MacKay, 1992; Gal &
tation through Bayesian neural networks (BNN IRL). These Ghahramani, 2016) and are known for their useful regular-
methods are defined as follows: ization properties. Since exact inference is computationally
challenging, variational inference has instead been used
Maximum Entropy IRL. Proposed by Ziebart (2010), to approximate these models (Hinton & Van Camp, 1993;
this method builds on the maximum causal entropy objec- Peterson, 1987; Graves, 2011; Gal & Ghahramani, 2016).
tive function described above, combined with a linearity Given a dataset Z = {(xn , yn )}N n=1 , we wish to learn the
assumption on the structure of the reward as a function of mapping {(xn → yn )}N n=1 in a robust way. A BNN can
the state features: r(s) = wT φ(s), where φ(s) : S 7→ Rm be characterized through a prior distribution p(w) placed
is the m-dimensional mapping from the state to the feature on its weights, and the likelihood p(Z|w). An approximate
vector. The IRL problem then boils down to the inference posterior distribution q(w) can be fit through Bayesian vari-
of such w. ational inference, maximizing the evidence lower bound
(ELBO):
Gaussian Process-based IRL. Maximum causal entropy
provides a method to infer values of the reward function on
specific points of the state space. This is enough for cases Lq = Eq [log p(Z|w)] − KL[q(w)kp(w)]
where the MDP is finite and where the observed demonstra-
tions cover all the state space, which are not very common. We parametrize q(w) with θ and use the mean-field vari-
In this context, it is important to learn a structure of re- ational inference approximation (Peterson, 1987), which
wards that mirrors as closely as possible the behaviour of assumes that the weights of each layer factorize to indepen-
the expert agent. dent distributions. We choose a diagonal Gaussian prior
distribution p(w) and solve the optimization problem:
While many prior IRL methods assume linearity of the re-
ward function, GP-based IRL (Levine et al., 2011), expands
the function space of possible inferred rewards to non-linear max Lqθ
θ
reward structures. Under this model, the reward function
r is modeled by a zero-mean GP prior with a kernel or co-
In the context of the IRL problem, we leverage the benefits
variance function kθ . If Xf ∈ Rn,m is the feature matrix
of BNNs to generalize point estimates provided by maxi-
defining a finite number n of m-dimensional states, and f
mum causal entropy to a reward function in a robust and
is the reward function evaluated at Xf : f = r(Xf ), then
efficient way. Unlike in GPIRL, where the full objective
f |X, θ ∼ N (0, Kf,f ) where [Kf,f ]i,j = kθ (xi , xj ).
is maximized at each iteration, BNN IRL consists of two
The GPIRL objective (Levine et al., 2011) is then maxi- separate steps:
Towards Inverse Reinforcement Learning for Limit Order Book Dynamics
• Inference step: we first optimize the IRL objective • Action Space A. The expert agent chooses to take an
(EA)
p(D|r̂), obtaining a finite number of point estimates of action at at time-step t, by selecting volumes vb (t)
the reward {r̂(sn ) = r̂n }Nn=1 over a finite number of and va
(EA)
(t) to match the trading objectives defined by
states {sn ∈ S}N n=1 their reward with those available in the LOB through
• Learning step: secondly, we use these point estimates the other traders, at the best bid and ask, respectively:
to train a BNN that learns the mapping between state h iT
features s ∈ S and their respective inferred reward at = vb(EA) (t) va(EA) (t) ∈ A (2)
values r̂(s) ∈ R.
(EA) (EA)
where vb (t) + va (t) ≤ N (see Transition Dy-
The number of point estimates used is the number of states namics below). The EA is here an active market partic-
existing in the expert’s demonstrations. This means that we ipant, which actively sells at the best ask and buys at
do not use each state in the state space once, but as many the best bid, while the trading agents on the other side
times as it exists in the demonstrations D. This is equivalent of the LOB only place passive orders.
to tailoring the learning rate of the Bayesian neural network
to match the state visitation counts. • Transition Dynamics T . At each time-step t, the n-th
(n)
trader, n ∈ {1, .., N }, places a single order, ot at
2.3. Limit Order Book either side of the LOB (best bid, or best ask). The
traders follow stochastic policies:
Our experimental setup builds on limit order books (LOBs):
here we introduce some basic definitions following the con- (1)
v (t−1)
b
ventions of Gould et al. (2013) and Zhang et al. (2019). (n) (n) e τ (n)
π (·|st ; τ ) = Ber (3)
(1)
Two types of orders exist in an LOB: bids (buy orders) and v
b
(t−1) (1)
va (t−1)
asks (sell orders). A bid order for an asset is a quote to e τ (n) +e τ (n)
buy the underlying asset at or below a specified price Pb (t). (n) (n) (n)
ot ∼π (·|st ; τ ) (4)
Conversely, the ask order is a quote to sell the asset at or
above a certain price Pa (t). For a certain price, the bid (n)
where ot = 1 corresponds to a trader placing an or-
and ask orders have respective sizes (volumes) Vb (t) and (n)
der at the best bid, and ot = 0 at the best ask. By
Va (t). Each level (l) of the LOB is represented by one
(l) (l) (l) (l)
construction, each of the trading agents has to place
pair (pb (t), vb (t)) or (pa (t), va (t)), generally ranked exactly one order, a bid or an ask, at each time-step.
in decreasing order of competitiveness. Ber(·) denotes the Bernoulli distribution. The tempera-
ture parameters τ = (τ1 , . . . , τn ) are independent and
3. Experiments unknown to the expert agent.
Experimental Setup. Our environment setup is a one- Hence, the aggregate intermediate bid orders that were
(T A)
level LOB, resulting from the interaction of N (Markovian) generated by the N trading agents, vb (t) are the
trading agents (TA) and an expert agent (EA trader). With- sum of N independent Bernoulli variables, whose pa-
out loss of generality, the EA is regulated to have up to rameter is conditioned on the environment state st and
Imax inventory (amount of securities in their portfolio). The the idiosyncratic temperature parameters τ .
EA is unaware of the strategies of each of the other trading Therefore, the number of intermediate bids is dis-
agents, but adapts to them through trial and error, as in re- tributed according to a Poisson binomial distribution
inforcement learning frameworks (Sutton & Barto, 2018). (Shepp & Olkin, 1981), whose probability mass func-
The EA solves the following MDP hS, A, T , r, γ, P0 i: tion can be expressed in closed form. Since each of the
trading agents (but not the EA) should place exactly
• State Space S. The environment state at time-step t, one order, the aggregate intermediate ask orders are
(T A) (T A)
st , is a 3-dimensional vector: va (t) = N − vb (t). The aggregate bids and
h iT asks of the trading agents form an intermediate state
(T A) (T A)
st = vb(1) (t) va(1) (t) i(EA) (t) ∈ S (1) called the belief state bt = (vb (t), va (t)).
(1) (1)
where vb (t), va (t) ∈ R+ are the volumes Although the EA can only see the last available LOB
available in the one-level LOB through the best snapshot contained in st , their orders (at ) are executed
bid and ask levels, respectively, and i(EA) (t) ∈ against the intermediate state bt . This could be com-
{−Imax , . . . , 0, . . . , +Imax } is the inventory held by the pared in real world LOBs to a slippage effect. Based
expert agent at time-step t. on the number of bids and asks finally matched by the
Towards Inverse Reinforcement Learning for Limit Order Book Dynamics
Linear(EA)
Reward 3.0 Exponential Reward improved EVD, our BNN-IRL experiments provide a sig-
for i = 0 for i (EA) = 0
2.5 0.8 nificant improvement in computational time as compared to
Best ask volume, va(1)
0
GPIRL, hence enabling potentially more efficient scalability
2.0 0.6 of IRL on LOBs to state spaces of higher dimensions.
1
1
1.5
0.4
2
2
1.0 4. Conclusions
0.5 0.2
3
3
0 1 2 3 0 1 2 3 In this paper we attempt an application of IRL to a stochastic,
Best bid volume, vb(1) 0.0 Best bid volume, vb(1) 0.0 non-stationary financial environment (the LOB) whose com-
peting agents may follow complex reward functions. While
Figure 4. Reward functions (linear and exponential) with respect as expected all the methods considered are able to recover
to the state features: volume of best bid and ask and expert agent’s linear latent reward functions, only GP-based IRL (Levine
inventory.
et al., 2011) and our implementation through BNNs are able
to recover more realistic non-linear expert rewards, thus
2. the expected cumulative reward earned by follow- mitigating most of the challenges imposed by this stochastic
ing the policy π̂ implied by the rewards inferred multi-agent environment. Moreover, our BNN method out-
through IRL. performs GPIRL for larger numbers of demonstrations, and
is less computationally intensive. This may enable future
T T
X X work to study LOBs of higher dimensions, and on increas-
γ t r(st )|π ∗ − E γ t r(st )|π̂ (7)
EVD = E
ingly realistic number and complexity of agents involved.
t=0 t=0
EVD
0.0 0.0 Cont, R., Stoikov, S., and Talreja, R. A stochastic model
for order book dynamics. Operations Research, 58(3):
0.2 0.2 549–563, 2010.
210 212 214 210 212 214
Number of Demonstrations Number of Demonstrations Cybenko, G. Approximation by superpositions of a sig-
Figure 5. EVD for both the linear and the exponential reward func- moidal function. Mathematics of control, signals and
tions as inferred through MaxEnt, GP and BNN IRL algorithms systems, 2(4):303–314, 1989.
for increasing numbers of demonstrations. EVDs are all evaluated
on 100,000 trajectories each. The standard errors are plotted for
Fu, J., Luo, K., and Levine, S. Learning robust rewards
10 independent evaluations. with adversarial inverse reinforcement learning. arXiv
preprint arXiv:1710.11248, 2017.
We run two versions of our experiments, where the expert Gal, Y. and Ghahramani, Z. Dropout as a bayesian approxi-
agent has either a linear or an exponential reward function. mation: Representing model uncertainty in deep learning.
Each IRL method is run for 512, 1024, 2048, 4096, 8192 In Proceedings of The 33rd International Conference on
and 16384 demonstrations. The results obtained are pre- Machine Learning, volume 48 of Proceedings of Machine
sented in Figure 5: as expected, all three IRL methods tested Learning Research, pp. 1050–1059. PMLR, 20–22 Jun
(MaxEnt IRL, GPIRL, BNN-IRL), learn fairly well linear 2016.
reward functions. However, for an agent with an exponen-
tial reward, GPIRL and BNN-IRL are able to discover the Gould, M. D., Porter, M. A., Williams, S., McDonald, M.,
latent function significantly better, with BNN outperforming Fenn, D. J., and Howison, S. D. Limit order books. Quan-
as the number of demonstrations increases. In addition to titative Finance, 13(11):1709–1742, 2013.
Towards Inverse Reinforcement Learning for Limit Order Book Dynamics
Graves, A. Practical variational inference for neural net- Preis, T., Golke, S., Paul, W., and Schneider, J. J. Multi-
works. In Advances in neural information processing agent-based order book model of financial markets. Eu-
systems, pp. 2348–2356, 2011. rophysics Letters (EPL), 75(3):510–516, aug 2006. doi:
10.1209/epl/i2006-10139-0.
Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. Rein-
forcement learning with deep energy-based policies. In Roberson, B. The colonel blotto game. Economic Theory,
Proceedings of the 34th International Conference on Ma- 29(1):1–24, 2006.
chine Learning-Volume 70, pp. 1352–1361. JMLR. org,
2017. Russell, S. Learning agents for uncertain environments
(extended abstract). In Proceedings of the Eleventh An-
Hendricks, D., Cobb, A., Everett, R., Downing, J., and nual Conference on Computational Learning Theory, pp.
Roberts, S. J. Inferring agent objectives at different 101–103. ACM Press, 1998.
scales of a complex adaptive system. arXiv preprint
arXiv:1712.01137, 2017. Shepp, L. A. and Olkin, I. Entropy of the sum of indepen-
dent bernoulli random variables and of the multinomial
Hinton, G. and Van Camp, D. Keeping neural networks sim- distribution. In Contributions to probability, pp. 201–206.
ple by minimizing the description length of the weights. Elsevier, 1981.
In in Proc. of the 6th Ann. ACM Conf. on Computational
Learning Theory. Citeseer, 1993. Sutton, R. S. and Barto, A. G. Reinforcement learning: An
introduction. MIT press, 2018.
Jin, M., Damianou, A. C., Abbeel, P., and Spanos, C. J.
Inverse reinforcement learning via deep gaussian process. Tukey, J. W. A problem of strategy. Econometrica, 17(1):
In UAI. AUAI Press, 2017. 73, 1949.