Professional Documents
Culture Documents
Eric Benhamou 1,2 , David Saltiel 1,3 , Sandrine Ungari 4 , Abhishek Mukhopadhyay 5
1
AI Square Connect, France, {eric.benhamou,david.saltiel}@aisquareconnect.com
2
MILES, LAMSADE, Dauphine university, France, eric.benhamou@lamsade.dauphine.fr
3
LISIC, ULCO, France, david.saltiel@univ-littoral.fr
4
Societe Generale, Cross Asset Quantitative Research, UK,
5
Societe Generale, Cross Asset Quantitative Research, France,
{sandrine.ungari,abhishek.mukhopadhyay}@sgcib.com
Abstract 2016; 2017), StarCraft II (Vinyals et al. 2019), etc ... One of
the reasons often put forward for this situation is that asset
While researchers in the asset management industry have
management researchers have mostly been trained with an
mostly focused on techniques based on financial and risk
planning techniques like Markowitz efficient frontier, min- econometric and financial mathematics background, while
imum variance, maximum diversification or equal risk par- the deep reinforcement learning community has been mostly
ity, in parallel, another community in machine learning has trained in computer science and robotics, leading to two dis-
started working on reinforcement learning and more partic- tinctive research communities that do not interact much be-
ularly deep reinforcement learning to solve other decision tween each other. In this paper, we aim to present the various
making problems for challenging task like autonomous driv- approaches to show similarities and differences to bridge the
ing, robot learning, and on a more conceptual side games gap between these two approaches. Both methods can help
solving like Go. This paper aims to bridge the gap be- solving the decision making problem of finding the optimal
tween these two approaches by showing Deep Reinforcement portfolio allocation weights.
Learning (DRL) techniques can shed new lights on portfolio
allocation thanks to a more general optimization setting that
casts portfolio allocation as an optimal control problem that Related works
is not just a one-step optimization, but rather a continuous As this paper aims at bridging the gap between traditional
control optimization with a delayed reward. The advantages asset management portfolio selection methods and deep re-
are numerous: (i) DRL maps directly market conditions to
actions by design and hence should adapt to changing envi-
inforcement learning, there are too many works to be cited.
ronment, (ii) DRL does not rely on any traditional financial On the traditional methods side, the seminal work is
risk assumptions like that risk is represented by variance, (iii) (Markowitz 1952) that has led to various extensions like
DRL can incorporate additional data and be a multi inputs minimum variance (Chopra and Ziemba 1993; Haugen and
method as opposed to more traditional optimization methods. Baker 1991), (Kritzman 2014), maximum diversification
We present on an experiment some encouraging results using (Choueifaty and Coignard 2008; Choueifaty, Froidure, and
convolution networks. Reynier 2012), maximum decorrelation (Christoffersen et al.
2010), risk parity (Maillard, Roncalli, and Teı̈letche 2010;
Introduction Roncalli and Weisang 2016). We will review these works in
the section entitled Traditional methods.
In asset management, there is a gap between mainstream On the reinforcement learning side, the seminal book is
used methods and new machine learning techniques around (Sutton and Barto 2018). The field of deep reinforcement
reinforcement learning and in particular deep reinforcement learning is growing every day at an unprecedented pace,
learning. The former methods rely on financial risk opti- making the citation exercise complicated. But in terms of
mization and solve the planning problem of the optimal breakthroughs of deep reinforcement learning, one can cite
portfolio as a single step optimization question. The lat- the work around Atari games from raw pixel inputs (Mnih
ter do not make any assumptions about risk, do a more et al. 2013; 2015), Go (Silver et al. 2016; 2017), StarCraft II
involving multi-steps optimization and solve complex and (Vinyals et al. 2019), learning advanced locomotion and ma-
challenging tasks like autonomous driving (Wang, Jia, and nipulation skills from raw sensory inputs (Levine et al. 2015;
Weng 2018), learning advanced locomotion and manipula- 2016) (Schulman et al. 2015a; 2015b; 2017; Lillicrap et al.
tion skills from raw sensory inputs (Levine et al. 2015; 2016; 2015), autonomous driving (Wang, Jia, and Weng 2018) and
Schulman et al. 2015a; 2017; Lillicrap et al. 2015) or on a robot learning (Gu et al. 2017).
more conceptual side for reaching supra human level in pop-
On the application of deep reinforcement learning meth-
ular games like Atari (Mnih et al. 2013), Go (Silver et al.
ods to portfolio allocations, there is already a growing in-
Copyright
c 2020, Association for the Advancement of Artificial terest as recent breakthroughs has put growing emphasis on
Intelligence (www.aaai.org). All rights reserved. this method. Hence, the field is growing very rapidly and
survey like (Fischer 2018) are already out dated. Driven ini-
tially mostly by applications to crypto currencies and Chi-
nese financial markets (Jiang and Liang 2016; Zhengyao et
al. 2017; Liang et al. 2018; Yu et al. 2019; Wang and Zhou
2019; Saltiel et al. 2020; Benhamou et al. 2020b; 2020a;
2020c), the field is progressively taking off on other as-
sets (Kolm and Ritter 2019; Liu et al. 2020; Ye et al. 2020;
Li et al. 2019; Xiong et al. 2019). More generally, DRL has
recently been applied to other problems than portfolio al-
location. For instance, (Deng et al. 2016; Zhang, Zohren,
and Roberts 2019; Huang 2018; Théate and Ernst 2020;
Chakraborty 2019; Nan, Perumal, and Zaiane 2020; Wu et
al. 2020) tackle the problem of direct trading strategies (Bao
and yang Liu 2019) handles the one of multi agent trading
while (Ning, Lin, and Jaimungal 2018) examine optimal ex-
ecution.
Traditional methods
We are interested in finding an optimal portfolio which Figure 1: Markowitz efficient frontier for the GAFA: returns
makes the planning problem quite different from standard taken from 2017 to end of 2019
planning problem where the aim is to plan a succession
of tasks. Typical planning algorithms are variations around
STRIPS (Fikes and Nilsson 1971), that starts by analysis gies and Σ the matrix of variance covariances of the l strate-
ending goals and means, builds the corresponding graph and gies’ returns. Let rmin be the minimum expected return. The
finds the optimal graph. Indeed we start from the goals to Markowitz optimization problem to solve is to minimize the
achieve and try to find means that can lead to them. New risk given a target of minimum expected return as follows:
work like Graphplan as presented in (Blum and Furst 1995)
uses a novel planning graph, to reduce the amount of search
Minimize wT Σw (1)
needed, while hierarchical task network (HTN) planning w
leverages the classification to structure networks and hence subject to µT w ≥ rmin ,
X
wi = 1, 1 ≥ w ≥ 0
reduce the number of graph searches. Other algorithms like
i=1...l
search algorithm as A∗ , B ∗ , weighted A∗ or for full graph
search, branch and bound and its extensions, as well as evo- It is solved by standard quadratic programming. Thanks
lutionary algorithms like particle swarm, CMA-ES are also to duality, there is an equivalent maximization with a given
used widely in AI planning etc.. However, when it comes to maximum risk σmax for wich the problem writes as follows:
portfolio allocation, standard methods used by practitioners
rely on more traditional financial risk reward optimization Maximize µT w (2)
w
problems and follows rather the Markowitz approach as pre- X
sented below. subject to wT Σw ≤ σmax , wi = 1, 1 ≥ w ≥ 0
i=1...l
Action
In our deep reinforcement learning the augmented asset
manager agent needs to decide at each period in which hedg-
ing strategy it invests. The augmented asset manager can in-
Figure 2: network architecture obtained via tensorflow plot-
vest in l strategies that can be simple strategies or strategies
model function. Our network is very different from standard
that are also done by asset management agent. To cope with
DRL networks that have single inputs and outputs. Contex-
reality, the agent will only be able to act after one period.
tual information introduces a second input while the lever-
This is because asset managers have a one day turn around
age adds a second output
to change their position. We will see on experiments that this
one day turnaround lag makes a big difference in results.
As it has access to l potential hedging strategies, the out-
put is a l dimension vector that provides how much it invest
Convolution networks
in each hedging strategy. For our deep network, this means Because we want to extract some features implicitly with a
that the last layer is a softmax layer to ensure that portfolio limited set of parameters, and following (Liang et al. 2018),
weights are between 0 and 100% and sum to 1, denoted by we use convolution network that perform better than simple
(p1t , ..., plt ). In addition, to include leverage, our deep net- full connected layers. For our so called asset states named
work has a second output which is the overall leverage that like that because there are the part of the states that relates to
is between 0 and a maximum leverage value (in our experi- the asset, we use two layers of convolutional network with
ment 3), denoted by lvgt . Hence the final allocation is given 5 and 10 convolutions. These parameters are found to be
by lvgt × (p1t , ..., plt ). efficient on our validation set. In contrast, for the contextual
states part, we only use one layer of convolution networks
Reward with 3 convolutions. We flatten our two sub network in order
to concatenate them into a single network.
In terms of reward, we are considering the net performance
of our portfolio from t0 to the last train date tT computed as
P Adversarial Policy Gradient
follows: PttT − 1.
0 To learn the parameters of our network depicted in 2, we
use a modified policy gradient algorithm called adversarial
Multi inputs and outputs as we introduce noise in the data as suggested in (Liang et
We display in figure 2 the architecture of our network. Be- al. 2018).. The idea of introducing noise in the data is to
cause we feed our network with both data from the strate- have some randomness in each training to make it more ro-
gies to select but also contextual information, our network is bust. This is somehow similar to drop out in deep networks
a multiple inputs network. where we randomly perturb the network by randomly re-
Additionally, as we want from these inputs to provide moving some neurons to make it more robust and less prone
not only percentage in the different hedging strategies (with to overfitting. Here, we are perturbing directly the data to
a softmax activation of a dense layer) but also the overall create this stochasticity to make the network more robust. A
leverage (with a dense layer with one single output neurons), policy is a mapping from the observation space to the action
we also have a multi outputs network. Additional hyperpa- space, π : O → A. To achieve this, a policy is specified by a
rameters that are used in the network as L2 regularization deep network with a set of parameters θ.~ The action is a vec-
with a coefficient of 1e-8. tor function of the observation given the parameters: ~at =
πθ~ (ot ). The performance metric of πθ~ for time interval [0, t] training done for period starting in 2000 and ending one day
is defined as the corresponding total reward function of the before the start of the testing period.
interval J[0,t] (πθ~ ) = R ~o1 , πθ~ (o1 ), · · · , ~ot , πθ~ (ot ), ~ot+1 .
After random initialization, the parameters are continuously Data-set description
updated along the gradient direction with a learning rate λ:
Systematic strategies are similar to asset managers that in-
θ~ −→ θ~ + λ∇θ~ J[0,t] (πθ~ ). The gradient ascent optimization vest in financial markets according to an adaptive, pre-
is done with standard Adam (short for Adaptive Moment Es- defined trading rule. Here, we use 4 SG CIB proprietary
timation) optimizer to have the benefit of adaptive gradient ’hedging strategies’, that tend to perform when stock mar-
descent with root mean square propagation (Kingma and Ba kets are down:
2014). The whole process is summarized in algorithm 1.
• Directional hedges - react to small negative return in eq-
uities,
Algorithm 1 Adversarial Policy Gradient
1: Input: initial policy parameters θ, empty replay buffer D • Gap risk hedges - perform well in sudden market crashes,
2: repeat • Proxy hedges - tend to perform in some market config-
3: reset replay buffer urations, like for example when highly indebted stocks
4: while not terminal do under-perform other stocks,
5: Observe observation o and select action a = πθ (o)
with probability p and random action with proba- • Duration hedges - invest in bond market, a classical diver-
bility 1 − p, sifier to equity risk in finance.
6: Execute a in the environment The underlying financial instruments vary from put op-
7: Observe next observation o0 , reward r, and done tions, listed futures, single stocks, to government bonds.
signal d to indicate whether o0 is terminal Some of those strategies are akin to an insurance contract
8: apply noise to next observation o0 and bear a negative cost over the long run. The challenge
9: store (o, a, o0 ) in replay buffer D consists in balancing cost versus benefits.
10: if Terminal then In practice, asset managers have to decide how much of
11: for however many updates in D do these hedging strategies are needed on top of an existing
12: compute final reward R portfolio to achieve a better risk reward. The decision mak-
13: end for ing process is often based on contextual information, such
14: update network parameter with Adam gradient as the economic and geopolitical environment, the level of
ascent θ~ −→ θ~ + λ∇θ~ J[0,t] (πθ~ ) risk aversion among investors and other correlation regimes.
15: end if The contextual information is modeled by a large range of
16: end while features :
17: until convergence
• the level of risk aversion in financial markets, or market
sentiment, measured as an indicator varying between 0 for
In our gradient ascent, we use a learning rate of 0.01, maximum risk aversion and 1 for maximum risk appetite,
an adversarial Gaussian noise with a standard deviation of
0.002. We do up to 500 maximum iterations with an early • the bond equity historical correlation, a classical ex-
stop condition if on the train set, there is no improvement post measure of the diversification benefits of a duration
over the last 50 iterations. hedge, measured on a 1 month, 3 month and 1 year rolling
window,
Future work
As nice as this work is, there is room for improvement as
we have only tested a few scenarios and only a limited set of
hyper-parameters for our convolutional networks. We should
do more intensive testing to confirm that DRL is able to bet-
ter adapt to changing financial environment. We should also
investigate the impact of more layers and other design choice
in our network.
Conclusion
In this paper, we discuss how a traditional portfolio alloca-
tion problem can be reformulated as a DRL problem, trying
Figure 4: weights for all models to bridge the gaps between the two approaches. We see that
the DRL approach enables us to select fewer strategies, im-
proving the overall results as opposed to traditional methods
and risk parity provides non null weights for all our hedg- that are built on the concept of diversification. We also stress
ing strategies and do not do cherry picking at all. They are that DRL can better adapt to changing market conditions and
neither able to change the leverage used in the portfolio as is able to incorporate more information to make decision.
opposed to DRL model.
Acknowledgments
Adaptation to the Covid Crisis We would like to thank Beatrice Guez and Marc Pantic for
meaningful remarks. The views contained in this document
The DRL model can change its portfolio allocation should are those of the authors and do not necessarily reflect the
the market conditions change. This is the case from 2018 ones of SG CIB.
onwards, with a short deleveraging window emphasized by
the small blank disruption during the Covid crisis as shown
in figure 5. We observe in this figure where we have zoomed
References
over year 2020, that the DRL model is able to reduce lever- Astrom, K. 1969. Optimal control of markov processes with
age from 300 % to 200 % during the Covid crisis (end of incomplete state-information ii. the convexity of the loss-
February 2020 to start of April 2020). This is a unique fea- function. Journal of Mathematical Analysis and Applica-
ture of our DRL model compared to traditional financial tions 26(2):403–406.
planning models that do not take leverage into account and Bao, W., and yang Liu, X. 2019. Multi-agent deep rein-
keeps a leverage of 300 % regardless of market conditions. forcement learning for liquidation strategy analysis.
Benhamou, E.; Saltiel, D.; Ohana, J.-J.; and Atif, J. 2020a.
Benefits of DRL Detecting and adapting to crisis pattern with context based
deep reinforcement learning. ICPR 2020.
As illustrated by the experiment, the advantages of DRL are
numerous: (i) DRL maps directly market conditions to ac- Benhamou, E.; Saltiel, D.; Ungari, S.; and Mukhopadhyay,
tions by design and hence should adapt to changing envi- A. 2020b. Aamdrl: Augmented asset management with deep
ronment, (ii) DRL does not rely on any traditional financial reinforcement learning. arXiv.
risk assumptions, (iii) DRL can incorporate additional data Benhamou, E.; Saltiel, D.; Ungari, S.; and Mukhopadhyay,
and be a multi inputs method as opposed to more traditional A. 2020c. Bridging the gap between markowitz planning
optimization methods. and deep reinforcement learning. ICAPS FinPlan workshop.
Blum, A., and Furst, M. 1995. Fast Planning Through Plan- Liang et al. 2018. Adversarial deep reinforcement learning
ning Graph Analysis. In IJCAI, 1636–1642. in portfolio management.
Chakraborty, S. 2019. Capturing financial markets to apply Lillicrap, T.; Hunt, J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa,
deep reinforcement learning. Y.; Silver, D.; and Wierstra, D. 2015. Continuous control
Chopra, V. K., and Ziemba, W. T. 1993. The effect of errors with deep reinforcement learning. CoRR.
in means, variances, and covariances on optimal portfolio Liu, Y.; Liu, Q.; Zhao, H.; Pan, Z.; and Liu, C. 2020. Adap-
choice. Journal of Portfolio Management 19(2):6–11. tive quantitative trading: an imitative deep reinforcement
Choueifaty, Y., and Coignard, Y. 2008. Toward maximum learning approach. In AAAI.
diversification. Journal of Portfolio Management 35(1):40– Maillard, S.; Roncalli, T.; and Teı̈letche, J. 2010. The prop-
51. erties of equally weighted risk contribution portfolios. The
Choueifaty, Y.; Froidure, T.; and Reynier, J. 2012. Proper- Journal of Portfolio Management 36(4):60–70.
ties of the most diversified portfolio. Journal of Investment Markowitz, H. 1952. Portfolio selection. Journal of Finance
Strategies 2(2):49–70. 7:77–91.
Christoffersen, P.; Errunza, V.; Jacobs, K.; and Jin, X. 2010. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.;
Is the potential for international diversification disappear- Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013.
ing? Working Paper. Playing atari with deep reinforcement learning. NIPS Deep
Cogneau, P., and Hübner, G. 2009. The 101 ways to measure Learning Workshop.
portfolio performance. SSRN Electronic Journal. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.; Veness,
Deng, Y.; Bao, F.; Kong, Y.; Ren, Z.; and Dai, Q. 2016. Deep J.; Bellemare, M.; Graves, A.; Riedmiller, M.; Fidjeland,
direct reinforcement learning for financial signal representa- A.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.;
tion and trading. IEEE Transactions on Neural Networks Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg,
and Learning Systems 28:1–12. S.; and Hassabis, D. 2015. Human-level control through
Fikes, R. E., and Nilsson, N. J. 1971. Strips: A new approach deep reinforcement learning. Nature 518:529–33.
to the application of theorem proving to problem solving. Nan, A.; Perumal, A.; and Zaiane, O. R. 2020. Sentiment
Artificial Intelligence 2:189. and knowledge based algorithmic trading with deep rein-
Fischer, T. G. 2018. Reinforcement learning in financial forcement learning.
markets - a survey. Discussion Papers in Economics 12. Ning, B.; Lin, F. H. T.; and Jaimungal, S. 2018. Double deep
Gu, S.; Holly, E.; Lillicrap, T.; and Levine, S. 2017. Deep q-learning for optimal execution.
reinforcement learning for robotic manipulation with asyn- Roncalli, T., and Weisang, G. 2016. Risk parity portfolios
chronous off-policy updates. In IEEE International Confer- with risk factors. Quantitative Finance 16(3):377–388.
ence on Robotics and Automation (ICRA), 3389–3396. Saltiel, D.; Benhamou, E.; Ohana, J. J.; Laraki, R.; and Atif,
Haugen, R., and Baker, N. 1991. The efficient market inef- J. 2020. Drlps: Deep reinforcement learning for portfolio
ficiency of capitalization-weighted stock portfolios. Journal selection. ECML PKDD Demo track.
of Portfolio Management 17:35–40.
Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.; and Abbeel,
Huang, C. Y. 2018. Financial trading as a game: A deep P. 2015a. Trust region policy optimization. In ICML.
reinforcement learning approach.
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; and Abbeel,
Jiang, Z., and Liang, J. 2016. Cryptocurrency Portfolio P. 2015b. High-dimensional continuous control using gen-
Management with Deep Reinforcement Learning. arXiv e- eralized advantage estimation. ICLR.
prints.
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and
Kingma, D., and Ba, J. 2014. Adam: A method for stochastic Klimov, O. 2017. Proximal policy optimization algorithms.
optimization. CoRR.
Kolm, P. N., and Ritter, G. 2019. Modern perspective on Silver, D.; Huang, A.; Maddison, C.; Guez, A.; Sifre, L.;
reinforcement learning in finance. SSRN. Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneer-
Kritzman, M. 2014. Six practical comments about asset shelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.; Nham,
allocation. Practical Applications 1(3):6–11. J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T.; Leach, M.;
Levine, S.; Finn, C.; Darrell, T.; and Abbeel, P. 2015. End- Kavukcuoglu, K.; Graepel, T.; and Hassabis, D. 2016. Mas-
to-end training of deep visuomotor policies. Journal of Ma- tering the game of go with deep neural networks and tree
chine Learning Research 17. search. Nature 529:484–489.
Levine, S.; Pastor, P.; Krizhevsky, A.; and Quillen, D. 2016. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.;
Learning hand-eye coordination for robotic grasping with Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton,
deep learning and large-scale data collection. The Interna- A.; Chen, Y.; Lillicrap, T.; Hui, F.; Sifre, L.; van den Driess-
tional Journal of Robotics Research. che, G.; Graepel, T.; and Hassabis, D. 2017. Mastering the
Li, X.; Li, Y.; Zhan, Y.; and Liu, X.-Y. 2019. Optimistic bull game of go without human knowledge. Nature 550:354–.
or pessimistic bear: Adaptive deep reinforcement learning Sutton, R. S., and Barto, A. G. 2018. Reinforcement Learn-
for stock portfolio allocation. In ICML. ing: An Introduction. The MIT Press, second edition.
Théate, T., and Ernst, D. 2020. Application of deep re-
inforcement learning in stock trading strategies and stock
forecasting.
Vinyals, O.; Babuschkin, I.; Czarnecki, W.; Mathieu, M.;
Dudzik, A.; Chung, J.; Choi, D.; Powell, R.; Ewalds, T.;
Georgiev, P.; Oh, J.; Horgan, D.; Kroiss, M.; Danihelka, I.;
Huang, A.; Sifre, L.; Cai, T.; Agapiou, J.; Jaderberg, M.;
and Silver, D. 2019. Grandmaster level in starcraft ii using
multi-agent reinforcement learning. Nature 575.
Wang, H., and Zhou, X. Y. 2019. Continuous-Time Mean-
Variance Portfolio Selection: A Reinforcement Learning
Framework. arXiv e-prints.
Wang, S.; Jia, D.; and Weng, X. 2018. Deep reinforcement
learning for autonomous driving. ArXiv abs/1811.11329.
Wu, X.; Chen, H.; Wang, J.; Troiano, L.; Loia, V.; and
Fujita, H. 2020. Adaptive stock trading strategies with
deep reinforcement learning methods. Information Sciences
538:142–158.
Xiong, Z.; Liu, X.-Y.; Zhong, S.; Yang, H.; and Walid, A.
2019. Practical deep reinforcement learning approach for
stock trading.
Ye, Y.; Pei, H.; Wang, B.; Chen, P.-Y.; Zhu, Y.; Xiao, J.; and
Li, B. 2020. Reinforcement-learning based portfolio man-
agement with augmented asset movement prediction states.
In AAAI.
Yu, P.; Lee, J. S.; Kulyatin, I.; Shi, Z.; and Dasgupta, S.
2019. Model-based deep reinforcement learning for finan-
cial portfolio optimization. RWSDM Workshop, ICML 2019.
Zhang, Z.; Zohren, S.; and Roberts, S. 2019. Deep rein-
forcement learning for trading.
Zhengyao et al. 2017. Reinforcement learning framework
for the financial portfolio management problem. arXiv.