You are on page 1of 10

Bridging the gap between Markowitz planning and deep reinforcement learning

Eric Benhamou 1,2 , David Saltiel 1,3 , Sandrine Ungari 4 , Abhishek Mukhopadhyay 5
1
AI Square Connect, France, {eric.benhamou,david.saltiel}@aisquareconnect.com
2
MILES, LAMSADE, Dauphine university, France, eric.benhamou@lamsade.dauphine.fr
3
LISIC, ULCO, France, david.saltiel@univ-littoral.fr
4
Societe Generale, Cross Asset Quantitative Research, UK,
5
Societe Generale, Cross Asset Quantitative Research, France,
{sandrine.ungari,abhishek.mukhopadhyay}@sgcib.com

Abstract 2016; 2017), StarCraft II (Vinyals et al. 2019), etc ... One of
the reasons often put forward for this situation is that asset
While researchers in the asset management industry have
management researchers have mostly been trained with an
mostly focused on techniques based on financial and risk
planning techniques like Markowitz efficient frontier, min- econometric and financial mathematics background, while
imum variance, maximum diversification or equal risk par- the deep reinforcement learning community has been mostly
ity, in parallel, another community in machine learning has trained in computer science and robotics, leading to two dis-
started working on reinforcement learning and more partic- tinctive research communities that do not interact much be-
ularly deep reinforcement learning to solve other decision tween each other. In this paper, we aim to present the various
making problems for challenging task like autonomous driv- approaches to show similarities and differences to bridge the
ing, robot learning, and on a more conceptual side games gap between these two approaches. Both methods can help
solving like Go. This paper aims to bridge the gap be- solving the decision making problem of finding the optimal
tween these two approaches by showing Deep Reinforcement portfolio allocation weights.
Learning (DRL) techniques can shed new lights on portfolio
allocation thanks to a more general optimization setting that
casts portfolio allocation as an optimal control problem that Related works
is not just a one-step optimization, but rather a continuous As this paper aims at bridging the gap between traditional
control optimization with a delayed reward. The advantages asset management portfolio selection methods and deep re-
are numerous: (i) DRL maps directly market conditions to
actions by design and hence should adapt to changing envi-
inforcement learning, there are too many works to be cited.
ronment, (ii) DRL does not rely on any traditional financial On the traditional methods side, the seminal work is
risk assumptions like that risk is represented by variance, (iii) (Markowitz 1952) that has led to various extensions like
DRL can incorporate additional data and be a multi inputs minimum variance (Chopra and Ziemba 1993; Haugen and
method as opposed to more traditional optimization methods. Baker 1991), (Kritzman 2014), maximum diversification
We present on an experiment some encouraging results using (Choueifaty and Coignard 2008; Choueifaty, Froidure, and
convolution networks. Reynier 2012), maximum decorrelation (Christoffersen et al.
2010), risk parity (Maillard, Roncalli, and Teı̈letche 2010;
Introduction Roncalli and Weisang 2016). We will review these works in
the section entitled Traditional methods.
In asset management, there is a gap between mainstream On the reinforcement learning side, the seminal book is
used methods and new machine learning techniques around (Sutton and Barto 2018). The field of deep reinforcement
reinforcement learning and in particular deep reinforcement learning is growing every day at an unprecedented pace,
learning. The former methods rely on financial risk opti- making the citation exercise complicated. But in terms of
mization and solve the planning problem of the optimal breakthroughs of deep reinforcement learning, one can cite
portfolio as a single step optimization question. The lat- the work around Atari games from raw pixel inputs (Mnih
ter do not make any assumptions about risk, do a more et al. 2013; 2015), Go (Silver et al. 2016; 2017), StarCraft II
involving multi-steps optimization and solve complex and (Vinyals et al. 2019), learning advanced locomotion and ma-
challenging tasks like autonomous driving (Wang, Jia, and nipulation skills from raw sensory inputs (Levine et al. 2015;
Weng 2018), learning advanced locomotion and manipula- 2016) (Schulman et al. 2015a; 2015b; 2017; Lillicrap et al.
tion skills from raw sensory inputs (Levine et al. 2015; 2016; 2015), autonomous driving (Wang, Jia, and Weng 2018) and
Schulman et al. 2015a; 2017; Lillicrap et al. 2015) or on a robot learning (Gu et al. 2017).
more conceptual side for reaching supra human level in pop-
On the application of deep reinforcement learning meth-
ular games like Atari (Mnih et al. 2013), Go (Silver et al.
ods to portfolio allocations, there is already a growing in-
Copyright c 2020, Association for the Advancement of Artificial terest as recent breakthroughs has put growing emphasis on
Intelligence (www.aaai.org). All rights reserved. this method. Hence, the field is growing very rapidly and
survey like (Fischer 2018) are already out dated. Driven ini-
tially mostly by applications to crypto currencies and Chi-
nese financial markets (Jiang and Liang 2016; Zhengyao et
al. 2017; Liang et al. 2018; Yu et al. 2019; Wang and Zhou
2019; Saltiel et al. 2020; Benhamou et al. 2020b; 2020a;
2020c), the field is progressively taking off on other as-
sets (Kolm and Ritter 2019; Liu et al. 2020; Ye et al. 2020;
Li et al. 2019; Xiong et al. 2019). More generally, DRL has
recently been applied to other problems than portfolio al-
location. For instance, (Deng et al. 2016; Zhang, Zohren,
and Roberts 2019; Huang 2018; Théate and Ernst 2020;
Chakraborty 2019; Nan, Perumal, and Zaiane 2020; Wu et
al. 2020) tackle the problem of direct trading strategies (Bao
and yang Liu 2019) handles the one of multi agent trading
while (Ning, Lin, and Jaimungal 2018) examine optimal ex-
ecution.

Traditional methods
We are interested in finding an optimal portfolio which Figure 1: Markowitz efficient frontier for the GAFA: returns
makes the planning problem quite different from standard taken from 2017 to end of 2019
planning problem where the aim is to plan a succession
of tasks. Typical planning algorithms are variations around
STRIPS (Fikes and Nilsson 1971), that starts by analysis gies and Σ the matrix of variance covariances of the l strate-
ending goals and means, builds the corresponding graph and gies’ returns. Let rmin be the minimum expected return. The
finds the optimal graph. Indeed we start from the goals to Markowitz optimization problem to solve is to minimize the
achieve and try to find means that can lead to them. New risk given a target of minimum expected return as follows:
work like Graphplan as presented in (Blum and Furst 1995)
uses a novel planning graph, to reduce the amount of search
Minimize wT Σw (1)
needed, while hierarchical task network (HTN) planning w
leverages the classification to structure networks and hence subject to µT w ≥ rmin ,
X
wi = 1, 1 ≥ w ≥ 0
reduce the number of graph searches. Other algorithms like
i=1...l
search algorithm as A∗ , B ∗ , weighted A∗ or for full graph
search, branch and bound and its extensions, as well as evo- It is solved by standard quadratic programming. Thanks
lutionary algorithms like particle swarm, CMA-ES are also to duality, there is an equivalent maximization with a given
used widely in AI planning etc.. However, when it comes to maximum risk σmax for wich the problem writes as follows:
portfolio allocation, standard methods used by practitioners
rely on more traditional financial risk reward optimization Maximize µT w (2)
w
problems and follows rather the Markowitz approach as pre- X
sented below. subject to wT Σw ≤ σmax , wi = 1, 1 ≥ w ≥ 0
i=1...l

Markowitz Minimum variance portfolio


The intuition of Markowitz portfolio is to be able to com- This seminal model has led to numerous extensions where
pare various assets and assemble them taking into account the overall idea is to use a different optimization objective.
both return and risk. Comparing just returns of some finan- As presented in (Chopra and Ziemba 1993; Haugen and
cial assets would be too naive. One has to take into account Baker 1991), (Kritzman 2014), we can for instance be in-
in her/his investment decision returns with associated risk. terested in just minimizing risk (as we are not so much in-
Risk is not an easy concept. in Modern Portfolio Theory terested in expected returns), which leads to the minimum
(MPT), risk is represented by the variance of the asset re- variance portfolio given by the following optimization pro-
turns. If we take various financial assets and display their gram:
returns and risk as in figure 1, we can find an efficient fron-
tier. Indeed there exists an efficient frontier represented by
the red dot line. Minimize wT Σw (3)
w
Mathematically, if we denote by w = (w1 , ..., wl ) the al- X
location weights with 1 ≥ wi ≥ 0 for i = 0...l, summarized subject to wi = 1, 1 ≥ w ≥ 0
by 1 ≥ w ≥ 0, with the additional constraints that these i=1...l
Pl
weights sum to 1: i=1 wi = 1, we can see this portfolio Maximum diversification portfolio Denoting by σ the
allocation question as an optimization. volatilities of our l strategies, whose values are the diagonal
Let µ = (µ1 , ..., µl )T be the expected returns for our l strate- elements of the covariance matrix Σ: σ = (Σi,i )i=1..l , we
can shoot for maximum diversification with the diversifica-
T
What if we could cast this portfolio allocation planning
tion of a portfolio defined as follows: D = √ wT P
σ
. We question as a dynamic control problem where we have some
w w
market information and needs to decide at each time step the
then solve the following optimization program as presented
optimal portfolio allocation problem and evaluate the result
in (Choueifaty and Coignard 2008; Choueifaty, Froidure,
with delayed reward? What if we could move from static
and Reynier 2012)
portfolio allocation to optimal control territory where we
can change our portfolio allocation dynamically when mar-
wT σ ket conditions changes. Because the community of portfolio
Maximize p P (4) allocation is quite different from the one of reinforcement
w wT w
X learning, this approach has been ignored for quite some time
subject to wi = 1, 1 ≥ w ≥ 0 even though there is a growing interest for the use of rein-
i=1...l forcement learning and deep reinforcement learning over the
last few years. We will present here in greater details what
The concept of diversification is simply the ratio of the deep reinforcement is in order to suggest more discussions
weighted average of volatilities divided by the portfolio and exchanges between these two communities.
volatility. Contrary to supervised learning, reinforcement learning
do not try to predict future returns. It does not either try to
Maximum decorrelation portfolio Following (Christof- learn the structure of the market implicitly. Reinforcement
fersen et al. 2010) and denoting by C the correlation ma- learning does more: it directly learns the optimal policy for
trix of the portfolio strategies, the maximum decorrelation the portfolio allocation in connection with the dynamically
portfolio is obtained by finding the weights that provide the changing market conditions.
maximum decorrelation or equivalently the minimum corre-
lation as follows: Deep Reinforcement Learning Intuition
As it name stands for, Deep Reinforcement Learning (DRL)
Minimize wT Cw (5) is the combination of Reinforcement Learning (RL) and
w
X Deep (D). The usage of deep learning is to represent the pol-
subject to wi = 1, 1 ≥ w ≥ 0 icy function in RL. In a nutshell, the setting for applying RL
i=1...l to portfolio management can be summarized as follows:
Risk parity portfolio Another approach following risk • current knowledge of the financial markets is formalized
parity (Maillard, Roncalli, and Teı̈letche 2010; Roncalli and via a state variable denoted by st .
Weisang 2016) is to aim for more parity in risk and solve the • Our planning task which is to find an optimal portfolio
following optimization program allocation can be thought as taking an action at on this
market. This action is precisely the decision of the current
l
1 T 1X portfolio allocation (also called portfolio weights).
Minimize w Σw − ln(wi ) (6)
w 2 n i=1 • once we have decided the portfolio allocation, we observe
X the next state st+1 .
subject to wi = 1, 1 ≥ w ≥ 0
• we use a reward to evaluate the performance of our ac-
i=1...l
tions. In our particular setting, we can compute this re-
All these optimization techniques are the usual way to solve ward only at the the final time of our episode, making it
the planning question of getting the best portfolio allocation. quite special compared to standard reinforcement learn-
We will see in the following section that there are many ing problem. We denote this reward by RT where T is the
alternatives leveraging machine learning that remove cog- final time of our episode. This reward RT is in a sense
nitive bias of risk and are somehow more able to adapt to similar to our objective function in traditional methods.
changing environment. A typical reward is the final portfolio net performance. It
could be obviously other financial performance evaluation
criteria like Sharpe, Sortino ratio, etc..
Reinforcement learning
Following standard RL, we model our problem to solve
Previous financial methods treat the portfolio allocation
with a Markov Decision Process (MDP) as in (Sutton and
planning question as a one-step optimization problem, with
Barto 2018). MDP assumes that the agent knows all the
convex objective functions. There are multiple limitations to
states of the environment and has all the information to make
this approach:
the optimal decision in every state. The Markov property im-
• they do not relate market conditions to portfolio allocation plies in addition that knowing the current state is sufficient.
dynamically. MDP assumes a 4-tuple (S, A, P, R) where S is the set of
states, A is the set of actions, P is the state action to next
• they do not take into account that the result of the portfolio
state transition probability function P : S × A × S → [0, 1],
allocation may potentially be evaluated much later.
and R is the immediate reward. The goal of the agent is
• they make a strong assumptions about risk. to learn a policy that maps states to the optimal action
π : S →PA and that maximizes the expected discounted Mathematically, POMDP is a generalization of MDP.

reward E[ t=0 γ t Rt ]. POMPD adds two more variables in the tuple, O and Z
The concept of using deep network is to represent the where O is the set of observations and Z is the observation
function that relates dynamically the states to the action transition function Z : S × A × O → [0, 1]. At each time,
called in RL the policy and denoted by ~at = π(st ). This the agent is asked to take an action at ∈ A in a particular
function is represented by deep network because of the uni- environment state st ∈ S, that is followed by the next state
versal approximation theorem that states that any function st+1 with P(st+1 |st , at ). The next state st+1 is not observed
can be represented by a deep network provided we have by the agent. It rather receives an observation ot+1 ∈ O on
enough layers and nodes. Compared to traditional methods the state st+1 with probability Z(ot+1 |st+1 , at ).
that only solve a one step optimization, we are solving the From a practical standpoint, the general RL setting is
following dynamic control optimization program: modified by taking a pseudo state formed with a set of
past observations (ot−n , ot−n−1 , . . . , ot−1 , ot ). In practice
to avoid large dimension and the curse of dimension, it is
Maximize E[RT ] (7)
π(.) useful to reduce this set and take only a subset of these
past observations with j < n past observations, such that
subject to at = π(st )
0 < i1 < . . . < ij and ik ∈ N is an integer. The set
Note that we maximize the expected value of the cumu- δ1 = (0, i1 , . . . , ij ) is called the observation lags. In our ex-
lated reward E[RT ] because we are operating in a stochastic periment we typically use lag periods like (0, 1, 2, 3, 4, 20,
environment. To make things simpler, let us assume that the 60) for daily data, where (0, 1, 2, 3, 4) provides last week ob-
cumulated reward is the final portfolio net performance. Let servation, 20 is for the one-month ago observation (as there
us write Pt the price at time t of our portfolio, and its re- is approximately 20 business days in a month) and 60 the
turn at time t: rtP and the portfolio assets return vector at three-month ago observation.
time t: ~rt . The final net performance writes as PT /P0 − 1 =
QT P P Observations
t=1 (1 + rt ) − 1. The returns rt is a function of our plan-
P
ning action at as follows: (1+rt ) = 1+h~at , ~rt i where h·, ·i Regular observations There are two types of observa-
is the standard inner product of two vectors. In addition if we tions: regular and contextual information. Regular observa-
recall that the policy is parametrized by some deep network tions are data directly linked to the problem to solve. In the
parameters, θ: at = πθ (st ), we can make our optimization case of an asset management framework, regular observa-
problem slightly more detailed as follows: tions are past prices observed over a lag period δ = (0 <
i1 < . . . < ij ). To normalize data, we rather use past returns
pk
" T
Y
# computed as rtk = pk
t
− 1 where pkt is the price at time t
t−1
Maximize E (1 + h~at , ~rt i) (8) of the asset k. To give information about regime changes,
θ
t=1 our trading agent receives also empirical standard devia-
subject to at = πθ (st ). tion computed over a q sliding estimation window denoted by
k 1
Pt 2
It is worth noticing that compared to previous traditional d as follows σt = d u=t−d+1 (ru − µ) , where the
planning methods (optimization 1, 3, 4, 5 or 5), the underly- empirical mean µ is computed as µ = d1 u=t−d+1 ru .
P t
ing optimization problem in RL 7 and its rewritting in terms Hence our regular observations is a three dimensional ten-
of deep network parameters θ as presented in 8 have many sor At = A1t , A2t
differences:  1
rt−ij ... rt1
 1
σt−ij ... σt1
 
• First, we are trying to optimize a function π and not sim-
ple weights wi . Although this function at the end is rep- with A1t =  ... ... ... , A2t =  ... ... ... 
m
resented by a deep neural network that has admittely also rt−i j
.... rtm m
σt−i j
.... σtm
weights, this is conceptually very different as we are op- This setting with two layers (past returns and past volatili-
timizing in the space of functions π : S → A , that is a ties) is quite different from the one presented in (Jiang and
much bigger space than simply Rl . Liang 2016; Zhengyao et al. 2017; Liang et al. 2018) that
• Second, it is a multi time step optimization at it involves uses different layers representing closing, open high low
results from time t = 1 to t = T , making it also more prices. There are various remarks to be made. First, high
involving. low information does not make sense for portfolio strate-
gies that are only evaluated daily, which is the case of all the
Partially Observable Markov Decision Process funds. Secondly, open high low prices tend to be highly cor-
related creating some noise in the inputs. Third, the concept
If there is in addition some noise in our data and we are not
of volatility is crucial to detect regime change and is surpris-
able to observe the full state, it is better to use Partially Ob-
ingly absent from these works as well as from other works
servable Markov Decision Process (POMDP) as presented
like (Yu et al. 2019; Wang and Zhou 2019; Liu et al. 2020;
initially in (Astrom 1969). In POMDP, only a subset of
Ye et al. 2020; Li et al. 2019; Xiong et al. 2019).
the information of a given state is available. The partially-
informed agent cannot behave optimally. He uses a window Context observation Contextual observations are addi-
of past observations to replace states as in a traditional MDP. tional information that provide intuition about current con-
text. For our asset manager, they are other financial data
not directly linked to its portfolio assumed to have some
predictive power for portfolio assets. Contextual observa-
tions are stored in a 2D matrix denoted by Ct with stacked
past p individual contextual observations. Among these ob-
servations, we have the maximum and minimum portfo-
lio strategies return and the maximum portfolio strategies
volatility. The latter information is like for regular obser-
vations motivated by the stylized fact that standard devi-
ations are useful features
 1to detect crisis.
 The contextual
ct ... c1t−ik
state writes as C t =  ... ... ... . The matrix nature
cpt .... cpt−ik
of contextual states Ct implies in particular that we will use
1D convolutions should we use convolutional layers. All in
all, observations that are augmented observations, write as
Ot = [At , Ct ], with At = [A1t , A2t ] that will feed the two
sub-networks of our global network.

Action
In our deep reinforcement learning the augmented asset
manager agent needs to decide at each period in which hedg-
ing strategy it invests. The augmented asset manager can in-
Figure 2: network architecture obtained via tensorflow plot-
vest in l strategies that can be simple strategies or strategies
model function. Our network is very different from standard
that are also done by asset management agent. To cope with
DRL networks that have single inputs and outputs. Contex-
reality, the agent will only be able to act after one period.
tual information introduces a second input while the lever-
This is because asset managers have a one day turn around
age adds a second output
to change their position. We will see on experiments that this
one day turnaround lag makes a big difference in results.
As it has access to l potential hedging strategies, the out-
put is a l dimension vector that provides how much it invest
Convolution networks
in each hedging strategy. For our deep network, this means Because we want to extract some features implicitly with a
that the last layer is a softmax layer to ensure that portfolio limited set of parameters, and following (Liang et al. 2018),
weights are between 0 and 100% and sum to 1, denoted by we use convolution network that perform better than simple
(p1t , ..., plt ). In addition, to include leverage, our deep net- full connected layers. For our so called asset states named
work has a second output which is the overall leverage that like that because there are the part of the states that relates to
is between 0 and a maximum leverage value (in our experi- the asset, we use two layers of convolutional network with
ment 3), denoted by lvgt . Hence the final allocation is given 5 and 10 convolutions. These parameters are found to be
by lvgt × (p1t , ..., plt ). efficient on our validation set. In contrast, for the contextual
states part, we only use one layer of convolution networks
Reward with 3 convolutions. We flatten our two sub network in order
to concatenate them into a single network.
In terms of reward, we are considering the net performance
of our portfolio from t0 to the last train date tT computed as
P Adversarial Policy Gradient
follows: PttT − 1.
0 To learn the parameters of our network depicted in 2, we
use a modified policy gradient algorithm called adversarial
Multi inputs and outputs as we introduce noise in the data as suggested in (Liang et
We display in figure 2 the architecture of our network. Be- al. 2018).. The idea of introducing noise in the data is to
cause we feed our network with both data from the strate- have some randomness in each training to make it more ro-
gies to select but also contextual information, our network is bust. This is somehow similar to drop out in deep networks
a multiple inputs network. where we randomly perturb the network by randomly re-
Additionally, as we want from these inputs to provide moving some neurons to make it more robust and less prone
not only percentage in the different hedging strategies (with to overfitting. Here, we are perturbing directly the data to
a softmax activation of a dense layer) but also the overall create this stochasticity to make the network more robust. A
leverage (with a dense layer with one single output neurons), policy is a mapping from the observation space to the action
we also have a multi outputs network. Additional hyperpa- space, π : O → A. To achieve this, a policy is specified by a
rameters that are used in the network as L2 regularization deep network with a set of parameters θ.~ The action is a vec-
with a coefficient of 1e-8. tor function of the observation given the parameters: ~at =
πθ~ (ot ). The performance metric of πθ~ for time interval [0, t] training done for period starting in 2000 and ending one day
is defined as the corresponding total reward function of the  before the start of the testing period.
interval J[0,t] (πθ~ ) = R ~o1 , πθ~ (o1 ), · · · , ~ot , πθ~ (ot ), ~ot+1 .
After random initialization, the parameters are continuously Data-set description
updated along the gradient direction with a learning rate λ:
Systematic strategies are similar to asset managers that in-
θ~ −→ θ~ + λ∇θ~ J[0,t] (πθ~ ). The gradient ascent optimization vest in financial markets according to an adaptive, pre-
is done with standard Adam (short for Adaptive Moment Es- defined trading rule. Here, we use 4 SG CIB proprietary
timation) optimizer to have the benefit of adaptive gradient ’hedging strategies’, that tend to perform when stock mar-
descent with root mean square propagation (Kingma and Ba kets are down:
2014). The whole process is summarized in algorithm 1.
• Directional hedges - react to small negative return in eq-
uities,
Algorithm 1 Adversarial Policy Gradient
1: Input: initial policy parameters θ, empty replay buffer D • Gap risk hedges - perform well in sudden market crashes,
2: repeat • Proxy hedges - tend to perform in some market config-
3: reset replay buffer urations, like for example when highly indebted stocks
4: while not terminal do under-perform other stocks,
5: Observe observation o and select action a = πθ (o)
with probability p and random action with proba- • Duration hedges - invest in bond market, a classical diver-
bility 1 − p, sifier to equity risk in finance.
6: Execute a in the environment The underlying financial instruments vary from put op-
7: Observe next observation o0 , reward r, and done tions, listed futures, single stocks, to government bonds.
signal d to indicate whether o0 is terminal Some of those strategies are akin to an insurance contract
8: apply noise to next observation o0 and bear a negative cost over the long run. The challenge
9: store (o, a, o0 ) in replay buffer D consists in balancing cost versus benefits.
10: if Terminal then In practice, asset managers have to decide how much of
11: for however many updates in D do these hedging strategies are needed on top of an existing
12: compute final reward R portfolio to achieve a better risk reward. The decision mak-
13: end for ing process is often based on contextual information, such
14: update network parameter with Adam gradient as the economic and geopolitical environment, the level of
ascent θ~ −→ θ~ + λ∇θ~ J[0,t] (πθ~ ) risk aversion among investors and other correlation regimes.
15: end if The contextual information is modeled by a large range of
16: end while features :
17: until convergence
• the level of risk aversion in financial markets, or market
sentiment, measured as an indicator varying between 0 for
In our gradient ascent, we use a learning rate of 0.01, maximum risk aversion and 1 for maximum risk appetite,
an adversarial Gaussian noise with a standard deviation of
0.002. We do up to 500 maximum iterations with an early • the bond equity historical correlation, a classical ex-
stop condition if on the train set, there is no improvement post measure of the diversification benefits of a duration
over the last 50 iterations. hedge, measured on a 1 month, 3 month and 1 year rolling
window,

Experiments • The credit spreads of global corporate - investment grade,


high yield, in Europe and in the US - known to be an early
Goal of the experiment indicator of potential economic tensions,
We are interested in planing a hedging strategy for a risky • The equity implied volatility, a measure if the ’fear factor’
asset. The experiment is using daily data from 01/05/2000 in financial market,
to 19/06/2020 for the MSCI and 4 SG-CIB proprietary sys- • The spread between the yield of Italian government bonds
tematic strategies. The risky asset is the MSCI world index and the German government bond, a measure of potential
whose daily data can be found on Bloomberg. We choose tensions in the European Union,
this index because it is a good proxy for a wide range of
asset manager portfolios. The hedging strategies are 4 SG- • The US Treasury slope, a classical early indicator for US
CIB proprietary systematic strategies further described be- recession,
low. Training and testing are done following extending walk • And some more financial variables, often used as a gauge
forward analysis as presented in (Benhamou et al. 2020b; for global trade and activity: the dollar, the level of rates
2020c; 2020a) with initial training from 2000 to end of 2006 in the US, the estimated earnings per shares (EPS).
and testing in a rolling 1 year period. Hence, there are 14
training and testing periods, with the different testing pe- A cross validation step selects the most relevant features.
riod corresponding to all the years from 2007 to 2020 and In the present case, the first three features are selected. The
rebalancing of strategies in the portfolio comes with trans-
action costs, that can be quite high since hedges use op-
tions. Transactions costs are like frictions in physical sys-
tems. They are taken into account dynamically to penalise
solutions with a high turnover rate.

Evaluation metrics Table 1: Models comparison over 2 and 5 years


2 Years
Asset managers use a wide range of metrics to evaluate the
return Sortino Sharpe max DD
success of their investment decision. For a thorough review
Risky asset 8.27% 0.39 0.36 - 0.34
of those metrics, see for example (Cogneau and Hübner DRL 20.64% 0.94 0.96 - 0.27
2009). The metrics we are interested in for our hedging prob- Markowitz -0.25% - 0.01 - 0.01 - 0.43
lem are listed below: MinVariance -0.22% - 0.01 - 0.01 - 0.43
• annualized return defined as the average annualized com- MaxDiversification 0.24% 0.01 0.01 - 0.43
MaxDecorrel 14.42% 0.65 0.63 - 0.21
pounded return, RiskParity 14.17% 0.73 0.72 -0.19
• annualized daily based Sharpe ratio defined as the ratio 5 Years
of the annualized return over the annualized daily based return Sortino Sharpe max DD
volatility µ/σ, Risky asset 9.16% 0.57 0.54 - 0.34
• Sortino ratio computed as the ratio of the annualized re- DRL 16.95% 1.00 1.02 - 0.27
Markowitz 1.48% 0.07 0.06 - 0.43
turn overt the downside standard deviation, MinVariance 1.56% 0.08 0.06 - 0.43
• maximum drawdown (max DD) computed as the max- MaxDiversification 1.77% 0.08 0.07 - 0.43
imum of all daily drawdowns. The daily drawdown is MaxDecorrel 7.65% 0.44 0.39 - 0.21
computed as the ratio of the difference between the run- RiskParity 7.46% 0.48 0.43 -0.19
ning maximum of the portfolio value defined as RMT =
maxt=0..T (Pt ) and the portfolio value over the running
maximum of the portfolio value. Hence the drawdon at
time T is given by DDT = (RMT − PT )/RMT while
the maximum drawdown M DDT = maxt=0..T (DDt ).
It is the maximum loss in return that an investor will incur
if she/he invested at the worst time (at peak).

Results and discussion


Overall, the DRL approach achieves much better results than
traditional methods as shown in table 1, except for the max-
imum drawdown (max DD). Because time horizon is impor-
tant in the comparison we provide risk measures for the last
2 and 5 years to emphasize that the DRL approach seems
more robust than traditional portfolio allocation methods.
When plotting performance results from 2007 to 2020 as
shown in figure 3, we see that DRL model is able to devi-
ate upward from the risky asset continuously, indicating a
steady performance. In contrast, other financial models are
not able to keep their marginal over-performance over time
with respect to the risky asset and end slightly below the
risky asset.

Allocation chosen by models


The reason of the stronger performance of DRL comes from
the way it chooses its allocation. Contrarily to standard fi-
nancial methods that play the diversification as shown in fig-
ure 4, DRL aims at choosing a single hedging strategy most
of the time and at changing it dynamically, should the fi- Figure 3: performance of all models
nancial market conditions change. In a sense, DRL is doing
some cherry picking by selecting what it thinks is the best
hedging strategy.
In contrast, traditional models like Markowitz, minimum
variance, maximum diversification, maximum decorrelation
Figure 5: disallocation of DRL model

Future work
As nice as this work is, there is room for improvement as
we have only tested a few scenarios and only a limited set of
hyper-parameters for our convolutional networks. We should
do more intensive testing to confirm that DRL is able to bet-
ter adapt to changing financial environment. We should also
investigate the impact of more layers and other design choice
in our network.

Conclusion
In this paper, we discuss how a traditional portfolio alloca-
tion problem can be reformulated as a DRL problem, trying
Figure 4: weights for all models to bridge the gaps between the two approaches. We see that
the DRL approach enables us to select fewer strategies, im-
proving the overall results as opposed to traditional methods
and risk parity provides non null weights for all our hedg- that are built on the concept of diversification. We also stress
ing strategies and do not do cherry picking at all. They are that DRL can better adapt to changing market conditions and
neither able to change the leverage used in the portfolio as is able to incorporate more information to make decision.
opposed to DRL model.
Acknowledgments
Adaptation to the Covid Crisis We would like to thank Beatrice Guez and Marc Pantic for
meaningful remarks. The views contained in this document
The DRL model can change its portfolio allocation should are those of the authors and do not necessarily reflect the
the market conditions change. This is the case from 2018 ones of SG CIB.
onwards, with a short deleveraging window emphasized by
the small blank disruption during the Covid crisis as shown
in figure 5. We observe in this figure where we have zoomed
References
over year 2020, that the DRL model is able to reduce lever- Astrom, K. 1969. Optimal control of markov processes with
age from 300 % to 200 % during the Covid crisis (end of incomplete state-information ii. the convexity of the loss-
February 2020 to start of April 2020). This is a unique fea- function. Journal of Mathematical Analysis and Applica-
ture of our DRL model compared to traditional financial tions 26(2):403–406.
planning models that do not take leverage into account and Bao, W., and yang Liu, X. 2019. Multi-agent deep rein-
keeps a leverage of 300 % regardless of market conditions. forcement learning for liquidation strategy analysis.
Benhamou, E.; Saltiel, D.; Ohana, J.-J.; and Atif, J. 2020a.
Benefits of DRL Detecting and adapting to crisis pattern with context based
deep reinforcement learning. ICPR 2020.
As illustrated by the experiment, the advantages of DRL are
numerous: (i) DRL maps directly market conditions to ac- Benhamou, E.; Saltiel, D.; Ungari, S.; and Mukhopadhyay,
tions by design and hence should adapt to changing envi- A. 2020b. Aamdrl: Augmented asset management with deep
ronment, (ii) DRL does not rely on any traditional financial reinforcement learning. arXiv.
risk assumptions, (iii) DRL can incorporate additional data Benhamou, E.; Saltiel, D.; Ungari, S.; and Mukhopadhyay,
and be a multi inputs method as opposed to more traditional A. 2020c. Bridging the gap between markowitz planning
optimization methods. and deep reinforcement learning. ICAPS FinPlan workshop.
Blum, A., and Furst, M. 1995. Fast Planning Through Plan- Liang et al. 2018. Adversarial deep reinforcement learning
ning Graph Analysis. In IJCAI, 1636–1642. in portfolio management.
Chakraborty, S. 2019. Capturing financial markets to apply Lillicrap, T.; Hunt, J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa,
deep reinforcement learning. Y.; Silver, D.; and Wierstra, D. 2015. Continuous control
Chopra, V. K., and Ziemba, W. T. 1993. The effect of errors with deep reinforcement learning. CoRR.
in means, variances, and covariances on optimal portfolio Liu, Y.; Liu, Q.; Zhao, H.; Pan, Z.; and Liu, C. 2020. Adap-
choice. Journal of Portfolio Management 19(2):6–11. tive quantitative trading: an imitative deep reinforcement
Choueifaty, Y., and Coignard, Y. 2008. Toward maximum learning approach. In AAAI.
diversification. Journal of Portfolio Management 35(1):40– Maillard, S.; Roncalli, T.; and Teı̈letche, J. 2010. The prop-
51. erties of equally weighted risk contribution portfolios. The
Choueifaty, Y.; Froidure, T.; and Reynier, J. 2012. Proper- Journal of Portfolio Management 36(4):60–70.
ties of the most diversified portfolio. Journal of Investment Markowitz, H. 1952. Portfolio selection. Journal of Finance
Strategies 2(2):49–70. 7:77–91.
Christoffersen, P.; Errunza, V.; Jacobs, K.; and Jin, X. 2010. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.;
Is the potential for international diversification disappear- Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013.
ing? Working Paper. Playing atari with deep reinforcement learning. NIPS Deep
Cogneau, P., and Hübner, G. 2009. The 101 ways to measure Learning Workshop.
portfolio performance. SSRN Electronic Journal. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.; Veness,
Deng, Y.; Bao, F.; Kong, Y.; Ren, Z.; and Dai, Q. 2016. Deep J.; Bellemare, M.; Graves, A.; Riedmiller, M.; Fidjeland,
direct reinforcement learning for financial signal representa- A.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.;
tion and trading. IEEE Transactions on Neural Networks Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg,
and Learning Systems 28:1–12. S.; and Hassabis, D. 2015. Human-level control through
Fikes, R. E., and Nilsson, N. J. 1971. Strips: A new approach deep reinforcement learning. Nature 518:529–33.
to the application of theorem proving to problem solving. Nan, A.; Perumal, A.; and Zaiane, O. R. 2020. Sentiment
Artificial Intelligence 2:189. and knowledge based algorithmic trading with deep rein-
Fischer, T. G. 2018. Reinforcement learning in financial forcement learning.
markets - a survey. Discussion Papers in Economics 12. Ning, B.; Lin, F. H. T.; and Jaimungal, S. 2018. Double deep
Gu, S.; Holly, E.; Lillicrap, T.; and Levine, S. 2017. Deep q-learning for optimal execution.
reinforcement learning for robotic manipulation with asyn- Roncalli, T., and Weisang, G. 2016. Risk parity portfolios
chronous off-policy updates. In IEEE International Confer- with risk factors. Quantitative Finance 16(3):377–388.
ence on Robotics and Automation (ICRA), 3389–3396. Saltiel, D.; Benhamou, E.; Ohana, J. J.; Laraki, R.; and Atif,
Haugen, R., and Baker, N. 1991. The efficient market inef- J. 2020. Drlps: Deep reinforcement learning for portfolio
ficiency of capitalization-weighted stock portfolios. Journal selection. ECML PKDD Demo track.
of Portfolio Management 17:35–40.
Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.; and Abbeel,
Huang, C. Y. 2018. Financial trading as a game: A deep P. 2015a. Trust region policy optimization. In ICML.
reinforcement learning approach.
Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; and Abbeel,
Jiang, Z., and Liang, J. 2016. Cryptocurrency Portfolio P. 2015b. High-dimensional continuous control using gen-
Management with Deep Reinforcement Learning. arXiv e- eralized advantage estimation. ICLR.
prints.
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and
Kingma, D., and Ba, J. 2014. Adam: A method for stochastic Klimov, O. 2017. Proximal policy optimization algorithms.
optimization. CoRR.
Kolm, P. N., and Ritter, G. 2019. Modern perspective on Silver, D.; Huang, A.; Maddison, C.; Guez, A.; Sifre, L.;
reinforcement learning in finance. SSRN. Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneer-
Kritzman, M. 2014. Six practical comments about asset shelvam, V.; Lanctot, M.; Dieleman, S.; Grewe, D.; Nham,
allocation. Practical Applications 1(3):6–11. J.; Kalchbrenner, N.; Sutskever, I.; Lillicrap, T.; Leach, M.;
Levine, S.; Finn, C.; Darrell, T.; and Abbeel, P. 2015. End- Kavukcuoglu, K.; Graepel, T.; and Hassabis, D. 2016. Mas-
to-end training of deep visuomotor policies. Journal of Ma- tering the game of go with deep neural networks and tree
chine Learning Research 17. search. Nature 529:484–489.
Levine, S.; Pastor, P.; Krizhevsky, A.; and Quillen, D. 2016. Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.;
Learning hand-eye coordination for robotic grasping with Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton,
deep learning and large-scale data collection. The Interna- A.; Chen, Y.; Lillicrap, T.; Hui, F.; Sifre, L.; van den Driess-
tional Journal of Robotics Research. che, G.; Graepel, T.; and Hassabis, D. 2017. Mastering the
Li, X.; Li, Y.; Zhan, Y.; and Liu, X.-Y. 2019. Optimistic bull game of go without human knowledge. Nature 550:354–.
or pessimistic bear: Adaptive deep reinforcement learning Sutton, R. S., and Barto, A. G. 2018. Reinforcement Learn-
for stock portfolio allocation. In ICML. ing: An Introduction. The MIT Press, second edition.
Théate, T., and Ernst, D. 2020. Application of deep re-
inforcement learning in stock trading strategies and stock
forecasting.
Vinyals, O.; Babuschkin, I.; Czarnecki, W.; Mathieu, M.;
Dudzik, A.; Chung, J.; Choi, D.; Powell, R.; Ewalds, T.;
Georgiev, P.; Oh, J.; Horgan, D.; Kroiss, M.; Danihelka, I.;
Huang, A.; Sifre, L.; Cai, T.; Agapiou, J.; Jaderberg, M.;
and Silver, D. 2019. Grandmaster level in starcraft ii using
multi-agent reinforcement learning. Nature 575.
Wang, H., and Zhou, X. Y. 2019. Continuous-Time Mean-
Variance Portfolio Selection: A Reinforcement Learning
Framework. arXiv e-prints.
Wang, S.; Jia, D.; and Weng, X. 2018. Deep reinforcement
learning for autonomous driving. ArXiv abs/1811.11329.
Wu, X.; Chen, H.; Wang, J.; Troiano, L.; Loia, V.; and
Fujita, H. 2020. Adaptive stock trading strategies with
deep reinforcement learning methods. Information Sciences
538:142–158.
Xiong, Z.; Liu, X.-Y.; Zhong, S.; Yang, H.; and Walid, A.
2019. Practical deep reinforcement learning approach for
stock trading.
Ye, Y.; Pei, H.; Wang, B.; Chen, P.-Y.; Zhu, Y.; Xiao, J.; and
Li, B. 2020. Reinforcement-learning based portfolio man-
agement with augmented asset movement prediction states.
In AAAI.
Yu, P.; Lee, J. S.; Kulyatin, I.; Shi, Z.; and Dasgupta, S.
2019. Model-based deep reinforcement learning for finan-
cial portfolio optimization. RWSDM Workshop, ICML 2019.
Zhang, Z.; Zohren, S.; and Roberts, S. 2019. Deep rein-
forcement learning for trading.
Zhengyao et al. 2017. Reinforcement learning framework
for the financial portfolio management problem. arXiv.

You might also like