You are on page 1of 20

Algorithmic trading on nancial time series using

Deep Reinforcement Learning


Alireza Asghari
Iran University of Science and Technology
Nasser Mozayani (  mozayani@iust.ac.ir )
Iran University of Science and Technology

Research Article

Keywords: Deep learning, Deep reinforcement learning, Algorithmic trading, Financial market, Quantitative
trading strategy

Posted Date: February 1st, 2024

DOI: https://doi.org/10.21203/rs.3.rs-3910354/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License.
Read Full License

Additional Declarations: No competing interests reported.


Algorithmic trading on financial time series
using Deep Reinforcement Learning

Alireza Asghari1, Nasser Mozayani *


Computer Engineering Department, Iran University of Science and Technology, Tehran, Iran

Abstract
The use of technology in financial markets has led to extensive changes in
conventional trading structures.Today, most orders that reach exchanges are
created by algorithmic trading agents.
Today, machine learning-based methods play an important role in building
automated trading systems. The increasing complexity and dynamism of financial
markets are among the key challenges of these methods. The most widely used
machine learning approach is supervised learning, but in interactive environments,
the use of supervised learning alone has limitations such as difficulty in defining
appropriate labels and lack of modeling of the dynamic nature of the market. Due to
the good performance of deep reinforcement learning-based approaches, we will
use these approaches to solve the mentioned problems.
In this paper, we presented a deep reinforcement learning framework for trading in
the financial market, a set of input features and indicators selected and tailored to
the purpose of the problem, reward function, appropriate models based on fully
connected, convolutional and hybrid networks. The proposed top models traded
under real market conditions such as transaction costs and then were evaluated. In
addition to outperforming the buy and hold strategy, these models achieved
excellent cumulative returns while having appropriate risk metrics.
Keywords: Deep learning, Deep reinforcement learning, Algorithmic trading,
Financial market, Quantitative trading strategy

1. Introduction

Trading is buying and selling of assets within the financial market. There are different
types of assets in the financial market for trading. In recent years, there have been extensive
changes in the use of technology in the financial markets. The share of online transactions
has grown significantly and the speed of transactions has increased a lot. These
developments have changed the paradigm of traditional trading and replaced algorithmic
trading as much as possible. Algorithmic trading refers to the process that traders
consistently buy and sell one given financial asset to make a profit. It is widely applied in
trading stocks, commodity futures, foreign exchanges and cryptocurrencies (Ozbayoglu et
1 * Corresponding author
Email addresses: a_asghari@comp.iust.ac.ir (Alireza Asghari), mozayani@iust.ac.ir (Nasser Mozayani)
al., 2020). In quantitative finance, trading is essentially making dynamic decisions, namely
to decide where to trade, at what price, and what quantity, over a highly stochastic and
complex market (Liu et al., 2022).
We can categorize machine learning methods into 3 major categories: supervised,
unsupervised and reinforcement learning. In recent years, machine learning and deep
learning algorithms have been widely applied to build prediction and classification models
for the financial market. Fundamentals data (earnings report) and alternative data (market
news, academic graph data, credit card transactions, GPS traffic, etc.) are combined with
machine learning algorithms to extract new investment alphas or predict a company’s future
performance (Yang et al., 2020). Over time, price forecasting has been extensively studied by
supervised methods. But at the same time, we face problems in using supervised learning in
the real world. deep reinforcement learning (DRL) has been recognized as a promising
alternative for quantitative finance, since it has the potential to overcome some important
limitations of supervised learning, such as the difficulty in label specification and the gap
between modeling, positioning and order execution (Li et al., 2021). The popularity of deep
reinforcement learning (DRL) applications in economics have increased exponentially. DRL,
through a wide range of capabilities from reinforcement learning (RL) to deep learning (DL),
offers vast opportunities for handling sophisticated dynamic economics systems. DRL is
characterized by scalability with the potential to be applied to high-dimensional problems in
conjunction with noisy and nonlinear patterns of economic data (Mosavi et al., 2020).
We develop DRL trading agents that produce trading signals in the financial market. Our
Critic only agents are based on DQN(Volodymyr Mnih, 2016) and C51DQN(Bellemare et
al., 2017), and Our actor-critic agent is based on Proximal Policy Optimization(Schulman
et al., 2017). In experiments, we study the effect of using different algorithms and
architectures such as convolutional layers and different types of technical indicators plus raw
OHLC data on the performance of our models. We use technical indicators for reducing raw
OHLC data noise and improving performance. For train and test data, we use periods of the
price that contains almost all different market regimes, such as trending or side. We
proposed combining of different periods of a technical indicator. In the most works, only
one period for a technical indicator is used. Furthermore, we use sentiment of the market by
using Google trends. In Google trends, we can study the tendency of what people looking for
in the market. Also, we used an appropriate reward function for trading in a such volatile
market. In our experiments, also we scrutinized our returns besides considering risk
measurements like Sharpe ratio.
If we want to describe our phases of work for this paper: first we make an environment
for trading and then define state and action space plus reward function. Second, we train our
models by 3 major algorithms that we said before. our experimental results show that our
models outperform the buy and hold strategy and also, we have appropriate Sharpe ratios in
this process.
In summary, the contributions of this paper are:

1. Providing a reward function tailored to the problem with sensitivity to price


and trading opportunities (even when it does not have an asset) plus the ability to
withstand real market conditions such as transaction costs and minimum
purchase restrictions.
2. Offering a set of input features and price indicators match with each other;
3. Using the analysis of public tendency to the desired financial market in the
input characteristics set;
4. Providing suitable hybrid models based on convolutional and fully connected
networks with the ability to learn complex price patterns;
5. outperforming the baseline strategy (buy and hold) plus achieving excellent
cumulative returns while having appropriate risk metrics.

The remainder of this paper is organized as follows. Section 2 describes related


works and we briefly review them. In section 3 we describe our proposed model and
its details. In Section 4, we describe the experimental setup for our work and
present experimental results and compare them in terms of performance and risk
measurement. We conclude the paper and discuss future directions in Section 5.

2. Theory and related works


In this section, we present the underlying theory and the related works.

2.1. Background

We can name components of reinforcement learning as agent, environment, policy,


reward, value function, and environment model (optional).

Fig. 1. Interaction of reinforcement learning agent with the environment (Sutton and Barto, 2018)

We can present reinforcement learning as a process that can be represented by a Markov


Decision Process (MDP) (Sutton and Barto, 2018). The main components of a reinforcement
learning framework are:

1. A finite set of states summarizing the information the agent senses from the
environment at every time step
2. A set of actions which the agent can perform at each step to interact
with the environment.
3. A set of transition probabilities between subsequent states render the
environment stochastic.
4. A reward function which provides a numerical feedback value to the agent in
response to its action in state .
5. A policy which maps states to concrete actions to be carried out by the agent.
The policy can hence be understood as the agent's rules for how to choose actions.
6. A value function which maps states to the total (discounted) reward the agent
can expect from a given state until the end of the episode under policy .
Given the above framework, the decision problem is formalized as finding the optimal
policy ,i.e., the mapping from states to actions, corresponding to the optimal value
function :

(1)

Hereby, denotes the expectation operator, the discount factor, and the expected
immediate reward for carrying out action in state . Further, denotes the next
state of the agent. The value function can hence be understood as a mapping from states to
discounted future rewards which the agent seeks to maximize with its actions. To solve this
optimization problem, the Q-Learning algorithm (Watkins, 1989) can be applied, extending
the above equation to the level of state-action tuples:

(2)

Hereby, the Q-value equals to the immediate reward for carrying out action
in state plus the discounted future reward from carrying on in the best way possible.
The optimal policy (the mapping from states to actions) then simply becomes:

(3)

i.e., in every state , choose the action that yields the highest Q-value (Fischer,
2018).

2.2. Related works

Several works in literature have used deep reinforcement in trading applications. In this
section, we briefly study recent applications of deep reinforcement learning in the financial
market. In literature, we can see three major learning approaches: critic-only approach,
actor-only approach, or actor-critic approach (Fischer, 2018).
Our study focuses on creating a trading agent based on deep reinforcement learning. In
other words, we do not consider researches that predict stock prices or market direction with
deep learning in this study. Of course, we know that if such a system succeeds, a good
trading agent can be created, but usually, the price fluctuates highly and we can not make
accurate forecasts and trade based on it (Nabipour et al., 2020).
In (Chen and Gao, 2019) authors used a recurrent neural network In their DQN based
model which can extract temporal features in sequences better and consider a second target
network for stabilization. They used S&P500 ETF price data for training and evaluation of
their model against baseline benchmarks: buy and hold strategy and a DQN agent with
random action selection. the best model which is based on the DRQN method outperforms
the baseline models.
(Wu et al., 2020) proposed using Gated Recurrent Units (GRUs) for extracting complex
features from stock market data and then feeding them to the main network. Also, this paper
proposed a new reward function that better handles risk: the Sortino ratio (SR) (Mohan et
al., 2016). Argue of the authors about this reward function is that it would perform well in a
highly volatile market because its calculations are based on measuring negative volatility
.they trained and test their system on daily stock data in different countries. They used
OHLCV data plus some popular technical indicators like MACD, MA, EMA, OBV. The
authors proposed two models: GDQN and actor-critic
Gated Deterministic Policy Gradient (GDPG) that combined the QNetwork from the
GDQN with a policy network. they are based on DQN and DPG respectively. The baseline
strategy in this paper is turtle trading strategy (Curtis, 2003). Both models outperformed
baseline not only in trending but also in the volatile stock market.
In (Huang, 2018) author proposed a Markov Decision Process (MDP) model suitable for
the financial trading task and solve it with a deep recurrent Q-network (DRQN) algorithm.
He proposed several modifications to the existing learning algorithm to make it more
suitable under the financial trading setting. He employed a substantially small replay
memory (only a few hundred in size) compared to ones used in modern deep reinforcement
learning algorithms (often millions in size.) Also, He developed an action augmentation
technique to mitigate the need for random exploration by providing extra feedback signals
for all actions to the agent. He trained and validated their approach on the spot foreign
exchange market.
(Deng et al., 2016) tried to beat experienced traders for financial asset trading with a
deep reinforcement learning-based trading agent. Their model is based on a deep recurrent
Q-network. for the feature extraction part of their network, they used fuzzy learning concepts
to reduce the uncertainty of the input data and then feed them to the rest of the deep
learning network. Their model is a technical-indicator-free model. The robustness of the
neural system is verified on both the stock and the commodity futures markets under broad
testing conditions.
(Jia et al., 2019) used a deep actor-only approach for trading. This study analyzed the
effect of choosing the appropriate set of technical indicators for training and the effect of
using LSTM networks instead of using a fully connected network.
Results of the papers showed appropriate returns in some stocks but oscillatory behavior
in others.
(Yang et al., 2020) proposed an ensemble multiple stock trading strategy using three
actor-critic based algorithms: Proximal Policy Optimization (PPO), Advantage Actor-Critic
(A2C), and Deep Deterministic Policy Gradient (DDPG). The authors state that the ensemble
strategy inherits and integrates the best features of the three algorithms, thereby robustly
adjusting to different market situations and could maximize risk-adjusted returns. They
tested their algorithms on the 30 Dow Jones stocks that have adequate liquidity. The
features used to train the ensemble model were the available balance, adjusted close price,
shares already owned plus some technical indicators: MACD, RSI, CCI and ADX. The
performance of the trading agent with different reinforcement learning algorithms is
evaluated and compared with both the Dow Jones Industrial Average index and the
traditional min-variance portfolio allocation strategy. The results stated that the proposed
deep ensemble strategy outperformed the three individual algorithms and two baselines in
terms of the risk-adjusted return measured by the Sharpe ratio.
(Azhikodan et al., 2019) proposed automating swing trading using deep reinforcement
learning. The deep deterministic policy gradient-based neural network model trained to
choose an action to sell, buy, or hold the stocks to maximize the gain in asset value. The
paper also acknowledges the need for a system that predicts the trend in stock value to work
along with the reinforcement learning algorithm. Thus the authors proposed a model using a
recurrent convolutional neural network to predict the stock trend from the financial news.
The authors stated that the proposed approach could learn the tricks of stock trading.

3. Material and methods

In this section, we describe train and evaluation data, observations, actions, reward
function, the architecture of models and hyperparameters.
Traditional algorithmic trading strategies offer innovative rules for identifying trading
opportunities. These rule-based methods often only work well in certain market conditions
and are not able to adapt to new financial market conditions. In this paper, we proposed a
Deep Reinforcement Learning approach for swing trading. Our proposed method is based on
DQN, C51 DQN and PPO algorithms. The first two algorithms are in the category of critic-
only methods and the third algorithm is in the category of actor-critic methods.

3.1. Train and evaluation data

The cryptocurrency market has gained a lot of popularity worldwide, among which
Bitcoin, makes the top news almost weekly now (Millea, 2021) .The dataset used for this
paper was Yahoo Finance OHLCV raw data (“Bitcoin USD (BTC-USD) Interactive Price
Chart - Yahoo Finance,” n.d.), along with Google trends data (“Google Trends,” n.d.).The
training period was from 2015/1/4 to 2019/12/31 and the evaluation period was from
2020/1/1 to 2022/1/31. Meanwhile, in the training section, data from 2014/12/20 to
2015/1/3 have been used to calculate the starting points of moving averages that are needed
in the indicators. These points are not part of the main process of agent training. But in the
evaluation phase, in order not to infiltrate data from the training phase, the values of these
first 14 days are considered zero for indicators including the moving average. As a rule, this
issue affects the performance of the agent in the beginning, but this process has been done to
maintain the generalizability of the agent.

3.2. Observations

Suggested observations are the logarithm of the OHLC, Google trends data and chosen
indicators. The indicators are ATR(“Average True Range (ATR),” n.d.), RSI(“Relative
Strength Index (RSI),” n.d.), STDEV(standard deviation), ACC-
DIST(“Accumulation/Distribution(A/D),” n.d.), HMA(“Hull Moving Average (HMA),” n.d.)
and the logarithm of return.

3.3. Actions

In the proposed framework, 'buy', 'sell', or 'nothing' operations are considered. All
actions are discrete. When an agent sells, he sells all his assets, and when he buys, he buys
with all his cash in hand. 'nothing' usually has two interpretations: 1- Keeping the current
assets 2- Not entering into a new buy transaction. Each action is performed at the beginning
of the day (opening candle) and is the result of the agent's decision after the end of the
previous day (closing candle).

3.4. Reward function

For the reward function, different functions have been evaluated and the proposed final
function has a good ability to learn and earn profit. This function guides agent learning, even
when the agent is not buying or selling. In other words, the agent is sensitive to all price
movements, whether he has an asset or not.
Equation (4) represents the proposed reward function. In this function, TF means
transaction cost. This cost is 0.001 purchase volume in most exchanges. At the time of agent
training, this cost is considered zero, but at the time of evaluation, the value is considered
0.001. In addition, in most exchanges, the minimum purchase volume is considered and for
example in Binance exchange is 0.0001 the price of a bitcoin. This limitation is considered
during agent training and testing.

(4)

3.5. Architecture of models

In this paper, we have presented 6 proposed models in 3 categories: DQN, C51DQN and
PPO. There are two models in each category: the base model is based only on dense layers
and the other is a more complex model based on dense and convolutional layers.

3.5.1. Basic DQN


This model, which we have called BDQN for short, is the most basic model among our
proposed models. The model architecture is shown in Fig. 2. The model, as shown in the
figure, has two fully connected dense layers as the hidden layer and the output layer
(softmax). The number of first layer neurons is 256, the second is 128 and the third is 3.
Finally, in the output layer, we will have one of the 3 actions of buying, selling, or
nothing. In this model, the ReLU activation function, as well as Batch Normalization,
are used. The mentioned properties have been obtained based on several tests.
Fig. 2. The architecture of the BDQN model

3.5.2. Convolutional DQN

This model, called CDQN for short, is a more complex model based on the BDQN
model. In this model, in the feature extraction part, convolutional networks are used.
The model architecture is shown in Fig. 3 . This model, as shown in the figure, initially
has two convolutional layers. The first layer uses 128 filters of size 6, the second layer
uses 128 filters of size 3, along with the tanh activation function as well as Batch
Normalization (in both layers). In convolutional layers, the tanh activator function helps
to learn complex patterns. After the convolution layers, the flattener layer is used. Then,
in the final part of the model, the BDQN model architecture is used.

Fig. 3. The architecture of the CDQN model

3.5.3. Other models

Basic C51 DQN(BC51DQN) is the same as the BDQN model in architecture with
difference in learning algorithm and its hyperparameters. Convolutional C51 DQN
(CC51DQN) is similar to the CDQN model in terms of architecture. Basic
PPO(BPPO) architecture is used for both actor and critic networks. In terms of network
architecture, this model differs from previous basic models in using the tanh activation
function, as well as the difference between the learning algorithm and its specific
parameters. In Convolutional PPO(CPPO), The architecture of the feature extraction
part is similar to the architecture of the CDQN model, and in the final part of the model,
the BPPO model architecture is used exactly.

3.6. Hyperparameters

3.6.1. Same hyperparameters of DQN and C51DQN based models

The value used for these hyperparameters is shown in Table 1. In rows 1 to 4, we


describe the main features of the learning framework, namely the learning rate, the
number of episodes, the size of each category, and the buffer size of the replay
experience. In the discussion of updating the weights between the main network and the
target, instead of hard updating and with the aim of better learning stability, soft
updating has been used. The values related to these parameters are also given in row 5
of the table.
Row 6 also lists the gamma value to adjust the agent's foresight. In these models, in
the subject of exploration and exploitation dilemma, "decayed epsilon greedy" is used,
which you can see the values of the related parameters in rows 8 and 9. In simple
models, the length of the learning sequence is 1. In other words, we give the model only
a one-time step of moving in the environment (here one day) each time weights are
updated to learn. Each of our categories includes experiences with the length of 1 day.
But in models based on convolutional networks, i.e., CDQN and CC51DQN, we give the
model a set of sequences with the number of days to learn. This number is specified in
row 10 of the table.

Table 1: Same hyperparameters of DQN and C51DQN based models


Row Hyperparameter Value
1 Learning rate 0.0025
2 Episodes 500
3 Batch size 128
4 Replay buffer capacity 1000000
5 Tau 0.01
6 Gamma 0.65
7 Optimizer Adam
8 Initial epsilon 1
9 End epsilon 0.01

10 Sequence length (convolutional models) 128

3.6.2. Hyperparameters of C51DQN based models

These hyperparameters are specific to the C51DQN algorithm. We considered the


number of atoms equal to 51, Max Q equal to 0.5 and Min Q equal to -0.5.
3.6.3. Hyperparameters of PPO based models

The model algorithm we use here is PPO-Clip. PPO-Clip does not include the
expression KL divergence in its objective function and has no limit. Instead, it relies on
purposeful cutting into the objective function so that new policies do not stray too far
from old ones.
In these models, the number of episodes is equal to 500, the learning rate is 0.0001,
the number of epochs is 25 and the gamma is 0.99. Here we train the agent without the
use of batch training, and as a result, the whole data is given to the model in one place.

4. Results and Discussion

In this section, we describe the GPU cloud platform, the performance metrics and
baseline strategy, and then evaluate the performance of proposed methods for swing trading.

4.1. GPU Cloud Platform

All experiments were executed on a cloud framework with 6 CPU cores running at 2.6
GHz, 20 GB memory and an NVIDIA GTX 1080 with 8 GB of GPU memory.

4.2. Performance metrics

• Cumulative return. Cumulative return is the percent change of net value over time horizon ℎ.
• Maximum drawdown(MDD). Maximum drawdown measures the largest decline from the peak in
the whole trading period to show the worst case(Magdon-Ismail and Atiya, 2004).
• Sharpe ratio. Sharpe ratio is a risk-adjusted profit measure, which refers to the return per unit of
deviation(Sharpe, 1998).
• Sortino ratio. Sortino ratio is a variant of risk-adjusted profit measure, which applies drawdown as
a risk measure.
• Calmar ratio. Calmar ratio is another variant of risk-adjusted profit measure, which applies MDD as
a risk measure.
• Annual volatility. Annual volatility means the rate of deviation of the annual standard of price.
• Daily value at risk. Daily value at risk is also an important criterion for measuring risk and
discovering potential loss capacity. Calmar ratio is another variant of risk-adjusted profit measure,
which applies MDD as a risk measure.
• Time in the market. Time in the market means the times when the trader has the asset (Sun et al.,
2021).

4.3. Baseline strategy

We use the common buy and hold strategy, abbreviated to BH, to compare with our
proposed approach. BH is a passive investment strategy in which an investor buys the asset
he wants and holds it to gradually increase the value of the asset over a long period. This
strategy is used in the financial literature as a basis for evaluating the performance of
investment strategies, and thus, will be a valuable baseline for evaluating the performance of
our proposed models.
4.4. Learning curve

After performing all the experiments related to the 6 proposed models and forming 30
final models, you can see the learning curve of these models in Fig. 4 . Thick lines indicate
the mean and marginal areas indicate the distance to The rate of a standard deviation from
the average obtained during the 5 times of each model. The horizontal axis represents the
episode and the vertical axis represents the reward.

Fig. 4. Learning curve of all 30 models

In this curve, as can be seen, the basic models whose first letter is denoted by B have
achieved lower rewards during convergence than the family of more complex models based
on convolutional networks whose first letter is denoted by C.
Of the three base models, the DQN-based model is less rewarding than the other two. As
we discussed earlier, this model directly learns the appropriate Qs compared to the C51DQN-
based model, while the C51DQN learns the Q distribution. As a result, it has the potential to
achieve greater rewards. Of course, there was not much difference in reward between the two
basic models in our experiment.
The other base model, the BPPO, is much more rewarding than the two. Of course, the
amount of reward during training on train data, although it can provide some insight into
the performance of the model on the training set, it is mostly used to examine the training
process and the stability and convergence of models during the training.
Among the more complex convolutional models, the curves of the CDQN and CPPO
models converge to almost one number. What is interesting is the very good performance of
the CC51DQN compared to both convolution models and in general to all models. This
model has achieved the best reward during training and has very good convergence and
stability compared to other models.

4.5. Overall performance

After reviewing the performance of the models on the daily price evaluation data from
2020/1/1 to 2022/1/31, we have selected the models based on two important criteria. These
criteria are 1- Cumulative return 2- Sharp ratio. After choosing the best model based on each
of these two criteria and in each of the 6 types of proposed models, we have finally reached
the top 9 models. Some models are the best based on both criteria and there are no two
separate models for each of the criteria, and therefore we have 9 models instead of 12
models.
In naming these models, we have done so that at the end of the model name, we have
added the letters CR to indicate the best model in terms of cumulative return or the letters
SH to indicate the best model in terms of the Sharp ratio. If a model has CR and SH together,
it means that the model is the best in the category of the relevant type and from the
perspective of both criteria.
In Table 2, you can see the performance and evaluation metrics of the top models in
terms of cumulative return and sharp ratio. Fig. 5 also shows the performance of the top
models in comparison with each other and the baseline strategy(BH).

Fig. 5. Performance of top models in terms of cumulative return and sharp ratio

Table 2: Performance and evaluation metrics of top models in terms of cumulative return and sharp ratio
Cumulative Sharpe Sortino Max Annual Calmar Daily Value Average Time in
Model
return ratio ratio drawdown volatility ratio At Risk drawdown market
BH 434.47% 1.45 2.11 -53.06% 75.55% 2.33 -6.20% -8.62% 100.00%

BPPO_CR_SH 789.10% 2.29 4.07 -37.92% 51.42% 4.88 -4.10% -4.76% 71.00%

BC51DQN_SH 1025.36% 2.48 4.44 -23.72% 52.23% 9.24 -4.14% -6.05% 54.00%

BC51DQN_CR 1078.10% 2.46 4.38 -24.23% 53.91% 9.34 -4.28% -5.64% 62.00%

BDQN_CR_SH 1205.80% 2.7 5.08 -23.72% 50.17% 10.24 -3.95% -5.63% 50.00%

CDQN_SH 1577.30% 3.08 6.58 -15.01% 47.40% 19.1 -3.68% -2.84% 50.00%

CPPO_CR_SH 1630.67% 3.11 5.88 -19.30% 47.55% 15.16 -3.69% -3.02% 54.00%

CDQN_CR 1666.25% 2.86 4.98 -39.89% 53.03% 7.43 -4.15% -3.47% 77.00%

CC51DQN_SH 1698.99% 3.22 6.46 -16.45% 46.37% 18.23 -3.58% -2.79% 54.00%

CC51DQN_CR 2530.71% 3.19 5.83 -24.53% 53.70% 15.48 -4.15% -3.25% 68.00%

As shown in the figure, all proposed models have outperformed the baseline strategy.
BDQN_CR_SH is the top base model. This model has a Calmar ratio of 10.24 and a sharp
ratio of 2.7, which is better than all previous base models. The trader agent based on this
model is not present in the market half the time. This agent has been able to escape from the
first valley, but it is more or less caught in the next valleys. The average drawdown of this
model is about 5.6 percent and is a relatively good number considering the previous models
and the baseline strategy.
Among the 4 basic models reviewed, none of them were the best in terms of evaluation
metrics.
In the following, we will review the results of models based on convolutional networks.
CDQN_SH had the most appropriate sharp ratio of its kind (CDQN). The model, which is
based on convolutional networks, has a return of 1577.3%, the best Sortino ratio of 6.58,
Calmar ratio of 19.1% and the best maximum drawdown of 15.01%. In addition, it has a sharp
ratio of 3.07, which is better than all previous base models. This model has been able to
escape from all price valleys.
CPPO_CR_SH has about 60% more cumulative return than CDQN_SH. The sharp ratio
of this model is only about 0.11 better than the previous model.
CDQN_CR has only performed better than previous models in terms of performance and
generally does not perform well in terms of other metrics. In the first price valley, the model
stops moving towards lower prices after a slight advance and starts moving upwards. But in
the last valley, it falls.
CC51DQN_SH with a 1699% cumulative return is the next more suitable model in terms
of cumulative return, it has the most suitable sharp ratio in its type (C51DQN). This model
with a sharp ratio of 3.22 has the best ratio among the 9 selected models. In addition, it has
the lowest annual volatility (46.37%), the most appropriate value for daily risk (3.58%) and
the best average drawdown (2.79%) compared to other models. This model does not fall into
any of the deepest price valleys and continues to grow. The evaluation metrics of this model
are very close to the CDQN_SH model. The main difference between the two is the
difference between their sharp ratios. This means that the CC51DQN_SH has less annual
volatility than the CDQN_SH. As a result, we have a better sharp ratio.
The latest model, the CC51DQN_CR, has the highest cumulative return. Although this
model is a little behind the previous model in all other metrics, its metrics are very close to
the previous model. However, it has recorded the number 2530 as the highest cumulative
return among the available models. This model has been able to move away from the basic
strategy and other models in terms of cumulative returns. But it is only slightly caught in the
last valley. The maximum drawdown in this valley is half the maximum drawdown in the
baseline strategy in the same period.
In Table 3, you can see the average evaluation metrics of all 30 tested models. The BH
metrics are listed as the baseline and are not the average metrics. In this table, as can be
seen, the CC51DQN model is the best in terms of average returns, Sharp, Sortino and Calmar
ratios, maximum drawdown and average drawdown. In other words, these results confirm
the previously obtained results of CC51DQN. Next in terms of average returns is CDQN,
which has metrics close to CC51DQN. After these two, CPPO is known as the best model
from the perspectives of annual volatility and daily value at risk. However, the average
cumulative return is lower than the other two models. The base models all performed worse
than the models based on convolutional networks.

Table 3: Performance and average evaluation metrics of all 30 models


Cumulative Sharpe Sortino Max Annual Calmar Daily Value Average Time in
Model
return ratio ratio drawdown volatility ratio At Risk drawdown market

BH 434.47% 1.45 2.11 -53.06% 75.55% 2.33 -6.20% -8.62% 100.00%

BPPO 520.33% 1.988 3.506 -35.17% 51.52% 4.718 -4.16% -5.25% 64.40%

BC51DQN 796.22% 2.144 3.644 -32.56% 56.96% 6.518 -4.58% -5.99% 66.60%

BDQN 913.54% 2.164 3.696 -39.01% 60.24% 6.056 -4.83% -5.31% 81.60%

CPPO 1100.09% 2.724 5.042 -26.13% 47.77% 9.832 -3.76% -3.82% 56.00%

CDQN 1573.37% 2.754 4.954 -34.97% 54.72% 9.61 -4.30% -3.54% 75.60%

CC51DQN 1744.89% 2.98 5.578 -22.91% 50.61% 13.496 -3.94% -3.44% 62.40%

In the following box plots, you will see 9 evaluation metrics of the 30 tested models. With
help of these plots and by performing repeated tests, a better understanding of the
performance of each of the 6 types of proposed models in the relevant metrics can be
achieved in comparison with other models. In this way, only one implementation will no
longer be the metric for deciding on a particular type of model, and we can make decisions
about its performance with a greater degree of confidence.

Fig. 6. Time in market Fig. 7. Daily Value At Risk

Fig. 8. Cumulative return Fig. 9. Sharpe ratio


Fig. 10. Maximum drawdown Fig. 11. Sortino ratio

Fig. 12. Annual volatility Fig. 13. Average drawdown

Fig. 14. Calmar ratio

5. Signaling of more appropriate models

In the previous section, after general review, we concluded that the CDQN_SH,
CC51DQN_SH, and CC51DQN_CR models are more appropriate than other models. In this
section, we will see the signaling system based on each of the mentioned models during the
5-month sample interval.

5.1. CDQN_SH

As you can see in Fig. 15, the model decisions, ie the actions of buying, selling, or doing
nothing, are shown in green, red and black on the price chart, respectively. According to the
figure, in times when the price is in an upward trend, the number of green signals(buy) is
higher, and in times when we are in a downward trend, the number of red signals(sell) is
higher. This indicates the correct performance of the model in the downtrend and uptrend.

Fig. 15. Signaling of CDQN_SH model

5.2. CC51DQN_SH

As you can see in Fig. 16, as in the previous model, the agent in the downtrend does not
mistakenly decide to stay in the trend. According to the figure, in times of a downtrend, the
number of red signals is higher. This indicates the correct performance of the model in
downtrends. Here the number of buy signals is low, but most of the time when we have a buy
signal the given purchase has led to the profit of a suitable amount of sell signals in the
future.

Fig. 16. Signaling of CC51DQN_SH model

5.3. CC51DQN_CR

As you can see in Fig. 17, the model has performed well in the uptrend and is doing well
until it reaches the peak of the chart. In the downtrend, the model does not work well and
makes many wrong decisions. Which leads to a decrease in the profit earned from the
upward movement. However, in this decline, it performs twice as well as the baseline
strategy.

Fig. 17. Signaling of CC51DQN_CR model

6. Conclusions

In this paper, we presented a Deep Reinforcement Learning framework for trading in the
financial market, a set of appropriate features and indicators, reward function and
appropriate models based on fully connected, convolutional and hybrid networks. The
proposed top models were traded and evaluated under real market conditions such as
transaction cost consideration. In addition to outperforming the baseline strategy (buy and
hold), these models achieved excellent cumulative returns while having appropriate risk
metrics.
Our future work in the field of risk balance and cumulative returns, as well as other areas
related to the paper, are:

1) Introducing a reward function in which risk metrics such as Sortino, Sharp and Calmar have
a direct impact on it.
2) Formation of ensemble models based on the best available models in terms of cumulative
return and risk.
3) Suggesting price-driven input features.
4) Adjusting the current learning framework to fit the trading period shorter than one day.
6) Comprehensive modeling of the desire, emotions and tendencies of society using economic
news and Twitter to use that as an input feature or model among a set of models.

References

Accumulation/Distribution(A/D) [WWW Document], n.d. URL


https://www.investopedia.com/terms/a/accumulationdistribution.asp

Average True Range (ATR) [WWW Document], n.d. URL


https://www.tradingview.com/wiki/Average_True_Range_(ATR)
Azhikodan, A.R., Bhat, A.G., Jadhav, M.V., 2019. Stock trading bot using deep
reinforcement learning, in: Innovations in Computer Science and Engineering.
Springer, pp. 41–49.

Bellemare, M.G., Dabney, W., Munos, R., 2017. A Distributional Perspective on


Reinforcement Learning. arXiv:1707.06887 [cs, stat].
Bitcoin USD (BTC-USD) Interactive Price Chart - Yahoo Finance [WWW Document], n.d.
URL https://finance.yahoo.com/quote/BTC-USD/chart/ (accessed 12.5.21).

Chen, L., Gao, Q., 2019. Application of Deep Reinforcement Learning on Automated Stock
Trading, in: 2019 IEEE 10th International Conference on Software Engineering
and Service Science (ICSESS). Presented at the 2019 IEEE 10th International
Conference on Software Engineering and Service Science (ICSESS), pp. 29–33.
https://doi.org/10.1109/ICSESS47205.2019.9040728

Curtis, F., 2003. The original turtle trading rules.

Deng, Y., Bao, F., Kong, Y., Ren, Z., Dai, Q., 2016. Deep direct reinforcement learning for
financial signal representation and trading. IEEE transactions on neural networks
and learning systems 28, 653–664.

Fischer, T.G., 2018. Reinforcement learning in financial markets - a survey (Working


Paper No. 12/2018). FAU Discussion Papers in Economics.

Google Trends [WWW Document], n.d. URL https://trends.google.com/trends/

Huang, C.Y., 2018. Financial trading as a game: A deep reinforcement learning approach.
arXiv preprint arXiv:1807.02787.

Hull Moving Average (HMA) [WWW Document], n.d. URL


https://www.fidelity.com/learning-center/trading-investing/technical-
analysis/technical-indicator-guide/hull-moving-average

Jia, W.U., Chen, W., Xiong, L., Hongyong, S.U.N., 2019. Quantitative trading on stock
market based on deep reinforcement learning, in: 2019 International Joint
Conference on Neural Networks (IJCNN). IEEE, pp. 1–8.

Li, Z., Liu, X.-Y., Zheng, J., Wang, Z., Walid, A., Guo, J., 2021. FinRL-Podracer: High
Performance and Scalable Deep Reinforcement Learning for Quantitative Finance.
arXiv:2111.05188 [cs, q-fin]. https://doi.org/10.1145/3490354.3494413

Liu, X.-Y., Yang, H., Chen, Q., Zhang, R., Yang, L., Xiao, B., Wang, C.D., 2022. FinRL: A
Deep Reinforcement Learning Library for Automated Stock Trading in
Quantitative Finance. arXiv:2011.09607 [cs, q-fin].

Magdon-Ismail, M., Atiya, A.F., 2004. Maximum drawdown. Risk Magazine 17, 99–102.

Millea, A., 2021. Deep reinforcement learning for trading—A critical survey. Data 6, 119.

Mohan, V., Singh, J.G., Ongsakul, W., 2016. Sortino ratio based portfolio optimization
considering EVs and renewable energy in microgrid power market. IEEE
Transactions on Sustainable Energy 8, 219–229.

Mosavi, A., Ghamisi, P., Faghan, Y., Duan, P., 2020. Comprehensive Review of Deep
Reinforcement Learning Methods and Applications in Economics.
arXiv:2004.01509 [cs, econ, q-fin, stat].
https://doi.org/10.20944/preprints202003.0309.v1

Nabipour, M., Nayyeri, P., Jabani, H., Mosavi, A., Salwana, E., S., S., 2020. Deep Learning
for Stock Market Prediction. Entropy 22, 840.
https://doi.org/10.3390/e22080840

Ozbayoglu, A.M., Gudelek, M.U., Sezer, O.B., 2020. Deep learning for financial
applications : A survey. Applied Soft Computing 93, 106384.
https://doi.org/10.1016/j.asoc.2020.106384

Relative Strength Index (RSI) [WWW Document], n.d. URL


https://www.tradingview.com/wiki/Relative_Strength_Index_(RSI)

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Proximal Policy
Optimization Algorithms. arXiv:1707.06347 [cs].

Sharpe, W.F., 1998. The sharpe ratio. Streetwise–the Best of the Journal of Portfolio
Management 169–185.

Sun, S., Wang, R., An, B., 2021. Reinforcement Learning for Quantitative Trading.
arXiv:2109.13851 [cs, q-fin].
Sutton, R.S., Barto, A.G., 2018. Reinforcement learning: An introduction. MIT press.

Volodymyr Mnih, K.K., David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra,
Martin Riedmiller, 2016. Playing Atari with Deep Reinforcement Learning. IJCAI
International Joint Conference on Artificial Intelligence.
https://doi.org/10.1038/nature14236

Wu, X., Chen, H., Wang, J., Troiano, L., Loia, V., Fujita, H., 2020. Adaptive stock trading
strategies with deep reinforcement learning methods. Information Sciences 538,
142–158.

Yang, H., Liu, X.-Y., Zhong, S., Walid, A., 2020. Deep reinforcement learning for
automated stock trading: An ensemble strategy. Available at SSRN.

You might also like