Automated Portfolio Rebalancing Using Q-Learning: DR - Narayana Darapaneni Najmus Saquib Sudarshan Singhavi

Automated Portfolio Rebalancing using Q-learning
Dr.Narayana Darapaneni Najmus Saquib Sudarshan Singhavi

Director - AIML Student - AIML Student - AIML
Great Learning/Northwestern University Great Learning Great Learning
Illinois, USA Mumbai, India Mumbai, India
darapaneni@gmail.com saquibshaikh433@gmail.com sudarshan.singhavi@gmail.com
Aishwarya Kale Pratik Bid Anwesh Reddy Paduri

Student - AIML Student - AIML Research Assistant - AIML
Great Learning Great Learning Great Learning
Mumbai, India Mumbai, India Mumbai, India
aishwaryakale614@gmail.com pratik.bid@gmail.com anwesh@greatlearning.in
Amitavo Basu Sanket Savla

Student - AIML Student - AIML
Great Learning Great Learning
Mumbai, India Mumbai, India
amitavo.basu@gmail.com sanket.savla@gmail.com
Abstract—Fund Managers and retail Do-it-yourself investors well-balanced portfolio of financial assets which can protect
constantly seek ideas and techniques to invest their wealth in investors from the down-side risk. This diversification can be
various financial assets to maximize their wealth over time while achieved by spreading investment funds across asset classes.
trying to minimize the Investment Risk and Transaction costs. The ability to achieve superior returns while having a well-
This study uses Basic Q-Learning Reinforcement Learning balanced portfolio would depend on the rebalancing of the
agents who will learn market patterns to trade in financial assets assets to adjust for their weights depending upon certain market
to maximize the fund value, using portfolio returns net of conditions. This study adds to the literature for implementation
transaction costs as the learning criteria. 15 Indian financial of Model-Free Reinforcement Learning (RL) based Portfolio
assets covering Equity Sectoral Indices, Government Security
Rebalancing in Indian Stock Markets using Technical
Indices and Gold spot prices were chosen and a Reinforcement
Learning agent was trained on each of them. The Reinforcement
Indicators for Equity Sectoral Indices, Government Security
Learning agents were given Simple Moving Averages, 52-Week (G-Sec) Indices and Gold spot prices (GOLD). The following
Stochastic Indicator and Price Change Momentum Indicators for sections provide the background and motivation for the study
their respective financial assets to train on. Testing was and introduces the concepts of RL.
conducted for the Year 2019 and evaluation of the performance
was done using the annual returns net of transaction costs, Max A. Background
Drawdown and Standard Deviation, in comparison to the Trading in stock markets for building wealth requires
benchmarks. Most of the agents have been able to reduce the timely decision making. Humans are often biased by their
Max Drawdown and Standard Deviation while there is further emotions like instinct, fear and greed, have limited information
scope to improve on the Fund Performance. The study has been
and knowledge, act upon someone else’s advice and many a
successful in implementing a basic Q-Learning algorithm for
Portfolio optimization to reduce the downside risk and the study
times end up taking decisions which would hurt the goal of
would lay the foundation to perform further research on this wealth creation. On the other hand, rules-based investing or
Subject to maximize wealth creation from Financial Markets. algorithmic trading are systems which have a strict investment
Exploration of further features, tuning of hyper-parameters for strategy in place and trades are carried out according to the
each Asset & use of Deep Q-Learning algorithm could be rules in the strategy, ensuring human emotions do not interfere
explored further to take forward this research. in the process. Such rules-based investing requires trading
strategies to be defined in the investment policy. Trading
Keywords—portfolio, rebalancing, q-learning, reinforcement strategies are based on fundamental or technical signals or a
learning, automated, trading, wealth creation combination of both and are static in nature. Defining a trading
strategy requires thorough research as there are a plethora of
I. INTRODUCTION financial data available to investors.
The Bombay Stock Exchange and National Stock Exchange Machine Learning (ML) has brought advanced techniques
are two of the biggest Stock Exchanges in India. Their Market to discover hidden and complex patterns in the fundamental
Capitalization places them in the top 12 Stock Exchanges and technical signals to augment the setting of rules-based
globally. Investment in traded financial assets have been a trading strategies which were otherwise too complex to
source of potential wealth creation for the Indian investor but formulate. Once trading strategies are set and defined in the
in recent years the interest has heightened. There is a need for a investment policy the rules-based investing system will always
follow the defined paths making the investment strategy static. the previous day would make up the State (St) at that time. The
Dynamic trading strategies have been implemented which can Action (At) at that point in time could be a buy, hold or sell
have the flexibility to change strategies based on signals. The action. On taking an Action the Environment will provide
drawback of such a rules-based system is that the trading Reward (Rt+1) to the agent depending on the change in wealth
strategies are formulated using historical data and patterns of the investor or the Sharpe ratio and move on to the next state
which may not reflect the future as trends or regimes tend to (St+1). The agent will learn from such an experience and
change over longer periods of time leading to underperforming iteratively develop a strategy through this reinforcement and
trading strategies. The problem of rules-based trading boils form the Optimal Policy (π*). Here the agent acts only on
down to the strategies being static and based on data which information available in the current state (St) and hence has a
might not be relevant in the future. There is no learning from Markov Property. Since our trading problem requires a process
failures. of taking sequential decisions with finite states and actions, we
can define the RL problem as a Markov Decision Process.
Coming to RL techniques, they involve software agents
which are iteratively given new information from the raw data
or environment to learn actions for maximization of defined
rewards. This would seem appropriate for tackling the
drawbacks of rules-based investing. The RL system would be
dynamic as well as continue to discover or adjust to new
market behaviors.
The use of ML by investment professionals has been
around for many years but the implementation of RL in
Portfolio Management has seen a recent uptrend. With the
success of RL algorithms in solving sequential decision-
making problems whether that be board games or Robotics,
there is a heightened interest to explore the use of such
algorithms to solve the portfolio rebalancing decisions, with Fig. 1 Q-LEARNING AGENT AND ENVIRONMENT
respect to which assets to invest in and the degree of exposure.
This study attempts to answer: Can RL agents be used to make
profitable investment decisions by interacting with a Trading The science of RL involves the execution of optimal
Environment and observing returns to the portfolio as rewards. decisions using experiences. Iteratively, the steps of RL
involve:
B. Literature Survey 1) Observing the environment
Research on Trading Strategies has been successfully 2) Deciding the action to be taken using some strategy
conducted using Fundamental Analysis [1] or Technical
Analysis [2] or a combination of both [3]. Fundamental 3) Acting accordingly
Analysis pertains to predictions based on the study of
4) Receiving a reward or penalty
companies’ financial statements and accompanied notes, while
Technical Analysis is associated with predictions based on the 5) Learning from the experiences and refining the
analysis of historical prices and volumes. This study considers strategy
the use of Technical Analysis.
6) Iterate until an optimal strategy is found
Most of the existing studies in RL for Trading Strategies
have focused on a single asset or at max two to three asset There are 2 main types of RL algorithms. They are model-
portfolio with Companies’ stock or Stock Indices and Foreign based and model-free.
Exchange [4] [5] [6]. This study will use multiple Equity A model-free algorithm is an algorithm that estimates the
Indices in the form of NIFTY Sectoral Indices, Fixed Income optimal policy without using or estimating the dynamics like
in the form of G-Sec Indices and Commodities in the form of transition and reward functions of the environment. Whereas a
GOLD as tradable assets. Commodities, Stock Indices and model-based algorithm is an algorithm that uses the transition
Fixed Income Instruments have been used by [7] for a Deep function and the reward function in order to estimate the
RL study. Studies using RL for the Indian Stock Market has optimal policy. Our study will not attempt to model the
recently picked up [8] [9] [10]. environment and hence will consider Model-free RL
algorithms.
C. Reinforcement Learning
Model-free RL algorithms can be of two types based on
Typically for a trading problem to be defined as a RL how they are trained; On and Off policy. On policy learning
problem, as given in Fig. 1, it must have an Environment with includes SARSA (State-Action-Reward-State-Action) that
discrete State Space (S) Set, discrete Action Space (A) Set and learns policies and evaluates consequences of the action
Rewards (R) Set [11]. The mapping of State and Action pairs is currently taken. Off policy learning includes Q-learning
called the Policy (π). For a trading agent, the information policies and would evaluate the rewards independent of the
available to the agent at Time (t), for example Open-High- current action as they were always evaluating all the possible
Low-Close prices, Current Holdings and Volume Traded for
current actions to see which action could maximize the reward probability 1 - ε. At the beginning, the value of ε is one so that
gained at the next step. there is only exploration which helps to fill the Q-table, while it
reduces with time, and once the table converges it becomes
D. Q-Learning nearly zero.
Q-Learning is a model-free off-policy RL method in which Once the training phase is completed and the Q-table is
the agent aims to obtain the optimal state-action-value function updated, for testing & Production phase, only exploitation is
Q*(S,A) by interacting with the environment. The algorithm used for choosing a best learned action at given state (St).
maintains a state-action table Q[S,A] called as Q-table
containing Q-values for every state-action pair. At the start, the
Q-values are initialized to zeroes. The Q-learning algorithm II. MATERIALS AND METHODS
updates the Q-values using Temporal Difference method as In this section, we introduce the approach used for
shown in (1). attempting to build a strategy for Automated Portfolio
Rebalancing using Q-learning algorithm.
Q(St,At) = Q(St,At) + α(Rt+1 + γ maxA Q(St+1,A) – Q(St,At)) (1)
where,
α is the learning rate A. Data Source
γ is the discount factor We are using Indian Market Data for our study. We have
used publicly available historical data for Equity Sectoral
Q(St,At) is the actual Q-value for state-action pair (St,At). Indices, G-Sec Indices and GOLD. Raw Daily Data for Equity
The target Q-value for state action pair (St,At) is Rt+1 + γ maxA Sectoral Indices and G-Sec Indices was downloaded from NSE
Q(St+1,A) i.e. immediate reward plus discounted Q-value of the India Website whereas Daily GOLD data was downloaded
next state. The table converges using iterative updates for each from Bloomberg Terminal. For Equity Sectoral Indices, we
state-action pair. To efficiently converge the Q-table, epsilon have taken NIFTY Auto (AUTO) Index, NIFTY Bank
(ε) - greedy approach is used. (BANK) Index, NIFTY Consumer Durables (CD) Index,
NIFTY Financial Services (FINSERV) Index, NIFTY FMCG
E. Learning Rate (FMCG) Index, NIFTY IT (IT) Index, NIFTY Media
The learning rate α is a Hyper-parameter which determines (MEDIA) Index, NIFTY Metal (METAL) Index, NIFTY
to what extent newly acquired information would replace the Pharma (PHARMA) Index, NIFTY Private Bank (PVTB)
existing information in the Q-Table. A factor of 0 would force Index, NIFTY PSU Bank (PSUB) Index, NIFTY Realty
the agent to learn nothing, allowing it to only exploit prior (REALTY) Index and NIFTY Oil and Gas (OILGAS) Index.
knowledge, while a factor of 1 makes the agent consider only For G-Sec Indices, we have taken NIFTY Composite G-Sec
the most recent information ignoring prior experienced (COMPGSEC) Index, NIFTY 4-8 yr. G-Sec (GSEC_4_8_YR)
knowledge. In a fully deterministic environment, a learning Index and NIFTY 8-13 yr. G-Sec (GSEC_8_13_YR) Index.
rate of α = 1 is optimal. When the problem is stochastic, the
algorithm converges under some technical conditions on the B. Data Pre-processing
learning rate that require it to decrease to zero. For Equity Indices, daily Open, High, Low and Close prices
were available for Price Return (PR) Indices whereas only
F. Discount Factor daily Close Price Data was available for Total Return (TR)
The discount factor γ is a Hyper-parameter which Indices. However, since TR Data provides more appropriate
determines the relevance of future rewards. A factor of 0 will valuation as it is adjusted for any Corporate Action Events, we
make the agent short-sighted since it will only consider current collected both Data and derived TR Open Price, TR High Price
rewards, while a factor approaching 1 will make it strive for a and TR Low Price corresponding to TR Data from PR Data
long-term high reward. using (2), (3) and (4) respectively. However, Open, High and
Low Prices were not available for CD and OILGAS Indices
G. The ε - Greedy Approach and also for earlier dates for some of the Indices. For all such
At the commencement of the training, Q-values of the Q- cases, Close price was considered where Open, Low or High
table are initialized to zeroes. It implies all actions for a state Price data was required for further calculations.
have the same chance to be selected. So, to converge the Q-
table using iterative updates, exploration–exploitation tradeoff (2)
is used. The exploration action updates the Q-value of random
state-action pair Q(St,At) of the Q-table by randomly selecting
(3)
the action. The exploitation selects greedy action (At) for the
state (St) from the Q-table having maximum rewards.
(4)
So, to converge the Q-table from zero-initialized Q-table, at
the initial phases of iterative update process, more exploration PR Indices data was discarded after deriving these new
is required and at a later time of the iterative updates, the features. If some data was missing for some dates due to
process requires more exploitation. The process can use Market for that segment not operating on that particular day,
probability ε, in which a random action is selected using the we have taken previous business date’s prices for further
probability ε, and action is chosen from the Q-table using calculations.
Normalized Close Price using 52-week Low High Range is We have only considered days when Equity Markets are
referred to as the stochastic K% and is part of the stochastic operational and have discarded any other days from the Data
oscillator developed initially by George Lane in the late 1950s since our aim is to use Investment decisions which we want to
[12]. If the reading goes above 80 then it signifies and take during next Trading day based on RL Agent’s signal from
overbought level and thereafter once the reading goes below previous Trading Day.
80 it is usually considered as a sell signal. On the other hand, To reduce the number of features available, we evaluated
if the reading goes below 20 it signifies an oversold level and performance on 52-week Low High Range, SMAs &
once it goes above 20 again it is usually considered as a buy Momentums when used alone. Based on 52-week Low High
signal. We derived Low High Range using (5) for each Index range for each Index and GOLD, we first derived weights for
& GOLD. each Index and GOLD by subtracting Low High Range from 1.
The funds were rebalanced after end of each trading day based
on these weights. Using this technique, we were able to achieve
a modest 6% Compounded Annual Growth Rate for the
combined portfolio for the study period between 3-Jan-2011
and 31-Dec-2019.
Simple Moving Average (SMA) has been observed to be For SMA, values greater than 1 were considered to be buy
very useful in smoothening noisy prices and identifying bullish signal & values less than or equal to 1 were considered to be
or bearish trends, especially when shorter SMAs are compared sell signal. We simulated buying 100% fund & staying invested
against longer SMAs. SMAs help to cut the noise by if N-day SMA was more than 1 or selling everything & holding
smoothening the data. The popular Golden cross and Death Cash if N-day SMA was less than or equal to 1 for each Index
cross [13] are when the 50-day SMA crosses the 200-day SMA and GOLD. Similarly, for Momentum, positive values were
from below or above respectively. The current study compares considered to be buy signal & 0 or negative values were
the latest close price of an asset with 5-day, 10-day, 15-day, considered to be sell signal. We simulated buying 100% fund
20-day, 30-day, 50-day, 100-day, 150-day & 200-day SMAs & staying invested if N-month Momentum was positive or
and is derived using (6) for each Index & GOLD. selling everything & holding Cash if N-month Momentum was
0 or negative for each Index and GOLD. For both these
strategies, we calculated Annualized Return, Annualized
(6) Standard Deviation and Sharpe Ratio for each Index and
As per Jegadeesh, N. & Titman, S. (1993) [14], there is GOLD. Annualized Sharpe Ratio for SMA and Momentum are
evidence to suggest that stock which have a strong 1-year shown below in Tables I & II. We selected two N-day SMA
return continue to perform better than other stocks and vice features & one N-day Momentum feature for each Index and
versa, while stocks which perform poorly in the immediate 1- GOLD based on highest Sharpe ratio for each Index. So, for
month look back period continue to perform poorly in the AUTO we selected 5-day SMA, 15-day SMA and 1-month
future. Considering these observations, we calculated 1-month, Momentum, for CD we selected 5-day SMA, 10-day SMA and
3-month, 6-month, 9-month and 12-month Momentums using 1-month Momentum, for FINSERV we selected 5-day SMA,
(7) for each Index & GOLD. 30-day SMA and 1-month Momentum, for FMCG we selected
10-day SMA, 20-day SMA and 9-month Momentum, for IT we
selected 5-day SMA, 30-day SMA and 12-month Momentum,
(7) for MEDIA we selected 5-day SMA, 20-day SMA and 1-
month Momentum, for METAL we selected 50-day SMA,
To normalize the effect of value of the index, we derived 150-day SMA and 1-month Momentum, for OILGAS we
Daily Returns over previous day and using this, we derived selected 5-day SMA, 30-day SMA and 1-month Momentum,
Daily Log Returns as they exhibit more Normal distribution for PHARMA we selected 5-day SMA, 10-day SMA and 1-
over Daily Returns and are also additive in nature [15]. For month Momentum, for PVTB we selected 5-day SMA, 30-day
target variable, Daily Log Return for next trading day was SMA and 1-month Momentum, for PSUB we selected 5-day
considered. SMA, 20-day SMA and 1-month Momentum, for REALTY we
On studying the constituents of BANK Index, PVTB Index selected 5-day SMA, 20-day SMA and 1-motnh Momentum,
and PSUB Index [16], we realized that BANK Index is for COMPGSEC we selected 10-day SMA, 200-day SMA and
representative of Stocks from PVTB Index and PSUB Index. 1-month Momentum, for GSEC_4_8_YR we selected 20-day
Hence, we eliminated BANK Index from our further study. It is SMA, 30-day SMA and 1-month Momentum, for
also found that the METAL index was launched in 2011 but GSEC_8_13_YR we selected 50-day SMA, 200-day SMA and
data was available only from 2013 which we deemed as 6-months Momentum and for GOLD we selected 15-day SMA,
inadequate data for our Study and hence this Index was 20-day SMA and 3-months Momentum for individual RL
dropped too from our study. agents using Q-learning.
TABLE I. ANNUALIZED SHARPE RATIO FOR SMA
10-day 15-day 20-day 30-day 50-day 100-day 150-day 200-day
Index 5-day SMA
SMA SMA SMA SMA SMA SMA SMA SMA
AUTO 1.48355667 1.15018890 1.24236519 1.10764975 1.13132530 1.09196363 1.03966691 0.80640208 0.86042763
CD 1.98000178 2.05726032 1.71621776 1.78816832 1.81525439 1.57956067 1.10115519 1.08389546 0.95049958
FINSERV 1.32622692 0.95231588 0.92225889 1.12444960 1.17522373 1.08411680 1.03858427 0.96120119 0.90593963
FMCG 0.79330225 0.94786267 0.80540931 0.94850238 0.77473633 0.73255051 0.59288194 0.47877364 0.58123421
IT 0.76589234 0.57333501 0.54894567 0.54680536 0.61476779 0.41905141 0.47521311 0.36846660 0.40207293
MEDIA 1.08151285 0.77363884 0.96586410 1.03963905 0.97922425 0.55866838 0.50876233 0.49345318 0.39541588
METAL 0.29581653 0.05442240 0.42805897 0.62713366 0.61951665 0.65571281 0.37075611 0.67718264 0.51280346
OILGAS 1.04411185 0.76507733 0.84351468 0.96284268 0.96435569 0.55228656 0.55920759 0.49564265 0.42000785
PHARMA 1.45380417 1.28840103 1.16009338 1.14711363 1.02955484 0.92591495 0.90844577 0.73100964 0.69575686
PVTB 1.23551421 1.09740186 1.14531106 1.20226426 1.33392775 1.13267051 0.83134067 0.99274672 0.89778140
PSUB 0.90468248 0.65106871 0.72326996 0.76611311 0.70069084 0.47731988 0.36581347 0.26282454 0.34178230
REALTY 0.95839762 0.72738983 0.71163265 0.82582332 0.74722861 0.49565908 0.40216909 0.20225667 0.09976231
COMPGSEC 2.02764098 2.15874240 1.66978388 2.09381107 2.06563468 2.13783209 2.13806415 2.11745001 2.14213716
GSEC_4_8_YR 2.77509697 2.89260501 2.75048037 2.93470924 2.98466119 2.76134428 2.66511964 2.49447746 2.46254803
GSEC_8_13_YR 1.66088464 1.74910704 1.47466846 1.62734220 1.82818853 1.93533755 1.90012896 1.90866595 1.98974448
GOLD 0.91121133 0.84033946 0.97160675 0.91469639 0.90424245 0.63230510 0.66404192 0.77842893 0.87594866
TABLE II. ANNUALIZED SHARPE RATIO FOR MOMENTUM
1-Month 3-Months 6-Months 9-Months 12-Months
Index
Momentum Momentum Momentum Momentum Momentum
AUTO 1.12531586 0.94999480 0.64297727 0.61182625 0.80641380
CD 1.96321881 0.91180760 0.89440983 0.74017643 0.64969121
FINSERV 1.27148458 0.83393270 0.78094061 0.77949692 0.51470345
FMCG 0.52084291 0.50854070 0.58385211 0.77055799 0.61552002
IT 0.56925669 0.49948057 0.24589755 0.44601228 0.72314494
MEDIA 0.98062118 0.24901821 0.71447105 0.58811868 0.30449345
METAL 0.83716136 0.33963950 0.41230258 0.55078330 0.24352123
OILGAS 0.75955073 0.34811074 0.38691415 0.26890018 0.34171915
PHARMA 0.88791388 0.77116503 0.54493095 0.72129670 0.66676257
PVTB 1.26806205 0.77744619 0.67548553 0.66448668 0.52261897
PSUB 0.61454096 0.22770545 0.26108619 0.31407777 -0.02154569
REALTY 0.51303506 0.41474842 0.06693126 -0.15630960 -0.45799706
COMPGSEC 2.06822002 2.02029460 1.97330026 1.78257058 1.61162629
GSEC_4_8_YR 3.06211792 2.59074137 2.59179890 2.45077381 2.52577105
GSEC_8_13_YR 1.88097911 1.73170645 1.94553988 1.57661334 1.40426727
GOLD 0.73451161 0.91882153 0.63914257 0.86462776 0.75917356
C. Methodical steps taken one of our aim is to minimize transaction costs as well, RL
Since, Low High Range & 12-month Momentum require Agent would also need information about current level of
last 12-months data, we considered data after 1 year of investment so that RL Agent can learn not to take big jumps to
availability of Close price as starting date for Training period. attract higher Transaction Charges & hence 5 discrete set of
For Q-learning Hyper-parameter tuning, we considered data up Actions are also considered in State and hence we have 1280
to end of 2017 as Training Data & used 2018 data as states in all for each Index and GOLD.
Validation Data. For evaluating final performance, we We are considering also Charges applicable for
considered data up to end of 2018 as Training Data & used Transactions for calculating net Performance of the RL Agent.
2019 data as Testing Data. Brokerage is considered as 0% as normally this fees is levied
For using Q-learning algorithm, we need to have discrete for advisory fees but since we do not require any advice and
State space & Action space. From Data pre-processing, we also because there are platforms available which do not charge
shortlisted Low High range, two N-day SMA and one N-month any Brokerage fees. Transaction Charge is considered as 0.5%.
Momentum features. Since discretizing the features into GST is considered as 18% as per existing applicable rate.
multiple states quickly explodes the number of states. E.g., Securities Transaction Tax of 0.001% is charged on Sale
Using 10 discrete states across 4 features explodes the number transaction of Equity Funds. Stamp Duty applicable is 0.01%.
of states to 10,000. But on the other hand we have less than SEBI Turnover Tax is 0.00015%. Also, ETFs have Daily
2000 daily records for Training. Hence, we decided to use only Expense Ratio of around 0.5% annually which is divided by
4 discrete states per feature based on the quartiles. State for 250 which are rough trading days in a year.
each Feature was considered as 0 if it lies in first quartile, 1 if it Rewards are calculated as Log of Net Fund performance
lies in second quartile, 2 if it lies in third quartile and 3 if it lies over previous day after deducting all these applicable Charges
in fourth quartile. This way we get 256 states. We also & are added to Total Rewards.
discretize Actions by considering Action as 0 if 100% Cash
holding is suggested, considering Action as 1 if 75% Cash We then tuned Q-learning hyper-parameters with 100
holding and 25% investment in Fund is suggested, considering Episodes for Training. We picked MEDIA Index for hyper-
Action as 2 if 50% Cash holding and 50% Fund investment is parameter tuning for Equity Indices because it was one of the
suggested, considering Action as 3 if 25% Cash holding & toughest indices with respect to returns in validation & testing
75% Fund investment is suggested and considering Action as 4 period. In 2018 which is Validation period and 2019 which is
if 100% Fund investment is suggested by the RL agent. Since, Test period returns from this index were around -25% of the
invested amount. We wanted the RL agent to learn to avoid
loss making trades from this index. We picked COMPGSEC
Index for G-Sec Indices hyper-parameter tuning & they were D. Limitation of method
tuned also for GOLD. Hyper-parameter Epsilon determines
how much RL Agent explores next action randomly v/s how With basic Q-learning algorithm, we are able to build Index
much it exploits best action from already learnt Q-table during specific RL agents instead of single agent that could predict
Training. More the value of Epsilon, more it explores. Hyper- optimal weights of investment under each Asset. This is
parameter Discount rate determines how much weightage is because of the limitation of number of States that we can break
given to future Rewards. Lesser the Discount Rate, more down due to fewer daily records available for the Data we have
weightage is given to future rewards. Hyper-parameter based our study on.
Learning rate determines how much weightage should be given Also, the discrete nature of State space & Action space,
to new Learning v/s already updated value in the Q-table. More limits the capability of this model.
the Learning rate, more weightage is given to new Learning.
After several iterations, we tuned Epsilon to 0.48 for Equity
III. RESULT
Sectoral Indices, 0.83 for G-Sec Indices and 0.29 for GOLD,
tuned Discount rate to 0.33 for Equity Sectoral Indices, 0.91 for The following Table III. describes the results of each RL
G-Sec Indices and 0.74 for GOLD and tuned Learning rate to Agent and compares with their respective Benchmarks.
Dynamic for Equity Sectoral Indices meaning it will gradually From Table III, it is observed that for Equity Sectoral
decrease as the RL Agent approaches completion of Training Indices, RL Agent’s Annualized Return is between -23.01% for
Phase, 0.08 for G-Sec Indices and 0.09 for GOLD. AUTO to 0.70% for REALTY, Max Drawdown is between
After hyper-parameter tuning, individual RL Agents were 6.30% for PVTB to 25.36% for PSUB, Annualized Standard
trained for 1000 episodes for each Index and GOLD using Q- Deviation is between 6.10% for CD to 14.13% for PSUB,
learning algorithm. Fig. 2 shows the Flowchart of Q-learning. Reward-to-Risk Ratio is between -204.07% for AUTO to
8.70% for REALTY, UP days are between 21 for REALTY to
59 for PVTB and DOWN days are between 57 days for
FINSERV to 118 for AUTO for the Year 2019.
Start
Similarly, for G-Sec Indices, RL Agent’s Annualized
Return is between 5.54% for GSEC_8_13_YR to 8.54% for
Initialize learning rate (α), GSEC_4_8_YR, Max Drawdown is between 1.36% for
discount factor (γ) & epsilon (ε) COMPGSEC to 1.48% for GSEC_8_13_YR, Annualized
Standard Deviation is between 2.80% for COMPGSEC to
Randomly initialize Q Table for 2.91% for GSEC_8_13_YR, Reward-to-Risk Ratio is between
each States (s) & Actions (a) 213.93% for COMPGSEC to 298.89% for GSEC_4_8_YR, UP
days are between 69 for COMPGSEC to 133 for
GSEC_4_8_YR and DOWN days are between 49 days for
Initialize state (s0) indicating initial
COMPGSEC to 84 for GSEC_4_8_YR for the Year 2019.
fund performance for each asset
And for GOLD, Agent’s Annualized Return is -1.42%,
Generate random number
Max Drawdown is 2.80%, Annualized Standard Deviation is
Q Table
(R) between 0 & 1 3.13%, Reward-to-Risk Ratio is -45.31%, UP days are 28 and
DOWN days are 37 for the Year 2019.
Get Action (a) with
Pick any Action Yes No
R<ε maximum Q-value
(a) randomly
for given State (s)
Reward Calculate reward (r) & updated Q-value

function for given State (s) & Action taken (a)
Update Q Table
No Reached
last step?
Yes
Yes More
Epochs?
No
Stop
Fig. 2 Q-LEARNING FLOWCHART

TABLE III. TRADING AGENTS PERFORMANCE FOR 2019 FOR EQUITY, DEBT & GOLD INDICES
AUTO CD FINSERV
Performance for 2019
Strategy Benchmark Strategy Benchmark Strategy Benchmark
Annualized Return -23.01% -8.84% -0.81% 18.54% -9.12% 25.13%
Max Drawdown % 24.58% 25.60% 8.46% 14.81% 13.33% 12.80%
Ann. Standard Deviation 11.27% 23.94% 6.10% 16.45% 8.58% 18.13%
Reward-To-Risk Ratio -204.07% -36.92% -13.32% 112.73% -106.34% 138.63%
Up Days 44 112 58 123 31 132
Down Days 118 132 107 121 57 112
FMCG IT MEDIA OILGAS

Strategy Benchmark Strategy Benchmark Strategy Benchmark Strategy Benchmark
Annualized Return -7.58% 0.48% -2.19% 10.91% -15.16% -29.13% -6.01% 14.26%
Max Drawdown % 8.47% 7.84% 8.65% 10.57% 21.20% 33.08% 13.65% 19.64%
Ann. Standard Deviation 7.52% 12.59% 7.34% 15.37% 9.67% 33.94% 7.86% 19.13%
Reward-To-Risk Ratio -100.84% 3.78% -29.90% 70.93% -156.71% -85.84% -76.43% 74.56%
Up Days 52 117 50 125 56 115 54 122
Down Days 67 127 90 119 103 129 75 122
PHARMA PVTB PSUB REALTY

Annualized Return -10.00% -8.79% 0.04% 16.02% -20.83% -19.53% 0.70% 26.44%
Max Drawdown % 17.81% 23.89% 6.30% 15.81% 25.36% 37.22% 9.01% 16.68%
Reward-To-Risk Ratio -96.90% -50.06% 0.46% 80.93% -147.39% -60.82% 8.70% 111.29%
Up Days 54 117 59 125 39 116 21 138
Down Days 87 127 91 119 115 128 72 106
COMPGSEC GSEC_4_8_YR GSEC_8_13_YR GOLD

Annualized Return 6.00% 10.79% 8.54% 11.03% 5.54% 11.02% -1.42% 23.04%
Max Drawdown % 1.36% 1.79% 1.44% 1.42% 1.48% 1.91% 2.80% 7.52%
Reward-To-Risk Ratio 213.93% 264.15% 298.89% 359.10% 190.46% 244.73% -45.31% 169.60%
Up Days 69 148 133 150 103 143 28 133
Down Days 49 95 84 92 79 101 37 110
IV. DISCUSSION AND CONCLUSION there are high number of Indices that have given negative
Annualized Returns and, in such cases, it isn’t effective tool to
A. Discussion
measure RL Agent’s Performance.
Individual RL Agents are only in some cases able to beat
the Benchmark Indices in terms of Annualized Return like for As discussed earlier, in this study we are building Index
MEDIA where RL Agent is able to provide an Annualized specific RL Agents. Future scope of improvement could be to
Return of -15.16% vs -29.13% for Benchmark. But it could be club output of all these Index specific RL Agents to build a
because we tuned Hyper-parameters only for MEDIA and used single model that can be trained to trade optimally across
it for all other Equity Sectoral Indices. The results might different indices & build an optimal Portfolio.
improve if we attempt to tune Hyper-parameters for all Also, we are only able to build RL Agents trained on
individual RL Agents. individual Index Data. But, there could be some correlation
Almost all RL Agents are able to bring down the Max between several Indices which would provide better signals for
Drawdown as compared to their corresponding Benchmarks. the Agent to take optimum trading decisions & it could be a
However, the up-capture ratio remains poor in comparison to scope for future improvements.
their Benchmarks. Again, it could be because we picked To overcome the limitation of Q-learning algorithm to
MEDIA to tune Hyper-parameters for all Equity Sectoral Index discretize State space & Action space, we can use Deep Q-
RL Agents and MEDIA had poor annualized returns for the Learning algorithm.
Validation Year of 2018 on which Hyper-parameters were
tuned and Testing Year of 2019 on which the RL Agent was Another improvement would be to get data at Intra-day
tested. level which could present RL Agents with opportunity to train
on much larger Dataset & also take Trading decisions at
Annualized Standard Deviation has been reduced for all RL Intraday level and it could bring down risk considerably as
Agents & it signifies reduction in Risk. there could be unfavorable Market swings at times and the RL
Reward-to-risk Ratio remains poor when compared with Agents could avoid taking decision on previous trading day’s
their corresponding Benchmarks. However, for Year 2019, data.
To improve the model performance, a scope for V. ACKNOWELEDGMENT
investigation could be the inclusion of some non-price We would like to convey our sincere gratitude to our
indicators such as features like the traded volumes or traded mentor Mr. Raamanathan Gururajan. Without his able
values for each financial asset. guidance, mentorship and motivation this study would have not
been possible. His deep understanding of the use case helped in
B. Conclusion charting the right approach and deploying the appropriate
The study was done on Indian Financial Markets for daily models. We would also like to thank the Great Learning
Open, High, Low & Close Data for NIFTY Equity Sectoral management especially Dr. D Narayana and Mr. Anwesh
Indices, G-Sec Indices & GOLD. Based on this raw data, 52- Reddy for giving us the required guidance for structuring our
week Low High Range, N-day SMAs and N-month conference paper.
Momentums were calculated.
When these derived features were used as stand-alone, they VI. REFERENCES
gave good results and also helped in choosing which two N- [1] Y. Huang, “Machine learning for stock prediction based on fundamental
day SMAs and one N-month Momentum would be useful for analysis,” Western University, 2019.
the RL Agents along with 52-week Low High Range feature. [2] Larsen, Jan. (2010). Predicting Stock Prices Using Technical Analysis
and Machine Learning.
Each of these shortlisted features were discretized into 4 [3] D. Shah, H. Isah, and F. Zulkernine, “Stock market analysis: A review
discrete values based on which Quartiles they belong and and taxonomy of prediction techniques,” Int. J. Fin. Stud., vol. 7, no. 2, p.
Action space was also discretized into 5 values to determine 26, 2019.
level of Investment in chunks of incremental 25% Fund value. [4] J. B. Chakole, M. S. Kolhe, G. D. Mahapurush, A. Yadav, and M. P.
These discretized features along with Action space was used to Kurhekar, “A Q-learning agent for automated trading in equity stock
form State space which was used for Q-learning algorithm. markets,” Expert Syst. Appl., vol. 163, no. 113761, p. 113761, 2021.
[5] A Singh, N Krishnan, X Zhang, Z Ren (2018) “Reinforcement Learning
Hyper-parameter tuning for Q-learning was then done with for Portfolio Optimization” king-jim.com.
MEDIA for Equity Sectoral Indices, COMPGSEC for G-Sec [6] J. Moody, M. Saffell, Y. Liao, and L. Wu, “Reinforcement learning for
Indices and also on GOLD with 2018 as the Validation Data. trading systems and portfolios: Immediate vs future rewards,” in Decision
We also used transaction charges to keep the results realistic. Technologies for Computational Finance, Boston, MA: Springer US,
1998, pp. 129–140.
Individual Q-learning RL Agents were then trained for [7] Z. Zhang, S. Zohren, and S. Roberts, “Deep Reinforcement Learning for
1000 episodes each with the Hyper-parameter tuned with trading,” arXiv [q-fin.CP], 2019.
Training Data up to 2018 and performance was evaluated using [8] S. Chakraborty, “Capturing Financial markets to apply Deep
Test data of 2019. Reinforcement Learning,” arXiv [q-fin.CP], 2019
[9] S. Guruprasad and H. Chandramouli, “Q-Learning based Stock Market
Performance of each Q-learning RL Agents was evaluated Prediction for Indian Stock market,” Dynamicpublisher.org. [Online].
using Annualized Return, Max Drawdown, Annualized Available: http://www.dynamicpublisher.org/gallery/ijsrr-d239.pdf.
Standard Deviation, Reward-to-risk ratio, number of Up Days [Accessed: 06-Sep-2020].
and number of Down Days each Agent was able to capture. [10] “Stock price prediction using Reinforcement Learning and feature
extraction,” Regular Issue, vol. 8, no. 6, pp. 3324–3327, 2020.
We found that each RL Agents were able to reduce [11] R. S. Sutton and A. G. Barto, An Reinforcement Learning: Introduction.
downside risk considerably when compared to their Mit Press, 2012.
Benchmarks though they have not been able to capture Market [12] “Stochastic Oscillator [ChartSchool],” Stockcharts.com. [Online].
uptrends effectively and because of which RL Agent’s overall Available:
https://school.stockcharts.com/doku.php?id=technical_indicators:stochast
performance did not look very promising when compared to ic_oscillator_fast_slow_and_full. [Accessed: 06-Sep-2020].
their corresponding Benchmarks. [13] Nasdaq.com. [Online]. Available:
However, there are several scopes of Improvements as https://www.nasdaq.com/glossary/g/golden-cross. [Accessed: 06-Sep-
2020].
described in Discussion Section to take forward this Study
some of which include using Deep Q-learning Network [14] N. Jegadeesh and S. Titman, “Returns to buying winners and selling
losers: Implications for stock market efficiency,” J. Finance, vol. 48, no.
Algorithm and getting Intra-day data if possible, which can 1, p. 65, 1993.
enhance the performance of the RL Agents significantly. [15] C. Brooks, Introductory econometrics for finance, 2nd ed. Cambridge,
England: Cambridge Univ. Press, 2012.
Overall the study has laid a good foundation to extend it for
further research based on our work on Automated Portfolio [16] Niftyindices.com. [Online]. Available:
https://www.niftyindices.com/IndexConstituent/ind_niftybanklist.csv].
Rebalancing and with ideas presented in Discussion section, [Accessed: 06-Sep-2020].
there are several scopes of improvement to carry further
Research work on this Topic.

Automated Portfolio Rebalancing Using Q-Learning: DR - Narayana Darapaneni Najmus Saquib Sudarshan Singhavi

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automated Portfolio Rebalancing Using Q-Learning: DR - Narayana Darapaneni Najmus Saquib Sudarshan Singhavi

Uploaded by

Copyright:

Available Formats

Automated Portfolio Rebalancing using Q-learning

Dr.Narayana Darapaneni Najmus Saquib Sudarshan Singhavi

Aishwarya Kale Pratik Bid Anwesh Reddy Paduri

Amitavo Basu Sanket Savla

Reward Calculate reward (r) & updated Q-value

Fig. 2 Q-LEARNING FLOWCHART

FMCG IT MEDIA OILGAS

PHARMA PVTB PSUB REALTY

COMPGSEC GSEC_4_8_YR GSEC_8_13_YR GOLD

You might also like