1 Introduction To Reinforcement Learning

Introduction to Reinforcement Learning
Paul Alexander Bilokon, PhD
Thalesians Ltd
Level39, One Canada Square, Canary Wharf, London E14 5AB
2023.01.17
A historical perspective
First stage of automation

First stage of automation: the Industrial Revolution
I This painting from the Everett Collection depicts Wentworth Works, file and steel
manufacturers and exporters of iron in Sheffield, England, ca. 1860.
I According to the 15th edition of Encyclopædia Britannica, the Industrial Revolution,
in modern history, is the process of change from an agrarian, handicraft economy to
one dominated by industry and machine manufacture.
I Started around 1760 and until around 1830 was largely confined to Britain.
I The technological changes included:
I the use of new basic materials, chiefly iron and steel,
I the use of new energy sources, including both fuels and motive power, such as coal, the
steam engine, electricity, petroleum, and the internal-combustion engine,
I the invention of new machines, such as the spinning jenny and the power loom that permitted
increased production with a smaller expenditure of human energy,
I a new organisation of work known as the factory system, which entailed increased division of
labour and specialisation of function,
I important developments in transportation and communication, including the stream
locomotive, steamship, automobile, airplane, telegraph, and radio, and
I the increasing application of science to industry.
I This was the first step towards automation.
The Great Exhibition of The Works of Industry of All Nations (i)

The Great Exhibition of The Works of Industry of All Nations (ii)
From https://www.intriguing-history.com/great-exhibition/:
I On 1st May 1851 over half a million people massed in Hyde Park in London to witness
its opening.
I Prince Albert captured the mood of the time when the British considered themselves to
be the workshop of the world.
I The exhibition was to be the biggest display of objects of industry from all over the
world with over half of it given over to all that Britain manufactured. It was to be a
showcase for a hundred thousand objects, of inventions, machines, and creative
works.
I The works of industry of all nations was to be a combination of visual wonder,
competition (between manufacturers with prizes awarded) and shopping.
I The main exhibition hall was a giant glass structure, with over a million square feet of
glass. The man who designed it, Joseph Paxton, named it the Crystal Palace. In
itself it was a wondrous thing to behold and covered nearly 20 acres, easily
accommodating the huge elm trees that grew in the park.
Second stage of automation

Second stage of automation: the Digital Revolution
I According to Wikipedia, the Digital Revolution is the shift from mechanical and
analogue electronic technology to digital electronics which began anywhere from the
late 1950s to the late 1970s with the adoption and proliferation of digital computers
and digital record keeping that continues to the present day.
I The term also refers to the sweeping changes brought about by digital computing and
communication technology during (and after) the latter half of the 20th century.
I The Digital Revolution marked the beginning of the Information Age—a historical
period characterized by a rapid epochal shift from the traditional industry established
by the Industrial Revolution to an economy primarily based upon information
technology.
The Information Age
Figure: Rings of time: Information Age (Digital Revolution) from 1968 to 2017. Spruce tree. By Petar
Milošević.
Marvin Minsky on programming languages
From Marvin Minsky’s 1969 Turing Award lecture:

Computer languages of the future will
be more concerned with goals and less
with procedures specified by the program-
mer. [Min70]
Marvin Minsky
Alan Turing on reinforcement
A quote from Alan Turing’s 1948 paper:

When a configuration is reached for which the action is
undetermined, a random choice for the missing data is
made and the appropriate entry is made in the descrip-
tion, tentatively, and is applied. When a pain stimulus
occurs all tentative entries are cancelled, and when
a pleasure stimulus occurs they are all made perma-
nent. [Tur04]
Alan Turing
A hedonistic learning system
...in 1979 we came to realize that

perhaps the simplest of the ideas,
which had long been taken for
granted, had received surprisingly
little attention from a computational
perspective. This was simply the
idea of a learning system that wants
something, that adapts its behaviour
in order to maximize a special sig-
nal from its environment. This was
the idea of a “hedonistic” learning
Rich Sutton system, or, as we would say now, Andrew Barto
the idea of reinforcement learn-
ing. [SB18]
A different kind of learning
Branches of machine learning
From David Silver:

Reinforcement learning is multidisciplinary
From David Silver:

Reinforcement learning is not supervised machine learning
I Reinforcement learning differs from other types of machine learning in that the
training information is used to evaluate the actions rather than instruct as to what the
correct actions should be.
I Instructive feedback, as in supervised machine learning, points out the correct
action to take independent of the action taken.
I Evaluative feedback, as in reinforcement learning, points out how good the action
taken is, but not whether it is the best or the worst action possible.
I This creates the need for active exploration, a trial-and-error search for good
behaviour.
Reinforcement learning is not unsupervised machine learning
I One may be tempted to think of reinforcement learning as a kind of unsupervised

machine learning, because it does not rely on examples of correct behaviour.
I However, reinforcement learning is concerned with maximising a reward signal rather
than trying to find hidden structure, as distinct from unsupervised machine learning.
Elements of reinforcement learning
Agent
Observations
State change: st +1
Reward: rt
Agent Environment
Actions
Action: at
The agent is the entity that takes actions.

Environment
Observations
State change: st +1
Reward: rt
Agent Environment
Actions
Action: at
The environment is the world in which the agent exists and operates.
Action
Observations
State change: st +1
Reward: rt
Agent Environment
Actions
Action: at
The action is a move made by the agent in the environment.

Observation
Observations
State change: st +1
Reward: rt
Agent Environment
Actions
Action: at
The observation provides the agent with information about the (possibly changed)
environment after taking an action.
State
Observations
State change: st +1
Reward: rt
Agent Environment
Actions
Action: at
The state is a situation, which the agent perceives.

Reward
Observations
State change: st +1
Reward: rt
Agent Environment
Actions
Action: at
The reward is the feedback that measures the success or failure of the agent’s action. It
defines the goal of a reinforcement learning problem.
Total reward
Observations
State change: st +1
Reward: rt
Agent Environment
Actions
Action: at
The total (future) reward is given by Gt = ∑i∞=t +1 ri . May or may not converge.
Discounted total reward
Observations
State change: st +1
Reward: rt
Agent Environment
Actions
Action: at
The discounted total reward is given by Gt = ∑i∞=t +1 γi −t −1 ri , γ ∈ [0, 1] being the

discount rate.
Reward hypothesis
I Reinforcement learning is based on the reward hypothesis:

All goals can be described by the maximisation of expected total reward.
History
The history consists in the sequence of all observations, actions, and rewards (i.e. all
observable variables) up to the current time:
Ht = s0 , a0 , r0 , s1 , a1 , r1 , s2 , a2 , r2 , s3 , a3 , r3 , . . . , st .
Environment state
I The agent state, st , may or may not match the environment state, ste .
I Consider for example, a poker game. The agent (a poker player) knows only his hand.
The environment state includes the hand of each poker player.
I In chess, on the other hand, st = ste — it is a perfect information game.
Markov state
I A state is said to be Markov iff
P [st +1 | st ] = P [st +1 | s0 , . . . , st ] ,
in other words, the future is independent of the past given the present.
Policy
I A policy is the agent’s behaviour.

I It is a map from state to action.
I Deterministic policy: a = π (s ).
I Stochastic policy: π (a | s ) = P [At = a | St = s ].
Value function
I A value function is a prediction of future reward.

I Used to evaluate the goodness/badness of states.
I And therefore to select between actions, e.g.
vπ (s ) = Eπ [rt + γrt +1 + γ2 rt +2 + γ3 rt +3 + . . . | St = s ].
I Whereas the reward signal indicates what is good in an immediate sense, a value
function specifies what is good in the long run.
I Roughly speaking, the value of a state is the total amount of reward an agent can
expect to accumulate over the future, starting from that state.
Model
I A model predicts what the environment will do next.

I P predicts the next state.
I R predicts the next (immediate) reward.
Examples of reinforcement learning
Phil’s breakfast
From [SB18], inspired by [Agr88]:

Phil prepares his breakfast. Closely examined, even this apparently mundane ac-
tivity reveals a complex web of conditional behaviour and interlocking goal-subgoal
relationships: walking to the cupboard, opening it, selecting a cereal box, then
reaching for, grasping, and retrieving the box. Other complex, tuned, interactive
sequences of behaviour are required to obtain a bowl, spoon, and milk carton.
Each step involves a series of eye movements to obtain information and to guide
reaching and locomotion. Rapid judgments are continually made about how to
carry the objects or whether it is better to ferry some of them to the dining table
before obtaining others. Each step is guided by goals, such as grasping a spoon
or getting to the refrigerator, and is in service of other goals, such as having the
spoon to eat with once the cereal is prepared and ultimately obtaining nourish-
ment. Whether he is aware of it or not, Phil is accessing information about the
state of his body that determines his nutritional needs, level of hunger, and food
preferences.
A prop trader
A proprietary trader [Car15, Cha08, Cha13, Cha16, Dur13, Tul15] observes the dynamics
of market securities and watches economic releases and news unfold on his Bloomberg
terminal. Based on this information, considering both the tactical and strategic information,
he places buy and sell orders, stop losses and stop gains. The trader’s goal is to have a
strong PnL.
An options market maker
A vanilla options market maker [Che98, Cla10, JFB15, Tal96, Wys17] produces two-sided
quotes in FX options. She hedges her options position with spot. The market moves all the
time, so her risk (delta, gamma, vega, etc.) keeps changing. The market maker’s goal is to
hedge the position as safely and as cheaply as possible.
Origins of reinforcement learning
Donald Michie on trial and error (i)
From the point of view of one of the players, any game,

such as Tic-Tac-Toe, represents a sequential decision
process. Sooner or later the sequence of choices ter-
minates in an outcome, to which a value is attached,
according to whether the game has been won, drawn
or lost. If the player is able to learn from experience,
the choices which have led up to a given outcome
receive reinforcements in the light of the outcome
value. In general, positive outcomes are fed back in
the form of positive reinforcement, that is to say, the
choices belonging to the successful sequence become
more probable on later recurrence of the same situa-
tions. Similarly, negative outcomes are fed back as
negative reinforcements. [Mic63]
Donald Michie FRSE FBCS
Origins of reinforcement learning
Donald Michie on trial and error (ii)
This picture of trial-and-error learning uses the con-

cepts and terminology of the experimental psycholo-
gist. Observations on animals agree with common
sense in suggesting that the strength of reinforcement
becomes less as we proceed backwards along the
loop from the terminus towards the origin. The more
recent the choice in the sequence, the greater its prob-
able share of responsibility for the outcome. This pro-
vides an adequate conceptual basis for a trial-and-
error learning device, provided that the total number
of choice-points which can be encountered is small
enough for them to be individually listed. [Mic63]
Donald Michie FRSE FBCS

Successes of reinforcement learning
Checkers (i)
The game of checkers [Sam59, Sam67], following some ideas from [Sha50].
Checkers (ii)
In Some Studies in Machine Learning Using the Game of Checkers [Sam59]:

Two machine-learning procedures have been investigated in some detail using the
game of checkers. Enough work has been done to verify the fact that a computer
can be programmed so that it will learn to play a better game of checkers than can
be played by the person who wrote the program. Furthermore, it can learn to do
this in a remarkably short period of time (8 or 10 hours of machine-playing time)
when given only the rules of the game, a sense of direction, and a redundant and
incomplete list of parameters which are thought to have something to do with the
game, but whose correct signs and relative weights are unknown and unspecified.
The principles of machine learning verified by these experiments are, of course,
applicable to many other situations.
Checkers (iii)
In Some Studies in Machine Learning Using the Game of Checkers. II — Recent

Progress [Sam67]:
A new signature table technique is described together with an improved book
learning procedure which is thought to be much superior to the linear polynomial
method described earlier. Full use is made of the so-called “alpha-beta” pruning
and several forms of forward pruning to restrict the spread of the move tree and to
permit the program to look ahead to a much greater depth than it otherwise could
do. While still unable to outplay checker masters, the program’s playing ability has
been greatly improved.
Backgammon (i)
The game of backgammon [Tes92, Tes94, Tes95, Tes02].

Backgammon (ii)
In Practical Issues in Temporal Difference Learning [Tes92]:

This paper examines whether temporal difference methods for training connec-
tionist networks, such as Sutton’s TD (λ) algorithm, can be successfully applied to
complex real-world problems. A number of important practical issues are identified
and discussed from a general theoretical perspective. These practical issues are
then examined in the context of a case study in which TD (λ) is applied to learning
the game of backgammon from the outcome of self-play. This is apparently the
first application of this algorithm to a complex nontrivial task. It is found that, with
zero knowledge built in, the network is able to learn from scratch to play the entire
game at a fairly strong intermediate level of performance, which is clearly better
than conventional commercial programs, and which in fact surpasses compara-
ble networks trained on a massive human expert data set. The hidden units in
these networks have apparently discovered useful features, a longstanding goal
of computer games research. Furthermore, when a set of hand-crafted features is
added to the input representation, the resulting networks reach a near-expert level
of performance, and have achieved good results against world-class human play.
Backgammon (iii)
In TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level

Play [Tes94]:
TD-Gammon is a neural network that is able to teach itself to play backgam-
mon solely by playing against itself and learning from the results based on the
TD (λ) reinforcement learning algorithm [Sut88]. Despite starting from random
initial weights (and hence random initial strategy), TD-Gammon achieves a sur-
prisingly strong level of play. With zero knowledge built in at the start of learning
(i.e. given only a “raw” description of the board state), the network learns to play
at a strong intermediate level. Furthermore, when a set of hand-crafted features
is added to the network’s input representation, the result is a truly staggering level
of performance: the latest version of TD-Gammon is now estimated to play at a
strong master level that is extremely close to the world’s best human players.
Backgammon (iv)
In Temporal Difference Learning with TD-Gammon [Tes95]:

TD-Gammon is a neural network that is able to teach itself to play backgam-
mon solely by playing against itself and learning from the results based on the
TD (λ) reinforcement learning algorithm [Sut88]. Despite starting from random
initial weights (and hence random initial strategy), TD-Gammon achieves a sur-
prisingly strong level of play. With zero knowledge built in at the start of learning
(i.e. given only a “raw” description of the board state), the network learns to play
at a strong intermediate level. Furthermore, when a set of hand-crafted features
is added to the network’s input representation, the result is a truly staggering level
of performance: the latest version of TD-Gammon is now estimated to play at a
strong master level that is extremely close to the world’s best human players.
Go (i)
The game of go [SHM+ 16, SSS+ 17].

Go (ii)
In Mastering the game of Go with deep neural networks and tree search [SHM+ 16]:
The game of Go has long been viewed as the most challenging of classic games
for artificial intelligence owing to its enormous search space and the difficulty of
evaluating board positions and moves. Here we introduce a new approach to
computer Go that uses ‘value networks’ to evaluate board positions and ‘policy
networks’ to select moves. These deep neural networks are trained by a novel
combination of supervised learning from human expert games, and reinforcement
learning from games of self-play. Without any lookahead search, the neural net-
works play Go at the level of state-of-the-art Monte Carlo tree search programs that
simulate thousands of random games of self-play. We also introduce a new search
algorithm that combines Monte Carlo simulation with value and policy networks.
Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate
against other Go programs, and defeated the human European Go champion by 5
games to 0. This is the first time that a computer program has defeated a human
professional player in the full-sized game of Go, a feat previously thought to be at
least a decade away.
Go (iii)
In Mastering the Game of Go without Human Knowledge [SSS+ 17]:

A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa,
superhuman proficiency in challenging domains. Recently, AlphaGo became the
first program to defeat a world champion in the game of Go. The tree search
in AlphaGo evaluated positions and selected moves using deep neural networks.
These neural networks were trained by supervised learning from human expert
moves, and by reinforcement learning from self-play. Here, we introduce an al-
gorithm based solely on reinforcement learning, without human data, guidance,
or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a
neural network is trained to predict AlphaGo’s own move selections and also the
winner of AlphaGo’s games. This neural network improves the strength of tree
search, resulting in higher quality move selection and stronger self-play in the next
iteration. Starting tabula rasa, our new program AlphaGo Zero achieved super-
human performance, winning 100-0 against the previously published, champion-
defeating AlphaGo.
Watson’s Daily-Double wagering (i)
The game of Jeopardy! [TGL+ 12, TGL+ 13].

Watson’s Daily-Double wagering (ii)
In Simulation, learning, and optimization techniques in Watson’s game strategies [TGL+ 12]:
The game of Jeopardy! features four types of strategic decision-making: 1) Daily
Double wagering; 2) Final Jeopardy! wagering; 3) selecting the next square when
in control of the board; and 4) deciding whether to attempt to answer, i.e., “buzz in”.
Strategies that properly account for the game state and future event probabilities
can yield a huge boost in overall winning chances, when compared with simple
“rule-of-thumb” strategies. In this paper, we present an approach to developing
and testing components to make said strategy decisions, founded upon develop-
ment of reasonably faithful simulation models of the players and the Jeopardy!
game environment. We describe machine learning and Monte Carlo methods
used in simulations to optimize the respective strategy algorithms. Application of
these methods yielded superhuman game strategies for IBM Watson that signifi-
cantly enhanced its overall competitive record.
Watson’s Daily-Double wagering (iii)

In Analysis of Watson’s Strategies for Playing Jeopardy! [TGL+ 13]:
Major advances in Question Answering technology were needed for IBM Watson
to play Jeopardy! at championship level — the show requires rapid-fire answers to
challenging natural language questions, broad general knowledge, high precision,
and accurate confidence estimates. In addition, Jeopardy! features four types of
decision making carrying great strategic importance: (1) Daily Double wagering;
(2) Final Jeopardy wagering; (3) selecting the next square when in control of the
board; (4) deciding whether to attempt to answer, i.e. “buzz in.” Using sophisti-
cated strategies for these decisions, that properly account for the game state and
future event probabilities, can significantly boost a player’s overall chances to win,
when compared with simple “rule of thumb” strategies.
This article presents our approach to developing Watson’s game-playing strate-
gies comprising development of a faithful simulation model, and then using learn-
ing and Monte-Carlo methods within the simulator to optimise Watson’s strategic
decision-making. After giving a detailed description of each of our game-stragegy
algorithms, we then focus in particular on validating the accuracy of the simula-
tor’s predictions, and documenting performance improvements using our methods.
Quantitative performance benefits are shown with respect to both simple heuris-
tic strategies, and actual human contestant performance in historical episodes.
We further extend our analysis of human play to derive a number of valuable and
counterintuitive examples illustrating how human contestants may improve their
performance on the show.
Atari games (i)
Atari 2600 games, such as Breakout [MKS+ 13, MKS+ 15].

Atari games (ii)
In Playing Atari with Deep Reinforcement Learning [MKS+ 13]:

We present the first deep learning model to successfully learn control policies
directly from high-dimensional sensory input using reinforcement learning. The
model is a convolutional neural network, trained with a variant of Q-learning,
whose raw input is raw pixels and whose output is a value function estimating
future rewards. We apply our method to seven Atari 2600 games from the Arcade
Learning Environment, with no adjustment of the architecture or learning algo-
rithm. We find that it outperforms all previous approaches on six of the games and
surpasses a human expert on three of them.
Atari games (iii)

In Human-level control through deep reinforcement learning [MKS+ 15]:
The theory of reinforcement learning provides a normative account deeply rooted in psycho-
logical and neuroscientific perspectives on animal behaviour, of how agents may optimize
their control of an environment. To use reinforcement learning successfully in situations ap-
proaching real-world complexity, however, agents are confronted with a difficult task: they must
derive efficient representations of the environment from high-dimensional sensory inputs, and
use these to generalise past experience to new situations. Remarkably, humans and other an-
imals seem to solve this problem through a harmonious combination of reinforcement learning
and hierarchical sensory processing systems, the former evidenced by a wealth of neural data
revealing notable parallels between the phasic signals emitted by dopaminergic neurons and
temporal difference reinforcement learning algorithms. While reinforcement learning agents
have achieved some successes in a variety of domains, their applicability has previously been
limited to domains in which useful features can be handcrafted, or to domains with fully ob-
served, low-dimensional state spaces. Here we use recent advances in training deep neural
networks to develop a novel artificial agent, termed a deep Q-network, that can learn suc-
cessful policies directly from high-dimensional sensory inputs using end-to-end reinforcement
learning. We tested this agent on the challenging domain of classic Atari 2600 games. We
demonstrate that the deep Q-network agent, receiving only the pixels and the game score as
inputs, was able to surpass the performance of all previous algorithms and achieve a level
comparable to that of a professional human games tester across a set of 49 games, using
the same algorithm, network architecture and hyperparameters. This work bridges the divide
between high-dimensional sensory inputs and actions, resulting in the first artificial agent that
is capable of learning to excel at a diverse array of challenging tasks.
Personalised web services (i)
Personalised web services [TTG15, Tho15].

Personalised web services (ii)
In [TTG15]:
In this paper, we propose a framework for using reinforcement learning (RL) algo-
rithms to learn good policies for personalised ad recommendation (PAR) sys-
tems. The RL algorithms take into account the long-term effect of an action, and
thus, could be more suitable than myopic techniques like supervised learning and
contextual bandit, for modern PAR systems in which the number of returning visi-
tors is rapidly growing. However, while myopic techniques have been well-studied
in PAR systems, the RL approach is still in its infancy, mainly due to two fundamen-
tal challenges: how to compute a good RL strategy and how to evaluate a solution
using historical data to ensure its “safety” before deployment. In this paper, we pro-
pose to use a family of off-policy evaluation techniques with statistical guarantees
to tackle both these challenges. We apply these methods to a real PAR problem,
both for evaluating the final performance and for optimising the parameters of the
RL algorithm. Our results show that a RL algorithm equipped with these off-policy
evaluation techniques outperforms the myopic approaches. Our results also give
fundamental insights on the difference between the click through rate (CTR) and
life-time value (LTV) metrics for evaluating the performance of a PAR algorithm.
Cooling optimisation for data centres (i)
Cooling optimisation for data centres [LWTG19].

Cooling optimisation for data centres (ii)

In Transforming Cooling Optimization for Green Data Centre via Deep Reinforcement
Learning [LWTG19]:
Cooling system plays a critical role in a modern data centre (DC). Developing an optimal
control policy for DC cooling system is a challenging task. The prevailing approaches often
rely on approximating system models that are built upon the knowledge of mechanical cooling,
electrical and thermal management, which is difficult to design and may lead to suboptimal or
unstable performances. In this paper, we propose utilising the large amount of monitoring
data in DC to optimise the control policy. To do so, we cast the cooling control policy design
into an energy cost minimisation problem with temperature constraints, and tap it into the
emerging deep reinforcement learning (DRL) framework. Specifically, we propose an end-
to-end cooling control algorithm (CCA) that is based on the actor-critic framework and an
off-policy offline version of the deep deterministic policy gradient (DDPG) algorithm. In the
proposed CCA, an evaluation network is trained to predict an energy cost counter penalised
by the cooling status of the DC room, and a policy network is trained to predict optimised
control settings when given the current load and weather information. The proposed algorithm
is evaluated on the EnergyPlus simulation platform and on a real data trace collected from the
National Super Computing Centre (NSCC) of Singapore. Our results show that the proposed
CCA can achieve about 11% cooling cost saving on the simulation platform compared with a
manually configured baseline control algorithm. In the trace-based study, we propose a de-
underestimation validation mechanism as we cannot directly test the algorithm on a real DC.
Even though with DUE the results are conservative, we can still achieve about 15% cooling
energy saving on the NSCC data trace if we set the inlet temperature threshold at 26.6 degree
Celsius.
Optimising memory control (i)
Optimising memory control [İMMC08, Mİ09].

Optimising memory control (ii)
In Self-Optimizing Memory Controllers: A Reinforcement Learning Approach [İMMC08]:

Efficiently utilising off-chip DRAM bandwidth is a critical issue in designing cost-
effective, high-performance chip multiprocessors (CMPs). Conventional memory
controllers deliver relatively low performance in part because they often employ
fixed, rigid access scheduling policies designed for average-case application be-
haviour. As a result, they cannot learn and optimise the long-term performance
impact of their scheduling decisions, and cannot adapt their scheduling policies to
dynamic workload behaviour.
We propose a new, self-optimising memory controller design that operates using
the principles of reinforcement learning (RL) to overcome these limitations. Our
RL-based memory controller observes the system state and estimates the long-
term performance impact of each action it can take. In this way, the controller
learns to optimise its scheduling policy on the fly to maximise long-term perfor-
mance. Our results show that an RL-based memory controller improves the per-
formance of a set of parallel applications run on a 4-core CMP by 19% on average
(up to 33%), and it improves DRAM bandwidth utilisation by 22% compared to a
state-of-the-art controller.
Optimising memory control (iii)
In Dynamic Multicore Resource Management: A Machine Learning Approach [Mİ09]:

A machine learning approach to multicore resource management produces self-
optimising on-chip hardware agents capable of learning, planning, and continu-
ously adapting to changing workload demands. This results in more efficient and
flexible management of critical hardware resources at runtime.
Packet routing in dynamically changing networks (i)
Packet routing in dynamically changing networks [BL93].

Packet routing in dynamically changing networks (ii)
In Packet Routing in Dynamically Changing Networks: A Reinforcement Learning

Approach [BL93]:
This paper describes the Q-routing algorithm for packet routing, in which a rein-
forcement learning module is embedded into each node of a switching network.
Only local communication is used by each node to keep accurate statistics on
which routing decisions lead to minimal delivery times. In simple experiments in-
volving a 36-node, irregularly connected network, Q-routing proves superior to a
nonadaptive algorithm based on precomputed shortest paths and is able to route
efficiently even when critical aspects of the simulation, such as the network load,
are allowed to vary dynamically. The paper concludes with a discussion of the
tradeoff between discovering shortcuts and maintaining stable policies.
Mobile robots (i)
Mobile robots [SK02].

Mobile robots (ii)
In Effective Reinforcement Learning for Mobile Robots [SK02]:

Programming mobile robots can be a long, time-consuming process. Specifying
the low-level mapping from sensors to actuators is prone to programmer miscon-
ceptions, and debugging such a mapping can be tedious. The idea of having a
robot learn how to accomplish a task, rather than being told explicitly is an appeal-
ing one. It seems easier and much more intuitive for the programmer to specify
what the robot should be doing, and to let it learn the fine details of how to do it. In
this paper, we introduce a framework for reinforcement learning on mobile robots
and describe our experiments using it to learn simple tasks.
Robocup soccer (i)
Robocup soccer [SSK05].

Robocup soccer (ii)
In Reinforcement learning for robocup soccer keepaway [SSK05]:

RoboCup simulated soccer presents many challenges to reinforcement learning
methods, including a large state space, hidden and uncertain state, multiple in-
dependent agents learning simultaneously, and long and variable delays in the
effects of actions. We describe our application of episodic SMDP Sarsa(λ) with
linear tile-coding function approximation and variable λ to learning higher-level de-
cisions in a keepaway subtask of RoboCup soccer. In keepaway, one team “the
keepers”, tries to keep control of the ball for as long as possible despite the efforts
of “the takers”. The keepers learn individually when to hold the ball and when to
pass to a teammate. Our agents learned policies that significantly outperform a
range of benchmark policies. We demonstrate the generality of our approach by
applying it to a number of task variations including different field sizes and different
numbers of players on each team.
Self-driving cars (i)
Autonomous driving [SSSS16].

Self-driving cars (ii)

In Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving [SSSS16]:
Autonomous driving is a multi-agent setting where the host vehicle must apply sophisticated
negotiation skills with other road users when overtaking, giving way, merging, taking left and
right turns and while pushing ahead in unstructured urban roadways. Since there are many
possible scenarios, manually tackling all possible cases will likely yield a too simplistic policy.
Moreover, one must balance between unexpected behaviour of other drivers/pedestrians and
at the same time not to be too defensive so that normal traffic flow is maintained.
In this paper we apply deep reinforcement learning to the problem of forming long term driving
strategies. We note that there are two major challenges that make autonomous driving differ-
ent from other robotic tasks. First, is the necessity for ensuring functional safety — something
that machine learning has difficulty with given that performance is optimised at the level of an
expectation over many instances. Second, the Markov Decision Process model often used
in robotics is problematic in our case because of unpredictable behaviour of other agents in
this multi-agent scenario. We make three contributions in our work. First, we show how policy
gradient iterations can be used, and the variance of the gradient estimation using stochastic
gradient ascent can be minimised, without Markovian assumptions. Second, we decompose
the problem into a composition of a Policy for Desires (which is to be learned) and trajectory
planning with hard constraints (which is not learned). The goal of Desires is to enable com-
fort of driving, while hard constraints guarantees the safety of driving. Third, we introduce
a hierarchical temporal abstraction we call an “Option Graph” with a gating mechanism that
significantly reduces the effective horizon and thereby reducing the variance of the gradient
estimation even further. The Option Graph plays a similar role to “structured prediction” in
supervised learning, thereby reducing sample complexity, while also playing a similar role to
LSTM gating mechanisms used in supervised deep networks.
Thermal soaring (i)
Thermal soaring [RCSV16, WDV14].

Thermal soaring (ii)
In Learning to soar in turbulent environments [RCSV16]:

Birds and gliders exploit warm, rising atmospheric currents (thermals) to reach
heights comparable to low-lying clouds with a reduced expenditure of energy. This
strategy of flight (thermal soaring) is frequently used by migratory birds. Soar-
ing provides a remarkable instance of complex decision making in biology and
requires a long-term strategy to effectively use the ascending thermals. Further-
more, the problem is technologically relevant to extend the flying range of au-
tonomous gliders. Thermal soaring is commonly observed in the atomspheric
convective boundary layer on warm, sunny days. The formation of thermals un-
avoidably generates strong turbulent fluctuations, which constitute an essential
element of soaring. Here, we approach soaring flight as a problem of learning
to navigate complex, highly fluctuating turbulent environments. We simulate the
atmospheric boundary layer by numerical models of turbulent convective flow and
combine them with model-free, experience-based, reinforcement learning algo-
rithms to train the gliders. For the learned policies in the regimes of moderate and
strong turbulence levels, the glider adopts an increasingly conservative policy as
turbulence levels increase, quantifying the degree of risk affordable in turbulent en-
vironments. Reinforcement learning uncovers those sensorimotor cues that permit
effective control over soaring in turbulent environments.
Autonomous helicopter flight (i)
Autonomous helicopter flight [NCD+ 06].

Autonomous helicopter flight (ii)
In Autonomous helicopter flight via reinforcement learning [NKJS03]:

Autonomous helicopter flight represents a challenging control problem, with com-
plex, noisy, dynamics. In this paper, we describe a successful application of rein-
forcement learning to autonomous helicopter flight. We first fit a stochastic nonlin-
ear model of the helicopter dynamics. We then use the model to learn to hover in
place, and to fly a number of maneuvers taken from an RC helicopter competition.
Autonomous helicopter flight (iii)
In Autonomous inverted helicopter flight via reinforcement learning [NCD+ 06]:

Helicopters have highly stochastic, nonlinear, dynamics, and autonomous heli-
copter flight is widely regarded to be a challenging control problem. As helicopters
are highly unstable at low speeds, it is particularly difficult to design controllers for
low speed aerobatic maneuvers. In this paper, we describe a successful applica-
tion of reinforcement learning to designing a controller for sustained inverted flight
on an autonomous helicopter. Using data collected from the helicopter in flight,
we began by learning a stochastic, nonlinear model of the helicopter’s dynam-
ics. Then, a reinforcement learning algorithm was applied to automatically learn
a controller for autonomous inverted hovering. Finally, the resulting controller was
successfully tested on our autonomous helicopter platform.
Financial applications of reinforcement learning

Reinforcement Learning in finance—Kolm/Ritter
In Modern Perspectives on Reinforcement Learning in Finance [KR19b]:

We give an overview and outlook of the field of reinforcement learning as it ap-
plies to solving financial applications of intertemporal choice. In finance, common
problems of this kind include pricing and hedging of contingent claims, investment
and portfolio allocation, buying and selling a portfolio of securities subject to trans-
action costs, market making, asset liability management and optimization of tax
consequences, to name a few. Reinforcement learning allows us to solve these
dynamic optimization problems in an almost model-free way, relaxing the assump-
tions often needed for classical approaches.
A main contribution of this article is the elucidation of the link between these dy-
namic optimization problems and reinforcement learning, concretely addressing
how to formulate expected intertemporal utility maximization problems using mod-
ern machine learning techniques.
RL pricing and hedging—Halperin (i)

In QLBS: Q-Learner in the Black–Scholes (–Merton) Worlds [Hal17]:
This paper presents a discrete-time option pricing model that is rooted in Rein-
forcement Learning (RL), and more specifically in the famous Q-Learning method
of RL. We construct a risk-adjusted Markov Decision Process for a discrete-time
version of the classical Black-Scholes-Merton (BSM) model, where the option
price is an optimal Q-function, while the optimal hedge is a second argument of
this optimal Q-function, so that both the price and hedge are parts of the same
formula. Pricing is done by learning to dynamically optimize risk-adjusted returns
for an option replicating portfolio, as in the Markowitz portfolio theory. Using Q-
Learning and related methods, once created in a parametric setting, the model is
able to go model-free and learn to price and hedge an option directly from data,
and without an explicit model of the world. This suggests that RL may provide
efficient data-driven and model-free methods for optimal pricing and hedging of
options, once we depart from the academic continuous-time limit, and vice versa,
option pricing methods developed in Mathematical Finance may be viewed as spe-
cial cases of model-based Reinforcement Learning. Further, due to simplicity and
tractability of our model which only needs basic linear algebra (plus Monte Carlo
simulation, if we work with synthetic data), and its close relation to the original
BSM model, we suggest that our model could be used for benchmarking of differ-
ent RL algorithms for financial trading applications.
RL pricing and hedging—Halperin (ii)
In The QLBS Q-Learner Goes NuQLear: Fitted Q Iteration, Inverse RL, and Option
Portfolios [Hal18]:
The QLBS model is a discrete-time option hedging and pricing model that is based
on Dynamic Programming (DP) and Reinforcement Learning (RL). It combines the
famous Q-Learning method for RL with the Black–Scholes (–Merton) model’s idea
of reducing the problem of option pricing and hedging to the problem of optimal
rebalancing of a dynamic replicating portfolio for the option, which is made of a
stock and cash.
Here we expand on several NuQLear (Numerical Q-Learning) topics with the
QLBS model. First, we investigate the performance of Fitted Q Iteration for a RL
(data-driven) solution to the model, and benchmark it versus a DP (model-based)
solution, as well as versus the BSM model.
Second, we develop an Inverse Reinforcement Learning (IRL) setting for the
model, where we only observe prices and actions (re-hedges) taken by a trader,
but not rewards.
Third, we outline how the QLBS model can be used for pricing portfolios of options,
rather than a single option in isolation, thus providing its own, data-driven and
model independent solution to the (in)famous volatility smile problem of the Black–
Scholes model.
RL hedging—Kolm/Ritter
In Dynamic Replication and Hedging: A Reinforcement Learning Approach [KR19a]:

The authors of this article address the problem of how to optimally hedge an op-
tions book in a practical setting, where trading decisions are discrete and trad-
ing costs can be nonlinear and difficult to model. Based on reinforcement learn-
ing (RL), a well-established machine learning technique, the authors propose a
model that is flexible, accurate and very promising for real-world applications. A
key strength of the RL approach is that it does not make any assumptions about
the form of trading cost. RL learns the minimum variance hedge subject to what-
ever transaction cost function one provides. All that it needs is a good simulator,
in which transaction costs and option prices are simulated accurately.
Deep hedging—Buehler/Gonon/Teichmann/Wood/Mohan/Kochems
In Deep Hedging: Hedging Derivatives Under Generic Market Frictions Using

Reinforcement Learning [BGT+ 19]:
This article discusses a new application of reinforcement learning: to the problem
of hedging a portfolio of “over-the-counter” derivatives under market frictions such
as trading costs and liquidity constraints.
The objective is to maximise a non-linear risk-adjusted return function by trading
in liquid hedging instruments such as equities or listed options. The approach
presented here is the first efficient and model-independent algorithm which can be
used for such problems at scale.
Deep hedging—Cao/Chen/Hull/Poulos
In Deep Hedging of Derivatives Using Reinforcement Learning [CCHZ19]:

This paper shows how reinforcement learning can be used to derive optimal hedg-
ing strategies for derivatives when there are transaction costs. The paper illus-
trates the approach by showing the difference between using delta hedging and
optimal hedging for a short position in a call option when the objective is to mini-
mize a function equal to the mean hedging cost plus a constant times the standard
deviation of the hedging cost. Two situations are considered. In the first, the asset
price follows geometric Brownian motion. In the second, the asset price follows a
stochastic volatility process. The paper extends the basic reinforcement learning
approach in a number of ways. First, it uses two different Q-functions so that both
the expected value of the cost and the expected value of the square of the cost are
tracked for different state/action combinations. This approach increases the range
of objective functions that can be used. Second, it uses a learning algorithm that
allows for continuous state and action space. Third, it compares the accounting
P&L approach (where the hedged position is valued at each step) and the cash
flow approach (where cash inflows and outflows are used). We find that a hybrid
approach involving the use of an accounting P&L approach that incorporates a
relatively simple valuation model works well. The valuation model does not have
to correspond to the process assumed for the underlying asset price.
Wealth management—Dixon/Halperin
In G-Learner and GIRL: Goal Based Wealth Management with Reinforcement

Learning [DH20]:
We present a reinforcement learning approach to goal based wealth management problems
such as optimization of retirement plans or target dated funds. In such problems, an investor
seeks to achieve a financial goal by making periodic investments in the portfolio while being
employed, and periodically draws from the account when in retirement, in addition to the ability
to re-balance the portfolio by selling and buying different assets (e.g. stocks). Instead of
relying on a utility of consumption, we present G-Learner: a reinforcement learning algorithm
that operates with explicitly defined one-step rewards, does not assume a data generation
process, and is suitable for noisy data. Our approach is based on G-learning—a probabilistic
extension of the Q-learning method of reinforcement learning.
In this paper, we demonstrate how G-learning, when applied to a quadratic reward and Gaus-
sian reference policy, gives an entropy-regulated Linear Quadratic Regulator (LQR). This crit-
ical insight provides a novel and computationally tractable tool for wealth management tasks
which scales to high dimensional portfolios. In addition to the solution of the direct problem of
G-learning, we also present a new algorithm, GIRL, that extends our goal-based G-learning
approach to the setting of Inverse Reinforcement Learning (IRL) where rewards collected by
the agent are not observed, and should instead be inferred. We demonstrate that GIRL can
successfully learn the reward parameters of a G-Learner agent and thus imitate its behavior.
Finally, we discuss potential applications of the G-Learner and GIRL algorithms for wealth
management and robo-advising.
Optimal execution—Ning/Lin/Jaimungal
In Double Deep Q-Learning for Optimal Execution [NLJ18]:

Optimal trade execution is an important problem faced by essentially all traders.
Much research into optimal execution uses stringent model assumptions and ap-
plies continuous time stochastic control to solve them. Here, we instead take a
model free approach and develop a variation of Deep Q-Learning to estimate the
optimal actions of a trader. The model is a fully connected Neural Network trained
using Experience Replay and Double DQN with input features given by the cur-
rent state of the limit order book, other trading signals, and available execution
actions, while the output is the Q-value function estimating the future rewards un-
der an arbitrary action. We apply our model to nine different stocks and find that
it outperforms the standard benchmark approach on most stocks using the mea-
sures of (i) mean and median out-performance, (ii) probability out-performance,
and (iii) gain-loss ratios.
Optimal order placement—Schnaubelt
In Deep reinforcement learning for the optimal placement of cryptocurrency limit

orders [Sch20]:
This paper presents the first large-scale application of deep reinforcement learning
to optimize the placement of limit orders at cryptocurrency exchanges. For train-
ing and out-of-sample evaluation, we use a virtual limit order exchange to reward
agents according to the realized shortfall over a series of time steps. Based on
the literature, we generate features that inform the agent about the current mar-
ket state. Leveraging 18 months of high-frequency data with 300 million historic
trades and more than 3.5 million order book states from major exchanges and cur-
rency pairs, we empirically compare state-or-the-art deep reinforcement learning
algorithms to several benchmarks. We find proximal policy optimization to reli-
ably learn superior order placement strategies when compared to deep double
Q-networks and other benchmarks. Further analyses shed light into the black box
of the learned execution strategy. Important features are current liquidity costs and
queue imbalances, where the latter can be interpreted as predictors of short-term
mid-price returns. To preferably execute volume in limit orders to avoid additional
market order exchange fees, order placement tends to be more aggressive in ex-
pectation of unfavorable price movements.
Student projects
Toby Weston: Distributional Reinforcement Learning for Optimal Execution

Toby Weston. Distributional Reinforcement Learning for
Optimal Execution. A thesis submitted for the degree of MSc in
Mathematics and Finance, 2019-2020.
When trading a financial asset, large orders will often incur higher exe-
cution costs as the trader uses up the available liquidity. To reduce this
effect, orders are split and executed over a short period of time. Theo-
retical solutions for how to optimally split orders rely on models of market
environments. These fail to take into account market idiosyncrasies and
tend to oversimplify a complex optimisation problem.
Deep Q learning provides a set of methodologies for learning an optimal
solution from real experience. Successful application would allow mod-
els of the trading environment to be sidestepped in favour of direct inter-
action with the financial markets. Deep Q learning has previously been
applied to the problem of optimal execution and has shown promise, both
in simulated environments and on historical data.
In the last few years many improvements have been suggested for the
vanilla deep Q learning algorithm. Distributional reinforcement learning
in particular has shown to outperform value based deep Q learning on
a selection of Atari games. Given the highly stochastic nature of the
trading environment it is reasonable to assume that it would perform well
for the problem of optimal execution.
Toby Weston
In the following work we will outline the principles behind distributional
reinforcement learning and show that it can outperform value based deep
Q learning for optimal execution. To the best of our knowledge this is the
first time distributional reinforcement learning has been used for optimal
execution.
Textbooks
Sutton/Barto
Richard S. Sutton and Andrew G. Barto. Reinforcement

Learning: An Introduction, second edition. MIT Press,
2018. [SB18]
Like the first edition, this new edition focusses on core on-
line learning algorithms, with the more mathematical material
set off in shaded boxes. Part I covers as much of reinforce-
ment learning as possible without going beyond the tabular
case for which exact solutions can be found. Many algorithms
presented in this part are new to the second edition, including
UCB, Expected Sarsa, and double learning. Part II extends
these ideas to function approximation, with new sections on
such topics as artificial neural networks and the Fourier ba-
sis, and offers expanded treatment of off-policy learning and
policy-gradient methods. Part III has new chapters on rein-
forcement learning’s relationships with psychology and neuro-
science, as well as an updated case-studies chapter including
AlphaGo and AlphaGo Zero, Atari game playing, and IBM Wat-
son’s wagering strategy. The final chapter discusses the future
societal impacts of reinforcement learning.
Available online for free:

http://www.incompleteideas.net/book/the-book.html
Textbooks
Szepesvári
Csaba Szepesvári. Algorithms for Reinforcement Learning.
Synthesis Lectures on Artificial Intelligence and Machine
Learning, Morgan & Claypool, 2010 [Sze10].
Reinforcement learning is a learning paradigm concerned with
learning to control a system so as to maximize a numerical
performance measure that expresses a long-term objective.
What distinguishes reinforcement learning from supervised
learning is that only partial feedback is given to the learner
about the learner’s predictions. Further, the predictions may
have long term effects through influencing the future state of
the controlled system. Thus, time plays a special role. The
goal in reinforcement learning is to develop efficient learning
algorithms, as well as to understand the algorithms’ merits
and limitations. Reinforcement learning is of great interest be-
cause of the large number of practical applications that it can
be used to address, ranging from problems in artificial intelli-
gence to operations research or control engineering. In this
book, we focus on those algorithms of reinforcement learning
that build on the powerful theory of dynamic programming. We
give a fairly comprehensive catalog of learning problems, de-
scribe the core ideas, note a large number of state of the art
algorithms, followed by the discussion of their theoretical prop-
erties and limitations.
Available online for free:
https://sites.ualberta.ca/˜szepesva/rlbook.html
Textbooks
Bertsekas
Dimitri Bertsekas. Reinforcement Learning and Optimal Control.
Athena Scientific, 2019. [Ber19]
This book considers large and challenging multistage decision problems,
which can be solved in principle by dynamic programming, but their ex-
act solution is computationally intractable. We discuss solution methods
that rely on approximations to produce suboptimal policies with adequate
performance. These methods are known by several essentially equiva-
lent names: reinforcement learning, approximate dynamic programming,
and neuro-dynamic programming. They underlie, among others, the re-
cent impressive successes of self-learning in the context of games such
as chess and Go. One of the aims of the book is to explore the common
boundary between artificial intelligence and optimal control, and to form
a bridge that is accessible by workers with background in either field.
Another aim is to organize coherently the broad mosaic of methods that
have proved successful in practice while having a solid theoretical and/or
logical foundation. This may help researchers and practitioners to find
their way through the maze of competing ideas that constitute the current
state of the art. The mathematical style of this book is somewhat different
than other books by the same author. While we provide a rigorous, albeit
short, mathematical account of the theory of finite and infinite horizon dy-
namic programming, and some fundamental approximation methods, we
rely more on intuitive explanations and less on proof-based insights. We
also illustrate the methodology with many example algorithms and ap-
plications. Selected sections, instructional videos and slides, and other
supporting material may be found at the author’s website.
Textbooks
Agarwal/Jiang/Kakade/Sun
I Work in progress: Alekh Agarwal, Nan Jiang, Sham M. Kakade, Wen Sun.
Reinforcement Learning: Theory and Algorithms [AJKS21].
I A draft is available at https://rltheorybook.github.io/
I Current contents:
I Markov decision processes and computational complexity
I Sample complexity
I Approximate value function methods
I Generalization
I Multi-armed and linear bandits
I Strategic exploration in tabular MDPs
I Linearly parameterized MDPs
I Parametric models with bounded Bellman rank
I Policy gradient methods and non-convex optimization
I Optimality
I Function approximation and the NPG
I CPI, TRPO, and more
I Linear quadratic regulators
I Imitation learning
I Offline reinforcement learning
I Partially observable Markov decision processes
Textbooks
Lapan
Maxim Lapan. Deep Reinforcement Learning Hands-On.
Packt [Lap18].
Recent developments in reinforcement learning (RL), com-
bined with deep learning (DL) have seen unprecedented
progress made towards training agents to solve complex prob-
lems in a human-like way. Google’s use of algorithms to play
and defeat the well-known Atari arcade games has propelled
the field to prominence, and researchers are generating new
ideas at a rapid pace.
Deep Reinforcement Learning Hands-On is a comprehensive
guide to the very latest DL tools and their limitations. You
will evaluate methods including cross-entropy and policy gra-
dients, before applying them to real-world environments. Take
on both the Atari set of virtual games and family favourites
such as Connect4. The book provides an introduction to the
basics of RL, giving you the know-how to code intelligent learn-
ing agents to take on a formidable array of practical tasks.
Discover how to implement Q-learning on ‘grid world’ environ-
ments, teach your agent to buy and trade stocks, and find out
how natural language models are driving the boom in chat-
bots.
Textbooks
Zai/Brown
Alexander Zai and Brandon Brown. Deep Reinforcement

Learning in Action. Manning, 2020 [ZB20].
Humans learn best from feedback — we are encour-
aged to take actions that lead to positive results while
deterred by decisions with negative consequences.
This reinforcement process can be applied to com-
puter programs allowing them to solve more complex
problems that classical programming cannot. Deep
Reinforcement Learning in Action teaches you the
fundamental concepts and terminology of deep rein-
forcement learning, along with the practical skills and
techniques you’ll need to implement it into your own
projects.
Textbooks
Dixon/Halperin/Bilokon
Matthew Dixon, Igor Halperin, and Paul Bilokon. Machine
Learning in Finance: From Theory to Practice. Springer, 2020.
This book is written for advanced graduate students and academics in
financial econometrics, management science and applied statistics, in
addition to quants and data scientists in the field of quantitative finance.
We present machine learning as a non-linear extension of various topics
in quantitative economics such as financial econometrics and dynamic
programming, with an emphasis on novel algorithmic representations
of data, regularisation, and techniques for controlling the bias-variance
tradeoff leading to improved out-of-sample forecasting. The book is
presented in three parts, each part covering theory and applications.
The first presents supervised learning for cross-sectional data from both
a Bayesian and frequentist perspective. The more advanced material
places a firm emphasis on neural networks, including deep learning, as
well as Gaussian processes, with examples in investment management
and derivatives. The second part covers supervised learning for time se-
ries data, arguably the most common data type used in finance with ex-
amples in trading, stochastic volatility and fixed income modeling. Finally,
the third part covers reinforcement learning and its applications in trad-
ing, investment and wealth management. We provide Python code ex-
amples to support the readers’ understanding of the methodologies and
applications. As a bridge to research in this emergent field, we present
the frontiers of machine learning in finance from a researcher’s perspec-
tive, highlighting how many well known concepts in statistical physics are
likely to emerge as research topics for machine learning in finance.
Textbooks
Novotny/Bilokon/Galiotos/Délèze
Jan Novotny, Paul Bilokon, Aris Galiotos, and Frédéric Délèze.

Machine Learning and Big Data with kdb+/q. Wiley,
2019 [NBGD19].
This book opens the world of q and kdb+ to a wide au-
dience, as it emphasises solutions to problems of prac-
tical importance. Implementations covered include:
data description and summary statistics; basic regres-
sion methods and cointegration; volatility estimation
and time series modelling; advanced machine learning
techniques, including neural networks, random forests,
and principal component analysis; techniques useful
beyond finance related to text analysis, game engines,
and agent-based models.
Textbooks
Books on multi-armed bandits

I Donald Berry and Bert Fristedt. Bandit problems: sequential allocation of
experiments. Chapman & Hall, 1985.
I Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games.
Cambridge University Press, 2006.
I Dirk Bergemann and Juuso Välimäki. Bandit Problems. In Steven Durlauf and
Larry Blume (editors). The New Palgrave Dictionary of Economics, 2nd edition.
Macmillan Press, 2006.
I Aditya Mahajan and Demosthenis Teneketzis. Multi-armed Bandit Problems. In
Alfred Olivier Hero III, David A. Castañón, Douglas Cochran, Keith Kastella
(editors). Foundations and Applications of Sensor Management. Springer, Boston,
MA, 2008.
I John Gittins, Kevin Glazebrook, and Richard Weber. Multi-armed Bandit Allocation
Indices. John Wiley & Sons, 2011.
I Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret Analysis of Stochastic and
Nonstochastic Multi-armed Bandit Problems. Foundations and Trends in Machine
Learning, now publishers Inc., 2012.
I Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University
Press, 2020.
I Aleksandrs Slivkins. Introduction to Multi-Armed Bandits. Foundations and Trends in
Machine Learning, now publishers Inc., 2019.
Textbooks
Books on Markov decision processes and dynamic programming

I Lloyd Stowell Shapley. Stochastic Games. Proceedings of the National Academy of Sciences of
the United States of America, October 1, 1953, 39 (10), 1095–1100 [Sha53].
I Richard Bellman. Dynamic Programming. Princeton University Press, NJ 1957 [Bel57].
I Ronald A. Howard. Dynamic programming and Markov processes. The Technology Press of
M.I.T., Cambridge, Mass. 1960 [How60].
I Dimitri P. Bertsekas and Steven E. Shreve. Stochastic optimal control. Academic Press, New
York, 1978 [BS78].
I Martin L. Puterman. Markov decision processes: discrete stochastic dynamic programming. John
Wiley & Sons, New York, 1994 [Put94].
I Onesimo Hernández-Lerma and Jean B. Lasserre. Discrete-time Markov control processes.
Springer-Verlag, New York, 1996 [HLL96].
I Dimitri P. Bertsekas. Dynamic programming and optimal control, Volume I. Athena Scientific,
Belmont, MA, 2001 [Ber01].
I Dimitri P. Bertsekas. Dynamic programming and optimal control, Volume II. Athena Scientific,
Belmont, MA, 2005 [Ber05].
I Eugene A. Feinberg and Adam Shwartz. Handbook of Markov decision processes. Kluwer
Academic Publishers, Boston, MA, 2002 [FS02].
I Warren B. Powell. Approximate dynamic programming. Wiley-Interscience, Hoboken, NJ,
2007 [Pow07].
I Nicole Bäuerle and Ulrich Rieder. Markov Decision Processes with Applications to Finance.
Springer, 2011 [BR11].
I Alekh Agarwal, Nan Jiang, Sham M. Kakade, Wen Sun. Reinforcement Learning: Theory and
Algorithms. A draft is available at https://rltheorybook.github.io/
Bibliography
Philip E. Agre.
The Dynamic Structure of Everyday Life.
PhD thesis, Massachusetts Institute of Technology, Cambridge MA, 1988.
Alekh Agarwal, Nan Jiang, Sham M. Kakade, and Wen Sun.
Reinforcement Learning: Theory and Algorithms.
2021.
https://rltheorybook.github.io/.
Richard Bellman.
Dynamic Programming.
Princeton University Press, NJ, 1957.
Dimitri P. Bertsekas.
Dynamic programming and optimal control, Volume I.
Athena Scientific, Belmont, MA, 2001.
Dynamic programming and optimal control, Volume II.
Athena Scientific, Belmont, MA, 2005.
Reinforcement Learning and Optimal Control.
Athena Scientific, 2019.
Hans Buehler, Lukas Gonon, Josef Teichmann, Ben Wood, Baranidharan Mohan, and
Jonathan Kochems.
Bibliography
Deep hedging: Hedging derivatives under generic market frictions using reinforcement
learning.
Research Paper 19–80, Swiss Finance Institute, 2019.
Justin A. Boyan and Michael L. Littman.
Packet routing in dynamically changing networks: A reinforcement learning approach.
In Advances in Neural Information Processing Systems 6 (NIPS 1993), 1993.
Nicole Bäuerle and Ulrich Rieder.
Markov Decision Processes with Applications to Finance.
Springer, 2011.
Dimitri P. Bertsekas and Steven E. Shreve.
Stochastic optimal control.
Academic Press, New York, 1978.
Robert Carver.
Systematic Trading: A Unique New Method for Designing Trading and Investing
Systems.
Harriman House, 2015.
Jay Cao, Jacky Chen, John C. Hull, and Poulos Zissis.
Deep hedging of derivatives using reinforcement learning.
SSRN, December 2019.
Ernest P. Chan.
Quantitative Trading: How to Build Your Own Algorithmic Trading Business.
Wiley, 2008.
Bibliography
Ernest P. Chan.
Algorithmic Trading: Winning Strategies and Their Rationale.
Wiley, 2013.
Ernest P. Chan.
Machine Trading: Deploying Computer Algorithms to Conquer the Markets.
Wiley, 2016.
Zhaohui Chen, editor.
Currency Options and Exchange Rate Economics.
World Scientific, 1998.
Iain J. Clark.
Foreign Exchange Option Pricing: A Practitioner’s Guide.
Wiley, 2010.
Matthew Dixon and Igor Halperin.
G-learner and GIRL: Goal based wealth management with reinforcement learning.
arXiv, 2020.
Eugene A. Durenard.
Professional Automated Trading: Theory and Practice.
Wiley, 2013.
Eugene A. Feinberg and Adam Shwartz.
Handbook of Markov decision processes.
Kluwer Academic Publishers, Boston, MA, 2002.
Bibliography
Igor Halperin.
QLBS: Q-learner in the Black–Scholes (–Merton) worlds.
SSRN, 2017.
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3087076.
Igor Halperin.
The QLBS Q-learner goes NuQLear: Fitted Q iteration, inverse RL, and option
portfolios.
SSRN, 2018.
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3102707.
Onesimo Hernández-Lerma and Jean B. Lasserre.
Discrete-time Markov control processes.
Springer-Verlag, New York, 1996.
Ronald A. Howard.
Dynamic programming and Markov processes.
The Technology Press of M.I.T., Cambridge, Mass., 1960.
Engin İpek, Onur Mutlu, José F. Martı́nez, and Rich Caruana.
Self-optimizing memory controllers: A reinforcement learning approach.
In Proceedings of the 35th Annual International Symposium on Computer
Architecture, pages 39–50. IEEE Computer Society Washington, DC, 2008.
Jessica James, Jonathan Fullwood, and Peter Billington.
FX Option Performance and Data Set: An Analysis of the Value Delivered by FX
Options Since the Start of the Market.
Bibliography
Wiley, 2015.
Petter N. Kolm and Gordon Ritter.
Dynamic replication and hedging: A reinforcement learning approach.
The Journal of Financial Data Science, 1(1):159–171, 2019.
Petter N. Kolm and Gordon Ritter.
Modern perspectives on reinforcement learning in finance.
Journal of Machine Learning in Finance, 1(1), 2019.
Maxim Lapan.
Deep Reinforcement Learning Hands-On.
Packt, 2018.
Yuanlong Li, Yonggang Wen, Dacheng Tao, and Kyle Guan.
Transforming cooling optimization for green data center via deep reinforcement
learning.
IEEE Transactions on Cybernetics, pages 1–12, 2019.
José F. Martı́nez and Engin İpek.
Dynamic multicore resource management: A machine learning approach.
Micro, IEEE, 29(5):8–17, 2009.
Donald Michie.
Experiments on the mechanization of game-learning. Part I. Characterization of the
model and its parameters.
The Computer Journal, 6(3):232–236, November 1963.
Bibliography
Marvin Minsky.
Form and content in computer science, 1969 Turing Award lecture.
Journal of the Association for Computing Machinery, 17(2), 1970.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou,
Daan Wierstra, and Martin Riedmiller.
Playing Atari with deep reinforcement learning.
https://arxiv.org/abs/1312.5602, December 2013.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness,
Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg
Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen
King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis.
Human-level control through deep reinforcement learning.
Nature, 518, February 2015.
Jan Novotny, Paul Alexander Bilokon, Aris Galiotos, and Frédéric Délèze.
Machine Learning and Bid Data with kdb+/q.
Wiley, 2019.
Andrew Y. Ng, Adam Coates, Mark Diel, Varun Ganapathi, Jamie Schulte, Ben Tse,
Eric Berger, and Eric Liang.
Experimental Robotics IX: The 9th International Symposium on Experimental
Robotics, chapter Autonomous Inverted Helicopter Flight via Reinforcement Learning,
pages 363–372.
Springer, 2006.
Bibliography
Andrew Y. Ng, H. Jin Kim, Michael I. Jordan, and Shankar Sastry.

Autonomous helicopter flight via reinforcement learning.
In NIPS’03: Proceedings of the 16th International Conference on Neural Information,
pages 799–806, December 2003.
Brian Ning, Franco Ho Ting Lin, and Sebastian Jaimungal.
Double deep Q-learning for optimal execution.
arXiv, 2018.
https://arxiv.org/abs/1812.06600.
Warren B. Powell.
Approximate dynamic programming.
Wiley-Interscience, Hoboken, NJ, 2007.
Martin L. Puterman.
Markov decision processes: discrete stochastic dynamic programming.
John Wiley & Sons, New York, 1994.
Gautam Reddy, Antonio Celani, Terrence J. Sejnowski, and Massimo Vergassola.
Learning to soar in turbulent environments.
Proceedings of the National Academy of Sciences, 113(33):E4877–E4884, 2016.
Arthur L. Samuel.
Some studies in machine learning using the game of checkers.
IBM Journal on Research and Development, 3(3):210–229, 1959.
Arthur L. Samuel.
Bibliography
Some studies in machine learning using the game of checkers. ii — recent progress.
IBM Journal on Research and Development, 11(6):601–617, 1967.
Richard S. Sutton and Andrew G. Barto.
Reinforcement Learning: An Introduction.
MIT Press, 2 edition, 2018.
Matthias Schnaubelt.
Deep reinforcement learning for optimal placement of cryptocurrency limit orders.
FAU Discussion Papers in Economics 05/2020, Friedrich-Alexander-Universität
Erlangen-Nürnberg, Institute of Economics, Nürnberg, 2020.
Claude E. Shannon.
Programming a computer for playing chess.
Philosophical Magazine and Journal of Science, 41(314):256–275, 1950.
Lloyd Stowell Shapley.
Stochastic games.
Proceedings of the National Academy of Sciences of the United States of America,
39(10):1095–1100, October 1953.
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George
van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,
Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya
Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel,
and Demis Hassabis.
Mastering the game of Go with deep neural networks and tree search.
Bibliography
Nature, 529:484–489, January 2016.

William D. Smart and Leslie Pack Kaelbling.
Effective reinforcement learning for mobile robots.
IEEE International Conference on Robotics and Automation (ICRA-2002), 2002.
Peter Stone, Richard S. Sutton, and Gregory Kuhlmann.
Reinforcement learning for RoboCup soccer keepaway.
Adaptive Behavior, 13(3):165–188, September 2005.
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang,
Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen,
Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel,
and Demis Hassabis.
Mastering the game of go without human knowledge.
Nature, 550:354–359, October 2017.
Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua.
Safe, multi-agent, reinforcement learning for autonomous driving.
https://arxiv.org/abs/1610.03295, October 2016.
Richard S. Sutton.
Learning to predict by the methods of temporal differences.
Machine Learning, 3:9–44, 1988.
Csaba Szepesvári.
Algorithms for Reinforcement Learning.
Bibliography
Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan &

Claypool, 2010.
Nassim Taleb.
Dynamic Hedging: Managing Vanilla and Exotic Options.
Wiley, 1996.
Gerry Tesauro.
Practical issues in temporal difference learning.
Machine Learning, 8(3–4):257–277, 1992.
Gerry Tesauro.
TD-gammon, a self-teaching backgammon program, achieves master-level play.
Neural Computation, 1994.
Gerry Tesauro.
Temporal difference learning and TD-Gammon.
Communications of the ACM, 38(3):58–68, 1995.
Gerry Tesauro.
Programming backgammon using self-teaching neural nets.
Artificial Intelligence, 134(1–2):181–199, 2002.
Gerald Tesauro, David C. Gondek, Jonathan Lenchner, James Fan, and John M.
Prager.
Simulation, learning, and optimization techniques in Watson’s game strategies.
IBM Journal of Research and Development, 56(3–4):16–1–16–11, 2012.
Bibliography
Gerald Tesauro, David C. Gondek, Jonathan Lenchner, James Fan, and John M.
Prager.
Analysis of Watson’s strategies for playing Jeopardy!
Journal of Artificial Intelligence Research, 47:205–251, 2013.
Philip S. Thomas.
Safe Reinforcement Learning.
PhD thesis, University of Massachusetts, Amherst, 2015.
Georgios Theocharous, Philip S. Thomas, and Mohammad Ghavamzadeh.
Personalized ad recommendation systems for life-time value optimization with
guarantees.
In Proceedings of the Twenty-Fourth International Joint Conference on Artificial
Intelligence (IJCAI 2015). AAAI Press, Palo Alto, CA, 2015.
Igor Tulchinsky.
Finding Alphas: A Quantitative Approach to Building Trading Strategies.
Wiley, 2015.
Alan Mathison Turing.
The Essential Turing, chapter Intelligent machinery, pages 410–432.
Oxford University Press, Oxford, 2004.
Timothy Woodbury, Caroline Dunn, and John Valasek.
Autonomous soaring using reinforcement learning for trajectory generation.
In 52nd Aerospace Sciences Meeting, 2014.
Bibliography
Uwe Wystup.
FX Options and Structured Products.
Wiley, 2 edition, 2017.
Alex Zai and Brandon Brown.
Deep Reinforecement Learning in Action.
Manning, 2020.

1 Introduction To Reinforcement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 Introduction To Reinforcement Learning

Uploaded by

Copyright:

Available Formats

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning

Paul Alexander Bilokon, PhD

First stage of automation

First stage of automation: the Industrial Revolution

The Great Exhibition of The Works of Industry of All Nations (i)

The Great Exhibition of The Works of Industry of All Nations (ii)

Second stage of automation

Second stage of automation: the Digital Revolution

The Information Age

Marvin Minsky on programming languages

From Marvin Minsky’s 1969 Turing Award lecture:

Alan Turing on reinforcement

A quote from Alan Turing’s 1948 paper:

A hedonistic learning system

...in 1979 we came to realize that

Branches of machine learning

From David Silver:

Reinforcement learning is multidisciplinary

From David Silver:

Reinforcement learning is not supervised machine learning

Reinforcement learning is not unsupervised machine learning

I One may be tempted to think of reinforcement learning as a kind of unsupervised

The agent is the entity that takes actions.

The action is a move made by the agent in the environment.

The state is a situation, which the agent perceives.

Discounted total reward

The discounted total reward is given by Gt = ∑i∞=t +1 γi −t −1 ri , γ ∈ [0, 1] being the

I Reinforcement learning is based on the reward hypothesis:

I A state is said to be Markov iff

I A policy is the agent’s behaviour.

I A value function is a prediction of future reward.

I A model predicts what the environment will do next.

From [SB18], inspired by [Agr88]:

An options market maker

Donald Michie on trial and error (i)

From the point of view of one of the players, any game,

Donald Michie on trial and error (ii)

This picture of trial-and-error learning uses the con-

Donald Michie FRSE FBCS

In Some Studies in Machine Learning Using the Game of Checkers [Sam59]:

In Some Studies in Machine Learning Using the Game of Checkers. II — Recent

The game of backgammon [Tes92, Tes94, Tes95, Tes02].

In Practical Issues in Temporal Difference Learning [Tes92]:

In TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level

In Temporal Difference Learning with TD-Gammon [Tes95]:

The game of go [SHM+ 16, SSS+ 17].

In Mastering the Game of Go without Human Knowledge [SSS+ 17]:

Watson’s Daily-Double wagering (i)

The game of Jeopardy! [TGL+ 12, TGL+ 13].

Watson’s Daily-Double wagering (ii)

Watson’s Daily-Double wagering (iii)

Atari games (i)

Atari 2600 games, such as Breakout [MKS+ 13, MKS+ 15].

Atari games (ii)

In Playing Atari with Deep Reinforcement Learning [MKS+ 13]:

Atari games (iii)

Personalised web services (i)

Personalised web services [TTG15, Tho15].

Personalised web services (ii)

Cooling optimisation for data centres (i)

Cooling optimisation for data centres [LWTG19].

Cooling optimisation for data centres (ii)