2024 MTH058 Lecture05 ReinforcementLearning

REINFORCEMENT
LEARNING
Nguyễn Ngọc Thảo

nnthao@fit.hcmus.edu.vn
Outline
• Reinforcement learning: Basic concepts
• Multi-armed bandit problem
2
Reinforcement learning:
Basic concepts
Reinforcement learning (RL)
• An agent learns to interact with an environment based on
feedback signals it receives from the environment.
4
RL as a trial-and-error process
• The learner is not told which actions to take, instead he
discovers which actions yield the most reward by trial.
Learning to ride a bike requires trial and

error, much like reinforcement learning.
(Video courtesy of Mark Harris, who says he
is “learning reinforcement” as a parent.)
Image credit: Commoncog
5
Reinforcement learning is the culmination of many fields and
has a rich history in optimization and behavioral psychology. 6
Concept vocabulary
• Let’s frame the concepts in terms of a video game, Mario!
Image credit: CinnamonAI 7

Concepts: Agent and Environment
• The main entities of RL are the agent and the environment.
Environment:
Agent Actions a game level
• The environment is the world that the agent lives in and

interacts with.
• The environment changes when the agent acts on it but may also
change on its own.
8
Concepts: Agent and Environment
• At every step, the agent perceives a (possibly partial)
observation of the state of the world.
• E.g., in the game Mario, the environment is the whole game level,
yet Mario can only see a part of the scene.
• Then, it picks one from the list of possible actions to let the
environment change from one state to another.
• E.g., in the game Mario, a state is the combination of {Mario, action,
environment}.
• The agent can observe this change, use it as a feedback
signal, and learn from it.
9
Concepts: State and Observation
• A state 𝑠 is a complete description of the state of the world.
• There is no information about the world that is hidden from the state.
• An observation 𝑜 is a partial description of a state, which
may omit information.
• Fully-/Partially- observable environment: whether the agent observes
the complete state of the environment (e.g., chess vs. poker).
• Each state or observation is represented by real-valued

vector, matrix, or higher-order tensor.
• E.g., a visual observation: RGB matrix of pixel values; the state of a
robot: joint angles and velocities.
10
Concepts: Action space
• The action space includes the set of all valid actions in a
given environment.
• Discrete action space: only a finite number of moves are
available to the agent.
• E.g., Atari and Go
• Continuous action space: actions are real-valued vectors.
• E.g., the robotic agent operates in a physical world
11
Concepts: A formal description
• At 𝑡0 , the agent 𝑀 does not know what action to take.
• Thus, it can take a random action, or other strategies if there
is preliminary knowledge.
• At a time step 𝑡𝑡 , 𝐴 performs an action 𝑎𝑡 .
• At the next time step 𝑡𝑡+1 , 𝑀 perceives its new state 𝑠𝑡+1 and
considers the reward 𝑟𝑡+1 got from the environment.
• The environment at time step 𝑡𝑡+1 is the result of action 𝑎𝑡 .
• If the rewards get smaller, 𝑀 will choose another action.
• This process is repeated until the agent completes running
its episode.
12
Concepts: Another example
The agent in the task of food delivery is the rider.

He needs to navigate the streets (actions) to
reach the goal of arriving at the customer’s
house while also making sure he/she is on the
correct route (state).
The biggest reward comes when the rider reaches the customer and
delivers the food, and penalties in between can come in the form of taking
a wrong turn, caught in a traffic jam, etc. that prevents the rider from
completing the task.
13
Concepts: Policy
• A policy is a mapping from the perceived states of the
environment to actions to be taken when in those states.
A policy might be
that given a
certain tile, the
agent moves in a
certain direction.
14
Concepts: Policy
• The concept of policy corresponds to what in psychology
called a set of stimulus–response rules (or associations).
Image credit: BioNinja 15

Concepts: Policy
• The policy is the core of a RL agent, which alone is sufficient
to determine behaviors.
• A good policy would result in a positive outcome.
• This may be a simple function, a lookup table, or extensive

computation such as a search process.
• A policy can be deterministic, 𝑎𝑡 = 𝜇𝜃 𝑠𝑡 .
• 𝜃 denotes on a set of parameters that will be optimized (e.g., the
weights and biases of a neural network).
• It can also be stochastic, 𝑎𝑡 ~ 𝜋𝜃 ∙ | 𝑠𝑡 , specifying the
probability for each action.
16
Concepts: Policy
• A deterministic policy always map a given state to only one
particular action.
• A stochastic policy returns a probability distribution of

actions in the action space for a given state.
17
Concepts: Stochastic policy
• A categorical policy uses a categorical probability distribution
to select actions from the discrete action space.
• E.g., actions in the grid world include [Up, Down, Left, Right].
• A Gaussian policy chooses the action for a given state by

using a Gaussian distribution over continuous action space.
• E.g., for an agent driving a car, the speed of the car is selected from
the Gaussian distribution over the action space of 0 to 200.
18
Concepts: Trajectories
• A trajectory τ = 𝑠0 , 𝑎0 , 𝑠1 , 𝑎1 , … is a sequence of states and
actions in the world.
• The first state of the world, 𝑠0 , is randomly sampled from the
start-state distribution, 𝜌𝑜 .
𝑠0 ~ 𝜌𝑜 ∙
• State transitions are what happens to the world between the
state at time 𝑡, 𝑠𝑡 , and the state at 𝑡 + 1, 𝑠𝑡+1 ,
• They are governed by the natural laws of environment, and
depend on only the most recent actions, 𝑎𝑡 .
𝑠𝑡+1 ~ 𝑓 𝑠𝑡 , 𝑎𝑡 or 𝑠𝑡+1 ~ 𝑃 ∙ | 𝑠𝑡 , 𝑎𝑡
19
Concepts: Reward signal
• A reward signal defines the goal of a RL problem, i.e., what
are the good and bad events for the agent.
In Mario, a good way to measure reward might be the score!

20
• Rewards are the immediate and defining features of the
problem faced by the agent.
• They are usually numeric values sent from the environment
at each time step.
A simple environment setup (left)

and its hidden reward mapping
(right). Only by exploring the
environment the agent can learn
that stepping on the goal tile
yields a reward of 1!
21
22
• The reward signal is the primary basis for altering the policy.
• If an action selected by the policy is followed by low reward,
the policy should be modified to select some other action in
that situation in the future.
• A reward signal may be described as a stochastic function of
the state of the environment and the actions taken.
23
• The reward function ℛ is defined as 𝑟𝑡 = ℛ 𝑠𝑡 , 𝑎𝑡 , 𝑠𝑡+1 .
• Finite-horizon undiscounted return: the sum of rewards
obtained in a fixed window of steps.
𝑇
ℛ 𝜏 = ෍ 𝑟𝑡
𝑡=0
• Infinite-horizon discounted return: the sum of all rewards but
discounted by how far off in the future they’re obtained.
∞
ℛ 𝜏 = ෍ 𝛾 𝑡 𝑟𝑡
𝑡=0
• 𝛾: the discount factor in (0, 1)
24
• The agent’s sole objective is to maximize the total reward it
receives over the long run (i.e., trajectories).
Image credit: KDnuggets
25
Concepts: Expected return
• Consider any choice of the policy and any reward measure
(infinite-horizon discounted or finite-horizon undiscounted).
• An agent aims to select a policy which maximizes expected
return when it acts according to it.
• Suppose that both the environment transitions and the policy

are stochastic.
26
Concepts: Expected return
• The probability of a 𝑇-step trajectory is:
𝑇−1
𝑃 𝜏 | 𝜋 = 𝜌𝑜 𝑠𝑜 ෑ 𝑃 𝑠𝑡+1 | 𝑎𝑡 , 𝑠𝑡 ∙ 𝜋 𝑎𝑡 | 𝑠𝑡
𝑡=0
• The expected return, 𝐽 𝜋 , is then:
𝐽 𝜋 = න 𝑃 𝜏 | 𝜋 ∙ ℛ 𝜏 = 𝔼𝜏~𝜋 ℛ 𝜏
𝜏
• The central optimization problem can then be expressed by
𝜋 ∗ = arg max 𝐽 𝜋
𝜋
• with 𝜋 ∗ being the optimal policy.
27
Concepts: Value
• The value of a state is the total amount of reward an agent
hopes to accumulate over the future, starting from that state.
Each square is a state: S is the start state,
G the goal state, T squares are traps, and
black squares cannot be entered.
The rewards (traps and goal state) are
initialized and then these values spread
over time until an equilibrium is reached.
Depending on the penalty value on traps
and reward value for the goal, different
solution patterns might emerge; the last
two grids show such solution states.
Image credit: Nvidia Developer 28

Concepts: Value
• The value function may find the most suitable solution for
contexts that have trade-off between factors.
Different racing lines around a corner. Each racing line has a

different distance, range of possible speeds, and force exerted
on the tires. A value function optimizing for lap time will
optimize this problem to find state transitions which minimize
the total time spent in the turn.
• We seek for actions that bring about states of highest value,

i.e., the greatest amount of reward for us over the long run.
29
Concepts: Value
• Rewards take precedence, while values, serving as
anticipations of rewards, come secondarily.
• There could be no values without rewards. The only purpose of
estimating values is to achieve more reward.
• Nevertheless, we are most concerned on values, not reward
solely, when making and evaluating decisions.
• That is, action choices are made based on value judgments.
• It is much harder to determine values than rewards.
30
Concepts: Value functions
• On-policy value function: give the expected return if the agent starts in
state 𝑠 and always acts according to policy 𝜋.
𝑉 𝜋 𝑠 = 𝔼𝜏~𝜋 ℛ 𝜏 | 𝑠0 = 𝑠
• Optimal value function: That is on-policy value with the optimal policy.
𝑉 ∗ 𝑠 = max 𝔼𝜏~𝜋 ℛ 𝜏 | 𝑠0 = 𝑠
𝜋
• On-policy action-value function: give the expected return if the agent

starts in state 𝑠, takes an arbitrary action 𝑎 (which may not have come
from the policy), and then forever after acts according to policy 𝜋.
𝑄 𝜋 𝑠, 𝑎 = 𝔼𝜏~𝜋 ℛ 𝜏 | 𝑠0 = 𝑠, 𝑎0 = 𝑎
• Optimal action-value function: That is on-policy action-value with the
optimal policy
𝑄 ∗ 𝑠, 𝑎 = max 𝔼𝜏~𝜋 ℛ 𝜏 | 𝑠0 = 𝑠, 𝑎0 = 𝑎
𝜋
31
• The connections between the value function and the action-
value function
𝑉 𝜋 𝑠 = 𝔼𝑎~𝜋 𝑄𝜋 𝑠, 𝑎
𝑉 ∗ 𝑠 = max 𝑄∗ 𝑠, 𝑎
𝑎
• The optimal Q-function and the optimal action: given 𝑄 ∗ 𝑠 ,

the optimal action, 𝑎∗ 𝑠 , can be directly obtained from
𝑎∗ 𝑠 = max 𝑄∗ 𝑠, 𝑎
𝑎
32
• All four value functions obey the Bellman equations about
self-consistency.
The value of your starting point is the reward you expect to get
from being there, plus the value of wherever you land next.
• The Bellman equations for the on-policy value function is:

𝑉 𝜋 𝑠 = 𝔼𝑎~𝜋, 𝑠′~𝑃 𝑟 𝑠, 𝑎 + 𝑉 𝜋 𝑠 ′
• 𝑠 ′ ~𝑃 stands for 𝑠 ′ ~𝑃 ∙ | 𝑎 , i.e., the next state 𝑠 ′ is sampled from the
environment’s transition rules.
33
Concepts: Model of environment
• The model allows inferences to be made about how the
environment will behave.
• E.g., given a state and action, the model might predict the resultant
next state and next reward.
As an agent explores an environment.

They could build a 3D interpretation of
the world around them to help them
reason about the actions they might
take in the future.
34
Concepts: Model of environment
• Modern RL spans the spectrum from low-level, trial-and-
error learning to high-level, deliberative planning.
• Model-free methods: trial-and-error learning
• Model-based methods: use models and planning
• Some systems simultaneously learn by trial and error, learn
a model of the environment, and use the model for planning.
35
A taxonomy of RL algorithms
Image credit
A non-exhaustive, but useful taxonomy of algorithms in modern RL

36
Markov Decision Processes (MDP)
• A Markov Decision Process is a 5-tuple, 𝒮, 𝒜, ℛ, 𝑃, 𝜌0 .
• 𝒮 is the set of all valid states, and 𝐴 is the set of all valid actions.
• ℛ: 𝒮 × 𝐴 × 𝒮 → ℝ: the reward function, with 𝑟𝑡 = ℛ 𝑠𝑡 , 𝑎𝑡, 𝑠𝑡+1
• 𝑃: 𝒮 × 𝒜 × 𝒮 → 𝒫(𝑆) : the transition probability function, with
𝑃 𝑠 ′ | 𝑠, 𝑎 being the probability of transitioning into state 𝑠 ′ if the
agent starts in state 𝑠 and takes action 𝑎
• and 𝜌0 is the starting state distribution.
• The system obeys the Markov property: transitions only
depend on the most recent state and action, no prior history.
37
Reinforcement learning: A demo
Source: YouTube
38
Source: YouTube 39
Learning to run – an example of reinforcement learning

Deepsense.ai
40
Source: YouTube
41
Multi-armed bandit
problem
Original material: The Multi-Armed Bandit Problem and Its Solutions
Exploration – Exploitation Dilemma
• The tradeoff between exploration and exploitation is one of
the challenges that arise in RL.
• Exploitation: A RL agent must prefer actions that it has tried

in the past and found to be effective in producing reward.
• Exploration: It must also try actions not selected before to
make better action selections in the future.
• One cannot solely focus on either exploration or exploitation

without experiencing failure in the task.
43
Exploration – Exploitation Dilemma
• The agent must try a variety of actions while progressively
favoring those that appear to be best.
• On a stochastic task, each action must be tried many times
to gain a reliable estimate of its expected reward.
Micromouse (Wikipedia)
44
Multi-armed bandit problem
• Imagine you are in a casino facing multiple slot machines.
• Each is configured with an unknown probability of how likely
you can get a reward at one play.
The reward probabilities

are unknown to the player
What is the best strategy to achieve highest long-term rewards?

45
Problem definition
• There are 𝐾 slot machines with reward probabilities, 𝜃1 , … , 𝜃𝐾 .
• At each time step 𝑡, the agent takes an action 𝑎 on one machine
and receives a reward 𝑟 in a stochastic fashion.
• A Bernoulli multi-armed bandit can be described as a tuple of
𝓐. 𝓡 , where:
• 𝓐 is a set of actions, each referring to one machine.
• The value of an action 𝑎 is the expected reward, 𝑄 𝑎 = 𝔼 𝑟|𝑎 = 𝜃
• If action 𝑎𝑖 at the time step 𝑡 is on the 𝑖𝑡ℎ machine, then 𝑄 𝑎𝑖 = 𝜃𝑖 .
• 𝓡 is a reward function. At the time step 𝑡, 𝑟𝑡 = 𝓡 𝑎𝑡 may return
reward 1 with a probability 𝑄 𝑎𝑖 or 0 otherwise.
• It simplifies the Markov decision process as there is no state 𝓢.
46
Problem definition
• The goal is to maximize the cumulative reward σ𝑇𝑡=1 𝑟𝑡 .
• The optimal reward probability 𝜃 ∗ of the optimal action 𝑎∗ is
𝜃 ∗ = 𝑄 𝑎∗ = max 𝑄(𝑎) = max 𝜃𝑖
𝑎∈𝓐 1≤𝑖≤𝐾
• The loss function is the total regret we might have by not selecting
the optimal action up to the time step 𝑇.
𝑇
ℒ 𝑇 = 𝔼 ෍ 𝜃 ∗ − 𝑄 𝑎𝑖
𝑡=1
47
Bandit strategies
There are several ways to solve the multi-armed bandit

problem, based on how we do exploration
Exploration
No exploration Exploration at smartly
The naivest and
worst one
random with preference to
uncertainty
48
-greedy algorithm
• Take the best action most of the time, but occasionally do
random exploration
• The action value is estimated following the past experience.
𝑡
1
𝑄෠𝑡 𝑎 = ෍ 𝑟𝜏 𝟏 𝑎𝜏 = 𝑎
𝑁𝑡 (𝑎)
𝜏=1
• 𝟏 is the binary indicator function and 𝑁𝑡 (𝑎) is how many times the
action 𝑎 has been selected so far, 𝑁𝑡 𝑎 = σ𝑡𝜏=1 𝟏 𝑎𝜏 = 𝑎
• We take a random action with a small probabilities .
• Otherwise, pick the best option learnt so far.
𝑎ො𝑡∗ = max 𝑄෠𝑡 𝑎
𝑎∈𝓐
49
Upper Confidence Bounds
• Random exploration may end up with a bad action which we
have confirmed in the past.
• There are several solutions to address the problem.
• Decrease the parameter  in time
• Favor exploring actions that are most likely to have an

optimal value
50
Upper Confidence Bounds
• Upper Confidence Bounds (UCB) measures this potential
෡𝑡 (𝑎)
by an upper confidence bound of the reward value, 𝑈
such that there is 𝑄(𝑎) ≤ 𝑄෠𝑡 𝑎 + 𝑈
෡𝑡 (𝑎) with high probability
෡𝑡 (𝑎) is a function of 𝑁𝑡 𝑎
• 𝑈
෡𝑡 (𝑎).
• A large number of trials 𝑁𝑡 𝑎 should give us a smaller bound 𝑈
• We always select the greediest action to maximize the upper
confidence bound
𝑎𝑡𝑈𝐶𝐵 = arg max 𝑄෠𝑡 𝑎 + 𝑈
෡𝑡 (𝑎)
𝑎∈𝓐
• Yet, how to estimate the upper confidence bound?

51
Hoeffding’s Inequality
• Assume that there is no prior knowledge on how the
distribution looks like.
• Let’s consider Hoeffding’s Inequality – a theorem applicable
to any bounded distribution.
• Let 𝑋1 , … , 𝑋𝑡 be independent and identically distributed

random variables, which are all bounded by [0, 1].
1
• The sample mean is 𝑋ത𝑡 = σ𝑡𝜏=1 𝑋𝜏
𝑡
ത −2𝑡𝑢2
• Then for 𝑢 > 0, we have: ℙ 𝔼 𝑋 > 𝑋𝑡 + 𝑢 ≤ 𝑒
52
Hoeffding’s Inequality to UCB
• Given one target action 𝑎, we define the following terms
• 𝑟𝑡 (𝑎): the random variables
• 𝑄(𝑎) and 𝑄෠𝑡 𝑎 are true mean and sample mean, respectively
• 𝑢 = 𝑈𝑡 (𝑎) be the upper confidence bound
2
෠
• Then, we have ℙ 𝑄 𝑎 > 𝑄𝑡 𝑎 + 𝑈𝑡 (𝑎) ≤ 𝑒 −2𝑡𝑈𝑡 (𝑎)
−2𝑡𝑈 (𝑎) 2
• 𝑒 𝑡 should be a small probability
• We want to pick a bound so that with high chances the true mean is
below the sample mean + the upper confidence bound.
−2𝑡𝑈 (𝑎) 2
• Finally, with a small threshold 𝑝 = 𝑒 𝑡 , then
− ln 𝑝
𝑈𝑡 𝑎 =
2𝑁𝑡 (𝑎)
53
From UCB to UCB1
• One heuristic is to reduce the threshold 𝑝 in time.
• The bound estimation is more confident with more rewards observed
• Set 𝑝 = 𝑡 −4 we get UCB1 algorithm
2 ln 𝑡 2 ln 𝑡
𝑈𝑡 𝑎 = 𝑈𝐶𝐵1
and 𝑎𝑡 ෠
= arg max 𝑄𝑡 𝑎 +
𝑁𝑡 (𝑎) 𝑎∈𝓐 𝑁𝑡 (𝑎)
54
Bayesian UCB
• If we know the distribution upfront, we would be able to
make better bound estimation.
Image credit: Lil'log
When the expected reward has a Gaussian distribution. 𝜎 𝑎𝑖 is the standard deviation
𝑐𝜎 𝑎𝑖 and is the upper confidence bound. The constant 𝑐 is an adjustable hyperparameter.
A simple experiment
The result of a small experiment on solving a Bernoulli bandit with K = 10 slot machines with
reward probabilities, {0.0, 0.1, 0.2, ..., 0.9}. Each solver runs 10000 steps.
(Left) The plot of time step vs the cumulative regrets.

(Middle) The plot of true reward probability vs estimated probability.
(Right) The fraction of each action is picked during the 10000-step run.
56
Acknowledgements
• Some parts of the slide are adapted from
• Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An
introduction. Second edition. The MIT Press, 2018.
• The Multi-Armed Bandit Problem and Its Solutions (link)
• OpenAI Spinning Up: Introduction to RL (link)
• Deep learning in a nutshell: Reinforcement learning (link)
57
List of references
• Best benchmarks for reinforcement learning: the ultimate list (link)
• Introducing planet: A deep planning network for reinforcement learning
(link)
• Beginner’s guide to policy in reinforcement learning (link)
• RL Course by David Silver - Lecture 9: Exploration and Exploitation (link)
58
59

2024 MTH058 Lecture05 ReinforcementLearning

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2024 MTH058 Lecture05 ReinforcementLearning

Uploaded by

Copyright:

Available Formats

REINFORCEMENT

Nguyễn Ngọc Thảo

Learning to ride a bike requires trial and

Image credit: Commoncog

Image credit: CinnamonAI 7

• The environment is the world that the agent lives in and

• Each state or observation is represented by real-valued

The agent in the task of food delivery is the rider.

Image credit: BioNinja 15

• This may be a simple function, a lookup table, or extensive

• A stochastic policy returns a probability distribution of

• A Gaussian policy chooses the action for a given state by

In Mario, a good way to measure reward might be the score!

A simple environment setup (left)

Image credit: KDnuggets

• Suppose that both the environment transitions and the policy

Image credit: Nvidia Developer 28

Different racing lines around a corner. Each racing line has a

• We seek for actions that bring about states of highest value,

• On-policy action-value function: give the expected return if the agent

• The optimal Q-function and the optimal action: given 𝑄 ∗ 𝑠 ,

• The Bellman equations for the on-policy value function is:

As an agent explores an environment.

A non-exhaustive, but useful taxonomy of algorithms in modern RL

Learning to run – an example of reinforcement learning

• Exploitation: A RL agent must prefer actions that it has tried

• One cannot solely focus on either exploration or exploitation

The reward probabilities

What is the best strategy to achieve highest long-term rewards?

There are several ways to solve the multi-armed bandit

• Decrease the parameter  in time

• Favor exploring actions that are most likely to have an

• Yet, how to estimate the upper confidence bound?

• Let 𝑋1 , … , 𝑋𝑡 be independent and identically distributed

Image credit: Lil'log

(Left) The plot of time step vs the cumulative regrets.

You might also like