Professional Documents
Culture Documents
LEARNING
2
Reinforcement learning:
Basic concepts
Reinforcement learning (RL)
• An agent learns to interact with an environment based on
feedback signals it receives from the environment.
4
RL as a trial-and-error process
• The learner is not told which actions to take, instead he
discovers which actions yield the most reward by trial.
5
Reinforcement learning is the culmination of many fields and
has a rich history in optimization and behavioral psychology. 6
Concept vocabulary
• Let’s frame the concepts in terms of a video game, Mario!
Environment:
Agent Actions a game level
8
Concepts: Agent and Environment
• At every step, the agent perceives a (possibly partial)
observation of the state of the world.
• E.g., in the game Mario, the environment is the whole game level,
yet Mario can only see a part of the scene.
• Then, it picks one from the list of possible actions to let the
environment change from one state to another.
• E.g., in the game Mario, a state is the combination of {Mario, action,
environment}.
• The agent can observe this change, use it as a feedback
signal, and learn from it.
9
Concepts: State and Observation
• A state 𝑠 is a complete description of the state of the world.
• There is no information about the world that is hidden from the state.
• An observation 𝑜 is a partial description of a state, which
may omit information.
• Fully-/Partially- observable environment: whether the agent observes
the complete state of the environment (e.g., chess vs. poker).
10
Concepts: Action space
• The action space includes the set of all valid actions in a
given environment.
• Discrete action space: only a finite number of moves are
available to the agent.
• E.g., Atari and Go
• Continuous action space: actions are real-valued vectors.
• E.g., the robotic agent operates in a physical world
11
Concepts: A formal description
• At 𝑡0 , the agent 𝑀 does not know what action to take.
• Thus, it can take a random action, or other strategies if there
is preliminary knowledge.
• At a time step 𝑡𝑡 , 𝐴 performs an action 𝑎𝑡 .
• At the next time step 𝑡𝑡+1 , 𝑀 perceives its new state 𝑠𝑡+1 and
considers the reward 𝑟𝑡+1 got from the environment.
• The environment at time step 𝑡𝑡+1 is the result of action 𝑎𝑡 .
• If the rewards get smaller, 𝑀 will choose another action.
• This process is repeated until the agent completes running
its episode.
12
Concepts: Another example
The biggest reward comes when the rider reaches the customer and
delivers the food, and penalties in between can come in the form of taking
a wrong turn, caught in a traffic jam, etc. that prevents the rider from
completing the task.
13
Concepts: Policy
• A policy is a mapping from the perceived states of the
environment to actions to be taken when in those states.
A policy might be
that given a
certain tile, the
agent moves in a
certain direction.
14
Concepts: Policy
• The concept of policy corresponds to what in psychology
called a set of stimulus–response rules (or associations).
18
Concepts: Trajectories
• A trajectory τ = 𝑠0 , 𝑎0 , 𝑠1 , 𝑎1 , … is a sequence of states and
actions in the world.
• The first state of the world, 𝑠0 , is randomly sampled from the
start-state distribution, 𝜌𝑜 .
𝑠0 ~ 𝜌𝑜 ∙
• State transitions are what happens to the world between the
state at time 𝑡, 𝑠𝑡 , and the state at 𝑡 + 1, 𝑠𝑡+1 ,
• They are governed by the natural laws of environment, and
depend on only the most recent actions, 𝑎𝑡 .
𝑠𝑡+1 ~ 𝑓 𝑠𝑡 , 𝑎𝑡 or 𝑠𝑡+1 ~ 𝑃 ∙ | 𝑠𝑡 , 𝑎𝑡
19
Concepts: Reward signal
• A reward signal defines the goal of a RL problem, i.e., what
are the good and bad events for the agent.
21
Concepts: Reward signal
22
Concepts: Reward signal
• The reward signal is the primary basis for altering the policy.
• If an action selected by the policy is followed by low reward,
the policy should be modified to select some other action in
that situation in the future.
• A reward signal may be described as a stochastic function of
the state of the environment and the actions taken.
23
Concepts: Reward signal
• The reward function ℛ is defined as 𝑟𝑡 = ℛ 𝑠𝑡 , 𝑎𝑡 , 𝑠𝑡+1 .
• Finite-horizon undiscounted return: the sum of rewards
obtained in a fixed window of steps.
𝑇
ℛ 𝜏 = 𝑟𝑡
𝑡=0
• Infinite-horizon discounted return: the sum of all rewards but
discounted by how far off in the future they’re obtained.
∞
ℛ 𝜏 = 𝛾 𝑡 𝑟𝑡
𝑡=0
• 𝛾: the discount factor in (0, 1)
24
Concepts: Reward signal
• The agent’s sole objective is to maximize the total reward it
receives over the long run (i.e., trajectories).
25
Concepts: Expected return
• Consider any choice of the policy and any reward measure
(infinite-horizon discounted or finite-horizon undiscounted).
• An agent aims to select a policy which maximizes expected
return when it acts according to it.
26
Concepts: Expected return
• The probability of a 𝑇-step trajectory is:
𝑇−1
𝑃 𝜏 | 𝜋 = 𝜌𝑜 𝑠𝑜 ෑ 𝑃 𝑠𝑡+1 | 𝑎𝑡 , 𝑠𝑡 ∙ 𝜋 𝑎𝑡 | 𝑠𝑡
𝑡=0
• The expected return, 𝐽 𝜋 , is then:
𝐽 𝜋 = න 𝑃 𝜏 | 𝜋 ∙ ℛ 𝜏 = 𝔼𝜏~𝜋 ℛ 𝜏
𝜏
• The central optimization problem can then be expressed by
𝜋 ∗ = arg max 𝐽 𝜋
𝜋
• with 𝜋 ∗ being the optimal policy.
27
Concepts: Value
• The value of a state is the total amount of reward an agent
hopes to accumulate over the future, starting from that state.
Each square is a state: S is the start state,
G the goal state, T squares are traps, and
black squares cannot be entered.
The rewards (traps and goal state) are
initialized and then these values spread
over time until an equilibrium is reached.
Depending on the penalty value on traps
and reward value for the goal, different
solution patterns might emerge; the last
two grids show such solution states.
29
Concepts: Value
• Rewards take precedence, while values, serving as
anticipations of rewards, come secondarily.
• There could be no values without rewards. The only purpose of
estimating values is to achieve more reward.
• Nevertheless, we are most concerned on values, not reward
solely, when making and evaluating decisions.
• That is, action choices are made based on value judgments.
• It is much harder to determine values than rewards.
30
Concepts: Value functions
• On-policy value function: give the expected return if the agent starts in
state 𝑠 and always acts according to policy 𝜋.
𝑉 𝜋 𝑠 = 𝔼𝜏~𝜋 ℛ 𝜏 | 𝑠0 = 𝑠
• Optimal value function: That is on-policy value with the optimal policy.
𝑉 ∗ 𝑠 = max 𝔼𝜏~𝜋 ℛ 𝜏 | 𝑠0 = 𝑠
𝜋
32
Concepts: Value functions
• All four value functions obey the Bellman equations about
self-consistency.
The value of your starting point is the reward you expect to get
from being there, plus the value of wherever you land next.
33
Concepts: Model of environment
• The model allows inferences to be made about how the
environment will behave.
• E.g., given a state and action, the model might predict the resultant
next state and next reward.
34
Concepts: Model of environment
• Modern RL spans the spectrum from low-level, trial-and-
error learning to high-level, deliberative planning.
• Model-free methods: trial-and-error learning
• Model-based methods: use models and planning
• Some systems simultaneously learn by trial and error, learn
a model of the environment, and use the model for planning.
35
A taxonomy of RL algorithms
Image credit
37
Reinforcement learning: A demo
Source: YouTube
38
Reinforcement learning: A demo
Source: YouTube 39
Reinforcement learning: A demo
40
Source: YouTube
41
Multi-armed bandit
problem
Original material: The Multi-Armed Bandit Problem and Its Solutions
Exploration – Exploitation Dilemma
• The tradeoff between exploration and exploitation is one of
the challenges that arise in RL.
43
Exploration – Exploitation Dilemma
• The agent must try a variety of actions while progressively
favoring those that appear to be best.
• On a stochastic task, each action must be tried many times
to gain a reliable estimate of its expected reward.
Micromouse (Wikipedia)
44
Multi-armed bandit problem
• Imagine you are in a casino facing multiple slot machines.
• Each is configured with an unknown probability of how likely
you can get a reward at one play.
46
Problem definition
• The goal is to maximize the cumulative reward σ𝑇𝑡=1 𝑟𝑡 .
• The optimal reward probability 𝜃 ∗ of the optimal action 𝑎∗ is
𝜃 ∗ = 𝑄 𝑎∗ = max 𝑄(𝑎) = max 𝜃𝑖
𝑎∈𝓐 1≤𝑖≤𝐾
• The loss function is the total regret we might have by not selecting
the optimal action up to the time step 𝑇.
𝑇
ℒ 𝑇 = 𝔼 𝜃 ∗ − 𝑄 𝑎𝑖
𝑡=1
47
Bandit strategies
Exploration
No exploration Exploration at smartly
The naivest and
worst one
random with preference to
uncertainty
48
-greedy algorithm
• Take the best action most of the time, but occasionally do
random exploration
• The action value is estimated following the past experience.
𝑡
1
𝑄𝑡 𝑎 = 𝑟𝜏 𝟏 𝑎𝜏 = 𝑎
𝑁𝑡 (𝑎)
𝜏=1
• 𝟏 is the binary indicator function and 𝑁𝑡 (𝑎) is how many times the
action 𝑎 has been selected so far, 𝑁𝑡 𝑎 = σ𝑡𝜏=1 𝟏 𝑎𝜏 = 𝑎
• We take a random action with a small probabilities .
• Otherwise, pick the best option learnt so far.
𝑎ො𝑡∗ = max 𝑄𝑡 𝑎
𝑎∈𝓐
49
Upper Confidence Bounds
• Random exploration may end up with a bad action which we
have confirmed in the past.
• There are several solutions to address the problem.
50
Upper Confidence Bounds
• Upper Confidence Bounds (UCB) measures this potential
𝑡 (𝑎)
by an upper confidence bound of the reward value, 𝑈
such that there is 𝑄(𝑎) ≤ 𝑄𝑡 𝑎 + 𝑈
𝑡 (𝑎) with high probability
𝑡 (𝑎) is a function of 𝑁𝑡 𝑎
• 𝑈
𝑡 (𝑎).
• A large number of trials 𝑁𝑡 𝑎 should give us a smaller bound 𝑈
• We always select the greediest action to maximize the upper
confidence bound
𝑎𝑡𝑈𝐶𝐵 = arg max 𝑄𝑡 𝑎 + 𝑈
𝑡 (𝑎)
𝑎∈𝓐
52
Hoeffding’s Inequality to UCB
• Given one target action 𝑎, we define the following terms
• 𝑟𝑡 (𝑎): the random variables
• 𝑄(𝑎) and 𝑄𝑡 𝑎 are true mean and sample mean, respectively
• 𝑢 = 𝑈𝑡 (𝑎) be the upper confidence bound
2
• Then, we have ℙ 𝑄 𝑎 > 𝑄𝑡 𝑎 + 𝑈𝑡 (𝑎) ≤ 𝑒 −2𝑡𝑈𝑡 (𝑎)
−2𝑡𝑈 (𝑎) 2
• 𝑒 𝑡 should be a small probability
• We want to pick a bound so that with high chances the true mean is
below the sample mean + the upper confidence bound.
−2𝑡𝑈 (𝑎) 2
• Finally, with a small threshold 𝑝 = 𝑒 𝑡 , then
− ln 𝑝
𝑈𝑡 𝑎 =
2𝑁𝑡 (𝑎)
53
From UCB to UCB1
• One heuristic is to reduce the threshold 𝑝 in time.
• The bound estimation is more confident with more rewards observed
• Set 𝑝 = 𝑡 −4 we get UCB1 algorithm
2 ln 𝑡 2 ln 𝑡
𝑈𝑡 𝑎 = 𝑈𝐶𝐵1
and 𝑎𝑡
= arg max 𝑄𝑡 𝑎 +
𝑁𝑡 (𝑎) 𝑎∈𝓐 𝑁𝑡 (𝑎)
54
Bayesian UCB
• If we know the distribution upfront, we would be able to
make better bound estimation.
When the expected reward has a Gaussian distribution. 𝜎 𝑎𝑖 is the standard deviation
𝑐𝜎 𝑎𝑖 and is the upper confidence bound. The constant 𝑐 is an adjustable hyperparameter.
A simple experiment
The result of a small experiment on solving a Bernoulli bandit with K = 10 slot machines with
reward probabilities, {0.0, 0.1, 0.2, ..., 0.9}. Each solver runs 10000 steps.
56
Acknowledgements
• Some parts of the slide are adapted from
• Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An
introduction. Second edition. The MIT Press, 2018.
• The Multi-Armed Bandit Problem and Its Solutions (link)
• OpenAI Spinning Up: Introduction to RL (link)
• Deep learning in a nutshell: Reinforcement learning (link)
57
List of references
• Best benchmarks for reinforcement learning: the ultimate list (link)
• Introducing planet: A deep planning network for reinforcement learning
(link)
• Beginner’s guide to policy in reinforcement learning (link)
• RL Course by David Silver - Lecture 9: Exploration and Exploitation (link)
58
59