Lec 5

Reinforcement Learning
Monte Carlo Methods
Pablo Zometa – Department of Mechatronics – GIU Berlin 1

Monte Carlo Methods
In general: The Monte Carlo method is a statistical technique used to

find an approximate solution through sampling.
This family of methods is named after the Casino de Monte-Carlo,
Monaco.
In the context of reinforcement learning:

▶ Monte Carlo methods are ways of solving the reinforcement learning
problem based on averaging sample returns
▶ Dynamic Programming is a model-based method: value functions
are computed from knowledge of the MDP (planning)
▶ Monte Carlo methods are model-free: value functions are learned
from sample returns with the MDP (learning)
▶ Both DP and MC fit into the GPI framework.

Learning from episodic tasks
Some agent/environment interactions can be easily broken down into

episodes. In episodic tasks:
▶ all episodes terminate regardless of the actions taken
▶ New episode begins independently on how previous episodes ended
▶ experience is acquired one episode (trajectory) at a time,
▶ Value estimates and policies are changed only after completing an
episode
Examples of episodic tasks:

▶ Maze, path planning
▶ Video games

Monte Carlo Prediction
We want to estimate vπ (s) given a set of N episodes obtained by

following π and passing through s:
▶ Each occurrence of state s in an episode is called a visit to s.
▶ In the same episode s may be visited multiple times
▶ the first time s is visited in an episode is called the first visit to s.
Two MC variants:
▶ first-visit MC method estimates v(s) as the average of the returns
≈ E(Gt |St = s) following first visits to s. Subsequent visits to s are
not taken into consideration to compute the average. There is only
one return Gt (s) for episode.
▶ every-visit MC method averages the returns following all visits to s.
One episode may append more than one return, e.g. Gt (s), Gt+k (s),
etc. to compute the average.

Exercise: Monte Carlo prediction for 3-state grid world
The agent follows a stochastic policy π:
A B
s a π(a|s)
A D 0.7
C
A R 0.3
B D 1.0
State C is a terminal state, once we reach it, the episode terminates.
Visiting state B has a reward −1. The reward for visiting C is +1.
The agent has no knowledge of the MDP. We denote the return for state
s in trajectory i as Gs,i . Starting from S0 = A and under policy π, we
obtain the trajectories and returns (γ = 1) of the first N = 3 episodes:
+1
A D C GA,0 = 1
−1 +1
A R B D C GA,1 = 0
+1
A D C GA,2 = 1
Exercise: Monte Carlo prediction for 3-state grid world
From these N = 3 episodes, the agent can estimate the state value
function for A under policy π:
N −1
1 X A,i 1 2
vπ (A) = Eπ (Gt | St = A) ≈ G = (GA,0 + GA,1 + GA,2 ) =
N i=0 3 3
By increasing the number of episodes N , the agent can get a better

estimate of vπ (A).
Similarly, the value function for state B under π can be estimated by the
agent from experience:
N −1
1 X B,i
vπ (B) = Eπ (Gt | St = B) ≈ G =1
N j=0

First-visit Monte Carlo prediction
In the first-visit Monte Carlo prediction, if the same state is visited again
in the same episode, we don’t use the return for that state again to
estimate the expected return (the average).
Example: In episode k for a policy π, this trajectory is generated:

S0 = w, A0 = 1, R1 = 4,
S1 = x, A1 = 0, R2 = 3,
S2 = y, A2 = 2, R3 = 2,
S3 = x, A3 = 2, R4 = 1, S4 = z.
In this episode, state x is visited twice: at t = 1, and t = 3. For an
undiscounted return (i.e., γ = 1), the return for the first visit to x
(t = 1) is G1 = 3 + 2 + 1 = 6, whereas for t = 3 is G3 = 1.
In first-visit MC, we only use the value of G1 as the return for the
episode k, that is Gx,k = G1 = 6, to compute the average return of all
episodes for x. This criterion is used for all N episodes:
N −1
1 X x,j
vπ (x) = Eπ (Gt | St = x) ≈ G
N j=0
Every-visit Monte Carlo prediction
In the every-visit Monte Carlo prediction, if the same state is visited
many times in the same episode, we use all returns for that state to
estimate the expected return (the average).
Example: For the previous example with trajectory:

S0 = w, A0 = 1, R1 = 4,
S1 = x, A1 = 0, R2 = 3,
S2 = y, A2 = 2, R3 = 2,
S3 = x, A3 = 2, R4 = 1, S4 = z.
the return for the first visit to x (t = 1) is G1 = 3 + 2 + 1 = 6, whereas
for t = 3 is G3 = 1.
In every-visit MC, we use the value of G1 and G3 as the returns for the
episode k, that is Gx,k,0 = G1 = 6, Gx,k,1 = G3 = 1, to compute the
average return of all episodes for x. This criterion is used for all N
episodes, with each episode j having nj returns:
N j −1
−1 nX
1 X
vπ (x) = Eπ (Gt | St = x) ≈ Gx,j,i
n0 + n1 + · · · + nN −1 j=0 i=0
Monte Carlo Control
Recall that prediction refers to the estimation of state-value functions
vπ (s), ∀ s ∈ S, for a given policy π. Control refers to finding an optimal
policy. In the case of Monte Carlo control, we can in general only find an
approximate optimal policy.
Both DP and MC control follow the generalized policy iteration:

▶ in DP, because we have a model of the environment p(s′ , r|s, a), we
can extract an optimal policy knowing only the optimal state-value
function v∗ (s) (recall from DP value iteration):
π∗ (s) = arg max q∗ (s, a)
a
X
= arg max p(s′ , r|s, a)(r + γv∗ (s′ ))
a
s′ ,r
▶ In MC, and other model-free methods, we explicitly work with

state-action value functions qπ (s, a): no need for a model
▶ In MC, however, qπ (s, a) is estimated by the agent through
experience with the environment (sampling the MDP), for all
s ∈ S, a ∈ A(s) (big assumption!). In DP we can explore the whole
state-action space systematically with the help of the model.
Generalized Policy Iteration: MC vs DP
To do policy improvement, a greedy policy is used in both cases, i.e.:
π(s) = arg max q(s, a)
a
In DP:
π0 7→E vπ0 7→I π1 7→E vπ1 · · · 7→I π∗ 7→E vπ∗
Both policy evaluation and improvement rely on the Bellman equation:
X X
vk+1 (s) = π(a|s) p(s′ , r|a, s) (r + γvk (s′ ))
a s′ ,r
X
πk+1 (s) = arg max p(s′ , r|s, a)(r + γvπk (s′ ))
a
s′ ,r
In MC:
π0 7→E qπ0 7→I π1 7→E qπ1 · · · 7→I π∗ 7→E qπ∗
πk+1 (s) is the greedy policy with respect to qk (s, a), for all
s ∈ S, a ∈ A(s)
πk+1 (s) = arg max qk (s, a)
a

GPI: Monte Carlo learning
Similarly to DP, value iteration in MC is a special case of GPI, in which
policy evaluation (prediction) is done using a single episode. Afterwards,
policy improvement is done at all states visited in the episode.
Assumption: we have a good estimate of qk (s, a) for all s ∈ S, a ∈ A(s).
How can we achieved this?
▶ Exploring starts: to start an episode with the agent on a random
state-action pair, i.e., such that all s ∈ S, a ∈ A(s) have non-zero
probability of occurring.
We can guarantee that all state are visited (not always practically
possible), and all actions available at each state are taken.
▶ On-policy methods: On-policy methods attempt to evaluate or
improve the policy that is used to make decisions.
▶ Off-policy methods: off-policy methods evaluate or improve a policy
(target policy) different from that used to generate the data
(behaviour policy)
Both on-policy and off-policy have a tradeoff between exploration and
exploitation.

On-policy learning: ϵ-soft policies
On-policy methods typically use soft policies:
π(a|s) > 0, ∀ s ∈ S, a ∈ A(s)
An ϵ-soft policy satisfies:
ϵ
π(a|s) ≥ , ∀ s ∈ S, a ∈ A(s), and ϵ > 0
|A(s)|
A particular type of ϵ-soft policy is the ϵ-greedy policy.
An ϵ-greedy policy chooses a greedy action most of the time at any given
state, but with probability ϵ they select a random action at that state.
Formally, each nongreedy action have the minimal probability:
ϵ
π(a|s) ≥ ,
|A(s)|
whereas the greedy action ā has
ϵ
π(ā|s) ≥ 1 − ϵ + ,
|A(s)|
The on-policy approach learns action values not for the optimal policy,
but for a near-optimal policy that still explores. On-policy methods
estimate the value of a policy while using it for control.
Off-policy learning
We say that learning is from data ”off” the target policy, and the overall
process is termed off-policy learning. (Sutton and Barto)
In off-policy methods two policies are used:

▶ target policy: the policy we want to learn, i.e., to become the
optimal policy
▶ behaviour policy: used to explore (sample the environment), i.e., to
generate behaviour,
Remarks:
▶ off-policy methods are in general more complex, slower to converge,
but more powerful than on-policy methods
▶ On-policy methods are a special case of off-policy methods, in which
the target and behaviour policies are one and the same.
▶ in many cases of interest (control applications), the target policy
must be deterministic (e.g., greedy), whereas the behaviour policy
remains stochastic (e.g., ϵ-greedy).
▶ behaviour data can come from an external source to the agent
(human expert, controller).

Lec 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 5

Uploaded by

Copyright:

Available Formats

Reinforcement Learning

Monte Carlo Methods

Pablo Zometa – Department of Mechatronics – GIU Berlin 1

In general: The Monte Carlo method is a statistical technique used to

In the context of reinforcement learning:

Pablo Zometa – Department of Mechatronics – GIU Berlin 2

Some agent/environment interactions can be easily broken down into

Examples of episodic tasks:

Pablo Zometa – Department of Mechatronics – GIU Berlin 3

We want to estimate vπ (s) given a set of N episodes obtained by

Pablo Zometa – Department of Mechatronics – GIU Berlin 4

By increasing the number of episodes N , the agent can get a better

Pablo Zometa – Department of Mechatronics – GIU Berlin 6

Example: In episode k for a policy π, this trajectory is generated:

Example: For the previous example with trajectory:

Both DP and MC control follow the generalized policy iteration:

▶ In MC, and other model-free methods, we explicitly work with

Pablo Zometa – Department of Mechatronics – GIU Berlin 10

Pablo Zometa – Department of Mechatronics – GIU Berlin 11

In off-policy methods two policies are used:

You might also like