You are on page 1of 13

Reinforcement Learning

Monte Carlo Methods

Pablo Zometa – Department of Mechatronics – GIU Berlin 1


Monte Carlo Methods

In general: The Monte Carlo method is a statistical technique used to


find an approximate solution through sampling.
This family of methods is named after the Casino de Monte-Carlo,
Monaco.

In the context of reinforcement learning:


▶ Monte Carlo methods are ways of solving the reinforcement learning
problem based on averaging sample returns
▶ Dynamic Programming is a model-based method: value functions
are computed from knowledge of the MDP (planning)
▶ Monte Carlo methods are model-free: value functions are learned
from sample returns with the MDP (learning)
▶ Both DP and MC fit into the GPI framework.

Pablo Zometa – Department of Mechatronics – GIU Berlin 2


Learning from episodic tasks

Some agent/environment interactions can be easily broken down into


episodes. In episodic tasks:
▶ all episodes terminate regardless of the actions taken
▶ New episode begins independently on how previous episodes ended
▶ experience is acquired one episode (trajectory) at a time,
▶ Value estimates and policies are changed only after completing an
episode

Examples of episodic tasks:


▶ Maze, path planning
▶ Video games

Pablo Zometa – Department of Mechatronics – GIU Berlin 3


Monte Carlo Prediction

We want to estimate vπ (s) given a set of N episodes obtained by


following π and passing through s:
▶ Each occurrence of state s in an episode is called a visit to s.
▶ In the same episode s may be visited multiple times
▶ the first time s is visited in an episode is called the first visit to s.
Two MC variants:
▶ first-visit MC method estimates v(s) as the average of the returns
≈ E(Gt |St = s) following first visits to s. Subsequent visits to s are
not taken into consideration to compute the average. There is only
one return Gt (s) for episode.
▶ every-visit MC method averages the returns following all visits to s.
One episode may append more than one return, e.g. Gt (s), Gt+k (s),
etc. to compute the average.

Pablo Zometa – Department of Mechatronics – GIU Berlin 4


Exercise: Monte Carlo prediction for 3-state grid world
The agent follows a stochastic policy π:
A B
s a π(a|s)
A D 0.7
C
A R 0.3
B D 1.0
State C is a terminal state, once we reach it, the episode terminates.
Visiting state B has a reward −1. The reward for visiting C is +1.

The agent has no knowledge of the MDP. We denote the return for state
s in trajectory i as Gs,i . Starting from S0 = A and under policy π, we
obtain the trajectories and returns (γ = 1) of the first N = 3 episodes:

+1
A D C GA,0 = 1
−1 +1
A R B D C GA,1 = 0
+1
A D C GA,2 = 1
Pablo Zometa – Department of Mechatronics – GIU Berlin 5
Exercise: Monte Carlo prediction for 3-state grid world

From these N = 3 episodes, the agent can estimate the state value
function for A under policy π:
N −1
1 X A,i 1 2
vπ (A) = Eπ (Gt | St = A) ≈ G = (GA,0 + GA,1 + GA,2 ) =
N i=0 3 3

By increasing the number of episodes N , the agent can get a better


estimate of vπ (A).

Similarly, the value function for state B under π can be estimated by the
agent from experience:
N −1
1 X B,i
vπ (B) = Eπ (Gt | St = B) ≈ G =1
N j=0

Pablo Zometa – Department of Mechatronics – GIU Berlin 6


First-visit Monte Carlo prediction
In the first-visit Monte Carlo prediction, if the same state is visited again
in the same episode, we don’t use the return for that state again to
estimate the expected return (the average).

Example: In episode k for a policy π, this trajectory is generated:


S0 = w, A0 = 1, R1 = 4,
S1 = x, A1 = 0, R2 = 3,
S2 = y, A2 = 2, R3 = 2,
S3 = x, A3 = 2, R4 = 1, S4 = z.
In this episode, state x is visited twice: at t = 1, and t = 3. For an
undiscounted return (i.e., γ = 1), the return for the first visit to x
(t = 1) is G1 = 3 + 2 + 1 = 6, whereas for t = 3 is G3 = 1.
In first-visit MC, we only use the value of G1 as the return for the
episode k, that is Gx,k = G1 = 6, to compute the average return of all
episodes for x. This criterion is used for all N episodes:
N −1
1 X x,j
vπ (x) = Eπ (Gt | St = x) ≈ G
N j=0
Pablo Zometa – Department of Mechatronics – GIU Berlin 7
Every-visit Monte Carlo prediction
In the every-visit Monte Carlo prediction, if the same state is visited
many times in the same episode, we use all returns for that state to
estimate the expected return (the average).

Example: For the previous example with trajectory:


S0 = w, A0 = 1, R1 = 4,
S1 = x, A1 = 0, R2 = 3,
S2 = y, A2 = 2, R3 = 2,
S3 = x, A3 = 2, R4 = 1, S4 = z.
the return for the first visit to x (t = 1) is G1 = 3 + 2 + 1 = 6, whereas
for t = 3 is G3 = 1.
In every-visit MC, we use the value of G1 and G3 as the returns for the
episode k, that is Gx,k,0 = G1 = 6, Gx,k,1 = G3 = 1, to compute the
average return of all episodes for x. This criterion is used for all N
episodes, with each episode j having nj returns:
N j −1
−1 nX
1 X
vπ (x) = Eπ (Gt | St = x) ≈ Gx,j,i
n0 + n1 + · · · + nN −1 j=0 i=0
Pablo Zometa – Department of Mechatronics – GIU Berlin 8
Monte Carlo Control
Recall that prediction refers to the estimation of state-value functions
vπ (s), ∀ s ∈ S, for a given policy π. Control refers to finding an optimal
policy. In the case of Monte Carlo control, we can in general only find an
approximate optimal policy.

Both DP and MC control follow the generalized policy iteration:


▶ in DP, because we have a model of the environment p(s′ , r|s, a), we
can extract an optimal policy knowing only the optimal state-value
function v∗ (s) (recall from DP value iteration):
π∗ (s) = arg max q∗ (s, a)
a
X
= arg max p(s′ , r|s, a)(r + γv∗ (s′ ))
a
s′ ,r

▶ In MC, and other model-free methods, we explicitly work with


state-action value functions qπ (s, a): no need for a model
▶ In MC, however, qπ (s, a) is estimated by the agent through
experience with the environment (sampling the MDP), for all
s ∈ S, a ∈ A(s) (big assumption!). In DP we can explore the whole
state-action space systematically with the help of the model.
Pablo Zometa – Department of Mechatronics – GIU Berlin 9
Generalized Policy Iteration: MC vs DP
To do policy improvement, a greedy policy is used in both cases, i.e.:
π(s) = arg max q(s, a)
a

In DP:
π0 7→E vπ0 7→I π1 7→E vπ1 · · · 7→I π∗ 7→E vπ∗
Both policy evaluation and improvement rely on the Bellman equation:
X X
vk+1 (s) = π(a|s) p(s′ , r|a, s) (r + γvk (s′ ))
a s′ ,r
X
πk+1 (s) = arg max p(s′ , r|s, a)(r + γvπk (s′ ))
a
s′ ,r
In MC:
π0 7→E qπ0 7→I π1 7→E qπ1 · · · 7→I π∗ 7→E qπ∗
πk+1 (s) is the greedy policy with respect to qk (s, a), for all
s ∈ S, a ∈ A(s)
πk+1 (s) = arg max qk (s, a)
a

Pablo Zometa – Department of Mechatronics – GIU Berlin 10


GPI: Monte Carlo learning
Similarly to DP, value iteration in MC is a special case of GPI, in which
policy evaluation (prediction) is done using a single episode. Afterwards,
policy improvement is done at all states visited in the episode.
Assumption: we have a good estimate of qk (s, a) for all s ∈ S, a ∈ A(s).
How can we achieved this?
▶ Exploring starts: to start an episode with the agent on a random
state-action pair, i.e., such that all s ∈ S, a ∈ A(s) have non-zero
probability of occurring.
We can guarantee that all state are visited (not always practically
possible), and all actions available at each state are taken.
▶ On-policy methods: On-policy methods attempt to evaluate or
improve the policy that is used to make decisions.
▶ Off-policy methods: off-policy methods evaluate or improve a policy
(target policy) different from that used to generate the data
(behaviour policy)
Both on-policy and off-policy have a tradeoff between exploration and
exploitation.

Pablo Zometa – Department of Mechatronics – GIU Berlin 11


On-policy learning: ϵ-soft policies
On-policy methods typically use soft policies:
π(a|s) > 0, ∀ s ∈ S, a ∈ A(s)
An ϵ-soft policy satisfies:
ϵ
π(a|s) ≥ , ∀ s ∈ S, a ∈ A(s), and ϵ > 0
|A(s)|
A particular type of ϵ-soft policy is the ϵ-greedy policy.
An ϵ-greedy policy chooses a greedy action most of the time at any given
state, but with probability ϵ they select a random action at that state.
Formally, each nongreedy action have the minimal probability:
ϵ
π(a|s) ≥ ,
|A(s)|
whereas the greedy action ā has
ϵ
π(ā|s) ≥ 1 − ϵ + ,
|A(s)|
The on-policy approach learns action values not for the optimal policy,
but for a near-optimal policy that still explores. On-policy methods
estimate the value of a policy while using it for control.
Pablo Zometa – Department of Mechatronics – GIU Berlin 12
Off-policy learning
We say that learning is from data ”off” the target policy, and the overall
process is termed off-policy learning. (Sutton and Barto)

In off-policy methods two policies are used:


▶ target policy: the policy we want to learn, i.e., to become the
optimal policy
▶ behaviour policy: used to explore (sample the environment), i.e., to
generate behaviour,
Remarks:
▶ off-policy methods are in general more complex, slower to converge,
but more powerful than on-policy methods
▶ On-policy methods are a special case of off-policy methods, in which
the target and behaviour policies are one and the same.
▶ in many cases of interest (control applications), the target policy
must be deterministic (e.g., greedy), whereas the behaviour policy
remains stochastic (e.g., ϵ-greedy).
▶ behaviour data can come from an external source to the agent
(human expert, controller).
Pablo Zometa – Department of Mechatronics – GIU Berlin 13

You might also like