Professional Documents
Culture Documents
Emma Brunskill
Winter 2021
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 1 / 66
Refresh Your Knowledge 3. Piazza Poll
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 2 / 66
Refresh Your Knowledge 3. Piazza Poll
Bootstrapping is when:
An estimate of the next state value is used instead of the true next
state value
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 3 / 66
Break
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 4 / 66
Class Structure
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 5 / 66
Today: Model-free Control
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 6 / 66
Today: Model-free Control
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 7 / 66
Reinforcement Learning
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 8 / 66
Model-free Control Examples
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 9 / 66
Recall Policy Iteration
Initialize policy π
Repeat:
Policy evaluation: compute V π
Policy improvement: update π
X
π 0 (s) = arg max R(s, a) + γ P(s 0 |s, a)V π (s 0 ) = arg max Q π (s, a)
a a
s 0 ∈S
Now want to do the above two steps without access to the true
dynamics and reward models
Last lecture introduced methods for model-free policy evaluation
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 10 / 66
Model Free Policy Iteration
Initialize policy π
Repeat:
Policy evaluation: compute Q π
Policy improvement: update π
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 11 / 66
Model-free Generalized Policy Improvement
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 12 / 66
Check Your Understanding: Model-free Generalized Policy
Improvement
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 13 / 66
Check Your Understanding: Model-free Generalized Policy
Improvement
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 14 / 66
Model-free Policy Iteration
Initialize policy π
Repeat:
Policy evaluation: compute Q π
Policy improvement: update π given Q π
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 15 / 66
The Problem of Exploration
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 16 / 66
Explore vs Exploit
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 17 / 66
Policy Evaluation with Exploration
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 18 / 66
-greedy Policies
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 19 / 66
-greedy Policies
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 20 / 66
Monotonic -greedy Policy Improvement
Theorem
For any -greedy policy πi , the -greedy policy w.r.t. Q πi , πi+1 is a
monotonic improvement V πi+1 ≥ V πi
π π
X
Q i (s, πi+1 (s)) = πi+1 (a|s)Q i (s, a)
a∈A
π π
X
= (/|A|) Q i (s, a) + (1 − ) max Q i (s, a)
a
a∈A
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 21 / 66
Monotonic -greedy Policy Improvement
Theorem
For any -greedy policy πi , the -greedy policy w.r.t. Q πi , πi+1 is a
monotonic improvement V πi+1 ≥ V πi
π π
X
Q i (s, πi+1 (s)) = πi+1 (a|s)Q i (s, a)
a∈A
π π
X
= (/|A|) Q i (s, a) + (1 − ) max Q i (s, a)
a
a∈A
X π π 1−
= (/|A|) Q i (s, a) + (1 − ) max Q i (s, a)
a 1−
a∈A
X π π
X πi (a|s) − |A|
= (/|A|) Q i (s, a) + (1 − ) max Q i (s, a)
a 1−
a∈A a∈A
X π X πi (a|s) − |A| π
≥ Q i (s, a) + (1 − ) Q i (s, a)
|A| a∈A a∈A
1−
π π
X
= πi (a|s)Q i (s, a) = V i (s)
a∈A
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 23 / 66
Break
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 24 / 66
Today: Model-free Control
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 25 / 66
Recall Monte Carlo Policy Evaluation, Now for Q
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 26 / 66
Monte Carlo Online Control / On Policy Improvement
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 27 / 66
Poll. Check Your Understanding: MC for On Policy
Control
Mars rover with new actions:
r (−, a1 ) = [ 1 0 0 0 0 0 +10], r (−, a2 ) = [ 0 0 0 0 0 0 +5], γ = 1.
Assume current greedy π(s) = a1 ∀s, =.5. Q(s, a) = 0 for all (s, a)
Sample trajectory from -greedy policy
Trajectory = (s3 , a1 , 0, s2 , a2 , 0, s3 , a1 , 0, s2 , a2 , 0, s1 , a1 , 1, terminal)
First visit MC estimate of Q of each (s, a) pair?
Q −π (−, a1 ) = [1 0 1 0 0 0 0]
After this trajectory (Select all)
Q −π (−, a2 ) = [0 0 0 0 0 0 0]
The new greedy policy would be: π = [1 tie 1 tie tie tie tie]
The new greedy policy would be: π = [1 2 1 tie tie tie tie]
If = 1/3, prob of selecting a1 in s1 in the new -greedy policy is 1/9.
If = 1/3, prob of selecting a1 in s1 in the new -greedy policy is 2/3.
If = 1/3, prob of selecting a1 in s1 in the new -greedy policy is 5/6.
Not sure
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 28 / 66
Check Your Understanding: MC for On Policy Control
Mars rover with new actions:
r (−, a1 ) = [ 1 0 0 0 0 0 +10], r (−, a2 ) = [ 0 0 0 0 0 0 +5], γ = 1.
Assume current greedy π(s) = a1 ∀s, =.5
Sample trajectory from -greedy policy
Trajectory = (s3 , a1 , 0, s2 , a2 , 0, s3 , a1 , 0, s2 , a2 , 0, s1 , a1 , 1, terminal)
First visit MC estimate of Q of each (s, a) pair?
Q −π (−, a1 ) = [1 0 1 0 0 0 0], Q −π (−, a2 ) = [0 1 0 0 0 0 0]
What is π(s) = arg maxa Q −π (s, a) ∀s?
π = [1 2 1 tie tie tie tie]
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 29 / 66
MC control with -greedy policies
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 30 / 66
Greedy in the Limit of Infinite Exploration (GLIE)
Definition of GLIE
All state-action pairs are visited an infinite number of times
lim Ni (s, a) → ∞
i→∞
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 31 / 66
Greedy in the Limit of Infinite Exploration (GLIE)
Definition of GLIE
All state-action pairs are visited an infinite number of times
lim Ni (s, a) → ∞
i→∞
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 32 / 66
GLIE Monte-Carlo Control
Theorem
GLIE Monte-Carlo control converges to the optimal state-action value
function Q(s, a) → Q ∗ (s, a)
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 33 / 66
Break
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 34 / 66
Today: Model-free Control
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 35 / 66
Model-free Policy Iteration with TD Methods
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 36 / 66
Model-free Policy Iteration with TD Methods
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 37 / 66
General Form of SARSA Algorithm
9: t =t +1
10: end loop
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 38 / 66
General Form of SARSA Algorithm
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 39 / 66
Worked Example: SARSA for Mars Rover
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 40 / 66
Worked Example: SARSA for Mars Rover
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 41 / 66
SARSA Initialization
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 42 / 66
Convergence Properties of SARSA
Theorem
SARSA for finite-state and finite-action MDPs converges to the optimal
action-value, Q(s, a) → Q ∗ (s, a), under the following conditions:
1 The policy sequence πt (a|s) satisfies the condition of GLIE
2 The step-sizes αt satisfy the Robbins-Munro sequence such that
∞
X
αt = ∞
t=1
X∞
αt2 < ∞
t=1
1
For ex. αt = T satisfies the above condition.
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 43 / 66
Q-Learning: Learning the Optimal State-Action Value
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 44 / 66
On and Off-Policy Learning
On-policy learning
Direct experience
Learn to estimate and evaluate a policy from experience obtained from
following that policy
Off-policy learning
Learn to estimate and evaluate a policy using experience gathered from
following a different policy
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 45 / 66
Q-Learning: Learning the Optimal State-Action Value
SARSA is an on-policy learning algorithm
SARSA estimates the value of the current behavior policy (policy
using to take actions in the world)
And then updates the policy trying to estimate
Alternatively, can we directly estimate the value of π ∗ while acting
with another behavior policy πb ?
Yes! Q-learning, an off-policy RL algorithm
Key idea: Maintain state-action Q estimates and use to bootstrap–
use the value of the best future action
Recall SARSA
Q(st , at ) ← Q(st , at ) + α((rt + γQ(st+1 , at+1 )) − Q(st , at ))
Q-learning:
Q(st , at ) ← Q(st , at ) + α((rt + γ max
0
Q(st+1 , a0 )) − Q(st , at ))
a
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 46 / 66
Q-Learning with -greedy Exploration
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 47 / 66
Worked Example: -greedy Q-Learning Mars
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 48 / 66
Worked Example: -greedy Q-Learning Mars
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 49 / 66
Check Your Understanding: SARSA and Q-Learning
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 50 / 66
Check Your Understanding: SARSA and Q-Learning
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 51 / 66
Q-Learning with -greedy Exploration
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 52 / 66
Break
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 53 / 66
Maximization Bias1
Consider single-state MDP (|S| = 1) with 2 actions, and both actions have 0-mean
E E
random rewards, ( (r |a = a1 ) = (r |a = a2 ) = 0).
Then Q(s, a1 ) = Q(s, a2 ) = 0 = V (s)
Assume there are prior samples of taking action a1 and a2
Let Q̂(s, a1 ), Q̂(s, a2 ) be the finite sample estimate of Q
1
Pn(s,a1 )
Use an unbiased estimator for Q: e.g. Q̂(s, a1 ) = n(s,a1 ) i=1 ri (s, a1 )
Let π̂ = arg maxa Q̂(s, a) be the greedy policy w.r.t. the estimated Q̂
1
Example from Mannor, Simester, Sun and Tsitsiklis. Bias and Variance
Approximation in Value Function Estimates. Management Science 2007
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 54 / 66
Maximization Bias2 Proof
Consider single-state MDP (|S| = 1) with 2 actions, and both actions have
0-mean random rewards, (E(r |a = a1 ) = E(r |a = a2 ) = 0).
Then Q(s, a1 ) = Q(s, a2 ) = 0 = V (s)
Assume there are prior samples of taking action a1 and a2
Let Q̂(s, a1 ), Q̂(s, a2 ) be the finite sample estimate of Q
1
Pn(s,a1 )
Use an unbiased estimator for Q: e.g. Q̂(s, a1 ) = n(s,a1 ) i=1 ri (s, a1 )
Let π̂ = arg maxa Q̂(s, a) be the greedy policy w.r.t. the estimated Q̂
Even though each estimate of the state-action values is unbiased, the
estimate of π̂’s value V̂ π̂ can be biased:
V̂ π̂ (s) = E[max Q̂(s, a1 ), Q̂(s, a2 )]
≥ max[E[Q̂(s, a1 )], [Q̂(s, a2 )]]
= max[0, 0] = V π ,
where the inequality comes from Jensen’s inequality.
2
Example from Mannor, Simester, Sun and Tsitsiklis. Bias and Variance
Approximation in Value Function Estimates. Management Science 2007
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 55 / 66
Double Q-Learning
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 56 / 66
Double Q-Learning
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 60 / 66
What You Should Know
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 61 / 66
Class Structure
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 62 / 66
Today: Learning to Control Involves
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 63 / 66
Recall: Reinforcement Learning Involves
Optimization
Delayed consequences
Exploration
Generalization
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 64 / 66
Evaluation to Control
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 65 / 66
Evaluation to Control
Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 4: Model Free Control Winter 2021 66 / 66