Module 3.0

Reinforcement Learning
Module 3
Dr. D. Sathian
SCOPE
Dynamic Programming
• Dynamic Programming (DP) refers to a collection of algorithms that can be used to
compute optimal policies given a perfect model of the environment as a Markov
decision process (MDP).
• Dynamic programming is fundamental to many reinforcement learning algorithms.
• 2 of the main tasks of an agent in RL include –
• policy evaluation: compute Vπ from π
• control: improve π based on Vπ
• Policy evaluation refers to determining the value function for a specific policy, whereas
control refers to the task of finding a policy that maximizes reward.
• start with an arbitrary policy
• repeat evaluation/improvement until convergence
Your Logo or Name Here 2
Dynamic Programming
• The key idea of DP, and of reinforcement learning generally, is the use of value
functions to organize and structure the search for good policies.
• DP makes use of the Bellman equations to define iterative algorithms for both policy
evaluation and control.

Policy Evaluation
• First we consider how to compute the state-value function vπ for an arbitrary policy π.
This is called policy evaluation in the DP literature.
• We also refer to it as the prediction problem.
π(a|s) is the probability of taking action a in state s under

policy π, and the expectations are subscripted by π to
indicate that they are conditional on π being followed.

Policy Evaluation
• First we consider how to compute the state-value function vπ for an arbitrary policy π.
This is called policy evaluation in the DP literature.
• We also refer to it as the prediction problem.
π(a|s) is the probability of taking action a in state s under

policy π, and the expectations are subscripted by π to
indicate that they are conditional on π being followed.
• Iterative policy evaluation

start with an arbitrary value function V0, iterate until
Vk converges

Policy Evaluation
• Iterative Policy Evaluation, for estimating V ≈ vπ

Policy Evaluation
• Gridworld Example:
• Non-terminal states are S = {1, 2, 3, . . . , 14} , shaded – terminal states
• Four possible actions in each state, A = {up, down, right, left}
• Rt = -1 on all transitions. reward is -1 on all transitions until the terminal state is
reached.
• All actions in each state deterministically cause the corresponding state transitions,
except that actions that would take the agent off the grid in fact leave the state
unchanged.
• p(9,-1|8,right) = 1, p(7,-1|7,right) = 1 , p(6,r|9,right) = 0
• This is an undiscounted, episodic task.

Policy Evaluation
Vk for random policy Optimal Policy
Random Policy

Policy Evaluation
• Gridworld Example:
• Q. If π is the equiprobable random policy,
• what is qπ(11,down)?
• What is qπ(7,down)?
• Suppose we commit ourselves on going down from the position 11. Then we get a −1
reward deterministically, and the game is over (episode ends).
Hence qπ(11,down) = −1.
• Now suppose we commit ourselves on going down from position 7.
qπ(7,down)=𝔼π[Gt|St =s, At =a]
= Rt+ γ 𝔼π[Gt+1|St+1=sʹ]
= -1 + -14 = -15 Your Logo or Name Here 9
Policy Evaluation
• Q. Suppose a new state 15 is added to the gridworld just below state 13, and its
actions, left, up, right, and down, take the agent to states 12, 13, 14, and 15,
respectively. Assume that the transitions from the original states are unchanged.
What, then, is vπ(15) for the equiprobable random policy? Now suppose the
dynamics of state 13 are also changed, such that action down from state 13 takes the
agent to the new state 15. What is vπ(15) for the equiprobable random policy in this
case?

Policy Evaluation
vπ (15) = Eπ{rt+1 + Vπ(st+1)|st = s}
vπ (15) = 0.25 [(−1 + −20) + (−1 + −22) + (−1 + −14) + (−1 + vπ (15))]
= 0.25 [(−60 + vπ (15))] = −15 + 0.25 [vπ (15)]
we know vπ(15) == vπ(13) == −20 as the state transitions and value functions at next state
are identical.
Proof:
vπ (15) = 0.25 [vπ (15)] - 15
0.75 vπ (15) = -15
vπ (15) = -15/0.75 = -20 Your Logo or Name Here 11
Policy Evaluation
Q. Now suppose the dynamics of 13 also changed. We do the iterative policy evaluation.
The initialization is natural: we let V0(s) = vπ(s), where vπ is the value when the dynamics
are unchanged, as shown in the k = ∞ case in the below figure. Note that, especially,
V0(15) = −20, where −20 is what we just derived for vπ(15)
v1 (13) = 0.25(−1−20−1−22−1−14−1−20) = 20
Then we immediate update V1(15) using the updated V1(13). Note this is “overwriting”
v1 (15) = 0.25(−1−22−1−20−1−14−1−20)
= 20

Policy Improvement
• After evaluating a policy, how to make it better?
• What if we take some other action a ≠ π(s)
• Is this greater than v(s)? If so what do we do?
The process of making a new policy that improves on an original policy, by making it
greedy with respect to the value function of the original policy, is called policy
improvement.

Policy Improvement
• We know how to improve over one state s and one action a
• How about extend to consider changes at all states and to all possible actions?
• to consider changes at all states and to all possible actions, selecting at each state the
action that appears best according to qπ(s,a). In other words, to consider the new
greedy policy, π’, given by
• where argmaxa denotes the value of a at which the expression that follows is
maximized
Policy Improvement
• Suppose the new greedy policy, π’, is as good as, but not better than, the old policy π.
• Then vπ = vπ’ , and from previous equation it follows that for all s ∈ S:

Policy Iteration
• Once a policy, π, has been improved using vπ to yield a better policy, π’, we can then
compute vπ’ and improve it again to yield an even better π’’.
• We can thus obtain a sequence of monotonically improving policies and value
functions:
• E- evaluation, I- improvement
• Each policy is guaranteed to be a strict improvement over the previous one.
• A finite MDP has only a finite number of policies, this process must converge to an
optimal policy and optimal value function in a finite number of iterations.
• This way of finding an optimal policy is called policy iteration.
Policy Iteration - Algorithm

Module 3.0

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 3.0

Uploaded by

Copyright:

Available Formats

Reinforcement Learning

Your Logo or Name Here 3

π(a|s) is the probability of taking action a in state s under

Your Logo or Name Here 4

π(a|s) is the probability of taking action a in state s under

• Iterative policy evaluation

Your Logo or Name Here 5

Your Logo or Name Here 6

Your Logo or Name Here 7

Your Logo or Name Here 8

Your Logo or Name Here 10

Your Logo or Name Here 12

Your Logo or Name Here 13

Your Logo or Name Here 15

Your Logo or Name Here 17

You might also like