You are on page 1of 17

Reinforcement Learning

Module 3
Dr. D. Sathian
SCOPE
Dynamic Programming
• Dynamic Programming (DP) refers to a collection of algorithms that can be used to
compute optimal policies given a perfect model of the environment as a Markov
decision process (MDP).
• Dynamic programming is fundamental to many reinforcement learning algorithms.
• 2 of the main tasks of an agent in RL include –
• policy evaluation: compute Vπ from π
• control: improve π based on Vπ
• Policy evaluation refers to determining the value function for a specific policy, whereas
control refers to the task of finding a policy that maximizes reward.
• start with an arbitrary policy
• repeat evaluation/improvement until convergence
Your Logo or Name Here 2
Dynamic Programming
• The key idea of DP, and of reinforcement learning generally, is the use of value
functions to organize and structure the search for good policies.
• DP makes use of the Bellman equations to define iterative algorithms for both policy
evaluation and control.

Your Logo or Name Here 3


Policy Evaluation
• First we consider how to compute the state-value function vπ for an arbitrary policy π.
This is called policy evaluation in the DP literature.
• We also refer to it as the prediction problem.

π(a|s) is the probability of taking action a in state s under


policy π, and the expectations are subscripted by π to
indicate that they are conditional on π being followed.

Your Logo or Name Here 4


Policy Evaluation
• First we consider how to compute the state-value function vπ for an arbitrary policy π.
This is called policy evaluation in the DP literature.
• We also refer to it as the prediction problem.

π(a|s) is the probability of taking action a in state s under


policy π, and the expectations are subscripted by π to
indicate that they are conditional on π being followed.

• Iterative policy evaluation


start with an arbitrary value function V0, iterate until
Vk converges

Your Logo or Name Here 5


Policy Evaluation
• Iterative Policy Evaluation, for estimating V ≈ vπ

Your Logo or Name Here 6


Policy Evaluation
• Gridworld Example:
• Non-terminal states are S = {1, 2, 3, . . . , 14} , shaded – terminal states
• Four possible actions in each state, A = {up, down, right, left}
• Rt = -1 on all transitions. reward is -1 on all transitions until the terminal state is
reached.
• All actions in each state deterministically cause the corresponding state transitions,
except that actions that would take the agent off the grid in fact leave the state
unchanged.
• p(9,-1|8,right) = 1, p(7,-1|7,right) = 1 , p(6,r|9,right) = 0
• This is an undiscounted, episodic task.

Your Logo or Name Here 7


Policy Evaluation
Vk for random policy Optimal Policy
Random Policy

Your Logo or Name Here 8


Policy Evaluation
• Gridworld Example:
• Q. If π is the equiprobable random policy,
• what is qπ(11,down)?
• What is qπ(7,down)?

• Suppose we commit ourselves on going down from the position 11. Then we get a −1
reward deterministically, and the game is over (episode ends).
Hence qπ(11,down) = −1.
• Now suppose we commit ourselves on going down from position 7.
qπ(7,down)=𝔼π[Gt|St =s, At =a]
= Rt+ γ 𝔼π[Gt+1|St+1=sʹ]
= -1 + -14 = -15 Your Logo or Name Here 9
Policy Evaluation
• Q. Suppose a new state 15 is added to the gridworld just below state 13, and its
actions, left, up, right, and down, take the agent to states 12, 13, 14, and 15,
respectively. Assume that the transitions from the original states are unchanged.
What, then, is vπ(15) for the equiprobable random policy? Now suppose the
dynamics of state 13 are also changed, such that action down from state 13 takes the
agent to the new state 15. What is vπ(15) for the equiprobable random policy in this
case?

Your Logo or Name Here 10


Policy Evaluation
vπ (15) = Eπ{rt+1 + Vπ(st+1)|st = s}

vπ (15) = 0.25 [(−1 + −20) + (−1 + −22) + (−1 + −14) + (−1 + vπ (15))]
= 0.25 [(−60 + vπ (15))] = −15 + 0.25 [vπ (15)]

we know vπ(15) == vπ(13) == −20 as the state transitions and value functions at next state
are identical.
Proof:
vπ (15) = 0.25 [vπ (15)] - 15
0.75 vπ (15) = -15
vπ (15) = -15/0.75 = -20 Your Logo or Name Here 11
Policy Evaluation
Q. Now suppose the dynamics of 13 also changed. We do the iterative policy evaluation.
The initialization is natural: we let V0(s) = vπ(s), where vπ is the value when the dynamics
are unchanged, as shown in the k = ∞ case in the below figure. Note that, especially,
V0(15) = −20, where −20 is what we just derived for vπ(15)

v1 (13) = 0.25(−1−20−1−22−1−14−1−20) = 20
Then we immediate update V1(15) using the updated V1(13). Note this is “overwriting”

v1 (15) = 0.25(−1−22−1−20−1−14−1−20)
= 20

Your Logo or Name Here 12


Policy Improvement
• After evaluating a policy, how to make it better?
• What if we take some other action a ≠ π(s)
• Is this greater than v(s)? If so what do we do?

The process of making a new policy that improves on an original policy, by making it
greedy with respect to the value function of the original policy, is called policy
improvement.

Your Logo or Name Here 13


Policy Improvement
• We know how to improve over one state s and one action a
• How about extend to consider changes at all states and to all possible actions?
• to consider changes at all states and to all possible actions, selecting at each state the
action that appears best according to qπ(s,a). In other words, to consider the new
greedy policy, π’, given by

• where argmaxa denotes the value of a at which the expression that follows is
maximized
Your Logo or Name Here 14
Policy Improvement
• Suppose the new greedy policy, π’, is as good as, but not better than, the old policy π.
• Then vπ = vπ’ , and from previous equation it follows that for all s ∈ S:

Your Logo or Name Here 15


Policy Iteration
• Once a policy, π, has been improved using vπ to yield a better policy, π’, we can then
compute vπ’ and improve it again to yield an even better π’’.
• We can thus obtain a sequence of monotonically improving policies and value
functions:

• E- evaluation, I- improvement
• Each policy is guaranteed to be a strict improvement over the previous one.
• A finite MDP has only a finite number of policies, this process must converge to an
optimal policy and optimal value function in a finite number of iterations.
• This way of finding an optimal policy is called policy iteration.
Your Logo or Name Here 16
Policy Iteration - Algorithm

Your Logo or Name Here 17

You might also like