Professional Documents
Culture Documents
Module 3
Dr. D. Sathian
SCOPE
Dynamic Programming
• Dynamic Programming (DP) refers to a collection of algorithms that can be used to
compute optimal policies given a perfect model of the environment as a Markov
decision process (MDP).
• Dynamic programming is fundamental to many reinforcement learning algorithms.
• 2 of the main tasks of an agent in RL include –
• policy evaluation: compute Vπ from π
• control: improve π based on Vπ
• Policy evaluation refers to determining the value function for a specific policy, whereas
control refers to the task of finding a policy that maximizes reward.
• start with an arbitrary policy
• repeat evaluation/improvement until convergence
Your Logo or Name Here 2
Dynamic Programming
• The key idea of DP, and of reinforcement learning generally, is the use of value
functions to organize and structure the search for good policies.
• DP makes use of the Bellman equations to define iterative algorithms for both policy
evaluation and control.
• Suppose we commit ourselves on going down from the position 11. Then we get a −1
reward deterministically, and the game is over (episode ends).
Hence qπ(11,down) = −1.
• Now suppose we commit ourselves on going down from position 7.
qπ(7,down)=𝔼π[Gt|St =s, At =a]
= Rt+ γ 𝔼π[Gt+1|St+1=sʹ]
= -1 + -14 = -15 Your Logo or Name Here 9
Policy Evaluation
• Q. Suppose a new state 15 is added to the gridworld just below state 13, and its
actions, left, up, right, and down, take the agent to states 12, 13, 14, and 15,
respectively. Assume that the transitions from the original states are unchanged.
What, then, is vπ(15) for the equiprobable random policy? Now suppose the
dynamics of state 13 are also changed, such that action down from state 13 takes the
agent to the new state 15. What is vπ(15) for the equiprobable random policy in this
case?
vπ (15) = 0.25 [(−1 + −20) + (−1 + −22) + (−1 + −14) + (−1 + vπ (15))]
= 0.25 [(−60 + vπ (15))] = −15 + 0.25 [vπ (15)]
we know vπ(15) == vπ(13) == −20 as the state transitions and value functions at next state
are identical.
Proof:
vπ (15) = 0.25 [vπ (15)] - 15
0.75 vπ (15) = -15
vπ (15) = -15/0.75 = -20 Your Logo or Name Here 11
Policy Evaluation
Q. Now suppose the dynamics of 13 also changed. We do the iterative policy evaluation.
The initialization is natural: we let V0(s) = vπ(s), where vπ is the value when the dynamics
are unchanged, as shown in the k = ∞ case in the below figure. Note that, especially,
V0(15) = −20, where −20 is what we just derived for vπ(15)
v1 (13) = 0.25(−1−20−1−22−1−14−1−20) = 20
Then we immediate update V1(15) using the updated V1(13). Note this is “overwriting”
v1 (15) = 0.25(−1−22−1−20−1−14−1−20)
= 20
The process of making a new policy that improves on an original policy, by making it
greedy with respect to the value function of the original policy, is called policy
improvement.
• where argmaxa denotes the value of a at which the expression that follows is
maximized
Your Logo or Name Here 14
Policy Improvement
• Suppose the new greedy policy, π’, is as good as, but not better than, the old policy π.
• Then vπ = vπ’ , and from previous equation it follows that for all s ∈ S:
• E- evaluation, I- improvement
• Each policy is guaranteed to be a strict improvement over the previous one.
• A finite MDP has only a finite number of policies, this process must converge to an
optimal policy and optimal value function in a finite number of iterations.
• This way of finding an optimal policy is called policy iteration.
Your Logo or Name Here 16
Policy Iteration - Algorithm