Professional Documents
Culture Documents
David Silver
Lecture 2: Markov Decision Processes
1 Markov Processes
4 Extensions to MDPs
Lecture 2: Markov Decision Processes
Markov Processes
Introduction
Introduction to MDPs
Markov Property
Pss 0 = P St+1 = s 0 | St = s
to
P11 . . . P1n
P = from ...
Pn1 . . . Pnn
Markov Process
0.9
Facebook Sleep
0.1
0.4
0.2
0.4
Pub
Lecture 2: Markov Decision Processes
Markov Processes
Markov Chains
0.9
Facebook Sleep
0.1 C1 C2 C3 Pass Pub FB Sleep
0.5 0.5
C1
C2 0.8 0.2
0.6 0.4
0.5 0.2 1.0 C3
P = 1.0
0.8 0.6 Pass
Class 1 0.5
Class 2 Class 3 Pass
Pub 0.2 0.4 0.4
0.4 0.1
FB 0.9
Sleep 1
0.4
0.2
0.4
Pub
Lecture 2: Markov Decision Processes
Markov Reward Processes
MRP
0.9
Facebook Sleep
0.1 R = -1 R=0
0.4
0.2
0.4
Pub
R = +1
Lecture 2: Markov Decision Processes
Markov Reward Processes
Return
Return
Definition
The return Gt is the total discounted reward from time-step t.
∞
X
Gt = Rt+1 + γRt+2 + ... = γ k Rt+k+1
k=0
Why discount?
Value Function
v (s) = E [Gt | St = s]
Lecture 2: Markov Decision Processes
Markov Reward Processes
Value Function
G1 = R2 + γR3 + ... + γ T −2 RT
v(s) for γ =0
0.9
-1 0
0.1 R = -1 R=0
0.4
0.2
0.4
+1
R = +1
Lecture 2: Markov Decision Processes
Markov Reward Processes
Value Function
0.9
-7.6 0
0.1 R = -1 R=0
0.4
0.2
0.4
1.9
R = +1
Lecture 2: Markov Decision Processes
Markov Reward Processes
Value Function
v(s) for γ =1
0.9
-23 0
0.1 R = -1 R=0
0.4
0.2
0.4
+0.8
R = +1
Lecture 2: Markov Decision Processes
Markov Reward Processes
Bellman Equation
v (s) = E [Gt | St = s]
= E Rt+1 + γRt+2 + γ 2 Rt+3 + ... | St = s
v(s) !7 s
r
v(s0 ) !7 s0
X
v (s) = Rs + γ Pss 0 v (s 0 )
s 0 ∈S
Lecture 2: Markov Decision Processes
Markov Reward Processes
Bellman Equation
0.9
-23 0
0.1 R = -1 R=0
0.4
0.2
0.4
0.8
R = +1
Lecture 2: Markov Decision Processes
Markov Reward Processes
Bellman Equation
v = R + γPv
v (1) R1 P11 . . . P1n v (1)
.. .. . ..
. = . + γ .. .
v (n) Rn P11 . . . Pnn v (n)
Lecture 2: Markov Decision Processes
Markov Reward Processes
Bellman Equation
v = R + γPv
(I − γP) v = R
v = (I − γP)−1 R
Facebook
R = -1
Pub
R = +1
0.4
0.2
0.4
Lecture 2: Markov Decision Processes
Markov Decision Processes
Policies
Policies (1)
Definition
A policy π is a distribution over actions given states,
π(a|s) = P [At = a | St = s]
Policies (2)
a∈A
X
Rπs = π(a|s)Ras
a∈A
Lecture 2: Markov Decision Processes
Markov Decision Processes
Value Functions
Value Function
Definition
The state-value function vπ (s) of an MDP is the expected return
starting from state s, and then following policy π
vπ (s) = Eπ [Gt | St = s]
Definition
The action-value function qπ (s, a) is the expected return
starting from state s, taking action a, and then following policy π
qπ (s, a) = Eπ [Gt | St = s, At = a]
Lecture 2: Markov Decision Processes
Markov Decision Processes
Value Functions
-2.3 0
Pub
R = +1
0.4
0.2
0.4
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Expectation Equation
v⇡ (s) !7 s
q⇡ (s, a) !7 a
X
vπ (s) = π(a|s)qπ (s, a)
a∈A
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Expectation Equation
q⇡ (s, a) !7 s, a
r
v⇡ (s0 ) !7 s0
X
0
qπ (s, a) = Ras + γ a
Pss 0 vπ (s )
s 0 ∈S
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Expectation Equation
v⇡ (s) !7 s
v⇡ (s0 ) !7 s0
!
X X
0
vπ (s) = π(a|s) Ras + γ a
Pss 0 vπ (s )
a∈A s 0 ∈S
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Expectation Equation
q⇡ (s, a) !7 s, a
s0
q⇡ (s0 , a0 ) !7 a0
X X
qπ (s, a) = Ras + γ a
Pss 0 π(a0 |s 0 )qπ (s 0 , a0 )
s 0 ∈S a0 ∈A
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Expectation Equation
-2.3 0
Pub
R = +1
0.4
0.2
0.4
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Expectation Equation
vπ = Rπ + γP π vπ
vπ = (I − γP π )−1 Rπ
Lecture 2: Markov Decision Processes
Markov Decision Processes
Optimal Value Functions
6 0
Pub
R = +1
0.4
0.2
0.4
Lecture 2: Markov Decision Processes
Markov Decision Processes
Optimal Value Functions
6 0
Pub
R = +1
0.4 q* =8.4
0.2
0.4
Lecture 2: Markov Decision Processes
Markov Decision Processes
Optimal Value Functions
Optimal Policy
Theorem
For any Markov Decision Process
There exists an optimal policy π∗ that is better than or equal
to all other policies, π∗ ≥ π, ∀π
All optimal policies achieve the optimal value function,
vπ∗ (s) = v∗ (s)
All optimal policies achieve the optimal action-value function,
qπ∗ (s, a) = q∗ (s, a)
Lecture 2: Markov Decision Processes
Markov Decision Processes
Optimal Value Functions
6 0
Pub
R = +1
0.4 q* =8.4
0.2
0.4
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Optimality Equation
v⇤ (s) !7 s
q⇤ (s, a) !7 a
q⇤ (s, a) !7 s, a
v⇤ (s0 ) !7 s0
X
0
q∗ (s, a) = Ras + γ a
Pss 0 v∗ (s )
s 0 ∈S
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Optimality Equation
v⇤ (s) !7 s
v⇤ (s0 ) !7 s0
X
0
v∗ (s) = max Ras + γ a
Pss 0 v∗ (s )
a
s 0 ∈S
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Optimality Equation
q⇤ (s, a) !7 s, a
s0
q⇤ (s0 , a0 ) !7 a0
X
q∗ (s, a) = Ras + γ Pssa 0 max
0
q∗ (s 0 , a0 )
a
s 0 ∈S
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Optimality Equation
6 0
Pub
R = +1
0.4
0.2
0.4
Lecture 2: Markov Decision Processes
Markov Decision Processes
Bellman Optimality Equation
Definition
A history Ht is a sequence of actions, observations and rewards,
Ht = A0 , O1 , R1 , ..., At−1 , Ot , Rt
Definition
A belief state b(h) is a probability distribution over states,
conditioned on the history h
P(s)
a1 a2 a1 a2
a1 a2 P(s|a1) P(s|a2)
o1 o2 o1 o2 o1 o2 o1 o2
a1 a2 a1 a2
a1o1a1 a1o1a2 ... ... ... P(s|a1o1a1) P(s|a1o1a2) ... ... ...
Theorem
An ergodic Markov process has a limiting stationary distribution
d π (s) with the property
X
d π (s) = d π (s 0 )Ps 0 s
s 0 ∈S
Lecture 2: Markov Decision Processes
Extensions to MDPs
Average Reward MDPs
Definition
An MDP is ergodic if the Markov chain induced by any policy is
ergodic.
" T #
π 1 X
ρ = lim E Rt
T →∞ T
t=1
Lecture 2: Markov Decision Processes
Extensions to MDPs
Average Reward MDPs
∞
" #
X
π
ṽπ (s) = Eπ (Rt+k − ρ ) | St = s
k=1
∞
" #
X
π π
ṽπ (s) = Eπ (Rt+1 − ρ ) + (Rt+k+1 − ρ ) | St = s
k=1
= Eπ [(Rt+1 − ρπ ) + ṽπ (St+1 ) | St = s]
Lecture 2: Markov Decision Processes
Extensions to MDPs
Average Reward MDPs
Questions?