Chapter1 MDP

Markov Decision Processes
Santiago Paternain and Miguel Calvo-Fullana

Electrical and Systems Engineering, University of Pennsylvania
{spater,cfullana}@seas.upenn.edu
August 29 —September 5, 2019
Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 1

Markov chains. Definition and examples
Markov Decision Processes (Formally)
Goal, Rewards, Returns and Episodes
Value Function
Bellman Equation and Optimality

Markov chains
I Consider time index n = 0, 1, 2, . . . & time dependent random state Xn

I State Xn takes values on a countable number of states
⇒ In general denotes states as i = 0, 1, 2, . . .
I Denote the history of the process Xn = [Xn , Xn−1 , . . . , X0 ]T
I Denote stochastic process as XN
I The stochastic process XN is a Markov chain (MC) if

P Xn+1 = j Xn = i, Xn−1 = P Xn+1 = j Xn = i = Pij
I Future depends only on current state Xn

Observations
I Process’s history Xn−1 irrelevant for future evolution of the process

I Probabilities Pij are constant for all times (time invariant)
I From the definition we have that for arbitrary m

P Xn+m Xn , Xn−1 = P Xn+m Xn
I Xn+m depends only on Xn+m−1 , which depends only onXn+m−2 , . . . which
depends only on Xn
I Since Pij ’s are probabilities they’re positive and sum up to 1

∞
X
Pij ≥ 0 Pij = 1
j=1

Matrix representation
I Group transition probabilities Pij in a “matrix” P
P00 P01 P02 ...

 
 P10 P11 P12 ... 
.. .. .. ..
 
 
P := 
 . . . . 


 Pi0 Pi1 Pi2 ... 

.. .. .. ..
. . . .
I Not really a matrix if number of states is infinite

Graph representation
I A graph representation is also used
Pi−1,i−1 Pii Pi+1,i+1
Pi−2,i−1 Pi−1,i Pi,i+1 Pi+1,i+2
i −1 i i +1
Pi−1,i−2 Pi,i−1 Pi+1,i Pi+2,i+1
I Useful when number of states is infinite

Example: Happy - Sad
I I can be happy (Xn = 0) or sad (Xn = 1).

I Happiness tomorrow affected by happiness today only
I Model as Markov chain with transition probabilities
0.8 0.7
0.2

0.8 0.2
P :=
0.3 0.7 H S
0.3
I Inertia ⇒ happy or sad today, likely to stay happy or sad tomorrow
(P00 = 0.8, P11 = 0.7)
I But when sad, a little less likely so (P00 > P11 )

Example: Happy - Sad, version 2
I Happiness tomorrow affected by today and yesterday

I Define double states HH (happy-happy), HS (happy-sad), SH, SS
I Only some transitions are possible
⇒ HH and SH can only become HH or HS
⇒ HS and SS can only become SH or SS
0.1
0.9 HH 0.2 HS
 
0.9 0.1 0 0
 0 0 0.4 0.6 
P := 
  0.8 0.6
0.8 0.2 0 0 
0 0 0.3 0.7
0.4
SH SS 0.7
0.3
I More time happy or sad increases likelihood of staying happy or sad
I State augmentation ⇒ Capture longer time memory

Random (drunkard’s) walk
I Step to the right with probability p, to the left with prob. (1-p)
p p p p
i −1 i i +1
1−p 1−p 1−p 1−p
I States are 0, ±1, ±2, . . ., number of states is infinite

I Transition probabilities are
Pi,i+1 = p, Pi,i−1 = 1 − p,
I Pij = 0 for all other transitions

Random (drunkard’s) walk - continued
I Random walks behave differently if p < 1/2, p = 1/2 or p > 1/2
p = 0.45 p = 0.50 p = 0.55

100 100
60
80 80
60 40 60
40 40
20
position (in steps)
position (in steps)
position (in steps)

20 20
0 0 0
−20 −20
−20
−40 −40
−60 −40 −60
−80 −80
−60
−100 −100
0 100 200 300 400 500 600 700 800 900 1000 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 100 200 300 400 500 600 700 800 900 1000
time time time
I With p > 1/2 diverges to the right (grows unbounded almost surely)
I With p < 1/2 diverges to the left
I With p = 1/2 always come back to visit origin (almost surely)

Two dimensional random walk
40
35
I Take a step in random direction East, West, South 30
or North 25
Latitude (North−South)
20
⇒ E, W, S, N chosen with equal probability 15
10
I States are pairs of coordinates (x, y ) 5
0
⇒ x = 0, ±1, ±2, . . . and y = 0, ±1, ±2, . . . −5
I Transiton probabilities are not zero only for points −10

−5 0 5 10 15 20 25 30 35 40
adjacent in the grid Longitude (East−West)
1 50
P x(t +1) = i +1, y (t + 1) = j x(t) = i, y (t) = j = 40

4
30
1
P x(t +1) = i −1, y (t + 1) = j x(t) = i, y (t) = j =
Latitude (North−South)
20
4
10
1
P x(t +1) = i, y (t + 1) = j +1 x(t) = i, y (t) = j = 0
4
1 −10
P x(t +1) = i, y (t + 1) = j −1 x(t) = i, y (t) = j =

−20
4
−30
−45 −40 −35 −30 −25 −20 −15 −10 −5 0
Longitude (East−West)

More about random walks
I Some random facts of life for equiprobable random walks
I In one and two dimensions probability of returning to origin is 1

I Will almost surely return home
I In more than two dimensions, is less than 1

I In three dimensions probability of returning to origin is 0.34
I Then 0.19, 0.14, 0.10, 0.08, . . .

Random walk with borders (gambling)
I As a random walk, but stop moving when i = 0 or i = J

⇒ Models a gambler that stops playing when ruined, Xn = 0
⇒ Or when reaches target gains Xn = J
1 1
p p
0 ... i −1 i i +1 ... J
1−p 1−p
I States are 0, 1, . . . , J. Finite number of states (J). Transition probs.
Pi,i+1 = p, Pi,i−1 = 1 − p, P00 = 1, PJJ = 1

I Pij = 0 for all other transitions
I States 0 and J are called absorbing. Once there stay there forever
I The rest are transient states. Visits stop almost surely

Value Function

Markov Decission Process
I General framework to formalize sequential decision making
I At each time step t = 0, 1 . . . the agent is in state St ∈ S

I And it selects an action At ∈ A(s) possibly state dependent
I As a consequence of the action the environment produces
⇒ A numerical reward Rt ∈ R
⇒ The transition to the system to a new state st+1

Markov Decission Process
I A Markov Decision Process is a tuple (S, A, R, P)

I S is a finite or infinite but countable set of states
I A is a finite or infinite but countable set of actions
I R ⊆ R is the set of rewards
I P is a Markov transition probability if for any R0 ⊆ R
P[St+1 , Rt+1 ∈ R0 |St , At , St−1 , At−1 , . . . , S0 , A0 ] = P[St+1 , Rt+1 ∈ R0 |St , At ]
⇒ Memory-less transition probability

⇒ Only depends on the current state and action

Examples: TIC-TAC-TOE
I State space S ⇒ All combinations of ◦ and ×

I Action space A ⇒ Where to place the next cross, its state dependent
I The next state is a function of the current state, Action and “randomness”

I State space S ⇒ All combinations of ◦ and ×

I Action space A ⇒ Where to place the next cross, its state dependent
I The next state is a function of the current state, Action and “randomness”

Policies
I A policy π is a distribution over actions given states
π(a|s) = P[At = a|St = s]

I The policy defines the behavior of the agent
I The policy depends on the current state only
I Deterministic policies ⇒ Assign probability one to one action
I Policies are stationary
At ∼ π(·|St ), for all t ≥ 0

Example: TIC-TAC-TOE
I We could use a random policy for instance π ∼ U[1, 2, . . . , 9]

I We can also use a mixed policy
⇒ S0 = empty board, A0 = center position
⇒ S1 = cross center, circle corner, A1 ∼ U[corners]
⇒ ···
I We can also make it fully deterministic
⇒ S0 = empty board, A0 = center position
⇒ S1 = cross center, circle corner 1, A1 = corner 2
⇒ S2 = cross center, circle corner 2, A1 = corner 3
⇒ ···

One Transition
I Probability of reaching state s in one step given that we start at S0

a1
I Law of total probability
∞ a2
X
P[S1 = s|S0 ] = P[S1 = s, A0 = ai |S0 ] ..
i=1 .
S0 ai s
I Conditioning on A0 ..
.
∞
X
P[S1 = s|S0 ] = P[S1 = s|A0 = ai , S0 ]P[A0 = ai |S0 ]
i=1
I Using the definition of policy

∞
X
P[S1 = s|S0 ] = P[S1 = s|A0 = ai , S0 ]π[ai |S0 ]
i=1
I Note that once we have defined the policy we have a Markov Chain

Two Steps Transition
I Probability of reaching state s in two steps given that we start at S0

I Using the law of the total probability
∞
X
P[S2 = s|S0 ] = P[S2 = s, S1 = si , A1 = aj |S0 ]
i,j=1
I Conditioning on S1 and A1 we have

∞
X
P[S2 = s|S0 ] = P[S2 = s|S1 = si , A1 = aj , S0 ]P[S1 = si , A1 = aj |S0 ]
i,j=1
I Using the Markov Property and conditioning on S1

∞
X
P[S2 = s|S0 ] = P[S2 = s|si , aj ]P[A1 = aj |S1 = si , S0 ]P[S1 = si |S0 ]
i,j=1

Two Steps Transition
I Probability of reaching state s in two steps given that we start at S0

∞
X
P[S2 = s|S0 ] = P[S2 = s|si , aj ]P[A1 = aj |S1 = si , S0 ]P[S1 = si |S0 ]
i,j=1
I Since the policy only depends on the current step

∞
X
P[S2 = s|S0 ] = P[S2 = s|si , aj ]π(aj |si )P[S1 = si |S0 ]
i,j=1
I Recall that the expression of one transition is

∞
X
P[S1 = si |S0 ] = P[S1 = si |ak , S0 ]π[ak |S0 ]
k=1
I Substituting in the above equation it follows that

∞
X
P[S2 = s|S0 ] = P[S2 = s|si , aj ]π(aj |si )P[S1 = si |ak , S0 ]π[ak |S0 ]
i,j,k=1

Trajectories
I A trajectory TT is a collection of states, actions and rewards
TT = (S0 , A0 , R1 , S1 , A1 , . . . , RT −1 , ST −1 , AT −1 , RT , ST )
I T can be finite or infinite and it is call the horizon
I We want to compute the probability of a given trajectory, this is P[TT ]
P[TT ] = P [S0 , A0 , R1 , S1 , A1 , . . . , RT −1 , ST −1 , AT −1 , RT , ST ]
I Let us condition on TT −1 = (S0 , A0 , R1 , S1 , A1 , . . . , RT −1 , ST −1 )
P[TT ] = P[ST , RT , AT −1 |TT −1 ]P[TT −1 ]

I Let us also condition on AT −1
P[TT ] = P[ST , RT |AT −1 , TT −1 ]P[AT −1 |TT −1 ]P[TT −1 ]

Trajectories
I From the previous slide we have that
P[TT ] = P[ST , RT |AT −1 , TT −1 ]P[AT −1 |TT −1 ]P[TT −1 ]

I Let us expand the first factor of the right hand side
P[ST , RT |AT −1 , TT −1 ] = P[ST , RT |AT −1 , S0 , A0 , R1 , S1 , A1 , . . . , RT −1 , ST −1 ]

I Using the Markov Property the transition depends only on ST −1 , AT −1
P[ST , RT |AT −1 , TT −1 ] = P[ST , RT |AT −1 , ST −1 ]

I Likewise, P[AT −1 |TT −1 ] is the policy and it only depends on ST −1
P[TT ] = P[ST , RT |AT −1 , ST −1 ]π[AT −1 |ST −1 ]P[TT −1 ]

Trajectories
P[TT ] = P[ST , RT |AT −1 , ST −1 ]π[AT −1 |ST −1 ]P[TT −1 ]

I The previous expression is true for all T ⇒ it also follows that
P[TT −1 ] = P[ST −1 , RT −1 |AT −2 , ST −2 ]π[AT −2 |ST −2 ]P[TT −2 ]

I This defines a recursion and we can keep expanding the expression

−1
P[TT ] = ΠTt=0 P[St+1 , Rt+1 |At , St ]π[At |St ] P[S0 ]

Value Function

Measurable Space
I Let Ω be a set of outcomes

⇒ Example: The throw of a dice Ω = {1, 2, 3, 4, 5, 6}
⇒ For Markov Decision Processes Ω = S × A × S × R
I Let F be a set of events or a σ -algebra
I Formally F is a nonempty collection of subsets Ω that satisfies
⇒ (i) if A ∈ F then Ac ∈ F
⇒ (ii) if Ai ∈ F is a countable sequence of sets then ∪i Ai ∈ F
I Examples of sigma algebras for the thrown of a dice
⇒ The trivial σ-algebra F = {∅, Ω}
⇒ The “less trivial” F = {∅, {1} , {2, 3, 4, 5, 6} , Ω}
⇒ The “Power set”
I Borel σ-algebra is the one generated by the open sets
I A measurable Space is any duple (Ω, F)

Probability measure
I Why is it called measurable?

I You can define a measure on the space, this is a function µ : F → R
⇒ µ(A) ≥ µ(∅) = 0 for all A ∈ F and
⇒ if Ai ∈ F is a countable sequence of disjoints sets, then
X
µ(∪i Ai ) = µ(Ai )
i
I If µ(Ω) = 1 , we call µ a probability measure

Transition Probabilities
I A function p : S × A × S × R → R is a transition probability if

I (i) ∀(s, a) ∈ S × A , K → p(s, a, K) is a probability measure on (S × R)
I (ii) For each K ⊆ S × R , (s, a) → p(s, a, K) is a measurable function
⇒ “For every “value of p” there is a state-action pair that originated it”
I “The “probability” of getting a pair (s, r ) only depends on the pair (s, a)”

Markov Decision Process
I Let p : (S × A) × (S × R) → R be a transition probability

I (St , At ) is a Markov Decision Process with transition probability p if
P(St+1 ∈ S 0 , Rt+1 ∈ R0 S0 , A0 . . . , St , At ) = p(St , At , S 0 , R0 )

I The probability of the state St+1 and the reward Rt+1

⇒ only depends on the last state St and action At
⇒ Independent of (S0 , A0 ), . . . , (St−1 , At−1 ) ⇒ Memoryless systems
I It is the same idea as in the discrete case and we will also write
P(St+1 ∈ S 0 , Rt+1 ∈ R0 S0 , A0 , . . . , St , At ) = P(St+1 ∈ S 0 , Rt+1 ∈ R0 St , At )


Discrete state-space examples
I Let p : (S × A) × (S × R) → R be a transition probability

I (St , At ) is a Markov Decision Process with transition probability p if
P(St+1 ∈ S 0 , Rt+1 ∈ R0 S0 , A0 . . . , St , At ) = p(St , At , S 0 , R0 )

I For example we have that p(high, search, high, rsearch ) = α
P(St+1 = high, Rt+1 = rsearch |S0 , A0 , . . . , St = high, At = search) = α

Markov Decision Process
I We need these formalities to unify continuous and discrete cases

I We can now consider situations where states and actions are uncountable
I If we are given the transition probability we can compute
P(St+1 ∈ S 0 , Rt+1 ∈ R0 St , At ) = p(St , At , S 0 , R0 )

I From the definition of transition probability we have that given (St , At )

(S 0 , R0 ) → p(St , At , S 0 , R0 )is a measure
Z
P(St+1 ∈ S 0 , Rt+1 ∈ R0 |St , At ) = p(St , At , s, r ) dsdr
S 0 ×R0
I Since we are talking about sequential processes we will use the notation
Z
P(St+1 ∈ S 0 , Rt+1 ∈ R0 |St , At ) = p(s, r |St , At ) dsdr
S 0 ×R0

Example: One dimensional particle
I Let ξt ∼ N (µ, σ) be i.i.d, and consider the following dynamics
St+1 = St + At + ξt
I The transition dynamics are (Let us forget about the rewards)
1 (s−St −At −µ)2

−
p(s|St , At ) = √ e 2σ 2
2πσ
I Equivalently St+1 ∼ N (St + At + µ, σ)
I Without the action this is a Gaussian random walk
I We can compute the probability of P(St+1 ∈ S 0 |St , At ) as
(s−St −At −µ)2
Z
1 −
P(St+1 ∈ S 0 |St , At ) = √ e 2σ 2 ds
S0 2πσ

Policies in Continuous State-Action Spaces
I A policy π is a probability measure over actions given states

Z
⇒ π(a|s) ≥ 0 and for every s ∈ S it follows that π(a|s) da = 1
A
I The policy defines the behavior of the agent
I The policy depends only on the current state
I Policies are stationary At ∼ π(·|St ), for all t ≥ 0

Transitions
I We want to compute the density function for state S1 given state S0

I Recall the law of the total probability
∞
X
P[S1 = s|S0 ] = P[S1 = s, A0 = ai |S0 ]
i=1
I Similarly for densities we have that

Z
p(S1 = s 0 |S0 = s) = p(s 0 , a|s) da
A
I Conditioning on the action it follows that

Z
p(S1 = s 0 |S0 = s) = p(s 0 |s, a)π(a|s) da
A

Two step Transitions
I We want to compute the density function for state S2 given state S0

I Marginalizing with respect to one state action pair (s 0 , a0 )
Z
p(S2 = s 00 |S0 = s) = p(s 00 , a0 , s 0 |s) da0 ds 0
S×A
I Conditioning on the state and action it follows that

Z
p(S2 = s 0 |S0 = s) = p(s 00 |a0 , s 0 , s)p(a0 , s 0 |s) da0 ds 0
S×A
I Using the Markov Property and conditioning with respect s 0

Z
p(S2 = s 00 |S0 = s) = p(s 00 |a0 , s 0 )p(a0 |s, s 0 )p(s 0 |s) da0 ds 0
S×A

Two step Transitions

Z
p(S2 = s 00 |S0 = s) = p(s 00 |a0 , s 0 )p(a0 |s, s 0 )p(s 0 |s) da0 ds 0
S×A
I And we have also the expression for the density of the one transition case
Z
p(s 0 |s) = p(s 0 |s, a)π(a|s) da
A
I Note that p(a0 |s, s 0 ) = π(a0 |s 0 )

Z
p(S2 = s 00 |S0 = s) = p(s 00 |a0 , s 0 )π(a0 |s 0 )p(s 0 |s, a)π(a|s) da0 ds 0 da
S×A2

Value Function

Goals and Rewards
I Rewards need to be defined in terms of the desired goal

⇒ what we want accomplished and not how to do it
⇒ However we might use reward shaping for improving convergence
I Is the reward a function of the state that we reach or the transition?
⇒ It’s more natural to define it as a function of the state we reach
⇒ However it’s the same from a mathematical point of view
⇒ Conditioning on the state s 0 it follows that
p(r , s 0 |s, a) = p(r |s 0 , s, a)p(s 0 |s, a)
⇒ If the rewards depend only on the next state we have that
p(r , s 0 |s, a) = p(r |s 0 )p(s 0 |s, a)

Returns and Episodes
I Return after time t as the sum of the rewards from time t + 1 onward
I For episodic tasks with finite horizon T
Gt = Rt+1 + Rt+2 + · · · RT
⇒ For instance TIC-TAC-TOE finishes after at most 5 actions

⇒ The horizon might not be fixed as in the TIC-TAC-TOE
I For continuing tasks the horizon is T = ∞
∞
X
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + · · · = γ k Rt+k+1
k=0
⇒ γ ∈ (0, 1) is a discount factor ⇒ relative importance of the future

Why Discount?
I It is convininet mathematically to discount rewards

⇒ Avoids infinite returns in continuing tasks
I In general we are more uncertain about the far future
I For financial rewards immediate rewards are more relevant
⇒ Money loses value with time
I Animal/human behavior shows preference for immediate rewards
I It is possible to use undiscounted rewards if all sequence terminate

Returns and Episodes
I To unify both frameworks we can add an absorving terminal state
I For instance in the TIC-TAC-TOE the end of the game is absorving

I At the terminal state rewards are zero and we can write the return as
T
X
Gt = γ k−t−1 Rk
k=t+1
I With the possibility of having T = ∞ or γ = 1 but not both

Examples: Cart-Pole
I The goal is to balance the pole on the up position for as long as possible
⇒ It is a continuing task ⇒ T = ∞, γ ∈ (0, 1)
⇒ A possible reward Rt = 1 if θt ∈ [170, 190], zero otherwise
I We also need the cart to be in a specified range
⇒ A possible reward Rt = −1 if |x| > 1

I Goal is to learn to win ⇒ We do not want to reward “good movements”

⇒ Reward Rt = 0 as long as the game is not finished
I Episodic task with random horizon T
I RT = 1 if game is won, RT = −1 if game is lost, RT = 0 if tied

Examples: Grid World
I Goal is to arrive to the target without colliding

I Episodic task with horizon T and absorbent states ⇒ goal or collision
I Reward Rt = 0 as long as the agent is moving
I RT = 1 if the goal is reached, RT = −1 if collision happens

Examples: Recycling Robot
I A robot is trying to collect cans in an office environment

I It states are S = {high, low} battery
I It can chose Actions A = {recharge, search, wait}
I If it runs out of energy gets a bad reward
I Otherwise gets one point per can collected

Value Function
Value Function

You are the Learner
I We can either be Happy Xt = H or Sad Xt = S

I In each state we can take two actions
⇒ Have a beer with your friends At = 1 or Study for ESE-680 At = 2
H S

You are the Learner
I We can either be Happy Xt = H or Sad Xt = S

I In each state we can take two actions
⇒ Have a beer with your friends At = 1 or Study for ESE-680 At = 2
p = 0.8
r = −10
p = 0.2
r = −10
p = 0.2
p = 0.8 r = 40
r = 40
H S
p=1
r = 10 p = 0.2
r = 20
p = 0.8
r = 20

Some RL difficulties
I Some actions are good despite them being bad in the short term
⇒ Studying while happy ⇒ it’s hard to assign it credit
I Exploration v.s. Explotation
⇒ If we start happy and we drink we might think it is the best option
p = 0.8
r = −10
p = 0.2
r = −10 p = 0.2
p = 0.8 r = 40
r = 40
H S
p=1
r = 10 p = 0.2
r = 20
p = 0.8
r = 20

How can we evaluate the quality of a policy?
I The policy of an agent is the rule under which the actions are chosen
I Formally is a mapping from states to probabilities
I If the agent applies the policy π at time t then π(a|s) is the probability of
choosing At = a given that the state St = s
I Policies are stationary in the RL framework
I To evaluate the quality of the policy we use the expected return (ER)
I The value function is the ER when starting in s and following π
"∞ #
X k
vπ (s) = Eπ [Gt |St = s] = Eπ γ Rt+k+1 |St = s , for all s ∈ S
k=0

Examples of value function: Grid World
I Actions are A = {up, right, down, left}

I Rewards are −1 if the action pushes out of the board but no movement
I For uniform policy and γ = 0.9 the value function is
I It depends of the policy selected. For the optimal policy yields

I Let’s compute V for the optimal policy at one point
I Say we want v (A) = 10 + 10γ 5 + 10γ 10 + . . ., with γ = 0.9

0 1 2
v (A) = 10 γ5 + γ5 + γ5 + . . .
I It is a geometric series with parameter γ 5

1
v (A) = 10 ≈ 24.42
1 − γ5

Value function
I Let us recall the definition of the value function

"∞ #
X k
vπ (s) = Eπ [G0 |S0 = s] = Eπ γ Rk+1 |S0 = s , for all s ∈ S
k=0
I Let us make the dependance on the policy explicit

Z X∞
vπ (s) = γ k rk p(r0 , r1 , . . . |S0 = s) dr0 dr1 . . .
R∞ k=0
I Exchanging the sum and the integral it follows that

∞
X Z
vπ (s) = γk rk p(r0 , r1 , . . . |S0 = s) dr0 dr1 . . .
k=0 R∞
Z
I Notice that for any j 6= k p(rk ) = p(rj , rk ) drj
R
∞
X Z
vπ (s) = γk rk p(rk |S0 = s) drk
k=0 R

Value function

X∞ Z
vπ (s) = γk rk p(rk |S0 = s) drk
k=0 R
I We just need to compute p(rk |S0 ) ⇒ introduce state and action

Z
p(rk |S0 ) = p(rk , sk−1 , ak−1 |S0 ) dsk−1 dak−1
SA
I Conditioning on ak−1 and sk−1 it follows that

Z
p(rk |S0 ) = p(rk |sk−1 , ak−1 )p(sk−1 , ak−1 |S0 ) dsk−1 dak−1
SA
I Conditioning on the state sk−1 it follows that

Z
p(rk |S0 ) = p(rk |sk−1 , ak−1 )π(ak−1 |sk−1 )p(sk−1 |S0 ) dsk−1 dak−1
SA

Value function

Z
p(rk |S0 = s0 ) = p(rk |sk−1 , ak−1 )π(ak−1 |sk−1 )p(sk−1 |s0 ) dsk−1 dak−1
SA
I It is convenient to write
p(rk |S0 = s0 ) =
Z
p(rk , sk |sk−1 , ak−1 )π(ak−1 |sk−1 )p(sk−1 , rk−1 |s0 ) dsk drk−1 dsk−1 dak−1
S 2 AR
I Repeating the same steps it follows that

Z
p(rk |S0 = s0 ) = Πk−1
j=0 p(sj+1 , rj+1 |sj , aj )π(aj |sj ) dsk dak−1 drk−1
S k Ak Rk−1
I Putting everything together we have that

X∞ Z
vπ (s) = γ k rk p(rk |S0 = s) drk
k=0 R
∞ Z
X
= γ k rk Πk−1
j=0 p(sj+1 , rj+1 |sj , aj )π(aj |sj ) dsk dak−1 drk
k=0 Rk Ak S k−1

Value function
I We have from the previous slide

∞ Z
X
vπ (s) = γ k rk Πk−1
j=0 p(sj+1 , rj+1 |sj , aj )π(aj |sj ) dsk dak−1 drk
k=0 Rk Ak S k−1
I Alternatively using the idea of trajectory we can write

X∞ Z ∞
Z X
vπ (s) = γ k rk p(Tk |S0 ) dTk = γ k rk p(T |S0 ) dT
k=0 Tk T k=0
I The value function depends on the policy ⇒ the goal is to find
π ? = argmax vπ (s)
π

Value Function

Bellman’s Equation
I Recursive relationship between vπ at two consecutive states
vπ (s) = Eπ Rt+1 + γvπ (s 0 )St = s

∞
X
I Recall that the return is given by Gt = γ k Rt+k+1
k=0
∞
X ∞
X
Gt = Rt+1 + γ γ k−1 Rt+k+1 = Rt+1 + γ γ l Rt+1+l+1 = Rt+1 + γGt+1
k=1 l=0
I Then it follows that

vπ (s) = Eπ [Gt |St = s] = Eπ Rt+1 + γGt+1 St = s
I Using the tower property of the conditional expectation
vπ (s) = Eπ Rt+1 + γEπ Gt+1 St+1 = s 0 St = s

= Eπ Rt+1 + γvπ (s 0 )St = s


I Let us check the Bellman’s Equation for the Grid World
vπ (s) = Eπ Rt+1 + γvπ (s 0 )St = s

I With the uniform policy and γ = 0.9 the value function is
I Let’s focus on the state at the top left corner

⇒ Actions are up or left ⇒ state is unchanged and reward of −1
⇒ Actions are right or down ⇒ state changes and reward of 0
1 0.9
Eπ Rt+1 + γvπ (s 0 )St = s = (−1 + 0.9 × 3.3)+

(8.8+1.5) = 3.3025
2 4

Bellman Equation
I Value function is the only solution to the Bellman’s Equation

I Many Policy evaluation algorithms based on this property
I Policy evaluation in dynamic programming

vk+1 (s) = Eπ Rt+1 + γvk (St+1 )St = s
⇒ Knowledge of the transition dynamics to compute expectation

I Our first Reinforcement Learning algorithm TD(0)
vk+1 (St ) = vk (St ) + α [Rt+1 + γvk (St+1 ) − vk (St )]
⇒ Notice that we do not require the transition dynamics

⇒ We just require the next state St+1 and the reward Rt+1
⇒ We will return to this algorithm later in the class

Bellman Equation
I Proof of convergence of policy evaluation
max |vk+1 (s) − vπ (s)| = max |E [Rt+1 + γvk (St+1 ) | St = s, π]

s s
− E [Rt+1 + γvπ (St+1 ) | St = s, π] |

= max |E [γvk (St+1 ) − γvπ (St+1 ) | St = s, π] |
s
= γ max |E [vk (St+1 ) − vπ (St+1 ) | St = s, π] |

s
≤ γ max |vk (s) − vπ (s)|

s
I This implies limk→∞ vk = vπ

Control of Markov Decision Processes
I So far we have only consider the evaluation of a policy by looking at

"∞ #
X k
vπ (s) = Eπ γ Rt+k+1 |St = s , for all s ∈ S
k=0
I For control we are interested in the Q-function

"∞ #
X k
qπ (s, a) = Eπ γ Rt+k+1 |St = s, At = a , for all s ∈ S, a ∈ A
k=0
I Conditioning with respect to the state St+1

" " ∞
# #
X k 0

qπ (s, a) = Eπ Eπ Rt+1 + γ γ Rt+1+k+1 |St+1 = s St = s, At = a
k=0
" "∞ # #
X
γ k Rt+1+k+1 |St+1 = s 0 St = s, At = a

= Eπ Rt+1 + γEπ
k=0
= E Rt+1 + γvπ (s 0 )St = s, At = a

I Notice that the last expectation is independent of the policy

Policy improvement
qπ (s, a) = E Rt+1 + γvπ (s 0 )St = s, At = a

I Let us define a “better” policy than π for instance π 0 (s) = maxa∈A qπ (s, a)
I In which sense the policy π 0 is “better” than π?
vπ0 (s) ≥ vπ (s)

I Proof: ⇒ Homework

Optimality
I In particular we are interested for the best policy
v? (s) = max vπ (s) = max Eπ Rt+1 + γvπ (s 0 )St = s

π π
I Recall from the previous slide that
q? (s, a) = E Rt+1 + γv? (s 0 )|St = s, At = a

I Notice that it has to be the case that
v? (s) = max q? (s, a) = max E Rt+1 + γv? (s 0 )|St = s, At = a

a∈A a∈A
I If we are given q? (s, a) or v? (s) we can find and optimal policy

⇒ Dynamic Programming algorithms like Policy Improvement estimate v?
⇒ Many RL algorithms aim to do learn q? ⇒ SARSA, Q-Learning

Optimality
I Let us check the Optimal Bellman’s Equation for the Grid World
v? (s) = max E Rt+1 + γv? (s 0 )|St = s, At = a

a∈A
I Let’s focus on the state at the top left corner v? (s) = 22 with γ = 0.9
⇒ up or left ⇒ Rt+1 + γv? (s0) = −1 + 0.9 × 22 = 18.8
⇒ down ⇒ Rt+1 + γv? (s0) = 0 + 0.9 × 19.8 = 17.82
⇒ right ⇒ Rt+1 + γv? (s0) = 0 + 0.9 × 24.4 = 21.96

Chapter1 MDP

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter1 MDP

Uploaded by

Copyright:

Available Formats

Markov Decision Processes

Santiago Paternain and Miguel Calvo-Fullana

August 29 —September 5, 2019

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 1

Markov chains. Definition and examples

Markov Decision Processes

Markov Decision Processes (Formally)

Goal, Rewards, Returns and Episodes

Bellman Equation and Optimality

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 2

I Consider time index n = 0, 1, 2, . . . & time dependent random state Xn

I The stochastic process XN is a Markov chain (MC) if

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 3

I Process’s history Xn−1 irrelevant for future evolution of the process

I From the definition we have that for arbitrary m

I Since Pij ’s are probabilities they’re positive and sum up to 1

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 4

I Group transition probabilities Pij in a “matrix” P

P00 P01 P02 ...

I Not really a matrix if number of states is infinite

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 5

I A graph representation is also used

Pi−1,i−1 Pii Pi+1,i+1

Pi−2,i−1 Pi−1,i Pi,i+1 Pi+1,i+2

Pi−1,i−2 Pi,i−1 Pi+1,i Pi+2,i+1

I Useful when number of states is infinite

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 6

I I can be happy (Xn = 0) or sad (Xn = 1).

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 7

I Happiness tomorrow affected by today and yesterday

I State augmentation ⇒ Capture longer time memory

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 8

1−p 1−p 1−p 1−p

I States are 0, ±1, ±2, . . ., number of states is infinite

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 9

I Random walks behave differently if p < 1/2, p = 1/2 or p > 1/2

p = 0.45 p = 0.50 p = 0.55

position (in steps)

position (in steps)

−60 −40 −60

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 10

⇒ E, W, S, N chosen with equal probability 15

I Transiton probabilities are not zero only for points −10

adjacent in the grid Longitude (East−West)

P x(t +1) = i +1, y (t + 1) = j x(t) = i, y (t) = j = 40

P x(t +1) = i, y (t + 1) = j −1 x(t) = i, y (t) = j =

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 11

I Some random facts of life for equiprobable random walks

I In one and two dimensions probability of returning to origin is 1

I In more than two dimensions, is less than 1

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 12

I As a random walk, but stop moving when i = 0 or i = J

I States are 0, 1, . . . , J. Finite number of states (J). Transition probs.

Pi,i+1 = p, Pi,i−1 = 1 − p, P00 = 1, PJJ = 1

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 13

Markov chains. Definition and examples

Markov Decision Processes

Markov Decision Processes (Formally)

Goal, Rewards, Returns and Episodes

Bellman Equation and Optimality

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 14

I General framework to formalize sequential decision making