You are on page 1of 67

Markov Decision Processes

Santiago Paternain and Miguel Calvo-Fullana


Electrical and Systems Engineering, University of Pennsylvania
{spater,cfullana}@seas.upenn.edu

August 29 —September 5, 2019

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 1


Markov chains. Definition and examples

Markov chains. Definition and examples

Markov Decision Processes

Markov Decision Processes (Formally)

Goal, Rewards, Returns and Episodes

Value Function

Bellman Equation and Optimality

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 2


Markov chains

I Consider time index n = 0, 1, 2, . . . & time dependent random state Xn


I State Xn takes values on a countable number of states
⇒ In general denotes states as i = 0, 1, 2, . . .
I Denote the history of the process Xn = [Xn , Xn−1 , . . . , X0 ]T
I Denote stochastic process as XN

I The stochastic process XN is a Markov chain (MC) if


   
P Xn+1 = j Xn = i, Xn−1 = P Xn+1 = j Xn = i = Pij
I Future depends only on current state Xn

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 3


Observations

I Process’s history Xn−1 irrelevant for future evolution of the process


I Probabilities Pij are constant for all times (time invariant)

I From the definition we have that for arbitrary m


   
P Xn+m Xn , Xn−1 = P Xn+m Xn
I Xn+m depends only on Xn+m−1 , which depends only onXn+m−2 , . . . which
depends only on Xn

I Since Pij ’s are probabilities they’re positive and sum up to 1



X
Pij ≥ 0 Pij = 1
j=1

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 4


Matrix representation

I Group transition probabilities Pij in a “matrix” P

P00 P01 P02 ...


 
 P10 P11 P12 ... 
.. .. .. ..
 
 
P := 
 . . . . 


 Pi0 Pi1 Pi2 ... 

.. .. .. ..
. . . .

I Not really a matrix if number of states is infinite

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 5


Graph representation

I A graph representation is also used

Pi−1,i−1 Pii Pi+1,i+1

Pi−2,i−1 Pi−1,i Pi,i+1 Pi+1,i+2

i −1 i i +1

Pi−1,i−2 Pi,i−1 Pi+1,i Pi+2,i+1

I Useful when number of states is infinite

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 6


Example: Happy - Sad

I I can be happy (Xn = 0) or sad (Xn = 1).


I Happiness tomorrow affected by happiness today only
I Model as Markov chain with transition probabilities

0.8 0.7
0.2
 
0.8 0.2
P :=
0.3 0.7 H S

0.3
I Inertia ⇒ happy or sad today, likely to stay happy or sad tomorrow
(P00 = 0.8, P11 = 0.7)
I But when sad, a little less likely so (P00 > P11 )

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 7


Example: Happy - Sad, version 2

I Happiness tomorrow affected by today and yesterday


I Define double states HH (happy-happy), HS (happy-sad), SH, SS
I Only some transitions are possible
⇒ HH and SH can only become HH or HS
⇒ HS and SS can only become SH or SS
0.1

0.9 HH 0.2 HS
 
0.9 0.1 0 0
 0 0 0.4 0.6 
P := 
  0.8 0.6
0.8 0.2 0 0 
0 0 0.3 0.7
0.4
SH SS 0.7

0.3
I More time happy or sad increases likelihood of staying happy or sad

I State augmentation ⇒ Capture longer time memory

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 8


Random (drunkard’s) walk

I Step to the right with probability p, to the left with prob. (1-p)

p p p p

i −1 i i +1

1−p 1−p 1−p 1−p

I States are 0, ±1, ±2, . . ., number of states is infinite


I Transition probabilities are

Pi,i+1 = p, Pi,i−1 = 1 − p,
I Pij = 0 for all other transitions

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 9


Random (drunkard’s) walk - continued

I Random walks behave differently if p < 1/2, p = 1/2 or p > 1/2

p = 0.45 p = 0.50 p = 0.55


100 100
60
80 80

60 40 60

40 40
20
position (in steps)

position (in steps)

position (in steps)


20 20

0 0 0

−20 −20
−20
−40 −40

−60 −40 −60

−80 −80
−60
−100 −100
0 100 200 300 400 500 600 700 800 900 1000 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 100 200 300 400 500 600 700 800 900 1000
time time time

I With p > 1/2 diverges to the right (grows unbounded almost surely)
I With p < 1/2 diverges to the left
I With p = 1/2 always come back to visit origin (almost surely)

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 10


Two dimensional random walk

40

35
I Take a step in random direction East, West, South 30

or North 25

Latitude (North−South)
20

⇒ E, W, S, N chosen with equal probability 15

10
I States are pairs of coordinates (x, y ) 5

0
⇒ x = 0, ±1, ±2, . . . and y = 0, ±1, ±2, . . . −5

I Transiton probabilities are not zero only for points −10


−5 0 5 10 15 20 25 30 35 40

adjacent in the grid Longitude (East−West)

  1 50

P x(t +1) = i +1, y (t + 1) = j x(t) = i, y (t) = j = 40


4
30
  1
P x(t +1) = i −1, y (t + 1) = j x(t) = i, y (t) = j =

Latitude (North−South)
20
4
10
  1
P x(t +1) = i, y (t + 1) = j +1 x(t) = i, y (t) = j = 0
4
  1 −10

P x(t +1) = i, y (t + 1) = j −1 x(t) = i, y (t) = j =


−20
4
−30
−45 −40 −35 −30 −25 −20 −15 −10 −5 0
Longitude (East−West)

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 11


More about random walks

I Some random facts of life for equiprobable random walks

I In one and two dimensions probability of returning to origin is 1


I Will almost surely return home

I In more than two dimensions, is less than 1


I In three dimensions probability of returning to origin is 0.34
I Then 0.19, 0.14, 0.10, 0.08, . . .

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 12


Random walk with borders (gambling)

I As a random walk, but stop moving when i = 0 or i = J


⇒ Models a gambler that stops playing when ruined, Xn = 0
⇒ Or when reaches target gains Xn = J
1 1
p p

0 ... i −1 i i +1 ... J

1−p 1−p

I States are 0, 1, . . . , J. Finite number of states (J). Transition probs.

Pi,i+1 = p, Pi,i−1 = 1 − p, P00 = 1, PJJ = 1


I Pij = 0 for all other transitions

I States 0 and J are called absorbing. Once there stay there forever
I The rest are transient states. Visits stop almost surely

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 13


Markov Decision Processes

Markov chains. Definition and examples

Markov Decision Processes

Markov Decision Processes (Formally)

Goal, Rewards, Returns and Episodes

Value Function

Bellman Equation and Optimality

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 14


Markov Decission Process

I General framework to formalize sequential decision making

I At each time step t = 0, 1 . . . the agent is in state St ∈ S


I And it selects an action At ∈ A(s) possibly state dependent
I As a consequence of the action the environment produces
⇒ A numerical reward Rt ∈ R
⇒ The transition to the system to a new state st+1

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 15


Markov Decission Process

I A Markov Decision Process is a tuple (S, A, R, P)


I S is a finite or infinite but countable set of states
I A is a finite or infinite but countable set of actions
I R ⊆ R is the set of rewards
I P is a Markov transition probability if for any R0 ⊆ R

P[St+1 , Rt+1 ∈ R0 |St , At , St−1 , At−1 , . . . , S0 , A0 ] = P[St+1 , Rt+1 ∈ R0 |St , At ]

⇒ Memory-less transition probability


⇒ Only depends on the current state and action

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 16


Examples: TIC-TAC-TOE

I State space S ⇒ All combinations of ◦ and ×


I Action space A ⇒ Where to place the next cross, its state dependent
I The next state is a function of the current state, Action and “randomness”

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 17


Examples: TIC-TAC-TOE

I State space S ⇒ All combinations of ◦ and ×


I Action space A ⇒ Where to place the next cross, its state dependent
I The next state is a function of the current state, Action and “randomness”

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 18


Policies

I A policy π is a distribution over actions given states

π(a|s) = P[At = a|St = s]


I The policy defines the behavior of the agent
I The policy depends on the current state only
I Deterministic policies ⇒ Assign probability one to one action
I Policies are stationary

At ∼ π(·|St ), for all t ≥ 0

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 19


Example: TIC-TAC-TOE

I We could use a random policy for instance π ∼ U[1, 2, . . . , 9]


I We can also use a mixed policy
⇒ S0 = empty board, A0 = center position
⇒ S1 = cross center, circle corner, A1 ∼ U[corners]
⇒ ···
I We can also make it fully deterministic
⇒ S0 = empty board, A0 = center position
⇒ S1 = cross center, circle corner 1, A1 = corner 2
⇒ S2 = cross center, circle corner 2, A1 = corner 3
⇒ ···

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 20


One Transition

I Probability of reaching state s in one step given that we start at S0


a1

I Law of total probability

∞ a2
X
P[S1 = s|S0 ] = P[S1 = s, A0 = ai |S0 ] ..
i=1 .
S0 ai s
I Conditioning on A0 ..
.

X
P[S1 = s|S0 ] = P[S1 = s|A0 = ai , S0 ]P[A0 = ai |S0 ]
i=1

I Using the definition of policy



X
P[S1 = s|S0 ] = P[S1 = s|A0 = ai , S0 ]π[ai |S0 ]
i=1

I Note that once we have defined the policy we have a Markov Chain

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 21


Two Steps Transition

I Probability of reaching state s in two steps given that we start at S0


I Using the law of the total probability

X
P[S2 = s|S0 ] = P[S2 = s, S1 = si , A1 = aj |S0 ]
i,j=1

I Conditioning on S1 and A1 we have



X
P[S2 = s|S0 ] = P[S2 = s|S1 = si , A1 = aj , S0 ]P[S1 = si , A1 = aj |S0 ]
i,j=1

I Using the Markov Property and conditioning on S1



X
P[S2 = s|S0 ] = P[S2 = s|si , aj ]P[A1 = aj |S1 = si , S0 ]P[S1 = si |S0 ]
i,j=1

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 22


Two Steps Transition

I Probability of reaching state s in two steps given that we start at S0



X
P[S2 = s|S0 ] = P[S2 = s|si , aj ]P[A1 = aj |S1 = si , S0 ]P[S1 = si |S0 ]
i,j=1

I Since the policy only depends on the current step



X
P[S2 = s|S0 ] = P[S2 = s|si , aj ]π(aj |si )P[S1 = si |S0 ]
i,j=1

I Recall that the expression of one transition is



X
P[S1 = si |S0 ] = P[S1 = si |ak , S0 ]π[ak |S0 ]
k=1

I Substituting in the above equation it follows that



X
P[S2 = s|S0 ] = P[S2 = s|si , aj ]π(aj |si )P[S1 = si |ak , S0 ]π[ak |S0 ]
i,j,k=1

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 23


Trajectories

I A trajectory TT is a collection of states, actions and rewards

TT = (S0 , A0 , R1 , S1 , A1 , . . . , RT −1 , ST −1 , AT −1 , RT , ST )
I T can be finite or infinite and it is call the horizon
I We want to compute the probability of a given trajectory, this is P[TT ]

P[TT ] = P [S0 , A0 , R1 , S1 , A1 , . . . , RT −1 , ST −1 , AT −1 , RT , ST ]
I Let us condition on TT −1 = (S0 , A0 , R1 , S1 , A1 , . . . , RT −1 , ST −1 )

P[TT ] = P[ST , RT , AT −1 |TT −1 ]P[TT −1 ]


I Let us also condition on AT −1

P[TT ] = P[ST , RT |AT −1 , TT −1 ]P[AT −1 |TT −1 ]P[TT −1 ]

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 24


Trajectories

I From the previous slide we have that

P[TT ] = P[ST , RT |AT −1 , TT −1 ]P[AT −1 |TT −1 ]P[TT −1 ]


I Let us expand the first factor of the right hand side

P[ST , RT |AT −1 , TT −1 ] = P[ST , RT |AT −1 , S0 , A0 , R1 , S1 , A1 , . . . , RT −1 , ST −1 ]


I Using the Markov Property the transition depends only on ST −1 , AT −1

P[ST , RT |AT −1 , TT −1 ] = P[ST , RT |AT −1 , ST −1 ]


I Likewise, P[AT −1 |TT −1 ] is the policy and it only depends on ST −1

P[TT ] = P[ST , RT |AT −1 , ST −1 ]π[AT −1 |ST −1 ]P[TT −1 ]

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 25


Trajectories

I From the previous slide we have that

P[TT ] = P[ST , RT |AT −1 , ST −1 ]π[AT −1 |ST −1 ]P[TT −1 ]


I The previous expression is true for all T ⇒ it also follows that

P[TT −1 ] = P[ST −1 , RT −1 |AT −2 , ST −2 ]π[AT −2 |ST −2 ]P[TT −2 ]


I This defines a recursion and we can keep expanding the expression
 
−1
P[TT ] = ΠTt=0 P[St+1 , Rt+1 |At , St ]π[At |St ] P[S0 ]

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 26


Markov Decision Processes (Formally)

Markov chains. Definition and examples

Markov Decision Processes

Markov Decision Processes (Formally)

Goal, Rewards, Returns and Episodes

Value Function

Bellman Equation and Optimality

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 27


Measurable Space

I Let Ω be a set of outcomes


⇒ Example: The throw of a dice Ω = {1, 2, 3, 4, 5, 6}
⇒ For Markov Decision Processes Ω = S × A × S × R
I Let F be a set of events or a σ -algebra
I Formally F is a nonempty collection of subsets Ω that satisfies
⇒ (i) if A ∈ F then Ac ∈ F
⇒ (ii) if Ai ∈ F is a countable sequence of sets then ∪i Ai ∈ F
I Examples of sigma algebras for the thrown of a dice
⇒ The trivial σ-algebra F = {∅, Ω}
⇒ The “less trivial” F = {∅, {1} , {2, 3, 4, 5, 6} , Ω}
⇒ The “Power set”
I Borel σ-algebra is the one generated by the open sets
I A measurable Space is any duple (Ω, F)

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 28


Probability measure

I Why is it called measurable?


I You can define a measure on the space, this is a function µ : F → R
⇒ µ(A) ≥ µ(∅) = 0 for all A ∈ F and
⇒ if Ai ∈ F is a countable sequence of disjoints sets, then
X
µ(∪i Ai ) = µ(Ai )
i

I If µ(Ω) = 1 , we call µ a probability measure

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 29


Transition Probabilities

I A function p : S × A × S × R → R is a transition probability if


I (i) ∀(s, a) ∈ S × A , K → p(s, a, K) is a probability measure on (S × R)
I (ii) For each K ⊆ S × R , (s, a) → p(s, a, K) is a measurable function
⇒ “For every “value of p” there is a state-action pair that originated it”
I “The “probability” of getting a pair (s, r ) only depends on the pair (s, a)”

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 30


Markov Decision Process

I Let p : (S × A) × (S × R) → R be a transition probability


I (St , At ) is a Markov Decision Process with transition probability p if

P(St+1 ∈ S 0 , Rt+1 ∈ R0 S0 , A0 . . . , St , At ) = p(St , At , S 0 , R0 )


I The probability of the state St+1 and the reward Rt+1


⇒ only depends on the last state St and action At
⇒ Independent of (S0 , A0 ), . . . , (St−1 , At−1 ) ⇒ Memoryless systems
I It is the same idea as in the discrete case and we will also write

P(St+1 ∈ S 0 , Rt+1 ∈ R0 S0 , A0 , . . . , St , At ) = P(St+1 ∈ S 0 , Rt+1 ∈ R0 St , At )


Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 31


Discrete state-space examples

I Let p : (S × A) × (S × R) → R be a transition probability


I (St , At ) is a Markov Decision Process with transition probability p if

P(St+1 ∈ S 0 , Rt+1 ∈ R0 S0 , A0 . . . , St , At ) = p(St , At , S 0 , R0 )


I For example we have that p(high, search, high, rsearch ) = α

P(St+1 = high, Rt+1 = rsearch |S0 , A0 , . . . , St = high, At = search) = α

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 32


Markov Decision Process

I We need these formalities to unify continuous and discrete cases


I We can now consider situations where states and actions are uncountable
I If we are given the transition probability we can compute

P(St+1 ∈ S 0 , Rt+1 ∈ R0 St , At ) = p(St , At , S 0 , R0 )


I From the definition of transition probability we have that given (St , At )


(S 0 , R0 ) → p(St , At , S 0 , R0 )is a measure
Z
P(St+1 ∈ S 0 , Rt+1 ∈ R0 |St , At ) = p(St , At , s, r ) dsdr
S 0 ×R0

I Since we are talking about sequential processes we will use the notation
Z
P(St+1 ∈ S 0 , Rt+1 ∈ R0 |St , At ) = p(s, r |St , At ) dsdr
S 0 ×R0

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 33


Example: One dimensional particle

I Let ξt ∼ N (µ, σ) be i.i.d, and consider the following dynamics

St+1 = St + At + ξt
I The transition dynamics are (Let us forget about the rewards)

1 (s−St −At −µ)2



p(s|St , At ) = √ e 2σ 2
2πσ
I Equivalently St+1 ∼ N (St + At + µ, σ)
I Without the action this is a Gaussian random walk
I We can compute the probability of P(St+1 ∈ S 0 |St , At ) as
(s−St −At −µ)2
Z
1 −
P(St+1 ∈ S 0 |St , At ) = √ e 2σ 2 ds
S0 2πσ

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 34


Policies in Continuous State-Action Spaces

I A policy π is a probability measure over actions given states


Z
⇒ π(a|s) ≥ 0 and for every s ∈ S it follows that π(a|s) da = 1
A
I The policy defines the behavior of the agent
I The policy depends only on the current state
I Policies are stationary At ∼ π(·|St ), for all t ≥ 0

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 35


Transitions

I We want to compute the density function for state S1 given state S0


I Recall the law of the total probability

X
P[S1 = s|S0 ] = P[S1 = s, A0 = ai |S0 ]
i=1

I Similarly for densities we have that


Z
p(S1 = s 0 |S0 = s) = p(s 0 , a|s) da
A

I Conditioning on the action it follows that


Z
p(S1 = s 0 |S0 = s) = p(s 0 |s, a)π(a|s) da
A

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 36


Two step Transitions

I We want to compute the density function for state S2 given state S0


I Marginalizing with respect to one state action pair (s 0 , a0 )
Z
p(S2 = s 00 |S0 = s) = p(s 00 , a0 , s 0 |s) da0 ds 0
S×A

I Conditioning on the state and action it follows that


Z
p(S2 = s 0 |S0 = s) = p(s 00 |a0 , s 0 , s)p(a0 , s 0 |s) da0 ds 0
S×A

I Using the Markov Property and conditioning with respect s 0


Z
p(S2 = s 00 |S0 = s) = p(s 00 |a0 , s 0 )p(a0 |s, s 0 )p(s 0 |s) da0 ds 0
S×A

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 37


Two step Transitions

I From the previous slide we have that


Z
p(S2 = s 00 |S0 = s) = p(s 00 |a0 , s 0 )p(a0 |s, s 0 )p(s 0 |s) da0 ds 0
S×A

I And we have also the expression for the density of the one transition case
Z
p(s 0 |s) = p(s 0 |s, a)π(a|s) da
A

I Note that p(a0 |s, s 0 ) = π(a0 |s 0 )


Z
p(S2 = s 00 |S0 = s) = p(s 00 |a0 , s 0 )π(a0 |s 0 )p(s 0 |s, a)π(a|s) da0 ds 0 da
S×A2

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 38


Goal, Rewards, Returns and Episodes

Markov chains. Definition and examples

Markov Decision Processes

Markov Decision Processes (Formally)

Goal, Rewards, Returns and Episodes

Value Function

Bellman Equation and Optimality

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 39


Goals and Rewards

I Rewards need to be defined in terms of the desired goal


⇒ what we want accomplished and not how to do it
⇒ However we might use reward shaping for improving convergence
I Is the reward a function of the state that we reach or the transition?
⇒ It’s more natural to define it as a function of the state we reach
⇒ However it’s the same from a mathematical point of view
⇒ Conditioning on the state s 0 it follows that

p(r , s 0 |s, a) = p(r |s 0 , s, a)p(s 0 |s, a)

⇒ If the rewards depend only on the next state we have that

p(r , s 0 |s, a) = p(r |s 0 )p(s 0 |s, a)

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 40


Returns and Episodes

I Return after time t as the sum of the rewards from time t + 1 onward
I For episodic tasks with finite horizon T

Gt = Rt+1 + Rt+2 + · · · RT

⇒ For instance TIC-TAC-TOE finishes after at most 5 actions


⇒ The horizon might not be fixed as in the TIC-TAC-TOE
I For continuing tasks the horizon is T = ∞

X
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + · · · = γ k Rt+k+1
k=0

⇒ γ ∈ (0, 1) is a discount factor ⇒ relative importance of the future

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 41


Why Discount?

I It is convininet mathematically to discount rewards


⇒ Avoids infinite returns in continuing tasks
I In general we are more uncertain about the far future
I For financial rewards immediate rewards are more relevant
⇒ Money loses value with time
I Animal/human behavior shows preference for immediate rewards
I It is possible to use undiscounted rewards if all sequence terminate

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 42


Returns and Episodes

I To unify both frameworks we can add an absorving terminal state

I For instance in the TIC-TAC-TOE the end of the game is absorving


I At the terminal state rewards are zero and we can write the return as
T
X
Gt = γ k−t−1 Rk
k=t+1

I With the possibility of having T = ∞ or γ = 1 but not both

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 43


Examples: Cart-Pole

I The goal is to balance the pole on the up position for as long as possible
⇒ It is a continuing task ⇒ T = ∞, γ ∈ (0, 1)
⇒ A possible reward Rt = 1 if θt ∈ [170, 190], zero otherwise
I We also need the cart to be in a specified range
⇒ A possible reward Rt = −1 if |x| > 1

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 44


Examples: TIC-TAC-TOE

I Goal is to learn to win ⇒ We do not want to reward “good movements”


⇒ Reward Rt = 0 as long as the game is not finished
I Episodic task with random horizon T
I RT = 1 if game is won, RT = −1 if game is lost, RT = 0 if tied

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 45


Examples: Grid World

I Goal is to arrive to the target without colliding


I Episodic task with horizon T and absorbent states ⇒ goal or collision
I Reward Rt = 0 as long as the agent is moving
I RT = 1 if the goal is reached, RT = −1 if collision happens

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 46


Examples: Recycling Robot

I A robot is trying to collect cans in an office environment


I It states are S = {high, low} battery
I It can chose Actions A = {recharge, search, wait}
I If it runs out of energy gets a bad reward
I Otherwise gets one point per can collected

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 47


Value Function

Markov chains. Definition and examples

Markov Decision Processes

Markov Decision Processes (Formally)

Goal, Rewards, Returns and Episodes

Value Function

Bellman Equation and Optimality

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 48


You are the Learner

I We can either be Happy Xt = H or Sad Xt = S


I In each state we can take two actions
⇒ Have a beer with your friends At = 1 or Study for ESE-680 At = 2

H S

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 49


You are the Learner

I We can either be Happy Xt = H or Sad Xt = S


I In each state we can take two actions
⇒ Have a beer with your friends At = 1 or Study for ESE-680 At = 2

p = 0.8
r = −10
p = 0.2
r = −10
p = 0.2
p = 0.8 r = 40
r = 40
H S

p=1
r = 10 p = 0.2
r = 20
p = 0.8
r = 20

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 50


Some RL difficulties

I Some actions are good despite them being bad in the short term
⇒ Studying while happy ⇒ it’s hard to assign it credit
I Exploration v.s. Explotation
⇒ If we start happy and we drink we might think it is the best option

p = 0.8
r = −10
p = 0.2
r = −10 p = 0.2
p = 0.8 r = 40
r = 40
H S
p=1
r = 10 p = 0.2
r = 20
p = 0.8
r = 20

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 51


How can we evaluate the quality of a policy?

I The policy of an agent is the rule under which the actions are chosen
I Formally is a mapping from states to probabilities
I If the agent applies the policy π at time t then π(a|s) is the probability of
choosing At = a given that the state St = s
I Policies are stationary in the RL framework
I To evaluate the quality of the policy we use the expected return (ER)
I The value function is the ER when starting in s and following π
"∞ #
X k
vπ (s) = Eπ [Gt |St = s] = Eπ γ Rt+k+1 |St = s , for all s ∈ S
k=0

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 52


Examples of value function: Grid World

I Actions are A = {up, right, down, left}


I Rewards are −1 if the action pushes out of the board but no movement
I For uniform policy and γ = 0.9 the value function is

I It depends of the policy selected. For the optimal policy yields

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 53


Examples of value function: Grid World

I Let’s compute V for the optimal policy at one point

I Say we want v (A) = 10 + 10γ 5 + 10γ 10 + . . ., with γ = 0.9


  
0  1  2
v (A) = 10 γ5 + γ5 + γ5 + . . .

I It is a geometric series with parameter γ 5


1
v (A) = 10 ≈ 24.42
1 − γ5

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 54


Value function

I Let us recall the definition of the value function


"∞ #
X k
vπ (s) = Eπ [G0 |S0 = s] = Eπ γ Rk+1 |S0 = s , for all s ∈ S
k=0

I Let us make the dependance on the policy explicit


Z X∞
vπ (s) = γ k rk p(r0 , r1 , . . . |S0 = s) dr0 dr1 . . .
R∞ k=0

I Exchanging the sum and the integral it follows that



X Z
vπ (s) = γk rk p(r0 , r1 , . . . |S0 = s) dr0 dr1 . . .
k=0 R∞

Z
I Notice that for any j 6= k p(rk ) = p(rj , rk ) drj
R


X Z
vπ (s) = γk rk p(rk |S0 = s) drk
k=0 R

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 55


Value function

I From the previous slide we have that


X∞ Z
vπ (s) = γk rk p(rk |S0 = s) drk
k=0 R

I We just need to compute p(rk |S0 ) ⇒ introduce state and action


Z
p(rk |S0 ) = p(rk , sk−1 , ak−1 |S0 ) dsk−1 dak−1
SA

I Conditioning on ak−1 and sk−1 it follows that


Z
p(rk |S0 ) = p(rk |sk−1 , ak−1 )p(sk−1 , ak−1 |S0 ) dsk−1 dak−1
SA

I Conditioning on the state sk−1 it follows that


Z
p(rk |S0 ) = p(rk |sk−1 , ak−1 )π(ak−1 |sk−1 )p(sk−1 |S0 ) dsk−1 dak−1
SA

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 56


Value function

I From the previous slide we have that


Z
p(rk |S0 = s0 ) = p(rk |sk−1 , ak−1 )π(ak−1 |sk−1 )p(sk−1 |s0 ) dsk−1 dak−1
SA

I It is convenient to write
p(rk |S0 = s0 ) =
Z
p(rk , sk |sk−1 , ak−1 )π(ak−1 |sk−1 )p(sk−1 , rk−1 |s0 ) dsk drk−1 dsk−1 dak−1
S 2 AR

I Repeating the same steps it follows that


Z
p(rk |S0 = s0 ) = Πk−1
j=0 p(sj+1 , rj+1 |sj , aj )π(aj |sj ) dsk dak−1 drk−1
S k Ak Rk−1

I Putting everything together we have that


X∞ Z
vπ (s) = γ k rk p(rk |S0 = s) drk
k=0 R
∞ Z
X
= γ k rk Πk−1
j=0 p(sj+1 , rj+1 |sj , aj )π(aj |sj ) dsk dak−1 drk
k=0 Rk Ak S k−1

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 57


Value function

I We have from the previous slide


∞ Z
X
vπ (s) = γ k rk Πk−1
j=0 p(sj+1 , rj+1 |sj , aj )π(aj |sj ) dsk dak−1 drk
k=0 Rk Ak S k−1

I Alternatively using the idea of trajectory we can write


X∞ Z ∞
Z X
vπ (s) = γ k rk p(Tk |S0 ) dTk = γ k rk p(T |S0 ) dT
k=0 Tk T k=0

I The value function depends on the policy ⇒ the goal is to find

π ? = argmax vπ (s)
π

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 58


Bellman Equation and Optimality

Markov chains. Definition and examples

Markov Decision Processes

Markov Decision Processes (Formally)

Goal, Rewards, Returns and Episodes

Value Function

Bellman Equation and Optimality

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 59


Bellman’s Equation

I Recursive relationship between vπ at two consecutive states

vπ (s) = Eπ Rt+1 + γvπ (s 0 ) St = s


 


X
I Recall that the return is given by Gt = γ k Rt+k+1
k=0


X ∞
X
Gt = Rt+1 + γ γ k−1 Rt+k+1 = Rt+1 + γ γ l Rt+1+l+1 = Rt+1 + γGt+1
k=1 l=0

I Then it follows that


 
vπ (s) = Eπ [Gt |St = s] = Eπ Rt+1 + γGt+1 St = s
I Using the tower property of the conditional expectation

vπ (s) = Eπ Rt+1 + γEπ Gt+1 St+1 = s 0 St = s


   

= Eπ Rt+1 + γvπ (s 0 ) St = s
 

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 60


Examples of value function: Grid World

I Let us check the Bellman’s Equation for the Grid World

vπ (s) = Eπ Rt+1 + γvπ (s 0 ) St = s


 

I With the uniform policy and γ = 0.9 the value function is

I Let’s focus on the state at the top left corner


⇒ Actions are up or left ⇒ state is unchanged and reward of −1
⇒ Actions are right or down ⇒ state changes and reward of 0
1 0.9
Eπ Rt+1 + γvπ (s 0 ) St = s = (−1 + 0.9 × 3.3)+
 
(8.8+1.5) = 3.3025
2 4

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 61


Bellman Equation

I Value function is the only solution to the Bellman’s Equation


I Many Policy evaluation algorithms based on this property
I Policy evaluation in dynamic programming
 
vk+1 (s) = Eπ Rt+1 + γvk (St+1 ) St = s

⇒ Knowledge of the transition dynamics to compute expectation


I Our first Reinforcement Learning algorithm TD(0)

vk+1 (St ) = vk (St ) + α [Rt+1 + γvk (St+1 ) − vk (St )]

⇒ Notice that we do not require the transition dynamics


⇒ We just require the next state St+1 and the reward Rt+1
⇒ We will return to this algorithm later in the class

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 62


Bellman Equation

I Proof of convergence of policy evaluation

max |vk+1 (s) − vπ (s)| = max |E [Rt+1 + γvk (St+1 ) | St = s, π]


s s

− E [Rt+1 + γvπ (St+1 ) | St = s, π] |


= max |E [γvk (St+1 ) − γvπ (St+1 ) | St = s, π] |
s

= γ max |E [vk (St+1 ) − vπ (St+1 ) | St = s, π] |


s

≤ γ max |vk (s) − vπ (s)|


s

I This implies limk→∞ vk = vπ

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 63


Control of Markov Decision Processes

I So far we have only consider the evaluation of a policy by looking at


"∞ #
X k
vπ (s) = Eπ γ Rt+k+1 |St = s , for all s ∈ S
k=0

I For control we are interested in the Q-function


"∞ #
X k
qπ (s, a) = Eπ γ Rt+k+1 |St = s, At = a , for all s ∈ S, a ∈ A
k=0

I Conditioning with respect to the state St+1


" " ∞
# #
X k 0

qπ (s, a) = Eπ Eπ Rt+1 + γ γ Rt+1+k+1 |St+1 = s St = s, At = a
k=0
" "∞ # #
X
γ k Rt+1+k+1 |St+1 = s 0 St = s, At = a

= Eπ Rt+1 + γEπ
k=0

= E Rt+1 + γvπ (s 0 ) St = s, At = a
 

I Notice that the last expectation is independent of the policy

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 64


Policy improvement

I From the previous slide we have that

qπ (s, a) = E Rt+1 + γvπ (s 0 ) St = s, At = a


 

I Let us define a “better” policy than π for instance π 0 (s) = maxa∈A qπ (s, a)
I In which sense the policy π 0 is “better” than π?

vπ0 (s) ≥ vπ (s)


I Proof: ⇒ Homework

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 65


Optimality

I In particular we are interested for the best policy

v? (s) = max vπ (s) = max Eπ Rt+1 + γvπ (s 0 ) St = s


 
π π

I Recall from the previous slide that

q? (s, a) = E Rt+1 + γv? (s 0 )|St = s, At = a


 

I Notice that it has to be the case that

v? (s) = max q? (s, a) = max E Rt+1 + γv? (s 0 )|St = s, At = a


 
a∈A a∈A

I If we are given q? (s, a) or v? (s) we can find and optimal policy


⇒ Dynamic Programming algorithms like Policy Improvement estimate v?
⇒ Many RL algorithms aim to do learn q? ⇒ SARSA, Q-Learning

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 66


Optimality

I Let us check the Optimal Bellman’s Equation for the Grid World

v? (s) = max E Rt+1 + γv? (s 0 )|St = s, At = a


 
a∈A

I Let’s focus on the state at the top left corner v? (s) = 22 with γ = 0.9
⇒ up or left ⇒ Rt+1 + γv? (s0) = −1 + 0.9 × 22 = 18.8
⇒ down ⇒ Rt+1 + γv? (s0) = 0 + 0.9 × 19.8 = 17.82
⇒ right ⇒ Rt+1 + γv? (s0) = 0 + 0.9 × 24.4 = 21.96

Santiago Paternain, Miguel Calvo-Fullana Markov Decision Processes 67

You might also like