Chapter 3: The Reinforcement Learning Problem (Markov Decision Processes, or MDPS)

Chapter 3: The Reinforcement Learning Problem 
(Markov Decision Processes, or MDPs)
Objectives of this chapter:
❐ present Markov decision processes—an idealized form of

the AI problem for which we have precise theoretical
results
❐ introduce key components of the mathematics: value
functions and Bellman equations
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

action t t
At Rt+1
SUMMARY OF NOTATION Agent
Rt+1 St+1
SUMMARY OF NOTATION
St+1 The Agent-Environment
Environment Interface
Summary
S
state reward
R of Notation
t t
Figure 3.1: The
Summary R agent–enviro
ofEnvironmen
Notat t+1
environment interaction in reinforcement learning. SUMMARY OF NOTATION

S t+1
Agent Capital letters are used for random

Lower
Summary
Figure gives case
Capital
3.1: rise
The letters
letters
to of are
rewards, are used
Notation
agent–environment used
special forfor the
interact ra
nu
cial numerical values statethatreward the agent tries to maximize functions.Lower Quantities
caseactionletters that
are are used requfor
over time. A complete specifi
St Rt
specification of an environment defines a task ,gives in bold
one instance
rise toand
functions.
rewards, ofAlower
in tQuantities
the
special case (even
reinforcement that ifarera
le
Rt+1 Capital letters are usednumerical
for random values v
ment learning problem. in bold and in
over time. A complete specification of an e lower case (even
St+1 Environment Lower case More specifically,
letters are usedthe foragentthe va
instance
s of discreteof the reinforcement
state learning problem
agent and environment interact at each of a sequence functions. time steps,
Quantities that are t = requir0, 1, 2
0, 1, 2, 3, . . ..2 At each time step t, the agent receives a More s specifically,
some action state
the agent and environme
representation of the
Agent and environment interact atofdiscrete in bold and
time in
steps:lower t = case
0, 1, 2,(even
K 2 if env
ran
discrete a
Sof possible time steps,
ofaction
setstates, all nonterminal 2, 3, . . .. At staea
he environment’s state, St 2 S, where S is the set some and on that ba
Agent observes state at step t: S +∈representation
S of the environment’s sta
set ofstates,
all nontermina
hat basis selects an action, At 2 A(St ), where A(S S
s t ) is +
t
possible states, set
thestateset of
andofon all
actions
that including
basisavailable
selects an ia
lable in state St . One produces
time step actionlater, at stepint :partAt ∈isasA(s)
athe
(S
a tconsequence
set set ofset
)S of actions
action ofofits
actions
available allaction,
instates,
possible
state the Sinclu
t .inO a
, the agent receives agets numerical
resultingreward reward:, RRt+1 2 consequence
RRS ⇢ R, A(s) of its R set
action, oftheactions
agent possibl
receives a
t +1 ∈ where
set of is the
all set
nonterminal of possible
states
finds
t 3 itself in adiscrete
new state, timeSt+1step.3 Figure 3.1 d
ossible rewards, and and findsresulting
itself innext a new state,
state: St +1 ∈St+1S+. Figureset3.1 of diagrams
all states, the including agent–t
interaction.
agent–environment interaction. T! =t s0 , aset
A(s) , sof
0final
discrete
1, a 1 , . . .step
time
actions
time of an
possible
step episo
in sta
At eachT Attimeeach step, time
final thetime step,step
agent theofagen
implements an
St state at t
e agent implements ... a mapping Rfrom
t+1 states R to
The
t+2 prob-
abilities
other abilities
R S of
t+3 selecting
random of state
selecting
each .
variables . possible
.
at t each
are action.
a possib
functi Th
St St+1 St+2
tAAtt+2policy S action at t step
t
possible action. This mapping At is calledAt+1 thetarget policy
agent’s and ist+3 discrete
denoted
and time
At+3 is⇡denoted
t , where ⇡t⇡(a|s), is the p
where
R r A
t+1t
Reinforcement
is a function
reward action
learning at of
t,
methods
at s
t ,
dependent,t a
tspecify t and
, howli
where ⇡t (a|s) is the probability that At = a ifterminal St = T s. Reinforcement
t final time step learning of anmethod episod
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction aG resultR target zt , and
oft its return reward
experience. prediction
The
(cumulative atagent’s
t, depende ydiscou
t , are
goal, 2r
A particular finite MDP is defined by its state and action sets and b
A reinforcement
dynamics learning task Given
of the environment. that satisfies
any state theandMarkovactionproperty
s and a,ist
decision
of each possible process,
Markov or MDP.
pair of next If state
Decisionthe state and action
andProcesses
reward, s0 , r,spaces
is denotedare finite, th
finite Markov decision process (finite MDP). Finite MDPs are particu
0 .
to thep(s theory , r|s, a)of = Pr St+1 = s0 ,learning.
reinforcement Rt+1 = r We | St = s, Atthem
treat = a .extensively
❐
book; If a they reinforcement
are all youlearning need totask has the Markov
understand 90% of Property, it is
modern reinforcem
These basicallyquantities a Markovcompletely Decisionspecify the dynamics
Process (MDP). of a finite MDP. Mos
weApresent particular in thefinite rest of MDP this isbookdefined by itsassumes
implicitly state and theaction sets andis
environment
❐
58 If stateCHAPTER
dynamics and
of the action 3. THE
sets
environment. REINFORCEMENT
are finite, Given it isany
a finite
stateMDP.andLEARNING
action s and PRO
a,
Given the dynamics as specified by (3.6), one can compute anything
of To
❐ each possible
define a finitepairMDP, of next you state
need and
to reward, s0 , r, is denoted
give:
want to know about the environment, such as the expected rewards fo
A particular finite
! state . MDP is defined by its state and action sets and
pairs, 0 and action sets 0
p(s , r|s, a)
one-step dynamics of the environment. = Pr S t+1 = s , Rt+1 =Given r | St =anys, A t = aand
state . action s
! one-step . “dynamics” X X
the probability of each possible pair of next state and 0 reward, s0 , r, is d
Theser(s, a) = E[R
quantities t+1 | St = s,
completely At = a]the
specify = dynamicsr p(s
of ,ar|s, a), MDP. Mo
finite
0 in the rest of this book 0 r2R s0 2S
we present p(s , r|s, a) = Pr{St+1 = s , Rt+1 = r | Sassumes implicitly t = s, At = thea}.environment
theGiven state-transition
! there theisdynamics
also: probabilities,
as specified by (3.6), one can compute anythin
These quantities completely specify the dynamicsX of a finite MDP. Mos
want to know about
. in the rest 0of this book implicitly expected
the environment, such as the rewards
theoryp(s we0 |s, present
a) = Pr S t+1 = s | S t = s, A t = a = p(s 0
assumes, r|s, the
a), envir
pairs,
is a finite MDP. r2R
. X X
0
Given r(s, a)
the = E[R
dynamics | S
as = s,
specified
and the expected rewards for state–action–next-state triples,
t+1 t A t = a]
by = (3.6), rone p(s
can ,compute
r|s, a), anyth
one might want to know about the environment, r2R s0 2S such as P the expected3 r
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 0
The Agent Learns a Policy
Policy at step t = π t =
a mapping from states to action probabilities
π t (a | s) = probability that At = a when St = s
Special case - deterministic policies:

πt (s) = the action taken with prob=1 when St = s
❐ Reinforcement learning methods specify how the agent

changes its policy as a result of experience.
❐ Roughly, the agent’s goal is to get as much reward as it can
over the long run.

The Markov Property
❐ By “the state” at step t, the book⇤

3.5. means
THE MARKOV whatever information is
PROPERTY
available to the agent at step
⇤
t about its environment.
3.5.defined
THE MARKOV PROPERTY
only by specifying the complete probabil
❐ The state can include immediate “sensations,” highly processed
sensations, and structuresdefined
built only
upPr{R over t+1time
= r, Sfrom
by specifying t+1 ssequences
0
the=complete 0 , R1 ,of
| S0 , Aprobability
. . . , Sdi
t
sensations. Pr{R t+1s=

0 r, St+1 = s0 | S0 , A0 , R1 , . . . , St 1 , At
for all r, , and all possible values of the past
❐ Ideally, a state should summarize At 1 , R past
0 t
, St ,sensations so assignal
At . If the state to retain
has the M
for all r, s , and all possible values of the past event
⇤
3.5. all “essential” information, hand, then the environment’s response at t + 1 d
At i.e., it should have the Markov
THE MARKOV PROPERTY 55
1 , R t , S t , A t . If the state signal has the Markov
action representations at t, in which case the en
Property: hand, then the environment’s response at t + 1 depend
defined by specifying only
defined only by specifying the complete probability
action distribution:
representations at t, in which case the environm
Pr{Rt+1 = r, St+1 = s0 | S0 , A0 , Rdefined by ,specifying
A0t, r|s,
1 , . . . , St 1p(s 1 , Ra)
only
t , S= t }, = t+1 =
Pr{R
t, A r, St+1 = s0 | St , A
(3.4)
for all r, s0 , and all possible values of theforp(s0 , events:

past r|s, a)0 =SPr{R, A , t+1
R , r, SSt+1 ,= s0 | St , At },
=...,
all r, s , St , 0and 0At . 1In other t 1words, a state sig
At 1 , Rt , St , At . If the state signal has the
and Markov property,
is0 , aStMarkov state,on iftheandother
only if (3.5) is eq
hand, then the 0environment’s
for all r, s
response at thistories,
+ 1 depends
, and A
only t . In other words, a state signal ha
on ,the
❐ for all s 2 S+ , r 2 R, and all histories S0 , state,
A0 , R ..., state
Sonly and , R , S , A . I
t 1, A
action representations at t, in which case the environment’s dynamics can if
and is a Markov if
1 and
be(3.5) is equal to
t 1 t t t
and task
histories, S0 , A as0 ,aRwhole are also said to have the Ma
1 , ..., St 1 , At 1 , Rt , St , At . In thi
defined by specifying only 5
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
The Meaning of Life
(goals, rewards, and returns)

Rewards and returns
• The objective in RL is to maximize long-term future reward
• That is, to choose At so as to maximize Rt+1 , Rt+2 , Rt+3 , . . .
• But what exactly should be maximized?
• The discounted return at time t:

the discount rate
2 3
Gt = Rt+1 + Rt+2 + Rt+3 + Rt+4 + · · · 2 [0, 1)
Reward sequence Return

0.5(or any) 1 0 0 0… 1
0.5 0 0 2 0 0 0… 0.5
0.9 0 0 2 0 0 0… 1.62
0.5 -1 2 6 3 2 0 0 0… 2
4 value functions
state action
values values
prediction v⇡ q⇡
control v⇤ q⇤
• All theoretical objects, mathematical ideals (expected values)
• Distinct from their estimates:
Vt (s) Qt (s, a)
Values are expected returns
• The value of a state, given a policy:
v⇡ (s) = E{Gt | St = s, At:1 ⇠ ⇡} v⇡ : S ! <
• The value of a state-action pair, given a policy:
q⇡ (s, a) = E{Gt | St = s, At = a, At+1:1 ⇠ ⇡} q⇡ : S ⇥ A ! <
• The optimal value of a state:
v⇤ (s) = max v⇡ (s) v⇤ : S ! <
⇡
• The optimal value of a state-action pair:
q⇤ (s, a) = max q⇡ (s, a) q⇤ : S ⇥ A ! <
⇡
• Optimal policy: ⇡⇤ is an optimal policy if and only if
⇡⇤ (a|s) > 0 only where q⇤ (s, a) = max q⇤ (s, b) 8s 2 S
b
• in other words, ⇡⇤ is optimal iff it is greedy wrt q⇤
optimal policy example

What we learned last time
❐ Finite Markov decision processes!

! States, actions, and rewards
! And returns
! And time, discrete time
! They capture essential elements of life — state, causality
❐ The goal is to optimize expected returns

! returns are discounted sums of future rewards
❐ Thus we are interested in values — expected returns

❐ There are four value functions
! state vs state-action values
! values for a policy vs values for the optimal policy

4 value functions
state action
values values
prediction v⇡ q⇡
control v⇤ q⇤
• All theoretical objects, mathematical ideals (expected values)
• Distinct from their estimates:
Vt (s) Qt (s, a)
Gridworld
❐ Actions: north, south, east, west; deterministic.

❐ If would take agent off the grid: no move but reward = –1
❐ Other actions produce reward = 0, except actions that move
agent out of special states A and B as shown.
State-value function
for equiprobable
random policy;
γ = 0.9

Golf
❐ State is ball location

❐ Reward of –1 for each stroke putt
!4
vVputt !3
until the ball is in the hole sand
!"
❐ Value of a state?
!1 !2
❐ Actions: !3
!2
0 !1
! putt (use putter) !4
s green
!5 a
n
! driver (use driver) !6 d
!4 !"
❐ putt succeeds anywhere on !3 !2
the green
Q*(s, driver)
q*(s, driver)
sand
0 !1 !2
!2
sa
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction !3 n green 14
SUMMARY O
Summary of a
in bold
SUMMARY OF NOT
Optimal Value Functions Summary
s are us
Capital letters
Summary
❐ For finite MDPs, policies can be partially ordered:
Lower a of
case letters ar
Capital letters
if and only if vπ (s) ≥ vπfunctions.
π ≥ π# # (s) for all s∈ S
Quantities
Lower case + let
❐ There are always one or more policies that inare
Capitalbold and
better than
letters Sorlower
in areQu us
functions.
equal to all the others. These are the optimal Lower policies.
case We A(s) ar
letters
in bold and in
denote them all π*. functions.
s Quantities
state
❐ Optimal policies share the same optimalin bold and
astate-value t lower
in
action
function:
s state
v* (s) = max vπ (s) for all s ∈ S set Tof all no
π a actio
❐ Optimal policies also share the same optimal
+
sS action-value set Sof
S state t all setsto
function: aA(s) S+action set Aoft actio
set o
q* (s, a) = max qπ (s, a) for all s ∈ S and a ∈ A(s) set
(s) Roft allsetno o
π
St+ set G
discrete
oft(n)alltim
st
This is the expected return for taking action T a in state
t s G time
final discrs
A(s) set oft actio
and thereafter following an optimal policy. T state Gt atfinal
St t
At
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction St action at statet
15
Why Optimal State-Value Functions are Useful
Any policy that is greedy with respect to v* is an optimal policy.
Therefore, given v*, one-step-ahead search produces the

long-term optimal actions.
E.g., back to the gridworld:
A B 22.0 24.4 22.0 19.4 17.5
+5 19.8 22.0 19.8 17.8 16.0
+10 B' 17.8 19.8 17.8 16.0 14.4
16.0 17.8 16.0 14.4 13.0
A' 14.4 16.0 14.4 13.0 11.7
a) gridworld b) v
V*
* π*
c) !*

sand
!"
Optimal Value Function

!2
!1 for Golf !2
!3 !1
0
!4
❐ We can hit the!5ball farther with driver
s
a
n
than
green
with putter,
!6 d
but with less accuracy !4 !"
❐ q* (s,driver) gives the value or !3 using !2 driver first, then

using whichever actions are best
Q*(s, driver)
q*(s,driver)
sand
0 !1 !2
!2
sa
!3 n green
d

What About Optimal Action-Value Functions?
Given q* , the agent does not even

have to do a one-step-ahead search:
π * (s) = arg max q* (s, a)

a

Value Functions
x4

Bellman Equations
x4

Bellman Equation for a Policy π
The basic idea:
Gt = Rt +1 + γ Rt +2 + γ 2 Rt + 3 + γ 3 Rt + 4 L
+ ...
(
= Rt +1 + γ Rt +2 + γ Rt + 3 + γ 2 Rt + 4 +L... )
= Rt +1 + γ Gt +1
So: vπ (s) = Eπ {Gt St = s}

= Eπ { Rt +1 + γ vπ ( St +1 ) St = s}
Or, without the expectation operator:

X X h i
v⇡ (s) = ⇡(a|s) p(s0 , r|s, a) r + v⇡ (s0 )
a s0 ,r

More on the Bellman Equation
X X h i
0 0
v⇡ (s) = ⇡(a|s) p(s , r|s, a) r + v⇡ (s )
a s0 ,r
This is a set of equations (in fact, linear), one for each state.
The
⇥ value function for π is its unique solution.
⇤ 2
s) = E⇡ Rt+1 + Rt+2 + Rt+3 + · · · St = s
= EBackup
⇡[Rt+1 +diagrams:
v⇡ (St+1 ) | St = s]
X X h i
= ⇡(a|s) p(s0 , r|s, a) r + v⇡ (s0 ) ,
a s0 ,r
for vπ for qπ
Gridworld
❐ Actions: north, south, east, west; deterministic.

❐ If would take agent off the grid: no move but reward = –1
❐ Other actions produce reward = 0, except actions that move
agent out of special states A and B as shown.
State-value function
for equiprobable
random policy;
γ = 0.9

⇥ 2
⇤
v⇡ (s) = E⇡ Rt+1 + Rt+2 + Rt+3 + · · · St = s
= E⇡[Rt+1 + v⇡ (St+1 ) | St = s]
Bellman Optimality Equation for v*
X X
0
h
0
i
= ⇡(a|s) p(s , r|s, a) r + v⇡ (s ) ,
a s0 ,r
The value of a state under an optimal policy must equal
the expected return for the best action from that state:
v⇤ (s) = max q⇡⇤ (s, a)

a
= max E[Rt+1 + v⇤ (St+1 ) | St = s, At = a]
a
X ⇥ ⇤
0 0
= max p(s , r|s, a) r + v⇤ (s ) .
a
s0 ,r
(a) s (b)
max
The relevant backup diagram: a
r max
s'
v* is the unique solution of this system of nonlinear equations.

X ⇥ ⇤
0 0
= max p(s , r|s, a) r + v⇤ (s ) .
a2A(s)
s0 ,r
Bellman Optimality Equation for q*
The last two equations are two forms of the Bellman optimality equati
v⇤ . The Bellman optimality equation for q⇤ is
h i
0
q⇤ (s, a) = E Rt+1 + max q ⇤ (S t+1 , a ) St = s, At = a
a0
X h i
= p(s0 , r|s, a) r + max 0
q ⇤ (s 0 0
,a ) .
a
s0 ,r
(a) s (b) s,a

max
The relevant backup diagram: r
a s'
r max
s' a'
q* is the unique solution of this system of nonlinear equations.

Solving the Bellman Optimality Equation
❐ Finding an optimal policy by solving the Bellman
Optimality Equation requires the following:
! accurate knowledge of environment dynamics;
! we have enough space and time to do the computation;
! the Markov Property.
❐ How much space and time do we need?

! polynomial in number of states (via dynamic
programming methods; Chapter 4),

! BUT, number of states is often huge (e.g., backgammon
has about 1020 states).

❐ We usually have to settle for approximations.
❐ Many RL methods can be understood as approximately
solving the Bellman Optimality Equation.

Summary
❐ Agent-environment interaction ❐ Value functions

! States ! State-value function for a policy
! Actions ! Action-value function for a policy
! Rewards ! Optimal state-value function
❐ Policy: stochastic rule for ! Optimal action-value function
selecting actions ❐ Optimal value functions
❐ Return: the function of future ❐ Optimal policies
rewards agent tries to maximize ❐ Bellman Equations
❐ Episodic and continuing tasks ❐ The need for approximation
❐ Markov Property
❐ Markov Decision Process
! Transition probabilities
! Expected rewards

Chapter 3: The Reinforcement Learning Problem (Markov Decision Processes, or MDPS)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 3: The Reinforcement Learning Problem (Markov Decision Processes, or MDPS)

Uploaded by

Copyright:

Available Formats

Chapter 3: The Reinforcement Learning Problem

(Markov Decision Processes, or MDPs)

Objectives of this chapter:

❐ present Markov decision processes—an idealized form of

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

environment interaction in reinforcement learning. SUMMARY OF NOTATION

Agent Capital letters are used for random

Special case - deterministic policies:

❐ Reinforcement learning methods specify how the agent

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

❐ By “the state” at step t, the book⇤

sensations. Pr{R t+1s=

for all r, s0 , and all possible values of theforp(s0 , events:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

• That is, to choose At so as to maximize Rt+1 , Rt+2 , Rt+3 , . . .

• But what exactly should be maximized?

• The discounted return at time t:

Reward sequence Return

• All theoretical objects, mathematical ideals (expected values)

• Distinct from their estimates:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

❐ Finite Markov decision processes!

! And time, discrete time

! They capture essential elements of life — state, causality

❐ The goal is to optimize expected returns

❐ Thus we are interested in values — expected returns

! values for a policy vs values for the optimal policy

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

• All theoretical objects, mathematical ideals (expected values)

• Distinct from their estimates:

❐ Actions: north, south, east, west; deterministic.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

❐ State is ball location

Any policy that is greedy with respect to v* is an optimal policy.

Therefore, given v*, one-step-ahead search produces the

E.g., back to the gridworld:

A B 22.0 24.4 22.0 19.4 17.5

+5 19.8 22.0 19.8 17.8 16.0

+10 B' 17.8 19.8 17.8 16.0 14.4

16.0 17.8 16.0 14.4 13.0

A' 14.4 16.0 14.4 13.0 11.7

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

Optimal Value Function

❐ q* (s,driver) gives the value or !3 using !2 driver first, then

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

Given q* , the agent does not even

π * (s) = arg max q* (s, a)

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

The basic idea:

So: vπ (s) = Eπ {Gt St = s}

Or, without the expectation operator:

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21

❐ Actions: north, south, east, west; deterministic.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23

v⇤ (s) = max q⇡⇤ (s, a)

v* is the unique solution of this system of nonlinear equations.

(a) s (b) s,a

q* is the unique solution of this system of nonlinear equations.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25

! we have enough space and time to do the computation;

Chapter 3: The Reinforcement Learning Problem