ch13 PDF

13.
Reinforcement Learning
[Read Chapter 13]

[Exercises 13.1, 13.2, 13.4]
Control learning
Control policies that choose optimal actions
Q learning
Convergence
255 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Control Learning
Consider learning to choose actions, e.g.,

Robot learning to dock on battery charger
Learning to choose actions to optimize factory
output
Learning to play Backgammon
Note several problem characteristics:
Delayed reward
Opportunity for active exploration
Possibility that state only partially observable
Possible need to learn multiple tasks with same
sensors/eectors
One Example: TD-Gammon
[Tesauro, 1995]
Learn to play Backgammon
Immediate reward
+100 if win
-100 if lose
0 for all other states
Trained by playing 1.5 million games against itself
Now approximately equal to best human player
Reinforcement Learning Problem
Agent
State Reward Action
Environment
a0 a1 a2
s0 s1 s2 ...
r0 r1 r2
Goal: Learn to choose actions that maximize
r + γ r + γ 2 r + ... , where 0 < γ <1

0 1 2
Markov Decision Processes
Assume
nite set of states S
set of actions A
at each discrete time agent observes state st 2 S
and chooses action at 2 A
then receives immediate reward rt
and state changes to st +1
Markov assumption: st = (st; at) and +1

rt = r(st; at)
{ i.e., rt and st depend only on current state
+1
and action
{ functions and r may be nondeterministic
{ functions and r not necessarily known to
agent
Agent's Learning Task
Execute actions in environment, observe results,

and
learn action policy : S ! A that maximizes
E [rt + rt + rt + : : :]
+1
2
+2
from any starting state in S

here 0 < 1 is the discount factor for future
rewards
Note something new:
Target function is : S ! A
but we have no training examples of form hs; ai
training examples are of form hhs; ai; ri
Value Function
To begin, consider deterministic worlds...

For each possible policy the agent might adopt,
we can dene an evaluation function over states
V (s) rt + rt + rt + ::: +1
2
+2
1 i
i rt i
X
=0
+
where rt; rt ; : : : are generated by following policy

+1
starting at state s
Restated, the task is to learn the optimal policy
argmax

V
(s); (8s)
0
0 100
G
0
0 0
0 0 100
0 0
0 0
r(s; a) (immediate reward) values
0
90 100
G 90 100 G
81 0
72 81
81 90 100
81 90
81 90 100
72 81
Q(s; a) values V (s) values
One optimal policy
What to Learn
We might try to have agent learn the evaluation

function V (which we write as V )

It could then do a lookahead search to choose best

action from any state s because
(s) = argmax
a
[r (s; a) + V ( (s; a))]
A problem:
This works well if agent knows : S A ! S ,
and r : S A ! <
But when it doesn't, it can't choose actions this
way
Q Function
Dene new function very similar to V

Q(s; a) r(s; a) + V ((s; a))
If agent learns Q, it can choose optimal action even
without knowing !
(s) = argmax
a
[r (s; a) + V ( (s; a))]
(s) = argmax
a
Q(s; a)
Q is the evaluation function the agent will learn
Training Rule to Learn Q
Note Q and V closely related:

V (s) = maxa0
Q (s; a 0)
Which allows us to write Q recursively as
Q(st; at) = r(st; at) + V ((st; at)))

= r(st; at) + max
a
Q (s t ;a)
0
0 +1
Nice! Let Q^ denote learner's current approximation

to Q. Consider training rule
Q^ (s; a) r + max a
Q^ (s0 ; a0) 0
where s0 is the state resulting from applying action

a in state s
Q Learning for Deterministic Worlds
For each s; a initialize table entry Q^ (s; a) 0

Observe current state s
Do forever:
Select an action a and execute it
Receive immediate reward r
Observe the new state s0
Update the table entry for Q^ (s; a) as follows:
Q^ (s; a) r + maxa0
Q^ (s 0 ; a0 )
s s0
Updating Q^
72 100 90 100
R R
63 63
81 81
a right
initial state: s1 next state: s2
Q^ (s ; aright)
1 r + max
a
^ (s ; a0)
Q 0 2
0 + 0:9 maxf63; 81; 100g

90
notice if rewards non-negative, then
(8s; a; n) Q^ n (s; a) Q^ n(s; a) +1
and
(8s; a; n) 0 Q^ n(s; a) Q(s; a)
Q^ converges to Q. Consider case of deterministic
world where see each hs; ai visited innitely often.
Proof: Dene a full interval to be an interval during
which each hs; ai is visited. During each full
interval the largest error in Q^ table is reduced by
factor of
Let Q^ n be table after n updates, and n be the
maximum error in Q^ n; that is
n = max ^
s;a jQn(s; a) ? Q(s; a)j
For any table entry Q^ n(s; a) updated on iteration

n + 1, the error in the revised estimate Q^ n (s; a) is +1
jQ^ n (s; a) ? Q(s; a)j = j(r + max

+1
a0
Q ^ n (s 0 ; a0 ))
?(r + max a0
Q ( s 0 ; a0 ))j
= j max ^
Q n ( s 0 ; a0 ) ? max Q(s0 ; a0 )j
a0 a0
max j ^
Q n (s 0 ; a0 ) ? Q(s0 ; a0 )j
a0
max j ^
Q n (s 00 ; a0 ) ? Q(s00 ; a0 )j
s00;a0
jQ^ n (s; a) ? Q(s; a)j n
+1
Note we used general fact that
j max
a f (a) ? max
1 a f (a)j maxa jf (a) ? f (a)j
2 1 2
Nondeterministic Case
What if reward and next state are

non-deterministic?
We redene V; Q by taking expected values
V (s) E [rt + rt + rt + : : :]
+1
2
+2
1 i
E [ i rt i ]
X
=0
+
Q(s; a) E [r(s; a) + V ((s; a))]
Nondeterministic Case
Q learning generalizes to nondeterministic worlds

Alter training rule to
Q^ n(s; a) (1?n)Q^ n? (s; a)+n[r+max
1
a0
^
Q n? (s 0 ; a0 )]
1
where 1
n = 1 + visits
n(s; a)
Can still prove convergence of Q^ to Q [Watkins and

Dayan, 1992]
Temporal Dierence Learning
Q learning: reduce discrepancy between successive

Q estimates
One step time dierence:
Q (st; at) rt + max
(1) ^
a Q(st ; a) +1
Why not two steps?

Q (st; at) rt + rt + max
(2) ^
a Q(st ; a)
+1
2
+2
Or n?
Q n (st; at) rt+ rt + + n? rt n? + n max
( )
+1
^
a Q(st n ; a)
( 1)
+ 1 +
Blend all of these:

"
Q(st; at) (1?) Q (st; at) + Q (st; at) + Q (st; at) +
(1) (2) 2 (3)
Temporal Dierence Learning
"
Q (st; at) (1?) Q (st; at) + Q (st; at) + Q (st; at) +
(1) (2) 2 (3)
Equivalent expression:
Q(st; at) = rt + [ (1 ? ) max
a
^
Q (st; at)
+ Q(st ; at )] +1 +1
TD() algorithm uses above training rule

Sometimes converges faster than Q learning
converges for learning V for any 0 1
(Dayan, 1992)
Tesauro's TD-Gammon uses this algorithm
Subtleties and Ongoing Research
Replace Q^ table with neural net or other

generalizer
Handle case where state only partially observable
Design optimal exploration strategies
Extend to continuous action, state
Learn and use ^ : S A ! S
Relationship to dynamic programming

ch13 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ch13 PDF

Uploaded by

Copyright:

Available Formats

13.

[Read Chapter 13]

Consider learning to choose actions, e.g.,

State Reward Action

Goal: Learn to choose actions that maximize

r + γ r + γ 2 r + ... , where 0 < γ <1

 Markov assumption: st = (st; at) and +1

Execute actions in environment, observe results,

from any starting state in S

To begin, consider deterministic worlds...

where rt; rt ; : : : are generated by following policy

r(s; a) (immediate reward) values

Q(s; a) values V (s) values

One optimal policy

We might try to have agent learn the evaluation

It could then do a lookahead search to choose best

De ne new function very similar to V

Note Q and V closely related:

Which allows us to write Q recursively as

Q(st; at) = r(st; at) + V ((st; at)))

Nice! Let Q^ denote learner's current approximation

where s0 is the state resulting from applying action

For each s; a initialize table entry Q^ (s; a) 0

initial state: s1 next state: s2

0 + 0:9 maxf63; 81; 100g

For any table entry Q^ n(s; a) updated on iteration

jQ^ n (s; a) ? Q(s; a)j = j(r + max

What if reward and next state are

Q(s; a)  E [r(s; a) + V ((s; a))]

Q learning generalizes to nondeterministic worlds

Can still prove convergence of Q^ to Q [Watkins and

Q learning: reduce discrepancy between successive

Why not two steps?

Blend all of these:

TD() algorithm uses above training rule

 Replace Q^ table with neural net or other

You might also like

Markov assumption: st = (st; at) and +1

Dene new function very similar to V

Q(st; at) = r(st; at) + V ((st; at)))

Q(s; a) E [r(s; a) + V ((s; a))]

TD() algorithm uses above training rule

Replace Q^ table with neural net or other