You are on page 1of 20

13.

Reinforcement Learning

[Read Chapter 13]


[Exercises 13.1, 13.2, 13.4]
 Control learning
 Control policies that choose optimal actions
 Q learning
 Convergence

255 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Control Learning

Consider learning to choose actions, e.g.,


 Robot learning to dock on battery charger
 Learning to choose actions to optimize factory
output
 Learning to play Backgammon
Note several problem characteristics:
 Delayed reward
 Opportunity for active exploration
 Possibility that state only partially observable
 Possible need to learn multiple tasks with same
sensors/e ectors

256 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
One Example: TD-Gammon

[Tesauro, 1995]
Learn to play Backgammon
Immediate reward
 +100 if win
 -100 if lose
 0 for all other states
Trained by playing 1.5 million games against itself
Now approximately equal to best human player

257 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Reinforcement Learning Problem

Agent

State Reward Action

Environment

a0 a1 a2
s0 s1 s2 ...
r0 r1 r2

Goal: Learn to choose actions that maximize

r + γ r + γ 2 r + ... , where 0 < γ <1


0 1 2

258 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Markov Decision Processes

Assume
 nite set of states S
 set of actions A
 at each discrete time agent observes state st 2 S
and chooses action at 2 A
 then receives immediate reward rt
 and state changes to st +1

 Markov assumption: st = (st; at) and +1


rt = r(st; at)
{ i.e., rt and st depend only on current state
+1
and action
{ functions  and r may be nondeterministic
{ functions  and r not necessarily known to
agent

259 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Agent's Learning Task

Execute actions in environment, observe results,


and
 learn action policy  : S ! A that maximizes
E [rt + rt + rt + : : :]
+1
2
+2

from any starting state in S


 here 0  < 1 is the discount factor for future
rewards
Note something new:
 Target function is  : S ! A
 but we have no training examples of form hs; ai
 training examples are of form hhs; ai; ri

260 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Value Function

To begin, consider deterministic worlds...


For each possible policy  the agent might adopt,
we can de ne an evaluation function over states

V  (s)  rt + rt + rt + ::: +1
2
+2
1 i
 i rt i
X

=0
+

where rt; rt ; : : : are generated by following policy


+1
 starting at state s
Restated, the task is to learn the optimal policy 
  argmax

V 
(s); (8s)

261 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
0
0 100
G
0
0 0
0 0 100
0 0

0 0

r(s; a) (immediate reward) values

0
90 100
G 90 100 G
81 0
72 81
81 90 100
81 90
81 90 100
72 81

Q(s; a) values V (s) values

One optimal policy

262 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
What to Learn

We might try to have agent learn the evaluation


function V  (which we write as V )


It could then do a lookahead search to choose best


action from any state s because
(s) = argmax
a
[r (s; a) + V ( (s; a))]

A problem:
 This works well if agent knows  : S  A ! S ,
and r : S  A ! <
 But when it doesn't, it can't choose actions this
way

263 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Q Function

De ne new function very similar to V 


Q(s; a)  r(s; a) + V ((s; a))
If agent learns Q, it can choose optimal action even
without knowing !
(s) = argmax
a
[r (s; a) + V ( (s; a))]

(s) = argmax
a
Q(s; a)
Q is the evaluation function the agent will learn

264 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Training Rule to Learn Q

Note Q and V  closely related:


V (s) = maxa0
Q (s; a 0)

Which allows us to write Q recursively as

Q(st; at) = r(st; at) + V ((st; at)))


= r(st; at) + max
a
Q (s t ;a)
0
0 +1

Nice! Let Q^ denote learner's current approximation


to Q. Consider training rule
Q^ (s; a) r + max a
Q^ (s0 ; a0) 0

where s0 is the state resulting from applying action


a in state s

265 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Q Learning for Deterministic Worlds

For each s; a initialize table entry Q^ (s; a) 0


Observe current state s
Do forever:
 Select an action a and execute it
 Receive immediate reward r
 Observe the new state s0
 Update the table entry for Q^ (s; a) as follows:
Q^ (s; a) r + maxa0
Q^ (s 0 ; a0 )

 s s0

266 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Updating Q^

72 100 90 100
R R
63 63
81 81

a right

initial state: s1 next state: s2

Q^ (s ; aright)
1 r + max
a
^ (s ; a0)
Q 0 2

0 + 0:9 maxf63; 81; 100g


90
notice if rewards non-negative, then
(8s; a; n) Q^ n (s; a)  Q^ n(s; a) +1

and
(8s; a; n) 0  Q^ n(s; a)  Q(s; a)

267 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Q^ converges to Q. Consider case of deterministic
world where see each hs; ai visited in nitely often.
Proof: De ne a full interval to be an interval during
which each hs; ai is visited. During each full
interval the largest error in Q^ table is reduced by
factor of
Let Q^ n be table after n updates, and n be the
maximum error in Q^ n; that is
n = max ^
s;a jQn(s; a) ? Q(s; a)j

For any table entry Q^ n(s; a) updated on iteration


n + 1, the error in the revised estimate Q^ n (s; a) is +1

jQ^ n (s; a) ? Q(s; a)j = j(r + max


+1
a0
Q ^ n (s 0 ; a0 ))
?(r + max a0
Q ( s 0 ; a0 ))j

= j max ^
Q n ( s 0 ; a0 ) ? max Q(s0 ; a0 )j
a0 a0
 max j ^
Q n (s 0 ; a0 ) ? Q(s0 ; a0 )j
a0
 max j ^
Q n (s 00 ; a0 ) ? Q(s00 ; a0 )j
s00;a0
jQ^ n (s; a) ? Q(s; a)j  n
+1

268 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Note we used general fact that
j max
a f (a) ? max
1 a f (a)j  maxa jf (a) ? f (a)j
2 1 2

269 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Nondeterministic Case

What if reward and next state are


non-deterministic?
We rede ne V; Q by taking expected values

V  (s)  E [rt + rt + rt + : : :]
+1
2
+2
1 i
 E [ i rt i ]
X

=0
+

Q(s; a)  E [r(s; a) + V ((s; a))]

270 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Nondeterministic Case

Q learning generalizes to nondeterministic worlds


Alter training rule to
Q^ n(s; a) (1? n)Q^ n? (s; a)+ n[r+max
1
a0
^
Q n? (s 0 ; a0 )]
1

where 1
n = 1 + visits
n(s; a)

Can still prove convergence of Q^ to Q [Watkins and


Dayan, 1992]

271 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Temporal Di erence Learning

Q learning: reduce discrepancy between successive


Q estimates
One step time di erence:
Q (st; at)  rt + max
(1) ^
a Q(st ; a) +1

Why not two steps?


Q (st; at)  rt + rt + max
(2) ^
a Q(st ; a)
+1
2
+2

Or n?
Q n (st; at)  rt+ rt +  + n? rt n? + n max
( )
+1
^
a Q(st n ; a)
( 1)
+ 1 +

Blend all of these:


"
Q(st; at)  (1?) Q (st; at) + Q (st; at) +  Q (st; at) +
(1) (2) 2 (3)

272 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Temporal Di erence Learning

"
Q (st; at)  (1?) Q (st; at) + Q (st; at) +  Q (st; at) +
 (1) (2) 2 (3)

Equivalent expression:
Q(st; at) = rt + [ (1 ? ) max
a
^
Q (st; at)
+ Q(st ; at )] +1 +1

TD() algorithm uses above training rule


 Sometimes converges faster than Q learning
 converges for learning V  for any 0    1
(Dayan, 1992)
 Tesauro's TD-Gammon uses this algorithm

273 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Subtleties and Ongoing Research

 Replace Q^ table with neural net or other


generalizer
 Handle case where state only partially observable
 Design optimal exploration strategies
 Extend to continuous action, state
 Learn and use ^ : S  A ! S
 Relationship to dynamic programming

274 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997

You might also like