Professional Documents
Culture Documents
ch13 PDF
ch13 PDF
Reinforcement Learning
255 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Control Learning
256 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
One Example: TD-Gammon
[Tesauro, 1995]
Learn to play Backgammon
Immediate reward
+100 if win
-100 if lose
0 for all other states
Trained by playing 1.5 million games against itself
Now approximately equal to best human player
257 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Reinforcement Learning Problem
Agent
Environment
a0 a1 a2
s0 s1 s2 ...
r0 r1 r2
258 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Markov Decision Processes
Assume
nite set of states S
set of actions A
at each discrete time agent observes state st 2 S
and chooses action at 2 A
then receives immediate reward rt
and state changes to st +1
259 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Agent's Learning Task
260 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Value Function
V (s) rt +
rt +
rt + ::: +1
2
+2
1 i
i
rt i
X
=0
+
261 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
0
0 100
G
0
0 0
0 0 100
0 0
0 0
0
90 100
G 90 100 G
81 0
72 81
81 90 100
81 90
81 90 100
72 81
262 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
What to Learn
A problem:
This works well if agent knows : S A ! S ,
and r : S A ! <
But when it doesn't, it can't choose actions this
way
263 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Q Function
(s) = argmax
a
Q(s; a)
Q is the evaluation function the agent will learn
264 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Training Rule to Learn Q
265 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Q Learning for Deterministic Worlds
s s0
266 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Updating Q^
72 100 90 100
R R
63 63
81 81
a right
Q^ (s ; aright)
1 r +
max
a
^ (s ; a0)
Q 0 2
and
(8s; a; n) 0 Q^ n(s; a) Q(s; a)
267 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Q^ converges to Q. Consider case of deterministic
world where see each hs; ai visited innitely often.
Proof: Dene a full interval to be an interval during
which each hs; ai is visited. During each full
interval the largest error in Q^ table is reduced by
factor of
Let Q^ n be table after n updates, and n be the
maximum error in Q^ n; that is
n = max ^
s;a jQn(s; a) ? Q(s; a)j
=
j max ^
Q n ( s 0 ; a0 ) ? max Q(s0 ; a0 )j
a0 a0
max j ^
Q n (s 0 ; a0 ) ? Q(s0 ; a0 )j
a0
max j ^
Q n (s 00 ; a0 ) ? Q(s00 ; a0 )j
s00;a0
jQ^ n (s; a) ? Q(s; a)j
n
+1
268 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Note we used general fact that
j max
a f (a) ? max
1 a f (a)j maxa jf (a) ? f (a)j
2 1 2
269 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Nondeterministic Case
V (s) E [rt +
rt +
rt + : : :]
+1
2
+2
1 i
E [ i
rt i ]
X
=0
+
270 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Nondeterministic Case
where 1
n = 1 + visits
n(s; a)
271 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Temporal Dierence Learning
Or n?
Q n (st; at) rt+
rt + +
n? rt n? +
n max
( )
+1
^
a Q(st n ; a)
( 1)
+ 1 +
272 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Temporal Dierence Learning
"
Q (st; at) (1?) Q (st; at) + Q (st; at) + Q (st; at) +
(1) (2) 2 (3)
Equivalent expression:
Q(st; at) = rt +
[ (1 ? ) max
a
^
Q (st; at)
+ Q(st ; at )] +1 +1
273 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997
Subtleties and Ongoing Research
274 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997