Professional Documents
Culture Documents
INTRODUCTION
TO
MACHNE
LEARNNG
3RD EDTON
ETHEM ALPAYDIN
The MIT Press, 2014
alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 18:
RENFORCEMENT LEARNNG
Introduction
3
Qt 1 a Qt a rt 1 a Qt a
Elements of RL (Markov Decision
5
Processes)
st : State of agent at time t
at: Action taken at time t
In st, action at is taken, clock ticks and reward rt+1 is
received and state changes to st+1
Next state prob: P (st+1 | st , at )
Reward prob: p (rt+1 | st , at )
Initial state(s), goal state(s)
Episode (trial) of actions from initial state to goal
(Sutton and Barto, 1998; Kaelbling et al., 1996)
Policy and Cumulative Reward
6
Policy, : S A at st
Value of a policy, V s
t
Finite-horizon:
T
V st E rt 1 rt 2 rt T E rt i
i 1
Infinite horizon:
V st E rt 1 rt 2 2rt 3 E i 1rt i
i 1
0 1 is the discount rate
V * st max V st , st
i 1
max E rt i
at
i 1
max E rt 1 rt i 1
i 1
at
i 1
max E rt 1 V * st 1
at
Bellmans equation
V st max E rt 1 P st 1 | st , at V st 1
* *
at
st 1
V * st max Q* st , at Value of at in st
at
Q* st , at E rt 1 P st 1 | st , at max Q* st 1 , at 1
at 1
st 1
7
Model-Based Learning
8
at
st 1
Optimal policy
* st arg max E rt 1 | st , at P st 1 | st , at V st 1
*
at
st 1
Value Iteration
9
Policy Iteration
10
Temporal Difference Learning
11
1 if s st and a at
et s , a
et 1 s , a otherwise
t rt 1 Qst 1 , at 1 Qst , at
Qst , at Qst , at t et s, a , s , a
Sarsa ()
19
Generalization
20
R(aL)=-100p+80(1-p), R(aR)=90p-100(1-p)
We can sense with a reward of R(aS)=-1
We have unreliable sensors
If we sense oL, our belief in tigers position changes
P(oL | zL )P(zL ) 0.7 p
p' P(zL |oL )
P(oL ) 0.7 p 0.3(1 p)
R(aL |oL ) r (aL , zL )P(zL |oL ) r (aL , zR )P(zR |oL )
100 p'80(1 p')
0.7 p 0.3(1 p)
100 80
P(oL ) P(oL )
R(aR |oL ) r (aR , zL )P(zL |oL ) r (aR , zR )P(zR |oL )
90 p'100(1 p' )
0.7 p 0.3(1 p)
90 100
P(oL ) P(oL )
R(aS |oL ) 1
23
V ' max i R(ai |o j )P(o j )
j
max(R(aL |oL ),R(aR |oL ),R(aS |oL ))P(oL ) max(R(aL |oR ),R(aR |oR ),R(aS |oR ))P(oR )
100p 80(1 p)
43p 46(1 p)
max
33p 26(1 p)
90p 100(1 p)
24
25
Let us say the tiger can move from one room to the
other with prob 0.8
p' 0.2p 0.8(1 p)
100p' 80(1 p')
V ' max 33p 26(1 p')
90p 100(1 p')
26
When planning for episodes of two, we can take aL,
aR, or sense and wait:
100p 80(1 p)
V2 max 90p 100(1 p)
max V ' 1
27