I2ml3e Chap18

Lecture Slides for
INTRODUCTION
TO
MACHNE
LEARNNG
3RD EDTON
ETHEM ALPAYDIN
The MIT Press, 2014
alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 18:
RENFORCEMENT LEARNNG
Introduction
3
Game-playing: Sequence of moves to win a game

Robot in a maze: Sequence of actions to find a goal
Agent has a state in an environment, takes an action
and sometimes receives reward and the state
changes
Credit-assignment
Learn a policy
Single State: K-armed Bandit
4
Among K levers, choose

the one that pays best
Q(a): value of action a
Reward is ra
Set Q(a) = ra
Choose a* if
Q(a*)=maxa Q(a)
Rewards stochastic (keep an expected reward):
Qt 1 a Qt a rt 1 a Qt a
Elements of RL (Markov Decision
5
Processes)
st : State of agent at time t
at: Action taken at time t
In st, action at is taken, clock ticks and reward rt+1 is
received and state changes to st+1
Next state prob: P (st+1 | st , at )
Reward prob: p (rt+1 | st , at )
Initial state(s), goal state(s)
Episode (trial) of actions from initial state to goal
(Sutton and Barto, 1998; Kaelbling et al., 1996)
Policy and Cumulative Reward
6
Policy, : S A at st
Value of a policy, V s
t
Finite-horizon:
T

V st E rt 1 rt 2 rt T E rt i

i 1
Infinite horizon:

V st E rt 1 rt 2 2rt 3 E i 1rt i
i 1
0 1 is the discount rate
V * st max V st , st

i 1
max E rt i
at
i 1

max E rt 1 rt i 1
i 1
at
i 1

max E rt 1 V * st 1
at
Bellmans equation

V st max E rt 1 P st 1 | st , at V st 1
* *

at
st 1
V * st max Q* st , at Value of at in st
at
Q* st , at E rt 1 P st 1 | st , at max Q* st 1 , at 1
at 1
st 1
7
Model-Based Learning
8
Environment, P (st+1 | st , at ), p (rt+1 | st , at ) known

There is no need for exploration
Can be solved using dynamic programming
Solve for

V st max E rt 1 P st 1 | st , at V st 1
* *

at
st 1
Optimal policy

* st arg max E rt 1 | st , at P st 1 | st , at V st 1
*
at
st 1
Value Iteration
9
Policy Iteration
10
Temporal Difference Learning
11
Environment, P (st+1 | st , at ), p (rt+1 | st , at ), is not

known; model-free learning
There is need for exploration to sample from
P (st+1 | st , at ) and p (rt+1 | st , at )
Use the reward received in the next time step to
update the value of current state (action)
The temporal difference between the value of the
current action and the value discounted from the
next state
Exploration Strategies
12
-greedy: With pr ,choose one action at random

uniformly; and choose the best action with pr 1-
Probabilistic:
expQs, a
Pa| s
expQs, b
A
b 1
Move smoothly from exploration/exploitation.

Decrease
expQs, a / T
Annealing Pa| s A
b 1
expQs, b / T
Deterministic Rewards and Actions
13
Q* st , at Ert 1 Pst 1 | st , at max Q* st 1 , at 1

at 1
st 1
Deterministic: single possible reward and next state

Qst , at rt 1 max Qst 1 , at 1
at 1
used as an update rule (backup)

Q st , at rt 1 max Q st 1 , at 1
at 1
Starting at zero, Q values increase, never decrease

=0.9
Consider the value of action marked by *:

If path A is seen first, Q(*)=0.9*max(0,81)=73
Then B is seen, Q(*)=0.9*max(100,81)=90
Or,
If path B is seen first, Q(*)=0.9*max(100,0)=90
Then A is seen, Q(*)=0.9*max(100,81)=90
Q values increase but never decrease
14
Nondeterministic Rewards and
15
Actions
When next states and rewards are nondeterministic
(there is an opponent or randomness in the environment),
we keep averages (expected values) instead as
assignments
Q-learning (Watkins and Dayan, 1992):
Q st , at Q st , at rt 1 max Q st 1 , at 1 Q st , at
at 1
Off-policy vs on-policy (Sarsa) backup
Learning V (TD-learning: Sutton, 1988)
V st V st rt 1 V st 1 V st
Q-learning
16
Sarsa
17
Eligibility Traces
18
Keep a record of previously visited states (actions)
1 if s st and a at
et s , a
et 1 s , a otherwise
t rt 1 Qst 1 , at 1 Qst , at
Qst , at Qst , at t et s, a , s , a
Sarsa ()
19
Generalization
20
Tabular: Q (s , a) or V (s) stored in a table

Regressor: Use a learner to estimate Q(s,a) or V(s)
E t rt 1 Qst 1 , at 1 Qst , at 2
rt 1 Qst 1 , at 1 Qst , at t Qst , at
Eligibility
t et
t rt 1 Qst 1 , at 1 Qst , at
et et 1 t Qst , at with e 0 all zeros
Partially Observable States
21
The agent does not know its state but receives an

observation p(ot+1|st,at) which can be used to infer
a belief about states
Partially observable
MDP
The Tiger Problem
22
Two doors, behind one of which there is a tiger

p: prob that tiger is behind the left door
R(aL)=-100p+80(1-p), R(aR)=90p-100(1-p)
We can sense with a reward of R(aS)=-1
We have unreliable sensors
If we sense oL, our belief in tigers position changes
P(oL | zL )P(zL ) 0.7 p
p' P(zL |oL )
P(oL ) 0.7 p 0.3(1 p)
R(aL |oL ) r (aL , zL )P(zL |oL ) r (aL , zR )P(zR |oL )
100 p'80(1 p')
0.7 p 0.3(1 p)
100 80
P(oL ) P(oL )
R(aR |oL ) r (aR , zL )P(zL |oL ) r (aR , zR )P(zR |oL )
90 p'100(1 p' )
0.7 p 0.3(1 p)
90 100
P(oL ) P(oL )
R(aS |oL ) 1
23
V ' max i R(ai |o j )P(o j )
j
max(R(aL |oL ),R(aR |oL ),R(aS |oL ))P(oL ) max(R(aL |oR ),R(aR |oR ),R(aS |oR ))P(oR )
100p 80(1 p)

43p 46(1 p)
max
33p 26(1 p)

90p 100(1 p)

24
25
Let us say the tiger can move from one room to the
other with prob 0.8
p' 0.2p 0.8(1 p)
100p' 80(1 p')

V ' max 33p 26(1 p')
90p 100(1 p')

26
When planning for episodes of two, we can take aL,
aR, or sense and wait:
100p 80(1 p)

V2 max 90p 100(1 p)
max V ' 1

27

I2ml3e Chap18

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

I2ml3e Chap18

Uploaded by

Copyright:

Available Formats

Lecture Slides for

Game-playing: Sequence of moves to win a game

Among K levers, choose

Rewards stochastic (keep an expected reward):

Environment, P (st+1 | st , at ), p (rt+1 | st , at ) known

Environment, P (st+1 | st , at ), p (rt+1 | st , at ), is not

-greedy: With pr ,choose one action at random

Move smoothly from exploration/exploitation.

Q* st , at Ert 1 Pst 1 | st , at max Q* st 1 , at 1

Deterministic: single possible reward and next state

used as an update rule (backup)

Starting at zero, Q values increase, never decrease

Consider the value of action marked by *:

Keep a record of previously visited states (actions)

Tabular: Q (s , a) or V (s) stored in a table

The agent does not know its state but receives an

Two doors, behind one of which there is a tiger

You might also like