Reinforcement Learning: 1 Updated Lecture Slides of Machine Learning Textbook, C Tom M. Mitchell, Mcgraw Hill, 1997

13.
Reinforcement Learning
• Control learning
• Control policies that choose optimal actions
• Q learning
• Convergence
1 c
Updated lecture slides of Machine Learning textbook, Tom M. Mitchell, McGraw Hill, 1997
Control Learning
A computaional approach to learning from

interactions with the environment.
• The learner is not told which actions to take.
• Delayed reward
• Instead must discover which actions yield
the most reward by trying them.
• Supervised learning is learning from
examples provided by an external
supervisor.
• In interactive environments, it is often
impossible to obtain examples that are
representative for all of the situations.
• Trade-off between exploration and
exploitation.
• The agent has to exploit what it already
knows to obtain reward.
2 c
• But it also has to explore to make better
selections in the future.
• Agents should have explicit goals.
• Should be able to observe state at least
partially
• Should be able to chose actions to influence
their environments.
Consider learning to choose actions, e.g.,
• Robot operating in an environment
• Learning to play Backgammon
3 c
One Example: TD-Gammon
[Tesauro, 1995]
Learn to play Backgammon
Immediate reward
• +100 if win
• -100 if lose
• 0 for all other states
Trained by playing 1.5 million games against
itself
Now approximately equal to best human player
4 c
Reinforcement Learning Problem
Agent
State Reward Action
Environment
a0 a1 a2
s0 s1 s2 ...
r0 r1 r2
Goal: Learn to choose actions that maximize
r + γ r + γ 2 r + ... , where 0 < γ <1

0 1 2
Future rewards are discounted exponentially

due to their delay.
5 c
Markov Decision Processes
Assume
• finite set of states S
• set of actions A
• at each discrete time agent observes state
st ∈ S and chooses action at ∈ A
• then receives immediate reward rt
• and state changes to st+1
• Markov assumption: st+1 = δ(st, at) and
rt = r(st, at)
– i.e., rt and st+1 depend only on current
state and action
– functions δ and r may be nondeterministic
– functions δ and r not necessarily known
to agent
6 c
Agent’s Learning Task
Execute actions in environment, observe

results, and
• learn action policy π : S → A that
maximizes
E[rt + γrt+1 + γ 2rt+2 + . . .]
from any starting state in S
• here 0 ≤ γ < 1 is the discount factor for
future rewards
Note something new:
• Target function is π : S → A
• but we have no training examples of form
hs, ai
• training examples are of form hhs, ai, ri
7 c
Value Function
To begin, consider deterministic worlds...

For each possible policy π the agent might
adopt, we can define an evaluation function
over states
V π (s) ≡ rt + γrt+1 + γ 2rt+2 + ...

∞
≡ γ irt+i
X
i=0
where rt, rt+1, . . . are generated by following
policy π starting at state s
Restated, the task is to learn the optimal
policy π ∗
π ∗ ≡ argmax V π (s), (∀s)
π
8 c
0
0 100
G
0
0 0
0 0 100
0 0
0 0
r(s, a) (immediate reward) values
0
90 100
G 90 100
0
G
81
72 81
81 90 100
81 90
81 90 100
72 81
Q(s, a) values V ∗(s) values
One optimal policy
9 c
What to Learn
We might try to have agent learn the

π ∗
evaluation function V (which we write as
V ∗)
It could then do a lookahead search to choose
best action from any state s because
π ∗(s) = argmax[r(s, a) + γV ∗(δ(s, a))]
a
A problem:
• This works well if agent knows
δ : S × A → S, and r : S × A → ℜ
• But when it doesn’t, it can’t choose actions
this way
c
10 Updated lecture slides of Machine Learning textbook, Tom M. Mitchell, McGraw Hill, 1997
Q Function
Define new function very similar to V ∗

Q(s, a) ≡ r(s, a) + γV ∗(δ(s, a))
If agent learns Q, it can choose optimal action
even without knowing δ!
π ∗(s) = argmax[r(s, a) + γV ∗(δ(s, a))]
a
π ∗(s) = argmax Q(s, a)

a
Q is the evaluation function the agent will
learn
c
Training Rule to Learn Q
Note Q and V ∗ closely related:

V ∗(s) = max
′
Q(s, a′)
a
Which allows us to write Q recursively as
Q(st, at) = r(st, at) + γV ∗(δ(st, at)))

= r(st, at) + γ max Q(s , a′)
′ t+1
a
Nice! Let Q̂ denote learner’s current
approximation to Q. Consider training rule
Q̂(s, a) ← r + γ max Q̂(s′, a′ )
′ a
where s′ is the state resulting from applying
action a in state s
c
Q Learning for Deterministic Worlds
For each s, a initialize table entry Q̂(s, a) ← 0

Observe current state s
Do forever:
• Select an action a and execute it
• Receive immediate reward r
• Observe the new state s′
• Update the table entry for Q̂(s, a) as
follows:
Q̂(s, a) ← r + γ max Q̂(s′, a′)
′ a
• s ← s′
c
Updating Q̂
72 100 90 100
R R
63 63
81 81
a right
Initial state: s1 Next state: s2
Q̂(s1, aright) ← r + γ max Q̂(s , a′)

′ 2
a
← 0 + 0.9 max{63, 81, 100}
← 90
notice if rewards non-negative, then
(∀s, a, n) Q̂n+1(s, a) ≥ Q̂n(s, a)
and
(∀s, a, n) 0 ≤ Q̂n(s, a) ≤ Q(s, a)
c
Q̂ converges to Q. Consider case of
deterministic world where see each hs, ai
visited infinitely often.
Proof: Define a full interval to be an interval
during which each hs, ai is visited. During
each full interval the largest error in Q̂ table is
reduced by factor of γ
Let Q̂n be table after n updates, and ∆n be
the maximum error in Q̂n; that is
∆n = max
s,a
|Q̂n(s, a) − Q(s, a)|
For any table entry Q̂n(s, a) updated on

iteration n + 1, the error in the revised
estimate Q̂n+1(s, a) is
|Q̂n+1(s, a) − Q(s, a)| = |(r + γ max Q̂n (s′, a′ ))
′ a
−(r + γ max Q(s′, a′ ))|
′ a
= γ| max Q̂n (s , a′) − max
′
Q(s′, a′ )|
′ a ′ a
≤ γ max |Q̂n (s , a ) − Q(s′, a′ )|
′ ′
′ a
≤ γ max |Q̂n (s′′, a′ ) − Q(s′′, a′ )|
s ,a
′′ ′
c
|Q̂n+1(s, a) − Q(s, a)| ≤ γ∆n
Note we used general fact that

| max
a
f1(a) − max
a
f2(a)| ≤ max
a
|f1(a) − f2(a)|
c
Nondeterministic Case
What if reward and next state are

non-deterministic?
We redefine V, Q by taking expected values
V π (s) ≡ E[rt + γrt+1 + γ 2rt+2 + . . .]

∞
≡ E[ γ irt+i]
X
i=0
Q(s, a) ≡ E[r(s, a) + γV ∗(δ(s, a))]
c
Nondeterministic Case
Q learning generalizes to nondeterministic

worlds
Alter training rule to
Q̂n(s, a) ← (1 − αn )Q̂n−1(s, a) + αn [r + max Q̂n−1(s′ , a′ )]
′ a
where
1
αn =
1 + visitsn (s, a)
Can still prove convergence of Q̂ to Q

[Watkins and Dayan, 1992]
c
Temporal Difference Learning
Q learning: reduce discrepancy between

successive Q estimates
One step time difference:
Q(1)(st, at) ≡ rt + γ max
a
Q̂(st+1, a)
Why not two steps?
Q(2)(st, at) ≡ rt + γrt+1 + γ 2 max
a
Q̂(st+2, a)
Or n?
Q(n) (st , at ) ≡ rt + γrt+1 + · · · + γ (n−1)rt+n−1 + γ n max
a
Q̂(st+n, a)
Blend all of these:

Qλ (st , at ) ≡ (1 − λ) Q(1) (st , at ) + λQ(2) (st , at ) + λ2 Q(3) (st , at ) + · · ·
h i
c
Temporal Difference Learning
Qλ (st , at ) ≡ (1 − λ) Q(1) (st , at ) + λQ(2) (st , at ) + λ2 Q(3) (st , at ) + · · ·

h i
Equivalent expression:
Qλ(st, at) = rt + γ[ (1 − λ) max a
Q̂(st, at)
+λ Qλ(st+1, at+1)]
TD(λ) algorithm uses above training rule
• Sometimes converges faster than Q learning
• converges for learning V ∗ for any 0 ≤ λ ≤ 1
(Dayan, 1992)
• Tesauro’s TD-Gammon uses this algorithm
c

Reinforcement Learning: 1 Updated Lecture Slides of Machine Learning Textbook, C Tom M. Mitchell, Mcgraw Hill, 1997

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reinforcement Learning: 1 Updated Lecture Slides of Machine Learning Textbook, C Tom M. Mitchell, Mcgraw Hill, 1997

Uploaded by

Copyright:

Available Formats

13.

A computaional approach to learning from

State Reward Action

Goal: Learn to choose actions that maximize

r + γ r + γ 2 r + ... , where 0 < γ <1

Future rewards are discounted exponentially

Execute actions in environment, observe

To begin, consider deterministic worlds...

V π (s) ≡ rt + γrt+1 + γ 2rt+2 + ...

r(s, a) (immediate reward) values

Q(s, a) values V ∗(s) values

One optimal policy

We might try to have agent learn the

Define new function very similar to V ∗

π ∗(s) = argmax Q(s, a)

Note Q and V ∗ closely related:

Q(st, at) = r(st, at) + γV ∗(δ(st, at)))

For each s, a initialize table entry Q̂(s, a) ← 0

Initial state: s1 Next state: s2

Q̂(s1, aright) ← r + γ max Q̂(s , a′)

For any table entry Q̂n(s, a) updated on

Note we used general fact that

What if reward and next state are

V π (s) ≡ E[rt + γrt+1 + γ 2rt+2 + . . .]

Q(s, a) ≡ E[r(s, a) + γV ∗(δ(s, a))]

Q learning generalizes to nondeterministic

Can still prove convergence of Q̂ to Q

Q learning: reduce discrepancy between

Blend all of these:

Qλ (st , at ) ≡ (1 − λ) Q(1) (st , at ) + λQ(2) (st , at ) + λ2 Q(3) (st , at ) + · · ·

You might also like