Professional Documents
Culture Documents
Reinforcement Learning
• Control learning
• Control policies that choose optimal actions
• Q learning
• Convergence
1 c
Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
Control Learning
2 c
Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
• But it also has to explore to make better
selections in the future.
• Agents should have explicit goals.
• Should be able to observe state at least
partially
• Should be able to chose actions to influence
their environments.
Consider learning to choose actions, e.g.,
• Robot operating in an environment
• Learning to play Backgammon
3 c
Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
One Example: TD-Gammon
[Tesauro, 1995]
Learn to play Backgammon
Immediate reward
• +100 if win
• -100 if lose
• 0 for all other states
Trained by playing 1.5 million games against
itself
Now approximately equal to best human player
4 c
Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
Reinforcement Learning Problem
Agent
Environment
a0 a1 a2
s0 s1 s2 ...
r0 r1 r2
5 c
Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
Markov Decision Processes
Assume
• finite set of states S
• set of actions A
• at each discrete time agent observes state
st ∈ S and chooses action at ∈ A
• then receives immediate reward rt
• and state changes to st+1
• Markov assumption: st+1 = δ(st, at) and
rt = r(st, at)
– i.e., rt and st+1 depend only on current
state and action
– functions δ and r may be nondeterministic
– functions δ and r not necessarily known
to agent
6 c
Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
Agent’s Learning Task
7 c
Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
Value Function
i=0
where rt, rt+1, . . . are generated by following
policy π starting at state s
Restated, the task is to learn the optimal
policy π ∗
π ∗ ≡ argmax V π (s), (∀s)
π
8 c
Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
0
0 100
G
0
0 0
0 0 100
0 0
0 0
0
90 100
G 90 100
0
G
81
72 81
81 90 100
81 90
81 90 100
72 81
9 c
Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
What to Learn
c
10 Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
Q Function
c
11 Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
Training Rule to Learn Q
c
12 Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
Q Learning for Deterministic Worlds
c
13 Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
Updating Q̂
72 100 90 100
R R
63 63
81 81
a right
c
14 Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
Q̂ converges to Q. Consider case of
deterministic world where see each hs, ai
visited infinitely often.
Proof: Define a full interval to be an interval
during which each hs, ai is visited. During
each full interval the largest error in Q̂ table is
reduced by factor of γ
Let Q̂n be table after n updates, and ∆n be
the maximum error in Q̂n; that is
∆n = max
s,a
|Q̂n(s, a) − Q(s, a)|
c
15 Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
|Q̂n+1(s, a) − Q(s, a)| ≤ γ∆n
c
16 Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
Nondeterministic Case
i=0
c
17 Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
Nondeterministic Case
where
1
αn =
1 + visitsn (s, a)
c
18 Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
Temporal Difference Learning
c
19 Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997
Temporal Difference Learning
Equivalent expression:
Qλ(st, at) = rt + γ[ (1 − λ) max a
Q̂(st, at)
+λ Qλ(st+1, at+1)]
TD(λ) algorithm uses above training rule
• Sometimes converges faster than Q learning
• converges for learning V ∗ for any 0 ≤ λ ≤ 1
(Dayan, 1992)
• Tesauro’s TD-Gammon uses this algorithm
c
20 Updated lecture slides of Machine Learning textbook,
Tom M. Mitchell, McGraw Hill, 1997