Professional Documents
Culture Documents
Tuomas Sandholm
Carnegie Mellon University
Computer Science Department
Reinforcement Learning
(Ch. 17.1-17.3, Ch. 20)
passive
Learner
active
Sequential decision problems
Approaches:
1. Learn values of states (or state histories) & try to maximize
utility of their outcomes.
• Need a model of the environment: what ops & what
states they lead to
2. Learn values of state-action pairs
• Does not require a model of the environment (except
legal moves)
• Cannot look ahead
Reinforcement Learning …
Deterministic transitions
Stochastic transitions
M ija is the probability to reaching state j when taking action
a in state i
3 +1
A simple environment that Move cost = 0.04
presents the agent with a 2 -1
sequential decision problem:
1 start
1 2 3 4
(Temporal) credit assignment problem sparse reinforcement problem
1 2 3 4 1 2 3 4
Observable (accessible): percept identifies the state
Environment
Partially observable
Markov property: Transition probabilities depend on state only, not on the
path to the state.
Markov decision problem (MDP).
Partially observable MDP (POMDP): percepts does not have enough info to
identify transition probabilities.
Partial observability in previous
figure
Uh ([S0,S1…Sn]) = R0 + Uh([S1,…Sn])
Utility function on histories
U t 1 (i ) R(i ) M ijPolicy(i )U t ( j )
j
and using the current utility estimates from policy iteration as the initial
values. (Here Policy(i) is the action suggested by the policy in state i)
While this can work well in some environments, it will often take a very
long time to converge in the early stages of policy iteration. This is because
the policy will be more or less random, so that many steps can be required to
reach terminal states
Value Determination Algorithm
The second approach is to solve for the utilities directly. Given a fixed policy
P, the utilities of states obey a set of equations of the form:
U (i ) R(i ) M ijP (i )U t ( j )
j
For example, suppose P is the policy shown in Figure 17.2(a). Then using the
transition model M, we can construct the following set of equations:
U(1,1) = 0.8u(1,2) + 0.1u(1,1) + 0.1u(2,1)
U(1,2) = 0.8u(1,3) + 0.2u(1,2)
and so on. This gives a set of 11 linear equations in 11 unknowns, which can
be solved by linear algebra methods such as Gaussian elimination. For small
state spaces, value determination using exact solution methods is often the
most efficient approach.
Policy iteration converges to optimal policy, and policy improves
monotonically for all states.
Asynchronous version converges to optimal policy if all states are visited
infinitely often.
Discounting
Infinite horizon Infinite U Policy & value iteration fail to converge.
Solution: discounting
U (H ) vi R i
Finite if 0 v 1
Reinforcement Learning II:
Reinforcement learning (RL)
algorithms
(we will focus solely on observable
environments in this lecture)
Tuomas Sandholm
Carnegie Mellon University
Computer Science Department
Passive learning
P=0.1 +1
An example where LMS does poorly. A new state is reached for the
first time, and then follows the path marked by the dashed lines,
reaching a terminal state with reward +1.
Adaptive DP (ADP)
Idea: use the constraints (state transition probabilities) between
states to speed learning.
Solve
U (i ) R(i ) M ijU ( j )
j
= value determination.
No maximization over actions because agent is
passive unlike in value iteration.
using DP
U (i ) U (i ) [ R(i ) U ( j ) U (i )]
Thrm: Average value of U(i) converges to the correct value.
Idea: update from the whole epoch, not just on state transition.
U (i ) U (i ) m k [ R(im ) U (im 1 ) U (im )]
mk
Special cases:
=1: LMS
=0: TD
Intermediate choice of (between 0 and 1) is best.
Interplay with …
Convergence of TD()
i2 (t )
t
Model-based (learn M)
Tradeoff
Model-free (e.g. Q-learning)
Which is better? open
Q-learning
Q (a,i)
U (i ) max Q(a, i )
a
Q(a, i ) R(i ) M ija max Q(a' , j )
a'
j
Direct approach (ADP) would require learning a model M ija.
j
e T
Pa* M ijaU ( j )
j
e T
a
Reinforcement Learning III:
Advanced topics
Tuomas Sandholm
Carnegie Mellon University
Computer Science Department
Generalization
With table lookup representation (of U,M,R,Q) up to 10,000
states or more
Chess ~ 10120 Backgammon ~ 1050
Industrial problems
Hard to represent & visit all states!
converges to Q* averagers
linear
converges to Q* diverges
max Q d (i) Q d ( j ) off-policy
error in Q on-policy
i, j in same class
1 v
converges to Q
chatters, bound
unknown
Applications of RL
• Checker’s [Samuel 59]
• TD-Gammon [Tesauro 92]
• World’s best downpeak elevator dispatcher [Crites at al ~95]
• Inventory management [Bertsekas et al ~95]
– 10-15% better than industry standard
• Dynamic channel assignment [Singh & Bertsekas, Nie&Haykin ~95]
– Outperforms best heuristics in the literature
• Cart-pole [Michie&Chambers 68-] with bang-bang control
• Robotic manipulation [Grupen et al. 93-]
• Path planning
• Robot docking [Lin 93]
• Parking
• Football
• Tetris
• Multiagent RL [Tan 93, Sandholm&Crites 95, Sen 94-, Carmel&Markovitch 95-, lots
of work since]
• Combinatorial optimization: maintenance & repair
– Control of reasoning [Zhang & Dietterich IJCAI-95]
TD-Gammon
• Q-learning & back propagation neural net
• Start with random net
• Learned by 1.5 million games against itself
• As good as best human in the world
TD-Gammon (self-play)
Performance
Neurogammon (15,000
against
supervised learning
Gammontool
examples)
# hidden units