Professional Documents
Culture Documents
Csaba Szepesvári
MTA SZTAKI
William D. Smart
WUSTL
Summary
Introduction
Algorithm
Convergence results
Extensions
Some experimental results
Conclusions, future work
Title 1
Motivation, problem setup
-Q-learning
-Continuous space Q-
learning
Problem Setup
Title 2
ICML-2004, July 03 2004
3/29/2006
State space -
State space -
Title 3
ICML-2004, July 03 2004
3/29/2006
State space -
?
State space -
Title 4
ICML-2004, July 03 2004
3/29/2006
NO!
State space -
Difference to Regression
Title 5
ICML-2004, July 03 2004
3/29/2006
Goals
Avoid non-convergence
– Changes:
Local
Incremental
Algorithm
Title 6
ICML-2004, July 03 2004
3/29/2006
iFAPP-Q Learning
State space -
Multi-component updates
Title 7
ICML-2004, July 03 2004
3/29/2006
iFAPP-Q Learning
State space -
iFAPP-Q Learning
State space -
Title 8
ICML-2004, July 03 2004
3/29/2006
iFAPP-Q Learning
State space -
iFAPP-Q Learning
State space -
Title 9
ICML-2004, July 03 2004
3/29/2006
iFAPP-Q Learning
State space -
iFAPP-Q Learning
State space -
Title 10
ICML-2004, July 03 2004
3/29/2006
iFAPP-Q Learning
State space -
iFAPP-Q Learning
Title 11
ICML-2004, July 03 2004
3/29/2006
Equations?
Notation
Title 12
ICML-2004, July 03 2004
3/29/2006
Gordon, 1995
Averagers
..1st order
B-splines
Title 13
ICML-2004, July 03 2004
3/29/2006
iFAPP-Q Learning
vs. Q-Learning
aFAPP-Q Learning
vs. Q-Learning
Title 14
ICML-2004, July 03 2004
3/29/2006
aFAPP-Q Learning/Details
Theory
-FAPPs as operators
-Theorem
-Assumptions
-Proof outline
Title 15
ICML-2004, July 03 2004
3/29/2006
Non-expansions
Title 16
ICML-2004, July 03 2004
3/29/2006
Convergence Theorem
– A finite
– X is a compact subset of a
separable metric space (e.g. an
n-dimensional Euclidean space)
– r is continuous
Title 17
ICML-2004, July 03 2004
3/29/2006
Actions:
~ “positive recurrent”
States:
Rewards:
Assumption A3:
Conditions on the Influence Function s
- bounded, measurable
Positivity condition:
Title 18
ICML-2004, July 03 2004
3/29/2006
Counting visits:
Learning rates:
Constraint on
Definition of
Truncation:
Normalization:
Title 19
ICML-2004, July 03 2004
3/29/2006
Proof (outline)
allows
..componont-wise analysis
..use of standard stochastic approximation
How?
(A4)
and
Title 20
ICML-2004, July 03 2004
3/29/2006
Szepesvari&Littman, 1999
Proof: Relaxation Processes
Proof
Title 21
ICML-2004, July 03 2004
3/29/2006
Proof
Fixed Point
..since F is a non-expansion
Title 22
ICML-2004, July 03 2004
3/29/2006
Q.e.d
How to Choose F?
Averagers:
Title 23
ICML-2004, July 03 2004
3/29/2006
Title 24
Interpolation-based Q-learning
Csaba Szepesvári
MTA SZTAKI
William D. Smart
WUSTL
Averagers-based Q-learning
(and other convergent value
function approximation
algorithms)
Csaba Szepesvári
MTA SZTAKI
William D. Smart
WUSTL
Title 25
Generalizations
-Other learning
algorithms
-Increasing accuracy
-Approximating Q*
Generalizations
ASSUME
“interpolation assumption’
Title 26
ICML-2004, July 03 2004
3/29/2006
Generalizations
?
Generalization to “learning”:
-learning rates
-observed data
Generalizations
?
Generalization to “learning”:
Title 27
ICML-2004, July 03 2004
3/29/2006
“Meta Theorem”
Let T be a contraction.
If F is an interpolative non-expansion
and
defines a “relaxation process” when its second
parameter is fixed approximating w.p.1, and if
then
w.p.1
Applications
FAPP +
– Real-time dynamic programming
(speeding up dynamic programming by not updating
irrelevant parts of the state-space)
– Value iteration with Monte-Carlo updates
(when exact updates are too expansive)
– Other criteria (e.g. Markov games, Risk-sensitive
MDPs, ..)
Tsitsiklis&Van Roy, 1997
Gordon, 1995
Title 28
ICML-2004, July 03 2004
3/29/2006
Extensions
Title 29
ICML-2004, July 03 2004
3/29/2006
Algorithm
While (true)
{
Add basis points dens(St)>h0
Run IFAPP-Q Learning for time Tt
}
Convergence
Family of FAPPs:
Non-destructive refinement
Expansion finishes in finite time with probability
one
=> (random)
Title 30
ICML-2004, July 03 2004
3/29/2006
Extension #2
While (true)
{
Add data points
Shrink influence function s
Estimate density underlying µX with κt
and
run modified iFAPP-Q for time Tt
}
Output learned Q-function
Use
Title 31
Some Experimental Results
Comparison algorithms
Title 32
ICML-2004, July 03 2004
3/29/2006
Goal
p p
v v
left right
p p
v v
left right
Title 33
ICML-2004, July 03 2004
3/29/2006
go right
go left
p
v
Performance comparison
error
flops
Title 34
ICML-2004, July 03 2004
3/29/2006
error
flops
flops
Title 35
Conclusions
Conclusions
Title 36
ICML-2004, July 03 2004
3/29/2006
-parameter vector
-”FAPP”
- basis points
Title 37