Interpolation-Based Q-Learning

Interpolation-based Q-learning
Csaba Szepesvári
MTA SZTAKI
William D. Smart
WUSTL
1 3/29/2006 ICML-2004, July 03 2004
ICML-2004, July 03 2004

3/29/2006
Summary
Introduction
Algorithm
Convergence results
Extensions
Some experimental results
Conclusions, future work
Title 1
Motivation, problem setup
-Q-learning
-Continuous space Q-
learning
3 3/29/2006 ICML-2004, July 03 2004
ICML-2004, July 03 2004

3/29/2006
Problem Setup
Markovian Decision Problems

X – state space
A – action space
p – transition kernel (dynamics)
r – immediate rewards
Continuous State Space
Unknown Dynamics
Title 2
ICML-2004, July 03 2004
3/29/2006
Q-learning in Discrete State Spaces
State space -
ICML-2004, July 03 2004

3/29/2006
State space -
Title 3
ICML-2004, July 03 2004
3/29/2006
State space -
ICML-2004, July 03 2004

3/29/2006
Continuous State Spaces
?
State space -
Title 4
ICML-2004, July 03 2004
3/29/2006
Continuous State Spaces
NO!
State space -
ICML-2004, July 03 2004

3/29/2006
Difference to Regression
Title 5
ICML-2004, July 03 2004
3/29/2006
Goals
Avoid non-convergence
– Changes:
Local
Incremental
O(1) memory requirements

O(1) time update
Algorithm
12 3/29/2006 ICML-2004, July 03 2004
Title 6
ICML-2004, July 03 2004
3/29/2006
iFAPP-Q Learning
State space -
ICML-2004, July 03 2004

3/29/2006
Multi-component updates
•Ribeiro & Szepesvari, 1996

“spreading”
•Szepesvari & Littman, 1999
“multi-state updates”
Title 7
ICML-2004, July 03 2004
3/29/2006
iFAPP-Q Learning
State space -
ICML-2004, July 03 2004

3/29/2006
iFAPP-Q Learning
State space -
Title 8
ICML-2004, July 03 2004
3/29/2006
iFAPP-Q Learning
State space -
ICML-2004, July 03 2004

3/29/2006
iFAPP-Q Learning
State space -
Title 9
ICML-2004, July 03 2004
3/29/2006
iFAPP-Q Learning
State space -
ICML-2004, July 03 2004

3/29/2006
iFAPP-Q Learning
State space -
Title 10
ICML-2004, July 03 2004
3/29/2006
iFAPP-Q Learning
State space -
ICML-2004, July 03 2004

3/29/2006
iFAPP-Q Learning
Title 11
ICML-2004, July 03 2004
3/29/2006
Equations?
ICML-2004, July 03 2004

3/29/2006
Notation
Title 12
ICML-2004, July 03 2004
3/29/2006
Gordon, 1995
Averagers
..1st order
B-splines
ICML-2004, July 03 2004

3/29/2006
aFAPP-Q Learning Equation
Title 13
ICML-2004, July 03 2004
3/29/2006
iFAPP-Q Learning
vs. Q-Learning
ICML-2004, July 03 2004

3/29/2006
aFAPP-Q Learning
vs. Q-Learning
Title 14
ICML-2004, July 03 2004
3/29/2006
aFAPP-Q Learning/Details
Theory
-FAPPs as operators
-Theorem
-Assumptions
-Proof outline
30 3/29/2006 ICML-2004, July 03 2004
Title 15
ICML-2004, July 03 2004
3/29/2006
Another View of Function

Approximation
Parameter space Function space
ICML-2004, July 03 2004

3/29/2006
Non-expansions
Parameter space Function space
Use sup-norm in both spaces!
Title 16
ICML-2004, July 03 2004
3/29/2006
Convergence Theorem
Under Assumptions A1-A4, and if F is a

non-expansion then
converges to some vector w.p.1, such that
is the fixed point of the operator

where and
is defined by
ICML-2004, July 03 2004

3/29/2006
Assumption A1: MDP
(X,A,p,r,γ) is a discounted MDP
– A finite
– X is a compact subset of a
separable metric space (e.g. an
n-dimensional Euclidean space)
– r is continuous
Title 17
ICML-2004, July 03 2004
3/29/2006
Assumption A2: Sampling
Actions:
~ “positive recurrent”
States:
is positive Harris, aperiodic
Rewards:
ICML-2004, July 03 2004

3/29/2006
Assumption A3:
Conditions on the Influence Function s
- bounded, measurable
Positivity condition:
- unique invariant measure underlying
Title 18
ICML-2004, July 03 2004
3/29/2006
Assumption A4: Learning Rates
Counting visits:
Learning rates:
Constraint on
ICML-2004, July 03 2004

3/29/2006
Definition of
Truncation:
Normalization:
Title 19
ICML-2004, July 03 2004
3/29/2006
Proof (outline)
allows
..componont-wise analysis
..use of standard stochastic approximation
How?
ICML-2004, July 03 2004

3/29/2006
Proof: Stochastic Approximation
• positive Harris, aperiodic (A2)
Time averages -> expected values
(A4)
and
Title 20
ICML-2004, July 03 2004
3/29/2006
Szepesvari&Littman, 1999
Proof: Relaxation Processes
..holds for all F
ICML-2004, July 03 2004

3/29/2006
Proof
Title 21
ICML-2004, July 03 2004
3/29/2006
Proof
ICML-2004, July 03 2004

3/29/2006
Fixed Point
..since F is a non-expansion
Title 22
ICML-2004, July 03 2004
3/29/2006
Proof: Finishing Steps
Q.e.d
ICML-2004, July 03 2004

3/29/2006
How to Choose F?
Averagers:
Averagers are non-expansions!
Title 23
ICML-2004, July 03 2004
3/29/2006
What Does this Theorem Tell Us?
ICML-2004, July 03 2004

3/29/2006
Theorem – Once Again
Under Assumptions A1-A4, and if F is a

non-expansion then
converges to some vector w.p.1, such that
is the fixed point of the operator

where and
is defined by
Title 24
Interpolation-based Q-learning
Csaba Szepesvári
MTA SZTAKI
William D. Smart
WUSTL
49 3/29/2006 ICML-2004, July 03 2004
Averagers-based Q-learning
(and other convergent value
function approximation
algorithms)
Csaba Szepesvári
MTA SZTAKI
William D. Smart
WUSTL
50 3/29/2006 ICML-2004, July 03 2004
Title 25
Generalizations
-Other learning
algorithms
-Increasing accuracy
-Approximating Q*
51 3/29/2006 ICML-2004, July 03 2004
ICML-2004, July 03 2004

3/29/2006
Generalizations
ASSUME
“interpolation assumption’
Title 26
ICML-2004, July 03 2004
3/29/2006
Generalizations
Why not start with
?
Generalization to “learning”:
-learning rates
-observed data
ICML-2004, July 03 2004

3/29/2006
Generalizations
Why not start with
?
Generalization to “learning”:
Title 27
ICML-2004, July 03 2004
3/29/2006
“Meta Theorem”
Let T be a contraction.
If F is an interpolative non-expansion
and
defines a “relaxation process” when its second
parameter is fixed approximating w.p.1, and if
then
w.p.1
ICML-2004, July 03 2004

3/29/2006
Applications
FAPP +
– Real-time dynamic programming
(speeding up dynamic programming by not updating
irrelevant parts of the state-space)
– Value iteration with Monte-Carlo updates
(when exact updates are too expansive)
– Other criteria (e.g. Markov games, Risk-sensitive
MDPs, ..)
Tsitsiklis&Van Roy, 1997
Gordon, 1995
Title 28
ICML-2004, July 03 2004
3/29/2006
Extensions
ICML-2004, July 03 2004

3/29/2006
Increasing the Density of Basis

Points
Title 29
ICML-2004, July 03 2004
3/29/2006
Algorithm
While (true)
{
Add basis points dens(St)>h0
Run IFAPP-Q Learning for time Tt
}
ICML-2004, July 03 2004

3/29/2006
Convergence
Family of FAPPs:
Non-destructive refinement
Expansion finishes in finite time with probability
one
=> (random)
Title 30
ICML-2004, July 03 2004
3/29/2006
Extension #2
While (true)
{
Add data points
Shrink influence function s
Estimate density underlying µX with κt
and
run modified iFAPP-Q for time Tt
}
Output learned Q-function
ICML-2004, July 03 2004

3/29/2006
Modified iFAPP-Q Learning
Use
Title 31
Some Experimental Results
63 3/29/2006 ICML-2004, July 03 2004
ICML-2004, July 03 2004

3/29/2006
Comparison algorithms
Soft state aggregation [Singh, Jaakkola & Jordan 95]

Kernel-based RL [Ormoneit & Sen 02]
Domain: car on the hill
Title 32
ICML-2004, July 03 2004
3/29/2006
State-action visit counts
Goal
p p
v v
left right
ICML-2004, July 03 2004

3/29/2006
Q-values for “optimal” policy
p p
v v
left right
Title 33
ICML-2004, July 03 2004
3/29/2006
Q-values for “optimal” policy
go right
go left
p
v
ICML-2004, July 03 2004

3/29/2006
Performance comparison
error
flops
Title 34
ICML-2004, July 03 2004
3/29/2006
Varying the exploration policy
error
flops
ICML-2004, July 03 2004

3/29/2006
Varying the density threshold

error
flops
Title 35
Conclusions
71 3/29/2006 ICML-2004, July 03 2004
ICML-2004, July 03 2004

3/29/2006
Conclusions
iFAPP-Q: Extends Q-learning to continuous

spaces
Need to update multiple components of the
parameter vector => influence function s
Works for non-expansions
Extensions are possible
Changing the policy?
When to add basis points?
Other FAPPs (LWR?)
Title 36
ICML-2004, July 03 2004
3/29/2006
What Makes F “Interpolative”?
-parameter vector
-”FAPP”
- basis points
Title 37

Interpolation-Based Q-Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Interpolation-Based Q-Learning

Uploaded by

Copyright:

Available Formats

Interpolation-based Q-learning

1 3/29/2006 ICML-2004, July 03 2004

ICML-2004, July 03 2004

3 3/29/2006 ICML-2004, July 03 2004

ICML-2004, July 03 2004

Markovian Decision Problems

Q-learning in Discrete State Spaces

ICML-2004, July 03 2004

Q-learning in Discrete State Spaces

Q-learning in Discrete State Spaces

ICML-2004, July 03 2004

Continuous State Spaces

Continuous State Spaces

ICML-2004, July 03 2004

O(1) memory requirements

12 3/29/2006 ICML-2004, July 03 2004

ICML-2004, July 03 2004

•Ribeiro & Szepesvari, 1996

ICML-2004, July 03 2004

ICML-2004, July 03 2004

ICML-2004, July 03 2004

ICML-2004, July 03 2004

ICML-2004, July 03 2004

ICML-2004, July 03 2004

aFAPP-Q Learning Equation

ICML-2004, July 03 2004

30 3/29/2006 ICML-2004, July 03 2004

Another View of Function

Parameter space Function space

ICML-2004, July 03 2004

Parameter space Function space

Use sup-norm in both spaces!

Under Assumptions A1-A4, and if F is a

converges to some vector w.p.1, such that

is the fixed point of the operator

ICML-2004, July 03 2004

Assumption A1: MDP

(X,A,p,r,γ) is a discounted MDP

Assumption A2: Sampling

is positive Harris, aperiodic

ICML-2004, July 03 2004

- unique invariant measure underlying

Assumption A4: Learning Rates

ICML-2004, July 03 2004

ICML-2004, July 03 2004

Proof: Stochastic Approximation

• positive Harris, aperiodic (A2)

Time averages -> expected values

..holds for all F

ICML-2004, July 03 2004

ICML-2004, July 03 2004

Proof: Finishing Steps

ICML-2004, July 03 2004

Averagers are non-expansions!

What Does this Theorem Tell Us?

ICML-2004, July 03 2004

Theorem – Once Again

Under Assumptions A1-A4, and if F is a

converges to some vector w.p.1, such that

is the fixed point of the operator

49 3/29/2006 ICML-2004, July 03 2004

50 3/29/2006 ICML-2004, July 03 2004