You are on page 1of 37

Interpolation-based Q-learning

Csaba Szepesvári
MTA SZTAKI
William D. Smart
WUSTL

1 3/29/2006 ICML-2004, July 03 2004

ICML-2004, July 03 2004


3/29/2006

Summary

 Introduction
 Algorithm
 Convergence results
 Extensions
 Some experimental results
 Conclusions, future work

Title 1
Motivation, problem setup

-Q-learning
-Continuous space Q-
learning

3 3/29/2006 ICML-2004, July 03 2004

ICML-2004, July 03 2004


3/29/2006

Problem Setup

 Markovian Decision Problems


X – state space
A – action space
p – transition kernel (dynamics)
r – immediate rewards
 Continuous State Space
 Unknown Dynamics

Title 2
ICML-2004, July 03 2004
3/29/2006

Q-learning in Discrete State Spaces

State space -

ICML-2004, July 03 2004


3/29/2006

Q-learning in Discrete State Spaces

State space -

Title 3
ICML-2004, July 03 2004
3/29/2006

Q-learning in Discrete State Spaces

State space -

ICML-2004, July 03 2004


3/29/2006

Continuous State Spaces

?
State space -

Title 4
ICML-2004, July 03 2004
3/29/2006

Continuous State Spaces

NO!
State space -

ICML-2004, July 03 2004


3/29/2006

Difference to Regression

Title 5
ICML-2004, July 03 2004
3/29/2006

Goals

 Avoid non-convergence
– Changes:
 Local
 Incremental

 O(1) memory requirements


 O(1) time update

Algorithm

12 3/29/2006 ICML-2004, July 03 2004

Title 6
ICML-2004, July 03 2004
3/29/2006

iFAPP-Q Learning

State space -

ICML-2004, July 03 2004


3/29/2006

Multi-component updates

•Ribeiro & Szepesvari, 1996


“spreading”
•Szepesvari & Littman, 1999
“multi-state updates”

Title 7
ICML-2004, July 03 2004
3/29/2006

iFAPP-Q Learning

State space -

ICML-2004, July 03 2004


3/29/2006

iFAPP-Q Learning

State space -

Title 8
ICML-2004, July 03 2004
3/29/2006

iFAPP-Q Learning

State space -

ICML-2004, July 03 2004


3/29/2006

iFAPP-Q Learning

State space -

Title 9
ICML-2004, July 03 2004
3/29/2006

iFAPP-Q Learning

State space -

ICML-2004, July 03 2004


3/29/2006

iFAPP-Q Learning

State space -

Title 10
ICML-2004, July 03 2004
3/29/2006

iFAPP-Q Learning

State space -

ICML-2004, July 03 2004


3/29/2006

iFAPP-Q Learning

Title 11
ICML-2004, July 03 2004
3/29/2006

Equations?

ICML-2004, July 03 2004


3/29/2006

Notation

Title 12
ICML-2004, July 03 2004
3/29/2006

Gordon, 1995
Averagers

..1st order
B-splines

ICML-2004, July 03 2004


3/29/2006

aFAPP-Q Learning Equation

Title 13
ICML-2004, July 03 2004
3/29/2006

iFAPP-Q Learning
vs. Q-Learning

ICML-2004, July 03 2004


3/29/2006

aFAPP-Q Learning
vs. Q-Learning

Title 14
ICML-2004, July 03 2004
3/29/2006

aFAPP-Q Learning/Details

Theory

-FAPPs as operators
-Theorem
-Assumptions
-Proof outline

30 3/29/2006 ICML-2004, July 03 2004

Title 15
ICML-2004, July 03 2004
3/29/2006

Another View of Function


Approximation

Parameter space Function space

ICML-2004, July 03 2004


3/29/2006

Non-expansions

Parameter space Function space

Use sup-norm in both spaces!

Title 16
ICML-2004, July 03 2004
3/29/2006

Convergence Theorem

Under Assumptions A1-A4, and if F is a


non-expansion then

converges to some vector w.p.1, such that

is the fixed point of the operator


where and
is defined by

ICML-2004, July 03 2004


3/29/2006

Assumption A1: MDP

 (X,A,p,r,γ) is a discounted MDP

– A finite
– X is a compact subset of a
separable metric space (e.g. an
n-dimensional Euclidean space)
– r is continuous

Title 17
ICML-2004, July 03 2004
3/29/2006

Assumption A2: Sampling

Actions:

~ “positive recurrent”

States:

is positive Harris, aperiodic

Rewards:

ICML-2004, July 03 2004


3/29/2006

Assumption A3:
Conditions on the Influence Function s

- bounded, measurable

Positivity condition:

- unique invariant measure underlying

Title 18
ICML-2004, July 03 2004
3/29/2006

Assumption A4: Learning Rates

Counting visits:

Learning rates:

Constraint on

ICML-2004, July 03 2004


3/29/2006

Definition of

Truncation:

Normalization:

Title 19
ICML-2004, July 03 2004
3/29/2006

Proof (outline)

allows
..componont-wise analysis
..use of standard stochastic approximation

How?

ICML-2004, July 03 2004


3/29/2006

Proof: Stochastic Approximation

• positive Harris, aperiodic (A2)

Time averages -> expected values

(A4)

and

Title 20
ICML-2004, July 03 2004
3/29/2006

Szepesvari&Littman, 1999
Proof: Relaxation Processes

..holds for all F

ICML-2004, July 03 2004


3/29/2006

Proof

Title 21
ICML-2004, July 03 2004
3/29/2006

Proof

ICML-2004, July 03 2004


3/29/2006

Fixed Point

..since F is a non-expansion

Title 22
ICML-2004, July 03 2004
3/29/2006

Proof: Finishing Steps

Q.e.d

ICML-2004, July 03 2004


3/29/2006

How to Choose F?

Averagers:

Averagers are non-expansions!

Title 23
ICML-2004, July 03 2004
3/29/2006

What Does this Theorem Tell Us?

ICML-2004, July 03 2004


3/29/2006

Theorem – Once Again

Under Assumptions A1-A4, and if F is a


non-expansion then

converges to some vector w.p.1, such that

is the fixed point of the operator


where and
is defined by

Title 24
Interpolation-based Q-learning

Csaba Szepesvári
MTA SZTAKI
William D. Smart
WUSTL

49 3/29/2006 ICML-2004, July 03 2004

Averagers-based Q-learning
(and other convergent value
function approximation
algorithms)
Csaba Szepesvári
MTA SZTAKI
William D. Smart
WUSTL

50 3/29/2006 ICML-2004, July 03 2004

Title 25
Generalizations

-Other learning
algorithms
-Increasing accuracy
-Approximating Q*

51 3/29/2006 ICML-2004, July 03 2004

ICML-2004, July 03 2004


3/29/2006

Generalizations

ASSUME

“interpolation assumption’

Title 26
ICML-2004, July 03 2004
3/29/2006

Generalizations

Why not start with

?
Generalization to “learning”:

-learning rates
-observed data

ICML-2004, July 03 2004


3/29/2006

Generalizations

Why not start with

?
Generalization to “learning”:

Title 27
ICML-2004, July 03 2004
3/29/2006

“Meta Theorem”
 Let T be a contraction.
If F is an interpolative non-expansion
and
defines a “relaxation process” when its second
parameter is fixed approximating w.p.1, and if

then

w.p.1

ICML-2004, July 03 2004


3/29/2006

Applications

 FAPP +
– Real-time dynamic programming
(speeding up dynamic programming by not updating
irrelevant parts of the state-space)
– Value iteration with Monte-Carlo updates
(when exact updates are too expansive)
– Other criteria (e.g. Markov games, Risk-sensitive
MDPs, ..)
Tsitsiklis&Van Roy, 1997
Gordon, 1995

Title 28
ICML-2004, July 03 2004
3/29/2006

Extensions

ICML-2004, July 03 2004


3/29/2006

Increasing the Density of Basis


Points

Title 29
ICML-2004, July 03 2004
3/29/2006

Algorithm

While (true)
{
Add basis points dens(St)>h0
Run IFAPP-Q Learning for time Tt
}

ICML-2004, July 03 2004


3/29/2006

Convergence

 Family of FAPPs:
 Non-destructive refinement
 Expansion finishes in finite time with probability
one
=> (random)

Title 30
ICML-2004, July 03 2004
3/29/2006

Extension #2

While (true)
{
Add data points
Shrink influence function s
Estimate density underlying µX with κt
and
run modified iFAPP-Q for time Tt
}
Output learned Q-function

ICML-2004, July 03 2004


3/29/2006

Modified iFAPP-Q Learning

 Use

Title 31
Some Experimental Results

63 3/29/2006 ICML-2004, July 03 2004

ICML-2004, July 03 2004


3/29/2006

Comparison algorithms

 Soft state aggregation [Singh, Jaakkola & Jordan 95]


 Kernel-based RL [Ormoneit & Sen 02]

 Domain: car on the hill

Title 32
ICML-2004, July 03 2004
3/29/2006

State-action visit counts

Goal

p p
v v

left right

ICML-2004, July 03 2004


3/29/2006

Q-values for “optimal” policy

p p
v v

left right

Title 33
ICML-2004, July 03 2004
3/29/2006

Q-values for “optimal” policy

go right

go left
p
v

ICML-2004, July 03 2004


3/29/2006

Performance comparison
error

flops

Title 34
ICML-2004, July 03 2004
3/29/2006

Varying the exploration policy

error

flops

ICML-2004, July 03 2004


3/29/2006

Varying the density threshold


error

flops

Title 35
Conclusions

71 3/29/2006 ICML-2004, July 03 2004

ICML-2004, July 03 2004


3/29/2006

Conclusions

 iFAPP-Q: Extends Q-learning to continuous


spaces
 Need to update multiple components of the
parameter vector => influence function s
 Works for non-expansions
 Extensions are possible
 Changing the policy?
 When to add basis points?
 Other FAPPs (LWR?)

Title 36
ICML-2004, July 03 2004
3/29/2006

What Makes F “Interpolative”?

-parameter vector
-”FAPP”

- basis points

Title 37

You might also like