Professional Documents
Culture Documents
I. INTRODUCTION
URRENT theory for dynamic programming (DP) based
algorithms focuses on finite state, finite action Markov
decision process (MDP) problems. However, discrete representations of state and action spaces often suffer from the
curse of dimensionality, typically limiting their application
to problems with a relatively small number of dimensions.
In addition, there are plenty of real world problems with
continuous state and action spaces, to which algorithms developed for discrete problems cannot be applied. As a result,
continuous function approximation is introduced to overcome
the problem. Nevertheless, in contrast to the mature and elegant convergence theory supporting discrete MDPs [1], there
is a paucity of theoretical results regarding the convergence
of DP algorithms with function approximation applied to
continuous state problems.
This paper is a synopsis of [2]. In this paper, we describe
an approximate policy iteration algorithm with recursive least
squares function approximation for infinite horizon Markov
decision process problems with continuous state and action
spaces. The algorithm can handle curses of dimensionality
i.e. the state and decision variables are high dimensional
continuous vectors and the expectation is hard to compute. It
is provably convergent by imposing the significant assumption that the true value function is continuous and linearly
parameterized with finitely many known basis functions.
Extensions can be made to apply the convergence results
to continuously differentiable value functions. The idea of
policy iteration has been around for a while. [3] states
that policy iteration runs into two difficulties for problems
with infinite state and action spaces: (1) the existence of
a sequence of policies generated by the algorithm, (2) the
feasibility of obtaining accurate policy value functions in a
computationally implementable way. Least squares updating
[10]
[11]
[12]
[13]
[4]
[14]
[15]
[16]
[17]
[18]
[19]
[21]
[22]
[23]
[24]
[25]
[26], [27]
[28]
S
D
D
D
D
D
C
D
NA
D
D
C
C
C
C
C
C
C
C
A
D
D
D
D
D
D
D
NA
D
D
C
C
C
D
D
C
D
D
R
G
G
G
G
G
G
G
G
G
G
Q
Q
Q
G
G
G
G
G
E
Y
Y
N
N
N
N
N
Y
N
N
N
N
N
Y
N
N
N
N
Noise
S
S
S
S
S
S
S
S
S
D
D
D
S(G)
S
S
S
S
S
Policy
VI
FP
FP
FP
API
FP
API
VI
VI
VI
API
VI/API
VI
EPI
VI
API
API
VI
Conv.
Y
Y
Y
Y
N
Y
Y
Y
Y
Y
Y
Y
Y
Y
B
B
B
Y
Fig. 1. Table of some existing continuous function approximation algorithms and related convergence results
(1)
where Wt+1 represents the exogenous information that arrives during the time interval from t to t + 1. The problem
is to find the policy that solves
t
C(St , Xt (St )) .
(2)
sup E
t=0
Since solving the objective function (2) directly is computationally intractable, Bellmans equation is introduced so that
the optimal control can be computed recursively:
V (s) = max C(s, x) + E V (S M (s, x, W ))|s ,
(3)
xX
equation becomes
x, W ) + V (S M (s, x, W ))|s . (4)
V (s) = max E C(s,
xX
To work with continuous problems, we make some assumptions regarding the state, action and outcome spaces and
the transition kernel function. Assume that the state space S
is a convex and compact subset of Rm , the action space X
a compact subset of Rn , the outcome space W a compact
subset of Rl and Q : S X W R is a continuous
probability transition function. Let C b (S) denote the space
of all bounded continuous functions from S to R, and C b (S)
is a complete metric space.
B. Contraction operators
.
A chain is called Harris (recurrent) if it is -irreducible
and every set in
B + (S) = {A B(S) : (A) > 0}
is Harris recurrent.
Step 0:Initialization:
Step 0a Set the initial values of the value function parameters
0 .
Step 0b Set the initial policy
0 (s) = arg max {C(s, x) + (sx ) 0 }.
x(s)
Step
Step
Step 1:Do for
Step
n = n,M ,
n (s) = arg max {C(s, x) + (sx ) n }.
x(s)
concavity and differentiability of the value functions can significantly reduce the computational difficulty and complexity.
IV. CONVERGENCE RESULTS AND SKETCH OF
PROOFS
We observe that there are both inner and outer loops in the
RLSAPI algorithm. The inner loop is the policy evaluation
step, while the outer loop corresponds to policy updates. We
first consider an exact policy iteration algorithm, in which
case both the inner and outer loop iteration counters goes
to infinity. This conceptual algorithm is provably convergent
almost surely (for details see [2]). To prove convergence of
the exact algorithm, it is necessary to prove convergence of
policy evaluation step in the inner loop, and the convergence
result turns out to be useful in proving the mean convergence
of RLSAPI.
A. Almost sure convergence for a fixed policy
For a fixed policy, a Markov decision problem is a Markov
chain and Bellmans equation for the post-decision value
function of following the fixed policy (see (7)) becomes:
Suppose V (s)
, f (s), is the vector of basis functions of dimension F = |F| (number of basis functions) and f F (F
denotes the set of features). Bellmans equation (8) gives us
T ,
( (s)) =
P (s, ds )[C (s, u) + ( (s ))T , ].
Sx
C (s, s ) = ( (s)
P (s, ds ) (s ))T , +C (s, s )
Sx
Sx
the observation error. (s) S x P (s, ds ) (s ) can
be viewed as the input variable. Since the transition kernel
function may be unknown, at iteration m, instead of having
the exact input variable, we can only observe an unbiased
sample m m+1 . Therefore, the errors in the input
variables introduce bias into the regular formula of linear
regression for estimating , . To eliminate the asymptotic
bias, an instrumental variable is introduced. Then, the
m-th estimate of , becomes
1
m
m
1
1
m =
( i+1 )
C , (9)
m i=1 i i
m i=1 i i
where Ci is the i-th observation of the contribution. [36]
proves convergence of the parameter estimates using (9) to
the true values by assuming that the instrumental variable is
correlated with the true input variable but uncorrelated with
the observation error term and the correlation matrix between
the instrumental variable and input variable is nonsingular
and finite.
[13] chooses (s) as the instrumental variable , and
Sx
i=1
+ k1 V (sk )}.
k1
Let P (s , ds ) =
i=1 P (si , dsi+1 ), C (s1 , sk ) =
k1 i1 1 k
k1
C (si , si+1 ) and =
. It is easy to check
i=1
that is also an invariant measure for P . As a result, we
have
Bm
= Cm
(m m+1 ) m1
,
) Bm1
= Bm1
m1 m m m+1
,
1 + (m m+1 ) Bm1
m
m Bm1
m
= m1
+
,
1 + (m m+1 ) Bm1
m
,
1 + (m m+1 ) Bm1 m = 0 for all m, then m
almost surely.
Applying the law of large numbers for a positive Harris
chain [35], the proof only involves simple calculations and
manipulation of equation (8), so it is omitted here (for details
see [2]).
B. Mean convergence of RLSAPI
In the RLSAPI algorithm, the estimated value function
of the approximate policy in the policy evaluation step is
random due to its dependence on both the Markov chain
samples and iteration counter M of the inner loop. In other
words, for fixed s S, V n (s) is a random variable.
Assuming that the state space (post-decision) S x is compact
and the norm is the sup norm |||| for continuous functions,
we can show that RLSAPI converges in the mean. That
is to say, the mean of the norm of the difference between
the optimal value function and the estimated policy value
function using approximate policy iteration converges to 0 as
the successive approximations improves. We first consider a
theorem regarding the mean convergence of a general API
algorithm. Then, the mean convergence of RLSAPI is just a
corollary from the generalization.
Theorem 4.2 (Mean convergence of an API algorithm):
1 , . . . ,
n be the sequence of policies generated
Let
0 ,
by some approximate policy iteration algorithm and let
V 0 , V 1 , . . . , V n be the corresponding approximate value
functions. Further assume that, for each fixed policy
n ,
the MDP is reduced to a positive Harris chain that admits
an invariant probability measure n . Let {n } be positive
scalars that bound the mean errors between the approximate
and the true value functions over all iterations so that, for
all n N,
En ||V n V n || n .
n1
Suppose n 0 and limn i=0 n1i i = 0, e.g.
i = i . Then, this sequence eventually produces policies
whose performance converges to the optimal performance in
the mean:
lim En ||V n V || = 0.
F
i=1
n
En |M,i
i,n |.
,n
Since
w
It is easy to check that fN
=
N
n=1
w
>w w
<f,gn
gn .
w ||2
||gn
w
B. Chebyshev polynomials
We further restrict ourselves to one specific weighting
1
on
function: Chebyshev weighting function c(s) = 1s
2
[1, 1]. Chebyshev polynomials are desirable for approximating non-periodic smooth functions due to its uniform
convergence property [38]. The weighting functions can be
easily extended to an arbitrary closed interval [a, b], e.g.
the general Chebyshev weighting function is cg (s) = (1
2 12
.
( 2xab
ba ) )
The family of Chebyshev polynomials T = {tn }
n=0
is defined as tn (s) = cos(n cos1 s) [38]. It can also be
recursively defined as
t0 (s) = 1, t1 (s) = s,
gCN
N
cj tj (s),
j=0
where cj =
1
V (s)tj (s)c(s)ds.
1
{tn fc }
n=0 . It is easy
to check that T is an
Let T =
orthonormal basis set with respect to f . The least squares
approximation problem for the value function of policy
onto the basis set TN is
1
(V (s) g(s))2 f (s)ds.
inf
gTN
VN
c
f .
S M (s, x, w) = (s + x) + w
where 0 < < 1 and w N (0, 1) and the contribution
function C be
C(s, x) = s2 x2
Since we have a concave quadratic contribution function, it is
well-known that the value function is also concave quadratic,
i.e.
V (s) = 0 + 1 s + 2 s2
where 2 < 0, and the optimal decision/control function is
of linear form: x = a + bs.
In the numerical example, we fix the discount factor =
0.95 and = 0.9, and we take 1000 iterations for each policy
evaluation step and 100 iterations for policy updating. The
following figures plot the deviation of RLSAPI estimated
values of control function intercept a and slope b from their
true values against the policy updating iteration counter N .
From the figure, we clearly observe convergence behavior
of the RLSAPI algorithm for the stochastic linear quadratic
problem.
0.8
0.6
0.4
Intercept Deviation
0.2
0.2
0.4
0.6
Fig. 3.
20
40
60
Policy Updating Iteration Counter N
80
100
0.2
0.1
Slope Deviation
0.1
0.2
0.3
0.4
VI. E XPERIMENTS
In this
stochastic
numerical
of known
0.5
Fig. 4.
20
40
60
Policy Updating Iteration Counter N
80
100
Deviation of the slope of the control from its true value -0.5428.
[13] S.J. Bradtke and A.G. Barto, Linear Least-Squares algorithms for
temporal difference learning, Machine Learning, vol. 22, no. 1, pp.
33-57, 1996.
[14] F.S. Melo and P. Lisboa and M.I. Ribeiro, Convergence of Q-learning
with linear function approximation, Proceedings of the European
Control Conference 2007, 2007.
[15] T.J. Perkins and D. Precup, A Convergent Form of Approximate
Policy Iteration, Advances In Neural Information Processing Systems,
pp. 1627-1634, 2003.
[16] G.J. Gordon, Stable function approximation in dynamic programming Proceedings of the Twelfth International Conference on Machine
Learning, vol. 131, pp. 132-143, 1995.
[17] G.J. Gordon, Reinforcement learning with function approximation
converges to a region, Advances in Neural Information Processing
Systems, vol. 13, pp. 1040-1046, 2001.
[18] L.C.Baird, Residual algorithms: Reinforcement learning with function
approximation, Proceedings of the Twelfth International Conference on
Machine Learning, pp. 30-37, 1995.
[19] S.J. Bradtke and B.E. Ydstie and A.G. Barto, Adaptive linear
quadratic control using policy iteration, American Control Conference,
1994.
[20] S.J. Bradtke, Reinforcement Learning Applied to Linear Quadratic
Regulation, Advances In Neural Information Processing Systems, pp.
295-302, 1993.
[21] T. Landelius and H. Knutsson, Greedy Adaptive Critics for LQR
Problems: Convergence Proofs, Neural Computation, 1997.
[22] I. Szita, Rewarding Excursions: Extending Reinforcement Learning
to Complex Domains, Ph.D. Thesis, 2007.
[23] S.P. Meyn, The policy iteration algorithm for average reward Markov
decisionprocesses with general state space, Automatic Control, IEEE
Transactions on, vol. 42, no. 12, pp. 1663-1680, 1997.
[24] R. Munos and C. Szepesvari, Finite time bounds for sampling based
fitted value iteration, Proceedings of the 22nd international conference
on Machine learning, pp. 880-887, 2005.
[25] A. Antos and C. Szepesvari and R. Munos, Value-Iteration Based
Fitted Policy Iteration: Learning with a Single Trajectory, Approximate
Dynamic Programming and Reinforcement Learning, 2007. ADPRL
2007. IEEE International Symposium on, pp. 330-337, 2007.
[26] A. Antos and R. Munos and C.Szepesvari, Fitted Q-iteration in
continuous action-space MDPs, Proceedings of Neural Information
Processing Systems Conference (NIPS), Vancouver, Canada, 2007.
[27] A. Antos and C. Szepesvari and R. Munos, Learning near-optimal
policies with Bellman-residual minimization based fitted policy iteration
and a single sample path, Machine Learning, vol. 71, no. 1, pp. 89-129,
2008.
Sen, Kernel-Based Reinforcement Learning,
[28] D. Ormoneit and S.
Machine Learning, vol. 49, no. 2, pp. 161-178, 2002.
[29] D.P. Bertsekas, A Counterexample to Temporal Differences Learning, Neural Computation, vol. 7, no. 2, pp. 270-279, 1995.
[30] J. Boyan and A.W. Moore, Generalization in Reinforcement Learning: Safely Approximating the Value Function, Advances In Neural
Information Processing Systems, pp. 369-376, 1995.
[31] B. Van Roy and D.P. Bertsekas and Y. Lee and J.N. Tsitsiklis, A
Neuro-Dynamic Programming Approach to Retailer Inventory Management, Decision and Control, 1997., Proceedings of the 36th IEEE
Conference on vol. 4, 1997.
[32] R.S. Sutton and R.S.Barto, Reinforcement Learning: An Introduction.
MIT Press, 1998.
[33] W.B. Powell, Approximate Dynamic Programming: Solving the curses
of dimensionality. John Wiley and Sons, New York, 2007.
[34] S.P. Meyn and R.L. Tweedie, Markov chains and stochastic stability.
Springer, 1993.
[35] J. Luque, The nonlinear proximal point algorithm. Laboratory for Information and Decision Systems, Massachusetts Institute of Technology,
1986.
[36] P. Young, Recursive estimation and time-series analysis: an introduction. Springer-Verlag New York, Inc. New York, NY, USA, 1984.
[37] D.P. Bertsekas and J.N. Tsitsiklis, Neuro-Dynamic Programming.
Athena Scientific, 1996.
[38] K.L. Judd, Numerical Methods in Economics. MIT Press, 1998.
[39] J. Fan and I. Gijbels, Local Polynomial Modelling and Its Applications.
Chapman & Hall/CRCs, 1996.