Dynamic Programming and Optimal Control
3rd Edition, Volume II
by
Dimitri P. Bertsekas
Massachusetts Institute of Technology
Chapter 6
Approximate Dynamic Programming
This is an updated version of the researchoriented Chapter 6 on
Approximate Dynamic Programming. It will be periodically updated as
new research becomes available, and will replace the current Chapter 6 in
the book’s next printing.
In addition to editorial revisions, rearrangements, and new exercises,
the chapter includes an account of new research, which is collected mostly
in Sections 6.3 and 6.8. Furthermore, a lot of new material has been
added, such as an account of postdecision state simpliﬁcations (Section
6.1), regressionbased TD methods (Section 6.3), policy oscillations (Sec
tion 6.3), exploration schemes and optimistic policy iteration (Section 6.3),
convergence analysis of Qlearning (Section 6.4), aggregation methods (Sec
tion 6.5), and Monte Carlo linear algebra (Section 6.8).
This chapter represents “work in progress.” It more than likely con
tains errors (hopefully not serious ones). Furthermore, its references to the
literature are incomplete. Your comments and suggestions to the author
at dimitrib@mit.edu are welcome. When quoting, please refer to the date
of last revision given below.
May 13, 2010
6
Approximate
Dynamic Programming
Contents
6.1. General Issues of Cost Approximation . . . . . . . . p. 327
6.1.1. Approximation Architectures . . . . . . . . . p. 327
6.1.2. Approximate Policy Iteration . . . . . . . . . p. 331
6.1.3. Direct and Indirect Approximation . . . . . . p. 335
6.1.4. Simpliﬁcations . . . . . . . . . . . . . . . p. 337
6.1.5. The Role of Contraction Mappings . . . . . . p. 344
6.1.6. The Role of Monte Carlo Simulation . . . . . . p. 345
6.2. Direct Policy Evaluation  Gradient Methods . . . . . p. 348
6.3. Projected Equation Methods . . . . . . . . . . . . p. 354
6.3.1. The Projected Bellman Equation . . . . . . . p. 355
6.3.2. Deterministic Iterative Methods . . . . . . . . p. 360
6.3.3. SimulationBased Methods . . . . . . . . . . p. 364
6.3.4. LSTD, LSPE, and TD(0) Methods . . . . . . p. 366
6.3.5. Optimistic Versions . . . . . . . . . . . . . p. 374
6.3.6. Policy Oscillations – Chattering . . . . . . . . p. 375
6.3.7. Multistep SimulationBased Methods . . . . . p. 381
6.3.8. TD Methods with Exploration . . . . . . . . p. 386
6.3.9. Summary and Examples . . . . . . . . . . . p. 395
6.4. QLearning . . . . . . . . . . . . . . . . . . . . p. 399
6.4.1. Convergence Properties of QLearning . . . . . p. 401
6.4.2. QLearning and Approximate Policy Iteration . . p. 406
6.4.3. QLearning for Optimal Stopping Problems . . . p. 408
6.4.4. Finite Horizon QLearning . . . . . . . . . . p. 413
6.5. Aggregation Methods . . . . . . . . . . . . . . . p. 416
6.5.1. Cost and QFactor Approximation by Aggregation p. 419
6.5.2. Approximate Policy and Value Iteration . . . . p. 423
321
322 Approximate Dynamic Programming Chap. 6
6.6. Stochastic Shortest Path Problems . . . . . . . . . p. 432
6.7. Average Cost Problems . . . . . . . . . . . . . . p. 436
6.7.1. Approximate Policy Evaluation . . . . . . . . p. 436
6.7.2. Approximate Policy Iteration . . . . . . . . . p. 445
6.7.3. QLearning for Average Cost Problems . . . . . p. 447
6.8. SimulationBased Solution of Large Systems . . . . . p. 451
6.8.1. Projected Equations  SimulationBased Versions p. 451
6.8.2. Matrix Inversion and RegressionType Methods . p. 455
6.8.3. Iterative/LSPEType Methods . . . . . . . . p. 457
6.8.4. Extension of QLearning for Optimal Stopping . p. 464
6.8.5. Bellman Equation ErrorType Methods . . . . p. 467
6.8.6. Oblique Projections . . . . . . . . . . . . . p. 471
6.8.7. Generalized Aggregation by Simulation . . . . . p. 473
6.9. Approximation in Policy Space . . . . . . . . . . . p. 477
6.9.1. The Gradient Formula . . . . . . . . . . . . p. 478
6.9.2. Computing the Gradient by Simulation . . . . p. 480
6.9.3. Essential Features of Critics . . . . . . . . . p. 481
6.9.4. Approximations in Policy and Value Space . . . p. 484
6.10. Notes, Sources, and Exercises . . . . . . . . . . . p. 485
References . . . . . . . . . . . . . . . . . . . . . . p. 507
Sec. 6.0 323
In this chapter we consider approximation methods for challenging, compu
tationally intensive DP problems. We discussed a number of such methods
in Chapter 6 of Vol. I and Chapter 1 of the present volume, such as for
example rollout and other onestep lookahead approaches. Here our focus
will be on algorithms that are mostly patterned after two principal methods
of inﬁnite horizon DP: policy and value iteration. These algorithms form
the core of a methodology known by various names, such as approximate
dynamic programming, or neurodynamic programming, or reinforcement
learning.
A principal aim of the methods of this chapter is to address problems
with very large number of states n. In such problems, ordinary linear
algebra operations such as ndimensional inner products, are prohibitively
timeconsuming, and indeed it may be impossible to even store an nvector
in a computer memory. Our methods will involve linear algebra operations
of dimension much smaller than n, and require only that the components
of nvectors are just generated when needed rather than stored.
Another aim of the methods of this chapter is to address modelfree
situations, i.e., problems where a mathematical model is unavailable or
hard to construct. Instead, the system and cost structure may be sim
ulated (think, for example, of a queueing network with complicated but
welldeﬁned service disciplines at the queues). The assumption here is that
there is a computer program that simulates, for a given control u, the prob
abilistic transitions from any given state i to a successor state j according
to the transition probabilities p
ij
(u), and also generates a corresponding
transition cost g(i, u, j). It may then be possible to use repeated simula
tion to calculate (at least approximately) the transition probabilities of the
system and the expected stage costs by averaging, and then to apply the
methods discussed in earlier chapters.
The methods of this chapter, however, are geared towards an alter
native possibility, which is much more attractive when one is faced with a
large and complex system, and one contemplates approximations. Rather
than estimate explicitly the transition probabilities and costs, we will aim
to approximate the cost function of a given policy or even the optimal
costtogo function by generating one or more simulated system trajecto
ries and associated costs, and by using some form of “least squares ﬁt.”
In another type of method, which we will discuss only brieﬂy, we use a
gradient method and simulation data to approximate directly an optimal
policy with a policy of a given parametric form. Let us also mention, two
other approximate DP methods, which we have discussed at various points
in other parts of the book: rollout algorithms (Sections 6.4, 6.5 of Vol. I,
and Section 1.3.5 of Vol. II), and approximate linear programming (Section
1.3.4).
Our main focus will be on two types of methods: policy evaluation
algorithms, which deal with approximation of the cost of a single policy,
and Qlearning algorithms, which deal with approximation of the optimal
324 Approximate Dynamic Programming Chap. 6
cost. Let us summarize each type of method, focusing for concreteness on
the ﬁnitestate discounted case.
Policy Evaluation Algorithms
With this class of methods, we aim to approximate the cost function J
µ
(i)
of a policy µ with a parametric architecture of the form
˜
J(i, r), where
r is a parameter vector (cf. Section 6.3.5 of Vol. I). This approximation
may be carried out repeatedly, for a sequence of policies, in the context
of a policy iteration scheme. Alternatively, it may be used to construct
an approximate costtogo function of a single suboptimal/heuristic policy,
which can be used in an online rollout scheme, with onestep or multistep
lookahead. We focus primarily on two types of methods.†
In the ﬁrst class of methods, called direct , we use simulation to collect
samples of costs for various initial states, and ﬁt the architecture
˜
J to
the samples through some least squares problem. This problem may be
solved by several possible algorithms, including linear least squares methods
based on simple matrix inversion. Gradient methods have also been used
extensively, and will be described in Section 6.2.
The second and currently more popular class of methods is called
indirect . Here, we obtain r by solving an approximate version of Bellman’s
equation. We will focus exclusively on the case of a linear architecture,
where
˜
J is of the form Φr, and Φ is a matrix whose columns can be viewed
as basis functions (cf. Section 6.3.5 of Vol. I). In an important method of
this type, we obtain the parameter vector r by solving the equation
Φr = ΠT(Φr), (6.1)
where Π denotes projection with respect to a suitable norm on the subspace
of vectors of the form Φr, and T is either the mapping T
µ
or a related
mapping, which also has J
µ
as its unique ﬁxed point [here ΠT(Φr) denotes
the projection of the vector T(Φr) on the subspace].‡
† In another type of method, often called the Bellman equation error ap
proach, which we will discuss brieﬂy in Section 6.8.4, the parameter vector r is
determined by minimizing a measure of error in satisfying Bellman’s equation;
for example, by minimizing over r
˜
J −T
˜
J,
where · is some norm. If · is a Euclidean norm, and
˜
J(i, r) is linear in r,
this minimization is a linear least squares problem.
‡ Another method of this type is based on aggregation (cf. Section 6.3.4 of
Vol. I) and is discussed in Section 6.5. This approach can also be viewed as a
problem approximation approach (cf. Section 6.3.3 of Vol. I): the original problem
Sec. 6.0 325
We can view Eq. (6.1) as a form of projected Bellman equation. We
will show that for a special choice of the norm of the projection, ΠT is
a contraction mapping, so the projected Bellman equation has a unique
solution Φr
∗
. We will discuss several iterative methods for ﬁnding r
∗
in
Section 6.3. All these methods use simulation and can be shown to converge
under reasonable assumptions to r
∗
, so they produce the same approximate
cost function. However, they diﬀer in their speed of convergence and in
their suitability for various problem contexts. Here are the methods that we
will focus on in Section 6.3 for discounted problems, and also in Sections 6.6
6.8 for other types of problems. They all depend on a parameter λ ∈ [0, 1],
whose role will be discussed later.
(1) TD(λ) or temporal diﬀerences method. This algorithm may be viewed
as a stochastic iterative method for solving a version of the projected
equation (6.1) that depends on λ. The algorithm embodies important
ideas and has played an important role in the development of the
subject, but in practical terms, it is typically inferior to the next two
methods, so it will be discussed in less detail.
(2) LSTD(λ) or least squares temporal diﬀerences method. This algo
rithm computes and solves a progressively more reﬁned simulation
based approximation to the projected Bellman equation (6.1).
(3) LSPE(λ) or least squares policy evaluation method. This algorithm
is based on the idea of executing value iteration within the lower
dimensional space spanned by the basis functions. Conceptually, it
has the form
Φr
k+1
= ΠT(Φr
k
) + simulation noise, (6.2)
i.e., the current value iterate T(Φr
k
) is projected on S and is suitably
approximated by simulation. The simulation noise tends to 0 asymp
totically, so assuming that ΠT is a contraction, the method converges
to the solution of the projected Bellman equation (6.1). There are
also a number of variants of LSPE(λ). Both LSPE(λ) and its vari
ants have the same convergence rate as LSTD(λ), because they share
a common bottleneck: the slow speed of simulation.
QLearning Algorithms
With this class of methods, we aim to compute, without any approximation,
is approximated with a related “aggregate” problem, which is then solved exactly
to yield a costtogo approximation for the original problem. The analog of Eq.
(6.1) has the form Φr = ΦDT(Φr), where Φ and D are matrices whose rows
are restricted to be probability distributions (the aggregation and disaggregation
probabilities, respectively).
326 Approximate Dynamic Programming Chap. 6
the optimal cost function (not just the cost function of a single policy).†
The letter “Q” stands for nothing special  it just refers to notation used
in the original proposal of this method!
Qlearning maintains and updates for each statecontrol pair (i, u)
an estimate of the expression that is minimized in the righthand side of
Bellman’s equation. This is called the Qfactor of the pair (i, u), and is
denoted by Q
∗
(i, u). The Qfactors are updated with what may be viewed
as a simulationbased form of value iteration, as will be explained in Section
6.4. An important advantage of using Qfactors is that when they are
available, they can be used to obtain an optimal control at any state i
simply by minimizing Q
∗
(i, u) over u ∈ U(i), so the transition probabilities
of the problem are not needed.
On the other hand, for problems with a large number of statecontrol
pairs, Qlearning is often impractical because there may be simply too
many Qfactors to update. As a result, the algorithm is primarily suitable
for systems with a small number of states (or for aggregated/fewstate
versions of more complex systems). There are also algorithms that use
parametric approximations for the Qfactors (see Sections 6.4 and 6.5),
but they are either tailored to special classes of problems (e.g., optimal
stopping problems discussed in Section 6.4.3, and problems resulting from
aggregation discussed in Section 6.5), or else they lack a ﬁrm theoretical
basis. Still, however, these methods are used widely, and often with success.
Chapter Organization
Throughout this chapter, we will focus almost exclusively on perfect state
information problems, involving a Markov chain with a ﬁnite number of
states i, transition probabilities p
ij
(u), and single stage costs g(i, u, j). Ex
tensions of many of the ideas to continuous state spaces are possible, but
they are beyond our scope. We will consider ﬁrst, in Sections 6.16.5, the
discounted problem using the notation of Section 1.3. Section 6.1 provides
a broad overview of cost approximation architectures and their uses in ap
proximate policy iteration. Section 6.2 focuses on direct methods for policy
evaluation. Section 6.3 is a long section on a major class of indirect meth
ods for policy evaluation, which are based on the projected Bellman equa
tion. Section 6.4 discusses Qlearning and its variations, and extends the
projected Bellman equation approach to the case of multiple policies, and
particularly to optimal stopping problems. Section 6.5 discusses methods
based on aggregation. Stochastic shortest path and average cost problems
† There are also methods related to Qlearning, which are based on aggre
gation and compute an approximately optimal cost function (see Section 6.5).
Other methods, which aim to approximate the optimal cost function, are based
on linear programming. They were discussed in Section 1.3.4, and will not be
considered further here (see also the references cited in Chapter 1).
Sec. 6.1 General Issues of Cost Approximation 327
are discussed in Sections 6.6 and 6.7, respectively. Section 6.8 extends and
elaborates on the projected Bellman equation approach of Sections 6.3,
6.6, and 6.7, discusses another approach based on the Bellman equation
error, and generalizes the aggregation methodology. Section 6.9 describes
methods involving the parametric approximation of policies.
6.1 GENERAL ISSUES OF COST APPROXIMATION
Most of the methodology of this chapter deals with approximation of some
type of cost function (optimal cost, cost of a policy, Qfactors, etc). The
purpose of this section is to highlight the main issues involved, without
getting too much into the mathematical details.
We start with general issues of parametric approximation architec
tures, which we have also discussed in Vol. I (Section 6.3.5). We then
consider approximate policy iteration (Section 6.1.2), and the two general
approaches for approximate cost evaluation (direct and indirect; Section
6.1.3). In Section 6.1.4, we discuss various special structures that can be
exploited to simplify approximate policy iteration. In Sections 6.1.5 and
6.1.6, we provide orientation into the main mathematical issues underlying
the methodology, and focus on two of its main components: contraction
mappings and simulation.
6.1.1 Approximation Architectures
The major use of cost approximation is for obtaining a onestep lookahead†
suboptimal policy (cf. Section 6.3 of Vol. I). In particular, suppose that
we use
˜
J(j, r) as an approximation to the optimal cost of the ﬁnitestate
discounted problem of Section 1.3. Here
˜
J is a function of some chosen
form (the approximation architecture) and r is a parameter/weight vector.
Once r is determined, it yields a suboptimal control at any state i via the
onestep lookahead minimization
˜ µ(i) = arg min
u∈U(i)
n
j=1
p
ij
(u)
_
g(i, u, j) + α
˜
J(j, r)
_
. (6.3)
The degree of suboptimality of ˜ µ, as measured by J
˜ µ
−J
∗

∞
, is bounded
by a constant multiple of the approximation error according to
J
˜ µ
−J
∗

∞
≤
2α
1 −α

˜
J −J
∗

∞
,
† We may also use a multiplestep lookahead minimization, with a costtogo
approximation at the end of the multiplestep horizon. Conceptually, singlestep
and multiplestep lookahead approaches are similar, and the costtogo approxi
mation algorithms of this chapter apply to both.
328 Approximate Dynamic Programming Chap. 6
as shown in Prop. 1.3.7. This bound is qualitative in nature, as it tends to
be quite conservative in practice.
An alternative possibility is to obtain a parametric approximation
˜
Q(i, u, r) of the Qfactor of the pair (i, u), deﬁned in terms of the optimal
cost function J
∗
as
Q
∗
(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) + αJ
∗
(j)
_
.
Since Q
∗
(i, u) is the expression minimized in Bellman’s equation, given the
approximation
˜
Q(i, u, r), we can generate a suboptimal control at any state
i via
˜ µ(i) = arg min
u∈U(i)
˜
Q(i, u, r).
The advantage of using Qfactors is that in contrast with the minimiza
tion (6.3), the transition probabilities p
ij
(u) are not needed in the above
minimization. Thus Qfactors are better suited to the modelfree context.
Note that we may similarly use approximations to the cost functions
J
µ
and Qfactors Q
µ
(i, u) of speciﬁc policies µ. A major use of such ap
proximations is in the context of an approximate policy iteration scheme;
see Section 6.1.2.
The choice of architecture is very signiﬁcant for the success of the
approximation approach. One possibility is to use the linear form
˜
J(i, r) =
s
k=1
r
k
φ
k
(i), (6.4)
where r = (r
1
, . . . , r
s
) is the parameter vector, and φ
k
(i) are some known
scalars that depend on the state i. Thus, for each state i, the approximate
cost
˜
J(i, r) is the inner product φ(i)
′
r of r and
φ(i) =
_
_
_
φ
1
(i)
.
.
.
φ
s
(i)
_
_
_.
We refer to φ(i) as the feature vector of i, and to its components as features
(see Fig. 6.1.1). Thus the cost function is approximated by a vector in the
subspace
S = ¦Φr [ r ∈ ℜ
s
¦,
where
Φ =
_
_
_
φ
1
(1) . . . φ
s
(1)
.
.
.
.
.
.
.
.
.
φ
1
(n) . . . φ
s
(n)
_
_
_ =
_
_
_
φ(1)
′
.
.
.
φ(n)
′
_
_
_.
Sec. 6.1 General Issues of Cost Approximation 329
Feature Extraction
Mapping
Linear Mapping
State i
Feature Vector φ(i)
Linear Cost
Approximator φ(i)’r
Figure 6.1.1. A linear featurebased architecture. It combines a mapping that
extracts the feature vector φ(i) =
_
φ
1
(i), . . . , φs(i)
_
′
associated with state i, and
a parameter vector r to form a linear cost approximator.
We can view the s columns of Φ as basis functions, and Φr as a linear
combination of basis functions.
Features, when wellcrafted, can capture the dominant nonlinearities
of the cost function, and their linear combination may work very well as an
approximation architecture. For example, in computer chess (Section 6.3.5
of Vol. I) where the state is the current board position, appropriate fea
tures are material balance, piece mobility, king safety, and other positional
factors.
Example 6.1.1 (Polynomial Approximation)
An important example of linear cost approximation is based on polynomial
basis functions. Suppose that the state consists of q integer components
x1, . . . , xq, each taking values within some limited range of integers. For
example, in a queueing system, x
k
may represent the number of customers
in the kth queue, where k = 1, . . . , q. Suppose that we want to use an
approximating function that is quadratic in the components x
k
. Then we
can deﬁne a total of 1 + q + q
2
basis functions that depend on the state
x = (x1, . . . , xq) via
φ0(x) = 1, φ
k
(x) = x
k
, φ
km
(x) = x
k
xm, k, m = 1, . . . , q.
A linear approximation architecture that uses these functions is given by
˜
J(x, r) = r0 +
q
k=1
r
k
x
k
+
q
k=1
q
m=k
r
km
x
k
xm,
where the parameter vector r has components r0, r
k
, and r
km
, with k =
1, . . . , q, m = k, . . . , q. In fact, any kind of approximating function that is
polynomial in the components x1, . . . , xq can be constructed similarly.
It is also possible to combine feature extraction with polynomial approx
imations. For example, the feature vector φ(i) =
_
φ1(i), . . . , φs(i)
_
′
trans
formed by a quadratic polynomial mapping, leads to approximating functions
of the form
˜
J(i, r) = r0 +
s
k=1
r
k
φ
k
(i) +
s
k=1
s
ℓ=1
r
kℓ
φ
k
(i)φ
ℓ
(i),
330 Approximate Dynamic Programming Chap. 6
where the parameter vector r has components r0, r
k
, and r
kℓ
, with k, ℓ =
1, . . . , s. This function can be viewed as a linear cost approximation that
uses the basis functions
w0(i) = 1, w
k
(i) = φ
k
(i), w
kℓ
(i) = φ
k
(i)φ
ℓ
(i), k, ℓ = 1, . . . , s.
Example 6.1.2 (Interpolation)
A common type of approximation of a function J is based on interpolation.
Here, a set I of special states is selected, and the parameter vector r has one
component ri per state i ∈ I, which is the value of J at i:
ri = J(i), i ∈ I.
The value of J at states i / ∈ I is approximated by some form of interpolation
using r.
Interpolation may be based on geometric proximity. For a simple ex
ample that conveys the basic idea, let the system states be the integers within
some interval, let I be a subset of special states, and for each state i let i and
¯
i be the states in I that are closest to i from below and from above. Then for
any state i,
˜
J(i, r) is obtained by linear interpolation of the costs ri = J(i)
and r¯
i
= J(
¯
i):
˜
J(i, r) =
i −i
¯
i −i
ri +
¯
i −i
¯
i −i
r¯
i
.
The scalars multiplying the components of r may be viewed as features, so
the feature vector of i above consists of two nonzero features (the ones cor
responding to i and
¯
i), with all other features being 0. Similar examples can
be constructed for the case where the state space is a subset of a multidimen
sional space (see Example 6.3.13 of Vol. I).
A generalization of the preceding example is approximation based on
aggregation; see Section 6.3.4 of Vol. I and the subsequent Section 6.5 in
this chapter. There are also interesting nonlinear approximation architec
tures, including those deﬁned by neural networks, perhaps in combination
with feature extraction mappings (see Bertsekas and Tsitsiklis [BeT96],
or Sutton and Barto [SuB98] for further discussion). In this chapter, we
will not address the choice of the structure of
˜
J(i, r) or associated basis
functions, but rather focus on various iterative algorithms for obtaining a
suitable parameter vector r. However, we will mostly focus on the case of
linear architectures, because many of the policy evaluation algorithms of
this chapter are valid only for that case. Note that there is considerable
ongoing research on automatic basis function generation approaches (see
e.g., Keller, Manor, and Precup [KMP06], and Jung and Polani [JuP07]).
We ﬁnally mention the possibility of optimal selection of basis func
tions within some restricted class. In particular, consider an approximation
subspace
S
θ
=
_
Φ(θ)r [ r ∈ ℜ
s
_
,
Sec. 6.1 General Issues of Cost Approximation 331
where the s columns of the ns matrix Φ are basis functions parametrized
by a vector θ. Assume that for a given θ, there is a corresponding vector
r(θ), obtained using some algorithm, so that Φ(θ)r(θ) is an approximation
of a cost function J (various such algorithms will be presented later in
this chapter). Then we may wish to select θ so that some measure of
approximation quality is optimized. For example, suppose that we can
compute the true cost values J(i) (or more generally, approximations to
these values) for a subset of selected states I. Then we may determine θ
so that
i∈I
_
J(i) −φ(i, θ)
′
r(θ)
_
2
is minimized, where φ(i, θ)
′
is the ith row of Φ(θ). Alternatively, we may
determine θ so that the norm of the error in satisfying Bellman’s equation,
_
_
Φ(θ)r(θ) −T
_
Φ(θ)r(θ)
__
_
2
,
is minimized. Gradient and random search algorithms for carrying out such
minimizations have been proposed in the literature (see Menache, Mannor,
and Shimkin [MMS06], and Yu and Bertsekas [YuB09]).
6.1.2 Approximate Policy Iteration
Let us consider a form of approximate policy iteration, where we com
pute simulationbased approximations
˜
J(, r) to the cost functions J
µ
of
stationary policies µ, and we use them to compute new policies based on
(approximate) policy improvement. We impose no constraints on the ap
proximation architecture, so
˜
J(i, r) may be linear or nonlinear in r.
Suppose that the current policy is µ, and for a given r,
˜
J(i, r) is an
approximation of J
µ
(i). We generate an “improved” policy µ using the
formula
µ(i) = arg min
u∈U(i)
n
j=1
p
ij
(u)
_
g(i, u, j) + α
˜
J(j, r)
_
, for all i. (6.5)
The method is illustrated in Fig. 6.1.1. Its theoretical basis was discussed in
Section 1.3 (cf. Prop. 1.3.6), where it was shown that if the policy evaluation
is accurate to within δ (in the supnorm sense), then for an αdiscounted
problem, the method will yield in the limit (after inﬁnitely many policy
evaluations) a stationary policy that is optimal to within
2αδ
(1 −α)
2
,
where α is the discount factor. Experimental evidence indicates that this
bound is usually conservative. Furthermore, often just a few policy evalu
ations are needed before the bound is attained.
332 Approximate Dynamic Programming Chap. 6
Approximate Policy
Evaluation
Policy Improvement
Guess Initial Policy
Evaluate Approximate Cost
˜
J
µ
(r) = Φr Using Simulation
Generate “Improved” Policy µ
Figure 6.1.1 Block diagram of approximate policy iteration.
When the sequence of policies obtained actually converges to some µ,
then it can be proved that µ is optimal to within
2αδ
1 −α
(see Section 6.5.2, where it is shown that if policy evaluation is done using
an aggregation approach, the generated sequence of policies does converge).
A simulationbased implementation of the algorithm is illustrated in
Fig. 6.1.2. It consists of four modules:
(a) The simulator, which given a statecontrol pair (i, u), generates the
next state j according to the system’s transition probabilities.
(b) The decision generator, which generates the control µ(i) of the im
proved policy at the current state i for use in the simulator.
(c) The costtogo approximator, which is the function
˜
J(j, r) that is used
by the decision generator.
(d) The cost approximation algorithm, which accepts as input the output
of the simulator and obtains the approximation
˜
J(, r) of the cost of
µ.
Note that there are two policies µ and µ, and parameter vectors r
and r, which are simultaneously involved in this algorithm. In particular,
r corresponds to the current policy µ, and the approximation
˜
J(, r) is used
in the policy improvement Eq. (6.5) to generate the new policy µ. At the
same time, µ drives the simulation that generates samples to be used by
the algorithm that determines the parameter r corresponding to µ, which
will be used in the next policy iteration.
Sec. 6.1 General Issues of Cost Approximation 333
System Simulator D
CosttoGo Approx
r Decision Generator
roximator Supplies Valu r) Decision µ(i) S
CosttoGo Approximator S
State Cost Approximation
ecision Generator
r Supplies Values
˜
J(j, r) D
i Cost Approximation A
n Algorithm
˜
J(j, r)
State i C
r) Samples
Figure 6.1.2 Simulationbased implementation approximate policy iteration al
gorithm. Given the approximation
˜
J(i, r), we generate cost samples of the “im
proved” policy µ by simulation (the “decision generator” module). We use these
samples to generate the approximator
˜
J(i, r) of µ.
The Issue of Exploration
Let us note an important generic diﬃculty with simulationbased policy
iteration: to evaluate a policy µ, we need to generate cost samples using
that policy, but this biases the simulation by underrepresenting states that
are unlikely to occur under µ. As a result, the costtogo estimates of
these underrepresented states may be highly inaccurate, causing potentially
serious errors in the calculation of the improved control policy µ via the
policy improvement Eq. (6.5).
The diﬃculty just described is known as inadequate exploration of the
system’s dynamics because of the use of a ﬁxed policy. It is a particularly
acute diﬃculty when the system is deterministic, or when the randomness
embodied in the transition probabilities is “relatively small.” One possibil
ity for guaranteeing adequate exploration of the state space is to frequently
restart the simulation and to ensure that the initial states employed form
a rich and representative subset. A related approach, called iterative re
sampling, is to enrich the sampled set of states in evaluating the current
policy µ as follows: derive an initial cost evaluation of µ, simulate the next
policy µ obtained on the basis of this initial evaluation to obtain a set of
representative states S visited by µ, and repeat the evaluation of µ using
additional trajectories initiated from S.
Still another frequently used approach is to artiﬁcially introduce some
extra randomization in the simulation, by occasionally generating transi
tions that use a randomly selected control rather than the one dictated by
334 Approximate Dynamic Programming Chap. 6
the policy µ. This and other possibilities to improve exploration will be
discussed further in Section 6.3.7.
Limited Sampling/Optimistic Policy Iteration
In the approximate policy iteration approach discussed so far, the policy
evaluation of the cost of the improved policy µ must be fully carried out. An
alternative, known as optimistic policy iteration, is to replace the policy µ
with the policy µ after only a few simulation samples have been processed,
at the risk of
˜
J(, r) being an inaccurate approximation of J
µ
.
Optimistic policy iteration is discussed extensively in the book by
Bertsekas and Tsitsiklis [BeT96], together with several variants. It has
been successfully used, among others, in an impressive backgammon appli
cation (Tesauro [Tes92]). However, the associated theoretical convergence
properties are not fully understood. As shown in Section 6.4.2 of [BeT96],
optimistic policy iteration can exhibit fascinating and counterintuitive be
havior, including a natural tendency for a phenomenon called chattering,
whereby the generated parameter sequence ¦r
k
¦ converges, while the gen
erated policy sequence oscillates because the limit of ¦r
k
¦ corresponds to
multiple policies (see also the discussion in Section 6.3.5).
We note that optimistic policy iteration tends to deal better with
the problem of exploration discussed earlier, because with rapid changes
of policy, there is less tendency to bias the simulation towards particular
states that are favored by any single policy.
Approximate Policy Iteration Based on QFactors
The approximate policy iteration method discussed so far relies on the cal
culation of the approximation
˜
J(, r) to the cost function J
µ
of the current
policy, which is then used for policy improvement using the minimization
µ(i) = arg min
u∈U(i)
n
j=1
p
ij
(u)
_
g(i, u, j) + α
˜
J(j, r)
_
.
Carrying out this minimization requires knowledge of the transition proba
bilities p
ij
(u) and calculation of the associated expected values for all con
trols u ∈ U(i) (otherwise a timeconsuming simulation of these expected
values is needed). An interesting alternative is to compute approximate
Qfactors
˜
Q(i, u, r) ≈
n
j=1
p
ij
(u)
_
g(i, u, j) + αJ
µ
(j)
_
, (6.6)
and use the minimization
µ(i) = arg min
u∈U(i)
˜
Q(i, u, r) (6.7)
Sec. 6.1 General Issues of Cost Approximation 335
for policy improvement. Here, r is an adjustable parameter vector and
˜
Q(i, u, r) is a parametric architecture, possibly of the linear form
˜
Q(i, u, r) =
s
k=1
r
k
φ
k
(i, u),
where φ
k
(i, u) are basis functions that depend on both state and control
[cf. Eq. (6.4)].
The important point here is that given the current policy µ, we can
construct Qfactor approximations
˜
Q(i, u, r) using any method for con
structing cost approximations
˜
J(i, r). The way to do this is to apply the
latter method to the Markov chain whose states are the pairs (i, u), and
the probability of transition from (i, u) to (j, v) is
p
ij
(u) if v = µ(j),
and is 0 otherwise. This is the probabilistic mechanism by which state
control pairs evolve under the stationary policy µ.
A major concern with this approach is that the statecontrol pairs
(i, u) with u ,= µ(i) are never generated in this Markov chain, so they are
not represented in the cost samples used to construct the approximation
˜
Q(i, u, r) (see Fig. 6.1.3). This creates an acute diﬃculty due to diminished
exploration, which must be carefully addressed in any simulationbased
implementation. We will return to the use of Qfactors in Section 6.4,
where we will discuss exact and approximate implementations of the Q
learning algorithm.
) States
StateControl Pairs (i, u) States
) States j p
j p
ij
(u)
) g(i, u, j)
v µ(j)
j)
j, µ(j)
StateControl Pairs: Fixed Policy µ Case (
Figure 6.1.3. Markov chain underlying Qfactorbased policy evaluation, associ
ated with policy µ. The states are the pairs (i, u), and the probability of transition
from (i, u) to (j, v) is p
ij
(u) if v = µ(j), and is 0 otherwise. Thus, after the ﬁrst
transition, the generated pairs are exclusively of the form (i, µ(i)); pairs of the
form (i, u), u = µ(i), are not explored.
6.1.3 Direct and Indirect Approximation
We will now preview two general algorithmic approaches for approximating
the cost function of a ﬁxed stationary policy µ within a subspace of the
336 Approximate Dynamic Programming Chap. 6
form S = ¦Φr [ r ∈ ℜ
s
¦. (A third approach, based on aggregation, uses a
special type of matrix Φ and is discussed in Section 6.5.) The ﬁrst and most
straightforward approach, referred to as direct , is to ﬁnd an approximation
˜
J ∈ S that matches best J
µ
in some normed error sense, i.e.,
min
˜
J∈S
J
µ
−
˜
J,
or equivalently,
min
r∈ℜ
s
J
µ
−Φr
(see the lefthand side of Fig. 6.1.4).‡ Here,   is usually some (possibly
weighted) Euclidean norm, in which case the approximation problem is a
linear least squares problem, whose solution, denoted r
∗
, can in principle be
obtained in closed form by solving the associated quadratic minimization
problem. If the matrix Φ has linearly independent columns, the solution is
unique and can also be represented as
Φr
∗
= Π
˜
J,
where Π denotes projection with respect to  on the subspace S.† A major
diﬃculty is that speciﬁc cost function values J
µ
(i) can only be estimated
through their simulationgenerated cost samples, as we discuss in Section
6.2.
An alternative and more popular approach, referred to as indirect ,
is to approximate the solution of Bellman’s equation J = T
µ
J on the
subspace S (see the righthand side of Fig. 6.1.4). An important example
of this approach, which we will discuss in detail in Section 6.3, leads to the
problem of ﬁnding a vector r
∗
such that
Φr
∗
= ΠT
µ
(Φr
∗
). (6.8)
We can view this equation as a projected form of Bellman’s equation. We
will consider another type of indirect approach based on aggregation in
Section 6.5.
‡ Note that direct approximation may be used in other approximate DP
contexts, such as ﬁnite horizon problems, where we use sequential singlestage
approximation of the costtogo functions J
k
, going backwards (i.e., starting with
JN, we obtain a least squares approximation of JN−1, which is used in turn to
obtain a least squares approximation of JN−2, etc). This approach is sometimes
called ﬁtted value iteration.
† In what follows in this chapter, we will not distinguish between the linear
operation of projection and the corresponding matrix representation, denoting
them both by Π. The meaning should be clear from the context.
Sec. 6.1 General Issues of Cost Approximation 337
We note that solving projected equations as approximations to more
complex/higherdimensional equations has a long history in scientiﬁc com
putation in the context of Galerkin methods (see e.g., [Kra72]). For exam
ple, some of the most popular ﬁniteelement methods for partial diﬀerential
equations are of this type. However, the use of the Monte Carlo simulation
ideas that are central in approximate DP is an important characteristic
that diﬀerentiates the methods of the present chapter from the Galerkin
methodology.
S: Subspace spanned by basis functions
0
ΠJ
µ
Projection
on S
S: Subspace spanned by basis functions
T
µ
(Φr)
0
Φr = ΠT
µ
(Φr)
Projection
on S
J
µ
Direct Mehod: Projection of cost vector J
µ
Indirect method: Solving a projected
form of Bellman’s equation
Figure 6.1.4. Two methods for approximating the cost function Jµ as a linear
combination of basis functions (subspace S). In the direct method (ﬁgure on
the left), Jµ is projected on S. In the indirect method (ﬁgure on the right), the
approximation is found by solving Φr = ΠTµ(Φr), a projected form of Bellman’s
equation.
An important fact here is that ΠT
µ
is a contraction, provided we use
a special weighted Euclidean norm for projection, as will be proved in Sec
tion 6.3 for discounted problems (Prop. 6.3.1). In this case, Eq. (6.8) has
a unique solution, and allows the use of algorithms such as LSPE(λ) and
TD(λ), which are discussed in Section 6.3. Unfortunately, the contrac
tion property of ΠT
µ
does not extend to the case where T
µ
is replaced by
T, the DP mapping corresponding to multiple/all policies, although there
are some interesting exceptions, one of which relates to optimal stopping
problems and is discussed in Section 6.4.3.
6.1.4 Simpliﬁcations
We now consider various situations where the special structure of the prob
lem may be exploited to simplify policy iteration or other approximate DP
algorithms.
338 Approximate Dynamic Programming Chap. 6
Problems with Uncontrollable State Components
In many problems of interest the state is a composite (i, y) of two compo
nents i and y, and the evolution of the main component i can be directly
aﬀected by the control u, but the evolution of the other component y can
not. Then as discussed in Section 1.4 of Vol. I, the value and the policy
iteration algorithms can be carried out over a smaller state space, the space
of the controllable component i. In particular, we assume that given the
state (i, y) and the control u, the next state (j, z) is determined as follows:
j is generated according to transition probabilities p
ij
(u, y), and z is gen
erated according to conditional probabilities p(z [ j) that depend on the
main component j of the new state (see Fig. 6.1.5). Let us assume for
notational convenience that the cost of a transition from state (i, y) is of
the form g(i, y, u, j) and does not depend on the uncontrollable component
z of the next state (j, z). If g depends on z it can be replaced by
ˆ g(i, y, u, j) =
z
p(z [ j)g(i, y, u, j, z)
in what follows.
i, u) States
) States j p
j p
ij
(u)
Controllable State Components
(i, y) ( ) (j, z) States
j g(i, y, u, j)
) Control u
) No Control
u p(z  j)
Figure 6.1.5. States and transition probabilities for a problem with uncontrol
lable state components.
For an αdiscounted problem, consider the mapping
ˆ
T deﬁned by
(
ˆ
T
ˆ
J)(i) =
y
p(y [ i)(TJ)(i, y)
=
y
p(y [ i) min
u∈U(i,y)
n
j=0
p
ij
(u, y)
_
g(i, y, u, j) + α
ˆ
J(j)
_
,
Sec. 6.1 General Issues of Cost Approximation 339
and the corresponding mapping for a stationary policy µ,
(
ˆ
T
µ
ˆ
J)(i) =
y
p(y [ i)(T
µ
J)(i, y)
=
y
p(y [ i)
n
j=0
p
ij
_
µ(i, y), y
__
g
_
i, y, µ(i, y), j
_
+ α
ˆ
J(j)
_
.
Bellman’s equation, deﬁned over the controllable state component i,
takes the form
ˆ
J(i) = (
ˆ
T
ˆ
J)(i), for all i. (6.9)
The typical iteration of the simpliﬁed policy iteration algorithm consists of
two steps:
(a) The policy evaluation step, which given the current policy µ
k
(i, y),
computes the unique
ˆ
J
µ
k(i), i = 1, . . . , n, that solve the linear system
of equations
ˆ
J
µ
k =
ˆ
T
µ
k
ˆ
J
µ
k or equivalently
ˆ
J
µ
k(i) =
y
p(y [ i)
n
j=0
p
ij
_
µ
k
(i, y)
_
_
g
_
i, y, µ
k
(i, y), j
_
+ α
ˆ
J
µ
k (j)
_
for all i = 1, . . . , n.
(b) The policy improvement step, which computes the improved policy
µ
k+1
(i, y), from the equation
ˆ
T
µ
k+1
ˆ
J
µ
k =
ˆ
T
ˆ
J
µ
k or equivalently
µ
k+1
(i, y) = arg min
u∈U(i,y)
n
j=0
p
ij
(u, y)
_
g(i, y, u, j) + α
ˆ
J
µ
k (j)
_
,
for all (i, y).
Approximate policy iteration algorithms can be similarly carried out in
reduced form.
Problems with PostDecision States
In some stochastic problems, the transition probabilities and stage costs
have the special form
p
ij
(u) = q
_
j [ f(i, u)
_
, (6.10)
where f is some function and q
_
[ f(i, u)
_
is a given probability distribution
for each value of f(i, u). In words, the dependence of the transitions on
(i, u) comes through the function f(i, u). We may exploit this structure by
viewing f(i, u) as a form of state: a postdecision state that determines the
probabilistic evolution to the next state. An example where the conditions
(6.10) are satisﬁed are inventory control problems of the type considered in
340 Approximate Dynamic Programming Chap. 6
Section 4.2 of Vol. I. There the postdecision state at time k is x
k
+u
k
, i.e.,
the postpurchase inventory, before any demand at time k has been ﬁlled.
Postdecision states can be exploited when the stage cost has no de
pendence on j,† i.e., when we have (with some notation abuse)
g(i, u, j) = g(i, u).
Then the optimal costtogo within an αdiscounted context at state i is
given by
J
∗
(i) = min
u∈U(i)
_
g(i, u) + αV
∗
_
f(i, u)
_
_
,
while the optimal costtogo at postdecision state m (optimal sum of costs
of future stages) is given by
V
∗
(m) =
n
j=1
q(j [ m)J
∗
(j).
In eﬀect, we consider a modiﬁed problem where the state space is enlarged
to include postdecision states, with transitions between ordinary states
and postdecision states speciﬁed by f is some function and q
_
[ f(i, u)
_
(see Fig. 6.1.6). The preceding two equations represent Bellman’s equation
for this modiﬁed problem.
StateControl Pairs (
StateControl Pairs (i, u) States StateControl Pairs (j, v) States
g(i, u, m)
) m m
Controllable State Components PostDecision States
m m = f(i, u)
) q(j  m)
No Control v p
No Control u p
Figure 6.1.6. Modiﬁed problem where the postdecision states are viewed as
additional states.
Combining these equations, we have
V
∗
(m) =
n
j=1
q(j [ m) min
u∈U(j)
_
g(j, u) + αV
∗
_
f(j, u)
_
_
, ∀ m, (6.11)
† If there is dependence on j, one may consider computing, possibly by simu
lation, (an approximation to) g(i, u) =
n
j=1
pij(u)g(i, u, j), and using it in place
of g(i, u, j).
Sec. 6.1 General Issues of Cost Approximation 341
which can be viewed as Bellman’s equation over the space of postdecision
states m. This equation is similar to Qfactor equations, but is deﬁned
over the space of postdecision states rather than the larger space of state
control pairs. The advantage of this equation is that once the function V
∗
is calculated (or approximated), the optimal policy can be computed as
µ
∗
(i) = arg min
u∈U(i)
_
g(i, u) + αV
∗
_
f(i, u)
_
_
,
which does not require the knowledge of transition probabilities and com
putation of an expected value. It involves a deterministic optimization,
and it can be used in a modelfree context (as long as the functions g and
d are known). This is important if the calculation of the optimal policy is
done online.
It is straightforward to construct a policy iteration algorithm that is
deﬁned over the space of postdecision states. The costtogo function V
µ
of a stationary policy µ is the unique solution of the corresponding Bellman
equation
V
µ
(m) =
n
j=1
q(j [ m)
_
g
_
j, µ(j)
_
+ αV
µ
_
f
_
j, µ(j)
__
_
, ∀ m.
Given V
µ
, the improved policy is obtained as
µ(i) = arg min
u∈U(i)
_
g(i, u) + V
µ
_
f(i, u)
_
_
, i = 1, . . . , n.
There are also corresponding approximate policy iteration methods with
cost function approximation.
An advantage of this method when implemented by simulation is that
the computation of the improved policy does not require the calculation
of expected values. Moreover, with a simulator, the policy evaluation of
V
µ
can be done in modelfree fashion, without explicit knowledge of the
probabilities q(j [ m). These advantages are shared with policy iteration
algorithms based on Qfactors. However, when function approximation is
used in policy iteration, the methods using postdecision states may have a
signiﬁcant advantage over Qfactorbased methods: they use cost function
approximation in the space of postdecision states, rather than the larger
space of statecontrol pairs, and they are less susceptible to diﬃculties due
to inadequate exploration.
We note that there is a similar simpliﬁcation with postdecision states
when g is of the form
g(i, u, j) = h
_
f(i, u), j
_
,
for some function h. Then we have
J
∗
(i) = min
u∈U(i)
V
∗
_
f(i, u)
_
,
342 Approximate Dynamic Programming Chap. 6
where V
∗
is the unique solution of the equation
V
∗
(m) =
n
j=1
q(j [ m)
_
h(m, j) + α min
u∈U(j)
V
∗
_
f(j, u)
_
_
, ∀ m.
Here V
∗
(m) should be interpreted as the optimal costtogo from post
decision state m, including the cost h(m, j) incurred within the stage when
m was generated. When h does not depend on j, the algorithm takes the
simpler form
V
∗
(m) = h(m) + α
n
j=1
q(j [ m) min
u∈U(j)
V
∗
_
f(j, u)
_
, ∀ m. (6.12)
Example 6.1.3 (Tetris)
Let us revisit the game of tetris, which was discussed in Example 1.4.1 of Vol.
I in the context of problems with an uncontrollable state component. We
will show that it also admits a postdecision state. Assuming that the game
terminates with probability 1 for every policy (a proof of this has been given
by Burgiel [Bur97]), we can model the problem of ﬁnding an optimal tetris
playing strategy as a stochastic shortest path problem.
The state consists of two components:
(1) The board position, i.e., a binary description of the full/empty status
of each square, denoted by x.
(2) The shape of the current falling block, denoted by y (this is the uncon
trollable component).
The control, denoted by u, is the horizontal positioning and rotation applied
to the falling block.
Bellman’s equation over the space of the controllable state component
takes the form
ˆ
J(x) =
y
p(y) max
u
_
g(x, y, u) +
ˆ
J
_
f(x, y, u)
_
_
, for all x,
where g(x, y, u) and f(x, y, u) are the number of points scored (rows removed),
and the board position when the state is (x, y) and control u is applied,
respectively [cf. Eq. (6.9)].
This problem also admits a postdecision state. Once u is applied at
state (x, y), a new board position m is obtained, and the new state component
x is obtained from m after removing a number of rows. Thus we have
m = f(x, y, u)
for some function f, and m also determines the reward of the stage, which
has the form h(m) for some m [h(m) is the number of complete rows that
Sec. 6.1 General Issues of Cost Approximation 343
can be removed from m]. Thus, m may serve as a postdecision state, and
the corresponding Bellman’s equation takes the form (6.12), i.e.,
V
∗
(m) = h(m) +
n
(x,y)
q(m, x, y) max
u∈U(j)
V
∗
_
f(x, y, u)
_
, ∀ m,
where (x, y) is the state that follows m, and q(m, x, y) are the corresponding
transition probabilities. Note that both of the simpliﬁed Bellman’s equations
share the same characteristic: they involve a deterministic optimization.
Trading oﬀ Complexity of Control Space with Complexity of
State Space
Suboptimal control using cost function approximation deals fairly well with
large state spaces, but still encounters serious diﬃculties when the number
of controls available at each state is large. In particular, the minimization
min
u∈U(i)
n
j=1
p
ij
(u)
_
g(i, u, j) +
˜
J(j, r)
_
using an approximate costgo function
˜
J(j, r) may be very timeconsuming.
For multistep lookahead schemes, the diﬃculty is exacerbated, since the
required computation grows exponentially with the size of the lookahead
horizon. It is thus useful to know that by reformulating the problem, it
may be possible to reduce the complexity of the control space by increasing
the complexity of the state space. The potential advantage is that the
extra state space complexity may still be dealt with by using function
approximation and/or rollout.
In particular, suppose that the control u consists of m components,
u = (u
1
, . . . , u
m
).
Then, at a given state i, we can break down u into the sequence of the
m controls u
1
, u
2
, . . . , u
m
, and introduce artiﬁcial intermediate “states”
(i, u
1
), (i, u
1
, u
2
), . . . , (i, u
1
, . . . , u
m−1
), and corresponding transitions to mo
del the eﬀect of these controls. The choice of the last control component
u
m
at “state” (i, u
1
, . . . , u
m−1
) marks the transition to state j according
to the given transition probabilities p
ij
(u). In this way the control space is
simpliﬁed at the expense of introducing m− 1 additional layers of states,
and m−1 additional costtogo functions
J
1
(i, u
1
), J
2
(i, u
1
, u
2
), . . . , J
m−1
(i, u
1
, . . . , u
m−1
).
To deal with the increase in size of the state space we may use rollout, i.e.,
when at “state” (i, u
1
, . . . , u
k
), assume that future controls u
k+1
, . . . , u
m
344 Approximate Dynamic Programming Chap. 6
will be chosen by a base heuristic. Alternatively, we may use function
approximation, that is, introduce costtogo approximations
˜
J
1
(i, u
1
, r
1
),
˜
J
2
(i, u
1
, u
2
, r
2
), . . . ,
˜
J
m−1
(i, u
1
, . . . , u
m−1
, r
m−1
),
in addition to
˜
J(i, r). We refer to [BeT96], Section 6.1.4, for further dis
cussion.
A potential complication in the preceding schemes arises when the
controls u
1
, . . . , u
m
are coupled through a constraint of the form
u = (u
1
, . . . , u
m
) ∈ U(i). (6.13)
Then, when choosing a control u
k
, care must be exercised to ensure that
the future controls u
k+1
, . . . , u
m
can be chosen together with the already
chosen controls u
1
, . . . , u
k
to satisfy the feasibility constraint (6.13). This
requires a variant of the rollout algorithm that works with constrained DP
problems; see Exercise 6.19 of Vol. I, and also references [Ber05a], [Ber05b].
6.1.5 The Role of Contraction Mappings
In this and the next subsection, we will try to provide some orientation
into the mathematical content of this chapter. The reader may wish to
skip these subsections at ﬁrst, but return to them later for orientation and
a higher level view of some of the subsequent technical material.
Most of the chapter (Sections 6.36.8) deals with the approximate
computation of a ﬁxed point of a (linear or nonlinear) mapping T within a
subspace
S = ¦Φr [ r ∈ ℜ
s
¦.
We will discuss a variety of approaches with distinct characteristics, but at
an abstract mathematical level, these approaches fall into two categories:
(a) A projected equation approach, based on the equation
Φr = ΠT(Φr),
where Π is a projection operation with respect to a Euclidean norm
(see Section 6.3 for discounted problems, and Sections 6.66.8 for other
types of problems).
(b) A Qlearning/aggregation approach, based on an equation of the form
Φr = ΦDT(Φr),
where D is an s n matrix whose rows are probability distributions
(see Sections 6.4 and 6.5). Section 6.4 includes a discussion of Q
learning (where Φ is the identity matrix) and other methods (where
Sec. 6.1 General Issues of Cost Approximation 345
Φ has a more general form), and Section 6.5 is focused on aggregation
(where Φ is restricted to satisfy certain conditions).
If T is linear, both types of methods lead to linear equations in the
parameter vector r, whose solution can be approximated by simulation.
The approach here is very simple: we approximate the matrices and vectors
involved in the equation Φr = ΠT(Φr) or Φr = ΦDT(Φr) by simulation,
and we solve the resulting (approximate) linear system by matrix inversion.
This is called the matrix inversion approach. A primary example is the
LSTD methods of Section 6.3.
When iterative methods are used (which is the only possibility when
T is nonlinear, and may be attractive in some cases even when T is linear),
it is important that ΠT and ΦDT be contractions over the subspace S.
Note here that even if T is a contraction mapping (as is ordinarily the
case in DP), it does not follow that ΠT and ΦDT are contractions. In
our analysis, this is resolved by requiring that T be a contraction with
respect to a norm such that Π or ΦD, respectively, is a nonexpansive
mapping. As a result, we need various assumptions on T, Φ, and D, which
guide the algorithmic development. We postpone further discussion of these
issues, but for the moment we note that the projection approach revolves
mostly around Euclidean norm contractions and cases where T is linear,
while the Qlearning/aggregation approach revolves mostly around sup
norm contractions.
6.1.6 The Role of Monte Carlo Simulation
The methods of this chapter rely to a large extent on simulation in con
junction with cost function approximation in order to deal with large state
spaces. The advantage that simulation holds in this regard can be traced
to its ability to compute (approximately) sums with a very large number
terms. These sums arise in a number of contexts: inner product and matrix
vector product calculations, the solution of linear systems of equations and
policy evaluation, linear least squares problems, etc.
Example 6.1.4 (Approximate Policy Evaluation)
Conciser the approximate solution of the Bellman equation that corresponds
to a given policy of an nstate discounted problem:
J = g +αPJ;
where P is the transition probability matrix and α is the discount factor.
Let us adopt a hard aggregation approach (cf. Section 6.3.4 of Vol. I; see
also Section 6.5 later in this chapter), whereby we divide the n states in two
disjoint subsets I1 and I2 with I1 ∪I2 = {1, . . . , n}, and we use the piecewise
constant approximation
J(i) =
_
r1 if i ∈ I1,
r2 if i ∈ I2.
346 Approximate Dynamic Programming Chap. 6
This corresponds to the linear featurebased architecture J ≈ Φr, where Φ
is the n × 2 matrix with column components equal to 1 or 0, depending on
whether the component corresponds to I1 or I2.
We obtain the approximate equations
J(i) ≈ g(i) +α
_
_
j∈I
1
pij
_
_
r1 +α
_
_
j∈I
2
pij
_
_
r2, i = 1, . . . , n,
which we can reduce to just two equations by forming two weighted sums
(with equal weights) of the equations corresponding to the states in I1 and
I2, respectively:
r1 ≈
1
n1
i∈I
1
J(i), r2 ≈
1
n2
i∈I
2
J(i),
where n1 and n2 are numbers of states in I1 and I2, respectively. We thus
obtain the aggregate system of the following two equations in r1 and r2:
r1 =
1
n1
i∈I
1
g(i) +
α
n1
_
_
i∈I
1
j∈I
1
pij
_
_
r1 +
α
n1
_
_
i∈I
1
j∈I
2
pij
_
_
r2,
r2 =
1
n2
i∈I
2
g(i) +
α
n2
_
_
i∈I
2
j∈I
1
pij
_
_
r1 +
α
n2
_
_
i∈I
2
j∈I
2
pij
_
_
r2.
Here the challenge, when the number of states n is very large, is the calcu
lation of the large sums in the righthand side, which can be of order O(n
2
).
Simulation allows the approximate calculation of these sums with complexity
that is independent of n. This is similar to the advantage that MonteCarlo
integration holds over numerical integration, as discussed in standard texts
on MonteCarlo methods.
To see how simulation can be used with advantage, let us consider
the problem of estimating a scalar sum of the form
z =
ω∈Ω
v(ω),
where Ω is a ﬁnite set and v : Ω → ℜ is a function of ω. We introduce a
distribution ξ that assigns positive probability ξ(ω) to every element ω ∈ Ω,
and we generate a sequence
¦ω
1
, . . . , ω
T
¦
of samples from Ω, with each sample ω
t
taking values from Ω according to
ξ. We then estimate z with
ˆ z
T
=
1
T
T
t=1
v(ω
t
)
ξ(ω
t
)
. (6.14)
Sec. 6.1 General Issues of Cost Approximation 347
Clearly ˆ z is unbiased:
E[ˆ z
T
] =
1
T
T
t=1
E
_
v(ω
t
)
ξ(ω
t
)
_
=
1
T
T
t=1
ω∈Ω
ξ(ω)
v(ω)
ξ(ω)
=
ω∈Ω
v(ω) = z.
Suppose now that the samples are generated in a way that the long
term frequency of each ω ∈ Ω is equal to ξ(ω), i.e.,
lim
T→∞
T
t=1
δ(ω
t
= ω)
T
= ξ(ω), ∀ ω ∈ Ω, (6.15)
where δ() denotes the indicator function [δ(E) = 1 if the event E has
occurred and δ(E) = 0 otherwise]. Then from Eq. (6.14), we have
lim
T→∞
ˆ z
T
=
ω∈Ω
T
t=1
δ(ω
t
= ω)
T
v(ω)
ξ(ω)
,
and by taking limit as T → ∞ and using Eq. (6.15),
lim
T→∞
ˆ z
T
=
ω∈Ω
lim
T→∞
T
t=1
δ(ω
t
= ω)
T
v(ω)
ξ(ω)
=
ω∈Ω
v(ω) = z.
Thus in the limit, as the number of samples increases, we obtain the desired
sum z. An important case, of particular relevance to the methods of this
chapter, is when Ω is the set of states of an irreducible Markov chain. Then,
if we generate an inﬁnitely long trajectory ¦ω
1
, ω
2
, . . .¦ starting from any
initial state ω
1
, then the condition (6.15) will hold with probability 1, with
ξ(ω) being the steadystate probability of state ω.
The samples ω
t
need not be independent for the preceding properties
to hold, but if they are, then the variance of ˆ z
T
is given by
var(ˆ z
T
) =
1
T
2
T
t=1
ω∈Ω
ξ(ω)
_
v(ω)
ξ(ω)
−z
_
2
,
which can be written as
var(ˆ z
T
) =
1
T
_
ω∈Ω
v(ω)
2
ξ(ω)
−z
2
_
. (6.16)
An important observation from this formula is that the accuracy of the
approximation does not depend on the number of terms in the sum z (the
number of elements in Ω), but rather depends on the variance of the random
348 Approximate Dynamic Programming Chap. 6
variable that takes values v(ω)/ξ(ω), ω ∈ Ω, with probabilities ξ(ω).† Thus,
it is possible to execute approximately linear algebra operations of very
large size through Monte Carlo sampling (with whatever distributions may
be convenient in a given context), and this a principal idea underlying the
methods of this chapter.
In the case where the samples are dependent, the variance formula
(6.16) does not hold, but similar qualitative conclusions can be drawn under
various assumptions that ensure that the dependencies between samples
become suﬃciently weak over time (see the specialized literature).
6.2 DIRECT POLICY EVALUATION  GRADIENT METHODS
We will now consider the direct approach for policy evaluation.‡ In par
ticular, suppose that the current policy is µ, and for a given r,
˜
J(i, r) is
an approximation of J
µ
(i). We generate an “improved” policy µ using the
† The selection of the distribution
_
ξ(ω)  ω ∈ Ω
_
can be optimized (at least
approximately), and methods for doing this are the subject of the technique of
importance sampling. In particular, assuming that samples are independent and
that v(ω) ≥ 0 for all ω ∈ Ω, we have
var(ˆzT ) =
z
2
T
_
ω∈Ω
_
v(ω)/z
_
2
ξ(ω)
−1
_
,
the optimal distribution is ξ
∗
= v/z and the corresponding minimum variance
value is 0. However, ξ
∗
cannot be computed without knowledge of z. Instead, ξ
is usually chosen to be an approximation to v, normalized so that its components
add to 1. Note that we may assume that v(ω) ≥ 0 for all ω ∈ Ω without loss of
generality: when v takes negative values, we may decompose v as
v = v
+
−v
−
,
so that both v
+
and v
−
are positive functions, and then estimate separately
z
+
=
ω∈Ω
v
+
(ω) and z
−
=
ω∈Ω
v
−
(ω).
‡ Direct policy evaluation methods have been historically important, and
provide an interesting contrast with indirect methods. However, they are cur
rently less popular than the projected equation methods to be considered in the
next section, despite some generic advantages (the option to use nonlinear ap
proximation architectures, and the capability of more accurate approximation).
The material of this section will not be substantially used later, so the reader
may read lightly this section without loss of continuity.
Sec. 6.2 Direct Policy Evaluation  Gradient Methods 349
formula
µ(i) = arg min
u∈U(i)
n
j=1
p
ij
(u)
_
g(i, u, j) + α
˜
J(j, r)
_
, for all i. (6.17)
To evaluate approximately J
µ
, we select a subset of “representative” states
˜
S (perhaps obtained by some form of simulation), and for each i ∈
˜
S, we
obtain M(i) samples of the cost J
µ
(i). The mth such sample is denoted by
c(i, m), and mathematically, it can be viewed as being J
µ
(i) plus some sim
ulation error/noise.† Then we obtain the corresponding parameter vector
r by solving the following least squares problem
min
r
i∈
˜
S
M(i)
m=1
_
˜
J(i, r) −c(i, m)
_
2
, (6.18)
and we repeat the process with µ and r replacing µ and r, respectively (see
Fig. 6.1.1).
The least squares problem (6.18) can be solved exactly if a linear
approximation architecture is used, i.e., if
˜
J(i, r) = φ(i)
′
r,
where φ(i)
′
is a row vector of features corresponding to state i. In this case
r is obtained by solving the linear system of equations
i∈
˜
S
M(i)
m=1
φ(i)
_
φ(i)
′
r −c(i, m)
_
= 0,
which is obtained by setting to 0 the gradient with respect to r of the
quadratic cost in the minimization (6.18). When a nonlinear architecture
is used, we may use gradientlike methods for solving the least squares
problem (6.18), as we will now discuss.
† The manner in which the samples c(i, m) are collected is immaterial for
the purposes of the subsequent discussion. Thus one may generate these samples
through a single very long trajectory of the Markov chain corresponding to µ, or
one may use multiple trajectories, with diﬀerent starting points, to ensure that
enough cost samples are generated for a “representative” subset of states. In
either case, the samples c(i, m) corresponding to any one state i will generally be
correlated as well as “noisy.” Still the average
1
M(i)
M(i)
m=1
c(i, m) will ordinarily
converge to Jµ(i) as M(i) → ∞ by a law of large numbers argument [see Exercise
6.2 and the discussion in [BeT96], Sections 5.1, 5.2, regarding the behavior of the
average when M(i) is ﬁnite and random].
350 Approximate Dynamic Programming Chap. 6
Batch Gradient Methods for Policy Evaluation
Let us focus on an Ntransition portion (i
0
, . . . , i
N
) of a simulated trajec
tory, also called a batch. We view the numbers
N−1
t=k
α
t−k
g
_
i
t
, µ(i
t
), i
t+1
_
, k = 0, . . . , N −1,
as cost samples, one per initial state i
0
, . . . , i
N−1
, which can be used for
least squares approximation of the parametric architecture
˜
J(i, r) [cf. Eq.
(6.18)]:
min
r
N−1
k=0
1
2
_
˜
J(i
k
, r) −
N−1
t=k
α
t−k
g
_
i
t
, µ(i
t
), i
t+1
_
_
2
. (6.19)
One way to solve this least squares problem is to use a gradient method,
whereby the parameter r associated with µ is updated at time N by
r := r −γ
N−1
k=0
∇
˜
J(i
k
, r)
_
˜
J(i
k
, r) −
N−1
t=k
α
t−k
g
_
i
t
, µ(i
t
), i
t+1
_
_
. (6.20)
Here, ∇
˜
J denotes gradient with respect to r and γ is a positive stepsize,
which is usually diminishing over time (we leave its precise choice open for
the moment). Each of the N terms in the summation in the righthand
side above is the gradient of a corresponding term in the least squares
summation of problem (6.19). Note that the update of r is done after
processing the entire batch, and that the gradients ∇
˜
J(i
k
, r) are evaluated
at the preexisting value of r, i.e., the one before the update.
In a traditional gradient method, the gradient iteration (6.20) is
repeated, until convergence to the solution of the least squares problem
(6.19), i.e., a single Ntransition batch is used. However, there is an im
portant tradeoﬀ relating to the size N of the batch: in order to reduce
simulation error and generate multiple cost samples for a representatively
large subset of states, it is necessary to use a large N, yet to keep the work
per gradient iteration small it is necessary to use a small N.
To address the issue of size of N, an expanded view of the gradient
method is preferable in practice, whereby batches may be changed after one
or more iterations. Thus, in this more general method, the Ntransition
batch used in a given gradient iteration comes from a potentially longer
simulated trajectory, or from one of many simulated trajectories. A se
quence of gradient iterations is performed, with each iteration using cost
samples formed from batches collected in a variety of diﬀerent ways and
whose length N may vary. Batches may also overlap to a substantial degree.
Sec. 6.2 Direct Policy Evaluation  Gradient Methods 351
We leave the method for generating simulated trajectories and form
ing batches open for the moment, but we note that it inﬂuences strongly
the result of the corresponding least squares optimization (6.18), provid
ing better approximations for the states that arise most frequently in the
batches used. This is related to the issue of ensuring that the state space is
adequately “explored,” with an adequately broad selection of states being
represented in the least squares optimization, cf. our earlier discussion on
the exploration issue.
The gradient method (6.20) is simple, widely known, and easily un
derstood. There are extensive convergence analyses of this method and
its variations, for which we refer to the literature cited at the end of the
chapter. These analyses often involve considerable mathematical sophis
tication, particularly when multiple batches are involved, because of the
stochastic nature of the simulation and the complex correlations between
the cost samples. However, qualitatively, the conclusions of these analyses
are consistent among themselves as well as with practical experience, and
indicate that:
(1) Under some reasonable technical assumptions, convergence to a lim
iting value of r that is a local minimum of the associated optimization
problem is expected.
(2) For convergence, it is essential to gradually reduce the stepsize to 0,
the most popular choice being to use a stepsize proportional to 1/m,
while processing the mth batch. In practice, considerable trial and
error may be needed to settle on an eﬀective stepsize choice method.
Sometimes it is possible to improve performance by using a diﬀerent
stepsize (or scaling factor) for each component of the gradient.
(3) The rate of convergence is often very slow, and depends among other
things on the initial choice of r, the number of states and the dynamics
of the associated Markov chain, the level of simulation error, and
the method for stepsize choice. In fact, the rate of convergence is
sometimes so slow, that practical convergence is infeasible, even if
theoretical convergence is guaranteed.
Incremental Gradient Methods for Policy Evaluation
We will now consider a variant of the gradient method called incremental .
This method can also be described through the use of Ntransition batches,
but we will see that (contrary to the batch version discussed earlier) the
method is suitable for use with very long batches, including the possibility
of a single very long simulated trajectory, viewed as a single batch.
For a given Ntransition batch (i
0
, . . . , i
N
), the batch gradient method
processes the N transitions all at once, and updates r using Eq. (6.20). The
incremental method updates r a total of N times, once after each transi
352 Approximate Dynamic Programming Chap. 6
tion. Each time it adds to r the corresponding portion of the gradient in
the righthand side of Eq. (6.20) that can be calculated using the newly
available simulation data. Thus, after each transition (i
k
, i
k+1
):
(1) We evaluate the gradient ∇
˜
J(i
k
, r) at the current value of r.
(2) We sum all the terms in the righthand side of Eq. (6.20) that involve
the transition (i
k
, i
k+1
), and we update r by making a correction
along their sum:
r := r −γ
_
∇
˜
J(i
k
, r)
˜
J(i
k
, r) −
_
k
t=0
α
k−t
∇
˜
J(i
t
, r)
_
g
_
i
k
, µ(i
k
), i
k+1
_
_
.
(6.21)
By adding the parenthesized “incremental” correction terms in the above
iteration, we see that after N transitions, all the terms of the batch iter
ation (6.20) will have been accumulated, but there is a diﬀerence: in the
incremental version, r is changed during the processing of the batch, and
the gradient ∇
˜
J(i
t
, r) is evaluated at the most recent value of r [after the
transition (i
t
, i
t+1
)]. By contrast, in the batch version these gradients are
evaluated at the value of r prevailing at the beginning of the batch. Note
that the gradient sum in the righthand side of Eq. (6.21) can be conve
niently updated following each transition, thereby resulting in an eﬃcient
implementation.
It can now be seen that because r is updated at intermediate transi
tions within a batch (rather than at the end of the batch), the location of
the end of the batch becomes less relevant. It is thus possible to have very
long batches, and indeed the algorithm can be operated with a single very
long simulated trajectory and a single batch. In this case, for each state
i, we will have one cost sample for every time when state i is encountered
in the simulation. Accordingly state i will be weighted in the least squares
optimization in proportion to the frequency of its occurrence within the
simulated trajectory.
Generally, within the least squares/policy evaluation context of this
section, the incremental versions of the gradient methods can be imple
mented more ﬂexibly and tend to converge faster than their batch counter
parts, so they will be adopted as the default in our discussion. The book
by Bertsekas and Tsitsiklis [BeT96] contains an extensive analysis of the
theoretical convergence properties of incremental gradient methods (they
are fairly similar to those of batch methods), and provides some insight into
the reasons for their superior performance relative to the batch versions;
see also the author’s nonlinear programming book [Ber99] (Section 1.5.2),
and the paper by Bertsekas and Tsitsiklis [BeT00]. Still, however, the rate
of convergence can be very slow.
Sec. 6.3 Projected Equation Methods 353
Implementation Using Temporal Diﬀerences – TD(1)
We now introduce an alternative, mathematically equivalent, implemen
tation of the batch and incremental gradient iterations (6.20) and (6.21),
which is described with cleaner formulas. It uses the notion of temporal
diﬀerence (TD for short) given by
q
k
=
˜
J(i
k
, r)−α
˜
J(i
k+1
, r)−g
_
i
k
, µ(i
k
), i
k+1
_
, k = 0, . . . , N−2, (6.22)
q
N−1
=
˜
J(i
N−1
, r) −g
_
i
N−1
, µ(i
N−1
), i
N
_
. (6.23)
In particular, by noting that the parenthesized term multiplying ∇
˜
J(i
k
, r)
in Eq. (6.20) is equal to
q
k
+ αq
k+1
+ + α
N−1−k
q
N−1
,
we can verify by adding the equations below that iteration (6.20) can also
be implemented as follows:
After the state transition (i
0
, i
1
), set
r := r −γq
0
∇
˜
J(i
0
, r).
After the state transition (i
1
, i
2
), set
r := r −γq
1
_
α∇
˜
J(i
0
, r) +∇
˜
J(i
1
, r)
_
.
Proceeding similarly, after the state transition (i
N−1
, t), set
r := r −γq
N−1
_
α
N−1
∇
˜
J(i
0
, r) + α
N−2
∇
˜
J(i
1
, r) + +∇
˜
J(i
N−1
, r)
_
.
The batch version (6.20) is obtained if the gradients ∇
˜
J(i
k
, r) are
all evaluated at the value of r that prevails at the beginning of the batch.
The incremental version (6.21) is obtained if each gradient ∇
˜
J(i
k
, r) is
evaluated at the value of r that prevails when the transition (i
k
, i
k+1
) is
processed.
In particular, for the incremental version, we start with some vector
r
0
, and following the transition (i
k
, i
k+1
), k = 0, . . . , N −1, we set
r
k+1
= r
k
−γ
k
q
k
k
t=0
α
k−t
∇
˜
J(i
t
, r
t
), (6.24)
where the stepsize γ
k
may very from one transition to the next. In the
important case of a linear approximation architecture of the form
˜
J(i, r) = φ(i)
′
r, i = 1, . . . , n,
where φ(i) ∈ ℜ
s
are some ﬁxed vectors, it takes the form
r
k+1
= r
k
−γ
k
q
k
k
t=0
α
k−t
φ(i
t
). (6.25)
This algorithm is known as TD(1), and we will see in Section 6.3.6 that it
is a limiting version (as λ → 1) of the TD(λ) method discussed there.
354 Approximate Dynamic Programming Chap. 6
6.3 PROJECTED EQUATION METHODS
In this section, we consider the indirect approach, whereby the policy eval
uation is based on solving a projected form of Bellman’s equation (cf. the
righthand side of Fig. 6.1.3). We will be dealing with a single station
ary policy µ, so we generally suppress in our notation the dependence on
control of the transition probabilities and the cost per stage. We thus con
sider a stationary ﬁnitestate Markov chain, and we denote the states by
i = 1, . . . , n, the transition probabilities by p
ij
, i, j = 1, . . . , n, and the stage
costs by g(i, j). We want to evaluate the expected cost of µ corresponding
to each initial state i, given by
J
µ
(i) = lim
N→∞
E
_
N−1
k=0
α
k
g(i
k
, i
k+1
)
¸
¸
¸ i
0
= i
_
, i = 1, . . . , n,
where i
k
denotes the state at time k, and α ∈ (0, 1) is the discount factor.
We approximate J
µ
(i) with a linear architecture of the form
˜
J(i, r) = φ(i)
′
r, i = 1, . . . , n, (6.26)
where r is a parameter vector and φ(i) is an sdimensional feature vector
associated with the state i. (Throughout this section, vectors are viewed
as column vectors, and a prime denotes transposition.) As earlier, we also
write the vector
_
˜
J(1, r), . . . ,
˜
J(n, r)
_
′
in the compact form Φr, where Φ is the n s matrix that has as rows the
feature vectors φ(i), i = 1, . . . , n. Thus, we want to approximate J
µ
within
S = ¦Φr [ r ∈ ℜ
s
¦,
the subspace spanned by s basis functions, the columns of Φ. Our as
sumptions in this section are the following (we will later discuss how our
methodology may be modiﬁed in the absence of these assumptions).
Assumption 6.3.1: The Markov chain has steadystate probabilities
ξ
1
, . . . , ξ
n
, which are positive, i.e., for all i = 1, . . . , n,
lim
N→∞
1
N
N
k=1
P(i
k
= j [ i
0
= i) = ξ
j
> 0, j = 1, . . . , n.
Assumption 6.3.2: The matrix Φ has rank s.
Sec. 6.3 Projected Equation Methods 355
Assumption 6.3.1 is equivalent to assuming that the Markov chain is
irreducible, i.e., has a single recurrent class and no transient states. As
sumption 6.3.2 is equivalent to the basis functions (the columns of Φ) being
linearly independent, and is analytically convenient because it implies that
each vector J in the subspace S is represented in the form Φr with a unique
vector r.
6.3.1 The Projected Bellman Equation
We will now introduce the projected form of Bellman’s equation. We use
a weighted Euclidean norm on ℜ
n
of the form
J
v
=
¸
¸
¸
_
n
i=1
v
i
_
J(i)
_
2
,
where v is a vector of positive weights v
1
, . . . , v
n
. Let Π denote the projec
tion operation onto S with respect to this norm. Thus for any J ∈ ℜ
n
, ΠJ
is the unique vector in S that minimizes J −
ˆ
J
2
v
over all
ˆ
J ∈ S. It can
also be written as
ΠJ = Φr
J
,
where
r
J
= arg min
r∈ℜ
s
J −Φr
2
v
, J ∈ ℜ
n
. (6.27)
This is because Φ has rank s by Assumption 6.3.2, so a vector in S is
uniquely written in the form Φr.
Note that Π and r
J
can be written explicitly in closed form. This can
be done by setting to 0 the gradient of the quadratic function
J −Φr
2
v
= (J −Φr)
′
V (J −Φr),
where V is the diagonal matrix with v
i
, i = 1, . . . , n, along the diagonal
[cf. Eq. (6.27)]. We thus obtain the necessary and suﬃcient optimality
condition
Φ
′
V (J −Φr
J
) = 0, (6.28)
from which
r
J
= (Φ
′
V Φ)
−1
Φ
′
V J,
and using the formula Φr
J
= ΠJ,
Π = Φ(Φ
′
V Φ)
−1
Φ
′
V.
[The inverse (Φ
′
V Φ)
−1
exists because Φ is assumed to have rank s; cf.
Assumption 6.3.2.] The optimality condition (6.28), through left multipli
cation with r
′
, can also be equivalently expressed as
J
′
V (J −Φr
J
) = 0, ∀ J ∈ S. (6.29)
356 Approximate Dynamic Programming Chap. 6
The interpretation is that the diﬀerence/approximation error J − Φr
J
is
orthogonal to the subspace S in the scaled geometry of the norm  
v
(two
vectors x, y ∈ ℜ
n
are called orthogonal if x
′
V y =
n
i=1
v
i
x
i
y
i
= 0).
Consider now the mapping T given by
(TJ)(i) =
n
i=1
p
ij
_
g(i, j) + αJ(j)
_
, i = 1, . . . , n,
or in more compact notation,
TJ = g + αPJ, (6.30)
where g is the vector with components
n
j=1
p
ij
g(i, j), i = 1, . . . , n, and P
is the matrix with components p
ij
.
Consider also the mapping ΠT (the composition of Π with T) and
the equation
Φr = ΠT(Φr). (6.31)
We view this as a projected/approximate form of Bellman’s equation, and
we view a solution Φr
∗
of this equation as an approximation to J
µ
. Note
that since Π is a linear mapping, this is a linear equation in the vector r.
We will give an explicit form for this equation shortly.
We know from Section 1.4 that T is a contraction with respect to
the supnorm, but unfortunately this does not necessarily imply that T
is a contraction with respect to the norm  
v
. We will next show an
important fact: if v is chosen to be the steadystate probability vector ξ,
then T is a contraction with respect to  
v
, with modulus α. The critical
part of the proof is addressed in the following lemma.
Lemma 6.3.1: For any n n stochastic matrix P that has a steady
state probability vector ξ = (ξ
1
, . . . , ξ
n
) with positive components, we
have
Pz
ξ
≤ z
ξ
, z ∈ ℜ
n
.
Proof: Let p
ij
be the components of P. For all z ∈ ℜ
n
, we have
Pz
2
ξ
=
n
i=1
ξ
i
_
_
n
j=1
p
ij
z
j
_
_
2
≤
n
i=1
ξ
i
n
j=1
p
ij
z
2
j
Sec. 6.3 Projected Equation Methods 357
=
n
j=1
n
i=1
ξ
i
p
ij
z
2
j
=
n
j=1
ξ
j
z
2
j
= z
2
ξ
,
where the inequality follows from the convexity of the quadratic func
tion, and the next to last equality follows from the deﬁning property
n
i=1
ξ
i
p
ij
= ξ
j
of the steadystate probabilities. Q.E.D.
We next note an important property of projections: they are nonex
pansive, in the sense
ΠJ −Π
¯
J
v
≤ J −
¯
J
v
, for all J,
¯
J ∈ ℜ
n
.
To see this, note that
_
_
Π(J −J)
_
_
2
v
≤
_
_
Π(J −J)
_
_
2
v
+
_
_
(I −Π)(J −J)
_
_
2
v
= J −J
2
v
,
where the equality above follows from the Pythagorean Theorem:†
J −J
2
v
= ΠJ −J
2
v
+J −ΠJ
2
v
, for all J ∈ ℜ
n
, J ∈ S. (6.32)
Thus, for ΠT to be a contraction with respect to  
v
, it is suﬃcient that
T be a contraction with respect to  
v
, since
ΠTJ −ΠT
¯
J
v
≤ TJ −T
¯
J
v
≤ βJ −
¯
J
v
,
where β is the modulus of contraction of T with respect to  
v
(see Fig.
6.3.1). This leads to the following proposition.
Proposition 6.3.1: The mappings T and ΠT are contractions of
modulus α with respect to the weighted Euclidean norm  
ξ
, where
ξ is the steadystate probability vector of the Markov chain.
Proof: Using the deﬁnition TJ = g +αPJ [cf. Eq. (6.30)], we have for all
J,
¯
J ∈ ℜ
n
,
TJ −T
¯
J = αP(J −
¯
J).
† The Pythagorean Theorem follows from the orthogonality of the vectors
(J −ΠJ) and (ΠJ −J); cf. Eq. (6.29).
358 Approximate Dynamic Programming Chap. 6
J
ΤJ
0
S: Subspace spanned by basis functions
J
ΤJ
ΠΤJ
ΠΤJ
Figure 6.3.1. Illustration of the contraction property of ΠT due to the nonex
pansiveness of Π. If T is a contraction with respect to · v, the Euclidean norm
used in the projection, then ΠT is also a contraction with respect to that norm,
since Π is nonexpansive and we have
ΠTJ − ΠT
¯
Jv ≤ TJ −T
¯
Jv ≤ βJ −
¯
Jv,
where β is the modulus of contraction of T with respect to · v.
We thus obtain
TJ −T
¯
J
ξ
= αP(J −
¯
J)
ξ
≤ αJ −
¯
J
ξ
,
where the inequality follows from Lemma 6.3.1. Hence T is a contraction
of modulus α. The contraction property of ΠT follows from the contrac
tion property of T and the nonexpansiveness property of Π noted earlier.
Q.E.D.
The next proposition gives an estimate of the error in estimating J
µ
with the ﬁxed point of ΠT.
Proposition 6.3.2: Let Φr
∗
be the ﬁxed point of ΠT. We have
J
µ
−Φr
∗

ξ
≤
1
√
1 −α
2
J
µ
−ΠJ
µ

ξ
.
Proof: We have
J
µ
−Φr
∗

2
ξ
= J
µ
−ΠJ
µ

2
ξ
+
_
_
ΠJ
µ
−Φr
∗
_
_
2
ξ
= J
µ
−ΠJ
µ

2
ξ
+
_
_
ΠTJ
µ
−ΠT(Φr
∗
)
_
_
2
ξ
≤ J
µ
−ΠJ
µ

2
ξ
+ α
2
J
µ
−Φr
∗

2
ξ
,
Sec. 6.3 Projected Equation Methods 359
where the ﬁrst equality uses the Pythagorean Theorem [cf. Eq. (6.32)], the
second equality holds because J
µ
is the ﬁxed point of T and Φr
∗
is the
ﬁxed point of ΠT, and the inequality uses the contraction property of ΠT.
From this relation, the result follows. Q.E.D.
Note the critical fact in the preceding analysis: αP (and hence T)
is a contraction with respect to the projection norm  
ξ
(cf. Lemma
6.3.1). Indeed, Props. 6.3.1 and 6.3.2 hold if T is any (possibly nonlinear)
contraction with respect to the Euclidean norm of the projection (cf. Fig.
6.3.1).
The Matrix Form of the Projected Bellman Equation
Let us now write the projected Bellman equation in explicit form. Its
solution is the vector J = Φr
∗
, where r
∗
solves the problem
min
r∈ℜ
s
_
_
Φr −(g + αPΦr
∗
)
_
_
2
ξ
.
Setting to 0 the gradient with respect to r of the above quadratic expression,
we obtain
Φ
′
Ξ
_
Φr
∗
−(g + αPΦr
∗
)
_
= 0,
where Ξ is the diagonal matrix with the steadystate probabilities ξ
1
, . . . , ξ
n
along the diagonal. Note that this equation is just the orthogonality con
dition (6.29).
Thus the projected equation is written as
Cr
∗
= d, (6.33)
where
C = Φ
′
Ξ(I −αP)Φ, d = Φ
′
Ξg, (6.34)
and can be solved by matrix inversion:
r
∗
= C
−1
d,
just like the Bellman equation, which can also be solved by matrix inversion,
J = (I −αP)
−1
g.
An important diﬀerence is that the projected equation has smaller dimen
sion (s rather than n). Still, however, computing C and d using Eq. (6.34),
requires computation of inner products of size n, so for problems where n
is very large, the explicit computation of C and d is impractical. We will
discuss shortly eﬃcient methods to compute inner products of large size by
using simulation and low dimensional calculations. The idea is that an in
ner product, appropriately normalized, can be viewed as an expected value
(the weighted sum of a large number of terms), which can be computed by
sampling its components with an appropriate probability distribution and
averaging the samples, as discussed in Section 6.1.6.
360 Approximate Dynamic Programming Chap. 6
S: Subspace spanned by basis functions
Φr
k
T(Φr
k
)
=
g + αPΦr
k
0
Φr
k+1
Value Iterate
Projection
on S
Figure 6.3.2. Illustration of the projected value iteration (PVI) method
Φr
k+1
= ΠT(Φr
k
).
At the typical iteration k, the current iterate Φr
k
is operated on with T, and the
generated vector T(Φr
k
) is projected onto S, to yield the new iterate Φr
k+1
.
6.3.2 Deterministic Iterative Methods
We have noted in Chapter 1 that for problems where n is very large, an
iterative method such as value iteration may be appropriate for solving the
Bellman equation J = TJ. Similarly, one may consider an iterative method
for solving the projected Bellman equation Φr = ΠT(Φr) or its equivalent
version Cr = d [cf. Eqs. (6.33)(6.34)].
Since ΠT is a contraction (cf. Prop. 6.3.1), the ﬁrst iterative method
that comes to mind is the analog of value iteration: successively apply ΠT,
starting with an arbitrary initial vector Φr
0
:
Φr
k+1
= ΠT(Φr
k
), k = 0, 1, . . . . (6.35)
Thus at iteration k, the current iterate Φr
k
is operated on with T, and
the generated value iterate T(Φr
k
) (which does not necessarily lie in S)
is projected onto S, to yield the new iterate Φr
k+1
(see Fig. 6.3.2). We
refer to this as projected value iteration (PVI for short). Since ΠT is a
contraction, it follows that the sequence ¦Φr
k
¦ generated by PVI converges
to the unique ﬁxed point Φr
∗
of ΠT.
It is possible to write PVI explicitly by noting that
r
k+1
= arg min
r∈ℜ
s
_
_
Φr −(g + αPΦr
k
)
_
_
2
ξ
.
By setting to 0 the gradient with respect to r of the above quadratic ex
pression, we obtain
Φ
′
Ξ
_
Φr
k+1
−(g + αPΦr
k
)
_
= 0,
Sec. 6.3 Projected Equation Methods 361
which yields
r
k+1
= r
k
−(Φ
′
ΞΦ)
−1
(Cr
k
−d), (6.36)
where C and d are given by Eq. (6.33).
Thus in this form of the iteration, the current iterate r
k
is corrected by
the “residual” Cr
k
−d (which tends to 0), after “scaling” with the matrix
(Φ
′
ΞΦ)
−1
. It is also possible to scale Cr
k
− d with a diﬀerent/simpler
matrix, leading to the iteration
r
k+1
= r
k
−γG(Cr
k
−d), (6.37)
where γ is a positive stepsize, and G is some s s scaling matrix.† Note
that when G is the identity or a diagonal approximation to (Φ
′
ΞΦ)
−1
, the
iteration (6.37) is simpler than PVI in that it does not require a matrix
inversion (it does require, however, the choice of a stepsize γ).
The iteration (6.37) converges to the solution of the projected equa
tion if and only if the matrix I − γGC has eigenvalues strictly within the
unit circle. The following proposition shows that this is true when G is
positive deﬁnite symmetric, as long as the stepsize γ is small enough to
compensate for large components in the matrix G. This hinges on an im
portant property of the matrix C, which we now deﬁne. Let us say that a
(possibly nonsymmetric) s s matrix M is positive deﬁnite if
r
′
Mr > 0, ∀ r ,= 0.
We say that M is positive semideﬁnite if
r
′
Mr ≥ 0, ∀ r ∈ ℜ
s
.
The following proposition shows that C is positive deﬁnite, and if G is
positive deﬁnite and symmetric, the iteration (6.37) is convergent for suf
ﬁciently small stepsize γ.
Proposition 6.3.3: The matrix C of Eq. (6.34) is positive deﬁnite.
Furthermore, if the s s matrix G is symmetric and positive deﬁnite,
there exists γ > 0 such that the eigenvalues of
I −γGC
lie strictly within the unit circle for all γ ∈ (0, γ].
† Iterative methods that involve incremental changes along directions of the
form Gf(x) are very common for solving a system of equations f(x) = 0. They
arise prominently in cases where f(x) is the gradient of a cost function, or has
certain monotonicity properties. They also admit extensions to the case where
there are constraints on x (see [Ber09] for an analysis that is relevant to the
present DP context).
362 Approximate Dynamic Programming Chap. 6
For the proof we need the following lemma.
Lemma 6.3.2: The eigenvalues of a positive deﬁnite matrix have
positive real parts.
Proof: Let M be a positive deﬁnite matrix. Then for suﬃciently small
γ > 0 we have (γ/2)r
′
M
′
Mr < r
′
Mr for all r ,= 0, or equivalently
_
_
(I −γM)r
_
_
2
< r
2
, ∀ r ,= 0,
implying that I−γM is a contraction mapping with respect to the standard
Euclidean norm. Hence the eigenvalues of I −γM lie within the unit circle.
Since these eigenvalues are 1 − γλ, where λ are the eigenvalues of M, it
follows that if M is positive deﬁnite, the eigenvalues of M have positive
real parts. Q.E.D.
Proof of Prop. 6.3.3: For all r ∈ ℜ
s
, we have
ΠPΦr
ξ
≤ PΦr
ξ
≤ Φr
ξ
, (6.38)
where the ﬁrst inequality follows from the Pythagorean Theorem,
PΦr
2
ξ
= ΠPΦr
2
ξ
+(I −Π)PΦr
2
ξ
,
and the second inequality follows from Prop. 6.3.1. Also from properties of
projections, all vectors of the form Φr are orthogonal to all vectors of the
form x −Πx, i.e.,
r
′
Φ
′
Ξ(I −Π)x = 0, ∀ r ∈ ℜ
s
, x ∈ ℜ
n
, (6.39)
[cf. Eq. (6.29)]. Thus, we have for all r ,= 0,
r
′
Cr = r
′
Φ
′
Ξ(I −αP)Φr
= r
′
Φ
′
Ξ
_
I −αΠP + α(Π −I)P
_
Φr
= r
′
Φ
′
Ξ(I −αΠP)Φr
= Φr
2
ξ
−αr
′
Φ
′
ΞΠPΦr
≥ Φr
2
ξ
−αΦr
ξ
ΠPΦr
ξ
≥ (1 −α)Φr
2
ξ
> 0,
where the third equality follows from Eq. (6.39), the ﬁrst inequality follows
from the CauchySchwartz inequality applied with inner product < x, y >=
Sec. 6.3 Projected Equation Methods 363
x
′
Ξy, and the second inequality follows from Eq. (6.38). This proves the
positive deﬁniteness of C.
If G is symmetric and positive deﬁnite, the matrix G
1/2
exists and is
symmetric and positive deﬁnite. Let M = G
1/2
CG
1/2
, and note that since
C is positive deﬁnite, M is also positive deﬁnite, so from Lemma 6.3.2
it follows that its eigenvalues have positive real parts. The eigenvalues
of M and GC are equal (with eigenvectors that are multiples of G
1/2
or
G
−1/2
of each other), so the eigenvalues of GC have positive real parts. It
follows that the eigenvalues of I −γGC lie strictly within the unit circle for
suﬃciently small γ > 0. This completes the proof of Prop. 6.3.3. Q.E.D.
Note that for the conclusion of Prop. 6.3.3 to hold, it is not necessary
that G is symmetric. It is suﬃcient that GC has eigenvalues with positive
real parts. An example is G = C
′
Σ
−1
, in which case GC = C
′
Σ
−1
C is a
positive deﬁnite matrix. Another example, important for our purposes, is
G = (C
′
Σ
−1
C + βI)
−1
C
′
Σ
−1
, (6.40)
where Σ is any positive deﬁnite symmetric matrix, and β is a positive
scalar. Then GC is given by
GC = (C
′
Σ
−1
C + βI)
−1
C
′
Σ
−1
C,
and can be shown to have real eigenvalues that lie in the interval (0, 1),
even if C is not positive deﬁnite.† As a result I −γGC has real eigenvalues
in the interval (0, 1) for any γ ∈ (0, 2].
Unfortunately, however, while PVI and its scaled version (6.37) are
conceptually important, they are not practical algorithms for problems
† To see this let λ1, . . . , λs be the eigenvalues of C
′
Σ
−1
C and let UΛU
′
be
its singular value decomposition, where Λ = diag{λ1, . . . , λs} and U is a unitary
matrix (UU
′
= I; see [Str09], [TrB97]). We also have C
′
Σ
−1
C + βI = U(Λ +
βI)U
′
, so
GC =
_
U(Λ +βI)U
′
_
−1
UΛU
′
= U(Λ +βI)
−1
ΛU
′
.
It follows that the eigenvalues of GC are λi/(λi +β), i = 1, . . . , s, and lie in the
interval (0, 1). Actually, the iteration
r
k+1
= r
k
−G(Cr
k
−d),
[cf. Eq. (6.37)], where G is given by Eq. (6.40), is the socalled proximal point
algorithm applied to the problem of minimizing (Cr−d)
′
Σ
−1
(Cr−d) over r. From
known results about this algorithm (Martinet [Mar70] and Rockafellar [Roc76]) it
follows that the iteration will converge to a minimizing point of (Cr−d)
′
Σ
−1
(Cr−
d). Thus it will converge to some solution of the projected equation Cr = d, even
if there exist many solutions (as in the case where Φ does not have rank s).
364 Approximate Dynamic Programming Chap. 6
where n is very large. The reason is that the vector T(Φr
k
) is ndimensional
and its calculation is prohibitive for the type of large problems that we aim
to address. Furthermore, even if T(Φr
k
) were calculated, its projection on
S requires knowledge of the steadystate probabilities ξ
1
, . . . , ξ
n
, which are
generally unknown. Fortunately, both of these diﬃculties can be dealt with
through the use of simulation, as we now discuss.
6.3.3 SimulationBased Methods
We will now consider approximate versions of the methods for solving the
projected equation, which involve simulation and lowdimensional calcula
tions. The idea is very simple: we use simulation to form a matrix C
k
that
approximates
C = Φ
′
Ξ(I −αP)Φ,
and a vector d
k
that approximates
d = Φ
′
Ξg;
[cf. Eq. (6.34)]. We then approximate the solution C
−1
d of the projected
equation with C
−1
k
d
k
, or we approximate the term (Cr
k
− d) in the PVI
iteration (6.36) [or its scaled version (6.37)] with (C
k
r
k
−d
k
).
The simulation can be done as follows: we generate an inﬁnitely long
trajectory (i
0
, i
1
, . . .) of the Markov chain, starting from an arbitrary state
i
0
. After generating state i
t
, we compute the corresponding row φ(i
t
)
′
of Φ,
and after generating the transition (i
t
, i
t+1
), we compute the corresponding
cost component g(i
t
, i
t+1
). After collecting k +1 samples (k = 0, 1, . . .), we
form
C
k
=
1
k + 1
k
t=0
φ(i
t
)
_
φ(i
t
) −αφ(i
t+1
)
_
′
, (6.41)
and
d
k
=
1
k + 1
k
t=0
φ(i
t
)g(i
t
, i
t+1
), (6.42)
where φ(i)
′
denotes the ith row of Φ.
It can be proved using simple law of large numbers arguments that
C
k
→ C and d
k
→ d with probability 1. To show this, we use the expression
Φ
′
=
_
φ(1) φ(n)
¸
to write C explicitly as
C = Φ
′
Ξ(I −αP)Φ =
n
i=1
ξ
i
φ(i)
_
_
φ(i) −α
n
j=1
p
ij
φ(j)
_
_
′
, (6.43)
and we rewrite C
k
in a form that matches the preceding expression, except
that the probabilities ξ
i
and p
ij
are replaced by corresponding empirical
Sec. 6.3 Projected Equation Methods 365
frequencies produced by the simulation. Indeed, by denoting δ() the in
dicator function [δ(E) = 1 if the event E has occurred and δ(E) = 0
otherwise], we have
C
k
=
n
i=1
n
j=1
k
t=0
δ(i
t
= i, i
t+1
= j)
k + 1
_
φ(i)
_
φ(i) −αφ(j)
_
′
_
=
n
i=1
k
t=0
δ(i
t
= i)
k + 1
φ(i)
_
_
φ(i) −α
n
j=1
k
t=0
δ(i
t
= i, i
t+1
= j)
k
t=0
δ(i
t
= i)
φ(j)
_
_
′
and ﬁnally
C
k
=
n
i=1
ˆ
ξ
i,k
φ(i)
_
_
φ(i) −α
n
j=1
ˆ p
ij,k
φ(j)
_
_
′
,
where
ˆ
ξ
i,k
=
k
t=0
δ(i
t
= i)
k + 1
, ˆ p
ij,k
=
k
t=0
δ(i
t
= i, i
t+1
= j)
k
t=0
δ(i
t
= i)
. (6.44)
Here,
ˆ
ξ
i,k
and ˆ p
ij,k
are the fractions of time that state i, or transition
(i, j) has occurred within (i
0
, . . . , i
k
), the initial (k + 1)state portion of
the simulated trajectory. Since the empirical frequencies
ˆ
ξ
i,k
and ˆ p
ij,k
asymptotically converge (with probability 1) to the probabilities ξ
i
and
p
ij
, respectively, we have with probability 1,
C
k
→
n
i=1
ξ
i
φ(i)
_
_
φ(i) −α
n
j=1
p
ij
φ(j)
_
_
′
= Φ
′
Ξ(I −αP)Φ = C,
[cf. Eq. (6.43)]. Similarly, we can write
d
k
=
n
i=1
ˆ
ξ
i,k
φ(i)
n
j=1
ˆ p
ij,k
g(i, j),
and we have
d
k
→
n
i=1
ξφ(i)
n
j=1
p
ij
g(i, j) = Φ
′
Ξg = d.
Note that C
k
and d
k
can be updated recursively as new samples φ(i
k
)
and g(i
k
, i
k+1
) are generated. In particular, we have
C
k
=
1
k + 1
¯
C
k
, d
k
=
1
k + 1
¯
d
k
,
where
¯
C
k
and
¯
d
k
are updated by
¯
C
k
=
¯
C
k−1
+ φ(i
k
)
_
φ(i
k
) −αφ(i
k+1
)
_
′
,
¯
d
k
=
¯
d
k−1
+ φ(i
k
)g(i
k
, i
k+1
).
366 Approximate Dynamic Programming Chap. 6
6.3.4 LSTD, LSPE, and TD(0) Methods
Given the simulationbased approximations C
k
and d
k
, one possibility is
to construct a simulationbased approximate solution
ˆ r
k
= C
−1
k
d
k
. (6.45)
The established name for this method is LSTD (least squares temporal
diﬀerences) method. Despite the dependence on the index k, this is not
an iterative method, since we do not need ˆ r
k−1
to compute ˆ r
k
. Rather it
may be viewed as an approximate matrix inversion approach: we replace
the projected equation Cr = d with the approximation C
k
r = d
k
, using a
batch of k + 1 simulation samples, and solve the approximate equation by
matrix inversion. Note that by using Eqs. (6.41) and (6.42), the equation
C
k
r = d
k
can be written as
k
t=0
φ(i
t
)q
k,t
= 0, (6.46)
where
q
k,t
= φ(i
t
)
′
r
k
−αφ(i
t+1
)
′
r
k
−g(i
t
, i
t+1
). (6.47)
The scalar q
k,t
is the socalled temporal diﬀerence, associated with r
k
and
transition (i
t
, i
t+1
). It may be viewed as a sample of a residual term arising
in the projected Bellman’s equation. More speciﬁcally, from Eqs. (6.33),
(6.34), we have
Cr
k
−d = Φ
′
Ξ(Φr
k
−αPΦr
k
−g). (6.48)
The three terms in the deﬁnition (6.47) of the temporal diﬀerence q
k,t
can be viewed as samples [associated with the transition (i
t
, i
t+1
)] of the
corresponding three terms in the expression Ξ(Φr
k
− αPΦr
k
− g) in Eq.
(6.48).
RegressionBased LSTD
A potential diﬃculty in LSTD arises if the matrices C and C
k
are nearly
singular (e.g., when the discount factor is close to 1), because then the
simulationinduced error
ˆ r
k
−r
∗
= C
−1
k
d
k
−C
−1
d
may be greatly ampliﬁed. This is similar to a wellknown diﬃculty in
the solution of nearly singular systems of linear equations: the solution is
strongly aﬀected by small changes in the problem data, due for example to
roundoﬀ error.
Sec. 6.3 Projected Equation Methods 367
Example 6.3.1:
To get a rough sense of the eﬀect of the simulation error in LSTD, consider
the approximate inversion of a small nonzero number c, which is estimated
with simulation error ǫ. The absolute and relative errors are
E =
1
c +ǫ
−
1
c
, Er =
E
1/c
.
By a ﬁrst order Taylor series expansion around ǫ = 0, we obtain for small ǫ
E ≈
∂
_
1/(c +ǫ)
_
∂ǫ
¸
¸
¸
ǫ=0
ǫ = −
ǫ
c
2
, Er ≈ −
ǫ
c
.
Thus for the estimate
1
c+ǫ
to be reliable, we must have ǫ << c. If N
independent samples are used to estimate c, the variance of ǫ is proportional
to 1/N, so for a small relative error, N must be much larger than 1/c
2
. Thus
as c approaches 0, the amount of sampling required for reliable simulation
based inversion increases very fast.
To counter the large errors associated with a nearsingular matrix
C, an eﬀective remedy is to estimate r
∗
by a form of regularized regres
sion, which works even if C
k
is singular, at the expense of a system
atic/deterministic error (a “bias”) in the generated estimate. In this ap
proach, instead of solving the system C
k
r = d
k
, we use a leastsquares ﬁt
of a linear model that properly encodes the eﬀect of the simulation noise.
We write the projected form of Bellman’s equation d = Cr as
d
k
= C
k
r + e
k
, (6.49)
where e
k
is the vector
e
k
= (C −C
k
)r + d
k
−d,
which we view as “simulation noise.” We then estimate the solution r
∗
based on Eq. (6.49) by using regression. In particular, we choose r by
solving the least squares problem:
min
r
_
(d
k
−C
k
r)
′
Σ
−1
(d
k
−C
k
r) + βr − ¯ r
2
_
, (6.50)
where ¯ r is an a priori estimate of r
∗
, Σ is some positive deﬁnite symmetric
matrix, and β is a positive scalar. By setting to 0 the gradient of the least
squares objective in Eq. (6.50), we can ﬁnd the solution in closed form:
ˆ r
k
= (C
′
k
Σ
−1
C
k
+ βI)
−1
(C
′
k
Σ
−1
d
k
+ β¯ r). (6.51)
A suitable choice of ¯ r may be some heuristic guess based on intuition about
the problem, or it may be the parameter vector corresponding to the es
timated cost vector Φ¯ r of a similar policy (for example a preceding policy
368 Approximate Dynamic Programming Chap. 6
in an approximate policy iteration context). One may try to choose Σ in
special ways to enhance the quality of the estimate of r
∗
, but we will not
consider this issue here, and the subsequent analysis in this section does not
depend on the choice of Σ, as long as it is positive deﬁnite and symmetric.
The quadratic βr − ¯ r
2
in Eq. (6.50) is known as a regularization
term, and has the eﬀect of “biasing” the estimate ˆ r
k
towards the a priori
guess ¯ r. The proper size of β is not clear (a large size reduces the eﬀect of
near singularity of C
k
, and the eﬀect of the simulation errors C
k
− C and
d
k
−d, but may also cause a large “bias”). However, this is typically not a
major diﬃculty in practice, because trialanderror experimentation with
diﬀerent values of β involves lowdimensional linear algebra calculations
once C
k
and d
k
become available.
We will now derive an estimate for the error ˆ r
k
−r
∗
, where r
∗
= C
−1
d
is the solution of the projected equation. Let us denote
b
k
= Σ
−1/2
(d
k
−C
k
r
∗
),
so from Eq. (6.51),
ˆ r
k
−r
∗
= (C
′
k
Σ
−1
C
k
+ βI)
−1
_
C
′
k
Σ
−1/2
b
k
+ β(¯ r −r
∗
)
_
. (6.52)
We have the following proposition, which involves the singular values of the
matrix Σ
−1/2
C
k
(these are the square roots of the eigenvalues of C
′
k
Σ
−1
C
k
;
see e.g., Strang [Str09], [TrB97]).
Proposition 6.3.4: We have
ˆ r
k
−r
∗
 ≤ max
i=1,...,s
_
λ
i
λ
2
i
+ β
_
b
k
+ max
i=1,...,s
_
β
λ
2
i
+ β
_
¯ r−r
∗
, (6.53)
where λ
1
, . . . , λ
s
are the singular values of Σ
−1/2
C
k
.
Proof: Let Σ
−1/2
C
k
= UΛV
′
be the singular value decomposition of
Σ
−1/2
C
k
, where Λ = diag¦λ
1
, . . . , λ
s
¦, and U, V are unitary matrices
(UU
′
= V V
′
= I and U = U
′
 = V  = V
′
 = 1; see [Str09],
[TrB97]). Then, Eq. (6.52) yields
ˆ r
k
−r
∗
= (V ΛU
′
UΛV
′
+ βI)
−1
(V ΛU
′
b
k
+ β(¯ r −r
∗
))
= (V
′
)
−1
(Λ
2
+ βI)
−1
V
−1
(V ΛU
′
b
k
+ β(¯ r −r
∗
))
= V (Λ
2
+ βI)
−1
ΛU
′
b
k
+ β V (Λ
2
+ βI)
−1
V
′
(¯ r −r
∗
).
Sec. 6.3 Projected Equation Methods 369
Conﬁdence Regions Approximation error
r
∗
∞ = ¯ r
ˆr
k
=
C
′
k
Σ
−1
C
k
+ βI
−1
C
′
k
Σ
−1
d
k
+ β¯r
r
∗
β
=
C
′
Σ
−1
C +βI
−1
C
′
Σ
−1
d +β¯r
ˆr
k
= C
−1
k
d
k
r
∗
0
= r
∗
= C
−1
d
Figure 6.3.3. Illustration of Prop. 6.3.4. The ﬁgure shows the estimates
ˆr
k
=
_
C
′
k
Σ
−1
C
k
+ βI
_
−1
_
C
′
k
Σ
−1
d
k
+ β¯r
_
corresponding to a ﬁnite number of samples, and the exact values
r
∗
β
=
_
C
′
Σ
−1
C + βI
_
−1
_
C
′
Σ
−1
d + β¯r
_
corresponding to an inﬁnite number of samples. We may view ˆr
k
−r
∗
as the sum
of a “simulation error” ˆr
k
− r
∗
β
whose norm is bounded by the ﬁrst term in the
estimate (6.53) and can be made arbitrarily small by suﬃciently long sampling,
and a “regularization error” r
∗
β
− r
∗
whose norm is bounded by the second term
in the righthand side of Eq. (6.53).
Therefore, using the triangle inequality, we have
ˆ r
k
−r
∗
 ≤ V  max
i=1,...,s
_
λ
i
λ
2
i
+ β
_
U
′
 b
k

+ βV  max
i=1,...,s
_
1
λ
2
i
+ β
_
V
′
 ¯ r −r
∗

= max
i=1,...,s
_
λ
i
λ
2
i
+ β
_
b
k
 + max
i=1,...,s
_
β
λ
2
i
+ β
_
¯ r −r
∗
.
Q.E.D.
From Eq. (6.53), we see that the error ˆ r
k
− r
∗
 is bounded by the
sum of two terms. The ﬁrst term can be made arbitrarily small by using a
suﬃciently large number of samples, thereby making b
k
 small. The sec
ond term reﬂects the bias introduced by the regularization and diminishes
with β, but it cannot be made arbitrarily small by using more samples (see
Fig. 6.3.3).
Now consider the case where β = 0, Σ is the identity, and C
k
is
invertible. Then ˆ r
k
is the LSTD solution C
−1
k
d
k
, and the proof of Prop.
2.6 can be replicated to show that
ˆ r
k
− r
∗
 ≤ max
i=1,...,s
_
1
λ
i
_
b
k
,
370 Approximate Dynamic Programming Chap. 6
where λ
1
, . . . , λ
s
are the (positive) singular values of C
k
. This suggests
that without regularization, the LSTD error can be adversely aﬀected by
near singularity of the matrix C
k
(smallest λ
i
close to 0). Thus we expect
that for a nearly singular matrix C, a very large number of samples are
necessary to attain a small error (ˆ r
k
−r
∗
), with serious diﬃculties resulting,
consistent with the scalar inversion example we gave earlier. Generally, if Φ
has nearly dependent columns, C is nearly singular, although in this case,
it is possible that the error (r
k
−r
∗
) is large but the error Φ(r
k
−r
∗
) is not.
Generally, the regularization of LSTD alleviates the eﬀects of near
singularity of C and simulation error, but it comes at a price: there is a
bias of the estimate ˆ r
k
towards the prior guess ¯ r (cf. Fig. 6.3.3). One
possibility to eliminate this bias is to adopt an iterative regularization
approach: start with some ¯ r, obtain ˆ r
k
, replace ¯ r by ˆ r
k
, and repeat for
any number of times. This turns LSTD to an iterative method, which will
be shown to be a special case of LSPE, the next method to be discussed.
LSPE Method
An alternative to LSTD is to use a true iterative method to solve the
projected equation Cr = d using simulationbased approximations to C
and d. One possibility is to approximate the scaled PVI iteration
r
k+1
= r
k
−γG(Cr
k
−d) (6.54)
[cf. Eq. (6.54)] with
r
k+1
= r
k
−γ
ˆ
G(
ˆ
Cr
k
−
ˆ
d), (6.55)
where
ˆ
C and
ˆ
d are simulationbased estimates of C and d, γ is a positive
stepsize, and
ˆ
G is an ss matrix, which may also be obtained by simulation.
Assuming that I −γ
ˆ
G
ˆ
C is a contraction, this iteration will yield a solution
to the system
ˆ
Cr =
ˆ
d, which will serve as a simulationbased approximation
to a solution of the projected equation Cr = d.
Like LSTD, this may be viewed as a batch simulation approach: we
ﬁrst simulate to obtain
ˆ
C,
ˆ
d, and
ˆ
G, and then solve the system
ˆ
Cr =
ˆ
d
by the iteration (6.55) rather than direct matrix inversion. An alternative
is to iteratively update r as simulation samples are collected and used to
form ever improving approximations to C and d. In particular, one or
more iterations of the form (6.55) may be performed after collecting a few
additional simulation samples that are used to improve the approximations
of the current
ˆ
C and
ˆ
d. In the most extreme type of such an algorithm,
the iteration (6.55) is used after a single new sample is collected. This
algorithm has the form
r
k+1
= r
k
−γG
k
(C
k
r
k
−d
k
), (6.56)
Sec. 6.3 Projected Equation Methods 371
where G
k
is an ss matrix, γ is a positive stepsize, and C
k
and d
k
are given
by Eqs. (6.41)(6.42). For the purposes of further discussion, we will focus
on this algorithm, with the understanding that there are related versions
that use (partial) batch simulation and have similar properties. Note that
the iteration (6.56) may also be written in terms of temporal diﬀerences as
r
k+1
= r
k
−
γ
k + 1
G
k
k
t=0
φ(i
t
)q
k,t
(6.57)
[cf. Eqs. (6.41), (6.42), (6.47)]. The convergence behavior of this method is
satisfactory. Generally, we have r
k
→ r
∗
, provided C
k
→ C, d
k
→ d, and
G
k
→ G, where G and γ are such that I − γGC is a contraction [this is
fairly evident in view of the convergence of the iteration (6.54), which was
shown in Section 6.3.2; see also the paper [Ber09]].
To ensure that I −γGC is a contraction for small γ, we may choose
G to be symmetric and positive deﬁnite, or to have a special form, such as
G = (C
′
Σ
−1
C + βI)
−1
C
′
Σ
−1
,
where Σ is any positive deﬁnite symmetric matrix, and β is a positive scalar
[cf. Eq. (6.40)].
Regarding the choices of γ and G
k
, one possibility is to choose γ = 1
and G
k
to be a simulationbased approximation to G = (Φ
′
ΞΦ)
−1
, which
is used in the PVI method (6.35)(6.36):
G
k
=
_
1
k + 1
k
t=0
φ(i
t
)φ(i
t
)
′
_
−1
, (6.58)
or
G
k
=
_
β
k + 1
I +
1
k + 1
k
t=0
φ(i
t
)φ(i
t
)
′
_
−1
, (6.59)
where βI is a positive multiple of the identity (to ensure that G
k
is positive
deﬁnite). This iteration, is known as LSPE (least squares policy evalua
tion); it is historically the ﬁrst method of this type, and has the advantage
that it allows the use of the known stepsize value γ = 1.
Note that while G
k
, as deﬁned by Eqs. (6.58) and (6.59), requires
updating and inversion at every iteration, a partial batch mode of updating
G
k
is also possible: one may introduce into iteration (6.56) a new estimate
of G = (Φ
′
ΞΦ)
−1
periodically, obtained from the previous estimate using
multiple simulation samples. This will save some computation and will not
aﬀect the asymptotic convergence rate of the method, as we will discuss
shortly. Indeed, as noted earlier, the iteration (6.56) itself may be executed
in partial batch mode, after collecting multiple samples between iterations.
372 Approximate Dynamic Programming Chap. 6
Note also that even if G
k
is updated at every k using Eqs. (6.58) and (6.59),
the updating can be done recursively; for example, from Eq. (6.58) we have
G
−1
k
=
k
k + 1
G
−1
k−1
+
1
k + 1
φ(i
k
)φ(i
k
)
′
.
Another choice of G
k
is
G
k
= (C
′
k
Σ
−1
k
C
k
+ βI)
−1
C
′
k
Σ
−1
k
, (6.60)
where Σ
k
is some positive deﬁnite symmetric matrix, and β is a positive
scalar. Then the iteration (6.56) takes the form
r
k+1
= r
k
−γ(C
′
k
Σ
−1
k
C
k
+ βI)
−1
C
′
k
Σ
−1
k
(C
k
r
k
−d
k
),
and for γ = 1, it can be written as
r
k+1
= (C
′
k
Σ
−1
k
C
k
+ βI)
−1
(C
′
k
Σ
−1
k
d
k
+ βr
k
). (6.61)
We recognize this as an iterative version of the regressionbased LSTD
method (6.51), where the prior guess ¯ r is replaced by the previous iterate
r
k
. This iteration is convergent to r
∗
provided that ¦Σ
−1
k
¦ is bounded
[γ = 1 is within the range of stepsizes for which I −γGC is a contraction;
see the discussion following Eq. (6.40)].
A simple possibility is to use a diagonal matrix G
k
, thereby simpli
fying the matrix inversion in the iteration (6.56). One possible choice is a
diagonal approximation to Φ
′
ΞΦ, obtained by discarding the oﬀdiagonal
terms of the matrix (6.58) or (6.59). Then it is reasonable to expect that a
stepsize γ close to 1 will often lead to I −γGC being a contraction, thereby
facilitating the choice of γ. The simplest possibility is to just choose G
k
to
be the identity, although in this case, some experimentation is needed to
ﬁnd a proper value of γ such that I −γC is a contraction.
Convergence Rate of LSPE – Comparison with LSTD
Let us now discuss the choice of γ and G from the convergence rate point of
view. It can be easily veriﬁed with simple examples that the values of γ and
G aﬀect signiﬁcantly the convergence rate of the deterministic scaled PVI
iteration (6.54). Surprisingly, however, the asymptotic convergence rate of
the simulationbased iteration (6.56) does not depend on the choices of γ
and G. Indeed it can be proved that the iteration (6.56) converges at the
same rate asymptotically, regardless of the choices of γ and G, as long as
I − γGC is a contraction (although the shortterm convergence rate may
be signiﬁcantly aﬀected by the choices of γ and G).
The reason is that the scaled PVI iteration (6.54) has a linear con
vergence rate (since it involves a contraction), which is fast relative to the
Sec. 6.3 Projected Equation Methods 373
slow convergence rate of the simulationgenerated G
k
, C
k
, and d
k
. Thus
the simulationbased iteration (6.56) operates on two time scales (see, e.g.,
Borkar [Bor08], Ch. 6): the slow time scale at which G
k
, C
k
, and d
k
change,
and the fast time scale at which r
k
adapts to changes in G
k
, C
k
, and d
k
. As
a result, essentially, there is convergence in the fast time scale before there
is appreciable change in the slow time scale. Roughly speaking, r
k
“sees
G
k
, C
k
, and d
k
as eﬀectively constant,” so that for large k, r
k
is essentially
equal to the corresponding limit of iteration (6.56) with G
k
, C
k
, and d
k
held ﬁxed. This limit is C
−1
k
d
k
. It follows that the sequence r
k
generated
by the scaled LSPE iteration (6.56) “tracks” the sequence C
−1
k
d
k
generated
by the LSTD iteration in the sense that
r
k
−C
−1
k
d
k
 << r
k
−r
∗
, for large k,
independent of the choice of γ and the scaling matrix G that is approxi
mated by G
k
(see also [Ber09] for further discussion).
TD(0) Method
This is an iterative method for solving the projected equation Cr = d. Like
LSTD and LSPE, it generates an inﬁnitely long trajectory ¦i
0
, i
1
, . . .¦ of
the Markov chain, but at each iteration, it uses only one sample, the last
one. It has the form
r
k+1
= r
k
−γ
k
φ(i
k
)q
k,k
, (6.62)
where γ
k
is a stepsize sequence that diminishes to 0. It may be viewed as
an instance of a classical stochastic approximation/RobbinsMonro scheme
for solving the projected equation Cr = d. This equation can be written as
Φ
′
Ξ(Φr −AΦr −b) = 0, and by using Eqs. (6.47) and (6.62), it can be seen
that the direction of change φ(i
k
)q
k,k
in TD(0) is a sample of the lefthand
side Φ
′
Ξ(Φr −AΦr −b) of the equation.
Let us note a similarity between TD(0) and the scaled LSPE method
(6.57) with G
k
= I, given by:
r
k+1
= r
k
−γ(C
k
r
k
−d
k
) = r
k
−
γ
k + 1
k
t=0
φ(i
t
)q
k,t
. (6.63)
While LSPE uses as direction of change a timeaverage approximation of
Cr
k
− d based on all the available samples, TD(0) uses a single sample
approximation. It is thus not surprising that TD(0) is a much slower al
gorithm than LSPE, and moreover requires that the stepsize γ
k
diminishes
to 0 in order to deal with the nondiminishing noise that is inherent in the
term φ(i
k
)q
k,k
of Eq. (6.62). On the other hand, TD(0) requires much less
overhead per iteration: calculating the single temporal diﬀerence q
k,k
and
multiplying it with φ(i
k
), rather than updating the s s matrix C
k
and
374 Approximate Dynamic Programming Chap. 6
multiplying it with r
k
. Thus when s, the number of features, is very large,
TD(0) may oﬀer a signiﬁcant overhead advantage over LSTD and LSPE.
We ﬁnally note a scaled version of TD(0) given by
r
k+1
= r
k
−γ
k
G
k
φ(i
k
)q
k,k
, (6.64)
where G
k
is a positive deﬁnite symmetric scaling matrix, selected to speed
up convergence. It is a scaled (by the matrix G
k
) version of TD(0), so it
may be viewed as a type of scaled stochastic approximation method.
6.3.5 Optimistic Versions
In the LSTD and LSPE methods discussed so far, the underlying assump
tion is that each policy is evaluated with a very large number of samples,
so that an accurate approximation of C and d are obtained. There are also
optimistic versions (cf. Section 6.1.2), where the policy µ is replaced by
an “improved” policy µ after only a certain number of simulation samples
have been processed.
A natural form of optimistic LSTD is ˆ r
k+1
= C
−1
k
d
k
, where C
k
and
d
k
are obtained by averaging samples collected using the controls corre
sponding to the (approximately) improved policy; this is the policy µ
k+1
whose controls are generated by
µ
k+1
(i) = arg min
u∈U(i)
n
j=1
p
ij
(u)
_
g(i, u, j) + αφ(j)
′
ˆ r
k
_
[cf. Eq. (6.45)]. By this we mean that C
k
and d
k
are time averages of the
matrices and vectors
φ(i
t
)
_
φ(i
t
) − αφ(i
t+1
)
_
′
, φ(i
t
)g(i
t
, i
t+1
),
corresponding to simulated transitions (i
t
, i
t+1
) that are generated using
µ
k+1
[cf. Eqs. (6.41), (6.42)]. Unfortunately, this method requires the col
lection of many samples between policy updates, as it is susceptible to
simulation noise in C
k
and d
k
, particularly when C
k
is nearly singular.
The optimistic version of (scaled) LSPE is based on similar ideas.
Following the state transition (i
k
, i
k+1
), we update r
k
using the iteration
r
k+1
= r
k
−γG
k
(C
k
r
k
−d
k
), (6.65)
where C
k
and d
k
are given by Eqs. (6.41), (6.42) [cf. Eq. (6.56)], and G
k
is a
scaling matrix that converges to some G for which I −γGC is a contraction.
For example G
k
could be a positive deﬁnite symmetric matrix [such as for
example the one given by Eq. (6.58)] or the matrix
G
k
= (C
′
k
Σ
−1
k
C
k
+ βI)
−1
C
′
k
Σ
−1
k
(6.66)
Sec. 6.3 Projected Equation Methods 375
[cf. Eq. (6.60)]. In the latter case, for γ = 1 the method takes the form
ˆ r
k+1
= (C
′
k
Σ
−1
k
C
k
+ βI)
−1
(C
′
k
Σ
−1
k
d
k
+ βˆ r
k
), (6.67)
[cf. Eq. (6.61)]. The simulated transitions are generated using a policy
that is updated every few samples. In the extreme case of a single sample
between policies, we generate the next transition (i
k+1
, i
k+2
) using the
control
u
k+1
= arg min
u∈U(i
k+1
)
n
j=1
p
i
k+1
j
(u)
_
g(i
k+1
, u, j) + αφ(j)
′
r
k+1
_
.
Because the theoretical convergence guarantees of LSPE apply only to the
nonoptimistic version, it may be essential to experiment with various values
of the stepsize γ [this is true even if G
k
is chosen according to Eq. (6.58), for
which γ = 1 guarantees convergence in the nonoptimistic version]. There
is also a similar optimistic version of TD(0).
To improve the reliability of the optimistic LSTD method it seems
necessary to turn it into an iterative method, which then brings it very
close to LSPE. In particular, an iterative version of the regressionbased
LSTD method (6.51) is given by Eq. (6.67), and is the special case of LSPE,
corresponding to the special choice of the scaling matrix G
k
of Eq. (6.66).
As in Section 6.3.4, the matrix Σ
k
should be comparable to the covariance
matrix of the vector Cr
∗
−d and should be computed using some heuristic
scheme.
Generally, in optimistic LSTD and LSPE, a substantial number of
samples may need to be collected with the same policy before switching
policies, in order to reduce the variance of C
k
and d
k
. As an alternative,
one may consider building up C
k
and d
k
as weighted averages, using sam
ples from several past policies, while giving larger weight to the samples of
the current policy. One may argue that mixing samples from several past
policies may have a beneﬁcial exploration eﬀect. Still, however, similar to
other versions of policy iteration, to enhance exploration, one may occa
sionally replace the control u
k+1
by a control selected at random from the
constraint set U(i
k+1
). The complexities introduced by these variations
are not fully understood at present. For experimental investigations of op
timistic policy iteration, see Bertsekas and Ioﬀe [BeI96], Jung and Polani
[JuP07], and Busoniu et al. [BED09].
6.3.6 Policy Oscillations – Chattering
As noted earlier, optimistic variants of policy evaluation methods with
function approximation are popular in practice, but the associated conver
gence behavior is complex and not well understood at present. Section 5.4
of Bertsekas and Tsitsiklis [BeT96] provides an analysis of optimistic policy
376 Approximate Dynamic Programming Chap. 6
iteration for the case of a lookup table representation, where Φ = I, while
Section 6.4 of the same reference considers general approximation architec
tures. This analysis is based on the use of the so called greedy partition. For
a given approximation architecture
˜
J(, r), this is a partition of the space
ℜ
s
of parameter vectors r into subsets R
µ
, each subset corresponding to a
stationary policy µ, and deﬁned by
R
µ
=
_
_
_
r
¸
¸
¸
¸
¸
µ(i) = arg min
u∈U(i)
n
j=1
p
ij
(u)
_
g(i, u, j) + α
˜
J(j, r)
_
, i = 1, . . . , n
_
_
_
.
Thus, R
µ
is the set of parameter vectors r for which µ is greedy with respect
to
˜
J(, r).
For simplicity, let us assume that we use a policy evaluation method
that for each given µ produces a unique parameter vector r
µ
. Nonoptimistic
policy iteration starts with a parameter vector r
0
, which speciﬁes µ
0
as a
greedy policy with respect to
˜
J(, r
0
), and generates r
µ
0 by using the given
policy evaluation method. It then ﬁnds a policy µ
1
that is greedy with
respect to
˜
J(, r
µ
0 ), i.e., a µ
1
such that
r
µ
0 ∈ R
µ
1 .
It then repeats the process with µ
1
replacing µ
0
. If some policy µ
k
satisfying
r
µ
k ∈ R
µ
k (6.68)
is encountered, the method keeps generating that policy. This is the nec
essary and suﬃcient condition for policy convergence in the nonoptimistic
policy iteration method.
r
µ
k
k r
µ
k+1
+1 r
µ
k+2
+2 r
µ
k+3
R
µ
k
R
µ
k+1
R
µ
k+2
+2 R
µ
k+3
Figure 6.3.4: Greedy partition and cycle
of policies generated by nonoptimistic pol
icy iteration. In this ﬁgure, the method
cycles between four policies and the corre
sponding four parameters r
µ
k r
µ
k+1 r
µ
k+2
r
µ
k+3.
In the case of a lookup table representation where the parameter
vectors r
µ
are equal to the costtogo vector J
µ
, the condition r
µ
k ∈ R
µ
k
Sec. 6.3 Projected Equation Methods 377
is equivalent to r
µ
k = Tr
µ
k , and is satisﬁed if and only if µ
k
is optimal.
When there is function approximation, however, this condition need not be
satisﬁed for any policy. Since there is a ﬁnite number of possible vectors
r
µ
, one generated from another in a deterministic way, the algorithm ends
up repeating some cycle of policies µ
k
, µ
k+1
, . . . , µ
k+m
with
r
µ
k ∈ R
µ
k+1, r
µ
k+1 ∈ R
µ
k+2, . . . , r
µ
k+m−1 ∈ R
µ
k+m, r
µ
k+m ∈ R
µ
k; (6.69)
(see Fig. 6.3.4). Furthermore, there may be several diﬀerent cycles, and
the method may end up converging to any one of them depending on the
starting policy µ
0
. The actual cycle obtained depends on the initial policy
µ
0
. This is similar to gradient methods applied to minimization of functions
with multiple local minima, where the limit of convergence depends on the
starting point.
In the case of optimistic policy iteration, the trajectory of the method
is less predictable and depends on the ﬁne details of the iterative policy
evaluation method, such as the frequency of the policy updates and the
stepsize used. Generally, given the current policy µ, optimistic policy it
eration will move towards the corresponding “target” parameter r
µ
, for as
long as µ continues to be greedy with respect to the current costtogo ap
proximation
˜
J(, r), that is, for as long as the current parameter vector r
belongs to the set R
µ
. Once, however, the parameter r crosses into another
set, say R
µ
, the policy µ becomes greedy, and r changes course and starts
moving towards the new “target” r
µ
. Thus, the “targets” r
µ
of the method,
and the corresponding policies µ and sets R
µ
may keep changing, similar to
nonoptimistic policy iteration. Simultaneously, the parameter vector r will
move near the boundaries that separate the regions R
µ
that the method
visits, following reduced versions of the cycles that nonoptimistic policy
iteration may follow (see Fig. 6.3.5). Furthermore, as Fig. 6.3.5 shows, if
diminishing parameter changes are made between policy updates (such as
for example when a diminishing stepsize is used by the policy evaluation
method) and the method eventually cycles between several policies, the
parameter vectors will tend to converge to the common boundary of the
regions R
µ
corresponding to these policies. This is the socalled chattering
phenomenon for optimistic policy iteration, whereby there is simultane
ously oscillation in policy space and convergence in parameter space. The
following is an example of chattering. Other examples are given in Section
6.4.2 of [BeT96] (Examples 6.9 and 6.10).
Example 6.3.2 (Chattering)
Consider a discounted problem with two states, 1 and 2, illustrated in Fig.
6.3.6(a). There is a choice of control only at state 1, and there are two policies,
denoted µ
∗
and µ. The optimal policy µ
∗
, when at state 1, stays at 1 with
probability p > 0 and incurs a negative cost c. The other policy is µ and
378 Approximate Dynamic Programming Chap. 6
r
µ
1
1 r
µ
2
2 r
µ
3 R
µ
1
R
µ
2
2 R
µ
3
Figure 6.3.5: Illustration of a trajectory of
optimistic policy iteration. The algorithm set
tles into an oscillation between policies µ
1
, µ
2
,
µ
3
with r
µ
1 ∈ R
µ
2 , r
µ
2 ∈ R
µ
3 , r
µ
3 ∈ R
µ
1 .
The parameter vectors converge to the com
mon boundary of these policies.
cycles between the two states with 0 cost. We consider linear approximation
with a single feature φ(i)
′
= i for each of the states i = 1, 2, i.e.,
Φ =
_
1
2
_
,
˜
J = Φr =
_
r
2r
_
.
Cost =0 Cost =
Stages
Cost =0 Cost =
Stages
Cost =0 Cost =
Stages
Cost =0 Cost =
Stages
0 Prob. = 1 −p Prob. = 1 Prob. =
p Prob. = 1 Prob. =
p Prob. = 1 Prob. =
p Prob. = 1 Prob. =
Cost =0 Cost = c < 0 Prob. = 1
Stages
Prob. = 1 Prob. = p
Policy µ Policy µ Policy µ
∗
∗
(a) (b)
(a) (b)
c
r
µ
= 0
c
α
r
µ
∗ ≈
c
1 −α
,
= 0 R
µ µ
R
µ
∗
1 2 1 2 1 2 1 2
Figure 6.3.6. The problem of Example 6.3.2. (a) Costs and transition prob
abilities for the policies µ and µ
∗
. (b) The greedy partition and the solutions
of the projected equations corresponding to µ and µ
∗
. Nonoptimistic policy
iteration oscillates between rµ and r
µ
∗.
Let us construct the greedy partition. We have
Φ =
_
1
2
_
,
˜
J = Φr =
_
r
2r
_
.
Sec. 6.3 Projected Equation Methods 379
We next calculate the points rµ and r
µ
∗ that solve the projected equations
Cµrµ = dµ, C
µ
∗r
µ
∗ = d
µ
∗,
which correspond to µ and µ
∗
, respectively [cf. Eqs. (6.33), (6.34)]. We have
Cµ = Φ
′
Ξµ(1 −αPµ)Φ = ( 1 2 )
_
1 0
0 1
__
1 −α
−a 1
__
1
2
_
= 5 −9α,
dµ = Φ
′
Ξµgµ = ( 1 2 )
_
1 0
0 1
__
0
0
_
= 0,
so
rµ = 0.
Similarly, with some calculation,
C
µ
∗ = Φ
′
Ξ
µ
∗(1 −αP
µ
∗)Φ
= ( 1 2 )
_
1
2−p
0
0
1−p
2−p
__
1 −αp −α(1 −p)
−a 1
__
1
2
_
=
5 −4p −α(4 −3p)
2 −p
,
d
µ
∗ = Φ
′
Ξ
µ
∗g
µ
∗ = ( 1 2 )
_
1
2−p
0
0
1−p
2−p
__
c
0
_
=
c
2 −p
,
so
r
µ
∗ =
c
5 −4p −α(4 −3p)
.
We now note that since c < 0,
rµ = 0 ∈ R
µ
∗,
while for p ≈ 1 and α > 1 −α, we have
r
µ
∗ ≈
c
1 −α
∈ Rµ;
cf. Fig. 6.3.6(b). In this case, approximate policy iteration cycles between
µ and µ
∗
. Optimistic policy iteration uses some algorithm that moves the
current value r towards r
µ
∗ if r ∈ R
µ
∗, and towards rµ if r ∈ Rµ. Thus
optimistic policy iteration starting from a point in Rµ moves towards r
µ
∗
and once it crosses the boundary point c/α of the greedy partition, it reverses
course and moves towards rµ. If the method makes small incremental changes
in r before checking whether to change the current policy, it will incur a small
oscillation around c/α. If the incremental changes in r are diminishing, the
method will converge to c/α. Yet c/α does not correspond to any one of the
two policies and has no meaning as a desirable parameter value.
Notice that it is hard to predict when an oscillation will occur and what
kind of oscillation it will be. For example if c > 0, we have
rµ = 0 ∈ Rµ,
380 Approximate Dynamic Programming Chap. 6
while for p ≈ 1 and α > 1 −α, we have
r
µ
∗ ≈
c
1 −α
∈ R
µ
∗.
In this case approximate as well as optimistic policy iteration will converge
to µ (or µ
∗
) if started with r in Rµ (or R
µ
∗, respectively).
When chattering occurs, the limit of optimistic policy iteration tends
to be on a common boundary of several subsets of the greedy partition
and may not meaningfully represent a cost approximation of any of the
corresponding policies. Thus, the limit to which the method converges
cannot always be used to construct an approximation of the costtogo of
any policy or the optimal costtogo. As a result, at the end of optimistic
policy iteration and in contrast with the nonoptimistic version, one must
go back and perform a screening process; that is, evaluate by simulation
the many policies generated by the method starting from the initial condi
tions of interest and select the most promising one. This is a disadvantage
of optimistic policy iteration that may nullify whatever practical rate of
convergence advantages the method may have over its nonoptimistic coun
terpart.
An additional insight is that the choice of the iterative policy evalu
ation method (e.g., LSTD, LSPE, or TD for various values of λ) makes a
diﬀerence in rate of convergence, but does not seem crucial for the quality
of the ﬁnal policy obtained (as long as the methods converge). Using a
diﬀerent value of λ changes the targets r
µ
somewhat, but leaves the greedy
partition unchanged. As a result, diﬀerent methods “ﬁsh in the same wa
ters” and tend to yield similar ultimate cycles of policies.
Finally, let us note that while chattering is a typical phenomenon in
optimistic policy iteration with a linear parametric architecture, and should
be an important concern in its implementation, there are indications that
for many problems, its eﬀects may not be very serious. Indeed, suppose
that we have convergence to a parameter vector r and that there is a steady
state policy oscillation involving a collection of policies /. Then, all the
policies in / are greedy with respect to
˜
J(, r), which implies that there is
a subset of states i such that there are at least two diﬀerent controls µ
1
(i)
and µ
2
(i) satisfying
min
u∈U(i)
j
p
ij
(u)
_
g(i, u, j) + α
˜
J(j, r)
_
=
j
p
ij
_
µ
1
(i)
_
_
g
_
i, µ
1
(i), j
_
+ α
˜
J(j, r)
_
=
j
p
ij
_
µ
2
(i)
_
_
g
_
i, µ
2
(i), j
_
+ α
˜
J(j, r)
_
.
(6.70)
Each equation of this type can be viewed as a constraining relation on the
parameter vector r. Thus, excluding singular situations, there will be at
Sec. 6.3 Projected Equation Methods 381
most s relations of the form (6.70) holding, where s is the dimension of r.
This implies that there will be at most s “ambiguous” states where more
than one control is greedy with respect to
˜
J(, r) (in Example 6.3.6, state
1 is “ambiguous”).
Now assume that we have a problem where the total number of states
is much larger than s and, furthermore, there are no “critical” states; that
is, the cost consequences of changing a policy in just a small number of
states (say, of the order of s) is relatively small. It then follows that all
policies in the set / involved in chattering have roughly the same cost.
Furthermore, for the methods of this section, one may argue that the cost
approximation
˜
J(, r) is close to the cost approximation
˜
J(, r
µ
) that would
be generated for any of the policies µ ∈ /. Note, however, that the
assumption of “no critical states,” aside from the fact that it may not be
easily quantiﬁable, it will not be true for many problems.
We ﬁnally note that even if all policies involved in chattering have
roughly the same cost, it is still possible that none of them is particularly
good; the policy iteration process may be just cycling in a “bad” part of the
greedy partition. An interesting case in point is the game of tetris, which
has been used as a testbed for approximate DP methods [Van95], [TsV96],
[BeI96], [Kak02], [FaV06], [SzL06], [DFM09]. Using a set of 22 features and
approximate policy iteration with policy evaluation based on the projected
equation and the LSPE method [BeI96], an average score of a few thousands
was achieved. Using the same features and a random search method in the
space of weight vectors r, an average score of over 900,000 was achieved
[ThS09]. This suggests that in the tetris problem, policy iteration using
the projected equation is seriously hampered by oscillations/chattering be
tween relatively poor policies, roughly similar to the attraction of gradient
methods to poor local minima. The full ramiﬁcations of policy oscillation
in practice are not fully understood at present, but it is clear that they give
serious reason for concern. Moreover local minimatype phenomena may be
causing similar diﬃculties in other related approximate DP methodologies:
approximate policy iteration with the Bellman error method (see Section
6.8.4), policy gradient methods (see Section 6.9), and approximate linear
programming (the tetris problem, using the same 22 features, has been ad
dressed by approximate linear programming [FaV06], [DFM09], and with
a policy gradient method [Kak02], also with an achieved average score of a
few thousands).
6.3.7 Multistep SimulationBased Methods
A useful approach in approximate DP is to replace Bellman’s equation with
an equivalent equation that reﬂects control over multiple successive stages.
This amounts to replacing T with a multistep version that has the same
382 Approximate Dynamic Programming Chap. 6
ﬁxed points; for example, T
ℓ
with ℓ > 1, or T
(λ)
given by
T
(λ)
= (1 −λ)
∞
ℓ=0
λ
ℓ
T
ℓ+1
,
where λ ∈ (0, 1). We will focus on the λweighted multistep Bellman
equation
J = T
(λ)
J.
A straightforward calculation shows that it can be written as
T
(λ)
J = g
(λ)
+ αP
(λ)
J, (6.71)
with
P
(λ)
= (1 −λ)
∞
ℓ=0
α
ℓ
λ
ℓ
P
ℓ+1
, g
(λ)
=
∞
ℓ=0
α
ℓ
λ
ℓ
P
ℓ
g = (I −αλP)
−1
g.
(6.72)
We may then apply variants of the preceding simulation algorithms
to ﬁnd a ﬁxed point of T
(λ)
in place of T. The corresponding projected
equation takes the form
C
(λ)
r = d
(λ)
,
where
C
(λ)
= Φ
′
Ξ
_
I − αP
(λ)
_
Φ, d
(λ)
= Φ
′
Ξg
(λ)
, (6.73)
[cf. Eq. (6.34)]. The motivation for replacing T with T
(λ)
is that the mod
ulus of contraction of T
(λ)
is smaller, resulting in a tighter error bound.
This is shown in the following proposition.
Proposition 6.3.5: The mappings T
(λ)
and ΠT
(λ)
are contractions
of modulus
α
λ
=
α(1 −λ)
1 −αλ
with respect to the weighted Euclidean norm  
ξ
, where ξ is the
steadystate probability vector of the Markov chain. Furthermore
J
µ
−Φr
∗
λ

ξ
≤
1
_
1 −α
2
λ
J
µ
−ΠJ
µ

ξ
, (6.74)
where Φr
∗
λ
is the ﬁxed point of ΠT
(λ)
.
Sec. 6.3 Projected Equation Methods 383
Proof: Using Lemma 6.3.1, we have
P
(λ)
z
ξ
≤ (1 −λ)
∞
ℓ=0
α
ℓ
λ
ℓ
P
ℓ+1
z
ξ
≤ (1 −λ)
∞
ℓ=0
α
ℓ
λ
ℓ
z
ξ
=
(1 −λ)
1 −αλ
z
ξ
.
Since T
(λ)
is linear with associated matrix αP
(λ)
[cf. Eq. (6.71)], it follows
that T
(λ)
is a contraction with modulus α(1 − λ)/(1 − αλ). The estimate
(6.74) follows similar to the proof of Prop. 6.3.2. Q.E.D.
Note that α
λ
decreases as λ increases, and α
λ
→ 0 as λ → 1. Further
more, the error bound (6.74) also becomes better as λ increases. Indeed
from Eq. (6.74), it follows that as λ → 1, the projected equation solution
Φr
∗
λ
converges to the “best” approximation ΠJ
µ
of J
µ
on S. This suggests
that large values of λ should be used. On the other hand, we will later
argue that when simulationbased approximations are used, the eﬀects of
simulation noise become more pronounced as λ increases. Furthermore, we
should note that in the context of approximate policy iteration, the objec
tive is not just to approximate well the cost of the current policy, but rather
to use the approximate cost to obtain the next “improved” policy. We are
ultimately interested in a “good” next policy, and there is no consistent
experimental or theoretical evidence that this is achieved solely by good
cost approximation of the current policy. Thus, in practice, some trial and
error with the value of λ may be useful.
Another interesting fact, which follows from the property α
λ
→ 0 as
λ → 1, is that given any norm, the mapping T
(λ)
is a contraction (with
arbitrarily small modulus) with respect to that norm for λ suﬃciently close
to 1. This is a consequence of the norm equivalence property in ℜ
n
(any
norm is bounded by a constant multiple of any other norm). As a result, for
any weighted Euclidean norm of projection, ΠT
(λ)
is a contraction provided
λ is suﬃciently close to 1.
LSTD(λ), LSPE(λ), and TD(λ)
The simulationbased methods of the preceding subsections correspond to
λ = 0, but can be extended to λ > 0. In particular, in a matrix inversion
approach, the unique solution of the projected equation may be approxi
mated by
_
C
(λ)
k
_
−1
d
(λ)
k
,
where C
(λ)
k
and d
(λ)
k
are simulationbased approximations of C
(λ)
and d
(λ)
.
This is the LSTD(λ) method. There is also a regression/regularization
variant of this method along the lines described earlier [cf. Eq. (6.51)].
384 Approximate Dynamic Programming Chap. 6
Similarly, we may consider the (scaled) LSPE(λ) iteration
r
k+1
= r
k
−γG
k
_
C
(λ)
k
r
k
−d
(λ)
k
_
, (6.75)
where γ is a stepsize and G
k
is a scaling matrix that converges to some G
such that I − γGC
(λ)
is a contraction. One possibility is to choose γ = 1
and
G
k
=
1
k + 1
k
t=0
φ(i
t
)φ(i
t
)
′
.
Diagonal approximations to this matrix may also be used to avoid the
computational overhead of matrix inversion. Another possibility is
G
k
=
_
C
(λ)
′
k
Σ
−1
k
C
(λ)
k
+ βI
_
−1
C
(λ)
′
k
Σ
−1
k
, (6.76)
where Σ
k
is some positive deﬁnite symmetric matrix, and β is a positive
scalar [cf. Eq. (6.60)]. For γ = 1, we obtain the iteration
r
k+1
=
_
C
(λ)
′
k
Σ
−1
k
C
(λ)
k
+ βI
_
−1
_
C
(λ)
′
k
Σ
−1
k
d
(λ)
k
+ βr
k
_
. (6.77)
This as an iterative version of the regressionbased LSTD method [cf. Eq.
(6.61)], for which convergence is assured provided C
(λ)
k
→ C
(λ)
, d
(λ)
k
→
d
(λ)
, and ¦Σ
−1
k
¦ is bounded.
Regarding the calculation of appropriate simulationbased approxi
mations C
(λ)
k
and d
(λ)
k
, one possibility is [cf. Eqs. (6.41)(6.42)]
C
(λ)
k
=
1
k + 1
k
t=0
φ(i
t
)
k
m=t
α
m−t
λ
m−t
_
φ(i
m
) −αφ(i
m+1
)
_
′
, (6.78)
d
(λ)
k
=
1
k + 1
k
t=0
φ(i
t
)
k
m=t
α
m−t
λ
m−t
g
im
. (6.79)
It can be shown that indeed these are correct simulationbased approxima
tions to C
(λ)
and d
(λ)
of Eq. (6.73). The veriﬁcation is similar to the case
λ = 0, by considering the approximation of the steadystate probabilities
ξ
i
and transition probabilities p
ij
with the empirical frequencies
ˆ
ξ
i,k
and
ˆ p
ij,k
deﬁned by Eq. (6.44).
For a sketch of the argument, we ﬁrst verify that the rightmost ex
pression in the deﬁnition (6.78) of C
(λ)
k
can be written as
k
m=t
α
m−t
λ
m−t
_
φ(i
m
) −αφ(i
m+1
)
_
′
= φ(i
t
) −α(1 −λ)
k−1
m=t
α
m−t
λ
m−t
φ(i
m+1
) −α
k−t+1
λ
k−t
φ(i
k+1
),
Sec. 6.3 Projected Equation Methods 385
which by discarding the last term (it is negligible for k >> t), yields
k
m=t
α
m−t
λ
m−t
_
φ(i
m
) −αφ(i
m+1
)
_
′
= φ(i
t
) −α(1 −λ)
k−1
m=t
α
m−t
λ
m−t
φ(i
m+1
).
Using this relation in the expression (6.78) for C
(λ)
k
, we obtain
C
(λ)
k
=
1
k + 1
k
t=0
φ(i
t
)
_
φ(i
t
) −α(1 −λ)
k−1
m=t
α
m−t
λ
m−t
φ(i
m+1
)
_
′
.
We now compare this expression with C
(λ)
, which similar to Eq. (6.43),
can be written as
C
(λ)
= Φ
′
Ξ(I −αP
(λ)
)Φ =
n
i=1
ξ
i
φ(i)
_
_
φ(i) −α
n
j=1
p
(λ)
ij
φ(j)
_
_
′
,
where p
(λ)
ij
are the components of the matrix P
(λ)
. It can be seen (cf. the
derivations of Section 6.3.3) that
1
k + 1
k
t=0
φ(i
t
)φ(i
t
)
′
→
n
i=1
ξ
i
φ(i)φ(i)
′
,
while by using the formula
p
(λ)
ij
= (1 −λ)
∞
ℓ=0
α
ℓ
λ
ℓ
p
(ℓ+1)
ij
with p
(ℓ+1)
ij
being the (i, j)th component of P
(ℓ+1)
[cf. Eq. (6.72)], it can
be veriﬁed that
1
k + 1
k
t=0
φ(i
t
)
_
(1 −λ)
k−1
m=t
α
m−t
λ
m−t
φ(i
m+1
)
′
_
→
n
i=1
ξ
i
φ(i)
n
j=1
p
(λ)
ij
φ(j)
′
.
Thus, by comparing the preceding expressions, we see that C
(λ)
k
→ C
(λ)
with probability 1. A full convergence analysis can be found in [NeB03]
and also in [BeY09], in a more general context.
We may also streamline the calculation of C
(λ)
k
and d
(λ)
k
by introduc
ing the vector
z
t
=
t
m=0
(αλ)
t−m
φ(i
m
). (6.80)
386 Approximate Dynamic Programming Chap. 6
Then, by straightforward calculation, we may verify that
C
(λ)
k
=
1
k + 1
k
t=0
z
t
_
φ(i
t
) −αφ(i
t+1
)
_
′
, (6.81)
d
(λ)
k
=
1
k + 1
k
t=0
z
t
g(i
t
, i
t+1
). (6.82)
Note that z
k
, C
(λ)
k
, d
(λ)
k
, can be conveniently updated by means of recursive
formulas. We have
z
k
= αλz
k−1
+ φ(i
k
),
C
(λ)
k
=
1
k + 1
¯
C
(λ)
k
, d
k
=
1
k + 1
¯
d
(λ)
k
,
where
¯
C
(λ)
k
and
¯
d
(λ)
k
are updated by
¯
C
(λ)
k
=
¯
C
(λ)
k−1
+ z
k
_
φ(i
k
) −αφ(i
k+1
)
_
′
,
¯
d
(λ)
k
=
¯
d
(λ)
k−1
+ z
k
g(i
k
, i
k+1
).
Let us ﬁnally note that by using the above formulas for C
(λ)
k
and d
(λ)
k
,
the iteration (6.75) can also be written as
r
k+1
= r
k
−
γ
k + 1
G
k
k
t=0
z
t
q
k,t
, (6.83)
where q
k,t
is the temporal diﬀerence
q
k,t
= φ(i
t
)
′
r
k
−αφ(i
t+1
)
′
r
k
−g(i
t
, i
t+1
) (6.84)
[cf. Eqs. (6.47) and (6.56)].
The TD(λ) algorithm is essentially TD(0) applied to the multistep
projected equation C
(λ)
r = d
(λ)
. It takes the form
r
k+1
= r
k
−γ
k
z
k
q
k,k
, (6.85)
where γ
k
is a stepsize parameter. When compared to the scaled LSPE
method (6.83), we see that TD(λ) uses G
k
= I and only the latest temporal
diﬀerence q
k,k
. This amounts to approximating C
(λ)
and d
(λ)
by a single
sample, instead of k + 1 samples. Note that as λ → 1, z
k
approaches
k
t=0
α
k−t
φ(i
t
) [cf. Eq. (6.80)], and TD(λ) approaches the TD(1) method
given earlier in Section 6.2 [cf. Eq. (6.25)].
Sec. 6.3 Projected Equation Methods 387
6.3.8 TD Methods with Exploration
We will now address some of the diﬃculties associated with TD methods,
particularly in the context of policy iteration. The ﬁrst diﬃculty has to
do with the issue of exploration: to maintain the contraction property of
ΠT it is important to generate trajectories according to the steadystate
distribution ξ associated with the given policy µ. On the other hand, this
biases the simulation by underrepresenting states that are unlikely to occur
under µ, causing potentially serious errors in the calculation of a new policy
via policy improvement. A second diﬃculty is that Assumption 6.3.1 (P
is irreducible) may be hard or impossible to guarantee, in which case the
methods break down, either because of the presence of transient states (in
which case the components of ξ corresponding to transient states are 0), or
because of multiple recurrent classes (in which case some states will never
be generated during the simulation).
As mentioned earlier, a common approach to address the exploration
diﬃculty is to modify the transition probability matrix P of the given
policy µ by occasionally generating transitions that use a randomly selected
control or state rather than the one dictated by µ. If the modiﬁed transition
probability matrix is irreducible, we simultaneously address the second
diﬃculty as well. Mathematically, in such a scheme we generate successive
states according to an irreducible transition probability matrix
P = (I −B)P + BQ, (6.86)
where B is a diagonal matrix with diagonal components β
i
∈ [0, 1] and Q
is another transition probability matrix. Thus, at state i, the next state is
generated with probability 1 −β
i
according to transition probabilities p
ij
,
and with probability β
i
according to transition probabilities q
ij
.† We refer
to β
i
as the exploration probability at state i.
Unfortunately, using P in place of P for simulation, with no other
modiﬁcation in the TD algorithms, creates a bias that tends to degrade
the quality of policy evaluation, because it directs the algorithm towards
approximating the ﬁxed point of the mapping T
(λ)
, given by
T
(λ)
(J) = g
(λ)
+ αP
(λ)
J,
where
T
(λ)
(J) = (1 −λ)
∞
t=0
λ
t
T
t+1
(J),
† In the literature, e.g., [SuB98], the policy being evaluated is sometimes
called the target policy to distinguish from the policy used for exploration, which
is called behavior policy. Also, methods that use a behavior policy are called
oﬀpolicy methods, while methods that do not are called onpolicy methods.
388 Approximate Dynamic Programming Chap. 6
with
T(J) = g + αPJ
[cf. Eq. (6.72)]. This is the cost of a diﬀerent policy, an exploration
enhanced policy that has a cost vector g with components
g
i
=
n
j=1
p
ij
g(i, j),
and a transition probability matrix P in place of P. In particular, when the
simulated trajectory is generated according to P, the LSTD(λ), LSPE(λ),
and TD(λ) algorithms yield the unique solution r
λ
of the equation
Φr = ΠT
(λ)
(Φr), (6.87)
where Π denotes projection on the approximation subspace with respect to
 
ξ
, where ξ is the invariant distribution corresponding to P.
We will now discuss some schemes that allow the approximation of
the solution of the projected equation
Φr = ΠT
(λ)
(Φr), (6.88)
where Π is projection with respect to the norm  
ξ
, corresponding to
the steadystate distribution ξ of P. Note the diﬀerence between equa
tions (6.87) and (6.88): the ﬁrst involves T but the second involves T, so
it aims to approximate the desired ﬁxed point of T, rather than a ﬁxed
point of T. Thus, the following schemes allow exploration, but without the
degradation of approximation quality resulting from the use of T in place
of T. The corresponding LSTD(λ)type methods, including its regression
based version [cf. Eqs. (6.51)], do not require that ΠT
(λ)
be a contraction;
they only require that the projected equation (6.88) has a unique solution.
However, the corresponding LSPE(λ) and TD(λ)type methods are valid
only if ΠT
(λ)
is a contraction [except for the scaled version of LSPE(λ)
that has the form (6.61), where ΠT
(λ)
need not be a contraction]. In this
connection, we note that generally, ΠT
(λ)
may not be a contraction. We
will now show, however, that for any choice of P, the mapping ΠT
(λ)
is a
contraction for λ suﬃciently close to 1.
The key diﬃculty for asserting contraction properties of the composi
tion ΠT
(λ)
is a potential norm mismatch: even if T
(λ)
is a contraction with
respect to some norm, Π may not be nonexpansive with respect to the same
norm. The solid convergence properties of iterative TD methods are due to
the fact that T
(λ)
is a contraction with respect to the norm  
ξ
and Π is
nonexpansive with respect to the same norm. This makes the composition
ΠT
(λ)
a contraction with respect to  
ξ
. Our diﬃculty here is that T
(λ)
Sec. 6.3 Projected Equation Methods 389
need not in general be a contraction with respect to modiﬁed/exploration
reweighted norm  ¯
ξ
.
The following proposition quantiﬁes the restrictions on the size of
the exploration probabilities in order to avoid the diﬃculty just described.
Since Π is nonexpansive with respect to  
ξ
, the proof is based on ﬁnding
values of β
i
for which T
(λ)
is a contraction with respect to  
ξ
. This is
equivalent to showing that the corresponding induced norm of the matrix
A
(λ)
given by
A
(λ)
= (1 −λ)
∞
t=0
λ
t
(αP)
t+1
, (6.89)
is less than 1.
Proposition 6.3.6: Assume that P is irreducible and ξ is its invariant
distribution. Then T
(λ)
and ΠT
(λ)
are contractions with respect to
 
ξ
for all λ ∈ [0, 1) provided α < 1, where
α =
α
_
1 −max
i=1,...,n
β
i
.
The associated modulus of contraction is at most equal to
α(1 − λ)
1 −αλ
.
Proof: For all z ∈ ℜ
n
with z ,= 0, we have
αPz
2
ξ
=
n
i=1
ξ
i
_
_
n
j=1
αp
ij
z
j
_
_
2
= α
2
n
i=1
ξ
i
_
_
n
j=1
p
ij
z
j
_
_
2
≤ α
2
n
i=1
ξ
i
n
j=1
p
ij
z
2
j
≤ α
2
n
i=1
ξ
i
n
j=1
p
ij
1 −β
i
z
2
j
≤
α
2
1 −β
n
j=1
n
i=1
¯
ξ
i
p
ij
z
2
j
390 Approximate Dynamic Programming Chap. 6
= α
2
n
j=1
¯
ξ
j
z
2
j
= α
2
z
2
¯
ξ
,
where the ﬁrst inequality follows from the convexity of the quadratic func
tion, the second inequality follows from the fact (1 −β
i
)p
ij
≤ p
ij
, and the
next to last equality follows from the property
n
i=1
¯
ξ
i
p
ij
= ξ
j
of the invariant distribution. Thus, αP is a contraction with respect to
 
ξ
with modulus at most α.
Next we note that if α < 1, we have
A
(λ)
¯
ξ
≤ (1 −λ)
∞
t=0
λ
t
αP
t+1
¯
ξ
≤ (1 −λ)
∞
t=0
λ
t
α
t+1
=
α(1 −λ)
1 −αλ
< 1,
(6.90)
from which the result follows. Q.E.D.
The preceding proposition delineates a range of values for the explo
ration probabilities in order for ΠT
(λ)
to be a contraction: it is suﬃcient
that
β
i
< 1 −α
2
, i = 1, . . . , n,
independent of the value of λ. We next consider the eﬀect of λ on the
range of allowable exploration probabilities. While it seems diﬃcult to fully
quantify this eﬀect, it appears that values of λ close to 1 tend to enlarge
the range. In fact, T
(λ)
is a contraction with respect to any norm  
ξ
and
consequently for any value of the exploration probabilities β
i
, provided λ
is suﬃciently close to 1. This is shown in the following proposition.
Proposition 6.3.7: Given any exploration probabilities from the range
[0, 1] such that P is irreducible, there exists λ ∈ [0, 1) such that T
(λ)
and ΠT
(λ)
are contractions with respect to  
ξ
for all λ ∈ [λ, 1).
Proof: By Prop. 6.3.6, there exists ξ such that αP¯
ξ
< 1. From Eq.
(6.90) it follows that lim
λ→1
A
(λ)
¯
ξ
= 0 and hence lim
λ→1
A
(λ)
= 0. It
follows that given any norm  , A
(λ)
is a contraction with respect to
that norm for λ suﬃciently close to 1. In particular, this is true for any
norm  
ξ
, where ξ is the invariant distribution of an irreducible P that is
generated with any exploration probabilities from the range [0, 1]. Q.E.D.
Sec. 6.3 Projected Equation Methods 391
We ﬁnally note that assuming ΠT
(λ)
is a contraction with modulus
α
λ
=
α(1 −λ)
1 −αλ
,
as per Prop. 6.3.6, we have the error bound
J
µ
−Φr
λ

ξ
≤
1
_
1 −α
2
λ
J
µ
−ΠJ
µ

ξ
,
where Φr
λ
is the ﬁxed point of ΠT
(λ)
. The proof is nearly identical to the
one of Prop. 6.3.5.
We will now consider some speciﬁc schemes for exploration. Let us
ﬁrst consider the case where λ = 0. Then a vector r
∗
solves the exploration
enhanced projected equation Φr = ΠT(Φr) if and only if it solves the
problem
min
r∈ℜ
s
_
_
Φr −(g + αPΦr
∗
)
_
_
2
ξ
.
The optimality condition for this problem is
Φ
′
Ξ(Φr
∗
−αPΦr
∗
−g) = 0, (6.91)
where Ξ is the steadystate distribution of P, and can be written in matrix
form as [cf. Eq. (6.48)]
Cr
∗
= d,
where
C = Φ
′
Ξ(I −αP)Φ, d = Φ
′
Ξg. (6.92)
These equations should be compared with the equations for the case where
P = P: the only diﬀerence is that the distribution matrix Ξ is replaced by
the explorationenhanced distribution matrix Ξ.
The following two schemes use diﬀerent simulationbased approxima
tions of the matrix C and the vector d, which are similar to the ones given
earlier for the case P = P, but employ appropriately modiﬁed forms of
temporal diﬀerences.
Exploration Using Extra Transitions
The ﬁrst scheme applies only to the case where λ = 0. We generate a
state sequence
_
i
0
, i
1
, . . .
_
according to the explorationenhanced transi
tion matrix P (or in fact any steady state distribution ξ, such as the uni
form distribution). We also generate an additional sequence of independent
transitions
_
(i
0
, j
0
), (i
1
, j
1
), . . .
_
according to the original transition matrix
P.
392 Approximate Dynamic Programming Chap. 6
We approximate the matrix C and vector d of Eq. (6.92) using the
formulas
C
k
=
1
k + 1
k
t=0
φ(i
t
)
_
φ(i
t
) −αφ(j
t
)
_
′
,
and
d
k
=
1
k + 1
k
t=0
φ(i
t
)g(i
t
, j
t
),
in place of Eqs. (6.41) and (6.42). Similar to the earlier case in Section
6.3.3, where P = P, it can be shown using law of large numbers arguments
that C
k
→ C and d
k
→ d with probability 1.
The corresponding approximation C
k
r = d
k
to the projected equation
Φr = ΠT(Φr) can be written as
k
t=0
φ(i
t
)˜ q
k,t
= 0,
where
˜ q
k,t
= φ(i
t
)
′
r
k
−αφ(j
t
)
′
r
k
−g(i
t
, j
t
)
is a temporal diﬀerence associated with the transition (i
t
, j
t
) [cf. Eq. (6.47)].
The three terms in the deﬁnition of ˜ q
k,t
can be viewed as samples [asso
ciated with the transition (i
t
, j
t
)] of the corresponding three terms of the
expression Ξ(Φr
k
−αPΦr
k
−g) in Eq. (6.91).
In a modiﬁed form of LSTD(0), we approximate the solution C
−1
d
of the projected equation with C
−1
k
d
k
. In a modiﬁed form of (scaled)
LSPE(0), we approximate the term (Cr
k
− d) in PVI by (C
k
r
k
− d
k
),
leading to the iteration
r
k+1
= r
k
−
γ
k + 1
G
k
k
t=0
φ(i
t
)˜ q
k,t
,
where γ is small enough to guarantee convergence [cf. Eq. (6.57)]. Finally,
the modiﬁed form of TD(0) is
r
k+1
= r
k
−γ
k
φ(i
k
)˜ q
k,k
,
where γ
k
is a positive diminishing stepsize [cf. Eq. (6.62)]. Unfortunately,
versions of these schemes for λ > 0 are complicated because of the diﬃculty
of generating extra transitions in a multistep context.
Sec. 6.3 Projected Equation Methods 393
Exploration Using Modiﬁed Temporal Diﬀerences
We will now present an alternative exploration approach that works for all
λ ≥ 0. Like the preceding approach, it aims to solve the projected equation
Φr = ΠT
(λ)
(Φr) [cf. Eq. (6.88)], but it does not require extra transitions. It
does require, however, the explicit knowledge of the transition probabilities
p
ij
and p
ij
.
Here we generate a single state sequence
_
i
0
, i
1
, . . .
_
according to the
explorationenhanced transition matrix P.† The formulas of the various
TD algorithms are similar to the ones given earlier, but we use modiﬁed
versions of temporal diﬀerences, deﬁned by
˜ q
k,t
= φ(i
t
)
′
r
k
−
p
i
t
i
t+1
p
i
t
i
t+1
_
αφ(i
t+1
)
′
r
k
+ g(i
t
, i
t+1
)
_
, (6.93)
where p
ij
and p
ij
denote the ijth components of P and P, respectively.
Consider now the case where λ = 0 and the approximation of the
matrix C and vector d of Eq. (6.92) by simulation: we generate a state
sequence ¦i
0
, i
1
, . . .¦ using the explorationenhanced transition matrix P.
After collecting k + 1 samples (k = 0, 1, . . .), we form
C
k
=
1
k + 1
k
t=0
φ(i
t
)
_
φ(i
t
) −α
p
i
t
i
t+1
p
i
t
i
t+1
φ(i
t+1
)
_
′
,
and
d
k
=
1
k + 1
k
t=0
φ(i
t
)g(i
t
, i
t+1
).
Similar to the earlier case in Section 6.3.3, where P = P, it can be shown
using simple law of large numbers arguments that C
k
→ C and d
k
→ d with
probability 1 (see also Section 6.8.1, where this approximation approach is
discussed within a more general context). Note that the approximation
C
k
r = d
k
to the projected equation can also be written as
k
t=0
φ(i
t
)˜ q
k,t
= 0,
where ˜ q
k,t
is the modiﬁed temporal diﬀerence given by Eq. (6.93).
The explorationenhanced LSTD(0) method is simply ˆ r
k
= C
−1
k
d
k
,
and converges with probability 1 to the solution of the projected equation
† Note the diﬀerence in the sampling of transitions. Whereas in the preceding
scheme with extra transitions, (it, jt) was generated according to the original
transition matrix P, here (it, it+1) is generated according to the exploration
enhanced transition matrix P.
394 Approximate Dynamic Programming Chap. 6
Φr = ΠT(Φr). Explorationenhanced versions of LSPE(0) and TD(0) can
be similarly derived, but for convergence of these methods, the mapping
ΠT should be a contraction, which is guaranteed only if P diﬀers from P
by a small amount (cf. Prop. 6.3.6).
Let us now consider the case where λ > 0. We ﬁrst note that increas
ing values of λ tend to preserve the contraction of ΠT
(λ)
. In fact, given any
norm  
ξ
, T
(λ)
is a contraction with respect to that norm, provided λ is
suﬃciently close to 1 (cf. Prop. 6.3.7). This implies that given any explo
ration probabilities from the range [0, 1] such that P is irreducible, there
exists λ ∈ [0, 1) such that T
(λ)
and ΠT
(λ)
are contractions with respect to
 
ξ
for all λ ∈ [λ, 1).
The derivation of simulationbased approximations to the exploration
enhanced LSPE(λ), LSTD(λ), and TD(λ) methods is straightforward but
somewhat tedious. We will just provide the formulas, and refer to Bertsekas
and Yu [BeY07], [BeY09] for a detailed development. The exploration
enhanced LSPE(λ) iteration is given by
r
k+1
= r
k
−
γ
k + 1
G
k
k
t=0
φ(i
t
)
k
m=t
λ
m−t
w
t,m−t
˜ q
k,m
, (6.94)
where G
k
is a scaling matrix, γ is a positive stepsize, q
k,t
are the modiﬁed
temporal diﬀerences (6.93), and for all k and m,
w
k,m
=
_
α
m
p
i
k
i
k+1
p
i
k
i
k+1
p
i
k+1
i
k+2
p
i
k+1
i
k+2
p
i
k+m−1
i
k+m
p
i
k+m−1
i
k+m
if m ≥ 1,
1 if m = 0.
(6.95)
Iteration (6.94) can also be written in a compact form, as shown in
[BeY09], where a convergence result is also given. In particular, we have
r
k+1
= r
k
−γG
k
_
C
(λ)
k
r
k
−d
(λ)
k
_
, (6.96)
where similar to the earlier formulas with unmodiﬁed TD [cf. Eqs. (6.80)
(6.82)],
z
k
= αλ
p
i
k−1
i
k
p
i
k−1
i
k
z
k−1
+ φ(i
k
),
C
(λ)
k
=
1
k + 1
¯
C
(λ)
k
, d
k
=
1
k + 1
¯
d
(λ)
k
,
where
¯
C
(λ)
k
and
¯
d
(λ)
k
are updated by
¯
C
(λ)
k
=
¯
C
(λ)
k−1
+ z
k
_
φ(i
k
) −α
p
i
k
i
k+1
p
i
k
i
k+1
φ(i
k+1
)
_
,
Sec. 6.3 Projected Equation Methods 395
¯
d
(λ)
k
=
¯
d
(λ)
k−1
+ z
k
p
i
k−1
i
k
p
i
k−1
i
k
g(i
k−1
, i
k
).
It is possible to show the convergence of r
k
(with probability 1) to the
unique solution of the explorationenhanced projected equation
Φr = ΠT
(λ)
(Φr),
assuming that P is such that T
(λ)
is a contraction with respect to  
ξ
(e.g., for λ suﬃciently close to 1). Note that this is a substantial restriction,
since it may require a large (and unknown) value of λ.
A favorable exception is the iterativeregression method
r
k+1
=
_
C
(λ)
′
k
Σ
−1
k
C
(λ)
k
+ βI
_
−1
_
C
(λ)
′
k
Σ
−1
k
d
(λ)
k
+ βr
k
_
, (6.97)
which is the special case of the LSPE method (6.96) [cf. Eqs. (6.76) and
(6.77)]. This method converges for any λ as it does not require that T
(λ)
is a contraction with respect to  
ξ
.
The modiﬁed LSTD(λ) algorithm computes ˆ r
k
as the solution of the
equation C
(λ)
k
r = d
(λ)
k
. It is possible to show the convergence of Φˆ r
k
, the
solution of the explorationenhanced projected equation Φr = ΠT
(λ)
(Φr),
assuming only that this equation has a unique solution (a contraction prop
erty is not necessary since LSTD is not an iterative method, but rather
approximates the projected equation by simulation).
6.3.9 Summary and Examples
Several algorithms for policy cost evaluation have been given so far for
ﬁnitestate discounted problems, under a variety of assumptions, and we
will now summarize the analysis. We will also explain what can go wrong
when the assumptions of this analysis are violated.
The algorithms considered so far for approximate evaluation of the
cost vector J
µ
of a single stationary policy µ are of two types:
(1) Direct methods, such as the batch and incremental gradient methods
of Section 6.2, including TD(1). These methods allow for a nonlinear
approximation architecture, and for a lot of ﬂexibility in the collection
of the cost samples that are used in the least squares optimization.
The drawbacks of these methods are that they are not wellsuited
for problems with large variance of simulation “noise,” and also when
implemented using gradientlike methods, they can be very slow. The
former diﬃculty is in part due to the lack of the parameter λ, which is
used in other methods to reduce the variance in the parameter update
formulas.
(2) Indirect methods that are based on solution of a projected version
of Bellman’s equation. These are simulationbased methods that in
clude approximate matrix inversion methods such as LSTD(λ), and
396 Approximate Dynamic Programming Chap. 6
iterative methods such as LSPE(λ) and its scaled versions, and TD(λ)
(Sections 6.3.16.3.8).
The salient characteristics of indirect methods are the following:
(a) For a given choice of λ ∈ [0, 1), they all aim to compute r
∗
λ
, the
unique solution of the projected Bellman equation Φr = ΠT
(λ)
(Φr).
This equation is linear of the form C
(λ)
r = d
(λ)
, where C
(λ)
and d
(λ)
can be approximated by a matrix C
(λ)
k
and a vector d
(λ)
k
that can be
computed by simulation. The equation expresses the orthogonality
relation between Φr − T
(λ)
(Φr) and the approximation subspace S.
The LSTD(λ) method simply computes the solution
ˆ r
k
= (C
(λ)
k
)
−1
d
(λ)
k
of the simulationbased approximation C
(λ)
k
r = d
(λ)
k
of the projected
equation. The projected equation may also be solved iteratively using
LSPE(λ) and its scaled versions,
r
k+1
= r
k
−γG
k
_
C
(λ)
k
r
k
−d
(λ)
k
_
, (6.98)
and by TD(λ). A key fact for convergence of LSPE(λ) and TD(λ) to
r
∗
λ
is that T
(λ)
is a contraction with respect to the projection norm
 
ξ
, which implies that ΠT
(λ)
is also a contraction with respect to
the same norm.
(b) LSTD(λ) and LSPE(λ) are connected through the regularized regres
sionbased form (6.77), which aims to deal eﬀectively with cases where
C
(λ)
k
is nearly singular (see Section 6.3.4). This is the special case of
the LSPE(λ) class of methods, corresponding to the special choice
(6.76) of G
k
. LSTD(λ), and the entire class of LSPE(λ)type itera
tions (6.98) converge at the same asymptotic rate, in the sense that
ˆ r
k
−r
k
 << ˆ r
k
−r
∗
λ
,
for suﬃciently large k. However, depending on the choice of G
k
, the
shortterm behavior of the LSPEtype methods is more regular as it
involves implicit regularization. This regularized behavior is likely an
advantage for the policy iteration context where optimistic variants,
involving more noisy iterations, may be used.
(c) When the LSTD(λ) and LSPE(λ) methods are explorationenhanced
for the purpose of embedding within an approximate policy itera
tion framework, their convergence properties become more compli
cated: LSTD(λ) and the regularized regressionbased version (6.97) of
LSPE(λ) converge to the solution of the corresponding (exploration
enhanced) projected equation for an arbitrary amount of exploration,
Sec. 6.3 Projected Equation Methods 397
but TD(λ) and other special cases of LSPE(λ) do so only for λ suﬃ
ciently close to 1, as discussed in Section 6.3.8.
(d) The limit r
∗
λ
depends on λ. The estimate of Prop. 6.3.5 indicates
that the approximation error J
µ
− Φr
∗
λ

ξ
increases as the distance
J
µ
−ΠJ
µ

ξ
from the subspace S becomes larger, and also increases
as λ becomes smaller. Indeed, the error degradation may be very
signiﬁcant for small values of λ, as shown by an example in Bertsekas
[Ber95] (also reproduced in Exercise 6.9), where TD(0) produces a
very bad solution relative to ΠJ
µ
, which is the limit of the solution
Φr
∗
λ
produced by TD(λ) as λ → 1. (This example involves a stochastic
shortest path problem rather than a discounted problem, but can be
modiﬁed to illustrate the same conclusion for discounted problems.)
Note, however, that in the context of approximate policy iteration,
the correlation between approximation error in the cost of the current
policy and the performance of the next policy is somewhat unclear in
practice.
(e) While the quality of the approximation Φr
∗
λ
tends to improve, as
λ → 1 the methods become more vulnerable to simulation noise, and
hence require more sampling for good performance. Indeed, the noise
in a simulation sample of a tstages cost T
t
J tends to be larger as t
increases, and from the formula
T
(λ)
= (1 −λ)
∞
t=0
λ
t
T
t+1
[cf. Eq. (6.71)], it can be seen that the simulation samples of T
(λ)
(Φr
k
),
used by LSTD(λ) and LSPE(λ), tend to contain more noise as λ in
creases. This is consistent with practical experience, which indicates
that the algorithms tend to be faster and more reliable in practice
when λ takes smaller values (or at least when λ is not too close to
1). There is no rule of thumb for selecting λ, which is usually chosen
with some trial and error.
(f) TD(λ) is much slower than LSTD(λ) and LSPE(λ). Still TD(λ) is
simpler and embodies important ideas, which we did not cover suﬃ
ciently in our presentation. We refer to the convergence analysis by
Tsitsiklis and Van Roy [TsV97], and the subsequent papers [TsV99a]
and [TsV02], as well as the book [BeT96] for extensive discussions of
TD(λ).
For all the TD methods, LSTD(λ), LSPE(λ), and TD(λ), the as
sumptions under which convergence of the methods is usually shown in the
literature include:
(i) The existence of a steadystate distribution vector ξ, with positive
components.
398 Approximate Dynamic Programming Chap. 6
(ii) The use of a linear approximation architecture Φr, with Φ satisfying
the rank Assumption 6.3.2.
(iii) The use of data from a single inﬁnitely long simulated trajectory of the
associated Markov chain. This is needed for the implicit construction
of a projection with respect to the norm  
ξ
corresponding to the
steadystate distribution ξ of the chain (except for the exploration
enhanced versions of Section 6.3.8, where the norm  
ξ
is used,
corresponding to the steadystate distribution ξ of an exploration
enhanced chain).
(iv) The use of a diminishing stepsize for TD(λ); LSTD(λ) and LSPE(λ)
do not require a stepsize choice, and scaled versions of LSPE(λ) re
quire a constant stepsize.
(v) The use of a single policy, unchanged during the simulation; conver
gence does not extend to the case where T involves a minimization
over multiple policies, or optimistic variants, where the policy used
to generate the simulation data is changed after a few transitions.
Let us now discuss the above assumptions (i)(v). Regarding (i), if a
steadystate distribution exists but has some components that are 0, the
corresponding states are transient, so they will not appear in the simulation
after a ﬁnite number of transitions. Once this happens, the algorithms will
operate as if the Markov chain consists of just the recurrent states, and
convergence will not be aﬀected. However, the transient states would be
underrepresented in the cost approximation. If on the other hand there is
no steadystate distribution, there must be multiple recurrent classes, so the
results of the algorithms would depend on the initial state of the simulated
trajectory (more precisely on the recurrent class of this initial state). In
particular, states from other recurrent classes, and transient states would
be underrepresented in the cost approximation obtained. This may be
remedied by using multiple trajectories, with initial states from all the
recurrent classes, so that all these classes are represented in the simulation.
Regarding (ii), there are no convergence guarantees for methods that
use nonlinear architectures. In particular, an example by Tsitsiklis and Van
Roy [TsV97] (also replicated in Bertsekas and Tsitsiklis [BeT96], Example
6.6) shows that TD(λ) may diverge if a nonlinear architecture is used. In
the case where Φ does not have rank s, the mapping ΠT
(λ)
will still be a
contraction with respect to 
ξ
, so it has a unique ﬁxed point. In this case,
TD(λ) has been shown to converge to some vector r
∗
∈ ℜ
s
. This vector
is the orthogonal projection of the initial guess r
0
on the set of solutions
of the projected Bellman equation, i.e., the set of all r such that Φr is the
unique ﬁxed point of ΠT
(λ)
; see [Ber09]. LSPE(λ) and its scaled variants
can be shown to have a similar property.
The key issue regarding (iii) is whether the empirical frequencies at
which sample costs of states are collected are consistent with the corre
Sec. 6.3 Projected Equation Methods 399
sponding steadystate distribution ξ. If there is a substantial discrepancy
(as in the case where a substantial amount of exploration is introduced), the
projection implicitly constructed by simulation is with respect to a norm
substantially diﬀerent from  
ξ
, in which case the contraction property of
ΠT
(λ)
may be lost, with divergence potentially resulting. An example of
divergence in the case of TD(0) is given in Bertsekas and Tsitsiklis [BeT96]
(Example 6.7). Exercise 6.4 gives an example where Π is projection with
respect to the standard Euclidean norm and ΠT is not a contraction while
PVI(0) diverges. Tsitsiklis and Van Roy [TsV96] give an example of diver
gence, which involves a projection with respect to a nonEuclidean norm.
On the other hand, as noted earlier, ΠT
(λ)
is a contraction for any Eu
clidean projection norm, provided λ is suﬃciently close to 1. Moreover the
contraction property is not needed at all for the convergence of LSTD(λ)
and the regularized regressionbased version (6.97) of LSPE(λ).
Regarding (iv), the method for stepsize choice is critical for TD(λ);
both for convergence and for performance. On the other hand, LSTD(λ)
works without a stepsize and LSPE(λ) requires a constant stepsize. In Sec
tion 6.7, we will see that for average cost problems there is an exceptional
case, associated with periodic Markov chains, where a stepsize less than 1
is essential for the convergence of LSPE(0).
Regarding (v), once minimization over multiple policies is introduced
[so T and T
(λ)
are nonlinear], or optimistic variants are used, the behavior
of the methods becomes quite peculiar and unpredictable because ΠT
(λ)
may not be a contraction.† For instance, there are examples where ΠT
(λ)
has no ﬁxed point, and examples where it has multiple ﬁxed points; see
Bertsekas and Tsitsiklis [BeT96] (Example 6.9), and de Farias and Van Roy
[DFV00]. Section 6.4.2 of Bertsekas and Tsitsiklis [BeT96] gives examples
of the chattering phenomenon for optimistic TD(λ). Generally, the issues
associated with the asymptotic behavior of optimistic methods, or even
(nonoptimistic) approximate policy iteration, are not well understood at
present, and Figs. 6.3.4 and 6.35 suggest their enormously complex nature:
the points where subsets in the greedy partition join are potential “points
of attraction” of the various algorithms, and act as forms of “local minima.”
On the other hand, even in the case where T
(λ)
is nonlinear, if ΠT
(λ)
is a contraction, it has a unique ﬁxed point, and the peculiarities associated
with chattering do not arise. In this case the scaled PVI(λ) iteration [cf.
Eq. (6.54)] takes the form
r
k+1
= r
k
−γGΦ
′
Ξ
_
Φr
k
−T
(λ)
(Φr
k
)
_
,
where G is a scaling matrix, and γ is a positive stepsize that is small enough
to guarantee convergence. As discussed in [Ber09], this iteration converges
† Similar to Prop. 6.3.5, it can be shown that T
(λ)
is a supnorm contraction
with modulus that tends to 0 as λ → 1. It follows that given any projection
norm ·
ξ
, T
(λ)
and ΠT
(λ)
are contractions with respect to ·
ξ
, provided λ is
suﬃciently close to 1.
400 Approximate Dynamic Programming Chap. 6
to the unique ﬁxed point of ΠT
(λ)
, provided the constant stepsize γ is
suﬃciently small. Note that there are limited classes of problems, involving
multiple policies, where the mapping ΠT
(λ)
is a contraction. An example,
optimal stopping problems, is discussed in Sections 6.4.3 and 6.8.3. Finally,
let us note that the LSTD(λ) method relies on the linearity of the mapping
T, and it has no practical generalization for the case where T is nonlinear.
6.4 QLEARNING
We now introduce another method for discounted problems, which is suit
able for cases where there is no explicit model of the system and the cost
structure (a modelfree context). The method is related to value iteration
and has the additional advantage that it can be used directly in the case
of multiple policies. Instead of approximating the cost function of a par
ticular policy, it updates the Qfactors associated with an optimal policy,
thereby avoiding the multiple policy evaluation steps of the policy iteration
method.
In the discounted problem, the Qfactors are deﬁned, for all pairs
(i, u) with u ∈ U(i), by
Q
∗
(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) + αJ
∗
(j)
_
.
Using Bellman’s equation, we see that the Qfactors satisfy for all (i, u),
Q
∗
(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) + α min
v∈U(j)
Q
∗
(j, v)
_
, (6.99)
and can be shown to be the unique solution of this set of equations. The
proof is essentially the same as the proof of existence and uniqueness of
solution of Bellman’s equation. In fact, by introducing a system whose
states are the original states 1, . . . , n, together with all the pairs (i, u),
the above set of equations can be seen to be a special case of Bellman’s
equation (see Fig. 6.4.1). Furthermore, it can be veriﬁed that F is a sup
norm contraction with modulus α. Thus, the Qfactors can be obtained by
the value iteration Q
k+1
= FQ
k
, where F is the mapping deﬁned by
(FQ)(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) + α min
v∈U(j)
Q(j, v)
_
, ∀ (i, u).
(6.100)
The Qlearning algorithm is an approximate version of value itera
tion, whereby the expected value in Eq. (6.100) is suitably approximated.
In particular, an inﬁnitely long sequence of statecontrol pairs ¦(i
k
, u
k
)¦
Sec. 6.4 QLearning 401
StateControl Pairs (
) States
StateControl Pairs (i, u) States
) States j p
j p
ij
(u)
) g(i, u, j)
i, u, j) v
StateControl Pairs (j, v) States
) States
StateControl Pairs (i, u) States
) States j p
j p
ij
(u)
) g(i, u, j)
v µ(j)
j)
j, µ(j)
StateControl Pairs: Fixed Policy Case (
Figure 6.4.1. Modiﬁed problem where the state control pairs (i, u) are viewed
as additional states. The bottom ﬁgure corresponds to a ﬁxed policy µ. The
transitions from (i, u) to j are according to transition probabilities p
ij
(u) and
incur a cost g(i, u, j). Once the control v is chosen, the transitions from j to (j, v)
occur with probability 1 and incur no cost.
is generated according to some probabilistic mechanism. Given the pair
(i
k
, u
k
), a state j
k
is generated according to the probabilities p
i
k
j
(u
k
).
Then the Qfactor of (i
k
, u
k
) is updated using a stepsize γ
k
> 0 while all
other Qfactors are left unchanged:
Q
k+1
(i, u) = (1 −γ
k
)Q
k
(i, u) + γ
k
(F
k
Q
k
)(i, u), ∀ (i, u), (6.101)
where the components of the vector F
k
Q
k
are deﬁned by
(F
k
Q
k
)(i
k
, u
k
) = g(i
k
, u
k
, j
k
) + α min
v∈U(j
k
)
Q
k
(j
k
, v), (6.102)
and
(F
k
Q
k
)(i, u) = Q
k
(i, u), ∀ (i, u) ,= (i
k
, u
k
). (6.103)
Note from Eqs. (6.101) and (6.102) that the Qfactor of the current sample
(i
k
, u
k
) is modiﬁed by the iteration
Q
k+1
(i
k
, u
k
) = Q
k
(i
k
, u
k
)
+ γ
k
_
g(i
k
, u
k
, j
k
) + α min
u∈U(j
k
)
Q
k
(j
k
, u) −Q
k
(i
k
, u
k
)
_
,
and that the rightmost term above has the character of a temporal diﬀer
ence.
402 Approximate Dynamic Programming Chap. 6
To guarantee the convergence of the algorithm (6.101)(6.103) to the
optimal Qfactors, some conditions must be satisﬁed. Chief among these
conditions are that all statecontrol pairs (i, u) must be generated inﬁnitely
often within the inﬁnitely long sequence ¦(i
k
, u
k
)¦, and that the successor
states j must be independently sampled at each occurrence of a given state
control pair. Furthermore, the stepsize γ
k
should be diminishing to 0 at an
appropriate rate, as we now proceed to discuss.
6.4.1 Convergence Properties of QLearning
We will explain the convergence properties of Qlearning, by viewing it as
an asynchronous value iteration algorithm, where the expected value in the
deﬁnition (6.100) of the mapping F is approximated via a form of Monte
Carlo averaging. In the process we will derive some variants of Qlearning
that may oﬀer computational advantages in some situations.†
Consider the inﬁnitely long sequence ¦(i
k
, u
k
)¦, and a value iteration
algorithm where only the Qfactor corresponding to the pair (i
k
, u
k
) is
updated at iteration k, while all the other Qfactors are left unchanged.
This is the algorithm
Q
k+1
(i, u) =
_
(FQ
k
)(i
k
, u
k
) if (i, u) = (i
k
, u
k
),
Q
k
(i, u) if (i, u) ,= (i
k
, u
k
),
(6.104)
where F is the mapping (6.100). We can view this algorithm as a special
case of an asynchronous value iteration algorithm of the type discussed in
Section 1.3. Using the analysis of GaussSeidel value iteration and related
methods given in that section, it can be shown that the algorithm (6.104)
converges to the optimal Qfactor vector provided all statecontrol pairs
(i, u) are generated inﬁnitely often within the sequence ¦(i
k
, u
k
)¦.†
† Much of the theory of Qlearning can be generalized to problems with post
decision states, where pij(u) is of the form q
_
f(i, u), j
_
(cf. Section 6.1.4). In
particular, for such problems one may develop similar asynchronous simulation
based versions of value iteration for computing the optimal costtogo function
V
∗
of the postdecision states m = f(i, u): the mapping F of Eq. (6.100) is
replaced by the mapping H given by
(HV )(m) =
n
j=1
q(m, j) min
u∈U(j)
_
g(j, u) +αV
_
f(j, u)
_¸
, ∀ m,
[cf. Eq. (6.11)]. Qlearning corresponds to the special case where f(i, u) = (i, u).
† Generally, iteration with a mapping that is either a contraction with re
spect to a weighted supnorm, or has some monotonicity properties and a ﬁxed
point, converges when executed asynchronously (i.e., with diﬀerent frequencies
for diﬀerent components, and with iterates that are not uptodate). One or
Sec. 6.4 QLearning 403
Suppose now that we replace the expected value in the deﬁnition
(6.100) of F, with a Monte Carlo estimate based on all the samples up to
time k that involve (i
k
, u
k
). In particular, let n
k
be the number of times the
current state control pair (i
k
, u
k
) has been generated up to and including
time k, and let
T
k
=
_
t [ (i
t
, u
t
) = (i
k
, u
k
), 0 ≤ t ≤ k
_
be the set of corresponding time indexes. The algorithm is given by
Q
k+1
(i, u) = (
˜
F
k
Q
k
)(i, u), ∀ (i, u), (6.105)
where the components of the vector
˜
F
k
Q
k
are deﬁned by
(
˜
F
k
Q
k
)(i
k
, u
k
) =
1
n
k
t∈T
k
_
g(i
t
, u
t
, j
t
) + α min
v∈U(j
t
)
Q
k
(j
t
, v)
_
, (6.106)
(
˜
F
k
Q
k
)(i, u) = Q
k
(i, u), ∀ (i, u) ,= (i
k
, u
k
). (6.107)
Comparing the preceding equations and Eq. (6.100), and using the law of
large numbers, it is clear that for each (i, u), we have with probability 1
lim
k→∞, k∈T(i,u)
(
˜
F
k
Q
k
)(i, u) = (FQ
k
)(i, u),
where T(i, u) =
_
k [ (i
k
, u
k
) = (i, u)
_
. From this, the supnorm contrac
tion property of F, and the attendant asynchronous convergence properties
of the value iteration algorithm Q := FQ, it can be shown that the algo
rithm (6.105)(6.107) converges with probability 1 to the optimal Qfactors
[assuming again that all statecontrol pairs (i, u) are generated inﬁnitely
often within the sequence ¦(i
k
, u
k
)¦].
From the point of view of convergence rate, the algorithm (6.105)
(6.107) is quite satisfactory, but unfortunately it may have a signiﬁcant
drawback: it requires excessive overhead per iteration to calculate the
Monte Carlo estimate (
˜
F
k
Q
k
)(i
k
, u
k
) using Eq. (6.106). In particular, while
the term
1
n
k
t∈T
k
g(i
t
, u
t
, j
t
),
both of these properties are present in discounted and stochastic shortest path
problems. As a result, there are strong asynchronous convergence guarantees for
value iteration for such problems, as shown in Bertsekas [Ber82]. A general con
vergence theory of distributed asynchronous algorithms was developed in [Ber83]
and has been discussed in detail in the book [BeT89].
404 Approximate Dynamic Programming Chap. 6
in this equation can be recursively updated with minimal overhead, the
term
1
n
k
t∈T
k
min
v∈U(j
t
)
Q
k
(j
t
, v) (6.108)
must be completely recomputed at each iteration, using the current vector
Q
k
. This may be impractical, since the above summation may potentially
involve a very large number of terms.†
Motivated by the preceding concern, let us modify the algorithm and
replace the oﬀending term (6.108) in Eq. (6.106) with
1
n
k
t∈T
k
min
v∈U(j
t
)
Q
t
(j
t
, v), (6.110)
which can be computed recursively, with minimal extra overhead. This is
the algorithm (6.105), but with the Monte Carlo average (
˜
F
k
Q
k
)(i
k
, u
k
)
of Eq. (6.106) approximated by replacing the term (6.108) with the term
(6.110), which depends on all the iterates Q
t
, t ∈ T
k
. This algorithm has
† We note a special type of problem where the overhead involved in updating
the term (6.108) may be manageable. This is the case where for each pair (i, u)
the set S(i, u) of possible successor states j [the ones with pij(u) > 0] has small
cardinality. Then for each (i, u), we may maintain the numbers of times that each
successor state j ∈ S(i, u) has occurred up to time k, and use them to compute
eﬃciently the troublesome term (6.108). In particular, we may implement the
algorithm (6.105)(6.106) as
Q
k+1
(i
k
, u
k
) =
j∈S(i
k
,u
k
)
n
k
(j)
n
k
_
g(i
k
, u
k
, j) +α min
v∈U(j)
Q
k
(j, v)
_
, (6.109)
where n
k
(j) is the number of times the transition (i
k
, j), j ∈ S(i
k
, u
k
), occurred
at state i
k
under u
k
, in the simulation up to time k, i.e., n
k
(j) is the cardinality
of the set
{jt = j  t ∈ T
k
}, j ∈ S(i
k
, u
k
).
Note that this amounts to replacing the probabilities pi
k
j(u
k
) in the mapping
(6.100) with their Monte Carlo estimates
n
k
(j)
n
k
. While the minimization term in
Eq. (6.109), min
v∈U(j)
Q
k
(j, v), has to be computed for all j ∈ S(i
k
, u
k
) [rather
than for just j
k
as in the Qlearning algorithm algorithm (6.101)(6.103)] the
extra computation is not excessive if the cardinalities of S(i
k
, u
k
) and U(j),
j ∈ S(i
k
, u
k
), are small. This approach can be strengthened if some of the
probabilities pi
k
j(u
k
) are known, in which case they can be used directly in Eq.
(6.109). Generally, any estimate of pi
k
j(u
k
) can be used in place of n
k
(j)/n
k
, as
long as the estimate converges to pi
k
j(u
k
) as k → ∞.
Sec. 6.4 QLearning 405
the form
Q
k+1
(i
k
, u
k
) =
1
n
k
t∈T
k
_
g(i
t
, u
t
, j
t
) + α min
v∈U(j
t
)
Q
t
(j
t
, v)
_
, ∀ (i, u),
(6.111)
and
Q
k+1
(i, u) = Q
k
(i, u), ∀ (i, u) ,= (i
k
, u
k
). (6.112)
We now show that this (approximate) value iteration algorithm is essen
tially the Qlearning algorithm.†
Indeed, let us observe that the iteration (6.111) can be written as
Q
k+1
(i
k
, u
k
) =
n
k
−1
n
k
Q
k
(i
k
, u
k
)+
1
n
k
_
g(i
k
, u
k
, j
k
) + α min
v∈U(j
k
)
Q
k
(j
k
, v)
_
,
or
Q
k+1
(i
k
, u
k
) =
_
1 −
1
n
k
_
Q
k
(i
k
, u
k
) +
1
n
k
(F
k
Q
k
)(i
k
, u
k
),
where (F
k
Q
k
)(i
k
, u
k
) is given by the expression (6.102) used in the Q
learning algorithm. Thus the algorithm (6.111)(6.112) is the Qlearning
algorithm (6.101)(6.103) with a stepsize γ
k
= 1/n
k
. It can be similarly
shown that the algorithm (6.111)(6.112), equipped with a stepsize param
eter, is equivalent to the Qlearning algorithm with a diﬀerent stepsize,
say
γ
k
=
γ
n
k
,
where γ is a positive constant.
† A potentially more eﬀective algorithm is to introduce a window of size
m ≥ 0, and consider a more general scheme that calculates the last m terms
of the sum in Eq. (6.108) exactly and the remaining terms according to the
approximation (6.110). This algorithm, a variant of Qlearning, replaces the
oﬀending term (6.108) by
1
n
k
_
_
t∈T
k
, t≤k−m
min
v∈U(j
t
)
Qt+m(jt, v) +
t∈T
k
, t>k−m
min
v∈U(j
t
)
Q
k
(jt, v)
_
_
, (6.113)
which may also be updated recursively. The algorithm updates at time k the val
ues of min
v∈U(j
t
)
Q(jt, v) to min
v∈U(j
t
)
Q
k
(jt, v) for all t ∈ T
k
within the window
k−m ≤ t ≤ k, and ﬁxes them at the last updated value for t outside this window.
For m = 0, it reduces to the algorithm (6.111)(6.112). For moderate values of
m it involves moderate additional overhead, and it is likely a more accurate ap
proximation to the term (6.108) than the term (6.110) [min
v∈U(j
t
)
Qt+m(jt, v)
presumably approximates better than min
v∈U(j
t
)
Qt(jt, v) the “correct” term
min
v∈U(j
t
)
Q
k
(jt, v)].
406 Approximate Dynamic Programming Chap. 6
The preceding analysis provides a view of Qlearning as an approxi
mation to asynchronous value iteration (updating one component at a time)
that uses Monte Carlo sampling in place of the exact expected value in the
mapping F of Eq. (6.100). It also justiﬁes the use of a diminishing stepsize
that goes to 0 at a rate proportional to 1/n
k
, where n
k
is the number of
times the pair (i
k
, u
k
) has been generated up to time k. However, it does
not constitute a convergence proof because the Monte Carlo estimate used
to approximate the expected value in the deﬁnition (6.100) of F is accu
rate only in the limit, if Q
k
converges. We refer to Tsitsiklis [Tsi94b] for
a rigorous proof of convergence of Qlearning, which uses the theoretical
machinery of stochastic approximation algorithms.
In practice, despite its theoretical convergence guaranties, Qlearning
has some drawbacks, the most important of which is that the number
of Qfactors/statecontrol pairs (i, u) may be excessive. To alleviate this
diﬃculty, we may introduce a state aggregation scheme; we discuss this
possibility in Section 6.5.1. Alternatively, we may introduce a linear ap
proximation architecture for the Qfactors, similar to the policy evaluation
schemes of Section 6.3. This is the subject of the next two subsections.
6.4.2 QLearning and Approximate Policy Iteration
We will now consider Qlearning methods with linear Qfactor approxima
tion. As we discussed earlier (cf. Fig. 6.4.1), we may view Qfactors as
optimal costs of a certain discounted DP problem, whose states are the
statecontrol pairs (i, u). We may thus apply the TD/approximate policy
iteration methods of Section 6.3. For this, we need to introduce a linear
parametric architecture
˜
Q(i, u, r),
˜
Q(i, u, r) = φ(i, u)
′
r, (6.114)
where φ(i, u) is a feature vector that depends on both state and control.
At the typical iteration, given the current policy µ, these methods ﬁnd
an approximate solution
˜
Q
µ
(i, u, r) of the projected equation for Qfactors
corresponding to µ, and then obtain a new policy µ by
µ(i) = arg min
u∈U(i)
˜
Q
µ
(i, u, r).
For example, similar to our discussion in Section 6.3.4, LSTD(0) with a
linear parametric architecture of the form (6.114) generates a trajectory
_
(i
0
, u
0
), (i
1
, u
1
), . . .
_
using the current policy [u
t
= µ(i
t
)], and ﬁnds at
time k the unique solution of the projected equation [cf. Eq. (6.46)]
k
t=0
φ(i
t
, u
t
)q
k,t
= 0,
Sec. 6.4 QLearning 407
where q
k,t
are the corresponding TD
q
k,t
= φ(i
t
, u
t
)
′
r
k
−αφ(i
t+1
, u
t+1
)
′
r
k
−g(i
t
, u
t
, i
t+1
), (6.115)
[cf. Eq. (6.47)]. Also, LSPE(0) is given by [cf. Eq. (6.57)]
r
k+1
= r
k
−
γ
k + 1
G
k
k
t=0
φ(i
t
, u
t
)q
k,t
, (6.116)
where γ is a positive stepsize, G
k
is a positive deﬁnite matrix, such as
G
k
=
_
β
k + 1
I +
1
k + 1
k
t=0
φ(i
t
, u
t
)φ(i
t
, u
t
)
′
_
−1
,
with β > 0, or a diagonal approximation thereof.
There are also optimistic approximate policy iteration methods based
on LSPE(0), LSTD(0), and TD(0), similar to the ones we discussed earlier.
As an example, let us consider the extreme case of TD(0) that uses a single
sample between policy updates. At the start of iteration k, we have the
current parameter vector r
k
, we are at some state i
k
, and we have chosen
a control u
k
. Then:
(1) We simulate the next transition (i
k
, i
k+1
) using the transition proba
bilities p
i
k
j
(u
k
).
(2) We generate the control u
k+1
from the minimization
u
k+1
= arg min
u∈U(i
k+1
)
˜
Q(i
k+1
, u, r
k
). (6.117)
(3) We update the parameter vector via
r
k+1
= r
k
−γ
k
φ(i
k
, u
k
)q
k,k
, (6.118)
where γ
k
is a positive stepsize, and q
k,k
is the TD
q
k,k
= φ(i
k
, u
k
)
′
r
k
−αφ(i
k+1
, u
k+1
)
′
r
k
−g(i
k
, u
k
, i
k+1
);
[cf. Eq. (6.115)].
The process is now repeated with r
k+1
, i
k+1
, and u
k+1
replacing r
k
, i
k
,
and u
k
, respectively.
In simulationbased methods, a major concern is the issue of explo
ration in the approximate policy evaluation step, to ensure that state
control pairs (i, u) ,=
_
i, µ(i)
_
are generated suﬃciently often in the sim
ulation. For this, the explorationenhanced schemes discussed in Section
6.3.8 may be used in conjunction with LSTD. As an example, given the
408 Approximate Dynamic Programming Chap. 6
current policy µ, we may use any explorationenhanced transition mecha
nism to generate a sequence
_
(i
0
, u
0
), (i
1
, u
1
), . . .
_
, and then use LSTD(0)
with extra transitions
(i
k
, u
k
) →
_
j
k
, µ(j
k
)
_
,
where j
k
is generated from (i
k
, u
k
) using the transition probabilities p
i
k
j
(u
k
)
(cf. Section 6.3.8). Alternatively, we may use an exploration scheme based
on LSTD(λ) with modiﬁed temporal diﬀerences (cf. Section 6.3.8). In such
a scheme, we generate a sequence of statecontrol pairs
_
(i
0
, u
0
), (i
1
, u
1
), . . .
_
according to transition probabilities
p
i
k
i
k+1
(u
k
)ν(u
k+1
[ i
k+1
),
where ν(u [ i) is a probability distribution over the control constraint set
U(i), which provides a mechanism for exploration. Note that in this case
the calculation of the probability ratio in the modiﬁed temporal diﬀerence
of Eq. (6.93) does not require knowledge of the transitions probabilities
p
ij
(u), since these probabilities appear both in the numerator and the de
nominator and cancel out. Generally, in the context of Qlearning, the
required amount of exploration is likely to be substantial, so the underly
ing mapping ΠT may not be a contraction, in which case the validity of
LSPE(0) or TD(0) comes into doubt, as discussed in Section 6.3.8.
As in other forms of policy iteration, the behavior of all the algorithms
described is very complex, involving for example nearsingular matrix in
version (cf. Section 6.3.4) or policy oscillations (cf. Section 6.3.6), and there
is no guarantee of success (except for general error bounds for approximate
policy iteration methods). However, Qlearning with approximate policy
iteration is often tried because of its modelfree character [it does not re
quire knowledge of the transition probabilities p
ij
(u)].
6.4.3 QLearning for Optimal Stopping Problems
The policy evaluation algorithms of Section 6.3, such as TD(λ), LSPE(λ),
and LSTD(λ), apply when there is a single policy to be evaluated in the
context of approximate policy iteration. We may try to extend these meth
ods to the case of multiple policies, by aiming to solve by simulation the
projected equation
Φr = ΠT(Φr),
where T is a DP mapping that now involves minimization over multiple
controls. However, there are some diﬃculties:
(a) The mapping ΠT is nonlinear, so a simulationbased approximation
approach like LSTD breaks down.
Sec. 6.4 QLearning 409
(b) ΠT may not in general be a contraction with respect to any norm, so
the PVI iteration
Φr
k+1
= ΠT(Φr
k
)
[cf. Eq. (6.35)] may diverge and simulationbased LSPElike approx
imations may also diverge.
(c) Even if ΠT is a contraction, so the above PVI iteration converges, the
implementation of LSPElike approximations may not admit an eﬃ
cient recursive implementation because T(Φr
k
) is a nonlinear function
of r
k
.
In this section we discuss the extension of iterative LSPEtype ideas for
the special case of an optimal stopping problem where the last two diﬃ
culties noted above can be largely overcome. Optimal stopping problems
are a special case of DP problems where we can only choose whether to
terminate at the current state or not. Examples are problems of search, se
quential hypothesis testing, and pricing of derivative ﬁnancial instruments
(see Section 4.4 of Vol. I, and Section 3.4 of the present volume).
We are given a Markov chain with state space ¦1, . . . , n¦, described
by transition probabilities p
ij
. We assume that the states form a single
recurrent class, so that the steadystate distribution vector ξ = (ξ
1
, . . . , ξ
n
)
satisﬁes ξ
i
> 0 for all i, as in Section 6.3. Given the current state i, we
assume that we have two options: to stop and incur a cost c(i), or to
continue and incur a cost g(i, j), where j is the next state (there is no
control to aﬀect the corresponding transition probabilities). The problem
is to minimize the associated αdiscounted inﬁnite horizon cost.
We associate a Qfactor with each of the two possible decisions. The
Qfactor for the decision to stop is equal to c(i). The Qfactor for the
decision to continue is denoted by Q(i), and satisﬁes Bellman’s equation
Q(i) =
n
j=1
p
ij
_
g(i, j) + αmin
_
c(j), Q(j)
_
_
. (6.119)
The Qlearning algorithm generates a inﬁnitely long sequence of states
¦i
0
, i
1
, . . .¦, with all states generated inﬁnitely often, and a correspond
ing sequence of transitions ¦(i
k
, j
k
), generated according to the transition
probabilities p
i
k
j
. It updates the Qfactor for the decision to continue as
follows [cf. Eqs. (6.101)(6.103)]:
Q
k+1
(i) = (1 −γ
k
)Q
k
(i) + γ
k
(F
k
Q
k
)(i), ∀ i,
where the components of the mapping F
k
are deﬁned by
(F
k
Q)(i
k
) = g(i
k
, j
k
) + αmin
_
c(j
k
), Q(j
k
)
_
,
and
(F
k
Q)(i) = Q(i), ∀ i ,= i
k
.
410 Approximate Dynamic Programming Chap. 6
The convergence of this algorithm is addressed by the general the
ory of Qlearning discussed earlier. Once the Qfactors are calculated, an
optimal policy can be implemented by stopping at state i if and only if
c(i) ≤ Q(i). However, when the number of states is very large, the algo
rithm is impractical, which motivates Qfactor approximations.
Let us introduce the mapping F : ℜ
n
→ ℜ
n
given by
(FQ)(i) =
n
j=1
p
ij
_
g(i, j) + αmin
_
c(j), Q(j)
_
_
.
This mapping can be written in more compact notation as
FQ = g + αPf(Q),
where g is the vector whose ith component is
n
j=1
p
ij
g(i, j), (6.120)
and f(Q) is the function whose jth component is
f
j
(Q) = min
_
c(j), Q(j)
_
. (6.121)
We note that the (exact) Qfactor for the choice to continue is the unique
ﬁxed point of F [cf. Eq. (6.119)].
Let  
ξ
be the weighted Euclidean norm associated with the steady
state probability vector ξ. We claim that F is a contraction with respect
to this norm. Indeed, for any two vectors Q and Q, we have
¸
¸
(FQ)(i) − (FQ)(i)
¸
¸
≤ α
n
j=1
p
ij
¸
¸
f
j
(Q) −f
j
(Q)
¸
¸
≤ α
n
j=1
p
ij
¸
¸
Q(j) −Q(j)
¸
¸
,
or
[FQ−FQ[ ≤ αP[Q−Q[,
where we use the notation [x[ to denote a vector whose components are the
absolute values of the components of x. Hence,
FQ−FQ
ξ
≤ α
_
_
P[Q− Q[
_
_
ξ
≤ αQ−Q
ξ
,
where the last step follows from the inequality PJ
ξ
≤ J
ξ
, which holds
for every vector J (cf. Lemma 6.3.1). We conclude that F is a contraction
with respect to  
ξ
, with modulus α.
We will now consider Qfactor approximations, using a linear approx
imation architecture
˜
Q(i, r) = φ(i)
′
r,
Sec. 6.4 QLearning 411
where φ(i) is an sdimensional feature vector associated with state i. We
also write the vector
_
˜
Q(1, r), . . . ,
˜
Q(n, r)
_
′
in the compact form Φr, where as in Section 6.3, Φ is the n s matrix
whose rows are φ(i)
′
, i = 1, . . . , n. We assume that Φ has rank s, and we
denote by Π the projection mapping with respect to  
ξ
on the subspace
S = ¦Φr [ r ∈ ℜ
s
¦.
Because F is a contraction with respect to  
ξ
with modulus α, and
Π is nonexpansive, the mapping ΠF is a contraction with respect to  
ξ
with modulus α. Therefore, the algorithm
Φr
k+1
= ΠF(Φr
k
) (6.122)
converges to the unique ﬁxed point of ΠF. This is the analog of the PVI
algorithm (cf. Section 6.3.2).
As in Section 6.3.2, we can write the PVI iteration (6.122) as
r
k+1
= arg min
r∈ℜ
s
_
_
Φr −
_
g + αPf(Φr
k
)
__
_
2
ξ
, (6.123)
where g and f are deﬁned by Eqs. (6.120) and (6.121). By setting to 0 the
gradient of the quadratic function in Eq. (6.123), we see that the iteration
is written as
r
k+1
= r
k
−(Φ
′
ΞΦ)
−1
_
C(r
k
) −d
_
,
where
C(r
k
) = Φ
′
Ξ
_
Φr
k
−αPf(Φr
k
)
_
, d = Φ
′
Ξg.
Similar to Section 6.3.3, we may implement a simulationbased ap
proximate version of this iteration, thereby obtaining an analog of the
LSPE(0) method. In particular, we generate a single inﬁnitely long sim
ulated trajectory (i
0
, i
1
, . . .) corresponding to an unstopped system, i.e.,
using the transition probabilities p
ij
. Following the transition (i
k
, i
k+1
),
we update r
k
by
r
k+1
= r
k
−
_
k
t=0
φ(i
t
)φ(i
t
)
′
_
−1
k
t=0
φ(i
t
)q
k,t
, (6.124)
where q
k,t
is the TD,
q
k,t
= φ(i
t
)
′
r
k
−αmin
_
c(i
t+1
), φ(i
t+1
)
′
r
k
_
−g(i
t
, i
t+1
). (6.125)
Similar to the calculations involving the relation between PVI and LSPE,
it can be shown that r
k+1
as given by this iteration is equal to the iterate
produced by the iteration Φr
k+1
= ΠF(Φr
k
) plus a simulationinduced
error that asymptotically converges to 0 with probability 1 (see the paper
412 Approximate Dynamic Programming Chap. 6
by Yu and Bertsekas [YuB07], to which we refer for further analysis). As
a result, the generated sequence ¦Φr
k
¦ asymptotically converges to the
unique ﬁxed point of ΠF. Note that similar to discounted problems, we
may also use a scaled version of the PVI algorithm,
r
k+1
= r
k
−γG
_
C(r
k
) −d
_
, (6.126)
where γ is a positive stepsize, and G is a scaling matrix. If G is positive
deﬁnite symmetric can be shown that this iteration converges to the unique
solution of the projected equation if γ is suﬃciently small. [The proof of
this is somewhat more complicated than the corresponding proof of Section
6.3.2 because C(r
k
) depends nonlinearly on r
k
. It requires algorithmic
analysis using the theory of variational inequalities; see [Ber09].] We may
approximate the scaled PVI algorithm (6.126) by a simulationbased scaled
LSPE version of the form
r
k+1
= r
k
−
γ
k + 1
G
k
k
t=0
φ(i
t
)q
k,t
,
where G
k
is a positive deﬁnite symmetric matrix and γ is a suﬃciently
small positive stepsize. For example, we may use a diagonal approximation
to the inverse in Eq. (6.124).
In comparing the Qlearning iteration (6.124)(6.125) with the al
ternative optimistic LSPE version (6.116), we note that it has consider
ably higher computation overhead. In the process of updating r
k+1
via
Eq. (6.124), we can compute the matrix
k
t=0
φ(i
t
)φ(i
t
)
′
and the vector
k
t=0
φ(i
t
)q
k,t
iteratively as in the LSPE algorithms of Section 6.3. How
ever, the terms
min
_
c(i
t+1
), φ(i
t+1
)
′
r
k
_
in the TD formula (6.125) need to be recomputed for all the samples i
t+1
,
t ≤ k. Intuitively, this computation corresponds to repartitioning the states
into those at which to stop and those at which to continue, based on the
current approximate Qfactors Φr
k
. By contrast, in the corresponding
optimistic LSPE version (6.116), there is no repartitioning, and these terms
are replaced by ˜ w(i
t+1
, r
k
), given by
˜ w(i
t+1
, r
k
) =
_
c(i
t+1
) if t ∈ T,
φ(i
t+1
)
′
r
k
if t / ∈ T,
where
T =
_
t [ c(i
t+1
) ≤ φ(i
t+1
)
′
r
t
_
is the set of states to stop based on the approximate Qfactors Φr
t
, calcu
lated at time t (rather than the current time k). In particular, the term
Sec. 6.4 QLearning 413
k
t=0
φ(i
t
) min
_
c(i
t+1
), φ(i
t+1
)
′
r
k
_
in Eqs. (6.124), (6.125) is replaced by
k
t=0
φ(i
t
) ˜ w(i
t+1
, r
k
) =
t≤k, t∈T
φ(i
t
)c(i
t+1
) +
_
_
t≤k, t/ ∈T
φ(i
t
)φ(i
t+1
)
′
_
_
r
k
,
(6.127)
which can be eﬃciently updated at each time k. It can be seen that the
optimistic algorithm that uses the expression (6.127) (no repartitioning)
can only converge to the same limit as the nonoptimistic version (6.124).
However, there is no convergence proof of this algorithm at present.
Another variant of the algorithm, with a more solid theoretical foun
dation, is obtained by simply replacing the term φ(i
t+1
)
′
r
k
in the TD
formula (6.125) by φ(i
t+1
)
′
r
t
, thereby eliminating the extra overhead for
repartitioning. The idea is that for large k and t, these two terms are close
to each other, so convergence is still maintained. The convergence analysis
of this algorithm and some variations is based on the theory of stochastic
approximation methods, and is given in the paper by Yu and Bertsekas
[YuB07] to which we refer for further discussion.
Constrained Policy Iteration and Optimal Stopping
It is natural in approximate DP to try to exploit whatever prior information
is available about J
∗
. In particular, if it is known that J
∗
belongs to a
subset ¸ of ℜ
n
, we may try to ﬁnd an approximation Φr that belongs to
¸. This leads to projected equations involving projection on a restricted
subset of the approximation subspace S. Corresponding analogs of the
LSTD and LSPEtype methods for such projected equations involve the
solution of linear variational inequalities rather linear systems of equations.
The details of this are beyond our scope, and we refer to [Ber09] for a
discussion.
In the practically common case where an upper bound of J
∗
is avail
able, a simple possibility is to modify the policy iteration algorithm. In
particular, suppose that we know a vector J with J(i) ≥ J
∗
(i) for all i.
Then the approximate policy iteration method can be modiﬁed to incorpo
rate this knowledge as follows. Given a policy µ, we evaluate it by ﬁnding
an approximation Φ˜ r
µ
to the solution
˜
J
µ
of the equation
˜
J
µ
(i) =
n
j=1
p
ij
_
µ(i)
_
_
g(i, µ(i), j) +αmin
_
J(j),
˜
J
µ
(j)
_
_
, i = 1, . . . , n,
(6.128)
414 Approximate Dynamic Programming Chap. 6
followed by the (modiﬁed) policy improvement
µ(i) = arg min
u∈U(i)
n
j=1
p
ij
(u)
_
g(i, u, j) + αmin
_
J(j), φ(j)
′
˜ r
µ
_
_
,
i = 1, . . . , n,
(6.129)
where φ(j)
′
is the row of Φ that corresponds to state j.
Note that Eq. (6.128) is Bellman’s equation for the Qfactor of an
optimal stopping problem that involves the stopping cost J(i) at state i
[cf. Eq. (6.119)]. Under the assumption J(i) ≥ J
∗
(i) for all i, and a lookup
table representation (Φ = I), it can be shown that the method (6.128)
(6.129) yields J
∗
in a ﬁnite number of iterations, just like the standard
(exact) policy iteration method (Exercise 6.17). When a compact feature
based representation is used (Φ ,= I), the approximate policy evaluation
based on Eq. (6.128) can be performed using the Qlearning algorithms
described earlier in this section. The method exhibits oscillatory behavior
similar to its unconstrained policy iteration counterpart (cf. Section 6.3.6).
6.4.4 FiniteHorizon QLearning
We will now brieﬂy discuss Qlearning and related approximations for ﬁnite
horizon problems. We will emphasize online algorithms that are suitable
for relatively short horizon problems. Approximate DP methods for such
problems are additionally important because they arise in the context of
multistep lookahead and rolling horizon schemes, possibly with cost func
tion approximation at the end of the horizon.
One may develop extensions of the Qlearning algorithms of the pre
ceding sections to deal with ﬁnite horizon problems, with or without cost
function approximation. For example, one may easily develop versions
of the projected Bellman equation, and corresponding LSTD and LSPE
type algorithms (see the endofchapter exercises). However, with a ﬁnite
horizon, there are a few alternative approaches, with an online character,
which resemble rollout algorithms. In particular, at statetime pair (i
k
, k),
we may compute approximate Qfactors
˜
Q
k
(i
k
, u
k
), u
k
∈ U
k
(i
k
),
and use online the control ˜ u
k
∈ U
k
(i
k
) that minimizes
˜
Q
k
(i
k
, u
k
) over
u
k
∈ U
k
(i
k
). The approximate Qfactors have the form
˜
Q
k
(i
k
, u
k
) =
n
k
i
k+1
=1
p
i
k
i
k+1
(u
k
)
_
g(i
k
, u
k
, i
k+1
)
+ min
u
k+1
∈U
k+1
(i
k+1
)
ˆ
Q
k+1
(i
k+1
, u
k+1
)
_
,
(6.130)
Sec. 6.4 QLearning 415
where
ˆ
Q
k+1
may be computed in a number of ways:
(1)
ˆ
Q
k+1
may be the cost function
˜
J
k+1
of a base heuristic (and is thus
independent of u
k+1
), in which case Eq. (6.130) takes the form
˜
Q
k
(i
k
, u
k
) =
n
k
i
k+1
=1
p
i
k
i
k+1
(u
k
)
_
g(i
k
, u
k
, i
k+1
) +
˜
J
k+1
(i
k+1
)
_
.
(6.131)
This is the rollout algorithm discussed at length in Chapter 6 of Vol. I.
A variation is when multiple base heuristics are used and
˜
J
k+1
is the
minimum of the cost functions of these heuristics. These schemes may
also be combined with a rolling and/or limited lookahead horizon.
(2)
ˆ
Q
k+1
is an approximately optimal cost function
˜
J
k+1
[independent
of u
k+1
as in Eq. (6.131)], which is computed by (possibly multistep
lookahead or rolling horizon) DP based on limited sampling to ap
proximate the various expected values arising in the DP algorithm.
Thus, here the function
˜
J
k+1
of Eq. (6.131) corresponds to a (ﬁnite
horizon) nearoptimal policy in place of the base policy used by roll
out. These schemes are well suited for problems with a large (or
inﬁnite) state space but only a small number of controls per state,
and may also involve selective pruning of the control constraint set
to reduce the associated DP computations. The book by Chang, Fu,
Hu, and Marcus [CFH07] has extensive discussions of approaches of
this type, including systematic forms of adaptive sampling that aim
to reduce the eﬀects of limited simulation (less sampling for controls
that seem less promising at a given state, and less sampling for future
states that are less likely to be visited starting from the current state
i
k
).
(3)
ˆ
Q
k+1
is computed using a linear parametric architecture of the form
ˆ
Q
k+1
(i
k+1
, u
k+1
) = φ(i
k+1
, u
k+1
)
′
r
k+1
, (6.132)
where r
k+1
is a parameter vector. In particular,
ˆ
Q
k+1
may be ob
tained by a leastsquares ﬁt/regression or interpolation based on val
ues computed at a subset of selected statecontrol pairs (cf. Section
6.4.3 of Vol. I). These values may be computed by ﬁnite horizon
rollout, using as base policy the greedy policy corresponding to the
preceding approximate Qvalues in a backwards (oﬀline) Qlearning
scheme:
˜ µ
i
(x
i
) = arg min
u
i
∈U
i
(x
i
)
ˆ
Q
i
(x
i
, u
i
), i = k + 2, . . . , N −1. (6.133)
Thus, in such a scheme, we ﬁrst compute
˜
Q
N−1
(i
N−1
, u
N−1
) =
n
N
i
N
=1
p
i
N−1
i
N
(u
N−1
)
_
g(i
N−1
, u
N−1
, i
N
)
+ J
N
(i
N
)
_
416 Approximate Dynamic Programming Chap. 6
by the ﬁnal stage DP computation at a subset of selected statecontrol
pairs (i
N−1
, u
N−1
), followed by a least squares ﬁt of the obtained
values to obtain
ˆ
Q
N−1
in the form (6.132); then we compute
˜
Q
N−2
at a subset of selected statecontrol pairs (i
N−2
, u
N−2
) by rollout
using the base policy ¦˜ µ
N−1
¦ deﬁned by Eq. (6.133), followed by a
least squares ﬁt of the obtained values to obtain
ˆ
Q
N−2
in the form
(6.132); then compute
˜
Q
N−3
at a subset of selected statecontrol pairs
(i
N−3
, u
N−3
) by rollout using the base policy ¦˜ µ
N−2
, ˜ µ
N−1
¦ deﬁned
by Eq. (6.133), etc.
One advantage of ﬁnite horizon formulations is that convergence is
sues of the type arising in policy or value iteration methods do not play a
signiﬁcant role, so anomalous behavior does not arise. This is, however, a
mixed blessing as it may mask poor performance and/or important quali
tative diﬀerences between diﬀerent approaches.
6.5 AGGREGATION METHODS
In this section we revisit the aggregation methodology discussed in Section
6.3.4 of Vol. I, viewing it now in the context of costtogo or Qfactor ap
proximation for discounted DP.† The aggregation approach resembles in
some ways the problem approximation approach discussed in Section 6.3.3
of Vol. I: the original problem is approximated with a related “aggregate”
problem, which is then solved exactly to yield a costtogo approximation
for the original problem. Still, in other ways the aggregation approach
resembles the projected equation/subspace approximation approach, most
importantly because it constructs cost approximations of the form Φr, i.e.,
linear combinations of basis functions. However, there are important dif
ferences: in aggregation methods there are no projections with respect to
Euclidean norms, the simulations can be done more ﬂexibly, and from a
mathematical point of view, the underlying contractions are with respect
to the supnorm rather than a Euclidean norm.
To construct an aggregation framework, we introduce a ﬁnite set /
of aggregate states, and we introduce two (somewhat arbitrary) choices of
probabilities, which relate the original system states with the aggregate
states:
(1) For each aggregate state x and original system state i, we specify
the disaggregation probability d
xi
[we have
n
i=1
d
xi
= 1 for each
† Aggregation may be used in conjunction with any Bellman equation asso
ciated with the given problem. For example, if the problem admits postdecision
states (cf. Section 6.1.4), the aggregation may be done using the correspond
ing Bellman equation, with potentially signiﬁcant simpliﬁcations resulting in the
algorithms of this section.
Sec. 6.5 Aggregation Methods 417
x ∈ /]. Roughly, d
xi
may be interpreted as the “degree to which x is
represented by i.”
(2) For each aggregate state y and original system state j, we specify
the aggregation probability φ
jy
(we have
y∈A
φ
jy
= 1 for each j =
1, . . . , n). Roughly, φ
jy
may be interpreted as the “degree of member
ship of j in the aggregate state y.” The vectors ¦φ
jy
[ j = 1, . . . , n¦
may also be viewed as basis functions that will be used to represent
approximations of the cost vectors of the original problem.
Let us mention a few examples:
(a) In hard and soft aggregation (Examples 6.3.9 and 6.3.10 of Vol. I),
we group the original system states into subsets, and we view each
subset as an aggregate state. In hard aggregation each state belongs
to one and only one subset, and the aggregation probabilities are
φ
jy
= 1 if system state j belongs to aggregate state/subset y.
One possibility is to choose the disaggregation probabilities as
d
xi
= 1/n
x
if system state i belongs to aggregate state/subset x,
where n
x
is the number of states of x (this implicitly assumes that all
states that belong to aggregate state/subset y are “equally represen
tative”). In soft aggregation, we allow the aggregate states/subsets to
overlap, with the aggregation probabilities φ
jy
quantifying the “de
gree of membership” of j in the aggregate state/subset y.
(b) In various discretization schemes, each original system state j is as
sociated with a convex combination of aggregate states:
j ∼
y∈A
φ
jy
y,
for some nonnegative weights φ
jx
, whose sum is 1, and which are
viewed as aggregation probabilities (this makes geometrical sense if
both the original and the aggregate states are associated with points
in a Euclidean space, as described in Example 6.3.13 of Vol. I).
(c) In coarse grid schemes (cf. Example 6.3.12 of Vol. I), a subset of
representative states is chosen, each being an aggregate state. Thus,
each aggregate state x is associated with a unique original state i
x
,
and we may use the disaggregation probabilities d
xi
= 1 for i = i
x
and d
xi
= 0 for i ,= i
x
. The aggregation probabilities are chosen as
in the preceding case (b).
The aggregation approach approximates cost vectors with Φr, where
r ∈ ℜ
s
is a weight vector to be determined, and Φ is the matrix whose jth
418 Approximate Dynamic Programming Chap. 6
row consists of the aggregation probabilities φ
j1
, . . . , φ
js
. Thus aggrega
tion involves an approximation architecture similar to the one of projected
equation methods: it uses as features the aggregation probabilities. Con
versely, starting from a set of s features for each state, we may construct
a featurebased hard aggregation scheme by grouping together states with
“similar features.” In particular, we may use a more or less regular par
tition of the feature space, which induces a possibly irregular partition of
the original state space into aggregate states (all states whose features fall
in the same set of the feature partition form an aggregate state). This is
a general approach for passing from a featurebased approximation of the
cost vector to an aggregationbased approximation. Unfortunately, in the
resulting aggregation scheme the number of aggregate states may become
very large.
The aggregation and disaggregation probabilities specify a dynamical
system involving both aggregate and original system states (cf. Fig. 6.5.1).
In this system:
(i) From aggregate state x, we generate original system state i according
to d
xi
.
(ii) We generate transitions from original system state i to original system
state j according to p
ij
(u), with cost g(i, u, j).
(iii) From original system state j, we generate aggregate state y according
to φ
jy
.
according to p
ij
(u), with cost
d
xi
S
φ
jy
Q
, j = 1 i
), x
), y
Original System States Aggregate States
Original System States Aggregate States

Original System States Aggregate States
ˆ p
xy
(u) =
n
i=1
d
xi
n
j=1
p
ij
(u)φ
jy
,
Disaggregation Probabilities
Aggregation Probabilities
Disaggregation Probabilities
Aggregation Probabilities
Disaggregation Probabilities
Aggregation Probabilities
Disaggregation Probabilities
Figure 6.5.1. Illustration of the transition mechanism of a dynamical system
involving both aggregate and original system states.
One may associate various DP problems with this system, thereby eﬀecting
costtogo or Qfactor approximation. In the ﬁrst problem, discussed in
Sec. 6.5 Aggregation Methods 419
the next subsection, the focus is on the aggregate states, the role of the
original system states being to deﬁne the mechanisms of cost generation and
probabilistic transition from one aggregate state to the next. In the second
problem, discussed in Section 6.5.2, the focus is on both the original system
states and the aggregate states, which are viewed as states of an enlarged
system. Policy and value iteration algorithms are then deﬁned for this
enlarged system. These methods admit simulationbased implementations.
6.5.1 Cost and QFactor Approximation by Aggregation
Here we assume that the control constraint set U(i) is independent of the
state i, and we denote it by U. Then, the transition probability from ag
gregate state x to aggregate state y under control u, and the corresponding
expected transition cost, are given by (cf. Fig. 6.5.1)
ˆ p
xy
(u) =
n
i=1
d
xi
n
j=1
p
ij
(u)φ
jy
, ˆ g(x, u) =
n
i=1
d
xi
n
j=1
p
ij
(u)g(i, u, j).
These transition probabilities and costs deﬁne an aggregate problem whose
states are just the aggregate states.
The optimal cost function of the aggregate problem, denoted
ˆ
J, is
obtained as the unique solution of Bellman’s equation
ˆ
J(x) = min
u∈U
_
_
ˆ g(x, u) + α
y∈A
p
xy
(u)
ˆ
J(y)
_
_
, ∀ x,
which can be solved by any of the available value and policy iteration
methods, including ones that involve simulation. The optimal cost function
J
∗
of the original problem is approximated by
˜
J given by
˜
J(j) =
y∈A
φ
jy
ˆ
J(y), ∀ j.
Thus, for an original system state j, the approximation
˜
J(j) is a convex
combination of the costs
ˆ
J(y) of the aggregate states y for which φ
jy
> 0. In
the case of hard aggregation,
˜
J is piecewise constant: it assigns the same
cost to all the states j that belong to the same aggregate state y (since
φ
jy
= 1 if j belongs to y and φ
jy
= 0 otherwise).
The preceding scheme can also be applied to problems with inﬁnite
state space, and is wellsuited for approximating the solution of partially ob
served Markov Decision problems (POMDP), which are deﬁned over their
belief space (space of probability distributions over their states, cf. Section
5.4.2 of Vol. I). By discretizing the belief space with a coarse grid, one
obtains a ﬁnite spaces DP problem of perfect state information that can
420 Approximate Dynamic Programming Chap. 6
be solved with the methods of Chapter 1 (see [ZhL97], [ZhH01], [YuB04]).
The following example illustrates the main ideas and shows that in the
POMDP case, where the optimal cost function is a concave function over
the simplex of beliefs (see Vol. I, Section 5.4.2), the approximation obtained
is a lower bound of the optimal cost function.
Example 6.5.1 (Coarse Grid/POMDP Discretization and
Lower Bound Approximations)
Consider a discounted DP problem of the type discussed in Section 1.2, where
the state space is a convex subset C of a Euclidean space. We use z to denote
the elements of this space, to distinguish them from x which now denotes
aggregate states. Bellman’s equation is J = TJ with T deﬁned by
(TJ)(z) = min
u∈U
E
w
_
g(z, u, w) +αJ
_
f(z, u, w)
__
, ∀ z ∈ C.
Let J
∗
denote the optimal cost function. Suppose we choose a ﬁnite sub
set/coarse grid of states {x1, . . . , xm}, which we view as aggregate states
with aggregation probabilities φzx
i
, i = 1, . . . , m, for each z. The disaggre
gation probabilities are dx
k
i = 1 for i = x
k
, k = 1, . . . , m, and dx
k
i = 0 for
i = {x1, . . . , xm}. Consider the mapping
ˆ
T deﬁned by
(
ˆ
TJ)(z) = min
u∈U
E
w
_
g(z, u, w) +α
m
j=1
φ
f(z,u,w) x
j
J(xj)
_
, ∀ z ∈ C,
where φ
f(z,u,w) x
j
are the aggregation probabilities of the next state f(z, u, w).
We note that
ˆ
T is a contraction mapping with respect to the supnorm.
Let
ˆ
J denotes its unique ﬁxed point, so that we have
ˆ
J(xi) = (
ˆ
T
ˆ
J)(xi), i = 1, . . . , m.
This is Bellman’s equation for an aggregated ﬁnitestate discounted DP prob
lem whose states are x1, . . . , xm, and can be solved by standard value and
policy iteration methods. We approximate the optimal cost function of the
original problem by
˜
J(z) =
m
i=1
φzx
i
ˆ
J(xi), ∀ z ∈ C.
Suppose now that J
∗
is a concave function over S, so that for all
(z, u, w),
J
∗
_
f(z, u, w)
_
≥
m
j=1
φ
f(z,u,w) x
j
J(xj).
It then follows from the deﬁnitions of T and
ˆ
T that
J
∗
(z) = (TJ
∗
)(z) ≥ (
ˆ
TJ
∗
)(z), ∀ z ∈ C,
Sec. 6.5 Aggregation Methods 421
so by iterating, we see that
J
∗
(z) ≥ lim
k→∞
(
ˆ
T
k
J
∗
)(z) =
ˆ
J(z), ∀ z ∈ C,
where the last equation follows since
ˆ
T is a contraction. For z = xi, we have
in particular
J
∗
(xi) ≥
ˆ
J(xi), ∀ i = 1, . . . , m,
from which we obtain
J
∗
(z) ≥
m
i=1
φzx
i
J
∗
(xi) ≥
m
i=1
φzx
i
ˆ
J(xi) =
˜
J(z), ∀ z ∈ C,
where the ﬁrst inequality follows from the concavity of J
∗
. Thus the approx
imation
˜
J(z) obtained from the aggregate system provides a lower bound to
J
∗
(z). Similarly, if J
∗
can be shown to be convex, the preceding argument
can be modiﬁed to show that
˜
J(z) is an upper bound to J
∗
(z).
QFactor Approximation
We now consider the Qfactors
ˆ
Q(x, u), x ∈ /, u ∈ U, of the aggregate
problem. They are the unique solution of the Qfactor equation
ˆ
Q(x, u) = ˆ g(x, u) + α
y∈A
ˆ p
xy
(u) min
v∈U
ˆ
Q(y, v)
=
n
i=1
d
xi
n
j=1
p
ij
(u)
_
_
g(i, u, j) + α
y∈A
φ
jy
min
v∈U
ˆ
Q(y, v)
_
_
.
(6.134)
We may apply Qlearning to solve the aggregate problem. In particular, we
generate an inﬁnitely long sequence of pairs ¦(x
k
, u
k
)¦ ⊂ /U according
to some probabilistic mechanism. For each (x
k
, u
k
), we generate an original
system state i
k
according to the disaggregation probabilities d
x
k
i
, and then
a successor state j
k
according to probabilities p
i
k
j
k
(u
k
). We ﬁnally generate
an aggregate system state y
k
using the aggregation probabilities φ
j
k
y
. Then
the Qfactor of (x
k
, u
k
) is updated using a stepsize γ
k
> 0 while all other
Qfactors are left unchanged [cf. Eqs. (6.101)(6.103)]:
ˆ
Q
k+1
(x, u) = (1 − γ
k
)
ˆ
Q
k
(x, u) + γ
k
(F
k
ˆ
Q
k
)(x, u), ∀ (x, u), (6.135)
where the vector F
k
ˆ
Q
k
is deﬁned by
(F
k
ˆ
Q
k
)(x, u) =
_
g(i
k
, u
k
, j
k
) + αmin
v∈U
ˆ
Q
k
(y
k
, v) if (x, u) = (x
k
, u
k
),
ˆ
Q
k
(x, u) if (x, u) ,= (x
k
, u
k
).
422 Approximate Dynamic Programming Chap. 6
Note that the probabilistic mechanism by which the pairs ¦(x
k
, u
k
)¦ are
generated is arbitrary, as long as all possible pairs are generated inﬁnitely
often. In practice, one may wish to use the aggregation and disaggregation
probabilities, and the Markov chain transition probabilities in an eﬀort to
ensure that “important” statecontrol pairs are not underrepresented in the
simulation.
After solving for the Qfactors
ˆ
Q, the Qfactors of the original problem
are approximated by
˜
Q(j, v) =
y∈A
φ
jy
ˆ
Q(y, v), j = 1, . . . , n, v ∈ U. (6.136)
We recognize this as an approximate representation
˜
Q of the Qfactors of
the original problem in terms of basis functions. There is a basis function
for each aggregate state y ∈ / (the vector ¦φ
jy
[ j = 1, . . . , n¦), and the
corresponding coeﬃcients that weigh the basis functions are the Qfactors
of the aggregate problem
ˆ
Q(y, v), y ∈ /, v ∈ U (so we have in eﬀect a
lookup table representation with respect to v). The optimal costtogo
function of the original problem is approximated by
˜
J(j) = min
v∈U
˜
Q(j, v), j = 1, . . . , n,
and the corresponding onestep lookahead suboptimal policy is obtained as
˜ µ(i) = arg min
u∈U
n
j=1
p
ij
(u)
_
g(i, u, j) + α
˜
J(j)
_
, i = 1, . . . , n.
Note that the preceding minimization requires knowledge of the tran
sition probabilities p
ij
(u), which is unfortunate since a principal motiva
tion of Qlearning is to deal with modelfree situations where the transition
probabilities are not explicitly known. The alternative is to obtain a sub
optimal control at j by minimizing over v ∈ U the Qfactor
˜
Q(j, v) given by
Eq. (6.136). This is less discriminating in the choice of control (for example
in the case of hard aggregation, it applies the same control at all states j
that belong to the same aggregate state y).
The preceding analysis highlights the two kinds of approximation that
are inherent in the method just described:
(a) The transition probabilities of the original system are modiﬁed through
the aggregation and disaggregation process.
(b) In calculating Qfactors of the aggregate system via Eq. (6.134), con
trols are associated with aggregate states rather than original system
states.
In the next section, we provide alternative algorithms based on value it
eration (rather than Qlearning), where the approximation in (b) above is
addressed more eﬀectively.
Sec. 6.5 Aggregation Methods 423
6.5.2 Approximate Policy and Value Iteration
Let us consider the system consisting of the original states and the aggre
gate states, with the transition probabilities and the stage costs described
earlier (cf. Fig. 6.5.1). We introduce the vectors
˜
J
0
,
˜
J
1
, and R
∗
where:
R
∗
(x) is the optimal costtogo from aggregate state x.
˜
J
0
(i) is the optimal costtogo from original system state i that has
just been generated from an aggregate state (left side of Fig. 6.5.1).
˜
J
1
(j) is the optimal costtogo from original system state j that has
just been generated from an original system state (right side of Fig.
6.5.1).
Note that because of the intermediate transitions to aggregate states,
˜
J
0
and
˜
J
1
are diﬀerent.
These three vectors satisfy the following three Bellman’s equations:
R
∗
(x) =
n
i=1
d
xi
˜
J
0
(i), x ∈ /, (6.137)
˜
J
0
(i) = min
u∈U(i)
n
j=1
p
ij
(u)
_
g(i, u, j) + α
˜
J
1
(j)
_
, i = 1, . . . , n, (6.138)
˜
J
1
(j) =
y∈A
φ
jy
R
∗
(y), j = 1, . . . , n. (6.139)
By combining these equations, we obtain an equation for R
∗
:
R
∗
(x) = (FR
∗
)(x), x ∈ /,
where F is the mapping deﬁned by
(FR)(x) =
n
i=1
d
xi
min
u∈U(i)
n
j=1
p
ij
(u)
_
_
g(i, u, j) + α
y∈A
φ
jy
R(y)
_
_
, x ∈ /.
(6.140)
It can be seen that F is a supnorm contraction mapping and has R
∗
as
its unique ﬁxed point. This follows from standard contraction arguments
(cf. Prop. 1.2.4) and the fact that d
xi
, p
ij
(u), and φ
jy
are all transition
probabilities.†
† A quick proof is to observe that F is the composition
F = DTΦ,
where T is the usual DP mapping, and D and Φ are the matrices with rows the
disaggregation and aggregation distributions, respectively. Since T is a contrac
424 Approximate Dynamic Programming Chap. 6
Once R
∗
is found, the optimalcosttogo of the original problem may
be approximated by
˜
J
1
= ΦR
∗
, and a suboptimal policy may be found
through the minimization deﬁning
˜
J
0
. Again, the optimal cost function
approximation
˜
J
1
is a linear combination of the columns of Φ, which may
be viewed as basis functions.
One may use value and policy iterationtype algorithms to ﬁnd R
∗
.
The value iteration algorithm is to generate successively FR, F
2
R, . . .,
starting with some initial guess R. The policy iteration algorithm starts
with a stationary policy µ
0
for the original problem, and given µ
k
, it ﬁnds
R
µ
k satisfying R
µ
k = F
µ
k R
µ
k, where F
µ
is the mapping deﬁned by
(F
µ
R)(x) =
n
i=1
d
xi
n
j=1
p
ij
_
µ(i)
_
_
_
g
_
i, µ(i), j
_
+ α
y∈A
φ
jy
R
µ
(y)
_
_
, x ∈ /,
(6.141)
(this is the policy evaluation step). It then generates µ
k+1
by
µ
k+1
(i) = arg min
u∈U(i)
n
j=1
p
ij
(u)
_
_
g(i, u, j) + α
y∈A
φ
jy
R
µ
k(y)
_
_
, ∀ i,
(6.142)
(this is the policy improvement step). We leave it as an exercise for the
reader to show that this policy iteration algorithm converges to the unique
ﬁxed point of F in a ﬁnite number of iterations. The key fact here is that F
and F
µ
are not only supnorm contractions, but also have the monotonicity
property of DP mappings (cf. Section 1.1.2 and Lemma 1.1.1), which was
used in an essential way in the convergence proof of ordinary policy iteration
(cf. Prop. 1.3.4).
Generally, in approximate policy iteration, by Prop. 1.3.6, we have
an error bound of the form
limsup
k→∞
J
µ
k −J
∗

∞
≤
2αδ
(1 −α)
2
,
where δ satisﬁes
J
k
−J
µ
k 
∞
≤ δ
for all generated policies µ
k
and J
k
is the approximate cost vector of µ
k
that
is used for policy improvement (which is ΦR
µ
k in the case of aggregation).
tion with respect to the supnorm · ∞, and D and Φ satisfy
Dx∞ ≤ x∞, ∀ x ∈ ℜ
n
,
Φy∞ ≤ y∞, ∀ y ∈ ℜ
s
,
it follows that F is a supnorm contraction.
Sec. 6.5 Aggregation Methods 425
However, when the policy sequence ¦µ
k
¦ converges to some µ as it does
here, it turns out that the much sharper bound
J
µ
−J
∗

∞
≤
2αδ
1 −α
(6.143)
holds. To show this, let J be the cost vector used to evaluate µ (which is
ΦR
∗
in the case of aggregation), and note that it satisﬁes
TJ = T
µ
J
since µ
k
converges to µ [cf. Eq. (6.142) in the case of aggregation]. We
write
TJ
µ
≥ T(J −δe) = TJ −αδe = T
µ
J −αδe ≥ T
µ
J
µ
−2αδe = J
µ
−2αδe,
where e is the unit vector, from which by repeatedly applying T to both
sides, we obtain
J
∗
= lim
k→∞
T
k
J
µ
≥ J
µ
−
2αδ
1 −α
e,
thereby showing the error bound (6.143).
The preceding error bound improvement suggests that approximate
policy iteration based on aggregation may hold some advantage in terms
of approximation quality, relative to its projected equationbased counter
part. For a generalization of this idea, see Exercise 6.15. Note, however,
that the basis functions in the aggregation approach are restricted by the
requirement that the rows of Φ must be probability distributions.
SimulationBased Policy Iteration
The policy iteration method just described requires ndimensional calcula
tions, and is impractical when n is large. An alternative, which is consistent
with the philosophy of this chapter, is to implement it by simulation, using
an LSTDtype method, as we now proceed to describe.
For a given policy µ, the aggregate version of Bellman’s equation,
R = F
µ
R, is linear of the form [cf. Eq. (6.141)]
R = DT
µ
(ΦR),
where D and Φ are the matrices with rows the disaggregation and aggrega
tion distributions, respectively, and T
µ
is the DP mapping associated with
µ, i.e.,
T
µ
J = g
µ
+ αP
µ
J,
with P
µ
the transition probability matrix corresponding to µ, and g
µ
is the
vector whose ith component is
n
j=1
p
ij
_
µ(i)
_
g
_
i, µ(i), j
_
.
426 Approximate Dynamic Programming Chap. 6
We can thus write this equation as
ER = f,
where
E = I −αDPΦ, f = Dg, (6.144)
in analogy with the corresponding matrix and vector for the projected
equation [cf. Eq. (6.34)].
We may use lowdimensional simulation to approximate E and f
based on a given number of samples, similar to Section 6.3.3 [cf. Eqs.
(6.41) and (6.42)]. In particular, a sample sequence
_
(i
0
, j
0
), (i
1
, j
1
), . . .
_
is obtained by ﬁrst generating a sequence of states ¦i
0
, i
1
, . . .¦ by sampling
according to a distribution ¦ξ
i
[ i = 1, . . . , n¦ (with ξ
i
> 0 for all i), and
then by generating for each t the column index j
t
using sampling according
to the distribution ¦p
i
t
j
[ j = 1, . . . , n¦. Given the ﬁrst k + 1 samples, we
form the matrix
ˆ
E
k
and vector
ˆ
f
k
given by
ˆ
E
k
= I −
α
k + 1
k
t=0
1
ξ
i
t
d(i
t
)φ(j
t
)
′
,
ˆ
f
k
=
1
k + 1
k
t=0
1
ξ
i
t
d(i
t
)g
_
i
t
, µ(i
t
), j
t
_
,
(6.145)
where d(i) is the ith column of D and φ(j)
′
is the jth row of Φ. The
convergence
ˆ
E
k
→ E and
ˆ
f
k
→ f follows from the expressions
E = I−α
n
i=1
n
j=1
p
ij
_
µ(i)
_
d(i)φ(j)
′
, f =
n
i=1
n
j=1
p
ij
_
µ(i)
_
d(i)g
_
i, µ(i), j
_
,
the relation
lim
k→∞
k
t=0
δ(i
t
= i, j
t
= j)
k + 1
= ξ
i
p
ij
,
and law of large numbers arguments (cf. Section 6.3.3).
It is important to note that the sampling probabilities ξ
i
are restricted
to be positive, but are otherwise arbitrary and need not depend on the cur
rent policy. Moreover, their choice does not aﬀect the obtained approximate
solution of the equation ER = f. Because of this possibility, the problem
of exploration is less acute in the context of policy iteration when aggre
gation is used for policy evaluation. This is in contrast with the projected
equation approach, where the choice of ξ
i
aﬀects the projection norm and
the solution of the projected equation, as well as the contraction properties
of the mapping ΠT.
Note also that instead of using the probabilities ξ
i
to sample original
system states, we may alternatively sample the aggregate states x accord
ing to a distribution ¦ζ
x
[ x ∈ /¦, generate a sequence of aggregate states
Sec. 6.5 Aggregation Methods 427
¦x
0
, x
1
, . . .¦, and then generate a state sequence ¦i
0
, i
1
, . . .¦ using the dis
aggregation probabilities. In this case ξ
i
=
x∈A
ζ
x
d
xi
and the equations
(6.145) should be modiﬁed as follows:
ˆ
E
k
= I−
α
k + 1
k
t=0
m
d
x
t
i
t
d(i
t
)φ(j
t
)
′
,
ˆ
f
k
=
1
k + 1
k
t=0
m
d
x
t
i
t
d(i
t
)g
_
i
t
, µ(i
t
), j
t
_
,
where m is the number of aggregate states.
The corresponding LSTD(0) method generates
ˆ
R
k
=
ˆ
E
−1
k
ˆ
f
k
, and
approximates the cost vector of µ by the vector Φ
ˆ
R
k
:
˜
J
µ
= Φ
ˆ
R
k
.
There is also a regressionbased version that is suitable for the case where
ˆ
E
k
is nearly singular (cf. Section 6.3.4), as well as an iterative regression
based version of LSTD, which may be viewed as a special case of (scaled)
LSPE. The latter method takes the form
ˆ
R
k+1
= (
ˆ
E
′
k
Σ
−1
k
ˆ
E
k
+ βI)
−1
(
ˆ
E
′
k
Σ
−1
k
ˆ
f
k
+ β
ˆ
R
k
), (6.146)
where β > 0 and Σ
k
is a positive deﬁnite symmetric matrix [cf. Eq. (6.61)].
Note that contrary to the projected equation case, for a discount factor α ≈
1,
ˆ
E
k
will always be nearly singular [since DPΦ is a transition probability
matrix, cf. Eq. (6.144)]. Of course,
ˆ
E
k
will also be nearly singular when the
rows of D and/or the columns of Φ are nearly dependent. The iteration
(6.146) actually makes sense even if
ˆ
E
k
is singular.
The nonoptimistic version of this aggregationbased policy iteration
method does not exhibit the oscillatory behavior of the one based on the
projected equation approach (cf. Section 6.3.6): the generated policies con
verge and the limit policy satisﬁes the sharper error bound (6.143), as noted
earlier. Moreover, optimistic versions of the method also do not exhibit the
chattering phenomenon described in Section 6.3.6. This is similar to opti
mistic policy iteration for the case of a lookup table representation of the
cost of the current policy: we are essentially dealing with a lookup table
representation of the cost of the aggregate system of Fig. 6.5.1.
The preceding arguments indicate that aggregationbased policy iter
ation holds an advantage over its projected equationbased counterpart in
terms of regularity of behavior, error guarantees, and explorationrelated
diﬃculties. Its limitation is that the basis functions in the aggregation
approach are restricted by the requirement that the rows of Φ must be
probability distributions. For example in the case of a single basis function
(s = 1), there is only one possible choice for Φ in the aggregation context,
namely the matrix whose single column is the unit vector.
428 Approximate Dynamic Programming Chap. 6
SimulationBased Value Iteration
The value iteration algorithm also admits a simulationbased implemen
tation, which is similar to the Qlearning algorithms of Section 6.4. It
generates a sequence of aggregate states ¦x
0
, x
1
, . . .¦ by some probabilistic
mechanism, which ensures that all aggregate states are generated inﬁnitely
often. Given each x
k
, it independently generates an original system state
i
k
according to the probabilities d
x
k
i
, and updates R(x
k
) according to
R
k+1
(x
k
) = (1 −γ
k
)R
k
(x
k
)
+ γ
k
min
u∈U(i)
n
j=1
p
i
k
j
(u)
_
_
g(i
k
, u, j) + α
y∈A
φ
jy
R
k
(y)
_
_
,
(6.147)
where γ
k
is a diminishing positive stepsize, and leaves all the other com
ponents of R unchanged:
R
k+1
(x) = R
k
(x), if x ,= x
k
.
This algorithm can be viewed as an asynchronous stochastic approximation
version of value iteration. Its convergence mechanism and justiﬁcation are
very similar to the ones given for Qlearning in Section 6.4.1 [cf. Eqs. (6.111)
and (6.112)]. Note that contrary to the aggregationbased Qlearning algo
rithm (6.135), the iteration (6.147) involves the calculation of an expected
value at every iteration.
Multistep Aggregation
The aggregation methodology of this section can be generalized by con
sidering a multistep aggregationbased dynamical system. This system,
illustrated in Fig. 6.5.2, is speciﬁed by disaggregation and aggregation prob
abilities as before, but involves k > 1 transitions between original system
states in between transitions from and to aggregate states.
S
φjyQ
dxi
Original System States Aggregate States
Original System States Aggregate States

Original System States Aggregate States
Disaggregation Probabilities
Aggregation Probabilities
Disaggregation Probabilities
Aggregation Probabilities
Disaggregation Probabilities
Aggregation Probabilities
Disaggregation Probabilities
according to pij (u), with cost according to pij (u), with cost according to pij (u), with cost
i
), x ), y
Stages j1 j1 j2
2 j
k
k Stages
Figure 6.5.2. The transition mechanism for multistep aggregation. It is based on
a dynamical system involving aggregate states, and k transitions between original
system states in between transitions from and to aggregate states.
Sec. 6.5 Aggregation Methods 429
We introduce vectors
˜
J
0
,
˜
J
1
, . . . ,
˜
J
k
, and R
∗
where:
R
∗
(x) is the optimal costtogo from aggregate state x.
˜
J
0
(i) is the optimal costtogo from original system state i that has
just been generated from an aggregate state (left side of Fig. 6.5.2).
˜
J
1
(j
1
) is the optimal costtogo from original system state j that has
just been generated from an original system state i.
˜
J
m
(j
m
), m = 2, . . . , k, is the optimal costtogo from original system
state j
m
that has just been generated from an original system state
j
m−1
.
These vectors satisfy the following Bellman’s equations:
R
∗
(x) =
n
i=1
d
xi
˜
J
0
(i), x ∈ /,
˜
J
0
(i) = min
u∈U(i)
n
j
1
=1
p
ij
1
(u)
_
g(i, u, j
1
) + α
˜
J
1
(j
1
)
_
, i = 1, . . . , n, (6.148)
˜
J
m
(j
m
) = min
u∈U(jm)
n
j
m+1
=1
p
jmj
m+1
(u)
_
g(j
m
, u, j
m+1
) + α
˜
J
m+1
(j
m+1
)
_
,
j
m
= 1, . . . , n, m = 1, . . . , k −1,
(6.149)
˜
J
k
(j
k
) =
y∈A
φ
j
k
y
R
∗
(y), j
k
= 1, . . . , n. (6.150)
By combining these equations, we obtain an equation for R
∗
:
R
∗
(x) = (FR
∗
)(x), x ∈ /,
where F is the mapping deﬁned by
FR = DT
k
(ΦR),
where T is the usual DP mapping of the problem. As earlier, it can be seen
that F is a supnorm contraction, but its contraction modulus is α
k
rather
than α.
There is a similar mapping corresponding to a ﬁxed policy and it
can be used to implement a policy iteration algorithm, which evaluates a
policy through calculation of a corresponding vector R and then improves
it. However, there is a major diﬀerence from the singlestep aggregation
case: a policy involves a set of k control functions ¦µ
0
, . . . , µ
k−1
¦, and
while a known policy can be easily simulated, its improvement involves
multistep lookahead using the minimizations of Eqs. (6.148)(6.150), and
430 Approximate Dynamic Programming Chap. 6
may be costly. Thus multistep aggregation is a useful idea only for problems
where the cost of this multistep lookahead minimization (for a single given
starting state) is not prohibitive. On the other hand, note that from the
theoretical point of view, a multistep scheme provides a means of better
approximation of the true optimal cost vector J
∗
, independent of the use
of a large number of aggregate states. This can be seen from Eqs. (6.148)
(6.150), which by classical value iteration convergence results, show that
˜
J
0
(i) → J
∗
(i) as k → ∞, regardless of the choice of aggregate states.
Asynchronous MultiAgent Aggregation
Let us now discuss the distributed solution of largescale discounted DP
problems using cost function approximation based on hard aggregation.
We partition the original system states into aggregate states/subsets x ∈
/ = ¦x
1
, . . . , x
m
¦, and we envision a network of agents, each updating
asynchronously a detailed/exact local cost function, deﬁned on a single
aggregate state/subset. Each agent also maintains an aggregate cost for its
aggregate state, which is the weighted average detailed cost, weighted by
the corresponding disaggregation probabilities. These aggregate costs are
communicated between agents and are used to perform the local updates.
In a synchronous value iteration method of this type, each agent a =
1, . . . , m, maintains/updates a (local) cost J(i) for every state i ∈ x
a
, and
an aggregate cost
R(a) =
i∈xa
d
xai
J(i),
with d
xai
being the corresponding disaggregation probabilities. We generi
cally denote by J and R the vectors with components J(i), i = 1, . . . , n, and
R(a), a = 1, . . . , m, respectively. These components are updated according
to
J
k+1
(i) = min
u∈U(i)
H
a
(i, u, J
k
, R
k
), ∀ i ∈ x
a
, (6.151)
with
R
k
(a) =
i∈xa
d
xai
J
k
(i), ∀ a = 1, . . . , m, (6.152)
where the mapping H
a
is deﬁned for all a = 1, . . . , m, i ∈ x
a
, u ∈ U(i),
and J ∈ ℜ
n
, R ∈ ℜ
m
, by
H
a
(i, u, J, R) =
n
j=1
p
ij
(u)g(i, u, j)+α
j∈xa
p
ij
(u)J(j)+α
j / ∈xa
p
ij
(u)R
_
x(j)
_
,
(6.153)
and where for each original system state j, we denote by x(j) the subset
to which i belongs [i.e., j ∈ x(j)]. Thus the iteration (6.151) is the same as
ordinary value iteration, except that the aggregate costs R
_
x(j)
_
are used
for states j whose costs are updated by other agents.
Sec. 6.5 Aggregation Methods 431
It is possible to show that the iteration (6.151)(6.152) involves a
supnorm contraction mapping of modulus α, so it converges to the unique
solution of the system of equations in (J, R)
J(i) = min
u∈U(i)
H
a
(i, u, J, R), R(a) =
i∈xa
d
xai
J(i),
∀ i ∈ x
a
, a = 1, . . . , m;
(6.154)
This follows from the fact that ¦d
xai
[ i = 1, . . . , n¦ is a probability distri
bution. We may view the equations (6.154) as a set of Bellman equations
for an “aggregate” DP problem, which similar to our earlier discussion,
involves both the original and the aggregate system states. The diﬀerence
from the Bellman equations (6.137)(6.139) is that the mapping (6.153)
involves J(j) rather than R
_
x(j)
_
for j ∈ x
a
.
In the algorithm (6.151)(6.152), all agents a must be updating their
local costs J(i) aggregate costs R(a) synchronously, and communicate the
aggregate costs to the other agents before a new iteration may begin. This
is often impractical and timewasting. In a more practical asynchronous
version of the method, the aggregate costs R(a) may be outdated to account
for communication “delays” between agents. Moreover, the costs J(i) need
not be updated for all i; it is suﬃcient that they are updated by each agent
a only for a (possibly empty) subset of I
a,k
∈ x
a
. In this case, the iteration
(6.151)(6.152) is modiﬁed to take the form
J
k+1
(i) = min
u∈U(i)
H
a
_
i, u, J
k
, R
τ
1,k
(1), . . . , R
τ
m,k
(m)
_
, ∀ i ∈ I
a,k
,
(6.155)
with 0 ≤ τ
a,k
≤ k for a = 1, . . . , m, and
R
k
(a) =
i∈xa
d
xai
J
k
(i), ∀ a = 1, . . . , m. (6.156)
The diﬀerences k−τ
a,k
, a = 1, . . . , m, in Eq. (6.155) may be viewed as “de
lays” between the current time k and the times τ
a,k
when the corresponding
aggregate costs were computed at other processors. For convergence, it is
of course essential that every i ∈ x
a
belongs to I
a,k
for inﬁnitely many k
(so each cost component is updated inﬁnitely often), and lim
k→∞
τ
a,k
= ∞
for all a = 1, . . . , m (so that agents eventually communicate more recently
computed aggregate costs to other agents).
Asynchronous multiagent DP methods of this type have been pro
posed and analyzed by the author in [Ber82]. Their convergence, based on
the supnorm contraction property of the mapping underlying Eq. (6.154),
has been established in [Ber82], [Ber83]. The monotonicity property is also
suﬃcient to establish convergence, and this may be useful in the conver
gence analysis of related algorithms for other nondiscounted DP models.
432 Approximate Dynamic Programming Chap. 6
We ﬁnally note that there are Qlearning versions of the algorithm (6.155)
(6.156), whose convergence analysis is very similar to the one for standard
Qlearning (see [Tsi84], which also allows asynchronous distributed com
putation with communication “delays”).
6.6 STOCHASTIC SHORTEST PATH PROBLEMS
In this section we consider policy evaluation for ﬁnitestate stochastic short
est path (SSP) problems (cf. Chapter 2). We assume that there is no dis
counting (α = 1), and that the states are 0, 1, . . . , n, where state 0 is a
special costfree termination state. We focus on a ﬁxed proper policy µ,
under which all the states 1, . . . , n are transient.
There are natural extensions of the LSTD(λ) and LSPE(λ) algo
rithms. We introduce a linear approximation architecture of the form
˜
J(i, r) = φ(i)
′
r, i = 0, 1, . . . , n,
and the subspace
S = ¦Φr [ r ∈ ℜ
s
¦,
where, as in Section 6.3, Φ is the n s matrix whose rows are φ(i)
′
, i =
1, . . . , n. We assume that Φ has rank s. Also, for notational convenience
in the subsequent formulas, we deﬁne φ(0) = 0.
The algorithms use a sequence of simulated trajectories, each of the
form (i
0
, i
1
, . . . , i
N
), where i
N
= 0, and i
t
,= 0 for t < N. Once a trajectory
is completed, an initial state i
0
for the next trajectory is chosen according
to a ﬁxed probability distribution q
0
=
_
q
0
(1), . . . , q
0
(n)
_
, where
q
0
(i) = P(i
0
= i), i = 1, . . . , n, (6.157)
and the process is repeated.
For a trajectory i
0
, i
1
, . . ., of the SSP problem consider the probabil
ities
q
t
(i) = P(i
t
= i), i = 1, . . . , n, t = 0, 1, . . .
Note that q
t
(i) diminishes to 0 as t → ∞ at the rate of a geometric pro
gression (cf. Section 2.1), so the limits
q(i) =
∞
t=0
q
t
(i), i = 1, . . . , n,
are ﬁnite. Let q be the vector with components q(1), . . . , q(n). We assume
that q
0
(i) are chosen so that q(i) > 0 for all i [a stronger assumption is
that q
0
(i) > 0 for all i]. We introduce the norm
J
q
=
¸
¸
¸
_
n
i=1
q(i)
_
J(i)
_
2
,
Sec. 6.6 Stochastic Shortest Path Problems 433
and we denote by Π the projection onto the subspace S with respect to
this norm. In the context of the SSP problem, the projection norm  
q
plays a role similar to the one played by the steadystate distribution norm
 
ξ
for discounted problems (cf. Section 6.3).
Let P be the n n matrix with components p
ij
, i, j = 1, . . . , n.
Consider also the mapping T : ℜ
n
→ ℜ
n
given by
TJ = g + PJ,
where g is the vector with components
n
j=0
p
ij
g(i, j), i = 1, . . . , n. For
λ ∈ [0, 1), deﬁne the mapping
T
(λ)
= (1 −λ)
∞
t=0
λ
t
T
t+1
[cf. Eq. (6.71)]. Similar to Section 6.3, we have
T
(λ)
J = P
(λ)
J + (I −λP)
−1
g,
where
P
(λ)
= (1 −λ)
∞
t=0
λ
t
P
t+1
(6.158)
[cf. Eq. (6.72)].
We will now show that ΠT
(λ)
is a contraction, so that it has a unique
ﬁxed point.
Proposition 6.6.1: For all λ ∈ [0, 1), ΠT
(λ)
is a contraction with
respect to some norm.
Proof: Let λ > 0. We will show that T
(λ)
is a contraction with respect
to the projection norm  
q
, so the same is true for ΠT
(λ)
, since Π is
nonexpansive. Let us ﬁrst note that with an argument like the one in the
proof of Lemma 6.3.1, we have
PJ
q
≤ J
q
, J ∈ ℜ
n
.
Indeed, we have q =
∞
t=0
q
t
and q
′
t+1
= q
′
t
P, so
q
′
P =
∞
t=0
q
′
t
P =
∞
t=1
q
′
t
= q
′
−q
′
0
,
or
n
i=1
q(i)p
ij
= q(j) −q
0
(j), ∀ j.
434 Approximate Dynamic Programming Chap. 6
Using this relation, we have for all J ∈ ℜ
n
,
PJ
2
q
=
n
i=1
q(i)
_
_
n
j=1
p
ij
J(j)
_
_
2
≤
n
i=1
q(i)
n
j=1
p
ij
J(j)
2
=
n
j=1
J(j)
2
n
i=1
q(i)p
ij
=
n
j=1
_
q(j) − q
0
(j)
_
J(j)
2
≤ J
2
q
.
(6.159)
From the relation PJ
q
≤ J
q
it follows that
P
t
J
q
≤ J
q
, J ∈ ℜ
n
, t = 0, 1, . . .
Thus, by using the deﬁnition (6.158) of P
(λ)
, we also have
P
(λ)
J
q
≤ J
q
, J ∈ ℜ
n
.
Since lim
t→∞
P
t
J = 0 for any J ∈ ℜ
n
, it follows that P
t
J
q
< J
q
for
all J ,= 0 and t suﬃciently large. Therefore,
P
(λ)
J
q
< J
q
, for all J ,= 0. (6.160)
We now deﬁne
β = max
_
P
(λ)
J
q
[ J
q
= 1
_
and note that since the maximum in the deﬁnition of β is attained by the
Weierstrass Theorem (a continuous function attains a maximum over a
compact set), we have β < 1 in view of Eq. (6.160). Since
P
(λ)
J
q
≤ βJ
q
, J ∈ ℜ
n
,
it follows that P
(λ)
is a contraction of modulus β with respect to  
q
.
Let λ = 0. We use a diﬀerent argument because T is not necessarily
a contraction with respect to  
q
. [An example is given following Prop.
6.8.1. Note also that if q
0
(i) > 0 for all i, from the calculation of Eq.
(6.159) it follows that P and hence T is a contraction with respect to
 
q
.] We show that ΠT is a contraction with respect to a diﬀerent norm
Sec. 6.6 Stochastic Shortest Path Problems 435
by showing that the eigenvalues of ΠP lie strictly within the unit circle.†
Indeed, with an argument like the one used to prove Lemma 6.3.1, we
have PJ
q
≤ J
q
for all J, which implies that ΠPJ
q
≤ J
q
, so the
eigenvalues of ΠP cannot be outside the unit circle. Assume to arrive at
a contradiction that ν is an eigenvalue of ΠP with [ν[ = 1, and let ζ be
a corresponding eigenvector. We claim that Pζ must have both real and
imaginary components in the subspace S. If this were not so, we would
have Pζ ,= ΠPζ, so that
Pζ
q
> ΠPζ
q
= νζ
q
= [ν[ ζ
q
= ζ
q
,
which contradicts the fact PJ
q
≤ J
q
for all J. Thus, the real and
imaginary components of Pζ are in S, which implies that Pζ = ΠPζ = νζ,
so that ν is an eigenvalue of P. This is a contradiction because [ν[ = 1
while the eigenvalues of P are strictly within the unit circle, since the policy
being evaluated is proper. Q.E.D.
The preceding proof has shown that ΠT
(λ)
is a contraction with re
spect to  
q
when λ > 0. As a result, similar to Prop. 6.3.5, we can obtain
the error bound
J
µ
−Φr
∗
λ

q
≤
1
_
1 −α
2
λ
J
µ
−ΠJ
µ

q
, λ > 0,
where Φr
∗
λ
and α
λ
are the ﬁxed point and contraction modulus of ΠT
(λ)
,
respectively. When λ = 0, we have
J
µ
−Φr
∗
0
 ≤ J
µ
−ΠJ
µ
 +ΠJ
µ
−Φr
∗
0

= J
µ
−ΠJ
µ
 +ΠTJ
µ
−ΠT(Φr
∗
0
)
= J
µ
−ΠJ
µ
 + α
0
J
µ
−Φr
∗
0
,
where   is the norm with respect to which ΠT is a contraction (cf. Prop.
6.7.1), and Φr
∗
0
and α
0
are the ﬁxed point and contraction modulus of ΠT.
We thus have the error bound
J
µ
−Φr
∗
0
 ≤
1
1 −α
0
J
µ
−ΠJ
µ
.
† We use here the fact that if a square matrix has eigenvalues strictly within
the unit circle, then there exists a norm with respect to which the linear mapping
deﬁned by the matrix is a contraction. Also in the following argument, the
projection Πz of a complex vector z is obtained by separately projecting the real
and the imaginary components of z on S. The projection norm for a complex
vector x +iy is deﬁned by
x +iyq =
_
x
2
q
+y
2
q
.
436 Approximate Dynamic Programming Chap. 6
Similar to the discounted problem case, the projected equation can
be written as a linear equation of the form Cr = d. The correspond
ing LSTD and LSPE algorithms use simulationbased approximations C
k
and d
k
. This simulation generates a sequence of trajectories of the form
(i
0
, i
1
, . . . , i
N
), where i
N
= 0, and i
t
,= 0 for t < N. Once a trajectory is
completed, an initial state i
0
for the next trajectory is chosen according to
a ﬁxed probability distribution q
0
=
_
q
0
(1), . . . , q
0
(n)
_
. The LSTD method
approximates the solution C
−1
d of the projected equation by C
−1
k
d
k
, where
C
k
and d
k
are simulationbased approximations to C and d, respectively.
The LSPE algorithm and its scaled versions are deﬁned by
r
k+1
= r
k
−γG
k
(C
k
r
k
−d
k
),
where γ is a suﬃciently small stepsize and G
k
is a scaling matrix. The
derivation of the detailed equations is straightforward but somewhat te
dious, and will not be given (see also the discussion in Section 6.8).
6.7 AVERAGE COST PROBLEMS
In this section we consider average cost problems and related approxima
tions: policy evaluation algorithms such as LSTD(λ) and LSPE(λ), ap
proximate policy iteration, and Qlearning. We assume throughout the
ﬁnite state model of Section 4.1, with the optimal average cost being the
same for all initial states (cf. Section 4.2).
6.7.1 Approximate Policy Evaluation
Let us consider the problem of approximate evaluation of a stationary pol
icy µ. As in the discounted case (Section 6.3), we consider a stationary
ﬁnitestate Markov chain with states i = 1, . . . , n, transition probabilities
p
ij
, i, j = 1, . . . , n, and stage costs g(i, j). We assume that the states form
a single recurrent class. An equivalent way to express this assumption is
the following.
Assumption 6.6.1: The Markov chain has a steadystate proba
bility vector ξ = (ξ
1
, . . . , ξ
n
) with positive components, i.e., for all
i = 1, . . . , n,
lim
N→∞
1
N
N
k=1
P(i
k
= j [ i
0
= i) = ξ
j
> 0, j = 1, . . . , n.
Sec. 6.7 Average Cost Problems 437
From Section 4.2, we know that under Assumption 6.6.1, the average
cost, denoted by η, is independent of the initial state
η = lim
N→∞
1
N
E
_
N−1
k=0
g
_
x
k
, x
k+1
_
¸
¸
¸ x
0
= i
_
, i = 1, . . . , n, (6.161)
and satisﬁes
η = ξ
′
g,
where g is the vector whose ith component is the expected stage cost
n
j=1
p
ij
g(i, j). (In Chapter 4 we denoted the average cost by λ, but
in the present chapter, with apologies to the readers, we reserve λ for use
in the TD, LSPE, and LSTD algorithms, hence the change in notation.)
Together with a diﬀerential cost vector h =
_
h(1), . . . , h(n)
_
′
, the average
cost η satisﬁes Bellman’s equation
h(i) =
n
j=1
p
ij
_
g(i, j) −η + h(j)
_
, i = 1, . . . , n.
The solution is unique up to a constant shift for the components of h, and
can be made unique by eliminating one degree of freedom, such as ﬁxing
the diﬀerential cost of a single state to 0 (cf. Prop. 4.2.4).
We consider a linear architecture for the diﬀerential costs of the form
˜
h(i, r) = φ(i)
′
r, i = 1, . . . , n.
where r ∈ ℜ
s
is a parameter vector and φ(i) is a feature vector associated
with state i. These feature vectors deﬁne the subspace
S = ¦Φr [ r ∈ ℜ
s
¦,
where as in Section 6.3, Φ is the n s matrix whose rows are φ(i)
′
, i =
1, . . . , n. We will thus aim to approximate h by a vector in S, similar to
Section 6.3, which dealt with cost approximation in the discounted case.
We introduce the mapping F : ℜ
n
→ ℜ
n
deﬁned by
FJ = g −ηe + PJ,
where P is the transition probability matrix and e is the unit vector. Note
that the deﬁnition of F uses the exact average cost η, as given by Eq.
(6.161). With this notation, Bellman’s equation becomes
h = Fh,
so if we know η, we can try to ﬁnd or approximate a ﬁxed point of F.
438 Approximate Dynamic Programming Chap. 6
Similar to Section 6.3, we introduce the projected equation
Φr = ΠF(Φr),
where Π is projection on the subspace S with respect to the norm  
ξ
.
An important issue is whether ΠF is a contraction. For this it is necessary
to make the following assumption.
Assumption 6.6.2: The columns of the matrix Φ together with the
unit vector e = (1, . . . , 1)
′
form a linearly independent set of vectors.
Note the diﬀerence with the corresponding Assumption 6.3.2 for the
discounted case in Section 6.3. Here, in addition to Φ having rank s, we
require that e does not belong to the subspace S. To get a sense why this
is needed, observe that if e ∈ S, then ΠF cannot be a contraction, since
any scalar multiple of e when added to a ﬁxed point of ΠF would also be
a ﬁxed point.
We also consider multistep versions of the projected equation of the
form
Φr = ΠF
(λ)
(Φr), (6.162)
where
F
(λ)
= (1 −λ)
∞
t=0
λ
t
F
t+1
.
In matrix notation, the mapping F
(λ)
can be written as
F
(λ)
J = (1 −λ)
∞
t=0
λ
t
P
t+1
J +
∞
t=0
λ
t
P
t
(g −ηe),
or more compactly as
F
(λ)
J = P
(λ)
J + (I −λP)
−1
(g −ηe), (6.163)
where the matrix P
(λ)
is deﬁned by
P
(λ)
= (1 −λ)
∞
t=0
λ
t
P
t+1
(6.164)
[cf. Eq. (6.72)]. Note that for λ = 0, we have F
(0)
= F and P
(0)
= P.
We wish to delineate conditions under which the mapping ΠF
(λ)
is a
contraction. The following proposition relates to the composition of general
linear mappings with Euclidean projections, and captures the essence of our
analysis.
Sec. 6.7 Average Cost Problems 439
Proposition 6.7.1: Let S be a subspace of ℜ
n
and let L : ℜ
n
→ ℜ
n
be a linear mapping,
L(x) = Ax + b,
where A is an n n matrix and b is a vector in ℜ
n
. Let   be
a weighted Euclidean norm with respect to which L is nonexpansive,
and let Π denote projection onto S with respect to that norm.
(a) ΠL has a unique ﬁxed point if and only if either 1 is not an
eigenvalue of A, or else the eigenvectors corresponding to the
eigenvalue 1 do not belong to S.
(b) If ΠL has a unique ﬁxed point, then for all γ ∈ (0, 1), the mapping
H
γ
= (1 −γ)I + γΠL
is a contraction, i.e., for some scalar ρ
γ
∈ (0, 1), we have
H
γ
x −H
γ
y ≤ ρ
γ
x −y, ∀ x, y ∈ ℜ
n
.
Proof: (a) Assume that ΠL has a unique ﬁxed point, or equivalently (in
view of the linearity of L) that 0 is the unique ﬁxed point of ΠA. If 1 is
an eigenvalue of A with a corresponding eigenvector z that belongs to S,
then Az = z and ΠAz = Πz = z. Thus, z is a ﬁxed point of ΠA with
z ,= 0, a contradiction. Hence, either 1 is not an eigenvalue of A, or else
the eigenvectors corresponding to the eigenvalue 1 do not belong to S.
Conversely, assume that either 1 is not an eigenvalue of A, or else the
eigenvectors corresponding to the eigenvalue 1 do not belong to S. We will
show that the mapping Π(I −A) is onetoone from S to S, and hence the
ﬁxed point of ΠL is the unique vector x
∗
∈ S satisfying Π(I −A)x
∗
= Πb.
Indeed, assume the contrary, i.e., that Π(I −A) has a nontrivial nullspace
in S, so that some z ∈ S with z ,= 0 is a ﬁxed point of ΠA. Then, either
Az = z (which is impossible since then 1 is an eigenvalue of A, and z is a
corresponding eigenvector that belongs to S), or Az ,= z, in which case Az
diﬀers from its projection ΠAz and
z = ΠAz < Az ≤ A z,
so that 1 < A (which is impossible since L is nonexpansive, and therefore
A ≤ 1), thereby arriving at a contradiction.
(b) If z ∈ ℜ
n
with z ,= 0 and z ,= aΠAz for all a ≥ 0, we have
(1 −γ)z + γΠAz < (1 −γ)z + γΠAz ≤ (1 −γ)z + γz = z,
(6.165)
440 Approximate Dynamic Programming Chap. 6
where the strict inequality follows from the strict convexity of the norm,
and the weak inequality follows from the nonexpansiveness of ΠA. If on
the other hand z ,= 0 and z = aΠAz for some a ≥ 0, we have (1 − γ)z +
γΠAz < z because then ΠL has a unique ﬁxed point so a ,= 1, and ΠA
is nonexpansive so a < 1. If we deﬁne
ρ
γ
= sup¦(1 −γ)z + γΠAz [ z ≤ 1¦,
and note that the supremum above is attained by the Weierstrass Theorem
(a continuous function attains a minimum over a compact set), we see that
Eq. (6.165) yields ρ
γ
< 1 and
(1 −γ)z + γΠAz ≤ ρ
γ
z, z ∈ ℜ
n
.
By letting z = x −y, with x, y ∈ ℜ
n
, and by using the deﬁnition of H
γ
, we
have
H
γ
x−H
γ
y = H
γ
(x−y) = (1−γ)(x−y) +γΠA(x−y) = (1−γ)z +γΠAz,
so by combining the preceding two relations, we obtain
H
γ
x −H
γ
y ≤ ρ
γ
x −y, x, y ∈ ℜ
n
.
Q.E.D.
We can now derive the conditions under which the mapping underly
ing the LSPE iteration is a contraction with respect to  
ξ
.
Proposition 6.7.2: The mapping
F
γ,λ
= (1 −γ)I + γΠF
(λ)
is a contraction with respect to  
ξ
if one of the following is true:
(i) λ ∈ (0, 1) and γ ∈ (0, 1],
(ii) λ = 0 and γ ∈ (0, 1).
Proof: Consider ﬁrst the case, γ = 1 and λ ∈ (0, 1). Then F
(λ)
is a linear
mapping involving the matrix P
(λ)
. Since 0 < λ and all states form a single
recurrent class, all entries of P
(λ)
are positive. Thus P
(λ)
can be expressed
as a convex combination
P
(λ)
= (1 −β)I + β
¯
P
for some β ∈ (0, 1), where
¯
P is a stochastic matrix with positive entries.
We make the following observations:
Sec. 6.7 Average Cost Problems 441
(i)
¯
P corresponds to a nonexpansive mapping with respect to the norm
 
ξ
. The reason is that the steadystate distribution of
¯
P is ξ [as
can be seen by multiplying the relation P
(λ)
= (1 −β)I +β
¯
P with ξ,
and by using the relation ξ
′
= ξ
′
P
(λ)
to verify that ξ
′
= ξ
′ ¯
P]. Thus,
we have 
¯
Pz
ξ
≤ z
ξ
for all z ∈ ℜ
n
(cf. Lemma 6.3.1), implying
that
¯
P has the nonexpansiveness property mentioned.
(ii) Since
¯
P has positive entries, the states of the Markov chain corre
sponding to
¯
P form a single recurrent class. If z is an eigenvector of
¯
P corresponding to the eigenvalue 1, we have z =
¯
P
k
z for all k ≥ 0,
so z =
¯
P
∗
z, where
¯
P
∗
= lim
N→∞
(1/N)
N−1
k=0
¯
P
k
(cf. Prop. 4.1.2). The rows of
¯
P
∗
are all equal to ξ
′
since the steady
state distribution of
¯
P is ξ, so the equation z =
¯
P
∗
z implies that z
is a nonzero multiple of e. Using Assumption 6.6.2, it follows that z
does not belong to the subspace S, and from Prop. 6.7.1 (with
¯
P in
place of C, and β in place of γ), we see that ΠP
(λ)
is a contraction
with respect to the norm  
ξ
. This implies that ΠF
(λ)
is also a
contraction.
Consider next the case, γ ∈ (0, 1) and λ ∈ (0, 1). Since ΠF
(λ)
is a
contraction with respect to  
ξ
, as just shown, we have for any J,
¯
J ∈ ℜ
n
,
F
γ,λ
J −F
γ,λ
¯
J
ξ
≤ (1 −γ)J −
¯
J
ξ
+ γ
_
_
ΠF
(λ)
J −ΠF
(λ) ¯
J
_
_
ξ
≤ (1 −γ + γβ)J −
¯
J
ξ
,
where β is the contraction modulus of F
(λ)
. Hence, F
γ,λ
is a contraction.
Finally, consider the case γ ∈ (0, 1) and λ = 0. We will show that
the mapping ΠF has a unique ﬁxed point, by showing that either 1 is not
an eigenvalue of P, or else the eigenvectors corresponding to the eigenvalue
1 do not belong to S [cf. Prop. 6.7.1(a)]. Assume the contrary, i.e., that
some z ∈ S with z ,= 0 is an eigenvector corresponding to 1. We then have
z = Pz. From this it follows that z = P
k
z for all k ≥ 0, so z = P
∗
z, where
P
∗
= lim
N→∞
(1/N)
N−1
k=0
P
k
(cf. Prop. 4.1.2). The rows of P
∗
are all equal to ξ
′
, so the equation z = P
∗
z
implies that z is a nonzero multiple of e. Hence, by Assumption 6.6.2, z
cannot belong to S  a contradiction. Thus ΠF has a unique ﬁxed point,
and the contraction property of F
γ,λ
for γ ∈ (0, 1) and λ = 0 follows from
Prop. 6.7.1(b). Q.E.D.
442 Approximate Dynamic Programming Chap. 6
Error Estimate
We have shown that for each λ ∈ [0, 1), there is a vector Φr
∗
λ
, the unique
ﬁxed point of ΠF
γ,λ
, γ ∈ (0, 1), which is the limit of LSPE(λ) (cf. Prop.
6.7.2). Let h be any diﬀerential cost vector, and let β
γ,λ
be the modulus of
contraction of ΠF
γ,λ
, with respect to  
ξ
. Similar to the proof of Prop.
6.3.2 for the discounted case, we have
h −Φr
∗
λ

2
ξ
= h −Πh
2
ξ
+Πh −Φr
∗
λ

2
ξ
= h −Πh
2
ξ
+
_
_
ΠF
γ,λ
h −ΠF
γ,λ
(Φr
∗
λ
)
_
_
2
ξ
≤ h −Πh
2
ξ
+ β
γ,λ
h −Φr
∗
λ

2
ξ
.
It follows that
h −Φr
∗
λ

ξ
≤
1
_
1 −β
2
γ,λ
h −Πh
ξ
, λ ∈ [0, 1), γ ∈ (0, 1), (6.166)
for all diﬀerential cost vector vectors h.
This estimate is a little peculiar because the diﬀerential cost vector
is not unique. The set of diﬀerential cost vectors is
D =
_
h
∗
+ γe [ γ ∈ ℜ¦,
where h
∗
is the bias of the policy evaluated (cf. Section 4.1, and Props.
4.1.1 and 4.1.2). In particular, h
∗
is the unique h ∈ D that satisﬁes ξ
′
h = 0
or equivalently P
∗
h = 0, where
P
∗
= lim
N→∞
1
N
N−1
k=0
P
k
.
Usually, in average cost policy evaluation, we are interested in obtaining
a small error (h − Φr
∗
λ
) with the choice of h being immaterial (see the
discussion of the next section on approximate policy iteration). It follows
that since the estimate (6.166) holds for all h ∈ D, a better error bound
can be obtained by using an optimal choice of h in the lefthand side and
an optimal choice of γ in the righthand side. Indeed, Tsitsiklis and Van
Roy [TsV99a] have obtained such an optimized error estimate. It has the
form
min
h∈D
h−Φr
∗
λ

ξ
=
_
_
h
∗
−(I −P
∗
)Φr
∗
λ
_
_
ξ
≤
1
_
1 −α
2
λ
Π
∗
h
∗
−h
∗

ξ
, (6.167)
where h
∗
is the bias vector, Π
∗
denotes projection with respect to  
ξ
onto the subspace
S
∗
=
_
(I −P
∗
)y [ y ∈ S
_
,
Sec. 6.7 Average Cost Problems 443
and α
λ
is the minimum over γ ∈ (0, 1) of the contraction modulus of the
mapping Π
∗
F
γ,λ
:
α
λ
= min
γ∈(0,1)
max
y
ξ
=1
Π
∗
P
γ,λ
y
ξ
,
where P
γ,λ
= (1 − γ)I + γΠ
∗
P
(λ)
. Note that this error bound has similar
form with the one for discounted problems (cf. Prop. 6.3.5), but S has
been replaced by S
∗
and Π has been replaced by Π
∗
. It can be shown that
the scalar α
λ
decreases as λ increases, and approaches 0 as λ ↑ 1. This
is consistent with the corresponding error bound for discounted problems
(cf. Prop. 6.3.5), and is also consistent with empirical observations, which
suggest that smaller values of λ lead to larger approximation errors.
Figure 6.7.1 illustrates and explains the projection operation Π
∗
, the
distance of the bias h
∗
from its projection Π
∗
h
∗
, and the other terms in
the error bound (6.167).
LSTD(λ) and LSPE(λ)
The LSTD(λ) and LSPE(λ) algorithms for average cost are straightforward
extensions of the discounted versions, and will only be summarized. The
LSPE(λ) iteration can be written (similar to the discounted case) as
r
k+1
= r
k
−γG
k
(C
k
r
k
−d
k
), (6.168)
where γ is a positive stepsize and
C
k
=
1
k + 1
k
t=0
z
t
_
φ(i
t+1
)
′
−φ(i
t
)
′
_
, G
k
=
1
k + 1
k
t=0
φ(i
t
) φ(i
t
)
′
,
d
k
=
1
k + 1
k
t=0
z
t
_
g(i
t
, i
t+1
) −η
t
_
, z
t
=
t
m=0
(λ)
t−m
φ(i
m
).
Scaled versions of this algorithm, where G
k
is a scaling matrix are also
possible. The LSTD(λ) algorithm is given by
ˆ r
k
= C
−1
k
d
k
.
There is also a regressionbased version that is wellsuited for cases where
C
k
is nearly singular (cf. Section 6.3.4).
The matrices C
k
, G
k
, and vector d
k
can be shown to converge to
limits:
C
k
→ Φ
′
Ξ(I −P
(λ)
)Φ, G
k
→ Φ
′
ΞΦ, d
k
→ Φ
′
Ξg
(λ)
, (6.169)
where the matrix P
(λ)
is deﬁned by Eq. (6.164), g
(λ)
is given by
g
(λ)
= Φ
′
Ξ
∞
ℓ=0
λ
ℓ
P
ℓ
(g −ηe),
and Ξ is the diagonal matrix with diagonal entries ξ
1
, . . . , ξ
n
:
Ξ = diag(ξ
1
, . . . , ξ
n
),
[cf. Eqs. (6.72) and (6.73)].
444 Approximate Dynamic Programming Chap. 6
E
*
: Subspace of vectors (IP
*
)y
e
S: Subspace
spanned by
basis vectors
0
Bias h
*
Subspace S
*
(IP
*
)Φr
λ
*
Φr
λ
*
D: Set of
Differential cost
vectors
Π
*
h
*
Figure 6.7.1: Illustration of the estimate (6.167). Consider the subspace
E
∗
=
_
(I −P
∗
)y  y ∈ ℜ
n
_
.
Let Ξ be the diagonal matrix with ξ
1
, . . . , ξn on the diagonal. Note that:
(a) E
∗
is the subspace that is orthogonal to the unit vector e in the scaled
geometry of the norm ·
ξ
, in the sense that e
′
Ξz = 0 for all z ∈ E
∗
.
Indeed we have
e
′
Ξ(I − P
∗
)y = 0, for all y ∈ ℜ
n
,
because e
′
Ξ = ξ
′
and ξ
′
(I −P
∗
) = 0 as can be easily veriﬁed from the fact
that the rows of P
∗
are all equal to ξ
′
.
(b) Projection onto E
∗
with respect to the norm ·
ξ
is simply multiplication
with (I −P
∗
) (since P
∗
y = ξ
′
ye, so P
∗
y is orthogonal to E
∗
in the scaled
geometry of the norm ·
ξ
). Thus, S
∗
is the projection of S onto E
∗
.
(c) We have h
∗
∈ E
∗
since (I −P
∗
)h
∗
= h
∗
in view of P
∗
h
∗
= 0.
(d) The equation
min
h∈D
h −Φr
∗
λ
ξ
= h
∗
− (I − P
∗
)Φr
∗
λ
ξ
is geometrically evident from the ﬁgure. Also, the term Π
∗
h
∗
−h
∗
ξ
of the
error bound is the minimum possible error given that h
∗
is approximated
with an element of S
∗
.
(e) The estimate (6.167), is the analog of the discounted estimate of Prop.
6.3.5, with E
∗
playing the role of the entire space, and with the “geometry
of the problem” projected onto E
∗
. Thus, S
∗
plays the role of S, h
∗
plays
the role of Jµ, (I − P
∗
)Φr
∗
λ
plays the role of Φr
∗
λ
, and Π
∗
plays the role
of Π. Finally, α
λ
is the best possible contraction modulus of Π
∗
F
γ,λ
over
γ ∈ (0, 1) and within E
∗
(see the paper [TsV99a] for a detailed analysis).
Sec. 6.7 Average Cost Problems 445
6.7.2 Approximate Policy Iteration
Let us consider an approximate policy iteration method that involves ap
proximate policy evaluation and approximate policy improvement. We
assume that all stationary policies are unichain, and a special state s is
recurrent in the Markov chain corresponding to each stationary policy. As
in Section 4.3.1, we consider the stochastic shortest path problem obtained
by leaving unchanged all transition probabilities p
ij
(u) for j ,= s, by setting
all transition probabilities p
is
(u) to 0, and by introducing an artiﬁcial ter
mination state t to which we move from each state i with probability p
is
(u).
The onestage cost is equal to g(i, u) − η, where η is a scalar parameter.
We refer to this stochastic shortest path problem as the ηSSP.
The method generates a sequence of stationary policies µ
k
, a corre
sponding sequence of gains η
µ
k, and a sequence of cost vectors h
k
. We
assume that for some ǫ > 0, we have
max
i=1,...,n
¸
¸
h
k
(i) −h
µ
k
,η
k
(i)
¸
¸
≤ ǫ, k = 0, 1, . . . ,
where
η
k
= min
m=0,1,...,k
η
µ
m,
h
µ
k
,η
k
(i) is the costtogo from state i to the reference state s for the η
k

SSP under policy µ
k
, and ǫ is a positive scalar quantifying the accuracy of
evaluation of the costtogo function of the η
k
SSP. Note that we assume
exact calculation of the gains η
µ
k . Note also that we may calculate ap
proximate diﬀerential costs
˜
h
k
(i, r) that depend on a parameter vector r
without regard to the reference state s. These diﬀerential costs may then
be replaced by
h
k
(i) =
˜
h
k
(i, r) −
˜
h(s, r), i = 1, . . . , n.
We assume that policy improvement is carried out by approximate
minimization in the DP mapping. In particular, we assume that there exists
a tolerance δ > 0 such that for all i and k, µ
k+1
(i) attains the minimum in
the expression
min
u∈U(i)
n
j=1
p
ij
(u)
_
g(i, u, j) + h
k
(j)
_
,
within a tolerance δ > 0.
We now note that since η
k
is monotonically nonincreasing and is
bounded below by the optimal gain η
∗
, it must converge to some scalar η.
Since η
k
can take only one of the ﬁnite number of values η
µ
corresponding
to the ﬁnite number of stationary policies µ, we see that η
k
must converge
ﬁnitely to η; that is, for some k, we have
η
k
= η, k ≥ k.
446 Approximate Dynamic Programming Chap. 6
Let h
η
(s) denote the optimal costtogo from state s in the ηSSP. Then,
by using Prop. 2.4.1, we have
limsup
k→∞
_
h
µ
k
,η
(s) −h
η
(s)
_
≤
n(1 −ρ + n)(δ + 2ǫ)
(1 −ρ)
2
, (6.170)
where
ρ = max
i=1,...,n, µ
P
_
i
k
,= s, k = 1, . . . , n [ i
0
= i, µ
_
,
and i
k
denotes the state of the system after k stages. On the other hand,
as can also be seen from Fig. 6.7.2, the relation
η ≤ η
µ
k
implies that
h
µ
k
,η
(s) ≥ h
µ
k
,η
µ
k
(s) = 0.
It follows, using also Fig. 6.7.2, that
h
µ
k
,η
(s) −h
η
(s) ≥ −h
η
(s) ≥ −h
µ
∗
,η
(s) = (η −η
∗
)N
µ
∗, (6.171)
where µ
∗
is an optimal policy for the η
∗
SSP (and hence also for the original
average cost per stage problem) and N
µ
∗ is the expected number of stages
to return to state s, starting from s and using µ
∗
. Thus, from Eqs. (6.170)
and (6.171), we have
η −η
∗
≤
n(1 −ρ + n)(δ + 2ǫ)
N
µ
∗(1 −ρ)
2
. (6.172)
This relation provides an estimate on the steadystate error of the approx
imate policy iteration method.
We ﬁnally note that optimistic versions of the preceding approximate
policy iteration method are harder to implement than their discounted cost
counterparts. The reason is our assumption that the gain η
µ
of every gen
erated policy µ is exactly calculated; in an optimistic method the current
policy µ may not remain constant for suﬃciently long time to estimate
accurately η
µ
. One may consider schemes where an optimistic version of
policy iteration is used to solve the ηSSP for a ﬁxed η. The value of η
may occasionally be adjusted downward by calculating “exactly” through
simulation the gain η
µ
of some of the (presumably most promising) gener
ated policies µ, and by then updating η according to η := min¦η, η
µ
¦. An
alternative is to approximate the average cost problem with a discounted
problem, for which an optimistic version of approximate policy iteration
can be readily implemented.
Sec. 6.7 Average Cost Problems 447
η
µ η η
∗
h
η
(s)
h
µ,η
(s) = (η
µ
− η)N
µ
η
h
η
(s)
h
µ
∗
,η
(s) = (η
∗
− η)N
µ
∗
Figure 6.7.2: Relation of the costs of stationary policies for the ηSSP in the
approximate policy iteration method. Here, Nµ is the expected number of stages
to return to state s, starting from s and using µ. Since η
µ
k ≥ η, we have
h
µ
k
,η
(s) ≥ h
µ
k
,η
µ
k
(s) = 0.
Furthermore, if µ
∗
is an optimal policy for the η
∗
SSP, we have
h
η
(s) ≤ h
µ
∗
,η
(s) = (η
∗
− η)N
µ
∗ .
6.7.3 QLearning for Average Cost Problems
To derive the appropriate form of the Qlearning algorithm, we form an
auxiliary average cost problem by augmenting the original system with one
additional state for each possible pair (i, u) with u ∈ U(i). Thus, the states
of the auxiliary problem are those of the original problem, i = 1, . . . , n,
together with the additional states (i, u), i = 1, . . . , n, u ∈ U(i). The
probabilistic transition mechanism from an original state i is the same as
for the original problem [probability p
ij
(u) of moving to state j], while the
probabilistic transition mechanism from a state (i, u) is that we move only
to states j of the original problem with corresponding probabilities p
ij
(u)
and costs g(i, u, j).
It can be seen that the auxiliary problem has the same optimal average
cost per stage η as the original, and that the corresponding Bellman’s
448 Approximate Dynamic Programming Chap. 6
equation is
η + h(i) = min
u∈U(i)
n
j=1
p
ij
(u)
_
g(i, u, j) + h(j)
_
, i = 1, . . . , n, (6.173)
η + Q(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) + h(j)
_
, i = 1, . . . , n, u ∈ U(i),
(6.174)
where Q(i, u) is the diﬀerential cost corresponding to (i, u). Taking the
minimum over u in Eq. (6.174) and comparing with Eq. (6.173), we obtain
h(i) = min
u∈U(i)
Q(i, u), i = 1, . . . , n.
Substituting the above form of h(i) in Eq. (6.174), we obtain Bellman’s
equation in a form that exclusively involves the Qfactors:
η+Q(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) + min
v∈U(j)
Q(j, v)
_
, i = 1, . . . , n, u ∈ U(i).
Let us now apply to the auxiliary problem the following variant of
the relative value iteration
h
k+1
= Th
k
−h
k
(s)e,
where s is a special state. We then obtain the iteration [cf. Eqs. (6.173)
and (6.174)]
h
k+1
(i) = min
u∈U(i)
n
j=1
p
ij
(u)
_
g(i, u, j) + h
k
(j)
_
−h
k
(s), i = 1, . . . , n,
Q
k+1
(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j)+h
k
(j)
_
−h
k
(s), i = 1, . . . , n, u ∈ U(i).
(6.175)
From these equations, we have that
h
k
(i) = min
u∈U(i)
Q
k
(i, u), i = 1, . . . , n,
and by substituting the above form of h
k
in Eq. (6.175), we obtain the
following relative value iteration for the Qfactors
Q
k+1
(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) + min
v∈U(j)
Q
k
(j, v)
_
− min
v∈U(s)
Q
k
(s, v).
Sec. 6.7 Average Cost Problems 449
The sequence of values min
u∈U(s)
Q
k
(s, u) is expected to converge to the
optimal average cost per stage and the sequences of values min
u∈U(i)
Q(i, u)
are expected to converge to diﬀerential costs h(i).
An incremental version of the preceding iteration that involves a pos
itive stepsize γ is given by
Q(i, u) := (1 −γ)Q(i, u) + γ
_
n
j=1
p
ij
(u)
_
g(i, u, j) + min
v∈U(j)
Q(j, v)
_
− min
v∈U(s)
Q(s, v)
_
.
The natural form of the Qlearning method for the average cost problem
is an approximate version of this iteration, whereby the expected value is
replaced by a single sample, i.e.,
Q(i, u) := Q(i, u) + γ
_
g(i, u, j) + min
v∈U(j)
Q(j, v) − min
v∈U(s)
Q(s, v)
−Q(i, u)
_
,
where j and g(i, u, j) are generated from the pair (i, u) by simulation. In
this method, only the Qfactor corresponding to the currently sampled pair
(i, u) is updated at each iteration, while the remaining Qfactors remain
unchanged. Also the stepsize should be diminishing to 0. A convergence
analysis of this method can be found in the paper by Abounadi, Bertsekas,
and Borkar [ABB01].
QLearning Based on the Contracting Value Iteration
We now consider an alternative Qlearning method, which is based on the
contracting value iteration method of Section 4.3. If we apply this method
to the auxiliary problem used above, we obtain the following algorithm
h
k+1
(i) = min
u∈U(i)
_
¸
_
n
j=1
p
ij
(u)g(i, u, j) +
n
j=1
j=s
p
ij
(u)h
k
(j)
_
¸
_ − η
k
, (6.176)
Q
k+1
(i, u) =
n
j=1
p
ij
(u)g(i, u, j) +
n
j=1
j=s
p
ij
(u)h
k
(j) −η
k
, (6.177)
η
k+1
= η
k
+ α
k
h
k+1
(s).
From these equations, we have that
h
k
(i) = min
u∈U(i)
Q
k
(i, u),
450 Approximate Dynamic Programming Chap. 6
and by substituting the above form of h
k
in Eq. (6.177), we obtain
Q
k+1
(i, u) =
n
j=1
p
ij
(u)g(i, u, j) +
n
j=1
j=s
p
ij
(u) min
v∈U(j)
Q
k
(j, v) −η
k
,
η
k+1
= η
k
+ α
k
min
v∈U(s)
Q
k+1
(s, v).
A smallstepsize version of this iteration is given by
Q(i, u) := (1 −γ)Q(i, u) + γ
_
n
j=1
p
ij
(u)g(i, u, j)
+
n
j=1
j=s
p
ij
(u) min
v∈U(j)
Q(j, v) −η
_
,
η := η + α min
v∈U(s)
Q(s, v),
where γ and α are positive stepsizes. A natural form of Qlearning based
on this iteration is obtained by replacing the expected values by a single
sample, i.e.,
Q(i, u) := (1 −γ)Q(i, u) + γ
_
g(i, u, j) + min
v∈U(j)
ˆ
Q(j, v) −η
_
, (6.178)
η := η + α min
v∈U(s)
Q(s, v), (6.179)
where
ˆ
Q(j, v) =
_
Q(j, v) if j ,= s,
0 otherwise,
and j and g(i, u, j) are generated from the pair (i, u) by simulation. Here
the stepsizes γ and α should be diminishing, but α should diminish “faster”
than γ; i.e., the ratio of the stepsizes α/γ should converge to zero. For
example, we may use γ = C/k and α = c/k log k, where C and c are positive
constants and k is the number of iterations performed on the corresponding
pair (i, u) or η, respectively.
The algorithm has two components: the iteration (6.178), which is
essentially a Qlearning method that aims to solve the ηSSP for the current
value of η, and the iteration (6.179), which updates η towards its correct
value η
∗
. However, η is updated at a slower rate than Q, since the stepsize
ratio α/γ converges to zero. The eﬀect is that the Qlearning iteration
(6.178) is fast enough to keep pace with the slower changing ηSSP. A
convergence analysis of this method can also be found in the paper [ABB01].
Sec. 6.8 SimulationBased Solution of Large Systems 451
6.8 SIMULATIONBASED SOLUTION OF LARGE SYSTEMS
We have focused so far in this chapter on approximating the solution of
Bellman equations within a subspace of basis functions in a variety of con
texts. We have seen common analytical threads across discounted, SSP,
and average cost problems, as well as diﬀerences in formulations, imple
mentation details, and associated theoretical results. In this section we
will aim for a more general view of simulationbased solution of large sys
tems within which the methods and analysis so far can be understood and
extended. The beneﬁt of this generalization is a deeper perspective, and
the ability to address new problems in DP and beyond.
6.8.1 Projected Equations  SimulationBased Versions
We ﬁrst focus on general linear ﬁxed point equations x = T(x), where
T(x) = b + Ax, (6.180)
A is an n n matrix, and b ∈ ℜ
n
is a vector. We consider approximations
of a solution by solving a projected equation
Φr = ΠT(Φr) = Π(b + AΦr),
where Π denotes projection with respect to a weighted Euclidean norm
 
ξ
on a subspace
S = ¦Φr [ r ∈ ℜ
s
¦.
We assume throughout that the columns of the ns matrix Φ are linearly
independent basis functions.
Examples are Bellman’s equation for policy evaluation, in which case
A = αP, where P is a transition matrix (discounted and average cost), or
P is a substochastic matrix (row sums less or equal to 0, as in SSP), and
α = 1 (SSP and average cost), or α < 1 (discounted). Other examples in
DP include the semiMarkov problems discussed in Chapter 5. However,
for the moment we do not assume the presence of any stochastic structure
in A. Instead, we assume throughout that I −ΠA is invertible, so that the
projected equation has a unique solution denoted r
∗
.
Even though T or ΠT may not be contractions, we can obtain an
error bound that generalizes some of the bounds obtained earlier. We have
x
∗
−Φr
∗
= x
∗
−Πx
∗
+ΠTx
∗
−ΠTΦr
∗
= x
∗
−Πx
∗
+ΠA(x
∗
−Φr
∗
), (6.181)
from which
x
∗
−Φr
∗
= (I −ΠA)
−1
(x
∗
−Πx
∗
).
Thus, for any norm   and ﬁxed point x
∗
of T,
x
∗
−Φr
∗
 ≤
_
_
(I −ΠA)
−1
_
_
x
∗
−Πx
∗
_
_
, (6.182)
452 Approximate Dynamic Programming Chap. 6
so the approximation error x
∗
− Φr
∗
 is proportional to the distance of
the solution x
∗
from the approximation subspace. If ΠT is a contraction
mapping of modulus α ∈ (0, 1) with respect to  , from Eq. (6.181), we
have
x
∗
−Φr
∗
 ≤ x
∗
−Πx
∗
+ΠT(x
∗
)−ΠT(Φr
∗
) ≤ x
∗
−Πx
∗
+αx
∗
−Φr
∗
,
so that
x
∗
−Φr
∗
 ≤
1
1 −α
x
∗
−Πx
∗
. (6.183)
We ﬁrst discuss an LSTDtype method for solving the projected equa
tion Φr = Π(b +AΦr). Let us assume that the positive distribution vector
ξ is given. By the deﬁnition of projection with respect to  
ξ
, the unique
solution r
∗
of this equation satisﬁes
r
∗
= arg min
r∈ℜ
s
_
_
Φr −(b + AΦr
∗
)
_
_
2
ξ
.
Setting to 0 the gradient with respect to r, we obtain the corresponding
orthogonality condition
Φ
′
Ξ
_
Φr
∗
−(b + AΦr
∗
)
_
= 0,
where Ξ is the diagonal matrix with the probabilities ξ
1
, . . . , ξ
n
along the
diagonal. Equivalently,
Cr
∗
= d,
where
C = Φ
′
Ξ(I −A)Φ, d = Φ
′
Ξb, (6.184)
and Ξ is the diagonal matrix with the components of ξ along the diagonal
[cf. the matrix form (6.33)(6.34) of the projected equation for discounted
DP problems].
We will now develop a simulationbased approximation to the system
Cr
∗
= d, by using corresponding estimates of C and d. We write C and d
as expected values with respect to ξ:
C =
n
i=1
ξ
i
φ(i)
_
_
φ(i) −
n
j=1
a
ij
φ(j)
_
_
′
, d =
n
i=1
ξ
i
φ(i)b
i
. (6.185)
As in Section 6.3.3, we approximate these expected values by simulation
obtained sample averages, however, here we do not have a Markov chain
structure by which to generate samples. We must therefore design a sam
pling process that can be used to properly approximate the expected values
in Eq. (6.185). In the most basic form of such a process, we generate a se
quence of indices ¦i
0
, i
1
, . . .¦, and a sequence of transitions between indices
Sec. 6.8 SimulationBased Solution of Large Systems 453
j
0
j
0
j
1 j
1
j
k
j
k
j
k+1
+1
i
0
i
0
i
1
i
1
i
k
i
k
i
k+1
... ...
Row Sampling According to ξ
ξ (May Use Markov Chain Q)
Disaggregation Probabilities Column Sampling According to Markov
Disaggregation Probabilities Column Sampling According to Markov
Markov Chain
Row Sampling According to
) P ∼ A
Figure 6.8.1. The basic simulation methodology consists of (a) generating a
sequence of indices {i
0
, i
1
, . . .} according to the distribution ξ (a Markov chain Q
may be used for this, but this is not a requirement), and (b) generating a sequence
of transitions
_
(i
0
, j
0
), (i
1
, j
1
), . . .
_
using a Markov chain P. It is possible that
j
k
= i
k+1
, but this is not necessary.
_
(i
0
, j
0
), (i
1
, j
1
), . . .
_
. We use any probabilistic mechanism for this, subject
to the following two requirements (cf. Fig. 6.8.1):
(1) Row sampling: The sequence ¦i
0
, i
1
, . . .¦ is generated according to
the distribution ξ, which deﬁnes the projection norm  
ξ
, in the
sense that with probability 1,
lim
k→∞
k
t=0
δ(i
t
= i)
k + 1
= ξ
i
, i = 1, . . . , n, (6.186)
where δ() denotes the indicator function [δ(E) = 1 if the event E has
occurred and δ(E) = 0 otherwise].
(2) Column sampling: The sequence
_
(i
0
, j
0
), (i
1
, j
1
), . . .
_
is generated
according to a certain stochastic matrix P with transition probabili
ties p
ij
that satisfy
p
ij
> 0 if a
ij
,= 0, (6.187)
in the sense that with probability 1,
lim
k→∞
k
t=0
δ(i
t
= i, j
t
= j)
k
t=0
δ(i
t
= i)
= p
ij
, i, j = 1, . . . , n. (6.188)
At time k, we approximate C and d with
C
k
=
1
k + 1
k
t=0
φ(i
t
)
_
φ(i
t
) −
a
i
t
j
t
p
i
t
j
t
φ(j
t
)
_
′
, d
k
=
1
k + 1
k
t=0
φ(i
t
)b
i
t
.
(6.189)
454 Approximate Dynamic Programming Chap. 6
To show that this is a valid approximation, we count the number of times
an index occurs and after collecting terms, we write Eq. (6.189) as
C
k
=
n
i=1
ˆ
ξ
i,k
φ(i)
_
_
φ(i) −
n
j=1
ˆ p
ij,k
a
ij
p
ij
φ(j)
_
_
′
, d
k
=
n
i=1
ˆ
ξ
i,k
φ(i)b
i
,
(6.190)
where
ˆ
ξ
i,k
=
k
t=0
δ(i
t
= i)
k + 1
, ˆ p
ij,k
=
k
t=0
δ(i
t
= i, j
t
= j)
k
t=0
δ(i
t
= i)
;
(cf. the calculations in Section 6.3.3). In view of the assumption
ˆ
ξ
i,k
→ ξ
i
, ˆ p
ij,k
→ p
ij
, i, j = 1, . . . , n,
[cf. Eqs. (6.186) and (6.188)], by comparing Eqs. (6.185) and (6.190), we see
that C
k
→ C and d
k
→ d. Since the solution r
∗
of the system (6.185) exists
and is unique, the same is true for the system (6.190) for all t suﬃciently
large. Thus, with probability 1, the solution of the system (6.189) converges
to r
∗
as k → ∞.
A comparison of Eqs. (6.185) and (6.190) indicates some considera
tions for selecting the stochastic matrix P. It can be seen that “important”
(e.g., large) components a
ij
should be simulated more often (p
ij
: large).
In particular, if (i, j) is such that a
ij
= 0, there is an incentive to choose
p
ij
= 0, since corresponding transitions (i, j) are “wasted” in that they
do not contribute to improvement of the approximation of Eq. (6.185) by
Eq. (6.190). This suggests that the structure of P should match in some
sense the structure of the matrix A, to improve the eﬃciency of the simu
lation (the number of samples needed for a given level of simulation error
variance). On the other hand, the choice of P does not aﬀect the limit of
Φˆ r
k
, which is the solution Φr
∗
of the projected equation. By contrast, the
choice of ξ aﬀects the projection Π and hence also Φr
∗
.
Note that there is a lot of ﬂexibility for generating the sequence
¦i
0
, i
1
, . . .¦ and the transition sequence
_
(i
0
, j
0
), (i
1
, j
1
), . . .
_
to satisfy Eqs.
(6.186) and (6.188). For example, to satisfy Eq. (6.186), the indices i
t
do
not need to be sampled independently according to ξ. Instead, it may be
convenient to introduce an irreducible Markov chain with transition matrix
Q, states 1, . . . , n, and ξ as its steadystate probability vector, and to start
at some state i
0
and generate the sequence ¦i
0
, i
1
, . . .¦ as a single inﬁnitely
long trajectory of the chain. For the transition sequence, we may option
ally let j
k
= i
k+1
for all k, in which case P would be identical to Q, but in
general this is not essential.
Let us discuss two possibilities for constructing a Markov chain with
steadystate probability vector ξ. The ﬁrst is useful when a desirable dis
tribution ξ is known up to a normalization constant. Then we can con
struct such a chain using techniques that are common in Markov chain
Sec. 6.8 SimulationBased Solution of Large Systems 455
Monte Carlo (MCMC) methods (see e.g., [Liu01], Rubinstein and Kroese
[RuK08]).
The other possibility, which is useful when there is no particularly
desirable ξ, is to specify ﬁrst the transition matrix Q of the Markov chain
and let ξ be its steadystate probability vector. Then the requirement
(6.186) will be satisﬁed if the Markov chain is irreducible, in which case ξ
will be the unique steadystate probability vector of the chain and will have
positive components. An important observation is that explicit knowledge
of ξ is not required; it is just necessary to know the Markov chain and to
be able to simulate its transitions. The approximate DP applications of
Sections 6.3, 6.6, and 6.7, where Q = P, fall into this context. In the next
section, we will discuss favorable methods for constructing the transition
matrix Q from A, which result in ΠT being a contraction so that iterative
methods are applicable.
Note that multiple simulated sequences can be used to form the equa
tion (6.189). For example, in the Markov chainbased sampling schemes,
we can generate multiple inﬁnitely long trajectories of the chain, starting at
several diﬀerent states, and for each trajectory use j
k
= i
k+1
for all k. This
will work even if the chain has multiple recurrent classes, as long as there
are no transient states and at least one trajectory is started from within
each recurrent class. Again ξ will be a steadystate probability vector of
the chain, and need not be known explicitly. Note also that using multiple
trajectories may be interesting even if there is a single recurrent class, for
at least two reasons:
(a) The generation of trajectories may be parallelized among multiple
processors, resulting in signiﬁcant speedup.
(b) The empirical frequencies of occurrence of the states may approach
the steadystate probabilities more quickly; this is particularly so for
large and “stiﬀ” Markov chains.
We ﬁnally note that the option of using distinct Markov chains Q and
P for row and column sampling is important in the DP/policy iteration
context. In particular, by using a distribution ξ that is not associated with
P, we may resolve the issue of exploration (see Section 6.3.8).
6.8.2 Matrix Inversion and RegressionType Methods
Given simulationbased estimates C
k
and d of C and d, respectively, we
may approximate r
∗
= C
−1
d with
ˆ r
k
= C
−1
k
d
k
,
in which case we have ˆ r
k
→ r
∗
with probability 1 (this parallels the LSTD
method of Section 6.3.4). An alternative, which is more suitable for the case
where C
k
is nearly singular, is the regression/regularizationbased estimate
ˆ r
k
= (C
′
k
Σ
−1
C
k
+ βI)
−1
(C
′
k
Σ
−1
d
k
+ β¯ r), (6.191)
456 Approximate Dynamic Programming Chap. 6
[cf. Eq. (6.51) in Section 6.3.4], where ¯ r is an a priori estimate of r
∗
=
C
−1
d, β is a positive scalar, and Σ is some positive deﬁnite symmetric
matrix. The error estimate given by Prop. 6.3.4 applies to this method.
In particular, the error ˆ r
k
− r
∗
 is bounded by the sum of two terms:
one due to simulation error (which is larger when C is nearly singular,
and decreases with the amount of sampling used), and the other due to
regularization error (which depends on the regularization parameter β and
the error ¯ r −r
∗
); cf. Eq. (6.53).
To obtain a conﬁdence interval for the error ˆ r
k
− r
∗
, we view all
variables generated by simulation to be random variables on a common
probability space. Let
ˆ
Σ
k
be the covariance of (d
k
−C
k
r
∗
), and let
ˆ
b
k
=
ˆ
Σ
−1/2
k
(d
k
−C
k
r
∗
).
Note that
ˆ
b
k
has covariance equal to the identity. Let
ˆ
P
k
be the cumulative
distribution function of 
ˆ
b
k

2
, and note that

ˆ
b
k
 ≤
_
ˆ
P
−1
k
(1 −θ) (6.192)
with probability (1−θ), where
ˆ
P
−1
k
(1−θ) is the threshold value v at which
the probability that 
ˆ
b
k

2
takes value greater than v is θ. We denote by
P(E) the probability of an event E.
Proposition 6.8.1: We have
P
_
ˆ r
k
−r
∗
 ≤ σ
k
(Σ, β)
_
≥ 1 −θ,
where
σ
k
(Σ, β) = max
i=1,...,s
_
λ
i
λ
2
i
+ β
_
_
_
_Σ
−1/2 ˆ
Σ
1/2
k
_
_
_
_
ˆ
P
−1
k
(1 −θ)
+ max
i=1,...,s
_
β
λ
2
i
+ β
_
¯ r −r
∗
,
(6.193)
and λ
1
, . . . , λ
s
are the singular values of Σ
−1/2
C
k
.
Proof: Let b
k
= Σ
−1/2
(d
k
− C
k
r
∗
). Following the notation and proof of
Prop. 6.3.4, and using the relation
ˆ
b
k
=
ˆ
Σ
−1/2
k
Σ
1/2
b
k
, we have
ˆ r
k
−r
∗
= V (Λ
2
+ βI)
−1
ΛU
′
b
k
+ β V (Λ
2
+ βI)
−1
V
′
(¯ r −r
∗
)
= V (Λ
2
+ βI)
−1
ΛU
′
Σ
−1/2 ˆ
Σ
1/2
k
ˆ
b
k
+ β V (Λ
2
+ βI)
−1
V
′
(¯ r −r
∗
).
Sec. 6.8 SimulationBased Solution of Large Systems 457
From this, we similarly obtain
ˆ r
k
−r
∗
 ≤ max
i=1,...,s
_
λ
i
λ
2
i
+ β
_
_
_
_Σ
−1/2 ˆ
Σ
1/2
k
_
_
_ 
ˆ
b
k
+ max
i=1,...,s
_
β
λ
2
i
+ β
_
¯ r−r
∗
.
Since Eq. (6.192) holds with probability (1 −θ), the desired result follows.
Q.E.D.
Using a form of the central limit theorem, we may assume that for
a large number of samples,
ˆ
b
k
asymptotically becomes a Gaussian random
sdimensional vector, so that the random variable

ˆ
b
k

2
= (d
k
−C
k
r
∗
)
′ ˆ
Σ
−1
k
(d
k
−C
k
r
∗
)
can be treated as a chisquare random variable with s degrees of freedom
(since the covariance of
ˆ
b
k
is the identity by deﬁnition). Assuming this, the
distribution
ˆ
P
−1
k
(1 −θ) in Eq. (6.193) is approximately equal and may be
replaced by P
−1
(1 − θ; s), the threshold value v at which the probability
that a chisquare random variable with s degrees of freedom takes value
greater than v is θ. Thus in a practical application of Prop. 6.8.1, one may
replace
ˆ
P
−1
k
(1 −θ) by P
−1
(1 −θ; s), and also replace
ˆ
Σ
k
with an estimate
of the covariance of (d
k
−C
k
r
∗
); the other quantities in Eq. (6.193) (Σ, λ
i
,
β, and r) are known.
6.8.3 Iterative/LSPEType Methods
In this section, we will consider iterative methods for solving the projected
equation Cr = d [cf. Eq. (6.185)], using simulationbased estimates C
k
and
d
k
. We ﬁrst consider the ﬁxed point iteration
Φr
k+1
= ΠT(Φr
k
), k = 0, 1, . . . , (6.194)
which generalizes the PVI method of Section 6.3.2. For this method to
be valid and to converge to r
∗
it is essential that ΠT is a contraction with
respect to some norm. In the next section, we will provide tools for verifying
that this is so.
Similar to the analysis of Section 6.3.3, the simulationbased approx
imation (LSPE analog) is
r
k+1
=
_
k
t=0
φ(i
t
)φ(i
t
)
′
_
−1
k
t=0
φ(i
t
)
_
a
i
t
j
t
p
i
t
j
t
φ(j
t
)
′
r
k
+ b
i
t
_
. (6.195)
Here again ¦i
0
, i
1
, . . .¦ is an index sequence and ¦(i
0
, j
0
), (i
1
, j
1
), . . .¦ is a
transition sequence satisfying Eqs. (6.186)(6.188).
458 Approximate Dynamic Programming Chap. 6
A generalization of this iteration, written in more compact form and
introducing scaling with a matrix G
k
, is given by
r
k+1
= r
k
−γG
k
(C
k
r
k
−d
k
), (6.196)
where C
k
and d
k
are given by Eq. (6.189) [cf. Eq. (6.56)]. As in Section
6.3.3, this iteration can be equivalently written in terms of generalized
temporal diﬀerences as
r
k+1
= r
k
−
γ
k + 1
G
k
k
t=0
φ(i
t
)q
k,t
where
q
k,t
= φ(i
t
)
′
r
k
−
a
i
t
j
t
p
i
t
j
t
φ(j
t
)
′
r
k
−b
i
t
[cf. Eq. (6.57)]. The scaling matrix G
k
should converge to an appropriate
matrix G.
For the scaled LSPEtype method (6.196) to converge to r
∗
, we must
have G
k
→ G, C
k
→ C, and G, C, and γ must be such that I −γGC is a
contraction. Noteworthy special cases where this is so are:
(a) The case of iteration (6.195), where γ = 1 and
G
k
=
_
k
t=0
φ(i
t
)φ(i
t
)
′
_
−1
,
under the assumption that ΠT is a contraction. The reason is that
this iteration asymptotically becomes the ﬁxed point iteration Φr
k+1
=
ΠT(Φr
k
) [cf. Eq. (6.194)].
(b) C is positive deﬁnite, G is symmetric positive deﬁnite, and γ is suf
ﬁciently small. This case arises in various DP contexts, e.g., the
discounted problem where A = αP (cf. Section 6.3).
(c) C is invertible, γ = 1, and G has the form [cf. Eq. (6.60)]
G = (C
′
Σ
−1
C + βI)
−1
C
′
Σ
−1
,
where Σ is some positive deﬁnite symmetric matrix, and β is a positive
scalar. The corresponding iteration (6.196) takes the form
r
k+1
= (C
′
k
Σ
−1
k
C
k
+ βI)
−1
(C
′
k
Σ
−1
k
d
k
+ βr
k
)
[cf. Eq. (6.61)]. As shown in Section 6.3.2, the eigenvalues of GC are
λ
i
/(λ
i
+ β), where λ
i
are the eigenvalues of C
′
Σ
−1
C, so I −GC has
real eigenvalues in the interval (0, 1). This iteration also works if C
is not invertible.
Let us also note the analog of the TD(0) method. It is similar to Eq.
(6.196), but uses only the last sample:
r
k+1
= r
k
−γ
k
φ(i
k
)q
k,k
,
where the stepsize γ
k
must be diminishing to 0.
Sec. 6.8 SimulationBased Solution of Large Systems 459
Contraction Properties
We will now derive conditions for ΠT to be a contraction, which facilitates
the use of the preceding iterative methods. We assume that the index
sequence ¦i
0
, i
1
, . . .¦ is generated as an inﬁnitely long trajectory of a Markov
chain whose steadystate probability vector is ξ. We denote by Q the
corresponding transition probability matrix and by q
ij
the components of
Q. As discussed earlier, Q may not be the same as P, which is used
to generate the transition sequence
_
(i
0
, j
0
), (i
1
, j
1
), . . .
_
to satisfy Eqs.
(6.186) and (6.188). It seems hard to guarantee that ΠT is a contraction
mapping, unless [A[ ≤ Q [i.e., [a
ij
[ ≤ q
ij
for all (i, j)]. The following
propositions assume this condition.
Proposition 6.8.2: Assume that Q is irreducible and that [A[ ≤
Q. Then T and ΠT are contraction mappings under any one of the
following three conditions:
(1) For some scalar α ∈ (0, 1), we have [A[ ≤ αQ.
(2) There exists an index i such that [a
ij
[ < q
ij
for all j = 1, . . . , n.
(3) There exists an index i such that
n
j=1
[a
ij
[ < 1.
Proof: For any vector or matrix X, we denote by [X[ the vector or matrix
that has as components the absolute values of the corresponding compo
nents of X. Let ξ be the steadystate probability vector of Q. Assume
condition (1). Since Π is nonexpansive with respect to  
ξ
, it will suﬃce
to show that A is a contraction with respect to  
ξ
. We have
[Az[ ≤ [A[ [z[ ≤ αQ[z[, ∀ z ∈ ℜ
n
. (6.197)
Using this relation, we obtain
Az
ξ
≤ αQ[z[
ξ
≤ αz
ξ
, ∀ z ∈ ℜ
n
, (6.198)
where the last inequality follows since Qx
ξ
≤ x
ξ
for all x ∈ ℜ
n
(see
Lemma 6.3.1). Thus, A is a contraction with respect to  
ξ
with modulus
α.
Assume condition (2). Then, in place of Eq. (6.197), we have
[Az[ ≤ [A[ [z[ ≤ Q[z[, ∀ z ∈ ℜ
n
,
with strict inequality for the row corresponding to i when z ,= 0, and in
place of Eq. (6.198), we obtain
Az
ξ
< Q[z[
ξ
≤ z
ξ
, ∀ z ,= 0.
460 Approximate Dynamic Programming Chap. 6
It follows that A is a contraction with respect to  
ξ
, with modulus
max
z
ξ
≤1
Az
ξ
.
Assume condition (3). It will suﬃce to show that the eigenvalues of
ΠA lie strictly within the unit circle.† Let
¯
Q be the matrix which is identical
to Q except for the ith row which is identical to the ith row of [A[. From
the irreducibility of Q, it follows that for any i
1
,= i it is possible to ﬁnd a
sequence of nonzero components
¯
Q
i
1
i
2
, . . . ,
¯
Q
i
k−1
i
k
,
¯
Q
i
k
i
that “lead” from
i
1
to i. Using a wellknown result, we have
¯
Q
t
→ 0. Since [A[ ≤
¯
Q, we
also have [A[
t
→ 0, and hence also A
t
→ 0 (since [A
t
[ ≤ [A[
t
). Thus, all
eigenvalues of A are strictly within the unit circle. We next observe that
from the proof argument under conditions (1) and (2), we have
ΠAz
ξ
≤ z
ξ
, ∀ z ∈ ℜ
n
,
so the eigenvalues of ΠA cannot lie outside the unit circle.
Assume to arrive at a contradiction that ν is an eigenvalue of ΠA
with [ν[ = 1, and let ζ be a corresponding eigenvector. We claim that Aζ
must have both real and imaginary components in the subspace S. If this
were not so, we would have Aζ ,= ΠAζ, so that
Aζ
ξ
> ΠAζ
ξ
= νζ
ξ
= [ν[ ζ
ξ
= ζ
ξ
,
which contradicts the fact Az
ξ
≤ z
ξ
for all z, shown earlier. Thus,
the real and imaginary components of Aζ are in S, which implies that
Aζ = ΠAζ = νζ, so that ν is an eigenvalue of A. This is a contradiction
because [ν[ = 1, while the eigenvalues of A are strictly within the unit
circle. Q.E.D.
Note that the preceding proof has shown that under conditions (1)
and (2) of Prop. 6.8.2, T and ΠT are contraction mappings with respect
to the speciﬁc norm  
ξ
, and that under condition (1), the modulus of
contraction is α. Furthermore, Q need not be irreducible under these con
ditions – it is suﬃcient that Q has no transient states (so that it has a
steadystate probability vector ξ with positive components). Under condi
tion (3), T and ΠT need not be contractions with respect to  
ξ
. For a
counterexample, take a
i,i+1
= 1 for i = 1, . . . , n − 1, and a
n,1
= 1/2, with
every other entry of A equal to 0. Take also q
i,i+1
= 1 for i = 1, . . . , n −1,
† In the following argument, the projection Πz of a complex vector z is
obtained by separately projecting the real and the imaginary components of z on
S. The projection norm for a complex vector x +iy is deﬁned by
x +iy
ξ
=
_
x
2
ξ
+y
2
ξ
.
Sec. 6.8 SimulationBased Solution of Large Systems 461
and q
n,1
= 1, with every other entry of Q equal to 0, so ξ
i
= 1/n for all i.
Then for z = (0, 1, . . . , 1)
′
we have Az = (1, . . . , 1, 0)
′
and Az
ξ
= z
ξ
,
so A is not a contraction with respect to  
ξ
. Taking S to be the entire
space ℜ
n
, we see that the same is true for ΠA.
When the row sums of [A[ are no greater than one, one can construct
Q with [A[ ≤ Q by adding another matrix to [A[:
Q = [A[ + Diag(e −[A[e)R, (6.199)
where R is a transition probability matrix, e is the unit vector that has
all components equal to 1, and Diag(e −[A[e) is the diagonal matrix with
1 −
n
m=1
[a
im
[, i = 1, . . . , n, on the diagonal. Then the row sum deﬁcit of
the ith row of A is distributed to the columns j according to fractions r
ij
,
the components of R.
The next proposition uses diﬀerent assumptions than Prop. 6.8.2, and
applies to cases where there is no special index i such that
n
j=1
[a
ij
[ < 1. In
fact A may itself be a transition probability matrix, so that I −A need not
be invertible, and the original system may have multiple solutions; see the
subsequent Example 6.8.2. The proposition suggests the use of a damped
version of the T mapping in various methods (compare with Section 6.7
and the average cost case for λ = 0).
Proposition 6.8.3: Assume that there are no transient states corre
sponding to Q, that ξ is a steadystate probability vector of Q, and
that [A[ ≤ Q. Assume further that I − ΠA is invertible. Then the
mapping ΠT
γ
, where
T
γ
= (1 −γ)I + γT,
is a contraction with respect to  
ξ
for all γ ∈ (0, 1).
Proof: The argument of the proof of Prop. 6.8.2 shows that the condition
[A[ ≤ Q implies that A is nonexpansive with respect to the norm  
ξ
.
Furthermore, since I − ΠA is invertible, we have z ,= ΠAz for all z ,= 0.
Hence for all γ ∈ (0, 1) and z ∈ ℜ
n
,
(1−γ)z+γΠAz
ξ
< (1−γ)z
ξ
+γΠAz
ξ
≤ (1−γ)z
ξ
+γz
ξ
= z
ξ
,
(6.200)
where the strict inequality follows from the strict convexity of the norm,
and the weak inequality follows from the nonexpansiveness of ΠA. If we
deﬁne
ρ
γ
= sup
_
(1 −γ)z + γΠAz
ξ
[ z ≤ 1
_
,
462 Approximate Dynamic Programming Chap. 6
and note that the supremum above is attained by Weierstrass’ Theorem,
we see that Eq. (6.200) yields ρ
γ
< 1 and
(1 −γ)z + γΠAz
ξ
≤ ρ
γ
z
ξ
, ∀ z ∈ ℜ
n
.
From the deﬁnition of T
γ
, we have for all x, y ∈ ℜ
n
,
ΠT
γ
x −ΠT
γ
y = ΠT
γ
(x −y) = (1 −γ)Π(x −y) + γΠA(x −y)
= (1 −γ)Π(x −y) + γΠ
_
ΠA(x −y)
_
,
so deﬁning z = x − y, and using the preceding two relations and the non
expansiveness of Π, we obtain
ΠT
γ
x −ΠT
γ
y
ξ
= (1 −γ)Πz + γΠ(ΠAz)
ξ
≤ (1 −γ)z + γΠAz
ξ
≤ ρ
γ
z
ξ
= ρ
γ
x −y
ξ
,
for all x, y ∈ ℜ
n
. Q.E.D.
Note that the mappings ΠT
γ
and ΠT have the same ﬁxed points, so
under the assumptions of Prop. 6.8.3, there is a unique ﬁxed point Φr
∗
of
ΠT. We now discuss examples of choices of ξ and Q in some special cases.
Example 6.8.1: (Discounted DP Problems and Exploration)
Bellman’s equation for the cost vector of a stationary policy in an nstate
discounted DP problem has the form x = T(x), where
T(x) = αPx +g,
g is the vector of singlestage costs associated with the n states, P is the
transition probability matrix of the associated Markov chain, and α ∈ (0, 1)
is the discount factor. If P is an irreducible Markov chain, and ξ is chosen
to be its unique steadystate probability vector, the matrix inversion method
based on Eq. (6.189) becomes LSTD(0). The methodology of the present
section also allows row sampling/state sequence generation using a Markov
chain P other than P, with an attendant change in ξ, as discussed in the
context of explorationenhanced methods in Section 6.3.8.
Example 6.8.2: (Undiscounted DP Problems)
Consider the equation x = Ax + b, for the case where A is a substochastic
matrix (aij ≥ 0 for all i, j and
n
j=1
aij ≤ 1 for all i). Here 1 −
n
j=1
aij
may be viewed as a transition probability from state i to some absorbing
state denoted 0. This is Bellman’s equation for the cost vector of a stationary
policy of a SSP. If the policy is proper in the sense that from any state i = 0
Sec. 6.8 SimulationBased Solution of Large Systems 463
there exists a path of positive probability transitions from i to the absorbing
state 0, the matrix
Q = A +Diag(e −Ae)R
[cf. Eq. (6.199)] is irreducible, provided R has positive components. As a
result, the conditions of Prop. 6.8.2 under condition (2) are satisﬁed, and T
and ΠT are contractions with respect to ·
ξ
. It is also possible to use a
matrix R whose components are not all positive, as long as Q is irreducible,
in which case Prop. 6.8.2 under condition (3) applies (cf. Prop. 6.7.1).
Consider also the equation x = Ax + b for the case where A is an ir
reducible transition probability matrix, with steadystate probability vector
ξ. This is related to Bellman’s equation for the diﬀerential cost vector of a
stationary policy of an average cost DP problem involving a Markov chain
with transition probability matrix A. Then, if the unit vector e is not con
tained in the subspace S spanned by the basis functions, the matrix I −ΠA is
invertible, as shown in Section 6.7. As a result, Prop. 6.8.3 applies and shows
that the mapping (1 −γ)I +γA, is a contraction with respect to ·
ξ
for all
γ ∈ (0, 1) (cf. Section 6.7, Props. 6.7.1, 6.7.2).
The projected equation methodology of this section applies to general
linear ﬁxed point equations, where A need not have a probabilistic struc
ture. A class of such equations where ΠA is a contraction is given in the
following example, an important case where iterative methods are used for
solving linear equations.
Example 6.8.3: (Weakly Diagonally Dominant Systems)
Consider the solution of the system
Cx = d,
where d ∈ ℜ
n
and C is an n ×n matrix that is weakly diagonally dominant,
i.e., its components satisfy
cii = 0,
j=i
cij ≤ cii, i = 1, . . . , n. (6.201)
By dividing the ith row by cii, we obtain the equivalent system x = Ax + b,
where the components of A and b are
aij =
_
0 if i = j,
−
c
ij
c
ii
if i = j,
bi =
di
cii
, i = 1, . . . , n.
Then, from Eq. (6.201), we have
n
j=1
aij =
j=i
cij
cii
≤ 1, i = 1, . . . , n,
464 Approximate Dynamic Programming Chap. 6
so Props. 6.8.2 and 6.8.3 may be used under the appropriate conditions. In
particular, if the matrix Q given by Eq. (6.199) has no transient states and
there exists an index i such that
n
j=1
a
ij
 < 1, Prop. 6.8.2 applies and shows
that ΠT is a contraction.
Alternatively, instead of Eq. (6.201), assume the somewhat more re
strictive condition
1 −cii +
j=i
cij  ≤ 1, i = 1, . . . , n, (6.202)
and consider the equivalent system x = Ax +b, where
A = I −C, b = d.
Then, from Eq. (6.202), we have
n
j=1
aij = 1 −cii +
j=i
cij ≤ 1, i = 1, . . . , n,
so again Props. 6.8.2 and 6.8.3 apply under appropriate conditions.
Let us ﬁnally address the question whether it is possible to ﬁnd Q
such that [A[ ≤ Q and the corresponding Markov chain has no transient
states or is irreducible. To this end, assume that
n
j=1
[a
ij
[ ≤ 1 for all i.
If A is itself irreducible, then any Q such that [A[ ≤ Q is also irreducible.
Otherwise, consider the set
I =
_
_
_
i
¸
¸
¸
¸
¸
n
j=1
[a
ij
[ < 1
_
_
_
,
and assume that it is nonempty (otherwise the only possibility is Q = [A[).
Let
˜
I be the set of i such that there exists a sequence of nonzero components
a
ij
1
, a
j
1
j
2
, . . . , a
jmi
such that i ∈ I, and let
ˆ
I = ¦i [ i / ∈ I∪
˜
I¦ (we allow here
the possibility that
˜
I or
ˆ
I may be empty). Note that the square submatrix
of [A[ corresponding to
ˆ
I is a transition probability matrix, and that we
have a
ij
= 0 for all i ∈
ˆ
I and j / ∈
ˆ
I. Then it can be shown that there exists
Q with [A[ ≤ Q and no transient states if and only if the Markov chain
corresponding to
ˆ
I has no transient states. Furthermore, there exists an
irreducible Q with [A[ ≤ Q if and only if
ˆ
I is empty.
6.8.4 Extension of QLearning for Optimal Stopping
If the mapping T is nonlinear (as for example in the case of multiple poli
cies) the projected equation Φr = ΠT(Φr) is also nonlinear, and may have
one or multiple solutions, or no solution at all. On the other hand, if ΠT
Sec. 6.8 SimulationBased Solution of Large Systems 465
is a contraction, there is a unique solution. We have seen in Section 6.4.3
a nonlinear special case of projected equation where ΠT is a contraction,
namely optimal stopping. This case can be generalized as we now show.
Let us consider a system of the form
x = T(x) = Af(x) + b, (6.203)
where f : ℜ
n
→ ℜ
n
is a mapping with scalar function components of the
form f(x) =
_
f
1
(x
1
), . . . , f
n
(x
n
)
_
. We assume that each of the mappings
f
i
: ℜ → ℜ is nonexpansive in the sense that
¸
¸
f
i
(x
i
) −f
i
(¯ x
i
)
¸
¸
≤ [x
i
− ¯ x
i
[, ∀ i = 1, . . . , n, x
i
, ¯ x
i
∈ ℜ. (6.204)
This guarantees that T is a contraction mapping with respect to any norm
  with the property
y ≤ z if [y
i
[ ≤ [z
i
[, ∀ i = 1, . . . , n,
whenever A is a contraction with respect to that norm. Such norms include
weighted l
1
and l
∞
norms, the norm  
ξ
, as well as any scaled Euclidean
norm x =
√
x
′
Dx, where D is a positive deﬁnite symmetric matrix with
nonnegative components. Under the assumption (6.204), the theory of
Section 6.8.2 applies and suggests appropriate choices of a Markov chain
for simulation so that ΠT is a contraction.
As an example, consider the equation
x = T(x) = αPf(x) + b,
where P is an irreducible transition probability matrix with steadystate
probability vector ξ, α ∈ (0, 1) is a scalar discount factor, and f is a
mapping with components
f
i
(x
i
) = min¦c
i
, x
i
¦, i = 1, . . . , n, (6.205)
where c
i
are some scalars. This is the Qfactor equation corresponding
to a discounted optimal stopping problem with states i = 1, . . . , n, and a
choice between two actions at each state i: stop at a cost c
i
, or continue
at a cost b
i
and move to state j with probability p
ij
. The optimal cost
starting from state i is min¦c
i
, x
∗
i
¦, where x
∗
is the ﬁxed point of T. As a
special case of Prop. 6.8.2, we obtain that ΠT is a contraction with respect
to  
ξ
. Similar results hold in the case where αP is replaced by a matrix
A satisfying condition (2) of Prop. 6.8.2, or the conditions of Prop. 6.8.3.
A version of the LSPEtype algorithm for solving the system (6.203),
which extends the method of Section 6.4.3 for optimal stopping, may be
used when ΠT is a contraction. In particular, the iteration
Φr
k+1
= ΠT(Φr
k
), k = 0, 1, . . . ,
466 Approximate Dynamic Programming Chap. 6
takes the form
r
k+1
=
_
n
i=1
ξ
i
φ(i)φ(i)
′
_
−1
n
i=1
ξ
i
φ(i)
_
_
n
j=1
a
ij
f
j
_
φ(j)
′
r
k
_
+ b
i
_
_
,
and is approximated by
r
k+1
=
_
k
t=0
φ(i
t
)φ(i
t
)
′
_
−1
k
t=0
φ(i
t
)
_
a
i
t
j
t
p
i
t
j
t
f
j
t
_
φ(j
t
)
′
r
k
_
+ b
i
t
_
. (6.206)
Here, as before, ¦i
0
, i
1
, . . .¦ is a state sequence, and ¦(i
0
, j
0
), (i
1
, j
1
), . . .¦ is
a transition sequence satisfying Eqs. (6.186) and (6.188) with probability
1. The justiﬁcation of this approximation is very similar to the ones given
so far, and will not be discussed further. Diagonally scaled versions of this
iteration are also possible.
A diﬃculty with iteration (6.206) is that the terms f
j
t
_
φ(j
t
)
′
r
k
_
must
be computed for all t = 0, . . . , k, at every step k, thereby resulting in
signiﬁcant overhead. The methods to bypass this diﬃculty in the case of
optimal stopping, discussed at the end of Section 6.4.3, can be extended to
the more general context considered here.
Let us ﬁnally consider the case where instead of A = αP, the matrix
A satisﬁes condition (2) of Prop. 6.8.2, or the conditions of Prop. 6.8.3. The
case where
n
j=1
[a
ij
[ < 1 for some index i, and 0 ≤ A ≤ Q, where Q is an
irreducible transition probability matrix, corresponds to an undiscounted
optimal stopping problem where the stopping state will be reached from all
other states with probability 1, even without applying the stopping action.
In this case, from Prop. 6.8.2 under condition (3), it follows that ΠA is
a contraction with respect to some norm, and hence I − ΠA is invertible.
Using this fact, it can be shown by modifying the proof of Prop. 6.8.3 that
the mapping ΠT
γ
, where
T
γ
(x) = (1 −γ)x + γT(x)
is a contraction with respect to  
ξ
for all γ ∈ (0, 1). Thus, ΠT
γ
has a
unique ﬁxed point, and must be also the unique ﬁxed point of ΠT (since
ΠT and ΠT
γ
have the same ﬁxed points).
In view of the contraction property of T
γ
, the “damped” PVI iteration
Φr
k+1
= (1 −γ)Φr
k
+ γΠT(Φr
k
),
converges to the unique ﬁxed point of ΠT and takes the form
r
k+1
= (1−γ)r
k
+γ
_
n
i=1
ξ
i
φ(i)φ(i)
′
_
−1
n
i=1
ξ
i
φ(i)
_
_
n
j=1
a
ij
f
j
_
φ(j)
′
r
k
_
+ b
i
_
_
Sec. 6.8 SimulationBased Solution of Large Systems 467
As earlier, it can be approximated by the LSPE iteration
r
k+1
= (1−γ)r
k
+γ
_
k
t=0
φ(i
t
)φ(i
t
)
′
_
−1
k
t=0
φ(i
t
)
_
a
i
t
j
t
p
i
t
j
t
f
j
t
_
φ(j
t
)
′
r
k
_
+ b
i
t
_
[cf. Eq. (6.206)].
6.8.5 Bellman Equation ErrorType Methods
We will now consider an alternative approach for approximate solution of
the linear equation x = T(x) = b + Ax, based on ﬁnding a vector r that
minimizes
Φr −T(Φr)
2
ξ
,
or
n
i=1
ξ
i
_
_
φ(i)
′
r −
n
j=1
a
ij
φ(j)
′
r − b
i
_
_
2
,
where ξ is a distribution with positive components. In the DP context
where the equation x = T(x) is the Bellman equation for a ﬁxed policy, this
is known as the Bellman equation error approach (see [BeT96], Section 6.10
for a detailed discussion of this case, and the more complicated nonlinear
case where T involves minimization over multiple policies). We assume
that the matrix (I −A)Φ has rank s, which guarantees that the vector r
∗
that minimizes the weighted sum of squared errors is unique.
We note that the equation error approach is related to the projected
equation approach. To see this, consider the case where ξ is the uniform
distribution, so the problem is to minimize
_
_
Φr −(b + AΦr)
_
_
2
, (6.207)
where   is the standard Euclidean norm. By setting the gradient to 0,
we see that a necessary and suﬃcient condition for optimality is
Φ
′
(I − A)
′
_
Φr
∗
−T(Φr
∗
)
_
= 0,
or equivalently,
Φ
′
_
Φr
∗
−
ˆ
T(Φr
∗
)
_
= 0,
where
ˆ
T(x) = T(x) + A
′
_
x −T(x)
_
.
Thus minimization of the equation error (6.207) is equivalent to solving the
projected equation
Φr = Π
ˆ
T(Φr),
468 Approximate Dynamic Programming Chap. 6
where Π denotes projection with respect to the standard Euclidean norm. A
similar conversion is possible when ξ is a general distribution with positive
components.
Error bounds analogous to the projected equation bounds of Eqs.
(6.182) and (6.183) can be developed for the equation error approach, as
suming that I −A is invertible and x
∗
is the unique solution. In particular,
let ˜ r minimize Φr −T(Φr)
2
ξ
. Then
x
∗
−Φ˜ r = Tx
∗
−T(Φ˜ r) + T(Φ˜ r) −Φ˜ r = A(x
∗
−Φ˜ r) + T(Φ˜ r) −Φ˜ r,
so that
x
∗
−Φ˜ r = (I −A)
−1
_
T(Φ˜ r) −Φ˜ r
_
.
Thus, we obtain
x
∗
−Φ˜ r
ξ
≤
_
_
(I − A)
−1
_
_
ξ
Φ˜ r −T(Φ˜ r)
ξ
≤
_
_
(I − A)
−1
_
_
ξ
_
_
Πx
∗
−T(Πx
∗
)
_
_
ξ
=
_
_
(I − A)
−1
_
_
ξ
_
_
Πx
∗
−x
∗
+ Tx
∗
−T(Πx
∗
)
_
_
ξ
=
_
_
(I − A)
−1
_
_
ξ
_
_
(I −A)(Πx
∗
−x
∗
)
_
_
ξ
≤
_
_
(I − A)
−1
_
_
ξ
I −A
ξ
x
∗
−Πx
∗

ξ
,
where the second inequality holds because ˜ r minimizes Φr −T(Φr)
2
ξ
. In
the case where T is a contraction mapping with respect to the norm  
ξ
,
with modulus α ∈ (0, 1), a similar calculation yields
x
∗
−Φ˜ r
ξ
≤
1 + α
1 −α
x
∗
−Πx
∗

ξ
.
The vector r
∗
that minimizes Φr −T(Φr)
2
ξ
satisﬁes the correspond
ing necessary optimality condition
n
i=1
ξ
i
_
_
φ(i) −
n
j=1
a
ij
φ(j)
_
_
_
_
φ(i) −
n
j=1
a
ij
φ(j)
_
_
′
r
∗
=
n
i=1
ξ
i
_
_
φ(i) −
n
j=1
a
ij
φ(j)
_
_
b
i
.
(6.208)
To obtain a simulationbased approximation to Eq. (6.208), without re
quiring the calculation of row sums of the form
n
j=1
a
ij
φ(j), we intro
duce an additional sequence of transitions ¦(i
0
, j
′
0
), (i
1
, j
′
1
), . . .¦ (see Fig.
6.8.2), which is generated according to the transition probabilities p
ij
of the
Sec. 6.8 SimulationBased Solution of Large Systems 469
Markov chain, and is “independent” of the sequence ¦(i
0
, j
0
), (i
1
, j
1
), . . .¦
in the sense that with probability 1,
lim
t→∞
t
k=0
δ(i
k
= i, j
k
= j)
t
k=0
δ(i
k
= i)
= lim
t→∞
t
k=0
δ(i
k
= i, j
′
k
= j)
t
k=0
δ(i
k
= i)
= p
ij
, (6.209)
for all i, j = 1, . . . , n, and
lim
t→∞
t
k=0
δ(i
k
= i, j
k
= j, j
′
k
= j
′
)
t
k=0
δ(i
k
= i)
= p
ij
p
ij
′ , (6.210)
for all i, j, j
′
= 1, . . . , n. At time t, we form the linear equation
t
k=0
_
φ(i
k
) −
a
i
k
j
k
p
i
k
j
k
φ(j
k
)
_
_
φ(i
k
) −
a
i
k
j
′
k
p
i
k
j
′
k
φ(j
′
k
)
_
′
r
=
t
k=0
_
φ(i
k
) −
a
i
k
j
k
p
i
k
j
k
φ(j
k
)
_
b
i
k
.
(6.211)
Similar to our earlier analysis, it can be seen that this is a valid approxi
mation to Eq. (6.208).
j
′
0
j
′
0
j
′
1
′
1
j
′
k
j
′
k
j
′
k+1
j
0
j
0
j
1 j
1
j
k k
j
k+1
+1
i
0
i
0
i
1
i
1
i
k
i
k
i
k+1
... ...
Figure 6.8.2. A possible simulation mechanism for minimizing the equation er
ror norm [cf. Eq. (6.211)]. We generate a sequence of states {i
0
, i
1
, . . .} according
to the distribution ξ, by simulating a single inﬁnitely long sample trajectory of
the chain. Simultaneously, we generate two independent sequences of transitions,
{(i
0
, j
0
), (i
1
, j
1
), . . .} and {(i
0
, j
′
0
), (i
1
, j
′
1
), . . .}, according to the transition prob
abilities p
ij
, so that Eqs. (6.209) and (6.210) are satisﬁed.
Note a disadvantage of this approach relative to the projected equa
tion approach (cf. Section 6.8.1). It is necessary to generate two sequences
of transitions (rather than one). Moreover, both of these sequences enter
470 Approximate Dynamic Programming Chap. 6
Eq. (6.211), which thus contains more simulation noise than its projected
equation counterpart [cf. Eq. (6.189)].
Let us ﬁnally note that the equation error approach can be general
ized to yield a simulationbased method for solving the general linear least
squares problem
min
r
n
i=1
ξ
i
_
_
c
i
−
m
j=1
q
ij
φ(j)
′
r
_
_
2
,
where q
ij
are the components of an nm matrix Q, and c
i
are the compo
nents of a vector c ∈ ℜ
n
. In particular, one may write the corresponding
optimality condition [cf. Eq. (6.208)] and then approximate it by simulation
[cf. Eq. (6.211)]; see [BeY09], and [WPB09], [PWB09], which also discuss
a regressionbased approach to deal with nearly singular problems (cf. the
regressionbased LSTD method of Section 6.3.4). Conversely, one may con
sider a selected set I of states of moderate size, and ﬁnd r
∗
that minimizes
the sum of squared Bellman equation errors only for these states:
r
∗
∈ arg min
r∈ℜ
s
i∈I
ξ
i
_
_
φ(i)
′
r −
n
j=1
a
ij
φ(j)
′
r −b
i
_
_
2
.
This least squares problem may be solved by conventional (nonsimulation)
methods.
An interesting question is how the approach of this section compares
with the projected equation approach in terms of approximation error. No
deﬁnitive answer seems possible, and examples where one approach gives
better results than the other have been constructed. Reference [Ber95]
shows that in the example of Exercise 6.9, the projected equation approach
gives worse results. For an example where the projected equation approach
may be preferable, see Exercise 6.11.
Approximate Policy Iteration with Bellman Equation
Error Evaluation
When the Bellman equation error approach is used in conjunction with
approximate policy iteration in a DP context, it is susceptible to chatter
ing and oscillation just as much as the projected equation error approach
(cf. Section 6.3.6). The reason is that both approaches operate within
the same greedy partition, and oscillate when there is a cycle of policies
µ
k
, µ
k+1
, . . . , µ
k+m
with
r
µ
k ∈ R
µ
k+1, r
µ
k+1 ∈ R
µ
k+2, . . . , r
µ
k+m−1 ∈ R
µ
k+m, r
µ
k+m ∈ R
µ
k
(cf. Fig. 6.3.4). The only diﬀerence is that the weight vector r
µ
of a policy
µ is calculated diﬀerently (by solving a leastsquares Bellman error problem
Sec. 6.8 SimulationBased Solution of Large Systems 471
versus solving a projected equation). In practice the weights calculated by
the two approaches may diﬀer somewhat, but generally not enough to cause
dramatic changes in qualitative behavior. Thus, much of our discussion of
optimistic policy iteration in Sections 6.3.56.3.6 applies to the Bellman
equation error approach as well.
Example 6.3.2 (continued)
Let us return to Example 6.3.2 where chattering occurs when rµ is evaluated
using the projected equation. When the Bellman equation error approach
is used instead, the greedy partition remains the same (cf. Fig. 6.3.6), the
weight of policy µ is rµ = 0 (as in the projected equation case), and for p ≈ 1,
the weight of policy µ
∗
can be calculated to be
r
µ
∗ ≈
c
(1 −α)
_
(1 −α)
2
+ (2 −α)
2
_
[which is almost the same as the weight c/(1 − α) obtained in the projected
equation case]. Thus with both approaches we have oscillation between µ
and µ
∗
in approximate policy iteration, and chattering in optimistic versions,
with very similar iterates.
6.8.6 Oblique Projections
Some of the preceding methodology regarding projected equations can be
generalized to the case where the projection operator Π is oblique (i.e., it
is not a projection with respect to the weighted Euclidean norm, see e.g.,
Saad [Saa03]). Such projections have the form
Π = Φ(Ψ
′
ΞΦ)
−1
Ψ
′
Ξ, (6.212)
where as before, Ξ is the diagonal matrix with the components ξ
1
, . . . , ξ
n
of
a positive distribution vector ξ along the diagonal, Φ is an n s matrix of
rank s, and Ψ is an ns matrix of rank s. The earlier case corresponds to
Ψ = Φ. Two characteristic properties of Π as given by Eq. (6.212) are that
its range is the subspace S = ¦Φr [ r ∈ ℜ
s
¦ and that it is idempotent, i.e.,
Π
2
= Π. Conversely, a matrix Π with these two properties can be shown
to have the form (6.212) for some n s matrix Ψ of rank s and a diagonal
matrix Ξ with the components ξ
1
, . . . , ξ
n
of a positive distribution vector
ξ along the diagonal. Oblique projections arise in a variety of interesting
contexts, for which we refer to the literature.
Let us now consider the generalized projected equation
Φr = ΠT(Φr) = Π(b + AΦr). (6.213)
Using Eq. (6.212) and the fact that Φ has rank s, it can be written as
r = (Ψ
′
ΞΦ)
−1
Ψ
′
Ξ(b + AΦr),
472 Approximate Dynamic Programming Chap. 6
or equivalently Ψ
′
ΞΦr = Ψ
′
Ξ(b + AΦr), which can be ﬁnally written as
Cr = d,
where
C = Ψ
′
Ξ(I −A)Φ, d = Ψ
′
Ξb. (6.214)
These equations should be compared to the corresponding equations for
the Euclidean projection case where Ψ = Φ [cf. Eq. (6.184)].
It is clear that row and column sampling can be adapted to provide
simulationbased estimates C
k
and d
k
of C and d, respectively. The corre
sponding equations have the form [cf. Eq. (6.189)]
C
k
=
1
k + 1
k
t=0
ψ(i
t
)
_
φ(i
t
) −
a
i
t
j
t
p
i
t
j
t
φ(j
t
)
_
′
, d
k
=
1
k + 1
k
t=0
ψ(i
t
)b
i
t
,
(6.215)
where ψ
′
(i) is the ith row of Ψ. The sequence of vectors C
−1
k
d
k
converges
with probability one to the solution C
−1
d of the projected equation, as
suming that C is nonsingular. For cases where C
k
is nearly singular, the
regression/regularizationbased estimate (6.191) may be used. The corre
sponding iterative method is
r
k+1
= (C
′
k
Σ
−1
k
C
k
+ βI)
−1
(C
′
k
Σ
−1
k
d
k
+ βr
k
),
and can be shown to converge with probability one to C
−1
d.
An example where oblique projections arise in DP is aggregation/dis
cretization with a coarse grid [cases (c) and (d) in Section 6.5, with the ag
gregate states corresponding some distinct representative states ¦x
1
, . . . , x
s
¦
of the original problem; also Example 6.5.1]. Then the aggregation equa
tion for a discounted problem has the form
Φr = ΦD(b + αPΦr), (6.216)
where the rows of D are unit vectors (have a single component equal to
1, corresponding to a representative state, and all other components equal
to 0), and the rows of Φ are probability distributions, with the rows corre
sponding to the representative states x
k
having a single unit component,
Φ
x
k
x
k
= 1, k = 1, . . . , s. Then the matrix DΦ can be seen to be the
identity, so we have ΦD ΦD = ΦD and it follows that ΦD is an oblique
projection. The conclusion is that the aggregation equation (6.216) in the
special case of coarse grid discretization is the projected equation (6.213),
with the oblique projection Π = ΦD.
Sec. 6.8 SimulationBased Solution of Large Systems 473
6.8.7 Generalized Aggregation by Simulation
We will ﬁnally discuss the simulationbased iterative solution of a general
system of equations of the form
r = DT(Φr), (6.217)
where T : ℜ
n
→ ℜ
m
is a (possibly nonlinear) mapping, D is an s m
matrix, and Φ is an n s matrix. We can regard the system (6.217) as an
approximation to a system of the form
x = T(x). (6.218)
In particular, the variables x
i
of the system (6.218) are approximated by
linear combinations of the variables r
j
of the system (6.217), using the rows
of Φ. Furthermore, the components of the mapping DT are obtained by
linear combinations of the components of T, using the rows of D. Thus
we may view the system (6.217) as being obtained by aggregation/linear
combination of the variables and the equations of the system (6.218).
We have encountered equations of the form (6.217) in our discussion
of Qlearning (Section 6.4) and aggregation (Section 6.5). For example the
Qlearning mapping
(FQ)(i, u) =
n
j=1
p
ij
(u)
_
g(i, u, j) + α min
v∈U(j)
Q(j, v)
_
, ∀ (i, u),
(6.219)
[cf. Eq. (6.100)] is of the form (6.217), where r = Q, the dimensions s and
n are equal to the number of statecontrol pairs (i, u), the dimension m
is the number of statecontrolnext state triples (i, u, j), the components
of D are the appropriate probabilities p
ij
(u), Φ is the identity, and T is
the nonlinear mapping that transforms Q to the vector with a component
g(i, u, j) + αmin
v∈U(j)
Q(j, v) for each (i, u, j).
As another example, the aggregation mapping
(FR)(x) =
n
i=1
d
xi
min
u∈U(i)
n
j=1
p
ij
(u)
_
_
g(i, u, j) + α
y∈S
φ
jy
R(y)
_
_
, x ∈ S,
(6.220)
[cf. Eq. (6.140)] is of the form (6.217), where r = R, the dimension s is
equal to the number of aggregate states x, m = n is the number of states
i, and the matrices D and Φ consist of the disaggregation and aggregation
probabilities, respectively.
As a third example, consider the following Bellman’s equation over
the space of postdecision states m [cf. Eq. (6.11)]:
V (m) =
n
j=1
q(m, j) min
u∈U(j)
_
g(j, u) + αV
_
f(j, u)
_
_
, ∀ m. (6.221)
474 Approximate Dynamic Programming Chap. 6
This equation is of the form (6.217), where r = V , the dimension s is equal
to the number of postdecision states x, m = n is the number of (pre
decision) states i, the matrix D consists of the probabilities q(m, j), and Φ
is the identity matrix.
There are also versions of the preceding examples, which involve eval
uation of a single policy, in which case there is no minimization in Eqs.
(6.219)(6.221), and the corresponding mapping T is linear. We will now
consider separately cases where T is linear and where T is nonlinear. For
the linear case, we will give an LSTDtype method, while for the nonlinear
case (where the LSTD approach does not apply), we will discuss iterative
methods under some contraction assumptions on T, D, and Φ.
The Linear Case
Let T be linear, so the equation r = DT(Φr) has the form
r = D(b + AΦr), (6.222)
where A is an mn matrix, and b ∈ ℜ
s
. We can thus write this equation
as
Er = f,
where
E = I −DAΦ, f = Db.
As in the case of projected equations (cf. Section 6.8.1), we can use low
dimensional simulation to approximate E and f based on row and column
sampling. One way to do this is to introduce for each index i = 1, . . . , n, a
distribution ¦p
ij
[ j = 1, . . . , m¦ with the property
p
ij
> 0 if a
ij
,= 0,
and to obtain a sample sequence
_
(i
0
, j
0
), (i
1
, j
1
), . . .
_
. We do so by ﬁrst
generating a sequence of row indices ¦i
0
, i
1
, . . .¦ through sampling according
to some distribution ¦ξ
i
[ i = 1, . . . , m¦, and then by generating for each t
the column index j
t
by sampling according to the distribution ¦p
i
t
j
[ j =
1, . . . , n¦. There are also alternative schemes, in which we ﬁrst sample rows
of D and then generate rows of A, along the lines discussed in Section 6.5.2
(see also Exercise 6.14).
Given the ﬁrst k + 1 samples, we form the matrix
ˆ
E
k
and vector
ˆ
f
k
given by
ˆ
E
k
= I −
1
k + 1
k
t=0
a
i
t
j
t
ξ
i
t
p
i
t
j
t
d(i
t
)φ(j
t
)
′
,
ˆ
f
k
=
1
k + 1
k
t=0
1
ξ
i
t
d(i
t
)b
t
,
Sec. 6.8 SimulationBased Solution of Large Systems 475
where d(i) is the ith column of D and φ(j)
′
is the jth row of Φ. By using
the expressions
E = I −
m
i=1
n
j=1
a
ij
d(i)φ(j)
′
, f =
m
i=1
d(i)b
i
,
and law of large numbers arguments, it can be shown that
ˆ
E
k
→ E and
ˆ
f
k
→ f, similar to the case of projected equations. In particular, we can
write
f
k
=
m
i=1
k
t=0
δ(i
t
= i)
k + 1
1
ξ
i
d(i)b
i
,
and since
k
t=0
δ(i
t
= i)
k + 1
→ ξ
i
,
we have
f
k
→
m
i=1
d(i)b
i
= Db.
Similarly, we can write
1
k + 1
k
t=0
a
i
t
j
t
p
i
t
j
t
d(i
t
)φ(j
t
)
′
= m
m
i=1
n
j=1
k
t=0
δ(i
t
= i, j
t
= j)
k + 1
a
ij
ξ
i
p
ij
d(i)φ(j)
′
,
and since
k
t=0
δ(i
t
= i, j
t
= j)
k + 1
→ ξ
i
p
ij
,
we have
E
k
→
m
i=1
n
j=1
a
ij
d(i)φ(j)
′
= E.
The convergence
ˆ
E
k
→ E and
ˆ
f
k
→ f implies in turn that
ˆ
E
−1
k
ˆ
f
k
converges
to the solution of the equation r = D(b +AΦr). There is also a regression
based version of this method that is suitable for the case where
ˆ
E
k
is nearly
singular (cf. Section 6.3.4), and an iterative LSPEtype method that works
even when
ˆ
E
k
is singular [cf. Eq. (6.61)].
The Nonlinear Case
Consider now the case where T is nonlinear and has the contraction prop
erty
T(x) −T(x)
∞
≤ αx −x
∞
, ∀ x ∈ ℜ
m
,
476 Approximate Dynamic Programming Chap. 6
where α is a scalar with 0 < α < 1 and  
∞
denotes the supnorm.
Furthermore, let the components of the matrices D and Φ satisfy
m
i=1
[d
ℓi
[ ≤ 1, ∀ ℓ = 1, . . . , s,
and
s
ℓ=1
[φ
jℓ
[ ≤ 1, ∀ j = 1, . . . , n.
These assumptions imply that D and Φ are nonexpansive in the sense that
Dx
∞
≤ x
∞
, ∀ x ∈ ℜ
n
,
Φy
∞
≤ y
∞
, ∀ y ∈ ℜ
s
,
so that DTΦ is a supnorm contraction with modulus α, and the equation
r = DT(Φr) has a unique solution, denoted r
∗
.
The ideas underlying the Qlearning algorithm and its analysis (cf.
Section 6.4.1) can be extended to provide a simulationbased algorithm for
solving the equation r = DT(Φr). This algorithm contains as a special case
the iterative aggregation algorithm (6.147), as well as other algorithms of
interest in DP, such as for example Qlearning and aggregationtype algo
rithms for stochastic shortest path problems, and for problems involving
postdecision states.
As in Qlearning, the starting point of the algorithm is the ﬁxed point
iteration
r
k+1
= DT(Φr
k
).
This iteration is guaranteed to converge to r
∗
, and the same is true for asyn
chronous versions where only one component of r is updated at each itera
tion (this is due to the supnorm contraction property of DTΦ). To obtain
a simulationbased approximation of DT, we introduce an s m matrix D
whose rows are mdimensional probability distributions with components
d
ℓi
satisfying
d
ℓi
> 0 if d
ℓi
,= 0, ℓ = 1, . . . , s, i = 1, . . . , m.
The ℓth component of the vector DT(Φr) can be written as an expected
value with respect to this distribution:
m
i=1
d
ℓi
T
i
(Φr) =
m
i=1
d
ℓi
_
d
ℓi
d
ℓi
T
i
(Φr)
_
, (6.223)
where T
i
is the ith component of T. This expected value is approximated
by simulation in the algorithm that follows.
Sec. 6.9 Approximation in Policy Space 477
The algorithm generates a sequence of indices ¦ℓ
0
, ℓ
1
, . . .¦ according
to some mechanism that ensures that all indices ℓ = 1, . . . , s, are generated
inﬁnitely often. Given ℓ
k
, an index i
k
∈ ¦1, . . . , m¦ is generated according
to the probabilities d
ℓ
k
i
, independently of preceding indices. Then the com
ponents of r
k
, denoted r
k
(ℓ), ℓ = 1, . . . , s, are updated using the following
iteration:
r
k+1
(ℓ) =
_
(1 −γ
k
)r
k
(ℓ) + γ
k
d
ℓi
k
d
ℓi
k
T
i
k
(Φr
k
) if ℓ = ℓ
k
,
r
k
(ℓ) if ℓ ,= ℓ
k
,
where γ
k
> 0 is a stepsize that diminishes to 0 at an appropriate rate. Thus
only the ℓ
k
th component of r
k
is changed, while all other components are
left unchanged. The stepsize could be chosen to be γ
k
= 1/n
k
, where as in
Section 6.4.1, n
k
is the number of times that index ℓ
k
has been generated
within the sequence ¦ℓ
0
, ℓ
1
, . . .¦ up to time k.
The algorithm is similar and indeed contains as a special case the
Qlearning algorithm (6.101)(6.103). The justiﬁcation of the algorithm
follows closely the one given for Qlearning in Section 6.4.1. Basically, we
replace the expected value in the expression (6.223) of the ℓth component of
DT, with a Monte Carlo estimate based on all the samples up to time k that
involve ℓ
k
, and we then simplify the hardtocalculate terms in the resulting
method [cf. Eqs. (6.108) and (6.110)]. A rigorous convergence proof requires
the theoretical machinery of stochastic approximation algorithms.
6.9 APPROXIMATION IN POLICY SPACE
Our approach so far in this chapter has been to use an approximation ar
chitecture for some cost function, diﬀerential cost, or Qfactor. Sometimes
this is called approximation in value space, to indicate that a cost or value
function is being approximated. In an important alternative, called ap
proximation in policy space, we parameterize the set of policies by a vector
r = (r
1
, . . . , r
s
) and we optimize the cost over this vector. In particular, we
consider randomized stationary policies of a given parametric form ˜ µ
u
(i, r),
where ˜ µ
u
(i, r) denotes the probability that control u is applied when the
state is i. Each value of r deﬁnes a randomized stationary policy, which
in turn deﬁnes the cost of interest as a function of r. We then choose r to
minimize this cost.
In an important special case of this approach, the parameterization of
the policies is indirect, through an approximate cost function. In particu
lar, a cost approximation architecture parameterized by r, deﬁnes a policy
dependent on r via the minimization in Bellman’s equation. For example,
Qfactor approximations
˜
Q(i, u, r), deﬁne a parameterization of policies by
letting ˜ µ
u
(i, r) = 1 for some u that minimizes
˜
Q(i, u, r) over u ∈ U(i),
478 Approximate Dynamic Programming Chap. 6
and ˜ µ
u
(i, r) = 0 for all other u. This parameterization is discontinuous in
r, but in practice is it smoothed by replacing the minimization operation
with a smooth exponentialbased approximation; we refer to the literature
for the details. Also in a more abstract and general view of approxima
tion in policy space, rather than parameterizing policies or Qfactors, we
can simply parameterize by r the problem data (stage costs and transition
probabilities), and optimize the corresponding cost function over r. Thus,
in this more general formulation, we may aim to select some parameters of
a given system to optimize performance.
Once policies are parameterized in some way by a vector r, the cost
function of the problem, over a ﬁnite or inﬁnite horizon, is implicitly pa
rameterized as a vector
˜
J(r). A scalar measure of performance may then
be derived from
˜
J(r), e.g., the expected cost starting from a single initial
state, or a weighted sum of costs starting from a selected set of states. The
method of optimization may be any one of a number of possible choices,
ranging from random search to gradient methods. This method need not
relate to DP, although DP calculations may play a signiﬁcant role in its
implementation. Traditionally, gradient methods have received most at
tention within this context, but they often tend to be slow and to have
diﬃculties with local minima. On the other hand, random search methods,
such as the crossentropy method [RuK04], are often very easy to imple
ment and on occasion have proved surprisingly eﬀective (see the literature
cited in Section 6.10).
In this section, we will focus on the ﬁnite spaces average cost problem
and gradienttype methods. Let the cost per stage vector and transition
probability matrix be given as functions of r: G(r) and P(r), respectively.
Assume that the states form a single recurrent class under each P(r), and
let ξ(r) be the corresponding steadystate probability vector. We denote
by G
i
(r), P
ij
(r), and ξ
i
(r) the components of G(r), P(r), and ξ(r), respec
tively. Each value of r deﬁnes an average cost η(r), which is common for
all initial states (cf. Section 4.2), and the problem is to ﬁnd
min
r∈ℜ
s
η(r).
Assuming that η(r) is diﬀerentiable with respect to r (something that must
be independently veriﬁed), one may use a gradient method for this mini
mization:
r
k+1
= r
k
−γ
k
∇η(r
k
),
where γ
k
is a positive stepsize. This is known as a policy gradient method.
6.9.1 The Gradient Formula
We will now show that a convenient formula for the gradients ∇η(r) can
Sec. 6.9 Approximation in Policy Space 479
be obtained by diﬀerentiating Bellman’s equation
η(r) + h
i
(r) = G
i
(r) +
n
j=1
P
ij
(r)h
j
(r), i = 1, . . . , n, (6.224)
with respect to the components of r, where h
i
(r) are the diﬀerential costs.
Taking the partial derivative with respect to r
m
, we obtain for all i and m,
∂η
∂r
m
+
∂h
i
∂r
m
=
∂G
i
∂r
m
+
n
j=1
∂P
ij
∂r
m
h
j
+
n
j=1
P
ij
∂h
j
∂r
m
.
(In what follows we assume that the partial derivatives with respect to
components of r appearing in various equations exist. The argument at
which they are evaluated, is often suppressed to simplify notation.) By
multiplying this equation with ξ
i
(r), adding over i, and using the fact
n
i=1
ξ
i
(r) = 1, we obtain
∂η
∂r
m
+
n
i=1
ξ
i
∂h
i
∂r
m
=
n
i=1
ξ
i
∂G
i
∂r
m
+
n
i=1
ξ
i
n
j=1
∂P
ij
∂r
m
h
j
+
n
i=1
ξ
i
n
j=1
P
ij
∂h
j
∂r
m
.
The last summation on the righthand side cancels the last summation on
the lefthand side, because from the deﬁning property of the steadystate
probabilities, we have
n
i=1
ξ
i
n
j=1
P
ij
∂h
j
∂r
m
=
n
j=1
_
n
i=1
ξ
i
P
ij
_
∂h
j
∂r
m
=
n
j=1
ξ
j
∂h
j
∂r
m
.
We thus obtain
∂η(r)
∂r
m
=
n
i=1
ξ
i
(r)
_
_
∂G
i
(r)
∂r
m
+
n
j=1
∂P
ij
(r)
∂r
m
h
j
(r)
_
_
, m = 1, . . . , s,
(6.225)
or in more compact form
∇η(r) =
n
i=1
ξ
i
(r)
_
_
∇G
i
(r) +
n
j=1
∇P
ij
(r)h
j
(r)
_
_
, (6.226)
where all the gradients are column vectors of dimension s.
480 Approximate Dynamic Programming Chap. 6
6.9.2 Computing the Gradient by Simulation
Despite its relative simplicity, the gradient formula (6.226) involves formida
ble computations to obtain ∇η(r) at just a single value of r. The reason
is that neither the steadystate probability vector ξ(r) nor the bias vector
h(r) are readily available, so they must be computed or approximated in
some way. Furthermore, h(r) is a vector of dimension n, so for large n,
it can only be approximated either through its simulation samples or by
using a parametric architecture and an algorithm such as LSPE or LSTG
( see the references cited at the end of the chapter).
The possibility to approximate h using a parametric architecture ush
ers a connection between approximation in policy space and approximation
in value space. It also raises the question whether approximations intro
duced in the gradient calculation may aﬀect the convergence guarantees
of the policy gradient method. Fortunately, however, gradient algorithms
tend to be robust and maintain their convergence properties, even in the
presence of signiﬁcant error in the calculation of the gradient.
In the literature, algorithms where both µ and h are parameterized
are sometimes called actorcritic methods. Algorithms where just µ is
parameterized and h is not parameterized but rather estimated explicitly
or implicitly by simulation, are called actoronly methods, while algorithms
where just h is parameterized and µ is obtained by onestep lookahead
minimization, are called criticonly methods.
We will now discuss some possibilities of using simulation to approx
imate ∇η(r). Let us introduce for all i and j such that P
ij
(r) > 0, the
function
L
ij
(r) =
∇P
ij
(r)
P
ij
(r)
.
Then, suppressing the dependence on r, we write the partial derivative
formula (6.226) in the form
∇η =
n
i=1
ξ
i
_
_
∇G
i
+
n
j=1
P
ij
L
ij
h
j
_
_
. (6.227)
We assume that for all states i and possible transitions (i, j), we can cal
culate ∇G
i
and L
ij
. Suppose now that we generate a single inﬁnitely long
simulated trajectory (i
0
, i
1
, . . .). We can then estimate the average cost η
as
˜ η =
1
k
k−1
t=0
G
i
t
,
where k is large. Then, given an estimate ˜ η, we can estimate the bias
components h
j
by using simulationbased approximations to the formula
h
i
0
= lim
N→∞
E
_
N
t=0
(G
i
t
−η)
_
,
Sec. 6.9 Approximation in Policy Space 481
[which holds from general properties of the bias vector when P(r) is ape
riodic – see the discussion following Prop. 4.1.2]. Alternatively, we can
estimate h
j
by using the LSPE or LSTD algorithms of Section 6.7.1 [note
here that if the feature subspace contains the bias vector, the LSPE and
LSTD algorithms will ﬁnd exact values of h
j
in the limit, so with a suﬃ
ciently rich set of features, an asymptotically exact calculation of h
j
, and
hence also ∇η(r), is possible]. Finally, given estimates ˜ η and
˜
h
j
, we can
estimate the gradient ∇η with a vector δ
η
given by
δ
η
=
1
k
k−1
t=0
_
∇G
i
t
+ L
i
t
i
t+1
˜
h
i
t+1
_
. (6.228)
This can be seen by a comparison of Eqs. (6.227) and (6.228): if we replace
the expected values of ∇G
i
and L
ij
by empirical averages, and we replace
h
j
by
˜
h
j
, we obtain the estimate δ
η
.
The estimationbysimulation procedure outlined above provides a
conceptual starting point for more practical gradient estimation methods.
For example, in such methods, the estimation of η and h
j
may be done
simultaneously with the estimation of the gradient via Eq. (6.228), and
with a variety of diﬀerent algorithms. We refer to the literature cited at
the end of the chapter.
6.9.3 Essential Features of Critics
We will now develop an alternative (but mathematically equivalent) expres
sion for the gradient ∇η(r) that involves Qfactors instead of diﬀerential
costs. Let us consider randomized policies where ˜ µ
u
(i, r) denotes the prob
ability that control u is applied at state i. We assume that ˜ µ
u
(i, r) is
diﬀerentiable with respect to r for each i and u. Then the corresponding
stage costs and transition probabilities are given by
G
i
(r) =
u∈U(i)
˜ µ
u
(i, r)
n
j=1
p
ij
(u)g(i, u, j), i = 1, . . . , n,
P
ij
(r) =
u∈U(i)
˜ µ
u
(i, r)p
ij
(u), i, j = 1, . . . , n.
Diﬀerentiating these equations with respect to r, we obtain
∇G
i
(r) =
u∈U(i)
∇˜ µ
u
(i, r)
n
j=1
p
ij
(u)g(i, u, j), (6.229)
∇P
ij
(r) =
u∈U(i)
∇˜ µ
u
(i, r)p
ij
(u), i, j = 1, . . . , n. (6.230)
482 Approximate Dynamic Programming Chap. 6
Since
u∈U(i)
˜ µ
u
(i, r) = 1 for all r, we have
u∈U(i)
∇˜ µ
u
(i, r) = 0, so Eq.
(6.229) yields
∇G
i
(r) =
u∈U(i)
∇˜ µ
u
(i, r)
_
_
n
j=1
p
ij
(u)g(i, u, j) −η(r)
_
_
.
Also, by multiplying with h
j
(r) and adding over j, Eq. (6.230) yields
n
j=1
∇P
ij
(r)h
j
(r) =
n
j=1
u∈U(i)
∇˜ µ
u
(i, r)p
ij
(u)h
j
(r).
By using the preceding two equations to rewrite the gradient formula
(6.226), we obtain
∇η(r) =
n
i=1
ξ
i
(r)
_
_
∇G
i
(r) +
n
j=1
∇P
ij
(r)h
j
(r)
_
_
=
n
i=1
ξ
i
(r)
u∈U(i)
∇˜ µ
u
(i, r)
n
j=1
p
ij
(u)
_
g(i, u, j) −η(r) + h
j
(r)
_
,
and ﬁnally
∇η(r) =
n
i=1
u∈U(i)
ξ
i
(r)
˜
Q(i, u, r)∇˜ µ
u
(i, r), (6.231)
where
˜
Q(i, u, r) are the approximate Qfactors corresponding to r:
˜
Q(i, u, r) =
n
j=1
p
ij
(u)
_
g(i, u, j) −η(r) + h
j
(r)
_
.
Let us now express the formula (6.231) in a way that is amenable to
proper interpretation. In particular, by writing
∇η(r) =
n
i=1
{u∈U(i)˜ µu(i,r)>0}
ξ
i
(r)˜ µ
u
(i, r)
˜
Q(i, u, r)
∇˜ µ
u
(i, r)
˜ µ
u
(i, r)
,
and by introducing the function
ψ
r
(i, u) =
∇˜ µ
u
(i, r)
˜ µ
u
(i, r)
,
we obtain
∇η(r) =
n
i=1
{u∈U(i)˜ µu(i,r)>0}
ζ
r
(i, u)
˜
Q(i, u, r)ψ
r
(i, u), (6.232)
Sec. 6.9 Approximation in Policy Space 483
where ζ
r
(i, u) are the steadystate probabilities of the pairs (i, u) under r:
ζ
r
(i, u) = ξ
i
(r)˜ µ
u
(i, r).
Note that for each (i, u), ψ
r
(i, u) is a vector of dimension s, the dimension
of the parameter vector r. We denote by ψ
m
r
(i, u), m = 1, . . . , s, the
components of this vector.
Equation (6.232) can form the basis for policy gradient methods that
estimate
˜
Q(i, u, r) by simulation, thereby leading to actoronly algorithms.
An alternative suggested by Konda and Tsitsiklis [KoT99], [KoT03], is to
interpret the formula as an inner product, thereby leading to a diﬀerent set
of algorithms. In particular, for a given r, we deﬁne the inner product of
two realvalued functions Q
1
, Q
2
of (i, u), by
¸Q
1
, Q
2
)
r
=
n
i=1
{u∈U(i)˜ µu(i,r)>0}
ζ
r
(i, u)Q
1
(i, u)Q
2
(i, u).
With this notation, we can rewrite Eq. (6.232) as
∂η(r)
∂r
m
= ¸
˜
Q(, , r), ψ
m
r
(, ))
r
, m = 1, . . . , s.
An important observation is that although ∇η(r) depends on
˜
Q(i, u, r),
which has a number of components equal to the number of statecontrol
pairs (i, u), the dependence is only through its inner products with the s
functions ψ
m
r
(, ), m = 1, . . . , s.
Now let  
r
be the norm induced by this inner product, i.e.,
Q
2
r
= ¸Q, Q)
r
.
Let also S
r
be the subspace that is spanned by the functions ψ
m
r
(, ),
m = 1, . . . , s, and let Π
r
denote projection with respect to this norm onto
S
r
. Since
¸
˜
Q(, , r), ψ
m
r
(, ))
r
= ¸Π
r
˜
Q(, , r), ψ
m
r
(, ))
r
, m = 1, . . . , s,
it is suﬃcient to know the projection of
˜
Q(, , r) onto S
r
in order to compute
∇η(r). Thus S
r
deﬁnes a subspace of essential features, i.e., features the
knowledge of which is essential for the calculation of the gradient ∇η(r).
As discussed in Section 6.1, the projection of
˜
Q(, , r) onto S
r
can be done
in an approximate sense with TD(λ), LSPE(λ), or LSTD(λ) for λ ≈ 1. We
refer to the papers by Konda and Tsitsiklis [KoT99], [KoT03], and Sutton,
McAllester, Singh, and Mansour [SMS99] for further discussion.
484 Approximate Dynamic Programming Chap. 6
6.9.4 Approximations in Policy and Value Space
Let us now provide a comparative assessment of approximation in policy
and value space. We ﬁrst note that in comparing approaches, one must bear
in mind that speciﬁc problems may admit natural parametrizations that
favor one type of approximation over the other. For example, in inventory
control problems, it is natural to consider policy parametrizations that
resemble the (s, S) policies that are optimal for special cases, but also
make intuitive sense in a broader context.
Policy gradient methods for approximation in policy space are sup
ported by interesting theory and aim directly at ﬁnding an optimal policy
within the given parametric class (as opposed to aiming for policy evalua
tion in the context of an approximate policy iteration scheme). However,
they suﬀer from a drawback that is wellknown to practitioners of non
linear optimization: slow convergence, which unless improved through the
use of eﬀective scaling of the gradient (with an appropriate diagonal or
nondiagonal matrix), all too often leads to jamming (no visible progress)
and complete breakdown. Unfortunately, there has been no proposal of
a demonstrably eﬀective scheme to scale the gradient in policy gradient
methods (see, however, Kakade [Kak02] for an interesting attempt to ad
dress this issue, based on the work of Amari [Ama98]). Furthermore, the
performance and reliability of policy gradient methods are susceptible to
degradation by large variance of simulation noise. Thus, while policy gradi
ent methods are supported by convergence guarantees in theory, attaining
convergence in practice is often challenging. In addition, gradient methods
have a generic diﬃculty with local minima, the consequences of which are
not wellunderstood at present in the context of approximation in policy
space.
A major diﬃculty for approximation in value space is that a good
choice of basis functions/features is often far from evident. Furthermore,
even when good features are available, the indirect approach of TD(λ),
LSPE(λ), and LSTD(λ) may neither yield the best possible approxima
tion of the cost function or the Qfactors of a policy within the feature
subspace, nor yield the best possible performance of the associated one
steplookahead policy. In the case of a ﬁxed policy, LSTD(λ) and LSPE(λ)
are quite reliable algorithms, in the sense that they ordinarily achieve their
theoretical guarantees in approximating the associated cost function or Q
factors: they involve solution of systems of linear equations, simulation
(with convergence governed by the law of large numbers), and contraction
iterations (with favorable contraction modulus when λ is not too close to
0). However, within the multiple policy context of an approximate policy
iteration scheme, TD methods have additional diﬃculties: the need for
adequate exploration, the chattering phenomenon and the associated issue
of policy oscillation, and the lack of convergence guarantees for both op
timistic and nonoptimistic schemes. When an aggregation method is used
Sec. 6.10 Notes, Sources, and Exercises 485
for policy evaluation, these diﬃculties do not arise, but the cost approx
imation vectors Φr are restricted by the requirements that the rows of Φ
must be aggregation probability distributions.
6.10 NOTES, SOURCES, AND EXERCISES
There has been intensive interest in simulationbased methods for approx
imate DP since the early 90s, in view of their promise to address the dual
curses of DP: the curse of dimensionality (the explosion of the computa
tion needed to solve the problem as the number of states increases), and the
curse of modeling (the need for an exact model of the system’s dynamics).
We have used the name approximate dynamic programming to collectively
refer to these methods. Two other popular names are reinforcement learn
ing and neurodynamic programming. The latter name, adopted by Bert
sekas and Tsitsiklis [BeT96], comes from the strong connections with DP
as well as with methods traditionally developed in the ﬁeld of neural net
works, such as the training of approximation architectures using empirical
or simulation data.
Two books were written on the subject in the mid90s, one by Sutton
and Barto [SuB98], which reﬂects an artiﬁcial intelligence viewpoint, and
another by Bertsekas and Tsitsiklis [BeT96], which is more mathematical
and reﬂects an optimal control/operations research viewpoint. We refer to
the latter book for a broader discussion of some of the topics of this chapter,
for related material on approximation architectures, batch and incremental
gradient methods, and neural network training, as well as for an extensive
overview of the history and bibliography of the subject up to 1996. More
recent books are Cao [Cao07], which emphasizes a sensitivity approach and
policy gradient methods, Chang, Fu, Hu, and Marcus [CFH07], which em
phasizes ﬁnitehorizon/limited lookahead schemes and adaptive sampling,
Gosavi [Gos03], which emphasizes simulationbased optimization and re
inforcement learning algorithms, and Powell [Pow07], which emphasizes
resource allocation and the diﬃculties associated with large control spaces.
The book by Borkar [Bor08] is an advanced monograph that addresses rig
orously many of the convergence issues of iterative stochastic algorithms
in approximate DP, mainly using the so called ODE approach (see also
Borkar and Meyn [BoM00]). The book by Meyn [Mey07] is broader in its
coverage, but touches upon some of the approximate DP algorithms that
we have discussed.
Several survey papers in the volume by Si, Barto, Powell, and Wun
sch [SBP04], and the special issue by Lewis, Liu, and Lendaris [LLL08]
describe recent work and approximation methodology that we have not
covered in this chapter: linear programmingbased approaches (De Farias
and Van Roy [DFV03], [DFV04a], De Farias [DeF04]), largescale resource
486 Approximate Dynamic Programming Chap. 6
allocation methods (Powell and Van Roy [PoV04]), and deterministic op
timal control approaches (Ferrari and Stengel [FeS04], and Si, Yang, and
Liu [SYL04]). An inﬂuential survey was written, from an artiﬁcial intelli
gence/machine learning viewpoint, by Barto, Bradtke, and Singh [BBS95].
Three more recent surveys are Borkar [Bor09] (a methodological point of
view that explores connections with other Monte Carlo schemes), Lewis and
Vrabie [LeV09] (a control theory point of view), and Szepesvari [Sze09] (a
machine learning point of view), and Bertsekas [Ber10] (which focuses on
policy iteration and elaborates on some of the topics of this chapter). The
reader is referred to these sources for a broader survey of the literature of
approximate DP, which is very extensive and cannot be fully covered here.
Direct approximation methods and the ﬁtted value iteration approach
have been used for ﬁnite horizon problems since the early days of DP. They
are conceptually simple and easily implementable, and they are still in wide
use for either approximation of optimal cost functions or Qfactors (see
e.g., Gordon [Gor99], Longstaﬀ and Schwartz [LoS01], Ormoneit and Sen
[OrS02], and Ernst, Geurts, and Wehenkel [EGW06]). The simpliﬁcations
mentioned in Section 6.1.4 are part of the folklore of DP. In particular, post
decision states have sporadically appeared in the literature since the early
days of DP. They were used in an approximate DP context by Van Roy,
Bertsekas, Lee, and Tsitsiklis [VBL97] in the context of inventory control
problems. They have been recognized as an important simpliﬁcation in
the book by Powell [Pow07], which pays special attention to the diﬃculties
associated with large control spaces. For a recent application, see Simao
et. al. [SDG09].
Temporal diﬀerences originated in reinforcement learning, where they
are viewed as a means to encode the error in predicting future costs, which
is associated with an approximation architecture. They were introduced
in the works of Samuel [Sam59], [Sam67] on a checkersplaying program.
The papers by Barto, Sutton, and Anderson [BSA83], and Sutton [Sut88]
proposed the TD(λ) method, on a heuristic basis without a convergence
analysis. The method motivated a lot of research in simulationbased DP,
particularly following an early success with the backgammon playing pro
gram of Tesauro [Tes92]. The original papers did not discuss mathematical
convergence issues and did not make the connection of TD methods with
the projected equation. Indeed for quite a long time it was not clear which
mathematical problem TD(λ) was aiming to solve! The convergence of
TD(λ) and related methods was considered for discounted problems by sev
eral authors, including Dayan [Day92], Gurvits, Lin, and Hanson [GLH94],
Jaakkola, Jordan, and Singh [JJS94], Pineda [Pin97], Tsitsiklis and Van
Roy [TsV97], and Van Roy [Van98]. The proof of Tsitsiklis and Van Roy
[TsV97] was based on the contraction property of ΠT (cf. Lemma 6.3.1 and
Prop. 6.3.1), which is the starting point of our analysis of Section 6.3. The
scaled version of TD(0) [cf. Eq. (6.64)] as well as a λcounterpart were pro
posed by Choi and Van Roy [ChV06] under the name Fixed Point Kalman
Sec. 6.10 Notes, Sources, and Exercises 487
Filter. The book by Bertsekas and Tsitsiklis [BeT96] contains a detailed
convergence analysis of TD(λ), its variations, and its use in approximate
policy iteration, based on the connection with the projected Bellman equa
tion. Lemma 6.3.2 is attributed to Lyapunov (see Theorem 3.3.9 and Note
3.13.6 of Cottle, Pang, and Stone [CPS92]).
Generally, projected equations are the basis for Galerkin methods,
which are popular in scientiﬁc computation (see e.g., [Kra72], [Fle84]).
These methods typically do not use Monte Carlo simulation, which is es
sential for the DP context. However, Galerkin methods apply to a broad
range of problems, far beyond DP, which is in part the motivation for our
discussion of projected equations in more generality in Section 6.8.
The LSTD(λ) algorithm was ﬁrst proposed by Bradtke and Barto
[BrB96] for λ = 0, and later extended by Boyan [Boy02] for λ > 0. For
λ > 0, the convergence C
(λ)
k
→ C
(λ)
and d
(λ)
k
→ d
(λ)
is not as easy to
demonstrate as in the case λ = 0. An analysis has been given in sev
eral sources under diﬀerent assumptions. For the case of an αdiscounted
problem, convergence was shown by Nedic and Bertsekas [NeB03] (for the
standard version of the method), and by Bertsekas and Yu [BeY09] (in
the more general twoMarkov chain sampling context of Section 6.8), and
by Yu [Yu10], which gives the sharpest analysis. The analysis of [BeY09]
and [Yu10] also extends to simulationbased solution of general projected
equations.An analysis of the lawoflargenumbers convergence issues asso
ciated with LSTD for discounted problems was given by Nedi´c and Bert
sekas [NeB03]. The rate of convergence of LSTD was analyzed by Konda
[Kon02], who showed that LSTD has optimal rate of convergence within a
broad class of temporal diﬀerence methods. The regression/regularization
variant of LSTD is due to Wang, Polydorides, and Bertsekas [WPB09].
This work addresses more generally the simulationbased approximate so
lution of linear systems and least squares problems, and it applies to LSTD
as well as to the minimization of the Bellman equation error as special cases.
The LSPE(λ) algorithm, was ﬁrst proposed for stochastic shortest
path problems by Bertsekas and Ioﬀe [BeI96], and was applied to a chal
lenging problem on which TD(λ) failed: learning an optimal strategy to
play the game of tetris (see also Bertsekas and Tsitsiklis [BeT96], Section
8.3). The convergence of the method for discounted problems was given in
[NeB03] (for a diminishing stepsize), and by Bertsekas, Borkar, and Nedi´c
[BBN04] (for a unit stepsize). In the paper [BeI96] and the book [BeT96],
the LSPE method was related to a version of the policy iteration method,
called λpolicy iteration (see also Exercises 1.19 and 6.10). The paper
[BBN04] compared informally LSPE and LSTD for discounted problems,
and suggested that they asymptotically coincide in the sense described in
Section 6.3. Yu and Bertsekas [YuB06b] provided a mathematical proof of
this for both discounted and average cost problems. The scaled versions
of LSPE and the associated convergence analysis were developed more re
cently, and within a more general context in Bertsekas [Ber09], which is
488 Approximate Dynamic Programming Chap. 6
based on a connection between general projected equations and variational
inequalities. Some related methods were given by Yao and Liu [YaL08].
The research on policy or Qfactor evaluation methods was of course mo
tivated by their use in approximate policy iteration schemes. There has
been considerable experimentation with such schemes, see e.g., [BeI96],
[BeT96], [SuB98], [LaP03], [JuP07], [BED09]. However, the relative prac
tical advantages of optimistic versus nonoptimistic schemes, in conjunction
with LSTD, LSPE, and TD(λ), are not yet clear. Policy oscillations and
chattering were ﬁrst described by the author at an April 1996 workshop on
reinforcement learning [Ber96], and were subsequently discussed in Section
6.4.2 of [BeT96]. The size of the oscillations is bounded by the error bound
of Prop. 1.3.6, which due to [BeT96]. An alternative error bound that is
based on the Euclidean norm has been derived by Munos [Mun03].
The exploration scheme with extra transitions (Section 6.3.8) was
given in the paper by Bertsekas and Yu [BeY09], Example 1. The LSTD(λ)
algorithmwith exploration and modiﬁed temporal diﬀerences (Section 6.3.8)
was given by Bertsekas and Yu [BeY07], and a convergence with probabil
ity 1 proof was provided under the condition λαp
ij
≤ p
ij
for all (i, j) in
[BeY09], Prop. 4. The idea of modiﬁed temporal diﬀerences stems from the
techniques of importance sampling, which have been introduced in various
DPrelated contexts by a number of authors: Glynn and Iglehart [GlI89]
(for exact cost evaluation), Precup, Sutton, and Dasgupta [PSD01] [for
TD(λ) with exploration and stochastic shortest path problems], Ahamed,
Borkar, and Juneja [ABJ06] (in adaptive importance sampling schemes
for cost vector estimation without approximation), and Bertsekas and Yu
[BeY07], [BeY09] (in the context of the generalized projected equation
methods of Section 6.8.1).
Qlearning was proposed by Watkins [Wat89], who explained insight
fully the essence of the method, but did not provide a rigorous convergence
analysis; see also Watkins and Dayan [WaD92]. A convergence proof was
given by Tsitsiklis [Tsi94b]. For SSP problems with improper policies,
this proof required the assumption of nonnegative onestage costs (see also
[BeT96], Prop. 5.6). This assumption was relaxed by Abounadi, Bertsekas,
and Borkar [ABB02], using an alternative line of proof. For a survey of
related methods, which also includes many historical and other references
up to 1995, see Barto, Bradtke, and Singh [BBS95].
A variant of Qlearning is the method of advantage updating, devel
oped by Baird [Bai93], [Bai94], [Bai95], and Harmon, Baird, and Klopf
[HBK94]. In this method, instead of aiming to compute Q(i, u), we com
pute
A(i, u) = Q(i, u) − min
u∈U(i)
Q(i, u).
The function A(i, u) can serve just as well as Q(i, u) for the purpose of com
puting corresponding policies, based on the minimization min
u∈U(i)
A(i, u),
but may have a much smaller range of values than Q(i, u), which may be
Sec. 6.10 Notes, Sources, and Exercises 489
helpful in some contexts involving basis function approximation. When us
ing a lookup table representation, advantage updating is essentially equiva
lent to Qlearning, and has the same type of convergence properties. With
function approximation, the convergence properties of advantage updat
ing are not wellunderstood (similar to Qlearning). We refer to the book
[BeT96], Section 6.6.2, for more details and some analysis.
Another variant of Qlearning, also motivated by the fact that we
are really interested in Qfactor diﬀerences rather than Qfactors, has been
discussed in Section 6.4.2 of Vol. I, and is aimed at variance reduction of
Qfactors obtained by simulation. A related variant of approximate policy
iteration and Qlearning, called diﬀerential training, has been proposed by
the author in [Ber97] (see also Weaver and Baxter [WeB99]). It aims to
compute Qfactor diﬀerences in the spirit of the variance reduction ideas
of Section 6.4.2 of Vol. I.
Approximation methods for the optimal stopping problem (Section
6.4.3) were investigated by Tsitsiklis and Van Roy [TsV99b], [Van98], who
noted that Qlearning with a linear parametric architecture could be ap
plied because the associated mapping F is a contraction with respect to
the norm  
ξ
. They proved the convergence of a corresponding Qlearning
method, and they applied it to a problem of pricing ﬁnancial derivatives.
The LSPE algorithm given in Section 6.4.3 for this problem is due to Yu
and Bertsekas [YuB07], to which we refer for additional analysis. An alter
native algorithm with some similarity to LSPE as well as TD(0) is given
by Choi and Van Roy [ChV06], and is also applied to the optimal stopping
problem. We note that approximate dynamic programming and simulation
methods for stopping problems have become popular in the ﬁnance area,
within the context of pricing options; see Longstaﬀ and Schwartz [LoS01],
who consider a ﬁnite horizon model in the spirit of Section 6.4.4, and Tsit
siklis and Van Roy [TsV01], and Li, Szepesvari, and Schuurmans [LSS09],
whose works relate to the LSPE method of Section 6.4.3. The constrained
policy iteration method of Section 6.4.3 is new.
Recently, an approach to Qlearning with exploration, called enhanced
policy iteration, has been proposed (Bertsekas and Yu [BeY10]). Instead
of policy evaluation by solving a linear system of equations, this method
requires (possibly inexact) solution of Bellman’s equation for an optimal
stopping problem. It is based on replacing the standard Qlearning map
ping used for evaluation of a policy µ with the mapping
(F
J,ν
Q)(i, u) =
n
j=1
p
ij
(u)
_
_
g(i, u, j) + α
v∈U(j)
ν(v [ j) min
_
J(j), Q(j, v)
_
_
_
which depends on a vector J ∈ ℜ
n
, with components denoted J(i), and
on a randomized policy ν, which for each state i deﬁnes a probability
distribution
_
ν(u [ i) [ u ∈ U(i)
_
490 Approximate Dynamic Programming Chap. 6
over the feasible controls at i, and may depend on the “current policy” µ.
The vector J is updated using the equation J(i) = min
u∈U(i)
Q(i, u), and
the “current policy” µ is obtained from this minimization. Finding a ﬁxed
point of the mapping F
J,ν
is an optimal stopping problem [a similarity with
the constrained policy iteration (6.128)(6.129)]. The policy ν may be cho
sen arbitrarily at each iteration. It encodes aspects of the “current policy”
µ, but allows for arbitrary and easily controllable amount of exploration.
For extreme choices of ν and a lookup table representation, the algorithm of
[BeY10] yields as special cases the classical Qlearning/value iteration and
policy iteration methods. Together with linear Qfactor approximation,
the algorithm may be combined with the TD(0)like method of Tsitsiklis
and Van Roy [TsV99b], which can be used to solve the associated stopping
problems with low overhead per iteration, thereby resolving the issue of
exploration.
Approximate value iteration methods that are based on aggregation
(cf. Section 6.5.2) were proposed by Tsitsiklis and Van Roy [TsV96] (see
also Section 6.7 of [BeT96] for a discussion). Related results are given
by Gordon [Gor95], and by Singh, Jaakkola, and Jordan [SJJ94], [SJJ95].
Bounds on the error between the optimal costtogo vector J
∗
and the limit
of the value iteration method in the case of hard aggregation are given un
der various assumptions in [TsV96] (see also Exercise 6.12 and Section
6.7.4 of [BeT96]). Related error bounds are given by Munos and Szepes
vari [MuS08]. A more recent work that focuses on hard aggregation is Van
Roy [Van06]. Multistep aggregation has not been discussed earlier. It may
have some important practical applications in problems where multistep
lookahead minimizations are feasible. Also asynchronous multiagent ag
gregation has not been discussed earlier. Finally it is worth emphasizing
that while nearly all the methods discussed in this chapter produce basis
function approximations to costs or Qfactors, there is an important qual
itative diﬀerence that distinguishes the aggregationbased policy iteration
approach: assuming suﬃciently small simulation error, it is not susceptible
to policy oscillation and chattering like the projected equation or Bellman
equation error approaches. The price for this is the restriction of the type
of basis functions that can be used in aggregation.
The contraction mapping analysis (Prop. 6.6.1) for SSP problems
in Section 6.6 is based on the convergence analysis for TD(λ) given in
Bertsekas and Tsitsiklis [BeT96], Section 6.3.4. The LSPE algorithm was
ﬁrst proposed for SSP problems in [BeI96].
The TD(λ) algorithm was extended to the average cost problem, and
its convergence was proved by Tsitsiklis and Van Roy [TsV99a] (see also
[TsV02]). The average cost analysis of LSPE in Section 6.7.1 is due to Yu
and Bertsekas [YuB06b]. An alternative to the LSPE and LSTD algorithms
of Section 6.7.1 is based on the relation between average cost and SSP
problems, and the associated contracting value iteration method discussed
in Section 4.4.1. The idea is to convert the average cost problem into a
Sec. 6.10 Notes, Sources, and Exercises 491
parametric form of SSP, which however converges to the correct one as the
gain of the policy is estimated correctly by simulation. The SSP algorithms
of Section 6.6 can then be used with the estimated gain of the policy η
k
replacing the true gain η.
While the convergence analysis of the policy evaluation methods of
Sections 6.3 and 6.6 is based on contraction mapping arguments, a diﬀerent
line of analysis is necessary for Qlearning algorithms for average cost prob
lems (as well as for SSP problems where there may exist some improper
policies). The reason is that there may not be an underlying contraction,
so the nonexpansive property of the DP mapping must be used instead. As
a result, the analysis is more complicated, and a diﬀerent method of proof
has been employed, based on the socalled ODE approach; see Abounadi,
Bertsekas, and Borkar [ABB01], [ABB02], and Borkar and Meyn [BeM00].
In particular, the Qlearning algorithms of Section 6.7.3 were proposed and
analyzed in these references. They are also discussed in the book [BeT96]
(Section 7.1.5). Alternative algorithms of the Qlearning type for average
cost problems were given without convergence proof by Schwartz [Sch93b],
Singh [Sin94], and Mahadevan [Mah96]; see also Gosavi [Gos04].
The framework of Sections 6.8.16.8.5 on generalized projected equa
tion and Bellman error methods is based on Bertsekas and Yu [BeY09],
which also discusses multistep methods, and several other variants of the
methods given here (see also Bertsekas [Ber09]). The regressionbased
method and the conﬁdence interval analysis of Prop. 6.8.1 is due to Wang,
Polydorides, and Bertsekas [WPB09]. The material of Section 6.8.6 on
oblique projections and the connections to aggregation/discretization with
a coarse grid is based on unpublished collaboration with H. Yu. The gen
eralized aggregation methodology of Section 6.8.7 is new in the form given
here, but is motivated by the sources on aggregationbased approximate
DP given for Section 6.5.
The paper by Yu and Bertsekas [YuB08] derives error bounds which
apply to generalized projected equations and sharpen the rather conserva
tive bound
J
µ
−Φr
∗

ξ
≤
1
√
1 −α
2
J
µ
−ΠJ
µ

ξ
, (6.233)
given for discounted DP problems (cf. Prop. 6.3.2) and the bound
x
∗
−Φr
∗
 ≤
_
_
(I −ΠA)
−1
_
_
x
∗
−Πx
∗
_
_
,
for the general projected equation Φr = Π(AΦr +b) [cf. Eq. (6.182)]. The
bounds of [YuB08] apply also to the case where A is not a contraction and
have the form
x
∗
−Φr
∗

ξ
≤ B(A, ξ, S) x
∗
−Πx
∗
_
_
ξ
,
where B(A, ξ, S) is a scalar that [contrary to the scalar 1/
√
1 −α
2
in Eq.
(6.233)] depends on the approximation subspace S and the structure of
492 Approximate Dynamic Programming Chap. 6
the matrix A. The scalar B(A, ξ, S) involves the spectral radii of some
lowdimensional matrices and may be computed either analytically or by
simulation (in the case where x has large dimension). One of the scalars
B(A, ξ, S) given in [YuB08] involves only the matrices that are computed as
part of the simulationbased calculation of the matrix C
k
via Eq. (6.189), so
it is simply obtained as a byproduct of the LSTD and LSPEtype methods
of Section 6.8.1. Among other situations, such bounds can be useful in cases
where the “bias” Φr
∗
−Πx
∗
_
_
ξ
(the distance between the solution Φr
∗
of
the projected equation and the best approximation of x
∗
within S, which
is Πx
∗
) is very large [cf., the example of Exercise 6.9, mentioned earlier,
where TD(0) produces a very bad solution relative to TD(λ) for λ ≈ 1]. A
value of B(A, ξ, S) that is much larger than 1 strongly suggests a large bias
and motivates corrective measures (e.g., increase λ in the approximate DP
case, changing the subspace S, or changing ξ). Such an inference cannot
be made based on the much less discriminating bound (6.233), even if A is
a contraction with respect to  
ξ
.
The Bellman equation error approach was initially suggested by Sch
weitzer and Seidman [ScS85], and simulationbased algorithms based on
this approach were given later by Harmon, Baird, and Klopf [HBK94],
Baird [Bai95], and Bertsekas [Ber95]. For some recent developments, see
Ormoneit and Sen [OrS02], Szepesvari and Smart [SzS04], Antos, Szepes
vari, and Munos [ASM08], and Bethke, How, and Ozdaglar [BHO08].
There is a large literature on policy gradient methods for average cost
problems. The formula for the gradient of the average cost has been given
in diﬀerent forms and within a variety of diﬀerent contexts: see Cao and
Chen [CaC97], Cao and Wan [CaW98], Cao [Cao99], [Cao05], Fu and Hu
[FuH94], Glynn [Gly87], Jaakkola, Singh, and Jordan [JSJ95], L’Ecuyer
[L’Ec91], and Williams [Wil92]. We follow the derivations of Marbach and
Tsitsiklis [MaT01]. The inner product expression of ∂η(r)/∂r
m
was used to
delineate essential features for gradient calculation by Konda and Tsitsiklis
[KoT99], [KoT03], and Sutton, McAllester, Singh, and Mansour [SMS99].
Several implementations of policy gradient methods, some of which
use cost approximations, have been proposed: see Cao [Cao04], Grudic and
Ungar [GrU04], He [He02], He, Fu, and Marcus [HFM05], Kakade [Kak02],
Konda [Kon02], Konda and Borkar [KoB99], Konda and Tsitsiklis [KoT99],
[KoT03], Marbach and Tsitsiklis [MaT01], [MaT03], Sutton, McAllester,
Singh, and Mansour [SMS99], and Williams [Wil92].
Approximation in policy space can also be carried out very simply
by a random search method in the space of policy parameters. There
has been considerable progress in random search methodology, and the so
called crossentropy method (see Rubinstein and Kroese [RuK04], [RuK08],
de Boer et al [BKM05]) has gained considerable attention. A notewor
thy success with this method has been attained in learning a high scoring
strategy in the game of tetris (see Szita and Lorinz [SzL06], and Thiery
and Scherrer [ThS09]); surprisingly this method outperformed methods
Sec. 6.10 Notes, Sources, and Exercises 493
based on approximate policy iteration, approximate linear programming,
and policy gradient by more than an order of magnitude (see the discussion
of chattering in Section 6.3.6). Other random search algorithms have also
been suggested; see Chang, Fu, Hu, and Marcus [CFH07], Ch. 3.
Approximate DP methods for partially observed Markov decision
problems are not as welldeveloped as their perfect observation counter
parts. Approximations obtained by aggregation/interpolation schemes and
solution of ﬁnitespaces discounted or average cost problems have been pro
posed by Zhang and Liu [ZhL97], Zhou and Hansen [ZhH01], and Yu and
Bertsekas [YuB04]. Alternative approximation schemes based on ﬁnite
state controllers are analyzed in Hauskrecht [Hau00], Poupart and Boutilier
[PoB04], and Yu and Bertsekas [YuB06a]. Policy gradient methods of the
actoronly type have been given by Baxter and Bartlett [BaB01], and Ab
erdeen and Baxter [AbB00]. An alternative method, which is of the actor
critic type, has been proposed by Yu [Yu05]. See also Singh, Jaakkola, and
Jordan [SJJ94].
Many problems have special structure, which can be exploited in ap
proximate DP. For some representative work, see Guestrin et al. [GKP03],
and Koller and Parr [KoP00].
E X E R C I S E S
6.1
Consider a fully connected network with n nodes, and the problem of ﬁnding a
travel strategy that takes a traveller from node 1 to node n in no more than a
given number m of time periods, while minimizing the expected travel cost (sum
of the travel costs of the arcs on the travel path). The cost of traversing an arc
changes randomly and independently at each time period with given distribution.
For any node i, the current cost of traversing the outgoing arcs (i, j), j = i, will
become known to the traveller upon reaching i, who will then either choose the
next node j on the travel path, or stay at i (waiting for smaller costs of outgoing
arcs at the next time period) at a ﬁxed (deterministic) cost per period. Derive
a DP algorithm in a space of postdecision variables and compare it to ordinary
DP.
6.2 (Multiple State Visits in Monte Carlo Simulation)
494 Approximate Dynamic Programming Chap. 6
Argue that the Monte Carlo simulation formula
Jµ(i) = lim
M→∞
1
M
M
m=1
c(i, m)
is valid even if a state may be revisited within the same sample trajectory. Note:
If only a ﬁnite number of trajectories is generated, in which case the number
M of cost samples collected for a given state i is ﬁnite and random, the sum
1
M
M
m=1
c(i, m) need not be an unbiased estimator of Jµ(i). However, as the
number of trajectories increases to inﬁnity, the bias disappears. See [BeT96],
Sections 5.1, 5.2, for a discussion and examples. Hint: Suppose the M cost
samples are generated from N trajectories, and that the kth trajectory involves
n
k
visits to state i and generates n
k
corresponding cost samples. Denote m
k
=
n1 +· · · +n
k
. Write
lim
M→∞
1
M
M
m=1
c(i, m) = lim
N→∞
1
N
N
k=1
m
k
m=m
k−1
+1
c(i, m)
1
N
(n1 +· · · +nN)
=
E
_
m
k
m=m
k−1
+1
c(i, m)
_
E{n
k
}
,
and argue that
E
_
_
_
m
k
m=m
k−1
+1
c(i, m)
_
_
_
= E{n
k
}Jµ(i),
(or see Ross [Ros83b], Cor. 7.2.3 for a closely related result).
6.3 (Viewing QFactors as Optimal Costs)
Consider the stochastic shortest path problem under Assumptions 2.1.1 and 2.1.2.
Show that the Qfactors Q(i, u) can be viewed as state costs associated with a
modiﬁed stochastic shortest path problem. Use this fact to show that the Q
factors Q(i, u) are the unique solution of the system of equations
Q(i, u) =
j
pij(u)
_
g(i, u, j) + min
v∈U(j)
Q(j, v)
_
.
Hint: Introduce a new state for each pair (i, u), with transition probabilities
pij(u) to the states j = 1, . . . , n, t.
Sec. 6.10 Notes, Sources, and Exercises 495
6.4
This exercise provides a counterexample to the convergence of PVI for discounted
problems when the projection is with respect to a norm other than ·
ξ
. Consider
the mapping TJ = g+αPJ and the algorithm Φr
k+1
= ΠT(Φr
k
), where P and Φ
satisfy Assumptions 6.3.1 and 6.3.2. Here Π denotes projection on the subspace
spanned by Φ with respect to the weighted Euclidean norm Jv =
√
J
′
V J,
where V is a diagonal matrix with positive components. Use the formula Π =
Φ(Φ
′
V Φ)
−1
Φ
′
V to show that in the single basis function case (Φ is an n × 1
vector) the algorithm is written as
r
k+1
=
Φ
′
V g
Φ
′
V Φ
+
αΦ
′
V PΦ
Φ
′
V Φ
r
k
.
Construct choices of α, g, P, Φ, and V for which the algorithm diverges.
6.5 (LSPE(0) for Average Cost Problems [YuB06b])
Show the convergence of LSPE(0) for average cost problems with unit stepsize,
assuming that P is aperiodic, by showing that the eigenvalues of the matrix ΠF
lie strictly within the unit circle.
6.6 (Relation of Discounted and Average Cost Approximations
[TsV02])
Consider the ﬁnitestate αdiscounted and average cost frameworks of Sections
6.3 and 6.7 for a ﬁxed stationary policy with cost per stage g and transition
probability matrix P. Assume that the states form a single recurrent class, let
Jα be the αdiscounted cost vector, let (η
∗
, h
∗
) be the gainbias pair, let ξ be
the steadystate probability vector, let Ξ be the diagonal matrix with diagonal
elements the components of ξ, and let
P
∗
= lim
N→∞
N−1
k=0
P
k
.
Show that:
(a) η
∗
= (1 −α)ξ
′
Jα and P
∗
Jα = (1 −α)
−1
η
∗
e.
(b) h
∗
= limα→1(I −P
∗
)Jα. Hint: Use the Laurent series expansion of Jα (cf.
Prop. 4.1.2).
(c) Consider the subspace
E
∗
=
_
(I −P
∗
)y  y ∈ ℜ
n
_
,
which is orthogonal to the unit vector e in the scaled geometry where x
and y are orthogonal if x
′
Ξy = 0 (cf. Fig. 6.7.1). Verify that Jα can be
decomposed into the sum of two vectors that are orthogonal (in the scaled
496 Approximate Dynamic Programming Chap. 6
geometry): P
∗
Jα, which is the projection of Jα onto the line deﬁned by e,
and (I −P
∗
)Jα, which is the projection of Jα onto E
∗
and converges to h
∗
as α → 1.
(d) Use part (c) to show that the limit r
∗
λ,α
of PVI(λ) for the αdiscounted
problem converges to the limit r
∗
λ
of PVI(λ) for the average cost problem
as α → 1.
6.7 (Conversion of SSP to Average Cost Policy Evaluation)
We have often used the transformation of an average cost problem to an SSP
problem (cf. Section 4.3.1, and Chapter 7 of Vol. I). The purpose of this exercise
(unpublished collaboration of H. Yu and the author) is to show that a reverse
transformation is possible, from SSP to average cost, at least in the case where
all policies are proper. As a result, analysis, insights, and algorithms for average
cost policy evaluation can be applied to policy evaluation of a SSP problem.
Consider the SSP problem, a single proper stationary policy µ, and the
probability distribution q0 =
_
q0(1), . . . , q0(n)
_
used for restarting simulated tra
jectories [cf. Eq. (6.157)]. Let us modify the Markov chain by eliminating the
selftransition from state 0 to itself, and substituting instead transitions from 0
to i with probabilities q0(i),
˜p0i = q0(i),
each with a ﬁxed transition cost β, where β is a scalar parameter. All other
transitions and costs remain the same (cf. Fig. 6.10.1). We call the corresponding
average cost problem βAC. Denote by Jµ the SSP cost vector of µ, and by η
β
and h
β
(i) the average and diﬀerential costs of βAC, respectively.
i
j
0
p
00
= 1
p
ij
p
ji
p
j0
p
i0
i
j
0 p
ij
p
ji
q
0
(i)
p
i0
p
j0
q
0
(j)
SSP Problem Average Cost Problem
Figure 6.10.1. Transformation of a SSP problem to an average cost problem.
The transitions from 0 to each i = 1, . . . , n, have cost β.
(a) Show that η
β
can be expressed as the average cost per stage of the cycle
that starts at state 0 and returns to 0, i.e.,
η
β
=
β +
n
i=1
q0(i)Jµ(i)
T
,
where T is the expected time to return to 0 starting from 0.
Sec. 6.10 Notes, Sources, and Exercises 497
(b) Show that for the special value
β
∗
= −
n
i=1
q0(i)Jµ(i),
we have η
β
∗ = 0, and
Jµ(i) = h
β
∗(i) −h
β
∗(0), i = 1, . . . , n.
Hint: Since the states of βAC form a single recurrent class, we have from
Bellman’s equation
η
β
+h
β
(i) =
n
j=0
pij
_
g(i, j) +h
β
(j)
_
, i = 1, . . . , n, (6.234)
η
β
+h
β
(0) = β +
n
i=1
q0(i)h
β
(i). (6.235)
From Eq. (6.234) it follows that if β = β
∗
, we have η
β
∗ = 0, and
δ(i) =
n
j=0
pijg(i, j) +
n
j=1
pijδ(j), i = 1, . . . , n, (6.236)
where
δ(i) = h
β
∗(i) −h
β
∗(0), i = 1, . . . , n.
Since Eq. (6.236) is Bellman’s equation for the SSP problem, we see that
δ(i) = Jµ(i) for all i.
(c) Derive a transformation to convert an average cost policy evaluation prob
lem into another average cost policy evaluation problem where the transi
tion probabilities out of a single state are modiﬁed in any way such that
the states of the resulting Markov chain form a single recurrent class. The
two average cost problems should have the same diﬀerential cost vectors,
except for a constant shift. Note: This conversion may be useful if the
transformed problem has more favorable properties.
6.8 (Projected Equations for FiniteHorizon Problems)
Consider a ﬁnitestate ﬁnitehorizon policy evaluation problem with the cost vec
tor and transition matrices at time m denoted by gm and Pm, respectively. The
DP algorithm/Bellman’s equation takes the form
Jm = gm +PmJm+1, m = 0, . . . , N −1,
where Jm is the cost vector of stage m for the given policy, and JN is a given
terminal cost vector. Consider a lowdimensional approximation of Jm that has
the form
Jm ≈ Φmrm, m = 0, . . . , N −1,
498 Approximate Dynamic Programming Chap. 6
where Φm is a matrix whose columns are basis functions. Consider also a pro
jected equation of the form
Φmrm = Πm(gm +PmΦm+1rm+1), m = 0, . . . , N −1,
where Πm denotes projection onto the space spanned by the columns of Φm with
respect to a weighted Euclidean norm with weight vector ξm.
(a) Show that the projected equation can be written in the equivalent form
Φ
′
m
Ξm(Φmrm −gm −PmΦm+1rm+1) = 0, m = 0, . . . , N −2,
Φ
′
N−1
ΞN−1(ΦN−1rN−1 −gN−1 −PN−1JN) = 0,
where Ξm is the diagonal matrix having the vector ξm along the diagonal.
Abbreviated solution: The derivation follows the one of Section 6.3.1 [cf. the
analysis leading to Eqs. (6.33) and (6.34)]. The solution {r
∗
0
, . . . , r
∗
N−1
} of
the projected equation is obtained as
r
∗
m
= arg min
r
0
,...,r
N−1
_
N−2
m=0
_
_
Φmrm −(gm +PmΦm+1r
∗
m+1
)
_
_
2
ξm
+
_
_
ΦN−1rN−1 −(gN−1 +PN−1JN)
_
_
2
ξ
N−1
_
.
The minimization can be decomposed into N minimizations, one for each
m, and by setting to 0 the gradient with respect to rm, we obtain the
desired form.
(b) Consider a simulation process that generates a sequence of trajectories of
the system, similar to the case of a stochastic shortest path problem (cf.
Section 6.6). Derive corresponding LSTD and (scaled) LSPE algorithms.
(c) Derive appropriate modiﬁcations of the algorithms of Section 6.4.3 to ad
dress a ﬁnite horizon version of the optimal stopping problem.
6.9 (Approximation Error of TD Methods [Ber95])
This exercise illustrates how the value of λ may signiﬁcantly aﬀect the approxi
mation quality in TD methods. Consider a problem of the SSP type, but with a
single policy. The states are 0, 1, . . . , n, with state 0 being the termination state.
Under the given policy, the system moves deterministically from state i ≥ 1 to
state i −1 at a cost gi. Consider a linear approximation of the form
˜
J(i, r) = i r
for the costtogo function, and the application of TD methods. Let all simulation
runs start at state n and end at 0 after visiting all the states n −1, n − 2, . . . , 1
in succession.
Sec. 6.10 Notes, Sources, and Exercises 499
(a) Derive the corresponding projected equation Φr
∗
λ
= ΠT
(λ)
(Φr
∗
λ
) and show
that its unique solution r
∗
λ
satisﬁes
n
k=1
(g
k
−r
∗
λ
)
_
λ
n−k
n +λ
n−k−1
(n −1) +· · · +k
_
= 0.
(b) Plot
˜
J(i, r
∗
λ
) with λ from 0 to 1 in increments of 0.2, for the following two
cases:
(1) n = 50, g1 = 1 and gi = 0 for all i = 1.
(2) n = 50, gn = −(n −1) and gi = 1 for all i = n.
Figure 6.10.2 gives the results for λ = 0 and λ = 1.
0
0.5
1
1.5
0 10 20 30 40 50
State i
TD(0) Approximation
Cost function J(i)
TD(1) Approximation
 50.0
 25.0
0.0
25.0
50.0
0 10 20 30 40 50
State i
Cost function J(i)
TD(1) Approximation
TD(0) Approximation
Figure 6.10.2: Form of the costtogo function J(i), and the linear representa
tions
˜
J(i, r
∗
λ
) in Exercise 6.9, for the case
g
1
= 1, g
i
= 0, ∀ i = 1
(ﬁgure on the left), and the case.
gn = −(n − 1), g
i
= 1, ∀ i = n
(ﬁgure on the right).
6.10 (λPolicy Iteration [BeI96])
The purpose of this exercise is to discuss a method that relates to approximate
policy iteration and LSPE(λ). Consider the discounted problem of Section 6.3, a
scalar λ ∈ [0, 1), and a policy iterationlike method that generates a sequence of
500 Approximate Dynamic Programming Chap. 6
policycost vector pairs
_
(µ0, J0), (µ1, J1), . . .
_
, starting from an arbitrary pair
(µ0, J0), as follows: Given the current pair (µ, J), it computes the next pair (µ, J)
according to
T
µ
J = TJ,
and
J = T
(λ)
µ
J. (6.237)
(Here µ is the “improved” policy and J is an estimate of the cost vector J
µ
.)
(a) Show that for λ = 0 the algorithm is equivalent to value iteration, and
as λ → 1, the algorithm approaches policy iteration (in the sense that
J → J
µ
).
(b) Show that for all k greater than some index, we have
J
k+1
−J
∗
∞ ≤
α(1 −λ)
1 −αλ
J
k
−J
∗
∞.
(See [BeI96] or [BeT96], Prop. 2.8, for a proof.)
(c) Consider a linear cost approximation architecture of the form Φr ≈ J and
a corresponding projected version of the “evaluation” equation (6.237):
Φr = ΠT
(λ)
µ
(Φr).
Interpret this as a single iteration of PVI(λ) for approximating J
µ
. Note:
Since this method uses a single PVI(λ) iteration, one may consider a
simulationbased implementation with a single LSPE(λ) iteration:
r = r −G
k
(C
k
r −d
k
),
where G
k
, C
k
, and d
k
are simulationbased approximations of the corre
sponding matrices Φ
′
ΞΦ, C
(λ)
, and vector d
(λ)
(cf. Section 6.3.7). However,
a substantial number of samples using the policy µ are necessary to approx
imate well these quantities. Still the intuition around this method suggests
that once good approximations G
k
, C
k
, and d
k
of Φ
′
ΞΦ, C
(λ)
, and d
(λ)
have
been obtained, it is only necessary to perform just a few LSPE(λ) iterations
for a given policy within an approximate policy iteration context.
6.11
This exercise provides an example of comparison of the projected equation ap
proach of Section 6.8.1 and the least squares approach of Section 6.8.2. Consider
the case of a linear system involving a vector x with two block components,
x1 ∈ ℜ
k
and x2 ∈ ℜ
m
. The system has the form
x1 = A11x1 +b1, x2 = A21x1 +A22x2 +b2,
so x1 can be obtained by solving the ﬁrst equation. Let the approximation
subspace be ℜ
k
×S2, where S2 is a subspace of ℜ
m
. Show that with the projected
equation approach, we obtain the component x
∗
1
of the solution of the original
equation, but with the least squares approach, we do not.
Sec. 6.10 Notes, Sources, and Exercises 501
6.12 (Error Bounds for Hard Aggregation [TsV96])
Consider the hard aggregation case of Section 6.5.2, and denote i ∈ x if the
original state i belongs to aggregate state x. Also for every i denote by x(i) the
aggregate state x with i ∈ x. Consider the corresponding mapping F deﬁned by
(FR)(x) =
n
i=1
dxi min
u∈U(i)
n
j=1
pij(u)
_
g(i, u, j) +αR
_
x(j)
_
_
, x ∈ A,
[cf. Eq. (6.140)], and let R
∗
be the unique ﬁxed point of this mapping. Show that
R
∗
(x) −
ǫ
1 −α
≤ J
∗
(i) ≤ R
∗
(x) +
ǫ
1 −α
, ∀ x ∈ A, i ∈ x,
where
ǫ = max
x∈A
max
i,j∈x
¸
¸
J
∗
(i) −J
∗
(j)
¸
¸
.
Abbreviated Proof : Let the vector R be deﬁned by
R(x) = min
i∈x
J
∗
(i) +
ǫ
1 −α
, x ∈ A.
We have for all x ∈ A,
(FR)(x) =
n
i=1
dxi min
u∈U(i)
n
j=1
pij(u)
_
g(i, u, j) +αR
_
x(j)
_
_
≤
n
i=1
dxi min
u∈U(i)
n
j=1
pij(u)
_
g(i, u, j) +αJ
∗
(j) +
αǫ
1 −α
_
=
n
i=1
dxi
_
J
∗
(i) +
αǫ
1 −α
_
≤ min
i∈x
_
J
∗
(i) +ǫ
_
+
αǫ
1 −α
= min
i∈x
J
∗
(i) +
ǫ
1 −α
= R(x).
Thus, FR ≤ R, from which it follows that R
∗
≤ R (since R
∗
= lim
k→∞
F
k
R and
F is monotone). This proves the lefthand side of the desired inequality. The
righthand side follows similarly.
502 Approximate Dynamic Programming Chap. 6
6.13 (Hard Aggregation as a Projected Equation Method)
Consider a ﬁxed point equation of the form
r = DT(Φr),
where T : ℜ
n
→ ℜ
n
is a (possibly nonlinear) mapping, and D and Φ are s × n
and n ×s matrices, respectively, and Φ has rank s. Writing this equation as
Φr = ΦDT(Φr),
we see that it is a projected equation if ΦD is a projection onto the subspace
S = {Φr  r ∈ ℜ
s
} with respect to a weighted Euclidean norm. The purpose of
this exercise is to prove that this is the case in hard aggregation schemes, where
the set of indices {1, . . . , n} is partitioned into s disjoint subsets I1, . . . , Is and:
(1) The ℓth column of Φ has components that are 1 or 0 depending on whether
they correspond to an index in I
ℓ
or not.
(2) The ℓth row of D is a probability distribution (d
ℓ1
, . . . , d
ℓn
) whose compo
nents are positive depending on whether they correspond to an index in I
ℓ
or not, i.e.,
n
i=1
d
ℓi
= 1, d
ℓi
> 0 if i ∈ I
ℓ
, and d
ℓi
= 0 if i / ∈ I
ℓ
.
Show in particular that ΦD is given by the projection formula
ΦD = Φ(Φ
′
ΞΦ)
−1
Φ
′
Ξ,
where Ξ is the diagonal matrix with the nonzero components of D along the
diagonal, normalized so that they form a probability distribution, i.e.,
ξi =
d
ℓi
s
k=1
n
j=1
d
kj
, ∀ i ∈ I
ℓ
, ℓ = 1, . . . , s.
Notes: (1) Based on the preceding result, if T is a contraction mapping with
respect to the projection norm, the same is true for ΦDT. In addition, if T is
a contraction mapping with respect to the supnorm, the same is true for DTΦ
(since aggregation and disaggregation matrices are nonexpansive with respect to
the supnorm); this is true for all aggregation schemes, not just hard aggregation.
(2) For ΦD to be a weighted Euclidean projection, we must have ΦDΦD = ΦD.
This implies that if DΦ is invertible and ΦD is a weighted Euclidean projection,
we must have DΦ = I (since if DΦ is invertible, Φ has rank s, which implies
that DΦD = D and hence DΦ = I, since D also has rank s). From this it can
be seen that out of all possible aggregation schemes with DΦ invertible and D
having nonzero columns, only hard aggregation has the projection property of
this exercise.
Sec. 6.10 Notes, Sources, and Exercises 503
6.14 (SimulationBased Implementation of Linear Aggregation
Schemes)
Consider a linear system Ax = b, where A is an n ×n matrix and b is a column
vector in ℜ
n
. In a scheme that generalizes the aggregation approach of Section
6.5, we introduce an n×s matrix Φ, whose columns are viewed as basis functions,
and an s ×n matrix D. We ﬁnd a solution r
∗
of the s ×s system
DAΦr = Db,
and we view Φr
∗
as an approximate solution of Ax = b. An approximate imple
mentation is to compute by simulation approximations
˜
C and
˜
d of the matrices
C = DAΦ and d = Db, respectively, and then solve the system
˜
Cr =
˜
d. The
purpose of the exercise is to provide a scheme for doing this.
Let D and A be matrices of dimensions s×n and n×n, respectively, whose
rows are probability distributions, and are such that their components satisfy
G
ℓi
> 0 if G
ℓi
= 0, ℓ = 1, . . . , s, i = 1, . . . , n,
Aij > 0 if Aij = 0, i = 1, . . . , n, j = 1, . . . , n.
We approximate the (ℓm)th component C
ℓm
of C as follows. We generate a
sequence
_
(it, jt)  t = 1, 2, . . .
_
by independently generating each it according
to the distribution {G
ℓi
 i = 1, . . . , n}, and then by independently generating jt
according to the distribution {Ai
t
j  j = 1, . . . , n}. For k = 1, 2, . . ., consider the
scalar
C
k
ℓm
=
1
k
k
t=1
G
ℓi
t
G
ℓi
t
Ai
t
j
t
Ai
t
j
t
Φj
t
m,
where Φjm denotes the (jm)th component of Φ. Show that with probability 1
we have
lim
k→∞
C
k
ℓm
= C
ℓm
.
Derive a similar scheme for approximating the components of d.
6.15 (Approximate Policy Iteration Using an Approximate
Problem)
Consider the discounted problem of Section 6.3 (referred to as DP) and an ap
proximation to this problem (this is a diﬀerent discounted problem referred to
as AP). This exercise considers an approximate policy iteration method where
the policy evaluation is done through AP, but the policy improvement is done
through DP – a process that is patterned after the aggregationbased policy iter
ation method of Section 6.5.2. In particular, we assume that the two problems,
DP and AP, are connected as follows:
(1) DP and AP have the same state and control spaces, and the same policies.
(2) For any policy µ, its cost vector in AP, denoted
˜
Jµ, satisﬁes
˜
Jµ −Jµ∞ ≤ δ,
504 Approximate Dynamic Programming Chap. 6
i.e., policy evaluation using AP, rather than DP, incurs an error of at most
δ in supnorm.
(3) The policy µ obtained by exact policy iteration in AP satisﬁes the equation
T
˜
J
µ
= T
µ
˜
J
µ
.
This is true in particular if the policy improvement process in AP is iden
tical to the one in DP.
Show the error bound
J
µ
−J
∗
∞ ≤
2αδ
1 −α
e
[cf. Eq. (6.143)]. Hint: Follow the derivation of the error bound (6.143).
6.16 (Approximate Policy Iteration and Newton’s Method)
Consider the discounted problem, and a policy iteration method that uses func
tion approximation over a subspace S = {Φr  r ∈ ℜ
s
}, and operates as follows:
Given a vector r
k
∈ ℜ
s
, deﬁne µ
k
to be a policy such that
T
µ
k
(Φr
k
) = T(Φr
k
), (6.238)
and let r
k+1
satisfy
r
k+1
= L
k+1
T
µ
k
(Φr
k+1
),
where L
k+1
is an s×n matrix. Show that if µ
k
is the unique policy satisfying Eq.
(6.238), then r
k+1
is obtained from r
k
by a Newton step for solving the equation
r = L
k+1
T(Φr),
(cf. Exercise 1.10).
6.17 (Generalized Policy Iteration)
Consider a policy iterationtype algorithm that has the following form: Given µ,
we ﬁnd
˜
Jµ satisfying
˜
Jµ(i) = (Fµ
˜
Jµ)(i), i = 1, . . . , n.
We update µ by
µ(i) = arg min
ν
(Fν
˜
Jµ)(i), i = 1, . . . , n.
where we assume that µ so obtained satisﬁes
(F
µ
˜
Jµ)(i) = min
ν
(Fν
˜
Jµ)(i), i = 1, . . . , n.
Here Fµ : ℜ
n
→ ℜ
n
is a mapping that is parametrized by the policy µ, and is
assumed to be both monotone and a supnorm contraction with modulus α ∈
(0, 1). Furthermore, (Fν
˜
Jµ)(i) depends on ν only through the component ν(i),
and is minimized over that component, which may take only a ﬁnite number of
values. Show that the sequence of policies generated by the algorithm terminates
with
˜
Jµ being the unique ﬁxed point J
∗
of the mapping F deﬁned by
(FJ)(i) = min
µ
(FµJ)(i), i = 1, . . . , n.
Hint: Show that
˜
Jµ ≥
˜
J
µ
with equality if
˜
Jµ = J
∗
.
Sec. 6.10 Notes, Sources, and Exercises 505
6.18 (Policy Gradient Formulas for SSP)
Consider the SSP context, and let the cost per stage and transition probabil
ity matrix be given as functions of a parameter vector r. Denote by gi(r),
i = 1, . . . , n, the expected cost starting at state i, and by pij(r) the transi
tion probabilities. Each value of r deﬁnes a stationary policy, which is assumed
proper. For each r, the expected costs starting at states i are denoted by Ji(r).
We wish to calculate the gradient of a weighted sum of the costs Ji(r), i.e.,
¯
J(r) =
n
i=1
q(i)Ji(r),
where q =
_
q(1), . . . , q(n)
_
is some probability distribution over the states. Con
sider a single scalar component rm of r, and diﬀerentiate Bellman’s equation to
show that
∂Ji
∂rm
=
∂gi
∂rm
+
n
j=1
∂pij
∂rm
Jj +
n
j=1
pij
∂Jj
∂rm
, i = 1, . . . , n,
where the argument r at which the partial derivatives are computed is suppressed.
Interpret the above equation as Bellman’s equation for a SSP problem.
References
[ABB01] Abounadi, J., Bertsekas, B. P., and Borkar, V. S., 2001. “Learning
Algorithms for Markov Decision Processes with Average Cost,” SIAM J.
on Control and Optimization, Vol. 40, pp. 681698.
[ABB02] Abounadi, J., Bertsekas, B. P., and Borkar, V. S., 2002. “Stochas
tic Approximation for NonExpansive Maps: QLearning Algorithms,” SIAM
J. on Control and Optimization, Vol. 41, pp. 122.
[ABJ06] Ahamed, T. P. I., Borkar, V. S., and Juneja, S., 2006. “Adap
tive Importance Sampling Technique for Markov Chains Using Stochastic
Approximation,” Operations Research, Vol. 54, pp. 489504.
[ASM08] Antos, A., Szepesvari, C., and Munos, R., 2008. “Learning Near
Optimal Policies with BellmanResidual Minimization Based Fitted Policy
Iteration and a Single Sample Path,” Vol. 71, pp. 89129.
[AbB02] Aberdeen, D., and Baxter, J., 2002. “Scalable InternalState Policy
Gradient Methods for POMDPs,” Proc. of the Nineteenth International
Conference on Machine Learning, pp. 310.
[Ama98] Amari, S., 1998. “Natural Gradient Works Eﬃciently in Learn
ing,” Neural Computation, Vol. 10, pp. 251276.
[BBN04] Bertsekas, D. P., Borkar, V., and Nedi´c, A., 2004. “Improved
Temporal Diﬀerence Methods with Linear Function Approximation,” in
Learning and Approximate Dynamic Programming, by J. Si, A. Barto, W.
Powell, (Eds.), IEEE Press, N. Y.
[BBS95] Barto, A. G., Bradtke, S. J., and Singh, S. P., 1995. “Real
Time Learning and Control Using Asynchronous Dynamic Programming,”
Artiﬁcial Intelligence, Vol. 72, pp. 81138.
[BED09] Busoniu, L., Ernst, D., De Schutter, B., and Babuska, R., 2009.
“Online LeastSquares Policy Iteration for Reinforcement Learning Con
trol,” unpublished report, Delft Univ. of Technology, Delft, NL.
[BHO08] Bethke, B., How, J. P., and Ozdaglar, A., 2008. “Approximate Dy
namic Programming Using Support Vector Regression,” Proc. IEEE Con
507
508 References
ference on Decision and Control, Cancun, Mexico.
[BKM05] de Boer, P. T., Kroese, D. P., Mannor, S., and Rubinstein, R. Y.
2005. “A Tutorial on the CrossEntropy Method,” Annals of Operations
Research, Vol. 134, pp. 1967.
[BSA83] Barto, A. G., Sutton, R. S., and Anderson, C. W., 1983. “Neuron
like Elements that Can Solve Diﬃcult Learning Control Problems,” IEEE
Trans. on Systems, Man, and Cybernetics, Vol. 13, pp. 835846.
[BaB01] Baxter, J., and Bartlett, P. L., 2001. “InﬁniteHorizon Policy
Gradient Estimation,” J. Artiﬁcial Intelligence Research, Vol. 15, pp. 319–
350.
[Bai93] Baird, L. C., 1993. “Advantage Updating,” Report WLTR93
1146, Wright Patterson AFB, OH.
[Bai94] Baird, L. C., 1994. “Reinforcement Learning in Continuous Time:
Advantage Updating,” International Conf. on Neural Networks, Orlando,
Fla.
[Bai95] Baird, L. C., 1995. “Residual Algorithms: Reinforcement Learning
with Function Approximation,” Dept. of Computer Science Report, U.S.
Air Force Academy, CO.
[BeI96] Bertsekas, D. P., and Ioﬀe, S., 1996. “Temporal DiﬀerencesBased
Policy Iteration and Applications in NeuroDynamic Programming,” Lab.
for Info. and Decision Systems Report LIDSP2349, Massachusetts Insti
tute of Technology.
[BeT89] Bertsekas, D. P., and Tsitsiklis, J. N., 1989. Parallel and Dis
tributed Computation: Numerical Methods, PrenticeHall, Englewood Cliﬀs,
N. J.; republished by Athena Scientiﬁc, Belmont, MA, 1997.
[BeT96] Bertsekas, D. P., and Tsitsiklis, J. N., 1996. NeuroDynamic Pro
gramming, Athena Scientiﬁc, Belmont, MA.
[BeT00] Bertsekas, D. P., and Tsitsiklis, J. N., 2000. “Gradient Convergence
in Gradient Methods,” SIAM J. on Optimization, Vol. 10, pp. 627642.
[BeY07] Bertsekas, D. P., and Yu, H., 2007. “Solution of Large Systems
of Equations Using Approximate Dynamic Programming Methods,” LIDS
Report 2754, MIT.
[BeY09] Bertsekas, D. P., and Yu, H., 2009. “Projected Equation Methods
for Approximate Solution of Large Linear Systems,” Journal of Computa
tional and Applied Mathematics, Vol. 227, pp. 2750.
[BeY10] Bertsekas, D. P., and Yu, H., 2010. “QLearning and Enhanced
Policy Iteration in Discounted Dynamic Programming,” Lab. for Informa
tion and Decision Systems Report LIDSP2831, MIT.
References 509
[Ber82] Bertsekas, D. P., 1982. “Distributed Dynamic Programming,” IEEE
Trans. Automatic Control, Vol. AC27, pp. 610616.
[Ber83] Bertsekas, D. P., 1983. “Asynchronous Distributed Computation of
Fixed Points,” Math. Programming, Vol. 27, pp. 107120.
[Ber95] Bertsekas, D. P., 1995. “A Counterexample to Temporal Diﬀerences
Learning,” Neural Computation, Vol. 7, pp. 270279.
[Ber96] Bertsekas, D. P., 1996. Lecture at NSF Workshop on Reinforcement
Learning, Hilltop House, Harper’s Ferry, N.Y.
[Ber97] Bertsekas, D. P., 1997. “Diﬀerential Training of Rollout Policies,”
Proc. of the 35th Allerton Conference on Communication, Control, and
Computing, Allerton Park, Ill.
[Ber99] Bertsekas, D. P., 1999. Nonlinear Programming: 2nd Edition, Athe
na Scientiﬁc, Belmont, MA.
[Ber05a] Bertsekas, D. P., 2005. “Dynamic Programming and Suboptimal
Control: A Survey from ADP to MPC,” in Fundamental Issues in Control,
European J. of Control, Vol. 11.
[Ber05b] Bertsekas, D. P., 2005. “Rollout Algorithms for Constrained Dy
namic Programming,” Lab. for Information and Decision Systems Report
2646, MIT.
[Ber09] Bertsekas, D. P., 2009. “Projected Equations, Variational Inequali
ties, and Temporal Diﬀerence Methods,” Lab. for Information and Decision
Systems Report LIDSP2808, MIT; IEEE Trans. on Aut. Control, to ap
pear.
[Ber10] Bertsekas, D. P., 2010. “Approximate Policy Iteration: A Survey
and Some New Methods,” Lab. for Information and Decision Systems Re
port LIDSP2833, MIT; to appear in Journal of Control Theory and Ap
plications.
[BoM00] Borkar, V. S., and Meyn, S. P., 2000. “The O.D.E. Method for
Convergence of Stochastic Approximation and Reinforcement Learning,
SIAM J. Control and Optimization, Vol. 38, pp. 447469.
[Bor08] Borkar, V. S., 2008. Stochastic Approximation: A Dynamical Sys
tems Viewpoint, Cambridge Univ. Press, N. Y.
[Bor09] Borkar, V. S., 2009. “Reinforcement Learning A Bridge Between
Numerical Methods and Monte Carlo,” in World Scientiﬁc Review Vol. 9,
Chapter 4.
[Boy02] Boyan, J. A., 2002. “Technical Update: LeastSquares Temporal
Diﬀerence Learning,” Machine Learning, Vol. 49, pp. 115.
[BrB96] Bradtke, S. J., and Barto, A. G., 1996. “Linear LeastSquares
Algorithms for Temporal Diﬀerence Learning,” Machine Learning, Vol. 22,
510 References
pp. 3357.
[Bur97] Burgiel, H., 1997. “How to Lose at Tetris,” The Mathematical
Gazette, Vol. 81, pp. 194200.
[CFH07] Chang, H. S., Fu, M. C., Hu, J., Marcus, S. I., 2007. Simulation
Based Algorithms for Markov Decision Processes, Springer, N. Y.
[CPS92] Cottle, R. W., Pang, JS., and Stone, R. E., 1992. The Linear
Complementarity Problem, Academic Press, N. Y.; republished by SIAM
in 2009.
[CaC97] Cao, X. R., and Chen, H. F., 1997. “Perturbation Realization Po
tentials and Sensitivity Analysis of Markov Processes,” IEEE Transactions
on Automatic Control, Vol. 32, pp. 13821393.
[CaW98] Cao, X. R., and Wan, Y. W., 1998. “Algorithms for Sensitivity
Analysis of Markov Systems Through Potentials and Perturbation Realiza
tion,” IEEE Transactions Control Systems Technology, Vol. 6, pp. 482494.
[Cao99] Cao, X. R., 1999. “Single Sample Path Based Optimization of
Markov Chains,” J. of Optimization Theory and Applicationa, Vol. 100,
pp. 527548.
[Cao04] Cao, X. R., 2004. “Learning and Optimization from a System The
oretic Perspective,” in Learning and Approximate Dynamic Programming,
by J. Si, A. Barto, W. Powell, (Eds.), IEEE Press, N. Y.
[Cao05] Cao, X. R., 2005. “A Basic Formula for Online Policy Gradient
Algorithms,” IEEE Transactions on Automatic Control, Vol. 50, pp. 696
699.
[Cao07] Cao, X. R., 2007. Stochastic Learning and Optimization: A Sensiti
vityBased Approach, Springer, N. Y.
[ChV06] Choi, D. S., and Van Roy, B., 2006. “A Generalized Kalman Filter
for Fixed Point Approximation and Eﬃcient TemporalDiﬀerence Learn
ing,” Discrete Event Dynamic Systems, Vol. 16, pp. 207239.
[DFM09] Desai, V. V., Farias, V. F., and Moallemi, C. C., 2009. “Aprox
imate Dynamic Programming via a Smoothed Approximate Linear Pro
gram, Submitted.
[DFV00] de Farias, D. P., and Van Roy, B., 2000. “On the Existence of
Fixed Points for Approximate Value Iteration and TemporalDiﬀerence
Learning,” J. of Optimization Theory and Applications, Vol. 105.
[DFV03] de Farias, D. P., and Van Roy, B., 2003. “The Linear Programming
Approach to Approximate Dynamic Programming,” Operations Research,
Vol. 51, pp. 850865.
[DFV04a] de Farias, D. P., and Van Roy, B., 2004. “On Constraint Sam
pling in the Linear Programming Approach to Approximate Dynamic Pro
References 511
gramming,” Mathematics of Operations Research, Vol. 29, pp. 462478.
[Day92] Dayan, P., 1992. “The Convergence of TD(λ) for General λ,” Ma
chine Learning, Vol. 8, pp. 341362.
[DeF04] De Farias, D. P., 2004. “The Linear Programming Approach to
Approximate Dynamic Programming,” in Learning and Approximate Dy
namic Programming, by J. Si, A. Barto, W. Powell, (Eds.), IEEE Press,
N. Y.
[EGW06] Ernst, D., Geurts, P., and Wehenkel, L., 2006. “TreeBased Batch
Mode Reinforcement Learning,” Journal of Machine Learning Research,
Vol. 6, pp. 503556.
[FaV06] Farias, V. F., and Van Roy, B., 2006. “Tetris: A Study of Ran
domized Constraint Sampling, in Probabilistic and Randomized Methods
for Design Under Uncertainty, SpringerVerlag.
[FeS94] Feinberg, E. A., and Shwartz, A., 1994. “Markov Decision Models
with Weighted Discounted Criteria,” Mathematics of Operations Research,
Vol. 19, pp. 117.
[FeS04] Ferrari, S., and Stengel, R. F., 2004. “ModelBased Adaptive Critic
Designs,” in Learning and Approximate Dynamic Programming, by J. Si,
A. Barto, W. Powell, (Eds.), IEEE Press, N. Y.
[Fle84] Fletcher, C. A. J., 1984. Computational Galerkin Methods, Springer
Verlag, N. Y.
[FuH94] Fu, M. C., and Hu, 1994. “Smoothed Perturbation Analysis Deriva
tive Estimation for Markov Chains,” Oper. Res. Letters, Vol. 41, pp. 241
251.
[GKP03] Guestrin, C. E., Koller, D., Parr, R., and Venkataraman, S.,
2003. “Eﬃcient Solution Algorithms for Factored MDPs,” J. of Artiﬁcial
Intelligence Research, Vol. 19, pp. 399468.
[GLH94] Gurvits, L., Lin, L. J., and Hanson, S. J., 1994. “Incremen
tal Learning of Evaluation Functions for Absorbing Markov Chains: New
Methods and Theorems,” Preprint.
[GlI89] Glynn, P. W., and Iglehart, D. L., 1989. “Importance Sampling for
Stochastic Simulations,” Management Science, Vol. 35, pp. 13671392.
[Gly87] Glynn, P. W., 1987. “Likelihood Ratio Gradient Estimation: An
Overview,” Proc. of the 1987 Winter Simulation Conference, pp. 366375.
[Gor95] Gordon, G. J., 1995. “Stable Function Approximation in Dynamic
Programming,” in Machine Learning: Proceedings of the Twelfth Interna
tional Conference, Morgan Kaufmann, San Francisco, CA.
[Gos03] Gosavi, A., 2003. SimulationBased Optimization Parametric Op
timization Techniques and Reinforcement Learning, SpringerVerlag, N. Y.
512 References
[Gos04] Gosavi, A., 2004. “Reinforcement Learning for LongRun Average
Cost,” European J. of Operational Research, Vol. 155, pp. 654674.
[GrU04] Grudic, G., and Ungar, L., 2004. “Reinforcement Learning in
Large, HighDimensional State Spaces,” in Learning and Approximate Dy
namic Programming, by J. Si, A. Barto, W. Powell, (Eds.), IEEE Press,
N. Y.
[HBK94] Harmon, M. E., Baird, L. C., and Klopf, A. H., 1994. “Advan
tage Updating Applied to a Diﬀerential Game,” Presented at NIPS Conf.,
Denver, Colo.
[HFM05] He, Y., Fu, M. C., and Marcus, S. I., 2005. “A TwoTimescale
SimulationBased Gradient Algorithm for Weighted Cost Markov Decision
Processes,” Proc. of the 2005 Conf. on Decision and Control, Seville, Spain,
pp. 80228027.
[Hau00] Hauskrecht, M., 2000. “ValueFunction Approximations for Par
tially Observable Markov Decision Processes, Journal of Artiﬁcial Intelli
gence Research, Vol. 13, pp. 3395.
[He02] He, Y., 2002. SimulationBased Algorithms for Markov Decision
Processes, Ph.D. Thesis, University of Maryland.
[JJS94] Jaakkola, T., Jordan, M. I., and Singh, S. P., 1994. “On the
Convergence of Stochastic Iterative Dynamic Programming Algorithms,”
Neural Computation, Vol. 6, pp. 11851201.
[JSJ95] Jaakkola, T., Singh, S. P., and Jordan, M. I., 1995. “Reinforcement
Learning Algorithm for Partially Observable Markov Decision Problems,”
Advances in Neural Information Processing Systems, Vol. 7, pp. 345352.
[JuP07] Jung, T., and Polani, D., 2007. “Kernelizing LSPE(λ),” in Proc.
2007 IEEE Symposium on Approximate Dynamic Programming and Rein
forcement Learning, Honolulu, Hawaii. pp. 338345.
[KMP06] Keller, P. W., Mannor, S., and Precup, D., 2006. “Automatic
Basis Function Construction for Approximate Dynamic Programming and
Reinforcement Learning,” Proc. of the 23rd ICML, Pittsburgh, Penn.
[Kak02] Kakade, S., 2002. “A Natural Policy Gradient,” Proc. Advances
in Neural Information Processing Systems, Vancouver, BC, Vol. 14, pp.
15311538.
[KoB99] Konda, V. R., and Borkar, V. S., 1999. “ ActorCritic Like Learn
ing Algorithms for Markov Decision Processes,” SIAM J. on Control and
Optimization, Vol. 38, pp. 94123.
[KoP00] Koller, K., and Parr, R., 2000. “Policy Iteration for Factored
MDPs,” Proc. of the 16th Annual Conference on Uncertainty in AI, pp.
326334.
References 513
[KoT99] Konda, V. R., and Tsitsiklis, J. N., 1999. “ActorCritic Algo
rithms,” Proc. 1999 Neural Information Processing Systems Conference,
Denver, Colorado, pp. 10081014.
[KoT03] Konda, V. R., and Tsitsiklis, J. N., 2003. “ActorCritic Algo
rithms,” SIAM J. on Control and Optimization, Vol. 42, pp. 11431166.
[Kon02] Konda, V. R., 2002. ActorCritic Algorithms, Ph.D. Thesis, Dept.
of EECS, M.I.T., Cambridge, MA.
[Kra72] Krasnoselskii, M. A., et. al, 1972. Approximate Solution of Opera
tor Equations, Translated by D. Louvish, WoltersNoordhoﬀ Pub., Gronin
gen.
[LLL08] Lewis, F. L., Liu, D., and Lendaris, G. G., 2008. Special Issue on
Adaptive Dynamic Programming and Reinforcement Learning in Feedback
Control, IEEE Transactions on Systems, Man, and Cybernetics, Part B,
Vol. 38, No. 4.
[LSS09] Li, Y., Szepesvari, C., and Schuurmans, D., 2009. “Learning Ex
ercise Policies for American Options,” Proc. of the Twelfth International
Conference on Artiﬁcial Intelligence and Statistics, Clearwater Beach, Fla.
[LaP03] Lagoudakis, M. G., and Parr, R., 2003. “LeastSquares Policy
Iteration,” J. of Machine Learning Research, Vol. 4, pp. 11071149.
[LeV09] Lewis, F. L., and Vrabie, D., 2009. “Reinforcement Learning and
Adaptive Dynamic Programming for Feedback Control,” IEEE Circuits
and Systems Magazine, 3rd Q. Issue.
[Liu01] Liu, J. S., 2001. Monte Carlo Strategies in Scientiﬁc Computing,
Springer, N. Y.
[LoS01] Longstaﬀ, F. A., and Schwartz, E. S., 2001. “Valuing American
Options by Simulation: A Simple LeastSquares Approach,” Review of
Financial Studies, Vol. 14, pp. 113147.
[MMS06] Menache, I., Mannor, S., and Shimkin, N., 2005. “Basis Function
Adaptation in Temporal Diﬀerence Reinforcement Learning,” Ann. Oper.
Res., Vol. 134, pp. 215238.
[MaT01] Marbach, P., and Tsitsiklis, J. N., 2001. “SimulationBased Opti
mization of Markov Reward Processes,” IEEE Transactions on Automatic
Control, Vol. 46, pp. 191209.
[MaT03] Marbach, P., and Tsitsiklis, J. N., 2003. “Approximate Gradient
Methods in PolicySpace Optimization of Markov Reward Processes,” J.
Discrete Event Dynamic Systems, Vol. 13, pp. 111148.
[Mar70] Martinet, B., 1970. “Regularisation d’ Inequations Variationnelles
par Approximations Successives”, Rev. Francaise Inf. Rech. Oper., pp. 154
159.
514 References
[Mey07] Meyn, S., 2007. Control Techniques for Complex Networks, Cam
bridge University Press, N. Y.
[MuS08] Munos, R., and Szepesvari, C, 2008. “FiniteTime Bounds for
Fitted Value Iteration,” Journal of Machine Learning Research, Vol. 1, pp.
815857.
[Mun03] Munos, R., 2003. “Error Bounds for Approximate Policy Itera
tion,” Proc. 20th International Conference on Machine Learning, pp. 560
567.
[NeB03] Nedi´c, A., and Bertsekas, D. P., 2003. “LeastSquares Policy Eval
uation Algorithms with Linear Function Approximation,” J. of Discrete
Event Systems, Vol. 13, pp. 79110.
[OrS02] Ormoneit, D., and Sen, S., 2002. “KernelBased Reinforcement
Learning,” Machine Learning, Vol. 49, pp. 161178.
[Pin97] Pineda, F., 1997. “MeanField Analysis for Batched TD(λ),” Neural
Computation, pp. 14031419.
[PSD01] Precup, D., Sutton, R. S., and Dasgupta, S., 2001. “OﬀPolicy
TemporalDiﬀerence Learning with Function Approximation,” In Proc. 18th
Int. Conf. Machine Learning, pp. 417424.
[PWB09] Polydorides, N., Wang, M., and Bertsekas, D. P., 2009. “Approx
imate Solution of LargeScale Linear Inverse Problems with Monte Carlo
Simulation,” Lab. for Information and Decision Systems Report LIDSP
2822, MIT.
[PoB04] Poupart, P., and Boutilier, C., 2004. “Bounded Finite State Con
trollers,” Advances in Neural Information Processing Systems.
[PoV04] Powell, W. B., and Van Roy, B., 2004. “Approximate Dynamic
Programming for HighDimensional Resource Allocation Problems,” in Le
arning and Approximate Dynamic Programming, by J. Si, A. Barto, W.
Powell, (Eds.), IEEE Press, N. Y.
[Pow07] Powell, W. B., 2007. Approximate Dynamic Programming: Solving
the Curses of Dimensionality, J. Wiley and Sons, Hoboken, N. J.
[Roc76] Rockafellar, R. T., ”Monotone Operators and the Proximal Point
Algorithm”, SIAM J. on Control and Optimization, Vol. 14, 1976, pp. 877
898.
[RuK04] Rubinstein, R. Y., and Kroese, D. P., 2004. The CrossEntropy
Method: A Uniﬁed Approach to Combinatorial Optimization, Springer, N.
Y.
[RuK08] Rubinstein, R. Y., and Kroese, D. P., 2008. Simulation and the
Monte Carlo Method, 2nd Edition, J. Wiley, N. Y.
References 515
[SBP04] Si, J., Barto, A., Powell, W., and Wunsch, D., (Eds.) 2004. Learn
ing and Approximate Dynamic Programming, IEEE Press, N. Y.
[SDG09] Simao, D. G., Day, S., George, A. P., Giﬀord, T., Nienow, J., and
Powell, W. B., 2009. “An Approximate Dynamic Programming Algorithm
for LargeScale Fleet Management: A Case Application,” Transportation
Science, Vol. 43, pp. 178197.
[SJJ94] Singh, S. P., Jaakkola, T., and Jordan, M. I., 1994. “Learning
without StateEstimation in Partially Observable Markovian Decision Pro
cesses,” Proceedings of the Eleventh Machine Learning Conference, pp.
284292.
[SJJ95] Singh, S. P., Jaakkola, T., and Jordan, M. I., 1995. “Reinforcement
Learning with Soft State Aggregation,” in Advances in Neural Information
Processing Systems 7, MIT Press, Cambridge, MA.
[SMS99] Sutton, R. S., McAllester, D., Singh, S. P., and Mansour, Y.,
1999. “Policy Gradient Methods for Reinforcement Learning with Func
tion Approximation,” Proc. 1999 Neural Information Processing Systems
Conference, Denver, Colorado.
[SYL04] Si, J., Yang, L., and Liu, D., 2004. “Direct Neural Dynamic Pro
gramming,” in Learning and Approximate Dynamic Programming, by J.
Si, A. Barto, W. Powell, (Eds.), IEEE Press, N. Y.
[Saa03] Saad, Y., 2003. Iterative Methods for Sparse Linear Systems, SIAM,
Phila., Pa.
[Sam59] Samuel, A. L., 1959. “Some Studies in Machine Learning Using
the Game of Checkers,” IBM Journal of Research and Development, pp.
210229.
[Sam67] Samuel, A. L., 1967. “Some Studies in Machine Learning Using
the Game of Checkers. II – Recent Progress,” IBM Journal of Research
and Development, pp. 601617.
[ScS85] Schweitzer, P. J., and Seidman, A., 1985. “Generalized Polyno
mial Approximations in Markovian Decision Problems,” J. Math. Anal.
and Appl., Vol. 110, pp. 568582.
[Sin94] Singh, S. P., 1994. “Reinforcement Learning Algorithms for Average
Payoﬀ Markovian Decision Processes,” Proc. of 12th National Conference
on Artiﬁcial Intelligence, pp. 202207.
[Str09] Strang, G., 2009. Linear Algebra and its Applications, Wellesley
Cambridge Press, Welleslay, MA.
[SuB98] Sutton, R. S., and Barto, A. G., 1998. Reinforcement Learning,
MIT Press, Cambridge, MA.
516 References
[Sut88] Sutton, R. S., 1988. “Learning to Predict by the Methods of Tem
poral Diﬀerences,” Machine Learning, Vol. 3, pp. 944.
[SzL06] Szita, I., and Lorinz, A., 2006. “Learning Tetris Using the Noisy
CrossEntropy Method,” Neural Computation, Vol. 18, pp. 29362941.
[SzS04] Szepesvari, C., and Smart, W. D., 2004. “InterpolationBased Q
Learning,” Proc. of 21st International Conf. on Machine Learning, Banﬀ,
Ca.
[Sze09] Szepesvari, C., 2009. “Reinforcement Learning Algorithms for MDPs,”
Dept. of Computing Science Report TR0913, University of Alberta, Ca.
[Tes92] Tesauro, G., 1992. “Practical Issues in Temporal Diﬀerence Learn
ing,” Machine Learning, Vol. 8, pp. 257277.
[ThS09] Thiery, C., and Scherrer, B., 2009. “Improvements on Learning
Tetris with CrossEntropy,” International Computer Games Association
Journal, Vol. 32, pp. 2333.
[TrB97] Trefethen, L. N., and Bau, D., 1997. Numerical Linear Algebra,
SIAM, Phila., PA.
[TsV96] Tsitsiklis, J. N., and Van Roy, B., 1996. “FeatureBased Methods
for LargeScale Dynamic Programming,” Machine Learning, Vol. 22, pp.
5994.
[TsV97] Tsitsiklis, J. N., and Van Roy, B., 1997. “An Analysis of Temporal
Diﬀerence Learning with Function Approximation,” IEEE Transactions on
Automatic Control, Vol. 42, pp. 674690.
[TsV99a] Tsitsiklis, J. N., and Van Roy, B., 1999. “Average Cost Temporal
Diﬀerence Learning,” Automatica, Vol. 35, pp. 17991808.
[TsV99b] Tsitsiklis, J. N., and Van Roy, B., 1999. “Optimal Stopping of
Markov Processes: Hilbert Space Theory, Approximation Algorithms, and
an Application to Pricing Financial Derivatives,” IEEE Transactions on
Automatic Control, Vol. 44, pp. 18401851.
[TsV01] Tsitsiklis, J. N., and Van Roy, B., 2001. “Regression Methods for
Pricing Complex AmericanStyle Options,” IEEE Trans. on Neural Net
works, Vol. 12, pp. 694703.
[TsV02] Tsitsiklis, J. N., and Van Roy, B., 2002. “On Average Versus Dis
counted Reward Temporal–Diﬀerence Learning,” Machine Learning, Vol.
49, pp. 179191.
[Tsi94b] Tsitsiklis, J. N., 1994. “Asynchronous Stochastic Approximation
and QLearning,” Machine Learning, Vol. 16, pp. 185202.
[Van95] Van Roy, B., 1995. “FeatureBased Methods for Large Scale Dy
namic Programming,” Lab. for Info. and Decision Systems Report LIDS
TH2289, Massachusetts Institute of Technology, Cambridge, MA.
References 517
[Van98] Van Roy, B., 1998. Learning and Value Function Approximation
in Complex Decision Processes, Ph.D. Thesis, Dept. of EECS, MIT, Cam
bridge, MA.
[Van06] Van Roy, B., 2006. “Performance Loss Bounds for Approximate
Value Iteration with State Aggregation,” Mathematics of Operations Re
search, Vol. 31, pp. 234244.
[VBL07] Van Roy, B., Bertsekas, D. P., Lee, Y., and Tsitsiklis, J. N.,
1997. “A NeuroDynamic Programming Approach to Retailer Inventory
Management,” Proc. of the IEEE Conference on Decision and Control;
based on a more extended Lab. for Information and Decision Systems
Report, MIT, Nov. 1996.
[WPB09] Wang, M., Polydorides, N., and Bertsekas, D. P., 2009. “Approx
imate SimulationBased Solution of LargeScale Least Squares Problems,”
Lab. for Information and Decision Systems Report LIDSP2819, MIT.
[WaB92] Watkins, C. J. C. H., and Dayan, P., 1992. “QLearning,” Machine
Learning, Vol. 8, pp. 279292.
[Wat89] Watkins, C. J. C. H., Learning from Delayed Rewards, Ph.D. The
sis, Cambridge Univ., England.
[WeB99] Weaver, L., and Baxter, J., 1999. “Reinforcement Learning From
State and Temporal Diﬀerences,” Tech. Report, Department of Computer
Science, Australian National University.
[WiB93] Williams, R. J., and Baird, L. C., 1993. “Analysis of Some In
cremental Variants of Policy Iteration: First Steps Toward Understanding
ActorCritic Learning Systems,” Report NUCCS9311, College of Com
puter Science, Northeastern University, Boston, MA.
[Wil92] Williams, R. J., 1992. “Simple Statistical Gradient Following Algo
rithms for Connectionist Reinforcement Learning,” Machine Learning, Vol.
8, pp. 229256.
[YaL08] Yao, H., and Liu, Z.Q., 2008. “Preconditioned Temporal Diﬀer
ence Learning,” Proc. of the 25th ICML, Helsinki, Finland.
[YuB04] Yu, H., and Bertsekas, D. P., 2004. “Discretized Approximations
for POMDP with Average Cost,” Proc. of the 20th Conference on Uncer
tainty in Artiﬁcial Intelligence, Banﬀ, Canada.
[YuB06a] Yu, H., and Bertsekas, D. P., 2006. “On NearOptimality of the
Set of FiniteState Controllers for Average Cost POMDP,” Lab. for Infor
mation and Decision Systems Report 2689, MIT; Mathematics of Opera
tions Research, Vol. 33, pp. 111, 2008.
[YuB06b] Yu, H., and Bertsekas, D. P., 2006. “Convergence Results for
Some Temporal Diﬀerence Methods Based on Least Squares,” Lab. for
Information and Decision Systems Report 2697, MIT; also in IEEE Trans
518 References
actions on Aut. Control, Vol. 54, 2009, pp. 15151531.
[YuB07] Yu, H., and Bertsekas, D. P., 2007. “A Least Squares QLearning
Algorithm for Optimal Stopping Problems,” Lab. for Information and Deci
sion Systems Report 2731, MIT; also in Proc. European Control Conference
2007, Kos, Greece.
[YuB08] Yu, H., and Bertsekas, D. P., 2008. “New Error Bounds for Ap
proximations from Projected Linear Equations,” Lab. for Information and
Decision Systems Report LIDSP2797, MIT, July 2008; to appear in Math
ematics of Operations Research.
[YuB09] Yu, H., and Bertsekas, D. P., 2009. “Basis Function Adaptation
Methods for Cost Approximation in MDP,” Proceedings of 2009 IEEE
Symposium on Approximate Dynamic Programming and Reinforcement
Learning (ADPRL 2009), Nashville, Tenn.
[Yu05] Yu, H., 2005. “A Function Approximation Approach to Estimation
of Policy Gradient for POMDP with Structured Policies,” Proc. of the 21st
Conference on Uncertainty in Artiﬁcial Intelligence, Edinburgh, Scotland.
[Yu10] Yu, H., 2010. “Convergence of Least Squares Temporal Diﬀerence
Methods Under General Conditions,” Technical Report C20101, Dept.
Computer Science, University of Helsinki, 2010; a shorter version to appear
in Proc. of 2010 ICML, Haiﬀa, Israel.
[ZhH01] Zhou, R., and Hansen, E. A., 2001. “An Improved GridBased
Approximation Algorithm for POMDPs,” In Int. J. Conf. Artiﬁcial Intelli
gence.
[ZhL97] Zhang, N. L., and Liu, W., 1997. “A Model Approximation Scheme
for Planning in Partially Observable Stochastic Domains,” J. Artiﬁcial In
telligence Research, Vol. 7, pp. 199230.
6 Approximate Dynamic Programming
Contents 6.1. General Issues of Cost Approximation . . . . . . . . 6.1.1. Approximation Architectures . . . . . . . . . 6.1.2. Approximate Policy Iteration . . . . . . . . . 6.1.3. Direct and Indirect Approximation . . . . . . 6.1.4. Simpliﬁcations . . . . . . . . . . . . . . . 6.1.5. The Role of Contraction Mappings . . . . . . 6.1.6. The Role of Monte Carlo Simulation . . . . . . 6.2. Direct Policy Evaluation  Gradient Methods . . . . . 6.3. Projected Equation Methods . . . . . . . . . . . . 6.3.1. The Projected Bellman Equation . . . . . . . 6.3.2. Deterministic Iterative Methods . . . . . . . . 6.3.3. SimulationBased Methods . . . . . . . . . . 6.3.4. LSTD, LSPE, and TD(0) Methods . . . . . . 6.3.5. Optimistic Versions . . . . . . . . . . . . . 6.3.6. Policy Oscillations – Chattering . . . . . . . . 6.3.7. Multistep SimulationBased Methods . . . . . 6.3.8. TD Methods with Exploration . . . . . . . . 6.3.9. Summary and Examples . . . . . . . . . . . 6.4. QLearning . . . . . . . . . . . . . . . . . . . . 6.4.1. Convergence Properties of QLearning . . . . . 6.4.2. QLearning and Approximate Policy Iteration . . 6.4.3. QLearning for Optimal Stopping Problems . . . 6.4.4. Finite Horizon QLearning . . . . . . . . . . 6.5. Aggregation Methods . . . . . . . . . . . . . . . 6.5.1. Cost and QFactor Approximation by Aggregation 6.5.2. Approximate Policy and Value Iteration . . . . p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. 327 327 331 335 337 344 345 348 354 355 360 364 366 374 375 381 386 395 399 401 406 408 413 416 419 423
321
322
Approximate Dynamic Programming
Chap. 6
6.6. Stochastic Shortest Path Problems . . . . . . . . . 6.7. Average Cost Problems . . . . . . . . . . . . . . 6.7.1. Approximate Policy Evaluation . . . . . . . . 6.7.2. Approximate Policy Iteration . . . . . . . . . 6.7.3. QLearning for Average Cost Problems . . . . . 6.8. SimulationBased Solution of Large Systems . . . . . 6.8.1. Projected Equations  SimulationBased Versions 6.8.2. Matrix Inversion and RegressionType Methods . 6.8.3. Iterative/LSPEType Methods . . . . . . . . 6.8.4. Extension of QLearning for Optimal Stopping . 6.8.5. Bellman Equation ErrorType Methods . . . . 6.8.6. Oblique Projections . . . . . . . . . . . . . 6.8.7. Generalized Aggregation by Simulation . . . . . 6.9. Approximation in Policy Space . . . . . . . . . . . 6.9.1. The Gradient Formula . . . . . . . . . . . . 6.9.2. Computing the Gradient by Simulation . . . . 6.9.3. Essential Features of Critics . . . . . . . . . 6.9.4. Approximations in Policy and Value Space . . . 6.10. Notes, Sources, and Exercises . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . .
p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. p. p.
432 436 436 445 447 451 451 455 457 464 467 471 473 477 478 480 481 484 485 507
Our methods will involve linear algebra operations of dimension much smaller than n.5 of Vol. II).3. computationally intensive DP problems. the system and cost structure may be simulated (think. Rather than estimate explicitly the transition probabilities and costs. A principal aim of the methods of this chapter is to address problems with very large number of states n. Another aim of the methods of this chapter is to address modelfree situations. and require only that the components of nvectors are just generated when needed rather than stored. Here our focus will be on algorithms that are mostly patterned after two principal methods of inﬁnite horizon DP: policy and value iteration. and indeed it may be impossible to even store an nvector in a computer memory.” In another type of method. and Section 1. and then to apply the methods discussed in earlier chapters. we will aim to approximate the cost function of a given policy or even the optimal costtogo function by generating one or more simulated system trajectories and associated costs.5 of Vol. which we will discuss only brieﬂy. u. and one contemplates approximations. problems where a mathematical model is unavailable or hard to construct. or neurodynamic programming. We discussed a number of such methods in Chapter 6 of Vol. i. which we have discussed at various points in other parts of the book: rollout algorithms (Sections 6. for a given control u. or reinforcement learning.. 6. 6.0 323 In this chapter we consider approximation methods for challenging.Sec. which deal with approximation of the optimal . which deal with approximation of the cost of a single policy. such as approximate dynamic programming.e. I and Chapter 1 of the present volume. The methods of this chapter. and approximate linear programming (Section 1. of a queueing network with complicated but welldeﬁned service disciplines at the queues). and also generates a corresponding transition cost g(i. Our main focus will be on two types of methods: policy evaluation algorithms. are geared towards an alternative possibility. It may then be possible to use repeated simulation to calculate (at least approximately) the transition probabilities of the system and the expected stage costs by averaging.4). the probabilistic transitions from any given state i to a successor state j according to the transition probabilities pij (u). and by using some form of “least squares ﬁt. and Qlearning algorithms. which is much more attractive when one is faced with a large and complex system. These algorithms form the core of a methodology known by various names. The assumption here is that there is a computer program that simulates. I. we use a gradient method and simulation data to approximate directly an optimal policy with a policy of a given parametric form. however.3. In such problems. ordinary linear algebra operations such as ndimensional inner products. such as for example rollout and other onestep lookahead approaches.4. Let us also mention. for example. are prohibitively timeconsuming. j). Instead. two other approximate DP methods.
4 of Vol. r). This problem may be solved by several possible algorithms. focusing for concreteness on the ﬁnitestate discounted case. I) and is discussed in Section 6. we use simulation to collect ˜ samples of costs for various initial states. by minimizing over r ˜ ˜ J − TJ . and T is either the mapping Tµ or a related mapping. We focus primarily on two types of methods. Section 6. Policy Evaluation Algorithms With this class of methods. Gradient methods have also been used extensively.5 of Vol.4. ‡ Another method of this type is based on aggregation (cf. Section 6. Section 6. with onestep or multistep lookahead. I).5.3.2. Here. The second and currently more popular class of methods is called indirect . this minimization is a linear least squares problem. called direct . and ﬁt the architecture J to the samples through some least squares problem. the parameter vector r is determined by minimizing a measure of error in satisfying Bellman’s equation. Section 6. where r is a parameter vector (cf. we aim to approximate the cost function Jµ (i) ˜ of a policy µ with a parametric architecture of the form J(i. We will focus exclusively on the case of a linear architecture. (6.‡ † In another type of method. it may be used to construct an approximate costtogo function of a single suboptimal/heuristic policy. we obtain r by solving an approximate version of Bellman’s equation. Let us summarize each type of method. 6 cost. I).5 of Vol. I): the original problem . Alternatively. r) is linear in r.8. for a sequence of policies.3.1) where Π denotes projection with respect to a suitable norm on the subspace of vectors of the form Φr. and will be described in Section 6. we obtain the parameter vector r by solving the equation Φr = ΠT (Φr).3. This approach can also be viewed as a problem approximation approach (cf. often called the Bellman equation error approach. including linear least squares methods based on simple matrix inversion. If · is a Euclidean norm. which also has Jµ as its unique ﬁxed point [here ΠT (Φr) denotes the projection of the vector T (Φr) on the subspace]. ˜ where J is of the form Φr. for example. in the context of a policy iteration scheme.3 of Vol. and Φ is a matrix whose columns can be viewed as basis functions (cf. which we will discuss brieﬂy in Section 6.3. In an important method of this type.† In the ﬁrst class of methods. which can be used in an online rollout scheme. ˜ where · is some norm. This approximation may be carried out repeatedly. and J(i.324 Approximate Dynamic Programming Chap.
2) i. and also in Sections 6.8 for other types of problems. This algorithm is based on the idea of executing value iteration within the lower dimensional space spanned by the basis functions. (6. (3) LSPE(λ) or least squares policy evaluation method . but in practical terms. we aim to compute. This algorithm may be viewed as a stochastic iterative method for solving a version of the projected equation (6. ΠT is a contraction mapping. so the projected Bellman equation has a unique solution Φr∗ .0 325 We can view Eq. The algorithm embodies important ideas and has played an important role in the development of the subject.e. where Φ and D are matrices whose rows are restricted to be probability distributions (the aggregation and disaggregation probabilities.. (6.1) as a form of projected Bellman equation. it has the form Φrk+1 = ΠT (Φrk ) + simulation noise. We will show that for a special choice of the norm of the projection.1) that depends on λ. (1) TD(λ) or temporal diﬀerences method . the current value iterate T (Φrk ) is projected on S and is suitably approximated by simulation. . Both LSPE(λ) and its variants have the same convergence rate as LSTD(λ). is approximated with a related “aggregate” problem. because they share a common bottleneck: the slow speed of simulation. so it will be discussed in less detail. so they produce the same approximate cost function. Here are the methods that we will focus on in Section 6. There are also a number of variants of LSPE(λ). This algorithm computes and solves a progressively more reﬁned simulationbased approximation to the projected Bellman equation (6.3 for discounted problems.1). QLearning Algorithms With this class of methods. 1]. 6. The simulation noise tends to 0 asymptotically. so assuming that ΠT is a contraction. (6.66. they diﬀer in their speed of convergence and in their suitability for various problem contexts.1).1) has the form Φr = ΦDT (Φr).3. They all depend on a parameter λ ∈ [0. which is then solved exactly to yield a costtogo approximation for the original problem.Sec. without any approximation. We will discuss several iterative methods for ﬁnding r∗ in Section 6. The analog of Eq. All these methods use simulation and can be shown to converge under reasonable assumptions to r∗ . However. Conceptually. respectively). it is typically inferior to the next two methods. the method converges to the solution of the projected Bellman equation (6. whose role will be discussed later. (2) LSTD(λ) or least squares temporal diﬀerences method .
they can be used to obtain an optimal control at any state i simply by minimizing Q∗ (i. Section 6. u) over u ∈ U (i). We will consider ﬁrst. which are based on aggregation and compute an approximately optimal cost function (see Section 6. Section 6.4 discusses Qlearning and its variations.16. or else they lack a ﬁrm theoretical basis. and will not be considered further here (see also the references cited in Chapter 1). and problems resulting from aggregation discussed in Section 6.3. Section 6. the discounted problem using the notation of Section 1. j). these methods are used widely.5 discusses methods based on aggregation.5. 6 the optimal cost function (not just the cost function of a single policy). Qlearning is often impractical because there may be simply too many Qfactors to update. Extensions of many of the ideas to continuous state spaces are possible. The Qfactors are updated with what may be viewed as a simulationbased form of value iteration. Still.3. and is denoted by Q∗ (i.4. They were discussed in Section 1.5).g. Stochastic shortest path and average cost problems † There are also methods related to Qlearning. and often with success. transition probabilities pij (u). This is called the Qfactor of the pair (i.4 and 6.5). in Sections 6.2 focuses on direct methods for policy evaluation. and particularly to optimal stopping problems. u).1 provides a broad overview of cost approximation architectures and their uses in approximate policy iteration. An important advantage of using Qfactors is that when they are available.it just refers to notation used in the original proposal of this method! Qlearning maintains and updates for each statecontrol pair (i. however. Other methods. On the other hand. As a result.4.326 Approximate Dynamic Programming Chap. which are based on the projected Bellman equation. are based on linear programming. Chapter Organization Throughout this chapter. we will focus almost exclusively on perfect state information problems.5).. u. Section 6. so the transition probabilities of the problem are not needed. involving a Markov chain with a ﬁnite number of states i.3. and extends the projected Bellman equation approach to the case of multiple policies. and single stage costs g(i. Section 6.3 is a long section on a major class of indirect methods for policy evaluation. as will be explained in Section 6. . which aim to approximate the optimal cost function. u) an estimate of the expression that is minimized in the righthand side of Bellman’s equation. for problems with a large number of statecontrol pairs. There are also algorithms that use parametric approximations for the Qfactors (see Sections 6. the algorithm is primarily suitable for systems with a small number of states (or for aggregated/fewstate versions of more complex systems). but they are beyond our scope. u). optimal stopping problems discussed in Section 6.4.† The letter “Q” stands for nothing special . but they are either tailored to special classes of problems (e.
8 extends and elaborates on the projected Bellman equation approach of Sections 6. 6. we discuss various special structures that can be exploited to simplify approximate policy iteration. In Section 6. We start with general issues of parametric approximation architectures.1. which we have also discussed in Vol. . and generalizes the aggregation methodology. 6. with a costtogo approximation at the end of the multiplestep horizon. as measured by Jµ − J ∗ ∞ . singlestep and multiplestep lookahead approaches are similar. † We may also use a multiplestep lookahead minimization. it yields a suboptimal control at any state i via the onestep lookahead minimization n µ(i) = arg min ˜ u∈U(i) ˜ pij (u) g(i. Here J is a function of some chosen form (the approximation architecture) and r is a parameter/weight vector.1.7. r) . We then consider approximate policy iteration (Section 6.6.1. and 6.7. we provide orientation into the main mathematical issues underlying the methodology. 6. Once r is determined. I). In Sections 6. Section 6. is bounded ˜ ˜ by a constant multiple of the approximation error according to Jµ − J ∗ ˜ ∞ ≤ 2α ˜ J − J∗ 1−α ∞.3.1. Section 6. Conceptually. cost of a policy.6.3. and the two general approaches for approximate cost evaluation (direct and indirect. discusses another approach based on the Bellman equation error.1.5 and 6. and focus on two of its main components: contraction mappings and simulation.6 and 6. respectively.3) The degree of suboptimality of µ. u. etc). Qfactors.3). I (Section 6. 6. suppose that ˜ we use J(j. without getting too much into the mathematical details.1 GENERAL ISSUES OF COST APPROXIMATION Most of the methodology of this chapter deals with approximation of some type of cost function (optimal cost. r) as an approximation to the optimal cost of the ﬁnitestate ˜ discounted problem of Section 1.9 describes methods involving the parametric approximation of policies.1.3.3 of Vol.4.1 Approximation Architectures The major use of cost approximation is for obtaining a onestep lookahead† suboptimal policy (cf. Section 6.5). The purpose of this section is to highlight the main issues involved. j=1 (6.1 General Issues of Cost Approximation 327 are discussed in Sections 6.Sec. j) + αJ (j. Section 6. and the costtogo approximation algorithms of this chapter apply to both. In particular.2).
u) = j=1 pij (u) g(i. . given the ˜ approximation Q(i. r). φs (n) . and φk (i) are some known scalars that depend on the state i. . Thus Qfactors are better suited to the modelfree context. ˜ u∈U(i) The advantage of using Qfactors is that in contrast with the minimization (6.3.328 Approximate Dynamic Programming Chap. . Thus the cost function is approximated by a vector in the subspace S = {Φr  r ∈ ℜs }. u. φ(n)′ φ1 (n) . as it tends to be quite conservative in practice. . . 1. . .. The choice of architecture is very signiﬁcant for the success of the approximation approach. . . . . A major use of such approximations is in the context of an approximate policy iteration scheme. we can generate a suboptimal control at any state i via ˜ µ(i) = arg min Q(i.4) where r = (r1 . .1. and to its components as features (see Fig. r) is the inner product φ(i)′ r of r and φ1 (i) . Note that we may similarly use approximations to the cost functions Jµ and Qfactors Qµ (i. r) = k=1 rk φk (i).1). An alternative possibility is to obtain a parametric approximation ˜ Q(i. φ(i) = . Since Q∗ (i. u. deﬁned in terms of the optimal cost function J ∗ as n Q∗ (i. 6 as shown in Prop. . This bound is qualitative in nature. φ(1)′ φs (1) . Thus.7. rs ) is the parameter vector. φs (i) We refer to φ(i) as the feature vector of i.1. u) of speciﬁc policies µ. r). . where φ1 (1) . . . u) is the expression minimized in Bellman’s equation. r) of the Qfactor of the pair (i.2. u. (6.3). One possibility is to use the linear form s ˜ J(i. j) + αJ ∗ (j) . 6. u). the approximate ˜ cost J(i. see Section 6. the transition probabilities pij (u) are not needed in the above minimization. Φ= . u. . . for each state i.. = .
. For example. the feature vector φ(i) = φ1 (i). .Sec. q. 6. Suppose that we want to use an approximating function that is quadratic in the components xk . where the parameter vector r has components r0 . xq can be constructed similarly. . and Φr as a linear combination of basis functions. xk may represent the number of customers in the kth queue. . . q. . . q. For example. . any kind of approximating function that is polynomial in the components x1 . m = k. . . . can capture the dominant nonlinearities of the cost function.1. . rk . . . . and a parameter vector r to form a linear cost approximator. m = 1. . Suppose that the state consists of q integer components x1 . and rkm . It is also possible to combine feature extraction with polynomial approx′ imations. It combines a mapping that ′ extracts the feature vector φ(i) = φ1 (i). . I) where the state is the current board position. φkm (x) = xk xm . . Example 6.1. xq . with k = 1. . . xq ) via φ0 (x) = 1. φs (i) associated with state i. k. . r) = r0 + k=1 rk φk (i) + k=1 ℓ=1 rkℓ φk (i)φℓ (i). king safety. and other positional factors. We can view the s columns of Φ as basis functions. each taking values within some limited range of integers. . . . In fact. Features. q. . . . when wellcrafted. r) = r0 + k=1 rk x k + k=1 m=k rkm xk xm . φs (i) transformed by a quadratic polynomial mapping. . in computer chess (Section 6. φk (x) = xk . in a queueing system. . . . Then we can deﬁne a total of 1 + q + q 2 basis functions that depend on the state x = (x1 . .1 (Polynomial Approximation) An important example of linear cost approximation is based on polynomial basis functions. . and their linear combination may work very well as an approximation architecture. where k = 1. A linear featurebased architecture. leads to approximating functions of the form s s s ˜ J (i. . piece mobility. A linear approximation architecture that uses these functions is given by q q q ˜ J(x. For example.3.1. appropriate features are material balance.5 of Vol.1 General Issues of Cost Approximation Feature Vector φ(i) Linear Mapping 329 Linear Cost Approximator φ(i)’r State i Feature Extraction Mapping Figure 6. . . .
we ˜ will not address the choice of the structure of J(i. we will mostly focus on the case of linear architectures. For a simple example that conveys the basic idea. Then for i ˜ any state i. and Jung and Polani [JuP07]). be constructed for the case where the state space is a subset of a multidimensional space (see Example 6. and rkℓ . s. see Section 6. which is the value of J at i: ri = J(i). This function can be viewed as a linear cost approximation that uses the basis functions w0 (i) = 1. rk . A generalization of the preceding example is approximation based on aggregation. wkℓ (i) = φk (i)φℓ (i). . However. k. s. i−i i−i i The scalars multiplying the components of r may be viewed as features. . Manor. .330 Approximate Dynamic Programming Chap.5 in this chapter. because many of the policy evaluation algorithms of this chapter are valid only for that case.1. In this chapter. and for each state i let i and ¯ be the states in I that are closest to i from below and from above. or Sutton and Barto [SuB98] for further discussion). Note that there is considerable ongoing research on automatic basis function generation approaches (see e. .g. In particular. . let the system states be the integers within some interval.3. and Precup [KMP06]. with k. and the parameter vector r has one component ri per state i ∈ I. There are also interesting nonlinear approximation architectures. so the feature vector of i above consists of two nonzero features (the ones corresponding to i and ¯ with all other features being 0. r) is obtained by linear interpolation of the costs ri = J(i) i): and r¯ = J(¯ i ¯− i i−i i ˜ J(i. ℓ = 1. Keller. .4 of Vol. let I be a subset of special states. consider an approximation subspace Sθ = Φ(θ)r  r ∈ ℜs . . I). Here. J(i. but rather focus on various iterative algorithms for obtaining a suitable parameter vector r. wk (i) = φk (i).13 of Vol. 6 where the parameter vector r has components r0 . r) = ¯ ri + ¯ r¯. perhaps in combination with feature extraction mappings (see Bertsekas and Tsitsiklis [BeT96].. Example 6. including those deﬁned by neural networks. a set I of special states is selected. ℓ = 1. The value of J at states i ∈ I is approximated by some form of interpolation / using r. I and the subsequent Section 6. We ﬁnally mention the possibility of optimal selection of basis functions within some restricted class. i ∈ I. Similar examples can i). . Interpolation may be based on geometric proximity. r) or associated basis functions.3. .2 (Interpolation) A common type of approximation of a function J is based on interpolation.
and we use them to compute new policies based on (approximate) policy improvement. Experimental evidence indicates that this bound is usually conservative. r) . 1. Furthermore.Sec. . Alternatively. For example.5) The method is illustrated in Fig.3 (cf. Prop. approximations to these values) for a subset of selected states I. the method will yield in the limit (after inﬁnitely many policy evaluations) a stationary policy that is optimal to within 2αδ .6).1 General Issues of Cost Approximation 331 where the s columns of the n × s matrix Φ are basis functions parametrized by a vector θ.1. so J(i. Then we may wish to select θ so that some measure of approximation quality is optimized. ˜ Suppose that the current policy is µ. We impose no constraints on the ap˜ proximation architecture. so that Φ(θ)r(θ) is an approximation of a cost function J (various such algorithms will be presented later in this chapter). suppose that we can compute the true cost values J(i) (or more generally. is minimized. We generate an “improved” policy µ using the formula n µ(i) = arg min u∈U(i) ˜ pij (u) g(i. Mannor. Then we may determine θ so that 2 J(i) − φ(i. and Shimkin [MMS06]. r) is an approximation of Jµ (i). we may determine θ so that the norm of the error in satisfying Bellman’s equation. then for an αdiscounted problem. j=1 for all i. θ)′ r(θ) i∈I is minimized. r) to the cost functions Jµ of stationary policies µ. J(i.1. and for a given r.2 Approximate Policy Iteration Let us consider a form of approximate policy iteration. (6. where it was shown that if the policy evaluation is accurate to within δ (in the supnorm sense). Assume that for a given θ. and Yu and Bertsekas [YuB09]). Its theoretical basis was discussed in Section 1. u. r) may be linear or nonlinear in r.1. where φ(i. θ)′ is the ith row of Φ(θ). Φ(θ)r(θ) − T Φ(θ)r(θ) 2 . (1 − α)2 where α is the discount factor. 6. 6. 6. obtained using some algorithm.3. often just a few policy evaluations are needed before the bound is attained. where we com˜ pute simulationbased approximations J(·. j) + αJ (j. Gradient and random search algorithms for carrying out such minimizations have been proposed in the literature (see Menache. there is a corresponding vector r(θ).
r) is used in the policy improvement Eq.1. When the sequence of policies obtained actually converges to some µ. and parameter vectors r and r.5) to generate the new policy µ. then it can be proved that µ is optimal to within 2αδ 1−α (see Section 6. At the same time.2. 6. and the approximation J(·. the generated sequence of policies does converge). A simulationbased implementation of the algorithm is illustrated in Fig. (6. which given a statecontrol pair (i. which is the function J(j.2. which will be used in the next policy iteration. generates the next state j according to the system’s transition probabilities. Note that there are two policies µ and µ.1. where it is shown that if policy evaluation is done using an aggregation approach. ˜ (c) The costtogo approximator . which accepts as input the output ˜ of the simulator and obtains the approximation J(·. . (d) The cost approximation algorithm. which are simultaneously involved in this algorithm. µ drives the simulation that generates samples to be used by the algorithm that determines the parameter r corresponding to µ. (b) The decision generator .1 Block diagram of approximate policy iteration.332 Approximate Dynamic Programming Chap. 6 Guess Initial Policy Evaluate Approximate Cost ˜ Jµ (r) = Φr Using Simulation Approximate Policy Evaluation Generate “Improved” Policy µ Policy Improvement Figure 6. u). r) of the cost of µ. In particular.5. which generates the control µ(i) of the improved policy at the current state i for use in the simulator. r) that is used by the decision generator. ˜ r corresponds to the current policy µ. It consists of four modules: (a) The simulator .
causing potentially serious errors in the calculation of the improved control policy µ via the policy improvement Eq. we need to generate cost samples using that policy. It is a particularly acute diﬃculty when the system is deterministic. r) n Algorithm r Decision Generator r) Decision µ(i) S roximator Supplies Valu State i C CosttoGo Approximator S ecision Generator ˜ State Cost Approximation r Supplies Values J(j.5). Still another frequently used approach is to artiﬁcially introduce some extra randomization in the simulation.Sec. r). called iterative resampling.” One possibility for guaranteeing adequate exploration of the state space is to frequently restart the simulation and to ensure that the initial states employed form a rich and representative subset. A related approach. but this biases the simulation by underrepresenting states that are unlikely to occur under µ. by occasionally generating transitions that use a randomly selected control rather than the one dictated by . The Issue of Exploration Let us note an important generic diﬃculty with simulationbased policy iteration: to evaluate a policy µ.1. simulate the next policy µ obtained on the basis of this initial evaluation to obtain a set of representative states S visited by µ. is to enrich the sampled set of states in evaluating the current policy µ as follows: derive an initial cost evaluation of µ. Given the approximation J(i. we generate cost samples of the “improved” policy µ by simulation (the “decision generator” module). The diﬃculty just described is known as inadequate exploration of the system’s dynamics because of the use of a ﬁxed policy. We use these ˜ samples to generate the approximator J(i. As a result. r) of µ. and repeat the evaluation of µ using additional trajectories initiated from S. the costtogo estimates of these underrepresented states may be highly inaccurate. r) D Figure 6.1 General Issues of Cost Approximation 333 System Simulator D CosttoGo Approx ˜ r) Samples i Cost Approximation AJ(j. 6. (6. or when the randomness embodied in the transition probabilities is “relatively small.2 Simulationbased implementation approximate policy iteration al˜ gorithm.
As shown in Section 6.3. Optimistic policy iteration is discussed extensively in the book by Bertsekas and Tsitsiklis [BeT96]. This and other possibilities to improve exploration will be discussed further in Section 6.7. together with several variants. We note that optimistic policy iteration tends to deal better with the problem of exploration discussed earlier.334 Approximate Dynamic Programming Chap. 6 the policy µ.5). r) to the cost function Jµ of the current policy. among others.7) . optimistic policy iteration can exhibit fascinating and counterintuitive behavior.3. there is less tendency to bias the simulation towards particular states that are favored by any single policy. Approximate Policy Iteration Based on QFactors The approximate policy iteration method discussed so far relies on the cal˜ culation of the approximation J(·. including a natural tendency for a phenomenon called chattering. the associated theoretical convergence properties are not fully understood. u. which is then used for policy improvement using the minimization n µ(i) = arg min u∈U(i) ˜ pij (u) g(i. An alternative. u. in an impressive backgammon application (Tesauro [Tes92]).4. r) . whereby the generated parameter sequence {rk } converges. j=1 (6. while the generated policy sequence oscillates because the limit of {rk } corresponds to multiple policies (see also the discussion in Section 6. It has been successfully used. u. j) + αJµ (j) .2 of [BeT96]. r) u∈U(i) (6. because with rapid changes of policy. j=1 Carrying out this minimization requires knowledge of the transition probabilities pij (u) and calculation of the associated expected values for all controls u ∈ U (i) (otherwise a timeconsuming simulation of these expected values is needed). u. is to replace the policy µ with the policy µ after only a few simulation samples have been processed. r) being an inaccurate approximation of Jµ . An interesting alternative is to compute approximate Qfactors n ˜ Q(i.6) ˜ µ(i) = arg min Q(i. known as optimistic policy iteration. r) ≈ and use the minimization pij (u) g(i. ˜ at the risk of J(·. Limited Sampling/Optimistic Policy Iteration In the approximate policy iteration approach discussed so far. j) + αJ (j. the policy evaluation of the cost of the improved policy µ must be fully carried out. However.
we can ˜ construct Qfactor approximations Q(i.1 General Issues of Cost Approximation 335 for policy improvement. u) to (j. v) is pij (u) if v = µ(j). so they are not represented in the cost samples used to construct the approximation ˜ Q(i. are not explored. u. u. 6. latter method to the Markov chain whose states are the pairs (i. u = µ(i). where φk (i. pairs of the form (i. u) with u = µ(i) are never generated in this Markov chain.4.1. j) j pij (u) ) States j p j) j. which must be carefully addressed in any simulationbased implementation. The important point here is that given the current policy µ. u).1. and is 0 otherwise. v) is pij (u) if v = µ(j). µ(i)). associated with policy µ. Thus. StateControl Pairs: Fixed Policy µ Case ( StateControl Pairs (i.4)]. the generated pairs are exclusively of the form (i. possibly of the linear form s ˜ Q(i.3. and the probability of transition from (i. Here. and is 0 otherwise. after the ﬁrst transition. u. u. r is an adjustable parameter vector and ˜ Q(i. u) to (j. Markov chain underlying Qfactorbased policy evaluation. (6. u) are basis functions that depend on both state and control [cf. where we will discuss exact and approximate implementations of the Qlearning algorithm. The way to do this is to apply the structing cost approximations J(i. A major concern with this approach is that the statecontrol pairs (i. µ(j) v µ(j) ) States Figure 6. and the probability of transition from (i.1. The states are the pairs (i.3). This creates an acute diﬃculty due to diminished exploration.Sec. u).3 Direct and Indirect Approximation We will now preview two general algorithmic approaches for approximating the cost function of a ﬁxed stationary policy µ within a subspace of the . u. r) is a parametric architecture. 6. r) using any method for con˜ r). This is the probabilistic mechanism by which statecontrol pairs evolve under the stationary policy µ. We will return to the use of Qfactors in Section 6. u). u). Eq. r) = k=1 rk φk (i. r) (see Fig. 6. u) States ) g(i.
we obtain a least squares approximation of JN−1 . 6 form S = {Φr  r ∈ ℜs }. (6. in which case the approximation problem is a linear least squares problem. † In what follows in this chapter. We will consider another type of indirect approach based on aggregation in Section 6. where we use sequential singlestage approximation of the costtogo functions Jk . can in principle be obtained in closed form by solving the associated quadratic minimization problem. where Π denotes projection with respect to · on the subspace S. 6. ‡ Note that direct approximation may be used in other approximate DP contexts.2. we will not distinguish between the linear operation of projection and the corresponding matrix representation..3. etc). · is usually some (possibly weighted) Euclidean norm. The meaning should be clear from the context. referred to as direct .1. denoted r∗ .† A major diﬃculty is that speciﬁc cost function values Jµ (i) can only be estimated through their simulationgenerated cost samples.e. starting with JN .8) We can view this equation as a projected form of Bellman’s equation. ˜ J∈S or equivalently. as we discuss in Section 6. referred to as indirect . which is used in turn to obtain a least squares approximation of JN−2 .. 6.4). (A third approach. An important example of this approach. r∈ℜs min Jµ − Φr (see the lefthand side of Fig. is to approximate the solution of Bellman’s equation J = Tµ J on the subspace S (see the righthand side of Fig.‡ Here. . the solution is unique and can also be represented as ˜ Φr∗ = ΠJ. i. ˜ min Jµ − J . leads to the problem of ﬁnding a vector r∗ such that Φr∗ = ΠTµ (Φr∗ ). is to ﬁnd an approximation ˜ J ∈ S that matches best Jµ in some normed error sense. which we will discuss in detail in Section 6. such as ﬁnite horizon problems.e.5.) The ﬁrst and most straightforward approach. going backwards (i.5. whose solution.4).336 Approximate Dynamic Programming Chap. This approach is sometimes called ﬁtted value iteration. An alternative and more popular approach. uses a special type of matrix Φ and is discussed in Section 6. denoting them both by Π.1. If the matrix Φ has linearly independent columns. based on aggregation.
3.3. which are discussed in Section 6. the contraction property of ΠTµ does not extend to the case where Tµ is replaced by T . the DP mapping corresponding to multiple/all policies. one of which relates to optimal stopping problems and is discussed in Section 6. some of the most popular ﬁniteelement methods for partial diﬀerential equations are of this type. (6. In the indirect method (ﬁgure on the right).. the approximation is found by solving Φr = ΠTµ (Φr). as will be proved in Section 6. a projected form of Bellman’s equation. Jµ Tµ(Φr) Projection on S Projection on S Φr = ΠTµ(Φr) ΠJµ 0 S: Subspace spanned by basis functions Direct Mehod: Projection of cost vector Jµ 0 S: Subspace spanned by basis functions Indirect method: Solving a projected form of Bellman’s equation Figure 6. [Kra72]).1. and allows the use of algorithms such as LSPE(λ) and TD(λ). 6. In the direct method (ﬁgure on the left). Two methods for approximating the cost function Jµ as a linear combination of basis functions (subspace S).1). In this case.1 General Issues of Cost Approximation 337 We note that solving projected equations as approximations to more complex/higherdimensional equations has a long history in scientiﬁc computation in the context of Galerkin methods (see e. 6. although there are some interesting exceptions.4. Eq.g. provided we use a special weighted Euclidean norm for projection. Jµ is projected on S. For example.3.3 for discounted problems (Prop. 6.4. . the use of the Monte Carlo simulation ideas that are central in approximate DP is an important characteristic that diﬀerentiates the methods of the present chapter from the Galerkin methodology. However. Unfortunately.Sec. An important fact here is that ΠTµ is a contraction.1.8) has a unique solution.4 Simpliﬁcations We now consider various situations where the special structure of the problem may be exploited to simplify policy iteration or other approximate DP algorithms.
1. j) ) States j p ) (j.y) ˆ pij (u. consider the mapping T deﬁned by ˆˆ (T J)(i) = y p(y  i)(T J)(i. u) States (i. z) in what follows. 6 Problems with Uncontrollable State Components In many problems of interest the state is a composite (i. j) + αJ (j) . the next state (j. y. j) = ˆ z p(z  j)g(i. 6. y) ( ) Control u j pij (u) j g(i.5. u. States and transition probabilities for a problem with uncontrollable state components. u. y) of two components i and y. In particular.5). y) is of the form g(i. I. ˆ For an αdiscounted problem. and the evolution of the main component i can be directly aﬀected by the control u. y. y. If g depends on z it can be replaced by g(i. and z is generated according to conditional probabilities p(z  j) that depend on the main component j of the new state (see Fig. z) is determined as follows: j is generated according to transition probabilities pij (u. y. Let us assume for notational convenience that the cost of a transition from state (i. z) States ) No Control u p(z  j) Controllable State Components Figure 6. i. the value and the policy iteration algorithms can be carried out over a smaller state space. u. u. y) and the control u. y) g(i. z).1.4 of Vol.338 Approximate Dynamic Programming Chap. Then as discussed in Section 1. j) and does not depend on the uncontrollable component z of the next state (j. y) n = y p(y  i) min u∈U(i. the space of the controllable component i. y). j. we assume that given the state (i. j=0 . u. but the evolution of the other component y cannot. y.
y) n = y p(y  i) ˆ pij µ(i. 6.9) The typical iteration of the simpliﬁed policy iteration algorithm consists of two steps: (a) The policy evaluation step. Problems with PostDecision States In some stochastic problems. i = 1. n. y). In words. (b) The policy improvement step. j=0 Bellman’s equation. . µ(i. for all i. u∈U(i. y). the transition probabilities and stage costs have the special form pij (u) = q j  f (i. y. from the equation Tµk+1 J µk = T J µk or equivalently n µk+1 (i. u). u) . u) is a given probability distribution for each value of f (i. . An example where the conditions (6. y g i. (6. n. which computes the improved policy ˆ ˆ ˆˆ µk+1 (i. j + αJ(j) . u). deﬁned over the controllable state component i. . u. j=0 Approximate policy iteration algorithms can be similarly carried out in reduced form. µk (i.10) where f is some function and q ·  f (i. y). y. y). u) comes through the function f (i. (6. the dependence of the transitions on (i.1 General Issues of Cost Approximation 339 and the corresponding mapping for a stationary policy µ. . y). u) as a form of state: a postdecision state that determines the probabilistic evolution to the next state. . . j) + αJ µk (j) .y) ˆ pij (u. y. ˆ ˆ (Tµ J)(i) = y p(y  i)(Tµ J)(i. which given the current policy µk (i. We may exploit this structure by viewing f (i. y) g(i.10) are satisﬁed are inventory control problems of the type considered in . that solve the linear system ˆ ˆ ˆ of equations J µk = Tµk J µk or equivalently n ˆ J µk (i) = y p(y  i) pij µk (i. takes the form ˆ ˆˆ J(i) = (T J)(i). . . j + αJ µk (j) for all i = 1. y) = arg min for all (i. y). ˆ computes the unique J µk (i).Sec. y) j=0 ˆ g i.
m) No Control v p m m = f (i.. u) = j=1 pij (u)g(i. we consider a modiﬁed problem where the state space is enlarged to include postdecision states. we have n V ∗ (m) = j=1 q(j  m) min u∈U(j) g(j. u. j) = g(i. the postpurchase inventory. There the postdecision state at time k is xk + uk ..e. possibly by simun lation. Postdecision states can be exploited when the stage cost has no dependence on j. Then the optimal costtogo within an αdiscounted context at state i is given by J ∗ (i) = min g(i. u. ∀ m. (6. before any demand at time k has been ﬁlled.1. u. when we have (with some notation abuse) g(i. u. and using it in place of g(i. 6. Combining these equations. with transitions between ordinary states and postdecision states speciﬁed by f is some function and q ·  f (i.6).e. u) (see Fig. j). StateControl Pairs ( StateControl Pairs (i. i. 6 Section 4. I. u∈U(i) while the optimal costtogo at postdecision state m (optimal sum of costs of future stages) is given by n V ∗ (m) = j=1 q(j  m)J ∗ (j). u) . In eﬀect. Modiﬁed problem where the postdecision states are viewed as additional states. (an approximation to) g(i. u) ) q(j  m) )mm Controllable State Components PostDecision States Figure 6. . j).11) † If there is dependence on j. u) States StateControl Pairs (j.1.† i. u) + αV ∗ f (i. u) + αV ∗ f (j.340 Approximate Dynamic Programming Chap. one may consider computing.6. The preceding two equations represent Bellman’s equation for this modiﬁed problem. u).2 of Vol. u) . v) States No Control u p g(i.
u) . . 6. u) + αV ∗ f (i. with a simulator. µ(j) . This equation is similar to Qfactor equations. However. for some function h. j . u) + Vµ f (i. the methods using postdecision states may have a signiﬁcant advantage over Qfactorbased methods: they use cost function approximation in the space of postdecision states.1 General Issues of Cost Approximation 341 which can be viewed as Bellman’s equation over the space of postdecision states m. We note that there is a similar simpliﬁcation with postdecision states when g is of the form g(i. n. u). without explicit knowledge of the probabilities q(j  m). u∈U(i) . It involves a deterministic optimization. when function approximation is used in policy iteration.Sec. The costtogo function Vµ of a stationary policy µ is the unique solution of the corresponding Bellman equation n Vµ (m) = j=1 q(j  m) g j. . µ(j) + αVµ f j. . This is important if the calculation of the optimal policy is done online. . which does not require the knowledge of transition probabilities and computation of an expected value. These advantages are shared with policy iteration algorithms based on Qfactors. ∀ m. u. Then we have J ∗ (i) = min V ∗ f (i. and it can be used in a modelfree context (as long as the functions g and d are known). The advantage of this equation is that once the function V ∗ is calculated (or approximated). Moreover. i = 1. and they are less susceptible to diﬃculties due to inadequate exploration. Given Vµ . but is deﬁned over the space of postdecision states rather than the larger space of statecontrol pairs. j) = h f (i. It is straightforward to construct a policy iteration algorithm that is deﬁned over the space of postdecision states. the optimal policy can be computed as µ∗ (i) = arg min u∈U(i) g(i. the policy evaluation of Vµ can be done in modelfree fashion. rather than the larger space of statecontrol pairs. u) . There are also corresponding approximate policy iteration methods with cost function approximation. An advantage of this method when implemented by simulation is that the computation of the improved policy does not require the calculation of expected values. u) . the improved policy is obtained as µ(i) = arg min u∈U(i) g(i.
respectively [cf. j) incurred within the stage when m was generated . denoted by y (this is the uncontrollable component). u∈U(j) ∀ m. a new board position m is obtained. where g(x.12) Example 6..3 (Tetris) Let us revisit the game of tetris. u) u∈U(j) . u) u . (2) The shape of the current falling block. is the horizontal positioning and rotation applied to the falling block. Once u is applied at state (x. y).1. y. y.e. I in the context of problems with an uncontrollable state component. u) + J f (x. 6 where V ∗ is the unique solution of the equation n V ∗ (m) = j=1 q(j  m) h(m. i.342 Approximate Dynamic Programming Chap. y. y. When h does not depend on j.4. j) + α min V ∗ f (j. we can model the problem of ﬁnding an optimal tetris playing strategy as a stochastic shortest path problem. y) and control u is applied. u) are the number of points scored (rows removed). The control. for all x. The state consists of two components: (1) The board position. u) . u) for some function f . Eq. the algorithm takes the simpler form n V ∗ (m) = h(m) + α j=1 q(j  m) min V ∗ f (j.9)]. including the cost h(m. and the new state component x is obtained from m after removing a number of rows. Assuming that the game terminates with probability 1 for every policy (a proof of this has been given by Burgiel [Bur97]). ∀ m. which was discussed in Example 1. denoted by x. We will show that it also admits a postdecision state. Bellman’s equation over the space of the controllable state component takes the form ˆ J(x) = y ˆ p(y) max g(x. y. which has the form h(m) for some m [h(m) is the number of complete rows that . (6. and the board position when the state is (x. and m also determines the reward of the stage. u) and f (x. a binary description of the full/empty status of each square. denoted by u. Thus we have m = f (x. Here V ∗ (m) should be interpreted as the optimal costtogo from postdecision state m.1 of Vol. (6. This problem also admits a postdecision state.
when at “state” (i. . i. u1 ). Then. . . In particular. u2 ). . um . . . . and q(m. r) may be very timeconsuming. . at a given state i. For multistep lookahead schemes. u) . i. In particular. y. .. . u1 . . . Jm−1 (i.1 General Issues of Cost Approximation 343 can be removed from m]. y) are the corresponding transition probabilities. um−1 ) marks the transition to state j according to the given transition probabilities pij (u). . . um−1 ). To deal with the increase in size of the state space we may use rollout. um ). u1 . .12). n V ∗ (m) = h(m) + (x. u2 . . . and introduce artiﬁcial intermediate “states” (i. . . y) is the state that follows m. since the required computation grows exponentially with the size of the lookahead horizon. . The choice of the last control component um at “state” (i. . . assume that future controls uk+1 . .e. 6. . . Thus. m may serve as a postdecision state. Trading oﬀ Complexity of Control Space with Complexity of State Space Suboptimal control using cost function approximation deals fairly well with large state spaces. u1 . (i. um−1 ). uk ). . The potential advantage is that the extra state space complexity may still be dealt with by using function approximation and/or rollout. . we can break down u into the sequence of the m controls u1 . u. j) + J(j. y) max V ∗ f (x.. um .e. . the diﬃculty is exacerbated. where (x. u2 ). x. u = (u1 . but still encounters serious diﬃculties when the number of controls available at each state is large.Sec. u∈U (j) ∀ m. . x. and corresponding transitions to model the eﬀect of these controls. and the corresponding Bellman’s equation takes the form (6. It is thus useful to know that by reformulating the problem. suppose that the control u consists of m components. u1 . . u1 .y) q(m. J2 (i. and m − 1 additional costtogo functions J1 (i. the minimization n u∈U(i) min ˜ pij (u) g(i. Note that both of the simpliﬁed Bellman’s equations share the same characteristic: they involve a deterministic optimization. . u1 . . . (i. . . r) j=1 ˜ using an approximate costgo function J(j. u1 ). it may be possible to reduce the complexity of the control space by increasing the complexity of the state space. In this way the control space is simpliﬁed at the expense of introducing m − 1 additional layers of states. .
1. but at an abstract mathematical level. Alternatively. u1 . . we will try to provide some orientation into the mathematical content of this chapter. Section 6. ˜ in addition to J(i.13).8) deals with the approximate computation of a ﬁxed point of a (linear or nonlinear) mapping T within a subspace S = {Φr  r ∈ ℜs }. when choosing a control uk . we may use function approximation. (b) A Qlearning/aggregation approach. Jm−1 (i. .3 for discounted problems. . that is. . . This requires a variant of the rollout algorithm that works with constrained DP problems. 6 will be chosen by a base heuristic.4 and 6. Section 6. for further discussion. . .4. u2 . where Π is a projection operation with respect to a Euclidean norm (see Section 6.5). (6. . . 6. .66. um ) ∈ U (i). . . . A potential complication in the preceding schemes arises when the controls u1 . and also references [Ber05a].344 Approximate Dynamic Programming Chap. . u1 .13) Then. . uk to satisfy the feasibility constraint (6. care must be exercised to ensure that the future controls uk+1 . um−1 . see Exercise 6. We refer to [BeT96]. . . and Sections 6. . The reader may wish to skip these subsections at ﬁrst. u1 . . rm−1 ). these approaches fall into two categories: (a) A projected equation approach. J2 (i. but return to them later for orientation and a higher level view of some of the subsequent technical material. based on an equation of the form Φr = ΦDT (Φr). r1 ). [Ber05b].8 for other types of problems).1. . . . I.5 The Role of Contraction Mappings In this and the next subsection. r). where D is an s × n matrix whose rows are probability distributions (see Sections 6.36. introduce costtogo approximations ˜ ˜ ˜ J1 (i.4 includes a discussion of Qlearning (where Φ is the identity matrix) and other methods (where . We will discuss a variety of approaches with distinct characteristics. based on the equation Φr = ΠT (Φr). r2 ). um can be chosen together with the already chosen controls u1 . . um are coupled through a constraint of the form u = (u1 . . Most of the chapter (Sections 6.19 of Vol.
5 is focused on aggregation (where Φ is restricted to satisfy certain conditions). which guide the algorithmic development.1.3. etc. but for the moment we note that the projection approach revolves mostly around Euclidean norm contractions and cases where T is linear. see also Section 6. 6.4 of Vol. I. it is important that ΠT and ΦDT be contractions over the subspace S. Φ. is a nonexpansive mapping.1.1 General Issues of Cost Approximation 345 Φ has a more general form). The advantage that simulation holds in this regard can be traced to its ability to compute (approximately) sums with a very large number terms. . When iterative methods are used (which is the only possibility when T is nonlinear. . Example 6. whose solution can be approximated by simulation. and Section 6. In our analysis.6 The Role of Monte Carlo Simulation The methods of this chapter rely to a large extent on simulation in conjunction with cost function approximation in order to deal with large state spaces. and may be attractive in some cases even when T is linear).Sec. it does not follow that ΠT and ΦDT are contractions. this is resolved by requiring that T be a contraction with respect to a norm such that Π or ΦD. Let us adopt a hard aggregation approach (cf.5 later in this chapter). we need various assumptions on T . and we solve the resulting (approximate) linear system by matrix inversion. Note here that even if T is a contraction mapping (as is ordinarily the case in DP). and D. and we use the piecewise constant approximation J(i) = r1 r2 if i ∈ I1 . As a result. . These sums arise in a number of contexts: inner product and matrixvector product calculations. both types of methods lead to linear equations in the parameter vector r. Section 6. where P is the transition probability matrix and α is the discount factor. while the Qlearning/aggregation approach revolves mostly around supnorm contractions.3. if i ∈ I2 . . A primary example is the LSTD methods of Section 6. We postpone further discussion of these issues. whereby we divide the n states in two disjoint subsets I1 and I2 with I1 ∪ I2 = {1. The approach here is very simple: we approximate the matrices and vectors involved in the equation Φr = ΠT (Φr) or Φr = ΦDT (Φr) by simulation. 6. If T is linear. This is called the matrix inversion approach.4 (Approximate Policy Evaluation) Conciser the approximate solution of the Bellman equation that corresponds to a given policy of an nstate discounted problem: J = g + αP J. linear least squares problems. n}. . the solution of linear systems of equations and policy evaluation. respectively.
. . . r2 ≈ 1 n2 J(i). i∈I1 pij r1 + α j∈I2 pij r2 . . . ωT } of samples from Ω. let us consider the problem of estimating a scalar sum of the form z= ω∈Ω v(ω).346 Approximate Dynamic Programming Chap. . depending on whether the component corresponds to I1 or I2 . Simulation allows the approximate calculation of these sums with complexity that is independent of n. 1 r2 = n2 i∈I2 i∈I2 j∈I1 i∈I2 j∈I2 Here the challenge. We introduce a distribution ξ that assigns positive probability ξ(ω) to every element ω ∈ Ω.14) . is the calculation of the large sums in the righthand side. i = 1. as discussed in standard texts on MonteCarlo methods. We obtain the approximate equations J(i) ≈ g(i) + α j∈I1 which we can reduce to just two equations by forming two weighted sums (with equal weights) of the equations corresponding to the states in I1 and I2 . respectively. with each sample ωt taking values from Ω according to ξ. and we generate a sequence {ω1 . To see how simulation can be used with advantage. pij r2 . which can be of order O(n2 ). This is similar to the advantage that MonteCarlo integration holds over numerical integration. when the number of states n is very large. 6 This corresponds to the linear featurebased architecture J ≈ Φr. respectively: r1 ≈ 1 n1 J(i). We thus obtain the aggregate system of the following two equations in r1 and r2 : 1 r1 = n1 α g(i) + n1 α g(i) + n2 i∈I1 i∈I1 j∈I1 α pij r1 + n1 α pij r1 + n2 i∈I1 j∈I2 pij r2 . where Φ is the n × 2 matrix with column components equal to 1 or 0. i∈I2 where n1 and n2 are numbers of states in I1 and I2 . n. . ξ(ωt ) (6. We then estimate z with zT = ˆ 1 T T t=1 v(ωt ) . . where Ω is a ﬁnite set and v : Ω → ℜ is a function of ω.
i. The samples ωt need not be independent for the preceding properties to hold.Sec. . T ∀ ω ∈ Ω. ω2 . if we generate an inﬁnitely long trajectory {ω1 . ω∈Ω Thus in the limit. with ξ(ω) being the steadystate probability of state ω. Then. Then from Eq. we obtain the desired sum z. ω∈Ω An important observation from this formula is that the accuracy of the approximation does not depend on the number of terms in the sum z (the number of elements in Ω). . T T →∞ lim t=1 δ(ωt = ω) = ξ(ω). of particular relevance to the methods of this chapter. ω∈Ω Suppose now that the samples are generated in a way that the longterm frequency of each ω ∈ Ω is equal to ξ(ω). is when Ω is the set of states of an irreducible Markov chain. T ξ(ω) and by taking limit as T → ∞ and using Eq. but if they are.15). then the variance of zT is given by ˆ var(ˆT ) = z which can be written as var(ˆT ) = z 1 T v(ω)2 − z2 .15) will hold with probability 1.15) where δ(·) denotes the indicator function [δ(E) = 1 if the event E has occurred and δ(E) = 0 otherwise]. 6. T T →∞ lim zT = ˆ ω∈Ω T →∞ lim t=1 δ(ωt = ω) v(ω) · = T ξ(ω) v(ω) = z.16) 1 T2 T ξ(ω) t=1 ω∈Ω v(ω) −z ξ(ω) 2 . (6.e. then the condition (6. we have T T →∞ lim zT = ˆ ω∈Ω t=1 δ(ωt = ω) v(ω) · . .14). (6. but rather depends on the variance of the random . ξ(ω) (6.1 General Issues of Cost Approximation 347 Clearly z is unbiased: ˆ E[ˆT ] = z 1 T T E t=1 1 v(ωt ) = ξ(ωt ) T T ξ(ω) t=1 ω∈Ω v(ω) = ξ(ω) v(ω) = z. An important case. as the number of samples increases.} starting from any initial state ω1 .. (6.
assuming that samples are independent and that v(ω) ≥ 0 for all ω ∈ Ω. the optimal distribution is ξ ∗ = v/z and the corresponding minimum variance value is 0. so that both v + and v − are positive functions.16) does not hold.2 DIRECT POLICY EVALUATION . it is possible to execute approximately linear algebra operations of very large size through Monte Carlo sampling (with whatever distributions may be convenient in a given context). we may decompose v as v = v+ − v− . and methods for doing this are the subject of the technique of importance sampling. However. and for a given r. In the case where the samples are dependent. and provide an interesting contrast with indirect methods.GRADIENT METHODS We will now consider the direct approach for policy evaluation. we have z2 var(ˆT ) = z T v(ω)/z ξ(ω) 2 ω∈Ω −1 . 6 variable that takes values v(ω)/ξ(ω). they are currently less popular than the projected equation methods to be considered in the next section. but similar qualitative conclusions can be drawn under various assumptions that ensure that the dependencies between samples become suﬃciently weak over time (see the specialized literature).‡ In par˜ ticular. and the capability of more accurate approximation). the variance formula (6. and then estimate separately z + = ω∈Ω v + (ω) and z − = ω∈Ω v − (ω). ξ is usually chosen to be an approximation to v.† Thus. r) is an approximation of Jµ (i). ω ∈ Ω. J(i. The material of this section will not be substantially used later. suppose that the current policy is µ. We generate an “improved” policy µ using the † The selection of the distribution ξ(ω)  ω ∈ Ω can be optimized (at least approximately). . and this a principal idea underlying the methods of this chapter. 6. In particular. However. Note that we may assume that v(ω) ≥ 0 for all ω ∈ Ω without loss of generality: when v takes negative values. Instead. despite some generic advantages (the option to use nonlinear approximation architectures. with probabilities ξ(ω). normalized so that its components add to 1. ‡ Direct policy evaluation methods have been historically important.348 Approximate Dynamic Programming Chap. ξ ∗ cannot be computed without knowledge of z. so the reader may read lightly this section without loss of continuity.
m) are collected is immaterial for the purposes of the subsequent discussion. as we will now discuss. m) will ordinarily correlated as well as “noisy. r) − c(i. u. r) .1. it can be viewed as being Jµ (i) plus some simulation error/noise. Thus one may generate these samples through a single very long trajectory of the Markov chain corresponding to µ.2 and the discussion in [BeT96]. m) corresponding to any one state i will generally be M (i) c(i.2 Direct Policy Evaluation . Sections 5. 6. m). we may use gradientlike methods for solving the least squares problem (6..Gradient Methods 349 formula n µ(i) = arg min u∈U(i) ˜ pij (u) g(i. or one may use multiple trajectories. regarding the behavior of the average when M (i) is ﬁnite and random].” Still the average M1 (i) m=1 converge to Jµ (i) as M (i) → ∞ by a law of large numbers argument [see Exercise 6.Sec. if ˜ J(i.† Then we obtain the corresponding parameter vector r by solving the following least squares problem M(i) min r ˜ i∈S m=1 2 ˜ J(i. where φ(i)′ is a row vector of features corresponding to state i. respectively (see Fig. r) = φ(i)′ r. j) + αJ (j. The mth such sample is denoted by c(i.18) can be solved exactly if a linear approximation architecture is used. . In this case r is obtained by solving the linear system of equations M(i) ˜ i∈S m=1 φ(i) φ(i)′ r − c(i.17) To evaluate approximately Jµ . (6.18) and we repeat the process with µ and r replacing µ and r. and for each i ∈ S. with diﬀerent starting points. j=1 for all i. 5.1). (6. m) .18). 6.18). we obtain M (i) samples of the cost Jµ (i). In either case.1. and mathematically.2. which is obtained by setting to 0 the gradient with respect to r of the quadratic cost in the minimization (6. † The manner in which the samples c(i. the samples c(i. When a nonlinear architecture is used. The least squares problem (6. to ensure that enough cost samples are generated for a “representative” subset of states. i.e. we select a subset of “representative” states ˜ ˜ S (perhaps obtained by some form of simulation). m) = 0.
20) is repeated. . an expanded view of the gradient method is preferable in practice. in this more general method. it+1 t=k . it+1 . r) J(ik . A sequence of gradient iterations is performed. . N − 1. the N transition batch used in a given gradient iteration comes from a potentially longer simulated trajectory. Note that the update of r is done after ˜ processing the entire batch. iN −1 .19) One way to solve this least squares problem is to use a gradient method. whereby batches may be changed after one or more iterations. . a single N transition batch is used. . . i. . 6 Batch Gradient Methods for Policy Evaluation Let us focus on an N transition portion (i0 . and that the gradients ∇J(ik .e.19). the gradient iteration (6. i. µ(it ). . . whereby the parameter r associated with µ is updated at time N by N −1 N −1 r := r − γ k=0 ˜ ˜ ∇J(ik . also called a batch. . . as cost samples. r) are evaluated at the preexisting value of r. t=k k = 0. Batches may also overlap to a substantial degree. until convergence to the solution of the least squares problem (6. Eq. r) − αt−k g it . or from one of many simulated trajectories. (6.350 Approximate Dynamic Programming Chap.. . . ∇J denotes gradient with respect to r and γ is a positive stepsize. µ(it ). In a traditional gradient method. Thus.e. which can be used for ˜ least squares approximation of the parametric architecture J(i. To address the issue of size of N . µ(it ). However.. one per initial state i0 . We view the numbers N −1 αt−k g it . yet to keep the work per gradient iteration small it is necessary to use a small N . there is an important tradeoﬀ relating to the size N of the batch: in order to reduce simulation error and generate multiple cost samples for a representatively large subset of states. (6. it+1 . Each of the N terms in the summation in the righthand side above is the gradient of a corresponding term in the least squares summation of problem (6. which is usually diminishing over time (we leave its precise choice open for the moment). with each iteration using cost samples formed from batches collected in a variety of diﬀerent ways and whose length N may vary. r) − αt−k g t=k it .19). iN ) of a simulated trajectory. the one before the update. it is necessary to use a large N .18)]: N −1 min r k=0 1 2 N −1 2 ˜ J(ik . r) [cf. (6. .20) ˜ Here.
There are extensive convergence analyses of this method and its variations. because of the stochastic nature of the simulation and the complex correlations between the cost samples. . the batch gradient method processes the N transitions all at once. The incremental method updates r a total of N times. our earlier discussion on the exploration issue. and depends among other things on the initial choice of r. and updates r using Eq. once after each transi . The gradient method (6. the conclusions of these analyses are consistent among themselves as well as with practical experience. cf. . This method can also be described through the use of N transition batches. qualitatively. the number of states and the dynamics of the associated Markov chain.20). particularly when multiple batches are involved. . In fact.2 Direct Policy Evaluation .20) is simple. Sometimes it is possible to improve performance by using a diﬀerent stepsize (or scaling factor) for each component of the gradient. Incremental Gradient Methods for Policy Evaluation We will now consider a variant of the gradient method called incremental . including the possibility of a single very long simulated trajectory. iN ). In practice. for which we refer to the literature cited at the end of the chapter. This is related to the issue of ensuring that the state space is adequately “explored. but we note that it inﬂuences strongly the result of the corresponding least squares optimization (6. the rate of convergence is sometimes so slow. (2) For convergence. 6. These analyses often involve considerable mathematical sophistication. that practical convergence is infeasible. even if theoretical convergence is guaranteed. it is essential to gradually reduce the stepsize to 0.Sec. widely known. However. For a given N transition batch (i0 . providing better approximations for the states that arise most frequently in the batches used. and easily understood. (3) The rate of convergence is often very slow.” with an adequately broad selection of states being represented in the least squares optimization. viewed as a single batch. considerable trial and error may be needed to settle on an eﬀective stepsize choice method. the most popular choice being to use a stepsize proportional to 1/m. the level of simulation error. and the method for stepsize choice. convergence to a limiting value of r that is a local minimum of the associated optimization problem is expected.18). but we will see that (contrary to the batch version discussed earlier) the method is suitable for use with very long batches. . while processing the mth batch. and indicate that: (1) Under some reasonable technical assumptions.Gradient Methods 351 We leave the method for generating simulated trajectories and forming batches open for the moment. (6.
thereby resulting in an eﬃcient implementation. but there is a diﬀerence: in the incremental version. r is changed during the processing of the batch. however. (6. and we update r by making a correction along their sum: k r := r − γ ˜ ˜ ∇J(ik . Still. and provides some insight into the reasons for their superior performance relative to the batch versions. for each state i. the location of the end of the batch becomes less relevant. all the terms of the batch iteration (6. ik+1 . ik+1 ).5. we see that after N transitions. 6 tion. and indeed the algorithm can be operated with a single very long simulated trajectory and a single batch. (6. r) at the current value of r. Each time it adds to r the corresponding portion of the gradient in the righthand side of Eq. within the least squares/policy evaluation context of this section. ik+1 ): ˜ (1) We evaluate the gradient ∇J(ik .21) can be conveniently updated following each transition. Generally. after each transition (ik . (2) We sum all the terms in the righthand side of Eq. It can now be seen that because r is updated at intermediate transitions within a batch (rather than at the end of the batch). in the batch version these gradients are evaluated at the value of r prevailing at the beginning of the batch. and the paper by Bertsekas and Tsitsiklis [BeT00].20) that can be calculated using the newly available simulation data. It is thus possible to have very long batches. r)J(ik .21) By adding the parenthesized “incremental” correction terms in the above iteration. r) g ik . we will have one cost sample for every time when state i is encountered in the simulation.20) that involve the transition (ik . The book by Bertsekas and Tsitsiklis [BeT96] contains an extensive analysis of the theoretical convergence properties of incremental gradient methods (they are fairly similar to those of batch methods). In this case. so they will be adopted as the default in our discussion. the incremental versions of the gradient methods can be implemented more ﬂexibly and tend to converge faster than their batch counterparts. see also the author’s nonlinear programming book [Ber99] (Section 1. (6. .2). r) is evaluated at the most recent value of r [after the transition (it . By contrast. Accordingly state i will be weighted in the least squares optimization in proportion to the frequency of its occurrence within the simulated trajectory. r) − t=0 ˜ αk−t ∇J (it . (6. and ˜ the gradient ∇J(it . µ(ik ). Thus. Note that the gradient sum in the righthand side of Eq. the rate of convergence can be very slow. it+1 )].20) will have been accumulated.352 Approximate Dynamic Programming Chap.
. 6. implementation of the batch and incremental gradient iterations (6. (6. . . r) − g iN −1 . . r) + · · · + ∇J(iN −1 . ˜ The batch version (6. r) are all evaluated at the value of r that prevails at the beginning of the batch.20) is equal to we can verify by adding the equations below that iteration (6. i = 1. and we will see in Section 6. r) .21). . r) + ∇J(i1 . ik+1 ) is processed. after the state transition (iN −1 . . we set k qk + αqk+1 + · · · + αN −1−k qN −1 . set ˜ ˜ ˜ r := r − γqN −1 αN −1 ∇J(i0 . (6. and following the transition (ik . rt ). r) = φ(i)′ r. Proceeding similarly.20) is obtained if the gradients ∇J(ik . N − 1.21) is obtained if each gradient ∇J(ik . N −2. set ˜ ˜ r := r − γq1 α∇J (i0 . µ(iN −1 ).24) where the stepsize γk may very from one transition to the next. i2 ). rk+1 = rk − γk qk t=0 ˜ αk−t ∇J(it . . ik+1 . r) + αN −2 ∇J(i1 . .25) This algorithm is known as TD(1). k = 0. i1 ). for the incremental version. iN .22) qk = J(ik .20) and (6. (6. r). t=0 (6. (6. After the state transition (i1 . ˜ The incremental version (6. . . It uses the notion of temporal diﬀerence (TD for short) given by ˜ ˜ k = 0. mathematically equivalent.3 Projected Equation Methods 353 Implementation Using Temporal Diﬀerences – TD(1) We now introduce an alternative.23) ˜ In particular. In the important case of a linear approximation architecture of the form ˜ J(i.20) can also be implemented as follows: After the state transition (i0 . r)−αJ (ik+1 . r) in Eq. where φ(i) ∈ ℜs are some ﬁxed vectors.6 that it is a limiting version (as λ → 1) of the TD(λ) method discussed there. by noting that the parenthesized term multiplying ∇J(ik .Sec. In particular. r) . n. . we start with some vector r0 . ik+1 ). . µ(ik ). it takes the form k rk+1 = rk − γk qk αk−t φ(it ). ˜ qN −1 = J(iN −1 . . set ˜ r := r − γq0 ∇J(i0 . r)−g ik . t).3. which is described with cleaner formulas. r) is evaluated at the value of r that prevails when the transition (ik .
. .3 PROJECTED EQUATION METHODS In this section. .2: The matrix Φ has rank s. (6. J(n. . . which are positive. where Φ is the n × s matrix that has as rows the feature vectors φ(i). . n. where ik denotes the state at time k. whereby the policy evaluation is based on solving a projected form of Bellman’s equation (cf. 6. j = 1. . We approximate Jµ (i) with a linear architecture of the form ˜ J(i. n. and α ∈ (0. Assumption 6. . . the subspace spanned by s basis functions. i. we consider the indirect approach. . 1) is the discount factor. r). .354 Approximate Dynamic Programming Chap. . . . . the columns of Φ. . . .3. and we denote the states by i = 1. . i = 1. given by N −1 Jµ (i) = lim E N →∞ k=0 αk g(ik . Our assumptions in this section are the following (we will later discuss how our methodology may be modiﬁed in the absence of these assumptions). ik+1 ) i0 = i . Assumption 6.1: The Markov chain has steadystate probabilities ξ1 . n. . the transition probabilities by pij . . . n. . (Throughout this section. r) = φ(i)′ r. for all i = 1. i. i = 1. and a prime denotes transposition. . . j = 1. . we also write the vector ′ ˜ ˜ J(1. . . 1 N →∞ N lim N k=1 P (ik = j  i0 = i) = ξj > 0. 6 6. j).) As earlier. n. . so we generally suppress in our notation the dependence on control of the transition probabilities and the cost per stage.. . n. . . We want to evaluate the expected cost of µ corresponding to each initial state i. ξn .e. i = 1.3). and the stage costs by g(i.26) where r is a parameter vector and φ(i) is an sdimensional feature vector associated with the state i. . vectors are viewed as column vectors. the righthand side of Fig. . we want to approximate Jµ within S = {Φr  r ∈ ℜs }.1. Thus. . . r) in the compact form Φr. We thus consider a stationary ﬁnitestate Markov chain. We will be dealing with a single stationary policy µ. n.3. .
Sec. 6.3
Projected Equation Methods
355
Assumption 6.3.1 is equivalent to assuming that the Markov chain is irreducible, i.e., has a single recurrent class and no transient states. Assumption 6.3.2 is equivalent to the basis functions (the columns of Φ) being linearly independent, and is analytically convenient because it implies that each vector J in the subspace S is represented in the form Φr with a unique vector r. 6.3.1 The Projected Bellman Equation
We will now introduce the projected form of Bellman’s equation. We use a weighted Euclidean norm on ℜn of the form
n
J
v
=
i=1
vi J(i) ,
2
where v is a vector of positive weights v1 , . . . , vn . Let Π denote the projection operation onto S with respect to this norm. Thus for any J ∈ ℜn , ΠJ ˆ v ˆ is the unique vector in S that minimizes J − J 2 over all J ∈ S. It can also be written as ΠJ = ΦrJ , where rJ = arg min J − Φr 2 , v s
r∈ℜ
J ∈ ℜn .
(6.27)
This is because Φ has rank s by Assumption 6.3.2, so a vector in S is uniquely written in the form Φr. Note that Π and rJ can be written explicitly in closed form. This can be done by setting to 0 the gradient of the quadratic function J − Φr
2 v
= (J − Φr)′ V (J − Φr),
where V is the diagonal matrix with vi , i = 1, . . . , n, along the diagonal [cf. Eq. (6.27)]. We thus obtain the necessary and suﬃcient optimality condition Φ′ V (J − ΦrJ ) = 0, (6.28) from which rJ = (Φ′ V Φ)−1 Φ′ V J, and using the formula ΦrJ = ΠJ, Π = Φ(Φ′ V Φ)−1 Φ′ V. [The inverse (Φ′ V Φ)−1 exists because Φ is assumed to have rank s; cf. Assumption 6.3.2.] The optimality condition (6.28), through left multiplication with r′ , can also be equivalently expressed as J V (J − ΦrJ ) = 0,
′
∀ J ∈ S.
(6.29)
356
Approximate Dynamic Programming
Chap. 6
The interpretation is that the diﬀerence/approximation error J − ΦrJ is orthogonal to the subspace S in the scaled geometry of the norm · v (two vectors x, y ∈ ℜn are called orthogonal if x′ V y = n vi xi yi = 0). i=1 Consider now the mapping T given by
n
(T J)(i) =
i=1
pij g(i, j) + αJ(j) ,
i = 1, . . . , n,
or in more compact notation, T J = g + αP J,
n
(6.30)
where g is the vector with components j=1 pij g(i, j), i = 1, . . . , n, and P is the matrix with components pij . Consider also the mapping ΠT (the composition of Π with T ) and the equation Φr = ΠT (Φr). (6.31) We view this as a projected/approximate form of Bellman’s equation, and we view a solution Φr∗ of this equation as an approximation to Jµ . Note that since Π is a linear mapping, this is a linear equation in the vector r. We will give an explicit form for this equation shortly. We know from Section 1.4 that T is a contraction with respect to the supnorm, but unfortunately this does not necessarily imply that T is a contraction with respect to the norm · v . We will next show an important fact: if v is chosen to be the steadystate probability vector ξ, then T is a contraction with respect to · v , with modulus α. The critical part of the proof is addressed in the following lemma. Lemma 6.3.1: For any n × n stochastic matrix P that has a steadystate probability vector ξ = (ξ1 , . . . , ξn ) with positive components, we have P z ξ ≤ z ξ, z ∈ ℜn .
Proof: Let pij be the components of P . For all z ∈ ℜn , we have
n
Pz
2 ξ
=
i=1 n
ξi ξi
n j=1
n
≤
2 pij zj
pij zj
2
i=1
j=1
Sec. 6.3
Projected Equation Methods
n n 2 ξi pij zj j=1 i=1 n
357
= =
j=1
2 ξj zj
= z 2, ξ where the inequality follows from the convexity of the quadratic function, and the next to last equality follows from the deﬁning property n i=1 ξi pij = ξj of the steadystate probabilities. Q.E.D. We next note an important property of projections: they are nonexpansive, in the sense ¯ ΠJ − ΠJ To see this, note that Π(J − J)
2 v v
¯ ≤ J − J v,
¯ for all J, J ∈ ℜn .
≤ Π(J − J)
2 v
+ (I − Π)(J − J)
2 v
= J −J
2, v
where the equality above follows from the Pythagorean Theorem:† J −J
2 v
= ΠJ − J
2 v
+ J − ΠJ
2 v,
for all J ∈ ℜn , J ∈ S. (6.32) ·
v,
Thus, for ΠT to be a contraction with respect to T be a contraction with respect to · v , since ¯ ΠT J − ΠT J
v
it is suﬃcient that
¯ ≤ TJ − TJ
v
¯ ≤ β J − J v, ·
v
where β is the modulus of contraction of T with respect to 6.3.1). This leads to the following proposition.
(see Fig.
Proposition 6.3.1: The mappings T and ΠT are contractions of modulus α with respect to the weighted Euclidean norm · ξ , where ξ is the steadystate probability vector of the Markov chain.
Proof: Using the deﬁnition T J = g + αP J [cf. Eq. (6.30)], we have for all ¯ J, J ∈ ℜn , ¯ ¯ T J − T J = αP (J − J). † The Pythagorean Theorem follows from the orthogonality of the vectors (J − ΠJ) and (ΠJ − J); cf. Eq. (6.29).
1. then ΠT is also a contraction with respect to that norm.3. We have Jµ − Φr∗ ξ 1 Jµ − ΠJµ ξ . We thus obtain ¯ TJ − TJ ξ ¯ = α P (J − J) ξ ¯ ≤ α J − J ξ. Illustration of the contraction property of ΠT due to the nonexpansiveness of Π. where β is the modulus of contraction of T with respect to · v.3.D. The next proposition gives an estimate of the error in estimating Jµ with the ﬁxed point of ΠT . Hence T is a contraction of modulus α. Proposition 6.3. 6 ΤJ ΤJ J ΠΤJ ΠΤJ J 0 S: Subspace spanned by basis functions Figure 6.2: Let Φr∗ be the ﬁxed point of ΠT . Q. where the inequality follows from Lemma 6. the Euclidean norm used in the projection.1. ≤ √ 1 − α2 Proof: We have Jµ − Φr∗ 2 ξ = Jµ − ΠJµ = Jµ − ΠJµ ≤ Jµ − ΠJµ 2 ξ 2 ξ 2 ξ + ΠJµ − Φr∗ 2 ξ 2 ξ + α2 Jµ − Φr∗ 2 . ξ + ΠT Jµ − ΠT (Φr∗ ) .358 Approximate Dynamic Programming Chap. The contraction property of ΠT follows from the contraction property of T and the nonexpansiveness property of Π noted earlier. If T is a contraction with respect to · v .E. since Π is nonexpansive and we have ¯ ΠT J − ΠT J v ¯ ≤ TJ − TJ v ¯ ≤β J −J v.
Fig. J = (I − αP )−1 g. Indeed. .1. The Matrix Form of the Projected Bellman Equation Let us now write the projected Bellman equation in explicit form.29). ξn along the diagonal. just like the Bellman equation. Lemma 6.1 and 6. the result follows. however. requires computation of inner products of size n.2 hold if T is any (possibly nonlinear) contraction with respect to the Euclidean norm of the projection (cf.6. computing C and d using Eq.33) . We will discuss shortly eﬃcient methods to compute inner products of large size by using simulation and low dimensional calculations.3. (6. 6. Note that this equation is just the orthogonality condition (6. An important diﬀerence is that the projected equation has smaller dimension (s rather than n).Sec. Props. The idea is that an inner product. where and can be solved by matrix inversion: C = Φ′ Ξ(I − αP )Φ.3. Its solution is the vector J = Φr∗ . Still.1). Note the critical fact in the preceding analysis: αP (and hence T ) is a contraction with respect to the projection norm · ξ (cf. as discussed in Section 6. appropriately normalized. r∗ = C −1 d. . Eq. . 6. which can also be solved by matrix inversion. so for problems where n is very large.3.1). can be viewed as an expected value (the weighted sum of a large number of terms).34). which can be computed by sampling its components with an appropriate probability distribution and averaging the samples.32)]. the explicit computation of C and d is impractical. From this relation. where Ξ is the diagonal matrix with the steadystate probabilities ξ1 . we obtain Φ′ Ξ Φr∗ − (g + αP Φr∗ ) = 0.34) (6. and the inequality uses the contraction property of ΠT .3. Q. where r∗ solves the problem r∈ℜs min Φr − (g + αP Φr∗ ) 2 .E. d = Φ′ Ξg. 6. the second equality holds because Jµ is the ﬁxed point of T and Φr∗ is the ﬁxed point of ΠT .3 Projected Equation Methods 359 where the ﬁrst equality uses the Pythagorean Theorem [cf. . Thus the projected equation is written as Cr∗ = d. ξ Setting to 0 the gradient with respect to r of the above quadratic expression. (6. (6.D.
. k = 0. .2). ξ By setting to 0 the gradient with respect to r of the above quadratic expression. At the typical iteration k. to yield the new iterate Φrk+1 (see Fig. . 6. Illustration of the projected value iteration (PVI) method Φrk+1 = ΠT (Φrk ). Since ΠT is a contraction (cf. . 6. We refer to this as projected value iteration (PVI for short). we obtain Φ′ Ξ Φrk+1 − (g + αP Φrk ) = 0. (6.3.360 Approximate Dynamic Programming Value Iterate T(Φrk) = g + αPΦrk Chap.35) Thus at iteration k.34)]. Prop. and the generated vector T (Φrk ) is projected onto S. 6 Projection on S Φrk+1 Φrk 0 S: Subspace spanned by basis functions Figure 6.3. (6. the current iterate Φrk is operated on with T . Eqs.2. Since ΠT is a contraction. starting with an arbitrary initial vector Φr0 : Φrk+1 = ΠT (Φrk ).1). it follows that the sequence {Φrk } generated by PVI converges to the unique ﬁxed point Φr∗ of ΠT . Similarly. an iterative method such as value iteration may be appropriate for solving the Bellman equation J = T J. 6. It is possible to write PVI explicitly by noting that rk+1 = arg min Φr − (g + αP Φrk ) s r∈ℜ 2 . and the generated value iterate T (Φrk ) (which does not necessarily lie in S) is projected onto S. 1. the ﬁrst iterative method that comes to mind is the analog of value iteration: successively apply ΠT . . one may consider an iterative method for solving the projected Bellman equation Φr = ΠT (Φr) or its equivalent version Cr = d [cf.3. to yield the new iterate Φrk+1 .3.2 Deterministic Iterative Methods We have noted in Chapter 1 that for problems where n is very large.33)(6. the current iterate Φrk is operated on with T .
and G is some s × s scaling matrix. however. Furthermore. the iteration (6. if the s × s matrix G is symmetric and positive deﬁnite. They arise prominently in cases where f (x) is the gradient of a cost function.37) converges to the solution of the projected equation if and only if the matrix I − γGC has eigenvalues strictly within the unit circle. there exists γ > 0 such that the eigenvalues of I − γGC lie strictly within the unit circle for all γ ∈ (0. leading to the iteration rk+1 = rk − γG(Crk − d).37) is simpler than PVI in that it does not require a matrix inversion (it does require.† Note that when G is the identity or a diagonal approximation to (Φ′ ΞΦ)−1 .33). (6. It is also possible to scale Crk − d with a diﬀerent/simpler matrix. † Iterative methods that involve incremental changes along directions of the form Gf (x) are very common for solving a system of equations f (x) = 0. (6. and if G is positive deﬁnite and symmetric. We say that M is positive semideﬁnite if r′ M r ≥ 0. The iteration (6.3: The matrix C of Eq. ∀ r = 0. the current iterate rk is corrected by the “residual” Crk − d (which tends to 0). Let us say that a (possibly nonsymmetric) s × s matrix M is positive deﬁnite if r′ M r > 0. after “scaling” with the matrix (Φ′ ΞΦ)−1 .34) is positive deﬁnite. The following proposition shows that C is positive deﬁnite.Sec.37) is convergent for sufﬁciently small stepsize γ. or has certain monotonicity properties.36) where γ is a positive stepsize. Thus in this form of the iteration. The following proposition shows that this is true when G is positive deﬁnite symmetric. . This hinges on an important property of the matrix C. They also admit extensions to the case where there are constraints on x (see [Ber09] for an analysis that is relevant to the present DP context). (6. which we now deﬁne. Proposition 6. 6. the choice of a stepsize γ).3. as long as the stepsize γ is small enough to compensate for large components in the matrix G.3 Projected Equation Methods 361 which yields where C and d are given by Eq.37) rk+1 = rk − (Φ′ ΞΦ)−1 (Crk − d). γ]. the iteration (6. ∀ r ∈ ℜs . (6.
P Φr 2 ξ = ΠP Φr 2 ξ + (I − Π)P Φr 2 . r′ Φ′ Ξ(I − Π)x = 0. Since these eigenvalues are 1 − γλ. ξ and the second inequality follows from Prop. implying that I −γM is a contraction mapping with respect to the standard Euclidean norm. we have ΠP Φr ξ ≤ P Φr ξ ≤ Φr ξ . Eq. (6.3. y >= . Proof: Let M be a positive deﬁnite matrix. 6. it follows that if M is positive deﬁnite. ≥ Φr − α Φr 2 ξ ξ · ΠP Φr ξ where the third equality follows from Eq. Also from properties of projections. i. 6 For the proof we need the following lemma. ∀ r = 0.3. where λ are the eigenvalues of M . (6.29)]. Hence the eigenvalues of I − γM lie within the unit circle.39) [cf.38) where the ﬁrst inequality follows from the Pythagorean Theorem. or equivalently (I − γM )r 2 < r 2.1.D.. the eigenvalues of M have positive real parts. Q.362 Approximate Dynamic Programming Chap. (6. the ﬁrst inequality follows from the CauchySchwartz inequality applied with inner product < x. (6. Then for suﬃciently small γ > 0 we have (γ/2)r′ M ′ M r < r′ M r for all r = 0. Thus.E. Proof of Prop. Lemma 6. we have for all r = 0. r′ Cr = r′ Φ′ Ξ(I − αP )Φr = r′ Φ′ Ξ I − αΠP + α(Π − I)P Φr = r′ Φ′ Ξ(I − αΠP )Φr = Φr 2 − αr′ Φ′ ΞΠP Φr ξ 2 ξ ≥ (1 − α) Φr > 0. x ∈ ℜn . ∀ r ∈ ℜs .2: The eigenvalues of a positive deﬁnite matrix have positive real parts. all vectors of the form Φr are orthogonal to all vectors of the form x − Πx.e.39).3. 6.3: For all r ∈ ℜs .
D. however. even if C is not positive deﬁnite. . 1) for any γ ∈ (0. [TrB97]). so from Lemma 6. Unfortunately. so the eigenvalues of GC have positive real parts. where Λ = diag{λ1 . Thus it will converge to some solution of the projected equation Cr = d. . 2]. (6. Actually. and β is a positive scalar.3 Projected Equation Methods 363 x′ Ξy. in which case GC = C ′ Σ−1 C is a positive deﬁnite matrix. while PVI and its scaled version (6. 6. (6. and can be shown to have real eigenvalues that lie in the interval (0. see [Str09]. (6. † To see this let λ1 . λs } and U is a unitary matrix (U U ′ = I.E. it is not necessary that G is symmetric. Note that for the conclusion of Prop. .3. Another example. . It is suﬃcient that GC has eigenvalues with positive real parts. If G is symmetric and positive deﬁnite. M is also positive deﬁnite. . . Then GC is given by GC = (C ′ Σ−1 C + βI)−1 C ′ Σ−1 C. is the socalled proximal point algorithm applied to the problem of minimizing (Cr−d)′ Σ−1 (Cr−d) over r. . they are not practical algorithms for problems its singular value decomposition.3. It follows that the eigenvalues of GC are λi /(λi + β). An example is G = C ′ Σ−1 . Let M = G1/2 CG1/2 . . 6.† As a result I − γGC has real eigenvalues in the interval (0. is G = (C ′ Σ−1 C + βI)−1 C ′ Σ−1 . where G is given by Eq. . 1). We also have C ′ Σ−1 C + βI = U (Λ + βI)U ′ . 6.2 it follows that its eigenvalues have positive real parts.Sec. . i = 1. From known results about this algorithm (Martinet [Mar70] and Rockafellar [Roc76]) it follows that the iteration will converge to a minimizing point of (Cr−d)′ Σ−1 (Cr− d).3. and note that since C is positive deﬁnite. and the second inequality follows from Eq.37)]. [cf. This proves the positive deﬁniteness of C.3. λs be the eigenvalues of C ′ Σ−1 C and let U ΛU ′ be . and lie in the interval (0.37) are conceptually important. This completes the proof of Prop.40) where Σ is any positive deﬁnite symmetric matrix. the iteration rk+1 = rk − G(Crk − d).38). Q. . s.40). the matrix G1/2 exists and is symmetric and positive deﬁnite. It follows that the eigenvalues of I − γGC lie strictly within the unit circle for suﬃciently small γ > 0.3 to hold. Eq. so −1 GC = U (Λ + βI)U ′ U ΛU ′ = U (Λ + βI)−1 ΛU ′ . The eigenvalues of M and GC are equal (with eigenvectors that are multiples of G1/2 or G−1/2 of each other). (6. even if there exist many solutions (as in the case where Φ does not have rank s). 1). important for our purposes. .
3. and after generating the transition (it . Furthermore. it+1 ). 6. We then approximate the solution C −1 d of the projected −1 equation with Ck dk . . It can be proved using simple law of large numbers arguments that Ck → C and dk → d with probability 1. both of these diﬃculties can be dealt with through the use of simulation.3 SimulationBased Methods We will now consider approximate versions of the methods for solving the projected equation. i1 . . . 6 where n is very large. it+1 ). we use the expression Φ′ = φ(1) · · · φ(n) to write C explicitly as n C = Φ′ Ξ(I − αP )Φ = i=1 and we rewrite Ck in a form that matches the preceding expression. ′ (6. we compute the corresponding row φ(it )′ of Φ. . and a vector dk that approximates d = Φ′ Ξg.) of the Markov chain. The simulation can be done as follows: we generate an inﬁnitely long trajectory (i0 . it+1 ).36) [or its scaled version (6. which involve simulation and lowdimensional calculations.34)]. .41) k + 1 t=0 and dk = 1 k+1 k φ(it )g(it . (6.364 Approximate Dynamic Programming Chap. 1. .). we compute the corresponding cost component g(it .37)] with (Ck rk − dk ). . Fortunately.43) . its projection on S requires knowledge of the steadystate probabilities ξ1 .42) where φ(i)′ denotes the ith row of Φ. The idea is very simple: we use simulation to form a matrix Ck that approximates C = Φ′ Ξ(I − αP )Φ. except that the probabilities ξi and pij are replaced by corresponding empirical ξi φ(i) φ(i) − α n j=1 pij φ(j) . ξn . starting from an arbitrary state i0 . we form k 1 ′ Ck = φ(it ) φ(it ) − αφ(it+1 ) . [cf. After collecting k + 1 samples (k = 0. or we approximate the term (Crk − d) in the PVI iteration (6. which are generally unknown. After generating state it . . as we now discuss. Eq. To show this. . The reason is that the vector T (Φrk ) is ndimensional and its calculation is prohibitive for the type of large problems that we aim to address. even if T (Φrk ) were calculated. . (6. t=0 (6.
Eq. ξi.k φ(j) . . (6. j) has occurred within (i0 . we have Ck = δ(it = i.Sec. . ik ). ˆ .k and pij. j) = Φ′ Ξg = d. . respectively. it+1 = j) ′ φ(i) φ(i) − αφ(j) k+1 i=1 j=1 ′ n n k k δ(it = i) δ(it = i. Similarly. ik+1 ) are generated.k φ(i) φ(i) − α pij. we have Ck = 1 ¯ Ck . it+1 = j) k t=0 δ(it = i) ˆ Here. Indeed. ˆ and we have dk → n ξφ(i) i=1 j=1 pij g(i. we have with probability 1.k are the fractions of time that state i. k+1 ¯ ¯ where Ck and dk are updated by ′ ¯ ¯ Ck = Ck−1 + φ(ik ) φ(ik ) − αφ(ik+1 ) . 6.k and pij. by denoting δ(·) the indicator function [δ(E) = 1 if the event E has occurred and δ(E) = 0 otherwise].43)].44) ′ δ(it = i) . k+1 k t=0 δ(it = i. ¯ ¯ dk = dk−1 + φ(ik )g(ik .k g(i. .k φ(i) j=1 n pij. j). Note that Ck and dk can be updated recursively as new samples φ(ik ) and g(ik .3 Projected Equation Methods 365 frequencies produced by the simulation. it+1 = j) t=0 t=0 = φ(j) φ(i) φ(i) − α k k+1 δ(it = i) i=1 t=0 j=1 n n n k t=0 and ﬁnally Ck = i=1 where ˆ ξi. n ˆ ξi. In particular. ′ n n Ck → i=1 [cf. or transition ˆ (i. k+1 dk = 1 ¯ dk .k ˆ asymptotically converge (with probability 1) to the probabilities ξi and pij . . (6. ik+1 ).k = k t=0 ˆ ξi. the initial (k + 1)state portion of ˆ the simulated trajectory. we can write n ξi φ(i) φ(i) − α dk = i=1 j=1 pij φ(j) = Φ′ Ξ(I − αP )Φ = C. Since the empirical frequencies ξi.k = ˆ n j=1 pij.
. it+1 ). (6.g.33).45) The established name for this method is LSTD (least squares temporal diﬀerences) method.46) where qk.34).3.t = φ(it )′ rk − αφ(it+1 )′ rk − g(it . This is similar to a wellknown diﬃculty in the solution of nearly singular systems of linear equations: the solution is strongly aﬀected by small changes in the problem data. (6. using a batch of k + 1 simulation samples. RegressionBased LSTD A potential diﬃculty in LSTD arises if the matrices C and Ck are nearly singular (e. It may be viewed as a sample of a residual term arising in the projected Bellman’s equation.48). ˆ (6. when the discount factor is close to 1). t=0 (6. from Eqs. Note that by using Eqs.t = 0. More speciﬁcally. one possibility is to construct a simulationbased approximate solution −1 rk = Ck dk .41) and (6. (6. and solve the approximate equation by matrix inversion. (6. LSPE. Rather it ˆ ˆ may be viewed as an approximate matrix inversion approach: we replace the projected equation Cr = d with the approximation Ck r = dk .47) of the temporal diﬀerence qk.t can be viewed as samples [associated with the transition (it . we have Crk − d = Φ′ Ξ(Φrk − αP Φrk − g). it+1 ). 6 6.4 LSTD. (6.366 Approximate Dynamic Programming Chap. the equation Ck r = dk can be written as k φ(it )qk. and TD(0) Methods Given the simulationbased approximations Ck and dk . .48) The three terms in the deﬁnition (6. due for example to roundoﬀ error.t is the socalled temporal diﬀerence.42). associated with rk and transition (it . this is not an iterative method. (6. since we do not need rk−1 to compute rk . it+1 )] of the corresponding three terms in the expression Ξ(Φrk − αP Φrk − g) in Eq. because then the simulationinduced error −1 rk − r∗ = Ck dk − C −1 d ˆ may be greatly ampliﬁed.47) The scalar qk. Despite the dependence on the index k.
To counter the large errors associated with a nearsingular matrix C. In this approach. which is estimated with simulation error ǫ. Thus as c approaches 0. at the expense of a systematic/deterministic error (a “bias”) in the generated estimate. In particular. we must have ǫ << c. and β is a positive scalar. (6. ˆ r (6. Σ is some positive deﬁnite symmetric ¯ matrix. c to be reliable. 1/c By a ﬁrst order Taylor series expansion around ǫ = 0. we obtain for small ǫ E≈ ∂ 1/(c + ǫ) ∂ǫ 1 c+ǫ ǫ=0 ǫ=− ǫ . we choose r by solving the least squares problem: min (dk − Ck r)′ Σ−1 (dk − Ck r) + β r − r ¯ r 2 (6.” We then estimate the solution r∗ based on Eq. instead of solving the system Ck r = dk . consider the approximate inversion of a small nonzero number c. so for a small relative error. an eﬀective remedy is to estimate r∗ by a form of regularized regression. The absolute and relative errors are E= 1 1 − .49) . (6. which we view as “simulation noise. which works even if Ck is singular. where ek is the vector ek = (C − Ck )r + dk − d. we can ﬁnd the solution in closed form: ′ ′ rk = (Ck Σ−1 Ck + βI)−1 (Ck Σ−1 dk + β¯). N must be much larger than 1/c2 . c+ǫ c Er = E . c2 ǫ Er ≈ − .3.1: To get a rough sense of the eﬀect of the simulation error in LSTD. the amount of sampling required for reliable simulationbased inversion increases very fast. we use a leastsquares ﬁt of a linear model that properly encodes the eﬀect of the simulation noise. By setting to 0 the gradient of the least squares objective in Eq. the variance of ǫ is proportional to 1/N . (6.50) where r is an a priori estimate of r∗ . We write the projected form of Bellman’s equation d = Cr as dk = Ck r + ek . or it may be the parameter vector corresponding to the estimated cost vector Φ¯ of a similar policy (for example a preceding policy r .Sec. 6. If N Thus for the estimate independent samples are used to estimate c.49) by using regression.3 Projected Equation Methods 367 Example 6.51) A suitable choice of r may be some heuristic guess based on intuition about ¯ the problem.50).
50) is known as a regularization ¯ term.g. see e. Proposition 6. as long as it is positive deﬁnite and symmetric.. where r∗ = C −1 d ˆ is the solution of the projected equation. Eq. r (6. and the subsequent analysis in this section does not depend on the choice of Σ. see [Str09]. 6 in an approximate policy iteration context). [TrB97]). The proper size of β is not clear (a large size reduces the eﬀect of ¯ near singularity of Ck . which involves the singular values of the ′ matrix Σ−1/2 Ck (these are the square roots of the eigenvalues of Ck Σ−1 Ck . λs are the singular values of Σ−1/2 Ck .52) We have the following proposition. . .4: We have rk −r∗ ≤ max ˆ λi λ2 + β i bk + max β λ2 + β i r −r∗ . One may try to choose Σ in special ways to enhance the quality of the estimate of r∗ . where Λ = diag{λ1 ... r .. . Then. λs }. However. The quadratic β r − r 2 in Eq. [TrB97]).51). . We will now derive an estimate for the error rk −r∗ . ′ rk − r∗ = (Ck Σ−1 Ck + βI) ˆ −1 ′ Ck Σ−1/2 bk + β(¯ − r∗ ) . but may also cause a large “bias”).3. so from Eq. this is typically not a major diﬃculty in practice.. . and U . (6. .s i=1. (6... . . and the eﬀect of the simulation errors Ck − C and dk − d. Let us denote bk = Σ−1/2 (dk − Ck r∗ ).s where λ1 .53) ¯ i=1. (6. because trialanderror experimentation with diﬀerent values of β involves lowdimensional linear algebra calculations once Ck and dk become available. Strang [Str09].368 Approximate Dynamic Programming Chap. Proof: Let Σ−1/2 Ck = U ΛV ′ be the singular value decomposition of Σ−1/2 Ck . but we will not consider this issue here. (6...52) yields rk − r∗ = (V ΛU ′ U ΛV ′ + βI)−1 (V ΛU ′ bk + β(¯ − r∗ )) ˆ r = (V ′ )−1 (Λ2 + βI)−1 V −1 (V ΛU ′ bk + β(¯ − r∗ )) r = V (Λ2 + βI)−1 ΛU ′ bk + β V (Λ2 + βI)−1 V ′ (¯ − r∗ ). and has the eﬀect of “biasing” the estimate rk towards the a priori ˆ guess r . V are unitary matrices (U U ′ = V V ′ = I and U = U ′ = V = V ′ = 1.
s λ2 + β i λi β bk + max i=1. we have rk − r∗ ≤ V ˆ max λi U ′ bk λ2 + β i 1 V ′ r − r∗ ¯ max i=1. i=1.53). but it cannot be made arbitrarily small by using more samples (see Fig.s +β V = max Q..D.... The ﬁgure shows the estimates ′ rk = Ck Σ−1 Ck + βI ˆ −1 ′ Ck Σ−1 dk + β¯ r corresponding to a ﬁnite number of samples.. The second term reﬂects the bias introduced by the regularization and diminishes with β.. (6.. i=1. thereby making bk small. Therefore.3. we see that the error rk − r∗ is bounded by the ˆ sum of two terms. ¯ From Eq.3. ˆ 2.. Then rk is the LSTD solution Ck dk .. We may view ˆk − r ∗ as the sum r ∗ of a “simulation error” ˆk − rβ whose norm is bounded by the ﬁrst term in the r estimate (6.3)..3. ∗ and a “regularization error” rβ − r ∗ whose norm is bounded by the second term in the righthand side of Eq. using the triangle inequality.s λ2 + β λ2 + β i i i=1.. and the exact values ∗ rβ = C ′ Σ−1 C + βI −1 C ′ Σ−1 d + β¯ r corresponding to an inﬁnite number of samples... The ﬁrst term can be made arbitrarily small by using a suﬃciently large number of samples. Σ is the identity.E. and the proof of Prop.53). 6.53) and can be made arbitrarily small by suﬃciently long sampling. Illustration of Prop. 6..Sec.6 can be replicated to show that rk − r∗ ≤ max ˆ 1 λi bk .3. (6. Now consider the case where β = 0..3 Projected Equation Methods ∗ ¯ r∞ = r ∗ rβ = C ′ Σ−1 C + βI −1 369 C ′ Σ−1 d + β¯ r ′ r ˆk = Ck Σ−1 Ck + βI −1 ′ r Ck Σ−1 dk + β¯ −1 ˆk = Ck dk r Conﬁdence Regions Approximation error ∗ r0 = r∗ = C −1 d Figure 6.... and Ck is −1 invertible. 6.s ..4..s r − r∗ .
replace r by rk . and G.3). . γ is a positive ˆ stepsize. One possibility is to approximate the scaled PVI iteration rk+1 = rk − γG(Crk − d) [cf. Generally.56) .54)] with ˆ ˆ ˆ rk+1 = rk − γ G(Crk − d). (6. In the most extreme type of such an algorithm. This algorithm has the form rk+1 = rk − γGk (Ck rk − dk ). although in this case. which may also be obtained by simulation. with serious diﬃculties resulting. Eq. This turns LSTD to an iterative method. . the regularization of LSTD alleviates the eﬀects of near singularity of C and simulation error.370 Approximate Dynamic Programming Chap.55) may be performed after collecting a few additional simulation samples that are used to improve the approximations ˆ ˆ of the current C and d. r consistent with the scalar inversion example we gave earlier. (6. and G is an s×s matrix. this iteration will yield a solution ˆ ˆ to the system Cr = d. . ˆˆ Assuming that I − γ GC is a contraction. C is nearly singular. the LSTD error can be adversely aﬀected by near singularity of the matrix Ck (smallest λi close to 0). which will be shown to be a special case of LSPE. but it comes at a price: there is a bias of the estimate rk towards the prior guess r (cf. and then solve the system Cr = d by the iteration (6. d. .55) rather than direct matrix inversion. Fig.54) ˆ ˆ where C and d are simulationbased estimates of C and d.55) is used after a single new sample is collected. which will serve as a simulationbased approximation to a solution of the projected equation Cr = d. 6. if Φ has nearly dependent columns. LSPE Method An alternative to LSTD is to use a true iterative method to solve the projected equation Cr = d using simulationbased approximations to C and d. λs are the (positive) singular values of Ck . the next method to be discussed. the iteration (6. Thus we expect that for a nearly singular matrix C. obtain rk . (6. it is possible that the error (rk − r∗ ) is large but the error Φ(rk − r∗ ) is not.3. and repeat for ¯ ˆ ¯ ˆ any number of times. One ˆ ¯ possibility to eliminate this bias is to adopt an iterative regularization approach: start with some r . Generally.55) (6. An alternative is to iteratively update r as simulation samples are collected and used to form ever improving approximations to C and d. this may be viewed as a batch simulation approach: we ˆ ˆ ˆ ˆ ˆ ﬁrst simulate to obtain C. Like LSTD. a very large number of samples are necessary to attain a small error (ˆk −r∗ ). 6 where λ1 . one or more iterations of the form (6. In particular. This suggests that without regularization.
where Σ is any positive deﬁnite symmetric matrix.40)]. see also the paper [Ber09]]. (6.41).2. provided Ck → C.36): 1 k+1 k −1 Gk = or Gk = φ(it )φ(it )′ t=0 . and has the advantage that it allows the use of the known stepsize value γ = 1.42). as noted earlier.3 Projected Equation Methods 371 where Gk is an s×s matrix. after collecting multiple samples between iterations. (6. requires updating and inversion at every iteration. is known as LSPE (least squares policy evaluation).47)]. (6.42). one possibility is to choose γ = 1 and Gk to be a simulationbased approximation to G = (Φ′ ΞΦ)−1 .41)(6. obtained from the previous estimate using multiple simulation samples. and β is a positive scalar [cf. This iteration. a partial batch mode of updating Gk is also possible: one may introduce into iteration (6. Eqs. The convergence behavior of this method is satisfactory.57) [cf. (6.56) itself may be executed in partial batch mode. 6. and Gk → G. we have rk → r∗ . as deﬁned by Eqs. For the purposes of further discussion.58) 1 β I+ k+1 k+1 k −1 φ(it )φ(it t=0 )′ . which is used in the PVI method (6. the iteration (6. dk → d.59). Note that while Gk . it is historically the ﬁrst method of this type.54). and Ck and dk are given by Eqs. Eq. with the understanding that there are related versions that use (partial) batch simulation and have similar properties. Generally. we will focus on this algorithm.56) may also be written in terms of temporal diﬀerences as rk+1 = rk − γ Gk φ(it )qk. or to have a special form.35)(6. as we will discuss shortly. we may choose G to be symmetric and positive deﬁnite. Note that the iteration (6. . To ensure that I − γGC is a contraction for small γ. γ is a positive stepsize. (6. (6.59) where βI is a positive multiple of the identity (to ensure that Gk is positive deﬁnite). (6. (6. where G and γ are such that I − γGC is a contraction [this is fairly evident in view of the convergence of the iteration (6. Indeed.3.58) and (6. which was shown in Section 6. Regarding the choices of γ and Gk .t k+1 t=0 k (6. This will save some computation and will not aﬀect the asymptotic convergence rate of the method. such as G = (C ′ Σ−1 C + βI)−1 C ′ Σ−1 .56) a new estimate of G = (Φ′ ΞΦ)−1 periodically.Sec.
the asymptotic convergence rate of the simulationbased iteration (6. (6. A simple possibility is to use a diagonal matrix Gk .56) takes the form ′ ′ rk+1 = rk − γ(Ck Σ−1 Ck + βI)−1 Ck Σ−1 (Ck rk − dk ). thereby simplifying the matrix inversion in the iteration (6.59). as long as I − γGC is a contraction (although the shortterm convergence rate may be signiﬁcantly aﬀected by the choices of γ and G). This iteration is convergent to r∗ provided that {Σ−1 } is bounded k [γ = 1 is within the range of stepsizes for which I − γGC is a contraction. k k (6. see the discussion following Eq.40)].60) where Σk is some positive deﬁnite symmetric matrix. 6 Note also that even if Gk is updated at every k using Eqs. although in this case.58) and (6.51). The reason is that the scaled PVI iteration (6.58) we have −1 Gk = k 1 G−1 + φ(ik )φ(ik )′ . it can be written as ′ ′ rk+1 = (Ck Σ−1 Ck + βI)−1 (Ck Σ−1 dk + βrk ). Then it is reasonable to expect that a stepsize γ close to 1 will often lead to I − γGC being a contraction. One possible choice is a diagonal approximation to Φ′ ΞΦ. k−1 k+1 k+1 Another choice of Gk is ′ ′ Gk = (Ck Σ−1 Ck + βI)−1 Ck Σ−1 .54).59).56) does not depend on the choices of γ and G. Convergence Rate of LSPE – Comparison with LSTD Let us now discuss the choice of γ and G from the convergence rate point of view. k k and for γ = 1. which is fast relative to the . (6. Indeed it can be proved that the iteration (6. The simplest possibility is to just choose Gk to be the identity. however. where the prior guess r is replaced by the previous iterate ¯ rk . from Eq. k k (6. the updating can be done recursively. regardless of the choices of γ and G.54) has a linear convergence rate (since it involves a contraction). thereby facilitating the choice of γ.61) We recognize this as an iterative version of the regressionbased LSTD method (6. some experimentation is needed to ﬁnd a proper value of γ such that I − γC is a contraction. obtained by discarding the oﬀdiagonal terms of the matrix (6.56). Surprisingly. It can be easily veriﬁed with simple examples that the values of γ and G aﬀect signiﬁcantly the convergence rate of the deterministic scaled PVI iteration (6. and β is a positive scalar. Then the iteration (6.372 Approximate Dynamic Programming Chap.56) converges at the same rate asymptotically. for example. (6.58) or (6.
t . It follows that the sequence rk generated −1 by the scaled LSPE iteration (6. t=0 (6. Ch. and dk change.47) and (6. and moreover requires that the stepsize γk diminishes to 0 in order to deal with the nondiminishing noise that is inherent in the term φ(ik )qk. This equation can be written as Φ′ Ξ(Φr − AΦr − b) = 0. Like LSTD and LSPE.Sec. given by: rk+1 = rk − γ(Ck rk − dk ) = rk − γ k+1 k φ(it )qk. and dk . Ck . for large k. This limit is Ck dk .} of the Markov chain. essentially. but at each iteration. i1 . Ck . rk “sees Gk .62).56) operates on two time scales (see. and dk . Ck . there is convergence in the fast time scale before there is appreciable change in the slow time scale.62).63) While LSPE uses as direction of change a timeaverage approximation of Crk − d based on all the available samples.56) “tracks” the sequence Ck dk generated by the LSTD iteration in the sense that −1 rk − Ck dk << rk − r∗ .56) with Gk . . .k in TD(0) is a sample of the lefthand side Φ′ Ξ(Φr − AΦr − b) of the equation. It has the form rk+1 = rk − γk φ(ik )qk.62) where γk is a stepsize sequence that diminishes to 0. Thus the simulationbased iteration (6. e. it generates an inﬁnitely long trajectory {i0 . (6. and the fast time scale at which rk adapts to changes in Gk . As a result.k and multiplying it with φ(ik ).57) with Gk = I. TD(0) Method This is an iterative method for solving the projected equation Cr = d. Ck . Let us note a similarity between TD(0) and the scaled LSPE method (6. Roughly speaking. (6.g. (6. It may be viewed as an instance of a classical stochastic approximation/RobbinsMonro scheme for solving the projected equation Cr = d. Ck .. and dk as eﬀectively constant. and by using Eqs. it can be seen that the direction of change φ(ik )qk. it uses only one sample. the last one. rk is essentially equal to the corresponding limit of iteration (6. .k of Eq. TD(0) uses a single sample approximation.k .” so that for large k. and dk −1 held ﬁxed.3 Projected Equation Methods 373 slow convergence rate of the simulationgenerated Gk . On the other hand. Borkar [Bor08]. 6. rather than updating the s × s matrix Ck and . independent of the choice of γ and the scaling matrix G that is approximated by Gk (see also [Ber09] for further discussion). It is thus not surprising that TD(0) is a much slower algorithm than LSPE. TD(0) requires much less overhead per iteration: calculating the single temporal diﬀerence qk. 6): the slow time scale at which Gk .
particularly when Ck is nearly singular. Thus when s.374 Approximate Dynamic Programming Chap.2). (6.1. we update rk using the iteration rk+1 = rk − γGk (Ck rk − dk ).66) . so that an accurate approximation of C and d are obtained. TD(0) may oﬀer a signiﬁcant overhead advantage over LSTD and LSPE. the number of features. it+1 ). (6.45)]. this is the policy µk+1 whose controls are generated by n µk+1 (i) = arg min u∈U(i) pij (u) g(i. Section 6. (6. Following the state transition (ik . ik+1 ). (6. Eq. 6 multiplying it with rk .42)]. (6. Eq. Unfortunately.65) where Ck and dk are given by Eqs.41).56)]. The optimistic version of (scaled) LSPE is based on similar ideas. corresponding to simulated transitions (it . For example Gk could be a positive deﬁnite symmetric matrix [such as for example the one given by Eq.41).5 Optimistic Versions In the LSTD and LSPE methods discussed so far. We ﬁnally note a scaled version of TD(0) given by rk+1 = rk − γk Gk φ(ik )qk.64) where Gk is a positive deﬁnite symmetric scaling matrix. (6. (6. (6. There are also optimistic versions (cf. u. where the policy µ is replaced by an “improved” policy µ after only a certain number of simulation samples have been processed. is very large.3. By this we mean that Ck and dk are time averages of the matrices and vectors φ(it ) φ(it ) − αφ(it+1 ) . the underlying assumption is that each policy is evaluated with a very large number of samples. and Gk is a scaling matrix that converges to some G for which I −γGC is a contraction.58)] or the matrix ′ ′ Gk = (Ck Σ−1 Ck + βI)−1 Ck Σ−1 k k (6.42) [cf. (6. ′ φ(it )g(it . this method requires the collection of many samples between policy updates. Eqs. It is a scaled (by the matrix Gk ) version of TD(0). −1 A natural form of optimistic LSTD is rk+1 = Ck dk . as it is susceptible to simulation noise in Ck and dk . j) + αφ(j)′ rk ˆ j=1 [cf. selected to speed up convergence. it+1 ) that are generated using µk+1 [cf. where Ck and ˆ dk are obtained by averaging samples collected using the controls corresponding to the (approximately) improved policy. so it may be viewed as a type of scaled stochastic approximation method. 6.k .
j) + αφ(j)′ rk+1 . Section 5.51) is given by Eq. One may argue that mixing samples from several past policies may have a beneﬁcial exploration eﬀect. one may occasionally replace the control uk+1 by a control selected at random from the constraint set U (ik+1 ). we generate the next transition (ik+1 . a substantial number of samples may need to be collected with the same policy before switching policies. ˆ r k k (6. to enhance exploration. while giving larger weight to the samples of the current policy. and is the special case of LSPE. which then brings it very close to LSPE. To improve the reliability of the optimistic LSTD method it seems necessary to turn it into an iterative method.3 Projected Equation Methods 375 [cf. As an alternative. Jung and Polani [JuP07]. 6. There is also a similar optimistic version of TD(0).Sec. see Bertsekas and Ioﬀe [BeI96]. for γ = 1 the method takes the form ′ ′ rk+1 = (Ck Σ−1 Ck + βI)−1 (Ck Σ−1 dk + βˆk ). in optimistic LSTD and LSPE.3.58). using samples from several past policies. for which γ = 1 guarantees convergence in the nonoptimistic version]. In the extreme case of a single sample between policies. (6. The simulated transitions are generated using a policy that is updated every few samples. In particular. The complexities introduced by these variations are not fully understood at present. an iterative version of the regressionbased LSTD method (6.4. but the associated convergence behavior is complex and not well understood at present.4 of Bertsekas and Tsitsiklis [BeT96] provides an analysis of optimistic policy . however.67) [cf. Eq. u.6 Policy Oscillations – Chattering As noted earlier. In the latter case.60)]. it may be essential to experiment with various values of the stepsize γ [this is true even if Gk is chosen according to Eq. As in Section 6. (6. [BED09].61)]. Still. and Busoniu et al. similar to other versions of policy iteration. the matrix Σk should be comparable to the covariance matrix of the vector Cr∗ − d and should be computed using some heuristic scheme.67). For experimental investigations of optimistic policy iteration. one may consider building up Ck and dk as weighted averages. 6. Eq. j=1 Because the theoretical convergence guarantees of LSPE apply only to the nonoptimistic version.66). corresponding to the special choice of the scaling matrix Gk of Eq. (6. Generally. optimistic variants of policy evaluation methods with function approximation are popular in practice. in order to reduce the variance of Ck and dk . ik+2 ) using the control n uk+1 = arg u∈U(ik+1 ) min pik+1 j (u) g(ik+1 . (6.3. (6.
4 of the same reference considers general approximation architectures. r). this is a partition of the space s of parameter vectors r into subsets R . i = 1. which speciﬁes µ0 as a ˜ greedy policy with respect to J(·. each subset corresponding to a ℜ µ stationary policy µ.3. Rµ is the set of parameter vectors r for which µ is greedy with respect ˜ to J(·. let us assume that we use a policy evaluation method that for each given µ produces a unique parameter vector rµ . rµk Rµk +2 Rµk+1 k rµk+1 Figure 6.376 Approximate Dynamic Programming Chap. and generates rµ0 by using the given policy evaluation method. and deﬁned by n ˜ pij (u) g(i. . .68) is encountered. j) + αJ(j.. Rµ = r µ(i) = arg min u∈U(i) j=1 (6. u. r). r) . This is the necessary and suﬃcient condition for policy convergence in the nonoptimistic policy iteration method.4: Greedy partition and cycle of policies generated by nonoptimistic policy iteration. the method keeps generating that policy. If some policy µk satisfying rµk ∈ Rµk iteration for the case of a lookup table representation. r0 ).e. . Nonoptimistic policy iteration starts with a parameter vector r0 . It then ﬁnds a policy µ1 that is greedy with ˜ respect to J(·. where Φ = I. In this ﬁgure. 6 Thus. It then repeats the process with µ1 replacing µ0 . the method cycles between four policies and the corresponding four parameters rµk rµk+1 rµk+2 rµk+3 . i. +1 rµk+2 Rµk+2 rµk+3 Rµk+3 +2 In the case of a lookup table representation where the parameter vectors rµ are equal to the costtogo vector Jµ . For simplicity. For ˜ a given approximation architecture J(·. the condition rµk ∈ Rµk . n . This analysis is based on the use of the so called greedy partition. while Section 6. rµ0 ). . a µ1 such that rµ0 ∈ Rµ1 .
similar to nonoptimistic policy iteration. the parameter r crosses into another set. and there are two policies. The actual cycle obtained depends on the initial policy µ0 . This is the socalled chattering phenomenon for optimistic policy iteration. the “targets” rµ of the method.Sec. following reduced versions of the cycles that nonoptimistic policy iteration may follow (see Fig.5 shows. Simultaneously.5). given the current policy µ. stays at 1 with probability p > 0 and incurs a negative cost c. Other examples are given in Section 6.4). and the method may end up converging to any one of them depending on the starting policy µ0 . for as long as µ continues to be greedy with respect to the current costtogo ap˜ proximation J(·.2 of [BeT96] (Examples 6. for as long as the current parameter vector r belongs to the set Rµ . 1 and 2. the parameter vector r will move near the boundaries that separate the regions Rµ that the method visits. rµk+m−1 ∈ Rµk+m . 6. Furthermore.69) (see Fig. one generated from another in a deterministic way. as Fig. . 6.3. Since there is a ﬁnite number of possible vectors rµ . . . the trajectory of the method is less predictable and depends on the ﬁne details of the iterative policy evaluation method. however. . the parameter vectors will tend to converge to the common boundary of the regions Rµ corresponding to these policies. µk+1 . the policy µ becomes greedy.10). say Rµ .3.3. The following is an example of chattering.9 and 6. r). illustrated in Fig. . 6. . This is similar to gradient methods applied to minimization of functions with multiple local minima. . if diminishing parameter changes are made between policy updates (such as for example when a diminishing stepsize is used by the policy evaluation method) and the method eventually cycles between several policies. such as the frequency of the policy updates and the stepsize used. The optimal policy µ∗ .3. and r changes course and starts moving towards the new “target” rµ . that is. Example 6.6(a). rµk+1 ∈ Rµk+2 . The other policy is µ and . and is satisﬁed if and only if µk is optimal. (6. In the case of optimistic policy iteration. optimistic policy iteration will move towards the corresponding “target” parameter rµ . Thus. whereby there is simultaneously oscillation in policy space and convergence in parameter space. .3. and the corresponding policies µ and sets Rµ may keep changing. the algorithm ends up repeating some cycle of policies µk . when at state 1.2 (Chattering) Consider a discounted problem with two states. When there is function approximation. this condition need not be satisﬁed for any policy. There is a choice of control only at state 1. Generally. however.4.3 Projected Equation Methods 377 is equivalent to rµk = T rµk . Furthermore. 6. denoted µ∗ and µ. where the limit of convergence depends on the starting point. µk+m with rµk ∈ Rµk+1 . Once. rµk+m ∈ Rµk . there may be several diﬀerent cycles. 6.
Rµ3 cycles between the two states with 0 cost. = Stages ∗ 12 12 Cost =0 Cost = p Prob. Let us construct the greedy partition.e. Nonoptimistic policy iteration oscillates between rµ and rµ∗ . = p Stages Cost =0 Cost = Cost =0 Cost = 0 Prob. Policy µ Policy Cost =0 Cost = c < 0 Prob. = 1 Prob.6. = Stages p Prob.2. = 1 − p Prob. (b) The greedy partition and the solutions of the projected equations corresponding to µ and µ∗ . = 1 Prob. = 1 Prob. =µ Policy µ 1 Prob. µ3 with rµ1 ∈ Rµ2 . 1 2 ˜ J = Φr = r 2r Φ= . The parameter vectors converge to the common boundary of these policies.3. . = 1 Prob. (a) Costs and transition probabilities for the policies µ and µ∗ . rµ3 ∈ Rµ1 . We consider linear approximation with a single feature φ(i)′ = i for each of the states i = 1. 1−α = 0 Rµ c c rµ = 0 α µ Rµ∗ (a) (b) Figure 6. µ2 . rµ2 ∈ Rµ3 . 2. The algorithm settles into an oscillation between policies µ1 .3. = Stages rµ∗ ≈ c . = Stages ∗ 12 (a) (b) 12 Cost =0 Cost = p Prob.378 Approximate Dynamic Programming Chap. The problem of Example 6. . = 1 Prob. i.5: Illustration of a trajectory of optimistic policy iteration.. ˜ J = Φr = r 2r . We have Φ= 1 2 .3. 6 Rµ2 1 rµ2 rµ1 Rµ1 2 2 rµ3 Figure 6.
6(b). we have rµ∗ ≈ c ∈ Rµ . (6. 2−p 2) 1 − αp −a −α(1 − p) 1 1 2 dµ∗ = Φ′ Ξµ∗ gµ∗ = ( 1 so rµ∗ = 1 2−p 0 1−p 2−p 0 c 0 = c . . we have rµ = 0 ∈ Rµ . c . and towards rµ if r ∈ Rµ . Notice that it is hard to predict when an oscillation will occur and what kind of oscillation it will be. Similarly. Cµ∗ rµ∗ = dµ∗ . 6. (6. Eqs. while for p ≈ 1 and α > 1 − α. If the incremental changes in r are diminishing.Sec.34)].3 Projected Equation Methods 379 We next calculate the points rµ and rµ∗ that solve the projected equations Cµ r µ = dµ . Optimistic policy iteration uses some algorithm that moves the current value r towards rµ∗ if r ∈ Rµ∗ . Thus optimistic policy iteration starting from a point in Rµ moves towards rµ∗ and once it crosses the boundary point c/α of the greedy partition. If the method makes small incremental changes in r before checking whether to change the current policy. We have Cµ = Φ′ Ξµ (1 − αPµ )Φ = ( 1 2) 1 0 2) 0 1 1 0 0 1 1 −a 0 0 −α 1 = 0. dµ = Φ′ Ξµ gµ = ( 1 so rµ = 0. with some calculation.3. approximate policy iteration cycles between µ and µ∗ . the method will converge to c/α. 2−p We now note that since c < 0. Cµ∗ = Φ′ Ξµ∗ (1 − αPµ∗ )Φ = (1 = 2) 1 2−p 0 0 1−p 2−p 5 − 4p − α(4 − 3p) . In this case. 6. 1−α cf. respectively [cf. 5 − 4p − α(4 − 3p) rµ = 0 ∈ Rµ∗ . For example if c > 0. Yet c/α does not correspond to any one of the two policies and has no meaning as a desirable parameter value. which correspond to µ and µ∗ . it reverses course and moves towards rµ .33). 1 2 = 5 − 9α. Fig. it will incur a small oscillation around c/α.
respectively). while for p ≈ 1 and α > 1 − α. As a result. its eﬀects may not be very serious. all the ˜ policies in M are greedy with respect to J(·. µ1 (i). but does not seem crucial for the quality of the ﬁnal policy obtained (as long as the methods converge).380 Approximate Dynamic Programming Chap. we have c ∈ Rµ∗ . diﬀerent methods “ﬁsh in the same waters” and tend to yield similar ultimate cycles of policies. the limit to which the method converges cannot always be used to construct an approximation of the costtogo of any policy or the optimal costtogo.. µ2 (i). (6. let us note that while chattering is a typical phenomenon in optimistic policy iteration with a linear parametric architecture. j + αJ(j. Finally. 6 In this case approximate as well as optimistic policy iteration will converge to µ (or µ∗ ) if started with r in Rµ (or Rµ∗ . LSPE. and should be an important concern in its implementation. rµ∗ ≈ 1−α When chattering occurs. u. the limit of optimistic policy iteration tends to be on a common boundary of several subsets of the greedy partition and may not meaningfully represent a cost approximation of any of the corresponding policies. one must go back and perform a screening process. which implies that there is a subset of states i such that there are at least two diﬀerent controls µ1 (i) and µ2 (i) satisfying u∈U(i) min ˜ pij (u) g(i. there are indications that for many problems.70) = Each equation of this type can be viewed as a constraining relation on the parameter vector r. r) ˜ g i. evaluate by simulation the many policies generated by the method starting from the initial conditions of interest and select the most promising one. r). suppose that we have convergence to a parameter vector r and that there is a steadystate policy oscillation involving a collection of policies M. or TD for various values of λ) makes a diﬀerence in rate of convergence. An additional insight is that the choice of the iterative policy evaluation method (e. As a result.g. there will be at . Indeed. LSTD. at the end of optimistic policy iteration and in contrast with the nonoptimistic version. Then. excluding singular situations. Thus. that is. Using a diﬀerent value of λ changes the targets rµ somewhat. r) . j + αJ(j. This is a disadvantage of optimistic policy iteration that may nullify whatever practical rate of convergence advantages the method may have over its nonoptimistic counterpart. j) + αJ(j. but leaves the greedy partition unchanged. r) j = j pij µ1 (i) pij µ2 (i) j ˜ g i. Thus.
This suggests that in the tetris problem. the policy iteration process may be just cycling in a “bad” part of the greedy partition. also with an achieved average score of a few thousands). We ﬁnally note that even if all policies involved in chattering have roughly the same cost. An interesting case in point is the game of tetris. [TsV96]. for the methods of this section. it is still possible that none of them is particularly good. there are no “critical” states. has been addressed by approximate linear programming [FaV06].6. that is. The full ramiﬁcations of policy oscillation in practice are not fully understood at present.Sec. [SzL06]. [BeI96]. It then follows that all policies in the set M involved in chattering have roughly the same cost. policy gradient methods (see Section 6. policy iteration using the projected equation is seriously hampered by oscillations/chattering between relatively poor policies. [Kak02].3.7 Multistep SimulationBased Methods A useful approach in approximate DP is to replace Bellman’s equation with an equivalent equation that reﬂects control over multiple successive stages. which has been used as a testbed for approximate DP methods [Van95]. [FaV06]. Now assume that we have a problem where the total number of states is much larger than s and. This amounts to replacing T with a multistep version that has the same . that the assumption of “no critical states. where s is the dimension of r. r) is close to the cost approximation J(·. and with a policy gradient method [Kak02]. Using a set of 22 features and approximate policy iteration with policy evaluation based on the projected equation and the LSPE method [BeI96]. r) (in Example 6. an average score of a few thousands was achieved. however. state 1 is “ambiguous”).4). 6. Note. furthermore. an average score of over 900. Moreover local minimatype phenomena may be causing similar diﬃculties in other related approximate DP methodologies: approximate policy iteration with the Bellman error method (see Section 6. This implies that there will be at most s “ambiguous” states where more ˜ than one control is greedy with respect to J(·.3. [DFM09]. using the same 22 features.3 Projected Equation Methods 381 most s relations of the form (6.9). 6. roughly similar to the attraction of gradient methods to poor local minima.” aside from the fact that it may not be easily quantiﬁable. and approximate linear programming (the tetris problem.000 was achieved [ThS09].8. rµ ) that would be generated for any of the policies µ ∈ M. Using the same features and a random search method in the space of weight vectors r. Furthermore. the cost consequences of changing a policy in just a small number of states (say. one may argue that the cost ˜ ˜ approximation J(·. it will not be true for many problems. [DFM09].70) holding. but it is clear that they give serious reason for concern. of the order of s) is relatively small.
Eq. for example. T ℓ with ℓ > 1. ℓ=0 g (λ) = ℓ=0 αℓ λℓ P ℓ g = (I − αλP )−1 g. with ∞ ∞ (6. 6 ﬁxed points.71) P (λ) = (1 − λ) αℓ λℓ P ℓ+1 . 1). Furthermore ∗ Jµ − Φrλ ξ ≤ 1 1 − α2 λ Jµ − ΠJµ ξ .72) We may then apply variants of the preceding simulation algorithms to ﬁnd a ﬁxed point of T (λ) in place of T .74) ∗ where Φrλ is the ﬁxed point of ΠT (λ) . resulting in a tighter error bound. We will focus on the λweighted multistep Bellman equation J = T (λ) J.5: The mappings T (λ) and ΠT (λ) are contractions of modulus α(1 − λ) αλ = 1 − αλ with respect to the weighted Euclidean norm · ξ . (6. A straightforward calculation shows that it can be written as T (λ) J = g (λ) + αP (λ) J. The corresponding projected equation takes the form C (λ) r = d(λ) . (6. or T (λ) given by ∞ T (λ) = (1 − λ) λℓ T ℓ+1 . (6. The motivation for replacing T with T (λ) is that the modulus of contraction of T (λ) is smaller.3. ℓ=0 where λ ∈ (0. where C (λ) = Φ′ Ξ I − αP (λ) Φ. .73) [cf. d(λ) = Φ′ Ξg (λ) . Proposition 6.382 Approximate Dynamic Programming Chap. (6. This is shown in the following proposition. where ξ is the steadystate probability vector of the Markov chain.34)].
in a matrix inversion approach. We are ultimately interested in a “good” next policy. This is the LSTD(λ) method.3. There is also a regression/regularization variant of this method along the lines described earlier [cf. LSPE(λ).3 Projected Equation Methods 383 Proof: Using Lemma 6. it follows that T (λ) is a contraction with modulus α(1 − λ)/(1 − αλ). As a result.2. The estimate (6. (6. In particular. Eq. Q. Thus. and TD(λ) The simulationbased methods of the preceding subsections correspond to λ = 0. (λ) (λ) . the objective is not just to approximate well the cost of the current policy. 6. the eﬀects of simulation noise become more pronounced as λ increases. Indeed from Eq. This suggests that large values of λ should be used. but rather to use the approximate cost to obtain the next “improved” policy. (6.71)]. we will later argue that when simulationbased approximations are used. the unique solution of the projected equation may be approximated by (λ) −1 (λ) Ck dk . for any weighted Euclidean norm of projection.74) follows similar to the proof of Prop. Furthermore. but can be extended to λ > 0. is that given any norm. On the other hand. (6. ΠT (λ) is a contraction provided λ is suﬃciently close to 1.3. and αλ → 0 as λ → 1. it follows that as λ → 1.74). Eq.D. LSTD(λ). where Ck and dk are simulationbased approximations of C (λ) and d(λ) . 6.51)]. some trial and error with the value of λ may be useful. 1 − αλ Since T (λ) is linear with associated matrix αP (λ) [cf.1. in practice. which follows from the property αλ → 0 as λ → 1. the projected equation solution ∗ Φrλ converges to the “best” approximation ΠJµ of Jµ on S. Note that αλ decreases as λ increases.E. the error bound (6. we have ∞ P (λ) z ξ ≤ (1 − λ) ≤ (1 − λ) = αℓ λℓ P ℓ+1 z ℓ=0 ∞ ξ αℓ λℓ z ℓ=0 ξ (1 − λ) z ξ. and there is no consistent experimental or theoretical evidence that this is achieved solely by good cost approximation of the current policy. Another interesting fact.74) also becomes better as λ increases. This is a consequence of the norm equivalence property in ℜn (any norm is bounded by a constant multiple of any other norm). we should note that in the context of approximate policy iteration. the mapping T (λ) is a contraction (with arbitrarily small modulus) with respect to that norm for λ suﬃciently close to 1. Furthermore.Sec.
(6. one possibility is [cf. One possibility is to choose γ = 1 and k 1 Gk = φ(it )φ(it )′ . we ﬁrst verify that the rightmost ex(λ) pression in the deﬁnition (6. ˆ For a sketch of the argument. we may consider the (scaled) LSPE(λ) iteration rk+1 = rk − γGk Ck rk − dk (λ) (λ) . (6.77) This as an iterative version of the regressionbased LSTD method [cf. The veriﬁcation is similar to the case λ = 0. Eqs. .76) where Σk is some positive deﬁnite symmetric matrix.78) of Ck can be written as k m=t αm−t λm−t φ(im ) − αφ(im+1 ) k−1 ′ = φ(it ) − α(1 − λ) m=t αm−t λm−t φ(im+1 ) − αk−t+1 λk−t φ(ik+1 ).k and pij.41)(6.k deﬁned by Eq. dk → −1 d(λ) .75) where γ is a stepsize and Gk is a scaling matrix that converges to some G such that I − γGC (λ) is a contraction. For γ = 1.79) It can be shown that indeed these are correct simulationbased approximations to C (λ) and d(λ) of Eq.384 Approximate Dynamic Programming Chap.60)]. for which convergence is assured provided Ck → C (λ) . k (6. Eq. by considering the approximation of the steadystate probabilities ˆ ξi and transition probabilities pij with the empirical frequencies ξi.44). Another possibility is Gk = Ck (λ)′ Σ−1 Ck k (λ) + βI −1 Ck (λ)′ Σ−1 . Regarding the calculation of appropriate simulationbased approxi(λ) (λ) mations Ck and dk .42)] Ck (λ) = 1 k+1 k k φ(it ) t=0 (λ) m=t αm−t λm−t φ(im ) − αφ(im+1 ) .61)].78) dk = 1 k+1 k φ(it ) t=0 m=t αm−t λm−t gim . 6 Similarly. k + 1 t=0 Diagonal approximations to this matrix may also be used to avoid the computational overhead of matrix inversion. Eq. and β is a positive scalar [cf. (6. and {Σk } is bounded. (6. k ′ (6. (λ) (λ) (6. we obtain the iteration rk+1 = Ck (λ)′ Σ−1 Ck k (λ) + βI −1 Ck (λ)′ Σ−1 dk + βrk . k (λ) (6.73). (6. (6.
Sec.3 Projected Equation Methods 385 which by discarding the last term (it is negligible for k >> t). (6. A full convergence analysis can be found in [NeB03] and also in [BeY09]. m=t (λ) Using this relation in the expression (6. where pij are the components of the matrix P (λ) .78) for Ck . (λ) (λ) We may also streamline the calculation of Ck and dk by introducing the vector t zt = (αλ)t−m φ(im ). Eq.3) that 1 k+1 k t=0 n φ(it )φ(it )′ → ξi φ(i)φ(i)′ .72)]. we see that Ck → C (λ) with probability 1. It can be seen (cf.3. We now compare this expression with C (λ) . j)th component of P (ℓ+1) [cf. i=1 while by using the formula ∞ pij = (1 − λ) (ℓ+1) (λ) αℓ λℓ pij ℓ=0 (ℓ+1) being the (i.43).80) . yields k m=t αm−t λm−t φ(im ) − αφ(im+1 ) = φ(it ) − α(1 − λ) ′ k−1 αm−t λm−t φ(im+1 ). 6. (λ) Thus. m=0 (6. which similar to Eq. we obtain (λ) Ck 1 = k+1 k t=0 k−1 ′ φ(it ) φ(it ) − α(1 − λ) αm−t λm−t φ(i m=t m+1 ) . in a more general context. by comparing the preceding expressions. can be written as ′ n n C (λ) = Φ′ Ξ(I − αP (λ) )Φ = (λ) i=1 ξi φ(i) φ(i) − α j=1 (λ) pij φ(j) . it can with pij be veriﬁed that 1 k+1 k t=0 k−1 n n φ(it ) (1 − λ) αm−t λm−t φ(im+1 )′ m=t → ξi φ(i) i=1 j=1 (λ) pij φ(j)′ . (6. the derivations of Section 6.
(6.t . we see that TD(λ) uses Gk = I and only the latest temporal diﬀerence qk. (6.25)].82) Note that zk . We have zk = αλ zk−1 + φ(ik ). dk . . ik+1 ). we may verify that (λ) Ck = 1 k+1 k t=0 zt φ(it ) − αφ(it+1 ) .83) where qk. zk approaches k k−t φ(i ) [cf. (6. and TD(λ) approaches the TD(1) method t t=0 α given earlier in Section 6.84) [cf. t=0 (6.t is the temporal diﬀerence qk. This amounts to approximating C (λ) and d(λ) by a single sample.47) and (6. t=0 (6. k+1 k dk = 1 ¯(λ) d . k+1 k ¯(λ) ¯ (λ) where Ck and dk are updated by ′ ¯ (λ) ¯ (λ) Ck = Ck−1 + zk φ(ik ) − αφ(ik+1 ) . it+1 ). ¯(λ) ¯(λ) dk = dk−1 + zk g(ik . It takes the form rk+1 = rk − γk zk qk.80)].386 Approximate Dynamic Programming Chap.2 [cf. Note that as λ → 1. 6 Then.t = φ(it )′ rk − αφ(it+1 )′ rk − g(it . Ck (λ) = 1 ¯ (λ) C .75) can also be written as γ Gk k+1 k and dk .k . k ′ (6.81) dk (λ) (λ) (λ) = 1 k+1 zt g(it . can be conveniently updated by means of recursive formulas. Eqs. Eq.k . it+1 ) (6. instead of k + 1 samples.85) where γk is a stepsize parameter.83). (λ) Let us ﬁnally note that by using the above formulas for Ck the iteration (6. Ck . Eq. by straightforward calculation. (λ) rk+1 = rk − zt qk. When compared to the scaled LSPE method (6. (6. The TD(λ) algorithm is essentially TD(0) applied to the multistep projected equation C (λ) r = d(λ) .56)].
Unfortunately. while methods that do not are called onpolicy methods.8 TD Methods with Exploration We will now address some of the diﬃculties associated with TD methods.86) where B is a diagonal matrix with diagonal components βi ∈ [0. the next state is generated with probability 1 − βi according to transition probabilities pij . methods that use a behavior policy are called oﬀpolicy methods.3. the policy being evaluated is sometimes called the target policy to distinguish from the policy used for exploration. [SuB98]. On the other hand. given by T where T (λ) (λ) (J) = g (λ) + αP ∞ (λ) J. or because of multiple recurrent classes (in which case some states will never be generated during the simulation). using P in place of P for simulation. in which case the methods break down. in such a scheme we generate successive states according to an irreducible transition probability matrix P = (I − B)P + BQ.3. 6. As mentioned earlier. e. with no other modiﬁcation in the TD algorithms. If the modiﬁed transition probability matrix is irreducible.3 Projected Equation Methods 387 6.. either because of the presence of transient states (in which case the components of ξ corresponding to transient states are 0). The ﬁrst diﬃculty has to do with the issue of exploration: to maintain the contraction property of ΠT it is important to generate trajectories according to the steadystate distribution ξ associated with the given policy µ.† We refer to βi as the exploration probability at state i. (6. creates a bias that tends to degrade the quality of policy evaluation.g. † In the literature. A second diﬃculty is that Assumption 6. Mathematically. particularly in the context of policy iteration. which is called behavior policy. Also. a common approach to address the exploration diﬃculty is to modify the transition probability matrix P of the given policy µ by occasionally generating transitions that use a randomly selected control or state rather than the one dictated by µ. this biases the simulation by underrepresenting states that are unlikely to occur under µ. and with probability βi according to transition probabilities qij . because it directs the algorithm towards (λ) approximating the ﬁxed point of the mapping T .1 (P is irreducible) may be hard or impossible to guarantee. at state i. . causing potentially serious errors in the calculation of a new policy via policy improvement. 1] and Q is another transition probability matrix.Sec. (J) = (1 − λ) λt T t=0 t+1 (J). we simultaneously address the second diﬃculty as well. Thus.
the mapping ΠT (λ) is a contraction for λ suﬃciently close to 1. Thus.61). they only require that the projected equation (6. We will now discuss some schemes that allow the approximation of the solution of the projected equation Φr = ΠT (λ) (Φr). the LSTD(λ).51)].88): the ﬁrst involves T but the second involves T . an explorationenhanced policy that has a cost vector g with components n gi = j=1 pij g(i. so it aims to approximate the desired ﬁxed point of T . This makes the composition ΠT (λ) a contraction with respect to · ξ . however.88) has a unique solution. that for any choice of P . where ξ is the invariant distribution corresponding to P . and a transition probability matrix P in place of P . j). Our diﬃculty here is that T (λ) .72)].87) and (6. we note that generally. ΠT (λ) may not be a contraction. do not require that ΠT (λ) be a contraction. Note the diﬀerence between equations (6. when the simulated trajectory is generated according to P . and TD(λ) algorithms yield the unique solution r λ of the equation Φr = Π T (λ) (Φr). The key diﬃculty for asserting contraction properties of the composition ΠT (λ) is a potential norm mismatch: even if T (λ) is a contraction with respect to some norm. We will now show. The corresponding LSTD(λ)type methods. (6.88) where Π is projection with respect to the norm · ξ . including its regressionbased version [cf. In particular. the corresponding LSPE(λ) and TD(λ)type methods are valid only if ΠT (λ) is a contraction [except for the scaled version of LSPE(λ) that has the form (6. 6 with T (J) = g + αP J [cf. LSPE(λ). (6. the following schemes allow exploration. In this connection. (6.388 Approximate Dynamic Programming Chap. Π may not be nonexpansive with respect to the same norm. where ΠT (λ) need not be a contraction]. This is the cost of a diﬀerent policy. (6. but without the degradation of approximation quality resulting from the use of T in place of T . rather than a ﬁxed point of T . The solid convergence properties of iterative TD methods are due to the fact that T (λ) is a contraction with respect to the norm · ξ and Π is nonexpansive with respect to the same norm. However. corresponding to the steadystate distribution ξ of P . Eq. Eqs.87) where Π denotes projection on the approximation subspace with respect to · ξ .
3.89) Proposition 6.n βi The associated modulus of contraction is at most equal to α(1 − λ) . 1 − αλ Proof: For all z ∈ ℜn with z = 0. we have n αP z 2 ξ = i=1 ξi n n j=1 = α2 i=1 n ξi ξi αpij zj n 2 2 j=1 n ≤ α2 ≤ α2 ≤ 2 pij zj pij zj i=1 n j=1 n ξi i=1 n j=1 n pij 2 z 1 − βi j 2 ¯ ξi pij zj α2 1−β j=1 i=1 . λt (αP )t+1 .. This is equivalent to showing that the corresponding induced norm of the matrix A(λ) given by ∞ A(λ) = (1 − λ) is less than 1.3 Projected Equation Methods 389 need not in general be a contraction with respect to modiﬁed/exploration reweighted norm · ξ . Since Π is nonexpansive with respect to · ξ .Sec.6: Assume that P is irreducible and ξ is its invariant distribution.. ¯ The following proposition quantiﬁes the restrictions on the size of the exploration probabilities in order to avoid the diﬃculty just described.. the proof is based on ﬁnding values of βi for which T (λ) is a contraction with respect to · ξ . t=0 (6.. Then T (λ) and ΠT (λ) are contractions with respect to · ξ for all λ ∈ [0. 1) provided α < 1. where α= α . 6. 1 − maxi=1.
It ¯ follows that given any norm · . In particular. Proposition 6.90) it follows that limλ→1 A(λ) ξ = 0 and hence limλ→1 A(λ) = 0. . provided λ is suﬃciently close to 1.6. . 6. 1] such that P is irreducible. 1) such that T (λ) and ΠT (λ) are contractions with respect to · ξ for all λ ∈ [λ.D.3. independent of the value of λ. From Eq. 1]. Proof: By Prop. 6 = α2 j=1 ¯ 2 ξj zj = α2 z 2 . Thus. there exists λ ∈ [0. n. Q. In fact. . Q. This is shown in the following proposition. .7: Given any exploration probabilities from the range [0. we have ∞ ∞ A(λ) ¯ ξ ≤ (1 − λ) λt αP t=0 t+1 ¯ ξ ≤ (1 − λ) λt αt+1 = t=0 α(1 − λ) < 1.390 Approximate Dynamic Programming n Chap. and the next to last equality follows from the property n ¯ ξi pij = ξ j i=1 of the invariant distribution. where ξ is the invariant distribution of an irreducible P that is generated with any exploration probabilities from the range [0. there exists ξ such that αP ξ < 1. i = 1.3.90) from which the result follows. The preceding proposition delineates a range of values for the exploration probabilities in order for ΠT (λ) to be a contraction: it is suﬃcient that βi < 1 − α2 . While it seems diﬃcult to fully quantify this eﬀect. αP is a contraction with respect to · ξ with modulus at most α. ¯ (6. ¯ ξ where the ﬁrst inequality follows from the convexity of the quadratic function. 1 − αλ (6. the second inequality follows from the fact (1 − βi )pij ≤ pij . it appears that values of λ close to 1 tend to enlarge the range. 1). this is true for any norm · ξ . Next we note that if α < 1. A(λ) is a contraction with respect to that norm for λ suﬃciently close to 1. .D. We next consider the eﬀect of λ on the range of allowable exploration probabilities.E. T (λ) is a contraction with respect to any norm · ξ and consequently for any value of the exploration probabilities βi .E.
6. such as the uniform distribution). The following two schemes use diﬀerent simulationbased approximations of the matrix C and the vector d. .Sec. The proof is nearly identical to the one of Prop.48)] Cr∗ = d.3. Then a vector r∗ solves the explorationenhanced projected equation Φr = ΠT (Φr) if and only if it solves the problem 2 min Φr − (g + αP Φr∗ ) ξ . We also generate an additional sequence of independent transitions (i0 .91) where Ξ is the steadystate distribution of P . . Exploration Using Extra Transitions The ﬁrst scheme applies only to the case where λ = 0.3. d = Φ′ Ξg. (6. according to the original transition matrix P. 6. Let us ﬁrst consider the case where λ = 0. where C = Φ′ Ξ(I − αP )Φ. We will now consider some speciﬁc schemes for exploration. (6. s r∈ℜ The optimality condition for this problem is Φ′ Ξ(Φr∗ − αP Φr∗ − g) = 0. which are similar to the ones given earlier for the case P = P . Eq. j0 ). . according to the explorationenhanced transition matrix P (or in fact any steady state distribution ξ. . . 6. i1 . .5. 1 − αλ as per Prop. and can be written in matrix form as [cf.92) These equations should be compared with the equations for the case where P = P : the only diﬀerence is that the distribution matrix Ξ is replaced by the explorationenhanced distribution matrix Ξ. (i1 . .3 Projected Equation Methods 391 We ﬁnally note that assuming ΠT (λ) is a contraction with modulus αλ = α(1 − λ) . 6. We generate a state sequence i0 . but employ appropriately modiﬁed forms of temporal diﬀerences. we have the error bound Jµ − Φr λ ξ ≤ 1 1 − α2 λ Jµ − ΠJµ ξ . (6. j1 ). where Φr λ is the ﬁxed point of ΠT (λ) .
jt )] of the corresponding three terms of the expression Ξ(Φrk − αP Φrk − g) in Eq. jt ).t = φ(it )′ rk − αφ(jt )′ rk − g(it . . we approximate the term (Crk − d) in PVI by (Ck rk − dk ). In a modiﬁed form of LSTD(0). versions of these schemes for λ > 0 are complicated because of the diﬃculty of generating extra transitions in a multistep context. Eq. (6. jt ) [cf. where P = P .41) and (6. we approximate the solution C −1 d −1 of the projected equation with Ck dk . Eq. q t=0 where γ is small enough to guarantee convergence [cf. t=0 in place of Eqs. 6 We approximate the matrix C and vector d of Eq.92) using the formulas Ck = and dk = 1 k+1 k 1 k+1 k t=0 φ(it ) φ(it ) − αφ(jt ) . In a modiﬁed form of (scaled) LSPE(0).91). Unfortunately.47)].3.k . The corresponding approximation Ck r = dk to the projected equation Φr = ΠT (Φr) can be written as k φ(it )˜k. q where γk is a positive diminishing stepsize [cf. (6. Finally. (6.t .62)].t can be viewed as samples [asso˜ ciated with the transition (it . (6.392 Approximate Dynamic Programming Chap. jt ) ˜ is a temporal diﬀerence associated with the transition (it .57)]. it can be shown using law of large numbers arguments that Ck → C and dk → d with probability 1. Eq. ′ φ(it )g(it . q t=0 where qk. Similar to the earlier case in Section 6.t = 0. (6. The three terms in the deﬁnition of qk.42).3. the modiﬁed form of TD(0) is rk+1 = rk − γk φ(ik )˜k. (6. leading to the iteration γ Gk k+1 k rk+1 = rk − φ(it )˜k.
(6. here (it . Like the preceding approach.92) by simulation: we generate a state sequence {i0 . but we use modiﬁed versions of temporal diﬀerences. Whereas in the preceding scheme with extra transitions. t=0 Similar to the earlier case in Section 6. however.t = φ(it )′ rk − ˜ pit it+1 αφ(it+1 )′ rk + g(it . pit it+1 (6. but it does not require extra transitions. It does require. i1 .88)]. q t=0 where qk.). where P = P . according to the explorationenhanced transition matrix P . . ˆ and converges with probability 1 to the solution of the projected equation † Note the diﬀerence in the sampling of transitions. ˜ −1 The explorationenhanced LSTD(0) method is simply rk = Ck dk .3.1. . the explicit knowledge of the transition probabilities pij and pij .t is the modiﬁed temporal diﬀerence given by Eq. .93). . it can be shown using simple law of large numbers arguments that Ck → C and dk → d with probability 1 (see also Section 6.3 Projected Equation Methods 393 Exploration Using Modiﬁed Temporal Diﬀerences We will now present an alternative exploration approach that works for all λ ≥ 0. it+1 ) . Consider now the case where λ = 0 and the approximation of the matrix C and vector d of Eq. it+1 ) is generated according to the explorationenhanced transition matrix P .Sec.93) where pij and pij denote the ijth components of P and P . where this approximation approach is discussed within a more general context). (6. φ(it )g(it . . jt ) was generated according to the original transition matrix P . .t = 0. 1. Here we generate a single state sequence i0 . . .8. it+1 ). deﬁned by qk.} using the explorationenhanced transition matrix P . (it .3. respectively. i1 . . Eq. 6. Note that the approximation Ck r = dk to the projected equation can also be written as k φ(it )˜k. it aims to solve the projected equation Φr = ΠT (λ) (Φr) [cf. After collecting k + 1 samples (k = 0.† The formulas of the various TD algorithms are similar to the ones given earlier. . we form 1 Ck = k+1 and dk = 1 k+1 pi i φ(it ) φ(it ) − α t t+1 φ(it+1 ) pit it+1 t=0 k k ′ . (6.
394
Approximate Dynamic Programming
Chap. 6
Φr = ΠT (Φr). Explorationenhanced versions of LSPE(0) and TD(0) can be similarly derived, but for convergence of these methods, the mapping ΠT should be a contraction, which is guaranteed only if P diﬀers from P by a small amount (cf. Prop. 6.3.6). Let us now consider the case where λ > 0. We ﬁrst note that increasing values of λ tend to preserve the contraction of ΠT (λ) . In fact, given any norm · ξ , T (λ) is a contraction with respect to that norm, provided λ is suﬃciently close to 1 (cf. Prop. 6.3.7). This implies that given any exploration probabilities from the range [0, 1] such that P is irreducible, there exists λ ∈ [0, 1) such that T (λ) and ΠT (λ) are contractions with respect to · ξ for all λ ∈ [λ, 1). The derivation of simulationbased approximations to the explorationenhanced LSPE(λ), LSTD(λ), and TD(λ) methods is straightforward but somewhat tedious. We will just provide the formulas, and refer to Bertsekas and Yu [BeY07], [BeY09] for a detailed development. The explorationenhanced LSPE(λ) iteration is given by rk+1 = rk − γ Gk k+1
k k
φ(it )
t=0 m=t
λm−t wt,m−t qk,m , ˜
(6.94)
where Gk is a scaling matrix, γ is a positive stepsize, qk,t are the modiﬁed temporal diﬀerences (6.93), and for all k and m, wk,m = αm 1
p pi i k k+1 ik+1 ik+2 pi i pi k k+1 k+1 ik+2
· · · p k+m−1
pi
ik+m
ik+m−1 ik+m
if m ≥ 1,
(6.95)
if m = 0.
Iteration (6.94) can also be written in a compact form, as shown in [BeY09], where a convergence result is also given. In particular, we have rk+1 = rk − γGk Ck rk − dk
(λ) (λ)
,
(6.96)
where similar to the earlier formulas with unmodiﬁed TD [cf. Eqs. (6.80)(6.82)], pi i zk = αλ k−1 k zk−1 + φ(ik ), pik−1 ik Ck
(λ)
=
1 ¯ (λ) C , k+1 k
dk =
1 ¯(λ) d , k+1 k
¯(λ) ¯ (λ) where Ck and dk are updated by ¯ (λ) ¯ (λ) Ck = Ck−1 + zk φ(ik ) − α pik ik+1 φ(ik+1 ) , pik ik+1
Sec. 6.3
Projected Equation Methods
395
¯(λ) ¯(λ) dk = dk−1 + zk
pik−1 ik g(ik−1 , ik ). pik−1 ik
It is possible to show the convergence of rk (with probability 1) to the unique solution of the explorationenhanced projected equation Φr = ΠT (λ) (Φr), assuming that P is such that T (λ) is a contraction with respect to · ξ (e.g., for λ suﬃciently close to 1). Note that this is a substantial restriction, since it may require a large (and unknown) value of λ. A favorable exception is the iterativeregression method rk+1 = Ck
(λ)′
Σ−1 Ck k
(λ)
+ βI
−1
Ck
(λ)′
Σ−1 dk + βrk , k
(λ)
(6.97)
which is the special case of the LSPE method (6.96) [cf. Eqs. (6.76) and (6.77)]. This method converges for any λ as it does not require that T (λ) is a contraction with respect to · ξ . The modiﬁed LSTD(λ) algorithm computes rk as the solution of the ˆ (λ) (λ) equation Ck r = dk . It is possible to show the convergence of Φˆk , the r solution of the explorationenhanced projected equation Φr = ΠT (λ) (Φr), assuming only that this equation has a unique solution (a contraction property is not necessary since LSTD is not an iterative method, but rather approximates the projected equation by simulation). 6.3.9 Summary and Examples
Several algorithms for policy cost evaluation have been given so far for ﬁnitestate discounted problems, under a variety of assumptions, and we will now summarize the analysis. We will also explain what can go wrong when the assumptions of this analysis are violated. The algorithms considered so far for approximate evaluation of the cost vector Jµ of a single stationary policy µ are of two types: (1) Direct methods, such as the batch and incremental gradient methods of Section 6.2, including TD(1). These methods allow for a nonlinear approximation architecture, and for a lot of ﬂexibility in the collection of the cost samples that are used in the least squares optimization. The drawbacks of these methods are that they are not wellsuited for problems with large variance of simulation “noise,” and also when implemented using gradientlike methods, they can be very slow. The former diﬃculty is in part due to the lack of the parameter λ, which is used in other methods to reduce the variance in the parameter update formulas. (2) Indirect methods that are based on solution of a projected version of Bellman’s equation. These are simulationbased methods that include approximate matrix inversion methods such as LSTD(λ), and
396
Approximate Dynamic Programming
Chap. 6
iterative methods such as LSPE(λ) and its scaled versions, and TD(λ) (Sections 6.3.16.3.8). The salient characteristics of indirect methods are the following:
∗ (a) For a given choice of λ ∈ [0, 1), they all aim to compute rλ , the unique solution of the projected Bellman equation Φr = ΠT (λ) (Φr). This equation is linear of the form C (λ) r = d(λ) , where C (λ) and d(λ) (λ) (λ) can be approximated by a matrix Ck and a vector dk that can be computed by simulation. The equation expresses the orthogonality relation between Φr − T (λ) (Φr) and the approximation subspace S. The LSTD(λ) method simply computes the solution
rk = (Ck )−1 dk ˆ
(λ)
(λ)
of the simulationbased approximation Ck r = dk of the projected equation. The projected equation may also be solved iteratively using LSPE(λ) and its scaled versions, rk+1 = rk − γGk Ck rk − dk
(λ) (λ)
(λ)
(λ)
,
(6.98)
and by TD(λ). A key fact for convergence of LSPE(λ) and TD(λ) to ∗ rλ is that T (λ) is a contraction with respect to the projection norm · ξ , which implies that ΠT (λ) is also a contraction with respect to the same norm. (b) LSTD(λ) and LSPE(λ) are connected through the regularized regressionbased form (6.77), which aims to deal eﬀectively with cases where (λ) Ck is nearly singular (see Section 6.3.4). This is the special case of the LSPE(λ) class of methods, corresponding to the special choice (6.76) of Gk . LSTD(λ), and the entire class of LSPE(λ)type iterations (6.98) converge at the same asymptotic rate, in the sense that
∗ rk − rk << rk − rλ , ˆ ˆ
for suﬃciently large k. However, depending on the choice of Gk , the shortterm behavior of the LSPEtype methods is more regular as it involves implicit regularization. This regularized behavior is likely an advantage for the policy iteration context where optimistic variants, involving more noisy iterations, may be used. (c) When the LSTD(λ) and LSPE(λ) methods are explorationenhanced for the purpose of embedding within an approximate policy iteration framework, their convergence properties become more complicated: LSTD(λ) and the regularized regressionbased version (6.97) of LSPE(λ) converge to the solution of the corresponding (explorationenhanced) projected equation for an arbitrary amount of exploration,
6. (6. used by LSTD(λ) and LSPE(λ). Indeed. The estimate of Prop. the assumptions under which convergence of the methods is usually shown in the literature include: (i) The existence of a steadystate distribution vector ξ. however.3. the correlation between approximation error in the cost of the current policy and the performance of the next policy is somewhat unclear in practice. that in the context of approximate policy iteration.9). .) Note.Sec. which indicates that the algorithms tend to be faster and more reliable in practice when λ takes smaller values (or at least when λ is not too close to 1).5 indicates ∗ that the approximation error Jµ − Φrλ ξ increases as the distance Jµ − ΠJµ ξ from the subspace S becomes larger. This is consistent with practical experience. Indeed.3 Projected Equation Methods 397 but TD(λ) and other special cases of LSPE(λ) do so only for λ suﬃciently close to 1. as shown by an example in Bertsekas [Ber95] (also reproduced in Exercise 6. which we did not cover suﬃciently in our presentation. with positive components. which is usually chosen with some trial and error. it can be seen that the simulation samples of T (λ) (Φrk ). (This example involves a stochastic shortest path problem rather than a discounted problem. which is the limit of the solution ∗ Φrλ produced by TD(λ) as λ → 1. where TD(0) produces a very bad solution relative to ΠJµ . We refer to the convergence analysis by Tsitsiklis and Van Roy [TsV97]. as discussed in Section 6. tend to contain more noise as λ increases. ∗ (e) While the quality of the approximation Φrλ tends to improve. as λ → 1 the methods become more vulnerable to simulation noise. and the subsequent papers [TsV99a] and [TsV02]. (f) TD(λ) is much slower than LSTD(λ) and LSPE(λ). Eq. and TD(λ). the noise in a simulation sample of a tstages cost T t J tends to be larger as t increases. There is no rule of thumb for selecting λ.3. the error degradation may be very signiﬁcant for small values of λ. and hence require more sampling for good performance. and also increases as λ becomes smaller. but can be modiﬁed to illustrate the same conclusion for discounted problems. Still TD(λ) is simpler and embodies important ideas. and from the formula ∞ T (λ) = (1 − λ) λt T t+1 t=0 [cf.8. LSPE(λ). For all the TD methods.71)]. 6. as well as the book [BeT96] for extensive discussions of TD(λ). LSTD(λ). ∗ (d) The limit rλ depends on λ.
the corresponding states are transient. the set of all r such that Φr is the unique ﬁxed point of ΠT (λ) . and scaled versions of LSPE(λ) require a constant stepsize.. so that all these classes are represented in the simulation. LSTD(λ) and LSPE(λ) do not require a stepsize choice. This may be remedied by using multiple trajectories. there must be multiple recurrent classes. Regarding (i). convergence does not extend to the case where T involves a minimization over multiple policies. In particular. In the case where Φ does not have rank s. LSPE(λ) and its scaled variants can be shown to have a similar property. where the policy used to generate the simulation data is changed after a few transitions. If on the other hand there is no steadystate distribution. with initial states from all the recurrent classes. Regarding (ii).8.6) shows that TD(λ) may diverge if a nonlinear architecture is used. if a steadystate distribution exists but has some components that are 0. states from other recurrent classes. This vector is the orthogonal projection of the initial guess r0 on the set of solutions of the projected Bellman equation. the transient states would be underrepresented in the cost approximation. (v) The use of a single policy. or optimistic variants. (iv) The use of a diminishing stepsize for TD(λ).3. In particular. so it has a unique ﬁxed point. i.3.2. corresponding to the steadystate distribution ξ of an explorationenhanced chain). Example 6. where the norm · ξ is used. Let us now discuss the above assumptions (i)(v). see [Ber09]. TD(λ) has been shown to converge to some vector r∗ ∈ ℜs . so the results of the algorithms would depend on the initial state of the simulated trajectory (more precisely on the recurrent class of this initial state). 6 (ii) The use of a linear approximation architecture Φr.398 Approximate Dynamic Programming Chap. with Φ satisfying the rank Assumption 6. an example by Tsitsiklis and Van Roy [TsV97] (also replicated in Bertsekas and Tsitsiklis [BeT96]. (iii) The use of data from a single inﬁnitely long simulated trajectory of the associated Markov chain. the algorithms will operate as if the Markov chain consists of just the recurrent states. there are no convergence guarantees for methods that use nonlinear architectures. and convergence will not be aﬀected. However.e. The key issue regarding (iii) is whether the empirical frequencies at which sample costs of states are collected are consistent with the corre . This is needed for the implicit construction of a projection with respect to the norm · ξ corresponding to the steadystate distribution ξ of the chain (except for the explorationenhanced versions of Section 6. the mapping ΠT (λ) will still be a contraction with respect to · ξ . In this case. Once this happens. and transient states would be underrepresented in the cost approximation obtained. so they will not appear in the simulation after a ﬁnite number of transitions. unchanged during the simulation.
even in the case where T (λ) is nonlinear. the issues associated with the asymptotic behavior of optimistic methods.9). and Figs. both for convergence and for performance. T (λ) and ΠT (λ) are contractions with respect to · ξ . If there is a substantial discrepancy (as in the case where a substantial amount of exploration is introduced). In this case the scaled PVI(λ) iteration [cf. or optimistic variants are used. provided λ is suﬃciently close to 1. ΠT (λ) is a contraction for any Euclidean projection norm. the projection implicitly constructed by simulation is with respect to a norm substantially diﬀerent from · ξ .2 of Bertsekas and Tsitsiklis [BeT96] gives examples of the chattering phenomenon for optimistic TD(λ).4 and 6. Exercise 6. which involves a projection with respect to a nonEuclidean norm. Eq. rk+1 = rk − γGΦ′ Ξ Φrk − T (λ) (Φrk ) . Section 6.7.7). Tsitsiklis and Van Roy [TsV96] give an example of divergence. Regarding (iv). 6. as noted earlier. As discussed in [Ber09]. once minimization over multiple policies is introduced [so T and T (λ) are nonlinear]. in which case the contraction property of ΠT (λ) may be lost. the method for stepsize choice is critical for TD(λ). Generally. we will see that for average cost problems there is an exceptional case. LSTD(λ) works without a stepsize and LSPE(λ) requires a constant stepsize. if ΠT (λ) is a contraction. and act as forms of “local minima.4 gives an example where Π is projection with respect to the standard Euclidean norm and ΠT is not a contraction while PVI(0) diverges. Regarding (v).54)] takes the form where G is a scaling matrix. An example of divergence in the case of TD(0) is given in Bertsekas and Tsitsiklis [BeT96] (Example 6. provided λ is suﬃciently close to 1. see Bertsekas and Tsitsiklis [BeT96] (Example 6.97) of LSPE(λ).3. where a stepsize less than 1 is essential for the convergence of LSPE(0). and examples where it has multiple ﬁxed points. the behavior of the methods becomes quite peculiar and unpredictable because ΠT (λ) may not be a contraction. 6. † Similar to Prop. It follows that given any projection norm · ξ .3. Moreover the contraction property is not needed at all for the convergence of LSTD(λ) and the regularized regressionbased version (6. are not well understood at present. with divergence potentially resulting. (6. this iteration converges with modulus that tends to 0 as λ → 1.5. it can be shown that T (λ) is a supnorm contraction .3 Projected Equation Methods 399 sponding steadystate distribution ξ. and γ is a positive stepsize that is small enough to guarantee convergence.Sec. and de Farias and Van Roy [DFV00]. On the other hand. In Section 6.35 suggest their enormously complex nature: the points where subsets in the greedy partition join are potential “points of attraction” of the various algorithms.4. and the peculiarities associated with chattering do not arise. it has a unique ﬁxed point. or even (nonoptimistic) approximate policy iteration. associated with periodic Markov chains. On the other hand.” On the other hand. there are examples where ΠT (λ) has no ﬁxed point.† For instance. 6.
4. . by n Q∗ (i. 6. u) = j=1 pij (u) g(i. u) with u ∈ U (i). In fact. thereby avoiding the multiple policy evaluation steps of the policy iteration method. Note that there are limited classes of problems.3. . u. an inﬁnitely long sequence of statecontrol pairs {(ik . the Qfactors are deﬁned. u. involving multiple policies. v∈U(j) (6.4 QLEARNING We now introduce another method for discounted problems. 6 to the unique ﬁxed point of ΠT (λ) . In the discounted problem. The proof is essentially the same as the proof of existence and uniqueness of solution of Bellman’s equation. j) + α min Q(j.400 Approximate Dynamic Programming Chap. An example. it can be veriﬁed that F is a supnorm contraction with modulus α. and it has no practical generalization for the case where T is nonlinear. (6. the Qfactors can be obtained by the value iteration Qk+1 = F Qk .8. u). In particular. let us note that the LSTD(λ) method relies on the linearity of the mapping T .100) The Qlearning algorithm is an approximate version of value iteration.4. (6. we see that the Qfactors satisfy for all (i. provided the constant stepsize γ is suﬃciently small.100) is suitably approximated. u. by introducing a system whose states are the original states 1. is discussed in Sections 6. Finally. which is suitable for cases where there is no explicit model of the system and the cost structure (a modelfree context). where F is the mapping deﬁned by n (F Q)(i. v) . for all pairs (i. u). Using Bellman’s equation. . v) . Furthermore. u) = j=1 pij (u) g(i. where the mapping ΠT (λ) is a contraction. n. it updates the Qfactors associated with an optimal policy. u). Instead of approximating the cost function of a particular policy. optimal stopping problems. j) + αJ ∗ (j) . .3 and 6. u) = j=1 pij (u) g(i. j) + α min Q∗ (j. uk )} . whereby the expected value in Eq. 6. Thus. v∈U(j) ∀ (i. the above set of equations can be seen to be a special case of Bellman’s equation (see Fig.1). together with all the pairs (i. n Q∗ (i.99) and can be shown to be the unique solution of this set of equations. The method is related to value iteration and has the additional advantage that it can be used directly in the case of multiple policies.
Once the control v is chosen. v) States ) g(i. (6.101) where the components of the vector Fk Qk are deﬁned by (Fk Qk )(ik . uk ) = g(ik . uk ) is modiﬁed by the iteration Qk+1 (ik . µ(j) v µ(j) ) States Figure 6. (6. ∀ (i. j). v) occur with probability 1 and incur no cost. uk ).Sec. u) − Qk (ik .4 QLearning 401 StateControl Pairs ( StateControl Pairs (i. u) = Qk (i. uk ) = Qk (ik . is generated according to some probabilistic mechanism. u) to j are according to transition probabilities pij (u) and incur a cost g(i. v∈U(jk ) (6. (6. v). . uk ). u) States StateControl Pairs (j. jk ) + α min Qk (jk . uk ) .103) Note from Eqs. u. u) are viewed as additional states. u).102) and (Fk Qk )(i. the transitions from j to (j. uk ) + γk g(ik .1. u) States ) g(i.4. The bottom ﬁgure corresponds to a ﬁxed policy µ. Given the pair (ik .102) that the Qfactor of the current sample (ik . uk . Modiﬁed problem where the state control pairs (i. The transitions from (i. uk . u) + γk (Fk Qk )(i. u∈U(jk ) and that the rightmost term above has the character of a temporal diﬀerence. 6. u). u. ∀ (i. j) j pij (u) ) States j p j) j. j) v j pij (u) ) States j p ) States StateControl Pairs: Fixed Policy Case ( StateControl Pairs (i. j) i. u. u). a state jk is generated according to the probabilities pik j (uk ). u) = (1 − γk )Qk (i. Then the Qfactor of (ik . u. u) = (ik . jk ) + α min Qk (jk . uk ) is updated using a stepsize γk > 0 while all other Qfactors are left unchanged: Qk+1 (i.101) and (6.
u). while all the other Qfactors are left unchanged. uk )}. Qlearning corresponds to the special case where f (i.4). uk )}.104) converges to the optimal Qfactor vector provided all statecontrol pairs (i. [cf.4. and with iterates that are not uptodate). Furthermore. and that the successor states j must be independently sampled at each occurrence of a given statecontrol pair. or has some monotonicity properties and a ﬁxed point. as we now proceed to discuss.100) is replaced by the mapping H given by n (HV )(m) = j=1 q(m. Qk (i.† † Much of the theory of Qlearning can be generalized to problems with postdecision states. This is the algorithm Qk+1 (i. 6. Chief among these conditions are that all statecontrol pairs (i. (6. u) = (F Qk )(ik . (6.3. u) = (i. some conditions must be satisﬁed.1. uk )}. iteration with a mapping that is either a contraction with respect to a weighted supnorm. the stepsize γk should be diminishing to 0 at an appropriate rate. with diﬀerent frequencies for diﬀerent components. it can be shown that the algorithm (6. One or .100). u) = (ik . u) if (i. u) must be generated inﬁnitely often within the inﬁnitely long sequence {(ik . u): the mapping F of Eq. where the expected value in the deﬁnition (6. u). where pij (u) is of the form q f (i. u) are generated inﬁnitely often within the sequence {(ik . u) = (ik . and a value iteration algorithm where only the Qfactor corresponding to the pair (ik . In particular. We can view this algorithm as a special case of an asynchronous value iteration algorithm of the type discussed in Section 1. converges when executed asynchronously (i. † Generally. for such problems one may develop similar asynchronous simulationbased versions of value iteration for computing the optimal costtogo function V ∗ of the postdecision states m = f (i.1 Convergence Properties of QLearning We will explain the convergence properties of Qlearning. (6. j) min u∈U (j) g(j.104) where F is the mapping (6. j (cf. In the process we will derive some variants of Qlearning that may oﬀer computational advantages in some situations.11)].† Consider the inﬁnitely long sequence {(ik . u) . uk ).402 Approximate Dynamic Programming Chap.100) of the mapping F is approximated via a form of Monte Carlo averaging. u) + αV f (j. Using the analysis of GaussSeidel value iteration and related methods given in that section.103) to the optimal Qfactors.e. uk ) is updated at iteration k. ∀ m. by viewing it as an asynchronous value iteration algorithm. Section 6.. uk ).101)(6. 6 To guarantee the convergence of the algorithm (6. uk ) if (i. Eq.
107) Comparing the preceding equations and Eq. it is clear that for each (i.100) of F . and the attendant asynchronous convergence properties of the value iteration algorithm Q := F Q. A general convergence theory of distributed asynchronous algorithms was developed in [Ber83] and has been discussed in detail in the book [BeT89].u) lim ˜ (Fk Qk )(i. . there are strong asynchronous convergence guarantees for value iteration for such problems. the supnorm contraction property of F . as shown in Bertsekas [Ber82]. where T (i. u). u) = (F Qk )(i. with a Monte Carlo estimate based on all the samples up to time k that involve (ik . let nk be the number of times the current state control pair (ik . and let Tk = t  (it . uk ) = 1 nk g(it .106).105)(6. uk ). ut . but unfortunately it may have a signiﬁcant drawback: it requires excessive overhead per iteration to calculate the ˜ Monte Carlo estimate (Fk Qk )(ik . uk ) has been generated up to and including time k.107) is quite satisfactory.107) converges with probability 1 to the optimal Qfactors [assuming again that all statecontrol pairs (i. v) . and using the law of large numbers. u). it can be shown that the algorithm (6. while the term 1 g(it . From the point of view of convergence rate. In particular.100). uk )}]. we have with probability 1 k→∞.105) ˜ where the components of the vector Fk Qk are deﬁned by ˜ (Fk Qk )(ik .105)(6. (6. ∀ (i. (6. k∈T (i. u) are generated inﬁnitely often within the sequence {(ik . nk t∈Tk both of these properties are present in discounted and stochastic shortest path problems. (6. u). jt ). u). ut .4 QLearning 403 Suppose now that we replace the expected value in the deﬁnition (6. t∈Tk v∈U(jt ) (6. the algorithm (6. u) = Qk (i. u) = (Fk Qk )(i. (6. ut ) = (ik . As a result.Sec. ∀ (i. u) . u) = (ik . u). From this. u) = k  (ik .106) ˜ (Fk Qk )(i. uk ). 0 ≤ t ≤ k be the set of corresponding time indexes. uk ) using Eq. uk ) = (i. 6. In particular. The algorithm is given by ˜ Qk+1 (i. uk ). jt ) + α min Qk (jt .
j ∈ S(ik . u) of possible successor states j [the ones with pij (u) > 0] has small cardinality. v∈U (j) (6. In particular.110).uk ) nk (j) nk g(ik . v).110) t∈Tk v∈U(jt ) which can be computed recursively. 6 in this equation can be recursively updated with minimal overhead. n (j) . uk ) = j∈S(ik . This may be impractical. (6. we may implement the algorithm (6.105)(6. v) .109). in which case they can be used directly in Eq. While the minimization term in nk Eq. u) has occurred up to time k. v).106) as Qk+1 (ik . This is ˜ the algorithm (6. i.108) v∈U(jt ) nk t∈Tk must be completely recomputed at each iteration.108) in Eq. (6. nk (j) is the cardinality of the set {jt = j  t ∈ Tk }. we may maintain the numbers of times that each successor state j ∈ S(i. in the simulation up to time k. v) (6. uk ) and U (j). as long as the estimate converges to pik j (uk ) as k → ∞.105). u). uk ). any estimate of pik j (uk ) can be used in place of nk (j)/nk . This approach can be strengthened if some of the probabilities pik j (uk ) are known. This is the case where for each pair (i. This algorithm has † We note a special type of problem where the overhead involved in updating the term (6.108) may be manageable. let us modify the algorithm and replace the oﬀending term (6.† Motivated by the preceding concern. since the above summation may potentially involve a very large number of terms. but with the Monte Carlo average (Fk Qk )(ik . j ∈ S(ik . u) the set S(i. occurred at state ik under uk . (6.108). are small.101)(6. j). uk ).106) with 1 nk min Qt (jt . t ∈ Tk . uk ) [rather than for just jk as in the Qlearning algorithm algorithm (6. uk ). the term 1 min Qk (jt . Generally. Then for each (i. j ∈ S(ik . uk .100) with their Monte Carlo estimates k . with minimal extra overhead.108) with the term (6. minv∈U (j) Qk (j. (6.. (6.109). uk ) of Eq. which depends on all the iterates Qt .106) approximated by replacing the term (6.103)] the extra computation is not excessive if the cardinalities of S(ik . and use them to compute eﬃciently the troublesome term (6. has to be computed for all j ∈ S(ik .109) where nk (j) is the number of times the transition (ik .404 Approximate Dynamic Programming Chap. Note that this amounts to replacing the probabilities pik j (uk ) in the mapping (6. j) + α min Qk (j.e. using the current vector Qk .
let us observe that the iteration (6. t≤k−m v∈U (jt ) min Qt+m (jt . uk . say γ γk = . a variant of Qlearning. ut . v) . is equivalent to the Qlearning algorithm with a diﬀerent stepsize. (6. v) + t∈Tk . uk ).111)(6. For moderate values of m it involves moderate additional overhead. uk )+ nk nk 1 nk g(ik .108) by 1 nk t∈Tk . The algorithm updates at time k the values of minv∈U (jt ) Q(jt . 6. uk ) + 1 (Fk Qk )(ik . and it is likely a more accurate approximation to the term (6. and consider a more general scheme that calculates the last m terms of the sum in Eq.108) than the term (6.112) We now show that this (approximate) value iteration algorithm is essentially the Qlearning algorithm.103) with a stepsize γk = 1/nk . ∀ (i. v) to minv∈U (jt ) Qk (jt . uk ) is given by the expression (6.102) used in the Qlearning algorithm.111) Qk (ik .113) which may also be updated recursively. v) . t>k−m v∈U (jt ) min Qk (jt . nk where γ is a positive constant. nk where (Fk Qk )(ik . (6.110).110) [minv∈U (jt ) Qt+m (jt . uk ). v) the “correct” term minv∈U (jt ) Qk (jt . It can be similarly shown that the algorithm (6. equipped with a stepsize parameter. t∈Tk v∈U(jt ) ∀ (i. v) presumably approximates better than minv∈U (jt ) Qt (jt . This algorithm.Sec. (6. u). u) = (ik . uk ) = or Qk+1 (ik .101)(6. v∈U(jk ) 1 nk g(it . v) . it reduces to the algorithm (6. u).† Indeed. uk ) = and Qk+1 (i. (6. and ﬁxes them at the last updated value for t outside this window.112) is the Qlearning algorithm (6.111)(6.112). uk ) = 1− nk − 1 1 Qk (ik . Thus the algorithm (6. † A potentially more eﬀective algorithm is to introduce a window of size m ≥ 0. jt ) + α min Qt (jt . u) = Qk (i.112). v) for all t ∈ Tk within the window k −m ≤ t ≤ k. For m = 0.111) can be written as Qk+1 (ik . jk ) + α min Qk (jk . replaces the oﬀending term (6. v)].111)(6.108) exactly and the remaining terms according to the approximation (6.4 QLearning 405 the form Qk+1 (ik . .
(6. u1 ). Alternatively.3. we may introduce a state aggregation scheme. . we need to introduce a linear ˜ parametric architecture Q(i.100) of F is accurate only in the limit. which uses the theoretical machinery of stochastic approximation algorithms. r) = φ(i. We may thus apply the TD/approximate policy iteration methods of Section 6. u). . the most important of which is that the number of Qfactors/statecontrol pairs (i. (6. At the typical iteration. u∈U(i) For example. It also justiﬁes the use of a diminishing stepsize that goes to 0 at a rate proportional to 1/nk . We refer to Tsitsiklis [Tsi94b] for a rigorous proof of convergence of Qlearning. (6. we discuss this possibility in Section 6. r). In practice.2 QLearning and Approximate Policy Iteration We will now consider Qlearning methods with linear Qfactor approximation. 6. u.100). u.1. u. and then obtain a new policy µ by ˜ µ(i) = arg min Qµ (i.t = 0. ˜ Q(i.406 Approximate Dynamic Programming Chap. these methods ﬁnd ˜ an approximate solution Qµ (i. r). Qlearning has some drawbacks. where nk is the number of times the pair (ik . u)′ r. similar to our discussion in Section 6.4. 6 The preceding analysis provides a view of Qlearning as an approximation to asynchronous value iteration (updating one component at a time) that uses Monte Carlo sampling in place of the exact expected value in the mapping F of Eq. Eq.114) generates a trajectory (i0 . Fig. As we discussed earlier (cf. we may introduce a linear approximation architecture for the Qfactors. and ﬁnds at time k the unique solution of the projected equation [cf. ut )qk.4. For this. despite its theoretical convergence guaranties. we may view Qfactors as optimal costs of a certain discounted DP problem. using the current policy [ut = µ(it )].114) where φ(i. u. LSTD(0) with a linear parametric architecture of the form (6. similar to the policy evaluation schemes of Section 6.3. However. if Qk converges. u) is a feature vector that depends on both state and control. whose states are the statecontrol pairs (i.5. . To alleviate this diﬃculty. u) may be excessive.4.1). r) of the projected equation for Qfactors corresponding to µ. given the current policy µ.3.46)] k φ(it . it does not constitute a convergence proof because the Monte Carlo estimate used to approximate the expected value in the deﬁnition (6. t=0 . uk ) has been generated up to time k. u0 ). (i1 . 6. This is the subject of the next two subsections.
At the start of iteration k. and qk.115)].t . ik+1 ).117) (3) We update the parameter vector via rk+1 = rk − γk φ(ik . [cf. Also. let us consider the extreme case of TD(0) that uses a single sample between policy updates. Eq.3. Eq. uk+1 )′ rk − g(ik . and TD(0). with β > 0. (6.4 QLearning 407 where qk. to ensure that statecontrol pairs (i. u) = i.k is the TD qk. or a diagonal approximation thereof. [cf. There are also optimistic approximate policy iteration methods based on LSPE(0). ut )qk. where γk is a positive stepsize. given the (6. ut . ut+1 )′ rk − g(it . As an example. LSPE(0) is given by [cf. (2) We generate the control uk+1 from the minimization uk+1 = arg u∈U(ik+1 ) min ˜ Q(ik+1 . (6. For this. rk ). uk . ik+1 ) using the transition probabilities pik j (uk ).t = φ(it . ut )φ(it . such as Gk = β 1 I+ k+1 k+1 k −1 φ(it . u. As an example. respectively.118) . LSTD(0). we are at some state ik . and we have chosen a control uk . uk )′ rk − αφ(ik+1 . and uk . t=0 (6. Gk is a positive deﬁnite matrix.57)] rk+1 = rk − γ Gk k+1 k (6. 6. In simulationbased methods. and uk+1 replacing rk .Sec.116) where γ is a positive stepsize. (6. similar to the ones we discussed earlier.8 may be used in conjunction with LSTD.k = φ(ik . (6. ut t=0 )′ .t are the corresponding TD qk. ik+1 . ik .47)]. the explorationenhanced schemes discussed in Section 6. Eq. uk )qk. a major concern is the issue of exploration in the approximate policy evaluation step. The process is now repeated with rk+1 . it+1 ). we have the current parameter vector rk .115) φ(it .k . µ(i) are generated suﬃciently often in the simulation. ut )′ rk − αφ(it+1 . Then: (1) We simulate the next transition (ik .
However. involving for example nearsingular matrix inversion (cf. which provides a mechanism for exploration. the behavior of all the algorithms described is very complex. However. so a simulationbased approximation approach like LSTD breaks down. we may use an exploration scheme based on LSTD(λ) with modiﬁed temporal diﬀerences (cf. .3. 6.6). where T is a DP mapping that now involves minimization over multiple controls. Alternatively. where jk is generated from (ik . we may use any explorationenhanced transition mechanism to generate a sequence (i0 .3 QLearning for Optimal Stopping Problems The policy evaluation algorithms of Section 6.3. Generally. . In such a scheme. uk ) → jk . We may try to extend these methods to the case of multiple policies. u1 ). Section 6. in which case the validity of LSPE(0) or TD(0) comes into doubt. and then use LSTD(0) with extra transitions (ik .93) does not require knowledge of the transitions probabilities pij (u). LSPE(λ).8). as discussed in Section 6.8). apply when there is a single policy to be evaluated in the context of approximate policy iteration. .4. there are some diﬃculties: (a) The mapping ΠT is nonlinear. .408 Approximate Dynamic Programming Chap. by aiming to solve by simulation the projected equation Φr = ΠT (Φr). uk ) using the transition probabilities pik j (uk ) (cf. u0 ). . . (6.3. Section 6. u1 ). we generate a sequence of statecontrol pairs (i0 .3. . where ν(u  i) is a probability distribution over the control constraint set U (i). so the underlying mapping ΠT may not be a contraction. such as TD(λ). 6 current policy µ.8. and there is no guarantee of success (except for general error bounds for approximate policy iteration methods).4) or policy oscillations (cf. u0 ). . and LSTD(λ). (i1 . according to transition probabilities pik ik+1 (uk )ν(uk+1  ik+1 ).3.3. Note that in this case the calculation of the probability ratio in the modiﬁed temporal diﬀerence of Eq. in the context of Qlearning. Qlearning with approximate policy iteration is often tried because of its modelfree character [it does not require knowledge of the transition probabilities pij (u)]. the required amount of exploration is likely to be substantial. (i1 . Section 6. Section 6. µ(jk ) . As in other forms of policy iteration. since these probabilities appear both in the numerator and the denominator and cancel out.
n}.103)]: Qk+1 (i) = (1 − γk )Qk (i) + γk (Fk Qk )(i). so the PVI iteration Φrk+1 = ΠT (Φrk ) [cf. where the components of the mapping Fk are deﬁned by (Fk Q)(ik ) = g(ik . ∀ i = ik . we assume that we have two options: to stop and incur a cost c(i). jk ) + α min c(jk ). and (Fk Q)(i) = Q(i). . so that the steadystate distribution vector ξ = (ξ1 . sequential hypothesis testing. j). with all states generated inﬁnitely often. 6.4 of Vol. The Qfactor for the decision to continue is denoted by Q(i). .4 QLearning 409 (b) ΠT may not in general be a contraction with respect to any norm. The Qfactor for the decision to stop is equal to c(i). Examples are problems of search. . as in Section 6. jk ). We associate a Qfactor with each of the two possible decisions. Q(j) .Sec. I. . ξn ) satisﬁes ξi > 0 for all i. so the above PVI iteration converges.119) The Qlearning algorithm generates a inﬁnitely long sequence of states {i0 . The problem is to minimize the associated αdiscounted inﬁnite horizon cost. (6. (6. In this section we discuss the extension of iterative LSPEtype ideas for the special case of an optimal stopping problem where the last two diﬃculties noted above can be largely overcome. and a corresponding sequence of transitions {(ik . generated according to the transition probabilities pik j . .}. Eq. and Section 3.35)] may diverge and simulationbased LSPElike approximations may also diverge. We are given a Markov chain with state space {1. Q(jk ) . . where j is the next state (there is no control to aﬀect the corresponding transition probabilities). i1 . It updates the Qfactor for the decision to continue as follows [cf. We assume that the states form a single recurrent class. . Optimal stopping problems are a special case of DP problems where we can only choose whether to terminate at the current state or not. and satisﬁes Bellman’s equation n Q(i) = j=1 pij g(i. described by transition probabilities pij . Given the current state i. j) + α min c(j). Eqs. (6.3. ∀ i. . . and pricing of derivative ﬁnancial instruments (see Section 4.101)(6. . . (c) Even if ΠT is a contraction. . the implementation of LSPElike approximations may not admit an eﬃcient recursive implementation because T (Φrk ) is a nonlinear function of rk . or to continue and incur a cost g(i.4 of the present volume).
We conclude that F is a contraction with respect to · ξ .1). j) + α min c(j). F Q − F Q ≤ αP Q − Q. which holds for every vector J (cf. Let us introduce the mapping F : ℜn → ℜn given by n (F Q)(i) = j=1 pij g(i. We claim that F is a contraction with respect to this norm. when the number of states is very large. an optimal policy can be implemented by stopping at state i if and only if c(i) ≤ Q(i). (6. r) = φ(i)′ r. where the last step follows from the inequality P J ξ ≤ J ξ . FQ − FQ ξ ≤ α P Q − Q ξ ≤ α Q − Q ξ.410 Approximate Dynamic Programming Chap. for any two vectors Q and Q. we have n n (F Q)(i) − (F Q)(i) ≤ α or j=1 pij fj (Q) − fj (Q) ≤ α j=1 pij Q(j) − Q(j) . which motivates Qfactor approximations. Q(j) . This mapping can be written in more compact notation as F Q = g + αP f (Q). with modulus α.120) and f (Q) is the function whose jth component is fj (Q) = min c(j).119)]. However. Let · ξ be the weighted Euclidean norm associated with the steadystate probability vector ξ. . j=1 (6. Indeed.121) We note that the (exact) Qfactor for the choice to continue is the unique ﬁxed point of F [cf. (6. j). Once the Qfactors are calculated. the algorithm is impractical. Eq. using a linear approximation architecture ˜ Q(i. Hence. Lemma 6. where g is the vector whose ith component is n pij g(i. Q(j) . We will now consider Qfactor approximations.3. where we use the notation x to denote a vector whose components are the absolute values of the components of x. 6 The convergence of this algorithm is addressed by the general theory of Qlearning discussed earlier.
Sec. Therefore. Similar to Section 6. r) in the compact form Φr. Section 6. thereby obtaining an analog of the LSPE(0) method. φ(it )φ(it t=0 )′ t=0 φ(it )qk.3. . ik+1 ). (6. By setting to 0 the gradient of the quadratic function in Eq. . . . i = 1. (6.123) where g and f are deﬁned by Eqs. where C(rk ) = Φ′ Ξ Φrk − αP f (Φrk ) . We assume that Φ has rank s.3. (6. ξ (6. i.t is the TD. we update rk by k −1 k rk+1 = rk − where qk..123).125) Similar to the calculations involving the relation between PVI and LSPE. and Π is nonexpansive. we generate a single inﬁnitely long simulated trajectory (i0 . Because F is a contraction with respect to · ξ with modulus α. i1 . r). . d = Φ′ Ξg. using the transition probabilities pij .t . Following the transition (ik . .3. Φ is the n × s matrix whose rows are φ(i)′ .122) as rk+1 = arg min Φr − g + αP f (Φrk ) s r∈ℜ 2 .3. . it can be shown that rk+1 as given by this iteration is equal to the iterate produced by the iteration Φrk+1 = ΠF (Φrk ) plus a simulationinduced error that asymptotically converges to 0 with probability 1 (see the paper .e.2).t = φ(it )′ rk − α min c(it+1 ). we can write the PVI iteration (6. In particular. (6.2. we may implement a simulationbased approximate version of this iteration. . 6.120) and (6.121).122) converges to the unique ﬁxed point of ΠF . . . we see that the iteration is written as rk+1 = rk − (Φ′ ΞΦ)−1 C(rk ) − d . the mapping ΠF is a contraction with respect to · ξ with modulus α. n. the algorithm Φrk+1 = ΠF (Φrk ) (6.4 QLearning 411 where φ(i) is an sdimensional feature vector associated with state i. where as in Section 6.124) qk. As in Section 6. We also write the vector ′ ˜ ˜ Q(1. and we denote by Π the projection mapping with respect to · ξ on the subspace S = {Φr  r ∈ ℜs }.3. it+1 ).) corresponding to an unstopped system. Q(n. This is the analog of the PVI algorithm (cf. . φ(it+1 )′ rk − g(it .
t=0 where Gk is a positive deﬁnite symmetric matrix and γ is a suﬃciently small positive stepsize.t iteratively as in the LSPE algorithms of Section 6. and these terms are replaced by w(it+1 . given by ˜ w(it+1 .124). By contrast. the term c(it+1 ) φ(it+1 )′ rk if t ∈ T . t ≤ k. this computation corresponds to repartitioning the states into those at which to stop and those at which to continue.116). / . to which we refer for further analysis). (6. [The proof of this is somewhat more complicated than the corresponding proof of Section 6. In particular. Intuitively. As a result. Note that similar to discounted problems. 6 by Yu and Bertsekas [YuB07]. in the corresponding optimistic LSPE version (6.125) need to be recomputed for all the samples it+1 . rk ).126) by a simulationbased scaled LSPE version of the form rk+1 = rk − γ Gk k+1 k φ(it )qk. calculated at time t (rather than the current time k). we can compute the matrix t=0 φ(it )φ(it )′ and the vector k t=0 φ(it )qk. there is no repartitioning.2 because C(rk ) depends nonlinearly on rk . In the process of updating rk+1 via k Eq. the generated sequence {Φrk } asymptotically converges to the unique ﬁxed point of ΠF . rk ) = ˜ where T = t  c(it+1 ) ≤ φ(it+1 )′ rt is the set of states to stop based on the approximate Qfactors Φrt . φ(it+1 )′ rk in the TD formula (6. the terms min c(it+1 ).412 Approximate Dynamic Programming Chap. see [Ber09]. we note that it has considerably higher computation overhead. rk+1 = rk − γG C(rk ) − d .3.125) with the alternative optimistic LSPE version (6.3. If G is positive deﬁnite symmetric can be shown that this iteration converges to the unique solution of the projected equation if γ is suﬃciently small. For example. we may also use a scaled version of the PVI algorithm. (6.126) where γ is a positive stepsize. we may use a diagonal approximation to the inverse in Eq.124)(6.] We may approximate the scaled PVI algorithm (6. if t ∈ T . based on the current approximate Qfactors Φrk .t . However.116). (6. and G is a scaling matrix. In comparing the Qlearning iteration (6. It requires algorithmic analysis using the theory of variational inequalities.124).
In particular. . thereby eliminating the extra overhead for repartitioning. we evaluate it by ﬁnding ˜ an approximation Φ˜µ to the solution Jµ of the equation r n ˜ Jµ (i) = j=1 pij µ(i) ˜ g(i. t∈T / φ(it )φ(it+1 )′ rk . The convergence analysis of this algorithm and some variations is based on the theory of stochastic approximation methods. .Sec.127) which can be eﬃciently updated at each time k. and we refer to [Ber09] for a discussion. Corresponding analogs of the LSTD and LSPEtype methods for such projected equations involve the solution of linear variational inequalities rather linear systems of equations. (6. This leads to projected equations involving projection on a restricted subset of the approximation subspace S. Given a policy µ. a simple possibility is to modify the policy iteration algorithm. with a more solid theoretical foundation. φ(it+1 )′ rk t=0 in Eqs. is obtained by simply replacing the term φ(it+1 )′ rk in the TD formula (6. Constrained Policy Iteration and Optimal Stopping It is natural in approximate DP to try to exploit whatever prior information is available about J ∗ . Another variant of the algorithm. Then the approximate policy iteration method can be modiﬁed to incorporate this knowledge as follows. and is given in the paper by Yu and Bertsekas [YuB07] to which we refer for further discussion. t∈T φ(it )c(it+1 ) + t≤k. suppose that we know a vector J with J(i) ≥ J ∗ (i) for all i. µ(i).125) by φ(it+1 )′ rt . (6.4 QLearning 413 k φ(it ) min c(it+1 ). i = 1. if it is known that J ∗ belongs to a subset J of ℜn . In particular. 6.127) (no repartitioning) can only converge to the same limit as the nonoptimistic version (6. .124). j) + α min J(j). rk ) = ˜ t=0 t≤k. (6. there is no convergence proof of this algorithm at present. n. In the practically common case where an upper bound of J ∗ is available.125) is replaced by k φ(it )w(it+1 . It can be seen that the optimistic algorithm that uses the expression (6. The details of this are beyond our scope. so convergence is still maintained. . these two terms are close to each other. However.128) . The idea is that for large k and t. (6.124). we may try to ﬁnd an approximation Φr that belongs to J . Jµ (j) .
(6. (6. where φ(j)′ is the row of Φ that corresponds to state j. it can be shown that the method (6. which resemble rollout algorithms. uk ∈ Uk (ik ). with or without cost function approximation. The approximate Qfactors have the form nk ˜ Qk (ik .4 FiniteHorizon QLearning We will now brieﬂy discuss Qlearning and related approximations for ﬁnitehorizon problems. One may develop extensions of the Qlearning algorithms of the preceding sections to deal with ﬁnite horizon problems. (6. φ(j)′ rµ ˜ j=1 . .6). Note that Eq. with a ﬁnite horizon.129) i = 1.414 Approximate Dynamic Programming Chap.119)]. k). possibly with cost function approximation at the end of the horizon. (6. 6. Under the assumption J(i) ≥ J ∗ (i) for all i. the approximate policy evaluation based on Eq. just like the standard (exact) policy iteration method (Exercise 6. ik+1 ) ˆ Qk+1 (ik+1 . with an online character.3. ˜ and use online the control uk ∈ Uk (ik ) that minimizes Qk (ik . at statetime pair (ik . When a compact featurebased representation is used (Φ = I). For example. Section 6.129) yields J ∗ in a ﬁnite number of iterations.130) + uk+1 ∈Uk+1 (ik+1 ) min . Approximate DP methods for such problems are additionally important because they arise in the context of multistep lookahead and rolling horizon schemes. uk ) = ik+1 =1 pik ik+1 (uk ) g(ik .128) is Bellman’s equation for the Qfactor of an optimal stopping problem that involves the stopping cost J (i) at state i [cf. However. we may compute approximate Qfactors ˜ Qk (ik . . (6.128) can be performed using the Qlearning algorithms described earlier in this section.128)(6. 6 followed by the (modiﬁed) policy improvement n µ(i) = arg min u∈U(i) pij (u) g(i. Eq. We will emphasize online algorithms that are suitable for relatively short horizon problems. uk . In particular. . n. . one may easily develop versions of the projected Bellman equation. The method exhibits oscillatory behavior similar to its unconstrained policy iteration counterpart (cf.17). j) + α min J(j). uk ) over ˜ uk ∈ Uk (ik ). and corresponding LSTD and LSPEtype algorithms (see the endofchapter exercises). there are a few alternative approaches. uk+1 ) . uk ). u.4. and a lookup table representation (Φ = I).
131) corresponds to a (ﬁnitehorizon) nearoptimal policy in place of the base policy used by rollout. Fu. uk+1 ) = φ(ik+1 .133) ui ∈Ui (xi ) Thus. we ﬁrst compute nN ˜ QN −1 (iN −1 .4 QLearning 415 ˆ where Qk+1 may be computed in a number of ways: ˆ ˜ (1) Qk+1 may be the cost function Jk+1 of a base heuristic (and is thus independent of uk+1 ). (6. ˜ Thus. These values may be computed by ﬁnite horizon rollout. iN ) + JN (iN ) . . Hu. uN −1 . Qk+1 may be obtained by a leastsquares ﬁt/regression or interpolation based on values computed at a subset of selected statecontrol pairs (cf. These schemes are well suited for problems with a large (or inﬁnite) state space but only a small number of controls per state. (6. ˜ i = k + 2. 6. and Marcus [CFH07] has extensive discussions of approaches of this type. uk+1 )′ rk+1 . The book by Chang. ui ). which is computed by (possibly multistep lookahead or rolling horizon) DP based on limited sampling to approximate the various expected values arising in the DP algorithm.3 of Vol.131)]. using as base policy the greedy policy corresponding to the preceding approximate Qvalues in a backwards (oﬀline) Qlearning scheme: ˆ µi (xi ) = arg min Qi (xi . N − 1. ik+1 ) + Jk+1 (ik+1 ) . here the function Jk+1 of Eq.131) This is the rollout algorithm discussed at length in Chapter 6 of Vol. uN −1 ) = iN =1 piN−1 iN (uN −1 ) g(iN −1 . in which case Eq. and may also involve selective pruning of the control constraint set to reduce the associated DP computations. Section 6.Sec. ˆ (3) Qk+1 is computed using a linear parametric architecture of the form ˆ Qk+1 (ik+1 . in such a scheme. . I. ˆ ˜ (2) Qk+1 is an approximately optimal cost function Jk+1 [independent of uk+1 as in Eq. uk .4. (6.130) takes the form nk ˜ Qk (ik . . In particular. uk ) = ik+1 =1 ˜ pik ik+1 (uk ) g(ik . . and less sampling for future states that are less likely to be visited starting from the current state ik ). including systematic forms of adaptive sampling that aim to reduce the eﬀects of limited simulation (less sampling for controls that seem less promising at a given state. ˜ A variation is when multiple base heuristics are used and Jk+1 is the minimum of the cost functions of these heuristics. (6. I). (6.132) ˆ where rk+1 is a parameter vector. (6. These schemes may also be combined with a rolling and/or limited lookahead horizon.
416 Approximate Dynamic Programming Chap. . For example. 6 by the ﬁnal stage DP computation at a subset of selected statecontrol pairs (iN −1 .. however.3. the simulations can be done more ﬂexibly. (6.5 AGGREGATION METHODS In this section we revisit the aggregation methodology discussed in Section 6. One advantage of ﬁnite horizon formulations is that convergence issues of the type arising in policy or value iteration methods do not play a signiﬁcant role. I. viewing it now in the context of costtogo or Qfactor approximation for discounted DP.1. then compute Q (iN −3 .4 of Vol. followed by a least squares ﬁt of the obtained ˆ ˜ values to obtain QN −1 in the form (6. the underlying contractions are with respect to the supnorm rather than a Euclidean norm. linear combinations of basis functions. in other ways the aggregation approach resembles the projected equation/subspace approximation approach. there are important differences: in aggregation methods there are no projections with respect to Euclidean norms.132). µN −1 } deﬁned µ ˜ by Eq. then we compute QN −2 at a subset of selected statecontrol pairs (iN −2 . a mixed blessing as it may mask poor performance and/or important qualitative diﬀerences between diﬀerent approaches.3.4). which relate the original system states with the aggregate states: (1) For each aggregate state x and original system state i. the aggregation may be done using the corresponding Bellman equation. This is. with potentially signiﬁcant simpliﬁcations resulting in the algorithms of this section.132).133). we specify n the disaggregation probability dxi [we have i=1 dxi = 1 for each † Aggregation may be used in conjunction with any Bellman equation associated with the given problem.3 of Vol. Section 6. which is then solved exactly to yield a costtogo approximation for the original problem. uN −2 ) by rollout using the base policy {˜N −1 } deﬁned by Eq. and from a mathematical point of view. uN −3 ) by rollout using the base policy {˜N −2 .133). if the problem admits postdecision states (cf. and we introduce two (somewhat arbitrary) choices of probabilities. However. so anomalous behavior does not arise. i. To construct an aggregation framework. most importantly because it constructs cost approximations of the form Φr. (6.† The aggregation approach resembles in some ways the problem approximation approach discussed in Section 6. I: the original problem is approximated with a related “aggregate” problem. followed by a µ ˆ least squares ﬁt of the obtained values to obtain QN −2 in the form ˜ N −3 at a subset of selected statecontrol pairs (6. we introduce a ﬁnite set A of aggregate states. Still.e. 6. uN −1 ). etc.
we group the original system states into subsets. n} may also be viewed as basis functions that will be used to represent approximations of the cost vectors of the original problem. I). I).” The vectors {φjy  j = 1. The aggregation approach approximates cost vectors with Φr. and we view each subset as an aggregate state.Sec. In soft aggregation. (b) In various discretization schemes. Roughly. Let us mention a few examples: (a) In hard and soft aggregation (Examples 6. each aggregate state x is associated with a unique original state ix . In hard aggregation each state belongs to one and only one subset. . . and which are viewed as aggregation probabilities (this makes geometrical sense if both the original and the aggregate states are associated with points in a Euclidean space. One possibility is to choose the disaggregation probabilities as dxi = 1/nx if system state i belongs to aggregate state/subset x.13 of Vol. φjy may be interpreted as the “degree of membership of j in the aggregate state y. Example 6. I). where nx is the number of states of x (this implicitly assumes that all states that belong to aggregate state/subset y are “equally representative”). 6. Thus. . . with the aggregation probabilities φjy quantifying the “degree of membership” of j in the aggregate state/subset y.9 and 6. n). each original system state j is associated with a convex combination of aggregate states: j∼ φjy y. whose sum is 1. dxi may be interpreted as the “degree to which x is represented by i. and the aggregation probabilities are φjy = 1 if system state j belongs to aggregate state/subset y. . (c) In coarse grid schemes (cf.3. and Φ is the matrix whose jth .3. we specify the aggregation probability φjy (we have y∈A φjy = 1 for each j = 1. where r ∈ ℜs is a weight vector to be determined. y∈A for some nonnegative weights φjx .5 Aggregation Methods 417 x ∈ A].3. and we may use the disaggregation probabilities dxi = 1 for i = ix and dxi = 0 for i = ix . as described in Example 6. . . Roughly. we allow the aggregate states/subsets to overlap.3.12 of Vol. a subset of representative states is chosen.10 of Vol. each being an aggregate state. The aggregation probabilities are chosen as in the preceding case (b). .” (2) For each aggregate state y and original system state j.
(ii) We generate transitions from original system state i to original system state j according to pij (u). y pxy (u) = ˆ i=1 dxi j=1 pij (u)φjy . with cost g(i. Conversely. Fig. u.1).418 Approximate Dynamic Programming Chap. In the ﬁrst problem. we may construct a featurebased hard aggregation scheme by grouping together states with “similar features. discussed in .5. starting from a set of s features for each state.5. Thus aggregation involves an approximation architecture similar to the one of projected equation methods: it uses as features the aggregation probabilities. φjs . Original System States Aggregate States Original System States Aggregate States i . with cost Aggregation Probabilities Disaggregation Probabilities Aggregation Disaggregation Probabilities Probabilities Aggregation Probabilities Disaggregation Probabilities Disaggregation Probabilities φjy Q dxi  S Original System States Aggregate States ). .” In particular. In this system: (i) From aggregate state x. . we may use a more or less regular partition of the feature space. Unfortunately. x n n ). 6.1. j). j=1 according to pij (u). The aggregation and disaggregation probabilities specify a dynamical system involving both aggregate and original system states (cf. we generate original system state i according to dxi . . Illustration of the transition mechanism of a dynamical system involving both aggregate and original system states. Figure 6. we generate aggregate state y according to φjy . in the resulting aggregation scheme the number of aggregate states may become very large. This is a general approach for passing from a featurebased approximation of the cost vector to an aggregationbased approximation. 6 row consists of the aggregation probabilities φj1 . thereby eﬀecting costtogo or Qfactor approximation. One may associate various DP problems with this system. (iii) From original system state j. which induces a possibly irregular partition of the original state space into aggregate states (all states whose features fall in the same set of the feature partition form an aggregate state). .
cf. ˆ The optimal cost function of the aggregate problem.2. is obtained as the unique solution of Bellman’s equation which can be solved by any of the available value and policy iteration methods. In ˜ the case of hard aggregation. ∀ j. J is piecewise constant: it assigns the same cost to all the states j that belong to the same aggregate state y (since φjy = 1 if j belongs to y and φjy = 0 otherwise). 6. These transition probabilities and costs deﬁne an aggregate problem whose states are just the aggregate states. and we denote it by U . including ones that involve simulation.4. g (x. ˜ Thus. for an original system state j. Section 5. 6. u) = ˆ i=1 dxi j=1 pij (u)g(i. These methods admit simulationbased implementations. Then. u. j). the approximation J(j) is a convex ˆ combination of the costs J(y) of the aggregate states y for which φjy > 0. Policy and value iteration algorithms are then deﬁned for this enlarged system. the role of the original system states being to deﬁne the mechanisms of cost generation and probabilistic transition from one aggregate state to the next. The optimal cost function ˜ J ∗ of the original problem is approximated by J given by ˜ J(j) = y∈A ˆ J(x) = min g (x. and is wellsuited for approximating the solution of partially observed Markov Decision problems (POMDP). In the second problem. I).1) n n n n pxy (u) = ˆ i=1 dxi j=1 pij (u)φjy . the focus is on both the original system states and the aggregate states. By discretizing the belief space with a coarse grid.5.2 of Vol.5 Aggregation Methods 419 the next subsection. The preceding scheme can also be applied to problems with inﬁnite state space. u) + α ˆ u∈U y∈A ˆ pxy (u)J(y) . ∀ x. 6.5.5. discussed in Section 6. are given by (cf. ˆ φjy J(y). denoted J. Fig. which are viewed as states of an enlarged system. the focus is on the aggregate states.Sec.1 Cost and QFactor Approximation by Aggregation Here we assume that the control constraint set U (i) is independent of the state i. which are deﬁned over their belief space (space of probability distributions over their states. the transition probability from aggregate state x to aggregate state y under control u. and the corresponding expected transition cost. one obtains a ﬁnite spaces DP problem of perfect state information that can .
. m. Suppose we choose a ﬁnite subset/coarse grid of states {x1 . . Example 6. xm }.420 Approximate Dynamic Programming Chap.4. i = 1. ∀ z ∈ C. . m.u. . k = 1. The disaggregation probabilities are dxk i = 1 for i = xk . u. . . where the state space is a convex subset C of a Euclidean space. Bellman’s equation is J = T J with T deﬁned by (T J)(z) = min E g(z. . m.2. where the optimal cost function is a concave function over the simplex of beliefs (see Vol. . Section 5. for each z. u. u. .u.5. w) + αJ f (z. We use z to denote the elements of this space. . We approximate the optimal cost function of the original problem by m ˜ J(z) = i=1 ˆ φzxi J (xi ). u. . ˆ Let J denotes its unique ﬁxed point. .w) xj J(xj ) . . [ZhH01]. . 6 be solved with the methods of Chapter 1 (see [ZhL97]. . xm .w) xj J(xj ). Let J ∗ denote the optimal cost function. w). The following example illustrates the main ideas and shows that in the POMDP case. to distinguish them from x which now denotes aggregate states. and dxk i = 0 for ˆ i = {x1 . the approximation obtained is a lower bound of the optimal cost function. u. ∀ z ∈ C. w) ≥ φf (z.w) xj are the aggregation probabilities of the next state f (z. [YuB04]). . .u. Suppose now that J ∗ is a concave function over S. . . Consider the mapping T deﬁned by m ˆ (T J)(z) = min E u∈U w g(z. so that for all (z. . ∀ z ∈ C. . . w) + α j=1 φf (z. . where φf (z. This is Bellman’s equation for an aggregated ﬁnitestate discounted DP problem whose states are x1 . I. j=1 ˆ It then follows from the deﬁnitions of T and T that ˆ J ∗ (z) = (T J ∗ )(z) ≥ (T J ∗ )(z). so that we have ˆ ˆˆ J(xi ) = (T J)(xi ). xm }. m J ∗ f (z.2). ∀ z ∈ C. i = 1. . and can be solved by standard value and policy iteration methods.1 (Coarse Grid/POMDP Discretization and Lower Bound Approximations) Consider a discounted DP problem of the type discussed in Section 1. . u. ˆ We note that T is a contraction mapping with respect to the supnorm. w). w) u∈U w . which we view as aggregate states with aggregation probabilities φzxi .
QFactor Approximation ˆ We now consider the Qfactors Q(x. m. we generate an original system state ik according to the disaggregation probabilities dxk i .5 Aggregation Methods 421 so by iterating. (6. u) = (xk .Sec. v) . i=1 ∀ z ∈ C. from which we obtain m m J ∗ (z) ≥ i=1 φzxi J ∗ (xi ) ≥ ˜ ˆ φzxi J(xi ) = J(z). Similarly.101)(6. u.134) We may apply Qlearning to solve the aggregate problem. In particular. u) = (xk . u). We ﬁnally generate an aggregate system state yk using the aggregation probabilities φjk y . uk ). . . we generate an inﬁnitely long sequence of pairs {(xk . Eqs. u) = g (x. They are the unique solution of the Qfactor equation ˆ Q(x. u) + α ˆ y∈A n n ˆ pxy (u) min Q(y. v) ˆ v∈U = i=1 dxi j=1 (6. ˆ if (x. ˆ where the vector Fk Qk is deﬁned by ˆ (Fk Qk )(x. uk ) is updated using a stepsize γk > 0 while all other Qfactors are left unchanged [cf. Thus the approx˜ imation J (z) obtained from the aggregate system provides a lower bound to J ∗ (z). . v) if (x. ∀ i = 1. uk ). uk )} ⊂ A × U according to some probabilistic mechanism. u) = ˆ g(ik . ˆ where the last equation follows since T is a contraction. where the ﬁrst inequality follows from the concavity of J ∗ . u) ∀ (x. if J ∗ can be shown to be convex. j) + α y∈A ˆ φjy min Q(y. Qk (x.135) pij (u) g(i. x ∈ A. u). k→∞ ∀ z ∈ C. u) = (1 − γk )Qk (x. u). of the aggregate problem. and then a successor state jk according to probabilities pik jk (uk ). . uk . For each (xk . uk ). u ∈ U . u) + γk (Fk Qk )(x. 6.103)]: ˆ ˆ ˆ Qk+1 (x. the preceding argument ˜ can be modiﬁed to show that J (z) is an upper bound to J ∗ (z). Then the Qfactor of (xk . we see that ˆ ˆ J ∗ (z) ≥ lim (T k J ∗ )(z) = J(z). jk ) + α minv∈U Qk (yk . (6. v∈U . we have in particular ˆ J ∗ (xi ) ≥ J(xi ). For z = xi .
134). v) = y∈A ˆ φjy Q(y. i = 1. . In the next section. j = 1. the Qfactors of the original problem are approximated by ˜ Q(j. .136) ˜ We recognize this as an approximate representation Q of the Qfactors of the original problem in terms of basis functions. as long as all possible pairs are generated inﬁnitely often. v∈U j = 1.422 Approximate Dynamic Programming Chap. . one may wish to use the aggregation and disaggregation probabilities. (6. This is less discriminating in the choice of control (for example in the case of hard aggregation. v) given by Eq. n. . u. ˆ After solving for the Qfactors Q. In practice. which is unfortunate since a principal motivation of Qlearning is to deal with modelfree situations where the transition probabilities are not explicitly known. (6. The optimal costtogo function of the original problem is approximated by ˜ ˜ J(j) = min Q(j. . n}). The alternative is to obtain a sub˜ optimal control at j by minimizing over v ∈ U the Qfactor Q(j. uk )} are generated is arbitrary. we provide alternative algorithms based on value iteration (rather than Qlearning). it applies the same control at all states j that belong to the same aggregate state y). The preceding analysis highlights the two kinds of approximation that are inherent in the method just described: (a) The transition probabilities of the original system are modiﬁed through the aggregation and disaggregation process. j) + αJ (j) . v). . y ∈ A. There is a basis function for each aggregate state y ∈ A (the vector {φjy  j = 1. . v ∈ U (so we have in eﬀect a lookup table representation with respect to v). (6. .136). 6 Note that the probabilistic mechanism by which the pairs {(xk . controls are associated with aggregate states rather than original system states. . . . and the corresponding coeﬃcients that weigh the basis functions are the Qfactors ˆ of the aggregate problem Q(y. Note that the preceding minimization requires knowledge of the transition probabilities pij (u). where the approximation in (b) above is addressed more eﬀectively. v ∈ U. v). and the Markov chain transition probabilities in an eﬀort to ensure that “important” statecontrol pairs are not underrepresented in the simulation. . . and the corresponding onestep lookahead suboptimal policy is obtained as n µ(i) = arg min ˜ u∈U j=1 ˜ pij (u) g(i. n. v). . . (b) In calculating Qfactors of the aggregate system via Eq. . n. .
5. and D and Φ are the matrices with rows the disaggregation and aggregation distributions. Fig. x ∈ A.4) and the fact that dxi . and J These three vectors satisfy the following three Bellman’s equations: n R∗ (x) = i=1 n ˜ dxi J0 (i). .1). Since T is a contrac pij (u) g(i. 6.138) (6. J1 .2. 6. 6.2 Approximate Policy and Value Iteration Let us consider the system consisting of the original states and the aggregate states. .137) ˜ J0 (i) = min u∈U(i) ˜ pij (u) g(i.1). we obtain an equation for R∗ : R∗ (x) = (F R∗ )(x).Sec. . Prop. . . where F is the mapping deﬁned by n n x ∈ A.† † A quick proof is to observe that F is the composition F = DT Φ.5. We introduce the vectors J0 . x ∈ A. where T is the usual DP mapping. . ˜ Note that because of the intermediate transitions to aggregate states. This follows from standard contraction arguments (cf.5. j=1 (6. with the transition probabilities and the stage costs described ˜ ˜ earlier (cf.139) ˜ J1 (j) = y∈A φjy R∗ (y).140) It can be seen that F is a supnorm contraction mapping and has R∗ as its unique ﬁxed point. j) + αJ1 (j) . J0 ˜1 are diﬀerent. u.1). 6.5. j) + α y∈A φjy R(y) . . 1. and φjy are all transition probabilities. u.5 Aggregation Methods 423 6. j = 1. . pij (u). n. n. (F R)(x) = dxi min i=1 u∈U(i) j=1 (6. and R∗ where: R∗ (x) is the optimal costtogo from aggregate state x. respectively. ˜ J0 (i) is the optimal costtogo from original system state i that has just been generated from an aggregate state (left side of Fig. i = 1. . (6. ˜ J1 (j) is the optimal costtogo from original system state j that has just been generated from an original system state (right side of Fig. By combining these equations.
6 Once R∗ is found. we have an error bound of the form lim sup Jµk − J ∗ k→∞ ∞ pij (u) g(i. but also have the monotonicity property of DP mappings (cf.4). (6. (1 − α)2 where δ satisﬁes Jk − Jµk ∞ ≤δ for all generated policies µk and Jk is the approximate cost vector of µk that is used for policy improvement (which is ΦRµk in the case of aggregation). it ﬁnds Rµk satisfying Rµk = Fµk Rµk . 1. One may use value and policy iterationtype algorithms to ﬁnd R∗ .142) (this is the policy improvement step).1). ∞.1. . Prop. . and D and Φ satisfy ∀ x ∈ ℜn . which was used in an essential way in the convergence proof of ordinary policy iteration (cf.3. ≤ x ≤ y ∞ it follows that F is a supnorm contraction. in approximate policy iteration. F 2 R. 1. u. starting with some initial guess R.141) (this is the policy evaluation step). It then generates µk+1 by n µk+1 (i) = arg min u∈U(i) j=1 (6. The key fact here is that F and Fµ are not only supnorm contractions. by Prop.6. .2 and Lemma 1.1. j) + α y∈A φjy Rµk (y) . ∞. The policy iteration algorithm starts with a stationary policy µ0 for the original problem. the optimal cost function ˜1 is a linear combination of the columns of Φ.. where Fµ is the mapping deﬁned by n n (Fµ R)(x) = i=1 dxi j=1 pij µ(i) g i. which may approximation J be viewed as basis functions.424 Approximate Dynamic Programming Chap. ≤ 2αδ . x ∈ A. and given µk . The value iteration algorithm is to generate successively F R. tion with respect to the supnorm Dx Φy ∞ · ∞. µ(i). j + α y∈A φjy Rµ (y) . ∀ i. ∀ y ∈ ℜs .3. and a suboptimal policy may be found ˜ through the minimization deﬁning J0 . We leave it as an exercise for the reader to show that this policy iteration algorithm converges to the unique ﬁxed point of F in a ﬁnite number of iterations. the optimalcosttogo of the original problem may ˜ be approximated by J1 = ΦR∗ . Again. . Section 1. Generally.
For a given policy µ. The preceding error bound improvement suggests that approximate policy iteration based on aggregation may hold some advantage in terms of approximation quality. µ(i). let J be the cost vector used to evaluate µ (which is ΦR∗ in the case of aggregation). and is impractical when n is large. k→∞ 1−α thereby showing the error bound (6.15.. An alternative.5 Aggregation Methods 425 However. To show this. using an LSTDtype method. 6. see Exercise 6. We write T Jµ ≥ T (J − δe) = T J − αδe = Tµ J − αδe ≥ Tµ Jµ − 2αδe = Jµ − 2αδe. it turns out that the much sharper bound Jµ − J ∗ ∞ ≤ 2αδ 1−α (6. as we now proceed to describe. Eq.Sec. Tµ J = gµ + αPµ J. where D and Φ are the matrices with rows the disaggregation and aggregation distributions. is to implement it by simulation. we obtain 2αδ J ∗ = lim T k Jµ ≥ Jµ − e. relative to its projected equationbased counterpart.e. the aggregate version of Bellman’s equation. R = Fµ R.143) holds. however. i. Note.143). is linear of the form [cf. and gµ is the vector whose ith component is n pij µ(i) g i. and note that it satisﬁes T J = Tµ J since µk converges to µ [cf. respectively. (6. j=1 . and Tµ is the DP mapping associated with µ. that the basis functions in the aggregation approach are restricted by the requirement that the rows of Φ must be probability distributions. For a generalization of this idea. j .142) in the case of aggregation]. where e is the unit vector. SimulationBased Policy Iteration The policy iteration method just described requires ndimensional calculations. from which by repeatedly applying T to both sides. which is consistent with the philosophy of this chapter. (6. when the policy sequence {µk } converges to some µ as it does here. Eq.141)] R = DTµ (ΦR). with Pµ the transition probability matrix corresponding to µ.
. we ˆ ˆ form the matrix Ek and vector fk given by 1 d(it )g it . It is important to note that the sampling probabilities ξi are restricted to be positive. is obtained by ﬁrst generating a sequence of states {i0 . Given the ﬁrst k + 1 samples. . . j0 ). their choice does not aﬀect the obtained approximate solution of the equation ER = f . Eqs. . i1 .426 Approximate Dynamic Programming Chap. 6 We can thus write this equation as ER = f. similar to Section 6. a sample sequence (i0 . we may alternatively sample the aggregate states x according to a distribution {ζx  x ∈ A}.34)]. Moreover. Note also that instead of using the probabilities ξi to sample original system states. . . where the choice of ξi aﬀects the projection norm and the solution of the projected equation.3.42)]. .144) in analogy with the corresponding matrix and vector for the projected equation [cf. . j1 ). µ(it ). (6. the relation k→∞ k lim t=0 δ(it = i. (6.3). f = Dg. ξit k E = I −α pij µ(i) d(i)φ(j)′ . In particular.3 [cf. (i1 . . This is in contrast with the projected equation approach. . We may use lowdimensional simulation to approximate E and f based on a given number of samples. Eq. jt = j) = ξi pij . generate a sequence of aggregate states . µ(i). the problem of exploration is less acute in the context of policy iteration when aggregation is used for policy evaluation. The ˆ ˆ convergence Ek → E and fk → f follows from the expressions ˆ Ek = I − ˆ fk = 1 k+1 n n n n α k+1 k 1 d(it )φ(jt )′ . n} (with ξi > 0 for all i). . but are otherwise arbitrary and need not depend on the current policy. j .} by sampling according to a distribution {ξi  i = 1. n}. k+1 and law of large numbers arguments (cf. Because of this possibility. i=1 j=1 f= i=1 j=1 pij µ(i) d(i)g i. where E = I − αDP Φ. (6.145) where d(i) is the ith column of D and φ(j)′ is the jth row of Φ. jt . Section 6. as well as the contraction properties of the mapping ΠT . . ξit t=0 t=0 (6.3. . .41) and (6. and then by generating for each t the column index jt using sampling according to the distribution {pit j  j = 1.
ˆ ˆ −1 ˆ The corresponding LSTD(0) method generates Rk = Ek fk . µ(it ). . where m is the number of aggregate states. Eq. . cf. .3.144)]. i1 . Eq. and ˆ approximates the cost vector of µ by the vector ΦRk : ˜ ˆ Jµ = ΦRk . and explorationrelated diﬃculties. Ek will always be nearly singular [since DP Φ is a transition probability ˆ matrix.}. (6.Sec. for a discount factor α ≈ ˆ 1. there is only one possible choice for Φ in the aggregation context. This is similar to optimistic policy iteration for the case of a lookup table representation of the cost of the current policy: we are essentially dealing with a lookup table representation of the cost of the aggregate system of Fig. The latter method takes the form ˆ ˆ ˆ′ ˆ ˆ′ ˆ Rk+1 = (Ek Σ−1 Ek + βI)−1 (Ek Σ−1 fk + β Rk ). . Section 6. error guarantees. .61)]. jt . Ek will also be nearly singular when the rows of D and/or the columns of Φ are nearly dependent. For example in the case of a single basis function (s = 1).4).146) actually makes sense even if Ek is singular. Of course.} using the disaggregation probabilities. Its limitation is that the basis functions in the aggregation approach are restricted by the requirement that the rows of Φ must be probability distributions. The iteration ˆ (6.145) should be modiﬁed as follows: α ˆ Ek = I− k+1 k t=0 m dxt it ˆ d(it )φ(jt )′ . which may be viewed as a special case of (scaled) LSPE. as well as an iterative regressionbased version of LSTD. Section 6. Moreover. . and then generate a state sequence {i0 . (6.1. In this case ξi = x∈A ζx dxi and the equations (6.6. 6. 6. fk = 1 k+1 k t=0 m dxt it d(it )g it . as noted earlier.3. optimistic versions of the method also do not exhibit the chattering phenomenon described in Section 6.146) where β > 0 and Σk is a positive deﬁnite symmetric matrix [cf. . The preceding arguments indicate that aggregationbased policy iteration holds an advantage over its projected equationbased counterpart in terms of regularity of behavior. x1 .5.3. Note that contrary to the projected equation case.6): the generated policies converge and the limit policy satisﬁes the sharper error bound (6. There is also a regressionbased version that is suitable for the case where ˆ Ek is nearly singular (cf.5 Aggregation Methods 427 {x0 . namely the matrix whose single column is the unit vector.143). The nonoptimistic version of this aggregationbased policy iteration method does not exhibit the oscillatory behavior of the one based on the projected equation approach (cf. k k (6.
Note that contrary to the aggregationbased Qlearning algorithm (6. is speciﬁed by disaggregation and aggregation probabilities as before. Given each xk . .} by some probabilistic mechanism. and k transitions between original system states in between transitions from and to aggregate states. y Original System States Aggregate States Figure 6. with cost 2 jk pik j (u) g(ik . which is similar to the Qlearning algorithms of Section 6.111) and (6. which ensures that all aggregate states are generated inﬁnitely often.2. illustrated in Fig. Its convergence mechanism and justiﬁcation are very similar to the ones given for Qlearning in Section 6. and leaves all the other components of R unchanged: Rk+1 (x) = Rk (x).2. if x = xk .4. j) + α y∈A φjy Rk (y) .147) involves the calculation of an expected value at every iteration. with cost Stages j1 j1 j2 i according to pij (u). (6.135). . It is based on a dynamical system involving aggregate states. Eqs.112)]. with cost pij (u).428 Approximate Dynamic Programming Chap. x1 . This system. but involves k > 1 transitions between original system states in between transitions from and to aggregate states. it independently generates an original system state ik according to the probabilities dxk i .5. k Stages according to according to pij (u). 6. It generates a sequence of aggregate states {x0 . 6 SimulationBased Value Iteration The value iteration algorithm also admits a simulationbased implementation. . The transition mechanism for multistep aggregation.1 [cf.5. u. x ).147) where γk is a diminishing positive stepsize. Multistep Aggregation The aggregation methodology of this section can be generalized by considering a multistep aggregationbased dynamical system. . and updates R(xk ) according to Rk+1 (xk ) = (1 − γk )Rk (xk ) n + γk min u∈U(i) j=1 (6.4. This algorithm can be viewed as an asynchronous stochastic approximation version of value iteration. the iteration (6. Disaggregation Probabilities Aggregation Probabilities Disaggregation Probabilities dxi Original System States Aggregate States Aggregation Probabilities Aggregation Disaggregation Probabilities Probabilities Disaggregation Probabilities φjy Q S  ).
x ∈ A. . jm+1 =1 ˜ Jk (jk ) = y∈A jm = 1. . j1 ) + αJ1 (j1 ) . . µk−1 }. where T is the usual DP mapping of the problem. ˜ J0 (i) is the optimal costtogo from original system state i that has just been generated from an aggregate state (left side of Fig. jk = 1. n. . J1 . . ˜ Jm (jm ). Jk . u. where F is the mapping deﬁned by F R = DT k (ΦR). k. 6. . . . . . n. i = 1. . . . . . m = 1.Sec. u. . (6.2). k − 1. . which evaluates a policy through calculation of a corresponding vector R and then improves it. . As earlier. . jm+1 ) + αJm+1 (jm+1 ) . . there is a major diﬀerence from the singlestep aggregation case: a policy involves a set of k control functions {µ0 . but its contraction modulus is αk rather than α. These vectors satisfy the following Bellman’s equations: n R∗ (x) = i=1 n ˜ dxi J0 (i).149) φjk y R∗ (y). . . n. it can be seen that F is a supnorm contraction. ˜ J1 (j1 ) is the optimal costtogo from original system state j that has just been generated from an original system state i. . . is the optimal costtogo from original system state jm that has just been generated from an original system state jm−1 . There is a similar mapping corresponding to a ﬁxed policy and it can be used to implement a policy iteration algorithm. . 6.148)(6. and R∗ where: R∗ (x) is the optimal costtogo from aggregate state x. (6. m = 2. . . and x ∈ A. j1 =1 n ˜ Jm (jm ) = u∈U(jm ) min ˜ pjm jm+1 (u) g(jm .150) By combining these equations. (6. . its improvement involves multistep lookahead using the minimizations of Eqs. and while a known policy can be easily simulated.5 Aggregation Methods 429 ˜ ˜ ˜ We introduce vectors J0 . . However.150). we obtain an equation for R∗ : R∗ (x) = (F R∗ )(x).148) ˜ J0 (i) = min u∈U(i) ˜ pij1 (u) g(i. (6.5.
xm }. a = 1. which by classical value iteration convergence results. .151) u∈U(i) with Rk (a) = i∈xa dxa i Jk (i). and we envision a network of agents. m. ∀ a = 1. by n Ha (i. These components are updated according to Jk+1 (i) = min Ha (i. u.e. J.148)(6..153) and where for each original system state j. . u ∈ U (i). . u. . i = 1. We generically denote by J and R the vectors with components J(i). independent of the use of a large number of aggregate states. . . . we denote by x(j) the subset to which i belongs [i. . . u. . (6. These aggregate costs are communicated between agents and are used to perform the local updates. Asynchronous MultiAgent Aggregation Let us now discuss the distributed solution of largescale discounted DP problems using cost function approximation based on hard aggregation. This can be seen from Eqs. . which is the weighted average detailed cost. j ∈ x(j)].151) is the same as ordinary value iteration. Thus multistep aggregation is a useful idea only for problems where the cost of this multistep lookahead minimization (for a single given starting state) is not prohibitive. Jk . each agent a = 1.430 Approximate Dynamic Programming Chap.150). ∀ i ∈ xa . j)+α j∈xa pij (u)J(j)+α j ∈xa / pij (u)R x(j) . (6. note that from the theoretical point of view. . and an aggregate cost dxa i J(i). Thus the iteration (6. . We partition the original system states into aggregate states/subsets x ∈ A = {x1 . deﬁned on a single aggregate state/subset. . R) = j=1 pij (u)g(i. Rk ). On the other hand. n. (6. show that ˜ J0 (i) → J ∗ (i) as k → ∞. each updating asynchronously a detailed/exact local cost function. R ∈ ℜm . . i ∈ xa . and R(a). 6 may be costly. except that the aggregate costs R x(j) are used for states j whose costs are updated by other agents. In a synchronous value iteration method of this type. a multistep scheme provides a means of better approximation of the true optimal cost vector J ∗ . . maintains/updates a (local) cost J(i) for every state i ∈ xa . respectively. . weighted by the corresponding disaggregation probabilities. .152) where the mapping Ha is deﬁned for all a = 1. m. . R(a) = i∈xa with dxa i being the corresponding disaggregation probabilities. . (6. Each agent also maintains an aggregate cost for its aggregate state. . and J ∈ ℜn . m. . . m. . regardless of the choice of aggregate states. .
151)(6. u.Sec. . . Jk . . . based on the supnorm contraction property of the mapping underlying Eq. m.k ∈ xa .k . . In a more practical asynchronous version of the method.139) is that the mapping (6. . . u∈U(i) ∀ i ∈ Ia. . . This is often impractical and timewasting. . which similar to our earlier discussion.k (m) . . . Moreover. .154).151)(6. For convergence. R). in Eq. (6.k for inﬁnitely many k (so each cost component is updated inﬁnitely often). it is of course essential that every i ∈ xa belongs to Ia. u∈U(i) R(a) = i∈xa dxa i J(i). involves both the original and the aggregate system states. . This follows from the fact that {dxa i  i = 1. (6. . . m (so that agents eventually communicate more recently computed aggregate costs to other agents). Asynchronous multiagent DP methods of this type have been proposed and analyzed by the author in [Ber82]. (6.5 Aggregation Methods 431 It is possible to show that the iteration (6. it is suﬃcient that they are updated by each agent a only for a (possibly empty) subset of Ia. The diﬀerence from the Bellman equations (6. . (6. ∀ a = 1. so it converges to the unique solution of the system of equations in (J. R) J(i) = min Ha (i. . In the algorithm (6. . and communicate the aggregate costs to the other agents before a new iteration may begin.k . [Ber83].k when the corresponding aggregate costs were computed at other processors.k (1). Their convergence.151)(6.k = ∞ for all a = 1. Rτm. n} is a probability distribution. . We may view the equations (6.152) is modiﬁed to take the form Jk+1 (i) = min Ha i. and limk→∞ τa.155) may be viewed as “delays” between the current time k and the times τa. . .155) with 0 ≤ τa. m. . . .154) as a set of Bellman equations for an “aggregate” DP problem.154) ∀ i ∈ xa . (6. and Rk (a) = i∈xa dxa i Jk (i). u. the aggregate costs R(a) may be outdated to account for communication “delays” between agents. and this may be useful in the convergence analysis of related algorithms for other nondiscounted DP models.k ≤ k for a = 1. has been established in [Ber82]. all agents a must be updating their local costs J(i) aggregate costs R(a) synchronously. the costs J(i) need not be updated for all i. a = 1.156) The diﬀerences k − τa. .152).152) involves a supnorm contraction mapping of modulus α. J. In this case. Rτ1. 6. a = 1. the iteration (6. . . The monotonicity property is also suﬃcient to establish convergence. .153) involves J(j) rather than R x(j) for j ∈ xa . m. m.137)(6.
We introduce a linear approximation architecture of the form ˜ J(i. as in Section 6. and that the states are 0. 6 We ﬁnally note that there are Qlearning versions of the algorithm (6. we deﬁne φ(0) = 0. n. for notational convenience in the subsequent formulas. We focus on a ﬁxed proper policy µ. n. r) = φ(i)′ r. . . i = 0. n are transient. Chapter 2). Note that qt (i) diminishes to 0 as t → ∞ at the rate of a geometric progression (cf. .. where q0 (i) = P (i0 = i). . We assume that q0 (i) are chosen so that q(i) > 0 for all i [a stronger assumption is that q0 (i) > 0 for all i]. We assume that there is no discounting (α = 1). . . . . whose convergence analysis is very similar to the one for standard Qlearning (see [Tsi84]. . n. . . The algorithms use a sequence of simulated trajectories. . Once a trajectory is completed. each of the form (i0 . 1. (6. . . . . Also. iN ). . Section 2. .1). under which all the states 1. For a trajectory i0 . . . 1. i1 . 6. i = 1. are ﬁnite.156). . . so the limits ∞ q(i) = t=0 qt (i). .157) S = {Φr  r ∈ ℜs }. i = 1. . . . . . . an initial state i0 for the next trajectory is chosen according to a ﬁxed probability distribution q0 = q0 (1). . . t = 0. i = 1. . . 2 .3. i = 1. We assume that Φ has rank s. where state 0 is a special costfree termination state. . Let q be the vector with components q(1). where iN = 0. . and the process is repeated. . Φ is the n × s matrix whose rows are φ(i)′ . . . . q(n). . q0 (n) . .155)(6. i1 . . and it = 0 for t < N . n. 1. . and the subspace where. of the SSP problem consider the probabilities qt (i) = P (it = i). n.432 Approximate Dynamic Programming Chap.6 STOCHASTIC SHORTEST PATH PROBLEMS In this section we consider policy evaluation for ﬁnitestate stochastic shortest path (SSP) problems (cf. . . n. which also allows asynchronous distributed computation with communication “delays”). . There are natural extensions of the LSTD(λ) and LSPE(λ) algorithms. We introduce the norm n J q = i=1 q(i) J(i) .
. . since Π is nonexpansive. n. . j). i = 1. n. ΠT (λ) is a contraction with respect to some norm.1: For all λ ∈ [0. . we have PJ Indeed. Proposition 6. J ∈ ℜn . . In the context of the SSP problem. . Consider also the mapping T : ℜn → ℜn given by T J = g + P J. so that it has a unique ﬁxed point. We will now show that ΠT (λ) is a contraction. 1). . We will show that T (λ) is a contraction with respect to the projection norm · q . we have q = ∞ t=0 qt q ≤ J q. deﬁne the mapping ∞ n j=0 pij g(i. ∞ q′ P = t=0 t=1 or n i=1 q(i)pij = q(j) − q0 (j).3. j = 1. so ′ qt P = ′ ′ qt = q ′ − q0 . . For T (λ) = (1 − λ) λt T t+1 t=0 [cf.3). Section 6. i. Let us ﬁrst note that with an argument like the one in the proof of Lemma 6. Let P be the n × n matrix with components pij .1.158) [cf. ∞ ′ ′ and qt+1 = qt P . we have T (λ) J = P (λ) J + (I − λP )−1 g. where P (λ) = (1 − λ) ∞ λt P t+1 t=0 (6.71)]. so the same is true for ΠT (λ) . . ∀ j. 6. (6. Eq.72)].6 Stochastic Shortest Path Problems 433 and we denote by Π the projection onto the subspace S with respect to this norm.3. the projection norm · q plays a role similar to the one played by the steadystate distribution norm · ξ for discounted problems (cf. 1). where g is the vector with components λ ∈ [0. Proof: Let λ > 0. (6. Similar to Section 6.Sec. Eq.6.
159) it follows that P and hence T is a contraction with respect to · q . . P (λ) J We now deﬁne β = max P (λ) J q q < J q for < J q. we have for all J ∈ ℜn .1. q it follows that ≤ J J ∈ ℜn . 6 Using this relation. for all J = 0.159) q(i)pij i=1 n j=1 n j=1 n J(j)2 i=1 = j=1 q(j) − q0 (j) J(j)2 2. Therefore.8. n PJ 2 q = i=1 n q(i) q(i) n j=1 n pij J(j) 2 ≤ = pij J(j)2 (6. (6. by using the deﬁnition (6. J ∈ ℜn . Note also that if q0 (i) > 0 for all i. We use a diﬀerent argument because T is not necessarily a contraction with respect to · q . 6.160)  J q =1 and note that since the maximum in the deﬁnition of β is attained by the Weierstrass Theorem (a continuous function attains a maximum over a compact set). t = 0. Since P (λ) J q ≤β J q.158) of P (λ) . it follows that P (λ) is a contraction of modulus β with respect to · q .434 Approximate Dynamic Programming Chap. . q ≤ J From the relation P J P tJ q q ≤ J q. from the calculation of Eq. Let λ = 0. 1. [An example is given following Prop. we also have P (λ) J q ≤ J q. it follows that P t J all J = 0 and t suﬃciently large. (6.] We show that ΠT is a contraction with respect to a diﬀerent norm . we have β < 1 in view of Eq. J ∈ ℜn .160). (6. . Thus. q Since limt→∞ P t J = 0 for any J ∈ ℜn .
so that ν is an eigenvalue of P . We claim that P ζ must have both real and imaginary components in the subspace S.5.7. we have ∗ ∗ Jµ − Φr0 ≤ Jµ − ΠJµ + ΠJµ − Φr0 ∗ = Jµ − ΠJµ + ΠT Jµ − ΠT (Φr0 ) ∗ = Jµ − ΠJµ + α0 Jµ − Φr0 . As a result. Thus. 6. . which implies that P ζ = ΠP ζ = νζ. respectively. then there exists a norm with respect to which the linear mapping deﬁned by the matrix is a contraction. since the policy being evaluated is proper. with an argument like the one used to prove Lemma 6.D. where · is the norm with respect to which ΠT is a contraction (cf.3. the real and imaginary components of P ζ are in S. Q. The projection norm for a complex vector x + iy is deﬁned by x + iy q = x 2 q + y 2 q. Prop.Sec.6 Stochastic Shortest Path Problems 435 by showing that the eigenvalues of ΠP lie strictly within the unit circle. 1 − α2 λ λ > 0. we have P J q ≤ J q for all J. the projection Πz of a complex vector z is obtained by separately projecting the real and the imaginary components of z on S. 1 − α0 † We use here the fact that if a square matrix has eigenvalues strictly within the unit circle.E. ∗ 6.1).3. which contradicts the fact P J q ≤ J q for all J. 6. Assume to arrive at a contradiction that ν is an eigenvalue of ΠP with ν = 1.1. we can obtain the error bound ∗ Jµ − Φrλ q ≤ 1 Jµ − ΠJµ q . ∗ where Φrλ and αλ are the ﬁxed point and contraction modulus of ΠT (λ) . This is a contradiction because ν = 1 while the eigenvalues of P are strictly within the unit circle. and let ζ be a corresponding eigenvector. similar to Prop. The preceding proof has shown that ΠT (λ) is a contraction with respect to · q when λ > 0. so that Pζ q > ΠP ζ q = νζ q = ν ζ q = ζ q. we would have P ζ = ΠP ζ.† Indeed. If this were not so. which implies that ΠP J q ≤ J q . When λ = 0. Also in the following argument. We thus have the error bound ∗ Jµ − Φr0 ≤ 1 Jµ − ΠJµ . and Φr0 and α0 are the ﬁxed point and contraction modulus of ΠT . so the eigenvalues of ΠP cannot be outside the unit circle.
q0 (n) . where Ck and dk are simulationbased approximations to C and d. and stage costs g(i. n. . . . The LSPE algorithm and its scaled versions are deﬁned by rk+1 = rk − γGk (Ck rk − dk ). Assumption 6. 6. Section 4. . . . and it = 0 for t < N .1 Approximate Policy Evaluation Let us consider the problem of approximate evaluation of a stationary policy µ. 6. . . .7 AVERAGE COST PROBLEMS In this section we consider average cost problems and related approximations: policy evaluation algorithms such as LSTD(λ) and LSPE(λ). n. . The corresponding LSTD and LSPE algorithms use simulationbased approximations Ck and dk .e. . j = 1. n. The derivation of the detailed equations is straightforward but somewhat tedious. . The LSTD method −1 approximates the solution C −1 d of the projected equation by Ck dk . . j = 1. ξn ) with positive components.1: The Markov chain has a steadystate probability vector ξ = (ξ1 . with the optimal average cost being the same for all initial states (cf.6. . 6 Similar to the discounted problem case. . where γ is a suﬃciently small stepsize and Gk is a scaling matrix. As in the discounted case (Section 6. . . iN ). . . . Once a trajectory is completed. . and will not be given (see also the discussion in Section 6. we consider a stationary ﬁnitestate Markov chain with states i = 1. 1 N →∞ N lim N k=1 P (ik = j  i0 = i) = ξj > 0. n. . i. for all i = 1. An equivalent way to express this assumption is the following. . where iN = 0.2).. This simulation generates a sequence of trajectories of the form (i0 . and Qlearning. an initial state i0 for the next trajectory is chosen according to a ﬁxed probability distribution q0 = q0 (1). We assume throughout the ﬁnite state model of Section 4. . respectively. j). . approximate policy iteration. . .436 Approximate Dynamic Programming Chap. . transition probabilities pij . i1 .8). We assume that the states form a single recurrent class.1.7. the projected equation can be written as a linear equation of the form Cr = d.3). i. .
and LSTD algorithms. Φ is the n × s matrix whose rows are φ(i)′ .2. h(n) .2. hence the change in notation. . we reserve λ for use in the TD.3. . We consider a linear architecture for the diﬀerential costs of the form ˜ h(i. . such as ﬁxing the diﬀerential cost of a single state to 0 (cf. Prop. 6. The solution is unique up to a constant shift for the components of h. i = 1. . . denoted by η. . where as in Section 6. i = 1. . we know that under Assumption 6. With this notation. . .6. the average cost η satisﬁes Bellman’s equation n h(i) = j=1 pij g(i. i = 1. Bellman’s equation becomes h = F h.4). . . 4. n. . We introduce the mapping F : ℜn → ℜn deﬁned by F J = g − ηe + P J. which dealt with cost approximation in the discounted case.) ′ Together with a diﬀerential cost vector h = h(1). LSPE. so if we know η. is independent of the initial state η = lim 1 E N →∞ N N −1 g xk . (6.161). where P is the transition probability matrix and e is the unit vector. i = 1. . (In Chapter 4 we denoted the average cost by λ. j) − η + h(j) . we can try to ﬁnd or approximate a ﬁxed point of F . with apologies to the readers. where g is the vector whose ith component is the expected stage cost n j=1 pij g(i. . These feature vectors deﬁne the subspace S = {Φr  r ∈ ℜs }. but in the present chapter. n. the average cost. n. j). (6. r) = φ(i)′ r. Note that the deﬁnition of F uses the exact average cost η. where r ∈ ℜs is a parameter vector and φ(i) is a feature vector associated with state i.161) and satisﬁes η = ξ ′ g. n. .Sec. and can be made unique by eliminating one degree of freedom. . . We will thus aim to approximate h by a vector in S. . . .7 Average Cost Problems 437 From Section 4.1. similar to Section 6. xk+1 k=0 x0 = i . . as given by Eq.3.
72)].3.6. we require that e does not belong to the subspace S. Note the diﬀerence with the corresponding Assumption 6.164) [cf. For this it is necessary to make the following assumption. F (λ) J = P (λ) J + (I − λP )−1 (g − ηe).162) Φr = ΠF (λ) (Φr). . The following proposition relates to the composition of general linear mappings with Euclidean projections.2 for the discounted case in Section 6.163) P (λ) = (1 − λ) λt P t+1 t=0 (6. To get a sense why this is needed. Note that for λ = 0. We also consider multistep versions of the projected equation of the form (6. 1)′ form a linearly independent set of vectors. we have F (0) = F and P (0) = P . . where Π is projection on the subspace S with respect to the norm · ξ . we introduce the projected equation Φr = ΠF (Φr). . An important issue is whether ΠF is a contraction. . in addition to Φ having rank s. and captures the essence of our analysis. (6. where F (λ) = (1 − λ) ∞ λt F t+1 .438 Approximate Dynamic Programming Chap. Here. 6 Similar to Section 6. the mapping F (λ) can be written as ∞ ∞ F (λ) J = (1 − λ) or more compactly as λt P t+1 J + t=0 t=0 λt P t (g − ηe). observe that if e ∈ S. Eq.3. since any scalar multiple of e when added to a ﬁxed point of ΠF would also be a ﬁxed point. t=0 In matrix notation. Assumption 6. then ΠF cannot be a contraction. We wish to delineate conditions under which the mapping ΠF (λ) is a contraction.3. where the matrix P (λ) is deﬁned by ∞ (6.2: The columns of the matrix Φ together with the unit vector e = (1. .
either Az = z (which is impossible since then 1 is an eigenvalue of A. the mapping Hγ = (1 − γ)I + γΠL is a contraction. assume the contrary. and hence the ﬁxed point of ΠL is the unique vector x∗ ∈ S satisfying Π(I − A)x∗ = Πb. ∀ x. 1). (b) If z ∈ ℜn with z = 0 and z = aΠAz for all a ≥ 0. or else the eigenvectors corresponding to the eigenvalue 1 do not belong to S. i. assume that either 1 is not an eigenvalue of A. or else the eigenvectors corresponding to the eigenvalue 1 do not belong to S. Conversely.1: Let S be a subspace of ℜn and let L : ℜn → ℜn be a linear mapping. Hence. y ∈ ℜn . so that 1 < A (which is impossible since L is nonexpansive. Let · be a weighted Euclidean norm with respect to which L is nonexpansive. z is a ﬁxed point of ΠA with z = 0. (b) If ΠL has a unique ﬁxed point. that Π(I − A) has a nontrivial nullspace in S. (a) ΠL has a unique ﬁxed point if and only if either 1 is not an eigenvalue of A. a contradiction. for some scalar ργ ∈ (0.7. we have (1 − γ)z + γΠAz < (1 − γ) z + γ ΠAz ≤ (1 − γ) z + γ z = z . Proof: (a) Assume that ΠL has a unique ﬁxed point. L(x) = Ax + b. we have H γ x − Hγ y ≤ ρ γ x − y .165) . where A is an n × n matrix and b is a vector in ℜn . If 1 is an eigenvalue of A with a corresponding eigenvector z that belongs to S. and therefore A ≤ 1).7 Average Cost Problems 439 Proposition 6. Indeed. then for all γ ∈ (0. or else the eigenvectors corresponding to the eigenvalue 1 do not belong to S..Sec. 6.e. so that some z ∈ S with z = 0 is a ﬁxed point of ΠA. and let Π denote projection onto S with respect to that norm. Then. Thus. thereby arriving at a contradiction. either 1 is not an eigenvalue of A. then Az = z and ΠAz = Πz = z.. i. or Az = z. in which case Az diﬀers from its projection ΠAz and z = ΠAz < Az ≤ A z . (6. We will show that the mapping Π(I − A) is onetoone from S to S. or equivalently (in view of the linearity of L) that 0 is the unique ﬁxed point of ΠA. 1). and z is a corresponding eigenvector that belongs to S).e.
165) yields ργ < 1 and (1 − γ)z + γΠAz ≤ ργ z . and ΠA is nonexpansive so a < 1. 1). 6 where the strict inequality follows from the strict convexity of the norm. we have (1 − γ)z + γΠAz < z because then ΠL has a unique ﬁxed point so a = 1.7. y ∈ ℜn . 1).E. we see that Eq.λ = (1 − γ)I + γΠF (λ) is a contraction with respect to (i) λ ∈ (0. with x. z ∈ ℜn . Then F (λ) is a linear mapping involving the matrix P (λ) . If we deﬁne ργ = sup{ (1 − γ)z + γΠAz  z ≤ 1}.2: The mapping Fγ. and by using the deﬁnition of Hγ . Thus P (λ) can be expressed as a convex combination ¯ P (λ) = (1 − β)I + β P ¯ for some β ∈ (0. and note that the supremum above is attained by the Weierstrass Theorem (a continuous function attains a minimum over a compact set). We can now derive the conditions under which the mapping underlying the LSPE iteration is a contraction with respect to · ξ . 1) and γ ∈ (0. and the weak inequality follows from the nonexpansiveness of ΠA. (ii) λ = 0 and γ ∈ (0. We make the following observations: · ξ x. Proof: Consider ﬁrst the case. where P is a stochastic matrix with positive entries. By letting z = x − y. y ∈ ℜn . we have Hγ x − Hγ y = Hγ (x − y) = (1 − γ)(x − y) + γΠA(x − y) = (1 − γ)z + γΠAz. If on the other hand z = 0 and z = aΠAz for some a ≥ 0. if one of the following is true: . Q.440 Approximate Dynamic Programming Chap. we obtain H γ x − Hγ y ≤ ρ γ x − y . γ = 1 and λ ∈ (0. Proposition 6. Since 0 < λ and all states form a single recurrent class. 1].D. so by combining the preceding two relations. all entries of P (λ) are positive. 1). (6.
Sec. 6.7
Average Cost Problems
441
¯ (i) P corresponds to a nonexpansive mapping with respect to the norm ¯ · ξ . The reason is that the steadystate distribution of P is ξ [as (λ) = (1 − β)I + β P with ξ, ¯ can be seen by multiplying the relation P ′ = ξ ′ P (λ) to verify that ξ ′ = ξ ′ P ]. Thus, ¯ and by using the relation ξ ¯ we have P z ξ ≤ z ξ for all z ∈ ℜn (cf. Lemma 6.3.1), implying ¯ has the nonexpansiveness property mentioned. that P ¯ (ii) Since P has positive entries, the states of the Markov chain corre¯ sponding to P form a single recurrent class. If z is an eigenvector of ¯ ¯ P corresponding to the eigenvalue 1, we have z = P k z for all k ≥ 0, ¯ so z = P ∗ z, where
N −1
¯ P ∗ = lim (1/N )
N →∞ k=0
¯ Pk
¯ (cf. Prop. 4.1.2). The rows of P ∗ are all equal to ξ ′ since the steady¯ is ξ, so the equation z = P ∗ z implies that z ¯ state distribution of P is a nonzero multiple of e. Using Assumption 6.6.2, it follows that z ¯ does not belong to the subspace S, and from Prop. 6.7.1 (with P in place of C, and β in place of γ), we see that ΠP (λ) is a contraction with respect to the norm · ξ . This implies that ΠF (λ) is also a contraction. Consider next the case, γ ∈ (0, 1) and λ ∈ (0, 1). Since ΠF (λ) is a ¯ contraction with respect to · ξ , as just shown, we have for any J, J ∈ ℜn , ¯ Fγ,λ J − Fγ,λ J
ξ
¯ ≤ (1 − γ) J − J
ξ
¯ ≤ (1 − γ + γβ) J − J ξ ,
¯ + γ ΠF (λ) J − ΠF (λ) J
ξ
where β is the contraction modulus of F (λ) . Hence, Fγ,λ is a contraction. Finally, consider the case γ ∈ (0, 1) and λ = 0. We will show that the mapping ΠF has a unique ﬁxed point, by showing that either 1 is not an eigenvalue of P , or else the eigenvectors corresponding to the eigenvalue 1 do not belong to S [cf. Prop. 6.7.1(a)]. Assume the contrary, i.e., that some z ∈ S with z = 0 is an eigenvector corresponding to 1. We then have z = P z. From this it follows that z = P k z for all k ≥ 0, so z = P ∗ z, where
N −1
P ∗ = lim (1/N )
N →∞ k=0
Pk
(cf. Prop. 4.1.2). The rows of P ∗ are all equal to ξ ′ , so the equation z = P ∗ z implies that z is a nonzero multiple of e. Hence, by Assumption 6.6.2, z cannot belong to S  a contradiction. Thus ΠF has a unique ﬁxed point, and the contraction property of Fγ,λ for γ ∈ (0, 1) and λ = 0 follows from Prop. 6.7.1(b). Q.E.D.
442
Approximate Dynamic Programming
Chap. 6
Error Estimate
∗ We have shown that for each λ ∈ [0, 1), there is a vector Φrλ , the unique ﬁxed point of ΠFγ,λ , γ ∈ (0, 1), which is the limit of LSPE(λ) (cf. Prop. 6.7.2). Let h be any diﬀerential cost vector, and let βγ,λ be the modulus of contraction of ΠFγ,λ , with respect to · ξ . Similar to the proof of Prop. 6.3.2 for the discounted case, we have ∗ h − Φrλ 2 ξ
= h − Πh = h − Πh ≤ h − Πh
2 ξ 2 ξ 2 ξ
∗ + Πh − Φrλ
2 ξ 2 ξ
∗ + βγ,λ h − Φrλ
∗ + ΠFγ,λ h − ΠFγ,λ (Φrλ ) 2. ξ
It follows that
∗ h − Φrλ ξ
≤
1
2 1 − βγ,λ
h − Πh ξ ,
λ ∈ [0, 1), γ ∈ (0, 1),
(6.166)
for all diﬀerential cost vector vectors h. This estimate is a little peculiar because the diﬀerential cost vector is not unique. The set of diﬀerential cost vectors is D = h∗ + γe  γ ∈ ℜ}, where h∗ is the bias of the policy evaluated (cf. Section 4.1, and Props. 4.1.1 and 4.1.2). In particular, h∗ is the unique h ∈ D that satisﬁes ξ ′ h = 0 or equivalently P ∗ h = 0, where P ∗ = lim 1 N →∞ N
N −1
P k.
k=0
Usually, in average cost policy evaluation, we are interested in obtaining ∗ a small error (h − Φrλ ) with the choice of h being immaterial (see the discussion of the next section on approximate policy iteration). It follows that since the estimate (6.166) holds for all h ∈ D, a better error bound can be obtained by using an optimal choice of h in the lefthand side and an optimal choice of γ in the righthand side. Indeed, Tsitsiklis and Van Roy [TsV99a] have obtained such an optimized error estimate. It has the form
∗ min h − Φrλ h∈D ξ ∗ = h∗ − (I − P ∗ )Φrλ ξ
≤
1 Π∗ h∗ − h∗ ξ , (6.167) 1 − α2 λ ·
ξ
where h∗ is the bias vector, Π∗ denotes projection with respect to onto the subspace S ∗ = (I − P ∗ )y  y ∈ S ,
Sec. 6.7
Average Cost Problems
443
and αλ is the minimum over γ ∈ (0, 1) of the contraction modulus of the mapping Π∗ Fγ,λ : αλ = min
γ∈(0,1) y ξ =1
max
Π∗ Pγ,λ y ξ ,
where Pγ,λ = (1 − γ)I + γΠ∗ P (λ) . Note that this error bound has similar form with the one for discounted problems (cf. Prop. 6.3.5), but S has been replaced by S ∗ and Π has been replaced by Π∗ . It can be shown that the scalar αλ decreases as λ increases, and approaches 0 as λ ↑ 1. This is consistent with the corresponding error bound for discounted problems (cf. Prop. 6.3.5), and is also consistent with empirical observations, which suggest that smaller values of λ lead to larger approximation errors. Figure 6.7.1 illustrates and explains the projection operation Π∗ , the distance of the bias h∗ from its projection Π∗ h∗ , and the other terms in the error bound (6.167). LSTD(λ) and LSPE(λ) The LSTD(λ) and LSPE(λ) algorithms for average cost are straightforward extensions of the discounted versions, and will only be summarized. The LSPE(λ) iteration can be written (similar to the discounted case) as where γ is a positive stepsize and Ck = 1 k+1
k t=0
rk+1 = rk − γGk (Ck rk − dk ), Gk = zt = 1 k+1
t k
(6.168)
zt φ(it+1 )′ − φ(it )′ ,
k
φ(it ) φ(it )′ ,
t=0
dk =
1 k+1
t=0
zt g(it , it+1 ) − ηt ,
(λ)t−m φ(im ).
m=0
Scaled versions of this algorithm, where Gk is a scaling matrix are also possible. The LSTD(λ) algorithm is given by
−1 rk = Ck dk . ˆ
There is also a regressionbased version that is wellsuited for cases where Ck is nearly singular (cf. Section 6.3.4). The matrices Ck , Gk , and vector dk can be shown to converge to limits: where the matrix Ck → Φ′ Ξ(I − P (λ) )Φ, P (λ) is deﬁned by Eq. (6.164),
∞
Gk → Φ′ ΞΦ,
dk → Φ′ Ξg (λ) , g (λ)
(6.169)
is given by
g (λ) = Φ′ Ξ
ℓ=0
λℓ P ℓ (g − ηe),
and Ξ is the diagonal matrix with diagonal entries ξ1 , . . . , ξn : Ξ = diag(ξ1 , . . . , ξn ), [cf. Eqs. (6.72) and (6.73)].
and Π∗ plays the role of Π. . Thus. (c) We have h∗ ∈ E ∗ since (I − P ∗ )h∗ = h∗ in view of P ∗ h∗ = 0.λ over γ ∈ (0. 6 e Π* h * 0 Bias h* (IP*)Φr* λ E*: Subspace of vectors (IP*)y Figure 6. Also. is the analog of the discounted estimate of Prop. (I − P ∗ )Φrλ plays the role of Φrλ . 1) and within E ∗ (see the paper [TsV99a] for a detailed analysis).5. with E ∗ playing the role of the entire space. αλ is the best possible contraction modulus of Π∗ Fγ.1: Illustration of the estimate (6. . . Finally. so P ∗ y is orthogonal to E ∗ in the scaled geometry of the norm · ξ ). Thus. Consider the subspace E∗ = (I − P ∗ )y  y ∈ ℜn . Note that: (a) E ∗ is the subspace that is orthogonal to the unit vector e in the scaled geometry of the norm · ξ .444 Approximate Dynamic Programming D: Set of Differential cost Φr* λ vectors S: Subspace spanned by basis vectors Subspace S* Chap. for all y ∈ ℜn . the term Π∗ h∗ −h∗ ξ of the error bound is the minimum possible error given that h∗ is approximated with an element of S ∗ .167).3. . Let Ξ be the diagonal matrix with ξ1 . 6. (b) Projection onto E ∗ with respect to the norm · ξ is simply multiplication with (I − P ∗ ) (since P ∗ y = ξ ′ ye. h∗ plays ∗ ∗ the role of Jµ . (d) The equation ∗ min h − Φrλ ξ ∗ = h∗ − (I − P ∗ )Φrλ ξ h∈D is geometrically evident from the ﬁgure.167). in the sense that e′ Ξz = 0 for all z ∈ E ∗ . and with the “geometry of the problem” projected onto E ∗ . (e) The estimate (6. .7. S ∗ is the projection of S onto E ∗ . because e′ Ξ = ξ ′ and ξ ′ (I − P ∗ ) = 0 as can be easily veriﬁed from the fact that the rows of P ∗ are all equal to ξ ′ . S ∗ plays the role of S. ξn on the diagonal. Indeed we have e′ Ξ(I − P ∗ )y = 0.
These diﬀerential costs may then be replaced by ˜ ˜ hk (i) = hk (i. . . where m=0. j=1 within a tolerance δ > 0. The onestage cost is equal to g(i.7 Average Cost Problems 445 6.. u. k = 0.7. we have i=1. . . We now note that since ηk is monotonically nonincreasing and is bounded below by the optimal gain η ∗ . it must converge to some scalar η.3. The method generates a sequence of stationary policies µk . . In particular.Sec. 6. We refer to this stochastic shortest path problem as the ηSSP.. and by introducing an artiﬁcial termination state t to which we move from each state i with probability pis (u).1. we assume that there exists a tolerance δ > 0 such that for all i and k. . . that is. . by setting all transition probabilities pis (u) to 0. for some k. r) − h(s.2 Approximate Policy Iteration Let us consider an approximate policy iteration method that involves approximate policy evaluation and approximate policy improvement.. µk+1 (i) attains the minimum in the expression n u∈U(i) min pij (u) g(i.n max hk (i) − hµk . a corresponding sequence of gains ηµk .. we consider the stochastic shortest path problem obtained by leaving unchanged all transition probabilities pij (u) for j = s. where η is a scalar parameter. 1. We assume that for some ǫ > 0. We assume that policy improvement is carried out by approximate minimization in the DP mapping.ηk (i) is the costtogo from state i to the reference state s for the ηk SSP under policy µk .. r).. i = 1. k ≥ k. n. Since ηk can take only one of the ﬁnite number of values ηµ corresponding to the ﬁnite number of stationary policies µ. and ǫ is a positive scalar quantifying the accuracy of evaluation of the costtogo function of the ηk SSP. we see that ηk must converge ﬁnitely to η. Note also that we may calculate approximate diﬀerential costs ˜ k (i. j) + hk (j) . Note that we assume exact calculation of the gains ηµk . we have ηk = η. As in Section 4.. u) − η.1. r) that depend on a parameter vector r h without regard to the reference state s. and a special state s is recurrent in the Markov chain corresponding to each stationary policy. .ηk (i) ≤ ǫ. We assume that all stationary policies are unichain.. and a sequence of cost vectors hk . ηk = min ηµm .k hµk .
An alternative is to approximate the average cost problem with a discounted problem.171) µk (s) = 0. 2.η (s) = (η − η ∗ )Nµ∗ .. (1 − ρ)2 (6.n. in an optimistic method the current policy µ may not remain constant for suﬃciently long time to estimate accurately ηµ . as can also be seen from Fig. from Eqs. we have η − η∗ ≤ n(1 − ρ + n)(δ + 2ǫ) .170) and (6.7. n  i0 = i.172) This relation provides an estimate on the steadystate error of the approximate policy iteration method. for which an optimistic version of approximate policy iteration can be readily implemented. k = 1.2.446 Approximate Dynamic Programming Chap.η (s) − hη (s) ≥ −hη (s) ≥ −hµ∗ . where µ∗ is an optimal policy for the η ∗ SSP (and hence also for the original average cost per stage problem) and Nµ∗ is the expected number of stages to return to state s. . 6 Let hη (s) denote the optimal costtogo from state s in the ηSSP.. that hµk . . The value of η may occasionally be adjusted downward by calculating “exactly” through simulation the gain ηµ of some of the (presumably most promising) generated policies µ.. One may consider schemes where an optimistic version of policy iteration is used to solve the ηSSP for a ﬁxed η. .η It follows. . and by then updating η according to η := min{η. The reason is our assumption that the gain ηµ of every generated policy µ is exactly calculated. Nµ∗ (1 − ρ)2 (6. (6..η (s) ≥ hµk . 6. and ik denotes the state of the system after k stages. (6.171).7.2. using also Fig. On the other hand. starting from s and using µ∗ . 6. µ . we have lim sup hµk . Then. µ max P ik = s. ηµ }. .1. Thus. by using Prop.170) where ρ= i=1.η (s) − hη (s) ≤ k→∞ n(1 − ρ + n)(δ + 2ǫ) .4. the relation η ≤ ηµk implies that hµk . We ﬁnally note that optimistic versions of the preceding approximate policy iteration method are harder to implement than their discounted cost counterparts.
j). starting from s and using µ. . Furthermore.η µk (s) = 0. i = 1. . . u) is that we move only to states j of the original problem with corresponding probabilities pij (u) and costs g(i. together with the additional states (i.η (s) ≥ hµk . n. if µ∗ is an optimal policy for the η∗ SSP.3 QLearning for Average Cost Problems To derive the appropriate form of the Qlearning algorithm. Nµ is the expected number of stages to return to state s.7 Average Cost Problems 447 hµ. . and that the corresponding Bellman’s . . u ∈ U (i). . 6. Since ηµk ≥ η.2: Relation of the costs of stationary policies for the ηSSP in the approximate policy iteration method.η (s) = (η∗ − η)Nµ∗ . 6.η(s) = (ηµ − η)Nµ hη(s) η∗ η ηµ η hµ∗.η(s) = (η∗ − η)Nµ∗ hη(s) Figure 6. we have hη (s) ≤ hµ∗ . u. u) with u ∈ U (i). The probabilistic transition mechanism from an original state i is the same as for the original problem [probability pij (u) of moving to state j]. Thus. we have hµk . . i = 1. u).7.Sec. It can be seen that the auxiliary problem has the same optimal average cost per stage η as the original. . Here. we form an auxiliary average cost problem by augmenting the original system with one additional state for each possible pair (i. n.7. the states of the auxiliary problem are those of the original problem. while the probabilistic transition mechanism from a state (i.
. . we have that hk (i) = min Qk (i. . u) is the diﬀerential cost corresponding to (i. . . we obtain Bellman’s equation in a form that exclusively involves the Qfactors: n η+Q(i.175) From these equations. . u) = j=1 pij (u) g(i. v) − min Qk (s. . u.173) and (6. we obtain the following relative value iteration for the Qfactors n Qk+1 (i. j) + h(j) .174) and comparing with Eq. u. u ∈ U (i). u. (6. u ∈ U (i). Substituting the above form of h(i) in Eq. . u ∈ U (i). u∈U(i) i = 1. i = 1. j) + min Qk (j. j=1 i = 1. . j) + min Q(j. u) = j=1 pij (u) g(i. (6. . u) = j=1 pij (u) g(i.173) η + Q(i. Qk+1 (i. n. . u. . (6. . . .448 Approximate Dynamic Programming Chap. . . . i = 1. We then obtain the iteration [cf. . n.175). . where s is a special state.174)] n hk+1 (i) = min n u∈U(i) j=1 pij (u) g(i. . u. v) .174) where Q(i. . u). (6. . u. u∈U(i) i = 1. n. n. we obtain h(i) = min Q(i. . j) + hk (j) − hk (s). . u). i = 1. v∈U(j) Let us now apply to the auxiliary problem the following variant of the relative value iteration hk+1 = T hk − hk (s)e. and by substituting the above form of hk in Eq. 6 equation is n η + h(i) = min u∈U(i) n pij (u) g(i. j)+hk (j) −hk (s). (6. Taking the minimum over u in Eq. . (6. Eqs. n. j) + h(j) . (6. . . u).173). n. u) = j=1 pij (u) g(i.174). n. v∈U(j) v∈U(s) . (6. i = 1. v).
u) = j=1 pij (u)g(i. (6. u) := Q(i. j) + j=1 j=1 j=s n n pij (u)hk (j) − η k . which is based on the contracting value iteration method of Section 4. j) + min Q(j. u). j) + j=1 j=s pij (u)hk (j) − η k . If we apply this method to the auxiliary problem used above. where j and g(i. j) + min Q(j. while the remaining Qfactors remain unchanged. we have that hk (i) = min Qk (i. A convergence analysis of this method can be found in the paper by Abounadi. u) := (1 − γ)Q(i. we obtain the following algorithm n n hk+1 (i) = min u∈U(i) pij (u)g(i. i.Sec.3. Also the stepsize should be diminishing to 0.e. u) is updated at each iteration. In this method. u) + γ pij (u) g(i.7 Average Cost Problems 449 The sequence of values minu∈U(s) Qk (s.177) η k+1 = η k + αk hk+1 (s). u) are expected to converge to diﬀerential costs h(i). u. u) is expected to converge to the optimal average cost per stage and the sequences of values minu∈U(i) Q(i. v) . (6. An incremental version of the preceding iteration that involves a positive stepsize γ is given by n Q(i. QLearning Based on the Contracting Value Iteration We now consider an alternative Qlearning method. u) . u. 6. u∈U(i) . u) + γ g(i. u.. v) v∈U(j) v∈U(s) − Q(i. v∈U(s) The natural form of the Qlearning method for the average cost problem is an approximate version of this iteration. u) by simulation. whereby the expected value is replaced by a single sample. From these equations. Bertsekas. u.176) Qk+1 (i. v) − min Q(s. v) j=1 v∈U(j) − min Q(s. Q(i. and Borkar [ABB01]. u. j) are generated from the pair (i. only the Qfactor corresponding to the currently sampled pair (i.
The eﬀect is that the Qlearning iteration (6.178) is fast enough to keep pace with the slower changing ηSSP. u. j) + min Q(j. ˆ Q(i.179) η := η + α min Q(s. but α should diminish “faster” than γ. v).178) (6. v) 0 if j = s. v). v∈U(s) where γ and α are positive stepsizes. and the iteration (6. η is updated at a slower rate than Q. u) or η. However. we may use γ = C/k and α = c/k log k.450 Approximate Dynamic Programming Chap. v). v) − η . where C and c are positive constants and k is the number of iterations performed on the corresponding pair (i. v) − η k . u) + γ g(i.179). A natural form of Qlearning based on this iteration is obtained by replacing the expected values by a single sample. u) by simulation. v∈U(s) A smallstepsize version of this iteration is given by n Q(i. i. v∈U(s) where ˆ Q(j. j) are generated from the pair (i. we obtain n n Qk+1 (i.. the ratio of the stepsizes α/γ should converge to zero. The algorithm has two components: the iteration (6. v∈U(j) η k+1 = η k + αk min Qk+1 (s. A convergence analysis of this method can also be found in the paper [ABB01]. and j and g(i. since the stepsize ratio α/γ converges to zero. u. j) + j=1 j=s pij (u) min Qk (j.177). v) − η . . which is essentially a Qlearning method that aims to solve the ηSSP for the current value of η. j) j=1 n + j=1 j=s pij (u) min Q(j.178). u. u) = j=1 pij (u)g(i. otherwise.. v∈U(j) (6. u) := (1 − γ)Q(i. Here the stepsizes γ and α should be diminishing. For example. which updates η towards its correct value η ∗ . v∈U(j) η := η + α min Q(s. u) + γ pij (u)g(i. u) := (1 − γ)Q(i.e. respectively. v) = Q(j.e. (6. i. u. 6 and by substituting the above form of hk in Eq.
182) x∗ − Φr∗ ≤ (I − ΠA)−1 . Examples are Bellman’s equation for policy evaluation. (6. for any norm · and ﬁxed point x∗ of T . We have x∗ −Φr∗ = x∗ −Πx∗ +ΠT x∗ −ΠT Φr∗ = x∗ −Πx∗ +ΠA(x∗ −Φr∗ ).8 SIMULATIONBASED SOLUTION OF LARGE SYSTEMS We have focused so far in this chapter on approximating the solution of Bellman equations within a subspace of basis functions in a variety of contexts.1 Projected Equations . so that the projected equation has a unique solution denoted r∗ . SSP.8.181) from which x∗ − Φr∗ = (I − ΠA)−1 (x∗ − Πx∗ ).180) A is an n × n matrix. we can obtain an error bound that generalizes some of the bounds obtained earlier. or P is a substochastic matrix (row sums less or equal to 0. (6. where P is a transition matrix (discounted and average cost). (6. The beneﬁt of this generalization is a deeper perspective.SimulationBased Versions We ﬁrst focus on general linear ﬁxed point equations x = T (x). and α = 1 (SSP and average cost). Even though T or ΠT may not be contractions. In this section we will aim for a more general view of simulationbased solution of large systems within which the methods and analysis so far can be understood and extended. 6. implementation details. We assume throughout that the columns of the n × s matrix Φ are linearly independent basis functions.8 SimulationBased Solution of Large Systems 451 6. or α < 1 (discounted). and average cost problems. as in SSP). and b ∈ ℜn is a vector. and associated theoretical results. for the moment we do not assume the presence of any stochastic structure in A. Thus. Instead.Sec. as well as diﬀerences in formulations. 6. in which case A = αP . we assume throughout that I − ΠA is invertible. We consider approximations of a solution by solving a projected equation Φr = ΠT (Φr) = Π(b + AΦr). where T (x) = b + Ax. We have seen common analytical threads across discounted. Other examples in DP include the semiMarkov problems discussed in Chapter 5. However. and the ability to address new problems in DP and beyond. x∗ − Πx∗ . where Π denotes projection with respect to a weighted Euclidean norm · ξ on a subspace S = {Φr  r ∈ ℜs }.
here we do not have a Markov chain structure by which to generate samples. 1−α (6. the matrix form (6. . ξ Setting to 0 the gradient with respect to r.}. (6. We must therefore design a sampling process that can be used to properly approximate the expected values in Eq.33)(6. . we have x∗ −Φr∗ ≤ x∗ −Πx∗ + ΠT (x∗ )−ΠT (Φr∗ ) ≤ x∗ −Πx∗ +α x∗ −Φr∗ . . 6 so the approximation error x∗ − Φr∗ is proportional to the distance of the solution x∗ from the approximation subspace.3. we generate a sequence of indices {i0 . Let us assume that the positive distribution vector ξ is given.3. where Ξ is the diagonal matrix with the probabilities ξ1 .452 Approximate Dynamic Programming Chap.183) We ﬁrst discuss an LSTDtype method for solving the projected equation Φr = Π(b + AΦr). By the deﬁnition of projection with respect to · ξ .185). We will now develop a simulationbased approximation to the system Cr∗ = d. Cr∗ = d. ξn along the diagonal. the unique solution r∗ of this equation satisﬁes r∗ = arg min Φr − (b + AΦr∗ ) s r∈ℜ 2 .184) and Ξ is the diagonal matrix with the components of ξ along the diagonal [cf.34) of the projected equation for discounted DP problems]. ′ n d= i=1 ξi φ(i)bi . We write C and d as expected values with respect to ξ: n C= i=1 As in Section 6. i1 . we approximate these expected values by simulationobtained sample averages. where C = Φ′ Ξ(I − A)Φ. from Eq. 1) with respect to · . d = Φ′ Ξb. (6. . and a sequence of transitions between indices ξi φ(i) φ(i) − n j=1 aij φ(j) . (6. If ΠT is a contraction mapping of modulus α ∈ (0.185) . however.181). . (6. . In the most basic form of such a process. . Equivalently. we obtain the corresponding orthogonality condition Φ′ Ξ Φr∗ − (b + AΦr∗ ) = 0. by using corresponding estimates of C and d. so that x∗ − Φr∗ ≤ 1 x∗ − Πx∗ .
It is possible that jk = ik+1 .} according to the distribution ξ (a Markov chain Q may be used for this. i1 . . . (6. . Fig. jt = k t=0 δ(it = i) j) = pij . (6. (6.} is generated according to the distribution ξ. n. i1 .188) At time k. . . Figure 6. . . lim k t=0 k→∞ δ(it = i) = ξi . .8. subject to the following two requirements (cf.1. ik ik+1 i1 ik Disaggregation Probabilities Column Sampling According to Markov Disaggregation Probabilities Column Sampling According to Markov Markov Chain Row Sampling According to ) P ∼ A j0 j1 j1 jk jk jk+1 . . We use any probabilistic mechanism for this.8. The basic simulation methodology consists of (a) generating a sequence of indices {i0 .. j0 ). we approximate C and d with Ck = 1 k+1 k t=0 φ(it ) φ(it ) − ait jt φ(jt ) pit jt ′ . j0 ).8 SimulationBased Solution of Large Systems 453 Row Sampling According to ξ ξ (May Use Markov Chain Q) +1 i0 i0 i1 j0 . (i0 . . using a Markov chain P . i. j1 ). . j1 ). j = 1.189) . 6. (i1 . is generated according to a certain stochastic matrix P with transition probabilities pij that satisfy pij > 0 if aij = 0. (i1 . lim k t=0 k→∞ δ(it = i. and (b) generating a sequence of transitions (i0 . dk = 1 k+1 k φ(it )bit . . n. . but this is not necessary. . . k+1 i = 1.. . 6. but this is not a requirement).. . . j0 ).. . which deﬁnes the projection norm · ξ . t=0 (6.186) where δ(·) denotes the indicator function [δ(E) = 1 if the event E has occurred and δ(E) = 0 otherwise].Sec. .187) in the sense that with probability 1.1): (1) Row sampling: The sequence {i0 . . j1 ). . (2) Column sampling: The sequence (i0 . . . in the sense that with probability 1. (i1 .
189) as ′ n n n aij ˆ ˆ pij.k φ(i) φ(i) − Ck = ξi. On the other hand. large) components aij should be simulated more often (pij : large). . pij. A comparison of Eqs. it may be convenient to introduce an irreducible Markov chain with transition matrix Q.185) and (6. Then we can construct such a chain using techniques that are common in Markov chain . . we count the number of times an index occurs and after collecting terms. the choice of P does not aﬀect the limit of Φˆk . which is the solution Φr∗ of the projected equation. .188)]. In view of the assumption ˆ ξi. Let us discuss two possibilities for constructing a Markov chain with steadystate probability vector ξ. In particular. if (i. to improve the eﬃciency of the simulation (the number of samples needed for a given level of simulation error variance). i1 . . [cf. the indices it do not need to be sampled independently according to ξ. .185) and (6. and ξ as its steadystate probability vector. we see that Ck → C and dk → d.185) by Eq. φ(j) . Instead. by comparing Eqs. (6. the same is true for the system (6. k+1 pij.k → pij . . (6.} and the transition sequence (i0 . but in general this is not essential. the r choice of ξ aﬀects the projection Π and hence also Φr∗ . to satisfy Eqs. the solution of the system (6.. (6. with probability 1. . j) is such that aij = 0. Note that there is a lot of ﬂexibility for generating the sequence {i0 .190). The ﬁrst is useful when a desirable distribution ξ is known up to a normalization constant. the calculations in Section 6. This suggests that the structure of P should match in some sense the structure of the matrix A.k ˆ ξi.190) indicates some considerations for selecting the stochastic matrix P . . Eqs. By contrast. (cf.k = ˆ k t=0 δ(it = i.3. since corresponding transitions (i.} as a single inﬁnitely long trajectory of the chain. jt = k t=0 δ(it = i) j) . . . 6 To show that this is a valid approximation.190). For the transition sequence. (6.k φ(i)bi .185) exists and is unique.186) and (6.g.189) converges to r∗ as k → ∞.188).k = k t=0 (6. j = 1. states 1.190) for all t suﬃciently large. Thus. For example. Since the solution r∗ of the system (6. It can be seen that “important” (e. j0 ). (6. . (6. (6. i1 .k → ξi . . . dk = pij j=1 i=1 i=1 where ˆ ξi. we write Eq. in which case P would be identical to Q. (i1 . j1 ). . ˆ i. n. and to start at some state i0 and generate the sequence {i0 . .190) δ(it = i) . we may optionally let jk = ik+1 for all k.3).186) and (6. there is an incentive to choose pij = 0. . j) are “wasted” in that they do not contribute to improvement of the approximation of Eq.454 Approximate Dynamic Programming Chap. (6. .186). to satisfy Eq. n.
we may approximate r∗ = C −1 d with −1 rk = Ck dk .3. is the regression/regularizationbased estimate ′ ′ rk = (Ck Σ−1 Ck + βI)−1 (Ck Σ−1 dk + β¯). fall into this context. Rubinstein and Kroese [RuK08]). we will discuss favorable methods for constructing the transition matrix Q from A. [Liu01]. This will work even if the chain has multiple recurrent classes. which result in ΠT being a contraction so that iterative methods are applicable.8).4). Again ξ will be a steadystate probability vector of the chain.189). respectively. by using a distribution ξ that is not associated with P . The approximate DP applications of Sections 6. For example. Then the requirement (6. in the Markov chainbased sampling schemes. (b) The empirical frequencies of occurrence of the states may approach the steadystate probabilities more quickly. it is just necessary to know the Markov chain and to be able to simulate its transitions. ˆ r (6. we can generate multiple inﬁnitely long trajectories of the chain.g. in which case ξ will be the unique steadystate probability vector of the chain and will have positive components. 6. We ﬁnally note that the option of using distinct Markov chains Q and P for row and column sampling is important in the DP/policy iteration context. starting at several diﬀerent states.. where Q = P . which is more suitable for the case where Ck is nearly singular. resulting in signiﬁcant speedup. 6. is to specify ﬁrst the transition matrix Q of the Markov chain and let ξ be its steadystate probability vector. In particular. which is useful when there is no particularly desirable ξ.8 SimulationBased Solution of Large Systems 455 Monte Carlo (MCMC) methods (see e. Note that multiple simulated sequences can be used to form the equation (6. An alternative.6.7.3. 6.Sec. as long as there are no transient states and at least one trajectory is started from within each recurrent class. and 6. and for each trajectory use jk = ik+1 for all k. In the next section. for at least two reasons: (a) The generation of trajectories may be parallelized among multiple processors. we may resolve the issue of exploration (see Section 6.3.191) .186) will be satisﬁed if the Markov chain is irreducible.8. this is particularly so for large and “stiﬀ” Markov chains. The other possibility.2 Matrix Inversion and RegressionType Methods Given simulationbased estimates Ck and d of C and d. An important observation is that explicit knowledge of ξ is not required. Note also that using multiple trajectories may be interesting even if there is a single recurrent class. ˆ in which case we have rk → r∗ with probability 1 (this parallels the LSTD ˆ method of Section 6. and need not be known explicitly.
λs are the singular values of Σ−1/2 Ck . we have rk − r∗ = V (Λ2 + βI)−1 ΛU ′ bk + β V (Λ2 + βI)−1 V ′ (¯ − r∗ ) ˆ r ˆ 1/2 b r = V (Λ2 + βI)−1 ΛU ′ Σ−1/2 Σk ˆk + β V (Λ2 + βI)−1 V ′ (¯ − r∗ ).3. ¯ To obtain a conﬁdence interval for the error rk − r∗ . β) = max + max λi λ2 + β i β λ2 + β i ˆ Σ−1/2 Σk r− ¯ r∗ . Following the notation and proof of ˆ −1/2 b Prop.. where r is an a priori estimate of r∗ = ¯ C −1 d.193) i=1. .192) ˆ −1 with probability (1 − θ). and the other due to regularization error (which depends on the regularization parameter β and the error r − r∗ ).3. 6 [cf. Proof: Let bk = Σ−1/2 (dk − Ck r∗ ). β is a positive scalar. where Pk (1 − θ) is the threshold value v at which the probability that ˆk 2 takes value greater than v is θ.1: We have P where σk (Σ. and Σ is some positive deﬁnite symmetric matrix.. 1/2 rk − r∗ ≤ σk (Σ. . Proposition 6.4. .8.456 Approximate Dynamic Programming Chap. ˆ i=1.51) in Section 6. We denote by b P(E) the probability of an event E. and let ˆk = Σ−1/2 (dk − Ck r∗ ). 6.53). and decreases with the amount of sampling used)..3.4]... In particular. β) ≥ 1 − θ. and using the relation ˆk = Σk Σ1/2 bk . . Eq. we view all ˆ variables generated by simulation to be random variables on a common ˆ probability space. ˆ b k ˆ Note that ˆk has covariance equal to the identity.. cf.4 applies to this method.. (6. Eq. 6. the error rk − r∗ is bounded by the sum of two terms: ˆ one due to simulation error (which is larger when C is nearly singular. Let Σk be the covariance of (dk − Ck r∗ ).s ˆ −1 Pk (1 − θ) (6. Let Pk be the cumulative b ˆk 2 . The error estimate given by Prop. and note that distribution function of b ˆk ≤ b ˆ −1 Pk (1 − θ) (6. (6..s and λ1 . .
. j1 ). 6.. (6. . We ﬁrst consider the ﬁxed point iteration Φrk+1 = ΠT (Φrk ).D. the threshold value v at which the probability that a chisquare random variable with s degrees of freedom takes value greater than v is θ. and r) are known.195) Here again {i0 . (6.8.193) (Σ.} is an index sequence and {(i0 .1. . the desired result follows. .185)].194) which generalizes the PVI method of Section 6. .. . (6. 6. ¯ Since Eq. (i1 . the b ˆ −1 distribution Pk (1 − θ) in Eq. Using a form of the central limit theorem.8 SimulationBased Solution of Large Systems 457 From this. using simulationbased estimates Ck and dk . . . 1. so that the random variable ˆk b 2 ˆ = (dk − Ck r∗ )′ Σ−1 (dk − Ck r∗ ) k can be treated as a chisquare random variable with s degrees of freedom (since the covariance of ˆk is the identity by deﬁnition). j0 ). we may assume that for a large number of samples. i1 .192) holds with probability (1 − θ). Similar to the analysis of Section 6. (6..186)(6.. s). . . (6.} is a transition sequence satisfying Eqs. In the next section. k = 0. . (6.3.3 Iterative/LSPEType Methods In this section.Sec.188).2. Assuming this. we similarly obtain rk −r∗ ≤ max ˆ λi λ2 + β i ˆ Σ−1/2 Σk 1/2 i=1. we will provide tools for verifying that this is so. the simulationbased approximation (LSPE analog) is k −1 k rk+1 = t=0 φ(it )φ(it )′ t=0 φ(it ) ait jt φ(jt )′ rk + bit pit jt . ˆk asymptotically becomes a Gaussian random b sdimensional vector. the other quantities in Eq.. For this method to be valid and to converge to r∗ it is essential that ΠT is a contraction with respect to some norm.s ˆk + max b i=1.3.3. Thus in a practical application of Prop. and also replace Σk with an estimate ∗ ). 6.E. of the covariance of (dk − Ck r i β.. one may ˆ −1 ˆ replace Pk (1 − θ) by P −1 (1 − θ. Q..s β λ2 + β i r −r∗ . we will consider iterative methods for solving the projected equation Cr = d [cf. (6. λ .8. Eq.193) is approximately equal and may be replaced by P −1 (1 − θ. . s).
so I − GC has real eigenvalues in the interval (0.. For the scaled LSPEtype method (6. is given by where Ck and dk are given by Eq. 1).2.3. rk+1 = rk − γk φ(ik )qk. where γ = 1 and k −1 Gk = t=0 φ(it )φ(it )′ . (6. Eq. (6. γ = 1. (6. e. As in Section 6. Let us also note the analog of the TD(0) method.56)]. The corresponding iteration (6.t = φ(it )′ rk − (a) The case of iteration (6.k . Section 6. but uses only the last sample: where the stepsize γk must be diminishing to 0.189) [cf. where λi are the eigenvalues of C ′ Σ−1 C. This iteration also works if C is not invertible. This case arises in various DP contexts. The reason is that this iteration asymptotically becomes the ﬁxed point iteration Φrk+1 = ΠT (Φrk ) [cf.3.196).194)].57)]. this iteration can be equivalently written in terms of generalized temporal diﬀerences as rk+1 where γ = rk − Gk φ(it )qk. Eq.458 Approximate Dynamic Programming Chap. Noteworthy special cases where this is so are: qk. where Σ is some positive deﬁnite symmetric matrix.t k+1 t=0 k rk+1 = rk − γGk (Ck rk − dk ).3).g.196) ait jt φ(jt )′ rk − bit pit jt [cf. (6. and β is a positive scalar. and γ must be such that I − γGC is a contraction. . (c) C is invertible. Ck → C. As shown in Section 6. and γ is sufﬁciently small. the eigenvalues of GC are λi /(λi + β).61)]. (6. and G. and G has the form [cf. we must have Gk → G. written in more compact form and introducing scaling with a matrix Gk . 6 A generalization of this iteration. under the assumption that ΠT is a contraction. (6.60)] G = (C ′ Σ−1 C + βI)−1 C ′ Σ−1 . It is similar to Eq. (6. (6. Eq. (b) C is positive deﬁnite. the discounted problem where A = αP (cf. C.196) to converge to r∗ .196) takes the form ′ ′ rk+1 = (Ck Σ−1 Ck + βI)−1 (Ck Σ−1 dk + βrk ) k k [cf. Eq.195).3. Eq. The scaling matrix Gk should converge to an appropriate matrix G. G is symmetric positive deﬁnite.
We denote by Q the corresponding transition probability matrix and by qij the components of Q. to satisfy Eqs. A is a contraction with respect to · ξ with modulus α.e. The following propositions assume this condition. and in place of Eq. j)].186) and (6. It seems hard to guarantee that ΠT is a contraction mapping. Then T and ΠT are contraction mappings under any one of the following three conditions: (2) There exists an index i such that aij  < qij for all j = 1. unless A ≤ Q [i. . (i1 .8 SimulationBased Solution of Large Systems 459 Contraction Properties We will now derive conditions for ΠT to be a contraction. j1 ). .. it will suﬃce to show that A is a contraction with respect to · ξ . j0 ).197) ≤ α Qz ξ ≤ α z ξ. (6. i1 . Assume condition (2). . Then. we have A ≤ αQ. As discussed earlier. 1).197). (6. .8. Proposition 6.198). We have Az ≤ A z ≤ αQz. (6. . Since Π is nonexpansive with respect to · ξ . Proof: For any vector or matrix X. . with strict inequality for the row corresponding to i when z = 0. Assume condition (1). we obtain Az ξ < Qz ξ ≤ z ξ. which is used to generate the transition sequence (i0 .188). (6.198) where the last inequality follows since Qx ξ ≤ x ξ for all x ∈ ℜn (see Lemma 6. We assume that the index sequence {i0 . . aij  < 1. . . in place of Eq. (3) There exists an index i such that n j=1 (1) For some scalar α ∈ (0. 6. we have Az ≤ A z ≤ Qz.3.1). ∀ z ∈ ℜn . aij  ≤ qij for all (i.} is generated as an inﬁnitely long trajectory of a Markov chain whose steadystate probability vector is ξ. ∀ z = 0. we denote by X the vector or matrix that has as components the absolute values of the corresponding components of X. which facilitates the use of the preceding iterative methods. Using this relation. . (6. ∀ z ∈ ℜn . Let ξ be the steadystate probability vector of Q. .2: Assume that Q is irreducible and that A ≤ Q. Thus. Q may not be the same as P . n. we obtain Az ξ ∀ z ∈ ℜn .Sec.
so the eigenvalues of ΠA cannot lie outside the unit circle. For a counterexample. . Q need not be irreducible under these conditions – it is suﬃcient that Q has no transient states (so that it has a steadystate probability vector ξ with positive components). take ai. which contradicts the fact Az ξ ≤ z ξ for all z. n − 1.i+1 = 1 for i = 1.D. Thus. It will suﬃce to show that the eigenvalues of ¯ ΠA lie strictly within the unit circle. . we have Q t → 0. Qik i that “lead” from ¯ t → 0. with modulus max z ξ ≤1 Az ξ . Note that the preceding proof has shown that under conditions (1) and (2) of Prop. Qik−1 ik . . so that ν is an eigenvalue of A. all also have A eigenvalues of A are strictly within the unit circle. T and ΠT are contraction mappings with respect to the speciﬁc norm · ξ . Under condition (3). .† Let Q be the matrix which is identical to Q except for the ith row which is identical to the ith row of A. the projection Πz of a complex vector z is obtained by separately projecting the real and the imaginary components of z on S. This is a contradiction because ν = 1.2. Q. the modulus of contraction is α. Furthermore.1 = 1/2. . We claim that Aζ must have both real and imaginary components in the subspace S. 6 It follows that A is a contraction with respect to · ξ . . so that Aζ ξ > ΠAζ ξ = νζ ξ = ν ζ ξ = ζ ξ. the real and imaginary components of Aζ are in S. Since A ≤ Q. Take also qi. it follows that for any i1 = i it is possible to ﬁnd a ¯ ¯ ¯ sequence of nonzero components Qi1 i2 . and an. while the eigenvalues of A are strictly within the unit circle. . and hence also At → 0 (since At  ≤ At ). . From the irreducibility of Q. Assume condition (3). The projection norm for a complex vector x + iy is deﬁned by x + iy ξ = x 2 ξ + y 2 ξ.8. . Using a wellknown result. which implies that Aζ = ΠAζ = νζ. and let ζ be a corresponding eigenvector. 6. ∀ z ∈ ℜn . we would have Aζ = ΠAζ.i+1 = 1 for i = 1. . and that under condition (1).460 Approximate Dynamic Programming Chap. Assume to arrive at a contradiction that ν is an eigenvalue of ΠA with ν = 1. . If this were not so. with every other entry of A equal to 0. Thus. We next observe that from the proof argument under conditions (1) and (2). n − 1. . † In the following argument. we ¯ i1 to i. T and ΠT need not be contractions with respect to · ξ . we have ΠAz ξ ≤ z ξ. shown earlier. .E.
on the diagonal.199) where R is a transition probability matrix. If we deﬁne ξ ξ ξ (1−γ)z +γΠAz < (1−γ) z ξ +γ ΠAz ≤ (1−γ) z ξ +γ z ργ = sup (1 − γ)z + γΠAz ξ  z ≤1 .2 shows that the condition A ≤ Q implies that A is nonexpansive with respect to the norm · ξ . that ξ is a steadystate probability vector of Q. where Tγ = (1 − γ)I + γT. = z ξ.8. . . . 1. and Diag(e − Ae) is the diagonal matrix with 1 − n aim .3: Assume that there are no transient states corresponding to Q.2.8 SimulationBased Solution of Large Systems 461 and qn. When the row sums of A are no greater than one.Sec. and that A ≤ Q. . Then for z = (0. n. The next proposition uses diﬀerent assumptions than Prop. we see that the same is true for ΠA. In fact A may itself be a transition probability matrix. the components of R. so that I − A need not be invertible. so A is not a contraction with respect to · ξ . Hence for all γ ∈ (0. we have z = ΠAz for all z = 0. Proposition 6. . and the weak inequality follows from the nonexpansiveness of ΠA. and the original system may have multiple solutions. Proof: The argument of the proof of Prop. . 6. one can construct Q with A ≤ Q by adding another matrix to A: Q = A + Diag(e − Ae)R. The proposition suggests the use of a damped version of the T mapping in various methods (compare with Section 6.2. 1)′ we have Az = (1. Then the mapping ΠTγ . see the subsequent Example 6. Taking S to be the entire space ℜn . (6. . 1. . i = 1.1 = 1. Assume further that I − ΠA is invertible. . so ξi = 1/n for all i. 1). 6.7 and the average cost case for λ = 0).200) where the strict inequality follows from the strict convexity of the norm. Then the row sum deﬁcit of m=1 the ith row of A is distributed to the columns j according to fractions rij . and n applies to cases where there is no special index i such that j=1 aij  < 1. .8. (6. since I − ΠA is invertible. 0)′ and Az ξ = z ξ .8. . e is the unit vector that has all components equal to 1. is a contraction with respect to · ξ for all γ ∈ (0. Furthermore. . . 6. with every other entry of Q equal to 0. 1) and z ∈ ℜn .8.
the matrix inversion method based on Eq. we see that Eq. If P is an irreducible Markov chain. and α ∈ (0. P is the transition probability matrix of the associated Markov chain. ΠTγ x − ΠTγ y = ΠTγ (x − y) = (1 − γ)Π(x − y) + γΠA(x − y) = (1 − γ)Π(x − y) + γΠ ΠA(x − y) . This is Bellman’s equation for the cost vector of a stationary policy of a SSP. as discussed in the context of explorationenhanced methods in Section 6. Example 6.8. j and a ≤ 1 for all i). y ∈ ℜn . we obtain ΠTγ x − ΠTγ y ξ ≤ ργ z = (1 − γ)Πz + γΠ(ΠAz) ξ ξ = ργ x − y ξ .8. Q. 1) is the discount factor.2: (Undiscounted DP Problems) Consider the equation x = Ax + b. so under the assumptions of Prop. ≤ (1 − γ)z + γΠAz ξ Note that the mappings ΠTγ and ΠT have the same ﬁxed points. for the case where A is a substochastic n n matrix (aij ≥ 0 for all i. (6.189) becomes LSTD(0).3.E. and ξ is chosen to be its unique steadystate probability vector.3. From the deﬁnition of Tγ . (6. Example 6. The methodology of the present section also allows row sampling/state sequence generation using a Markov chain P other than P . We now discuss examples of choices of ξ and Q in some special cases. g is the vector of singlestage costs associated with the n states. y ∈ ℜn . Here 1 − j=1 aij j=1 ij may be viewed as a transition probability from state i to some absorbing state denoted 0. so deﬁning z = x − y.1: (Discounted DP Problems and Exploration) Bellman’s equation for the cost vector of a stationary policy in an nstate discounted DP problem has the form x = T (x). 6 and note that the supremum above is attained by Weierstrass’ Theorem.462 Approximate Dynamic Programming Chap. we have for all x. and using the preceding two relations and the nonexpansiveness of Π. If the policy is proper in the sense that from any state i = 0 .8. where T (x) = αP x + g.D. 6.8. ∀ z ∈ ℜn .200) yields ργ < 1 and (1 − γ)z + γΠAz ξ ≤ ργ z ξ . with an attendant change in ξ. for all x. there is a unique ﬁxed point Φr∗ of ΠT .
7. 6. in which case Prop.1). . Props. 6. provided R has positive components. cii i = 1.7.3: (Weakly Diagonally Dominant Systems) Consider the solution of the system Cx = d. as shown in Section 6. and T and ΠT are contractions with respect to · ξ .7. cii  i = 1. from Eq. j=i cij  ≤ cii . i.8.8.Sec.1. (6. n. As a result. . As a result. A class of such equations where ΠA is a contraction is given in the following example. . 6. where A need not have a probabilistic structure.7. 6. Then. where d ∈ ℜn and C is an n × n matrix that is weakly diagonally dominant. its components satisfy cii = 0.201) By dividing the ith row by cii . the matrix Q = A + Diag(e − Ae)R [cf. we have n j=1 aij  = j=i cij  ≤ 1. . .2 under condition (3) applies (cf. is a contraction with respect to · ξ for all γ ∈ (0. . bi = di .8 SimulationBased Solution of Large Systems 463 there exists a path of positive probability transitions from i to the absorbing state 0. .201). (6. Prop. 1) (cf. Prop. (6. if the unit vector e is not contained in the subspace S spanned by the basis functions. It is also possible to use a matrix R whose components are not all positive. . . .199)] is irreducible. where the components of A and b are aij = 0 c − cij ii if i = j. . Eq. n. Example 6. This is related to Bellman’s equation for the diﬀerential cost vector of a stationary policy of an average cost DP problem involving a Markov chain with transition probability matrix A. an important case where iterative methods are used for solving linear equations. . 6. the matrix I − ΠA is invertible.e.2).8. as long as Q is irreducible..8.2 under condition (2) are satisﬁed. with steadystate probability vector ξ. n. Section 6. 6. Consider also the equation x = Ax + b for the case where A is an irreducible transition probability matrix. the conditions of Prop. 6. if i = j.3 applies and shows that the mapping (1 − γ)I + γA.7. i = 1. Then. we obtain the equivalent system x = Ax + b. . The projected equation methodology of this section applies to general linear ﬁxed point equations.
2 and 6. Prop.2 applies and shows that ΠT is a contraction. (6. and may have one or multiple solutions. (6. Then. instead of Eq. so again Props. consider the set n aij  < 1 .8. assume that j=1 aij  ≤ 1 for all i. Let us ﬁnally address the question whether it is possible to ﬁnd Q such that A ≤ Q and the corresponding Markov chain has no transient n states or is irreducible. . ajm i such that i ∈ I.202) j=i and consider the equivalent system x = Ax + b. Alternatively.8. ˜ Let I be the set of i such that there exists a sequence of nonzero components ˆ ˜ aij1 . assume the somewhat more restrictive condition 1 − cii  + cij  ≤ 1. On the other hand. j=1 aij  = 1 − cii  + j=i cij  ≤ 1. if the matrix Q given by Eq.2 and 6. Note that the square submatrix ˆ of A corresponding to I is a transition probability matrix. 6 so Props. . Then it can be shown that there exists / ˆ Q with A ≤ Q and no transient states if and only if the Markov chain ˆ corresponding to I has no transient states.201).199) has no transient states and n there exists an index i such that j=1 aij  < 1. we have n b = d. . and that we ˆ have aij = 0 for all i ∈ I and j ∈ I. (6. Otherwise. i = 1. . . from Eq.8. and let I = {i  i ∈ I ∪ I} (we allow here / ˜ ˆ the possibility that I or I may be empty). Furthermore. In particular. where A = I − C.202). If A is itself irreducible.3 may be used under the appropriate conditions.464 Approximate Dynamic Programming Chap. aj1 j2 .8. . To this end. n. there exists an ˆ irreducible Q with A ≤ Q if and only if I is empty. . (6. 6.3 apply under appropriate conditions. 6. .4 Extension of QLearning for Optimal Stopping If the mapping T is nonlinear (as for example in the case of multiple policies) the projected equation Φr = ΠT (Φr) is also nonlinear. I= i j=1 and assume that it is nonempty (otherwise the only possibility is Q = A). . .8. then any Q such that A ≤ Q is also irreducible. or no solution at all. n. . if ΠT . 6. 6. . i = 1.8.
or continue at a cost bi and move to state j with probability pij . This case can be generalized as we now show. Let us consider a system of the form x = T (x) = Af (x) + b. A version of the LSPEtype algorithm for solving the system (6. In particular.205) where ci are some scalars. . α ∈ (0. .2 applies and suggests appropriate choices of a Markov chain for simulation so that ΠT is a contraction. ¯ (6. n. . may be used when ΠT is a contraction.8.8. consider the equation x = T (x) = αP f (x) + b. i = 1. We have seen in Section 6. . n. . there is a unique solution. . . . as well as any scaled Euclidean √ norm x = x′ Dx. fn (xn ) . As an example. .8. 6.8 SimulationBased Solution of Large Systems 465 is a contraction. As a i special case of Prop. (6. . k = 0. or the conditions of Prop. . The optimal cost starting from state i is min{ci . whenever A is a contraction with respect to that norm. x ¯ ∀ i = 1. and a choice between two actions at each state i: stop at a cost ci . . x∗ }. . xi . This is the Qfactor equation corresponding to a discounted optimal stopping problem with states i = 1. . . where x∗ is the ﬁxed point of T . 1) is a scalar discount factor. n. Similar results hold in the case where αP is replaced by a matrix A satisfying condition (2) of Prop. 6. . . . xi }. 1. the iteration Φrk+1 = ΠT (Φrk ).2.Sec. Such norms include weighted l1 and l∞ norms. (6. where P is an irreducible transition probability matrix with steadystate probability vector ξ.4. .3 for optimal stopping.203). the theory of Section 6.203) where f : ℜn → ℜn is a mapping with scalar function components of the form f (x) = f1 (x1 ). the norm · ξ . . n.204). We assume that each of the mappings fi : ℜ → ℜ is nonexpansive in the sense that fi (xi ) − fi (¯i ) ≤ xi − xi .2. 6. namely optimal stopping. .3. which extends the method of Section 6. where D is a positive deﬁnite symmetric matrix with nonnegative components. we obtain that ΠT is a contraction with respect to · ξ . .8. 6. xi ∈ ℜ.3 a nonlinear special case of projected equation where ΠT is a contraction. ∀ i = 1. . Under the assumption (6. . and f is a mapping with components fi (xi ) = min{ci . .204) This guarantees that T is a contraction mapping with respect to any norm · with the property y ≤ z if yi  ≤ zi .4.
Thus. and 0 ≤ A ≤ Q. converges to the unique ﬁxed point of ΠT and takes the form n −1 n n rk+1 = (1−γ)rk +γ ξi φ(i)φ(i)′ i=1 i=1 ξi φ(i) j=1 aij fj φ(j)′ rk + bi . the matrix A satisﬁes condition (2) of Prop. or the conditions of Prop. even without applying the stopping action. 1). {i0 . In view of the contraction property of Tγ . .3.8. it follows that ΠA is a contraction with respect to some norm. where Q is an irreducible transition probability matrix. can be extended to the more general context considered here. Diagonally scaled versions of this iteration are also possible. rk+1 = t=0 φ(it )φ(it )′ t=0 φ(it ) ait jt fj φ(jt )′ rk + bit pit jt t . 6. . (i1 . .} is a state sequence. where Tγ (x) = (1 − γ)x + γT (x) is a contraction with respect to · ξ for all γ ∈ (0. discussed at the end of Section 6. .186) and (6. and {(i0 . . corresponds to an undiscounted optimal stopping problem where the stopping state will be reached from all other states with probability 1.2.8. 6. (6. j1 ). . and hence I − ΠA is invertible.206) Here. A diﬃculty with iteration (6. and must be also the unique ﬁxed point of ΠT (since ΠT and ΠTγ have the same ﬁxed points). the “damped” PVI iteration Φrk+1 = (1 − γ)Φrk + γΠT (Φrk ). i1 .3. Let us ﬁnally consider the case where instead of A = αP . thereby resulting in signiﬁcant overhead. Using this fact. from Prop.188) with probability 1. as before. 6. The methods to bypass this diﬃculty in the case of optimal stopping. The justiﬁcation of this approximation is very similar to the ones given so far. it can be shown by modifying the proof of Prop. The n case where j=1 aij  < 1 for some index i. 6. .3 that the mapping ΠTγ .8. j0 ). In this case.466 Approximate Dynamic Programming Chap.} is a transition sequence satisfying Eqs. and will not be discussed further. k. 6 takes the form n −1 n rk+1 = i=1 ξi φ(i)φ(i)′ i=1 and is approximated by k −1 k ξi φ(i) n j=1 aij fj φ(j)′ rk + bi .2 under condition (3).206) is that the terms fjt φ(jt )′ rk must be computed for all t = 0. ΠTγ has a unique ﬁxed point.4.8. (6. . at every step k. . .
206)]. we see that a necessary and suﬃcient condition for optimality is Φ′ (I − A)′ Φr∗ − T (Φr∗ ) = 0. where ˆ Φ′ Φr∗ − T (Φr∗ ) = 0. Thus minimization of the equation error (6. We assume that the matrix (I − A)Φ has rank s. consider the case where ξ is the uniform distribution. Eq. this is known as the Bellman equation error approach (see [BeT96].10 for a detailed discussion of this case.5 Bellman Equation ErrorType Methods We will now consider an alternative approach for approximate solution of the linear equation x = T (x) = b + Ax. 6. . In the DP context where the equation x = T (x) is the Bellman equation for a ﬁxed policy. it can be approximated by the LSPE iteration k −1 k rk+1 = (1−γ)rk +γ t=0 φ(it )φ(it )′ t=0 φ(it ) ait jt fj φ(jt )′ rk + bit pit jt t [cf. Section 6. ˆ T (x) = T (x) + A′ x − T (x) .8 SimulationBased Solution of Large Systems 467 As earlier. We note that the equation error approach is related to the projected equation approach. To see this.207) where · is the standard Euclidean norm. 2 .Sec. (6. so the problem is to minimize Φr − (b + AΦr) 2 ξi φ(i)′ r − n j=1 aij φ(j)′ r − bi . By setting the gradient to 0. or equivalently. based on ﬁnding a vector r that minimizes Φr − T (Φr) 2 . which guarantees that the vector r∗ that minimizes the weighted sum of squared errors is unique. 6.8.207) is equivalent to solving the projected equation ˆ Φr = ΠT (Φr). (6. ξ or n i=1 where ξ is a distribution with positive components. and the more complicated nonlinear case where T involves minimization over multiple policies).
1). (6. where the second inequality holds because r minimizes Φr − T (Φr) 2 . with modulus α ∈ (0. Then ˜ ξ x∗ − Φ˜ = T x∗ − T (Φ˜) + T (Φ˜) − Φ˜ = A(x∗ − Φ˜) + T (Φ˜) − Φ˜. we intro′ ). a similar calculation yields x∗ − Φ˜ r ξ ≤ 1+α ∗ x − Πx∗ ξ . duce an additional sequence of transitions {(i0 . . we obtain x∗ − Φ˜ r ξ ≤ (I − A)−1 ≤ (I − = (I − A)−1 = (I − A)−1 A)−1 ≤ (I − A)−1 ξ ξ ξ ξ ξ Φ˜ − T (Φ˜) r r Πx∗ − ξ ξ ξ T (Πx∗ ) Πx∗ − x∗ + T x∗ − T (Πx∗ ) (I − A)(Πx∗ ξ − x∗ ) ξ I −A x∗ − Πx∗ ξ . Error bounds analogous to the projected equation bounds of Eqs. 1−α 2 ξ The vector r∗ that minimizes Φr − T (Φr) ing necessary optimality condition n i=1 satisﬁes the correspond′ ξi φ(i) − n j=1 aij φ(j) φ(i) − n n j=1 = i=1 ξi φ(i) − aij φ(j) r∗ n j=1 aij φ(j) bi . r r r r r r r so that x∗ − Φ˜ = (I − A)−1 T (Φ˜) − Φ˜ . (6. (6. j0 1 1 6.208). let r minimize Φr − T (Φr) 2 .208) To obtain a simulationbased approximation to Eq. (i . 6 where Π denotes projection with respect to the standard Euclidean norm. without ren quiring the calculation of row sums of the form j=1 aij φ(j). . r r r Thus.183) can be developed for the equation error approach. assuming that I − A is invertible and x∗ is the unique solution. which is generated according to the transition probabilities pij of the . j ′ ).182) and (6. In ˜ ξ the case where T is a contraction mapping with respect to the norm · ξ .2).468 Approximate Dynamic Programming Chap. .} (see Fig. In particular.8. A similar conversion is possible when ξ is a general distribution with positive components.
n. . . (6. jk = j) t k=0 t→∞ δ(ik = i) = lim t k=0 ′ δ(ik = i. .8.211)]. (i1 . (6. . (6.. . Note a disadvantage of this approach relative to the projected equation approach (cf. lim t k=0 δ(ik = i. . pik jk Similar to our earlier analysis. jk = j) t k=0 t→∞ δ(ik = i) = pij .} according to the distribution ξ. it can be seen that this is a valid approximation to Eq. j ′ = 1. . . Section 6.209) for all i. and lim t k=0 ′ δ(ik = i. (i1 .210) are satisﬁed. . ′ j0 ′ ′ j0 j1 ′ 1 ′ jk ′ ′ jk jk+1 +1 i0 i0 i1 . A possible simulation mechanism for minimizing the equation error norm [cf..} and {(i0 . so that Eqs. . i1 ik ik ik+1 . both of these sequences enter . j0 ). .2.211) k = k=0 ai j φ(ik ) − k k φ(jk ) bik . by simulating a single inﬁnitely long sample trajectory of the chain.209) and (6. Simultaneously. n. . ′ ′ {(i0 . we form the linear equation t k=0 ai j φ(ik ) − k k φ(jk ) pik jk φ(ik ) − t aik j ′ pik j ′ ′ ′ φ(jk ) k r (6. j1 ). Eq. . j1 ). j. and is “independent” of the sequence {(i0 . j = 1. . we generate two independent sequences of transitions. . It is necessary to generate two sequences of transitions (rather than one). . j1 ). We generate a sequence of states {i0 . . j0 ). jk = j ′ ) t k=0 t→∞ δ(ik = i) = pij pij ′ . Moreover. j0 ). i1 .8 SimulationBased Solution of Large Systems 469 Markov chain. .8. . j0 j0 j1 j1 jk k jk+1 Figure 6.210) for all i.. At time t. (i1 . (6..208). 6. . jk = j.1).} in the sense that with probability 1.}. (6.Sec. according to the transition probabilities pij .
. µk+m with rµk ∈ Rµk+1 .4). it is susceptible to chattering and oscillation just as much as the projected equation error approach (cf. r∗ ∈ arg min s r∈ℜ i∈I This least squares problem may be solved by conventional (nonsimulation) methods. For an example where the projected equation approach may be preferable. µk+1 .3. Fig. see [BeY09].211). . [PWB09]. .4). Section 6.470 Approximate Dynamic Programming Chap. the regressionbased LSTD method of Section 6.11.189)]. one may consider a selected set I of states of moderate size.3. Let us ﬁnally note that the equation error approach can be generalized to yield a simulationbased method for solving the general linear least squares problem 2 n m min r i=1 where qij are the components of an n × m matrix Q. and ci are the components of a vector c ∈ ℜn . (6.9. rµk+1 ∈ Rµk+2 . rµk+m ∈ Rµk (cf. 6. which also discuss a regressionbased approach to deal with nearly singular problems (cf. (6. and oscillate when there is a cycle of policies µk . one may write the corresponding optimality condition [cf. Eq. which thus contains more simulation noise than its projected equation counterpart [cf. Reference [Ber95] shows that in the example of Exercise 6. . Eq. (6. . . Eq. see Exercise 6.208)] and then approximate it by simulation [cf. Conversely.211)]. .6). .3. 6 Eq. and ﬁnd r∗ that minimizes the sum of squared Bellman equation errors only for these states: 2 n ξi ci − j=1 qij φ(j)′ r . (6. and [WPB09]. An interesting question is how the approach of this section compares with the projected equation approach in terms of approximation error. Approximate Policy Iteration with Bellman Equation Error Evaluation When the Bellman equation error approach is used in conjunction with approximate policy iteration in a DP context. . the projected equation approach gives worse results. rµk+m−1 ∈ Rµk+m . In particular. and examples where one approach gives better results than the other have been constructed. No deﬁnitive answer seems possible. The only diﬀerence is that the weight vector rµ of a policy µ is calculated diﬀerently (by solving a leastsquares Bellman error problem ξi φ(i)′ r − j=1 aij φ(j)′ r − bi . The reason is that both approaches operate within the same greedy partition.
Ξ is the diagonal matrix with the components ξ1 . When the Bellman equation error approach is used instead.3. and for p ≈ 1.212) for some n × s matrix Ψ of rank s and a diagonal matrix Ξ with the components ξ1 .3. .. Oblique projections arise in a variety of interesting contexts.e.6). Thus with both approaches we have oscillation between µ and µ∗ in approximate policy iteration. Conversely.. it can be written as r = (Ψ′ ΞΦ)−1 Ψ′ Ξ(b + AΦr). the weight of policy µ is rµ = 0 (as in the projected equation case). ξn of a positive distribution vector ξ along the diagonal. . (6.8 SimulationBased Solution of Large Systems 471 versus solving a projected equation). but generally not enough to cause dramatic changes in qualitative behavior.3. with very similar iterates.56.212) and the fact that Φ has rank s. . i.212) where as before. the weight of policy µ∗ can be calculated to be rµ∗ ≈ c (1 − α) (1 − α)2 + (2 − α)2 [which is almost the same as the weight c/(1 − α) obtained in the projected equation case]. ξn of a positive distribution vector ξ along the diagonal.213) Using Eq. it is not a projection with respect to the weighted Euclidean norm. . .6 Oblique Projections Some of the preceding methodology regarding projected equations can be generalized to the case where the projection operator Π is oblique (i. 6. . .212) are that its range is the subspace S = {Φr  r ∈ ℜs } and that it is idempotent. and chattering in optimistic versions. and Ψ is an n × s matrix of rank s.6 applies to the Bellman equation error approach as well.8. (6. much of our discussion of optimistic policy iteration in Sections 6.e. Thus.g. Two characteristic properties of Π as given by Eq. . .Sec. a matrix Π with these two properties can be shown to have the form (6. Example 6.2 where chattering occurs when rµ is evaluated using the projected equation.2 (continued) Let us return to Example 6. Saad [Saa03]). (6.. (6. the greedy partition remains the same (cf. In practice the weights calculated by the two approaches may diﬀer somewhat.3. Such projections have the form Π = Φ(Ψ′ ΞΦ)−1 Ψ′ Ξ.3. Let us now consider the generalized projected equation Φr = ΠT (Φr) = Π(b + AΦr). for which we refer to the literature. 6. 6. Fig. Φ is an n × s matrix of rank s. see e. The earlier case corresponds to Ψ = Φ. Π2 = Π.
t=0 (6. The corresponding equations have the form [cf.213).215) −1 where ψ ′ (i) is the ith row of Ψ. 6 or equivalently Ψ′ ΞΦr = Ψ′ Ξ(b + AΦr). corresponding to a representative state. The sequence of vectors Ck dk converges with probability one to the solution C −1 d of the projected equation. Eq. Then the aggregation equation for a discounted problem has the form Φr = ΦD(b + αP Φr). also Example 6. . with the aggregate states corresponding some distinct representative states {x1 . with the oblique projection Π = ΦD. s. with the rows corresponding to the representative states xk having a single unit component.191) may be used. which can be ﬁnally written as Cr = d.1]. For cases where Ck is nearly singular. . dk = 1 k+1 k ψ(it )bit .216) where the rows of D are unit vectors (have a single component equal to 1.5. the regression/regularizationbased estimate (6. . An example where oblique projections arise in DP is aggregation/discretization with a coarse grid [cases (c) and (d) in Section 6. . xs } of the original problem. so we have ΦD · ΦD = ΦD and it follows that ΦD is an oblique projection. where C = Ψ′ Ξ(I − A)Φ. . (6. .214) These equations should be compared to the corresponding equations for the Euclidean projection case where Ψ = Φ [cf. Then the matrix DΦ can be seen to be the identity. and all other components equal to 0). assuming that C is nonsingular. It is clear that row and column sampling can be adapted to provide simulationbased estimates Ck and dk of C and d. respectively.189)] 1 k+1 k t=0 Ck = ψ(it ) φ(it ) − ait jt φ(jt ) pit jt ′ .216) in the special case of coarse grid discretization is the projected equation (6. (6.5. k = 1. (6. . Eq. (6. . Φxk xk = 1. .472 Approximate Dynamic Programming Chap. and the rows of Φ are probability distributions. The conclusion is that the aggregation equation (6.184)]. The corresponding iterative method is ′ ′ rk+1 = (Ck Σ−1 Ck + βI)−1 (Ck Σ−1 dk + βrk ). k k and can be shown to converge with probability one to C −1 d. d = Ψ′ Ξb.
u. the aggregation mapping n n (F R)(x) = dxi min i=1 u∈U(i) j=1 (6.217) as an approximation to a system of the form x = T (x). As another example. j) + α minv∈U(j) Q(j. where r = Q.218) In particular. (6. using the rows of D. u). (6. respectively. u) + αV f (j. Eq. Furthermore.8. We can regard the system (6. We have encountered equations of the form (6.220) [cf. u. v) for each (i. (6.221) . u) = j=1 pij (u) g(i. the dimension s is equal to the number of aggregate states x. j) + α y∈S φjy R(y) . the components of D are the appropriate probabilities pij (u).4) and aggregation (Section 6.217) as being obtained by aggregation/linear combination of the variables and the equations of the system (6.217). x ∈ S. Eq.11)]: n pij (u) g(i. u).217). using the rows of Φ. 6.Sec. As a third example.218) are approximated by linear combinations of the variables rj of the system (6. u) . and the matrices D and Φ consist of the disaggregation and aggregation probabilities.217). and T is the nonlinear mapping that transforms Q to the vector with a component g(i. the dimension m is the number of statecontrolnext state triples (i. For example the Qlearning mapping n (F Q)(i.217) in our discussion of Qlearning (Section 6. u.218). the dimensions s and n are equal to the number of statecontrol pairs (i. consider the following Bellman’s equation over the space of postdecision states m [cf. (6. ∀ m. and Φ is an n × s matrix. Φ is the identity. V (m) = j=1 q(m.140)] is of the form (6. Thus we may view the system (6. j).8 SimulationBased Solution of Large Systems 473 6. v∈U(j) ∀ (i. v) . Eq.7 Generalized Aggregation by Simulation We will ﬁnally discuss the simulationbased iterative solution of a general system of equations of the form r = DT (Φr).219) [cf. j) + α min Q(j. the variables xi of the system (6. (6. j).217) where T : ℜn → ℜm is a (possibly nonlinear) mapping. u. (6. (6. j) min u∈U(j) g(j. m = n is the number of states i. where r = R.5).100)] is of the form (6. D is an s × m matrix. the components of the mapping DT are obtained by linear combinations of the components of T . u.
. we will give an LSTDtype method.222) where A is an m × n matrix. m}. ξit . and the corresponding mapping T is linear. . . in which case there is no minimization in Eqs.2 (see also Exercise 6. so the equation r = DT (Φr) has the form r = D(b + AΦr). . . in which we ﬁrst sample rows of D and then generate rows of A. j). . we will discuss iterative methods under some contraction assumptions on T .5. and b ∈ ℜs . a distribution {pij  j = 1. The Linear Case Let T be linear. where r = V . . where E = I − DAΦ. We do so by ﬁrst generating a sequence of row indices {i0 . n. .8. We will now consider separately cases where T is linear and where T is nonlinear. . we can use lowdimensional simulation to approximate E and f based on row and column sampling. the matrix D consists of the probabilities q(m. ˆ ˆ Given the ﬁrst k + 1 samples. m = n is the number of (predecision) states i. and Φ. Section 6.217).} through sampling according to some distribution {ξi  i = 1.221). . .219)(6. and then by generating for each t the column index jt by sampling according to the distribution {pit j  j = 1. and to obtain a sample sequence (i0 . . .474 Approximate Dynamic Programming Chap. (i1 . . One way to do this is to introduce for each index i = 1. the dimension s is equal to the number of postdecision states x. . we form the matrix Ek and vector fk given by ˆ Ek = I − 1 k+1 k t=0 ait jt d(it )φ(jt )′ . (6. D. . . . . m} with the property pij > 0 if aij = 0. There are also versions of the preceding examples. j0 ). and Φ is the identity matrix. As in the case of projected equations (cf.14). (6. while for the nonlinear case (where the LSTD approach does not apply). . There are also alternative schemes.1). . i1 . ξit pit jt ˆ fk = 1 k+1 k t=0 1 d(it )bt . We can thus write this equation as Er = f. j1 ). . n}. . along the lines discussed in Section 6. which involve evaluation of a single policy. 6 This equation is of the form (6. f = Db. For the linear case.
k+1 ξi i=1 and since k t=0 δ(it = i) → ξi . .4). ˆ and law of large numbers arguments. jt = j) aij d(i)φ(j)′ . similar to the case of projected equations. Section 6.3. jt = j) → ξi pij .61)]. 6. we can write 1 k+1 k t=0 d(i)bi = Db. i=1 j=1 f= i=1 d(i)bi .Sec. k+1 m n we have Ek → aij d(i)φ(j)′ = E. k+1 m we have fk → Similarly. By using the expressions m n m E=I− aij d(i)φ(j)′ . i=1 ait jt d(it )φ(jt )′ = m pit jt i=1 j=1 k t=0 m n k t=0 δ(it = i. Eq. (6. The Nonlinear Case Consider now the case where T is nonlinear and has the contraction property T (x) − T (x) ∞ ≤ α x − x ∞ . In particular. i=1 j=1 ˆ ˆ ˆ −1 ˆ The convergence Ek → E and fk → f implies in turn that Ek fk converges to the solution of the equation r = D(b + AΦr). we can write m k t=0 δ(it = i) 1 fk = d(i)bi . it can be shown that Ek → E and ˆ fk → f . There is also a regressionˆ based version of this method that is suitable for the case where Ek is nearly singular (cf. and an iterative LSPEtype method that works ˆ even when Ek is singular [cf. k+1 ξi pij and since δ(it = i.8 SimulationBased Solution of Large Systems 475 where d(i) is the ith column of D and φ(j)′ is the jth row of Φ. ∀ x ∈ ℜm .
. s. ∀ j = 1. . The ideas underlying the Qlearning algorithm and its analysis (cf. . and the equation r = DT (Φr) has a unique solution. . as well as other algorithms of interest in DP. n. . These assumptions imply that D and Φ are nonexpansive in the sense that Dx Φy ∞ ∞ ≤ x ≤ y ∞. . s. so that DT Φ is a supnorm contraction with modulus α. dℓi (6. This iteration is guaranteed to converge to r∗ .147). we introduce an s × m matrix D whose rows are mdimensional probability distributions with components dℓi satisfying dℓi > 0 if dℓi = 0. the starting point of the algorithm is the ﬁxed point iteration rk+1 = DT (Φrk ). .1) can be extended to provide a simulationbased algorithm for solving the equation r = DT (Φr). . Section 6. . . and the same is true for asynchronous versions where only one component of r is updated at each iteration (this is due to the supnorm contraction property of DT Φ). . The ℓth component of the vector DT (Φr) can be written as an expected value with respect to this distribution: m m dℓi Ti (Φr) = i=1 i=1 dℓi dℓi Ti (Φr) . and for problems involving postdecision states. To obtain a simulationbased approximation of DT . . i = 1. This expected value is approximated by simulation in the algorithm that follows. ℓ = 1. As in Qlearning.4. .476 Approximate Dynamic Programming Chap. and s ℓ=1 φjℓ  ≤ 1. This algorithm contains as a special case the iterative aggregation algorithm (6. m. . 6 where α is a scalar with 0 < α < 1 and · ∞ denotes the supnorm. Furthermore.223) where Ti is the ith component of T . such as for example Qlearning and aggregationtype algorithms for stochastic shortest path problems. denoted r∗ . . . ∞. . ∀ y ∈ ℜs . ∀ x ∈ ℜn . ∀ ℓ = 1. let the components of the matrices D and Φ satisfy m i=1 dℓi  ≤ 1.
4. r). . . Each value of r deﬁnes a randomized stationary policy. to indicate that a cost or value function is being approximated. For example. . m} is generated according to the probabilities dℓk i . A rigorous convergence proof requires the theoretical machinery of stochastic approximation algorithms. . through an approximate cost function. if ℓ = ℓk . The algorithm is similar and indeed contains as a special case the Qlearning algorithm (6. r). with a Monte Carlo estimate based on all the samples up to time k that involve ℓk . ˜ Qfactor approximations Q(i. we consider randomized stationary policies of a given parametric form µu (i. u. The stepsize could be chosen to be γk = 1/nk . In an important special case of this approach. ˜ . 6.1.9 APPROXIMATION IN POLICY SPACE Our approach so far in this chapter has been to use an approximation architecture for some cost function. . are generated inﬁnitely often. diﬀerential cost. called approximation in policy space. independently of preceding indices. . s. the parameterization of the policies is indirect. deﬁne a parameterization of policies by ˜ letting µu (i. ℓ1 . . . . we parameterize the set of policies by a vector r = (r1 . which in turn deﬁnes the cost of interest as a function of r. are updated using the following iteration: rk+1 (ℓ) = (1 − γk )rk (ℓ) + γk rk (ℓ) dℓi dℓi k Tik (Φrk ) if ℓ = ℓk . nk is the number of times that index ℓk has been generated within the sequence {ℓ0 .103). Given ℓk . we replace the expected value in the expression (6. . Eqs.101)(6. . . a cost approximation architecture parameterized by r.108) and (6. In particular. . Then the components of rk . In particular. k where γk > 0 is a stepsize that diminishes to 0 at an appropriate rate. . . s. while all other components are left unchanged. ℓ = 1. an index ik ∈ {1. . rs ) and we optimize the cost over this vector. r) over u ∈ U (i). . Thus only the ℓk th component of rk is changed. In an important alternative. Sometimes this is called approximation in value space. Basically. . (6. u. .} up to time k. 6. .9 Approximation in Policy Space 477 The algorithm generates a sequence of indices {ℓ0 . ˜ where µu (i.} according to some mechanism that ensures that all indices ℓ = 1. where as in Section 6. r) denotes the probability that control u is applied when the ˜ state is i. deﬁnes a policy dependent on r via the minimization in Bellman’s equation. The justiﬁcation of the algorithm follows closely the one given for Qlearning in Section 6. and we then simplify the hardtocalculate terms in the resulting method [cf.4. . ℓ1 .Sec.1. r) = 1 for some u that minimizes Q(i. or Qfactor.223) of the ℓth component of DT . denoted rk (ℓ). We then choose r to minimize this cost. .110)].
This method need not relate to DP. although DP calculations may play a signiﬁcant role in its implementation. ranging from random search to gradient methods.2). one may use a gradient method for this minimization: rk+1 = rk − γk ∇η(rk ). over a ﬁnite or inﬁnite horizon. we can simply parameterize by r the problem data (stage costs and transition probabilities). Assume that the states form a single recurrent class under each P (r). This is known as a policy gradient method . we will focus on the ﬁnite spaces average cost problem and gradienttype methods. and the problem is to ﬁnd r∈ℜs min η(r). This parameterization is discontinuous in ˜ r.478 Approximate Dynamic Programming Chap. Thus.9. Pij (r). we may aim to select some parameters of a given system to optimize performance. such as the crossentropy method [RuK04]. Also in a more abstract and general view of approximation in policy space. and ξ(r). The method of optimization may be any one of a number of possible choices. and optimize the corresponding cost function over r. Traditionally. Once policies are parameterized in some way by a vector r.g.10).1 The Gradient Formula We will now show that a convenient formula for the gradients ∇η(r) can . but in practice is it smoothed by replacing the minimization operation with a smooth exponentialbased approximation. P (r).. 6. and let ξ(r) be the corresponding steadystate probability vector. random search methods. Let the cost per stage vector and transition probability matrix be given as functions of r: G(r) and P (r). or a weighted sum of costs starting from a selected set of states. we refer to the literature for the details. respectively. where γk is a positive stepsize. in this more general formulation. but they often tend to be slow and to have diﬃculties with local minima. respectively. are often very easy to implement and on occasion have proved surprisingly eﬀective (see the literature cited in Section 6. 6 and µu (i. and ξi (r) the components of G(r). rather than parameterizing policies or Qfactors. the expected cost starting from a single initial state. On the other hand. Each value of r deﬁnes an average cost η(r). Assuming that η(r) is diﬀerentiable with respect to r (something that must be independently veriﬁed). the cost function of the problem. which is common for all initial states (cf. is implicitly pa˜ rameterized as a vector J(r). Section 4. In this section. A scalar measure of performance may then ˜ be derived from J(r). gradient methods have received most attention within this context. We denote by Gi (r). r) = 0 for all other u. e.
adding over i. we have ∂hj Pij ξi = ∂rm j=1 i=1 We thus obtain ∂η(r) = ∂rm n i=1 n n n j=1 n ξi Pij i=1 ∂hj = ∂rm n ξj j=1 ∂hj . ∂rm i=1 j=1 ∂rm ∂rm j=1 i=1 n n n n The last summation on the righthand side cancels the last summation on the lefthand side. s.9 Approximation in Policy Space 479 be obtained by diﬀerentiating Bellman’s equation n η(r) + hi (r) = Gi (r) + j=1 Pij (r)hj (r). 6. The argument at which they are evaluated. we obtain ∂hi ∂η ξi + = ∂rm i=1 ∂rm n n n n ξi i=1 ∂hj ∂Pij ∂Gi Pij ξi ξi + hj + . because from the deﬁning property of the steadystate probabilities. . ∂η ∂hj ∂Pij ∂hi ∂Gi Pij + = + hj + . (6.Sec. . Taking the partial derivative with respect to rm . (6. and using the fact n i=1 ξi (r) = 1. . ∂rm ∂rm j=1 n m = 1.226) where all the gradients are column vectors of dimension s. is often suppressed to simplify notation. . where hi (r) are the diﬀerential costs.225) or in more compact form n ∇η(r) = i=1 ξi (r) ∇Gi (r) + n j=1 ∇Pij (r)hj (r) . .224) with respect to the components of r. n. ∂rm ξi (r) ∂Pij (r) ∂Gi (r) + hj (r) . ∂rm ∂rm ∂rm j=1 ∂rm ∂rm j=1 (In what follows we assume that the partial derivatives with respect to components of r appearing in various equations exist. . . . i = 1.) By multiplying this equation with ξi (r). (6. we obtain for all i and m. .
Let us introduce for all i and j such that Pij (r) > 0. h(r) is a vector of dimension n. The reason is that neither the steadystate probability vector ξ(r) nor the bias vector h(r) are readily available. . so for large n.2 Computing the Gradient by Simulation Despite its relative simplicity. . j).9. gradient algorithms tend to be robust and maintain their convergence properties. η= ˜ k t=0 t where k is large. even in the presence of signiﬁcant error in the calculation of the gradient. We will now discuss some possibilities of using simulation to approximate ∇η(r). i1 . . it can only be approximated either through its simulation samples or by using a parametric architecture and an algorithm such as LSPE or LSTG ( see the references cited at the end of the chapter). Suppose now that we generate a single inﬁnitely long simulated trajectory (i0 .226) involves formidable computations to obtain ∇η(r) at just a single value of r. Then. 6 6. Fortunately. In the literature. so they must be computed or approximated in some way.226) in the form n n ∇η = i=1 We assume that for all states i and possible transitions (i. the function ∇Pij (r) .). .227) hi0 = lim E N →∞ t=0 (Git − η) . given an estimate η .480 Approximate Dynamic Programming Chap. We can then estimate the average cost η as k−1 1 Gi . suppressing the dependence on r. however. Furthermore. Lij (r) = Pij (r) Then. It also raises the question whether approximations introduced in the gradient calculation may aﬀect the convergence guarantees of the policy gradient method. while algorithms where just h is parameterized and µ is obtained by onestep lookahead minimization. we write the partial derivative formula (6. are called actoronly methods. The possibility to approximate h using a parametric architecture ushers a connection between approximation in policy space and approximation in value space. are called criticonly methods. algorithms where both µ and h are parameterized are sometimes called actorcritic methods. we can calculate ∇Gi and Lij . (6. we can estimate the bias ˜ components hj by using simulationbased approximations to the formula N ξi ∇Gi + j=1 Pij Lij hj . the gradient formula (6. Algorithms where just µ is parameterized and h is not parameterized but rather estimated explicitly or implicitly by simulation.
.227) and (6. and we replace ˜ hj by hj . . (6. u. and ˜ hence also ∇η(r). ˜ i. We refer to the literature cited at the end of the chapter. . . Then the corresponding stage costs and transition probabilities are given by n Gi (r) = u∈U(i) µu (i. an asymptotically exact calculation of hj .228): if we replace the expected values of ∇Gi and Lij by empirical averages. Finally. . j = 1. Pij (r) = u∈U(i) µu (i. in such methods. we obtain n ∇Gi (r) = ∇Pij (r) = u∈U(i) ∇˜u (i. j = 1.9 Approximation in Policy Space 481 [which holds from general properties of the bias vector when P (r) is aperiodic – see the discussion following Prop. we obtain the estimate δη . i = 1. j=1 (6. Let us consider randomized policies where µu (i. . 6. Alternatively. n. and with a variety of diﬀerent algorithms. the estimation of η and hj may be done simultaneously with the estimation of the gradient via Eq.230) .7. (6. .Sec. j). u. The estimationbysimulation procedure outlined above provides a conceptual starting point for more practical gradient estimation methods. Diﬀerentiating these equations with respect to r. r)pij (u). the LSPE and LSTD algorithms will ﬁnd exact values of hj in the limit. µ i. n.229) u∈U(i) ∇˜u (i. . For example. r) is ˜ diﬀerentiable with respect to r for each i and u. r) µ pij (u)g(i. we can ˜ estimate the gradient ∇η with a vector δη given by δη = 1 k k−1 t=0 ˜ ∇Git + Lit it+1 hit+1 . . . We assume that µu (i. is possible]. (6.1 [note here that if the feature subspace contains the bias vector. j). 6. r)pij (u).3 Essential Features of Critics We will now develop an alternative (but mathematically equivalent) expression for the gradient ∇η(r) that involves Qfactors instead of diﬀerential costs. n.9. (6. we can estimate hj by using the LSPE or LSTD algorithms of Section 6. 4. r) denotes the prob˜ ability that control u is applied at state i. .2]. so with a suﬃciently rich set of features. r) ˜ j=1 pij (u)g(i. given estimates η and hj .228).228) This can be seen by a comparison of Eqs.1. .
we obtain n n ∇η(r) = = i=1 n ξi (r) ∇Gi (r) + ξi (r) u∈U(i) j=1 ∇Pij (r)hj (r) n i=1 ∇˜u (i. we have (6. j) − η(r) + hj (r) . r)ψr (i.r)>0} µ ∇˜u (i. u. j) − η(r) + hj (r) . 6 ˜ Since u∈U(i) µu (i. µ i=1 u∈U(i) (6. and ﬁnally ∇η(r) = n ˜ ξi (r)Q(i. u. u. µu (i. u)Q(i. r) j=1 pij (u)g(i. u.231) ˜ where Q(i. ∇Pij (r)hj (r) = j=1 u∈U(i) ∇˜u (i. In particular. r)Q(i. r) = 0. r) ˜ and by introducing the function ψr (i. j) − η(r) . Eq. so Eq. r) µ i=1 {u∈U(i)˜ u (i. r) µ j=1 pij (u) g(i.229) yields n ∇˜u (i. r) are the approximate Qfactors corresponding to r: n ˜ Q(i. (6. r).r)>0} µ (6. Let us now express the formula (6. u. u. u) = we obtain n ∇˜u (i. u). r) = j=1 pij (u) g(i. u.230) yields n j=1 n µ ∇˜u (i. r)pij (u)hj (r). r) µ .231) in a way that is amenable to proper interpretation. r)∇˜u (i.482 Approximate Dynamic Programming u∈U(i) Chap. r) µ .226). µ By using the preceding two equations to rewrite the gradient formula (6. µu (i. u. r) = 1 for all r. µ ∇Gi (r) = u∈U(i) Also. by multiplying with hj (r) and adding over j. r) ˜ ∇η(r) = ˜ ζr (i. by writing n ∇η(r) = ˜ ξi (r)˜u (i. i=1 {u∈U(i)˜ u (i.232) .
ψr (i. . Singh. for a given r. u)Q2 (i. ˜ An important observation is that although ∇η(r) depends on Q(i. or LSTD(λ) for λ ≈ 1. m = 1. ·) r .. ·) r . the dimension m of the parameter vector r. . LSPE(λ).232) as ∂η(r) m ˜ = Q(·. An alternative suggested by Konda and Tsitsiklis [KoT99]. (6. . r) onto Sr can be done in an approximate sense with TD(λ). u). ·. u) are the steadystate probabilities of the pairs (i. . ·. . Q2 r = i=1 {u∈U(i)˜ u (i. We refer to the papers by Konda and Tsitsiklis [KoT99]. . m Let also Sr be the subspace that is spanned by the functions ψr (·. . [KoT03]. u)Q1 (i. ψr (·. ·). the projection of Q(·. We denote by ψr (i. r). In particular. . µ Note that for each (i. we deﬁne the inner product of two realvalued functions Q1 . ·. u). i. s. m = 1. McAllester. . [KoT03]. ψr (·. ·. ∂rm m = 1. r). which has a number of components equal to the number of statecontrol pairs (i. u.Sec. thereby leading to a diﬀerent set of algorithms. r). the components of this vector. r). ·.e. . u). r). features the knowledge of which is essential for the calculation of the gradient ∇η(r). i. m = 1. . u) under r: ζr (i. s. . s. ·) m ˜ = Πr Q(·. With this notation. 6. s. . u) = ξi (r)˜u (i. r) onto Sr in order to compute ∇η(r). . Equation (6. . Thus Sr deﬁnes a subspace of essential features. u). Q 2 r = Q. u). ·). . Q2 of (i. and Mansour [SMS99] for further discussion. . s. ˜ it is suﬃcient to know the projection of Q(·. we can rewrite Eq.9 Approximation in Policy Space 483 where ζr (i.. by n Q1 . Now let · r be the norm induced by this inner product.r)>0} µ ζr (i. Since m ˜ Q(·. r m = 1. and Sutton. is to interpret the formula as an inner product. the dependence is only through its inner products with the s m functions ψr (·. ψr (·. thereby leading to actoronly algorithms. Q r . . r) by simulation.1.232) can form the basis for policy gradient methods that ˜ estimate Q(i. ˜ As discussed in Section 6.e. . u) is a vector of dimension s. . and let Πr denote projection with respect to this norm onto Sr . . u.
We ﬁrst note that in comparing approaches. the performance and reliability of policy gradient methods are susceptible to degradation by large variance of simulation noise. However. even when good features are available. Furthermore. When an aggregation method is used . all too often leads to jamming (no visible progress) and complete breakdown. gradient methods have a generic diﬃculty with local minima. For example. TD methods have additional diﬃculties: the need for adequate exploration. simulation (with convergence governed by the law of large numbers). nor yield the best possible performance of the associated onesteplookahead policy. based on the work of Amari [Ama98]). but also make intuitive sense in a broader context. however.4 Approximations in Policy and Value Space Let us now provide a comparative assessment of approximation in policy and value space. the consequences of which are not wellunderstood at present in the context of approximation in policy space. Furthermore. In addition. LSTD(λ) and LSPE(λ) are quite reliable algorithms. in the sense that they ordinarily achieve their theoretical guarantees in approximating the associated cost function or Qfactors: they involve solution of systems of linear equations. A major diﬃculty for approximation in value space is that a good choice of basis functions/features is often far from evident. they suﬀer from a drawback that is wellknown to practitioners of nonlinear optimization: slow convergence.484 Approximate Dynamic Programming Chap. LSPE(λ). Thus. which unless improved through the use of eﬀective scaling of the gradient (with an appropriate diagonal or nondiagonal matrix).9. the chattering phenomenon and the associated issue of policy oscillation. attaining convergence in practice is often challenging. 6 6. Policy gradient methods for approximation in policy space are supported by interesting theory and aim directly at ﬁnding an optimal policy within the given parametric class (as opposed to aiming for policy evaluation in the context of an approximate policy iteration scheme). one must bear in mind that speciﬁc problems may admit natural parametrizations that favor one type of approximation over the other. while policy gradient methods are supported by convergence guarantees in theory. and the lack of convergence guarantees for both optimistic and nonoptimistic schemes. it is natural to consider policy parametrizations that resemble the (s. in inventory control problems. the indirect approach of TD(λ). and contraction iterations (with favorable contraction modulus when λ is not too close to 0). within the multiple policy context of an approximate policy iteration scheme. Kakade [Kak02] for an interesting attempt to address this issue. S) policies that are optimal for special cases. In the case of a ﬁxed policy. Unfortunately. there has been no proposal of a demonstrably eﬀective scheme to scale the gradient in policy gradient methods (see. and LSTD(λ) may neither yield the best possible approximation of the cost function or the Qfactors of a policy within the feature subspace. However.
such as the training of approximation architectures using empirical or simulation data. largescale resource . and the curse of modeling (the need for an exact model of the system’s dynamics).10 NOTES. and Lendaris [LLL08] describe recent work and approximation methodology that we have not covered in this chapter: linear programmingbased approaches (De Farias and Van Roy [DFV03]. which emphasizes ﬁnitehorizon/limited lookahead schemes and adaptive sampling. batch and incremental gradient methods. [DFV04a]. Powell. Two other popular names are reinforcement learning and neurodynamic programming. but the cost approximation vectors Φr are restricted by the requirements that the rows of Φ must be aggregation probability distributions. Several survey papers in the volume by Si.Sec. which emphasizes simulationbased optimization and reinforcement learning algorithms. and neural network training. and Exercises 485 for policy evaluation. for related material on approximation architectures. Sources. Fu. Barto. and another by Bertsekas and Tsitsiklis [BeT96]. mainly using the so called ODE approach (see also Borkar and Meyn [BoM00]). but touches upon some of the approximate DP algorithms that we have discussed. Two books were written on the subject in the mid90s. which reﬂects an artiﬁcial intelligence viewpoint. More recent books are Cao [Cao07]. adopted by Bertsekas and Tsitsiklis [BeT96]. The book by Borkar [Bor08] is an advanced monograph that addresses rigorously many of the convergence issues of iterative stochastic algorithms in approximate DP. We refer to the latter book for a broader discussion of some of the topics of this chapter. Liu. SOURCES. comes from the strong connections with DP as well as with methods traditionally developed in the ﬁeld of neural networks. one by Sutton and Barto [SuB98]. these diﬃculties do not arise. AND EXERCISES There has been intensive interest in simulationbased methods for approximate DP since the early 90s. which emphasizes resource allocation and the diﬃculties associated with large control spaces. 6. We have used the name approximate dynamic programming to collectively refer to these methods. De Farias [DeF04]). which is more mathematical and reﬂects an optimal control/operations research viewpoint. The book by Meyn [Mey07] is broader in its coverage. and Marcus [CFH07]. as well as for an extensive overview of the history and bibliography of the subject up to 1996.10 Notes. Chang. The latter name. which emphasizes a sensitivity approach and policy gradient methods. Gosavi [Gos03]. and the special issue by Lewis. 6. and Powell [Pow07]. in view of their promise to address the dual curses of DP: the curse of dimensionality (the explosion of the computation needed to solve the problem as the number of states increases). Hu. and Wunsch [SBP04].
The papers by Barto.64)] as well as a λcounterpart were proposed by Choi and Van Roy [ChV06] under the name Fixed Point Kalman . which is the starting point of our analysis of Section 6. [Sam67] on a checkersplaying program. Eq. and Sutton [Sut88] proposed the TD(λ) method. Three more recent surveys are Borkar [Bor09] (a methodological point of view that explores connections with other Monte Carlo schemes). Sutton. For a recent application. and they are still in wide use for either approximation of optimal cost functions or Qfactors (see e. 6 allocation methods (Powell and Van Roy [PoV04]). from an artiﬁcial intelligence/machine learning viewpoint.. They were used in an approximate DP context by Van Roy. and Si.486 Approximate Dynamic Programming Chap. and Bertsekas [Ber10] (which focuses on policy iteration and elaborates on some of the topics of this chapter). and deterministic optimal control approaches (Ferrari and Stengel [FeS04]. which is very extensive and cannot be fully covered here. Bertsekas. Indeed for quite a long time it was not clear which mathematical problem TD(λ) was aiming to solve! The convergence of TD(λ) and related methods was considered for discounted problems by several authors. They are conceptually simple and easily implementable.3. Pineda [Pin97]. particularly following an early success with the backgammon playing program of Tesauro [Tes92]. Lee. Geurts. and Hanson [GLH94]. Lewis and Vrabie [LeV09] (a control theory point of view). and Tsitsiklis [VBL97] in the context of inventory control problems. The method motivated a lot of research in simulationbased DP. Ormoneit and Sen [OrS02]. Gurvits. and Anderson [BSA83]. and Singh [BBS95]. which pays special attention to the diﬃculties associated with large control spaces. 6.1. The proof of Tsitsiklis and Van Roy [TsV97] was based on the contraction property of ΠT (cf.1 and Prop. The simpliﬁcations mentioned in Section 6. [SDG09]. which is associated with an approximation architecture. They were introduced in the works of Samuel [Sam59]. (6. Lin. and Liu [SYL04]). The scaled version of TD(0) [cf.4 are part of the folklore of DP. including Dayan [Day92].3. In particular. see Simao et. on a heuristic basis without a convergence analysis. They have been recognized as an important simpliﬁcation in the book by Powell [Pow07]. by Barto. Bradtke. and Wehenkel [EGW06]). Jordan. postdecision states have sporadically appeared in the literature since the early days of DP.g.1). Temporal diﬀerences originated in reinforcement learning. and Singh [JJS94]. Jaakkola. al. and Ernst. Yang. An inﬂuential survey was written. Lemma 6.3. Tsitsiklis and Van Roy [TsV97]. and Szepesvari [Sze09] (a machine learning point of view). Direct approximation methods and the ﬁtted value iteration approach have been used for ﬁnite horizon problems since the early days of DP. where they are viewed as a means to encode the error in predicting future costs. The original papers did not discuss mathematical convergence issues and did not make the connection of TD methods with the projected equation. Longstaﬀ and Schwartz [LoS01]. and Van Roy [Van98]. Gordon [Gor99]. The reader is referred to these sources for a broader survey of the literature of approximate DP.
Sec. 6.10
Notes, Sources, and Exercises
487
Filter. The book by Bertsekas and Tsitsiklis [BeT96] contains a detailed convergence analysis of TD(λ), its variations, and its use in approximate policy iteration, based on the connection with the projected Bellman equation. Lemma 6.3.2 is attributed to Lyapunov (see Theorem 3.3.9 and Note 3.13.6 of Cottle, Pang, and Stone [CPS92]). Generally, projected equations are the basis for Galerkin methods, which are popular in scientiﬁc computation (see e.g., [Kra72], [Fle84]). These methods typically do not use Monte Carlo simulation, which is essential for the DP context. However, Galerkin methods apply to a broad range of problems, far beyond DP, which is in part the motivation for our discussion of projected equations in more generality in Section 6.8. The LSTD(λ) algorithm was ﬁrst proposed by Bradtke and Barto [BrB96] for λ = 0, and later extended by Boyan [Boy02] for λ > 0. For (λ) (λ) λ > 0, the convergence Ck → C (λ) and dk → d(λ) is not as easy to demonstrate as in the case λ = 0. An analysis has been given in several sources under diﬀerent assumptions. For the case of an αdiscounted problem, convergence was shown by Nedic and Bertsekas [NeB03] (for the standard version of the method), and by Bertsekas and Yu [BeY09] (in the more general twoMarkov chain sampling context of Section 6.8), and by Yu [Yu10], which gives the sharpest analysis. The analysis of [BeY09] and [Yu10] also extends to simulationbased solution of general projected equations.An analysis of the lawoflargenumbers convergence issues associated with LSTD for discounted problems was given by Nedi´ and Bertc sekas [NeB03]. The rate of convergence of LSTD was analyzed by Konda [Kon02], who showed that LSTD has optimal rate of convergence within a broad class of temporal diﬀerence methods. The regression/regularization variant of LSTD is due to Wang, Polydorides, and Bertsekas [WPB09]. This work addresses more generally the simulationbased approximate solution of linear systems and least squares problems, and it applies to LSTD as well as to the minimization of the Bellman equation error as special cases. The LSPE(λ) algorithm, was ﬁrst proposed for stochastic shortest path problems by Bertsekas and Ioﬀe [BeI96], and was applied to a challenging problem on which TD(λ) failed: learning an optimal strategy to play the game of tetris (see also Bertsekas and Tsitsiklis [BeT96], Section 8.3). The convergence of the method for discounted problems was given in [NeB03] (for a diminishing stepsize), and by Bertsekas, Borkar, and Nedi´ c [BBN04] (for a unit stepsize). In the paper [BeI96] and the book [BeT96], the LSPE method was related to a version of the policy iteration method, called λpolicy iteration (see also Exercises 1.19 and 6.10). The paper [BBN04] compared informally LSPE and LSTD for discounted problems, and suggested that they asymptotically coincide in the sense described in Section 6.3. Yu and Bertsekas [YuB06b] provided a mathematical proof of this for both discounted and average cost problems. The scaled versions of LSPE and the associated convergence analysis were developed more recently, and within a more general context in Bertsekas [Ber09], which is
488
Approximate Dynamic Programming
Chap. 6
based on a connection between general projected equations and variational inequalities. Some related methods were given by Yao and Liu [YaL08]. The research on policy or Qfactor evaluation methods was of course motivated by their use in approximate policy iteration schemes. There has been considerable experimentation with such schemes, see e.g., [BeI96], [BeT96], [SuB98], [LaP03], [JuP07], [BED09]. However, the relative practical advantages of optimistic versus nonoptimistic schemes, in conjunction with LSTD, LSPE, and TD(λ), are not yet clear. Policy oscillations and chattering were ﬁrst described by the author at an April 1996 workshop on reinforcement learning [Ber96], and were subsequently discussed in Section 6.4.2 of [BeT96]. The size of the oscillations is bounded by the error bound of Prop. 1.3.6, which due to [BeT96]. An alternative error bound that is based on the Euclidean norm has been derived by Munos [Mun03]. The exploration scheme with extra transitions (Section 6.3.8) was given in the paper by Bertsekas and Yu [BeY09], Example 1. The LSTD(λ) algorithm with exploration and modiﬁed temporal diﬀerences (Section 6.3.8) was given by Bertsekas and Yu [BeY07], and a convergence with probability 1 proof was provided under the condition λαpij ≤ pij for all (i, j) in [BeY09], Prop. 4. The idea of modiﬁed temporal diﬀerences stems from the techniques of importance sampling, which have been introduced in various DPrelated contexts by a number of authors: Glynn and Iglehart [GlI89] (for exact cost evaluation), Precup, Sutton, and Dasgupta [PSD01] [for TD(λ) with exploration and stochastic shortest path problems], Ahamed, Borkar, and Juneja [ABJ06] (in adaptive importance sampling schemes for cost vector estimation without approximation), and Bertsekas and Yu [BeY07], [BeY09] (in the context of the generalized projected equation methods of Section 6.8.1). Qlearning was proposed by Watkins [Wat89], who explained insightfully the essence of the method, but did not provide a rigorous convergence analysis; see also Watkins and Dayan [WaD92]. A convergence proof was given by Tsitsiklis [Tsi94b]. For SSP problems with improper policies, this proof required the assumption of nonnegative onestage costs (see also [BeT96], Prop. 5.6). This assumption was relaxed by Abounadi, Bertsekas, and Borkar [ABB02], using an alternative line of proof. For a survey of related methods, which also includes many historical and other references up to 1995, see Barto, Bradtke, and Singh [BBS95]. A variant of Qlearning is the method of advantage updating, developed by Baird [Bai93], [Bai94], [Bai95], and Harmon, Baird, and Klopf [HBK94]. In this method, instead of aiming to compute Q(i, u), we compute A(i, u) = Q(i, u) − min Q(i, u).
u∈U(i)
The function A(i, u) can serve just as well as Q(i, u) for the purpose of computing corresponding policies, based on the minimization minu∈U(i) A(i, u), but may have a much smaller range of values than Q(i, u), which may be
Sec. 6.10
Notes, Sources, and Exercises
489
helpful in some contexts involving basis function approximation. When using a lookup table representation, advantage updating is essentially equivalent to Qlearning, and has the same type of convergence properties. With function approximation, the convergence properties of advantage updating are not wellunderstood (similar to Qlearning). We refer to the book [BeT96], Section 6.6.2, for more details and some analysis. Another variant of Qlearning, also motivated by the fact that we are really interested in Qfactor diﬀerences rather than Qfactors, has been discussed in Section 6.4.2 of Vol. I, and is aimed at variance reduction of Qfactors obtained by simulation. A related variant of approximate policy iteration and Qlearning, called diﬀerential training, has been proposed by the author in [Ber97] (see also Weaver and Baxter [WeB99]). It aims to compute Qfactor diﬀerences in the spirit of the variance reduction ideas of Section 6.4.2 of Vol. I. Approximation methods for the optimal stopping problem (Section 6.4.3) were investigated by Tsitsiklis and Van Roy [TsV99b], [Van98], who noted that Qlearning with a linear parametric architecture could be applied because the associated mapping F is a contraction with respect to the norm · ξ . They proved the convergence of a corresponding Qlearning method, and they applied it to a problem of pricing ﬁnancial derivatives. The LSPE algorithm given in Section 6.4.3 for this problem is due to Yu and Bertsekas [YuB07], to which we refer for additional analysis. An alternative algorithm with some similarity to LSPE as well as TD(0) is given by Choi and Van Roy [ChV06], and is also applied to the optimal stopping problem. We note that approximate dynamic programming and simulation methods for stopping problems have become popular in the ﬁnance area, within the context of pricing options; see Longstaﬀ and Schwartz [LoS01], who consider a ﬁnite horizon model in the spirit of Section 6.4.4, and Tsitsiklis and Van Roy [TsV01], and Li, Szepesvari, and Schuurmans [LSS09], whose works relate to the LSPE method of Section 6.4.3. The constrained policy iteration method of Section 6.4.3 is new. Recently, an approach to Qlearning with exploration, called enhanced policy iteration, has been proposed (Bertsekas and Yu [BeY10]). Instead of policy evaluation by solving a linear system of equations, this method requires (possibly inexact) solution of Bellman’s equation for an optimal stopping problem. It is based on replacing the standard Qlearning mapping used for evaluation of a policy µ with the mapping
n
(FJ,ν Q)(i, u) =
j=1
which depends on a vector J ∈ ℜn , with components denoted J(i), and on a randomized policy ν, which for each state i deﬁnes a probability distribution ν(u  i)  u ∈ U (i)
pij (u) g(i, u, j) + α
v∈U(j)
ν(v  j) min J(j), Q(j, v)
7. It may have some important practical applications in problems where multistep lookahead minimizations are feasible. 6. and the associated contracting value iteration method discussed in Section 4. Related error bounds are given by Munos and Szepesvari [MuS08].12 and Section 6.4 of [BeT96]).7. and its convergence was proved by Tsitsiklis and Van Roy [TsV99a] (see also [TsV02]). The TD(λ) algorithm was extended to the average cost problem. the algorithm may be combined with the TD(0)like method of Tsitsiklis and Van Roy [TsV99b].4. and may depend on the “current policy” µ. The vector J is updated using the equation J(i) = minu∈U(i) Q(i. For extreme choices of ν and a lookup table representation. it is not susceptible to policy oscillation and chattering like the projected equation or Bellman equation error approaches. Section 6. Also asynchronous multiagent aggregation has not been discussed earlier.490 Approximate Dynamic Programming Chap.1) for SSP problems in Section 6. [SJJ95]. u).1. The policy ν may be chosen arbitrarily at each iteration.128)(6. An alternative to the LSPE and LSTD algorithms of Section 6.4.3. The average cost analysis of LSPE in Section 6. there is an important qualitative diﬀerence that distinguishes the aggregationbased policy iteration approach: assuming suﬃciently small simulation error.2) were proposed by Tsitsiklis and Van Roy [TsV96] (see also Section 6.6 is based on the convergence analysis for TD(λ) given in Bertsekas and Tsitsiklis [BeT96].ν is an optimal stopping problem [a similarity with the constrained policy iteration (6. Section 6.1 is due to Yu and Bertsekas [YuB06b]. Finding a ﬁxed point of the mapping FJ. Finally it is worth emphasizing that while nearly all the methods discussed in this chapter produce basis function approximations to costs or Qfactors. and by Singh. the algorithm of [BeY10] yields as special cases the classical Qlearning/value iteration and policy iteration methods.7.6.1 is based on the relation between average cost and SSP problems. and the “current policy” µ is obtained from this minimization. Multistep aggregation has not been discussed earlier. and Jordan [SJJ94]. A more recent work that focuses on hard aggregation is Van Roy [Van06]. Related results are given by Gordon [Gor95].129)].5. Together with linear Qfactor approximation. 6 over the feasible controls at i. The price for this is the restriction of the type of basis functions that can be used in aggregation. but allows for arbitrary and easily controllable amount of exploration. The LSPE algorithm was ﬁrst proposed for SSP problems in [BeI96]. Jaakkola. thereby resolving the issue of exploration. The idea is to convert the average cost problem into a . It encodes aspects of the “current policy” µ. which can be used to solve the associated stopping problems with low overhead per iteration. Bounds on the error between the optimal costtogo vector J ∗ and the limit of the value iteration method in the case of hard aggregation are given under various assumptions in [TsV96] (see also Exercise 6.7 of [BeT96] for a discussion). Approximate value iteration methods that are based on aggregation (cf. The contraction mapping analysis (Prop.
(6. see Abounadi. As a result.6 on oblique projections and the connections to aggregation/discretization with a coarse grid is based on unpublished collaboration with H. Yu. for the general projected equation Φr = Π(AΦr + b) [cf.3 and 6.233) 1 − α2 given for discounted DP problems (cf.8. (6.233)] depends on the approximation subspace S and the structure of ≤ B(A. S) x∗ − Πx∗ ξ .8. 6. ξ. The framework of Sections 6. Sources. While the convergence analysis of the policy evaluation methods of Sections 6.Sec. Polydorides. but is motivated by the sources on aggregationbased approximate DP given for Section 6. Alternative algorithms of the Qlearning type for average cost problems were given without convergence proof by Schwartz [Sch93b].7. The paper by Yu and Bertsekas [YuB08] derives error bounds which apply to generalized projected equations and sharpen the rather conservative bound 1 Jµ − Φr∗ ξ ≤ √ Jµ − ΠJµ ξ . 6. . They are also discussed in the book [BeT96] (Section 7. In particular.2) and the bound x∗ − Φr∗ ≤ (I − ΠA)−1 x∗ − Πx∗ . which however converges to the correct one as the gain of the policy is estimated correctly by simulation.3 were proposed and analyzed in these references. [ABB02]. Singh [Sin94].8.182)]. and Bertsekas [WPB09]. which also discusses multistep methods. and a diﬀerent method of proof has been employed.5 on generalized projected equation and Bellman error methods is based on Bertsekas and Yu [BeY09].7 is new in the form given here. the analysis is more complicated. The SSP algorithms of Section 6. The regressionbased method and the conﬁdence interval analysis of Prop. based on the socalled ODE approach. (6. The reason is that there may not be an underlying contraction. Prop. and Borkar [ABB01]. Eq. so the nonexpansive property of the DP mapping must be used instead. The bounds of [YuB08] apply also to the case where A is not a contraction and have the form x∗ − Φr∗ ξ √ where B(A. 6. and Exercises 491 parametric form of SSP.6 is based on contraction mapping arguments.16. ξ.5.1 is due to Wang. and several other variants of the methods given here (see also Bertsekas [Ber09]). S) is a scalar that [contrary to the scalar 1/ 1 − α2 in Eq. Bertsekas.6 can then be used with the estimated gain of the policy ηk replacing the true gain η.5). see also Gosavi [Gos04]. and Mahadevan [Mah96].10 Notes. The material of Section 6.8. a diﬀerent line of analysis is necessary for Qlearning algorithms for average cost problems (as well as for SSP problems where there may exist some improper policies).3. and Borkar and Meyn [BeM00].8.1. The generalized aggregation methodology of Section 6. the Qlearning algorithms of Section 6.
so it is simply obtained as a byproduct of the LSTD and LSPEtype methods of Section 6. which is Πx∗ ) is very large [cf. ξ. The inner product expression of ∂η(r)/∂rm was used to delineate essential features for gradient calculation by Konda and Tsitsiklis [KoT99]. and Williams [Wil92].. where TD(0) produces a very bad solution relative to TD(λ) for λ ≈ 1]. A value of B(A. and Klopf [HBK94]. S) involves the spectral radii of some lowdimensional matrices and may be computed either analytically or by simulation (in the case where x has large dimension).233).492 Approximate Dynamic Programming Chap. There is a large literature on policy gradient methods for average cost problems. A noteworthy success with this method has been attained in learning a high scoring strategy in the game of tetris (see Szita and Lorinz [SzL06].g. and Jordan [JSJ95]. There has been considerable progress in random search methodology. and Ozdaglar [BHO08]. and Marcus [HFM05]. and Bertsekas [Ber95]. Fu and Hu [FuH94]. (6. How. and Mansour [SMS99]. L’Ecuyer [L’Ec91].189). surprisingly this method outperformed methods . The formula for the gradient of the average cost has been given in diﬀerent forms and within a variety of diﬀerent contexts: see Cao and Chen [CaC97]. the example of Exercise 6. [Cao05]. 6 the matrix A. and Mansour [SMS99]. McAllester. Cao [Cao99]. The scalar B(A. and Williams [Wil92]. see Ormoneit and Sen [OrS02]. de Boer et al [BKM05]) has gained considerable attention. [KoT03]. Kakade [Kak02]. changing the subspace S. and the socalled crossentropy method (see Rubinstein and Kroese [RuK04]. [MaT03]. and Sutton. Sutton. Several implementations of policy gradient methods. and Bethke. He. ξ. Antos.9. Among other situations. Singh. Konda and Borkar [KoB99]. Singh. The Bellman equation error approach was initially suggested by Schweitzer and Seidman [ScS85]. Such an inference cannot be made based on the much less discriminating bound (6. and Munos [ASM08]. Jaakkola. even if A is a contraction with respect to · ξ . S) given in [YuB08] involves only the matrices that are computed as part of the simulationbased calculation of the matrix Ck via Eq. Grudic and Ungar [GrU04].8. One of the scalars B(A. Glynn [Gly87]. ξ. We follow the derivations of Marbach and Tsitsiklis [MaT01]. Marbach and Tsitsiklis [MaT01]. [KoT03]. or changing ξ). have been proposed: see Cao [Cao04]. McAllester. Szepesvari and Smart [SzS04]. Baird.. Singh. and simulationbased algorithms based on this approach were given later by Harmon. Fu. Approximation in policy space can also be carried out very simply by a random search method in the space of policy parameters. He [He02]. some of which use cost approximations. Szepesvari. such bounds can be useful in cases where the “bias” Φr∗ − Πx∗ ξ (the distance between the solution Φr∗ of the projected equation and the best approximation of x∗ within S. [RuK08]. increase λ in the approximate DP case. Konda [Kon02]. mentioned earlier. Konda and Tsitsiklis [KoT99]. For some recent developments. and Thiery and Scherrer [ThS09]). Cao and Wan [CaW98]. S) that is much larger than 1 strongly suggests a large bias and motivates corrective measures (e.1. Baird [Bai95].
3. EXERCISES 6.Sec. while minimizing the expected travel cost (sum of the travel costs of the arcs on the travel path). Zhou and Hansen [ZhH01]. j = i. Derive a DP algorithm in a space of postdecision variables and compare it to ordinary DP. and Exercises 493 based on approximate policy iteration. 6. and Jordan [SJJ94]. The cost of traversing an arc changes randomly and independently at each time period with given distribution. Jaakkola. and Yu and Bertsekas [YuB06a]. and Marcus [CFH07]. Many problems have special structure. Hu. Other random search algorithms have also been suggested. 3.2 (Multiple State Visits in Monte Carlo Simulation) . Ch. has been proposed by Yu [Yu05].1 Consider a fully connected network with n nodes. An alternative method. Approximations obtained by aggregation/interpolation schemes and solution of ﬁnitespaces discounted or average cost problems have been proposed by Zhang and Liu [ZhL97].10 Notes. Poupart and Boutilier [PoB04]. the current cost of traversing the outgoing arcs (i. and policy gradient by more than an order of magnitude (see the discussion of chattering in Section 6. see Chang. For some representative work. Policy gradient methods of the actoronly type have been given by Baxter and Bartlett [BaB01]. [GKP03]. For any node i. See also Singh. and Koller and Parr [KoP00]. Approximate DP methods for partially observed Markov decision problems are not as welldeveloped as their perfect observation counterparts. and Yu and Bertsekas [YuB04]. which is of the actorcritic type. or stay at i (waiting for smaller costs of outgoing arcs at the next time period) at a ﬁxed (deterministic) cost per period. Alternative approximation schemes based on ﬁnitestate controllers are analyzed in Hauskrecht [Hau00]. who will then either choose the next node j on the travel path. and the problem of ﬁnding a travel strategy that takes a traveller from node 1 to node n in no more than a given number m of time periods. will become known to the traveller upon reaching i. approximate linear programming. see Guestrin et al. j). and Aberdeen and Baxter [AbB00].6). Sources. Fu. 6. which can be exploited in approximate DP.
3 (Viewing QFactors as Optimal Costs) Consider the stochastic shortest path problem under Assumptions 2. Denote mk = n1 + · · · + nk . j) + min Q(j. m) .494 Approximate Dynamic Programming Chap. m) need not be an unbiased estimator of Jµ (i). 5. and that the kth trajectory involves nk visits to state i and generates nk corresponding cost samples. n. u. 7.2. See [BeT96]. t. m) = lim m=1 1 N N k=1 1 (n1 N mk m=mk−1 +1 c(i.1. m) k−1 +1 (or see Ross [Ros83b]. Use this fact to show that the Qfactors Q(i. in which case the number M of cost samples collected for a given state i is ﬁnite and random.3 for a closely related result).1. m=m = E{nk }Jµ (i). . Note: If only a ﬁnite number of trajectories is generated. m) m=1 is valid even if a state may be revisited within the same sample trajectory. v∈U (j) Hint: Introduce a new state for each pair (i. u).1. 6. v) . as the M m=1 number of trajectories increases to inﬁnity. u) are the unique solution of the system of equations Q(i. However. with transition probabilities pij (u) to the states j = 1. the sum M 1 c(i. Write 1 lim M →∞ M M c(i. Show that the Qfactors Q(i. the bias disappears. . E = mk m=mk−1 +1 E{nk } and argue that E mk c(i. . 6 Argue that the Monte Carlo simulation formula 1 M M Jµ (i) = lim M →∞ c(i. for a discussion and examples. u) can be viewed as state costs associated with a modiﬁed stochastic shortest path problem.2. . . m) N→∞ + · · · + nN ) c(i. Cor. Hint: Suppose the M cost samples are generated from N trajectories.2. u) = j pij (u) g(i. Sections 5.1 and 2.
let Jα be the αdiscounted cost vector.3 and 6. Φ′ V Φ Φ′ V Φ Construct choices of α. 6.10 Notes. Use the formula Π = Φ(Φ′ V Φ)−1 Φ′ V to show that in the single basis function case (Φ is an n × 1 vector) the algorithm is written as rk+1 = Φ′ V g αΦ′ V P Φ + rk .6 (Relation of Discounted and Average Cost Approximations [TsV02]) Consider the ﬁnitestate αdiscounted and average cost frameworks of Sections 6. by showing that the eigenvalues of the matrix ΠF lie strictly within the unit circle.3. 6.5 (LSPE(0) for Average Cost Problems [YuB06b]) Show the convergence of LSPE(0) for average cost problems with unit stepsize. (c) Consider the subspace E ∗ = (I − P ∗ )y  y ∈ ℜn . Hint: Use the Laurent series expansion of Jα (cf. let (η ∗ . 6.4 This exercise provides a counterexample to the convergence of PVI for discounted problems when the projection is with respect to a norm other than · ξ . Fig. Φ.7 for a ﬁxed stationary policy with cost per stage g and transition probability matrix P . and let N−1 P ∗ = lim Show that: N→∞ k=0 P k. Verify that Jα can be decomposed into the sum of two vectors that are orthogonal (in the scaled (a) η ∗ = (1 − α)ξ ′ Jα and P ∗ Jα = (1 − α)−1 η ∗ e.7. let ξ be the steadystate probability vector. Consider the mapping T J = g +αP J and the algorithm Φrk+1 = ΠT (Φrk ).1). and Exercises 495 6. Prop. 4.Sec. Assume that the states form a single recurrent class. 6. which is orthogonal to the unit vector e in the scaled geometry where x and y are orthogonal if x′ Ξy = 0 (cf. . (b) h∗ = limα→1 (I − P ∗ )Jα . Sources.2. where V is a diagonal matrix with positive components. g. let Ξ be the diagonal matrix with diagonal elements the components of ξ. P .1. assuming that P is aperiodic.1 and 6. h∗ ) be the gainbias pair. Here Π denotes projection on the subspace √ spanned by Φ with respect to the weighted Euclidean norm J v = J ′ V J .3.2). where P and Φ satisfy Assumptions 6. and V for which the algorithm diverges.
Denote by Jµ the SSP cost vector of µ. . . insights. Yu and the author) is to show that a reverse transformation is possible.10. q0 (n) used for restarting simulated trajectories [cf.1. at least in the case where all policies are proper. . . . i pji j pij pi0 p00 = 1 0 pji i pi0 pij pj0 j q0(i) 0 q0(j) pj0 SSP Problem Average Cost Problem Figure 6. 6. ηβ = β+ n i=1 q0 (i)Jµ (i) . and substituting instead transitions from 0 to i with probabilities q0 (i).7 (Conversion of SSP to Average Cost Policy Evaluation) We have often used the transformation of an average cost problem to an SSP problem (cf.496 Approximate Dynamic Programming Chap. The transitions from 0 to each i = 1. Fig.1. Let us modify the Markov chain by eliminating the selftransition from state 0 to itself. and by ηβ and hβ (i) the average and diﬀerential costs of βAC. All other transitions and costs remain the same (cf. i. . T where T is the expected time to return to 0 starting from 0. As a result.157)]. analysis.. have cost β. from SSP to average cost. (a) Show that ηβ can be expressed as the average cost per stage of the cycle that starts at state 0 and returns to 0.α of PVI(λ) for the αdiscounted ∗ problem converges to the limit rλ of PVI(λ) for the average cost problem as α → 1. We call the corresponding average cost problem βAC. 6 ∗ (d) Use part (c) to show that the limit rλ. Transformation of a SSP problem to an average cost problem. where β is a scalar parameter. n. Consider the SSP problem. . each with a ﬁxed transition cost β.e. which is the projection of Jα onto E ∗ and converges to h∗ as α → 1.1). . 6. respectively.10. geometry): P ∗ Jα . p ˜0i = q0 (i). which is the projection of Jα onto the line deﬁned by e. and Chapter 7 of Vol. The purpose of this exercise (unpublished collaboration of H.3. and (I − P ∗ )Jα . (6. Section 4. and algorithms for average cost policy evaluation can be applied to policy evaluation of a SSP problem. . I). Eq. a single proper stationary policy µ. and the probability distribution q0 = q0 (1).
236) is Bellman’s equation for the SSP problem. . respectively. N − 1. we have from Bellman’s equation n ηβ + hβ (i) = j=0 pij g(i. j) + hβ (j) . (6.234) it follows that if β = β ∗ . .235) From Eq. . we see that δ(i) = Jµ (i) for all i. . The DP algorithm/Bellman’s equation takes the form Jm = gm + Pm Jm+1 . (6. n. Consider a lowdimensional approximation of Jm that has the form Jm ≈ Φm rm . 6.236) where δ(i) = hβ ∗ (i) − hβ ∗ (0).234) ηβ + hβ (0) = β + i=1 q0 (i)hβ (i). except for a constant shift. .8 (Projected Equations for FiniteHorizon Problems) Consider a ﬁnitestate ﬁnitehorizon policy evaluation problem with the cost vector and transition matrices at time m denoted by gm and Pm . 6. . . . . . i=1 Jµ (i) = hβ ∗ (i) − hβ ∗ (0). . . j) + j=1 pij δ(j). Sources. . (6. m = 0. and JN is a given terminal cost vector. . . . and q0 (i)Jµ (i). . . .Sec. and n n δ(i) = j=0 pij g(i. Hint: Since the states of βAC form a single recurrent class. where Jm is the cost vector of stage m for the given policy. . n i = 1. . Note: This conversion may be useful if the transformed problem has more favorable properties. n. i = 1. n.10 Notes. . Since Eq. n. The two average cost problems should have the same diﬀerential cost vectors. . (6. . (6. . i = 1. and Exercises 497 (b) Show that for the special value n β∗ = − we have ηβ ∗ = 0. m = 0. i = 1. N − 1. we have ηβ ∗ = 0. (c) Derive a transformation to convert an average cost policy evaluation problem into another average cost policy evaluation problem where the transition probabilities out of a single state are modiﬁed in any way such that the states of the resulting Markov chain form a single recurrent class.
33) and (6. N − 1. similar to the case of a stochastic shortest path problem (cf. 6 where Φm is a matrix whose columns are basis functions.. . . one for each m.4. . the system moves deterministically from state i ≥ 1 to state i − 1 at a cost gi . . .1 [cf. r) = i r for the costtogo function.3. n − 2. (b) Consider a simulation process that generates a sequence of trajectories of the system. 1 in succession. The solution {r0 . Consider also a projected equation of the form Φm rm = Πm (gm + Pm Φm+1 rm+1 ). (a) Show that the projected equation can be written in the equivalent form Φ′ Ξm (Φm rm − gm − Pm Φm+1 rm+1 ) = 0. (6. m = 0. The minimization can be decomposed into N minimizations. m m = 0.. Derive corresponding LSTD and (scaled) LSPE algorithms. . 1. rN−1 } of the projected equation is obtained as N−2 ∗ rm = arg r0 . and by setting to 0 the gradient with respect to rm . Abbreviated solution: The derivation follows the one of Section 6. (c) Derive appropriate modiﬁcations of the algorithms of Section 6. but with a single policy. . n. 6. Let all simulation runs start at state n and end at 0 after visiting all the states n − 1. . Φ′ N−1 ΞN−1 (ΦN−1 rN−1 − gN−1 − PN−1 JN ) = 0. . . . with state 0 being the termination state.3 to address a ﬁnite horizon version of the optimal stopping problem. . Consider a linear approximation of the form ˜ J(i. . . The states are 0. Section 6..rN−1 ∗ Φm rm − (gm + Pm Φm+1 rm+1 ) 2 ξm min m=0 + ΦN−1 rN−1 − (gN−1 + PN−1 JN ) 2 ξN−1 . Consider a problem of the SSP type. where Πm denotes projection onto the space spanned by the columns of Φm with respect to a weighted Euclidean norm with weight vector ξm . . where Ξm is the diagonal matrix having the vector ξm along the diagonal. and the application of TD methods.6). . Under the given policy.498 Approximate Dynamic Programming Chap. . N − 2. we obtain the desired form.34)]. . . .. . the ∗ ∗ analysis leading to Eqs.9 (Approximation Error of TD Methods [Ber95]) This exercise illustrates how the value of λ may signiﬁcantly aﬀect the approximation quality in TD methods.
1). rλ ) with λ from 0 to 1 in increments of 0.0 TD(1) Approximation 0.0 TD(0) Approximation 0 0 10 20 30 40 50 .50.2: Form of the costtogo function J(i). n = 50.10 Notes.25. gn = −(n − 1) and gi = 1 for all i = n.0 0 10 20 30 40 50 State i State i Figure 6. (ﬁgure on the left). Sources. and a policy iterationlike method that generates a sequence of . Consider the discounted problem of Section 6.0 0.2 gives the results for λ = 0 and λ = 1.5 TD(1) Approximation TD(0) Approximation .10 (λPolicy Iteration [BeI96]) The purpose of this exercise is to discuss a method that relates to approximate policy iteration and LSPE(λ). gi = 1. ∀i=n gi = 0. 6. 1. ∀i=1 6. and the linear representa∗ ˜ tions J(i. gn = −(n − 1).5 50. a scalar λ ∈ [0.0 Cost function J(i) Cost function J(i) 1 25. (ﬁgure on the right). for the following two cases: (1) (2) n = 50. Figure 6. and the case.10.9. k=1 ∗ ˜ (b) Plot J(i. and Exercises 499 ∗ ∗ (a) Derive the corresponding projected equation Φrλ = ΠT (λ) (Φrλ ) and show ∗ that its unique solution rλ satisﬁes n ∗ (gk − rλ ) λn−k n + λn−k−1 (n − 1) + · · · + k = 0. for the case g1 = 1.Sec.10. rλ ) in Exercise 6.2.3. g1 = 1 and gi = 0 for all i = 1.
for a proof. so x1 can be obtained by solving the ﬁrst equation. Still the intuition around this method suggests that once good approximations Gk . starting from an arbitrary pair (µ0 .) α(1 − λ) Jk − J ∗ 1 − αλ ∞. Show that with the projected equation approach. x1 ∈ ℜk and x2 ∈ ℜm . J0 ). and as λ → 1.1 and the least squares approach of Section 6. Ck . (λ) (6.8. we obtain the component x∗ of the solution of the original 1 equation.7). it is only necessary to perform just a few LSPE(λ) iterations for a given policy within an approximate policy iteration context. a substantial number of samples using the policy µ are necessary to approximate well these quantities. Section 6. . . and dk of Φ′ ΞΦ. but with the least squares approach.500 Approximate Dynamic Programming Chap. 6 policycost vector pairs (µ0 . Consider the case of a linear system involving a vector x with two block components.8. J). C (λ) .3. and dk are simulationbased approximations of the corresponding matrices Φ′ ΞΦ. the algorithm approaches policy iteration (in the sense that J → Jµ ). J1 ). (b) Show that for all k greater than some index. . Let the approximation subspace be ℜk ×S2 . and vector d(λ) (cf. 2. and d(λ) have been obtained. we do not. where S2 is a subspace of ℜm .2. The system has the form x1 = A11 x1 + b1 .) (a) Show that for λ = 0 the algorithm is equivalent to value iteration. and J = Tµ J. However. Prop. J0 ). Interpret this as a single iteration of PVI(λ) for approximating Jµ . J) according to Tµ J = T J. . x2 = A21 x1 + A22 x2 + b2 . it computes the next pair (µ.11 This exercise provides an example of comparison of the projected equation approach of Section 6. C (λ) .8. (c) Consider a linear cost approximation architecture of the form Φr ≈ J and a corresponding projected version of the “evaluation” equation (6. Note: Since this method uses a single PVI(λ) iteration. as follows: Given the current pair (µ. one may consider a simulationbased implementation with a single LSPE(λ) iteration: r = r − Gk (Ck r − dk ). where Gk . (µ1 . (λ) 6. we have Jk+1 − J ∗ ∞ ≤ (See [BeI96] or [BeT96].237) (Here µ is the “improved” policy and J is an estimate of the cost vector Jµ . .237): Φr = ΠTµ (Φr). Ck .
x ∈ A. i ∈ x. and Exercises 501 6. Also for every i denote by x(i) the aggregate state x with i ∈ x. and let R∗ be the unique ﬁxed point of this mapping. Consider the corresponding mapping F deﬁned by n n (F R)(x) = i=1 dxi min u∈U (i) pij (u) g(i.140)].j∈x ǫ ǫ ≤ J ∗ (i) ≤ R∗ (x) + . u. Abbreviated Proof : Let the vector R be deﬁned by R(x) = min J ∗ (i) + i∈x ǫ .2.Sec. x∈A i. 1−α 1−α ∀ x ∈ A. (6. 6.12 (Error Bounds for Hard Aggregation [TsV96]) Consider the hard aggregation case of Section 6. j) + αR x(j) j=1 . ǫ 1−α αǫ 1−α Thus. We have for all x ∈ A.10 Notes. F R ≤ R. Show that R∗ (x) − where ǫ = max max J ∗ (i) − J ∗ (j) . and denote i ∈ x if the original state i belongs to aggregate state x. from which it follows that R∗ ≤ R (since R∗ = limk→∞ F k R and F is monotone). . 1−α x ∈ A. The righthand side follows similarly. u. u. n n (F R)(x) = i=1 n dxi min u∈U (i) pij (u) g(i. Eq. This proves the lefthand side of the desired inequality. Sources.5. j) + αJ ∗ (j) + j=1 αǫ 1−α dxi J ∗ (i) + i=1 αǫ 1−α ≤ min J ∗ (i) + ǫ + i∈x = min J ∗ (i) + i∈x = R(x). j) + αR x(j) j=1 n ≤ = dxi min i=1 n u∈U (i) pij (u) g(i. [cf.
. i. where the set of indices {1. ℓ = 1. normalized so that they form a probability distribution. . . the same is true for DT Φ (since aggregation and disaggregation matrices are nonexpansive with respect to the supnorm). . d = 1. .e. only hard aggregation has the projection property of this exercise. and Φ has rank s.. Writing this equation as Φr = ΦDT (Φr). since D also has rank s). Is and: (1) The ℓth column of Φ has components that are 1 or 0 depending on whether they correspond to an index in Iℓ or not. . i. . . Notes: (1) Based on the preceding result. respectively. (2) The ℓth row of D is a probability distribution (dℓ1 . ..e. this is true for all aggregation schemes. where T : ℜn → ℜn is a (possibly nonlinear) mapping. we must have ΦDΦD = ΦD. and D and Φ are s × n and n × s matrices. . 6 6. if T is a contraction mapping with respect to the projection norm. Φ has rank s. / i=1 ℓi Show in particular that ΦD is given by the projection formula ΦD = Φ(Φ′ ΞΦ)−1 Φ′ Ξ. . and dℓi = 0 if i ∈ Iℓ . This implies that if DΦ is invertible and ΦD is a weighted Euclidean projection. From this it can be seen that out of all possible aggregation schemes with DΦ invertible and D having nonzero columns. dℓn ) whose components are positive depending on whether they correspond to an index in Iℓ n or not. In addition. where Ξ is the diagonal matrix with the nonzero components of D along the diagonal. (2) For ΦD to be a weighted Euclidean projection. ξi = dℓi s k=1 n j=1 dkj . . which implies that DΦD = D and hence DΦ = I. if T is a contraction mapping with respect to the supnorm. . we see that it is a projected equation if ΦD is a projection onto the subspace S = {Φr  r ∈ ℜs } with respect to a weighted Euclidean norm. . . .502 Approximate Dynamic Programming Chap. dℓi > 0 if i ∈ Iℓ . we must have DΦ = I (since if DΦ is invertible. n} is partitioned into s disjoint subsets I1 .13 (Hard Aggregation as a Projected Equation Method) Consider a ﬁxed point equation of the form r = DT (Φr). not just hard aggregation. the same is true for ΦDT . ∀ i ∈ Iℓ . . The purpose of this exercise is to prove that this is the case in hard aggregation schemes. s.
. i = 1. satisﬁes ˜ J µ − Jµ ∞ ≤ δ. . jt )  t = 1. . . . . DP and AP. . consider the scalar k Gℓit Ait jt 1 k Cℓm = Φjt m . 6. k Gℓit Ait jt t=1 where Φjm denotes the (jm)th component of Φ. where A is an n × n matrix and b is a column vector in ℜn . .14 (SimulationBased Implementation of Linear Aggregation Schemes) Consider a linear system Ax = b. 2. . . . n. The purpose of the exercise is to provide a scheme for doing this. n. denoted J µ . . k→∞ Derive a similar scheme for approximating the components of d. whose columns are viewed as basis functions. but the policy improvement is done through DP – a process that is patterned after the aggregationbased policy iteration method of Section 6. i = 1. .3 (referred to as DP) and an approximation to this problem (this is a diﬀerent discounted problem referred to as AP).5. . . Show that with probability 1 we have k lim Cℓm = Cℓm . ˜ (2) For any policy µ. and the same policies.2. and then solve the system Cr = d. 6. and Exercises 503 6. . In particular. n}. we introduce an n×s matrix Φ. We approximate the (ℓm)th component Cℓm of C as follows. For k = 1. We ﬁnd a solution r ∗ of the s × s system DAΦr = Db. .Sec. n}. . . j = 1. . . whose rows are probability distributions. . if Aij = 0. s. . . . respectively.5. .15 (Approximate Policy Iteration Using an Approximate Problem) Consider the discounted problem of Section 6.10 Notes. In a scheme that generalizes the aggregation approach of Section 6. Sources. 2. . by independently generating each it according to the distribution {Gℓi  i = 1. We generate a sequence (it . and then by independently generating jt according to the distribution {Ait j  j = 1. . . and an s × n matrix D. its cost vector in AP. Let D and A be matrices of dimensions s × n and n × n. ℓ = 1. are connected as follows: (1) DP and AP have the same state and control spaces. . and we view Φr ∗ as an approximate solution of Ax = b. n. and are such that their components satisfy Gℓi > 0 Aij > 0 if Gℓi = 0.. This exercise considers an approximate policy iteration method where the policy evaluation is done through AP. respectively. we assume that the two problems. . An approximate imple˜ ˜ mentation is to compute by simulation approximations C and d of the matrices ˜ ˜ C = DAΦ and d = Db.
and is assumed to be both monotone and a supnorm contraction with modulus α ∈ ˜ (0. (cf. . (6. ˜ we ﬁnd J µ satisfying ˜ ˜ J µ (i) = (Fµ J µ )(i).16 (Approximate Policy Iteration and Newton’s Method) Consider the discounted problem. Furthermore. (Fν J µ )(i) depends on ν only through the component ν(i). (6. . .238). We update µ by ˜ µ(i) = arg min(Fν J µ )(i).143)]. . where Lk+1 is an s × n matrix. . ˜ ˜ ˜ Hint: Show that J µ ≥ J µ with equality if J µ = J ∗ . n. .143). and operates as follows: Given a vector rk ∈ ℜs . . µ i = 1.10). 1). Show that the sequence of policies generated by the algorithm terminates ˜ with J µ being the unique ﬁxed point J ∗ of the mapping F deﬁned by (F J)(i) = min(Fµ J)(i). . and is minimized over that component. This is true in particular if the policy improvement process in AP is identical to the one in DP. . 6. rather than DP. where we assume that µ so obtained satisﬁes ˜ ˜ (Fµ J µ )(i) = min(Fν J µ )(i).. .504 Approximate Dynamic Programming Chap. ν i = 1. i = 1. Jµ − J ∗ ∞ Show the error bound ≤ 6. which may take only a ﬁnite number of values. . deﬁne µk to be a policy such that Tµk (Φrk ) = T (Φrk ). (3) The policy µ obtained by exact policy iteration in AP satisﬁes the equation ˜ ˜ T J µ = Tµ J µ . 6 i. n. . Hint: Follow the derivation of the error bound (6. . . . n.e. i = 1. then rk+1 is obtained from rk by a Newton step for solving the equation r = Lk+1 T (Φr). Show that if µk is the unique policy satisfying Eq. n. Eq. 2αδ e 1−α [cf. incurs an error of at most δ in supnorm. . ν n n Here Fµ : ℜ → ℜ is a mapping that is parametrized by the policy µ.238) and let rk+1 satisfy rk+1 = Lk+1 Tµk (Φrk+1 ). Exercise 1. (6. policy evaluation using AP.17 (Generalized Policy Iteration) Consider a policy iterationtype algorithm that has the following form: Given µ. and a policy iteration method that uses function approximation over a subspace S = {Φr  r ∈ ℜs }. .
10 Notes. . and diﬀerentiate Bellman’s equation to show that ∂Ji ∂gi = + ∂rm ∂rm n j=1 ∂pij Jj + ∂rm n pij j=1 ∂Jj . Denote by gi (r). We wish to calculate the gradient of a weighted sum of the costs Ji (r). 6. . . and let the cost per stage and transition probability matrix be given as functions of a parameter vector r.e. where q = q(1). ∂rm i = 1. n. i. q(n) is some probability distribution over the states. . Interpret the above equation as Bellman’s equation for a SSP problem. .. n ¯ J(r) = i=1 q(i)Ji (r). . . and Exercises 505 6. Each value of r deﬁnes a stationary policy. . Sources. i = 1.18 (Policy Gradient Formulas for SSP) Consider the SSP context. which is assumed proper. . the expected cost starting at state i.Sec. where the argument r at which the partial derivatives are computed is suppressed. . . n. and by pij (r) the transition probabilities. . . Consider a single scalar component rm of r. For each r. the expected costs starting at states i are denoted by Ji (r).
.
“Learning Algorithms for Markov Decision Processes with Average Cost. 40. J. N. Vol. of Technology. Si. “Stochastic Approximation for NonExpansive Maps: QLearning Algorithms. on Control and Optimization.. J. by J. I. P.. pp.. 41. “RealTime Learning and Control Using Asynchronous Dynamic Programming.” Artiﬁcial Intelligence.. and Ozdaglar. S. 1995. Borkar. 310. Bradtke. B. “Learning NearOptimal Policies with BellmanResidual Minimization Based Fitted Policy Iteration and a Single Sample Path. 681698..” Proc.References [ABB01] Abounadi. Szepesvari... [ABB02] Abounadi. “Nat