Professional Documents
Culture Documents
Operations Research
Publication details, including instructions for authors and subscription information:
http://pubsonline.informs.org
This article may be used only for the purposes of research, teaching, and/or private study. Commercial use
or systematic downloading (by robots or other automatic processes) is prohibited without explicit Publisher
approval, unless otherwise noted. For more information, contact permissions@informs.org.
The Publisher does not warrant or guarantee the article’s accuracy, completeness, merchantability, fitness
for a particular purpose, or non-infringement. Descriptions of, or references to, products or publications, or
inclusion of an advertisement in this article, neither constitutes nor implies a guarantee, endorsement, or
support of claims made of that product, publication, or service.
© 2003 INFORMS
INFORMS is the largest professional society in the world for professionals in the fields of operations research, management
science, and analytics.
For more information on INFORMS, its publications, membership, or meetings visit http://www.informs.org
THE LINEAR PROGRAMMING APPROACH TO APPROXIMATE
DYNAMIC PROGRAMMING
D. P. DE FARIAS
Downloaded from informs.org by [128.32.10.230] on 29 March 2018, at 03:33 . For personal use only, all rights reserved.
B. VAN ROY
Department of Management Science and Engineering, Stanford University, Stanford, California 94306,
bvr@stanford.edu
The curse of dimensionality gives rise to prohibitive computational requirements that render infeasible the exact solution of large-scale
stochastic control problems. We study an efficient method based on linear programming for approximating solutions to such problems. The
approach “fits” a linear combination of pre-selected basis functions to the dynamic programming cost-to-go function. We develop error
bounds that offer performance guarantees and also guide the selection of both basis functions and “state-relevance weights” that influence
quality of the approximation. Experimental results in the domain of queueing network control provide empirical support for the methodology.
(Dynamic programming/optimal control: approximations/large-scale problems. Queues, algorithms: control of queueing networks.)
Received September 2001; accepted July 2002.
Area of review: Stochastic Models.
how and why approximate dynamic programming algo- refinement in two-dimensional problems. Some of their grid
rithms work and a lack of streamlined guidelines for imple- generation techniques are based on stationary state distribu-
mentation. These deficiencies pose a barrier to the use tions, which also appear in our analysis of state-relevance
of approximate dynamic programming in industry. Lim- weights. Paschalidis and Tsitsiklis (2000) also apply the
ited understanding also affects the linear programming algorithm to two-dimensional problems. An important fea-
approach; in particular, although the algorithm was intro- ture of the linear programming approach is that it generates
duced by Schweitzer and Seidmann more than 15 years lower bounds as approximations to the cost-to-go function;
ago, there has been virtually no theory explaining its Gordon (1999) discusses problems that may arise from that
behavior. and suggests constraint relaxation heuristics. One of these
We develop a variant of approximate linear programming problems is that the linear program used in the approximate
that represents a significant improvement over the origi- linear programming algorithm may be overly constrained,
nal formulation. While the original algorithm may exhibit which may lead to poor approximations or even infeasibil-
poor scaling properties, our version enjoys strong theoret- ity. The approach taken in our work prevents this—part of
ical guarantees and is provably well-behaved for a fairly the difference between our variant of approximate linear
general class of problems involving queueing networks— programming and the original one proposed by Schweitzer
and we expect the same to be true for other classes of prob- and Seidmann is that we include certain basis functions
lems. Specifically, our contributions can be summarized as that guarantee feasibility and also lead to improved bounds
follows: on the approximation error. Morrison and Kumar (1999)
• We develop an error bound that characterizes the develop efficient implementations in the context of queue-
quality of approximations produced by approximate lin- ing network control. Guestrin et al. (2002) and Schuurmans
ear programming. The error is characterized in relative and Patrascu (2001) develop efficient implementations of
terms, compared against the “best possible” approximation the algorithm to factored MDPs. The linear programming
of the optimal cost-to-go function given the selection of approach involves linear programs with a prohibitive num-
basis functions—“best possible” is taken under quotations ber of constraints, and the emphasis of the previous three
because it involves choice of a metric by which to compare articles is on exploiting problem-specific structure that
different approximations. In addition to providing perfor- allows for the constraints in the linear program to be rep-
mance guarantees, the error bounds and associated analysis resented compactly. Alternatively, de Farias and Van Roy
offer new interpretations and insights pertaining to approx- (2001) suggest an efficient constraint sampling algorithm.
imate linear programming. Furthermore, insights from the This paper is organized as follows. We first formulate
analysis offer guidance in the selection of basis functions, in §2 the stochastic control problem under consideration
motivating our variant of the algorithm. and discuss linear programming approaches to exact and
Our error bound is the first to link quality of the approx- approximate dynamic programming. In §3, we discuss the
imate cost-to-go function to quality of the “best” approx- significance of “state-relevance weights” and establish a
imate cost-to-go function within the approximation archi- bound on the performance of policies generated by approx-
tecture not only for the linear programming approach but imate linear programming. Section 4 contains the main
also for any algorithm that approximates cost-to-go func- results of the paper, which offer error bounds for the algo-
tions of general stochastic control problems by computing rithm as well as associated analyses. The error bounds
weights for arbitrary collections of basis functions. involve problem-dependent terms, and in §5 we study char-
• We provide analysis, theoretical results, and numerical acteristics of these terms in examples involving queueing
examples that explain the impact of state-relevance weights networks. Presented in §6 are experimental results involv-
on the performance of approximate linear programming and ing problems of queueing network control. A final section
offer guidance on how to choose them for practical prob- offers closing remarks, including a discussion of merits of
lems. In particular, appropriate choice of state-relevance the linear programming approach relative to other methods
weights is shown to be of fundamental importance for the for approximate dynamic programming.
scalability of the algorithm.
• We develop a bound on the cost increase that results 2. STOCHASTIC CONTROL AND
from using policies generated by the approximation of LINEAR PROGRAMMING
the cost-to-go function instead of the optimal policy. The We consider discrete-time stochastic control problems
bound suggests a natural metric by which to compare dif- involving a finite state space of cardinality = N . For
ferent approximations to the cost-to-go function and pro- each state x ∈ , there is a finite set of available actions
852 / de Farias and Van Roy
x . Taking action a ∈ x when the current state is x incurs constraints to transform the problem into a linear program.
cost ga x. State transition probabilities pa x y represent, In particular, noting that each constraint
for each pair x y of states and each action a ∈ x , the
probability that the next state will be y given that the cur- TJ x J x
rent state is x and the current action is a ∈ x .
A policy u is a mapping from states to actions. Given a is equivalent to a set of constraints
policy u, the dynamics of the system follow a Markov chain
ga x + pa x yJ y J x ∀a ∈ x
with transition probabilities pux x y. For each policy u,
Downloaded from informs.org by [128.32.10.230] on 29 March 2018, at 03:33 . For personal use only, all rights reserved.
y∈
we define a transition matrix Pu whose x yth entry is
pux x y. we can rewrite the problem as
The problem of stochastic control amounts to selection
max cT J
of a policy that optimizes a given criterion. In this paper,
we will employ as an optimality criterion infinite-horizon st ga x + y∈ pa x yJ y J x
discounted cost of the form
∀x ∈ S a ∈ x
t
Ju x = E gu xt x0 = x We will refer to this problem as the exact LP.
t=0
As mentioned in the introduction, state spaces for
where gu x is used as shorthand for gux x, and the dis- practical problems are enormous due to the curse of
count factor ∈ 0 1 reflects intertemporal preferences. It dimensionality. Consequently, the linear program of inter-
is well known that there exists a single policy u that min- est involves prohibitively large numbers of variables and
imizes Ju x simultaneously for all x, and the goal is to constraints. The approximation algorithm we study reduces
identify that policy. dramatically the number of variables.
Let us define operators Tu and T by Let us now introduce the linear programming approach
to approximate dynamic programming. Given pre-selected
Tu J = gu + Pu J and TJ = mingu + Pu J basis functions
1
K , define a matrix
u
where the minimization is carried out component-wise.
Dynamic programming involves solution of Bellman’s
equation, =
1 K
J = TJ
The unique solution J ∗ of this equation is the optimal With an aim of computing a weight vector r˜ ∈ K such
cost-to-go function that r˜ is a close approximation to J ∗ , one might pose the
following optimization problem:
J ∗ = min Ju
u
max c T r
and optimal control actions can be generated based on this (2)
st T r r
function, according to
Given a solution r,
˜ one might then hope to generate near-
ux = arg min ga x + pa x yJ ∗ y optimal decisions according to
a∈x y∈
Dynamic programming offers a number of approaches ux = arg min ga x + pa x y ry
˜
a∈x
to solving Bellman’s equation. One of particular relevance y∈
to our paper makes use of linear programming, as we will We will call such a policy a greedy policy with respect to
now discuss. Consider the problem r.
˜ More generally, a greedy policy u with respect to a
max cT J function J is one that satisfies
(1)
st TJ J ux = arg min ga x + pa x yJ y
a∈x y∈
where c is a (column) vector with positive components,
which we will refer to as state-relevance weights, and c T As with the case of exact dynamic programming, the
denotes the transpose of c. It can be shown that any feasible optimization problem (2) can be recast as a linear program
J satisfies J J ∗ . It follows that, for any set of positive
weights c, J ∗ is the unique solution to (1). max c T r
Note that T is a nonlinear operator, and therefore the st ga x + pa x yry rx
constrained optimization problem written above is not a y∈
linear program. However, it is easy to reformulate the ∀x ∈ S a ∈ x
de Farias and Van Roy / 853
We will refer to this problem as the approximate LP. Note the cost-to-go function. A possible measure of quality is
that although the number of variables is reduced to K, the the distance to the optimal cost-to-go function; intuitively,
number of constraints remains as large as in the exact LP. we expect that the better the approximate cost-to-go func-
Fortunately, most of the constraints become inactive, and tion captures the real long-run advantage of being in a given
solutions to the linear program can be approximated effi- state, the better the policy it generates. A more direct mea-
ciently. In numerical studies presented in §6, for exam- sure is a comparison between the actual costs incurred by
ple, we sample and use only a relatively small subset of using the greedy policy associated with the approximate
the constraints. We expect that subsampling in this way cost-to-go function and those incurred by an optimal pol-
Downloaded from informs.org by [128.32.10.230] on 29 March 2018, at 03:33 . For personal use only, all rights reserved.
suffices for most practical problems, and have developed icy. We now provide a bound on the cost increase incurred
sample-complexity bounds that qualify this expectation by using approximate cost-to-go functions generated by
(de Farias and Van Roy 2001). There are also alterna- approximate linear programming.
tive approaches studied in the literature for alleviating We consider as a measure of the quality of policy u the
the need to consider all constraints. Examples include expected increase in the infinite-horizon discounted cost,
heuristics presented in Trick and Zin (1993) and problem- conditioned on the initial state of the system being dis-
specific approaches making use of constraint generation tributed according to a probability distribution ; i.e.,
methods (e.g., Grötschel and Holland 1991, Schuurmans
and Patrascu 2001) or structure allowing constraints to be EX∼ Ju X − J ∗ X = Ju − J ∗ 1
represented compactly (e.g., Morrison and Kumar 1999,
Guestrin et al. 2002). It will be useful to define a measure u over the state
In the next four sections, we assume that the approximate space associated with each policy u and probability distri-
LP can be solved, and we study the quality of the solution bution , given by
as an approximation to the cost-to-go function.
Tu = 1 − T t Put (3)
3. THE IMPORTANCE OF STATE-RELEVANCE t=0
WEIGHTS t
Note that because t=0 Put = I − Pu −1 , we also have
In the exact LP, for any vector c with positive components,
maximizing c T J yields J ∗ . In other words, the choice of Tu = 1 − T I − Pu −1
state-relevance weights does not influence the solution. The
same statement does not hold for the approximate LP. In The measure u captures the expected frequency of
fact, the choice of state-relevance weights may bear a sig- visits to each state when the system runs under policy u,
nificant impact on the quality of the resulting approxima- conditioned on the initial state being distributed according
tion, as suggested by theoretical results in this section and to . Future visits are discounted according to the discount
demonstrated by numerical examples later in the paper. factor .
To motivate the role of state-relevance weights, let us Proofs for the following lemma and theorem can be
start with a lemma that offers an interpretation of their found in the appendix.
function in the approximate LP. The proof can be found in
the appendix. Lemma 2. u is a probability distribution.
Lemma 1. A vector r˜ solves We are now poised to prove the following bound on the
expected cost increase associated with policies generated
max c T r by approximate linear programming. Henceforth we will
st T r r use the norm · 1 , defined by
if and only if it solves J 1 = xJ x
x∈
min J ∗ − r1 c
Theorem 1. Let J → be such that TJ J . Then
st T r r
1
The preceding lemma points to an interpretation of the JuJ − J ∗ 1 J − J ∗ 1 u (4)
1− J
approximate LP as the minimization of a certain weighted
norm, with weights equal to the state-relevance weights. Theorem 1 offers some reassurance that if the approxi-
This suggests that c imposes a trade-off in the quality of mate cost-to-go function J is close to J ∗ , the performance
the approximation across different states, and we can lead of the policy generated by J should similarly be close to the
the algorithm to generate better approximations in a region performance of the optimal policy. Moreover, the bound (4)
of the state space by assigning relatively larger weight to also establishes how approximation errors in different states
that region. in the system map to losses in performance, which is use-
Underlying the choice of state-relevance weights is the ful for comparing different approximations to the cost-to-go
question of how to compare different approximations to function.
854 / de Farias and Van Roy
Contrasting Lemma 1 with the bound on the increase in In Figure 1, the span of the basis functions comes rel-
costs (4) given by Theorem 1, we may want to choose state- atively close to the optimal cost-to-go function J ∗ ; if we
relevance weights c that capture the (discounted) frequency were able to perform, for instance, a maximum-norm pro-
with which different states are expected to be visited. Note jection of J ∗ onto the subspace J = r, we would obtain
that the frequency with which different states are visited the reasonably good approximate cost-to-go function r ∗ .
in general depends on the policy being used. One possibil- At the same time, the approximate LP yields the approx-
ity is to have an iterative scheme, where the approximate imate cost-to-go function r. ˜ In this section, we develop
LP is solved multiple times with state-relevance weights bounds guaranteeing that r˜ is not too much farther from
Downloaded from informs.org by [128.32.10.230] on 29 March 2018, at 03:33 . For personal use only, all rights reserved.
= TJ − ke (6)
Combining (5) and (6), we have
T r ∗ − ke = T r ∗ − ke
J ∗ − e − ke
r ∗ − 1 + e − ke
= r ∗ − ke + #1 − k − 1 + $e
de Farias and Van Roy / 855
Because e is within the span of the columns of , there for all V → . For any V , H V x represents the
exists a vector r¯ such that maximum expected value of V y if the current state is x
1 + and y is a random variable representing the next state. For
r¯ = r ∗ − e each V → , we define a scalar 'V given by
1−
and r¯ is a feasible solution to the approximate LP. By the H V x
'V = max (8)
triangle inequality, x V x
Downloaded from informs.org by [128.32.10.230] on 29 March 2018, at 03:33 . For personal use only, all rights reserved.
r¯ − J ∗ J ∗ − r ∗ + r ∗ − r
¯ We can now introduce the notion of a “Lyapunov
function.”
1+ 2
1+ =
1− 1− Definition 1 (Lyapunov Function). We call V →
+ a Lyapunov function if 'V < 1.
If r˜ is an optimal solution to the approximate LP, by
Lemma 1 we have Our definition of a Lyapunov function translates into the
condition that there exist V > 0 and ' < 1 such that
J ∗ − r
˜ 1 c J ∗ − r
¯ 1 c
J ∗ − r
¯ H V x 'V x ∀x ∈ (9)
2 If were equal to 1, this would look like a Lyapunov
1− stability condition: the maximum expected value H V x
where the second inequality holds because c is a probability at the next time step must be less than the current value
distribution. The result follows. V x. In general, is less than 1, and this introduces some
slack in the condition.
This bound establishes that when the optimal cost-to- Our error bound for the approximate LP will grow pro-
go function lies close to the span of the basis functions, portionately with 1/1 − 'V , and we therefore want 'V to
the approximate LP generates a good approximation. In be small. Note that 'V becomes smaller as the H V x’s
particular, if the error minr J ∗ − r goes to zero (e.g., become small relative to the V x’s; 'V conveys a degree
as we make use of more and more basis functions), the of “stability,” with smaller values representing stronger sta-
error resulting from the approximate LP also goes to zero. bility. Therefore our bound suggests that the more stable
Though the above bound offers some support for the the system is, the easier it may be for the approximate LP
linear programming approach, there are some significant to generate a good approximate cost-to-go function.
weaknesses: We now state our main result. For any given function V
1. The bound calls for an element of the span of the mapping to positive reals, we use 1/V as shorthand for
basis functions to exhibit uniformly low error over all
a function x → 1/V x.
states. In practice, however, minr J ∗ − r is typically
huge, especially for large-scale problems. Theorem 3. Let r˜ be a solution of the approximate LP.
2. The bound does not take into account the choice of Then, for any v ∈ K such that vx > 0 for all x ∈
state-relevance weights. As demonstrated in the previous and H v < v,
section, these weights can significantly impact the approxi-
mation error. A sharp bound should take them into account. 2c T v
J ∗ − r
˜ 1 c min J ∗ − r 1/v (10)
In §4.2, we will state and prove the main result of this 1 − 'v r
paper, which provides an improved bound that aims to alle-
We will first present three preliminary lemmas leading
viate the shortcomings listed above.
to the main result. Omitted proofs can be found in the
appendix.
4.2. An Improved Bound The first lemma bounds the effects of applying T to two
To set the stage for development of an improved bound, let different vectors.
us establish some notation. First, we introduce a weighted
Lemma 3. For any J and J¯,
maximum norm, defined by
for any → + . As opposed to the maximum norm Based on the preceding lemma, we can place the follow-
employed in Theorem 2, this norm allows for uneven ing bound on constraint violations in the approximate LP.
weighting of errors across the state space. Lemma 4. For any vector V with positive components and
We also introduce an operator H , defined by any vector J ,
HV x = max Pa x yV y TJ J − H V + V J ∗ − J 1/V (11)
a∈x
y
856 / de Farias and Van Roy
The next lemma establishes that subtracting an appropri- the Lyapunov function value. This should to some extent
ately scaled version of a Lyapunov function from any r alleviate difficulties arising in large problems. In particular,
leads us to the feasible region of the approximate LP. the Lyapunov function should take on large values in
undesirable regions of the state space—regions where J ∗
Lemma 5. Let v be a vector such that v is a Lyapunov
is large. Hence, division by the Lyapunov function acts as
function, r be an arbitrary vector, and
a normalizing procedure that scales down errors in such
regions.
2
r¯ = r − J ∗ − r 1/v − 1 v 2. As opposed to the bound of Theorem 2, the state-
1 − 'v
Downloaded from informs.org by [128.32.10.230] on 29 March 2018, at 03:33 . For personal use only, all rights reserved.
-
1
easy to verify that such a choice of boundary conditions is 2 2/1 −
possible.
We assume that p < 1/2 so that the system is “stable.” Hence, minr J ∗ − r 1/V is uniformly bounded over N .
Stability here is taken in a loose sense indicating that the We next show that 1/1 − 'V is uniformly bounded
steady-state probabilities are decreasing for all sufficiently over N . To do that, we find bounds on H V in terms of V .
For 0 < x < N − 1, we have
large states.
Suppose that we wish to generate an approximation to 2
H V x = p x2 + 2x + 1 +
the optimal cost-to-go function using the linear program- 1−
ming approach. Further suppose that we have chosen the
2 2
state-relevance weights c to be the vector , of steady-state + 1 − p x − 2x + 1 +
1−
probabilities and the basis functions to be
1 x = 1 and
2 x = x2 . 2
= x2 + + 1 + 2x2p − 1
How good can we expect the approximate cost-to-go 1−
function r˜ generated by approximate linear programming 2 2
to be as we increase the number of states N ? First note that x + +1
1−
min J ∗ − r1 c J ∗ − -0
1 + -2
2 1 c = V x +
r V x
N −1
1
= ,x-1 x V x +
x=0
V 0
x 1+
N −1
p = V x
= -1 ,0 x 2
x=0 1−p
For x = 0, we have
p/1 − p
-1 2 2
1 − p/1 − p H V 0 = p 1 + + 1 − p
1− 1−
for all N . The last inequality follows from the fact that the 2
summation in the third line corresponds to the expected = p +
1−
value of a geometric random variable conditioned on its
1−
being less than N . Hence, minr J ∗ − r1 c is uniformly V 0 +
2
bounded over N . One would hope that J ∗ − r ˜ 1 c , with r˜
being an outcome of the approximate LP, would be sim- 1+
= V 0
ilarly uniformly bounded over N . It is clear that Theo- 2
rem 2 does not offer a uniform bound of this sort. In par- Finally, we clearly have
ticular, the term minr J ∗ − r on the right-hand side 1+
grows proportionately with N and is unbounded as N H V N − 1 V N − 1 V N − 1
2
increases. Fortunately, this situation is remedied by The-
because the only possible transitions from state N − 1 are
orem 3, which does provide a uniform bound. In partic- to states x N − 1 and V is a nondecreasing function.
ular, as we will show in the remainder of this section, Therefore, 'V 1 + /2 and 1/1 − 'V is uniformly
for an appropriate Lyapunov function V = v, the values bounded on N .
of minr J ∗ − r 1/V 1/1 − 'V , and c T V are all uni- We now treat c T V . Note that for N 1,
formly bounded over N , and together these values offer a x
bound on J ∗ − r ˜ 1 c that is uniform over N .
N −1
p 2
cT V = ,0 x2 +
We will make use of a Lyapunov function x=0 1−p 1−
x
2 1 − p/1 − p N −1
p 2 2
V x = x2 + = x +
1− 1 − #p/1 − p$N x=0 1 − p 1−
858 / de Farias and Van Roy
x
1 − p/1 − p p 2 2 p-2 + -1 m1 − p + #-2 + -1 2p − 1$
x + -0 = max
1 − p/1 − p x=0 1 − p 1− 1− 1−
1−p 2 p2 p For any state x such that 0 < x < N − 1, we can verify that
= +2 +
1 − 2p 1 − 1 − 2p2 1 − 2p
J x − T¯ J x
so c T V is uniformly bounded for all N . = -0 1 − − m1 − p − #-2 + -1 2p − 1$
m1 − p + #-2 + -1 2p − 1$
Downloaded from informs.org by [128.32.10.230] on 29 March 2018, at 03:33 . For personal use only, all rights reserved.
under some stability assumptions, there should be a optimal cost-to-go function for the finite buffer case should
geometric upper bound for the tail of steady-state distribu- be bounded above by that of the infinite buffer case, as
tion; we expect that results for finite (large) buffers should having finite buffers corresponds to having jobs arriving at
be similar if the system is stable, because in this case most a full queue discarded at no additional cost.
of the steady-state distribution will be concentrated on rela- We have
tively small states. Let us assume that the system under the
optimal policy is indeed stable—that should generally be Ex #xt $ x + Adt
the case if the discount factor is large. For a queue with infi-
nite buffer the optimal service rate qx is nondecreasing because the expected total number of jobs at time t cannot
in x (Bertsekas 1995), and stability therefore implies that exceed the total number of jobs at time 0 plus the expected
qx qx0 > p number of arrivals between 0 and t, which is less than or
equal to Adt. Therefore, we have
for all x x0 and some sufficiently large x0 . It is easy then
to verify that the tail of the steady-state distribution has an
t=0 t=0
,xp = ,x + 1qx + 1
t x + Adt
and therefore t=0
,x + 1 p x Ad
< 1 = + (14)
,x qx0 1 − 1 − 2
for all x x0 . Thus a reasonable choice of state-relevance
weights is cx = ,03 x , where ,0 = 1 − 3/1 − 3 N The first equality holds because xt 0 for all t; by the
is a normalizing constant making c a probability distribu- monotone convergence theorem, we can interchange the
tion. In this case, expectation and the summation. We conclude from (14) that
the optimal cost-to-go function in the infinite buffer case
c T V = E#X 2 + C X < N $ should be bounded above by a linear function of the state;
32 3 in particular,
2 + + C
1 − 3 2 1−3 -1
0 J ∗ x x + -0
where X represents a geometric random variable with d
parameter 1 − 3. We conclude that c T V is uniformly
bounded on N . for some positive scalars -0 and -1 independent of the num-
ber of queues d.
As discussed before, the optimal cost-to-go function in
5.3. A Queueing Network
the infinite buffer case provides an upper bound for the
Both previous examples involved one-dimensional state optimal cost-to-go function in the case of finite buffers of
spaces and had terms of interest in the approximation error size B. Therefore, the optimal cost-to-go function in the
bound uniformly bounded over the number of states. We finite buffer case should be bounded above by the same
now consider a queueing network with d queues and finite linear function regardless of the buffer size B.
buffers of size B to determine the impact of dimensionality As in the previous examples, we will establish bounds
on the terms involved in the error bound of Theorem 3. on the terms involved in the error bound of Theorem 3.
We assume that the number of exogenous arrivals We consider a Lyapunov function V x = 1/dx + C for
occuring in any time step has expected value less than or
some constant C > 0, which implies
equal to Ad, for a finite A. The state x ∈ d indicates the
number of jobs in each queue. The cost per stage incurred min J ∗ − r 1/V J ∗ 1/V
at state x is given by r
-1 x + d-0
x 1 d
max
gx = = x x0 x + dC
d d i=1 i
-0
the average number of jobs per queue. -1 +
C
860 / de Farias and Van Roy
and the bound above is independent of the number of 6.1. Single Queue with Controlled Service Rate
queues in the system. In §5.2, we studied a queue with a controlled service rate
Now let us study 'V . We have and determined that the bounds on the error of the approx-
x + Ad imate LP were uniform over the number of states. That
H V x +C example provided some guidance on the choice of basis
d
functions; in particular, we now know that including a
A
V x + quadratic and a constant function guarantees that an appro-
x/d + C priate Lyapunov function is in the span of the columns
Downloaded from informs.org by [128.32.10.230] on 29 March 2018, at 03:33 . For personal use only, all rights reserved.
A of . Furthermore, our analysis of the (unknown) steady-
V x + state distribution revealed that state-relevance weights of
C
the form cx = 1 − 33 x are a sensible choice. However,
and it is clear that, for C sufficiently large and inde- how to choose an appropriate value of 3 was not discussed
pendent of d, there is a ' < 1 independent of d such there. In this section, we present results of experiments with
that H V 'V , and therefore 1/1 − 'V is uniformly different values of 3 for a particular instance of the model
bounded on d.
described in §5.2. The values of 3 chosen for experimen-
Finally, let us consider c T V . We expect that under some
tation are motivated by ideas developed in §3.
stability assumptions, the tail of the steady-state distribution
We assume that jobs arrive at a queue with probability
will have an upper bound with geometric decay (Bertsimas
p = 02 in any unit of time. Service rates/probabilities qx
et al. 2001), and we take cx = 1 − 3/1 − 3 B+1 d 3 x .
are chosen from the set !02 04 06 08". The cost in-
The state-relevance weights c are equivalent to the con-
curred at any time for being in state x and taking action q
ditional joint distribution of d independent and identically
is given by
distributed geometric random variables conditioned on the
event that they are all less than B + 1. Therefore, gx q = x + 60q 3
1 d
We take the buffer size to be 49999 and the discount
cT V = E X + C Xi < B + 1 i = 1 d
d i=1 i factor to be = 098. We select basis functions
1 x = 1,
< E#X1 $ + C
2 x = x,
3 x = x2 ,
4 x = x3 , and state-relevance
weights cx = 1 − 33 x . The approximate LP is solved
3 for 3 = 09 and 3 = 0999 and we denote the solution of the
= + C
1−3 approximate LP by r 3 . The numerical results are presented
where Xi , i = 1 d are identically distributed geometric in Figures 2, 3, 4, and 5.
random variables with parameter 1 − 3. It follows that c T V Figure 2 shows the approximations r 3 to the cost-to-go
is uniformly bounded over the number of queues. function generated by the approximate LP. Note that the
results agree with the analysis developed in §3; small states
6. APPLICATION TO CONTROLLED are approximated better when 3 = 09 whereas large states
QUEUEING NETWORKS are approximated almost exactly when 3 = 0999.
In Figure 3 we see the greedy action with respect to
In this section, we discuss numerical experiments involv-
r 3 . We get the optimal action for almost all “small” states
ing application of the linear programming approach to con-
trolled queueing problems. Such problems are relevant to
several industries including manufacturing and telecom- Figure 2. Approximate cost-to-go function for the
munications and the experimental results presented here example in §6.1.
suggest approximate linear programming as a promising 2500
approach to solving them.
In all examples, we assume that at most one event
.
optimal
(arrival/departure) occurs at each time step. We also choose 2000
ξ = 0.9
+ ξ = 0.999
basis functions that are polynomial in the states. This
is partly motivated by the analysis in the previous sec- 1500
approach.
The first example illustrates how state-relevance weights
influence the solution of the approximate LP.
500
0 5 10 15 20 25 30 35 40 45 50
de Farias and Van Roy / 861
Figure 3. Greedy action for the example in §6.1. Figure 5. Steady-state probabilities for the example
in §6.1.
1
0.7
0.9
. optimal
ξ = 0.9
.
+ ξ = 0.999 optimal policy (steady-state probabilities)
0.8 0.6 ξ = 0.9 (steady-state probabilities of the greedy policy)
+ ξ = 0.999 (steady-state probabilities of the greedy policy)
-- ξ = 0.9 (state-relevance weights)
-.
0.7
ξ = 0.999 (state-relevance weights)
0.5
Downloaded from informs.org by [128.32.10.230] on 29 March 2018, at 03:33 . For personal use only, all rights reserved.
0.6
0.5 0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0 5 10 15 20 25 30 35 40 45 50
0
0 5 10 15 20 25 30 35 40 45 50
λ = 0.08
µ1 = 0.12 µ 2 = 0.12
λ = 0.08
µ 4 = 0.28 µ3 = 0.28
Downloaded from informs.org by [128.32.10.230] on 29 March 2018, at 03:33 . For personal use only, all rights reserved.
always serves the longest queue (LONGEST). Results are To evaluate the performance of the policy generated by
summarized in Table 1 and we can see that the approxi- the approximate LP, we compare it with first-in-first-out
mate LP yields significantly better performance than all of (FIFO), last-buffer-first-serve (LBFS), and a policy that
the other heuristics. serves the longest queue in each server (LONGEST). LBFS
serves the job that is closest to leaving the system; for
6.3. An Eight-Dimensional Queueing Network example, if there are jobs in queue 2 and in queue 6, a job
from queue 2 is processed since it will leave the system
In our last example, we consider a queueing network with
after going through only one more queue, whereas the job
eight queues. The system is depicted in Figure 7, with
from queue 6 will still have to go through two more queues.
arrival (7i , i = 1 2) and departure (i , i = 1 8) proba-
We also choose to assign higher priority to queue 3 than to
bilities indicated.
queue 8 because queue 3 has higher departure probability.
The state x ∈ 8 represents the number of jobs in each
We estimated the average cost of each policy with
queue. The cost-per-state is gx = x, and the discount
50,000,000 simulation steps, starting with an empty sys-
factor is 0.995. Actions a ∈ !0 1"8 indicate which queues
tem. Results appear in Table 2. The policy generated by the
are being served; ai = 1 iff a job from queue i is being
approximate LP performs significantly better than each of
processed. We consider only nonidling policies and, at each
the heuristics, yielding more than 10% improvement over
time step, a server processes jobs from one of its queues
LBFS, the second best policy. We expect that even better
exclusively.
results could be obtained by refining the choice of basis
We choose state-relevance weights of the form cx =
functions and state-relevance weights.
1−38 3 x . The basis functions are chosen to span all poly-
The constraint generation step took 74.9 seconds, and the
nomials in x of degree at most 2; therefore, the approxi-
resulting LP was solved in approximately 3.5 minutes of
mate LP has 47 variables. Because of the relatively large
CPU time with CPLEX 7.0 running on a Sun Ultra Enter-
number of actions per state (up to 18), we choose to sam-
prise 5500 machine with Solaris 7 operating system and a
ple a relatively small number of states. Note that we take a
400-MHz processor.
slightly different approach from that proposed in de Farias
and Van Roy (2001) and include constraints relative to
all actions associated with each state in the system. Con- 7. CLOSING REMARKS AND OPEN ISSUES
straints for the approximate LP are generated by sampling In this paper, we studied the linear programming approach
5,000 states according to the distribution associated with to approximate dynamic programming for stochastic
the state-relevance weights c. Experiments were performed control problems as a means of alleviating the curse of
for 3 = 085, 09, and 095, and 3 = 09 yielded the policy dimensionality. We provided an error bound for a variant of
with smallest average cost. We do not specify a maximum approximate linear programming based on certain assump-
buffer size. The maximum number of jobs in the system for tions on the basis functions. The bounds were shown to
states sampled in the LP was 235, and the maximum single be uniform in the number of states and state variables in
queue length, 93. During simulation of the policy obtained, certain queueing problems. Our analysis also led to some
the maximum number of jobs in the system was 649, and guidelines in the choice of the so-called “state-relevance
the maximum number of jobs in any single queue, 384. weights” for the approximate LP.
An alternative to the approximate LP are temporal-
Table 1. Performance of different poli- difference (TD) learning methods (Bertsekas and Tsitsiklis
cies for example in §6.2. 1996; Dayan 1992; de Farias and Van Roy 2000; Sutton
1988; Sutton and Barto 1998; Tsitsiklis and Van Roy
Policy Average Cost 1997; Van Roy 1998, 2000). In such methods, one tries
3 = 095 3337 to find a fixed point for an “approximate dynamic pro-
LONGEST 4504 gramming operator” by simulating the system and learning
FIFO 4571 from the observed costs and state transitions. Experimen-
LBFS 1441 tation is necessary to determine when TD can offer bet-
Average cost estimated by simulation after ter results than the approximate LP. However, it is worth
50,000,000 iterations, starting with empty system. mentioning that because of its complexity, much of TD’s
de Farias and Van Roy / 863
Figure 7. System for the example in §6.3.
µ3 = 3 /11.5
µ2 = 2/11.5 µ7 = 3/11.5
λ1 =1/11.5
µ1 = 4 /11.5 µ5 = 3/11.5
λ 2 = 1/11.5
Downloaded from informs.org by [128.32.10.230] on 29 March 2018, at 03:33 . For personal use only, all rights reserved.
behavior is still to be understood; there are no conver- the objective function be chosen in some adaptive way?
gence proofs or effective error bounds for general stochas- Can we add robustness to the approximate LP algorithm to
tic control problems. Such poor understanding leads to account for errors in the estimation of costs and transition
implementation difficulties; a fair amount of trial and error probabilities, i.e., design an alternative LP with meaning-
is necessary to get the method to perform well or even to ful performance bounds when problem parameters are just
converge. The approximate LP, on the other hand, benefits known to be in a certain range? How do our results extend
from the inherent simplicity of linear programming: its to the average cost case? How do our results extend to the
analysis is simpler, and error bounds such as those pro- infinite-state case?
vided here provide guidelines on how to set the algorithm’s Finally, in this paper we utilize linear architectures
parameters most efficiently. Packages for large-scale linear to represent approximate cost-to-go functions. It may
programming developed in the recent past also make the be interesting to explore algorithms using nonlinear
approximate LP relatively easy to implement. representations. Alternative representations encountered in
A central question in approximate linear programming the literature include neural networks (Bishop 1995, Haykin
not addressed here is the choice of basis functions. In the 1994) and splines (Chen et al. 1999, Trick and Zin 1997),
applications to queueing networks, we have chosen basis among others.
functions polynomial in the states. This was largely moti-
vated by the fact that with linear/quadratic costs, it can be APPENDIX A: PROOFS
shown in these problems that the optimal cost-to-go func-
tion is asymptotically linear/quadratic. Reasonably accurate Lemma 1. A vector r˜ solves
knowledge of structure the cost-to-go function may be diffi- max c T r
cult in other problems. An alternative approach is to extract
a number of features of the states which are believed to be st T r r
relevant to the decision being made. The hope is that the
mapping from features to the cost-to-go function might be if and only if it solves
smooth, in which case certain sets of basis functions such min J ∗ − r1 c
as polynomials might lead to good approximations.
We have motivated many of the ideas and guidelines for st T r r
choice of parameters through examples in queueing prob-
lems. In future work, we intend to explore how these ideas Proof. It is well known that the dynamic programming
would be interpreted in other contexts, such as portfolio operator T is monotonic. From this and the fact that T is
management and inventory control. a contraction with fixed point J ∗ , it follows that for any J
Several other questions remain open and are the object with J TJ , we have
of future investigation: Can the state-relevance weights in J TJ T 2 J · · · J ∗
Table 2. Average number of jobs in the system for the Hence, any r that is a feasible solution to the optimization
example in §6.3, after 50,000,000 simulation problems of interest satisfies r J ∗ . It follows that
steps.
J ∗ − r1 c = cxJ ∗ x − rx
Policy Average Cost x∈
T −1
= 1 − 1 − e TJ ∗ x − TJ x
= 1 max Pa x yJ ∗ y − J y
a
y∈
and the claim follows.
∗
J − J 1/V max Pa x yV y
Theorem 1. Let J → be such that TJ J . Then a∈x
y∈
1 = J ∗ − J 1/V H V x
JuJ − J ∗ 1 J − J ∗ 1 u
1− J
Letting = J ∗ − J 1/V , it follows that
Proof. We have
TJ x J ∗ x − H V x
JuJ − J = I − PuJ −1 guJ − J
J x − V x − H V x
= I − PuJ −1 #guJ − I − PuJ J $
The result follows.
= I − PuJ −1 guJ + PuJ J − J
Lemma 5. Let v be a vector such that v is a Lyapunov
= I − PuJ −1 TJ − J function, r be an arbitrary vector, and
Because J TJ , we have J TJ J ∗ JuJ . Hence, 2
r¯ = r − J ∗ − r 1/v − 1 v
1 − 'v
JuJ − J ∗ 1 = T JuJ − J ∗
Then,
T JuJ − J T r¯ r
¯
T −1
= I − PuJ TJ − J Proof. Let = J ∗ − r 1/v . By Lemma 3,
1
= T TJ − J T rx − T rx
¯
1 − uJ
2
1
T J ∗ − J = T rx − T r − − 1 v x
1 − uJ 1 − 'v
1
= J ∗ − J 1 u 2
1− J max Pa x y − 1 vy
a
y∈
1 − 'v
Lemma 3. For any J and J¯,
2
TJ − T J¯ max Pu J − J¯ = − 1 H vx
u
1 − 'v
because v is a Lyapunov function and therefore 2/
Proof. Note that, for any J and J¯,
1 − 'v − 1 > 0. It follows that
TJ − T J¯ = min!gu + Pu J " − min!gu + Pu J¯"
u u
2
T r¯ T r − − 1 H v
1 − 'v
= guJ + PuJ J − guJ¯ − PuJ¯ J¯
By Lemma 4,
guJ¯ + PuJ¯ J − guJ¯ − PuJ¯ J¯
T r r − Hv + v
max Pu J − J¯
u and therefore
max Pu J − J¯ 2
u T r¯ r − Hv + v − − 1 H v
1 − 'v
where uJ and uJ¯ denote greedy policies with respect to = r¯ − Hv + v
J and J¯, respectively. An entirely analogous argument
gives us 2
+ − 1 v − H v
1 − 'v
T J¯ − TJ max Pu J − J¯
u r¯ − Hv + v + v + H v
and the result follows. = r
¯
de Farias and Van Roy / 865
where the last inequality follows from the fact that v − Guestrin, C., D. Koller, R. Parr. 2002. Efficient solution algo-
H v > 0, and rithms for factored MDPs. J. Artificial Intelligence Res.
Forthcoming.
2 2 Haykin, S. 1994. Neural Networks: A Comprehensive Formula-
−1 = −1
1 − 'v 1 − maxx H vx/vx tion. Macmillan, New York.
vx + H vx Hordijk, A., L. C. M. Kallenberg. 1979. Linear programming and
= max Markov decision chains. Management Sci. 25 352–362.
x vx − H vx
Kumar, P. R., T. I. Seidman. 1990. Dynamic instabilities and stabi-
Downloaded from informs.org by [128.32.10.230] on 29 March 2018, at 03:33 . For personal use only, all rights reserved.