1164

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 60, NO. 4, APRIL 2015

Folding Algorithm for Policy Evaluation for Markov Decision
Processes With Quasi-Birth Death Structure
Yassir Yassir and Langford B. White
Abstract—This technical note presents a new numerical procedure for policy evaluation of Stochastic Shortest Path Markov
Decision Processes (MDPs) having a level independent Quasi
Birth-Death structure. The algorithm is derived using a method
analogous to the folding method of Ye and Li (1994). The computational complexity is O(M 3 log2 N ) + O(M 2N ), where the
process has N levels and M phases. A simple example involving
the control of two queues is presented to illustrate the application
of this efficient policy evaluation algorithm to compare and rank
control policies.
Index Terms—Dynamic programming, optimisation, queueing
analysis.
I. I NTRODUCTION
A finite level Quasi Birth-Death (QBD) Process is a (finite) discrete
state Markov process with a transition probability matrix having a
block tridiagonal structure. QBD processes represent an extension of
standard birth-death processes (which possess tridiagonal transition
probability matrices) to more than one dimension. QBD processes are
the subject of texts such as [1] and [2] to which the reader is referred
for details. An important application of QBD models is in telecommunications systems modelling, where the determination of the stationary
distribution of the states of the system (usually queue occupancies)
permits the evaluation of various performance measures such as blocking probabilities and delays which are important in assessing system
performance. Matrix analytic methods (MAM) are a commonly used
approach for determining the stationary distribution of a QBD process
(see [1], [2] and references therein). A significant computational
saving is obtained by use of various MAM algorithms which exploit
the QBD structure of the process. In such problems, resources can be
allocated in different ways in order to optimize some utility function
or cost/reward. Examples would be minimizing blocking probabilities
or maximizing throughput. Thus there is a notion of the reward of a
particular policy of allocating resources. In this technical note, we are
interested in the evaluation of the reward of a specified policy associated with a QBD process, rather than the evaluation of their stationary
distribution. Thus our class of models become Markov decision processes (MDPs), and we are interested in evaluating policies for controlling MDPs with QBD transition probability structure.
In previous work, [4], author White presented an approach to policy
evaluation for QBD MDPs which was based on the MAM technique

Manuscript received March 12, 2013; revised November 12, 2013,
February 17, 2014, and July 2, 2014; accepted August 12, 2014. Date of publication August 18, 2014; date of current version March 20, 2015. Recommended
by Associate Editor L. H. Lee.
The authors are with the School of Electrical and Electronic Engineering,
The University of Adelaide, 5005, South Australia (e-mail: Yassir.Yassir@
adelaide.edu.au; Lang.White@adelaide.edu.au).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TAC.2014.2348803

known as linear level reduction. This approach has computational
complexity of O(N M 3 ) where N is the number of levels, and M
is the number of phases in the QBD MDP, and was applicable for the
general level dependent case. The purpose of this technical note is to
describe a faster numerical procedure for policy evaluation applicable
to level independent QBD MDPs. A QBD MDP is level independent if
both its transition probabilities and one-stage rewards are independent
of level (apart from the boundary levels). The technique described
is based on the folding method presented by Ye and Li [5], and has
computational complexity of O(M 3 log2 N ) + O(M 2 N ). The first
term dominates except when M is small and N is large. There is thus a
significant computational saving compared to the linear reduction case
[4]. In linear algebraic terms, linear level reduction corresponds to LU
factorization of a general block tridiagonal matrix, whilst the folding
method, a form of logarithmic reduction corresponds to a kind of
factorisation applicable to block tridaigonal Toeplitz matrices. Importantly, as argued in the MAM literature, the probabilistic interpretation
assists in the proof of the applicability of these factorizations in terms
of the existence of certain inverses and other relevant issues. Computational complexity is the same. The other significant contribution of
this technical note is that here we consider stochastic shortest path
MDPs, rather than uniformly discounted MDPs (as in [4]), although
the latter case can also be addressed in the current framework. We
note at this point that we do not address general approaches to finding
optimal policies for QBD MDPs in this technical note. This problem
is more complicated than the general approaches based on value or
policy iteration [3], although policy evaluation as presented here would
be expected to form part of an appropriate policy iteration method. The
general optimization case is a matter for ongoing work.
This technical note briefly describes discrete-time, discrete-state
QBD MDP models in Section II. In Section III, our algorithm for
policy evaluation based on the folding method is described. Finally,
in Section IV, we present a queuing example which utilises our new
method in the context of ranking a number of control policies in terms
of expected reward. The paper concludes with some suggestions for
consequent research.
II. Q UASI B IRTH -D EATH M ARKOV D ECISION P ROCESSES
A discrete time QBD process X(t), t ≥ 0 is a finite state Markov
process defined on a state space labelled (without loss of generality)
by the set of all ordered pairs (n, m) for 0 ≤ n ≤ N − 1, 0 ≤ m ≤
M − 1, where P = N M denotes the total number of states. In MAM
terminology, the set of states corresponding to a given index n, is called
a level and the set of states corresponding to a given m is called a
phase. The set of states (n) = {(n, m) : m = 0, . . . , M − 1} is
called level n. The key property of a QBD that distinguishes it from a
general two-dimensional Markov process, is that transitions are allowable only within a given level, or between adjacent levels. Thus allowable transitions from a given state (n, m) to state (j, k) are restricted
to the cases where |n − j| ≤ 1. In this technical note we assume that
the number of levels is a power of 2, although this assumption can be
relaxed as indicated in [5]. In addition, as we will be interested in solving stochastic shortest path (SSP) problems [3], we augment the state

0018-9286 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

. ⎥ ⎥ ⎥. . The transition probability matrix for a level independent QBD process X(t).. 60. VOL. . APRIL 2015 space with a unique absorbing state which we shall characterize as level/state −1 (with a single phase). ⎥ ⎦ Jμ (i) = Eμ ∞ t=0 (1) then (3) becomes . ⎢ ⎢ Q=⎢ ⎢ ⎣ αN −1 0 D1 A2 ··· D0 A1 A2 ··· A0 A1 . A0 . NO. . t ≥ 0 with an absorbing state has the block matrix form ⎡ 1 α0 α α .. . 4.IEEE TRANSACTIONS ON AUTOMATIC CONTROL..

The determination of an optimal policy for a level independent QBD MDP process is not a standard dynamic programming problem. . P − 1 [Q(u)]i. We propose that the policy evaluation algorithm presented here can find utility in ranking a number of candidate policies which might be selected on a heuristic basis. we consider the policy fixed and delete explicit reference to it... We will assume that a stationary policy (see [3]) is applied. .. . B0 . . j. j. i. So if the odd blocks of J are available we can easily compute the even blocks using (5) via J0 = (I − D1 )−1 (g 0 + D0 J1 ). At most two of α0 . . αN −1 may be zero. . In particular. N/2 − 1. where J represents T the reward-to-go vector. . i. . . . . The reward-to-go from state i under policy μ is defined by  which can be conveniently written in matrix-vector form as follows. u) j=−1 C1 g (X(t). X(t + 1). An optimal policy is one which realizes the maximum reward-to-go for each initial state. . This means that the mapping (policy) μ : X(t) → u(t) is independent of t. . 0 ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ . Thus. each block in Q is also a function of u ∈ U.j g(i. (I − A1 )−1 A0 . J(−1) = 0 for all admissible policies (including an optimal policy) because the Markov chain always remains in the absorbing state if initially there. In a MDP. From (2).. 0 E1 B2 ··· B0 B1 B2 βN/2−1 P −1 Jμ (i) = Jμ = g(u) + Q(u) Jμ ⇔ I − Q(u) Jμ = g(u) where g(. (2) [Q (μ(i))]i. .. where J = [J0T · · · JN −1 ] . C2 1165 ··· B0 B1 . We assume there is a unique absorbing state as described above.1 Similar remarks apply to transitions to the absorbing state. for i = 0. We also assume that the one-stage rewards are independent of level. (7) . A policy is admissible if all resulting controls u(t) ∈ U. Let g(u) ∈ RP denote the vector of average rewards out of state i under control u. III. Note that the reward-to-go from the absorbing state is zero. This process is also a level independent QBD (having N/2 + 1 levels). Thus. in the QBD case. we shall only consider SSP problems. . α. i. For general transition probability matrices Q. independent of N . that all admissible policies give rise to a level independent QBD transition matrix (1).e. . The first column contains the transition probabilities from each level to the absorbing state. if X(t) = (n. μ(i))} (3) j=−1 assumption can be relaxed at the boundary levels 0 and N − 1 as in the example of Section IV. N/2 − 1. which are independent of level apart from level 0 and level N − 1.. (I − A1 ) J2n − A2 J2n−1 − A0 J2n+1 = g 2n (5) T T for n = 1. Now. we can exploit the special structure of Q to reduce this complexity substantially. . including the absorbing state. J2n = (I − A1 )−1 (g 2n + A0 J2n+1 + A2 J2n−1 ) (6) for n = 1. we can write. . There is a unique solution to these equations because of the probabilistic assumptions made on the SSP problem.) ≥ 0 denotes the reward obtained for a transition from X(t) to X(t + 1) under control u(t) = μ(X(t)).e. consider the process Yt defined to be Xt observed on odd levels.. and expectation is with respect to all states evolving under the policy μ. The solution of (4) is called policy evaluation and is the main computational cost in performing standard policy iteration.. for all controls u. where the process starts in a given state at time t = 0 and evolves over t = 1. The diagonal blocks of Q contain the (unnormalized) transition probabilities associated with each level whilst the off-diagonal blocks contain the transition probabilities between levels. We restrict attention to functions μ which are level independent. with transition probability matrix ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1 β0 β β . i) and X(t + 1) = (m.. . we need to compute the three matrices (I − D1 )−1 D0 . u(t)) X(0) = i P −1 [g(u)]i = The lower right block of this matrix (call it Q). The one stage reward for self-transitions in the absorbing state is zero. Then (I − D1 ) J0 − D0 J1 = g 0 . . j) then g(X(t). Thus the reward-to-go under a specified policy μ can be evaluated by solving the set of linear equations (4). In the sequel. −1. following the folding idea of [5]. which requires O(M 3 ) effort. again the control u is specified by policy μ for each component state. namely that the absorbing state is reachable with probability one from any initial state (see [3]). g(−1. The inverse of the matrix I − A1 is guaranteed to exist because A1 is substochastic.j {Jμ (j) + g (i. B2 = A2 (I − A1 )−1 A2 . given by 0 ⎤ . and will not be addressed in this technical note. We also need N matrix vector multiplications requiring O(N M 2 ) effort. Similar remarks apply to I − D1 . F2 ⎤ F1 where E1 = A1 + A0 (I − A1 )−1 A2 + A2 (I − D1 )−1 D0 . using the strong Markov property of the process X(t). this requires O(P 3 ) operations which can be prohibitively large. 2.e. 1 This (4) where. . is of size P × P with each sub block having size M × M . Consider the equation (I − Q) J = g. and that the absorbing state is reached with probability one in a finite time. and self-transitions from this state yield zero reward. However. F OLDING M ETHOD FOR P OLICY E VALUATION In the sequel.. B0 = A0 (I − A1 )−1 A0 F2 = C2 (I − A1 )−1 A2 . this means in the current problem. u) depends only on i and j. and (I − A1 )−1 A2 . the transition probability matrix Q is parametrized by a finite set of control functions U. X(t + 1). and g = [g T0 · · · g TN −1 ] is the vector of average one stage rewards. u) = 0 for all controls u.

0 ≤ n ≤ N/2 − 2 is given by g˜0 = (g0  A0 )(I − A1 )−1 A0 + A0 (I − A1 )−1 (g0  A0 ) + A0 (I − A1 )−1 (g1  A1 )(I − A1 )−1 A0 . n = 0. τ ) =  [g0 ]i. APRIL 2015 B1 = A1 + A0 (I − A1 )−1 A2 + A2 (I − A1 )−1 A0 . by considering a new Markov process which is Yt observed on its odd levels. We then recurse backwards evaluating the “even” levels at each step. . then J˜n = J2n+1 . Let the lower right hand block of (7) be denoted by Q. (12) m=1 Multiplying these two terms together and summing over all the km terms yields the average reward of a path from (2n + 1. n of the original process Xt . . which is independent of n.k [A0 ]i. At the k-th step. We repeat the above reduction process. (8) g˜2 = (g2  A2 )(I − A1 )−1 A2 + A2 (I − A1 )−1 (g2  A2 ) The transition probabilities for Yt to the absorbing state are   + A2 (I − A1 )−1 (g1  A1 )(I − A1 )−1 A2 β0 = I + A0 (I − A1 )−1 α + A2 (I − D1 )−1 α0 . We can then evaluate JN −1 with O(M ) cost. Here 1 is a vector of size N M/2 consisting of all ones. The reward associated with this path is [g0 ]i. . and determine its transition probabilities and average one-stage rewards using the results above. βN/2−1 = αN −1 + C2 (I − A1 )−1 α. corresponding to level N − 1 of Xt . j) (assuming we relabel the levels for Yt ). if we solve (I − Q) 1. (10) The corresponding sample path for Yt is simply the one step from (n. For 1 ≤ n ≤ N/2 − 2. Ye and Li [5] also provide considerable discussion regarding computational and memory requirements for their folding algorithm. Readers are referred to [5] for these details. . and g1 ∈ RM ×M are the one-stage rewards for Xt from (n) to (n).j . .km+1 + [g0 ]kτ . then the one-stage rewards for Yt remaining in (n) are g˜1 = (g1  A1 ) + (g0  A0 )(I − A1 )−1 A2 + A0 (I − A1 )−1 (g1  A1 )(I − A1 )−1 A2 + A0 (I − A1 )−1 (g2  A2 ) + (g2  A2 )(I − A1 )−1 A0 + A2 (I − A1 )−1 (g1  A1 )(I − A1 )−1 + A2 (I − A1 )−1 (g0  A0 ). N/2 − i. NO. (9) The calculation of these quantities requires O(M 3 ) effort. the proofs are omitted but are straightforward using the sample path approach of [4]. then the one-stage rewards for Yt from level n to the absorbing state are given by g˜−1 = (α  g−1 ) + (g0  A0 )(I − A1 )−1 α + A0 (I − A1 )−1 (α  g−1 ) + A0 (I − A1 )−1 (g1  A1 )(I − A1 )−1 α + (g2  A2 )(I − A1 )−1 α + A2 (I − A1 )−1 (α  g−1 ) + A2 (I − A1 )−1 (g1  A1 )(I − A1 )−1 α (18) where g−1 are the one-stage rewards to the absorbing state. independent of the number of levels n. The average one-stage rewards g n for Yt going from (n). Using this result. As pointed out in [5]. . . . i) → (n + 1.k Aτ1 −1 A0  k.1166 IEEE TRANSACTIONS ON AUTOMATIC CONTROL.km+1 [A0 ]kτ . .  β = I + (A0 + A2 )(I − A1 )  −1 α.r +  [g1 ]k. k = 0. for 1 ≤ n ≤ N/2 − 1. log2 N . seek to find average one-stage rewards g˜ for the process Yt so that the reward-to-go for Yt correspond to those for the odd levels of Xt . The overall computational requirement is O(M 3 log2 N ) for the forward reduction process and O(M 2 N ) for the backwards process as in [5]. k1 ) · · · → (2n + 2. 4. JN/2−1 . (14) m=1 Here  denotes the Hadamard (componentwise) product. n = 1 . τ ). we will have a process with only one level.r A0 A1m−1  [g0 ]k. and is given by h(i. the process has n = N/2k levels which correspond to levels j2k − 1.k   i. A typical sample path having τ stops in (2n + 2) has the form (2n + 1. independent ˜ We now of N . The boundary cases for g˜1 and g˜−1 for levels 0 and N/2 − 1 are treated separately using the same approach with the appropriate modifications. j). it is important that this process is performed with numerical accuracy in mind.j A0 Aτ1 −1  i. Let 0 ≤ n ≤ N/2 − 2 and let X0 ∈ (2n + 1). . Space limitations preclude the inclusion of details here. kτ ) → (2n + 3.k1 + τ −1 [g1 ]km . much of which is directly relevant (with appropriate modifications) to the policy evaluation method presented above. 60. At the final step. Suppose we move up to level 2n + 3 via level 2n + 2.e. i) to (2n + 3. ˜ J˜ = g˜. . N/2 − 2 are thus obtained via g˜ = (˜ g0 + g˜1 + g˜2 )1 + g˜−1 (19) with appropriate modifications for levels 0 and N/2 − 1. j = 1.. By a similar process. we can then show that the average reward of a transition upwards of the process Yt from level n.r A1τ −m−1A0 [A0 ]k. i) → (2n + 2.j . j. Let this quantity be called h(i. (17) For 1 ≤ n ≤ N/2 − 2. This path has probability [A0 ]i. Because of space limitations of this technical note. .k1  τ −1   [A1 ]km .j (11) m=1 where g0 ∈ RM ×M are the one-stage rewards for Xt from (n) to (n + 1). The computational overhead of each step is O(M 3 ). . .k [A1 ]k. We can then determine the corresponding “even” level reward-to-go. j. . since errors made in determining JN −1 will propagate backwards as the remaining reward-to-go blocks are evaluated. the one-stage rewards for Yt going from (n) to (n − 1) are (15) (16) where g2 ∈ RM ×M are the one-stage rewards for Xt from (n) to (n − 1). F1 = C1 + C2 (I − A1 ) −1 A0 . until finally we return to step 0. (13) can be written h(τ ) = (g0  A0 )Aτ1 −1 A0 + A0 Aτ1 −1 (g0  A0 ) + τ −1 A0 A1m−1 (g1  A1 )A1τ −m−1 A0 .j k + τ −1 m=1 k.j (13) k In matrix terms. VOL. j) having exactly τ stops in level 2n + 2. and then have the complete reward-togo vector.  r. .

This is to allow a higher level of service for the data queue when it is full because the cost of overflow in this case is large. we use a different set of server scheduling probabilities. This problem has a deterministic solution which allocates all service capacity to the data queue when it is non-empty. so is more general. Table I shows the reward-to-go for various values of p and q (each pair (p. The problem addressed is the scheduling of service in a generalisation of the two class priority queueing system such as considered in [7]. qm defined as a function of m. From Fig. TABLE I AVERAGE R EWARD .1 packets per time unit. As detailed above..g. but loss-sensitive (e. 1. and the voice queue arrival rate to be 0. We assume that there is a requirement to buffer a significantly larger number of data packets than voice packets. This result is in contrast to the so-called μc-rule [7]. the high priority (voice) queue is served with probability 1 − pm unless it is empty. we do not consider the full optimization problem in this technical note because the level independent case is not a standard DP problem. The maximum buffer sizes for the data and voice packets are N and M respectively. with number of phases fixed at M = 8 averaging over 200 000 independent trials.2 and the method of this technical note. The control variables are the probabilities pm that we serve the low priority (data) queue given that the voice queue has m ≥ 1 packets present. level independent) apart from the boundary condition where the data queue is full. q) is one of the policies being ranked). We firstly compare the execution time of evaluating a specified scheduling policy using the linear level reduction (LLR) method of [4]. When the absorbing state is entered. So we identify the levels of a QBD process with the number of data packets in the system (since this is the quantity to which the logarithmic reduction applies).. and the data queue is not full. We compared the execution times of policy evaluation between folding algorithm method and linear level reduction method as a function of number of levels N . In this case. dependent on the number of packets waiting in the system. to which the reader is referred for more details. In our example. NO. The algorithm has computational complexity of O(M 3 log2 N ) + O(M 2N ) .5 units is received for serving the data queue unless it is full when a reward of 5 units is received. There is a single server. with adequate capacity to handle total offered traffic. Execution times as a function of the number of levels of policy evaluation. we present an example of a MDP problem to illustrate the application of the method of policy evaluation presented in this technical note. voice traffic) whilst the low priority class is delay-tolerant. Thus admissible policies are those which yield the QBD transition structure (1) with blocks dependent on the pm and qm terms. 60.8 packets per time unit. data traffic). in general. In this example. APRIL 2015 1167 IV. the service probability can be different in the case when the data queue is full. the switch needs to signal higher levels of the communication system that data buffer overflow has occurred. Thus pm and qm equal constant values p and q respectively. 1. the switch sends a message to higher levels of the system. These rewards are independent of the number of data packets (i. [7]) where the server is allocated to either queue with probability taking a value in a finite set. we consider the familiar case (e. Because loss of data packets is important. there are two priority classes of traffic.g. Now we turn to the application of the policy evaluation algorithm to the ranking of a number of candidate resource allocation schemes. Here.. We show the reward-to-go from the initial state of both queues being empty. the number of voice packets as for pm . This probability is. a buffer (queue) is provided for each class of packets. and that probability is not dependent on the queue occupancy (apart from when either of the queues is empty).IEEE TRANSACTIONS ON AUTOMATIC CONTROL. for 2 It should be noted that LLR allows the QBD MDP to be level dependent. When the data queue is full. and where all capacity would be allocated to the voice queue unless the data queue was full. A reward is accumulated as each packet is switched (served). and is the subject of ongoing work. As mentioned in the introduction. but loss-tolerant (e. We model this in the context of a stochastic shortest path problem where the absorbing state corresponds to overflow of the data buffer. so we reward service of the voice queue more highly when it has a larger number of packets present to mitigate against delay. C ONCLUSION We have presented a new algorithm for policy evaluation for stochastic shortest path MDP with Quasi-Birth Death structure. The service rate is 1 packet per time unit. we observe that the time required to evaluate a policy with folding algorithm is considerably faster than that of the LLR method.GO (’000 S ) FOR Q UEUE S ERVICE P OLICIES W ITH VARIOUS VALUES OF p AND q reasons of brevity. VOL.. and subsequent appropriate action is taken which does not concern us here. Fig. and a reward of 0. N UMERICAL E XAMPLE In this section. and the phases as the number of voice packets in the system. when it contains m packets.TO .g. until the data queue overflows (absorbing state). In a switch or router. V.e. A reward of m units is received for serving the voice queue. The QBD structure allows these rewards to be functions of the number of voice packets (phases) in the system. We then show how the method can be used to rank a number of candidate scheduling policies. allocated to serve each queue with a specified probability. which considers the infinite-horizon problem. 4. The effect of having to “reset” the system due to data queue overflow and the high reward obtained when serving a full data queue changes the service policy significantly compared to [7]. We then allocate a significantly larger reward to mitigate against loss of data packets due to overflow. we took the data queue arrival rate to be 0. The data packet buffer is 32 and the voice buffer is 5. The high priority class is delay-sensitive.

Blondia. Also. Matrix-Geometric Solutions in Stochastic Models. and C. 60. Commun. One could also address uniformly discounted and average reward-per-stage QBD MDPs using our approach. P. 4. 1981. [4] L. This new method represents the policy evaluation analogue to the folding method for evaluation of the stationary probabilities of a Markov chain presented by Ye and Li [5]. Press. Lambert. NO. no. 1994. Suk and C. Baltimore. MD. Li. “Folding algorithm: A computational method for finite QBD processes with level-dependent transitions. ICST. White. 1991.” IEEE Trans. NH: Athena Scientific. B. 1999. PA. R EFERENCES [1] M. . Sep. A simple example involving the capacity scheduling of a two class priority queueing system is presented to illustrate the applicability of the method. no. Bertsekas. 2/3/4. [7] J. Autom. Ye and S. 2–3. vol. 1086–1091. USA: ASA-SIAM. [6] J. and the paper’s referees for helpful comments. Introduction to Matrix Analytic Methods in Stochastic Modeling. 2005. pp. “Optimal scheduling of two competing queues with blocking. Van Houdt. 42. Control. pp.1168 IEEE TRANSACTIONS ON AUTOMATIC CONTROL. VOL.” Stochastic Models. B. Dynamic Programming and Optimal Control. G. Neuts. 625–639.-B. pp.-Q.. 2007. Clearly.” in ValueTools.” IEEE Trans. 2nd ed. vol. “A new algorithm for policy evaluation for Markov decision processes having quasi-birth death structure. “A policy iteration algorithm for Markov decision processes skip-free in one direction. [5] J. Belgium. no. [3] D. Further investigation of the priority queue example is also ongoing work. vol. APRIL 2015 where the MDP has N levels and M phases as compared to O(N M 3 ) for the linear level reduction based method presented in [4]. the issue of optimisation via a suitable DP methodology is a subject of current research by the authors. Cassandras. 2001. Latouche and V. ACKNOWLEDGMENT The authors would like to thank Professor Peter G Taylor. 21. one can apply the ideas of [5] to piecewise level independent MDPs with a commensurate increase in computational and memory requirements. Nashua. pp. USA: Johns Hopkins Univ. Ramaswami. Philadelphia. 9. 75:1–75:9. 36. 785–797. There are several possible extensions of the work presented here. [2] G.