C H A P T E R

2 3

Markovian Decision Process
Chapter Guide. This chapter applies dynamic programming to the solution of a stochastic decision process with a finite number of states. The transition probabilities between the states are described by a Markov chain. The reward structure of the process is a matrix representing the revenue (or cost) associated with movement from one state to another. Both the transition and revenue matrices depend on the decision alternatives available to the decision maker. The objective is to determine the optimal policy that maximizes the expected revenue over a finite or infinite number of stages. The prerequisites for this chapter are basic knowledge of Markov chains (Chapter 17), probabilistic dynamic programming (Chapter 22), and linear programming (Chapter 2). This chapter includes 6 solved examples and 14 end-of-section problems. 23.1 SCOPE OF THE MARKOVIAN DECISION PROBLEM We use the gardener problem (Example 17.1-1) to present the details of the Markovian decision process. The idea of the example can be adapted to represent important applications in the areas of inventory, replacement, cash flow management, and regulation of water reservoir capacity. The transition matrices, P 1 and P 2, associated with the no-fertilizer and fertilizer cases are repeated here for convenience. States 1, 2, and 3 correspond, respectively, to good, fair, and poor soil conditions. .2 P1 = £ 0 0
2

To put the decision problem in perspective, the gardener associates a return function (or a reward structure) with the transition from one state to another. The return function expresses the gain or loss during a 1-year period, depending on the states
CD-58

.30 P = £ .10 .05

.5 .5 0 .60 .60 .40

.3 .5 ≥ 1

.10 .30 ≥ .55

23.1

Scope of the Markovian Decision Problem

CD-59

between which the transition is made. Because the gardener has the option of using or not using fertilizer, gain and loss vary depending on the decision made. The matrices R1 and R2 summarize the return functions in hundreds of dollars associated with matrices P 1 and P 2, respectively. 1 1 7 R1 = 7r1 7 = 2 £ 0 ij 3 0 1 1 6 R2 = 7r2 7 = 2 £ 7 ij 3 6 2 3 6 3 5 1≥ 0 -1 2 3 5 -1 4 0≥ 3 -2

The elements r2 of R2 consider the cost of applying fertilizer. For example, if the soil ij condition was fair last year (state 2) and becomes poor this year (state 3), its gain will be r2 = 0 compared with r1 = 1 when no fertilizer is used.Thus, R gives the net reward 23 23 after the cost of the fertilizer is factored in. What kind of a decision problem does the gardener have? First, we must know whether the gardening activity will continue for a limited number of years or indefinitely. These situations are referred to as finite-stage and infinite-stage decision problems. In both cases, the gardener uses the outcome of the chemical tests (state of the system) to determine the best course of action (fertilize or do not fertilize) that maximizes expected revenue. The gardener may also be interested in evaluating the expected revenue resulting from a prespecified course of action for a given state of the system. For example, fertilizer may be applied whenever the soil condition is poor (state 3).The decision-making process in this case is said to be represented by a stationary policy. Each stationary policy is associated with different transition and return matrices, which are constructed from the matrices P 1, P 2, R1, and R2. For example, for the stationary policy calling for applying fertilizer only when the soil condition is poor (state 3), the resulting transition and return matrices are given as .20 P = £ .00 .05 .50 .50 .40 .30 7 .50 ≥, R = £ 0 .55 6 6 5 3 3 1≥ -2

These matrices differ from P 1 and R1 in the third rows only, which are taken directly from P 2 and R2, the matrices associated with applying fertilizer. PROBLEM SET 23.1A
1. In the gardener model, identify the matrices P and R associated with the stationary policy that calls for using fertilizer whenever the soil condition is fair or poor. *2. Identify all the stationary policies for the gardener model.

CD-60 Chapter 23 Markovian Decision Process 23. Á . Á . 2. n + 1. R2 = 7r2 7 = £ 7 ij .5 . N .30 ≥. reij sulting from reaching state j at stage n + 1 from state i at stage n occurs with probability pk . 1 7 R1 = 7r1 7 = £ 0 ij 0 6 5 0 5 4 3 3 1≥ -1 The gardener problem is expressed as a finite-stage dynamic programming (DP) model as follows. We are interested in determining the optimal course of action for each year (to fertilize or not to fertilize) that will return the highest expected revenue at the end of N years. N ij ij k j=1 m . given that i is the state of the system (soil condition) at the beginning of year n The backward recursive equation relating fn and fn + 1 is fn1i2 = max e a pk [rk + fn + 11j2] f.5 ≥.60 .2 P 1 = 7r1 7 = £ 0 ij 0 . . Let ij m vk = a pk r k i ij ij j=1 The DP recursive equation can be written as fN1i2 = max5vk6 i k fn1i2 = max e vk + a pk fn + 11j2 f. n = 1.55 6 -1 0≥ -2 m = Number of states at each stage 1year21= 3 in the gardener problem2 where fN + 11j2 = 0 for all j. rk + fn + 11j2.40 . N.30 P 2 = 7p2 7 = £ . n = 1. define fn1i2 = Optimal expected revenue of stages n.1 i ij k j=1 m . Á .5 0 . 2. The matrices P k and Rk representing the transition probabilities and reward function for alternative k were given in Section 23.3 .60 .10 6 .2 FINITE-STAGE DYNAMIC PROGRAMMING MODEL Suppose that the gardener plans to “retire” from gardening in N years.10 ij . Let k = 1 and 2 represent the two courses of action (alternatives) available to the gardener. A justification for the equation is that the cumulative revenue.05 .1 and are summarized here for convenience. For the sake of generalization.

5 * 6 + .1 .1 * .1 .1 + .05 * 5. the yield is -1.1 + .4 Stage 3 Optimal solution k* 1 2 2 vk i i 1 2 3 k = 1 5.4 = 4.2 Finite-Stage Dynamic Programming Model CD-61 To illustrate the computation of vk. given a horizon of 3 years 1N = 32.7 + .1 + . Recall that k = 1 represents “do not fertilize” and k = 2 represents “fertilize.2-1 In this example.19 3. if it is fair.4 = 8. and if it is poor.3 + .4 f31i2 5.5 * 5 + .7 3.3 3.3 * .1 + .03 3 + 0 * 5.75 -1 + 0 * 5.1 + . v1 = .13 Optimal solution f2 1i2 8.5 * 3. a single transition yields 5.61 .4 + .3 3 -1 v2 i 4.19 5.5 * .55 * .1 + . R1.4 Stage 2 vk + pk f3112 + pk f3122 + pk f3132 i i1 i2 i3 i 1 2 3 k = 1 5.3 + .6 * 3. we solve the gardener problem using the data summarized in the matrices P 1.3 + .3 + . if the soil condition is good.3 * .4 = 8.5 * 3. and R2. they are summai rized here for convenience.6 k = 2 4. Example 23.3 * 5.1 + 1 * .” i 1 2 3 v1 i 5.4 * 3.3 + .23.3 for that year.4 = 5.7 3. consider the case in which no fertilizer is used i 1k = 12. P 2.13 k* 2 2 2 .61 2.1 . the yield is 3.3 + .2 * 7 + .6 * 3.4 = -.3 3 -1 k = 2 4.5 * 1 = 3 2 v1 = 0 * 0 + 0 * 0 + 1 * -1 = -1 3 Thus.3 + 0 * 3.2 * 5.1 * 5.3 * 3 = 5. Because the values of vk will be used repeatedly in the computations.4 = 2.3 1 v1 = 0 * 0 + .

1 i ij k j=1 m .nrk. as revealed by the chemical tests).13 = 7.13 = 10.61 + .74 if the state of the system in year 1 is good.5 * 5.74 3.CD-62 Chapter 23 Stage 1 Markovian Decision Process vk + pk f2112 + pk f2122 + pk f2132 i i1 i2 i3 i 1 2 3 k = 1 5.92 .13 = 4.3 + .61 + . The total expected revenues for the three years are f1112 = 10. n.13 = 10.6 * 5. 2. the transition probabilities and their return functions need not be the same for all years.92 if it is fair.n = a pk. n = 1.19 + 0 * 5.19 + .n i ij ij j=1 In the second generalization. Á . f1122 = 7. n = 1.05 * 8. Á .1 + .N6 i k fn1i2 = max e vk.61 + 1 * 2.3 * 2.19 + .4 * 5. First. Remarks.13 = 6.19 + .55 * 2.13 = 1. the new recursive equation becomes fN1i2 = max5vk6 i k fn1i2 = max e vk + a a pk fn + 11j2 f.6 * 5.nfn + 11j2 f. given a 16 12 is the discount factor per year such that D dollars a year from now have a present value of aD dollars.92 4.61 + . as the following DP recursive equation shows fN1i2 = max5vk.3 * 8.1 * 2.n + a pk.1 i ij k j=1 m where m vk. 2.74 7.23 Optimal solution f11i2 10.2 * 8. fertilizer should be applied only if the system is in state 2 or 3 (fair or poor soil condition).4 + . N .5 * 2.3 * 2. The finite-horizon problem can be generalized in two ways.13 k = 2 4. In year 3. a discounting factor can be applied to the expected revenue of the successive stages so that f11i2 will equal the present value of the expected revenues of all the stages.61 + .38 3 + 0 * 8. Second.23 k* 2 2 2 The optimal solution shows that for years 1 and 2.1 * 8. The first generalization requires the return values rk and transition probabilities ij k pij to be functions of the stage. and f1132 = 4.19 + . N .19 + . the gardener should apply fertilizer 1k… = 22 regardless of the state of the system (soil condition.87 -1 + 0 * 8.5 * 5.7 + .23 if it is poor.61 + .

1 1 . .4 2 £ .7 3 . P 1 and P 2. or newspaper.1 TV 2 .3 4 b.3 The store’s policy is that the maximum stock level should not exceed two refrigerators in any single month. A company reviews the state of one of its important products annually and decides whether it is successful (state 1) or unsuccessful (state 2).2A *1.5 2 . A company can advertise through radio.5 .2 . Inventory Problem. P1 = a P2 = a . (2) good. $900.3 3 . The company can classify its sales volume during each week as (1) fair. respectively.8 Newspaper 400 530 710 £ 350 450 800 ≥ 250 400 650 The corresponding weekly returns (in dollars) are 400 £ 300 200 Radio 520 600 400 700 ≥ 250 500 1000 £ 800 600 TV 1300 1000 700 Find the optimal advertising policy over the next 3 weeks. provide the transition probabilities with and without advertising during any year.2 .7 .3 ≥ 3 0 .7 .2 .23. and $300. The penalty for running out of stock is estimated at $150 per refrigerator per month.2 . The company must decide whether or not to advertise the product to further promote sales. Determine the following: (a) The transition probabilities for the different decision alternatives of the problem.7 1 1 . The storage cost per refrigerator per month is $5.7 2 £ . R2 = a . *3.1 .3 2 £ 0 . TV. The associated returns are given by the matrices R1 and R2.8 2 -1 b -3 1 b -1 2.2 1 . The weekly costs of advertising on the three media are estimated at $200. (c) The optimal ordering policy over the next 3 months.2 1600 1700 ≥ 1100 Newspaper 1 2 3 1 .7 . R1 = a . The following matrices.1 2 b.1 ≥ .2 . A summary of the transition probabilities associated with each advertising medium follows. or (3) excellent.5 .4 1 .6 .6 . The monthly demand is given by the following pdf: Demand x p(x) 0 .1 Radio 2 3 .1 3 .9 . Find the optimal decisions over the next 3 years. An appliance store can place orders for refrigerators at the beginning of each month for immediate delivery.2 ≥ .1 . A fixed cost of $100 is incurred every time an order is placed. (b) The expected inventory cost per month as a function of the state of the system and the decision alternative.2 Finite-Stage Dynamic Programming Model CD-63 PROBLEM SET 23.

m.4 . the expected revenue of policy s per transition step (period). by using the formula m Step 3. Step 2.3. is generally more efficient because it determines the optimum policy iteratively. Compute vs. Compute ps. the expected one-step (one-period) revenue of policy s given i state i. Step 1.1 . Á . The second method. 2. 23.4 . .2 3 . This is equivalent to an exhaustive enumeration process and can be used only if the number of stationary policies is reasonably small. These probabilities.5 2 .3 INFINITE-STAGE MODEL There are two methods for solving the infinite-stage problem.3 .CD-64 Chapter 23 Markovian Decision Process 4. 2. Á .1 Exhaustive Enumeration Method Suppose that the decision problem has S stationary policies.2 .5 . ps . the long-run stationary probabilities of the transition matrix P s i associated with policy s. are computed from the equations P sP s = P s ps + ps + Á + ps = 1 1 2 m where P s = 1ps . E s = a psvs i i i=1 Step 4. The optimal policy s* is determined such that E s* = max5E s6 s We illustrate the method by solving the gardener problem for an infinite-period planning horizon. s = 1. ps 2. Á . when they exist. S.4 23. The steps of the enumeration method are as follows. Repeat Problem 3 assuming that the pdf of demand over the next quarter changes according to the following table: Month Demand x 0 1 2 1 . i = 1. The first method calls for evaluating all possible stationary policies of the decision problem. and assume that P s and Rs are the (one-step) transition and revenue matrices associated with the policy. 1 2 m Determine E s. called policy iteration.

6 0 .55 6 .05 .5 .5 0 . as the following table shows: Stationary policy.5 ≥.55 6 6 R3 = £ 0 0 7 R4 = £ 7 0 7 R = £0 0 1 6 5 0 5 4 3 5 5 0 6 4 0 6 5 3 5 4 0 5 5 3 6 4 3 3 1≥ -1 -1 0≥ -2 -1 1≥ -1 3 0≥ -1 3 1≥ -2 -1 0≥ -1 -1 1≥ -2 3 0≥ -2 .2 P 8 = £ . Fertilize if in state 2 or 3.6 .3 7 .3 ≥. Fertilize regardless of the state. R2 = £ 7 . R = £ 0 .6 .5 ≥. 1 .3 Infinite-Stage Model CD-65 Example 23.05 .1 0 7 . 1 . Fertilize if in state 1 or 2.3 ≥.4 .05 . Fertilize if in state 1.1 .5 .5 0 .1 .05 .3 P3 = £ 0 0 . Fertilize if in state 2.4 . Fertilize if in state 1 or 3.3 P 2 = £ .5 ≥.4 . R5 = £ 0 . R8 = £ 7 .3 ≥.3 P 6 = £ .6 .2 P 4 = £ .55 6 .1 0 .6 0 . s 1 2 3 4 5 6 7 8 Action Do not fertilize at all.1 . Fertilize if in state 3.3 ≥.6 .5 ≥. 1 3 .6 .5 .3-1 The gardener problem has a total of eight stationary policies.4 .55 6 6 R6 = £ 7 0 .1 . 1 .5 .2 P = £0 0 1 .3 .1 6 . The matrices P s and Rs for policies 3 through 8 are derived from those of policies 1 and 2 and are given as .1 6 7 .3 7 .5 .5 .2 P5 = £ 0 .23.6 .3 P = £0 .

0 3.1p1 + .0 3.3p2 + . i vs i s 1 2 3 4 5 6 7 8 i = 1 5.1 i = 3 -1.0 3.1p2 + .3 4. the expected yearly revenue is E 2 = p2v2 + p2v2 + p2v2 1 1 2 2 3 3 = 6 A 59 B * 4.734 2.216 0 0 5 154 0 0 69 154 1 1 80 154 0 5 137 12 135 0 62 137 69 135 1 70 137 54 135 . The associated equations are .0 0.55p3 = p3 p1 + p2 + p3 = 1 (Notice that one of the first three equations is redundant.4 The computations of the stationary probabilities are achieved by using the equations P sP s = P s p1 + p2 + Á + pm = 1 As an illustration.6p2 + .) s 1 2 3 4 5 6 7 8 ps 1 0 6 59 ps 2 0 31 59 ps 3 1 22 59 Es -1 2. p2 = 2 31 59 .) The solution yields p2 = 1 6 59 . (Although this will not affect the computations in any way.3p1 + .0 0.3 i = 2 3.1 + A 22 B * .256 0.7 + A 31 B * 3. 3. 4.0 -1. p2 = 3 22 59 In this case.7 5.1 3.7 4. note that each of policies 1.0 0.6p1 + .724 -1 1.1 3.4 -1.4 0.4 -1.3 5. This is the reason p1 = p2 = 0 and p3 = 1 for all these policies.0 3.4 -1 1.CD-66 Chapter 23 Markovian Decision Process The values of vk can thus be computed as given in the following table. consider s = 2.256 59 59 The following table summarizes ps and E s for all the stationary policies.3 4.4 = 2.4p3 = p2 .1 3. and 6 has an absorbing state: state 3.7 5.7 4.05p3 = p1 .

2. by the exhaustive enumeration method assuming an infinite horizon.2 Policy Iteration Method without Discounting To appreciate the difficulty associated with the exhaustive enumeration method. p2. the expected total return at stage n is expressed by the recursive equation fn1i2 = vi + a pijfn + 11j2. Set 23.2a. which defines stage n. However.2a. The optimum long-range policy calls for applying fertilizer regardless of the state of the system.3 Infinite-Stage Model CD-67 Policy 2 yields the largest expected yearly revenue.11j2. Solve Problem 2. we have shown that. Set 23. We define h as the number of stages remaining for consideration. (3) fertilize twice. Set 23.23. With the new definition.2a. the gardener would have a total of 4 3 = 256 stationary policies. for an infinite planning horizon using the exhaustive enumeration method. but the amount of computations may also be prohibitively large. and (4) fertilize three times. let us assume that the gardener had four courses of action (alternatives) instead of two: (1) do not fertilize. pm2 is the steady-state probability vector of the transition matrix P = 7pij 7 and E = p1v1 + p2v2 + Á + pmvm . In this case. 3. Á . Not only is it difficult to enumerate all the policies explicitly. Solve Problem 3. m j=1 m This recursive equation is the basis for the development of the policy iteration method. for any specific policy. (2) fertilize once during the season. the present form must be modified slightly to allow us to study the asymptotic behavior of the process.2. 2. *3. In Section 23.3A 1. m j=1 m Note that fh is the cumulative expected revenue given that h is the number of stages remaining for consideration. Given that P = 1p1. This is the reason we are interested in developing the policy iteration method. i = 1. i = 1. 2. the asymptotic behavior of the process can be studied by letting h : q . Solve Problem 2. Á . for an infinite number of periods using the exhaustive enumeration method. The recursive equation is thus written as fh1i2 = vi + a pijfh .3. This is in contrast with n in the equation. By increasing the number of alternatives from 2 to 4. the number of stationary policies “soars” exponentially from 8 to 256. Á . 23. PROBLEM SET 23.

a two-step iterative approach is utilized which. 2. Á .a pijf1j2 = vi. Policy improvement step. Otherwise. Using its associated matrices P s and Rs and arbitrarily assuming fs1m2 = 0.3. Value determination step. j = 1. Á .a ps fs1j2 = vs. Á . Á . Because there are m equations in m + 1 unknowns. f122. Á . f112.1. m j=1 Here. i = 1.12E + f1j26. The resulting optimum decisions for states 1. f1m2. i = 1. and fs1m . our objective is to determine the optimum policy that yields the maximum value of E. set s = t and return to the value determination step. 2. This result assumes that h : q . we have m equations in m + 1 unknowns. we get m E + f1i2 . determine the alternative k that yields max e vk + a pk fs1j2 f. are those determined in the value determination step. t is optimum. Because fh1i2 is the cumulative optimum return for h remaining stages given state i and E is the expected revenue per stage. Instead. we can see intuitively why fh1i2 equals hE plus a correction factor f (i) that accounts for the specific state i. Go to the policy improvement step. m. 2. .3. starting with an arbitrary policy.1.CD-68 Chapter 23 Markovian Decision Process is the expected revenue per stage as computed in Section 23. will determine a new policy that yields a better value of E. solve the equations E s + fs1i2 . For each state i. fs112. If s and t are identical. the optimum value of E cannot be determined in one step. Á . Now. The iterative process ends when two successive policies are identical. i ij k j=1 m i = 1. 2. 2. using this information. 2. m constitute the new policy t. m The values of fs1j2. and E. fh1i2 = hE + f1i2 where f(i) is a constant term representing the asymptotic intercept of fh1i2 given state i. Á . Á . we write the recursive equation as hE + f1i2 = vi + a pij51h . m ij i j=1 m in the unknowns E s.12. it can be shown that for very large h. Choose arbitrary policy s. m j=1 m Simplifying this equation. 1. i = 1. As in Section 23. 2.

.6 * 8 + .3 Infinite-Stage Model CD-69 Example 23..6 .2f112 .19 .05f112 ..3 7 .3 P = £ ..24 Optimal solution f(i) 13.3f112 .55f132 = .3f132 E + f122 E + f132 .88 + . R = £ 0 1 0 6 5 0 3 1≥ -1 = 5.3 * 0 = 9. Let us start with the arbitrary policy that calls for not applying fertilizer.3 * 12.05 .4 .6f122 .55 * 0 = 4.4f122 ..2 * 12.1 + .88 + .5 * 0 = 7 -1 + 0 * 12.5 * 8 + .88.19 4.. vk + pk f 112 + pk f 122 + pk f 132 i i1 i2 i3 i 1 2 3 k = 1 5.2 P = £0 0 .88 + .1 * 0 = 13.6f122 .1f132 = 4.3 3 The equations of the value iteration step are E + f112 . the value determination step is entered again.5 .3 + .1 E + f132 .24 k* 2 2 2 The new policy calls for applying fertilizer regardless of the state. f132 = 0 Next.05 * 12.36 3.. The associated matrices are . Because the new policy differs from the preceding one. f112 = 12.3 * 0 = 11.1f112 .1 * 12.3-2 We solve the gardener problem by the policy iteration method. we apply the policy improvement step.5 0 . The associated calculations are shown in the following tableau..55 6 5 4 3 -1 0≥ -2 These matrices yield the following equations: E + f112 .5f122 . The matrices associated with the new policy are .1 .876 3 + 0 * 12.4 * 8 + .4 ..5 * 8 + .5f122 ..23.4 + .7 + .36 9.6 * 8 + .7 E + f122 .88 + ..6 .5 ≥..3f132 = 3.5f132 = - f132 = -1 If we arbitrarily let f132 = 0..1 6 .88 + .. the equations yield the solution E = -1. R = £ 7 . f122 = 8.88 + 0 * 8 + 1 * 0 = -1 k = 2 4.3 ≥.

is identical with the preceding one.CD-70 Chapter 23 Markovian Decision Process Again.05 * 6.3 * 0 = 8.75.26 Optimal solution f(i) 9. Set 23. . the present worth f (i) should approach a constant value as h : q .75 + . This is the same conclusion obtained by the exhaustive enumeration method (Section 23.Thus the long-run behavior of fh1i2 as h : q is independent of the value of h.75 + .3a.6 + .01 6. vk + pk f 112 + pk f 122 + pk f 132 i i1 i2 i3 i 1 2 3 k = 1 5. f132 = 0 The computations of the policy improvement step are given in the following tableau.80 + 1 * 0 = -1 k = 2 4. assuming an infinite planning horizon. f112 = 6. Thus the last policy is optimal and the iterative process ends.1). however. where f (i) is the expected present-worth (discounted) revenue given that the system is in state i and operating over an infinite horizon.7 + . Set 23.3a.75 + 0 * 3.3.80.5 * 0 = 4.3.75 + .4 + .3B 1. f122 = 3.2) fh1i2 = max e vk + a a pk fh .06 2. Given the discount factor a 1612.06 3.3 + . Solve Problem 2.2a by the policy iteration method. that the policy iteration method converges quickly to the optimum policy. 3.80 2. a typical characteristic of the new method.55 * 0 * = * = * = 3. letting f132 = 0.80 + .6 + . Set 23. Compare the results with those of Problem 2. This result should be expected because in discounting the effect of future revenues will asymptotically diminish to zero.26. which calls for applying fertilizer regardless of the state. we get the solution E = 2.55 3 + 0 * 6.3 * 0 .5 * 3. 23. Note.3 * 6.3a.75 + .4 + .1 * 6. and compare the results with those of Problem 3.80 6.11j2 f i ij k j=1 m (Note that h represents the number of stages to go.2a that the planning horizon is infinite.80 9.1 * 0 3. Set 23.01 3. 2. This is in contrast with the case of no discounting where fh1i2 = hE + f1i2. fh1i2 = f1i2.3 Policy Iteration Method with Discounting The policy iteration algorithm can be extended to include discounting. Indeed. Set 23. Set 23.2a by the policy iteration method assuming an infinite planning horizon. PROBLEM SET 23. Solve the problem by the policy iteration method. Assume in Problem 1.2 * 6.) It can be proved that as h : q (infinite stage model). Solve Problem 3.80 + .26 k* 2 2 2 The new policy. and compare the results with those of Problem 1.90 -1 + 0 * 6.1 + .5 * 3.75 + . the finite-stage recursive equation can be written as (see Section 23.

5] = -2.4 + .5 * -2. determine the alternative k that yields max e vk + a a pk fs1j2 f.55 * -2.05 * 6.54 Optimal solution f(i) 6. set s = t and return to the value determination step.23. 1.6[.6[ f132 ..6[0 * 6.5] = .a a ps fs1j2 = vs. Á . 2. 2.4 * 3.61 + .6 * . stop.61 + . fs122.2 * 6.3 * -2.21 + 1 * -2.5f122 + .21 + .61 3 + .6[ pk f 112 + pk f 122 + pk f 132] i i1 i2 i3 i 1 2 3 k = 1 5.61 + . t is optimum..2 .3 3.. s = 51.5] = 3.3 * -2.54 k* 2 2 2 . 1.21 + .5] = 6.90 4.21. the associated matrices P and R (P 1 and R1 in Example 23.5f122 + .6.3-2 using the discounting factor a = .21 + . solve the m equations fs1i2 .6[. m ij i j=1 m in the m unknowns fs112. f132] = -1. Otherwise.1 * -2.6[ The solution of these equations yields f1 = 6. If the resulting policy t is the same as s.2 .3-1) yield the equations f112 . the steps of the policy iterations are modified as follows. Value determination step. Á .21 + .5] = 4. Starting with the arbitrary policy. i = 1.21 -1 + .1 + .3 Infinite-Stage Model CD-71 Based on this information.6[.61 + . i = 1.6[0 * 6. m i ij k j=1 m fs1j2 is obtained from the value determination step.61 + . f2 = 3.2f112 + .3f132] = f122 .5 k = 2 4.61 + 0 * 3.3 * 6.3-3 We will solve Example 23.5 * 3. fs1m2.6[. f3 = -2.3 + .61.6 * 3.1 * 6.21 + .6[.5f132] = 5.5] = 6. 2. Example 23.90 3.5 A summary of the policy improvement iteration is given in the following tableau: . For an arbitrary policy s with its matrices P s and Rs. For each state i. Policy improvement step.7 + . Á . vk + .5 * 3. 16.

2f112 + .6 + .4 + .37 Optimal solution f(i) 8.37 The policy improvement step yields the following tableau: vk + .62 8.7 f122 .62 6.97 + .3 * 3.62 6.6[.89 6.7 + .62.63 6.98 6.89 + .3f132] = 3.63 8.63 6.97 + . 2.6[0 * 8.6[ .1 f132 . f122 = 6.6f122 + .6 + .1 + . f122 = 6.1f112 + .37 Optimal solution f(i) 8.1 + .38 The policy improvement step yields the following tableau: vk + .5 + .1 * 3.3 f122 .55 * 3.37] 3 + .90 6.4f122 + .1 * 8.6[ .37] -1 + .37] * = * = * = 6.6[.38] 3.6f122 + .38] * = * = * = 6.89 + 0 + 1 * 3.3 * 3.4 + .62 8.5f122 + .05f112 + .55 * 3.55f132] = .3f132] = 3.4 The solution of these equations yields f112 = 8..2 * 8. 26 differs from the preceding one.89 + .37 k* 1 2 2 .97 + 0 + 1 * 3.37] * = * = * = 6.3 * 8.38] 3 + ..3 * 3.6[ .1 * 3.3 + .63 3.97 + .97 6.3 * 8.37] 3.5 + .4 The solution of these equations yields f112 = 8.97 + .6[0 * 8. the value determination step is entered again using P 3 and R3 (Example 23.89 + .6[.96 6.6[ pk f 112 + pk f 122 + pk f 132] i i1 i2 i3 i 1 2 3 k = 1 5.05 * 8.63. This results in the following equations: f112 .3-1) yields the following equations: f112 .63 6.89 + .4 + .6[.7 + .6[..38] -1 + .37] .2 * 8. f132 = 3.96 6. f132 = 3.97 + .3-1).55f132] = .97.89.5 * 3.5 + .3f112 + .6[.6[0 * 8.03 k = 2 4.62 3.6[0 * 8.02 k = 2 4.1 f132 .1 * 8.37 k* 1 2 2 Because the new policy 51.89 + .05 * 8.5 + .63 3.05f112 + .6 + .4 + .38] * = * = * = 6.63 8.62 6.6[..3 + ..3 * 3.6[ .00 6.1f112 + .62 1.CD-72 Chapter 23 Markovian Decision Process The value determination step using P 2 and R2 (Example 23.6 + .6f122 + .6[.4f122 + .6[.3f132] = 5.63 1.00 6.62 3..38] .5 * 3.1f132] = 4.6[.6[ pk f 112 + pk f 122 + pk 132] i i1 i2 i3 i 1 2 3 k = 1 5.

2. m. assuming the discount factor a = . (b) Problem 2. The constraints of the problem ensure that ps.4 LINEAR PROGRAMMING SOLUTION The infinite-state Markovian decision problems. Á . Á . i=1 j = 1. each policy s is specified by a fixed set of actions (as illustrated by the gardener problem in Example 23.1 by exhaustive enumeration. can be formulated and solved as linear programs. 2. for all i and k i . Specifically. Section 23. Repeat the problems listed. we need to modify the unknowns of the problem such that the optimal solution automatically determines the optimal action (alternative) k when the system is in state i. which corresponds to max e a psvs ƒ P sP s = P s. PROBLEM SET 23.3C 1. The same problem is the basis for the development of the LP formulation. Note that discounting has resulted in a different optimal policy that calls for not applying fertilizer if the state of the system is good (state 3). However. m f i The set S is the collection of all possible policies of the problem.3. 2. We consider the no-discounting case first.1 shows that the infinite-state Markovian problem with no discounting ultimately reduces to determining the optimal policy. 23. Set 23. The collection of all the optimal actions will then define s…. i i 1 2 m seS j=1 m ps Ú 0 i = 1.3-1). s…. Set 23. Let qk = Conditional probability of choosing alternative k given that the system is in state i i The problem may thus be expressed as Maximize E = a pi a a qkvk b i i i=1 k=1 m K subject to m pj = a pipij. it is optimal. m p1 + p2 + Á + pm = 1 q1 + q2 + Á + qK = 1.3b. Á . Set 23. i = 1. (c) Problem 3.3b. ps + ps + Á + ps = 1.3b. 2.9. represent the steady-state probabilities of the i Markov chain P s.4 Linear Programming Solution CD-73 Because the new policy 51. qk Ú 0. i = 1. the optimal policy. The problem is solved in Section 23.3. (a) Problem 1. Á . 26 is identical with the preceding one. 2. both with discounting and without.23. m i i i pi Ú 0.

Observe that the formulation is equivalent to the original one in i Section 23. From probability theory K pi = a wik k=1 Hence.a a pijwik = 0. The linear program we develop here does account for this condition automatically. (Verify!) Thus the problem can be written as m K Maximize E = a a vkwik i i=1k=1 subject to K k=1 m K k a wjk . First. 2. . Its optimal solution automatically guarantees that q k for one k for each i. K The resulting model is a linear program in wik. Á . Á . Á . m. m i=1k=1 m K i=1k=1 a a wik = 1 wij Ú 0. note that the linear program has m indei pendent equations (one of the equations associated with P = PP is redundant). Define wik = piq k. qk = i m wik K a k = 1wik We thus see that the restriction g i = 1pi = 1 can be written as m K i=1k=1 K a a wik = 1 Also. where k is the optimal alternative chosen.1 only if q k = 1 for exactly one k for each i. which will reduce the sum i K k k k… … a k = 1q i vi to vi . i = 1. The problem can be converted into a linear program by making proper substitutions involving q k.3. 2. j = 1. k = 1. 2. i for all i and k By definition wik represents the joint probability of state i making decision k. the restriction g k = 1qk = 1 is automatically implied by the way we defined qk in i i terms of wik.CD-74 Chapter 23 Markovian Decision Process Note that pij is a function of the policy selected and hence of the specific alternatives k of the policy.

2 the problem is expressed by the recursive equation f1i2 = max e vk + a a pk f1j2 f.256. This result means that q2 = q2 = q2 = 1.1w22 .4w322 = 0 w31 + w32 .) Example 23.3w12 + . Á .1017.5w11 + .5w21 + . Thus. (As a matter of fact the preceding result also K shows that pi = g k = 1wik = wik*. the optimal policy selects alternative 1 2 3 k = 2 for i = 1.23. 2. for all i and k ij i j=1 .41.3-1. m i ij k j=1 m These equations are equivalent to m f1i2 Ú a a pk f1j2 + vk.6w22 We next consider the Markovian decision problem with discounting. for all i and k The optimal solution is w11 = w12 = w31 = 0 and w12 = .w31 + .5254. and w32 = . w22 = .3. i = 1.1w22 + . In Section 23.10172 + 3. It is interesting that the positive values of wik exactly equal the values of pi associated with the optimal policy in the exhaustive enumeration procedure of Example 23.05w322 = 0 + .3w11 + .4 Linear Programming Solution CD-75 Hence. It can be shown that wik must be strictly positive for at least one k for each i.3w22 + w31 + .2w11 + .1. The optimal value of E is 4.1w12 + .1. 2.7w12 + 3w21 + 3.6w12 + .52542 + .3w11 + 4.71.5w21 + .3729. w21 + w22 .55w322 = 0 w11 + w12 + w21 + w22 + w31 + w32 = 1 wik Ú 0. This observation demonstrates the direct relationship between the two methods.4w32 subject to w11 + w12 .37292 = 2.4-1 The following is an LP formulation of the gardener problem without discounting: Maximize E = 5.1. we conclude that qk = i wik K a k = 1wik can assume a binary value (0 or 1) only. and 3. From these two results. the problem must have m basic variables.11. where k… is the alternative corresponding to wik 7 0.

1w22 . for all i and k The optimal solution is w12 = w21 w31 = 0 and w11 = 1. for all i and k ij i i=1 f1i2 unrestricted in sign for all i Now the dual of the problem is m K Maximize a a vkwik i i=1k=1 subject to K k=1 m K k a wjk .5w21 + . The solution shows that that optimal policy is (1. K Example 23. 2. the problem can be written as m Minimize a bif1i2 i=1 subject to m f1i2 .6w22 w31 + w32 .3w11 + .a a a pijwik = bj. m i=1k=1 wik Ú 0. Á .w31 + . Á . It can be shown that the optimization of this function subject to the inequalities given will result in the minimum value of f(i). w22 = 3.4-2 Consider the gardener problem given the discounting factor a = . . 2)..6. and w32 = 2.6w12 + . 2. 2.3w11 + 4.6[.6[.CD-76 Chapter 23 Markovian Decision Process provided that f(i) achieves its minimum value for each i. for i = 1...5678.3w22 + w31 + .2w11 + .6[.7w12 + 3w12 + 3.3w12 + .a a pk f1j2 Ú vk. k = 1. m.1w12 + .4w32] = 1 w21 + w22 .5w21 + .4w32 subject to w11 + w12 . If we let b1 = b2 = b3 = 1. Now consider the objective function m Minimize a bif1i2 i=1 where bi ( 70 for all i) is an arbitrary constant.05w32] = 1 + . j = 1. the dual LP problem may be written as Maximize 5. Thus.55w32] = 1 wik Ú 0. 2.3528.1w22 + .8145.5w11 + . Á .

. (a) Problem 1. Set 23. Oxford University Press. Springer-Verlag. NJ.3b. and D. Foundations of Modern Probability. Oxford. O. Grimmet. (b) Problem 2. Formulate the following problems as linear programs.4 Linear Programming Solution CD-77 PROBLEM SET 23..3b. Academic Press..3b. Howard. 1995. Set 23. 1997. Stewart. England. New York. 2nd ed. Princeton. Finite State Markovian Decision Processes. MA. G. MIT Press. W. 1960. Kallenberg. R. Princeton University Press.4A 1. 1970. (c) Problem 3.. Probability and Random Processes.. Introduction to the Numerical Solution of Markov Chains. 1992. REFERENCES Derman.23. Stirzaker. C. Set 23.. Cambridge. New York. Dynamic Programming and Markov Processes. .

Sign up to vote on this title
UsefulNot useful