You are on page 1of 27

ABSTRACT

This chapter represents a new dynamic programming to the solution of a

stochastic decision process that can be described by a finite number of states. The

transition probabilities between the states are described by a Markov chain. The reward

structure of the process is also described by a matrix whose individual elements represent

the revenue (or cost) resulting from moving one state to another.

Both the transition and revenue matrices depend on the decision alternatives

available to the decision maker. The objective of the problem is to determine the optimal

policy that maximizes the expected revenue of the process over a finite or infinite number

of stages.
INTRODUCTION

Markov decision processes (MDPs), named after Andrey Markov, provide a

mathematical framework for modeling decision making in situations where outcomes are

partly random and partly under the control of a decision maker. MDPs are useful for studying a

wide range of optimization problems solved via dynamic programming and reinforcement

learning. MDPs were known at least as early as the 1950s (cf. Bellman 1957). A core body of

research on Markov decision processes resulted from Ronald A. Howard's book published in

1960, Dynamic Programming and Markov Processes. They are used in a wide area of disciplines,

including robotics, automated control, economics, and manufacturing.

More precisely, a Markov Decision Process is a discrete time stochastic control process.

At each time step, the process is in some state , and the decision maker may choose any action

that is available in state . The process responds at the next time step by randomly moving into a

new state , and giving the decision maker a corresponding reward .

The probability that the process moves into its new state is influenced by the chosen

action. Specifically, it is given by the state transition function . Thus, the next state

depends on the current state and the decision maker's action . But given and , it is

conditionally independent of all previous states and actions; in other words, the state transitions

of an MDP possess the Markov property.

Markov decision processes are an extension of Markov chains; the difference is the

addition of actions (allowing choice) and rewards (giving motivation). Conversely, if only one

action exists for each state and all rewards are zero, a Markov decision process reduces to

a Markov chain.
Definition

A Markov decision process is a 4-tuple , where

 is a finite set of states,

 is a finite set of actions (alternatively, is the finite set of actions

available from state ),

 is the probability that

action in state at time will lead to state at time ,

 Is the immediate reward (or expected immediate reward) received

after transition to state from state with transition probability .

(The theory of Markov decision processes does not actually require or to be finite,

but the basic algorithms below assume that they are finite.)

Example of a simple MDP with three states and two actions.


Problem

The core problem of MDPs is to find a "policy" for the decision maker: a

function that specifies the action that the decision maker will choose when in

state . Note that once a Markov decision process is combined with a policy in this way,

this fixes the action for each state and the resulting combination behaves like a Markov

chain.

The goal is to choose a policy that will maximize some cumulative function of

the random rewards, typically the expected discounted sum over a potentially infinite

horizon:

(where we choose ) where is the discount factor

and satisfies .

(For example, when the discount rate is r.) is typically close

to 1. Because of the Markov property, the optimal policy for this particular problem can

indeed be written as a function of only, as assumed above.

Algorithms

MDPs can be solved by linear programming or dynamic programming. In what

follows we present the latter approach.

Suppose we know the state transition function and the reward function , and

we wish to calculate the policy that maximizes the expected discounted reward.
The standard family of algorithms to calculate this optimal policy requires storage

for two arrays indexed by state: value , which contains real values, and policy which

contains actions. At the end of the algorithm, will contain the solution and will

contain the discounted sum of the rewards to be earned (on average) by following that

solution from state .

The algorithm has the following two kinds of steps, which are repeated in some

order for all the states until no further changes take place. They are

Their order depends on the variant of the algorithm; one can also do them for all

states at once or state by state, and more often to some states than others. As long as no

state is permanently excluded from either of the steps, the algorithm will eventually

arrive at the correct solution.

Notable variants

Value iteration

In value iteration (Bellman 1957), which is also called backward induction, the

array is not used; instead, the value of is calculated whenever it is needed.

Shapley's 1953 paper on stochastic games included as a special case the value iteration

method for MDPs, but this was recognized only later on.
Substituting the calculation of into the calculation of gives the

combined step:

This update rule is iterated for all states until it converges with the left-hand

side equal to the right-hand side (which is the "Bellman equation" for this problem).

Policy iteration

In policy iteration (Howard 1960), step one is performed once, and then step two

is repeated until it converges. Then step one is again performed once and so on.

Instead of repeating step two to convergence, it may be formulated and solved as

a set of linear equations.

This variant has the advantage that there is a definite stopping condition: when the

array does not change in the course of applying step 1 to all states, the algorithm is

completed.

Modified policy iteration

In modified policy iteration (van Nunen, 1976; Puterman and Shin 1978), step one

is performed once, and then step two is repeated several times. Then step one is again

performed once and so on.

Prioritized sweeping

In this variant, the steps are preferentially applied to states which are in some way

important - whether based on the algorithm (there were large changes in or around
those states recently) or based on use (those states are near the starting state, or otherwise

of interest to the person or program using the algorithm).

Scope of the Markovian Decicion Problem –The Gardener Example

The example paraphrases the number of important applications in the areas of

inventory, replacement, cash flow management, and regulation of the water reservoir

capacity.

An avid gardener attends a plot of land in his backyard. Every year, at the

beginning of the gardening season, she uses chemical tests; she can classify the garden’s

productivity for the new season as good, fair, or poor.

Over the years, the gardener observed that current year’s productivity can be

assumed to depend only on last year’s soil condition. She is thus able to represent the

transition probabilities over the year period from one productivity state to another in

terms of the following Markov chain:

State of the system next year

1 2 3

1 .2 .5 .3

State of the system this year 2 0 .5 .5 = P1

3 0 0 1

The representation assumes the following correspondence between productivity

and the states of the chain:


Productivity State of the system
(Soil condition)

Good 1
Fair 2
Poor 3

The transition probabilities in P1 indicate that the productivity for a current year

can be no better that last year’s. for example, if the soil condition for this year is fair

(state 2), next year’s productivity may remain fair with probability .5 or become poor

(state 3), also with probability .5.

The gardener can alter the transition probabilities P1 by taking other courses of

action available to her. Typically, she may decide to fertilize the garden to boost the soil

condition. If she does not, her transition probabilities will remain as given in P1. But if

she does, the following transition matrix P2 will result:

1 2 3

1 .3 .6 .1

P2 = 2 .1 .6 .3

3 .05 .4 .55

In the new transition P2, it is possible to improve the condition of the soil over last

year’s.

To put the decision problem in perspective, the gardener associates a return

function (or a reward structure) with the transition from one state to another. The return

function expresses the gain or loss during a 1-year period, depending on the states

between which the transition is made. Since the gardener has the option of using or not

using fertilizer, her gain and losses are expected to vary depending on the decision she

makes. The matrices R1 and R2 summarize the return function in hundreds of dollars
associated with the matrices P1 and P2, respectively. Thus R1
applies when no fertilizer is

used; otherwise, R2 is utilized in the representation of the return function.

1 2 3

1 7 6 3

R1 = // r1ij// = 2 0 5 1

3 0 0 -1

1 2 3

1 6 5 -1

R2 = // r2ij// = 2 7 4 0

3 6 3 -2

Notice that the elements r2ij of R2 take into account the cost of applying the

fertilizer. For example , if the system was in state 1 and remained in state 1 during next

year, its gain would be r211 = 6 compared to r211 = 7 when no fertilizer is used.

What kind of decision problem does the gardener have? First, we must know

whether the gardener activity will continue for a limited number of years or, for all

practical purposes, indefinitely. These situations are referred to as finite-stage and

infinite-stage decision problems. In both cases, the gardener would need to determine the

best course of action she should follow (fertilize or do not fertilize) given the outcome of

the chemical tests (state of the system). He optimization process will be based on

maximization of expected revenue.


The gardener may also be interested in evaluating the expected revenue resulting

from following the prespecified course of action whenever a given state of action of the

system occurs. For example, she may decide to fertilize whenever the soil condition is

poor (state 3). The decision making process in this case is said to be represented by a

stationary policy.

We must note that each stationary policy will be associated with a different

transition and return matrices, which, in general, can be constructed from the matrices P 1,

P2, R1, and R2. For example, for the stationary policy calling for applying fertilizer only

when the soil condition is poor (state 3), the resulting transition and return matrices, P

and R, respectively, are given as

.2 .5 .3

P= 0 .5 .5

.05 .4 .55 and

7 6 3

R= 0 5 1

6 3 -2

These matrices differ from P1 and R1 in the third rows only, which are taken

directly from P2 and R2. The reason is the P2 and R2 are the matrices that result when

fertilizer is applied in every state.


Extension and Generalizations

A Markov decision process is a stochastic game with only one player.

Partial observability

Main article: partially observable Markov decision process

The solution above assumes that the state is known when action is to be taken;

otherwise cannot be calculated. When this assumption is not true, the problem is

called a partially observable Markov decision process or POMDP.

A major breakthrough in this area was provided in "Optimal adaptive policies for

Markov decision processes" [2] by Burnetas and Katehakis. In this work a class of

adaptive policies that possess uniformly maximum convergence rate properties for the

total expected finite horizon reward, were constructed under the assumptions of finite

state-action spaces and irreducibility of the transition law. These policies prescribe that

the choice of actions, at each state and time period, should be based on indices that are

inflations of the right-hand side of the estimated average reward optimality equations.

Reinforcement learning

If the probabilities or rewards are unknown, the problem is one of reinforcement

learning (Sutton and Barto, 1998).


For this purpose it is useful to define a further function, which corresponds to

taking the action and then continuing optimally (or according to whatever policy one

currently has):

While this function is also unknown, experience during learning is based

on pairs (together with the outcome ); that is, "I was in state and I tried

doing and happened"). Thus, one has an array and uses experience to update it

directly. This is known as Q-learning.

Reinforcement learning can solve Markov decision processes without explicit

specification of the transition probabilities; the values of the transition probabilities are

needed in value and policy iteration. In reinforcement learning, instead of explicit

specification of the transition probabilities, the transition probabilities are accessed

through a simulator that is typically restarted many times from a uniformly random initial

state. Reinforcement learning can also be combined with function approximation to

address problems with a very large number of states.

Continuous-Time Markov Decision Process

In discrete-time Markov Decision Processes, decisions are made at discrete time epoch.

However, for Continuous-time Markov Decision Process, decisions can be made at any

time when decision maker wants. Unlike discrete-time Markov Decision Process,

Continuous-time Markov Decision Process could better model the decision making
process when the interested system has continuous dynamics, i.e., the system dynamics is

defined by partial differential equations (PDEs).

Definition

In order to discuss the continuous-time Markov Decision Process, we introduce

two sets of notations:

If the state space and action space are finite,

 : State space;

 : Action space;

 : , transition rate function;

 : , a reward function.

If the state space and action space are continuous,

 : State space.;

 : Space of possible control;

 : , a transition rate function;

 : , a reward rate function such

that , where is the reward function we

discussed in previous case.


Problem

Like the Discrete-time Markov Decision Processes, in Continuous-time Markov

Decision Process we want to find the optimal policy or control which could give us the

optimal expected integrated reward:

Where

Linear programming formulation

If the state space and action space are finite, we could use linear programming

formulation to find the optimal policy, which was one of the earliest solution approaches.

Here we only consider the ergodic model, which means our continuous-time MDP

becomes an ergodic continuous-time Markov Chain under a stationary policy. Under this

assumption, although the decision maker could make decision at any time, on the current

state, he could not get more benefit to make more than one actions. It is better for him to

take action only at the time when system transit from current state to another state. Under

some conditions,(for detail check Corollary 3.14 of Continuous-Time Markov Decision

Processes), if our optimal value function is independent of state i, we will have a

following equation:
If there exists a function , then will be the smallest g could satisfied the

above equation. In order to find the , we could have the following linear programming

model:

 Primal linear program(P-LP)

 Dual linear program(D-LP)

is a feasible solution to the D-LP if is nonnative and satisfied the

constraints in the D-LP problem. A feasible solution to the D-LP is said to be

an optimal solution if

for all feasible solution y(i,a) to the D-LP. Once we found the optimal solution

, we could use those optimal solution to establish the optimal policies.


Hamilton-Jacobi-Bellman equation

In continuous-time MDP, if the state space and action space are continuous, the

optimal criterion could be found by solving Hamilton-Jacobi-Bellman (HJB) partial

differential equation. In order to discuss the HJB equation, we need to reformulate our

problem

D( ) is the terminal reward function, is the system state vector, is the system

control vector we try to find. f( ) shows how the state vector change over time. Hamilton-

Jacobi-Bellman equation is as follows:

We could solve the equation to find the optimal control , which could give us the

optimal value

Application

Queueing system, epidemic processes, Population process.


Alternative Notations

The terminology and notation for MDPs are not entirely settled. There are two

main streams — one focuses on maximization problems from contexts like economics,

using the terms action, reward, value, and calling the discount factor or , while the

other focuses on minimization problems from engineering and navigation, using the

terms control, cost, cost-to-go, and calling the discount factor . In addition, the notation

for the transition probability varies.

in this article alternative comment


action control
reward cost is the negative of
value cost-to-go is the negative of
policy policy
discounting factor discounting factor
Transition probability transition probability

In addition, transition probability is sometimes written

, or, rarely,

Finite-Stage Dynamic Programming Model

Suppose that the gardener plans to “retire” from exercising her hobby in N

years. She is thus interested in determining her optimal course of action for each year (to
fertilize or not fertilize) over a finite planning horizon. Optimality here is defined such

that the gardener will accumulate the highest expected revenue at the end of the N years.

Let k = 1 and 2 represent the two courses of action (alternatives) available to the

gardener. The matrices Pk and Rk representing the transition probabilities and reward

function for alternative k.

The summary is:

.2 .5 .3

P1 = // p1ij// = 0 .5 .5 ,

0 0 1

7 6 3

R1 = // r1ij// = 0 5 1 and

.3 .6 .1

P2 = // p2ij// = .1 .6 .3 ,

.05 .4 .55

6 5 -1

R2 = // r2ij// = 7 4 0

6 3 -2
Recall that the system has three states: good (state 1), fair (state 2), and poor (state

3).

We can express the gardener’s problem as a finite-stage dynamic programming

(DP) model as follows. For the sake of generalization, suppose that the number of states

for each stage (year) is m (=3 in the gardener’s example) and define fn (i) = optimal

expected revenue of stages n, n+1, . . . , N, given that the state of the system (soil

condition) at the beginning of year n is i.

The backward recursive equation relating fn and fn+1 can be written as:

Fn(i) = maxk { ∑mj=I pkij[rkij+fn+1(j)]}, n=1, 2, . . ., N

Where fN+1(j) =0 for all j.

Stage n Stage n+1

fn(1) fn+1 (1)


1 1
: pki1 , rki1 :

fn(i) i
pki1 , rki1 fn+1
j
(j)

: pki1 , rki1 :

fn(m) m m fn+1 (m)

A justification for the equation above is that the cumulative revenue, rkij +

fn+1 (j), resulting from reaching state j at stage n + 1 from state I at stage n occurs with

probability pkij. In fact, if Vki represents the expected return resulting from a single single

transition from state I given alternative k, then Vki can be expresses as:

Vki = ∑mj=1 pkij rkij

The DP recursive equation can thus written as


fn(i) = maxk{Vki}

fn(i) = { Vki = ∑mj=1p kij fn+1(j)}, n=1,2, . . . , N – 1

Before showing how the recursive equation is used to solve the gardener’s

problem, we illustratethe computation of Vki , which is part of the recursive equation. For

example, suppose that no fertilizer is used (k = 1); then

V11 = .2 * 7 + .5 +6 +.3 *3 = 5,3

V12 = 0 * 0 + .5 * 5 + .5 * 1 = 3

V13 = 0 * 0 + 0 + 1 * - 1 = - 1

These values show that if the soil condition is found good (state 1) at he

beginning of the year, a single transition is expected to yield 5.3 for that year. Similarly, if

the soil condition is fair (poor), the expected revenue is 3 (- 1).

The (finite horizon) gardener’s problem can be generalized in two ways. First, he

transition probabilities and their return functions need not be the same for every year.

Second, she may apply a discounting factor to the expected revenue of the successive

stages so hat the values of f 1(i) would represent the present value of the expected

revenues of all the stages.

The first generalization would reqire simply that the return value r kij and

tranbsition probabilities pike be additionally functions of the stage, n. in this case, he DP

recursive equations appears as:

fN (i) = maxk { Vki, N }

fn (i) maxk { Vkk, n + ∑mj=1 p kij fn+1(j)}, n = 1, 2, . . . , N-1

where:

Vki, n = ∑mj=1 1p kij, n rkij, n


The second generalization is accomplished as follows. Let a ( less than 1)

be the discount factor per year, which is normally computed as a = 1/(1 + t), where t is

the annual interest rate. Thus D dollars a year from now are equivalent to aD dollars now.

The introduction of the discount factor will modify the original recursive equation as

follows:

fN(i) = maxk {Vki }

fn (i) = maxk {Vki + a ∑mj=1 p kij fn+1(j) }, n = 1, 2, . . . , N – 1

The application of this recursive equation is in general, the use of a discount

factor may result in a different optimum decision in comparison with when no discount is

used.

The DP recursive equation can be used to evaluate any stationary policy for the

gardener’s problem. Assuming that no discounting is used (i.e., a = 1) , the recursive

equation for evaluating a stationary policy is

fn(i) = vi + ∑mj=1 p kij fn+1(j)

where pij is the (i, j)th element of the transition matrix associated with the policy and vi is

the expected one-step transition revenue of the policy.

Infinite-Stage Model

The long-run behavior of the Markovian process is characterized by its

independence of the initial state of the system. In this case the system is said to have
reached steady state. We are thus primarily interested in evaluating policies for which the

associated Markov chains allow the existence of a steady-state solution.

Here, we are interested in determining the optimum long-run policy of a

Markovian decision problem. It is logical to base the evaluation of a policy on

maximizing (minimizing) the expected revenue (cost) per transition period. For example,

in the gardener’s problem, the selection of the best (infinite-stage) policy is based on the

maximum expected revenue per year.

There are two methods in solving the infinite-stage problem. The first method

calls for enumerating all possible stationary policies of the decision problem. By

evaluating each policy, the optimum solution can be determined. This is basically

equivalent to an alternative enumeration process and can be used only if the total number

of the stationary policies is reasonably small for practical computations.

The second method, called policy iteration, alleviates the computational

difficulties that could arise in the exhaustive enumeration procedure. The new method ids

generally efficient in the sense that it determines the optimum policy in a small number of

iterations.

Naturally, both methods must lead to the same optimum solution. We demonstrate

these points well as the application of the two methods via the gardener example.

Exhaustive Enumeration Method


Suppose that the decision problem has a total of S policies, and assume that Ps and

Rs are the (one-step) transition and revenue matrices associated with the kth policy, s = 1,

2, . . . , S. the steps of the enumeration method are as follows.

Step 1: Compute Vsi , the expected one-step (one-period) revenue of policy s

given state I, I = 1, 2, . . . , m.

Step 2: Compute πsi, the long-run stationary probabilities of the transition matrix

Ps associated with policy s. these probabilities, when they exist, are computed from

equations:

π s Ps = π s

πs1 + πs2 + . . . + πsm = 1

where πs = (πs1, πs2, . . . , πsm) .

Step 3: Deteremine Es, the expected revenue of the policy s per transition step

(period), by using the formula:

Es = ∑mi=1 πsi Vsi

Step 4: The optimum policy s* is determined such that

Es = maxs { Es }

Policy Iteration Method with Discounting

The policy iteration algorithm can be extended to include discounting.

Specifically, given that a (less than 1) is the discount factor, the finite-stage recursive

equation can be written as:

fn (i) maxk { Vki + a ∑mj=1 p kij fn+1(j) }

where n represents the number of stages to go.


Policy Iteration Method without Discounting

To gain an appreciation of the difficulty associated with the exhaustive

enumeration method, let us assume that the gardener has four courses of action

(alternatives) instead of two: do not fertilize, fertilize once during the season, fertilize

twice, and fertilize three times. In this case, the gardener would have a total of 4 3 = 256

stationary policies. Thus, by increasing he number of alternatives from 2 to 4, the number

stationary policies “soars” exponentially from 8 to 256. Not only it is difficult to

enumerate all the policies explicitly, but the number of computations involved in the

evaluation of these policies may also be prohibitively large.

The policy iteration method is based principally on the following development.

For any specific policy, we showed that the expected total return at stage n is expressed

by the recursive equation

fn (i) Vi + ∑mj=1 p kij fn+1(j), I = 1, 2, . . . , m

Summary
This research provides models for the solution of the Markovian decision

problem. The models developed include the finite-stage models solved directly by the DP

recursive equations. In the infinite-stage model, it is shown that exhaustive enumeration

is not practical for large problems. The policy iteration algorithm, which is based on the

DP recursive equation, is shown to be more efficient computationally than the exhaustive

enumeration method, since it normally converges in a small number of iterations.

Discounting is shown to result in a possible change of the optimal policy in comparison

with the case where no discounting is used. This conclusion applies to both the finite- and

infinite-stage models.

The LP formulation is quite interesting but not as efficient computationally as the

policy iteration algorithm. For problems with K decision alternatives and m states, the

associated LP model would include (m + 1) constraints and mK variables, which tend to

be large for large values of m and K.

Although we represented the simplified gardener example to demonstrate the

development of the algorithms, the Markovian decision problem has applications in such

areas as inventory, maintenance, replacement, and water resources.

References
R. Bellman. A Markovian Decision Process. Journal of Mathematics and Mechanics 6,
1957.

R. E. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, 1957.


Dover paperback edition (2003), ISBN 0-486-42809-5.

Ronald A. Howard Dynamic Programming and Markov Processes, The M.I.T. Press,
1960.

D. Bertsekas. Dynamic Programming and Optimal Control. Volume 2, Athena, MA,


1995.

Burnetas, A.N. and M. N. Katehakis. "Optimal Adaptive Policies for Markov Decision
Processes, Mathematics of Operations Research, 22,(1), 1995.

M. L. Puterman. Markov Decision Processes. Wiley, 1994.

H.C. Tijms. A First Course in Stochastic Models. Wiley, 2003.

Sutton, R. S. and Barto A. G. Reinforcement Learning: An Introduction. The MIT Press,


Cambridge, MA, 1998.

J.A. E. E van Nunen. A set of successive approximation methods for discounted


Markovian decision problems. Z. Operations Research, 20:203-208, 1976.

S. P. Meyn, 2007. Control Techniques for Complex Networks, Cambridge University


Press, 2007. ISBN 978-0-521-88441-9. Appendix contains abridged Meyn & Tweedie.

S. M. Ross. 1983. Introduction to stochastic dynamic programming. Academic press

X. Guo and O. Hernández-Lerma. Continuous-Time Markov Decision Processes,


Springer, 2009.

M. L. Puterman and Shin M. C. Modified Policy Iteration Algorithms for Discounted


Markov Decision Problems, Management Science 24, 1978.

You might also like