Professional Documents
Culture Documents
stochastic decision process that can be described by a finite number of states. The
transition probabilities between the states are described by a Markov chain. The reward
structure of the process is also described by a matrix whose individual elements represent
the revenue (or cost) resulting from moving one state to another.
Both the transition and revenue matrices depend on the decision alternatives
available to the decision maker. The objective of the problem is to determine the optimal
policy that maximizes the expected revenue of the process over a finite or infinite number
of stages.
INTRODUCTION
mathematical framework for modeling decision making in situations where outcomes are
partly random and partly under the control of a decision maker. MDPs are useful for studying a
wide range of optimization problems solved via dynamic programming and reinforcement
learning. MDPs were known at least as early as the 1950s (cf. Bellman 1957). A core body of
research on Markov decision processes resulted from Ronald A. Howard's book published in
1960, Dynamic Programming and Markov Processes. They are used in a wide area of disciplines,
More precisely, a Markov Decision Process is a discrete time stochastic control process.
At each time step, the process is in some state , and the decision maker may choose any action
that is available in state . The process responds at the next time step by randomly moving into a
The probability that the process moves into its new state is influenced by the chosen
action. Specifically, it is given by the state transition function . Thus, the next state
depends on the current state and the decision maker's action . But given and , it is
conditionally independent of all previous states and actions; in other words, the state transitions
Markov decision processes are an extension of Markov chains; the difference is the
addition of actions (allowing choice) and rewards (giving motivation). Conversely, if only one
action exists for each state and all rewards are zero, a Markov decision process reduces to
a Markov chain.
Definition
(The theory of Markov decision processes does not actually require or to be finite,
but the basic algorithms below assume that they are finite.)
The core problem of MDPs is to find a "policy" for the decision maker: a
function that specifies the action that the decision maker will choose when in
state . Note that once a Markov decision process is combined with a policy in this way,
this fixes the action for each state and the resulting combination behaves like a Markov
chain.
The goal is to choose a policy that will maximize some cumulative function of
the random rewards, typically the expected discounted sum over a potentially infinite
horizon:
and satisfies .
to 1. Because of the Markov property, the optimal policy for this particular problem can
Algorithms
Suppose we know the state transition function and the reward function , and
we wish to calculate the policy that maximizes the expected discounted reward.
The standard family of algorithms to calculate this optimal policy requires storage
for two arrays indexed by state: value , which contains real values, and policy which
contains actions. At the end of the algorithm, will contain the solution and will
contain the discounted sum of the rewards to be earned (on average) by following that
The algorithm has the following two kinds of steps, which are repeated in some
order for all the states until no further changes take place. They are
Their order depends on the variant of the algorithm; one can also do them for all
states at once or state by state, and more often to some states than others. As long as no
state is permanently excluded from either of the steps, the algorithm will eventually
Notable variants
Value iteration
In value iteration (Bellman 1957), which is also called backward induction, the
Shapley's 1953 paper on stochastic games included as a special case the value iteration
method for MDPs, but this was recognized only later on.
Substituting the calculation of into the calculation of gives the
combined step:
This update rule is iterated for all states until it converges with the left-hand
side equal to the right-hand side (which is the "Bellman equation" for this problem).
Policy iteration
In policy iteration (Howard 1960), step one is performed once, and then step two
is repeated until it converges. Then step one is again performed once and so on.
This variant has the advantage that there is a definite stopping condition: when the
array does not change in the course of applying step 1 to all states, the algorithm is
completed.
In modified policy iteration (van Nunen, 1976; Puterman and Shin 1978), step one
is performed once, and then step two is repeated several times. Then step one is again
Prioritized sweeping
In this variant, the steps are preferentially applied to states which are in some way
important - whether based on the algorithm (there were large changes in or around
those states recently) or based on use (those states are near the starting state, or otherwise
inventory, replacement, cash flow management, and regulation of the water reservoir
capacity.
An avid gardener attends a plot of land in his backyard. Every year, at the
beginning of the gardening season, she uses chemical tests; she can classify the garden’s
Over the years, the gardener observed that current year’s productivity can be
assumed to depend only on last year’s soil condition. She is thus able to represent the
transition probabilities over the year period from one productivity state to another in
1 2 3
1 .2 .5 .3
3 0 0 1
Good 1
Fair 2
Poor 3
The transition probabilities in P1 indicate that the productivity for a current year
can be no better that last year’s. for example, if the soil condition for this year is fair
(state 2), next year’s productivity may remain fair with probability .5 or become poor
The gardener can alter the transition probabilities P1 by taking other courses of
action available to her. Typically, she may decide to fertilize the garden to boost the soil
condition. If she does not, her transition probabilities will remain as given in P1. But if
1 2 3
1 .3 .6 .1
P2 = 2 .1 .6 .3
3 .05 .4 .55
In the new transition P2, it is possible to improve the condition of the soil over last
year’s.
function (or a reward structure) with the transition from one state to another. The return
function expresses the gain or loss during a 1-year period, depending on the states
between which the transition is made. Since the gardener has the option of using or not
using fertilizer, her gain and losses are expected to vary depending on the decision she
makes. The matrices R1 and R2 summarize the return function in hundreds of dollars
associated with the matrices P1 and P2, respectively. Thus R1
applies when no fertilizer is
1 2 3
1 7 6 3
R1 = // r1ij// = 2 0 5 1
3 0 0 -1
1 2 3
1 6 5 -1
R2 = // r2ij// = 2 7 4 0
3 6 3 -2
Notice that the elements r2ij of R2 take into account the cost of applying the
fertilizer. For example , if the system was in state 1 and remained in state 1 during next
year, its gain would be r211 = 6 compared to r211 = 7 when no fertilizer is used.
What kind of decision problem does the gardener have? First, we must know
whether the gardener activity will continue for a limited number of years or, for all
infinite-stage decision problems. In both cases, the gardener would need to determine the
best course of action she should follow (fertilize or do not fertilize) given the outcome of
the chemical tests (state of the system). He optimization process will be based on
from following the prespecified course of action whenever a given state of action of the
system occurs. For example, she may decide to fertilize whenever the soil condition is
poor (state 3). The decision making process in this case is said to be represented by a
stationary policy.
We must note that each stationary policy will be associated with a different
transition and return matrices, which, in general, can be constructed from the matrices P 1,
P2, R1, and R2. For example, for the stationary policy calling for applying fertilizer only
when the soil condition is poor (state 3), the resulting transition and return matrices, P
.2 .5 .3
P= 0 .5 .5
7 6 3
R= 0 5 1
6 3 -2
These matrices differ from P1 and R1 in the third rows only, which are taken
directly from P2 and R2. The reason is the P2 and R2 are the matrices that result when
Partial observability
The solution above assumes that the state is known when action is to be taken;
otherwise cannot be calculated. When this assumption is not true, the problem is
A major breakthrough in this area was provided in "Optimal adaptive policies for
Markov decision processes" [2] by Burnetas and Katehakis. In this work a class of
adaptive policies that possess uniformly maximum convergence rate properties for the
total expected finite horizon reward, were constructed under the assumptions of finite
state-action spaces and irreducibility of the transition law. These policies prescribe that
the choice of actions, at each state and time period, should be based on indices that are
inflations of the right-hand side of the estimated average reward optimality equations.
Reinforcement learning
taking the action and then continuing optimally (or according to whatever policy one
currently has):
on pairs (together with the outcome ); that is, "I was in state and I tried
doing and happened"). Thus, one has an array and uses experience to update it
specification of the transition probabilities; the values of the transition probabilities are
through a simulator that is typically restarted many times from a uniformly random initial
In discrete-time Markov Decision Processes, decisions are made at discrete time epoch.
However, for Continuous-time Markov Decision Process, decisions can be made at any
time when decision maker wants. Unlike discrete-time Markov Decision Process,
Continuous-time Markov Decision Process could better model the decision making
process when the interested system has continuous dynamics, i.e., the system dynamics is
Definition
: State space;
: Action space;
: , a reward function.
: State space.;
Decision Process we want to find the optimal policy or control which could give us the
Where
If the state space and action space are finite, we could use linear programming
formulation to find the optimal policy, which was one of the earliest solution approaches.
Here we only consider the ergodic model, which means our continuous-time MDP
becomes an ergodic continuous-time Markov Chain under a stationary policy. Under this
assumption, although the decision maker could make decision at any time, on the current
state, he could not get more benefit to make more than one actions. It is better for him to
take action only at the time when system transit from current state to another state. Under
following equation:
If there exists a function , then will be the smallest g could satisfied the
above equation. In order to find the , we could have the following linear programming
model:
an optimal solution if
for all feasible solution y(i,a) to the D-LP. Once we found the optimal solution
In continuous-time MDP, if the state space and action space are continuous, the
differential equation. In order to discuss the HJB equation, we need to reformulate our
problem
D( ) is the terminal reward function, is the system state vector, is the system
control vector we try to find. f( ) shows how the state vector change over time. Hamilton-
We could solve the equation to find the optimal control , which could give us the
optimal value
Application
The terminology and notation for MDPs are not entirely settled. There are two
main streams — one focuses on maximization problems from contexts like economics,
using the terms action, reward, value, and calling the discount factor or , while the
other focuses on minimization problems from engineering and navigation, using the
terms control, cost, cost-to-go, and calling the discount factor . In addition, the notation
, or, rarely,
Suppose that the gardener plans to “retire” from exercising her hobby in N
years. She is thus interested in determining her optimal course of action for each year (to
fertilize or not fertilize) over a finite planning horizon. Optimality here is defined such
that the gardener will accumulate the highest expected revenue at the end of the N years.
Let k = 1 and 2 represent the two courses of action (alternatives) available to the
gardener. The matrices Pk and Rk representing the transition probabilities and reward
.2 .5 .3
P1 = // p1ij// = 0 .5 .5 ,
0 0 1
7 6 3
R1 = // r1ij// = 0 5 1 and
.3 .6 .1
P2 = // p2ij// = .1 .6 .3 ,
.05 .4 .55
6 5 -1
R2 = // r2ij// = 7 4 0
6 3 -2
Recall that the system has three states: good (state 1), fair (state 2), and poor (state
3).
(DP) model as follows. For the sake of generalization, suppose that the number of states
for each stage (year) is m (=3 in the gardener’s example) and define fn (i) = optimal
expected revenue of stages n, n+1, . . . , N, given that the state of the system (soil
The backward recursive equation relating fn and fn+1 can be written as:
fn(i) i
pki1 , rki1 fn+1
j
(j)
: pki1 , rki1 :
A justification for the equation above is that the cumulative revenue, rkij +
fn+1 (j), resulting from reaching state j at stage n + 1 from state I at stage n occurs with
probability pkij. In fact, if Vki represents the expected return resulting from a single single
transition from state I given alternative k, then Vki can be expresses as:
Before showing how the recursive equation is used to solve the gardener’s
problem, we illustratethe computation of Vki , which is part of the recursive equation. For
V12 = 0 * 0 + .5 * 5 + .5 * 1 = 3
V13 = 0 * 0 + 0 + 1 * - 1 = - 1
These values show that if the soil condition is found good (state 1) at he
beginning of the year, a single transition is expected to yield 5.3 for that year. Similarly, if
The (finite horizon) gardener’s problem can be generalized in two ways. First, he
transition probabilities and their return functions need not be the same for every year.
Second, she may apply a discounting factor to the expected revenue of the successive
stages so hat the values of f 1(i) would represent the present value of the expected
The first generalization would reqire simply that the return value r kij and
where:
be the discount factor per year, which is normally computed as a = 1/(1 + t), where t is
the annual interest rate. Thus D dollars a year from now are equivalent to aD dollars now.
The introduction of the discount factor will modify the original recursive equation as
follows:
factor may result in a different optimum decision in comparison with when no discount is
used.
The DP recursive equation can be used to evaluate any stationary policy for the
where pij is the (i, j)th element of the transition matrix associated with the policy and vi is
Infinite-Stage Model
independence of the initial state of the system. In this case the system is said to have
reached steady state. We are thus primarily interested in evaluating policies for which the
maximizing (minimizing) the expected revenue (cost) per transition period. For example,
in the gardener’s problem, the selection of the best (infinite-stage) policy is based on the
There are two methods in solving the infinite-stage problem. The first method
calls for enumerating all possible stationary policies of the decision problem. By
evaluating each policy, the optimum solution can be determined. This is basically
equivalent to an alternative enumeration process and can be used only if the total number
difficulties that could arise in the exhaustive enumeration procedure. The new method ids
generally efficient in the sense that it determines the optimum policy in a small number of
iterations.
Naturally, both methods must lead to the same optimum solution. We demonstrate
these points well as the application of the two methods via the gardener example.
Rs are the (one-step) transition and revenue matrices associated with the kth policy, s = 1,
given state I, I = 1, 2, . . . , m.
Step 2: Compute πsi, the long-run stationary probabilities of the transition matrix
Ps associated with policy s. these probabilities, when they exist, are computed from
equations:
π s Ps = π s
Step 3: Deteremine Es, the expected revenue of the policy s per transition step
Es = maxs { Es }
Specifically, given that a (less than 1) is the discount factor, the finite-stage recursive
enumeration method, let us assume that the gardener has four courses of action
(alternatives) instead of two: do not fertilize, fertilize once during the season, fertilize
twice, and fertilize three times. In this case, the gardener would have a total of 4 3 = 256
enumerate all the policies explicitly, but the number of computations involved in the
For any specific policy, we showed that the expected total return at stage n is expressed
Summary
This research provides models for the solution of the Markovian decision
problem. The models developed include the finite-stage models solved directly by the DP
is not practical for large problems. The policy iteration algorithm, which is based on the
with the case where no discounting is used. This conclusion applies to both the finite- and
infinite-stage models.
policy iteration algorithm. For problems with K decision alternatives and m states, the
development of the algorithms, the Markovian decision problem has applications in such
References
R. Bellman. A Markovian Decision Process. Journal of Mathematics and Mechanics 6,
1957.
Ronald A. Howard Dynamic Programming and Markov Processes, The M.I.T. Press,
1960.
Burnetas, A.N. and M. N. Katehakis. "Optimal Adaptive Policies for Markov Decision
Processes, Mathematics of Operations Research, 22,(1), 1995.