You are on page 1of 13

DYNAMIC

PROGRAMMING
Lecture 3
Prof. Preetam Basu
IIM Calcutta

Stochastic Dynamic Programming

State at the next stage is not completely


determined by the state and the decision
at the current stage.

There is a probability distribution for what


the next state will be

The probability distribution of the next


state is completely determined by the
state and the decision at the current stage

Formulation: Stochastic Dynamic


Programming

Stochastic dynamic programming


problems can be solved using
recursions of the following form (for
max problems):

f t (i ) max (expected reward at stage t | i, a )


a

f T (i ) boundary condition

p ( j | i, a ) f
j

t 1 (

j )

Stochastic Dynamic Programming


Example

When Sally arrives at the bank, 30 minutes remain in her lunch


break.

If Sally makes it to the head of the line and enters service before the
end of her lunch break, she earns reward r . Here the assumption is
Sally earns the reward r if her transaction starts. This model is
represented as Model 1

There is another way of interpreting the reward r: If we assume that


Sally earns the reward only when her transaction is completed, then
the model will be a little bit different. This model is given as Model 2

However, Sally does not enjoy waiting in lines, so to reflect her dislike
for waiting in line, she incurs a cost of c for each minute she waits.

During any minute in which n people are ahead of Sally, there is a


probability p(x|n) that x people will complete their transactions.

Suppose that when Sally arrives, 20 people are ahead of her in line.

Use dynamic programming to determine a strategy for Sally that will


maximize her expected net revenue (reward-waiting costs).

Solution: Model 1

When Sally arrives at the bank, she must decide


whether to join the line or give up and leave.

At any later time, she may also decide to leave if it is


unlikely that she will be served by the end of her
lunch break.

We can work backward to solve the problem.

We define ft(n) to be the maximum expected net


reward that Sally can receive from time t to the end of
her lunch break if at time t, n people are ahead of her.

Solution Contd: Model 1

We let t=0 be the present and t=30 be


the end of the problem. Boundary
f 30 (n) 0, n 0
Conditions
f t (0) r are:
for any t
and
Since t=29 is the beginning of the last
minute of the problem, we write
0

(Leave)

f 29 (n) max rp (n | n) c
p (k | n) f 30 (n k ) (Stay)

k n

Solution Contd

For t<29, we write


0

(Leave)

f t (n) max rp (n | n) c p (k | n) f (n k ) (Stay)

t 1

k n

Solution Contd

The last recursion follows, because if


Sally stays, she will earn an expected
reward (as in the t=29 case) of rp(n|n)-c
during the current minute, and with
probability p(k|n), there will be n-k
people ahead of her; in this case, her
expected net reward from time t+1 to
time 30 will be ft+1(n-k).

If Sally stays, her overall expected


p (k | nfrom
) f t 1 (n time
k ) t+1, t+2,
reward received

k n
,30 will be

Solution Contd

To determine Sallys optimal waiting


policy, we work backward until f0(20) is
computed.

Problems in which the decision maker


can terminate the problem by choosing
a particular action are known as
stopping rule problems.

Model 2

Here assumption is: Sally earns the reward


r only when her transaction is complete
As in Model 1, we let t=0 be the present
(n) problem.
0, n 0.
and t=30 be the end off 30the
For any t<30, we write
(Leave)
0
f t (n) max rp (n 1 | n) c p (k | n) f (n k ) (Stay)

t 1

k n

Example 1: Inventory
Management

ABC Fashion Stores is a leading retailer for Mens shirts. ABC


orders shirts from XYZ manufactures on a monthly basis. In
the current month ABC has 100 shirts in their stock. The
management at ABC need to determine an optimal ordering
policy for the next 12 months. Each month they cannot order
more than M shirts. Demand for the shirts in known and
follows a probability distribution p(Dt). Purchasing cost of each
shirt is c, holding cost for each unsold shirt is h per month,
and the salvage value for each unsold shirt at the beginning
of the 13 th month is s. The selling price of each shirt is k.

The orders made by ABC have a lead-time of one month, i.e.,


whatever is ordered in the current month will be available to
ABC in the next month.

Formulate the above as a probabilistic dynamic program that


maximizes profit for ABC.

Example 2: Capacity
Expansion

The management at Khosla Constructions is strategically


planning to add production facilities to their current operations
for the next 10 years. They want to decide how many new
facilities they should add each year for the next 10 years. Each
year they cannot add more than M facilities. Presently they have
15 facilities. Each of the facilities produce n units of a special
type of generator. The cost of adding a new facility is Rs. U. The
generators produced by Khosla earn Rs. B per unit. The demand
for the generators follow a probability distribution given by p(Dt).
At the end of the 10th year production facilities are salvaged for
Rs. S/facility. Develop a dynamic programming model that
maximizes the profit for Khosla Constructions.
Assume each year Khosla produces to the full capacity, the cost
running each facility is k and the capacity added at t will only
come into operation at t+1

Example 3

You might also like