You are on page 1of 4

# DEPARTMENT OF ENGINEERING SCIENCE ENGSCI760 - Dynamic Programming Tutorial 2013

May 31, 2013 1. Suppose you are interviewing job applicants for a position. There are N applicants to interview lined up outside your o ce and you interview them one by one. Assume that each candidate gives a random reward R to your company that is distributed as an exponential random variable Pr[R < r] = 1; r 0 . r 1 e ; r>0

After each interview, you discover the realization of R for the candidate and may oer them the job, or tell them that you will not hire them. If you do the former then you stop and get R. If you do the latter then the candidate is lost. Let Vn be the optimal expected reward after interviewing (and rejecting) n 1 candidates. R1 (a) What is VN ? VN = E [R] = 0 re r dr = 1: (b) What is the optimal decision to make after interviewing and rejecting N dates? Accept the (N 1)st candidate if their value of R exceeds 1. (c) Show that VN
1

2 candi-

=e

## + 1. The density function of R is f (r) = e r . Thus Z 1 maxfr; 1ge r dr VN 1 = Z 1 Z0 1 r re r dr e dr + =

0 1

=[ e + [ e r (r + 1)]1 1 1 = 1 e + 2e =e 1+1

r 1 ]0 1

1 Vn+1

Vn =

## maxfr; Vn+1 ge r dr 0 Z Vn+1 Z 1 r = Vn+1 e dr + re r dr

0 Vn+1 r V [ Vn+1 e r ]0 n+1 + [ Vn+1 Vn+1 e Vn+1 Vn+1

= Vn =

(r + 1)]1 Vn+1
Vn+1

+ Vn+1 e

+e

Vn+1

= Vn+1 + e

2. Consider computing innite horizon shortest paths in the following directed graph.
1

2 3 4

Let the costs of the arcs be given by cT = c12 c14 c23 c24 c31 c41 = 5 4 3 2 1 3 :

and suppose the discount discount factor is . Consider the problem of computing a never-ending path in which a move must be made from node to node at the start of every year. Write down a linear program in six variables that can be used to determine a policy that has minimum discounted cost. Let c(j; y ) = cjy as above, n =4: Pn Pn c(j; y )x(j; y ) D: minimize j =1 y =1 Pn Pn s.t. y =1 x(y; j ) = 1; j = 1; 2; : : : ; n: y =1 x(j; y ) x(j; y ) 0:

3. Consider an innite-horizon Markov decision process with states x =1, 2, and 3, and discount factor 0.9. In each state there are two actions (y = 1 and y = 2). The transition matrices are 2 1 1 3 2 1 1 1 3 0 2 2 3 3 3 P1 = 4 1 0 0 5; P2 = 4 1 0 0 5: 1 0 1 0 0 1 2 2 In each period there is a reward R(i) for being in state x, where R(1) = 0, R(2) = 3, R(3) = 2. (a) Write down a dynamic programming recursion that the optimal value V (x) = > V1 V2 V3 must satisfy. Then
1 > 2 V1 = maxfR(1) + re> 1 P V; R(1) + re1 P V g 1 > 2 V3 = maxfR(3) + re> 3 P V; R(3) + re3 P V g 2 1 1 32 3 2 1 3 1 0 V V + V 1 1 2 2 2 2 2 4 1 0 0 5 4 V2 5 = 4 5 V1 V3 V3 0 0 1 3 2 1 3 2 1 1 1 32 V1 V +1 V +1 V 3 3 3 3 1 3 2 3 3 5 4 1 0 0 5 4 V2 5 = 4 V1 1 1 1 1 V3 V + 2 V3 0 2 2 2 2 1 > 2 V2 = maxfR(2) + re> 2 P V; R(2) + re2 P V g

1 1 1 1 1 V1 = maxfr( V1 + V2 ); r( V1 + V2 + V3 )g 2 2 3 3 3 V2 = maxf3 + rV1 ; 3 + rV1 g 1 1 V3 = maxf2 + rV3 ; 2 + r( V2 + V3 )g 2 2 (b) Write down a formulation for solving the problem using linear programming. min V1 + V2 + V3 1 r( 2 V1 + 1 V) 2 2 1 1 V) r( 3 V1 + 3 V2 + 1 3 3 3 + rV1 3 + rV1 2 + rV3 1 2 + r( 1 V +2 V3 ) 2 2 V1 V1 V2 V2 V3 V4

(c) Write down an algorithm for solving the problem using policy iteration. (This is from notes. We illustrate this with one step). Consider the policy dened by actions y (1), y (2), y (3) in each state. We now compute V (x) corresponding to this policy. We can do this by choosing equations from the above set of inequaklities to dene V . Thus if y (1) = 2, y (2) = 1, y (3) = 1, we get 1 1 1 V1 = r( V1 + V2 + V3 ) 3 3 3 V2 = 3 + rV1 V3 = 2 + rV3

which has solution V1 = 16: 047, V2 = 17: 442; V3 = 20:0; We update the policy by putting these values into the recursion and testing y . 1 1 1 1 1 V1 = maxfr( V1 + V2 ); r( V1 + V2 + V3 )g = maxf15:070 05; 16:046 7g 2 2 3 3 3 V2 = maxf3 + rV1 ; 3 + rV1 g = maxf17: 442; 17: 442g 1 1 V3 = maxf2 + rV3 ; 2 + r( V2 + V3 )g = maxf20; 18:849g 2 2 This gives the optimal policy y (1) = 2, y (2) = 1, y (3) = 1, which has converged.