You are on page 1of 5
Lecture Notes: Markov Decision Processes Mare Toussaint “Machine Learning & Robotics group, TU Berlin Pranklinstr 28/28, FR 6-, 10587 Berlin, Germany April 13, 2009 1. Markov Decision Processes 11 Definition A Markov Decision Process is a stochastic process on the random variables of state 2;, action q,, and reward ri, as given by the Dynamic Bayesian network in Figure 1. The process is defined by the conditional probabilities Pleeei| a2) transition probability a Plre|ae.z) reward probability @) Pla | 21) = w(a| a) policy 8) In the following we will assume the process to be stationary, ic,, none of these conditional probabilities explicitly depends on time. Fora given policy 1 and a statt state 2 we may forward simulate the process and compute the expectation of future is. The talue function V"(z) is such a measure, more precisely it measures the expected discounted return, V"(2) = Blro tan =a t [zone ® BQ Yr | 20-2: © Choosing the discount fa investigate what the best + € [0,1) smaller than 1 ensures convergence of the sum. For every start state x we may olicy # is and what its value would be. Let us define the optimal value function as V"(e) Plo! a2) °C") (a2) This is easily proved via negation: Ifa policy x selected an action that does not maximize the bracket term, then the policy =’ which is equal to 7 everywhere except at state ~ where it chooses the action maximizing the bracket term = would have a higher value, Thus such a = cannot be optimal. Inversely, every optimal policy must choose actions to maximize the bracket term, The Bellman optimality equation is central throughout the theory of MDPs and reflects the principle of optimality. If the system is deterministic, this principle can be formulated as follows: “Form any point on an optimal trajectory the remaining trajectory is optimal for the corresponding problem initiated at the point.” This is another way of saying that an optimal policy achieves optimal value for every start state, or, whenever the process reaches a state the optimal policy will select the same action as an optimal policy if was the start state ofthe process 1.3 The Q-function As alternative to the value function V one may define the Q-function, QF (a2) = Ero bar + yr + |to=2, a0=a5 7} 13) which is related tothe value function as follows, Flax) = Rla,2) +o Pls’ lez) VE"), aa) VrG@) = a% (2),2) ‘The Q-function holds the following recursive property and the corresponding Bellman optimality equation, Q"(a,2) = R(a,2) +9 Pla’ |a,z) Q*(a(z"), 2’) a6) Q(a.2) ~ Rlo,2) +7 Peja, 2)maxg"(e,2') an argmaxQ"(a,2) as) 14 Computing value functions 1.41 Value Keration Many algorithms which compute the optimal policy r* are based on first computing the value funetion V" or Q* for 1 suboptimal policy =, or the optimal value function V* or Q*, Iterative algorithms to compute these functions can directly be derived from their recursive properties and the Bellman optimality equations. Let us summarize all of Lecture Notes: Markov Decision Processes, Maré Toussaint—April 13, 2009 3 thom, V(x) = Renle).2)-+7 LP In(2),2) V7), (9) Ve) mpe[Rnd DPN re e O(e2) — Roe) +1 PE a2) Obed 2) en Q"(a,z) s Rla,2) +7 Pla! | az) max Q"(a.2'), 2) and for each of these four equations consider the respective update equations Ye: Vigale) = R(xte),2) +9 PCa! late), =) Valo") es) Ye: Versle) = max [R(a,2) + SS PCe!la,x) Vile] 4) You! Qearlee) = Rla adres) Qnlate').2!) 5) Vor! Qror(a,z) = Rlaz) Pela 2) max Qe(al,2') (26) Each of these equations describes an iterative algorithm which starts with an initial value function Va(z) and Qo(a, x) respectively (eg, they could be initialized t0 constant zero), and then updates these functions while iterating k = 1,2,«. Note that the equations (19-22) are fixed point equations of the updates (23-26), respectively, Hence it seems rather intuitive that these iterative updates will converge to the desired value functions. We ean prove this convergence for the case of computing Q” via (26): Let A = ||Q" ~ Qllae = tae « Q(0, 4) ~ Qe(a,2)| be an error measure for the Q-function in iteration k. We have Qeoala2) = Raz) +7 Ple'la, =) mex Qa(a',2') Q7) (o.2) +9 Plea) max O°) + du] es) = [Rc 2) +7 P'la2) maxQ*(e!,2')] ade @9) (0,2) + 7a 80) Which describes an upper bound for the deviation of Quy1 from Q*. Similarly we can prove a lower bound of the deviation, Qi > Q" — Ak => Quoi > Q* ~ 7Ax. Hence, the error measure As. < 7/y converges to zero for + € (0,1). A stopping criterion for valve iteration is max. [Visa (2)—Vi(a) < and mare» (Qeva(0,)—Qelaya)] <6, respectively. 142 Direct matrix inversion In the case of policy evaluation, where we are interested in computing the value V™ of the current policy (rather than, directly V*), there is an alternative to the iterative algorithm. If the state space is discrete and not too large we can think of V"(2) as.a large vector V* with entries V," = V"(z) indexed by 2. We also define a large transition matrix T™ with entzies T3,, — P(2’|7(z),2) and the reward vector R™ with entries RE — K(n(x),2), Then the recursion equation (19) can be written in vector notation, Res yr" en (oT Vv" = R* (32) T= YP") R* 3) ‘This isan exact solution which requires the inversion of an v x n matrix (complexity O(n*)) when nis the cardinality of the state space

You might also like