An Introduction to Markov Decision Processes

Bob Givan Purdue University Ron Parr Duke University

MDP Tutorial - 1

Outline
Markov Decision Processes defined (Bob) • Objective functions • Policies Finding Optimal Solutions (Ron) • Dynamic programming • Linear programming Refinements to the basic model (Bob) • Partial observability • Factored representations

MDP Tutorial - 2

MDP Tutorial . We assume the Markov Property: the effects of an action taken in a state depend only on that state and not on the prior history.Stochastic Automata with Utilities A Markov Decision Process (MDP) model contains: • A set of possible world states S • A set of possible actions A • A real valued reward function R(s.a) • A description T of each action’s effects in each state.3 .

We assume the Markov Property: the effects of an action taken in a state depend only on that state and not on the prior history.a) • A description T of each action’s effects in each state.Stochastic Automata with Utilities A Markov Decision Process (MDP) model contains: • A set of possible world states S • A set of possible actions A • A real valued reward function R(s. MDP Tutorial .4 .

Represents the distribution P (s’ | s.4 MDP Tutorial .5 .6 Stochastic Actions: • T : S × A → Prob(S) For each state and action we specify a probability distribution over next states. 0.Representing Actions Deterministic Actions: • T : S × A → S For each state and action we specify a new state. 0. a).

Representing Actions Deterministic Actions: • T : S × A → S For each state and action we specify a new state. Represents the distribution P (s’ | s. 1.6 . a).0 Stochastic Actions: • T : S × A → Prob(S) For each state and action we specify a probability distribution over next states. MDP Tutorial .

Representing Solutions A policy π is a mapping from S to A MDP Tutorial .7 .

Goto step 1. Execute action π(s) 3. Determine the current state s 2.8 .Following a Policy Following a policy π: 1. Assumes full observability: the new state resulting from executing an action will be known to the system MDP Tutorial .

How do we compare policies of infinite value? MDP Tutorial .Evaluating a Policy How good is a policy π in a state s ? For deterministic actions just total the rewards obtained.9 . For stochastic actions. but result may be infinite... instead expected total reward obtained–again typically yields infinite value.

Set a finite horizon and just total the reward 2.10 . Discounting to prefer earlier rewards 3. Average reward rate in the limit Discounting is perhaps the most analytically tractable and most widely studied approach MDP Tutorial .Objective Functions An objective function maps infinite sequences of rewards to single real numbers (representing utility) Options: 1.

.⋅ M ) 1–γ 2 MDP Tutorial ..Discounting A reward n steps away is discounted by γ for discount rate 0 < γ < 1 . = ----------..11 . • models mortality: you may die at any moment • models preference for shorter solutions • a smoothed out version of limited horizon lookahead We use cumulative discounted reward as our objective n 1 (Max value <= M + γ ⋅ M + γ ⋅ M + .

Value Functions A value function V π : S → ℜ represents the expected objective value obtained following policy π from each state in S . Value functions partially order the policies.12 . • but at least one optimal policy exists. V * MDP Tutorial . and • all optimal policies have the same value function.

a. For the discounted objective function. V π ( s ) = R (s. s′ ) ⋅ γ ⋅ V * ( s′ )   In each case.Bellman Equations Bellman equations relate the value function to itself via the problem dynamics. there is one equation per state in S MDP Tutorial .13 . π ( s ). a) +  s′ ∈ S ∑ T ( s. s′ ) ⋅ γ ⋅ V π ( s′ )  R (s. π ( s )) + V *(s) = MAX a∈A s′ ∈ S ∑ T ( s.

Finite-horizon Bellman Equations Finite-horizon values at adjacent horizons are related by the action dynamics V π. n – 1 ( s′ ) MDP Tutorial . s′ ) ⋅ γ ⋅ V π. n ( s ) = R (s. π ( s )) V π. a. a) + s′ ∈ S ∑ T ( s. 0 ( s ) = R (s.14 .

Relation to Model Checking Some thoughts on the relationship • MDP solution focuses critically on expected value • Contrast safety properties which focus on worst case • This contrast allows MDP methods to exploit sampling and approximation more aggressively MDP Tutorial .15 .

Ron Parr spoke on solution methods for about 1/2 an hour. MDP Tutorial .16 . and then I continued.• At this point.

not enumerated • decomposed into factors. or aspects of state Issues raised: • How can we represent reward and action behaviors in such MDPs? • How can we find solutions in such MDPs? MDP Tutorial .17 . the “state space” is typically • astronomically large • described implicitly.Large State Spaces In AI problems.

Need-Power?. . Reward=4 Block 3: On-Mars? and not Need-Power? .. . . . . .A Factored MDP Representation • State Space S — assignments to state variables: On-Mars?. etc) Block 1: not On-Mars? Block 2: On-Mars? and Need-Power? Block 3: On-Mars? and not Need-Power? • Reward function R — labelled state-space partition: Block 1: not On-Mars?. . . . Reward=5 MDP Tutorial . . . Daytime?. .18 . • Partitions — each block a DNF formula (or BDD. .. . Reward=0 Block 2: On-Mars? and Need-Power?. . . . . . . . . . . . ...etc.

1 e. . . . .. a) * Pr (On-Mars? | x. .. . .Pr(Nd-Power? ^ On-Mars? | x. .1 1.. MDP Tutorial . . . . a) • Represent effect on each state variable as labelled partition: Effects of Action Charge-Battery on variable Need-Power? Pr(Need-Power? | Block n) Block 1: not On-Mars? . .9 Block 2: On-Mars? and Need-Power? . . a) = Pr (Nd-Power? | x. 0. . 0. .19 . . . . . .Factored Representations of Actions • Assume: actions affect state variables independently.3 Block 3: On-Mars? and not Need-Power? . This assumption can be relaxed.g. . . . . . . .. 0. . . .

20 .Representing Blocks • • • • Identifying “irrelevant” state variables Decision trees DNF formulas Binary/Algebraic Decision Diagrams MDP Tutorial .

o) for each state • Add an initial state distribution I Key notion: belief state.Partial Observability System state can not always be determined ⇒ a Partially Observable MDP (POMDP) • Action outcomes are not fully observable • Add a set of observations O to the model • Add an observation distribution U(s.21 . a distribution over system states representing “where I think I am” MDP Tutorial .

⇒ a POMDP can be treated as a continuous state MDP MDP Tutorial .s) normalized Pr(s’ |o) = Pr(s’ |s.o) = Pr(o|s.22 .o) T(s’.s’ ) Pr(s’ |s) / Pr(o|s) = U(s’.POMDP to MDP Conversion Belief state Pr(x) can be updated to Pr(x’|o) using Bayes’ rule: Pr(s’ |s.a.o) Pr(s) A POMDP is Markovian and fully observable relative to the belief state.

Belief State Approximation Problem: When MDP state space is astronomical. belief states cannot be explicitly represented.23 . Consequence: MDP conversion of POMDP impractical Solution: Represent belief state approximately • Typically exploiting factored state representation • Typically exploiting (near) conditional independence properties of the belief state factors MDP Tutorial .