A17 Complexdecisions

Making complex decisions
Chapter 17
Outline
• Sequential decision problems (Markov
Decision Process)
– Value iteration
– Policy iteration
Sequential Decisions
• Agent’s utility depends on a sequence of
decisions
Markov Decision Process (MDP)
• Defined as a tuple: <S, A, M, R>
– S: State
– A: Action
– M: Transition function
• Table Mija = P(sj| si, a), prob of sj| given action “a” in state s i
– R: Reward
• R(si, a) = cost or reward of taking action a in state si
• In our case R = R(si)
• Choose a sequence of actions (not just one decision or one action)

– Utility based on a sequence of decisions
Generalization
 Inputs:
• Initial state s0
• Action model
• Reward R(si) collected in each state si
 A state is terminal if it has no successor
 Starting at s0, the agent keeps executing actions
until it reaches a terminal state
 Its goal is to maximize the expected sum of rewards
collected (additive rewards)
 Additive rewards: U(s0, s1, s2, …) = R(s0) + R(s1) + R(s2)+ …
 Discounted rewards
U(s0, s1, s2, …) = R(s0) + R(s1) +  2R(s2) + … (0    1)
Dealing with infinite sequences
of actions
• Use discounted rewards
• The environment contains terminal states
that the agent is guaranteed to reach
eventually
• Infinite sequences are compared by their
average rewards
Example
• Fully observable environment
• Non deterministic actions
– intended effect: 0.8
– not the intended effect: 0.2
3 +1 0.8
0.1 0.1
2 -1
1 start
1 2 3 4
MDP of example
• S: State of the agent on the grid (4,3)
– Note that cell denoted by (x,y)
• A: Actions of the agent, i.e., N, E, S, W
• M: Transition function
– E.g., M( (4,2) | (3,2), N) = 0.1
– E.g., M((3, 3) | (3,2), N) = 0.8
– (Robot movement, uncertainty of another agent’s actions,…)
• R: Reward (more comments on the reward function later)

– R(1, 1) = -1/25, R(1, 2) = -1/25 ….
– R (4,3) = +1 R(4,2) = -1
 =1
Policy
• Policy is a mapping from states to actions.
• Given a policy, one may calculate the expected utility from
series of actions produced by policy.
3 +1
Example of policy
2 -1
1
1 2 3 4
• The goal: Find an optimal policy , one that would produce
maximal expected utility.
Utility of a State
Given a state s we can measure the expected utilities by applying
any policy .
We assume the agent is in state s and define St (a random variable)
as the state reached at step t.
Obviously S0 = s
The expected utility of a state s given a policy  is

U(s) = E[t t R(St)]
*s will be a policy that maximizes the expected utility of state s.

Utility of a State
The utility of a state s measures its desirability:
 If i is terminal:
U(Si) = R(Si)
 If i is non-terminal,
U(Si) = R(Si) +  maxajP(Sj|Si,a) U(Sj)
[Bellman equation]
[the reward of s augmented by the expected sum of
discounted rewards collected in future states]
Utility of a State
The utility of a state s measures its desirability:
 If i is terminal:
U(Si) = R(Si)
 If i is non-terminal,
U(Si) = R(Si) +  maxajP(Sj|Si,a) U(Sj)
[Bellman equation]
[the reward of s augmented by the expected sum of
discounted rewards collected in future states]
 dynamic programming
Optimal Policy
 A policy is a function that maps each state s into the
action to execute if s is reached
 The optimal policy * is the policy that always leads

to maximizing the expected sum of rewards
collected in future states
(Maximum Expected Utility principle)
*(Si) = argmaxajP(Sj|Si,a) U(Sj)

Optimal policy and state utilities
for MDP
3 +1 3 0.812 0.868 0.912 +1
2 -1 2 0.762 0.660 -1
1 1 0.705 0.655 0.611 0.388

1 2 3 4 1 2 3 4
• What will happen if cost of step is very low?

Finding the optimal policy
• Solution must satisfy both equations
 *(Si) = argmaxajP(Sj|Si,a) U(Sj) (1)
– U(Si) = R(Si) +  maxajP(Sj|Si,a) U(Sj) (2)
• Value iteration:
– start with a guess of a utility function and use (2) to
get a better estimate
• Policy iteration:
– start with a fixed policy, and solve for the exact
utilities of states; use eq 1 to find an updated policy.
Value iteration
function Value-Iteration(MDP) returns a utility function
inputs: P(Si |Sj,a), a transition model
R, a reward function on states
 discount parameter
local variables: Utility function, initially identical to R
U’, utility function, initially identical to R
repeat
U  U’
for each state i do
U’ [Si]  R [Si] +  maxa j P(Sj |Si,a) U (Sj)
end
until Close-Enough(U,U’)
return U
Value iteration - convergence of
utility values
policy iteration
function Policy-Iteration (MDP) returns a policy
inputs: P(Si |Sj,a), a transition model
R, a reward function on states
 discount parameter
local variables: U, utility function, initially identical to R
, a policy, initially optimal with respect to U
repeat
U  Value-Determination(, U, MDP)
unchaged?  true
for each state Si do
if maxa jP(Sj |Si,a) U (Sj) > jP(Sj |Si, (Si)) U (Sj)
(Si)  arg maxa jP(Sj |Si,a) U (Sj)
unchaged?  false
until unchaged?
return 
Decision Theoretic Agent
• Agent that must act in uncertain
environments. Actions may be non-
deterministic.
Stationarity and Markov
assumption
• A stationary process is one whose
dynamics, i.e., P(Xt|Xt-1,…,X0) for t>0, are
assumed not to change with time.
• The Markov assumption is that the
current state Xt is dependent only on a
finite history of previous states. In this
class, we will only consider first-order
Markov processes for which
P(Xt|Xt-1,…,X0) =P(Xt|Xt-1)
Transition model and sensor
model
• For first-order Markov processes, the laws
describing how the process state evolves with
time is contained entirely within the
conditional distribution P(Xt|Xt-1), which is
called the transition model for the process.
• We will also assume that the observable state
variables (evidence) Et are dependent only on
the state variables Xt. P(Et|Xt) is called the
sensor or observational model.
Decision Theoretic Agent -
implemented
sensor model
generic Example – part of steam train control

sensor model II
Model for lane position sensor
for automated vehicle
dynamic belief network – generic
structure
• Describe the action model, namely the
state evolution model.
State evolution model - example
Dynamic Decision Network
• Can handle uncertainty
• Deal with continuous streams of
evidence
• Can handle unexpected events (they
have no fixed plan)
• Can handle sensor noise and sensor
failure
• Can act to obtain information
• Can handle large state spaces

A17 Complexdecisions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A17 Complexdecisions

Uploaded by

Copyright:

Available Formats

Making complex decisions

• Choose a sequence of actions (not just one decision or one action)

• R: Reward (more comments on the reward function later)

The expected utility of a state s given a policy  is

*s will be a policy that maximizes the expected utility of state s.

 The optimal policy * is the policy that always leads

*(Si) = argmaxajP(Sj|Si,a) U(Sj)

3 +1 3 0.812 0.868 0.912 +1

1 1 0.705 0.655 0.611 0.388

• What will happen if cost of step is very low?

generic Example – part of steam train control

You might also like