Markov Decision Process Tutorial
Intro to AI 096210
Erez Karpas
Faculty of Industrial Engineering & Managment
Technion
December 22, 2011
Erez Karpas
Markov Decision Process Tutorial
Markov Decision Process
A Markov Decision Process (MDP) is a stochastic planning
problem
Stationary Markovian Dynamics
The rewards and transitions only depend on current state
Fully observable
We might not know where were going, but we always know where
we are
Decision theoretic planning
We want to maximize expected reward
Erez Karpas
Markov Decision Process Tutorial
Markov Decision Process Formal Definition
A MDP consists of hS , A, R , T i
S is a finite set of states
A is a finite set of actions
R : S 7 [0, rmax ] is the reward function
Rewards are bounded
T : S A S 7 [0, 1] is the transition function
Probability of going from s to s0 after applying a is T (s, a, s0 )
Where is the initial state?
Erez Karpas
Markov Decision Process Tutorial
Markov Decision Process Example
Shamelessly stolen from Andrew Moore
You run a startup company. In every decision period, you must
choose between Saving money or Advertizing.
S = {Poor &Unknown, Poor &Famous, Rich&Unknown,
Rich&Famous}
A = {Save, Advertize}
(
R (s) =
if s = Poor & X
10
if s = Rich & X
T set next slide
Erez Karpas
Markov Decision Process Tutorial
Markov Decision Process Graphic Example
Erez Karpas
Markov Decision Process Tutorial
Markov Decision Process Solution
How do we solve an MDP?
What does a solution for an MDP look like?
A solution to an MDP is a policy : S 7 A
Given that Im in state s, I should apply action (s)
This is why we need full observability
What is an optimal policy?
Erez Karpas
Markov Decision Process Tutorial
Markov Decision Process Policy
We can compute the expected value of following fixed policy at
state s
V (s) = R (s) +
T (s, (s), s0 )V (s0 )
s0
is a discount factor
It makes sure the infinite sum converges
It can also be explained by interest rates, mortality, . . .
Value is immediate reward plus discounted expected future
reward
Erez Karpas
Markov Decision Process Tutorial
Markov Decision Process Optimal Policy Value
An optimal policy maximizes V (s) for all states
Is the optimal policy unique? No
Is the value of an optimal policy unique? Yes
We denote the value of an optimal policy at state s by V (s)
V (s) is unique
Erez Karpas
Markov Decision Process Tutorial
Markov Decision Process Using V
If we know V , we can simply choose the best action for each
state
The best action maximizes:
R (s) +
T (s, a, s0 )V (s0 )
s0
So we want to find V
Erez Karpas
Markov Decision Process Tutorial
Markov Decision Process Value Iteration
How do we find V ? Value Iteration
V 0 (s) = R (s)
V t (s) = R (s) + max
a
T (s, a, s0 )V t 1 (s0 )
s0
Converges: V t V
Stop when
max |V t (s) V t 1 (s)| <
s
Erez Karpas
Markov Decision Process Tutorial
Markov Decision Process Value Iteration Example
V 0 (s) =
V t (s) =
R (s)
R (s) + maxa s0 T (s, a, s0 )V t 1 (s0 )
= 0.9
t
0
1
2
3
4
5
V t (PU )
0
0
2.03
4.75
7.62
10.21
100
31.58
V t (PF )
0
4.5
8.55
12.2
15.07
17.46
...
38.6
V t (RU )
10
14.5
16.525
18.34
20.39
22.61
V t (RF )
10
19
25.08
28.72
31.18
33.2
44.02
54.2
Erez Karpas
Markov Decision Process Tutorial
Markov Decision Process Policy from Values
(s) :=
s
PU
PF
RU
RF
argmaxa R (s) + s0 T (s, a, s0 )V (s0 )
V
31.58
38.6
44.02
54.2
A
S
S
S
Erez Karpas
Markov Decision Process Tutorial
Markov Decision Process Policy Iteration
Another algorithm to find an optimal policy:
Initialize a policy arbitrarily
Evaluate V (s) for all states s S
3
4
0 (s) := argmaxa s0 T (s, a, s0 )V (s0 )
If 6= 0
:= 0
Goto 2
Erez Karpas
Markov Decision Process Tutorial
Markov Decision Process Policy Iteration Example
t
0
1
2
t (PU )
t (PF )
t (RU )
t (RF )
A
A
A
A
S
S
A
S
S
A
S
S
Erez Karpas
V t (PU )
0
31.58
V t (PF )
V t (RU )
0
10
38.6
44.02
Done
Markov Decision Process Tutorial
V t (RF )
10
54.2
Value Iteration vs. Policy Iteration
Which is better? It depends
VI takes more iterations than PI, but PI requires more time on
each iteration
Lots of actions? PI
Already got a fair policy? PI
Few actions, acyclic? VI
Also possible to mix
Erez Karpas
Markov Decision Process Tutorial
Solving MDP without the Model
What if we do not have access to the model?
We dont know transition probabilities T
We dont know reward function R
Then we cant compute a policy offline
We must choose an action online
Erez Karpas
Markov Decision Process Tutorial
Reinforcement Learning
The model
At every time step, the agent sees the current state s and the
applicable actions at s
After choosing an action to execute, the agent receives a reward
There are many RL algorithms
We will focus on Q-Learning
Erez Karpas
Markov Decision Process Tutorial
Q-Learning
We define Q : S A 7 [0, rmax ]
Q (s, a) is the best value we can expect after taking action a in
state s
Q (s, a) = R (s) +
Q (s0 , a0 )
T (s, a, s0 ) max
a
s0
Q (s, a) is immediate reward plus discounted expected future
reward if we choose the best action in the next state
Erez Karpas
Markov Decision Process Tutorial
Learning Q
Suppose our agent performed action a in state s
It moved to some state s0 , and got some reward R (s)
We can update Q (s, a):
Q (s, a) := Q (s, a) + R (s) + max Q (s0 , a0 ) Q (s, a)
a0
is the learning rate how much weight to give new vs. past
knowledge
Under some (realistic?) assumptions, Q-learning will converge to
optimal Q
Erez Karpas
Markov Decision Process Tutorial
Q-Learing: Exploration/Exploitation
Suppose were in the middle of Q-Learning
Were at state s
We have some estimate for Q (s, a), for any applicable action a
Which action to choose?
We can choose an action greedily the one which maximizes
Q (s , a )
But we might now know about the best action, and miss out
We want a policy that is greedy in the limit of infinite exploration
(GLIE)
Erez Karpas
Markov Decision Process Tutorial
GLIE Policies
Need to make exploitation more likely as more knowledge is
gained
One of the most popular GLIE policy is Boltzmann Exploration
Choose action a with probablity proportional to
eQ (s,a)/T
T is the temperature, which decreases with time
Erez Karpas
Markov Decision Process Tutorial