You are on page 1of 39

UNIT IV: Markov Decision

Processes & Reinforcement


Learning
• Stochastic Process
• Markov Property
• Markov Chain
• Markov Decision Process
• Reinforcement Learning
• RL Techniques
• Example Applications
Stochastic Process
• Quick definition: A Random Process
• Often viewed as a collection of indexed random
variables
• Useful to us: Set of states with probabilities of
being in those states indexed over time
• We’ll deal with discrete stochastic processes
Stochastic Process Example
• Classic: Random Walk
• Start at state X0 at time t0
• At time ti, move a step Zi where
P(Zi = -1) = p and P(Zi = 1) = 1 - p
• At time ti, state Xi = X0 + Z1 +…+ Zi

http://en.wikipedia.org/wiki/Image:Random_Walk_example.png
Markov Property
• Also thought of as the “memoryless” property
• A stochastic process is said to have the Markov
property if the probability of state Xn+1 having any
given value depends only upon state Xn
• Very much depends on description of states
Markov Property Example
• Checkers:
• Current State: The current configuration of the board
• Contains all information needed for transition to next
state
• Thus, each configuration can be said to have the Markov
property
Markov Chain
• Discrete-time
stochastic process
with the Markov
property
• Industry Example:
Google’s PageRank
algorithm
• Probability
distribution
representing
likelihood of random
linking ending up on a
page
http://en.wikipedia.org/wiki/PageRank
Markov Decision Process (MDP)
• Discrete time stochastic control process
• Extension of Markov chains
• Differences:
• Addition of actions (choice)
• Addition of rewards (motivation)
• If the actions are fixed, an MDP reduces to a
Markov chain
Description of MDPs
• Tuple (S, A, P(.,.), R(.)))
• S -> state space
• A -> action space
• Pa(s, s’) = Pr(st+1 = s’ | st = s, at = a)
• R(s) = immediate reward at state s
• Goal is to maximize some cumulative function of
the rewards
• Finite MDPs have finite state and action spaces
Simple MDP Example
• Recycling MDP Robot
• Can search for trashcan, wait for someone to bring a
trashcan, or go home and recharge battery
news.bbc.co.uk
• Has two energy levels – high and low
• Searching runs down battery, waiting does not, and a
depleted battery has a very low reward
Transition Probabilities
s = st s’ = st+1 a = at Pass’ Rass’
high high search α Rsearch
high low search 1-α Rsearch
low high search 1-β -3
low low search β Rsearch
high high wait 1 Rwait
high low wait 0 Rwait
low high wait 0 Rwait
low low wait 1 Rwait
low high recharge 1 0
low low recharge 0 0
Transition Graph

state node

action node
Solution to an MDP = Policy π
• Gives the action to take from a given state
regardless of history
• Two arrays indexed by state
• V is the value function, namely the discounted sum of
rewards on average from following a policy
• π is an array of actions to be taken in each state (Policy)

2 basic
steps
V(s): = R(s) + γ∑Pπ(s)(s,s')V(s')
Variants
2 basic 1
steps
V(s): = R(s) + γ∑Pπ(s)(s,s')V(s') 2

• Value Iteration Value Function


• Policy Iteration
• Modified Policy Iteration
• Prioritized Sweeping
Value Iteration

V(s) = R(s) + γmaxa∑Pa(s,s')V(s')

k Vk(PU) Vk(PF) Vk(RU) Vk(RF)


1 0 0 10 10
2 0 4.5 14.5 19
3 2.03 8.55 18.55 24.18
4 4.76 11.79 19.26 29.23
5 7.45 15.30 20.81 31.82
6 10.23 17.67 22.72 33.68
Why So Interesting?
• If the transition probabilities are known, this
becomes a straightforward computational problem,
however…
• If the transition probabilities are unknown, then
this is a problem for reinforcement learning.
Typical Agent
• In reinforcement learning (RL), the agent observes
a state and takes an action.

• Afterward, the agent receives a reward.


Mission: Optimize Reward
• Rewards are calculated in the environment
• Used to teach the agent how to reach a goal state
• Must signal what we ultimately want achieved, not
necessarily subgoals
• May be discounted over time
• In general, seek to maximize the expected return
Value Functions
• Vπ is a value function
(How good is it to be
in this state?)
• Vπ is the unique State-value function for policy π
solution to its
Bellman Equation
• Expresses relationship
Bellman Equation:
between a state and its
successor states
Another Value Function
• Qπ defines the value of taking action a in state s
under policy π
• Expected return starting from s, taking action a,
and thereafter following policy π

Action-value function for policy π

Backup diagrams for (a) Vπ and (b) Qπ


Dynamic Programming
• Classically, a collection of algorithms used to
compute optimal policies given a perfect model of
environment as an MDP
• The classical view is not so useful in practice since
we rarely have a perfect environment model
• Provides foundation for other methods
• Not practical for large problems
DP Continued…
• Use value functions to organize and structure
the search for good policies.
• Turn Bellman equations into update policies.
• Iterative policy evaluation using full backups
Policy Improvement
• When should we change the policy?
• If we pick a new action α from state s and
thereafter follow the current policy and V(π’) >=
V(π), then picking α from state s is a better policy
overall.
• Results from the policy improvement theorem
Policy Iteration
• Continue improving
the policy π and
recalculating V(π)
• A finite MDP has a
finite number of
policies, so
convergence is
guaranteed in a finite
number of iterations
Remember Value Iteration?

Used to truncate policy iteration by combining


one sweep of policy evaluation and one of policy
improvement in each of its sweeps.
Monte Carlo Methods
• Requires only episodic experience – on-line or
simulated
• Based on averaging sample returns
• Value estimates and policies only changed at the
end of each episode, not on a step-by-step basis
Policy Evaluation
• Compute average
returns as the
episode runs
• Two methods:
first-visit and
every-visit
• First-visit is most
widely studied

First-visit MC method
Estimation of Action Values
• State values are not enough without a model – we
need action values as well
• Qπ(s, a)  expected return when starting in state s,
taking action a, and thereafter following policy π
• Exploration vs. Exploitation
• Exploring starts
Example Monte Carlo Algorithm

First-visit Monte Carlo assuming exploring starts


Another MC Algorithm

On-line, first-visit, ε-greedy MC without exploring starts


Temporal-Difference Learning
• Central and novel to reinforcement learning
• Combines Monte Carlo and DP methods
• Can learn from experience w/o a model – like MC
• Updates estimates based on other learned
estimates (bootstraps) – like DP
TD(0)

• Simplest TD method
• Uses sample backup from single successor state
or state-action pair instead of full backup of DP
methods
SARSA – On-policy Control

• Quintuple of events (st, at, rt+1, st+1, at+1)


• Continually estimate Qπ while changing π
Q-Learning – Off-policy Control

• Learned action-value function, Q, directly


approximates Q*, the optimal action-value
function, independent of policy being followed
Case Study
• Job-shop Scheduling
• Temporal and resource constraints
• Find constraint-satisfying schedules of short
duration
• In it’s general form, NP-complete
NASA Space Shuttle Payload
Processing Problem (SSPPP)
• Schedule tasks required for installation and testing
of shuttle cargo bay payloads
• Typical: 2-6 shuttle missions, each requiring 34-164
tasks
• Zhang and Dietterich (1995, 1996; Zhang, 1996)
• First successful instance of RL applied in plan-space
• states = complete plans
• actions = plan modifications
SSPPP – continued…
• States were an entire schedule
• Two types of actions:
• REASSIGN-POOL operators – reassigns a resource to a
different pool
• MOVE operators – moves task to first earlier or later
time with satisfied resource constraints
• Small negative reward for each step
• Resource dilation factor (RDF) formula for
rewarding final schedule’s duration
Even More SSPPP…
• Used TD() to learn value function
• Actions selected by decreasing ε-greedy policy with
one-step lookahead
• Function approximation used multilayer neural
networks
• Training generally took 10,000 episodes
• Each resulting network represented different
scheduling algorithm – not a schedule for a specific
instance!
RL and CBR
• Example: CBR used to store various policies and RL
used to learn and modify those policies
• Ashwin Ram and Juan Carlos Santamarıa, 1993
• Autonomous Robotic Control
• Job shop scheduling: RL used to repair schedules,
CBR used to determine which repair to make
• Similar methods can be used for IDSS
References
• Sutton, R. S. and Barto A. G. Reinforcement Learning: An
Introduction. The MIT Press, Cambridge, MA, 1998
• Stochastic Processes, www.hanoivn.net
• http://en.wikipedia.org/wiki/PageRank
• http://en.wikipedia.org/wiki/Markov_decision_process
• Using Case-Based Reasoning as a Reinforcement Learning
framework for Optimization with Changing Criteria, Zeng, D.
and Sycara, K. 1995

You might also like