You are on page 1of 14

Reinforcement Learning

Mitchell, Ch. 13
(see also Barto & Sutton book on-line)
Rationale
• Learning from experience
• Adaptive control
• Examples not explicitly labeled, delayed
feedback
• Problem of credit assignment – which
action(s) led to payoff?
• tradeoff short-term thinking (immediate
reward) for long-term consequences
Agent Model
• Transition function – T:SxA->S, environment
• Reward function R:SxA->real, payoff
• Stochastic but Markov
=

• Policy=decision function, :S->A


• “rationality” – maximize long term expected
reward
– Discounted long-term reward (convergent series)
– Alternatives: finite time horizon, uniform weights
R,T
Markov Decision Processes (MDPs)
• if know R and T(=P), solve for value func V(s)
• policy evaluation
• Bellman Equations
• dynamic programming (|S| eqns in |S| unknowns)
MDPs
• finding optimal policies

• Value iteration – update V(s) iteratively until


(s)=argmaxa V(s) stops changing

• Policy iteration – iterate between choosing  and


updating V over all states

• Monte Carlo sampling: run random scenarios


using  and take average rewards as V(s)
Q-learning: model-free
• Q-function: reformulate as value function
of S and A, independent of R and T(=)
Q-learning algorithm
Convergence
• Theorem: Q converges to Q*, after visiting
each state infinitely often (assuming |r|<)
• Proof: with each iteration (where all SxA
visited), magnitude of largest error in Q
table decreases by at least 
• “on-policy” Training
– exploitation vs. exploration
– will relevant parts of the space be explored if stick to
current (sub-optimal) policy?
– -greedy policies: choose action with max Q value
most of the time, or random action  % of the time
• “off-policy”
– learn from simulations or traces
– SARSA: training example database: <s,a,r,s’,a’>
• Actor-critic
Non-deterministic case
Temporal Difference Learning
• convergence is not the problem
• representation of large Q table is the
problem (domains with many states or
continuous actions)
• how to represent large Q tables?
– neural network
– function approximation
– basis functions
– hierarchical decomposition of state space

You might also like