Reinforcement Learning (Part 2) : Nguyen Do Van, PHD

REINFORCEMENT
LEARNING
(part 2)
Nguyen Do Van, PhD

Reinforcement Learning
§ Online Learning
§ Value Function Approximation
§ Policy Gradients
2
Reinforcement learning: Recall
§ Making good decision to do new task: fundamental challenge in

AI, ML
§ Learn to make good sequence of decisions
§ Intelligent agents learning and acting
q Learning by trial-and-error, in real time
q Improve with experience
q Inspired by psychology:
• Agents + environment
• Agents select action to maximize cumulative rewards
3
§ At each step t the agent:

q Executes action At
q Receives observation Ot
q Receives scalar reward Rt
§ The environment:
q Receives action At
q Emits observation Ot+1
q Emits scalar reward Rt+1
§ t increments at environment step
4
§ Policy - maps current state to action

§ Value function - prediction of value for each state and action
§ Model - agent’s representation of the environment.
5
Markov Decision Process
(Model of the environment)
§ Terminologies:
6
Bellman’s equation
§ State value function (for a fixed policy with discount)
• State-action value function (Q-function)
• When S is a finite set of states, this is a system of linear equations

(one per state)
• Belman’s equation in matrix form:
7
Optimal Value, Q and policy
§ Optimal V: the highest possible value for each s under any

possible policy
§ Satisfies the bellman Equation:
§ Optimal Q-function:
§ Optimal policy:
8
Dynamic Programming (DP)
§ Assuming full knowledge of Markov Decision Process

§ It is used for planning in an MDP
§ For prediction
q Input: MDP (S,A,P,R,γ) and policy π
q Output: value function vπ
§ For controlling
q Input: MDP (S,A,P,R,γ) and policy π
q Output: Optimal value function v* and optimal policy π*
9
ONLINE LEARNING
Model-free Reinforcement Learning
Partially observable environment, Monte Carlo, TD, Q-Learning
10
Monte-Carlo Reinforcement Learning
§ MC methods learn directly from episodes of

experience
§ MC is model-free: no knowledge of MDP
transitions / rewards
§ MC learns from complete episodes: no
bootstrapping
§ MC uses the simplest possible idea: value = mean
return
§ Caveat:
q Can only apply MC to episodic MDPs
q All episodes must terminate
11
Monte-Carlo Policy Evaluation
§ Goal: learn vπ from episodes of experience under policy π
§ Total discounted reward
§ Value function is expected return
§ Monte-Carlo policy evaluation uses empirical mean return

instead of expected return
12
State Visit Monte-Carlo Policy Evaluation
§ To evaluate state s
§ At time-step t that state s is visited in an episode
q Visiting state s: first or every time-step
§ Increase counter N(s) = N(s) + 1
§ Increate total return S(s) = S(s) + Gt
§ Value is estimated by mean return V(s) = S(s)/N(s)
§ By law of large number
13
Incremental Monte-Carlo Updates
§ Learning from experience

§ Update V(s) incrementally after full game
§ For each state St, with actual return Gt
§ With learning rate
14
Temporal-Difference Learning
§ Model-free: no knowledge of MDP

§ Do not wait for episodes, learn from incomplete episode by
bootstrapping
§ Update value V(st) toward estimated return
TD Target
TD error
15
Monte-Carlo and Temporal Difference
§ TD can learn before knowing the final outcome

q TD can learn online after every step
q MC must wait until end of episode before return is known
§ TD can learn without the final outcome
q TD can learn from incomplete sequences
q MC can only learn from complete sequences
q TD works in continuing (non-terminating) environments
q MC only works for episodic (terminating) environments
16
Monte-Carlo Backup
17
Temporal-Difference Backup
18
Dynamic Programming Backup
19
Bootstrapping and Sampling
§ Bootstrapping: update involves

an estimate
q MC does not bootstrap
q DP bootstraps
q TD bootstraps
§ Sampling: update samples an
expectation
q MC samples
q DP does not sample
q TD samples
20
N-step prediction
§ n-step return
§ Define n-step return
§ n-step temporal-difference learning
21
On-policy Learning
§ Advantage of TD:
q Lower variance
q Online
q Incomplete sequence
§ Sarsa:
q Apply TD to Q(S,A)
q Use policy improvement eg ϵ-greedy
q Update every time-step
22
Sarsa Algorithm
Initialize any Q(s,a) and Q (terminate-state, null) =0
Repeat (for each episode)
Initialize S
Choose A from S using Q (eg ϵ-greedy)
Repeat (for steps of episode)
Take A, observe R, S’
Chose A’ from S’ using Q (eg ϵ-greedy)
Until S is terminal
23
Off-Policy Learning
§ Evaluate target policy π(a|s) to compute vπ(s) or qπ(s,a)

§ While following policy μ(a|s)
𝑆", 𝐴" , 𝑅& , … , 𝑆( ~μ
§ Advantages:
q Learning from observing human or other agents
q Reuse experience generated from old policies π1 , π2 , π3 , … , πt−1
q Learn about optimal policy while following exploratory policy
q Learn about multiple policies while following one policy
24
Q-Learning
§ Off-policy learning action-value Q(s,a)
§ No importance sampling is required
§ Off policy: Next action is chosen by
§ Q-Learning: choose alternative successor
§ Update Q(St,At) towards value of alternative action
§ Improve policy by greedy
25
Q-Learning
Update equation
Algorithm
Q-Learning Table version

26
Visualization and Codes
§ https://cs.stanford.edu/people/karpathy/reinforcejs/index.html
27
VALUE FUNCTION APPROXIMATION
State representation in complex environments
Linear Function Approximation
Gradient Descent and Update rules
28
Function approximation
29
Type of Value Function Approximation
§ Differentiable function
approximation
q Linear combination of
feature
• Robots: distance from
checking point, target, dead
mark, wall
• Business Intelligence Systems:
Trends in stock market
q Neural Network
• Deep Q Learning
§ Training strategies
30
Value Function by Stochastic Gradient Descent
§ Goal: find parameter w minimizing mean-squared error

between approximate value function and true state value on π
§ Gradient descent finds a local minimum
§ Stochastic gradient descent sample
31
Linear Value Function Approximation
§ Represent state by a feature vector

§ Feature examples:
q Distance to obstacle by lidar
q Angle to target
q Energy level of robot
§ Represent a value function by a linear combination of features
32
Linear Value Function Approximation
§ Objective function is quadratic in parameter w
§ Stochastic gradient descent converges on global optimum

§ Update rule:
Update = step-size x predict error x feature value
33
Incremental Prediction Algorithms
§ Value function 𝑣. (𝑠) is

assumed to be given by
supervisors
§ In reinforcement learning,
there is only rewards
instead
§ In online learning
(practice), a target for
𝑣. (𝑠) is used
34
Control with Value Function
§ Policy evaluation Approximate

policy evaluation
§ Policy improvement e-greedy policy

improvement
35
Action-Value Function Approximation
§ Approximate the action-value function (Q-value)
§ Minimize mean-squared error between approximate action-

value function and true value function with π
§ Use stochastic gradient descent to find a local minimum
36
Linear Action-Value Function Approximation
§ Represent state and action by a feature vector

§ Represent action-value function by linear combination
§ Stochastic gradient descent update
§ Using target update in practice

q MC
q TD
37
POLICY GRADIENT
38
Policy-Based Reinforcement Learning
§ Last part: value (and action-value) functions are approximate by

parameterized function:
§ Generate policy from value function, e.g using e-gready

§ In this part: directly parameterize policy
§ Effective in high-dimensional or continuous action spaces

39
Policy Objective Functions
§ Goal: given policy 𝜋3 𝑠, 𝑎 with parameters θ, find the best θ

§ Define objective function to measure quality of policy
q In episodic environments, objective function is the start value
q In continuing environments, objective function is average value
q Or average reward per time-step
40
Policy Optimization
§ Policy based RL is an optimization problem

§ Find θ that maximize objective function J(θ)
§ Policy gradient algorithms search for a local maximum in J(θ)
by ascending gradient of the policy
41
Monte-Carlo Policy Gradient (REINFORCE)
§ Update parameter by stochastic gradient ascent

§ Using policy gradient theorem
§ Using return 𝑣7 as an unbiased sample of 𝑄 .9 (𝑠7 , 𝑎7 )
42
Reducing Variance using a Critic
§ Monte-Carlo policy gradient has high variance

§ A Critic is used to estimate action-value function
§ Actor-critic algorithms maintain two sets of parameters

q Critic: Updates action-value function parameter w
q Actor: Updates policy parameter θ
§ Actor-critic algorithms follow an approximate policy gradient
43
Action-Value Actor-Critic
§ Simple actor-critic algorithm

based on action-value critic
§ Using linear value function
approximation 𝑄: (𝑠, 𝑎)
§ Critic: Update w by linear
TD(0)
§ Actor: Update θ by policy
gradient
44
Recap on Reinforcement Learning 02
§ Online Learning § Value Function Approximation

q Model-free Reinforcement Learning q State representation in complex
q Partially observable environment, environments
q Monte Carlo q Linear Function Approximation
q Temporal Difference q Gradient Descent and Update rules
q Q-Learning
§ Policy Gradient
q Objective Function
q Gradient Ascent
q REINFORCE, Actor-Critic
45
Questions?
THANK YOU!
46

Reinforcement Learning (Part 2) : Nguyen Do Van, PHD

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reinforcement Learning (Part 2) : Nguyen Do Van, PHD

Uploaded by

Copyright:

Available Formats

REINFORCEMENT

Nguyen Do Van, PhD

§ Making good decision to do new task: fundamental challenge in

§ At each step t the agent:

§ Policy - maps current state to action

§ State value function (for a fixed policy with discount)

• State-action value function (Q-function)

• When S is a finite set of states, this is a system of linear equations

§ Optimal V: the highest possible value for each s under any

§ Assuming full knowledge of Markov Decision Process

§ MC methods learn directly from episodes of

§ Goal: learn vπ from episodes of experience under policy π

§ Total discounted reward

§ Value function is expected return

§ Monte-Carlo policy evaluation uses empirical mean return

§ Learning from experience

§ With learning rate

§ Model-free: no knowledge of MDP

§ TD can learn before knowing the final outcome

§ Bootstrapping: update involves

§ Define n-step return

§ n-step temporal-difference learning

§ Evaluate target policy π(a|s) to compute vπ(s) or qπ(s,a)

§ Improve policy by greedy

Q-Learning Table version

§ Goal: find parameter w minimizing mean-squared error

§ Gradient descent finds a local minimum

§ Stochastic gradient descent sample

§ Represent state by a feature vector

§ Objective function is quadratic in parameter w

§ Stochastic gradient descent converges on global optimum

§ Value function 𝑣. (𝑠) is

§ Policy evaluation Approximate

§ Policy improvement e-greedy policy

§ Approximate the action-value function (Q-value)

§ Minimize mean-squared error between approximate action-

§ Use stochastic gradient descent to find a local minimum

§ Represent state and action by a feature vector

§ Stochastic gradient descent update

§ Using target update in practice

§ Last part: value (and action-value) functions are approximate by

§ Generate policy from value function, e.g using e-gready

§ Effective in high-dimensional or continuous action spaces

§ Goal: given policy 𝜋3 𝑠, 𝑎 with parameters θ, find the best θ

q In continuing environments, objective function is average value

q Or average reward per time-step

§ Policy based RL is an optimization problem

§ Update parameter by stochastic gradient ascent

§ Monte-Carlo policy gradient has high variance

§ Actor-critic algorithms maintain two sets of parameters

§ Simple actor-critic algorithm

§ Online Learning § Value Function Approximation

q Temporal Difference q Gradient Descent and Update rules

You might also like