You are on page 1of 46

REINFORCEMENT

LEARNING
(part 2)

Nguyen Do Van, PhD


Reinforcement Learning

§ Online Learning
§ Value Function Approximation
§ Policy Gradients

2
Reinforcement learning: Recall

§ Making good decision to do new task: fundamental challenge in


AI, ML
§ Learn to make good sequence of decisions
§ Intelligent agents learning and acting
q Learning by trial-and-error, in real time
q Improve with experience
q Inspired by psychology:
• Agents + environment
• Agents select action to maximize cumulative rewards

3
Reinforcement learning: Recall

§ At each step t the agent:


q Executes action At
q Receives observation Ot
q Receives scalar reward Rt
§ The environment:
q Receives action At
q Emits observation Ot+1
q Emits scalar reward Rt+1
§ t increments at environment step
4
Reinforcement learning: Recall

§ Policy - maps current state to action


§ Value function - prediction of value for each state and action
§ Model - agent’s representation of the environment.

5
Markov Decision Process
(Model of the environment)

§ Terminologies:

6
Bellman’s equation

§ State value function (for a fixed policy with discount)

• State-action value function (Q-function)

• When S is a finite set of states, this is a system of linear equations


(one per state)
• Belman’s equation in matrix form:
7
Optimal Value, Q and policy

§ Optimal V: the highest possible value for each s under any


possible policy
§ Satisfies the bellman Equation:

§ Optimal Q-function:

§ Optimal policy:

8
Dynamic Programming (DP)

§ Assuming full knowledge of Markov Decision Process


§ It is used for planning in an MDP
§ For prediction
q Input: MDP (S,A,P,R,γ) and policy π
q Output: value function vπ
§ For controlling
q Input: MDP (S,A,P,R,γ) and policy π
q Output: Optimal value function v* and optimal policy π*
9
ONLINE LEARNING
Model-free Reinforcement Learning
Partially observable environment, Monte Carlo, TD, Q-Learning

10
Monte-Carlo Reinforcement Learning

§ MC methods learn directly from episodes of


experience
§ MC is model-free: no knowledge of MDP
transitions / rewards
§ MC learns from complete episodes: no
bootstrapping
§ MC uses the simplest possible idea: value = mean
return
§ Caveat:
q Can only apply MC to episodic MDPs
q All episodes must terminate

11
Monte-Carlo Policy Evaluation

§ Goal: learn vπ from episodes of experience under policy π

§ Total discounted reward

§ Value function is expected return

§ Monte-Carlo policy evaluation uses empirical mean return


instead of expected return

12
State Visit Monte-Carlo Policy Evaluation

§ To evaluate state s
§ At time-step t that state s is visited in an episode
q Visiting state s: first or every time-step
§ Increase counter N(s) = N(s) + 1
§ Increate total return S(s) = S(s) + Gt
§ Value is estimated by mean return V(s) = S(s)/N(s)
§ By law of large number

13
Incremental Monte-Carlo Updates

§ Learning from experience


§ Update V(s) incrementally after full game
§ For each state St, with actual return Gt

§ With learning rate

14
Temporal-Difference Learning

§ Model-free: no knowledge of MDP


§ Do not wait for episodes, learn from incomplete episode by
bootstrapping
§ Update value V(st) toward estimated return
TD Target

TD error
15
Monte-Carlo and Temporal Difference

§ TD can learn before knowing the final outcome


q TD can learn online after every step
q MC must wait until end of episode before return is known
§ TD can learn without the final outcome
q TD can learn from incomplete sequences
q MC can only learn from complete sequences
q TD works in continuing (non-terminating) environments
q MC only works for episodic (terminating) environments

16
Monte-Carlo Backup

17
Temporal-Difference Backup

18
Dynamic Programming Backup

19
Bootstrapping and Sampling

§ Bootstrapping: update involves


an estimate
q MC does not bootstrap
q DP bootstraps
q TD bootstraps
§ Sampling: update samples an
expectation
q MC samples
q DP does not sample
q TD samples

20
N-step prediction

§ n-step return

§ Define n-step return

§ n-step temporal-difference learning

21
On-policy Learning

§ Advantage of TD:
q Lower variance
q Online
q Incomplete sequence
§ Sarsa:
q Apply TD to Q(S,A)
q Use policy improvement eg ϵ-greedy
q Update every time-step

22
Sarsa Algorithm
Initialize any Q(s,a) and Q (terminate-state, null) =0
Repeat (for each episode)
Initialize S
Choose A from S using Q (eg ϵ-greedy)
Repeat (for steps of episode)
Take A, observe R, S’
Chose A’ from S’ using Q (eg ϵ-greedy)

Until S is terminal

23
Off-Policy Learning

§ Evaluate target policy π(a|s) to compute vπ(s) or qπ(s,a)


§ While following policy μ(a|s)
𝑆", 𝐴" , 𝑅& , … , 𝑆( ~μ
§ Advantages:
q Learning from observing human or other agents
q Reuse experience generated from old policies π1 , π2 , π3 , … , πt−1
q Learn about optimal policy while following exploratory policy
q Learn about multiple policies while following one policy

24
Q-Learning
§ Off-policy learning action-value Q(s,a)
§ No importance sampling is required
§ Off policy: Next action is chosen by
§ Q-Learning: choose alternative successor
§ Update Q(St,At) towards value of alternative action

§ Improve policy by greedy

25
Q-Learning

Update equation

Algorithm

Q-Learning Table version


26
Visualization and Codes

§ https://cs.stanford.edu/people/karpathy/reinforcejs/index.html

27
VALUE FUNCTION APPROXIMATION
State representation in complex environments
Linear Function Approximation
Gradient Descent and Update rules

28
Function approximation

29
Type of Value Function Approximation

§ Differentiable function
approximation
q Linear combination of
feature
• Robots: distance from
checking point, target, dead
mark, wall
• Business Intelligence Systems:
Trends in stock market
q Neural Network
• Deep Q Learning
§ Training strategies
30
Value Function by Stochastic Gradient Descent

§ Goal: find parameter w minimizing mean-squared error


between approximate value function and true state value on π

§ Gradient descent finds a local minimum

§ Stochastic gradient descent sample

31
Linear Value Function Approximation

§ Represent state by a feature vector


§ Feature examples:
q Distance to obstacle by lidar
q Angle to target
q Energy level of robot
§ Represent a value function by a linear combination of features

32
Linear Value Function Approximation

§ Objective function is quadratic in parameter w

§ Stochastic gradient descent converges on global optimum


§ Update rule:
Update = step-size x predict error x feature value

33
Incremental Prediction Algorithms

§ Value function 𝑣. (𝑠) is


assumed to be given by
supervisors
§ In reinforcement learning,
there is only rewards
instead
§ In online learning
(practice), a target for
𝑣. (𝑠) is used

34
Control with Value Function

§ Policy evaluation Approximate


policy evaluation

§ Policy improvement e-greedy policy


improvement

35
Action-Value Function Approximation

§ Approximate the action-value function (Q-value)

§ Minimize mean-squared error between approximate action-


value function and true value function with π

§ Use stochastic gradient descent to find a local minimum

36
Linear Action-Value Function Approximation

§ Represent state and action by a feature vector


§ Represent action-value function by linear combination

§ Stochastic gradient descent update

§ Using target update in practice


q MC
q TD

37
POLICY GRADIENT

38
Policy-Based Reinforcement Learning

§ Last part: value (and action-value) functions are approximate by


parameterized function:

§ Generate policy from value function, e.g using e-gready


§ In this part: directly parameterize policy

§ Effective in high-dimensional or continuous action spaces


39
Policy Objective Functions

§ Goal: given policy 𝜋3 𝑠, 𝑎 with parameters θ, find the best θ


§ Define objective function to measure quality of policy
q In episodic environments, objective function is the start value

q In continuing environments, objective function is average value

q Or average reward per time-step

40
Policy Optimization

§ Policy based RL is an optimization problem


§ Find θ that maximize objective function J(θ)
§ Policy gradient algorithms search for a local maximum in J(θ)
by ascending gradient of the policy

41
Monte-Carlo Policy Gradient (REINFORCE)

§ Update parameter by stochastic gradient ascent


§ Using policy gradient theorem
§ Using return 𝑣7 as an unbiased sample of 𝑄 .9 (𝑠7 , 𝑎7 )

42
Reducing Variance using a Critic

§ Monte-Carlo policy gradient has high variance


§ A Critic is used to estimate action-value function

§ Actor-critic algorithms maintain two sets of parameters


q Critic: Updates action-value function parameter w
q Actor: Updates policy parameter θ
§ Actor-critic algorithms follow an approximate policy gradient

43
Action-Value Actor-Critic

§ Simple actor-critic algorithm


based on action-value critic
§ Using linear value function
approximation 𝑄: (𝑠, 𝑎)
§ Critic: Update w by linear
TD(0)
§ Actor: Update θ by policy
gradient
44
Recap on Reinforcement Learning 02

§ Online Learning § Value Function Approximation


q Model-free Reinforcement Learning q State representation in complex
q Partially observable environment, environments
q Monte Carlo q Linear Function Approximation

q Temporal Difference q Gradient Descent and Update rules

q Q-Learning
§ Policy Gradient
q Objective Function
q Gradient Ascent
q REINFORCE, Actor-Critic

45
Questions?

THANK YOU!

46

You might also like