Professional Documents
Culture Documents
LEARNING
(part 2)
§ Online Learning
§ Value Function Approximation
§ Policy Gradients
2
Reinforcement learning: Recall
3
Reinforcement learning: Recall
5
Markov Decision Process
(Model of the environment)
§ Terminologies:
6
Bellman’s equation
§ Optimal Q-function:
§ Optimal policy:
8
Dynamic Programming (DP)
10
Monte-Carlo Reinforcement Learning
11
Monte-Carlo Policy Evaluation
12
State Visit Monte-Carlo Policy Evaluation
§ To evaluate state s
§ At time-step t that state s is visited in an episode
q Visiting state s: first or every time-step
§ Increase counter N(s) = N(s) + 1
§ Increate total return S(s) = S(s) + Gt
§ Value is estimated by mean return V(s) = S(s)/N(s)
§ By law of large number
13
Incremental Monte-Carlo Updates
14
Temporal-Difference Learning
TD error
15
Monte-Carlo and Temporal Difference
16
Monte-Carlo Backup
17
Temporal-Difference Backup
18
Dynamic Programming Backup
19
Bootstrapping and Sampling
20
N-step prediction
§ n-step return
21
On-policy Learning
§ Advantage of TD:
q Lower variance
q Online
q Incomplete sequence
§ Sarsa:
q Apply TD to Q(S,A)
q Use policy improvement eg ϵ-greedy
q Update every time-step
22
Sarsa Algorithm
Initialize any Q(s,a) and Q (terminate-state, null) =0
Repeat (for each episode)
Initialize S
Choose A from S using Q (eg ϵ-greedy)
Repeat (for steps of episode)
Take A, observe R, S’
Chose A’ from S’ using Q (eg ϵ-greedy)
Until S is terminal
23
Off-Policy Learning
24
Q-Learning
§ Off-policy learning action-value Q(s,a)
§ No importance sampling is required
§ Off policy: Next action is chosen by
§ Q-Learning: choose alternative successor
§ Update Q(St,At) towards value of alternative action
25
Q-Learning
Update equation
Algorithm
§ https://cs.stanford.edu/people/karpathy/reinforcejs/index.html
27
VALUE FUNCTION APPROXIMATION
State representation in complex environments
Linear Function Approximation
Gradient Descent and Update rules
28
Function approximation
29
Type of Value Function Approximation
§ Differentiable function
approximation
q Linear combination of
feature
• Robots: distance from
checking point, target, dead
mark, wall
• Business Intelligence Systems:
Trends in stock market
q Neural Network
• Deep Q Learning
§ Training strategies
30
Value Function by Stochastic Gradient Descent
31
Linear Value Function Approximation
32
Linear Value Function Approximation
33
Incremental Prediction Algorithms
34
Control with Value Function
35
Action-Value Function Approximation
36
Linear Action-Value Function Approximation
37
POLICY GRADIENT
38
Policy-Based Reinforcement Learning
40
Policy Optimization
41
Monte-Carlo Policy Gradient (REINFORCE)
42
Reducing Variance using a Critic
43
Action-Value Actor-Critic
q Q-Learning
§ Policy Gradient
q Objective Function
q Gradient Ascent
q REINFORCE, Actor-Critic
45
Questions?
THANK YOU!
46