Professional Documents
Culture Documents
Anant Nawalgaria
Alex Erfurt
Machine Learning Specialists,
Google Cloud
Agenda
Introduction to RL
Applications of RL
Policy-based RL
Contextual bandits
Objectives
Video Name:
Agenda
Introduction to reinforcement
learning (RL)
Contextual bandits
Applications of RL
Machine learning types
2 Action
Agent Environment
3 Reward
Reinforcement learning in dog training
Observations
Actions
Rewards
Agent Environment
Policy
Agent
RL algorithm
Policy
Policy
update
Reinforcement
learning algorithm
Observation Action
Reward
Environment
Reinforcement learning methods
Reinforcement learning
Video Name:
T-RSML-O_2_M6_L8_applications_of_reinforcement_learning
Agenda
Introduction to reinforcement
learning
Applications of RL
Policy-based RL
Contextual bandits
Industries that use reinforcement learning
SALE SALE
Supervised learning:
Prediction problems
Short term
Offline training on cold historic data
IID data
No trial and error required
Suitable for lower variance / dynamism
Transfer learning
Differentiable non-noisy loss
When to use RL rather than SL or USL
Video Name:
Agenda
Introduction to reinforcement
learning (RL)
Contextual bandits
Applications of RL
The reinforcement learning framework
Agent
Agent
States
Environment
Actions
Rewards
Terminology in reinforcement learning
Term Definition
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Reward Feedback on agent actions, also known as reward signal
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Reward Feedback on agent actions, also known as reward signal
Policy Method to map the agent’s state to actions
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Reward Feedback on agent actions, also known as reward signal
Policy Method to map the agent’s state to actions
Episode A termination point
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Reward Feedback on agent actions, also known as reward signal
Policy Method to map the agent’s state to actions
Episode A termination point
Value Long-term reward gained by the end of an episode
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Reward Feedback on agent actions, also known as reward signal
Policy Method to map the agent’s state to actions
Episode A termination point
Value Long-term reward gained by the end of an episode
Value Measure of potential future rewards from being in a particular state, or V(S)
function
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Reward Feedback on agent actions, also known as reward signal
Policy Method to map the agent’s state to actions
Episode A termination point
Value Long-term reward gained by the end of an episode
Value Measure of potential future rewards from being in a particular state, or V(S)
function
Q(S,A) “Q-value” of an action in various state/action pairs
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Reward Feedback on agent actions, also known as reward signal
Policy Method to map the agent’s state to actions
Episode A termination point
Value Long-term reward gained by the end of an episode
Value Measure of potential future rewards from being in a particular state, or V(S)
function
Q(S,A) “Q-value” of an action in various state/action pairs
SARSA State Action Reward State Action
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Reward Feedback on agent actions, also known as reward signal
Policy Method to map the agent’s state to actions
Episode A termination point
Value Long-term reward gained by the end of an episode
Value Measure of potential future rewards from being in a particular state, or V(S)
function
Q(S,A) “Q-value” of an action in various state/action pairs
SARSA State Action Reward State Action
Using the RL framework: Example
Decision tree Fulfillment center
+ 1
Move up Move right
Move up
Move right - 1
Being in this cell is
Move right Move up Move right better than being
in the one below it.
Move right Move up Move right
Move down
Move right Move up Move up
Move down
Move right Move up
Move down
Move right
Find the best sequence of actions that will generate the optimal outcome
Collect the most reward!
Agent Environment
Policy
Action
Reward
Observation
State
Setting up the problem for the RL workflow
How is it defined?
Real or simulated.
Setting up the problem for the RL workflow
How is it defined? Incentivize Structure the Choose a training Put the agent
Real or simulated. the agent to do logic and algorithm. to work.
what we want. parameters.
Reinforcement learning workflow
Environment
Environment Reward
Agent
Environment
Workflow
Agent
Step 1 :
Pull observation
generated by
environment.
Environment
Workflow
Agent
Step 1 : Step 2 : Choose
Pull observation action based on
generated by policy; apply to
environment. environment 1 .
Environment
Workflow
Agent
Step 1 : Step 2 : Choose
Pull observation Step 3 : Get reward action based on
generated by produced by environment based policy; apply to
environment. on 1 2 . environment 1 .
Environment
Workflow
Step 4 : Use agent to train the target policy on trajectory data
including observation, action, and reward from 1 2 3 .
Agent
Step 1 : Step 2 : Choose
Pull observation Step 3 : Get reward action based on
generated by produced by environment based policy; apply to
environment. on 1 2 . environment 1 .
Environment
SARSA
S7
S17
For instance:
● Q(S7, right) would have possible
reward of r = +1
● Q(S17, up) would have possible
reward of r = -1
Course T-RSML: Recommendation Systems
Video Name:
T-RSML-O_2_M6_L4_model-based_and_model-free_reinforcement
learning
Agenda
Introduction to reinforcement
learning
Contextual bandits
Applications of RL
Model-based and model-free RL
Reinforcement learning
Model
Model
Planning
building Model-based
approach
Value
Experience Policy
function
Model-based method
The agent learns what is considered optimal behavior through actions and
observing the resulting state and reward outcomes.
Destination
Dead end
Model-free method
The agent learns by exploring all areas of the state space to fill out its value function as
it searches for the best reward.
Value
Experience Policy
Model-free function
approach
Model-based vs model-free RL
Model-based Model-free
Video Name:
T-RSML-O_2_M6_L5_value-based_reinforcement_learning
Agenda
Introduction to reinforcement
learning
Contextual bandits
Applications of RL
RL methods
Reinforcement learning
Temporal Monte
difference
Shallow Carlo
backups
Enforcing control in value methods
A1
A3
0 10 20 30
S1
S1 S1
A A1
A A1 A1 A1
S S S2 S
S S S2 S S2 S
S2 S2 S
S2
A A A A A A2 A A
A A A A A A A A A A A A A A A A
S3
A3
T
Terms of the decision tree
S
T of the episode.
S S S S S S S S S
T T T T T
Monte Carlo backup
S1
A A1
S S S2 S
A A A A A A2 A A
S S S S S S S S3 S S State
T T T T T
A A A A A A A A3 A A Action
S S S S S S S S S S Termination
T T T T T T point
Temporal Difference backup
S1
A A1
S S S2 S
A A A A A A A A
S S S S S S S S S S State
T T T T T
A Action
Termination
T point
Achieving data efficiency
The solution
The problem
< s1 , a1 , r2 , s2>, 1
● Data inefficiency
● Rare event loss < s2 , a2 , r3 , s3>, 2
● Policy drift
● Correlated experience
● Large state space
< st , at , rt+1 , st+1 >, P
< . . . >, pt
< . . . >, pt+1
< . . . >, pt+2
...
< . . . >, pt
< . . . >, pt+1
< . . . >, pt+2
...
Action
< . . . >, pt
< . . . >, pt+1
< . . . >, pt+2
...
Action
Action
Action
Video Name:
T-RSML-O_2_M6_L6_policy-based_reinforcement_learning
Agenda
Introduction to reinforcement
learning
Contextual bandits
Applications of RL
RL methods
Reinforcement learning
Continuous action
Policy-based vs value-based
A policy-based approach is preferable Value-based Policy-based
over value-based when:
● There are large action spaces.
● Stochasticity is needed.
● An agent will learn the policy directly.
● Lower bias in the policy is needed.
1 5 20% 80%
Continuous action
Example: CartPole problem
A policy-based approach outputs an optimal
action for a selected state
Min Max
Card position
Model free
Probability
Card velocity .9 of action left ● State → Optimal policy
● REINFORCE
Pole angle . 1 Probability ● Proximal policy
of action right
optimization (PPO)
Q-Learning DQN
Q-Value Q-Value Q-Value
Action1 Action2 ActionM
Action1 Action2 ActionM
State1 Q1 1 Q1 2 Q1 M
Output:
State2 Q2 1 Q2 2 Q2 M
Input:
State1 State2 StateM
StateN QN 1 QN 2 QN M
Course T-RSML: Recommendation Systems
Contextual bandits
Applications of RL
RL methods
Reinforcement learning
Function approximator
Context
A multi-armed bandits agent
There’s no state.
Every episode is independent.
Contextual bandits agent
At work Webpage A At home Webpage A
Show item 1 Earn $0 Show item 1 Earn $2
Webpage B Webpage B
Presenter: n/a