Anant Nawalgaria

Course T-RSML: Recommendation Systems
Module 6: Reinforcement Learning
Lesson Title: Intro to module
Presenter: Anant Nawalgaria
Format: Video - Lecture Screencast
Video Name: T-RSML-O_2_M6_L1_introduction_to_module

Reinforcement
Learning
Anant Nawalgaria
Alex Erfurt
Machine Learning Specialists,
Google Cloud
Agenda
Introduction to RL
Applications of RL
RL framework and workflow

Model-based and model-free RL
Value-based RL
Policy-based RL
Contextual bandits
Objectives
1 Explain how reinforcement learning fits

within the different machine learning types.
Objectives

2 Define the primary branches and useful

types of reinforcement learning.
Objectives


3 Describe the framework and workflow

to solve a common reinforcement
learning problem.
Objectives


3 Describe the framework and workflow

to solve a common reinforcement
learning problem.
4 Identify use cases where RL is the ideal

approach to solving an ML problem and
where other methods are preferable to RL.
Lesson Title: Introduction to Reinforcement Learning
Video Name:
Agenda
Introduction to reinforcement
learning (RL)

Value-based RL
Policy-based RL
Contextual bandits
Applications of RL
Machine learning types
Unsupervised Supervised Reinforcement

learning learning learning
What is reinforcement learning?
Reinforcement learning is an area of machine
learning where an agent learns by interacting
with its environment.
R=+1
a1
Agents learn to: R=0
a1
● Achieve a goal. R=+2
a2
● Achieve the optimal behavior. R=+2
● Obtain the maximum reward. a1
R=+1
a1
R=0 R=-5
a2 a2
RL process 1 Observation
2 Action
Agent Environment
3 Reward
Reinforcement learning in dog training
Observations
Actions
Rewards
Agent Environment
Policy
Agent
RL algorithm
Policy
Policy
update
Reinforcement
learning algorithm
Observation Action
Reward
Environment
Reinforcement learning methods
Reinforcement learning
Model-based methods Model-free methods

Characteristics of reinforcement learning
● There is no supervisor; there is only a

real number or reward signal.

● Decision making is sequential.

● Time plays a crucial role in RL problems.

● Feedback is always delayed, not
instantaneous.

● Feedback is always delayed, not
instantaneous.
● The agent's actions determine the
subsequent data it receives.
Reference use case
Warehouse fulfillment center:

● It spans 28 football fields.
● Online merchants outsource to have their
products stored, managed, and shipped
from the fulfillment center warehouse.
● Robots need to choose optimal paths.
Lesson Title: Applications of reinforcement learning
Presenter: Alex Erfurt
Video Name:
T-RSML-O_2_M6_L8_applications_of_reinforcement_learning
Agenda
learning
Applications of RL

Value-based RL
Policy-based RL
Contextual bandits
Industries that use reinforcement learning
Business Gaming Recommendation Science

systems
Real-life applications
You may also like
SALE SALE
Spotify Retailer AlphaGo Zero /

Recommendations AlphaZero
When to use RL rather than SL
Supervised learning:
Prediction problems
Short term
Offline training on cold historic data
IID data
No trial and error required
Suitable for lower variance / dynamism
Transfer learning
Differentiable non-noisy loss
When to use RL rather than SL or USL
Supervised learning: Reinforcement learning:
Prediction problems Control/optimization/decision-making problems

Short term Optimized for long term/delayed value oriented
Offline training on cold historic data Real-time training or offline simulation
IID data Non-IID data possible
No trial and error required Trial and error necessary
Suitable for lower variance / dynamism Lower variance/dynamism
Transfer learning Transfer learning not yet possible
Differentiable non-noisy loss Differentiability not critical, noisy reward OK
When to use RL rather than SL or USL
Supervised learning: Reinforcement learning:
Prediction problems Control/optimization/decision-making problems

Short term Optimized for long term/delayed value oriented
Offline training on cold historic data Real-time training or offline simulation
IID data Non-IID data possible
No trial and error required Trial and error necessary
Suitable for lower variance / dynamism Lower variance/dynamism
Transfer learning Transfer learning not yet possible
Differentiable non-noisy loss Differentiability not critical, noisy reward OK
Why not combine model-based and model-free
approaches?
value / policy
● Combining allows the agent to learn planning acting
from both real experience online and direct RL
the trajectories generated by the
model. model experience
● Agent uses both sources to take an
action on the environment.
model learning
Lesson Title: The Reinforcement Learning framework and workflow
Video Name:
Agenda
learning (RL)

Value-based RL
Policy-based RL
Contextual bandits
Applications of RL
The reinforcement learning framework
Agent
Agent
State Reward Action

Environment
States
Environment
Actions
Rewards
Terminology in reinforcement learning
Term Definition
Term Definition
State Summary of events so far; the current situation
Term Definition
Action One or more events that alter the state
Term Definition
Environment The scenario the agent has to respond to
Term Definition
Agent The learner entity that performs actions in an environment
Term Definition
Reward Feedback on agent actions, also known as reward signal
Term Definition
Policy Method to map the agent’s state to actions
Term Definition
Episode A termination point
Term Definition
Value Long-term reward gained by the end of an episode
Term Definition
Value Measure of potential future rewards from being in a particular state, or V(S)
function
Term Definition
function
Q(S,A) “Q-value” of an action in various state/action pairs
Term Definition
function
SARSA State Action Reward State Action
Term Definition
function
SARSA State Action Reward State Action
Using the RL framework: Example
Decision tree Fulfillment center
+ 1
Move up Move right
Move up Move right

- 1
Move up Move up Move right
Move up Move up Move right
Move right Move up Move up
Move right Move up Move up Up?
Move right Move right Move up Right?
Move right Move right

The value function as the RL algorithm
Decision tree Fulfillment center
Optimal
Move up Move down + 1

Move up Move right
Move up
Move right - 1
Being in this cell is
Move right Move up Move right better than being
in the one below it.
Move right Move up Move right
Move down
Move down
Move right Move up
Move down
Move right
Find the best sequence of actions that will generate the optimal outcome
Collect the most reward!
Agent Environment
Policy
Action
Reward
Observation
State
Setting up the problem for the RL workflow
Environment Reward Policy Training Deployment

How is it defined?
Real or simulated.
How is it defined? Incentivize

Real or simulated. the agent to do
what we want.
How is it defined? Incentivize Structure the

Real or simulated. the agent to do logic and
what we want. parameters.
How is it defined? Incentivize Structure the Choose a training

Real or simulated. the agent to do logic and algorithm.
How is it defined? Incentivize Structure the Choose a training Put the agent
Real or simulated. the agent to do logic and algorithm. to work.
Reinforcement learning workflow
Environment
1 Create the Environment

● Define the environment, including the interface with the agent.
● It can be either a simulation model or a real physical system.
● Simulated environments are safer and allow experimentation.
Environment Reward
2 Define the reward

● Specify the reward signal.
● Iterate several times to shape the reward.
Environment Reward Reward

Policy
3 Create the agent

● Choose how to represent the policy.
● Select a training algorithm.
Environment Reward Reward Training
4 Train and validate the agent

● Setup stopping criteria.
● Train the agent to tune the policy.
● Utilize multiple CPUs, GPUs, and clusters for complex applications.
Environment Reward Reward Training Deployment
5 Deploying the policy

● Use generated code to deploy the policy.
● No agents and training algorithms needed yet.
● Revisit earlier stages again as needed.
Workflow
Agent
Environment
Workflow
Agent
Step 1 :
Pull observation
generated by
environment.
Environment
Workflow
Agent
Step 1 : Step 2 : Choose
Pull observation action based on
generated by policy; apply to
environment. environment 1 .
Environment
Workflow
Agent
Pull observation Step 3 : Get reward action based on
generated by produced by environment based policy; apply to
environment. on 1 2 . environment 1 .
Environment
Workflow
Step 4 : Use agent to train the target policy on trajectory data
including observation, action, and reward from 1 2 3 .
Agent
Pull observation Step 3 : Get reward action based on
generated by produced by environment based policy; apply to
environment. on 1 2 . environment 1 .
Environment
SARSA
S7
State Action Reward State Action

represents the quintuple — Q(St, A) — + 1
where an agent interacts with its
environment and updates its policy - 1
based on the feedback it received.
S17
For instance:
● Q(S7, right) would have possible
reward of r = +1
● Q(S17, up) would have possible
reward of r = -1
Lesson Title: Model-based and model-free reinforcement learning
Video Name:
T-RSML-O_2_M6_L4_model-based_and_model-free_reinforcement
learning
Agenda
learning

Value-based RL
Policy-based RL
Contextual bandits
Applications of RL
Agent predicts what happens Agent learns a control policy directly

when certain actions are taken. from interacting with the environment.
Model-based method
The agent learns what is considered optimal behavior through actions and
observing the resulting state and reward outcomes.
Model
Model
Planning
building Model-based
approach
Value
Experience Policy
function
Model-based method
The agent learns what is considered optimal behavior through actions and
observing the resulting state and reward outcomes.
Destination
We know what’s not

worth exploring
Dead end
Model-free method
The agent learns by exploring all areas of the state space to fill out its value function as
it searches for the best reward.
Value
Experience Policy
Model-free function
approach
Model-based vs model-free RL
Model-based Model-free
You have access to or knowledge about the environment. Yes No
You can avoid needless exploration by focusing on areas you

Yes No
already know are worthwhile.
Need to make more assumptions and approximations. Yes No
Need lots of samples. No Yes
Over many episodes, results become less optimal. Yes No
Over many episodes, results become more optimal. No Yes
Applicable across a wide variety of applications. No Yes

Examples of model-based and model-free RL methods
Analytic gradient Sampling-based

*Value-based *Policy-based Actor-critic
computation planning
Model-based data Value-equivalence

generation prediction
*Contextual
On-policy Off-policy
bandits
* Discussed in this module.

Lesson Title: Value-based reinforcement learning
Video Name:
T-RSML-O_2_M6_L5_value-based_reinforcement_learning
Agenda
learning

Value-based RL
Policy-based RL
Contextual bandits
Applications of RL
RL methods


*Contextual
bandits
Value-based approach Full
backups
You explore in order to learn

state-action values and
maximize a value function, V(S). Dynamic
programming
Exhaustive search
The agent can sample
exhaustively or sample and
Sample Deep
generalize to derive a policy, π, backups backups
that maximizes the value of
action for each state.
Temporal Monte
difference
Shallow Carlo
backups
Enforcing control in value methods
Predicted values of actions
A1
State Policy A2 argmax
A3
0 10 20 30
Exploration (during training)

Three approaches of value-based RL algorithms
Monte Carlo Temporal Difference Dynamic Programming
S1
S1 S1
A A1
A A1 A1 A1
S S S2 S
S S S2 S S2 S
S2 S2 S
S2
A A A A A A2 A A
A A A A A A A A A A A A A A A A
S3
A3
T
Terms of the decision tree
S
Symbol What it represents Examples
State (S) at each time (t) S1, S2, S3, … ST. A A

S step, or St, until
termination (T).
Action (A) at each time (t) A1, A2, A3, … AT. S S S S
A step, or At, until

termination (T).
Termination point, or end A A A A A A A A
T of the episode.
S S S S S S S S S
T T T T T
Monte Carlo backup
S1
A A1
S S S2 S
A A A A A A2 A A
S S S S S S S S3 S S State
T T T T T
A A A A A A A A3 A A Action
S S S S S S S S S S Termination
T T T T T T point
Temporal Difference backup
S1
A A1
S S S2 S
A A A A A A A A
S S S S S S S S S S State
T T T T T
A Action
Termination
T point
Achieving data efficiency
The solution
The problem
< s1 , a1 , r2 , s2>, 1
● Data inefficiency
● Rare event loss < s2 , a2 , r3 , s3>, 2
● Policy drift
● Correlated experience
● Large state space
< st , at , rt+1 , st+1 >, P
Experience replay buffer

Collecting and learning from experience
< . . . >, pt
< . . . >, pt+1
< . . . >, pt+2
...
Experience replay buffer Batch of experiences
Game environment Q learning agent

< . . . >, pt
< . . . >, pt+1
< . . . >, pt+2
...
Action

< . . . >, pt
< . . . >, pt+1
< . . . >, pt+2
...

States
States Rewards
rewards
Action

< . . . >, pt Sample

< . . . >, pt+1
< . . . >, pt+2
...

States rewards
Action

< . . . >, pt Sample

< . . . >, pt+1
< . . . >, pt+2
...

States rewards States
States Rewards
rewards
Action

Lesson Title: Policy-based reinforcement learning
Video Name:
T-RSML-O_2_M6_L6_policy-based_reinforcement_learning
Agenda
learning

Value-based RL
Policy-based RL
Contextual bandits
Applications of RL
RL methods

Value-based Policy-based Actor-critic

Contextual bandits On-policy Off-policy

Policy-based RL
You want the action performed in every state Policy-based
to help you to gain maximum reward in the
future.
The agent will:

● Learn the stochastic policy function that
maps state to action.
● Act by sampling policy.
● Utilize exploration techniques.
20% 80%
Continuous action
Policy-based vs value-based
A policy-based approach is preferable Value-based Policy-based
over value-based when:
● There are large action spaces.
● Stochasticity is needed.
● An agent will learn the policy directly.
● Lower bias in the policy is needed.
1 5 20% 80%
Continuous action
Example: CartPole problem
A policy-based approach outputs an optimal
action for a selected state
Min Max
Cart position -2.4 2.4
Cart velocity -Inf Inf
Pole angle -41.8 deg 41.8 deg
Pole velocity at tip -Inf Inf

Policy-based algorithms
Card position
Model free
Probability
Card velocity .9 of action left ● State → Optimal policy
● REINFORCE
Pole angle . 1 Probability ● Proximal policy
of action right
optimization (PPO)
Pole velocity at tip

Function approximation with Deep Learning
Used in CartPole example A better way
Q-Learning DQN
Q-Value Q-Value Q-Value
Action1 Action2 ActionM
Action1 Action2 ActionM
State1 Q1 1 Q1 2 Q1 M
Output:
State2 Q2 1 Q2 2 Q2 M
Input:
State1 State2 StateM
StateN QN 1 QN 2 QN M
Lesson Title: Contextual bandits
Video Name: T-RSML-O_2_M6_L7_contextual_bandits

Agenda
learning

Value-based RL
Policy-based RL
Contextual bandits
Applications of RL
RL methods


*Contextual
bandits
Contextual bandits
An extension of multi-armed bandits or simplified RL.
● In a sequence of trials, the agent acts based on a

given context.
State Action Reward

Contextual bandit
Full RL problem State Action Reward

What are multi-armed bandits?
Which arm
An agent simultaneously attempts to: 10 9 8 12 to pick at
● Explore (acquire new knowledge). each time?
● Exploit (optimize its decisions based
on existing knowledge).
Distribution
of each arm
unknown
D1 D2 D3 D4
What are contextual bandits?
● Each data point is a new episode. 10 9 8 12

● Value of exploration strategies is much
easier to quantify/tune.
● Context can be the input feature space
(recommender/personalization systems).
● A policy as the function approximator:
→ Estimating value gain from an action.
D1 D2 D3 D4
Function approximator
Context
A multi-armed bandits agent
First play Second play

Webpage Webpage
Show item 1 Earn $0 Show item 1 Earn $2
Action Reward Action Reward
Agent Action Reward Agent Action Reward
There’s no state.
Every episode is independent.
Contextual bandits agent
At work Webpage A At home Webpage A
Webpage B Webpage B
Agent Show item 1 Earn $50 Agent Show item 1 Earn $0
There’s context. Other episodes may influence the agent.

Lesson Title: Quiz
Presenter: n/a
Video Name: T-RSML-O_2_M6_L9_quiz_reinforcement_learning

Quiz
Reinforcement Learning
Lesson Title: Lab Intro
Video Name: T-RSML-O_2_M6_L11_lab_intro

Lab intro
Applying reinforcement learning
Lesson Title: Lab Review
Video Name: T-RSML-O_2_M6_L13_lab_review

Lab review
Applying reinforcement learning

Anant Nawalgaria

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Anant Nawalgaria

Uploaded by

Copyright:

Available Formats

Course T-RSML: Recommendation Systems

Module 6: Reinforcement Learning

Lesson Title: Intro to module

Presenter: Anant Nawalgaria

Format: Video - Lecture Screencast

Video Name: T-RSML-O_2_M6_L1_introduction_to_module

RL framework and workﬂow

1 Explain how reinforcement learning ﬁts

1 Explain how reinforcement learning ﬁts

2 Deﬁne the primary branches and useful

1 Explain how reinforcement learning ﬁts

2 Deﬁne the primary branches and useful

3 Describe the framework and workﬂow

1 Explain how reinforcement learning ﬁts

2 Deﬁne the primary branches and useful

3 Describe the framework and workﬂow

4 Identify use cases where RL is the ideal

Module 6: Reinforcement Learning

Lesson Title: Introduction to Reinforcement Learning

Presenter: Anant Nawalgaria

Format: Video - Lecture Screencast

RL framework and workﬂow

Model-based and model-free RL

Unsupervised Supervised Reinforcement

Model-based methods Model-free methods

● There is no supervisor; there is only a

● There is no supervisor; there is only a

● There is no supervisor; there is only a

● There is no supervisor; there is only a

● There is no supervisor; there is only a

Warehouse fulfillment center:

Module 6: Reinforcement Learning

Lesson Title: Applications of reinforcement learning

Presenter: Alex Erfurt

Format: Video - Lecture Screencast

RL framework and workﬂow

Business Gaming Recommendation Science

You may also like

Spotify Retailer AlphaGo Zero /

Supervised learning: Reinforcement learning:

Prediction problems Control/optimization/decision-making problems

Supervised learning: Reinforcement learning:

Prediction problems Control/optimization/decision-making problems

Module 6: Reinforcement Learning

Lesson Title: The Reinforcement Learning framework and workflow

Presenter: Anant Nawalgaria

Format: Video - Lecture Screencast

RL framework and workﬂow

Model-based and model-free RL

State Reward Action

Move up Move right

Move up Move up Move right

Move right Move up Move up

Move right Move up Move up Up?

Move right Move right Move up Right?

Move right Move right

Move up Move down + 1

Move right Move up Move up

Environment Reward Policy Training Deployment

Environment Reward Policy Training Deployment

Environment Reward Policy Training Deployment

How is it deﬁned? Incentivize

Environment Reward Policy Training Deployment