You are on page 1of 111

Course T-RSML: Recommendation Systems

Module 6: Reinforcement Learning

Lesson Title: Intro to module

Presenter: Anant Nawalgaria

Format: Video - Lecture Screencast

Video Name: T-RSML-O_2_M6_L1_introduction_to_module


Reinforcement
Learning

Anant Nawalgaria
Alex Erfurt
Machine Learning Specialists,
Google Cloud
Agenda
Introduction to RL

Applications of RL

RL framework and workflow


Model-based and model-free RL
Value-based RL

Policy-based RL

Contextual bandits
Objectives

1 Explain how reinforcement learning fits


within the different machine learning types.
Objectives

1 Explain how reinforcement learning fits


within the different machine learning types.

2 Define the primary branches and useful


types of reinforcement learning.
Objectives

1 Explain how reinforcement learning fits


within the different machine learning types.

2 Define the primary branches and useful


types of reinforcement learning.

3 Describe the framework and workflow


to solve a common reinforcement
learning problem.
Objectives

1 Explain how reinforcement learning fits


within the different machine learning types.

2 Define the primary branches and useful


types of reinforcement learning.

3 Describe the framework and workflow


to solve a common reinforcement
learning problem.

4 Identify use cases where RL is the ideal


approach to solving an ML problem and
where other methods are preferable to RL.
Course T-RSML: Recommendation Systems

Module 6: Reinforcement Learning

Lesson Title: Introduction to Reinforcement Learning

Presenter: Anant Nawalgaria

Format: Video - Lecture Screencast

Video Name:
Agenda
Introduction to reinforcement
learning (RL)

RL framework and workflow

Model-based and model-free RL


Value-based RL
Policy-based RL

Contextual bandits

Applications of RL
Machine learning types

Unsupervised Supervised Reinforcement


learning learning learning
What is reinforcement learning?
Reinforcement learning is an area of machine
learning where an agent learns by interacting
with its environment.
R=+1
a1
Agents learn to: R=0
a1
● Achieve a goal. R=+2
a2
● Achieve the optimal behavior. R=+2
● Obtain the maximum reward. a1
R=+1
a1
R=0 R=-5
a2 a2
RL process 1 Observation

2 Action

Agent Environment

3 Reward
Reinforcement learning in dog training

Observations

Actions

Rewards

Agent Environment

Policy
Agent
RL algorithm
Policy

Policy
update

Reinforcement
learning algorithm
Observation Action

Reward

Environment
Reinforcement learning methods

Reinforcement learning

Model-based methods Model-free methods


Characteristics of reinforcement learning

● There is no supervisor; there is only a


real number or reward signal.
Characteristics of reinforcement learning

● There is no supervisor; there is only a


real number or reward signal.
● Decision making is sequential.
Characteristics of reinforcement learning

● There is no supervisor; there is only a


real number or reward signal.
● Decision making is sequential.
● Time plays a crucial role in RL problems.
Characteristics of reinforcement learning

● There is no supervisor; there is only a


real number or reward signal.
● Decision making is sequential.
● Time plays a crucial role in RL problems.
● Feedback is always delayed, not
instantaneous.
Characteristics of reinforcement learning

● There is no supervisor; there is only a


real number or reward signal.
● Decision making is sequential.
● Time plays a crucial role in RL problems.
● Feedback is always delayed, not
instantaneous.
● The agent's actions determine the
subsequent data it receives.
Reference use case

Warehouse fulfillment center:


● It spans 28 football fields.
● Online merchants outsource to have their
products stored, managed, and shipped
from the fulfillment center warehouse.
● Robots need to choose optimal paths.
Course T-RSML: Recommendation Systems

Module 6: Reinforcement Learning

Lesson Title: Applications of reinforcement learning

Presenter: Alex Erfurt

Format: Video - Lecture Screencast

Video Name:
T-RSML-O_2_M6_L8_applications_of_reinforcement_learning
Agenda
Introduction to reinforcement
learning

Applications of RL

RL framework and workflow


Model-based and model-free RL
Value-based RL

Policy-based RL

Contextual bandits
Industries that use reinforcement learning

Business Gaming Recommendation Science


systems
Real-life applications

You may also like

SALE SALE

Spotify Retailer AlphaGo Zero /


Recommendations AlphaZero
When to use RL rather than SL

Supervised learning:

Prediction problems
Short term
Offline training on cold historic data
IID data
No trial and error required
Suitable for lower variance / dynamism
Transfer learning
Differentiable non-noisy loss
When to use RL rather than SL or USL

Supervised learning: Reinforcement learning:

Prediction problems Control/optimization/decision-making problems


Short term Optimized for long term/delayed value oriented
Offline training on cold historic data Real-time training or offline simulation
IID data Non-IID data possible
No trial and error required Trial and error necessary
Suitable for lower variance / dynamism Lower variance/dynamism
Transfer learning Transfer learning not yet possible
Differentiable non-noisy loss Differentiability not critical, noisy reward OK
When to use RL rather than SL or USL

Supervised learning: Reinforcement learning:

Prediction problems Control/optimization/decision-making problems


Short term Optimized for long term/delayed value oriented
Offline training on cold historic data Real-time training or offline simulation
IID data Non-IID data possible
No trial and error required Trial and error necessary
Suitable for lower variance / dynamism Lower variance/dynamism
Transfer learning Transfer learning not yet possible
Differentiable non-noisy loss Differentiability not critical, noisy reward OK
Why not combine model-based and model-free
approaches?
value / policy
● Combining allows the agent to learn planning acting
from both real experience online and direct RL
the trajectories generated by the
model. model experience
● Agent uses both sources to take an
action on the environment.
model learning
Course T-RSML: Recommendation Systems

Module 6: Reinforcement Learning

Lesson Title: The Reinforcement Learning framework and workflow

Presenter: Anant Nawalgaria

Format: Video - Lecture Screencast

Video Name:
Agenda
Introduction to reinforcement
learning (RL)

RL framework and workflow

Model-based and model-free RL


Value-based RL
Policy-based RL

Contextual bandits

Applications of RL
The reinforcement learning framework

Agent

Agent

State Reward Action


Environment

States
Environment
Actions

Rewards
Terminology in reinforcement learning
Term Definition
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Reward Feedback on agent actions, also known as reward signal
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Reward Feedback on agent actions, also known as reward signal
Policy Method to map the agent’s state to actions
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Reward Feedback on agent actions, also known as reward signal
Policy Method to map the agent’s state to actions
Episode A termination point
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Reward Feedback on agent actions, also known as reward signal
Policy Method to map the agent’s state to actions
Episode A termination point
Value Long-term reward gained by the end of an episode
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Reward Feedback on agent actions, also known as reward signal
Policy Method to map the agent’s state to actions
Episode A termination point
Value Long-term reward gained by the end of an episode
Value Measure of potential future rewards from being in a particular state, or V(S)
function
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Reward Feedback on agent actions, also known as reward signal
Policy Method to map the agent’s state to actions
Episode A termination point
Value Long-term reward gained by the end of an episode
Value Measure of potential future rewards from being in a particular state, or V(S)
function
Q(S,A) “Q-value” of an action in various state/action pairs
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Reward Feedback on agent actions, also known as reward signal
Policy Method to map the agent’s state to actions
Episode A termination point
Value Long-term reward gained by the end of an episode
Value Measure of potential future rewards from being in a particular state, or V(S)
function
Q(S,A) “Q-value” of an action in various state/action pairs
SARSA State Action Reward State Action
Terminology in reinforcement learning
Term Definition
State Summary of events so far; the current situation
Action One or more events that alter the state
Environment The scenario the agent has to respond to
Agent The learner entity that performs actions in an environment
Reward Feedback on agent actions, also known as reward signal
Policy Method to map the agent’s state to actions
Episode A termination point
Value Long-term reward gained by the end of an episode
Value Measure of potential future rewards from being in a particular state, or V(S)
function
Q(S,A) “Q-value” of an action in various state/action pairs
SARSA State Action Reward State Action
Using the RL framework: Example
Decision tree Fulfillment center

+ 1
Move up Move right

Move up Move right


- 1
Move up Move up Move right

Move up Move up Move right

Move right Move up Move up

Move right Move up Move up Up?

Move right Move right Move up Right?

Move right Move right


The value function as the RL algorithm
Decision tree Fulfillment center
Optimal

Move up Move down + 1


Move up Move right

Move up
Move right - 1
Being in this cell is
Move right Move up Move right better than being
in the one below it.
Move right Move up Move right

Move right Move up Move up

Move down
Move right Move up Move up
Move down
Move right Move up
Move down
Move right
Find the best sequence of actions that will generate the optimal outcome
Collect the most reward!

Agent Environment
Policy

Action

Reward

Observation

State
Setting up the problem for the RL workflow

Environment Reward Policy Training Deployment


Setting up the problem for the RL workflow

Environment Reward Policy Training Deployment

How is it defined?
Real or simulated.
Setting up the problem for the RL workflow

Environment Reward Policy Training Deployment

How is it defined? Incentivize


Real or simulated. the agent to do
what we want.
Setting up the problem for the RL workflow

Environment Reward Policy Training Deployment

How is it defined? Incentivize Structure the


Real or simulated. the agent to do logic and
what we want. parameters.
Setting up the problem for the RL workflow

Environment Reward Policy Training Deployment

How is it defined? Incentivize Structure the Choose a training


Real or simulated. the agent to do logic and algorithm.
what we want. parameters.
Setting up the problem for the RL workflow

Environment Reward Policy Training Deployment

How is it defined? Incentivize Structure the Choose a training Put the agent
Real or simulated. the agent to do logic and algorithm. to work.
what we want. parameters.
Reinforcement learning workflow

Environment

1 Create the Environment


● Define the environment, including the interface with the agent.
● It can be either a simulation model or a real physical system.
● Simulated environments are safer and allow experimentation.
Reinforcement learning workflow

Environment Reward

2 Define the reward


● Specify the reward signal.
● Iterate several times to shape the reward.
Reinforcement learning workflow

Environment Reward Reward


Policy

3 Create the agent


● Choose how to represent the policy.
● Select a training algorithm.
Reinforcement learning workflow

Environment Reward Reward Training

4 Train and validate the agent


● Setup stopping criteria.
● Train the agent to tune the policy.
● Utilize multiple CPUs, GPUs, and clusters for complex applications.
Reinforcement learning workflow

Environment Reward Reward Training Deployment

5 Deploying the policy


● Use generated code to deploy the policy.
● No agents and training algorithms needed yet.
● Revisit earlier stages again as needed.
Workflow

Agent

Environment
Workflow

Agent
Step 1 :
Pull observation
generated by
environment.

Environment
Workflow

Agent
Step 1 : Step 2 : Choose
Pull observation action based on
generated by policy; apply to
environment. environment 1 .

Environment
Workflow

Agent
Step 1 : Step 2 : Choose
Pull observation Step 3 : Get reward action based on
generated by produced by environment based policy; apply to
environment. on 1 2 . environment 1 .

Environment
Workflow
Step 4 : Use agent to train the target policy on trajectory data
including observation, action, and reward from 1 2 3 .

Agent
Step 1 : Step 2 : Choose
Pull observation Step 3 : Get reward action based on
generated by produced by environment based policy; apply to
environment. on 1 2 . environment 1 .

Environment
SARSA
S7

State Action Reward State Action


represents the quintuple — Q(St, A) — + 1
where an agent interacts with its
environment and updates its policy - 1
based on the feedback it received.

S17
For instance:
● Q(S7, right) would have possible
reward of r = +1
● Q(S17, up) would have possible
reward of r = -1
Course T-RSML: Recommendation Systems

Module 6: Reinforcement Learning

Lesson Title: Model-based and model-free reinforcement learning

Presenter: Anant Nawalgaria

Format: Video - Lecture Screencast

Video Name:
T-RSML-O_2_M6_L4_model-based_and_model-free_reinforcement
learning
Agenda
Introduction to reinforcement
learning

RL framework and workflow

Model-based and model-free RL


Value-based RL
Policy-based RL

Contextual bandits

Applications of RL
Model-based and model-free RL
Reinforcement learning

Model-based methods Model-free methods

Agent predicts what happens Agent learns a control policy directly


when certain actions are taken. from interacting with the environment.
Model-based method
The agent learns what is considered optimal behavior through actions and
observing the resulting state and reward outcomes.

Model
Model
Planning
building Model-based
approach
Value
Experience Policy
function
Model-based method
The agent learns what is considered optimal behavior through actions and
observing the resulting state and reward outcomes.

Destination

We know what’s not


worth exploring

Dead end
Model-free method
The agent learns by exploring all areas of the state space to fill out its value function as
it searches for the best reward.

Value
Experience Policy
Model-free function
approach
Model-based vs model-free RL

Model-based Model-free

You have access to or knowledge about the environment. Yes No

You can avoid needless exploration by focusing on areas you


Yes No
already know are worthwhile.

Need to make more assumptions and approximations. Yes No

Need lots of samples. No Yes

Over many episodes, results become less optimal. Yes No

Over many episodes, results become more optimal. No Yes

Applicable across a wide variety of applications. No Yes


Examples of model-based and model-free RL methods
Reinforcement learning

Model-based methods Model-free methods

Analytic gradient Sampling-based


*Value-based *Policy-based Actor-critic
computation planning

Model-based data Value-equivalence


generation prediction
*Contextual
On-policy Off-policy
bandits

* Discussed in this module.


Course T-RSML: Recommendation Systems

Module 6: Reinforcement Learning

Lesson Title: Value-based reinforcement learning

Presenter: Anant Nawalgaria

Format: Video - Lecture Screencast

Video Name:
T-RSML-O_2_M6_L5_value-based_reinforcement_learning
Agenda
Introduction to reinforcement
learning

RL framework and workflow

Model-based and model-free RL


Value-based RL
Policy-based RL

Contextual bandits

Applications of RL
RL methods
Reinforcement learning

Model-based methods Model-free methods

Analytic gradient Sampling-based


*Value-based *Policy-based Actor-critic
computation planning

Model-based data Value-equivalence


generation prediction
*Contextual
On-policy Off-policy
bandits
Value-based approach Full
backups

You explore in order to learn


state-action values and
maximize a value function, V(S). Dynamic
programming
Exhaustive search
The agent can sample
exhaustively or sample and
Sample Deep
generalize to derive a policy, π, backups backups
that maximizes the value of
action for each state.

Temporal Monte
difference
Shallow Carlo
backups
Enforcing control in value methods

Predicted values of actions

A1

State Policy A2 argmax

A3

0 10 20 30

Exploration (during training)


Three approaches of value-based RL algorithms

Monte Carlo Temporal Difference Dynamic Programming

S1
S1 S1

A A1
A A1 A1 A1
S S S2 S
S S S2 S S2 S
S2 S2 S
S2

A A A A A A2 A A
A A A A A A A A A A A A A A A A
S3

A3

T
Terms of the decision tree
S

Symbol What it represents Examples

State (S) at each time (t) S1, S2, S3, … ST. A A


S step, or St, until
termination (T).
Action (A) at each time (t) A1, A2, A3, … AT. S S S S

A step, or At, until


termination (T).
Termination point, or end A A A A A A A A

T of the episode.

S S S S S S S S S
T T T T T
Monte Carlo backup
S1

A A1

S S S2 S

A A A A A A2 A A

S S S S S S S S3 S S State
T T T T T

A A A A A A A A3 A A Action

S S S S S S S S S S Termination
T T T T T T point
Temporal Difference backup
S1

A A1

S S S2 S

A A A A A A A A

S S S S S S S S S S State
T T T T T

A Action

Termination
T point
Achieving data efficiency
The solution

The problem
< s1 , a1 , r2 , s2>, 1
● Data inefficiency
● Rare event loss < s2 , a2 , r3 , s3>, 2
● Policy drift
● Correlated experience
● Large state space
< st , at , rt+1 , st+1 >, P

Experience replay buffer


Collecting and learning from experience

< . . . >, pt
<  . . . >, pt+1
<  . . . >, pt+2
...

Experience replay buffer Batch of experiences

Game environment Q learning agent


Collecting and learning from experience

< . . . >, pt
<  . . . >, pt+1
<  . . . >, pt+2
...

Experience replay buffer Batch of experiences

Action

Game environment Q learning agent


Collecting and learning from experience

< . . . >, pt
<  . . . >, pt+1
<  . . . >, pt+2
...

Experience replay buffer Batch of experiences


States
States Rewards
rewards

Action

Game environment Q learning agent


Collecting and learning from experience

< . . . >, pt Sample


<  . . . >, pt+1
<  . . . >, pt+2
...

Experience replay buffer Batch of experiences


States rewards

Action

Game environment Q learning agent


Collecting and learning from experience

< . . . >, pt Sample


<  . . . >, pt+1
<  . . . >, pt+2
...

Experience replay buffer Batch of experiences


States rewards States
States Rewards
rewards

Action

Game environment Q learning agent


Course T-RSML: Recommendation Systems

Module 6: Reinforcement Learning

Lesson Title: Policy-based reinforcement learning

Presenter: Anant Nawalgaria

Format: Video - Lecture Screencast

Video Name:
T-RSML-O_2_M6_L6_policy-based_reinforcement_learning
Agenda
Introduction to reinforcement
learning

RL framework and workflow

Model-based and model-free RL


Value-based RL
Policy-based RL

Contextual bandits

Applications of RL
RL methods
Reinforcement learning

Model-based methods Model-free methods

Analytic gradient Sampling-based


Value-based Policy-based Actor-critic
computation planning

Model-based data Value-equivalence


generation prediction

Contextual bandits On-policy Off-policy


Policy-based RL
You want the action performed in every state Policy-based
to help you to gain maximum reward in the
future.

The agent will:


● Learn the stochastic policy function that
maps state to action.
● Act by sampling policy.
● Utilize exploration techniques.
20% 80%

Continuous action
Policy-based vs value-based
A policy-based approach is preferable Value-based Policy-based
over value-based when:
● There are large action spaces.
● Stochasticity is needed.
● An agent will learn the policy directly.
● Lower bias in the policy is needed.

1 5 20% 80%

Continuous action
Example: CartPole problem
A policy-based approach outputs an optimal
action for a selected state

Min Max

Cart position -2.4 2.4

Cart velocity -Inf Inf

Pole angle -41.8 deg 41.8 deg

Pole velocity at tip -Inf Inf


Policy-based algorithms

Card position
Model free
Probability
Card velocity .9 of action left ● State → Optimal policy
● REINFORCE
Pole angle . 1 Probability ● Proximal policy
of action right
optimization (PPO)

Pole velocity at tip


Function approximation with Deep Learning
Used in CartPole example A better way

Q-Learning DQN
Q-Value Q-Value Q-Value
Action1 Action2 ActionM
Action1 Action2 ActionM
State1 Q1 1 Q1 2 Q1 M
Output:
State2 Q2 1 Q2 2 Q2 M

Input:
State1 State2 StateM
StateN QN 1 QN 2 QN M
Course T-RSML: Recommendation Systems

Module 6: Reinforcement Learning

Lesson Title: Contextual bandits

Presenter: Alex Erfurt

Format: Video - Lecture Screencast

Video Name: T-RSML-O_2_M6_L7_contextual_bandits


Agenda
Introduction to reinforcement
learning

RL framework and workflow

Model-based and model-free RL


Value-based RL
Policy-based RL

Contextual bandits

Applications of RL
RL methods
Reinforcement learning

Model-based methods Model-free methods

Analytic gradient Sampling-based


*Value-based *Policy-based Actor-critic
computation planning

Model-based data Value-equivalence


generation prediction
*Contextual
On-policy Off-policy
bandits
Contextual bandits
An extension of multi-armed bandits or simplified RL.

● In a sequence of trials, the agent acts based on a


given context.

State Action Reward


Contextual bandit

Full RL problem State Action Reward


What are multi-armed bandits?
Which arm
An agent simultaneously attempts to: 10 9 8 12 to pick at
● Explore (acquire new knowledge). each time?
● Exploit (optimize its decisions based
on existing knowledge).
Distribution
of each arm
unknown
D1 D2 D3 D4
What are contextual bandits?

● Each data point is a new episode. 10 9 8 12


● Value of exploration strategies is much
easier to quantify/tune.
● Context can be the input feature space
(recommender/personalization systems).
● A policy as the function approximator:
→ Estimating value gain from an action.
D1 D2 D3 D4

Function approximator

Context
A multi-armed bandits agent

First play Second play


Webpage Webpage
Show item 1 Earn $0 Show item 1 Earn $2

Action Reward Action Reward

Show item 2 Earn $22 Show item 2 Earn $44

Agent Action Reward Agent Action Reward

There’s no state.
Every episode is independent.
Contextual bandits agent
At work Webpage A At home Webpage A
Show item 1 Earn $0 Show item 1 Earn $2

Action Reward Action Reward

Show item 2 Earn $22 Show item 2 Earn $44

Action Reward Action Reward

Webpage B Webpage B

Agent Show item 1 Earn $50 Agent Show item 1 Earn $0

Action Reward Action Reward

Show item 2 Earn $0 Show item 2 Earn $0

Action Reward Action Reward

There’s context. Other episodes may influence the agent.


Course T-RSML: Recommendation Systems

Module 6: Reinforcement Learning

Lesson Title: Quiz

Presenter: n/a

Format: Video - Lecture Screencast

Video Name: T-RSML-O_2_M6_L9_quiz_reinforcement_learning


Quiz
Reinforcement Learning
Course T-RSML: Recommendation Systems

Module 6: Reinforcement Learning

Lesson Title: Lab Intro

Presenter: Alex Erfurt

Format: Video - Lecture Screencast

Video Name: T-RSML-O_2_M6_L11_lab_intro


Lab intro
Applying reinforcement learning
Course T-RSML: Recommendation Systems

Module 6: Reinforcement Learning

Lesson Title: Lab Review

Presenter: Alex Erfurt

Format: Video - Lecture Screencast

Video Name: T-RSML-O_2_M6_L13_lab_review


Lab review
Applying reinforcement learning

You might also like