1 - Reinforecement Learning

Reinforcement and Online Learning
Course 1: Markov Decision Processes
Nicolas Gast
September 29, 2021
Nicolas Gast – 1 / 31
What is Reinforcement Learning?
And why it differs from supervised or unsupervised learning
No i.i.d. dataset, but an environmeent.

No labels, but observation of rewards.
We design an agent, that maps states to actions.
No i.i.d. dataset, but an environmeent.

No labels, but observation of rewards.
We design an agent, that maps states to actions.
Challenges:
Many possible states, actions
Reward can be delayed, or sparse.
Applications
Games (Go, Atari, StarCraft,...) StarCraft
Auto-piloting vehicles Robots , Helicopter
Supply management, energy data-center
Trading, bidding Bidding
Toy models AIGym
...
The number of application is increasing.
RL is about interacting with an environment
RL is about interacting with an environment
1 Get an observation of the state of the environment

2 Choose an action
3 Obtain a reward
You goal is to select actions to maximize the total reward.
Reward signal
At time t, we observe St , take action At , and obtain a reward Rt+1 .
S1 , A1 R2 , S2 , A2 R3 , S3 , A3 ... RT , ST
Reward signal
At time t, we observe St , take action At , and obtain a reward Rt+1 .
S1 , A1 R2 , S2 , A2 R3 , S3 , A3 ... RT , ST
Impact of actions can be delayed.

On which actions does the
reward depend?
Impact of actions can be weak

or noisy
Overview of the course
https://cours.univ-grenoble-alpes.fr/course/view.php?id=16628
1 Reinforcement learning
1–2 MDP
3–6 Classical RL algorithms
2 Online learning and bandits

7–9 Adversarial bandits
10–12 Online convex optimization
Overview of the course
https://cours.univ-grenoble-alpes.fr/course/view.php?id=16628
1 Reinforcement learning
1–2 MDP + hands-on session
3–6 Classical RL algorithms + hands-on session
2 Online learning and bandits

7–9 Adversarial bandits + hands-on session
10–12 Online convex optimization + hands-on session
Grading: 30% homeworks / projects + 70% final exam.
Personal work, a few advice
Mandatory :
Understand the course / finish the exercices
Do your homeworks
Mandatory :
Do your homeworks
Highly suggested:
Question what you learn.
Try to do some exercises.
I Do not hesitate to program, to go deeper, to ask follow-up questions.
Ask questions during or after the course.
After the course, list the key ideas.
Mandatory :
Do your homeworks
Highly suggested:
Question what you learn.
Try to do some exercises.
I Do not hesitate to program, to go deeper, to ask follow-up questions.
Ask questions during or after the course.
After the course, list the key ideas.
Read books (and/or research articles)

Introduction to Reinforcement Learning (Sutton-Barto, 2018 last ed.)
Algorithms for Reinforcement Learning (Szepesvari, 2010)
Deep Reinforcement learning: hands on (Maxim Lapan, 2020)
Bandit Algorithms (Lattimore and Szepesvari, 2020)
Outline
1 Markov Decision Processes
2 Policies and Returns
3 Value Function and Bellman’s Equation
4 Conclusion
Fully Observed an environmeent
S – state space.
A – action space.
R – reward space.
S – state space.
A – action space.
R – reward space.
Example: Vaccum Cleaner World
S – state space.
A – action space.
R – reward space.
Example: Vaccum Cleaner World
8 possible states, 3 actions.

Deterministic environments
In the vaccum cleaner world, the environment is deterministic. This is the

case in many problems. Example:
Go, Chess, tic-tac-toe, 8 puzzle
Algorithms to solve deterministic environments
1 Generate a tree of possibilities.

2 Compute the payoff for leaves.
3 Go backward to the root of the tree.
Problem: computationnaly prohibitive if the tree is too big (or if the path
to a leave is too long).
Stochastic environment
In a stochastic environment, the next state is random.
St+1 ∼ P(· | St = s, At = a)
Example: Frozen lake AIGym .
Markov decision processes
A MDPs is defined by:

S state space
A action set
Evolution is driven by Markovian transitions
P(St+1 = s 0 , Rt+1 = r | St = s, At = r ) = P(s 0 , r 0 | s, a).
MDP = Markov chain + decisions
Most reinforcement learning problems can be framed as MDPs.
Graphical representation
At
St
Graphical representation
At Rt+1
St St+1 ...
Example: Frozen lake Link (this is a gridworld example)
Set space: {0, . . . , 16}.

Actions: {L, R, U, D}.
Transitions: 1/3 in right direction.
Rewards: there are Holes and a Goal.
Outline
4 Conclusion
Policies
A (deterministic) policy specifies which action to take in a given state:
π : S → A.
It indicates which action to take in a given state: At = π(St ). This defines
the behavior of the agent.
Policies
π : S → A.
A stochastic policy specifies a distribution over actions:
π : S × A → [0, 1].
The agent takes At ∼ π(·|St ).
Policies
π : S → A.
A stochastic policy specifies a distribution over actions:
π : S × A → [0, 1].
The agent takes At ∼ π(·|St ).
Deterministic Stochastic
It forces exploration
Optimal in general Useful in games / non-Markovian
Differentiable
Return of a policy
We want to compute the best policy... But what is the best policy?
Return of a policy
Do we choose At to optimize:
Rt+1 ? (no: too greedy)
Return of a policy
Do we choose At to optimize:
Rt+1 ? (no: too greedy)
RT ? (only final reward?)
Return of a policy (finite horizon)
Sometimes, a problem has a known finite horizon T . In which case, the

return (a.k.a. gain) at time t is:
Gt = Rt+1 + Rt+2 + · · · + RT .
It indicates the value of the current state.
Return of a policy (discounted infinite horizon)
When T is not specified, it is common to look at the discounted return:
Gt = Rt+1 + γRt+2 + γ 2 Rt+2 + . . .

X∞
= γ k Rt+1+k ,
k=0
where the discount factor γ ∈ [0, 1] indicates the relative value of closer
rewards.
Return of a policy (discounted infinite horizon)
When T is not specified, it is common to look at the discounted return:
Gt = Rt+1 + γRt+2 + γ 2 Rt+2 + . . .

X∞
= γ k Rt+1+k ,
k=0
where the discount factor γ ∈ [0, 1] indicates the relative value of closer
rewards.
In practice, we will look at the expected reward E [Gt ].
Outline
4 Conclusion
Value function
The value function of a policy π is
V π (s) = E [Gt | St = s ∧ At+k ∼ π(St+k ) (k ≥ 0)].
Value function
The value function of a policy π is
V π (s) = E [Gt | St = s ∧ At+k ∼ π(St+k ) (k ≥ 0)].
It specifies the expected return. It is a vector of |S| values. If

S = {s1 . . . s4 }
s1 s2 s3 s4
V
Action-Value function
The action-value function of a policy π is
Q π (s, a) = E [Gt | St = s ∧ At = a ∧ At+k = π(St+k ) (k ≥ 1)].
Action-Value function
The action-value function of a policy π is
Q π (s, a) = E [Gt | St = s ∧ At = a ∧ At+k = π(St+k ) (k ≥ 1)].
It is a table of |S| × |A| values. If S = {s1 . . . s4 } and A = {a1 , a2 }:
Q a1 a2
s1
s2
s3
s4
From Q, we can define a greedy policy: at = arg maxa∈A Q(st , a).
Bellman Equation (policy evaluation)
Recall from slide 19 that V π (s) = E [Gt | St = s] and that
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . .

= Rt+1 + γGt+1 .
Hence:
V π (s) = .
Bellman Equation (policy evaluation)
Recall from slide 19 that V π (s) = E [Gt | St = s] and that
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . .

= Rt+1 + γGt+1 .
Hence:
X
V π (s) = (r + γV π (s 0 ))p(r , s 0 | s, a = π(s)). .
s 0 ,r
| {z }
=Q π (s,π(s))
The optimal policy
We denote by V ∗ (s) = maxπ V π (s) and Q ∗ (s, a) = maxπ Q π (s, a).
The optimal policy π ∗ is such that:
π ∗ = arg max V π (s) ∀s ∈ S

π
or equivalently:
π ∗ = arg max Q π (s, a) ∀s ∈ S, s ∈ A

π
Bellman’s equation (optimal policy)
From Bellman’s equation, the value functions and action-value function

must satisfy:
V ∗ (s) =
Q ∗ (s, a) =
Bellman’s equation (optimal policy)
From Bellman’s equation, the value functions and action-value function

must satisfy:
V ∗ (s) = max Q ∗ (s, a)

a∈A
X
∗
Q (s, a) = (r + γV ∗ (s 0 ))p(r , s 0 | s, a)
s 0 ,r
How to get π ∗ and V ∗ ?
Iterative solutions
Evaluate π
π Qπ
improve π
If you know the transitions and reward: value iteration or policy iteration.
Iterative solutions
Evaluate π
π Qπ
improve π
If you know the transitions and reward: value iteration or policy iteration.
Value iteration:
Initialize V 0 (to some random value).
For k ≥ 0 and s ∈PS, do:
V k+1 (s) := s 0 ,r (r + γV k (s 0 ))p(r , s 0 | s, a)
Policy iteration:
Initialize π 0 (to some random value).
For k ≥ 0:
k
Compute Q π (=linear system)
k
For all a ∈ A: π k+1 (s) := arg maxa∈A Q π (s, a).
Exercise: the wheel of fortune
You can draw a wheel indefinitely. After time t:

If you draw, the wheel stops on
Xt ∈ {1 . . . 10} (uniformly). You earn Xt .
You can draw again or keep Xt+1 := Xt .
How do you play knowing

"∞ # that you want to maximize your discounted
X
reward: E δ t Xt with γ = 0.9?
t=1
Bonus: if you can draw up to T = 5 times, what does it change?
Outline
4 Conclusion
Important concepts
(to be filled by you!)
Challenges and future courses
Learning transitions and reward? (course 2)
If the state space is too large?

I How do you store Q-values? (course 3)
Exploration or exploitation (course 4)
Some of the modern challenges
Limited samples, convergence guarantees.

Safety issues, explainable agents.
Multi-agents (ex: competitive objectives?)
Delayed or partial observations.

1 - Reinforecement Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 - Reinforecement Learning

Uploaded by

Copyright:

Available Formats

Reinforcement and Online Learning

Course 1: Markov Decision Processes

September 29, 2021

No i.i.d. dataset, but an environmeent.

No i.i.d. dataset, but an environmeent.

Games (Go, Atari, StarCraft,...) StarCraft

Auto-piloting vehicles Robots , Helicopter

Supply management, energy data-center

Trading, bidding Bidding

Toy models AIGym

1 Get an observation of the state of the environment

You goal is to select actions to maximize the total reward.

At time t, we observe St , take action At , and obtain a reward Rt+1 .

At time t, we observe St , take action At , and obtain a reward Rt+1 .

Impact of actions can be delayed.

Impact of actions can be weak

2 Online learning and bandits

2 Online learning and bandits

Grading: 30% homeworks / projects + 70% final exam.

Read books (and/or research articles)

1 Markov Decision Processes

2 Policies and Returns

3 Value Function and Bellman’s Equation

8 possible states, 3 actions.

In the vaccum cleaner world, the environment is deterministic. This is the

1 Generate a tree of possibilities.

Example: Frozen lake AIGym .

A MDPs is defined by:

P(St+1 = s 0 , Rt+1 = r | St = s, At = r ) = P(s 0 , r 0 | s, a).

MDP = Markov chain + decisions

Most reinforcement learning problems can be framed as MDPs.

Example: Frozen lake Link (this is a gridworld example)

Set space: {0, . . . , 16}.

1 Markov Decision Processes

2 Policies and Returns

3 Value Function and Bellman’s Equation

A stochastic policy specifies a distribution over actions:

A stochastic policy specifies a distribution over actions:

Sometimes, a problem has a known finite horizon T . In which case, the

It indicates the value of the current state.

When T is not specified, it is common to look at the discounted return:

Gt = Rt+1 + γRt+2 + γ 2 Rt+2 + . . .

When T is not specified, it is common to look at the discounted return:

Gt = Rt+1 + γRt+2 + γ 2 Rt+2 + . . .

In practice, we will look at the expected reward E [Gt ].

1 Markov Decision Processes

2 Policies and Returns

3 Value Function and Bellman’s Equation

The value function of a policy π is

V π (s) = E [Gt | St = s ∧ At+k ∼ π(St+k ) (k ≥ 0)].

The value function of a policy π is

V π (s) = E [Gt | St = s ∧ At+k ∼ π(St+k ) (k ≥ 0)].

It specifies the expected return. It is a vector of |S| values. If

The action-value function of a policy π is

Q π (s, a) = E [Gt | St = s ∧ At = a ∧ At+k = π(St+k ) (k ≥ 1)].

The action-value function of a policy π is

Q π (s, a) = E [Gt | St = s ∧ At = a ∧ At+k = π(St+k ) (k ≥ 1)].

It is a table of |S| × |A| values. If S = {s1 . . . s4 } and A = {a1 , a2 }:

From Q, we can define a greedy policy: at = arg maxa∈A Q(st , a).

Recall from slide 19 that V π (s) = E [Gt | St = s] and that

Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . .

Recall from slide 19 that V π (s) = E [Gt | St = s] and that

Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . .