You are on page 1of 51

Reinforcement and Online Learning

Course 1: Markov Decision Processes

Nicolas Gast

September 29, 2021

Nicolas Gast – 1 / 31
What is Reinforcement Learning?
And why it differs from supervised or unsupervised learning

Nicolas Gast – 2 / 31
What is Reinforcement Learning?
And why it differs from supervised or unsupervised learning

No i.i.d. dataset, but an environmeent.


No labels, but observation of rewards.
We design an agent, that maps states to actions.

Nicolas Gast – 2 / 31
What is Reinforcement Learning?
And why it differs from supervised or unsupervised learning

No i.i.d. dataset, but an environmeent.


No labels, but observation of rewards.
We design an agent, that maps states to actions.

Challenges:
Many possible states, actions
Reward can be delayed, or sparse.

Nicolas Gast – 2 / 31
Applications

Games (Go, Atari, StarCraft,...) StarCraft

Auto-piloting vehicles Robots , Helicopter

Supply management, energy data-center

Trading, bidding Bidding

Toy models AIGym

...
The number of application is increasing.

Nicolas Gast – 3 / 31
RL is about interacting with an environment

Nicolas Gast – 4 / 31
RL is about interacting with an environment

1 Get an observation of the state of the environment


2 Choose an action
3 Obtain a reward

You goal is to select actions to maximize the total reward.

Nicolas Gast – 4 / 31
Reward signal

At time t, we observe St , take action At , and obtain a reward Rt+1 .

S1 , A1 R2 , S2 , A2 R3 , S3 , A3 ... RT , ST

Nicolas Gast – 5 / 31
Reward signal

At time t, we observe St , take action At , and obtain a reward Rt+1 .

S1 , A1 R2 , S2 , A2 R3 , S3 , A3 ... RT , ST

Impact of actions can be delayed.


On which actions does the
reward depend?

Impact of actions can be weak


or noisy

Nicolas Gast – 5 / 31
Overview of the course
https://cours.univ-grenoble-alpes.fr/course/view.php?id=16628

1 Reinforcement learning
1–2 MDP
3–6 Classical RL algorithms

2 Online learning and bandits


7–9 Adversarial bandits
10–12 Online convex optimization

Nicolas Gast – 6 / 31
Overview of the course
https://cours.univ-grenoble-alpes.fr/course/view.php?id=16628

1 Reinforcement learning
1–2 MDP + hands-on session
3–6 Classical RL algorithms + hands-on session

2 Online learning and bandits


7–9 Adversarial bandits + hands-on session
10–12 Online convex optimization + hands-on session

Grading: 30% homeworks / projects + 70% final exam.

Nicolas Gast – 6 / 31
Personal work, a few advice
Mandatory :
Understand the course / finish the exercices
Do your homeworks

Nicolas Gast – 7 / 31
Personal work, a few advice
Mandatory :
Understand the course / finish the exercices
Do your homeworks

Highly suggested:
Question what you learn.
Try to do some exercises.
I Do not hesitate to program, to go deeper, to ask follow-up questions.
Ask questions during or after the course.
After the course, list the key ideas.

Nicolas Gast – 7 / 31
Personal work, a few advice
Mandatory :
Understand the course / finish the exercices
Do your homeworks

Highly suggested:
Question what you learn.
Try to do some exercises.
I Do not hesitate to program, to go deeper, to ask follow-up questions.
Ask questions during or after the course.
After the course, list the key ideas.

Read books (and/or research articles)


Introduction to Reinforcement Learning (Sutton-Barto, 2018 last ed.)
Algorithms for Reinforcement Learning (Szepesvari, 2010)
Deep Reinforcement learning: hands on (Maxim Lapan, 2020)
Bandit Algorithms (Lattimore and Szepesvari, 2020)

Nicolas Gast – 7 / 31
Outline

1 Markov Decision Processes

2 Policies and Returns

3 Value Function and Bellman’s Equation

4 Conclusion

Nicolas Gast – 8 / 31
Fully Observed an environmeent
S – state space.
A – action space.
R – reward space.

Nicolas Gast – 9 / 31
Fully Observed an environmeent
S – state space.
A – action space.
R – reward space.
Example: Vaccum Cleaner World

Nicolas Gast – 9 / 31
Fully Observed an environmeent
S – state space.
A – action space.
R – reward space.
Example: Vaccum Cleaner World

8 possible states, 3 actions.


Nicolas Gast – 9 / 31
Deterministic environments

In the vaccum cleaner world, the environment is deterministic. This is the


case in many problems. Example:
Go, Chess, tic-tac-toe, 8 puzzle

Nicolas Gast – 10 / 31
Algorithms to solve deterministic environments

1 Generate a tree of possibilities.


2 Compute the payoff for leaves.
3 Go backward to the root of the tree.

Problem: computationnaly prohibitive if the tree is too big (or if the path
to a leave is too long).

Nicolas Gast – 11 / 31
Stochastic environment
In a stochastic environment, the next state is random.

St+1 ∼ P(· | St = s, At = a)

Example: Frozen lake AIGym .

Nicolas Gast – 12 / 31
Markov decision processes

A MDPs is defined by:


S state space
A action set
Evolution is driven by Markovian transitions

P(St+1 = s 0 , Rt+1 = r | St = s, At = r ) = P(s 0 , r 0 | s, a).

MDP = Markov chain + decisions

Most reinforcement learning problems can be framed as MDPs.

Nicolas Gast – 13 / 31
Graphical representation

At

St

Nicolas Gast – 14 / 31
Graphical representation

At Rt+1

St St+1 ...

Example: Frozen lake Link (this is a gridworld example)

Set space: {0, . . . , 16}.


Actions: {L, R, U, D}.
Transitions: 1/3 in right direction.
Rewards: there are Holes and a Goal.

Nicolas Gast – 14 / 31
Outline

1 Markov Decision Processes

2 Policies and Returns

3 Value Function and Bellman’s Equation

4 Conclusion

Nicolas Gast – 15 / 31
Policies
A (deterministic) policy specifies which action to take in a given state:

π : S → A.
It indicates which action to take in a given state: At = π(St ). This defines
the behavior of the agent.

Nicolas Gast – 16 / 31
Policies
A (deterministic) policy specifies which action to take in a given state:

π : S → A.
It indicates which action to take in a given state: At = π(St ). This defines
the behavior of the agent.

A stochastic policy specifies a distribution over actions:

π : S × A → [0, 1].
The agent takes At ∼ π(·|St ).

Nicolas Gast – 16 / 31
Policies
A (deterministic) policy specifies which action to take in a given state:

π : S → A.
It indicates which action to take in a given state: At = π(St ). This defines
the behavior of the agent.

A stochastic policy specifies a distribution over actions:

π : S × A → [0, 1].
The agent takes At ∼ π(·|St ).

Deterministic Stochastic
It forces exploration
Optimal in general Useful in games / non-Markovian
Differentiable
Nicolas Gast – 16 / 31
Return of a policy

We want to compute the best policy... But what is the best policy?

Nicolas Gast – 17 / 31
Return of a policy

We want to compute the best policy... But what is the best policy?

Do we choose At to optimize:
Rt+1 ? (no: too greedy)

Nicolas Gast – 17 / 31
Return of a policy

We want to compute the best policy... But what is the best policy?

Do we choose At to optimize:
Rt+1 ? (no: too greedy)
RT ? (only final reward?)

Nicolas Gast – 17 / 31
Return of a policy (finite horizon)

Sometimes, a problem has a known finite horizon T . In which case, the


return (a.k.a. gain) at time t is:

Gt = Rt+1 + Rt+2 + · · · + RT .

It indicates the value of the current state.

Nicolas Gast – 18 / 31
Return of a policy (discounted infinite horizon)

When T is not specified, it is common to look at the discounted return:

Gt = Rt+1 + γRt+2 + γ 2 Rt+2 + . . .


X∞
= γ k Rt+1+k ,
k=0

where the discount factor γ ∈ [0, 1] indicates the relative value of closer
rewards.

Nicolas Gast – 19 / 31
Return of a policy (discounted infinite horizon)

When T is not specified, it is common to look at the discounted return:

Gt = Rt+1 + γRt+2 + γ 2 Rt+2 + . . .


X∞
= γ k Rt+1+k ,
k=0

where the discount factor γ ∈ [0, 1] indicates the relative value of closer
rewards.

In practice, we will look at the expected reward E [Gt ].

Nicolas Gast – 19 / 31
Outline

1 Markov Decision Processes

2 Policies and Returns

3 Value Function and Bellman’s Equation

4 Conclusion

Nicolas Gast – 20 / 31
Value function

The value function of a policy π is

V π (s) = E [Gt | St = s ∧ At+k ∼ π(St+k ) (k ≥ 0)].

Nicolas Gast – 21 / 31
Value function

The value function of a policy π is

V π (s) = E [Gt | St = s ∧ At+k ∼ π(St+k ) (k ≥ 0)].

It specifies the expected return. It is a vector of |S| values. If


S = {s1 . . . s4 }
s1 s2 s3 s4
V

Nicolas Gast – 21 / 31
Action-Value function

The action-value function of a policy π is

Q π (s, a) = E [Gt | St = s ∧ At = a ∧ At+k = π(St+k ) (k ≥ 1)].

Nicolas Gast – 22 / 31
Action-Value function

The action-value function of a policy π is

Q π (s, a) = E [Gt | St = s ∧ At = a ∧ At+k = π(St+k ) (k ≥ 1)].

It is a table of |S| × |A| values. If S = {s1 . . . s4 } and A = {a1 , a2 }:

Q a1 a2
s1
s2
s3
s4

From Q, we can define a greedy policy: at = arg maxa∈A Q(st , a).

Nicolas Gast – 22 / 31
Bellman Equation (policy evaluation)

Recall from slide 19 that V π (s) = E [Gt | St = s] and that

Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . .


= Rt+1 + γGt+1 .

Hence:

V π (s) = .

Nicolas Gast – 23 / 31
Bellman Equation (policy evaluation)

Recall from slide 19 that V π (s) = E [Gt | St = s] and that

Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . .


= Rt+1 + γGt+1 .

Hence:
X
V π (s) = (r + γV π (s 0 ))p(r , s 0 | s, a = π(s)). .
s 0 ,r
| {z }
=Q π (s,π(s))

Nicolas Gast – 23 / 31
The optimal policy

We denote by V ∗ (s) = maxπ V π (s) and Q ∗ (s, a) = maxπ Q π (s, a).

The optimal policy π ∗ is such that:

π ∗ = arg max V π (s) ∀s ∈ S


π

or equivalently:

π ∗ = arg max Q π (s, a) ∀s ∈ S, s ∈ A


π

Nicolas Gast – 24 / 31
Bellman’s equation (optimal policy)

From Bellman’s equation, the value functions and action-value function


must satisfy:

V ∗ (s) =

Q ∗ (s, a) =

Nicolas Gast – 25 / 31
Bellman’s equation (optimal policy)

From Bellman’s equation, the value functions and action-value function


must satisfy:

V ∗ (s) = max Q ∗ (s, a)


a∈A
X

Q (s, a) = (r + γV ∗ (s 0 ))p(r , s 0 | s, a)
s 0 ,r

How to get π ∗ and V ∗ ?

Nicolas Gast – 25 / 31
Iterative solutions
Evaluate π
π Qπ
improve π

If you know the transitions and reward: value iteration or policy iteration.

Nicolas Gast – 26 / 31
Iterative solutions
Evaluate π
π Qπ
improve π

If you know the transitions and reward: value iteration or policy iteration.
Value iteration:
Initialize V 0 (to some random value).
For k ≥ 0 and s ∈PS, do:
V k+1 (s) := s 0 ,r (r + γV k (s 0 ))p(r , s 0 | s, a)

Policy iteration:
Initialize π 0 (to some random value).
For k ≥ 0:
k
Compute Q π (=linear system)
k
For all a ∈ A: π k+1 (s) := arg maxa∈A Q π (s, a).
Nicolas Gast – 26 / 31
Exercise: the wheel of fortune

You can draw a wheel indefinitely. After time t:


If you draw, the wheel stops on
Xt ∈ {1 . . . 10} (uniformly). You earn Xt .
You can draw again or keep Xt+1 := Xt .

How do you play knowing


"∞ # that you want to maximize your discounted
X
reward: E δ t Xt with γ = 0.9?
t=1

Bonus: if you can draw up to T = 5 times, what does it change?

Nicolas Gast – 27 / 31
Outline

1 Markov Decision Processes

2 Policies and Returns

3 Value Function and Bellman’s Equation

4 Conclusion

Nicolas Gast – 28 / 31
Important concepts

(to be filled by you!)

Nicolas Gast – 29 / 31
Challenges and future courses

Learning transitions and reward? (course 2)

If the state space is too large?


I How do you store Q-values? (course 3)

Exploration or exploitation (course 4)

Nicolas Gast – 30 / 31
Some of the modern challenges

Limited samples, convergence guarantees.


Safety issues, explainable agents.
Multi-agents (ex: competitive objectives?)
Delayed or partial observations.

Nicolas Gast – 31 / 31

You might also like