Professional Documents
Culture Documents
Nicolas Gast
Nicolas Gast – 1 / 31
What is Reinforcement Learning?
And why it differs from supervised or unsupervised learning
Nicolas Gast – 2 / 31
What is Reinforcement Learning?
And why it differs from supervised or unsupervised learning
Nicolas Gast – 2 / 31
What is Reinforcement Learning?
And why it differs from supervised or unsupervised learning
Challenges:
Many possible states, actions
Reward can be delayed, or sparse.
Nicolas Gast – 2 / 31
Applications
...
The number of application is increasing.
Nicolas Gast – 3 / 31
RL is about interacting with an environment
Nicolas Gast – 4 / 31
RL is about interacting with an environment
Nicolas Gast – 4 / 31
Reward signal
S1 , A1 R2 , S2 , A2 R3 , S3 , A3 ... RT , ST
Nicolas Gast – 5 / 31
Reward signal
S1 , A1 R2 , S2 , A2 R3 , S3 , A3 ... RT , ST
Nicolas Gast – 5 / 31
Overview of the course
https://cours.univ-grenoble-alpes.fr/course/view.php?id=16628
1 Reinforcement learning
1–2 MDP
3–6 Classical RL algorithms
Nicolas Gast – 6 / 31
Overview of the course
https://cours.univ-grenoble-alpes.fr/course/view.php?id=16628
1 Reinforcement learning
1–2 MDP + hands-on session
3–6 Classical RL algorithms + hands-on session
Nicolas Gast – 6 / 31
Personal work, a few advice
Mandatory :
Understand the course / finish the exercices
Do your homeworks
Nicolas Gast – 7 / 31
Personal work, a few advice
Mandatory :
Understand the course / finish the exercices
Do your homeworks
Highly suggested:
Question what you learn.
Try to do some exercises.
I Do not hesitate to program, to go deeper, to ask follow-up questions.
Ask questions during or after the course.
After the course, list the key ideas.
Nicolas Gast – 7 / 31
Personal work, a few advice
Mandatory :
Understand the course / finish the exercices
Do your homeworks
Highly suggested:
Question what you learn.
Try to do some exercises.
I Do not hesitate to program, to go deeper, to ask follow-up questions.
Ask questions during or after the course.
After the course, list the key ideas.
Nicolas Gast – 7 / 31
Outline
4 Conclusion
Nicolas Gast – 8 / 31
Fully Observed an environmeent
S – state space.
A – action space.
R – reward space.
Nicolas Gast – 9 / 31
Fully Observed an environmeent
S – state space.
A – action space.
R – reward space.
Example: Vaccum Cleaner World
Nicolas Gast – 9 / 31
Fully Observed an environmeent
S – state space.
A – action space.
R – reward space.
Example: Vaccum Cleaner World
Nicolas Gast – 10 / 31
Algorithms to solve deterministic environments
Problem: computationnaly prohibitive if the tree is too big (or if the path
to a leave is too long).
Nicolas Gast – 11 / 31
Stochastic environment
In a stochastic environment, the next state is random.
St+1 ∼ P(· | St = s, At = a)
Nicolas Gast – 12 / 31
Markov decision processes
Nicolas Gast – 13 / 31
Graphical representation
At
St
Nicolas Gast – 14 / 31
Graphical representation
At Rt+1
St St+1 ...
Nicolas Gast – 14 / 31
Outline
4 Conclusion
Nicolas Gast – 15 / 31
Policies
A (deterministic) policy specifies which action to take in a given state:
π : S → A.
It indicates which action to take in a given state: At = π(St ). This defines
the behavior of the agent.
Nicolas Gast – 16 / 31
Policies
A (deterministic) policy specifies which action to take in a given state:
π : S → A.
It indicates which action to take in a given state: At = π(St ). This defines
the behavior of the agent.
π : S × A → [0, 1].
The agent takes At ∼ π(·|St ).
Nicolas Gast – 16 / 31
Policies
A (deterministic) policy specifies which action to take in a given state:
π : S → A.
It indicates which action to take in a given state: At = π(St ). This defines
the behavior of the agent.
π : S × A → [0, 1].
The agent takes At ∼ π(·|St ).
Deterministic Stochastic
It forces exploration
Optimal in general Useful in games / non-Markovian
Differentiable
Nicolas Gast – 16 / 31
Return of a policy
We want to compute the best policy... But what is the best policy?
Nicolas Gast – 17 / 31
Return of a policy
We want to compute the best policy... But what is the best policy?
Do we choose At to optimize:
Rt+1 ? (no: too greedy)
Nicolas Gast – 17 / 31
Return of a policy
We want to compute the best policy... But what is the best policy?
Do we choose At to optimize:
Rt+1 ? (no: too greedy)
RT ? (only final reward?)
Nicolas Gast – 17 / 31
Return of a policy (finite horizon)
Gt = Rt+1 + Rt+2 + · · · + RT .
Nicolas Gast – 18 / 31
Return of a policy (discounted infinite horizon)
where the discount factor γ ∈ [0, 1] indicates the relative value of closer
rewards.
Nicolas Gast – 19 / 31
Return of a policy (discounted infinite horizon)
where the discount factor γ ∈ [0, 1] indicates the relative value of closer
rewards.
Nicolas Gast – 19 / 31
Outline
4 Conclusion
Nicolas Gast – 20 / 31
Value function
Nicolas Gast – 21 / 31
Value function
Nicolas Gast – 21 / 31
Action-Value function
Nicolas Gast – 22 / 31
Action-Value function
Q a1 a2
s1
s2
s3
s4
Nicolas Gast – 22 / 31
Bellman Equation (policy evaluation)
Hence:
V π (s) = .
Nicolas Gast – 23 / 31
Bellman Equation (policy evaluation)
Hence:
X
V π (s) = (r + γV π (s 0 ))p(r , s 0 | s, a = π(s)). .
s 0 ,r
| {z }
=Q π (s,π(s))
Nicolas Gast – 23 / 31
The optimal policy
or equivalently:
Nicolas Gast – 24 / 31
Bellman’s equation (optimal policy)
V ∗ (s) =
Q ∗ (s, a) =
Nicolas Gast – 25 / 31
Bellman’s equation (optimal policy)
Nicolas Gast – 25 / 31
Iterative solutions
Evaluate π
π Qπ
improve π
If you know the transitions and reward: value iteration or policy iteration.
Nicolas Gast – 26 / 31
Iterative solutions
Evaluate π
π Qπ
improve π
If you know the transitions and reward: value iteration or policy iteration.
Value iteration:
Initialize V 0 (to some random value).
For k ≥ 0 and s ∈PS, do:
V k+1 (s) := s 0 ,r (r + γV k (s 0 ))p(r , s 0 | s, a)
Policy iteration:
Initialize π 0 (to some random value).
For k ≥ 0:
k
Compute Q π (=linear system)
k
For all a ∈ A: π k+1 (s) := arg maxa∈A Q π (s, a).
Nicolas Gast – 26 / 31
Exercise: the wheel of fortune
Nicolas Gast – 27 / 31
Outline
4 Conclusion
Nicolas Gast – 28 / 31
Important concepts
Nicolas Gast – 29 / 31
Challenges and future courses
Nicolas Gast – 30 / 31
Some of the modern challenges
Nicolas Gast – 31 / 31