Professional Documents
Culture Documents
Lecture 1
Lecture 1
Approaches
Lecture 01: Introduction to Reinforcement Learning
Outline
5. Wrap-Up
Agent-oriented learning:
Repeated interactions with the world
Rewards for sequences of decisions
Do not know in advance how the world works
Why is RL different?
Agent-oriented learning: learning by interacting with an
environment to achieve a goal
every AI problem can be phrased this way
all data science work loops are reinforcement learning
Training &
History data Validation Test data New data
data
Ingestion Monitor
Feedback
Typical Setting
RL Success Stories:
Learning Plasma Control for Fusion Science
Image credits: left Alain Herzog / EPFL, right DeepMind & SPC/EPFL.
Degrave et al. Nature 2022
https://www.nature.com/articles/s41586-021-04301-9
G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 11 / 38
2. What’s Special about RL?
Outline
5. Wrap-Up
Goal:
Learn a behavior strategy (policy) that maximizes the long term
sum of rewards in an unknown & stochastic environment.
General assumption: It is “easier” to specify the cost of behavior
than the behavior
Reinforcement Learning:
Core Characteristics
Reinforcement Learning:
Core Characteristics
RL Designer Choices
Representation:
how represent the world and the space of actions/interventions,
and feedback signal/ reward
Use of Prior Knowledge
Algorithm for learning
Objective function
Evaluation
Desirable Properties
Convergence
Consistency
Small generalization/estimation/approximation error
High learning speed
Safety
Stability
Computation time
Data available
Restricted in way can act (policy class, constraints on which
actions can take in states)
Online vs offline
Do we get to choose how to act or does someone else (an
expert, semi-expert, off-policy/on-policy learning. . . )?
AI Planning SL UL RL IL
Optimization X X X
Learns From experience X X X X
Generalization X X X X X
Delayed Consequences X X X
Exploration X
AI planning assumes having a model of how decisions impact
the environment
Supervised learning (SL) has access to the correct labels
Unsupervised learning (UL) has access to no labels
RL is given only reward information, and only for states reached
and actions taken
Imitation learning (IL) typically assumes input demonstrations of
good policies
Outline
5. Wrap-Up
Agent-Environment representation
Stochastic Processes
Markov Chain
A Markov chain (or Markov Process) is a memoryless stochastic
process, i.e., a sequence of random states s1 , s2 , . . . with the Markov
property.
It models an environment in which all states are Markov and time is
divided into stages.
Definition 3 (Finite Markov Chain)
A finite Markov chain is a tuple ⟨S, P⟩ with
S being a finite set of discrete-time states St ∈ S,
P = Pss′ = P[St+1 = s′ |St = s] is the state transition probability.
Ergodic
A Markov process is called ergodic if all states are recurrent (each
state is visited an infinite number of times) and aperiodic (each state
is visited without any systematic period)
Regular
A Markov process is called regular if some power of the transition
matrix has only positive elements.
Outline
5. Wrap-Up
Observations y1:t
Intelligent
Action at
Rewards r1:t−1 Agent
1st Assumption:
Filtered State, Sufficient Statistics
Sufficient Statistics
or belief state
bt = b(y1:t , a1:t−1 )
Observations y1:t Filter Intelligent Action at
2nd Assumption:
Markovian Observable State
State st
Intelligent Action at
Further Simplifications...
Problem Classification
Outline
5. Wrap-Up
Wrap-Up