Lecture 1

Reinforcement Learning: From Foundations to Deep
Approaches
Lecture 01: Introduction to Reinforcement Learning
Georgia Chalvatzaki & Davide Tateo
Department of Computer Science

TU Darmstadt
Summer Term 2024
G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 1 / 38

1. What is Reinforcement Learning?
Outline
2. What’s Special about RL?
3. Introduction to Markov Decision Process
4. Flavors of the RL Problem
5. Wrap-Up

What is Reinforcement Learning?
The fundamental challenge in artificial intelligence and machine

learning is learning to make good decisions under uncertainty
– E. Brunskill
Agent-oriented learning:
Repeated interactions with the world
Rewards for sequences of decisions
Do not know in advance how the world works

Why is RL different?
Agent-oriented learning: learning by interacting with an
environment to achieve a goal
every AI problem can be phrased this way
all data science work loops are reinforcement learning
Ingestion Data prep Train Eval Deploy Predict/Score Act
Training &
History data Validation Test data New data
data
Ingestion Monitor
Feedback
Learning by trial and error, with only delayed evaluative

feedback (reward)
the kind of machine learning most like natural learning
learning that can tell for itself when it is right or wrong
The beginnings of a science of mind that is neither natural
science nor applications technology
Typical Setting
At every time step, the agent perceives the state of the

environment
Based on this perception, it chooses an action
The action causes the agent to receive a numerical reward
Find a way of choosing actions, called a policy which maximizes
the agent’s long-term expected return
Why study RL now?
Figure from Henderson et al. 2018 AAAI

https://arxiv.org/pdf/1709.06560.pdf

RL success stories: Atari
Learned to play 49 Atari 2600 console games from self-play
Predicts final score for all 18 Amazing performance:

joystick actions: Better than all previous
algorithms
Human level for ~50% of
all games
Same learning algorithm
for all 49 games
RL Success Stories: AlphaGo

RL success stories: Robot locomotion
Lee, Joonho, et al. "Learning quadrupedal locomotion over challenging

terrain." Science robotics 5.47 (2020),
https://arxiv.org/abs/2010.11251
Video: https://youtu.be/9j2a1oAHDL8
RL Success Stories: ChatGPT

RL Success Stories:
Learning Plasma Control for Fusion Science
Image credits: left Alain Herzog / EPFL, right DeepMind & SPC/EPFL.
Degrave et al. Nature 2022
https://www.nature.com/articles/s41586-021-04301-9
Outline
5. Wrap-Up

What’s Special about RL?
Goal:
Learn a behavior strategy (policy) that maximizes the long term
sum of rewards in an unknown & stochastic environment.
General assumption: It is “easier” to specify the cost of behavior
than the behavior

Reinforcement Learning:
Core Characteristics
Characterization of any RL problem

There is no supervisor but only a reward
signal
Feedback is always delayed, it is not
instantaneous
Time really matters, data is sequential,
non iid
Agent’s actions affect subsequent data it (Picture by Emma Brunskill)
receives (Agent generates its own data)

Reinforcement Learning:
Core Characteristics
Complicating factors and issues

Dynamic (state-dependent) environments
Stochastic dynamics and rewards
Unknown dynamics and rewards
Exploration vs. exploitation
Delayed rewards: temporal credit assignment
Complicated systems: large state space, unstructured dynamics

RL Designer Choices
Representation:
how represent the world and the space of actions/interventions,
and feedback signal/ reward
Use of Prior Knowledge
Algorithm for learning
Objective function
Evaluation

Desirable Properties
Convergence
Consistency
Small generalization/estimation/approximation error
High learning speed
Safety
Stability

Typical Restrictions and Constraints
Computation time
Data available
Restricted in way can act (policy class, constraints on which
actions can take in states)
Online vs offline
Do we get to choose how to act or does someone else (an
expert, semi-expert, off-policy/on-policy learning. . . )?

RL vs. other learning paradigms
AI Planning SL UL RL IL
Optimization X X X
Learns From experience X X X X
Generalization X X X X X
Delayed Consequences X X X
Exploration X
AI planning assumes having a model of how decisions impact
the environment
Supervised learning (SL) has access to the correct labels
Unsupervised learning (UL) has access to no labels
RL is given only reward information, and only for states reached
and actions taken
Imitation learning (IL) typically assumes input demonstrations of
good policies

Outline
5. Wrap-Up

Agent-Environment representation
Richard Sutton (1960? - now)

Andrew Barto (1948 - now)
(Picture from Sutton & Barto)
Sutton/Barto: Reward Hypothesis - Learn from a scalar reward

function

Intro to MDPs (1/2)
Markov Decision Processes formally describe an environment for

reinforcement learning
The environment is fully observable
the current state completely characterizes the process
MDPs allow precise theoretical statements (e.g., on optimal
solutions).
They deliver insights into suitable RL algorithms since many
real-world problems can be abstracted as MDPs.
Almost all RL problems can be formalized as MDPs
Optimal control primarily deals with continuous MDPs
Partially observable problems can be converted into MDPs
Bandits are MDPs with one state

Intro to MDPs (2/2)
In the following we’ll focus on:

fully observable MDPs and
finite MDPs (i.e., finite number of states & actions)
All states observable?

Yes No
Actions?
No Markov Chain Hidden Markov Model
Yes MDP Partially observable MDP
Table: Types of Markov Models

Stochastic Processes
A stochastic process is an indexed collection of random variables

{St }, e.g., a time-series of weekly demands for a product
Discrete case: At a particular time t, labeled by integers, a
system is found to be exactly one of a finite number of mutually
exclusive and exhaustive categories or states, labeled by
integers too
Process can be embedded, i.e., time points refer to the
occurrence of specific events (or time may be evenly spaced)
Some basic types of stochastic processes include Markov
processes, Poisson processes (such as radioactive decay), and
time series, with the index variable referring to time

Example of Stochastic Processes
Random variables may depend on others, e.g.,

(
max{(3 − Dt+1 , 0)} if St ≤ 0
St+1 =
max{(St − Dt+1 , 0)} if St > 0
or
K
X
St+1 = αk St−k + ξt with ξt ∼ N (µ, σ 2 )
k=0

The Markov Property
“The future is independent of the past given the present”

Definition 1
A stochastic process St is said to be Markovian if and only if
P(St+1 = j|St = i, St−1 = kt−1 , . . . , S0 = k0 ) = P(St+1 = j|St = i)
The state captures all the information from the history

Once the state is known, the history may be thrown away
The state is a sufficient statistic for the future
The conditional probabilities are transition probabilities
If the probabilities are stationary (time-invariant), we can write:
pij = P(St+1 = j|St = i) = P(S1 = j|S0 = i)

State Transition Matrix

Definition 2 (State transition matrix)
Given a Markov state St = s and its successor St+1 = s′ , the state
transition probability ∀{s, s′ } ∈ S is defined by the matrix
Pss′ = P[St+1 = s′ |St = s].
Here, Pss′ ∈ Rn×n has the form

 
p11 p12 . . . p1n
.. 
p21 . . .

. 
Pss =  .
′ 
.. .. 

 .. . . 
pn1 . . . . . . pnn
with pij ∈ {R|0 ≤ pij ≤ 1} being the specific

P probability to go from
′
state s = Si to state s = Sj . Obviously, j pij = 1, ∀i must hold.

Markov Chain
A Markov chain (or Markov Process) is a memoryless stochastic
process, i.e., a sequence of random states s1 , s2 , . . . with the Markov
property.
It models an environment in which all states are Markov and time is
divided into stages.
Definition 3 (Finite Markov Chain)
A finite Markov chain is a tuple ⟨S, P⟩ with
S being a finite set of discrete-time states St ∈ S,
P = Pss′ = P[St+1 = s′ |St = s] is the state transition probability.
Specific stochastic process model

Sequence of random variables St , St+1 , . . .
“Memoryless”
In continuous-time framework: Markov process
Types of Markov Processes

Absorbing
A Markov process is called absorbing if it has at least one absorbing
state and if that state can be reached from every other state (not
necessarily in one step).
Ergodic
A Markov process is called ergodic if all states are recurrent (each
state is visited an infinite number of times) and aperiodic (each state
is visited without any systematic period)
Regular
A Markov process is called regular if some power of the transition
matrix has only positive elements.

Outline
5. Wrap-Up

Flavors of the RL Problem
Past actions a1:t−1
Observations y1:t
Intelligent
Action at
Rewards r1:t−1 Agent
Full Reinforcement Learning Problem

Agent can only test p(yt+1 |y1:t , a1:t ) to obtain rewards
rt = r(y1:t , a1:t )
True AI? Agent has to “understand” environment
Practically? Requires infinitately many “twin brothers”...

1st Assumption:
Filtered State, Sufficient Statistics
Past actions a1:t−1
Sufficient Statistics
or belief state
bt = b(y1:t , a1:t−1 )
Observations y1:t Filter Intelligent Action at
Decomposable Rewards, e.g., r(bt , at )

Agent
Partially Observable Markov Decision Problem (POMDP)

Sufficient filter statistics bt are Markovian p(bt+1 |bt , at )
Rewards are decomposable rt = r(bt , at )
Every problem is a (infinite or finite-dimensional) POMDP - but
where do filters come from?

2nd Assumption:
Markovian Observable State
State st
Intelligent Action at
Decomposable Rewards r(st , at )

Agent
Markov Decision Problem

Observe the state st = bt directly and remain Markovian
p(st+1 |s1:t , a1:t ) = p(st+1 |st , at )
Can frequently be done, e.g., reformulate
yt+1 = ayt + byt−1 + cat−1 + at or ÿ = mg − a.

Further Simplifications...
Contextual Bandit: State distribution p(st+1 |s1:t , a1:t ) = p(st+1 )

Bandit: Only a single state (aka state is not relevant)

Problem Classification
Actions do not Actions Change

Change State of State of the
the World World
Learn model of (Multi-Armed) Reinforcement
outcomes Bandits Learning
Given model of
Optimal Control,
stochastic Decision Theory
Planning
outcomes

5. Wrap-Up
Outline
5. Wrap-Up

5. Wrap-Up
Wrap-Up
Why RL is crucial for AI and why all other approaches are

probably doomed!
Background and characteristics of RL
Classifications of RL problems
Core components of RL algorithms

5. Wrap-Up
Reading Assignment for Next Week
1. Light fun reading from Forbes:

https://tinyurl.com/tpc43xer
2. The real assignment: Chapter 1 of
http://incompleteideas.net/book/RLbook2020.pdf

Lecture 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 1

Uploaded by

Copyright:

Available Formats

Reinforcement Learning: From Foundations to Deep

Georgia Chalvatzaki & Davide Tateo

Department of Computer Science

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 1 / 38

1. What is Reinforcement Learning?

2. What’s Special about RL?

3. Introduction to Markov Decision Process

4. Flavors of the RL Problem

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 2 / 38

What is Reinforcement Learning?

The fundamental challenge in artificial intelligence and machine

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 3 / 38

Ingestion Data prep Train Eval Deploy Predict/Score Act

Learning by trial and error, with only delayed evaluative

At every time step, the agent perceives the state of the

Why study RL now?

Figure from Henderson et al. 2018 AAAI

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 6 / 38

RL success stories: Atari

Learned to play 49 Atari 2600 console games from self-play

Predicts final score for all 18 Amazing performance:

RL Success Stories: AlphaGo

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 8 / 38

RL success stories: Robot locomotion

Lee, Joonho, et al. "Learning quadrupedal locomotion over challenging

RL Success Stories: ChatGPT

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 10 / 38

1. What is Reinforcement Learning?

2. What’s Special about RL?

3. Introduction to Markov Decision Process

4. Flavors of the RL Problem

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 12 / 38

What’s Special about RL?

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 13 / 38

Characterization of any RL problem

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 14 / 38

Complicating factors and issues

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 15 / 38

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 16 / 38

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 17 / 38

Typical Restrictions and Constraints

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 18 / 38

RL vs. other learning paradigms

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 19 / 38

1. What is Reinforcement Learning?

2. What’s Special about RL?

3. Introduction to Markov Decision Process

4. Flavors of the RL Problem

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 20 / 38

Richard Sutton (1960? - now)

(Picture from Sutton & Barto)

Sutton/Barto: Reward Hypothesis - Learn from a scalar reward

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 21 / 38

Intro to MDPs (1/2)

Markov Decision Processes formally describe an environment for

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 22 / 38

Intro to MDPs (2/2)

In the following we’ll focus on:

All states observable?

No Markov Chain Hidden Markov Model

Yes MDP Partially observable MDP

Table: Types of Markov Models

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 23 / 38

A stochastic process is an indexed collection of random variables