You are on page 1of 38

Reinforcement Learning: From Foundations to Deep

Approaches
Lecture 01: Introduction to Reinforcement Learning

Georgia Chalvatzaki & Davide Tateo

Department of Computer Science


TU Darmstadt
Summer Term 2024

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 1 / 38


1. What is Reinforcement Learning?

Outline

1. What is Reinforcement Learning?

2. What’s Special about RL?

3. Introduction to Markov Decision Process

4. Flavors of the RL Problem

5. Wrap-Up

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 2 / 38


1. What is Reinforcement Learning?

What is Reinforcement Learning?

The fundamental challenge in artificial intelligence and machine


learning is learning to make good decisions under uncertainty
– E. Brunskill

Agent-oriented learning:
Repeated interactions with the world
Rewards for sequences of decisions
Do not know in advance how the world works

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 3 / 38


1. What is Reinforcement Learning?

Why is RL different?
Agent-oriented learning: learning by interacting with an
environment to achieve a goal
every AI problem can be phrased this way
all data science work loops are reinforcement learning

Ingestion Data prep Train Eval Deploy Predict/Score Act

Training &
History data Validation Test data New data
data

Ingestion Monitor

Feedback

Learning by trial and error, with only delayed evaluative


feedback (reward)
the kind of machine learning most like natural learning
learning that can tell for itself when it is right or wrong
The beginnings of a science of mind that is neither natural
science nor applications technology
G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 4 / 38
1. What is Reinforcement Learning?

Typical Setting

At every time step, the agent perceives the state of the


environment
Based on this perception, it chooses an action
The action causes the agent to receive a numerical reward
Find a way of choosing actions, called a policy which maximizes
the agent’s long-term expected return
G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 5 / 38
1. What is Reinforcement Learning?

Why study RL now?

Figure from Henderson et al. 2018 AAAI


https://arxiv.org/pdf/1709.06560.pdf

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 6 / 38


1. What is Reinforcement Learning?

RL success stories: Atari

Learned to play 49 Atari 2600 console games from self-play

Predicts final score for all 18 Amazing performance:


joystick actions: Better than all previous
algorithms
Human level for ~50% of
all games
Same learning algorithm
for all 49 games
G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 7 / 38
1. What is Reinforcement Learning?

RL Success Stories: AlphaGo

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 8 / 38


1. What is Reinforcement Learning?

RL success stories: Robot locomotion

Lee, Joonho, et al. "Learning quadrupedal locomotion over challenging


terrain." Science robotics 5.47 (2020),
https://arxiv.org/abs/2010.11251
Video: https://youtu.be/9j2a1oAHDL8
G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 9 / 38
1. What is Reinforcement Learning?

RL Success Stories: ChatGPT

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 10 / 38


1. What is Reinforcement Learning?

RL Success Stories:
Learning Plasma Control for Fusion Science

Image credits: left Alain Herzog / EPFL, right DeepMind & SPC/EPFL.
Degrave et al. Nature 2022
https://www.nature.com/articles/s41586-021-04301-9
G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 11 / 38
2. What’s Special about RL?

Outline

1. What is Reinforcement Learning?

2. What’s Special about RL?

3. Introduction to Markov Decision Process

4. Flavors of the RL Problem

5. Wrap-Up

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 12 / 38


2. What’s Special about RL?

What’s Special about RL?

Goal:
Learn a behavior strategy (policy) that maximizes the long term
sum of rewards in an unknown & stochastic environment.
General assumption: It is “easier” to specify the cost of behavior
than the behavior

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 13 / 38


2. What’s Special about RL?

Reinforcement Learning:
Core Characteristics

Characterization of any RL problem


There is no supervisor but only a reward
signal
Feedback is always delayed, it is not
instantaneous
Time really matters, data is sequential,
non iid
Agent’s actions affect subsequent data it (Picture by Emma Brunskill)
receives (Agent generates its own data)

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 14 / 38


2. What’s Special about RL?

Reinforcement Learning:
Core Characteristics

Complicating factors and issues


Dynamic (state-dependent) environments
Stochastic dynamics and rewards
Unknown dynamics and rewards
Exploration vs. exploitation
Delayed rewards: temporal credit assignment
Complicated systems: large state space, unstructured dynamics

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 15 / 38


2. What’s Special about RL?

RL Designer Choices

Representation:
how represent the world and the space of actions/interventions,
and feedback signal/ reward
Use of Prior Knowledge
Algorithm for learning
Objective function
Evaluation

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 16 / 38


2. What’s Special about RL?

Desirable Properties

Convergence
Consistency
Small generalization/estimation/approximation error
High learning speed
Safety
Stability

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 17 / 38


2. What’s Special about RL?

Typical Restrictions and Constraints

Computation time
Data available
Restricted in way can act (policy class, constraints on which
actions can take in states)
Online vs offline
Do we get to choose how to act or does someone else (an
expert, semi-expert, off-policy/on-policy learning. . . )?

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 18 / 38


2. What’s Special about RL?

RL vs. other learning paradigms

AI Planning SL UL RL IL
Optimization X X X
Learns From experience X X X X
Generalization X X X X X
Delayed Consequences X X X
Exploration X
AI planning assumes having a model of how decisions impact
the environment
Supervised learning (SL) has access to the correct labels
Unsupervised learning (UL) has access to no labels
RL is given only reward information, and only for states reached
and actions taken
Imitation learning (IL) typically assumes input demonstrations of
good policies

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 19 / 38


3. Introduction to Markov Decision Process

Outline

1. What is Reinforcement Learning?

2. What’s Special about RL?

3. Introduction to Markov Decision Process

4. Flavors of the RL Problem

5. Wrap-Up

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 20 / 38


3. Introduction to Markov Decision Process

Agent-Environment representation

Richard Sutton (1960? - now)


Andrew Barto (1948 - now)

(Picture from Sutton & Barto)

Sutton/Barto: Reward Hypothesis - Learn from a scalar reward


function

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 21 / 38


3. Introduction to Markov Decision Process

Intro to MDPs (1/2)

Markov Decision Processes formally describe an environment for


reinforcement learning
The environment is fully observable
the current state completely characterizes the process
MDPs allow precise theoretical statements (e.g., on optimal
solutions).
They deliver insights into suitable RL algorithms since many
real-world problems can be abstracted as MDPs.
Almost all RL problems can be formalized as MDPs
Optimal control primarily deals with continuous MDPs
Partially observable problems can be converted into MDPs
Bandits are MDPs with one state

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 22 / 38


3. Introduction to Markov Decision Process

Intro to MDPs (2/2)

In the following we’ll focus on:


fully observable MDPs and
finite MDPs (i.e., finite number of states & actions)

All states observable?


Yes No
Actions?

No Markov Chain Hidden Markov Model

Yes MDP Partially observable MDP

Table: Types of Markov Models

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 23 / 38


3. Introduction to Markov Decision Process

Stochastic Processes

A stochastic process is an indexed collection of random variables


{St }, e.g., a time-series of weekly demands for a product
Discrete case: At a particular time t, labeled by integers, a
system is found to be exactly one of a finite number of mutually
exclusive and exhaustive categories or states, labeled by
integers too
Process can be embedded, i.e., time points refer to the
occurrence of specific events (or time may be evenly spaced)
Some basic types of stochastic processes include Markov
processes, Poisson processes (such as radioactive decay), and
time series, with the index variable referring to time

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 24 / 38


3. Introduction to Markov Decision Process

Example of Stochastic Processes

Random variables may depend on others, e.g.,


(
max{(3 − Dt+1 , 0)} if St ≤ 0
St+1 =
max{(St − Dt+1 , 0)} if St > 0
or
K
X
St+1 = αk St−k + ξt with ξt ∼ N (µ, σ 2 )
k=0

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 25 / 38


3. Introduction to Markov Decision Process

The Markov Property

“The future is independent of the past given the present”


Definition 1
A stochastic process St is said to be Markovian if and only if
P(St+1 = j|St = i, St−1 = kt−1 , . . . , S0 = k0 ) = P(St+1 = j|St = i)

The state captures all the information from the history


Once the state is known, the history may be thrown away
The state is a sufficient statistic for the future
The conditional probabilities are transition probabilities
If the probabilities are stationary (time-invariant), we can write:
pij = P(St+1 = j|St = i) = P(S1 = j|S0 = i)

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 26 / 38


3. Introduction to Markov Decision Process

State Transition Matrix


Definition 2 (State transition matrix)
Given a Markov state St = s and its successor St+1 = s′ , the state
transition probability ∀{s, s′ } ∈ S is defined by the matrix

Pss′ = P[St+1 = s′ |St = s].

Here, Pss′ ∈ Rn×n has the form


 
p11 p12 . . . p1n
.. 
p21 . . .

. 
Pss =  .
′ 
.. .. 

 .. . . 
pn1 . . . . . . pnn

with pij ∈ {R|0 ≤ pij ≤ 1} being the specific


P probability to go from

state s = Si to state s = Sj . Obviously, j pij = 1, ∀i must hold.

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 27 / 38


3. Introduction to Markov Decision Process

Markov Chain
A Markov chain (or Markov Process) is a memoryless stochastic
process, i.e., a sequence of random states s1 , s2 , . . . with the Markov
property.
It models an environment in which all states are Markov and time is
divided into stages.
Definition 3 (Finite Markov Chain)
A finite Markov chain is a tuple ⟨S, P⟩ with
S being a finite set of discrete-time states St ∈ S,
P = Pss′ = P[St+1 = s′ |St = s] is the state transition probability.

Specific stochastic process model


Sequence of random variables St , St+1 , . . .
“Memoryless”
In continuous-time framework: Markov process
G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 28 / 38
3. Introduction to Markov Decision Process

Types of Markov Processes


Absorbing
A Markov process is called absorbing if it has at least one absorbing
state and if that state can be reached from every other state (not
necessarily in one step).

Ergodic
A Markov process is called ergodic if all states are recurrent (each
state is visited an infinite number of times) and aperiodic (each state
is visited without any systematic period)

Regular
A Markov process is called regular if some power of the transition
matrix has only positive elements.

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 29 / 38


4. Flavors of the RL Problem

Outline

1. What is Reinforcement Learning?

2. What’s Special about RL?

3. Introduction to Markov Decision Process

4. Flavors of the RL Problem

5. Wrap-Up

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 30 / 38


4. Flavors of the RL Problem

Flavors of the RL Problem

Past actions a1:t−1

Observations y1:t
Intelligent
Action at
Rewards r1:t−1 Agent

Full Reinforcement Learning Problem


Agent can only test p(yt+1 |y1:t , a1:t ) to obtain rewards
rt = r(y1:t , a1:t )
True AI? Agent has to “understand” environment
Practically? Requires infinitately many “twin brothers”...

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 31 / 38


4. Flavors of the RL Problem

1st Assumption:
Filtered State, Sufficient Statistics

Past actions a1:t−1

Sufficient Statistics
or belief state
bt = b(y1:t , a1:t−1 )
Observations y1:t Filter Intelligent Action at

Decomposable Rewards, e.g., r(bt , at )


Agent

Partially Observable Markov Decision Problem (POMDP)


Sufficient filter statistics bt are Markovian p(bt+1 |bt , at )
Rewards are decomposable rt = r(bt , at )
Every problem is a (infinite or finite-dimensional) POMDP - but
where do filters come from?

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 32 / 38


4. Flavors of the RL Problem

2nd Assumption:
Markovian Observable State

State st
Intelligent Action at

Decomposable Rewards r(st , at )


Agent

Markov Decision Problem


Observe the state st = bt directly and remain Markovian
p(st+1 |s1:t , a1:t ) = p(st+1 |st , at )
Can frequently be done, e.g., reformulate
yt+1 = ayt + byt−1 + cat−1 + at or ÿ = mg − a.

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 33 / 38


4. Flavors of the RL Problem

Further Simplifications...

Contextual Bandit: State distribution p(st+1 |s1:t , a1:t ) = p(st+1 )


Bandit: Only a single state (aka state is not relevant)

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 34 / 38


4. Flavors of the RL Problem

Problem Classification

Actions do not Actions Change


Change State of State of the
the World World
Learn model of (Multi-Armed) Reinforcement
outcomes Bandits Learning
Given model of
Optimal Control,
stochastic Decision Theory
Planning
outcomes

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 35 / 38


5. Wrap-Up

Outline

1. What is Reinforcement Learning?

2. What’s Special about RL?

3. Introduction to Markov Decision Process

4. Flavors of the RL Problem

5. Wrap-Up

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 36 / 38


5. Wrap-Up

Wrap-Up

Why RL is crucial for AI and why all other approaches are


probably doomed!
Background and characteristics of RL
Classifications of RL problems
Core components of RL algorithms

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 37 / 38


5. Wrap-Up

Reading Assignment for Next Week

1. Light fun reading from Forbes:


https://tinyurl.com/tpc43xer
2. The real assignment: Chapter 1 of
http://incompleteideas.net/book/RLbook2020.pdf

G. Chalvatzaki & D. Tateo · RL: Foundations to Deep · Summer Term 2024 38 / 38

You might also like