You are on page 1of 12

Reinforcement Learning

Introduction

Pablo Zometa – Department of Mechatronics – GIU Berlin 1


Organization

Teaching staff Assessment


▶ Pablo Zometa ▶ 20 % Quizzes (Best of 2)
Office 6.05 ▶ 20 % Assignments
▶ Ali Tarek ▶ 25 % Midterm
Office 6.01 ▶ 35 % Final
Other info.:
▶ Lecture v1.0

Pablo Zometa – Department of Mechatronics – GIU Berlin 2


Course Overview
Part I: Reinforcement Part II: Optimal Control
Learning ▶ Convex Optimization
▶ Markov Decision Process ▶ Linear Quadratic Regulator
▶ Dynamic Programming ▶ Model Predictive Control
▶ Temporal difference and
Q-Learning
▶ Neural Networks and Deep
Q-Networks

Reinforcement Learning text book (freely available):


Reinforcement Learning: An Introduction, second edition, R. S.
Sutton and A. G. Barto

Optimization text book (freely available)


Convex Optimization, S. Boyd and L. Vandenberghe

Pablo Zometa – Department of Mechatronics – GIU Berlin 3


Reinforcement Learning (RL) in a Nutshell
Reinforcement learning: learning what actions to take in any
particular state to maximize a numerical reward over time.

action = f (state)

Agent Environment

reward, state+

RL is typically modelled using the Agent/Environment framework:


▶ Agent: The learner and action taker
▶ Environment: return rewards and transition to a new state,
depending on the chosen actions
▶ the boundary between agent and environment depends on
what needs to be learned and the states chosen as
representation, and not on physical boundaries
Pablo Zometa – Department of Mechatronics – GIU Berlin 4
Motivation

Why reinforcement learning?


▶ Learn from experience: solve a complex ”puzzle” (e.g., robot
navigation), recommendation system.
▶ Use in dynamic environments: autonomous robot navigating
on a busy factory floor
▶ Ability to handle uncertainty: stock trading, a game of Chess
or Go, autonomous driving
▶ Flexibility:
from computer games (e.g. Super Mario Bros. to learning
how to walk (e.g., Learn to walk).

Pablo Zometa – Department of Mechatronics – GIU Berlin 5


Reinforcement Learning (RL)
RL: learn what actions to take in any particular state to maximize
a numerical reward:
▶ map a state x to an action a: π(x) = a
▶ which action a to take at state x is not predefined, i.e., π(x)?
▶ π(x) must be discovered by experimentation, i.e., which
action a delivers the highest reward at state x?
▶ after taking action a at x, a reward ̸= 0 may not be immediate
The two most important distinguishing features of RL:
▶ trial-and-error search for ”best” action and
▶ delayed reward for an action taken
Unique to RL is the trade-off between
▶ exploration: to discover the best actions to exploit, it has to
try new actions that may yield better (or worse) results
▶ exploitation: retake actions that have lead to positive rewards

Pablo Zometa – Department of Mechatronics – GIU Berlin 6


The RL landscape
Currently, the dominant branch of artificial intelligence is machine
learning. Three main branches of machine learning:
▶ supervised learning: labeled samples, e.g., linear regression,
image classification
▶ unsupervised learning: finding structure hidden in collections
of unlabeled data, e.g., clustering
▶ reinforcement learning: trying to maximize a reward signal,
e.g. play a (video)game
Within RL, there are several branches, among them:
▶ Q-learning: model-free method. Builds a table of the expected
value (reward) of taking an action at a given state. For
discrete action spaces.
▶ Policy gradient: model-free methods that directly optimize the
policy function π(x). For continuous action spaces.
▶ Actor-critic: actor learn the policy π(x), while the critic checks
how good π(x) is based on expected reward of a = π(x).
▶ Deep RL: use deep neural networks to extend these methods
Pablo Zometa – Department of Mechatronics – GIU Berlin 7
History of RL: animal learning psychology
Reinforcement learning drew inspiration from psychological
learning theories. For instance: Rat basketball.
▶ In 1898, based on experiments on animal escaping puzzle
boxes, Thorndike formulates the Law of effect: learning by
trial and error.
▶ In 1920s, Russian physiologist Ivan Pavlov introduce the idea
of Pavlovian or classical conditioning: Connecting new stimuli
to innate reflexes.
▶ In 1950s, Skinner popularizes the idea of operant conditioning:
consequences lead to changes in voluntary behaviour
▶ In 1958, Skinner recognizes the effectiveness of shaping
behaviour with small intermediate rewards which reinforce
step-wise changes, until a desired complex behaviour is learnt

Classical vs. Operant conditioning.

RL does not attempt to replicate or explain how animals learn.


Pablo Zometa – Department of Mechatronics – GIU Berlin 8
History of RL: optimal control
In 1950, the main mathematical concepts of optimal control (e.g.,
linear quadratic regulator) were developed:
▶ minimize a cost (i.e., maximize performance) of a dynamical
system over time
▶ Bellman equation relates states with a value function
▶ Dynamic programming (DP) was developed as a way to solve
the Bellman equation
▶ Although efficient, for large problems DP becomes intractable
(curse of dimensionality)
▶ Bellman also introduced the use of Markov decision process
(MDP) to solve discrete stochastic problems
In 1989 Watkins treated RL using MDP. In some sense, RL is seen
as a way to approximately solve the Bellman equation, for
problems where DP is not feasible.
Both fields (OC and RL) evolved separately, and have developed
different terms for the same concepts.
Pablo Zometa – Department of Mechatronics – GIU Berlin 9
RL: exploration vs exploitation trade-off

Unique to RL is the trade-of between exploration and exploitation


▶ exploration: to discover the best actions to exploit, it has to
try new actions that may yield better (or worse) results
▶ Half-Cheetah: exploration
▶ exploitation: retake actions that in the past lead to positive
rewards
▶ Half-Cheetah: exploitation
How the reward signal is defined is crucial in how the agent learns
to solve a task.
Q: What could be the rewards for the half-Cheetah example?

Pablo Zometa – Department of Mechatronics – GIU Berlin 10


Modern Examples

Modern examples:
▶ Several RL examples that taught themselves how to play
games: RL: Atari , Super Mario Bros.
▶ AlphaZero: it has learned how to play Go and Chess without
human instruction, just data generated by playing against
itself. It has reached super-human playing capabilities.
▶ ChatGPT: has been fine-tuned (an approach to transfer
learning) using both supervised and reinforcement learning
techniques. (Source Wikipedia)

Pablo Zometa – Department of Mechatronics – GIU Berlin 11


Perspective: RL and OC for mechatronic systems?
Nowadays, reinforcement learning
▶ is mostly used in software applications: games, chatbots,
recommendation systems, etc.
▶ for mechatronic systems, it is mostly limited to research.
Currently, model-based approaches of the robot and environment
(based on physics) dominate, in particular, for simple applications.
Classical and optimal control are commonly used.
Limitations of RL for complex applications (e.g., learning to walk):
▶ performing the training with the real robot may be dangerous,
slow, and expensive
▶ performing the training with simulations may require a
extremely accurate model
▶ industrial applications often prefer simple approaches
Still, RL may have an edge as high-level decision maker on
uncertain environments, and may be combined with OC as
low-level controller.
Pablo Zometa – Department of Mechatronics – GIU Berlin 12

You might also like