You are on page 1of 27

Reinforcement Learning

Fundamental Discussion
08/04/23

Vision and Language Group


Branches of machine learning
Markov Decision Process
Markov decision processes give us a way to formalize sequential decision making.
This formalization is the basis for structuring problems that are solved with
reinforcement learning.
Components of an MDP:

○ Agent
○ Environment
○ State
○ Action
○ Reward
+1
Expected Return

Discounted Return

0 < Gamma < 1


The policy basically addresses how probable it is for an agent to select
any action from a given state.
Value functions are functions of states, or of state-action pairs, that estimate how good it is for the
agent to perform a given action in a given state.

State Value Function

Action Value Function


Optimal Policy

Optimal State-Value Function

Optimal State-Value Function


Bellman Optimality Equation
Methods for solving MDP

● Value iteration
● Policy iteration
● Q-Learning
● SARSA
Q-Learning
The Q-learning algorithm iteratively updates the Q-values for each state-action pair
using the Bellman equation until the Q-function converges to the optimal
Q-function, q*.
SARSA (State–action–reward–state–action):

It is an on policy Temporal Difference Learning where we follow the same policy π for
choosing the action to be taken for both present & future states.

On Policy: In this, the learning agent learns the value function according to the current
action derived from the policy currently being used.
Reward -: Black circle= -10
Red star = 10

You might also like