Professional Documents
Culture Documents
Fundamental Discussion
08/04/23
○ Agent
○ Environment
○ State
○ Action
○ Reward
+1
Expected Return
Discounted Return
● Value iteration
● Policy iteration
● Q-Learning
● SARSA
Q-Learning
The Q-learning algorithm iteratively updates the Q-values for each state-action pair
using the Bellman equation until the Q-function converges to the optimal
Q-function, q*.
SARSA (State–action–reward–state–action):
It is an on policy Temporal Difference Learning where we follow the same policy π for
choosing the action to be taken for both present & future states.
On Policy: In this, the learning agent learns the value function according to the current
action derived from the policy currently being used.
Reward -: Black circle= -10
Red star = 10