Unit 5

UNIT-5
n-step returns
In reinforcement learning, n-step returns refer to a class of algorithms that estimate the value of a
state or state-action pair by considering the cumulative reward over a fixed number of time steps.
These algorithms are a type of temporal difference learning, and they generalize both one-step TD
methods (like TD(0)) and Monte Carlo methods.
N-Step TD Update:
The n-step return, denoted as Gt(n), is calculated as the sum of rewards over the next n time steps.
The update rule for the state value function (V) using n-step returns is expressed as follows:
V(st)←V(st)+α[Gt(n)−V(st)]
where Gt(n) is calculated as:
Gt(n)=Rt+1+γRt+2+…+γn−1Rt+n+γnV(st+n)
 Rt+i is the reward at time step t+i.
 V(st+n) is the estimated value of the state at time step t+n.
 γ is the discount factor.
 α is the learning rate.
Key Points:
1. N-Step TD for State Values (V):
 If n=1, it becomes TD(0) with a one-step return.
 As n increases, the algorithm considers a longer sequence of rewards and values,
leading to a trade-off between bias and variance.
2. N-Step TD for Action Values (Q):
 The concept can be extended to Q-learning by considering the n-step return for action
values.
3. Selection of n:
 The choice of n depends on the specific characteristics of the environment and the
learning task.
 Smaller n values introduce bias, while larger n values may lead to increased variance.
4. Bootstrap at Terminal State:
 When reaching a terminal state before n time steps, the algorithm typically bootstraps
the remaining value from the estimated value of the terminal state.
N-Step SARSA:
For n-step SARSA (State-Action-Reward-State-Action), the update rule is similar, but it involves the
action values:
Q(st,at)←Q(st,at)+α[Gt(n)−Q(st,at)]
Advantages and Considerations:
 Reducing Variance: N-step returns can reduce the variance of the updates compared to one-
step methods, especially in environments with delayed rewards.
 Computational Complexity: Larger n values increase computational complexity as the
algorithm needs to wait for more time steps to compute returns.
 Application to Deep RL: N-step returns are compatible with deep reinforcement learning,
where neural networks are used to approximate value functions.
N-step returns offer a flexible framework for temporal difference learning, providing a spectrum of
trade-offs between bias and variance based on the choice of n. The selection of n depends on the
specific characteristics of the task and the learning environment.
TD(λ) algorithm
TD(λ) is a reinforcement learning algorithm that belongs to the family of temporal difference (TD)
learning methods. The notation "TD(λ)" represents TD with eligibility traces, where the parameter λ
(lambda) determines the degree of bootstrapping and influences the updates to the value function.
Key Concepts of TD(λ):
1. Eligibility Traces:
 Eligibility traces are used to keep track of the influence of previous states on the
current state's value.
 The eligibility trace for a state s is denoted as Et(s) and is updated as Et(s)=γλEt−1(s)
+1(St=s), where 1(St=s) is an indicator function that equals 1 if the current state is s
and 0 otherwise.
2. TD(λ) Update Rule for State Values (V):
 The TD(λ) update rule for state values is given by: V(st)←V(st)+αδtEt(st) where α is
the learning rate, δt is the TD error at time t, and Et(st) is the eligibility trace for state
st.
3. TD Error (δ_t):
 The TD error at time t is defined as δt=Rt+1+γV(St+1)−V(St), where Rt+1 is the
immediate reward, St is the current state, and St+1 is the next state.
4. λ (Lambda) Parameter:
 The λ parameter determines the weighting of the eligibility traces. A value of 0
corresponds to one-step TD (no eligibility traces), while a value of 1 corresponds to
Monte Carlo updates (full eligibility traces).
 Intermediate values of λ allow for a trade-off between one-step updates and full
Monte Carlo updates.
Algorithm Steps:
1. Initialization:
 Initialize the state values V(s) arbitrarily.
 Initialize eligibility traces E(s) for all states.
2. Repeat Until Convergence:
 Choose actions and observe rewards to transition through states.
 Update eligibility traces and state values using the TD(λ) update rule.
Advantages and Considerations:
 Versatility: TD(λ) is versatile and can be applied to various reinforcement learning problems.
 Bootstrapping Trade-off: The λ parameter allows control over the trade-off between one-
step bootstrapping (TD(0)) and full Monte Carlo updates (λ = 1).
 Efficiency: TD(λ) combines the advantages of one-step methods and multi-step methods,
making it computationally more efficient than full Monte Carlo updates.
 Applicability to Online Learning: TD(λ) is suitable for online learning, where updates are
made after each time step.
TD(λ) is widely used in reinforcement learning and has applications in fields such as robotics, game
playing, and control tasks. The choice of the λ parameter depends on the characteristics of the
learning task, and tuning it can significantly impact the algorithm's performance.

Unit 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 5

Uploaded by

Copyright:

Available Formats

UNIT-5

You might also like