You are on page 1of 17

EASWARI ENGINEERING COLLEGE

(AUTONOMOUS)
DEPARTMENT OF ARTIFICIAL INTELLIGENCE ANDDATA
SCIENCE

191AIC601T – REINFORCEMENT LEARNING

Unit IV –Notes

(Temporal Difference Learning)

III YEAR - B.TECH

PREPARED BY APPROVED BY

G.SIVASATHIYA, AP/AI&DS HOD/AI&DS


TEMPORAL DIFFERENCE LEARNING (TD)
▶ Temporal-Difference learning = TD learning

▶ The prediction problem is that of estimating the value function for a policy π

▶ The control problem is the problem of finding an optimal policy π*

▶ Given some experience following a policy π, update estimate v of vπ for non-terminal states
occurring in that experience

▶ Given current step t, TD methods wait until the next time step to update V(St)

▶ Learn from partial returns.

▶ TD learning is an unsupervised technique to predict a variable's expected value in a


sequence of states. TD uses a mathematical trick to replace complex reasoning about the
future with a simple learning procedure that can produce the same results.

▶ Instead of calculating the total future reward, TD tries to predict the combination of
immediate reward and its own reward prediction at the next moment in time.

Comparing TD with DP and MC

▶ Temporal-Difference(TD) method is a blend of the Monte Carlo (MC) method and the
Dynamic Programming (DP) method.
Difference Between MC & TD

▶ Monte Carlo learning is like an annual examination where student completes its episode at
the end of the year.

▶ TD learning, which can be thought like a weekly or monthly examination. (student can adjust
their performance based on this score (reward received) after every small interval and the
final score is the accumulation of all weekly tests (total rewards)).
TD (0) is the simplest form of TD learning.

▶ In this form of TD learning, after every step value function is updated with the value of the
next state and along the way reward obtained.

▶ This observed reward is the key factor that keeps the learning grounded and algorithm
converges after a sufficient number of sampling (in the limit of infinity).

▶ TD(0) can be represented with the equation.

▶ Equation 1 is generally shown in literature but the same equation written as per Equation 2
is more intuitive.

 α as a learning factor,

 γ as a discount factor.

 value of a state S is updated in the next time step (t+1) based on the reward r t+1
observed after the time step t with the expected value of S in time step t+1.

▶ So its the bootstrap of S at time step t using the estimation from time step t+1 while r t+1 is
the observed reward (real thing that makes the algorithm grounded).

▶ TD target and TD error as shown are two important components of the equation which are
used in many other areas of RL.
Parameters

▶ Alpha (α): learning rate. This parameter shows how much we should adjust our estimates
based on the error. The learning rate is between 0 and 1. A large learning rate adjusts
aggressively and might lead to fluctuating training results — not converging. A small
learning rate adjusts slowly, which will take more time to converge.

▶ Gamma (γ): the discount rate. How much we are valuing future rewards. The discount rate is
between 0 and 1. The bigger the discount rate, we more we valuing the future rewards.

Advantages of TD

▶ TD can learn in every step online or offline

▶ TD can learn from the incomplete sequence

▶ TD can work in non-terminating environments (continuing)

▶ TD has a lower variance compared to MC as depends on one random action, transition, reward

▶ Usually more efficient than MC

▶ TD exploits Markov property and thus more effective in Markov environments

▶ But it has below limitations as well:

▶ TD is a biased estimation

▶ TD is more sensitive to the initial value


Limitations of TD

▶ TD is a biased estimation

▶ TD is more sensitive to the initial value

Conclusion

Like one step TD method, there are multi-step TD methods as well as a combination of TD & MC, like
TD(λ) algorithms. TD was a breakthrough innovation in Reinforcement Learning and every practitioner
needs to have it in their tool kit.

SARSA
▶ One of the TD algorithms for control or improvement is SARSA.

▶ SARSA name came from the fact that agent takes one step from one state-action value pair to
another state-action value pair and along the way collect reward R (so its the S t, A t, R t+1, S
t+1 & A t+1 tuple that creates the term S,A,R,S,A).

▶ SARSA is an on-policy method.

▶ SARSA use action-value function Q and follow the policy π.

▶ GPI (Generalized Policy Iteration) is used to take action based on policy π ( ε-greedy to
ensure exploration as well as greedy to improve the policy).

▶ SARSA can be represented with the equation as shown in the below diagram. Equation 1 is
generally shown in literature but Equation 2 is more intuitive.

 α as a learning factor,

 γ as a discount factor.

▶ Action value version of TD target and TD error is shown as well.


▶ The SARSA algorithm has one conceptual problem, in that when updating we imply we know
in advance what the next action at+1 is for any possible next state. This requires that we step
forward and calculate the next action of our policy when updating, and therefore learning is
highly dependent on the current policy the agent is following.

▶ This complicates the exploration process, and it is therefore common to use some form
of ϵ−soft policy for on-policy methods.
Q LEARNING
▶ Q-learning is an off-policy algorithm.

▶ In Off-policy learning, we evaluate target policy (π) while following another policy
called behavior policy (μ) (this is like a robot following a video or agent learning based on
experience gained by another agent).

▶ DQN (Deep Q-Learning) which made a Nature front page entry, is a Q-learning based
algorithm (with few additional tricks) that surpassed human-level expertise in Atari game.

▶ In the Q-learning, target policy is a greedy policy and behavior policy is the ε-greedy
policy (this ensures exploration).

Refer to the below diagram for the Q-learning algorithm written in two
different ways. Look how target and behavior policy actions are represented in the
equation.
Expected SARSA
▶ Expected SARSA is just like Q-learning except that instead of the maximum over next state-
action pairs it uses the expected value.

▶ Taking into account how likely each action is under the current policy. Given the next state,
the Q-learning algorithm moves deterministically in the same direction while SARSA follows
as per expectation, and accordingly, it is called Expected SARSA.
Comparison – QL with SARSA

QL and SARSA are both excellent initial approaches for reinforcement learning problems. A few key

notes to select when to use QL or SARSA:

▶ Both approach work in a finite environment (or a discretized continuous environment)

▶ QL directly learns the optimal policy while SARSA learns a “near” optimal policy. QL is a
more aggressive agent, while SARSA is more conservative. An example is walking near the
cliff. QL will take the shortest path because it is optimal (with the risk of falling), while
SARSA will take the longer, safer route (to avoid unexpected falling).

▶ In practice, if you want to fast in a fast-iterating environment, QL should be your choice.


However, if mistakes are costly (unexpected minimal failure — robots), then SARSA is the
better option.

Implementation
▶ For the CartPole game, OpenAI’s gym has a prebuilt environment. A few gym syntaxes are
listed here: (learn more about OpenAI gym here)
Next, as QL and SARSA work best in discrete state space, and the cart pole game is continuous, we

will discretize them to into smaller bins. More bins, the better the performance. The more bins would

help the model account for more specific state space, leading to better overall performance.

However, more bins would require training with more games, costing computational power. If

time, computational power, and storage space are your constraints, stay with a small number of bins.

Otherwise, you are welcomed to try a larger number of bins. Also, try with a small number of bins to check

the performance before scaling up.

Putting a few graphs showing the performance of my algorithms. Note that in OpenAI’s gym cart

pole game, the maximum step you can reach is 200 (and the game will self terminate by then).

QL Training
SARSA training:

QL testing:
SARSA testing:

The training graphs show the performance of the agent after many games. The x_axis indicates

the number of games we trained on, and the y_axis represents the max step they can take (capped at

200 due to OpenAI gym’s setup).

The testing graphs show the performance in the implementation phase (after we finished the

training). The histogram shows what the distribution of the outcomes for each model is when we play

1000 games.

The results are as expected because SARSA chooses to play safer (usually) compared to QL.

Hence, it might take less dramatic steps along with the game, leading to better performance. That is

one possible explanation for the performance of SARSA over QL.


GRADIENT DESCENT ALGORITHM AND ITS VARIANTS

Gradient Descent (GD) is a popular optimization algorithm used in machine

learning to minimize the cost function of a model. It works by iteratively adjusting the

weights or parameters of the model in the direction of the negative gradient of the cost

function, until the minimum of the cost function is reached.

There are several variants of gradient descent that differ in the way the step size or

learning rate is chosen and the way the updates are made. Here are some popular

variants:

Each variant of gradient descent has its own advantages and disadvantages, and

the choice of which one to use depends on the specific problem and the available

computing resources.

Gradient descent is a powerful optimization algorithm used to minimize the loss

function in a machine learning model. It’s a popular choice for a variety of algorithms,

including linear regression, logistic regression, and neural networks. In this article, we’ll

cover what gradient descent is, how it works, and several variants of the algorithm that

are designed to address different challenges and provide optimizations for different use

cases.

Gradient Descent is an iterative optimization algorithm used to minimize the cost

function of a machine learning model. The idea is to move in the direction of the steepest

descent of the cost function to reach the global minimum or a local minimum. Here are

the steps involved in the Gradient Descent algorithm:

1. Initialize the parameters of the model with random values.

2. Calculate the gradient of the cost function with respect to each parameter.

3. Update the parameters by subtracting a fraction of the gradient from each


parameter. This fraction is called the learning rate, which determines the

step size of the algorithm.

Repeat steps 2 and 3 until convergence, which is achieved when the cost function stops

improving or reaches a predetermined threshold.

There are several variants of the Gradient Descent algorithm, which differ in the way

they calculate the updates to the parameters:

Batch Gradient Descent: In this variant, the entire training dataset is used to
calculate the gradient and update the parameters. This can be slow for large
datasets, but it ensures convergence to the global minimum.
Stochastic Gradient Descent (SGD): In this variant, only one random training
example is used to calculate the gradient and update the parameters. This can be
faster than Batch Gradient Descent, but the updates can be noisy and may not
converge to the global minimum.
Mini-Batch Gradient Descent: In this variant, a small subset of the training
dataset is used to calculate the gradient and update the parameters. This is a
compromise between Batch Gradient Descent and SGD, as it is faster than Batch
Gradient Descent and less noisy than SGD.
Momentum-based Gradient Descent: In this variant, the updates to the
parameters are based on the current gradient and the previous updates. This helps
the algorithm to overcome local minima and accelerate convergence.
Adagrad: In this variant, the learning rate is adaptively scaled for each parameter
based on the historical gradient information. This allows for larger updates for
infrequent parameters and smaller updates for frequent parameters.
RMSprop: In this variant, the learning rate is adaptively scaled for each parameter
based on the moving average of the squared gradient. This helps the algorithm to
converge faster in the presence of noisy gradients.
Adam: In this variant, the learning rate is adaptively scaled for each parameter
based on the moving average of the gradient and the squared gradient. This
combines the benefits of Momentum-based Gradient Descent, Adagrad, and
RMSprop, and is one of the most popular optimization algorithms for deep
learning.
How does Gradient Descent Work?

The basic idea of gradient descent is to start with an initial set of weights and
update them in the direction of the negative gradient of the loss function.
The gradient is a vector of partial derivatives that represents the rate of
change of the loss function with respect to the weights. By updating the
weights in the direction of the negative gradient, the algorithm moves
towards a minimum of the loss function.
The learning rate is a hyperparameter that determines the size of the step
taken in the weight update. A small learning rate results in a slow
convergence, while a large learning rate can lead to overshooting the
minimum and oscillating around the minimum. It’s important to choose an
appropriate learning rate that balances the speed of convergence and the
stability of the optimization.
Advantages of gradient descent and its variants:

1. Widely used: Gradient descent and its variants are widely used in machine
learning and optimization problems because they are effective and easy to
implement.
2. Convergence: Gradient descent and its variants can converge to a global minimum
or a good local minimum of the cost function, depending on the problem and the
variant used.
3. Scalability: Many variants of gradient descent can be parallelized and are scalable
to large datasets and high-dimensional models.
4. Flexibility: Different variants of gradient descent offer a range of trade-offs
between accuracy and speed, and can be adjusted to optimize the performance of a
specific problem.

Disadvantages of gradient descent and its variants:

1. Choice of learning rate: The choice of learning rate is crucial for the convergence
of gradient descent and its variants. Choosing a learning rate that is too large can
lead to oscillations or overshooting, while choosing a learning rate that is too small
can lead to slow convergence or getting stuck in local minima.
2. Sensitivity to initialization: Gradient descent and its variants can be sensitive to
the initialization of the model’s parameters, which can affect the convergence and
the quality of the solution.
3. Time-consuming: Gradient descent and its variants can be time-consuming,
especially when dealing with large datasets and high-dimensional models. The
convergence speed can also vary depending on the variant used and the specific
problem.
4. Local optima: Gradient descent and its variants can converge to a local minimum
instead of the global minimum of the cost function, especially in non-convex
problems. This can affect the quality of the solution, and techniques like random
initialization and multiple restarts may be used to mitigate this issue.

You might also like