You are on page 1of 13

EASWARI ENGINEERING COLLEGE

(AUTONOMOUS)
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND
DATA SCIENCE

191AIC601T – REINFORCEMENT LEARNING

Unit V –Notes

(Eligibility Traces)

III YEAR - B.TECH

PREPARED BY APPROVED BY

G.SIVASATHIYA, AP/AI&DS HOD/AI&DS


Eligibility Traces

Eligibility traces is a way of weighting between temporal-difference


“targets” and Monte-Carlo “returns”. Meaning that instead of using the
one-step TD target, we use TD(λ) target. In other words it fine tunes
the target to have a better learning performance.
Here are the benefits of Eligibility Traces:
 Provide a way of implementing Monte Carlo in online fashion
(does not wait for the episode to finish) and on problems
without episodes.
 Provide an algorithmic mechanism that uses a short-term
memory vector.
 Computational efficiency by storing a single
vector memory instead a list of feature vectors.
 Learning is done continually rather than waiting results at the
end of an episode.

The Forward View


In Temporal Difference and Monte Carlo methods update a state based
on future rewards. This is done either by looking directly one step
ahead or by waiting the episode to finish.
This approach is called the Forward View.

In Forward View we look ahead n steps for future rewards


In TD(0) we look one step ahead, while in Monte Carlo we look ahead
until the episode is terminated and we collect the discounted results.
However there is a middle ground, in which we look n-steps ahead.

The n-steps Forward View


As explained, looking ahead can vary from one step ahead to the
end of the episode as the case of Monte Carlo. So, n-steps is some kind
of middle ground.
Remember that in Monte Carlo we execute the episodes, get their
returns Gi and average those returns to compute the state value.
Note that length (number of steps) of each episode may vary from one
episode to the other. It is not constant!
Similarly we will do the same with n-steps look-ahead. As in
Monte Carlo the number of steps is not necessarily the same on each
episode. So let’s define an average return for all these iterations like
the following:

Where G( 𝛌, t) is the weighted average of all returns G(t,t+n) which are


the returns of individual episodes where each episode starts at t and
ends at t+n, for n going from 1 to infinity.
𝛌 is a weight that has a value between [0, 1].
As in all weighted average, the sum of the weights must be one, which
is the case since
Below is a picture showing how each episode is weighted according to
its length.

This pictures shows how the weight becomes smaller as the time (or n-
steps) increases.

In short if an episode terminates after 3 steps the weight associated


with its return is far greater than an episode that terminates at T steps
(where T is much greater than 3).
It is also important to notice that the weight decreases exponentially.

The Problem is not Resolved Yet!


Forward views are somehow complex to implement because the
update of each state depends on later events or rewards that are not
available at current time.
However this is going to change by adopting a new approach: The
Backward View.

Backward View TD
Suppose an agent randomly walking in an environment and finds
a treasure. He then stops and looks backwards in an attempt to know
what led him to this treasure?
Naturally the steps that are close to the treasure have more
merits in finding it than the steps that are miles away. So closer
locations are more valuable than distant ones and thus they are
assigned bigger values.
How does this materialize, is through a vector E called eligibility
traces. Concretely, the eligibility traces is a function of state E(s) or
state action E(s,a) and holds the decaying values of the V(s).

So how do we transit from Forward View to Backward View


and what is the role of eligibility traces in that?
Remember what we said about Forward View, that the
contribution of each episode to the current state is attenuated
exponentially (𝛌 to the power n) following the number of steps (n) in
theepisode.
Using the same logic, when we are at state s, instead of looking
ahead and see the decaying return (Gt) of an episode coming towards
us, we simply use the value we have and throw it backward using the
same decaying mechanism.
For example in TD(0) we defined the TD error as:

This error will be propagated backwards but in a decaying


manner. Similar to the voice that fades away with the distance.
The way we implement this, is by multiplying 𝛌 by the eligibility trace
at each state.

Where Et(s) is updated as follows:

Backward View propagates the error δ to previous states

The notation 1(St = s) means that we assign the full value when we are
at the state s, and as it gets propagated backwards it gets attenuated
exponentially.
The eligible trace update starts by E(s) = 0 for all states, then as we
pass by each state (due to performing an action) we increment E(s)
to boost the value of the state, then we decay E(s) by ɣ𝛌 (E(s) = ɣ𝛌
E(s)) for all s.

The main advantage of eligibility traces over n-step forward view is


that only one single trace vector is required rather than a store of the
last n feature vectors.

Job Shop Scheduling Problem (JSSP): An Overview


Job Shop Scheduling (JSS) or Job Shop Problem (JSP) is a
popular optimization problem in computer science and operational
research. This focus on assigning jobs to resources at particular times.

What is a Job Shop ?


A Job Shop is a work location in which a number of general-
purpose work stations exist and are used to perform a variety of jobs.

The most basic version of JSSP is :

The given n jobs J1, J2, …, Jn of varying processing times, which


need to be scheduled on m machines with varying processing power,
while trying to minimize the makespan.

The makespan is the total length of the schedule (that is, when all the
jobs have finished processing).

Each job consists of a sequence of tasks, which must be performed in a


given order, and each task must be processed on a specific machine.
Constraints for the Job Shop Problem
There are three main constraints for the job shop problem:
 No task for a job can be started until the previous task for that
job is completed.
 A machine can only work on one task at a time.
 A task, once started, must run to completion.
Factors to Describe Job Shop Scheduling Problem
Based on the researches and experiments done in Job Shop
Scheduling, following variables affects mostly.
1. Arrival pattern
2. Number of machines (work stations)
3. Work Sequence
4. Performance evaluation criterion
Types of Arrival Patterns
The arrival pattern of jobs to machines are of two forms, either
static or dynamic.
 Static — n jobs arrive at an idle shop and must be scheduled for
work
 Dynamic — intermittent arrival (this is often stochastic)

Types of Work Sequence


 Fixed, repeated sequence — flow shop
 Random sequence — All patterns possible
Performance Evaluation Criterion on Job Scheduling
The performance criteria that most researches are based on
following optimal job scheduling heuristics.
 Makespan — total time to completely process all jobs
 Average time of jobs in shop
 Lateness
 Average number of jobs in shop
 Utilization of machines
 Utilization of workers

Illustration of Job Shop Scheduling

Representation of JSS
The Gantt-Chart is a convenient way of visually representing a
solution of the JSSP.

A Grant chart representation of a 3x3 problem. Source: link


J1-J3 stands for the Jobs

M1-M3 stands for the Machines

The length of this solution is 12, which is the first time when all three
jobs are complete. However, note that this is not the optimal solution!

Flow Shop Scheduling


Flow shop scheduling is a special case of job shop scheduling,
where there is strict order of all operations to be performed on all jobs.
It follows a linear fashion.
The most basic version of FSS is :
The given n jobs J1, J2, …, Jn of varying specified processing times,
which need to be scheduled on m machines.
The i-th operation of the job must be executed on the i-th machine.
No machine can perform more than one operation simultaneously.
Operations within one job must be performed in the specified
order. The first operation gets executed on the first machine, then (as the
first operation is finished) the second operation on the second machine,
and so on until the n-th operation.
The problem here is to determine the optimal such arrangement,
one with the shortest possible total job execution makespan.

Illustration of Flow Shop Scheduling


Job Shop or Flow Shop?

Basically, most small machine shops that deal with high-mix,


low-volume products often use a job shop model. High-volume
manufacturers like automotive use a flow shop model.

The fundamental difference between how the task sequence is


processed between the two schedules can be illustrated as below.

Job Shop vs Flow Shop

Solving the Job shop Problem

In order to solve the scheduling problem, a wide range of


solutions have been proposed in both computer science and
operational research.
Johnson’s Rule

In operational research, Johnson’s rule is the most common


method of scheduling jobs in two work centers. Its primary objective
is to find an optimal sequence of jobs to reduce makespan. It also
focuses on reducing the amount of idle time between the two work
centers.
Statistical Approaches

In computer science, ample developments and researches have


been conducted with the aim of scheduling and optimization.
Genetic algorithms, Ant Colony Optimization, Simulated Annealing
(SA), Artificial Neural Networks, Multi-Agent Systems are some of
the approaches. Each of these techniques computationally differs
with the methodologies and optimization capability they can offer.

However, the optimization of scheduling is still been a


challenging issue in both computer science and operational research
fields. Most of the current solutions are unable to cope with
environmental uncertainties, dynamic behavior of tasks and agents
and are not adaptive. Even there is a lack of solutions that are real-
time implementable. So there is a major requirement in
implementing a workaround solution in order to solve this issue.

Watkins's Q( )
Unlike TD( ) or Sarsa( ), Watkins's Q( ) does not look ahead all the way
to the end of the episode in its backup. It only looks ahead as far as the next
exploratory action.

Aside from this difference, however, Watkins's Q( ) is much like TD( )


and Sarsa( ).Their lookahead stops at episode's end, whereas Q( )'s lookahead
stops at the first exploratory action, or at episode's end if there are no exploratory
actions before that.

Watkins's Q( ) would still do the one-step update


of toward . In general, if is the first
exploratory action, then the longest backup is toward
where we assume off-line updating. The backup diagram illustrates the forward
view of Watkins's Q( ), showing all the component backups.

You might also like