You are on page 1of 11

1 INTRODUCTION

Instruction Scheduling is the task of deciding what instruction will be executed at which unit of time. Since instruction scheduling algorithms are used in compilers to reduce run-time delays for the compiled code by the reordering or transformation of program statements that is done usually at the intermediate language or assembly code level. A research has been carried out by the users on scheduling code within the scope of basic blocks, i.e., straight line sections of code, and most modern compilers have been given with very effective basic block schedulers and especially for pipeline processors. Modern compilers employ sophisticated instruction scheduling techniques to shorten the number of cycles taken to execute the instruction stream. , Depending on their locality and the memory hierarchy architecture, memory access instructions have a variable latency. The scheduler must assume a constant value, usually the execution time assigned to a hit. Moreover, it is ensured by the instruction scheduler that hardware resources are not oversubscribed in any cycle. It is not an easy task for a contemporary processor implementation with multiple pipelines and complex resource usage restrictions. Such resource hazards involve the complexity which is one of the primary factors that constrain the instruction scheduler from performing certain kinds of transformations which may result in improved schedules. For detecting pipeline resource hazards we need an idea based on finite state automata which should support the efficient implementation of such transformation which makes it essential for aggressive instruction scheduling beyond basic blocks. Our scheme is superior in terms of space and time even though if the similar code transformations can be supported by other schemes such as reservation tables. This new technique takes advantage of a conventional, high quality basic block scheduler which is done as followed:

Firstly, suppressing selected subsequence of instructions. Secondly, scheduling the modified sequence of instructions using the basic block scheduler.

INSTRUCTION SCHEDULING
Instruction scheduling attempts to rewrite the code for maximum instruction-level parallelism (ILP).

Time a = 1 + x; + z;

ba = 1 + = 2 + y; x; b=2+ y; c=3+ z;

c=3

Since all three instructions are independent, we can execute them in parallel, assuming adequate hardware processing resources. Instruction scheduling is a compiler optimization which is used to improve instruction-level parallelism that improves performance on machines with instruction pipelines. Instruction scheduling tries to avoid:

Pipeline stalls by rearranging the order of instructions. Illegal or semantically ambiguous operations (typically involving subtle instruction pipeline timing issues or non-interlocked resources.)

The pipeline stalls can be caused by structural hazards (processor resource limit), data hazards (output of one instruction needed by another instruction) and control hazards (branching).

Parallelism constraints
Data-dependence constraints: If instruction A computes a value that is read by instruction B, then B cant execute before A is completed Resource hazards: Finiteness of hardware (e.g., how many add circuits) means limited parallelism.

Hardware parallelism
Three forms of parallelism are found in modern hardware: pipelining superscalar processing multiprocessing Of these, the first two forms are commonly exploited by instruction scheduling.

Pipelining
Basic idea is to decompose an instructions execution into a sequence of stages, so that multiple instruction executions can be overlapped. The classic 5-stage pipeline

IF Instruction fetch RF Decode and register fetch EX Execute on ALU ME Memory access WB Write back to register file

time

Why this might help?


The classical laundry

Best work if each stage takes same amount of time (eg-one clock cycle)

Pipelining illustration

time The standard vov neuman model In a given cycle each instruction is in different stage, but each stage is active.

The pipline is full here time

Pipelining speedup:
Suppose a non-pipelined machine can execute an instruction in 5ns.This is a completion rate of 0.20inst/ns. If each stage of the pipelined machine completes in 1ns, then it executes at a completion rate of 1.0inst/ns (after 5ns to fill the pipeline). In the ideal case, a pipelined processor will complete one instruction every cycle. This is the instruction throughput, which ideally is 1 IPC (instructions per cycle). Alternatively, we can talk in terms of CPI (cycles per instruction).

Pipeline hazards:

CPU One instruction may need the results of a previous instruction. A branch/jump target might not be known until later in the pipeline. An instruction may be waiting on a fetch from main memory. Such inconsistencies in the pipeline that causes stalls are called hazards.
3ns 8B 6ns CACHE MEMORY

Larger, slower ,

32B 60ns

cheaper

MAIN MEMORY
8KB

8ms

DISK

Limitations of pipelineing
Since all stages execute in parallel, the cycle time is limited by the slowest stage. Also, the deeper the pipeline the longer it takes to fill. Worst of all, pipeline hazards can cause the processor to suspend progress temporarily.

Scheduling complications
Complications: 1. Data dependences 2. Control dependences 3. Limited hardware resources

Data dependences
In practice, data dependences are extremely difficult to reason about. Considerable research effort on alias analysis, pointer analysis, with limited success. Must be careful not to read or write a data location too early x=1 y=x x=1 x=2 y=x x=1 True dependence: read-after-write Output dependence: write-after-write

Anti-dependence: write-after-read Sometimes fixed Via remaining Instruction scheduling is done on a single basic block. In order to know that whether rearranging the block's instructions preserves the behavior of that block; we need the concept of a data dependency. There are three types of dependencies, which also happen to be the three data hazards: 1. Read after Write (RAW or "True"):Firstly it writes a value which is to be later used by the second instruction and the value should be written first else the second instruction will read the old value instead of the new 2. Write after Read (WAR or "Anti"): Firstly it reads a location that is later overwritten by the second instruction and the location should be read first or it will read the new value instead of the old. 3. Write after Write (WAW or "Output"): Two instructions both write the same location. They must occur in their original order. There is also a fourth type i.e. Read after Read (RAR or "Input"): In this both instructions read the same location. Input does not depend on the execution order of the two statements but is useful in scalar replacement of array replacements.

Data hazards
r1 = r2 + r3 r4 = r1 + r1 IF RF EX ME WB r2+r3 available here

r1 = [r2] r4 = r1 + r1 IF RF EX ME WB

[r2] available here

Register renaming
Sometimes data hazards are not real, in the sense that a simple renaming of registers can eliminate them. For output dependence, A and B write. For anti dependence, A reads & B writes.

The hardware can sometimes rename registers, thereby allowing reordering.

Control dependences
What do we do when we reach a conditional branch? Unconditional target is available only after decode stage. Conditional target is available only after execute stage.

Branch delay slots

Jump

option 1: hardware solution stall pipeline

Jump nop

option 2: software solution insert nop instruction

Jump Inst X Inst Y Inst 1 Inst 2

option 3: branch takes effect after delay slots.

This means some execution get executed after the branch but the branching takes effect.

Hardware
The hardware is finite. Also, for engineering reasons, there may be constraints on how the hardware resources may be used. Hardware complication: Issue width Superscalar allows more than one instruction to be issued to the processor in each cycle. However, there is a finite limit. Hardware Complication: Functional units A 4-way superscalar might be limited to issuing, e.g., at most 2 integer, 1 memory, and 1 floating point instruction per cycle. Hardware complication: Pipeline limits Often, there are restrictions on the pipelining in a functional unit. e.g., some FPUs allow a new FP division only once every 2 cycles.

Algorithms
The most frequently used algorithm is used to find a topological sort and is known as list scheduling. It repeatedly selects a source of the dependency graph, appends it to the current instruction schedule and removes it from the graph. This further causes other vertices to be sources, which will then also be considered for scheduling. The algorithm is terminated if the graph is empty. In order to get a good schedule, stalls should be prevented. This is determined by the choice of the next instruction to be scheduled. A number of heuristics are in common use:

The processor resources that are already being used by the scheduled instruction are recorded. If an already occupied resource is used its priority drops. If a candidate is scheduled closer to its predecessors than the associated latency its priority will drop. If a candidate lies on the critical path of the graph, its priority will rise. This heuristic provides some form of look-ahead in an otherwise local decision process. If choosing a candidate will create many new sources, its priority will rise. This heuristic tends to generate more freedom for the scheduler.

The phase order of Instruction Scheduling


Instruction scheduling can be done either before or after register allocation .If it is done before register allocation it results in maximum parallelism and also in this case the register allocator has to use a number of registers exceeding those available .This causes spill/fill code to be introduced which will reduce the performance of the section of code in question.

10

If the architecture being scheduled has instruction sequences that have potentially illegal combinations (due to a lack of instruction interlocks) the instructions must be scheduled after register allocation. This second scheduling pass will also improve the placement of the spill/fill code. If scheduling is done after register allocation then there will be false dependencies introduced by the register allocation that will limit the amount of instruction motion possible by the scheduler.

Types of Instruction Scheduling


Various types of instruction scheduling are as follows: 1. Local Scheduling: In this the instructions can't move across basic block boundaries. 2. Global scheduling: In this the instructions can move across basic block boundaries. 3. Modulo Scheduling: It is also known as software pipelining, it interleaves different iterations of a loop. 4. Trace scheduling: It is the first practical approach for global scheduling; it tries to optimize the control flow path that is executed most often. 5. Super block scheduling: It is a simplified form of trace scheduling. It does not merge control flow paths, rather the code can be implemented by more than one schedule.

Major scheduling approaches

x=a+b x=a+b y=c+d x=a+b x=a+b y = c+ d

List scheduling pipelining (Within a block)

trace scheduling (Across blocks)

software (Across iterations)

11

Bibliography
References: http://www.cs.cmu.edu/afs/cs/academic/class/15745s06/web/handouts/11.pdf http://suif.stanford.edu/~courses/cs243/lectures/l7.pdf Book referred: Advanced Computer Architecture