You are on page 1of 31

Instruction Level Parallelism ILP

Advanced Computer Architecture


CSE 8383
Spring 2004 2/19/2004
Presented By:
Saad Al-Harbi
Saeed Abu Nimeh
Outline
Whats ILP
ILP vs Parallel Processing
Sequential execution vs ILP execution
Limitations of ILP
ILP Architectures
Sequential Architecture
Dependence Architecture
Independence Architecture
ILP Scheduling
Open Problems
References

Whats ILP
Architectural technique that allows the
overlap of individual machine operations (
add, mul, load, store )
Multiple operations will execute in parallel
(simultaneously)
Goal: Speed Up the execution
Example:
load R1 R2 add R3 R3, 1
add R3 R3, 1 add R4 R3, R2
add R4 R4, R2 store [R4] R0

Example: Sequential vs ILP
Sequential execution (Without ILP)
Add r1, r2 r8 4 cycles
Add r3, r4 r7 4 cycles 8 cycles


ILP execution (overlap execution)
Add r1, r2 r8
Add r3, r4 r7

Total of 5 cycles

ILP vs Parallel Processing
ILP
Overlap individual machine
operations (add, mul, load)
so that they execute in
parallel

Transparent to the user

Goal: speed up execution

Parallel Processing
Having separate processors
getting separate chunks of
the program ( processors
programmed to do so)

Nontransparent to the user

Goal: speed up and quality
up

ILP Challenges
In order to achieve parallelism we
should not have dependences among
instructions which are executing in
parallel:
H/W terminology Data Hazards ( RAW,
WAR, WAW)
S/W terminology Data Dependencies

Dependences and Hazards
Dependences are a property of
programs
If two instructions are data dependent
they can not execute simultaneously
A dependence results in a hazard and
the hazard causes a stall
Data dependences may occur through
registers or memory
Types of Dependencies
Name dependencies
Output dependence
Anti-dependence
Data True dependence
Control Dependence
Resource Dependence
Name dependences
Output dependence
When instruction I and J write the same register or
memory location. The ordering must be preserved to
leave the correct value in the register
add r7,r4,r3
div r7,r2,r8
Anti-dependence
When instruction j writes a register or memory
location that instruction I reads
i: add r6,r5,r4
j: sub r5,r8,r11
Data Dependences
An instruction j is data
dependent on instruction i
if either of the following
hold:
instruction i produces a
result that may be used by
instruction j , or
instruction j is data
dependent on instruction k,
and instruction k is data
dependent on instruction i
LOOP LD F0, 0(R1)

ADD F4, F0, F2

SD F4, 0(R1)

SUB R1, R1, -8

BNE R1, R2, LOOP

Control Dependences
A control dependence determines the ordering of an instruction
i, with respect to a branch instruction so that the instruction i is
executed in correct program order.
Example:
If p1 {
S1;
};
If p2 {
S2;
};


Two constraints imposed by control
dependences:
1. An instruction that is control dependent on a
branch cannot be moved before the branch
2. An instruction that is not control dependent
on a branch cannot be moved after the branch
Resource dependences
An instruction is resource-dependent on
a previously issued instruction if it
requires a hardware resource which is
still being used by a previously issued
instruction.
e.g.
div r1, r2, r3
div r4, r2, r5

ILP Architectures
Computer Architecture: is a contract
(instruction format and the interpretation of
the bits that constitute an instruction)
between the class of programs that are
written for the architecture and the set of
processor implementations of that
architecture.
In ILP Architectures: + information
embedded in the program pertaining to
available parallelism between instructions and
operations in the program
ILP Architectures
Classifications
Sequential Architectures: the program is not
expected to convey any explicit information regarding
parallelism. (Superscalar processors)
Dependence Architectures: the program explicitly
indicates the dependences that exist between
operations (Dataflow processors)
Independence Architectures: the program provides
information as to which operations are independent
of one another. (VLIW processors)
Sequential architecture and
superscalar processors
Program contains no explicit information
regarding dependencies that exist between
instructions
Dependencies between instructions must be
determined by the hardware
It is only necessary to determine dependencies
with sequentially preceding instructions that have
been issued but not yet completed
Compiler may re-order instructions to
facilitate the hardwares task of extracting
parallelism
Superscalar Processors
Superscalar processors attempt to issue
multiple instructions per cycle
However, essential dependencies are
specified by sequential ordering so
operations must be processed in sequential
order
This proves to be a performance
bottleneck that is very expensive to
overcome

Dependence architecture and
data flow processors
The compiler (programmer) identifies the parallelism
in the program and communicates it to the hardware
(specify the dependences between operations)
The hardware determines at run-time when each
operation is independent from others and perform
scheduling
Here, no scanning of the sequential program to
determine dependences
Objective: execute the instruction at the earliest
possible time (available input operands and
functional units).
Dependence architectures
Dataflow processors
Dataflow processors are representative of
Dependence architectures
Execute instruction at earliest possible time subject
to availability of input operands and functional units
Dependencies communicated by providing with each
instruction a list of all successor instructions
As soon as all input operands of an instruction are
available, the hardware fetches the instruction
The instruction is executed as soon as a functional
unit is available
Few Dataflow processors currently exist
Dataflow strengths and
limitations
Dataflow processors use control parallelism
alone to fully utilize the FU.
Dataflow processor is more successful than
others at looking far down the execution path
to find control parallelism
When successful its better than speculative
execution:
Every instruction is executed is useful
Processor does not have to deal with error
conditions, because of speculative operations

Independence architecture
and VLIW processors
By knowing which operations are independent, the
hardware needs no further checking to determine
which instructions can be issued in the same cycle
The set of independent operations >> the set of
dependent operations
Only a subset of independent operations are specified
The compiler may additionally specify on which
functional unit and in which cycle an operation is
executed
The hardware needs to make no run-time decisions
VLIW processors
Operation vs instruction
Operation: is an unit of computation (add, load,
branch = instruction in sequential ar.)
Instruction: set of operations that are intended to
be issued simultaneously
Compiler decides which operation to go to
each instruction (scheduling)
All operations that are supposed to begin at
the same time are packaged into a single
VLIW instruction
VLIW strengths
In hardware it is very simple:
consisting of a collection of function units (adders,
multipliers, branch units, etc.) connected by a bus,
plus some registers and caches
More silicon goes to the actual processing
(rather than being spent on branch
prediction, for example),
It should run fast, as the only limit is the
latency of the function units themselves.
Programming a VLIW chip is very much like
writing microcode
VLIW limitations
The need for a powerful compiler,
Increased code size arising from aggressive
scheduling policies,
Larger memory bandwidth and register-file
bandwidth,
Limitations due to the lock-step operation,
binary compatibility across implementations
with varying number of functional units and
latencies
Summary: ILP Architectures
Sequential
Architecture
Dependence
Architecture
Independence
Architectures
Additional info
required in the
program
None Specification of
dependences between
operations
Minimally, a partial list
of independences. A
complete specification
of when and where
each operation to be
executed
Typical kind of ILP
processor
Superscalar Dataflow VLIW
Dependences
analysis
Performed by HW Performed by
compiler
Performed by
compiler
Independences
analysis
Performed by HW Performed by HW Performed by
compiler
Scheduling Performed by HW Performed by HW Performed by
compiler
Role of compiler Rearranges the code
to make the analysis
and scheduling HW
more successful
Replaces some
analysis HW
Replaces virtually all
the analysis and
scheduling HW
ILP Scheduling
Static Scheduling boosted
by parallel code
optimization
done by the compiler
The processor receives
dependency-free and
optimized code for
parallel execution
Typical for VLIWs and a
few pipelined
processors (e.g. MIPS)

Dynamic Scheduling
without static parallel code
optimization
done by the processor
The code is not
optimized for parallel
execution. The
processor detects and
resolves dependencies
on its own
Early ILP processors
(e.g. CDC 6600, IBM
360/91 etc.)

Dynamic Scheduling
boosted by static parallel
code optimization
done by processor in
conjunction with parallel
optimizing compiler
The processor receives
optimized code for
parallel execution, but it
detects and resolves
dependencies on its own
Usual practice for
pipelined and
superscalar processors
(e.g. RS6000)
ILP Scheduling: Trace
scheduling
An optimization technique that has been
widely used for VLIW, superscalar, and
pipelined processors.
It selects a sequence of basic blocks as a
trace and schedules the operations from the
trace together.
Example:
Instr1
Instr2
Branch x
Instr3
Trace Scheduling
Extract more ILP
Increase machine fetch bandwidth by
storing logically consecutive blocks in
physically contiguous cache location
(possible to fetch multiple basic blocks
in one cycle)
Trace scheduling can be implemented
by hardware or software
Trace Scheduling in HW
Hardware technique makes use of a large
amount of information in dynamic execution
to format traces dynamically and schedule
the instructions in trace more efficiently.
Since the dependency and memory access
addresses have been solved in dynamic
execution, instructions in trace can be
reordered more easily and efficiently.
Example: trace cache approach
Trace scheduling in SW
Supplement to machines without
hardware trace scheduling support.
Formats traces based on static profiled
data, and schedules instructions using
traditional compiler scheduling and
optimization technique.
It faces some difficulties like code
explosion and exception handling.
ILP open problems
Pipelined scheduling : Optimized scheduling of pipelined
behavioral descriptions. Two simple type of pipelining (structural
and functional).
Controller cost : Most scheduling algorithms do not consider the
controller costs which is directly dependent on the controller
style used during scheduling.
Area constraints : The resource constrained algorithms could
have better interaction between scheduling and floorplanning.
Realism :
Scheduling realistic design descriptions that contain several special
language constructs.
Using more realistic libraries and cost functions.
Scheduling algorithms must also be expanded to incorporate
different target architectures.
References
Instruction-Level Parallel Processing: History, Overview and Perspective. B. Ramakrishna
Rau, Joseph A. Fisher. Journal of Supercomputing, Vol. 7, No. 1, Jan. 1993, pages 9-50.

Limits of Control Flow on Parallelism. Monica S. Lam, Robert P. Wilson. 19th ISCA, May
1992, pages 19-21.

Global Code Generation for Instruction-Level Parallelism: Trace Scheduling-2. Joseph A.
Fisher. Technical Report, HPLabs HPL-93-43, Jun. 1993.

VLIW at IBM Research
http://www.research.ibm.com/vliw

Intel and HP hope to speed CPUs with VLIW technology that's riskier than RISC, Dick
Pountain
http://www.byte.com/art/9604/sec8/art3.htm

Hardware and Software Trace Scheduling
http://charlotte.ucsd.edu/users/yhu/paperlist/summary.html

ILP open problems
http://www.ececs.uc.edu/~ddel/projects/dss/hls_paper/node9.html

Computer Architecture A Quantitative Approach, Hennessy & Patterson, 3
rd
edition, M
Kaufmann