Professional Documents
Culture Documents
1
Outline
Introduction
◦ Superscalar Review
VLIW Processors
Loop Unrolling
Summary
2
Superscalar Review
It is obviously a good architecture
◦ The hardware solves everything
It detects potential parallelism between instructions
It tries to issue as many instructions as possible in parallel
It use register renaming to increase parallelism
◦ Binary compatibility
If functional units are added to a new version of the architecture or
some other improvements have been made (without changing the
instruction set), programs don’t need to be changed
The number of functional units is invisible to the instruction set
Old programs can benefit from the additional potential of
parallelism, since the new hardware will simply issue instructions
in a more efficient way
3
Superscalar Review…
It has also several problems
◦ The architecture is very complex
Much hardware is needed for run-time detection of parallelism
There is, therefore, a limit in how far we can go with this
technique
◦ The instruction window for execution is limited in size
This limits the capacity to detect large number of parallel
instructions
4
VLIW Processors
In a VLIW (also called Very Large Instruction Word) processor, several
operations that can be executed in parallel are placed in a single instruction
word
5
VLIW Processors…
6
VLIW Processors…(Recap)
Instruction parallelism scheduled at compile time
◦ Included within the machine instructions explicitly
Processor uses this information to perform parallel execution
The hardware is less complex
◦ The controller is similar to a simple scalar computer
◦ The number of FUs can be increased without needing additional
sophisticated hardware to detect parallelism, as in SSA
Compiler has much more time to determine possible parallel
operations
◦ This analysis is only done once off-line, while run-time detection is
carried out by SSA hardware for each execution of the code.
Good compilers can detect parallelism based on global analysis of
the whole program
7
VLIW Processors…
Main Problems
◦ A large number of registers is needed in order to keep all
FUs active (to store operands and results)
◦ Large data transport capacity is needed between FUs and the
register files and between register files and memory
◦ High bandwidth between instruction cache and fetch unit is
also needed due to long instructions
Ex: one instruction with 8 operations, each 24 bits
196 bits/instruction
◦ Large code size, partially because unused operations
wasted bits in instruction word
8
VLIW Processors…
Main Problems…
◦ Incompatibility of binary code
If a new version of the processor introduces additional FUs, the number of
operations to execute in parallel is increased
Therefore, the instruction word changes, and old binary code cannot be
run on the new processor
◦ We might not have sufficient parallelism in the program to
utilize the large degree of machine parallelism
Hardware resources will be wasted in this case
A technique to address this problem is loop unrolling,which can be
performed by a compiler
9
Loop Unrolling
A technique used by compilers in order to increase the potential
of parallelism in a program
◦ It supports more efficient code generation, for processors with
instruction level parallelism
A technique that attempts to optimize a program's execution
speed at the expense of its binary size (space-time tradeoff)
10
Loop Unrolling…
Assumptions:
◦ R1 initially contains the address of the last element in x; the
other elements are at lower addresses; x[0] is at address 0
◦ Floating point register F2 contains the value s
◦ A floating point value is 8 bytes long
11
Loop Unrolling…
For
an ordinary processor, this C++ code will be
compiled to:
◦ Loop: LDD F0,(R1) F0:= x[i];(load double)
ADF F4,F0,F2 F4:= F0+F2;(add floating point)
STD (R1),F4 x[i]:= F4;(store double)
SBI R1,R1,#8 R1:= R1-8;
BGEZ R1,Loop branch if R1 ≥ 0
12
Loop Unrolling…
A VLIW Processor Example
◦ Two memory references, two FP operations, and one integer
operation or branch can be packed in an instruction, in the
following way:
13
Loop Unrolling…
A VLIW Processor Example…
Loop: LDD F0,(R1) load double
ADF F4,F0,F2 add FP
STD (R1),F4 store double
SBI R1,R1,#8
BGEZ R1,Loop
One iteration takes 6 cycles; the whole loop takes 960*6 = 5760 cycles
Almost no parallelism; most of the fields in the instructions are empty
There are two completely empty cycles
14
Loop Unrolling…
Loop unrolling is a technique used in compilers in order to increase the
potential of parallelism in a program
◦ It supports more efficient code generation for processors with instruction level
parallelism
@
15
Loop Unrolling…
16
Loop Unrolling…
Given a certain set of resources (processor architecture) and a given
loop, there is a limit on how many iterations should be unrolled
◦ Beyond that limit there is no gain any more
Loop unrolling increases the memory space needed to store the
program
A good compiler has to find the optimal level of unrolling for each
loop
The example before illustrates also some of the hardware support
needed to keep a VLIW processor busy:
◦ Large number of registers
In order to store data for operations which are active in parallel
◦ Large traffic has to be supported in parallel:
register files <-> memory
register files <-> functional units
17
Some VLIW Processors
Successful current VLIW processors
◦ TriMedia of Philips
◦ TMS320C6x of Texas Instruments
◦ Both are targeting the multimedia market
18
Summary
In order to keep up with increasing demands on performance, superscalar
architectures become excessively complex
VLIW architectures avoid complexity by relying on the compiler for
parallelism detection
Operation level parallelism is explicit in the instructions of a VLIW
processor
Not having to deal with parallelism detection, VLIW processors can have
a higher number of functional units
◦ This generates the need for a large number of registers and high communication
bandwidth
In order to keep the large number of functional units busy, compilers for
VLIW processors have to be aggressive in parallelism detection
Loop unrolling is used in order to increase the degree of parallelism in
loops: several iterations of a loop are unrolled and handled in parallel
19
Summary
The IA-64 family uses VLIW techniques for the future generation of
high-end processors. Itanium is the first member of this family
The instruction format avoids empty operations and allows more
flexibility in parallelism specification, by using the template field
Beside typical VLIW features, the Itanium architecture uses branch
predication and speculative loading.
Branch predication is based on predicated execution:
◦ The execution of an instruction is connected to a predicate register, which will
be committed only if the predicate becomes true
◦ It allows instructions from different branches to be started in parallel
Speculative loading tries to reduce latencies generated by load
instructions
◦ It allows load instructions to be moved across branch boundaries
◦ Page exceptions will be handled only if really needed
20
Reading Assignment
Read about EPIC
Compare EPIC with VLIW
Does EPIC and IA-64 refer to the same thing?
21