Unit II

The VLIW Architecture [4]
A typical VLIW (very long instruction word) machine has instruction words hundreds of bits in length. Multiple functional units are used concurrently in a VLIW processor. All functional units share the use of a common large register file.
Advantages of VLIW
Compiler prepares fixed packets of multiple operations that give the full "plan of execution"
dependencies are determined by compiler and used to schedule according to function unit latencies function units are assigned by compiler and correspond to the position within the instruction packet ("slotting") compiler produces fully-scheduled, hazard-free code => hardware doesn't have to "rediscover" dependencies or schedule
Disadvantages of VLIW
Compatibility across implementations is a major problem
VLIW code won't run properly with different number of function units or different latencies unscheduled events (e.g., cache miss) stall entire processor low slot utilization (mostly nops) reduce nops by compression ("flexible VLIW", "variable-length VLIW")
Code density is another problem

Exploiting Instruction-Level Parallelism with Software Approaches
Overview
Basic Compiler Techniques
Pipeline scheduling loop unrolling
Static Branch Prediction Static Multiple Issue: VLIW Advanced Compiler Support for Exposing ILP
Detecting loop-level parallelism Software pipelining symbolic loop unrolling Global code scheduling
Hardware support for exposing more parallelism

Conditional or predicted instructions Compiler speculation with hardware support
Hardware vs Software speculation mechanisms Intel IA-64 ISA
Review of Multi-issue Taxonomy

Common Name Superscaler (static) Superscaler (dynamic) Sperscaler (speculative) Issue Hazard structure detection dynamic dynamic dynamic hardware hardware hardware Scheduling Distinguishing characteristic in-order execution some out-of-order execution out-of-order execution with speculation Examples static dynamic dynamic with speculation
Sun UltraSPARC II/III IBM Power2
Pentium III/4 MIPS R10K, Alpha 21264, HP PA 8500, IBM RS64III Trimedia, i860
VLIW/LIW
static
software
static
no hazards between issue packets
EPIC
mostly static
mostly software
mostly static explicit dependencies marked by compiler
Itanium (IA-64 is one implementation)
Basic Pipeline Scheduling

To keep pipeline full
Find sequences of unrelated instructions to overlap Separate dependent instructions by at least the latency of source instruction
Compiler success depends on:

Amount of ILP available Latencies of functional units
Assumptions for Examples

Standard 5-stage integer pipeline plus floating point pipeline
Branches have delay of 1 cycle Integer load latency of 1 cycle, ALU latency of 0 Functional units fully pipelined or replicated so that there are no structural hazards
Latencies between dependent FP instructions:

Instruction producing result FP ALU operation FP ALU operation Load double Instruction using result Another FP ALU operation Store double FP ALU operation Latency in clock cycles 3 2 1
Load double
Store double
Loop Example
Add a scalar to an array. for (i=1000; i>0; i=i-1) x[i] = x[i] + s; Iterations of the loop are parallel with no dependencies between iterations.
Straightforward Conversion
R1 holds the address of the highest array element F2 holds the scalar R2 is pre-computed so that 8(R2) is the last element loop: L.D F0, 0(R1) ;F0 = array element ADD.D F4,F0,F2 ;add scalar in F2 S.D F4, 0(R1) ;store result DADDUI R1,R1,#-8 ;decrement pointer (DW) BNE R1,R2, loop ;branch if R1 != R2
Program in MIPS Pipeline

loop: L.D F0, 0(R1) 1-cycle Stall latency ADD.D F4,F0,F2 Stall 2-cycle latency Stall S.D F4, 0(R1) DADDUI R1,R1,#-8 1-cycle latency Stall branch BNE R1,R2, loop delay slot Stall
Clock cycle issued 1 2 3 4 5 6 7 8 9 10
Scheduled Program in MIPS Pipeline

loop: L.D Stall ADD.D Stall Stall S.D DADDUI Stall BNE Stall F0, 0(R1) F4,F0,F2 Clock cycle issued 1 2 3 4 5 6 7 8 9 10 Clock cycle issued 1 2 3 4 5 6
OLD
F4, 0(R1) R1,R1,#-8
R1,R2, loop
loop:
2-cycle latency
L.D DADDUI ADD.D Stall BNE S.D
F0, 0(R1) R1,R1,#-8 F4,F0,F2 R1,R2, loop F4, 8(R1)
NEW
Compiler Tasks
loop: L.D DADDUI ADD.D Stall BNE S.D F0, 0(R1) R1,R1,#-8 F4,F0,F2 R1,R2, loop F4, 8(R1) Clock cycle issued 1 2 3 4 5 6
OK to reorder DADDUI and ADD.D OK to reorder S.D and BNE OK to reorder DADDUI and S.D, but requires
0(R1) 8(R1)
This one is difficult since a reverse dependency can be seen:
S.D F4,0(R1) DADDUI R1,R1,# -8
Loop Overhead
loop: L.D DADDUI ADD.D Stall BNE S.D F0, 0(R1) R1,R1,#-8 F4,F0,F2
R1,R2, loop F4, 8(R1) Clock cycle issued 1 2 3 4 5 6
6 is the minimum due to dependencies and pipeline latencies Actual work of the loop is just 3 instructions:
L.D, ADD.D, S.D
Other instructions are loop overhead:

DADDUI, BNE
Loop Unrolling
Eliminate some of the overhead by unrolling the loop (fully or partially). Need to adjust the loop termination code Allows more parallel instructions in a row Allows more flexibility in reordering Usually requires register renaming
Unrolled Version
loop: L.D ADD.D S.D L.D ADD.D S.D F0, 0(R1) F4,F0,F2 F4, 0(R1) F6, -8(R1) F8,F6,F2 F8, -8(R1) Assume that the # iterations is a multiple of 4 Decrement R1 by 32 for this 4 iterations More registers required to avoid unnecessary dependencies
L.D ADD.D S.D

L.D ADD.D S.D DADDI BNE
F10, -16(R1) F12,F10,F2 F12, -16(R1)

F14, -24(R1) F16,F14,F2 F16, -24(R1) R1, R1, #-32 R1, R2, loop
Eliminate 3 DADDI and 3 BNE instructions

Without scheduling, each operation will cause a stall when pipelined:
L.D stall ADD.D stall stall S.D F0, 0(R1) F4,F0,F2
F4, 0(R1)
Scheduled Unrolled Version

loop: L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D DADDI BNE F0, 0(R1) F4,F0,F2 F4, 0(R1) F6, -8(R1 F8,F6,F2 F8, -8(R1) F10, -16(R1) F12,F10,F2 F12, -16(R1) F14, -24(R1) F16,F14,F2 F16, -24(R1) R1, R1, #-32 R1, R2, loop L.D L.D L.D L.D ADD.D ADD.D Schedule ADD.D ADD.D S.D S.D Move DADDI dependent instructions S.D apart BNE Fill in delay slot S.D loop: F0, 0(R1) Move dependent F6, -8(R1) instructions F10, -16(R1) apart F14, -24(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 F16,F14,F2 F4, 0(R1) F8, -8(R1) R1, R1, # -32 F12, 16(R1) R1, R2, loop F16, 8(R1) ; 8-32 = -24
Requires 28 clock cycles for 4 iterations 7 cycles/iteration
Requires 14 clock cycles for 4 iterations 3.5 cycles/iteration
Summary of Example
Version Unscheduled Scheduled
Clock cycles per iteration

10 6
Code size
5 5
Unrolled
Unrolled and Scheduled
7
3.5
14
14
Compiler Tasks for Unrolled Scheduled Version

OK to move S.D after DADDUI and BNE if S.D offset is adjusted Determine that unrolling is useful because loop iterations are independent Use different registers to avoid name hazards
Eliminate extra test and branch instructions and adjust iteration code
OK to move L.D and S.D instructions in unrolled code (requires analyzing memory addresses) Keep all the real dependencies, but reorder to avoid stalls
Limits to Loop Unrolling

Eventually the gains of removing loop overhead diminishes
Remaining loop overhead amortization
Code size limitations

Embedded applications Increase in cache misses
Compiler limitations
Shortfall in registers Increase in number of live values past # registers
Pipeline Schedule for 5 Iteration Unrolled Version

Integer Instruction
loop: L.D L.D L.D L.D L.D S.D F0, 0(R1) F6, -8(R1) F10, -16(R1) F14, -24(R1) F18, -32(R1) F4, 0(R1) ADD.D ADD.D ADD.D ADD.D F4,F0,F2 F8,F6,F2 F12,F10,F2 F16,F14,F2
FP Instruction
Clock Cycle 1 2 3 4 5 6
S.D
S.D DADDI
F8, -8(R1)
F12, -16(R1) R1, R1, # -32
ADD.D
F20,F18,F2
7
8 9
S.D
BNE S.D
F16, 16(R1)
R1, R2, loop F20, 8(R1)
10
11 12
12 cycles for 5 iterations 2.4 cycles/iteration
Summary of Example
Version Unscheduled Scheduled Unrolled (4) Unrolled (4) and Scheduled Unrolled (5) and Schedule in MultiIssue Pipe
Clock cycles per iteration 10

6 7 3.5 2.4
Code size
5 5 14 14 17
Overview

Advanced compiler support

Detecting Parallelism Loop-level parallelism
Analyzed at the source level requires recognition of array references, loops, indices. Loop-carried dependence a dependence of one loop iteration on a previous iteration.
for (k=1; k<=100; k=k+1) { A[k+1] = A[k] + B[k];
Examples of Loop Parallelism

A loop is parallel if it can be written without a cycle in the dependencies.
for (k=1; k<=100; k=k+1) {
a loop carried dependency in a single statement is a cycle
for (k=1; k<=100; k=k+1) {
A[k+1] = A[k] + C[k];
dependency within iteration does not make a cycle
A[k] = A[k] + B[k];
B[k+1] = B[k] + A[k+1]; }
B[k+1] = C[k] + D[1]; } This loop can be modified to make it parallel
loop carried dependency, but no cycle (B[k+1] does not have B[k] as source)
Transformation
Two statements can be interchanged. First iteration of first statement can be computed outside loop so that A[k+1] is computed within loop. Last iteration of second statement must also be computed outside loop. A[1] = A[1] + B[1]; for (k=1; k<=100; k=k+1) { A[k] = A[k] + B[k]; B[k+1] = C[k] + D[k];
Expose the parallelism
for (k=1; k<=99; k=k+1) { B[k+1] = C[k] +

dependency D[k]; within iteration
A[k+1] = A[k+1] + B[k+1];
} B[101] = C[100] + D[100];
Recurrences
for (i=2; i<=100; i=i-1) { Y[i] = Y[i-n] + Y[i]; }
Y[i] depends on itself, but uses the value of an earlier iteration. n = dependence distance. Most often, n=1. The larger that n is, the more parallelism is available.
Some architectures have special support for recurrences.
Finding Dependencies
Important for
Efficient scheduling Determining which loops to unroll Eliminating name dependencies
Makes finding dependencies difficult:

Arrays and pointers in C or C++ Pass by reference parameter passing in FORTRAN
Dependencies in Arrays
Array indices, i, are affine if:
ai+ b (for a one-dimensional array) Index of multiple-dimension arrays is affine if indices in each dimension are affine
Common example of non-affine index:

x[y[i]] (indirect array addressing)
For two affine indices: ai+b and ck+d there is a dependence if:
GCD(c,a) must divide (d-b) evenly
Example
For (k=1; k<=100; k=k+1) { X[2k+3] = X[2k] * 5.0; }
GCD test: a=2, b=3 c=2, d=0 Dependence if GCD(c,a) divides (d-b) evenly GCD(2,2) = 2 d-b = -3 k 1 2 3 4 5 6 7 . . . . 100 2k+3 5 7 9 11 13 15 17 . . . . 203 2k 2 4 6 8 10 12 14 . . . . 200
GCD Test Limitations

GCD test must take the limits of the indices into account.
If GCD shows NO dependency, then there is no dependency If GCD shows SOME dependency, it might not occur (because it might be outside the bounds of the indices).
In general, determining if dependency exists is NP-complete. There are exact tests for restricted situations
Dependency Classification
Different dependencies are handled differently
Anti-dependence and output dependence
rename
Real dependencies
try to reorder to separate by length of latency
Example
Find dependencies
True dependencies Output dependencies Anti-dependencies
Output dependence
for (i=1; i<=100; i=i+1) { Y[i] = X[i]/c;

True dependencies Not loop-carried Anti-dependence
X[i] = X[i] + c;
Z[i] = Y[i] + c;
Y[i] = c -Y[i]; }
Anti-dependence
Example
Eliminate output dependence (also eliminates second anti-dependence)
Rename Y T
Output dependence
for (i=1; i<=100; i=i+1) { T[i] = X[i]/c;

X[i] = X[i] + c;
Z[i] = T[i] + c;
Y[i] = c -T[i]; }
Anti-dependence
Example
Eliminate anti-dependence
Rename X S Final result is parallel loop that can be unrolled
for (i=1; i<=100; i=i+1) { T[i] = X[i]/c;

S[i] = X[i] + c;
Z[i] = T[i] + c;
Y[i] = c -T[i]; }
Software Pipelining
Interleaves instructions from different iterations of a loop without unrolling
each iteration is made from instructions from different iterations of the loop software counterpart to Tomasulos algorithm start-up and finish-up code required
Software Pipelining
Software Pipelining
Loop: LD ADD.D S.D DADDUI BNE F0, 0(R1) F4, F0, F2 F4, 0(R1) R1, R1, # -8 R1,R2, Loop 3
symbolic unrolling
new loop
Loop: S.D ADD.D L.D DADDUI BNE F0, 16(R1) ; store to M[i] F4, F0, F2 ; add to M[i-1] F4, 0(R1) ; load M[i-2] R1, R1, # -8 R1,R2, Loop
LD ADD.D S.D LD ADD.D S.D LD ADD.D S.D
F0, 0(R1) F4, F0, F2 F4, 0(R1) F0, 0(R1) F4, F0, F2 F4, 0(R1) F0, 0(R1) F4, F0, F2 F4, 0(R1)
3 2
Software Pipelining
new loop
Loop: S.D ADD.D L.D DADDUI BNE F0, 16(R1) ; store to M[i] F4, F0, F2 ; add to M[i-1] F4, 0(R1) ; load M[i-2] R1, R1, # -8 R1,R2, Loop
Result: 1 cycle per instruction 1 loop iteration per 5 cycles less code space than unrolling
separate to eliminate RAW stall and to fill delay slot
rescheduled loop
Loop: S.D DADDUI ADD.D BNE L.D F0, 16(R1) ; store to M[i] R1, R1, # -8 F4, F0, F2 ; add to M[i-1] R1, R2, Loop F4, 8(R1) ; load M[i-2]
Start-up and Wind-down Code
Iterations in software pipeline loop
Software Pipelining Benefits

Here the overhead includes instructions not optimally overlapped
Here the overhead includes branch/counter instructions that are not easy to overlap
Global Code Scheduling

Loop unrolling and software pipelining
Improve ILP when loop bodies are straightling code (no branches) Control flow (branches) within loops makes both more complex.
will require moving instructions across branches
Global Code Scheduling moving instructions across branches
Global Code Scheduling

Goal: compact code fragment with internal control structure into shortest possible sequence
must preserve data and control dependencies
data dependencies force partial order of instructions control dependencies dictate instructions across which code cannot be moved
Finding shortest possible sequence requires identification of critical path. Critical path is the longest sequence of dependent instructions
Global Code Motion

Moving code across branches affects frequency of execution of such code Need to determine relative frequency of different paths
Global Code Motion

Global code motion:
can we move B[i] or C[i] assignments? is it beneficial?
execution frequencies cost of moving (any empty slots?) effect on critical path which is better move, B[i] or C[i] cost of compensation code
Simplifications of Global Code Scheduling

Trace Scheduling
find the most frequent path (trace selection) unroll loop to create trace and then schedule it efficiently (trace compaction)
Trace Scheduling Example
Trace exits and re-entrances are very complex and require much bookkeeping
Trace Scheduling
Advantages
eliminates some hard decisions in global code scheduling good for code such as scientific programs with intensive loops and predictable behavior
Disadvantages
significant overhead in compensation code when trace must be exited
Superblocks
Similar to trace scheduling, but have only ONE entrance point When trace is exited, a duplicate loop is used for remainder of code (tail duplication)
Tail duplication
Chapter 4 Overview

Hardware Options
Instruction set change:
conditional instructions
example: conditional move CMOVZ R1, R2, R3
R1 R2 if R3=0
predicated instructions
example: predicated load LWC R1, 9(R2), R3
R1 M[R2+9] if R3 0
Conditional Moves
Can be used to eliminate some branches
if (A==0) {S=T;}
Let A,S,T be assigned to R1, R2, R3 Code without conditional move: BNEZ R1, L ADDU R2, R3, R0 L: Using conditional move: CMOVZ R2, R3, R1 Control dependence is converted to data dependence
Conditional Moves
EX:
Useful for conversions such as absolute value A = abs (B) if (B < 0), { A = -B;} else {A = B}; Can implement with two conditional moves
Useful for short sequences Not efficient for branches that guard large blocks of code. Simplest form of predicated instruction
Predication
Execution of an instruction is controlled by a predicate.
When predicate is false, instruction becomes a nop Full predication is when all instructions can be predicated. Full predication allows conversions of large blocks of code that are branch dependent
Predication and Multi-Issue

Predicated instructions can be used to improve scheduling Can be used to fill delay slots Some branches can be eliminated Eliminates some control dependencies Reduces overhead of global code scheduling
Limits of Predicated Instructions

Annulled instructions take processor resources Predicated instructions are not efficient for multiple branches. Implementing conditional/predicated instructions has some hardware cost
Examples
Support conditional moves:
MIPS Alpha PowerPC SPARC Intelx86
Support full predication:

IA-64
Compiler Speculation
To speculate ambitiously, must have
1. The ability to find instructions that can be speculatively moved and not affect program data flow. 2. The ability to ignore exceptions in speculated instructions, until it is certain they should occur. 3. The ability to speculatively interchange loads and stores which may have address conflicts. The last two require hardware support.
Preserving Exception Behavior

The results of a mis-predicted instruction will not be used in final computation, and should not cause an exception. Four approaches:
1. HW and OS cooperatively ignore exceptions for speculated instructions 2. Speculative instructions that never raise exceptions are used, checks are implemented to determine when exceptions should occur 3. Poison status bits are attached to result registers written by speculated instructions when they cause exceptions. Cause a fault when normal instruction uses results. 4. Hardware buffers speculative instruction results until instruction is no longer speculative
Hardware Support for Memory Reference Speculation

Moving loads above stores is important for reducing critical paths of code segments Compiler can not always be certain that reordering load/store is correct (memory address dependencies) Add special instruction to check for address conflicts.
Left at original location of load instruction Hardware saves address of speculative load If subsequent store (before special instruction) uses address, then speculation fails, else it succeeds
When failure occurs, must re-execute load at check point If other speculative instructions were execute, must re-execute those instructions too. Expensive to have speculation fail
Hardware vs Software Speculation

Disambiguation of memory references
Software hard to do at compile time if program uses pointers Hardware dynamic disambiguation is possible for supporting reordering of loads and stores in Tomasulos approach Support for speculative memory references can help compiler, but overhead of recovery is high.

When control flow is unpredictable, hardware speculation works better than compiler speculation Integer programs tend to have unpredictable control flow
Static (compiler) predictor misses 16% (SPECint) Hardware predictor misses under 10%
Statically scheduled processors normally include dynamic branch predictors

Hardware-based speculation maintains completely precise exception model Software-based approaches have added special hardware support to allow this as well. Hardware-based speculation does not require compensation or bookkeeping code Ambitious software-based approaches require this hardware support Compiler-based approaches can see a longer code sequence Better code scheduling can result Hardware-based speculation requires complex additional hardware resources Compiler-based speculation requires complex software
Intel IA-64 Architecture Itanium Implementation

IA-64
Instruction set architecture Instruction format Examples of explicit parallelism support Predication and speculation support
Itanium Implementation
Functional units and instruction issue Performance
The IA-64 Instruction Set Architecture

Register model
64 Predicate Registers (1-bit ) 128 General Purpose Registers (64-bit) Actually have 65 bits 128 Floating Point Registers (82-bit)
8 Branch Registers (64-bit)
Hold branch address for indirect branches
Other Registers for systems control, memory mapping, performance counters, communication with OS
Integer Registers
R0-R31 always accessible R32-R127 implemented as a register stack
each procedure is allocated to a set of registers
128 General Purpose Registers (64-bit)
R0-R31 CFM R32-R127
CFM Current Frame pointer used to point to a set of registers for a procedure
Instruction Format
VLIW approach
implicit parallelism among operations in an instruction fixed formatting of the operation fields
More flexible than most VLIW architectures

depends on compiler to detect ILP compiler schedules into parallel slots
A sequence of consecutive instructions with no register dependencies There may be some memory dependencies Boundaries between groups are indicated with a stop.
Instruction Groups
Instruction Bundles
128 bits of encoded instructions Each bundle: 5-bit template field specifies what types of execution units each instruction requires 3 instructions, each 41 bits
Execution Slots
Execution unit slot Instruction Instruction Type Description I-unit A Integer ALU Non-integer ALU Integer ALU Memory Access Floating Point Example instructions
add,sub,and,or... integer and multimedia shifts, bit tests, ... add,sub,and,or... load/stores, int/FP
Floating point instructions Conditional branches Extended immediates, stops, nops
I
M-unit A M F
F-unit
B-unit
L+X
B
L+X
Branches
Extended
IA-64 Instruction Format

41-bits
4-bits (major opcode)
31-bits
6-bits (predicate register)
Together with the 5-bit bundle template, determine the major operation
64 Predicate Registers (1-bit )
Predication
Almost all instructions can be predicated
Specify a predicate register in last 6 bits of instruction Predicate registers set with compare or test instructions
Compare 10 possible comparison tests Two predicate registers as destinations Written with result and compliment OR, with logical function that combines tests and compliment Multiple tests can be done with one instruction
Speculation Support
Control speculation support:
Deferred exceptions for speculated instructions
Equivalent of poison bits
Memory reference speculation

Support for speculation of load instructions
Deferred Exceptions
Support to indicate an exception on a speculative instruction
GPRs have NaT (Not a Thing) bits (make registers 65-bits long) FPRs use NaTVal (Not a Thing Value)
Significand of 0 and exponent out of range
NaTs and NaTVals are propagated by speculative instructions that dont reference memory FP instructions use status registers to record exceptions for this purpose.
Resolution of Deferred Exceptions

If non-speculative instruction receives a NaT or NaTVal as source operand, generate terminating exception
If a chk.s instruction detects the presence of NaT or NaTVal, branch to routine designed to recover from speculative operation.
Memory Reference Support

Advanced loads
Speculative load that is moved above a store Creates an entry in ALAT (Advanced Load Address) Table
Register destination of the load Address of accessed memory location
When store is executed, address is compared to active ALAT table entries

If a match occurs, ALAT table entry marked invalid
When instruction USING speculative load value is executed, ALAT table is checked
Ld.c check that is used if only load is speculative
Only causes a reload of the value
Chk.a check that is used if other speculative code used the load value.
Specifies address of a fix-up routine that re-executes code sequence.
Itanium Processor
First implementation of IA-64 (2001) 800 MHz clock Multi-issue up to 6-issues per clock cycle Up to 3 branches and 2 memory references 3-level cache memory hierarchy
L1 is split data/instruction caches L2 is unified cache on-chip L3 is unified cache off-chip (but on container)
Features of itanium

64-bit addressing EPIC (Explicit Parallel Instruction Computing) Wide Parallel Execution core Prediction FPU, ALU and Rotating registers Large fast Cache High Clock Speed Scalability Error Handling Fast Bus Architecture
Functional Units
I-unit I-unit M-unit M-unit Instruction Integer load B-unit F-unit B-unit F-unit B-unit Floating-point load Correctly predicted taken branch Mispredicted branch Latency 1 9 0-3 9 All functional units are pipelined Bypassing paths are implemented (forwarding) Bypass between units has a 1 cycle delay
Integer ALU operation FP arithmetic
0 4
Itanium Multi-Issue
Instruction Issue window of 2 bundles at a time Up to 6 instructions issues at once
template
inst 1
inst 2
inst 3
template
inst 1
inst 2
inst 3
NOPs and predicated instructions with false predicates are not issued If one or more instructions cannot be issued due to unavailable function unit, bundle can be split
Itanium Pipeline
10 Stages
Front-End IPG Fetch Rotate Instruction Delivery EXP REN Operand Delivery WLD REG EXE Execution DET WRB
Prefetches up to 32 bytes per clock Can hold up to 8 bundles Branch prediction: multilevel adaptive predictor
Distributes up to 6 instructions to 9 functional units. Implements register renaming
Access register file register bypassing register scoreboard checks predicate dependencies
Executes instructions through ALU and load-store units Detects exceptions and posts NaTs Write back
IPG
: instruction pointer generation
FET
ROT EXP REN
: Fetch instruction
: Rotate : Expand : Rename
WL.D
REG EXE DET WRB
: Word line Decode

: Register Read : Execute : Exception Detection : Write Back Result
Limits on ILP
Achieving Parallelism
Techniques Scoreboarding / Tomasulos Algorithm Pipelining Speculation Branch Prediction But how much more performance could we theoretically get? How much ILP exists? How much more performance could we realistically get?
Limits to ILP
Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction perfect; no mispredictions 3. Jump prediction all jumps perfectly predicted (returns, case statements) 2 & 3 no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW Also: perfect caches; 1 cycle latency for all instructions (FP 83 *,/); unlimited instructions issued/clock cycle;
Limits to ILP HW Model comparison

Model
Instructions Issued per clock Instruction Window Size Renaming Registers Branch Prediction Cache Memory Alias Analysis Infinite Infinite Infinite Perfect Perfect Perfect
Power 5
4 200 48 integer + 40 Fl. Pt. 2% to 6% misprediction (Tournament Branch Predictor) 64KI, 32KD, 1.92MB L2, 36 MB L3 ??
WAR and WAW hazards through memory: Eliminated WAW and WAR hazards through register renaming, but not in memory usage
84

Unit II

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit II

Uploaded by

Copyright:

Available Formats

The VLIW Architecture [4]

Code density is another problem

Exploiting Instruction-Level Parallelism with Software Approaches

Hardware support for exposing more parallelism

Hardware vs Software speculation mechanisms Intel IA-64 ISA

Review of Multi-issue Taxonomy

no hazards between issue packets

mostly static explicit dependencies marked by compiler

Itanium (IA-64 is one implementation)

Basic Pipeline Scheduling

Compiler success depends on:

Assumptions for Examples

Latencies between dependent FP instructions:

Program in MIPS Pipeline

Scheduled Program in MIPS Pipeline

F4, 0(R1) R1,R1,#-8

L.D DADDUI ADD.D Stall BNE S.D

F0, 0(R1) R1,R1,#-8 F4,F0,F2 R1,R2, loop F4, 8(R1)

S.D F4,0(R1) DADDUI R1,R1,# -8

Other instructions are loop overhead:

L.D ADD.D S.D

F10, -16(R1) F12,F10,F2 F12, -16(R1)

Eliminate 3 DADDI and 3 BNE instructions

Scheduled Unrolled Version

Requires 28 clock cycles for 4 iterations 7 cycles/iteration

Requires 14 clock cycles for 4 iterations 3.5 cycles/iteration

Clock cycles per iteration

Compiler Tasks for Unrolled Scheduled Version

Limits to Loop Unrolling

Code size limitations

Pipeline Schedule for 5 Iteration Unrolled Version

12 cycles for 5 iterations 2.4 cycles/iteration

Clock cycles per iteration 10

Hardware support for exposing more parallelism

Hardware vs Software speculation mechanisms Intel IA-64 ISA

Advanced compiler support

Examples of Loop Parallelism

for (k=1; k<=100; k=k+1) {

A[k+1] = A[k] + C[k];

dependency within iteration does not make a cycle

A[k] = A[k] + B[k];

B[k+1] = B[k] + A[k+1]; }

B[k+1] = C[k] + D[1]; } This loop can be modified to make it parallel

for (k=1; k<=99; k=k+1) { B[k+1] = C[k] +

A[k+1] = A[k+1] + B[k+1];

} B[101] = C[100] + D[100];

Some architectures have special support for recurrences.

Makes finding dependencies difficult:

Common example of non-affine index:

GCD Test Limitations

for (i=1; i<=100; i=i+1) { Y[i] = X[i]/c;

for (i=1; i<=100; i=i+1) { T[i] = X[i]/c;

for (i=1; i<=100; i=i+1) { T[i] = X[i]/c;

LD ADD.D S.D LD ADD.D S.D LD ADD.D S.D

separate to eliminate RAW stall and to fill delay slot

Start-up and Wind-down Code

Iterations in software pipeline loop

Software Pipelining Benefits

Global Code Scheduling

Global Code Scheduling moving instructions across branches

Global Code Scheduling

Global Code Motion

Global Code Motion