Professional Documents
Culture Documents
Pipe Lining
Pipe Lining
Chapter Overview
4.1 Basic Compiler Techniques for Exposing ILP 4.2 Static Branch Prediction 4.3 Static Multiple Issue: The VLIW Approach 4.4 Advanced Compiler Support for ILP 4.7 Intel IA-64 Architecture and Itanium Processor
Figure 4.1 Latencies of FP operations used in this chapters. Assume an integer load latency of 1. Dranches have a delay of one clock cycle.
Basic pipeline scheduling (6 cycles per loop; 3 for loop overhead: DADDUI and BNE)
Loop Unrolling
Four copies of the loop body Eliminate 3 branch 3 decrements of R1 => improve performance Require symbolic substitution and simplification Increase code size substantially => expose more computation that can be scheduled => Gain from scheduling on unrolled loop is larger than original
Unrolling without Scheduling(28 cycles/14 instr.) Unrolling with Scheduling(14 cycles)
Each L.D has 1 stall, each ADD.D 2, the DADDUI 1, the branch 1, plus 14 instruction issue cycles
Three limits to the gains achievable by loop unrolling 1.loop overhead decrease in the amount of overhead amortized with each unroll 2.code size limitations larger loops => larger code size growth => larger instruction cache miss rate 3.compiler limitations potential shortfall in registers
Summary Method # of Instr. Cycles/loop Original 5 10 Schedule 5 6 Unroll 4 14 28/4=7 Unroll 4 + Unroll 5 + Schedule Schedule + Multiple Issue 14 17 14/4=3.5 12/5=2.4
10
4.2 Static Branch Prediction used where branch behavior is highly predictable at compile time architectural feature to support static branch prediction delayed branch
The instruction in the branch delay is slot executed whether or not the branch is taken (for zero cycle penalty)
11
12
Static Branch Prediction Schemes Simplest scheme - predict branch as taken Direction-based scheme - predict backward-going branch as taken and forward-going branch as not taken
Not good for SPEC programs 34% misprediction rate for SPEC programs (59% to 9%)
Profile-based scheme predict on the basis of profile information collected from earlier runs
An individual branch is often highly biased toward taken or untake n Changing the input so that the profile is for a different run leads to only a small change in the accuracy
13
14
Flavor I: Superscalar in-order issue varying number of instructions per clock (1-8) processors
either statically scheduled (by the compiler) or dynamically scheduled (by the hardware, Tomasulo)
Flavor II: Very Long Instruction Word parallel issue a fixed number of instructions (4-16) (VLIW)
- compilers choose instructions to be issued simultaneously formatted either as one very large instruction or as a fixed issue packet of smaller instructions - rigid in early dependency checked and instruction scheduled by the compiler VLIWs - simplifying hardware (maybe no or simpler dependency check) - complex compiler to uncovered enough parallelism
loop unrolling local scheduling: operate on a single basic block global scheduling: may move code across branches Style: Explicitly Parallel Instruction Computer (EPIC) - Intel Architecture-64 (IA-64) 64-bit address - Joint HP/Intel agreement in 1999/2000
15
16
17
4.4 Advanced Compiler Support For ILP How can compilers be smart?
1. Produce good scheduling of code 2. Determine which loops might contain parallelism 3. Eliminate name dependencies
Compilers must be REALLY smart to figure out aliases -- pointers in C are a real problem Two important ideas:
Software Pipelining - Symbolic Loop Unrolling Trace Scheduling - Critical Path Scheduling
18
Loop-Level Dependence and Parallelism Loop-carried dependence : data accesses in later iterations are dependent on data values produced in earlier iterations
structures analyzed at source level by compilers, such as loops array references induction variable computations
Loop-level parallelism
dependence between two uses of x[i] dependent within a single iteration, not loop carried A loop is parallel if it can written without a cycle in thedependences, since the absence of a cycle means that the dependences give a partial ordering on the statements
19
P.320 Example
Assume A, B, and C are distinct, non-overlapping arrays Two different dependences 1.Dependence of S1 is on an earlier iteration of S1
S1: A[i+1] depends on A[i], S2: B[i+1] depends on B[i] forces successive iterations to execute in series
20
P.321 Example
Interchanging
no longer loop carried
21
Array Renaming
2. antidependence from S1 to S2, based on X[i] 3. antidependence from S3 to S4 for Y[i] 4. output dependence from S1 to S4, based on Y[i]
22
Recurrence
dependence distance of 5
23
Software Pipelining
Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations Software pipelining: interleave instructions from different loop iterations
reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (Tomasulo in SW)
24
LOOP: 1 L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D F4,0(R1) 4 DADDUI 5 BNE R1,R2,LOOP R1,R1,#8 1 L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D F4,0(R1)
(5/5)
Before: Unrolled 3 times 1 L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D F4,0(R1) 4 L.D F0,-8(R1) 5 ADD.D F4,F0,F2 6 S.D F4,-8(R1) 7 L.D F0,-16(R1) 8 ADD.D F4,F0,F2 9 S.D F4,-16(R1) 10 DADDUI R1,R1,#24 11 BNE R1,R2,LOOP
(11/11)
Software
(5/11)
2 ADD.D F4,F0,F2 No RAW! 4 L.D F0,-8(R1) 3 S.D F4,0(R1); Stores M[i] WARs) 5 ADD.D F4,F0,F2; Adds to M[i1]L.D F0,-16(R1); loads M[i-2] 7 10 DADDUI R1,R1,#8 11 BNE R1,R2,LOOP 6 S.D F4,-8(R1) 8 ADD.D F4,F0,F2 9 S.D F4,-16(R1)
(2
25
Software pipelining reduces the time when the loop is not running at peak speed to once per loop the best performance can come from doing both
major advantage: consumes less code space In practice, compilation using software pipelining is quite difficult
26
27
28
Assume its the inner loop and the likely path Unwind it four times Require the compiler to generate and trace the compensation code
29
Superblocks
superblock: single entry point but allow multiple exits
Compacting a superblock is much easier than compacting a trace
Tail duplication
Create a separate block that corresponds to the portion of the trace after the entry
30
Tail Duplication
Create a separate block that corresponds to the portion of the trace after the entry
31
Trace
trace scheduling and unwinding path
Super block
tail duplication
32
4.7 Intel IA-64 and Itanium Processor Designed to benefit VLIW approach IA-64 Register Model 128 64-bit GPR (65 bits actually) 128 82-bit floating-point registers
two extra exponent bits over the standard 80-bit IEEE format
64 1-bit predicate register 8 64-bit branch registers, used for indirect branches a variety of registers used for system control, etc. other supports:
register stack frame: like register window in SPARC
current frame pointer (CFM) register stack engine
33
34
bundl : fixed formatting of multiple instructions (3) e IA-64 instructions are encoded in bundles
128 bits wide: 5-bit template field and three 41-bit instructions the template field describes the presence of stops and specifies types of execution units for each instruction
35
36
37
38
Itanium Support for Explicit Parallelism ld.s r1= ld.s r1= instr 1 instr 1 instr 2 use =r1 instr 2 : : br br chk.s r1 chk.s use use =r1
LOAD moved above branch by compiler Uses moved above branch by compiler
Recovery code
39