Professional Documents
Culture Documents
Chapter Overview
4.1 Instruction Level Parallelism: Concepts and Challenges
Chap. 4 - Pipelining II
Chapter Overview
Technique
Loop Unrolling Basic Pipeline Scheduling Dynamic Scheduling with Scoreboarding Dynamic Scheduling with Register Renaming Dynamic Branch Prediction Issue Multiple Instructions per Cycle Compiler Dependence Analysis Software pipelining and trace scheduling Speculation Dynamic memory disambiguation
Reduces
Control Stalls RAW Stalls RAW stalls WAR and WAW stalls Control Stalls Ideal CPI Ideal CPI & data stalls Ideal CPI & data stalls All data & control stalls RAW stalls involving memory
Section
4.1 4.1 4.2 4.2 4.3 4.4 4.5 4.5 4.6 4.2, 4.6
Chap. 4 - Pipelining II
ILP is the principle that there are many instructions in code that dont depend on each other. That means its possible to execute those instructions in parallel. This is easier said than done: Issues include: Building compilers to analyze the code, Building hardware to be even smarter than that code.
Chap. 4 - Pipelining II
Terminology
Basic Block - That set of instructions between entry points and between branches. A basic block has only one entry and one exit. Typically this is about 6 instructions long.
Loop Level Parallelism - that parallelism that exists within a loop. Such parallelism can cross loop iterations.
Loop Unrolling - Either the compiler or the hardware is able to exploit the parallelism inherent in the loop.
Chap. 4 - Pipelining II
;F0=vector element ;add scalar from F2 ;store result ;decrement pointer 8bytes (DW) ;branch R1!=zero ;delayed branch slot
Chap. 4 - Pipelining II
FP Loop Hazards
;F0=vector element ;add scalar in F2 ;store result ;decrement pointer 8B (DW) ;branch R1!=zero ;delayed branch slot
Latency in clock cycles 3 2 1 0 0
Instruction producing result FP ALU op FP ALU op Load double Load double Integer op
Instruction using result Another FP ALU op Store double FP ALU op Store double Integer op
Chap. 4 - Pipelining II
;store result ;decrement pointer 8Byte (DW) ;branch R1!=zero ;delayed branch slot
Latency in clock cycles 3 2 1 0 0
Instruction producing result FP ALU op FP ALU op Load double Load double Integer op
Instruction using result Another FP ALU op Store double FP ALU op Store double Integer op
Now 6 clocks: Now unroll Chap. 4 - Pipelining II loop 4 times to make faster.
Pipeline Scheduling and Instruction Level Loop Unrolling Parallelism Unroll Loop Four Times (straightforward way)
1 Loop: 2 3 4 5 6 7 8 9 10 11 12 13 14 LD stall ADDD stall stall SD LD stall ADDD stall stall SD LD stall F0,0(R1) F4,F0,F2 15 16 17 18 19 20 21 22 23 24 25 26 27 28 ADDD stall stall SD LD stall ADDD stall stall SD SUBI BNEZ stall NOP F12,F10,F2
-16(R1),F12 F14,-24(R1)
F16,F14,F2
-8(R1),F8 F10,-16(R1)
; 8-32 = -24
No Stalls!!
14 clock cycles, or 3.5 per iteration Chap. 4 - Pipelining II 11
Dependencies
Chap. 4 - Pipelining II
13
Data Dependencies
1 Loop: 2 3 4 5
Chap. 4 - Pipelining II
14
Name Dependencies
Chap. 4 - Pipelining II
15
Name Dependencies
16
Name Dependencies
Chap. 4 - Pipelining II
18
Control Dependencies
Final kind of dependence called control dependence Example if p1 {S1;}; if p2 {S2;}; S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.
Chap. 4 - Pipelining II
19
Control Dependencies
Control Dependencies
Chap. 4 - Pipelining II
21
2.
NEW:
A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = + A[i+1] + B[i+1]; } B[101] = C[100] + D[100]; Chap. 4 - Pipelining II
24
Dynamic Scheduling
4.1 Instruction Level Parallelism: Concepts and Challenges 4.2 Overcoming Data Hazards with Dynamic Scheduling 4.3 Reducing Branch Penalties with Dynamic Hardware Prediction 4.4 Taking Advantage of More ILP with Multiple Issue 4.5 Compiler Support for Exploiting ILP 4.6 Hardware Support for Extracting more Parallelism 4.7 Studies of ILP
Dynamic Scheduling is when the hardware rearranges the order of instruction execution to reduce stalls. Advantages: Dependencies unknown at compile time can be handled by the hardware. Code compiled for one type of pipeline can be efficiently run on another. Disadvantages: Hardware much more complex.
Chap. 4 - Pipelining II
25
Dynamic Scheduling
The idea:
Dynamic Scheduling
The idea:
Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions. A scoreboard is a data structure that provides the information necessary for all pieces of the processor to work together. We will use In order issue, out of order execution, out of order commit ( also called completion) First used in CDC6600. Our example modified here for DLX. CDC had 4 FP units, 5 memory reference units, 7 integer units. DLX has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.
Chap. 4 - Pipelining II
27
Dynamic Scheduling
Using A Scoreboard
Scoreboard Implications
Out-of-order completion => WAR, WAW hazards? Solutions for WAR Queue both the operation and copies of its operands Read registers only during Read Operands stage For WAW, must detect hazard: stall until other completes Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies, state or operations Scoreboard replaces ID, EX, WB with 4 stages
Chap. 4 - Pipelining II
28
Dynamic Scheduling
Using A Scoreboard
Chap. 4 - Pipelining II
29
Dynamic Scheduling
Using A Scoreboard
Chap. 4 - Pipelining II
30
Dynamic Scheduling
Using A Scoreboard
Dynamic Scheduling
Using A Scoreboard
2. Functional unit statusIndicates the state of the functional unit (FU). 9 fields for each functional unit BusyIndicates whether the unit is busy or not OpOperation to perform in the unit (e.g., + or ) FiDestination register Fj, FkSource-register numbers Qj, QkFunctional units producing source registers Fj, Fk Rj, RkFlags indicating when Fj, Fk are ready
3. Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register
Chap. 4 - Pipelining II 32
Dynamic Scheduling
Instruction status
Using A Scoreboard
Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D; Fj(FU) `S1; Fk(FU) `S2; Qj Result(S1); Qk Result(`S2); Rj not Qj; Rk not Qk; Result(D) FU;
Rj No; Rk No
Issue
Read operands
Execution complete
f((Fj( f )Fi(FU) or Rj( f )=No) & Write result (Fk( f ) Fi(FU) or Rk( f )=No))
f(if Qj(f)=FU then Rj(f) Yes); f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No 33
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Scoreboard Example
This is the sample code well be working with in the example: LD LD MULT SUBD DIVD ADDD F6, 34(R2) F2, 45(R3) F0, F2, F4 F8, F6, F2 F10, F0, F6 F6, F8, F2 Latencies (clock cycles): LD 1 MULT 10 SUBD 2 DIVD 40 ADDD 2 34
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Scoreboard Example
Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Issue Read Execution Write operands complete Result
Busy No No No No No
Op
dest Fi
S1 Fj
S2 Fk
Fk? Rk
Clock
FU
F0
F2
F4
F6
F8
F10
F12
...
F30
35
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Issue LD #1
Shows in which cycle the operation occurred.
Busy Yes No No No No
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
1 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
36
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
LD #2 cant issue since integer unit is busy. MULT cant issue because we require in-order issue.
Busy Yes No No No No
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
2 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
37
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Busy Yes No No No No
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
3 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
38
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Busy Yes No No No No
Op Load
dest Fi F6
S1 Fj
S2 Fk R2
Fk? Rk Yes
Clock
4 FU
F0
F2
F4
F6 F8 F10
Integer
F12
...
F30
39
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Busy Yes No No No No
Op Load
dest Fi F2
S1 Fj
S2 Fk R3
Fk? Rk Yes
Clock
5 FU
F0
F2
Integer
F4
F6 F8 F10
F12
...
F30
40
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Issue MULT.
Op Load Mult
dest Fi F2 F0
S1 Fj F2
S2 Fk R3 F4
Clock
6 FU
F0
F2
F4
F6 F8 F10
F12
...
F30
41
Mult1 Integer
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
dest Fi F2 F0 F8
S1 Fj F2 F6
S2 Fk R3 F4 F2
Clock
7 FU
F0
F2
F4
F6 F8 F10
Add
F12
...
F30
42
Mult1 Integer
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Clock
8 FU
F0
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
43
Mult1 Integer
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
LD #2 writes F2.
Mult1
Clock
8 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
44
Chap. 4 - Pipelining II
Dynamic Scheduling
Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 10 Mult1 Mult2 2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 8 dest Fi F0 F8 F10 S1 Fj F2 F6 F0
Using A Scoreboard
S2 Fk F4 F2 F6
Mult1
Clock
9 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
Chap. 4 - Pipelining II
45
Dynamic Scheduling
Using A Scoreboard
Mult1
Clock
11 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
46
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Op Mult
Fk? Rk Yes
Div
F10
F0
F6
Mult1
No
Yes
Clock
12 FU
F0
Mult1
F2
F4
F6 F8 F10
Divide
F12
...
F30
47
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
ADDD issues.
Mult1
Clock
13 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
48
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Mult1
Clock
14 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
49
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Mult1
Clock
15 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
50
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Mult1
Clock
16 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
51
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Mult1
Clock
17 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
52
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Nothing Happens!!
Mult1
Clock
18 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
53
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Mult1
Clock
19 FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
54
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
MULT writes.
Fk? Rk
Yes Yes
Yes Yes
Clock
20 FU
F0
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
55
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Fk? Rk
Yes Yes
Yes Yes
Clock
21 FU
F0
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
56
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Fk? Rk
Yes
Yes
Clock
22 FU
F0
F2
F4
F6 F8 F10
Divide
F12
...
F30
57
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Fk? Rk
Yes
Yes
Clock
61 FU
F0
F2
F4
F6 F8 F10
Divide
F12
...
F30
58
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
DONE!!
Fk? Rk
Clock
62 FU
F0
F2
F4
F6 F8 F10
F12
...
F30
59
Chap. 4 - Pipelining II
Dynamic Scheduling
Using A Scoreboard
Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604,
Chap. 4 - Pipelining II
60
Dynamic Scheduling
Using A Scoreboard
Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ;
avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers cant
Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue
Chap. 4 - Pipelining II 61
Dynamic Scheduling
Using A Scoreboard
Tomasulo Organization
Load Buffer
FP Op Queue FP Registers
Dynamic Scheduling
Using A Scoreboard
Chap. 4 - Pipelining II
63
Dynamic Scheduling
Using A Scoreboard
Dynamic Scheduling
Using A Scoreboard
Busy Op No No No No No
S1 Vj
S2 Vk
RS for j Qj
RS for k Qk
Clock
0 FU
F0
F2
F4
F6
F8
F10
F12 ...
F30
Chap. 4 - Pipelining II
65
Dynamic Scheduling
Using A Scoreboard
Review: Tomasulo
Prevents Register as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Not limited to basic blocks (provided branch prediction) Lasting Contributions
Dynamic scheduling Register renaming Load/store disambiguation
360/91 descendants are PowerPC 604, 620; MIPS R10000; HP-PA 8000; Intel Pentium Pro
Chap. 4 - Pipelining II
66
Dynamic Branch Prediction is the ability of the hardware to make an educated guess about which way a branch will go - will the branch be taken or not. The hardware can look for clues based on the instructions, or it can use past history - we will discuss both of these directions.
Chap. 4 - Pipelining II
67
Address
31
1
1023 Chap. 4 - Pipelining II
Bits 13 - 2
P r e d i c t i o n 68
NT
Predict Taken T Predict Not Taken Predict Taken
T
NT T
Chap. 4 - Pipelining II
69
BHT Accuracy
Mispredict because either: Wrong guess for that branch Got branch history of wrong branch when index the table 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% 4096 about as good as infinite table, but 4096 is a lot of HW
Chap. 4 - Pipelining II
70
Correlating Branches
Idea: taken/not taken of recently executed branches is related to behavior of next branch (as well as the history of that branch behavior)
Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction
Prediction
Frequency of Mispredictions
11%
6% 6% 4% 2% 0% 1% 0% 1% 5%
6% 4%
6% 5%
0%
nasa7
doducd
gcc
espresso
spice
tomcatv
eqntott
fpppp
matrix300
li
72
Predicted PC
Chap. 4 - Pipelining II
73
Example
Example on page 274. Determine the total branch penalty for a BTB using the above penalties. Assume also the following: Prediction accuracy of 80% Hit rate in the buffer of 90% 60% taken branch frequency.
Branch Penalty = Percent buffer hit rate X Percent incorrect predictions X 2 + ( 1 - percent buffer hit rate) X Taken branches X 2 Branch Penalty = ( 90% X 10% X 2) + (10% X 60% X 2) Branch Penalty = 0.18 + 0.12 = 0.30 clock cycles
Chap. 4 - Pipelining II
74
Multiple Issue
4.1 Instruction Level Parallelism: Concepts and Challenges 4.2 Overcoming Data Hazards with Dynamic Scheduling 4.3 Reducing Branch Penalties with Dynamic Hardware Prediction 4.4 Taking Advantage of More ILP with Multiple Issue 4.5 Compiler Support for Exploiting ILP 4.6 Hardware Support for Extracting more Parallelism 4.7 Studies of ILP
Multiple Issue is the ability of the processor to start more than one instruction in a given cycle.
Flavor I:
Superscalar processors issue varying number of instructions per clock - can be either statically scheduled (by the compiler) or dynamically scheduled (by the hardware). Superscalar has a varying number of instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo). IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000 Chap. 4 - Pipelining II 75
Multiple Issue
Issuing Multiple Instructions/Cycle
Flavor II:
VLIW - Very Long Instruction Word - issues a fixed number of instructions formatted either as one very large instruction or as a fixed packet of smaller instructions. fixed number of instructions (4-16) scheduled by the compiler; put operators into wide templates Joint HP/Intel agreement in 1999/2000 Intel Architecture-64 (IA-64) 64-bit address Style: Explicitly Parallel Instruction Computer (EPIC)
Chap. 4 - Pipelining II
76
Multiple Issue
Issuing Multiple Instructions/Cycle
Flavor II - continued:
3 Instructions in 128 bit groups; field determines if instructions dependent or independent Smaller code size than old VLIW, larger than x86/RISC Groups can be linked to show independence > 3 instr 64 integer registers + 64 floating point registers Not separate files per functional unit as in old VLIW Hardware checks dependencies (interlocks => binary compatibility over time) Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mis-predictions? IA-64 : name of instruction set architecture; EPIC is type Merced is name of first implementation (1999/2000?) Chap. 4 - Pipelining II 77
Multiple Issue
In our DLX example, we can handle 2 instructions/cycle: Floating Point Anything Else
Type Pipe Stages Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB Int. instruction IF ID EX MEM WB FP instruction IF ID EX MEM WB 1 cycle load delay causes delay to 3 instructions in Superscalar instruction in right half cant use it, nor instructions in next slot Chap. 4 - Pipelining II 78
Multiple Issue
; 8-32 = -24
Multiple Issue
80
Multiple Issue
Chap. 4 - Pipelining II
81
Multiple Issue
Chap. 4 - Pipelining II
82
Multiple Issue
Multiple Issue
Memory reference 1 LD F0,0(R1) LD F10,-16(R1) LD F18,-32(R1) LD F26,-48(R1) Memory FP reference 2 operation 1 LD F6,-8(R1) LD F14,-24(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F12,F10,F2 ADDD F20,F18,F2 SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 SD -16(R1),F12 SD -24(R1),F16 SD -32(R1),F20 SD -40(R1),F24 SD -0(R1),F28 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per iteration Need more registers to effectively use VLIW FP op. 2
VLIW
Chap. 4 - Pipelining II
84
Multiple Issue
Difficulties in building HW Duplicate Functional Units to get parallel execution Increase ports to Register File (VLIW example needs 6 read and 3 write for Int. Reg. & 6 read and 4 write for Reg.) Increase ports to memory Decoding SS and impact on clock rate, pipeline depth
Chap. 4 - Pipelining II
85
Multiple Issue
Chap. 4 - Pipelining II
86
Multiple Issue
If more instructions issue at same time, greater difficulty of decode and issue
Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue
Chap. 4 - Pipelining II
87
How can compilers be smart? 1. Produce good scheduling of code. 2. Determine which loops might contain parallelism. 3. Eliminate name dependencies. Compilers must be REALLY smart to figure out aliases -- pointers in C are a real problem. Techniques lead to: Symbolic Loop Unrolling Critical Path Scheduling
Chap. 4 - Pipelining II
88
Software Pipelining
Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (Tomasulo in SW) Iteration
0 Iteration 1 Iteration 2 Iteration 3 Iteration 4
Softwarepipelined iteration
Chap. 4 - Pipelining II
89
SW Pipelining Example
After: Software Pipelined
LD ADDD LD SD ADDD LD SUBI BNEZ SD ADDD SD F0,0(R1) F4,F0,F2 F0,-8(R1) 0(R1),F4; Stores M[i] F4,F0,F2; Adds to M[i-1] F0,-16(R1); loads M[i-2] R1,R1,#8 R1,LOOP 0(R1),F4 F4,F0,F2 -8(R1),F4
1 2 3 4 5
SD IF ADDD LD
Read F4
ID IF
EX ID IF
Read F0
Mem EX ID
SW Pipelining Example
Symbolic Loop Unrolling
Less code space Overhead paid only once vs. each iteration in loop unrolling
Software Pipelining
Loop Unrolling
Trace Scheduling
Parallelism across IF branches vs. LOOP branches Two steps: Trace Selection Find likely sequence of basic blocks (trace) of (statically predicted or profile predicted) long sequence of straight-line code Trace Compaction Squeeze trace into few VLIW instructions Need bookkeeping code in case prediction is wrong Compiler undoes bad guess (discards values in registers) Subtle compiler bugs mean wrong answer vs. poorer performance; no hardware interlocks
Chap. 4 - Pipelining II
92
Software support of ILP is best when code is predictable at compile time. But what if theres no predictability? Here well talk about hardware techniques. These include: Conditional Instructions or Predicated
Hardware Speculation
Chap. 4 - Pipelining II
93
Nullified Instructions
A= B op C
94
Nullified Instructions
Chap. 4 - Pipelining II
95
Compiler Speculation
Increasing Parallelism
The theory here is to move an instruction across a branch so as to increase the size of a basic block and thus to increase parallelism. Primary difficulty is in avoiding exceptions. For example if ( a ^= 0 ) c = b/a; may have divide by zero error in some cases. Methods for increasing speculation include: 1. Use a set of status bits (poison bits) associated with the registers. Are a signal that the instruction results are invalid until some later time. 2. Result of instruction isnt written until its certain the instruction is no longer speculative.
Chap. 4 - Pipelining II
96
Compiler Speculation
Increasing Parallelism
Example on Page 305. Code for if ( A == 0 ) A = B; else A = A + 4; Assume A is at 0(R3) and B is at 0(R4) Note here that only ONE side needs to take a branch!!
Original Code: LW R1, 0(R3) BNEZ R1, L1 LW R1, 0(R2) J L2 L1: ADDI R1, R1, #4 L2: SW 0(R3), R1 Speculated Code: LW R1, 0(R3) LW R14, 0(R2) BEQZ R1, L3 ADDI R14, R1, #4 L3: SW 0(R3), R14
Chap. 4 - Pipelining II
97
Compiler Speculation
Poison Bits
In the example on the last page, if the LW* produces an exception, a poison bit is set on that register. The if a later instruction tries to use the register, an exception is THEN raised. Speculated Code: LW R1, 0(R3) LW* R14, 0(R2) BEQZ R1, L3 ADDI R14, R1, #4 L3: SW 0(R3), R14 Load A Spec Load B Other if Branch Else Clause Non-Spec Store
Chap. 4 - Pipelining II
98
Hardware Speculation
FP Regs
Hardware Speculation
Chap. 4 - Pipelining II
100
Studies of ILP
4.1 Instruction Level Parallelism: Concepts and Challenges 4.2 Overcoming Data Hazards with Dynamic Scheduling 4.3 Reducing Branch Penalties with Dynamic Hardware Prediction 4.4 Taking Advantage of More ILP with Multiple Issue 4.5 Compiler Support for Exploiting ILP 4.6 Hardware Support for Extracting more Parallelism 4.7 Studies of ILP
Conflicting studies of amount of improvement available Benchmarks (vectorized FP Fortran vs. integer C programs) Hardware sophistication Compiler sophistication How much ILP is available using existing mechanisms with increasing HW budgets? Do we need to invent new HW/SW mechanisms to keep on processor performance curve?
Chap. 4 - Pipelining II
101
Studies of ILP
Limits to ILP
Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaminginfinite virtual registers and all WAW & WAR hazards are avoided 2. Branch predictionperfect; no mispredictions 3. Jump predictionall jumps perfectly predicted => machine with perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysisaddresses are known & a store can be moved before a load provided addresses not equal 1 cycle latency for all instructions; unlimited number of instructions issued per clock cycle
Chap. 4 - Pipelining II
102
Studies of ILP
This is the amount of parallelism when there are no branch mis-predictions and were limited only by data dependencies.
160 140
120 100 80 60 40
IPC
Chap. 4 - Pipelining II
103
Studies of ILP
What parallelism do we get when we dont allow perfect branch prediction, as in the last picture, but assume some realistic model? Possibilities include: 1. Perfect - all branches are perfectly predicted (the last slide)
Studies of ILP
8096 x 2 bits
1 0
Bonus!!
8K x 2 bit Selector
105
Studies of ILP
Limiting the type of branch prediction.
60
FP: 15 - 45
50
41 48 46 45 46 45 45
40
35
Integer: 6 - 12
30
29
IPC
20
12
19 16 10 6 6 2 7 6 2 6 7 4 2 15 13 14
10
0 gcc espresso li Program Perfect Selective predictor Standard 2-bit Static None fpppp doducd tomcatv
Perfect
106 No prediction
Studies of ILP
Effect of limiting the number of renaming registers.
60
FP: 11 - 45
45 44
50
49
40
35
IPC
30
Integer: 5 - 15
20
29
28
20
11 10 10
15 15 9 5 4
16 13 10 5 4 12 12 12 11 6 5 5
15 11 5 5 7 5
10
0 gcc espresso li Program Infinite 256 64 Chap. 4128 - Pipelining II 32 None fpppp doducd tomcatv
Infinite
256
128
64
32
None
107
Studies of ILP
What happens when there may be conflicts with memory aliasing?
50 45 40 35
45
45
IPC
Integer: 4 - 9
16 15 12 9 7 4 3 7 5 5 4 3 4 3 6 4 5 4 16
30 25 20 15
10
10 5 0
gcc
espresso
li Program
fpppp
doducd
tomcatv
Perfect
Perfect
Global/Stack perf; Inspec. Global/stack Perfect Inspection Chap. 4 Pipelining II heap conflicts Assem.
None
None
108
Summary
4.1 Instruction Level Parallelism: Concepts and Challenges
Chap. 4 - Pipelining II
109