Professional Documents
Culture Documents
Chapter 4
Exploiting Instruction-Level Parallelism
with Software Approaches
November 2004
Chapter Overview
4.1 Basic Compiler Techniques for Exposing ILP
4.2 Static Branch Prediction
4.3 Static Multiple Issue: The VLIW Approach
Basic pipeline scheduling (6 cycles per loop; 3 for loop overhead: DADDUI and BNE)
Loop Unrolling
Four copies of the loop body
Eliminate 3 branch 3 decrements of R1 => improve performance
Require symbolic substitution and simplification
Increase code size substantially => expose more computation that can be
scheduled => Gain from scheduling on unrolled loop is larger than original
Unrolling without Scheduling(28 cycles/14 instr.)
Summary
Method
Original
Schedule
Unroll 4
# of Instr.
Cycles/loop
5
10
5
6
14
28/4=7
10
11
R1, 0(R2)
R7, R8, R9
R1, R1, R3
R1, L
R4, R5, R6
R10, R4, R3
LD
OR
DSUBU
BEQZ
DADDU
L: DADDU
R1, 0(R2)
R4, R5, R6
R1, R1, R3
R1, L
R10, R4, R3
R7, R8, R9
12
Profile-based Static
Branch Prediction
Misprediction rate on SPEC92
varying widely: 3% to 24%
in average, 9% for FP programs and
15% for integer programs
15
1
2
3
4
5
6
7
8
9
16
17
18
Loop-level parallelism
dependence between two uses of x[i]
dependent within a single iteration, not loop carried
A loop is parallel if it can written without a cycle in the
dependences, since the absence of a cycle means that the
dependences give a partial ordering on the statements
19
P.320 Example
20
P.321 Example
Interchanging
no longer loop carried
21
Array Renaming
22
Recurrence
dependence distance of 5
23
Software Pipelining
Observation: if iterations from loops are independent, then
can get ILP by taking instructions from different iterations
Software pipelining: interleave instructions from different
loop iterations
reorganizes loops so that each iteration is made from instructions
chosen from different iterations of the original loop (Tomasulo in SW)
24
LOOP: 1
2
3
4
5
L.D
ADD.D
S.D
DADDUI
BNE
F0,0(R1)
F4,F0,F2
F4,0(R1)
(5/5)
R1,R1,#8
R1,R2,LOOP 2 RAWs!
1 L.D
F0,0(R1)
2 ADD.D F4,F0,F2
3 S.D
F4,0(R1)
Software Pipelining
Example
4 L.D
F0,-8(R1)
5 ADD.D F4,F0,F2
7 L.D
F0,-16(R1)
6 S.D
F4,-8(R1) 8 ADD.D F4,F0,F2
9 S.D
F4,-16(R1)
L.D
ADD.D
L.D
S.D
ADD.D
L.D
DADDUI
BNE
S.D
ADD.D
S.D
F0,0(R1)
F4,F0,F2
F0,-8(R1) No RAW! (2 WARs)
F4,0(R1);
Stores M[i]
F4,F0,F2;
Adds to M[i-1]
F0,-16(R1); loads M[i-2]
R1,R1,#8
R1,R2,LOOP
F4,-8(R1)
F4,F0,F2
F4,-16(R1)
25
26
Two tasks
1. Find the common path
2. Move the assignments to B or C
27
A[i] = A[i]+B[i]
B[i]=
C[i]=
if(A[i]!=0) {
undo
X
}
28
29
Superblocks
superblock: single entry point but allow multiple exits
Compacting a superblock is much easier than compacting a trace
Tail duplication
Create a separate block that corresponds to the portion of the trace
after the entry
30
Tail Duplication
Trace
trace scheduling and unwinding path
Super block
tail duplication
32
34
35
36
9 bundles
21 cycles
85% of slots filled
(23/27)
11 bundles
12 cycles
70% of slots filled
(23/33)
37
Some
Instruction
Formats of
IA-64
See Textbook or
Intels manuals for
more information
38
instr 1
instr 2
:
br
ld r1=
use =r1
ld.s r1=
instr 1
use =r1 instr 2
:
br
chk.s use
Recovery code
ld r1=
use =r1
Uses moved
br
above branch
by compiler
39
Summary
Chapter 4 Exploiting Instruction-Level Parallelism
with Software Approaches
4.1
4.2
4.3
4.4
4.7
40