Professional Documents
Culture Documents
We now concentrate on promoting instruction level parallelism (ILP) in order to further improve pipeline performance
ILP: amount of parallelism in a basic block of code code without branches, or code between branches
given that branches make up about 15%-25% of all code in our MIPS examples a basic block may be between 4 and 6 instructions long
question: of these 4-6 instructions, how many can be executed in parallel or overlapped fashion?
if the number is low this argues for pipeline <= 4-6 stages otherwise, we will have to find ways to increase the available ILP in a basic block
so we will also consider ways to extend ILP across blocks the more we can parallelize, the better our CPI performance will be, maybe even get CPI < 1
Example
Loop:
F1, 0(R5) F2, 0(R6) F3, F1, F2 F3, 0(R5) R5, R5, #4 R6, R6, #4 R4, R4, #1 R4, Loop
How much production can we get from this loop in a pipeline given that each iteration has
a data hazard between the second L.S and the ADD.S a data hazard of 3 cycles between the ADD.S and S.S a data hazard between the DSUBI and the BNEZ a branch penalty
The high-level language code is so concise that the machine code only has 8 instructions so there is not much ILP that can be exploited in the pipeline
compiler scheduling can remove all of these hazards can you figure out how? but this would not be true if we had a multiply instead of an add
If 2 instructions are parallel, they can execute simultaneously in a pipeline without causing stalls But if 2 instructions are dependent then they are not parallel, and cannot be rearranged or executed in a pipeline
at least, not without stalls or forwarding to safeguard the dependencies
the likelihood of forwarding being successful depends on the pipeline depth and the distance between the instructions
longer pipelines have a greater potential for stalls and for lengthier stalls
Name dependencies arise when 2 instructions refer to the same named item
register, memory location
ADD.D is data dependent on L.D for F0, S.D is data dependent on ADD.D for F4
DSUBI BNEZ
Loop:
Output dependence
instruction i and instruction j both write to the same name but j executes first (out of order)
this can lead to a WAW hazard for output dependencies, ordering must be preserved
There are data dependencies between L.D and ADD.D and between ADD.D and S.D, There are name dependencies between iterations of the loop
that is, we use F0 in ADD.D and again in the next iteration but in the second iteration, F0 is referring to a different datum
we can rename the register for the second iteration to remove name dependencies
What are the dependencies in the following loop for each, identify if the dependence is loop carried or not
is the loop parallelizable?
for(j=1;j<99;j++) {
Example
a[j] = b[j] * c[j+1]; c[j] = a[j+1] * s; a[j-1] = c[j] + b[j]; b[j+1] = a[j];
}
// s1 // s2 // s3 // s4
Since there is a LC data dependence, the loop is not parallelizable although in the MIPS pipeline, the latency might be short enough to allow for unrolling and scheduling Unrolling the loop would look like this: a[1] = b[1] * c[2]; c[1] = a[2] * s; a[0] = c[1] + b[1]; b[2] = a[1]; a[2] = b[2] * c[3]; c[2] = a[3] * s; a[1] = c[2] + b[2]; b[3] = a[2];
True (data) dependencies: b from s4 to s1 and s3 (LC) c from s2 to s3 (not LC) Output dependencies: a from s1 to s3 (LC) Anti dependencies: c from s2 to s1 (LC)
Control Dependencies
Control dependencies arise from instructions that depend on branches
all instructions in a program have control dependencies except for the earliest instructions prior to any branch but here, we will refer to those instructions that are directly affected by a branch
such as the then or else clause of an if-then-else statement or the body of a loop
instructions that are control dependent on a branch cannot be moved before the branch
example: if(x!=0) x++; else y++; neither the if clause nor the else clause should precede the conditional branch of (x!=0)!
an instruction not control dependent on a branch can not be moved after the branch
Example
What are the control dependences from the code below? Which statements can be scheduled before the if statement?
assume only b and c are used again after the code fragment
d = d + 5 can be moved because d is only used again in if (a > c) { a = b + d + e if the condition is true, and not used d =d + 5; again if the condition is false a = b + d + e;} a = b + d + e cannot be moved as it would affect a, thus else { impacting the outcome of the condition, and would e = e + 2; affect the statement b = a + f if the condition were false f = f + 2; e = e + 2 cannot be moved because it could alter a if the c = c + f; } condition were false b = a + f; f = f + 2 cannot be moved because it could alter b and c incorrectly if the condition were true c = c + f cannot be moved because it will alter the condition and since c is used later, it may have the wrong value b = a + f cannot be moved since a or f might change
Loop:
to simplify problems in these notes, we will use new latencies which remove some of the stalls after FP *, /
Modify the displacement for the S.D since we have moved the DSUBI earlier
Compiler technique to improve ILP by providing more instructions to schedule As an advantage, loop unrolling consolidates loop mechanisms from several iterations into one iteration
Loop Unrolling
Loop: L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D L.D ADD.D S.D DSUBI BNEZ F0, 0(R1) F4, F0, F2 0(R1), F4 F6, -8(R1) F8, F6, F2 -8(R1), F8 F10, -16(R1) F12, F10, F2 -16(R1), F12 F14, -24(R1) F16, F14, F2 -24(R1), F16 R1, R1, #32 R1, Loop
we unroll the previous loop to contain 4 iterations so that it iterates only 250 times we adjust the code appropriately:
new registers alter memory reference displacements change the decrement of R1
the new loop is only slightly better because we have removed instructions, but once we schedule it, we will get a much better improvement
Advantages: fewer branch penalties, provides more instructions for ILP Disadvantages: uses more registers, lengthens program, complicates compiler
with 4 ADD.Ds
we can place them consecutively so we no longer have RAW hazards with the S.Ds
with 4 S.Ds
we move one between DSUBI and BNEZ and one after the BNEZ
Version of the loop with no stalls L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D 0(R1), F4 S.D -8(R1), F8 DSUBI R1, R1, #32 S.D 16(R1), F12 BNEZ R1, Loop S.D 24(R1), F16
Results
Speedup over original = 10,000 / 3,500 = 2.86 Speedup over scheduled but not unrolled = 6,000 / 3,500 = 1.71
all gains are from the compiler
eliminate extra conditions, branches, decrements, and adjust loop maintenance code use different registers
Another Example
Loop: LD LD MUL.D SD DADDI DSUBI BNEZ F0, 0(R1) F1, 0(R2) F2, F1, F0 F2, 8(R2) R2, R2, #16 R1, R1, #8 R1, Loop
assume 7 cycle multiply from appendix A 1 after the second LD 7 after the MUL.D (or 6 if we can handle the structural hazard) 1 after the DSUBI 1 for the branch hazard
We can reduce the stalls by unrolling and scheduling the loop how many times?
how about 4 more times
Loop:
F0, 0(R1) F1, 0(R2) F3, 8(R1) F4, 16(R2) F6, 16(R1) F7, 32(R2) F9, 32(R1) F10, 48(R2) F12, 40(R1) F13, 64(R2) F2, F1, F0 F5, F3, F4 F8, F6, F7 F11, F9, F10 F14, F12, F13 R2, R2, #80 R1, R1, #40 F2, -72(R2) F5, -56(R2) F8, -40(R2) F11, -24(R2) R1, Loop F14, -8(R2)
Solution
we will need 7 stalls or other instructions between the MUL.D and its SD
we have 3 operations that can go there (DADDI, DSUBI, BNEZ) but the branch cannot be followed by more than 1 instruction so we unroll the loop to create more instructions so we need 4 more instructions to fill the remaining stalls by unrolling the loop 4 times, we have 4 more MUL.Ds to fill those slots
we create more SDs, and only one (at most) can reside after the BNEZ we move the DADDI and DSUBI to fill some of those slots and move them up early enough to remove the stall before the BNEZ we arrange the code so that all LDs occur first, followed by all MUL.Ds, followed by our DADDI and DSUBI, followed by all of the SD with the BNE before the final SD
we also have to figure out the displacement offsets and how to adjust R1 and R2
if we can access this buffer to retrieve this information at the same time that we retrieve the instruction itself
we can then use the information to predict if and where to branch to while still in the IF stage, and thus remove any branch penalty!
The buffer is a small cache indexed by the low-order bits of the address of the instruction
the buffer stores 1 data bit pertaining to whether the branch was taken or not the last time it was executed (a 1 time history)
if the bit is set, we predict that the branch is taken if the bit is not set, we predict that the branch is not taken
even though the branch is taken 9 out of 10 times, our approach mispredicts twice, giving an accuracy of 80%
Here, we do not change the prediction after 1 wrong guess, we have to have 2 wrong guesses to change
It turns out that the nbit approach is not that much better, so a 2-bit approach is good enough
Correlating Predictors
The 1 or 2-bit approach only considers the current branch, but a compiler may detect correlation among branches
consider the C code to the right the third branch will not be taken if the first two branches are both taken if we can analyze such code, we can improve on branch prediction
So we want to package together a branch prediction based not only on previous occurrences of this branch, but other branches behaviors
this is known as a correlating predictor
A (1, 2) correlating predictor uses the behavior of the last branch to select between 2 2-bit predictors
a (m, n) correlating predictor uses the behavior of the last m branches to select between 2m n-bit predictors this is probably overkill, and in fact the (1, 2) predictor winds up offering a good prediction accuracy while only requiring twice the memory space of 2-bit predictors by themselves
Tournament Predictors
The correlating predictor can be thought of as a global prediction whereas the earlier 1-bit or 2-bit predictors were local predictions In some cases, local predictions are more accurate and in some cases global predictions are more accurate A third approach, the tournament predictor, combines both of these by using yet another set of bits to determine which predictor should be used, the local or global
a 2-bit counter can be used to count the number of previous mispredictions once we have two mispredictions, we switch from one predictor to the other we might use static prediction information to determine which predictor we should start with
Figure 2.8 compares these various approaches and you can see that the tournament predictor is clearly the most accurate
prediction accuracy can be as low as about 2.8% on SPEC benchmarks
The Power5 and Pentium4 use 30K bits to store prediction information while the Alpha 21264 uses 4K 2-bit counters with 4K 2-bit global prediction entries and 1K 10-bit local prediction entries
branch target
IF
ID
EX
MEM WB
Even with a good prediction, we dont know where to branch too until here and weve already retrieved the next instruction
therefore, there is no point in performing branch prediction (by itself) in MIPS, we also need to know where to branch too the branch prediction technique is only useful in longer pipelines where branch locations are computed earlier than branch conditions
In MIPS, we need to know both the branch condition outcome and the branch address by the end of the IF stage, so we enhance our buffer to include the prediction and if taken, the branch target location as well
Send PC of current instruction to target buffer in IF stage If a hit (PC is in the table) then look up predicted PC and branch prediction If predict taken, update PC with the predicted PC otherwise increment PC as usual
Branch-Target Buffers
We predict branch location before we have even decoded the instruction to see that it is a branch!
On a branch miss or misprediction, update the buffer by moving missing PC into buffer, or updating predicted PC/branch prediction bit
If cache miss, use normal branch mechanism in MIPS, then 2 cycle penalty
one cycle penalty as normal in MIPS plus one cycle penalty to update the cache
Branch Folding
Notice that by using the branch target buffer
we are fetching the new PC value (or the offset for the PC) from the buffer and then updating the PC and then fetching the branch target location instruction
Instead, why not just fetch the next instruction? In Branch folding, the buffer stores the instruction at the predicted location
If we use this scheme for unconditional branches, we wind up with a penalty of 1 (we are in essence removing the unconditional branch) Note: for this to work, we must also update the PC, so we must store both the target instruction and the target location
this approach wont work well for conditional branches
Examples
For branch target buffer:
prediction accuracy is 90% branch target buffer hit rate is 90%
branch penalty = hit rate * percent incorrect predictions * 2 cycles + (1 hit rate) * 2 cycles =
(90% * 10% * 2) + (1 - 90%) * 60% * 2 = .38 cycles
Using delayed branches (as seen in the appendix notes), we had an average branch penalty of .3 cycles
so this is not an improvement, however for longer pipelines with greater branch penalties, this will be an improvement, and so could be applied very efficiently
figure 2.25 demonstrates the usefulness of different sized caches where an 8-element cache yields 95% accuracy in prediction
With the scoreboard, we separated the instruction fetch from the instruction execution We will continue to do this as we explore other dynamic scheduling approaches In order to perform the instruction fetches, coupled with branch prediction, we add an integrated fetch unit
a single unit that can fetch instructions at a fast rate and operates independently of the execution units
Using the new latencies from chapter 2 and assuming a 5-stage pipeline with fowarding available and branches completed in the ID stage
determine the stalls that will arise from the code as is unroll and schedule the code to remove all stalls
note: assume that no structural hazard will arise when an FP ALU operation reaches the MEM stage at the same time as a LD or SD
Sample Problem #1
if the original loop were to iterate 1000 times, how much faster is your unrolled and scheduled version of the code?
From the code below Loop: LD R2, 0(R1) LD F0, 0(R2) ADD.D F2, F0, F1 LD F3, 8(R2) MUL.D F4, F3, F2 SD F4, 16(R2) DSUBI R1, R1, #4 BNEZ R1, Loop
Stalls:
1 after each LD (3 total) 1 after ADD.D 2 after MUL.D 1 after DSUBI 1 after the branch
Solution
Loop: LD R2, 0(R1) LD F0, 0(R2) LD R3, 4(R1) LD F5, 4(R3) ADD.D F2, F0, F1 Unroll: the greatest source of stalls will ADD.D F6, F5, F1 be after the MUL.D, however we LD F3, 8(R2) can insert the DSUBI to take up one spot, so we need 1 additional LD F7, 8(R3) instruction there MUL.D F4, F3, F2 we will unroll the loop 1 MUL.D F8, F7, F6 additional time DSUBI R1, R1, #8 Speedup: original loop takes 14 cycles SD F4, 16(R2) (excluding the first iteration) BNE R1, Loop new loop takes 14 cycles for 2 SD F8, 16(R2) iterations
therefore, there is a 2 times speedup
Sample Problem #2
MIPS R4000 pipe has 8 stages, branches determined in stage 4
make the following assumptions:
the only source of stalls is from branches we modify the MIPS R4000 to compute the branch target location in stage 3 but conditions are still computed in stage 4 the compiler schedules 1 neutral instruction the branch delay slot 80% of the time and that aside from the branch delay slot, we implement assume not taken
if a benchmark consists of 5% jumps/calls/returns and 12% conditional branches, and assuming that 67% of all conditional branches are taken, what is the CPI of this machine?
Solution
CPI = 1 + branch penalty / instruction
unconditional branches will have a penalty of 1 if branch delay slot is filled or 2 if branch delay slot is not filled since we know where to branch to at the end of stage 3 conditional branches will have a penalty of 0 if we do not take the branch, 2 if we take the branch and the branch delay slot is filled, or 3 if we take the branch and the branch delay slot is not filled
branch penalty / instruction = 5% * (80% * 1 + 20% * 2) + 12% * 67% * (80% * 2 + 20% * 3) = .237
CPI = 1.237
Sample Problem #3
we have more complex conditions so that we can determine where to branch in stage 2 but we dont know if we are branching until the 3rd stage
Which buffer should we use assuming that our benchmark we are testing has 5% jump/call/return and 12% conditional branch?
Solution
Miss or miss-prediction on unconditional branch has 1 cycle penalty + 1 cycle buffer update Miss or miss-prediction on conditional branch has 2 cycle penalty + 1 cycle buffer update
prediction Buffer:
CPI = 1 + .08 * .05 * 2 + .08 * .12 * 3 + .92 * .11 * .05 * 2 + .92 * .11 * .12 * 3 = 1.083
target Buffer:
CPI = 1 + .20 * .05 * 2 + .20 * .12 * 3 + .20 * .18 * .05 * 2 + .20 * .18 * .12 * 3 = 1.109