# CSC506 Pipeline Homework – due Wednesday, June 9, 1999

Question 1. An instruction requires four stages to execute: stage 1 (instruction fetch) requires 30 ns, stage 2 (instruction decode) = 9 ns, stage 3 (instruction execute) = 20 ns and stage 4 (store results) = 10 ns. An instruction must proceed through the stages in sequence. What is the minimum asynchronous time for any single instruction to complete? • 30 + 9 + 20 + 10 = 69 ns.

Question 2. We want to set this up as a pipelined operation. How many stages should we have and at what rate should we clock the pipeline? • We have 4 natural stages given and no information on how we might be able to further subdivide them, so we use 4 stages in our pipeline. We have a choice of what clock rate to use. The simplest choice would be to use a clock cycle that accommodates the longest stage in our pipe – 30 ns. This would allow us to initiate a new instruction every 30 ns with a latency through the pipe of 30 ns x 4 stages = 120 ns. We could also pick a finer clock cycle that more closely matches the shortest stage (9 ns) but is integrally divisible into the other stages. A clock of 10 ns would be a good match and would require three clocks for the first stage, 1 clock for the second, 2 clocks for the third and 1 clock for the fourth. This would allow us to initiate a new instruction every 30 ns but provide a latency of 70 ns rather than 120. Either 30 ns or 10 ns is acceptable.

Question 3. For the pipeline in question 2, how frequently can we initiate the execution of a new instruction, and what is the latency? • See answer to question 2.

Question 4. What is the speedup of the pipeline in question 2? • • Speedup per Stone's preferred definition is (30 + 9 + 20 + 10)/30 = 2.3 Speedup per best clocked definition is (30 + 10 + 20 + 10)/30 = 2.33

Question 5. The greedy cycle is 2.29 ops/cycle.27 ops/cycle. 3.7.2. giving three operations initiated every 11 cycles.2. . .7. giving two operations initiated every seven cycles. . This is a case where the greedy cycle is not the optimum. .2. . 4. . or 0. 3.2. . Draw the reduced state-diagram and show the maximum-rate cycle using the following collision vector: 1 0 0 0 1 1 100011 7 7 2 4 101111 111011 7 3 7 2 111111 4 3 110011 4 The maximum-rate cycle is the sequence 3. or 0. 4.2.

instruction decode = 1 clock. and store result = 1 clock. fetch operands = 1 clock. a) At what rate (in MIPS) can we execute register-register instructions that have no data dependencies with other instructions? b) At what rate can we execute the instructions when every instruction depends on the results of the previous instruction? c) We implement internal forwarding. . At what rate can we now execute the instructions when every instruction depends on the results of the previous instruction? Op Inst Fetch Inst Decode Op Fetch Execute Op Store 1 1 2 1 3 2 1 4 2 5 3 2 6 3 7 4 3 8 4 9 5 4 10 5 11 6 5 12 6 1 1 2 1 2 1 3 2 3 2 4 3 4 3 5 4 • a) No dependencies rate = 1 inst/2 cycles at 100 MHz clock = 50 MIPS.Question 6. The pipeline for these instructions runs with a 100 MHz clock with the following stages: instruction fetch = 2 clocks. We have a RISC processor with register-register arithmetic instructions that have the format R1 ß R2 op R3. execute = 2 clocks.

and we produce one instruction output only every 4 cycles. things begin backing up in the pipeline.Op Inst Fetch Inst Decode Op Fetch Execute Op Store 1 1 2 1 3 2 1 4 2 5 3 2 6 3 7 4 3 8 4 wait 2 9 5 4 10 5 11 12 13 6 5 14 6 15 wait wait wait wait wait wait 3 wait wait 1 1 wait wait 1 1 wait wait wait 2 2 2 wait wait wait 3 3 3 • b) Dependencies rate = 1 inst/4 cycles = 25 MIPS. As a result. we just have to point one of the inputs for instruction 2 execution to the internal register that receives the output of instruction 1 in order to get it.g. Op Fetch for #2 must wait until Op Store for #1 completes). Op Inst Fetch Inst Decode Op Fetch Execute Op Store 1 1 2 1 3 2 1 4 2 5 3 2 6 3 7 8 9 10 11 12 3 2 3 2 1 2 3 2 3 3 1 1 1 • c) Dependencies with internal forwarding rate = 1 inst/2 cycles = 50 MIPS. The reservation table shows that. If we implement internal forwarding. the Operand Fetch unit must wait until the prior instruction stores its result before it can retrieve one of its operands (e. We can then proceed without waiting. the operand fetch unit can bypass fetching the dependent operand and just rename the dependent operand input register to be the result of instruction 1. The result is available in time for the next calculation. although we begin fetching instructions every two cycles. .

we decide to implement branches by always assuming the branch will not be taken rather than implementing some form of branch prediction or speculative execution. and we can't provide the target address (of a branch taken) to stage 1 until the end of stage 5. For the RISC processor described in question 6. Assume a sequence of instructions where the condition code setting instruction immediately precedes the conditional branch. we introduce a two cycle delay. a) What penalty in lost cycles do we incur for the branch not taken? b) What penalty in lost cycles do we incur for the branch taken? c) We implement delayed branching and the conditional branch is a delayed conditional branch.Question 7. we don't know the condition code setting (for instructions that set the condition code) until stage 5 (operand store) is complete. If we force the operand fetch 2 cycle delay up the pipeline. The operand fetch unit must wait 2 cycles until the CC is stored by the operand store unit before fetching it for use by the branch instruction. The penalty depends on how we implement the pipeline. What penalty in lost cycles do we incur for the delayed branch taken with internal forwarding? Op Inst Fetch Inst Decode Op Fetch Execute Op Store • 1 CC 2 CC 3 BR CC 4 BR 5 NSI BR 6 NSI 7 2SI 8 2SI 9 3SI 2SI 10 3SI hold 11 4SI 3SI 2SI NSI BR 12 4SI hold hold NSI hold wait wait BR CC CC NSI hold BR BR CC CC NSI a) We have a data dependency between the CC instruction and the branch instruction. We don't know that the instruction is a branch until stage 2 (decode). Conditional branches are a problem with instruction pipelines. even when the branch is not taken. Penalty of 2 cycles for a branch not taken. What penalty in lost cycles do we incur for the delayed branch taken? d) We implement internal forwarding along with the delayed branch. and we do not implement internal forwarding. .

the total penalty is 8 cycles.• However. Penalty of 2 cycles for a branch not taken. so we make it wait one more cycle. . However. we have two cycles of buffering in the pipeline – one cycle in the instruction decode unit (it waits every other cycle) and one cycle in the operand fetch unit (it also waits every other cycle). dump the 2SI and 3SI instructions. Since the instruction fetch unit can’t get the new program counter address for the branch target (BT) instruction until clock 11. it can’t begin fetching the instruction at the target address until clock 12. and stop the instruction fetch unit from fetching the fourth sequential instruction after the branch. We must stop the execution of NSI. The total penalty for the branch taken is 7 cycles. we take no penalty for this particular instruction pair. If you assumed that the instruction fetch unit could not be stopped after fetching 3SI and proceeded to fetch 4SI. for 6 wasted clocks. Op Inst Fetch Inst Decode Op Fetch Execute Op Store 1 CC 2 CC 3 BR CC 4 BR 5 NSI BR 6 NSI 7 2SI 8 2SI 9 3SI 2SI 10 3SI hold 11 wait 12 BT 13 BT NSI hold wait wait BR CC CC NSI hold BR BR BR CC CC • b) The operand fetch unit must still wait 2 cycles until the CC is available from the operand store unit. If these units can each hold onto their results for a cycle until the next stage is available as shown in the reservation table. we will end up taking the 2 cycle penalty when the next two (and every succeeding) instruction pair with data dependencies come along. we have fetched the next three sequential instructions. By the time we know the outcome of the branch instruction (clock 10).

We can also forward the Branch Target address directly from the output of the Branch Execute stage (clock 8) to the Instruction Fetch unit so we don’t lose the branch Operand Store cycle in clock 9.Op Inst Fetch Inst Decode Op Fetch Execute Op Store 1 CC 2 CC 3 BA CC 4 BA 5 NSI BA 6 NSI 7 2SI 8 2SI 9 3SI 2SI 10 3SI hold 11 wait 12 BT 13 BT NSI hold wait wait BA CC CC NSI hold BA BA NSI BA NSI NSI CC CC • c) The difference here is that we do not need to stop the execution of NSI on a delayed branch. and the one it had to wait before proceeding with the new PC. and stop the instruction fetch unit from fetching the fourth sequential instruction after the branch. The instruction fetch unit still can’t get the new program counter address until clock 11. so the total penalty for the branch taken is only 5 cycles. so the total penalty for the delayed branch taken with internal forwarding is only 2 cycles. and it can’t begin fetching the instruction at the target address until clock 12. but we still need to dump the 2SI and 3SI instructions. . the four we lost by fetching the 2SI and 3SI instructions. We still need to dump the 2SI instruction that we pre-fetched. it would be 6 cycles. Op Inst Fetch Inst Decode Op Fetch Execute Op Store 1 CC 2 CC 3 BA CC 4 BA 5 NSI BA 6 NSI 7 2SI NSI 8 2SI 9 BT 10 BT 11 2T 12 2T 13 3T CC CC BA CC BA CC NSI BA NSI BA NSI NSI • d) Internal forwarding allows us to forward the condition code result directly from the CC Execute stage (clock 6) to the branch Execute stage (clock 7). if you assumed that 4SI was fetched. so we don’t delay the branch. Again. It can continue to completion.

Why would you implement a branch history table in a pipelined computer? • A branch history table gives you a better guess than random on whether or not a conditional branch will be taken. the same idea that the LRU cache replacement algorithm is based on. a good guess will reduce the number of times we have to discard instructions that we prefetch and start into the pipeline following a conditional branch. Question 11. . What is a greedy cycle? • The greedy cycle arises from initiating a new instruction into the pipeline at the first opportunity in each state. When the outcome of the branch is known. This is achieved by having more than one pipeline and allowing instructions without dependencies on one another to proceed in parallel through the separate pipelines. we proceed to fetch. and begin execution of instructions along both paths. The greedy cycle is also the maximum-rate cycle in many cases. What problem is speculative execution trying to solve? • Speculative execution is another strategy used to reduce the effects of conditional branches. If we have a long instruction pipeline. What do we mean when we say a computer is superscalar? • A superscalar computer executes more than one instruction per clock tick. the tentative results from the path not taken are discarded and the results from the path taken are made permanent. but not necessarily. decode. Question 10. The assumption is that recent history is a good predictor of the near future.Question 8. Rather than guessing which way a branch will go and fetching instructions only along one path. Results from both instruction streams are tentative until we know which way the branch goes. Question 9.