Pipeline Processor Design

Beyond Pipeline Architecture
Virendra Singh Computer Design and Test Lab. Indian Institute of Science Bangalore
virendra@computer.org

Advance Computer Architecture

Branch Hazard
Consider heuristic – branch not taken. Continue fetching instructions in sequence following the branch instructions. If branch is taken (indicated by zero output of ALU):
• Control generates branch signal in ID cycle. • branch activates PCSource signal in the MEM cycle to load PC with new branch address. • Three instructions in the pipeline must be flushed if branch is taken – can this penalty be reduced?
Advance Computer Architecture 2

Pipeline Flush
If branch is taken (as indicated by zero), then control does the following:
• Change all control signals to 0, similar to the case of stall for data hazard, i.e., insert bubble in the pipeline. • Generate a signal IF.Flush that changes the instruction in the pipeline register IF/ID to 0 (nop).

Penalty of branch hazard is reduced by
• Adding branch detection and address generation hardware in the decode cycle – one bubble needed – a next address generation logic in the decode stage writes PC+4, branch address, or jump address into PC. • Using branch prediction.
Advance Computer Architecture SE-273@SERC 3

Branch Prediction Useful for program loops. branch was taken • History bit = 0. A one-bit prediction scheme: a one-bit buffer carries a “history bit” that tells what happened on the last branch instruction • History bit = 1. Advance Computer Architecture SE-273@SERC 4 . 2011 14. branch was not taken taken Predict branch taken 1 taken Not taken Predict branch not taken 0 Not taken Feb 26.

2011 14.Branch Prediction Address of recent branch instructions Target addresses History bit(s) Low-order bits used as index PC+4 Next PC 0 1 Prediction Logic Advance Computer Architecture SE-273@SERC 5 = PC Feb 26. .

Old hist. bit 0 1 1 1 1 1 1 1 1 New Predicti hist. Advance Computer Architecture SE-273@SERC .bit = 0 branch not taken. on bit 1 5 2 2 2 2 2 2 2 2 1 2 3 4 5 6 7 8 9 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 6 Bad Good Good Good Good Good Good Good Good X = X + R(I) 2 3 4 I – 10 = 0? Y 5 6 7 8 9 Store X in memory 10 1 2 10 5 0 h.Branch Prediction for a Loop Execution of Instruction 4 Next instr. Feb 26. h. 2011 14. 1 2 3 N 4 5 I=0 I=I+1 Execution seq. I Act.bit = 1 branch Bad taken. Pred.

2011 14.Two-Bit Prediction Buffer Can improve correct prediction statistics. Not taken Predict branch taken 11 taken taken Not taken Predict branch not taken 01 Predict branch not taken 00 Predict branch taken 10 Not taken taken taken Not taken Feb 26. Advance Computer Architecture SE-273@SERC 7 .

Branch Prediction for a Loop 1 2 3 N 4 5 I=0 I=I+1 X = X + R(I) Execution of Instruction 4 Execu. uf f 1 2 3 4 10 11 11 11 11 11 11 11 11 11 2 2 2 2 2 2 2 2 2 2 1 2 3 4 5 6 7 8 9 10 2 2 2 2 2 2 2 2 2 5 11 11 11 11 11 11 11 11 11 10 8 Good Good Good Good Good Good Good Good Good Bad I – 10 = 0? Y 5 6 7 8 9 10 Store X in memory Feb 26.Old Next instr. 2011 14.B on Pred.Bu pred. seq. New Predicti tion Pred. Advance Computer Architecture SE-273@SERC . I Act.

Control asserts following actions on exception: • Change the PC address to 4000 0040hex. This is done by adding an additional input to the PC input multiplexer. Feb 26. This is the location of the exception routine. Similar to data hazard and pipeline flush. 2011 14. • Overflow is detected in the EX cycle.Exceptions A typical exception occurs when ALU produces an overflow signal.Flush and EX. • Generate ID. Advance Computer Architecture SE-273@SERC 9 . This also prevents the ALU result (presumed contaminated) from being written in the WB cycle. • Set IF/ID to 0 (nop).Flush signals to set all control signals to 0 in ID/EX and EX/MEM registers.

Limits of Pipelining IBM RISC Experience • Control and data dependences add 15% • Best case CPI of 1. IPC of 0.87 • Deeper pipelines (higher frequency) magnify dependence penalties This analysis assumes 100% cache hit rates • Hit rates approach 100% for some programs • Many important programs have much worse hit rates • Later! Feb 26. Advance Computer Architecture SE-273@SERC 10 . 2011 14.15.

0 => 1.15 => 0.15  In the 1990’s (decade of superscalar): • CPI: 1. 2011 14.Time Processor Performance = --------------Program = Instructions Program X Cycles X Instruction Time Cycle Processor Performance (code size) (CPI) (cycle time)  In the 1980’s (decade of pipelining): • CPI: 5.5 (best case)  In the 2000’s (decade of multicore): • Marginal CPI improvement Feb 26. Advance Computer Architecture SE-273@SERC 11 .

Amdahl’s Law N No. 2011 14. Advance Computer Architecture SE-273@SERC 12 . of Processors 1 h 1-h 1-f f Time h = fraction of time in serial code f = fraction that is vectorizable v = speedup for f 1 Speedup = Overall speedup: f 1− f + v Feb 26.

2011 14. of Processors 1 Feb 26.Revisit Amdahl’s Law Sequential bottleneck Even if v is infinite 1 lim = v →∞ f 1− f 1− f + v 1 • Performance limited by nonvectorizable portion (1-f) N No. h 1-h 1-f f Time 13 Advance Computer Architecture SE-273@SERC .

Pipelined Performance Model N Pipeline Depth 1 1-g g g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled) Feb 26. 2011 14. Advance Computer Architecture SE-273@SERC 14 .

2011 14.Pipelined Performance Model N Pipeline Depth 1 1-g g g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled) Feb 26. Advance Computer Architecture SE-273@SERC 15 .

Pipelined Performance Model N Pipeline Depth 1 1-g g Tyranny of Amdahl’s Law [Bob Colwell] • When g is even slightly below 100%. Advance Computer Architecture SE-273@SERC 16 . a big performance hit will result • Stalled cycles are the key adversary and must be minimized as much as possible Feb 26. 2011 14.

f=0.8.Motivation for Superscalar [Agerwala and Cocke] Speedup jumps from 3 to 4. Advance Computer Architecture SE-273@SERC 17 . but s =2 instead of s=1 (scalar) Typical Range Feb 26. 2011 14.3 for N=6.

Superscalar Proposal Moderate tyranny of Amdahl’s Law • Ease sequential bottleneck • More generally applicable • Robust (less sensitive to f) • Revised Amdahl’s Law: Speedup Feb 26. 2011 14. 1 = (1 − f ) + f s v Advance Computer Architecture SE-273@SERC 18 .

79 3.96 2. [1989] Jouppi and Wall [1988] Johnson [1991] Acosta et al. 2011 14. Advance Computer Architecture SE-273@SERC .81 1. [1972] Riseman and Foster [1972] 1.00 2. [1986] Wedig [1982] Butler et al.58 1.86 (Flynn’s bottleneck) 1.00 2. [1991] Melvin and Patt [1991] Wall [1991] Kuck et al.8 6 7 (Jouppi disagreed) 8 51 (no control dependences) 19 Nicolau and Fisher [1984] 90 (Fisher’s optimism) Feb 26.Limits on Instruction Level Parallelism (ILP) Weiss and Smith [1984] Sohi and Vajapeyam [1987] Tjaden and Flynn [1970] Tjaden and Flynn [1973] Uht [1986] Smith et al.40 2.50 2.00 5.

achieve IPC > 1 Dispatch multiple instructions per cycle Provide more generally applicable form of concurrency (not just vectors) Geared for sequential code that is hard to parallelize otherwise Exploit fine-grained or instructionlevel parallelism (ILP) Feb 26. 2011 14.Superscalar Proposal Go beyond single instruction pipeline. Advance Computer Architecture SE-273@SERC 20 .

Advance Computer Architecture SE-273@SERC 21 . 2011 14.Classifying ILP Machines [Jouppi. DECWRL 1991] Baseline scalar RISC • Issue parallelism = IP = 1 • Operation latency = OP = 1 • Peak IPC = 1 SUCCESSIVE INSTRUCTIONS 1 2 3 4 5 IF 6 4 5 6 7 8 9 DE EX WB 0 1 2 3 TIME IN CYCLES (OF BASELINE MACHINE) Feb 26.

2011 14.Beyond Pipelining Time taken by a program  The number of instruction required to execute the program  Average number of cycle required to execute an instruction  The processor cycle time  Constraints Feb 26. Advance Computer Architecture SE-273@SERC 22 .

Beyond Pipelining Superscalar Processor Reduces average number of cycles per instruction beyond what is possible in pipelined scalar RISC processor by allowing concurrent execution of instructions in the same pipelined stage as well as concurrent execution of instructions in different pipeline stages Emphasizes multiple concurrent operations on scalar quantities There are many hurdles to support it Feb 26. 2011 Advance Computer Architecture 23 .

2011 Advance Computer Architecture 24 . possibility to develop a processor code compatible with an existing architecture Feb 26.Superscalar Architecture Superscalar Processor Simple concept Wide pipeline Instructions are not independent Superscalar architecture is natural descendant of pipelined scalar RISC Superscalar techniques largely concern the processor organization. independent of the ISA and the other architectural features Thus.

2011 Advance Computer Architecture 25 . the second instruction has a true data dependency on the first instruction Second instruction must be delayed  Procedural dependencies Due to change in the program flow  Resource conflicts Arises when two instructions want to use the same resource at the same time Can be eliminated by resource duplication Feb 26.Superscalar Architecture Fundamental Limitations  True data dependencies If an instruction uses a value produced by a previous instruction.

ILP is determined by the number of true dependencies and the number of branches in relation to other instructions Feb 26.Superscalar Architecture Instruction parallelism and Machine parallelism Instruction parallelism of a program is a measure of the average number of instructions that a superscalar processor might be able to execute at the same time Mostly. 2011 Advance Computer Architecture 26 .

2011 Advance Computer Architecture 27 .Superscalar Architecture Instruction parallelism and Machine parallelism Machine parallelism of a processor is a measure of the ability of processor to take advantage of the ILP Determined by the number of instructions that can be fetched and executed at the same time A challenge in the design of superscalar processor is to achieve good balance between instruction parallelism and machine parallelism Feb 26.

Superscalar Architecture Instruction issue and machine parallelism  ILP is not necessarily exploited by widening the pipelines and adding more resources  Processor policies towards fetching decoding. and executing instruction have significant effect on its ability to discover instructions which can be executed concurrently  Instruction issue is refer to the process of initiating instruction execution  Instruction issue policy limits or enhances performance because it determines the processor’s look ahead capability Feb 26. 2011 Advance Computer Architecture 28 .

Superscalar Pipelines IF ID RD ALU MEM WB Feb 26. 2011 Advance Computer Architecture 29 .

2011 Advance Computer Architecture 30 . Temporal machine parallelism Feb 26. Use diversified pipelines 3. Alleviate the limitations of pipelined implementation 2.Superscalar Pipelines Dynamic Pipelines 1.

2011 Advance Computer Architecture 31 .Superscalar Pipelines (Diversified) IF ID RD EX ALU Mem1 Mem2 FP1 FP2 FP3 BR WB Feb 26.

Superscalar Pipelines (Diversified) Diversified Pipelines  Each pipeline can be customized for particular instruction type  Each instruction type incurs only necessary latency  Certainly less expensive than identical copies  If all inter-instruction dependencies are resolved then there is no stall after instruction issue Require special consideration  Number and Mix of functional units Advance Computer Architecture 32 Feb 26. 2011 .

2011 Advance Computer Architecture 33 .Superscalar Pipelines (Dynamic Pipelines) Dynamic Pipelines  Buffers are needed  Multi-entry buffers  Every entry is hardwired to one read port and one write port  Complex multi-entry buffers  Minimize stalls Feb 26.

Advance Computer Architecture SE-273@SERC 34 . 2011 14.Thank You Feb 26.

Sign up to vote on this title
UsefulNot useful