This action might not be possible to undo. Are you sure you want to continue?
Beyond Pipeline Architecture
Virendra Singh Computer Design and Test Lab. Indian Institute of Science Bangalore
Advance Computer Architecture
Consider heuristic – branch not taken. Continue fetching instructions in sequence following the branch instructions. If branch is taken (indicated by zero output of ALU):
• Control generates branch signal in ID cycle. • branch activates PCSource signal in the MEM cycle to load PC with new branch address. • Three instructions in the pipeline must be flushed if branch is taken – can this penalty be reduced?
Advance Computer Architecture 2
If branch is taken (as indicated by zero), then control does the following:
• Change all control signals to 0, similar to the case of stall for data hazard, i.e., insert bubble in the pipeline. • Generate a signal IF.Flush that changes the instruction in the pipeline register IF/ID to 0 (nop).
Penalty of branch hazard is reduced by
• Adding branch detection and address generation hardware in the decode cycle – one bubble needed – a next address generation logic in the decode stage writes PC+4, branch address, or jump address into PC. • Using branch prediction.
Advance SE-273@SERC Computer Architecture 3
branch was taken • History bit = 0. 14. A one-bit prediction scheme: a one-bit buffer carries a “history bit” that tells what happened on the last branch instruction • History bit = 1. 2011 Advance SE-273@SERC Computer Architecture 4 . branch was not taken taken Predict branch taken 1 taken Not taken Predict branch not taken 0 Not taken Feb 26.Branch Prediction Useful for program loops.
2011 . 14.Branch Prediction Address of recent branch instructions Target addresses History bit(s) Low-order bits used as index PC+4 Next PC 0 1 Prediction Logic Advance SE-273@SERC Computer Architecture 5 = PC Feb 26.
14. Pred.Branch Prediction for a Loop Execution of Instruction 4 Next instr. h. Old hist. on bit 1 5 2 2 2 2 2 2 2 2 1 2 3 4 5 6 7 8 9 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 6 Bad Good Good Good Good Good Good Good Good X = X + R(I) 2 3 4 I – 10 = 0? Y 5 6 7 8 9 Store X in memory 10 = 0 branch 1 2 taken 10 5 = 1 branch 0 Bad h. I Act. 1 2 3 N 4 5 I=0 I=I+1 Execution seq. Feb 26.bit taken. bit 0 1 1 1 1 1 1 1 1 New Predicti hist.bit not . 2011 Advance SE-273@SERC Computer Architecture .
2011 Advance SE-273@SERC Computer Architecture 7 . 14.Two-Bit Prediction Buffer Can improve correct prediction statistics. Not taken Predict branch taken 11 taken taken Not taken Predict branch not taken 01 Predict branch not taken 00 Predict branch taken 10 Not taken taken taken Not taken Feb 26.
New Predicti tion Pred. I Act. 2011 Advance SE-273@SERC Computer Architecture . uf f 1 2 3 4 10 11 11 11 11 11 11 11 11 11 2 2 2 2 2 2 2 2 2 2 1 2 3 4 5 6 7 8 9 10 2 2 2 2 2 2 2 2 2 5 11 11 11 11 11 11 11 11 11 10 8 Good Good Good Good Good Good Good Good Good Bad I – 10 = 0? Y 5 6 7 8 9 10 Store X in memory Feb 26. seq.B on Pred.Branch Prediction for a Loop 1 2 3 N 4 5 I=0 I=I+1 X = X + R(I) Execution of Instruction 4 Execu.Old Next instr.Bu pred. 14.
This is done by adding an additional input to the PC input multiplexer. This is the location of the exception routine. Similar to data hazard and pipeline flush. This also prevents the ALU result (presumed contaminated) from being written in the WB cycle. Control asserts following actions on exception: • Change the PC address to 4000 0040hex. • Set IF/ID to 0 (nop). • Overflow is detected in the EX cycle.Flush and EX. Feb 26. • Generate ID.Flush signals to set all control signals to 0 in ID/EX and EX/MEM registers. 14.Exceptions A typical exception occurs when ALU produces an overflow signal. 2011 Advance SE-273@SERC Computer Architecture 9 .
2011 Advance SE-273@SERC Computer Architecture 10 .Limits of Pipelining IBM RISC Experience • Control and data dependences add 15% • Best case CPI of 1. 14.87 • Deeper pipelines (higher frequency) magnify dependence penalties This analysis assumes 100% cache hit rates • Hit rates approach 100% for some programs • Many important programs have much worse hit rates • Later! Feb 26.15. IPC of 0.
15 In the 1990’s (decade of superscalar): • CPI: 1. 2011 Advance SE-273@SERC Computer Architecture 11 .5 (best case) In the 2000’s (decade of multicore): • Marginal CPI improvement Feb 26.0 => 1.Time Processor Performance = --------------Program = Instructions Program X Cycles X Instruction Time Cycle Processor Performance (code size) (CPI) (cycle time) In the 1980’s (decade of pipelining): • CPI: 5.15 => 0. 14.
2011 Advance SE-273@SERC Computer Architecture 12 . of Processors 1 h 1-h 1-f f Time h = fraction of time in serial code f = fraction that is vectorizable v = speedup for f 1 Speedup = Overall speedup: f 1− f + v Feb 26. 14.Amdahl’s Law N No.
2011 h 1-h 1-f f Time 13 Advance SE-273@SERC Computer Architecture . of Processors 1 Feb 26. 14.Revisit Amdahl’s Law Sequential bottleneck Even if v is infinite 1 lim = v →∞ f 1− f 1− f + v 1 • Performance limited by nonvectorizable portion (1-f) N No.
2011 Advance SE-273@SERC Computer Architecture 14 . 14.Pipelined Performance Model N Pipeline Depth 1 1-g g g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled) Feb 26.
Pipelined Performance Model N Pipeline Depth 1 1-g g g = fraction of time pipeline is filled 1-g = fraction of time pipeline is not filled (stalled) Feb 26. 2011 Advance SE-273@SERC Computer Architecture 15 . 14.
2011 Advance SE-273@SERC Computer Architecture 16 . 14. a big performance hit will result • Stalled cycles are the key adversary and must be minimized as much as possible Feb 26.Pipelined Performance Model N Pipeline Depth 1 1-g g Tyranny of Amdahl’s Law [Bob Colwell] • When g is even slightly below 100%.
3 for N=6.Motivation for Superscalar [Agerwala and Cocke] Speedup jumps from 3 to 4. but s =2 instead of s=1 (scalar) Typical Range Feb 26. f=0. 14.8. 2011 Advance SE-273@SERC Computer Architecture 17 .
14.Superscalar Proposal Moderate tyranny of Amdahl’s Law • Ease sequential bottleneck • More generally applicable • Robust (less sensitive to f) • Revised Amdahl’s Law: Speedup Feb 26. 2011 1 = (1 − f ) + f s v Advance SE-273@SERC Computer Architecture 18 .
00 2.  Wedig  Butler et al. 2011 Advance SE-273@SERC Computer Architecture . 14.  Jouppi and Wall  Johnson  Acosta et al.00 2.40 2.81 1.79 3.Limits on Instruction Level Parallelism (ILP) Weiss and Smith  Sohi and Vajapeyam  Tjaden and Flynn  Tjaden and Flynn  Uht  Smith et al.86 (Flynn’s bottleneck) 1.  Riseman and Foster  1.00 5.50 2.96 2.8 6 7 (Jouppi disagreed) 8 51 (no control dependences) 19 Nicolau and Fisher  90 (Fisher’s optimism) Feb 26.58 1.  Melvin and Patt  Wall  Kuck et al.
Superscalar Proposal Go beyond single instruction pipeline. 14. achieve IPC > 1 Dispatch multiple instructions per cycle Provide more generally applicable form of concurrency (not just vectors) Geared for sequential code that is hard to parallelize otherwise Exploit fine-grained or instructionlevel parallelism (ILP) Feb 26. 2011 Advance SE-273@SERC Computer Architecture 20 .
Classifying ILP Machines [Jouppi. 2011 Advance SE-273@SERC Computer Architecture 21 . DECWRL 1991] Baseline scalar RISC • Issue parallelism = IP = 1 • Operation latency = OP = 1 • Peak IPC = 1 SUCCESSIVE INSTRUCTIONS 1 2 3 4 5 IF 6 4 5 6 7 8 9 DE EX WB 0 1 2 3 TIME IN CYCLES (OF BASELINE MACHINE) Feb 26. 14.
Beyond Pipelining Time taken by a program The number of instruction required to execute the program Average number of cycle required to execute an instruction The processor cycle time Constraints Feb 26. 2011 Advance SE-273@SERC Computer Architecture 22 . 14.
Beyond Pipelining Superscalar Processor Reduces average number of cycles per instruction beyond what is possible in pipelined scalar RISC processor by allowing concurrent execution of instructions in the same pipelined stage as well as concurrent execution of instructions in different pipeline stages Emphasizes multiple concurrent operations on scalar quantities There are many hurdles to support it Feb 26. 2011 Advance Computer Architecture 23 .
independent of the ISA and the other architectural features Thus. 2011 Advance Computer Architecture 24 .Superscalar Architecture Superscalar Processor Simple concept Wide pipeline Instructions are not independent Superscalar architecture is natural descendant of pipelined scalar RISC Superscalar techniques largely concern the processor organization. possibility to develop a processor code compatible with an existing architecture Feb 26.
2011 Advance Computer Architecture 25 . the second instruction has a true data dependency on the first instruction Second instruction must be delayed Procedural dependencies Due to change in the program flow Resource conflicts Arises when two instructions want to use the same resource at the same time Can be eliminated by resource duplication Feb 26.Superscalar Architecture Fundamental Limitations True data dependencies If an instruction uses a value produced by a previous instruction.
2011 Advance Computer Architecture 26 .Superscalar Architecture Instruction parallelism and Machine parallelism Instruction parallelism of a program is a measure of the average number of instructions that a superscalar processor might be able to execute at the same time Mostly. ILP is determined by the number of true dependencies and the number of branches in relation to other instructions Feb 26.
Superscalar Architecture Instruction parallelism and Machine parallelism Machine parallelism of a processor is a measure of the ability of processor to take advantage of the ILP Determined by the number of instructions that can be fetched and executed at the same time A challenge in the design of superscalar processor is to achieve good balance between instruction parallelism and machine parallelism Feb 26. 2011 Advance Computer Architecture 27 .
and executing instruction have significant effect on its ability to discover instructions which can be executed concurrently Instruction issue is refer to the process of initiating instruction execution Instruction issue policy limits or enhances performance because it determines the processor’s look ahead capability Feb 26. 2011 Advance Computer Architecture 28 .Superscalar Architecture Instruction issue and machine parallelism ILP is not necessarily exploited by widening the pipelines and adding more resources Processor policies towards fetching decoding.
2011 Advance Computer Architecture 29 .Superscalar Pipelines IF ID RD ALU MEM WB Feb 26.
Alleviate the limitations of pipelined implementation 2. Temporal machine parallelism Feb 26.Superscalar Pipelines Dynamic Pipelines 1. 2011 Advance Computer Architecture 30 . Use diversified pipelines 3.
Superscalar Pipelines (Diversified) IF ID RD EX ALU Mem1 Mem2 FP1 FP2 FP3 BR WB Feb 26. 2011 Advance Computer Architecture 31 .
2011 .Superscalar Pipelines (Diversified) Diversified Pipelines Each pipeline can be customized for particular instruction type Each instruction type incurs only necessary latency Certainly less expensive than identical copies If all inter-instruction dependencies are resolved then there is no stall after instruction issue Require special consideration Number and Mix of functional units Advance Computer Architecture 32 Feb 26.
Superscalar Pipelines (Dynamic Pipelines) Dynamic Pipelines Buffers are needed Multi-entry buffers Every entry is hardwired to one read port and one write port Complex multi-entry buffers Minimize stalls Feb 26. 2011 Advance Computer Architecture 33 .
14.Thank You Feb 26. 2011 Advance SE-273@SERC Computer Architecture 34 .
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.