You are on page 1of 84

MIPS Pipeline

n 

Five stages, one step per stage
1. 
2. 
3. 
4. 
5. 

IF: Instruction fetch from memory
ID: Instruction decode & register read
EX: Execute operation or calculate address
MEM: Access memory operand
WB: Write result back to register

Chapter 4 — The Processor — 1

Pipeline Performance
n 

Assume time for stages is
n 
n 

n 

100ps for register read or write
200ps for other stages

Compare pipelined datapath with single-cycle
datapath

Instr

Instr fetch Register
read

ALU op

Memory
access

Register
write

Total time

lw

200ps

100 ps

200ps

200ps

100 ps

800ps

sw

200ps

100 ps

200ps

200ps

R-format

200ps

100 ps

200ps

beq

200ps

100 ps

200ps

700ps
100 ps

600ps
500ps

Chapter 4 — The Processor — 2

Pipeline Performance
Single-cycle (Tc= 800ps)

Pipelined (Tc= 200ps)

Chapter 4 — The Processor — 3

Pipeline Speedup
n 

If all stages are balanced
n 
n 

n 
n 

i.e., all take the same time
Time between instructionspipelined
= Time between instructionsnonpipelined
Number of stages

If not balanced, speedup is less
Speedup due to increased throughput
n 

Latency (time for each instruction) does not
decrease
Chapter 4 — The Processor — 4

access memory in 4th stage Alignment of memory operands n  Memory access takes only one cycle Chapter 4 — The Processor — 5 .f.to 17-byte instructions Can calculate address in 3rd stage. x86: 1.Pipelining and ISA Design n  MIPS ISA designed for pipelining n  All instructions are 32-bits n  n  n  Few and regular instruction formats n  n  Can decode and read registers in one step Load/store addressing n  n  Easier to fetch and decode in one cycle c.

Hazards n  n  Situations that prevent starting the next instruction in the next cycle Structure hazards n  n  Data hazard n  n  A required resource is busy Need to wait for previous instruction to complete its data read/write Control hazard n  Deciding on control action depends on previous instruction Chapter 4 — The Processor — 6 .

Structure Hazards n  n  Conflict for use of a resource In MIPS pipeline with a single memory Load/store requires data access n  Instruction fetch would have to stall for that cycle n  n  n  Would cause a pipeline “bubble” Hence. pipelined datapaths require separate instruction/data memories n  Or separate instruction/data caches Chapter 4 — The Processor — 7 .

$t0. $s0. $t1 $t2. $t3 Chapter 4 — The Processor — 8 .Data Hazards n  An instruction depends on completion of data access by a previous instruction n  add sub $s0.

Forwarding (aka Bypassing) n  Use result when it is computed n  n  Don’t wait for it to be stored in a register Requires extra connections in the datapath Chapter 4 — The Processor — 9 .

Load-Use Data Hazard n  Can’t always avoid stalls by forwarding n  n  If value not computed when needed Can’t forward backward in time! Chapter 4 — The Processor — 10 .

$t3. $t2 12($t0) $t1. $t4 16($t0) 13 cycles lw lw lw add sw add sw $t1. $t5. $t2. $t4 16($t0) 11 cycles Chapter 4 — The Processor — 11 . $t5. 0($t0) 4($t0) $t1. $t4. stall stall lw lw add sw lw add sw $t1. $t5. $t2. $t4. $t3. 0($t0) 4($t0) 8($t0) $t1. $t5. $t2 12($t0) 8($t0) $t1. $t3.Code Scheduling to Avoid Stalls n  n  Reorder code to avoid use of load result in the next instruction C code for A = B + E. $t3. C = B + F.

Control Hazards n  Branch determines flow of control Fetching next instruction depends on branch outcome n  Pipeline can’t always fetch correct instruction n  n  n  Still working on ID stage of branch In MIPS pipeline Need to compare registers and compute target early in the pipeline n  Add hardware to do it in ID stage n  Chapter 4 — The Processor — 12 .

Stall on Branch n  Wait until branch outcome determined before fetching next instruction Chapter 4 — The Processor — 13 .

with no delay Chapter 4 — The Processor — 14 .Branch Prediction n  Longer pipelines can’t readily determine branch outcome early n  n  Predict outcome of branch n  n  Stall penalty becomes unacceptable Only stall if prediction is wrong In MIPS pipeline n  n  Can predict branches not taken Fetch instruction after branch.

MIPS with Predict Not Taken Prediction correct Prediction incorrect Chapter 4 — The Processor — 15 .

. and update history Chapter 4 — The Processor — 16 . stall while re-fetching.g. record recent history of each branch Assume future behavior will continue the trend n  When wrong.More-Realistic Branch Prediction n  Static branch prediction n  n  Based on typical branch behavior Example: loop and if-statement branches n  n  n  Predict backward branches taken Predict forward branches not taken Dynamic branch prediction n  Hardware measures actual branch behavior n  n  e.

data. control Instruction set design affects complexity of pipeline implementation Chapter 4 — The Processor — 17 .Pipeline Summary The BIG Picture n  Pipelining improves performance by increasing instruction throughput n  n  n  Subject to hazards n  n  Executes multiple instructions in parallel Each instruction has the same latency Structure.

6 Pipelined Datapath and Control MIPS Pipelined Datapath MEM Right-to-left flow leads to hazards WB Chapter 4 — The Processor — 18 .§4.

Pipeline registers n  Need registers between stages n  To hold information produced in previous cycle Chapter 4 — The Processor — 19 .

“multi-clock-cycle” diagram n  n  Shows pipeline usage in a single cycle Highlight resources used Graph of operation over time We’ll look at “single-clock-cycle” diagrams for load & store Chapter 4 — The Processor — 20 .Pipeline Operation n  Cycle-by-cycle flow of instructions through the pipelined datapath n  “Single-clock-cycle” pipeline diagram n  n  n  c.f.

Store.IF for Load. … Chapter 4 — The Processor — 21 .

ID for Load. … Chapter 4 — The Processor — 22 . Store.

EX for Load Chapter 4 — The Processor — 23 .

MEM for Load Chapter 4 — The Processor — 24 .

WB for Load Wrong register number Chapter 4 — The Processor — 25 .

Corrected Datapath for Load Chapter 4 — The Processor — 26 .

EX for Store Chapter 4 — The Processor — 27 .

MEM for Store Chapter 4 — The Processor — 28 .

WB for Store Chapter 4 — The Processor — 29 .

Multi-Cycle Pipeline Diagram n  Form showing resource usage Chapter 4 — The Processor — 30 .

Multi-Cycle Pipeline Diagram n  Traditional form Chapter 4 — The Processor — 31 .

Single-Cycle Pipeline Diagram n  State of pipeline in a given cycle Chapter 4 — The Processor — 32 .

Pipelined Control (Simplified) Chapter 4 — The Processor — 33 .

Pipelined Control n  Control signals derived from instruction n  As in single-cycle implementation Chapter 4 — The Processor — 34 .

Pipelined Control Chapter 4 — The Processor — 35 .

$2.$6.$2 $14.$5 $13.$2 $15.n  Consider this sequence: sub and or add sw n  $2. $1.100($2) We can resolve hazards with forwarding n  §4.7 Data Hazards: Forwarding vs.$2. Stalling Data Hazards in ALU Instructions How do we detect when to forward? Chapter 4 — The Processor — 36 .$3 $12.

Dependencies & Forwarding Chapter 4 — The Processor — 37 .

EX/MEM. ID/EX. EX/MEM..g. MEM/WB.Detecting the Need to Forward n  Pass register numbers along pipeline n  n  ALU operand register numbers in EX stage are given by n  n  e.RegisterRs = register number for Rs sitting in ID/EX pipeline register ID/EX.RegisterRs 1b.RegisterRd = ID/EX.RegisterRd = ID/EX.RegisterRd = ID/EX.RegisterRs 2b. MEM/WB.RegisterRt 2a. ID/EX.RegisterRt Data hazards when 1a.RegisterRt Fwd from EX/MEM pipeline reg Fwd from MEM/WB pipeline reg Chapter 4 — The Processor — 38 .RegisterRd = ID/EX.RegisterRs.

RegisterRd ≠ 0 Chapter 4 — The Processor — 39 . MEM/WB.RegWrite. MEM/WB.RegisterRd ≠ 0.RegWrite And only if Rd for that instruction is not $zero n  EX/MEM.Detecting the Need to Forward n  But only if forwarding instruction will write to a register! n  n  EX/MEM.

Forwarding Paths Chapter 4 — The Processor — 40 .

RegWrite and (MEM/WB.RegisterRs)) ForwardA = 10 if (EX/MEM.RegisterRd ≠ 0) and (MEM/WB.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (MEM/WB.RegisterRs)) ForwardA = 01 if (MEM/WB.RegisterRt)) ForwardB = 10 MEM hazard n  n  if (MEM/WB.RegisterRd = ID/EX.RegisterRd ≠ 0) and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegWrite and (MEM/WB.RegisterRd = ID/EX.RegisterRd = ID/EX.Forwarding Conditions n  EX hazard n  n  n  if (EX/MEM.RegWrite and (EX/MEM.RegisterRt)) ForwardB = 01 Chapter 4 — The Processor — 41 .RegisterRd = ID/EX.

$1.$2 add $1.$1.Double Data Hazard n  Consider the sequence: add $1.$4 n  Both hazards occur n  n  Want to use the most recent Revise MEM hazard condition n  Only fwd if EX hazard condition isn’t true Chapter 4 — The Processor — 42 .$1.$3 add $1.

RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegWrite and (MEM/WB.RegisterRt)) ForwardB = 01 Chapter 4 — The Processor — 43 .RegisterRd ≠ 0) and not (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd ≠ 0) and not (EX/MEM.RegisterRd = ID/EX.RegWrite and (EX/MEM.RegisterRd = ID/EX.RegWrite and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) and (MEM/WB.RegisterRt)) and (MEM/WB.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (EX/MEM.Revised Forwarding Condition n  MEM hazard n  n  if (MEM/WB.

Datapath with Forwarding Chapter 4 — The Processor — 44 .

Load-Use Data Hazard Need to stall for one cycle Chapter 4 — The Processor — 45 .

RegisterRt ID/EX.RegisterRs) or (ID/EX.RegisterRt = IF/ID. IF/ID.RegisterRt)) If detected.RegisterRt = IF/ID. stall and insert bubble Chapter 4 — The Processor — 46 .MemRead and ((ID/EX.Load-Use Hazard Detection n  n  Check when using instruction is decoded in ID stage ALU operand register numbers in ID stage are given by n  n  Load-use hazard when n  n  IF/ID.RegisterRs.

How to Stall the Pipeline
n 

Force control values in ID/EX register
to 0
n 

n 

EX, MEM and WB do nop (no-operation)

Prevent update of PC and IF/ID register
n 
n 
n 

Using instruction is decoded again
Following instruction is fetched again
1-cycle stall allows MEM to read data for lw
n 

Can subsequently forward to EX stage

Chapter 4 — The Processor — 47

Stall/Bubble in the Pipeline

Stall inserted
here

Chapter 4 — The Processor — 48

Stall/Bubble in the Pipeline

Or, more
accurately…

Chapter 4 — The Processor — 49

Datapath with Hazard Detection

Chapter 4 — The Processor — 50

Stalls and Performance The BIG Picture n  Stalls reduce performance n  n  But are required to get correct results Compiler can arrange code to avoid hazards and stalls n  Requires knowledge of the pipeline structure Chapter 4 — The Processor — 51 .

n  If branch outcome determined in MEM §4.8 Control Hazards Branch Hazards Flush these instructions (Set control values to 0) PC Chapter 4 — The Processor — 52 .

lw $10. $2. $15.Reducing Branch Delay n  Move hardware to determine outcome to ID stage n  n  n  Target address adder Register comparator Example: branch taken 36: 40: 44: 48: 52: 56: 72: sub beq and or add slt . $14. $13. $2. 50($7) Chapter 4 — The Processor — 53 . $1. $12. $4. $3. $6. $4. $8 7 $5 $6 $2 $7 $4...

Example: Branch Taken Chapter 4 — The Processor — 54 .

Example: Branch Taken Chapter 4 — The Processor — 55 .

$4.Data Hazards for Branches n  If a comparison register is a destination of 2nd or 3rd preceding ALU instruction add $1. $2. $5. target n  ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Can resolve using forwarding Chapter 4 — The Processor — 56 . $6 … beq $1. $3 IF add $4.

Data Hazards for Branches n  If a comparison register is a destination of preceding ALU instruction or 2nd preceding load instruction n  lw Need 1 stall cycle $1. target ID EX MEM WB IF ID EX MEM WB IF ID ID EX MEM WB Chapter 4 — The Processor — 57 . $6 beq stalled beq $1. $4. $5. addr IF add $4.

$0. addr IF beq stalled beq stalled beq $1. target ID EX IF ID MEM WB ID ID EX MEM WB Chapter 4 — The Processor — 58 .Data Hazards for Branches n  If a comparison register is a destination of immediately preceding load instruction n  lw Need 2 stall cycles $1.

Dynamic Branch Prediction n  n  In deeper and superscalar pipelines. flush pipeline and flip prediction Chapter 4 — The Processor — 59 . expect the same outcome Start fetching from fall-through or target If wrong. branch penalty is more significant Use dynamic prediction n  n  n  n  Branch prediction buffer (aka branch history table) Indexed by recent branch instruction addresses Stores outcome (taken/not taken) To execute a branch n  n  n  Check table.

outer Mispredict as taken on last iteration of inner loop n  Then mispredict as not taken on first iteration of inner loop next time around n  Chapter 4 — The Processor — 60 . …. inner … beq …. ….1-Bit Predictor: Shortcoming n  Inner loop branches mispredicted twice! outer: … … inner: … … beq ….

2-Bit Predictor n  Only change prediction on two successive mispredictions Chapter 4 — The Processor — 61 .

can fetch target immediately Chapter 4 — The Processor — 62 . still need to calculate the target address n  n  1-cycle penalty for a taken branch Branch target buffer n  n  Cache of target addresses Indexed by PC when instruction fetched n  If hit and instruction is branch predicted taken.Calculating the Branch Target n  Even with predictor.

n  “Unexpected” events requiring change in flow of control n  n  Different ISAs use the terms differently Exception n  Arises within the CPU n  n  e. overflow.. syscall.g. undefined opcode. … Interrupt n  n  §4.9 Exceptions Exceptions and Interrupts From an external I/O controller Dealing with them without sacrificing performance is hard Chapter 4 — The Processor — 63 .

Handling Exceptions n  n  In MIPS. 1 for overflow Jump to handler at 8000 00180 Chapter 4 — The Processor — 64 . exceptions managed by a System Control Coprocessor (CP0) Save PC of offending (or interrupted) instruction n  n  In MIPS: Exception Program Counter (EPC) Save indication of the problem n  n  In MIPS: Cause register We’ll assume 1-bit n  n  0 for undefined opcode.

Handler Actions n  n  n  Read cause. cause. and transfer to relevant handler Determine action required If restartable n  n  n  Take corrective action use EPC to return to program Otherwise n  n  Terminate program Report error using EPC. … Chapter 4 — The Processor — 65 .

Exceptions in a Pipeline n  n  Another form of control hazard Consider overflow on add in EX stage add $1. $1 n  Prevent $1 from being clobbered n  Complete previous instructions n  Flush add and subsequent instructions n  Set Cause and EPC register values n  Transfer control to handler n  Similar to mispredicted branch n  Use much of the same hardware Chapter 4 — The Processor — 66 . $2.

roll-back and do the right thing Common to static and dynamic multiple issue Examples n  Speculate on branch outcome n  n  Roll back if path taken is different Speculate on load n  Roll back if location is updated Chapter 4 — The Processor — 67 .Speculation n  “Guess” what to do with an instruction n  n  Start operation as soon as possible Check whether guess was right n  n  n  n  If so. complete the operation If not.

g.. move load before branch n  Can include “fix-up” instructions to recover from incorrect guess n  n  Hardware can look ahead for instructions to execute Buffer results until it determines they are actually needed n  Flush buffers on incorrect speculation n  Chapter 4 — The Processor — 68 .Compiler/Hardware Speculation n  Compiler can reorder instructions e.

Static Multiple Issue n  Compiler groups instructions into “issue packets” Group of instructions that can be issued on a single cycle n  Determined by pipeline resources required n  n  Think of an issue packet as a very long instruction n  n  Specifies multiple concurrent operations ⇒ Very Long Instruction Word (VLIW) Chapter 4 — The Processor — 69 .

Scheduling Static Multiple Issue n  Compiler must remove some/all hazards Reorder instructions into issue packets n  No dependencies with a packet n  Possibly some dependencies between packets n  n  n  Varies between ISAs. compiler must know! Pad with nop if necessary Chapter 4 — The Processor — 70 .

then load/store Pad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n+4 Load/store IF ID EX MEM WB n+8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB Chapter 4 — The Processor — 71 .MIPS with Static Dual Issue n  Two-issue packets n  n  n  One ALU/branch instruction One load/store instruction 64-bit aligned n  n  ALU/branch.

MIPS with Static Dual Issue Chapter 4 — The Processor — 72 .

$s0. effectively a stall Still one cycle use latency. $s1 load $s2. but now two instructions More aggressive scheduling required Chapter 4 — The Processor — 73 .Hazards in the Dual-Issue MIPS n  n  More instructions executing in parallel EX data hazard n  n  Forwarding avoided stalls with single-issue Now can’t use ALU result in load/store in same packet n  n  n  Load-use hazard n  n  add $t0. 0($t0) Split into two packets.

$s2 nop 3 bne sw $s1. $s1. 0($s1) $t0. $s1.Scheduling Example n  Schedule this for dual-issue MIPS Loop: lw addu sw addi bne Loop: n  $t0.–4 nop 2 addu $t0. 4($s1) 4 IPC = 5/4 = 1.f. $s1. $zero. $t0.–4 $zero. $t0. $s2 0($s1) $s1. $t0. 0($s1) $t0. Loop # # # # # $t0=array element add scalar in $s2 store result decrement pointer branch $s1!=0 ALU/branch Load/store cycle nop lw 1 addi $s1.25 (c. Loop $t0. peak IPC = 2) Chapter 4 — The Processor — 74 .

Loop Unrolling n  Replicate loop body to expose more parallelism n  n  Reduces loop-control overhead Use different registers per replication n  n  Called “register renaming” Avoid loop-carried “anti-dependencies” n  n  Store followed by a load of the same register Aka “name dependence” n  Reuse of a register name Chapter 4 — The Processor — 75 .

$zero. $t4. $s1.Loop Unrolling Example Loop: ALU/branch Load/store cycle addi $s1. but at cost of registers and code size Chapter 4 — The Processor — 76 .75 n  Closer to 2. 12($s1) 2 addu $t0.–16 lw $t0. 0($s1) 1 nop lw $t1. 12($s1) 6 nop sw $t2. $t2. 8($s1) 3 addu $t1. 16($s1) 5 addu $t3. $s2 lw $t3. $t1. Loop IPC = 14/8 = 1. 4($s1) 8 bne n  $s1. $s2 lw $t2. $s2 sw $t0. $t0. $s2 sw $t1. 4($s1) 4 addu $t2. 8($s1) 7 sw $t3.

1. … each cycle n  n  Avoiding structural and data hazards Avoids the need for compiler scheduling n  n  Though it may still help Code semantics ensured by the CPU Chapter 4 — The Processor — 77 . 2.Dynamic Multiple Issue n  n  “Superscalar” processors CPU decides whether to issue 0.

Speculation n  Predict branch and continue issuing n  n  Don’t commit until branch outcome determined Load speculation n  Avoid load and cache miss delay n  n  n  n  n  Predict the effective address Predict loaded value Load before completing outstanding stores Bypass stored values to load unit Don’t commit load until speculation cleared Chapter 4 — The Processor — 78 .

g..Why Do Dynamic Scheduling? n  n  Why not just let the compiler schedule code? Not all stalls are predicable n  n  Can’t always schedule around branches n  n  e. cache misses Branch outcome is dynamically determined Different implementations of an ISA have different latencies and hazards Chapter 4 — The Processor — 79 .

pointer aliasing Hard to keep pipelines full Speculation can help if done well Chapter 4 — The Processor — 80 .. but not as much as we’d like Programs have real dependencies that limit ILP Some dependencies are hard to eliminate n  n  Some parallelism is hard to expose n  n  Limited window size during instruction issue Memory delays and limited bandwidth n  n  e.g.Does Multiple Issue Work? The BIG Picture n  n  n  Yes.

Power Efficiency n  n  Complexity of dynamic scheduling and speculations requires power Multiple simpler cores may be better Microprocessor Year Clock Rate Pipeline Stages Issue width Out-of-order/ Speculation Cores Power i486 1989 25MHz 5 1 No 1 5W Pentium 1993 66MHz 5 2 No 1 10W Pentium Pro 1997 200MHz 10 3 Yes 1 29W P4 Willamette 2001 2000MHz 22 3 Yes 1 75W P4 Prescott 2004 3600MHz 31 3 Yes 1 103W Core 2006 2930MHz 14 4 Yes 2 75W UltraSparc III 2003 1950MHz 14 4 No 1 90W UltraSparc T1 2005 1200MHz 6 1 No 8 70W Chapter 4 — The Processor — 81 .

72 physical registers §4.11 Real Stuff: The AMD Opteron X4 (Barcelona) Pipeline The Opteron X4 Microarchitecture Chapter 4 — The Processor — 82 .

The Opteron X4 Pipeline Flow n  For integer operations n  n  n  FP is 5 stages longer Up to 106 RISC-ops in progress Bottlenecks n  n  n  Complex instructions with long dependencies Branch mispredictions Memory access delays Chapter 4 — The Processor — 83 .

..g.§4.g. detecting data hazards Pipelining is independent of technology n  n  n  So why haven’t we always done pipelining? More transistors make more advanced techniques feasible Pipeline-related ISA design needs to take account of technology trends n  e. predicated instructions Chapter 4 — The Processor — 84 .13 Fallacies and Pitfalls Fallacies n  Pipelining is easy (!) n  n  The basic idea is easy The devil is in the details n  n  e.