Chapter04 2PipelinedProcessor

dce
2013
COMPUTER ARCHITECTURE
CSE Fall 2013
BK
TP.HCM
Faculty of Computer Science and

Engineering
Department of Computer Engineering
Vo Tan Phuong
http://www.cse.hcmut.edu.vn/~vtphuong
dce
2013
Chapter 4.2
Pipelined Processor Design
Computer Architecture Chapter 4.2
Fall 2013, CS
dce
2013
Presentation Outline
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction
Fall 2013, CS
dce
2013
Pipelining Example
Laundry Example: Three Stages
1. Wash dirty load of clothes
2. Dry wet clothes
3. Fold and put clothes into drawers

Each stage takes 30 minutes to complete
Four loads of clothes to wash, dry, and fold
Fall 2013, CS
dce
Sequential Laundry
2013
6 PM
Time 30
7
30
8
30
30
9
30
30
10
30
30
11
30
30
12 AM
30
30
B
C
Sequential laundry takes 6 hours for 4 loads

Intuitively, we can use pipelining to speed up laundry
Fall 2013, CS
dce
2013
Pipelined Laundry: Start Load ASAP

6 PM
30
7
30
30
8
30
30
30
30
30
30
9 PM
Time
30
30
30
Pipelined laundry takes

3 hours for 4 loads
Speedup factor is 2 for

4 loads
C
D
Time to wash, dry, and

fold one load is still the
same (90 minutes)
Fall 2013, CS
dce
2013
Serial Execution versus Pipelining

Consider a task that can be divided into k subtasks
The k subtasks are executed on k different stages
Each subtask requires one time unit

The total execution time of the task is k time units
Pipelining is to overlap the execution

The k stages work in parallel on k different tasks
Tasks enter/leave pipeline at the rate of one task per time unit
1 2
k
1 2
1 2
k
1 2
1 2
Without Pipelining
One completion every k time units
1 2
With Pipelining
One completion every 1 time unit
Fall 2013, CS
dce
Synchronous Pipeline
2013
Uses clocked registers between stages

Upon arrival of a clock edge
All registers hold the results of previous stages simultaneously
The pipeline stages are combinational logic circuits

It is desirable to have balanced stages
Approximately equal delay in all stages
Sk
Register
S2
Register
S1
Register
Input
Register
Clock period is determined by the maximum stage delay
Output
Clock
Fall 2013, CS
dce
Pipeline Performance
2013
Let ti = time delay in stage Si

Clock cycle t = max(ti) is the maximum stage delay
Clock frequency f = 1/t = 1/max(ti)
A pipeline can process n tasks in k + n 1 cycles
k cycles are needed to complete the first task
n 1 cycles are needed to complete the remaining n 1 tasks
Ideal speedup of a k-stage pipeline over serial execution

nk
Serial execution in cycles
Sk =
Pipelined execution in cycles
k+n1
Sk k for large n
Fall 2013, CS
dce
2013
MIPS Processor Pipeline

Five stages, one cycle per stage
1. IF: Instruction Fetch from instruction memory
2. ID: Instruction Decode, register read, and J/Br address
3. EX: Execute operation or calculate load/store address
4. MEM: Memory access for load and store

5. WB: Write Back result to register
Fall 2013, CS
10
dce
2013
Single-Cycle vs Pipelined Performance

Consider a 5-stage instruction execution in which
Instruction fetch = ALU operation = Data memory access = 200 ps
Register read = register write = 150 ps
What is the clock cycle of the single-cycle processor?

What is the clock cycle of the pipelined processor?
What is the speedup factor of pipelined execution?

Solution
Single-Cycle Clock =
200+150+200+200+150 = 900 ps
IF
Reg
ALU
MEM
Reg
900 ps
IF
Reg
ALU
MEM
Reg
900 ps
Fall 2013, CS
11
dce
2013
Single-Cycle versus Pipelined contd

Pipelined clock cycle = max(200, 150) = 200 ps
IF
Reg
200
IF
200
ALU
Reg
IF
200
MEM
Reg
ALU
MEM
Reg
ALU
MEM
200
200
Reg
200
Reg
200
CPI for pipelined execution = 1

One instruction completes each cycle (ignoring pipeline fill)
Speedup of pipelined execution = 900 ps / 200 ps = 4.5

Instruction count and CPI are equal in both cases
Speedup factor is less than 5 (number of pipeline stage)

Because the pipeline stages are not balanced
Fall 2013, CS
12
dce
2013
Pipeline Performance Summary

Pipelining doesnt improve latency of a single instruction
However, it improves throughput of entire workload

Instructions are initiated and completed at a higher rate
In a k-stage pipeline, k instructions operate in parallel

Overlapped execution using multiple hardware resources
Potential speedup = number of pipeline stages k
Unbalanced lengths of pipeline stages reduces speedup
Pipeline rate is limited by slowest pipeline stage

Unbalanced lengths of pipeline stages reduces speedup
Also, time to fill and drain pipeline reduces speedup
Fall 2013, CS
13
dce
Next . . .
2013

Pipeline Hazards
Control Hazards
Fall 2013, CS
14
dce
Single-Cycle Datapath
2013
Shown below is the single-cycle datapath

How to pipeline this single-cycle datapath?
Answer: Introduce pipeline register at end of each stage

IF = Instruction Fetch
ID = Decode &
Register Read
Jump or Branch Target Address
EX = Execute
MEM = Memory
Access
J
Next
PC
Beq
Bne
30
00
30
Instruction
Memory
Instruction
PC
0
1
ALU result
Imm26
+1
PCSrc
Rs 5
32
Rt 5
Address
Rd
Imm16
zero
32
BusA
RA
Registers
RB
BusB
0
1
WB =
Write
Back
RW
BusW
32
A
L
U
32
Data
Memory
Address
Data_out
Data_in
32
32
1
32
RegDst
clk
Reg
Write
ExtOp ALUSrc ALUCtrl
Mem Mem
Read Write
Mem
toReg
Fall 2013, CS
15
dce
Pipelined Datapath
2013
Pipeline registers are shown in green, including the PC

Same clock edge updates all pipeline registers, register
file, and data memory (for store instruction)
Address
RB
0
1
Rd
RW
32
Imm
E
BusB
BusW
32
zero
A
L
U
Data
Memory
ALUout
RA
ALU result
Imm16
NPC
Rt 5
BusA
Next
PC
32
Data_out
32
32
Address
WB Data
PC
Rs 5
Instruction
Imm26
Register File
Instruction
Memory
Instruction
+1
MEM = Memory
Access
WB = Write Back
EX = Execute
ID = Decode &
Register Read
NPC2
Data_in
32
clk
Fall 2013, CS
16
dce
Problem with Register Destination
2013
Is there a problem with the register destination address?

Instruction in the ID stage different from the one in the WB stage
Address
RB
0
1
Rd
RW
ALU result
Imm16
E
BusB
BusW
32
zero
32
Imm
Next
PC
A
L
U
Data
Memory
ALUout
Rt 5
RA
BusA
MEM =
Memory Access
32
32
32
Address
Data_out
PC
Rs 5
Instruction
Imm26
Register File
Instruction
Memory
Instruction
+1
NPC
NPC2
EX = Execute
WB Data
ID = Decode &
Register Read
WB = Write Back
Instruction in the WB stage is not writing to its destination register

but to the destination of a different instruction in the ID stage
Data_in
32
clk
Fall 2013, CS
17
dce
Pipelining the Destination Register
2013
Destination Register number should be pipelined

Destination register number is passed from ID to WB stage
The WB stage writes back data knowing the destination register
ID
EX
RW
BusB
BusW
0
1
32
A
L
U
Data
Memory
ALUout
Imm
32
32
zero
32
Data_out
32
32
Address
WB Data
Rd
ALU result
Imm16
Data_in
Rd4
RB
WB
Next
PC
RA
Address
Rt 5
BusA
MEM
Rd3
PC
Rs 5
Rd2
Instruction
Imm26
Register File
Instruction
Memory
Instruction
+1
NPC
NPC2
IF
clk
Fall 2013, CS
18
dce
Graphically Representing Pipelines
2013
Multiple instruction execution over multiple clock cycles

Instructions are listed in execution order from top to bottom
Clock cycles move from left to right
Program Execution Order
Figure shows the use of resources at each stage and each cycle
Time (in cycles)
CC1
CC2
CC3
CC4
CC5
lw $t6, 8($s5)
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
add $s1, $s2, $s3

ori $s4, $t3, 7
sub $t5, $s2, $t3
sw $s2, 10($t3)
CC6
CC7
CC8
Fall 2013, CS
19
dce
Instruction-Time Diagram
2013
Instruction-Time Diagram shows:

Which instruction occupying what stage at each clock cycle
Instruction flow is pipelined over the 5 stages
Instruction Order
Up to five instructions can be in the

pipeline during the same cycle
Instruction Level Parallelism (ILP)
lw
$t7, 8($s3)
lw
$t6, 8($s5)
IF
ID
EX MEM WB
IF
ID
EX MEM WB
IF
ID
EX
WB
IF
ID
EX
IF
ID
ori $t4, $s3, 7

sub $s5, $s2, $t3
sw
$s2, 10($s3)
CC1
ALU instructions skip

the MEM stage.
Store instructions
skip the WB stage
WB
EX MEM
CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Time
Fall 2013, CS
20
dce
Control Signals
ID
EX
RW
Imm16
BusB
32
Data
Memory
32
32
32
Address
Data_out
BusW
0
1
A
L
U
ALUout
32
zero
32
Imm
ALU result
WB Data
Rd
Bne
Data_in
Rd4
Address
RB
Beq
Rd3
PC
Rt 5
RA
BusA
WB
Next
PC
Instruction
Rs 5
MEM
Rd2
Instruction
Memory
Imm26
Register File
PCSrc
Instruction
+1
NPC
NPC2
IF
2013
clk
Reg
Dst
Reg
Write
Ext
Op
ALU
Src
ALU
Ctrl
Mem Mem
Read Write
Mem
toReg
Same control signals used in the single-cycle datapath

Fall 2013, CS
21
dce
Pipelined Control
32
0
1
32
Data_out
BusW
32
32
Address
WB Data
Data
Memory
ALUout
Imm
BusB
A
L
U
Data_in
Rd4
RW
32
32
zero
Rd
Op
Address
RB
Bne
Rd3
PC
Rt 5
RA
BusA
Beq
ALU result
Imm16
Instruction
Rs 5
Next
PC
Rd2
Instruction
Memory
Imm26
Register File
PCSrc
Instruction
+1
NPC
NPC2
2013
Main
& ALU
Control
Ext
Op
ALU
Src
J
ALU Beq
Ctrl Bne
Mem Mem
Read Write
Mem
toReg
WB
Reg
Write
MEM
Reg
Dst
EX
Pass control
signals along
pipeline just
like the data
func
clk
Fall 2013, CS
22
dce
2013
Pipelined Control Cont'd

ID stage generates all the control signals
Pipeline the control signals as the instruction moves
Extend the pipeline registers to include the control signals
Each stage uses some of the control signals

Instruction Decode and Register Read
Control signals are generated
RegDst is used in this stage
Execution Stage => ExtOp, ALUSrc, and ALUCtrl

Next PC uses J, Beq, Bne, and zero signals for branch control
Memory Stage
=> MemRead, MemWrite, and MemtoReg
Write Back Stage => RegWrite is used in this stage
Fall 2013, CS
23
dce
Control Signals Summary
2013
Decode
Stage
Execute Stage
Memory Stage
Write
Control Signals
Control Signals
Back
Op
RegDst ALUSrc ExtOp
R-Type
1=Rd
0=Reg
addi
0=Rt
slti
Beq Bne
ALUCtrl
MemRd MemWr MemReg RegWrite
func
1=Imm 1=sign
ADD
0=Rt
1=Imm 1=sign
SLT
andi
0=Rt
1=Imm 0=zero
AND
ori
0=Rt
1=Imm 0=zero
OR
lw
0=Rt
1=Imm 1=sign
ADD
sw
1=Imm 1=sign
ADD
beq
0=Reg
SUB
bne
0=Reg
SUB
Fall 2013, CS
24
dce
Next . . .
2013

Pipeline Hazards
Control Hazards
Fall 2013, CS
25
dce
Pipeline Hazards
2013
Hazards: situations that would cause incorrect execution

If next instruction were launched during its designated clock cycle
1. Structural hazards
Caused by resource contention
Using same resource by two instructions during the same cycle
2. Data hazards
An instruction may compute a result needed by next instruction
Hardware can detect dependencies between instructions
3. Control hazards
Caused by instructions that change control flow (branches/jumps)
Delays in changing the flow of control
Hazards complicate pipeline control and limit performance
Fall 2013, CS
26
dce
Structural Hazards
2013
Problem
Attempt to use the same hardware resource by two different
instructions during the same cycle
Structural Hazard
Two instructions are
attempting to write
the register file
during same cycle
Example
Writing back ALU result in stage 4
Instructions
Conflict with writing load data in stage 5
lw
$t6, 8($s5)
IF
ori $t4, $s3, 7
ID
EX MEM WB
IF
ID
EX
WB
IF
ID
EX
WB
IF
ID
EX MEM
sub $t5, $s2, $s3

sw
$s2, 10($s3)
CC1
CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9
Time
Fall 2013, CS
27
dce
2013
Resolving Structural Hazards
Serious Hazard:
Hazard cannot be ignored
Solution 1: Delay Access to Resource

Must have mechanism to delay instruction access to resource
Delay all write backs to the register file to stage 5
ALU instructions bypass stage 4 (memory) without doing anything
Solution 2: Add more hardware resources (more costly)

Add more hardware to eliminate the structural hazard
Redesign the register file to have two write ports

First write port can be used to write back ALU results in stage 4
Second write port can be used to write back load data in stage 5
Fall 2013, CS
28
dce
Next . . .
2013

Pipeline Hazards
Control Hazards
Fall 2013, CS
29
dce
2013
Data Hazards
Dependency between instructions causes a data hazard
The dependent instructions are close to each other
Pipelined execution might change the order of operand access
Read After Write RAW Hazard

Given two instructions I and J, where I comes before J
Instruction J should read an operand after it is written by I

Called a data dependence in compiler terminology
I: add $s1, $s2, $s3
# $s1 is written
J: sub $s4, $s1, $s3
# $s1 is read
Hazard occurs when J reads the operand before I writes it
Fall 2013, CS
30
dce
2013
Example of a RAW Data Hazard
Time (cycles)
value of $s2
sub $s2, $t1, $t3
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
10
10
10
10
10
20
20
20
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
add $s4, $s2, $t5

or $s6, $t3, $s2
and $s7, $t4, $s2
sw $t8, 10($s2)
Result of sub is needed by add, or, and, & sw instructions

Instructions add & or will read old value of $s2 from reg file
During CC5, $s2 is written at end of cycle, old value is read
Fall 2013, CS
31
dce
Instruction Order
2013
Solution 1: Stalling the Pipeline

Time (in cycles)
value of $s2
sub $s2, $t1, $t3
add $s4, $s2, $t5
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
CC9
10
10
10
10
10
20
20
20
20
IM
Reg
ALU
DM
Reg
IM
Reg
Reg
Reg
Reg
ALU
DM
Reg
stall
stall
stall
IM
Reg
ALU
DM
or $s6, $t3, $s2
Three stall cycles during CC3 thru CC5 (wasting 3 cycles)

Stall cycles delay execution of add & fetching of or instruction
The add instruction cannot read $s2 until beginning of CC6

The add instruction remains in the Instruction register until CC6
The PC register is not modified until beginning of CC6
Fall 2013, CS
32
dce
Solution 2: Forwarding ALU Result
2013
The ALU result is forwarded (fed back) to the ALU input

No bubbles are inserted into the pipeline and no cycles are wasted
ALU result is forwarded from ALU, MEM, and WB stages
Time (cycles)
value of $s2
sub $s2, $t1, $t3
CC1
CC2
CC3
CC4
CC5
CC6
CC7
CC8
10
10
10
10
10
20
20
20
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
add $s4, $s2, $t5

or $s6, $t3, $s2
and $s7, $s6, $s2
sw $t8, 10($s2)
Fall 2013, CS
33
dce
Implementing Forwarding
2013
Two multiplexers added at the inputs of A & B registers

Data from ALU stage, MEM stage, and WB stage is fed back
Two signals: ForwardA and ForwardB control forwarding
Rd
32
Result
Data
Memory
0
32
32
Data_out
0
32
1
WData
Address
Data_in
Rd4
0
1
BusW
A
L
U
RW
BusB
0
1
2
3
32 ALU result
32
Rd3
RB
0
1
2
3
Rt
RA
BusA
Imm16
Rd2
Instruction
Rs
Register File
Imm26
Im26
ForwardA
clk
ForwardB
Fall 2013, CS
34
dce
Forwarding Control Signals
2013
Signal
Explanation
ForwardA = 0 First ALU operand comes from register file = Value of (Rs)
ForwardA = 1 Forward result of previous instruction to A (from ALU stage)
ForwardA = 2 Forward result of 2nd previous instruction to A (from MEM stage)
ForwardA = 3 Forward result of 3rd previous instruction to A (from WB stage)
ForwardB = 0 Second ALU operand comes from register file = Value of (Rt)
ForwardB = 1 Forward result of previous instruction to B (from ALU stage)
ForwardB = 2 Forward result of 2nd previous instruction to B (from MEM stage)
ForwardB = 3 Forward result of 3rd previous instruction to B (from WB stage)
Fall 2013, CS
35
dce
Forwarding Example
2013
Instruction sequence:
lw
$t4, 4($t0)
ori $t7, $t1, 2
sub $t3, $t4, $t7
When sub instruction is fetched

ori will be in the ALU stage
lw will be in the MEM stage
ForwardA = 2 from MEM stage
ForwardB = 1 from ALU stage
sub $t3,$t4,$t7
ori $t7,$t1,2
lw $t4,4($t0)
Rd
32
Result
32
32
Data_out
0
32
1
WData
Data
Memory
Data_in
Rd4
0
1
BusW
Address
RW
BusB
0
1
2
3
A
L
U
Rd3
RB
32 ALU result
32
RA
0
1
2
3
Rt
BusA
Rd2
Instruction
Rs
ext
Register File
Imm16
Imm
Imm26
clk
Fall 2013, CS
36
dce
RAW Hazard Detection
2013
Current instruction being decoded is in Decode stage

Previous instruction is in the Execute stage
Second previous instruction is in the Memory stage
Third previous instruction in the Write Back stage
If ((Rs != 0) and (Rs == Rd2) and (EX.RegWrite))
ForwardA 1
Else if
((Rs != 0) and (Rs == Rd3) and (MEM.RegWrite)) ForwardA 2
Else if
((Rs != 0) and (Rs == Rd4) and (WB.RegWrite))
Else
ForwardA 3
ForwardA 0
If ((Rt != 0) and (Rt == Rd2) and (EX.RegWrite))
ForwardB 1
Else if
((Rt != 0) and (Rt == Rd3) and (MEM.RegWrite)) ForwardB 2
Else if
((Rt != 0) and (Rt == Rd4) and (WB.RegWrite))
Else
ForwardB 3
ForwardB 0
Fall 2013, CS
37
dce
Hazard Detect and Forward Logic
BusW
0
1
Result
Data
Memory
0
32
Data_out
32
0
32
1
WData
Im26
0
1
2
3
Address
Data_in
32
ALUCtrl
Rd4
RW
BusB
Rd
RB
A
L
U
Rd3
Rt
RA
0
1
2
3
BusA
32 ALU result
32
Rd2
Instruction
Rs
Register File
Imm26
2013
clk
RegDst
ForwardB
ForwardA
Hazard Detect
and Forward
func
RegWrite
WB
Main
& ALU
Control
RegWrite
MEM
Op
EX
RegWrite
Fall 2013, CS
38
dce
Next . . .
2013

Pipeline Hazards
Load Delay, Hazard Detection, and Pipeline Stall

Control Hazards
Fall 2013, CS
39
dce
Load Delay
2013
Unfortunately, not all data hazards can be forwarded

Load has a delay that cannot be eliminated by forwarding
In the example shown below

The LW instruction does not read data until end of CC4
Cannot forward data to ADD at end of CC3 - NOT possible
Program Order
Time (cycles)
lw
$s2, 20($t1)
add $s4, $s2, $t5
CC1
CC2
CC3
CC4
CC5
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
or $t6, $t3, $s2

and $t7, $s2, $t4
CC6
CC7
CC8
However, load can

forward data to
2nd next and later
instructions
Reg
Fall 2013, CS
40
dce
2013
Detecting RAW Hazard after Load

Detecting a RAW hazard after a Load instruction:
The load instruction will be in the EX stage
Instruction that depends on the load data is in the decode stage
Condition for stalling the pipeline

if ((EX.MemRead == 1) // Detect Load in EX stage
and (ForwardA==1 or ForwardB==1)) Stall // RAW Hazard
Insert a bubble into the EX stage after a load instruction

Bubble is a no-op that wastes one clock cycle
Delays the dependent instruction after load by once cycle
Because of RAW hazard
Fall 2013, CS
41
dce
Stall the Pipeline for one Cycle
2013
ADD instruction depends on LW stall at CC3

Allow Load instruction in ALU stage to proceed
Freeze PC and Instruction registers (NO instruction is fetched)

Introduce a bubble into the ALU stage (bubble is a NO-OP)
Load can forward data to next instruction after delaying it
Program Order
Time (cycles)
lw
$s2, 20($s1)
CC1
CC2
CC3
CC4
CC5
IM
Reg
ALU
DM
Reg
IM
stall
bubble
bubble
bubble
Reg
ALU
DM
Reg
IM
Reg
ALU
DM
add $s4, $s2, $t5
or $t6, $s3, $s2

CC6
CC7
CC8
Reg
Fall 2013, CS
42
dce
Showing Stall Cycles
2013
Stall cycles can be shown on instruction-time diagram

Hazard is detected in the Decode stage
Stall indicates that instruction is delayed

Instruction fetching is also delayed after a stall
Example:
Data forwarding is shown using green arrows
lw
$s1, ($t5)
lw
$s2, 8($s1)
IF
ID
IF
EX MEM WB
Stall
add $v0, $s2, $t3
ID
IF
sub $v1, $s2, $v0
EX MEM WB
Stall
ID
EX MEM WB
IF
ID
EX MEM WB
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 Time
Fall 2013, CS
43
dce
Hazard Detect, Forward, and Stall
32
0
1
0
32
Data_out
32
WData
Data
Memory
0
1
2
3
BusW
Result
Im26
A
BusB
Address
Data_in
32
Rd4
RW
A
L
U
Rd
RB
Rt
RA
0
1
2
3
BusA
32 ALU result
32
Rd2
PC
Instruction
Rs
Register File
Imm26
Rd3
2013
clk
Disable PC
RegDst
ForwardB
func
ForwardA
Hazard Detect
Forward, & Stall
MemRead
Stall
Bubble
=0
RegWrite
0
1
WB
Control Signals
MEM
Main & ALU

Control
EX
Op
RegWrite
RegWrite
Fall 2013, CS
44
dce
2013
Code Scheduling to Avoid Stalls

Compilers reorder code in a way to avoid load stalls
Consider the translation of the following statements:
A = B + C; D = E F; // A thru F are in Memory
Slow code:
lw
lw
add
sw
lw
lw
sub
sw
$t0, 4($s0)
$t1, 8($s0)
$t2, $t0, $t1
$t2, 0($s0)
$t3, 16($s0)
$t4, 20($s0)
$t5, $t3, $t4
$t5, 12($0)
Fast code: No Stalls

# &B = 4($s0)
# &C = 8($s0)
# stall cycle
# &A = 0($s0)
# &E = 16($s0)
# &F = 20($s0)
# stall cycle
# &D = 12($0)
lw
lw
lw
lw
add
sw
sub
sw
$t0,
$t1,
$t3,
$t4,
$t2,
$t2,
$t5,
$t5,
4($s0)
8($s0)
16($s0)
20($s0)
$t0, $t1
0($s0)
$t3, $t4
12($s0)
Fall 2013, CS
45
dce
2013
Name Dependence: Write After Read

Instruction J should write its result after it is read by I
Called anti-dependence by compiler writers
I: sub $t4, $t1, $t3 # $t1 is read

J: add $t1, $t2, $t3 # $t1 is written
Results from reuse of the name $t1
NOT a data hazard in the 5-stage pipeline because:
Reads are always in stage 2
Writes are always in stage 5, and

Instructions are processed in order
Anti-dependence can be eliminated by renaming

Use a different destination register for add (eg, $t5)
Fall 2013, CS
46
dce
2013
Name Dependence: Write After Write

Same destination register is written by two instructions
Called output-dependence in compiler terminology
I: sub $t1, $t4, $t3
# $t1 is written
J: add $t1, $t2, $t3

again
# $t1 is written
Not a data hazard in the 5-stage pipeline because:

All writes are ordered and always take place in stage 5
However, can be a hazard in more complex pipelines

If instructions are allowed to complete out of order, and
Instruction J completes and writes $t1 before instruction I
Output dependence can be eliminated by renaming $t1

Read After Read is NOT a name dependence
Fall 2013, CS
47
dce
Next . . .
2013

Pipeline Hazards

Control Hazards
Fall 2013, CS
48
dce
Control Hazards
2013
Jump and Branch can cause great performance loss

Jump instruction needs only the jump target address
Branch instruction needs two things:

Branch Result
Taken or Not Taken
Branch Target Address

PC + 4
If Branch is NOT taken
PC + 4 + 4 immediate
If Branch is Taken
Jump and Branch targets are computed in the ID stage

At which point a new instruction is already being fetched
Jump Instruction: 1-cycle delay
Branch: 2-cycle delay for branch result (taken or not taken)
Fall 2013, CS
49
dce
2-Cycle Branch Delay
2013
Control logic detects a Branch instruction in the 2nd Stage

ALU computes the Branch outcome in the 3rd Stage
Next1 and Next2 instructions will be fetched anyway

Convert Next1 and Next2 into bubbles if branch is taken
Beq $t1,$t2,L1
cc1
cc2
cc3
IF
Reg
ALU
IF
Next1
Next2
L1: target instruction
cc4
cc5
cc6
Reg
Bubble
Bubble
Bubble
IF
Bubble
Bubble
Bubble
Bubble
IF
Reg
ALU
DM
Branch
Target
Addr
cc7
Fall 2013, CS
50
dce
Implementing Jump and Branch

NPC2
Bne
32
2
3
BusB
0
1
BusW
0
1
zero
A
L
U
ALUout
Imm16
RW
Beq
32
32
Rd3
Rd
Im26
NPC
Address
RB
BusA
Rt 5
RA
0
1
Next
PC
Rd2
Instruction
0
Rs 5
Register File
Instruction
Memory
Imm26
Op
PCSrc
Instruction
+1
PC
Jump or Branch Target
2013
Branch target & outcome

are computed in ALU stage
J, Beq, Bne
Main & ALU

Control
Control Signals
Bubble = 0
0
1
MEM
Branch Delay = 2 cycles
Reg
Dst
EX
func
clk
Fall 2013, CS
51
dce
Predict Branch NOT Taken
2013
Branches can be predicted to be NOT taken

If branch outcome is NOT taken then
Next1 and Next2 instructions can be executed
Do not convert Next1 & Next2 into bubbles
No wasted cycles
Beq $t1,$t2,L1
Next1
cc1
cc2
cc3
IF
Reg
ALU NOT Taken
IF
Next2
cc4
cc5
cc6
Reg
ALU
DM
Reg
IF
Reg
ALU
DM
cc7
Reg
Fall 2013, CS
52
dce
2013
Reducing the Delay of Branches

Branch delay can be reduced from 2 cycles to just 1 cycle
Branches can be determined earlier in the Decode stage
A comparator is used in the decode stage to determine branch
decision, whether the branch is taken or not
Because of forwarding the delay in the second stage will be
increased and this will also increase the clock cycle
Only one instruction that follows the branch is fetched

If the branch is taken then only one instruction is flushed
We should insert a bubble after jump or taken branch
This will convert the next instruction into a NOP
Fall 2013, CS
53
dce
Reducing Branch Delay to 1 Cycle

J
Beq
Bne
Longer Cycle
=
RW
A
L
U
2
3
BusB
0
1
BusW
32
32
0
1
Rd
32
Rd3
RB
BusA
Address
Op
Rt 5
RA
0
1
Rd2
Instruction
0
Rs 5
Register File
Instruction
Memory
Instruction
Imm16
PCSrc
Data forwarded
then compared
ALUout
Next
PC
Reset
+1
PC
Jump or Branch Target
Zero
Im16
2013
Reg
Dst
J, Beq, Bne
Main & ALU

Control
Control Signals
Bubble = 0
ALUCtrl
0
1
MEM
Reset signal converts

next instruction after
jump or taken branch
into a bubble
EX
func
clk
Fall 2013, CS
54
dce
Next . . .
2013

Pipeline Hazards

Control Hazards
Fall 2013, CS
55
dce
2013
Branch Hazard Alternatives

Predict Branch Not Taken (previously discussed)
Successor instruction is already fetched
Do NOT Flush instruction after branch if branch is NOT taken

Flush only instructions appearing after Jump or taken branch
Delayed Branch
Define branch to take place AFTER the next instruction
Compiler/assembler fills the branch delay slot (for 1 delay cycle)
Dynamic Branch Prediction

Loop branches are taken most of time
Must reduce branch delay to 0, but how?
How to predict branch behavior at runtime?
Fall 2013, CS
56
dce
Delayed Branch
2013
Define branch to take place after the next instruction

For a 1-cycle branch delay, we have one delay slot
branch instruction
branch delay slot
(next instruction)
label:
branch target
(if branch taken)
. . .
Compiler fills the branch delay slot
add $t2,$t3,$t4
beq $s1,$s0,label
Delay Slot
By selecting an independent instruction
From before the branch
If no independent instruction is found

Compiler fills delay slot with a NO-OP
label:
. . .
beq $s1,$s0,label
add $t2,$t3,$t4
Fall 2013, CS
57
dce
2013
Drawback of Delayed Branching

New meaning for branch instruction
Branching takes place after next instruction (Not immediately!)
Impacts software and compiler

Compiler is responsible to fill the branch delay slot
For a 1-cycle branch delay One branch delay slot
However, modern processors and deeply pipelined

Branch penalty is multiple cycles in deeper pipelines
Multiple delay slots are difficult to fill with useful instructions
MIPS used delayed branching in earlier pipelines

However, delayed branching is not useful in recent processors
Fall 2013, CS
58
dce
Zero-Delayed Branching
2013
How to achieve zero delay for a jump or a taken branch?

Jump or branch target address is computed in the ID stage
Next instruction has already been fetched in the IF stage
Solution
Introduce a Branch Target Buffer (BTB) in the IF stage
Store the target address of recent branch and jump instructions
Use the lower bits of the PC to index the BTB

Each BTB entry stores Branch/Jump address & Target Address
Check the PC to see if the instruction being fetched is a branch
Update the PC using the target address stored in the BTB
Fall 2013, CS
59
dce
Branch Target Buffer
2013
The branch target buffer is implemented as a small cache

Stores the target address of recent branches and jumps
We must also have prediction bits

To predict whether branches are taken or not taken
The prediction bits are dynamically determined by the hardware
Branch Target & Prediction Buffer
Addresses of
Recent Branches
Inc
mux
PC
Target
Predict
Addresses
Bits
low-order bits
used as index
=
predict_taken
Fall 2013, CS
60
dce
2013
Dynamic Branch Prediction

Prediction of branches at runtime using prediction bits
Prediction bits are associated with each entry in the BTB
Prediction bits reflect the recent history of a branch instruction
Typically few prediction bits (1 or 2) are used per entry

We dont know if the prediction is correct or not
If correct prediction
Continue normal execution no wasted cycles
If incorrect prediction (misprediction)

Flush the instructions that were incorrectly fetched wasted cycles
Update prediction bits and target address for future use
Fall 2013, CS
61
dce
2013
Dynamic Branch Prediction Contd
IF
Use PC to address Instruction

Memory and Branch Target Buffer
PC = target address
Increment PC
No
ID
No
Jump
or taken
branch?
Found
BTB entry with predict
taken?
Yes
EX
Normal
Execution
Mispredicted Jump/branch
Enter jump/branch address, target
address, and set prediction in BTB entry.
Flush fetched instructions
Restart PC at target address
Yes
No
Jump
or taken
branch?
Yes
Correct Prediction
No stall cycles
Mispredicted branch
Branch not taken
Update prediction bits
Flush fetched instructions
Restart PC after branch
Fall 2013, CS
62
dce
2013
1-bit Prediction Scheme

Prediction is just a hint that is assumed to be correct
If incorrect then fetched instructions are flushed
1-bit prediction scheme is simplest to implement

1 bit per branch instruction (associated with BTB entry)
Record last outcome of a branch instruction (Taken/Not taken)
Use last outcome to predict future behavior of a branch
Not
Taken
Taken
Taken
Predict
Not Taken
Predict
Taken
Not Taken
Fall 2013, CS
63
dce
2013
1-Bit Predictor: Shortcoming

Inner loop branch mispredicted twice!
Mispredict as taken on last iteration of inner loop
Then mispredict as not taken on first iteration of inner

loop next time around
outer:
inner:
bne , , inner
bne , , outer
Fall 2013, CS
64
dce
2-bit Prediction Scheme
2013
1-bit prediction scheme has a performance shortcoming

2-bit prediction scheme works better and is often used
4 states: strong and weak predict taken / predict not taken
Implemented as a saturating counter

Counter is incremented to max=3 when branch outcome is taken
Counter is decremented to min=0 when branch is not taken
Not Taken
Strong
Predict
Not Taken
Taken
Taken
Weak
Predict
Not Taken
Not Taken
Taken
Not Taken
Weak
Predict
Taken
Taken
Strong
Predict
Taken
Not Taken
Fall 2013, CS
65
dce
2013
Fallacies and Pitfalls

Pipelining is easy!
The basic idea is easy
The devil is in the details
Detecting data hazards and stalling pipeline
Poor ISA design can make pipelining harder

Complex instruction sets (Intel IA-32)
Significant overhead to make pipelining work
IA-32 micro-op approach
Complex addressing modes

Register update side effects, memory indirection
Fall 2013, CS
66
dce
2013
Pipeline Hazards Summary

Three types of pipeline hazards
Structural hazards: conflicts using a resource during same cycle
Data hazards: due to data dependencies between instructions

Control hazards: due to branch and jump instructions
Hazards limit the performance and complicate the design

Structural hazards: eliminated by careful design or more hardware
Data hazards are eliminated by forwarding
However, load delay cannot be eliminated and stalls the pipeline
Delayed branching can be a solution when branch delay = 1 cycle
BTB with branch prediction can reduce branch delay to zero
Branch misprediction should flush the wrongly fetched instructions
Fall 2013, CS
67

Chapter04 2PipelinedProcessor

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter04 2PipelinedProcessor

Uploaded by

Copyright:

Available Formats

dce

Faculty of Computer Science and

Computer Architecture Chapter 4.2

Computer Architecture Chapter 4.2

3. Fold and put clothes into drawers

Computer Architecture Chapter 4.2

Sequential laundry takes 6 hours for 4 loads

Pipelined Laundry: Start Load ASAP

Pipelined laundry takes

Speedup factor is 2 for

Computer Architecture Chapter 4.2

Time to wash, dry, and

Serial Execution versus Pipelining

Each subtask requires one time unit

Pipelining is to overlap the execution

Uses clocked registers between stages

The pipeline stages are combinational logic circuits

Clock period is determined by the maximum stage delay

Let ti = time delay in stage Si

Ideal speedup of a k-stage pipeline over serial execution

Serial execution in cycles

Pipelined execution in cycles

Computer Architecture Chapter 4.2

MIPS Processor Pipeline

4. MEM: Memory access for load and store

Computer Architecture Chapter 4.2

Single-Cycle vs Pipelined Performance

What is the clock cycle of the single-cycle processor?

What is the speedup factor of pipelined execution?

Single-Cycle versus Pipelined contd

CPI for pipelined execution = 1

Speedup of pipelined execution = 900 ps / 200 ps = 4.5

Speedup factor is less than 5 (number of pipeline stage)

Computer Architecture Chapter 4.2

Pipeline Performance Summary

However, it improves throughput of entire workload

In a k-stage pipeline, k instructions operate in parallel

Unbalanced lengths of pipeline stages reduces speedup

Pipeline rate is limited by slowest pipeline stage

Computer Architecture Chapter 4.2

Pipelining versus Serial Execution

Computer Architecture Chapter 4.2

Shown below is the single-cycle datapath

Answer: Introduce pipeline register at end of each stage

Jump or Branch Target Address

Computer Architecture Chapter 4.2

Pipeline registers are shown in green, including the PC

Computer Architecture Chapter 4.2

Problem with Register Destination

Is there a problem with the register destination address?

Instruction in the WB stage is not writing to its destination register

Pipelining the Destination Register

Destination Register number should be pipelined

Graphically Representing Pipelines

Multiple instruction execution over multiple clock cycles

Program Execution Order

add $s1, $s2, $s3

Computer Architecture Chapter 4.2

Instruction-Time Diagram shows:

Instruction flow is pipelined over the 5 stages

Up to five instructions can be in the

ori $t4, $s3, 7

ALU instructions skip