You are on page 1of 67

dce

2013

COMPUTER ARCHITECTURE
CSE Fall 2013

BK
TP.HCM

Faculty of Computer Science and


Engineering
Department of Computer Engineering

Vo Tan Phuong
http://www.cse.hcmut.edu.vn/~vtphuong

dce
2013

Chapter 4.2
Pipelined Processor Design

Computer Architecture Chapter 4.2

Fall 2013, CS

dce
2013

Presentation Outline
Pipelining versus Serial Execution
Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction

Computer Architecture Chapter 4.2

Fall 2013, CS

dce
2013

Pipelining Example
Laundry Example: Three Stages
1. Wash dirty load of clothes
2. Dry wet clothes

3. Fold and put clothes into drawers


Each stage takes 30 minutes to complete
Four loads of clothes to wash, dry, and fold

Computer Architecture Chapter 4.2

Fall 2013, CS

dce

Sequential Laundry

2013

6 PM
Time 30

7
30

8
30

30

9
30

30

10
30

30

11
30

30

12 AM
30

30

B
C

Sequential laundry takes 6 hours for 4 loads


Intuitively, we can use pipelining to speed up laundry
Computer Architecture Chapter 4.2

Fall 2013, CS

dce
2013

Pipelined Laundry: Start Load ASAP


6 PM
30

7
30
30

8
30
30
30

30
30
30

9 PM
Time
30
30

30

Pipelined laundry takes


3 hours for 4 loads

Speedup factor is 2 for


4 loads

C
D

Computer Architecture Chapter 4.2

Time to wash, dry, and


fold one load is still the
same (90 minutes)
Fall 2013, CS

dce
2013

Serial Execution versus Pipelining


Consider a task that can be divided into k subtasks
The k subtasks are executed on k different stages

Each subtask requires one time unit


The total execution time of the task is k time units

Pipelining is to overlap the execution


The k stages work in parallel on k different tasks
Tasks enter/leave pipeline at the rate of one task per time unit
1 2

k
1 2

1 2

k
1 2

1 2

Without Pipelining
One completion every k time units
Computer Architecture Chapter 4.2

1 2

With Pipelining
One completion every 1 time unit
Fall 2013, CS

dce

Synchronous Pipeline

2013

Uses clocked registers between stages


Upon arrival of a clock edge
All registers hold the results of previous stages simultaneously

The pipeline stages are combinational logic circuits


It is desirable to have balanced stages
Approximately equal delay in all stages

Sk

Register

S2

Register

S1

Register

Input

Register

Clock period is determined by the maximum stage delay

Output

Clock
Computer Architecture Chapter 4.2

Fall 2013, CS

dce

Pipeline Performance

2013

Let ti = time delay in stage Si


Clock cycle t = max(ti) is the maximum stage delay
Clock frequency f = 1/t = 1/max(ti)
A pipeline can process n tasks in k + n 1 cycles
k cycles are needed to complete the first task
n 1 cycles are needed to complete the remaining n 1 tasks

Ideal speedup of a k-stage pipeline over serial execution


nk

Serial execution in cycles

Sk =

Pipelined execution in cycles

Computer Architecture Chapter 4.2

k+n1

Sk k for large n

Fall 2013, CS

dce
2013

MIPS Processor Pipeline


Five stages, one cycle per stage
1. IF: Instruction Fetch from instruction memory
2. ID: Instruction Decode, register read, and J/Br address
3. EX: Execute operation or calculate load/store address

4. MEM: Memory access for load and store


5. WB: Write Back result to register

Computer Architecture Chapter 4.2

Fall 2013, CS

10

dce
2013

Single-Cycle vs Pipelined Performance


Consider a 5-stage instruction execution in which
Instruction fetch = ALU operation = Data memory access = 200 ps
Register read = register write = 150 ps

What is the clock cycle of the single-cycle processor?


What is the clock cycle of the pipelined processor?

What is the speedup factor of pipelined execution?


Solution
Single-Cycle Clock =
200+150+200+200+150 = 900 ps
IF

Reg

ALU

MEM

Reg

900 ps

IF

Reg

ALU

MEM

Reg

900 ps
Computer Architecture Chapter 4.2

Fall 2013, CS

11

dce
2013

Single-Cycle versus Pipelined contd


Pipelined clock cycle = max(200, 150) = 200 ps
IF

Reg

200

IF
200

ALU

Reg
IF
200

MEM

Reg

ALU

MEM

Reg

ALU

MEM

200

200

Reg
200

Reg
200

CPI for pipelined execution = 1


One instruction completes each cycle (ignoring pipeline fill)

Speedup of pipelined execution = 900 ps / 200 ps = 4.5


Instruction count and CPI are equal in both cases

Speedup factor is less than 5 (number of pipeline stage)


Because the pipeline stages are not balanced

Computer Architecture Chapter 4.2

Fall 2013, CS

12

dce
2013

Pipeline Performance Summary


Pipelining doesnt improve latency of a single instruction

However, it improves throughput of entire workload


Instructions are initiated and completed at a higher rate

In a k-stage pipeline, k instructions operate in parallel


Overlapped execution using multiple hardware resources
Potential speedup = number of pipeline stages k

Unbalanced lengths of pipeline stages reduces speedup

Pipeline rate is limited by slowest pipeline stage


Unbalanced lengths of pipeline stages reduces speedup
Also, time to fill and drain pipeline reduces speedup

Computer Architecture Chapter 4.2

Fall 2013, CS

13

dce

Next . . .

2013

Pipelining versus Serial Execution


Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction

Computer Architecture Chapter 4.2

Fall 2013, CS

14

dce

Single-Cycle Datapath

2013

Shown below is the single-cycle datapath


How to pipeline this single-cycle datapath?

Answer: Introduce pipeline register at end of each stage


IF = Instruction Fetch

ID = Decode &
Register Read

Jump or Branch Target Address

EX = Execute

MEM = Memory
Access
J

Next
PC

Beq
Bne

30

00

30

Instruction
Memory
Instruction

PC

0
1

ALU result

Imm26

+1

PCSrc

Rs 5
32

Rt 5

Address
Rd

Imm16

zero

32

BusA

RA

Registers
RB

BusB

0
1

WB =
Write
Back

RW

BusW

32

A
L
U

32

Data
Memory
Address
Data_out
Data_in

32

32
1

32

RegDst

clk

Reg
Write
ExtOp ALUSrc ALUCtrl

Computer Architecture Chapter 4.2

Mem Mem
Read Write

Mem
toReg

Fall 2013, CS

15

dce

Pipelined Datapath

2013

Pipeline registers are shown in green, including the PC


Same clock edge updates all pipeline registers, register
file, and data memory (for store instruction)

Address

RB

0
1

Rd

RW

32

Imm

E
BusB
BusW

32

zero

A
L
U

Data
Memory

ALUout

RA

ALU result
Imm16

NPC
Rt 5

BusA

Next
PC

32

Data_out

32
32

Address

WB Data

PC

Rs 5

Instruction

Imm26

Register File

Instruction
Memory

Instruction

+1

MEM = Memory
Access

WB = Write Back

EX = Execute

ID = Decode &
Register Read
NPC2

IF = Instruction Fetch

Data_in

32

clk

Computer Architecture Chapter 4.2

Fall 2013, CS

16

dce

Problem with Register Destination

2013

Is there a problem with the register destination address?


Instruction in the ID stage different from the one in the WB stage

Address

RB

0
1

Rd

RW

ALU result
Imm16

E
BusB
BusW

32

zero

32

Imm

Next
PC

A
L
U

Data
Memory

ALUout

Rt 5

RA

BusA

MEM =
Memory Access

32
32

32

Address
Data_out

PC

Rs 5

Instruction

Imm26

Register File

Instruction
Memory

Instruction

+1

NPC

NPC2

EX = Execute

WB Data

ID = Decode &
Register Read

IF = Instruction Fetch

WB = Write Back

Instruction in the WB stage is not writing to its destination register


but to the destination of a different instruction in the ID stage

Data_in

32

clk
Computer Architecture Chapter 4.2

Fall 2013, CS

17

dce

Pipelining the Destination Register

2013

Destination Register number should be pipelined


Destination register number is passed from ID to WB stage
The WB stage writes back data knowing the destination register
ID

EX

RW

BusB
BusW

0
1

32

A
L
U

Data
Memory

ALUout

Imm

32

32

zero

32

Data_out

32

32

Address

WB Data

Rd

ALU result
Imm16

Data_in

Rd4

RB

WB

Next
PC

RA

Address

Rt 5

BusA

MEM

Rd3

PC

Rs 5

Rd2

Instruction

Imm26

Register File

Instruction
Memory

Instruction

+1

NPC

NPC2

IF

clk
Computer Architecture Chapter 4.2

Fall 2013, CS

18

dce

Graphically Representing Pipelines

2013

Multiple instruction execution over multiple clock cycles


Instructions are listed in execution order from top to bottom
Clock cycles move from left to right

Program Execution Order

Figure shows the use of resources at each stage and each cycle
Time (in cycles)

CC1

CC2

CC3

CC4

CC5

lw $t6, 8($s5)

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

add $s1, $s2, $s3


ori $s4, $t3, 7
sub $t5, $s2, $t3
sw $s2, 10($t3)

Computer Architecture Chapter 4.2

CC6

CC7

CC8

Fall 2013, CS

19

dce

Instruction-Time Diagram

2013

Instruction-Time Diagram shows:


Which instruction occupying what stage at each clock cycle

Instruction flow is pipelined over the 5 stages

Instruction Order

Up to five instructions can be in the


pipeline during the same cycle
Instruction Level Parallelism (ILP)
lw

$t7, 8($s3)

lw

$t6, 8($s5)

IF

ID

EX MEM WB

IF

ID

EX MEM WB

IF

ID

EX

WB

IF

ID

EX

IF

ID

ori $t4, $s3, 7


sub $s5, $s2, $t3
sw

$s2, 10($s3)
CC1

ALU instructions skip


the MEM stage.
Store instructions
skip the WB stage

WB

EX MEM

CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

Computer Architecture Chapter 4.2

Time

Fall 2013, CS

20

dce

Control Signals
ID

EX

RW

Imm16

BusB

32

Data
Memory
32
32

32

Address
Data_out

BusW

0
1

A
L
U

ALUout

32

zero

32

Imm

ALU result

WB Data

Rd

Bne

Data_in

Rd4

Address

RB

Beq

Rd3

PC

Rt 5

RA

BusA

WB

Next
PC

Instruction

Rs 5

MEM

Rd2

Instruction
Memory

Imm26

Register File

PCSrc

Instruction

+1

NPC

NPC2

IF

2013

clk
Reg
Dst

Reg
Write

Ext
Op

ALU
Src

ALU
Ctrl

Mem Mem
Read Write

Mem
toReg

Same control signals used in the single-cycle datapath


Computer Architecture Chapter 4.2

Fall 2013, CS

21

dce

Pipelined Control

32

0
1

32

Data_out

BusW

32
32

Address

WB Data

Data
Memory

ALUout

Imm

BusB

A
L
U

Data_in

Rd4

RW

32

32

zero

Rd

Op

Address

RB

Bne

Rd3

PC

Rt 5

RA

BusA

Beq

ALU result
Imm16

Instruction

Rs 5

Next
PC

Rd2

Instruction
Memory

Imm26

Register File

PCSrc

Instruction

+1

NPC

NPC2

2013

Main
& ALU
Control

Computer Architecture Chapter 4.2

Ext
Op

ALU
Src

J
ALU Beq
Ctrl Bne

Mem Mem
Read Write

Mem
toReg

WB

Reg
Write

MEM

Reg
Dst

EX

Pass control
signals along
pipeline just
like the data

func

clk

Fall 2013, CS

22

dce
2013

Pipelined Control Cont'd


ID stage generates all the control signals
Pipeline the control signals as the instruction moves
Extend the pipeline registers to include the control signals

Each stage uses some of the control signals


Instruction Decode and Register Read
Control signals are generated
RegDst is used in this stage

Execution Stage => ExtOp, ALUSrc, and ALUCtrl


Next PC uses J, Beq, Bne, and zero signals for branch control

Memory Stage

=> MemRead, MemWrite, and MemtoReg

Write Back Stage => RegWrite is used in this stage

Computer Architecture Chapter 4.2

Fall 2013, CS

23

dce

Control Signals Summary

2013

Decode
Stage

Execute Stage

Memory Stage

Write

Control Signals

Control Signals

Back

Op
RegDst ALUSrc ExtOp
R-Type

1=Rd

0=Reg

addi

0=Rt

slti

Beq Bne

ALUCtrl

MemRd MemWr MemReg RegWrite

func

1=Imm 1=sign

ADD

0=Rt

1=Imm 1=sign

SLT

andi

0=Rt

1=Imm 0=zero

AND

ori

0=Rt

1=Imm 0=zero

OR

lw

0=Rt

1=Imm 1=sign

ADD

sw

1=Imm 1=sign

ADD

beq

0=Reg

SUB

bne

0=Reg

SUB

Computer Architecture Chapter 4.2

Fall 2013, CS

24

dce

Next . . .

2013

Pipelining versus Serial Execution


Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction

Computer Architecture Chapter 4.2

Fall 2013, CS

25

dce

Pipeline Hazards

2013

Hazards: situations that would cause incorrect execution


If next instruction were launched during its designated clock cycle

1. Structural hazards
Caused by resource contention
Using same resource by two instructions during the same cycle

2. Data hazards
An instruction may compute a result needed by next instruction
Hardware can detect dependencies between instructions

3. Control hazards
Caused by instructions that change control flow (branches/jumps)
Delays in changing the flow of control

Hazards complicate pipeline control and limit performance

Computer Architecture Chapter 4.2

Fall 2013, CS

26

dce

Structural Hazards

2013

Problem
Attempt to use the same hardware resource by two different
instructions during the same cycle
Structural Hazard
Two instructions are
attempting to write
the register file
during same cycle

Example
Writing back ALU result in stage 4

Instructions

Conflict with writing load data in stage 5

lw

$t6, 8($s5)

IF

ori $t4, $s3, 7

ID

EX MEM WB

IF

ID

EX

WB

IF

ID

EX

WB

IF

ID

EX MEM

sub $t5, $s2, $s3


sw

$s2, 10($s3)

CC1

CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9

Computer Architecture Chapter 4.2

Time

Fall 2013, CS

27

dce
2013

Resolving Structural Hazards

Serious Hazard:
Hazard cannot be ignored

Solution 1: Delay Access to Resource


Must have mechanism to delay instruction access to resource
Delay all write backs to the register file to stage 5
ALU instructions bypass stage 4 (memory) without doing anything

Solution 2: Add more hardware resources (more costly)


Add more hardware to eliminate the structural hazard

Redesign the register file to have two write ports


First write port can be used to write back ALU results in stage 4
Second write port can be used to write back load data in stage 5
Computer Architecture Chapter 4.2

Fall 2013, CS

28

dce

Next . . .

2013

Pipelining versus Serial Execution


Pipelined Datapath and Control
Pipeline Hazards
Data Hazards and Forwarding
Load Delay, Hazard Detection, and Stall
Control Hazards
Delayed Branch and Dynamic Branch Prediction

Computer Architecture Chapter 4.2

Fall 2013, CS

29

dce
2013

Data Hazards
Dependency between instructions causes a data hazard
The dependent instructions are close to each other
Pipelined execution might change the order of operand access

Read After Write RAW Hazard


Given two instructions I and J, where I comes before J

Instruction J should read an operand after it is written by I


Called a data dependence in compiler terminology
I: add $s1, $s2, $s3

# $s1 is written

J: sub $s4, $s1, $s3

# $s1 is read

Hazard occurs when J reads the operand before I writes it

Computer Architecture Chapter 4.2

Fall 2013, CS

30

dce
2013

Example of a RAW Data Hazard

Program Execution Order

Time (cycles)
value of $s2
sub $s2, $t1, $t3

CC1

CC2

CC3

CC4

CC5

CC6

CC7

CC8

10

10

10

10

10

20

20

20

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

add $s4, $s2, $t5


or $s6, $t3, $s2
and $s7, $t4, $s2
sw $t8, 10($s2)

Result of sub is needed by add, or, and, & sw instructions


Instructions add & or will read old value of $s2 from reg file
During CC5, $s2 is written at end of cycle, old value is read
Computer Architecture Chapter 4.2

Fall 2013, CS

31

dce

Instruction Order

2013

Solution 1: Stalling the Pipeline


Time (in cycles)
value of $s2
sub $s2, $t1, $t3
add $s4, $s2, $t5

CC1

CC2

CC3

CC4

CC5

CC6

CC7

CC8

CC9

10

10

10

10

10

20

20

20

20

IM

Reg

ALU

DM

Reg

IM

Reg

Reg

Reg

Reg

ALU

DM

Reg

stall

stall

stall
IM

Reg

ALU

DM

or $s6, $t3, $s2

Three stall cycles during CC3 thru CC5 (wasting 3 cycles)


Stall cycles delay execution of add & fetching of or instruction

The add instruction cannot read $s2 until beginning of CC6


The add instruction remains in the Instruction register until CC6
The PC register is not modified until beginning of CC6
Computer Architecture Chapter 4.2

Fall 2013, CS

32

dce

Solution 2: Forwarding ALU Result

2013

The ALU result is forwarded (fed back) to the ALU input


No bubbles are inserted into the pipeline and no cycles are wasted

ALU result is forwarded from ALU, MEM, and WB stages

Program Execution Order

Time (cycles)
value of $s2
sub $s2, $t1, $t3

CC1

CC2

CC3

CC4

CC5

CC6

CC7

CC8

10

10

10

10

10

20

20

20

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

add $s4, $s2, $t5


or $s6, $t3, $s2
and $s7, $s6, $s2
sw $t8, 10($s2)

Computer Architecture Chapter 4.2

Fall 2013, CS

33

dce

Implementing Forwarding

2013

Two multiplexers added at the inputs of A & B registers


Data from ALU stage, MEM stage, and WB stage is fed back

Two signals: ForwardA and ForwardB control forwarding

Rd

32

Result

Data
Memory

0
32

32

Data_out

0
32
1

WData

Address

Data_in
Rd4

0
1

BusW

A
L
U

RW

BusB
0
1
2
3

32 ALU result

32

Rd3

RB

0
1
2
3

Rt

RA

BusA

Imm16

Rd2

Instruction

Rs

Register File

Imm26

Im26

ForwardA

clk
ForwardB

Computer Architecture Chapter 4.2

Fall 2013, CS

34

dce

Forwarding Control Signals

2013

Signal

Explanation

ForwardA = 0 First ALU operand comes from register file = Value of (Rs)
ForwardA = 1 Forward result of previous instruction to A (from ALU stage)
ForwardA = 2 Forward result of 2nd previous instruction to A (from MEM stage)
ForwardA = 3 Forward result of 3rd previous instruction to A (from WB stage)
ForwardB = 0 Second ALU operand comes from register file = Value of (Rt)
ForwardB = 1 Forward result of previous instruction to B (from ALU stage)
ForwardB = 2 Forward result of 2nd previous instruction to B (from MEM stage)
ForwardB = 3 Forward result of 3rd previous instruction to B (from WB stage)

Computer Architecture Chapter 4.2

Fall 2013, CS

35

dce

Forwarding Example

2013

Instruction sequence:
lw
$t4, 4($t0)
ori $t7, $t1, 2
sub $t3, $t4, $t7

When sub instruction is fetched


ori will be in the ALU stage
lw will be in the MEM stage

ForwardA = 2 from MEM stage

ForwardB = 1 from ALU stage

sub $t3,$t4,$t7

ori $t7,$t1,2

lw $t4,4($t0)

Rd

32

Result

32

32

Data_out

0
32
1

WData

Data
Memory
Data_in

Rd4

0
1

BusW

Address

RW

BusB
0
1
2
3

A
L
U

Rd3

RB

32 ALU result

32

RA

0
1
2
3

Rt

BusA

Rd2

Instruction

Rs

ext
Register File

Imm16

Imm

Imm26

clk
Computer Architecture Chapter 4.2

Fall 2013, CS

36

dce

RAW Hazard Detection

2013

Current instruction being decoded is in Decode stage


Previous instruction is in the Execute stage
Second previous instruction is in the Memory stage
Third previous instruction in the Write Back stage
If ((Rs != 0) and (Rs == Rd2) and (EX.RegWrite))

ForwardA 1

Else if

((Rs != 0) and (Rs == Rd3) and (MEM.RegWrite)) ForwardA 2

Else if

((Rs != 0) and (Rs == Rd4) and (WB.RegWrite))

Else

ForwardA 3

ForwardA 0

If ((Rt != 0) and (Rt == Rd2) and (EX.RegWrite))

ForwardB 1

Else if

((Rt != 0) and (Rt == Rd3) and (MEM.RegWrite)) ForwardB 2

Else if

((Rt != 0) and (Rt == Rd4) and (WB.RegWrite))

Else

Computer Architecture Chapter 4.2

ForwardB 3

ForwardB 0

Fall 2013, CS

37

dce

Hazard Detect and Forward Logic

BusW

0
1

Result

Data
Memory

0
32

Data_out

32

0
32
1

WData

Im26
0
1
2
3

Address

Data_in

32

ALUCtrl
Rd4

RW

BusB

Rd

RB

A
L
U

Rd3

Rt

RA

0
1
2
3

BusA

32 ALU result

32

Rd2

Instruction

Rs

Register File

Imm26

2013

clk
RegDst

ForwardB

ForwardA

Hazard Detect
and Forward
func

Computer Architecture Chapter 4.2

RegWrite

WB

Main
& ALU
Control

RegWrite

MEM

Op

EX

RegWrite

Fall 2013, CS

38

dce

Next . . .

2013

Pipelining versus Serial Execution


Pipelined Datapath and Control

Pipeline Hazards
Data Hazards and Forwarding

Load Delay, Hazard Detection, and Pipeline Stall


Control Hazards

Delayed Branch and Dynamic Branch Prediction

Computer Architecture Chapter 4.2

Fall 2013, CS

39

dce

Load Delay

2013

Unfortunately, not all data hazards can be forwarded


Load has a delay that cannot be eliminated by forwarding

In the example shown below


The LW instruction does not read data until end of CC4
Cannot forward data to ADD at end of CC3 - NOT possible

Program Order

Time (cycles)
lw

$s2, 20($t1)

add $s4, $s2, $t5

CC1

CC2

CC3

CC4

CC5

IF

Reg

ALU

DM

Reg

IF

Reg

ALU

DM

Reg

IF

Reg

ALU

DM

Reg

IF

Reg

ALU

DM

or $t6, $t3, $s2


and $t7, $s2, $t4
Computer Architecture Chapter 4.2

CC6

CC7

CC8

However, load can


forward data to
2nd next and later
instructions

Reg

Fall 2013, CS

40

dce
2013

Detecting RAW Hazard after Load


Detecting a RAW hazard after a Load instruction:
The load instruction will be in the EX stage

Instruction that depends on the load data is in the decode stage

Condition for stalling the pipeline


if ((EX.MemRead == 1) // Detect Load in EX stage
and (ForwardA==1 or ForwardB==1)) Stall // RAW Hazard

Insert a bubble into the EX stage after a load instruction


Bubble is a no-op that wastes one clock cycle
Delays the dependent instruction after load by once cycle
Because of RAW hazard

Computer Architecture Chapter 4.2

Fall 2013, CS

41

dce

Stall the Pipeline for one Cycle

2013

ADD instruction depends on LW stall at CC3


Allow Load instruction in ALU stage to proceed

Freeze PC and Instruction registers (NO instruction is fetched)


Introduce a bubble into the ALU stage (bubble is a NO-OP)

Load can forward data to next instruction after delaying it

Program Order

Time (cycles)
lw

$s2, 20($s1)

CC1

CC2

CC3

CC4

CC5

IM

Reg

ALU

DM

Reg

IM

stall

bubble

bubble

bubble

Reg

ALU

DM

Reg

IM

Reg

ALU

DM

add $s4, $s2, $t5

or $t6, $s3, $s2


Computer Architecture Chapter 4.2

CC6

CC7

CC8

Reg

Fall 2013, CS

42

dce

Showing Stall Cycles

2013

Stall cycles can be shown on instruction-time diagram


Hazard is detected in the Decode stage

Stall indicates that instruction is delayed


Instruction fetching is also delayed after a stall
Example:
Data forwarding is shown using green arrows
lw

$s1, ($t5)

lw

$s2, 8($s1)

IF

ID
IF

EX MEM WB
Stall

add $v0, $s2, $t3

ID
IF

sub $v1, $s2, $v0

EX MEM WB
Stall

ID

EX MEM WB

IF

ID

EX MEM WB

CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 Time
Computer Architecture Chapter 4.2

Fall 2013, CS

43

dce

Hazard Detect, Forward, and Stall

32

0
1

0
32

Data_out

32

WData

Data
Memory

0
1
2
3

BusW

Result

Im26
A

BusB

Address

Data_in

32

Rd4

RW

A
L
U

Rd

RB

Rt

RA

0
1
2
3

BusA

32 ALU result

32

Rd2

PC

Instruction

Rs

Register File

Imm26

Rd3

2013

clk
Disable PC

RegDst

ForwardB
func

ForwardA

Hazard Detect
Forward, & Stall
MemRead
Stall

Bubble
=0

RegWrite

0
1

Computer Architecture Chapter 4.2

WB

Control Signals

MEM

Main & ALU


Control

EX

Op

RegWrite

RegWrite

Fall 2013, CS

44

dce
2013

Code Scheduling to Avoid Stalls


Compilers reorder code in a way to avoid load stalls
Consider the translation of the following statements:
A = B + C; D = E F; // A thru F are in Memory

Slow code:
lw
lw
add
sw
lw
lw
sub
sw

$t0, 4($s0)
$t1, 8($s0)
$t2, $t0, $t1
$t2, 0($s0)
$t3, 16($s0)
$t4, 20($s0)
$t5, $t3, $t4
$t5, 12($0)

Fast code: No Stalls


# &B = 4($s0)
# &C = 8($s0)
# stall cycle
# &A = 0($s0)
# &E = 16($s0)
# &F = 20($s0)
# stall cycle
# &D = 12($0)

Computer Architecture Chapter 4.2

lw
lw
lw
lw
add
sw
sub
sw

$t0,
$t1,
$t3,
$t4,
$t2,
$t2,
$t5,
$t5,

4($s0)
8($s0)
16($s0)
20($s0)
$t0, $t1
0($s0)
$t3, $t4
12($s0)

Fall 2013, CS

45

dce
2013

Name Dependence: Write After Read


Instruction J should write its result after it is read by I
Called anti-dependence by compiler writers

I: sub $t4, $t1, $t3 # $t1 is read


J: add $t1, $t2, $t3 # $t1 is written
Results from reuse of the name $t1
NOT a data hazard in the 5-stage pipeline because:
Reads are always in stage 2

Writes are always in stage 5, and


Instructions are processed in order

Anti-dependence can be eliminated by renaming


Use a different destination register for add (eg, $t5)

Computer Architecture Chapter 4.2

Fall 2013, CS

46

dce
2013

Name Dependence: Write After Write


Same destination register is written by two instructions
Called output-dependence in compiler terminology
I: sub $t1, $t4, $t3

# $t1 is written

J: add $t1, $t2, $t3


again

# $t1 is written

Not a data hazard in the 5-stage pipeline because:


All writes are ordered and always take place in stage 5

However, can be a hazard in more complex pipelines


If instructions are allowed to complete out of order, and
Instruction J completes and writes $t1 before instruction I

Output dependence can be eliminated by renaming $t1


Read After Read is NOT a name dependence
Computer Architecture Chapter 4.2

Fall 2013, CS

47

dce

Next . . .

2013

Pipelining versus Serial Execution


Pipelined Datapath and Control

Pipeline Hazards
Data Hazards and Forwarding

Load Delay, Hazard Detection, and Stall


Control Hazards

Delayed Branch and Dynamic Branch Prediction

Computer Architecture Chapter 4.2

Fall 2013, CS

48

dce

Control Hazards

2013

Jump and Branch can cause great performance loss


Jump instruction needs only the jump target address

Branch instruction needs two things:


Branch Result

Taken or Not Taken

Branch Target Address


PC + 4

If Branch is NOT taken

PC + 4 + 4 immediate

If Branch is Taken

Jump and Branch targets are computed in the ID stage


At which point a new instruction is already being fetched
Jump Instruction: 1-cycle delay

Branch: 2-cycle delay for branch result (taken or not taken)

Computer Architecture Chapter 4.2

Fall 2013, CS

49

dce

2-Cycle Branch Delay

2013

Control logic detects a Branch instruction in the 2nd Stage


ALU computes the Branch outcome in the 3rd Stage

Next1 and Next2 instructions will be fetched anyway


Convert Next1 and Next2 into bubbles if branch is taken

Beq $t1,$t2,L1

cc1

cc2

cc3

IF

Reg

ALU

IF

Next1
Next2

L1: target instruction

Computer Architecture Chapter 4.2

cc4

cc5

cc6

Reg

Bubble

Bubble

Bubble

IF

Bubble

Bubble

Bubble

Bubble

IF

Reg

ALU

DM

Branch
Target
Addr

cc7

Fall 2013, CS

50

dce

Implementing Jump and Branch


NPC2

Bne

32

2
3

BusB
0
1

BusW

0
1

zero

A
L
U

ALUout

Imm16

RW

Beq

32

32

Rd3

Rd

Im26

NPC

Address

RB

BusA

Rt 5

RA

0
1

Next
PC

Rd2

Instruction
0

Rs 5

Register File

Instruction
Memory

Imm26

Op

PCSrc

Instruction

+1

PC

Jump or Branch Target

2013

Branch target & outcome


are computed in ALU stage

J, Beq, Bne

Main & ALU


Control

Computer Architecture Chapter 4.2

Control Signals
Bubble = 0

0
1

MEM

Branch Delay = 2 cycles

Reg
Dst

EX

func

clk

Fall 2013, CS

51

dce

Predict Branch NOT Taken

2013

Branches can be predicted to be NOT taken


If branch outcome is NOT taken then
Next1 and Next2 instructions can be executed
Do not convert Next1 & Next2 into bubbles
No wasted cycles

Beq $t1,$t2,L1
Next1

cc1

cc2

cc3

IF

Reg

ALU NOT Taken

IF

Next2

Computer Architecture Chapter 4.2

cc4

cc5

cc6

Reg

ALU

DM

Reg

IF

Reg

ALU

DM

cc7

Reg

Fall 2013, CS

52

dce
2013

Reducing the Delay of Branches


Branch delay can be reduced from 2 cycles to just 1 cycle
Branches can be determined earlier in the Decode stage
A comparator is used in the decode stage to determine branch
decision, whether the branch is taken or not
Because of forwarding the delay in the second stage will be
increased and this will also increase the clock cycle

Only one instruction that follows the branch is fetched


If the branch is taken then only one instruction is flushed
We should insert a bubble after jump or taken branch
This will convert the next instruction into a NOP

Computer Architecture Chapter 4.2

Fall 2013, CS

53

dce

Reducing Branch Delay to 1 Cycle


J
Beq
Bne

Longer Cycle
=

RW

A
L
U

2
3

BusB
0
1

BusW

32

32

0
1

Rd

32

Rd3

RB

BusA

Address
Op

Rt 5

RA

0
1

Rd2

Instruction
0

Rs 5

Register File

Instruction
Memory

Instruction

Imm16
PCSrc

Data forwarded
then compared

ALUout

Next
PC

Reset

+1

PC

Jump or Branch Target

Zero

Im16

2013

Reg
Dst
J, Beq, Bne

Main & ALU


Control

Computer Architecture Chapter 4.2

Control Signals
Bubble = 0

ALUCtrl
0
1

MEM

Reset signal converts


next instruction after
jump or taken branch
into a bubble

EX

func

clk

Fall 2013, CS

54

dce

Next . . .

2013

Pipelining versus Serial Execution


Pipelined Datapath and Control

Pipeline Hazards
Data Hazards and Forwarding

Load Delay, Hazard Detection, and Stall


Control Hazards

Delayed Branch and Dynamic Branch Prediction

Computer Architecture Chapter 4.2

Fall 2013, CS

55

dce
2013

Branch Hazard Alternatives


Predict Branch Not Taken (previously discussed)
Successor instruction is already fetched

Do NOT Flush instruction after branch if branch is NOT taken


Flush only instructions appearing after Jump or taken branch

Delayed Branch
Define branch to take place AFTER the next instruction
Compiler/assembler fills the branch delay slot (for 1 delay cycle)

Dynamic Branch Prediction


Loop branches are taken most of time
Must reduce branch delay to 0, but how?

How to predict branch behavior at runtime?

Computer Architecture Chapter 4.2

Fall 2013, CS

56

dce

Delayed Branch

2013

Define branch to take place after the next instruction


For a 1-cycle branch delay, we have one delay slot
branch instruction
branch delay slot

(next instruction)

label:

branch target

(if branch taken)

. . .

Compiler fills the branch delay slot

add $t2,$t3,$t4
beq $s1,$s0,label
Delay Slot

By selecting an independent instruction

From before the branch

If no independent instruction is found


Compiler fills delay slot with a NO-OP

label:
. . .
beq $s1,$s0,label
add $t2,$t3,$t4

Computer Architecture Chapter 4.2

Fall 2013, CS

57

dce
2013

Drawback of Delayed Branching


New meaning for branch instruction
Branching takes place after next instruction (Not immediately!)

Impacts software and compiler


Compiler is responsible to fill the branch delay slot
For a 1-cycle branch delay One branch delay slot

However, modern processors and deeply pipelined


Branch penalty is multiple cycles in deeper pipelines
Multiple delay slots are difficult to fill with useful instructions

MIPS used delayed branching in earlier pipelines


However, delayed branching is not useful in recent processors

Computer Architecture Chapter 4.2

Fall 2013, CS

58

dce

Zero-Delayed Branching

2013

How to achieve zero delay for a jump or a taken branch?


Jump or branch target address is computed in the ID stage

Next instruction has already been fetched in the IF stage

Solution
Introduce a Branch Target Buffer (BTB) in the IF stage
Store the target address of recent branch and jump instructions

Use the lower bits of the PC to index the BTB


Each BTB entry stores Branch/Jump address & Target Address
Check the PC to see if the instruction being fetched is a branch
Update the PC using the target address stored in the BTB

Computer Architecture Chapter 4.2

Fall 2013, CS

59

dce

Branch Target Buffer

2013

The branch target buffer is implemented as a small cache


Stores the target address of recent branches and jumps

We must also have prediction bits


To predict whether branches are taken or not taken
The prediction bits are dynamically determined by the hardware
Branch Target & Prediction Buffer
Addresses of
Recent Branches

Inc
mux

PC

Target
Predict
Addresses
Bits

low-order bits
used as index

=
predict_taken

Computer Architecture Chapter 4.2

Fall 2013, CS

60

dce
2013

Dynamic Branch Prediction


Prediction of branches at runtime using prediction bits
Prediction bits are associated with each entry in the BTB
Prediction bits reflect the recent history of a branch instruction

Typically few prediction bits (1 or 2) are used per entry


We dont know if the prediction is correct or not
If correct prediction
Continue normal execution no wasted cycles

If incorrect prediction (misprediction)


Flush the instructions that were incorrectly fetched wasted cycles
Update prediction bits and target address for future use

Computer Architecture Chapter 4.2

Fall 2013, CS

61

dce
2013

Dynamic Branch Prediction Contd

IF

Use PC to address Instruction


Memory and Branch Target Buffer
PC = target address

Increment PC
No

ID

No

Jump
or taken
branch?

Found
BTB entry with predict
taken?

Yes

EX

Normal
Execution

Mispredicted Jump/branch
Enter jump/branch address, target
address, and set prediction in BTB entry.
Flush fetched instructions
Restart PC at target address
Computer Architecture Chapter 4.2

Yes

No

Jump
or taken
branch?

Yes

Correct Prediction
No stall cycles

Mispredicted branch
Branch not taken
Update prediction bits
Flush fetched instructions
Restart PC after branch
Fall 2013, CS

62

dce
2013

1-bit Prediction Scheme


Prediction is just a hint that is assumed to be correct
If incorrect then fetched instructions are flushed

1-bit prediction scheme is simplest to implement


1 bit per branch instruction (associated with BTB entry)
Record last outcome of a branch instruction (Taken/Not taken)
Use last outcome to predict future behavior of a branch

Not
Taken

Taken

Taken
Predict
Not Taken

Predict
Taken
Not Taken

Computer Architecture Chapter 4.2

Fall 2013, CS

63

dce
2013

1-Bit Predictor: Shortcoming


Inner loop branch mispredicted twice!
Mispredict as taken on last iteration of inner loop

Then mispredict as not taken on first iteration of inner


loop next time around

outer:

inner:

bne , , inner

bne , , outer

Computer Architecture Chapter 4.2

Fall 2013, CS

64

dce

2-bit Prediction Scheme

2013

1-bit prediction scheme has a performance shortcoming


2-bit prediction scheme works better and is often used
4 states: strong and weak predict taken / predict not taken

Implemented as a saturating counter


Counter is incremented to max=3 when branch outcome is taken
Counter is decremented to min=0 when branch is not taken

Not Taken

Strong
Predict
Not Taken

Taken

Taken

Weak
Predict
Not Taken

Not Taken

Computer Architecture Chapter 4.2

Taken

Not Taken

Weak
Predict
Taken

Taken

Strong
Predict
Taken

Not Taken

Fall 2013, CS

65

dce
2013

Fallacies and Pitfalls


Pipelining is easy!
The basic idea is easy
The devil is in the details
Detecting data hazards and stalling pipeline

Poor ISA design can make pipelining harder


Complex instruction sets (Intel IA-32)
Significant overhead to make pipelining work
IA-32 micro-op approach

Complex addressing modes


Register update side effects, memory indirection

Computer Architecture Chapter 4.2

Fall 2013, CS

66

dce
2013

Pipeline Hazards Summary


Three types of pipeline hazards
Structural hazards: conflicts using a resource during same cycle

Data hazards: due to data dependencies between instructions


Control hazards: due to branch and jump instructions

Hazards limit the performance and complicate the design


Structural hazards: eliminated by careful design or more hardware
Data hazards are eliminated by forwarding
However, load delay cannot be eliminated and stalls the pipeline
Delayed branching can be a solution when branch delay = 1 cycle
BTB with branch prediction can reduce branch delay to zero

Branch misprediction should flush the wrongly fetched instructions

Computer Architecture Chapter 4.2

Fall 2013, CS

67

You might also like