You are on page 1of 72

Module 1: Pipelining

Module Outline:
• Pipelined Data Path Design
• Pipelined Control Path Design
• Hazards
• Exception Handling

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 1


Pipelining is Natural

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 2


Sequential Laundry

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 3


Pipelined Laundry: Start Work ASAP

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 4


Pipelining Lessons

throughput - overall system performance

stall = delay, i.e. we will have delay in the pipelined computer whenever there is
control / data dependences between the instructions
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 5
The Five Stages of Load

• Ifetch: Instruction Fetch


- fetch the instruction from the instruction memory
• Reg/Dec: Registers Fetch and Instruction Decode
• Exec: Calculate the memory address
• Mem: Read the data from the Data Memory
• Wr: Write the data back to the register file

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 6


NOTE: These 5 Stages Were There All Along!
PC - Program Counter

* Every Instruction type may go


thru’ different no. of pipeline
stages in total, e.g. BEQ - 3 PL stages !

R - Register File, rt - destination register for the update !

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 7


Pipelining

• As the instruction execution cycle uses different hardware components for different
steps, it is possible to set up a pipeline.
P1 P2 P3 P4 P5
Instruction Instruction Address Data fetch Instruction
fetch unit analyzer calculation unit execution
unit unit

P1: 1 2 3 4 5 6 7 8
P2: 1 2 3 4 5 6 7
P3: 1 2 3 4 5 6
P4: 1 2 3 4 5
P5: 1 2 3 4

Time
1 2 3 4 5 6 7 8

• Suppose each stage takes 1 nsec. Then each instruction still takes 5 nsec. But once
the pipeline has been filled, a complete instruction rolls out every 1 nsec. Thus, the
speedup is 5. nsec - nano-second !

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 8


Pipelining

• Pipelining is an implementation technique whereby multiple instructions are


overlapped in execution.
• Pipelining is the key implementation technique that is currently used to make high-
performance CPUs.
• Pipelining does not help to lower the latency of a single instruction, but it helps to
increase the throughput of the entire program execution.
• The pipeline rate is limited by the slowest pipeline stage.
• The principle: multiple independent tasks are operating simultaneously.
• Potential speedup = number of pipeline stages.
• Unbalanced lengths of pipeline stages reduce the speedup: that is, the key is every
stage should have the same duration.
• The time needed to “fill” the pipeline and the time needed to “drain” it reduces the
speedup.

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 9


Pipelining in the CPU
instruction-level parallelism !
• Pipelining is an implementation technique that exploits parallelism among
instructions in a sequential instruction stream.
• A major advantage of pipelining over “parallel processing” is that it is not visible to
the programmer (whereas in parallel processing, the program usually needs to
specify what kinds of tasks to be executed in parallel).
• In a computer system, each pipeline stage completes a part of the instruction being
executed.
• The time required between moving an instruction one step down the pipeline is a
machine cycle (the clock cycle). The length of a machine cycle is determined by the
time required for the slowest stage to proceed.
• The computer engineer should try to balance the length of each pipeline stage so as
to achieve the ideal speedup. In practice, however, the pipeline stages will not be
perfectly the same, and there are additional overheads. But we can get close to the
ideal case. the ideal case - i.e. CPI = 1

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 10


Pipelining in the CPU (continued)

Design Issues:
• We have to make sure that the same resource (e.g., ALU) is not used in more than
one pipeline stage.
• If the resources used in the same pipelining stage are different, then overlapping is
possible.
• However, we must note that to retain the intermediate values produced by an
individual instruction for all its pipeline stages, we must include temporary registers
between the pipeline stages.
T im e ( in c lo c k c y c l e s )
P ro g ra m
C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 C C 7
e x e c u t io n
o rd e r
( i n in s t r u c t io n s )
lw $ 1 , 1 0 0 ( $ 0 ) IM R eg A LU D M R eg

lw $ 2 , 2 0 0 ( $ 0 ) IM R eg A LU D M R eg

lw $ 3 , 3 0 0 ( $ 0 ) IM R eg A LU D M R eg

$1, $2 or $3 - represent Register #1, #2 and #3


ELEC6036 (Vincent Tam) Module 1: Pipelining Page 11
Performance of Pipelining

• Pipelining increases the processor instruction throughput—the number of


instructions completed per unit time.
• Remember: pipelining does not reduce the execution time of a single instruction.
• The increase in instruction throughput means that the program runs faster, even
though no single instruction runs faster.
• Imbalance among the pipeline stages reduces performance since the clock cannot
run faster than the time needed for the slowest pipeline stage.
• Pipelining overhead can arise from the combination of pipeline register delay and
other factors. * propagation delay of the register to
transfer the values between
the pipeline stages

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 12


Graphically Representing Pipelines

• Can help with answering questions like:


- how many cycles does it take to execute this code?
- what is the ALU doing during cycle 4?
- use this representation to help understand datapaths

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 13


Conventional Pipelined Execution Representation

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 14


Single Cycle, Multiple Cycle, vs. Pipeline

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 15


Why Pipelining Actually?

• Suppose we execute 100 instructions


• Single Cycle Machine
- 45 ns/cycle x 1 CPI x 100 instructions = 4500ns
• Multi-Cycle Machine
- 10 ns/cycle x 4.6 CPI (due to instruction mix) x 100 instructions = 4600ns
• Ideal pipelined machine
- 10 ns/cycle x (1 CPI x 100 instructions + 4 cycles drain) = 1040ns!!

Ans: pipelining is much faster, i.e. in the above example,


the total duration is ar. 1/4 of the total time required for single-cycle machine !

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 16


Why Pipeline? Because We Can!

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 17


Can Pipelining Get Us into Trouble?

YES: Pipeline Hazards


• Structural Hazards: attempt to use the same resource in two different ways (e.g., by
two different instructions) at the same time
- e.g., combined washer/dryer would be a structural hazard or “folder” busy doing
something else (e.g., watching TV ;-)
• Control Hazards: attempt to make a decision before condition is evaluated
- e.g., washing football uniforms and need to get proper detergent level; need to see after
dryer before next load in Control hazards - related to the branch instructions
- branch instructions
• Data Hazards: attempt to use item before it is ready
- e.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer
through dryer
- instruction depends on result of prior instruction still in the pipeline
• Can always resolve these hazards by waiting (pipeline stall):
- pipeline control must detect the hazard For all types of hazards,
- take action (or delay action) to resolve hazards it can be solved by inserting
the pipeline stall/delay !

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 18


Single Memory Is a Structural Hazard

(IF) (ID) (EX) (MEM) (WB)


(Read on the main Memory)

(IF)

CC#4
- Conflict of interests !

Solution : insert a delay/stall for Instr #3


ELEC6036 (Vincent Tam) Module 1: Pipelining Page 19
Structural Hazard Limits Performance
on the average - 1.3 memory access for each instruction !
• Example: If 1.3 memory accesses per instruction, and only one memory access per
cycle then:
- average CPI >= 1.3
- otherwise resource is more than 100% utilized (impossible!)

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 20


Control Hazard Solution #1: Stall

BEQ - Branch
on Equal

IF (R1 == 0) ?

* MOVE the decision earlier to the decoder stage


impact : 1 lost cycle

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 21


Control Hazard Solution #2: Predict

e.g. if there is 30% branch, then the total CPI - 1.5 * 0.3 + 1.0 * 0.7

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 22


Control Hazard Solution #3: Delayed Branch

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 23


Data Hazard on r1: Read After Write (RAW)
There must be data dependance !

add r1, r2, r3


(W) ADD rd, rs, rt;
rd - destination register (W)
rs - 1st source operand reg. (R)
rt - 2nd source operand reg. (R)

sub r4, r1, r3


(R) * r1 - common register
for all these instructions !

and r6, r1, r7


(R)

or r8, r1, r9
(R)

xor r10, r1, r11


(R)

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 24


Data Hazard on r1: Read After Write (RAW)

(W on
R1)

(R
on R1)

(Read
on R1
with the
new/updated value !)

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 25


Data Hazard Solution: Forwarding

the earliest stage that the result is


available !
(W)

(R)

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 26


Forwarding (or Bypassing): What About Loads?

(R)

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 27


Forwarding (or Bypassing): What About Loads?

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 28


Designing a Pipelined Processor
datapath - the sequence of operations
for updating the data items
• Go back and examine your datapath and control diagram control diagram -
the sequence of
• Associate resources with states control signals to facilitate
the prog. exec.
• Ensure that flows do not conflict, or figure out how to resolve
- solve the specific
• Assert control in appropriate stage type of hazards !
- provide solution to the hazard !!

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 29


Control and Datapath: Split State Diagram into 5 Pieces

(IF)

(ID)
* Data-
path
Diagram
(EX)

(MEM)

(WB)

-
Control
Diagram

(WB)

(IF) (ID) (EX)

ELEC6036 (Vincent Tam) Module 1: Pipelining (MEM) Page 30


Summary of Concepts

• Reduce CPI by overlapping many instructions


- average throughput of approximately 1 CPI with fast clock
• Utilize capabilities of the datapath
- start next instruction while working on the current one
- limited by length of longest stage (plus fill/flush)
- detect and resolve hazards
• What makes it easy
(Very Important for the pipeline design !!)
- all instructions are of the same length
- just a few instruction formats (Reduced Instruction Set Computers (RISC)
- memory operands appear only in loads and stores
• What makes it hard? (LW, SW)

- structural hazards: suppose we had only one memory


- control hazards: need to worry about branch instructions
- data hazards: an instruction depends on a previous instruction
Whenever two instructions have data dependence, there will be data hazards !

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 31


Pipelined Processor

(WB)

(IF) (ID/RF)
(EX)
(MEM)

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 32


Pipelining the Load Instruction

• The 5 independent functional units in the pipeline datapath are:


- instruction memory for the Ifetch stage
- register file’s Read ports (bus A and bus B) for the Reg/Dec stage
- ALU for the Exec stage
- data memory for the Mem stage
- register file’s Write port (bus W) for the Wr stage

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 33


The Four Stages of R-type

• Ifetch: Instruction Fetch


- fetch the instruction from the instruction memory
• Reg/Dec: registers fetch and instruction decode
• Exec:
- ALU operates on the two register operands
- update PC
• Wr: write the ALU output back to the register file

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 34


Pipelining the R-type and Load Instruction

* Both (Load) &


(R-type) try to
write back
to the same reg. file

• We have pipeline conflict or structural hazard:


- two instructions try to write to the register file at the same time
- only one write port

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 35


Important Observation
used
• Each functional unit can only be userd once per instruction
• Each functional unit must be used at the same stage for all instructions

• 2 ways to solve this pipeline hazard

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 36


Solution 1: Insert “Bubble” into the Pipeline

• Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle
- the control logic can be complex
- lose instruction fetch and issue opportunity
• No instruction is started in Cycle 6!!

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 37


Solution 2: Delay R-type’s Write by One Cycle

• Delay R-type’s register write by one cycle:

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 38


Modified Control and Datapath

* Add
NOOP "Mem"
stage
for R-type
and ORi-type

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 39


The Four Stages of Store

* NOOP for WB
• Ifetch: Instruction Fetch
- fetch the instruction from the instruction memory
• Reg/Dec: registers fetch and instruction decode
• Exec: calculate the memory address
• Mem: write the data into the data memory

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 40


The Three Stages of Beq

NOOP for both MEM & WB !

• Ifetch: Instruction Fetch


- fetch the instruction from the instruction memory
• Reg/Dec: registers fetch and instruction decode
• Exec:
- compares the two register operands
- select correct branch target address
- latch into PC

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 41


Control Diagram

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 42


Data Stationary Control

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 43


Datapath + Data Stationary Control
IR - Instruction Register
rw = rd (destination register)
control
signal

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 44


Let’s Try It Out
rd - destination register * Each instruction is of 4 bytes !
index addressing -
r2 to hold the base address
* Memory the constant (35) [offset]
Address target memory addr = [r2] + 35 (offset)
specified [r2] + 3 -> [r2]
in octal
(of base 8) [r4] - [r5] -> [r3]
where
the branch to address 100(octal) if
instruction ([r6] == [r7])
is stored !

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 45


Fetch
Start: Fetch 10 the instruction at addr 10

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 46


Fetch 14, Decode 10

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 47


IR = Instruction
Register Fetch 20, Decode 14, Exec 10

EX - Execution stage
ID - Instruction Decode
IF - Instruction Fetch

PC - Program Counter

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 48


Fetch 24, Decode 20, Exec 14, Mem 10

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 49


Fetch 30, Decode 24, Exec 20, Mem 14, WB 10

index addressing
: r2 (base addr) +
offset(35)
No Mem
access
is needed

if
(NOOP) (r6 ==
r7)
then
jump
to
inst at
- ori is a misc (or independent) 100 !
instr. to be inserted after BEQ

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 50


Fetch 100, Decode 30, Exec 24, Mem 20, WB 14
Here, r6 == r7 is true. So,
PC = 100

Due
to the first
instr. that is completed

NOOP ! sub
does not
need any
MEM opera-
tion !

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 51


Fetch 104, Decode 100, Exec 30, Mem 24, WB 20
and r13, r14, 15 ori r8,r9,17 beq r6, r7, 100 sub r3, r4, r5

104

NOOP !

** Each instruction requires 4 bytes.


Instr. after 100 will be 104!

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 52


Fetch 110, Decode 104, Exec 100, Mem 30, WB 24
Instr. at 104 and r13,r14,15 ori instr. beq instr.

15

r14

- :
no need !

110

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 53


Fetch 114, Decode 110, Exec 104, Mem 100, WB 30
instr. at 110 instr. at 104 and instr ori instr

114

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 54


Pipeline Hazards Again

* they try to compete for the MEM unit !

Due to Branch Instr !

e.g. common register


as R1

common register as R2
* WB on R2 by
2nd instr
should be AFTER
WB on R2
by the 3rd in-
str
* WB on R3 by the last instr should
be AFTER reading on R3 by the 4th instr,
ELEC6036 (Vincent Tam) Module 1: Pipelining i.e. RS should be after Page 55
the OF in the above diagram !
Data Hazards

• Avoid some “by design”


- eliminate WAR by always fetching operands early (DCD) in pipe
- eliminate WAW by doing all WBs in order (last stage, static) i.e. serialize all the WBs

• Detect and resolve remaining ones


- stall or forward (if possible)

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 56


Hazard Detection

• Suppose instruction i is about to be issued and a predecessor instruction j is in the


instruction pipeline i - current instr, j - earlier/previous instr

• A RAW hazard exists on register ρ if ρ ∈ Rregs ( i ) ∩ Wregs ( j )


- keep a record of pending writes (for instructions in the pipe) and compare with operand
regs of current instruction Rregs(i) = set of registers
to be ready the current instr
- when instruction issues, reserve its result register i;
Wregs(j) = set of registers
- when on operation completes, remove its write reservation to be WB by the previous in-
• A WAW hazard exists on register ρ if ρ ∈ Wregs ( i ) ∩ Wregs ( j ) str j

• A WAR hazard exists on register ρ if ρ ∈ Wregs ( i ) ∩ Rregs ( j )

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 57


Record of Pending Writes

Read register set


by the current instr i
ID

implied by one of the following


conditions to be TRUE

EX

MEM

WB

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 58


Resolve RAW by Forwarding

MUX
- Multiplexer

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 59


What about Memory Operations? LW or SW

• If instructions are initiated in order and


operations always occur in the same stage, there
can be no hazards between memory operations!
• What does delaying WB on arithmetic operations
cost? e.g. ADD or SUB
- cycles?
- hardware?
• What about data dependence on loads?
LW
- R1 <-- R4 + R5
- R2 <-- Mem[R2 + I] index addressing !
- R3 <-- R2 + R1 IF we change the 3rd instr to SW
LOAD stalls/delays
- ==> delayed loads
• Can recognize this in decode stage and introduce
bubble while stalling fetch stage
• Tricky situation: LW
- R1 <-- Mem[R2 + I]
- Mem[R3 + 34] <-- R1 SW
Handle with bypass in memory stage!
SW r6, 20(r4) At Mem stage of "SW", r6 -> Mem[20 + r4]
LW r7, 20(r4) ??

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 60


Compiler Avoiding Load Stalls

* Scheduled version:
rescheduling the instr
so as to reduce
the load stalls

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 61


What about interrupts, traps, faults?

• External interrupts:
- allow pipeline to drain
- load PC with interrupt address
• Faults (within instruction, restartable)
- force trap instruction into IF
- disable writes till trap hits WB
- must save multiple PCs or PC + state
• Recall: precise exceptions ==> state of the machine is preserved as if program
executed up to the offending instruction
- all previous instructions completed
- offending instruction and all following instructions act as if they have not even started
- same system code will work on different implementations

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 62


Exception Problem

• Exceptions/Interrupts: 5 instructions executing in 5 stage pipeline


- how to stop the pipeline?
- restart?
- who caused the interrupt?
Stage Problem when interrupts occur
IF page fault on instruction fetch; misaligned memory access; memory-
protection violation
ID undefined or illegal opcode
EX arithmetic exception
MEM page fault on data fetch; misaligned memory access; memory-
protection violation; memory error

• Load with data page fault, Add with instruction page fault?
• Solution
1: interrupt vector/instruction
2: interrupt ASAP, restart everything incomplete

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 63


Another look at the exception problem

• Use pipeline to sort this out!


- pass exception status along with instruction
- keep track of PCs for every instruction in pipeline
- don’t act on exception until it reaches WB stage
• Handle interrupts through “faulting no-op” in IF stage
• When instruction reaches WB stage:
* Here, we are trying to
- save PC => EPC, interrupt vector address => PC execute the Interrupt Service Routine
(ISR)
- turn all instructions in earlier stages into no-ops

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 64


Exception Handling

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 65


Resolution: Freeze above & Bubble below

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 66


MIPS - Million Instructions Per Second
R3000 is one of models under the MIPS arch.

FYI: MIPS R3000 Clocking Discipline

Write Read
(WB)

ref. back to Page 19 (W|R)

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 67


MIPS R3000 Instruction Pipeline

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 68


Recall: Data Hazard on r1
in the first half of 5th clock cycle, r1 is already updated.

in the 2nd half of the 5th clock cycle,


when the "OR" instruction performs
register fetch (RF) - successful !

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 69


MIPS R3000 Multicycle Operations

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 70


Issues in Pipelined Design

Very Long Instruction Word

vector-based computers

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 71


Summary of Concepts

• What makes it easy


- all instructions are the same length
- just a few instruction formats
- memory operands appear only in loads and stores
• What makes it hard? HAZARDS!!
- structural hazards: suppose we had only one memory
- control hazards: need to worry about branch instructions
- data hazards: an instruction depends on a previous instruction
• Pipelines pass control information down the pipe just as data moves down pipe
• Forwarding/stalls handled by local control
• Exceptions stop the pipeline
• Pipelines pass control information down the pipe just as data moves down pipe
• Forwarding/stalls handled by local control
• Exceptions stop the pipeline
• MIPS I instruction set architecture made pipeline visible (delayed branch, delayed
load)
• More performance from deeper pipleines, parallelism

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 72

You might also like