Module1 UPD

Module 1: Pipelining
Module Outline:
• Pipelined Data Path Design
• Pipelined Control Path Design
• Hazards
• Exception Handling
ELEC6036 (Vincent Tam) Module 1: Pipelining Page 1

Pipelining is Natural

Sequential Laundry

Pipelined Laundry: Start Work ASAP

Pipelining Lessons

The Five Stages of Load
• Ifetch: Instruction Fetch

- fetch the instruction from the instruction memory
• Reg/Dec: Registers Fetch and Instruction Decode
• Exec: Calculate the memory address
• Mem: Read the data from the Data Memory
• Wr: Write the data back to the register file

NOTE: These 5 Stages Were There All Along!

Pipelining
• As the instruction execution cycle uses different hardware components for different
steps, it is possible to set up a pipeline.
P1 P2 P3 P4 P5
Instruction Instruction Address Data fetch Instruction
fetch unit analyzer calculation unit execution
unit unit
P1: 1 2 3 4 5 6 7 8
P2: 1 2 3 4 5 6 7
P3: 1 2 3 4 5 6
P4: 1 2 3 4 5
P5: 1 2 3 4
Time
1 2 3 4 5 6 7 8
• Suppose each stage takes 1 nsec. Then each instruction still takes 5 nsec. But once
the pipeline has been filled, a complete instruction rolls out every 1 nsec. Thus, the
speedup is 5.

Pipelining
• Pipelining is an implementation technique whereby multiple instructions are

overlapped in execution.
• Pipelining is the key implementation technique that is currently used to make high-
performance CPUs.
• Pipelining does not help to lower the latency of a single instruction, but it helps to
increase the throughput of the entire program execution.
• The pipeline rate is limited by the slowest pipeline stage.
• The principle: multiple independent tasks are operating simultaneously.
• Potential speedup = number of pipeline stages.
• Unbalanced lengths of pipeline stages reduce the speedup: that is, the key is every
stage should have the same duration.
• The time needed to “fill” the pipeline and the time needed to “drain” it reduces the
speedup.

Pipelining in the CPU
• Pipelining is an implementation technique that exploits parallelism among

instructions in a sequential instruction stream.
• A major advantage of pipelining over “parallel processing” is that it is not visible to
the programmer (whereas in parallel processing, the program usually needs to
specify what kinds of tasks to be executed in parallel).
• In a computer system, each pipeline stage completes a part of the instruction being
executed.
• The time required between moving an instruction one step down the pipeline is a
machine cycle (the clock cycle). The length of a machine cycle is determined by the
time required for the slowest stage to proceed.
• The computer engineer should try to balance the length of each pipeline stage so as
to achieve the ideal speedup. In practice, however, the pipeline stages will not be
perfectly the same, and there are additional overheads. But we can get close to the
ideal case.

Pipelining in the CPU (continued)
Design Issues:
• We have to make sure that the same resource (e.g., ALU) is not used in more than
one pipeline stage.
• If the resources used in the same pipelining stage are different, then overlapping is
possible.
• However, we must note that to retain the intermediate values produced by an
individual instruction for all its pipeline stages, we must include temporary registers
between the pipeline stages.
T im e ( in c lo c k c y c l e s )
P ro g ra m
C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 C C 7
e x e c u t io n
o rd e r
( i n in s t r u c t io n s )
lw $ 1 , 1 0 0 ( $ 0 ) IM R eg A LU D M R eg

Performance of Pipelining
• Pipelining increases the processor instruction throughput—the number of

instructions completed per unit time.
• Remember: pipelining does not reduce the execution time of a single instruction.
• The increase in instruction throughput means that the program runs faster, even
though no single instruction runs faster.
• Imbalance among the pipeline stages reduces performance since the clock cannot
run faster than the time needed for the slowest pipeline stage.
• Pipelining overhead can arise from the combination of pipeline register delay and
other factors.

Graphically Representing Pipelines
• Can help with answering questions like:

- how many cycles does it take to execute this code?
- what is the ALU doing during cycle 4?
- use this representation to help understand datapaths

Conventional Pipelined Execution Representation

Single Cycle, Multiple Cycle, vs. Pipeline

Why Pipelining Actually?
• Suppose we execute 100 instructions

• Single Cycle Machine
- 45 ns/cycle x 1 CPI x 100 instructions = 4500ns
• Multi-Cycle Machine
- 10 ns/cycle x 4.6 CPI (due to instruction mix) x 100 instructions = 4600ns
• Ideal pipelined machine
- 10 ns/cycle x (1 CPI x 100 instructions + 4 cycles drain) = 1040ns!!

Why Pipeline? Because We Can!

Can Pipelining Get Us into Trouble?
YES: Pipeline Hazards

• Structural Hazards: attempt to use the same resource in two different ways (e.g., by
two different instructions) at the same time
- e.g., combined washer/dryer would be a structural hazard or “folder” busy doing
something else (e.g., watching TV ;-)
• Control Hazards: attempt to make a decision before condition is evaluated
- e.g., washing football uniforms and need to get proper detergent level; need to see after
dryer before next load in
- branch instructions
• Data Hazards: attempt to use item before it is ready
- e.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer
through dryer
- instruction depends on result of prior instruction still in the pipeline
• Can always resolve these hazards by waiting (pipeline stall):
- pipeline control must detect the hazard
- take action (or delay action) to resolve hazards

Single Memory Is a Structural Hazard

Structural Hazard Limits Performance
• Example: If 1.3 memory accesses per instruction, and only one memory access per
cycle then:
- average CPI >= 1.3
- otherwise resource is more than 100% utilized (impossible!)

Control Hazard Solution #1: Stall

Control Hazard Solution #2: Predict

Control Hazard Solution #3: Delayed Branch

Data Hazard on r1: Read After Write (RAW)
add r1, r2, r3

sub r4, r1, r3
and r6, r1, r7
or r8, r1, r9
xor r10, r1, r11
Data Hazard on r1: Read After Write (RAW)

Data Hazard Solution: Forwarding

Forwarding (or Bypassing): What About Loads?

Forwarding (or Bypassing): What About Loads?

Designing a Pipelined Processor
• Go back and examine your datapath and control diagram

• Associate resources with states
• Ensure that flows do not conflict, or figure out how to resolve
• Assert control in appropriate stage

Control and Datapath: Split State Diagram into 5 Pieces

Summary of Concepts
• Reduce CPI by overlapping many instructions

- average throughput of approximately 1 CPI with fast clock
• Utilize capabilities of the datapath
- start next instruction while working on the current one
- limited by length of longest stage (plus fill/flush)
- detect and resolve hazards
• What makes it easy
- all instructions are of the same length
- just a few instruction formats
- memory operands appear only in loads and stores
• What makes it hard?
- structural hazards: suppose we had only one memory
- control hazards: need to worry about branch instructions
- data hazards: an instruction depends on a previous instruction

Pipelined Processor

Pipelining the Load Instruction
• The 5 independent functional units in the pipeline datapath are:

- instruction memory for the Ifetch stage
- register file’s Read ports (bus A and bus B) for the Reg/Dec stage
- ALU for the Exec stage
- data memory for the Mem stage
- register file’s Write port (bus W) for the Wr stage

The Four Stages of R-type

• Reg/Dec: registers fetch and instruction decode
• Exec:
- ALU operates on the two register operands
- update PC
• Wr: write the ALU output back to the register file

Pipelining the R-type and Load Instruction
• We have pipeline conflict or structural hazard:

- two instructions try to write to the register file at the same time
- only one write port

Important Observation
• Each functional unit can only be userd once per instruction

• Each functional unit must be used at the same stage for all instructions
• 2 ways to solve this pipeline hazard

Solution 1: Insert “Bubble” into the Pipeline
• Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle
- the control logic can be complex
- lose instruction fetch and issue opportunity
• No instruction is started in Cycle 6!!

Solution 2: Delay R-type’s Write by One Cycle
• Delay R-type’s register write by one cycle:

Modified Control and Datapath

The Four Stages of Store

• Exec: calculate the memory address
• Mem: write the data into the data memory

The Three Stages of Beq

• Exec:
- compares the two register operands
- select correct branch target address
- latch into PC

Control Diagram

Data Stationary Control

Datapath + Data Stationary Control

Let’s Try It Out

Start: Fetch 10

Fetch 14, Decode 10

Fetch 20, Decode 14, Exec 10

Fetch 24, Decode 20, Exec 14, Mem 10

Fetch 30, Decode 24, Exec 20, Mem 14, WB 10





Pipeline Hazards Again

Data Hazards
• Avoid some “by design”

- eliminate WAR by always fetching operands early (DCD) in pipe
- eliminate WAW by doing all WBs in order (last stage, static)
• Detect and resolve remaining ones
- stall or forward (if possible)

Hazard Detection
• Suppose instruction i is about to be issued and a predecessor instruction j is in the

instruction pipeline
• A RAW hazard exists on register ρ if ρ ∈ Rregs ( i ) ∩ Wregs ( j )
- keep a record of pending writes (for instructions in the pipe) and compare with operand
regs of current instruction
- when instruction issues, reserve its result register
- when on operation completes, remove its write reservation
• A WAW hazard exists on register ρ if ρ ∈ Wregs ( i ) ∩ Wregs ( j )
• A WAR hazard exists on register ρ if ρ ∈ Wregs ( i ) ∩ Rregs ( j )

Record of Pending Writes

Resolve RAW by Forwarding

What about Memory Operations?
• If instructions are initiated in order and
operations always occur in the same stage, there
can be no hazards between memory operations!
• What does delaying WB on arithmetic operations
cost?
- cycles?
- hardware?
• What about data dependence on loads?
- R1 <-- R4 + R5
- R2 <-- Mem[R2 + I]
- R3 <-- R2 + R1
- ==> delayed loads
• Can recognize this in decode stage and introduce
bubble while stalling fetch stage
• Tricky situation:
- R1 <-- Mem[R2 + I]
- Mem[R3 + 34] <-- R1
Handle with bypass in memory stage!

Compiler Avoiding Load Stalls

What about interrupts, traps, faults?
• External interrupts:
- allow pipeline to drain
- load PC with interrupt address
• Faults (within instruction, restartable)
- force trap instruction into IF
- disable writes till trap hits WB
- must save multiple PCs or PC + state
• Recall: precise exceptions ==> state of the machine is preserved as if program
executed up to the offending instruction
- all previous instructions completed
- offending instruction and all following instructions act as if they have not even started
- same system code will work on different implementations

Exception Problem
• Exceptions/Interrupts: 5 instructions executing in 5 stage pipeline

- how to stop the pipeline?
- restart?
- who caused the interrupt?
Stage Problem when interrupts occur
IF page fault on instruction fetch; misaligned memory access; memory-
protection violation
ID undefined or illegal opcode
EX arithmetic exception
MEM page fault on data fetch; misaligned memory access; memory-
protection violation; memory error
• Load with data page fault, Add with instruction page fault?
• Solution
1: interrupt vector/instruction
2: interrupt ASAP, restart everything incomplete

Another look at the exception problem
• Use pipeline to sort this out!

- pass exception status along with instruction
- keep track of PCs for every instruction in pipeline
- don’t act on exception until it reaches WB stage
• Handle interrupts through “faulting no-op” in IF stage
• When instruction reaches WB stage:
- save PC => EPC, interrupt vector address => PC
- turn all instructions in earlier stages into no-ops

Exception Handling

Resolution: Freeze above & Bubble below

FYI: MIPS R3000 Clocking Discipline

MIPS R3000 Instruction Pipeline

Recall: Data Hazard on r1

MIPS R3000 Multicycle Operations

Issues in Pipelined Design

Summary of Concepts
• What makes it easy

- all instructions are the same length
- just a few instruction formats
- memory operands appear only in loads and stores
• What makes it hard? HAZARDS!!
- structural hazards: suppose we had only one memory
- control hazards: need to worry about branch instructions
- data hazards: an instruction depends on a previous instruction
• Pipelines pass control information down the pipe just as data moves down pipe
• Forwarding/stalls handled by local control
• Exceptions stop the pipeline
• Pipelines pass control information down the pipe just as data moves down pipe
• Forwarding/stalls handled by local control
• Exceptions stop the pipeline
• MIPS I instruction set architecture made pipeline visible (delayed branch, delayed
load)
• More performance from deeper pipleines, parallelism

Module1 UPD

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module1 UPD

Uploaded by

Copyright:

Available Formats

Module 1: Pipelining

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 1

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 2

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 3

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 4

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 5

• Ifetch: Instruction Fetch

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 6

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 7

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 8

• Pipelining is an implementation technique whereby multiple instructions are

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 9

• Pipelining is an implementation technique that exploits parallelism among

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 10

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 11

• Pipelining increases the processor instruction throughput—the number of

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 12

• Can help with answering questions like:

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 13

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 14

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 15

• Suppose we execute 100 instructions

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 16

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 17

YES: Pipeline Hazards

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 18

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 19

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 20

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 21

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 22

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 23

add r1, r2, r3

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 25

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 26

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 27

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 28

• Go back and examine your datapath and control diagram

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 29

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 30

• Reduce CPI by overlapping many instructions

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 31

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 32

• The 5 independent functional units in the pipeline datapath are:

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 33

• Ifetch: Instruction Fetch

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 34

• We have pipeline conflict or structural hazard:

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 35

• Each functional unit can only be userd once per instruction

• 2 ways to solve this pipeline hazard

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 36

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 37

• Delay R-type’s register write by one cycle:

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 38

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 39

• Ifetch: Instruction Fetch

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 40

• Ifetch: Instruction Fetch

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 41

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 42

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 43

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 44

ELEC6036 (Vincent Tam) Module 1: Pipelining Page 45