Professional Documents
Culture Documents
CS211 1
How to improve performance?
CS211 2
How to improve performance?
CS211 3
Pipeline Approach to Improve System
Performance
• Analogous to fluid flow in pipelines and
assembly line in factories
• Divide process into “stages” and send tasks
into a pipeline
– Overlap computations of different tasks by
operating on them concurrently in different stages
CS211 4
Instruction Pipeline
CS211 5
3
Linear Pipeline Processor
Linear pipeline processes a sequence of
subtasks with linear precedence
At a higher level - Sequence of processors
S1 S2 Sk
CS211 6
4
Synchronous Pipeline
CS211 7
Time Space Utilization of Pipeline
S3 T1 T2
T1
S2 T2 T3
T1 T2 T3 T4
S1
1 2 3 4
CS211 8
5
Asynchronous Pipeline
CS211 9
Pipeline Clock and Timing 6
Si Si+1
m d
Latch delay : d
= max {m } + d
Pipeline frequency : f
f=1/ CS211 10
7
Speedup and Efficiency
k-stage pipeline processes n tasks in k + (n-1) clock
cycles:
Tk = [ k + (n-1)]
T1 = n k
Speedup factor
T1 nk nk
Sk = = [ k + (n-1)] = k + (n-1)
Tk
CS211 11
10
Efficiency and Throughput
Sk n
Ek = =
k k + (n-1)
n nf
Hk = =
[ k + (n-1)] k + (n-1)
CS211 12
Pipeline Performance: Example
• Task has 4 subtasks with time: t1=60, t2=50, t3=90, and t4=80 ns
(nanoseconds)
• latch delay = 10
• Pipeline cycle time = 90+10 = 100 ns
• For non-pipelined execution
– time = 60+50+90+80 = 280 ns
• Speedup for above case is: 280/100 = 2.8 !!
• Pipeline Time for 1000 tasks = 1000 + 4-1= 1003*100 ns
• Sequential time = 1000*280ns
• Throughput= 1000/1003
• What is the problem here ?
• How to improve performance ?
CS211 13
Non-linear pipelines and pipeline control
algorithms
• Can have non-linear path in pipeline…
– How to schedule instructions so they do no conflict
for resources
• How does one control the pipeline at the
microarchitecture level
– How to build a scheduler in hardware ?
– How much time does scheduler have to make
decision ?
CS211 14
Non-linear Dynamic Pipelines
CS211 15
Reservation Tables
CS211 16
14
CS211 17
Latency Analysis
CS211 18
16
Collisions (Example)
1 2 3 4 5 6 7 8 9 10
x1 x2 x3 x1 x4 x1x2 x2x3
Latency = 2
CS211 19
17
Latency Cycle
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
x1 x1 x2 x1 x2 x3 x2 x3
x1 x1 x2 x2 x3 x3
x1 x1 x1 x2 x2 x2 x3 x3
Cycle Cycle
CS211 21
19
Collision Vector
Reservation Table
x x
x x
x x
x1 x1 x2 x2
x1 x1 x2 x2
x1 x1 x2 x2
Value X1 Value X2
C = (? ? . . . ? ?)
CS211 22
Back to our focus: Computer Pipelines
CS211 23
Designing a Pipelined Processor
• Go back and examine your datapath and control
diagram
• associated resources with states
• ensure that flows do not conflict, or figure out how
to resolve
• assert control in appropriate stage
CS211 24
5 Steps of MIPS Datapath
What do we need to do to pipeline the process ?
Instruction Instr. Decode Execute Memory Write
Fetch Reg. Fetch Addr. Calc Access Back
Next PC
MUX
Adder
Next SEQ PC
4 RS1
Zero?
MUX MUX
Address
Memory
RS2
Reg File
Inst
ALU
Memory
RD L
Data
M
MUX
D
Sign
Imm Extend
WB Data
CS211 25
5 Steps of MIPS/DLX Datapath
MUX
Next SEQ PC Next SEQ PC
Adder
4 RS1
Zero?
MUX MUX
MEM/WB
Address
Memory
EX/MEM
Reg File
RS2
ID/EX
IF/ID
ALU
Memory
Data
MUX
WB Data
Sign
Extend
Imm
RD RD RD
CS211 27
Visualizing Pipelining
ALU
Reg
n Ifetch Reg DMem
s
t
r.
ALU
Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d
e
r
ALU
Ifetch Reg DMem Reg
CS211 28
Conventional Pipelined Execution Representation
Time
CS211 29
Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1 Cycle 2
Clk
Single Cycle Implementation:
Load Store Waste
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10
Clk
Pipeline Implementation:
Load Ifetch Reg Exec Mem Wr
CS211 31
The Four Stages of R-type
Cycle 1 Cycle 2 Cycle 3 Cycle 4
CS211 32
Pipelining the R-type and Load
Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock
CS211 33
Important
• Observation
Each functional unit can only be used once per
instruction
• Each functional unit must be used at the same stage
for all instructions:
– Load uses Register File’s Write Port during its 5th stage
1 2 3 4 5
Load Ifetch Reg/Dec Exec Mem Wr
– R-type uses Register File’s Write Port during its 4th stage
1 2 3 4
R-type Ifetch Reg/Dec Exec Wr
CS211 34
Solution 1: Insert “Bubble” into the Pipeline
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Clock
Ifetch Reg/Dec Exec Wr
Load Ifetch Reg/Dec Exec Mem Wr
CS211 36
Why Pipeline?
• Suppose we execute 100 instructions
• Single Cycle Machine
– 45 ns/cycle x 1 CPI x 100 inst = 4500 ns
• Multicycle Machine
– 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns
• Ideal pipelined machine
– 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
CS211 37
Why Pipeline? Because the resources are there!
Time (clock cycles)
ALU
Im Reg Dm Reg
n Inst 0
s
ALU
t Inst 1 Im Reg Dm Reg
r.
ALU
O Inst 2 Im Reg Dm Reg
r
d Inst 3
ALU
Im Reg Dm Reg
e
r
Inst 4
ALU
Im Reg Dm Reg
CS211 38
Problems with Pipeline processors?
• Limits to pipelining: Hazards prevent next instruction from
executing during its designated clock cycle and introduce
stall cycles which increase CPI
– Structural hazards: HW cannot support this combination of
instructions - two dogs fighting for the same bone
– Data hazards: Instruction depends on result of prior
instruction still in the pipeline
» Data dependencies
– Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow
(branches and jumps).
» Control dependencies
• Can always resolve hazards by stalling
• More stall cycles = more CPU time = less performance
– Increase performance = decrease stall cycles
CS211 39
Back to our old friend: CPU time equation
CS211 40
Speed Up Equation for Pipelining
CS211 41
One Memory Port/Structural Hazards
ALU
I Load Ifetch Reg DMem Reg
n
s
ALU
t Instr 1
Ifetch Reg DMem Reg
r.
ALU
Instr 2 Ifetch Reg DMem Reg
O
r
ALU
d Instr 3 Ifetch Reg DMem Reg
e
r
ALU
Instr 4 Ifetch Reg DMem Reg
CS211 42
One Memory Port/Structural Hazards
ALU
I Load Ifetch Reg DMem Reg
n
s
ALU
t Instr 1
Ifetch Reg DMem Reg
r.
ALU
Instr 2 Ifetch Reg DMem Reg
O
r
d Stall Bubble Bubble Bubble Bubble Bubble
e
r
ALU
Instr 3 Ifetch Reg DMem Reg
CS211 43
Example: Dual-port vs. Single-port
• Machine A: Dual ported memory (“Harvard
Architecture”)
• Machine B: Single ported memory, but its
pipelined implementation has a 1.05 times faster
clock rate
• Ideal CPI = 1 for both
• Note - Loads will cause stalls of 1 cycle
• Recall our friend:
– CPU = IC*CPI*Clk
» CPI= ideal CPI + stalls
CS211 44
Example…
CS211 45
Data Dependencies
CS211 46
Three Generic Data Hazards
I: add r1,r2,r3
J: sub r4,r1,r3
CS211 47
RAW Dependency
CS211 48
Three Generic Data Hazards
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
CS211 50
WAR and WAW Dependency
• Example program (a):
– i1: mul r1, r2, r3;
– i2: add r2, r4, r5;
• Example program (b):
– i1: mul r1, r2, r3;
– i2: add r1, r4, r5;
• both cases we have dependence between i1 and i2
– in (a) due to r2 must be read before it is written into
– in (b) due to r1 must be written by i2 after it has been written
into by i1
CS211 51
What to do with WAR and WAW ?
• Problem:
– i1: mul r1, r2, r3;
– i2: add r2, r4, r5;
• Is this really a dependence/hazard ?
CS211 52
What to do with WAR and WAW
CS211 53
Hazard Detection in H/W
CS211 54
Hazard Detection in
Hardware
• A RAW hazard exists on register if Rregs( i ) Wregs( j )
– Keep a record of pending writes (for inst's in the pipe) and
compare with operand regs of current instruction.
– When instruction issues, reserve its result register.
– When on operation completes, remove its write reservation.
CS211 55
Internal Forwarding: Getting rid of some
hazards
• In some cases the data needed by the next
instruction at the ALU stage has been
computed by the ALU (or some stage defining
it) but has not been written back to the
registers
• Can we “forward” this result by bypassing
stages ?
CS211 56
Data Hazard on R1
Time (clock cycles)
IF ID/RF EX MEM WB
ALU
add r1,r2,r3 Ifetch Reg DMem Reg
n
s
ALU
t sub r4,r1,r3 Ifetch Reg DMem Reg
r.
ALU
and r6,r1,r7 Reg Reg
O Ifetch DMem
r
d
ALU
Reg
or r8,r1,r9
Ifetch Reg DMem
e
r
ALU
xor r10,r1,r11 Ifetch Reg DMem Reg
CS211 57
Forwarding to Avoid Data Hazard
ALU
Reg DMem Reg
s
t
r. sub r4,r1,r3
ALU
Ifetch Reg DMem Reg
O
r
ALU
Ifetch Reg DMem Reg
d and r6,r1,r7
e
r
ALU
Ifetch Reg DMem Reg
or r8,r1,r9
ALU
Ifetch Reg DMem Reg
xor r10,r1,r11
CS211 58
Internal Forwarding of Instructions
CS211 59
38
Store-load forwarding
Memory Memory
M M
R1 R2 R1 R2
CS211 60
39
Load-load forwarding
Memory Memory
M M
R1 R2 R1 R2
CS211 61
40
Store-store forwarding
Memory Memory
M M
R1 R2 R1 R2
STO M,R2
STO M, R1 STO M,R2
CS211 62
HW Change for Forwarding
NextPC
mux
Registers
MEM/WR
EX/MEM
ALU
ID/EX
Data
mux
Memory
mux
Immediate
CS211 63
What about memory operations?
º If instructions are initiated in order and
operations always occur in the same stage, op Rd Ra Rb
there can be no hazards between memory
operations!
º What does delaying WB on arithmetic
operations cost?
– cycles ? op Rd Ra Rb A B
– hardware ?
º What about data dependence on loads?
R1 <- R4 + R5
R2 <- Mem[ R2 + I ] Rd D R
R3 <- R2 + R1
“Delayed Loads”
Mem
º Can recognize this in decode stage and Rd
introduce bubble while stalling fetch stage
T
º Tricky situation: to reg
R1 <- Mem[ R2 + I ] file
Mem[R3+34] <- R1
Handle with bypass in memory stage!
CS211 64
Data Hazard Even with Forwarding
ALU
I Reg DMem Reg
n
s
t
ALU
sub r4,r1,r6 Ifetch Reg DMem Reg
r.
ALU
O and r6,r1,r7
Ifetch Reg DMem Reg
r
d
e
ALU
Ifetch Reg DMem Reg
r or r8,r1,r9
CS211 65
Data Hazard Even with Forwarding
ALU
Ifetch Reg DMem Reg
s
t
r.
ALU
sub r4,r1,r6 Ifetch Reg Bubble DMem Reg
O
r
d
ALU
Bubble Reg
and r6,r1,r7
Ifetch Reg DMem
e
r
ALU
Bubble Ifetch Reg DMem
or r8,r1,r9
CS211 66
What can we (S/W) do?
CS211 67
Software Scheduling to Avoid Load
Hazards
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory.
Slow code: Fast code:
LW Rb,b LW Rb,b
LW Rc,c LW Rc,c
ADD Ra,Rb,Rc LW Re,e
SW a,Ra ADD Ra,Rb,Rc
LW Re,e LW Rf,f
LW Rf,f SW a,Ra
SUB Rd,Re,Rf SUB Rd,Re,Rf
CS211 68
SW d,Rd SW d,Rd
Control Hazards: Branches
• Instruction flow
– Stream of instructions processed by Inst. Fetch unit
– Speed of “input flow” puts bound on rate of outputs
generated
• Branch instruction affects instruction flow
– Do not know next instruction to be executed until
branch outcome known
• When we hit a branch instruction
– Need to compute target address (where to branch)
– Resolution of branch condition (true or false)
– Might need to ‘flush’ pipeline if other instructions
have been fetched for execution
CS211 69
Control Hazard on Branches
Three Stage Stall
ALU
Ifetch Reg DMem Reg
ALU
14: and r2,r3,r5 Ifetch Reg DMem Reg
ALU
18: or r6,r1,r7 Ifetch Reg DMem Reg
ALU
22: add r8,r1,r9 Ifetch Reg DMem Reg
ALU
36: xor r10,r1,r11 Ifetch Reg DMem Reg
CS211 70
Branch Stall Impact
CS211 71
Pipelined MIPS (DLX) Datapath
CS211 72
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear – flushing pipe
#2: Predict Branch Not Taken
– Execute successor instructions in sequence
– “Squash” instructions in pipeline if branch actually taken
– Advantage of late pipeline state update
– 47% DLX branches not taken on average
– PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken
– 53% DLX branches taken on average
– But haven’t calculated branch target address in DLX
» DLX still incurs 1 cycle branch penalty
» Other machines: branch target known before outcome
CS211 73
Four Branch Hazard Alternatives
branch instruction
sequential successor1
sequential successor2
........
sequential successorn Branch delay of length n
branch target if taken
CS211 74
CS211 75
Delayed Branch
• Where to get instructions to fill branch delay slot?
– Before branch instruction
– From the target address: only valuable when branch taken
– From fall through: only valuable when branch not taken
– Cancelling branches allow more slots to be filled
CS211 76
Evaluating Branch Alternatives
Pipeline speedup = Pipeline depth
1 +Branch frequencyBranch penalty
CS211 77
Branch Prediction based on history
CS211 78
Summary :
Control and Pipelining
• Just overlap tasks; easy if tasks are independent
• Speed Up Pipeline Depth; if ideal CPI is 1, then:
Pipeline depth Cycle Time unpipelined
Speedup
1 Pipeline stall CPI Cycle Timepipelined
CS211 79
Summary #1/2: Pipelining
• What makes it easy
– all instructions are the same length
– just a few instruction formats
– memory operands appear only in loads and stores
CS211 80
Introduction to ILP
• What is ILP?
– Processor and Compiler design techniques that
speed up execution by causing individual machine
operations to execute in parallel
• ILP is transparent to the user
– Multiple operations executed in parallel even though
the system is handed a single program written with
a sequential processor in mind
• Same execution hardware as a normal RISC
machine
– May be more than one of any given type of hardware
CS211 81
Compiler vs. Processor
Compiler Hardware
TTA
Execute
B. Ramakrishna Rau and Joseph A. Fisher. Instruction-level parallel: History overview, and perspective. The
Journal of Supercomputing, 7(1-2):9-50, May 1993.
CS211 82