Professional Documents
Culture Documents
1 2
Introduc)on
• Basic requirements for pipelining the MIPS32 data path:
– We should be able to start a new instrucEon every clock cycle.
– Each of the five steps menEoned before (IF, ID, EX, MEM and WB) becomes a
pipeline stage.
– Each stage must finish its execuEon within one clock cycle.
• Since execuEon of several instrucEons are overlapped, we must ensure that
there is no conflict during the execuEon.
– Simplicity of the MIPS32 instrucEon set makes this evaluaEon quite easy.
– We shall discuss these issues in detail.
2 2
1
15/09/17
5T
IF + ID + EX + MEM + WB
(Non-pipelined)
IF ID EX MEM WB
3 2
Clock Cycles
Instruc)on 1 2 3 4 5 6 7 8
i IF ID EX MEM WB
i+1 IF ID EX MEM WB
i+2 IF ID EX MEM WB
i+3 IF ID EX MEM WB
Instr-i Instr-(i+2)
finishes finishes
Instr-(i+1) Instr-(i+3)
finishes finishes
4 2
2
15/09/17
Clock Cycles
Instruc)on 1 2 3 4 5 6 7 8
i IF ID EX MEM WB
i+1 IF ID EX MEM WB
i+2 IF ID EX MEM WB
i+3 IF ID EX MEM WB
5 2
Advantages of Pipelining
• In the non-pipelined version, the execuEon Eme of an instrucEon is equal to
the combined delay of the five stages (say, 5T ).
• In the pipelined version, once the pipeline is full, one instrucEon gets
executed aYer every T Eme.
– Assuming all state delays are equal (equal to T ), and neglecEng latch delay.
• However, due to various conflicts between instrucEons (called hazards), we
cannot achieve the ideal performance.
– Several techniques have been proposed to improve the performance.
– To be discussed.
6 2
3
15/09/17
Some Observa)ons
a) To support overlapped execuEon, peak memory bandwidth must be
increased 5 Emes over that required for the non-pipelined version.
– An instrucEon fetch occurs every clock cycle.
– Also there can be two memory accesses per clock cycle (one for instrucEon and
one for data).
– Separate instrucEon and data caches are typically used to support this.
I-Cache
CPU
D-Cache
7 2
Clock
8 2
4
15/09/17
Source Register 1
Read (5 bits)
Port 1 Register Data DesEnaEon Register
(32 bits) (5 bits) Write
REGISTER Register Data Port
Source Register 1 BANK (32 bits)
(5 bits)
Read
Port 2 Register Data
(32 bits)
9 2
10 2
5
15/09/17
• Register stages are inserted between pipeline stages, which increases the
execuEon Eme of an individual instrucEon.
– Because of overlapped execuEon of instrucEons, throughput increases.
• The clock period T has to be chosen suitably:
– Slowest stage in the pipeline.
– Clock skew and jieer.
– Register setup Eme: minimum Eme the register input must be held stable before
the acEve clock edge arrives.
11 2
Example 1
• Consider the 5-stage MIPS32 pipeline, with the following features:
– Pipeline clock rate of 1GHz (i.e. 1 ns clock cycle Eme).
– For a non-pipelined implementaEon, ALU operaEons and branches take 4 cycles,
while memory operaEons take 5 cycles.
– RelaEve frequencies of ALU operaEons, branches and memory operaEons are
50%, 15%, and 35% respecEvely.
– In the pipelined implementaEon, due to clock skew and setup Eme, the clock
cycle Eme increases by 0.25 ns.
– Calculate the esEmated speedup of the pipelined implementaEon.
12 2
6
15/09/17
• Solu)on:
a) For non-pipelined processor:
• Average instrucEon execuEon Eme = Clock cycle Eme x Average CPI
= 1 ns x (0.50 x 4 + 0.15 x 4 + 0.35 x 5) = 4.35 ns
b) For pipelined processor:
• Clock cycle Eme = 1 + 0.25 = 1.25 ns
• In the steady state, one instrucEon will get executed every clock cycle.
• Speedup = 4.35 / 1.25 = 3.48
13 2
14 2
7
15/09/17
WB
Reg [rd] ß ALUOut;
(a) R-R ALU
15 2
=0 cond
+ NPC
4 M
U
M X
rs A
A U
rt X L
InstrucEon Register A U
PC IR rd L O L
Memory Bank Data
U u M M
B M t Memory D
U U
X X
Sign I
M
Pu[ng it all Extend M
together
16 2
8
15/09/17
END OF LECTURE 53
17 2
18 2
9
15/09/17
• ConvenEon used:
– Many of the temporary registers required in the data path are included as part of
the inter-stage latches.
– IF/ID: denotes the latch stage between the IF and ID stages.
– ID/EX: denotes the latch stage between the ID and EX stages.
– EX/MEM: denotes the latch stage between the EX and MEM stages.
– MEM/WB: denotes the latch stage between the MEM and WB stages.
• Example:
– ID/EX.A means a register A that is implemented as part of the ID/EX latch stage.
19 2
20 2
10
15/09/17
From EX/MEM.ALUOut
From EX/MEM.cond
M
4 U
+ X NPC
IR
InstrucEon
PC
Memory
IF/ID
21 2
22 2
11
15/09/17
NPC NPC
rs
A
rt
IR Register
Bank B
Imm
Sign
Extend
IR
From MEM/WB.ALUOut
or MEM/WB.LMD
From MEM/WB.rd
IF/ID ID/EX
23 2
24 2
12
15/09/17
NPC
cond
=0
M
A U
X
A ALUOut
L
B
M U
Imm U
X B
IR IR
ID/EX EX/MEM
25 2
MEM/WB.IR ß EX/MEM.IR;
LOAD
MEM/WB.LMD ß Mem [EX/MEM.ALUOut];
MEM/WB.IR ß EX/MEM.IR;
STORE
Mem [EX/MEM.ALUOut] ß EX/MEM.B;
26 2
13
15/09/17
To IF stage
To IF stage
cond
ALUOut
Data LMD
B Memory
ALUOut
IR IR
EX/MEM MEM/WB
27 2
28 2
14
15/09/17
LMD
M
ALUOut U
X
IR
MEM/WB
Register
Access
29 2
30 2
15
15/09/17
M NPC
4 =0
U
+ X
rs M
A U
X A
Register
rt L LMD
IR Bank B
PC IM M U DM M
U U
X X
Sign B
Extend ALUOut
IR
31 2
END OF LECTURE 54
32 2
16
15/09/17
33 2
Introduc)on
• Pipeline Hazards:
– Ideally, an instrucEon pipeline should complete the execuEon of an instrucEon
every clock cycle.
– Hazards refer to situaEons that prevent a pipeline from operaEng at its maximum
possible clock speed.
• Prevents some instrucEons from execuEng during its designated clock cycle.
• Three types of hazards:
a) Structural hazard: Arise due to resource conflicts.
b) Data hazard: Arise due to data dependencies between instrucEons.
c) Control hazard: Arise due to branch and other instrucEons that change the PC.
34 2
17
15/09/17
35 2
36 2
18
15/09/17
• If we ignore the increase in clock cycle Eme due to pipelining (i.e. Cnopipe = Cpipe)
37 2
38 2
19
15/09/17
• IllustraEon:
– Structural hazard in a single-port memory system, which stores both instrucEons and data.
Instruc)on 1 2 3 4 5 6 7 8
LW R1,10(R2) IF ID EX MEM WB
ADD R3,R4,R5 IF ID EX MEM WB
SUB R10,R2,R9 IF ID EX MEM WB
AND R5,R7,R7 STALL
IF IF
ID ID
EX EX
MEM MEM
WB
ADD R2,R1,R5 IF
STALL ID
IF EX
ID MEM
EX
39 2
Example 1
• Consider an instrucEon pipeline with the following features:
– Data references consEtute 35% of the instrucEons.
– Ideal CPI ignoring structural hazard is 1.3.
How much faster is the ideal machine without the memory structural hazard, versus
the machine with the hazard?
Ideal CPI x Pipeline Depth
Speedup =
Ideal CPI + Pipeline stall cycles / instrucEon
Speedupideal = 1.3 x Pipeline Depth
1.3 + 0 Speedupideal 1.65
= = 1.27
1.3 x Pipeline Depth Speedupreal 1.3
Speedupreal =
1.3 + 0.35 x 1
40 2
20
15/09/17
41 2
Instruc)on 1 2 3 4 5 6
ADD R2,R5,R8 IF ID EX MEM WB
SUB R9,R2,R6 IF ID EX MEM WB
42 2
21
15/09/17
Instruc)on 1 2 3 4 5 6 7 8
ADD R2,R5,R8 IF ID EX MEM WB
SUB R9,R2,R6 IF ID STALL STALL ID EX MEM
Instr. (i+2) IF STALL STALL IF ID EX
Instr. (i+3) STALL STALL IF ID
Instr. (i+4) STALL STALL IF
43 2
44 2
22
15/09/17
45 2
Instruc)on 1 2 3 4 5 6 7 8 9
ADD R4, R5, R6 IF ID EX MEM WB
SUB R3, R4, R8 IF ID EX MEM WB
ADD R7, R2, R4 IF ID EX MEM WB
AND R9, R4, R10 IF ID EX MEM WB
OR R11, R4, R5 IF ID EX MEM WB
• The first instrucEon computes R4, which is required by all the subsequent four
instrucEons.
• The dependencies are depicted by red arrows (result wrieen in WB, operands
read in ID).
• The last instrucEon (OR) is not affected by the data dependency.
46 2
23
15/09/17
Instruc)on 1 2 3 4 5 6 7 8 9
ADD R4, R5, R6 IF ID EX MEM WB
SUB R3, R4, R8 IF ID EX MEM WB
ADD R7, R2, R4 IF ID EX MEM WB
AND R9, R4, R10 IF ID EX MEM WB
OR R11, R4, R5 IF ID EX MEM WB
47 2
48 2
24
15/09/17
Instruc)on 1 2 3 4 5 6 7 8 9
ADD R4, R5, R6 IF ID EX MEM WB
SUB R3, R4, R8 IF ID EX MEM WB
ADD R7, R2, R4 IF ID EX MEM WB
AND R9, R4, R10 IF ID EX MEM WB
OR R11, R4, R5 IF ID EX MEM WB
49 2
Data
Forwarding M
U
Hardware X
NPC
Only the A
A ALUOut
relevant parts L
of the MIPS32 B U Data LMD
pipeline data Imm M B Memory
U
path is shown. X
ALUOut
50 2
25
15/09/17
END OF LECTURE 55
51 2
52 2
26
15/09/17
• In MIPS32 pipeline, memory references are always kept in order, and so data
hazards between memory references never occur.
– Cache misses can result in pipeline stalls.
Instruc)on 1 2 3 4 5 6
SW R2, 100(R5) IF ID EX MEM WB
LW R10, 100(R5) IF ID EX MEM WB
53 2
Instruc)on 1 2 3 4 5 6 7 8
LW R4, 100(R5) IF ID EX MEM WB
SUB R3, R4, R8 IF ID EX MEM WB
ADD R7, R2, R4 IF ID EX MEM WB
AND R9, R4, R10 IF ID EX MEM WB
ALU wants to use the loaded data; Loaded data Forwarding can solve
Forwarding cannot solve available here
54 2
27
15/09/17
55 2
• A terminology:
– Instruc)on Issue: This is the process of allowing an instrucEon to move
from the ID stage to the EX stage.
– In the MIPS32 pipeline, all possible data hazards can be checked in the ID
stage itself.
• If a data hazard (LOAD followed by use) exists, the instrucEon is stalled
before it is issued.
56 2
28
15/09/17
• A common example:
A = B + C
• Pipelined execuEon of the corresponding MIPS32 code is shown.
Instruc)on 1 2 3 4 5 6 7 8 9
LW R1, B IF ID EX MEM WB
LW R2, C IF ID STALL EX MEM WB
ADD R5, R1, R2 IF STALL ID EX MEM WB
SW R5, A STALL IF ID EX MEM WB
57 2
A C code segment:
Example 1 x = a – b;
y = c + d;
58 2
29
15/09/17
• Some observaEons:
– Pipeline scheduling can increase the number of registers required, but
results in performance improvement in general.
– A LOAD instrucEon requiring that the following instrucEon do not use the
loaded value is called a delayed load.
– A pipeline slot aYer a LOAD is called the load delay slot.
– If the compiler cannot move some instrucEon to fill up the delay slot
(scheduling), it can insert a NOP (No OperaEon) instrucEon.
59 2
60 2
30
15/09/17
61 2
Instruc)on 1 2 3 4 5 6 7 8 9 10 11
BEQ Label IF ID EX MEM WB
Instr. (i+1) IF Stall Stall IF ID EX MEM WB
Instr. (i+2) Stall Stall Stall IF ID EX MEM WB
Instr. (i+3) Stall Stall Stall IF ID EX MEM WB
Instr. (i+4) Stall Stall Stall IF ID EX MEM
Instr. (i+5) Stall Stall Stall IF ID EX
Instr. (i+6) Stall Stall Stall IF ID
62 2
31
15/09/17
63 2
• By the end of the ID cycle, we know whether the branch is taken and also
know the branch target address.
– Requires a single cycle stall during branches.
64 2
32
15/09/17
Instruc)on 1 2 3 4 5 6 7 8 9 10 11
BEQ Label IF ID EX MEM WB
Instr. (i+1) IF IF ID EX MEM WB
Instr. (i+2) Stall IF ID EX MEM WB
Instr. (i+3) Stall IF ID EX MEM WB
Instr. (i+4) Stall IF ID EX MEM WB
Instr. (i+5) Stall IF ID EX MEM WB
65 2
END OF LECTURE 56
66 2
33
15/09/17
PROF.
DR. KAMALIKA DATTA
DR.INDRANIL
KAMALIKASENGUPTA
DATTA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, IIT MEGHALAYA
KHARAGPUR
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT MEGHALAYA
67 2
68 2
34
15/09/17
Instruc)on 1 2 3 4 5 6 7 8 9 10
BEQ Label IF ID EX MEM WB
Not
Taken Instr. (i+1) IF ID EX MEM WB
Branch Instr. (i+2) IF ID EX MEM WB
(no
Instr. (i+3) IF ID EX MEM WB
penalty)
Instr. (i+4) IF ID EX MEM WB
Instruc)on 1 2 3 4 5 6 7 8 9 10
BEQ Label IF ID EX MEM WB
Taken
Branch Instr. (i+1) IF IF ID EX MEM WB
(1 cycle Instr. (i+2) Stall IF ID EX MEM WB
penalty)
Instr. (i+3) Stall IF ID EX MEM WB
Instr. (i+4) Stall IF ID EX MEM WB
69 2
Taken Instruc)on 1 2 3 4 5 6 7 8 9 10
or Not BEQ Label IF ID EX MEM WB
Taken Instr. (i+1) IF IF ID EX MEM WB
Branch
Instr. (i+2) Stall IF ID EX MEM WB
(1 cycle
penalty) Instr. (i+3) Stall IF ID EX MEM WB
Instr. (i+4) Stall IF ID EX MEM WB
70 2
35
15/09/17
d) We can use a technique called delayed branch, which makes the hardware
simple but puts more responsibility on the compiler.
• If a branch instrucEon incurs a penalty of n stall cycles, the execuEon cycle of a branch
instrucEon is defined as follows:
Branch instruc)on
Successor1
.
Successor Branch For MIPS32, n=1
.. 2
delay slots
Successorn
Branch target if taken
• Just like in load-delay slots, the task of the compiler is to try and fill up the branch-delay
slots with meaningful instrucEons.
• InstrucEons in the branch-delay slot are always executed irrespecEve of the outcome
of the branch.
71 2
Some Examples
ADD R3,R4,R5 ADD R1,R2,R8
BEQZ R2,L L: ADD R1,R2,R8 BEQZ R3,L
DELAY SLOT SUB R3,R4,R5 DELAY SLOT
… BEQZ R3,L SUB R3,R4,R5
L: DELAY SLOT L:
72 2
36
15/09/17
73 2
Example 1
• Consider the MIPS32 pipeline with ideal CPI of 1. Assume that 20% of the instrucEons
executed are branch instrucEons, out of which 75% are taken branches. Evaluate the
pipeline speedup for the four techniques discussed to reduce branch penalEes.
• Solu)on:
Ideal CPI x Pipeline Depth
Speedup =
Ideal CPI + Pipeline stall cycles / instrucEon
Pipeline Depth
=
1 + (Branch frequency x Branch penalty)
74 2
37
15/09/17
75 2
76 2
38
15/09/17
• When an interrupt occurs, the pipeline state can be saved safely as follows:
a) A TRAP instrucEon is forced into the pipeline at the next IF cycle.
b) UnEl the TRAP is taken, disable all writes for the faul6ng instruc6on and all
others that follow. à prevents any state change
c) AYer the interrupt handler takes control, it saves the PC of the faulEng
instrucEon.
• For delayed branches, suppose that an instrucEon in the branch-delay slot
causes the interrupt.
– If the branch is taken, then the instrucEons to restart are those in the slot plus
the instrucEon at the branch target.
– A number of PC values need to be saved and restored.
77 2
• AYer handling the interrupt, special instrucEon (RFE in MIPS32) return the
machine from the interrupt handler by reloading the PC’s and restarEng the
interrupted instrucEons.
– Precise Interrupts: If the pipeline can be stopped such that the instrucEons before
the faulEng instrucEon are completed, while those aYer can be restarted from
scratch.
– SupporEng precise interrupts is a requirement in many systems (e.g. page fault,
IEEE arithmeEc, etc.).
78 2
39
15/09/17
79 2
– Can cause a data page fault (in MEM) and arithmeEc excepEon (in EX) at the
same Eme (clock cycle 4).
– As a possible soluEon, we can deal only with the data page fault and then restart
the execuEon. The second interrupt will occur again, and will be handled later.
80 2
40
15/09/17
81 2
• Possible SoluEon:
– The hardware posts each interrupt in a status vector, which is carried along
with each instrucEon as it moves through the pipeline.
– When an instrucEon reaches WB, the interrupt status vector is checked and
handled if present.
– Guarantees that all interrupt handling is carried out in precise order.
82 2
41
15/09/17
END OF LECTURE 57
83 2
84 2
42
15/09/17
Introduc)on
• What we have seen so far to reduce branch penalEes?
– Simple hardware-based schemes that uses a sta6c predicEon approach
(assume taken or not taken).
– SoYware-based scheme with compiler support (delayed branches).
• We shall now explore some hardware based approaches to dynamically
predict the outcome of a branch.
– PredicEon changes with change in branch behavior.
85 2
86 2
43
15/09/17
87 2
Example 1
• Consider the following nested loop where the inner loop iterates for 20
Emes before exiEng.
• For every execuEon of the inner loop,
for (loop=0; loop<1000; loop++)
there will be two mispredicEons:
{
…… a) Last iteraEon of the current loop.
for (i=0; i<20; i++) b) First iteraEon of the next loop.
A[i] = A[i] + 5; • Though the branch is taken 95% of the
} Eme (19 out of 20), correct predicEon
occurs only 18 Emes.
• 90% accuracy.
88 2
44
15/09/17
89 2
• An observaEon:
– It may be noted that the 2-bit predicEon scheme as discussed cannot speed up
the MIPS32 pipeline.
– In MIPS32 pipeline, both the branch outcome and branch target address gets
known together (viz. at the end of ID).
– To improve MIPS32 pipeline performance, we need to know from what address
to fetch by the end of IF.
• Whether the instrucEon is a branch (not decoded yet).
• What is the predicted next value of PC?
90 2
45
15/09/17
91 2
N Not predicted to be
=
branch; proceed normally.
InstrucEon is branch; predicted Y
PC to be used as the next PC
92 2
46
15/09/17
IF
No Yes
In BTB?
93 2
94 2
47
15/09/17
Example 2
• Consider the following parameters of a MIPS pipeline with BTB:
– 90% of the branches are found in BTB.
– 8% of the predicEons are incorrect.
Delayed branch requires
– 75% of the branches are taken. penalty of 0.5 clock cycles
Compute the branch penalty. on the average
95 2
Condi)onal Instruc)ons
• CondiEonal instrucEons can help eliminate some branch instrucEons and
hence also the corresponding branch penalEes.
– MIPS32 has a number of condiEonal instrucEons (e.g. condiEonal move).
96 2
48
15/09/17
END OF LECTURE 58
97 2
49