Professional Documents
Culture Documents
1 2
Introduc)on
• Basic requirements for pipelining the MIPS32 data path:
– We should be able to start a new instrucEon every clock cycle.
– Each of the five steps menEoned before (IF, ID, EX, MEM and WB) becomes a
pipeline stage.
– Each stage must finish its execuEon within one clock cycle.
• Since execuEon of several instrucEons are overlapped, we must ensure that
there is no conflict during the execuEon.
– Simplicity of the MIPS32 instrucEon set makes this evaluaEon quite easy.
– We shall discuss these issues in detail.
2 2
1
15/09/17
5T
IF + ID + EX + MEM + WB
(Non-pipelined)
IF ID EX MEM WB
3 2
Clock Cycles
Instruc)on 1 2 3 4 5 6 7 8
i IF ID EX MEM WB
i+1 IF ID EX MEM WB
i+2 IF ID EX MEM WB
i+3 IF ID EX MEM WB
Instr-i Instr-(i+2)
finishes finishes
Instr-(i+1) Instr-(i+3)
finishes finishes
4 2
2
15/09/17
Clock Cycles
Instruc)on 1 2 3 4 5 6 7 8
i IF ID EX MEM WB
i+1 IF ID EX MEM WB
i+2 IF ID EX MEM WB
i+3 IF ID EX MEM WB
5 2
Advantages of Pipelining
• In the non-pipelined version, the execuEon Eme of an instrucEon is equal to
the combined delay of the five stages (say, 5T ).
• In the pipelined version, once the pipeline is full, one instrucEon gets
executed aYer every T Eme.
– Assuming all state delays are equal (equal to T ), and neglecEng latch delay.
• However, due to various conflicts between instrucEons (called hazards), we
cannot achieve the ideal performance.
– Several techniques have been proposed to improve the performance.
– To be discussed.
6 2
3
15/09/17
Some Observa)ons
a) To support overlapped execuEon, peak memory bandwidth must be
increased 5 Emes over that required for the non-pipelined version.
– An instrucEon fetch occurs every clock cycle.
– Also there can be two memory accesses per clock cycle (one for instrucEon and
one for data).
– Separate instrucEon and data caches are typically used to support this.
I-Cache
CPU
D-Cache
7 2
Clock
8 2
4
15/09/17
Source Register 1
Read (5 bits)
Port 1 Register Data DesEnaEon Register
(32 bits) (5 bits) Write
REGISTER Register Data Port
Source Register 1 BANK (32 bits)
(5 bits)
Read
Port 2 Register Data
(32 bits)
9 2
10 2
5
15/09/17
• Register stages are inserted between pipeline stages, which increases the
execuEon Eme of an individual instrucEon.
– Because of overlapped execuEon of instrucEons, throughput increases.
• The clock period T has to be chosen suitably:
– Slowest stage in the pipeline.
– Clock skew and jieer.
– Register setup Eme: minimum Eme the register input must be held stable before
the acEve clock edge arrives.
11 2
Example 1
• Consider the 5-stage MIPS32 pipeline, with the following features:
– Pipeline clock rate of 1GHz (i.e. 1 ns clock cycle Eme).
– For a non-pipelined implementaEon, ALU operaEons and branches take 4 cycles,
while memory operaEons take 5 cycles.
– RelaEve frequencies of ALU operaEons, branches and memory operaEons are
50%, 15%, and 35% respecEvely.
– In the pipelined implementaEon, due to clock skew and setup Eme, the clock
cycle Eme increases by 0.25 ns.
– Calculate the esEmated speedup of the pipelined implementaEon.
12 2
6
15/09/17
• Solu)on:
a) For non-pipelined processor:
• Average instrucEon execuEon Eme = Clock cycle Eme x Average CPI
= 1 ns x (0.50 x 4 + 0.15 x 4 + 0.35 x 5) = 4.35 ns
b) For pipelined processor:
• Clock cycle Eme = 1 + 0.25 = 1.25 ns
• In the steady state, one instrucEon will get executed every clock cycle.
• Speedup = 4.35 / 1.25 = 3.48
13 2
14 2
7
15/09/17
33 2
Introduc)on
• Pipeline Hazards:
– Ideally, an instrucEon pipeline should complete the execuEon of an instrucEon
every clock cycle.
– Hazards refer to situaEons that prevent a pipeline from operaEng at its maximum
possible clock speed.
• Prevents some instrucEons from execuEng during its designated clock cycle.
• Three types of hazards:
a) Structural hazard: Arise due to resource conflicts.
b) Data hazard: Arise due to data dependencies between instrucEons.
c) Control hazard: Arise due to branch and other instrucEons that change the PC.
34 2
17
15/09/17
35 2
36 2
18
15/09/17
• If we ignore the increase in clock cycle Eme due to pipelining (i.e. Cnopipe = Cpipe)
37 2
38 2
19
15/09/17
• IllustraEon:
– Structural hazard in a single-port memory system, which stores both instrucEons and data.
Instruc)on 1 2 3 4 5 6 7 8
LW R1,10(R2) IF ID EX MEM WB
ADD R3,R4,R5 IF ID EX MEM WB
SUB R10,R2,R9 IF ID EX MEM WB
AND R5,R7,R7 STALL
IF IF
ID ID
EX EX
MEM MEM
WB
ADD R2,R1,R5 IF
STALL ID
IF EX
ID MEM
EX
39 2
Example 1
• Consider an instrucEon pipeline with the following features:
– Data references consEtute 35% of the instrucEons.
– Ideal CPI ignoring structural hazard is 1.3.
How much faster is the ideal machine without the memory structural hazard, versus
the machine with the hazard?
Ideal CPI x Pipeline Depth
Speedup =
Ideal CPI + Pipeline stall cycles / instrucEon
Speedupideal = 1.3 x Pipeline Depth
1.3 + 0 Speedupideal 1.65
= = 1.27
1.3 x Pipeline Depth Speedupreal 1.3
Speedupreal =
1.3 + 0.35 x 1
40 2
20
15/09/17
41 2
Instruc)on 1 2 3 4 5 6
ADD R2,R5,R8 IF ID EX MEM WB
SUB R9,R2,R6 IF ID EX MEM WB
42 2
21
15/09/17
Instruc)on 1 2 3 4 5 6 7 8
ADD R2,R5,R8 IF ID EX MEM WB
SUB R9,R2,R6 IF ID STALL STALL ID EX MEM
Instr. (i+2) IF STALL STALL IF ID EX
Instr. (i+3) STALL STALL IF ID
Instr. (i+4) STALL STALL IF
43 2
44 2
22
15/09/17
45 2
Instruc)on 1 2 3 4 5 6 7 8 9
ADD R4, R5, R6 IF ID EX MEM WB
SUB R3, R4, R8 IF ID EX MEM WB
ADD R7, R2, R4 IF ID EX MEM WB
AND R9, R4, R10 IF ID EX MEM WB
OR R11, R4, R5 IF ID EX MEM WB
• The first instrucEon computes R4, which is required by all the subsequent four
instrucEons.
• The dependencies are depicted by red arrows (result wrieen in WB, operands
read in ID).
• The last instrucEon (OR) is not affected by the data dependency.
46 2
23
15/09/17
Instruc)on 1 2 3 4 5 6 7 8 9
ADD R4, R5, R6 IF ID EX MEM WB
SUB R3, R4, R8 IF ID EX MEM WB
ADD R7, R2, R4 IF ID EX MEM WB
AND R9, R4, R10 IF ID EX MEM WB
OR R11, R4, R5 IF ID EX MEM WB
47 2
48 2
24
15/09/17
Instruc)on 1 2 3 4 5 6 7 8 9
ADD R4, R5, R6 IF ID EX MEM WB
SUB R3, R4, R8 IF ID EX MEM WB
ADD R7, R2, R4 IF ID EX MEM WB
AND R9, R4, R10 IF ID EX MEM WB
OR R11, R4, R5 IF ID EX MEM WB
49 2
Data
Forwarding M
U
Hardware X
NPC
Only the A
A ALUOut
relevant parts L
of the MIPS32 B U Data LMD
pipeline data Imm M B Memory
U
path is shown. X
ALUOut
50 2
25
15/09/17
END OF LECTURE 55
51 2
52 2
26
15/09/17
• In MIPS32 pipeline, memory references are always kept in order, and so data
hazards between memory references never occur.
– Cache misses can result in pipeline stalls.
Instruc)on 1 2 3 4 5 6
SW R2, 100(R5) IF ID EX MEM WB
LW R10, 100(R5) IF ID EX MEM WB
53 2
Instruc)on 1 2 3 4 5 6 7 8
LW R4, 100(R5) IF ID EX MEM WB
SUB R3, R4, R8 IF ID EX MEM WB
ADD R7, R2, R4 IF ID EX MEM WB
AND R9, R4, R10 IF ID EX MEM WB
ALU wants to use the loaded data; Loaded data Forwarding can solve
Forwarding cannot solve available here
54 2
27
15/09/17
55 2
• A terminology:
– Instruc)on Issue: This is the process of allowing an instrucEon to move
from the ID stage to the EX stage.
– In the MIPS32 pipeline, all possible data hazards can be checked in the ID
stage itself.
• If a data hazard (LOAD followed by use) exists, the instrucEon is stalled
before it is issued.
56 2
28
15/09/17
• A common example:
A = B + C
• Pipelined execuEon of the corresponding MIPS32 code is shown.
Instruc)on 1 2 3 4 5 6 7 8 9
LW R1, B IF ID EX MEM WB
LW R2, C IF ID STALL EX MEM WB
ADD R5, R1, R2 IF STALL ID EX MEM WB
SW R5, A STALL IF ID EX MEM WB
57 2
A C code segment:
Example 1 x = a – b;
y = c + d;
58 2
29
15/09/17
• Some observaEons:
– Pipeline scheduling can increase the number of registers required, but
results in performance improvement in general.
– A LOAD instrucEon requiring that the following instrucEon do not use the
loaded value is called a delayed load.
– A pipeline slot aYer a LOAD is called the load delay slot.
– If the compiler cannot move some instrucEon to fill up the delay slot
(scheduling), it can insert a NOP (No OperaEon) instrucEon.
59 2
60 2
30
15/09/17
61 2
Instruc)on 1 2 3 4 5 6 7 8 9 10 11
BEQ Label IF ID EX MEM WB
Instr. (i+1) IF Stall Stall IF ID EX MEM WB
Instr. (i+2) Stall Stall Stall IF ID EX MEM WB
Instr. (i+3) Stall Stall Stall IF ID EX MEM WB
Instr. (i+4) Stall Stall Stall IF ID EX MEM
Instr. (i+5) Stall Stall Stall IF ID EX
Instr. (i+6) Stall Stall Stall IF ID
62 2
31
15/09/17
63 2
• By the end of the ID cycle, we know whether the branch is taken and also
know the branch target address.
– Requires a single cycle stall during branches.
64 2
32
15/09/17
Instruc)on 1 2 3 4 5 6 7 8 9 10 11
BEQ Label IF ID EX MEM WB
Instr. (i+1) IF IF ID EX MEM WB
Instr. (i+2) Stall IF ID EX MEM WB
Instr. (i+3) Stall IF ID EX MEM WB
Instr. (i+4) Stall IF ID EX MEM WB
Instr. (i+5) Stall IF ID EX MEM WB
65 2
END OF LECTURE 56
66 2
33
15/09/17
PROF.
DR. KAMALIKA DATTA
DR.INDRANIL
KAMALIKASENGUPTA
DATTA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, IIT MEGHALAYA
KHARAGPUR
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING, NIT MEGHALAYA
67 2
68 2
34
15/09/17
Instruc)on 1 2 3 4 5 6 7 8 9 10
BEQ Label IF ID EX MEM WB
Not
Taken Instr. (i+1) IF ID EX MEM WB
Branch Instr. (i+2) IF ID EX MEM WB
(no
Instr. (i+3) IF ID EX MEM WB
penalty)
Instr. (i+4) IF ID EX MEM WB
Instruc)on 1 2 3 4 5 6 7 8 9 10
BEQ Label IF ID EX MEM WB
Taken
Branch Instr. (i+1) IF IF ID EX MEM WB
(1 cycle Instr. (i+2) Stall IF ID EX MEM WB
penalty)
Instr. (i+3) Stall IF ID EX MEM WB
Instr. (i+4) Stall IF ID EX MEM WB
69 2
Taken Instruc)on 1 2 3 4 5 6 7 8 9 10
or Not BEQ Label IF ID EX MEM WB
Taken Instr. (i+1) IF IF ID EX MEM WB
Branch
Instr. (i+2) Stall IF ID EX MEM WB
(1 cycle
penalty) Instr. (i+3) Stall IF ID EX MEM WB
Instr. (i+4) Stall IF ID EX MEM WB
70 2
35
15/09/17
d) We can use a technique called delayed branch, which makes the hardware
simple but puts more responsibility on the compiler.
• If a branch instrucEon incurs a penalty of n stall cycles, the execuEon cycle of a branch
instrucEon is defined as follows:
Branch instruc)on
Successor1
.
Successor Branch For MIPS32, n=1
.. 2
delay slots
Successorn
Branch target if taken
• Just like in load-delay slots, the task of the compiler is to try and fill up the branch-delay
slots with meaningful instrucEons.
• InstrucEons in the branch-delay slot are always executed irrespecEve of the outcome
of the branch.
71 2
Some Examples
ADD R3,R4,R5 ADD R1,R2,R8
BEQZ R2,L L: ADD R1,R2,R8 BEQZ R3,L
DELAY SLOT SUB R3,R4,R5 DELAY SLOT
… BEQZ R3,L SUB R3,R4,R5
L: DELAY SLOT L:
72 2
36
15/09/17
73 2
Example 1
• Consider the MIPS32 pipeline with ideal CPI of 1. Assume that 20% of the instrucEons
executed are branch instrucEons, out of which 75% are taken branches. Evaluate the
pipeline speedup for the four techniques discussed to reduce branch penalEes.
• Solu)on:
Ideal CPI x Pipeline Depth
Speedup =
Ideal CPI + Pipeline stall cycles / instrucEon
Pipeline Depth
=
1 + (Branch frequency x Branch penalty)
74 2
37
15/09/17
75 2
76 2
38