Lecture - 17 - MIPS - Instruction Level Parallelism

VLSI Architecture
MEL G642
Lecture:17 Instruction Level

Parallelism (ILP)
Date: 24-02-2022
1
Intro
1 Branch pred eliminates most Control dependencies
of
2 But data dependencies makes it harder to finish an instr
in a Single Cycle
3 How to further enhance the performance of the processor
I why do we execute
only one instr per cycle
I every don't execute more than I instr per cycles
Issue multiple instr per cycle decode them execute them
Cos How possibly multiple lust can be executed in a
given lyle
Instruction Level parallelism
Instruction level Parallelism
I Two Primary Methods to increase ILP
A- Deeper pipeline
To Overlap more instructions
entÉt
0 How to resolve hazards certain
t data dependencies have been resolved upto a
B - Multiple Issue New At
Replicate internal components of computer so that

it can launch multiple instructions
Crieff
we can
CPI < 1 even though
Improve the lust per cycle
eve have EPIC 1
Is due to verypredictions
few
IPC – Instructions per cycle Branch mis
an 0
Issue multiple instr in a Cycle and execute multiple instr
2 now
so not a multicore processor
Replicate the processor component
2
Concept of Speculation
A Approach allows compiler or processor to guess about
properties of other instructions
B Enables execution of other instructions that may

depend on speculated instruction
Speculation done at compiler or by hardware
Incorrect speculations should be handled correctly
3
Approach I IT
iI
Emotion out
a instr
Ii
s a eamoeenanted
I
Concept of In Imminent
emotion unit
Superscalar Ir
processors If Execution Unit
Execution Unit
All this is ideally possible
the complice makes a

very long instr from
ApproachI the H instructions
I I I In
VLIW
verylarge instr word 2
Here Compiler I Iz I3 In
Support is
req I Issue this
w o Complied EU EU EU EU
processor can't
fun I gets executed
Instruction Parallelism
Approach I I
- Multiple Issue : Static Multiple Issue
Compiler groups instructions to be issued together.
Packages them into “issue packets”. VLIW is an example

2 Compiler detects and avoids hazards.
Approach I
- Multiple Issue : Dynamic Multiple Issue
Processor chooses instruction(s) to execute in each cycle
Compiler can help in reordering Super scalar is an
2 Can corr example: in order
Processor resolves Hazards without
complice execution or out of
also
so Pentium processors are Superscaler order execution 4
what Should be the Ideal EPI
for a pipelined processor Ideal CPI 1
for a ILP processor as Ideal CPI o

All Instructions in Same Cycle
Example of hist bevel parallelism
1 Ideally a processor can fetch all instr execute all instr In parallel
2s not going to happen
Ideal ease example
INST\CYCLE 1 2 3 4 5
ADD R1,R2,R3 Fetch Decode Ever writeBack
SUB R4,R1,R5 fetch Decode Ever writeBaur
DIV R6,R7,R8 fetch Decode Ewer all in to writeBack

Parallel
MUL R5,R8,R9 fetch Decode Esey writeBack
ADD R4,R8,R9 Fetch Decode Ever write Baur
2 EPICalculation
be executed in parallel
5 cycles a no
of instr which can
2 CPI 56 0
🤩 Super
3 Issues
All instr cannot be executed in Same cycle due to dependencies
5
u
RAW
STALL Ence Mem
CPI 512 0.4

6 Even If bust are there
a to be executed CPI o
Due to data dependencies
RAW Dependencies
Dependency 1 RAW Dependency
• Ideal processers which can execute any number of
instructions still has to obey RAW dependencies, wait
till the result is produced for the dependent
instruction to use.
Example 3 Dependency 2
perfect Seg program need to execute
one after other
Ig raw
I a 5 instr 5 Cycles
tofetchdecode don't consider
F Raw
Iz u
In Is r
RAW
CPI 55 L
3 In
15 u
Considering Raw deep cycles I 2 3 4 5
or CPI 3 6
5
WAW Dependencies
I write after write deep
which evey is 5 6 7
written
Charm R E M B
R update
hereof
p Becango E M Bry updated
130Thisshouldnot
E M WB
Ry bethefinalvalue
ofRu
WB
R8 E
thefinalvalueof Ithere
WB Mold
Ry I
Ry updated't
rid Ru updated
5 501
Some way the processor has to figure out to WB the
value later Cycle
will be seen later 7
Exampled Dependency Quiz
Consider a processor with 5 stage pipelined, with the forwarding feature, can issue
10 instructions and execute in one cycle. Which cycle do we EXE and WB for the
following instructions. Assume that the instructions are issued at 0th cycle
given
some pipelined f dotage
EXE WB
am EWER
forwarding
can issue ever
MUL R2,R2,R2 0 I 2 3 4
10 intr in tuple It
ADD R1,R1,R2 I 2 3 4 5
Assume RAW
MUL R3,R3,R3 0 I 2 3 4
hist are issued
or cycle ADD R1,R1,R3 2 3 4 5 6
both MUL R4,R4,R4 0 1 2 3 4
cycle Me
o fetch ADD R1,R1,R4 3 4 5 6 7
l decode
2 ever
3
man
Y WB write the exec s cycle
west other will be set accordingly
8
Removing False Dependencies
RAW – TRUE dependencies… we need to obey, the program
computes that way,
may be delay the EXE, forwarding
WAR and WAW – FALSE (NAME) dependencies.. Nothing

fundamental, they are dependencies because we use the same
register to write two different outcomes
Q Can execute if we take care of using the same register for multiple
writes
4 If I can find a way by which waiting onto a
Same every can be avoided Then false dependencies
Can be removed
9
Explaining
Duplicating Register Values
Same
Issue ever
i
C100 C101
ADD R1,R2,R3 E Mfm WB
SUB R4,R1,R5
X
ADDI R3,R4,01
9 E Mfm
E
WB
Mm WB
SUB R4,R8,R9 E WB
Mm
. 3 2sSince this exec
. happened earlier
. and not in program's
DIV Rio RY R11 Order
say there is an instr see result used will

Ry will have
wrong valve Ibe wrong
10
Proposedbot
Register Renaming
Conceptof physical registers
1 programmer visualizes these registers
a
• Architectural registers: REGS for programmer/compiler use
• Physical registers: all places the processor can put values
2h All places where the processor can put values
• In parallel to FETCH and DECODE, the processor can
rewrite the program to use physical registers
Go This is Ma Register renaming Mr for this Reg Allocation table is neg
• Uses Register Allocation Table (RAT)

– Table that says which physical REG has value for which
architectural REG
RAT Allocation
Register table
11
Architectural
RAT Illustration
RAT Example Registers
2 This Rae architecture points to a physical

pay
To
O Po
3
Eg ADD Ri Ra R3
physical register
1 P PI
2 P2
Register Remaining is done in life By a
Superscalar processor 3 B
when it fetches an bust It decodes it
and does evey renaming
4 Py P
5
It uses Rae to do the eve
naming
Py Pao
5 from table
take data from these
y
Rz
R
Peta
B
7 Pt
Ri Dins P might have been written
say use by some instr
8 ADD Pa Pz P3 Physical
Pp
12
g other subsequent instr
renaming
SUB Ry R R5 sub Rs Pia Ps
XOR RG Rf Rg XOR Pig Pa D8
MUL Rj R8 Ra MULPao Pg Pg
ADD Rn Rg Rg ADD Pa Pg Pg
10 Is only written Registers are
1 Check RAT for renaming renamed
12 WAW dependency is Completely eliminated

But
RAW dependency you have to Obey
Example Register Renaming Exercise
2 Do the remaining
3 RAW dependency remains wth will be removed Upton
after renaming RA
RAF
R1 T
FETCHED RENAMED P Pg
i
R2
MUL R2,R2,R2
rear
LAW WRAW R3
MaR1,R1,R2
ADD ADD I.isPs Pl Pt
WAR R4 Py
MUL R2,R4,R4 p
Magaw R5
ADD R3,R3,R2
Ps Plz
R6 Po
MUL R2,R6,R6 MUN Pu PG Pf
R7 Pyo
RaAW Gi
ADD R5,R5,R2
R8 Pbo
3 evidently no more WAW WAR
What happens to the FALSEMMS

dependencies after RENAMING 13
y CPI Calc
a without remaining
she example given is purely Sequential
CP I 1 a 2 fetching decoding can be

done at once for all instr
Instr per cycle But execution will be
Sequential
IDC 1
b with renaming Instrafter renaming cycles III

MK Pt Pa Pa
3 2 Cycles for 6 ink ADD Ps Pl Pt
MUL Pa Py Py
CPI 2 0.33 ADD Pio PaPa
6
MUL Rl PG PG
IPC 6 3 ADD PizPs R V
2
WHAT
IS Instruction Level Parallelism (ILP)
IT
ILP = IPC when
• Processor completes entire INSTRUCTION in one cycle also
2• Processor can do any number of INSTRUCTIONS in the same
cycle
• But, should obey true dependencies
ILP is what an Ideal processor can do subject to only true dependencies
Toff a program is given asked to compute ILP Then
Steps to compute ILP ILP is a property of a

U
a • Rename registers steamfett program, not of the
6 • Execute any no of instr processor
It has no
any
units Or ILP is for an ideal
of exec
processor
5
ILP is of a program, processor independent, but IPC is running a
program on a processor non Ideals 14
Ythtrnamed ILP Example
program y
find ILP 2
3 Cycles I I I Example Cycles
ADD P10,P2,P3 MUL R1,R1,R1 1
RAW
XOR P6,P7,P8 SUB R2,R2,R1 2
MUL P5,P8,P8 DIV R3,R2,R1 3
RAW
ADD P4,P8,P9 ILP SUB R6,R7,R8 1
H
a
SUB P11,P10,P5 ADD R8,R3,R7 4
2
for calculating ILP MUL R1,R1,R1 2
eve need to know
Raw dependency DIV R1,R7,R7 I
Rename optional
5 ILP of the prog
sink
2 Consider only Raw dep
ILP 2.5
3
2 Cycles Ignore false dep
4 ILP
Get 15
ILP with Structural and Control Dependencies
While computing ILP, it is for a perfect processor and for the
program, so we assume
• No structural dependencies
• Perfect same-cycle branch prediction
• Only obey true dependencies, if the branch is affected due to
true dependencies, then obey
Inorecate
3
Example Cycleson n na na nt3 nth
SUB R1,R2,R3 jaggy
Comes when
structural dependencies
you look
L.ngy.gg
a processor
DIV R1,R1,R1 ng ayyy

BEQ R5,R1, LOOP egg grey
SUB R5,R1,R1
. is affected by this
. Branchis treated
LOOP: If Branch
is taken just as andother
Instruction
DIV R5,R7,R8 16
ILP VS IPC
ILP ≠ IPC, it can be equal for an ideal processor with no resource constrains and perfect
Th I ILP is
predictions for ideal processor T ILD IDC if Idealpro real pro
IPC is for real processor
Impossible
SUB R1,R2,R3 IPC =
ADD R4,R1,R5 When 1 div and 2 ALU units
AND R6,R7,R8 ILP = [For a multiple issue out-of-order superscalar processor]
DIV R5,R8,R9 IPC =
SUB R4,R8,R9 When 1 div and 1 ALU unit [For a multiple issue
out-of-order superscalar processor]
ILP ≥ IPC, ILP can not be less than IPC
Repeat the above problem for a two issue in-order

superscalar processor
17
Example to understand even b w ILP and Ipc
what the prog does is not Important
CASE Y
CASEZ
Cases
find ILP IPC when I div 2ALO Units are present
Cycles 2 I IPC Cal
This processor can issue instr out of order
cycles I I I
ILP 52 2.5
I IPC
I
L5
Kuopiofree
IPC Cale
Casey
ILP 5
2
2.5 sane I Unitfor
Cycles
DIV 1 ALU
I
II
IPC 5
x 4 125
ILP IPC ILP Cannot less than IPC

ILP and IPC Findings
ILP: Ideal out of order processor I
ILP ≥ IPC 2
• 3 Narrow issue In-order execution (IPC is mostly limited by the issue)
• Wide-issue In-order execution (mostly limited by the execution order)
u
• Wide-issue out-of-order so Best of both

• Fetch/EXE more than 1 instruction/cycle
• Eliminate false dependencies
• Reorder instructions
18
Thank You
19

Lecture - 17 - MIPS - Instruction Level Parallelism

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture - 17 - MIPS - Instruction Level Parallelism

Uploaded by

Copyright:

Available Formats

VLSI Architecture

Lecture:17 Instruction Level

B - Multiple Issue New At

Replicate internal components of computer so that

B Enables execution of other instructions that may

Speculation done at compiler or by hardware

Incorrect speculations should be handled correctly

the complice makes a

Packages them into “issue packets”. VLIW is an example

for a ILP processor as Ideal CPI o

DIV R6,R7,R8 fetch Decode Ewer all in to writeBack

ADD R4,R8,R9 Fetch Decode Ever write Baur

CPI 512 0.4

WAR and WAW – FALSE (NAME) dependencies.. Nothing

say there is an instr see result used will

• Uses Register Allocation Table (RAT)

2 This Rae architecture points to a physical

12 WAW dependency is Completely eliminated

3 evidently no more WAW WAR

What happens to the FALSEMMS

CP I 1 a 2 fetching decoding can be

b with renaming Instrafter renaming cycles III

Toff a program is given asked to compute ILP Then

Steps to compute ILP ILP is a property of a

DIV R1,R1,R1 ng ayyy

ILP ≥ IPC, ILP can not be less than IPC

Repeat the above problem for a two issue in-order

what the prog does is not Important

ILP IPC ILP Cannot less than IPC

• Wide-issue out-of-order so Best of both

You might also like