You are on page 1of 20

Lecture9

InstructionLevelParallelism (ILP)
( )
(StaticScheduling,LoopUnrolling)

MCFPMIPSPipelinevs.Scoreboard
Scoreboard

MC FP MIPS Pipeline

Instruction

IF

ID

EX

MEM

WB

Write
Result

L.D
F6, 34(R2)
L.D
F2, 45(R3)
MUL D F0
MUL.D
F0, F2
F2, F4
SUB.DF8,F6,F2
DIV.DF10,F0,F6
ADD.DF6,F8,F2

1
2
3
4-5
6
7-15

2
3
45
4-5
6
7-15
16

3
4
6 15
6-15
7-8
16-55
17-18

4
5
16
9
56
19

5
6
17
10
57
20

4
8
20
12
62
22

Features

MCFPMIPS

MIPS withScoreboard

Dataforwarding

Yes

No

PipelinedFU

Yes

No

Hazard Detection

RAW/WAW(WARNA)

RAW/WAW/WAR

Outoforder
Out
of orderExecutionStart
Execution Start

No

Yes

Facilitation toIndependentInst

No

Yes

EE/CS520 Comp.Archi.

9/28/2012

Quiz2:Tuesday,02102012
Lecture58
Lecture 5 8
Topics
Topicscovered:AppendixA(A1,A2,A4,A5,A7)4
covered: Appendix A (A1 A2 A4 A5 A7) 4th Ed.
Ed
Therewillbeproblemsandobjectivetypequestions
There will be problems and objective type questions

(TFand/orMCQs)

EE/CS520 Comp.Archi.

9/28/2012

Instruction Level Parallelism (ILP)


InstructionLevelParallelism(ILP)

EE/CS520 Comp.Archi.

9/28/2012

InstructionLevelParallelism(ILP)
Basicidea:
Executeseveralinstructionsinparallel
E
l
ll l

Wealreadydopipelining
But
Butitcanonlypushthroughatmost1inst/cycle
it can only push through at most 1 inst/cycle
MCFPPipeline/Scoreboard

Wewantmultipleinst/cycle
IItgetsabitcomplicated
bi
li
d
Moretransistors,morelogic,morecomplexity
Thatshowwegotfrom486(pipelined)toPentiumand

beyond

EE/CS520 Comp.Archi.

9/28/2012

StaticScheduling
Toexploitpipeliningefficiently,pipelineshouldremain

ffullallthetime
ll ll h
WemustuseILPbyfindingunrelatedinststhatcanbe
overlapped in a pipeline
overlappedinapipeline
Toavoidstalls,thedependentinstmustbeseparatedfrom
the src inst by the latency of src inst in clock cycles.
thesrcinstbythelatencyofsrcinstinclockcycles.
Ifdonebythecompiler staticscheduling
Ifdonebythehardwareatruntime
y
dynamicscheduling
y
g
WejustsawScoreboarding

EE/CS520 Comp.Archi.

9/28/2012

ParallelisminaBasicBlock(BB)
BB:Acodesequencewithnobranchesexceptan

entryandanexit
Example:

result1=b+c;
result2=d+e;
result3=result1+result2;
return(result3);

TypicalMIPSprogrambranchfreq.:15% 25%
Length.ofBB=36insts
g
Mostlydependentoneachother

NotmuchILPcanbeextractedfromaBB
7

EE/CS520 Comp.Archi.

9/28/2012

LoopLevelParallelism
GobeyondBBs
Extractparallelismamongdifferentiterationsofaloop
Example

for(i=1;i<=1000;i++)
x[i]=x[i]+y[i];

Everyiterationofthisloopcanoverlapwithothers

LoopUnrolling:
L
U lli
TechniquetoconvertlooplevelparallelismtoILP

EE/CS520 Comp.Archi.

x[1] = x[1] + y[1];


x[1]=x[1]+y[1];
x[2]=x[2]+y[2];
x[3]=x[3]+y[3];
x[4]=x[4]+y[4];
.
.
x[1000]=x[1000]+y[1000];

9/28/2012

StaticScheduling:Example
Instproducingresult

Instusingresult

Latency incc

FP ALU
FPALUop

A th FP ALU
AnotherFPALUop

FPALUop

StoreDouble

LoadDouble

FPALUop

LoadDouble

StoreDouble

Standard5stagepipeline,brancheshave1cycledelay,1instructionissuedpercc,
nostructuralhazard
for (i=1000;
for(i
1000;i>1;i
i>1; i))
x[i]=x[i]+s;

EE/CS520 Comp.Archi.

Loop:

L.D
ADD.D
S.D

F0,0(R1)
F4,F0,F2
F4,0(R1)

#F0=arrayelement
#sinF2=>x[i]+s
#storeresult

DADDI
BNE

R1,R1,8
R1,R2,Loop

#decrementpointer
#R2containsthe
#baseaddress

9/28/2012

StaticScheduling:Example
Unscheduledimplementation=9cc
Loop:

10

L.D
stall
ADD.D
stall
stall
S.D
DADDI
stall
BNE

RestistheLoopoverhead

F0,0(R1)

Useful work
Usefulwork
F4,F0,F2

Scheduledimplementation=7cc
Loop:

F4,0(R1)
R1,R1,8
R1 R2 Loop
R1,R2,Loop

L.D
DADDI
ADD.D
stall
stall
S.D
BNE

F0,0(R1)
R1,R1,8
F4,F0,F2
F4,8(R1)
R1,R2,Loop

Instproducingresult

Instusingresult

Latency incc

FPALUop

AnotherFPALUop

FPALUop
p

StoreDouble

LoadDouble

FPALUop

LoadDouble

StoreDouble

EE/CS520 Comp.Archi.

9/28/2012

LoopUnrolling
Schemetoreducetheloopoverhead
Increasesno.ofusefulinstsrelativetooverheadinsts

Replicatesloopbodymultipletimesandadjustsloop

terminationcode
Improvesschedulingbyreducingbranches

11

EE/CS520 Comp.Archi.

9/28/2012

LoopUnrolling:Example
Unscheduledimplementation=27cc
Loop:

L.D
ADD.D
SD
S.D

F0,0(R1)
F4,F0,F2
F4 0(R1)
F4,0(R1)

L.D
ADD.D
SD
S.D

F6,8(R1)
F8,F6,F2
F8 8(R1)
F8,8(R1)

L.D
ADD.D
SD
S.D

F10,16(R1)
F12,F10,F2
F12 16(R1)
F12,
16(R1)

L.D
ADD.D
SD
S.D

F14,24(R1)
F16,F14,F2
F16 24(R1)
F16,
24(R1)

DADDI
BNE

R1,R1,32
R1,R2,Loop

27/4=6.75cc/iteration
12

EE/CS520 Comp.Archi.

Loop:

L.D
stall
ADD.D
stall
ll
stall
S.D
L.D
stall
ADD.D
stall
stall
S.D
L.D
stall
ADD.D
stall
stall
S.D
L.D
stall
ADD.D
stall
stall
S.D
DADDI
stall
BNE

F0,0(R1)
F4,F0,F2
F4,0(R1)
F6,8(R1)
F8,F6,F2
F8,8(R1)
F10,16(R1)
F12,F10,F2
F12,16(R1)
F14,24(R1)
F16,F14,F2
F16,24(R1)
R1,R1,32
R1,R2,Loop

9/28/2012

StaticScheduling+LoopUnrolling
Loop:

L.D
LD
L.D
L.D
L.D
ADD.D
ADD D
ADD.D
ADD.D
ADD.D
S.D
SD
S.D
DADDI
S.D
S.D
BNE

F0,0(R1)
F6 8(R1)
F6,8(R1)
F10,16(R1)
F14,24(R1)
F4,F0,F2
F8 F6 F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
F4,0(R1)
F8 8(R1)
F8,8(R1)
R1,R1,32
F12,16(R1)
F16,8(R1)
R1 R2 Loop
R1,R2,Loop

14cc 14/4=3.5cc/iteration

13

EE/CS520 Comp.Archi.

9/28/2012

Summary
No.ofcc/iteration

StaticScheduling

LoopUnrolling

7
6.75

X
X

3.5

LoopUnrolling
Advantage
Increasedperformancethankstooverheadreduction
Increased performance thanks to overhead reduction
Disadvantage
Increasedcodesize
RegisterPressure
eg ste
essu e
14

EE/CS520 Comp.Archi.

9/28/2012

HowManyTimestheLoopcanbeUnrolled!
y
p
@Minimum:
Dependsonno.ofstallsintheoriginal/scheduledcode
Trytoremoveasmanystallsaspossible(targetiszerostalls)

@Maximum:
Dependsonavailableno.ofregisters
p
g
YoucanuseonlyevenFPregisters
Youcantreusethesameregisteragainwhileunrolling

15

EE/CS520 Comp.Archi.

9/28/2012

MinimumUnrolling
Unscheduledimplementation=9cc
Loop:

L.D
stall
ADD.D
stall
stall
S.D
DADDI
stall
BNE

F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,8
R1 R2 Loop
R1,R2,Loop

Loop:

L.D
L.D
ADD.D
ADD.D
DADDI
S.D
S.D
BNE

F0,0(R1)
F14,8(R1)
F4,F0,F2
F8,F6,F2
R1,R1,16
F4,16(R1)
F8,+8(R1)
R1,R2,Loop

8cc 8/2=4cc/iteration
16

EE/CS520 Comp.Archi.

9/28/2012

MaximumUnrolling
Unscheduledimplementation=9cc
Loop:

L.D
stall
ADD.D
stall
stall
S.D
DADDI
stall
BNE

F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,8
R1 R2 Loop
R1,R2,Loop

Letsassumewehave26FPregisters
(F0toF25)andonlyevenregisterscan
beusedandwecannotreusethesame
registeragainforunrolling

Loop:

L.D
L.D
L.D
L.D
L.D
L.D
ADD.D
ADD.D
ADD D
ADD.D
ADD.D
ADD.D
ADD.D
SD
S.D
S.D
DADDI
S.D
SD
S.D
S.D
S.D
BNE

( )
F0,0(R1)
F6,8(R1)
F10,16(R1)
F14,24(R1)
F18,32(R1)
18 32( 1)
F22,40(R1)
F4,F0,F2
F8,F6,F2
F12 F10 F2
F12,F10,F2
F16,F14,F2
F20,F18,F2
F24,F22,F2
F4 0(R1)
F4,0(R1)
F8,8(R1)
R1,R1,48
F12,40(R1)
F16 32(R1)
F16,32(R1)
F20,16(R1)
F24,8(R1)
R1,R2,Loop

20cc 20/6=3.33cc/iteration
17

EE/CS520 Comp.Archi.

9/28/2012

HowtodoStaticScheduling+LoopUnrolling?
g
p
g
Checklist:
Howmanystallsarethereinoriginalunscheduledcode!
o
a y sta s a e t e e o g a u sc edu ed code
Arethereindependentinstsinthelooptocoverthestalls

byreordering?
Ifyes,doitinoptimizedfashion
f
d
i i d f hi
Howmanystallsareleft?

Howmanytimestounrolltocovertheremainingstalls!
y
g
Whatwouldbethevalueoffinalloopcounterafter

unrolling?
8xno.ofloopiterations
8
fl
it ti
Whethercounterwasincrementingordecrementing?
Howtoadjustthedisplacementvalues(veor+ve)

Isthereanyrestrictiononno.ofregisterstobeused?
18

EE/CS520 Comp.Archi.

9/28/2012

LoopUnrolling:Example2(minimum)
InstructionProducingResult

LatencyinClockCycles

FPMUL

AnotherFPALUop

FPADD

AnotherFPALUoporStoreDouble

LoadDouble

FPALUop

LoadDouble

StoreDouble

Loop:
L.DF0,0(R1)
ADD.DF4,F0,F2
S.DF4,0(R1)
DADDIR1,R1,8
BNER1,R2,Loop

19

InstructionUsingResult

Loop:
L.D
Stall
ADD.D
Stall
Stall
Stall
Stall
S.D
DADDI
Stall
BNE

F0,0(R1)
F4,F0,F2

F4,0(R1)
R1,R1,8
R1, R2, Loop
R1,R2,Loop

11CC/Iteration
EE/CS520 Comp.Archi.

9/28/2012

LoopUnrolling:Example2(minimum)
Loop:
L.D
LD
L.D
L.D
L.D
ADD D
ADD.D
ADD.D
ADD.D
ADD.D
DADDI
S.D
S.D
S.D
S.D
BNE

F0,0(R1)
F6 8(R1)
F6,8(R1)
F10,16(R1)
F14,24(R1)
F4 F0 F2
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
,
,
R1,R1,32
F4,32(R1)
F8,24(R1)
F12,16(R1)
F16,8(R1)
R1,R2,Loop

14CC/4Iterations=3.5CC/Iteration
20

EE/CS520 Comp.Archi.

9/28/2012

You might also like