Lect9-Static Scheduling-LoopUnrolling-sec2 PDF

Lecture9
InstructionLevelParallelism (ILP)
( )
(StaticScheduling,LoopUnrolling)
MCFPMIPSPipelinevs.Scoreboard
Scoreboard
MC FP MIPS Pipeline
Instruction
IF
ID
EX
MEM
WB
Write
Result
L.D
F6, 34(R2)
L.D
F2, 45(R3)
MUL D F0
MUL.D
F0, F2
F2, F4
SUB.DF8,F6,F2
DIV.DF10,F0,F6
ADD.DF6,F8,F2
1
2
3
4-5
6
7-15
2
3
45
4-5
6
7-15
16
3
4
6 15
6-15
7-8
16-55
17-18
4
5
16
9
56
19
5
6
17
10
57
20
4
8
20
12
62
22
Features
MCFPMIPS
MIPS withScoreboard
Dataforwarding
Yes
No
PipelinedFU
Yes
No
Hazard Detection
RAW/WAW(WARNA)
RAW/WAW/WAR
Outoforder
Out
of orderExecutionStart
Execution Start
No
Yes
Facilitation toIndependentInst
No
Yes
EE/CS520 Comp.Archi.
9/28/2012
Quiz2:Tuesday,02102012
Lecture58
Lecture 5 8
Topics
Topicscovered:AppendixA(A1,A2,A4,A5,A7)4
covered: Appendix A (A1 A2 A4 A5 A7) 4th Ed.
Ed
Therewillbeproblemsandobjectivetypequestions
There will be problems and objective type questions
(TFand/orMCQs)
9/28/2012
Instruction Level Parallelism (ILP)

InstructionLevelParallelism(ILP)
9/28/2012
InstructionLevelParallelism(ILP)
Basicidea:
Executeseveralinstructionsinparallel
E
l
ll l
Wealreadydopipelining
But
Butitcanonlypushthroughatmost1inst/cycle
it can only push through at most 1 inst/cycle
MCFPPipeline/Scoreboard
Wewantmultipleinst/cycle
IItgetsabitcomplicated
bi
li
d
Moretransistors,morelogic,morecomplexity
Thatshowwegotfrom486(pipelined)toPentiumand
beyond
9/28/2012
StaticScheduling
Toexploitpipeliningefficiently,pipelineshouldremain
ffullallthetime
ll ll h
WemustuseILPbyfindingunrelatedinststhatcanbe
overlapped in a pipeline
overlappedinapipeline
Toavoidstalls,thedependentinstmustbeseparatedfrom
the src inst by the latency of src inst in clock cycles.
thesrcinstbythelatencyofsrcinstinclockcycles.
Ifdonebythecompiler staticscheduling
Ifdonebythehardwareatruntime
y
dynamicscheduling
y
g
WejustsawScoreboarding
9/28/2012
ParallelisminaBasicBlock(BB)
BB:Acodesequencewithnobranchesexceptan
entryandanexit
Example:
result1=b+c;
result2=d+e;
result3=result1+result2;
return(result3);
TypicalMIPSprogrambranchfreq.:15% 25%
Length.ofBB=36insts
g
Mostlydependentoneachother
NotmuchILPcanbeextractedfromaBB
7
9/28/2012
LoopLevelParallelism
GobeyondBBs
Extractparallelismamongdifferentiterationsofaloop
Example
for(i=1;i<=1000;i++)
x[i]=x[i]+y[i];
Everyiterationofthisloopcanoverlapwithothers
LoopUnrolling:
L
U lli
TechniquetoconvertlooplevelparallelismtoILP
x[1] = x[1] + y[1];

x[1]=x[1]+y[1];
x[2]=x[2]+y[2];
x[3]=x[3]+y[3];
x[4]=x[4]+y[4];
.
.
x[1000]=x[1000]+y[1000];
9/28/2012
StaticScheduling:Example
Instproducingresult
Instusingresult
Latency incc
FP ALU
FPALUop
A th FP ALU
AnotherFPALUop
FPALUop
StoreDouble
LoadDouble
FPALUop
LoadDouble
StoreDouble
Standard5stagepipeline,brancheshave1cycledelay,1instructionissuedpercc,
nostructuralhazard
for (i=1000;
for(i
1000;i>1;i
i>1; i))
x[i]=x[i]+s;
Loop:
L.D
ADD.D
S.D
F0,0(R1)
F4,F0,F2
F4,0(R1)
#F0=arrayelement
#sinF2=>x[i]+s
#storeresult
DADDI
BNE
R1,R1,8
R1,R2,Loop
#decrementpointer
#R2containsthe
#baseaddress
9/28/2012
StaticScheduling:Example
Unscheduledimplementation=9cc
Loop:
10
L.D
stall
ADD.D
stall
stall
S.D
DADDI
stall
BNE
RestistheLoopoverhead
F0,0(R1)
Useful work
Usefulwork
F4,F0,F2
Scheduledimplementation=7cc
Loop:
F4,0(R1)
R1,R1,8
R1 R2 Loop
R1,R2,Loop
L.D
DADDI
ADD.D
stall
stall
S.D
BNE
F0,0(R1)
R1,R1,8
F4,F0,F2
F4,8(R1)
R1,R2,Loop
Instproducingresult
Instusingresult
Latency incc
FPALUop
AnotherFPALUop
FPALUop
p
StoreDouble
LoadDouble
FPALUop
LoadDouble
StoreDouble
9/28/2012
LoopUnrolling
Schemetoreducetheloopoverhead
Increasesno.ofusefulinstsrelativetooverheadinsts
Replicatesloopbodymultipletimesandadjustsloop
terminationcode
Improvesschedulingbyreducingbranches
11
9/28/2012
LoopUnrolling:Example
Loop:
L.D
ADD.D
SD
S.D
F0,0(R1)
F4,F0,F2
F4 0(R1)
F4,0(R1)
L.D
ADD.D
SD
S.D
F6,8(R1)
F8,F6,F2
F8 8(R1)
F8,8(R1)
L.D
ADD.D
SD
S.D
F10,16(R1)
F12,F10,F2
F12 16(R1)
F12,
16(R1)
L.D
ADD.D
SD
S.D
F14,24(R1)
F16,F14,F2
F16 24(R1)
F16,
24(R1)
DADDI
BNE
R1,R1,32
R1,R2,Loop
27/4=6.75cc/iteration
12
Loop:
L.D
stall
ADD.D
stall
ll
stall
S.D
L.D
stall
ADD.D
stall
stall
S.D
L.D
stall
ADD.D
stall
stall
S.D
L.D
stall
ADD.D
stall
stall
S.D
DADDI
stall
BNE
F0,0(R1)
F4,F0,F2
F4,0(R1)
F6,8(R1)
F8,F6,F2
F8,8(R1)
F10,16(R1)
F12,F10,F2
F12,16(R1)
F14,24(R1)
F16,F14,F2
F16,24(R1)
R1,R1,32
R1,R2,Loop
9/28/2012
StaticScheduling+LoopUnrolling
Loop:
L.D
LD
L.D
L.D
L.D
ADD.D
ADD D
ADD.D
ADD.D
ADD.D
S.D
SD
S.D
DADDI
S.D
S.D
BNE
F0,0(R1)
F6 8(R1)
F6,8(R1)
F10,16(R1)
F14,24(R1)
F4,F0,F2
F8 F6 F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
F4,0(R1)
F8 8(R1)
F8,8(R1)
R1,R1,32
F12,16(R1)
F16,8(R1)
R1 R2 Loop
R1,R2,Loop
14cc 14/4=3.5cc/iteration
13
9/28/2012
Summary
No.ofcc/iteration
StaticScheduling
LoopUnrolling
7
6.75
X
X
3.5
LoopUnrolling
Advantage
Increasedperformancethankstooverheadreduction
Increased performance thanks to overhead reduction
Disadvantage
Increasedcodesize
RegisterPressure
eg ste
essu e
14
9/28/2012
HowManyTimestheLoopcanbeUnrolled!
y
p
@Minimum:
Dependsonno.ofstallsintheoriginal/scheduledcode
Trytoremoveasmanystallsaspossible(targetiszerostalls)
@Maximum:
Dependsonavailableno.ofregisters
p
g
YoucanuseonlyevenFPregisters
Youcantreusethesameregisteragainwhileunrolling
15
9/28/2012
MinimumUnrolling
Loop:
L.D
stall
ADD.D
stall
stall
S.D
DADDI
stall
BNE
F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,8
R1 R2 Loop
R1,R2,Loop
Loop:
L.D
L.D
ADD.D
ADD.D
DADDI
S.D
S.D
BNE
F0,0(R1)
F14,8(R1)
F4,F0,F2
F8,F6,F2
R1,R1,16
F4,16(R1)
F8,+8(R1)
R1,R2,Loop
8cc 8/2=4cc/iteration
16
9/28/2012
MaximumUnrolling
Loop:
L.D
stall
ADD.D
stall
stall
S.D
DADDI
stall
BNE
F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,8
R1 R2 Loop
R1,R2,Loop
Letsassumewehave26FPregisters
(F0toF25)andonlyevenregisterscan
beusedandwecannotreusethesame
registeragainforunrolling
Loop:
L.D
L.D
L.D
L.D
L.D
L.D
ADD.D
ADD.D
ADD D
ADD.D
ADD.D
ADD.D
ADD.D
SD
S.D
S.D
DADDI
S.D
SD
S.D
S.D
S.D
BNE
( )
F0,0(R1)
F6,8(R1)
F10,16(R1)
F14,24(R1)
F18,32(R1)
18 32( 1)
F22,40(R1)
F4,F0,F2
F8,F6,F2
F12 F10 F2
F12,F10,F2
F16,F14,F2
F20,F18,F2
F24,F22,F2
F4 0(R1)
F4,0(R1)
F8,8(R1)
R1,R1,48
F12,40(R1)
F16 32(R1)
F16,32(R1)
F20,16(R1)
F24,8(R1)
R1,R2,Loop
20cc 20/6=3.33cc/iteration
17
9/28/2012
HowtodoStaticScheduling+LoopUnrolling?
g
p
g
Checklist:
Howmanystallsarethereinoriginalunscheduledcode!
o
a y sta s a e t e e o g a u sc edu ed code
Arethereindependentinstsinthelooptocoverthestalls
byreordering?
Ifyes,doitinoptimizedfashion
f
d
i i d f hi
Howmanystallsareleft?
Howmanytimestounrolltocovertheremainingstalls!
y
g
Whatwouldbethevalueoffinalloopcounterafter
unrolling?
8xno.ofloopiterations
8
fl
it ti
Whethercounterwasincrementingordecrementing?
Howtoadjustthedisplacementvalues(veor+ve)
Isthereanyrestrictiononno.ofregisterstobeused?
18
9/28/2012
LoopUnrolling:Example2(minimum)
InstructionProducingResult
LatencyinClockCycles
FPMUL
AnotherFPALUop
FPADD
AnotherFPALUoporStoreDouble
LoadDouble
FPALUop
LoadDouble
StoreDouble
Loop:
L.DF0,0(R1)
ADD.DF4,F0,F2
S.DF4,0(R1)
DADDIR1,R1,8
BNER1,R2,Loop
19
InstructionUsingResult
Loop:
L.D
Stall
ADD.D
Stall
Stall
Stall
Stall
S.D
DADDI
Stall
BNE
F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,8
R1, R2, Loop
R1,R2,Loop
11CC/Iteration
9/28/2012
LoopUnrolling:Example2(minimum)
Loop:
L.D
LD
L.D
L.D
L.D
ADD D
ADD.D
ADD.D
ADD.D
ADD.D
DADDI
S.D
S.D
S.D
S.D
BNE
F0,0(R1)
F6 8(R1)
F6,8(R1)
F10,16(R1)
F14,24(R1)
F4 F0 F2
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
,
,
R1,R1,32
F4,32(R1)
F8,24(R1)
F12,16(R1)
F16,8(R1)
R1,R2,Loop
14CC/4Iterations=3.5CC/Iteration
20
9/28/2012

Lect9-Static Scheduling-LoopUnrolling-sec2 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect9-Static Scheduling-LoopUnrolling-sec2 PDF

Uploaded by

Copyright:

Available Formats

Lecture9

Instruction Level Parallelism (ILP)

x[1] = x[1] + y[1];

You might also like