Professional Documents
Culture Documents
InstructionLevelParallelism (ILP)
( )
(StaticScheduling,LoopUnrolling)
MCFPMIPSPipelinevs.Scoreboard
Scoreboard
MC FP MIPS Pipeline
Instruction
IF
ID
EX
MEM
WB
Write
Result
L.D
F6, 34(R2)
L.D
F2, 45(R3)
MUL D F0
MUL.D
F0, F2
F2, F4
SUB.DF8,F6,F2
DIV.DF10,F0,F6
ADD.DF6,F8,F2
1
2
3
4-5
6
7-15
2
3
45
4-5
6
7-15
16
3
4
6 15
6-15
7-8
16-55
17-18
4
5
16
9
56
19
5
6
17
10
57
20
4
8
20
12
62
22
Features
MCFPMIPS
MIPS withScoreboard
Dataforwarding
Yes
No
PipelinedFU
Yes
No
Hazard Detection
RAW/WAW(WARNA)
RAW/WAW/WAR
Outoforder
Out
of orderExecutionStart
Execution Start
No
Yes
Facilitation toIndependentInst
No
Yes
EE/CS520 Comp.Archi.
9/28/2012
Quiz2:Tuesday,02102012
Lecture58
Lecture 5 8
Topics
Topicscovered:AppendixA(A1,A2,A4,A5,A7)4
covered: Appendix A (A1 A2 A4 A5 A7) 4th Ed.
Ed
Therewillbeproblemsandobjectivetypequestions
There will be problems and objective type questions
(TFand/orMCQs)
EE/CS520 Comp.Archi.
9/28/2012
EE/CS520 Comp.Archi.
9/28/2012
InstructionLevelParallelism(ILP)
Basicidea:
Executeseveralinstructionsinparallel
E
l
ll l
Wealreadydopipelining
But
Butitcanonlypushthroughatmost1inst/cycle
it can only push through at most 1 inst/cycle
MCFPPipeline/Scoreboard
Wewantmultipleinst/cycle
IItgetsabitcomplicated
bi
li
d
Moretransistors,morelogic,morecomplexity
Thatshowwegotfrom486(pipelined)toPentiumand
beyond
EE/CS520 Comp.Archi.
9/28/2012
StaticScheduling
Toexploitpipeliningefficiently,pipelineshouldremain
ffullallthetime
ll ll h
WemustuseILPbyfindingunrelatedinststhatcanbe
overlapped in a pipeline
overlappedinapipeline
Toavoidstalls,thedependentinstmustbeseparatedfrom
the src inst by the latency of src inst in clock cycles.
thesrcinstbythelatencyofsrcinstinclockcycles.
Ifdonebythecompiler staticscheduling
Ifdonebythehardwareatruntime
y
dynamicscheduling
y
g
WejustsawScoreboarding
EE/CS520 Comp.Archi.
9/28/2012
ParallelisminaBasicBlock(BB)
BB:Acodesequencewithnobranchesexceptan
entryandanexit
Example:
result1=b+c;
result2=d+e;
result3=result1+result2;
return(result3);
TypicalMIPSprogrambranchfreq.:15% 25%
Length.ofBB=36insts
g
Mostlydependentoneachother
NotmuchILPcanbeextractedfromaBB
7
EE/CS520 Comp.Archi.
9/28/2012
LoopLevelParallelism
GobeyondBBs
Extractparallelismamongdifferentiterationsofaloop
Example
for(i=1;i<=1000;i++)
x[i]=x[i]+y[i];
Everyiterationofthisloopcanoverlapwithothers
LoopUnrolling:
L
U lli
TechniquetoconvertlooplevelparallelismtoILP
EE/CS520 Comp.Archi.
9/28/2012
StaticScheduling:Example
Instproducingresult
Instusingresult
Latency incc
FP ALU
FPALUop
A th FP ALU
AnotherFPALUop
FPALUop
StoreDouble
LoadDouble
FPALUop
LoadDouble
StoreDouble
Standard5stagepipeline,brancheshave1cycledelay,1instructionissuedpercc,
nostructuralhazard
for (i=1000;
for(i
1000;i>1;i
i>1; i))
x[i]=x[i]+s;
EE/CS520 Comp.Archi.
Loop:
L.D
ADD.D
S.D
F0,0(R1)
F4,F0,F2
F4,0(R1)
#F0=arrayelement
#sinF2=>x[i]+s
#storeresult
DADDI
BNE
R1,R1,8
R1,R2,Loop
#decrementpointer
#R2containsthe
#baseaddress
9/28/2012
StaticScheduling:Example
Unscheduledimplementation=9cc
Loop:
10
L.D
stall
ADD.D
stall
stall
S.D
DADDI
stall
BNE
RestistheLoopoverhead
F0,0(R1)
Useful work
Usefulwork
F4,F0,F2
Scheduledimplementation=7cc
Loop:
F4,0(R1)
R1,R1,8
R1 R2 Loop
R1,R2,Loop
L.D
DADDI
ADD.D
stall
stall
S.D
BNE
F0,0(R1)
R1,R1,8
F4,F0,F2
F4,8(R1)
R1,R2,Loop
Instproducingresult
Instusingresult
Latency incc
FPALUop
AnotherFPALUop
FPALUop
p
StoreDouble
LoadDouble
FPALUop
LoadDouble
StoreDouble
EE/CS520 Comp.Archi.
9/28/2012
LoopUnrolling
Schemetoreducetheloopoverhead
Increasesno.ofusefulinstsrelativetooverheadinsts
Replicatesloopbodymultipletimesandadjustsloop
terminationcode
Improvesschedulingbyreducingbranches
11
EE/CS520 Comp.Archi.
9/28/2012
LoopUnrolling:Example
Unscheduledimplementation=27cc
Loop:
L.D
ADD.D
SD
S.D
F0,0(R1)
F4,F0,F2
F4 0(R1)
F4,0(R1)
L.D
ADD.D
SD
S.D
F6,8(R1)
F8,F6,F2
F8 8(R1)
F8,8(R1)
L.D
ADD.D
SD
S.D
F10,16(R1)
F12,F10,F2
F12 16(R1)
F12,
16(R1)
L.D
ADD.D
SD
S.D
F14,24(R1)
F16,F14,F2
F16 24(R1)
F16,
24(R1)
DADDI
BNE
R1,R1,32
R1,R2,Loop
27/4=6.75cc/iteration
12
EE/CS520 Comp.Archi.
Loop:
L.D
stall
ADD.D
stall
ll
stall
S.D
L.D
stall
ADD.D
stall
stall
S.D
L.D
stall
ADD.D
stall
stall
S.D
L.D
stall
ADD.D
stall
stall
S.D
DADDI
stall
BNE
F0,0(R1)
F4,F0,F2
F4,0(R1)
F6,8(R1)
F8,F6,F2
F8,8(R1)
F10,16(R1)
F12,F10,F2
F12,16(R1)
F14,24(R1)
F16,F14,F2
F16,24(R1)
R1,R1,32
R1,R2,Loop
9/28/2012
StaticScheduling+LoopUnrolling
Loop:
L.D
LD
L.D
L.D
L.D
ADD.D
ADD D
ADD.D
ADD.D
ADD.D
S.D
SD
S.D
DADDI
S.D
S.D
BNE
F0,0(R1)
F6 8(R1)
F6,8(R1)
F10,16(R1)
F14,24(R1)
F4,F0,F2
F8 F6 F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
F4,0(R1)
F8 8(R1)
F8,8(R1)
R1,R1,32
F12,16(R1)
F16,8(R1)
R1 R2 Loop
R1,R2,Loop
14cc 14/4=3.5cc/iteration
13
EE/CS520 Comp.Archi.
9/28/2012
Summary
No.ofcc/iteration
StaticScheduling
LoopUnrolling
7
6.75
X
X
3.5
LoopUnrolling
Advantage
Increasedperformancethankstooverheadreduction
Increased performance thanks to overhead reduction
Disadvantage
Increasedcodesize
RegisterPressure
eg ste
essu e
14
EE/CS520 Comp.Archi.
9/28/2012
HowManyTimestheLoopcanbeUnrolled!
y
p
@Minimum:
Dependsonno.ofstallsintheoriginal/scheduledcode
Trytoremoveasmanystallsaspossible(targetiszerostalls)
@Maximum:
Dependsonavailableno.ofregisters
p
g
YoucanuseonlyevenFPregisters
Youcantreusethesameregisteragainwhileunrolling
15
EE/CS520 Comp.Archi.
9/28/2012
MinimumUnrolling
Unscheduledimplementation=9cc
Loop:
L.D
stall
ADD.D
stall
stall
S.D
DADDI
stall
BNE
F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,8
R1 R2 Loop
R1,R2,Loop
Loop:
L.D
L.D
ADD.D
ADD.D
DADDI
S.D
S.D
BNE
F0,0(R1)
F14,8(R1)
F4,F0,F2
F8,F6,F2
R1,R1,16
F4,16(R1)
F8,+8(R1)
R1,R2,Loop
8cc 8/2=4cc/iteration
16
EE/CS520 Comp.Archi.
9/28/2012
MaximumUnrolling
Unscheduledimplementation=9cc
Loop:
L.D
stall
ADD.D
stall
stall
S.D
DADDI
stall
BNE
F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,8
R1 R2 Loop
R1,R2,Loop
Letsassumewehave26FPregisters
(F0toF25)andonlyevenregisterscan
beusedandwecannotreusethesame
registeragainforunrolling
Loop:
L.D
L.D
L.D
L.D
L.D
L.D
ADD.D
ADD.D
ADD D
ADD.D
ADD.D
ADD.D
ADD.D
SD
S.D
S.D
DADDI
S.D
SD
S.D
S.D
S.D
BNE
( )
F0,0(R1)
F6,8(R1)
F10,16(R1)
F14,24(R1)
F18,32(R1)
18 32( 1)
F22,40(R1)
F4,F0,F2
F8,F6,F2
F12 F10 F2
F12,F10,F2
F16,F14,F2
F20,F18,F2
F24,F22,F2
F4 0(R1)
F4,0(R1)
F8,8(R1)
R1,R1,48
F12,40(R1)
F16 32(R1)
F16,32(R1)
F20,16(R1)
F24,8(R1)
R1,R2,Loop
20cc 20/6=3.33cc/iteration
17
EE/CS520 Comp.Archi.
9/28/2012
HowtodoStaticScheduling+LoopUnrolling?
g
p
g
Checklist:
Howmanystallsarethereinoriginalunscheduledcode!
o
a y sta s a e t e e o g a u sc edu ed code
Arethereindependentinstsinthelooptocoverthestalls
byreordering?
Ifyes,doitinoptimizedfashion
f
d
i i d f hi
Howmanystallsareleft?
Howmanytimestounrolltocovertheremainingstalls!
y
g
Whatwouldbethevalueoffinalloopcounterafter
unrolling?
8xno.ofloopiterations
8
fl
it ti
Whethercounterwasincrementingordecrementing?
Howtoadjustthedisplacementvalues(veor+ve)
Isthereanyrestrictiononno.ofregisterstobeused?
18
EE/CS520 Comp.Archi.
9/28/2012
LoopUnrolling:Example2(minimum)
InstructionProducingResult
LatencyinClockCycles
FPMUL
AnotherFPALUop
FPADD
AnotherFPALUoporStoreDouble
LoadDouble
FPALUop
LoadDouble
StoreDouble
Loop:
L.DF0,0(R1)
ADD.DF4,F0,F2
S.DF4,0(R1)
DADDIR1,R1,8
BNER1,R2,Loop
19
InstructionUsingResult
Loop:
L.D
Stall
ADD.D
Stall
Stall
Stall
Stall
S.D
DADDI
Stall
BNE
F0,0(R1)
F4,F0,F2
F4,0(R1)
R1,R1,8
R1, R2, Loop
R1,R2,Loop
11CC/Iteration
EE/CS520 Comp.Archi.
9/28/2012
LoopUnrolling:Example2(minimum)
Loop:
L.D
LD
L.D
L.D
L.D
ADD D
ADD.D
ADD.D
ADD.D
ADD.D
DADDI
S.D
S.D
S.D
S.D
BNE
F0,0(R1)
F6 8(R1)
F6,8(R1)
F10,16(R1)
F14,24(R1)
F4 F0 F2
F4,F0,F2
F8,F6,F2
F12,F10,F2
F16,F14,F2
,
,
R1,R1,32
F4,32(R1)
F8,24(R1)
F12,16(R1)
F16,8(R1)
R1,R2,Loop
14CC/4Iterations=3.5CC/Iteration
20
EE/CS520 Comp.Archi.
9/28/2012