Professional Documents
Culture Documents
Capp 04
Capp 04
Dynamic Scheduling
Dynamic Scheduling by hardware
Enables handling some cases when dependencies are unknown at compile time (e.g. dependencies involving memory ref) Allow processor to tolerate unpredictable delays Allows code that was compiled with one pipeline in mind to run efficiently on a different pipeline Allows Out-of-order execution, Out-of-order completion
In-order: If an instruction is stalled, no later instructions can proceed.
DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14 (stall) (have to wait)
2
Dynamic Scheduling
In classical pipeline, we use in-order instruction issue and execution Both structural and data hazards are checked during ID stage. No hazards means it can be issued from ID for execution To allow out-of-order execution, we split ID stage into 2 stages:
Issue Decode Instruction, check for SH Read Operands Wait until no DH, then read Operands
3
Dynamic Scheduling
Here, in-order issue but instructions may bypass each other in read operands stage, thus enter EXE stage out-of-order Out of order execution introduces possibility of WAW & WAR hazards
Dynamic Scheduling
2.
Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions.
IF ISSUE RO EX1 EXm (WAR?) EXp WB? WB
RO
EX1
EXn
RO
WB?
EX1
(WAR?)
Wait until no data hazards (no earlier active instructions will write source operands), then read operands (no RAW hazards) No forwarding supported The functional unit starts execution upon receiving operands. When the results are ready it notifies the scoreboard Stall until no WAR hazards with previous instructions
3. Execution (EX):
Register result status: Indicates which functional unit will write to each register (Result). Blank when no pending instructions will write that register.
8
Wait until
Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D; Fj(FU) `S1; Fk(FU) `S2; Qj Result(S1); Qk Result(`S2); Rj not Qj; Rk not Qk; Result(D) FU; Rj No; Rk No Qj 0; Qk 0;
Issue
WAW Read operands
Rj and Rk Functional unit done "f((Fj( f )!=Fi(FU) or Rj( f )=No) & (Fk( f )!=Fi(FU) or Rk( f )=No))
Execution complete
Write result
"f(if Qj(f)=FU then Rj(f) Yes); "f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No
WAR
Time Name
Busy Op
No No No No No
dest Fi
S1 Fj
S2 Fk
Fk? Rk
Clock
1
FU
F0
F2
F4
F6 F8 F10
F12
...
F30
10
j k 34+ R2
45+ F2 F6 F0 F8 R3 F4 F2 F6 F2
Busy Op
Yes No No No No Load
dest Fi
F6
S1 Fj
S2 Fk
R2
Fk? Rk
Yes
Clock
1
FU
F0
F2
F4
Integer
F6 F8 F10
F12
...
F30
11
Execution Write
Time Name
Busy Op
Yes No No No No Load
dest Fi
F6
S1 Fj
S2 Fk
R2
Fk? Rk
Yes
Clock
2
FU
F0
F2
F4
Integer
F6 F8 F10
F12
...
F30
12
Execution Write
issue is in-order
Time Name
Busy Op
Yes No No No No Load
dest Fi
F6
S1 Fj
S2 Fk
R2
Fk? Rk
Yes
Clock
3
FU
F0
F2
F4
Integer
F6 F8 F10
F12
...
F30
13
Execution Write
Time Name
Busy Op
Yes No No No No Load
dest Fi
F6
S1 Fj
S2 Fk
R2
Fk? Rk
Yes
Clock
4
FU
F0
F2
F4
Integer
F6 F8 F10
F12
...
F30
14
1 5
Time Name
Busy Op
Yes No No No No Load
dest Fi
F2
S1 Fj
S2 Fk
R3
Fk? Rk
Yes
Clock
5
FU
F0
Integer
F2
F4
F6 F8 F10
F12
...
F30
15
Issue
1 5 6
2 6
Time Name
Busy Op
Yes Yes No No No Load Mult
dest Fi
F2 F0
S1 Fj
F2
S2 Fk
R3 F4
Fk? Rk
Yes Yes
Clock
6
FU
Mult1 Integer
F0
F2
F4
F6 F8 F10
F12
...
F30
16
1 5 6 7
2 6
3 7
dest Fi F2
F0 F8
S1 Fj
F2 F6
S2 Fk R3
F4 F2
Fk? Rk Yes
Yes No
Clock
7
FU
Mult1 Integer
F0
F2
F4
F6 F8 F10
Add
F12
...
F30
17
1 5 6 7 8
2 6
3 7
Time Name
Busy Op
Yes Yes No Yes Yes Load Mult
dest Fi
F2 F0
S1 Fj
F2
S2 Fk
R3 F4
Fk? Rk
Yes Yes
Sub Div
F8 F10
F6 F0
F2 F6
Mult1
Integer Yes No
No Yes
Clock
8
FU
Mult1 Integer
F0
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
18
1 5 6 7 8
2 6
3 7
4 8
Time Name
Busy Op
No Yes No Yes Yes Mult Sub Div
dest Fi
F0 F8 F10
S1 Fj
F2 F6 F0
S2 Fk
F4 F2 F6
Fk? Rk
Yes Yes Yes
Mult1
Clock
8
FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
19
1 5 6 7 8
?
2 6 9 9
3 7
4 8
structural hazards
Time Name
Busy Op
No Yes No Yes Yes Mult
dest Fi
F0
S1 Fj
F2
S2 Fk
F4
Fk? Rk
Yes
Sub Div
F8 F10
F6 F0
F2 F6
Mult1
Yes No
Yes Yes
Clock
9
FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
20
1 5 6 7 8
2 6 9 9
3 7 11
4 8
Time Name
Busy Op
No Yes No Yes Yes Mult Sub Div
dest Fi
F0 F8 F10
S1 Fj
F2 F6 F0
S2 Fk
F4
Fk? Rk
Yes Yes Yes
F2 F6 Mult1
Clock
11
FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
21
1 5 6 7 8
2 6 9 9
3 7 11
4 8 12
Time Name
Busy Op
No Yes No No Yes Mult
dest Fi
F0
S1 Fj
F2
S2 Fk
F4
Fk? Rk
Yes
Div
F10
F0
F6 Mult1
No
Yes
Clock
12
FU
F0
Mult1
F2
F4
F6 F8 F10
Divide
F12
...
F30
22
1 5 6 7 8 13
2 6 9 9
3 7 11
4 8 12
Time Name
Busy Op
No Yes No Yes Yes Mult Add Div
dest Fi
F0 F6 F10
S1 Fj
F2 F8 F0
S2 Fk
F4 F2 F6
Fk? Rk
Yes Yes Yes
Mult1
Clock
13
FU
F0
Mult1
F2
F4
F6 F8 F10
Add Divide
F12
...
F30
23
1 5 6 7 8 13
2 6 9 9 14
3 7 11 16
4 8 12 ?
Time Name
Busy Op
No Yes No Yes Yes Mult Add Div
dest Fi
F0 F6 F10
S1 Fj
F2 F8 F0
S2 Fk
F4 F2 F6
Fk? Rk
Yes Yes Yes
Mult1
Clock
17
FU
Mult1
F0
F2
F4
Add
F6 F8 F10
Divide
F12
...
F30
24
1 5 6 7 8 13
2 6 9 9 14
3 7 19 11 16
4 8 20 12
Time Name
Busy Op
No No No Yes Yes
dest Fi
S1 Fj
S2 Fk
Fk? Rk
Add Div
F6 F10
F8 F0
F2 F6
Yes Yes
Yes Yes
Clock
20
FU
F0
F2
F4
Add
F6 F8 F10
Divide
F12
...
F30
25
1 5 6 7 8 13
2 6 9 9 21 14
Op
3 7 19 11 16
4 8 20 12
Time Name
Busy
No No No Yes Yes
dest
Fi
S1
Fj
S2
Fk
Fk?
Rk
Add Div
F6 F10
F8 F0
F2 F6
Yes Yes
Yes Yes
Clock
21
FU
F0
F2
F4
Add
F6 F8 F10
Divide
F12
...
F30
26
1 5 6 7 8 13
2 6 9 9 21 14
Op Fi
3 7 19 11 16
dest
4 8 20 12 22
S1 Fj S2 Fk FU for j Qj FU for k Qk Fj? Rj Fk? Rk
Time Name
Busy
No No No No Yes
Div
F10
F0
F6
Yes
Yes
Clock
22
FU
F0
F2
F4
F6 F8 F10
Divide
F12
...
F30
27
1 5 6 7 8 13
2 6 9 9 21 14
Op Fi
3 7 19 11 61 16
dest
4 8 20 12 22
S1 S2 FU for j FU for k Fj? Fk?
Time Name
Busy
No No No No Yes
Fj
Fk
Qj
Qk
Rj
Rk
Div
F10
F0
F6
Yes
Yes
Clock
61
FU
F0
F2
F4
F6 F8 F10
Divide
F12
...
F30
28
1 5 6 7 8 13
2 6 9 9 21 14
Op Fi
3 7 19 11 61 16
dest
4 8 20 12 62 22
S1 Fj S2 Fk FU for j Qj FU for k Qk Fj? Rj Fk? Rk
Time Name
Busy
No No No No No
Clock
62
FU
F0
F2
F4
F6 F8 F10
F12
...
F30
29
Review: Scoreboard
Limitations of CDC6600 scoreboard
No forwarding Limited to instructions in basic block (small window) Large number of functional units (structural hazards) Stall on WAR hazards Stall on WAW hazards
F0, F2, F4 F6, F0, F8 F6, 0(R1) F8, F10, F14 F6, F10, F8
WAW
Output dependence
Name dependence
30
Tomasulo Algorithm
Designed for the IBM 360/91, about 3 years after CDC 6600, by Robert Tomasulo Goal: high performance without special compilers Designed to overcome long memory access and floating point delays. RAW hazards are avoided by executing an instruction only when its operands are available.
31
Tomasulo Algorithm
WAR and WAW hazards arised from name dependencies, are eliminated by register renaming. Registers in instructions are replaced by values or pointers to reservation stations. The Common Data Bus (CDB) is used to bypass the registers and pass the results from the reservation stations directly to the functional units.
32
Tomasulos Organization
34
Normal data bus: data + destination (go to bus) CDB: data + source (come from bus)
64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does broadcast
35
Register File Status Qi: Qi Indicates which functional unit will write each register, if one exists. Blank (0) when no pending instructions that will write that register meaning that the value is already available.
36
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Busy Address
No No No
Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No
Op
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
37
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Busy Address
Yes No No 34+R2
Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No
Op
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
Load1
F8
F10
F12
...
F30
38
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Busy Address
Yes Yes No 34+R2 45+R3
Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No
Op
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
Load2
F4
F6
Load1
F8
F10
F12
...
F30
39
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Busy Address
Yes Yes No 34+R2 45+R3
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes MULTD Mult2 No
S1 Vj
S2 Vk
RS Qj
RS Qk
R(F4) Load2
F0
F2
F4
F6
Load1
F8
F10
F12
...
F30
Mult1 Load2
40
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Busy Address
No Yes No 45+R3
Reservation Stations:
Time Name Busy Op Add1 Yes SUBD M(A1) Load2 Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 Load2
M(A1) Add1
41
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Busy Address
No No No
Reservation Stations:
Time Name Busy Op 2 Add1 Yes SUBD M(A1) M(A2) Add2 No Add3 No 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Busy Address
No No No
Reservation Stations:
Time Name Busy Op 1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 9 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
Add2
F8
F10
F12
...
F30
Mult1 M(A2)
Add1 Mult2
43
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Busy Address
No No No
Reservation Stations:
Time Name Busy Op 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 8 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
Add2
F8
F10
F12
...
F30
Mult1 M(A2)
Add1 Mult2
44
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No 2 Add2 Yes ADDD (M-M) M(A2) Add3 No 7 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No 0 Add2 Yes ADDD (M-M) M(A2) Add3 No 5 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No 4 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No 0 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
Mult1 M(A2)
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes DIVD M*F4 M(A1)
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
M*F4 M(A2)
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 1 Mult2 Yes DIVD M*F4 M(A1)
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
M*F4 M(A2)
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Busy Address
No No No
Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 0 Mult2 Yes DIVD M*F4 M(A1)
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
M*F4 M(A2)
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2
Busy Address
No No No
Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No 0 Mult2 No
Op
S1 Vj
S2 Vk
RS Qj
RS Qk
F0
F2
F4
F6
F8
F10
F12
...
F30
M*F4 M(A2)
Branch Prediction
Lec. 7
53
Branch Prediction
Easiest (static prediction)
Always taken, always not taken Opcode based Displacement based (forward not taken, backward taken) Compiler directed (branch likely, branch not likely)
Next easiest
1 bit predictor remember last taken/not taken per branch
Use a branch-prediction buffer or branch-history table Use part of the PC (low-order bits) to index buffer/table
Example Q: Assume a loop branch is taken nine times in a row, then not taken
once. What is the prediction accuracy using 1-bit predictor? A: After first loop, the predictor will say not to take because the last time the execution came out of loop, it set a 0 in the predictor. So, its a misprediction. The bit will now be set to 1. Works fine until the last loop when it is predicted as taken. So, 2 mispredictions in in 10 loop executions => 80% accuracy. How about a 2-bit predictor? Let the prediction be changed only after it misses twice in a row.
55
56
branch PC
BHT
01
Nov. 2, 2004
Lec. 7
57
Can we do better ? Correlating branch predictors also look at other branches for clues
if (aa==2) aa = 0 if (bb==2) bb = 0 if(aa!=bb) { T
T
Prediction if the last branch is NT
NT
Prediction if the last branch is T
Nov. 2, 2004
Lec. 7
59
Performance of Correlating Branch Prediction With same number of state bits, (2,2) performs better than noncorrelating 2-bit predictor. Outperforms a 2-bit predictor with infinite number of entries
Nov. 2, 2004
Lec. 7
60
The global history register is an m-bit shift register that records the last m branches encountered by the processor m-bitPC ghr address and the GHR Usually use both the (2-level) 01
n-bit predictors PC
Combining funciton
00
Nov. 2, 2004
Lec. 7
61
If we predict the branch is taken and suppose it is correct, what is the target address?
Need a mechanism to provide target address as well
Can we eliminate the one cycle delay for the 5-stage pipeline?
Need to fetch from branch target immediately after branch
Nov. 2, 2004 Lec. 7 62
prediction is a taken direct branch (for not taken branches the target is simply PC+4 ) .
Nov. 2, 2004 Lec. 7 63