You are on page 1of 62

Lecture 6

Score Board Contd. And Tomasulo’s


Algorithm
Instructor: Laxmi Bhuyan

Nov. 2, 2004 Lec. 7 1


Three Parts of the Scoreboard
1. Instruction status—which of 4 steps the instruction is in
(Issue, Operand Read, EX, Write)
2. Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each
functional unit
Busy—Indicates whether the unit is busy or not
Op—Operation to perform in the unit (e.g., + or –)
Fi—Destination register
Fj, Fk—Source-register numbers
Qj, Qk—Functional units producing source registers Fj, Fk
Rj, Rk—Flags indicating when Fj, Fk are ready and not yet read. Set to
No after operand are read.

3. Register result status—Indicates which functional unit will write each register, if one
exists. Blank when no pending instructions will write that register

Nov. 2, 2004 Lec. 7 2


Detailed Scoreboard Pipeline Control
Instruction
Wait until Bookkeeping
status
Busy(FU) yes; Op(FU) op;
Fi(FU) `D’; Fj(FU) `S1’;
Not busy (FU)
Issue Fk(FU) `S2’; Qj Result(‘S1’);
and not result(D)
Qk Result(`S2’); Rj not Qj;
WAW Rk not Qk; Result(‘D’) FU;
Read
Rj and Rk Rj No; Rk No
operands
Execution Functional unit
complete done
f((Fj( f )!=Fi(FU)
or Rj( f )=No) & f(if Qj(f)=FU then Rj(f) Yes);
Write
(Fk( f )!=Fi(FU) or f(if Qk(f)=FU then Rj(f) Yes);
result
Result(Fi(FU)) 0; Busy(FU) No
Rk( f )=No))

WAR A.55 on page A-76


Nov. 2, 2004 Lec. 7 3
Scoreboard Example
• The following numbers are to illustrate behavior, not
representative

• LD – 1 cycle
– (compute address + data cache access)
• ADDDs and SUBs are 2 cycles
• Multiply is 10 cycles
• Divide is 40 cycles

Nov. 2, 2004 Lec. 7 4


Scoreboard Example

Instruction status Read ExecutionWrite


Instruction j k Issue operandscompleteResult
LD F6 34+ R2
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
FU
Nov. 2, 2004 Lec. 7 5
Scoreboard Example Cycle 1

Instruction status Read ExecutionWrite


Instruction j k Issue operands complete Result
LD F6 34+ R2 1
LD F2 45+ R3
MULTDF0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Integer
Nov. 2, 2004 Lec. 7 6
Scoreboard Example Cycle 2
Instruction status Read Execution Write
Instruction j k Issue operands
complete Result
LD F6 34+ R2 1 2 Note: Can’t issue I2
LD F2 45+ R3 because Integer unit
MULTD F0 F2 F4 is busy. Can’t issue
SUBD F8 F6 F2 next instruction due
DIVD F10 F0 F6 to in-order issue
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 No
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Integer
Nov. 2, 2004 Lec. 7 7
Scoreboard Example Cycle 3
Instruction status Read Execution Write
Instruction j k Issue operands
complete Result
LD F6 34+ R2 1 2 3
LD F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 No
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Integer
Nov. 2, 2004 Lec. 7 8
Scoreboard Example Cycle 4
Instruction status Read Execution Write
Instruction j k Issue operands
complete Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3
MULTD F0 F2 F4
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F6 R2 No
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU
Nov. 2, 2004 Lec. 7 9
Scoreboard Example Cycle 5
Instruction status Read Execution Write
Instruction j k Issue operands
complete Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5
MULTD F0 F2 F4 Now I2 is issued
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 Yes
Mult1 No
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Integer
Nov. 2, 2004 Lec. 7 10
Scoreboard Example Cycle 6
Instruction status Read Execution Write
Instruction j k Issue operands
complete Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6
MULTD F0 F2 F4 6
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 No
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult Integer
Nov. 2, 2004 Lec. 7 11
Scoreboard Example Cycle 7
Instruction status Read Execution Write
Instruction j k Issue operands
complete Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 I3 stalled at read
MULTD F0 F2 F4 6 because I2 isn’t
SUBD F8 F6 F2 7 complete
DIVD F10 F0 F6
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 No
Mult1 Yes Mult F0 F2 F4 Integer No Yes
Mult2 No
Add Yes Subd F8 F6 F2 Integer Yes No
Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult Integer Add
Nov. 2, 2004 Lec. 7 12
Scoreboard Example Cycle 8
Instruction status Read EX Write
Instruction j k Issue Op compl. Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6
SUBD F8 F6 F2 7
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for FU
j for kFj? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 Yes Mult F0 F2 F4 Yes Yes
Mult2 No
Add Yes Sub F8 F6 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 Nov. 2, 2004 FU Mult1 Lec. 7
Add Divide 13
Scoreboard Example Cycle 9
Instruction status Read EX Write
Instruction j k IssueOp complete
Result
LD F6 34+ R2 1 2 3 4 Note: I3 and I4 read
LD F2 45+ R3 5 6 7 8 operands because F2 is now
MULTD F0 F2 F4 6 9 available. ADDD (I6) can’t
SUBD F8 F6 F2 7 9 be issued because SUBD
DIVD F10 F0 F6 8 (I4) uses the adder
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
10 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
2 Add Yes Sub F8 F6 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
9 FU Mult1 Add Divide

Nov. 2, 2004 Lec. 7 14


Scoreboard Example Cycle 11
Instruction status Read Execution
Write
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8 Note: Add takes 2 cycles,
MULTD F0 F2 F4 6 9 so nothing happens in
SUBD F8 F6 F2 7 9 11 cycle 10. MUL continues.
DIVDF10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
8 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
0 Add Yes Sub F8 F6 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 Add Divide

Nov. 2, 2004 Lec. 7 15


Scoreboard Example Cycle 12
Instruction status Read Execution
Write
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
7 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
Add No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 Divide
Nov. 2, 2004 Lec. 7 16
Scoreboard Example Cycle 13
Instruction status Read Execution
Write
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
Now ADDD is issued
MULTD F0 F2 F4 6 9
because SUBD has
SUBD F8 F6 F2 7 9 11 12 completed
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13
Functional unit status dest S1 S2 FU for j FU for kFj? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
6 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
Add Yes Add F6 F8 F2 Yes Yes
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
13 FU Mult1 Add Divide
Nov. 2, 2004 Lec. 7 17
Scoreboard Example Cycle 14
Instruction status Read Execution
Write
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
5 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
2 Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
14 FU Mult1 Add Divide
Nov. 2, 2004 Lec. 7 18
Scoreboard Example Cycle 15
Instruction status Read Execution
Write
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 Note: ADDD takes 2
SUBD F8 F6 F2 7 9 11 12 cycles, so no change
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
4 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
1 Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
15 FU Mult1 Add Divide
Nov. 2, 2004 Lec. 7 19
Scoreboard Example Cycle 16
Instruction status Read Execution
Write
Instruction j k Issue operands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9
SUBD F8 F6 F2 7 9 11 12 ADDD completes, but
DIVD F10 F0 F6 8 MULTD and DIVD go on
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
3 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
0 Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU Mult1 Add Divide
Nov. 2, 2004 Lec. 7 20
Scoreboard Example Cycle 17
Instruction status Read Execution
Write
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 ADDD stalls, can’t write back
due to WAR with DIVD.
SUBD F8 F6 F2 7 9 11 12 MULT and DIV continue
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
2 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
17 FU Mult1 Add Divide
Nov. 2, 2004 Lec. 7 21
Scoreboard Example Cycle 18
Instruction status Read Execution
Write
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 MULT and DIV
SUBD F8 F6 F2 7 9 11 12 continue
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
1 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
18 FU Mult1 Add Divide
Nov. 2, 2004 Lec. 7 22
Scoreboard Example Cycle 19
Instruction status Read Execution
Write
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19 MULT completes
SUBD F8 F6 F2 7 9 11 12 after 10 cycles
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
0 Mult1 Yes Mult F0 F2 F4 No No
Mult2 No
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
19 FU Mult1 Add Divide
Nov. 2, 2004 Lec. 7 23
Scoreboard Example Cycle 20
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19 20 MULTD completes and
SUBD F8 F6 F2 7 9 11 12 writes to F0
DIVD F10 F0 F6 8
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 Yes Yes
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
20 FU Add Divide

Nov. 2, 2004 Lec. 7 24


Scoreboard Example Cycle 21
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19 20 Now DIVD reads
SUBD F8 F6 F2 7 9 11 12 because F0 is
DIVD F10 F0 F6 8 21 available
ADDD F6 F8 F2 13 14 16
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add Yes Add F6 F8 F2 No No
Divide Yes Div F10 F0 F6 No No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
21 FU Add Divide

Nov. 2, 2004 Lec. 7 25


Scoreboard Example Cycle 22
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19 20
ADDD writes result
SUBD F8 F6 F2 7 9 11 12 because WAR is
DIVD F10 F0 F6 8 21 removed.
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide Yes Div F10 F0 F6 No No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
21 FU Divide

Nov. 2, 2004 Lec. 7 26


Scoreboard Example Cycle 61
Instruction j k Issueoperands
complete
Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8
MULTD F0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12 DIVD completes
DIVD F10 F0 F6 8 21 61 execution
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for FU
j for F
k j? Fk?
TimeName Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
Divide Yes Div F10 F0 F6 No No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
61 FU Divide

Nov. 2, 2004 Lec. 7 27


Scoreboard Example Cycle 62

Instruction status Read ExecutionWrite


Instruction j k Issue operands complete Result
LD F6 34+ R2 1 2 3 4
LD F2 45+ R3 5 6 7 8 Execution is finished
MULTDF0 F2 F4 6 9 19 20
SUBD F8 F6 F2 7 9 11 12
DIVD F10 F0 F6 8 21 61 62
ADDD F6 F8 F2 13 14 16 22
Functional unit status dest S1 S2 FU for j FU for k Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No
Mult1 No
Mult2 No
Add No
0 Divide No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
62 FU
Nov. 2, 2004 Lec. 7 28
Review: Scoreboard
• Limitations of 6600 scoreboard
– No forwarding
– Limited to instructions in basic block (small window)
– Large number of functional units (structural hazards)
– Stall on WAR hazards
– Stall on WAW hazards
DIV.D F0, F2, F4
ADD.D F6, F0, F8
WAR S.D F6, 0(R1)
WAW
SUB.D F8, F10, F14 Output dependence
Antidependence
MUL.D F6, F10, F8
Name dependence

Nov. 2, 2004 Lec. 7 29


Another Dynamic Algorithm: Tomasulo Algorithm

• For IBM 360/91 about 3 years after CDC 6600


• Goal: High Performance without special compilers
• Differences between Tomasulo Algorithm & Scoreboard
– Control & buffers distributed with Function Units vs. centralized in
scoreboard; called “reservation stations”
– Registers in instructions replaced by pointers to reservation station buffer
– HW renaming of registers to avoid WAW hazards
– Buffer operand values to avoid WAR hazards
– Common Data Bus broadcasts results to all FUs
– Load and Stores treated as FUs as well
• Why study? Lead to Alpha 21264, HP 8000, MIPS 10000,
Pentium II, Power PC 604 …

Nov. 2, 2004 Lec. 7 30


FP unit and load-store unit using Tomasulo’s alg.

Nov. 2, 2004 Lec. 7 31


Another Dynamic Algorithm: Tomasulo Algorithm
DIV.D F0, F2, F4
ADD.D S, F0, F8
S.D S, 0(R1) register renaming
SUB.D T, F10, F14
MUL.D F6, F10, T
• Implemented through reservation stations (rs) per functional unit
– Buffers an operand as soon as it is available – avoids WAR hazards.
– Pending instr. designate rs that will provide their inputs – avoids WAW hazards.
– The last write in a sequence of same-register-writing actually updates the
register
– Decentralize hazard detection and execution control
– Instruction results are passed directly to the FU from rs rather than from registers
 Through common data bus (CDB)

Nov. 2, 2004 Lec. 7 32


Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
Stall if structural hazard, ie. no space in the rs. If reservation station (rs) is free, the
issue logic issues instr to rs & read operands into rs if ready (Register renaming =>
Solves WAR). Make status of destination register waiting for this latest instn even
if the previous instn writing to this register hasn’t completed => Solves WAW
hazards.
2. Execution—operate on operands (EX)
When both operands are ready then execute;
if not ready, watch CDB for result – Solves RAW
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available. Write result into dest. reg. if its status is r. =>
Solves WAW.
• Normal data bus: data + destination (“go to” bus)
• CDB: data + source (“come from” bus)
– 64 bits of data + 4 bits of Functional Unit source address
– Write if matches expected Functional Unit (produces result)
– Does broadcast
Nov. 2, 2004 Lec. 7 33
Reservation Station Components
Op—Operation to perform in the unit (e.g., + or –)
Vj, Vk— Value of the source operand.
Qj, Qk— Name of the RS that would provide the source
operands. Value zero means the source operands already
available in Vj or Vk, or is not necessary.
Busy—Indicates reservation station or FU is busy

Register File Status Qi:


Qi —Indicates which functional unit will write each register, if
one exists. Blank (0) when no pending instructions that will write
that register meaning that the value is already available.

Nov. 2, 2004 Lec. 7 34


Tomasulo Example Cycle 0
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 Load1 No
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
0 FU
Nov. 2, 2004 Lec. 7 35
Tomasulo Example Cycle 1
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 Load1 Yes 34+R2
LD F2 45+ R3 Load2 No
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
1 FU Load1
Nov. 2, 2004 Lec. 7 36
Tomasulo Example Cycle 2
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2- Load1 Yes 34+R2
LD F2 45+ R3 2 Load2 Yes 45+R3
MULTD F0 F2 F4 Load3 No
SUBD F8 F6 F2 Assume Load takes 2 cycles
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No
Add3 No
0 Mult1 No
0 Mult2 No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
2 FU Load2 Load1
Nov. 2, 2004 Lec. 7 37
Tomasulo Example Cycle 3
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 Load1 Yes 34+R2
LD F2 45+ R3 2 3- Load2 Yes 45+R3
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 No read value
Add3 No
0 Mult1 Yes Mult R(F4) Load2
0 Mult2 No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
3 FU Mult1 Load2 Load1
Nov. 2, 2004 Lec. 7 38
Tomasulo Example Cycle 4
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 Load2 Yes 45+R3
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6
ADDD F6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 Yes Sub M(A1) Load2
0 Add2 No
Add3 No
0 Mult1 Yes Mult R(F4) Load2
0 Mult2 No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
4 FU Mult1 Load2 M(A1) Add1
Nov. 2, 2004 Lec. 7 39
Tomasulo Example Cycle 5
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 Load3 No
SUBD F8 F6 F2 4
DIVD F10 F0 F6 5
ADDD F6 F8 F2
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
2 Add1 Yes Sub M(A1) M(A2)
0 Add2 No
Add3 No
10 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
5 FU Mult1 M(A2) M(A1) Add1 Mult2
Nov. 2, 2004 Lec. 7 40
Tomasulo Example Cycle 6
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- Load3 No
SUBD F8 F6 F2 4 6 --
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
1 Add1 Yes Sub M(A1) M(A2)
0 Add2 Yes Add M(A2) Add1
Add3 No
9 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
6 FU Mult1 M(A2) Add2 Add1 Mult2
Nov. 2, 2004 Lec. 7 41
Tomasulo Example Cycle 7
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- Load3 No
SUBD F8 F6 F2 4 6 -- 7
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 Yes Sub M(A1) M(A2)
0 Add2 Yes Add M(A2) Add1
Add3 No
8 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
7 FU Mult1 M(A2) Add2 Add1 Mult2
Nov. 2, 2004 Lec. 7 42
Tomasulo Example Cycle 8
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
2 Add2 Yes Add M1-M2 M(A2)
Add3 No
7 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
8 FU Mult1 M(A2) Add2 M1-M2 Mult2
Nov. 2, 2004 Lec. 7 43
Tomasulo Example Cycle 9
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 --
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
1 Add2 Yes Add M1-M2 M(A2)
Add3 No
6 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
9 FU Mult1 M(A2) Add2 M1-M2 Mult2
Nov. 2, 2004 Lec. 7 44
Tomasulo Example Cycle 10
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
0 Add2 Yes Add M1-M2 M(A2)
Add3 No
5 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
10 FU Mult1 M(A2) Add2 M1-M2 Mult2
Nov. 2, 2004 Lec. 7 45
Tomasulo Example Cycle 11
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10 11
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
Add2 No
Add3 No
4 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
11 FU Mult1 M(A2) M1-M2+M(A2)
M1-M2 Mult2
Nov. 2, 2004 Lec. 7 46
Tomasulo Example Cycle 12
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10 11
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
Add2 No
Add3 No
4 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
12 FU Mult1 M(A2) M1-M2+M(A2)
M1-M2 Mult2
Nov. 2, 2004 Lec. 7 47
Tomasulo Example Cycle 15
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- 15 Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10 11
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
Add2 No
Add3 No
0 Mult1 Yes Mult M(A2) R(F4)
0 Mult2 Yes Div M(A1) Mult1
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
15 FU Mult1 M(A2) M1-M2+M(A2)
M1-M2 Mult2
Nov. 2, 2004 Lec. 7 48
Tomasulo Example Cycle 16
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- 15 16 Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5
ADDD F6 F8 F2 6 9 -- 10 11
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
Add2 No
Add3 No
Mult1 No
40 Mult2 Yes Div M*F4 M(A1)
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
16 FU M*F4 M(A2) M1-M2+M(A2)
M1-M2 Mult2
Nov. 2, 2004 Lec. 7 49
Tomasulo Example Cycle 56
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- 15 16 Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5 17 -- 56
ADDD F6 F8 F2 6 9 -- 10 11
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 Yes Div M*F4 M(A1)
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
56 FU M*F4 M(A2) M1-M2+M(A2)
M1-M2 Mult2
Nov. 2, 2004 Lec. 7 50
Tomasulo Example Cycle 57
Instruction status Execution Write
Instruction j k Issue complete Result Busy Address
LD F6 34+ R2 1 2--3 4 Load1 No
LD F2 45+ R3 2 3--4 5 Load2 No
MULTD F0 F2 F4 3 6 -- 15 16 Load3 No
SUBD F8 F6 F2 4 6 -- 7 8
DIVD F10 F0 F6 5 17 -- 56 57
ADDD F6 F8 F2 6 9 -- 10 11
Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk
0 Add1 No
Add2 No
Add3 No
Mult1 No
0 Mult2 No
Register result status
Clock F0 F2 F4 F6 F8 F10 F12 ... F30
57 FU M*F4 M(A2) M1-M2+M(A2)
M1-M2 result
Nov. 2, 2004 Lec. 7 51
Branch Prediction (3.4, 3.5)

Nov. 2, 2004 Lec. 7 52


Branch Prediction
• Easiest (static prediction)
– Always taken, always not taken
– Opcode based
– Displacement based (forward not taken, backward taken)
– Compiler directed (branch likely, branch not likely)
• Next easiest
– 1 bit predictor – remember last taken/not taken per branch
 Use a branch-prediction buffer or branch-history table
 Use part of the PC (low-order bits) to index buffer/table
– Multiple branches may share the same bit
 Invert the bit if the prediction is wrong
 Backward branches for loops will be mispredicted twice

Nov. 2, 2004 Lec. 7 53


Q: Assume a loop branch is taken nine times in a row, then not taken once. What
is the prediction accuracy using 1-bit predictor?

A: After first loop, the predictor will say not to take because the last time the
execution came out of loop, it set a “0” in the predictor. So, it’s a misprediction.
The bit will now be set to “1”. Works fine until the last loop when it is predicted
as taken. So, 2 mispredictions in in 10 loop executions => 80% accuracy.

How about a 2-bit predictor? Let the prediction be changed only after it misses
twice in a row.

Nov. 2, 2004 Lec. 7 54


2-bit Branch Prediction
• Has 4 states instead of 2, allowing for more information about
tendencies
• A prediction must miss twice before it is changed
• Good for backward branches of loops

Nov. 2, 2004 Lec. 7 55


Branch History Table
• Has limited size
• 2 bits by N (e.g. 4K)
• 4K same as infinite, see Fig. 3.9
• Uses low-order bits of branch PC to
choose entry

branch PC BHT

01

Nov. 2, 2004 Lec. 7 56


Can we do better ?
• Correlating branch predictors also look at other branches for
clues
if (aa==2) T
aa = 0
if (bb==2) T
bb = 0
if(aa!=bb) { … NT

Prediction if the last branch is NT

Prediction if the last branch is T

(1,1) predictor – uses history of 1 branch and uses a 1-bit predictor


Nov. 2, 2004 Lec. 7 57
Correlating Branch Predictor
• If we use 2 branches as histories, then there are 4 possibilities
(T-T, NT-T, NT-NT, NT-T).
• For each possibility, we need to use a predictor (1-bit, 2-bit).
• And this repeats for every branch.

(2,2) branch prediction

Nov. 2, 2004 Lec. 7 58


Performance of Correlating Branch Prediction

• With same number of


state bits, (2,2) performs
better than noncorrelating
2-bit predictor.
• Outperforms a 2-bit
predictor with infinite
number of entries

Nov. 2, 2004 Lec. 7 59


General (m,n) Branch Predictors
• The global history register is an m-bit shift register that records
the last m branches encountered by the processor
• Usually use both the PC address and the GHR (2-level)

m-bit ghr

01
n-bit predictors
PC 00
Combining
funciton

Nov. 2, 2004 Lec. 7 60


Is Branch Predictor Enough?
• When is using branch prediction beneficial?
– When the outcome is known later than the target
– For example, in our standard MIPS pipeline, we compute the target in ID
stage but testing the branch condition incur a structure hazard in register
file.
• If we predict the branch is taken and suppose it is correct, what is
the target address?
– Need a mechanism to provide target address as well
• Can we eliminate the one cycle delay for the 5-stage pipeline?
– Need to fetch from branch target immediately after branch

Nov. 2, 2004 Lec. 7 61


Branch Target Buffer (BTB)
Is the current instruction a branch ?
• BTB provides the answer before the current instruction is decoded
and therefore enables fetching to begin after IF-stage .

What is the branch target ?


• BTB provides the branch target if the prediction is a taken direct
branch (for not taken branches the target is simply PC+4 ) .

Nov. 2, 2004 Lec. 7 62

You might also like