You are on page 1of 63

Instruction Level Parallelism using Dynamic Scheduling Topic 4

Dynamic Scheduling
Dynamic Scheduling by hardware
Enables handling some cases when dependencies are unknown at compile time (e.g. dependencies involving memory ref) Allow processor to tolerate unpredictable delays Allows code that was compiled with one pipeline in mind to run efficiently on a different pipeline Allows Out-of-order execution, Out-of-order completion
In-order: If an instruction is stalled, no later instructions can proceed.
DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14 (stall) (have to wait)
2

Dynamic Scheduling

In classical pipeline, we use in-order instruction issue and execution Both structural and data hazards are checked during ID stage. No hazards means it can be issued from ID for execution To allow out-of-order execution, we split ID stage into 2 stages:
Issue Decode Instruction, check for SH Read Operands Wait until no DH, then read Operands
3

Dynamic Scheduling

Here, in-order issue but instructions may bypass each other in read operands stage, thus enter EXE stage out-of-order Out of order execution introduces possibility of WAW & WAR hazards

Dynamic Scheduling

Two dynamic scheduling approaches


Scoreboarding The Tomasulo approach

HW Schemes: Instruction Parallelism


Out-of-order execution divides ID stage:
1. Issuedecode instructions, check for structural hazards, Issue in order if the functional unit is free and no WAW. Read operands (RO)wait until no data hazards, then read operands ADDD would stall at RO, and SUBD could proceed with no stalls.

2.

Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions.
IF ISSUE RO EX1 EXm (WAR?) EXp WB? WB

RO

EX1

EXn
RO

WB?

EX1

(WAR?)

Focusing on FP operations assume no MEM stages

Four Stages of Instructions with Scoreboard


1. Issue (ID1):
Decode instructions, check for structural hazards Instructions issued in program order (for hazard checking) Dont issue if structural hazard Dont issue if instruction is output dependent on any previously issued but uncompleted instruction (no WAW hazards)

2. Read operands (ID2):

Wait until no data hazards (no earlier active instructions will write source operands), then read operands (no RAW hazards) No forwarding supported The functional unit starts execution upon receiving operands. When the results are ready it notifies the scoreboard Stall until no WAR hazards with previous instructions

3. Execution (EX):

4. Write result (WB):

Three Parts of the Scoreboard


1
2
Instruction status: Which of 4 steps (Issue, RO, EX, WB) the instruction is in.
Functional unit status: Indicates the state of the functional unit (FU). Nine fields for each functional unit:
Busy Op Fi Fj, Fk Qj, Qk Rj, Rk
Indicates whether the unit is busy or not (values Yes & No) Operation to perform in the unit (e.g., ADDD or SUBD) Destination register (e.g R2, F2 etc.) Source-register numbers (e.g. R1, R2, F1 etc) Functional units producing source registers Fj, Fk (e.g. Integer, Mult, Div etc) Flags indicating when Fj, Fk are ready and not yet read. Set to No after operand are read.

Register result status: Indicates which functional unit will write to each register (Result). Blank when no pending instructions will write that register.
8

Detailed Scoreboard Pipeline Control


Instruction status

Wait until

Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D; Fj(FU) `S1; Fk(FU) `S2; Qj Result(S1); Qk Result(`S2); Rj not Qj; Rk not Qk; Result(D) FU; Rj No; Rk No Qj 0; Qk 0;

Issue
WAW Read operands

Not busy (FU) and not Result(D)

Rj and Rk Functional unit done "f((Fj( f )!=Fi(FU) or Rj( f )=No) & (Fk( f )!=Fi(FU) or Rk( f )=No))

Execution complete

Write result

"f(if Qj(f)=FU then Rj(f) Yes); "f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No

WAR

Scoreboard Example (Cycle 0)


FP Latency: LD = 1 Cycle (compute address + data cache access) Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read Issue Execution Write operands complete Result

Time Name

Busy Op
No No No No No

dest Fi

S1 Fj

S2 Fk

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Integer Mult1 Mult2 Add Divide Register result status

Clock
1

FU

F0

F2

F4

F6 F8 F10

F12

...

F30

10

Scoreboard Example (Cycle 1)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40 Instruction status
Instruction L.D F6 L.D F2 MUL.D F0 SUB.D F8 DIV.D F10 ADD.D F6
Read Issue Execution Write

j k 34+ R2
45+ F2 F6 F0 F8 R3 F4 F2 F6 F2

operands complete Result

Functional unit status


Time Name
Integer Mult1 Mult2 Add Divide

Busy Op
Yes No No No No Load

dest Fi
F6

S1 Fj

S2 Fk
R2

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk
Yes

Register result status

Clock
1

FU

F0

F2

F4

Integer

F6 F8 F10

F12

...

F30

11

Scoreboard Example (Cycle 2)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read
Issue

Execution Write

operands complete Result

structural hazard, Not issued

Time Name

Busy Op
Yes No No No No Load

dest Fi
F6

S1 Fj

S2 Fk
R2

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk
Yes

Integer Mult1 Mult2 Add Divide Register result status

Clock
2

FU

F0

F2

F4

Integer

F6 F8 F10

F12

...

F30

12

Scoreboard Example (Cycle 3)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read
Issue

Execution Write

operands complete Result

issue is in-order

Time Name

Busy Op
Yes No No No No Load

dest Fi
F6

S1 Fj

S2 Fk
R2

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk
Yes

Integer Mult1 Mult2 Add Divide Register result status

Clock
3

FU

F0

F2

F4

Integer

F6 F8 F10

F12

...

F30

13

Scoreboard Example (Cycle 4)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read
Issue

Execution Write

operands complete Result

Time Name

Busy Op
Yes No No No No Load

dest Fi
F6

S1 Fj

S2 Fk
R2

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk
Yes

Integer Mult1 Mult2 Add Divide Register result status

Clock
4

FU

F0

F2

F4

Integer

F6 F8 F10

F12

...

F30

14

Scoreboard Example (Cycle 5)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read Issue Execution Write

operands complete Result

1 5

Time Name

Busy Op
Yes No No No No Load

dest Fi
F2

S1 Fj

S2 Fk
R3

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk
Yes

Integer Mult1 Mult2 Add Divide Register result status

Clock
5

FU

F0

Integer

F2

F4

F6 F8 F10

F12

...

F30

15

Scoreboard Example (Cycle 6)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read Execution Write operands complete Result

Issue

1 5 6

2 6

Time Name

Busy Op
Yes Yes No No No Load Mult

dest Fi
F2 F0

S1 Fj
F2

S2 Fk
R3 F4

FU for j FU for k Fj? Qj Qk Rj


Integer No

Fk? Rk
Yes Yes

Integer Mult1 Mult2 Add Divide Register result status

Clock
6

FU

Mult1 Integer

F0

F2

F4

F6 F8 F10

F12

...

F30

16

Scoreboard Example (Cycle 7)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read Issue Execution Write

operands complete Result

1 5 6 7

2 6

3 7

Still waiting F2 to be written back

Time Name Integer

Busy Op Yes Load


Yes No Yes No Mult Sub

dest Fi F2
F0 F8

S1 Fj
F2 F6

S2 Fk R3
F4 F2

FU for j FU for k Fj? Qj Qk Rj


Integer Integer No Yes

Fk? Rk Yes
Yes No

Mult1 Mult2 Add Divide Register result status

Clock
7

FU

Mult1 Integer

F0

F2

F4

F6 F8 F10
Add

F12

...

F30

17

Scoreboard Example (Cycle 8a)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read Issue Execution Write operands complete Result

1 5 6 7 8

2 6

3 7

No WAW hazards No structural hazards

Time Name

Busy Op
Yes Yes No Yes Yes Load Mult

dest Fi
F2 F0

S1 Fj
F2

S2 Fk
R3 F4

FU for j FU for k Fj? Qj Qk Rj


Integer No

Fk? Rk
Yes Yes

Integer Mult1 Mult2 Add Divide Register result status

Sub Div

F8 F10

F6 F0

F2 F6

Mult1

Integer Yes No

No Yes

Clock
8

FU

Mult1 Integer

F0

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

18

Scoreboard Example (Cycle 8b)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k F6 34+ R2 L.D F2 45+ R3 L.D F2 F4 MUL.D F0 F6 F2 SUB.D F8 F6 DIV.D F10 F0 F8 F2 ADD.D F6 Functional unit status
Read Issue Execution Write operands complete Result

1 5 6 7 8

2 6

3 7

4 8

Time Name

Busy Op
No Yes No Yes Yes Mult Sub Div

dest Fi
F0 F8 F10

S1 Fj
F2 F6 F0

S2 Fk
F4 F2 F6

FU for j FU for k Fj? Qj Qk Rj


Yes Yes No

Fk? Rk
Yes Yes Yes

Integer Mult1 Mult2 Add Divide Register result status

Mult1

Clock
8

FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

19

Scoreboard Example (Cycle 9)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read Issue Execution Write operands complete Result

1 5 6 7 8
?

2 6 9 9

3 7

4 8

structural hazards

Time Name

Busy Op
No Yes No Yes Yes Mult

dest Fi
F0

S1 Fj
F2

S2 Fk
F4

FU for j FU for k Fj? Qj Qk Rj


Yes

Fk? Rk
Yes

Integer 10 Mult1 Mult2 2 Add Divide Register result status

Sub Div

F8 F10

F6 F0

F2 F6

Mult1

Yes No

Yes Yes

Clock
9

FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

20

Scoreboard Example (Cycle 11)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read Issue Execution Write operands complete Result

1 5 6 7 8

2 6 9 9

3 7 11

4 8

Time Name

Busy Op
No Yes No Yes Yes Mult Sub Div

dest Fi
F0 F8 F10

S1 Fj
F2 F6 F0

S2 Fk
F4

FU for j FU for k Fj? Qj Qk Rj


Yes Yes No

Fk? Rk
Yes Yes Yes

Integer 8 Mult1 Mult2 0 Add Divide Register result status

F2 F6 Mult1

Clock
11

FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

21

Scoreboard Example (Cycle 12)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read Issue Execution Write operands complete Result

1 5 6 7 8

2 6 9 9

3 7 11

4 8 12

Time Name

Busy Op
No Yes No No Yes Mult

dest Fi
F0

S1 Fj
F2

S2 Fk
F4

FU for j FU for k Fj? Qj Qk Rj


Yes

Fk? Rk
Yes

Integer 7 Mult1 Mult2 Add Divide Register result status

Div

F10

F0

F6 Mult1

No

Yes

Clock
12

FU

F0
Mult1

F2

F4

F6 F8 F10
Divide

F12

...

F30

22

Scoreboard Example (Cycle 13)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read Issue Execution Write operands complete Result

1 5 6 7 8 13

2 6 9 9

3 7 11

4 8 12

Time Name

Busy Op
No Yes No Yes Yes Mult Add Div

dest Fi
F0 F6 F10

S1 Fj
F2 F8 F0

S2 Fk
F4 F2 F6

FU for j FU for k Fj? Qj Qk Rj


Yes Yes No

Fk? Rk
Yes Yes Yes

Integer 6 Mult1 Mult2 Add Divide Register result status

Mult1

Clock
13

FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

23

Scoreboard Example (Cycle 17)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read Issue Execution Write operands complete Result

1 5 6 7 8 13

2 6 9 9 14

3 7 11 16

4 8 12 ?

Time Name

Busy Op
No Yes No Yes Yes Mult Add Div

dest Fi
F0 F6 F10

S1 Fj
F2 F8 F0

S2 Fk
F4 F2 F6

FU for j FU for k Fj? Qj Qk Rj


Yes Yes No

Fk? Rk
Yes Yes Yes

Integer 2 Mult1 Mult2 Add Divide Register result status

Mult1

Clock
17

FU

Mult1

F0

F2

F4

Add

F6 F8 F10

Divide

F12

...

F30

24

Scoreboard Example (Cycle 20)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read Issue Execution Write operands complete Result

1 5 6 7 8 13

2 6 9 9 14

3 7 19 11 16

4 8 20 12

Time Name

Busy Op
No No No Yes Yes

dest Fi

S1 Fj

S2 Fk

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Integer Mult1 Mult2 Add Divide Register result status

Add Div

F6 F10

F8 F0

F2 F6

Yes Yes

Yes Yes

Clock
20

FU

F0

F2

F4

Add

F6 F8 F10

Divide

F12

...

F30

25

Scoreboard Example (Cycle 21)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read Issue Execution Write operands complete Result

1 5 6 7 8 13

2 6 9 9 21 14
Op

3 7 19 11 16

4 8 20 12

Time Name

Busy
No No No Yes Yes

dest
Fi

S1
Fj

S2
Fk

FU for j FU for k Fj?


Qj Qk Rj

Fk?
Rk

Integer Mult1 Mult2 Add Divide Register result status

Add Div

F6 F10

F8 F0

F2 F6

Yes Yes

Yes Yes

Clock
21

FU

F0

F2

F4

Add

F6 F8 F10

Divide

F12

...

F30

26

Scoreboard Example (Cycle 22)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read Issue Execution Write operands complete Result

1 5 6 7 8 13

2 6 9 9 21 14
Op Fi

3 7 19 11 16
dest

4 8 20 12 22
S1 Fj S2 Fk FU for j Qj FU for k Qk Fj? Rj Fk? Rk

Time Name

Busy
No No No No Yes

Integer Mult1 Mult2 Add 40 Divide Register result status

Div

F10

F0

F6

Yes

Yes

Clock
22

FU

F0

F2

F4

F6 F8 F10

Divide

F12

...

F30

27

Scoreboard Example (Cycle 61)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read Issue Execution Write operands complete Result

1 5 6 7 8 13

2 6 9 9 21 14
Op Fi

3 7 19 11 61 16
dest

4 8 20 12 22
S1 S2 FU for j FU for k Fj? Fk?

Time Name

Busy
No No No No Yes

Fj

Fk

Qj

Qk

Rj

Rk

Integer Mult1 Mult2 Add 0 Divide Register result status

Div

F10

F0

F6

Yes

Yes

Clock
61

FU

F0

F2

F4

F6 F8 F10

Divide

F12

...

F30

28

Scoreboard Example (Cycle 62)


FP Latency: LD = 1 cycle, Add = 2 cycles, Multiply = 10, Divide = 40
Instruction status Instruction j k L.D F6 34+ R2 L.D F2 45+ R3 MUL.D F0 F2 F4 SUB.D F8 F6 F2 DIV.D F10 F0 F6 ADD.D F6 F8 F2 Functional unit status
Read Issue Execution Write operands complete Result

1 5 6 7 8 13

2 6 9 9 21 14
Op Fi

3 7 19 11 61 16
dest

4 8 20 12 62 22
S1 Fj S2 Fk FU for j Qj FU for k Qk Fj? Rj Fk? Rk

Time Name

Busy
No No No No No

Integer Mult1 Mult2 Add 0 Divide Register result status

Clock
62

FU

F0

F2

F4

F6 F8 F10

F12

...

F30

In-order issue, Out-of-order execute and commit

29

Review: Scoreboard
Limitations of CDC6600 scoreboard
No forwarding Limited to instructions in basic block (small window) Large number of functional units (structural hazards) Stall on WAR hazards Stall on WAW hazards
F0, F2, F4 F6, F0, F8 F6, 0(R1) F8, F10, F14 F6, F10, F8

DIV.D ADD.D WAR S.D SUB.D Antidependence MUL.D

WAW

Output dependence

Name dependence

30

Tomasulo Algorithm
Designed for the IBM 360/91, about 3 years after CDC 6600, by Robert Tomasulo Goal: high performance without special compilers Designed to overcome long memory access and floating point delays. RAW hazards are avoided by executing an instruction only when its operands are available.

31

Tomasulo Algorithm
WAR and WAW hazards arised from name dependencies, are eliminated by register renaming. Registers in instructions are replaced by values or pointers to reservation stations. The Common Data Bus (CDB) is used to bypass the registers and pass the results from the reservation stations directly to the functional units.

32

Tomasulo Algorithm Differences between Tomasulo Algorithm & Scoreboard


Control & buffers distributed with Function Units vs. centralized in scoreboard; called reservation stations Registers in instructions replaced by pointers to reservation station buffer HW renaming of registers to avoid WAW hazards Buffer operand values to avoid WAR hazards Common Data Bus broadcasts results to all FUs Load and Stores treated as FUs as well
33

FP unit and load-store unit using Tomasulos alg.

Tomasulos Organization

34

Three Stages of Tomasulo Algorithm


1. Issueget instruction from FP Op Queue
Stall if structural hazard, ie. no space in the rs. If reservation station (rs) is free, the issue logic issues instr to rs & read operands into rs if ready (Register renaming => Solves WAR). Make status of destination register waiting for this latest instn even if the previous instn writing to this register hasnt completed => Solves WAW hazards.

2. Executionoperate on operands (EX)


When both operands are ready then execute; if not ready, watch CDB for result Solves RAW

3. Write resultfinish execution (WB)


Write on Common Data Bus to all awaiting units; mark reservation station available. Write result into dest. reg. if its status is r. => Solves WAW.

Normal data bus: data + destination (go to bus) CDB: data + source (come from bus)
64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does broadcast
35

Reservation Station Components


OpOperation to perform in the unit (e.g., + or )
Vj, Vk Value of the source operand. Qj, Qk Name of the RS that would provide the source operands. Value zero means the source operands already available in Vj or Vk, or is not necessary. BusyIndicates reservation station or FU is busy

Register File Status Qi: Qi Indicates which functional unit will write each register, if one exists. Blank (0) when no pending instructions that will write that register meaning that the value is already available.
36

Tomasulo Example (Cycle 0)


FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Load takes 2 cycles in execution stage

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No

Op

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


0 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

37

Tomasulo Example (Cycle 1)


FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Load takes 2 cycles in execution stage

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 Load1 Load2 Load3

Busy Address
Yes No No 34+R2

Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No

Op

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


1 FU

F0

F2

F4

F6
Load1

F8

F10

F12

...

F30

38

Tomasulo Example (Cycle 2)


FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Load takes 2 cycles in execution stage

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 Load1 Load2 Load3

Busy Address
Yes Yes No 34+R2 45+R3

Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No

Op

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


2 FU

F0

F2
Load2

F4

F6
Load1

F8

F10

F12

...

F30

39

Tomasulo Example (Cycle 3)


FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Load takes 2 cycles in execution stage

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 3 Load1 Load2 Load3

Busy Address
Yes Yes No 34+R2 45+R3

Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes MULTD Mult2 No

S1 Vj

S2 Vk

RS Qj

RS Qk

R(F4) Load2

Register result status: Clock


3 FU

F0

F2

F4

F6
Load1

F8

F10

F12

...

F30

Mult1 Load2

40

Tomasulo Example (Cycle 4)


FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Load takes 2 cycles in execution stage

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 3 4 4 Load1 Load2 Load3

Busy Address
No Yes No 45+R3

Reservation Stations:

Time Name Busy Op Add1 Yes SUBD M(A1) Load2 Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


4 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 Load2

M(A1) Add1
41

Tomasulo Example (Cycle 5)


FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Load takes 2 cycles in execution stage

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 3 4 4 5 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op 2 Add1 Yes SUBD M(A1) M(A2) Add2 No Add3 No 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


5 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

M(A1) Add1 Mult2


42

Tomasulo Example (Cycle 6)


FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Load takes 2 cycles in execution stage

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 4 5 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op 1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 9 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


6 FU

F0

F2

F4

F6
Add2

F8

F10

F12

...

F30

Mult1 M(A2)

Add1 Mult2
43

Tomasulo Example (Cycle 7)


FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Load takes 2 cycles in execution stage

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 4 5 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 8 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


7 FU

F0

F2

F4

F6
Add2

F8

F10

F12

...

F30

Mult1 M(A2)

Add1 Mult2
44

Tomasulo Example (Cycle 8)


FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Load takes 2 cycles in execution stage

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 4 5 8 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No 2 Add2 Yes ADDD (M-M) M(A2) Add3 No 7 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


8 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

Add2 (M-M) Mult2


45

Tomasulo Example (Cycle 10)


FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Load takes 2 cycles in execution stage

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 10 4 5 8 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No 0 Add2 Yes ADDD (M-M) M(A2) Add3 No 5 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


10 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

Add2 (M-M) Mult2


46

Tomasulo Example (Cycle 11)


FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Load takes 2 cycles in execution stage

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 10 4 5 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No 4 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


11 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

(M-M+M) (M-M) Mult2


47

Tomasulo Example (Cycle 15)


FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Load takes 2 cycles in execution stage

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 15 7 10 4 5 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No 0 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


15 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

(M-M+M) (M-M) Mult2


48

Tomasulo Example (Cycle 16)


FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Load takes 2 cycles in execution stage

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 15 7 10 4 5 16 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes DIVD M*F4 M(A1)

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


16 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

M*F4 M(A2)

(M-M+M) (M-M) Mult2


49

Tomasulo Example (Cycle 55)


FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Load takes 2 cycles in execution stage

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 15 7 10 4 5 16 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 1 Mult2 Yes DIVD M*F4 M(A1)

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


55 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

M*F4 M(A2)

(M-M+M) (M-M) Mult2


50

Tomasulo Example (Cycle 56)


FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Load takes 2 cycles in execution stage

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 15 7 56 10 4 5 16 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 0 Mult2 Yes DIVD M*F4 M(A1)

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


56 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

M*F4 M(A2)

(M-M+M) (M-M) Mult2


51

Tomasulo Example (Cycle 57)


FP Latency: Add = 2 cycles, Multiply = 10, Divide = 40 Load takes 2 cycles in execution stage

Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 15 7 56 10 4 5 16 8 57 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No 0 Mult2 No

Op

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


57 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

M*F4 M(A2)

(M-M+M) (M-M) Mult2


52

In-order issue, Out-of-order execute and commit

Branch Prediction

Lec. 7

53

Branch Prediction
Easiest (static prediction)
Always taken, always not taken Opcode based Displacement based (forward not taken, backward taken) Compiler directed (branch likely, branch not likely)

Next easiest
1 bit predictor remember last taken/not taken per branch
Use a branch-prediction buffer or branch-history table Use part of the PC (low-order bits) to index buffer/table

Multiple branches may share the same bit


Invert the bit if the prediction is wrong Backward branches for loops will be mispredicted twice
54

Example Q: Assume a loop branch is taken nine times in a row, then not taken
once. What is the prediction accuracy using 1-bit predictor? A: After first loop, the predictor will say not to take because the last time the execution came out of loop, it set a 0 in the predictor. So, its a misprediction. The bit will now be set to 1. Works fine until the last loop when it is predicted as taken. So, 2 mispredictions in in 10 loop executions => 80% accuracy. How about a 2-bit predictor? Let the prediction be changed only after it misses twice in a row.

55

2-bit Branch Prediction


Has 4 states instead of 2, allowing for more information about tendencies A prediction must miss twice before it is changed Good for backward branches of loops

56

Branch History Table


Has limited size 2 bits by N (e.g. 4K) 4K same as infinite, see Fig. 3.9 Uses low-order bits of branch PC to choose entry

branch PC

BHT

01

Nov. 2, 2004

Lec. 7

57

Can we do better ? Correlating branch predictors also look at other branches for clues
if (aa==2) aa = 0 if (bb==2) bb = 0 if(aa!=bb) { T
T
Prediction if the last branch is NT

NT
Prediction if the last branch is T

(1,1) predictor uses history of 1 branch and uses a 1-bit predictor


Nov. 2, 2004 Lec. 7 58

Correlating Branch Predictor


If we use 2 branches as histories, then there are 4 possibilities (T-T, NT-T, NT-NT, NT-T). For each possibility, we need to use a predictor (1-bit, 2bit). And this repeats for every branch.

(2,2) branch prediction

Nov. 2, 2004

Lec. 7

59

Performance of Correlating Branch Prediction With same number of state bits, (2,2) performs better than noncorrelating 2-bit predictor. Outperforms a 2-bit predictor with infinite number of entries

Nov. 2, 2004

Lec. 7

60

General (m,n) Branch Predictors

The global history register is an m-bit shift register that records the last m branches encountered by the processor m-bitPC ghr address and the GHR Usually use both the (2-level) 01
n-bit predictors PC
Combining funciton

00

Nov. 2, 2004

Lec. 7

61

Is Branch Predictor Enough?


When is using branch prediction beneficial?
When the outcome is known later than the target For example, in our standard MIPS pipeline, we compute the target in ID stage but testing the branch condition incur a structure hazard in register file.

If we predict the branch is taken and suppose it is correct, what is the target address?
Need a mechanism to provide target address as well

Can we eliminate the one cycle delay for the 5-stage pipeline?
Need to fetch from branch target immediately after branch
Nov. 2, 2004 Lec. 7 62

Branch Target Buffer (BTB)


Is the current instruction a branch ?
BTB provides the answer before the current

instruction is decoded and therefore enables fetching to begin after IF-stage .


What is the branch target ?

BTB provides the branch target if the

prediction is a taken direct branch (for not taken branches the target is simply PC+4 ) .
Nov. 2, 2004 Lec. 7 63

You might also like