You are on page 1of 58

UNIT-III

Instruction Level Parallelism

UNIT-III
3.1 Instruction Level Parallelism: Concepts and Challenges

3.2 Overcoming Data Hazards with Dynamic Scheduling


3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP

Chap. 3 -ILP 1

Ideas To Reduce Stalls


Technique Dynamic scheduling Dynamic branch prediction Issuing multiple instructions per cycle Speculation Dynamic memory disambiguation Loop unrolling Basic compiler pipeline scheduling Compiler dependence analysis Software pipelining and trace scheduling Compiler speculation Reduces Data hazard stalls Control stalls Ideal CPI Data and control stalls Data hazard stalls involving memory Control hazard stalls Data hazard stalls Ideal CPI and data hazard stalls Ideal CPI and data hazard stalls Ideal CPI, data and control stalls

Chapter 3

Chapter 4

Chap. 3 -ILP 1

Instruction Level Parallelism


3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP

ILP is the principle that there are many instructions in code that dont depend on each other. That means its possible to execute those instructions in parallel. This is easier said than done: Issues include: Building compilers to analyze the code, Building hardware to be even smarter than that code.

This section looks at some of the problems to be solved.

Chap. 3 -ILP 1

Instruction Level Parallelism

Terminology
Basic Block - That set of instructions between entry points and between branches. A basic block has only one entry and one exit. Typically this is about 6 instructions long.

Loop Level Parallelism - that parallelism that exists within a loop. Such parallelism can cross loop iterations.
Loop Unrolling - Either the compiler or the hardware is able to exploit the parallelism inherent in the loop.

Chap. 3 -ILP 1

Instruction Level Parallelism

Terminology

Basic Block (BB) ILP is quite small BB: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between a pair of branches Plus instructions in BB likely to depend on each other To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks Simplest: loop-level parallelism to exploit parallelism among iterations of a loop Vector is one way If not vector, then either dynamic via branch prediction or static via loop unrolling by compiler

Chap. 3 -ILP 1

Instruction Level Parallelism

Data Dependence and Hazards

InstrJ is data dependent on InstrI InstrJ tries to read operand before InstrI writes it

I: add r1,r2,r3 J: sub r4,r1,r3


or InstrJ is data dependent on InstrK which is dependent on InstrI Caused by a True Dependence (compiler term) If true dependence caused a hazard in the pipeline, called a Read After Write (RAW) hazard

Chap. 3 -ILP 1

Instruction Level Parallelism

Data Dependence and Hazards

Dependences are a property of programs Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is a property of the pipeline Importance of the data dependencies 1) indicates the possibility of a hazard 2) determines order in which results must be calculated 3) sets an upper bound on how much parallelism can possibly be exploited Today looking at HW schemes to avoid hazard

Chap. 3 -ILP 1

Instruction Level Parallelism

Name Dependence #1: Anti-dependence

Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; 2 versions of name dependence InstrJ writes operand before InstrI reads it

I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7


Called an anti-dependence by compiler writers. This results from reuse of the name r1 If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard
Chap. 3 -ILP 1 9

Instruction Level Parallelism

Name Dependence #2: Output dependence

InstrJ writes operand before InstrI writes it.

I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7


Called an output dependence by compiler writers This also results from the reuse of name r1 If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard

Chap. 3 -ILP 1

10

Instruction Level Parallelism

ILP and Data Hazards

HW/SW must preserve program order: order instructions would execute in if executed sequentially 1 at a time as determined by original source program HW/SW goal: exploit parallelism by preserving program order only where it affects the outcome of the program Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict Register renaming resolves name dependence for regs Either by compiler or by HW

Chap. 3 -ILP 1

11

Instruction Level Parallelism


Every instruction is control dependent on some set of branches, and, in general, these control dependencies must be preserved to preserve program order if p1 { S1; }; if p2 { S2; } S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1.

Chap. 3 -ILP 1

12

Instruction Level Parallelism


Control dependence need not be preserved willing to execute instructions that should not have been executed, thereby violating the control dependences, if can do so without affecting correctness of the program Instead, 2 properties critical to program correctness are exception behavior and data flow

Chap. 3 -ILP 1

13

Instruction Level Parallelism

Exception Behavior

Preserving exception behavior => any changes in instruction execution order must not change how exceptions are raised in program (=> no new exceptions) Example:
DADDU BEQZ LW L1: Problem with moving LW before BEQZ? R2,R3,R4 R2,L1 R1,0(R2)

Chap. 3 -ILP 1

14

Instruction Level Parallelism

Data Flow

Data flow: actual flow of data values among instructions that produce results and those that consume them branches make flow dynamic, determine which instruction is supplier of data Example: DADDU R1,R2,R3 BEQZ R4, L DSUBU R1,R5,R6 L: OR R7,R1,R8 OR depends on DADDU or DSUBU? Must preserve data flow on execution
Chap. 3 -ILP 1 15

Dynamic Scheduling
3.1 Instruction Level Parallelism: Concepts and Challenges 3.2 Overcoming Data Hazards with Dynamic Scheduling 3.3 Dynamic Scheduling: Examples & The Algorithm 3.4 Reducing Branch Penalties with Dynamic Hardware Prediction 3.5 High Performance Instruction Delivery 3.6 Taking Advantage of More ILP with Multiple Issue 3.7 Hardware-based Speculation 3.8 Studies of The Limitations of ILP

Advantages of Dynamic Scheduling


Handles cases when dependences unknown at compile time (e.g., because they may involve a memory reference) It simplifies the compiler Allows code that compiled for one pipeline to run efficiently on a different pipeline Hardware speculation, a technique with significant performance advantages, that builds on dynamic scheduling

Chap. 3 -ILP 1

16

Dynamic Scheduling
Logistics

Sections 3.2 and 3.3 of the text use, as an example of Dynamic Scheduling, an algorithm due to Tomasulo. We instead use another scoreboarding technique which is discussed in Appendix A8

Chap. 3 -ILP 1

17

Dynamic Scheduling

The idea:

HW Schemes: Instruction Parallelism


Why is this in Hardware at run time? Works when cant know real dependence at compile time Compiler simpler Code for one machine runs well on another Key Idea: Allow instructions behind stall to proceed. Key Idea: Instructions executing in parallel. There are multiple execution units, so use them. DIVD ADDD SUBD F0,F2,F4 F10,F0,F8 F12,F8,F14 Even though ADDD stalls, the SUBD has no dependencies and can run.

Enables out-of-order execution => out-of-order completion 18

Chap. 3 -ILP 1

Dynamic Scheduling

The idea:

HW Schemes: Instruction Parallelism


Out-of-order execution divides ID stage:
1. Issuedecode instructions, check for structural hazards 2. Read operandswait until no data hazards, then read operands

Scoreboards allow instruction to execute whenever 1 & 2 hold, not waiting for prior instructions. A scoreboard is a data structure that provides the information necessary for all pieces of the processor to work together. We will use In order issue, out of order execution, out of order commit ( also called completion) First used in CDC6600. Our example modified here for MIPS. CDC had 4 FP units, 5 memory reference units, 7 integer units. MIPS has 2 FP multiply, 1 FP adder, 1 FP divider, 1 integer.

Chap. 3 -ILP 1

19

Dynamic Scheduling

Using A Scoreboard

Scoreboard Implications
Out-of-order completion => WAR, WAW hazards? Solutions for WAR Queue both the operation and copies of its operands Read registers only during Read Operands stage For WAW, must detect hazard: stall until other completes Need to have multiple instructions in execution phase => multiple execution units or pipelined execution units Scoreboard keeps track of dependencies, state or operations Scoreboard replaces ID, EX, WB with 4 stages

Chap. 3 -ILP 1

20

Dynamic Scheduling

Using A Scoreboard

Four Stages of Scoreboard Control


1. Issue decode instructions & check for structural hazards (ID1) If a functional unit for the instruction is free and no other active instruction has the same destination register (WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared.

Chap. 3 -ILP 1

21

Dynamic Scheduling

Using A Scoreboard

Four Stages of Scoreboard Control


2. Read operands wait until no data hazards, then read operands (ID2) A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit. When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order.

Chap. 3 -ILP 1

22

Dynamic Scheduling

Using A Scoreboard

Four Stages of Scoreboard Control


3. Execution operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. 4. Write result finish execution (WB) Once the scoreboard is aware that the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction. Example: DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F8,F8,F14 Scoreboard would stall SUBD until ADDD reads operands
Chap. 3 -ILP 1 23

Dynamic Scheduling

Using A Scoreboard

Three Parts of the Scoreboard


1. Instruction statuswhich of 4 steps the instruction is in

2. Functional unit statusIndicates the state of the functional unit (FU). 9 fields for each functional unit BusyIndicates whether the unit is busy or not OpOperation to perform in the unit (e.g., + or ) FiDestination register Fj, FkSource-register numbers Qj, QkFunctional units producing source registers Fj, Fk Rj, RkFlags indicating when Fj, Fk are ready

3. Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register
Chap. 3 -ILP 1 24

Dynamic Scheduling
Instruction status

Using A Scoreboard

Detailed Scoreboard Pipeline Control


Wait until

Bookkeeping Busy(FU) yes; Op(FU) op; Fi(FU) `D; Fj(FU) `S1; Fk(FU) `S2; Qj Result(S1); Qk Result(`S2); Rj not Qj; Rk not Qk; Result(D) FU;
Rj No; Rk No

Issue

Not busy (FU) and not result(D)

Read operands
Execution complete

Rj and Rk Functional unit done

f((Fj( f )Fi(FU) or Rj( f )=No) & Write result (Fk( f ) Fi(FU) or Rk( f )=No))

f(if Qj(f)=FU then Rj(f) Yes); f(if Qk(f)=FU then Rj(f) Yes); Result(Fi(FU)) 0; Busy(FU) No 25

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example
This is the sample code well be working with in the example: LD LD MULT SUBD DIVD ADDD F6, 34(R2) F2, 45(R3) F0, F2, F4 F8, F6, F2 F10, F0, F6 F6, F8, F2 Latencies (clock cycles): LD 1 MULT 10 SUBD 2 DIVD 40 ADDD 2 26

What are the hazards in this code?

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example
Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Issue Read Execution Write operands complete Result

Busy No No No No No

Op

dest Fi

S1 Fj

S2 Fk

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Clock
FU

F0

F2

F4

F6

F8

F10

F12

...

F30
27

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 1


Issue 1 Read Execution Write operands completeResult

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status

Issue LD #1
Shows in which cycle the operation occurred.

Busy Yes No No No No

Op Load

dest Fi F6

S1 Fj

S2 Fk R2

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk Yes

Clock
1 FU

F0

F2

F4

F6 F8 F10
Integer

F12

...

F30
28

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 2


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2

LD #2 cant issue since integer unit is busy. MULT cant issue because we require in-order issue.

Busy Yes No No No No

Op Load

dest Fi F6

S1 Fj

S2 Fk R2

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk Yes

Clock
2 FU

F0

F2

F4

F6 F8 F10
Integer

F12

...

F30
29

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 3


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3

Busy Yes No No No No

Op Load

dest Fi F6

S1 Fj

S2 Fk R2

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk Yes

Clock
3 FU

F0

F2

F4

F6 F8 F10
Integer

F12

...

F30
30

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 4


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4

Busy Yes No No No No

Op Load

dest Fi F6

S1 Fj

S2 Fk R2

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk Yes

Clock
4 FU

F0

F2

F4

F6 F8 F10
Integer

F12

...

F30
31

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 5


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5

Issue LD #2 since integer unit is now free.

Busy Yes No No No No

Op Load

dest Fi F2

S1 Fj

S2 Fk R3

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk Yes

Clock
5 FU

F0

F2
Integer

F4

F6 F8 F10

F12

...

F30
32

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 6


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 6

Issue MULT.

Busy Yes Yes No No No

Op Load Mult

dest Fi F2 F0

S1 Fj F2

S2 Fk R3 F4

FU for j FU for k Fj? Qj Qk Rj Integer No

Fk? Rk Yes Yes

Clock
6 FU

F0

F2

F4

F6 F8 F10

F12

...

F30
33

Mult1 Integer

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 7


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 6 7

MULT cant read its operands (F2) because LD #2 hasnt finished.

Busy Yes Yes No Yes No

Op Load Mult Sub

dest Fi F2 F0 F8

S1 Fj F2 F6

S2 Fk R3 F4 F2

FU for j FU for k Fj? Qj Qk Rj Integer Integer No Yes

Fk? Rk Yes Yes No

Clock
7 FU

F0

F2

F4

F6 F8 F10
Add

F12

...

F30
34

Mult1 Integer

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 8a


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 6 7 8 dest Fi F2 F0 F8 F10 S1 Fj F2 F6 F0 S2 Fk R3 F4 F2 F6

DIVD issues. MULT and SUBD both waiting for F2.

Busy Yes Yes No Yes Yes

Op Load Mult Sub Div

FU for j FU for k Fj? Qj Qk Rj Integer Integer Mult1 No Yes No

Fk? Rk Yes Yes No Yes

Clock
8 FU

F0

F2

F4

F6 F8 F10
Add Divide

F12

...

F30
35

Mult1 Integer

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 8b


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 7 8 dest Fi F0 F8 F10 S1 Fj F2 F6 F0 S2 Fk F4 F2 F6

LD #2 writes F2.

Busy No Yes No Yes Yes

Op Mult Sub Div

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
8 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30
36

Chap. 3 -ILP 1

Dynamic Scheduling
Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 10 Mult1 Mult2 2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 8 dest Fi F0 F8 F10 S1 Fj F2 F6 F0

Using A Scoreboard

Scoreboard Example Cycle 9


Now MULT and SUBD can both read F2. How can both instructions do this at the same time??
FU for j FU for k Fj? Qj Qk Rj Yes Yes No Fk? Rk Yes Yes Yes

Busy No Yes No Yes Yes

Op Mult Sub Div

S2 Fk F4 F2 F6

Mult1

Clock
9 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30

Chap. 3 -ILP 1

37

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 11


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 8 Mult1 Mult2 0 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 8 dest Fi F0 F8 F10 S1 Fj F2 F6 F0 S2 Fk F4 F2 F6

ADDD cant start because add unit is busy.

Busy No Yes No Yes Yes

Op Mult Sub Div

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
11 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30
38

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 12


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 7 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 dest Fi F0 S1 Fj F2 S2 Fk F4

SUBD finishes. DIVD waiting for F0.

Busy No Yes No No Yes

Op Mult

FU for j FU for k Fj? Qj Qk Rj Yes

Fk? Rk Yes

Div

F10

F0

F6

Mult1

No

Yes

Clock
12 FU

F0
Mult1

F2

F4

F6 F8 F10
Divide

F12

...

F30
39

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 13


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 6 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

ADDD issues.

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
13 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30
40

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 14


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 5 Mult1 Mult2 2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
14 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30
41

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 15


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 4 Mult1 Mult2 1 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
15 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30
42

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 16


Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 3 Mult1 Mult2 0 Add Divide Register result status

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
16 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30
43

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 17


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 2 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

ADDD cant write because of DIVD. RAW!

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
17 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30
44

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 18


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 1 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

Nothing Happens!!

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
18 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30
45

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 19


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer 0 Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No Yes Mult F0 F2 F4 No Yes Add F6 F8 F2 Yes Div F10 F0 F6

MULT completes execution.

FU for j FU for k Fj? Qj Qk Rj Yes Yes No

Fk? Rk Yes Yes Yes

Mult1

Clock
19 FU

F0
Mult1

F2

F4

F6 F8 F10
Add Divide

F12

...

F30
46

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 20


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No No No Yes Add F6 F8 F2 Yes Div F10 F0 F6

MULT writes.

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Yes Yes

Yes Yes

Clock
20 FU

F0

F2

F4

F6 F8 F10
Add Divide

F12

...

F30
47

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 21


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 13 14 16 dest S1 S2 Busy Op Fi Fj Fk No No No Yes Add F6 F8 F2 Yes Div F10 F0 F6

DIVD loads operands

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Yes Yes

Yes Yes

Clock
21 FU

F0

F2

F4

F6 F8 F10
Add Divide

F12

...

F30
48

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 22


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add 40 Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 13 14 16 22 dest S1 S2 Busy Op Fi Fj Fk No No No No Yes Div F10 F0 F6

Now ADDD can write since WAR removed.

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Yes

Yes

Clock
22 FU

F0

F2

F4

F6 F8 F10
Divide

F12

...

F30
49

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 61


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add 0 Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 61 13 14 16 22 dest S1 S2 Busy Op Fi Fj Fk No No No No Yes Div F10 F0 F6

DIVD completes execution

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Yes

Yes

Clock
61 FU

F0

F2

F4

F6 F8 F10
Divide

F12

...

F30
50

Chap. 3 -ILP 1

Dynamic Scheduling

Using A Scoreboard

Scoreboard Example Cycle 62


Instruction status Instruction j k LD F6 34+ R2 LD F2 45+ R3 MULTD F0 F2 F4 SUBD F8 F6 F2 DIVD F10 F0 F6 ADDD F6 F8 F2 Functional unit status Time Name Integer Mult1 Mult2 Add 0 Divide Register result status Read Execution Write Issue operands completeResult 1 2 3 4 5 6 7 8 6 9 19 20 7 9 11 12 8 21 61 62 13 14 16 22 dest S1 S2 Busy Op Fi Fj Fk No No No No No

DONE!!

FU for j FU for k Fj? Qj Qk Rj

Fk? Rk

Clock
62 FU

F0

F2

F4

F6 F8 F10

F12

...

F30
51

Chap. 3 -ILP 1

Dynamic Scheduling

Tomasulo Algorithm

Another Dynamic Algorithm: Tomasulo Algorithm


For IBM 360/91 about 3 years after CDC 6600 (1966) Goal: High Performance without special compilers Differences between IBM 360 & CDC 6600 ISA
IBM has only 2 register specifiers / instruction vs. 3 in CDC 6600 IBM has 4 FP registers vs. 8 in CDC 6600

Why Study? lead to Alpha 21264, HP 8000, MIPS 10000, Pentium II, PowerPC 604,

This is the example that the text uses in Sections 3.2 & 3.3.

Chap. 3 -ILP 1

52

Dynamic Scheduling

Tomasulo Algorithm

Tomasulo Algorithm vs. Scoreboard


Control & buffers distributed with Function Units (FU) vs. centralized in scoreboard;
FU buffers called reservation stations; have pending operands

Registers in instructions replaced by values or pointers to reservation stations(RS); called register renaming ;
avoids WAR, WAW hazards More reservation stations than registers, so can do optimizations compilers cant

Results to FU from RS, not through registers, over Common Data Bus that broadcasts results to all FUs Load and Stores treated as FUs with RSs as well Integer instructions can go past branches, allowing FP ops beyond basic block in FP queue
Chap. 3 -ILP 1 53

Dynamic Scheduling
FP Registers From Mem FP Op Queue

Tomasulo Organization

Load1 Load2 Load3 Load4 Load5 Load6

Load Buffers Store Buffers


Add1 Add2 Add3 Mult1 Mult2

FP adders

Reservation Stations

To Mem FP multipliers

Common Data3 Bus Chap. -ILP(CDB) 1

54

Dynamic Scheduling

Tomasulo Algorithm

Reservation Station Components


OpOperation to perform in the unit (e.g., + or ) Vj, VkValue of Source operands Store buffers have V field, result to be stored Qj, QkReservation stations producing source registers (value to be written) Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready Store buffers only have Qi for RS producing result BusyIndicates reservation station or FU is busy Register result statusIndicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

Chap. 3 -ILP 1

55

Dynamic Scheduling

Tomasulo Algorithm

Three Stages of Tomasulo Algorithm


1. Issueget instruction from FP Op Queue If reservation station free (no structural hazard), control issues instruction & sends operands (renames registers). 2. Executionoperate on operands (EX) When both operands ready then execute; if not ready, watch Common Data Bus for result 3. Write resultfinish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Normal data bus: data + destination (go to bus) Common data bus: data + source (come from bus) 64 bits of data + 4 bits of Functional Unit source address Write if matches expected Functional Unit (produces result) Does the broadcast Chap. 3 -ILP 1 56

Dynamic Scheduling

Tomasulo Algorithm

Tomasulo Example Cycle 0


Instruction status Instruction j LD F6 34+ LD F2 45+ MULTD F0 F2 SUBD F8 F6 DIVD F10 F0 ADDD F6 F8 Reservation Stations Time Name 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 Register result status k R2 R3 F4 F2 F6 F2 Issue Execution complete Write Result Load1 Load2 Load3 Busy No No No Address

Busy Op No No No No No

S1 Vj

S2 Vk

RS for j Qj

RS for k Qk

Clock
0 FU

F0

F2

F4

F6

F8

F10

F12 ...

F30

Chap. 3 -ILP 1

57

Dynamic Scheduling

Tomasulo Algorithm

Review: Tomasulo
Prevents Register as bottleneck Avoids WAR, WAW hazards of Scoreboard Allows loop unrolling in HW Not limited to basic blocks (provided branch prediction) Lasting Contributions
Dynamic scheduling Register renaming Load/store disambiguation

360/91 descendants are PowerPC 604, 620; MIPS R10000; HP-PA 8000; Intel Pentium Pro

Chap. 3 -ILP 1

58