You are on page 1of 214

Advanced Computer Architectures

Lecture 6: Pipeline Hazards and Their Resolution Mechanisms


Mr. Bhabani shankar Prasad Mishra. SCHOLE OF TECHNOLOGY

KIIT UNIVERSITY
BHUBANESWAR
1

Module Objectives

Hazards, their causes, and resolution


Branch prediction Exploiting loop-level parallelism Dynamic instruction scheduling:

Scoreboarding and Tomasulos algorithm

Compiler techniques for exposing ILP


Superscalar and VLIW processing

Survey of some modern processors


2

Introduction

What is ILP (Instruction-Level Parallelism)?

Parallel execution of different instructions belonging to the same thread.

A thread usually consists of several basic blocks:

As well as several branches and loops.


A sequence of instructions not having a branch instruction.
3

Basic block:

Introduction
cont

Instruction pipelines can effectively exploit parallelism in a basic block:


Pipelining can be viewed to:


An n-stage pipeline can improve performance up to n times. Does not require much investment in hardware Transparent to the programmers. Decrease average CPI, and/or Decrease clock cycle time for instructions.
4

Drags on Pipeline Performance

Factors that can degrade pipeline performance:


Unbalanced stages Pipeline overheads Clock skew Hazards

Hazards cause the worst drag on the performance of a pipeline.


5

Pipeline Hazards

What is a pipeline hazard?

A situation that prevents an instruction from executing during its designated clock cycles.

There are 3 classes of hazards:


Structural Hazards Data Hazards Control Hazards

Arise from resource conflicts among instructions executing concurrently:

Structural Hazards

Same resource is required by two (or more) concurrently executing instructions at the same time.

Easy way to avoid structural hazards:

Duplicate resources (sometimes not practical)


Memory interleaving ( lower & higher order ) An ALU to perform an arithmetic operation and an adder to increment PC. Separate data cache and instruction cache accessed simultaneously in the same cycle.
7

Examples of Resolution of Structural Hazard:


Structural Hazard: Example

IF

ID IF

EXE MEM ID IF EXE ID IF

WB MEM EXE ID WB MEM EXE WB MEM

WB

An Example of a Structural Hazard


Load
Mem Reg
ALU

DM

Reg

Instruction 1
Instruction 2 Instruction 3 Instruction 4

Mem

Reg

ALU

DM

Reg

Mem

Reg

ALU

DM

Reg

Mem

Reg

ALU

DM

Reg

Mem

Reg

ALU

DM

Reg

Time

Would there be a hazard here?

How is it Resolved?
Load
Mem Reg
ALU

DM

Reg

Instruction 1 Instruction 2 Stall Instruction 3

Mem

Reg

ALU

DM

Reg

Mem

Reg

ALU

DM

Reg

Bubble

Bubble

Bubble

Bubble

Bubble

Mem

Reg

ALU

DM

Reg

Time

A Pipeline can be stalled by inserting a bubble or NOP

10

Performance with Stalls

Stalls degrade performance of a pipeline:


Result

in deviation from 1 instruction executing/clock cycle. Lets examine by how much stalls can impact CPI

11

Stalls and Performance


CPI pipelined =
=Ideal CPI + Pipeline stall cycles per instruction =1 + Pipeline stall cycles per instruction

Ignoring overhead and assuming stages are balanced: CPI unpipelined Speedup 1 pipeline stall cycles per instruction

12

Speedup Due to Pipelining

1 Pipeline depth 1 Pipeline stall cycles per instruction

13

Alternate Speedup Expression


Clock cycle pipelined
Pipeline depth
Speedup from pipelining

Clock cycle unpipelined Pipeline depth

Clock cycle unpipelined Clock cycle pipelined

1 Clock cycle unpipelined 1 Pipeline stall cycles per instruction Clock cycle pipelined

14

An Example of Performance Impact of Structural Hazard

Assume:

Pipelined processor.
Data references constitute 40% of an instruction mix. Ideal CPI of the pipelined machine is 1.

Consider two cases:

Unified data and instruction cache vs. separate data and instruction cache.

What is the impact on performance?


15

An Example

Avg. Inst. Time = CPI x Clock Cycle Time


(ii) For Unified cache case:
= (1 + 0.4 x 1) x (Clock cycle timeideal) = 1.4 x Clock cycle timeideal=1.4

cont

(i) For Separate cache: Avg. Instr. Time=1*1=1

Speedup= 1/1.4 = 0.7

30% degradation in performance


16

Data Hazards

Occur when an instruction under execution depends on:

Data from an instruction ahead in pipeline.


A=B+C; D=A+E; IF ID IF EXE MEM ID EXE WB MEM

Example:
A=B+C;
D=A+E;

WB

Dependent instruction uses old data:

Results in wrong computations


17

Types of Data Hazards

Data hazards are of three types:


With an in-order execution machine:

Read After Write (RAW) Write After Read (WAR) Write After Write (WAW)

Assume instruction i is issued before j.


18

WAW, WAR hazards can not occur.

Read after Write (RAW) Hazards

Hazard between two instructions I & J may occur when j attempts to read some data object that has been modified by I.

instruction j tries to read its operand before instruction i writes it.

j would incorrectly receive an old or incorrect value. i: ADD R1, R2, R3 j: SUB R4, R1, R6 Example:

Instruction j is a read instruction issued after i

Instruction i is a write instruction issued before j


19

Read after Write (RAW) Hazards


D (I) Instn . I
Write

R (I) D (J)
Instn . J Read

RAW

R (J)

R (I) D (J) for RAW


20

Example program (a):


RAW Dependency: More Examples

i1: load r1, addr; i2: add r2, r1,r1; i1: mul r1, r4, r5; i2: add r2, r1, r1;

Program (b):

Both cases, i2 does not get operand until i1 has completed writing the result

In (a) this is due to load-use dependency In (b) this is due to define-use dependency
21

Write after Read (WAR) Hazards

Hazard may occur when j attempts to modify (write) some data object that is going to read by I.

Instruction J tries to write its operand at destination before instruction I read it.

I would incorrectly receive a new or incorrect value. i: ADD R1, R2, R3 j: SUB R2, R4, R6 Example:

Instruction i is a read instruction issued before j


22

Instruction j is a write instruction issued after i

Write after Read (WAR) Hazards


D (J) Instn . J
Write

R (J) D (I)
Instn . I Read

WAR

R (I)

D (I) R (J) for WAR


23

Write After Write (WAW) Hazards

WAW hazard:

Both I & J wants to modify a same data object. instruction j tries to write an operand before instruction i writes it. Writes are performed in wrong order.

Example:

i: DIV F1, F2, F3 j: SUB F1, F4, F6

(How can this happen???)

Instruction j is a write instruction issued after i

Instruction i is a write instruction issued before j


24

Write After Write (WAW) Hazards


D (I) Instn . I
Write

R (I) R (J)
Instn . J Write

WAW

D (J)

R (I) R (J) for WAW


25

Example program (a):


WAR and WAW Dependency: More Examples


i1: mul r1, r2, r3; i2: add r2, r4, r5; i1: mul r1, r2, r3;

Example program (b):

i2: add r1, r4, r5;


in (a) r2 must be read before it is written into in (b) r1 must be written by i2 after it has been written into by i1
26

Both cases have dependence between i1 and i2


Inter-Instruction Dependences
Data dependence
r3 r1 op r2 Read-after-Write r5 r3 op r4 (RAW) Anti-dependence r3 r1 op r2 Write-after-Read r1 r4 op r5 (WAR) Output dependence r3 r1 op r2 Write-after-Write r5 r3 op r4 (WAW) r3 r6 op r7
Control dependence
27

False Dependency

Data Dependencies : Summary


Data dependencies in straight-line code RAW Read After Write dependency
( Flow dependency )

WAR Write After Read dependency


( Anti dependency )

WAW Write After Write dependency


( Output dependency )

Load-Use dependency

Define-Use dependency

True dependency
Cannot be overcome

False dependency
Can be eliminated by register renaming
28

Solutions to Data Hazard


Operand forwarding

By S/W (NOP)
Reordering the instruction

29

Recollect Data Hazards


What causes them?

Pipelining changes the order of read/write accesses to operands. Order differs from that of an unpipelined machine.
ADD R1, R2, R3 SUB R4, R1, R5

Example:

For MIPS, ADD writes the register in WB but SUB needs it in ID.

This is a data hazard


30

Illustration of a Data Hazard


ADD R1, R2, R3
Mem Reg
ALU

DM

Reg

SUB R4, R1, R5


AND R6, R1, R7 OR R8, R1, R9 XOR R10, R1, R11
Time

Mem

Reg

ALU

DM

Reg

Mem

Reg

ALU

DM

Mem

Reg

Mem

Reg

ADD instruction causes a hazard in next 3 instructions because register not written until after those 3 read it.
31

ALU

Forwarding

Simplest solution to data hazard:

Result of the ADD instruction not really needed:

forwarding

until after ADD actually produces it.

Can we move the result from EX/MEM register to the beginning of ALU (where SUB needs it)?

Yes!
32

Forwarding

cont

Generally speaking:
Forwarding

occurs when a result is passed directly to the functional unit that requires it.
goes from output of one pipeline stage to input of another.

Result

33

Forwarding Technique

Latch EXECUTE ALU

Latch WRITE RESULT

Forwarding Path

34

When Can We Forward?


ALU

ADD R1, R2, R3

Mem

Reg

DM

Reg

SUB R4, R1, R5

Mem

Reg

ALU

DM

Reg

SUB gets info. from EX/MEM pipe register AND gets info. from MEM/WB pipe register OR gets info. by forwarding from register file

AND R6, R1, R7

Mem

Reg

ALU

DM

OR R8, R1, R9

Mem

Reg

XOR R10, R1, R11

Mem

Reg

Time

If line goes forward you can do forwarding. If its drawn backward, its physically impossible.
35

ALU

Handling data hazard by S/W

Compiler introduce NOP in between two instructions NOP = a piece of code which keeps a gap between two instruction

Detection of the dependency is left entirely on the S/W


Advantage :- We find the easy technique called as instruction reordering.
36

Instruction Reordering

ADD SUB

R1 , R2 , R3 R4 , R1 , R5

Before

XOR
AND ADD XOR AND SUB

R8 , R6 , R7
R9 , R10 , R11 R1 , R2 , R3 R8 , R6 , R7 R9 , R10 , R11 R4 , R1 , R5
37

After

Control Hazards

Result from branch and other instructions that change the flow of a program (i.e. change PC).
Example:
1: If(cond){
2: 3: s2 s1}

Statement in line 2 is control dependent on statement at line 1. Until condition evaluation completes:

It is not known whether s1 or s2 will execute next.


38

Can You Identify Control Dependencies?


1: if(cond1){ 2: 3: 4: s1; if(cond2){ s2;}

5: }

39

Solutions to Branch Hazards

Three simplest methods of dealing with branches:

Flush Pipeline:

Branch Not Taken:

Redo the instructions following a branch, once an instruction is detected to be branch during the ID stage.

Another scheme is delayed branch.

A slightly higher performance scheme is to assume every branch to be not taken.

40

An Example of Impact of Branch Penalty

Assume for a MIPS pipeline:


16%

of all instructions are branches:


4% unconditional branches: 3 cycle penalty 12% conditional: 50% taken: 3 cycle penalty

41

Impact of Branch Penalty

For a sequence of N instructions:

N cycles to initiate each


3 * 0.04 * N delays due to unconditional branches

0.5 * 3 * 0.12 * N delays due to conditional taken


1.3*N (or 1.3 cycles/instruction) 30% Performance Hit!!!
42

Overall CPI=

Reducing Branch Penalty

Two approaches:
1) Move condition comparator to ID stage:
Decide branch outcome and target address in the ID stage itself:

Reduces branch delay to 2 cycles.

2)Branch prediction
43

Four Simple Branch Hazard Solutions


#1: Stall

until branch direction is clear flushing pipe Execute successor instructions in sequence as if there is no branch undo instructions in pipeline if branch actually taken

#2: Predict Branch Not Taken

47% branches not taken on average


44

Four Simple Branch Hazard Solutions cont


#3: Predict Branch Taken

53% branches taken on average.


But branch target address not available after IF in MIPS

MIPS still incurs 1 cycle branch penalty even with predict taken Other machines: branch target known before branch outcome computed, significant benefits can accrue
45

Four Simple Branch Hazard Solutions cont


#4: Delayed Branch
Insert unrelated successor in the branch delay slot branch instruction sequential successor1 sequential successor2 ........ Branch delay of sequential successorn branch target if taken 1 slot delay required in 5 stage pipeline

length n

46

Delayed Branch

Simple idea: Put an instruction that would be executed anyway right after a branch.
Branch Delayed slot instruction Branch target OR successor IF ID IF EX MEM WB

ID EX MEM delay slotWB IF ID EX MEM WB

Question: What instruction do we put in the delay slot?


Answer: one that can safely be executed no matter what the branch does.

The compiler decides this.


47

Delayed Branch

One possibility: An instruction from before Example:


R1, R2, R3

DADD

DADD if

R1, R2, R3 then

if R2 == 0 then

R2 == 0

delay slot
. . .

DADD

R1, R2, R3

The DADD instruction is executed no matter what happens in the branch:


Because it is executed before the branch! Therefore, it can be moved


48

Delayed Branch

We get to execute the DADD execution for free


branch
add instruction branch target/successor

IF

ID IF

EX ID IF

MEM WB EX ID MEM WB EX MEM WB

By this time, we know whether to take the branch or whether not to take it
49

Delayed Branch

Another possibility: An instruction much before

Example:

DSUB R4, R5, R6 ... DADD R1, R2, R3 if R1 == 0 delay slot then

The DSUB instruction can be replicated into the delay slot, and the branch target can be changed
50

Delayed Branch

Another possibility: An instruction from before

Example:

DSUB R4, R5, R6 ... DADD R1, R2, R3 if R1 == 0 then

DSUB R4, R5, R6

The DSUB instruction can be replicated into the delay slot, and the branch target can be changed
51

Delayed Branch

Yet another possibility: An instruction from inside the taken path Example:
DADD R1, R2, R3 if R1 == 0 delay slot OR R7, R8, R9 DSUB R4, R5, R6 then

The OR instruction can be moved into the delay slot ONLY IF its execution doesnt disrupt the program execution (e.g., R7 is overwritten later)
52

Delayed Branch

Third possibility: An instruction from inside the taken path Example:


DADD R1, R2, R3 if R1 == 0 then

OR R7, R8, R9 OR R7, R8, R9 DSUB R4, R5, R6

The OR instruction can be moved into the delay slot ONLY IF its execution doesnt disrupt the program execution (e.g., R7 is overwritten later)
53

Delayed Branch Example


B1

LD
DSUBU BEQZ OR L:

R1,0(R2) R1,R1,R3

R1 != 0

LD DSUBU BEQZ

R1,0(R2) R1,R1,R3 R1,L

R1,L
R4,R5,R6

OR R4,R5,R6 DADDU R10,R4,R3 B2

R1 == 0

DADDU R10,R4,R3
DADDU R7,R8,R9

DADDU R7,R8,R9 B3 1.) BEQZ is dependent on DSUBU and DSUBU on LD, 2.) If we knew that the branch was taken with a high probability, then DADDU could be moved into block B1, since it doesnt have any dependencies with block B2,

3.) Conversely, knowing the branch was not taken, then OR could be moved into block B1, since it doesnt affect anything in B3,

54

Delayed Branch

Where to get instructions to fill branch delay slots?


Before

branch instruction

From

the target address: Useful only if branch taken.


fall through: Useful only if branch not taken.

From

55

Delayed Branch

cont

Compiler effectiveness for single branch delay slot:

Delayed Branch downside: what if multiple instructions issued per clock cycle (superscalar)?

Fills about 60% of branch delay slots. About 80% of instructions executed in branch delay slots useful in computation. About (60% x 80%) i.e. 50% of slots usefully filled.

56

Performance of branch with Stalls

Stalls degrade performance of a pipeline:


Result

in deviation from 1 instruction executing/clock cycle. Lets examine by how much stalls can impact CPI

57

Stalls and Performance with branch


CPI pipelined =
=Ideal CPI + Pipeline stall cycles per instruction =1 + Pipeline stall cycles per instruction

58

Performance of branch instn

Pipeline speed up

Pipeline depth 1+ pipeline stall cycle from branch

Pipeline stall cycle from branches = Branch frequency * branch penalty

Pipeline speed up =

Pipeline depth

1+ Branch frequency * Branch Penalty


59

Program Dependences Can Cause Hazards!

Hazards can be caused by dependences within a program. There are three main types of dependences in a program:
Data dependence Name dependence Control dependence

60

Data Dependences

An instruction j is data dependent on instruction i, if either of:


Direct: Instruction i produces a result that r3 r1 op r2 is used by instruction j. r5 r3 op r4 Transitive:

Instruction j is data dependent on instruction k and Instruction k is data dependent on instruction i.


r3 r1 op r2 r4 r3 op r2 r5 r6 op r4
61

Detecting Data Dependences

A data value may flow between instructions:

(i) through registers

(ii) through memory locations.


Detection is rather straight forward. Detection is difficult.

When data flow is through a register:

When data flow is through a memory location:

Two addresses may refer to the same memory location but look different.
100(R4) and 20(R6)
62

Types of Data Dependences

Two types of data dependences:


True data dependence. Name dependence:

Two types of name dependences:


Two instructions use the same register or memory location (called a name). There is no true flow of data between the two instructions. Example: A=B+C; A=P+Q;

Anti-dependence Output dependence


63

Anti-Dependence or (WAR)

Anti-dependence occurs between two instructions i and j, iff:

j writes to a register or memory location that i reads.

Original ordering must be preserved to ensure that i reads the correct value.
ADD F0,F6,F8 SUB F8,F4,F5
64

Example:

Output Dependence or (WAW)

Output dependence occurs between two instructions i and j, iff:

The two instructions write to the same memory location.

Ordering of the instructions must be preserved to ensure:

Finally written value corresponds to j.

Example:- ADD f6,f0,f8


Mul f6,f10,f8

65

Exercise

Identify all the dependences in the following C code:


1. 2. 3. 4.

a=b+c; b=c+d; a=a+c; c=b+a;

66

Hazard Resolution

Name dependences:

Once identified, can be easily eliminated through simple compiler renaming techniques. Memory-related dependences are difficult to identify:

True data dependences:


Hardware techniques (scoreboarding and dynamic instruction scheduling) are being used.

More difficult to handle. Can not be eliminated; can only be overcome! Many techniques have evolved over the years.
67

A Solution to WAR and WAW Hazards

Rename Registers

i1: mul r1, r2, r3; i2: add r6, r4, r5;

Register renaming can get rid of most false dependencies:

Compiler can do register renaming in the register allocation process (i.e., the process that assigns registers to variables).

68

Dependences and Hazards


Dependences
True Data Name

Hazards
RAW

Output
Anti

WAW
WAR Control

Control

------

Structural

69

Out-of-order Pipelining
IF ID RD EX INT Fadd1 Fadd2

Program Order Ia: F1 F2 x F3 ..... Ib: F1 F4 + F5

Fmult1 Fmult2 Fmult3

LD/ST

Out-of-order WB
Ib: F1 F4 + F5 ...... Ia: F1 F2 x F3
70

WB

Use of Compiler Techniques to Tackle Data hazards

A compiler can help eliminate some stalls caused by data hazards:

Example: an instruction that uses result of a LOADs destination register should not immediately follow the LOAD instruction. compiler-based pipeline instruction scheduling
71

The technique is called:

Hardware Techniques to Deal with Hazards

Simple solution
Stall

pipeline

Pipeline stall:
Lets

some instruction(s) in pipeline proceed, others are made to wait for data, resource, etc.

72

How to Implement Hazard Control Logic?

In a pipeline,

All data hazards can be checked during ID phase of pipeline. If a data hazard is detected, next instruction should be stalled. Whether forwarding is needed can also be determined at this stage, control signals set. Control unit of pipeline must stall pipeline and prevent instructions in IF, ID from advancing.
73

If hazard is detected,

Modern Computer Architectures

Lecture 8: Branch Prediction


Mr. Bhabani Shankar Prasad Mishra.
KIIT UNIVERSITY
BHUBANESWAR

74

Delayed Branch

cont

Compiler effectiveness for single branch delay slot:

Delayed Branch downside: what if multiple instructions issued per clock cycle (superscalar)?

Fills about 60% of branch delay slots. About 80% of instructions executed in branch delay slots useful in computation. About (60% x 80%) i.e. 50% of slots usefully filled.

75

Branch Prediction
KEY IDEA: Hope that branch assumption is correct.
If yes, then weve gained a performance improvement.

Otherwise, discard instructions


program is still correct, all weve done is waste a clock cycle.
Two approaches

Direction Based Prediction

Profile Based Prediction


76

Direction Based Prediction


Simple to implement
However, often branch behaviour is variable (dynamic).
Cant capture such behaviour at compile time with simple direction based prediction! Need history (aka profile)-based prediction.
77

History-based Branch Prediction

An important example is State-based branch prediction: Needs 2 parts:

Predictor to guess where/if instruction will branch (and to where)


Recovery Mechanism: i.e. a way to fix mistakes

78

One bit predictor:

History-based Branch Prediction

cont

Use result from last time this instruction executed. Even if branch is almost always taken, we will be wrong at least twice if branch alternates between taken, not taken

Problem:

We get 0% accuracy
79

1-bit Predictor

Set bit to 1 or 0:
Depending
Pipeline If

(T) or Not-taken (NT)

on whether branch Taken

checks bit value and predicts

incorrect then need to discard speculatively executed instruction

Actual outcome used to set the bit value.


80

Example

Let initial value = T, actual outcome of branches is- NT, NT,NT,T,T,T

Predictions are: T, NT,NT,NT,T,T

2 wrong (in red), 4 correct = 66% accuracy

2-bit predictors can do even better

In general, can have k-bit predictors.

81

2-bit Dynamic Branch Prediction Scheme

Change prediction only if twice mispredicted:


T Predict Taken NT Predict Taken

11
T

T NT

10
NT

Predict Not Taken

01

00

Predict Not Taken

Adds hysteresis to decision making process


82

NT

An Example of Computing Performance

Program assumptions:

23% loads and in of cases, next instruction uses load value 13% stores 19% conditional branches 2% unconditional branches 43% other

83

Example

Machine Assumptions:
5

cont

stage pipe

Penalty of 1 cycle on use of load value immediately after a load. Jumps are resolved in ID stage for a 1 cycle branch penalty. 75% branch prediction accuracy. 1 cycle delay on misprediction.
84

Example

CPI penalty calculation:

cont

Loads:

50% of the 23% of loads have 1 cycle penalty: .5*.23=0.115

Jumps:

All of the 2% of jumps have 1 cycle penalty: 0.02*1 = 0.02


25% of the 19% are mispredicted, have a 1 cycle penalty: 0.25*0.19*1 = 0.0475

Conditional Branches:

Total Penalty: 0.115 + 0.02 + 0.0475 = 0.1825 Average CPI: 1 + 0.1825 = 1.1825

85

Exploiting Loop-level Parallelism: Motivation

An instruction pipeline essentially exploits ILP within a basic block:


On the average the size of a basic block is 7. After every 7 instructions, a branch instruction is encountered.

To obtain substantial performance benefits:

ILP across multiple basic blocks need to be exploited.


86

Software-based Scheduling vs. Hardware-based Scheduling


Disadvantage with compilers:
Examples:
In many cases, many information can not be extracted from code pointers to the same memory location. Value of the induction variable of a loop

It is still possible to assist hardware by exposing more ILP:

Rearrange instructions for increased performance


87

Loop-level Parallelism

It may be possible to execute different iterations of a loop in parallel. Example:


For(i=0;i<1000;i++){ a[i]=a[i]+b[i]; b[i]=b[i]*2; }

88

Problems in Exploiting Looplevel Parallelism


Loop Carried Dependences: Loop Independent Dependences:

A dependence across different iterations of a loop.

A dependence within the body of the loop itself (i.e. within one iteration).

89

Loop-level Dependence

Example:

For(i=0;i<1000;i++){ a[i+1]=b[i]+c[i]

b[i+1]=a[i+1]+d[i];
}

Loop-carried dependence from one iteration to the preceding iteration. Also, loop-independent dependence on account of a[i+1]
90

Eliminating Loop-level Dependences Through Code Transformations

We shall examine 3 techniques:


Static

loop unrolling Basic block transformations Software pipelining

91

Static Loop Unrolling


- A high proportion of loop instructions are loop management instructions.
- Eliminating this overhead can significantly increase the performance of the loop. - for(i=1000;i>0;i--){ -} a[i]=a[i]+c;

92

Static Loop Unrolling


L.D F0,0(R1) F4,F0,F2 ; F0 = array elem. ; add scalar in F2

Loop :

ADD.D
S.D

F4,0(R1)

; store result
; decrement ptr

DADDUI R1,R1,#-8 BNE

R1,R2,Loop ; branch if R1 !=R2

93

Static Loop Unrolling


cont
Loop : L.D ADD.D S.D F0,0(R1) F4,F0,F2 F4,0(R1) F6,-8(R1) F8,F6,F2

L.D
ADD.D S.D L.D ADD.D S.D L.D

F8,-8(R1)
F10,-16(R1) F12,F10,F2 F12,-16(R1) F14,-24(R1) F16,F14,F2 F16,-24(R1)

ADD.D
S.D DADDUI BNE

R1,R1,#-32
R1,R2,Loop
94

Static Loop Unrolling


cont
Loop : L.D F0,0(R1) ADD.D F4,F0,F2 S.D L.D F4,0(R1) F6,-8(R1) Note the renamed registers. This eliminates dependencies between each of n loop bodies of different iterations.

ADD.D F8,F6,F2 n loop Bodies for n=4 S.D L.D F8,-8(R1) F10,-16(R1)

ADD.D F12,F10,F2 S.D L.D F12,-16(R1) Note the adjustments for store and load offsets (only store highlighted red)!

F14,-24(R1)

ADD.D F16,F14,F2 S.D Adjusted loop overhead instructions F16,-24(R1)

DADDUI R1,R1,#-32 BNE R1,R2,Loop


95

Transformation of A Basic Block

It is possible to rewrite a loop to eliminate loop-carried dependences:

Only if, there are no cyclic dependences.


a[1]=a[1]+b[1];
for(i=1;i<999;i++){ b[i+1]=c[i]+d[i];

for(i=1;i<1000;i++){ a[i]=a[i]+b[i];

b[i+1]=c[i]+d[i];
} } With dependence

a[i+1]=a[i+1]+b[i+1];
b[1000]=c[999]+d[999]; Without dependence
96

Modern Computer Architectures

Lecture 11:Software Pipelining and Predicated Instructions


Mr. Bhabani Shankar Prasad Mishra.
KIIT UNIVERSITY
BHUBANESWAR
97

Software Pipelining

Eliminates loop-independent dependence through code restructuring.


Reduces

stalls Helps achieve better performance in pipelined execution.

As compared to simple loop unrolling:


Consumes

less code space


98

Software Pipelining
cont

Central idea: reorganize loops

Each iteration is made from instructions chosen from different iterations of the original loop.
i0

Software Pipeline Iteration

i1 i2 i3 i4

i5

99

Software Pipelining
cont

Exactly just as it happens in a hardware pipeline:


In

each iteration of a software pipelined code, some instruction of some iteration of the original loop is executed.

100

Software Pipelining
cont

- How is this done?

1 unroll loop body with an unroll factor of n. (we have taken n = 3 for our example) 2 select order of instructions from different iterations to pipeline 3 paste instructions from different iterations into the new pipelined loop body
101

Static Loop Unrolling Example


L.D F0,0(R1) F4,F0,F2 ; F0 = array elem. ; add scalar in F2

Loop :

ADD.D
S.D

F4,0(R1)

; store result
; decrement ptr

DADDUI R1,R1,#-8 BNE

R1,R2,Loop ; branch if R1 !=R2

102

Software Pipelining: Step 1


Iteration i: L.D F0,0(R1) F4,F0,F2 F4,0(R1) F0,0(R1) F4,F0,F2

Note: 1.) We are unrolling the loop Hence no loop overhead Instructions are needed! 2.) A single loop body of restructured loop would contain instructions from different iterations of the original loop body.

ADD.D
S.D Iteration i + 1: L.D

ADD.D
S.D Iteration i + 2: L.D ADD.D S.D

F4,0(R1)
F0,0(R1) F4,F0,F2 F4,0(R1)

103

Software Pipelining: Step 2


Iteration i: L.D F0,0(R1) F4,F0,F2 F4,0(R1) F0,0(R1) F4,F0,F2 2.) 1.)

ADD.D
S.D Iteration i + 1: L.D

Notes: 1.) Well select the following order in our pipelined loop: 2.) Each instruction (L.D ADD.D S.D) must be selected at least once to make sure that we dont leave out any instructions of the original loop in the pipelined loop.

ADD.D
S.D Iteration i + 2: L.D ADD.D S.D

F4,0(R1)
F0,0(R1) F4,F0,F2 F4,0(R1) 3.)

104

Software Pipelining: Step 3


Iteration i: L.D F0,0(R1) F4,F0,F2 F4,0(R1) F0,0(R1) F4,F0,F2 2.) 1.) Loop : S.D F4,16(R1) F4,F0,F2 F0,0(R1) The Pipelined Loop

ADD.D
S.D Iteration i + 1: L.D

ADD.D
L.D DADDU 3.) BNE

ADD.D
S.D Iteration i + 2: L.D ADD.D S.D

F4,0(R1)
F0,0(R1) F4,F0,F2 F4,0(R1)

R1,R1,#-8
R1,R2,Loop

105

Software Pipelining: Step 4


Preheader Instructions to fill software pipeline Loop : S.D ADD.D L.D BNE Postheader F4,16(R1) F4,F0,F2 F0,0(R1) R1,R2,Loop ; M[ i ] ; M[ i 1 ] ; M[ i 2 ]

Pipelined Loop Body

DADDUI R1,R1,#-8

Instructions to drain software pipeline

106

Software Pipelined Code


Loop : S.D
ADD.D F4,16(R1) ; M[ i ] F4,F0,F2 ; M[ i 1 ]

L.D
BNE

F0,0(R1)
R1,R2,Loop

; M[ i 2 ]

DADDUI R1,R1,#-8

107

Software Pipelining Issues

Register management can be tricky.

In more complex examples, we may need to increase the iterations between when data is read and when the results are used.

Optimal software pipelining has been shown to be an NP-complete problem:

Present solutions are based on heuristics.

108

Software Pipelining versus Loop Unrolling

Software pipelining takes less code space.


Software pipelining and loop unrolling reduce different types of inefficiencies:
Loop unrolling reduces loop management overheads. Software pipelining allows a pipeline to run at full efficiency by eliminating loopindependent dependencies.

109

Hardware Support for ILP: Predicated Instructions


Consider :
If (A == 0) {S = T;}

Following MIPS code would be generated:


BNEZ R1,L L : ADDU R2,R3,R0

With predicated instructions:


CMOVZ R2,R3,R1; if (R1 == 0) move R3 to R2

110

Advantages of Dynamic Scheduling

Can handle dependences unknown at compile time:


E.g.

dependences involving memory references.

Simplifies the compiler.


Allows code compiled for one pipeline to run efficiently on a different pipeline. Hardware speculation can be used:
Can

lead to further performance advantages, builds on dynamic scheduling. 111

Overview of Dynamic Instruction Scheduling

We shall discuss two schemes for implementing dynamic scheduling:


Scoreboarding:

6600 computer. Tomasulos Algorithm: Implemented for the FP unit of the IBM 360/91 in 1966.

First used in the 1964 CDC

Since scoreboarding is a little closer to in-order execution, well look at it first.


112

A Point to Note About Dynamic Scheduling

WAR and WAW hazards that did not exist in an in-order pipeline:
Can

arise in a dynamically scheduled processor.

113

Scoreboarding

cont

Scoreboarding allows instructions to execute out of order:


When

Named after the scoreboard:


Originally

there are sufficient resources. 6600.

developed for CDC

114

Scoreboarding The 5 Stage MIPS Pipeline

Split the ID pipe stage of simple 5-stage pipeline into 2 stages:


Issue:

Decode instructions, check for structural hazards. Read operands: Wait until no data hazards, then read operands.

115

Scoreboarding

Instructions pass through the issue stage in order. Instructions can bypass each other in the read operands stage:
Then

cont

enter execution out of order.

116

Scoreboarding Concepts

We had observed that WAR and WAW hazards can occur in out-oforder execution:
Instructions

are stalled, But, instructions having no dependence are allowed to continue. Different units are kept as busy as possible.
117

involved in a dependence

Scoreboarding Concepts

Essence of scoreboarding:
Execute

possible. When an instruction is stalled,

instructions as early as

Later instructions are issued and executed if they do not depend on any active or stalled instruction.
118

A Few More Basic Scoreboarding Concepts

Every instruction goes through the scoreboard:


Scoreboard constructs the data dependences of the instruction. Scoreboard decides when an instruction can execute. Scoreboard also controls when an instruction can write its results into the destination register.

119

Scoreboarding

Out-of-order execution requires multiple instructions to be in the EX stage simultaneously:

Achieved with multiple functional units, along with pipelined functional units.

All instructions go through the scoreboard:

Centralized control of issue, operand reading, execution and writeback.


All hazard resolution is centralized in the scoreboard as well.
120

A Scoreboard for MIPS


R e g i s t e r s
Data buses source of structural hazard FP Mult FP Mult FP Divide FP Add

Integer Unit

Control/ status

Scoreboard

Control/ status
121

1. Issue: when a f.u. for an instruction is free and no other active instruction has the same destination register: 2. Read operands: when all source operands are available:

4 Steps of Execution with Scoreboarding

Avoids structural and WAW hazards.

Note: forwarding not used. A source operand is available if no earlier issued active instruction is going to write it. Thus resolves RAW hazards dynamically.
122

Steps in Execution with Scoreboarding


3. Execution: begins when the f.u. receives its operands; scoreboard notified when execution completes. 4. Write Result: after WAR hazards have been resolved. Example:

ADD.D cannot proceed to read operands until DIV.D completes; SUB.D can execute but not write back until ADD.D has read F8.
123

DIV.D F0, F2, F4 ADD.D F10, F0, F8 SUB.D F8, F8, F14

An Assessment of Scoreboarding

Pro: Factor of 1.7 improvement for FORTRAN and 2.5 for hand-coded assembly on CDC 6600!

Scoreboard on the CDC 6600:

Before semiconductor main memory or caches Required about as much logic as a functional unit -- quite low. Large number of buses needed:

Cons:

Centralized hardware for hazard resolution.

However, if we wish to issue multiple instructions per clock, more wires are needed in any case.
124

An Assessment of Scoreboarding cont

Pro: A scoreboard effectively handles true data dependencies:


Minimizes

the number of stalls due to true data dependencies.

Con: Anti dependences and output dependences (WAR and WAW hazards) are also handled using stalls:

Could have been better handled.


125

Advanced Computer Architectures

Lecture 13: Tomasulos Algorithm


Mr. B.S.P.Mishra.
KIIT UNIVERSITY
BHUBANESWAR

126

A More Sophisticated Approach: Tomasulos Algorithm

Developed for IBM 360/91:


Goal:

To keep the floating point pipeline as busy as possible. This led Tomasulo to try to figure out how to achieve renaming in hardware!

The descendants of this have flourished!


Alpha

21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, Pentium 4


127

Reservation stations:

Key Innovations in Dynamic Instruction Scheduling

Common Data Bus (CDB):

Single entry buffer at the head of each functional unit has been replaced by a multiple entry buffer.

Register Tags:

Connects the output of the functional units to the reservation stations as well as registers. Tag corresponds to the reservation station entry number for the instruction producing the result.
128

Reservation Stations

The basic idea:


An

instruction waits in the reservation station, until its operands become available. reservation station fetches and buffers an operand as soon as it is available:

Helps overcome RAW hazards.

Eliminates the need to get operands from registers.


129

Tomasulos Algorithm

Control & buffers distributed with Function Units (FU) In the form of reservation stations associated with every function unit. Store operands for issued but pending instructions. Registers in instructions replaced by values and others with pointers to reservation stations (RS):

Achieves register renaming. Avoids WAR, WAW hazards without stalling.

130

Tomasulos Algorithm

Results passed to FUs from RSs,

cont

Load and Stores:

Not through registers, therefore similar to forwarding. Over Common Data Bus (CDB) that broadcasts results to all FUs. Treated as FUs with RSs as well.

Integer instructions can go past branches:

Allows FP ops beyond basic block in FP queue.


131

Tomasulos Scheme
From Instruction Unit Instruction Queue Registers Address Unit Store Buffer

Load Buffer

Adder

Reservation Stations

Multiplier

Memory Unit

CDB
132

Three Stages of Tomasulo Algorithm


1. Issue: Get instruction from Instr Queue
Issue instruction only if a matching reservation station is free (no structural hazard). Send registers or the functional unit that would produce the result (achieves renaming).

2. Execute: Operate on operands (EX)

3. Write result: Finish execution (WB)

When both operands ready then execute; if not ready, watch Common Data Bus for result

Write on CDB to all awaiting units; mark reservation station available.


133

Instruction stream
Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8

Tomasulo Example
k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


Load1 Load2 Load3

Busy Address
No No No

3 Load/Buffers
Op S1 Vj S2 Vk RS Qj RS Qk

Reservation Stations:
Time Name Busy Add1 No Add2 No FU count Add3 No down Mult1 No Mult2 No

3 FP Adder R.S. 2 FP Mult R.S.

Register result status: Clock


0 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Clock cycle counter


134

Tomasulo Example Cycle 1


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 Load1 Load2 Load3

Busy Address
Yes No No 34+R2

Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No

Op

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


1 FU

F0

F2

F4

F6
Load1

F8

F10

F12

...

F30

135

Tomasulo Example Cycle 2


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 Load1 Load2 Load3

Busy Address
Yes Yes No 34+R2 45+R3

Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No

Op

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


2 FU

F0

F2
Load2

F4

F6
Load1

F8

F10

F12

...

F30

Note: Can have multiple loads outstanding


136

Tomasulo Example Cycle 3


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 3 Load1 Load2 Load3

Busy Address
Yes Yes No 34+R2 45+R3

Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes MULTD Mult2 No

S1 Vj

S2 Vk

RS Qj

RS Qk

R(F4) Load2

Register result status: Clock


3 FU

F0

F2

F4

F6
Load1

F8

F10

F12

...

F30

Mult1 Load2

Note: registers names are removed (renamed) in Reservation Stations; MULT issued 137 Load1 completing; what is waiting for Load1?

Tomasulo Example Cycle 4


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 3 4 4 Load1 Load2 Load3

Busy Address
No Yes No 45+R3

Reservation Stations:

Time Name Busy Op Add1 Yes SUBD M(A1) Load2 Add2 No Add3 No Mult1 Yes MULTD R(F4) Load2 Mult2 No

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


4 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 Load2

M(A1) Add1

Load2 completing; what is waiting for Load2?


138

Tomasulo Example Cycle 5


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 3 4 4 5 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op 2 Add1 Yes SUBD M(A1) M(A2) Add2 No Add3 No 10 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


5 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

M(A1) Add1 Mult2

Timer starts down for Add1, Mult1


139

Tomasulo Example Cycle 6


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 4 5 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op 1 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 9 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


6 FU

F0

F2

F4

F6
Add2

F8

F10

F12

...

F30

Mult1 M(A2)

Add1 Mult2

Issue ADDD here despite name dependency on F6?


140

Tomasulo Example Cycle 7


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 4 5 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op 0 Add1 Yes SUBD M(A1) M(A2) Add2 Yes ADDD M(A2) Add1 Add3 No 8 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


7 FU

F0

F2

F4

F6
Add2

F8

F10

F12

...

F30

Mult1 M(A2)

Add1 Mult2

Add1 (SUBD) completing; what is waiting for it?


141

Tomasulo Example Cycle 8


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 4 5 8 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No 2 Add2 Yes ADDD (M-M) M(A2) Add3 No 7 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


8 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

Add2 (M-M) Mult2

142

Tomasulo Example Cycle 9


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 4 5 8 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No 1 Add2 Yes ADDD (M-M) M(A2) Add3 No 6 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


9 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

Add2 (M-M) Mult2

143

Tomasulo Example Cycle 10


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 10 4 5 8 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No 0 Add2 Yes ADDD (M-M) M(A2) Add3 No 5 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


10 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

Add2 (M-M) Mult2

Add2 (ADDD) completing; what is waiting for it?


144

Tomasulo Example Cycle 11


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 10 4 5 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No 4 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


11 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

(M-M+M) (M-M) Mult2

Write result of ADDD here?


145

Tomasulo Example Cycle 12


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 10 4 5 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No 3 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


12 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

(M-M+M) (M-M) Mult2

146

Tomasulo Example Cycle 13


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 10 4 5 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No 2 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


13 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

(M-M+M) (M-M) Mult2

147

Tomasulo Example Cycle 14


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 7 10 4 5 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No 1 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


14 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

(M-M+M) (M-M) Mult2

148

Tomasulo Example Cycle 15


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 15 7 10 4 5 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No 0 Mult1 Yes MULTD M(A2) R(F4) Mult2 Yes DIVD M(A1) Mult1

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


15 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

Mult1 M(A2)

(M-M+M) (M-M) Mult2

Mult1 (MULTD) completing; what is waiting for it?


149

Tomasulo Example Cycle 16


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 15 7 10 4 5 16 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 40 Mult2 Yes DIVD M*F4 M(A1)

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


16 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

M*F4 M(A2)

(M-M+M) (M-M) Mult2

Just waiting for Mult2 (DIVD) to complete


150

(skip a couple of cycles)

151

Tomasulo Example Cycle 55


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 15 7 10 4 5 16 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 1 Mult2 Yes DIVD M*F4 M(A1)

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


55 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

M*F4 M(A2)

(M-M+M) (M-M) Mult2

152

Tomasulo Example Cycle 56


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 15 7 56 10 4 5 16 8 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No 0 Mult2 Yes DIVD M*F4 M(A1)

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


56 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

M*F4 M(A2)

(M-M+M) (M-M) Mult2

Mult2 (DIVD) is completing; what is waiting for it?


153

Tomasulo Example Cycle 57


Instruction status:
Instruction LD F6 LD F2 MULTD F0 SUBD F8 DIVD F10 ADDD F6 j 34+ 45+ F2 F6 F0 F8 k R2 R3 F4 F2 F6 F2

Exec Write Issue Comp Result


1 2 3 4 5 6 3 4 15 7 56 10 4 5 16 8 57 11 Load1 Load2 Load3

Busy Address
No No No

Reservation Stations:

Time Name Busy Op Add1 No Add2 No Add3 No Mult1 No Mult2 Yes DIVD M*F4 M(A1)

S1 Vj

S2 Vk

RS Qj

RS Qk

Register result status: Clock


56 FU

F0

F2

F4

F6

F8

F10

F12

...

F30

M*F4 M(A2)

(M-M+M) (M-M) Result

Once again: In-order issue, out-of-order execution 154 and out-of-order completion.

Performance is limited by CDB:

Tomasulos Scheme: Drawbacks

CDB connects to multiple functional units high capacitance, high wiring density Number of functional units that can complete per cycle limited to one!

Imprecise exceptions!

Multiple CDBs more FU logic for parallel stores.

Effective handling is a major performance bottleneck.


155

Interrupts/Exceptions

Interrupts: external, I/O devices, OS. Exceptions: internal, errors


Illegal

OS needs to intervene to handle exceptions.

op code, divide by 0, overflow/underflow, page faults.

156

Imprecise Exceptions

An exception is called imprecise when:


The

processor state when an exception is raised, does not look exactly the same compared to when the instructions are executed inorder.
157

Imprecise Exceptions

In an out of order execution model, an imprecise exception is said to occur if


When

exception is raised by an instruction:

For example:
A

some instructions before it may not be complete some instructions after it are already complete

floating point instruction exception could be detected after an integer instruction that is much later in the program order is complete.
158

Handling Imprecise Exceptions in Dynamic Scheduling

Instructions are issued in-order:


But, may execute out-of-order. However, unless control-dependence is resolved an instruction is not executed. No instruction is allowed to initiate execution until all branches that precede the instructions are complete.

This is a performance bottleneck:


Average basic block size is about 6 instructions.

159

Modern Computer Architectures

Lecture 14: Dynamic Instruction Scheduling: Loop Example


Mr. SUBHASIS DASH
KIIT UNIVERSITY
BHUBANESWAR
160

Tomasulos Scheme- Loop Example


Loop: LD MULTD F0 F4 0 F0 R1 F2

SD
SUBI

F4

R1

R1 R1 #8

BNEZ

R1 Loop

161

Assume Multiply takes 4 clocks. Assume:


1st

Tomasulos Scheme- Loop Example

To be clear, we will show clocks for SUBI, BNEZ:


Reality:

load takes 8 clocks (L1 cache miss) 2nd load takes 1 clock (hit)
integer instructions ahead of FP Instructions.
162

Show 2 iterations

Instruction status:
ITER Instruction
Iter1 ation Count 2
2 2 1 1 LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4

Loop Example
ExecWrite j
0 F0 0 0 F0 0

k IssueCompResult
R1 F2 R1 R1 F2 R1

Busy Addr
Load1 No Load2 No Load3 No Store1 No Store2 No Store3 No

Fu

Reservation Stations:
Add1 Add2 Add3 Mult1 Mult2 R1 80 No No No No No

S1 Vj Vk

S2 Qj

RS Qk Code:

Time Name Busy Op

Added Store Buffers


0 F0 0 R1 Loop R1 F2 R1 #8

LD F0 MULTD F4 SD SUBI BNEZ F4 R1 R1

Register result status

Instruction Loop

Clock
0

F0 F2 F4 F6 F8
Fu

F10 F12

...

F30 163

Loop Example Cycle 1


Instruction status:
ITER Instruction
1 LD F0

j
0

k
R1

Exec Write Issue CompResult


1 Load1 Load2 Load3 Store1 Store2 Store3 S2 Qj RS Qk Code: LD MULTD SD SUBI BNEZ

Busy Addr
Yes No No No No No 80

Fu

Reservation Stations:
Time Name Busy Add1 No Add2 No Add3 No Mult1 No Mult2 No R1 80 Op Vj

S1 Vk

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
1

F0
Fu Load1

F2

F4

F6

F8

F10 F12

...

F30

164

Loop Example Cycle 2


Instruction status:
ITER Instruction
1 1 LD MULTD F0 F4

j
0 F0

k
R1 F2

Exec Write Issue CompResult


1 2 Load1 Load2 Load3 Store1 Store2 Store3 S2 Qj RS Qk Code: LD MULTD SD SUBI BNEZ

Busy Addr
Yes No No No No No 80

Fu

Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 80 Vj

S1 Vk

R(F2) Load1

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
2

F0
Fu Load1

F2

F4
Mult1

F6

F8

F10 F12

...

F30

165

Loop Example Cycle 3


Instruction status:
ITER Instruction
1 1 1 LD MULTD SD F0 F4 F4

j
0 F0 0

k
R1 F2 R1

Exec Write Issue CompResult


1 2 3 Load1 Load2 Load3 Store1 Store2 Store3 S2 Qj RS Qk Code: LD MULTD SD SUBI BNEZ

Busy Addr
Yes No No Yes No No 80

Fu

80

Mult1

Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 80 Vj

S1 Vk

R(F2) Load1

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
3

F0
Fu Load1

F2

F4
Mult1

F6

F8

F10 F12

...

F30

Implicit renaming sets up data flow graph

166

Loop Example Cycle 4


Instruction status:
ITER Instruction
1 1 1 LD MULTD SD F0 F4 F4

j
0 F0 0

k
R1 F2 R1

Exec Write Issue CompResult


1 2 3 Load1 Load2 Load3 Store1 Store2 Store3 S2 Qj RS Qk Code: LD MULTD SD SUBI BNEZ

Busy Addr
Yes No No Yes No No 80

Fu

80

Mult1

Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 80 Vj

S1 Vk

R(F2) Load1

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
4

F0
Fu Load1

F2

F4
Mult1

F6

F8

F10 F12

...

F30

Dispatching SUBI Instruction (not in FP queue)

167

Loop Example Cycle 5


Instruction status:
ITER Instruction
1 1 1 LD MULTD SD F0 F4 F4

j
0 F0 0

k
R1 F2 R1

Exec Write Issue CompResult


1 2 3 Load1 Load2 Load3 Store1 Store2 Store3 S2 Qj RS Qk Code: LD MULTD SD SUBI BNEZ

Busy Addr
Yes No No Yes No No 80

Fu

80

Mult1

Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 72 Vj

S1 Vk

R(F2) Load1

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
5

F0
Fu Load1

F2

F4
Mult1

F6

F8

F10 F12

...

F30

BNEZ instruction (not in FP queue)

168

Loop Example Cycle 6


Instruction status:
ITER Instruction
1 1 1 2 LD MULTD SD LD F0 F4 F4 F0

j
0 F0 0 0

k
R1 F2 R1 R1

Exec Write Issue CompResult


1 2 3 6 Load1 Load2 Load3 Store1 Store2 Store3 S2 Qj RS Qk Code: LD MULTD SD SUBI BNEZ

Busy Addr
Yes Yes No Yes No No 80 72 80

Fu

Mult1

Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 72 Vj

S1 Vk

R(F2) Load1

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
6

F0
Fu Load2

F2

F4
Mult1

F6

F8

F10 F12

...

F30

Notice that F0 never sees Load from location 80

169

Loop Example Cycle 7


Instruction status:
ITER Instruction
1 1 1 2 2 LD MULTD SD LD MULTD F0 F4 F4 F0 F4

j
0 F0 0 0 F0

k
R1 F2 R1 R1 F2

Exec Write Issue CompResult


1 2 3 6 7 S1 Vk S2 Qj RS Qk Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ

Busy Addr
Yes Yes No Yes No No 80 72 80

Fu

Mult1

Reservation Stations:
Time Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 Yes Multd R1 72 Vj

R(F2) Load1 R(F2) Load2

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
7

F0
Fu Load2

F2

F4
Mult2

F6

F8

F10 F12

...

F30

Register file completely detached from computation First and Second iteration completely overlapped

170

Loop Example Cycle 8


Instruction status:
ITER Instruction
1 1 1 2 2 2 Time LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4

j
0 F0 0 0 F0 0

k
R1 F2 R1 R1 F2 R1 Vj

Exec Write Issue CompResult


1 2 3 6 7 8 S1 Vk S2 Qj RS Qk Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ

Busy Addr
Yes Yes No Yes Yes No 80 72 80 72

Fu

Mult1 Mult2

Reservation Stations:
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 Yes Multd R1 72

R(F2) Load1 R(F2) Load2

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
8

F0
Fu Load2

F2

F4
Mult2

F6

F8

F10 F12

...

F30

171

Loop Example Cycle 9


Instruction status:
ITER Instruction
1 1 1 2 2 2 Time LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4

j
0 F0 0 0 F0 0

k
R1 F2 R1 R1 F2 R1 Vj

Exec Write Issue CompResult


1 2 3 6 7 8 S1 Vk 9 Load1 Load2 Load3 Store1 Store2 Store3 RS Qk Code: LD MULTD SD SUBI BNEZ

Busy Addr
Yes Yes No Yes Yes No 80 72 80 72

Fu

Mult1 Mult2

Reservation Stations:
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 Yes Multd R1 72

S2 Qj

R(F2) Load1 R(F2) Load2

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
9

F0
Fu Load2

F2

F4
Mult2

F6

F8

F10 F12

...

F30

Load1 completing: who is waiting? Note: Dispatching SUBI

172

Loop Example Cycle 10


Instruction status:
ITER Instruction
1 1 1 2 2 2 Time LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4

j
0 F0 0 0 F0 0

k
R1 F2 R1 R1 F2 R1

Exec Write Issue CompResult


1 2 3 6 7 8 S1 Vk 9 10 Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ

Busy Addr
No Yes No Yes Yes No 72 80 72

Fu

10

Mult1 Mult2

Reservation Stations:

Name Busy Op Vj Add1 No Add2 No Add3 No Mult1 Yes Multd M[80] R(F2) Mult2 Yes Multd R(F2) Load2 R1 64

S2 Qj

RS Qk

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
10

F0
Fu Load2

F2

F4
Mult2

F6

F8

F10 F12

...

F30

Load2 completing: who is waiting? Note: Dispatching BNEZ

173

Loop Example Cycle 11


Instruction status:
ITER Instruction
1 1 1 2 2 2 Time LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4

j
0 F0 0 0 F0 0

k
R1 F2 R1 R1 F2 R1

Exec Write Issue CompResult


1 2 3 6 7 8 S1 Vk 9 10 Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ

Busy Addr
No No Yes Yes Yes No

Fu

10

11

64 80 72

Mult1 Mult2

Reservation Stations:

3 4

Name Busy Op Vj Add1 No Add2 No Add3 No Mult1 Yes Multd M[80] R(F2) Mult2 Yes Multd M[72] R(F2) R1 64

S2 Qj

RS Qk

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
11

F0
Fu Load3

F2

F4
Mult2

F6

F8

F10 F12

...

F30

Next load in sequence

174

Loop Example Cycle 12


Instruction status:
ITER Instruction
1 1 1 2 2 2 Time LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4

j
0 F0 0 0 F0 0

k
R1 F2 R1 R1 F2 R1

Exec Write Issue CompResult


1 2 3 6 7 8 S1 Vk 9 10 Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ

Busy Addr
No No Yes Yes Yes No

Fu

10

11

64 80 72

Mult1 Mult2

Reservation Stations:

2 3

Name Busy Op Vj Add1 No Add2 No Add3 No Mult1 Yes Multd M[80] R(F2) Mult2 Yes Multd M[72] R(F2) R1 64

S2 Qj

RS Qk

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
12

F0
Fu Load3

F2

F4
Mult2

F6

F8

F10 F12

...

F30

Why not issue third multiply?

175

Loop Example Cycle 13


Instruction status:
ITER Instruction
1 1 1 2 2 2 Time LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4

j
0 F0 0 0 F0 0

k
R1 F2 R1 R1 F2 R1

Exec Write Issue CompResult


1 2 3 6 7 8 S1 Vk 9 10 Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ

Busy Addr
No No Yes Yes Yes No

Fu

10

11

64 80 72

Mult1 Mult2

Reservation Stations:

1 2

Name Busy Op Vj Add1 No Add2 No Add3 No Mult1 Yes Multd M[80] R(F2) Mult2 Yes Multd M[72] R(F2) R1 64

S2 Qj

RS Qk

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
13

F0
Fu Load3

F2

F4
Mult2

F6

F8

F10 F12

...

F30

Why not issue third store?

176

Loop Example Cycle 14


Instruction status:
ITER Instruction
1 1 1 2 2 2 Time LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4

j
0 F0 0 0 F0 0

k
R1 F2 R1 R1 F2 R1

Exec Write Issue CompResult


1 2 3 6 7 8 S1 Vk 9 14 10 10 Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ

Busy Addr
No No Yes Yes Yes No

Fu

11

64 80 72

Mult1 Mult2

Reservation Stations:

0 1

Name Busy Op Vj Add1 No Add2 No Add3 No Mult1 Yes Multd M[80] R(F2) Mult2 Yes Multd M[72] R(F2) R1 64

S2 Qj

RS Qk

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
14

F0
Fu Load3

F2

F4
Mult2

F6

F8

F10 F12

...

F30

Mult1 completing. Who is waiting?

177

Loop Example Cycle 15


Instruction status:
ITER Instruction
1 1 1 2 2 2 Time LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4

j
0 F0 0 0 F0 0

k
R1 F2 R1 R1 F2 R1

Exec Write Issue CompResult


1 2 3 6 7 8 S1 Vk 9 14 10 15 S2 Qj 10 15 11 Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ

Busy Addr
No No Yes Yes Yes No

Fu

64 80 72

[80]*R2 Mult2

Reservation Stations:

Name Busy Op Vj Add1 No Add2 No Add3 No Mult1 No Mult2 Yes Multd M[72] R(F2) R1 64

RS Qk

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
15

F0
Fu Load3

F2

F4
Mult2

F6

F8

F10 F12

...

F30

Mult2 completing. Who is waiting?

178

Loop Example Cycle 16


Instruction status:
ITER Instruction
1 1 1 2 2 2 Time LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4

j
0 F0 0 0 F0 0

k
R1 F2 R1 R1 F2 R1 Vj

Exec Write Issue CompResult


1 2 3 6 7 8 S1 Vk 9 14 10 15 S2 Qj 10 15 11 16 RS Qk Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ

Busy Addr
No No Yes Yes Yes No

Fu

64 80 72

[80]*R2 [72]*R2

Reservation Stations:
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 64

R(F2) Load3

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
16

F0
Fu Load3

F2

F4
Mult1

F6

F8

F10 F12

...

F30

179

Loop Example Cycle 17


Instruction status:
ITER Instruction
1 1 1 2 2 2 Time LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4

j
0 F0 0 0 F0 0

k
R1 F2 R1 R1 F2 R1 Vj

Exec Write Issue CompResult


1 2 3 6 7 8 S1 Vk 9 14 10 15 S2 Qj 10 15 11 16 RS Qk Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ

Busy Addr
No No Yes Yes Yes Yes

Fu

64 80 72 64

[80]*R2 [72]*R2 Mult1

Reservation Stations:
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 64

R(F2) Load3

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
17

F0
Fu Load3

F2

F4
Mult1

F6

F8

F10 F12

...

F30

180

Loop Example Cycle 18


Instruction status:
ITER Instruction
1 1 1 2 2 2 Time LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4

j
0 F0 0 0 F0 0

k
R1 F2 R1 R1 F2 R1 Vj

Exec Write Issue CompResult


1 2 3 6 7 8 S1 Vk 9 14 18 10 15 S2 Qj 10 15 11 16 RS Qk Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ

Busy Addr
No No Yes Yes Yes Yes

Fu

64 80 72 64

[80]*R2 [72]*R2 Mult1

Reservation Stations:
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 64

R(F2) Load3

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
18

F0
Fu Load3

F2

F4
Mult1

F6

F8

F10 F12

...

F30

181

Loop Example Cycle 19


Instruction status:
ITER Instruction
1 1 1 2 2 2 Time LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4

j
0 F0 0 0 F0 0

k
R1 F2 R1 R1 F2 R1 Vj

Exec Write Issue CompResult


1 2 3 6 7 8 S1 Vk 9 14 18 10 15 19 S2 Qj 10 15 19 11 16 RS Qk Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ

Busy Addr
No No Yes No Yes Yes

Fu

64 72 64 [72]*R2 Mult1

Reservation Stations:
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 56

R(F2) Load3

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
19

F0
Fu Load3

F2

F4
Mult1

F6

F8

F10 F12

...

F30

182

Loop Example Cycle 20


Instruction status:
ITER Instruction
1 1 1 2 2 2 Time LD MULTD SD LD MULTD SD F0 F4 F4 F0 F4 F4

j
0 F0 0 0 F0 0

k
R1 F2 R1 R1 F2 R1 Vj

Exec Write Issue CompResult


1 2 3 6 7 8 S1 Vk 9 14 18 10 15 19 S2 Qj 10 15 19 11 16 20 RS Qk Load1 Load2 Load3 Store1 Store2 Store3 Code: LD MULTD SD SUBI BNEZ

Busy Addr
Yes No Yes No No Yes 56 64

Fu

64

Mult1

Reservation Stations:
Name Busy Op Add1 No Add2 No Add3 No Mult1 Yes Multd Mult2 No R1 56

R(F2) Load3

F0 F4 F4 R1 R1

0 F0 0 R1 Loop

R1 F2 R1 #8

Register result status

Clock
20

F0
Fu Load1

F2

F4
Mult1

F6

F8

F10 F12

...

F30

Once again: In-order issue, out-of-order execution and 183 out-of-order completion.

Why Can Tomasulos Scheme Overlap Iterations of Loops?

Register renaming using reservation stations:


Avoids

the WAR stall that occurred in the scoreboard. Also, multiple iterations use different physical destinations facilitating dynamic loop unrolling.
184

Tomasulos Scheme Offers Three Major Advantages

1. Distribution of hazard detection logic:

Distributed reservation stations. If multiple instructions wait on a single result,

If a centralized register file were used,

Instructions can be passed simultaneously by broadcast on CDB. Units would have to read their results from registers .

2. Elimination of stalls for WAW and WAR hazards. 3. Possible to have superscalar execution:

Because results directly available to FUs, rather than from registers.


185

Advanced Computer Architectures

Superscalar and VLIW Processors


Mr. B.S.P.Mishra SCHOLE OF TECHNOLOGY

KIIT UNIVERSITY
BHUBANESWAR
186

A Practice Problem on Dependence Analysis


Identify all dependences in the following code.

Transform the code to eliminate the dependences.


for(i=1;i<1000;i++){ y[i]=x[i]/c; x[i]=x[i]+c; z[i]=y[i]+c; y[i]=c-y[i]; }
187

Transformed Code Without Dependence


for(i=1;i<1000;i++){ t[i]=x[i]/c; x[i]=x[i]+c; z[i]=t[i]+c; y[i]=c-t[i]; }

188

Simple code transformations work well, only if:

Global Code Scheduling


The loop body is a straight line code.

Issues become more complex in the presence of:

Instructions might have to be moved across branches:

Nested loops, nested branches, etc.

This is called global code scheduling.


189

Two Paths to Higher ILP

Superscalar processors:
Multiple

issue, dynamically scheduled, speculative execution, branch prediction More hardware functionalities and complexities.

VLIW:
Let

complier take the complexity. Simple hardware, smart compiler.


190

Superscalar Execution

Scheduling of instructions is determined by a number of factors:


True Data Dependency: The result of one operation is an input to the next. Resource constraints: Two operations require the same resource. Branch Dependency: Scheduling instructions across conditional branch statements cannot be done deterministically a-priori. Superscalar processor of degree m.
191

An appropriate number of instructions issued.

Very Long Instruction Word (VLIW) Processors

Hardware cost and complexity of superscalar schedulers is a major consideration in processor design.

These instructions are packed and dispatched together,

VLIW processors rely on compile time analysis to identify and bundle together instructions that can be executed concurrently.

Thus the name very long instruction word. This concept is employed in the Intel IA64 processors.

192

VLIW Processors

The compiler has complete responsibility of selecting a set of instructions:

These can be concurrently be executed.

VLIW processors have static instruction issue capability:

As compared, superscalar processors have dynamic issue capability.


193

The Basic VLIW Approach


VLIW processors deploy multiple independent functional units. Early VLIW processors operated lock step:
There

was no hazard detection in hardware at all. A stall in any functional unit causes the entire pipeline to stall.
194

Assume a 4-issue static superscalar processor:


During fetch stage, 1 to 4 instructions would be fetched. The group of instructions that could be issued in a single cycle are called:

VLIW Processors

If an instruction could cause a structural or data hazard:

An issue packet or a Bundle.

It is not issued.
195

One single VLIW instruction:

VLIW (Very Long Instruction Word)


separately targets differently functional units.

MultiFlow TRACE, TI C6X, IA-64


Bundle
add r1,r2,r3 load r4,r5+4 mov r6,r2 mul r7,r8,r9

FU

FU

FU

FU

Schematic Explanation for a VLIW Instruction

196

VLIW Processors: Some Considerations


Issue hardware is simpler. Compiler has a bigger context from which to select co-scheduled instructions. Compilers, however, do not have runtime information such as cache misses.

Scheduling is, therefore, inherently conservative. Branch and memory prediction is more difficult.

Typical VLIW processors are limited to 4-way to 8-way parallelism.


197

VLIW Summary

Each instruction is very large

Complier detects hazard, and determines scheduling. There is no (or only partial) hardware hazard detection:

Bundles multiple operations that are independent.

Tradeoff instruction space for simple decoding


The long instruction word has room for many operations. But have to fill with NOP if enough operations cannot be found.

No dependence check logic for instructions issued at the same cycle.

198

VLIW vs Superscalar

VLIW - Compiler finds parallelism:

VLIW Simpler hardware:

Superscalar hardware finds parallelism Superscalar More complex hardware

VLIW less parallelism can be exploited for a typical program:

Superscalar Better performance

199

Superscalar Processors

Commercial desktop processors now do four or more issues per clock:


Even

in the embedded processor market, dual issue superscalar pipelines are becoming common.

200

Superscalar Execution With Dynamic Scheduling

Multiple instruction issue:


Very

well accommodated with dynamic instruction scheduling approach.


pipelined, or both.

The issue stage can be:


Replicated,

201

Limitations of Scalar Pipelines: A Reflection

Maximum throughput bounded by one instruction per cycle. Inefficient unification of instructions into one pipeline:

ALU, MEM stages very diverse eg: FP If a leading instruction is stalled every subsequent instruction is stalled
202

Rigid nature of in-order pipeline:

A Rigid Pipeline

Bypassing of stalled instruction not allowed

Stalled Instruction

Backward Propagation of stalling

203

Solving Problems of Scalar Pipelines: Modern Processors

Maximum throughput bounded by one instruction per cycle:

Inefficient unification into a single pipeline:

parallel pipelines (superscalar)

Rigid nature of in order pipeline

diversified pipelines.

Allow out of ordering or dynamic instruction scheduling.


204

Machine Parallelism
(a) No Parallelism (Nonpipelined) (b) Temporal Parallelism (Pipelined) (c) Spatial Parallelism (Multiple units) (d) Combined Temporal and Spatial Parallelism

205

A Parallel Pipeline
Width = 3

206

Scalar and Parallel Pipeline


(a) The five-stage i486 scalar pipeline (b) The five-stage Pentium Parallel Pipeline of width=2

207

Diversified Parallel Pipeline

208

A Dynamically Scheduled Speculative Pipeline

209

Distributed Reservation Stations

210

A Superscalar Pipeline
A degree six superscalar pipeline

211

Superscalar Pipeline Design


Fetch

Instruction Flow

Instruction Buffer Decode Dispatch Buffer Dispatch Issuing Buffer Execute

Data Flow
Complete

Completion Buffer

Store Buffer

Retire

212

A Superscalar MIPS Processor

Assume two instructions can be issued per clock cycle:


One

of the instructions can be load, store, or integer ALU operations. The other can be a floating point operation.

213

MIPS Pipeline with Pipelined MultiCycle Operations


EX

M1 IF ID

M2

M3

M4

M5

M6

M7 M WB

A1

A2

A3

A4

DIV
Pipelined implementations ex: 7 outstanding MUL, 4 outstanding Add, unpipelined DIV. In-order execution, out-of-order completion

Tomasulo w/o ROB: out-of-order execution, out-of-order completion, in- 214 order commit