ILP Techniques: Laxmi N. Bhuyan CS 162 Spring 2003

Lecture 6:
ILP Techniques
Laxmi N. Bhuyan
CS 162
Spring 2003
DAP Spr.98 UCB 1
HW Schemes: Instruction Parallelism

Why in HW at run time?
Works when cant know real dependence at compile time
Compiler simpler
Code for one machine runs well on another
Key idea: Allow instructions behind stall to proceed

DIVD
ADDD
SUBD
F0,F2,F4
F10,F0,F8
F12,F8,F14
Enables out-of-order execution => out-of-order completion

ID stage checks for hazards. If no hazards, issue the instn for
execution. Scoreboard dates to CDC 6600 in 1963
DAP Spr.98 UCB 2
How ILP Works

Issuing multiple instructions per cycle would
require fetching multiple instructions from
memory per cycle => called Superscalar
degree or Issue width
To find independent instructions, we must
have a big pool of instructions to choose from,
called instruction buffer (IB). As IB length
increases, complexity of decoder (control)
increases that increases the datapath cycle
time
Prefetch instructions sequentially by an IFU
that operates independently from datapath
control. Fetch instruction (PC)+L, where L is
the IB size or as directed by the branch
predictor. (See Fig. 6-1 Pentium diagram)
DAP Spr.98 UCB 3
Pentium Datapath
Pentium consists of two pipes (U-pipe
and V-pipe) operating in parallel. U-pipe
contains an 8-stage FP pipeline (see
Pentium Figure)
Two stages of Decode Decode and
control one stage Register read 2nd
stage
See I-cache and D-cache in Fig. 6-1.
What is TLB? How does the Virtual
memory work?
DAP Spr.98 UCB 4
HW Schemes: Instruction Parallelism

Two types: Scoreboard and Tomasulo
Scoreboard (EX: PENTIUM):
Out-of-order execution divides ID stage:
1. Issuedecode instructions, check for structural hazards
2. Read operandswait until no data hazards, then read operands
Scoreboards allow instruction to execute whenever

there is no structural hazard or not waiting for prior
instructions. So the instructions are issued in order,
but can bypass the waiting instructions in the read
operand stage => In-order issue Out-of-Order
execution => Out-of-Order completion
Named after CDC 6600 Scoreboard, which developed
this capability
DAP Spr.98 UCB 5
Scoreboard Implications
Scoreboard replaces ID, EX, WB with 4 stages
Out-of-order completion => WAR, WAW hazards?
Solutions for WAR => Wait at the WB stage until the
other instruction completes
For WAW, must detect hazard at the ID stage: stall
until other completes
Need to have multiple instructions in execution
phase => multiple execution units or pipelined
execution units
Scoreboard keeps track of dependencies, state or
operations
DAP Spr.98 UCB 6
Four Stages of Scoreboard Control

1. Issuedecode instructions & check for
structural hazards (ID1)
If a functional unit for the instruction is free and no other
active instruction has the same destination register (WAW),
the scoreboard issues the instruction to the functional unit
and updates its internal data structure. If a structural or
WAW hazard exists, then the instruction issue stalls, and no
further instructions will issue until these hazards are cleared.
2. Read operandswait until no data hazards, then

read operands (ID2)
A source operand is available if no earlier issued active
instruction is going to write it, or if the register containing
the operand is being written by a currently active functional
unit. If the source operands are available for an instn, the
scoreboard tells the functional unit to proceed to read the
operands from the registers and begin execution. The
scoreboard resolves RAW hazards dynamically in this step,
and instructions may be sent into execution out of order.
DAP Spr.98 UCB 7
Four Stages of Scoreboard Control

3. Executionoperate on operands (EX)
The functional unit begins execution upon receiving
operands. When the result is ready, it notifies the scoreboard
that it has completed execution.
4. Write resultfinish execution (WB)

Once the scoreboard is aware that the functional unit has
completed execution, the scoreboard checks for WAR
hazards. If none, it writes results. If WAR, then it stalls the
instruction.
Example:
DIVD
F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
CDC 6600 scoreboard would stall SUBD until ADDD reads
operands
DAP Spr.98 UCB 8
Design of the Scoreboard

1. Instruction statuswhich of 4 steps the instruction is in
2. Functional unit statusIndicates the state of the functional unit (FU).
9 fields for each functional unit
BusyIndicates whether the unit is busy or not
OpOperation to perform in the unit (e.g., + or )
FiDestination register
Fj, FkSource-register numbers
Qj, QkFunctional units producing source registers Fj, Fk
Rj, RkFlags indicating when Fj, Fk are ready
3. Register result statusIndicates which functional unit will write each

register, if one exists. Blank when no pending instructions will write
that register
DAP Spr.98 UCB 9
Detailed Scoreboard Pipeline

Control
Instruction
status
Wait until
Bookkeeping
Issue
Not busy (FU)

and not result(D)
Busy(FU) yes; Op(FU) op;

Fi(FU) `D; Fj(FU) `S1;
Fk(FU) `S2; Qj Result(S1);
Qk Result(`S2); Rj not Qj;
Rk not Qk; Result(D) FU;
Read
operands
Rj and Rk
Rj No; Rk No
Execution
complete
Functional unit
done
f((Fj( f )Fi(FU)
or Rj( f )=No) &
Write result
(Fk( f ) Fi(FU) or
Rk( f )=No))
f(if Qj(f)=FU then Rj(f) Yes);

f(if Qk(f)=FU then Rj(f) Yes);
Result(Fi(FU)) 0; Busy(FU) No
DAP Spr.98 UCB 10
CDC 6600 Scoreboard

Speedup 1.7 from compiler; 2.5 by hand
BUT slow memory (no cache) limits benefit
Limitations of 6600 scoreboard:
No forwarding hardware
Limited to instructions in basic block (small window)
Small number of functional units (structural hazards),
especailly integer/load store units
Do not issue on structural hazards
Wait for WAR hazards
Prevent WAW hazards
DAP Spr.98 UCB 11
Summary
Instruction Level Parallelism (ILP) in SW or HW
Loop level parallelism is easiest to see
SW parallelism dependencies defined for program,
hazards if HW cannot resolve
SW dependencies/compiler sophistication determine if
compiler can unroll loops
Memory dependencies hardest to determine
HW exploiting ILP
Works when cant know dependence at run time
Code for one machine runs well on another
Key idea of Scoreboard: Allow instructions behind stall

to proceed (Decode => Issue instr & read operands)
Enables out-of-order execution => out-of-order completion
ID stage checked both for structural
DAP Spr.98 UCB 12
Tomasulo Algorithm
(Implemented in IBM 360/91 in 1966)

Control & buffers distributed with Function Units (FU) vs. centralized
in scoreboard;
FU buffers called reservation stations; have pending operands
Registers in instructions replaced by values or pointers to

reservation stations(RS); called register renaming ;
avoids WAR, WAW hazards
More reservation stations than registers, so can do optimizations compilers
cant
Results to FU from RS, not through registers, over Common Data Bus
that broadcasts results to all FUs
Load and Stores treated as FUs with RSs as well
Integer instructions can go past branches, allowing
FP ops beyond basic block in FP queue
DAP Spr.98 UCB 13
Tomasulo Organization
FP Op Queue
FP
Registers
Load
Buffer
Common
Data
Bus
FP Add
Res.
Station
Store
Buffer
FP Mul
Res.
Station
DAP Spr.98 UCB 14
Reservation Station Components

OpOperation to perform in the unit (e.g., + or )
Vj, VkValue of Source operands
Store buffers has V field, result to be stored
Qj, QkReservation stations producing source registers

(value to be written)
Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready
Store buffers only have Qi for RS producing result
BusyIndicates reservation station or FU is busy

Register result statusIndicates which functional unit will
write each register, if one exists. Blank when no pending
instructions that will write that register.
DAP Spr.98 UCB 15
Three Stages of Tomasulo Algorithm

1.Issueget instruction from FP Op Queue
If reservation station free (no structural hazard),
control issues instr & sends operands (renames registers).
2.Executionoperate on operands (EX)

When both operands ready then execute;
if not ready, watch Common Data Bus for result
3.Write resultfinish execution (WB)

Write on Common Data Bus to all awaiting units;
mark reservation station available
Normal data bus: data + destination (go to bus)

Common data bus: data + source (come from bus)
64 bits of data + 4 bits of Functional Unit source address
Write if matches expected Functional Unit (produces result)
Does the broadcast
DAP Spr.98 UCB 16
Tomasulo v. Scoreboard
(IBM 360/91 v. CDC 6600)
Pipelined Functional Units
Multiple Functional Units
(6 load, 3 store, 3 +, 2 x/)
(1 load/store, 1 + , 2 x, 1 )
window size: 14 instructions
5 instructions
No issue on structural hazard
same
WAR: renaming avoids
stall completion
WAW: renaming avoids
stall completion
Broadcast results from FU
Write/read registers
distributed reservation stations
central scoreboard
DAP Spr.98 UCB 17
Tomasulo Drawbacks
Complexity
delays of 360/91, MIPS 10000, IBM 620?
Many associative stores (CDB) at high speed

Performance limited by Common Data Bus
Multiple CDBs => more FU logic for parallel assoc stores
DAP Spr.98 UCB 18
Tomasulo Summary
Reservations stations: renaming to larger set of registers +
buffering source operands
Prevents registers as bottleneck
Avoids WAR, WAW hazards of Scoreboard
Allows loop unrolling in HW
Not limited to basic blocks

(integer units gets ahead, beyond branches)
Helps cache misses as well
Lasting Contributions
Dynamic scheduling
Register renaming
Load/store disambiguation
360/91 descendants are Pentium II; PowerPC 604; MIPS

R10000; HP-PA 8000; Alpha 21264
DAP Spr.98 UCB 19
HW support for More ILP

Speculation: allow an instruction to issue that is dependent on
branch predicted to be taken without any consequences (including
exceptions) if branch is not actually taken (HW undo); called
boosting
Combine branch prediction with dynamic scheduling to execute
before branches resolved
Separate speculative bypassing of results from real bypassing of
results
When instruction no longer speculative,
write boosted results (instruction commit)
or discard boosted results
execute out-of-order but commit in-order
to prevent irrevocable action (update state or exception)
until instruction commits
DAP Spr.98 UCB 20
HW support for More ILP

Need HW buffer for results of
uncommitted instructions: reorder
buffer
3 fields: instr, destination, value
Reorder buffer can be operand source =>
more registers like RS
Use reorder buffer number instead of
FP
reservation station when execution
completes
Op
Supplies operands between execution
Queue
complete & commit
Once operand commits,
result is put into register
Instructions commit in order
As a result, its easy to undo speculated Res Stations
instructions
on mispredicted branches
FP Adder
or on exceptions
Reorder
Buffer
FP Regs
Res Stations
FP Adder
DAP Spr.98 UCB 21
Four Steps of Speculative

Tomasulo Algorithm
1. Issueget instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr & send
operands & reorder buffer no. for destination (this stage sometimes
called dispatch)
2. Executionoperate on operands (EX)

When both operands ready then execute; if not ready, watch CDB for
result; when both in reservation station, execute; checks RAW
(sometimes called issue)
3. Write resultfinish execution (WB)

Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.
4. Commitupdate register with reorder result

When instr. at head of reorder buffer & result present, update register
with result (or store to memory) and remove instr from reorder buffer.
Mispredicted branch flushes reorder buffer (sometimes called
graduation)
DAP Spr.98 UCB 22
Renaming Registers
Common variation of speculative design
Reorder buffer keeps instruction information
but not the result
Extend register file with extra
renaming registers to hold speculative results
Rename register allocated at issue;
result into rename register on execution complete;
rename register into real register on commit
Operands read either from register file
(real or speculative) or via Common Data Bus
Advantage: operands are always from single source (extended
register file)
DAP Spr.98 UCB 23

ILP Techniques: Laxmi N. Bhuyan CS 162 Spring 2003

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ILP Techniques: Laxmi N. Bhuyan CS 162 Spring 2003

Uploaded by

Copyright:

Available Formats

Lecture 6:

DAP Spr.98 UCB 1

HW Schemes: Instruction Parallelism

Key idea: Allow instructions behind stall to proceed

Enables out-of-order execution => out-of-order completion

DAP Spr.98 UCB 2

How ILP Works

DAP Spr.98 UCB 4

HW Schemes: Instruction Parallelism

Scoreboards allow instruction to execute whenever

DAP Spr.98 UCB 6

Four Stages of Scoreboard Control

2. Read operandswait until no data hazards, then

Four Stages of Scoreboard Control

4. Write resultfinish execution (WB)

Design of the Scoreboard

3. Register result statusIndicates which functional unit will write each

DAP Spr.98 UCB 9

Detailed Scoreboard Pipeline

Not busy (FU)

Busy(FU) yes; Op(FU) op;

f(if Qj(f)=FU then Rj(f) Yes);

CDC 6600 Scoreboard

DAP Spr.98 UCB 11

Key idea of Scoreboard: Allow instructions behind stall

(Implemented in IBM 360/91 in 1966)

Registers in instructions replaced by values or pointers to

Reservation Station Components

Qj, QkReservation stations producing source registers

BusyIndicates reservation station or FU is busy

Three Stages of Tomasulo Algorithm

2.Executionoperate on operands (EX)

3.Write resultfinish execution (WB)

Normal data bus: data + destination (go to bus)

DAP Spr.98 UCB 16

DAP Spr.98 UCB 17

Many associative stores (CDB) at high speed

DAP Spr.98 UCB 18

Not limited to basic blocks

360/91 descendants are Pentium II; PowerPC 604; MIPS

DAP Spr.98 UCB 19

HW support for More ILP

DAP Spr.98 UCB 20

HW support for More ILP

DAP Spr.98 UCB 21

Four Steps of Speculative

2. Executionoperate on operands (EX)

3. Write resultfinish execution (WB)

4. Commitupdate register with reorder result

DAP Spr.98 UCB 22

DAP Spr.98 UCB 23

You might also like