You are on page 1of 23

Lecture 6:

ILP Techniques
Laxmi N. Bhuyan
CS 162
Spring 2003

DAP Spr.98 UCB 1

HW Schemes: Instruction Parallelism


Why in HW at run time?
Works when cant know real dependence at compile time
Compiler simpler
Code for one machine runs well on another

Key idea: Allow instructions behind stall to proceed


DIVD
ADDD
SUBD

F0,F2,F4
F10,F0,F8
F12,F8,F14

Enables out-of-order execution => out-of-order completion


ID stage checks for hazards. If no hazards, issue the instn for
execution. Scoreboard dates to CDC 6600 in 1963

DAP Spr.98 UCB 2

How ILP Works


Issuing multiple instructions per cycle would
require fetching multiple instructions from
memory per cycle => called Superscalar
degree or Issue width
To find independent instructions, we must
have a big pool of instructions to choose from,
called instruction buffer (IB). As IB length
increases, complexity of decoder (control)
increases that increases the datapath cycle
time
Prefetch instructions sequentially by an IFU
that operates independently from datapath
control. Fetch instruction (PC)+L, where L is
the IB size or as directed by the branch
predictor. (See Fig. 6-1 Pentium diagram)
DAP Spr.98 UCB 3

Pentium Datapath
Pentium consists of two pipes (U-pipe
and V-pipe) operating in parallel. U-pipe
contains an 8-stage FP pipeline (see
Pentium Figure)
Two stages of Decode Decode and
control one stage Register read 2nd
stage
See I-cache and D-cache in Fig. 6-1.
What is TLB? How does the Virtual
memory work?

DAP Spr.98 UCB 4

HW Schemes: Instruction Parallelism


Two types: Scoreboard and Tomasulo
Scoreboard (EX: PENTIUM):
Out-of-order execution divides ID stage:
1. Issuedecode instructions, check for structural hazards
2. Read operandswait until no data hazards, then read operands

Scoreboards allow instruction to execute whenever


there is no structural hazard or not waiting for prior
instructions. So the instructions are issued in order,
but can bypass the waiting instructions in the read
operand stage => In-order issue Out-of-Order
execution => Out-of-Order completion
Named after CDC 6600 Scoreboard, which developed
this capability
DAP Spr.98 UCB 5

Scoreboard Implications
Scoreboard replaces ID, EX, WB with 4 stages
Out-of-order completion => WAR, WAW hazards?
Solutions for WAR => Wait at the WB stage until the
other instruction completes
For WAW, must detect hazard at the ID stage: stall
until other completes
Need to have multiple instructions in execution
phase => multiple execution units or pipelined
execution units
Scoreboard keeps track of dependencies, state or
operations

DAP Spr.98 UCB 6

Four Stages of Scoreboard Control


1. Issuedecode instructions & check for
structural hazards (ID1)
If a functional unit for the instruction is free and no other
active instruction has the same destination register (WAW),
the scoreboard issues the instruction to the functional unit
and updates its internal data structure. If a structural or
WAW hazard exists, then the instruction issue stalls, and no
further instructions will issue until these hazards are cleared.

2. Read operandswait until no data hazards, then


read operands (ID2)
A source operand is available if no earlier issued active
instruction is going to write it, or if the register containing
the operand is being written by a currently active functional
unit. If the source operands are available for an instn, the
scoreboard tells the functional unit to proceed to read the
operands from the registers and begin execution. The
scoreboard resolves RAW hazards dynamically in this step,
and instructions may be sent into execution out of order.
DAP Spr.98 UCB 7

Four Stages of Scoreboard Control


3. Executionoperate on operands (EX)
The functional unit begins execution upon receiving
operands. When the result is ready, it notifies the scoreboard
that it has completed execution.

4. Write resultfinish execution (WB)


Once the scoreboard is aware that the functional unit has
completed execution, the scoreboard checks for WAR
hazards. If none, it writes results. If WAR, then it stalls the
instruction.
Example:
DIVD
F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
CDC 6600 scoreboard would stall SUBD until ADDD reads
operands
DAP Spr.98 UCB 8

Design of the Scoreboard


1. Instruction statuswhich of 4 steps the instruction is in
2. Functional unit statusIndicates the state of the functional unit (FU).
9 fields for each functional unit
BusyIndicates whether the unit is busy or not
OpOperation to perform in the unit (e.g., + or )
FiDestination register
Fj, FkSource-register numbers
Qj, QkFunctional units producing source registers Fj, Fk
Rj, RkFlags indicating when Fj, Fk are ready

3. Register result statusIndicates which functional unit will write each


register, if one exists. Blank when no pending instructions will write
that register

DAP Spr.98 UCB 9

Detailed Scoreboard Pipeline


Control
Instruction
status

Wait until

Bookkeeping

Issue

Not busy (FU)


and not result(D)

Busy(FU) yes; Op(FU) op;


Fi(FU) `D; Fj(FU) `S1;
Fk(FU) `S2; Qj Result(S1);
Qk Result(`S2); Rj not Qj;
Rk not Qk; Result(D) FU;

Read
operands

Rj and Rk

Rj No; Rk No

Execution
complete

Functional unit
done

f((Fj( f )Fi(FU)
or Rj( f )=No) &
Write result
(Fk( f ) Fi(FU) or
Rk( f )=No))

f(if Qj(f)=FU then Rj(f) Yes);


f(if Qk(f)=FU then Rj(f) Yes);
Result(Fi(FU)) 0; Busy(FU) No
DAP Spr.98 UCB 10

CDC 6600 Scoreboard


Speedup 1.7 from compiler; 2.5 by hand
BUT slow memory (no cache) limits benefit
Limitations of 6600 scoreboard:
No forwarding hardware
Limited to instructions in basic block (small window)
Small number of functional units (structural hazards),
especailly integer/load store units
Do not issue on structural hazards
Wait for WAR hazards
Prevent WAW hazards

DAP Spr.98 UCB 11

Summary
Instruction Level Parallelism (ILP) in SW or HW
Loop level parallelism is easiest to see
SW parallelism dependencies defined for program,
hazards if HW cannot resolve
SW dependencies/compiler sophistication determine if
compiler can unroll loops
Memory dependencies hardest to determine

HW exploiting ILP
Works when cant know dependence at run time
Code for one machine runs well on another

Key idea of Scoreboard: Allow instructions behind stall


to proceed (Decode => Issue instr & read operands)
Enables out-of-order execution => out-of-order completion
ID stage checked both for structural
DAP Spr.98 UCB 12

Tomasulo Algorithm

(Implemented in IBM 360/91 in 1966)


Control & buffers distributed with Function Units (FU) vs. centralized
in scoreboard;
FU buffers called reservation stations; have pending operands

Registers in instructions replaced by values or pointers to


reservation stations(RS); called register renaming ;
avoids WAR, WAW hazards
More reservation stations than registers, so can do optimizations compilers
cant

Results to FU from RS, not through registers, over Common Data Bus
that broadcasts results to all FUs
Load and Stores treated as FUs with RSs as well
Integer instructions can go past branches, allowing
FP ops beyond basic block in FP queue
DAP Spr.98 UCB 13

Tomasulo Organization
FP Op Queue

FP
Registers

Load
Buffer

Common
Data
Bus
FP Add
Res.
Station

Store
Buffer

FP Mul
Res.
Station
DAP Spr.98 UCB 14

Reservation Station Components


OpOperation to perform in the unit (e.g., + or )
Vj, VkValue of Source operands
Store buffers has V field, result to be stored

Qj, QkReservation stations producing source registers


(value to be written)
Note: No ready flags as in Scoreboard; Qj,Qk=0 => ready
Store buffers only have Qi for RS producing result

BusyIndicates reservation station or FU is busy


Register result statusIndicates which functional unit will
write each register, if one exists. Blank when no pending
instructions that will write that register.
DAP Spr.98 UCB 15

Three Stages of Tomasulo Algorithm


1.Issueget instruction from FP Op Queue
If reservation station free (no structural hazard),
control issues instr & sends operands (renames registers).

2.Executionoperate on operands (EX)


When both operands ready then execute;
if not ready, watch Common Data Bus for result

3.Write resultfinish execution (WB)


Write on Common Data Bus to all awaiting units;
mark reservation station available

Normal data bus: data + destination (go to bus)


Common data bus: data + source (come from bus)
64 bits of data + 4 bits of Functional Unit source address
Write if matches expected Functional Unit (produces result)
Does the broadcast

DAP Spr.98 UCB 16

Tomasulo v. Scoreboard
(IBM 360/91 v. CDC 6600)
Pipelined Functional Units
Multiple Functional Units
(6 load, 3 store, 3 +, 2 x/)
(1 load/store, 1 + , 2 x, 1 )
window size: 14 instructions
5 instructions
No issue on structural hazard
same
WAR: renaming avoids
stall completion
WAW: renaming avoids
stall completion
Broadcast results from FU
Write/read registers
distributed reservation stations
central scoreboard

DAP Spr.98 UCB 17

Tomasulo Drawbacks
Complexity
delays of 360/91, MIPS 10000, IBM 620?

Many associative stores (CDB) at high speed


Performance limited by Common Data Bus
Multiple CDBs => more FU logic for parallel assoc stores

DAP Spr.98 UCB 18

Tomasulo Summary
Reservations stations: renaming to larger set of registers +
buffering source operands
Prevents registers as bottleneck
Avoids WAR, WAW hazards of Scoreboard
Allows loop unrolling in HW

Not limited to basic blocks


(integer units gets ahead, beyond branches)
Helps cache misses as well
Lasting Contributions
Dynamic scheduling
Register renaming
Load/store disambiguation

360/91 descendants are Pentium II; PowerPC 604; MIPS


R10000; HP-PA 8000; Alpha 21264

DAP Spr.98 UCB 19

HW support for More ILP


Speculation: allow an instruction to issue that is dependent on
branch predicted to be taken without any consequences (including
exceptions) if branch is not actually taken (HW undo); called
boosting
Combine branch prediction with dynamic scheduling to execute
before branches resolved
Separate speculative bypassing of results from real bypassing of
results
When instruction no longer speculative,
write boosted results (instruction commit)
or discard boosted results
execute out-of-order but commit in-order
to prevent irrevocable action (update state or exception)
until instruction commits

DAP Spr.98 UCB 20

HW support for More ILP


Need HW buffer for results of
uncommitted instructions: reorder
buffer
3 fields: instr, destination, value
Reorder buffer can be operand source =>
more registers like RS
Use reorder buffer number instead of
FP
reservation station when execution
completes
Op
Supplies operands between execution
Queue
complete & commit
Once operand commits,
result is put into register
Instructions commit in order
As a result, its easy to undo speculated Res Stations
instructions
on mispredicted branches
FP Adder
or on exceptions

Reorder
Buffer

FP Regs

Res Stations
FP Adder

DAP Spr.98 UCB 21

Four Steps of Speculative


Tomasulo Algorithm
1. Issueget instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr & send
operands & reorder buffer no. for destination (this stage sometimes
called dispatch)

2. Executionoperate on operands (EX)


When both operands ready then execute; if not ready, watch CDB for
result; when both in reservation station, execute; checks RAW
(sometimes called issue)

3. Write resultfinish execution (WB)


Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.

4. Commitupdate register with reorder result


When instr. at head of reorder buffer & result present, update register
with result (or store to memory) and remove instr from reorder buffer.
Mispredicted branch flushes reorder buffer (sometimes called
graduation)

DAP Spr.98 UCB 22

Renaming Registers
Common variation of speculative design
Reorder buffer keeps instruction information
but not the result
Extend register file with extra
renaming registers to hold speculative results
Rename register allocated at issue;
result into rename register on execution complete;
rename register into real register on commit
Operands read either from register file
(real or speculative) or via Common Data Bus
Advantage: operands are always from single source (extended
register file)

DAP Spr.98 UCB 23

You might also like