You are on page 1of 27

Advanced Topic:

Hardware-Based Speculation
Speculating on Branch
Outcomes
 To optimally exploit ILP (instruction-level parallelism) – e.g. with
pipelining, Tomasulo, etc. – it is critical to efficiently maintain
control dependencies (=branch dependencies)
 Key idea: Speculate on the outcome of branches(=predict) and
execute instructions as if the predictions are correct
 of course, we must proceed in such a manner as to be able to
recover if our speculation turns out wrong
 Three components of hardware-based speculation
 dynamic branch prediction to pick branch outcome
 speculation to allow instructions to execute before control
dependencies are resolved, i.e., before branch outcomes become
known – with ability to undo in case of incorrect speculation
 dynamic scheduling
Speculating with Tomasulo
 Modern processors such as PowerPC 603/604, MIPS R10000,
Intel Pentium II/III/4, Alpha 21264 extend Tomasulo’s
approach to support speculation
 Key ideas:
 separate execution from completion: allow instructions to execute
speculatively but do not let instructions update registers or
memory until they are no longer speculative
 therefore, add a final step – after an instruction is no longer
speculative – when it is allowed to make register and memory
updates, called instruction commit
 allow instructions to execute and complete out of order but force
them to commit in order
 add a hardware buffer, called the reorder buffer (ROB), with
registers to hold the result of an instruction between completion
and commit
Tomasulo Hardware with
Speculation
Basic structure of MIPS floating-point unit based on Tomasulo and extended
to handle speculation: ROB is added and store buffer in original Tomasulo is
eliminated as its functionality is integrated into the ROB

ROB is a queue!
Tomasulo’s Algorithm with
Speculation: Four Stages
1. Issue: get instruction from Instruction Queue
 if reservation station and ROB slot free (no structural hazard),
control issues instruction to reservation station and ROB, and
sends to reservation station operand values (or reservation
station source for values) as well as allocated ROB slot number
2. Execution: operate on operands (EX)
 when both operands ready then execute;
if not ready, watch CDB for result
3. Write result: finish execution (WB)
 write on CDB to all awaiting units and ROB;
mark reservation station available
4. Commit: update register or memory with ROB result
 when instruction reaches head of ROB and results present,
update register with result or store to memory and remove
instruction from ROB
 if an incorrectly predicted branch reaches the head of ROB,
flush the ROB, and restart at correct successor of branch
ROB Data Structure
ROB entry fields
 Instruction type: branch, store, register operation (i.e., ALU
or load)
 State: indicates if instruction has completed and value is ready
 Destination: where result is to be written – register number
for register operation (i.e. ALU or load), memory address for
store
 branch has no destination result
 Value: holds the value of instruction result till time to commit

Additional reservation station field


 Destination: Corresponding ROB entry number
Example
1. L.D F6, 34(R2)
2. L.D F2, 45(R3)
3. MUL.D F0, F2, F4
4. SUB.D F8, F2, F6
5. DIV.D F10, F0, F6
6. ADD.D F6, F8, F2

 Assume latencies load 1 clock, add 2 clocks, multiply 10 clocks,


divide 40 clocks
 Show data structures just before MUL.D goes to commit…
Reservation Stations
Name Busy Op Vj Vk Qj Qk Dest A

Load1 no

Load2 no

Add1 no

Add2 no

Add3 no

Mult1 yes MUL Mem[45+Regs[R3]] Regs[F4] #3

Mult2 yes DIV Mem[34+Regs[R2]] #3 #5

Addi indicates ith reservation station for the FP add unit, etc.
Reorder Buffer
Entry Busy Instruction State Destination Value

1 no L.D F6, 34(R2) Commit F6 Mem[34+Regs[R2]]

2 no L.D F2, 45(R3) Commit F2 Mem[45+Regs[R3]]

3 yes MUL.D F0, F2, F4 Write result F0 #2  Regs[F4]

4 yes SUB.D F8, F6, F2 Write result F8 #1 – #2

5 yes DIV.D F10, F0, F6 Execute F10

6 yes ADD.D F6, F8, F2 Write result F6 #4 + #2

At the time MUL.D is ready to commit only the two L.D instructions have
already committed, though others have completed execution
Actually, the MUL.D is at the head of the ROB – the L.D instructions are
shown only for understanding purposes
#X represents value field of ROB entry number X
Registers
Field F0 F1 F2 F3 F4 F5 F6 F7 F8 F10
Reorder# 3 6 4 5
Busy yes no no no no no yes … yes yes
Floating point registers
Example
Loop: LD F0 0 R1
MULTD F4 F0 F2
SD F4 0 R1
SUBI R1 R1 #8
BNEZ R1 Loop

 Assume instructions in the loop have been issued twice


 Assume L.D and MUL.D from the first iteration have committed and
all other instructions have completed
 Assume effective address for store is computed prior to its issue
 Show data structures…
Reorder Buffer
Entry Busy Instruction State Destination Value

1 no L.D F0, 0(R1) Commit F0 Mem[0+Regs[R1]]

2 no MUL.D F4, F0, F2 Commit F4 #1  Regs[F2]

3 yes S.D F4, 0(R1) Write result 0 + Regs[R1] #2

4 yes DADDUI R1, R1, #-8 Write result R1 Regs[R1] – 8

5 yes BNE R1, R2, Loop Write result

6 yes L.D F0, 0(R1) Write result F0 Mem[#4]

7 yes MUL.D F4, F0, F2 Write result F4 #6  Regs[F2]

8 yes S.D F4, 0(R1) Write result 0 + #4 #7

9 yes DADDUI R1, R1, #-8 Write result R1 #4 – 8

10 yes BNE R1, R2, Loop Write result


Registers
Field F0 F1 F2 F3 F4 F5 F6 F7 F8
Reorder# 6 7
Busy yes no no no yes no no … no
Notes
 If a branch is mispredicted, recovery is done by flushing the
ROB of all entries that appear after the mispredicted branch
 entries before the branch are allowed to continue
 restart the fetch at the correct branch successor
 When an instruction commits or is flushed from the ROB then
the corresponding slots become available for subsequent
instructions
Getting CPI Below 1
 CPI ≥ 1 if issue only 1 instruction every clock cycle
 Multiple-issue processors come in 3 flavors:
1. Statically-scheduled superscalar,
2. Dynamically-scheduled superscalar, and
3. VLIW (very long instruction word)
 2 types of superscalar processors issue varying
numbers of instructions per clock
 Use in-order execution if statically scheduled, or

 Out-of-order execution if dynamically scheduled

 VLIW processors issue fixed number of instructions


formatted either as one large instruction or as fixed
instruction packet, with explicitly indicated parallelism
(Intel/HP Itanium)
15
VLIW: Very Long Instruction Word

 Each “instruction” has explicit coding for multiple


operations
 In IA-64, grouping called a “packet”
 In Transmeta, grouping called a “molecule” (with “atoms” as ops)
 Tradeoff instruction space for simple decoding
 Long instruction word has room for many operations
 By definition, all operations the compiler puts in instruction word
are independent => execute in parallel
 E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits

wide
 Need compiling technique that schedules across several branches

16
VLIW: Very Long Instruction Word

 Scheduling of instructions done statically (by the


compiler)
 Higher clock rate, less control, more room for execution units
 Less hardware
 Less cost and power
 Complex compiler to uncover enough parallelism
 Loop unrolling
 Local scheduling – within same basic block
 Global scheduling – across branches
 Currently very popular in Embedded systems like
DSP, multimedia applications.
Software Techniques -
Example
 This code adds a scalar to a vector:
for (i = 1000; --i >= 0; )
x[i] = x[i] + s;
 Assume following latencies for all examples
 Ignore delayed branch in these examples
Instruction Instruction Latency Stalls between
producing result using result in cycles in cycles
FP ALU op Another FP ALU op 4 3
FP ALU op Store double 3 2
Load double FP ALU op 1 1
Load double Store double 1 0
Integer op Integer op 1 0

18
Recall: Unrolled Loop
that Minimizes Scalar Stalls
1 Loop: L.D F0,0(R1) L.D to ADD.D: 1 Cycle
2 L.D F6,-8(R1) ADD.D to S.D: 2 Cycles
3 L.D F10,-16(R1)
4 L.D F14,-24(R1)
5 ADD.D F4,F0,F2
6 ADD.D F8,F6,F2
7 ADD.D F12,F10,F2
8 ADD.D F16,F14,F2
9 S.D 0(R1),F4
10 S.D -8(R1),F8
11 S.D -16(R1),F12
12 DSUBUI R1,R1,#32
13 BNEZ R1,LOOP
14 S.D 8(R1),F16 ; 8-32 = -24

14 clock cycles, or 3.5 per iteration


19
Loop Unrolling in VLIW
Memory Memory FP FP Int. op/ Clock
reference 1 reference 2 operation 1 op. 2 branch
L.D F0,0(R1) L.D F6,-8(R1) 1
L.D F10,-16(R1) L.D F14,-24(R1) 2
L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 3
L.D F26,-48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2 4
ADD.D F20,F18,F2 ADD.D F24,F22,F2 5
S.D 0(R1),F4 S.D -8(R1),F8 ADD.D F28,F26,F2 6
S.D -16(R1),F12 S.D -24(R1),F16 7
S.D -32(R1),F20 S.D -40(R1),F24 DSUBUI R1,R1,#48 8
S.D -0(R1),F28 BNEZ R1,LOOP 9
VLIW with 1 int op, 2 mem ref ops, 2 FP ops
Unrolled 7 times to avoid delays
7 results in 9 clocks, or 1.3 clocks per iteration (1.8X)
Average: 2.5 ops per clock, 50% efficiency
Note: Need more registers in VLIW (15 vs. 6 in SS)
20
Problems with 1st-Generation VLIW
 Increase in code size
 Generating enough operations in straight-line code fragment requires
ambitious loop unrolling
 Whenever VLIW instructions aren’t full, unused functional units translate
to wasted bits in instruction encoding
 Solution: -- clever encoding
-- compress the instructions in main memory and expand them when
brought into cache or are decoded.
 Operated in lock-step; no hazard detection HW
 Stall in any functional unit pipeline causes entire processor to stall, since
all functional units must be kept synchronized
 Compiler might predict functional units, but caches hard to predict
 Binary code compatibility (migration problem)
 Pure VLIW => different numbers of functional units and unit latencies
require different versions of the code
21
Increasing Instruction-Fetch Bandwidth
Branch Target Buffer (BTB) or Branch Prediction Buffer
 Predicts next instruct
address, sends it out
before decoding
instruction
 PC of branch sent to
BTB
 When match is found,
predicted PC is
returned
 If branch predicted
taken, instruction
fetch continues at
predicted PC

22
23
More Instruction-Fetch Bandwidth

 Integrated branch prediction: branch predictor is part of


instruction fetch unit and is constantly predicting branches
 Instruction prefetch: Instruction fetch units prefetch to deliver
multiple instructions per clock, integrating fetch with branch
prediction
 Instruction memory access and buffering: Fetching multiple
instructions per cycle:
 May require accessing multiple cache blocks (prefetch to hide

cost of crossing cache blocks)


 Provides buffering, acting as on-demand unit to provide

instructions to issue stage as needed and in quantity needed

24
Speculation:
Register Renaming vs. ROB
 Alternative to ROB is larger physical set of registers combined
with register renaming
 Extended registers replace function of both ROB and

reservation stations
 Instruction issue maps names of architectural registers to
physical register numbers in extended register set
 On issue, allocates a new unused register for destination

(which avoids WAW and WAR hazards)


 Speculation recovery easy because physical register holding

an instruction destination does not become architectural


register until instruction commits
 Most out-of-order processors today use extended registers with
renaming
25
Perspective
 Interest in multiple-issue because wanted to improve performance
without affecting uniprocessor programming model
 Taking advantage of ILP is conceptually simple, but design
problems are amazingly complex in practice
 Conservative in ideas, just faster clock and bigger
 Processors of last 5 years (Pentium 4, IBM Power 5, AMD Opteron)
have same basic structure and similar sustained issue rates (3 to 4
instructions per clock) as first dynamically scheduled, multiple-issue
processors announced in 1995
 Clocks 10 to 20X faster, caches 4 to 8X bigger, 2 to 4X as many
renaming registers, and 2X as many load-store units
 performance 8 to 16X
 Peak vs. delivered performance gap increasing

26
In Conclusion …

 Interrupts and exceptions either interrupt current instruction or happen


between instructions
 Large quantities of state must potentially be saved before

interrupting
 Machines with precise exceptions provide one single point in program
to restart execution
 All instructions before that point have completed

 No instructions after or including that point have completed

 Hardware techniques exist for precise exceptions even in the face of


out-of-order execution!
 Important enabling factor for out-of-order execution

27

You might also like