Superscalar Computers

Topics Left
Superscalar machines
IA64 / EPIC architecture

Multithreading (explicit and implicit)
Multicore Machines
Clusters
Parallel Processors
Hardware implementation vs microprogramming
Chapter 14
Superscalar Processors
Definition of Superscalar
Design Issues:
- Instruction Issue Policy
- Register renaming
- Machine parallelism
- Branch Prediction
- Execution
Pentium 4 example
What is Superscalar?
A Superscalar machine executes multiple independent
instructions in parallel.
They are pipelined as well.
Common instructions (arithmetic, load/store, conditional branch)

can be executed independently.
Equally applicable to RISC & CISC, but more straightforward in

RISC machines.
The order of execution is usually assisted by the compiler.
Example of Superscalar Organization
2 Integer ALU pipelines,

2 FP ALU pipelines,
1 memory pipeline (?)
Superscalar v Superpipelined
Limitations of Superscalar
Dependent upon:
- Instruction level parallelism possible
- Compiler based optimization
- Hardware support
Limited by
Data dependency
Procedural dependency
Resource conflicts
(Recall) True Data Dependency

(Must W before R)
ADD r1, r2
MOVE r3, r1
r1+r2 r1
r1 r3
Can fetch and decode second instruction in parallel with

first
LOAD
r1, X
x (memory) r1
MOVE
r3, r1
r1 r3
Can NOT execute second instruction until first is

finished
Second instruction is dependent on first (R after W)
(recall) Antidependancy (Must R before W)

ADD
R4, R3, 1
R3 + 1 R4
ADD
R3, R5, 1
R5 + 1 R3
Cannot complete the second instruction before the first has

read R3
(Recall) Procedural Dependency

Cant execute instructions after a branch in parallel
with instructions before a branch, because?
Note: Also, if instruction length is not fixed,

instructions have to be decoded to find out how many
fetches are needed
(recall) Resource Conflict

Two or more instructions requiring access to the
same resource at the same time
e.g. two arithmetic instructions need the ALU
Solution - Can possibly duplicate resources

e.g. have two arithmetic units
Effect of Dependencies on Superscalar Operation
Notes:
1) Superscalar operation is double impacted by a stall.

2) CISC machines typically have different length instructions and need to be at least
partially decoded before the next can be fetched not good for superscalar operation
Instruction-level Parallelism degree of

Consider:
LOAD
ADD
ADD
R1, R2
R3, 1
R4, R2
These can be handled in parallel.
Consider:
ADD
R3, 1
ADD
R4, R3
STO (R4), R0
These cannot be handled in parallel.
The degree of instruction-level parallelism is determined by the

number of instructions that can be executed in parallel without
stalling for dependencies
Instruction Issue Policies

Order in which instructions are fetched
Order in which instructions are executed
Order in which instructions update registers and

memory values (order of completion)
Standard Categories:
In-order issue with in-order completion
In-order issue with out-of-order completion
Out-of order issue with out-of-order completion
In-Order Issue -- In-Order Completion
Issue instructions in the order they occur:

Not very efficient
Instructions must stall if necessary (and stalling in

superpipelining is expensive)
In-Order Issue -- In-Order Completion

(Example)
Assume:
I1 requires 2 cycles to execute
I3 & I4 conflict for the same functional unit
I5 depends upon value produced by I4
I5 & I6 conflict for a functional unit
In-Order Issue -- Out-of-Order Completion

(Example)
Again:
How does this effect interrupts?
Out-of-Order Issue -- Out-of-Order Completion

Decouple decode pipeline from execution pipeline
Can continue to fetch and decode until the window

is full
When a functional unit becomes available an
instruction can be executed (usually in as much inorder as possible)
Since instructions have been decoded, processor can

look ahead
Out-of-Order Issue -- Out-of-Order Completion

(Example)
Again:
Note: I5 depends upon I4, but I6 does not
Register Renaming
to avoid hazards
Output and antidependencies occur because register

contents may not reflect the correct ordering from the
program
Can require a pipeline stall
One solution: Allocate Registers dynamically
(renaming registers)
Register Renaming example

Add R3, R3, R5
Add R4, R3, 1
Add R3, R5, 1
Add R7, R3, R4
R3b:=R3a + R5a
R4b:=R3b + 1
R3c:=R5a + 1
R7b:=R3c + R4b
(I1)
(I2)
(I3)
(I4)
Without subscript refers to logical register in

instruction
With subscript is hardware register allocated:
R3a R3b R3c
Note: R3c avoids: antidependency on I2
output dependency I1
Recaping: Machine Parallelism Support
Duplication of Resources
Out of order issue hardware
Windowing to decouple execution from decode
Register Renaming capability
Speedups of Machine Organizations

(Without Procedural Dependencies)
Not worth duplication of functional units without register renaming
Need instruction window large enough (more than 8, probably not more than 32)
Branch Prediction in Superscalar Machines

Delayed branch not used much. Why?
Multiple instructions need to execute in the delay slot.
This leads to much complexity in recovery.
Branch prediction should be used - Branch history is

very useful
View of Superscalar Execution
Committing or Retiring Instructions

Results need to be put into order (commit or retire)
Results sometimes must be held in temporary storage
until it is certain they can be placed in permanent
storage.
(either committed or retired/flushed)
Temporary storage requires regular clean up
overhead done in hardware.
Superscalar Hardware Support

Facilities to simultaneously fetch multiple
instructions
Logic to determine true dependencies involving
register values and Mechanisms to communicate
these values
Mechanisms to initiate multiple instructions in
parallel
Resources for parallel execution of multiple

instructions
Mechanisms for committing process state in correct
order
Example: Pentium 4
A Superscalar CISC Machine
Pentium 4 alternate view
Pentium 4 pipeline
20 stages !
a) Generation of Micro-ops (stages 1 &2)
Using the Branch Target Buffer and Instruction Translation

Lookaside Buffer, the x86 instructions are fetched 64 bytes at a
time from the L2 cache
The instruction boundaries are determined and instructions
decoded into 1-4 118-bit RISC micro-ops
Micro-ops are stored in the trace cache
b) Trace cache next instruction pointer (stage 3)
The Trace Cache Branch Target Buffer contains dynamic

gathered history information (4 bit tag)
If target is not in BTB
- Branch not PC relative: predict branch taken if it is a return,
predict not taken otherwise
- For PC relative backward conditional branches, predict take,

otherwise not taken
c) Trace Cache fetch (stage 4)
Orders micro-ops in program-ordered sequences called traces

These are fetched in order, subject to branch prediction
Some micro-ops require many micro-ops (CISC instructions).
These are coded into the ROM and fetched from the ROM
d) Drive (stage 5)
Delivers instructions from the Trace Cache to the

Rename/Allocator module for reordering
e) Allocate: register naming (stages 6, 7, & 8)
Allocates resources for execution (3 micro-ops arrive per clock cycle):
- Each micro-op is allocated to a slot in the 126 position circular Reorder Buffer (ROB) which
tracks progress of the micro-ops.
Buffer entries include:
- State scheduled, dispatched, completed, ready for retire
- Address that generated the micro-op
- Operation
- Alias registers are assigned for one of 16 arch reg (128 alias registers)
{to remove data
dependencies}
The micro-ops are dispatched out of order as resources are available

Allocates an entry to one of the 2 scheduler queues - memory access or not
f) Micro-op queuing (stage 9)
Micro-ops are loaded into one of 2 queues:

- one for memory operations
- one for non memory operations

Each queue operates on a FIFO policy
g) Micro-op scheduling
(stages 10, 11, & 12)
h) Dispatch
(stages 13 & 14)
The 2 schedulers retrieve micro-ops based upon having all

the operands ready and dispatch them to an available unit (up
to 6 per clock cycle)
If two micro-ops need the same unit, they are dispatched
in sequence.
i) Register file
(stages 15 & 16)
j) Execute: flags
(stages 17 & 18)
The register files are the sources for pending fixed and FF
operations
A separate stage is used to compute the flags
k) Branch check
(stage 19)
l) Branch check results

(stage 20)
Checks flags and compares results with predictions

If the branch prediction was wrong:
- all incorrect micro-ops must be flushed (dont want to be wrong!)
- the correct branch destination is provided to the Branch Predictor
- the pipeline is restarted from the new target address

Superscalar Computers

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Superscalar Computers

Uploaded by

Copyright:

Available Formats

Topics Left

IA64 / EPIC architecture

Common instructions (arithmetic, load/store, conditional branch)

Equally applicable to RISC & CISC, but more straightforward in

The order of execution is usually assisted by the compiler.

Example of Superscalar Organization

2 Integer ALU pipelines,

1 memory pipeline (?)

(Recall) True Data Dependency

Can fetch and decode second instruction in parallel with

Can NOT execute second instruction until first is

(recall) Antidependancy (Must R before W)

Cannot complete the second instruction before the first has

(Recall) Procedural Dependency

Note: Also, if instruction length is not fixed,

(recall) Resource Conflict

Solution - Can possibly duplicate resources

Effect of Dependencies on Superscalar Operation

1) Superscalar operation is double impacted by a stall.

Instruction-level Parallelism degree of

These can be handled in parallel.

The degree of instruction-level parallelism is determined by the

Instruction Issue Policies

Order in which instructions update registers and

In-Order Issue -- In-Order Completion

Issue instructions in the order they occur:

Instructions must stall if necessary (and stalling in

In-Order Issue -- In-Order Completion

In-Order Issue -- Out-of-Order Completion

How does this effect interrupts?

Out-of-Order Issue -- Out-of-Order Completion

Can continue to fetch and decode until the window

Since instructions have been decoded, processor can

Out-of-Order Issue -- Out-of-Order Completion

Note: I5 depends upon I4, but I6 does not

Output and antidependencies occur because register

Register Renaming example

Without subscript refers to logical register in

Recaping: Machine Parallelism Support

Speedups of Machine Organizations

Not worth duplication of functional units without register renaming

Branch Prediction in Superscalar Machines

Branch prediction should be used - Branch history is

View of Superscalar Execution

Committing or Retiring Instructions

Superscalar Hardware Support

Resources for parallel execution of multiple

Pentium 4 alternate view

a) Generation of Micro-ops (stages 1 &2)

Using the Branch Target Buffer and Instruction Translation

b) Trace cache next instruction pointer (stage 3)

The Trace Cache Branch Target Buffer contains dynamic

- For PC relative backward conditional branches, predict take,

c) Trace Cache fetch (stage 4)

Orders micro-ops in program-ordered sequences called traces

Delivers instructions from the Trace Cache to the

e) Allocate: register naming (stages 6, 7, & 8)

Allocates resources for execution (3 micro-ops arrive per clock cycle):

The micro-ops are dispatched out of order as resources are available

f) Micro-op queuing (stage 9)

Micro-ops are loaded into one of 2 queues:

- one for non memory operations

The 2 schedulers retrieve micro-ops based upon having all

l) Branch check results