Superscalar Microprocessors

Superscalar Microprocessors
1 2
PowerPC 620 Components of a Superscalar Processor

I-cache
MMU
BHT BTAC 32 (64)
Branch Instruction Fetch Unit Data

Integer Unit Bus
Fetch Issue Execute Complete Write back Instruction Decode and
Register Rename Unit
32 (64)
Instruction Buffer Bus
Load Addr. Dcache Dcache Instruction Address
Align Complete Write back Align Inter-
Calc. Access Write Issue Unit face Bus
ReorderBuffer
Unit
Store Addr. Dcache Write Load/ Floating-

Calc. Lookup store Complete GPR Access Store Point
Integer Retire
Unit(s) Unit
Unit Unit(s) Control
Bus
FP General
FPR Access FP Multiply FP Add FP NormalizeComplete Write back Floating-
Purpose Rename
Point
Registers Registers
Registers
MMU
D-cache
3 4
Superscalar Superskalare Prozessoren
" Instructions are issued from a sequential stream of normal instructions.

" The instructions that are issued are scheduled dynamically by the hardware.
• Nebenläufigkeit
" More than one instruction can be issued each cycle (motivating the term
superscalar instead of scalar).
" The number of issued instructions is determined dynamically by • Out-Of-Order-Bereitstellung
hardware, that is, the actual number of instructions issued in a single cycle
can be zero up to a maximum instruction issue bandwidth
• Out-Of-Order-Ausführung
5 6
Phasen der Befehlsabarbeitung Details of Superscalar Pipeline

Instruction " In-order section:
– Instruction Fetch (BTAC access, simple branch prediction)
Fetch Fetch buffer

– Instruction decode
Prefetch /Predecode "

often: more complex branch prediction techniques
" Register Rename
Decode / Register Rename Complet/Reorder Instruction window
" Out-of-order section:
Issue Commit – Instruction issue to FU or Reservation station
– Execute till completion
Dispatch Retire " In-order section:
– Retire (commit or remove)
Execute Write-Back 7 – Write-back 8
Decode Stage Predecode
Second-level cache
" Fetch and decode instructions at a higher bandwidth than execute them. (or memory)
" Instruction window is kept full than the deeper instruction look-ahead
allows to find more instructions to issue to the execution units. Typically 128 bits/cycle
" The processor fetches and decodes more (today about 1.4 to twice as
many) instructions than it commits, because it discards instructions on Predecode
unit
mispredicted branch paths.
When instructions are written into the Icache,
" Typically the decode bandwidth is the same as the instruction fetch E . g . 1 4 8 b i t s / c y c lteh e p r e d e c o 1d e u n i t a p p e n d s 4 - 7 b i t s t o e a c h
bandwidth. RISC instruction
" Multiple instruction fetch and decode is supported by a fixed instruction Icache
length.
1
In the AMD K5, which is an x86-compatible CISC-processor,
the predecode unit appends 5 bits to each byte
9 10
Predecoding Rename Stage

" Predecoding can be done when the instructions are transferred from
memory or secondary cache to the I-cache.
" The aim of register renaming is to remove anti and output dependencies
dynamically by the processor hardware.
" ⇒ The decode stage is more simple.
" Register renaming is the process of dynamically associating physical
registers (rename registers) with the architectural registers (logical
" MIPS R10000: predecodes each 32-bit instruction into a 36-bit format registers) referred to in the instruction set of the architecture.
stored in the I-cache.
" Register renaming is implemented by allocating a new physical register for
– The four extra bits indicate which functional unit should execute
every destination register specified in an instruction.
the instruction.
– The predecoding also rearranges operand- and destination-select
" Each physical register is written only once after each assignment from the
fields to be in the same position for every instruction, and free list of available registers.
– modifies opcodes to simplify decoding of integer or floating-point " If a subsequent instruction needs its value, that instruction must wait until
destination registers. it is written (true data dependence).
" The decoder can decode this expanded format more rapidly than the
original instruction format.
11 12
Physical/Rename Register
Techniques to implement renaming:
" Separate sets of architectural registers and rename

RenameRegister Result Result
(physical) registers are provided.
valid number valid
– The physical registers only contain temporary values

(of completed but not yet retired instructions),
– the architectural registers store the committed values.
Formatoftherenameregisterentries – After commitment of an instruction, copying its result
from the rename register to the architectural register is
required.
13 14
Register Renaming
Techniques to implement renaming:
ad r2, ..., ...,
Architectural Physical
" Only a single set of registers is provided and architectural registers are
reg.numbers Entry RB- reg.numbers dynamically mapped to physical registers.
valid index – The physical registers contain committed values and temporary
0 1 10 0 results.
1 1 11 1
r2: 2 1 3 2 – After commitment of an instruction, the physical register is made
3 3
Allocated
0 p3 permanent and no copying is necessary.
4 1 0 4 tor2
5 0
Mapping " Alternative to the dynamic renaming is the use of a large register file as
table Physicalregisterfile defined for the Intel Merced.
31 0
39
15 16
Type of rename buffers
(The basic approach of how rename buffers are implemented)
Issue and Dispatch
Merged Separate rename Holding renamed Holding renamed

a r c h i t e c t u r a l a nadn d a r c h i t e c t u r a l values in the values in the
" Instruction issue is the process of initiating instruction execution in
r e n a m e r e g i s t e r f ri leeg i s t e r f i l e s ROB DRIS
the processor's functional units (scheduling).
– issue to a FU or a reservation station
– dispatch, if a second issue stage exists to denote when an
Merged
M e t h o d o f rename and Rename Architectural R O B
f e t c h i n g o p e raarncdhsi t e c t u r a l r e g . f i l e r e g . f i l e
A r c h i t e c t u r a lD R I S
reg. file
Acrhitectural
reg. file instruction is started to execute in the functional unit.
reg. file
" The instruction-issue policy is the protocol used to issue
instructions.
" The processor's lookahead capability is the ability to examine
P o w e r 1 ( 1 9 9 0 )P o w e r P C 6 0 3 ( 1 9 9 3 ) P e n t i u m P r o ( 1 9 9 5 ) L i g h t n i n g ( 1 9 9 1 p ) instructions beyond the current point of execution in hope of finding
P o w e r 2 ( 1 9 9 3 )P o w e r P C 6 0 4 ( 1 9 9 4 ) A m 2 9 0 0 0 s u p ( 1 9 9 5 )
ES/9000 (1992P p )o w e r P C 6 2 0 ( 1 9 9 5 ) K5 (1995) independent instructions to execute.
Nx586 (1994) M1 (1995)
PM1 (Sparc64, 1995) Rename
R10000 (1996) buffers
17 18
Instruction scheduling
Issue
Static scheduling Dynamic scheduling
accomplished by the compiler accomplished by the processor

" The issue (scheduling) logic examines the waiting
instructions in the instruction window and simultaneously
Input code Input code
assigns (issues) a number of instructions to the FUs up to a
Software
window
Issue
window
maximum issue bandwidth.
Instruction Instruction to be
issue unit issued
IFP
compiler of the
processor
Execution " Several instructions can be issued simultaneously.
window
Instruction in
execution
" The program order of the issued instructions is stored in
the reorder buffer.
Dependence free Dependence free code
and optimized code
f o r a s u p e r -f o r a V L I W f o r a p i p e l if no er da s u p e r s c a l a r
scalar or processor processor
pipelined processor
19 20
Issue order
Issue
" Instruction issue from the instruction buffer can be:
In-order issue Out-of-order issue
– in-order (only in program order) or out-of-order
– it can be subject to simultaneous data dependences and Issue window Issue window
Instructions Instructions
resource (structural) constraints, t o b e i s s u e de d c b a t o b e i s s u e de d c b a
Instructions Instructions
issued a issued c a
– or it can be divided in two (or more) stages
" checking structural conflict in the first and data Instructions are issued Instructions may be issued
strictly in program order out-of-order
dependences in the next stage (or vice versa).
Most superscalar M C 8 8 1 1 0 ( 1 9 9( p1 a) r t i a l l y )
" In the case of structural conflicts, the instructions are processors P P C 6 0 1 ( 1 9 (9p3a) r t i a l l y )
issued first to reservation stations (buffers) in front Comments:

of the FUs where the issued instructions await For illustration purposes issue in-order superscalar processor is assumed
missing operands. Designates an independent instruction.

Designates a dependent instruction.
Designates an issued instruction.
21 22
Dispatch
" An instruction is then said to be dispatched from a reservation station to
Instruction Window
the FU when all operands are available, and execution starts.
" If all its operands are available during issue and the FU is not busy, an
instruction is immediately dispatched starting execution in the next cycle " The notion of the instruction window comprises all the waiting stations
after the issue. between decode (rename) and execute stages.
" So, the dispatch is usually not a pipeline stage. 2
" An issued instruction may stay in the reservation station for zero to " The instruction window isolates the decode/rename from the execution
several cycles. stages of the pipeline.
" Dispatch and execution is performed out of program order
"
micro- dataflow
23 24
Instruction Window Organizations Multiple Instruction Issue
Icache
" single-stage issue out of a central instruction window

I-buffer
Issue window (n)
" multi-stage issue: Operand availability and resource availability

checking is split into two separate stages. Decode/ Dependent instructions block
check/ instruction issue.
issue
Issue n
EU EU EU
25 26
Issue Schemes
Instruction Window Organizations
" Single-level, central issue
" single-level issue out of a central window as in " decoupling of instruction windows: Each instruction window is
Pentium II processor shared by a group of (usually related) functional units,
Issue most common: separate floating-point window and integer window
and
Dispatch
" combination of multi-stage issue and decoupling of instruction
windows:
Decode Functional – In a two-stage issue scheme with resource dependent issue
and Units preceding the data-dependent dispatch,
Rename
the first stage is done in-order,
the second stage is performed out-of-order.
27 28
Single-Level, Two-Window Issue Two-Level Issue with Multiple Windows
" Single-level, two-window issue: single-level issue with a " Two-level issue with multiple windows with a centralized
instruction window decoupling using two separate windows window in the first stage and separate windows in the second
– most common: separate floating point and integer windows as stage (PowerPC 604 and 620 processors).
in HP 8000 processor
Issue Dispatch
Issue
and Functional Unit
Dispatch
Functional Decode Functional Unit
Decode Units and
and Rename Functional Unit
Rename Functional Functional Unit
Units
Reservation Stations
29 30
Execution Stages
Execution Stages
" Multi-cycle units perform more complex operations that cannot be
" Various types of FUs classified as: implemented within a single cycle.
– single-cycle (latency of one) or
– multiple-cycle (latency more than one) units. " Multi-cycle units
– can be pipelined to accept a new operation each cycle or each other
" Single-cycle units produce a result one cycle after an instruction cycle
started execution. – or they are non-pipelined.
" Usually they are also able to accept a new instruction each cycle
(throughput of one). " Another class of units exists that perform the operations with variable
cycle times.
31 32
Types of Functional Units Types of Functional Units
" single-cycle (single latency) units: " multicycle units that are pipelined but do not accept a
– (simple) integer and (integer-based) multimedia units new operation each cycle (throughput of 1/2 or less):
– often the 64-bit floating-point operations in a
floating-point unit,
" multicycle units that are not pipelined:
" multicycle units that are pipelined (throughput of one):
– division unit, square root units, complex multimedia
– complex integer, floating-point, and
units
(floating-point-based) multimedia unit (also called
multimedia vector units) " variable cycle time units:
– load/store uniet (depending on cache misses) and
special implementations of e.g. Floating-point units.
33 34
Constraintsrelatedto
Constrains
IFPexecution
Constraintstopreserve Resourceconstraints
sequentialconsistency
ofexecution
Resource
dependences
Constraintstopreserve Constraintstopreserve (see section 8.2.4)

sequentialconsistency sequentialconsistency
ofinstructionprocessing ofexceptionprocessing
Data-andcontrol Requirementof
dependences preciseexceptions
(see sections 8.2.2 and 8.2.3) (see section 8.3)

35

Superscalar Microprocessors

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Superscalar Microprocessors

Uploaded by

Copyright:

Available Formats

Superscalar Microprocessors

PowerPC 620 Components of a Superscalar Processor

Branch Instruction Fetch Unit Data

Store Addr. Dcache Write Load/ Floating-

" Instructions are issued from a sequential stream of normal instructions.

Phasen der Befehlsabarbeitung Details of Superscalar Pipeline

Fetch Fetch buffer

Prefetch /Predecode "

Predecoding Rename Stage

" Separate sets of architectural registers and rename

– The physical registers only contain temporary values

Merged Separate rename Holding renamed Holding renamed

accomplished by the compiler accomplished by the processor

issued first to reservation stations (buffers) in front Comments:

missing operands. Designates an independent instruction.

" single-stage issue out of a central instruction window

" multi-stage issue: Operand availability and resource availability

Constraintstopreserve Constraintstopreserve (see section 8.2.4)

(see sections 8.2.2 and 8.2.3) (see section 8.3)

You might also like