Instruction Level Parallelism

2. Superscalar and VLIW processors

Superscalar and VLIW Processors

Vittorio Zaccaria – Alari @ ST 2001

• Scalar processors fetch and issue max 1 operation in each clock cycle. • Multiple-issue processors:
• Superscalar (issue a varying number of instructions at each clock cycle). • VLIW (issue a fixed number of instructions at each clock cycle).

Superscalar Processors
• Issues from 1 to 8 instructions at each clock cycle. • If instructions are dependent, only the instructions preceding that one are issued (in-order issue). • This decision is made at run-time by the processor.
=> Variability in the issue rate.

Superscalar Processors
Can be: • Statically scheduled: Do not allow (issue) instructions behind stalls to proceed or • Dynamically scheduled and speculative (allow instructions behind RAW hazards to proceed).

F14.-16(R1) LD F14.F4 R1.F12 SUBI R1.F0.F16 .F2 0(R1).R1.#8 R1.F0.-8(R1) LD F10.How to optimize code for Superscalar Processors (1) Loop: Loop: LD ADDD SD SUBI BNEZ F0. 8-32 = -24 . LD F0.F2 ADDD F8.LOOP The loop is unrolled 4 times (load/addd/store) in which RAW hazards have been reduced..0(R1) F4. but there are resource conflicts on the pipelines..F2 SD 0(R1).F2 ADDD F12.F4 SD -8(R1).0(R1) LD F6.F6.LOOP SD (R1).F8 SD -16(R1).#32 BNEZ R1.F10.F2 ADDD F16.-24(R1) ADDD F4.R1.

-32(R1) 0(R1).F2 F12.F12 -24(R1).F16 R1.F4 -8(R1).R1.F2 // // // // // • 5 times unrolled loop.F2 F16.-8(R1) F10.#40 R1.F2 F20.F20 ADDD ADDD ADDD ADDD ADDD // // F4. .F18.LOOP -32(R1).F10.How to optimize code for Superscalar processors (2) Integer instruction FP instruction Loop: LD LD LD LD LD SD SD SD SD SUBI BNEZ SD F0.-24(R1) F18.-16(R1) F14.F2 F8.F8 -16(R1).0(R1) F6.F14.F0.F6.

The PowerPC 620 [’94] • Superscalar Architecture Similar to: • MIPS R10000 • HP PA 8000 • Fetch. . issue and completion of up to 4 instructions per clock cycle. • Six separate execution units buffered with reservation stations.

unpipelined /). Latency=1 for integer loads. 2 for FP loads. Latency from 3 to 20 cycles).PowerPC functional units • 2 integer units (XSU0.] • 1 complex integer function unit MCFXU for integer (pipelined * . 0 cycles latency [+. • 1 Load store unit..-.shift. XSU1). .

multiply-add • 31 for DP FP divide. completes branches and informs the fetch unit of mispredictions. Includes the condition register used for conditional branches.add. . (fully pipelined except for divide). • 1 BRU.PowerPC functional units • 1 FPU with latencies of: • 2 cycles for multiply.

Extendend register file holds speculative result of an instruction until the instruction commits. • Advantages: operands are available from a single location (no need for additional complex logic to access ROB result values) .PowerPC Architecture • Speculative Tomasulo with register renaming. • The ROB enforces only in-order commit.

PowerPC 620 architecture .

two-way set associative BTB.PowerPC Pipeline • Fetch: The Fetch unit loads the decode queue with instructions from the cache. . Next address is predicted through a 256-entry. A BPB is used if there is a miss in the BTB.

stall. .PowerPC Pipeline • Instruction decode: Instructions are decoded and inserted into an 8-entry instruction queue. Allocate a rename register and a reorder buffer entry for the instruction issued. If we can’t. • Instruction Issue: 4 Instructions are taken from the 8-entry instruction queue and are issued to the RS.

the result is written on the result bus. At the end.PowerPC Pipeline • Execution: Proceeds with execution when all operands are available. . The completion unit is notified that the instruction has completed.

• Commit: When all previous instructions have been committed.PowerPC Pipeline • If the instruction is a (mispredicted) branch. Instruction fetch restarts. . IFU and IC(ompletion)U are notified. and ICU discards all the speculated instructions after the branch and free the rename buffers. commit the result into the RF and free the rename buffer. Stores also commit from store buffer to memory.

.8. • We do not reach IPC=4 due to: • Fus are not replicated for each instruction (structural hazards) • Limited instruction level parallelism or limited buffering (insufficient buffers).Performance results • IPC from under 1 to 1.

three engines: .P6 Processor Family: Intel Pentium II/III • 3-way superscalar. • Basic Idea.

P6 Pipeline • Fetch/Decode Unit: decodes instructions and puts them in the instruction pool in-order. • Dispatch/Execute Unit: out-of-order issue from the instruction pool in a reservation station and out-of-order execution of micro-ops. . • converts the instructions in micro-ops that represent instruction code. • Retire Unit Reorders the instructions and commits speculative results to the architectural state.

ref. In the ROB (register renaming) . ref. into physical reg.P6 Instruction Decode • • • The decoder fetches 16 bytes at each clock cycle from the cache 3 parallel decoders convert most of the instructions into one or more triadic micro-ops. Register Alias Table unit converts logical reg. Some instruction need microcode (several micro-ops) to be executed.

their execution is compared with the predicted address (in the Fetch phase). .P6 Instruction Dispatch/Execute • The dispatch unit dispatches out-of-order the microops in the instruction pool through the reservation station unit • This happens when: • All the operands are ready • The resource needed is ready. • Maximum throughput: 5 micro-ops/cycle. If mispredicted the JEU changes the status of all the micro-ops behind the branch and removes them from the instruction pool. If micro-ops are branches.

.P6 Instruction Retire • The retire unit looks for micro-ops that have been executed and can be removed from the pool. • Up to 3 micro-ops can be retired at each clock cycle. • The original architectural target of the micro-ops is written. • This is done in-order by committing an instruction only if: • Previous instructions have been committed • The instruction has been executed.

Pentium 4 .

• L3->L2 data and instruction hardware prefetcher . • Software controlled data cache prefetching.4 GHz to 2GHz • 3 prefetching mechanisms • Harware instruction prefetcher (based on BTB).Pentium 4 • New NetBurst micro-architecture • 20 pipeline stages (hyper-pipeline) • 1.

• However some instructions need micro-code from ROM.Pentium 4 • Execution Trace Cache • TC stores decoded IA-32 instructions or micro-ops. • Removes decoding costs • 12K micro-ops. 3 micro-ops per cycle fetch bandwidth • It stores traces built across predicted branches. .

static prediction is used (back=T. . forw=NT) • Use of software branch hints during the trace construction that override static prediction.Pentium 4 • Branch penalty delay can be much more than 10 cycles • Uses BTB • In case of a miss in the BTB.

Pentium 4 • Execution Units and Issue Ports .

r. • Load/store forwarding . other loads and stores • Loads can be executed speculatively • Up to 4 outstanding load misses. • Loads can be reordered w.Pentium 4 • 1 load and 1 store issue for each cycle.t.

• Three out-of-order. superscalar. superscalar x86 processor • Multiple x86 instruction decoders (into triadic microops) • Three out-of-order. superscalar. pipelined address calculation units.AMD Athlon K7 • Nine-issue (micro-ops). superscalar. pipelined integer units. fully pipelined floating point execution units. • 72-entry instruction control unit (ROB) . • Three out-of-order. super-pipelined.

AMD Athlon K7 .

. while the address-generation units send calculated memory addresses to the Load/Store Unit for further processing. • The Integer Instruction Scheduler is an instruction scheduling logic that picks OP’s for execution based on their operand availability and issues them to functional units or address generation units.AMD Athlon K7 • The Instruction Control Unit contains a reorder buffer and distributed reservation stations to hold operands while OP’s wait to be scheduled. • The function units perform transformations on data and return their results to the reorder buffer.

Clustered VLIW .

Multi-Ported Register File Limits • Area of the register file grows approximately with the square of the number of ports Bit Cell Write1 Read1A Read1B Write1 Read1A Read1B Write2 Bit Cell 1 write Port 2 Read Ports Dout1A Dout1B Read2A Read2B 2 write Ports 4 Read Ports Dout2A Dout2B Dout1A Dout1B .

Multiported Register File • Read Access time of a register file grows approximately linearly with the number of ports • Internal Bit Cell loading becomes larger • Larger area of register file causes longer wire delays • What is reasonable today in terms of number of ports? • Changes with technology. 15-20 ports is currently about the maximum (read ports + write ports) .

create partitioned register files connected to small numbers of Executions Units Global Bus Register File Register File Register File EU EU EU .Clustered VLIW • To solve the bottleneck.

Register File Communication • Architecturally Invisible • Partitioned RFs appear as one large register file to the compiler • Copying between RFs is done by control • Detection of when copying is needed can be complicated. goes against VLIW philosophy of minimal control overhead .

execution unit is idle while copying is done => performance . • Because copying is ‘atomic’ part of remote instruction.Register File Communication • Architecturally Visible • Remote and Local versions of instructions • Explicit copy primitives • Remote Instructions: • have one or more operands in non-local RF • Copying of remote operands to local RFs takes clock cycles.

r60 //(r60 in another RF) independent instr a //do not waste useful independent instr b //clock cycles add r2. r3 .Register File Communication • Copy instructions: • Separation of copy and execution allows more flexible scheduling by compiler move r1. r1.

use only a few bits (2-3) to represent a NOP.Instruction Comression • Embedded Processors often put a limit on code size • How to reduce size? • NOPs are common. . • Mark explicitly start and stop of the long instruction and do not insert nop.

may have to add one or more pipeline stages just for decompression .limits cache size • On instruction fetch • Decompression in critical path of fetch stage.Instructions Decompression • On Instruction Cache fill • ICache has to hold uncompressed instructions .

VLIW Architectures: Some real world example .

FP compare • D : Integer adder. M) . Logical. D. Branch/Control. S. FP multiplier • Split into two identical datapaths. Bit Manipulation. Load-Store • M : Integer Multiplier. Shifting.TMS320C6X CPU • 8 independent execution units • Execution unit types: • L : Integer adder. each contains the same four units (L. Bit Counting. FP conversion • S : Integer adder. FP adder. Constant. Logical.

TMS320C6X CPU (cont). • Max clock speed of 200 Mhz • Each datapath has a 16 x 32bit Register file Global Bus 16 x 32 RF 16 x 32 RF L S M D L S M D .

• A execute packet is a group of instructions beginning execution in parallel. Execute packet has 8 instructions .Instruction Encoding • Internal Execution path is 256 bits-wide • Each operation is 32 bits wide => 8 operations per clock • A fetch packet is a group of instructions fetched simultaneously. Fetch packet has 8 instructions.

Instruction Encoding • Instructions in ICache have an associated P-bit (Parallel-bit). • Fetch packet expanded to 1 to 8 Execute packets during fetch stage depending on P-bits .

A-H executed serially 8 instructions n|n|n|n|n|n|n|H 64 instructions .Fetch Packet to Execute Packet Expansion Fetch Packet A|B|C|D|E|F|G|H 0|0|0|0|0|0|0|0 Execute Packet n|n|A|n|n|n|n|n n|B|n|n|n|n|n|n n|n|n|n|n|C|n|n n|n|n|n|n|D|n|n n|n|n|E|n|n|n|n F|n|n|n|n|n|n|n n|n|n|n|n|n|G|n P-bits.

D||E. G||H P-bit String of ‘1’s followed by ‘0’ means those execute in parallel. F. 40 instructions .Fetch Packet to Execute Packet Expansion Fetch Packet A|B|C|D|E|F|G|H 1|1|0|1|0|0|1|0 n|B|A|n|n|C|n|n n|n|n|E|n|D|n|n F|n|n|n|n|n|n|n n|n|n|n|n|n|G|H P-bits A||B||C. String starting with ‘0’ indicates sequential execution.

Philips TM 1000/Multimedia Processor .

• 128 Registers (r0.Philips Trimedia • Five Execution Units => Five operations per clock issued • 15 Read and 5 Write Ports on register File • Need 15 read ports for 5 Execution Units because each operation requires two operands and a Guard operand. r1 always 0) . • Guard operand makes each operation conditional based upon value of LSB of the guard operand => Predicated Execution.

and 44 bits. 26 bits.Philips Trimedia Instructions • Multiple operation sizes • 2 bits for NOP. . 34 bits.

Sign up to vote on this title
UsefulNot useful