CS-421 Parallel Processing Handout_9

BE (CIS) Batch 2004-05

Beyond Simple Pipelining
Although instruction p ipelining achieves ILP (Instruction Level Parallelism) i.e., parallelism among instructions by overlapping different phases of instructions resulting in CPIideal = 1, this unit CPI is unachievable in practice due to hazards in practical programs. There are many architectural techniques in vogue that can be employed to push the performance beyond unit CPI (i.e. CPI < 1 supporting execution of multiple instructions per cycle ) and extract more ILP from programs. A brief description of these techniques follows:

1. Superpipelining
Superpipelining is the breaking of stages of a given pipeline into smaller stages (thus making the pipeline deeper) in an attempt to shorten the clock period and thus enhancing the instruction throughput by keeping more and more instructions in flight at a time. In a simple scalar pipeline the clock period is dictated by the slowest or most time consuming stage in the system. It is often the case that the slower and more complex operations that occur in a stage can be further broken down into simpler tasks. For example, the instructionfetch stage and the data-memory access stage are generally the most time-consuming in any pipeline and can be broken down into smaller steps. The execute stage may be broken down into two or more smaller steps depending upon the type of operation that is performed, etc. If each of these smaller steps is performed in a single clock cycle, with the more time consuming steps being performed in two or more clock cycles, then the effective clock cycle time can be reduced. The net effect is to allow more instructions to have an earlier start in the pipeline. This is the essence of superpipelined operation. The performance improvement resulting from superpipelining is shown below.

The MIPS R4000 processor is an example of a machine that employs this technique. The MIPS R4000 pipeline contains 8 stages.

CS-421 Parallel Processing Handout_9

BE (CIS) Batch 2004-05

The downside of superpipelining is, however, more dependenc ies among instructions necessitating increased complexity in data forwarding, hazard detection units and branch predictors.

2. Multiple-Issue Architectures
The basic idea is to fetch multiple instructions per cycle from memory and after checking inter-dependencies, issue those instructions to independent functional units so that they can be simultaneously executed generating increased ILP. These architectures are also known as wide-issue architectures. There are two methods of implementing a multiple - issue processor. • • Static multiple -issue Dynamic multiple -issue Multiple instructions issued in a given clock cycle are said to form an instruction packet. The decision of packaging instructions into issue slots is made by the compiler. Only independent instructions can be placed in predefined instruction slots of an instruction packet. For example, instruction packet of a static quad-issue machine can have the following form: Instruction Packet FP Instruction (Instruction Slot 1) Integer Instruction (Instruction Slot 2) Integer Instruction (Instruction Slot 3) Load/Store Instruction (Instruction Slot 4)

a. Static Multiple Issue

The instruction packet can be thought of as a very long instruction comprising multiple base machine instructions. This was the reason behind the original name for this approach: Very Long Instruction Word (VLIW). Intel has its own name for the technique i.e. EPIC (Explicitly Parallel Instruction Computing) used in Itanium series.

Example: Static Dual Issue (i.e. 2-way) MIPS
Let the issue packet contain ALU or branch instruction (appearing first) and load or store instruction. This design is akin to some embedded MIPS processors. 64 bits R-Type or Branch Instruction Load/Store Instruction

Following figure shows such a pipelined processor in operation.
Instruction Type ALU or Branch Load or Store ALU or Branch Load or Store ALU or Branch Load or Store ALU or Branch Load or Store Pipeline Stages M WB M WB EX M EX M ID EX ID EX IF ID IF ID

IF IF

ID ID IF IF

EX EX ID ID IF IF

WB WB M M EX EX

WB WB M M

WB WB

CS-421 Parallel Processing Handout_8

BE (CIS) Batch 2004-05

For simultaneous issue of ALU and data transfer instructions, following additional hardware is required to avoid structural hazards. § Additional ports in register file: o o § § 2 extra reading ports 1 extra writing port

Additional ALU Additional reading port in instruction memory

A static two-issue MIPS datapath It contains no hazard detection unit, so no load-use is allowed. However, a static multiple issue processor may adopt one of the following approaches to handle control and data hazards. § § Full responsibility of compiler without any support in hardware. Compiler is responsible for removal of intra-packet dependencies while hardware supports removal of inter-packet hazards. We adopt this approach. To effectively exploit parallelism availa ble in a multiple -issue processor, more ambitious compilers are required. If it is not possible to find operations that can be carried out at the same time for all functional units, then the instruction slot is filled with a NOP in the group of fields for unneeded units. In case, most instruction words contain some NOPs, VLIW programs

CS-421 Parallel Processing Handout_8

BE (CIS) Batch 2004-05

tend to be very long. The VLIW architecture requires the compiler to be very knowledgeable of implementation details of the target computer, and may require a program to be recompiled if moved to a different implementation of the same architecture. Code Scheduling Example Consider scheduling of following loop on a static 2-way pipeline for MIPS.
Loop: lw add sw addi bne $t0, 0($s1) $t0, $t0, $s2 $t0, 0($s1) $s1, $s1, -4 $s1, $0, Loop

Must “schedule” the instructions to avoid pipeline stalls : § § § § Instructions in one bundle must be independent Must separate load use instructions from their loads by one cycle Assume branches are perfectly predicted by the hardware Assume forwarding hardware as necessary
Optimal Schedule

ALU/Branch
Loop: NOP addi $s1, $s1, -4 add bne $t0, $t0, $s2 $s1, $0,loop

Memory Reference Issue Packet (CC)
lw $t0, 0($s1) NOP NOP sw $t0, 4($s1) 1 2 3 4

Ignoring pipeline startup, 5 instructions are executed in 4 clock cycles. Hence, we achieve a CPI of 4/5 = 0.8 (versus the best case of 0.5) or equivalently IPC of 1.25 (versus the best case of 2.0). NOPs don’t count towards performance!! Loop Unrolling Loop unrolling is a technique to extract more performance from loops that access arrays, in which multiple copies of the loop body are made and instructions from different iterations are scheduled together. Apply loop unrolling by a factor of 4 eliminating loop overhead instructions. Note that the compiler must rename registers so as to avoid name or false dependencies and adjust offsets in the load and store instructions. A name dependence is said to exist between two instructions when they use same register or memory location, called a name, but there’s no flow of data between the instructions associated with that name.

CS-421 Parallel Processing Handout_8

BE (CIS) Batch 2004-05

Loop:

lw lw lw lw add add add add sw sw sw sw addi bne

$t0, $t1, $t2, $t3, $t0, $t1, $t2, $t3, $t0, $t1, $t2, $t3, $s1, $s1,

0($s1) -4($s1) -8($s1) -12($s1) $t0, $t1, $t2, $t3, 0($s1) -4($s1) -8($s1) -12($s1) $s1, $0, -16 Loop $s2 $s2 $s2 $s2

Iteration 1 Iteration 2 Iteration 3 Iteration 4

Now we schedule the resulting unrolled code. Due to absence of hazard detection unit, we must schedule so as to avoid load use hazards.
Optimal Schedule

Issue Packet (CC) 1 2 add $t0, $t0, $s2 3 add $t1, $t1, $s2 4 add $t2, $t2, $s2 5 add $t3, $t3, $s2 6 7 bne $s1, $0, Loop 8 Hence, by loop unrolling, we are able to execute 14 instructions in 8 clock cycles
Loop:
addi $s1, $s1, -16 lw lw lw lw sw sw sw sw $t0, $t1, $t2, $t3, $t0, $t1, $t2, $t3, 0($s1) 12($s1) 8($s1) 4($s1) 16($s1) 12($s1) 8($s1) 4($s1)

ALU/Branch

Memory Reference

corresponding to CPI of 0.57 (versus the best case of 0.5) or IPC of 1.8 (versus the best case of 2.0). VLIW Advantages & Disadvantages § Simpler hardware and therefore potentially less power hungry. For this reason, they’ve gained popularity in embedded domain. Almost all Digital Signal Processors use VLIW architecture. § Compiler complexity § Object (binary) code incompatibility § Code bloat o o NOPs are a waste of program memory space Loop unrolling uses more program memory space

CS-421 Parallel Processing Handout_8

BE (CIS) Batch 2004-05

b. Dynamic Multiple Issue Processors Dynamic multiple -issue architectures are also known as superscalars. Unlike compiler in VLIW machines, t e processor hardware decides whether zero, one, or more h instructions can be issued in a given clock cycle. Superscalars allowing in-order execution of instructions are called static superscalars. However, there are dynamic superscalars which allow out-of-order execution (also called dynamic pipeline scheduling or dynamic execution).

Dynamic Execution
When executing in-order, we fetch instructions and execute them in the order the compiler produced the object code. But, what if there is a long-running instruction (e.g., a floating-point divide operation that takes 40 cycles, for example) followed by some

Motivation

other instructions behind it that don’t depend on the value produced by the preceding divide instruction. If we could somehow allow those instructions to “go around” the divide and execute in some other functional unit while the divide unit is busy, we would get better performance. Instructions are fetched and decoded in program order. These instructions are then sent to reservation stations (buffers within functional units that hold the operands and the operation until the corresponding functional unit becomes ready to execute) along with operands that are available. As soon as all operands for an instruction become available, and inter-instruction dependencies are discharged, these instructions are issued in the program order to the respective functional units for execution. When an instruction completes, its results are sent to a commit unit. Committing an instruction involves writing back any values to memory or the register file . The commit unit holds result values in a reorder buffer until they can be committed in order (i.e., the program order). This step is also called retirement or graduation (of instructions). In summary, Dynamic execution is about IN-ORDER ISSUE, OUT-OF-ORDER EXECUTION, and IN-ORDER COMMIT. Dynamically scheduled pipelines are used in both the PowerPC 604 and the Pentium Pro. Support from compilers is even more crucial for the performance of superscalars because a superscalar processor can only look at a small window of program. A good compiler schedules code in such a way that facilities scheduling decisions by the processor.

Detail

CS-421 Parallel Processing Handout_8

BE (CIS) Batch 2004-05

In-Order Fetch

In-Order Issue

Out-of-Order Execute

In-Order Commit

Out-of-Order Execution & New Data Hazards We have witnessed RAW (Read-After-Write) hazard in normal operation of pipeline (i.e. when execution is in-order). This hazard is a result of flow dependence between two instructions. However, Out-of-Order execution gives rise to two more data hazards. These hazards cannot occur in normal in-order execution. § WAR (Write-After-Read) Hazard This is caused by anti-dependence between two instructions. An instruction J is said to be anti-dependent on a preceding instruction I if destination of J and source of I are common. E.g.
add sub $1, $2, $3 $2, $4, $5

§

WAW(Write-After-Write) Hazard This is caused by output-dependence between two instructions. An instruction J is said to be output-dependent on a preceding instruction I if destinations of J and I are common. E.g.
add sub $1, $2, $3 $1, $4, $5

WAR and WAW are name or false dependencies as they can be avoided simply by renaming.