You are on page 1of 63

UNIT-5

PIPELINE AND VECTOR PROCESSING

PIPELINING AND VECTOR PROCESSING

Introduction to pipelining and pipeline hazards Design issues of pipeline architecture Instruction level parallelism and advanced issues Parallel processing concepts-

Vector processing
Array processors CISC RISC VLIW

PARALLEL PROCESSING

1.

It is used to provide simultaneous data processing tasks for the purpose of increasing the computational speed of a computer system.

EX When an inst is being executed in ALU, the next inst can be read from memory. 2. H/W and cost increase Parallel processing

Multiple functional units


Adder-Subtracted

Integer

multiply Logic unit Shift unit Incremented Floating point add-subtract Floating point multiply Floating point divide

PROCESSOR WITH MULTIPLE FUNCTIONAL UNITS

ADDERSUBTRACTOR

INTEGER MULTIPLY
LOGIC UNIT To memory SHIFT UNIT INCREMENTER FLOATING-POINT ADD-SUBTRACT FLOATING-POINT MULTIPLY FLOATING-POINT DIVIDE

PROCESSOR REGISTER

Parallel Processing

PARALLEL COMPUTERS
Architectural Classification

Flynn's classification Based on the multiplicity of Instruction Streams and Data Streams Instruction Stream Sequence of Instructions read from memory Data Stream Operations performed on the data in the processor
Number of Data Streams Single Number of Single Instruction Streams Multiple SISD MISD Multiple SIMD MIMD

SISD

Inst executed sequentially an system may or may not have internal parallel processing capabilities. SIMD Many processing units under supervision of common control unit. All processors receive the same inst from control unit but operate on different items of data.

MISD: Theory oriented not practical implemented.


MIMD: Processing several programs at same time.

Parallel processing techniques


Pipeline

processing Vector ,, Array ,,

Pipeline

processing: When arithmetic sub operations on the phases of computer inst cycle overlay in execution Vector processing: Deals with computations involving large vectors and matrices. Array processing: Perform computations on large arrays of data

PIPELINING
A technique of decomposing a sequential process into sub operations, with each sub process being executed in a partial dedicated segment that operates concurrently with all other segments. Result obtained from computation in each segment is transferred to next segment in pipeline. Overlapping of computation. Register holds data and combinational circuit performs sub operation in particular segment.

WHAT IS PIPELINING??

Pipelining is an implementation technique where multiple instructions are overlapped in execution to make fast CPUs. It is an implementation which exploits parallelism among the instructions in a sequential instruction stream.

THE METHODOLOGY

In a pipeline each step is called a pipe stage/pipe segment which completes a part of an instruction. Each stage is connected with each other to form a pipe. Instructions enter at one end ,progress through each stage and exit at the other end.

Pipelining

PIPELINING
Ai * Bi + Ci
Ai Segment 1 R1 R2

for i = 1, 2, 3, ... , 7
Bi

Example of pipeline processing

Memory Ci

Multiplier Segment 2 R3 R4

Segment 3

Adder

R5

R1 Ai, R2 Bi R3 R1 * R2, R4 Ci R5 R3 + R4

Load Ai and Bi Multiply and load Ci Add

Pipelining

OPERATIONS IN EACH PIPELINE STAGE


Clock Segment 1 Pulse Number R1 R2 1 A1 B1 2 A2 B2 3 A3 B3 4 A4 B4 5 A5 B5 6 A6 B6 7 A7 B7 8 9 Segment 2 Segment 3

R3
A1 * B1 A2 * B2 A3 * B3 A4 * B4 A5 * B5 A6 * B6 A7 * B7

R4
C1 C2 C3 C4 C5 C6 C7

R5

A1 * B1 + C1 A2 * B2 + C2 A3 * B3 + C3 A4 * B4 + C4 A5 * B5 + C5 A6 * B6 + C6 A7 * B7 + C7

Pipelining

GENERAL PIPELINE
General Structure of a 4-Segment Pipeline
Clock

Input

S1

R1

S2

R2

S3

R3

S4

R4

Space-Time Diagram
1 Segment 1 2 3 4 T1 2 T2 3 T3 T2 T1 4 T4 T3 T2 T1 5 T5 T4 T3 T2 6 T6 T5 T4 T3 T6 T5 T4 T6 T5 T6 7 8 9 Clock cycles

T1

Pipelining

PIPELINE SPEEDUP
n: Number of tasks to be performed

Conventional Machine (Non-Pipelined) tn: Clock cycle t1: Time required to complete the n tasks t 1 = n * tn
Pipelined Machine (k stages) tp: Clock cycle (time to complete each suboperation) tk: Time required to complete the n tasks tk = (k + n - 1) * tp Speedup Sk: Speedup Sk = n*tn / (k + n - 1)*tp lim
n

Sk =

tn tp

( = k, if tn = k * tp )

Pipelining

PIPELINE AND MULTIPLE FUNCTION Example UNITS - 4-stage pipeline


- subopertion in each stage; tp = 20nS - 100 tasks to be executed - 1 task in non-pipelined system; 20*4 = 80nS

Pipelined System (k + n - 1)*tp = (4 + 99) * 20 = 2060nS


Non-Pipelined System n*k*tp = 100 * 80 = 8000nS Speedup Sk = 8000 / 2060 = 3.88 4-Stage Pipeline is basically identical to the system Ii I i+1 with 4 identical function units
I i+2 I i+3

Multiple Functional Units

P1

P2

P3

P4

Disadvantage of pipeline

Different segment may take different times to complete their sub operation. Clk cycle must be chosen to equal the time delay of segment with max propagation time. This causes all other segment to waste time waiting for the next clk. Two area of computer design where pipeline org is applicable: 1.Arithmetic pipeline : divides arithmetic operation into sub operation for execution in pipeline segment. 2.Inst pipeline: operates on a stream of inst by overlapping fetch, decode and execute phases of inst cycle

PIPELINE HAZARDS
WHAT ARE PIPELINE HAZARDS ??? Hazards are those situations ,that prevent the next instruction in the instruction stream from executing during its designated clock cycle. They reduce the performance from the ideal speedup gained by pipelining.

CLASSIFICATION OF HAZARDS

Structural Hazards : arise from resource conflicts when the hardware cant support all possible combinations in simultaneous overlapped execution. Data hazards : arise when an instruction depends upon the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.

CLASSIFICATION OF HAZARDS

Control Hazards : arise from the pipelining of branches and other instructions that change the PC

STRUCTURAL HAZARDS
For any system to be free from hazards, pipelining of functional units and duplication of resources is necessary to allow all possible combinations of instructions in the pipeline. Structural hazards arise due to the following reasons :

STRUCTURAL HAZARDS

When a functional unit is not fully pipelined , then the sequence of instructions using that unit cannot proceed at the rate of one per clock cycle. When the resource is not duplicated enough to allow all possible combinations of instructions. ex : a machine may have one register file write port, but it may want to perform 2 writes during the same clock cycle.

STRUCTURAL HAZARDS

A machine with a shared single memory for data and instructions . An instruction containing data memory reference will conflict with the instruction reference for a later instruction. This resolved by stalling the pipeline for one clock cycle when the data memory access occurs.

DATA HAZARDS
Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order they see by sequentially executing instructions on an unpipelined machine.

CLASSIFICATION OF DATA HAZARDS


RAW (read after write ) : consider two instructions i and j with i occurring before j. j tries to read a source before i actually writes into it , as a result j gets the old value. Ex : ADD R1,R2,R3 SUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R1,R11

CLASSIFICATION OF DATA HAZARDS


This hazard is overcome by a simple hardware technique called forwarding. in forwarding ,the ALU result from the EX/MEM register is always fed back into ALU input latches. if the forwarding hardware detects that the previous ALU operations has written the register corresponding to a source for the current ALU operation, then the control logic selects the forwarded result as the ALU input rather than the value read from the register file.

CLASSIFICATION OF DATA HAZARDS

WAW (write after write) : j tries to write an operand before it is written by i. Thus the writes are performed in the wrong order leaving the value of i as the final value. This hazard is present in pipelines that write in more than one pipe stage. However in DLX this isnt a hazard as it writes only in the WB stage.

CLASSIFICATION OF DATA HAZARDS


EX : LW R1,0(R2) ADD R1,R2,R3

CLASSIFICATION OF DATA HAZARDS


WRITE AFTER READ (WAR) : j tries to write a destination before it is read by i. This doesnt happen in DLX as all reads occur early (ID phase) and all writes occur late (in WB stage). EX: SW 0(R1),R2 ADD R2,R3,R4

CONTROL HAZARDS
Control hazards cause a greater performance loss compared to the losses posed by data hazards. The simplest method of dealing with branches is that the pipeline is stalled as soon the branch is detected in the ID phase and until the MEM stage where the new PC is finally determined.

CONTROL HAZARDS

Each branch causes a 3 cycle stall in the DLX pipeline which is a significant loss as the 30% of the instructions used are branch instructions.

The number of clock cycles in the branch is reduced by testing the condition for branching in the ID stage and computing the destination address in the ID stage using a separate adder. Thus there is only clock cycle on branches .

WHAT MAKES PIPELINING HARD TO IMPLEMENT???


EXCEPTIONAL SITUATIONS : are those situations in which the normal order of execution is changed. This is due to instructions that raise exceptions that may force the machine to abort the instructions in the pipeline before they complete.

WHAT MAKES PIPELINING HARD TO IMPLEMENT???


Some of the exceptions include :
o o o o

Integer arithmetic overflow/underflow. Power failure Hardware malfunctions. I/O device request.

Arithmetic pipeline

Usually found in high speed computers. Used to implement floating-point operations ,multiplication of fixed-point number. EX floating point addition and subtraction. A and B are fractions representing mantissas and a and b are the exponents. Sub operation that are performed in four segment are
[1] Compare the exponents [2] Align the mantissa [3] Add/sub the mantissa [4] Normalize the result

Arithmetic Pipeline

ARITHMETIC PIPELINE
Exponents a b
R

Mantissas A B
R

Segment 1:

Compare exponents by subtraction R

Difference

Floating-point adder
Segment 2:
Choose exponent Align mantissa R

X = A x 2a Y = B x 2b

Segment 3:

Add or subtract mantissas R R

[1] [2] [3] [4]

Compare the exponents Align the mantissa Add/sub the mantissa Normalize the result

Segment 4:

Adjust exponent R

Normalize result R

Instruction pipeline
Instruction

pipeline reads consecutive instructions from memory while previous instructions are being executed in other segments . This causes instructions fetch and execute phases to overlap and perform simultaneous operation

Instruction Pipeline

INSTRUCTION CYCLE
Six Phases* in an Instruction Cycle [1] [2] [3] [4] [5] [6] Fetch an instruction from memory Decode the instruction Calculate the effective address of the operand Fetch the operands from memory Execute the operation Store the result in the proper place

Some instructions skip some phases * Effective address calculation can be done in the part of the decoding phase * Storage of the operation result into a register is done automatically in the execution phase ==> 4-Stage Pipeline [1] FI: Fetch an instruction from memory [2] DA: Decode the instruction and calculate the effective address of the operand [3] FO: Fetch the operand [4] EX: Execute the operation

Instruction Pipeline

INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE

Segment1:

Fetch instruction from memory Decode instruction and calculate effective address

Segment2:

Branch? yes Segment3: Fetch operand from memory Execute instruction Interrupt? no Update PC Empty pipe no

Segment4: Interrupt handling yes

Four segment CPU pipeline

Timing
Step: Instruction

of instruction pipeline
1 1 2 FI 2 DA FI 3 FO DA FI 4 EX FO DA FI EX FO EX FI DA FI FO DA FI EX FO DA EX FO EX 5 6 7 8 9 10 11 12 13

(Branch)

3 4 5 6

FI

DA

FO

EX

Major difficulties
Resource

conflicts: caused by access to memory by 2 segment at the same time. These conflicts can be resolved by using separate instruction and data memories. Data dependency conflicts: when inst depends on result of previous inst but this result not yet available. Branch difficulties arise from branch and other inst that change the value of pc.

1.Data dependency A data dependency occurs when an inst needs data that are not yet available. H/W interlocks: circuit that detects inst whose source operands are dest of inst farther up in pipeline. Inst whose source is not available to be delayed by enough clock cycle to resolve conflict. 2.Operand forwarding: Special h/w to detect a conflict an then avoid it by routing data through special path b/w pipeline segment. 3. Delayed load: Reorder the inst as necessary to delay the loading of conflicting data by inserting no-operation inst

Handling of branch instructions


Prefetch Target Instruction Fetch instructions in both streams, branch not taken and branch taken Both are saved until branch branch is executed. Then, select the right instruction stream and discard the wrong stream Branch Target Buffer(BTB; Associative Memory) Entry: Addr of previously executed branches; Target instruction and the next few instructions When fetching an instruction, search BTB. If found, fetch the instruction stream in BTB; If not, new stream is fetched and update BTB

Loop Buffer (High Speed Register file)


Storage of entire loop that allows to execute a loop without accessing memory

Branch Prediction
Guessing the branch condition, and fetch an instruction stream based on the guess. Correct guess eliminates the branch penalty Delayed Branch

Compiler detects the branch and rearranges the instruction sequence by inserting useful instructions that keep the pipeline busy in the presence of a branch instruction

RISC pipeline

RISC - Machine with a very fast clock cycle that executes at the rate of one instruction per cycle <- Simple Instruction Set Fixed Length Instruction Format Register-to-Register Operations

RISC Pipeline

RISC PIPELINE
Instruction Cycles of Three-Stage Instruction Pipeline

Data Manipulation Instructions I: Instruction Fetch A: Decode, Read Registers, ALU Operations E: Write a Register Load and Store Instructions I: Instruction Fetch A: Decode, Evaluate Effective Address E: Register-to-Memory or Memory-to-Register

Program Control Instructions I: Instruction Fetch A: Decode, Evaluate Branch Address E: Write Register(PC)

RISC Pipeline

DELAYED LOAD
LOAD: LOAD: ADD: STORE: R1 M[address 1] R2 M[address 2] R3 R1 + R2 M[address 3] R3

Three-segment pipeline timing Pipeline timing with data conflict

clock cycle 1 2 3 4 5 6 Load R1 I A E Load R2 I A E Add R1+R2 I A E Store R3 I A E Pipeline timing with delayed load
clock cycle 1 2 3 4 5 6 7 Load R1 I A E Load R2 I A E NOP I A E Add R1+R2 I A E Store R3 I A E Advantage
The data dependency is taken care by the compiler rather than the hardware

RISC Pipeline

DELAYED BRANCH
Compiler analyzes the instructions before and after the branch and rearranges the program sequence by inserting useful instructions in the delay steps Using no-operation instructions
Clock cycle s: 1. Load 2. Incre me nt 3. Add 4. Subtr act 5. Branch to X 6. NOP 7. NOP 8. Instr. in X 1 2 3 4 5 6 7 8 9 10 I A E I A E I A E I A E I A E I A E I A E I A E

Rearranging the instructions


Clock cycle s: 1. Load 2. Incre me nt 3. Branc h to X 4. Add 5. Subtract 6. Instr . in X 1 2 3 4 5 6 7 8 I A E I A E I A E I A E I A E I A E

CISC (COMPLEX INSTRUCTION SET COMPUTING)


A CISC is a computer where single instruction can execute several low-level operations and are capable of multi-step operations or addressing modes within a single instruction. Some complex instructions are difficult or impossible to execute in one cycle through the pipeline Many different addressing modes Instructions of different lengths

Implementing a CISC Architecture


There are simple and complex instructions in a CISC architecture One approach:
Adapt the RISC pipeline
Execute

the simple, frequently-used CISC instructions as in RISC For the more complex instructions, use microinstructions.

That means, a sequence of microinstructions is stored on ROM for each complex CISC instruction Complex instructions often involve multiple microoperations or memory accesses in sequence When a complex instruction is decoded in the DOF stage of the pipeline, the microcode address and control is given to the microcode counter. Microinstructions are executed until the instruction is completed. Each microoperation is simply a set of control input signals Example: A certain CISC instruction I has microcode written in the microcode ROM at address A. When I is decoded in the main pipeline, stall the main pipeline, and give control to the microcode control (MC). At each subsequent clock cycle, MC will increment and execute the next microoperation (which is a control word that controls the datapath). The last microoperation in the sequence will give control back to the main pipeline, and un-stall the main pipeline.

CISC Approach
The primary goal of CISC architecture is to complete a task in a few lines of Assembly as possible. This is achieved by building processor hardware that is capable of Understanding and executing a series of operation. EX: MULT 2:3,5:2 For this task a CISC processor would come prepare with a specific instruction. This instruction loads two values into separate register, multiplies the operand In execution unit and stores the product in app. Register. The entire task of multiplying two numbers can be completed with one Instruction. MULT ----- complex instruction. CISC minimizes the number of instruction per program, sacrificing number of Cycles per instruction.

Vector Processing

VECTOR PROCESSING
Vector Processing Applications Problems that can be efficiently formulated in terms of vectors Long-range weather forecasting Petroleum explorations Seismic data analysis Medical diagnosis Aerodynamics and space flight simulations Artificial intelligence and expert systems Image processing Vector Processor (computer) Ability to process vectors, and related data structures such as matrices and multi-dimensional arrays, much faster than conventional computers Vector Processors may also be pipelined

Vector Processing

VECTOR PROGRAMMING
20 DO 20 I = 1, 100 C(I) = B(I) + A(I)

Conventional computer Initialize I = 0 20 Read A(I) Read B(I) Store C(I) = A(I) + B(I) Increment I = i + 1 If I 100 goto 20

Vector computer
C(1:100) = A(1:100) + B(1:100)

Vector Processing

VECTOR INSTRUCTIONS
f1: V * V f2: V * S f3: V x V * V f4: V x S * V Type f1 V: Vector operand S: Scalar operand

Mnemonic Description (I = 1, ..., n) VSQR VSIN VCOM Vector square root Vector sine Vector complement Vector summation Vector maximum Vector add Vector multiply Vector AND Vector larger Vector test > B(I) * SQR(A(I)) B(I) * sin(A(I)) A(I) * A(I) S * S A(I) S * max{A(I)} C(I) * A(I) + B(I) C(I) * A(I) * B(I) C(I) * A(I) . B(I) C(I) * max(A(I),B(I)) C(I) * 0 if A(I) < B(I) C(I) * 1 if A(I) > B(I)

f2

VSUM VMAX

f3

VADD VMPY VAND VLAR VTGE

f4

SADD SDIV

Vector-scalar add Vector-scalar divide

B(I) * S + A(I) B(I) * A(I) / S

Vector Processing

VECTOR INSTRUCTION FORMAT


Vector Instruction Format
Ope ration code Base addre ss source 1 Base addre ss source 2 Base addre ss de stination Ve ctor le ngth

Pipeline for Inner Product


Source A

Source B

M ultiplie r pipeline

Adde r pipeline

Vector Processing

MULTIPLE MEMORY MODULE AND INTERLEAVING


Multiple Module Memory
Address bus

M0
AR

M1
AR

M2
AR

M3
AR

Memory array

Memory array

Memory array

Memory array

DR

DR

DR

DR

Data bus

Address Interleaving Different sets of addresses are assigned to different memory modules

Pipeline & vector processors often require simultaneous Pipeline & vector processors often require simultaneous access to memory from 2 or more sources. Inst pipeline may require fetching of an inst and an operand at same time from two different segment. Memory can be partitioned into number of modules connected to common memory address & data buses. Memory module is a memory array together with its own address & data register. One module initiates memory access while other modules are in the process of reading or writing and each modules can honor a memory request independent of the state of the other modules.

Advantage of modular memory


Interleaving:

In an interleaved memory, different sets of address are assigned to different memory modules. A vector processor that uses n-way interleaved memory can fetch n operands from n different modules. Example In 2-modules memory system , even address may be 1 module and the odd addresses in the other. A CPU with inst pipeline can take advantage of multiple memory modules so that each segment in pipeline can access memory independent of memory access from other segments.

Array processors
It

is a processor that performs computations on largearrays of data. Attached array processor It is an auxiliary processor attached to general purpose computer. SIMD Array Processor. It is a processor that has single inst multiple data org. Manipulates vector inst by means of multiple functional units responding to a common inst.

Attached array processor


Enhance the performance of computer by providing vector processing for complex scientific applications Attached array processor
General purpose computer i/p O/p interface I/P -O/P interface Attached array processor

High speed

Main memory
Memory to memory bus

Local memory

SIMD Array processors


SIMD

array processor org


PE1 M1

Master ctrl unit PE2 PEn

M2
Mn

Main memory

It is a computer with multiple processing units operating in parallel. It consist of a set of identical processing element each having a local memory. Each processor element includes an ALU, Floating point arithmetic unit an working register. Master ctl unit controls the operations in the processors elements. Main memory is used for storage of program. Function of master control unit is to decode the inst and determine how the inst is to executed. Vector inst are broadcast to all PEs Simultaneously. Vector operands are distributed to local memories prior to parallel execution of inst.

You might also like