UNIT-5: Pipeline and Vector Processing

UNIT-5
PIPELINE AND VECTOR PROCESSING
PIPELINING AND VECTOR PROCESSING
Introduction to pipelining and pipeline hazards Design issues of pipeline architecture Instruction level parallelism and advanced issues Parallel processing concepts-
Vector processing
Array processors CISC RISC VLIW
PARALLEL PROCESSING
1.
It is used to provide simultaneous data processing tasks for the purpose of increasing the computational speed of a computer system.
EX When an inst is being executed in ALU, the next inst can be read from memory. 2. H/W and cost increase Parallel processing
Multiple functional units

Adder-Subtracted
Integer
multiply Logic unit Shift unit Incremented Floating point add-subtract Floating point multiply Floating point divide
PROCESSOR WITH MULTIPLE FUNCTIONAL UNITS
ADDERSUBTRACTOR
INTEGER MULTIPLY
LOGIC UNIT To memory SHIFT UNIT INCREMENTER FLOATING-POINT ADD-SUBTRACT FLOATING-POINT MULTIPLY FLOATING-POINT DIVIDE
PROCESSOR REGISTER
Parallel Processing
PARALLEL COMPUTERS
Architectural Classification
Flynn's classification Based on the multiplicity of Instruction Streams and Data Streams Instruction Stream Sequence of Instructions read from memory Data Stream Operations performed on the data in the processor
Number of Data Streams Single Number of Single Instruction Streams Multiple SISD MISD Multiple SIMD MIMD
SISD
Inst executed sequentially an system may or may not have internal parallel processing capabilities. SIMD Many processing units under supervision of common control unit. All processors receive the same inst from control unit but operate on different items of data.
MISD: Theory oriented not practical implemented.

MIMD: Processing several programs at same time.
Parallel processing techniques

Pipeline
processing Vector ,, Array ,,
Pipeline
processing: When arithmetic sub operations on the phases of computer inst cycle overlay in execution Vector processing: Deals with computations involving large vectors and matrices. Array processing: Perform computations on large arrays of data
PIPELINING
A technique of decomposing a sequential process into sub operations, with each sub process being executed in a partial dedicated segment that operates concurrently with all other segments. Result obtained from computation in each segment is transferred to next segment in pipeline. Overlapping of computation. Register holds data and combinational circuit performs sub operation in particular segment.
WHAT IS PIPELINING??
Pipelining is an implementation technique where multiple instructions are overlapped in execution to make fast CPUs. It is an implementation which exploits parallelism among the instructions in a sequential instruction stream.
THE METHODOLOGY
In a pipeline each step is called a pipe stage/pipe segment which completes a part of an instruction. Each stage is connected with each other to form a pipe. Instructions enter at one end ,progress through each stage and exit at the other end.
Pipelining
PIPELINING
Ai * Bi + Ci
Ai Segment 1 R1 R2
for i = 1, 2, 3, ... , 7
Bi
Example of pipeline processing
Memory Ci
Multiplier Segment 2 R3 R4
Segment 3
Adder
R5
R1 Ai, R2 Bi R3 R1 * R2, R4 Ci R5 R3 + R4
Load Ai and Bi Multiply and load Ci Add
Pipelining
OPERATIONS IN EACH PIPELINE STAGE

Clock Segment 1 Pulse Number R1 R2 1 A1 B1 2 A2 B2 3 A3 B3 4 A4 B4 5 A5 B5 6 A6 B6 7 A7 B7 8 9 Segment 2 Segment 3
R3
A1 * B1 A2 * B2 A3 * B3 A4 * B4 A5 * B5 A6 * B6 A7 * B7
R4
C1 C2 C3 C4 C5 C6 C7
R5
A1 * B1 + C1 A2 * B2 + C2 A3 * B3 + C3 A4 * B4 + C4 A5 * B5 + C5 A6 * B6 + C6 A7 * B7 + C7
Pipelining
GENERAL PIPELINE
General Structure of a 4-Segment Pipeline
Clock
Input
S1
R1
S2
R2
S3
R3
S4
R4
Space-Time Diagram
1 Segment 1 2 3 4 T1 2 T2 3 T3 T2 T1 4 T4 T3 T2 T1 5 T5 T4 T3 T2 6 T6 T5 T4 T3 T6 T5 T4 T6 T5 T6 7 8 9 Clock cycles
T1
Pipelining
PIPELINE SPEEDUP
n: Number of tasks to be performed
Conventional Machine (Non-Pipelined) tn: Clock cycle t1: Time required to complete the n tasks t 1 = n * tn
Pipelined Machine (k stages) tp: Clock cycle (time to complete each suboperation) tk: Time required to complete the n tasks tk = (k + n - 1) * tp Speedup Sk: Speedup Sk = n*tn / (k + n - 1)*tp lim
n
Sk =
tn tp
( = k, if tn = k * tp )
Pipelining
PIPELINE AND MULTIPLE FUNCTION Example UNITS - 4-stage pipeline

- subopertion in each stage; tp = 20nS - 100 tasks to be executed - 1 task in non-pipelined system; 20*4 = 80nS
Pipelined System (k + n - 1)*tp = (4 + 99) * 20 = 2060nS

Non-Pipelined System n*k*tp = 100 * 80 = 8000nS Speedup Sk = 8000 / 2060 = 3.88 4-Stage Pipeline is basically identical to the system Ii I i+1 with 4 identical function units
I i+2 I i+3
Multiple Functional Units
P1
P2
P3
P4
Disadvantage of pipeline
Different segment may take different times to complete their sub operation. Clk cycle must be chosen to equal the time delay of segment with max propagation time. This causes all other segment to waste time waiting for the next clk. Two area of computer design where pipeline org is applicable: 1.Arithmetic pipeline : divides arithmetic operation into sub operation for execution in pipeline segment. 2.Inst pipeline: operates on a stream of inst by overlapping fetch, decode and execute phases of inst cycle
PIPELINE HAZARDS
WHAT ARE PIPELINE HAZARDS ??? Hazards are those situations ,that prevent the next instruction in the instruction stream from executing during its designated clock cycle. They reduce the performance from the ideal speedup gained by pipelining.
CLASSIFICATION OF HAZARDS
Structural Hazards : arise from resource conflicts when the hardware cant support all possible combinations in simultaneous overlapped execution. Data hazards : arise when an instruction depends upon the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.
CLASSIFICATION OF HAZARDS
Control Hazards : arise from the pipelining of branches and other instructions that change the PC
STRUCTURAL HAZARDS
For any system to be free from hazards, pipelining of functional units and duplication of resources is necessary to allow all possible combinations of instructions in the pipeline. Structural hazards arise due to the following reasons :
STRUCTURAL HAZARDS
When a functional unit is not fully pipelined , then the sequence of instructions using that unit cannot proceed at the rate of one per clock cycle. When the resource is not duplicated enough to allow all possible combinations of instructions. ex : a machine may have one register file write port, but it may want to perform 2 writes during the same clock cycle.
STRUCTURAL HAZARDS
A machine with a shared single memory for data and instructions . An instruction containing data memory reference will conflict with the instruction reference for a later instruction. This resolved by stalling the pipeline for one clock cycle when the data memory access occurs.
DATA HAZARDS
Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order they see by sequentially executing instructions on an unpipelined machine.
CLASSIFICATION OF DATA HAZARDS

RAW (read after write ) : consider two instructions i and j with i occurring before j. j tries to read a source before i actually writes into it , as a result j gets the old value. Ex : ADD R1,R2,R3 SUB R4,R1,R5 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R1,R11

This hazard is overcome by a simple hardware technique called forwarding. in forwarding ,the ALU result from the EX/MEM register is always fed back into ALU input latches. if the forwarding hardware detects that the previous ALU operations has written the register corresponding to a source for the current ALU operation, then the control logic selects the forwarded result as the ALU input rather than the value read from the register file.
WAW (write after write) : j tries to write an operand before it is written by i. Thus the writes are performed in the wrong order leaving the value of i as the final value. This hazard is present in pipelines that write in more than one pipe stage. However in DLX this isnt a hazard as it writes only in the WB stage.

EX : LW R1,0(R2) ADD R1,R2,R3

WRITE AFTER READ (WAR) : j tries to write a destination before it is read by i. This doesnt happen in DLX as all reads occur early (ID phase) and all writes occur late (in WB stage). EX: SW 0(R1),R2 ADD R2,R3,R4
CONTROL HAZARDS
Control hazards cause a greater performance loss compared to the losses posed by data hazards. The simplest method of dealing with branches is that the pipeline is stalled as soon the branch is detected in the ID phase and until the MEM stage where the new PC is finally determined.
CONTROL HAZARDS
Each branch causes a 3 cycle stall in the DLX pipeline which is a significant loss as the 30% of the instructions used are branch instructions.
The number of clock cycles in the branch is reduced by testing the condition for branching in the ID stage and computing the destination address in the ID stage using a separate adder. Thus there is only clock cycle on branches .
WHAT MAKES PIPELINING HARD TO IMPLEMENT???

EXCEPTIONAL SITUATIONS : are those situations in which the normal order of execution is changed. This is due to instructions that raise exceptions that may force the machine to abort the instructions in the pipeline before they complete.
WHAT MAKES PIPELINING HARD TO IMPLEMENT???

Some of the exceptions include :
o o o o
Integer arithmetic overflow/underflow. Power failure Hardware malfunctions. I/O device request.
Arithmetic pipeline

Usually found in high speed computers. Used to implement floating-point operations ,multiplication of fixed-point number. EX floating point addition and subtraction. A and B are fractions representing mantissas and a and b are the exponents. Sub operation that are performed in four segment are
[1] Compare the exponents [2] Align the mantissa [3] Add/sub the mantissa [4] Normalize the result
Arithmetic Pipeline
ARITHMETIC PIPELINE
Exponents a b
R
Mantissas A B
R
Segment 1:
Compare exponents by subtraction R
Difference
Floating-point adder
Segment 2:
Choose exponent Align mantissa R
X = A x 2a Y = B x 2b
Segment 3:
Add or subtract mantissas R R
[1] [2] [3] [4]
Compare the exponents Align the mantissa Add/sub the mantissa Normalize the result
Segment 4:
Adjust exponent R
Normalize result R
Instruction pipeline
Instruction
pipeline reads consecutive instructions from memory while previous instructions are being executed in other segments . This causes instructions fetch and execute phases to overlap and perform simultaneous operation
Instruction Pipeline
INSTRUCTION CYCLE
Six Phases* in an Instruction Cycle [1] [2] [3] [4] [5] [6] Fetch an instruction from memory Decode the instruction Calculate the effective address of the operand Fetch the operands from memory Execute the operation Store the result in the proper place
Some instructions skip some phases * Effective address calculation can be done in the part of the decoding phase * Storage of the operation result into a register is done automatically in the execution phase ==> 4-Stage Pipeline [1] FI: Fetch an instruction from memory [2] DA: Decode the instruction and calculate the effective address of the operand [3] FO: Fetch the operand [4] EX: Execute the operation
Instruction Pipeline
INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE
Segment1:
Fetch instruction from memory Decode instruction and calculate effective address
Segment2:
Branch? yes Segment3: Fetch operand from memory Execute instruction Interrupt? no Update PC Empty pipe no
Segment4: Interrupt handling yes
Four segment CPU pipeline
Timing
Step: Instruction
of instruction pipeline
1 1 2 FI 2 DA FI 3 FO DA FI 4 EX FO DA FI EX FO EX FI DA FI FO DA FI EX FO DA EX FO EX 5 6 7 8 9 10 11 12 13
(Branch)
3 4 5 6
FI
DA
FO
EX
Major difficulties
Resource
conflicts: caused by access to memory by 2 segment at the same time. These conflicts can be resolved by using separate instruction and data memories. Data dependency conflicts: when inst depends on result of previous inst but this result not yet available. Branch difficulties arise from branch and other inst that change the value of pc.
1.Data dependency A data dependency occurs when an inst needs data that are not yet available. H/W interlocks: circuit that detects inst whose source operands are dest of inst farther up in pipeline. Inst whose source is not available to be delayed by enough clock cycle to resolve conflict. 2.Operand forwarding: Special h/w to detect a conflict an then avoid it by routing data through special path b/w pipeline segment. 3. Delayed load: Reorder the inst as necessary to delay the loading of conflicting data by inserting no-operation inst
Handling of branch instructions

Prefetch Target Instruction Fetch instructions in both streams, branch not taken and branch taken Both are saved until branch branch is executed. Then, select the right instruction stream and discard the wrong stream Branch Target Buffer(BTB; Associative Memory) Entry: Addr of previously executed branches; Target instruction and the next few instructions When fetching an instruction, search BTB. If found, fetch the instruction stream in BTB; If not, new stream is fetched and update BTB
Loop Buffer (High Speed Register file)

Storage of entire loop that allows to execute a loop without accessing memory
Branch Prediction
Guessing the branch condition, and fetch an instruction stream based on the guess. Correct guess eliminates the branch penalty Delayed Branch
Compiler detects the branch and rearranges the instruction sequence by inserting useful instructions that keep the pipeline busy in the presence of a branch instruction
RISC pipeline
RISC - Machine with a very fast clock cycle that executes at the rate of one instruction per cycle <- Simple Instruction Set Fixed Length Instruction Format Register-to-Register Operations
RISC Pipeline
RISC PIPELINE
Instruction Cycles of Three-Stage Instruction Pipeline
Data Manipulation Instructions I: Instruction Fetch A: Decode, Read Registers, ALU Operations E: Write a Register Load and Store Instructions I: Instruction Fetch A: Decode, Evaluate Effective Address E: Register-to-Memory or Memory-to-Register
Program Control Instructions I: Instruction Fetch A: Decode, Evaluate Branch Address E: Write Register(PC)
RISC Pipeline
DELAYED LOAD
LOAD: LOAD: ADD: STORE: R1 M[address 1] R2 M[address 2] R3 R1 + R2 M[address 3] R3
Three-segment pipeline timing Pipeline timing with data conflict
clock cycle 1 2 3 4 5 6 Load R1 I A E Load R2 I A E Add R1+R2 I A E Store R3 I A E Pipeline timing with delayed load
clock cycle 1 2 3 4 5 6 7 Load R1 I A E Load R2 I A E NOP I A E Add R1+R2 I A E Store R3 I A E Advantage
The data dependency is taken care by the compiler rather than the hardware
RISC Pipeline
DELAYED BRANCH
Compiler analyzes the instructions before and after the branch and rearranges the program sequence by inserting useful instructions in the delay steps Using no-operation instructions
Clock cycle s: 1. Load 2. Incre me nt 3. Add 4. Subtr act 5. Branch to X 6. NOP 7. NOP 8. Instr. in X 1 2 3 4 5 6 7 8 9 10 I A E I A E I A E I A E I A E I A E I A E I A E
Rearranging the instructions

Clock cycle s: 1. Load 2. Incre me nt 3. Branc h to X 4. Add 5. Subtract 6. Instr . in X 1 2 3 4 5 6 7 8 I A E I A E I A E I A E I A E I A E
CISC (COMPLEX INSTRUCTION SET COMPUTING)

A CISC is a computer where single instruction can execute several low-level operations and are capable of multi-step operations or addressing modes within a single instruction. Some complex instructions are difficult or impossible to execute in one cycle through the pipeline Many different addressing modes Instructions of different lengths
Implementing a CISC Architecture

There are simple and complex instructions in a CISC architecture One approach:
Adapt the RISC pipeline
Execute
the simple, frequently-used CISC instructions as in RISC For the more complex instructions, use microinstructions.
That means, a sequence of microinstructions is stored on ROM for each complex CISC instruction Complex instructions often involve multiple microoperations or memory accesses in sequence When a complex instruction is decoded in the DOF stage of the pipeline, the microcode address and control is given to the microcode counter. Microinstructions are executed until the instruction is completed. Each microoperation is simply a set of control input signals Example: A certain CISC instruction I has microcode written in the microcode ROM at address A. When I is decoded in the main pipeline, stall the main pipeline, and give control to the microcode control (MC). At each subsequent clock cycle, MC will increment and execute the next microoperation (which is a control word that controls the datapath). The last microoperation in the sequence will give control back to the main pipeline, and un-stall the main pipeline.
CISC Approach
The primary goal of CISC architecture is to complete a task in a few lines of Assembly as possible. This is achieved by building processor hardware that is capable of Understanding and executing a series of operation. EX: MULT 2:3,5:2 For this task a CISC processor would come prepare with a specific instruction. This instruction loads two values into separate register, multiplies the operand In execution unit and stores the product in app. Register. The entire task of multiplying two numbers can be completed with one Instruction. MULT ----- complex instruction. CISC minimizes the number of instruction per program, sacrificing number of Cycles per instruction.
Vector Processing
VECTOR PROCESSING
Vector Processing Applications Problems that can be efficiently formulated in terms of vectors Long-range weather forecasting Petroleum explorations Seismic data analysis Medical diagnosis Aerodynamics and space flight simulations Artificial intelligence and expert systems Image processing Vector Processor (computer) Ability to process vectors, and related data structures such as matrices and multi-dimensional arrays, much faster than conventional computers Vector Processors may also be pipelined
Vector Processing
VECTOR PROGRAMMING
20 DO 20 I = 1, 100 C(I) = B(I) + A(I)
Conventional computer Initialize I = 0 20 Read A(I) Read B(I) Store C(I) = A(I) + B(I) Increment I = i + 1 If I 100 goto 20
Vector computer
C(1:100) = A(1:100) + B(1:100)
Vector Processing
VECTOR INSTRUCTIONS
f1: V * V f2: V * S f3: V x V * V f4: V x S * V Type f1 V: Vector operand S: Scalar operand
Mnemonic Description (I = 1, ..., n) VSQR VSIN VCOM Vector square root Vector sine Vector complement Vector summation Vector maximum Vector add Vector multiply Vector AND Vector larger Vector test > B(I) * SQR(A(I)) B(I) * sin(A(I)) A(I) * A(I) S * S A(I) S * max{A(I)} C(I) * A(I) + B(I) C(I) * A(I) * B(I) C(I) * A(I) . B(I) C(I) * max(A(I),B(I)) C(I) * 0 if A(I) < B(I) C(I) * 1 if A(I) > B(I)
f2
VSUM VMAX
f3
VADD VMPY VAND VLAR VTGE
f4
SADD SDIV
Vector-scalar add Vector-scalar divide
B(I) * S + A(I) B(I) * A(I) / S
Vector Processing
VECTOR INSTRUCTION FORMAT

Vector Instruction Format
Ope ration code Base addre ss source 1 Base addre ss source 2 Base addre ss de stination Ve ctor le ngth
Pipeline for Inner Product

Source A
Source B
M ultiplie r pipeline
Adde r pipeline
Vector Processing
MULTIPLE MEMORY MODULE AND INTERLEAVING

Multiple Module Memory
Address bus
M0
AR
M1
AR
M2
AR
M3
AR
Memory array
Memory array
Memory array
Memory array
DR
DR
DR
DR
Data bus
Address Interleaving Different sets of addresses are assigned to different memory modules
Pipeline & vector processors often require simultaneous Pipeline & vector processors often require simultaneous access to memory from 2 or more sources. Inst pipeline may require fetching of an inst and an operand at same time from two different segment. Memory can be partitioned into number of modules connected to common memory address & data buses. Memory module is a memory array together with its own address & data register. One module initiates memory access while other modules are in the process of reading or writing and each modules can honor a memory request independent of the state of the other modules.
Advantage of modular memory

Interleaving:
In an interleaved memory, different sets of address are assigned to different memory modules. A vector processor that uses n-way interleaved memory can fetch n operands from n different modules. Example In 2-modules memory system , even address may be 1 module and the odd addresses in the other. A CPU with inst pipeline can take advantage of multiple memory modules so that each segment in pipeline can access memory independent of memory access from other segments.
Array processors
It
is a processor that performs computations on largearrays of data. Attached array processor It is an auxiliary processor attached to general purpose computer. SIMD Array Processor. It is a processor that has single inst multiple data org. Manipulates vector inst by means of multiple functional units responding to a common inst.
Attached array processor

Enhance the performance of computer by providing vector processing for complex scientific applications Attached array processor
General purpose computer i/p O/p interface I/P -O/P interface Attached array processor
High speed
Main memory
Memory to memory bus
Local memory
SIMD Array processors

SIMD
array processor org

PE1 M1
Master ctrl unit PE2 PEn
M2
Mn
Main memory
It is a computer with multiple processing units operating in parallel. It consist of a set of identical processing element each having a local memory. Each processor element includes an ALU, Floating point arithmetic unit an working register. Master ctl unit controls the operations in the processors elements. Main memory is used for storage of program. Function of master control unit is to decode the inst and determine how the inst is to executed. Vector inst are broadcast to all PEs Simultaneously. Vector operands are distributed to local memories prior to parallel execution of inst.

UNIT-5: Pipeline and Vector Processing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT-5: Pipeline and Vector Processing

Uploaded by

Copyright:

Available Formats

UNIT-5

PIPELINE AND VECTOR PROCESSING

PIPELINING AND VECTOR PROCESSING

Multiple functional units

PROCESSOR WITH MULTIPLE FUNCTIONAL UNITS

MISD: Theory oriented not practical implemented.

Parallel processing techniques

processing Vector ,, Array ,,

Example of pipeline processing

Load Ai and Bi Multiply and load Ci Add

OPERATIONS IN EACH PIPELINE STAGE

PIPELINE AND MULTIPLE FUNCTION Example UNITS - 4-stage pipeline

Pipelined System (k + n - 1)*tp = (4 + 99) * 20 = 2060nS

Multiple Functional Units

CLASSIFICATION OF DATA HAZARDS

CLASSIFICATION OF DATA HAZARDS

CLASSIFICATION OF DATA HAZARDS

CLASSIFICATION OF DATA HAZARDS

CLASSIFICATION OF DATA HAZARDS

WHAT MAKES PIPELINING HARD TO IMPLEMENT???

WHAT MAKES PIPELINING HARD TO IMPLEMENT???

Compare exponents by subtraction R

Add or subtract mantissas R R

[1] [2] [3] [4]

INSTRUCTION EXECUTION IN A 4-STAGE PIPELINE

Segment4: Interrupt handling yes

Four segment CPU pipeline

Handling of branch instructions

Loop Buffer (High Speed Register file)

Three-segment pipeline timing Pipeline timing with data conflict

Rearranging the instructions

CISC (COMPLEX INSTRUCTION SET COMPUTING)

Implementing a CISC Architecture

VADD VMPY VAND VLAR VTGE

Vector-scalar add Vector-scalar divide

B(I) * S + A(I) B(I) * A(I) / S

VECTOR INSTRUCTION FORMAT

Pipeline for Inner Product

MULTIPLE MEMORY MODULE AND INTERLEAVING

Advantage of modular memory

Attached array processor

SIMD Array processors

array processor org

Master ctrl unit PE2 PEn

You might also like

Pipelined System (k + n - 1)tp = (4 + 99) 20 = 2060nS