You are on page 1of 52

3140707

Computer Organization & Architecture

Unit – 6
Pipeline & Vector
Processing
Parallel
Processing
6.1 Parallel Processing
Parallel Processing

Fig.: Processor with multiple functional units


Parallel Processing
 Parallel processing is a term used to denote a large class of
techniques that are used to provide simultaneous data-
processing tasks for the purpose of increasing the
computational speed of a computer system.
 Purpose of parallel processing is to speed up the computer
processing capability and increase its throughput.
 Throughput:
The amount of processing that can be accomplished during a
given interval of time.
Flynn’s
Taxonomy/
Classification
6.1.1Flynn’s Taxonomy / Classification

 Instruction Stream: Sequence of instruction read from memory.


 Data Stream: Operation performed on the data in the processor.
Flynn’s Taxonomy / Classification
Flynn’s classification divides computers into four major groups as
follows:
 Single Instruction Single Data (SISD)
 Single Instruction Multiple Data (SIMD)
 Multiple Instruction Single Data (MISD)
 Multiple Instruction Multiple Data (MIMD)
Single Instruction Single Data (SISD)
 SISD represents the organization of a single computer
containing a control unit, a processor unit, and a memory unit.
 Instructions are executed sequentially and the system may or
may not have internal parallel processing capabilities.
Single Instruction Multiple Data (SIMD)
 SIMD represents an
organization that
includes many
processing units under
the supervision of a
common control unit.
 All processors receive
the same instruction
from the control unit
but operate on
different items of data.
Multiple Instruction Single Data (MISD)
 There is no computer at present that can be classified as MISD.
 MISD structure is only of theoretical interest since no practical
system has been constructed using this organization.
Multiple Instruction Multiple Data (MIMD)
 MIMD organization refers to a computer system capable of processing
several programs at the same time.
 Most multiprocessor and multicomputer systems can be classified in this
category.
 Contains multiple processing units.
 Execution of multiple instructions on multiple data.
Pipelining
6.2 Pipelining
 Pipeline is a technique of decomposing a sequential process
into sub operations, with each sub process being executed in
a special dedicated segment that operates concurrently with
all other segments.
 A pipeline can be visualized as a collection of processing
segments through which binary information flows.
 The result obtained from the computation in each segment is
transferred to the next segment in the pipeline.
 The registers provide isolation between each segment.
 The technique is efficient for those applications that need to
repeat the same task many times with different sets of data.
 The overlapping of computation is made possible by
associating a register with each segment in the pipeline
6.2.1 Pipelining example
𝐴𝑖 ∗ 𝐵𝑖 + 𝐶𝑖 for 𝑖 = 1,2,3, … , 7

R1 R2
Segment-1

Multiplier

Segment-2 R4
R3

Adder

Segment-3
R5
Pipelining
Sub operation performed in each segment of the pipeline are as
Follow:
 R1 Ai , R2 Bi Load Ai & Bi
 R3 R1 * R2 , R4 Ci Multiply and Load Ci
 R5 R3 + R4 Add Ci to product
Pipelining
Content of registers in pipeline example
6.2.2 Four Segment Pipeline:
 General structure of four segment pipeline

Clock

Input
S1 R1 S2 R2 S3 R3 S4 R4

This technique is efficient for those application that need to repeat


the same task many times with different sets of data
5.2.3 Space-time Diagram
Arithmetic
Pipeline
6.3 Arithmetic Pipeline
 Usually found in high speed computers.
 Divides arithmetic operation into sub operations for execution
in the pipeline segments.
 Used to implement floating point operations, multiplication of
fixed point numbers and similar operations.
6.3.1 Pipeline for Floating point Arithmetic
 Consider an example of floating point addition and subtraction.
𝑋 = 𝐴 × 10𝑎
𝑌 = 𝐵 × 10𝑏
 A and B are two fractions that represent the mantissas and a and b are the
exponents.
 The floating point addition and subtraction can be performed in four
segments.
 The sub-operations that are performed in the four segments are:
1. Compare the exponents
2. Align the mantissas
3. Add or subtract the mantissas
4. Normalize the result
Arithmetic pipeline
a Exponents b A Mantissas B

R
R
Compare exponents Difference
Segment 1:
by subtraction

R
Align mantissas
Segment 2: Choose exponent
R

Segment 3: Add or subtract


mantissas
R R

Segment 4: Adjust exponent Normalize result

R R
6.3.2 Example of Arithmetic Pipeline
 Consider the two normalized floating-point numbers:
X = 0.9504 x 103 Y = 0.8200 x 102
 Segment-1: The two exponents are subtracted in the first segment to
obtain (3-2) =1.
 Segment-2: The next segment shifts the mantissa of Y to the right to
obtain. Aligning the mantissa numbers
X = 0.9504 x 103 Y = 0.0820 x 103
 Segment-3: The aligns the two mantissas under same exponents.
Addition of the two mantissas produces the sum
Z = 1.0324 x 103
 Segment-4: Normalize the result. This is done by shifting the
mantissa once to the right and incrementing the exponent by one to
obtain the normalized sum.
Z = 0.10324 x 104
Instruction
Pipeline
6.4 Instruction Pipeline
 In the most general case, the computer needs to process each
instruction with the following sequence of steps
1. Fetch the instruction from memory.
2. Decode the instruction.
3. Calculate the effective address.
4. Fetch the operands from memory.
5. Execute the instruction.
6. Store the result in the proper place.
 Different segments may take different times to operate on the
incoming information.
 Some segments are skipped for certain operations.
 The design of an instruction pipeline will be most efficient if
the instruction cycle is divided into segments of equal duration.
Instruction Pipeline
 Assume that the decoding of the instruction can be combined
with the calculation of the effective address into one segment.
 Assume further that most of the instructions place the result
into a processor registers so that the instruction execution and
storing of the result can be combined into one segment.
 This reduces the instruction pipeline into four segments.
1. FI: Fetch an instruction from memory
2. DA: Decode the instruction and calculate the effective address of
the operand
3. FO: Fetch the operand
4. EX: Execute the operation
6.4.1 Four segment CPU pipeline
Fetch instruction
Segment1: from memory

Segment2: Decode instruction &


calculate effective address

yes
Branch?
no
Fetch operand from
Segment3: memory

Segment4: Execute instruction

Interrupt yes no
Interrupt?
handling

Update PC

Empty pipe
Four segment CPU pipeline
 Figure shows, how the instruction cycle in the CPU can be
processed with a four-segment pipeline.
 While an instruction is being executed in segment 4, the next
instruction in sequence is busy fetching an operand from
memory in segment 3.
 The effective address may be calculated in a separate
arithmetic circuit for the third instruction, and whenever the
memory is available, the fourth and all subsequent instructions
can be fetched and placed in an instruction FIFO.
 Thus up to four sub operations in the instruction cycle can
overlap and up to four different instructions can be in progress
of being processed at the same time.
6.4.2 Instruction Pipeline
Execution of Three Instructions in a 4-Stage Pipeline
Conventional

Pipelined
6.4.3 Space-time Diagram
Segment
1 2 3 4 5 6 7 8 9 10 11 12 13
1
FI DA FO EX
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI - - FI DA FO EX
5 - - - FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
Space-time Diagram
 Figure shows the operation of the instruction pipeline. The time in
the horizontal axis is divided into steps of equal duration. The four
segments are represented in the diagram with an abbreviated
symbol.
 It is assumed that the processor has separate instruction and data
memories so that the operation in FI and FO can proceed at the same
time.
 Thus, in step 4, instruction 1 is being executed in segment EX; the
operand for instruction 2 is being fetched in segment FO; instruction
3 is being decoded in segment DA; and instruction 4 is being fetched
from memory in segment FI.
 Assume now that instruction 3 is a branch instruction.
 As soon as this instruction is decoded in segment DA in step 4, the
transfer from FI to DA of the other instructions is halted until the
branch instruction is executed in step 6.
Space-time Diagram
 As soon as this instruction is decoded in segment DA in step 4,
the transfer from FI to DA of the other instructions is halted
until the branch instruction is executed in step 6.
 If the branch is taken, a new instruction is fetched in step 7. If
the branch is not taken, the instruction fetched previously in
step 4 can be used.
 The pipeline then continues until a new branch instruction is
encountered.
 Another delay may occur in the pipeline if the EX segment
needs to store the result of the operation in the data memory
while the FO segment needs to fetch an operand.
 In that case, segment FO must wait until segment EX has
finished its operation.
6.4.4 Pipeline Conflict (Major Hazards in pipelined execution)
Structural hazards(Resource Conflicts)
 Caused by access to memory by two segments at the same time.
 Most of these conflicts can be resolved by using separate instruction and
data memories.
Data hazards (Data Dependency Conflicts)
 Arise when an instruction scheduled to be executed in the pipeline
requires the result of a previous instruction, which is not yet available.
 Data dependency conflicts arise when an instruction depends on the
results of a previous
Control hazards (Branch difficulties)
 Arise when Branches and other instructions that change the value of PC
 The fetch of the next instruction to be delayed
Pipeline Conflict
 There are three major difficulties that cause the instruction
pipeline conflicts.
1. Resource conflicts caused by access to memory by two segments at the
same time.
2. Data dependency conflicts arise when an instruction depends on the
result of a previous instruction, but this result is not yet available.
3. Branch difficulties arise from branch and other instructions that
change the value of PC.
Data Dependency
 Data dependency occurs when an instruction needs data that
are not yet available.
 Pipelined computers deal with such conflicts between data
dependencies in a variety of ways as follows:
1. Hardware Interlocks : It is circuit that detects instructions whose
source operands are destinations of instruction further up in the
pipeline.
2. Operand forwarding : Another technique called operand forwarding
uses special hardware to detect a conflict and then avoid it by routing
the data through special path between pipeline segments.
3. Delayed load: The compiler for such computers is designed to detect a
data conflict and recorder the instructions as necessary to delay the
loading of the conflicting data by inserting no operation instructions,
This method is referred as delayed load.
Handling Branch Instructions
 The branch instruction breaks the normal sequence of the
instruction stream, causing difficulties in the operation of the
instruction pipeline.
 Hardware techniques available to minimize the performance
degradation caused by instruction branching are as follows:
• Pre-fetch target
• Branch target buffer (BTB)
• Loop buffer
• Branch prediction
• Delayed branch
Handling Branch Instructions
Pre-fetch target:
 One way of handling a conditional branch is to pre-fetch the
target instruction in addition to the instruction following the
branch.
 If the branch condition is successful, the pipeline continues
from the branch target instruction.
 An extension of this procedure is to continue fetching
instructions from both places until the branch decision is
made.
Handling Branch Instructions
Branch target buffer:
 Another possibility is the use of a branch target buffer or BTB.
 The BTB is an associative memory included in the fetch
segment of the pipeline.
 Each entry in the BTB consists of the address of a previously
executed branch instruction and the target instruction for that
branch.
 It also stores the next few instructions after the branch target
instruction.
 The advantage of this scheme is that branch instructions that
have occurred previously are readily available in the pipeline
without interruption.
Handling Branch Instructions
Loop buffer:
 A variation of the BTB is the loop buffer. This is a small very
high speed register file maintained by the instruction fetch
segment of the pipeline.
 When a program loop is detected in the program, it is stored in
the loop buffer in its entirety, including all branches.
 A pipeline with branch prediction uses some additional logic to
guess the outcome of a conditional branch instruction before it
is executed.
 The pipeline then begins pre-fetching the instruction stream
from the predicted path.
 A correct prediction eliminates the wasted time caused by
branch penalties.
Handling Branch Instructions
Delayed branch:
 A procedure employed in most RISC processors is the delayed
branch.
 In this procedure, the compiler detects the branch instructions
and rearranges the machine language code sequence by
inserting useful instructions that keep the pipeline operating
without interruptions.
RISC
Pipeline
6.5 RISC Pipeline
RISC : Reduced Instruction Set Computer
- Machine with a very fast clock cycle that executes at the rate of one
instruction per cycle.
- Simple Instruction Set.
- Fixed Length Instruction Format.
- Register-to-Register Operations .
 All the manipulation instruction have register to register operations.
Since all operands are in registers, there is no need for calculating
effective address or fetching of operands from memory.
 One segment fetches the instruction from program memory, and the
other segment executes the instruction in the ALU.
 A third segment may be used to store the result of the ALU operation
in destination register.
 The data transfer instructions in RISC are limited to load and store
instructions. These instruction use register indirect addressing. They
usually needs three or four stages in the pipeline.
6.5.1 Three segment Instruction Pipeline
Instruction Cycles can divided into three sub operations and implemented in three segments:
I: Instruction Fetch
A: ALU Operations
E: Execute Instruction

Instruction Cycles of Three-Stage(Segment) Instruction Pipeline


Data manipulation Instruction
I: Instruction Fetch
A: Decode, Read Registers, ALU Operations
E: Write a Register

Load and Store Instructions


I: Instruction Fetch
A: Decode, Evaluate Effective Address
E: Register-to-Memory or Memory-to-Register

Program Control Instructions


I: Instruction Fetch
A: Decode, Evaluate Branch Address
E: Write Register(PC)
6.5.2 Delayed Load
LOAD: R1  M[address 1]
LOAD: R2  M[address 2]
ADD: R3  R1 + R2
STORE: M[address 3]  R3

Three-segment pipeline timing


Pipeline timing with data conflict

Pipeline timing with delayed load

The data dependency is taken


care by the compiler rather
than the hardware
Vector
Processing
6.6 Vector Processing
 In many science and engineering applications, the problems
can be formulated in terms of vectors and matrices that lend
themselves to vector processing.
 Applications of Vector processing
1. Long-range weather forecasting
2. Petroleum explorations
3. Seismic data analysis
4. Medical diagnosis
5. Aerodynamics and space flight simulations
6. Artificial intelligence and expert systems
7. Mapping the human genome
8. Image processing
6.6.1 Vector Operation
Vector Instruction Format
Operation Base address Base address Base address Vector
code source 1 source 2 destination length

 This is three address instruction with three field specifying the


base address of the operands and an additional filed that gives
the length of the data items in the vectors.
6.6.2 Matrix Multiplication
Pipeline for Inner Product of Matrix Multiplication:
Source
A

Source Multiplier Adder


B pipeline pipeline

 A and B are either in memory or in processor registers


 Each floating point adder and multiplier unit is supposed to have 4
segments
 All segment registers are initially initialized to zero
 Ai and Bi pairs are brought in and multiplied at a rate of one pair per
cycle
 After first 4 cycles the products are added to the output of the adder.
 During the next 4 cycles zero is added to products entering the adder
pipeline.
 At the end of the 8th cycle the first four products A1B1 through A4B4
are in the four adder segments and the next four products A5 B5
through A8B8 are in the multiplier Segments
 Thus the 9th cycle starts addition A1B1 + A5B5 in the adder pipeline.
 The tenth cycle start addition A2B2 + A6B6 and so on.
Matrix Multiplication
C= A1 B1 + A5B5 + A9 B9 + A13 B13 +…….
+ A2 B2 + A6 B6 + A10 B10 + A14 B14 +…….
+ A3 B3 + A7 B7 + A11 B11 + A15 B15 +…..
+ A4 B4 + A8 B8 + A12 B12 + A16 B16 +….
6.6.3 Memory Interleaving

Fig. Multiple module memory organization


Memory Interleaving
 Pipeline and vector processors often require simultaneous access
to memory from two or more sources.
 Instead of using two memory buses for simultaneous access, the
memory can be partitioned into a number of modules connected
to a common address and data buses.
 Memory can be partitioned into a number of modules connected
to a common memory address and data buses.
 Each memory array has its own address register AR and data
register DR.
 The address registers receive information from a common
address bus and data registers communicate with a bidirectional
bus.
 Advantages: allows interleaving ; in that different sets of
addresses are assigned to different memory modules

You might also like