You are on page 1of 53

EA 2004

Computer Architecture - II

Pipelining
Parallel processing

A parallel processing system is able to perform


simultaneous data processing to achieve faster
execution time

The system may have two or more ALUs and be able


to execute two or more instructions at the same time

Goal is to increase the throughput the amount of


processing that can be accomplished during a given
interval of time
Parallel processing

Parallel processing can happen in two basic streams:

instruction stream - consists of the sequence of


instructions read from memory

data stream - encapsulates the operations performed


on the memory

computers can be classified into 4 different categories more


focused on the behavioral aspects of Parallel Processing
Parallel processing classification

Single instruction stream, single data stream SISD

Single instruction stream, multiple data stream SIMD

Multiple instruction stream, single data stream MISD

Multiple instruction stream, multiple data stream MIMD


Single instruction stream, single
data stream SISD

A single control unit

A Processor Unit

A memory unit

Instructions are executed sequentially. Parallel processing


may be achieved by means of multiple functional
units or by pipeline processing
Single instruction stream,
multiple data stream SIMD
A single control unit
Many Processor Units
A memory unit

Includes multiple processing units with a single control


unit. All processors receive the same instruction, but
operate on different data.
Multiple instruction stream,
single data stream MISD
Many Processor Units
Which on its own contains
A control unit
A local memory
Theoretical only

processors receive different instructions, but operate


on same data.

i.e. Space shuttle flight control systems


Multiple instruction stream,
multiple data stream MIMD
Many Processor Units
Many Control Units
A computer system capable of processing several
programs at the same time.

Most multiprocessor and supercomputer systems can


be classified in this category

Parallel processing also be classified via pipelining which


concerns operational and structural interconnections
What is a Pipeline
Pipelining is used by all modern microprocessors to
enhance performance by overlapping the execution
of instructions.
A common analogue for a pipeline is a factory
assembly line. Assume that there are three stages:
o Welding
o Painting
o Polishing

For simplicity, assume that each task takes one hour.


What is a Pipeline
A single person would take three hours to produce one
product.

Three people, one person could work on each stage,


upon completing their stage they could pass their
product on to the next person (since each stage takes
one hour there will be no waiting).

Then produce one product per hour assuming the


assembly line has been filled.
Pipelining: Laundry Example

Small laundry has one


washer, one dryer and one
operator, it takes 90 A B C D
minutes to finish one load:

Washer takes 30 minutes


Dryer takes 40 minutes
operator folding takes 20
minutes
Sequential Laundry
6 PM 7 8 9 10 11 Midnight
Time

30 40 20 30 40 20 30 40 20 30 40 20
T
a A
s
k
B
O
r
d C
e 90 min
r
D
This operator scheduled his loads to be delivered to the laundry every 90
minutes which is the time required to finish one load. In other words he
will not start a new task unless he is already done with the previous task
The process is sequential. Sequential laundry takes 6 hours for 4 loads
Efficiently scheduled laundry: Pipelined Laundry

6 PM 7 8 9 10 11 Midnight
Time

30 40 40 40 40 20
40 40 40
T
a A
s
k
B
O
r
d C
e
r
D
Another operator asks for the delivery of loads to the laundry every 40
minutes!?.
Pipelined laundry takes 3.5 hours for 4 loads
Pipelining Facts Multiple tasks operating
simultaneously
Pipelining doesnt help
6 PM latency of single task, it
7 8 9
helps throughput of entire
Time workload
T
a 30 40 40 40 40 20 Pipeline rate limited by
s slowest pipeline stage
k A
Potential speedup =
O Number of pipe stages
r B
d Unbalanced lengths of pipe
e The washer stages reduces speedup
r C waits for the
dryer for 10
minutes Time to fill pipeline and
D time to drain it reduces
speedup
Building a Car
Unpipelined Start and finish a job before moving to the next

Parallelism = 1 car

24 hrs.
Latency= 24 hrs.
Throughput = 1/24 hrs.
24 hrs.

Jobs
24 hrs.

Time
Latency the amount of time that a single operation takes to execute
Throughput the rate at which operations get executed (generally
expressed as operations/second or operations/cycle)
The Assembly Line
Pipelined Break the job into smaller stages
Eng. Body Paint 8h
A B C Parallelism = 3 cars
8h Eng. Body Paint Latency= 24 hrs.
A B C
Throughput = 1/8 hrs.
Eng. Body Paint
A B C
Jobs
Eng. Body Paint
3X
A B C

Time
In computer..
Unpipelined Start and finish a job before moving to the next
FET DEC EXE

FET DEC EXE

Jobs

Time
In computer..
Pipelined Break the job into smaller stages
FET DEC EXE
A B C
I1 I1 I1
Cycle 1 FET DEC EXE
A B C
I2 I2
Cycle 2 FET DEC EXE
A B C
Jobs I3
Cycle 3
A B C

Time
In computer..
Unpipelined Start and finish a job before moving to the next
FET DEC EXE

3 ns FET DEC EXE

Jobs

Clock Speed = 1/3ns = 333 MHz

Time
In computer..
Pipelined Break the job into smaller stages
FET DEC EXE
A B C
I1 I1 I1 Clock Speed = 1/1ns = 1 GHz
Cycle 1 FET DEC EXE
A B C
I2 I2
Cycle 2 FET DEC EXE
A B C
Jobs I3
Cycle 3
A B C
1ns
3 ns
Time
Pipelining

Latency the amount of time that a single operation takes to execute


Throughput the rate at which operations get executed (generally
expressed as operations/second or operations/cycle)
Clocks and Latches

Stage 1 Stage 2
Clocks and Latches

Stage 1 L Stage 2 L

Clk
Clocks and Latches

Stage 1 L Stage 2 L

Clk

Four segment pipeline:


Clock

Input S1 R1 S2 R2 S3 R3 S4 R4
Example
Assume a 2 ns flip-flop delay
Characteristics Of Pipelining
Decomposes a sequential process into segments.

Divide the processor into segment processors each one


is dedicated to a particular segment.

Each segment is executed in a dedicated segment-


processor operates concurrently with all other segments.

Information flows through these multiple hardware


segments.

If the stages of a pipeline are not balanced and one


stage is slower than another, the entire throughput of
the pipeline is affected
Pipelining
Instruction execution is divided into k segments or
stages
Instruction exits pipe stage k-1 and proceeds into pipe

stage k
All pipe stages take the same amount of time; called

one processor cycle


Length of the processor cycle is determined by the

slowest pipe stage

k segments
Pipeline Performance
n:instructions n is equivalent to number of loads in
the laundry example
k: stages in
k is the stages (washing, drying and
pipeline folding.
: clock cycle Clock cycle is the slowest task time
Tk: total time

Tk (k (n 1))
n
T1 nk
Speedup
Tk k (n 1) k
Efficiently scheduled laundry: Pipelined Laundry

6 PM 7 8 9 10 11 Midnight
Time

30 40 40 40 40 20
40 40 40
T
a A
s
k
B
O
r
d C
e
r
D
Speedup
Consider a k-segment pipeline operating on n data
sets. (In the above example, k = 3 and n = 4.)

It takes k clock cycles to fill the pipeline and get the


first result from the output of the pipeline.

After that the remaining (n - 1) results will come out


at each clock cycle.

It therefore takes (k + n - 1) clock cycles to complete


the task.
Speedup
If we execute the same task sequentially in a
single processing unit, it takes (k * n) clock
cycles.
The speedup gained by using the pipeline is:

S = k * n / (k + n - 1 )
Speedup
S = k * n / (k + n - 1 )

For n >> k (such as 1 million data sets on a 3-stage


pipeline),

S~k

So we can gain the speedup which is equal to the


number of functional units for a large data sets. This
is because the multiple functional units can work in
parallel except for the filling and cleaning-up cycles.
Speedup
Example
- 4-stage pipeline
- subopertion in each stage; tp = 20nS
- 100 tasks to be executed
- 1 task in non-pipelined system; 20*4 = 80nS

Pipelined System
(k + n - 1)*tp = (4 + 99) * 20 = 2060nS

Non-Pipelined System
n*k*tp = 100 * 80 = 8000nS

Speedup
Sk = 8000 / 2060 = 3.88

4-Stage Pipeline is basically identical to the system with 4


identical function units
Example of Pipelining
Suppose we want to perform the combined
multiply and add operations with a stream
of numbers:

Ai * Bi + Ci for i =1,2,3,,7
Example of Pipelining
The sub-operations performed in each
segment of the pipeline are as follows:

R1 Ai, R2 Bi
R3 R1 * R2 R4 Ci
R5 R3 + R4
Example of Pipelining
Ai Bi Ci

R1 Ai , R2 Bi R1 R2
Input Ai and Bi

R3 R1 * R2, R4 Ci
Multiplier
Multiply and input Ci

R5 R3 + R4 R3 R4
Add Ci to product
Adder

R5
Content of registers in pipeline example

Clock
Pulse
Segment1 Segment2 Segment3
number R1 R2 R3 R4 R5

1 A1 B1 ---- ---- ----


2 A2 B2 A1*B1 C1 ----
3 A3 B3 A2*B2 C2 A1*B1+C1
4 A4 B4 A3*B3 C3 A2*B2+C2
5 A5 B5 A4*B4 C4 A3*B3+C3
6 A6 B6 A5*B5 C5 A4*B4+C4
7 A7 B7 A6*B6 C6 A5*B5+C5
8 ---- ---- A7*B7 C7 A6*B6+C6
9 ---- ---- ---- ---- A7*B7+C7

Exercise: Looking at the above example define how the operation of

Ai*Bi + Ci*Di+ Ei
is executed using a pipeline
Arithmetic Pipeline
From the early times of computing arithmetics withheld an
important aspect, yet arithmetic operations happen to
consume much of the time with in the arithmetic and logic
unit.

Thus pipelining is used to boost the performance of ALUs


and has opened up to many means of High performance of
computing.

Arithmetic pipelines are generally used for fixed point


operations and floating point operations.
Arithmetic Pipeline: Floating Point Adder

A generic floating point number can be stated as

X = A * 2a

Where X happens to be a binary value.


A is defined to be the mantissa and a is called the
exponent.
Arithmetic Pipeline: Floating Point Adder

X = A * 2a
Y = B * 2b
A floating point adder can be executed via 4 simple
sub operations

Compare the exponents.


Align the mantissas.
Add or subtract the mantissas.
Normalize the result.
Arithmetic Pipeline: Floating Point Adder

Given below is a simple demonstration of how two


decimal floats are added.

Consider the two input floats of X and Y

X = 0.9832* 103
Y = 0.8929* 102

Note: Decimal numbers are used for simplicity of explanation


Arithmetic Pipeline: Floating Point Adder

X = 0.9832* 103
Y = 0.8929* 102

In the initial segment the two exponents are compared.


The larger exponent is 3 and thus it is chosen as the
exponent for the result.

difference between the two exponents is 1 (3-2).


Arithmetic Pipeline: Floating Point Adder
X = 0.9832* 103
Y = 0.8929* 102
Since Y is with the lesser exponent its mantissa is
shifted to the right and the two gained values are,
X = 0.9832* 103
Y = 0.08929* 103

Afterwards the two mantissas are simply added and the


value Z is gained
Z = 1.07249* 103
Finally the gained result is normalized in manner which
staples a mantissa with a fraction with a none zero
value for the first decimal point.
Arithmetic Pipeline for Floating Point Adder

Exponents
Mantissas
a b
A B

R
R
Compare
Difference
Segment 1 Exponent
By subtraction
Align mantissas
R
R
Segment 2 Choose exponent

Add or subtract
Segment 3 mantissas

R R

Normalize
Segment 4 Adjust
Exponent result

R R
Arithmetic Pipeline for Floating Point Adder
Instruction Pipeline

An Instruction pipeline works in a similar manner


to the Arithmetic Pipeline even though it works
with an instruction field as suppose to a data
stream.
Instruction Pipeline
process of an instruction requires the following sequence
of steps.

Fetch the instruction from memory.


Decode the instruction.
Calculate the effective address.
Fetch the operands from memory.
Execute the instruction.
Store the result in the proper place.
Instruction Pipeline
Consider the following specification of a pipeline mean to
have 4 separate segments

In such a system up to 4 different instructions can be


processed at the same time.
Pipeline Conflicts
Difficulties in general can be caused due to the reasons
specified below.
Resource conflicts
when two segments access memory at the same

time.
Data dependency conflicts
occur when an instruction is dependent of a result of a
previous instruction which is not available yet
Branch difficulties conflicts
when branching and other instructions that change the
value of the PC.
Four-segment CPU pipeline for overcome
Pipeline Conflicts
Fetch instruction
Segment 1 from memory

Decode instruction
And calculate
Segment 2
Effective address

yes
Branch?

no
Fetch operand
Segment 3 From memory

Execute
Segment 4
instruction

Interrupt yes
handling Interrupt?

no
Update PC

Empty pipe
Four-segment CPU pipeline for overcome
Pipeline Conflicts
Timing of Instruction Pipeline

Step: 1 2 3 4 5 6 7 8 9 10 11 12 13

Instruction: 1 FI DA FO EX

2 FI DA FO EX

(Branch) 3 FI DA FO EX

4 FI -- -- FI DA FO EX

5 -- -- -- FI DA FO EX

6 FI DA FO EX

7 FI DA FO EX
Four-segment CPU pipeline for overcome
Pipeline Conflicts
The four segments illustrated in above table have the following
meanings:

FI is the segment that fetches an instruction.

DA is the segment that decodes the instruction and


calculate the effective address.

FO is the segment that fetches the operand.

EX is the segment that executes the instruction.


Thank You