Professional Documents
Culture Documents
Chapter1 PDF
Chapter1 PDF
in
PARALLEL PROCESSING
Prepared by Rza Bashirov
High-performance computers are increasingly in demand in the
areas of structural analysis, weather forecasting, petroleum
exploration, medical diagnosis, aerodynamics simulation, artificial
intelligence, expert systems, genetic engineering, signal and image
processing, among many other scientific and engineering applications.
Without superpower computers, many of these challenges to advance
human civilization cannot be made within a reasonable time period.
Achieving high performance depends not only on using faster and
more reliable hardware devices but also on major improvements in
computer architecture and processing techniques.
1
FLYNNS TAXONOMY
that some task Ti cannot start until some earlier task T j (i < j ) finishes. A linear
pipeline can process a succession of subtasks with a linear precedence graph.
A linear pipeline consists of cascade of processing stages. High-speed
interface latches separate the stages. The latches are fast registers for holding
the intermediate results between the stages. Information flows between
adjacent stages are under the control of a common clock applied to all the
latches simultaneously.
Clock period
denoted by i . Let l be the time delay of each interface latch. The clock period
of a linear pipeline is defined by
= max{ i } k + l = m + l .
i =1
f = 1 .
The reciprocal of the clock period is called the frequency
Ideally, a linear pipeline with k stages can process n tasks in
Tk = k + (n 1) periods, where k cycles are used to fill up the pipeline or to
complete execution of the first task and n 1 cycles are needed to complete
the remaining n 1 tasks. The same number of tasks (operand pairs) can be
executed in a nonpipeline processor with an equivalent function in
T1 = n k time delay.
Speedup
We define the speedup of a k -stage linear pipeline processor
over an equivalent nonpipeline processor as
Sk =
T1
nk
=
Tk k + ( n 1) .
sum of all busy and idle time-space spans. Let n, k , be the number of tasks
(instructions), the number of pipeline stages, and the clock period of a linear
pipeline, respectively. The pipeline efficiency is defined by
n k
n
=
k [ k + ( n 1) ] k + ( n 1) .
n
=
k + ( n 1)
Processor pipelining
This refers to the pipeline processing of the same
data stream by a cascade of processors, each of which processes a specific
task. The data stream passes the first processor with results stored in a
memory block which is also accessible by the second processor. The second
processor than passes the refined results to the third, and so on.
Unifunctional vs. multifunctional pipelines A pipeline unit with a fixed
and dedicated function, such as the floating-point adder, is called
unifuctional. A multifunctional pipe may perform different functions, either
at different subsets of stages in the pipeline.
Static vs. Dynamic pipelines
A static pipeline may assume only one
fuctional configuration at a time. Static pipelines can be either unifunctional
or multifunctional. Pipelining is made possible in static pipes only if
instructions of the same type are to be executed continuously. The function
performed by a static pipeline should not change frequently. Otherwise, its
performance may be very low. A dynamic pipeline processor permits several
functional configurations to exist simultaneously. In this sense, a dynamic
pipeline must be multifunctional. On the other hand, a unifunctional pipe
must be static. The dynamic configuration needs much more elaborate control
and sequencing mechanisms than those for static pipelines. Most existing
computers are equipped with static pipes, either unifunctional or
multifunctional.
Scalar vs. vector pipelines Depending on instruction or data types, pipeline
processors can be also classified as scalar pipelines and vector pipelines. A
scalar pipeline processes a sequence of scalar operands under the control of a
DO loop. Instructions in a small DO loop are often prefetched into the
instruction buffer. The required scalar operands for repeated scalar
instructions are moved into a data cache in order to continuously supply the
pipeline with operands. Vector pipelines are specially designed to handle
vector instructions over vector operands. Computers having vector
instructions are often called vector processors. The design of a vector pipeline
is expanded from that of a scalar pipeline.
I
n
p
u
t
S1
S3
S2
Output
O
u
t
p
u
t
2.2
S1
S2
S3
S1
S2
S3
t0
A
T1
t2
t3
A
t4
t5
t6
A
A
A
t0
B
t1
t2
t3
B
B
t7
t4
B
t5
t6
B
B
R
e
c
e
i
v
e
2.3
M
u
l
t
I
p
l
A
c
c
u
m
u
l
a
t
E
x
p.
A
l
I
g
n
S
u
b
t
A
d
d
N
o
r
m
a
l
I
z
O
u
t
p
u
t
R
e
c
e
i
v
e
A
d
d
Fixed-point add
O
u
t
p
u
t
R
e
c
e
i
v
e
E
x
p.
A
l
I
g
n
S
u
b
t
A
d
d
N
o
r
m
a
l
I
z
O
u
t
p
u
t
Floating-point add
R
e
c
e
i
v
e
M
u
l
t
I
p
l
O
u
t
p
u
t
A
d
d
Fixed-point multiply
R
e
c
e
i
v
e
M
u
l
t
i
p
l
A
c
c
u
m
u
l
a
t
E
x
p.
A
l
i
g
n
S
u
b
t
A
d
d
N
o
r
m
a
l
i
z
O
u
t
p
u
t
Floating-point multiply
Array pipelines are two-dimensional pipelines with multiple data-flow
streams for high-level arithmetic computations, such as matrix multiplication,
inversion, and L-U decomposition. The pipeline is usually constructed with a
cellular array of arithmetic units. The cellular array is usually regularly
structured and suitable for microprocessor implementation.
The basic building blocks in the array are the M cells. Each M cell
performs an additive inner-product operation as illustrated in figure. Each
such cell performs an additive inner0product operation. Each cell has the
three input operands a, b and c and the three outputs a ' = a , b' = b, d = a b + c .
Fast latches are used in all input output terminals and all interconnecting
paths in the array pipeline. All the latches are synchronously controlled by the
same clock. The array shown in figure performs the multiplication of two
3 3 dense matrices A B = C
t3 t2 t1
t8
t7
t6
c13
c23
c12
c33
b33 0 0
c11
c22
c21
c32
b23 b32 0
c31
t8
t7
t6
0 b12 b21
0 0 b11
a11
a12
a13
t1
a21
a22
a23
t2
a31
a33
t3
a32
a11
A B = a 21
a 31
a12
a 22
a 32
b13
b23 =
b33
c11
c21
c31
c13
c23 = C
c33
.
c12
c22
c32
2
In general, to multiply two ( n n) matrices requires 3n 4n + 2 cells. It takes
3n 1 clock periods to complete the multiply process.
2.4
Problems
1. Consider a four-segment floating-point adder with a 10ns clock period.
Find the minimum number of periods required for 99 floating-points
additions, which use pipeline adder.
X
Y
S4
S3
S2
S1
f =
Sk =
= 0.1
1
,
ns
T1
nk
99 4
=
=
3.88 ,
Tk k + (n 1) 4 + 98
n k
n
=
0.97 ,
k [k + (n 1) ] k + (n 1)
n
1
w=
= 9.7 .
k + (n 1)
ns
10
t0
A
S1
S2
S3
t1
t2
t3
A
t4
t5
A
A
t0
B
S1
S2
S3
S4
t1
t2
t3
t4
t5
B
B
B
B
Solution.
Input
S1
S2
S3
S4
Input
Input
4. Describe the following terminologies associated with pipeline
computers
(a)
(b)
(c)
(d)
(e)
(f)
Static pipeline
Dynamic pipeline
Unifunctional pipeline
Multifunctional pipeline
Instruction pipeline
Vector pipeline
Solution.
(a) A static pipeline may assume only one fuctional configuration at a
time. Static pipelines can be either unifunctional or multifunctional.
Pipelining is made possible in static pipes only if instructions of the
same type are to be executed continuously. The function performed by
a static pipeline should not change frequently. Otherwise, its
performance may be very low.
(b) A dynamic pipeline processor permits several functional
configurations to exist simultaneously. In this sense, a dynamic
pipeline must be multifunctional. On the other hand, a unifunctional
pipe must be static. The dynamic configuration needs much more
elaborate control and sequencing mechanisms than those for static
pipelines.
11
(c) A pipeline unit with a fixed and dedicated function, such as the
floating-point adder, is called unifuctional.
(d) A multifunctional pipe may perform different functions, either at
different subsets of stages in the pipeline.
(e) The execution of a stream of instructions can be pipelined by
overlapping the execution of the current instruction with the fetch,
decode, and operand fetch of subsequent instructions. This technique is
also known as instruction lookahead. Almost all high-performance
computers are now equipped with instruction-execution pipelines.
(f) Vector pipelines are specially designed to handle vector instructions
over vector operands. Computers having vector instructions are often
called vector processors. The design of a vector pipeline is expanded
from that of a scalar pipeline.
12