You are on page 1of 39

Conditions of Parallelism

Data and Resource Dependences:


• The ability to execute several program segments
in parallel requires each segment to be
independent of the other segment.
• We use a dependence graph to describe the
relations.
• The nodes of a dependence graph correspond to
the program statements (instructions)
• The directed edges with different labels show the
ordered relations among the statements. The
analysis of dependence graph shows where
opportunity exists for parallelization.
Data Dependence
• The ordered relationship between
statements is indicated by the data
dependence.
• Types of data dependences:

– Flow dependency
– Anti dependency
– Output dependency
– I/O dependency
Flow dependence:
• A statement S2 is flow-dependent on
statement S1 if an execution path exists
from S1 to S2 and if at least one output on
S1 feeds in as input to S2.

• Flow dependence is denoted as:


S1 S2
Antidependence
• Statement S2 is anti-dependent on
statement S1 if S2 follows S1 in program
order and if the output of S2 overlaps the
input to S1.
• Anti dependence is denoted as:
S1 S2
Output dependence:
• Two statements are output-dependent if
they produce the same output variable.
• Anti dependence is denoted as:
S1 S2
I/O dependence:
• Read and Write and I/O statements.
• I/O dependence occurs not because the
same variable is involved but because the
same file is referenced by both I/O
statements.
I/O
• S1 S2
Example: Consider the following code fragment of four
instructions:
S1: Load R1, A / R1  Memory(A) /
S2: Add R2, R1 / R2  (R1) + (R2) /
S3: Move R1, R3 / R1  (R3) /
S4: Store B, R1 / Memory(B)  (R1) /

S1 S2

S4 S3

Dependency Graph
Example:

• S1: Read(4), A(I) /Read array A from tape unit 4/


• S2: Rewind(4) /Rewind tape unit 4/
• S3: Write(4), B(I) /Write array B into tape unit 4/
• S4: Rewind(4) /Rewind the tape unit 4/

I/O
S1 S3

• S1 and S3 are I/O-dependent on each other because they both


access the same file from tape unit 4.
Resource Dependence

• Resource dependence is concerned with


the conflicts in using shared resources,
such as integer units, floating-point units,
registers, and memory areas, among
parallel events.
• When the conflicting resource is an ALU,
we call it ALU dependence.
• If the conflicts involve workplace storage,
we call it storage dependence.
Bernstein’s Conditions
• Let the input set Ii of a process Pi as the set
of all input variables needed to execute the
process.
• Let the output set Oi consists of all output
variables generated after execution of the
process Pi.
• Now, consider two processes P1 and P2 with
their input sets I1 and I2 and output sets O1
and O2, respectively.
• These processes can execute in parallel
(P1|| P2) if they are independent.
These conditions are stated as:
• I1 ∩ O2 = 0
• I2 ∩ O1 = 0
• O1 ∩ O2 = 0
• These three equations are known as
Bernstein’s conditions.
• In general, a set of processes, P1, P2,…Pk,
can execute in parallel if Bernstein’s
conditions are satisfied on a pair wise basis;
i.e. P1 || P2 || P3 || … || Pk if and only if Pi ||
Pj for all i != j.
Bernstein’s Conditions (cont.)

• In terms of data dependences, Bernstein’s


conditions simply imply that two processes
can execute in parallel if they are:
Flow-independent
Anti-independent
Output-independent.
Bernstein’s Conditions (cont.)

• The parallelism relation || is commutative:


i.e. Pi || Pj implies Pj || Pi

• The parallelism relation || may or may not


be transitive :
Example: Detect the parallelism in the following five instructions labeled P1,
P2, P3, P4, and P5 in program order. Assume that each statement requires
one step to execute. No pipelining is considered here. Two adders are
available per step.

P1: C = D * E
P2: M = G + C
P3: A = B + C
P4: C = L + M
P5: F = G / E
Dependence graph

* P1

P2 P4
P5

+1 +3 /

P3 +2
Grain Size & Latency
• Grain size or granularity is a measure of the amount of
computation involved in a software process. The simplest
measure is to count the number of instructions in a grain.

• Grain sizes are commonly described as fine, medium, or


coarse.

• Latency is the time measure of communication overhead


incurred between machine subsystems.

• Memory latency is the time required by a processor to


access the memory.

• The time required for two processes to synchronize with


each other is called the synchronization latency.
Processing Levels:
• Parallelism has been exploited at various
processing levels.
• Different levels of program execution
represent different computational grain
sizes and changing communication and
control requirements.
• The execution of a program may involve a
combination of these levels.
Instruction Level:
• At instruction or statement level, a typical
grain contains less than 20 instructions,
called fine grain.

• The fine grain parallelism at this level may


range from two to thousands.

• Instruction-level parallelism is rather


tedious for an ordinary programmer to
detect in a source code.
Loop level:
• At loop level, a typical grain contains less
than 500 instructions.

• Loop level parallelism is most optimized


program construct to execute on a parallel
or vector computer.

• Loop level is still considered a fine grain of


computation.
Procedure level :
• This level corresponds to medium-grain
size at the task, procedural, subroutine
levels.

• A typical grain at this level contains less


than 2000 instructions.

• Detection of parallelism at this level is


much more difficult than at the finer-grain
levels.
Subprogram Level:
• This corresponds to the level of job steps and
related subprograms.

• The grain size may typically contain


thousands of instructions.

• We do not have good compilers for exploiting


medium or coarse-grain parallelism at
present.
Job (Program) Level:

• This corresponds to the parallel execution


of essentially independent jobs (programs)
on a parallel computer.

• The grain size can be as high as tens of


thousands of instructions in a single
program.
Grain Packing and Scheduling
• Two fundamental questions to ask in parallel programming
are:

– How can we partition a program into parallel branches, program


modules, micro tasks or grain to yield the shortest possible execution
time
– What is the optimal size of concurrent grains in a computation

• This grain size problem demands determination of both the


number and the size of grains in a parallel program.

• The solution is both problem dependent and machine-


dependent.

• The goal is to produce a short schedule for fast execution of


subdivided program modules.
Example (Grain packing)
1. a=1 10. j=e * f
2. b=2 11. k=d * f
3. c=3 12. l= j * k
4. d=4 13. m= 4 * l
5. e=5 14. n= 3 * m
6. f=6 15. o= n * i
7. g=a * b 16. p= o * h
8. h=c * d 17. q= p * g
9. i= d * e
Grain packing (cont.)
• Denote each node by a pair (n, s) where n is the
node name (id) and s is the grain size of the node.

• The grain size reflects the number of computations


involved in a program segment.

• The edge label (v, d) between two end nodes


specifies the output variable v from the source node
or the input variable to the destination node, and the
communication delay d between them.

• This delay includes all the path delay and memory


latency involved.
Before Grain Packing

1,1 2,1 3,1 4,1 5,1 6,1


f, 6
b, 6 e, 6
a, 6 c, 6 d, 6 f, 6
d, 6 e, 6
d, 6
7,2 8,2 9,2 10,2 11,2

4, 0 j, 4 k, 4
h, 4 3, 0
i, 4
12,2
g, 4
13,2 l, 3
14,2 m, 3
15,2 n, 4
16,2 o, 3
17,2 p, 3

q, 0
After Grain Packing

A, 8
(a, 6); (b, 6) (k, 4) (4,0); (3,0)

(c, 6); (d, 6) (d, 6); (f, 6)


(e, 6)
(j, 4)
D, 6
B, 4 C, 4

(i, 4)
(g, 4); (h, 4)
(n, 4)

E, 6
6 4
1 1
Processor 1 5 Processor 2
2

8
7
11
10
10
9
1
10
2
11
3
14
12
Fig: Scheduling of fine grain
12
16

19
13
18
9 21

20
8

24
22
7 14 Notations used
24 26

Communication delay
30
15
28 32
Node id Execution time for grain
35
16
I 37

40 I Idle time
17
42
P1 A P2
I

Fig: Scheduling of course grain


14 14
C B

18 18

22
22
D Notations used

28 I
Communication delay
32
E
Node id Execution time for grain

38

I Idle time
Node Duplication:

• Some of the nodes can be duplicated in


more than one processor

– to eliminate the idle time

– to further reduce the communication delays


among processors.
Example: Schedule without node duplication

P1 P2 P1 P2

A I
4 4
5
B
A, 4 6
7
a, 1
a, 8
12
C
I 13
C, 1 13
14
B, 1 E
16
20
c, 1 21
b, 1 c, 8 D
23

D, 2 E, 2
27

d, 4 e, 4
Schedule with node duplication

P2 A A
P1
4 4
5 5
A,4 A’, 4 B C
6 6
C
a,1 a,1 7 7
a, 1 8 E
9
D
10
B,1 C’,1 C, 1
13
14

b,1 c, 1 c, 1

D,2 E, 2
d, 4 e, 4
Example: Two 2*2 matrices A and B are multiplied to compute the sum of
the four elements in the resulting product matrix C = A* B.

A11 A12 B11 B12 C11 C12


X
=
A21 A22 B21 B22 C21 C22

C11 = A11 * B11 + A12 * B21


C12 = A11 * B12 + A12 * B22
C21 = A21 * B11 + A22 * B21
C22 = A21 * B12 + A22 * B22
Sum = C11 + C12 + C21 + C22
Given:
• Eight multiplications are performed in eight
nodes. Each node has grain size of 101.

• Seven additions are performed in seven


nodes. Each node has grain size of 8.

• Inter-processor communication latency along


all edges in the program graph is eliminated
as d = 212.
Fine-grain program graph

A B C D E F G H

* * * * * * * *

d d
d d d d d
d
J + K + L + M +

d d d d

N + O +

d d
+
P

Sum
A Sequential schedule
101
A

202
B

303
C

404
D
E
505
F Time
606

707
G

808
H
816 J

824
K

832
L
840 M

848
N
O
856

864
P
A Parallel schedule

P1 P2 P3 P4 P5 P6 P7 P8
101 A B C D E F G H

313

321 J K L M

533

541
N O

753

761
P
After grain packing
V W X Y

A, * B, * C, * D,* E, * F,* G, * H, *

J, + K, + L, + M, +

N, + O, +

P, +
Parallel schedule for the packed programs

P1 P2 P3 P4

101
A C E G
B V D W F X H Y
202

210 J K L M

422
N
430

438
O Z

446
P

You might also like