Dependency Graph and Bernstein Conditions

Conditions of Parallelism
Data and Resource Dependences:

• The ability to execute several program segments
in parallel requires each segment to be
independent of the other segment.
• We use a dependence graph to describe the
relations.
• The nodes of a dependence graph correspond to
the program statements (instructions)
• The directed edges with different labels show the
ordered relations among the statements. The
analysis of dependence graph shows where
opportunity exists for parallelization.
Data Dependence
• The ordered relationship between
statements is indicated by the data
dependence.
• Types of data dependences:
– Flow dependency
– Anti dependency
– Output dependency
– I/O dependency
Flow dependence:
• A statement S2 is flow-dependent on
statement S1 if an execution path exists
from S1 to S2 and if at least one output on
S1 feeds in as input to S2.
• Flow dependence is denoted as:

S1 S2
Antidependence
• Statement S2 is anti-dependent on
statement S1 if S2 follows S1 in program
order and if the output of S2 overlaps the
input to S1.
• Anti dependence is denoted as:
S1 S2
Output dependence:
• Two statements are output-dependent if
they produce the same output variable.
• Anti dependence is denoted as:
S1 S2
I/O dependence:
• Read and Write and I/O statements.
• I/O dependence occurs not because the
same variable is involved but because the
same file is referenced by both I/O
statements.
I/O
• S1 S2
Example: Consider the following code fragment of four
instructions:
S1: Load R1, A / R1  Memory(A) /
S2: Add R2, R1 / R2  (R1) + (R2) /
S3: Move R1, R3 / R1  (R3) /
S4: Store B, R1 / Memory(B)  (R1) /
S1 S2
S4 S3
Dependency Graph
Example:
• S1: Read(4), A(I) /Read array A from tape unit 4/

• S2: Rewind(4) /Rewind tape unit 4/
• S3: Write(4), B(I) /Write array B into tape unit 4/
• S4: Rewind(4) /Rewind the tape unit 4/
I/O
S1 S3
• S1 and S3 are I/O-dependent on each other because they both

access the same file from tape unit 4.
Resource Dependence
• Resource dependence is concerned with

the conflicts in using shared resources,
such as integer units, floating-point units,
registers, and memory areas, among
parallel events.
• When the conflicting resource is an ALU,
we call it ALU dependence.
• If the conflicts involve workplace storage,
we call it storage dependence.
Bernstein’s Conditions
• Let the input set Ii of a process Pi as the set
of all input variables needed to execute the
process.
• Let the output set Oi consists of all output
variables generated after execution of the
process Pi.
• Now, consider two processes P1 and P2 with
their input sets I1 and I2 and output sets O1
and O2, respectively.
• These processes can execute in parallel
(P1|| P2) if they are independent.
These conditions are stated as:
• I1 ∩ O2 = 0
• I2 ∩ O1 = 0
• O1 ∩ O2 = 0
• These three equations are known as
Bernstein’s conditions.
• In general, a set of processes, P1, P2,…Pk,
can execute in parallel if Bernstein’s
conditions are satisfied on a pair wise basis;
i.e. P1 || P2 || P3 || … || Pk if and only if Pi ||
Pj for all i != j.
Bernstein’s Conditions (cont.)
• In terms of data dependences, Bernstein’s

conditions simply imply that two processes
can execute in parallel if they are:
Flow-independent
Anti-independent
Output-independent.
Bernstein’s Conditions (cont.)
• The parallelism relation || is commutative:

i.e. Pi || Pj implies Pj || Pi
• The parallelism relation || may or may not

be transitive :
Example: Detect the parallelism in the following five instructions labeled P1,
P2, P3, P4, and P5 in program order. Assume that each statement requires
one step to execute. No pipelining is considered here. Two adders are
available per step.
P1: C = D * E
P2: M = G + C
P3: A = B + C
P4: C = L + M
P5: F = G / E
Dependence graph
* P1
P2 P4
P5
+1 +3 /
P3 +2
Grain Size & Latency
• Grain size or granularity is a measure of the amount of
computation involved in a software process. The simplest
measure is to count the number of instructions in a grain.
• Grain sizes are commonly described as fine, medium, or

coarse.
• Latency is the time measure of communication overhead

incurred between machine subsystems.
• Memory latency is the time required by a processor to

access the memory.
• The time required for two processes to synchronize with

each other is called the synchronization latency.
Processing Levels:
• Parallelism has been exploited at various
processing levels.
• Different levels of program execution
represent different computational grain
sizes and changing communication and
control requirements.
• The execution of a program may involve a
combination of these levels.
Instruction Level:
• At instruction or statement level, a typical
grain contains less than 20 instructions,
called fine grain.
• The fine grain parallelism at this level may

range from two to thousands.
• Instruction-level parallelism is rather

tedious for an ordinary programmer to
detect in a source code.
Loop level:
• At loop level, a typical grain contains less
than 500 instructions.
• Loop level parallelism is most optimized

program construct to execute on a parallel
or vector computer.
• Loop level is still considered a fine grain of

computation.
Procedure level :
• This level corresponds to medium-grain
size at the task, procedural, subroutine
levels.
• A typical grain at this level contains less

than 2000 instructions.
• Detection of parallelism at this level is

much more difficult than at the finer-grain
levels.
Subprogram Level:
• This corresponds to the level of job steps and
related subprograms.
• The grain size may typically contain

thousands of instructions.
• We do not have good compilers for exploiting

medium or coarse-grain parallelism at
present.
Job (Program) Level:
• This corresponds to the parallel execution

of essentially independent jobs (programs)
on a parallel computer.
• The grain size can be as high as tens of

thousands of instructions in a single
program.
Grain Packing and Scheduling
• Two fundamental questions to ask in parallel programming
are:
– How can we partition a program into parallel branches, program

modules, micro tasks or grain to yield the shortest possible execution
time
– What is the optimal size of concurrent grains in a computation
• This grain size problem demands determination of both the

number and the size of grains in a parallel program.
• The solution is both problem dependent and machine-

dependent.
• The goal is to produce a short schedule for fast execution of

subdivided program modules.
Example (Grain packing)
1. a=1 10. j=e * f
2. b=2 11. k=d * f
3. c=3 12. l= j * k
4. d=4 13. m= 4 * l
5. e=5 14. n= 3 * m
6. f=6 15. o= n * i
7. g=a * b 16. p= o * h
8. h=c * d 17. q= p * g
9. i= d * e
Grain packing (cont.)
• Denote each node by a pair (n, s) where n is the
node name (id) and s is the grain size of the node.
• The grain size reflects the number of computations

involved in a program segment.
• The edge label (v, d) between two end nodes

specifies the output variable v from the source node
or the input variable to the destination node, and the
communication delay d between them.
• This delay includes all the path delay and memory

latency involved.
Before Grain Packing
1,1 2,1 3,1 4,1 5,1 6,1

f, 6
b, 6 e, 6
a, 6 c, 6 d, 6 f, 6
d, 6 e, 6
d, 6
7,2 8,2 9,2 10,2 11,2
4, 0 j, 4 k, 4
h, 4 3, 0
i, 4
12,2
g, 4
13,2 l, 3
14,2 m, 3
15,2 n, 4
16,2 o, 3
17,2 p, 3
q, 0
After Grain Packing
A, 8
(a, 6); (b, 6) (k, 4) (4,0); (3,0)
(c, 6); (d, 6) (d, 6); (f, 6)

(e, 6)
(j, 4)
D, 6
B, 4 C, 4
(i, 4)
(g, 4); (h, 4)
(n, 4)
E, 6
6 4
1 1
Processor 1 5 Processor 2
2
8
7
11
10
10
9
1
10
2
11
3
14
12
Fig: Scheduling of fine grain
12
16
19
13
18
9 21
20
8
24
22
7 14 Notations used
24 26
Communication delay
30
15
28 32
Node id Execution time for grain
35
16
I 37
40 I Idle time
17
42
P1 A P2
I
Fig: Scheduling of course grain

14 14
C B
18 18
22
22
D Notations used
28 I
Communication delay
32
E
Node id Execution time for grain
38
I Idle time
Node Duplication:
• Some of the nodes can be duplicated in

more than one processor
– to eliminate the idle time
– to further reduce the communication delays

among processors.
Example: Schedule without node duplication
P1 P2 P1 P2
A I
4 4
5
B
A, 4 6
7
a, 1
a, 8
12
C
I 13
C, 1 13
14
B, 1 E
16
20
c, 1 21
b, 1 c, 8 D
23
D, 2 E, 2
27
d, 4 e, 4
Schedule with node duplication
P2 A A
P1
4 4
5 5
A,4 A’, 4 B C
6 6
C
a,1 a,1 7 7
a, 1 8 E
9
D
10
B,1 C’,1 C, 1
13
14
b,1 c, 1 c, 1
D,2 E, 2
d, 4 e, 4
Example: Two 2*2 matrices A and B are multiplied to compute the sum of
the four elements in the resulting product matrix C = A* B.
A11 A12 B11 B12 C11 C12

X
=
A21 A22 B21 B22 C21 C22
C11 = A11 * B11 + A12 * B21

C12 = A11 * B12 + A12 * B22
C21 = A21 * B11 + A22 * B21
C22 = A21 * B12 + A22 * B22
Sum = C11 + C12 + C21 + C22
Given:
• Eight multiplications are performed in eight
nodes. Each node has grain size of 101.
• Seven additions are performed in seven

nodes. Each node has grain size of 8.
• Inter-processor communication latency along

all edges in the program graph is eliminated
as d = 212.
Fine-grain program graph
A B C D E F G H
* * * * * * * *
d d
d d d d d
d
J + K + L + M +
d d d d
N + O +
d d
+
P
Sum
A Sequential schedule
101
A
202
B
303
C
404
D
E
505
F Time
606
707
G
808
H
816 J
824
K
832
L
840 M
848
N
O
856
864
P
A Parallel schedule
P1 P2 P3 P4 P5 P6 P7 P8
101 A B C D E F G H
313
321 J K L M
533
541
N O
753
761
P
After grain packing
V W X Y
A, * B, * C, * D,* E, * F,* G, * H, *
J, + K, + L, + M, +
N, + O, +
P, +
Parallel schedule for the packed programs
P1 P2 P3 P4
101
A C E G
B V D W F X H Y
202
210 J K L M
422
N
430
438
O Z
446
P

Dependency Graph and Bernstein Conditions

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dependency Graph and Bernstein Conditions

Uploaded by

Copyright:

Available Formats

Conditions of Parallelism

Data and Resource Dependences:

• Flow dependence is denoted as:

• S1: Read(4), A(I) /Read array A from tape unit 4/

• S1 and S3 are I/O-dependent on each other because they both

• Resource dependence is concerned with

• In terms of data dependences, Bernstein’s

• The parallelism relation || is commutative:

• The parallelism relation || may or may not

• Grain sizes are commonly described as fine, medium, or

• Latency is the time measure of communication overhead

• Memory latency is the time required by a processor to

• The time required for two processes to synchronize with

• The fine grain parallelism at this level may

• Instruction-level parallelism is rather

• Loop level parallelism is most optimized

• Loop level is still considered a fine grain of

• A typical grain at this level contains less

• Detection of parallelism at this level is

• The grain size may typically contain

• We do not have good compilers for exploiting

• This corresponds to the parallel execution

• The grain size can be as high as tens of

– How can we partition a program into parallel branches, program

• This grain size problem demands determination of both the

• The solution is both problem dependent and machine-

• The goal is to produce a short schedule for fast execution of

• The grain size reflects the number of computations

• The edge label (v, d) between two end nodes

• This delay includes all the path delay and memory

1,1 2,1 3,1 4,1 5,1 6,1

(c, 6); (d, 6) (d, 6); (f, 6)

Fig: Scheduling of course grain

• Some of the nodes can be duplicated in

– to eliminate the idle time

– to further reduce the communication delays

A11 A12 B11 B12 C11 C12

C11 = A11 * B11 + A12 * B21

• Seven additions are performed in seven

• Inter-processor communication latency along

You might also like