You are on page 1of 47

Program and Network Properties

2.1 Conditions of Parallelism


2.2 Program Partitioning and Scheduling
 Program segments cannot be executed in parallel
unless they are independent.
 Independence comes in several forms:
◦ Data dependence: data modified by one segement must not be
modified by another parallel segment.
◦ Control dependence: if the control flow of segments cannot be
identified before run time, then the data dependence between
the segments is variable.
◦ Resource dependence: even if several segments are
independent in other ways, they cannot be executed in parallel
if there aren’t sufficient processing resources (e.g. functional
units).
S1
..
..
Assume task S2 follows task S1 in sequential program order
S2 1 True Data or Flow Dependence: Task S2 is data dependent on
Program
task S1 if an execution path exists from S1 to S2 and if at least
Order one output variable of S1 feeds in as an input operand used by
S2
Represented by S1 S2 in task dependency graphs
2 Anti-dependence: Task S2 is antidependent on S1, if S2 follows
S1 in program order and if the output of S2 overlaps the input
Name of S1
Dependencies

Represented by S1  S2 in dependency graphs


3 Output dependence: Two tasks S1, S2 are output dependent if
they produce the same output variable
Represented by S1 S2 in task dependency graphs
4. I/O dependence: two I/O statements (read/write)
reference the same variable, and/or the same file.

S1 I/O S2
Example:
S1: read(4), A(I)
S2: rewind(4)
S3: Write (4), B(I)
S4: Rewind(4)
I/O S3
S1
• Assume task S2 follows task S1 in sequential program order
• Task S1 produces one or more results used by task S2,
– Then task S2 is said to be data dependent on taskS1 S1  S2
• Changing the relative execution order of tasks S1, S2 in the parallel
program violates this data dependence and results in incorrect
execution.

S1 (Write) Task Dependency Graph Representation

i.e. Producer
S1
Shared S1 ..
Operands ..

S2 (Read) S2
Data Dependence
i.e. Consumer S2 Program
Order

Task S2 is data dependent on task S1


 Assume task S2 follows task S1 in sequential program order
 Task S1 reads one or more values from one or more names (registers
or memory locations)
 Instruction S2 writes one or more values to the same names (same
registers or memory locations read by S1)
◦ Then task S2 is said to be anti-dependent on task S1S1 S2
 Changing the relative execution order of tasks S1, S2 in the parallel
program violates this name dependence and may result in incorrect
execution. Task Dependency Graph Representation
S1 (Read)
S1
..
Shared S1 ..
Names
S2
S2 (Write) Anti-dependence
Program
Order
S2
Task S2 is anti-dependent on task S1

Name: Register or Memory Location


 Assume task S2 follows task S1 in sequential program order
 Both tasks S1, S2 write to the same a name or names (same registers
or memory locations)
S1 S1 S2
◦ Then task S2 is said to be output-dependent on task
 Changing the relative execution order of tasks S1, S2 in the parallel
program violates this name dependence and may result in incorrect
execution.
Task Dependency Graph Representation
S1 (Write)
S1
Shared
I ..
..
Names
S2
S2 (Write) Output dependence
J Program
Order

Task S2 is output-dependent on task S1

Name: Register or Memory Location


Here assume each instruction is treated as a task

S1: Load R1, A / R1  Memory(A) /


S2: Add R2, R1 / R2  R1 + R2 /
S3: Move R1, R3 / R1  R3 /
S4: Store B, R1 /Memory(B)  R1 /

True Data Dependence:


(S1, S2) (S3, S4)
S1
Output Dependence:
(S1, S3)

Anti-dependence:
S2 S4
(S2, S3)

S3
Dependency graph
 Control-independent example:
for (i=0;i<n;i++) {
a[i] = c[i];
if (a[i] < 0) a[i] = 1;
}
 Control-dependent example:
for (i=1;i<n;i++) {
if (a[i-1] < 0) a[i] = 1;
}
 Compiler techniques are needed to get around control
dependence limitations.
 Order of execution cannot be determined before runtime due
to conditional statements.
 Data and control dependencies are based on
the independence of the work to be done.
 Resource independence is concerned with

conflicts in using shared resources, such as


registers, integer and floating point ALUs,
etc.
 ALU conflicts are called ALU dependence.
 Memory (storage) conflicts are called storage

dependence.
 Bernstein’s conditions are a set of conditions
which must exist if two processes can
execute in parallel.
 Notation

◦ Ii is the set of all input variables for a process Pi .


◦ Oi is the set of all output variables for a process Pi .
 If P1 and P2 can execute in parallel (which is
written as P1 || P2), then:
I1  O2  
I 2  O1  
O1  O2  
 In terms of data dependencies, Bernstein’s
conditions imply that two processes can
execute in parallel if they are flow-
independent, antiindependent, and output-
independent.
 The parallelism relation || is commutative (P
i
|| Pj implies Pj || Pi ), but not transitive (Pi || Pj
and Pj || Pk does not imply Pi || Pk ) .
Therefore, || is not an equivalence relation.
 Intersection of the input sets is allowed.
 For the following instructions P1, P2, P3, P4, P5 : P1
◦ Each instruction requires one step to execute Co-Begin
P2, P3, P5
◦ Two adders are available Co-End
P4
P1 : C = D x E
Using Bernstein’s Conditions after checking statement pairs:
P2 : M = G + C
P1 || P5 , P2 || P3 , P2 || P5 , P3 || P5 , P4 || P5
P3 : A = B + C
D E D E
P4 : C = L + M Time
P5 : F = G  E G X P1 G X P1 B G E
C
C
P1
X B
+1 P2 L +1 P2 +2 P3  P5

P5
P2 P4
+1 +3  +2 P3 M +3 P4
L A C A F

+2 +3 P4 Parallel execution in three steps


E G
P3 C assuming two adders are available
per step
Dependence graph:
Data dependence (solid lines) Sequential
 P5

Resource dependence (dashed lines) execution


F
 Hardware parallelism is defined by machine
architecture and hardware multiplicity.
 It can be characterized by the number of
instructions that can be issued per machine cycle.
If a processor issues k instructions per machine
cycle, it is called a k-issue processor. Conventional
processors are one-issue machines.

 A machine with n k-issue processors should be


able to handle a maximum of nk threads
simultaneously.
 Software parallelism is defined by the control
and data dependence of programs, and is
revealed in the program’s flow graph.
 It is a function of algorithm, programming

style, and compiler optimization.


Cycle 1 L1 L2 L3 L4
Maximum software
parallelism (L=load,
Cycle 2 X1 X2 X/+/- = arithmetic).
8/3 INSTN PER
Cycle 3 + - CYCLE

A B
L1 Cycle 1

Same problem, but L2 Cycle 2


considering the
X1 L3 Cycle 3
parallelism on a two-issue
superscalar processor. L4 Cycle 4

8/7 INSTN PER CYCLE X2 Cycle 5

+ Cycle 6

- Cycle 7
A
B
L1 L3 Cycle 1

L2 L4 Cycle 2

Same problem, X1 X2 Cycle 3


with two single-
issue processors
S1 S2 Cycle 4

L5 L6 Cycle 5
= inserted for
synchronization + - Cycle 6

A B
 Control Parallelism – two or more operations
can be performed simultaneously. This can
be detected by a compiler, or a programmer
can explicitly indicate control parallelism by
using special language constructs or dividing
a program into multiple processes.
 Data parallelism – multiple data elements
have the same operations applied to them at
the same time.
 Compilers used to exploit hardware features
to improve performance.
 Interaction between compiler and architecture
design is a necessity in modern computer
development.
 It is not necessarily the case that more
software parallelism will improve
performance in conventional scalar
processors.
 The hardware and compiler should be
designed at the same time.
 The size of the parts or pieces of a program
that can be considered for parallel execution
can vary.
 The sizes are roughly classified using the

term “granule size,” or simply “granularity.”


 The simplest measure, for example, is the

number of instructions in a program part.


 Grain sizes are usually described as fine,

medium or coarse, depending on the level of


parallelism involved.
 Latency is the time required for
communication between different subsystems
in a computer.
 Memory latency, for example, is the time

required by a processor to access memory.


 Synchronization latency is the time required

for two processes to synchronize their


execution.
Increasing
communication
Jobs or programs

Subprograms, job steps or } Coarse grain

}
demand and
scheduling related parts of a program
overhead Medium grain
Procedures, subroutines,
Higher degree tasks, or coroutines
of parallelism

}
Non-recursive loops
or unfolded iterations
Fine grain
Instructions
or statements
 This fine-grained, or smallest granularity level
typically involves less than 20 instructions per
grain. The number of candidates for parallel
execution varies from 2 to thousands, with
about five instructions or statements (on the
average) being the average level of parallelism.
 Advantages:
◦ There are usually many candidates for parallel
execution
◦ Compilers can usually do a reasonable job of finding
this parallelism
 Typical loop has less than 500 instructions.
 If a loop operation is independent between

iterations, it can be handled by a pipeline, or


by a SIMD machine.
 Most optimized program construct to execute

on a parallel or vector machine


 Some loops (e.g. recursive) are difficult to

handle.
 Loop-level parallelism is still considered fine

grain computation.
 Medium-sized grain; usually less than 2000
instructions.
 Detection of parallelism is more difficult than
with smaller grains; interprocedural
dependence analysis is difficult and history-
sensitive.
 Communication requirement less than
instruction-level
 SPMD (single procedure multiple data) is a
special case
 Multitasking belongs to this level.
 Job step level; grain typically has thousands
of instructions; medium- or coarse-grain
level.
 Job steps can overlap across different jobs.
 Multiprograming conducted at this level
 No compilers available to exploit medium- or

coarse-grain parallelism at present.


 Corresponds to execution of essentially
independent jobs or programs on a parallel
computer.
 Grain size near about 10000 instructions
 This is practical for a machine with a small

number of powerful processors, but


impractical for a machine with a large number
of simple processors (since each processor
would take too long to process a single job).
 Balancing granularity and latency can yield
better performance.
 Various latencies attributed to machine
architecture, technology, and communication
patterns used.
 Latency imposes a limiting factor on machine
scalability. Ex. Memory latency increases as
memory capacity increases, limiting the
amount of memory that can be used with a
given tolerance for communication latency.
 Needs to be minimized by system designer
 Affected by signal delays and communication

patterns
 Ex. n communicating tasks may require n (n -

1)/2 communication links, and the


complexity grows quadratically, effectively
limiting the number of processors in the
system.
 Determined by algorithms used and
architectural support provided
 Patterns include

◦ permutations
◦ broadcast
◦ multicast
◦ conference
 Tradeoffs often exist between granularity of
parallelism and communication demand.
 Two questions:
◦ How can I partition a program into parallel “pieces”
to yield the shortest execution time?
◦ What is the optimal size of parallel grains?
 There is an obvious tradeoff between the time
spent scheduling and synchronizing parallel
grains and the speedup obtained by parallel
execution.
 One approach to the problem is called “grain

packing.”
 A program graph is similar to a dependence
graph
◦ Nodes = { (n,s) }, where n = node name, s = size
(larger s = larger grain size).
◦ Edges = { (v,d) }, where v = variable being
“communicated,” and d = communication delay.
 Packing two (or more) nodes produces a node
with a larger grain size and possibly more
edges to other nodes.
 Packing is done to eliminate unnecessary
communication delays or reduce overall
scheduling overhead.
 A schedule is a mapping of nodes to
processors and start times such that
communication delay requirements are
observed, and no two nodes are executing on
the same processor at the same time.
 Some general scheduling goals

◦ Schedule all fine-grain activities in a node to the


same processor to minimize communication delays.
◦ Select grain sizes for packing to achieve better
schedules for a particular parallel machine.
 Four major steps are involved in the grain determination and the
process of scheduling optimization:

◦ Step 1. Construct a fine-grain program graph.


◦ Step 2. Schedule the fine-grain computation.
◦ Step 3. Grain packing to produce the coarse grains.
◦ Step 4. Generate a parallel schedule based on the packed graph.

38
 two 2 x 2 matrices A and B are multiplied to compute the sum of the
four elements in the resulting product matrix C = A x B. There are
eight multiplications and seven additions to be performed in this
program, as written below:

A11 A12  B11 B12  C11 C12 


 A A    B B   C C 
 21 22   21 22   21 22 

39
◦ C11 = A11  B11 + A12  B21
◦ C12 = A11  B12 + A12  B22
◦ C21 = A21  B11 + A22  B21
◦ C22 = A21  B11 + A22  B22
◦ Sum = C11 + C12 + C21 + C22

40
41
42
43
 Grain packing may potentially eliminate
interprocessor communication, but it may not
always produce a shorter schedule (see figure
2.8 (a)).
 By duplicating nodes (that is, executing some

instructions on multiple processors), we may


eliminate some interprocessor
communication, and thus produce a shorter
schedule.
Duplications

P1 P2 P1 P2

a a a
a

x x x
b b c

b c c

Task Graph No Duplication Task a is Duplicated

45
 Example 2.5 illustrates a matrix
multiplication program requiring 8
multiplications and 7 additions.
 Using various approaches, the program

requires:
◦ 212 cycles (software parallelism only)
◦ 864 cycles (sequential program on one processor)
◦ 741 cycles (8 processors) - speedup = 1.16
◦ 446 cycles (4 processors) - speedup = 1.94

You might also like