You are on page 1of 60

Chapter 2

Program and Network Properties


In this chapter
1. Conditions of parallelism

2. Program partitioning and scheduling

3. Program flow mechanisms


System interconnect architectures
Conditions of Parallelism
Three types
• Data dependence
• Resources dependence
• Control dependence
Conditions of Parallelism
• The aim of parallel programmer is to exploit parallelism in a given
program.
• To execute several program segments in parallel it is necessary that each
segment is independent of other segments.
• So, before executing parallelism, all the conditions of parallelism
between the segments must be analyzed.
• Three types of dependencies exist between the program segments
Dependence graph
• A dependence graph is used to describe the dependence relationships
between the statements.
Nodes corresponds to the program statements(instructions), and
Directed edges with different labels shows the ordered relation among the
statement.
1. Data Dependence
• Data dependence: the ordering relationship between statements is
indicated by the data dependence.
• Data dependency is a situation in which a program
statements(instructions) refers to the data of a preceding statements
• There are 5 types of data dependencies
1. Flow dependence
2. Anti dependence
3. Output dependence
4. I/O dependence
5. Unknown dependence
1. Flow dependence(S1S2)

A statement S2 is flow dependent on S1


- If an execution path exists from S1 to S2 and at least one output (variable
assigned) of S1 is input (operands to be used)to S2.
- And it is denoted as S1S2

Example-: S1 Load R1,A


S2 ADD R2,R1

Note: output of S1 as input to s2


2. Anti dependence(S1 S2)
A Statement S2 is anti dependent on the statement S1.

- if S2 follows S1 in the program order


- if the output of S2 overlaps the input to S1

- And it is denoted as S1 S2
: Example-:
S1 Add R2,R1
S2 Move R1,R3

• Note: output of S2 as input to s1


3. Output dependence(S1 0 S2)

Two statements are said to be output dependent


- if S1 and S2 write to the same output variable.

Example-:
S1 Load R1,A
S2 Move R1,R3
4. I/O dependence
Read and write are I/O statements.
- I/O dependence occurs not because the same variable is
involved but because the same file referenced by both I/O
statement.
Denoted as,
S1 I/O S3
Example
S1 Read(4),A(I)
S3 Write(4),A(I)
• 5. Unknown dependence
The dependence relation between two statements cannot be
determined.
Example - Indirect Addressing
Unknown dependence may exist due to the following situation:
• The subscript of a variable is itself subscripted.
e.g: a(i[j])
• The subscript does not contain the loop index variable
e.g:a[]
• The subscript is nonlinear in the loop index variable.
• The variable appears more than once with subscripts having
different co-efficients of the loop variables.
• E g :a[i] and a[j]
Data Dependence in Program
Consider the following code segment of 4
instructions
S1 : Load R1,A
S2 : Add R2, R1
S3 : Move R1, R3
S4 : Store B, R1

Solution:
S1 to S2:S2 flow dependent on S1
S1 to S3: S3 is output dependent on S1
S1 to S4:S4 is flow dependent on S1
S2 to S3: S3 is anti dependent on S2
S2 to S2: S2 is flow dependent on S2 itself
S2 to S4:No dependency
S3 to S4: S4 is flow dependent on S3
Data Dependence in Program
Consider the following code segment of 4 instructions find the type of
dependences and write the dependence graph

S1 : Read (4), A(I)


S2 : Rewind (4)
S3 : Write (4), B(I)
S4 : Rewind (4)

Solution:
I/O Dependency:
S1 S3

S1 to S3
S1 and S3 (read and write statements) are I/O Dependent on each other because they
both access the same file from tape unit 4
No other dependency
2.Control Dependence
• Control dependence arise
When the order of execution of statements
cannot be determined before runtime.
• For example: IF condition, will not be resolved
until runtime.
• Control dependence often prohibits
parallelism from being exploited.
• Compilers are used to eliminate this control
dependence and exploit the parallelism.
Here, After for loop we have one statement a[i]=c[i]
Let us assume the value of c[i]=5 now the value of a[i] =5
Now check the condition
In the second code snippet
if a[i]<0 if it is true then we need to reinitialize a[i] and in
we have conditional statement a[i-1]<0 after for
this case it is false .
loop , since the value of i is 1 the value of a[i-1]
so this statement is not going to execute Hence it is an
becomes a[0]
example for control independent
3.Resource Dependence
• Resource dependence concerned with conflicts in using shared resources,
such as integer and floating point units(ALU’s),registers and memory
areas among parallel events, etc.
ALU conflicts called ALU dependence
Memory(storage) conflicts called Storage dependence.

Example:
I1 and I2 both are accessing addition unit , so ALU dependent
I1: A=B+C
I2: G=D+H
Bernstein’s conditions
• The transformation of a sequentially coded program into a parallel
executable form can be
1. Done manually by the programmer using explicit parallelism
2. Or done by a compiler detecting implicit parallelism automatically .
• Bernstein’s revealed a set of conditions which must exist if two
processes can execute in parallel.
• Bernstein’s revealed a set of conditions which must exist
“if two processes can execute in parallel”.

Notation
• Process Pi is a software entity.
• Ii is the set of all input variables for a process Pi .
• Oi is the set of all output variables for a process Pi .
Bernstein’s conditions.......
• Consider two processes P1 and P2. Their inputs are I1 and I2 and outputs
are O1 and O2 respectively.

• These two processes P1 and P2(i.e. P1||P2) can execute in parallel if


they are independent and do not create confusing results.
Bernstein’s conditions.......
• According to Bernstein:
• Two processes P1,P2 with input sets I1,I2 and output sets O1,O2 can
execute in parallel(denoted by P1||P2) if

• These three conditions are known as Bernstein's condition.


Bernstein’s conditions.......
• In terms of data dependencies,
• Bernstein's conditions imply that two processes can execute in parallel if
they are :
• Flow -independent
• Anti-independent
• Output dependent
Bernstein’s conditions.......
Example 1
Example 2
Problem 1
Solution-a
Solution-b
Problem 2
Problem 3
Solution-A
Solution-B
HARDWARE AND SOFTWARE PARALLELISM
• What is Parallelism?
• Performing more than one task at a time.
• It speeds up the process.
• More work in less time.
• It reduces the cost of work.
• Modern computer architecture implementation requires special
hardware and software support for parallelism.
Types of parallelism are-
1. Hardware parallelism
2. Software parallelism
1. Hardware Parallelism:
• This refers to the type of parallelism defined by the machine architecture and
hardware multiplicity.
• Hardware parallelism is a function of cost and performance tradeoffs. It displays the
resource utilization patterns of simultaneously executable operations. It can also
indicate the peak performance of the processors.
• One way to characterize the parallelism in a processor is by the number of
instruction issues per machine cycle.
• If a processor issues k instructions per machine cycle, it is called a k-issue
processor.

Examples.
Intel i960CA is a three-issue processor (arithmetic, memory access, branch).
IBM RS -6000 is a four-issue processor (arithmetic, floating-point, memory
access, branch)
2. Software Parallelism:

It is defined by the control and data dependence of programs.


• The degree of parallelism is revealed in the program profile or in the program flow
graph.
• Software parallelism is a function of algorithm, programming style, and compiler
optimization.
• The program flow graph displays the patterns of simultaneously executable
operations.
• Parallelism in a program varies during the execution period. It limits the continuous
performance of the processor.
Types of Software Parallelism
1. Control Parallelism
• two or more operations can be performed simultaneously.
2. Data parallelism
• multiple data elements have the same operations applied to them at the same
time.
Mismatch between software and hardware parallelism
Example: A= L1*L2 + L3*L4
B= L1*L2 - L3*L4
Software Parallelism:
There are 8 instructions
FOUR Load instructions (L1, L2, L3 & L4).
TWO Multiply instructions (X1 & X2).
ONE Add instruction (+)
ONE Subtract instruction (-)
• The parallelism varies from 4 to 2 in three cycles
• 8/3=2.67 cycles
• Hardware Parallelism:
• Parallel Execution:
• Using TWO-issue processor:
• The processor can execute one memory
access (Load or Store) and one
arithmetic operation (multiply, add,
subtract) simultaneously.
• The program must execute in 7 cycles.
• The h/w parallelism average is 8/7=1.14.
• It is clear from this example the
mismatch between the s/w and h/w
parallelism.
• Example:
• Match between hardware and software
parallelism using dual processor system
• dual A h/w platform of a Dual-Processor
system, single issue processors are used to
execute the same program.
• Six processor cycles are needed to execute
the 12 instructions by two processors.
• S1 & S2 are two inserted store operations.
• L5 & L6 are two inserted load
operations.
• The added instructions are needed for
inter-processor communication through
the shared memory.
The Role of Compilers
• Compilers used to exploit hardware features to improve performance.
Interaction between compiler and architecture design is a necessity in
modern computer development. It is not necessarily the case that
more software parallelism will improve performance in conventional
scalar processors. The hardware and compiler should be designed at
the same time.
Program Partitioning & Scheduling
• Parallelism can classified on the basis of computational granularity or grain size.
• Grain size is the amount of computations involved in a software process. the simplest measure is to count the
number of instructions.
• Grain size are commonly defined as fine, medium and coarse depending on the processing levels.

• Grain size and latency


• Grain Size or granularity is the amount of computation and manipulation involved in a software process.
• The simplest way TO FOUND GRAIN SIZE IS to count the number of instructions in a given program segment.
• Grain size decides the basic program segment chosen for parallel processing.

Latency
• Latency is a time of the communication overhead acquire a mid-subsystems.
• The memory latency is the time required by processor to access the memory.
• The time required for two processes to synchronize with each other is called synchronization latency.
Computational granularity and communication latency are closely related.
Levels of Parallelism

There are 5 Levels of Parallelism


1. Instruction Level Parallelism
2. Loop-level Parallelism
3. Procedure-level Parallelism
4. Subprogram-level Parallelism
5. Job or Program-Level Parallelism
1.Instruction Level Parallelism
• At the lowest level, a typical grain size contains less than 20 instructions, which is
called Fine Grain.
• This type of parallelism is assisted by an optimizing compiler which can
automatically detect such parallelism.
Advantages:
• There are usually many candidates for parallel execution
• Compilers can usually do a reasonable job of finding this parallelism
2.Loop-level Parallelism
• This level corresponds to iterative loop operations.
• A typical loop has less than 500 instructions.
• If a loop operation is independent between iterations, it can be handled by a
pipeline, or by a SIMD machine.
• Most optimized program construct to execute on a parallel or vector machine.
• Some loops (e.g. recursive) are difficult to handle.
• Loop-level parallelism is still considered fine grain computation.
3.Procedure-level Parallelism
• This level corresponds to Medium-sized grain; usually less than 2000 instructions.
• Where a specific task procedure or a subroutine is considered.
• Detection of parallelism at this level is more difficult than with smaller grains.
4.Subprogram-level Parallelism
• this level corresponds to levels of Job steps - grain typically has thousands of instructions,
it comes under medium- or coarse-grain level. Job steps can overlap across different jobs.
• Multiprogramming conducted at this level, No compilers available to exploit medium- or
coarse-grain parallelism at present.
5.Job or Program-Level Parallelism
• Corresponds to parallel execution of independent jobs or programs on a parallel computer.
• The grain size at this level can be as high as Million of instructions in program. this level
falls under the coarse grain size and is handled by program loader and operating system.
• This is practical for a machine with a small number of powerful processors, but impractical
for a machine with a large number of simple processors (since each processor would take
too long to process a single job).
Grain Packing and Scheduling
• Grain Packing :
How to partition program into program segments to get the shortest
possible execution time ?
What is the optimal size of concurrent grains ?
There is an obvious tradeoff between the time spent scheduling and
synchronizing parallel grains and the speedup obtained by the parallel
execution.
One approach to the program is called “Grain Packing”.
The grain size problem requires determination of both number of
partitions and size of grain in a parallel problem.
Solution is problem dependent and machine dependent.
Want a short schedule for fast execution of subdivided program module.
Program Graph :
-- Each node (n, s) corresponds to the computational unit :
n -- node name; s -- grain size
-- Each edge between two nodes (v,d) denotes the output variable v and communications
delay d Example:
1. a := 1 10. j := e x f
2. b := 2 11. k := d x f
3. c := 3 12. l := j x k
4. d := 4 13. m := 4 x l
5. e := 5 14. n := 3 x m
6. f := 6 15. o := n x i
7. g := a x b 16. p := o x h
8. h := c x d
9. a := d x e 17. q := p x q
Program Graphs and Packing…
Grain Packing — Partitioning

1,1 2,1 3,1 4,1 5,1 6,1 A


b,6 d,6 f,6 f,6
a,6 c,6 d,6 e,6 e,6
d,6
B 7,2 8,2 C 9,2 10,2 11,2
j,4 k,4
i,4 4.
h,4 12,2
g,4
3. 13,2 l,3
E
15,2
14,2 m,3 D
16,2 n,4
17,2 o,3
p,3

47
Program Graphs and Packing…
 Nodes 1, 2, 3, 4, 5, and 6 are memory reference (data fetch)
operations.
 Each takes one cycle to address and six cycles to fetch from memory.
 All remaining nodes (7 to 17) are CPU operations, each requiring two
cycles to complete. After packing, the coarse-grain nodes have larger
grain sizes ranging from 4 to 8 as shown.
 The node (A, 8) in Fig. (b) is obtained by combining the nodes (1, 1),
(2, 1), (3, 1), (4, I), (5, 1), (6, 1) and (11, 2) in Fig.(a). The grain size, 8.
of node A is the summation of all grain sizes (1 + 1 + 1 + 1 + 1 +1 + 2 =
8) being combined.
Program Flow Mechanism
It determines the order of execution which is explicitly stated by the user
program.
There are three flow mechanism
1. Control flow mechanism(used by conventional computers)
2. Data-driven mechanism(used by data-flow computers)
3. Demand-driven mechanism(used by reduction computers)
1.Control flow mechanism
 Conventional von Neumann computers use a PC to sequence the execution of instruction in a program. This
sequential execution style is called control-driven.
 Conventional computers are based on a control flow mechanism by which the order of program execution is explicitly
stated in the user programs
 Control flow can be made parallel by using parallel language constructs or parallel compiler.
 Control flow machines give complete control, but are less efficient than other approaches

2.Data-driven mechanism
 Dataflow computers are based on a data-driven mechanism which allows the execution of any instruction to be
driven by data (operand) availability.
 Dataflow computers emphasize a high degree of parallelism at the fine-grain instructional level.
 But have high control overhead, lose time waiting for unneeded arguments, and difficulty in manipulating data
structures
Dataflow features
No need for
 shared memory
 program counter
 control sequencer
Special mechanisms are required to
 detect data availability
 match data tokens with instructions needing them
Interconnection Network
• When more than one processor needs to access a memory structure, peripheral devices etc..as interconnection
network is required between processing elements and other subcomponents of a computer.
• These interconnection network/architecture can be viewed as topology where nodes are various sub-
components of computer.
• System interconnection networks are categorized into two categories
1. Static interconnection networks
2. Dynamic interconnection networks
• Static networks uses direct links which are fixed once built. Such network does not change with
respect to time and requirements.
• Such networks are suitable when communication patterns among various sub-components are
predictable.
• There are various topologies under this category
1. Linear array
2. Ring and chordal ring
3. Barrel shifter
4. Trees and stars
5. Fat tree
6. Mesh and Torus
1. Linear array
• It is a one dimensional network with simplest topology where n nodes are
connected with the help of n-1 links.
• In this topology , internal node have degree 2, and terminal nodes have
degree 1.
• Bisection width b=1 for this topology . This structure poses communication
inefficiency when n becomes large.

2. Ring and Chordal ring


• A ring Is obtained by interconnecting two terminal
nodes with as extra link. A ring can be uni–directional
or bi-directional.
• This is a symmetric structure with bisection width b=2.
• By increasing the node degree from 2 to 3,4,5 and so
on, we can obtain Chordal ring
3. Barrel shifter:
• We can obtain the barrel shifter from a
ring by adding extra links from each node
to those node having distance equal to an
integral power of 2,i.e node i is connected
to node j, if |j-i|=2 power of r, where
r=0,1,2……
• Such barrel shifter has node degree of 2n-
1 and diameter D=n/2.

4. Trees and stars:


• Tree is hierarchical structure with a root node
and subsequent node as a children. A k-level
completely balanced binary tree will have
2power of k-1 nodes.

• The star is a two level tree with a high node


degree of n-1 and a constant diameter of 2
5. Fat tree
• The conventional tree structure can be modified to become a fat
tree.
• The channel bandwidth of a fat tree increases as we more from
leaf nodes to the root nodes. The fat tree is like a real tree in
which branches gets thicker toward the root.
• The fat tree is used to solve problem of bottleneck situation at
root node

6. Mesh and Torus


• A k-dimensional mesh with n power k nodes has an internal
node degree of 2powerk and the network diameter of 2^k-1.

• The torus can be viewed as another variant mesh with even


shorter diameter. this topology combines ring and mesh
topology to higher dimensions

7. Systolic Array:
• This class of multidimensional pipeline architecture are used for matrix
multiplication. The interior node degree is 6.
• This architecture specially used for applications like signal and image
processing. Such structure have limited applicability and difficult to program
• Dynamic interconnection networks
• Such networks are implemented by the use of switches and
arbiters.
• It can provide dynamic connectivity for all types of
communication pattern based on the demand of the program.
• These are 2 major class of dynamic interconnection networks.
1. Single stage networks
2. Multi stage networks
• A bus systems is essentially, a collection of wires and
connectors for data transactions among processors,
memory modules and peripheral devices attached to
the systems.
• The bus is used only for one transaction at a time
between source and destination. In case of multiple
requests, arbitration logic is used.
• Digital bus system is also called contention bus or time
sharing bus.
• These MIN have been used in both SIMD
and MIMD computers. A number of aXb
switches are used in each stage. Fixed
inter-stage interconnection are used
between the switches in adjacent stages.
The switches can be dynamically set to
establish the desired connections
between inputs and outputs.
• Different class of MINs differ in the switch
modules used and in the kind of inter-
stage connections(ISC) pattern used.
• The simplest switch module is 2X2
switches, and pattern used include
perfect shuffle,butterfly,multiway
shuffle,crossbar,cube connection etc..
• There are 4 different possible
interconnection of2X2 switches used in
construction of Omega network.

• A 16X16 Omega network is shown in figure


below which requires 4 stages of 2X2
switches. There are 16 i/p on left and 16 o/p
on the right. The ISC pattern used is perfect
shuffle.
• The highest bandwidth and interconnection capabilities provided by the
crossbar networks. It is a network like telephone switch board having
cross point switches, providing dynamic connections between source,
destination pairs. Each cross point switch provides a dedicated path
between a pair. The switch can be set on or off dynamically on program
demand
• Baseline network:
• This network falls under the category
of multistage interconnection
network, and is generated by using
recursive pattern.
• The first stage contains one NXN
blocks, and the second stage
contains two(N/2)X(N/2) sub blocks
and the construction process is
applied recursively to sub blocks
until size of blocks reached to 2X2.
• This network is built by using 2X2
switches with two legitimate
connection states: straight and
crossbar

You might also like