Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones

Module-1 Theory of Parallelism
Parallel Computer Models, The State of Computing,Multiprocessors and Multicomputer

,Multivector and SIMD Computers ,PRAM and VLSI Models, Program and Network Properties
,Conditions of Parallelism, Program Partitioning and Scheduling, Program Flow Mechanisms,
System Interconnect Architectures, Principles of Scalable Performance, Performance Metrics
and Measures, Parallel Processing Applications, Speedup Performance Laws, Scalability
Analysis and Approaches.
What is Computer Architecture?

• Computer Architecture refers to
 Instruction set design or Instruction set architecture (ISA)

 Design of various system building blocks -processor, main memory, cache, data
paths
 Arrangement by which the various system building blocks are interconnected and
interoperated to achieve desired system performance.
Instruction set architecture (ISA) ISA involves specifying
 registers,
 data types,
 instruction set,
 instruction formats,
 addressing modes of a machine or computer, which can be used by compiler writer.
ISA is the portion of computer visible to the programmer or compiler writer.
THE STATE OF COMPUTING

Computer Development Milestones
• Prior to I945, computers were made with mechanical or electromechanical parts.Modern
computers were marked by the introduction of electronic components.Over the past
several decades, electronic computers have gone through roughly five generations of
development.
The first generation (1945-1954)
• Vacuum tubes and magnetic drums for memories.
• Single central processing unit (cpu) which performed serial fixed-point arithmetic,
• Machine or assembly languages were used.
The second generation (1955-1964)

• The second generation (1955-1964) was marked by the use of
• discrete transistors, diodes, and magnetic ferrite cores, floating-point arithmetic,
• High level languages (HLLs), such as Fortran, Algol, and Cobol, were introduced along
with compilers.
The third generation (1965-1974)

• The third generation (1965-1974) began to use
• integrated circuits (ICs) for both logic and memory in small-scale or medium-scale
integration (SSI or MSI)
• Micro programmed control ,Pipelining and cache memory were introduced.
• Multiuser applications
The fourth generation (1974-1991)

The fourth generation (1974-1991) used
• large-scale or very large- scale integration (LSI or VLSI),
• Semiconductor memory replaced core memory
• Multiprocessors, multicomputers, supercomputers

• Multiprocessor OS, languages, compilers, and environments for parallel processing.
The fifth generation (1991- present)

• The fifth generation (1991-present) is highlighted by the use of advanced VLSI
processors, memory, superscalar processors, heterogeneous processing.
Summary of the five generations of electronic computer development.

Elements of Modern Computers
• Hardware, software, and programming elements of a modern computer system are briefly
introduced below in the context of parallel processing.
• computing problems have been labeled numerical computing, tedious integer or floating -
point computation, transaction processing, and large database management information
retrieval
Algorithms and Data Structures

• Traditional algorithms and data structures are designed for sequential machines.
• New, specialized algorithms and data structures are needed to exploit the capabilities of parallel
architectures.
Hardware Resources
• Processors, memory, and peripheral devices form the hardware core of a computer
system.
• instruction-set, memory organization, multiprocessors, supercomputers, multicomputers,

and massively parallel computers.
Operating System
• An effective operating system manages the allocation and deallocation of resources

during the execution of user programs.
• Mapping is a bidirectional process matching algorithmic structure with hardware

architecture, and vice versa. The mapping of algorithmic and data structures onto the
machine architecture includes processor scheduling, memory maps, interprocessor
communications, etc. These activities are usually architecture-dependent.
System Software Support The source code written in a HLL must be first translated into object
code by an optimizing compiler.
• The compiler assigns variables to registers or to memory words and reserves functional
units for operators.
• A loader is used to initiate the program execution through the OS kernel.
• Resource binding demands the use of the compiler, assembler, loader, and OS kernel to
commit physical machine resources to program execution.
• Parallel software can be developed using entirely new languages designed specifically
with parallel support as its goal, or by using extensions to existing sequential languages.
Compiler Support
• Parallelizing Compilers requires full detection of parallelism in source code, and
transformation of sequential code into parallel constructs
• Compiler directives are often inserted into source code to aid compiler parallelizing
efforts
Evolution of Computer Architecture
• Started with the von Neumann architecture built as a sequential machine executing scalar
data .
• The von Neumann architecture is slow due to sequential execution of instructions in

programs.
• Lookahead techniques were introduced to prefetch instructions in order to overlap I/E

(instruction fetch/decode and execution) operations and to enable functional parallelism.
• Functional parallelism was supported by two approaches: one is to use multiple

functional units simultaneously, and the other is to use pipelining.
• Pipelining has proven especially attractive in performing identical operations repeatedly

over vector data strings.
• Vector operations were originally carried out implicitly by software-controlled looping

using scalar pipeline processors.
Flynn's Classification
• Michael Flynn (1972) introduced a classification of various computer architectures based
on notions of instruction and data streams processed simultaneously.
• SISD (single instruction stream over a single data stream)
• SIMD (single instruction stream over multiple data streams) machines
• MIMD(multiple instruction streams over multiple data streams)

• MISD (multiple instruction streams and a single data stream)
SISD (single instruction stream over a single data stream)

• Conventional sequential machines are called SISD (single instruction stream over a
single data stream) computers.
• Instructions are executed sequentially.
SIMD (single instruction stream over multiple data streams) machines

• SIMD (single instruction stream over multiple data streams) machines allow vector
processing.
• Uses an array of processing elements(PEs) synchronized by the same controller.

• Single instruction stream is executed by different processing unit on different set of data.
PE = Processing Element LM = Local memory

MIMD(multiple instruction streams over multiple data streams)
• Parallel computers (multicomputers and multiprocessors)
• Multiple instruction: every processor may be executing a different instruction stream

• Multiple data: every processor may be working with a different data stream
• There are different processor each processing different task.
• Multiple data stream is provided by shared memory.
MISD (multiple instruction streams and a single data stream) machines

• The same data stream flows through a linear array of processors
• They execute different instruction streams.

• This architecture is also known as systolic arrays.
• Most parallel computers are built with MIMD model for general purpose
computations.The SIMD and MISD models are more suitable for special-purpose
computations. For this reason, MIMD is the most popular model, SIMD next, and MISD
the least popular model being applied in commercial machines.
Development Layers
A layered development of parallel computers is illustrated in Fig. Hardware configurations differ

from machine to machine, even those of the same model. The address space of a processor in a
computer system varies among different architectures. It depends on the memory organization,
which is machine-dependent. These features are up to the designer and should match the target
application domains.
we want to develop application programs and programming environments which are machine-
independent. Independent of machine architecture, the user programs can be ported to many
computers with minimum conversion costs.
High-level languages and communication models depend on the architectural choices made in a
computer system. From a programmer's viewpoint, these two layers should be architecture-
transparent.
However, the communication models, shared variables versus message passing, are mostly
machine-dependent. Application programmers prefer more architectural transparency. However,
kernel programmers have to explore the opportunities supported by hardware.
As a good computer architect, one has to approach the problem from both ends. The compilers
and OS support should be designed to remove as many architectural constraints as possible from
the programmer.
System Attributes to Performance

• They can be used to guide system architects in designing better machines or to educate
programmers or compiler writers in optimizing the codes for more efficient execution by
the hardware.
The total CPU time needed for program execution

• The CPU (or the processor) is driven by a clock with a constant cycle time τ.
• The size of a program is its instruction count (Ic), in terms of the number of machine
instructions to be executed in the program.
• The CPU time (T in seconds/program) needed to execute the program is :
T = Ic * CPI * τ
Clock Rate and CPI
• Different machine instructions may require different numbers of clock cycles to execute.
• Therefore, the term cycles per instruction (CPI) to mean the average value.
• The CPU of computer is driven by a clock with a constant cycle time τ.
• The inverse of the cycle time is the clock rate
f= 1 / τ
• The execution of an instruction requires access to the memory.
• The CPI of an instruction type can be divided into two component terms corresponding to
the total processor cycles and memory cycles needed to complete the execution of the
instruction.
• The total time needed to execute a program can then be rewritten as
T = Ic* (p + m*k)*τ
where
• p is the number of processor cycles needed for the instruction decode and execution,
• m is the number of memory references needed,
• k is the memory access latency,
• Ic is the instruction count, and
• τ is the processor cycle time.
System Attributes
T = Ic* (p + m*k)*τ
Five performance factors : Ic, p, m, k, τ
Where, p is the number of processor cycles needed for the instruction decode and execution, m
is the number of memory references needed, k is the memory access latency, Ic is the instruction
count, and τ is the processor cycle time.
These are influenced by four system attributes:
• Instruction-set architecture,
• Compiler technology,
• CPU implementation and control,

• Cache and memory hierarchy as specified in Table
The instruction-set architecture affects the program length (Ic) and processor cycle needed (p).
The compiler technology affects the values of Ic ,p, and the memory reference count (m). The
CPU implementation and control determine the total processor time (p. τ) needed. Finally, the
memory technology and hierarchy design affect the memory access latency (k • τ ) . The above
CPU time can be used as a basis in estimating the execution rate of a processor.
MIPS Rate
• The processor speed is often measured in terms of million instructions per second
(MIPS).
• The MIPS rate of a given processor
where C is the total number of clock cycles needed to execute a given program.
Floating Point Operations per Second

• Most compute-intensive applications in science and engineering make heavy use of
floating point operations.
• for such applications a more relevant measure of performance is floating point operations
per second or flops.
• This is written as Megaflops(mflops), gigaflops(gflops), teraflops or petaflops.
Throughput Rate:
• The CPU throughput is a measure of how many programs can be executed per second
• unit is programs/second
1. Consider the execution of an object code with 200,000 instructions on a 40 MHz
processor. The program consists of four major types of instructions. The instruction mix
and the number of cycles [CPI] needed for each instruction type are given below:
I. Find total number of cycles required to execute program.

II. Calculate the average CPl when the program is executed on a uniprocessor .
III. Calculate MIPS rate.
2. A 400 MHz processor was used to execute a benchmark program with the following
instruction mix and clock cycle counts:
Determine the effective CPI, MIPS rate and execution time for this program.
Progammming Environments
• When using a parallel computer, one desires a parallel environment where parallelism is
automatically exploited.
• Two approaches to parallel programming
Implicit Parallelism and Explicit Parallelism
Implicit Parallelism
• An implicit approach uses a conventional language, such as C, C++ to write the source
program.
• The sequentially coded source program is translated into parallel object code by a
parallelizing compiler.
• This compiler must be able to detect parallelism and assign target machine resources.
• This compiler approach has been applied in programming shared-memory
multiprocessors.
Explicit Parallelism
• The second approach requires more effort by the programmer to develop a source
program using parallel dialects of C, C++.
• Parallelism is explicitly specified in the user programs.
• This reduces the burden on the compiler to detect parallelism.
• Instead, the compiler needs to preserve parallelism.
Multiprocessors and Multicomputers
• Two categories of parallel computers.
• These physical models are distinguished by having a shared common memory or
unshared distributed memories.
Shared-Memory Multiprocessors
Three shared-memory multiprocessor models. These models differ in how the memory and
peripheral resources are shared or distributed.
• Uniform Memory-Access (UMA) model
• Non-Uniform Memory Access (NUMA) model
• Cache-only Memory Access (COMA) model
The UMA Model
• Uniform memory-access (UMA) model
• In a UMA multiprocessor model, the physical memory is uniformly shared by all the
processors. All processors have equal access time to all memory words. Each processor
may use a private cache. Peripherals are also shared in some fashion. Communication
among processors takes place using variables in the common memory.
• The system interconnect takes the form of a common bus, a crossbar switch, or a
multistage network. The UMA model is suitable for general purpose applications by
multiple users.
• It can be used to speed up the execution of a single large program in time-critical
applications. To coordinate parallel events, synchronization and communication among
processors are done through using shared variables in the common memory.
Non-Uniform Memory Access (NUMA) model
• A NUMA multiprocessor is a shared-memory system in which the access time varies
with the location of the memory word.
• Two NUMA machine models:
The shared local memory

• The shared memory is physically distributed to all processors, called local memories.
• Local memories are accessible by all processors. It is faster to access a local memory
with a local processor. The access of remote memory attached to other processors takes
longer due to the added delay through the interconnection network.
• [The BBN TC-2000 Butterfly 1980, multiprocessor had this configuration, machine had
up to 512 CPUs, each with local memory]
A hierarehieal cluster model
• globally shared memory can be added to a multiprocessor system.
• In this case, there are three memory-access patterns: The fastest is local memory access.
The next is global memory access. The slowest is access of remote memory.
• The processors are divided into several clusters- Each cluster is itself an UMA or a
NUMA multiprocessor.
• The clusters are connected to global shared-memory modules. The entire system is
considered a NUMA multiprocessor. All processors belonging to the same cluster are
allowed to uniformly access the cluster shared-memory modules.
• [The Cedar multiprocessor, built at the University of Dlinois, assumes such a structure in
which each cluster is an Alliant FX/80 multiprocessor.]
The COMA model
• A multiprocessor using cache-only memory assumes the COMA model.

• The COMA model is a special case of a NUMA machine, in which the distributed main
memories are converted to caches.
• All the caches form a global address space.
• Remote cache access is assisted by the distributed cache directories. Depending on the
interconnection network used, sometimes hierarchical directories may be used to help
locate copies of cache blocks.
Distributed-Memory Multicomputers
• The system consists of multiple computers, often called nodes, interconnected by a

message-passing network.
• Each node is an autonomous computer consisting of a processor, local memory, and
sometimes attached disks or I/O peripherals.
• All local memories are private and are accessible only by local processors.
• For this reason, traditional multicomputers have been called no-remote-memory-access
(NORMA) machines.
• lnternode communication is carried out by passing messages through the static
connection network. With advances in interconnection and
network technologies, this model of computing has gained importance, because of its suitability
for certain applications, scalability, and fault-tolerance.
Multivector and SIMD Computers

• We classify supercomputers either as pipelined vector machines or as SIMD computers
emphasizing massive data parallelism.
[A vector operand contains an ordered set of n elements, where n is called the length of the
vector. Each element in a vector is a scalar quantity, which may be a floating point number, an
integer, a logical value or a character.]
Vector Supercomputers
• A vector computer is often built on top of a scalar processor. The vector processor is
attached to the scalar processor. Program and data are first loaded into the main memory
through a host computer. All instructions are first decoded by the scalar control unit- If
the decoded instruction is a scalar operation, it will be directly executed by the scalar
processor using the scalar functional pipelines.
• If the instruction is decoded as a vector operation, it will be sent to the vector control
unit.
This control unit will supervise the flow of vector data between the main memory and vector
functional pipelines. The vector data flow is coordinated by the control unit. A number of vector
functional pipelines may be built into a vector processor.
Two pipeline vector supercomputer models are described below.

Vector Processor Models:
register-to-register architecture.
• Vector registers are used to hold the vector operands, intermediate and final vector
results. The vector functional pipelines retrieve operands from and put results into the
vector registers.
• The length of each vector register is usually fixed. In general, there are fixed numbers of
vector registers and functional pipelines in a vector processor. Therefore, both resources
must be reserved in advance to avoid resource conflicts between different vector
operations.
memory-to-memory architecture.
• A memory-to-memory architecture differs from a register-to-register architecture in the
use of a vector stream unit to replace the vector registers.
• Vector operands and results are directly retrieved from the main memory.
SIMD Supercomputers
• An operational model of an SIMD computer is specified by a 5-tuple:

M=(N, C, I, M, R)
• N is the number of processing elements (PEs) in the machine.
• C is the set of instructions directly executed by the control unit (CU), including scalar and
program flow control instructions.
• I is the set of instructions broadcast by the CU to all PEs for parallel execution.
• M is the set of masking schemes, where each mask partitions the set of PEs into enabled
and disabled subsets.
• R is the set of data-routing functions, specifying various patterns to be set up in the
interconnection network for inter-PE communications.
One can describe a particular SIMD machine architecture by specifying the 5-tuple.
Program & Network properties
Conditions of parallelism
• The ability to execute several program segments in parallel requires each segment to be
independent of the other segments.
• The independence comes in various forms as defined below separately.
• we consider the dependence relations among instructions in a program. In general, each
code segment may contain one or more statements. We use a dependence graph to
describe the relations. The nodes of a dependence graph correspond to the program
statements [instructions], and the directed edges with different labels show the ordered
relations among the statements. The analysis of dependence graphs shows where
opportunity exists for parallelization and vectorization.
Data Dependence
• The ordering relationship between statements is indicated by the data dependence.
• Five type of data dependence are defined below:
1. Flow dependence: A statement S2 is flow dependent on S1 if an execution path exists
from s1 to S2 and if at least one output (variables assigned) of S1feeds in as input
(operands to be used) to S2. Flow dependence is denoted as S1 -> S2.
2. Antidependence: Statement S2 is antidependent on the statement S1 if S2 follows S1 in
the program order and if the output of S2 overlaps the input to S1 also called RAW
hazard and is denoted as
3. Output dependence : two statements are output dependent if they produce (write) the
same output variable .
4. I/O dependence: Read and write are I/O statements. I/O dependence occurs if same file
referenced by both I/O statement.
5. Unknown dependence: The dependence relation between two statements cannot be
determined in the following situations:
• The subscript of a variable is itself subscribed( indirect addressing)
• The subscript does not contain the loop index variable.
• A variable appears more than once with subscripts having different coefficients of the
loop variable.
• The subscript is nonlinear in the loop index variable.
Ex: Consider the following fragment of any program:

S1 Load R1, A
S2 Add R2, R1
S3 Move R1, R3
S4 Store B, R1
Draw the Dependence graph.
Here we have,
• Flow dependence S1to S2, S3 to S4, S2 to S2
• Anti-dependence from S2to S3
• Output dependence S1 toS3
Dependence graph:
• The above data dependence relations should not be arbitrarily violated during program
execution. Otherwise, erroneous results may be produced with changed program order. On a
multiprocessor system, the program order may or may not be preserved, depending on the
memory model used.
Control Dependence
• This refers to the situation where the order of the execution of statements cannot be
determined before run time.
• For example all condition statement, where the flow of statement depends on the output.
• Paths taken after a conditional branch may introduce or eliminate data dependence among
instructions.
• Dependence may also exist between operations performed in successive iterations of a
looping procedure.
• Loop example with control-dependent iterations. The successive iterations of the
following loop are control-independent.
• Loop example without control-dependent iterations.
• Control dependence often prohibits parallelism from being exploited. Compiler

techniques or hardware branch prediction techniques are needed to get around the control
dependence in order to exploit more parallelism.
Resource Dependence
• Resource dependence is concerned with conflicts in using shared resources, such as

registers, integer and floating point ALUs, etc.
• ALU conflicts are called ALU dependence. Memory (storage) conflicts are called storage
dependence.
The detection of parallelism in programs requires a check of the various dependence
relations.
Bernstein's Conditions
• Bernstein revealed a set of conditions based on which two processes can execute in
parallel.
• We define the input set Ii, of a process Pi, as the set of all input variables needed to
execute the process and the output set 0i , consists of all output variables generated after
execution of the process Pi. Bernstein‘s conditions—which apply to input and output sets
of processes—must be satisfied for parallel execution of processes.
• Consider two processes P1 and P2 with their input sets I1 and I2 and output sets 01 and
02, respectively.These two processes can execute in parallel and are denoted P1 || P2 if
they are independent and do not create confusing results.
• Conditions are stated as follows:
These three conditions are known as Bernstein conditions.
• In terms of data dependencies, Bernstein’s conditions imply that two processes can
execute in parallel if they are flow-independent, antiindependent, and output-
independent.
• In general, a set of processcs, PI, P2,P3, … Pk can execute in parallel if Bomstein's
conditions are satisfied on a pairwise basis; that is, , PI || P2 ||P3|| … ||Pk if and only if
Pi ||Pj for all i ≠ j.
Ex: Detection of parallelism in a program using Bernstein's conditions
•
• The dependence graph that shows Data dependence (solid arrows) and resource
dependence (dashed arrows). It demonstrates data dependence as well as resource
dependence.
• Sequential execution and Parallel execution of the above program.In sequential execution
five steps are needed. If two adders are available simultaneously, the parallel execution
requires only three steps as shown.
• There are 10 pairs of statements to check against Bernstein's conditions. Only 5 pairs, P1||
P5, P2|| P3, P2|| P5, P5|| P3, and P4|| P5, can execute in parallel if there are no resource
conflicts. Collectively, only P2|| P3 || P5, is possible because P2||P3, P3|| P5, and P5||P2
are all possible.
------------------------------------------------------------------------------------------------------------
Violations of any one or more of the three conditions of Bernstein's conditions prohibits
parallelism between two processes.In general, violation of any one or more of the 3n(n-1)/2
W2 Bernstein's conditions among n processes prohibits parallelism collectively or partially.
Hardware and Software Parallelism

• For implementation of parallelism, we need special hardware and software support.
• Hardware and software parallelism: This refers to the type of parallelism defined by the
machine architecture and hardware multiplicity.
• One way to characterize the parallelism in a processor is by the number of instruction

issues per machine cycle. If a processor issues It instructions per machine cycle, then it is
called a k-issue processor.
Hardware parallelism:
• It refers to support to parallelism through hardware multiplicity.It is characterized by the
number of instruction issues per machine cycle. If a processor issues k instructions per
machine cycle, then it is called a k-issue processor.
• For example, the lntel i96OCA is a three-issue processor with one arithmetic, one
memory access, and one branch instruction issues per cycle.
• Software parallelism: This type of parallelism is revealed in the program flow graph.
The program flow graph displays the patterns(number) of simultaneously executable
operations. It indicates the number of instruction executed per machine cycle through
program flow graph.
Example : Mismatch between software parallelism and hardware parallelism
• Consider the example program graph. There are eight instructions (four loads and four
arithmetic operations) to be executed in three consecutive machine cycles. Four load
operations are performed in the first cycle, followed by two multiply operations in the
second cycle and two add/subtract operations in the third cycle. Here,
• The software parallelism varies from 4 to 2 in three cycles. The average software
parallelism is equal to 8/3 = 2.67 instructions per cycle in this example program.
•
• consider execution of the same program by a two-issue processor which can execute one
memory access (load or write) and one arithmetic (add, subtract, multiply etc.) operation
simultaneously. With this hardware restriction, the program must execute in seven
machine cycles .
• Hardware parallelism displays an average value of 8/7 = 1.14 instructions executed per
cycle.
• This demonstrates a mismatch between the software parallelism and the hardware
parallelism.
• To match the software parallelism, we consider a hardware platform of a dual-processor
system, where single-issue processors are used.Here, L/S stands for load/store operations.
Six processor cycles are needed to execute the I2 instructions by two processors. S1 and
S2 are two inserted store operations, and L5 and L6 are two inserted load operations.
These added instructions are needed for inter processor communication through the
shared memory.
• Hardware parallelism an average value of 12/6 =2 instructions executed per cycle.
• To solve the mismatch problem between software parallelism and hardware parallelism,
one approach is to develop compilation support, and the other is through hardware
redesign.
• The Role of Compilers: Compiler techniques are used to exploit hardware features to
improve performance. Loop transformation, software pipelining, and features of
optimizing compilers are used for supporting parallelism.
• One must design the compiler and the hardware jointly at the same time. Interaction
between the two can lead to a better solution to the mismatch problem between software
and hardware parallelism.
Program Partitioning & Scheduling

• Program partitioning is a technique for decomposing a large program into many small
pieces for parallel execution by multiple processors. Program partitioning involves both
programmers and the compiler.
• Grain Sizes and Latency

• Grain size or granularity is a measure of the amount of computation involved in a
software process. The simplest measure is to count the number of instructions in a grain
(program segment). Grain sizes are commonly described as fine, medium, or coarse
depending on the processing levels involved.
• Grain size determines the basic program segment chosen for parallel processing.
• Latency is a time measure of the communication overhead incurred between machine

subsystems. For example, the memory latency is the time required by a processor to
access the memory.
• Parallelism has been exploited at various processing levels. As illustrated in Fig. five
levels of program execution represent different grain sizes and changing communication
and control requirements.
• Level 5: The grain size can

be as big as millions of
instructions(coarse-grain)
• Level 4: The grain size may
typically contain tens or
hundreds of thousands of
instructions.
• Level 3:grain size is less than
2000 instructions (medium-
grain)
• Level 2: grain size is less
than 500 instructions (fine-
grain)
• Level 1: grain size is less
than 20 instructions (fine-
grain)
Instruction Level : At instruction or statement level,a typical grain contains less than 20
instructions, called fine grain. Depending on individual programs, finegrain parallelism at this
level may range from two to thousands. The exploitation of fine-grain parallelism can be assisted
by an optimizing compiler which should be able to automatically detect parallelism and translate
the source code to a parallel form which can be recognized by the run-time system.
Loop Level : This corresponds to the iterative loop operations. A typical loop contains less than
500 instructions. Independent loop operations, can be used for vector processing pipelined
execution.The loop level is considered a fine grain of computation.
Procedure Level This level corresponds to medium-grain size at the task, procedural,
subroutine, and coroutine levels. A typical grain at this level contains less than 2000 instructions.
Detection of parallelism at this level is much more difficult than at the finer-grain levels. The
communication requirement is often less.
Subprogram Level : This corresponds to the level of job steps and related subprograms.
The grain size may typically contain thousands of instructions. Parallelism at this level has been
exploited by algorithm designers or programmers, rather than by compilers. We do not have
good compilers for exploiting medium- or coarse-grain parallelism at present.
Job (Program) Level : This corresponds to the parallel execution of independent jobs
(programs) on a parallel computer. The grain size can be as high as tens of thousands of
instructions in a single program.
To summarize, fine-grain parallelism is often exploited at instruction or loop levels,

preferably assisted by a compiler. Medium-grain parallelism at the task or job step demands
significant roles for the programmer as well as compilers. Coarse-grain parallelism at the
program level relies heavily on an effective OS and on the efficiency of the algorithm used.
Grain Packing and Scheduling

• Grain packing is to apply fine grain first in order to achieve a higher degree of
parallelism.
• Then combine (pack) multiple fine grain nodes into a coarse grain node if it can eliminate
unnecessary communications delays or reduce the overall scheduling overhead.
• To yield the shortest possible execution time.
• Introduced by Kruatraehue and Lewis (I938) for parallel programming applications.
• Usually, all fine-grain operations within a single coarse-grain node are assigned to the
same processor for execution.
• Fine-grain partition of a program often demands more inter processor communication
than that required in a coarse-grain partition.
The basic concept of Program partitioning, Grain Packing and Scheduling introduced below.
Example : Program graph before and after grain packing

• An example program graph in two different grain sizes. A program graph shows the
structure of a program.Each node in the program graph corresponds to a grain in the
program.
• The grain size is measured by the number of basic machine cycles needed to execute all
the operations within the node.
• We denote each node by a pair (n, s), where n is the node name (id) and s is the grain size
of the node.
• Fine-grain nodes have a smaller grain size, and coarse-grain nodes have a larger grain
size.
• The edge label (v, d) between two end nodes specifies the output variable v from the
source node or the input variable to the destination node, and the communication delay d
between them.
• There are 17 nodes in the fine-grain program graph (Fig. a) and 5 in the coarse-grain
program graph (Fig. b). The coarse-grain node is obtained by combining (grouping)
multiple fine-grain nodes. The fine grain corresponds to the following program.
• The idea of grain packing is to apply fine grain first in order to achieve a higher degree of
parallelism. Then one combines (packs) multiple fine-grain nodes into a coarse grain
node if it can eliminate unnecessary communications delays or reduce the overall
scheduling overhead.
•
• The fine—grain schedule is longer (42 time units) because more communication delays
were included as shown by the shaded area.
• The coarse-grain schedule is shorter (38 time units) because communication delays
among nodes 12, 13, and 14 within the same node D (and also the delays among 15,16,
and 17 within the node E) are eliminated after grain packing.
Static Multiprocessor Scheduling

• We consider Basic concepts behind multiprocessor scheduling using static schemes to
Produce a shorter schedule.
• Node duplication In order to eliminate the idle time and to further reduce the
communication delays among processors, one can duplicate some of the nodes in more
than one processor.
• Ex: Figure a shows a schedule without duplicating any of the five nodes. This schedule
contains idle time as well as long interprocessor delays (8 units) between P1 and P2. In
Fig. b, node A is duplicated into A' and assigned to P2 besides retaining the original copy
A in PL Similarly, a duplicated node C is copied into Pi besides the original node C in
P2.
• The new schedule is almost 50% shorter. The reduction in schedule time is caused by
elimination of the (a, 8) and (c, B) delays between the two processors.
• Four major steps are involved in the grain determination and the process of scheduling
optimization:
Step l . Construct a fine-grain program graph.
Step 2. Schedule the fine-grain computation.
Step 3. Perform grain packing to produce the coarse grains.
Step 4. Generate a parallel schedule based on the packed graph.
PROGRAM FLOW MECHANISMS

• Order of execution
Computers are based on
1. Control flow mechanism
2. Dataflow mechanism
3. Demand-Driven Mechanisms
Control flow mechanism
• In computers based on a control flow mechanism the order of program execution is
explicitly stated in the user programs.
• guided by a program counter
• use a program counter (PC) to sequence the execution of instructions in a program.
• Conventional von Neumann computers use a program counter (PC) to sequence the
execution of instructions in a program.
Dataflow mechanism
• The execution of any instruction depends on data (operand or data tokens) availability.
• The data generated by an instruction will be forwarded directly to all needy instructions.
• This scheme requires no program counter. It requires special mechanisms to detect data
availability, to match data tokens with needy instructions.
A Dataflow Architecture
• The following is a tagged-token architecture for dataflow computers.As shown in Fig.,
the global architecture consists of n processing elements PEs interconnected by an N -by
–N routing network.
• Each PE has a token-matching mechanism that dispatches only instructions with data
tokens available.
• Each datum is tagged with address of instruction to which it belongs.Instructions are
stored in the program memory. Tagged tokens enter the PE through a local path. The
tokens can also be passed to other PEs through the routing network. All internal token
circulation operations are pipelined.
• Another synchronization mechanism, called the I-structure, is provided within each PE. It
is a tagged memory unit. The purpose is to reduce excessive copying of large data
structures in dataflow operations.
Comparison of dataflow and control-flow computers
• A sample program and its dataflow graph
• The dataflow graph shows that 24 instructions are to be executed (8 divides, 8 multiplies,
and 8 adds).
• Assume that each add,multiply,divide requires 1, 2, and 3 cycles to complete,
respectively. Sequential execution of the 24 instructions on a control flow uniprocessor
takes 48 cycles to complete. On the other hand, a dataflow multiprocessor completes the
execution in 14 cycles.
Demand-Driven Mechanisms
Reduction machines are based on Demand-Driven Mechanisms.
• Reduction machines trigger an instruction’s execution based on the demand for its
results. Initiates an operation based on the demand for its results by other computations.
• Consider the evaluation of a nested arithmetic expression
• The data-driven computation chooses a bottom-up approach. starting from the

innermost operations b + I and d/e, then proceeding to the >< operation, and finally to the
outermost operation —.
• A demand-driven computation chooses a top-down approach by first demanding the
value of a, which triggers the demand for evaluating the next-level expressions (b+1) x c
and d/e, which in turn triggers the demand for valuating b + 1 at the innermost level.
• The results are then returned to the nested demander in the reverse order before a is
evaluated.
Dataflow and reduction models, are still concepts in the research stage.Control-flow machines
dominate the market. Until the data-driven or demand-driven mechanism is proven to be cost-
effective, the control-flow approach will continue to dominate the computer industry.
Comparison of Flow Mechanisms
System interconnect architecture
• Interconnecting computer subsystems or for constructing multiprocessors or
multicomputers
• Static networks
• Dynamic networks
• These networks can be used for internal connections among processors, memory
modules, and I/O adaptors in a centralized system, or for distributed networking of
multicomputer nodes.
Static Networks: Static networks are formed of point-to-point direct connections which will
not change during program execution.Used for fixed connections among subsystems of a
centralized system or multiple computing nodes of a distributed system.
Dynamic networks: Dynamic networks are implemented with switched channels, which are
dynamically configured to match the communication demand in user programs.Used in shared-
memory multiprocessors.
Network Properties and Routing:

Before we consider various network topologies, we define several parameters often used to
estimate the complexity, communication efficiency, and cost of a network.
 In general, a network is represented by the graph of a finite number of nodes linked by
directed or undirected edges. The number of nodes in the graph is called the network
size.
 Number of nodes in graph = network size .
 Number of edges incident on a node = node degree d (also note in and out degrees when
edges are directed).The node degree reflects the number of I/O ports required per node,
and thus the cost of a node. Therefore, the node degree should be kept a (small) constant,
in order to reduce cost.
 Diameter : shortest path between two distant nodes. The network diameter should be as
small as possible from a communication point of view.
 Bisection Width: When a given network is cut into two equal halves, the minimum
number of edges along the cut is called the bisection width.
 A network symmetric if the topology is the same looking from any node.
 Data Routing Functions: A data routing network is used for inter PE data exchange.
Various type of data routing functions are Shifting, Rotating, Permutation (one to one),
Broadcast (one to all), Multicast (many to many), Personalized broadcast (one to many),
Shuffle, Exchange etc.
• Permutations: Given n objects, there are n ! ways in which they can be reordered.
Perform permutations of the connections between nodes. One can use cycle notation to
specify a permutation function. Permutations can be implemented using crossbar
switches, multistage networks, shifting, and broadcast operations.
[Note:
]
Perfect Shuffle and Exchange:
• Perfect shuffle is a special permutation function for parallel processing applications. The
mapping corresponding to a perfect shuffle is shown in Fig. Its inverse is shown on the
right-hand side.
Hypercube Routing Functions:

A three-dimensional binary cube network is shown in fig. Three routing functions are defined by
three bits in the node address. For ex: exchange the data between adjacent nodes which differ in
the least significant bit C0 as shown in fig b. two other routing patterns can be obtained by
checking the middle hit C1, (Fig. c) and the most significant bit C2; (Fig. d), respectively.
• In general, an n-dimensional hypercube has n routing functions, defined by each bit of the
n-bit address.
• These data exchange functions can be used in routing messages in a hypercube
multicomputer.
Static connection Networks
Linear Array
• This is a one-dimensional network in which N nodes are connected by N- 1 links in a line
• Internal nodes have degree 2, and the terminal nodes have degree 1.
• The diameter is N-1
• The bisection width = l.
• The structure is not symmetric and poses a communication inefiiciency when N becomes
very large.
Ring and Chordal Ring

• A ring is obtained by connecting the two terminal nodes of a linear array with one extra
link.
• A ring can be unidirectional or bidirectional.
• It is symmetric with node degree of 2
• The diameter is for a bidirectional ring and N for unidirectional ring.
• By increasing the node degree from 2 to 3 or 4, we obtain two chordal rings

• In general, the more links added, the higher the node degree and the shorter the network
diameter.
The completely connected network has a node degree of 15 for 16 node network with the
shortest possible diameter of 1.
Barrel Shifter A network of N = 16 nodes, the barrel shifter is obtained from the ring by adding
extra links from each node to those nodes having a distance equal to an integer power of 2.
For N = 16, the barrel shifter has a node degree of 7 with a diameter of 2. The barrel shifter
complexity is still much lower than that of the completely connected network.
Tree and Star

• A binary tree of 31 nodes in five levels is shown in Fig. a.
• In general, a k-level, completely balanced binary tree should have
nodes.
• The maximum node degree is 3 and the diameter is 2(k - 1).
• the diameter is long.
• The star is 2 two-level tree with a high node degree at the central node d= N-1 and a
small constant diameter of 2.
Fat tree
• The channel width of a fat tree increases as we ascend from leaves to the root.
• Branches get thicker toward the root.
• The traffic toward the root becomes heavier in the conventional binary tree.
• The fat tree has been proposed to alleviate the problem.
Mesh and Torus

• A 3 * 3 example mesh network is shown in Fig.
• The mesh is a frequently used architecture which has been implemented in the Illiac IV,
MPP, DAP, and Intel Paragon with variations.
• ln general, a k-dimensional mesh with nodes has an interior node degree
of 2k and the network diameter is k(n — 1).
• The pure mesh as shown in Fig. is not symmetric. The node degrees at the boundary and
corner nodes are 3 or 2.
llliac mesh : A variation of the mesh by allowing wraparound connections. The llliac mesh is
topologically equivalent to a chordal ring of degree 4 ( as shown in above Fig.d) for an
configuration.
In general, an llliac mesh should have a diameter of d = n-l, which is only half of the
diameter for a pure mesh.
Torus
• The torus can be viewed as another variant of the mesh with an even shorter diameter.
• This topology combines the ring and mesh and extends to higher dimensions.
Systolic Arrays :
• This is a class of multidimensional pipelined array architectures
• A systolic array where the interior node degree is 6 in this example.
• In general, static systolic arrays are pipelined with multidirectional flow of data streams.
Hypercubes
• This is a binary n-cube architecture which has been implemented in the iPSC, nCUBE,
and CM-2 systems.
• In general, an n-cube consists of nodes spanning along n dimensions, with two
nodes per dimension. A 3-cube with 8 nodes is shown.
• A 4-cube can be formed by interconnecting the corresponding nodes of two 3-cubes,
Cube-Connected Cycle: This architecture is modified from the hypercube.
• 3-cube is modified to form 3-cube connected cycle (CCC).
• The idea is to cut off the corner nodes of the 3-cube and replace each by a ring (cycle) of
3 nodes.
• The CCC is a better architecture for building scalable systems if latency can be tolerated
in some way.
Dynamic connection Networks
• To provide the dynamic connectivity
• The route through which data move from one pe to another is established at the time
communication has to be performed.
• All communication patterns based on program demands.
• Instead of using fixed connections, switches or arbiters must be used along the
connecting paths
• More difficult to expand as compared to static network
• Paths are established as needed between processors
In increasing order of cost and performance, dynamic connection networks include

• Bus-based Networks
• Crossbar Switching Network
• Multistage Interconnection Networks
Bus-based Networks
• In a bus-based network, processors share a single communication resource [the bus].
•A bus is a highly non-scalable architecture, because only one processor can communicate on the
bus at a time.
• Used in shared-memory parallel computers to communicate read and write requests to a shared
global memory.
• A bus-based interconnection network, used here to implement a shared-memory parallel
computer. Each processor (P) is connected to the bus, which in turn is connected to the
global memory. A cache associated with each processor stores recently accessed memory
values in an effort to reduce the bus traffic.
Switch Modules
• Each input can be connected to one or more of the outputs. One-to-one and one-to-many
mappings are allowed.
• When only one-to-one mappings (permutations) are allowed, we call the module an n x n
crossbar switch.
• For example, a 2 x 2 crossbar switch can connect two possible patterns: straight or
crossover.
In general, an n x n crossbar can achieve n! permutations.The numbers of legitimate

connection patterns for switch modules of various sizes are listed in Table
Multistage Interconnection Networks

• MINs have been used in both MIMD and SIMD computers.
• A generalized multistage network is shown.
A number of a x b switches are used in each stage. Fixed inter stage connections are used
between the switches in adjacent stages. The switches can be dynamically set to establish the
desired connections between the inputs and outputs.MINs differ in the switch modules used
and in the kind inter stage connection patterns used. The simplest switch module would be
the 2 x 2 switches.
The ISC patterns often used include perfect shuffle, butterfly, crossbar, cube connection, etc.
Omega Network
• A I6 x 16 Omega network
• Four possible connections of 2 x 2 switches used in constructing the Omega network.
• Four stages of 2 X 2 switches are needed.
• There are 16 inputs on the left and 16 outputs on the right.
• The ISC pattern is the perfect shuffle over 16 objects.
[Note:8 x 8 Omega network
]
• In general, an n input Omega network requires log2n stages of 2 x 2 switches. Each
stage requires n/2 switch modules.
Crossbar Network
• A crossbar network can be visualized as a single-stage switch network.
• The cross point switches provide dynamic connections between (source, destination)
pairs.
• Each cross point switch can provide a dedicated connection path between a pair.
• The switch can be set on or off dynamically upon program demand.
Two types of crossbar networks as shown below.

• To build a shared-memory multiprocessor, one can use a crossbar network between the
processors and memory modules
• The above fig. shows a 16 x I6 crossbar network which connected 16 PDP ll processors
to 16 memory modules. The 16 memory modules could be accessed by the processors in
parallel. Each memory module can satisfy only one processor request at a time.
• Each processor can generate a sequence of addresses to access multiple memory modules
simutaneously. Only one crosspoint switch can be set on in each column. However,
several crosspoint switches can be set on simultaneously in order to support parallel
memory accesses.
• The above crossbar network is for interproccssor communication.
• The PEs are processors with attached memory. The CPs stand for control processors
which are used to supervise the entire system operation, including the crossbar networks.
In this crossbar, at one time only one crosspoint switch can be set on in each row and
each column.Only one-to-one connections are provided.
• The n x n crossbar connects at most n source, destination pairs at a time.
STUDY 3RD CHAPTER FROM TEXT BOOK
MODULE – 1 REVIEW QUESTIONS
1. With a neat diagram, explain the operation of tagged-token data flow computer.
2. Briefly explain the different types of data dependencies. For the following code segment,
draw the dependence graph:
S1 Load R1, A
S2 Add R2, R1
S3 Move R1, R3
S4 Store B, R1
3. Write a note on speedup performance laws.
4. Explain the different performance metrics.
5. Explain the following
Clock rate and CPI
Performance factors.
MIPS rate
Floating point operations per second
Throughput rate
6. Write a note on grain sizes.
1. Explain two categories of parallel computers with their architecture. ( 10 m)
2. Explain the Flynn’s classification of computer architectures. (6 m)
3. Explain data, control and resource dependence. 10 m
4. Explain Bernstein’s conditions. (4 m)
5.
7. With neat diagram, discuss the operation of vector super computers. (8 M)

8. Write system attributes, which affects performance factors of the system. ( 04 Marks)
9. Consider the execution of an object code with 200,000 instructions on a 40 MHz
processor. The program consists of four major types of instructions. The instruction mix and the
number of cycles [CPI] needed for each instruction type are given below:
I. Find total number of cycles required to execute program.
II. Calculate the average CPl when the program is executed on a uniprocessor .
III. Calculate MIPS rate.

10. A 400 MHz processor was used to execute a benchmark program with the following
instruction mix and clock cycle counts:
Determine the effective CPI, MIPS rate and execution time for this program.
11.
12. Consider the following sequence of instructions:

13. Explain performance factors vs system attributes.
14. Explain the UMA, NUMA and COMA multiprocessor models.
15. Explain the following static connection network topologies a) Linear array b) Ring and
chordal ring c)Barrel shifter d) Tree and star e) Fat tree f) Mesh and Torus
16. Compare control flow and data flow architectures.
17.
18. Define the following terms:

Granularity
Latency
Grain packing and scheduling (4 m)
19.
20. Differentiate between shared-memory multiprocessors and distributed-memory

multicomputers.
ACA MODULE-2-REVIEW QUESTIONS
1. Define the following basic terms related to modern processor technology.
a) Instruction issue latency b) Instruction issue rate c) Simple operation latency
2. Describe the design space of modern processor families.
3. With diagrams, explain the pipelined execution of successive instructions in a base scalar
processor and in two underpipelined cases.
4. Explain the architectural models of a basic scalar computer system using block
diagrams.( CISC and RISC architectures )
5. Draw and explain the block diagram of a typical superscalar RISC processor architecture
consisting of an integer unit and a floating-point unit.
6. Explain the architecture of VLIW processor and its pipeline operations with the help of
diagrams.
7. Explain architectural distinctions between CISC and RISC processors.
8. Compare the following:
I. Base scalar processing
II. Super scalar processing
III. Pipelining technique
9. Compare the superscalar processor and VLIW processor.

10. Write a note on pipelining in superscalar processors.
11. Draw and explain a typical VLIW processor architecture and its instruction format.
Explain its pipeline operation (with degree m=3) using timing diagram.
12. Explain the overlapping register window mechanism of SPARC architecture.
13. Compare the pipelined execution style in a vector processor with that in a base scalar
processor.
14. Write a short note on vector instructions.
15. Explain hierarchical memory technology.
16. Explain the inclusion, coherence and locality properties with reference to memory
hierarchy.
17. Explain locality of references.
18. Write a note on TLB, Paging and Segmentation.
19. Explain two virtual memory models for multiprocessor systems.
20. Explain different page replacement policies.

Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module-1 Theory of Parallelism: The State of Computing Computer Development Milestones

Uploaded by

Copyright:

Available Formats

Module-1 Theory of Parallelism

Parallel Computer Models, The State of Computing,Multiprocessors and Multicomputer

What is Computer Architecture?

 Instruction set design or Instruction set architecture (ISA)

THE STATE OF COMPUTING

The second generation (1955-1964)

The third generation (1965-1974)

The fourth generation (1974-1991)

• Multiprocessors, multicomputers, supercomputers

The fifth generation (1991- present)

Summary of the five generations of electronic computer development.

Algorithms and Data Structures

• instruction-set, memory organization, multiprocessors, supercomputers, multicomputers,

• An effective operating system manages the allocation and deallocation of resources

• Mapping is a bidirectional process matching algorithmic structure with hardware

• The von Neumann architecture is slow due to sequential execution of instructions in

• Lookahead techniques were introduced to prefetch instructions in order to overlap I/E

• Functional parallelism was supported by two approaches: one is to use multiple

• Pipelining has proven especially attractive in performing identical operations repeatedly

• Vector operations were originally carried out implicitly by software-controlled looping

• MIMD(multiple instruction streams over multiple data streams)

SISD (single instruction stream over a single data stream)

• Instructions are executed sequentially.

SIMD (single instruction stream over multiple data streams) machines

• Uses an array of processing elements(PEs) synchronized by the same controller.

PE = Processing Element LM = Local memory

• Multiple instruction: every processor may be executing a different instruction stream

• Multiple data stream is provided by shared memory.

MISD (multiple instruction streams and a single data stream) machines

• They execute different instruction streams.

A layered development of parallel computers is illustrated in Fig. Hardware configurations differ

System Attributes to Performance

The total CPU time needed for program execution

• CPU implementation and control,

Floating Point Operations per Second

I. Find total number of cycles required to execute program.

The shared local memory

• A multiprocessor using cache-only memory assumes the COMA model.

• The system consists of multiple computers, often called nodes, interconnected by a

Multivector and SIMD Computers

Two pipeline vector supercomputer models are described below.

• An operational model of an SIMD computer is specified by a 5-tuple:

Program & Network properties

Ex: Consider the following fragment of any program:

• Loop example without control-dependent iterations.

• Control dependence often prohibits parallelism from being exploited. Compiler

• Resource dependence is concerned with conflicts in using shared resources, such as

Hardware and Software Parallelism

• One way to characterize the parallelism in a processor is by the number of instruction

Example : Mismatch between software parallelism and hardware parallelism

Program Partitioning & Scheduling

• Grain Sizes and Latency

• Latency is a time measure of the communication overhead incurred between machine

• Level 5: The grain size can

To summarize, fine-grain parallelism is often exploited at instruction or loop levels,

Grain Packing and Scheduling

Example : Program graph before and after grain packing

Static Multiprocessor Scheduling

PROGRAM FLOW MECHANISMS

Comparison of dataflow and control-flow computers

• A sample program and its dataflow graph

• The data-driven computation chooses a bottom-up approach. starting from the

Network Properties and Routing: