Introduction To Parallel Processing

INTRODUCTION TO PARALLEL PROCESSING
Parallel computer structures will be

characterized as pipelined computers, array processors, and multiprocessor systems. Several new computing concepts, including data
flow and VLSI approaches.
1.1 EVOLUTION OF COMPUTER SYSTEMS
physically marked by the rapid changing of building blocks from relays and vacuum tubes (l940-1950s) to discrete diodes and
transistors (1950 1960s), to small- and medium-scale integrated (SSI/MSI) circuits (l960-1970s), and to large- and very-large-scale
integrated (LSI/VLSI) devices (1970s and beyond). Increases in device speed and reliability and reductions in hardware cost and
physical size have greatly enhanced computer performance
1.1.1 Generations of Computer Systems

The first generation (1938-1953) The introduction of the first electronic analog computer in 1938 and the first electronic
digital computer, ENIAC (Electronic Numerical Integrator and Computer), in 1946 marked the beginning of the first
generation of computers. Electromechanical relays were used as switching devices in the 1940s, and vacuum tubes were
used in the 1950s.
The second generation (1952-1963) Transistors were invented in 1948. The first transistorized digital computer, TRAD!C, was
built by Bell Laboratories in 1954.Discrete transistors and diodes were the building blocks: 800 transistors were
used in TRADIC. Printed circuits appeared
The third generation (1962-1975) This generation was marked by the use of small-scale integrated (SSI) and medium-scale
integrated (MSI) circuits as the basic building blocks. Multilayered printed circuits were used. Core memory was
still used in CDC-6600 and other machines but. by 1968, many fast computers, like CDC-7600, began to replace cores with solid-
state memories.
The fourth generation (1972-present) The present generation computers emphasize the use of large-scale integrated (LSI) circuits
for both logic and memory sections. High-density packaging has appeared. High-level languages are being extended to handle both
scalar and vector data. like the extended Fortran in many vector processors,
The future Computers to be used in the 1990s may be the next generation. Very large-scale integrated (VLSI) chips will be used
along with high-density modular design. Multiprocessors like the 16 processors in the S-1 project at Lawrence Livermore National
Laboratory and in the Denelcor's HEP will be required.Cray-2 is expected to have four processors, to be delivered in 1985. More
than 1000mega floating point operations per second (megaflops) are expected in these future supercomputers.
1.1.2 Trends Towards Parallel Processing **

According to Sidney Fern Bach:" Today's large computers (mainframes) would here been considered 'supercomputers' 10 to 20
years ago. By the same token, today's supercomputers will he considered 'state-of-the-art' standard equipment 10 to 20_years From
now." from an application point of view. the mainstream usage of computers is experiencing a trend of four ascending
levels of sophistication:
Data processing
Information processing Knowledge processing Intelligence processing
By GS kosta
We are in an era which is promoting the use ofcomputers not only for conventional data-information processing. but also towardthe
building of workable machine knowledge-intelligence systems to advance human civilization. Many computer scientists feel that the
degree of parallelism exploitable at the two highest processing levels should be higher than that at thedata-information processing
levels
From an operating system point of view, computer systems have improved chronologically in four phases:
Batch processing
Multiprogramming
Time sharing
Multiprocessing
In these four operating modes. the degree of parallelism increases sharply from phase to phase. The general trend is to emphasize
parallel processing of information. In what follows. the term information is used with an extended meaning to include
data, information, knowledge, and intelligence. We formally define parallel processing as follows:
Parallel processing is an efficient form of information processing which emphasizes the exploitation of concurrent events in the
computing process. Concurrency implies parallelism, simultaneity, and pipelining. Parallel events may occur in multiple resources
during the same time interval; simultaneous events may occur at the same time instant; and pipelined events may occur in
overlapped time spans.
parallelprocessing can be challenged in four programmatic levels:

Job or program level
Task or procedure level
Interinstruction level
Intrainstruction level The highest job level is often conducted algorithmically. The lowest intrainstruction level is often
implemented directly by hardware means. Hardware roles increase from high to low levels. On the other hand, software
implementations increase from low to high levels. The trade-off between hardware and software approaches to solve a problem is
always a very controversial issue. As hardware cost declines and software cost increases, more and more hardware methods are
replacing the conventional software approaches. The trend is also supported by the increasing demand for a faster real-time,
resource-sharing, and fault-tolerant computing environment.
parallel processing is concerned, the general architectural trend is being shifted away from conventional uniprocessor systems to
multiprocessor systems or to an array of processing elements controlled by one uniprocessor. In all cases, a high degree of pipe
lining is being incorporated into the various system levels.
1.2 P ARALLEUSM IN UNIPROCESSOR SYSTEMS

1.2.1 Basic Uniprocessor Architecture
A typical uniprocessor computer consists of three major components: the main memory, the central processing unit
(CPU), and the input-output (l/O) subsystem.
By GS kosta
The architectures of two commercially available uniprocessor computers are given below to show the possible
interconnection of structures among the three subsystems. We will examine major components in the CPU and in
the 1/0 subsystem.
1.2.2 Parallel Processing Mechanisms

A number of parallel processing mechanisms have been developed in uniprocessor computers. We identify them in the following six
categories:
Multiplicity of functional units
Parallelism and pipelining within the CPU
Overlapped CPU and I/0 operations
Use of a hierarchical memory system
Balancing of subsystem bandwidths
Multiprogramming and time sharing
Multiplicity of functional units The early computer had only one arithmetic and logic unit in its CPU. Furthermore. the ALU could
only perform one function at a time, a rather slow process for executing a long sequence of arithmetic logic instructions. In practice,
many of the functions of the ALU can be distributed to multiple and specialized functional units which can operate in parallel. The
CDC-6600 (designed in 1964) has I 0 functional units built into its CPU (Figure 1.5). These I 0 units are independent of each other
and may operate simultaneously. A scoreboard is used to keep track of the availability of the functional units and
registers being demanded. With 10 functional units and 24 registers available, the
instruction issue rate can be significantly increased.
Another good example of a multifunction uniprocessor is the IBM 360/91 (1968), which has two parallel execution units
Parallelism and pipelining within the CPU Parallel

adders, using such techniques as carry-lookahead and
carry-save, arc now built into almost all ALUs. This is in
contrast to the bit-serial adders used in the first-generation
machines. High-speed multiplier recoding and
convergence division arc techniques for exploring
parallelism and the sharing of hardware resources for the
functions of multiply and divide The use of multiple
functional units is a form of parallelism with the CPU.
Various phases of instruction executions arc now
pipelined, including instruction fetch, decode, operand
fetch, arithmetic logic execution, and store result. To
facilitate overlapped instruction executions through the
pipe, instruction prefetch and data buffering techniques
have been developed.
By GS kosta
Overlapped CPU and 1/0 operations 1/0 operations can be performed simultaneously with the CPU computations by using
separate 1/0 controllers, channels, or I/0 processors. The direct-memory-access (DMA) channel can be used to provide direct
information transfer between the 1/0 devices and the main memory. The DMA is conducted on a cycle-stealing basis, which is
apparent to the CPU.
Use of hierarchical memory system Usually, the CPU is

about 1000 times faster than memory access. A
hierarchical memory system can be used to close up the
speed gap. Computer memory hierarchy is conceptually
illustrated in Figure 1.6.The innermost level is the register
files directly addressable by ALU. Cache memory can be
used to serve as a buffer between the CPU and the main
memory. Block access of the main memory can be
achieved through multi way inter leaving across parallel
memory modules (see Figure 1.4). Virtual memory space
can be established with the use of disks and tape units at
the outer levels.
Multiprogramming and Time Sharing
Multiprogramming Within the same time interval, there

may be multiple processes active in a computer.
competing for memory. 1/0. and CPG resources. We are
aware of the fact that some computer programs are CPU-
hound (computation intensive), and some are I/O-bound
(input-output intensive)
Time sharing Multiprogramming on a uniprocessor is

centered around the sharing of the CPU by many programs. Sometimes a high-priority program may occupy
the CPU for too long to allow others to share. This problem can be overcome by using a rime-sharing operating system.
1.3 PARALLEL COMPUTER STRUCTURES
Parallel computers are those systems that emphasize parallel processing. The basic architectural features of parallel computers are
introduced below.' We divide parallel computers into three architectural configurations:
Pipeline computers
Array processors
Multiprocessor systems
A pipeline computer performs overlapped computations to exploit temporal parallelism An array processor uses multiple
synchronized arithmetic logic units to achieve spatial parallelism. A multiprocessor system achieves asynchronous parallelism
through a set of interactive processors with shared resources (memories,
database, etc.).
1.4 ARCHITECTURAL CLASSIFICATION SCHEMES

Three computer architectural classification schemes are presented in this section .Flynn'., classification (1966) is based on the
multiplicity of instruction streams and data streams in a computer system. F eng's scheme (1972) is based on serial versus
parallel processing. handlers classification (1977) is determined by the degree of parallelism and pipelining in various subsystem
levels.
1.4.1 Multiplicity of Instruction-Data Streams
In general, digital computers may be classified into four categories, according to the multiplicity of instruction and data
streams. This scheme for classifying computer organizations was introduced by Michael J. Flynn.
Computer organizations are characterized by the
multiplicity of the hardware provided to service
the instruction and data streams. Listed below
are Flynn's four machine organizations:
Single instruction stream-single data stream

(SISD)
Single instruction stream-multiple data
stream (SIMD)
Multiple instruction stream-single data
stream ( MISD)
Multiple instruction stream-multiple data
stream (MI MD)
By GS kosta
SISD computer organization This organization, shown in figure 1.16a, represents most serial computers available today.
Instructions are executed sequentially but may be overlapped in their execution stages (pipelining). Most SISD uniprocessor
systems are pipelined. An SISD computer may have more than one functional unit in it. All the functional units are under the
supervision of one control unit.
SIMD computer organization This class corresponds to array processors. Introduced in Section 1.3.2. As illustrated in Figure
1.16b, there are multiple processing elements supervised by the same control unit. All PEs receive the same instruction broadcast
from the control unit but operate on different data sets from distinct data streams. The shared memory subsystem may contain
multiple modules.
MISD computer organization This organization is conceptually illustrated in Figure l.l6c. There are n processor units, each
receiving distinct instructions operating over the same data stream and its derivatives. The results (output) of one processor become
the input (operands) of the next processor in the micropipe. This structure has received much less attention and has been challenged
as impractical by some computer architects. No real embodiment of this class exists.
MIMD computer organization Most multiprocessor systems and multiple computer systems can be classified in this category
(Figure 1.16d). An intrinsic MIMD computer implies
interactions among then processors because all memory
streams are derived from the same data space shared by
all processors. If the n data streams were derived from
disjointed subspaces of the shared memories, then we
would have the so-called multiple SISD (MSISD)
operation, which is nothing but a set of n
independent SISD uniprocessor systems.
1.4.2 Serial Versus Parallel Processing

Tse-yun Feng has suggested the use of the degree of
parallelism to
classify various
computer
architectures.
There are four types of processing methods that can be seen from this diagram:
Word-serial and bit-serial (WSBS)
Word-parallel and bit-serial (WPBS)
Word-serial and bit-parallel (WSBP)
By GS kosta
Word-parallel and bit-parallel (WPBP)
WSBS has been called bit-serial processing because one bit (n = m = 1) is processed at a time. a rather slow process. This was
done only in the first~generation computers. WPBS (n = 1, m > 1) has been called bis (bit-slice) processin9 because an m-bit slice is
processed at a time. WSBP (n > 1, m = l ), as found in most existing computers, has been called word-slice processi11g because one
word of 11 bits is processed at a time. Finally, WPBP (n > I, m > l) is known as fully parallel processing(or simply parallel
processing, if no confusion exists), in which an array of n m bits is processed at one time, the fastest processing mode of the four.
In Table 1.4, we have listed a number of computer systems under each processing mode. The system parameters n, m are also shown
for each system. The bit-slice processors, like STARAN. ~PP, and DAP. all have long bit slices. llliac-IV and PEPE are two word-
slice array processors.
1.4.3 Parallelism Versus Pipelining

Wolfgang Handler has proposed a classification scheme for identifying the parallelism degree and pipe lining degree built into the
hardware structures of a computer system. He considers parallel-pipeline processing at three subsystem levels:
Processor control unit (PCU)
Arithmetic logic unit (ALU)
Bit-level circuit (BLC)
The functions of PCU and ALU should be clear to us. Each PCU corresponds to one processor or one CPU. The ALU is equivalent
to the processing element (PE)we specified for SIMD array processors. The BLC corresponds to the combinational logic circuitry
needed to perform !-bit operations in the ALU. A computer system C can be characterized by a triple containing six independent
entities. as defined below:
T(C) = < K x K', D x D', W x W'> (1.13)
where K = the number of processors (PCUs) within the computer
D = the number of ALUs (or PEs) under the control of one PCU
W = the word length of an ALU or of a PE
W' =the number of pipeline stages in all ALUs or in a PE
D' = the number of ALUs that can be pipelined (pipeline chaining to be
described in Chapter 4)
K' = the number of PCUs that can be pipelined
Several real computer examples are used to clarify the above parametric descriptions. The Texas Instrument's Advanced Scientific
Computer (Tl-ASC)has one controller controlling four arithmetic pipelines, each has 64-bit word lengths and eight stages. Thus, we
have
T(ASC) = <1 x 1, 4 X 1, 64 x 8> = < 1, 4, 64 x 8>
By GS kosta
By GS kosta
Amdahl's law
In computer architecture, Amdahl's law (or Amdahl's
argument[1]) is a formula which gives the
theoretical speedup in latency of the execution of a
task at fixed workload that can be expected of a
system whose resources are improved. It is named
after computer scientist Gene Amdahl, and was
presented at the AFIPS Spring Joint Computer
Conference in 1967
Amdahl's law is often used in parallel computing to
predict the theoretical speedup when using multiple
processors.
Amdahl's law applies only to the cases where the

problem size is fixed.
By GS kosta
Moores Law
The quest for higher-performance digital computers seems unending. In
the past two decades, the performance of microprocessors has enjoyed an
exponential growth. The growth of microprocessor speed/performance by
a factor of 2 every 18 months (or about 60% per
year) is known as Moores law. This growth is the result of a combination
of two factors:
1. Increase in complexity (related both to higher device density and to
larger size) of VLSI chips, projected to rise to around 10 M transistors per
chip for microprocessors, and 1B for dynamic random-access memories
(DRAMs), by the year 2000 [SIA94]
2. Introduction of, and improvements in, architectural features such as on-
chip cache memories, large instruction buffers, multiple instruction issue
per cycle, multithreading, deep pipelines, out-of-order instruction
execution, and branch prediction
Moores law was originally formulated in 1965 in terms of the doubling
of chip complexity every year (later revised to every 18 months) based
only on a small number of data points [Scha97]. Moores revised
prediction matches almost perfectly the actual increases in the
number of transistors in DRAM and microprocessor chips.
Moores law seems to hold regardless of how one measures
processor performance: counting the number of executed Figure 1.1. The exponential growth of microprocessor performance,
instructions per second (IPS), counting the number of floating-point known as Moores law, shown over the past two decades.
operations per second (FLOPS), or using sophisticated benchmark
suites that attempt to measure the processor's performance on real
applications. This is because all of these measures, though
By GS kosta
numerically different, tend to rise at roughly the same rate .Figure 1.1 shows that the performance of actual processors has in fact followed
Moores law quite closely since 1980 and is on the verge of reaching the GIPS (giga IPS = 10 9 IPS) milestone
PRINCIPLES OF SCALABLE PERFORMANCE

1. Performance Metrics and Measures
1.1. Parallelism Profile in Programs
1.1.1. Degree of Parallelism The number of processors used at any instant to execute a program is called the degree of
parallelism (DOP); this can vary over time.
DOP assumes an infinite number of processors are available; this is not achievable in real machines, so some parallel
program segments must be executed sequentially as smaller parallel segments. Other resources may impose limiting
conditions.
A plot of DOP vs. time is called a parallelism profile.
1.1.2. Average Parallelism - 1
Assume the following:
n homogeneous processors
maximum parallelism in a profile is m
Ideally, n >> m
D, the computing capacity of a processor, is something
like MIPS or Mflops w/o regard for memory latency, etc.
i is the number of processors busy in an observation
period (e.g. DOP = i )
W is the total work (instructions or computations)
performed by a program
A is the average parallelism in the program
1.1.3. Average Parallelism 2 1.1.4. Average Parallelism 3
1.1.5. Available Parallelism

Various studies have shown that the potential parallelism in scientific and engineering calculations can be very
high (e.g. hundreds or thousands of instructions per clock cycle).
But in real machines, the actual parallelism is much smaller (e.g. 10 or 20).
1.1.6. Basic Blocks
A basic block is a sequence or block of instructions with one entry and one exit.
Basic blocks are frequently used as the focus of optimizers in compilers (since its easier to manage the use of
registers utilized in the block).
Limiting optimization to basic blocks limits the instruction level parallelism that can be obtained (to about 2 to 5 in
typical code).
1.1.7. Asymptotic Speedup 1 1.1.8. Asymptotic Speedup 2
By GS kosta
1.2. Mean Performance
We seek to obtain a measure that characterizes the mean, or average, performance of a set of
benchmark programs with potentially many different execution modes (e.g. scalar, vector, sequential, parallel).
We may also wish to associate weights with these programs to emphasize these different modes and yield a more
meaningful performance measure.
1.2.1. Arithmetic Mean
The arithmetic mean is familiar (sum of the terms divided by the number of terms).
Our measures will use execution rates expressed in MIPS or Mflops.
The arithmetic mean of a set of execution rates is proportional to the sum of the inverses of the execution times; it
is not inversely proportional to the sum of the execution times.
Thus arithmetic mean fails to represent real times consumed by the benchmarks when executed.
1.2.2. Harmonic Mean

Instead of using arithmetic or geometric mean, we use the harmonic mean execution rate,
which is just the inverse of the arithmetic mean of the execution time (thus guaranteeing
the inverse relation not exhibited by the other means).
1.2.3. Weighted Harmonic Mean

If we associate weights fi with the benchmarks, then we can compute the weighted harmonic
mean:
1.2.4. Weighted Harmonic Mean Speedup

T1 = 1/R1 = 1 is the sequential execution time on a
single processor with rate R1 = 1.
Ti = 1/Ri = 1/i = is the execution time using i
processors with a combined execution rate of Ri = i.
Now suppose a program has n execution
modes with associated weights f1 f n. The w eighted
harmonic mean speedup is defined as:
1.2.5. Amdahls Law
Assume Ri = i, and w (the weights) are (a, 0, , 0, 1-a).
Basically this means the system is used sequentially (with probability a) or
all n processors are used (with probability 1- a).
This yields the speedup equation known as Amdahls law:
The implication is that the best speedup possible is 1/ a, regardless of n, the number of processors.
1.3. Efficiency, Utilizations, and Quality
1.3.1. System Efficiency 1
By GS kosta
Assume the following definitions:
O (n) = total number of unit operations performed by an n processor system in completing a program P.
T (n) = execution time required to execute the program P on an n processor system.
O (n) can be considered similar to the total number of instructions executed by the n processors, perhaps scaled by
a constant factor.
If we define O (1) = T (1), then it is logical to expect that T (n) < O (n) when n > 1 if the program P is able to make
any use at all of the extra processor(s).
1.3.2. System Efficiency 2
Clearly, the speedup factor (how much faster the program runs with n processors) can now be expressed as
S (n) = T (1) / T (n)
Recall that we expect T (n) < T (1), so S (n) 1.
System efficiency is defined as
E (n) = S (n) / n = T (1) / ( n T (n) )
It indicates the actual degree of speedup achieved in a system as compared with the maximum possible speedup.
Thus 1 / n E (n) 1. The value is 1/n when only one
processor is used (regardless of n), and the value is 1 when all processors are fully utilized.
1.3.3. Redundancy
The redundancy in a parallel computation is defined as
R (n) = O (n) / O (1)
What values can R (n) obtain?
R (n) = 1 when O (n) = O (1), or when the number of operations performed is independent of the number of
processors, n. This is the ideal case.
R (n) = n when all processors performs the same number of operations as when only a single processor is used; this
implies that n completely redundant computations are performed!
The R (n) figure indicates to what extent the software parallelism is carried over to the hardware implementation
without having extra operations performed.
1.3.4. System Utilization

System utilization is defined as
U (n) = R (n) xE (n) = O (n) / ( nxT (n) )
It indicates the degree to which the system resources were kept busy during execution of the
program. Since 1 R (n) n, and 1 / n E (n)
1, the best possible value for U (n) is 1, and the
worst is 1 / n.
SPEEDUP PERFORMANCE LAWS
The main objective is to produce the results as early as possible. In other words minimal turnaround time is the
primary goal.
Three performance laws defined below:
1. Amdahls Law(1967) is based on fixed workload or fixed problem size
2. Gustafsons Law(1987) is applied to scalable problems, where the problem size increases with the increase in
machine size.
3. The speed up model by Sun and Ni(1993) is for scaled problems bounded by memory capacity.
Amdahls Law for fixed workload

In many practical applications the computational workload is often fixed with a fixed problem size. As the number
of processors increases, the fixed workload is distributed.
Speedup obtained for time-critical applications is called fixed-load speedup.
Fixed-Load Speedup
By GS kosta
The ideal speed up formula given below:
is based on a fixed workload, regardless of machine size.

We consider below two cases of DOP< n and of DOP n.
Parallel algorithm
In computer science, a parallel algorithm, as opposed to a traditional serial algorithm, is an algorithm which can
be executed a piece at a time on many different processing devices, and then combined together again at the end
to get the correct result.[1]
Many parallel algorithms are executed concurrently though in general concurrent algorithms are a distinct
concept and thus these concepts are often conflated, with which aspect of an algorithm is parallel and which is
concurrent not being clearly distinguished. Further, non-parallel, non-concurrent algorithms are often referred to
as "sequential algorithms", by contrast with concurrent algorithms.
Examples of Parallel Algorithms

This section describes and analyzes several parallel algorithms. These algorithms provide examples of how to analyze algorithms in terms of work
and depth and of how to use nested data-parallel constructs. They also introduce some important ideas concerning parallel algorithms. We mention
again that the main goals are to have the code closely match the high-level intuition of the algorithm, and to make it easy to analyze the asymptotic
performance from the code.
Parallel Algorithm Complexity

Analysis of an algorithm helps us determine whether the algorithm is useful or not. Generally, an algorithm is analyzed based on its
execution time (Time Complexity) and the amount of space (Space Complexity) it requires.
Since we have sophisticated memory devices available at reasonable cost, storage space is no longer an issue. Hence, space
complexity is not given so much of importance.
Parallel algorithms are designed to improve the computation speed of a computer. For analyzing a Parallel Algorithm, we normally
consider the following parameters
Time complexity (Execution Time),
By GS kosta
Total number of processors used, and
Total cost.
Time Complexity
The main reason behind developing parallel algorithms was to reduce the computation time of an algorithm. Thus, evaluating the
execution time of an algorithm is extremely important in analyzing its efficiency.
Execution time is measured on the basis of the time taken by the algorithm to solve a problem. The total execution time is calculated
from the moment when the algorithm starts executing to the moment it stops. If all the processors do not start or end execution at the
same time, then the total execution time of the algorithm is the moment when the first processor started its execution to the moment
when the last processor stops its execution.
Time complexity of an algorithm can be classified into three categories
Worst-case complexity When the amount of time required by an algorithm for a given input is maximum.
Average-case complexity When the amount of time required by an algorithm for a given input is average.
Best-case complexity When the amount of time required by an algorithm for a given input is minimum.
Asymptotic Analysis
The complexity or efficiency of an algorithm is the number of steps executed by the algorithm to get the desired output. Asymptotic
analysis is done to calculate the complexity of an algorithm in its theoretical analysis. In asymptotic analysis, a large length of input
is used to calculate the complexity function of the algorithm.
Note Asymptotic is a condition where a line tends to meet a curve, but they do not intersect. Here the line and the curve is
asymptotic to each other.
Asymptotic notation is the easiest way to describe the fastest and slowest possible execution time for an algorithm using high bounds
and low bounds on speed. For this, we use the following notations
Big O notation
Omega notation
Theta notation
Big O notation
In mathematics, Big O notation is used to represent the asymptotic characteristics of functions. It represents the behavior of a function
for large inputs in a simple and accurate method. It is a method of representing the upper bound of an algorithms execution time. It
represents the longest amount of time that the algorithm could take to complete its execution. The function
f(n) = O(g(n))
iff there exists positive constants c and n0 such that f(n) c * g(n) for all nwhere n n0.
Omega notation
Omega notation is a method of representing the lower bound of an algorithms execution time. The function
f(n) = (g(n))
iff there exists positive constants c and n0 such that f(n) c * g(n) for all nwhere n n0.
Theta Notation
Theta notation is a method of representing both the lower bound and the upper bound of an algorithms execution time. The function
f(n) = (g(n))
iff there exists positive constants c1, c2, and n0 such that c1 * g(n) f(n) c2 * g(n) for all n where n n0.
Speedup of an Algorithm
The performance of a parallel algorithm is determined by calculating its speedup. Speedup is defined as the ratio of the worst-case
execution time of the fastest known sequential algorithm for a particular problem to the worst-case execution time of the parallel
algorithm.
speedup = Worst case execution time of the fastest known sequential for a particular problem / Worst case execution time of the parallel
algorithm
By GS kosta
Number of Processors Used
The number of processors used is an important factor in analyzing the efficiency of a parallel algorithm. The cost to buy, maintain,
and run the computers are calculated. Larger the number of processors used by an algorithm to solve a problem, more costly becomes
the obtained result.
Total Cost
Total cost of a parallel algorithm is the product of time complexity and the number of processors used in that particular algorithm.
Total Cost = Time complexity Number of processors used
Therefore, the efficiency of a parallel algorithm is
Efficiency = Worst case execution time of sequential algorithm / Worst case execution time of the parallel algorithm
Models of Parallel Processing

Parallel processors come in many different varieties.
1. SIMD VERSUS MIMD ARCHITECTURES
Within the SIMD category, two fundamental design choices exist:
1. Synchronous versus asynchronous SIMD. In a SIMD machine, each processor can execute or ignore the instruction being broadcast
based on its local state or data-dependent conditions. However, this leads to some inefficiency in executing conditional computations.
For example, an if-then-else statement is executed by first enabling the processors for which the condition is satisfied and then flipping
the enable bit before getting into the else part. On the average, half of the processors will be idle for each branch. The situation is
even worse for case statements involving multiway branches. A possible cure is to use the asynchronous version of SIMD, known as
SPMD (spim-dee or single-program, multiple data), where each processor runs its own copy of the common program. The advantage of
SPMD is that in an if-then-else computation, each processor will only spend time on the relevant branch. The disadvantages include
the need for occasional synchronization and the higher complexity of each processor, which must now have a program memory and
instruction fetch/decode logic.
2. Custom- versus commodity-chip SIMD. A SIMD machine can be designed based on commodity (off-the-shelf) components or with
custom chips. In the first approach, components tend to be inexpensive because of mass production. However, such general-purpose
components will likely contain elements that may not be needed for a particular design. These extra components may complicate the
design, manufacture, and testing of the SIMD machine and may introduce speed penalties as well. Custom components (including ASICs
= application-specific ICs, multichip modules, or WSI = wafer-scale integrated circuits) generally offer better performance but lead to
much higher cost in view of their development costs being borne by a relatively small number of parallel machine users (as opposed to
commodity microprocessors that are produced in millions). As integrating multiple processors along with ample memory on a single
VLSI chip becomes feasible, a type of convergence between the two approaches appears imminent.
Within the MIMD class, three fundamental issues or design choices are subjects of ongoing debates in the research community:
1. MPPmassively or moderately parallel processor. Is it more cost-effective to build a parallel processor out of a relatively small
number of powerful processors or a massive number of very simple processors (the herd of elephants or the army of ants
approach)? Referring to Amdahls law, the first choice does better on the inherently sequential part of a computation while the
second approach might allow a higher speed-up for the parallelizable part. A general answer cannot be given to this question, as
the best choice is both application- and technology-dependent.
2. Tightly versus loosely coupled MIMD. Which is a better approach to high-performance computing, that of using specially
designed multiprocessors/multicomputer or a collection of ordinary workstations that are interconnected by commodity networks
(such as Ethernet or ATM) and whose interactions are coordinated by special system software and distributed file systems? The
latter choice, sometimes referred to as network of workstations (NOW) or cluster computing, has been gaining popularity in
recent years. However, many open problems exist for taking full advantage of such network-based loosely coupled architectures.
The hardware, system software, and applications aspects of NOWs are being investigated by numerous research groups.
3. .Explicit message passing versus virtual shared memory. Which scheme is better, that of forcing the users to explicitly specify all
messages that must be sent between processors or to allow them to program in an abstract higher-level model, with the required
messages automatically generated by the system software? This question is essentially very similar to the one asked in the early
days of high-level languages and virtual memory. At some point in the past, programming in assembly languages and doing
explicit transfers between secondary and primary memories could lead to higher efficiency. However, nowadays, software is so
complex and compilers and operating systems so advanced (not to mention processing power so cheap) that it no longer makes
sense to hand-optimize the programs, except in limited time-critical instances. However, we are not yet at that point in parallel
processing, and hiding the explicit communication structure of a parallel machine from the programmer has nontrivial
consequences for performance
By GS kosta
THE PRAM SHARED-MEMORY MODEL
The theoretical model used for conventional or sequential computers (SISD class) is
known as the random-access machine (RAM) (not to be confused with random-access
memory, which has the same acronym). The parallel version of RAM [PRAM (pea-ram)],
constitutes an abstract model of the class of global-memory parallel processors. The
abstraction consists of ignoring the details of the processor-to-memory interconnection
network and taking the view that each processor can access any memory location in each
machine cycle, independent of what other processors are doing
DISTRIBUTED-MEMORY OR GRAPH MODELS

This network is usually represented as a graph, with vertices corresponding to processormemory nodes and edges corresponding to
communication links. If communication links are unidirectional, then directed edges are used. Undirected edges imply bidirectional
communication, although not necessarily in both directions at once. Important parameters of an interconnection network include
1. Network diameter: the longest of the shortest paths

between various pairs of nodes, which should be
relatively small if network latency is to be minimized.
The network diameter is more important with store-and-
forward routing (when a message is stored in its entirety
and retransmitted by intermediate nodes) than with
wormhole routing (when a message is quickly relayed
through a node in small pieces).
2. Bisection (band)width: the smallest number (total
capacity) of links that need to be cut in order to divide
the network into two subnetworks of half the size. This
is important when nodes communicate with each other
in a random fashion. A small bisection (band)width
limits the rate of data transfer between the two halves of
the network, thus affecting the performance of
communication-intensive algorithms.
3. Vertex or node degree: the number of communication
ports required of each node, which should be a constant
independent of network size if the architecture is to be
readily scalable to larger sizes. The node degree has a
direct effect on the cost of each node, with the effect
being more significant for parallel ports containing
several wires or when the node is required to
communicate over all of its ports at once.
CIRCUIT MODEL AND PHYSICAL REALIZATIONS

In a sense, the only sure way to predict the performance of a parallel architecture on a
given set of problems is to actually build the machine and run the programs on it. Because
this is often impossible or very costly, the next best thing is to model the machine at the
circuit level, so that all computational and signal propagation delays can be taken into
account. Unfortunately, this is also impossible for a complex supercomputer, both because
generating and debugging detailed circuit specifications are not much easier than a
fullblown
implementation and because a circuit simulator would take eons to run the simulation.
Despite the above observations, we can produce and evaluate circuit-level designs for
specific applications.
GLOBAL VERSUS DISTRIBUTED MEMORY

Within the MIMD class of parallel processors, memory can be global or distributed.
Global memory may be visualized as being in a central location where all processors can
access it with equal ease (or with equal difficulty, if you are a half-empty-glass type of
person). Figure 4.3 shows a possible hardware organization for a global-memory parallel
processor. Processors can access memory through a special processor-to-memory network.
A global-memory multiprocessor is characterized by the type and number p of processors,
the capacity and number m of memory modules, and the network architecture. Even though
p and m are independent parameters, achieving high performance typically requires that they
be comparable in magnitude (e.g., too few memory modules will cause contention among
the processors and too many would complicate the network design).
By GS kosta
Distributed-memory architectures can be conceptually viewed as in Fig. 4.5. A collection of p processors, each with its own private memory,
communicates through an interconnection network. Here, the latency of the
interconnection network may be less critical, as each processor is likely to
access its own local memory most of the time. However, the communication
bandwidth of the network may or may not be critical, depending on the type of
parallel applications and the extent of task interdependencies. Note that each
processor is usually connected to the network through multiple links or channels
(this is the norm here ,although it can also be the case for shared-memory
parallel processors).
Cache coherence
In computer architecture, cache coherence is the uniformity
of shared resource data that ends up stored in multiple local
caches. When clients in a system maintain caches of a
common memory resource, problems may arise with
incoherent data, which is particularly the case with CPUs in
a multiprocessingsystem.
In the illustration on the right, consider both the clients have
a cached copy of a particular memory block from a previous
read. Suppose the client on the bottom updates/changes
that memory block, the client on the top could be left with an
invalid cache of memory without any notification of the
change. Cache coherence is intended to manage such
conflicts by maintaining a coherent view of the data values in
multiple caches.
The following are the requirements for cache coherence:[2]
Write Propagation
Changes to the data in any cache must be propagated to other copies(of that cache line) in the peer
caches.
Transaction Serialization
Reads/Writes to a single memory location must be seen by all processors in the same order.
Coherence protocols
Coherence Protocols apply cache coherence in multiprocessor systems. The intention is that two clients must
never see different values of the same shared data.
The protocol must implement the basic requirements for coherence. It can be tailor made for the target
system/application.
Protocols can also be classified as Snooping(Snoopy/Broadcast) or Directory based. Typically, early systems
used directory based protocols where a directory would keep a track of the data being shared and the sharers. In
Snoopy protocols , the transaction request. (read/write/upgrade) are sent out to all processors. All processors
snoop the request and respond appropriately.
Write Propagation in Snoopy protocols can be implemented by either of the following:
Write Invalidate
When a write operation is observed to a location that a cache has a copy of, the cache controller
invalidates its own copy of the snooped memory location, and thus forcing reads from main memory of the
new value on their next access.[4]
Write Update
When a write operation is observed to a location that a cache has a copy of, the cache controller updates
its own copy of the snooped memory location with the new data.
By GS kosta
If the protocol design states that whenever any copy of the shared data is changed, all the other copies
must be "updated" to reflect the change, then it is a write update protocol. If the design states that on a
write to a cached copy by any processor requires other processors to discard/invalidate their cached
copies, then it is a write invalidate protocol.
However, scalability is one shortcoming of broadcast protocols.
Various models and protocols have been devised for maintaining coherence.
Parallel Algorithm - Models

The model of a parallel algorithm is developed by considering a strategy for dividing the
data and processing method and applying a suitable strategy to reduce interactions. In
this chapter, we will discuss the following Parallel Algorithm Models
Data parallel model
Task graph model
Work pool model
Master slave model
Producer consumer or pipeline model
Hybrid model
Data Parallel
In data parallel model, tasks are assigned to processes and each task performs similar
types of operations on different data. Data parallelism is a consequence of single
operations that is being applied on multiple data items.
Data-parallel model can be applied on shared-address spaces and message-passing

paradigms. In data-parallel model, interaction overheads can be reduced by selecting a
locality preserving decomposition, by using optimized collective interaction routines, or
by overlapping computation and interaction.
The primary characteristic of data-parallel model problems is that the intensity of data
parallelism increases with the size of the problem, which in turn makes it possible to
use more processes to solve larger problems.
Example Dense matrix multiplication.
By GS kosta
Task Graph Model
In the task graph model, parallelism is expressed by a task graph. A task graph can
be either trivial or nontrivial. In this model, the correlation among the tasks are utilized
to promote locality or to minimize interaction costs. This model is enforced to solve
problems in which the quantity of data associated with the tasks is huge compared to
the number of computation associated with them. The tasks are assigned to help
improve the cost of data movement among the tasks.
Examples Parallel quick sort, sparse matrix factorization, and parallel algorithms
derived via divide-and-conquer approach.
Here, problems are divided into atomic tasks and implemented as a graph. Each task is
an independent unit of job that has dependencies on one or more antecedent task. After
By GS kosta
the completion of a task, the output of an antecedent task is passed to the dependent
task. A task with antecedent task starts execution only when its entire antecedent task
is completed. The final output of the graph is received when the last dependent task is
completed (Task 6 in the above figure).
Work Pool Model

In work pool model, tasks are dynamically assigned to the processes for balancing the
load. Therefore, any process may potentially execute any task. This model is used when
the quantity of data associated with tasks is comparatively smaller than the computation
associated with the tasks.
There is no desired pre-assigning of tasks onto the processes. Assigning of tasks is

centralized or decentralized. Pointers to the tasks are saved in a physically shared list,
in a priority queue, or in a hash table or tree, or they could be saved in a physically
distributed data structure.
The task may be available in the beginning, or may be generated dynamically. If the
task is generated dynamically and a decentralized assigning of task is done, then a
termination detection algorithm is required so that all the processes can actually detect
the completion of the entire program and stop looking for more tasks.
Example Parallel tree search
Master-Slave Model
In the master-slave model, one or more master processes generate task and allocate it
to slave processes. The tasks may be allocated beforehand if
By GS kosta
the master can estimate the volume of the tasks, or
a random assigning can do a satisfactory job of balancing load, or
slaves are assigned smaller pieces of task at different times.
This model is generally equally suitable to shared-address-space or message-

passing paradigms, since the interaction is naturally two ways.
In some cases, a task may need to be completed in phases, and the task in each phase
must be completed before the task in the next phases can be generated. The master-
slave model can be generalized to hierarchical or multi-level master-slave
model in which the top level master feeds the large portion of tasks to the second-level
master, who further subdivides the tasks among its own slaves and may perform a part
of the task itself.
Precautions in using the master-slave model

Care should be taken to assure that the master does not become a congestion point. It
may happen if the tasks are too small or the workers are comparatively fast.
The tasks should be selected in a way that the cost of performing a task dominates the
cost of communication and the cost of synchronization.
Asynchronous interaction may help overlap interaction and the computation associated
with work generation by the master.
Pipeline Model
It is also known as the producer-consumer model. Here a set of data is passed on
through a series of processes, each of which performs some task on it. Here, the arrival
of new data generates the execution of a new task by a process in the queue. The
processes could form a queue in the shape of linear or multidimensional arrays, trees,
or general graphs with or without cycles.
This model is a chain of producers and consumers. Each process in the queue can be
considered as a consumer of a sequence of data items for the process preceding it in
the queue and as a producer of data for the process following it in the queue. The queue
does not need to be a linear chain; it can be a directed graph. The most common
By GS kosta
interaction minimization technique applicable to this model is overlapping interaction
with computation.
Example Parallel LU factorization algorithm.
Hybrid Models
A hybrid algorithm model is required when more than one model may be needed to
solve a problem.
A hybrid model may be composed of either multiple models applied hierarchically or

multiple models applied sequentially to different phases of a parallel algorithm.
Example Parallel quick sort
Shared memory/ Parallel Processing in Memory

In computer science, shared memory is memory that may be
simultaneously accessed by multiple programs with an intent
to provide communication among them or avoid redundant
copies. Shared memory is an efficient means of passing data
between programs. Depending on context, programs may
run on a single processor or on multiple separate
processors.
Using memory for communication inside a single
program, e.g. among its multiple threads, is also referred
to as shared memory.
In hardware[edit]
In computer hardware, shared memory refers to a (typically large) block of random access memory (RAM) that
can be accessed by several different central processing units (CPUs) in a multiprocessor computer system.
Shared memory systems may use:[1]
uniform memory access (UMA): all the processors share the physical memory uniformly;
non-uniform memory access (NUMA): memory access time depends on the memory location relative to a
processor;
cache-only memory architecture (COMA): the local memories for the processors at each node is used as
cache instead of as actual main memory.
In software[edit]
In computer software, shared memory is either
By GS kosta
a method of inter-process communication (IPC), i.e. a way of exchanging data between programs running at
the same time. One process will create an area in RAMwhich other processes can access;
a method of conserving memory space by directing accesses to what would ordinarily be copies of a piece of
data to a single instance instead, by using virtual memorymappings or with explicit support of the program in
question. This is most often used for shared libraries and for XIP.
By GS kosta

Introduction To Parallel Processing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Parallel Processing

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO PARALLEL PROCESSING

Parallel computer structures will be

1.1.1 Generations of Computer Systems

1.1.2 Trends Towards Parallel Processing **

parallelprocessing can be challenged in four programmatic levels:

1.2 P ARALLEUSM IN UNIPROCESSOR SYSTEMS

1.2.2 Parallel Processing Mechanisms

Parallelism and pipelining within the CPU Parallel

Use of hierarchical memory system Usually, the CPU is

Multiprogramming and Time Sharing

Multiprogramming Within the same time interval, there

Time sharing Multiprogramming on a uniprocessor is

1.4 ARCHITECTURAL CLASSIFICATION SCHEMES

Single instruction stream-single data stream

1.4.2 Serial Versus Parallel Processing

1.4.3 Parallelism Versus Pipelining

Amdahl's law applies only to the cases where the

PRINCIPLES OF SCALABLE PERFORMANCE

1.1.3. Average Parallelism 2 1.1.4. Average Parallelism 3

1.1.5. Available Parallelism

1.2.2. Harmonic Mean

1.2.3. Weighted Harmonic Mean

1.2.4. Weighted Harmonic Mean Speedup

1.3.4. System Utilization

Amdahls Law for fixed workload

is based on a fixed workload, regardless of machine size.

Examples of Parallel Algorithms

Parallel Algorithm Complexity

Time complexity (Execution Time),

Models of Parallel Processing

DISTRIBUTED-MEMORY OR GRAPH MODELS

1. Network diameter: the longest of the shortest paths

CIRCUIT MODEL AND PHYSICAL REALIZATIONS

GLOBAL VERSUS DISTRIBUTED MEMORY

Parallel Algorithm - Models

Data parallel model

Task graph model

Work pool model

Master slave model

Producer consumer or pipeline model

Data-parallel model can be applied on shared-address spaces and message-passing

Example Dense matrix multiplication.

Work Pool Model

There is no desired pre-assigning of tasks onto the processes. Assigning of tasks is

Example Parallel tree search

a random assigning can do a satisfactory job of balancing load, or

slaves are assigned smaller pieces of task at different times.

This model is generally equally suitable to shared-address-space or message-

Precautions in using the master-slave model

Example Parallel LU factorization algorithm.

A hybrid model may be composed of either multiple models applied hierarchically or

Example Parallel quick sort

Shared memory/ Parallel Processing in Memory

You might also like