Parallel Archtecture and Computing

A new simplified approach
To
Parallel Architecture and Computing
IT (FIFTH SEMESTER)
PTU Jalandher
By :- Er.Umesh Vats
Information Technology
BCET Ludhiana
PTU Syllabus
IT-309
PARALLEL ARCHITECTURE AND COMPUTING
PREREQUISITES: -Computer Architecture.
OBJECTIVES: -. This course offers a good understanding of the various functional

units of a computer system and prepares the student to be in a position to design a basic
computer system.
COURSE CONTENTS
1. Introduction and Classification of Parallel Computer [8%]

Parallel processing terminology, Flynn's and Hndler's classifications, Amdahl’s
law.
2. Pipelined and Vector Processors [12%]

Instruction pipelining, Reservation table, Data and control hazards and methods to
remove them
3. SIMD or Array Processors [12%]

Various interconnection networks, Data routing through various networks,
Comparison of Various networks, Simulation of one network another.
4. MIMD and Multi Processor Systems [12%]

Uniform and non-uniform memory access multi processors, Scheduling in multi
processors systems, Load balancing in multi processors systems.
5. PRAM model of Parallel Computing and Basic Algorithms [7%]

PRAM model and its variations, Relative powers of various PRAM models.
6. Parallel Algorithms for Multi Processor Systems [25%]

Basic construction for representing PRAM algorithm, Parallel reduction
algorithm, Parallel prefix computing, Parallel list ranking, Parallel merge, Brent's
theorem and cost optimal algorithm, NC class of parallel algorithms.
7. Parallel Algorithms for SIMD and Multi Processor System [4%]

Introduction to parallel algorithms for SIMD and Multi Processor System
Chapter1.
Introduction and classification of parallel computer

What is a parallel computer:-
We will briefly review the structure of a single processor

computer commonly used today. It consist of
Input unit:-which accept the list of instruction to solve a problem and data relevant to a
problem.
Storage unit:-
It is memory or storage unit in which the procedure, data and intermediate

result are stored.
Output unit:-
o/p unit which display or print the result .
CPU or processing element:-
Combination of ALU(arithmetic and logic unit) and control unit .
ALU:-where arithmetic and logic logic operation are performed.
Control unit:-a control unit which interpret the instruction stored in memory and obediently
carries them out.
Output unit
Input processing
unit
ALU control
Processing element
The P.E(Processing element) retrieve one instruction at a time, interpret it and execute it .the
operation of this computer is sequential .at a time a P.E can execute only one instruction
.speed of the sequential computer is limited by the speed at which it can process the retrieved
data.
To increase the speed of processing of data one may interconnect many sequential
computer to work together .such a computer which consist of a number of
interconnected sequential computer which cooperatively execute a single program to
solve a problem is called a parallel computer.
COMPARISON OF TEMPORAL AND DATA PARALLEL PROCESSING
TEMPORAL PARALLEL DATA Parallel processing

PROCESSING
1. job is divided into a set of independent 1.pull jobs are assigned for processing.
tasks are assigned for processing.
2.task should take equal time 2.job may take different time. No need to
synchronize beginning of job
3. processor specified to do specific tasks 3. processor should be general purpose and
efficiently. may not do all tasks efficiently.
4. task assignment static. 4. job assignment may static, dynamic or
quassi-dyanmic.
5.not tolerant to processor fault 5.tolerate processor fault.
6. efficient with fine grained tasks. 6. efficient with coarse grained tasks and
quassi-dyabmic.
Parallel processing:-
Parallel processing is efficient form of information processing which

emphasizes the exploitation of concurrent event in the computing process. Concurrency
implies parallelism, simultaneity and pipelining.
Parallel event may occur in multiple resources during the same time interval and pipelined
event may occur in overlapped time spans.
The highest level of parallel processing is conducted amoung multiple jobs or program
through multiprogramming, time sharing and multiprocessor.
Parallel computer Structures:-
Parallel computer are those system that emphasis parallel

processing. The basic architectural features of parallel computers are introduced below. We
divide parallel computer into three architectural configuration :
1. Pipeline computer
2. Array processor
3. Multiprocessor system.
Pipeline computer:-
A pipeline computer performs overlapped computation to exploit

temporal parallelism. The process of executing an instruction in a digital computer involves
four steps:
.Instruction fetch (IF)
.Instruction decode (ID)
.Operand fetch (OF)
.Execution (EX)
Instruction fetch (IF) from memory then instruction decoding (ID),then identifying the
operation to be performed ,operand fetch (OF) and then execution (EX) of decode arithmetic
logic operation. In a pipelined computer successive instruction are executed in a overlapped
fashion. Four pipeline stage IF, ID, OF, and EX are arranged in linear cascade. A pipeline
processor can be represented as:
IF ID OF EX
PIPELINE PROCSSOR
But in non pipelined computer these four steps must be combined before the next instruction
can issued.
The space time diagram for pipeline and non-pipelined processor. in pipeline processor the
operation of all stages is synchronized under a common clock control. Interface latches are
used between adjacent segments to hold the intermediate result.
For nonpipelined processor computer it takes four pipeline cycles to complete one
instruction.
.Some main issue in designing a pipeline computer include job sequencing, collision
prevention, congestion control, branch handling, reconfiguration and hazard resolution.
Pipeline computer are more attractive for vector processing.
Array processor:-
An array processor uses multiple synchronized arithmetic logic unit to

achieve spatial parallelism. The fundamental difference between array processor and a
multiprocessor system is that the processing element in an array processor operates
synchronously but in a multiprocessor system may operate asynchronously.
Array computer:-
An array processor is a synchronized parallel computer with multiple

arithmetic logic unit, called processing element(PE). By replication of ALUs one can achieve
the spatial parallelism. Scalar and control type instruction are directly executed in the control
unit. Each PEs are interconnected by a data-routing network.
CP
CM
PE1 PE2 PEn

P P
P
.........................
M M
M
Inter –PE connection network
Diagram of array processor
Control unit is combination of control processor and control memory. In array processor each
processing element has its own private memory. Instruction fetch (from local memory or
from the control memory) decode is done by the control unit.
Array processors designed with associative memories are called associative processor.
Parallel algorithms on array processor will be given foe matrix multiplication, merge sort and
fast Fourier transform (FFT).
Multiprocessor system:-
Research and development of multiprocessor system are aimed at

improving throughput, reliability, flexibility and availability. A basic microprocessor
organization is drawn below: the system contains two or more processor of approximately
comparable capabilities. All processor share access to common sets of memory module, I/O
channel, and peripheral devices. Most importantly the entire system must be controlled by a
single integrated operating system providing interaction between processors and their
program at various level. Three different interconnection have been practiced in the past:
. Time-shared common bus
.Crossbar switch
. Multiport memories
THE Diagram of multiprocessor system is drawn below:
Pipelining of processing element:-
Pipelining uses temporal parallelism to increase the

speed of processing. Pipelining is important method of increasing the speed of processor. The
following ideal condition to satisfied that are provided by the pipeline.
1. It is possible to break up an instruction into a number of independent parts, each part

taking nearly equal time to execute.
2. There is so called locality in instruction execution. By this we mean that instruction
are executed in sequence one after the other in the order in which they written. If the
instruction are not executed in sequence but “jump around” due to many branch
instruction, then pipelining is not effective.
3. Successive instruction are such that the wok done during the execution of an
instruction can be effectively used by the next instruction.
4. Sufficient resources are available in a processor so that if a resource is required by
successive instruction in the pipeline it is readily available.
Data flow computer:-
1. To exploit maximum parallelism in a program, data flow are used.

2. The basic concept is to enable the execution of an instruction whenever its required
operand become available.
3. There is no program counter are needed in data-driven computation.
4. Program for data-driven computation can be represented by the data flow graph.
5. An example of data flow graph is given below:
Z=(x+y)*2
The of data flow graph is shown below:
Flynn’s classification:-
Flynn’s classification is based on the multiplicity of instruction

streams and data streams in the parallel computer system. The essential computing process is
the execution of a sequence of instruction on a set of data.
1. The term stream is used to denote a sequence of items as executed or operated upon
by a single processor.
2. An instruction stream is a sequence of instruction as executed by the machine.
Computer organization are characterized by the multiplicity of the hardware provided to

serve the instruction and data stream. Flynn’s are classified into four type:
1. Single instruction stream-single data stream(SISD)

2. Single instruction stream-multiple data stream(SIMD)
3. Multiple instruction stream-single data stream(MISD)
4. Multiple instruction –multiple data stream(MIMD)
1. SISD:-
a. Instruction are executed sequentially but may overlapped n their execution stages.
b. Single instruction: only one instruction stream is being acted on the CPU during
any one clock cycle.
c. Single data: only one data stream is being used on which instruction applied.
d. An SISD computer may have more one functional unit in it.
e. Deterministic execution.
f. E.g. older generation mainframe, minicomputer and workstation.
Fig:
IS
IS DS
CU PU MM
2. SIMD:-
a. SIMD stands for the single instruction multiple data stream.
b. There are multiple processing element supervised by the same control unit.
c. All PEs receive the same instruction broadcast from the control unit but operate
on different data sets from distinct machine.
d. SIMD machine further divided into word slice versus bit-slice modes.
e. Deterministic execution.
f. E.g. array processor and vector pipeline.
DS1
PU1
CU=Control
CU MM1
unit
PU2
DS2
MM2
. DSn
PUn MMn
PU =PROCESSOR UNIT
MM= Memory unit, IS= Instruction stream , DS= DATA STREAM
3. MISD:-
a. MISD stands for multiple instruction single data.
b. There are n processor units, each receiving distinct instruction operating over the
same data stream.
c. This exist conceptually not physically.
Dia. Are given below:
MISD COMPUTER
IS1 CU1 IS1 PU1

DS
CU2 PU2 Mm1 MM MM

IS2 IS2 2 n
ISn CUn ISn PUn

DS ISn IS2
IS1
4. MIMD:-
a. MIMD stands for multiple instruction multiple data.
b. This is most common type of parallel computer.
c. Multiple instruction : every processor may be executing a different stream.
d. Multiple data: every processor may be working with a different data stream.
e. E.g. most current supercomputer ,multi-core PC’s.
Dia.are given below
MIMD COMPUTER:
Cu1 PU1
Is1 IS1 DS1 IS1
MM1
CU2 PU2
IS2 IS2 DS2 IS2
MM2
ISn ISn PUn DSn ISn

CUn MMn
Handler’s classification:
Wolfing Handler has proposed a classification for identifying the parallelism degree and
pipeline degree built into the hardware structure of a computer system. Handler’s taxonomy
addresses the computer at three distinct levels:
1. Processor control unit (PCU).

2. Arithmetic logic unit (ALU).
3. Bit-level circuit (BLC).
THE PCU corresponds to one processor or one CPU. The ALU is equivalent to the
processing element in a array processor and BLC corresponds to the combinational
logic circuitry needed to perform 1-bit operation in the ALU.
A computer system C can be characterized by a triple containing six independent
entities, as defined below:
T(C)=< K*K’, D*D’, W*W’>
K=the number of processor(PCUs) within the computer.
K’=the number of PCUs that can be pipelined
D=the number of ALU(OR PEs) under the control of one PCU.
D’=the number of ALUs that can be pipelined.
W=the word length of an ALU or of a PE.
W’=the number of pipeline stage in all ALUs or in a PE.
e.g.
The Texas Instrument’s Advanced Scientific computer (TI-ASC) has one controller
controlling four arithmetic pipeline each has 64 bit word length and eight stage. thus
we have:
T(ASC)= <1*1, 4*1,64*8> = <1,4,64*8>
Whenever the second entity K’, D’, or w’ equal 1, we drop it since pipelining of one
stage or of one unit is meaningless.
Amdahl’s law:
Amdahl’s law is based on a fixed problem size or a fixed work load.
We seen that for a given problem size the speed up does not increase linearly as the
number of processor increase. in fact speed tends to become saturate.
According to Amdahl’s law a program contain two type of operation

a. Completely sequential which must be done sequentially and
b. completely parallel which can be run on multiple processor.
Lets the time taken to perform the sequential operation (Ts)be a fraction ,α(0<α<1), of
the total execution time of the program, T(1).
Then the parallel operation take Tp= (1-α).T(1) time.
Assuming that the parallel operation in the program achieve a linear speedup ( i.e.
these operation using n processor take (1/n)th of the time taken to perform them on
the one processor ).
Then speedup with n processor is:
S (n) = T(1)/T(n)
= T(1)/ αT(1) + (1-α) T(1)/n
=1/α+ (1-α)/n
Dia. Of the Amdahl’s law given below:

Parallel computing:-
Parallel computing is a process of computation which has the ability to carry out
many calculations operating on the principle that large problems can often be divided
into smaller ones, which are then solved concurrently.
There is several different form of parallel computing:
1. Bit-level
2. instruction level
3. Data.
4. Task parallelism.
Uses of parallel computing:
Parallel computing has been used to model difficult scientific and engineering problem found
in the real world. Some example:
1. Atmosphere, earth environment

2. Database, data mining.
3. Oil exploration.
4. Web search engine, web based business services.
5. Medical imaging and diagnosis.
6. Management of national and multinational co-operation.
7. Financial and economic modelling.
8. Advance graphic and virtual reality, particularly in the entertainment industry.
9. Network video and multi-media technologies.
Chapter :-two
Pipelined and vector processor:-
Pipelining :-
To achieve pipelining processing one must subdivide the input
task(process) into a sequence of subtasks, each of which can be executed by a
specialized hardware stage that operate concurrently with other stage in the pipeline.
Classification of pipeline:-
There are following type of pipelining that are describe
below:-
1. Arithmetic pipelining :-
The arithmetic logic units of computer can be
segmentized for pipeline operation in various data format.
Fig
2.Processor pipeline:-
This refer to the pipelining processing of the same data stream by a

cascade of processor, each of which processor a specific task. The data stream passes the first
processor with result stored in a memory block which is also accessible by the second
processor. The second then passes the result to the result to the third and so on.
Fig:-
3.unifunction vs. Multifunction :
a pipelining unit with a fixed and dedicated function, such

as floating point adder is called unifunctional pipelining .
A multifunction pipe may perform different function either at different times or at the same
time, by interconnecting different subset of stage in the pipeline.
Static vs. Dynamic :-
A static pipeline may assume only one functional configuration at a

time . static pipeline may be unifunctional or multifunctional .pipelining is made possible in
static pipes only if instruction of the same type are to be executed continuously .
A dynamic pipeline processor permit several functional configuration to exist

simultaneously .dynamic pipeline may be multifunctional.
Pipelining processing:-
Pipelining processing is a technique of doing certain repetitive work

effiencently. It is possible to implement pipeline at various level of abstraction .pipeline logic
circuit provide concurrency at the lowest level .
Speed :-
Once the pipe is filled up, it will output one result per clock period
independent of the number of stage in the pipe .ideally a linear pipeline with k stage can
process n tasks in Tk=k + (n-1) clock period,
Where k cycle are used to fill up the pipeline or to complete execution of the first task and (n-
1) cycle are needed to complete the remaining (n-1) task. The same number of tasks can be
executed in a nonpipeline computer with an equivalent function in Ti=n.k time delay.
Speedup:- we define the speed of a k-stage linear pipeline processor over an equivalent
nonpipeline processor.
S=time taken by non pipeline processor/time taken by pipeline processor
S=T1/Tk=n.k/k+(n-1)
Efficiency:-
The efficiency of the linear pipeline by the percentage of busy time space
span over the total time space span which equal the sum of all busy and idle time space
span .let n,k,t be the number of tasks ,the number pipeline stage and clock period of a linear
pipeline respectively .the pipeline efficiency is denoted by:-
e=n.k.t/k.[kt + (n-1)]=n/k + (n-1)
Throughput:-
The number of result that can be completed by a pipeline per unit time is called
its throughput .
Throughput =n/kt + (n-1)t=n/t
Performance :-
it can be describe by important attribute viz. The length of the pipe .the performance measure
of importance are effiency, speedup throughput.
Let n =length of the pipeline i.e. the number of the stages in the pipeline
. m=number of the tasks run on the pipe.
Reservation table:-
A reservation table is another way of representing the task flow pattern

of a pipelined system. A reservation table represents the flow of data through the pipeline for
one complete evaluation of given function.
It represents number of rows and number of columns.
The rows of reservation table represent the stage (or the one source of the pipeline) and each
column represent the clock time unit. The total number of clock units in the table is called the
evaluation time for given function.
E.g. if a pipelined system there are four resources and five time slice .then reservation table
will have four rows and column. All the elements of the table are either 0 or 1 .if one resource
(say, resource i) is used in a time slice (say time slice j) then the (i, j )th element of the table
will have the entry 1.
On the other hand ,if a resource is not used in a particular time slice then that entry of will
have the value 0.
An analysis example of reservation table:-
Suppose that we have four resource and 6-time slices and usage of resource is as follows:
1.) Resource 1 is used in time slice 1,3,6.
2.) Resource 2 is used in time slice 2.
3) Resource 3 is used in time slice 2,4.
4) Resource 4 is used in time slice 5.
The reservation table will be as follows:-
Index Time1 Time 2 Time 3 Time 4 Time 5 Time 6

Resource 1 1 0 1 0 0 1
Resource 2 0 1 0 0 0 0
Resource 3 0 1 0 1 0 0
Resource 4 0 0 0 0 1 0
Often 0 is represented by blank and 1 is represented by X. So corresponding reservation table

in:
Index Time 1 Time2 Time 3 Time 4 Time 5 Time 6

Resource1 X X X
Resource2 X
Resource3 X X
Resource4 X
Different function may have different evaluation times .for example above table has 7
evaluation time .
Hazard :-
A hazard is a potential problem that can happen in a pipelined processor. it refer to the
possibility of erroneous computation when CPU tries to execute simultaneously multiple
instruction which exhibit data dependence.
Or
Delay in pipeline execution of instruction due to non ideal condition are called pipeline
hazard .
Type of hazards:-
Structural hazard:-
When number of resource is limited in a processor and cause delay due

to resource constraint .this is known as structural hazard.
A structural hazard take place when a part of processor hardware is needed by two or more
processor at the same time.
Control hazard:-
Control hazard is also known as Branch hazard .a control hazard occurs when all
program have branches and loops .when execution of a program is thus not in a “straight
line”
If a certain condition is true then jump from one part of instruction part to another type of
instruction part.
Delay due to branch instruction or control dependency in a program which is known as

control hazard.
Data hazard:-
Data hazard is defined as the result generated by an instruction may be

required by the next instruction .successive instruction are not independent on one another.
Delay due to data dependency between instructions which is known as data hazard.
When successive instruction overlap their fetch , decode and execution through a pipelined
processor , interinstruction dependencies may arise to prevent the sequential data flow in the
pipeline.
e.g. instruction may depend upon the result of previous instruction
data hazard is divided into three type :-
1. Read after write (RAW)
2. Write after read (WAR)
3. Write after write (WAW)
RAW Hazard:-
A RAW hazard between the two instruction I and J may occurs when j
attempts to read some data object that has been modified by I.
R(I) П R(J) FOR RAW
D(I) R(I)
D(J) R(J)
Fig. Raw hazard
2. WAR HAZARD :-
A WAR hazard may occur when J attempt to modify some data
object that is read by I.
D(I) П R(J)
D(J) R(J)
R(J)
D(I)
WAW hazard:-
A WAW hazard may occur if both I and J attempt to modify the
same data object.
R(I) П R(J)
FIG:WAW Hazards
D(I) R(I)
R(J)
D(J)
Hazard detection:-
Hazard detection is necessary because if hazard is not properly

detected and resolved, could result in an interlock situation in the pipeline or produce
unreliable result by overwriting.
Hazard detection basically falls into two categories:-

1. Hazard detection can be done in the instruction fetch stage of a pipeline processor by
comparing the domain and range of the incoming instruction with those of the
instructions being processed in the pipe. A warning signal can be generated to prevent
the hazard from taking place.
2. Another approach is to allow the incoming instruction through the pipe and distribute
the detection to all the potential pipeline stages. This distributed approach offer better
flexibility at the expense of increased hardware control.
Hazard Resolution:-
Once hazard is detected, the system should resolve the
interlock situation.
1. Consider the instruction sequence {....I, I+1,.........J, J+1...........} in which a hazard
has been detected between the current instruction J and a previous instruction I.
A straightforward approach is to stop the pipe and to suspend the execution of
instruction J, J+1, J+2.........until the instruction I has passed the point of resource
conflict.
2. A more sophisticated approaches is to suspend only instruction J and continue the
flow of instruction J+1, J+2.........down the pipe. Of course the potential hazard
due to the suspension of the J should be continuously checked as instruction J+1,
J+2.....MOVE ahead of J.
Note:-to avoid RAW hazard IBM engineer developed a short circuiting approach
which gives a copy of the data object to be written directly to the instruction waiting
to read the data.
The technique is also known as data forwarding.
Vector processor:-
A vector pipeline is specially designed to handle vector instruction over

vector operand. Computer having vector instruction are often called vector processor.
Vector processor or the SIMD processor microprocessor that are specialized for operating on
vector or matrix data element. There processor have specialized hardware for performing
vector operation such as vector addition, vector multiplication and other operation.
Instruction pipeline:-
An instruction pipeline reads consective instruction from memory while previous instruction
are being executed in other segment. This cause instruction fetch and execute phase to
overlap and perform simultaneous operations. One possible digression associated with such a
scheme is that an instruction may cause a branch out of sequence. In that case the pipeline
must be emptied and all the instruction that have been read from memory after the branch
instruction must be discarded.
In general the computer needs to process each instruction with the following sequence of
steps:
1. Fetch the instruction from memory .(IF)
2. Decode the instruction. (ID)
3. Calculate the effective address.
4. Fetch the operands from memory. (OF)
5. Execute the instruction. (EX)
6. Store the result in the proper place. (ST)
Difficulties in instruction pipeline:
1. Different segment take different time to to operate on the incoming information.

Some segment are skipped for certain operations. E.g. a register mode instruction
does need an effective address calculation.
2. Two or more segment require memory at the same time.
The design of an instruction pipeline will be most effieient if the instruction cycle is
divided into the segment of equal duration.
Chapter 3.
SIMD ARRAY PROCESSOR
A synchronous array of parallel processor is called an array processor, which consists of

multiple processing elements (PEs) under the supervision of one control unit (CU).An array
processor can handle single instruction and multiple data streams(SIMD).
In this sense, array processor are also known as SIMD computers. SIMD machines are
especially designed to perform vector computations over matrices or arrays of data.
SIMD computers appear in two basic architectural organisations: array processors, using
rabdom-access memory; and associative memory, using content addressable (or associative)
memory.
SIMD COMPUTER ORGANISATIONS

In general, an array processor may assume one of two slightly different configurations,as
illustrated in figure.configuration I, it has been implemented in the well-publicized Illiac-IV
computer. this configuration is structured with N synchronized PEs, all of which are under
the control of one CPU. I/O
Data bus Data & instructions

CU memory
CU
PE0 PE1 PE N-1
PEM0 PEM1 PEM N-1
Control
Interconnection Network
Configuration I(Illiac IV)
a. Each PEi is essentially an arithmetic logic unit(ALU) with attached working registers
and local memory PEMi.
b. The CU also has its own memory for the storage of programs. The system and user
programs are executed under the control of one CU. The user programs are loaded
into the CU memory from an external source. The function of CU is to decode all the
instructions and determine where the decoded instructions should be executed.
c. Vector instructions are broadcast to the PEs for distributed execution to achieve
spatial parallelism. All the PEs perform the same function synchronously in the lock
step fashion under the command of the CU.
d. Vector operands are distributed to the PEMs before parallel execution in the array of
PEs. The distributed data can be loaded into the PEMs from an external source via
system data bus or via the CU in a broadcast mode using the control bus. Masking
Schemes are used to control the status each PE during the execution of a vector
instruction. Each PE may be either active or disabled during an instruction cycle.
Only enabled PEs perform computation. Interconnection network is under the control
of one control unit.
e.
I/O
Data bus
CU memory
memory
CU
PE0 PE1 PE-N-1
Alignment network Control
M0 M1 M-P-1
Configuration II(BSP)
This configuration II differs from the configuration I in two aspects.
First, the local memories attached to the PEs arfe now replaced by parallel memory modules
shared by all the PEs through an alignment network.
Second, the inter Pe permutation network is replaced by the inter-PE memory alignment
network, which is again controlled by the CU.
In Configuration II(BSP), there are N PEs and P memory modules. The two numbers are not
necessarily equal. In fact thy have choosen to be relatively prime. The alignment network is a
path switching network between the PEs and the parallel memories.
Formally, an SIMD computer C is characterized by the following set of parameters:
C = <N,F,I,M>
Where N = the number of PEs in the system;
F = a set of data-routing functions provided by the interconnection network or by the

alignment network
I = the set of machine instructions for scalar-vector, data-routing, and network-

manipulation operations.
M = the set of masking schemes, where each mask partitions the set of PEs into the
two disjoint subsets of enabled PEs and the disabled PEs.
Inter-PE communications
Network Design decisions for inter-PE communications are discussed below. The decisions
are made between operation modes, control strategies, switching methodologies, and network
topologies.
Operation mode Two types of communication can be identified :
synchronous and
asynchronous.
Synchronous communication is needed for establishing communication paths

synchronously for either a manipulating function or for a data instruction broadcast.
Asynchronous communication is needed for multiprocessing in which connection requests

are issued dynamically.
A system may also be designed to facilitate both synchronous and asynchronous processing.
Therefore, the typical operation modes of interconnection networks can be classified into
three categories : synchronous, asynchronous, and the combined.
Control Strategy
A typical interconnection network consists of a number of switching elements and

interconnecting links.
Interconnection functions are realized by properly setting control of the switching elements.
The control setting function can be managed by the centralized controller or by the individual
switching element.
The latter strategy is called distributed control and the first strategy corresponds to
centralized control. Most existing SIMD interconnection networks choose the centralized
control on all switch elements by the control unit.
Switching Methodologies The two major switching methodologies are
circuit switching and
packet switching.
In circuit switching, a physical path is actually established between a source and a

destination.
In packet switching, data is put in a packet and route through the interconnection network
without establishing a interconnection path.
In general, circuit switching is more suitable for bule data transmission, and packet switching
is more efficient for many short data messages. Another option, integrated switching,
includes the capabilities of both circuit switching and packet switching.
Network Topology :-
A network can be depicted by a graph in which nodes represent switching points and edges
represent communication links.
The topologies tend to be regular and can be grouped into two categories:
static and
dynamic .
In a static topology, Links between two processors are passive and dedicated buses cannot
be reconfigured for direct connections to other processors .
on the other hand, Links in the dynamic category can be reconfigured bty setting the
network’s active switching elements.
The space of the interconnection networks can be represented by the cartesian product of the
above four sets of design features: {operation mode} * {control strategy} * {switching
methodology} * {network topology}.
SIMD INTERCONNECTION NETWORKS
Various interconnection networks have been suggested for SIMD computers. We distinguish
between single stage, recirculating networks and multistage networks. Important network
classes to be presented include the Illiac network, the flip network, the n-cube network, the
omega network, the data manipulator, the barrel shifter, and the shuffle-exchange network.
Static versus Dynamic Networks
The topological structure of an SIMD array processor characterized by the data-routing

network used in interconnecting the processing elements.
Formally, such an inter-PE communication network can be specified by a set of data routing
functions. To pass data between PEs that are not directly connected in the network, the data
must be passed through intermediate PEs by executing a sequence of routing functions
through the interconnection network.
The SIMD interconnection networks are classified into the following two categories based on
network topologies: static networks and dynamic networks.
Staic networks Topologies in the static networks can be classified according to the
dimensions required for layout.
For illustration, one dimensional, two dimensional, three dimentional, and hypercube.
Examples of one dimensional topologies include the linear array used for some pipeline
architectures.
two dimensional topologies include the ring, tree, star, mesh and systolic array. Examples of
these structures are shown in figures.
Three dimensional topologies include the completely connected chordal ring,3 cube,and 3-
cube-connected cycle networks. The mesh and the 3 cube are actually two and three
dimensional hypercube respectively.
Dynamic Networks We consider two classes of dynamic networks:
single-stage versus multistage,
Single-stage networks A single-stage network is a switching network with N input

selectors(IS) And N output selectors(OS) as demonstrated in figure. Each IS is essentially a
1-to-D demultiplexer and each OS ia an M-to-1 multiplexer where 1<=D<=N and 1<=M<=N.
The single stage network is also called a recirculating network. Data items may have to
recirculate through the single stage several times before reaching their final destinations. The
number of recirculations needed depends on the connectivity in the single stage network. In
general, the higher is the hardware connectivity, the less ia the number of recirculations.
Multistage network Multistage network are described by three characterizing features:
the switch box,
the network topology,
and the control structure.
Many switch boxes are used in a multistage network. Each box is essentially an interchange
device with two inputs and two outputs as depicted in figure.
There are four states of a switch boxes : straight, exchange, upper broadcast, lower broadcast.
1. straight:-
2. exchange:-
3. upper broadcast
4. lower broadcast:-
A two-function switch box can assume either the straight or the exchange states. A four-
function switch box can be in any one of the four legitimate states.
A multistage network is capable of connecting an arbitrary input terminal to an arbitrary

output terminal.
Multistage networks can be one-sided or two-sided.
The one-side networks, sometimes called full switches, have input output ports on the same
side.
Single side network:-
2
Connection
3
network
N
The two-sided multistage networks, which usually have an input side and an output side,
can be divided into three classes: blocking, rearrangable, and nonblocking.
double network
1 1
-
CONNECTIO
2 2
N NETWORK
3 3
Blocking networks:-
In blocking networks, simulataneous connections of more than one terminal pair may results
in conflicts in the use of network communication links. Examples of a blocking network are
the data manipulator, omega, flip,n cube, and baseline.
rearrangeable network:
A Network ia called a rearrangeable network if it can perform all possible connections

between inputs and outputs by rearranging its existing connections so that a connection path
for a new input output pair can always be established.
nonblocking network:-
A network which can handle all possible connections without blocking is called a
nonblocking network.
Mesh-connected Illiac Network
A single-stage recirculating network has been implemented in the Illiac-IV array processor
with N=64 PEs. Each PEi is allowed to send data to any one of PEi+1,PEi-1,PEi+r,andPEi-r where
r=N in one circulation step through the network. Formally, the Illiac netwok is characterized
by the following four routing functions.
R+1(i) = (i+1) mod N
R-1(i) = (i-1) mod N
R+r(i) = (i+r) mod N
R-r(i) = (i-r) mod N

Where 0<=i<=N-1.N is commonly a perfect square, such as N=64 and r=8 in the illiac-IV
network. A reduced Illiac network is illustrated in figure for N=16 and r=4.
Each PEi in fig is directly connected to its four nearest neighbors in the mesh network.
In terms of permutation cycles, we can express the above routing functions as follows:
Horizontally, all the PEs of all rows from a linear circular list as governed by the following
two permutations, each with a single cycle of order N. The permutation cycles (a b c) (d e)
stand for the permutation a b, b c ,c a and d e, e d in a circular fashion within
each pair of parenthesis:
R+1 = (0 1 2………N-1)
R-1 = (N-1……...2 1 0)
For the example network of N=16 and r=4, the shift by a distance of four is specified by the
following two permutations, each with four cycles of order four each:
R+4 =(0 4 8 12)(1 5 9 13)(2 6 10 14)(3 7 11 15)
R-4 =(12 8 4 0)(13 9 5 1)(14 10 6 2)(15 11 7 3)
The Illiac network is only a partially connected network. figure shows the connectivity of the
example illiav network with N=16.This graph shows that four PEs can be reached from any
PE in one step., seven PEs in two steps, and eleven PEs in three steps.
Cube Interconnection networks
The cube network can be implemented as either a recirculating network or as a multistage

network for SIMD machines.
A three dimensional cube is illustrated in figure.
Vertical lines connect vertices(PEs) whose addresses differ in the most significant bit
position. Vertices at both ends of the diagonal lines differ in the middle bit position.
Horizontal lines differ in the least significant bit position. This unit cube concept can be
extended to an n-dimensional unit space, called an n cube, with n-bits per vertex.
We shall use the binary sequence A = (an-1……a2 a1 a0) to represent vertex (PE) address for
0<=i<=n-1.The complement of bit will be denoted as ai for any 0<=i<=n-1.
Formally, an n-dimensional cube network of N PEs is specified by the following n routing

functions:
Ci(an-1………..a1 a0) = an-1……ai+1aiai-1……….a0 for i = 0,1,2,…….n-1
In the n-cube, each PE located at the corner is directly connected to n-neighbors. The
neighboring PEs differ in exactly one bit position.
The implementation of a single stage cube network is illustrated in figure for N=8.
The interconnections of the PEs corresponding to the three routing functions C0,C1,C2 are
shown separately. If one assembles all three connecting patterns together, the 3 cube shown
in figure should be a result.
The same set of cube-routing functions, C0,C1,C2 can also be implemented by a three stage
cube network, as modeled in figure for N=8.
Two-function (straight and exchange) switch boxes are used in constructing multistage cube
networks. The stages are numbered as 0 at the input end and increased to n-1 at the output
end. stage i implements the Ci routing function for i=0,1,2,…….n-1.This means that switch
boxes at stage I connect an input line to an output line that differs from it only at the ith bit
position.
Barrel Shifter and data Manipulator
Barrel shifters are also known as plus-minus-2i (PM2I) networks. This type of network is
based on the following routing functions:
B+i(j) = (j+2i) (mod N)
B-i(j) = (j-2i) (mod N)
Where 0<=j<=N-1,0<=i<=n-1, and n=log2N.comparing eq. with eq., the following

equivalence is revealed when r = N =2n/2
B+0 = R+1
B-0 = R-1
B+n/2 = R+r
B-n/2 = R-r
This implies that the illiac routing functions are a subset of the barrel-shifting functions. In
addition to adjacent (+-1) and fixed-distance (+-r) shiftings.
The barrel-shifting functions allow either forward or backward shifting of distances, each PE
in a barrel shifter is directly connected to 2(n - 1)PEs.therefore, the connectivity in a barrel
shifter is increased from the Illiac network by having (2n – 5).
The barrel shifter can be implemented as either a recirculating a single stage network. Figure
shows the interconnection patterns in a recirculating barrel shifter for N=8.The barrel shifting
functions B are executed by the interconnection patterns shown.
A barrel shifter has been implemented with multiple stages in the form of a data manipulator.
as shown in figure, the data manipulator consists of N cells. each cell is essentially a
controlled shifter. This network is designed for implementing data manipulating functions
such as permuting, replicating, spacing, masking and complementing.
To implement a data manipulating functions, proper control lines if six groups (u12i u22i h12i h22i
d12i d22i) in each column must be properly set through the use of the control register and the
associated decoder.
Shuffle-Exchange and Omega Networks
The class of shuffle-exchange networks is based on two routing functions shuffle(S) and
Exchange(E).Let A = an-1…a1a0 be a PE address:
S(an-1…a1a0) = an-2…a1a0an-1
Where 0<=A<=N-1 and n=log2N. The cyclic shifting of the bits in A to the left for one bit
position is performed by the S function. This action corresponds to perfectly shuffling a deck
of N cards, as demonstrated in figure.
The perfect shuffle cuts the deck into two halves from the center and intermixes them
evently. The inverse perfect shuffle does the opposite to restore the original ordering. The
exchange-routing function E is defined by:
E(an-1…a1a0) = an-1…a1a0
The complementing of the least significant digit means the exchange of data between two
PEs with adjacent addresses.
These shuffle-exchange functions can be implemented as either a recirculating network or a

multistage network. For N=8 a single stage recirculating shuffle-exchange network is shown
in figure. the solid lines indicates exchange and the dashed lines indicates shuffle.
The use of recirculating shuffle-exchange network for parallel processing was proposed by
stone.
The shuffle-exchange functions have been implemented with the multistage omega network
by Lawroe. The omega network for N=8 is illustrated in figure. An N * N omega network
consists of n identical stages. Between two adjacent stages is a perfect-shuffle
interconnection. Each stage has N/2 switch boxes under independent box control.
each box has four functions (straght, exchange, upper broadcast, lower broadcast), as
illustrated in figure. The switch boxes in the Omega network can be repositioned as shown in
figure without violating the perfect-shuffle interconnections between stages.
Chapter:-4
Parallel computer:-
Loosely coupled:-
Physical connection:- P.E with private memory connected via. N/W.
Logical connection:- compute independently and co-operate by

changing message.
Type of parallel computer: message passing multicomputer.
Tightly coupled:-
Physical connection: PE share a connection memory and
communication via. Shared memory.
Logical connection:- co-operate by changing result stored in
common memory.
Type of parallel computer:- shared memory multicomputer.
Tightly coupled microprocessor:-
1. A multiprocessor system with a common shared memory is classified as the shared

memory or tightly coupled microprocessor.
2. Tightly coupled system provide a cache memory with each CPU . in addition there is
common global memory that is accessed by all CPU.
3. Information can be shared among the CPU by placing it in the common global
memory.
Loosely coupled microprocessor:-
1. This type of microprocessor is called distributed system.
2. Each PE in the loosely coupled has its own private local memory .
3. The processor are tight together by a switching scheme designed to route the
information from one processor to another through a message passing scheme.
4. Loosely coupled network is efficient when interaction between task is minimum.
5. Tightly coupled is efficient where in real time or high speed.
Uniform Memory Access
Uniform Memory Access (UMA) is a shared memory architecture used in parallel computers.
All the processors in the UMA model share the physical memory uniformly. In a UMA
architecture, access time to a memory location is independent of which processor makes the
request or which memory chip contains the transferred data.
Uniform Memory Access computer architectures are often contrasted with Non-Uniform
Memory Access (NUMA) architectures.
In the UMA architecture, each processor may use a private cache. Peripherals are also shared
in some fashion, The UMA model is suitable for general purpose and time sharing
applications by multiple users. It can be used to speed up the execution of a single large
program in time critical applications.[1]
Types of UMA architectures
1. UMA using bus-based SMP architectures

2. UMA using crossbar switches
3. UMA using multistage switching networks
Time shared bus (features)
1. This bus system is totally passive with low active component like switches.
2. Conflict resolution method such as fixed priorities, FIFO queue and daisy Channing
are device for efficient utilization of resource.
3. Time shared is further divided into different sub categories:
a. Single bus multiprocessor
b. Unidirectional buses
c. Multibus multiprocessor organization
4. It has lowest overall system cost for the network and is least complex.
5. It is very easy to physically modify the new configuration.
Disadvantage :-
1. This organization is usually appropriate for system only.

2. Efficiency is lowest among all other organization.
3. Expansion of system by adding functional unit may degrade the overall performance.
Crossbar switch and multiport memory:-
1. This is expansion of time shared common bus.

2. It require most expensive memory unit since most of the control and switching circuit
is included memory unit.
3. Data transfer rate is quite high and the system become more stable by the addition of
functional unit and that’s why it has highest rate of data transfer.
4. Large number of cable and connector are require.
Multistage network for multiprocessor:-
1. It is the most complex interconnection system and has the highest data transfer rate.
2. The functional unit are the highest and cheapest.
3. Switches are used for routing the request to memory and other ports.
4. This organisation is usually cost effective for multiprocessor only.
5. System expansion (addition of function unit) usually improve the overall
performance.
6. Reliability of switch and system can be improved by segmentation within the
switches.
Non-Uniform Memory Access
Non-Uniform Memory Access or Non-Uniform Memory Architecture (NUMA) is a

computer memory design used in multiprocessors, where the memory access time depends on
the memory location relative to a processor. Under NUMA, a processor can access its own
local memory faster than non-local memory, that is, memory local to another processor or
memory shared between processors.
NUMA architectures logically follow in scaling from symmetric multiprocessing (SMP)

architectures. Their commercial development came in work by Burroughs (later Unisys),
Convex Computer (later Hewlett-Packard), Silicon Graphics, Sequent Computer Systems,
Data General and Digital during the 1990s. Techniques developed by these companies later
featured in a variety of Unix-like operating systems, and somewhat in Windows NT.
Basic concept
One possible architecture of a NUMA system. Notice that the processors are connected to the
bus or crossbar by connections of varying thickness/number. This shows that different cpus
have different priorities to memory access based on their location.
Modern CPUs operate considerably faster than the main memory to which they are attached.
In the early days of high-speed computing and supercomputers the CPU generally ran slower
than its memory, until the performance lines crossed in the 1970s. Since then, CPUs,
increasingly starved for data, have had to stall while they wait for memory accesses to
complete. Many supercomputer designs of the 1980s and 90s focused on providing high-
speed memory access as opposed to faster processors, allowing them to work on large data
sets at speeds other systems could not approach.
Limiting the number of memory accesses provided the key to extracting high performance
from a modern computer. For commodity processors, this means installing an ever-increasing
amount of high-speed cache memory and using increasingly sophisticated algorithms to avoid
"cache misses". But the dramatic increase in size of the operating systems and of the
applications run on them has generally overwhelmed these cache-processing improvements.
Multi-processor systems make the problem considerably worse. Now a system can starve
several processors at the same time, notably because only one processor can access memory
at a time.
NUMA attempts to address this problem by providing separate memory for each processor,
avoiding the performance hit when several processors attempt to address the same memory.
For problems involving spread data (common for servers and similar applications), NUMA
can improve the performance over a single shared memory by a factor of roughly the number
of processors (or separate memory banks).
Of course, not all data ends up confined to a single task, which means that more than one
processor may require the same data. To handle these cases, NUMA systems include
additional hardware or software to move data between banks. This operation has the effect of
slowing down the processors attached to those banks, so the overall speed increase due to
NUMA will depend heavily on the exact nature of the tasks run on the system at any given
time.
Cache coherent NUMA (ccNUMA)
Nearly all CPU architectures use a small amount of very fast non-shared memory known as
cache to exploit locality of reference in memory accesses. With NUMA, maintaining cache
coherence across shared memory has a significant overhead.
Although simpler to design and build, non-cache-coherent NUMA systems become

prohibitively complex to program in the standard von Neumann architecture programming
model. As a result, all NUMA computers sold to the market use special-purpose hardware to
maintain cache coherence[citation needed], and thus class as "cache-coherent NUMA", or ccNUMA.
Typically, this takes place by using inter-processor communication between cache controllers
to keep a consistent memory image when more than one cache stores the same memory
location. For this reason, ccNUMA performs poorly when multiple processors attempt to
access the same memory area in rapid succession. Operating-system support for NUMA
attempts to reduce the frequency of this kind of access by allocating processors and memory
in NUMA-friendly ways and by avoiding scheduling and locking algorithms that make
NUMA-unfriendly accesses necessary. Alternatively, cache coherency protocols such as the
MESIF protocol attempt to reduce the communication required to maintain cache coherency.
Current[when?] ccNUMA systems are multiprocessor systems based on the AMD Opteron,
which can be implemented without external logic, and Intel Itanium, which requires the
chipset to support NUMA. Examples of ccNUMA enabled chipsets are the SGI Shub (Super
hub), the Intel E8870, the HP sx2000 (used in the Integrity and Superdome servers), and
those found in recent NEC Itanium-based systems. Earlier ccNUMA systems such as those
from Silicon Graphics were based on MIPS processors and the DEC Alpha 21364 (EV7)
processor.
Intel announced NUMA introduction to its x86 and Itanium servers in late 2007 with
Nehalem and Tukwila CPUs[citation needed]. Both CPU families will share a common chipset; the
interconnection is called Intel Quick Path Interconnect (QPI).
NUMA vs. cluster computing
One can view NUMA as a very tightly coupled form of cluster computing. The addition of
virtual memory paging to a cluster architecture can allow the implementation of NUMA
entirely in software where no NUMA hardware exists. However, the inter-node latency of
software-based NUMA remains several orders of magnitude greater than that of hardware-
based NUMA.
In a typical SMP (Symmetric MultiProcessor architecture), all memory access are posted to
the same shared memory bus. This works fine for a relatively small number of CPUs, but the
problem with the shared bus appears when you have dozens, even hundreds, of CPUs
competing for access to the shared memory bus. This leads to a major performance
bottleneck due to the extremely high contention rate between the multiple CPU's onto the
single memory bus.
The NUMA architecture was designed to surpass these scalability limits of the SMP
architecture. NUMA computers offer the scalability of MPP(Massively Parallel Processing),
in that processors can be added and removed at will without loss of efficiency, and the
programming ease of SMP where.
Load balancing:-
Load balancing which is used to distribute computation fairly across processor in order to
obtain the highest possible execution speed.
It was discussed that process were simply distributed amoung the avalible processor without
any discussion of type of processor and their speed.
However ot may that some processor will complete their task before the other and become
idlebecause the work is unevenly divided or some processor operate faster than other.
Load balancing is particular useful when the amount of work is known before to execution.
Load balancing is of two type :-
1. Static load balancing

2. Dynamic load balancing
Static load balancing:
Static load balancing is that attempted statically before execution of

any process.
It always refer to the scheduling problem static load balancing technique .

1. Round robin algorithm:-
Round-robin algorithm passes out tasks in sequence order of
processing coming back to the first where all process have been given task.
2. Randomized algorithm:-
Select process at random to take task.
3. Simulated annealing:-
An optimized technique.
4. Genetic algorithm:-
Another optimization technique.
Disadvantage of static load balanacing:-
It is impossible to estimate the accurate execution time of various part of progress without
actually execution of the parts.
Dynamic load balancing:-
Dynamic load balancing is that attempyed during the execution of

processor.
All the factor are taken into account by making the division of load depand uoon the
execution of parts as they are being executed.
Dynamic load balancing can be dividedinto following categories;
1. Centralized LBA
2. Decentralized LBA
3. Semicentrlized LBA
4. Centralized load balancing:
1. In this master process hold the collection of task to be performed. Tasks to be sent to
the salve processor when a salve master complete one task it will request another task
from the master process.
Master queue
process
receive task
Salve process Salve process Salve process
send task
2. it makes a global decision about the relocation of the work of processor. Some
centralized algorithm assign the maintance system global state information
to a particular node.
5. Global state information can allow the algorithm to do good job of balancing the work
among the process.
Disadvantage:
a. Master process can issue on process at a time

b. The particular algorithm does not scale well because the amount of information in
a linear manner with the number of processor.
2 Fully distributed LBA:-

a. Master process can be divide into mini-master.
Master queue
process
Receiving task
send task
mini master
salve
salve salve salve
b. EACH PROCESSING ELEMENT holds its own information about its state.
These processing element are free to interact with each other and also to balance
the load .
c. The interchange of data will take place if one processor has more and other has
less data.
Disadvantage:
If proper distribution of load as work load may not balance as in case of centralized
load balance algorithm.
6. Semi distributed LBA:

In this LBA we divide network into various region and each region has centralized
processor which hold the state information and further these centralized processor are
controlled by single centralized processor.
These algo can be divided into two categories:
1. Sender initiated algorithm.
2. Receiver initiated algorithm
Sender initiated algorithm:-
A process send task to the other process it select .
A heavily load process pass out some of its task to other that are willing to accept them.
Receiver initiated algorithm:-
A process request take from other process it select .when a process request task from another
process when it has free and no task to perform.
Process scheduling:-
It is the allocation of task to the processor. Scheduling is illustrated by Gantt chart indicating
the time each task spend as well as the processor on which it execute. So we can say that that
scheduling is division of work among the processes
Types of scheduling:-
1. Static scheduling
2. Dynamic scheduling
Static scheduling:-
The technique to separate dependent instruction and minimize the number of actual
hazard and resultant stalls is called static scheduling. It became popular with pipelining.
The three main reason why static scheduling is still of interest.
1. Static scheduling sometimes result in lower execution times than dynamic scheduling.
2. Static scheduling can be used to predict the speed up that can be achieved by a
particular parallel algorithm on a target machine assuming no pre-emption of process
occurs.
3. Static scheduling can allow the generation of only one processor per processor,
reducing process creation, synchronization, and termination overhead.
Dynamic scheduling:
The technique where the hardware rearrange the instruction execution to reduce the stalls.
Dynamic scheduling offer several advantage.
1. It simplifies the complier.
2. It allow code that was compiled with one pipeline in mind to run efficiently on a
different pipeline.
These advantage are gained at accost of a significant increase in hardware complervity.
Scheduling algorithm:
1. Deterministic model
2. Graham’s list scheduling algorithm
3. Coffman-graham scheduling algorithm
1. Deterministic model:
Parallel algorithm is collection of task , some of which can be completed before other begin .
In deterministic model the precedence relation and the execution time between task is
predefined or known before the run time .
Information is usually represented by task graph.
For example consider the tasks graph illustrated in the figure given below. We are given a set
of seven tasks and their precedence relation (information about what task competed before
other tasks can be started).
T1
T2 T4
T3
T5 T6
T7
TASK GRAPH.
In figure each node represent a task to be performed. A directed are Tia toTj indicated that
the task Ti must completed before Tj starts.
The allocation of tasks to the processor is done by a schedule. Gantt charts are best example
to explain the schedules. A Gantt chart indicated the time each task spend in the execution as
well as the processor on which it executes.
T4
T3 T6
T1 T2 T5
T7
1 2 3 4 5 6 7 8 9
Gantt chart illustrating a schedule for task graph.
2. Coffman-Graham’s scheduling algorithm:-

a. This algorithm construct the list of task for simple case when all the task take the
same amount of time.
b. Once list L has been constructed the algorithm applies Graham list scheduling
algorithm.
c. Let T=T1, T2,....................................Tn be a set of n unit time tasks to be executed
on P processor
d. Let α be partial order on T that specify which task must be completed before the
other tasks begins.
e. If Ti<Tj then task Ti is an immediate processor of the task Tj .
f. Let α(T) be an integer label assigned to T. N(T) denotes the decreasing order of
integer formed by ordering the set.
T2
T1
T6
T3 T5
T4 T8
T7
T9
Example of Coffman-graham scheduling algorithm.
Algorithm steps:-
1. Choose an arbitrary tasks Tk from T Such that S(Tk)=Ф and define α(Tk)=1
2. Fori 2 to n do
a. R be the set of unlabeled tasks with no unlabeled successor.
b. Let T* be the tasks in R such that N(T*) is lexicographically smaller than
N(T) for all T and R.
c. Let α(T*) i
3. Construct a list of task L=(Un, Un-1,.................U2, U1) such that α(Ui)=i for all i
where 1<\i<\n.
4. Given (Ti<L) Use graham’s scheduling algorithm to schedule the tasks in T.
3. Non –deterministic model:-

In this model the execution time and precedence of the task execution is not
known before the run time.
The tasks with no predecessor are called initial tasks.
T1
T2
N1 N2
T3
N3
T4 N4
N5 T5
G1=NIN3N5+N2N3N5+N4N5
= [ (NI+N2)N3+N4]N5
T3 T3
T3 T3
G3=X1X3+X1X4+X2X4
G is said to be simple of polynomial Xa can be factor so that each variable appear
exactly once.
4. Graham’s list scheduling algorithm:-

Let (T=T1, T2,....................................Tn) be a set of tasks.
Let µ:T (0,∞) be a function that associated an execution time with each task.
We are also given a partial order < on T. Let L be a list of the tasks in T.
Whenever a processor has no work to do, it instantaneously removes from L the
first ready task. that is ,an unscheduled task whose predecessor under have all
completed execution.
If two or more processor simultaneously attempt to execute the same tasks the
processor with the lowest index succeeds, and the other processor look for another
suitable task.
Chapter 5 &6
Parallel Algorithms:-
This algorithm design for sequential computers are known as sequential algorithm.
The algorithm which is used to solve a problem of parallel computer are known
as parallel computer.
Parallel algo defined how the problem is divided into sub problem ,how the
processor communicate and how the partial solution into combined to produce the
final result.
Parallel algo depends upon the type of parallel computer they designed for .
Thus for a single problem we need to design a parallel algo. Of different

architecture.
In order to simply design and analysis of parallel algo. Parallel computer are
represented by various abstract machine model. these machine model try to
capture the important features of parallel computer .
Assumption:-
1. in designing algo for these model are learn about the inherit parallism in the
parallel.
2. These models helps us to compare the relative computational power of various
parallel computer.
3. These model helps us to determine the kind of parallel architecture that is best
suited for a given problem.
Model of computation:-
Ram (random access memory)
This machine model of computing abstracts the sequential computer.
Dia. Of ram
1.Memory unit :-
A memory unit has M location where M can be unbounded.
2.) PROCEESOR :-
A Processor that operate under the control of sequential algorithm
. the processor can read and write to a memory location and can perform
basic arithmetic and logic operation(ALU).
3.) MAC (Memory access unit ):-
It creates a path from the processor to arbitrary location
in the memory . the processor provide the memory access unit with the
address of location that it wish to access and the operation it wants to
perform. The memory access unit use that address to establish a direct
connection between the processor and the memory location.
Algorithm for RAM consist of following steps:
1. Read :-
The processor read the data from memory and stores the data into
its local register.
2. Execute :-
the processor perform a basic arithmetic or logic operation on the
content on the one or two of its local register .
3. Write :-
The processor write the contents of one register into an arbitrary memory
location.
PRAM(PARALLEL RANDOM ACCESS MEMORY)
The PRAM is one of the popular model for degining parallel algo.
P1 Memory
Access Shared
memory
P2 Unit
(MAU)
PN
Dia . of PRAM model
a) A set of N (P1,P2 P3 ....................Pn) processor where the N is unbounded .

b) A memory with M location which is shared by all processor where M IS
Unbounded
c) MAU:-
MAU which allows the processor to access the shared memory. shared memory
function as the communication medium for the processor..
Steps of PRAM DESIGN:-
Write:- N processor can write simultaneously from their register.
Read:-
N PROCESSOR read simultaneously from M memory location and store

the values in their register.
Compute:-
N processor perform basic arithmetic and logic operation on the values

of their register.
Types of P-Ram Model:-
1. EREW
2. CREW
3. ERCW
4. CRCW
EREW:-
EREW stands for exclusive read exclusive write. in this model every
access to memory location (read &write) has to be exclusive .it means read and
write operation are not allowed .this model provide least amount of concurrency
and therefore the weakest model.
2. CREW:-
CREW stands for concurrent read exclusive write PRAM. In this model
only write operations to a memory a location are exclusive .concurrent read is
allowed means two or more processor can concurrently read from same memory
location. This is one of most commonly used model.
3.ERCW:-
ERCW stands for exclusive read concurrent write PRAM Model. In
this model read operation are exclusive. this model allows multiple processor to
write concurrently to same memory location .this model is not frequently used and
is defined hare for the shake of completeness .
4.CRCW:-
CRCW stands for concurrent read and concurrent write PRAM model.This
model allows multiple processor to read and write to a memory location . it provide
maximum amount of concurrency ,
This is most powerful model among four memory model .
Types of CRCW:-
There are several protocol that are used to specify the value that is written to a
memory in situation where many processor try to write different memory location
and the model has to specify the values that is written to a memory location.
1.) Priority CRCW:-

In this protocol processor are assigned priorities. The processors
with high priority (among those conducting to write) succeed in writing its values to
the memory location.
2. Common CRCW:-
Here the processor that are trying to write to a memory location
are allowed to do so only when they write same value.
If the processor writing the same value than it will allow to other otherwise it is an
illegal operation.
3.Aribitary CRCW:-
Here the processor that writes to memory location is selected arbitrary

without affecting the concurrent of algorithms .however the algo. must specify the selection
criteria.
4.combining CRCW:-
In this there is function that maps the multiple values that the processor
try to write to a single values that is actually written into the memory location.
Eg:- summation function
Interconnection network:-
In pram memory model exchange of data between processor take place either through shared
memory or by the direct links connecting the processor .
A network to a topology is important as it specifies the interconnection network model of a

parallel computer. This model helps in determine the computational path of parallel
computer.
Combinational circuit:-
Is another model of parallel computer it refer to a family of model of

computation.
Combinational circuit is viewed as a device that has s set of input lines at one end and set
of output [o/p] lines at another end .
stage
Inter
Connection
network
The circuit is made up of number of interconnected component arranged is column

called stages.
Combinational circuit consist of :-
1. Interconnection component (processor):-
It has a fixed no. Of I/P Lines
called fan in and fixed no of O/P lines called fan out.
A component is active only when it receive all the input necessary for its
computation.
2. The important feature of combinational circuit is that it has no feedback that
means no component can be used more than once while computing the O/P for
given input.
Parameter used to analyse the combinational circuit are:-
1. Size:- no. Of component used in combinational circuit.
2. Depth:- no of stage in combinational circuit
3. Width:- maximum number of component in given stage.
ANALYSIS The performance of parallel algorithm

Parallel algo are evaluated on the basis of their parameter criteria.
1. Running time
2. No. Of processor
3. Cost.
Relative strengths/power /features of PRAM Model:-
1. EREW PRAM:- EREW PRAM Model has least concurrency and is therefore the
weakest model.
2. CREW PRAM:- A CREW PRAM model is the most commonly used model than the
EREW PRAM model.
3. ERCW PRAM:- A ERCW PRAM is never used because it is impossible to write
concurrently at same location .
4. CRCW PRAM:- A CRCW PRAM is strongest model and it is widely used.
Chapter 6
PRAM ALGORITHMS
PRAM algorithm having two phase .
1. Initialization or activation of P.E(processing element)

a) Spawn( <processor name>)
This instruction is used for the initialization the processor or to decide which
processing element is used for further processing.
2. Active processor are the made to do desire work
a) For all <processor list>do<statement list>end for
b) This instruction is used to allocate the task to the processor.
PARALLEL REDUCTION ALGORITHMS

Binary tree is the one of most important
parameter of parallel computer.
In some algo data flows from top to bottom i.e. from root to leaf .e.g. divide and
conquer algorithm and the broadcast algorithms.
In some algo data flows from bottom to up. i.e. from leaf to root these are called
reduction operation. For example of this is the summation.
4 3 8 2 9 1 0 5 6
3
7
Array representation of parallel reduction
A(0) A(1) A(2) 3 4 5 6 7 8
9
4 6 1 2 8 3 0 5 7 3
10 3 12 5 10
13 17 10
30 10
40
Complexity of this algorithms is (log n) for n/2 processor .the complexity is overall
time complexity.
Prefix sum algorithms:-
Given set of n values a1, a2, a3, a4,.............an and an associative operation +,the
prefix sum problem is to complete n-quantities.
A1
A1+a2
A1+a2+a3
.
.
A1+a2+..........+an
For example:- the operation “+” on array (3,10,4,2) the prefix sum should be
1) A1=3
2) A1+a2=13
3) A1+a2+a3=3+10+4=17
4) A1+a2+a3+a4=3+10+4+2=19
So array will be =(3, 13, 17, 19)
PRAM Algorithms :-
To find prefix sum n element list using n-1 processor.
Prefix sum (CREW PRAM)
Initial condition:-list of n>element stored in A[0,1........(n-1)]
Final condition:-sum of elements stored in A[0]+A[1]+A[2]..........+A[i]
Global variable:- n , A[0,..............(n-1)] , j
Begin
Spawn (P0,P1,P2....................Pn-1)
For all Pi where 0<\i<\n-1 do
For j-0 to [log n]-1 do
If i-2^j>/0 then
A[i]-A[i]+A[i-2^j]
End if
End for
End for
End
Prefix sums has many application:-
Let us take another problem to separate upper case letters from lower case and
yet maintaining the order .keeping the upper case letter first in the array A.
Let us take an array A of n letter and suppose with elements .
A b C D e F g h
Array A
We also have another array T of same size so that we get the index of each element in
array. Putting “1” where ever we find upper case letter and “0” where ever we find
lower case letter.
1 0 1 1 0 1 0 0
Now complete prefix sum of T using additional operation.
1 1 2 3 3 4 4 4
Here we will take single value if value is repeated e.g. “1” then we will take the first
coming value i.e. A for upper case.
To show corresponding values we will take 2-c , 3 is repeated so we will take first value
of D and on.
Now in array A indexes will be considered such as in array A.
1 2 3 4 ............
Now the array will look like
A C D F b e g h
The complexity of this algorithms is (log n) for (n-1) processor.
Array representation of prefix sums:-
A[0] A[1] 2 3 4 5 6 7 8 9
4 6 1 2 8 3 0 5 7 3
4 10 7 3 10 11 3 5 12 10
4 10 11 13 17 14 13 16 15 15
4 10 11 13 21 24 24 29 32 29
4 10 11 13 21 24 24 29 36 39
From step 0 to 1 :
The addition operation is performed with first right neighbour and the neighbour values is
updated.
From step 1 to 2:
The addition operation is performed with second neighbour and neighbour value is updated.
From step 2 to 3:
The addition operation is performed with fourth right neighbour.
From step 3 to 4:
The addition operation is performed with eighth right neighbour.
Parallel list Ranking /Suffix sums:
The suffix sum problem is a variant of prefix sum problem, where an array is replaced by
linked list. And the sum are computed from end rather than from the beginning.
Algorithm to find the suffix sum of n element.
List Ranking (CREW PRAM)
Intial condition : value in array ‘next’ represent a linked list .
Final condition: values in array position contain original distance of each element from end of
list .
Global variable: n, position[0...........(n-1)],next[0............(n-1)],j
Begin
Spawn (P0, P1,.............Pn)
For all Pi where 0<\i<\n-1 do
If next[i]=i then
Position [i] 0
Else
Position[i] 1
End if
For j 1 to [log n] do
Parallel[i]=position[i]+position[next[i]]
Next[i] next(next[i])
End for
End for
End
Parallel merge :
Merging is the mixing of two linked list
Merge list[CREW PRAM]
Given two sorted list of n/2 elements each stored in A[1]..........A[n/2] and A[(n/2).......A(n)]
Global A[1......n]
Local n high level index.
Begin
Spawn(P0,P1................Pn)
For all P where 1<\i<\n do
If i<n/2 then
Low n/2+1
High n
Else low 1
High n/2
End if
If x A[i]
Repeat
Index [(low+high)/2]
If x <A[index] then
High index-1
Else
Low index+1
End if
Until low>high
A[high+i-n/2] x
End for
End
In PRAM the complexity reduce as o(log n) as compare to RAM algorithms .
One processor is assign to each element which determine the position assign to the element in
the final merge list. index is there it gives the position of element. n processor are required
for n elements .upper list processing element will access lower list and vice versa.
1 5 7 9 19 24 14 17
1 3 5 2 7 9 11 19 22 24 14 17
3 2 11 22
Every processor find the position of its own element on the other list using binary search
because an element index is in own merge list is known and is position on the merge list can
be completed when its index is found and then two index is added.
Cost optimal algorithms:
A cost optimal algorithm is an algorithm for which the cost

is the same complexity class as an optimal sequence algorithm.
We can say that parallel algo. having same complexity with optimal RAM algorithms having
cost. This can be termed as cost optimal algorithm.
Cost optimal parallel reduction algo. has a time complexity (log n) for n processor.
A cost optimal algorithm exists only:-
1 .determine no. of processor need to perform (n-1) operation
2. After determine no of processor we need to verify that whether a cost optimal parallel
reduction algorithm with (log n) complexity exist or not .this is done by Brent’s theorem.
Cost optimal is consider important in design of parallel algo.
The amount of work an algorithm performs is the run time of algorithm multiplied by the
number of processor it uses .a conventional (sequential) algo. may be thought of as a parallel
algorithm designed for one processor. An algorithm is said to be cost optimal if the amount of
work it does is the same as the best known sequential algorithm.
Brent’s theorem:-
This theorem state that a parallel algorithm with computation t if a

parallel algorithm A perform m computation operation then processor p can execute
algorithm A in time T
T= t + (m-t)/p
This typical application is when p<the number of processor which gave rise to the time t
measure.
Note that where p=1 Brent’s =m(the processor is executing in sequential order one after
another .)
When p=∞ Brent’s =t
Brent’s theorem specifies for a sequential algorithm with t time steps and a total of m
operation that a run time T is definitely possible on a shared memory machine with p
processor . there may be an algorithm that solve this problem faster or it may be possible to
implement this algorithm faster (by scheduling ) instruction differently to minimize idle
processor , for instance),but it is definitely possible to implement this algorithm in this time ,
given p processor.
Key to understanding Brent’s theorem is understanding time steps .in single time step every
instruction that has no dependencies is executed and therefore t is equal to the length of the
longest chain of instruction that depend on the result of other instruction that depend on the
result of other instruction (as any shorter chains will be finished executing by the time ,the
longest chain has).
e.g. we are summing an array .
for (i=0; i<length(a);i++)
Using this algorithm each add operation depend on the result of the previous one forming a
chain of length n thus t=n. there are n operation so m=n.
T=n +o/p
(m-t)=0. So no matter how many processor are available this algorithm will take time n.
IMPORTANRT PREDICITION ABOUT THIS ALGORITHM:-
1. No matter how many processor are used .there can be no implementation of this
algorithm that can be faster than (log n).
2. If we have n processor the algorithm can be implemented in (log n) times .
3. If we have log (n) processor the algorithm can be implemented in 2log(n)times.
4. If we have one processor the algorithm can be implemented in n time.
If we consider the amount of work done in each case .with one processor we do n
work ,with log n processor we do n work but with n processor we do nlog(n)
work.
So the implementation with with 1 or log (n) processor therefore cost optimal
while the implementation with n processor is not cost implementation..
Brent’s theorem does not tell how to implement parallel algorithm but it tell
what is possible.
NC Algorithm:-
The class NC is set of languages decidable in parallel time T.
NC is the class of problem solvable on a pram in poly logarithmic time using a
number of processor that one polynomial function of problem size.
If some algorithm is in NC, it remains in NC regardless in which PRAM sub model
we assume.
The class NC include:
1. Parallel prefix computation.
2. Parallel sorting and selection.
3. Matrix operation.
4. Parallel tree contraction and expression evaluation.
5. Parallel algo. For graph.
6. Parallel algo. For computation geometry.
7. Parallel graph for biconnectivity and triconnectivity
8. Parallel string matching algorithm.
Many NC algo. Are cost optimal that is they have T(n,p(n)= o(log n)
P(n)=n/log n
N=o(n)
O=complexity notation .
Drawback of NC theory :-
1. the class NC many include some algorithm which are not efficiently
parallelizable .the most infamous example is parallel binary search.
2. NC theory assumes situation where a huge machine solve very quickly moderately
size problem.
However in actual moderately size machine are used to solve a large problem so that
the no of processor tend to be polynomial.
Chapter:- 7
PARALLEL ALGORITHMS FOR ARRAY PROCESSOR
SIMD matrix multiplication

Matrix manipulation is frequently needed in solving linear systems of equations. Important
matrix operations include matrix multiplication, L-U decomposition, ans matrix inversion.
The differences between SISD and SIMD matrix algorithms are pointed out in their program
structures and speed performances.
Algorithm for SIMD matrix multiplication
For j = 1 to n Do
Par for k = 1 to n Do
Cik = 0 (vector load)
For j = 1 to n Do
Par for k = 1 to n Do
Cik = Cik + aij . bjk (vector multiply)
End of j loop
End of I loop
Vector load operation is performed to initialize the row vectors of matrix C one row at a
time.In the vector multiply operation, the same multiplier aij is broadcast from the CU to all
PEs to multiply all n elements of the ith row vector of B.
Parallel Sorting on Array processors
An SIMD algorithm is to be presented for sorting n2 elements on a mesh-connected processor

array in O(n) routing and comparison steps. We assume an array processor with N = n2
identical PEs interconnected by a mesh network similar to Illiac-IV except that the PEs at the
perimeter have two or three rather than four neighbors.
Two time measures are needed to estimate the time complexity of the parallel-sorting
algorithm. Let tR be the routing time required to move one item from a PE to one of its
neighbours, and tC be the comparison time required for one comparison step. This means that
a comparison-interchange step between the two items in adjacent PEs can be done in 2tR + tC
time units.
The sorting problen depends on the indexing schemes on the PEs. The PEs may be indexed
by a bijection from {1,2…….,n} 8 {1,2……..,n} to {0,1,….,N-1},where N = n2.Three
indexing patterns formed after sorting the given array in part a with respect to three different
ways for indexing the PEs. The pattern in part b corresponds to a row-majored indexing, part
c corresponds to a shuffled row-major indexing , and is based on a snake-like row-major
indexing.
Shuffle and unshuffle operations can each be implemented with a sequence of interchange
operations .Both the perfect shuffle and its inverse can be done in K – 1 interchanges or 2(K
– 1) routing steps on a linear array of 2K PEs.
PARALLEL ALGORITHMS FOR MULTIPROCESSOR (MIMD)
Parallel algorithm for multiprocessors is a set of K concurrent processes which may operate
simultaneously and cooperatively to solve a given problem.
Interaction points:- interaction points are those points where processes communicate with
other processes. These interaction points divide process into stages.
Two types of parallel algorithm:
1.synchronized algorithm:- Parallel algorithms in which some processes have to wait on other
processes are called synchronized algorithms. In these type of algorithms, processes have the
property that there exist a process such that some stage of the process is not activated until
another has completed certain stage of its program. Execution time of processes is a variable
which depends on input data and system interruptions.
2.Asynchronous algorithm:-In asynchronous algorithm ,processes are not required to wait for
each other and communication is achieved by reading dynamically updated global variables
stored in a shared memory. There is a set of global variables accessible to all the processes
when a stage of a process is completed .Based on the values of the variable read together with
the results just obtained from the last stage. The process modifies a subset of global variable
and then activates the next stage or terminate itself. In some cases, operations on global
variables are programmed as critical section.
The main characteristics of an asynchronous parallel algorithm is that its processes never wait
for inputs at any time but continue execution or terminate according to whatever information
is currently contains in global variables.
Alternative approach
Macro pipelining:- It is applicable if the computation can be divided into parts called stages
so that the output of one or several collected parts is the input for another parts. In this case as
each computation part is realized as a separate process, communication cost may be high.
An algorithm which requires an execution on multiprocessor system must be decomposed

into set of processes to the exploit parallelism.
In Static decomposition:- Set of processes and their precedence relation are known before
execution.
In Dynamic decomposition:-Set of processes changes during execution.

Parallel Archtecture and Computing

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Archtecture and Computing

Uploaded by

Copyright:

Available Formats

A new simplified approach

PREREQUISITES: -Computer Architecture.

OBJECTIVES: -. This course offers a good understanding of the various functional

1. Introduction and Classification of Parallel Computer [8%]

2. Pipelined and Vector Processors [12%]

3. SIMD or Array Processors [12%]

4. MIMD and Multi Processor Systems [12%]

5. PRAM model of Parallel Computing and Basic Algorithms [7%]

6. Parallel Algorithms for Multi Processor Systems [25%]

7. Parallel Algorithms for SIMD and Multi Processor System [4%]

Introduction and classification of parallel computer

We will briefly review the structure of a single processor

It is memory or storage unit in which the procedure, data and intermediate

o/p unit which display or print the result .

CPU or processing element:-

Combination of ALU(arithmetic and logic unit) and control unit .

ALU:-where arithmetic and logic logic operation are performed.

COMPARISON OF TEMPORAL AND DATA PARALLEL PROCESSING

TEMPORAL PARALLEL DATA Parallel processing

Parallel processing is efficient form of information processing which

Parallel computer Structures:-

Parallel computer are those system that emphasis parallel

A pipeline computer performs overlapped computation to exploit

.Instruction fetch (IF)

.Instruction decode (ID)

.Operand fetch (OF)

An array processor uses multiple synchronized arithmetic logic unit to

An array processor is a synchronized parallel computer with multiple

PE1 PE2 PEn

Inter –PE connection network

Diagram of array processor

Research and development of multiprocessor system are aimed at

. Time-shared common bus

THE Diagram of multiprocessor system is drawn below:

Pipelining of processing element:-

Pipelining uses temporal parallelism to increase the

1. It is possible to break up an instruction into a number of independent parts, each part

1. To exploit maximum parallelism in a program, data flow are used.

The of data flow graph is shown below:

Flynn’s classification is based on the multiplicity of instruction

Computer organization are characterized by the multiplicity of the hardware provided to

1. Single instruction stream-single data stream(SISD)

MM= Memory unit, IS= Instruction stream , DS= DATA STREAM

IS1 CU1 IS1 PU1

CU2 PU2 Mm1 MM MM

ISn CUn ISn PUn

ISn ISn PUn DSn ISn

1. Processor control unit (PCU).

According to Amdahl’s law a program contain two type of operation

Dia. Of the Amdahl’s law given below:

Uses of parallel computing:

1. Atmosphere, earth environment

This refer to the pipelining processing of the same data stream by a

3.unifunction vs. Multifunction :

a pipelining unit with a fixed and dedicated function, such

Static vs. Dynamic :-

A static pipeline may assume only one functional configuration at a

A dynamic pipeline processor permit several functional configuration to exist

Pipelining processing is a technique of doing certain repetitive work

S=time taken by non pipeline processor/time taken by pipeline processor

e=n.k.t/k.[kt + (n-1)]=n/k + (n-1)

Throughput =n/kt + (n-1)t=n/t

. m=number of the tasks run on the pipe.