# A new simplified approach To

Parallel Architecture and Computing
IT (FIFTH SEMESTER)

PTU Jalandher

By :- Er.Umesh Vats
Information Technology BCET Ludhiana

PTU Syllabus
IT-309 PARALLEL ARCHITECTURE AND COMPUTING PREREQUISITES: -Computer Architecture. OBJECTIVES: -. This course offers a good understanding of the various functional units of a computer system and prepares the student to be in a position to design a basic computer system. COURSE CONTENTS 1. Introduction and Classification of Parallel Computer [8%] Parallel processing terminology, Flynn's and Hndler's class ifications, Amdahl¶s law. 2. Pipelined and Vector Processors [12%] Instruction pipelining, Reservation table, Data and control hazards and methods to remove them 3. SIMD or Array Processors [12%] Various interconnection networks, Data routing through var ious networks, Comparison of Various networks, Simulation of one network another. 4. MIMD and Multi Processor Systems [12%] Uniform and non-uniform memory access multi processors, Scheduling in multi processors systems, Load balancing in multi processors systems. 5. PRAM model of Parallel Computing and Basic Algorithms [7%] PRAM model and its variations, Relative powers of various PRAM models. 6. Parallel Algorithms for Multi Processor Systems [25%] Basic construction for representing PRAM algorithm, Parallel reduction algorithm, Parallel prefix computing, Parallel list ranking, Parallel merge, Brent's theorem and cost optimal algorithm, NC class of parallel algorithms. 7. Parallel Algorithms for SIMD and Multi Processor System [4%] Introduction to parallel algorithms for SIMD and Multi Processor System

Chapter1.

Introduction and classification of parallel computer
What is a parallel computer:We will briefly review the structure of a single processor computer commonly used today. It consist of Input unit:-which accept the list of instruction to solve a problem and data relevant to a problem. Storage unit:It is memory or storage unit in which the procedure, data and intermediate result are stored. Output unit:o/p unit which display or print the result . CPU or processing element:Combination of ALU(arithmetic and logic unit) and control unit . ALU:-where arithmetic and logic logic operation are performed. Control unit:-a control unit which interpret the instruction stored in memory and obediently carries them out.

Input unit

processing

Output unit

ALU

control

Processing element

The P.E(Processing element) retrieve one instruction at a time, interpret it and execute it .the operation of this computer is sequential .at a time a P.E can execute only one instruction .speed of the sequential computer is limited by the speed at which it can process the retrieved data. To increase the speed of processing of data one may interconnect many sequential computer to work together .such a computer which consist of a number of interconnected sequential computer which cooperatively execute a single program to solve a problem is called a parallel computer. COMPARISON OF TEMPORAL AND DATA PARALLEL PROCESSING TEMPORAL PARALLEL PROCESSING 1. job is divided into a set of independent tasks are assigned for processing. 2.task should take equal time 3. processor specified to do specific tasks efficiently. 4. task assignment static. 5.not tolerant to processor fault 6. efficient with fine grained tasks. DATA Parallel processing 1.pull jobs are assigned for processing. 2.job may take different time. No need to synchronize beginning of job 3. processor should be general purpose and may not do all tasks efficiently. 4. job assignment may static, dynamic or quassi-dyanmic. 5.tolerate processor fault. 6. efficient with coarse grained tasks and quassi-dyabmic.

Parallel processing:Parallel processing is efficient form of information processing which emphasizes the exploitation of concurrent event in the computing process. Concurrency implies parallelism, simultaneity and pipelining. Parallel event may occur in multiple resources during the same time interval and pipelined event may occur in overlapped time spans. The highest level of parallel processing is conducted amoung multiple jobs or program through multiprogramming, time sharing and multiprocessor. Parallel computer Structures:Parallel computer are those system that emphasis parallel processing. The basic architectural features of parallel computers are introduced below. We divide parallel computer into three architectural configuration : 1. Pipeline computer 2. Array processor

3. Multiprocessor system. Pipeline computer:A pipeline computer performs overlapped computation to exploit temporal parallelism. The process of executing an instruction in a digital computer involves four steps: .Instruction fetch (IF) .Instruction decode (ID) .Operand fetch (OF) .Execution (EX) Instruction fetch (IF) from memory then instruction decoding (ID),then identifying the operation to be performed ,operand fetch (OF) and then execution (EX) of decode arithmetic logic operation. In a pipelined computer successive instruction are executed in a overlapped fashion. Four pipeline stage IF, ID, OF, and EX are arranged in linear cascade. A pipeline processor can be represented as:

IF

ID

OF

EX

PIPELINE PROCSSOR But in non pipelined computer these four steps must be combined before the next instruction can issued. The space time diagram for pipeline and non-pipelined processor. in pipeline processor the operation of all stages is synchronized under a common clock control. Interface latches are used between adjacent segments to hold the intermediate result. For nonpipelined processor computer it takes four pipeline cycles to complete one instruction. .Some main issue in designing a pipeline computer include job sequencing, collision prevention, congestion control, branch handling, reconfiguration and hazard resolution. Pipeline computer are more attractive for vector processing.

Array processor:An array processor uses multiple synchronized arithmetic logic unit to achieve spatial parallelism. The fundamental difference between array processor and a multiprocessor system is that the processing element in an array processor operates synchronously but in a multiprocessor system may operate asynchronously.

Array computer:An array processor is a synchronized parallel computer with multiple arithmetic logic unit, called processing element(PE). By replication of ALUs one can achieve the spatial parallelism. Scalar and control type instruction are directly executed in the control unit. Each PEs are interconnected by a data-routing network.

CP CM

PE1
P M P M

PE2
P

PEn .........................
M

Inter PE connection network

Diagram of array processor

Control unit is combination of control processor and control memory. In array processor each processing element has its own private memory. Instruction fetch (from local memory or from the control memory) decode is done by the control unit.

Array processors designed with associative memories are called associative processor. Parallel algorithms on array processor will be given foe matrix multiplication, merge sort and fast Fourier transform (FFT).

Multiprocessor system:Research and development of multiprocessor system are aimed at improving throughput, reliability, flexibility and availability. A basic microprocessor organization is drawn below: the system contains two or more processor of approximately comparable capabilities. All processor share access to common sets of memory module, I/O channel, and peripheral devices. Most importantly the entire system must be controlled by a single integrated operating system providing interaction between processors and their program at various level. Three different interconnection have been practiced in the past: . Time-shared common bus .Crossbar switch . Multiport memories THE Diagram of multiprocessor system is drawn below:

Pipelining of processing element:Pipelining uses temporal parallelism to increase the speed of processing. Pipelining is important method of increasing the speed of processor. The following ideal condition to satisfied that are provided by the pipeline.

1. It is possible to break up an instruction into a number of independent parts, each part
taking nearly equal time to execute. 2. There is so called locality in instruction execution. By this we mean that instruction are executed in sequence one after the other in the order in which they written. If the instruction are not executed in sequence but ³jump around´ due to many branch instruction, then pipelining is not effective. 3. Successive instruction are such that the wok done during the execution of an instruction can be effectively used by the next instruction. 4. Sufficient resources are available in a processor so that if a resource is required by successive instruction in the pipeline it is readily available.

Data flow computer:1. To exploit maximum parallelism in a program, data flow are used. 2. The basic concept is to enable the execution of an instruction whenever its required operand become available. 3. There is no program counter are needed in data-driven computation. 4. Program for data-driven computation can be represented by the data flow graph. 5. An example of data flow graph is given below: Z=(x+y)*2 The of data flow graph is shown below:

Flynn¶s classification:Flynn¶s classification is based on the multiplicity of instruction streams and data streams in the parallel computer system. The essential computing process is the execution of a sequence of instruction on a set of data. 1. The term stream is used to denote a sequence of items as executed or operated upon by a single processor. 2. An instruction stream is a sequence of instruction as executed by the machine. Computer organization are characterized by the multiplicity of the hardware provided to serve the instruction and data stream. Flynn¶s are classified into four type: 1. 2. 3. 4. Single instruction stream-single data stream(SISD) Single instruction stream-multiple data stream(SIMD) Multiple instruction stream-single data stream(MISD) Multiple instruction ±multiple data stream(MIMD)

1. SISD:a. Instruction are executed sequentially but may overlapped n their execution stages. b. Single instruction: only one instruction stream is being acted on the CPU during any one clock cycle.

c. d. e. f.

Single data: only one data stream is being used on which instruction applied. An SISD computer may have more one functional unit in it. Deterministic execution. E.g. older generation mainframe, minicomputer and workstation. Fig:

IS
CU

IS

PU

DS

MM

2. a. b. c.

SIMD:SIMD stands for the single instruction multiple data stream. There are multiple processing element supervised by the same control unit. All PEs receive the same instruction broadcast from the control unit but operate on different data sets from distinct machine. d. SIMD machine further divided into word slice versus bit-slice modes. e. Deterministic execution. f. E.g. array processor and vector pipeline. DS1
PU1 CU PU2 MM1

CU=Control

unit

DS2
MM2

.
PUn

DSn
MMn

PU =PROCESSOR UNIT MM= Memory unit, IS= Instruction stream , DS= DATA STREAM

3. MISD:a. MISD stands for multiple instruction single data. b. There are n processor units, each receiving distinct instruction operating over the same data stream. c. This exist conceptually not physically.

Di

i

l

:

MI D C M

R

I

CU1

I

PU1

DS

IS2 CU2

IS2

PU2

Mm 1

MM 2

MM n

ISn CUn IS1

ISn

PUn

DS

ISn

IS2

4. a. b. c. d. e.

MIMD:MIMD stands for multi l instruction multi l data. of parall l comput r. This is most common t Multipl instruction : every processor may be executing a different stream. Multiple data: every processor may be working with a different data stream. E.g. most current supercomputer ,multi core PC s. Dia.are given below

MIMD COMPUTER:

Is1

Cu1

IS1

PU1

DS1
MM1

IS1

IS2

CU2

IS2

PU2

DS2
MM2

IS2

ISn

CUn

ISn

PUn

DSn
MMn

ISn

Handler¶s classification: Wolfing Handler has proposed a classification for identifying the parallelism degree and pipeline degree built into the hardware structure of a computer system. Handler¶s taxonomy addresses the computer at three distinct levels: 1. Processor control unit (PCU). 2. Arithmetic logic unit (ALU). 3. Bit-level circuit (BLC). THE PCU corresponds to one processor or one CPU. The ALU is equivalent to the processing element in a array processor and BLC corresponds to the combinational logic circuitry needed to perform 1-bit operation in the ALU. A computer system C can be characterized by a triple containing six independent entities, as defined below: T(C)=< K*K¶, D*D¶, W*W¶> K=the number of processor(PCUs) within the computer. K¶=the number of PCUs that can be pipelined D=the number of ALU(OR PEs) under the control of one PCU.

D¶=the number of ALUs that can be pipelined. W=the word length of an ALU or of a PE. W¶=the number of pipeline stage in all ALUs or in a PE. e.g. The Texas Instrument¶s Advanced Scientific computer (TI-ASC) has one controller controlling four arithmetic pipeline each has 64 bit word length and eight stage. thus we have: T(ASC)= <1*1, 4*1,64*8> = <1,4,64*8> Whenever the second entity K¶, D¶, or w¶ equal 1, we drop it since pipelining of one stage or of one unit is meaningless.

Amdahl¶s law: Amdahl¶s law is based on a fixed problem size or a fixed work load. We seen that for a given problem size the speed up does not increase linearly as the number of processor increase. in fact speed tends to become saturate. According to Amdahl¶s law a program contain two type of operation a. Completely sequential which must be done sequentially and b. completely parallel which can be run on multiple processor.

Lets the time taken to perform the sequential operation (Ts)be a fraction , (0< <1), of the total execution time of the program, T(1). Then the parallel operation take Tp= (1- ).T(1) time. Assuming that the parallel operation in the program achieve a linear speedup ( i.e. these operation using n processor take (1/n)th of the time taken to perform them on the one processor ). Then speedup with n processor is: S (n) = T(1)/T(n) = T(1)/ T(1) + (1- ) T(1)/n =1/ + (1- )/n Dia. Of the Amdahl¶s law given below:

Parallel computing:Parallel computing is a process of computation which has the ability to carry out many calculations operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently. There is several different form of parallel computing: 1. Bit-level 2. instruction level 3. Data. 4. Task parallelism.

Uses of parallel computing: Parallel computing has been used to model difficult scientific and engineering problem found in the real world. Some example: 1. 2. 3. 4. 5. 6. 7. 8. 9. Atmosphere, earth environment Database, data mining. Oil exploration. Web search engine, web based business services. Medical imaging and diagnosis. Management of national and multinational co-operation. Financial and economic modelling. Advance graphic and virtual reality, particularly in the entertainment industry. Network video and multi-media technologies.

Chapter :-two Pipelined and vector processor:Pipelining :To achieve pipelining processing one must subdivide the input task(process) into a sequence of subtasks, each of which can be executed by a specialized hardware stage that operate concurrently with other stage in the pipeline. Classification of pipeline:There are following type of pipelining that are describe below:1. Arithmetic pipelining :The arithmetic logic units of computer can be segmentized for pipeline operation in various data format. Fig

2.Processor pipeline:This refer to the pipelining processing of the same data stream by a cascade of processor, each of which processor a specific task. The data stream passes the first processor with result stored in a memory block which is also accessible by the second processor. The second then passes the result to the result to the third and so on. Fig:-

3.unifunction vs. Multifunction : a pipelining unit with a fixed and dedicated function, such as floating point adder is called unifunctional pipelining . A multifunction pipe may perform different function either at different times or at the same time, by interconnecting different subset of stage in the pipeline. Static vs. Dynamic :A static pipeline may assume only one functional configuration at a time . static pipeline may be unifunctional or multifunctional .pipelining is made possible in static pipes only if instruction of the same type are to be executed continuously . A dynamic pipeline processor permit several functional configuration to exist simultaneously .dynamic pipeline may be multifunctional. Pipelining processing:Pipelining processing is a technique of doing certain repetitive work effiencently. It is possible to implement pipeline at various level of abstraction .pipeline logic circuit provide concurrency at the lowest level . Speed :Once the pipe is filled up, it will output one result per clock period independent of the number of stage in the pipe .ideally a linear pipeline with k stage can process n tasks in Tk=k + (n-1) clock period, Where k cycle are used to fill up the pipeline or to complete execution of the first task and (n1) cycle are needed to complete the remaining (n-1) task. The same number of tasks can be executed in a nonpipeline computer with an equivalent function in Ti=n.k time delay.

Speedup:- we define the speed of a k-stage linear pipeline processor over an equivalent nonpipeline processor. S=time taken by non pipeline processor/time taken by pipeline processor S=T1/Tk=n.k/k+(n-1) Efficiency:The efficiency of the linear pipeline by the percentage of busy time space span over the total time space span which equal the sum of all busy and idle time space span .let n,k,t be the number of tasks ,the number pipeline stage and clock period of a linear pipeline respectively .the pipeline efficiency is denoted by:e=n.k.t/k.[kt + (n-1)]=n/k + (n-1)

Throughput:The number of result that can be completed by a pipeline per unit time is called its throughput . Throughput =n/kt + (n-1)t=n/t Performance :it can be describe by important attribute viz. The length of the pipe .the performance measure of importance are effiency, speedup throughput. Let n =length of the pipeline i.e. the number of the stages in the pipeline . m=number of the tasks run on the pipe. Reservation table:A reservation table is another way of representing the task flow pattern of a pipelined system. A reservation table represents the flow of data through the pipeline for one complete evaluation of given function. It represents number of rows and number of columns. The rows of reservation table represent the stage (or the one source of the pipeline) and each column represent the clock time unit. The total number of clock units in the table is called the evaluation time for given function. E.g. if a pipelined system there are four resources and five time slice .then reservation table will have four rows and column. All the elements of the table are either 0 or 1 .if one resource (say, resource i) is used in a time slice (say time slice j) then the (i, j )th element of the table will have the entry 1.

On the other hand ,if a resource is not used in a particular time slice then that entry of will have the value 0. An analysis example of reservation table:Suppose that we have four resource and 6-time slices and usage of resource is as follows: 1.) Resource 1 is used in time slice 1,3,6. 2.) Resource 2 is used in time slice 2. 3) Resource 3 is used in time slice 2,4. 4) Resource 4 is used in time slice 5.

The reservation table will be as follows:Index Resource 1 Resource 2 Resource 3 Resource 4 Time1 1 0 0 0 Time 2 0 1 1 0 Time 3 1 0 0 0 Time 4 0 0 1 0 Time 5 0 0 0 1 Time 6 1 0 0 0

Often 0 is represented by blank and 1 is represented by X. So corresponding reservation table in: Index Resource1 Resource2 Resource3 Resource4 Time 1 X Time2 X X Time 3 X Time 4 Time 5 Time 6 X

X X

Different function may have different evaluation times .for example above table has 7 evaluation time . Hazard :A hazard is a potential problem that can happen in a pipelined processor. it refer to the possibility of erroneous computation when CPU tries to execute simultaneously multiple instruction which exhibit data dependence. Or Delay in pipeline execution of instruction due to non ideal condition are called pipeline hazard . Type of hazards:-

Structural hazard:When number of resource is limited in a processor and cause delay due to resource constraint .this is known as structural hazard. A structural hazard take place when a part of processor hardware is needed by two or more processor at the same time. Control hazard:Control hazard is also known as Branch hazard .a control hazard occurs when all program have branches and loops .when execution of a program is thus not in a ³straight line´ If a certain condition is true then jump from one part of instruction part to another type of instruction part. Delay due to branch instruction or control dependency in a program which is known as control hazard.

Data hazard:Data hazard is defined as the result generated by an instruction may be required by the next instruction .successive instruction are not independent on one another. Delay due to data dependency between instructions which is known as data hazard. When successive instruction overlap their fetch , decode and execution through a pipelined processor , interinstruction dependencies may arise to prevent the sequential data flow in the pipeline. e.g. instruction may depend upon the result of previous instruction data hazard is divided into three type :1. Read after write (RAW) 2. Write after read (WAR) 3. Write after write (WAW)

RAW Hazard:A RAW hazard between the two instruction I and J may occurs when j attempts to read some data object that has been modified by I.

R(I)

R(J) FOR RAW

D(I)

R(I) D(J) R(J)

Fig. Raw hazard 2. WAR HAZARD :A WAR hazard may occur when J attempt to modify some data object that is read by I. D(I) R(J)

D(J)

R(J) R(J) D(I)

WAW hazard:A WAW hazard may occur if both I and J attempt to modify the same data object. R(I) R(J) FIG:WAW Hazards
D(I) R(I)

R(J)

D(J)

Hazard detection:Hazard detection is necessary because if hazard is not properly detected and resolved, could result in an interlock situation in the pipeline or produce unreliable result by overwriting. Hazard detection basically falls into two categories:-

1. Hazard detection can be done in the instruction fetch stage of a pipeline processor by comparing the domain and range of the incoming instruction with those of the instructions being processed in the pipe. A warning signal can be generated to prevent the hazard from taking place. 2. Another approach is to allow the incoming instruction through the pipe and distribute the detection to all the potential pipeline stages. This distributed approach offer better flexibility at the expense of increased hardware control. Hazard Resolution:Once hazard is detected, the system should resolve the interlock situation. 1. Consider the instruction sequence {....I, I+1,.........J, J+1...........} in which a hazard has been detected between the current instruction J and a previous instruction I. A straightforward approach is to stop the pipe and to suspend the execution of instruction J, J+1, J+2.........until the instruction I has passed the point of resource conflict. 2. A more sophisticated approaches is to suspend only instruction J and continue the flow of instruction J+1, J+2.........down the pipe. Of course the potential hazard due to the suspension of the J should be continuously checked as instruction J+1, J+2.....MOVE ahead of J. Note:-to avoid RAW hazard IBM engineer developed a short circuiting approach which gives a copy of the data object to be written directly to the instruction waiting to read the data. The technique is also known as data forwarding.

Vector processor:A vector pipeline is specially designed to handle vector instruction over vector operand. Computer having vector instruction are often called vector processor. Vector processor or the SIMD processor microprocessor that are specialized for operating on vector or matrix data element. There processor have specialized hardware for performing vector operation such as vector addition, vector multiplication and other operation. Instruction pipeline:An instruction pipeline reads consective instruction from memory while previous instruction are being executed in other segment. This cause instruction fetch and execute phase to overlap and perform simultaneous operations. One possible digression associated with such a scheme is that an instruction may cause a branch out of sequence. In that case the pipeline must be emptied and all the instruction that have been read from memory after the branch instruction must be discarded. In general the computer needs to process each instruction with the following sequence of steps:

1. 2. 3. 4. 5. 6.

Fetch the instruction from memory .(IF) Decode the instruction. (ID) Calculate the effective address. Fetch the operands from memory. (OF) Execute the instruction. (EX) Store the result in the proper place. (ST)

Difficulties in instruction pipeline: 1. Different segment take different time to to operate on the incoming information. Some segment are skipped for certain operations. E.g. a register mode instruction does need an effective address calculation. 2. Two or more segment require memory at the same time. The design of an instruction pipeline will be most effieient if the instruction cycle is divided into the segment of equal duration.

Chapter 3.

SIMD ARRAY PROCESSOR A synchronous array of parallel processor is called an array processor, which consists of multiple processing elements (PEs) under the supervision of one control unit (CU).An array processor can handle single instruction and multiple data streams(SIMD). In this sense, array processor are also known as SIMD computers. SIMD machines are especially designed to perform vector computations over matrices or arrays of data. SIMD computers appear in two basic architectural organisations: array processors, using rabdom-access memory; and associative memory, using content addressable (or associative) memory.

SIMD COMPUTER ORGANISATIONS

In general, an array processor may assume one of two slightly different configurations,as illustrated in figure.configuration I, it has been implemented in the well-publicized Illiac-IV computer. this configuration is structured with N synchronized PEs, all of which are under the control of one CPU. I/O Data bus Data & instructions
CU memory CU

PE0 PEM0

PE1 PEM1

PE N-1 PEM N-1

Interconnection Network

Control

Configuration I(Illiac IV)

a. Each PEi is essentially an arithmetic logic unit(ALU) with attached working registers and local memory PEMi. b. The CU also has its own memory for the storage of programs. The system and user programs are executed under the control of one CU. The user programs are loaded into the CU memory from an external source. The function of CU is to decode all the instructions and determine where the decoded instructions should be executed. c. Vector instructions are broadcast to the PEs for distributed execution to achieve spatial parallelism. All the PEs perform the same function synchronously in the lock step fashion under the command of the CU. d. Vector operands are distributed to the PEMs before parallel execution in the array of PEs. The distributed data can be loaded into the PEMs from an external source via system data bus or via the CU in a broadcast mode using the control bus. Masking Schemes are used to control the status each PE during the execution of a vector instruction. Each PE may be either active or disabled during an instruction cycle. Only enabled PEs perform computation. Interconnection network is under the control of one control unit. e.

I/O Data bus
CU memory CU

PE0

PE1

PE-N-1

Alignment network

Control

M0

M1

M-P-1

Configuration II(BSP) This configuration II differs from the configuration I in two aspects. First, the local memories attached to the PEs arfe now replaced by parallel memory modules shared by all the PEs through an alignment network. Second, the inter Pe permutation network is replaced by the inter-PE memory alignment network, which is again controlled by the CU. In Configuration II(BSP), there are N PEs and P memory modules. The two numbers are not necessarily equal. In fact thy have choosen to be relatively prime. The alignment network is a path switching network between the PEs and the parallel memories. Formally, an SIMD computer C is characterized by the following set of parameters: C = <N,F,I,M> Where N = the number of PEs in the system; F = a set of data-routing functions provided by the interconnection network or by the alignment network I = the set of machine instructions for scalar-vector, data-routing, and networkmanipulation operations. M = the set of masking schemes, where each mask partitions the set of PEs into the two disjoint subsets of enabled PEs and the disabled PEs.

Inter-PE communications Network Design decisions for inter-PE communications are discussed below. The decisions are made between operation modes, control strategies, switching methodologies, and network topologies. Operation mode Two types of communication can be identified : synchronous and asynchronous. Synchronous communication is needed for establishing communication paths synchronously for either a manipulating function or for a data instruction broadcast. Asynchronous communication is needed for multiprocessing in which connection requests are issued dynamically. A system may also be designed to facilitate both synchronous and asynchronous processing. Therefore, the typical operation modes of interconnection networks can be classified into three categories : synchronous, asynchronous, and the combined. Control Strategy A typical interconnection network consists of a number of switching elements and interconnecting links. Interconnection functions are realized by properly setting control of the switching elements. The control setting function can be managed by the centralized controller or by the individual switching element. The latter strategy is called distributed control and the first strategy corresponds to centralized control. Most existing SIMD interconnection networks choose the centralized control on all switch elements by the control unit. Switching Methodologies The two major switching methodologies are circuit switching and packet switching. In circuit switching, a physical path is actually established between a source and a destination. In packet switching, data is put in a packet and route through the interconnection network without establishing a interconnection path. In general, circuit switching is more suitable for bule data transmission, and packet switching is more efficient for many short data messages. Another option, integrated switching, includes the capabilities of both circuit switching and packet switching.

Network Topology :A network can be depicted by a graph in which nodes represent switching points and edges represent communication links. The topologies tend to be regular and can be grouped into two categories: static and dynamic . In a static topology, Links between two processors are passive and dedicated buses cannot be reconfigured for direct connections to other processors . on the other hand, Links in the dynamic category can be reconfigured bty setting the network¶s active switching elements. The space of the interconnection networks can be represented by the cartesian product of the above four sets of design features: {operation mode} * {control strategy} * {switching methodology} * {network topology}.

SIMD INTERCONNECTION NETWORKS Various interconnection networks have been suggested for SIMD computers. We distinguish between single stage, recirculating networks and multistage networks. Important network classes to be presented include the Illiac network, the flip network, the n-cube network, the omega network, the data manipulator, the barrel shifter, and the shuffle-exchange network. Static versus Dynamic Networks The topological structure of an SIMD array processor characterized by the data-routing network used in interconnecting the processing elements. Formally, such an inter-PE communication network can be specified by a set of data routing functions. To pass data between PEs that are not directly connected in the network, the data must be passed through intermediate PEs by executing a sequence of routing functions through the interconnection network. The SIMD interconnection networks are classified into the following two categories based on network topologies: static networks and dynamic networks. Staic networks Topologies in the static networks can be classified according to the dimensions required for layout. For illustration, one dimensional, two dimensional, three dimentional, and hypercube. Examples of one dimensional topologies include the linear array used for some pipeline architectures.

two dimensional topologies include the ring, tree, star, mesh and systolic array. Examples of these structures are shown in figures. Three dimensional topologies include the completely connected chordal ring,3 cube,and 3cube-connected cycle networks. The mesh and the 3 cube are actually two and three dimensional hypercube respectively. Dynamic Networks We consider two classes of dynamic networks: single-stage versus multistage,

Single-stage networks A single-stage network is a switching network with N input selectors(IS) And N output selectors(OS) as demonstrated in figure. Each IS is essentially a 1-to-D demultiplexer and each OS ia an M-to-1 multiplexer where 1<=D<=N and 1<=M<=N.

The single stage network is also called a recirculating network. Data items may have to recirculate through the single stage several times before reaching their final destinations. The number of recirculations needed depends on the connectivity in the single stage network. In general, the higher is the hardware connectivity, the less ia the number of recirculations.

Multistage network Multistage network are described by three characterizing features: the switch box, the network topology, and the control structure. Many switch boxes are used in a multistage network. Each box is essentially an interchange device with two inputs and two outputs as depicted in figure. There are four states of a switch boxes : straight, exchange, upper broadcast, lower broadcast.

1. straight:-

2. exchange:-

A two-function switch box can assume either the straight or the exchange states. A fourfunction switch box can be in any one of the four legitimate states. A multistage network is capable of connecting an arbitrary input terminal to an arbitrary output terminal. Multistage networks can be one-sided or two-sided. The one-side networks, sometimes called full switches, have input output ports on the same side. Single side network:-

1 2 3 N
Connection network

The two-sided multistage networks, which usually have an input side and an output side, can be divided into three classes: blocking, rearrangable, and nonblocking.

double network 1 2 3 n
CONNECTIO N NETWORK

1 2 3

Blocking networks:In blocking networks, simulataneous connections of more than one terminal pair may results in conflicts in the use of network communication links. Examples of a blocking network are the data manipulator, omega, flip,n cube, and baseline. rearrangeable network: A Network ia called a rearrangeable network if it can perform all possible connections between inputs and outputs by rearranging its existing connections so that a connection path for a new input output pair can always be established. nonblocking network:A network which can handle all possible connections without blocking is called a nonblocking network. Mesh-connected Illiac Network A single-stage recirculating network has been implemented in the Illiac-IV array processor with N=64 PEs. Each PEi is allowed to send data to any one of PEi+1,PEi-1,PEi+r,andPEi-r where r=N in one circulation step through the network. Formally, the Illiac netwok is characterized by the following four routing functions. R+1(i) = (i+1) mod N R-1(i) = (i-1) mod N R+r(i) = (i+r) mod N R-r(i) = (i-r) mod N

Where 0<=i<=N-1.N is commonly a perfect square, such as N=64 and r=8 in the illiac-IV network. A reduced Illiac network is illustrated in figure for N=16 and r=4. Each PEi in fig is directly connected to its four nearest neighbors in the mesh network. In terms of permutation cycles, we can express the above routing functions as follows: Horizontally, all the PEs of all rows from a linear circular list as governed by the following two permutations, each with a single cycle of order N. The permutation cycles (a b c) (d e) stand for the permutation a b, b c ,c a and d e, e d in a circular fashion within each pair of parenthesis: R+1 = (0 1 2«««N-1) R-1 = (N-1««...2 1 0) For the example network of N=16 and r=4, the shift by a distance of four is specified by the following two permutations, each with four cycles of order four each: R+4 =(0 4 8 12)(1 5 9 13)(2 6 10 14)(3 7 11 15) R-4 =(12 8 4 0)(13 9 5 1)(14 10 6 2)(15 11 7 3) The Illiac network is only a partially connected network. figure shows the connectivity of the example illiav network with N=16.This graph shows that four PEs can be reached from any PE in one step., seven PEs in two steps, and eleven PEs in three steps. Cube Interconnection networks The cube network can be implemented as either a recirculating network or as a multistage network for SIMD machines. A three dimensional cube is illustrated in figure. Vertical lines connect vertices(PEs) whose addresses differ in the most significant bit position. Vertices at both ends of the diagonal lines differ in the middle bit position. Horizontal lines differ in the least significant bit position. This unit cube concept can be extended to an n-dimensional unit space, called an n cube, with n-bits per vertex. We shall use the binary sequence A = (an-1««a2 a1 a0) to represent vertex (PE) address for 0<=i<=n-1.The complement of bit will be denoted as ai for any 0<=i<=n-1. Formally, an n-dimensional cube network of N PEs is specified by the following n routing functions: Ci(an-1 «««..a1 a0) = an-1««ai+1 aiai-1«««.a0 for i = 0,1,2,««.n-1

In the n-cube, each PE located at the corner is directly connected to n-neighbors. The neighboring PEs differ in exactly one bit position.

The implementation of a single stage cube network is illustrated in figure for N=8. The interconnections of the PEs corresponding to the three routing functions C 0,C1,C2 are shown separately. If one assembles all three connecting patterns together, the 3 cube shown in figure should be a result. The same set of cube-routing functions, C0,C1,C2 can also be implemented by a three stage cube network, as modeled in figure for N=8. Two-function (straight and exchange) switch boxes are used in constructing multistage cube networks. The stages are numbered as 0 at the input end and increased to n-1 at the output end. stage i implements the Ci routing function for i=0,1,2,««.n-1.This means that switch boxes at stage I connect an input line to an output line that differs from it only at the ith bit position. Barrel Shifter and data Manipulator Barrel shifters are also known as plus-minus-2i (PM2I) networks. This type of network is based on the following routing functions: B+i(j) = (j+2i) (mod N)

B-i(j) = (j-2i) (mod N) Where 0<=j<=N-1,0<=i<=n-1, and n=log2 N.comparing eq. with eq., the following equivalence is revealed when r = N =2n/2 B+0 B-0 = R+1 = R-1

B+n/2 = R+r B-n/2 = R-r

This implies that the illiac routing functions are a subset of the barrel-shifting functions. In addition to adjacent (+-1) and fixed-distance (+-r) shiftings. The barrel-shifting functions allow either forward or backward shifting of distances, each PE in a barrel shifter is directly connected to 2(n - 1)PEs.therefore, the connectivity in a barrel shifter is increased from the Illiac network by having (2n ± 5). The barrel shifter can be implemented as either a recirculating a single stage network. Figure shows the interconnection patterns in a recirculating barrel shifter for N=8.The barrel shifting functions B are executed by the interconnection patterns shown. A barrel shifter has been implemented with multiple stages in the form of a data manipulator. as shown in figure, the data manipulator consists of N cells. each cell is essentially a

controlled shifter. This network is designed for implementing data manipulating functions such as permuting, replicating, spacing, masking and complementing. To implement a data manipulating functions, proper control lines if six groups (u12i u22i h12i h22i d12i d22i) in each column must be properly set through the use of the control register and the associated decoder. Shuffle-Exchange and Omega Networks The class of shuffle-exchange networks is based on two routing functions shuffle(S) and Exchange(E).Let A = a n-1«a1a0 be a PE address: S(an-1 «a1a0) = an-2 «a1a0an-1 Where 0<=A<=N-1 and n=log2 N. The cyclic shifting of the bits in A to the left for one bit position is performed by the S function. This action corresponds to perfectly shuffling a deck of N cards, as demonstrated in figure. The perfect shuffle cuts the deck into two halves from the center and intermixes them evently. The inverse perfect shuffle does the opposite to restore the original ordering. The exchange-routing function E is defined by: E(an-1 «a1a0) = an-1 «a1 a0 The complementing of the least significant digit means the exchange of data between two PEs with adjacent addresses. These shuffle-exchange functions can be implemented as either a recirculating network or a multistage network. For N=8 a single stage recirculating shuffle-exchange network is shown in figure. the solid lines indicates exchange and the dashed lines indicates shuffle. The use of recirculating shuffle-exchange network for parallel processing was proposed by stone. The shuffle-exchange functions have been implemented with the multistage omega network by Lawroe. The omega network for N=8 is illustrated in figure. An N * N omega network consists of n identical stages. Between two adjacent stages is a perfect-shuffle interconnection. Each stage has N/2 switch boxes under independent box control. each box has four functions (straght, exchange, upper broadcast, lower broadcast), as illustrated in figure. The switch boxes in the Omega network can be repositioned as shown in figure without violating the perfect-shuffle interconnections between stages.

Chapter:-4

Parallel computer:-

Loosely coupled:Physical connection:Logical connection:P.E with private memory connected via. N/W. compute independently and co-operate by changing message. message passing multicomputer.

Type of parallel computer:

Tightly coupled:Physical connection: PE share a connection memory and communication via. Shared memory. Logical connection:co-operate by changing result stored in common memory. Type of parallel computer:shared memory multicomputer.

Tightly coupled microprocessor:1. A multiprocessor system with a common shared memory is classified as the shared memory or tightly coupled microprocessor.

2. Tightly coupled system provide a cache memory with each CPU . in addition there is common global memory that is accessed by all CPU. 3. Information can be shared among the CPU by placing it in the common global memory.

Loosely coupled microprocessor:1. This type of microprocessor is called distributed system. 2. Each PE in the loosely coupled has its own private local memory . 3. The processor are tight together by a switching scheme designed to route the information from one processor to another through a message passing scheme. 4. Loosely coupled network is efficient when interaction between task is minimum. 5. Tightly coupled is efficient where in real time or high speed.

Uniform Memory Access Uniform Memory Access (UMA) is a shared memory architecture used in parallel computers. All the processors in the UMA model share the physical memory uniformly. In a UMA architecture, access time to a memory location is independent of which processor makes the request or which memory chip contains the transferred data. Uniform Memory Access computer architectures are often contrasted with Non-Uniform Memory Access (NUMA) architectures. In the UMA architecture, each processor may use a private cache. Peripherals are also shared in some fashion, The UMA model is suitable for general purpose and time sharing applications by multiple users. It can be used to speed up the execution of a single large program in time critical applications.[1] Types of UMA architectures 1. UMA using bus-based SMP architectures 2. UMA using crossbar switches 3. UMA using multistage switching networks Time shared bus (features) 1. This bus system is totally passive with low active component like switches.

2. Conflict resolution method such as fixed priorities, FIFO queue and daisy Channing are device for efficient utilization of resource. 3. Time shared is further divided into different sub categories: a. Single bus multiprocessor b. Unidirectional buses c. Multibus multiprocessor organization 4. It has lowest overall system cost for the network and is least complex. 5. It is very easy to physically modify the new configuration. Disadvantage :1. This organization is usually appropriate for system only. 2. Efficiency is lowest among all other organization. 3. Expansion of system by adding functional unit may degrade the overall performance.

Crossbar switch and multiport memory:1. This is expansion of time shared common bus. 2. It require most expensive memory unit since most of the control and switching circuit is included memory unit. 3. Data transfer rate is quite high and the system become more stable by the addition of functional unit and that¶s why it has highest rate of data transfer. 4. Large number of cable and connector are require.

Multistage network for multiprocessor:It is the most complex interconnection system and has the highest data transfer rate. The functional unit are the highest and cheapest. Switches are used for routing the request to memory and other ports. This organisation is usually cost effective for multiprocessor only. System expansion (addition of function unit) usually improve the overall performance. 6. Reliability of switch and system can be improved by segmentation within the switches. 1. 2. 3. 4. 5.

Non-Uniform Memory Access Non-Uniform Memory Access or Non-Uniform Memory Architecture (NUMA) is a computer memory design used in multiprocessors, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors. NUMA architectures logically follow in scaling from symmetric multiprocessing (SMP) architectures. Their commercial development came in work by Burroughs (later Unisys), Convex Computer (later Hewlett-Packard), Silicon Graphics, Sequent Computer Systems, Data General and Digital during the 1990s. Techniques developed by these companies later featured in a variety of Unix-like operating systems, and somewhat in Windows NT. Basic concept

One possible architecture of a NUMA system. Notice that the processors are connected to the bus or crossbar by connections of varying thickness/number. This shows that different cpus have different priorities to memory access based on their location. Modern CPUs operate considerably faster than the main memory to which they are attached. In the early days of high-speed computing and supercomputers the CPU generally ran slower than its memory, until the performance lines crossed in the 1970s. Since then, CPUs, increasingly st f t , have had to stall while they wait for memory accesses to complete. Many supercomputer designs of the 1980s and 90s focused on providing high speed memory access as opposed to faster processors, allowing them to work on large data sets at speeds other systems could not approach. Limiting the number of memory accesses provided the key to extracting high performance from a modern computer. For commodity processors, this means installing an ever increasing amount of high-speed cache memory and using increasingly sophisticated algorithms to avoid "cache misses". But the dramatic increase in si e of the operating systems and of the applications run on them has generally ove rwhelmed these cache-processing improvements. Multi-processor systems make the problem considerably worse. Now a system can starve several processors at the same time, notably because only one processor can access memory at a time.

¤ ¡¥ ¤£¢¡

NUMA attempts to address this problem by providing separate memory for each processor, avoiding the performance hit when several processors attempt to address the same memory. For problems involving spread data (common for servers and similar applications), NUMA can improve the performance over a single shared memory by a factor of roughly the number of processors (or separate memory banks). Of course, not all data ends up confined to a single task, which means that more than one processor may require the same data. To handle these cases, NUMA systems include additional hardware or software to move data between banks. This operation has the effect of slowing down the processors attached to those banks, so the overall speed increase due to NUMA will depend heavily on the exact nature of the tasks run on the system at any given time. Cache coherent NUMA (ccNUMA) Nearly all CPU architectures use a small amount of very fast non-shared memory known as cache to exploit locality of reference in memory accesses. With NUMA, maintaining cache coherence across shared memory has a significant overhead. Although simpler to design and build, non-cache-coherent NUMA systems become prohibitively complex to program in the standard von Neumann architecture programming model. As a result, all NUMA computers sold to the market use special-purpose hardware to maintain cache coherence[citation needed], and thus class as "cache-coherent NUMA", or ccNUMA. Typically, this takes place by using inter-processor communication between cache controllers to keep a consistent memory image when more than one cache stores the same memory location. For this reason, ccNUMA performs poorly when multiple processors attempt to access the same memory area in rapid succession. Operating-system support for NUMA attempts to reduce the frequency of this kind of access by allocating processors and memory in NUMA-friendly ways and by avoiding scheduling and locking algorithms that make NUMA-unfriendly accesses necessary. Alternatively, cache coherency protocols such as the MESIF protocol attempt to reduce the communication required to maintain cache coherency. Current[when?] ccNUMA systems are multiprocessor systems based on the AMD Opteron, which can be implemented without external logic, and Intel Itanium, which requires the chipset to support NUMA. Examples of ccNUMA enabled chipsets are the SGI Shub (Super hub), the Intel E8870, the HP sx2000 (used in the Integrity and Superdome servers), and those found in recent NEC Itanium-based systems. Earlier ccNUMA systems such as those from Silicon Graphics were based on MIPS processors and the DEC Alpha 21364 (EV7) processor. Intel announced NUMA introduction to its x86 and Itanium servers in late 2007 with Nehalem and Tukwila CPUs[citation needed]. Both CPU families will share a common chipset; the interconnection is called Intel Quick Path Interconnect (QPI). NUMA vs. cluster computing One can view NUMA as a very tightly coupled form of cluster computing. The addition of virtual memory paging to a cluster architecture can allow the implementation of NUMA

entirely in software where no NUMA hardware exists. However, the inter-node latency of software-based NUMA remains several orders of magnitude greater than that of hardwarebased NUMA.

In a typical SMP (Symmetric MultiProcessor architecture), all memory access are posted to the same shared memory bus. This works fine for a relatively small number of CPUs, but the problem with the shared bus appears when you have dozens, even hundreds, of CPUs competing for access to the shared memory bus. This leads to a major performance bottleneck due to the extremely high contention rate between the multiple CPU's onto the single memory bus. The NUMA architecture was designed to surpass these scalability limits of the SMP architecture. NUMA computers offer the scalability of MPP(Massively Parallel Processing), in that processors can be added and removed at will without loss of efficiency, and the programming ease of SMP where.

Load balancing:Load balancing which is used to distribute computation fairly across processor in order to obtain the highest possible execution speed. It was discussed that process were simply distributed amoung the avalible processor without any discussion of type of processor and their speed. However ot may that some processor will complete their task before the other and become idlebecause the work is unevenly divided or some processor operate faster than other. Load balancing is particular useful when the amount of work is known before to execution. Load balancing is of two type :1. Static load balancing 2. Dynamic load balancing

Static load balancing: Static load balancing is that attempted statically before execution of any process. It always refer to the scheduling problem static load balancing technique .

Master process

queue

Salve process

Salve process

Salve process

2. it makes a global decision about the relocation of the work of processor. Some centralized algorithm assign the maintance system global state information to a particular node. 5. Global state information can allow the algorithm to do good job of balancing the work among the process. Disadvantage: a. Master process can issue on process at a time b. The particular algorithm does not scale well because the amount of information in a linear manner with the number of processor. 2 Fully distributed LBA:a. Master process can be divide into mini-master.

cess

mini master

b. EACH PROCESSING ELEMENT holds its own information about its state. These processing element are free to interact with each other and also to balance the load . c. The interchange of data will take place if one processor has more and other has less data. Disadvantage: If proper distribution of load as work load may not balance as in case of centrali ed load balance algorithm. 6. Semi distributed LBA: In this LBA we divide network into various region and each region has centrali ed processor which hold the state information and further these centrali ed processor are controlled by single centrali ed processor. These algo can be divided into two categories: 1. Sender initiated algorithm. 2. Receiver initiated algorithm  

sa ve

sa ve

sa ve 

¨ ¨

Mas e

§¦

q e e 

sa ve 

Sender initiated algorithm:A process send task to the other process it select . A heavily load process pass out some of its task to other that are willing to accept them.

Receiver initiated algorithm:A process request take from other process it select .when a process request task from another process when it has free and no task to perform.

Process scheduling:It is the allocation of task to the processor. Scheduling is illustrated by Gantt chart indicating the time each task spend as well as the processor on which it execute. So we can say that that scheduling is division of work among the processes

Types of scheduling:1. Static scheduling 2. Dynamic scheduling Static scheduling:The technique to separate dependent instruction and minimize the number of actual hazard and resultant stalls is called static scheduling. It became popular with pipelining. The three main reason why static scheduling is still of interest. 1. Static scheduling sometimes result in lower execution times than dynamic scheduling. 2. Static scheduling can be used to predict the speed up that can be achieved by a particular parallel algorithm on a target machine assuming no pre-emption of process occurs. 3. Static scheduling can allow the generation of only one processor per processor, reducing process creation, synchronization, and termination overhead. Dynamic scheduling: The technique where the hardware rearrange the instruction execution to reduce the stalls. Dynamic scheduling offer several advantage.

1. It simplifies the complier. 2. It allow code that was compiled with one pipeline in mind to run efficiently on a different pipeline. These advantage are gained at accost of a significant increase in hardware complervity. Scheduling algorithm: 1. Deterministic model 2. Graham¶s list scheduling algorithm 3. Coffman-graham scheduling algorithm 1. Deterministic model: Parallel algorithm is collection of task , some of which can be completed before other begin . In deterministic model the precedence relation and the execution time between task is predefined or known before the run time . Information is usually represented by task graph. For example consider the tasks graph illustrated in the figure given below. We are given a set of seven tasks and their precedence relation (information about what task competed before other tasks can be started).

T1

T2

T4

T3

T5

T6

T7

In figure each node represent a task to be performed. A directed are Tia toTj indicated that the task Ti must completed before Tj starts. The allocation of tasks to the processor is done by a schedule. Gantt charts are best example to explain the schedules. A Gantt chart indicated the time each task spend in the execution as well as the processor on which it executes.

T4 T3 T1 1 2 3 T2 T7 4 5 6 7 8 9 Gantt chart illustrating a schedule for task graph. 2. Coffman-Graham¶s scheduling algorithm:a. This algorithm construct the list of task for simple case when all the task take the same amount of time. b. Once list L has been constructed the algorithm applies Graham list scheduling algorithm. c. Let T=T1, T2,....................................Tn be a set of n unit time tasks to be executed on P processor d. Let be partial order on T that specify which task must be completed before the other tasks begins. e. If Ti<Tj then task Ti is an immediate processor of the task Tj . f. Let (T) be an integer label assigned to T. N(T) denotes the decreasing order of integer formed by ordering the set. T6 T5

T2

T1 T6 T3 T4 T5 T8

T7 T9

Example of Coffman-graham scheduling algorithm.

Algorithm steps:1. Choose an arbitrary tasks Tk from T Such that S(Tk)= and define (Tk)=1 2. Fori 2 to n do a. R be the set of unlabeled tasks with no unlabeled successor. b. Let T* be the tasks in R such that N(T*) is lexicographically smaller than N(T) for all T and R. c. Let (T*) i 3. Construct a list of task L=(Un, Un-1,.................U2, U1) such that (Ui)=i for all i where 1<\i<\n. 4. Given (Ti<L) Use graham¶s scheduling algorithm to schedule the tasks in T.

3. Non ±deterministic model:In this model the execution time and precedence of the task execution is not known before the run time. The tasks with no predecessor are called initial tasks.

T1

N1

T2

N2

T3

N3

T4

N4

N5

T5

G1=NIN3N5+N2N3N5+N4N5 = [ (NI+N2)N3+N4]N5

T3

T3

T3

T3

G3=X1X3+X1X4+X2X4 G is said to be simple of polynomial Xa can be factor so that each variable appear exactly once.

4. Graham¶s list scheduling algorithm:-

Let (T=T1, T2,....................................Tn) be a set of tasks. Let µ:T (0,) be a function that associated an execution time with each task. We are also given a partial order < on T. Let L be a list of the tasks in T. Whenever a processor has no work to do, it instantaneously removes from L the first ready task. that is ,an unscheduled task whose predecessor under have all completed execution. If two or more processor simultaneously attempt to execute the same tasks the processor with the lowest index succeeds, and the other processor look for another suitable task.

Chapter 5 &6

Parallel Algorithms:This algorithm design for sequential computers are known as sequential algorithm. The algorithm which is used to solve a problem of parallel computer are known as parallel computer. Parallel algo defined how the problem is divided into sub problem ,how the processor communicate and how the partial solution into combined to produce the final result. Parallel algo depends upon the type of parallel computer they designed for . Thus for a single problem we need to design a parallel algo. Of different architecture. In order to simply design and analysis of parallel algo. Parallel computer are represented by various abstract machine model. these machine model try to capture the important features of parallel computer . Assumption:1. in designing algo for these model are learn about the inherit parallism in the parallel.

2. These models helps us to compare the relative computational power of various parallel computer. 3. These model helps us to determine the kind of parallel architecture that is best suited for a given problem. Model of computation:Ram (random access memory) This machine model of computing abstracts the sequential computer. Dia. Of ram 1.Memory unit :A memory unit has M location where M can be unbounded. 2.) PROCEESOR :A Processor that operate under the control of sequential algorithm . the processor can read and write to a memory location and can perform basic arithmetic and logic operation(ALU). 3.) MAC (Memory access unit ):It creates a path from the processor to arbitrary location in the memory . the processor provide the memory access unit with the address of location that it wish to access and the operation it wants to perform. The memory access unit use that address to establish a direct connection between the processor and the memory location. Algorithm for RAM consist of following steps: 1. Read :The processor read the data from memory and stores the data into its local register. 2. Execute :the processor perform a basic arithmetic or logic operation on the content on the one or two of its local register . 3. Write :The processor write the contents of one register into an arbitrary memory location. PRAM(PARALLEL RANDOM ACCESS MEMORY) The PRAM is one of the popular model for degining parallel algo.
P1 Memory Access P2 Unit (MAU) PN Shared memory

Dia . of PRAM model a) A set of N (P1,P2 P3 ....................Pn) processor where the N is unbounded . b) A memory with M location which is shared by all processor where M IS Unbounded c) MAU:MAU which allows the processor to access the shared memory. shared memory function as the communication medium for the processor.. Steps of PRAM DESIGN:Write:- N processor can write simultaneously from their register. Read:N PROCESSOR read simultaneously from M memory location and store the values in their register. Compute:N processor perform basic arithmetic and logic operation on the values of their register. Types of P-Ram Model:1. 2. 3. 4. EREW CREW ERCW CRCW

EREW:EREW stands for exclusive read exclusive write. in this model every access to memory location (read &write) has to be exclusive .it means read and write operation are not allowed .this model provide least amount of concurrency and therefore the weakest model. 2. CREW:CREW stands for concurrent read exclusive write PRAM. In this model only write operations to a memory a location are exclusive .concurrent read is allowed means two or more processor can concurrently read from same memory location. This is one of most commonly used model. 3.ERCW:ERCW stands for exclusive read concurrent write PRAM Model. In this model read operation are exclusive. this model allows multiple processor to

write concurrently to same memory location .this model is not frequently used and is defined hare for the shake of completeness . 4.CRCW:CRCW stands for concurrent read and concurrent write PRAM model.This model allows multiple processor to read and write to a memory location . it provide maximum amount of concurrency , This is most powerful model among four memory model . Types of CRCW:There are several protocol that are used to specify the value that is written to a memory in situation where many processor try to write different memory location and the model has to specify the values that is written to a memory location. 1.) Priority CRCW:In this protocol processor are assigned priorities. The processors with high priority (among those conducting to write) succeed in writing its values to the memory location. 2. Common CRCW:Here the processor that are trying to write to a memory location are allowed to do so only when they write same value. If the processor writing the same value than it will allow to other otherwise it is an illegal operation. 3.Aribitary CRCW:Here the processor that writes to memory location is selected arbitrary without affecting the concurrent of algorithms .however the algo. must specify the selection criteria.

4.combining CRCW:In this there is function that maps the multiple values that the processor try to write to a single values that is actually written into the memory location.

Eg:- summation function

Interconnection network:In pram memory model exchange of data between processor take place either through shared memory or by the direct links connecting the processor .

A network to a topology is important as it specifies the interconnection network model of a parallel computer. This model helps in determine the computational path of parallel computer. Combinational circuit:Is another model of parallel computer it refer to a family of model of computation. Combinational circuit is viewed as a device that has s set of input lines at one end and set of output [o/p] lines at another end .

stage

Inter Connection network

The circuit is made up of number of interconnected component arranged is column called stages. Combinational circuit consist of :1. Interconnection component (processor):It has a fixed no. Of I/P Lines called fan in and fixed no of O/P lines called fan out. A component is active only when it receive all the input necessary for its computation. 2. The important feature of combinational circuit is that it has no feedback that means no component can be used more than once while computing the O/P for given input. Parameter used to analyse the combinational circuit are:1. Size:- no. Of component used in combinational circuit.

2. Depth:- no of stage in combinational circuit 3. Width:- maximum number of component in given stage. ANALYSIS The performance of parallel algorithm Parallel algo are evaluated on the basis of their parameter criteria. 1. Running time 2. No. Of processor 3. Cost. Relative strengths/power /features of PRAM Model:1. EREW PRAM:- EREW PRAM Model has least concurrency and is therefore the weakest model. 2. CREW PRAM:- A CREW PRAM model is the most commonly used model than the EREW PRAM model. 3. ERCW PRAM:- A ERCW PRAM is never used because it is impossible to write concurrently at same location . 4. CRCW PRAM:- A CRCW PRAM is strongest model and it is widely used.

Chapter 6

PRAM ALGORITHMS PRAM algorithm having two phase . 1. Initialization or activation of P.E(processing element) a) Spawn( <processor name>) This instruction is used for the initialization the processor or to decide which processing element is used for further processing. 2. Active processor are the made to do desire work

a) For all <processor list>do<statement list>end for b) This instruction is used to allocate the task to the processor. PARALLEL REDUCTION ALGORITHMS Binary tree is the one of most important parameter of parallel computer. In some algo data flows from top to bottom i.e. from root to leaf .e.g. divide and conquer algorithm and the broadcast algorithms. In some algo data flows from bottom to up. i.e. from leaf to root these are called reduction operation. For example of this is the summation. 4 3 3 8 2 9 1 0 5 6

7 Array representation of parallel reduction A(0) A(1) A(2) 3 4 9 4 6 1 2 8 10 13 30 40 3 12

5 3 0 5 17

6 5

7 7 10 10 10

8 3

Complexity of this algorithms is (log n) for n/2 processor .the complexity is overall time complexity.

Prefix sum algorithms:Given set of n values a1, a2, a3, a4,.............an and an associative operation +,the prefix sum problem is to complete n-quantities. A1 A1+a2 A1+a2+a3 .

. . A1+a2+..........+an For example:- the operation ³+´ on array (3,10,4,2) the prefix sum should be 1) 2) 3) 4) A1=3 A1+a2=13 A1+a2+a3=3+10+4=17 A1+a2+a3+a4=3+10+4+2=19 So array will be =(3, 13, 17, 19)

PRAM Algorithms :To find prefix sum n element list using n-1 processor. Prefix sum (CREW PRAM) Initial condition:-list of n>element stored in A[0,1........(n-1)] Final condition:-sum of elements stored in A[0]+A[1]+A[2]..........+A[i] Global variable:- n , A[0,..............(n-1)] , j Begin Spawn (P0,P1,P2....................Pn -1) For all Pi where 0<\i<\n-1 do For j-0 to [log n]-1 do If i-2^j>/0 then A[i]-A[i]+A[i-2^j] End if End for End for End

Prefix sums has many application:Let us take another problem to separate upper case letters from lower case and yet maintaining the order .keeping the upper case letter first in the array A. Let us take an array A of n letter and suppose with elements . A Array A b C D e F g h

We also have another array T of same size so that we get the index of each element in array. Putting ³1´ where ever we find upper case letter and ³0´ where ever we find lower case letter. 1 0 1 1 0 1 0 0

Now complete prefix sum of T using additional operation. 1 1 2 3 3 4 4 4

Here we will take single value if value is repeated e.g. ³1´ then we will take the first coming value i.e. A for upper case. To show corresponding values we will take 2-c , 3 is repeated so we will take first value of D and on. Now in array A indexes will be considered such as in array A. 1 2 3 4 ............

Now the array will look like A C D F b e g h

The complexity of this algorithms is (log n) for (n-1) processor. Array representation of prefix sums:A[0] 4 4 4 4 10 10 A[1] 6 10 7 11 11 2 1 3 3 2 4 8 10 17 21 14 24 5 3 11 13 24 6 0 3 7 5 5 16 29 15 8 7 12 15 29 9 3 10

13 13

32

4

10

11

13

21

24

24

29

36

39

From step 0 to 1 : The addition operation is performed with first right neighbour and the neighbour values is updated.

From step 1 to 2: The addition operation is performed with second neighbour and neighbour value is updated. From step 2 to 3: The addition operation is performed with fourth right neighbour. From step 3 to 4: The addition operation is performed with eighth right neighbour.

Parallel list Ranking /Suffix sums: The suffix sum problem is a variant of prefix sum problem, where an array is replaced by linked list. And the sum are computed from end rather than from the beginning. Algorithm to find the suffix sum of n element. List Ranking (CREW PRAM) Intial condition : value in array next¶ represent a linked list . Final condition: values in array position contain original distance of each element from end of list . Global variable: n, position[0...........(n-1)],next[0............(n-1)],j Begin Spawn (P0, P1,.............Pn) For all Pi where 0<\i<\n-1 do If next[i]=i then Position [i] Else Position[i] End if For j 1 to [log n] do Parallel[i]=position[i]+position[next[i]] Next[i] End for next(next[i]) 1 0

End for End Parallel merge : Merging is the mixing of two linked list Merge list[CREW PRAM] Given two sorted list of n/2 elements each stored in A[1]..........A[n/2] and A[(n/2).......A(n)] Global A[1......n] Local n high level index. Begin Spawn(P0,P1................Pn) For all P where 1<\i<\n do If i<n/2 then Low High n/2+1 n 1 n/2

Else low High End if If x A[i]

Repeat Index [(low+high)/2]

If x <A[index] then High Else Low End if Until low>high A[high+i-n/2] x index+1 index-1

End for End In PRAM the complexity reduce as o(log n) as compare to RAM algorithms . One processor is assign to each element which determine the position assign to the element in the final merge list. index is there it gives the position of element. n processor are required for n elements .upper list processing element will access lower list and vice versa. 1 5 7 9 19 24 14 17

1 3

3

5 2

2 11

7

9 22

11

19

22

24

14

17

Every processor find the position of its own element on the other list using binary search because an element index is in own merge list is known and is position on the merge list can be completed when its index is found and then two index is added. Cost optimal algorithms: A cost optimal algorithm is an algorithm for which the cost is the same complexity class as an optimal sequence algorithm. We can say that parallel algo. having same complexity with optimal RAM algorithms having cost. This can be termed as cost optimal algorithm. Cost optimal parallel reduction algo. has a time complexity (log n) for n processor. A cost optimal algorithm exists only:1 .determine no. of processor need to perform (n-1) operation 2. After determine no of processor we need to verify that whether a cost optimal parallel reduction algorithm with (log n) complexity exist or not .this is done by Brent¶s theorem. Cost optimal is consider important in design of parallel algo. The amount of work an algorithm performs is the run time of algorithm multiplied by the number of processor it uses .a conventional (sequential) algo. may be thought of as a parallel algorithm designed for one processor. An algorithm is said to be cost optimal if the amount of work it does is the same as the best known sequential algorithm. Brent¶s theorem:-

This theorem state that a parallel algorithm with computation t if a parallel algorithm A perform m computation operation then processor p can execute algorithm A in time T T= t + (m-t)/p This typical application is when p<the number of processor which gave rise to the time t measure. Note that where p=1 another .) When p= Brent¶s =m(the processor is executing in sequential order one after

Brent¶s =t

Brent¶s theorem specifies for a sequential algorithm with t time steps and a total of m operation that a run time T is definitely possible on a shared memory machine with p processor . there may be an algorithm that solve this problem faster or it may be possible to implement this algorithm faster (by scheduling ) instruction differently to minimize idle processor , for instance),but it is definitely possible to implement this algorithm in this time , given p processor. Key to understanding Brent¶s theorem is understanding time steps .in single time step every instruction that has no dependencies is executed and therefore t is equal to the length of the longest chain of instruction that depend on the result of other instruction that depend on the result of other instruction (as any shorter chains will be finished executing by the time ,the longest chain has). e.g. we are summing an array . for (i=0; i<length(a);i++) } Using this algorithm each add operation depend on the result of the previous one forming a chain of length n thus t=n. there are n operation so m=n. T=n +o/p (m-t)=0. So no matter how many processor are available this algorithm will take time n. IMPORTANRT PREDICITION ABOUT THIS ALGORITHM:1. No matter how many processor are used .there can be no implementation of this algorithm that can be faster than (log n). 2. If we have n processor the algorithm can be implemented in (log n) times . 3. If we have log (n) processor the algorithm can be implemented in 2log(n)times. 4. If we have one processor the algorithm can be implemented in n time.

If we consider the amount of work done in each case .with one processor we do n work ,with log n processor we do n work but with n processor we do nlog(n) work. So the implementation with with 1 or log (n) processor therefore cost optimal while the implementation with n processor is not cost implementation.. Brent¶s theorem does not tell how to implement parallel algorithm but it tell what is possible. NC Algorithm:The class NC is set of languages decidable in parallel time T. NC is the class of problem solvable on a pram in poly logarithmic time using a number of processor that one polynomial function of problem size. If some algorithm is in NC, it remains in NC regardless in which PRAM sub model we assume. The class NC include: 1. Parallel prefix computation. 2. Parallel sorting and selection. 3. Matrix operation. 4. Parallel tree contraction and expression evaluation. 5. Parallel algo. For graph. 6. Parallel algo. For computation geometry. 7. Parallel graph for biconnectivity and triconnectivity 8. Parallel string matching algorithm. Many NC algo. Are cost optimal that is they have T(n,p(n)= o(log n) P(n)=n/log n N=o(n) O=complexity notation . Drawback of NC theory :1. the class NC many include some algorithm which are not efficiently parallelizable .the most infamous example is parallel binary search. 2. NC theory assumes situation where a huge machine solve very quickly moderately size problem. However in actual moderately size machine are used to solve a large problem so that the no of processor tend to be polynomial.

Chapter:- 7 PARALLEL ALGORITHMS FOR ARRAY PROCESSOR SIMD matrix multiplication Matrix manipulation is frequently needed in solving linear systems of equations. Important matrix operations include matrix multiplication, L-U decomposition, ans matrix inversion. The differences between SISD and SIMD matrix algorithms are pointed out in their program structures and speed performances. Algorithm for SIMD matrix multiplication For j = 1 to n Do Par for k = 1 to n Do Cik = 0 (vector load) For j = 1 to n Do Par for k = 1 to n Do Cik = Cik + aij . bjk (vector multiply) End of j loop End of I loop Vector load operation is performed to initialize the row vectors of matrix C one row at a time.In the vector multiply operation, the same multiplier a ij is broadcast from the CU to all PEs to multiply all n elements of the ith row vector of B. Parallel Sorting on Array processors An SIMD algorithm is to be presented for sorting n2 elements on a mesh-connected processor array in O(n) routing and comparison steps. We assume an array processor with N = n2 identical PEs interconnected by a mesh network similar to Illiac-IV except that the PEs at the perimeter have two or three rather than four neighbors. Two time measures are needed to estimate the time complexity of the parallel-sorting algorithm. Let t R be the routing time required to move one item from a PE to one of its neighbours, and tC be the comparison time required for one comparison step. This means that a comparison-interchange step between the two items in adjacent PEs can be done in 2tR + tC time units. The sorting problen depends on the indexing schemes on the PEs. The PEs may be indexed by a bijection from {1,2««.,n} 8 {1,2««..,n} to {0,1,«.,N-1},where N = n2.Three indexing patterns formed after sorting the given array in part a with respect to three different ways for indexing the PEs. The pattern in part b corresponds to a row-majored indexing, part

c corresponds to a shuffled row-major indexing , and is based on a snake-like row-major indexing. Shuffle and unshuffle operations can each be implemented with a sequence of interchange operations .Both the perfect shuffle and its inverse can be done in K ± 1 interchanges or 2(K ± 1) routing steps on a linear array of 2K PEs.

PARALLEL ALGORITHMS FOR MULTIPROCESSOR (MIMD) Parallel algorithm for multiprocessors is a set of K concurrent processes which may operate simultaneously and cooperatively to solve a given problem. Interaction points:- interaction points are those points where processes communicate with other processes. These interaction points divide process into stages. Two types of parallel algorithm: 1.synchronized algorithm:- Parallel algorithms in which some processes have to wait on other processes are called synchronized algorithms. In these type of algorithms, processes have the property that there exist a process such that some stage of the process is not activated until another has completed certain stage of its program. Execution time of processes is a variable which depends on input data and system interruptions. 2.Asynchronous algorithm:-In asynchronous algorithm ,processes are not required to wait for each other and communication is achieved by reading dynamically updated global variables stored in a shared memory. There is a set of global variables accessible to all the processes when a stage of a process is completed .Based on the values of the variable read together with the results just obtained from the last stage. The process modifies a subset of global variable and then activates the next stage or terminate itself. In some cases, operations on global variables are programmed as critical section. The main characteristics of an asynchronous parallel algorithm is that its processes never wait for inputs at any time but continue execution or terminate according to whatever information is currently contains in global variables. Alternative approach Macro pipelining:- It is applicable if the computation can be divided into parts called stages so that the output of one or several collected parts is the input for another parts. In this case as each computation part is realized as a separate process, communication cost may be high. An algorithm which requires an execution on multiprocessor system must be decomposed into set of processes to the exploit parallelism. In Static decomposition:- Set of processes and their precedence relation are known before execution.

In Dynamic decomposition:-Set of processes changes during execution.