Lecture Notes in PARALLEL PROCESSING Prepared by Rza Bashirov High-performance computers are increasingly in demand in the areas

of structural analysis, weather forecasting, petroleum exploration, medical diagnosis, aerodynamics simulation, artificial intelligence, expert systems, genetic engineering, signal and image processing, among many other scientific and engineering applications. Without superpower computers, many of these challenges to advance human civilization cannot be made within a reasonable time period. Achieving high performance depends not only on using faster and more reliable hardware devices but also on major improvements in computer architecture and processing techniques. 1 FLYNN’S TAXONOMY

In general, digital computers may be classified into four categories, according to the multiplicity of instruction and data streams. This scheme for classifying computer organizations was introduced by Michael J. Flynn. The essential computing process is the execution of a sequence of instructions on a set of data. The term stream is used here to denote a sequence of items (instructions or data) as executed or operated upon by a single processor. Instructions or data are defined with respect to a referenced machine. An instruction stream is a sequence of instructions as executed by the machine; a data stream is a sequence of data including input, partial, or temporary results, called for the instruction stream. Computer organizations are characterized by the multiplicity of the hardware provided to service the instruction and data streams. Listed below are Flynn’s four machine organizations: • • • • Single instruction stream single data stream (SISD) Single instruction stream multiple data stream (SIMD) Multiple instruction stream single data stream (MISD) Multiple instruction stream multiple data stream (MIMD)

SISD computer organization This organization represents most serial computers available today. Instructions are executed sequentially but may be overlapped in their execution stages. SIMD computer organization In this organization, there are multiple processing elements supervised by the same control unit. All PE’s receive the same instruction broadcast from the control unit but operate on different data sets from distinct data streams.

MIMD computer implies interactions among the n processors because all memory streams are derived from the same data space shared by all processors. one must subdivide the input task (process) into a sequence of subtasks. all tasks have equal processing time in all station facilities. The last three classes of computer organization are the classes of parallel computers. MIMD computer organization Most multiprocessor systems and multiple computer systems can be classified in this category. pipeline processing has led to the improvement of system throughput in the modern digital computer.1 Principles of linear pipelining Assembly lines have been used in automated industrial plants in order to increase productivity. The subdivision of labour in assembly lines has contributed to the success of mass production in modern industry. However. The stations in an ideal assembly line can operate synchronously with full resource utilisation. in reality. Their original form is a flow line (pipeline) of assembly stations where items are assembled continuously from separate parts along a moving conveyor belt. This bottleneck problem plus the congestion caused by improper buffering may result in many idle stations waiting for new parts. each of which can be executed by a specialised hardware stage that operates concurrently with other stages in the pipeline. In a uniform-delay pipeline. the 2 . If the n data streams were from disjointed subspaces of the shared memories. To achieve pipelining. the slowest station becomes the bottleneck of the entire pipe. The results (output) of one processor become the input (operands) of the next processor in the macropipe. each receiving distinct instructions operating over the same data stream and its derivatives. Successive tasks are streamed into the pipe and get executed in an overlapped fashion at the subtask level. Otherwise. By the same token. all the assembly stations should have equal processing speed. 2 PIPELINING: AN OVERLAPPED PARALLELISM Pipelining offers an economical way to realise temporal parallelism in digital computers. then we would have the so-called multiple SISD (MSISD) operation. Ideally. which is nothing but a set of n independent SISD uniprocessor systems.MISD computer organization There are n processor units. The subdivision of the input task into a proper sequence of subtasks becomes a crucial factor in determining the performance of the pipeline. The concept of pipeline processing in a computer is similar to assembly lines in an industrial plant. 2.

and other factors. Information flows between adjacent stages are under the control of a common clock applied to all the latches simultaneously. Let τ l be the time delay of each interface latch. the maximum speedup that a linear pipeline can provide us is k .. Efficiency The efficiency of a linear pipeline is measured by the percentage of busy time-space spans over the total time-space span.. The reciprocal of the clock period is called the frequency Ideally. a linear pipeline with k stages can process n tasks in Tk = k + (n − 1) periods. including the quality (efficiency and capability) of the working units. The maximum speedup is never fully achievable because of data dependencies between instructions. where k cycles are used to fill up the pipeline or to complete execution of the first task and n − 1 cycles are needed to complete the remaining n − 1 tasks. The clock period of a linear pipeline is defined by τ = max{τ i } k + τ l = τ m + τ l . which equals the 3 .. It should be noted that the maximum speedup is S k → k . A linear pipeline consists of cascade of processing stages. High-speed interface latches separate the stages. Clock period The logic circuitry in each stage Si has a time delay denoted by τ i . i =1 f = 1τ . interrupts.successive stations have unequal delays. Speedup We define the speedup of a k -stage linear pipeline processor over an equivalent nonpipeline processor as Sk = T1 n⋅k = Tk k + ( n − 1) . A linear pipeline can process a succession of subtasks with a linear precedence graph. The precedence relation of a set of subtasks {T1 . The latches are fast registers for holding the intermediate results between the stages. where k is the number of stages in the pipe. for n >> k .. Tk } for an T implies that some task Ti cannot start until some earlier task T j (i < j ) finishes. The optimal partition of the assembly line depends on a number of factors. and the cost effectiveness of the entire assembly line. the desired processing speed. In other words. The same number of tasks (operand pairs) can be executed in a nonpipeline processor with an equivalent function in T1 = n ⋅ k time delay.

data dependency. Moreover. scalar vs. Instruction pipelining The execution of a stream of instructions can be pipelined by overlapping the execution of the current instruction with the fetch.sum of all busy and idle time-space spans. This provides another view of efficiency of a linear pipeline as the ratio of its actual speedup to the ideal speedup k . This technique is also known as instruction lookahead. pipeline processors can be classified into the classes: arithmetic. Well-known arithmetic pipeline examples are the four-stage pipes used in STAR-100. Note that η → 1 as n → ∞. respectively. which corresponds to one output result per clock period. and operand fetch of subsequent instructions. Arithmetic pipelining The arithmetic logic units of a computer can be segmentized for pipeline operations in various data formats. In terms of efficiency η and clock period τ of a linear pipeline. vector pipelines. This rate reflects the computing power of a pipeline. we Sk realize that η = k . unifunction vs. Throughput The number of results (tasks) that can be completed by a pipeline per unit time is called its throughput. processor. In the ideal case. this ideal case may not hold all the time because of program branches and interrupts. instruction. static vs. multifunction. and other reasons. This implies that the larger the number of tasks flowing through the pipeline. the number of pipeline stages. and the clock period of a linear pipeline. However. the efficiency η should approach 1. The pipeline efficiency is defined by η= n ⋅ k ⋅τ n = k ⋅ [ kτ + ( n − 1)τ ] k + ( n − 1) . and the up to 26 stages per pipe in the CYBER-205. 4 . k . This means that the maximum throughput of a linear pipeline is equal to its frequency. when η → 1. Almost all high-performance computers are now equipped with instruction-execution pipelines. the eight-stage pipelines used in the TI-ASC. τ be the number of tasks (instructions). we have n >> k . dynamic. decode. In the steady state of a pipeline. Let n. we define the throughput as follows: w= η n = kτ + ( n − 1)τ τ where n equals the total number of tasks being processed during an w = 1τ = f observation period kτ + ( n − 1)τ . the better is its efficiency. According to the levels of processing. the up to 14 pipeline stages used in the CRAY-1.

I n p u t S1 S2 S3 O u t p u t Output 5 . such as the floating-point adder. The second processor than passes the refined results to the third. Scalar vs. Most existing computers are equipped with static pipes. is called unifuctional. its performance may be very low. Dynamic pipelines A static pipeline may assume only one fuctional configuration at a time. Instructions in a small DO loop are often prefetched into the instruction buffer. On the other hand. The function performed by a static pipeline should not change frequently. A scalar pipeline processes a sequence of scalar operands under the control of a DO loop. In this sense. Computers having vector instructions are often called vector processors. a unifunctional pipe must be static. Otherwise. The dynamic configuration needs much more elaborate control and sequencing mechanisms than those for static pipelines. A dynamic pipeline processor permits several functional configurations to exist simultaneously. vector pipelines Depending on instruction or data types. Pipelining is made possible in static pipes only if instructions of the same type are to be executed continuously. Static pipelines can be either unifunctional or multifunctional. a dynamic pipeline must be multifunctional. A multifunctional pipe may perform different functions. Vector pipelines are specially designed to handle vector instructions over vector operands. multifunctional pipelines A pipeline unit with a fixed and dedicated function. The design of a vector pipeline is expanded from that of a scalar pipeline. either at different subsets of stages in the pipeline. The required scalar operands for repeated scalar instructions are moved into a data cache in order to continuously supply the pipeline with operands. pipeline processors can be also classified as scalar pipelines and vector pipelines. either unifunctional or multifunctional.Processor pipelining This refers to the pipeline processing of the same data stream by a cascade of processors. each of which processes a specific task. and so on. Static vs. The data stream passes the first processor with results stored in a memory block which is also accessible by the second processor. Unifunctional vs.

The total number of clock units in the table called the evaluation time for the given function. a “pure” linear pipeline is a pipeline without any feedback or feed-forward connections. 6 . A reservation table represents the flow of data through the pipeline for one complete evaluation of a given function. denoted as function A and function B . For a unifunctional pipeline. In some computations. A feed-forward connection connects a stage Si to a stage S j such that j ≥ i + 2 and a feedback connection connects a stage Si to a stage S j such that j ≤ i . the inputs may depend on previous outputs. the outputs of the pipeline are fed back as future inputs. In practice. The one-way connections between adjacent stages form the original linear cascade of the pipeline. Assume that this pipeline is dual functional. The circles in figure refer o data multiplexers. In other words. Improper use of feed-forward or feedback inputs may destroy the inherent advantages of pipelining. The inputs and outputs of such pipelines are totally independent. S 3 from the input end to the output end. as shown in figure. like linear recurrence. S 2 . t0 A T1 A A t0 B t1 t2 B B B t3 A t4 B A t5 B B t6 t2 t3 A t4 t5 t6 A t7 A S1 S2 S3 S1 S2 S3 A marked entry in the (i .2. We will number the pipeline stages S1 . one can simply use an “x” to mark the table entries.2 General pipelines and reservation tables What we have studied so far are linear pipelines without feedback connections. In this sense. The utilization history of pipeline determines the present state of the pipeline. proper sequencing with nonlinear data flow may enhance the pipeline efficiency. j ) th square of the table indicates that stage Si will be used j time units after the initiation of the function evaluation. many of the arithmetic pipeline processors allow nonlinear connection as a mechanism to implement recursion and multiple functions. Pipelines with feedback may have a nonlinear flow of data. On the other hand. The rows correspond to pipeline stages and the columns to clock time units. Consider a simple pipeline that has a structure with both feed-forward and feedback connection. The timing of the feedback inputs becomes crucial to the nonlinear data flow. The two reservation tables shown below correspond to the two functions of the sample pipeline.

such as 8 and 7 for functions A and B. S1. as illustrated in figure.For a multifunctional pipeline.3 Multifunction and Array Pipelines Texas Instruments’ Advanced Scientific Computer (TI-ASC) was the first vector processor that was installed with multifunction pipelines in its arithmetic processors. R e c e i v e Fixed-point add A d d O u t p u t 7 . S1. or 64 bits. Different functions may have different evaluation times. All the interconnection routes among the eight stages are shown. unifuctional pipeline can be fully described by one reservation table. S2. a given reservation table does not uniquely correspond to one particular hardware pipeline. respectively. S3. On the other hand. S1. S2. S3. A multifunctional pipeline may use different reservation tables for different functions to be performed. S3. Similarly. different marks are used for different functions. S3. This pipeline can perform either fixed-point or floating-point arithmetic functions and many logical-shifting operations over scaler and vector operands of length 16. The ASC arithmetic pipeline consists of eight stages. S3. such as the A’s and B’s in the two reservation tables for the simple pipeline. S2. R e c e i v e M u l t I p l A c c u m u l a t E x p. the 7 steps needed to evaluate B are S1. S3. S u b t A l I g n A d d N o r m a l I z O u t p u t 2. One may find that several hardware pipelines with different interconnection structures can use the same reservation table. 32. The 8 steps required to evaluate A are S1. The data-flow pattern in a static. S2.

fixed-point multiply. R e c e i v e E x p. The multiply stage produces two results. and conversions between fixed-point and floating-point operands. which are sent to the accumulator stage or the add stage to produce the desired product. It’s not difficult to see that the receiver and output stages are used by all instructions.Different arithmetic-logic instructions are allowed to use different connecting paths through the pipeline. S u b t A l i g n A d d N o r m a l i z O u t p u t 8 . all leftshift operations. Figure shows four interconnection patterns of the ASC pipeline for the evaluation of the functions: fixed-point add. The exponent subtract stage determines the exponent difference and sends this shift count to the align stage to align fractions for floating-point add or subtract instructions. The multiply stage performs multiplication. called pseudo sum and pseudo carry. The normalize stage does the floating-point normalization. All right shift operations are also implemented in this align stage. and floating-point add. S u b t A l I g n A d d N o r m a l I z O u t p u t Floating-point add R e c e i v e M u l t I p l Fixed-point multiply A d d O u t p u t R e c e i v e M u l t i p l A c c u m u l a t E x p. floating-point multiply.

Each such cell performs an additive inner0product operation. and L-U decomposition. Fast latches are used in all input output terminals and all interconnecting paths in the array pipeline. b and c and the three outputs a ' = a . d = a × b + c . inversion. The basic building blocks in the array are the M cells. Each cell has the three input operands a. The cellular array is usually regularly structured and suitable for microprocessor implementation. The array shown in figure performs the multiplication of two 3 × 3 dense matrices A ⋅ B = C t7 t6 c13 c23 c12 c33 c11 c22 t3 t2 t1 b33 0 0 t8 c32 c21 b23 b32 0 c31 t7 t6 b13 b22 b31 t8 0 b12 b21 0 0 b11 a11 a12 a21 a13 a22 a31 0 a23 a32 0 0 a33 t1 t2 t3 0 0 0 9 . All the latches are synchronously controlled by the same clock. such as matrix multiplication.Floating-point multiply Array pipelines are two-dimensional pipelines with multiple data-flow streams for high-level arithmetic computations. Each M cell performs an additive inner-product operation as illustrated in figure. The pipeline is usually constructed with a cellular array of arithmetic units. b' = b.

to multiply two ( n × n) matrices requires 3n − 4n + 2 cells. which use pipeline adder. 1020ns is enough to evaluate A1 + + A100 in four-segment pipeline.1 1 . efficiency. 10 . ns Sk = T1 n⋅k 99 ⋅ 4 = = ≈ 3.88 . So. Find the minimum number of periods required for 99 floating-points additions.4 Problems 1. Find frequency. X Y S1 S2 S3 S4 Z Solution.97 . Substituting n=99 and k=4 in Tk = k + (n − 1) we obtain T4 = 4 + (99 − 1) = 102 . It takes 3n − 1 clock periods to complete the multiply process. 2. Substituting n=99 and k=4 in respective formulae we obtain f = 1 τ = 0. 2 In general. Draw the four-segment pipeline to realize two functions A and B described in the following reservation table. Consider a four-segment floating-point adder with a 10ns clock period. The problem of computing A1 + + A100 is equivalent to the processing of a stream of 99 problems of type A+B. throughput and speedup of a four-segment linear pipeline in problem 1 over an equivalent nonpipeline processor.7 . kτ + (n − 1)τ τ ns 3. Solution. 2. k ⋅ [kτ + (n − 1)τ ] k + (n − 1) n 1 η w= = ≈ 9.a11  A ⋅ B = a 21 a 31  a12 a 22 a 32 a13  b11 b12   a 23  ⋅ b21 b22 a 33  b31 b32   b13   b23  = b33   c11  c21 c31  c12 c22 c32 c13   c23  = C c33   . Tk k + (n − 1) 4 + 98 η= n ⋅ k ⋅τ n = ≈ 0.

The dynamic configuration needs much more elaborate control and sequencing mechanisms than those for static pipelines. On the other hand. its performance may be very low. Static pipelines can be either unifunctional or multifunctional. Static pipeline Dynamic pipeline Unifunctional pipeline Multifunctional pipeline Instruction pipeline Vector pipeline 11 . Describe the following terminologies associated with pipeline computers (a) (b) (c) (d) (e) (f) Solution. B Input S1 S2 S3 Input S4 Input 4. (b) A dynamic pipeline processor permits several functional configurations to exist simultaneously. a unifunctional pipe must be static. Otherwise. a dynamic pipeline must be multifunctional.S1 S2 S3 t0 A t1 A t2 t3 A t4 A t5 A t0 B t1 B B B t2 t3 t4 B A t5 S1 S2 S3 S4 Solution. In this sense. (a) A static pipeline may assume only one fuctional configuration at a time. The function performed by a static pipeline should not change frequently. Pipelining is made possible in static pipes only if instructions of the same type are to be executed continuously.

12 . either at different subsets of stages in the pipeline. decode. (f) Vector pipelines are specially designed to handle vector instructions over vector operands. such as the floating-point adder. (d) A multifunctional pipe may perform different functions. Almost all high-performance computers are now equipped with instruction-execution pipelines. This technique is also known as instruction lookahead. is called unifuctional. and operand fetch of subsequent instructions. Computers having vector instructions are often called vector processors. The design of a vector pipeline is expanded from that of a scalar pipeline.(c) A pipeline unit with a fixed and dedicated function. (e) The execution of a stream of instructions can be pipelined by overlapping the execution of the current instruction with the fetch.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.