P. 1


|Views: 17|Likes:
Published by kumudsharma88

More info:

Published by: kumudsharma88 on May 05, 2011
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less





The University of New South Wales

School of Computer Science & Engineering

COMP9243 — Week 13 (05s1)
Manuel M. T. Chakravarty Gernot Heiser

Parallel Processing
At the current state of the art, writing parallel programs is a complex endeavour for almost all classes of applications. This complexity is acceptable only if it provides a significant benefit, which in the case of parallel computing usually means if it provides a significant increase in processing speed. In some cases, however, parallel processing is used to enable the processing of larger quantities of data instead of achieving fast processing for smaller data sets. While parallel processing can be based on a range of architectures, including shared-memory systems, the use of distributed systems is of particular interest. This is because distributed systems scale further than shared-memory systems and they can more easily be built from commodity hardware. In particular, distributed-memory parallel systems dominate the “Top 500” supercomputer list.1 The disadvantage of systems based on distributed memory is that they can be hard to program and, in particular, it usually requires substantial work to achieve a sufficient performance gain.

Measuring Performance
Due to the focus on performance in parallel processing, it is important to have clear performance measures that facilitate reasoning about the performance of competing approaches to a problem. We review some of the basic terminology in the following. Let us assume a parallel system with p processing elements (PEs). We denote the running time of the best sequential running time of an algorithm for a problem size n as T ∗ (n). Moreover, we denote the running time of a parallel algorithm on p processors, for problem size n, by T p (n). Judgements involving the gain of using a parallel algorithm are usually based on the speedup achieved by the parallel algorithm. Speedup comes in two flavours: • Relative speedup:
p Sr (n) =

T 1 (n) T p (n)

• Absolute speedup:
p Sa (n) =

T ∗ (n) T p (n)

The essential difference between the two is that relative speedup quantifies the improvement achieved by using more processors to execute a parallel algorithm, whereas absolute speedup compares a parallel to a sequential algorithm. If we are interested in evaluating the usefulness of accepting the added complexity, only absolute speedup measures are meaningful. Relative speedup already takes much of the overhead of parallel processing for granted.
1 www.top500.org


the main difference is in the interconnect. where the same program is executed on all nodes of a MIMD machine with nodes synchronising frequently (typically using barriers). The first class of machines promises ease of migration from sequential to parallel code. MasPar MP-2) • MIMD: many independent processors (e.Parallel Architectures & Machine Models Traditionally. Distributed shared memory: These are machines that implement DSM in hardware. whereas MIMD gives a better price-performance ratio due to the ability to use commodity parts. Experience so far encourages the use of a few high-end processors.g. but requires a considerable amount of specialised hardware. we have that SIMD is easier to program (due to the single control flow). There drawback is that communication latency and bandwidth is still significantly behind that of specialised hardware. we might choose to use many cheap and simple processors or a few expensive. 2. 3. parallel architectures are classified by whether they feature a a single or multiple instruction streams and by whether they have a single or multiple data streams. three main distributed architectures are in use for parallel processing: 1.. the major advantages of MIMD are that it is the more general model and MIMD machines are more easily assembled from commodity parts. This immediately leads to the question of whether we cannot have a SIMD-like programming model implemented on MIMD hardware. which makes it doubtful that the architecture can compete in the market of massively parallel systems in the long term. Todays Dominant Architectures Currently. Overall. clusters) The major advantage of SIMD architectures is that there single control flow makes them easier to program. In contrast. such as Linux Beowulf clusters or Compaq’s high-performance clusters. However. such as the Cray T3E. Cray T3E. such as the SGI Origin.g. These systems often require special purpose software (such as communication libraries and parallelising compilers). as it is usually hard to exploit parallelism and a larger number of processors induces larger overheads. Workstation clusters: These are systems assembled from off-the-shelf workstations connected by a fast commodity network. Distributed-memory multiprocessors are usually tightly integrated and use expensive crossbar switches or ultra high-performance networks that be directly integrated into the motherboard to optimise the data path from the CPU and/or memory to the network. This has led to the SPMD (single program multiple data) model. In contrast. the IBM SP-2. Give then the use of a MIMD architecture. CM-5. Thinking Machines CM-2. This is the model that is currently in most widespread use. 2 . it is to be expected that the economy of scale will make clusters an ever more increasingly attractive option. IBM SP-2. the designer of a parallel system is faced with an important question: Given a fixed budget. the shared control logic may also reduce hardware costs. When comparing the second and third architecture. as the program model is closer to the sequential model. of which only the following three have practical relevance: • SISD: uniprocessor • SIMD: one instruction stream is executed by many processors operating on different data items (e. Distributed-memory multiprocessors: These are machines based on custom processors connected via special-purpose high-speed interconnect. In some cases.. Which alternative we give rise to better performance. clusters draw a huge cost benefit from the use of commodity hardware and software. This leads to four possible categories. high-end processors.

in the sense that execution on a uniprocessor yields the same result as execution on a multiprocessor. which simplifies debugging significantly. control parallel programming may be regarded as an abstraction of MIMD architectures. such that the operation is executed in parallel on individual elements. Data parallel programming may be regarded as an abstraction of SIMD architectures. i < n. where all processing elements may execute independent processes. int *b) { int c[n]. This makes it the more general model.a: * b: c: result: Figure 1: Dataflow in dotp Programming Models The two basic programming models for parallel programming are data parallel programming and control parallel programming. or sets). 3 . As a typical example of a data parallel routine. } The function dotp is denoted in a fictive extension of the language C. It is the easier of two models to understand and debug and it is typically implemented using the SPMD model.. instead of sequentially. /* distributed across PEs */ forall (i = 0. and then.g. which usually communicate via message passing. parallel computations operate on collections (such as arrays. this is graphically represented in Figure 1. consider the following function for computing the dot product of two vectors: int dotp (int n. The function multiplies the two input vectors elementwise. A single parallel operation usually processes all elements of a collection at once. i++) c[i] = a[i] * b[i]. where forall loops execute the various iterations in parallel. where parallel operations act on collections as a whole (e. to add two arrays elementwise or compute the sum over all elements of an array). int *a. sums the vector of products. the behaviour of such programs is deterministic. lists. The similarity to a sequential program makes this function easy to read. Moreover. return sum (c). Conversely. Data Parallelism In data parallel programs.

Moreover. In the case of the second loop nest the dependence is linear. this process is called flattening. which largely restricts this approach to targeting shared-memory machines. the tradeoff between flat and nested data parallelism is a tradeoff between ease of system/compiler implementation and ease of application programming. Flat data parallelism is clearly less convenient once irregular computations are implemented.. regular structures can be more easily distributed and often good load balancing can be computed statically. So. communication patterns statically fixed.. i < n. can express nested data parallelism. In contrast. inner parallel loops may not depend on outer parallel loops. by current state-of-the-art. j < 2 * i. Modern dialects. such as forall (i = 0. in forall (i = 0. but current compilers often do not generate efficient parallel code from such programs. The exist two main approaches to the implementation of nested data parallelism: • The compiler may transforms nested into flat data parallelism. the inner loop depends on the outer. Finally. j++) . j++) . The iteration space of j is independent of i. where the added complication with respect to proper load balancing are indicated. the combined iteration space is regular . but it also is much easier to efficiently implement on standard architectures. in general. but they also severely complicate the efficient parallel implementation of an algorithm. such as High Performance Fortran (HPF). overall. • The compiler transforms nested data parallelism into a multi-threaded SPMD program. 4 .. for an arbitrary function over i. we get an irregular iteration space. The iteration space of the two loop nests is illustrated in Figure 2. the nesting of parallel operations is restricted. As a consequence. the result is very fine-grained parallelism and communication. as i is used to determine the upper bound of j iteration space. However. i++) forall (j = 0. such as sparse matrices and trees.m P1 n P3 P4 P2 n P3 P4 P1 P2 Figure 2: Rectangular and non-rectangular iteration space Flat Versus Nested Data Parallelism Data parallelism itself comes in two flavours: it may be flat or it may be nested. in fact.. SIMD and vector machines support flat data parallelism directly in hardware. early parallel Fortran dialects were restricted to flat data parallelism. This restriction is met in a nest. i++) forall (j = 0. which still leads to a regular (albeit not rectangular) structure. In particular. which corresponds to the traversal of multi-dimensional arrays. In particular. j < m. i < n. Such irregular iterations are important for the processing of irregular structures. However. it is rectangular. this holds also for the SIMD instructions that are part of the graphics instructions of modern microprocessors. In flat data parallelism. Hence. in a loop nest.

i++) local_result += a[i]. Nevertheless. most application constrain themselves to SPMD to impose more structure. return sum (n. communication is usually by message passing. The the processors cooperate by performing a tree-shaped reduction. c). In a distributed setting. the input arrays are already uniformly distributed when dotp is called. let us assume that n is divisible by p (the number of processors) and that on each processing node. The code is as follows: int sum (int n. but we will see an example of its impact in the following implementation of the routine sum(). Data distribution is explicit (and part of index calculations). i++) c[i] = a[i] + b[i]. Under these conditions. int *a. 5 . Finally. For simplicity. let us reconsider the dotp() code and discuss its transformation. Differing pe values do not make a difference in the code so far. } This same code is executed on each node (as dictated by the SPMD model). Control Parallelism Control parallel programming allows arbitrary tasks to run in parallel. Moreover. int *b) { int c[n/p]. local_result = 0. barrier (). where implicit per-instruction synchronisation is replaced by explicit barrier synchronisation. the variable pe contains the node number (counting from 0). As an example consider image recognition. int *a) { int i. This structure is illustrated in Figure 3. however. However. The function proceeds by having each processor compute the sum of its local portion of the input vector. From Data Parallelism to Control Parallelism To illustrate the implementation of data parallel programs on MIMD hardware by exploiting the structured control parallelism of the SPMD model. int buf.Line Feature −→ −→ Grouping −→ Classification Recognition Recognition Figure 3: Pipelined image processor Flat data parallelism (independent of whether specified in the application program or generated from nested data parallelism) is implemented on current MIMD hardware by the compiler transforming it into an SPMD program. further more synchronisation (via barrier()) is explicit. for (i = 0. the original data parallel code that provides a global view on the computation can be rewritten into the following node program: int dotp (int n. i < n/p. we have to provide a parallel version of the sum(). where a pipeline consists of multiple stages that apply different transformation to an image. In the node program. the value of pe will be different for the different nodes. for (i = 0. which usually simplifies development and maintenance of the code. there are applications that make use of full control parallelism. i < n/p.

parallel I/O. but no special support for communication among processes. such as one-side communication. it does nothing to help an application programmer with communication and synchronisation of parallel processes. return local_result. MOSIX is based on the assumption that. which are often called embarrassingly parallel programs. always goes via node on which a process was created. the message passing interface (MPI) provides a standardised infrastructure to facilitate communication and synchronisation.(p / (1 << i)). It supports automatic load balancing via process migration.” Programming language support for data parallel programming is provided by High Performance Fortran (HPF). which is based on collective array operations and features annotations for controlling data distribution as well as parallelism. as programs based on message passing can be hard to write and understand. 0). However. MPI has been called “the assembly language of parallel programming. The basic functionality of MPI provides message send and receive as well as various forms of barrier synchronisation and collective communication operations (e. including message passing.+ + + + + + + Figure 4: Tree reduction for (i = 1. else { receive_int (pe + (p / (1 << i)) + i. while MOSIX solves the problem of load balancing. Hence. once started. In contrast.. scatter/gather and reductions). processes are largely independent. &buf). } bcast_int (&local_result. i++) if (pe < p / i) if (pe >= p / (i + 1)) send_int (pe . However. In fact. This is clearly very inefficient if there is much I/O. } The structure of the computation is illustrated in Figure 4. which is a patch to the Linux kernel. local_result). As a result. and dynamic process creation. were introduced in the revised standard MPI-2. MOSIX is only suitable for a rather restricted set of application.g. More sophisticated features. However. Support for Programming Parallel Systems Flexible operating system support for cluster programming is provided by MOSIX. local_result += buf. an I/O. MPI is widely used and MPI code is fairly portable. but does not help the programmer with load balancing. Current compilers require fairly flat data 6 . even after migration. i < ld (p).

parallelism to generate good code. 7 .

You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->