This action might not be possible to undo. Are you sure you want to continue?

1. Identifying portions of the work that can be performed concurrently. 2 Mapping the concurrent pieces of work onto multiple processes running in parallel. 3. Distributing the input, Output, and Intermediate data associated with the program. 4. Managing accesses to data shared by multiple processors. 5. Synchronizing the processors at various stages of the parallel program execution. Preliminaries: Decomposition, Tasks, and Dependency Graphs • • • The first step in developing a parallel algorithm is to decompose the problem into tasks that can be executed concurrently A given problem may be decomposed into tasks in many different ways. Tasks may be of same or different sizes. Decomposition can be Modeled using a directed graph with nodes corresponding to tasks and edges indicating that the result of one task is required for processing the next. Such a graph is called a Task Dependency Graph.

•

Granularity of Task Decompositions The number of tasks into which a problem is decomposed determines its granularity. Decomposition into a large number of tasks results in fine-grained decomposition and that into a small number of tasks results in a coarse grained decomposition. Degree of Concurrency • The number of tasks that can be executed in parallel is the degree of concurrency of a decomposition. • Since the number of tasks that can be executed in parallel may change over program execution, the maximum degree of concurrency is the maximum number of such tasks at any point during execution. • The average degree of concurrency is the average number of tasks that can be processed in parallel over the execution of the program. The degree of concurrency increases as the decomposition becomes finer in granularity and vice versa. Critical Path Length • A directed path in the task dependency graph represents a sequence of tasks that must be processed one after the other. • The longest such path determines the shortest time in which the program can be executed in parallel. • The length of the longest path in a task dependency graph is called the critical path length. Limits on Parallel Performance It would appear that the parallel time can be made arbitrarily small by making the decomposition finer in granularity. There is an inherent bound on how fine the granularity of a computation can be. For example, in the case of multiplying a dense matrix with a vector, there can be no more than (n2) concurrent tasks. Concurrent tasks may also have to exchange data with other tasks. This results in communication overhead. The tradeoff between the granularity of a decomposition and associated overheads often determines performance bounds. If the granularity of a decomposition is finer, the associated overhead (as a ratio of useful work associated with a task) increases.

• •

1

Task Interaction Graphs • The graph of tasks (nodes) and their interactions/data exchange (edges) is referred to as a task interaction graph. Note that task interaction graphs represent data dependencies, whereas task dependency graphs represent control dependencies. Processes and Mapping • In general, the number of tasks in a decomposition exceeds the number of processing elements available. • For this reason, a parallel algorithm must also provide a mapping of tasks to processes. • • • • Appropriate mapping of tasks to processes is critical to the parallel performance of an algorithm. Mappings are determined by both the task dependency and task interaction graphs. Task dependency graphs can be used to ensure that work is equally spread across all processes at any point (minimum idling and optimal load balance). Task interaction graphs can be used to make sure that processes need minimum interaction with other processes (minimum communication).

An appropriate mapping must minimize parallel execution time by: • • • Mapping independent tasks to different processes. Assigning tasks on critical path to processes as soon as they become available. Minimizing interaction between processes by mapping tasks with dense interactions to the same process.

Note: These criteria often conflict either each other. For example, a decomposition into one task (or no decomposition at all) minimizes interaction but does not result in a speedup at all! Decomposition Techniques So how does one decompose a task into various subtasks? While there is no single recipe that works for all problems, we present a set of commonly used techniques that apply to broad classes of problems. These include: 1. Recursive decomposition 2. Data decomposition 3. Exploratory decomposition 4. Speculative decomposition 5. Hybrid Decomposition Recursive Decomposition • Generally suited to problems that are solved using the divide-and-conquer strategy. • A given problem is first decomposed into a set of sub-problems. • These sub-problems are recursively decomposed further until a desired granularity is reached. Data Decomposition • Identify the data on which computations are performed. • Partition this data across various tasks. • This partitioning induces a decomposition of the problem. • Data can be partitioned in various ways - this critically impacts performance of a parallel algorithm. 1. Output Data Decomposition • Often, each element of the output can be computed independently of others (but simply as a function of the input). • A partition of the output across tasks decomposes the problem naturally. • If the database of transactions is replicated across the processes, each task can be independently accomplished with no communication. • If the database is partitioned across processes as well (for reasons of memory utilization), each task first computes partial counts. These counts are then aggregated at the appropriate task. 2. Input Data Partitioning • Generally applicable if each output can be naturally computed as a function of the input.

2

In many cases, this is the only natural decomposition because the output is not clearly known apriori (e.g., the problem of finding the minimum in a list, sorting a given list, etc.). • A task is associated with each input data partition. The task performs as much of the computation with its part of the data. Subsequent processing combines these partial results. 3. Partitioning both Input and Output Data Often input and output data decomposition can be combined for a higher degree of concurrency. For the itemset counting example, the transaction set (input) and itemset counts (output) can both be decomposed as follows: Computation can often be viewed as a sequence of transformation from the input to the output data. • In these cases, it is often beneficial to use one of the intermediate stages as a basis for decomposition. The Owner Computes Rule • The Owner Computes Rule generally states that the process assined a particular data item is responsible for all computation associated with it. • In the case of input data decomposition, the owner computes rule imples that all computations that use the input data are performed by the process. • In the case of output data decomposition, the owner computes rule implies that the output is computed by the process to which the output data is assigned. Exploratory Decomposition • • • In many cases, the decomposition of the problem goes hand-in-hand with its execution. These problems typically involve the exploration (search) of a state space of solutions. Problems in this class include a variety of discrete optimization problems (0/1 integer programming, QAP, etc.), theorem proving, game playing, etc. •

•

Exploratory Decomposition: Anomalous Computations • In many instances of exploratory decomposition, the decomposition technique may change the amount of work done by the parallel formulation. • This change results in super- or sub-linear speedups Speculative Decomposition • In some applications, dependencies between tasks are not known a-priori. • For such applications, it is impossible to identify independent tasks. • There are generally two approaches to dealing with such applications: conservative approaches, which identify independent tasks only when they are guaranteed to not have dependencies, and, optimistic approaches, which schedule tasks even when they may potentially be erroneous. • Conservative approaches may yield little concurrency and optimistic approaches may require rollback mechanism in the case of an error. Hybrid Decompositions Often, a mix of decomposition techniques is necessary for decomposing a problem. Consider the following examples: • In quicksort, recursive decomposition alone limits concurrency (Why?). A mix of data and recursive decompositions is more desirable. • In discrete event simulation, there might be concurrency in task processing. A mix of speculative decomposition and data decomposition may work well. • Even for simple problems like finding a minimum of a list of numbers, a mix of data and recursive decomposition works well. Characteristics of Tasks Once a problem has been decomposed into independent tasks, the characteristics of these tasks critically impact choice and performance of parallel algorithms. Relevant task characteristics include: • Task generation. • Task sizes. • Size of data associated with tasks.

3

Task Generation • • Static task generation: Concurrent tasks can be identified a-priori. Typical matrix operations, graph algorithms, image processing applications, and other regularly structured problems fall in this class. These can typically be decomposed using data or recursive decomposition techniques. Dynamic task generation: Tasks are generated as we perform computation. A classic example of this is in game playing - each 15 puzzle board is generated from the previous one. These applications are typically decomposed using exploratory or speculative decompositions.

Task Sizes Task sizes may be uniform (i.e., all tasks are the same size) or non-uniform. Non-uniform task sizes may be such that they can be determined (or estimated) a-priori or not. Examples in this class include discrete optimization problems, in which it is difficult to estimate the effective size of a state space. Size of Data Associated with Tasks • The size of data associated with a task may be small or large when viewed in the context of the size of the task. • A small context of a task implies that an algorithm can easily communicate this task to other processes dynamically (e.g., the 15 puzzle). • A large context ties the task to a process, or alternately, an algorithm may attempt to reconstruct the context at another processes as opposed to communicating the context of the task (e.g., 0/1 integer programming). Characteristics of Task Interactions Tasks may communicate with each other in various ways. The associated dichotomy is: Static interactions: The tasks and their interactions are known a-priori. These are relatively simpler to code into programs. • Dynamic interactions: The timing or interacting tasks cannot be determined a-priori. These interactions are harder to code, especitally, as we shall see, using message passing APIs. Characteristics of Task Interactions Regular interactions: There is a definite pattern (in the graph sense) to the interactions. These patterns can be exploited for efficient implementation. • Irregular interactions: Interactions lack well-defined topologies. Characteristics of Task Interactions Interactions may be read-only or read-write. In read-only interactions, tasks just read data items associated with other tasks. In read-write interactions tasks read, as well as modily data items associated with other tasks. In general, read-write interactions are harder to code, since they require additional synchronization primitives. • Interactions may be one-way or two-way. • A one-way interaction can be initiated and accomplished by one of the two interacting tasks. • A two-way interaction requires participation from both tasks involved in an interaction. • One way interactions are somewhat harder to code in message passing APIs. Mapping Techniques • • • • Once a problem has been decomposed into concurrent tasks, these must be mapped to processes (that can be executed on a parallel platform). Mappings must minimize overheads. Primary overheads are communication and idling. Minimizing these overheads often represents contradicting objectives. • • • • • • • • • •

4

Assigning all work to one processor trivially minimizes communication at the expense of significant idling. Mapping Techniques for Minimum Idling Mapping must simultaneously minimize idling and load balance. Merely balancing load does not minimize idling. Mapping techniques can be static or dynamic. • • Static Mapping: Tasks are mapped to processes a-priori. For this to work, we must have a good estimate of the size of each task. Even in these cases, the problem may be NP complete. Dynamic Mapping: Tasks are mapped to processes at runtime. This may be because the tasks are generated at runtime, or that their sizes are not known.

•

Other factors that determine the choice of techniques include the size of data associated with a task and the nature of underlying domain. Schemes for Static Mapping • Mappings based on data partitioning. • Mappings based on task graph partitioning. • Hybrid mappings. Mappings Based on Data Partitioning We can combine data partitioning with the ``owner-computes'' rule to partition the computation into subtasks. The simplest data decomposition schemes for dense matrices are 1-D block distribution schemes.

Block Array Distribution Schemes Block distribution schemes can be generalized to higher dimensions as well.

Cyclic and Block Cyclic Distributions • If the amount of computation associated with data items varies, a block decomposition may lead to significant load imbalances. • A simple example of this is in LU decomposition (or Gaussian Elimination) of dense matrices.

5

A block distribution is a special case in which block size is n/p . Block Cyclic Distributions Variation of the block distribution scheme that can be used to alleviate the load-imbalance and idling problems. • Block-Cyclic Distribution • • A cyclic distribution is a special case in which block size is one. • Partition an array into many more blocks than the number of available processes. 6 .LU Factorization of a Dense Matrix A decomposition of LU factorization into 14 tasks . • Blocks are assigned to processes in a round-robin manner so that each process gets several nonadjacent blocks. Block-Cyclic Distribution: Examples One.and two-dimensional block-cyclic distributions among 4 processes.notice the significant load imbalance. where n is the dimension of the matrix and p is the number of processes.

The graph of the matrix is a useful indicator of the work (number of nodes) and communication (the degree of each node). • For this reason. Centralized Dynamic Mapping 7 . we would like to partition the graph so as to assign equal number of nodes to each process. block decompositions are more complex. the task mapping of the binary tree (quicksort) cannot use a large number of processors. • Dynamic mapping schemes can be centralized or distributed. In this case. • Excellent heuristics exist for structured graphs. since load balancing is the primary motivation for dynamic mapping. Mappings Based on Task Paritioning • Partitioning a given task-dependency graph across processes.• Graph Partitioning Based Data Decomposition • • • • In case of sparse matrices. Consider the problem of multiplying a sparse matrix with a vector. • For example. task mapping can be used at the top level and data partitioning within each level. Schemes for Dynamic Mapping • Dynamic mapping is sometimes also referred to as dynamic load balancing. Hierarchical Mappings • Sometimes a single mapping technique is inadequate. • Determining an optimal mapping for a general task-dependency graph is an NP-complete problem. while minimizing edge count of the graph partition.

Task Graph Model: Starting from a task dependency graph. Selecting large chunk sizes may lead to significant load imbalances as well. the interrelationships among the tasks are utilized to promote locality or to reduce interaction costs. Minimizing Interaction Overheads • Maximize data locality: Where possible. it requests the master for more work. each of which perform some task on it. To alleviate this. • Overlapping computations with interactions: Use non-blocking communications. • Overlap interactions with other interactions. Restructure computation so that data can be reused in smaller time windows. replicate data where necessary. we must minimize the volume of data communicated. reuse intermediate data.• • • • • • Processes are designated as masters or slaves. A number of schemes have been used to gradually decrease chunk size as the computation progresses. When the number of processes increases. where possible. Pipeline / Producer-Comsumer Model: A stream of data is passed through a succession of processes. Master-Slave Model: One or more processes generate work and allocate it to worker processes. and prefetching to hide latencies. • • Hybrid Models: A hybrid model may be composed either of multiple models applied hierarchically or multiple models applied sequentially to different phases of a parallel algorithm. the master may become the bottleneck. 8 . We will look at some of these techniques later in this class. Therefore. and when is a transfer triggered? • Answers to these questions are generally application specific. This is called Chunk scheduling. When a process runs out of work. • Replicating data or computations. multithreading. For this reason. Distributed Dynamic Mapping • Each process can send or receive work from other processes. how much work is transferred. • This alleviates the bottleneck in centralized schemes. This allocation may be static or dynamic. • Minimize contention and hot-spots: Use decentralized techniques. • Minimize volume of data exchange: There is a cost associated with each word that is communicated. • Minimize frequency of interactions: There is a startup cost associated with each interaction. try to merge multiple interactions to one. • There are four critical questions: how are sensing and receiving processes paired together. • • Data Parallel Model: Tasks are statically (or semi-statically) mapped to processes and each task performs similar operations on different data. • Using group communications instead of point-to-point primitives. a process may pick up a number of tasks (a chunk) at one time. Parallel Algorithm Models An algorithm model is a way of structuring a parallel algorithm by selecting a decomposition and mapping technique and applying the appropriate strategy to minimize interactions. who initiates work transfer.

One-to-all broadcast and all-to-one reduction among processors. • We assume that the network is bidirectional and that communication is single-ported. we take congestion into account explicitly by scaling the tw term. such as addition or min). • In all-to-one reduction. each processor has m units of data. we refer to specific architectures here. • We use this as the basis for our analyses. • Group communication operations are built using point-to-point messaging primitives. • Efficient implementations of these operations can improve performance. One-to-All Broadcast and All-to-One Reduction on Rings • • • Simplest way is to send p-1 messages from the source to the other p-1 processors . Use recursive doubling: source sends a message to a selected processor. These data items must be combined piece-wise (using some associative operator. • We select a descriptive set of architectures to illustrate the process of algorithm design.this is not very efficient. One-to-All Broadcast and All-to-One Reduction • One processor has a piece of data (of size m) it needs to send to everyone. • Recall from our discussion of architectures that communicating a message of size m over an uncongested network takes time ts +tmw. • Efficient implementations must leverage underlying architecture. Reduction can be performed in an identical fashion by inverting the process. and improve software quality. One-to-All Broadcast 9 . • The dual of one-to-all broadcast is all-to-one reduction. For this reason. and the result made available at a target processor.CHAPTER-4 Basic Communication Operations Basic Communication Operations: Introduction • Many interactions in practical parallel programs occur in well-defined patterns involving groups of processors. reduce development effort and cost. We now have two independent problems denined over halves of machines. Where necessary.

This can be done concurrently for all n columns.One-to-all broadcast on an eight-node ring. Each message transfer step is shown by a numbered. Broadcast and Reduction: Example Consider the problem of multiplying a matrix with a vector. All-to-One Reduction Reduction on an eight-node ring with node 0 as the destination of the reduction. Broadcast and reduction operations can be performed in two steps . • The n x n matrix is assigned to an n x n (virtual) processor grid.the first step does the operation along a row and the second step along each column concurrently. The vector is assumed to be on the first row of processors. Broadcast and Reduction on a Mesh: Example 10 . • The processors compute local product of the vector element and the local matrix entry. • In the final step. Broadcast and Reduction: Matrix-Vector Multiplication Example One-to-all broadcast and all-to-one reduction in the multiplication of a 4 x 4 matrix with a 4 x 1 vector. Broadcast and Reduction on a Mesh • • • We can view each row and column of a square mesh of p nodes as a linear array of √p nodes. The number on an arrow indicates the time step during which the message is transferred. This process generalizes to higher dimensions as well. the results of these products are accumulated to the first row using n concurrent all-to-one reduction operations along the columns (using the sum operation). dotted arrow from the source of the message to its destination. Node 0 is the source of the broadcast. • The first step of the product requires a one-to-all broadcast of the vector element along the corresponding column of processors.

The problem has now been decomposed into two problems with half the number of processors. Broadcast and Reduction Algorithms 11 .One-to-all broadcast on a 16-node mesh. Broadcast and Reduction on a Balanced Binary Tree One-to-all broadcast on an eight-node tree. Broadcast and Reduction on a Hypercube • A hypercube with 2d nodes can be regarded as a d-dimensional mesh with two nodes in each dimension. Broadcast and Reduction on a Balanced Binary Tree • Consider a binary tree in which processors are (logically) at the leaves and internal nodes are routing nodes. In the first step. • Assume that source processor is the root of this tree. The binary representations of node labels are shown in parentheses. Broadcast and Reduction on a Hypercube: Example One-to-all broadcast on a three-dimensional hypercube. • The mesh algorithm can be generalized to a hypercube and the operation is carried out in d (= log p) steps. the source sends the data to the right child (assuming the source is also the left child).

X is the message to be broadcast. Broadcast and Reduction Algorithms 12 . which initially resides at the source node 0.• • • • All of the algorithms described above are adaptations of the same algorithmic template. Broadcast and Reduction Algorithms One-to-all broadcast of a message X from source on a hypercube. but the algorithm. The hypercube has 2d nodes and my_id is the label for a node. We illustrate the algorithm for a hypercube. can be adapted to other architectures. as has been seen.

A process sends the same m-word message to every other process.Single-node accumulation on a d-dimensional hypercube. • The total time is therefore given by: All-to-All Broadcast and Reduction • • Generalization of broadcast in which each processor is the source as well as destination. each at a time cost of ts + twm. Each node contributes a message X containing m words. Cost Analysis • The broadcast or reduction procedure involves log p point-to-point simple message transfers. but different processes may broadcast different messages. All-to-All Broadcast and Reduction 13 . and node 0 is the destination.

though. This is not the most efficient way. it forwards the data received from one of its neighbors to its other neighbor. • The algorithm terminates in p-1 steps. All-to-All Broadcast and Reduction on a Ring • • • All-to-all broadcast on an eight-node ring. All-to-All Broadcast and Reduction on a Ring All-to-all broadcast on a p-node ring. All-to-All Broadcast and Reduction on a Ring Simplest approach: perform p one-to-all broadcasts. Each node first sends to one of its neighbors the data it needs to broadcast. 14 . In subsequent steps.All-to-all broadcast and all-to-all reduction.

• In this phase. all nodes collect √p messages corresponding to the √p nodes of their respective rows. all nodes get (0.2. each row of the mesh performs an all-to-all broadcast using the procedure for the linear array.7) (that is. 15 . The groups of nodes communicating with each other in each phase are enclosed by dotted boundaries.5. a message from each All-to-all Broadcast on a Mesh All-to-all broadcast on a square mesh of p nodes.All-to-all Broadcast on a Mesh • Performed in two phases . By the end of the second phase. • The second communication phase is a columnwise all-to-all broadcast of the consolidated messages. Each node consolidates this information into a single message of size m√p.3.in the first phase.4. All-to-all Broadcast on a Mesh All-to-all broadcast on a 3 x 3 mesh.6.1.

except in the reverse order.All-to-all broadcast on a Hypercube • Generalization of the mesh algorithm to log p dimensions. • Message size doubles at each of the log p steps. the time is given by: 2ts(√p – 1) + twm(p-1). All-to-all Reduction Similar communication pattern to all-to-all broadcast. On receiving a message. All-to-all broadcast on a Hypercube All-to-all broadcast on an eight-node hypercube. a node must combine it with the local copy of the message that has the same destination as the received message before forwarding the combined message to the next neighbor. the time is given by: (ts + twm)(p-1). • On a hypercube. • On a mesh. Cost Analysis • On a ring. All-to-all broadcast on a Hypercube All-to-all broadcast on a d-dimensional hypercube. we have: • • 16 .

n1. Contention for a channel when the hypercube is mapped onto a ring. instead. nk resides on the node labeled k. All-Reduce and Prefix-Sum Operations • In all-reduce.…. The only difference is that message size does not increase here. each with a different destination for the result. in which p simultaneous all-to-one reductions take place. each node starts with a buffer of size m and the final results of the operation are identical buffers of size m on each node that are formed by combining the original p buffers using an associative operator. and at the end of the procedure. • Different from all-to-all reduction. This formulation is not the most efficient. Time for this operation is (ts + twm) log p. The Prefix-Sum Operation 17 . Uses the pattern of all-to-all broadcast.np-1 (one on each node). the problem is to compute the sums sk = ∑ik= 0 ni for all k between 0 and p-1 . • Identical to all-to-one reduction followed by a one-to-all broadcast.All-to-all broadcast: Notes • • All of the algorithms presented above are asymptotically optimal in message size. The Prefix-Sum Operation • • Given p numbers n0. the same node holds Sk. Initially. It is not possible to port algorithms for higher dimensional networks (such as a hypercube) into a ring because this would cause contention.

18 . • This is implemented using an additional result buffer. • The contents of the outgoing message (denoted by parentheses in the figure) are updated with every incoming message. The content of an incoming message is added to the result buffer only if the message comes from a node with a smaller label than the recipient node. • We must account for the fact that in prefix sums the node with label k uses information from only the k-node subset whose labels are less than or equal to k. The Prefix-Sum Operation Prefix sums on a d-dimensional hypercube.Computing prefix sums on an eight-node hypercube. square brackets show the local prefix sum accumulated in the result buffer and parentheses enclose the contents of the outgoing message buffer for the next step. At each node. The Prefix-Sum Operation • The operation can be implemented using the all-to-all broadcast kernel.

All-to-All Personalized Communication • Each node has a distinct message of size m for every other node. • While the scatter operation is fundamentally different from broadcast. Cost of Scatter and Gather • There are log p steps. • The gather operation is exactly the inverse of the scatter operation and can be executed as such. • In the gather operation. a single node sends a unique message of size m to every other node (also called a one-to-all personalized communication). Example of the Scatter Operation The scatter operation on an eight-node hypercube. Gather and Scatter Operations Scatter and gather operations. except for differences in message sizes (messages get smaller in scatter and stay constant in broadcast). • All-to-all personalized communication is also known as total exchange. • We have the time for this operation to be: • • This time holds for a linear array as well as a 2-D mesh. a single node collects a unique message from each node. These times are asymptotically optimal in message size. the algorithmic structure is similar . • This is unlike all-to-all broadcast. the machine size halves and the data size halves.Scatter and Gather • In the scatter operation. 19 . in each step. in which each node sends the same message to all other nodes.

All-to-All Personalized Communication on a Ring: Cost • We have p – 1 steps in all. the message size is m(p – i).y2}. The label ({x1. indicates a message that is formed by concatenating n individual messages. and y is the label of the node that is the final destination of the message. The label of each message is of the form {x. • Each node extracts the information meant for it from the data received. • In step i. All-to-all personalized communication on a six-node ring.All-to-All Personalized Communication on a Ring • Each node sends all pieces of data as one consolidated message of size m(p – 1) to one of its neighbors. • The algorithm terminates in p – 1 steps. {xn. and forwards the remaining (p – 2) pieces of size m each to the next node.….yn}. • The total time is given by: The tw term in this equation can be reduced by a factor of 2 by communicating messages in both directions 20 .y1}.y}. {x2. • The size of the message reduces by m at each step. where x is the label of the node that originally owned the message.

The groups of nodes communicating together in each phase are enclosed in dotted boundaries All-to-All Personalized Communication on a Mesh: Cost • • Time for the first phase is identical to that in a ring with √p processors.e. • All-to-all personalized communication is performed independently in each column with clustered messages of size m√p. this time according to the rows of their destination nodes. (ts + twmp/2)(√p – 1). • At any stage in all-to-all personalized communication. i.. • Messages in each node are sorted again.i}).e. i.i}. total time is twice of this time. Therefore.All-to-All Personalized Communication on a Mesh • Each node first groups its p messages according to the columns of their destination nodes. where 0 ≤ i ≤ 8. 21 . At the end of the second phase.{8. Time in the second phase is identical to the first phase. The distribution of messages at the beginning of each phase of all-to-all personalized communication on a 3 x 3 mesh. every node holds p packets of size m each. node i has messages ({0. All-to-All Personalized Communication on a Hypercube • Generalize the mesh algorithm to log p steps. • All-to-all personalized communication is performed independently in each row with clustered messages of size m√p.…. • It can be shown that the time for rearrangement is less much less than this communication time..

All-to-All Personalized Communication on a Hypercube: Optimal Algorithm 22 . A node must rearrange its messages locally before each of the log p communication steps. An all-to-all personalized communication algorithm on a three-dimensional hypercube. • A node must choose its communication partner in each step so that the hypercube links do not suffer congestion. Therefore. node i exchanges data with node (i XOR j). every node sends p/2 of these packets (consolidated as one message). the cost is: • This is not optimal! All-to-All Personalized Communication on a Hypercube: Optimal Algorithm • Each node simply performs p – 1 communication steps. exchanging m words of data with a different node in every step. • In the jth communication step. All-to-All Personalized Communication on a Hypercube: Cost • We have log p iterations and mp/2 words are communicated in each iteration. and none of the bidirectional links carry more than one message in the same direction.• • While communicating in a particular dimension. all paths in every communication step are congestion-free. • In this schedule.

Seven steps in all-to-all personalized communication on an eight-node hypercube. All-to-All Personalized Communication on a Hypercube: Optimal Algorithm A procedure to perform all-to-all personalized communication on a d-dimensional hypercube. 23 . The message Mi.j initially resides on node i and is destined for node j.

p – q} neighbor communications. Circular Shift • A special permutation in which node i sends a data packet to node (i + q) mod p in a p-node ensemble (0 ≤ q ≤ p). • Mesh algorithms follow from this as well. It can be performed in min{q. We shift in one direction (all processors) followed by the next direction. We have: • This is asymptotically optimal in message size. Circular Shift on a Mesh • The implementation on a ring is rather intuitive. • The associated time has an upper bound of Circular Shift on a Mesh 24 .All-to-All Personalized Communication on a Hypercube: Cost Analysis of Optimal Algorithm • • There are p – 1 steps and each step involves non-congesting message transfer of m words.

The communication steps in a circular 5-shift on a 4 x 4 mesh. • If q is the sum of s distinct powers of 2. • The time for this is upper bounded by: • If E-cube routing is used. we expand q as a sum of distinct powers of 2. Circular Shift on a Hypercube • Map a linear array with 2d nodes onto a d-dimensional hypercube. then the circular q-shift on a hypercube is performed in s phases. 25 . this time can be reduced to The mapping of an eight-node linear array onto a three-dimensional hypercube to perform a circular 5-shift as a combination of a 4-shift and a 1-shift. • To perform a q-shift.

The time for this is: • • • All-to-one reduction can be performed by performing all-to-all reduction (dual of all-to-all broadcast) followed by a gather operation (dual of scatter). The intervening gather and scatter operations cancel each other. Therefore. a one-to-all broadcast can be implemented as a scatter operation followed by an all-to-all broadcast operation. Improving Performance of Operations Splitting and routing messages into parts: If the message can be split into p parts. 26 . Since an all-reduce operation is semantically equivalent to an all-to-one reduction followed by a one-to-all broadcast.Circular q-shifts on an 8-node hypercube for 1 ≤ q < 8. the asymptotically optimal algorithms for these two operations can be used to construct a similar algorithm for the all-reduce operation. an all-reduce operation requires an all-to-all reduction and an all-to-all broadcast.

and the communication parameters of the machine. Performance Metrics for Parallel Systems: 1) Execution Time: • Serial runtime of a program is the time elapsed between the beginning and the end of its execution on a sequential computer. • Idling: Processes may idle because of load imbalance. there might be many serial algorithms available. An algorithm must therefore be analyzed in the context of the underlying platform. or that some computations are repeated across processors to minimize communication. • Excess Computation: This is computation not performed by the serial version. These algorithms may have different asymptotic runtimes and may be parallelizable to different degrees. asymptotic runtime as a function of input size). • Observe that Tall . synchronization. • The overhead function (To) is therefore given by To = p TP . Wall clock time . • We denote the serial runtime by TS and the parallel runtime by TP. This is called the total overhead.TS (1) 3) Speedup: • Speedup (S) is the ratio of the time taken to solve a problem on a single processor to the time required to solve the same problem on a parallel computer with p identical processing elements. or serial components. • • A parallel system is a combination of a parallel algorithm and an underlying platform. The asymptotic runtime of a sequential program is identical on any serial platform. 27 .the time from the start of the first processor to the stopping time of the last processor in a parallel ensemble. • For a given problem. the number of processors. 2) Total Parallel Overhead : • Let Tall be the total time collectively spent by all the processing elements. • TS is the serial time. The parallel runtime of a program depends on the input size.CHAPTER-5 ANALYTICAL MODELING OF PARALLEL SYSTEMS Analytical Modeling .TS is then the total time spend by all processors combined in non-useful work. This might be because the serial algorithm is difficult to parallelize. • The total time collectively spent by all the processing elements Tall = p TP (p is the number of processors).What good are FLOP counts when they do not solve a problem? • • Sources of Overhead in Parallel Programs • Inter process interactions: Processors working on any non-trivial parallel problem will need to talk to each other. • The parallel runtime is the time that elapses from the moment the first processor starts to the moment the last processor finishes execution. But how does this scale when the number of processors is changed of the program is ported to another machine altogether? How much faster is the parallel version? This begs the obvious follow up question – what’s the baseline serial version with which we compare? Can we use a suboptimal serial program to make our parallel program look Raw FLOP count . A number of performance measures are intuitive.Basics • • • • A sequential algorithm is evaluated by its runtime (in general.

which contradicts our assumption of fastest serial program as basis for speedup. we always consider the best sequential program as the baseline. • A naive way of scaling down is to think of each processor in the original case as a virtual processor and to assign virtual processors equally to scaled down processors. efficiency can be as low as 0 and as high as 1. E = O(1). The corresponding serial formulation expands the entire tree. • Since E = TS / p TP. • Speedup. and remote memory access time is 400ns. If two processors are used. in theory. a single processor could be timeslided to achieve a faster serial program. 28 . 5)Cost : • Cost is the product of parallel runtime and the number of processing elements used (p x TP ). this corresponds to a speedup of 2.after all. Example: A processor with 64KB of cache yields an 80% hit ratio. Resource-based superlinearity: The higher aggregate cache/memory bandwidth can result in better cachehit ratios. cache access time is 2 ns. and therefore superlinearity. • Cost reflects the sum of the time that each processing element spends solving the problem. for cost optimal systems.For the purpose of computing speedup. using fewer processors improves performance of parallel systems. since the problem size/processor is smaller. on two processing elements using depthfirst traversal. Super linear Speedups One reason for super linearity is that the parallel version does less work than corresponding serial algorithm. it is given by E=S/p (2) • Following the bounds on speedup. Effect of Granularity on Performance : • Often. • Searching an unstructured tree for a node with a given label. should be upper bounded by p . • A speedup greater than p is possible only if each processing element spends less than time TS / p solving the problem. • A parallel system is said to be cost-optimal if the cost of solving a problem on a parallel computer is asymptotically identical to serial cost. Of the remaining 10% access. Speedup Bounds • Speedup can be as low as 0 (the parallel program never terminates). If DRAM access time is 100 ns. `S'. the hit ratio goes up to 90%. It is clear that the serial algorithm does more work than the parallel algorithm. The two-processor version with processor 0 searching the left subtree and processor 1 searching the right subtree expands only the shaded nodes before the solution is found. we can only expect a p-fold speedup if we use times as many resources. 8% come from local memory and 2% from remote memory. • In this case. • Cost is sometimes referred to as work or processor-time product. • Using fewer than the maximum possible number of processing elements to execute a parallel algorithm is called scaling down a parallel system.43! 4) Efficiency : • Efficiency is a measure of the fraction of time for which a processing element is usefully employed • Mathematically.

The communication cost should not increase by this factor since some of the virtual processors assigned to a physical processors might talk to each other. • Scalability and cost-optimality are therefore related.. Use the parallel algorithm for n processors. the parallel system is not cost-optimal.e. To increases. as we increase the number of processing elements.log p steps do not require any communication. 29 . This is the basic reason for the improvement from building granularity. Each of the p processors is now assigned n / p virtual processors. Consider the problem of adding n numbers on p processing elements such that p < n and both n and p are powers of 2. The overall parallel execution time of this parallel system is Θ ( (n / p) log p). the value of TS remains constant). The cost is Θ (n log p). as we increase the number of processing elements. Subsequent log n . Therefore. in this case.• • • • • • • • • Since the number of processing elements decreases by a factor of n / p. • For some systems. which is asymptotically higher than the Θ (n) cost of adding n numbers sequentially. This is the case for all parallel programs. The first log p of the log n steps of the original algorithm are simulated in (n / p) log p steps on p processing elements. • The overall efficiency of the parallel program goes down. except. we think of them as virtual processors. Isoefficiency Metric of Scalability • For a given problem size. • For a given problem size (i. the overall efficiency of the parallel system goes down for all systems. the efficiency of a parallel system increases if the problem size is increased while keeping the number of processing elements constant. • A scalable parallel system can always be made cost-optimal if the number of processing elements and the size of the computation are chosen appropriately. the computation at each processing element increases by a factor of n / p. |Scalability of Parallel Systems • The efficiency of a parallel program can be written as: E=S/p =Ts/pTp Using expression parallel overhead E=1/(1+To+Ts) • The total overhead function To is an increasing function of p .

The slower this rate. we define the problem size W as the asymptotic number of operations associated with the best serial algorithm to solve the problem. the better.Variation of efficiency: (a) as the number of processing elements is increased for a given problem size. and (b) as the problem size is increased for a given number of processing elements. • This function is called the isoefficiency function. • This function determines the ease with which a parallel system can maintain a constant efficiency and hence achieve speedups increasing in proportion to the number of processing elements Cost-Optimality and the Isoefficiency Function • A parallel system is cost-optimal if and only if • 30 . The phenomenon illustrated in graph (b) is not common to all parallel systems. Before we formalize this rate. • Isoefficiency function: The problem size W can usually be obtained as a function of p by algebraic manipulations to keep efficiency constant. • • What is the rate at which the problem size must increase with respect to the number of processing elements to keep the efficiency fixed? This rate determines the scalability of the system.

• Let TPcost_opt be the minimum cost-optimal parallel time. Ω(p) is the asymptotic lower bound on the isoefficiency function. The fastest serial programs for this problem run in time Θ(n log n). no more than C(W) processing elements can be employed effectively. for cost optimality. From this. p and equating it to zero. and A4 as follows: 31 . • If the isoefficiency function of a parallel system is Θ(f(p)) .If we have an isoefficiency function f(p). At this processor count. This may not always be the case. no more than W processing elements can be used cost-optimally. TP = Θ(W/p) . In other words. TP(p0) is the minimum parallel time. Consider the minimum execution time for adding n numbers Tp=n/p+2logp Setting the derivative w. the formulation is not cost-optimal. Consider four parallel algorithms. then for a problem of size W. therefore. A2. p = O(f--1(W)) . • For cost-optimal systems. p to zero. Degree of Concurrency and the Isoefficiency Function • The maximum number of tasks that can be executed simultaneously at any time in a parallel algorithm is called its degree of concurrency. hence. The corresponding runtime is Tpmin=2logn (One may verify that this is indeed a min by verifying that the second derivative is positive). =0 If p0 is the value of p as determined by this equation.t. A1. Minimum Execution Time and Minimum Cost-Optimal Execution Time • We can determine the minimum parallel runtime TPmin for a given W by differentiating the expression for TP w. we have p ≈ n /log n .r.t. • If C(W) is the degree of concurrency of a parallel algorithm. the parallel runtime is: • Note that both TPmin and TPcost_opt for adding n numbers are Θ(log n). then it follows that the relation W = Ω(f(p)) must be satisfied to ensure the cost-optimality of a parallel system as it is scaled up Lower Bound on the Isoefficiency Function • For a problem consisting of W units of work. we have p = n/ 2 . Asymptotic Analysis of Parallel Programs Consider the problem of sorting a list of n numbers. A3.r. • The isoefficiency function f(p) of this parallel system is Θ(p log p). • The problem size must increase at least as fast as Θ(p) to maintain fixed efficiency. Note that at this point. then a problem of size W can be solved cost-optimally if and only if W= Ω(f(p)) .

scaled speedup curve is close to linear as well. the system is considered scalable. efficiency and the pTP product If the metric is speed. A2 and A4 are the best. • Alternately. speedup. scaled speedup increases problem size to fill memory. • If scaled speedup is close to linear. • In terms of efficiency. • In terms of cost. followed by A3. metrics operate at the limit of memory and estimate performance under this problem growth rate. • For real-time applications. algorithm A1 is the best. followed by A3 and A1. • It is important to identify the objectives of analysis and to use appropriate metrics! Other Scalability Metrics • A number of other metrics have been proposed. The table shows number of processing elements. the objective is to scale up a system to accomplish a task in a specified time bound. parallel runtime. F=(1/s-1/p)/(1-1/p) 32 . A1 and A3 are not. Scaled Speedup • Speedup obtained when the problem size is increased linearly with the number of processing elements. • For a given parallel algorithm • • Total memory requirement of this algorithm is Θ(n2) . the size of the problem is increased subject to an upper-bound on parallel execution time. algorithms A2 and A4 are cost optimal. Serial fraction: (f) It is used to quantify the performance of parallel system on fixed size problem.Comparison of four different algorithms for sorting a given list of numbers. • The serial runtime of multiplying a matrix of dimension n x n with a vector is tcn2 . • If the aggregate memory grows linearly in p. and A2 (in order of increasing TP). • In memory constrained environments. dictated by specific needs of applications. • If the isoefficiency is near linear. A4.

make underlying costs very explicit to the programmer The Building Blocks: Send and Receive Operations • • The prototypes of these operations are as follows: send(void *sendbuf. printf("%d\n". This motivates the design of the send and receive protocols. a). the operation does not return until the matching receive has been encountered at the receiving process. data must be explicitly partitioned and placed. int source) Consider the following code segments: P0 P1 a = 100. Idling and deadlocks are major issues with non-buffered blocking sends. Each data element must belong to one of the partitions of the space. while onerous. 0) send(&a. These two constraints. hence. 1. • • Non-Buffered Blocking Message Passing Operations • • • A simple method for forcing send/receive semantics is for the send operation to return only when it is safe to do so. 1). int dest) receive(void *recvbuf. receive(&a. 33 . All interactions (read-only or read/write) require cooperation of two processes . each with its own exclusive address space.the process that has the data and the process that wants to access the data. int nelems.CHAPTER-6 Programming Using the Message Passing Paradigm Principles of Message-Passing Programming: • • • • The logical view of a machine supporting the message-passing paradigm consists of p processes. 1. int nelems. The semantics of the send operation require that the value received by process P1 must be 100 as opposed to 0. a = 0. In the non-buffered blocking send.

there can be considerable idling overheads. The data must be buffered at the receiving end as well. The data is copied at a buffer at the receiving end as well. Blocking buffered transfer protocols: (a) in the presence of communication hardware with buffers at send and receive ends. and (b) in the absence of communication hardware. Bounded buffer sizes can have signicant impact on performance. sender interrupts receiver and deposits data in buffer at receiver end. The sender simply copies the data into the designated buffer and returns after the copy operation has been completed. Buffering trades off idling overhead for buffer copying overhead. the sender simply copies the data into the designated buffer and returns after the copy operation has been completed. Buffering alleviates idling at the expense of copying overheads. It is easy to see that in cases where sender and receiver do not reach communication point at similar times. P0 P1 34 .• • In buffered blocking sends. Handshake for a blocking non-buffered send/receive operation. Buffered Blocking Message Passing Operations • • • • A simple solution to the idling and deadlocking problem outlined above is to rely on buffers at the sending and receiving ends.

send(&b. send(&a. 1. Determines the number of processes. i < 1000. 0). 1). Message passing libraries typically provide both blocking and non-blocking primitives. Non-blocking operations are generally accompanied by a check-status operation. It is possible to write fully-functional message-passing programs by using only the six routines. The MPI standard defines both the syntax as well as the semantics of a core set of library routines. 1). Terminates MPI. P0 receive(&a. send(&b. 1. Vendor implementations of MPI are available on almost all commercial parallel computers. 0). Non-blocking non-buffered send and receive operations (a) in absence of communication hardware. The minimal set of MPI routines. 1. } } What if consumer was much slower than producer? Deadlocks are still possible with buffering since receive operations block. MPI_Init MPI_Finalize MPI_Comm_size Initializes MPI. MPI: the Message Passing Interface • • • • MPI defines a standard library for message-passing that can be used to develop portable messagepassing programs using either C or Fortran. 1. i < 1000. 1. i++){ produce_data(&a). This class of non-blocking protocols returns from the send or receive operation before it is semantically safe to do so. Non-Blocking Message Passing Operations • • • • • The programmer must ensure semantics of the send and receive. i++){ for (i = 0. 0). consume_data(&a).for (i = 0. receive(&a. 35 . When used correctly. (b) in presence of communication hardware. P1 receive(&a. 1. these primitives are capable of overlapping communication overheads with useful computations. 1).

myrank.h> main(int argc. int *size) int MPI_Comm_rank(MPI_Comm comm. A process can belong to many different (possibly overlapping) communication domains. Communicators • • • • • A communicator defines a communication domain . Information about communication domains is stored in variables of type MPI_Comm. Its purpose is to initialize the MPI environment. npes). myrank. MPI Program: #include <mpi. MPI_Comm_size(MPI_COMM_WORLD. Querying Information • • • The MPI_Comm_size and MPI_Comm_rank functions are used to determine the number of processes and the label of the calling process. • MPI_Finalize is called at the end of the computation. MPI_Comm_rank(MPI_COMM_WORLD. &argv). Communicators are used as arguments to all message transfer MPI routines. The calling sequences of these routines are as follows: int MPI_Comm_size(MPI_Comm comm. and constants are prefixed by “MPI_”. &myrank). Hello World!\n". MPI_Init(&argc. The return code for successful completion is MPI_SUCCESS. printf("From process %d out of %d.a set of processes that are allowed to communicate with each other. char ***argv) int MPI_Finalize() • MPI_Init also strips off any MPI related command-line arguments. MPI defines a default communicator called MPI_COMM_WORLD which includes all the processes. data-types. &npes). int *rank) The rank of a process is an integer that ranges from zero up to the size of the communicator minus one. } Sending and Receiving Messages 36 . and it performs various clean-up tasks to terminate the MPI environment. • The prototypes of these two functions are: int MPI_Init(int *argc.MPI_Comm_rank MPI_Send MPI_Recv Determines the label of calling process. char *argv[]) { int npes. Receives a message. Sends a message. Starting and Terminating the MPI Library • MPI_Init is called prior to any calls to other MPI routines. respectively. • All MPI routines. MPI_Finalize().

int MPI_TAG. int sendtag. MPI_Datatype recvdatatype. On the receive side. The goodness of any such mapping is determined by the interaction pattern of the underlying program and the topology of the machine. MPI_Status *status) • MPI provides equivalent datatypes for all C datatypes. int recvtag. • The datatype MPI_BYTE corresponds to a byte (8 bits) and MPI_PACKED corresponds to a collection of data items that has been created by packing non-contiguous data. MPI_Comm comm) int MPI_Recv(void *buf. MPI_Status *status) The arguments include arguments to the send and receive functions. MPI_Datatype datatype. If source is set to MPI_ANY_SOURCE. int *count) • • Sending and Receiving Messages Simultaneously To exchange messages. then messages with any tag are accepted. int source. • The calling sequences of these routines are as follows: int MPI_Send(void *buf. int count. The MPI_Get_count function returns the precise count of data items received. int recvcount. MPI does not provide the programmer any control over these mappings. MPI_Datatype datatype. MPI_Comm comm. int recvtag. MPI_Datatype datatype. MPI provides the following function: int MPI_Sendrecv(void *sendbuf. int MPI_ERROR. int tag. the status variable can be used to get information about the MPI_Recv operation. we can use: int MPI_Sendrecv_replace(void *buf. On the receiving end. MPI_Datatype datatype. This is done for portability reasons. • • • • • • MPI allows specification of wildcard arguments for both source and tag. If we wish to use the same buffer for both send andreceive. int sendtag. int source. int dest. then any process of the communication domain can be the source of the message. MPI_Comm comm. int MPI_Get_count(MPI_Status *status. int dest. respectively. }. int sendcount. void *recvbuf. MPI_Datatype senddatatype. • The message-tag can take values ranging from zero up to the MPI defined constant MPI_TAG_UB. the message must be of length equal to or less than the length field specified. int count. 37 . int count. MPI_Comm comm. int source.The basic functions for sending and receiving messages in MPI are the MPI_Send and MPI_Recv. The corresponding data structure contains: typedef struct MPI_Status { int MPI_SOURCE. MPI_Status *status) Topologies and Embeddings • • • • MPI allows a programmer to organize processors into logical k-d meshes. int tag. The processor ids in MPI_COMM_WORLD can be mapped to other communicators (corresponding to higher-dimensional meshes) in many ways. int dest. If tag is set to MPI_ANY_TAG.

Creating and Using Cartesian Topologies • We can create cartesian topologies using the function: int MPI_Cart_create(MPI_Comm comm_old. MPI_Request *request) • These operations return before the operations have been completed. int *rank) • The most common operation on cartesian topologies is a shift. int s_step. int dest.Different ways to map a set of processes to a two-dimensional grid. MPI_Comm comm. MPI_Status *status) • 38 . int *coords) int MPI_Cart_rank(MPI_Comm comm_cart. MPI_Comm comm. int ndims. int tag. int *coords. int rank. (c) shows a mapping that follows a space-lling curve (dotted line). int count. int *dims. int count. int *flag. int maxdims. • Since sending and receiving messages still require (one-dimensional) ranks. int *rank_dest) Overlapping Communication with Computation : In order to overlap communication with computation. and (d) shows a mapping in which neighboring processes are directly connected in a hypercube. int MPI_Cart_coord(MPI_Comm comm_cart. MPI provides routines to convert ranks to cartesian coordinates and vice-versa. MPI provides the following function: int MPI_Cart_shift(MPI_Comm comm_cart. MPI_Datatype datatype. (a) and (b) show a row. int tag. Function MPI_Test tests whether or not the non-blocking send or receive operation identified by its request has finished. MPI_Request *request) int MPI_Irecv(void *buf. int MPI_Isend(void *buf. int MPI_Test(MPI_Request *request. • Each processor can now be identified in this new cartesian topology by a vector of dimension dims. int reorder. int *rank_source. int source. int *periods.and column-wise mapping of these processes. *comm_cart) MPI_Comm This function takes the processes in the old communicator and creates a new communicator with dims dimensions. int dir. MPI_Datatype datatype. MPI provides a pair of functions for performing non-blocking send and receive operations. To determine the rank of source and destination of such shifts.

MPI_Comm comm) The corresponding scatter operation is: int MPI_Scatter(void *sendbuf. MPI_Datatype recvdatatype. void *recvbuf. int sendcount. int MPI_Allgather(void *sendbuf. MPI_Comm comm) The gather operation is performed in MPI using: int MPI_Gather(void *sendbuf. void *recvbuf. int count. MPI_Op op. MPI_Status *status) Collective Communication and Computation Operations • • • • MPI provides an extensive set of functions for performing common collective communication operations.MPI_Datatype datatype. Each of these operations is defined over a group corresponding to the communicator. MPI_Datatype datatype. MPI_Comm comm) • The all-to-one reduction operation is: int MPI_Reduce(void *sendbuf. void *recvbuf. MPI_Comm comm) • • • • 39 . MPI_Datatype senddatatype.MPI_Comm comm) To compute prefix-sums. MPI provides: int MPI_Scan(void *sendbuf.MPI_Datatype datatype. int count. int count. MPI provides: int MPI_Allreduce(void *sendbuf. int count. MPI_Datatype datatype. int target. int recvcount. int source.MPI_Datatype senddatatype. int recvcount. int source. MPI_Comm comm) MPI also provides the MPI_Allgather function in which the data are gathered at all the processes.• MPI_Wait waits for the operation to complete. All processors in a communicator must call these operations. int sendcount.MPI_Comm comm) • If the result of the reduction operation is needed by all processes. MPI_Datatype recvdatatype. MPI_Op op. MPI_Datatype senddatatype. MPI_Datatype recvdatatype. void *recvbuf. MPI_Op op. The barrier synchronization operation is performed in MPI using: int MPI_Barrier(MPI_Comm comm) The one-to-all broadcast operation is: int MPI_Bcast(void *buf. void *recvbuf. int sendcount. int MPI_Wait(MPI_Request *request. int target. int recvcount. void *recvbuf.

int key. int *keep_dims. 40 . communication operations need to be restricted to certain subsets of processes. • The coordinate of a process in a sub-topology created by MPI_Cart_sub can be obtained from its coordinate in the original topology by disregarding the coordinates that correspond to the dimensions that were not retained. • The simplest such mechanism is: int MPI_Comm_split(MPI_Comm comm. Using MPI_Comm_split to split a group of processes in a communicator into subgroups. communication needs to be restricted to a different subset of the grid. MPI_Comm *comm_subcart) • If keep_dims[i] is true (non-zero value in C) then the ith dimension is retained in the new subtopology. • MPI provides mechanisms for partitioning the group of processes that belong to a communicator into subgroups each corresponding to a different communicator. • MPI provides a convenient way to partition a Cartesian topology to form lower-dimensional grids: int MPI_Cart_sub(MPI_Comm comm_cart. and in different steps of the algorithm. • CHAPTER-7 DENSE MATRIX ALGORITHM Introduction • • • Due to their regular structure. parallel computations involving matrices and vectors readily lend themselves to data-decomposition. or intermediate data decomposition. output. Most algorithms use one.Groups and Communicators • In many parallel algorithms.and two-dimensional block. In many parallel algorithms. Typical algorithms rely on input. MPI_Comm *newcomm) • This operation groups processors by color and sorts resulting groups on the key. cyclic. and block-cyclic partitionings. processes are arranged in a virtual grid. int color.

• Each process initially stores n=p complete rows of the matrix and a portion of the vector of size n=p. The all-to-all broadcast takes place among p processes and involves messages of size n=p. Scalability Analysis: • • 41 . The all-to-all broadcast and the computation of y[i] both take time Θ(n) . • Thus. For the one-row-per-process Case p = n. Therefore. with each processor storing complete row of the matrix. • Since each process starts with only one element of x. • Consider now the case when p < n and we use block 1D partitioning.Matrix-Vector Multiplication • We aim to multiply a dense n x n matrix A with an n x 1 vector x to yield the n x 1 result vector y. Matrix-Vector Multiplication: Row wise 1-D Partitioning • • • • The n x n matrix is partitioned among n processors. • The serial algorithm requires n2 multiplications and additions. an all-to-all broadcast is required to distribute all the elements to all the processes. the parallel run time of this procedure is • This is cost-optimal. the parallel time is Θ(n) . • Process Pi now computes . This is followed by n=p local dot products. The n x 1 vector x is distributed such that each process owns one of plica ti • on of an n x n matrix with an n x 1 vector using Row wise block 1-D partitioning.

therefore. • The first communication step for the 2-D partitioning aligns the vector x along the principal diagonal of the matrix. hence. where K = E/(1 – E) for desired efficiency E. • The cost (process-time product) is Θ(n2 log n) . • The second step copies the vector elements from each diagonal process to all the processes in the corresponding column using n simultaneous broadcasts among all processors in the column. • The vector is distributed in portions of elements in the last process-column only. Matrix-Vector Multiplication 2-D Partitioning: • The n x n matrix is partitioned among n2 processors such that each processor owns a single element. therefore. we have. • Each of these operations takes Θ(log n) time and the parallel time is Θ(log n) . the algorithm is not cost-optimal. the result vector is computed by performing an all-to-one reduction along the columns. p = n2 if the matrix size is n x n Matrix-Vector Multiplication: 2-D Partitioning • We must first align the vector with the matrix appropriately. • Three basic communication operations are used in this algorithm: one-to-one communication to align the vector along the main diagonal. one-to-all broadcast of each vector element among the n processes of each column. • Overall isoefficiency is W = O(p2). In this case. • The first alignment step takes time • The broadcast and reductions take time • Local matrix-vector products take time • Total time is 42 . each process owns an block of the matrix. For isoefficiency. There is also a bound on isoefficiency because of concurrency.We know that T0 = pTP . • The n x 1 vector x is distributed only in the last column of n processors. For the one-element-per-process case. • When using fewer than n2 processors. • Finally. p < n.W. we have W = O(p2) (from the tw term). • • • • Matrix-vector multiplication with block 2-D partitioning. . W = n2 = Ω(p2). and all-to-one reduction in each row.we have W = KT0. From this.

(0 ≤ i.j (0 ≤ i.j and computes block Ci. • In this view.j of size each. All-to-all broadcast blocks of A along rows and B along columns. these can be used as serial kernels in the parallel algorithms.Scalability Analysis: • Equating T0 with W. j < ) • • • • Process Pi. we have. we perform q3 matrix multiplications.j initially stores Ai. • For cost optimality. for isoefficiency. • We do not consider better serial algorithms (Strassen's method).5) due to bandwidth term tw and concurrency. we have.j and Bi. • The overall isoefficiency is (due to the network bandwidth). In this view. have.j of the result matrix. Matrix-Matrix Multiplication • Consider two n x n matrices A and B partitioned into p blocks Ai. term. The parallel run time is approximately • • The algorithm is cost optimal and the isoefficiency is O(p1. Perform local submatrix multiplication. Major drawback of the algorithm is that it is not memory optimal. Computing submatrix Ci. Matrix-Matrix Multiplication • • • The two broadcasts take time The computation requires multiplications of sized submatrices. we Matrix-Matrix Multiplication • Consider the problem of multiplying two n x n dense. . although.j for 0 ≤ k < . • A useful concept in this case is called block operations. term by term.k and Bk. • The isoefficiency due to concurrency is O(p). Matrix-Matrix Multiplication: Cannon's Algorithm 43 . square matrices A and B to yield the product matrix C =A x B.j and Bi. each involving (n/q) x (n/q) matrices. an n x n matrix A can be regarded as a q x q array of blocks Ai. as the dominant For this. • The serial complexity is O(n3). j < q) such that each block is an (n/q) x (n/q) submatrix.j requires all submatrices Ai.

we schedule the computations of the processes of the ith row such that.k after each rotation. These blocks can be systematically rotated among the processes after every submatrix multiplication so that every process gets a fresh Ai.• • In this algorithm. at any given time. Matrix-Matrix Multiplication: Cannon's Algorithm Communication steps in Cannon's algorithm on 16 processes Matrix-Matrix Multiplication: Cannon's Algorithm 44 .k. each process is using a different block Ai.

• The cost-efficiency and isoefficiency of the algorithm are identical to the first algorithm. repeat until all blocks have been multiplied. Matrix-Matrix Multiplication: DNS Algorithm • • • • Uses a 3-D partitioning. It can be made cost optimal by using n / log n processors along the direction of accumulation. Perform next block multiplication. Matrix-Matrix Multiplication: Cannon's Algorithm • In the alignment step. 45 . Visualize the matrix multiplication algorithm as a cube . • Each of the single-step shifts in the compute-and-shift phase of the algorithm takes time.j to the left (with wraparound) by i steps and all submatrices Bi. • Move the columns of A and rows of B and perform broadcast. Perform local block multiplication. . This is done by shifting all submatrices Ai. • This is not cost optimal. the total runtime is log n. • Each processor computes a single add-multiply. DNS algorithm partitions this cube using a 3-D block scheme. • • The computation time for multiplying The parallel time is approximately: matrices of size is .j up (with wraparound) by j steps. since the maximum distance over which a block shifts is the two shift operations require a total of time. except. this is memory optimal. matrices A and B come in two orthogonal faces and result C comes out the other orthogonal face. Each internal node in the cube represents a single add-multiply operation (and thus the complexity). • Since each add-multiply takes constant time and accumulation and broadcast takes log n time. Matrix-Matrix Multiplication: DNS Algorithm • Assume an n x n x n mesh of processors.• • • • Align the blocks of A and B in such a way that each process multiplies its local submatrices. Each block of A moves one step left and each block of B moves one step up (again with wraparound). • This is followed by an accumulation along the C dimension. add to partial result.

we operate on blocks rather than on individual elements. • The algorithm follows from the previous one. • Each matrix can thus be regarded as a q x q two-dimensional square array of blocks.Matrix-Matrix Multiplication: DNS Algorithm The communication steps in the DNS algorithm while multiplying 4 x 4 matrices A and B on 64 processes. 46 . • The first one-to-one communication step is performed for both A and B. in this case. except. • Assume that the number of processes p is equal to q3 for some q < n. and takes time for each matrix. Matrix-Matrix Multiplication: DNS Algorithm Using fewer than n3 processors. • The two matrices are partitioned into blocks of size (n/q) x(n/q). Matrix-Matrix Multiplication: DNS Algorithm Using fewer than n3 processors.

• • • . Solving a System of Linear Equations • Consider the problem of solving linear equations of the kind: • This is written as Ax = b. • The isoefficiency function is . The triangular form is as: 47 .j. b1. Solving a System of Linear Equations Two steps in solution are: reduction to triangular form.• The two one-to-all broadcasts take time for each matrix. … . b is an n x 1 vector [ b0. and x is the solution. and back-substitution. The reduction takes time Multiplication of The parallel time is approximated by: submatrices takes time. bn ]T. j] = ai. where A is an n x n matrix with A[i.

A commonly used method for transforming a given matrix into an upper-triangular matrix is Gaussian Elimination.in the kth iteration of the outer loop.n.We write this as: Ux = y . the algorithm performs (n-k)2 computations. Summing from k = 1. 48 . we have roughly (n3/3) multiplicationssubtractions. Gaussian Elimination Serial Gaussian Elimination Gaussian Elimination • The computation has three nested loops ..

• The total parallel time can be computed by summing from k = 1 … n-1 as • The formulation is not cost optimal because of the tw term. Parallel Gaussian Elimination 49 . This is a serial operation and takes time (n-k) in the kth iteration. the normalized row is broadcast to all the processors. • Each processor can independently eliminate this row from its own. This takes time . Parallel Gaussian Elimination • Assume p = n with each row assigned to a processor.A typical computation in Gaussian elimination. • The first step of the algorithm normalizes the row. • In the second step. This requires (n-k-1) multiplications and subtractions.

it forwards its own row to processor Pk+1. the (k+1)st iteration starts only after all the computation and communication for the kth iteration is complete.normalization of a row.Gaussian elimination steps during the iteration corresponding k = 3 for an 8 x 8 matrix partitioned rowwise among eight processes. Once it has done this. Parallel Gaussian Elimination: Pipelined Execution • • • • In the previous formulation. A processor Pk waits to receive and eliminate all rows prior to k. communication. In the pipelined version. there are three steps . These steps are performed in an asynchronous fashion. and elimination. Parallel Gaussian Elimination: Pipelined Execution 50 .

The parallel time is therefore O(n2) . In any step. or an elimination step is performed on O(n) elements of a row. or a division step is performed on O(n) elements of a row. Parallel Gaussian Elimination: Pipelined Execution 51 . This is cost optimal. either O(n) elements are communicated between directly-connected processes.Pipelined Gaussian elimination on a 5 x 5 matrix partitioned withone row per process. Parallel Gaussian Elimination: Pipelined Execution • • • • The total number of steps in the entire pipelined procedure is Θ(n).

The parallel time is given by: or approximately. for n > p. a processor with all rows belonging to the active part of the matrix performs (n – k -1) / np multiplications and subtractions. • While the algorithm is cost optimal. Parallel Gaussian Elimination: Block 1D with p < n 52 . computation dominates communication. In the kth iteration. Parallel Gaussian Elimination: Block 1D with p < n • • • • The above algorithm can be easily adapted to the case when p < n.The communication in the Gaussian elimination iteration corresponding to k = 3 for an 8 x 8 matrix distributed among four processes using block 1-D partitioning. In the pipelined version. n3/p. the cost of the parallel algorithm is higher than the sequential run time by a factor of 3/2.

other than processing of the last p rows. For this it needs the corresponding value from the pivot row. each of which takes log n time. the pivot is broadcast to the row of processors. This requires two broadcasts. each processor locally updates its value.Computation load on different processes in block and cyclic 1-D partitioning of an 8 x 8 matrix on four processes during the Gaussian elimination iteration corresponding to k = 3. In this case. This corresponds to a cumulative load imbalance overhead of O(n2p) (instead of O(n3) in the previous case). This results in a non-cost-optimal algorithm Parallel Gaussian Elimination: 2-D Mapping 53 . there is no load imbalance. Each update of the partial matrix can be thought of as a scaled rank-one update (scaling by the pivot element). Parallel Gaussian Elimination: Block 1D with p < n The load imbalance problem can be alleviated by using a cyclic mapping. and the scaling value from its own row. In the first step. Parallel Gaussian Elimination: 2-D Mapping • • • • • • • • • Assume an n x n matrix A mapped onto an n x n mesh of processors. In the second step.

j (not on the pivot row) performs the elimination step A[i. j] as soon as A[i.Various steps in the Gaussian elimination iteration corresponding to k = 3 for an 8 x 8 matrix on 64 processes arranged in a logical two-dimensional mesh.n-1. Parallel Gaussian Elimination: 2-D Mapping with Pipelining • We pipeline along two dimensions. The front takes Θ(n) time to reach Pn-1. k] and A[k. j]] . the next front can be initiated. Then the scaled pivot row is pipelined down. Processor Pi. • • • • Parallel Gaussian Elimination: 2-D Mapping with Pipelining • If each step (division. the process is free to perform subsequent iterations. Multiple fronts that correspond to different iterations are active simultaneously. First. which is cost-optimal. or communication) is assumed to take constant time. j] := A[i. the pivot value is pipelined along the row. k] A[k. In this way. The computation and communication for each iteration moves through the mesh from top-left to bottom-right as a ``front. j] are available. the front moves a single step in this time.'' After the front corresponding to a certain iteration passes through a process. the last front passes the bottom-right corner of the matrix Θ(n) steps after the first one. • The parallel time is therefore O(n) .A[i. elimination. • Once the front has progressed past a diagonal processor. 2-D Mapping with Pipelining 54 .

This is equal to 2n3/p. This is three times the serial operation count! • • • Parallel Gaussian Elimination: 2-D Mapping with Pipelining and p < n 55 . since there are n iterations. The total parallel run time of this algorithm is (2n2/p) x n. The computation dominates communication for n >> p. Parallel Gaussian Elimination: 2-D Mapping with Pipelining and p < n • In this case.Pipelined Gaussian elimination for a 5 x 5 matrix with 25 processors. and communicates words along its row and its column. a processor containing a completely active part of the matrix performs n2/p multiplications and subtractions.

The maximum difference in computational load between any two processes in any iteration is that of one row and one column update. 56 .The communication steps in the Gaussian elimination iteration corresponding to k = 3 for an 8 x 8 matrix on 16 processes of a two-dimensional mesh. Parallel Gaussian Elimination: 2-D Mapping with Pipelining and p < n Computational load on different processes in block and cyclic 2-D mappings of an 8 x 8 matrix onto 16 processes during the Gaussian elimination iteration corresponding to k = 3 Parallel Gaussian Elimination: 2-D Cyclic Mapping • • The idling in the block mapping can be alleviated using a cyclic mapping.

Alternately. Gaussian Elimination with Partial Pivoting • • For numerical stability. Solving a Triangular System: Back-Substitution • The upper triangular matrix U undergoes back-substitution to determine the vector x. The k th and the i th columns are interchanged. requires a global reduction. Since there are n iterations. 57 . we can use faster algorithms for broadcast. In the k th iteration. adding a log p term to the overhead. • • • • Gaussian Elimination with Partial Pivoting: 2-D Partitioning • • • Partial pivoting restricts use of pipelining. This loss can be alleviated by restricting pivoting to specific columns. one generally uses partial pivoting. i] such that k ≤ j < n. i] is the largest in magnitude among all A[k. resulting in performance loss. however.• This contributes overhead is to the overhead function. Pivoting precludes the use of pipelining. the total . Simple to implement with row-partitioning and does not add overhead since the division step takes the same time as computing the max. we select a column i (called the pivot column) such that A[k. Column-partitioning.

the cost optimality is determined by the factorization. Each step of a pipelined implementation requires a constant amount of time for communication and Θ(n/p) time for computation. the algorithm can be executed in time. Solving a Triangular System: Back-Substitution • If the matrix is partitioned by using 2-D partitioning on a logical mesh of processes. since this does not dominate the overall computation. and the elements of the vector are distributed along one of the columns of the process mesh. Solving a Triangular System: Back-Substitution • • • • • • The algorithm performs approximately n2/2 multiplications and subtractions. Consider a rowwise block 1-D mapping of the n x n matrix U with vector y distributed uniformly. • 58 . While this is not cost optimal. Since complexity of this part is asymptotically lower. we should optimize the data distribution for the factorization part.A serial algorithm for back-substitution. then only the processes containing the vector perform any computation. The value of the variable solved at a step can be pipelined back. • Using pipelining to communicate the appropriate elements of U to the process containing the corresponding elements of y for the substitution step (line 7). The parallel run time of the entire algorithm is Θ(n2/p).

Sign up to vote on this title

UsefulNot useful- Green 38
- 10.1.1.45.7768
- JaJa Parallel.algorithms Intro
- Parallel Algorithms for Matrix Computations
- Gaussian
- Designing Parallel Algorithms_ Part 4
- Parallel Algebraic Multigrid-Survey
- Data structures and algorithms for data-parallel
- 10 1 PP BasicParallelAlgorithms
- Domain decomposition on parallel computers
- Instructor-s Guide to Parallel Programming in c With Mpi and Openmp
- PA-Notes
- Performance Metrices
- ParallelAlgorithms-Ranade
- ModelsOfComputation_Chapter7
- Pram Model
- 05.Lecture5
- 2 PP AbstractModels
- Chapter 3 - Principles of Parallel Algorithm Design
- Quiz1 Solution
- Pac Assignment
- Micro mechanical simulation of geotechnical problems using massively parallel computers
- 2 PP AbstractModels
- 5643
- Eliminacion Gauss
- Project Final
- Basic of Parallel Algorithm Design
- Implementation Results of Algoritms Paralalel
- Static Versus Dynamic Load
- 07_dlp
- Module 2,3,4

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd