You are on page 1of 22

# Tutorial 1 / CS-3211 / week 19-24 Jan 2004

1. Suppose a galaxy has 1011 stars. Estimate the time it would take to perform 100 iterations of the basic N -body algorithm using O(N 2 ) computations and a computer that is capable of 500 MFlops. 2. Find the diameter of: (a) a torus; (b) a tree network; (c) an d-dimensional mesh. 3. Look at the minimal distance deadlock-free algorithm for hypercube networks described in the textbook, page 15. Apply it for: (a) a vedimensional hypercube network from node 7 to node 22; (b) repeat for an 8 8 mesh, using its perfect embedding in a hypercube network. 4. Determine how the largest complete binary tree can be embedded into a hypercube. What is the dilation of the mapping? 5. Which is the average distance between two nodes in: (a) a mesh network; (b) a hypercube?

## CS-3211: Parallel and Concurrent Programming

Tutorial 3 Week 2-7 Feb 2004 1. Develop an equation for message communication time tcomm that incorporates the delay through multiple links as would occur in a static interconnection network. Develop the equation for a mesh, for a tree, and for a hypercube network, assuming that all destinations are randomly chosen. 2. (i) Device an ecient way that a scatter operation can be done on an ndimensional hypercube. What is its time complexity? (ii) Repeat for an n n torus. 3. In Linux (tembusu) you may use gettimeofday() function to record small amounts of time (microseconds) as follows: #include <sys/time.h> //add this ... struct timeval start, stop; ... start.tv_usec=0; stop.tv_usec=0; gettimeofday(&start,NULL); ... gettimeofday(&stop,NULL); printf(".. %i ..(microsec)\n",stop.tv_usec-start.tv_usec); Measure the time to send a message in our parallel programming system (tembusu) using various communication routines [individual send-recv, broadcast, scatter, gather]. Repeat with the ping-pong method described in the class. Estimate the startup time tstartup and the time to send one data item tdata . 4. Certain complex MPI communication functions can be simulated by a series of more basic one. (i) Use MPI Send(..) and MPI Recv(..) to write a few procedures MyBcast, MyScatter, etc simulating MPI Bcast(..), MPI Scatter(..), MPI Gather(..), MPI Reduce(..) (See the MPI manual or Appendix B
1

of the textbook for details on these routines.) (ii) Use the procedure described in the above question to estimate the time taken by your simulating routines and compare with the time taken by the corresponding MPI routines. 5. Experiment with latency hiding on your system to determine how much computation is possible between sending messages. Investigate using both nonblocking and locally blocking send routines.

## CS-3211: Parallel and Concurrent Programming

Tutorial 4 Week 9-14 Feb 2004 1. Image transformations, square partitions: Write (in pseudocode) a parallel program for the following question and analyze its eciency. 1. Write a parallel program to perform image transformations (shifting, scaling, rotation) based on static task assignment and square partitions. (For example, for an 50 60 image one may divide the row number by 2 and the column number by 3 to get 6 square parts, each of size 25 20.) 2. Analyze your code in terms of communication, computation, overall parallel execution time, speedup, and eciency. 2. Implementing image transformations: Implement the following image transformations: shifting, scaling, rotation (slide 3.22 or textbook) and run the program on Tembusu cluster. 1. Start using a simple graphical interface: an image is just a, say, 50 60 matrix lled in with digits; then you may simply display the image by printing on a terminal window. 2. Adapt the above program to handle real images, e.g., in PPM format. (An example of PPM le, including a few explanations, may be found at cs3211 course web page - Tutorial table, Misc column.) 3. Mandelbrot, static task assignment: Write an MPI program for Mandelbrot computation using a static task assignment (that is, simply divide the image into xed areas). Run it on Tembusu cluster. 4. Mandelbrot, dynamic task assignment: Repeat the above question, but using a dynamic task assignment (slides 3.33-34).

5. Monte-Carlo method: Write an MPI program to compute /4 using Monte-Carlo methods. (Run it on Tembusu cluster.) 1. Use a sequential parallel random number generator and both methods described in the the class (slide 3.34-35): (1) score how many random points within a 2 2 square lie within a circle of unit radius and (2) 1 compute the corresponding integral 0 1 x2 dx. 2. Repeat the above question using a parallel random number generator (write your own implementation of such a parallel random number generator using the method described in the class - slides 3.36-37).

## CS-3211: Parallel and Concurrent Programming

Tutorial 5 Week 16-21 Feb 2004

Regular questions
1. Analysis of divide-and-conquer method: Analyze the divide-and-conquer method of assigning one processor to each node in a tree for adding numbers (see textbook, sec.4.1.2) in terms of communication, computation, overall parallel execution time, speedup, and eciency. 2. Holes: Suppose you own a hole punch capable of putting a hole in an arbitrarily thick stack of paper. If you insert the paper into the hole punch and activate it, you will get a piece of paper with one hole in it. If you fold the paper in half before inserting it into the hole punch, you will have a piece of paper with two holes in it. If you can only use the hole punch once, how many times must you fold a piece of paper in order to put n holes in it? Prove that your answer is correct and optimal. 3. Smallest value with an arbitrary number of processes: Develop a divide-and-conquer algorithm that nds the smallest value in a set of n values in O(log n) steps using n 2 processors. What is the time n complexity if there are fewer than 2 processors? 4. Two variants of summation: Write a parallel program to compute the summation of n integers in each of the following ways and assess their performance. Assume that n is a power of 2.
n (a) Partition the n integers into n 2 pairs. Use 2 processes to add together each pair of integers resulting in n 2 integers. Repeat the method on n the n 2 integers to obtain 4 integers and continue until the nal result is obtained. (Binary tree algorithm.)

n n (b) Divide the n integers into log n groups of log n numbers each. Use log n processes each adding the numbers in one group sequentially. Then n add the log n results using method (a).

5. Integration: Write a static assignment parallel program to compute using the formula 1 1 x2 dx = 0 4 using each of the following ways: 1. rectangular decomposition 1 (slide 4.22) 2. rectangular decomposition 2 (slide 4.23) 3. trapezoidal decomposition (slide 4.24) Analyze each method in terms of speed and accuracy.

## Additional, research-like question

Convex hull problem: Given a set of n points in a plane, develop an algorithm and a parallel program to nd the points that are on the perimeter of the smallest convex region containing all the points. (See textbook, Ex.4-22.)

## CS-3211: Parallel and Concurrent Programming

Tutorial 6 Week 23-28 Feb 2004 1. Analyze insertion sort: 1. Compare the sequential and parallel, pipeline-like versions of insertion sort (textbook 5.3.2, slides 5.24-28) in terms of speedup and time complexity. 2. Modify the method to work for a sequence of n numbers using p processes (for arbitrary n, p), then repeat the above question for this new version. 2. Pipeline programs for ordinary calculations: 1. Develop a pipeline method to compute sin() using the formula 3 5 7 9 sin() = + + . . . 3! 5! 7! 9! for a series of inputs 0 , 1 , 2 , . . .. Repeat for cos() and tan(). 2. Write a parallel program using pipelining (pseudocode!) to compute the polynomial f (x) = a0 x0 + a1 x1 + . . . + an1 xn1 where as, x and n are inputs. 3. Radix sort: Radix sort is similar to the bucket sort described in lecture 4, but specifically uses the bits of the numbers to identify the bucket into which each number is placed. First the most signicant bit is used to place the numbers into two buckets, say B0 or B1. Then the next most signicant bit is used to place the numbers from B0 into two new buckets, say B00 or B01; similarly with B1. Repeat till the least signicant bit is reached. Reformulate the method to become a pipeline solution, write a program (pseudocode!), and analyze its time complexity.
1

4. Outer product of two vectors: The outer product of two vectors A = (a0 , . . . , an1 ) and B = (b0 , . . . , bn1 ) a0 b0 . . . a0 bn1 . . . . . . . is an n n matrix C , where C = . . . an1 b0 . . . an1 bn1 Develop a pipeline implementation for the outer product of two vectors and analyze it. 5. Pipeline, sieve of Eratosthenes: Consider the following methods for implementing the sieve of Eratosthenes: 1. By a pipeline approach (textbook 5.3.3; slides 5.29-33) 2. By dividing the range of the numbers into m regions and assigning one region to each process to strike out multiples of prime numbers; use a master process to broadcast each already found prime number to processes. Write parallel programs (pseudocode!) for each method and estimate their time complexity.

## CS-3211: Parallel and Concurrent Programming

Tutorial 7 Week 1-6 Mar 2004

Regular questions
1. Partial barrier: Write a barrier, barrier(procno), which will block until procno processes reach the barrier and then release the processes. Allow for the barrier to be called with dierent numbers of processes and with dierent values for procno. 2. Implementations for tree and buttery barriers: Implement the tree barrier described in Slide 6.9 (textbook 6.1.3) using individual send-receive routines. Analyze its time complexity. Repeat for the buttery barrier described in Slide 6.10 (textbook, 6.1.4). 3. Prex calculation: Analyze prex calculation method described in Slides 6.17-19 (textbook p.171) and determine its eciency. Modify the method to work for m numbers and p processes [arbitrary m, p] and repeat the above question for this new version. 4. Strips vs. squares: In our presentation of heat distribution problem (Slides 6.33-45, textbook 6.3.2) we have supposed to have a square array. What are the mathematical conditions for choosing block or strip partition if the array in an m n rectangle [arbitrary m, p]? Suppose the communication of a rectangle is proportional with its perimeter. Show that the square has the minimum communication of all rectangles of a xed area (i.e., in the class of rectangles R of dimensions (x, y ) such that x y = a, for a xed a).
1

5. Second-largest key: Given a list of n keys a[0], . . . , a[n 1], design a parallel algorithm to nd the second-largest key in the list. [Note: Keys do not necessarily have distinct values.]

## Additional, programming question

Game of Life: Write an MPI parallel program to simulate the Game of Life as described in textbook, 6.3.3 and experiment with dierent initial populations. Try to implement both strip and square partitions and compare their performances on Tembusu cluster.

## CS-3211: Parallel and Concurrent Programming

Tutorial 8 Week 8-13 Mar 2004 1. Implementation of load-balancing line structure technique: Implement load-balancing line structure technique (textbook, 7.2.3; Slides 7.17-19) and use it in one of your parallel programs. 2. Parallel Moore, centralized work pool approach: Implement parallel Moores single-source shortest path algorithm using the centralized work pool approach (textbook, p217; Slide 7.39). (Hint: Use a vertexQueue to store the tasks and a requestQueue the store the stillunsolved task requests. Then, the master process will exit the main loop when vertexQueue is empty and requestQueue is full.) 3. Parallel Moore, load-balancing line structure: Implement Moores algorithms using load-balancing line structure technique. 4. Parallel Moore, decentralized work pool approach: The decentralized work pool approach described in textbook, Section 7.4 for searching a graph is inecient in that processes are only active after their vertex is placed on the queue. Develop a more ecient work pool approach that keeps processes more active. 5. Parallel Dijkstra vs. parallel Moore: Write (pseudocode) a load-balancing parallel version of Dijkstras algorithm for searching a graph. Compare its performance and the performance of a corresponding load-balancing parallel version of Moores algorithm.

## CS-3211: Parallel and Concurrent Programming

Tutorial 9 Week 15-20 Mar 2004 1: Analyze the code (using Bernsteins conditions) forall (i = 2; i < 6; i++){ x = i - 2*i + i*i; a[i] = a[x]; } and determine whether any instance of the body can be executed simultaneously. 2: For the following code int a[100], b[100]; forall (i = 2; i < 6; i++){ forall (j = 1; j < 8; j++){ a[i] = b[2*j] + a[i+j]; b[j] = b[i+2*j]; } } nd the instances of the body that can be executed simultaneously and provide a schedule that minimize parallel execution time. 3: List all possible outputs when the following code is executed j = 0; k = 0; forall (i = 1; i <= 2; i++) { j = j + 10; k = k + 100; } printf(i=%i,j=%i,k=%i\n,i,j,k);
1

assuming that each assignment statement is atomic. 4: The following C-like parallel program is supposed to transpose a matrix: forall (i = 0; i < n; i++) forall (j = 0; j < n; j++) a[i][j] = a[j][i] Explain why the code will not work and correct it. 5: Determine and explain how the following code for a barrier work (based upon the two-phase barrier given in textbook Section 6.1.3) void barrier() { lock(arrival); count++; if (count < n) unlock(arrival) else unlock(departure); lock(departure); count--; if (count > 0) unlock(departure) else unlock(arrival); return; } Why is it necessary to use two lock variables, arrival and departure?

## CS-3211: Parallel and Concurrent Programming

Tutorial 10 Week 22-27 Mar 2004

Regular questions
1: Modify the rank sort code given in Sec.9.1.3
for (i = 0; i < n; i++) { /* for each number */ x = 0; for (j = 0; j < n; j++) /* count number of nos less tan it */ if (a[i] > a{j]) x++; b[x] = a[i]; /* copy number into correct place */ }

to cope with duplicates in the sequence of numbers (i.e., for it to sort in nondecreasing order). 2: The following is an attempt to code the odd-even transposition sort of Sec.9.2.2. as a SPMD program: Process P_i
evenprocess = (i % 2== 0); evenphase = 1; for (step = 0; step < n; step++, evenphase = !evenphase){ if ((evenphase && evenprocess) || (!evenphase) && !(evenprocess)){ send(&a, P_{i+1}); recv(&x, P_{i+1}); if (x < a) a = x; /* keep smaller number */ } else { send(&a, P_{i-1}); recv(&x, P_{i-1}); /* keep larger number */ if (x > a) a = x; } }

Determine whether the code is correct and, if not, correct it. 3: Implement (in pseudo-code) shear-sort (Sec.9.2.3). Explain why log n + 1 phases are to be used.
1

4: Draw the exchange of numbers for the Quick-sort on a Hypercube (Sec.9.2.6) using the algorithm based on Grey code ordering (Fig.9.21). Illustrate the procedure on a particular set of numbers. 5: Draw the compare-and-exchange circuit congurations for the odd-even merge-sort algorithm described in Sec.9.2.7 to sort 16 numbers. Sort a sequence of numbers by hand using the odd-even merge-sort algorithm.

## More questions - [no automatic allocation; send request email to kzhu]

6: Repeat the above problem 5 for bitonic merge-sort (Sec.9.2.8). 7: Analyze the systolic array for matrix multiplication as described in Sec.10.2.4, deriving equations for the computation and for the communication. 8: Develop a parallel program for convolution: Given x1 , . . . , xN +n1 and w1 , . . . , wn compute yi = n j =1 xij +n wj , for i = 1, . . . , N . (See the textbook for more detailed explanation.) 9: Develop a linear pipeline solution of the Gauss-Seidel method described in Sec.10.5.1 and write a pseudo-code parallel program to implement it. 10: Derive the system eciency when implementing Gaussian elimination with the strip partition and the cyclic partition, as described in sec.10.3.2.

CS-3211; Tutorial 1
1. Suppose a galaxy has 1011 stars. Estimate the time it would take to perform 100 iterations of the basic N -body algorithm using O(N 2 ) computations and a computer that is capable of 500 MFlops. Solution: Each iteration takes 1011 1011 = 1022 steps. 100 iterations takes 1024 steps. The computer handle 500 106 = 5 108 operation Flop per second. Hence, the computations takes 1024 /(5 108 ) = 2 1015 seconds, which gives 63,419,500 years. Notice: As it was pointed out at one tutorial, this is correct provided we suppose that each step takes 1 Flop. Otherwise the time is even larger. 2. Find the diameter of: (a) a torus; (b) a tree network; (c) an k-dimensional mesh. Solution: (a) For a m n torus (m lines and n columns), this is d = n/2 + m/2 The reason is that we may go in both directions on a line (respectively, column), so the shorter distance between two nodes in the same line is at most n/2 . Similarly for the columns. To have an example, for a 7 10 mash two points which realize this diameter are (1,1) and (4,6). (b) In a (complete, balanced, binary) tree network, the longest (minimal) path is, for instance, between the leftmost and the right-most leaves. If the tree has k levels, this is 2(k 1).

We have to express this in terms of number of the networks nodes. If the tree has k levels, than the number of vertices is 1 + 2 + 22 + . . . + 2k1 = 2k 1. If there are n nodes in the tree, this gives n = 2k 1, hence k = log2 (n + 1). To conclude, the diameter is d = 2(log2 (n + 1) 1) Notice: If the branching degree r is not 2, but still constant, a similar result is obtained, but the logarithm is in base r. If the tree is not balanced or the branching degree may be dierent for dierent nodes, then the analysis is more complicate and less precise results are obtained. (c) We suppose that the mash is an hypercube, hence it has the same length in all directions. In a k -dimensional mesh, the grater (minimal) distance is between the corners (0, 0, . . . , 0) and (1, 1, . . . , 1). A path between them have to parse all k directions, along each directions having the length k n 1. Hence the result is d = k ( k n 1) 3. Look at the minimal distance deadlock-free algorithm for hypercube networks described in the textbook, page 15. Apply it for: (a) a ve-dimensional hypercube network from node 7 to node 22; (b) repeat for an 88 mesh, using its perfect embedding in a hypercube network. Solution: (a) The binary representation of 7 is 00111 and of 22 is 10110. The algorithm requires: (i) to compute disjunctive or which is 10001 and (ii) to parse the hypercube along the directions having 1 in the result, in our case,

directions 1 and 5 (left-to-right). The obtained length 2 routing is: 7 = 00111 23 = 10111 22 = 10110. (b) In the mesh, the parsing algorithm is to go, say, rst horizontally and then vertically from one node to the other. If the mesh is embedded in a hypercube, then this routing is dierent from the hypercube routing (generally is longer), as the mesh has forgotten many of the hypercube links. 4. Determine how the largest complete binary tree can be embedded into a hypercube. What is the dilation of the mapping? Solution: We may recursively dene a perfect embedding as follows. If we know how to embed a k level tree in a r-dimensional hypercube, then we take the r + 2dimensional hypercube and map: the root of the tree in (0, 0, . . . , 0), the left subtree in the (sub) r-dimensional hypercube (1, 0, , . . . , ) and the left subtree in the (sub) rdimensional hypercube (0, 1, , . . . , ). This is a perfect embedding (one connection in the tree network is realize by one connection in the hypercube), but only a very small number of nodes of the hypercube are used. If we relax the condition to have a perfect embedding, sometimes it is possible to get irregular embedding with less nodes in the hypercube. E.g., a 3-level tree with the nodes represented by ( ), (0), (1), (00), (01), (10), (11) my be embedded in a 3-dimensional cube by the mapping: ( ) (0, 0, 0), (0) (1, 0, 0), (1) (0, 0, 1), (00) (1, 1, 0), (01) (1, 0, 1), (10) (0, 1, 0), (11) (1, 1, 1) having dilation 2.

5. Which is the average distance between two nodes in: (a) a mesh network; (b) a hypercube? Solution: 5(a) Take an arbitrary position (i, j ) of an m n mesh. The distances to the cells from i-th line are S = (j 1)+(j 2) . . .+2+1+0+1+2+. . .+(nj 1)+(nj ) 1)j nj +1) = (j + (nj )(2 . 2 For a line which departs from i-th line by k lines we have to add kn (an extra k appear for each cell), hence the total sum of the distances from (i, j ) cell to the other ones is Si,j = [S +(i 1)n]+[S +(i 2)n]+ . . . +[S +2n]+[S + n]+[S ]+ [S + n] + [S + 2n] + . . . + [S + (m i 1)n] + [S + (m i)n] 1)i i+1) = mS + n (i + n (mi)(m 2 2 j +1) 1)i i+1) 1)j = m( (j + (nj )(n ) + n( (i + (mi)(m ) 2 2 2 2 Here we may apply two dierent, but equivalent, methods: Method 2: Find the total length of all paths and divide to the number of paths. (This is a simple general method.) Method 1: Find the average of the length from one cell to the other ones, then make the average of these results over the cells. (The number of paths to be counted for each vertex is the same, so a simple, non weighted, average is enough.) By the rst method, we get the total sum of the lengths to be St =
i,j

Si,j

5
1)j (nj )(nj +1) 1)i (mi)(mi+1) = m[m j ( (j )]+n[n i ( (i )] 2 + 2 2 + 2 21 = m2 1 2 [2 j (j 1)j ] + n 2 [2 i (i 1)i] = m 2 [ j j 2 j j ] + n 2 [ i i2 i i] n+1) +1) m+1) +1) n(n2 ] + n2 [ m(m+1)(2 m(m ] = m2 [ n(n+1)(2 6 6 2 (n+1) (m+1) = m2 (n1)n + n2 (m1)m 3 3 = mn 3 (mn 1)(m + n) 1) 2 The number of paths is N = Cmn = (mn)(mn , but each 2 path is counted two times in the above sum (once for each head), hence the average is

A = =

St 2N m+n 3

Nice formula... Maybe there is a dierent, simpler proof... 5(b): In a hypercube all vertexes are equivalent, so it will be enough to count the average path length for one vertex only. If we start with vertex 00 . . . 0 (k times, for a k dimensional hypercube), then the distance to an arbitrary vertex is given by the number of 1s in its representation. The total sum of length to the other vertexes is then S 1 2 k = 1 Ck + 2 Ck + . . . + k Ck This sum may be computed taking the derivative of the well-known identity
1 1 2 2 k k (1 + x)k = 1 + Ck x + Ck x + . . . + Ck x

The derivative is
k k 1 1 0 2 1 x + 2Ck x + . . . + kCk x k (1 + x)k1 = 0 + 1Ck

Our sum actually is the right-hand-side of the above identity when x = 1, hence S = k 2k1 The number of vertexes (dierent from 00 . . . 0) is 2k 1, hence the average path length (for 00 . . . 0 and also for the whole hypercube) is A=
S 2k 1

k 2 2k1 1

For large k a good approximation of this is k 2. 5(c): Try to nd the average distance between two vertices of a tree network.