This action might not be possible to undo. Are you sure you want to continue?

Programming of Parallel Computers 20010-03-24

Compulsory Assignment 1

Implementing dense matrix multiplication in parallel distributed memory environment 1 Problem setting

Let two dense matrices be given, = = . The elements of are given by

Ò

and

=

,

¾ Rn¢n , i.e., ¡¡¡

Ò

=1 2

¡¡¡

Ò

. Let

=

=1

=1 2

The task is to design a matrix-matrix multiplication procedure for performing the multiplication in a distributed memory parallel computer environment and implement the algorithm using Fortran/C/C++ and MPI. The implementation must be independent of the size of the matrices Ò and the number of processors Ô to be used, thus, Ò and Ô must be input parameters.

**2 Data generation and partitioning strategies
**

Create your matrices with entries initialized as random double precision numbers on a single processor and distribute the elements according to the given partitioning strategy, see below.

A

0,0

A0,1 A1,1

... ... ... ...

A

0,p2-1

A1,0

A

1,p2-1

...

Ap1-1,0

...

Ap1-1,1

...

A

p1-1,p2-1

Figur 1: Two dimensional data partitioning. As shown in Figure 1 the processors are assumed to form a 2D Cartesian grid where Ô = Ô1 ¢ Ô2. It is also assumed that Ò is divisible by both Ô1 and Ô2, i.e., Ö1 = Ò Ô1 and Ö2 = Ò Ô2. For simplicity

1

the blocks (× Ö) (× Ö) (× Ö) are local for processor (× Ö ) . Broadcast block within each block-row ( = 0 2. The communication and computations can be organized as follows (Fox’s algorithm). Pay attention to your results and note any abnormalities. For each step ( = 0 Ô 1): ÔÔ 1) where Ñ = ( + )modÔÔ. i.e. Ö = 0 1 ¡ ¡ ¡ Ô2 1. use your algorithm = for matrices of different sizes. Assume that the matrices . × = 0 1 ¡ ¡ ¡ Ô1 1. in block form.you can also assume or restrict to the case Ô1 = Ô2 = Ô. Make several runs for each problem size and processor conﬁguration and report the lowest run-time. The intention is to see that you are on the right track and to correct any missunderstandings 2 . and are partitioned in blocks. overlap the communication in phase 3 with the computations in phase 2 by using non-blocking send and receive within the column-communicators. Then. In phase 1. È Ô 3 Fox’s algorithm Processor È (× Ö) must compute.) 5 Numerical experiments Check the correctness of your algorithm by applying it to two simple matrices. For each matrix size vary the number of processors and measure speedup. 3. This requires interprocessor communication. (Start up the communication of the B-blocks before phase 2 and then in phase 3 wait for these communication operations to ﬁnish. Ô 4 MPI implementation To get an efﬁcient MPI implementation you are required to use a cartesian 2D processor topology and create separate communicators for the rows and the columns. (× Ö) = (× Ö) + (× Ñ) (Ñ Ö) . Shift the blocks of one step cyclicly upwards. Ô 1 Ô (× Ö ) = =0 (× ) ( Ö) where (× ) ( Ö) is a matrix-matrix multiplication of two blocks. Go up in matrix size from 100 ¢ 100 to 1440 ¢ to compute 1440 with a few intervals. as shown in Figure 1. Multiply the broadcasted block with the -block in each processor. Processor È (× Ö) needs to have all blocks (× :) (in row s) and all blocks (: Ö) (in column r). 6 Coaching session You are welcome anytime to ask about the assignment but we will also like to meet all groups in coaching sessions. ( Ñ) 1. MPI_Bcast. Then. utilize the row-communicators by using collective communication. Then.

Solution method. contact us for booking a time for the coaching session. this means that you should protect your source code from others and you can’t copy others solutions. Results. Latex. i. Discussion. The report should cover the following issues: Report requirements: 1. Problem description. Maple or any other graphical tool. presenting the task. 2.uu. 3. Conclusions. StarOfﬁce). a description of the parallel implementation. Before the coaching session you should have started the implementation of the assignment but there is no requirement that the program can run (or even compile correctly). 3 . Word. The assignments are a part of the examination. 2010. 7 Writing a report on the results The report can be written in Swedish or English but you should use a word processor or a type setting system (e.se/edu/policydokument). 6. 4.in an early stage. Please. with explanations of the results and with ideas for possible optimizations or improvements. The last day to hand in the report is May 5. (See http://www.g. with a listing of the program code and tables of the numerical results. presenting plots of speedup.it. 5. Appendix.e. The plots can be done using Matlab. with observations and comments on the results (can also be included in 3 above).

Sign up to vote on this title

UsefulNot useful- Matrix Multiplication Paper
- 9783540771463-c1
- tr-19-91
- 10.1.1.114
- 10.1.1.84
- Thm13 - PRAM Models
- An Optimal Parallel Jacobi-Like Solution Method for the Singular Value Decomposition
- iee-mm.doc
- Parallel Algorithms
- Parallel Algebraic Multigrid-Survey
- Eliminacion Gauss
- Het Load Balancing
- Ansys Solver 2002
- proj1
- Universal Scalable Matrix Multiplication
- Domain decomposition on parallel computers
- Parallel Trafic Solutions
- Back and Forward
- Ch 3 Matrices
- c2-11
- reed-solomon.pdf
- Topic 2 Matrices and System of Linear Equations
- Space-frequency processing
- Lawn 141
- Matrices 2
- ParallelAlgorithms-Ranade
- A New Parallel Architecture for Sparse Matrix Computation Based on Finite Projective Geometries - N Karmarkar
- L10_IntroToMatrices
- Parlab08 Poster Spmv
- Untitled
- Assignment1_MPI