Advanced Parallel Computing for Scientific Applications

Prof. I. F. Sbalzarini ETH Zentrum, CAB G34 CH-8092 Z¨rich u Autumn Term 2010 Prof. P. Arbenz ETH Zentrum, CAB H89 CH-8092 Z¨rich u

Exercise 3
Release: 12. Oct 2010 Due: 26 Oct. 2010


Practice in C/C++

The following two assignments illustrate the effects of caching in matrix operations. C uses row major memory layout for storing matrices and high dimensional arrays. Hence row-wise access of elements is more cache efficient than column-wise access.

Question 1: Matrix multiplication
The file matrixMult.cpp contains a program to find the execution time for matrix multiplication. C =A·B Each matrix is stored as a 1-D array. The multiplication is performed in the method void Multiply(...). To calculate each element in C, the elements of A are accessed row-wise and that of B are accessed column-wise resulting in several cache misses especially for large matrix sizes. A better cache usage can be achieved if the matrix B is transposed and the matrix multiplication operation is modified accordingly to give the same result as before. Your task is to implement the methods void InPlaceTranspose(...) and void MultiplyEfficient(...). Compile your code using the default GNU compiler: g++ -o mult matrixMult.cpp Do you observe better performance in the case of large matrices?

Question 2: Matrix norm
You have to calculate the 1-norm and infinity norm of a matrix A of size m × n given by

||A||1 = max


|aij |
i=1 n

||A||∞ = max


|aij |


sections. This program uses n threads Hello World 2 . 7. Count the floating point operations in the calculation and compute Mflop/s for different matrix sizes n.h Please submit the jobs to the batch queue as follows: bsub -o <op_file> . 6 threads and each thread will display one of the following messages along with its own thread number. Compile your code using : g++ -o norm matrixNorm. 3. 2. The number of threads is fixed a priori by the programmer using environment variable OMP NUM THREADS 5.. OpenMP is an application programming interface that provides a parallel programming model for shared memory and distributed shared memory multiprocessors.cpp Which of the above norms is calculated faster and why? In each of the above examples. 6. There is a standard include file omp.cpp. 4.h for C/C++ OpenMP programs. This is Advanced Parallel Computing tutorial./<executable> 2 Introduction to OpenMP 1. critical etc Question 3: First OpenMP program Using the above information. time measurement is done using the method double walltime(. OpenMP is based on the Fork/Join Execution Model : An OpenMP program starts as a single thread (master) and additional threads are created when the master hits a parallel region.. The directive #pragma omp parallel in the program marks the beginning of parallel section. omp get num threads() and omp get thread num() can be used to get the number of threads created and the local number assigned to each thread. write a simple program that will create n = 2.Implement the above equations in the appropriate methods double Norm 1() and double Norm Inf() in the file matrixNorm.) which is implemented in the file walltime. 4. The keywords used for distributing work among threads are for. This is the first OpenMP program.

b) Write a code to calculate dot product a. a) Compile the program and execute it using say 4 threads.Compile the program using the GNU compiler as follows: gcc -lgomp -fopenmp -o omp1 omp1.cpp contains the code for serial and parallel execution of SAXPY operation along with time measurement. 3 .c Question 4: Work sharing among OpenMP threads The file vectorAdd. ¯b You may submit the jobs to the batch queue as follows: bsub -n N -o <op_file> ./<executable> {N is the number of processors} Do not forget to set the environment variable OMP NUM THREADS before execution. Why is there no speedup? Modify the code in order to achieve appreciable speedup.¯ in parallel.