You are on page 1of 2

Vivian Liu cs61c-ec Alexander Javad cs61c-dr A brief description of any changes you made to your code from

part 1 in order to get it to run well on a range of matrix sizes (max 150 words)

Part 2

We implemented cache blocking to improve the throughput for larger matrices, then padded each matrix so that there would not be a separate case for the fringes. The padded size of a matrix was the next largest multiple of the block size from the original size. We also locally stored 13 sum vectors at a time to optimize performance. We decided not to pad matrices divisible by 32 and simply use a block size of 32 for those cases. We placed the entire body of code inside a #pragma omp parallel block and placed #pragma omp for statements outside all of the for loops.

A brief description of how you used OpenMP pragmas to parallelize your code in sgemm-openmp.c (max 150 words). A plot showing the speedup of sgemm-all.c over sgemm-naive.c for values of N between 64 and 1024
16 14 12 gFlops/s 10 8 6 4 2 0 64 97 159 192 225 287 320 353 415 448 481 543 576 609 671 704 737 799 832 865 927 960 993 n sgemm-naive sgemm-all

speedup of sgemm-all over sgemm-naive

A weak scaling plot of the performance of your sgemm-openmp.c code (use your sgemm-all.c code as the baseline for the single threaded case)

weak scaling of sgemm-openmp


80 70 60 Gflops/s 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads (n=number of threads*64)

A strong scaling plot of the performance of your sgemm-openmp.c code

strong scaling of sgemm-openmp


60 50 Gflops/s 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 number of threads (n=512)

You might also like