Master in High Performance Computing Advanced Parallel Programming Mpi: Rma and I/O

Master in High Performance Computing
Advanced Parallel Programming

MPI: RMA and I/O
April 15, 2020
1 Exercise description
This MPI Lab covers One-Sided Communication techniques included in MPI 3 standard: Generalized Active
Target Synchronization (PSCW) and Passive Target Synchronization (Lock/Unlock). Write a small report in
English or Spanish addressing the tasks described in the following points:
1. Generalized Active Target Synchronization (PSCW)

For this exercise, we will modify a parallel version of the Conway’s Game of Life. In Appendix A you
can see an explanation of the Game of Life and the algorithm implemented in gameoflife seq.c and
gameoflife mpi.c.
In the data directory there are test input files with the following dimensions:
• gol grow 40 80.input (Grow pattern of size 40 × 80)

• gol bell 1K.input (Bell Gun pattern of size 1024 × 1024)
• gol max 4K.input (Max pattern of size 4096 × 4096)
• gol 16384.input (Random pattern of size 16384 × 16384)
For the sake of simplicity, we set the precondition that the number of rows/columns must be divisible
by the number of processes in the corresponding dimension. Make sure that the output of the parallel
versions (checksum and output files) are the same regardless of the number of processes used.
(a) Read carefully the Appendix A and understand the algorithm and its communications. Test both se-
quential (gameoflife seq.c) and basic parallel implementations (gameoflife mpi.c) using several input
matrices and number of iterations.
(b) Starting from the MPI implementation, think about using RMA functions (active target using PSCW)
instead of blocking operations for (a) the communications within the main loop and (b) the reduction
of the global checksum. Discuss the suitability of RMA for this exercise.
You can test this locally in your computer or in any multiprocessor architecture of your choice.
2. Passive Target Synchronization (Lock/Unlock)

Consider the source code in dyn balance.c. It reads a list of numbers from the standard input (process
0), that represents the workload of a set of tasks. Despite the name of the file, workload is balanced
statically by assigning tasks to processes cyclically. You can try it by running ”mpirun -np NPROCS
dyn balance < data/workloads”.
Given the heterogeneous computational workload of the tasks, which may vary in several orders of mag-
nitude, a cyclic or a block distribution is not the best choice here. Also, we assume that we don’t know
the workload in advance, so any static balancing algorithm won’t perform any better.
Your task is to design and implement a dynamic workload balancing algorithm that solves the problem in
run time, either centralized or distributed, using Passive Target Synchronization RMA. Remember that
Lock/Unlock are not mutexes nor a way to establish a critical section, but there are a couple of atomic
operations that can be useful here: MPI Compare and swap and MPI Fetch and op.
1
(a) Design an algorithm for balancing the workload dynamically. Before modifying the source code,
describe briefly how would you improve the original distribution using Passive Target Synchronization
RMA. Think, from a theoretical point of view, how your algorithm compares to the original one in
terms of performance.
(b) Implement your algorithm and check if your design and prediction were correct or not. In the case
you fail either in the design or the expectations, don’t worry. Identify the misconceptions/mistakes
and suggest a fix for them.
A Conway’s Game of Life

The Conway’s Game of Life (CGL) is a cellular automaton devised by the British mathematician John Horton
Conway in 1970. The game evolution is determined by its initial state, requiring no further input. One interacts
with the Game of Life by creating an initial configuration and observing how it evolves, or, for advanced players,
by creating patterns with particular properties (see Section A.2).
A.1 Rules
The universe of the Game of Life is an infinite, two-dimensional orthogonal grid of square cells, each of which
is in one of two possible states, alive or dead, (or populated and unpopulated, respectively).
Every cell interacts with its eight neighbours, which are the cells that are horizontally, vertically, or diagonally
adjacent.
At each step in time, the following transitions occur:
• For a space that is ’populated’:
– Each cell with less than two neighbors dies, as if by solitude.

– Each cell with two or three neighbors lives on to the next generation.
– Each cell with four or more neighbors dies, as if by overpopulation.
• For a space that is ’empty’ or ’unpopulated’
2
– Each cell with three neighbors becomes populated, as if by reproduction.
Below you can see an example of transition (evolution) from one state to the next one after applying the
previous rules:
A.2 Patterns
Given these simple rules, people usually try to find initial states or patterns for the game that present a particular
behavior while evolving. The most typical examples are the following:
• Oscillator: Pattern that is a predecessor of itself. That is, it is a pattern that repeats itself after a fixed
number of generations (known as its period). For example, the oscillator named pulsar has a period of 3:
• Still life: Pattern that do not change over time (particular kind of oscillator of period 1).
Block Hive Loaf Boat tie Ship tie

• Spaceships: Finite pattern that re-appears after a fixed number of generations displaced by a non-zero
amount. Therefore, the pattern (spaceship) will move around the game space until it hits another structure.
3
A.3 Turing Completeness
By applying specific patterns, like the ones listed before and several others, and given the infinite space (mem-
ory), it has been proven that CGL is Turing Complete.
You can find a Universal Turing Machine pattern made for CGL by Paul Rendell in the following article:
https://www.ics.uci.edu/~welling/teaching/271fall09/Turing-Machine-Life.pdf.
So now, just try to imagine what you could create with the Game of Life if you had time, computational
memory, and patience enough.
B Computing the Conway’s Game of Life

The infinite space of the theoretical CGL is somehow a problem for today’s machines. For us, the universe will
be a two-dimensional orthogonal periodic grid. It means that border cells are neighbors of the opposite border.
First, we define what a state is for us. A state contains a finite space of size r ows by cols. Additionally,
also contains the generation count, a checksum and third extra variable called halo.
1 typedef struct {
2 int rows ; /* no . of rows in grid */
3 int cols ; /* no . of columns in grid */
4 char ** space ; /* a pointer to a list of pointers for storing
5 a dynamic NxM grid of cells */
6 char * s_tmp ; /* auxiliary space for the evolution process */
7 long generation ; /* current state generation */
8 long checksum ; /* evolution checksum */
9 int halo ; /* ( boolean ) space contains halo rows / columns */
10 } state ;
C hecksum is the count of differences between state of consecutive generations, and it is useful for us to check
the correctness of our parallel implementations. The fastest way to check whether 2 implementations or runs
behave the same, is to ensure that both reach the same checksum after running from the same initial state and
the same number of generations.
H alo is a boolean variable that indicates whether the allocated space contains halo rows/columns or not. If
true, it means that the universe space was actually allocated for (rows + 2) by (cols + 2) cells. Since the universe
is cyclic, the borders of the space must check the opposite borders in order to count the neighbors. Also, it is
possible that the owner of the state does not have the whole (finite) space, but just a subset of it. While non of
these are problems for sequential execution, if we split the CGL space into several processes, it is possible that
we do not have access to the actual neighbors locally, and we need a place to store those neighbors.
The main algorithm (pseudocode) is as simple as follows:
1 initialize state s
2
3 /* * encapsulated in game () procedure * */

4 // main loop
4
5 while ( s . generation < MAX_GENERATIONS )
6 {
7 evolve ( s )
8 }
9 /* * end of game () procedure */
10
11 display final state and / or checksum

12 write bmp image
The evolve function, calculates the next generation of the universe at the current state (s→space). Therer
are 2 loops (y,x) that go across the space matrix. For each cell the algorithm visits its neighborhood using 2
inner loops (y1,x1) and counts the alive cells. Afterwards, it computes the new state for the cell [y, x] as a
function of n and the current state.
1 void evolve ( state * s ) // active ( current ) state [ in , out ]
2 {
3 char * temp_ptr = s - > s_temp ;
4 // for each cell (y , x )
5 for ( int y = 0; y < s - > rows ; y ++)
6 {
7 for ( int x = 0; x < s - > cols ; x ++)
8 {
9 // count neighbors ( n )
10 int n = 0;
11 if (s - > space [ y ][ x ]) n - -;
12 for ( int y1 = y - 1; y1 <= y + 1; y1 ++)
13 {
14 for ( int x1 = x - 1; x1 <= x + 1; x1 ++)
15 {
16 n += (s - > space [( y1 + s - > rows ) % s - > rows ]
17 [( x1 + s - > cols ) % s - > cols ]);
18 }
19 }
20 // apply CGL rules as a function of n and the current state
21 * temp_ptr = ( n == 3 || ( n == 2 && s - > space [ y ][ x ]));
22 ++ temp_ptr ;
23 }
24 }
25
26 // copy the new state to the active state

27 memcpy (s - > space [0] , s - > s_temp , s - > rows * s - > cols );
28 }
Note that line 16 uses a modulus operation to account for the periodicity of the grid. That works when the
process running the algorithm owns the whole space matrix. However, when running the algorithm in parallel,
(1) the process does not contain locally the state of the opposite cell for at least one dimension, and (2) the
dimensions of the space (s → rows and s → cols) are the dimensions of the local space (i.e., the whole space
divided by the number of processors).
C Parallel approach
In order to split the workload of each iteration among several processes, we need to break the game space into
equally sized blocks. However, after splitting the game space in such a way, the previous evolve algorithm will
not work anymore. Instead of using the modulus operation, we define extra rows and columns at the borders
of the space, that contain the neighborhood data for the border cells.
5
Thus, we need to update the algorithm to account for the possibility of using halo rows:
1 initialize state s
2
3 /* * encapsulated in game () procedure * */

4 // main loop
5 while ( s . generation < MAX_GENERATIONS )
6 {
7 /* update halo ( rows , columns and corners ):
8 * local : copy cells from the opposite side of the space
9 * remote : get the neighbor cells from neighbor processes */
10 update halo ( s . space )
11
12 /* evolves the - local - space */

13 evolve ( s )
14 }
15 /* * end of game () procedure */
16
17 reduce local checksums to global checksum ( sum )

18 display final checksum
19
20 write bmp image in parallel

Also, the evolve function looks slightly different. Now, if halo == 1, the 2 nested loops that go across the
matrix have a range of [1, rows] and [1, cols] instead of [0, rows) and [0, cols). Therefore, the 2 inner nested
loops for the neighborhood (y1,x1) have a range of [0, rows + 1] and [0, cols + 1], which is the allocated space
for the game state plus the halo.
Line 19 in the algorithm below is equivalent to line 16 in the sequential algorithm, but checks the halo cells
instead of using the modulus operation.
1 void evolve ( state * s ) // active ( current ) state [ in , out ]
2 {
3 assert ( halo == 1);
4
5 char * temp_ptr = s - > s_temp ;

6 // for each cell (y , x )
7 for ( int y = halo ; y < h + halo ; y ++)
8 {
9 for ( int x = halo ; x < w + halo ; x ++)
10 {
11 int n = 0 , y1 , x1 ;
12
13 // count neighbors ( n )
14 if (s - > space [ y ][ x ]) n - -;
15 for ( y1 = y - 1; y1 <= y + 1; y1 ++)
16 {
17 for ( x1 = x - 1; x1 <= x + 1; x1 ++)
18 {
19 n += s - > space [ y1 ][ x1 ];
20 }
21 }
6
22 // apply CGL rules as a function of n and the current state
23 * temp_ptr = ( n == 3 || ( n == 2 && s - > space [ y ][ x ]));
24 checksum += s - > space [ y ][ x ] != * temp_ptr ;
25 ++ temp_ptr ;
26 }
27 }
28 ...
29 }

Master in High Performance Computing Advanced Parallel Programming Mpi: Rma and I/O

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Master in High Performance Computing Advanced Parallel Programming Mpi: Rma and I/O

Uploaded by

Copyright:

Available Formats

Master in High Performance Computing

Advanced Parallel Programming

April 15, 2020

1. Generalized Active Target Synchronization (PSCW)

• gol grow 40 80.input (Grow pattern of size 40 × 80)

2. Passive Target Synchronization (Lock/Unlock)

A Conway’s Game of Life

At each step in time, the following transitions occur:

• For a space that is ’populated’:

– Each cell with less than two neighbors dies, as if by solitude.

Block Hive Loaf Boat tie Ship tie

B Computing the Conway’s Game of Life

3 /* * encapsulated in game () procedure * */

11 display final state and / or checksum

26 // copy the new state to the active state

3 /* * encapsulated in game () procedure * */

12 /* evolves the - local - space */

17 reduce local checksums to global checksum ( sum )

20 write bmp image in parallel

5 char * temp_ptr = s - > s_temp ;

You might also like