You are on page 1of 12

Domain Decomposition With MPI

Numerical Problem
• Transient Large Eddy Simulation (LES) of turbulence
• Finite difference scheme
• Three-dimensional rectangular domain
• Uniform domain decomposition in each direction – every CPU gets a copy

MPI Communications buffer neighboring CPU


• Computation of variable derivatives buffer neighboring CPU
while converging for
• Filtering of variables
each time step
• Implicit solution of variables, i.e. iteration of coupled equations
• Determination of maximum error while converging
scalars
• Determine maximum variable values each time step
• Post processing 1
Motivation

 f  f i  2  8 f i 1  8 f i 1  f i  2
4th Order First Derivative: 
x 12 Δ
Required buffer width = 2

 2 f  f i  2  16 f i 1  30 f i  16 f i 1  f i 2
4th Order Second Derivative: 
x 2
Δ2
Required buffer width = 2

8th Order Filter: f i  0.7265625 f i  0.21875f i 1  f i 1   0.109375 f i 2  f i 2 

 0.03125fi 3  fi 3   0.00390625 fi 4  fi 4 


Required buffer width = 4

2
Development

1 Dimensional Domain Decomposition


• NX,NY,NZ = grid points in x,y and z directions
• NX/CPUs = NXP is an integer > 2 x NB
• NXP = grid points on 1 CPU in x direction
• f(NXP,NY,NZ) = array dimension on each CPU
• buf(NB,NY,NZ) = buffer dimensions at each
boundary

Communication Pattern

“backward” tag
“forward” tag

Note:
• Blocking SEND/RECV requires synchronization to avoid dead-lock
• Distributed memory + proper parallelization ensures no RAM limitation

3
Implementation (FORTRAN)

Blocking SEND/RECV
CALL MPI_SEND(buffer,size,MPI_Datatype,destination,tag,MPI_COMM_WORLD,IERR)
CALL MPI_RECV(buffer,size,MPI_Datatype,source,tag,MPI_COMM_WORLD,STATUS,IERR)

Synchronization
send(…,destination,tag,…)
recv(…,source,tag,…)
NPROC = total number of CPUs

Rank = 0 0 < Rank < NPROC-1 Rank = NPROC


send(…,Rank+1,NFOR+Rank,…) recv(…,Rank-1,NFOR+Rank-1,…) recv(…,Rank-1,NFOR+Rank-1,…)
recv(…,Rank+1,NBAC+Rank,…) send(…,Rank-1,NBAC+Rank-1,…) send(…,Rank-1,NBAC+Rank-1,…)
send(…,Rank+1,NFOR+Rank,…)
recv(…,Rank+1,NBAC+Rank,…)

Note:
• All send/recv’s are ordered for synchronization, otherwise dead-lock
• All tags are unique positive integers, otherwise dead-lock

4
SEND/RECV Example Code (FORTRAN)

C TAGS Destination/Source
unique forward and backward tags NFOR=NSTRT
NBAC=NFOR+NPROC tag
NSTRT=NBAC+1
next subroutine has a fresh tag IF(RANK .EQ. 0)THEN
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NFOR+RANK,
~ MPI_COMM_WORLD,IERR)
CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NBAC+RANK,
~ MPI_COMM_WORLD,STATUS,IERR)
ELSEIF(RANK .EQ. NPROC-1)THEN
CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NFOR+RANK-1,
~ MPI_COMM_WORLD,STATUS,IERR)
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NBAC+RANK-1,
~ MPI_COMM_WORLD,IERR)
ELSE
SEND/RECV pattern matches CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NFOR+RANK-1,
synchronization pattern ~ MPI_COMM_WORLD,STATUS,IERR)
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NBAC+RANK-1,
~ MPI_COMM_WORLD,IERR)
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NFOR+RANK,
~ MPI_COMM_WORLD,IERR)
CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NBAC+RANK,
~ MPI_COMM_WORLD,STATUS,IERR)
ENDIF

Note:
• All multidimensional arrays are transformed into a 1 dimensional “BUF” before SEND
• After RECV all 1 dimensional “BUF” are transformed back into 3 dimensions
• Color coding applies to NPROC=3

5
Current Practices

3 Dimensional Parallelization
• Smaller buffers
- 1D buffers, buf(NB=4,NY,NZ), are large slabs, limits number of CPUs
- 3D buffers, i.e. buf(NB=4,NYP,NZP), are smaller, i.e. NYP=NY/4
- 3D buffering requires more communication
• Greater potential for scaling up
- large slab domain cannot be narrower than 2 x 4 (2 x Buffer width, NB)
- arbitrary rectangular proportions possible
• Buffer arrays are converted to vectors before and after MPI communication
• Use non-blocking ISEND/IRECV + WAIT for buffer arrays
• Use blocking SEND/RECV for scalars
• Extensive use of COMMON variables for RAM minimization
• Possible recalculation in different subroutines for RAM minimization
• As much as possible, locate MPI communication in separate subroutines
Lessons Learned using SEND/RECV:
• For a relatively small number of processors, NPROC<16, anticipated speed-up achieved
• For NPROC>16 performance slowed

6
3D Domain Decomposition

Utilize 1 dimensional flags (IZNX, IZNY and IZNZ) for


implementation of boundary conditions
IZNX

IZNY
Rank

IZNZ

NPROC = 64 Processors

7
Buffer Subroutines, ISEND/IRECV (FORTRAN)

Time loop tag initiation: DO IT=1,NT


NSTRT=1 first tag of each time iteration

Call buffer subroutine: CALL BUFFERSXX(NSTRT,


~ RANK,MPI_REAL,MPI_COMM_WORLD,MPI_STATUS_SIZE,STATUS,IERR)

Buffer subroutine: ~
SUBROUTINE BUFFERSXX(NSTRT,
RANK,MPI_REAL,MPI_COMM_WORLD,MPI_STATUS_SIZE,STATUS,IERR)
C TAGS
NXF=NSTRT unique forward and backward tags
NXB=NXF+NPROC
NSTRT=NXB+1 next subroutine has a fresh tag

Use MPI_WAIT for BOTH CALL MPI_ISEND(SENDBUFX,NB*NYP*NZP,MPI_REAL,RANK+1,NXF+RANK,


ISEND and IRECV ~ MPI_COMM_WORLD,SENDXF,IERR)
CALL MPI_IRECV(RECVBUFX,NB*NYP*NZP,MPI_REAL,RANK+1,NXB+RANK,
~ MPI_COMM_WORLD,RECVXB,IERR)
CALL MPI_WAIT(SENDXF,STATUS,IERR) MPI_WAIT has its own request
CALL MPI_WAIT(RECVXB,STATUS,IERR) like a “tag”

CALL MPI_BARRIER(MPI_COMM_WORLD,IERR) synchronize at the end of


each buffer communication
Note:
• tags recycled each time step, MPI tag must greater than zero and less than a big integer
• Non-blocking ISEND/IRECV need not be written in synchronized pattern, MPI_WAIT “tag”

8
Buffer Subroutines for Convergence

• Crank-Nicolson method for x, y and z velocity components


• Implicit Poisson-type equation
• Solution by Jacobi iteration
- Each iteration level steps together
- Slowest method to converge
- Most stable iterative method (Neumann problem)
- Number of iterations a function of time step size

     
u i,n j,1k  c1 u in1, j,k  u in1, j,k  c 2 u i,n j1, k  u i,n j1, k  c3 u i,n j,k 1  u i,n j,k 1  c 4g i, j,k

Note: Stencil size is 1 in each direction

9
Scalar Communication, SEND/RECV (FORTRAN)

• Iterative solution requires maximum error over the entire domain


• Poisson pressure equation also requires communication for the
compatibility condition
• Treat Rank = 0 as Master

Step 1: Master receives Step 2: Master sends


slaves’ maximum error maximum error to slaves

Note: Typically domain decomposition would not require


Master/Slave method, rather use distributed memory

10
Scalar Communication FORTRAN

Determine max error during convergence:


C TAGS
Index same set of tags NFOR=NSTRT
as buffer communication NBAC=NFOR+NPROC
NSTRT=NBAC+1
IF(RANK .EQ. 0)THEN
DO I=1,NPROC-1
CALL MPI_RECV(ERR,1,MPI_REAL,I,NBAC+I,
~ MPI_COMM_WORLD,STATUS,IERR)
IF(ERR .GT. ERRMAX)THEN
ERRMAX=ERR
ENDIF
Synchronizing SEND/RECV ENDDO
pattern required to avoid DO I=1,NPROC-1
CALL MPI_SEND(ERRMAX,1,MPI_REAL,I,NFOR+I,
dead-lock ~ MPI_COMM_WORLD,IERR)
ENDDO
ELSE
CALL MPI_SEND(ERRMAX,1,MPI_REAL,0,NBAC+RANK,
~ MPI_COMM_WORLD,IERR)
CALL MPI_RECV(ERRMAX,1,MPI_REAL,0,NFOR+RANK,
~ MPI_COMM_WORLD,STATUS,IERR)
ENDIF

11
Lessons Learned and Open Questions

• Three-dimensional domain decomposition yields maximum efficiency


- more communication
- smaller buffer sizes
- greater sub-domain surface area for a given volume
• SEND/RECV fine for small buffer sizes
• ISEND/IRECV provides appropriate scaling performance
• Both MPI_ISEND and MPI_IRECV require MPI_WAIT linked with
a request, which is unique like a “tag”

• Unique buffer sizes as required for communication speed-up (particularly


in an iterative loop) and memory reduction?

12

You might also like