MPIDomain Decomp

Domain Decomposition With MPI
Numerical Problem
• Transient Large Eddy Simulation (LES) of turbulence
• Finite difference scheme
• Three-dimensional rectangular domain
• Uniform domain decomposition in each direction – every CPU gets a copy
MPI Communications buffer neighboring CPU

• Computation of variable derivatives buffer neighboring CPU
while converging for
• Filtering of variables
each time step
• Implicit solution of variables, i.e. iteration of coupled equations
• Determination of maximum error while converging
scalars
• Determine maximum variable values each time step
• Post processing 1
Motivation
 f  f i  2  8 f i 1  8 f i 1  f i  2
4th Order First Derivative: 
x 12 Δ
Required buffer width = 2
 2 f  f i  2  16 f i 1  30 f i  16 f i 1  f i 2
4th Order Second Derivative: 
x 2
Δ2
8th Order Filter: f i  0.7265625 f i  0.21875f i 1  f i 1   0.109375 f i 2  f i 2 
 0.03125fi 3  fi 3   0.00390625 fi 4  fi 4 

2
Development
1 Dimensional Domain Decomposition

• NX,NY,NZ = grid points in x,y and z directions
• NX/CPUs = NXP is an integer > 2 x NB
• NXP = grid points on 1 CPU in x direction
• f(NXP,NY,NZ) = array dimension on each CPU
• buf(NB,NY,NZ) = buffer dimensions at each
boundary
Communication Pattern
“backward” tag
“forward” tag
Note:
• Blocking SEND/RECV requires synchronization to avoid dead-lock
• Distributed memory + proper parallelization ensures no RAM limitation
3
Implementation (FORTRAN)
Blocking SEND/RECV
CALL MPI_SEND(buffer,size,MPI_Datatype,destination,tag,MPI_COMM_WORLD,IERR)
CALL MPI_RECV(buffer,size,MPI_Datatype,source,tag,MPI_COMM_WORLD,STATUS,IERR)
Synchronization
send(…,destination,tag,…)
recv(…,source,tag,…)
NPROC = total number of CPUs
Rank = 0 0 < Rank < NPROC-1 Rank = NPROC

send(…,Rank+1,NFOR+Rank,…) recv(…,Rank-1,NFOR+Rank-1,…) recv(…,Rank-1,NFOR+Rank-1,…)
recv(…,Rank+1,NBAC+Rank,…) send(…,Rank-1,NBAC+Rank-1,…) send(…,Rank-1,NBAC+Rank-1,…)
send(…,Rank+1,NFOR+Rank,…)
recv(…,Rank+1,NBAC+Rank,…)
Note:
• All send/recv’s are ordered for synchronization, otherwise dead-lock
• All tags are unique positive integers, otherwise dead-lock
4
SEND/RECV Example Code (FORTRAN)
C TAGS Destination/Source
unique forward and backward tags NFOR=NSTRT
NBAC=NFOR+NPROC tag
NSTRT=NBAC+1
next subroutine has a fresh tag IF(RANK .EQ. 0)THEN
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NFOR+RANK,
~ MPI_COMM_WORLD,IERR)
CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NBAC+RANK,
~ MPI_COMM_WORLD,STATUS,IERR)
ELSEIF(RANK .EQ. NPROC-1)THEN
CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NFOR+RANK-1,
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NBAC+RANK-1,
ELSE
SEND/RECV pattern matches CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NFOR+RANK-1,
synchronization pattern ~ MPI_COMM_WORLD,STATUS,IERR)
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NBAC+RANK-1,
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NFOR+RANK,
CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NBAC+RANK,
ENDIF
Note:
• All multidimensional arrays are transformed into a 1 dimensional “BUF” before SEND
• After RECV all 1 dimensional “BUF” are transformed back into 3 dimensions
• Color coding applies to NPROC=3
5
Current Practices
3 Dimensional Parallelization
• Smaller buffers
- 1D buffers, buf(NB=4,NY,NZ), are large slabs, limits number of CPUs
- 3D buffers, i.e. buf(NB=4,NYP,NZP), are smaller, i.e. NYP=NY/4
- 3D buffering requires more communication
• Greater potential for scaling up
- large slab domain cannot be narrower than 2 x 4 (2 x Buffer width, NB)
- arbitrary rectangular proportions possible
• Buffer arrays are converted to vectors before and after MPI communication
• Use non-blocking ISEND/IRECV + WAIT for buffer arrays
• Use blocking SEND/RECV for scalars
• Extensive use of COMMON variables for RAM minimization
• Possible recalculation in different subroutines for RAM minimization
• As much as possible, locate MPI communication in separate subroutines
Lessons Learned using SEND/RECV:
• For a relatively small number of processors, NPROC<16, anticipated speed-up achieved
• For NPROC>16 performance slowed
6
3D Domain Decomposition
Utilize 1 dimensional flags (IZNX, IZNY and IZNZ) for

implementation of boundary conditions
IZNX
IZNY
Rank
IZNZ
NPROC = 64 Processors
7
Buffer Subroutines, ISEND/IRECV (FORTRAN)
Time loop tag initiation: DO IT=1,NT

NSTRT=1 first tag of each time iteration
Call buffer subroutine: CALL BUFFERSXX(NSTRT,

~ RANK,MPI_REAL,MPI_COMM_WORLD,MPI_STATUS_SIZE,STATUS,IERR)
Buffer subroutine: ~
SUBROUTINE BUFFERSXX(NSTRT,
RANK,MPI_REAL,MPI_COMM_WORLD,MPI_STATUS_SIZE,STATUS,IERR)
C TAGS
NXF=NSTRT unique forward and backward tags
NXB=NXF+NPROC
NSTRT=NXB+1 next subroutine has a fresh tag
Use MPI_WAIT for BOTH CALL MPI_ISEND(SENDBUFX,NB*NYP*NZP,MPI_REAL,RANK+1,NXF+RANK,

ISEND and IRECV ~ MPI_COMM_WORLD,SENDXF,IERR)
CALL MPI_IRECV(RECVBUFX,NB*NYP*NZP,MPI_REAL,RANK+1,NXB+RANK,
~ MPI_COMM_WORLD,RECVXB,IERR)
CALL MPI_WAIT(SENDXF,STATUS,IERR) MPI_WAIT has its own request
CALL MPI_WAIT(RECVXB,STATUS,IERR) like a “tag”
CALL MPI_BARRIER(MPI_COMM_WORLD,IERR) synchronize at the end of

each buffer communication
Note:
• tags recycled each time step, MPI tag must greater than zero and less than a big integer
• Non-blocking ISEND/IRECV need not be written in synchronized pattern, MPI_WAIT “tag”
8
Buffer Subroutines for Convergence
• Crank-Nicolson method for x, y and z velocity components

• Implicit Poisson-type equation
• Solution by Jacobi iteration
- Each iteration level steps together
- Slowest method to converge
- Most stable iterative method (Neumann problem)
- Number of iterations a function of time step size
     
u i,n j,1k  c1 u in1, j,k  u in1, j,k  c 2 u i,n j1, k  u i,n j1, k  c3 u i,n j,k 1  u i,n j,k 1  c 4g i, j,k
Note: Stencil size is 1 in each direction
9
Scalar Communication, SEND/RECV (FORTRAN)
• Iterative solution requires maximum error over the entire domain

• Poisson pressure equation also requires communication for the
compatibility condition
• Treat Rank = 0 as Master
Step 1: Master receives Step 2: Master sends

slaves’ maximum error maximum error to slaves
Note: Typically domain decomposition would not require

Master/Slave method, rather use distributed memory
10
Scalar Communication FORTRAN
Determine max error during convergence:

C TAGS
Index same set of tags NFOR=NSTRT
as buffer communication NBAC=NFOR+NPROC
NSTRT=NBAC+1
IF(RANK .EQ. 0)THEN
DO I=1,NPROC-1
CALL MPI_RECV(ERR,1,MPI_REAL,I,NBAC+I,
IF(ERR .GT. ERRMAX)THEN
ERRMAX=ERR
ENDIF
Synchronizing SEND/RECV ENDDO
pattern required to avoid DO I=1,NPROC-1
CALL MPI_SEND(ERRMAX,1,MPI_REAL,I,NFOR+I,
dead-lock ~ MPI_COMM_WORLD,IERR)
ENDDO
ELSE
CALL MPI_SEND(ERRMAX,1,MPI_REAL,0,NBAC+RANK,
CALL MPI_RECV(ERRMAX,1,MPI_REAL,0,NFOR+RANK,
ENDIF
11
Lessons Learned and Open Questions
• Three-dimensional domain decomposition yields maximum efficiency

- more communication
- smaller buffer sizes
- greater sub-domain surface area for a given volume
• SEND/RECV fine for small buffer sizes
• ISEND/IRECV provides appropriate scaling performance
• Both MPI_ISEND and MPI_IRECV require MPI_WAIT linked with
a request, which is unique like a “tag”
• Unique buffer sizes as required for communication speed-up (particularly

in an iterative loop) and memory reduction?
12

MPIDomain Decomp

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MPIDomain Decomp

Uploaded by

Copyright:

Available Formats

Domain Decomposition With MPI

MPI Communications buffer neighboring CPU

8th Order Filter: f i  0.7265625 f i  0.21875f i 1  f i 1   0.109375 f i 2  f i 2 

 0.03125fi 3  fi 3   0.00390625 fi 4  fi 4 

1 Dimensional Domain Decomposition

Rank = 0 0 < Rank < NPROC-1 Rank = NPROC

Utilize 1 dimensional flags (IZNX, IZNY and IZNZ) for

Time loop tag initiation: DO IT=1,NT

Call buffer subroutine: CALL BUFFERSXX(NSTRT,

Use MPI_WAIT for BOTH CALL MPI_ISEND(SENDBUFX,NBNYPNZP,MPI_REAL,RANK+1,NXF+RANK,

CALL MPI_BARRIER(MPI_COMM_WORLD,IERR) synchronize at the end of

• Crank-Nicolson method for x, y and z velocity components

Note: Stencil size is 1 in each direction

• Iterative solution requires maximum error over the entire domain

Step 1: Master receives Step 2: Master sends

Note: Typically domain decomposition would not require

Determine max error during convergence:

• Three-dimensional domain decomposition yields maximum efficiency

• Unique buffer sizes as required for communication speed-up (particularly

You might also like