Professional Documents
Culture Documents
Numerical Problem
• Transient Large Eddy Simulation (LES) of turbulence
• Finite difference scheme
• Three-dimensional rectangular domain
• Uniform domain decomposition in each direction – every CPU gets a copy
f f i 2 8 f i 1 8 f i 1 f i 2
4th Order First Derivative:
x 12 Δ
Required buffer width = 2
2 f f i 2 16 f i 1 30 f i 16 f i 1 f i 2
4th Order Second Derivative:
x 2
Δ2
Required buffer width = 2
2
Development
Communication Pattern
“backward” tag
“forward” tag
Note:
• Blocking SEND/RECV requires synchronization to avoid dead-lock
• Distributed memory + proper parallelization ensures no RAM limitation
3
Implementation (FORTRAN)
Blocking SEND/RECV
CALL MPI_SEND(buffer,size,MPI_Datatype,destination,tag,MPI_COMM_WORLD,IERR)
CALL MPI_RECV(buffer,size,MPI_Datatype,source,tag,MPI_COMM_WORLD,STATUS,IERR)
Synchronization
send(…,destination,tag,…)
recv(…,source,tag,…)
NPROC = total number of CPUs
Note:
• All send/recv’s are ordered for synchronization, otherwise dead-lock
• All tags are unique positive integers, otherwise dead-lock
4
SEND/RECV Example Code (FORTRAN)
C TAGS Destination/Source
unique forward and backward tags NFOR=NSTRT
NBAC=NFOR+NPROC tag
NSTRT=NBAC+1
next subroutine has a fresh tag IF(RANK .EQ. 0)THEN
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NFOR+RANK,
~ MPI_COMM_WORLD,IERR)
CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NBAC+RANK,
~ MPI_COMM_WORLD,STATUS,IERR)
ELSEIF(RANK .EQ. NPROC-1)THEN
CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NFOR+RANK-1,
~ MPI_COMM_WORLD,STATUS,IERR)
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NBAC+RANK-1,
~ MPI_COMM_WORLD,IERR)
ELSE
SEND/RECV pattern matches CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NFOR+RANK-1,
synchronization pattern ~ MPI_COMM_WORLD,STATUS,IERR)
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NBAC+RANK-1,
~ MPI_COMM_WORLD,IERR)
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NFOR+RANK,
~ MPI_COMM_WORLD,IERR)
CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NBAC+RANK,
~ MPI_COMM_WORLD,STATUS,IERR)
ENDIF
Note:
• All multidimensional arrays are transformed into a 1 dimensional “BUF” before SEND
• After RECV all 1 dimensional “BUF” are transformed back into 3 dimensions
• Color coding applies to NPROC=3
5
Current Practices
3 Dimensional Parallelization
• Smaller buffers
- 1D buffers, buf(NB=4,NY,NZ), are large slabs, limits number of CPUs
- 3D buffers, i.e. buf(NB=4,NYP,NZP), are smaller, i.e. NYP=NY/4
- 3D buffering requires more communication
• Greater potential for scaling up
- large slab domain cannot be narrower than 2 x 4 (2 x Buffer width, NB)
- arbitrary rectangular proportions possible
• Buffer arrays are converted to vectors before and after MPI communication
• Use non-blocking ISEND/IRECV + WAIT for buffer arrays
• Use blocking SEND/RECV for scalars
• Extensive use of COMMON variables for RAM minimization
• Possible recalculation in different subroutines for RAM minimization
• As much as possible, locate MPI communication in separate subroutines
Lessons Learned using SEND/RECV:
• For a relatively small number of processors, NPROC<16, anticipated speed-up achieved
• For NPROC>16 performance slowed
6
3D Domain Decomposition
IZNY
Rank
IZNZ
NPROC = 64 Processors
7
Buffer Subroutines, ISEND/IRECV (FORTRAN)
Buffer subroutine: ~
SUBROUTINE BUFFERSXX(NSTRT,
RANK,MPI_REAL,MPI_COMM_WORLD,MPI_STATUS_SIZE,STATUS,IERR)
C TAGS
NXF=NSTRT unique forward and backward tags
NXB=NXF+NPROC
NSTRT=NXB+1 next subroutine has a fresh tag
8
Buffer Subroutines for Convergence
u i,n j,1k c1 u in1, j,k u in1, j,k c 2 u i,n j1, k u i,n j1, k c3 u i,n j,k 1 u i,n j,k 1 c 4g i, j,k
9
Scalar Communication, SEND/RECV (FORTRAN)
10
Scalar Communication FORTRAN
11
Lessons Learned and Open Questions
12