Bug with delayed communications in `sbc_csupdate`
Context
- Branches impacted: main
- Dependencies:
ln_closea=T
Analysis
sbc_csupdate contains:
zcsfw(:) = glob_2Dsum( 'closea', zmsk_src, cdelay = 'cs1' )
zcsh(:) = glob_2Dsum( 'closea', zmsk_src, cdelay = 'cs2' )
There are multiple calls to this routine from sbc_clo for different variables:
CALL sbc_csupdate( ncsg, mcsgrpg, mask_csglo, mask_csgrpglo, rsurfsrcg, rsurftrgg, 'glo', mask_opnsea, rsurftrgg, zwcs, zqcs )
CALL sbc_csupdate( ncsr, mcsgrpr, mask_csrnf, mask_csgrprnf, rsurfsrcr, rsurftrgr, 'rnf', mask_opnsea, rsurftrgg, zwcs, zqcs )
CALL sbc_csupdate( ncse, mcsgrpe, mask_csemp, mask_csgrpemp, rsurfsrce, rsurftrge, 'emp', mask_opnsea, rsurftrgg, zwcs, zqcs )
The issue is that the result of glob_2Dsum
for mask_csglo
will be received by the call for mask_csrnf
, rather than the call for mask_csglo
on the following timestep (and similarly, the mask_csrnf
result will be received by the mask_csemp
call).
In sbc_csupdate, zmsk_src
has the shape DIMENSION(A2D(0),kncs)
, where kncs
is the first argument in the above calls, one of ncsg
/ncsr
/ncse
, which are defined in clo_sea and may have different values. This means glob_2Dsum
will be called with the same cdelay
tag for arrays of different shapes.
Let's say ncsg = 2
and ncsr = 4
. In the following code extract from the preprocessed lib_mpp.F90
, these correspond to ipi
:
SUBROUTINE mppsum1d_cplx_dp( cdname, ptab, kcom, cdelay )
...
ipi = SIZE(ptab,1) ! 1st dimension
...
IF( ndelayid(idvar) == ndlrstoff ) THEN ! first call without restart: define ydpbuffout from ptab(:) with a blocking allreduce
! --------------------------
ALLOCATE(todelay(idvar)%ydpbuffin(ipi), todelay(idvar)%ydpbuffout(ipi))
CALL mpi_allreduce( ptab(:), todelay(idvar)%ydpbuffout, ipi, MPI_DOUBLE_COMPLEX, mpi_sumdd, ilocalcomm, ierr ) ! get ydpbuffout
ndelayid(idvar) = MPI_REQUEST_NULL
ENDIF
...
ptab(:) = todelay(idvar)%ydpbuffout(:)
The first call to sbc_csupdate
(with ipi = ncsg = 2
) will allocate the receiving buffer such that SIZE( todelay(idvar)%ydpbuffout(:) ) == 2
. The second call (with ipi = ncsr = 4
) will then attempt to copy data from this receiving buffer into ptab
, for which SIZE( ptab(:) ) == 4
. This causes a conformance error when using debug flags, otherwise the mpi_iallreduce
call just hangs.
Fix
This can be resolved by using unique cdelay
tags for the glob_2Dsum
calls in sbc_csupdate
:
- zcsfw(:) = glob_2Dsum( 'closea', zmsk_src, cdelay = 'cs1' )
- zcsh(:) = glob_2Dsum( 'closea', zmsk_src, cdelay = 'cs2' )
+ zcsfw(:) = glob_2Dsum( 'closea', zmsk_src, cdelay = 'cs1_'//cdcstype ) ! cdelay = "cs1_glo"/"cs1_rnf"/"cs1_emp"
+ zcsh(:) = glob_2Dsum( 'closea', zmsk_src, cdelay = 'cs2_'//cdcstype ) ! cdelay = "cs2_glo"/"cs2_rnf"/"cs2_emp"