Replace DO_[23]D_OVR macros with a threadsafe solution for tiling overlap

Context

Reference configuration/test case (to add, chosen or used as template)
Modifications of versioned files: Fortran routines (*.[Ffh]90), namelists (namelist\_*cfg), outputs settings (*.xml), ...
Additional dependencies
New datasets
Any other relevant information

Proposal

Overlapping calculations are the biggest fundamental issue remaining to be solved for the tiling.

This issue arises as a result of tiling at the timestep level, where each tile occupies a different part of the MPI domain, rather than at the DO loop level. For arrays that are "persistent" in memory (e.g. module variables and those declared with ALLOCATABLE, SAVE), this means that all tiles will be accessing the same array. Therefore, calculations with a horizontal stencil (i.e. those that would access a halo point) on one tile will not be independent of those on adjacent tiles.

Previously, this issue was addressed by using the DO_.D_OVR macros to adjust DO loop bounds so that calculations did not overlap between tiles. However, this approach is difficult to understand, does not resolve all cases, and crucially, does not work with OpenMP.

#68 (closed) has addressed many of these overlap cases by removing halo calculations where possible, but many instances of overlap remain. Further refactoring of the code is not a maintainable approach, since it is difficult to diagnose where the overlap occurs and therefore also easy to reintroduce it. Even if overlapping calculations do not change the results, parallelisation of the tiling with OpenMP will not be possible due to memory access conflicts and race conditions.

Developers should not be constrained by the tiling when writing their code, so any solution to the overlap issue must be straightforward to understand and apply. For example, developers generally do not need to consider MPI other than calling a few bespoke subroutines (lbc_lnk, mpp_sum etc). The aim is for the tiling to have a similar level of impact.

General method

The solution implemented here is to introduce two subroutines, dom_tile_copyin and dom_tile_copyout, which are used to generate independent copies of an array in memory. These copies are used in place of the original array, making calculations on this array independent of other tiles. The modified data from these copies are then copied back to the original array.

Both subroutines must be called twice, immediately before/after and within the tiling loop. This is because the code is split into serial and parallel parts, where the serial part must not be executed inside an OpenMP parallel region. This is described in more detail below.

Method in detail

dom_tile_copyin and dom_tile_copyout use a similar syntax to lbc_lnk, except every array must be paired with a string name. A typical usage is:

IF( ln_tile ) CALL dom_tile_copyin( 'arr1', arr1, 'arr2', arr2 )
IF( ln_tile ) CALL dom_tile_start

DO jtile = 1, nijtile
   IF( ln_tile ) CALL dom_tile( ntsi, ntsj, ntei, ntej, ktile=jtile )
   IF( ln_tile ) CALL dom_tile_copyin( 'arr1', arr1, 'arr2', arr2 )

   ! Tiled region

   IF( ln_tile ) CALL dom_tile_copyout( 'arr1', arr1, 'arr2', arr2 )
END DO

IF( ln_tile ) CALL dom_tile_stop
IF( ln_tile ) CALL dom_tile_copyout( 'arr1', arr1, 'arr2', arr2 )

Within the tiled region, arrays arr1 and arr2 will instead point to copies of these arrays for the current tile. After exiting the tiled region, arr1 and arr2 will again point to the original arrays, which will have been updated using the modified data from the copies for each tile.

This is achieved by using move_alloc to move the allocation of (rather than copy data from) arrays where possible. Only one pair of copies is actually performed between the original array and the copies for each tile. A derived type (twrk) is used to store the data as it is swapped around.

dom_tile_copyin must first be called outside of the tiling loop, before the call to dom_tile_start. This copies data from the original array to the tile copies, which must be done serially for each tile to preserve results.

The second call to dom_tile_copyin is within the tiling loop, and points arr1/arr2 to the array copy for the current tile. This part can be done concurrently.

The first call to dom_tile_copyout is also within the tiling loop. This copies data from the tile copies back to the original array. Unlike the first copying of data, this can be done concurrently.

Finally, the second call to dom_tile_copyout must be outside of the tiling loop, after the call to dom_tile_stop. This simply points arr1 and arr2 to the original arrays.

Other changes

Most of the DO_.D_OVR macros have been removed, with those remaining generally found in the RK3 code (which is not tiled). It has not been necessary to add any calls to dom_tile_copyin/dom_tile_copyout.

Some refactoring of tra_bbl_adv was performed to preserve the results when using tiling.

Final thoughts

The subroutines introduced by this development are intended as a general solution to be used by developers when tiling changes the results of their code. While it has not been necessary to use these new subroutines for this reason, this does not mean that calculations are independent between tiles. Calculations on adjacent tiles still overlap, they simply do not change the results. This issue requires a more fundamental solution if tiling is to be parallelised with OpenMP.

Edited Feb 17, 2023 by Daley Calvert

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message