review and imprive timing

The timing diagnostics have been review and improved in order to facilitate the use of timing.
NetCDF files are now used to store timing results:

the timing of step is no more written in timing.output but is directly written in a NetCDF file called timing_step.nc. For performance issues, this file is written by chunks of keepnc time steps. The value of keepnc is specified in the call of timing_start. We use 1000. As for timing.output, this file is written only by the MPI process with rank 0 if sn_cfctl%l_oceout = .false. or by all MPI processes if sn_cfctl%l_oceout = .true.
the timing of the last keepnc time steps of all MPI processes is stored in another NetCDF file called timing_ts_allmpi_step.nc. This file is written only by the MPI process with rank 0 at the end of the run.
these 2 files are written only for the timing of step but it could be written for any timed section of the code by using the optional argument keepnc in timing_start.
the total of the time (net and full), of each MPI task, spent in each part of the model that is timed are written in NetCDF files called timing_tsum_allmpi_txxx_tyyy.nc, where xxx and yyy are the time step number of the window used in the timing. These files are written only by the MPI process with rank 0 at the end of the run.

Other minor improvements of the timing:

the gnuplot script, timing_gnuplot.sh, has been adapted to the new timing_step.nc and generalized.
the contain of timing.output has been slightly improved by adding the statistics for each timing window.
even if ln_timing = .false., we time and give the timing information for step. This timing limited to step is extremely light. A simple ncview of the files timing_step.nc or timing_ts_allmpi_step.nc could give very usefull information even in production mode...

Other side improvements:

for benchmarking purpose, we introduced nn_comm = 0 which suppresses all MPI communications in lbc_lnk (of course, in this case, model results are meaningless from a physical point of view).
In order to reduce the size of ocean.output and layout.dat when we use a large number of MPI processes, we limited the number of lines printed in these ascii files, to give details of the MPI domain decomposition. A new NetCDF file called layout.nc provides all details informations, including the send and receive neighbours in the 8 directions.
as the finalization of the timing can require large memory when we use a large number of MPI processes, we introduce the routine nemo_dealloc which deallocates the arrays allocated by nemo_alloc

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message