review and imprive timing
The timing diagnostics have been review and improved in order to facilitate the use of timing. \ NetCDF files are now used to store timing results: - the timing of step is no more written in timing.output but is directly written in a NetCDF file called `timing_step.nc`. For performance issues, this file is written by chunks of `keepnc` time steps. The value of `keepnc` is specified in the call of `timing_start`. We use 1000. As for `timing.output`, this file is written only by the MPI process with rank 0 if `sn_cfctl%l_oceout = .false.` or by all MPI processes if `sn_cfctl%l_oceout = .true.` - the timing of the last `keepnc` time steps of all MPI processes is stored in another NetCDF file called `timing_ts_allmpi_step.nc`. This file is written only by the MPI process with rank 0 at the end of the run. - these 2 files are written only for the timing of `step` but it could be written for any timed section of the code by using the optional argument `keepnc` in `timing_start`. - the total of the time (net and full), of each MPI task, spent in each part of the model that is timed are written in NetCDF files called `timing_tsum_allmpi_txxx_tyyy.nc`, where xxx and yyy are the time step number of the window used in the timing. These files are written only by the MPI process with rank 0 at the end of the run. Other minor improvements of the timing: - the gnuplot script, `timing_gnuplot.sh`, has been adapted to the new `timing_step.nc` and generalized. - the contain of `timing.output` has been slightly improved by adding the statistics for each timing window. - even if `ln_timing = .false.`, we time and give the timing information for step. This timing limited to step is extremely light. A simple ncview of the files `timing_step.nc` or `timing_ts_allmpi_step.nc` could give very usefull information even in production mode... Other side improvements: - for benchmarking purpose, we introduced `nn_comm = 0` which suppresses all MPI communications in lbc_lnk (of course, in this case, model results are meaningless from a physical point of view). - In order to reduce the size of `ocean.output` and `layout.dat` when we use a large number of MPI processes, we limited the number of lines printed in these ascii files, to give details of the MPI domain decomposition. A new NetCDF file called `layout.nc` provides all details informations, including the send and receive neighbours in the 8 directions. - as the finalization of the timing can require large memory when we use a large number of MPI processes, we introduce the routine `nemo_dealloc` which deallocates the arrays allocated by `nemo_alloc`
issue