Skip to content

Advanced Topics

Denise Worthen edited this page Oct 12, 2020 · 23 revisions

Debug mode

When troubleshooting MPI applications using Intel MPI, the debug versions of the Intel MPI library should be used. Adding -link_mpi=dbg (or -link_mpi=dbg_mt for multi-threaded applications) to FFLAGS_DEBUG allows the wrappers use the debug versions of the MPI library.

Use ESMF in debug mode

The correct module file must be loaded. The module file location can be found by looking in the conf/before_components.mk file at the CHOSEN_MODULE variable. For computing system Hera, the location is modulefiles/hera.intel/fv3_coupled. To use ESMF in debug mode on Hera, change

module load esmf/8.0.0

to

module load esmf/8.0.0g

On Cheyenne, in modulefiles/cheyenne.intel/fv3_coupled change

module load esmf/8.0.0

to

module load esmf_libs/8.0.0
module load esmf-8.0.0-ncdfio-mpi-g

Compiling in debug mode

The ufs-s2s-model can be built in debug mode using make

cd NEMS
make app=coupledFV3_MOM6_CICE_debug build

This will pass the DEBUG=Y flag down to the component repositories and will trigger the use of the FFLAGS_DEBUG setting in the appropriate file. These files are:

For NEMS and FV3:
In conf/configure.fv3_coupled.{machine}

For CICE:
In CICE/bld/Macros.Linux.NEMS.{machine}

For MOM6:
In MOM6/src/mkmf/templates/{machine}.mk

Note: The debug settings have been added to the appropriate files for machines Hera and Cheyenne. Developers working on other platforms can use these as templates for porting the debug settings to other platforms. If a new platform has debug support added, please create an issue and an accompanying PR in the ufs-s2s-model repository so that these can be made available to the wider community.

Running ufs-s2s-model in debug mode

Some things to keep in mind when running in debug mode:

  1. The default walltime settings will not be long enough. To change the walltime settings in the C384 cold start test for example, add the setting
walltime=10800   # seconds

to the section defining the cold c384 test in fv3mom6cice5.input. On Hera, in debug mode for the C384 case, the cold start requires approximately 1 hour of walltime.

  1. The regression tests will fail since they have been generated with the model compiled in non-debug mode.

Restarting the coupled model

The applicable settings and parameters can be found here.

The list of restart files produced by the UFS-S2S-model can be found here.

Changing the number of PEs for FV3

The following variables all need to be changed:

In input.nml: layout and ntiles

In model_configure: TASKS, quilting, write_groups, and write_tasks_per_group

The definitions of variables and file where they are located:

TASKS = total number of all tasks for all components (model_configure)
layout = INPES, JNPES is the layout of pets on each task in the x & y directions (input.nml)
ntiles is the number of tiles, typically 6 (input.nml)
quilting true/false variable to use writer cores for FV3GFS (model_configure)
write_groups is the number of write groups for FV3GFS (model_configure)
write_tasks_per_group is the number of tasks per FV3GFS write group, a multiple of ntiles (model_configure)

The number of FV3 tasks is then given by:

(INPES x JNPES x 6) + (write_groups x write_tasks_per_group)

The PET layout for each component then needs to be adjusted consistent with the TASKS. The default values are set in compsets/MACHINEID.input (e.g. default_cpl in compsets/hera.input). If the number of PEs for FV3 is changed, non-default values consistent with the total number of tasks will need to be added to the appropriate compset test. Copy the values from default_cpl and insert into the compset, changing values where needed to be consistent with the number of FV3 tasks. Typically, the mediator is given the number of FV3 tasks, not considering the the write tasks. For example, if INPESxJNPESxntiles = 3x8x6 = 144 then the mediator is given 144 tasks and FV3 will be given 144 plus the number of write tasks.

Profiling Timing Across Components

To check run times of different components for load balancing, the following two environment variables must be set:

export ESMF_RUNTIME_PROFILE=ON
export ESMF_RUNTIME_PROFILE_OUTPUT=SUMMARY

For ufs-s2s-model, the environment variables should be added to the file ufs-s2s-model/tests/fv3_conf/fv3_slurm.IN_<platform> where platform is Hera, Orion, etc. This will produce the ESMF_Profile.summary in the run directory which will give you timing information for the run. See the ESMF Reference Manual for more details.

The ESMF_Profile.summary can also include MPI functions to indicate how much time is spent inside communication calls. To use this feature, modify the file ufs-s2s-model/tests/fv3_conf/fv3_slurm.IN_<platform> to set the environment variable for the location of mpi profiling preload script, and include the script in srun command. See the ESMF Reference Manual for more details:

# set location of mpi profiling preload script
export ESMF_PRELOAD=${ESMFMKFILE/esmf.mk/preload.sh}

# include preload script before forecast executable in srun command
srun --label -n @[TASKS] $ESMF_PRELOAD ./fcst.exe

Note, your job must complete for the summary table to be written so make sure to adjust the wall clock or runtime.

Clone this wiki locally