Artificial differences seen in fluxsite results when openmpi is loaded #304

ccarouge · 2024-07-24T01:18:27Z

When working on issue #335 in CABLE, the associated benchcab simulations returned numerical precision differences in all variables for the fluxsite experiments, see here.

After investigation, it turns out this is due to loading openmpi when doing serial compilation.

Tests performed

Running benchcab with main and #335 branch returned differences in fluxsite outputs between realisations.
Running one of the tasks using a serial compilation of main and #335 branch done outside benchcab returned no differences between the outputs. These tests were done using the build.bash script from the CABLE repository and ensuring we loaded the same versions of netcdf and intel compiler modules. These outputs are identical to the outputs of main using benchcab.

It turns out if we compile CABLE, serially, using the build.bash script but loading an openmpi module (3 versions were tested), then the #335 branch gives slightly different results to the main branch. This happens even so the compilation does not use the openmpi module directly, it's probably a difference in some environment variable.

What do we want to do?

This is annoying as it may result in false negative results from benchcab.

Do we want to investigate further to identify where the difference in the environment actually is? Is that useful?

Do we want to fix that in benchcab? Would that mean only loading the necessary modules at compilation time or is there another solution?

@SeanBryan51 @bschroeter @abhaasgoyal @Whyborn mentioning you since I'd appreciate some discussion here.

The text was updated successfully, but these errors were encountered:

SeanBryan51 · 2024-08-08T06:37:42Z

Running one of the tasks using a serial compilation of main and #335 branch done outside benchcab returned no differences between the outputs. These tests were done using the build.bash script from the CABLE repository and ensuring we loaded the same versions of netcdf and intel compiler modules.

I wasn't able to reproduce this - I found differences in the output when building and running CABLE outside of benchcab.

I narrowed down the differences in output to the following commit: CABLE-LSM/CABLE@0a69346.

See here for the commit history of 335-facilitate-output-of-potential-evaporation-directly-from-the-offline-code-base.

ccarouge · 2024-08-09T05:28:18Z

This does not make any sense. The commit you highlighted changes the calculation of canopy%epot. This variable is only used for canopy%wetfac_cs which is not used in standalone, only in the coupled model. So changing the equation to calculate canopy%epot should not change the results at all in standalone!

And the potential evaporation (epot) was not part of the outputs before that branch so the only change in the output we should see is an additional variable in the file. All other variables should be the same.

SeanBryan51 · 2024-08-09T06:21:28Z

I tried running a debugger on the AU-Tum fluxsite configuration and I now seem to be getting floating point overflow error:

forrtl: error (72): floating overflow
Image              PC                Routine            Line        Source             
cable              0000000000B357D4  Unknown               Unknown  Unknown
libpthread-2.28.s  00007FFFEC28DD20  Unknown               Unknown  Unknown
cable              0000000000717C6D  cable_canopy_modu         497  cable_canopy.F90
cable              00000000006BC2FE  cable_cbm_module_         169  cbl_model_driver_offline.F90
cable              000000000041864D  MAIN__                    798  cable_driver.F90
cable              000000000040CBA2  Unknown               Unknown  Unknown
libc-2.28.so       00007FFFEBEDF7E5  __libc_start_main     Unknown  Unknown
cable              000000000040CAAE  Unknown               Unknown  Unknown

Still not sure if this is related to the original problem, investigating further.

SeanBryan51 · 2024-08-13T00:54:12Z

Investigating the above error further, I found that the debug build and release build in the main branch are not bit reproducible in model output for fluxsite tests (tested with commit CABLE-LSM/CABLE@860094b).

The floating point overflow errors were due to uninitialised variables: 1. canopy%DvLitt and 2. sum_rad_gradis. Fixing 1 does not change results. Fixing 2 does change results (see CABLE-LSM/CABLE#351).

Fixing the floating point errors restores bit reproducibility between release and debug builds. Applying the fix to commits CABLE-LSM/CABLE@860094b and CABLE-LSM/CABLE@0a69346 and doing a comparison shows that the two commits now only differ in model output w.r.t the PotEvap variable which is expected.

SeanBryan51 mentioned this issue Aug 8, 2024

added potential evaporation to offline output, changed checks range, … CABLE-LSM/CABLE#346

Merged

5 tasks

SeanBryan51 self-assigned this Aug 8, 2024

SeanBryan51 closed this as completed Aug 13, 2024

SeanBryan51 mentioned this issue Aug 15, 2024

Support for debug builds and/or configurable compiler flags #307

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Artificial differences seen in fluxsite results when openmpi is loaded #304

Artificial differences seen in fluxsite results when openmpi is loaded #304

ccarouge commented Jul 24, 2024

SeanBryan51 commented Aug 8, 2024 •

edited

Loading

ccarouge commented Aug 9, 2024 •

edited

Loading

SeanBryan51 commented Aug 9, 2024

SeanBryan51 commented Aug 13, 2024 •

edited

Loading

Artificial differences seen in fluxsite results when openmpi is loaded #304

Artificial differences seen in fluxsite results when openmpi is loaded #304

Comments

ccarouge commented Jul 24, 2024

Tests performed

What do we want to do?

SeanBryan51 commented Aug 8, 2024 • edited Loading

ccarouge commented Aug 9, 2024 • edited Loading

SeanBryan51 commented Aug 9, 2024

SeanBryan51 commented Aug 13, 2024 • edited Loading

SeanBryan51 commented Aug 8, 2024 •

edited

Loading

ccarouge commented Aug 9, 2024 •

edited

Loading

SeanBryan51 commented Aug 13, 2024 •

edited

Loading