Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Artificial differences seen in fluxsite results when openmpi is loaded #304

Closed
ccarouge opened this issue Jul 24, 2024 · 4 comments
Closed
Assignees

Comments

@ccarouge
Copy link
Collaborator

When working on issue #335 in CABLE, the associated benchcab simulations returned numerical precision differences in all variables for the fluxsite experiments, see here.

After investigation, it turns out this is due to loading openmpi when doing serial compilation.

Tests performed

Running benchcab with main and #335 branch returned differences in fluxsite outputs between realisations.
Running one of the tasks using a serial compilation of main and #335 branch done outside benchcab returned no differences between the outputs. These tests were done using the build.bash script from the CABLE repository and ensuring we loaded the same versions of netcdf and intel compiler modules. These outputs are identical to the outputs of main using benchcab.

It turns out if we compile CABLE, serially, using the build.bash script but loading an openmpi module (3 versions were tested), then the #335 branch gives slightly different results to the main branch. This happens even so the compilation does not use the openmpi module directly, it's probably a difference in some environment variable.

What do we want to do?

This is annoying as it may result in false negative results from benchcab.

Do we want to investigate further to identify where the difference in the environment actually is? Is that useful?

Do we want to fix that in benchcab? Would that mean only loading the necessary modules at compilation time or is there another solution?

@SeanBryan51 @bschroeter @abhaasgoyal @Whyborn mentioning you since I'd appreciate some discussion here.

@SeanBryan51
Copy link
Collaborator

SeanBryan51 commented Aug 8, 2024

Running one of the tasks using a serial compilation of main and #335 branch done outside benchcab returned no differences between the outputs. These tests were done using the build.bash script from the CABLE repository and ensuring we loaded the same versions of netcdf and intel compiler modules.

I wasn't able to reproduce this - I found differences in the output when building and running CABLE outside of benchcab.

I narrowed down the differences in output to the following commit: CABLE-LSM/CABLE@0a69346.

See here for the commit history of 335-facilitate-output-of-potential-evaporation-directly-from-the-offline-code-base.

@ccarouge
Copy link
Collaborator Author

ccarouge commented Aug 9, 2024

This does not make any sense. The commit you highlighted changes the calculation of canopy%epot. This variable is only used for canopy%wetfac_cs which is not used in standalone, only in the coupled model. So changing the equation to calculate canopy%epot should not change the results at all in standalone!

And the potential evaporation (epot) was not part of the outputs before that branch so the only change in the output we should see is an additional variable in the file. All other variables should be the same.

@SeanBryan51
Copy link
Collaborator

I tried running a debugger on the AU-Tum fluxsite configuration and I now seem to be getting floating point overflow error:

forrtl: error (72): floating overflow
Image              PC                Routine            Line        Source             
cable              0000000000B357D4  Unknown               Unknown  Unknown
libpthread-2.28.s  00007FFFEC28DD20  Unknown               Unknown  Unknown
cable              0000000000717C6D  cable_canopy_modu         497  cable_canopy.F90
cable              00000000006BC2FE  cable_cbm_module_         169  cbl_model_driver_offline.F90
cable              000000000041864D  MAIN__                    798  cable_driver.F90
cable              000000000040CBA2  Unknown               Unknown  Unknown
libc-2.28.so       00007FFFEBEDF7E5  __libc_start_main     Unknown  Unknown
cable              000000000040CAAE  Unknown               Unknown  Unknown

Still not sure if this is related to the original problem, investigating further.

@SeanBryan51
Copy link
Collaborator

SeanBryan51 commented Aug 13, 2024

Investigating the above error further, I found that the debug build and release build in the main branch are not bit reproducible in model output for fluxsite tests (tested with commit CABLE-LSM/CABLE@860094b).

The floating point overflow errors were due to uninitialised variables: 1. canopy%DvLitt and 2. sum_rad_gradis. Fixing 1 does not change results. Fixing 2 does change results (see CABLE-LSM/CABLE#351).

Fixing the floating point errors restores bit reproducibility between release and debug builds. Applying the fix to commits CABLE-LSM/CABLE@860094b and CABLE-LSM/CABLE@0a69346 and doing a comparison shows that the two commits now only differ in model output w.r.t the PotEvap variable which is expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants