-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Artificial differences seen in fluxsite results when openmpi is loaded #304
Comments
I wasn't able to reproduce this - I found differences in the output when building and running CABLE outside of benchcab. I narrowed down the differences in output to the following commit: CABLE-LSM/CABLE@0a69346. See here for the commit history of |
This does not make any sense. The commit you highlighted changes the calculation of And the potential evaporation (epot) was not part of the outputs before that branch so the only change in the output we should see is an additional variable in the file. All other variables should be the same. |
I tried running a debugger on the AU-Tum fluxsite configuration and I now seem to be getting floating point overflow error:
Still not sure if this is related to the original problem, investigating further. |
Investigating the above error further, I found that the debug build and release build in the main branch are not bit reproducible in model output for fluxsite tests (tested with commit CABLE-LSM/CABLE@860094b). The floating point overflow errors were due to uninitialised variables: 1. Fixing the floating point errors restores bit reproducibility between release and debug builds. Applying the fix to commits CABLE-LSM/CABLE@860094b and CABLE-LSM/CABLE@0a69346 and doing a comparison shows that the two commits now only differ in model output w.r.t the |
When working on issue #335 in CABLE, the associated benchcab simulations returned numerical precision differences in all variables for the fluxsite experiments, see here.
After investigation, it turns out this is due to loading openmpi when doing serial compilation.
Tests performed
Running benchcab with
main
and#335 branch
returned differences in fluxsite outputs between realisations.Running one of the tasks using a serial compilation of
main
and#335 branch
done outside benchcab returned no differences between the outputs. These tests were done using thebuild.bash
script from the CABLE repository and ensuring we loaded the same versions of netcdf and intel compiler modules. These outputs are identical to the outputs ofmain
using benchcab.It turns out if we compile CABLE, serially, using the
build.bash
script but loading an openmpi module (3 versions were tested), then the#335 branch
gives slightly different results to themain
branch. This happens even so the compilation does not use the openmpi module directly, it's probably a difference in some environment variable.What do we want to do?
This is annoying as it may result in false negative results from benchcab.
Do we want to investigate further to identify where the difference in the environment actually is? Is that useful?
Do we want to fix that in benchcab? Would that mean only loading the necessary modules at compilation time or is there another solution?
@SeanBryan51 @bschroeter @abhaasgoyal @Whyborn mentioning you since I'd appreciate some discussion here.
The text was updated successfully, but these errors were encountered: