-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
evp kernel version 2 testing and validation #279
Comments
Some more information on the test failures with evp_kernel_ver=2, running just with the intel compiler for now (other compilers have similar issues).
There are also some cases (which I haven't reported on before) that fail bit-for-bit comparison between different cases when the results should be identical. These are comparisons with the same model, just different decompositions and so forth.
If I run these tests with the OMP commented out in ice_dyn_evp_1d.F90, then they pass. So OMP may explain some of it, but does not explain all of it. I can get some of the failed tests to run if I comment out all of the OMP in ice_dyn_evp_1d.F90. But that is not going to acceptable longer term for the evp_kernel_ver=2 since it relies only on openmp for performance. But there are a couple other tests that still fail even when turning all the openmp off (by selecting pe layouts with 1 thread/task which does not invoke the openmp compiler option) with evp_kernel_ver=2 when they run fine with evp_kernel_ver=0. I do not expect evp_kernel_ver=2 to get bit-for-bit results with ver=0, although maybe it should. At this point, I am not testing that. The first step is to make sure all the tests that pass with ver=0 also pass with ver=2. Thoughts? |
#318 addresses several of the outstanding points to some degree,
|
#318 also documents that the current implementation is not validated and the code will abort when kevp_kernel=2. As a workaround for testing, kevp_kernel=102 will turn on version=2. Once this is validated, we will remove the abort and the workaround. |
In nag testing, a separate issue was discovered. When I add -nan to the compile, I get a separate error,
I spent a few minutes trying to understand this. This only happens when threading and -nan is set. The error went away if I removed the threadprivate OMP definition at line 51 and then changed the OMP PARALLEL DEFAULT(none) statement on line 70 to OMP PARALLEL PRIVATE(domp_iam,domp_nt,rdomp_iam,rdomp_nt). At this time, I think we have to assume this is a compiler bug. It doesn't make a lot of sense that the -nan compiler argument breaks some thread private implementation. On the other hand, it would be nice to be able to use the -nan option, and I do have some lingering concerns about whether maybe the thread implementation in the evp_1d is completely correct. |
Also want to note the fix to the kind types due to nag testing in ice_dyn_evp_1d.F90 that was part of #356 |
@apcraig The tx1 test case is set to exit within the 1d solver as the conversion from 2d to 1d and vice versa is not tested for tripole grids |
@TillRasmussen and @apcraig : The 1d solver will not work correctly for tripole grids (will run but gives wrong results at "tripole boundaries"). That's why an exit with an error is implemented for tri-grids. |
tripole is not the issue. It's fine that it's not working for those grids. I think I knew that and did not raise it as a concern. All the documented problems are for non-tripole grids. gx1 and gx3 are not tripole grids. |
I just ran a full test suite with kevp_kernel = 102 set in ice_in on cheyenne with the intel compiler. I compared to this weekend's full test suite which passes all tests. The evp1d results are here, https://github.com/CICE-Consortium/Test-Results/wiki/922b998005.cheyenne.intel.20-10-26.221713.0 and for comparison, the baseline results are https://github.com/CICE-Consortium/Test-Results/wiki/922b998005.cheyenne.intel.20-10-25.025008.0 Most of the tests pass and most of the tests produce non-bit-for-bit results with the baseline. But some tests fail, those are
which is largely consistent with results from last year. I have not looked into each of the failures in any detail. If you want me to do that, I can. If you need any help understanding any of the tests, let me know. And again, happy to help do some additional analysis if that would be helpful. |
Following up, have all of the issues in this issue been addressed adequately, up to and including #568 ? |
See #623 for followup issues. |
We are going to merge PR #278, PR #252. There are several outstanding issues, basically copied from the end of #252,
Let me summarize where we are.
With evp_kernel_ver=0, results are bit-for-bit for most tests against the current master. This is running full test suites on gordon for 4 compilers. A subset of box tests are NOT bit-for-bit on 3/4 compilers. Rerunning the failed box tests with the debug flag (reduced optimization and run time checks) on both master and this PR results in bit-for-bit identical answers. It seems the changes in the answers in the box test is caused by some compiler optimization as a results of the code changes. This might be associated with the evp kernel changes (although @mhrib makes a case it shouldn't) or it might be associated with some of the code cleanup. We could look into this further or we could accept it. Personally, I am comfortable with this outcome as it stands. I believe we've shown the answers are roundoff different (see above gbox128 diff) as a result of compiler optimization and that we can make this bit-for-bit if we reduce compiler optimization. I think based on these results, we could merge this PR. evp_kernel_ver=0 will be the default setting.
Separately, there is an effort to test and validate the evp_kernel_ver=2. The same test suite on gordon was run with the new kernel on. Results can be found https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks, hash aa6de33...+evpk=2. Three to four tests fail on each compiler, and they are the same tests across the compilers. Looking at the intel results, https://github.com/CICE-Consortium/Test-Results/wiki/aa6de33f19.gordon.pgi.190128.235649, there are four failures.
Again, many tests passed, but these 4 failures need to be debugged. In addition, the qc test relies on the gx1 configuration, so the qc testing comparing evp_kernel_ver=2 to 0 could not be done.
So, the outstanding tasks are
The text was updated successfully, but these errors were encountered: