-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make graceful exit feature MPI-compatible #2475
Comments
I think it would be helpful to clarify what situations you want to handle. A couple I can think of:
MPIThere is some complexity when dealing with multiple processes:
One additional challenge with MPI is that you can end up in a case where a process is waiting on a receive which will never come. This is likely for case 1, but even in cases 2 and 3, there is no guarantee that all processes will receive the notification to cancel jobs at the same time (e.g. if we only check the existence of a sentinel file every time step, then 1 process may be slightly ahead and miss the notification). The is further compounded by the fact that if we want to write to HDF5 in the exit procedure, that needs to be done collectively over MPI. There are two options I can see here:
|
I think the most promising case might be to
|
Ctrl-C does not reliably work (for me, at least). So, I'd strongly prefer not going down that route. Whether we dump data or not seems orthogonal to me. Users may just want to leave the stepping loop so that they can interactively investigate the time stepper / solution at that time. Also, I'd prefer a solution that is not tied to buildkite. Regarding MPI + crashed / partially crashed simulations, I think a combination of try-catch and a separate MPI channel (mentioned in ii of bullet 2) would be ideal.
For MPI, if we cannot assume that all processors have the same view of the file system, can we do something like:
|
All that said, a lot of complications come from MPI + crashed simulations. For non-MPI runs, the solution in #2481 seems to be pretty helpful with little effort. |
My investigations lead me to two discoveries:
Agreed. I also don't like tying it to file system behavior either though.
That would work, but then you can't do any collective operations (like write HDF5 files) |
One potential wrinkle with my plan: JuliaLang/julia#52771 (polling via |
I don't understand under which conditions using a sentinel file will fail for MPI runs. I understand that we cannot assume |
One MPI process may be ahead of others, and not see that you've changed the graceful exit bool in the file. Processors late to the game do see it, and then the process that missed it will likely end up hanging while waiting for the other processors. |
But wouldn't the faster process just exit at the following iteration? I agree that we cannot (easily) ensure consistency. |
I got it. It would wait at DSS. |
I asked on Julia slack, someone pointed me to this: Unfortunately, it doesn't appear to be registered, but it might be possible to extract the signal handling machinery (https://github.com/emmt/InterProcessCommunication.jl/blob/87786bb966809b9d0248d24c77a7601a3abe2ba9/src/signals.jl) out to its own package? |
We added a graceful exit feature, but it has limitations, so I’ll change the title to fix/improve on the limitstions (MPI) |
No description provided.
The text was updated successfully, but these errors were encountered: