Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make graceful exit feature MPI-compatible #2475

Open
charleskawczynski opened this issue Jan 3, 2024 · 12 comments
Open

Make graceful exit feature MPI-compatible #2475

charleskawczynski opened this issue Jan 3, 2024 · 12 comments
Assignees

Comments

@charleskawczynski
Copy link
Member

No description provided.

@simonbyrne
Copy link
Member

I think it would be helpful to clarify what situations you want to handle. A couple I can think of:

  1. Dump the most recent state when a job crashes
    • Could do this by try/catch, or an atexit hook. Question is how to handle MPI (see below).
  2. Allow users to externally cancel a job, dumping the current state.
  3. Dump the most recent state when a job is about to time out. Schedulers usually have a mechanism to signal that they're about to do this, e.g.
    • Slurm has a --signal option, which will send a signal to the job that the time is about to expire.
    • Most cloud preemptible instances will give you a short notice period (e.g. 30 seconds) before the VM is killed. For example, GCP will let you register a shutdown script that will be called.

MPI

There is some complexity when dealing with multiple processes:

  • How to signal to all processes that they need to exit gracefully?
  • In case 1, if one process crashes we need to be able to alert all other processes. Unfortunately, the only mechanism MPI provides is MPI_Abort which doesn't let us exit gracefully. Two options I can think of
    1. Call scancel to send the signal to all other jobs (this is Slurm-specific though)
    2. Have another MPI channel (e.g. tag=0) that we can use to send control messages between jobs (I've sort of wanted something like this for logging and other non-compute workloads).

One additional challenge with MPI is that you can end up in a case where a process is waiting on a receive which will never come. This is likely for case 1, but even in cases 2 and 3, there is no guarantee that all processes will receive the notification to cancel jobs at the same time (e.g. if we only check the existence of a sentinel file every time step, then 1 process may be slightly ahead and miss the notification). The is further compounded by the fact that if we want to write to HDF5 in the exit procedure, that needs to be done collectively over MPI.

There are two options I can see here:

  • We run the neighbor communication and the HDF5 on separate CPU threads. For this to work, we would need to initialize with MPI_THREAD_MULTIPLE (as we could have two separate MPI calls going at the same time.
  • We switch the MPI.Waitalls to a MPI.Testall (basically a cooperative waitall, similar to Add non-blocking wait JuliaParallel/MPI.jl#766), which would let the Julia task manager switch between different MPI comms.

@simonbyrne
Copy link
Member

simonbyrne commented Jan 5, 2024

I think the most promising case might be to

  1. Wrap solve in try/catch.
  2. Set up our CI to send SIGINT on cancellation, as that seems the most well-supported, and is capturable as an exception by Julia. This would probably require setting Base.exit_on_sigint(false) so that we can capture it correctly.
  3. Not sure what to do about MPI: I think the easiest option for now would be to make all MPI communication happen on a different thread (basically change our MPI.Waitall(...) to wait(Threads.@spawn(MPI.Waitall(...))).

@charleskawczynski
Copy link
Member Author

charleskawczynski commented Jan 5, 2024

Ctrl-C does not reliably work (for me, at least). So, I'd strongly prefer not going down that route.

Whether we dump data or not seems orthogonal to me. Users may just want to leave the stepping loop so that they can interactively investigate the time stepper / solution at that time.

Also, I'd prefer a solution that is not tied to buildkite.

Regarding MPI + crashed / partially crashed simulations, I think a combination of try-catch and a separate MPI channel (mentioned in ii of bullet 2) would be ideal.

(e.g. if we only check the existence of a sentinel file every time step, then 1 process may be slightly ahead and miss the notification).

For MPI, if we cannot assume that all processors have the same view of the file system, can we do something like:

  • If a processor (let's say X) crashes (e.g., in saturation adjustment), we can use the try-catch and that processor will gracefully exit
  • Another processor (Y) may not have crashed and is waiting on Y to communicate (e.g., DSS), if we use tasks to perform that and also look for a kill signal, we can be sure that one of the two will work.
  • We can have processor X to send kill signals to all other processors after it gracefully exits.
    ?

@charleskawczynski
Copy link
Member Author

All that said, a lot of complications come from MPI + crashed simulations. For non-MPI runs, the solution in #2481 seems to be pretty helpful with little effort.

@simonbyrne
Copy link
Member

Ctrl-C does not reliably work (for me, at least). So, I'd strongly prefer not going down that route.

My investigations lead me to two discoveries:

  1. You need to set Base.exit_on_sigint(false) when using a script (this ensures an InterruptException is thrown).
  2. This will prevent it from exiting out of ccalls (e.g. if you're stuck in a MPI.Waitall). Calling it on a different thread (wait(Threads.@spawn(...))) solves the issue.

Also, I'd prefer a solution that is not tied to buildkite.

Agreed. I also don't like tying it to file system behavior either though.

For MPI, if we cannot assume that all processors have the same view of the file system, can we do something like:

* If a processor (let's say X) crashes (e.g., in saturation adjustment), we can use the try-catch and that processor will gracefully exit

* Another processor (Y) may not have crashed and is waiting on Y to communicate (e.g., DSS), if we use tasks to perform that and also look for a kill signal, we can be sure that one of the two will work.

* We can have processor X to send kill signals to all other processors after it gracefully exits.

That would work, but then you can't do any collective operations (like write HDF5 files)

@simonbyrne
Copy link
Member

One potential wrinkle with my plan: JuliaLang/julia#52771 (polling via MPI.Testall would still work though)

@charleskawczynski charleskawczynski changed the title Add graceful exit watchdog file Add graceful exit feature Jan 11, 2024
@Sbozzolo
Copy link
Member

I don't understand under which conditions using a sentinel file will fail for MPI runs. I understand that we cannot assume
the filesystem to be perfectly identical at all time across different processes, but the output directly will likely be shared and synced. There might be a latency, but eventually all the processes will see that there's a termination file. Isn't this the case?

@charleskawczynski
Copy link
Member Author

I don't understand under which conditions using a sentinel file will fail for MPI runs.

One MPI process may be ahead of others, and not see that you've changed the graceful exit bool in the file. Processors late to the game do see it, and then the process that missed it will likely end up hanging while waiting for the other processors.

@Sbozzolo
Copy link
Member

I don't understand under which conditions using a sentinel file will fail for MPI runs.

One MPI process may be ahead of others, and not see that you've changed the graceful exit bool in the file. Processors late to the game do see it, and then the process that missed it will likely end up hanging while waiting for the other processors.

But wouldn't the faster process just exit at the following iteration?

I agree that we cannot (easily) ensure consistency.

@Sbozzolo
Copy link
Member

I got it.

It would wait at DSS.

@simonbyrne
Copy link
Member

I asked on Julia slack, someone pointed me to this:
https://gitlab.gwdg.de/eDLS/InPartS.jl/-/blob/main/ext/InterProcessCommunicationExt.jl?ref_type=heads
which uses this package
https://github.com/emmt/InterProcessCommunication.jl

Unfortunately, it doesn't appear to be registered, but it might be possible to extract the signal handling machinery (https://github.com/emmt/InterProcessCommunication.jl/blob/87786bb966809b9d0248d24c77a7601a3abe2ba9/src/signals.jl) out to its own package?

@charleskawczynski
Copy link
Member Author

We added a graceful exit feature, but it has limitations, so I’ll change the title to fix/improve on the limitstions (MPI)

@charleskawczynski charleskawczynski changed the title Add graceful exit feature Make graceful exit feature MPI-compatible Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants