Make graceful exit feature MPI-compatible #2475

charleskawczynski · 2024-01-03T18:29:36Z

No description provided.

simonbyrne · 2024-01-05T19:20:46Z

I think it would be helpful to clarify what situations you want to handle. A couple I can think of:

Dump the most recent state when a job crashes
- Could do this by try/catch, or an atexit hook. Question is how to handle MPI (see below).
Allow users to externally cancel a job, dumping the current state.
- Add graceful exit feature #2481 proposes using a sentinel file: this requires users to have write access to the directory.
- Buildkite will send a SIGTERM, which Julia can sort-of handle, but people have had mixed results. It looks like it is possible to change it in the agent config to be other signals, so we could change it to a SIGINT (what Ctrl-C does), which seems like it is handled better. You can also increase the grace period (time until it sends SIGKILL, which we can't handle).
- If using Slurm directly, then scancel has a --signal option to send a specific signal.
Dump the most recent state when a job is about to time out. Schedulers usually have a mechanism to signal that they're about to do this, e.g.
- Slurm has a --signal option, which will send a signal to the job that the time is about to expire.
- Most cloud preemptible instances will give you a short notice period (e.g. 30 seconds) before the VM is killed. For example, GCP will let you register a shutdown script that will be called.

MPI

There is some complexity when dealing with multiple processes:

How to signal to all processes that they need to exit gracefully?
- It looks like srun should forward signals to all processes, but it looks like SIGINT might be special? Would need to look into this.
In case 1, if one process crashes we need to be able to alert all other processes. Unfortunately, the only mechanism MPI provides is MPI_Abort which doesn't let us exit gracefully. Two options I can think of
1. Call scancel to send the signal to all other jobs (this is Slurm-specific though)
2. Have another MPI channel (e.g. tag=0) that we can use to send control messages between jobs (I've sort of wanted something like this for logging and other non-compute workloads).

One additional challenge with MPI is that you can end up in a case where a process is waiting on a receive which will never come. This is likely for case 1, but even in cases 2 and 3, there is no guarantee that all processes will receive the notification to cancel jobs at the same time (e.g. if we only check the existence of a sentinel file every time step, then 1 process may be slightly ahead and miss the notification). The is further compounded by the fact that if we want to write to HDF5 in the exit procedure, that needs to be done collectively over MPI.

There are two options I can see here:

We run the neighbor communication and the HDF5 on separate CPU threads. For this to work, we would need to initialize with MPI_THREAD_MULTIPLE (as we could have two separate MPI calls going at the same time.
We switch the MPI.Waitalls to a MPI.Testall (basically a cooperative waitall, similar to Add non-blocking wait JuliaParallel/MPI.jl#766), which would let the Julia task manager switch between different MPI comms.

simonbyrne · 2024-01-05T19:57:26Z

I think the most promising case might be to

Wrap solve in try/catch.
Set up our CI to send SIGINT on cancellation, as that seems the most well-supported, and is capturable as an exception by Julia. This would probably require setting Base.exit_on_sigint(false) so that we can capture it correctly.
Not sure what to do about MPI: I think the easiest option for now would be to make all MPI communication happen on a different thread (basically change our MPI.Waitall(...) to wait(Threads.@spawn(MPI.Waitall(...))).

charleskawczynski · 2024-01-05T20:01:21Z

Ctrl-C does not reliably work (for me, at least). So, I'd strongly prefer not going down that route.

Whether we dump data or not seems orthogonal to me. Users may just want to leave the stepping loop so that they can interactively investigate the time stepper / solution at that time.

Also, I'd prefer a solution that is not tied to buildkite.

Regarding MPI + crashed / partially crashed simulations, I think a combination of try-catch and a separate MPI channel (mentioned in ii of bullet 2) would be ideal.

(e.g. if we only check the existence of a sentinel file every time step, then 1 process may be slightly ahead and miss the notification).

For MPI, if we cannot assume that all processors have the same view of the file system, can we do something like:

If a processor (let's say X) crashes (e.g., in saturation adjustment), we can use the try-catch and that processor will gracefully exit
Another processor (Y) may not have crashed and is waiting on Y to communicate (e.g., DSS), if we use tasks to perform that and also look for a kill signal, we can be sure that one of the two will work.
We can have processor X to send kill signals to all other processors after it gracefully exits.
?

charleskawczynski · 2024-01-05T20:03:17Z

All that said, a lot of complications come from MPI + crashed simulations. For non-MPI runs, the solution in #2481 seems to be pretty helpful with little effort.

simonbyrne · 2024-01-05T20:20:17Z

Ctrl-C does not reliably work (for me, at least). So, I'd strongly prefer not going down that route.

My investigations lead me to two discoveries:

You need to set Base.exit_on_sigint(false) when using a script (this ensures an InterruptException is thrown).
This will prevent it from exiting out of ccalls (e.g. if you're stuck in a MPI.Waitall). Calling it on a different thread (wait(Threads.@spawn(...))) solves the issue.

Also, I'd prefer a solution that is not tied to buildkite.

Agreed. I also don't like tying it to file system behavior either though.

For MPI, if we cannot assume that all processors have the same view of the file system, can we do something like:

* If a processor (let's say X) crashes (e.g., in saturation adjustment), we can use the try-catch and that processor will gracefully exit

* Another processor (Y) may not have crashed and is waiting on Y to communicate (e.g., DSS), if we use tasks to perform that and also look for a kill signal, we can be sure that one of the two will work.

* We can have processor X to send kill signals to all other processors after it gracefully exits.

That would work, but then you can't do any collective operations (like write HDF5 files)

simonbyrne · 2024-01-05T21:09:38Z

One potential wrinkle with my plan: JuliaLang/julia#52771 (polling via MPI.Testall would still work though)

Sbozzolo · 2024-01-11T21:16:44Z

I don't understand under which conditions using a sentinel file will fail for MPI runs. I understand that we cannot assume
the filesystem to be perfectly identical at all time across different processes, but the output directly will likely be shared and synced. There might be a latency, but eventually all the processes will see that there's a termination file. Isn't this the case?

charleskawczynski · 2024-01-11T21:24:13Z

I don't understand under which conditions using a sentinel file will fail for MPI runs.

One MPI process may be ahead of others, and not see that you've changed the graceful exit bool in the file. Processors late to the game do see it, and then the process that missed it will likely end up hanging while waiting for the other processors.

Sbozzolo · 2024-01-11T21:39:35Z

I don't understand under which conditions using a sentinel file will fail for MPI runs.

One MPI process may be ahead of others, and not see that you've changed the graceful exit bool in the file. Processors late to the game do see it, and then the process that missed it will likely end up hanging while waiting for the other processors.

But wouldn't the faster process just exit at the following iteration?

I agree that we cannot (easily) ensure consistency.

Sbozzolo · 2024-01-11T21:41:27Z

I got it.

It would wait at DSS.

simonbyrne · 2024-01-17T22:53:58Z

I asked on Julia slack, someone pointed me to this:
https://gitlab.gwdg.de/eDLS/InPartS.jl/-/blob/main/ext/InterProcessCommunicationExt.jl?ref_type=heads
which uses this package
https://github.com/emmt/InterProcessCommunication.jl

Unfortunately, it doesn't appear to be registered, but it might be possible to extract the signal handling machinery (https://github.com/emmt/InterProcessCommunication.jl/blob/87786bb966809b9d0248d24c77a7601a3abe2ba9/src/signals.jl) out to its own package?

charleskawczynski · 2024-02-15T23:46:29Z

We added a graceful exit feature, but it has limitations, so I’ll change the title to fix/improve on the limitstions (MPI)

charleskawczynski self-assigned this Jan 3, 2024

charleskawczynski mentioned this issue Jan 4, 2024

Add graceful exit feature #2481

Closed

charleskawczynski changed the title ~~Add graceful exit watchdog file~~ Add graceful exit feature Jan 11, 2024

charleskawczynski changed the title ~~Add graceful exit feature~~ Make graceful exit feature MPI-compatible Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make graceful exit feature MPI-compatible #2475

Make graceful exit feature MPI-compatible #2475

charleskawczynski commented Jan 3, 2024

simonbyrne commented Jan 5, 2024

simonbyrne commented Jan 5, 2024 •

edited

Loading

charleskawczynski commented Jan 5, 2024 •

edited

Loading

charleskawczynski commented Jan 5, 2024

simonbyrne commented Jan 5, 2024

simonbyrne commented Jan 5, 2024

Sbozzolo commented Jan 11, 2024

charleskawczynski commented Jan 11, 2024

Sbozzolo commented Jan 11, 2024

Sbozzolo commented Jan 11, 2024

simonbyrne commented Jan 17, 2024

charleskawczynski commented Feb 15, 2024

Make graceful exit feature MPI-compatible #2475

Make graceful exit feature MPI-compatible #2475

Comments

charleskawczynski commented Jan 3, 2024

simonbyrne commented Jan 5, 2024

MPI

simonbyrne commented Jan 5, 2024 • edited Loading

charleskawczynski commented Jan 5, 2024 • edited Loading

charleskawczynski commented Jan 5, 2024

simonbyrne commented Jan 5, 2024

simonbyrne commented Jan 5, 2024

Sbozzolo commented Jan 11, 2024

charleskawczynski commented Jan 11, 2024

Sbozzolo commented Jan 11, 2024

Sbozzolo commented Jan 11, 2024

simonbyrne commented Jan 17, 2024

charleskawczynski commented Feb 15, 2024

simonbyrne commented Jan 5, 2024 •

edited

Loading

charleskawczynski commented Jan 5, 2024 •

edited

Loading