-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Sierra port #43
base: master
Are you sure you want to change the base?
Conversation
BTW, I built this with |
From @lee218llnl:
Will add this new config into install target. |
From @lee218llnl:
I looked the (mis)behavior with Greg this morning. At first glance, this looks to me like issues with |
d24cc6a
to
2b262c5
Compare
c723213: jsrun now fixes the
As you can see from here, my initial suspicion is |
A hang occurs about once every 8 - 10 times with Trace from the successful runs:
Trace from the hang case:
|
It seems like the jsrun doesn't call MPIR_Breakpoint every once in a while even after being poked at |
@lee218llnl: In the meanwhile, I did |
OK, I confirmed that it occasionally works. Anecdotally, it appears to hang more often than not for me when running STAT. |
@dongahn is this ready to be merged? The jsrun fixes appear to resolve the hangs we previously saw. |
@lee218llnl: Well, I don't feel comfortable merging this in yet. Lack of CI support is making it difficult to check easily the sanity of a patch on all of the RMs we need to support. There are tons of things that we need to do but I'm resource bound. |
Adjust for OpenPower ABI functional call convension change: The ABI adds two different entry points, one for intramodule call and the other intermodule call (being done through TOC). Port for IBM Spectrum jsrun. Add new spectrum definition and also adjust back end deamon code as to how it should synchronize itself with Spectrum MPI target. Adjust how to handle new thread creation. When a new thread is created, we will get the SIGTRAP notification via waitpid on the parent thread that spawns a new thread. But on a recent Linux kernal, we have to check the high order bits of waitpid returned status as: leftshift status by 8 bits and then to see if it is equal to (SIGTRAP | (LINUX_TRACER_EVENT_CLONE << 8). Add a work around for jsrun's broken MPIR_attach_fifo. It is expecting ASCII 1 to be sent to the FIFO when it should expect a numeric 1. Need to drop it when IBM ultimately fixes the issue. (Bug filed to IBM).
Remove BG deadwoods from test programs Add misaligned header comments Refactor printing of launcher path
jsrun had a bug where MPIR_attach_fifo is expecting a byte that contains an ASCII '1' (49) when it should expect a byte that contains integer 1. The specification is at https://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf
Warn about a common bug where the target RM process cannot detect the FIFO poke. When launchmon send integer 1 to the FIFO while the RM process is still stopped, the sent can go undetected even when the RM process resumes its execution depending on how the FIFO is polled.
@lee218llnl: This PR is definitely experimental; please don't merge.
To facilitate your STAT testing, I installed this version into
/usr/global/tools/launchmon/blueos_3_ppc64le_ib/default
for both CZ and RZ.TODOs:
Verify all of the test cases. (I only verified
test.launch_1
andtest_attach_1
. Even then there are some minor issues with the launch case.Add support for exec. Right now for launch mode, you should use
/opt/ibm/spectrum_mpi/jsm_pmix/bin/stock/jsrun
instead of/opt/ibm/spectrum_mpi/jsm_pmix/bin/jsrun
because the latter does exec the former and this version still can't handle exec event yet. This is okay with attach mode.Add support for better handling of threads for attach case
Test it on other key platforms
When IBM fixes MPIR_attach_fifo issue, revert the current work around