Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Sierra port #43

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Conversation

dongahn
Copy link
Collaborator

@dongahn dongahn commented Jul 17, 2018

@lee218llnl: This PR is definitely experimental; please don't merge.

To facilitate your STAT testing, I installed this version into /usr/global/tools/launchmon/blueos_3_ppc64le_ib/default for both CZ and RZ.

TODOs:

  • Verify all of the test cases. (I only verified test.launch_1 and test_attach_1. Even then there are some minor issues with the launch case.

  • Add support for exec. Right now for launch mode, you should use /opt/ibm/spectrum_mpi/jsm_pmix/bin/stock/jsrun instead of /opt/ibm/spectrum_mpi/jsm_pmix/bin/jsrun because the latter does exec the former and this version still can't handle exec event yet. This is okay with attach mode.

  • Add support for better handling of threads for attach case

  • Test it on other key platforms

  • When IBM fixes MPIR_attach_fifo issue, revert the current work around

@dongahn
Copy link
Collaborator Author

dongahn commented Jul 17, 2018

BTW, I built this with --enable-verbose=/usr/global/tools/launchmon/logs/blueos_3_ppc64le_ib so your backend logs will go there. Further, it is configured with --enable-debug.

@dongahn dongahn mentioned this pull request Jul 17, 2018
@dongahn
Copy link
Collaborator Author

dongahn commented Jul 17, 2018

From @lee218llnl:

I had to add /collab/usr/global/tools/launchmon/blueos_3_ppc64le_ib/launchmon-1.1.0-20180716/etc/rm_ibm_spectrum.conf. Otherwise, initial testing (attach and launch) are looking OK. My STAT build in /collab/usr/global/tools/stat/blueos_3_ppc64le_ib/stat-test is using this new launchmon build.

Will add this new config into install target.

@dongahn
Copy link
Collaborator Author

dongahn commented Jul 17, 2018

From @lee218llnl:

Occasionally I get hangs with STAT, particularly after running it multiple times. It appears to be in lmon__fe.cxx on line 4601 in a pthread_cond_timedwait. I don’t know if this is an actual affect or just a correlation, but it seems like if I subsequently attach TV to the job and detach TV, then I am able to attach again with STAT.

I have also seen hang-like behavior (looping) in cobo on cobo_connect_hostname. This also appears to happen if I aggressively attach/detach/attach STAT multiple times.

I looked the (mis)behavior with Greg this morning. At first glance, this looks to me like issues with jsrun's FIFO support and co-location support. I will create a new issue for this to track. If this is IBM, we will need a few simpler test cases.

@dongahn dongahn force-pushed the ibm_spectrum branch 2 times, most recently from d24cc6a to 2b262c5 Compare July 18, 2018 22:03
@dongahn
Copy link
Collaborator Author

dongahn commented Sep 17, 2018

@lee218llnl:

c723213: jsrun now fixes the MPIR_attach_fifo issue of expecting ASCII 1 ('1') instead of integer 1. Even after adjusting this, I think I still see the same intermittent issue you reported:

Occasionally I get hangs with STAT, particularly after running it multiple times. It appears to be in lmon__fe.cxx on line 4601 in a pthread_cond_timedwait.

As you can see from here, my initial suspicion is jsrun occasionally failing to launch the back-end daemon. I will spend some more time to look more closely.

@dongahn
Copy link
Collaborator Author

dongahn commented Sep 18, 2018

A hang occurs about once every 8 - 10 times with test.attach_1 case.

Trace from the successful runs:

<Sep 17 17:08:56> <LMON FE API> (INFO): GCRYPT shared_key: 1501848644
<Sep 17 17:08:56> <LMON FE API> (INFO): GCRYPT setkey, 31303531:36383438:3434:0
<Sep 17 17:08:56> <Launchmon> (INFO): linux_launchmon_t initialized.
<Sep 17 17:08:56> <Driver> (INFO): now creating a process object based on pid=27599, exe=/opt/ibm/spectrum_mpi/jsm_pmix/bin/stock/jsrun
<Sep 17 17:08:56> <Symtable> (INFO): reading linkage symbol table for image[=jsrun]
<Sep 17 17:08:56> <Symtable> (INFO): interpreter found in .interp section [/lib64/ld-2.17.so]
<Sep 17 17:08:56> <Symtable> (INFO): reading linkage symbol table for image[=ld-2.17.so]
<Sep 17 17:08:56> <Launchmon> (INFO): The RM process has just been trapped due to attach
<Sep 17 17:08:56> <Launchmon> (INFO):  trap after attach event handler invoked.
<Sep 17 17:08:56> <Symtable> (INFO): reading linkage symbol table for image[=libpthread.so.0]
<Sep 17 17:08:56> <Symtable> (INFO): reading linkage symbol table for image[=libc.so.6]
<Sep 17 17:08:56> <Launchmon> (INFO): trap after attach event handler completed.
<Sep 17 17:08:56> <Launchmon> (INFO): Just continued the RM process out of the first trap
<Sep 17 17:08:56> <Launchmon> (INFO): launch-breakpoint hit event handler invoked.
<Sep 17 17:08:56> <Launchmon> (INFO): launch-breakpoint hit event handler completing with MPIR_DEBUG_SPAWNED
<Sep 17 17:08:56> <Launchmon> (INFO): a proctable message shipped out
<Sep 17 17:08:56> <Launchmon> (INFO): a reshandle message shipped out
<Sep 17 17:08:56> <LMON FE API> (INFO): RPDTAB message received...
<Sep 17 17:08:56> <Launchmon> (INFO): a reshandle message shipped out
<Sep 17 17:08:56> <LMON FE API> (INFO): rid available event received...
<Sep 17 17:08:56> <LMON FE API> (INFO): rminfo event received...
<Sep 17 17:08:56> <LMON FE API> (INFO): launch_bp or first_attach done...

Trace from the hang case:

<Sep 17 17:16:53> <LMON FE API> (INFO): GCRYPT shared_key: 1592601243
<Sep 17 17:16:53> <LMON FE API> (INFO): GCRYPT setkey, 32393531:32313036:3334:0
<Sep 17 17:16:53> <Launchmon> (INFO): linux_launchmon_t initialized.
<Sep 17 17:16:53> <Driver> (INFO): now creating a process object based on pid=35414, exe=/opt/ibm/spectrum_mpi/jsm_pmix/bin/stock/jsrun
<Sep 17 17:16:53> <Symtable> (INFO): reading linkage symbol table for image[=jsrun]
<Sep 17 17:16:53> <Symtable> (INFO): interpreter found in .interp section [/lib64/ld-2.17.so]
<Sep 17 17:16:53> <Symtable> (INFO): reading linkage symbol table for image[=ld-2.17.so]
<Sep 17 17:16:53> <Launchmon> (INFO): The RM process has just been trapped due to attach
<Sep 17 17:16:53> <Launchmon> (INFO):  trap after attach event handler invoked.
<Sep 17 17:16:53> <Symtable> (INFO): reading linkage symbol table for image[=libpthread.so.0]
<Sep 17 17:16:53> <Symtable> (INFO): reading linkage symbol table for image[=libc.so.6]
<Sep 17 17:16:53> <Launchmon> (INFO): trap after attach event handler completed.
<Sep 17 17:16:53> <Launchmon> (INFO): Just continued the RM process out of the first trap

@dongahn
Copy link
Collaborator Author

dongahn commented Sep 18, 2018

It seems like the jsrun doesn't call MPIR_Breakpoint every once in a while even after being poked at MPIR_attach_fifo. I think the key is if we can reproduce this with a simple case using gdb.

@dongahn
Copy link
Collaborator Author

dongahn commented Sep 18, 2018

@lee218llnl: In the meanwhile, I did make install the version with c723213 into /collab area so that you can test STAT. You may hit the occasional hangs, but I think it shouldn't prevent you from doing some scalability benchmarking.

@lee218llnl
Copy link
Collaborator

OK, I confirmed that it occasionally works. Anecdotally, it appears to hang more often than not for me when running STAT.

@lee218llnl
Copy link
Collaborator

@dongahn is this ready to be merged? The jsrun fixes appear to resolve the hangs we previously saw.

@dongahn
Copy link
Collaborator Author

dongahn commented Feb 12, 2019

@lee218llnl: Well, I don't feel comfortable merging this in yet. Lack of CI support is making it difficult to check easily the sanity of a patch on all of the RMs we need to support. There are tons of things that we need to do but I'm resource bound.

Adjust for OpenPower ABI functional call convension
change: The ABI adds two different entry points, one
for intramodule call and the other intermodule call
(being done through TOC).

Port for IBM Spectrum jsrun. Add new spectrum definition
and also adjust back end deamon code as to how it should
synchronize itself with Spectrum MPI target.

Adjust how to handle new thread creation. When a new thread
is created, we will get the SIGTRAP notification via waitpid
on the parent thread that spawns a new thread.
But on a recent Linux kernal, we have to check the high
order bits of waitpid returned status as:

leftshift status by 8 bits and then to see if
it is equal to (SIGTRAP | (LINUX_TRACER_EVENT_CLONE << 8).

Add a work around for jsrun's broken MPIR_attach_fifo.
It is expecting ASCII 1 to be sent to the FIFO when
it should expect a numeric 1. Need to drop it when
IBM ultimately fixes the issue. (Bug filed to IBM).
Remove BG deadwoods from test programs

Add misaligned header comments

Refactor printing of launcher path
jsrun had a bug where MPIR_attach_fifo is expecting a byte
that contains an ASCII '1' (49) when it should expect a byte
that contains integer 1.

The specification is at
https://www.mpi-forum.org/docs/mpir-specification-10-11-2010.pdf
Warn about a common bug where the target RM process cannot
detect the FIFO poke.

When launchmon send integer 1 to the FIFO while the RM process
is still stopped, the sent can go undetected even when the
RM process resumes its execution depending on how the FIFO
is polled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants