Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libpmi: add support for PMI_PORT #2156

Closed
garlick opened this issue May 10, 2019 · 4 comments
Closed

libpmi: add support for PMI_PORT #2156

garlick opened this issue May 10, 2019 · 4 comments

Comments

@garlick
Copy link
Member

garlick commented May 10, 2019

Currently the Flux simple PMI implementation exclusively uses file descriptor passing to establish a connection between a PMI provider such as flux-start or wrexecd (job shell) and a program rank. The PMI provider creates a socketpair, then passes the client end of it to the program rank via the PMI_FD environment variable. This mechanism is used in mpich and is sort of a defacto standard for the PMI-1 wire protocol. I documented it in Flux RFC 13.

It may be useful to allow program ranks to connect remotely to a PMI provider, or to allow multiple threads within a rank to establish ndependent connections.

There exists in the MPICH code base another option for establishing a PMI-1 wire protocol connection that is less commonly used (and configured off by default IIRC). If one sets an envirnoment variable PMI_PORT to a hostname:port tuple, a program rank can connect to a PMI provider over TCP.

Supporting this mechanism in the PMI implementation in flux-start could enable an instance to be started with pdsh or similar. flux-start on rank 0 would need to do something like spawn a script that could obtain the allocated port number, and then run a script that calls something like

pdsh -w hostlist flux-start --pmi-port=hostname:port --pmi-rank=%n  --pmi-size=size`

Supporting it in wrexecd (or job shell) could help with #1789

Security and scalability concerns apply of course.

@grondo
Copy link
Contributor

grondo commented May 10, 2019

Great idea, and this might help with auto-start of Flux instance under Slurm.

For this bootstrap mode, two new options would be needed

  • The hostlist of target systems
  • The path to the script used to run the remaining (non rank 0) flux-start commands

e.g., a flux session could be started with simply:

$ flux start --bootstrap=rsh --hostlist=foo[0-12] --rsh-command=/path/to/script

Once flux-start has opened the PMI port, it could then launch the configured rsh-command (perhaps with sensible default), substituting $PMI_HOST $PMI_PORT $PMI_SIZE and $PMI_HOSTLIST in the environment of the script.

@garlick
Copy link
Member Author

garlick commented Jun 5, 2019

I got part way done implementing this in #2172 and realized that, for the multi thread case (e.g. PAMI with PMIx calls intercepted + openmpi in spectrum MPI), if both threads are doing a put / barrier /get pattern, there was no way to prevent the barrier calls from becoming interspersed, since the barriers are "anonymous" (unnamed). For example, thread 0 might enter the barrier first one rank, and thread 1 might enter it first on another rank, and the barrier count might be reached before either barrier is complete, causing premature barrier exit.

However there may be a way to distinguish the two threads. When PMI_PORT is used, an additional initack handshake is performed in which the client presents itself with an ID in addition to the rank. The ABNF wire protocol looks this:

    C:initack = "cmd=initack" SP "pmiid=" int LF
    S:initack = "cmd=initack" LF
    S:initack = "cmd=set" SP "size=" int LF
    S:initack = "cmd=set" SP "rank=" int LF
    S:initack = "cmd=set" SP "debug=" int LF

In the mpich code it looked like the ID was passed to the client in a PMI_ID environment variable and I think this is used to identify which rank is connecting on the common listen port. But if say the PAMI interceptor were to create a unique ID (like rank + (size * tid), where tid=1 for pami and 0 for openmpi, maybe we could use this to keep the barriers separate?

@garlick
Copy link
Member Author

garlick commented Jun 5, 2019

This cannot be done using the PMI-1 API however, since PMI-1 doesn't provide a way to set the id via the API. So I guess it would need to be via direct use of the flux `pmi_simple_client class (not currently exported). This behavior is also not defined for PMI-1 as far as I know so we may be straying into a bad place that increases complexity to introduce unexpected behavior.

@garlick
Copy link
Member Author

garlick commented Aug 17, 2022

This is probably not an advisable change to make, despite its existence in the mpich reference implementation. We now have flux-pmix to deal with PAMI/spectrum MPI (or at least that is the preferred path to a solution).

@garlick garlick closed this as completed Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants