Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Flux resource manager #50

Open
dongahn opened this issue Aug 4, 2020 · 2 comments
Open

Add support for Flux resource manager #50

dongahn opened this issue Aug 4, 2020 · 2 comments

Comments

@dongahn
Copy link
Collaborator

dongahn commented Aug 4, 2020

https://github.com/flux-framework/flux-core

The first level support should be to enable STAT to attach to a running job. My current approach is to support STAT through flux-framework/flux-core#3108 (comment).

Using STAT on LLNL's quartz10, I got:

<Aug 04 14:56:47> <Driver> (INFO): Unknown resource manager type: it could be misconfiguration in a rm config file.
<Aug 04 14:56:47> <LMON FE API> (ERROR): read_lmonp_msg returned a negative return code
<Aug 04 14:56:47> <LMON FE API> (ERROR): front end's connection to launchmon engine is disconnected?
@dongahn
Copy link
Collaborator Author

dongahn commented Aug 4, 2020

At first glance, adding flux into LaunchMON's configuration file plus some minor code changes to accommodate this should address the above issue.

However, currently Flux doesn't have bulk tool daemon launch support so we may need to add some support into Flux as well.

@dongahn
Copy link
Collaborator Author

dongahn commented Aug 5, 2020

Given flux-framework/flux-core#3110 (comment), the approach that I am thinking about:

  1. Introduce etc/rm_flux.conf with something like the following
RM=flux
RM_MPIR=STD
RM_launcher=flux-job
RM_launcher_id=RM_launcher|sym|cmd_attach
RM_jobid=RM_launcher|sym|totalview_jobid|string
RM_launch_helper=flux-helper.sh
RM_fail_detection=true
RM_launch_str=--jobid=%j --daemon=%d --daemon-opts=%o --lmonsharedsec=%s --lmonsecchk=%c
  1. Add support to LaunchMON's C++ proper under RC_flux. I think the support will be almost identical to RC_slurm given how flux's exec system implements the initial synchronization with the tools.

  2. Add flux-helper.sh into tools/flux. Introducing a helper script for bulk launching would be a bit more future proof as the bulk launch interface is still in the works in Flux (Add totalview_jobid symbol into flux-job flux-framework/flux-core#3110 (comment)). This script will just perform something like the following

flux exec -r `flux jobs -no {ranks} JOBID` DAEMON_PATH DAEMON_OPTIONS --lmonsharedsec=SHARED_SECRET --lmonsecchk=SECRET_CHECK
  1. Need to add totalview_jobid into flux-job front end command as well so that LaunchMON can fetch the jobid directly from the address space of flux-job process.

I think this should at least get us the "attach mode" for STAT under Flux.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant