Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

threads and multiprocessing #18

Open
jeremymanning opened this issue Oct 30, 2020 · 4 comments
Open

threads and multiprocessing #18

jeremymanning opened this issue Oct 30, 2020 · 4 comments

Comments

@jeremymanning
Copy link
Member

to make the cluster.submit function return faster, it could be implemented as a thread (or multiprocess).

the submit process could also be made more robust by having cluster.submit run a single "meta" job whose sole function is to submit all of the other jobs and then save out a file with the corresponding job IDs. that way if the user ends their session after calling cluster.submit, their jobs will still be submitted and run as long as that first meta job has been submitted.

one tricky thing will be to handle job IDs, since only the meta job would have access to those IDs (i.e., not whatever session the user was running). a potential solution would be to have the meta job return only its own id. but then when cluster.collect is run on a meta job's id, it would load in the saved IDs, and then load in the corresponding results files for those IDs in a second pass (or None for jobs that haven't finished running yet).

@paxtonfitzpatrick
Copy link
Member

  • I'm not sure how threading or multiprocessing would help in this case. What were you imagining?

  • submitting from a "meta" job is definitely necessary:

    • remote-submitting from the head node would require keeping the ssh connection open the whole time, which A) would be blocking, and B) could take a long time if your submit script does any preprocessing before sending jobs to the queue. You wouldn't be able to put your computer to sleep, disconnect from wifi, etc. during this time or the submission would fail
    • submitting a large number of jobs from the head node slows it down for everyone else and can cause other users' submissions to fail
  • Fortunately, there's a fairly easy way to handle this issue with the job IDs, and it's partially implemented in the current version of the package:

    • job IDs for all completed, running, and queued jobs (as well as their status) are available via qstat or showq | grep <username>
    • job IDs for running and completed jobs are available via the stdout/stderr files, whose contents can be used to map the job IDs onto their corresponding shell scripts
    • jobs that show a "completed" status but don't have corresponding stdout/stderr files failed with exit status -9 and should be resubmitted

    I tried the approach of saving out a file of job IDs before switching to this, and the problem is that (unless you want to open/write to/close the output file after every job submission individually) the file won't be get created if the submission script fails partway through submitting jobs. I think using the existing Torque commands is the way to go.

@jeremymanning
Copy link
Member Author

i'm thinking that it could take a while to execute a remote command on the cluster. but we could help the code execute faster on the user side by running the submit function in a thread so that their interpreter can continue before the function actually returns

@paxtonfitzpatrick
Copy link
Member

Ah, okay, that makes sense.

The only command I can think of that we'd want to be non-blocking (i.e., subsequent operations are unlikely to directly depend on its output) is qdel. I think everything else would fall under two categories:

  • "check status"-type commands, which would be run either A) by themselves to display cluster/job/etc. info, in which case there's nothing for the interpreter to continue onto, or B) to determine whether do/not do something depending on their output, in which case we wouldn't want the interpreter to continue before they execute
  • "submit jobs"-type commands, which I think should cause a script to fail before allowing the interpreter to continue if exceptions are raised either locally (e.g., user entered their password wrong) or remotely (e.g., a SyntaxError in a job script).

@jeremymanning
Copy link
Member Author

  • if the submit function only ever submits a single job, i think it's ok for that function to be blocking.
  • i also agree that we need the status commands to be blocking, since otherwise we won't have any information to report.

some special cases:

  • when a cluster instance is created, that should check to make sure that the connection is available and valid. otherwise we should get an error.
  • each time any remote command is run, before actually doing anything related to the command we should re-check the connection. if the connection has been lost, throw an error immediately rather than trying to run the command (otherwise we could be stuck waiting for commend sent over a lost connection to return).
  • if we're making the submit function block, then that function should return only once the newly submitted job appears in the appropriate queue (e.g. visible with qstat).
  • if the user requests a status update before the "submit other jobs" job (i.e., the "meta job") has finished running, then we need to be able to handle that situation. the collect and status functions should treat those not-yet-run jobs as incomplete, even when we might not know the corresponding job IDs. once the meta job starts running, it could create a file with the to-be-submitted job IDs, but we'll also need to be able to handle the special case of when the meta job hasn't yet started running (or hasn't yet created that file).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants