Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additonal manual jobs cause REMOTE_ERROR state for jobflow-remote jobs when submission maximum is reached #135

Open
QuantumChemist opened this issue Jun 28, 2024 · 5 comments

Comments

@QuantumChemist
Copy link
Contributor

Hi 😀

we have a submission limit of 40 or 20 jobs (depending on the queue of our HPC cluster) and when I start some other additional VASP jobs manually, that leads to reaching that limit, the submission of the next jobs from the jobflow-remote queue fail and go into REMOTE_ERROR state. Therefore, I have to retry the jobs when I'm below the limit again, but then I cannot let the jobs run over night or over the weekend and have to constantly watch the workflow. I'm using the interactive branch.

Do you have an idea how to solve this problem? I temporary solved it by using a bash line that is retrying all jobs with REMOTE_ERROR state every few hours, but it's not really a desirable solution. Did you ever face a similar issue?

@QuantumChemist
Copy link
Contributor Author

QuantumChemist commented Jun 28, 2024

I have no example at hand now but the error I retrieved via jf job info id is something along the lines of "submission limit reached"

@gpetretto
Copy link
Contributor

Hi, I am not sure if this will solve your issue, but it is possible to set a maximum number of jobs submitted to a worker:
https://matgenix.github.io/jobflow-remote/user/advancedoptions.html#limiting-the-number-of-submitted-jobs.
However this will only keep track of the jobs submitted by jobflow and will not take into account those submitted manually.
Would this help or would you need something that keeps track of the total number of jobs in the cluster instead?

@QuantumChemist
Copy link
Contributor Author

QuantumChemist commented Jun 28, 2024

Hi, I am not sure if this will solve your issue, but it is possible to set a maximum number of jobs submitted to a worker: https://matgenix.github.io/jobflow-remote/user/advancedoptions.html#limiting-the-number-of-submitted-jobs. However this will only keep track of the jobs submitted by jobflow and will not take into account those submitted manually. Would this help or would you need something that keeps track of the total number of jobs in the cluster instead?

Hi! I have set this max number of jobs already, because I have way more jobs than the limit. Unfortunately it doesn't help, so I would really appreciate it if there would be a way to keep track of all the jobs in the cluster (and a certain queue).

@gpetretto
Copy link
Contributor

I see. Indeed I had the doubt this could be the case.
Unfortunately at the moment there is no way of enforcing such constraint.
One of the issues is that there is not a strict link between a worker and the resources used by each job. At the moment the implementation relies on informations that are all known from jobflow-remote's runner or DB. The implementation will thus imply adding some ad hoc configuration parameter and rely on the user assigning jobs to the proper worker.
I will think about the best way of adding this feature.

@QuantumChemist
Copy link
Contributor Author

Oh I see. This sounds indeed not so easy to implement.
I can maybe get it working by playing around with the RunnerOption like step_attempts and get_delta_retry etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants