-
Notifications
You must be signed in to change notification settings - Fork 642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SLURM jobs being set as RUNNING when the actual state of the job is failed #5298
Comments
Thanks for reporting this. I've noticed you are using version |
Thanks for responding! I am trying it with the latest version now and will report the findings. However, my understanding of the code snippet (which isn't much, to be honest) is that for any SLURM status other than PENDING, the job would be considered as started. I will confirm this with my run on the latest version and update you with the findings. |
This seems to be the case for the version For Job Id: 4151 at lines 14407, 14432, 14498 in the nextflow.log say that it has NODE FAILURE in SLURM but Nextflow picks it up as running. |
The NF(NODE FAIL) is managed in the same way as F (FAILED). They are considered an QueueStatus.ERROR that has been started and not active. For all submitted and started tasks that are not active, it tries to get the exit code. If it doesn't exists, it wait until exitReadTimeout is reached. According to some comments in the code, it seems this behavior is intended to avoid NFS issues. |
Thank you for looking into this! I understand that the job fails eventually, but the problem mainly is that there is no log which says that the node has failed and that is why the process failed. |
I see what is happening. The final error message includes the dump of the queue status, but the failing job doesn't appear because the Slurm squeue command is not printing old jobs after a certain time. I will check how it could be improved |
Bug report
Expected behavior and actual behavior
Situation:
When SLURM jobs are submitted by Nextflow to a SLURM node, sometimes the SLURM node fail to startup. This causes the SLURM job to take
NF
- Node Fail status.Expected behaviour: Nextflow should report that this job failed because of node failure.
Actual behaviour: Nextflow interprets that the job started (because of code not handling this case) but fails to find the
.exitcode
or mark the job as active (as it will never get into RUNNING state). So, the job throws error because of exitReadTimeOut eventually.Steps to reproduce the problem
Nextflow process configured to run on a node that fails to startup.
Program output
I don't have the output. But, I have the nextflow.log file with TRACE enabled. Please look for Job ID:
4078
. Lines 13076, 13101, 13126Environment
Additional context
The problem I think is in
nextflow/modules/nextflow/src/main/groovy/nextflow/executor/AbstractGridExecutor.groovy
Lines 371 to 385 in a0f6902
This might be related to #4962 , but I am not sure... I didn't read through the full logs.
The text was updated successfully, but these errors were encountered: