Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLURM jobs being set as RUNNING when the actual state of the job is failed #5298

Open
sgopalan98 opened this issue Sep 12, 2024 · 6 comments · May be fixed by #5514
Open

SLURM jobs being set as RUNNING when the actual state of the job is failed #5298

sgopalan98 opened this issue Sep 12, 2024 · 6 comments · May be fixed by #5514
Assignees

Comments

@sgopalan98
Copy link

Bug report

Expected behavior and actual behavior

Situation:
When SLURM jobs are submitted by Nextflow to a SLURM node, sometimes the SLURM node fail to startup. This causes the SLURM job to take NF - Node Fail status.

Expected behaviour: Nextflow should report that this job failed because of node failure.
Actual behaviour: Nextflow interprets that the job started (because of code not handling this case) but fails to find the .exitcode or mark the job as active (as it will never get into RUNNING state). So, the job throws error because of exitReadTimeOut eventually.

Steps to reproduce the problem

Nextflow process configured to run on a node that fails to startup.

Program output

I don't have the output. But, I have the nextflow.log file with TRACE enabled. Please look for Job ID: 4078. Lines 13076, 13101, 13126

Environment

  • Nextflow version: 22.10.6
  • Java version: 17.0.9
  • Operating system: Linux
  • Bash version: GNU bash, version 5.1.16(1)-release (x86_64-pc-linux-gnu)

Additional context

The problem I think is in

boolean checkStartedStatus(jobId, queueName) {
assert jobId
// -- fetch the queue status
final queue = getQueueStatus(queueName)
if( !queue )
return false
if( !queue.containsKey(jobId) )
return false
if( queue.get(jobId) == QueueStatus.PENDING )
return false
return true
}
.

This might be related to #4962 , but I am not sure... I didn't read through the full logs.

@pditommaso
Copy link
Member

Thanks for reporting this. I've noticed you are using version 22.10.6, any chance to try latest version?

@sgopalan98
Copy link
Author

Thanks for responding!

I am trying it with the latest version now and will report the findings. However, my understanding of the code snippet (which isn't much, to be honest) is that for any SLURM status other than PENDING, the job would be considered as started. I will confirm this with my run on the latest version and update you with the findings.

@sgopalan98
Copy link
Author

This seems to be the case for the version 24.04.4 as well. I am attaching the .nextflow.log with trace with the newer version as well.

For Job Id: 4151 at lines 14407, 14432, 14498 in the nextflow.log say that it has NODE FAILURE in SLURM but Nextflow picks it up as running.

@jorgee
Copy link
Contributor

jorgee commented Oct 15, 2024

The NF(NODE FAIL) is managed in the same way as F (FAILED). They are considered an QueueStatus.ERROR that has been started and not active. For all submitted and started tasks that are not active, it tries to get the exit code. If it doesn't exists, it wait until exitReadTimeout is reached. According to some comments in the code, it seems this behavior is intended to avoid NFS issues.

@sgopalan98
Copy link
Author

sgopalan98 commented Oct 15, 2024

Thank you for looking into this! I understand that the job fails eventually, but the problem mainly is that there is no log which says that the node has failed and that is why the process failed.

@jorgee
Copy link
Contributor

jorgee commented Oct 16, 2024

I see what is happening. The final error message includes the dump of the queue status, but the failing job doesn't appear because the Slurm squeue command is not printing old jobs after a certain time. I will check how it could be improved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants