SLURM jobs being set as RUNNING when the actual state of the job is failed #5298

sgopalan98 · 2024-09-12T04:52:40Z

Bug report

Expected behavior and actual behavior

Situation:
When SLURM jobs are submitted by Nextflow to a SLURM node, sometimes the SLURM node fail to startup. This causes the SLURM job to take NF - Node Fail status.

Expected behaviour: Nextflow should report that this job failed because of node failure.
Actual behaviour: Nextflow interprets that the job started (because of code not handling this case) but fails to find the .exitcode or mark the job as active (as it will never get into RUNNING state). So, the job throws error because of exitReadTimeOut eventually.

Steps to reproduce the problem

Nextflow process configured to run on a node that fails to startup.

Program output

I don't have the output. But, I have the nextflow.log file with TRACE enabled. Please look for Job ID: 4078. Lines 13076, 13101, 13126

Environment

Nextflow version: 22.10.6
Java version: 17.0.9
Operating system: Linux
Bash version: GNU bash, version 5.1.16(1)-release (x86_64-pc-linux-gnu)

Additional context

The problem I think is in

nextflow/modules/nextflow/src/main/groovy/nextflow/executor/AbstractGridExecutor.groovy

Lines 371 to 385 in a0f6902

    
           boolean checkStartedStatus(jobId, queueName) { 
        
               assert jobId 
        
               // -- fetch the queue status 
        
               final queue = getQueueStatus(queueName) 
        
               if( !queue ) 
        
                   return false 
        
               if( !queue.containsKey(jobId) ) 
        
                   return false 
        
               if( queue.get(jobId) == QueueStatus.PENDING ) 
        
                   return false 
        
               return true 
        
           }

.

This might be related to #4962 , but I am not sure... I didn't read through the full logs.

The text was updated successfully, but these errors were encountered:

pditommaso · 2024-09-12T05:25:01Z

Thanks for reporting this. I've noticed you are using version 22.10.6, any chance to try latest version?

sgopalan98 · 2024-09-12T15:31:05Z

Thanks for responding!

I am trying it with the latest version now and will report the findings. However, my understanding of the code snippet (which isn't much, to be honest) is that for any SLURM status other than PENDING, the job would be considered as started. I will confirm this with my run on the latest version and update you with the findings.

sgopalan98 · 2024-09-12T16:03:04Z

This seems to be the case for the version 24.04.4 as well. I am attaching the .nextflow.log with trace with the newer version as well.

For Job Id: 4151 at lines 14407, 14432, 14498 in the nextflow.log say that it has NODE FAILURE in SLURM but Nextflow picks it up as running.

jorgee · 2024-10-15T16:40:25Z

The NF(NODE FAIL) is managed in the same way as F (FAILED). They are considered an QueueStatus.ERROR that has been started and not active. For all submitted and started tasks that are not active, it tries to get the exit code. If it doesn't exists, it wait until exitReadTimeout is reached. According to some comments in the code, it seems this behavior is intended to avoid NFS issues.

sgopalan98 · 2024-10-15T19:43:34Z

Thank you for looking into this! I understand that the job fails eventually, but the problem mainly is that there is no log which says that the node has failed and that is why the process failed.

jorgee · 2024-10-16T08:59:15Z

I see what is happening. The final error message includes the dump of the queue status, but the failing job doesn't appear because the Slurm squeue command is not printing old jobs after a certain time. I will check how it could be improved

pditommaso assigned jorgee Oct 14, 2024

jorgee linked a pull request Nov 15, 2024 that will close this issue

Include job status report when failure and no exit code found #5514

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLURM jobs being set as RUNNING when the actual state of the job is failed #5298

SLURM jobs being set as RUNNING when the actual state of the job is failed #5298

sgopalan98 commented Sep 12, 2024

pditommaso commented Sep 12, 2024

sgopalan98 commented Sep 12, 2024

sgopalan98 commented Sep 12, 2024

jorgee commented Oct 15, 2024

sgopalan98 commented Oct 15, 2024 •

edited

Loading

jorgee commented Oct 16, 2024 •

edited

Loading

SLURM jobs being set as RUNNING when the actual state of the job is failed #5298

SLURM jobs being set as RUNNING when the actual state of the job is failed #5298

Comments

sgopalan98 commented Sep 12, 2024

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

Additional context

pditommaso commented Sep 12, 2024

sgopalan98 commented Sep 12, 2024

sgopalan98 commented Sep 12, 2024

jorgee commented Oct 15, 2024

sgopalan98 commented Oct 15, 2024 • edited Loading

jorgee commented Oct 16, 2024 • edited Loading

sgopalan98 commented Oct 15, 2024 •

edited

Loading

jorgee commented Oct 16, 2024 •

edited

Loading