Resume from crashed head job caused process and its downstream dependency to run at the same time #5595

oalmelid-gel · 2024-12-10T15:25:28Z

Bug report

Due to an LSF compute cluster outage, two of our pipelines ended up in a state where the nextflow process was unable to submit jobs to the cluster and crashed. However, the running processes completed successfully.

When we resumed the pipeline after the cluster had recovered from the crash, we saw nextflow

launch the processes that had completed successfully
immediately conclude the process had completed and launch a process that depended on its output

The first launched process remained running for another ten minutes and wrote output files while its dependent process ran, leading to truncated output and a false negative result.

After some digging, I believe that what happened is related to the behaviour of the GridTaskHandler, specifically this line:

https://github.com/nextflow-io/nextflow/blob/18f7de132c0342102f7160ce97b2e68dc6b67[…]xtflow/src/main/groovy/nextflow/executor/GridTaskHandler.groovy

The function readExitStatus() looks for a file called .exitfile and reads the exit code of the completed process from there. Since nextflow resumed a process that had successfully completed after the head job crashed, that file was already present in the work directory when the pipeline resumed and this caused a premature launch of the downstream process.

I think an easy fix would be to make sure the .exitfile is deleted before submitting a process, possibly just with an rm -f .exitfile

Expected behavior and actual behavior

Expected behaviour: when resuming a pipeline, nextflow should wait for the resumed process to complete executing on the cluster.
Actual behaviour: nextflow immedately launches a downstream process while still running.

Steps to reproduce the problem

The following script illustrates the problem. It has a different output on a grid scheduler than when run locally.
A similar result can be achieved by removing the line echo 0 > .exitcode, launching nextflow on a grid scheduler, killing nextflow when the process has started, and then relaunching with -resume.

workflow {
    messWithExitCode | view
}


process messWithExitCode {
    beforeScript 'module load cluster/test'

    output:
    stdout

    script:
    """
    echo On a Grid scheduler, this line will print
    # the line below is just for demo purposes. On our grid cluster, this was achieved by
    # - letting nextflow launch the child process
    # - killing the nextflow executor
    # - resuming the workflow after the child process had completed
    echo 0 > .exitcode
    sleep 60s
    echo On a Grid scheduler, this line will not print
    """
}

Program output

On a grid scheduler, the above will only print

echo On a Grid scheduler, this line will print

Environment

Nextflow version: 22.04.5
Java version: 1.8.0
Operating system: linux
Bash version: 4.3.48(1)-release

Additional context

N/A

The text was updated successfully, but these errors were encountered:

robsyme · 2024-12-10T17:19:58Z

Hi Øyvind

When you say

When we resumed the pipeline after the cluster had recovered from the crash, we saw nextflow launch the processes that had completed successfully

Are you sure that it was Nextflow that relaunched the task? If a task directory already exists, Nextflow will not resubmit a task to that same directory.

Is it possible that LSF re-submitted the task when it came back online?

oalmelid-gel · 2024-12-10T17:40:26Z

Hi Robert,

Is it possible that LSF re-submitted the task when it came back online?

I don't think that's possible. The resume happened many hours after the cluster came back up, and the resume was submitted through seqera platform.

I've got traces from nextflow showing that it thought the resumed process had completed, and logs from the process showing it continued running.

The logs in the process working directory also show two separate runs - one on december 4th and one on the 5th.

I've verified through seqera platform that both in the original and resumed runs, the process in question has the same working directory.

robsyme · 2024-12-10T19:31:51Z

Thanks for the clarification, Øyvind.

Nextflow should not re-use the same task working directory on a resumed task, so it's very unexpected to see the same directory being re-used on resume. In the checkCachedOrLaunchTask method, Nextflow checks to see if the work directory exists and if so, avoids submitting to that directory. In pseudocode:

tries = 0
loop:
    cache = session.getCache(taskHash)
    taskWorkDirHash = Hash(taskHash, tries)
    taskWorkdir = getWorkDir(taskWorkDirHash)
    cached = taskWorkdir.exists() && cache.isCompleted() && checkCachedOutput(resumeDir)
    if cached:
        break
    if taskWorkdir.exists():
        tries += 1
        continue
    submitTask(task, resumeDir)
    break

Essentially, Nextflow spins around this loop, generating a deterministic series of hashes looking for a non-empty directory in which to submit the job.

@bentsherman - do you have any idea how a task workDir could possible be re-used on resume?

bentsherman · 2024-12-10T20:23:07Z

Indeed, Nextflow should never submit a new task to an existing directory. It should only reuse the directory if the task is completely cached and doesn't need to be re-executed. At a first glance, I'm not sure how this situation would happen.

The scheduler goes down and the pipeline crashes,
Assuming the compute nodes still work, the submitted jobs keep running to completion
Later, when the scheduler is back up and the run is resumed, Nextflow either
1. if the job completed successfully, resume the task, or
2. if the job failed or is still running, re-execute the task in a new directory (in which case the previous job is ignored)

robsyme · 2024-12-10T21:22:38Z

I'm having trouble reproducing the example. When you say

killing the nextflow executor

Does this require forcibly terminating the scheduler (LSF), or simply killing the Nextflow process?

oalmelid-gel · 2024-12-11T11:31:38Z

Thanks both! And apologies for not having a clean repro.

I've reviewed the logs from our cluster, and I think the situation is more complex than I initially realised.
Here's the timeline of what happened:

12:01 - nextflow job is submitted to the cluster and starts running
13:14 - nextflow job starts running on the cluster
15:04 - network outage prevents the cluster from talking to the node on which nextflow is running. Cluster automatically requeues the nextflow job
16:28 - nextflow job is automatically re-run by the cluster but immediately crashes when attempting to fetch ephemeral config from seqera platform, since the endpoint no longer exists.
next day at 12:02 - workflow is manually resumed from seqera platform and exhibits the symptoms above.

Looking at the code you linked, I could see nextflow launching a job in the same work directory if one or more of the two variables below end up as null.

final entry = session.cache.getTaskEntry(hash, this)
resumeDir = entry ? FileHelper.asPath(entry.trace.getWorkDir()) : null

That would mean that the first try block completes fine with no existence check on the directory. The block below does check existence, but does not abort a launch if the directory exists.

final workDir = task.getWorkDirFor(hash)
try {
    if( resumeDir != workDir )
        exists = workDir.exists()
    if( !exists && !workDir.mkdirs() )
        throw new IOException("Unable to create directory=$workDir -- check file system permissions")
}

My best guess is that the failed relaunch relaunch (not resume) did something to the cache. Though I'm not sure if that's possible? The resume definitely skipped over all the tasks that had completed before the outage, but then somehow decided to re-launch the two processes that were running at the time of the network outage.

bentsherman · 2024-12-11T17:34:41Z

Indeed, if there is no cache entry for a given task hash, Nextflow will proceed to execute the task in that directory, without checking whether the directory already exists. It is assuming that if there is no cache entry then there must be no directory either.

Meanwhile, the cache entry for a task is not saved until Nextflow sees that the task is complete:

nextflow/modules/nextflow/src/main/groovy/nextflow/Session.groovy

Lines 1065 to 1068 in b23e42c

    
                * @param handler 
        
                */ 
        
               void notifyTaskComplete( TaskHandler handler ) { 
        
                   // save the completed task in the cache DB

Here is a plausible scenario:

nextflow submits a job
nextflow crashes
the job completes successfully
nextflow is resumed
there is no cache entry for the job, so nextflow re-executes the job in the same directory as before, overwriting the previous job's outputs
on the next status poll, nextflow sees the .exitcode file from the previous job and thinks the current job is done
nextflow collects the outputs in the job directory, which will be a mish-mash of the previous and current jobs depending on how the job saves its outputs -- likely all of them will be the in-progress outputs of the current job
nextflow submits a downstream job based on the upstream outputs, which are now incomplete. if you're lucky, the downstream completes, otherwise, your results are incomplete

So you might be right after all. Let me see if I can reproduce this with the local executor. I think I know what we need to change but I will feel more confident if we also have a red-green test case.

bentsherman · 2024-12-11T17:44:11Z

There is also this minor failsafe to make the job script delete any existing input files:

nextflow/modules/nextflow/src/main/groovy/nextflow/executor/SimpleFileCopyStrategy.groovy

Lines 132 to 133 in b23e42c

    
           if( len<100 ) 
        
               delete << "rm -f ${Escape.path(stageName)}"

But it's only intended for debugging purposes. I think during normal execution, Nextflow would see the .exitcode file and mark the job as completed before the job has a chance to reset the work directory. Especially if the job stays in the submit queue for more than a few minutes.

bentsherman added the executor/lsf label Dec 10, 2024

bentsherman linked a pull request Dec 11, 2024 that will close this issue

Fix race condition when resuming an aborted run #5600

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume from crashed head job caused process and its downstream dependency to run at the same time #5595

Resume from crashed head job caused process and its downstream dependency to run at the same time #5595

oalmelid-gel commented Dec 10, 2024 •

edited

Loading

robsyme commented Dec 10, 2024

oalmelid-gel commented Dec 10, 2024 •

edited

Loading

robsyme commented Dec 10, 2024

bentsherman commented Dec 10, 2024

robsyme commented Dec 10, 2024

oalmelid-gel commented Dec 11, 2024 •

edited

Loading

bentsherman commented Dec 11, 2024 •

edited

Loading

bentsherman commented Dec 11, 2024 •

edited

Loading

Resume from crashed head job caused process and its downstream dependency to run at the same time #5595

Resume from crashed head job caused process and its downstream dependency to run at the same time #5595

Comments

oalmelid-gel commented Dec 10, 2024 • edited Loading

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

Additional context

robsyme commented Dec 10, 2024

oalmelid-gel commented Dec 10, 2024 • edited Loading

robsyme commented Dec 10, 2024

bentsherman commented Dec 10, 2024

robsyme commented Dec 10, 2024

oalmelid-gel commented Dec 11, 2024 • edited Loading

bentsherman commented Dec 11, 2024 • edited Loading

bentsherman commented Dec 11, 2024 • edited Loading

oalmelid-gel commented Dec 10, 2024 •

edited

Loading

oalmelid-gel commented Dec 10, 2024 •

edited

Loading

oalmelid-gel commented Dec 11, 2024 •

edited

Loading

bentsherman commented Dec 11, 2024 •

edited

Loading

bentsherman commented Dec 11, 2024 •

edited

Loading