Fusion on GCP, tool can't find file #5566

giuseppemartone · 2024-12-03T20:26:19Z

Hi, I am writing because I have a problem with nextflow version 24.10.2 and using fusion.
I should preface that I have this problem only with the latest version and not with version 24.04. When I enable fusion, the process stops because it is as if the file is not seen correctly within the process.

Below I leave you the file main.nf

#!/usr/bin/env nextflow

nextflow.preview.output = true

input_ch = Channel.fromPath( params.input )
                    .splitCsv( header: true)
                    .map { row -> tuple(samplename = row.sample_name, down_size = row.down_size, input = row.input_path, output = row.output_path) }

process DOWNSAMPLING {
    container 'europe-west1-docker.pkg.dev/ngdx-nextflow/negedia/seqtk:r132'
    tag "$samplename"

    input:
    tuple val(samplename), val(down_size), path(input), val(output)

    output:
    tuple val(output), val(samplename), path("${samplename}_sub.fastq.gz"), emit: fq

    script:
    """
    echo "Sample Name $samplename"
    echo "Down Size $down_size"

    seqtk sample -s100 $input $down_size > ${samplename}_sub.fastq
    pigz ${samplename}_sub.fastq
    """
}

workflow PREPROCESS {
    take:
    input_ch
 
    main:
    DOWNSAMPLING( input_ch )

    publish:
    DOWNSAMPLING.out.fq >> "Subsampling_Reads"

}

workflow NEGEDIA {
    PREPROCESS( input_ch )
}

workflow {
    NEGEDIA ()
}

output {
    'Subsampling_Reads' {
        mode 'copy'
        path { out, name, fastq  -> "$params.outdir/Subsampling_Reads/${out}" }
    }
}

And the nexftlow.config file

params {
    input = "NULL"
    outdir = "NULL"
}

process {
    withName: 'DOWNSAMPLING' { cpus = 4; memory = 16.GB }
}

profiles {
    docker {
        docker.enabled         = true
        docker.runOptions      = '-u $(id -u):$(id -g)'
    }
    googlebatch {
        process.executor = "google-batch"
        google.project = "ngdx-nextflow"
        google.location = "europe-west1"
        google.batch.bootDiskSize = "10.GB"
        google.region = "europe-west1-b"
    }
    wave {
        wave.enabled = true
        wave.strategy = ['container','dockerfile']
        wave.freeze = true
        wave.build.repository = 'europe-west1-docker.pkg.dev/ngdx-nextflow/wave/digital'
        wave.build.cacheRepository = 'europe-west1-docker.pkg.dev/ngdx-nextflow/wave-cache/digital'
    }
    fusion {
        fusion.enabled = true
        process.scratch = false
    }
}

To replicate the error

nextflow run main.nf -profile docker,wave,fusion,googlebatch --input test.csv --output gs://<YOUR-BUCKET>/fusion-tmp -with-tower -w gs://<YOUR-BUCKET>/tmp

This is the error:

[83/049e77] process > NEGEDIA:PREPROCESS:DOWNSAMPLING (WT_REP1) [100%] 1 of 1, failed: 1 ✘
ERROR ~ Error executing process > 'NEGEDIA:PREPROCESS:DOWNSAMPLING (WT_REP1)'

Caused by:
  Process `NEGEDIA:PREPROCESS:DOWNSAMPLING (WT_REP1)` terminated with an error exit status (1)


Command executed:

  echo "Sample Name WT_REP1"
  echo "Down Size 10000"
  
  seqtk sample -s100 SRR6357070_1.fastq.gz 10000 > WT_REP1_sub.fastq
  pigz WT_REP1_sub.fastq

Command exit status:
  1

Command output:
  [E::stk_sample] failed to open the input file/stream.Fusion Info:    clone_namespace: false    kernel_version: 6.6    disk_cache_size: 368Gb    max_open_files: 1048576    fusion_version: 2.4.6-5529968

Command error:
  Sample Name WT_REP1Down Size 10000

Work dir:
  gs://ngdx-scratch-west1/tmp/83/049e77e81551108c0b5822f7fb4808

Container:
  europe-west1-docker.pkg.dev/ngdx-nextflow/wave/digital:92ab05481cf7817d

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

I leave attached the file test.csv
test.csv

Can you help me to better set my pipelines for wave and fusion? WIth older version, it was useful for big file like WGS or WES data.

Regards,
Giuseppe

The text was updated successfully, but these errors were encountered:

pditommaso · 2024-12-04T10:43:51Z

Bit hard to help here without further details. Any chance to upload the .fusion.log file ? adding @jordeu for visibility

giuseppemartone · 2024-12-04T13:16:43Z

Sure, ty :)
fusion.log

giuseppemartone · 2024-12-06T18:08:19Z

Dears,
i tried to write the pipeline from scratch using taking some suggestions from latest nf-core pipelines, like this.

Nextflow version: 24.10.2

main.nf:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

include { PREPROCESS } from './workflows/preprocess'

workflow {
    input_ch = Channel.fromPath(params.input)
        .splitCsv(header: true)
        .map { row -> 
            def input_files = row.input_path.split(';').collect { it.trim() }
            [
                [
                    samplename: row.sample_name, 
                    down_size: row.down_size.toInteger(), 
                    output: row.output_path, 
                    is_paired: input_files.size() == 2
                ],
                input_files
            ]
        }

    input_ch.view { "Sample: $it" }

    PREPROCESS(input_ch)
}

workflow:

include { DOWNSAMPLING } from '../modules/downsampling'

workflow PREPROCESS {
    take:
    input_ch

    main:
    DOWNSAMPLING(input_ch)

    emit:
    fq = DOWNSAMPLING.out.fq
}

module:

process DOWNSAMPLING {
    tag "$meta.samplename"
    publishDir "${params.outdir}/${meta.output}", mode: 'copy'

    container 'europe-west1-docker.pkg.dev/ngdx-nextflow/negedia/seqtk:r132'

    input:
    tuple val(meta), path(input_files)

    output:
    tuple val(meta.output), val(meta.samplename), path("${meta.samplename}_sub*.fastq.gz"), emit: fq

    script:
    def seqtk_cmd = seqtkCommand(meta.samplename, meta.down_size, input_files, meta.is_paired)
    """
    echo "Debug: meta = ${meta}"
    echo "Debug: Input files = ${input_files}"
    
    # List the contents of the current directory
    echo "Current directory contents:"
    ls -la

    $seqtk_cmd

    pigz ${meta.samplename}_sub*.fastq
    """
}

def seqtkCommand(samplename, down_size, input_files, is_paired) {
    if (is_paired) {
        """
        seqtk sample -s100 ${input_files[0]} $down_size > ${samplename}_sub_1.fastq
        seqtk sample -s100 ${input_files[1]} $down_size > ${samplename}_sub_2.fastq
        """
    } else {
        "seqtk sample -s100 ${input_files[0]} $down_size > ${samplename}_sub.fastq"
    }
}

and profile configs:

profiles {
    docker {
        docker.enabled         = true
        docker.runOptions      = '-u $(id -u):$(id -g)'
    }
    googlebatch {
        process.executor = "google-batch"
        google.project = "ngdx-nextflow"
        google.location = "europe-west1"
        google.batch.bootDiskSize = "10.GB"
    }
    wave {
        wave.enabled = true
        wave.strategy = ['container','dockerfile']
        wave.freeze = true
        wave.build.repository = 'europe-west1-docker.pkg.dev/ngdx-nextflow/wave/digital'
        wave.build.cacheRepository = 'europe-west1-docker.pkg.dev/ngdx-nextflow/wave-cache/digital'
    }
    fusion {
        fusion.enabled = true
        process.scratch = false
    }
}

And it worked, using wave,fusion,googlebatch,docker profiles.
Can i send you some log file to understand the issue?

From the code, the only big difference is the input channel and i do not understand how the previusly version worked with nextflow 24.04.4 and not with 24.10.2

Bye :)

pditommaso · 2024-12-09T08:17:06Z

Not sure either. Inclined to close this if there's no better way to isolate the issue you have experienced

giuseppemartone · 2024-12-09T08:22:16Z

I will do other tests with more complex pipeline and i will update this thread with some news, in case.
The different method for the input channel can be a solution. :)

We can close

giuseppemartone · 2024-12-09T09:12:11Z

no way, the same pipeline now gives me error.

giuseppemartone · 2024-12-09T15:51:17Z

Hi, i reopen this issue beacuse today i have some new problems.
Fusion does not work anymore with 24.10 and i want to help to find a solution. For big data, like pipeline for single cell demultiplex, it reduces the time by at least 5 times.

I changed the visibility of my pipeline, so you can reads the complete structure.
https://github.com/NEGEDIA/downsampling.git

Regards,
Giuseppe

giuseppemartone closed this as completed Dec 9, 2024

giuseppemartone reopened this Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fusion on GCP, tool can't find file #5566

Fusion on GCP, tool can't find file #5566

giuseppemartone commented Dec 3, 2024

pditommaso commented Dec 4, 2024

giuseppemartone commented Dec 4, 2024

giuseppemartone commented Dec 6, 2024

pditommaso commented Dec 9, 2024

giuseppemartone commented Dec 9, 2024

giuseppemartone commented Dec 9, 2024

giuseppemartone commented Dec 9, 2024 •

edited

Loading

Fusion on GCP, tool can't find file #5566

Fusion on GCP, tool can't find file #5566

Comments

giuseppemartone commented Dec 3, 2024

pditommaso commented Dec 4, 2024

giuseppemartone commented Dec 4, 2024

giuseppemartone commented Dec 6, 2024

pditommaso commented Dec 9, 2024

giuseppemartone commented Dec 9, 2024

giuseppemartone commented Dec 9, 2024

giuseppemartone commented Dec 9, 2024 • edited Loading

giuseppemartone commented Dec 9, 2024 •

edited

Loading