Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fusion on GCP, tool can't find file #5566

Open
giuseppemartone opened this issue Dec 3, 2024 · 7 comments
Open

Fusion on GCP, tool can't find file #5566

giuseppemartone opened this issue Dec 3, 2024 · 7 comments

Comments

@giuseppemartone
Copy link

Hi, I am writing because I have a problem with nextflow version 24.10.2 and using fusion.
I should preface that I have this problem only with the latest version and not with version 24.04. When I enable fusion, the process stops because it is as if the file is not seen correctly within the process.

Below I leave you the file main.nf

#!/usr/bin/env nextflow

nextflow.preview.output = true

input_ch = Channel.fromPath( params.input )
                    .splitCsv( header: true)
                    .map { row -> tuple(samplename = row.sample_name, down_size = row.down_size, input = row.input_path, output = row.output_path) }

process DOWNSAMPLING {
    container 'europe-west1-docker.pkg.dev/ngdx-nextflow/negedia/seqtk:r132'
    tag "$samplename"

    input:
    tuple val(samplename), val(down_size), path(input), val(output)

    output:
    tuple val(output), val(samplename), path("${samplename}_sub.fastq.gz"), emit: fq

    script:
    """
    echo "Sample Name $samplename"
    echo "Down Size $down_size"

    seqtk sample -s100 $input $down_size > ${samplename}_sub.fastq
    pigz ${samplename}_sub.fastq
    """
}

workflow PREPROCESS {
    take:
    input_ch
 
    main:
    DOWNSAMPLING( input_ch )

    publish:
    DOWNSAMPLING.out.fq >> "Subsampling_Reads"

}

workflow NEGEDIA {
    PREPROCESS( input_ch )
}

workflow {
    NEGEDIA ()
}

output {
    'Subsampling_Reads' {
        mode 'copy'
        path { out, name, fastq  -> "$params.outdir/Subsampling_Reads/${out}" }
    }
}

And the nexftlow.config file

params {
    input = "NULL"
    outdir = "NULL"
}

process {
    withName: 'DOWNSAMPLING' { cpus = 4; memory = 16.GB }
}

profiles {
    docker {
        docker.enabled         = true
        docker.runOptions      = '-u $(id -u):$(id -g)'
    }
    googlebatch {
        process.executor = "google-batch"
        google.project = "ngdx-nextflow"
        google.location = "europe-west1"
        google.batch.bootDiskSize = "10.GB"
        google.region = "europe-west1-b"
    }
    wave {
        wave.enabled = true
        wave.strategy = ['container','dockerfile']
        wave.freeze = true
        wave.build.repository = 'europe-west1-docker.pkg.dev/ngdx-nextflow/wave/digital'
        wave.build.cacheRepository = 'europe-west1-docker.pkg.dev/ngdx-nextflow/wave-cache/digital'
    }
    fusion {
        fusion.enabled = true
        process.scratch = false
    }
}

To replicate the error

nextflow run main.nf -profile docker,wave,fusion,googlebatch --input test.csv --output gs://<YOUR-BUCKET>/fusion-tmp -with-tower -w gs://<YOUR-BUCKET>/tmp

This is the error:

[83/049e77] process > NEGEDIA:PREPROCESS:DOWNSAMPLING (WT_REP1) [100%] 1 of 1, failed: 1 ✘
ERROR ~ Error executing process > 'NEGEDIA:PREPROCESS:DOWNSAMPLING (WT_REP1)'

Caused by:
  Process `NEGEDIA:PREPROCESS:DOWNSAMPLING (WT_REP1)` terminated with an error exit status (1)


Command executed:

  echo "Sample Name WT_REP1"
  echo "Down Size 10000"
  
  seqtk sample -s100 SRR6357070_1.fastq.gz 10000 > WT_REP1_sub.fastq
  pigz WT_REP1_sub.fastq

Command exit status:
  1

Command output:
  [E::stk_sample] failed to open the input file/stream.Fusion Info:    clone_namespace: false    kernel_version: 6.6    disk_cache_size: 368Gb    max_open_files: 1048576    fusion_version: 2.4.6-5529968

Command error:
  Sample Name WT_REP1Down Size 10000

Work dir:
  gs://ngdx-scratch-west1/tmp/83/049e77e81551108c0b5822f7fb4808

Container:
  europe-west1-docker.pkg.dev/ngdx-nextflow/wave/digital:92ab05481cf7817d

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

I leave attached the file test.csv
test.csv

Can you help me to better set my pipelines for wave and fusion? WIth older version, it was useful for big file like WGS or WES data.

Regards,
Giuseppe

@pditommaso
Copy link
Member

Bit hard to help here without further details. Any chance to upload the .fusion.log file ? adding @jordeu for visibility

@giuseppemartone
Copy link
Author

Sure, ty :)
fusion.log

@giuseppemartone
Copy link
Author

Dears,
i tried to write the pipeline from scratch using taking some suggestions from latest nf-core pipelines, like this.

Nextflow version: 24.10.2

main.nf:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

include { PREPROCESS } from './workflows/preprocess'

workflow {
    input_ch = Channel.fromPath(params.input)
        .splitCsv(header: true)
        .map { row -> 
            def input_files = row.input_path.split(';').collect { it.trim() }
            [
                [
                    samplename: row.sample_name, 
                    down_size: row.down_size.toInteger(), 
                    output: row.output_path, 
                    is_paired: input_files.size() == 2
                ],
                input_files
            ]
        }

    input_ch.view { "Sample: $it" }

    PREPROCESS(input_ch)
}

workflow:

include { DOWNSAMPLING } from '../modules/downsampling'

workflow PREPROCESS {
    take:
    input_ch

    main:
    DOWNSAMPLING(input_ch)

    emit:
    fq = DOWNSAMPLING.out.fq
}

module:

process DOWNSAMPLING {
    tag "$meta.samplename"
    publishDir "${params.outdir}/${meta.output}", mode: 'copy'

    container 'europe-west1-docker.pkg.dev/ngdx-nextflow/negedia/seqtk:r132'

    input:
    tuple val(meta), path(input_files)

    output:
    tuple val(meta.output), val(meta.samplename), path("${meta.samplename}_sub*.fastq.gz"), emit: fq

    script:
    def seqtk_cmd = seqtkCommand(meta.samplename, meta.down_size, input_files, meta.is_paired)
    """
    echo "Debug: meta = ${meta}"
    echo "Debug: Input files = ${input_files}"
    
    # List the contents of the current directory
    echo "Current directory contents:"
    ls -la

    $seqtk_cmd

    pigz ${meta.samplename}_sub*.fastq
    """
}

def seqtkCommand(samplename, down_size, input_files, is_paired) {
    if (is_paired) {
        """
        seqtk sample -s100 ${input_files[0]} $down_size > ${samplename}_sub_1.fastq
        seqtk sample -s100 ${input_files[1]} $down_size > ${samplename}_sub_2.fastq
        """
    } else {
        "seqtk sample -s100 ${input_files[0]} $down_size > ${samplename}_sub.fastq"
    }
}

and profile configs:

profiles {
    docker {
        docker.enabled         = true
        docker.runOptions      = '-u $(id -u):$(id -g)'
    }
    googlebatch {
        process.executor = "google-batch"
        google.project = "ngdx-nextflow"
        google.location = "europe-west1"
        google.batch.bootDiskSize = "10.GB"
    }
    wave {
        wave.enabled = true
        wave.strategy = ['container','dockerfile']
        wave.freeze = true
        wave.build.repository = 'europe-west1-docker.pkg.dev/ngdx-nextflow/wave/digital'
        wave.build.cacheRepository = 'europe-west1-docker.pkg.dev/ngdx-nextflow/wave-cache/digital'
    }
    fusion {
        fusion.enabled = true
        process.scratch = false
    }
}

And it worked, using wave,fusion,googlebatch,docker profiles.
Can i send you some log file to understand the issue?

From the code, the only big difference is the input channel and i do not understand how the previusly version worked with nextflow 24.04.4 and not with 24.10.2

Bye :)

@pditommaso
Copy link
Member

Not sure either. Inclined to close this if there's no better way to isolate the issue you have experienced

@giuseppemartone
Copy link
Author

I will do other tests with more complex pipeline and i will update this thread with some news, in case.
The different method for the input channel can be a solution. :)

We can close

@giuseppemartone
Copy link
Author

no way, the same pipeline now gives me error.
image

@giuseppemartone
Copy link
Author

giuseppemartone commented Dec 9, 2024

Hi, i reopen this issue beacuse today i have some new problems.
Fusion does not work anymore with 24.10 and i want to help to find a solution. For big data, like pipeline for single cell demultiplex, it reduces the time by at least 5 times.

I changed the visibility of my pipeline, so you can reads the complete structure.
https://github.com/NEGEDIA/downsampling.git

Regards,
Giuseppe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants