Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PublishDir allows closures for mode, enabled, overwrite fields #2432

Closed
wants to merge 7 commits into from

Conversation

mjhipp
Copy link

@mjhipp mjhipp commented Nov 4, 2021

This is a suggestion to make PublishDir more flexible by allowing more of the fields to be set using a closure.

Use Case

I am using logic suggested here to set publishDir for all processes: #1933 (comment)

That requires including these (or similar) lines in every single process file:

process foo {
  publishDir params.outputPublishDir, enabled: params.outputPublishDir, mode: params.publishMode, pattern: '[!.]**' // Non-hidden files
  publishDir "$params.logDir/${task.process.replaceAll(':', '/')}", enabled: params.publishLogs, pattern: '.command*'

Rather than copy/pasting all of that into every single process file, I would like to set default global publishDir logic in my config files, like this:

process.publishDir = [
  // Publishes all ".command.xx" files to the logging root dir (params.logDir)
  [
    path: { "$params.logDir" + '/' + "$task.process".replaceAll(':', '/') },
    mode: { "$params.publishMode" }, // needs this PR to allow updating with addParams()
    enabled: { "${params.publishLogs}" },  // needs this PR to allow updating with addParams()
    pattern: '.command*'
  ],
  // Publishes all files that are not marked as "hidden" to the output root
  [
    path: { "$params.outputPublishDir" ?: 'placeholder' }, // Place holder string used to avoid PublishDir construction error when output not set (will disable publishing below)
    mode: { "$params.publishMode" }, // needs this PR to allow updating with addParams()
    enabled: { "$params.outputPublishDir" as Boolean }, // needs this PR to allow updating with addParams()
    pattern: '[!.]**' // non-hidden files
  ]
]

Some of this is possible without closures if I am to stick to the params provided at the start. But without closures, I cannot take advantage of the addParams feature when importing processes into a workflow.
With this PR, it is now possible, and it will reflect changes on import. For example:

include { foo } from './processes/foo.nf' addParams(publishMode: 'copy')
include { bar } from './processes/bar.nf' addParams(outputPublishDir: '') // turns off publishing for output files

@pditommaso pditommaso force-pushed the master branch 2 times, most recently from acb53ee to 2c6665e Compare November 20, 2021 17:52
@clintval
Copy link

@pditommaso I have a strong use case for this PR. Anything I can help do to move it closer to merge-ready?

@pditommaso
Copy link
Member

This was already discussed in the past.

The point a closure is not needed to make these options conditionally configurable. A conditional expression should be enough e.g. enabled: "${params.something ? true : false }"

Would that not fit your use case?

@mjhipp
Copy link
Author

mjhipp commented Mar 18, 2022

That does not fit my case for the reasons outlined above.

Using a conditional expression in the process script means that each process script file must specify the publishDir directive individually. That means a lot of copy/pasted code. Being able to specify in the configuration process scope would mean only having to specify this default behavior in one place. The main point is that it is not possible to use the addParams() feature with the current way publishDir is handled, if specified in the nextflow.config. One example use is that for large complex workflows, it is often requested to move or copy the files to the final location for terminal processes, rather than symlink. To set global publish mode to copy would be a huge use of storage, and to set global publish mode to move would break the data flow.

For enabled, that means you would currently have to have all processes publish or not publish, without the possibility to use addParams: [enable-publish: true] on include.

@pditommaso
Copy link
Member

Ok, this is interesting. Therefore the main concern is related to global settings via the config file. I understand the need and always thought there should be a way to change the publishDir attributes default value in an easier way.

Still think the closure approach is a hack and it's likely to go away from future config system (#2723).

My proposal instead is to add a setting in the config file that allows changing the publishDir defaults globally. For example:

process.defaults.publishDir = [enabled: false, overwrite: false, mode: 'copy']

What do you think?

@mjhipp
Copy link
Author

mjhipp commented Mar 21, 2022

I am interested to see what happens with #2723 in the future. Anecdotally, I have had some issues with Lightbend config in the past (it is used in the scala workflow DSL, dagr). In my case with that specific workflow manager and config system, the configuration override file did not work, and in order to override config properties, you had to pass them as java system properties. For example, rather than passing a file with lines like scope.param = value, you needed to pass a java flag -Dscope.param=value.

My proposal instead is to add a setting in the config file that allows changing the publishDir defaults globally. For example:

process.defaults.publishDir = [enabled: false, overwrite: false, mode: 'copy']

What do you think?

It is possible to set the publishDir logic globally in the config now, when you say defaults, do you mean I would be able to override individual parts of the publishDir config in a process?

Like:

process.defaults.publishDir = [path: { "$params.outputPublishDir" ?: 'placeholder' }, mode: 'symlink', enabled: { "$params.outputPublishDir" as Boolean }]

And then inside a process do something like this, and have the rest of the fields already set?:

process foo {
  publishDir mode: { params.publishMode }
}

It sounds like I would still need to include the above override in every process if I wish to use the addParams feature for publishing, which is the main thing I am hoping to avoid (copying boilerplate config into every process script in a large project). Also, what would the behavior be if there are multiple publish directives? As in the original case above, I have a publishDir directive for publishing main outputs, and a directive for publishing the log files.

@mjhipp
Copy link
Author

mjhipp commented Mar 30, 2022

@pditommaso to clarify, it is possible to set most values in publishDir as a closure, except a few (mode, overwrite). I guess I am not understanding why these values cannot be closures and considered a hack, while it is perfectly fine to have closures for path, pattern etc.

Even something like this does not work:

publishDir = [
  [mode: { 'copy' }]
]

While this is perfectly fine:

publishDir = [
  [path: { "$params.outputDir" }]
]

It would be very helpful to allow someone to specify something like this in the config:

publishDir = [
  [mode: { params.mode }]
]

@pditommaso
Copy link
Member

This could be a short term solution, but I feel like a hack. I have to say I'm bit lost with your use case when mentioned the need for the addParams.

Could you please isolate a minimal example that shows what you are trying to achieve?

@mjhipp
Copy link
Author

mjhipp commented Apr 5, 2022

@pditommaso here is an over simplified example where I only want to output the files from certain processes. In many workflows, I will want to publish the output of the SortSam process, but in this example, it is an intermediate file that I do not wish to publish:

config:

process {
  publishDir = [
    [
      path: { params.output },
      enabled: { params.enablePublish }  // params.enablePublish has a default of `true` in params config
    ]
  ]
}

workflow:

include { SortSam } from 'sortsam.nf' addParams(enablePublish: false)
include { VariantCalling } from 'variantcalling.nf

workflow {
  // variant calling process requires coordinate sorted input, but I do not want to publish that intermediate file
  SortSam(params.input-sam, 'coordinate')
  VariantCalling(SortSam.out.bam_ch)
}

@pditommaso
Copy link
Member

This likely will be addressed by #4186. Closing without merging

@pditommaso pditommaso closed this Sep 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants