IO file storm when pipelines are run at scale #4158

PeteClapham · 2023-08-07T09:19:51Z

PeteClapham
Aug 7, 2023

Expected behavior and actual behavior

NextFlow currently creates numerous small files during pipeline runs to maintain and track state. When pipelines are run at scale, this can create an IO storm which can prevent other workflows from taking place and place high performance parallel filesystems and wider clusters at risk of service failure.

Steps to reproduce the problem

Hi numbers of multi-component run in parallel across a scale HPC cluster (approx 20k cores) can create these file storms. The last recurrence was 2 weeks ago. We have seen this now on multiple occasions. On the last occasion the storm induced an IO storm that peaked at 448Million IOP/s

Due to the impact upon the filesystems, logging is unable to write out the usual debugging data.

Environment

Nextflow version: [?]
Current release
Java version: [?]
openjdk 11.0.19 2023-04-18
OpenJDK Runtime Environment (build 11.0.19+7-post-Ubuntu-0ubuntu118.04.1)
OpenJDK 64-Bit Server VM (build 11.0.19+7-post-Ubuntu-0ubuntu118.04.1, mixed mode, sharing)
Operating system: [macOS, Linux, etc]
Ubuntu 18.04
Bash version: (use the command $SHELL --version)
GNU bash, version 4.4.20(1)-release (x86_64-pc-linux-gnu)
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html

This is free software; you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Additional details
The issue seems unrelated to a given informatics software stack and we have seen the issue arising from different research areas and data sets and types.

The clusters are running with IBM Spectrum LSF 10.1.0.13

It is unclear at which point in the process the storm is created, however the use of small files at scale to maintain state is a potential architectural challenge.

bentsherman · 2023-08-07T14:38:58Z

bentsherman
Aug 7, 2023
Maintainer

Hi @PeteClapham , Nextflow creates potentially several small files for each task that is executed:

.command.run
.command.sh
.command.begin
.exitcode
.command.out
.command.err
...

This approach indeed puts a huge load on the shared filesystem at scale.

Here are two things you can try:

Use an S3-compatible object storage layer, which should provide better performance over a network filesystem, supposedly many HPC centers are experimenting with it
If your pipeline runs many small tasks, try to group them into larger tasks, which will reduce the number of these metadata files. The task batching and workflow grouping patterns are useful here.

We are also working on a number of features that should alleviate this issue somewhat, but at the end of the day, it's a fundamental part of how Nextflow works. In order to not write all of these files, the equivalent information would need to be stored by the HPC scheduler or some kind of database service. HPC schedulers typically do not store all of this info, or at least not for long, and while you could use a database service, well... most people already have a shared filesystem 😄

We're always open to new ideas though. I think the TES API aims to do exactly what I'm talking about, storing this metadata in a database instead of the filesystem. Funnel is the main TES backend that I know of, not sure if it supports LSF though.

0 replies

pditommaso · 2023-08-07T14:49:37Z

pditommaso
Aug 7, 2023
Maintainer

Hi all, let me add more details. Nextflow indeed creates several small meta-data files, but it's hard to believe that this can determine heavy IO pressure on the file system. Those files are accessible sporadically to determine the task status or to fetch error logs in a failure.

More likely the problem is caused by the fact that nextflow allows users to submit easily the executions of a large number of jobs which can be IO-intensive.

@PeteClapham I have a question: do the nodes in your cluster have a local scratch storage? if yes, do your users set the process.scratch directive to true in their pipeline, so that job uses the local disk instead of working over the shared file system?

1 reply

bentsherman Aug 7, 2023
Maintainer

Indeed, if your tasks are IO-heavy then you will need to either reduce the number of parallel tasks or use local storage, if you want to reduce the load on your shared filesystem.

robsyme · 2023-08-08T18:50:08Z

robsyme
Aug 8, 2023
Collaborator

I can imagine that in very large (10-100k tasks) runs that are resumed, Nextflow will request thousands or tens of thousands of file reads (.exitcode) in a short period of time. Perhaps this is the issue that Sanger is running into.

Is there a Nextflow configuration option to throttle the cache checks?

1 reply

bentsherman Aug 8, 2023
Maintainer

Regarding the exitcode file, Nextflow doesn't even need to check it when resuming tasks, because the exit code is already saved to the trace record in .nextflow/cache. I discovered this while working on the automatic cleanup + resume (#3849). But that is a small optimization, since Nextflow also reads the .command.out file for every cached task.

I don't know of a way to throttle these checks against a network filesystem. S3 will rate limit you, but the NFS will just be congested 😅

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO file storm when pipelines are run at scale #4158

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

IO file storm when pipelines are run at scale #4158

PeteClapham Aug 7, 2023

Expected behavior and actual behavior

Steps to reproduce the problem

Environment

Replies: 3 comments · 2 replies

bentsherman Aug 7, 2023 Maintainer

pditommaso Aug 7, 2023 Maintainer

bentsherman Aug 7, 2023 Maintainer

robsyme Aug 8, 2023 Collaborator

bentsherman Aug 8, 2023 Maintainer

PeteClapham
Aug 7, 2023

Replies: 3 comments 2 replies

bentsherman
Aug 7, 2023
Maintainer

pditommaso
Aug 7, 2023
Maintainer

bentsherman Aug 7, 2023
Maintainer

robsyme
Aug 8, 2023
Collaborator

bentsherman Aug 8, 2023
Maintainer