Nextflow 151 errors - potential Storage latency ? #4272

colindaven · 2023-09-06T11:38:01Z

colindaven
Sep 6, 2023

Hello,

we have a substantial number of jobs (5%) which fail with error 151 for no obvious reason on Nextflow with Singularity containers on LSF, yet complete without error on retry. So the workaround is easy enough, but wastes resources for long running jobs.

One idea was that the storage systems aren't keeping up, but I can't see an option for increasing a Nextflow wait or whatever until a file (not just a file handle) has really been made. In any rate, I am not seeing file not found errors. Also, this happens across diverse pipelines from different devs.

I can't see anything on latency here:
https://www.nextflow.io/docs/latest/config.html

Others have reported 151 errors on SLURM with aptainer for no obvious reasons too, with retry being the workaround. https://sciwiki.fredhutch.org/hdc/workflows/running/on_gizmo/

Maybe someone has seen this problem and found a solution ?

Thanks, Colin

Answered by colindaven

Dec 20, 2024

I have almost completely resolved the 151 issue by adding sleep commands in after jobs which create huge outputs.

This allows the (especially NFS, but not just NFS) storage to keep up by adding a grace period to allow the file to really be copied there, and not just having a file stub be created. I use sleep values of 180-300 (seconds) for huge files >40 GB or very long running processes like assemblies.

One example

fastp in.fastq 
sleep 300

View full answer

aghr · 2023-10-13T11:45:36Z

aghr
Oct 13, 2023

Hello,

I seem to have the same issue with the nf-core rnaseq pipeline. As I couldn't figure out why I get this error 151 with no obvious reason, I thought the reason must be connected to the rnaseq pipeline. I posted more info on my error 151 in anther issue:

nf-core/rnaseq#1024

I hope this helps to figure out what is going wrong here now. When I tried the nf-core rnaseq pipeline for the first time a few months (3-5) ago on the very same computer, I could get it running. Now, I seem not to be able to get it running.

Thanks. Andre

1 reply

colindaven Oct 17, 2023
Author

How are you running the pipeline ? LSF ? SLURM ? Locally? Which storage do you have ? Is it NFS ?

I'm not sure it's a latency problem since I tried it on faster SSD storage and still got 151 exit code problems. I don't get the problems when running the nextflow commands directly through the command.sh.

My feeling is that it's a bug at the interface of the scheduler (LSF for me) and Nextflow. But there's little information to debug further. I'll try another storage soon to see if the problem occurs there too. It may also be storage load dependent, which is transient.

bentsherman · 2023-10-27T20:09:02Z

bentsherman
Oct 27, 2023
Maintainer

Exit codes 129-160 or so are supposed to correspond to POSIX signals. If LSF is following this convention, then 151 should correspond to SIGURG ("Urgent condition on socket (4.2BSD)") according to the man pages.

So that could be a network issue. Or it could be something completely different, just wanted to give a possible lead

0 replies

colindaven · 2023-11-21T07:57:03Z

colindaven
Nov 21, 2023
Author

@bentsherman Yeah, thanks for your input. It could indeed be a network issue, but that is completely impossible for me to work out. When we change overfrom LSF to SLURM then I hope this problem will go away, as it happens frequently. Sometimes output is even produced for a process, but the 151 error causes the process to fail and a new job to start.

The only workaround is setting retry to 4 and letting it run. I can't see any correlation to very high network or cluster load.

[b9/d846c3] Submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.2) 
[6a/b11b8b] NOTE: Process `fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.1)` terminated with an error exit status (151) -- Execution is retried (3)
[9d/d2e6ad] Re-submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.1)
[2f/393ce6] Submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.3) 
[27/6b5e20] Submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.4) 
[27/6b5e20] NOTE: Process `fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.4)` terminated with an error exit status (151) -- Execution is retried (1)
[34/0b7d5b] Re-submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.4)
[9b/a731f2] Submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.5) 
[9b/a731f2] NOTE: Process `fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.5)` terminated with an error exit status (151) -- Execution is retried (1)
[6c/8016db] Re-submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.5)
[48/fb3ec8] Submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.6) 
[b7/0a0d19] Submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.7) 
[31/757893] Submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.8) 
[31/757893] NOTE: Process `fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.8)` terminated with an error exit status (151) -- Execution is retried (1)
[0c/d111da] Re-submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.8)
[76/73317e] Submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.9) 
[a7/132881] Submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.10)
[a7/132881] NOTE: Process `fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.10)` terminated with an error exit status (151) -- Execution is retried (1)
[89/937161] Re-submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.10)
[d0/eadaaa] Submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.11)
[98/1dfcc3] Submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.12)
[98/1dfcc3] NOTE: Process `fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.12)` terminated with an error exit status (151) -- Execution is retried (1)
[ae/c0d75d] Re-submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.12)
[7c/182383] Submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.13)
[7c/182383] NOTE: Process `fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.13)` terminated with an error exit status (151) -- Execution is retried (1)
[61/4f36d8] Re-submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.13)
[61/4f36d8] NOTE: Process `fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.13)` terminated with an error exit status (151) -- Execution is retried (2)
[ba/6624a6] Re-submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.13)
[48/1f35a7] Submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.14)
[48/1f35a7] NOTE: Process `fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.14)` terminated with an error exit status (151) -- Execution is retried (1)
[74/4fe59b] Re-submitted process > fastp_split (1277628-20230619_1559_1E_PAQ27114_0cde6561_ont.14)

0 replies

Grelot · 2024-01-27T00:39:09Z

Grelot
Jan 27, 2024

I got unexplained error 151 using singularity container with nextflow version 23.10.1.

Command exit status:
  151

0 replies

colindaven · 2024-12-20T10:35:26Z

colindaven
Dec 20, 2024
Author

I have almost completely resolved the 151 issue by adding sleep commands in after jobs which create huge outputs.

This allows the (especially NFS, but not just NFS) storage to keep up by adding a grace period to allow the file to really be copied there, and not just having a file stub be created. I use sleep values of 180-300 (seconds) for huge files >40 GB or very long running processes like assemblies.

One example

fastp in.fastq 
sleep 300

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nextflow 151 errors - potential Storage latency ? #4272

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Nextflow 151 errors - potential Storage latency ? #4272

colindaven Sep 6, 2023

Replies: 5 comments · 1 reply

aghr Oct 13, 2023

colindaven Oct 17, 2023 Author

bentsherman Oct 27, 2023 Maintainer

colindaven Nov 21, 2023 Author

Grelot Jan 27, 2024

colindaven Dec 20, 2024 Author

colindaven
Sep 6, 2023

Replies: 5 comments 1 reply

aghr
Oct 13, 2023

colindaven Oct 17, 2023
Author

bentsherman
Oct 27, 2023
Maintainer

colindaven
Nov 21, 2023
Author

Grelot
Jan 27, 2024

colindaven
Dec 20, 2024
Author