Replies: 2 comments 1 reply
-
The AMI is simply the AWS ECS standard: with the addition of the awscli exactly as documented here: So far, I've only seen this happen when many jobs land on a large (64cpu) instance. I'm guessing the local disk is thrashing (or maybe the network is saturated) causing some transient errors - similar to the Docker timeout errors I was seeing prior to increasing the ECS timeout. I guess I could exclude large instances from the compute environment, but that seems a little hacky and impacts throughput. Is there a way to have Nextflow retry with backoff when these types of underlying infrastructure errors happen instead of exiting? |
Beta Was this translation helpful? Give feedback.
-
Bug report
Expected behavior and actual behavior
Nextflow sometimes exits when running the nf-core rnaseq workflow against AWS Batch with the following error:
If this type of error happens occasionally, it would be good if Nextflow could retry as usual instead of exiting.
.nextflow.log
Steps to reproduce the problem
Run nf-core rnaseq workflow on AWS Batch with a handful of samples. (This is not using Wave or Fusion) - just the plain awsbatch profile.
Note that the problem is inconsistent.
I am using the latest AWS ECS AMI customized to include the awscli installed via Miniconda.
Also, I am using a launch template to increase the disk size and also increase ECS timeouts:
Program output
There are a couple of additional errors upstream from this in the .nextflow.log:
Environment
Additional context
(Add any other context about the problem here)
Beta Was this translation helpful? Give feedback.
All reactions