AWSBatch -- CgroupError: Agent could not create task's platform resources #5381

mozack · 2024-10-07T20:15:48Z

mozack
Oct 7, 2024

Bug report

Expected behavior and actual behavior

Nextflow sometimes exits when running the nf-core rnaseq workflow against AWS Batch with the following error:

CgroupError: Agent could not create task's platform resources

If this type of error happens occasionally, it would be good if Nextflow could retry as usual instead of exiting.
.nextflow.log

Steps to reproduce the problem

Run nf-core rnaseq workflow on AWS Batch with a handful of samples. (This is not using Wave or Fusion) - just the plain awsbatch profile.

Note that the problem is inconsistent.

I am using the latest AWS ECS AMI customized to include the awscli installed via Miniconda.

Also, I am using a launch template to increase the disk size and also increase ECS timeouts:

- echo ECS_CONTAINER_START_TIMEOUT=40m >> /etc/ecs/ecs.config
- echo ECS_CONTAINER_CREATE_TIMEOUT=40m >> /etc/ecs/ecs.config

Program output

ERROR ~ Error executing process > 'NFCORE_RNASEQ:RNASEQ:ALIGN_STAR:BAM_SORT_STATS_SAMTOOLS:BAM_STATS_SAMTOOLS:SAMTOOLS_FLAGSTAT (A05)'

Caused by:
  CgroupError: Agent could not create task's platform resources

Command executed:

  samtools \
      flagstat \
      --threads 1 \
      A05.sorted.bam \
      > A05.sorted.bam.flagstat

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASEQ:RNASEQ:ALIGN_STAR:BAM_SORT_STATS_SAMTOOLS:BAM_STATS_SAMTOOLS:SAMTOOLS_FLAGSTAT":
      samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
  END_VERSIONS

Command exit status:
  -

Command output:
  (empty)

Work dir:
  s3://mose-temp/rnaseq/nf-work-5/f8/c985b3f25e513ec3bb94c5eaa76455

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details
Execution cancelled -- Finishing pending tasks before exit
ERROR ~ Pipeline failed. Please refer to troubleshooting docs: https://nf-co.re/docs/usage/troubleshooting

There are a couple of additional errors upstream from this in the .nextflow.log:

Oct-07 02:20:14.881 [Task monitor] DEBUG n.c.aws.batch.AwsBatchTaskHandler - [AWS BATCH] Cannot read exitstatus for task: `NFCORE_RNASEQ:RNASEQ:ALIGN_STAR:BAM_SORT_STATS_SAMTOOLS:BAM_STATS_SAMTOOLS:SAMTOOLS_FLAGSTAT (A05)` | /mose-temp/rnaseq/nf-work-5/f8/c985b3f25e513ec3bb94c5eaa76455/.exitcode
Oct-07 02:20:14.950 [Task monitor] DEBUG nextflow.cloud.aws.AwsClientFactory - Merging AWS credentials file '/home/lmose/.aws/credentials' and config file '/home/lmose/.aws/config'
Oct-07 02:20:15.106 [Task monitor] DEBUG n.cloud.aws.batch.AwsBatchExecutor - Unable to find AWS Cloudwatch logs for Batch Job id=6fd5cfc1-f341-45af-9ebe-ff2a6a2b5412 - The specified log stream does not exist. (Service: AWSLogs; Status Code: 400; Error Code: ResourceNotFoundException; Request ID: 1ce14c8e-962b-4d1c-906b-32adeed10fd7; Proxy: null)

Environment

Nextflow version: 23.10.1
Java version: openjdk version "11.0.15-internal" 2022-04-19
Operating system: Linux
Bash version: GNU bash, version 4.2.46(2)-release (x86_64-koji-linux-gnu)

Additional context

(Add any other context about the problem here)

mozack · 2024-10-07T20:16:40Z

mozack
Oct 7, 2024
Author

.nextflow.log

1 reply

pditommaso Oct 8, 2024
Maintainer

Hard to say. Nextflow does not have any direct interaction with cgroup. You need to review your custom AMI.

mozack · 2024-10-08T14:44:24Z

mozack
Oct 8, 2024
Author

The AMI is simply the AWS ECS standard:
al2023-ami-ecs-hvm-2023.0.20240917-kernel-6.1-x86_64

with the addition of the awscli exactly as documented here:
https://www.nextflow.io/docs/latest/aws.html#aws-cli-installation

So far, I've only seen this happen when many jobs land on a large (64cpu) instance. I'm guessing the local disk is thrashing (or maybe the network is saturated) causing some transient errors - similar to the Docker timeout errors I was seeing prior to increasing the ECS timeout. I guess I could exclude large instances from the compute environment, but that seems a little hacky and impacts throughput.

Is there a way to have Nextflow retry with backoff when these types of underlying infrastructure errors happen instead of exiting?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWSBatch -- CgroupError: Agent could not create task's platform resources #5381

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

AWSBatch -- CgroupError: Agent could not create task's platform resources #5381

mozack Oct 7, 2024

Bug report

Expected behavior and actual behavior

Steps to reproduce the problem

Program output

Environment

Additional context

Replies: 2 comments · 1 reply

mozack Oct 7, 2024 Author

pditommaso Oct 8, 2024 Maintainer

mozack Oct 8, 2024 Author

mozack
Oct 7, 2024

Replies: 2 comments 1 reply

mozack
Oct 7, 2024
Author

pditommaso Oct 8, 2024
Maintainer

mozack
Oct 8, 2024
Author