Segmentation fault during S3 file transfer with AWSBatch executor #2514

tamuanand · 2021-12-22T02:51:45Z

tamuanand
Dec 22, 2021

I am using Nextflow 21.10.5.5658 with AWSBatch as executor. I am using Spot Provisioning model, optimal Instance types and Spot_Capacity_Optimized in my AWS Batch Compute Environment. I have a AL2 Custom AMI with size of 500GB provisioned.

I have a simple process:

17K files in S3 get read into the csv_ch channel in code below. (The total file size of all files put together will be about 40GB)
a custom shell script merges the columns of different files into a final output file

When I run the same nextflow script shown below on my slurm cluster, the job finishes without any issues. I am running this on a slurm partition that would have same resources provisioned by my CE in AWS Batch.

Important: The script below works in AWS Batch if I had say 2000 files in my S3 bucket
However, when I try running this if I have large number of files in S3 (17K in this case), I keep getting these errors:

nxf-scratch-dir ip-xxxxxx.us-east-2.compute.internal:/tmp/nxf.DYEhNFxHX4
  bash: line 1:     7 Done                    /home/ec2-user/miniconda/bin/aws --region us-east-2 s3 \\
            cp --only-show-errors s3://<path_to>/9f/638db29675d967b65006471b970cb0/.command.run -
            Segmentation fault      (core dumped) | bash 2>&1
                                  | tee .command.log

My nextflow script:

#!/usr/bin/nextflow

params.in_csv = 's3://bucket_name/individual_files/*.csv'
csv_ch = Channel.fromPath(params.in_csv)

process merge_files {
  cpus 24
  memory '200 GB'

  publishDir "${params.results}/summary", mode: 'copy'

  input:
  path(consensus) from csv_ch.collect()

  output:
  path('combined_final_prediction.csv')

  script:
  """
   join_csvs_on_column1_using_paste.sh combined_final_prediction.csv
  """
}

join_csvs_on_column1_using_paste.sh in my bin directory

#!/bin/sh
OUTFILE=$1

NUM_FILES=$(ls -1 $PWD/*.csv | wc -l)
CUT_FIELD=''

for(( i=1;i<=NUM_FILES;i++ )){
       j=$(( i*2 ))

       if [ $i -eq $NUM_FILES ] ; then
           CUT_FIELD=${CUT_FIELD}$j
       else
           CUT_FIELD=${CUT_FIELD}$j,
       fi
}

ls -1v *.csv | split -l 1000 -d - lists
for list in lists*; do paste -d ',' $(cat $list) > merge${list##lists}; done
paste -d ',' merge* | cut -f1,$CUT_FIELD -d ','  > $OUTFILE

Sanitized lines from my nextflow.config file modelled on #1371 (comment) after trying many other permutations/combinations

awsbatch {
        process {
           container = 'container_in_ecr'
           executor = 'awsbatch'
           queue = 'queue-name'
           errorStrategy = 'retry'
           maxRetries = 2
           }
        aws {
            region = 'us-east-2'
            batch {
                  cliPath = "/home/ec2-user/miniconda/bin/aws"
                  maxParallelTransfers = 8
                  maxTransferAttempts = 3
                  }
            client{
                  protocol = 'https'
                  maxConnections = 20 
                  maxErrorRetry = 100
                  uploadMaxThreads = 20
                  uploadChunkSize = '500MB' 
                  uploadMaxAttempts = 10
                  uploadRetrySleep = '30 sec'
                  }
         }
     }

I would appreciate any suggestions or workarounds for the above.

Thanks in advance

Edit : Added the relevant .nextflow.log and .command.run files from the AWSBatch run

nextflow_log.txt
command_run.txt
.

Answered by manuelesimi

Dec 23, 2021

A better alternative could be to download the folder instead of the (long) list of files:

params.in_csv = 's3://bucket_name/individual_files/
csv_ch = Channel.fromPath(params.in_csv)

and then read the files from the folder in your script:

NUM_FILES=$(ls -1 $PWD/individual_files/*.csv | wc -l)

View full answer

tamuanand · 2021-12-22T06:00:07Z

tamuanand
Dec 22, 2021
Author

After digging around, I stumbled upon a similar issue that was raised in 2019 - #1364

I have tried the fixes suggested there, i.e. adding beforeScript = 'ulimit -s unlimited' but it still did not resolve this issue that I have raised

A similar issue related to Nextflow/AWSBatch was raised in the aws-genomics-cli GH - aws/amazon-genomics-cli#45

0 replies

pditommaso · 2021-12-22T11:39:39Z

pditommaso
Dec 22, 2021
Maintainer

This happens because all input files get listed in the script created by NF to launch the job which likely becomes too big for the Bash interpreter.

You should consider either splitting that job in many sub-jobs handling a portion of those files at time, alternatively you can try to increase the ulimit via containerOptions, see here

5 replies

tamuanand Dec 22, 2021
Author

Thanks @pditommaso A follow-up: If I am using AL2 AMI, what would be the containerOptions directive to set so that I get the same behavior 'ulimit -s unlimited' as suggested in #1364

Will it be something like --ulimit nofile=20000:40000

manuelesimi Dec 22, 2021
Collaborator

The ulimit is there for a reason. AWS does not support the unlimited option to avoid instances becoming unresponsive. You should either try to set nofile to a reasonable hard limit or consider to split the job, as already suggested.

tamuanand Dec 22, 2021
Author

I thought I will share my findings. I have tried these below with Amazon Linux 2 AMI and nextflow edge 21-12.1 and I get the same errors:

containerOptions = ' --ulimit nofile=20000:25000 '
containerOptions = ' --ulimit nofile=20000:25000 --privileged '
containerOptions = ' --ulimit nofile=20000:25000 --ulimit nproc=16:32 --privileged '
containerOptions = ' --shm-size 16000000 --ulimit nofile=20000:25000 --ulimit nproc=32:64'

I do understand the suggestion about splitting the job; there is a reason why I had to do so this way without going into too many details

tamuanand Dec 22, 2021
Author

And if it helps someone, I created a custom AMI with Amazon Linux 1 (AL1) and I had this below in my nextflow.config and it worked with AL1 AMI - beforeScript = 'ulimit -s unlimited'

// Tested with Nextflow 21.10.5 and AL1 Custom AMI with AWSBatch executor
awsbatch {
        process {
           container = 'my-ecr-container'
           executor = 'awsbatch'
           queue = 'my-job-queue'
           errorStrategy = 'retry'
           maxRetries = 2
           beforeScript = 'ulimit -s unlimited'  // with Amazon Linux 1 AMI only
           }
}

I would just like to get this working with AL2

manuelesimi Dec 22, 2021
Collaborator

Likely AWS removed the possibility to set an unlimited number of files to open in AL2. This is also evident in the latest Batch API where they only accept hard/soft limits.

By the way, the container options you are listing in your previous comment are all the same. The other options (I mean, other than ulimit) do not affect the number of files you can open in any way.

If you can't split across jobs, could you split the files across channels? Something like:

#!/usr/bin/nextflow

a_csv_ch = Channel.fromPath('s3://bucket_name/individual_files/a*.csv')
b_csv_ch = Channel.fromPath('s3://bucket_name/individual_files/b*.csv')

process merge_files {
  cpus 24
  memory '200 GB'

  publishDir "${params.results}/summary", mode: 'copy'

  input:
  path(a_consensus) from a_csv_ch.collect()
  path(b_consensus) from b_csv_ch.collect()

  output:
  path('combined_final_prediction.csv')

  script:
  """
   join_csvs_on_column1_using_paste.sh combined_final_prediction.csv
  """
}

Of course with a different granularity than just a and b.

manuelesimi · 2021-12-23T12:47:13Z

manuelesimi
Dec 23, 2021
Collaborator

A better alternative could be to download the folder instead of the (long) list of files:

params.in_csv = 's3://bucket_name/individual_files/
csv_ch = Channel.fromPath(params.in_csv)

and then read the files from the folder in your script:

NUM_FILES=$(ls -1 $PWD/individual_files/*.csv | wc -l)

7 replies

tamuanand Dec 23, 2021
Author

Do you have any suggestions on how to achieve the same using the real life scenario I mention above

manuelesimi Dec 23, 2021
Collaborator

Going to be creative here :)

See if this works (it did with the local executor, it should also on AWS):

//just to simulate 1K executions of foos
index_ch = Channel.from(1 .. 1000)

outputBucket = 's3://..<your path>'

bar_input_ch = Channel.create()

// foos publish 1k files into the bucket, one for each execution
process foo {

  publishDir "${outputBucket}"

  input:
  val(index) from index_ch

  output:
  path('*_prediction.csv') into output_ch

  script:
  """
  echo txt > "${index}_prediction.csv"
  """
}

// wait until all the foos complete, send the bucket into bar_input_ch to wake up bar
output_ch.subscribe onComplete: {  bar_input_ch << "${outputBucket}"; bar_input_ch.close() }

// bar gets as input the bucket with all *_prediction.csv, localizes it into a folder and does the computation
process bar {

  input:
  path(dir) from bar_input_ch

  script:
  """
  ls -l ${dir}/* > list.txt
  """
}

tamuanand Dec 24, 2021
Author

Thanks @manuelesimi. Your solution worked with my real life scenario with S3/AWSBatch - I used it with 8K files just to get over the default limit. In the process, I learnt something new - though I did not fully understand how it worked, I can just say that I am happy it worked.

Also, looks like Channel.create is DSL1 specific and will not be in DSL2

manuelesimi Dec 24, 2021
Collaborator

Hi @tamuanand,

This is a brief description of the algorithm:

foo is executed N times. Each execution publishes 1 file into the S3 bucket (as per your pipeline)
the S3 bucket collects all these files
the key of the solution is the subscription for the onComplete event, triggered by Nextflow when the output_ch is closed (meaning that the last foo execution is done)
when we get the event, we emit the bucket location into the bar_input_ch channel
the bar process is executed (once) and it localizes the entire bucket in its workdir, avoiding to use the collect operator

The downside is that the 17k files are stored into the S3 bucket. If you don't need these files, you have to delete them manually.

Hope this helps to understand.

tamuanand Dec 24, 2021
Author

Thanks @manuelesimi and thanks for explaining all steps. In my original comment about not understanding, I did not specifically understand the subscribe onComplete and you have explained it in your post above.

Thanks a lot for all your help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault during S3 file transfer with AWSBatch executor #2514

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Segmentation fault during S3 file transfer with AWSBatch executor #2514

tamuanand Dec 22, 2021

Replies: 3 comments · 12 replies

tamuanand Dec 22, 2021 Author

pditommaso Dec 22, 2021 Maintainer

tamuanand Dec 22, 2021 Author

manuelesimi Dec 22, 2021 Collaborator

tamuanand Dec 22, 2021 Author

tamuanand Dec 22, 2021 Author

manuelesimi Dec 22, 2021 Collaborator

manuelesimi Dec 23, 2021 Collaborator

tamuanand Dec 23, 2021 Author

manuelesimi Dec 23, 2021 Collaborator

tamuanand Dec 24, 2021 Author

manuelesimi Dec 24, 2021 Collaborator

tamuanand Dec 24, 2021 Author

tamuanand
Dec 22, 2021

Replies: 3 comments 12 replies

tamuanand
Dec 22, 2021
Author

pditommaso
Dec 22, 2021
Maintainer

tamuanand Dec 22, 2021
Author

manuelesimi Dec 22, 2021
Collaborator

tamuanand Dec 22, 2021
Author

tamuanand Dec 22, 2021
Author

manuelesimi Dec 22, 2021
Collaborator

manuelesimi
Dec 23, 2021
Collaborator

tamuanand Dec 23, 2021
Author

manuelesimi Dec 23, 2021
Collaborator

tamuanand Dec 24, 2021
Author

manuelesimi Dec 24, 2021
Collaborator

tamuanand Dec 24, 2021
Author