Reduce the number of S3 API calls when the same object is requested at the same time #959

olileach · 2024-07-30T11:57:07Z

Tell us more about this new feature.

For HPC workloads, when a job is submitted that has tens of thousands of tasks that contain instructions for a required calculation, we often see the same object being requested by clients, running in pods or on EC2 instances, at the same time. A typical example of this is when a job is submitted to a queue based architecture, where existing pods or EC2 instances or up and running, and actively consuming messages from the queue. When this happens, multiple pods and instances can often request the same file in a S3 bucket in order to complete the calculation. The impact of this, when using the S3 mount point and the CSI driver if using Amazon EKS, is we see thousands of S3 API calls happening at the same time when the same object is being fetched. This causes the following error:

mountpoint_s3::fuse: lookup failed: inode error: error from ObjectClient: ListObjectsV2 failed: Client error: Unknown CRT error: CRT error 14342: aws-c-s3:AWS_ERROR_S3_SLOW_DOWN, Response code indicates throttling

A more efficient approach would be for the S3 mount point to know if the object is already being requested, and if so, pool subsequent requests until the object is fetched and available in the configured cache location, thus allowing subsequent client requests to fetch the file locally rather than making additional API calls to Amazon S3. This feature enhancement would dramatically cut down the number of Amazon S3 API calls for HPC workloads, improve overall job performance, and be more cost efficient.

The text was updated successfully, but these errors were encountered:

arsh · 2024-08-06T09:09:03Z

We are looking into this internally and will post an update as soon as we have one.

vladem · 2024-09-12T17:25:06Z

We've looked into this issue and willing to state here that, when a single instance of Mountpoint process is used, number of ListObjectsV2 requests may be reduced by enabling metadata cache with --metadata-ttl flag.

As for the feature of reducing the total number of ListObjectsV2 emitted by multiple Mountpoint processes running on the same machine, we don't have it planned at this point.

olileach · 2024-09-13T23:40:06Z

@vladem - we are configuring the metadata-ttl as described here:

https://github.com/awslabs/mountpoint-s3-csi-driver/tree/main/examples/kubernetes/static_provisioning

how would the metadata-ttl help with the HPC use case described above when many processes read the same file at the same time?

dannycjones · 2024-09-25T13:51:37Z

Discussed offline:

For --metadata-ttl to help, the metadata needs to be cached by something. This could be by listing the directory. Otherwise, it may not benefit workloads where there is not repeated access within the same mount.
Different Mountpoints and even the same Mountpoint will perform GetObject requests for the same data, as we do not coordinate between file handles for the same object today.

olileach added the enhancement New feature or request label Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the number of S3 API calls when the same object is requested at the same time #959

Reduce the number of S3 API calls when the same object is requested at the same time #959

olileach commented Jul 30, 2024

arsh commented Aug 6, 2024

vladem commented Sep 12, 2024

olileach commented Sep 13, 2024

dannycjones commented Sep 25, 2024

Reduce the number of S3 API calls when the same object is requested at the same time #959

Reduce the number of S3 API calls when the same object is requested at the same time #959

Comments

olileach commented Jul 30, 2024

Tell us more about this new feature.

arsh commented Aug 6, 2024

vladem commented Sep 12, 2024

olileach commented Sep 13, 2024

dannycjones commented Sep 25, 2024