CRT reaise AWS_ERROR_S3_REQUEST_HAS_COMPLETED when upload Llama 70b checkpoints. #219

crazyguitar · 2024-07-30T18:49:08Z

s3torchconnector version

s3torchconnector-1.2.3

s3torchconnectorclient version

s3torchconnectorclient-1.2.3

AWS Region

us-west-2

Describe the running environment

EC2 instance p4d.24xlarge
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"
Amazon Linux release 2 (Karoo)

What happened?

Hi team,

I was running the Llama v2 70b model with 32 nodes using the Slurm job scheduler, based on the SageMaker Model Parallelism Library v2 (using a specific Docker image). To improve checkpoint performance, I tried using the s3connector. However, when writing checkpoints to S3, the writer stream encountered an error: "Unknown CRT error: CRT error 14366: aws-c-s3: AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already complete". Note that Llama v2 7b, 13b did not encounter this issue.

This error seemed to indicate that I had hit the S3 rate limit. The issue was resolved when I increased the part_size from 8Mb to 32Mb. However, the error code was ambiguous, and I was unsure if the problem was indeed related to the rate limit. Could the team help to explain this specific error. Thank you.

Relevant log output

9397  4: [rank34]:   File "/opt/conda/lib/python3.11/site-packages/s3torchconnector/s3writer.py", line 40, in write                                                                                      
9398  4: [rank34]:     self.stream.write(data)                                                                                                                                                           
9399  4: [rank34]: s3torchconnectorclient._mountpoint_s3_client.S3Exception: Client error: Unknown CRT error: CRT error 14366: aws-c-s3: AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already complete

Code of Conduct

I agree to follow this project's Code of Conduct

IsaevIlya · 2024-07-31T15:08:13Z

Hi @crazyguitar,

Thank you for sharing your experience with the s3-connector-for-pytorch. The error message "AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already completed" is indeed ambiguous and can be confusing.

It seems that the issue you encountered was related to the object size limit for the given part_size. Increasing the part_size from 8MB to 32MB was the right decision, as it allowed you to upload larger objects without hitting object size limit.

The s3-connector-for-pytorch uses the AWS Common Runtime (CRT) under the hood, which breaks large requests into smaller part-sized requests and executes them in parallel. There could be up to 10,000 parts when writing data to S3, so with a part_size of 8MB, the maximum upload size would be around 80GB. If your model checkpoint was larger than this limit, increasing the part_size was the appropriate solution. Was your model large than 80GB?

I will reach out to the CRT team to discuss if it is possible to provide a more meaningful error message in situations where the object size for upload exceeds the current part_size limit. I will also take a look at our documentation to make it more helpful regarding the usage of part_size.

Thank you for your feedback and for sharing your experience. It will help us improve the user experience and documentation for the s3-connector-for-pytorch.

crazyguitar added the bug Something isn't working label Jul 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CRT reaise AWS_ERROR_S3_REQUEST_HAS_COMPLETED when upload Llama 70b checkpoints. #219

CRT reaise AWS_ERROR_S3_REQUEST_HAS_COMPLETED when upload Llama 70b checkpoints. #219

crazyguitar commented Jul 30, 2024 •

edited

Loading

IsaevIlya commented Jul 31, 2024

CRT reaise AWS_ERROR_S3_REQUEST_HAS_COMPLETED when upload Llama 70b checkpoints. #219

CRT reaise AWS_ERROR_S3_REQUEST_HAS_COMPLETED when upload Llama 70b checkpoints. #219

Comments

crazyguitar commented Jul 30, 2024 • edited Loading

s3torchconnector version

s3torchconnectorclient version

AWS Region

Describe the running environment

What happened?

Relevant log output

Code of Conduct

IsaevIlya commented Jul 31, 2024

crazyguitar commented Jul 30, 2024 •

edited

Loading