Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRT reaise AWS_ERROR_S3_REQUEST_HAS_COMPLETED when upload Llama 70b checkpoints. #219

Open
1 task done
crazyguitar opened this issue Jul 30, 2024 · 1 comment
Open
1 task done
Labels
bug Something isn't working

Comments

@crazyguitar
Copy link

crazyguitar commented Jul 30, 2024

s3torchconnector version

s3torchconnector-1.2.3

s3torchconnectorclient version

s3torchconnectorclient-1.2.3

AWS Region

us-west-2

Describe the running environment

EC2 instance p4d.24xlarge
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"
Amazon Linux release 2 (Karoo)

What happened?

Hi team,

I was running the Llama v2 70b model with 32 nodes using the Slurm job scheduler, based on the SageMaker Model Parallelism Library v2 (using a specific Docker image). To improve checkpoint performance, I tried using the s3connector. However, when writing checkpoints to S3, the writer stream encountered an error: "Unknown CRT error: CRT error 14366: aws-c-s3: AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already complete". Note that Llama v2 7b, 13b did not encounter this issue.

This error seemed to indicate that I had hit the S3 rate limit. The issue was resolved when I increased the part_size from 8Mb to 32Mb. However, the error code was ambiguous, and I was unsure if the problem was indeed related to the rate limit. Could the team help to explain this specific error. Thank you.

Relevant log output

9397  4: [rank34]:   File "/opt/conda/lib/python3.11/site-packages/s3torchconnector/s3writer.py", line 40, in write                                                                                      
9398  4: [rank34]:     self.stream.write(data)                                                                                                                                                           
9399  4: [rank34]: s3torchconnectorclient._mountpoint_s3_client.S3Exception: Client error: Unknown CRT error: CRT error 14366: aws-c-s3: AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already complete

Code of Conduct

  • I agree to follow this project's Code of Conduct
@crazyguitar crazyguitar added the bug Something isn't working label Jul 30, 2024
@IsaevIlya
Copy link
Contributor

Hi @crazyguitar,

Thank you for sharing your experience with the s3-connector-for-pytorch. The error message "AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already completed" is indeed ambiguous and can be confusing.

It seems that the issue you encountered was related to the object size limit for the given part_size. Increasing the part_size from 8MB to 32MB was the right decision, as it allowed you to upload larger objects without hitting object size limit.

The s3-connector-for-pytorch uses the AWS Common Runtime (CRT) under the hood, which breaks large requests into smaller part-sized requests and executes them in parallel. There could be up to 10,000 parts when writing data to S3, so with a part_size of 8MB, the maximum upload size would be around 80GB. If your model checkpoint was larger than this limit, increasing the part_size was the appropriate solution. Was your model large than 80GB?

I will reach out to the CRT team to discuss if it is possible to provide a more meaningful error message in situations where the object size for upload exceeds the current part_size limit. I will also take a look at our documentation to make it more helpful regarding the usage of part_size.

Thank you for your feedback and for sharing your experience. It will help us improve the user experience and documentation for the s3-connector-for-pytorch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants