CRT reaise AWS_ERROR_S3_REQUEST_HAS_COMPLETED when upload Llama 70b checkpoints. #219
Open
1 task done
Labels
bug
Something isn't working
s3torchconnector version
s3torchconnector-1.2.3
s3torchconnectorclient version
s3torchconnectorclient-1.2.3
AWS Region
us-west-2
Describe the running environment
EC2 instance p4d.24xlarge
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2025-06-30"
Amazon Linux release 2 (Karoo)
What happened?
Hi team,
I was running the Llama v2 70b model with 32 nodes using the Slurm job scheduler, based on the SageMaker Model Parallelism Library v2 (using a specific Docker image). To improve checkpoint performance, I tried using the s3connector. However, when writing checkpoints to S3, the writer stream encountered an error: "Unknown CRT error: CRT error 14366: aws-c-s3: AWS_ERROR_S3_REQUEST_HAS_COMPLETED, Request has already complete". Note that Llama v2 7b, 13b did not encounter this issue.
This error seemed to indicate that I had hit the S3 rate limit. The issue was resolved when I increased the part_size from 8Mb to 32Mb. However, the error code was ambiguous, and I was unsure if the problem was indeed related to the rate limit. Could the team help to explain this specific error. Thank you.
Relevant log output
Code of Conduct
The text was updated successfully, but these errors were encountered: