Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-c-common: AWS_ERROR_SYS_CALL_FAILURE, System call failure. while running checkpointing #281

Open
1 task done
harshavardhana opened this issue Dec 15, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@harshavardhana
Copy link

s3torchconnector version

s3torchconnector-1.2.7

s3torchconnectorclient version

s3torchconnectorclient-1.2.7

AWS Region

us-east-1

Describe the running environment

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.1 LTS"

Not on EC2

What happened?

While running the following checkpoint code observing this failure

# Model Checkpointing 
checkpoint = s3torchconnector.S3Checkpoint(region=REGION, s3client_config=config)
model = torchvision.models.resnet18()

# Save to MinIO
with checkpoint.writer(CHECKPOINT_URI + "epoch0.ckpt") as writer:
    torch.save(model.state_dict(), writer)

# Load from MinIO
with checkpoint.reader(CHECKPOINT_URI + "epoch0.ckpt") as reader:
    state_dict = torch.load(reader)

model.load_state_dict(state_dict)
thread '<unnamed>' panicked at /root/.cargo/registry/src/index.crates.io-6f17d22bba15001f/mountpoint-s3-client-0.11.0/src/s3_crt_client.rs:295:89:
called `Result::unwrap()` on an `Err` value: Error(46, "aws-c-common: AWS_ERROR_SYS_CALL_FAILURE, System call failure.")
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: mountpoint_s3_client::s3_crt_client::S3CrtClient::new
   4: _mountpoint_s3_client::mountpoint_s3_client::MountpointS3Client::new_s3_client
   5: _mountpoint_s3_client::mountpoint_s3_client::_::<impl pyo3::impl_::pyclass::PyMethods<_mountpoint_s3_client::mountpoint_s3_client::MountpointS3Client> for pyo3::impl_::pyclass::PyClassImplCollector<_mountpoint_s3_client::mountpoint_s3_client::MountpointS3Client>>::py_methods::ITEMS::trampoline
   6: <unknown>
   7: _PyObject_MakeTpCall
   8: _PyEval_EvalFrameDefault
   9: PyObject_CallOneArg
  10: _PyObject_GenericGetAttrWithDict
  11: PyObject_GetAttr
  12: _PyEval_EvalFrameDefault
  13: PyEval_EvalCode
  14: <unknown>
  15: <unknown>
  16: _PyRun_SimpleFileObject
  17: _PyRun_AnyFileObject
  18: Py_RunMain
  19: Py_BytesMain
  20: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  21: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:360:3
  22: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "/home/minio/s3pytorch.py", line 75, in <module>
    with checkpoint.writer(CHECKPOINT_URI + "epoch0.ckpt") as writer:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/minio/pytorchtest/lib/python3.12/site-packages/s3torchconnector/s3checkpoint.py", line 60, in writer
    return self._client.put_object(bucket, key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/minio/pytorchtest/lib/python3.12/site-packages/s3torchconnector/_s3client/_s3client.py", line 116, in put_object
    return S3Writer(self._client.put_object(bucket, key, storage_class))
                    ^^^^^^^^^^^^
  File "/home/minio/pytorchtest/lib/python3.12/site-packages/s3torchconnector/_s3client/_s3client.py", line 65, in _client
    self._real_client = self._client_builder()
                        ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/minio/pytorchtest/lib/python3.12/site-packages/s3torchconnector/_s3client/_s3client.py", line 83, in _client_builder
    return MountpointS3Client(
           ^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: Error(46, "aws-c-common: AWS_ERROR_SYS_CALL_FAILURE, System call failure.")

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@harshavardhana harshavardhana added the bug Something isn't working label Dec 15, 2024
@IsaevIlya
Copy link
Contributor

Hello @harshavardhana,

Thank you for reaching out and providing the detailed information about the issue you are facing. We appreciate your interest in using the S3 Connector for PyTorch.

The error you encountered seems to be related to the interaction between our library and MinIO, an Amazon S3-compatible object storage server. Our project's primary goal is to provide optimized access to Amazon S3, and we do not actively maintain compatibility with other S3-compatible storage systems.

Could you please try to run your example in your environment against Amazon S3 to understand if the issue is a regression in our underlying libraries or a compatibility issue specific to MinIO? This would help us narrow down the root cause of the problem.

While we strive for compatibility with other S3-compatible systems whenever possible, we do not dedicate resources specifically for this purpose. If our library happens to work with a compatible system, that's great, but we cannot guarantee support or prioritize addressing issues related to non-Amazon S3 storage providers.

We appreciate your understanding that our focus is solely on Amazon S3, and we may not be able to assist with issues related to other storage systems. However, if you discover a regression or bug in our library when used with Amazon S3, we will be happy to investigate and address it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants