Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 support for a file which is not on active storage #94

Open
bnlawrence opened this issue Jun 16, 2023 · 2 comments
Open

S3 support for a file which is not on active storage #94

bnlawrence opened this issue Jun 16, 2023 · 2 comments
Assignees

Comments

@bnlawrence
Copy link
Collaborator

We need to include support within PyActiveStorage for the situation where the remote server does not support ActiveStorage but the client has requested ActiveStorage support - in this situation we should fail over to calculate the operations ourselves. To do that, we need our version of reduce_chunk to grab the necessary blocks and do the operations itself, as it currently does for POSIX.

In the long-term we would hope that netcdf4-python would do this transparently, but for the moment we need to use h5netcdf to do it.

@markgoddard
Copy link

This is a similar scenario to when S3 active storage is broken or too busy to handle the request.

Should activestorage.s3.reduce_chunk handle these cases transparently, or raise an error that is handled by Active which propagates the request to activestorage.storage.reduce_chunk? I lean towards the former approach, keeping all S3 interaction within the s3 module. In that case it would make sense to extract some of the Numpy operations to a common module to be shared by the storage and s3 modules.

@bnlawrence
Copy link
Collaborator Author

bnlawrence commented Jun 22, 2023

  1. I think we need to handle s3 independently of s3 active storage. There are going to be a lot of use cases where the dask workflow has identified a need to bring all the data back to the client whether or not there is active storage present.
  2. We think the error needs to propagate up to PyActiveStorage so it can avoid making unnecessary repeated requests which would introduce extra latency on each block.

Context: each computational chunk in Dask has it's own PyActiveStorage instance ... they are likely to be requesting in parallel, so once a computational chunk sees a problem it should give up using active, but some may still work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants