Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert zarr helpers to utilities, update numpy chunk encoding in zarr router #260

Merged
merged 14 commits into from
Apr 12, 2024

Conversation

mpiannucci
Copy link
Contributor

We have been working on building a subset router (https://github.com/asascience-open/xreds/blob/fa3aa81e398c280cef34fd6e0846880df0bb2aef/xreds/plugins/subset_plugin.py#L138) which introduces a nested dataset router. The core zarr plugin did not work with this because of zmetadata and zvariable dependencies using the xpublish global get_dataset dependency. Instead this simply moves those functions to utils and removes the Depends functionality. If there is a better way to handle this i am all ears, I was not sure of why they were dependencies in the first place so this may be incorrect.

This also includes a patch for numpy arrays when using the zarr router (#207). In some cases (especially kerchunk concatenated datasets) there may be a combination of numpy and dask arrays, and the numpy arrays may include encoding information even if the encoding is reset beforehand. This PR changes this functionality to force the encoding to match the array shape when the underlying array is not a dask array.

@mpiannucci
Copy link
Contributor Author

pre-commit.ci autofix

):
"""Returns a consolidated zmetadata dictionary, using the cache when possible."""
cache_key = dataset.attrs.get(DATASET_ID_ATTR_KEY, '') + '/' + ZARR_METADATA_KEY
zmeta = cache.get(cache_key)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be an issue for nested datasets, have to verify

Copy link
Contributor Author

@mpiannucci mpiannucci Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DATASET_ID_ATTR_KEY = '_xpublish_id'

Ok I think it just needs to be documented that if you build a dataset provider you should provide _xpublish_id as an attribute on the dataset or else the zarr attrs will be cached for the remote zarr dataset. Alternative is to change the dataset provider hook to assign an ID if it does not exist.

Currently there is no DATASET_ID set for datasets that are provided bytthe dataset provider so the .zmetadata etc is cached across datasets

Any thoughts @abkfenris ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, as I definitely think I've got dataset provider implementations that would have problems with this.

@mpiannucci mpiannucci marked this pull request as draft April 11, 2024 14:02
@mpiannucci
Copy link
Contributor Author

pre-commit.ci autofix

@@ -348,7 +348,7 @@ def test_cache(airtemp_ds):

response1 = client.get('/zarr/air/0.0.0')
assert response1.status_code == 200
assert '/air/0.0.0' in rest.cache
assert 'airtemp/air/0.0.0' in rest.cache
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All datasets now have an _xpublish_id attr so the cache string changed in this test

@mpiannucci mpiannucci marked this pull request as ready for review April 11, 2024 15:20
@mpiannucci mpiannucci requested a review from abkfenris April 11, 2024 15:20
@mpiannucci
Copy link
Contributor Author

Okay think this is good to go now

Copy link
Member

@abkfenris abkfenris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch with the caching issues!

Comment on lines 315 to 319
```{note}
Some routers may want to cache data computed from datasets that they serve to avoid unnecessary recomputation. In this case, routers may rely on the
`_xpublish_id` attribute (`DATASET_ID_ATTR_KEY` from `xpublish.api`) on each dataset. If this attribute is set, it should be a unique identifier for the dataset, otherwise the `dataset_id` used to load the dataset will be set as the `_xpublish_id` automatically.
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth being more explicit, and making it more important than alert. Also could you tweak the example above to explicit set the attr as an example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean move that up and change it to a warning? Ill update the example code too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, warning is probably the right level. I was thinking tweaking the tone as well, 'you need to set a uniqueDATASET_ID_ATTR_KEY from xpublish.api on each dataset for routers to manage caching appropriately' to go with the warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I like that, K updated it!

xpublish/utils/zarr.py Show resolved Hide resolved
@mpiannucci
Copy link
Contributor Author

pre-commit.ci autofix

Copy link
Member

@abkfenris abkfenris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some tweaks to help folks make sure keys are unique.

docs/source/user-guide/plugins.md Outdated Show resolved Hide resolved
docs/source/user-guide/plugins.md Outdated Show resolved Hide resolved
@mpiannucci
Copy link
Contributor Author

Some tweaks to help folks make sure keys are unique.

Great catches!! Thank you!

@mpiannucci mpiannucci merged commit a4c7d17 into xpublish-community:main Apr 12, 2024
15 checks passed
@mpiannucci mpiannucci deleted the zarr-utils branch April 12, 2024 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants