-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert zarr helpers to utilities, update numpy chunk encoding in zarr router #260
Conversation
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
): | ||
"""Returns a consolidated zmetadata dictionary, using the cache when possible.""" | ||
cache_key = dataset.attrs.get(DATASET_ID_ATTR_KEY, '') + '/' + ZARR_METADATA_KEY | ||
zmeta = cache.get(cache_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be an issue for nested datasets, have to verify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DATASET_ID_ATTR_KEY = '_xpublish_id'
Ok I think it just needs to be documented that if you build a dataset provider you should provide _xpublish_id
as an attribute on the dataset or else the zarr attrs will be cached for the remote zarr dataset. Alternative is to change the dataset provider hook to assign an ID if it does not exist.
Currently there is no DATASET_ID set for datasets that are provided bytthe dataset provider so the .zmetadata etc is cached across datasets
Any thoughts @abkfenris ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, as I definitely think I've got dataset provider implementations that would have problems with this.
pre-commit.ci autofix |
@@ -348,7 +348,7 @@ def test_cache(airtemp_ds): | |||
|
|||
response1 = client.get('/zarr/air/0.0.0') | |||
assert response1.status_code == 200 | |||
assert '/air/0.0.0' in rest.cache | |||
assert 'airtemp/air/0.0.0' in rest.cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All datasets now have an _xpublish_id
attr so the cache string changed in this test
Okay think this is good to go now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch with the caching issues!
docs/source/user-guide/plugins.md
Outdated
```{note} | ||
Some routers may want to cache data computed from datasets that they serve to avoid unnecessary recomputation. In this case, routers may rely on the | ||
`_xpublish_id` attribute (`DATASET_ID_ATTR_KEY` from `xpublish.api`) on each dataset. If this attribute is set, it should be a unique identifier for the dataset, otherwise the `dataset_id` used to load the dataset will be set as the `_xpublish_id` automatically. | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth being more explicit, and making it more important than alert. Also could you tweak the example above to explicit set the attr as an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean move that up and change it to a warning? Ill update the example code too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, warning is probably the right level. I was thinking tweaking the tone as well, 'you need to set a uniqueDATASET_ID_ATTR_KEY
from xpublish.api
on each dataset for routers to manage caching appropriately' to go with the warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I like that, K updated it!
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some tweaks to help folks make sure keys are unique.
Co-authored-by: Alex Kerney <[email protected]>
Co-authored-by: Alex Kerney <[email protected]>
Great catches!! Thank you! |
We have been working on building a subset router (https://github.com/asascience-open/xreds/blob/fa3aa81e398c280cef34fd6e0846880df0bb2aef/xreds/plugins/subset_plugin.py#L138) which introduces a nested dataset router. The core zarr plugin did not work with this because of zmetadata and zvariable dependencies using the xpublish global get_dataset dependency. Instead this simply moves those functions to utils and removes the Depends functionality. If there is a better way to handle this i am all ears, I was not sure of why they were dependencies in the first place so this may be incorrect.
This also includes a patch for numpy arrays when using the zarr router (#207). In some cases (especially kerchunk concatenated datasets) there may be a combination of numpy and dask arrays, and the numpy arrays may include encoding information even if the encoding is reset beforehand. This PR changes this functionality to force the encoding to match the array shape when the underlying array is not a dask array.