Convert zarr helpers to utilities, update numpy chunk encoding in zarr router #260

mpiannucci · 2024-04-10T21:19:34Z

We have been working on building a subset router (https://github.com/asascience-open/xreds/blob/fa3aa81e398c280cef34fd6e0846880df0bb2aef/xreds/plugins/subset_plugin.py#L138) which introduces a nested dataset router. The core zarr plugin did not work with this because of zmetadata and zvariable dependencies using the xpublish global get_dataset dependency. Instead this simply moves those functions to utils and removes the Depends functionality. If there is a better way to handle this i am all ears, I was not sure of why they were dependencies in the first place so this may be incorrect.

This also includes a patch for numpy arrays when using the zarr router (#207). In some cases (especially kerchunk concatenated datasets) there may be a combination of numpy and dask arrays, and the numpy arrays may include encoding information even if the encoding is reset beforehand. This PR changes this functionality to force the encoding to match the array shape when the underlying array is not a dask array.

for more information, see https://pre-commit.ci

mpiannucci · 2024-04-10T21:20:08Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

mpiannucci · 2024-04-10T21:22:22Z

xpublish/utils/zarr.py

+):
+    """Returns a consolidated zmetadata dictionary, using the cache when possible."""
+    cache_key = dataset.attrs.get(DATASET_ID_ATTR_KEY, '') + '/' + ZARR_METADATA_KEY
+    zmeta = cache.get(cache_key)


This may be an issue for nested datasets, have to verify

DATASET_ID_ATTR_KEY = '_xpublish_id'

Ok I think it just needs to be documented that if you build a dataset provider you should provide _xpublish_id as an attribute on the dataset or else the zarr attrs will be cached for the remote zarr dataset. Alternative is to change the dataset provider hook to assign an ID if it does not exist.

Currently there is no DATASET_ID set for datasets that are provided bytthe dataset provider so the .zmetadata etc is cached across datasets

Any thoughts @abkfenris ?

That makes sense, as I definitely think I've got dataset provider implementations that would have problems with this.

mpiannucci · 2024-04-11T14:54:15Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

mpiannucci · 2024-04-11T15:17:15Z

tests/test_rest_api.py

@@ -348,7 +348,7 @@ def test_cache(airtemp_ds):

    response1 = client.get('/zarr/air/0.0.0')
    assert response1.status_code == 200
-    assert '/air/0.0.0' in rest.cache
+    assert 'airtemp/air/0.0.0' in rest.cache


All datasets now have an _xpublish_id attr so the cache string changed in this test

mpiannucci · 2024-04-11T15:21:08Z

Okay think this is good to go now

abkfenris

Good catch with the caching issues!

abkfenris · 2024-04-11T15:47:01Z

docs/source/user-guide/plugins.md

+```{note}
+Some routers may want to cache data computed from datasets that they serve to avoid unnecessary recomputation. In this case, routers may rely on the
+`_xpublish_id` attribute (`DATASET_ID_ATTR_KEY` from `xpublish.api`) on each dataset. If this attribute is set, it should be a unique identifier for the dataset, otherwise the `dataset_id` used to load the dataset will be set as the `_xpublish_id` automatically.
+```
+


It's probably worth being more explicit, and making it more important than alert. Also could you tweak the example above to explicit set the attr as an example?

Do you mean move that up and change it to a warning? Ill update the example code too

Ya, warning is probably the right level. I was thinking tweaking the tone as well, 'you need to set a uniqueDATASET_ID_ATTR_KEY from xpublish.api on each dataset for routers to manage caching appropriately' to go with the warning.

Nice. I like that, K updated it!

xpublish/utils/zarr.py

mpiannucci · 2024-04-11T18:14:42Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

abkfenris

Some tweaks to help folks make sure keys are unique.

docs/source/user-guide/plugins.md

Co-authored-by: Alex Kerney <[email protected]>

mpiannucci · 2024-04-11T18:23:36Z

Some tweaks to help folks make sure keys are unique.

Great catches!! Thank you!

mpiannucci and others added 5 commits April 10, 2024 10:54

Move zmetadata and zvariable to utils

39b4244

Handle numpy chunk encoding, Update tests

3b7dd45

Update tests (I think this is correct)

913fd1b

[pre-commit.ci] auto fixes from pre-commit.com hooks

4f9765d

for more information, see https://pre-commit.ci

ruff

e7961e6

[pre-commit.ci] auto fixes from pre-commit.com hooks

87a6d66

for more information, see https://pre-commit.ci

mpiannucci commented Apr 10, 2024

View reviewed changes

mpiannucci marked this pull request as draft April 11, 2024 14:02

Clean up, add docs for xpublish_id

97cd7eb

pre-commit-ci bot and others added 2 commits April 11, 2024 14:54

[pre-commit.ci] auto fixes from pre-commit.com hooks

409c380

for more information, see https://pre-commit.ci

Update cache test with new behavior

0e2588a

mpiannucci commented Apr 11, 2024

View reviewed changes

mpiannucci marked this pull request as ready for review April 11, 2024 15:20

mpiannucci requested a review from abkfenris April 11, 2024 15:20

Small doc update

ba361ac

abkfenris approved these changes Apr 11, 2024

View reviewed changes

Update docs

e189bd8

[pre-commit.ci] auto fixes from pre-commit.com hooks

32c8fb6

for more information, see https://pre-commit.ci

abkfenris reviewed Apr 11, 2024

View reviewed changes

docs/source/user-guide/plugins.md Outdated Show resolved Hide resolved

docs/source/user-guide/plugins.md Outdated Show resolved Hide resolved

mpiannucci and others added 2 commits April 11, 2024 14:22

Update docs/source/user-guide/plugins.md

453b899

Co-authored-by: Alex Kerney <[email protected]>

Update docs/source/user-guide/plugins.md

7547d85

Co-authored-by: Alex Kerney <[email protected]>

mpiannucci merged commit a4c7d17 into xpublish-community:main Apr 12, 2024
15 checks passed

mpiannucci deleted the zarr-utils branch April 12, 2024 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert zarr helpers to utilities, update numpy chunk encoding in zarr router #260

Convert zarr helpers to utilities, update numpy chunk encoding in zarr router #260

mpiannucci commented Apr 10, 2024

mpiannucci commented Apr 10, 2024

mpiannucci Apr 10, 2024

mpiannucci Apr 11, 2024 •

edited

Loading

abkfenris Apr 11, 2024

mpiannucci commented Apr 11, 2024

mpiannucci Apr 11, 2024

mpiannucci commented Apr 11, 2024

abkfenris left a comment

abkfenris Apr 11, 2024

mpiannucci Apr 11, 2024

abkfenris Apr 11, 2024

mpiannucci Apr 11, 2024

mpiannucci commented Apr 11, 2024

abkfenris left a comment

mpiannucci commented Apr 11, 2024

Convert zarr helpers to utilities, update numpy chunk encoding in zarr router #260

Convert zarr helpers to utilities, update numpy chunk encoding in zarr router #260

Conversation

mpiannucci commented Apr 10, 2024

mpiannucci commented Apr 10, 2024

mpiannucci Apr 10, 2024

Choose a reason for hiding this comment

mpiannucci Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

abkfenris Apr 11, 2024

Choose a reason for hiding this comment

mpiannucci commented Apr 11, 2024

mpiannucci Apr 11, 2024

Choose a reason for hiding this comment

mpiannucci commented Apr 11, 2024

abkfenris left a comment

Choose a reason for hiding this comment

abkfenris Apr 11, 2024

Choose a reason for hiding this comment

mpiannucci Apr 11, 2024

Choose a reason for hiding this comment

abkfenris Apr 11, 2024

Choose a reason for hiding this comment

mpiannucci Apr 11, 2024

Choose a reason for hiding this comment

mpiannucci commented Apr 11, 2024

abkfenris left a comment

Choose a reason for hiding this comment

mpiannucci commented Apr 11, 2024

mpiannucci Apr 11, 2024 •

edited

Loading