-
Notifications
You must be signed in to change notification settings - Fork 0
Conversation
This raises a good question: do we want to expose a sync interface for setting virtual refs? Otherwise virtualizarr will either have to
|
Why would virtualizarr want to provide an async interface to users (which is presumably what you mean by "go full async")? The whole point of virtualizarr is to make manipulation of virtual references to datasets as simple as manipulation of in-memory data, by copying/re-using xarray's API. Xarray's API is not async, so virtualizarr's should not be either.
You don't need to I don't think, because virtualizarr can reasonably use the async interface of icechunk to set virtual references for many groups/variables concurrently.
If I understand |
Welcome to the async rabbit hole! 🐰 That sounds reasonable, but it it will not work if your user happens to be calling VirtualiZarr from within an existing async event loop. Like in a jupyter notebook. 🙃 This is why both Zarr and fsspec have this gnarly code to allow you to safely call async functions: https://github.com/zarr-developers/zarr-python/blob/v3/src/zarr/core/sync.py |
virtualizarr/writers/icechunk.py
Outdated
store, | ||
group, | ||
): | ||
await asyncio.gather( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the right approach to me. its work stealing not parallel so remember that to keep perf expecations in check
], | ||
op_flags=[["readonly"]] * 3, | ||
) | ||
for path, offset, length in it: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works, but it will set them in serial. You can do this in parallel, creating the tasks with asyncio.ensure_future
instead of calling await
. Then you just gather them as you do above into a single future to return out to await later, or you can await the gathered future in the function, depending on the level of concurrency you desire
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, but wouldn't it be better for this iteration to just happen on the icechunk end instead? That would both simplify this code and presumably be more efficient overall.
e.g. can icechunk expose a method for setting the virtual refs of an entire array at once like
async def set_array_as_virtual_refs(
self,
key_prefix: str,
paths: np.ndarray[Any, np.dtypes.StringDType],
offsets: np.ndarray[Any, np.dtype[np.uint64]],
lengths: np.ndarray[Any, np.dtype[np.uint64]],
):
...
then do the loop over the chunks in rust? Or do you think that's departing from the zarr-like abstraction that icechunk presents?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a technical level supporting this on the rust side would be fast, but i am worried about a little about departure from being a Zarr store first and foremost and leaking the abstraction. This would most likely require a new dependency on the numpy rust bindings to be efficient enough to make a difference.
@rabernat also mentioned the possibility of adding another package specifically to make icechunk/virtualizarr more efficient which is possible as another option if we dont want to put it in main python bindings.
I think in the short term, lets focus on getting it to work as is and leave this as future optimization after we have some idea of the real world performance? Adding bulk will not be as hard as rounding out all the initial support things IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little worried about departure from being a Zarr store
require a new dependency on the numpy rust bindings
This is reasonable, but the alternative requires me to iterate in python over many millions of elements of numpy arrays, and send every single one off to icechunk as a separate async request. That seems unnecessary gymnastics when we already have all the elements arranged very neatly in-memory.
another package specifically to make icechunk/virtualizarr more efficient
Seems kinda over-complicated, but could definitely solve the problem.
I think in the short term, lets focus on getting it to work as is and leave this as future optimization after we have some idea of the real world performance? Adding bulk will not be as hard as rounding out all the initial support things IMO
Agree that getting it to work correctly with a nice user API is the priority for now, and we can worry about this again after measuring performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea to think about would be if virtualizarr's implementation of the chunk manifest actually used icechunk's rust implementation
zarr-developers/VirtualiZarr#23
This would be a stronger argument for a separate package IMO - a rust crate implementing the Manifest
class that both icechunk
and virtualizarr
depended on, and could be used to exchange references efficiently between the two libraries.
cc @rabernat
Christ - I already regret
I'll worry about getting correct behaviour outside of a notebook first. Presumably because virtualizarr depends on zarr anyway I can just import that |
This PR doesn't work yet - when I run it locally the first test passes (as that one doesn't actually check the data stored at the location the virtual reference points to) but the second fails with FAILED virtualizarr/tests/test_writers/test_icechunk.py::TestWriteVirtualRefs::test_set_single_virtual_ref - AssertionError:
Arrays are not equal
Mismatched elements: 3869000 / 3869000 (100%)
Max absolute difference among violations: 317.4
Max relative difference among violations: 1.
ACTUAL: array([[[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],...
DESIRED: array([[[241.2 , 242.5 , 243.5 , ..., 232.8 , 235.5 , 238.6 ],
[243.8 , 244.5 , 244.7 , ..., 232.8 , 235.3 , 239.3 ],
[250. , 249.8 , 248.89, ..., 233.2 , 236.39, 241.7 ],... Looks like it's just giving me the |
How much should be shared between the icechunk writer and the zarr v3 writer? icechunk is just a zarr v3 store with the added set_virtual_refs function. So should they just be the same for all of the metadata? |
In terms of writing they are not represented the same on disk though are they? So it's technically a v3 store that additionally follows the icechunk spec. Having said that the zarr v3 writer was just an experiment based on Joe's proposed json chunk manifest format in zarr-developers/zarr-specs#287. I think icechunk completely supercedes this existing No-one should be using that code right now because there is no way to read data via references that were written that way (see https://virtualizarr.readthedocs.io/en/latest/usage.html#writing-as-zarr), so there are no real backwards-compatibility concerns. (Technically someone like Raphael might be using this as a serialization format for virtual references, but we shouldn't worry about that.) So for the API going forward, we could just remove
I think therefore the real question here is how much should be shared between |
Great answer, thanks thats super helpful to read.
This is exactly what i had in mind while filling out the metadata for the virtual icechunk variables |
Yeah, it's frustrating that we can't test with xarray yet, because that would make this correspondence a lot clearer. |
Also we could literally import some of the internals of xarray's |
I think I am going to take a stab at doing this with tom a's branch: pydata/xarray#9552 . It should be as simple is as reusing the encoding logic and then calling set virtual ref |
Matt/icechunk encoding
@mpiannucci the example in earth-mover/icechunk#197 is awesome! I'm wondering how to "bank" the progress here and split off future work. We don't need to make loadable variables work in this PR, which allows us to punt on this Q
We do want to have virtualizarr work with zarr v3 in general, but our tests can't pass (in their current form) without either kerchunk working with zarr v3 or maybe a non-kerchunk way to generate references (zarr-developers/VirtualiZarr#87). |
Ok I have been thinking a lot about this. I think the easiest way for us to talk through this is to list out what depends on what:
Icechunk is isolated but depends on zarr 3 to work. We can check the zarr version at import time and maybe get it into main that way, ahead of kerchunk being finished. Or we can wait for the whole chain of dependencies to be ready. There are a lot of moving parts but we have to start somewhere syncing things up |
It needs kerchunk unless we use @sharkinsspatial's non-kerchunk hdf5 reader instead - kerchunk should really be an optional dependency for virtualizarr. But that path would only be quicker if it's a real pain to get kerchunk to work with v3. |
The only real blocker for kerchunk and v3 is the codecs (numcodecs are not available by default now in zarr 3 + kerchunks grib codec needs to be updated. Everything else i can brute force through. |
Do we know that VirtualiZarr actually works with Zarr v3? I think we import some small utility functions. |
I have not tested the zarr to zarr functionality, it might not work yet. It does work with zarr 3 + icechunk tho |
Fix v3 codec pipeline
I think this is vaguely along the right lines? But I'm getting a bit confused by whether I'm supposed to manually create the zarr Arrays and Groups. (Also I'm completely unfamiliar with
async
code so not sure I'm doing that right either.)