Skip to content
This repository has been archived by the owner on Nov 12, 2024. It is now read-only.

Write virtual references to Icechunk #1

Closed
wants to merge 43 commits into from
Closed

Write virtual references to Icechunk #1

wants to merge 43 commits into from

Conversation

TomNicholas
Copy link

I think this is vaguely along the right lines? But I'm getting a bit confused by whether I'm supposed to manually create the zarr Arrays and Groups. (Also I'm completely unfamiliar with async code so not sure I'm doing that right either.)

@TomNicholas TomNicholas added enhancement New feature or request help wanted Extra attention is needed labels Sep 27, 2024
@rabernat
Copy link

This raises a good question: do we want to expose a sync interface for setting virtual refs? Otherwise virtualizarr will either have to

  • Go full async
  • Make its own sync / async bridge

@TomNicholas
Copy link
Author

TomNicholas commented Sep 28, 2024

virtualizarr will either have to

  • Go full async

Why would virtualizarr want to provide an async interface to users (which is presumably what you mean by "go full async")? The whole point of virtualizarr is to make manipulation of virtual references to datasets as simple as manipulation of in-memory data, by copying/re-using xarray's API. Xarray's API is not async, so virtualizarr's should not be either.

do we want to expose a sync interface for setting virtual refs?

You don't need to I don't think, because virtualizarr can reasonably use the async interface of icechunk to set virtual references for many groups/variables concurrently.

Make its own sync / async bridge

If I understand async/await correctly I would achieve the above by using asyncio.run in virtualizarr's to_icechunk, where I'm running an async function that awaits the setting of virtual references for each variable (i.e. it awaits icechunk's async set_virtual_refs).

@rabernat
Copy link

If I understand async/await correctly I would achieve the above by using asyncio.run in virtualizarr's to_icechunk, where I'm running an async function that awaits the setting of virtual references for each variable (i.e. it awaits icechunk's async set_virtual_refs).

Welcome to the async rabbit hole! 🐰 That sounds reasonable, but it it will not work if your user happens to be calling VirtualiZarr from within an existing async event loop. Like in a jupyter notebook. 🙃

This is why both Zarr and fsspec have this gnarly code to allow you to safely call async functions: https://github.com/zarr-developers/zarr-python/blob/v3/src/zarr/core/sync.py

virtualizarr/writers/icechunk.py Outdated Show resolved Hide resolved
store,
group,
):
await asyncio.gather(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the right approach to me. its work stealing not parallel so remember that to keep perf expecations in check

virtualizarr/writers/icechunk.py Outdated Show resolved Hide resolved
],
op_flags=[["readonly"]] * 3,
)
for path, offset, length in it:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works, but it will set them in serial. You can do this in parallel, creating the tasks with asyncio.ensure_future instead of calling await. Then you just gather them as you do above into a single future to return out to await later, or you can await the gathered future in the function, depending on the level of concurrency you desire

Copy link
Author

@TomNicholas TomNicholas Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, but wouldn't it be better for this iteration to just happen on the icechunk end instead? That would both simplify this code and presumably be more efficient overall.

e.g. can icechunk expose a method for setting the virtual refs of an entire array at once like

async def set_array_as_virtual_refs(
    self,
    key_prefix: str,
    paths: np.ndarray[Any, np.dtypes.StringDType],
    offsets: np.ndarray[Any, np.dtype[np.uint64]],
    lengths: np.ndarray[Any, np.dtype[np.uint64]],
):
    ...

then do the loop over the chunks in rust? Or do you think that's departing from the zarr-like abstraction that icechunk presents?

Copy link

@mpiannucci mpiannucci Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a technical level supporting this on the rust side would be fast, but i am worried about a little about departure from being a Zarr store first and foremost and leaking the abstraction. This would most likely require a new dependency on the numpy rust bindings to be efficient enough to make a difference.

@rabernat also mentioned the possibility of adding another package specifically to make icechunk/virtualizarr more efficient which is possible as another option if we dont want to put it in main python bindings.

I think in the short term, lets focus on getting it to work as is and leave this as future optimization after we have some idea of the real world performance? Adding bulk will not be as hard as rounding out all the initial support things IMO

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little worried about departure from being a Zarr store

require a new dependency on the numpy rust bindings

This is reasonable, but the alternative requires me to iterate in python over many millions of elements of numpy arrays, and send every single one off to icechunk as a separate async request. That seems unnecessary gymnastics when we already have all the elements arranged very neatly in-memory.

another package specifically to make icechunk/virtualizarr more efficient

Seems kinda over-complicated, but could definitely solve the problem.

I think in the short term, lets focus on getting it to work as is and leave this as future optimization after we have some idea of the real world performance? Adding bulk will not be as hard as rounding out all the initial support things IMO

Agree that getting it to work correctly with a nice user API is the priority for now, and we can worry about this again after measuring performance.

Copy link
Author

@TomNicholas TomNicholas Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea to think about would be if virtualizarr's implementation of the chunk manifest actually used icechunk's rust implementation

zarr-developers/VirtualiZarr#23

This would be a stronger argument for a separate package IMO - a rust crate implementing the Manifest class that both icechunk and virtualizarr depended on, and could be used to exchange references efficiently between the two libraries.

cc @rabernat

@TomNicholas
Copy link
Author

this gnarly code

Christ - I already regret async 🤣

Like in a jupyter notebook.

I'll worry about getting correct behaviour outside of a notebook first. Presumably because virtualizarr depends on zarr anyway I can just import that sync function to use inside dataset_to_icechunk.

@TomNicholas
Copy link
Author

TomNicholas commented Oct 1, 2024

This PR doesn't work yet - when I run it locally the first test passes (as that one doesn't actually check the data stored at the location the virtual reference points to) but the second fails with

FAILED virtualizarr/tests/test_writers/test_icechunk.py::TestWriteVirtualRefs::test_set_single_virtual_ref - AssertionError: 
Arrays are not equal

Mismatched elements: 3869000 / 3869000 (100%)
Max absolute difference among violations: 317.4
Max relative difference among violations: 1.
 ACTUAL: array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],...
 DESIRED: array([[[241.2 , 242.5 , 243.5 , ..., 232.8 , 235.5 , 238.6 ],
        [243.8 , 244.5 , 244.7 , ..., 232.8 , 235.3 , 239.3 ],
        [250.  , 249.8 , 248.89, ..., 233.2 , 236.39, 241.7 ],...

Looks like it's just giving me the fill_value, but I'm unclear if this is my fault or icechunk's fault.

@mpiannucci
Copy link

mpiannucci commented Oct 2, 2024

How much should be shared between the icechunk writer and the zarr v3 writer? icechunk is just a zarr v3 store with the added set_virtual_refs function. So should they just be the same for all of the metadata?

@TomNicholas
Copy link
Author

TomNicholas commented Oct 2, 2024

icechunk is just a zarr v3 store

In terms of writing they are not represented the same on disk though are they? So it's technically a v3 store that additionally follows the icechunk spec.

Having said that the zarr v3 writer was just an experiment based on Joe's proposed json chunk manifest format in zarr-developers/zarr-specs#287. I think icechunk completely supercedes this existing virtualize.to_zarr writer.

No-one should be using that code right now because there is no way to read data via references that were written that way (see https://virtualizarr.readthedocs.io/en/latest/usage.html#writing-as-zarr), so there are no real backwards-compatibility concerns. (Technically someone like Raphael might be using this as a serialization format for virtual references, but we shouldn't worry about that.)

So for the API going forward, we could just remove .virtualize.to_zarr in favour of a public .virtualize.to_icechunk API. Or we could repurpose .to_zarr to use dataset_to_icechunk internally. I prefer the latter, for the following reason:

How much should be shared

I think therefore the real question here is how much should be shared between ds.virtualizarr.to_zarr and xarray's ds.to_zarr? For a dataset containing only "loadable variables" being written to an icechunk store, these are exactly the same thing. It's only for virtual variables that they are different. See also earth-mover/icechunk#104 (comment)

@mpiannucci
Copy link

Great answer, thanks thats super helpful to read.

I think therefore the real question here is how much should be shared between ds.virtualizarr.to_zarr and xarray's ds.to_zarr? For a dataset containing only "loadable variables" being written to an icechunk store, these are exactly the same thing. It's only for virtual variables that they are different

This is exactly what i had in mind while filling out the metadata for the virtual icechunk variables

@TomNicholas
Copy link
Author

This is exactly what i had in mind while filling out the metadata for the virtual icechunk variables

Yeah, it's frustrating that we can't test with xarray yet, because that would make this correspondence a lot clearer.

@TomNicholas
Copy link
Author

TomNicholas commented Oct 2, 2024

Also we could literally import some of the internals of xarray's to_zarr backend here... We already import some other semi-private xarray internals, and there is an eventual path to doing this using only public xarray functions (which would be for xarray's backend entrypoint system to support configurable writers).

@mpiannucci
Copy link

mpiannucci commented Oct 3, 2024

Also we could literally import some of the internals of xarray's to_zarr backend here

I think I am going to take a stab at doing this with tom a's branch: pydata/xarray#9552 . It should be as simple is as reusing the encoding logic and then calling set virtual ref

@TomNicholas
Copy link
Author

@mpiannucci the example in earth-mover/icechunk#197 is awesome!

I'm wondering how to "bank" the progress here and split off future work. We don't need to make loadable variables work in this PR, which allows us to punt on this Q

I think therefore the real question here is how much should be shared between ds.virtualizarr.to_zarr and xarray's ds.to_zarr?

We do want to have virtualizarr work with zarr v3 in general, but our tests can't pass (in their current form) without either kerchunk working with zarr v3 or maybe a non-kerchunk way to generate references (zarr-developers/VirtualiZarr#87).

@mpiannucci
Copy link

I'm wondering how to "bank" the progress here and split off future work

Ok I have been thinking a lot about this. I think the easiest way for us to talk through this is to list out what depends on what:

  1. Virtualizarr needs kerchunk to create manifest arrays from existsing data
  2. Kerchunk currently only fully works with zarr python 2. My v3 branch works with hdf files only currently.
  3. With this PR VirtualiZarr can write references to icechunk, succesfully. The one caveat is that numcodecs that v2 expects are not yet supported on the main branches, and need a special numcodecs branch installed, which is subject to change.

Icechunk is isolated but depends on zarr 3 to work. We can check the zarr version at import time and maybe get it into main that way, ahead of kerchunk being finished. Or we can wait for the whole chain of dependencies to be ready. There are a lot of moving parts but we have to start somewhere syncing things up

@TomNicholas
Copy link
Author

Virtualizarr needs kerchunk to create manifest arrays from existsing data

It needs kerchunk unless we use @sharkinsspatial's non-kerchunk hdf5 reader instead - kerchunk should really be an optional dependency for virtualizarr. But that path would only be quicker if it's a real pain to get kerchunk to work with v3.

@mpiannucci
Copy link

mpiannucci commented Oct 14, 2024

The only real blocker for kerchunk and v3 is the codecs (numcodecs are not available by default now in zarr 3 + kerchunks grib codec needs to be updated. Everything else i can brute force through.

@TomNicholas
Copy link
Author

Do we know that VirtualiZarr actually works with Zarr v3? I think we import some small utility functions.

@mpiannucci
Copy link

mpiannucci commented Oct 14, 2024

I have not tested the zarr to zarr functionality, it might not work yet. It does work with zarr 3 + icechunk tho

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants