Write virtual references to Icechunk #1

TomNicholas · 2024-09-27T23:48:39Z

I think this is vaguely along the right lines? But I'm getting a bit confused by whether I'm supposed to manually create the zarr Arrays and Groups. (Also I'm completely unfamiliar with async code so not sure I'm doing that right either.)

rabernat · 2024-09-28T14:40:07Z

This raises a good question: do we want to expose a sync interface for setting virtual refs? Otherwise virtualizarr will either have to

Go full async
Make its own sync / async bridge

TomNicholas · 2024-09-28T15:46:37Z

virtualizarr will either have to

Go full async

Why would virtualizarr want to provide an async interface to users (which is presumably what you mean by "go full async")? The whole point of virtualizarr is to make manipulation of virtual references to datasets as simple as manipulation of in-memory data, by copying/re-using xarray's API. Xarray's API is not async, so virtualizarr's should not be either.

do we want to expose a sync interface for setting virtual refs?

You don't need to I don't think, because virtualizarr can reasonably use the async interface of icechunk to set virtual references for many groups/variables concurrently.

Make its own sync / async bridge

If I understand async/await correctly I would achieve the above by using asyncio.run in virtualizarr's to_icechunk, where I'm running an async function that awaits the setting of virtual references for each variable (i.e. it awaits icechunk's async set_virtual_refs).

rabernat · 2024-09-30T19:16:16Z

If I understand async/await correctly I would achieve the above by using asyncio.run in virtualizarr's to_icechunk, where I'm running an async function that awaits the setting of virtual references for each variable (i.e. it awaits icechunk's async set_virtual_refs).

Welcome to the async rabbit hole! 🐰 That sounds reasonable, but it it will not work if your user happens to be calling VirtualiZarr from within an existing async event loop. Like in a jupyter notebook. 🙃

This is why both Zarr and fsspec have this gnarly code to allow you to safely call async functions: https://github.com/zarr-developers/zarr-python/blob/v3/src/zarr/core/sync.py

virtualizarr/writers/icechunk.py

mpiannucci · 2024-10-01T00:26:08Z

virtualizarr/writers/icechunk.py

+    store,
+    group,
+):
+    await asyncio.gather(


Looks like the right approach to me. its work stealing not parallel so remember that to keep perf expecations in check

virtualizarr/writers/icechunk.py

mpiannucci · 2024-10-01T00:36:29Z

virtualizarr/writers/icechunk.py

+        ],
+        op_flags=[["readonly"]] * 3,
+    )
+    for path, offset, length in it:


This works, but it will set them in serial. You can do this in parallel, creating the tasks with asyncio.ensure_future instead of calling await. Then you just gather them as you do above into a single future to return out to await later, or you can await the gathered future in the function, depending on the level of concurrency you desire

Interesting, but wouldn't it be better for this iteration to just happen on the icechunk end instead? That would both simplify this code and presumably be more efficient overall.

e.g. can icechunk expose a method for setting the virtual refs of an entire array at once like

async def set_array_as_virtual_refs( self, key_prefix: str, paths: np.ndarray[Any, np.dtypes.StringDType], offsets: np.ndarray[Any, np.dtype[np.uint64]], lengths: np.ndarray[Any, np.dtype[np.uint64]], ): ...

then do the loop over the chunks in rust? Or do you think that's departing from the zarr-like abstraction that icechunk presents?

On a technical level supporting this on the rust side would be fast, but i am worried about a little about departure from being a Zarr store first and foremost and leaking the abstraction. This would most likely require a new dependency on the numpy rust bindings to be efficient enough to make a difference.

@rabernat also mentioned the possibility of adding another package specifically to make icechunk/virtualizarr more efficient which is possible as another option if we dont want to put it in main python bindings.

I think in the short term, lets focus on getting it to work as is and leave this as future optimization after we have some idea of the real world performance? Adding bulk will not be as hard as rounding out all the initial support things IMO

I am a little worried about departure from being a Zarr store

require a new dependency on the numpy rust bindings

This is reasonable, but the alternative requires me to iterate in python over many millions of elements of numpy arrays, and send every single one off to icechunk as a separate async request. That seems unnecessary gymnastics when we already have all the elements arranged very neatly in-memory.

another package specifically to make icechunk/virtualizarr more efficient

Seems kinda over-complicated, but could definitely solve the problem.

I think in the short term, lets focus on getting it to work as is and leave this as future optimization after we have some idea of the real world performance? Adding bulk will not be as hard as rounding out all the initial support things IMO

Agree that getting it to work correctly with a nice user API is the priority for now, and we can worry about this again after measuring performance.

Another idea to think about would be if virtualizarr's implementation of the chunk manifest actually used icechunk's rust implementation

zarr-developers/VirtualiZarr#23

This would be a stronger argument for a separate package IMO - a rust crate implementing the Manifest class that both icechunk and virtualizarr depended on, and could be used to exchange references efficiently between the two libraries.

cc @rabernat

TomNicholas · 2024-10-01T04:21:07Z

this gnarly code

Christ - I already regret async 🤣

Like in a jupyter notebook.

I'll worry about getting correct behaviour outside of a notebook first. Presumably because virtualizarr depends on zarr anyway I can just import that sync function to use inside dataset_to_icechunk.

TomNicholas · 2024-10-01T04:21:10Z

This PR doesn't work yet - when I run it locally the first test passes (as that one doesn't actually check the data stored at the location the virtual reference points to) but the second fails with

FAILED virtualizarr/tests/test_writers/test_icechunk.py::TestWriteVirtualRefs::test_set_single_virtual_ref - AssertionError: 
Arrays are not equal

Mismatched elements: 3869000 / 3869000 (100%)
Max absolute difference among violations: 317.4
Max relative difference among violations: 1.
 ACTUAL: array([[[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],...
 DESIRED: array([[[241.2 , 242.5 , 243.5 , ..., 232.8 , 235.5 , 238.6 ],
        [243.8 , 244.5 , 244.7 , ..., 232.8 , 235.3 , 239.3 ],
        [250.  , 249.8 , 248.89, ..., 233.2 , 236.39, 241.7 ],...

Looks like it's just giving me the fill_value, but I'm unclear if this is my fault or icechunk's fault.

mpiannucci · 2024-10-02T18:25:16Z

How much should be shared between the icechunk writer and the zarr v3 writer? icechunk is just a zarr v3 store with the added set_virtual_refs function. So should they just be the same for all of the metadata?

TomNicholas · 2024-10-02T18:41:49Z

icechunk is just a zarr v3 store

In terms of writing they are not represented the same on disk though are they? So it's technically a v3 store that additionally follows the icechunk spec.

Having said that the zarr v3 writer was just an experiment based on Joe's proposed json chunk manifest format in zarr-developers/zarr-specs#287. I think icechunk completely supercedes this existing virtualize.to_zarr writer.

No-one should be using that code right now because there is no way to read data via references that were written that way (see https://virtualizarr.readthedocs.io/en/latest/usage.html#writing-as-zarr), so there are no real backwards-compatibility concerns. (Technically someone like Raphael might be using this as a serialization format for virtual references, but we shouldn't worry about that.)

So for the API going forward, we could just remove .virtualize.to_zarr in favour of a public .virtualize.to_icechunk API. Or we could repurpose .to_zarr to use dataset_to_icechunk internally. I prefer the latter, for the following reason:

How much should be shared

I think therefore the real question here is how much should be shared between ds.virtualizarr.to_zarr and xarray's ds.to_zarr? For a dataset containing only "loadable variables" being written to an icechunk store, these are exactly the same thing. It's only for virtual variables that they are different. See also earth-mover/icechunk#104 (comment)

mpiannucci · 2024-10-02T18:46:26Z

Great answer, thanks thats super helpful to read.

I think therefore the real question here is how much should be shared between ds.virtualizarr.to_zarr and xarray's ds.to_zarr? For a dataset containing only "loadable variables" being written to an icechunk store, these are exactly the same thing. It's only for virtual variables that they are different

This is exactly what i had in mind while filling out the metadata for the virtual icechunk variables

TomNicholas · 2024-10-02T18:58:03Z

This is exactly what i had in mind while filling out the metadata for the virtual icechunk variables

Yeah, it's frustrating that we can't test with xarray yet, because that would make this correspondence a lot clearer.

TomNicholas · 2024-10-02T19:00:40Z

Also we could literally import some of the internals of xarray's to_zarr backend here... We already import some other semi-private xarray internals, and there is an eventual path to doing this using only public xarray functions (which would be for xarray's backend entrypoint system to support configurable writers).

mpiannucci · 2024-10-03T13:21:02Z

Also we could literally import some of the internals of xarray's to_zarr backend here

I think I am going to take a stab at doing this with tom a's branch: pydata/xarray#9552 . It should be as simple is as reusing the encoding logic and then calling set virtual ref

Matt/icechunk encoding

TomNicholas · 2024-10-12T21:51:53Z

@mpiannucci the example in earth-mover/icechunk#197 is awesome!

I'm wondering how to "bank" the progress here and split off future work. We don't need to make loadable variables work in this PR, which allows us to punt on this Q

I think therefore the real question here is how much should be shared between ds.virtualizarr.to_zarr and xarray's ds.to_zarr?

We do want to have virtualizarr work with zarr v3 in general, but our tests can't pass (in their current form) without either kerchunk working with zarr v3 or maybe a non-kerchunk way to generate references (zarr-developers/VirtualiZarr#87).

mpiannucci · 2024-10-14T15:30:49Z

I'm wondering how to "bank" the progress here and split off future work

Ok I have been thinking a lot about this. I think the easiest way for us to talk through this is to list out what depends on what:

Virtualizarr needs kerchunk to create manifest arrays from existsing data
Kerchunk currently only fully works with zarr python 2. My v3 branch works with hdf files only currently.
With this PR VirtualiZarr can write references to icechunk, succesfully. The one caveat is that numcodecs that v2 expects are not yet supported on the main branches, and need a special numcodecs branch installed, which is subject to change.

Icechunk is isolated but depends on zarr 3 to work. We can check the zarr version at import time and maybe get it into main that way, ahead of kerchunk being finished. Or we can wait for the whole chain of dependencies to be ready. There are a lot of moving parts but we have to start somewhere syncing things up

TomNicholas · 2024-10-14T15:38:32Z

Virtualizarr needs kerchunk to create manifest arrays from existsing data

It needs kerchunk unless we use @sharkinsspatial's non-kerchunk hdf5 reader instead - kerchunk should really be an optional dependency for virtualizarr. But that path would only be quicker if it's a real pain to get kerchunk to work with v3.

mpiannucci · 2024-10-14T15:41:21Z

The only real blocker for kerchunk and v3 is the codecs (numcodecs are not available by default now in zarr 3 + kerchunks grib codec needs to be updated. Everything else i can brute force through.

TomNicholas · 2024-10-14T15:44:47Z

Do we know that VirtualiZarr actually works with Zarr v3? I think we import some small utility functions.

mpiannucci · 2024-10-14T15:48:23Z

I have not tested the zarr to zarr functionality, it might not work yet. It does work with zarr 3 + icechunk tho

Fix v3 codec pipeline

TomNicholas added 8 commits September 27, 2024 17:49

move vds_with_manifest_arrays fixture up

7b00e41

sketch implementation

c82221c

test that we can create an icechunk store

d29362b

fixture to create icechunk filestore in temporary directory

2aa3cb5

get the async fixture working properly

f2c095c

split into more functions

6abe32d

change mode

93080b3

try creating zarr group and arrays explicitly

bebf370

TomNicholas added enhancement New feature or request help wanted Extra attention is needed labels Sep 27, 2024

TomNicholas added 9 commits September 28, 2024 13:38

create root group from store

833e5f0

todos

9853140

do away with the async pytest fixtures/functions

030a96e

successfully writes root group attrs

90ca5cf

check array metadata is correct

b138dde

try to write array attributes

6631102

sketch test for checking virtual references have been set correctly

e92b56c

test setting single virtual ref

2c8c0ee

use async properly

a2ce1ed

mpiannucci reviewed Oct 1, 2024

View reviewed changes

TomNicholas added 5 commits September 30, 2024 23:33

better separation of handling of loadable variables

9393995

fix chunk key format

956e324

use require_array

2d7d5f6

check that store supports writes

8726e23

removed outdated note about awaiting

387f345

more comprehensive

7e4e2ce

TomNicholas mentioned this pull request Oct 2, 2024

Use Case: [C]Worthy OAE dataset earth-mover/icechunk#119

Open

8 tasks

mpiannucci and others added 8 commits October 3, 2024 13:53

add attrtirbute encoding

9a03245

Merge pull request #2 from earth-mover/matt/icechunk-encoding

9676485

Matt/icechunk encoding

Fix array dimensions

bbaf3ba

Merge pull request #3 from earth-mover/matt/array-dims

31945cd

Fix v3 codec pipeline

49daa7e

Put xarray dep back

756ff92

Handle codecs, but get bad results

8c7242e

Gzip an d zlib are not directly working

666b676

mpiannucci mentioned this pull request Oct 12, 2024

Virtual Dataset Workflow Tracking Issue earth-mover/icechunk#197

Open

5 tasks

Get up working with numcodecs zarr 3 codecs

9076ad7

Update codec pipeline

7a160fd

TomNicholas mentioned this pull request Oct 15, 2024

[DOCS] Add Virtual Ref Documentation and tutorial earth-mover/icechunk#240

Merged

mpiannucci and others added 2 commits October 15, 2024 09:36

Merge pull request #4 from earth-mover/matt/v3-codecs

286a9b5

Fix v3 codec pipeline

oUdpate to latest icechunk using sync api

8f1f96e

mpiannucci mentioned this pull request Oct 15, 2024

Add Icechunk Support zarr-developers/VirtualiZarr#256

Merged

7 tasks

jhamman closed this Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write virtual references to Icechunk #1

Write virtual references to Icechunk #1

TomNicholas commented Sep 27, 2024

rabernat commented Sep 28, 2024

TomNicholas commented Sep 28, 2024 •

edited

Loading

rabernat commented Sep 30, 2024

mpiannucci Oct 1, 2024

mpiannucci Oct 1, 2024

TomNicholas Oct 1, 2024 •

edited

Loading

mpiannucci Oct 1, 2024 •

edited

Loading

TomNicholas Oct 1, 2024

TomNicholas Oct 2, 2024 •

edited

Loading

TomNicholas commented Oct 1, 2024

TomNicholas commented Oct 1, 2024 •

edited

Loading

mpiannucci commented Oct 2, 2024 •

edited

Loading

TomNicholas commented Oct 2, 2024 •

edited

Loading

mpiannucci commented Oct 2, 2024

TomNicholas commented Oct 2, 2024

TomNicholas commented Oct 2, 2024 •

edited

Loading

mpiannucci commented Oct 3, 2024 •

edited

Loading

TomNicholas commented Oct 12, 2024

mpiannucci commented Oct 14, 2024

TomNicholas commented Oct 14, 2024

mpiannucci commented Oct 14, 2024 •

edited

Loading

TomNicholas commented Oct 14, 2024

mpiannucci commented Oct 14, 2024 •

edited

Loading

Write virtual references to Icechunk #1

Write virtual references to Icechunk #1

Conversation

TomNicholas commented Sep 27, 2024

rabernat commented Sep 28, 2024

TomNicholas commented Sep 28, 2024 • edited Loading

rabernat commented Sep 30, 2024

mpiannucci Oct 1, 2024

Choose a reason for hiding this comment

mpiannucci Oct 1, 2024

Choose a reason for hiding this comment

TomNicholas Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

mpiannucci Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

TomNicholas Oct 1, 2024

Choose a reason for hiding this comment

TomNicholas Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

TomNicholas commented Oct 1, 2024

TomNicholas commented Oct 1, 2024 • edited Loading

mpiannucci commented Oct 2, 2024 • edited Loading

TomNicholas commented Oct 2, 2024 • edited Loading

mpiannucci commented Oct 2, 2024

TomNicholas commented Oct 2, 2024

TomNicholas commented Oct 2, 2024 • edited Loading

mpiannucci commented Oct 3, 2024 • edited Loading

TomNicholas commented Oct 12, 2024

mpiannucci commented Oct 14, 2024

TomNicholas commented Oct 14, 2024

mpiannucci commented Oct 14, 2024 • edited Loading

TomNicholas commented Oct 14, 2024

mpiannucci commented Oct 14, 2024 • edited Loading

TomNicholas commented Sep 28, 2024 •

edited

Loading

TomNicholas Oct 1, 2024 •

edited

Loading

mpiannucci Oct 1, 2024 •

edited

Loading

TomNicholas Oct 2, 2024 •

edited

Loading

TomNicholas commented Oct 1, 2024 •

edited

Loading

mpiannucci commented Oct 2, 2024 •

edited

Loading

TomNicholas commented Oct 2, 2024 •

edited

Loading

TomNicholas commented Oct 2, 2024 •

edited

Loading

mpiannucci commented Oct 3, 2024 •

edited

Loading

mpiannucci commented Oct 14, 2024 •

edited

Loading

mpiannucci commented Oct 14, 2024 •

edited

Loading