Extend refspec support to [path] entries (without offset/length) #187

maresb · 2024-07-15T13:18:50Z

I'm trying to open a Zarr v2. I couldn't find a good obvious way to do this, so my attempt was:

from kerchunk.zarr import ZarrToZarr
from virtualizarr.xarray import dataset_from_kerchunk_refs

ref = ZarrToZarr(url).translate()
ds = dataset_from_kerchunk_refs(ref)

This fails here in unpacking because kerchunk produces reference entries of the form [path] instead of [path, length, offset].

I was wondering if I could work around this by stating the chunks, and sure enough it worked! (See my second commit.)

I'd love to get some feedback on this, including if there's a better way to open v2 zarrs.

In preparation for my second commit I needed to clean up some types. To make things more precise and clean, I tell a minor lie: that the [path, offset, length] lists are tuples. This was done in the spirit of duck typing, and I'd argue is more semantically accurate. (Not to mention that tuples serialize to lists in JSON.)

TomNicholas

ref = ZarrToZarr(url).translate()

When I saw this line I thought your PR was going to add another possible filetype to open_virtual_dataset that called kerchunk.zarr.ZarrToZarr. My response was going to be that I actually don't want to add any further dependency on kerchunk because we're trying to depend on kerchunk as little as possible (see #87).

But it seems your PR doesn't do that yet, instead it adds a useful little workaround that I didn't know was possible! Amazing.

To support reading Zarr v2 with open_virtual_dataset I would prefer that (in a follow-up PR) we wrote code to manually parse the contents of a zarr v2 store ourselves, using the new ManifestArray.from_array constructor for maximum efficiency.

TomNicholas · 2024-07-16T06:01:06Z

virtualizarr/manifests/manifest.py

+            path = path_and_byte_range_info[0]
+            offset = 0
+            length = UPath(path).stat().st_size


This trick might be really useful in general, because it seems kerchunk regularly returns a single string path instead of a tuple here. See for example #186 (comment)

TomNicholas · 2024-07-16T06:14:35Z

virtualizarr/manifests/manifest.py

+    def from_kerchunk(
+        cls, path_and_byte_range_info: tuple[str] | tuple[str, int, int]
+    ) -> "ChunkEntry":
+        if len(path_and_byte_range_info) == 1:


I think this could be an isinstance(..., str) check instead perhaps? Kerchunk's docs on its specification indicate that it should return either a bare str or a tuple[str, int, int]. We can't predict when kerchunk will randomly decide to return the former, but we should be as explicit as possible when checking to see if that is what has happened. (I'm worried that there may be other situations in which kerchunk returns a length-1 something that isn't actually a string...)

Ooh, excellent catch! Kerchunk doesn't like to inline Zarr variable data, effectively masking the case you point out.

I just lazily turned it into a NotImplementedError for now since it would require a bit more thought. (Probably best to follow up in a subsequent PR.)

Sorry, when does the inline case come up here exactly? I feel like it is also relevant for #119?

Here's a simple reproducer:

import kerchunk.zarr import xarray as xr ds = xr.Dataset({}, coords={"time": [0, 1, 2]}) ds.to_zarr("/tmp/ds.zarr", mode="w") ref = kerchunk.zarr.ZarrToZarr("/tmp/ds.zarr").translate() ref["refs"]["time/0"] # ['file:///tmp/ds.zarr/time/0']

I think it's a kerchunk bug that Zarrs are never inlined, even when inline_threshold is set.

I think it's a kerchunk bug that Zarrs are never inlined, even when inline_threshold is set.

That doesn't surprise me - we already had to deal with fsspec/kerchunk#445.

But we actually don't use kerchunk's inlining feature at all in virtualizarr - instead if we want eager access to some variable we just load it as a numpy array-backed xarray variable instead (
https://virtualizarr.readthedocs.io/en/latest/usage.html#loading-variables). So we just set inline_threshold=0 whenever we call any kerchunk reader internally.

But I'm a little confused - that's all separate from how kerchunk will sometimes return a str instead of tuple[str, int, int]. Your example ref["refs"]["time/0"] # ['file:///tmp/ds.zarr/time/0'] is not an example of inlining, it's just kerchunk using a different way to implicitly represent the byte range reference to an entire chunk.

Also as I said above I don't think we should call kerchunk.zarr.ZarrToZarr inside virtualizarr.open_virtual_dataset, we should rewrite that function ourselves (but not in this PR).

Oh, sorry, I gave you an example for the wrong case.

I don't actually have any particular case in mind where data is inlined. I was assuming that inlined data is possible because kerchunk does it. Also I was hoping that I'd be able to use VirtualiZarr to open references that I've already kerchunked, in which case it would certainly occur.

If the only intended use of kerchunk is internal, should I remove that check? Or at least reword the text so that it's not a TODO?

Also I was hoping that I'd be able to use VirtualiZarr to open references that I've already kerchunked, in which case it would certainly occur.

Currently the only intended use case of kerchunk is internal, but this case:

Also I was hoping that I'd be able to use VirtualiZarr to open references that I've already kerchunked, in which case it would certainly occur.

we would like to support, which is being (intermittently) worked on in #119 and recently #186.

I think that the minimal version of this PR does not need to handle kerchunk-inlined data though, and we should merge this useful PR before worrying about that.

TomNicholas · 2024-07-16T06:15:47Z

virtualizarr/manifests/manifest.py

@@ -285,7 +299,7 @@ def to_zarr_json(self, filepath: str) -> None:
    @classmethod
    def _from_kerchunk_chunk_dict(cls, kerchunk_chunk_dict) -> "ChunkManifest":
        chunkentries = {
-            cast(ChunkKey, k): ChunkEntry.from_kerchunk(v).dict()
+            cast(ChunkKey, k): ChunkEntry.from_kerchunk(tuple(v)).dict()


I think if you don't add this tuple() call then my suggestion of using isinstance(..., str) above will work.

Ya, with my last commit the validation takes care of this.

TomNicholas · 2024-07-16T06:18:07Z

virtualizarr/manifests/manifest.py

+class ChunkDictEntry(TypedDict):
+    path: str
+    offset: int
+    length: int
+
+
+ChunkDict = NewType("ChunkDict", dict[ChunkKey, ChunkDictEntry])


This is nice and probably what I should have used in the first place if I was better with python typing 😅

I wonder if we should do a similar change for any of the other types in the codebase...

maresb · 2024-07-16T07:17:33Z

@TomNicholas, I think I'm ready for another round of review, whenever you are.

TomNicholas · 2024-07-21T04:23:22Z

virtualizarr/manifests/manifest.py

-        return ChunkManifest(entries=cast(ChunkDict, chunkentries))
+    def _from_kerchunk_chunk_dict(
+        cls,
+        kerchunk_chunk_dict: Dict[ChunkKey, str | tuple[str] | tuple[str, int, int]],


Suggested change

kerchunk_chunk_dict: Dict[ChunkKey, str | tuple[str] | tuple[str, int, int]],

kerchunk_chunk_dict: dict[ChunkKey, str | tuple[str] | tuple[str, int, int]],

Pretty sure in a recent enough version of python you can just use the built-in dict for typing, same as how you're using tuple.

I agree that this looks like a mistake, but strangely it's necessary. I added an explanatory comment to prevent others from attempting this obvious change.

TomNicholas · 2024-07-21T04:30:05Z

virtualizarr/manifests/manifest.py

+        for k, v in kerchunk_chunk_dict.items():
+            if isinstance(v, (str, bytes)):
+                raise NotImplementedError("TODO: handle inlined data")
+            elif not isinstance(v, (tuple, list)):


Isn't this incorrect typing? You allow v to be a list, then pass the list into a method which only accepts tuples (i.e. ChunkEntry.from_kerchunk()).

I'm pulling some really sketchy tricks with typing here.

You allow v to be a list

Actually, according to my type hint for kerchunk_chunk_dict, v should be a str or tuple but never a list. Thus checking if v is a list should be vacuous but not technically incorrect.

My rationale is that kerchunk can and probably will give us weird types, and we should be robust with handling them at runtime. While ChunkEntry.from_kerchunk expects tuples it can still handle lists if need be, even though I don't tell the typing system. Thus I simply don't want to fail in case a list makes its way to this point.

So basically what I'm saying is, "I want to raise a type error if I don't get a tuple, but if I happen to get a list I'll continue, pretending as if it were a tuple."

Do you find this an acceptable strategy, or would you like to change it?

Ideally we'd add types to kerchunk, but I don't have time for that right now.

Okay! So it's intentional.

I think this is fine - it still deals fine with the types the hints claim it will, plus you're saying that it will also silently succeed on some other types too.

Ideally we'd add types to kerchunk, but I don't have time for that right now.

Yep. In the long term the plan is to re-implement kerchunk's readers without going via the intermediate kerchunk reference format, at which point we will hopefully be able to remove all the code in kerchunk.py anyway.

TomNicholas · 2024-07-21T04:30:59Z

Sorry for the slow review @maresb ! Just some very small comments here.

maresb · 2024-07-21T11:41:41Z

Sorry for the slow review @maresb ! Just some very small comments here.

No worries, no rush, and thanks so much for maintaining!!!

maresb · 2024-07-21T19:21:23Z

I think I'm all caught up with your comments.

I don't think you're asking for a change regarding the NotImplementedError on inlined data, but I'm not certain about this.

Please let me know if you'd like any further changes, or if you'd like me to rebase.

Thanks @TomNicholas!

TomNicholas · 2024-07-22T06:09:14Z

I think this looks good, thank you @maresb !

maresb added 2 commits July 15, 2024 15:00

Improve typing of ChunkEntry

63cb722

Handle kerchunk [path] with no offset/length

606a465

maresb temporarily deployed to test-release July 16, 2024 06:01 — with GitHub Actions Inactive

TomNicholas reviewed Jul 16, 2024

View reviewed changes

TomNicholas added the Kerchunk Relating to the kerchunk library / specification itself label Jul 16, 2024

Raise NotImplementedError on inlined data

746ef50

maresb mentioned this pull request Jul 16, 2024

Open kerchunk ref as virtual dataset, only json (from PR 119) #186

Open

6 tasks

maresb temporarily deployed to test-release July 21, 2024 04:18 — with GitHub Actions Inactive

TomNicholas reviewed Jul 21, 2024

View reviewed changes

Explain the need for Dict type hint

22e32f7

maresb temporarily deployed to test-release July 21, 2024 11:28 — with GitHub Actions Inactive

TomNicholas merged commit 0ad4de5 into zarr-developers:main Jul 22, 2024
7 checks passed

maresb deleted the improve-refspec branch July 27, 2024 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend refspec support to [path] entries (without offset/length) #187

Extend refspec support to [path] entries (without offset/length) #187

maresb commented Jul 15, 2024

TomNicholas left a comment

TomNicholas Jul 16, 2024

TomNicholas Jul 16, 2024

maresb Jul 16, 2024

TomNicholas Jul 21, 2024

maresb Jul 21, 2024

TomNicholas Jul 21, 2024

maresb Jul 21, 2024

TomNicholas Jul 21, 2024

TomNicholas Jul 16, 2024

maresb Jul 16, 2024

TomNicholas Jul 16, 2024

TomNicholas Jul 16, 2024

maresb commented Jul 16, 2024

TomNicholas Jul 21, 2024

maresb Jul 21, 2024

TomNicholas Jul 21, 2024 •

edited

Loading

maresb Jul 21, 2024 •

edited

Loading

TomNicholas Jul 21, 2024

TomNicholas commented Jul 21, 2024

maresb commented Jul 21, 2024

maresb commented Jul 21, 2024

TomNicholas commented Jul 22, 2024

	kerchunk_chunk_dict: Dict[ChunkKey, str \| tuple[str] \| tuple[str, int, int]],
	kerchunk_chunk_dict: dict[ChunkKey, str \| tuple[str] \| tuple[str, int, int]],

Extend refspec support to [path] entries (without offset/length) #187

Extend refspec support to [path] entries (without offset/length) #187

Conversation

maresb commented Jul 15, 2024

TomNicholas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maresb commented Jul 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomNicholas Jul 21, 2024 • edited Loading

Choose a reason for hiding this comment

maresb Jul 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomNicholas commented Jul 21, 2024

maresb commented Jul 21, 2024

maresb commented Jul 21, 2024

TomNicholas commented Jul 22, 2024

TomNicholas Jul 21, 2024 •

edited

Loading

maresb Jul 21, 2024 •

edited

Loading