Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend refspec support to [path] entries (without offset/length) #187

Merged
merged 4 commits into from
Jul 22, 2024

Conversation

maresb
Copy link
Contributor

@maresb maresb commented Jul 15, 2024

I'm trying to open a Zarr v2. I couldn't find a good obvious way to do this, so my attempt was:

from kerchunk.zarr import ZarrToZarr
from virtualizarr.xarray import dataset_from_kerchunk_refs

ref = ZarrToZarr(url).translate()
ds = dataset_from_kerchunk_refs(ref)

This fails here in unpacking because kerchunk produces reference entries of the form [path] instead of [path, length, offset].

I was wondering if I could work around this by stating the chunks, and sure enough it worked! (See my second commit.)

I'd love to get some feedback on this, including if there's a better way to open v2 zarrs.

In preparation for my second commit I needed to clean up some types. To make things more precise and clean, I tell a minor lie: that the [path, offset, length] lists are tuples. This was done in the spirit of duck typing, and I'd argue is more semantically accurate. (Not to mention that tuples serialize to lists in JSON.)

  • Closes #xxxx
  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

Copy link
Collaborator

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref = ZarrToZarr(url).translate()

When I saw this line I thought your PR was going to add another possible filetype to open_virtual_dataset that called kerchunk.zarr.ZarrToZarr. My response was going to be that I actually don't want to add any further dependency on kerchunk because we're trying to depend on kerchunk as little as possible (see #87).

But it seems your PR doesn't do that yet, instead it adds a useful little workaround that I didn't know was possible! Amazing.

To support reading Zarr v2 with open_virtual_dataset I would prefer that (in a follow-up PR) we wrote code to manually parse the contents of a zarr v2 store ourselves, using the new ManifestArray.from_array constructor for maximum efficiency.

Comment on lines +49 to +51
path = path_and_byte_range_info[0]
offset = 0
length = UPath(path).stat().st_size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trick might be really useful in general, because it seems kerchunk regularly returns a single string path instead of a tuple here. See for example #186 (comment)

def from_kerchunk(
cls, path_and_byte_range_info: tuple[str] | tuple[str, int, int]
) -> "ChunkEntry":
if len(path_and_byte_range_info) == 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be an isinstance(..., str) check instead perhaps? Kerchunk's docs on its specification indicate that it should return either a bare str or a tuple[str, int, int]. We can't predict when kerchunk will randomly decide to return the former, but we should be as explicit as possible when checking to see if that is what has happened. (I'm worried that there may be other situations in which kerchunk returns a length-1 something that isn't actually a string...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, excellent catch! Kerchunk doesn't like to inline Zarr variable data, effectively masking the case you point out.

I just lazily turned it into a NotImplementedError for now since it would require a bit more thought. (Probably best to follow up in a subsequent PR.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, when does the inline case come up here exactly? I feel like it is also relevant for #119?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a simple reproducer:

import kerchunk.zarr
import xarray as xr

ds = xr.Dataset({}, coords={"time": [0, 1, 2]})
ds.to_zarr("/tmp/ds.zarr", mode="w")
ref = kerchunk.zarr.ZarrToZarr("/tmp/ds.zarr").translate()
ref["refs"]["time/0"]  # ['file:///tmp/ds.zarr/time/0']

I think it's a kerchunk bug that Zarrs are never inlined, even when inline_threshold is set.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a kerchunk bug that Zarrs are never inlined, even when inline_threshold is set.

That doesn't surprise me - we already had to deal with fsspec/kerchunk#445.

But we actually don't use kerchunk's inlining feature at all in virtualizarr - instead if we want eager access to some variable we just load it as a numpy array-backed xarray variable instead (
https://virtualizarr.readthedocs.io/en/latest/usage.html#loading-variables). So we just set inline_threshold=0 whenever we call any kerchunk reader internally.

But I'm a little confused - that's all separate from how kerchunk will sometimes return a str instead of tuple[str, int, int]. Your example ref["refs"]["time/0"] # ['file:///tmp/ds.zarr/time/0'] is not an example of inlining, it's just kerchunk using a different way to implicitly represent the byte range reference to an entire chunk.

Also as I said above I don't think we should call kerchunk.zarr.ZarrToZarr inside virtualizarr.open_virtual_dataset, we should rewrite that function ourselves (but not in this PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I gave you an example for the wrong case.

I don't actually have any particular case in mind where data is inlined. I was assuming that inlined data is possible because kerchunk does it. Also I was hoping that I'd be able to use VirtualiZarr to open references that I've already kerchunked, in which case it would certainly occur.

If the only intended use of kerchunk is internal, should I remove that check? Or at least reword the text so that it's not a TODO?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I was hoping that I'd be able to use VirtualiZarr to open references that I've already kerchunked, in which case it would certainly occur.

Currently the only intended use case of kerchunk is internal, but this case:

Also I was hoping that I'd be able to use VirtualiZarr to open references that I've already kerchunked, in which case it would certainly occur.

we would like to support, which is being (intermittently) worked on in #119 and recently #186.

I think that the minimal version of this PR does not need to handle kerchunk-inlined data though, and we should merge this useful PR before worrying about that.

@@ -285,7 +299,7 @@ def to_zarr_json(self, filepath: str) -> None:
@classmethod
def _from_kerchunk_chunk_dict(cls, kerchunk_chunk_dict) -> "ChunkManifest":
chunkentries = {
cast(ChunkKey, k): ChunkEntry.from_kerchunk(v).dict()
cast(ChunkKey, k): ChunkEntry.from_kerchunk(tuple(v)).dict()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you don't add this tuple() call then my suggestion of using isinstance(..., str) above will work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, with my last commit the validation takes care of this.

Comment on lines +19 to +25
class ChunkDictEntry(TypedDict):
path: str
offset: int
length: int


ChunkDict = NewType("ChunkDict", dict[ChunkKey, ChunkDictEntry])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice and probably what I should have used in the first place if I was better with python typing 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should do a similar change for any of the other types in the codebase...

@TomNicholas TomNicholas added the Kerchunk Relating to the kerchunk library / specification itself label Jul 16, 2024
@maresb
Copy link
Contributor Author

maresb commented Jul 16, 2024

@TomNicholas, I think I'm ready for another round of review, whenever you are.

return ChunkManifest(entries=cast(ChunkDict, chunkentries))
def _from_kerchunk_chunk_dict(
cls,
kerchunk_chunk_dict: Dict[ChunkKey, str | tuple[str] | tuple[str, int, int]],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kerchunk_chunk_dict: Dict[ChunkKey, str | tuple[str] | tuple[str, int, int]],
kerchunk_chunk_dict: dict[ChunkKey, str | tuple[str] | tuple[str, int, int]],

Pretty sure in a recent enough version of python you can just use the built-in dict for typing, same as how you're using tuple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this looks like a mistake, but strangely it's necessary. I added an explanatory comment to prevent others from attempting this obvious change.

for k, v in kerchunk_chunk_dict.items():
if isinstance(v, (str, bytes)):
raise NotImplementedError("TODO: handle inlined data")
elif not isinstance(v, (tuple, list)):
Copy link
Collaborator

@TomNicholas TomNicholas Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this incorrect typing? You allow v to be a list, then pass the list into a method which only accepts tuples (i.e. ChunkEntry.from_kerchunk()).

Copy link
Contributor Author

@maresb maresb Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pulling some really sketchy tricks with typing here.

You allow v to be a list

Actually, according to my type hint for kerchunk_chunk_dict, v should be a str or tuple but never a list. Thus checking if v is a list should be vacuous but not technically incorrect.

My rationale is that kerchunk can and probably will give us weird types, and we should be robust with handling them at runtime. While ChunkEntry.from_kerchunk expects tuples it can still handle lists if need be, even though I don't tell the typing system. Thus I simply don't want to fail in case a list makes its way to this point.

So basically what I'm saying is, "I want to raise a type error if I don't get a tuple, but if I happen to get a list I'll continue, pretending as if it were a tuple."

Do you find this an acceptable strategy, or would you like to change it?

Ideally we'd add types to kerchunk, but I don't have time for that right now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! So it's intentional.

I think this is fine - it still deals fine with the types the hints claim it will, plus you're saying that it will also silently succeed on some other types too.

Ideally we'd add types to kerchunk, but I don't have time for that right now.

Yep. In the long term the plan is to re-implement kerchunk's readers without going via the intermediate kerchunk reference format, at which point we will hopefully be able to remove all the code in kerchunk.py anyway.

@TomNicholas
Copy link
Collaborator

Sorry for the slow review @maresb ! Just some very small comments here.

@maresb
Copy link
Contributor Author

maresb commented Jul 21, 2024

Sorry for the slow review @maresb ! Just some very small comments here.

No worries, no rush, and thanks so much for maintaining!!!

@maresb
Copy link
Contributor Author

maresb commented Jul 21, 2024

I think I'm all caught up with your comments.

I don't think you're asking for a change regarding the NotImplementedError on inlined data, but I'm not certain about this.

Please let me know if you'd like any further changes, or if you'd like me to rebase.

Thanks @TomNicholas!

@TomNicholas
Copy link
Collaborator

I think this looks good, thank you @maresb !

@TomNicholas TomNicholas merged commit 0ad4de5 into zarr-developers:main Jul 22, 2024
7 checks passed
@maresb maresb deleted the improve-refspec branch July 27, 2024 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Kerchunk Relating to the kerchunk library / specification itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants