Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend refspec support to [path] entries (without offset/length) #187

Merged
merged 4 commits into from
Jul 22, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 36 additions & 14 deletions virtualizarr/manifests/manifest.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
import json
import re
from collections.abc import Iterable, Iterator
from typing import Any, Callable, NewType, Tuple, Union, cast
from typing import Any, Callable, Dict, NewType, Tuple, TypedDict, cast

import numpy as np
from pydantic import BaseModel, ConfigDict
from upath import UPath

from virtualizarr.types import ChunkKey

Expand All @@ -15,7 +16,13 @@
_CHUNK_KEY = rf"^{_INTEGER}+({_SEPARATOR}{_INTEGER})*$" # matches 1 integer, optionally followed by more integers each separated by a separator (i.e. a period)


ChunkDict = NewType("ChunkDict", dict[ChunkKey, dict[str, Union[str, int]]])
class ChunkDictEntry(TypedDict):
path: str
offset: int
length: int


ChunkDict = NewType("ChunkDict", dict[ChunkKey, ChunkDictEntry])
Comment on lines +19 to +25
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice and probably what I should have used in the first place if I was better with python typing 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should do a similar change for any of the other types in the codebase...



class ChunkEntry(BaseModel):
Expand All @@ -35,16 +42,23 @@ def __repr__(self) -> str:
return f"ChunkEntry(path='{self.path}', offset={self.offset}, length={self.length})"

@classmethod
def from_kerchunk(cls, path_and_byte_range_info: list[str | int]) -> "ChunkEntry":
path, offset, length = path_and_byte_range_info
def from_kerchunk(
cls, path_and_byte_range_info: tuple[str] | tuple[str, int, int]
) -> "ChunkEntry":
if len(path_and_byte_range_info) == 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be an isinstance(..., str) check instead perhaps? Kerchunk's docs on its specification indicate that it should return either a bare str or a tuple[str, int, int]. We can't predict when kerchunk will randomly decide to return the former, but we should be as explicit as possible when checking to see if that is what has happened. (I'm worried that there may be other situations in which kerchunk returns a length-1 something that isn't actually a string...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, excellent catch! Kerchunk doesn't like to inline Zarr variable data, effectively masking the case you point out.

I just lazily turned it into a NotImplementedError for now since it would require a bit more thought. (Probably best to follow up in a subsequent PR.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, when does the inline case come up here exactly? I feel like it is also relevant for #119?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a simple reproducer:

import kerchunk.zarr
import xarray as xr

ds = xr.Dataset({}, coords={"time": [0, 1, 2]})
ds.to_zarr("/tmp/ds.zarr", mode="w")
ref = kerchunk.zarr.ZarrToZarr("/tmp/ds.zarr").translate()
ref["refs"]["time/0"]  # ['file:///tmp/ds.zarr/time/0']

I think it's a kerchunk bug that Zarrs are never inlined, even when inline_threshold is set.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a kerchunk bug that Zarrs are never inlined, even when inline_threshold is set.

That doesn't surprise me - we already had to deal with fsspec/kerchunk#445.

But we actually don't use kerchunk's inlining feature at all in virtualizarr - instead if we want eager access to some variable we just load it as a numpy array-backed xarray variable instead (
https://virtualizarr.readthedocs.io/en/latest/usage.html#loading-variables). So we just set inline_threshold=0 whenever we call any kerchunk reader internally.

But I'm a little confused - that's all separate from how kerchunk will sometimes return a str instead of tuple[str, int, int]. Your example ref["refs"]["time/0"] # ['file:///tmp/ds.zarr/time/0'] is not an example of inlining, it's just kerchunk using a different way to implicitly represent the byte range reference to an entire chunk.

Also as I said above I don't think we should call kerchunk.zarr.ZarrToZarr inside virtualizarr.open_virtual_dataset, we should rewrite that function ourselves (but not in this PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I gave you an example for the wrong case.

I don't actually have any particular case in mind where data is inlined. I was assuming that inlined data is possible because kerchunk does it. Also I was hoping that I'd be able to use VirtualiZarr to open references that I've already kerchunked, in which case it would certainly occur.

If the only intended use of kerchunk is internal, should I remove that check? Or at least reword the text so that it's not a TODO?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I was hoping that I'd be able to use VirtualiZarr to open references that I've already kerchunked, in which case it would certainly occur.

Currently the only intended use case of kerchunk is internal, but this case:

Also I was hoping that I'd be able to use VirtualiZarr to open references that I've already kerchunked, in which case it would certainly occur.

we would like to support, which is being (intermittently) worked on in #119 and recently #186.

I think that the minimal version of this PR does not need to handle kerchunk-inlined data though, and we should merge this useful PR before worrying about that.

path = path_and_byte_range_info[0]
offset = 0
length = UPath(path).stat().st_size
Comment on lines +49 to +51
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trick might be really useful in general, because it seems kerchunk regularly returns a single string path instead of a tuple here. See for example #186 (comment)

else:
path, offset, length = path_and_byte_range_info
return ChunkEntry(path=path, offset=offset, length=length)

def to_kerchunk(self) -> list[str | int]:
def to_kerchunk(self) -> tuple[str, int, int]:
"""Write out in the format that kerchunk uses for chunk entries."""
return [self.path, self.offset, self.length]
return (self.path, self.offset, self.length)

def dict(self) -> dict[str, Union[str, int]]:
return dict(path=self.path, offset=self.offset, length=self.length)
def dict(self) -> ChunkDictEntry:
return ChunkDictEntry(path=self.path, offset=self.offset, length=self.length)


class ChunkManifest:
Expand Down Expand Up @@ -283,12 +297,20 @@ def to_zarr_json(self, filepath: str) -> None:
json.dump(entries, json_file, indent=4, separators=(", ", ": "))

@classmethod
def _from_kerchunk_chunk_dict(cls, kerchunk_chunk_dict) -> "ChunkManifest":
chunkentries = {
cast(ChunkKey, k): ChunkEntry.from_kerchunk(v).dict()
for k, v in kerchunk_chunk_dict.items()
}
return ChunkManifest(entries=cast(ChunkDict, chunkentries))
def _from_kerchunk_chunk_dict(
cls,
# The type hint requires `Dict` instead of `dict` due to
# the conflicting ChunkManifest.dict method.
kerchunk_chunk_dict: Dict[ChunkKey, str | tuple[str] | tuple[str, int, int]],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kerchunk_chunk_dict: Dict[ChunkKey, str | tuple[str] | tuple[str, int, int]],
kerchunk_chunk_dict: dict[ChunkKey, str | tuple[str] | tuple[str, int, int]],

Pretty sure in a recent enough version of python you can just use the built-in dict for typing, same as how you're using tuple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this looks like a mistake, but strangely it's necessary. I added an explanatory comment to prevent others from attempting this obvious change.

) -> "ChunkManifest":
chunk_entries: dict[ChunkKey, ChunkDictEntry] = {}
for k, v in kerchunk_chunk_dict.items():
if isinstance(v, (str, bytes)):
raise NotImplementedError("TODO: handle inlined data")
elif not isinstance(v, (tuple, list)):
Copy link
Collaborator

@TomNicholas TomNicholas Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this incorrect typing? You allow v to be a list, then pass the list into a method which only accepts tuples (i.e. ChunkEntry.from_kerchunk()).

Copy link
Contributor Author

@maresb maresb Jul 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pulling some really sketchy tricks with typing here.

You allow v to be a list

Actually, according to my type hint for kerchunk_chunk_dict, v should be a str or tuple but never a list. Thus checking if v is a list should be vacuous but not technically incorrect.

My rationale is that kerchunk can and probably will give us weird types, and we should be robust with handling them at runtime. While ChunkEntry.from_kerchunk expects tuples it can still handle lists if need be, even though I don't tell the typing system. Thus I simply don't want to fail in case a list makes its way to this point.

So basically what I'm saying is, "I want to raise a type error if I don't get a tuple, but if I happen to get a list I'll continue, pretending as if it were a tuple."

Do you find this an acceptable strategy, or would you like to change it?

Ideally we'd add types to kerchunk, but I don't have time for that right now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! So it's intentional.

I think this is fine - it still deals fine with the types the hints claim it will, plus you're saying that it will also silently succeed on some other types too.

Ideally we'd add types to kerchunk, but I don't have time for that right now.

Yep. In the long term the plan is to re-implement kerchunk's readers without going via the intermediate kerchunk reference format, at which point we will hopefully be able to remove all the code in kerchunk.py anyway.

raise TypeError(f"Unexpected type {type(v)} for chunk value: {v}")
chunk_entries[k] = ChunkEntry.from_kerchunk(v).dict()
return ChunkManifest(entries=chunk_entries)

def rename_paths(
self,
Expand Down
Loading