-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend refspec support to [path] entries (without offset/length) #187
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -1,10 +1,11 @@ | ||||||
import json | ||||||
import re | ||||||
from collections.abc import Iterable, Iterator | ||||||
from typing import Any, Callable, NewType, Tuple, Union, cast | ||||||
from typing import Any, Callable, Dict, NewType, Tuple, TypedDict, cast | ||||||
|
||||||
import numpy as np | ||||||
from pydantic import BaseModel, ConfigDict | ||||||
from upath import UPath | ||||||
|
||||||
from virtualizarr.types import ChunkKey | ||||||
|
||||||
|
@@ -15,7 +16,13 @@ | |||||
_CHUNK_KEY = rf"^{_INTEGER}+({_SEPARATOR}{_INTEGER})*$" # matches 1 integer, optionally followed by more integers each separated by a separator (i.e. a period) | ||||||
|
||||||
|
||||||
ChunkDict = NewType("ChunkDict", dict[ChunkKey, dict[str, Union[str, int]]]) | ||||||
class ChunkDictEntry(TypedDict): | ||||||
path: str | ||||||
offset: int | ||||||
length: int | ||||||
|
||||||
|
||||||
ChunkDict = NewType("ChunkDict", dict[ChunkKey, ChunkDictEntry]) | ||||||
|
||||||
|
||||||
class ChunkEntry(BaseModel): | ||||||
|
@@ -35,16 +42,23 @@ def __repr__(self) -> str: | |||||
return f"ChunkEntry(path='{self.path}', offset={self.offset}, length={self.length})" | ||||||
|
||||||
@classmethod | ||||||
def from_kerchunk(cls, path_and_byte_range_info: list[str | int]) -> "ChunkEntry": | ||||||
path, offset, length = path_and_byte_range_info | ||||||
def from_kerchunk( | ||||||
cls, path_and_byte_range_info: tuple[str] | tuple[str, int, int] | ||||||
) -> "ChunkEntry": | ||||||
if len(path_and_byte_range_info) == 1: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this could be an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ooh, excellent catch! Kerchunk doesn't like to inline Zarr variable data, effectively masking the case you point out. I just lazily turned it into a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, when does the inline case come up here exactly? I feel like it is also relevant for #119? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's a simple reproducer: import kerchunk.zarr
import xarray as xr
ds = xr.Dataset({}, coords={"time": [0, 1, 2]})
ds.to_zarr("/tmp/ds.zarr", mode="w")
ref = kerchunk.zarr.ZarrToZarr("/tmp/ds.zarr").translate()
ref["refs"]["time/0"] # ['file:///tmp/ds.zarr/time/0'] I think it's a kerchunk bug that Zarrs are never inlined, even when There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That doesn't surprise me - we already had to deal with fsspec/kerchunk#445. But we actually don't use kerchunk's inlining feature at all in virtualizarr - instead if we want eager access to some variable we just load it as a numpy array-backed xarray variable instead ( But I'm a little confused - that's all separate from how kerchunk will sometimes return a Also as I said above I don't think we should call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, sorry, I gave you an example for the wrong case. I don't actually have any particular case in mind where data is inlined. I was assuming that inlined data is possible because kerchunk does it. Also I was hoping that I'd be able to use VirtualiZarr to open references that I've already kerchunked, in which case it would certainly occur. If the only intended use of kerchunk is internal, should I remove that check? Or at least reword the text so that it's not a TODO? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Currently the only intended use case of kerchunk is internal, but this case:
we would like to support, which is being (intermittently) worked on in #119 and recently #186. I think that the minimal version of this PR does not need to handle kerchunk-inlined data though, and we should merge this useful PR before worrying about that. |
||||||
path = path_and_byte_range_info[0] | ||||||
offset = 0 | ||||||
length = UPath(path).stat().st_size | ||||||
Comment on lines
+49
to
+51
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This trick might be really useful in general, because it seems kerchunk regularly returns a single string path instead of a tuple here. See for example #186 (comment) |
||||||
else: | ||||||
path, offset, length = path_and_byte_range_info | ||||||
return ChunkEntry(path=path, offset=offset, length=length) | ||||||
|
||||||
def to_kerchunk(self) -> list[str | int]: | ||||||
def to_kerchunk(self) -> tuple[str, int, int]: | ||||||
"""Write out in the format that kerchunk uses for chunk entries.""" | ||||||
return [self.path, self.offset, self.length] | ||||||
return (self.path, self.offset, self.length) | ||||||
|
||||||
def dict(self) -> dict[str, Union[str, int]]: | ||||||
return dict(path=self.path, offset=self.offset, length=self.length) | ||||||
def dict(self) -> ChunkDictEntry: | ||||||
return ChunkDictEntry(path=self.path, offset=self.offset, length=self.length) | ||||||
|
||||||
|
||||||
class ChunkManifest: | ||||||
|
@@ -283,12 +297,20 @@ def to_zarr_json(self, filepath: str) -> None: | |||||
json.dump(entries, json_file, indent=4, separators=(", ", ": ")) | ||||||
|
||||||
@classmethod | ||||||
def _from_kerchunk_chunk_dict(cls, kerchunk_chunk_dict) -> "ChunkManifest": | ||||||
chunkentries = { | ||||||
cast(ChunkKey, k): ChunkEntry.from_kerchunk(v).dict() | ||||||
for k, v in kerchunk_chunk_dict.items() | ||||||
} | ||||||
return ChunkManifest(entries=cast(ChunkDict, chunkentries)) | ||||||
def _from_kerchunk_chunk_dict( | ||||||
cls, | ||||||
# The type hint requires `Dict` instead of `dict` due to | ||||||
# the conflicting ChunkManifest.dict method. | ||||||
kerchunk_chunk_dict: Dict[ChunkKey, str | tuple[str] | tuple[str, int, int]], | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Pretty sure in a recent enough version of python you can just use the built-in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that this looks like a mistake, but strangely it's necessary. I added an explanatory comment to prevent others from attempting this obvious change. |
||||||
) -> "ChunkManifest": | ||||||
chunk_entries: dict[ChunkKey, ChunkDictEntry] = {} | ||||||
for k, v in kerchunk_chunk_dict.items(): | ||||||
if isinstance(v, (str, bytes)): | ||||||
raise NotImplementedError("TODO: handle inlined data") | ||||||
elif not isinstance(v, (tuple, list)): | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't this incorrect typing? You allow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm pulling some really sketchy tricks with typing here.
Actually, according to my type hint for My rationale is that kerchunk can and probably will give us weird types, and we should be robust with handling them at runtime. While So basically what I'm saying is, "I want to raise a type error if I don't get a Do you find this an acceptable strategy, or would you like to change it? Ideally we'd add types to kerchunk, but I don't have time for that right now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay! So it's intentional. I think this is fine - it still deals fine with the types the hints claim it will, plus you're saying that it will also silently succeed on some other types too.
Yep. In the long term the plan is to re-implement kerchunk's readers without going via the intermediate kerchunk reference format, at which point we will hopefully be able to remove all the code in |
||||||
raise TypeError(f"Unexpected type {type(v)} for chunk value: {v}") | ||||||
chunk_entries[k] = ChunkEntry.from_kerchunk(v).dict() | ||||||
return ChunkManifest(entries=chunk_entries) | ||||||
|
||||||
def rename_paths( | ||||||
self, | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice and probably what I should have used in the first place if I was better with python typing 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should do a similar change for any of the other types in the codebase...