Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xarray backend which loads data by default #221

Open
TomNicholas opened this issue Aug 20, 2024 · 1 comment
Open

Xarray backend which loads data by default #221

TomNicholas opened this issue Aug 20, 2024 · 1 comment
Labels
references generation Reading byte ranges from archival files

Comments

@TomNicholas
Copy link
Collaborator

TomNicholas commented Aug 20, 2024

I think the idea of an virtualizarr backend is appealing. One idea for how it can be implemented is by loading the actual data file and then also creating and storing the byte references in the virtual dataset. This way the dataset structure creation and data loading is handled by xarray, then the bytes are just an add on. This all hinges on if data loading and reference creation both doesn't take much extra time compared to doing just one.

As a side note, this might make the reference creation process (#87) simpler as well. Instead of searching for attrs, encoding, dimension names, etc the "chunk reader" only needs to create a low level chunk manifest (bytes, offset, path). The rest of the information is retrieved from the netcdf by xarray. Not sure if that is an actually time consuming part of reference creation however.

It would also add the ability to easily inline data (#62)

Basically the idea is that xr.open_dataset("data.nc", engine="virtualizarr") loads the netcdf file normally but then also reads byte ranges and creates ManifestArrays. Since I don't think it's possible to have two data arrays within one variable, perhaps all the data arrays will be replaced with ManifestArrays unless the top level params ask for data such as loadable_variables and cftime_variables

Originally posted by @ayushnag in #157 (comment)

@TomNicholas TomNicholas added the references generation Reading byte ranges from archival files label Aug 20, 2024
@TomNicholas
Copy link
Collaborator Author

@ayushnag thanks for the suggestion!

I see a few issues with this idea though:

This all hinges on if data loading and reference creation both doesn't take much extra time compared to doing just one.

It's not just about time, it's also about memory usage. Loading all data up front will use ~1e6x as much RAM (assuming that each chunk is 1MB). This is very wasteful if we know all we want to do is write out the metadata in a new form.

We also cannot do this when opening metadata-only representations (e.g. DMR++, existing kerchunk .json) without incurring a big performance hit by having to also GET the original files in addition to the metadata files.

Since I don't think it's possible to have two data arrays within one variable,

It's not, by definition.

perhaps all the data arrays will be replaced with ManifestArrays unless the top level params ask for data such as loadable_variables and cftime_variables

If the reader creates metadata-only ManifestArrays but has the option to materialize them via loadable_variables, then that's what we have already, and if there is an additional way to materialize the ManifestArrays, then that's just the suggestion in #124.


I feel like your suggestion is sort of suggesting having a virtualizarr function that takes xr.Dataset[np.ndarray] -> xr.Dataset[ManifestArray]. But this isn't possible because the np.ndarray contains no knowledge of the filepath it was loaded from.

Maybe I've misunderstood something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
references generation Reading byte ranges from archival files
Projects
None yet
Development

No branches or pull requests

1 participant