Xarray backend which loads data by default #221

TomNicholas · 2024-08-20T17:14:40Z

I think the idea of an virtualizarr backend is appealing. One idea for how it can be implemented is by loading the actual data file and then also creating and storing the byte references in the virtual dataset. This way the dataset structure creation and data loading is handled by xarray, then the bytes are just an add on. This all hinges on if data loading and reference creation both doesn't take much extra time compared to doing just one.

As a side note, this might make the reference creation process (#87) simpler as well. Instead of searching for attrs, encoding, dimension names, etc the "chunk reader" only needs to create a low level chunk manifest (bytes, offset, path). The rest of the information is retrieved from the netcdf by xarray. Not sure if that is an actually time consuming part of reference creation however.

It would also add the ability to easily inline data (#62)

Basically the idea is that xr.open_dataset("data.nc", engine="virtualizarr") loads the netcdf file normally but then also reads byte ranges and creates ManifestArrays. Since I don't think it's possible to have two data arrays within one variable, perhaps all the data arrays will be replaced with ManifestArrays unless the top level params ask for data such as loadable_variables and cftime_variables

Originally posted by @ayushnag in #157 (comment)

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-08-20T17:25:28Z

@ayushnag thanks for the suggestion!

I see a few issues with this idea though:

This all hinges on if data loading and reference creation both doesn't take much extra time compared to doing just one.

It's not just about time, it's also about memory usage. Loading all data up front will use ~1e6x as much RAM (assuming that each chunk is 1MB). This is very wasteful if we know all we want to do is write out the metadata in a new form.

We also cannot do this when opening metadata-only representations (e.g. DMR++, existing kerchunk .json) without incurring a big performance hit by having to also GET the original files in addition to the metadata files.

Since I don't think it's possible to have two data arrays within one variable,

It's not, by definition.

perhaps all the data arrays will be replaced with ManifestArrays unless the top level params ask for data such as loadable_variables and cftime_variables

If the reader creates metadata-only ManifestArrays but has the option to materialize them via loadable_variables, then that's what we have already, and if there is an additional way to materialize the ManifestArrays, then that's just the suggestion in #124.

I feel like your suggestion is sort of suggesting having a virtualizarr function that takes xr.Dataset[np.ndarray] -> xr.Dataset[ManifestArray]. But this isn't possible because the np.ndarray contains no knowledge of the filepath it was loaded from.

Maybe I've misunderstood something?

TomNicholas added the references generation Reading byte ranges from archival files label Aug 20, 2024

TomNicholas mentioned this issue Aug 20, 2024

Use xarray's encode_cf / decode_cf functions to handle CF conventions #157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xarray backend which loads data by default #221

Xarray backend which loads data by default #221

TomNicholas commented Aug 20, 2024 •

edited

Loading

TomNicholas commented Aug 20, 2024

Xarray backend which loads data by default #221

Xarray backend which loads data by default #221

Comments

TomNicholas commented Aug 20, 2024 • edited Loading

TomNicholas commented Aug 20, 2024

TomNicholas commented Aug 20, 2024 •

edited

Loading