Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kerchunk / VirtualiZarr way to open radar files #187

Open
aladinor opened this issue Aug 5, 2024 · 8 comments
Open

Kerchunk / VirtualiZarr way to open radar files #187

aladinor opened this issue Aug 5, 2024 · 8 comments

Comments

@aladinor
Copy link
Member

aladinor commented Aug 5, 2024

Hi everyone,

Handling historical radar datasets can often be overwhelming. To simplify this process, I propose we adopt the concepts from Kerchunk / Virtualizarr to create reference files. By leveraging these tools, we can read multiple radar data files in a Zarr-like manner, significantly enhancing our capabilities for big data historical analysis.

Proposed Approach:

  • Kerchunk/Virtualizarr Integration: Create reference files for our radar datasets to enable efficient access without extensive I/O operations.
  • Boosting Big Data Analysis: By reading the radar datasets in a Zarr-like format, we can perform more efficient and scalable analyses on historical radar data.

Benefits:

  • Efficiency: Reduced time and computational resources needed for data access and preprocessing.
  • Scalability: Ability to handle and analyze large volumes of historical radar data.
  • Convenience: Simplified workflow for accessing and working with radar datasets.

I've previously discussed this idea with @TomNicholas, @kmuehlbauer, and @mgrover1. I'd like to start a discussion thread and possibly arrange a meeting to explore this further.

I look forward to your feedback and thoughts on this proposal. Let's collaborate to make historical radar data analysis more efficient and accessible for everyone!

@aladinor aladinor changed the title Kerchunk/VirtualiZarr way to open radar files Kerchunk / VirtualiZarr way to open radar files Aug 5, 2024
@TomNicholas
Copy link

Sounds great! Happy to chat.

@kmuehlbauer
Copy link
Collaborator

Thanks @aladinor for this initiative!

I've quickly skimmed the documentation of kerchunk and VirtualiZarr, still trying to get behind this. But I already have some questions.

  1. The individual chunks of some variable are specified by file offset and length. How does this work for file formats where variables are written interleaved within one chunk of data (eg: 100 bytes v1, 100 bytes v2, 100 bytes v3, 100 bytes v1, 100 bytes v2, 100 bytes v3, ...)? Is there something like strides available?
  2. We have source files where the data is interleaved and runlength encoded. That means we can't know beforehand the lenght of each chunk. Due to the structure of the overall file we can make educated guesses (or skim over the file while keeping track of the single data chunks). How would we handle those files?
  3. We have source files which have binary blobs of data. For those we can easily identify the needed offset and length. If the blob itself is compressed, how does this work?

1 - furuno, 2 - sigmet/nexrad, 3 - rainbow

The acknowledged standards are CfRadial1/2 (NetCDF4) and ODIM_H5 (hdf5). So I do not see issues for these files.

@TomNicholas
Copy link

Okay @kmuehlbauer I'm very pleased to have your thoughts on all this, but you've really come in with the hardest questions / most tricky file formats here!!

The individual chunks of some variable are specified by file offset and length.

Yes. This is a limitation of VirtualiZarr right now, which originally comes from the kerchunk definition of references, and would be enshrined in the proposed zarr chunk manifest specification. This format is sufficient for HDF5, netCDF, TIFF, GRIB and FITS, but possibly not for your filetypes!

Note also that my understanding is that this is motivated by what's supporting when reading from cloud object storage, i.e. http range requests. If we cannot think of how to read data efficiently from your file formats using http range requests then perhaps there is not much point in trying to kerchunk/virtualize them....

file formats where variables are written interleaved within one chunk of data

So multiple variables are compressed into a single chunk? That already is outside of the Zarr model, where compression is always defined per-array.

Is there something like strides available?

I don't think so... Again, does cloud object storage actually support reading with a strided pattern? If not then I suppose you would either have to read all 3 variables' data to get at one, or issue many many http requests to get at each 100 bytes. Both of those sound very inefficient.

You might want to ask about these things on the chunk manifest Zarr spec proposal issue.

runlength encoded

That means we can't know beforehand the length of each chunk.

I'm not quite sure I follow. Do you mean that a single chunk does not correspond to a fixed byte length?

If the blob itself is compressed, how does this work?

Again I'm not sure I follow. As long as given a byte range, one could read those bytes and apply a known decoding step to get out the array bytes then you are good.

The acknowledged standards are CfRadial1/2 (NetCDF4) and ODIM_H5 (hdf5). So I do not see issues for these files.

You should be fine with these formats.

@TomNicholas
Copy link

See zarr-developers/zarr-specs#287

@kmuehlbauer
Copy link
Collaborator

Thanks Tom for your detailed explanations. ❤️

It looks like we are good for #3 and hopefully might find some solution for #1 and #2 at least partially.

Let's get more insights here from others.

@TomNicholas
Copy link

TomNicholas commented Aug 7, 2024

I mentioned this to @d-v-b in the zarr call just now, and we thought that essentially the virtualizarr effort is trying to make zarr into a "superformat", sort of a superset of other formats such as netCDF. However, although you might imagine altering the proposed chunk manifest to accommodate more formats (e.g. by adding a strides entry), then because zarr's goal is to be cloud-native, it's likely that zarr would be a superformat only for formats that can actually be efficiently accessed via http range requests. In other words, the scope of the zarr project is such that it might exclude supporting some of these non-cloud-amenable formats you're talking about here @kmuehlbauer .

@kmuehlbauer
Copy link
Collaborator

Thanks @TomNicholas, I thought so.

So we still can engage here for the formats which fit.

And for the other formats it's a clear message to radar manufacturers, weather services and data providers what their formats should be like if cloud readiness is the aim.

The good thing is that with the new standard FM301/CfRadial2 WMO choose wisely an hdf5/NetCDF based format.

@aladinor
Copy link
Member Author

aladinor commented Aug 8, 2024

Thanks, @kmuehlbauer and @TomNicholas, for bringing this all up in the conversation. We can start looking for radar cloud-amenable formats, see if we can create a backend for VirtualiZarr, and then see what possible solution we can find for the other formats.

Please let me know your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants