Utility to extract, reshape, and store data for a subset of the data. e.g. for extracting timeseries for single PV sites from gridded NWPs #141

JackKelly · 2024-06-21T14:43:07Z

If I put on my hat of being an energy forecasting ML researchers, then one of the "dreams" would be to be able to use a single on-disk dataset (e.g. 500 TBytes of NWPs) for multiple ML experiments:

a neural net, which takes in dense imagery from NWPs and satellite imagery, covering the same regions in space and time
an XGBoost model to forecast solar PV power for a handful of specific sites. For each site, the input might be a single "pixel" (single lat lon location), across time.

If the data is chunked on disk to support use-case 1 (the neural net) then we might use chunks something like y=128, x=128, t=1, c=10. But that sucks for use-case 2 (which only wants a single pixel).

So it'd be nice to have a tool to:

easily extract long timeseries for a handful of sparse locations, and maybe save those in chunk sizes of something like y=1, x=1, t=4096, c=10
append to these timeseries
automatically append to the timeseries datasets when new timesteps are added to the dense dataset?

Maybe the ideal would be for the user to be able to express these conversions in a few lines of python, perhaps using xarray, whilst still saturating the IO (e.g. a cloud instance with a 200 Gbps NIC, reading and writing from object storage). The user shouldn't have to worry about parallelising stuff.

Perhaps you'd have multiple on-disk datasets (each optimised for a different read pattern). But the user wouldn't have to manually manage these multiple datasets. Instead the user would interact with a "multi-dataset" layer would would manage the underlying datasets (see #142).

The text was updated successfully, but these errors were encountered:

JackKelly added enhancement New feature or request performance Improvements to runtime performance usability Make things more user-friendly labels Jun 21, 2024

JackKelly mentioned this issue Jun 21, 2024

Multi-dataset abstraction layer #142

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Utility to extract, reshape, and store data for a subset of the data. e.g. for extracting timeseries for single PV sites from gridded NWPs #141

Utility to extract, reshape, and store data for a subset of the data. e.g. for extracting timeseries for single PV sites from gridded NWPs #141

JackKelly commented Jun 21, 2024 •

edited

Loading

Utility to extract, reshape, and store data for a subset of the data. e.g. for extracting timeseries for single PV sites from gridded NWPs #141

Utility to extract, reshape, and store data for a subset of the data. e.g. for extracting timeseries for single PV sites from gridded NWPs #141

Comments

JackKelly commented Jun 21, 2024 • edited Loading

JackKelly commented Jun 21, 2024 •

edited

Loading