Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a new documentation page for faster GRIB aggregations #495

Merged
merged 7 commits into from
Sep 9, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ so that blocks from one or more files can be arranged into aggregate datasets ac
beyond
nonzarr
reference
reference_aggregation
contributing
advanced

Expand Down
212 changes: 212 additions & 0 deletions docs/source/reference_aggregation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
Aggregation special cases
=============================

As we have already seen in this `page <https://fsspec.github.io/kerchunk/test_example.html#multi-file-jsons>`_,
that the main purpose of ``kerchunk`` it to generate references, to view whole archive
of files like GRIB2, NetCDF etc. allowing us for direct access to the data. In
this part of the documentation, we will see some other efficient ways of
combining references.

GRIB Aggregations
-----------------

This reference aggregation method of GRIB files, developed by **Camus Energy**, functions if
accompanying ``.idx`` files are present.

**But this procedure has some certain restrictions:**

- GRIB files must paired with their ``.idx`` files
- The ``.idx`` file must be of *text* type.
- Only specialised for time-series data, where GRIB files
have *identical* structure.
- Aggregation only works for a specific **forecast horizon** files.
Copy link
Contributor

@emfdavid emfdavid Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference index can be combined across many horizons but each horizon must be indexed separately.
Looking forward to seeing what you make of the reinflate api... there you can see all of the FMRC slices are supported against a collection of indexed data from many horizons, runtimes and valid times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reinflate api

ooh, what is this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method to turn the k_index and the metadata back into a ref_spec you can use in zarr/xarray
https://github.com/asascience-open/nextgen-dmac/blob/main/grib_index_aggregation/dynamic_zarr_store.py#L198
I think @Anu-Ra-g is already working on adding it into kerchunk?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, in that case I suspect it works already, right @Anu-Ra-g : but you can only work on one set of horizons OR one set of timepoints, not both at once? Something like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can return an array with multiple dimensions.
I didn't have a strong use for this so struggled to do something general and practical.
For instance if you request by Horizon, you can provide multiple horizon axis and you dimensions should include 'horizon' and 'valid_time". Similarly you can request multiple runtimes and then your dimensions should include 'runtime' and 'step'.
Honestly not sure if this is helpful or over complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martindurant I tried it out with one set of horizons with the original code. Actually, I'm still figuring out the reinflating part of the code, aggregation types and the new indexes.

I noticed that for reinflating can also work with a grib_tree model made from a single grib file.
@emfdavid can you confirm this in this notebook that I made?


Utilizing this method can significantly reduce the time required to combine
references, cutting it down to a fraction of the previous duration. The original
idea was showcased in this `talk <https://discourse.pangeo.io/t/pangeo-showcase-optimizations-for-kerchunk-aggregation-and-zarr-i-o-at-scale-for-machine-learning/4074>`_.

*How is it faster*

The ``.idx`` file otherwise known as an *index* file contains the key
metadata of the messages in the GRIB files. These metadata include `index`, `offset`, `datetime`,
`variable` and `forecast time` for their respective messages stored in the files.

**It follows three step approach:**

1. Extract and persist metadata directly from a few arbitrary grib
files for a given product such as HRRR SUBH, GEFS, GFS etc.
2. Use the metadata mapping to build an index table of every grib
message from the ``.idx`` files
3. Combine the index data with the metadata to build any FMRC
slice (Horizon, RunTime, ValidTime, BestAvailable)

.. tip::
To confirm the indexing of messages, see this `notebook <https://gist.github.com/Anu-Ra-g/efa01ad1c274c1bd1c14ee01666caa77>`_.

These metadata will be used to build an k_index for every GRIB message that we will be
Anu-Ra-g marked this conversation as resolved.
Show resolved Hide resolved
indexing. Indexing process primarily involves the `pandas <https://pandas.pydata.org/>`_ library.

.. note::
The index in ``.idx`` file indexes the GRIB messages where as the ``k_index`` (kerchunk index)
we build as part of this workflow index the variables in those messages.

.. list-table:: k_index for a single GRIB file
:header-rows: 1
:widths: 5 10 15 10 20 15 10 20 20 30 10 10 10

* -
- varname
- typeOfLevel
- stepType
- name
- step
- level
- time
- valid_time
- uri
- offset
- length
- inline_value
* - 0
- gh
- isobaricInhPa
- instant
- Geopotential height
- 0 days 06:00:00
- 0.0
- 2017-01-01 06:00:00
- 2017-01-01 12:00:00
- s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...
- 0
- 47493
- None
* - 1
- t
- isobaricInhPa
- instant
- Temperature
- 0 days 06:00:00
- 0.0
- 2017-01-01 06:00:00
- 2017-01-01 12:00:00
- s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...
- 47493
- 19438
- None
* - 2
- r
- isobaricInhPa
- instant
- Relative humidity
- 0 days 06:00:00
- 0.0
- 2017-01-01 06:00:00
- 2017-01-01 12:00:00
- s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...
- 66931
- 10835
- None
* - 3
- u
- isobaricInhPa
- instant
- U component of wind
- 0 days 06:00:00
- 0.0
- 2017-01-01 06:00:00
- 2017-01-01 12:00:00
- s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...
- 77766
- 22625
- None
* - 4
- v
- isobaricInhPa
- instant
- V component of wind
- 0 days 06:00:00
- 0.0
- 2017-01-01 06:00:00
- 2017-01-01 12:00:00
- s3://noaa-gefs-pds/gefs.20170101/06/gec00.t06z...
- 100391
- 20488
- None


*What now*

After creating the k_index as per the desired duration, we will use the ``DataTree`` model
from the `xarray-datatree <https://xarray-datatree.readthedocs.io/en/latest/>`_ to view a
part of the aggregation or the whole. Below is a tree model made from an aggregation of
martindurant marked this conversation as resolved.
Show resolved Hide resolved
GRIB files produced from **GEFS** model hosted in AWS S3 bucket.

.. code-block:: bash

DataTree('None', parent=None)
├── DataTree('prmsl')
│ │ Dimensions: ()
│ │ Data variables:
│ │ *empty*
│ │ Attributes:
│ │ name: Pressure reduced to MSL
│ └── DataTree('instant')
│ │ Dimensions: ()
│ │ Data variables:
│ │ *empty*
│ │ Attributes:
│ │ stepType: instant
│ └── DataTree('meanSea')
│ Dimensions: (latitude: 181, longitude: 360, time: 1, step: 1,
│ model_horizons: 1, valid_times: 237)
│ Coordinates:
│ * latitude (latitude) float64 1kB 90.0 89.0 88.0 87.0 ... -88.0 -89.0 -90.0
│ * longitude (longitude) float64 3kB 0.0 1.0 2.0 3.0 ... 357.0 358.0 359.0
│ meanSea float64 8B ...
│ number (time, step) int64 8B ...
│ step (model_horizons, valid_times) timedelta64[ns] 2kB ...
│ time (model_horizons, valid_times) datetime64[ns] 2kB ...
│ valid_time (model_horizons, valid_times) datetime64[ns] 2kB ...
│ Dimensions without coordinates: model_horizons, valid_times
│ Data variables:
│ prmsl (model_horizons, valid_times, latitude, longitude) float64 124MB ...
│ Attributes:
│ typeOfLevel: meanSea
└── DataTree('ulwrf')
│ Dimensions: ()
│ Data variables:
│ *empty*
│ Attributes:
│ name: Upward long-wave radiation flux
└── DataTree('avg')
│ Dimensions: ()
│ Data variables:
│ *empty*
│ Attributes:
│ stepType: avg
└── DataTree('nominalTop')
Dimensions: (latitude: 181, longitude: 360, time: 1, step: 1,
model_horizons: 1, valid_times: 237)
Coordinates:
* latitude (latitude) float64 1kB 90.0 89.0 88.0 87.0 ... -88.0 -89.0 -90.0
* longitude (longitude) float64 3kB 0.0 1.0 2.0 3.0 ... 357.0 358.0 359.0
nominalTop float64 8B ...
number (time, step) int64 8B ...
step (model_horizons, valid_times) timedelta64[ns] 2kB ...
time (model_horizons, valid_times) datetime64[ns] 2kB ...
valid_time (model_horizons, valid_times) datetime64[ns] 2kB ...
Dimensions without coordinates: model_horizons, valid_times
Data variables:
ulwrf (model_horizons, valid_times, latitude, longitude) float64 124MB ...
Attributes:
typeOfLevel: nominalTop


.. tip::
For a full tutorial on this workflow, refer this `kerchunk cookbook <https://projectpythia.org/kerchunk-cookbook/README.html>`_
martindurant marked this conversation as resolved.
Show resolved Hide resolved
in `Project Pythia <https://projectpythia.org/>`_.

.. raw:: html

<script data-goatcounter="https://kerchunk.goatcounter.com/count"
async src="//gc.zgo.at/count.js"></script>
Loading