Future vision and features of s2spy #135

semvijverberg · 2022-11-16T14:31:35Z

semvijverberg
Nov 16, 2022
Maintainer

Final goal

s2spy should be able to compute full ML-pipelines (recipes) from a simple text file and these analysis should be scalable to HPCs. Furthermore, it should be possible to run the ML-pipeline on multiple Earth System Models (ESMs) and transfer this to the real-work (sometimes called transfer learning). For scaling to ESMs, we will likely use ESMValTool for the preprocessing (and downloading) because it is build upon CDO (build upon C++), making its performance better than doing this with Python. One of the pioneering studies where they applied transfer learning is by Hamm et al. 2019, it is similar to use-case CNNs below.

In order to make ML-pipelines with text recipes, we want to leverage well-know ML Python packages (such as scikit-learn, pytorch, keras, & tensorflow) and guide the data through the our ML-pipeline (related to #89, original #71). We might need to make some small wrapper functions that simplify this, however, how this should look like is currently unknown and open for debate.

Current status

At this stage (November, 2022), we continue to work on the fundamentals of s2spy. These fundamentals encompass:

Preprocessing module to support the most common forms of preprocessing (e.g., deseasonalizing, detrending, interpolation of NaNs, etc).
time module to support a variety of 'ML-pipeline calendars', these calendars are used to define the target(s) and input features (precursors) in a user-friendly manner. The time module also takes care of cross-validation
traintest module to enable the implementation of all scikit-learn CV modules on our (ML-pipeline) calendar.
rgdr module: Response-Guided Dimensionality Reduction method to reduce a full 3-D precursor field to a few 1-D input features the calculation of correlation maps and subsequent clustering and spatial averaging of these clusters.

Example use cases

To clarify the need and prioritization of our next steps, I hereby present some use-cases related to the in and output shapes of data:

Regression/classification/clustering models (Scikit-learn, simple artificial neural network):

Input shape: (to my understanding) always (n_samples, n_features).
Target shape: (n_samples) or (n_samples, n_classes)

LSTM (Tensorflow)

Input shape: (n_samples, n_lags, n_features).
Target shape: (n_samples) or (n_samples, n_classes)

This example stems from the Lorentz XAI workshop 2022

CNNs (Pytorch)

Uses the torch.utils.data.DataLoader module to load the batches from the predictors and predictands (for the train dataset).
`from torch.utils.data import DataLoader

train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)`

Input shape: is of shape (n_samples, n_channels, n_latitude, n_longitude). n_channels can be different variables and/or different lags.
Target shape: (n_samples) or (n_samples, n_classes)

This example stems from the AI4ESS 2020 summerschool.

*There are likely some more exotic ML algorithm that might require different input shape, but these 3 are I think the most widely used and thus important to support.
**Different libraries expect shapes.

Peter9192 · 2022-11-25T12:32:38Z

Peter9192
Nov 25, 2022
Maintainer

Thanks for summarizing the status. A few comments.

ESMValTool for the preprocessing (and downloading) because it is build upon CDO (build upon C++), making its performance better than doing this with Python

esmvaltool is not built on CDO. It is build on iris, which uses numpy and dask. This is a good way to get good performance out of python.

s2spy should be able to compute full ML-pipelines (recipes) from a simple text file

a .yaml format like esmvaltool recipes would be very suitable for that

Related to another point raised by @geek-yang about esmvaltool: esmvaltool is heavy because it includes a big collection of user-contributed scripts that make use of all kinds of dependencies (including python, r, ncl, ...). However, ESMValCore is quite lightweight. So including that as a dependency is perhaps not too much of a concern. I think that's actually the most interesting part for us.

1 reply

geek-yang Nov 25, 2022
Maintainer

Thanks for your comment @Peter9192. I played with ESMValCore and it is lighter than ESMValTool. But it has some dependencies bother me a bit from a user perspective, like cartopy and shapely, which seems to be irrelevant to what we will do with our package. Also it requires ncl, which is not on pypi (hmmmm, maybe I'm too picky 😅).

Well, it is a question about how deep we would like to dig. It is clear that we don't want something that could beat ESMValTool (simply because we cannot and we don't have time). I am still in favor of simple operations with xarray and scipy, but in a more standard way to constrain our user. And complex operations with ESMValTool.

geek-yang · 2022-11-25T14:05:26Z

geek-yang
Nov 25, 2022
Maintainer

We had some discussion but I haven't put it here. This "roadmap" looks good to me. We can try to come up with some use-cases with different frameworks (e.g. pytorch, tensorflow, etc.) and see how could we facilitate them in a generic way.

Based on my experience, especially for deep learning, those frameworks all have their own data structure for defining tensors, which make it a bit difficult for us to generalize. I think there are more opportunities in XAI, as for now there is no best practice about it, especially for S2S and the whole climate community, I think.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Future vision and features of s2spy #135

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Future vision and features of s2spy #135

semvijverberg Nov 16, 2022 Maintainer

Final goal

Current status

Example use cases

Regression/classification/clustering models (Scikit-learn, simple artificial neural network):

LSTM (Tensorflow)

CNNs (Pytorch)

Replies: 2 comments · 1 reply

Peter9192 Nov 25, 2022 Maintainer

geek-yang Nov 25, 2022 Maintainer

geek-yang Nov 25, 2022 Maintainer

semvijverberg
Nov 16, 2022
Maintainer

Replies: 2 comments 1 reply

Peter9192
Nov 25, 2022
Maintainer

geek-yang Nov 25, 2022
Maintainer

geek-yang
Nov 25, 2022
Maintainer