Future vision and features of s2spy #135
Replies: 2 comments 1 reply
-
Thanks for summarizing the status. A few comments.
esmvaltool is not built on CDO. It is build on iris, which uses numpy and dask. This is a good way to get good performance out of python.
a Related to another point raised by @geek-yang about esmvaltool: esmvaltool is heavy because it includes a big collection of user-contributed scripts that make use of all kinds of dependencies (including python, r, ncl, ...). However, ESMValCore is quite lightweight. So including that as a dependency is perhaps not too much of a concern. I think that's actually the most interesting part for us. |
Beta Was this translation helpful? Give feedback.
-
We had some discussion but I haven't put it here. This "roadmap" looks good to me. We can try to come up with some use-cases with different frameworks (e.g. pytorch, tensorflow, etc.) and see how could we facilitate them in a generic way. Based on my experience, especially for deep learning, those frameworks all have their own data structure for defining tensors, which make it a bit difficult for us to generalize. I think there are more opportunities in XAI, as for now there is no best practice about it, especially for S2S and the whole climate community, I think. |
Beta Was this translation helpful? Give feedback.
-
Final goal
s2spy should be able to compute full ML-pipelines (recipes) from a simple text file and these analysis should be scalable to HPCs. Furthermore, it should be possible to run the ML-pipeline on multiple Earth System Models (ESMs) and transfer this to the real-work (sometimes called transfer learning). For scaling to ESMs, we will likely use ESMValTool for the preprocessing (and downloading) because it is build upon CDO (build upon C++), making its performance better than doing this with Python. One of the pioneering studies where they applied transfer learning is by Hamm et al. 2019, it is similar to use-case CNNs below.
In order to make ML-pipelines with text recipes, we want to leverage well-know ML Python packages (such as scikit-learn, pytorch, keras, & tensorflow) and guide the data through the our ML-pipeline (related to #89, original #71). We might need to make some small wrapper functions that simplify this, however, how this should look like is currently unknown and open for debate.
Current status
At this stage (November, 2022), we continue to work on the fundamentals of s2spy. These fundamentals encompass:
Example use cases
To clarify the need and prioritization of our next steps, I hereby present some use-cases related to the in and output shapes of data:
Regression/classification/clustering models (Scikit-learn, simple artificial neural network):
Input shape: (to my understanding) always (n_samples, n_features).
Target shape: (n_samples) or (n_samples, n_classes)
LSTM (Tensorflow)
Input shape: (n_samples, n_lags, n_features).
Target shape: (n_samples) or (n_samples, n_classes)
This example stems from the Lorentz XAI workshop 2022
CNNs (Pytorch)
Uses the torch.utils.data.DataLoader module to load the batches from the predictors and predictands (for the train dataset).
`from torch.utils.data import DataLoader
train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)`
Input shape: is of shape (n_samples, n_channels, n_latitude, n_longitude). n_channels can be different variables and/or different lags.
Target shape: (n_samples) or (n_samples, n_classes)
This example stems from the AI4ESS 2020 summerschool.
*There are likely some more exotic ML algorithm that might require different input shape, but these 3 are I think the most widely used and thus important to support.
**Different libraries expect shapes.
Beta Was this translation helpful? Give feedback.
All reactions