Suggestions for RF inference notebooks #17

fnattino · 2023-09-29T14:54:19Z

Just few questions/suggestions for improvements of the notebooks:

When using Xarray objects, avoid the use of .values as much as possible: this triggers immediate loading of data/computations, which can result in memory usage peaks on the node running the Jupyter server and heavy communications between workers and the client.
Avoid re-chunking as much as possible: right now some of the variables are loaded directly as Dask arrays, some others are not and converted to Dask arrays later on by rechunking. This is bad as it results in a lot of communication (and rechucking tasks). I would suggest to load all variables as Dask arrays (thus adding the chunk=... argument to the xr.open_dataset functions). For large files, you would ideally use the same chunk sizes used for the .nc files, for others you could use the same value consistently (maybe something like 50 in both x and y?).
Some of the variables are read using the netcdf4 library (I believe), some others are loaded with rasterio as engine. Why? Note that the latter is deprecated, and you should do instead rioxarray.open_rasterio, which loads the variable directly as a DataArray instead of a Dataset.
Try to separate conversions and transformations on the data from the more implementation specific tasks (loading/rechunking), so it is clearer where the "physical" operations are. Ideally, do transformations after having chunked the data in the most ideal way. Also, I would try to use throughout descriptive names for the variables (try to avoid names like all and result1).
If you want to index an array in order to match the coordinate of another array in a robust way, you can use: x = x.sel(longitude=y.longitude, latitude=y.latitude, method='nearest', tolerance=0.01) - I see that you had to play with the values in the slices in order to match the coordinates of the different arrays.
In the Europe notebook, is the loading of the ERA5 dataset that you used to use as a template needed? I don't see it used anywhere.

The text was updated successfully, but these errors were encountered:

QianqianHan96 · 2023-10-02T10:36:05Z

If you want to index an array in order to match the coordinate of another array in a robust way, you can use: x = x.sel(longitude=y.longitude, latitude=y.latitude, method='nearest', tolerance=0.01) - I see that you had to play with the values in the slices in order to match the coordinates of the different arrays.

In the Europe notebook, is the loading of the ERA5 dataset that you used to use as a template needed? I don't see it used anywhere.

Thanks for your advice, Francesco. I will update the related code of slicing after the #19 .

You are right, the reference ERA5 dataset was not used in the Europe notebook. But if I plan to use x = x.sel(longitude=y.longitude, latitude=y.latitude, method='nearest', tolerance=0.01), I will use ERA5 dataset as "y". In this way, I will keep it for now.

QianqianHan96 · 2023-10-03T04:50:30Z

Avoid re-chunking as much as possible: right now some of the variables are loaded directly as Dask arrays, some others are not and converted to Dask arrays later on by rechunking. This is bad as it results in a lot of communication (and rechucking tasks). I would suggest to load all variables as Dask arrays (thus adding the chunk=... argument to the xr.open_dataset functions). For large files, you would ideally use the same chunk sizes used for the .nc files, for others you could use the same value consistently (maybe something like 50 in both x and y?).

Some of the variables are read using the netcdf4 library (I believe), some others are loaded with rasterio as engine. Why? Note that the latter is deprecated, and you should do instead rioxarray.open_rasterio, which loads the variable directly as a DataArray instead of a Dataset.

Try to separate conversions and transformations on the data from the more implementation specific tasks (loading/rechunking), so it is clearer where the "physical" operations are. Ideally, do transformations after having chunked the data in the most ideal way. Also, I would try to use throughout descriptive names for the variables (try to avoid names like all and result1).

I see these 3 suggestions are all about how to load data. You are right, I checked that Vcmo and landcover was loaded with xr.open_rasterio(engine="rasterio"). From your suggestions, I think we should load all variables with "xr.open_dataset(chunk=)"?

There is one thing I did not understand: "For large files, you would ideally use the same chunk sizes used for the .nc files". Could you explain what does "large files" and ".nc files" refer to here?

I will modify variables names like "all" and "result1" after we fix the above problems.

QianqianHan96 · 2023-10-03T04:51:55Z

When using Xarray objects, avoid the use of .values as much as possible: this triggers immediate loading of data/computations, which can result in memory usage peaks on the node running the Jupyter server and heavy communications between workers and the client.

For this suggestion, I will check all the places I use .values and discuss with you which one can be removed.

fnattino · 2023-10-03T08:41:50Z

Thanks for having a look @QianqianHan96! We can definitely go through the points above in one of the co-working days.

There is one thing I did not understand: "For large files, you would ideally use the same chunk sizes used for the .nc files". Could you explain what does "large files" and ".nc files" refer to here?

I think all your input data is stored into netCDF files (with extension .nc). The most recent version of netCDF (which I think is what we are using), data can be stored into compressed blocks. Ideally, you would align the chunks that Dask uses to read the data with the blocks in the data files.

QianqianHan96 · 2023-10-24T08:49:17Z

When using Xarray objects, avoid the use of .values as much as possible: this triggers immediate loading of data/computations, which can result in memory usage peaks on the node running the Jupyter server and heavy communications between workers and the client.

For this suggestion, I will check all the places I use .values and discuss with you which one can be removed.

I removed .values when I calculate Rin and Rli, it is way faster now.

QianqianHan96 · 2023-10-25T19:16:41Z

Hi Francesco, I managed to predict for 15 degrees (364 seconds), 20 degrees (615 seconds), 25 degrees (1409 seconds) with 32 CPU and 240 GB memory, but for 30 degrees (1403 seconds), we need 64 CPU and 480 GB memory.
https://github.com/EcoExtreML/Emulator/blob/main/2daskParallel/1024_1year_Europe_clippedERA5land_15degrees.ipynb
https://github.com/EcoExtreML/Emulator/blob/main/2daskParallel/1024_1year_Europe_clippedERA5land-20degrees.ipynb
https://github.com/EcoExtreML/Emulator/blob/main/2daskParallel/1024_1year_Europe_clippedERA5land-25degrees.ipynb
https://github.com/EcoExtreML/Emulator/blob/main/2daskParallel/1024_1year_Europe_clippedERA5land-30degrees.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions for RF inference notebooks #17

Suggestions for RF inference notebooks #17

fnattino commented Sep 29, 2023

QianqianHan96 commented Oct 2, 2023 •

edited

Loading

QianqianHan96 commented Oct 3, 2023

QianqianHan96 commented Oct 3, 2023 •

edited

Loading

fnattino commented Oct 3, 2023

QianqianHan96 commented Oct 24, 2023

QianqianHan96 commented Oct 25, 2023 •

edited

Loading

Suggestions for RF inference notebooks #17

Suggestions for RF inference notebooks #17

Comments

fnattino commented Sep 29, 2023

QianqianHan96 commented Oct 2, 2023 • edited Loading

QianqianHan96 commented Oct 3, 2023

QianqianHan96 commented Oct 3, 2023 • edited Loading

fnattino commented Oct 3, 2023

QianqianHan96 commented Oct 24, 2023

QianqianHan96 commented Oct 25, 2023 • edited Loading

QianqianHan96 commented Oct 2, 2023 •

edited

Loading

QianqianHan96 commented Oct 3, 2023 •

edited

Loading

QianqianHan96 commented Oct 25, 2023 •

edited

Loading