-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestions for RF inference notebooks #17
Comments
Thanks for your advice, Francesco. I will update the related code of slicing after the #19 . You are right, the reference ERA5 dataset was not used in the Europe notebook. But if I plan to use |
I see these 3 suggestions are all about how to load data. You are right, I checked that Vcmo and landcover was loaded with xr.open_rasterio(engine="rasterio"). From your suggestions, I think we should load all variables with "xr.open_dataset(chunk=)"? There is one thing I did not understand: "For large files, you would ideally use the same chunk sizes used for the .nc files". Could you explain what does "large files" and ".nc files" refer to here? I will modify variables names like "all" and "result1" after we fix the above problems. |
For this suggestion, I will check all the places I use |
Thanks for having a look @QianqianHan96! We can definitely go through the points above in one of the co-working days.
I think all your input data is stored into netCDF files (with extension .nc). The most recent version of netCDF (which I think is what we are using), data can be stored into compressed blocks. Ideally, you would align the chunks that Dask uses to read the data with the blocks in the data files. |
I removed |
Hi Francesco, I managed to predict for 15 degrees (364 seconds), 20 degrees (615 seconds), 25 degrees (1409 seconds) with 32 CPU and 240 GB memory, but for 30 degrees (1403 seconds), we need 64 CPU and 480 GB memory. |
Just few questions/suggestions for improvements of the notebooks:
.values
as much as possible: this triggers immediate loading of data/computations, which can result in memory usage peaks on the node running the Jupyter server and heavy communications between workers and the client.chunk=...
argument to thexr.open_dataset
functions). For large files, you would ideally use the same chunk sizes used for the .nc files, for others you could use the same value consistently (maybe something like 50 in both x and y?).netcdf4
library (I believe), some others are loaded withrasterio
as engine. Why? Note that the latter is deprecated, and you should do insteadrioxarray.open_rasterio
, which loads the variable directly as aDataArray
instead of aDataset
.all
andresult1
).x = x.sel(longitude=y.longitude, latitude=y.latitude, method='nearest', tolerance=0.01)
- I see that you had to play with the values in the slices in order to match the coordinates of the different arrays.The text was updated successfully, but these errors were encountered: