-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to select Europe area ERA5land data and export efficiently #21
Comments
I tried to use different chunksize (on space and time) but with "open_mfdataset", the chunksize along time depends on the single netcdf files, so I have to rechunk along time after load data with open_mfdataset if I want to use a different chunksize along time. But I read online it is not recommended to rechunk after open_mfdataset (https://docs.xarray.dev/en/stable/user-guide/dask.html#parallel-computing-with-dask). I have not found a chunksize which can export netcdf fast. |
Hi @QianqianHan96 , just some random thoughts on the issue. Writing to a single NetCDF file in parallel is always tricky, have you considered other formats (Zarr for instance?). Also, Xarray implements different libraries to read/write NetCDF files ("engines"), and they have very different performance when running things in parallel. Also, having few workers with multiple threads or many workers with one or few threads can make a difference when reading from NetCDF files (but this also depend on the engine employed). |
Hi Francesco, thanks for your advice. I just tried to export to zarr, however the speed seems similar to exporting to netcdf (the most tasks in dask dashboard are "open_dataset" and "concatenate", so the bottleneck is still reading data probably?). I also checked the engines of xarray, seems netcdf4 is best one for now. I also tried to use 20 workers and 2 threads, 4 workers and 8 threads, both did not work well. |
I found that if I use "preprocess" parameter for function open_mfdataset to clip data, it is faster than when I load global data and then clip. However, ERA5land data's longitude is [0,360], Europe longitude range is [-31.28903052, 68.93136141], which is [328.71096947, 360] and [0, 68.93136141]. I can not clip the whole Europe in one sel function. When I only export [0, 68.93136141] with "preprocess", it took There are two ways to clip the two Europe parts in my opinion: 1) in "preprocess" function, convert [0, 360] to [-180, 180], then it will be a continuous area, we can clip in one go, but this longitude conversion seems high computation, dask dashboard says unmanaged memory is too high. 2) in "preprocess" function, clip two parts separately and then merge them, no error and warnings but way slower than when I only export one continuous area, it took |
In notebook https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/0-0sel_era5land_clipDuringLoad_allLon.ipynb, with method |
Not sure if I miss some context or not 😅. I think for such a big netcdf dataset. If you want to load it and clip it in one try, it will be slow, of course. We can use some tricks to accelerate the process a bit. But since we only need to do it once, I don't think it is necessary to put too much effort on it. Just clip and save the preprocessed data. If space is an issue, then we can move the global data to tape and keep only the European data on the disk. |
Hi Yang, actually we need to do this for 20 years for Europe, if we need to do for other area, it will be more. That's why I put so much time on this. Lucikly, I found the fast way to do it now (3min40s to export europe area for 1 year). Maybe the last thing is could you and Francesco help check is the method I am using now correct or not tomorrow? |
Ok. That's good to know. Then let's take a look tomorrow! |
Hi Francesco and Yang,
I found the exported "inputData+year+"/era5land/era5land2015_10km1.nc"" in cell 21 of notebook https://github.com/EcoExtreML/Emulator/blob/main/2daskParallel/0921_1year_Europe.ipynb is not complete, it does not have data in 8088 bands. I used https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/0-0sel_era5land.ipynb to export the Europe area ERA5Land data (because I downloaded global ERA5Land data), but it is very slow (around 60 GB data, it takes several hours to export). But if we do selection in the predict notebook, chunksize is different from other variables which will take more time to predict, so it is better to do this in the preprocessing notebook in my opinion. Do you know some faster way to export big netcdf files?
@geek-yang @fnattino
The text was updated successfully, but these errors were encountered: