Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to select Europe area ERA5land data and export efficiently #21

Open
QianqianHan96 opened this issue Oct 5, 2023 · 8 comments
Open

Comments

@QianqianHan96
Copy link
Collaborator

QianqianHan96 commented Oct 5, 2023

Hi Francesco and Yang,

I found the exported "inputData+year+"/era5land/era5land2015_10km1.nc"" in cell 21 of notebook https://github.com/EcoExtreML/Emulator/blob/main/2daskParallel/0921_1year_Europe.ipynb is not complete, it does not have data in 8088 bands. I used https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/0-0sel_era5land.ipynb to export the Europe area ERA5Land data (because I downloaded global ERA5Land data), but it is very slow (around 60 GB data, it takes several hours to export). But if we do selection in the predict notebook, chunksize is different from other variables which will take more time to predict, so it is better to do this in the preprocessing notebook in my opinion. Do you know some faster way to export big netcdf files?
@geek-yang @fnattino

@QianqianHan96
Copy link
Collaborator Author

QianqianHan96 commented Oct 6, 2023

I tried to use different chunksize (on space and time) but with "open_mfdataset", the chunksize along time depends on the single netcdf files, so I have to rechunk along time after load data with open_mfdataset if I want to use a different chunksize along time. But I read online it is not recommended to rechunk after open_mfdataset (https://docs.xarray.dev/en/stable/user-guide/dask.html#parallel-computing-with-dask). I have not found a chunksize which can export netcdf fast.

@fnattino
Copy link
Collaborator

fnattino commented Oct 6, 2023

Hi @QianqianHan96 , just some random thoughts on the issue. Writing to a single NetCDF file in parallel is always tricky, have you considered other formats (Zarr for instance?). Also, Xarray implements different libraries to read/write NetCDF files ("engines"), and they have very different performance when running things in parallel. Also, having few workers with multiple threads or many workers with one or few threads can make a difference when reading from NetCDF files (but this also depend on the engine employed).

@QianqianHan96
Copy link
Collaborator Author

QianqianHan96 commented Oct 6, 2023

Hi @QianqianHan96 , just some random thoughts on the issue. Writing to a single NetCDF file in parallel is always tricky, have you considered other formats (Zarr for instance?). Also, Xarray implements different libraries to read/write NetCDF files ("engines"), and they have very different performance when running things in parallel. Also, having few workers with multiple threads or many workers with one or few threads can make a difference when reading from NetCDF files (but this also depend on the engine employed).

Hi Francesco, thanks for your advice. I just tried to export to zarr, however the speed seems similar to exporting to netcdf (the most tasks in dask dashboard are "open_dataset" and "concatenate", so the bottleneck is still reading data probably?). I also checked the engines of xarray, seems netcdf4 is best one for now. I also tried to use 20 workers and 2 threads, 4 workers and 8 threads, both did not work well.
Do you think it is better to clip Europe area ERA5Land data in preprocessing notebook or in the predict notebook? I feel open_mfdataset takes a lot of time and it shows lots of tasks in dask dashboard, so in my opinion it is better to clip in preprocessing notebook and export to a single netcdf/zarr file. In this case, when you have time, could you help me have a look at https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/0-0sel_era5land.ipynb? Thanks for your help!

@QianqianHan96
Copy link
Collaborator Author

QianqianHan96 commented Oct 25, 2023

I found that if I use "preprocess" parameter for function open_mfdataset to clip data, it is faster than when I load global data and then clip. However, ERA5land data's longitude is [0,360], Europe longitude range is [-31.28903052, 68.93136141], which is [328.71096947, 360] and [0, 68.93136141]. I can not clip the whole Europe in one sel function. When I only export [0, 68.93136141] with "preprocess", it took 1min41s with 8 workers and 8 threads in 32 CPU and 240GB memory (https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/0-0sel_era5land_clipDuringLoad_onlyLonGT0.ipynb).

There are two ways to clip the two Europe parts in my opinion: 1) in "preprocess" function, convert [0, 360] to [-180, 180], then it will be a continuous area, we can clip in one go, but this longitude conversion seems high computation, dask dashboard says unmanaged memory is too high. 2) in "preprocess" function, clip two parts separately and then merge them, no error and warnings but way slower than when I only export one continuous area, it took 33min55s with 8 workers and 8 threads in 32 CPU and 240GB memory (https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/0-0sel_era5land_clipDuringLoad_allLon.ipynb).

@QianqianHan96
Copy link
Collaborator Author

QianqianHan96 commented Oct 25, 2023

In notebook https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/0-0sel_era5land_clipDuringLoad_allLon.ipynb, with method 1) clip area [328,360] and [0,68] separately, and then concat, after I added all2 = all1.chunk({"latitude":469,"longitude":1002}), the exporting time is 3min40s now with 8 workers and 8 threads in 32 CPU and 240GB memory.

@geek-yang
Copy link
Member

Not sure if I miss some context or not 😅. I think for such a big netcdf dataset. If you want to load it and clip it in one try, it will be slow, of course. We can use some tricks to accelerate the process a bit. But since we only need to do it once, I don't think it is necessary to put too much effort on it.

Just clip and save the preprocessed data. If space is an issue, then we can move the global data to tape and keep only the European data on the disk.

@QianqianHan96
Copy link
Collaborator Author

Not sure if I miss some context or not 😅. I think for such a big netcdf dataset. If you want to load it and clip it in one try, it will be slow, of course. We can use some tricks to accelerate the process a bit. But since we only need to do it once, I don't think it is necessary to put too much effort on it.

Just clip and save the preprocessed data. If space is an issue, then we can move the global data to tape and keep only the European data on the disk.

Hi Yang, actually we need to do this for 20 years for Europe, if we need to do for other area, it will be more. That's why I put so much time on this. Lucikly, I found the fast way to do it now (3min40s to export europe area for 1 year). Maybe the last thing is could you and Francesco help check is the method I am using now correct or not tomorrow?

@geek-yang
Copy link
Member

Not sure if I miss some context or not 😅. I think for such a big netcdf dataset. If you want to load it and clip it in one try, it will be slow, of course. We can use some tricks to accelerate the process a bit. But since we only need to do it once, I don't think it is necessary to put too much effort on it.
Just clip and save the preprocessed data. If space is an issue, then we can move the global data to tape and keep only the European data on the disk.

Hi Yang, actually we need to do this for 20 years for Europe, if we need to do for other area, it will be more. That's why I put so much time on this. Lucikly, I found the fast way to do it now (3min40s to export europe area for 1 year). Maybe the last thing is could you and Francesco help check is the method I am using now correct or not tomorrow?

Ok. That's good to know. Then let's take a look tomorrow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants