how to select Europe area ERA5land data and export efficiently #21

QianqianHan96 · 2023-10-05T03:53:42Z

Hi Francesco and Yang,

I found the exported "inputData+year+"/era5land/era5land2015_10km1.nc"" in cell 21 of notebook https://github.com/EcoExtreML/Emulator/blob/main/2daskParallel/0921_1year_Europe.ipynb is not complete, it does not have data in 8088 bands. I used https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/0-0sel_era5land.ipynb to export the Europe area ERA5Land data (because I downloaded global ERA5Land data), but it is very slow (around 60 GB data, it takes several hours to export). But if we do selection in the predict notebook, chunksize is different from other variables which will take more time to predict, so it is better to do this in the preprocessing notebook in my opinion. Do you know some faster way to export big netcdf files?
@geek-yang @fnattino

QianqianHan96 · 2023-10-06T04:05:24Z

I tried to use different chunksize (on space and time) but with "open_mfdataset", the chunksize along time depends on the single netcdf files, so I have to rechunk along time after load data with open_mfdataset if I want to use a different chunksize along time. But I read online it is not recommended to rechunk after open_mfdataset (https://docs.xarray.dev/en/stable/user-guide/dask.html#parallel-computing-with-dask). I have not found a chunksize which can export netcdf fast.

fnattino · 2023-10-06T07:18:42Z

Hi @QianqianHan96 , just some random thoughts on the issue. Writing to a single NetCDF file in parallel is always tricky, have you considered other formats (Zarr for instance?). Also, Xarray implements different libraries to read/write NetCDF files ("engines"), and they have very different performance when running things in parallel. Also, having few workers with multiple threads or many workers with one or few threads can make a difference when reading from NetCDF files (but this also depend on the engine employed).

QianqianHan96 · 2023-10-06T09:43:21Z

Hi @QianqianHan96 , just some random thoughts on the issue. Writing to a single NetCDF file in parallel is always tricky, have you considered other formats (Zarr for instance?). Also, Xarray implements different libraries to read/write NetCDF files ("engines"), and they have very different performance when running things in parallel. Also, having few workers with multiple threads or many workers with one or few threads can make a difference when reading from NetCDF files (but this also depend on the engine employed).

Hi Francesco, thanks for your advice. I just tried to export to zarr, however the speed seems similar to exporting to netcdf (the most tasks in dask dashboard are "open_dataset" and "concatenate", so the bottleneck is still reading data probably?). I also checked the engines of xarray, seems netcdf4 is best one for now. I also tried to use 20 workers and 2 threads, 4 workers and 8 threads, both did not work well.
Do you think it is better to clip Europe area ERA5Land data in preprocessing notebook or in the predict notebook? I feel open_mfdataset takes a lot of time and it shows lots of tasks in dask dashboard, so in my opinion it is better to clip in preprocessing notebook and export to a single netcdf/zarr file. In this case, when you have time, could you help me have a look at https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/0-0sel_era5land.ipynb? Thanks for your help!

QianqianHan96 · 2023-10-25T12:55:53Z

I found that if I use "preprocess" parameter for function open_mfdataset to clip data, it is faster than when I load global data and then clip. However, ERA5land data's longitude is [0,360], Europe longitude range is [-31.28903052, 68.93136141], which is [328.71096947, 360] and [0, 68.93136141]. I can not clip the whole Europe in one sel function. When I only export [0, 68.93136141] with "preprocess", it took 1min41s with 8 workers and 8 threads in 32 CPU and 240GB memory (https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/0-0sel_era5land_clipDuringLoad_onlyLonGT0.ipynb).

There are two ways to clip the two Europe parts in my opinion: 1) in "preprocess" function, convert [0, 360] to [-180, 180], then it will be a continuous area, we can clip in one go, but this longitude conversion seems high computation, dask dashboard says unmanaged memory is too high. 2) in "preprocess" function, clip two parts separately and then merge them, no error and warnings but way slower than when I only export one continuous area, it took 33min55s with 8 workers and 8 threads in 32 CPU and 240GB memory (https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/0-0sel_era5land_clipDuringLoad_allLon.ipynb).

QianqianHan96 · 2023-10-25T19:13:40Z

In notebook https://github.com/EcoExtreML/Emulator/blob/main/0preprocessing/0-0sel_era5land_clipDuringLoad_allLon.ipynb, with method 1) clip area [328,360] and [0,68] separately, and then concat, after I added all2 = all1.chunk({"latitude":469,"longitude":1002}), the exporting time is 3min40s now with 8 workers and 8 threads in 32 CPU and 240GB memory.

geek-yang · 2023-10-26T11:54:29Z

Not sure if I miss some context or not 😅. I think for such a big netcdf dataset. If you want to load it and clip it in one try, it will be slow, of course. We can use some tricks to accelerate the process a bit. But since we only need to do it once, I don't think it is necessary to put too much effort on it.

Just clip and save the preprocessed data. If space is an issue, then we can move the global data to tape and keep only the European data on the disk.

QianqianHan96 · 2023-10-26T13:33:26Z

Not sure if I miss some context or not 😅. I think for such a big netcdf dataset. If you want to load it and clip it in one try, it will be slow, of course. We can use some tricks to accelerate the process a bit. But since we only need to do it once, I don't think it is necessary to put too much effort on it.

Just clip and save the preprocessed data. If space is an issue, then we can move the global data to tape and keep only the European data on the disk.

Hi Yang, actually we need to do this for 20 years for Europe, if we need to do for other area, it will be more. That's why I put so much time on this. Lucikly, I found the fast way to do it now (3min40s to export europe area for 1 year). Maybe the last thing is could you and Francesco help check is the method I am using now correct or not tomorrow?

geek-yang · 2023-10-26T13:56:37Z

Not sure if I miss some context or not 😅. I think for such a big netcdf dataset. If you want to load it and clip it in one try, it will be slow, of course. We can use some tricks to accelerate the process a bit. But since we only need to do it once, I don't think it is necessary to put too much effort on it.
Just clip and save the preprocessed data. If space is an issue, then we can move the global data to tape and keep only the European data on the disk.

Hi Yang, actually we need to do this for 20 years for Europe, if we need to do for other area, it will be more. That's why I put so much time on this. Lucikly, I found the fast way to do it now (3min40s to export europe area for 1 year). Maybe the last thing is could you and Francesco help check is the method I am using now correct or not tomorrow?

Ok. That's good to know. Then let's take a look tomorrow!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to select Europe area ERA5land data and export efficiently #21

how to select Europe area ERA5land data and export efficiently #21

QianqianHan96 commented Oct 5, 2023 •

edited

Loading

QianqianHan96 commented Oct 6, 2023 •

edited

Loading

fnattino commented Oct 6, 2023

QianqianHan96 commented Oct 6, 2023 •

edited

Loading

QianqianHan96 commented Oct 25, 2023 •

edited

Loading

QianqianHan96 commented Oct 25, 2023 •

edited

Loading

geek-yang commented Oct 26, 2023

QianqianHan96 commented Oct 26, 2023

geek-yang commented Oct 26, 2023

how to select Europe area ERA5land data and export efficiently #21

how to select Europe area ERA5land data and export efficiently #21

Comments

QianqianHan96 commented Oct 5, 2023 • edited Loading

QianqianHan96 commented Oct 6, 2023 • edited Loading

fnattino commented Oct 6, 2023

QianqianHan96 commented Oct 6, 2023 • edited Loading

QianqianHan96 commented Oct 25, 2023 • edited Loading

QianqianHan96 commented Oct 25, 2023 • edited Loading

geek-yang commented Oct 26, 2023

QianqianHan96 commented Oct 26, 2023

geek-yang commented Oct 26, 2023

QianqianHan96 commented Oct 5, 2023 •

edited

Loading

QianqianHan96 commented Oct 6, 2023 •

edited

Loading

QianqianHan96 commented Oct 6, 2023 •

edited

Loading

QianqianHan96 commented Oct 25, 2023 •

edited

Loading

QianqianHan96 commented Oct 25, 2023 •

edited

Loading