Sprint about Dask computing summary #5

QianqianHan96 · 2023-07-10T09:04:02Z

The procedure to use Dask on snellius:

Things might be helpful during using Dask:

Put the reproject part and other preprocessing into another separate script, this make your main script run faster and look more clean.
Chunk the data by space and time when load it, and make sure in every step, they have same chunk size. It is better to chunk the data as early as possible.
If you load the trained model such as Machine Learning or Deep Learning model, make sure the model not so big, my trained model was 15 GB because I did not set max_depth when I trained Random Forest. If the model is too big, map_block() function can not handle it, you will get unexpected error. For example, my updated model is 245 MB, I can pass the model path to map_block() function, if I load the model outside of map_block() function and then pass it to map_block() function, the unmanaged memory is extremely high. Although loading the model outside of map_block() function is faster because you only load once, it throw unmanaged memory too high error, so we only can load the model inside map_block() function, in this way, it load the model for every chunk.
When export to netcdf, use netcdf4, not netcdf3. Use xarray to export.
Client(n_workers=4, threads_per_worker=1). More workers and more threads might make your script run faster. But if your data is too big, and if you set too many workers or threads_per_worker, the webpage might snap. This point I am still trying.

Provide feedback