Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sprint about Dask computing summary #5

Open
QianqianHan96 opened this issue Jul 10, 2023 · 0 comments
Open

Sprint about Dask computing summary #5

QianqianHan96 opened this issue Jul 10, 2023 · 0 comments

Comments

@QianqianHan96
Copy link
Collaborator

QianqianHan96 commented Jul 10, 2023

The procedure to use Dask on snellius:

  1. Configuration: https://github.com/RS-DAT/JupyterDaskOnSLURM
  2. Dask script preparation: map_block() function
  3. Run dask script and monitor the running process in Dask dashboard.

Things might be helpful during using Dask:

  1. Put the reproject part and other preprocessing into another separate script, this make your main script run faster and look more clean.
  2. Chunk the data by space and time when load it, and make sure in every step, they have same chunk size. It is better to chunk the data as early as possible.
  3. If you load the trained model such as Machine Learning or Deep Learning model, make sure the model not so big, my trained model was 15 GB because I did not set max_depth when I trained Random Forest. If the model is too big, map_block() function can not handle it, you will get unexpected error. For example, my updated model is 245 MB, I can pass the model path to map_block() function, if I load the model outside of map_block() function and then pass it to map_block() function, the unmanaged memory is extremely high. Although loading the model outside of map_block() function is faster because you only load once, it throw unmanaged memory too high error, so we only can load the model inside map_block() function, in this way, it load the model for every chunk.
  4. When export to netcdf, use netcdf4, not netcdf3. Use xarray to export.
  5. Client(n_workers=4, threads_per_worker=1). More workers and more threads might make your script run faster. But if your data is too big, and if you set too many workers or threads_per_worker, the webpage might snap. This point I am still trying.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant