-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiple nodes for batch run #25
Comments
Hi @QianqianHan96, yes, the code snippet above can use multiple nodes.
The reason why I have run calculations using that setup is that I could use a small job for the Jupyter server (16 cores on a thin node, i.e. the smallest allocation you can get on Snellius) and then use the Instead, when using the Apart from the fact that you are then charged for more CPU time, booking 96 CPUs simultaneously for a longer time period often involves more waiting time (if you instead ask for 6 times 16 CPU jobs with shorter wall time your jobs will likely start earlier).
I don't think billing is rounded to the the hour. You should be charged only for the 20 minutes (times the number of cores you are occupying, times the weight factor of the node - fat nodes and GPU nodes are more expensive). Pinging @SarahAlidoost for a clarification: when you suggested @QianqianHan96 to used multiple nodes, did you mean to increase the number of nodes used in one calculation or did you mean to run multiple calculations at the same time (keeping one node per calculation)? I thought Sarah meant the latter: for instance, instead of running one year after the other on one node, you could run e.g. 20 years simultaneously on 20 different nodes. |
Hi Francesco, Thanks a lot for your explanation. Maybe you are right that Sarah meant the latter one. When I talked with Sarah today, I was just thinking how to use multiple nodes on dask. Then I remember I saw 6 jobs run simultaneously on 6 nodes when I run your script, so I was curious is the current script for running one year (above snippet) already involved multiple nodes. So the 6 times 16 CPU jobs means using 6 nodes, right? Following Sarah's advice, do you know how to run 20 years simultaneously on 20 different nodes? My idea was to copy same jupyter notebook into 20, maybe too dumb way. If I set walltime as 1 hour, and I finished my task earlier (e.g. 20 mins). Then I client.shutdown() in Jupyter notebook or scancel job in terminal, will they stop charging me or still charge me until the end of the walltime? |
It means that you are running 6 jobs on 16 CPUs each. The jobs can run on the same node or on a different one, depending on what is available at that moment.
The
is exactly the same this as the following (just shorter):
So, when you create a A
I think the best way to do that would be to convert the notebook into a Python script that takes some input (e.g. the year of interest), then you can submit 20 jobs with 20 different inputs. Let me or @SarahAlidoost know if you need help with this!
You are only charged for the time that you actually use, so 20 min in your case (you can make sure you don't have any more running jobs with |
Hi Francesco, Thanks a lot for your detailed explanation. I understand now.
Could you help me convert the notebook into a Python script when you have time? Maybe just use year 2014 and 2015 as example, because input data of other years are not ready yet. This notebook is the latest one (https://github.com/EcoExtreML/Emulator/blob/main/2daskParallel/0117_1year_125degrees.ipynb)? |
You can convert a notebook to python using the tool jupyter nbconvert --to python your_notebook.ipynb Let me know if you need help with this. |
Thanks, Sarah. If just convert jupyter notebook to python script and submit 20 jobs with 20 sbatch script, I know how to do it. I thought submitting 20 jobs in one sbatch script. |
@fnattino Francesco, one thing is if I convert notebook to python script, can I still use SLURMcluster like what I am doing now? |
Hi @QianqianHan96, yes, you can use the SLURMCluster as in the notebooks, but it might be useful to add a call to wait for the full cluster to connect before actually go on and run the calculations. Something like the following should work: NWORKERS = 4
cluster = SLURMCluster(...)
cluster.scale(NWORKERS)
client = Client(cluster)
client.wait_for_workers(NWORKERS) # this blocks the execution until all required workers connect to the cluster |
Hi Francesco,
Sarah suggested me to use multiple nodes if time is a concern. For now 64 CPU in one node is enough, but maybe it is good to know for the future for running bigger area.
Is multiple nodes already in current script, jobs=6 or 4 (I set into 4)? If so, why you use 96 CPU from 6 nodes, why not 96 CPU from one node? Is this plan for bigger computation in advance?
Another question is: if I use 20 mins from a node, will SURF charge me 1 hour or 20 mins, because the price I saw are all in 1 hour.
@fnattino @SarahAlidoost
The text was updated successfully, but these errors were encountered: