Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running STEMMUS_SCOPE sensitivity analysis on Snellius with parallel computing #187

Open
Crystal-szj opened this issue Jun 23, 2023 · 14 comments

Comments

@Crystal-szj
Copy link
Contributor

@SarahAlidoost Hi Sarah,

I hope this message finds you well.
I want to do a sensitivity analysis on STEMMUS_SCOPE by setting different sets of parameters (run the model 380 times). I would like to utilize parallel computing to finish this part.

To begin, I have created a new executable file via STEMMUS_SCOPE_SS_exe.m that requires two input parameters: one for the config file (config_file) and the other for the parameters (parameter_setting_file). The Matlab code portion has been completed.

My intention is to utilize the existing run_STEMMUS_SCOPE_in_Snellius framework. If I understand correctly, I need to modify the run_model.py file to iterate through the input parameter file instead of the input forcing data for 170 sites.

The pystemmusscope environment is activated. Now I have a couple of questions:

  1. In the run_model.py file, see here, we need to create an instance of the model. However, the parameter_file is not an input for the StemmusScope class. Does this mean I need to modify the StemmusScope class in 'pystemmusscope' package and reinstall the package?
  2. Since the run_model.py requires the input of job_id, is it possible for me to test this modified version on my local computer to ensure there are no bugs before submitting it to Snellius? I'm not sure how to solve this in the development phase.

Please let me know if any information need be provided. I would greatly appreciate it if you could share your experience and provide guidance on how to address these questions.

Best regards,
Zengjing

@SarahAlidoost
Copy link
Member

SarahAlidoost commented Jun 23, 2023

@Crystal-szj nice job in creating the issue 👍 , thanks. Here answers:

  1. You dont need a new function STEMMUS_SCOPE_SS_exe.m that accepts two input variables. Instead, you can implement it as below:

So, you need to move things around in STEMMUS_SCOPE_SS.m. If the path to parameter_setting_file changes everytime you run the model, in the run_model.py file, write a function that reads the config_file and writes it again with a new path to parameter_setting_file.

@SarahAlidoost
Copy link
Member

  1. the variable job_id is only used to write the log file. You can create a run_model_local.py file and remove job_id. Then use run_model_local.py locally.

@SarahAlidoost
Copy link
Member

Just to give you ideas about reading and writing config_file in python, here are some examples read_config and update_config. Use them as an example, you need to write your own functions.

@Crystal-szj
Copy link
Contributor Author

  1. the variable job_id is only used to write the log file. You can create a run_model_local.py file and remove job_id. Then use run_model_local.py locally.

@SarahAlidoost Hi, Sarah, many thanks for your suggestions. I commented out the job_id part and Argparse part, and tried running run_model_local.py on my computer in Pycharm. I'm testing with a test file at AR-SLu, but I get an error when I run this line here. The exit_code is 1 instead of 0 or 139, see here. I copied the error message here:

D:\software\Anaconda3\envs\pystemmusscope\python.exe F:\P1\sensitivitiy_analysis_CLM5_scheme\STEMMUS_SCOPE_SS\run_model_on_snellius\run_model_local.py
D:\software\Anaconda3\envs\pystemmusscope\lib\site-packages\xarray\core\accessor_dt.py:72: FutureWarning: Index.ravel returning ndarray is deprecated; in a future version this will return a view on self.
values_as_series = pd.Series(values.ravel(), copy=False)
Traceback (most recent call last):
File "F:\P1\sensitivitiy_analysis_CLM5_scheme\STEMMUS_SCOPE_SS\run_model_on_snellius\run_model_local.py", line 93, in
run_model_local(0)
File "F:\P1\sensitivitiy_analysis_CLM5_scheme\STEMMUS_SCOPE_SS\run_model_on_snellius\run_model_local.py", line 40, in run_model_local
model_log = model.run()
File "D:\software\Anaconda3\envs\pystemmusscope\lib\site-packages\PyStemmusScope\stemmus_scope.py", line 206, in run
result = _run_sub_process(args, None)
File "D:\software\Anaconda3\envs\pystemmusscope\lib\site-packages\PyStemmusScope\stemmus_scope.py", line 85, in _run_sub_process
raise subprocess.CalledProcessError(
subprocess.CalledProcessError: Command '['F:\P1\sensitivitiy_analysis_CLM5_scheme\STEMMUS_SCOPE_SS\run_model_on_snellius\exe\STEMMUS_SCOPE F:\P1\sensitivitiy_analysis_CLM5_scheme\STEMMUS_SCOPE_SS\run_model_on_snellius\input\AR-SLu_2023-06-26-1225\AR-SLu_2023-06-26-1225_config.txt']' returned non-zero exit status 1.

Process finished with exit code 1

Could you please help me to figure out what's wrong here? Please let me know if more information is needed. Thanks very much.

@SarahAlidoost
Copy link
Member

@Crystal-szj there are several things to check:

  • the version of pystemmusscope and stemmus_scope, see here.
  • if you are running exe file with matlab runtime, you might need to set LD_LIBRARY_PATH, see the documentation
  • model.setup() genertaes input data in an input directory. Could you check if you can run your stemmus_scope code using the input data with matlab? It should return more info about errors, if any.
  • check if generated exe file works.

@Crystal-szj
Copy link
Contributor Author

@SarahAlidoost Many thanks for your advice.

  • About the version of pystemmusscope. I use the latest version of pystemmusscope. The version is 0.3.0.
  • I set the LD_LIBRARY_PATH accordingly. However, I'm running and debugging the Python code run_model.py and currently not use the Matlab runtime.
  • After running model.setup(), yes, it created an input directory including but without the .nc file. I use the config_file it generated in the input directory, and netcdf file in InputPath, it works.
  • About the generated exe file: In my case, I create an executable file named STEMMUS_SCOPE_SS.exe, and the config file named config_file_snellius_sensitivity_analysis.
    I can run the exe file via python console
import subprocess
subprocess.run(['.\exe\STEMMUS_SCOPE_SS.exe','.\config_file_snellius_sensitivity_analysis.txt'])

or WSL terminal

./exe/STEMMUS_SCOPE_SS.exe ./config_file_snellius_sensitivity_analysis.txt

Both of above commands work well. I think the exe file works.

However, when I run model.run(), the program break and doesn't execute continuously.

@Crystal-szj
Copy link
Contributor Author

The above problem may cause by the different operating systems (e.g. Linux and Windows). The documentation works well on the Linux system, but failed on WSL see here. In addition, the executable file generated by different systems may not be compatible. It's better to regenerate the executable file when run it on a new system.

@Crystal-szj
Copy link
Contributor Author

@Crystal-szj there are several things to check:

  • the version of pystemmusscope and stemmus_scope, see here.
  • if you are running exe file with matlab runtime, you might need to set LD_LIBRARY_PATH, see the documentation
  • model.setup() genertaes input data in an input directory. Could you check if you can run your stemmus_scope code using the input data with matlab? It should return more info about errors, if any.
  • check if generated exe file works.

@SarahAlidoost Hi Sarah, many thanks for your advice. I installed a Linux system, and now the code works well.
However, when I did the test run, I encountered the same issue with Qianqian about allocating one core per task. We discussed it together, but it's still a challenge for us to find a solution. I wondered if you encountered a similar situation in your experience running the 170 sites, and if you could share any insights or suggestions you may have.

All the codes have been uploaded to EcoExtreML/STEMMUS_SCOPE_sensitivity_analysis repository. Here is some detailed information.

  1. To submit the task to Snellius, I used the run_stemmus_scope_snellius.sh. In this shell script, a python function named run_model_on_snellius_sensitivity_analysis.py was called to execute the MATLAB executable file named STEMMUS_SCOPE_SS. For the test run, I have limited it to only 480 timesteps (instead of the complete study period of 10608 timesteps) to access CPU performance.
  2. To monitor the CPU usage, I used squeue to obtain the node_id information and then accessed the node using ssh node_id. After that, I used the command htop -u <user name> to gather the following information.
    image
  3. Here is the log file
    image

Please let me know if you need further information. Any insights or suggestions you can provide would be immensely helpful. Sincerely thanks for your time and support.

@SarahAlidoost
Copy link
Member

  1. To submit the task to Snellius, I used the run_stemmus_scope_snellius.sh.

I see that you commented out the loop for. Also, the variables ncores, i, and k are not used in your code. The loop is exactly the place where parallel execution is implemented. I am not sure if you saw the surf documentation that I have already sent to Qianqian, here are the links
https://servicedesk.surf.nl/wiki/display/WIKI/Methods+of+parallelization
https://servicedesk.surf.nl/wiki/display/WIKI/Example+job+scripts#Examplejobscripts-Singlenode,concurrentprogramsonthesamenode(CPUandGPU)

@Crystal-szj
Copy link
Contributor Author

I see that you commented out the loop for. Also, the variables ncores, i, and k are not used in your code. The loop is exactly the place where parallel execution is implemented.

Thanks for your prompt response and links. I understand your approach, where each site is assigned to a separate core for parallel execution. That enables the completion of 170 sites in six rounds, with 32 sites processed per round.

However, considering the need for one task to run on a single core, as both you and Qianqian mentioned, I believe I should follow the 'parallel execution of serial programs' approach, where parallelism is not programmed into the STEMMUS_SCOPE model. According to this method, if I submit one task, only one CPU should be utilized, and if I submit ten tasks, ten CPUs should work concurrently.

I noticed from the above screenshot that multiple cores were active, even though I just submitted only one task. Does this indicate the presence of parallelism within the executable file? My question is whether I should ensure "one task one CPU" or whether I can overlook this issue and proceed with using the loop for to run the 380 cases.

Thanks again for your guidance and expertise.

@SarahAlidoost
Copy link
Member

SarahAlidoost commented Jul 6, 2023

According to this method, if I submit one task, only one CPU should be utilized, and if I submit ten tasks, ten CPUs should work concurrently.

No, this is not the case except we tell the computer to run ten tasks on ten cores. It means that we should implement a method of parallelization, e.g. the loop for with the parameter wait and & . You need to figure out how many cores are used by one task (your code). It is okay if the task needs more than one core. But we need this information i.e. number of cores, memory usage, ... to be able to implement a method of parallelization.

@SarahAlidoost
Copy link
Member

However, considering the need for one task to run on a single core, as both you and Qianqian mentioned, I believe I should follow the 'parallel execution of serial programs' approach, where parallelism is not programmed into the STEMMUS_SCOPE model. According to this method, if I submit one task, only one CPU should be utilized, and if I submit ten tasks, ten CPUs should work concurrently.

your code is different than Qianqian's code and does not use many Python libraries. If you are just running stemmus_scope, it should use only one core except that your stemmus_scope is very different than the one in the main branch. If this is not the case, please check the code to build exe file and make sure that the argument -R singleCompThread is set.

@Crystal-szj
Copy link
Contributor Author

Crystal-szj commented Jul 10, 2023

@SarahAlidoost Hi Sarah, many thanks for your reply.

You need to figure out how many cores are used by one task (your code). It is okay if the task needs more than one core. But we need this information i.e. number of cores, memory usage, ... to be able to implement a method of parallelization.

  • About the core usage in parallel computing, I uncommented the loop for in run_stemmus_scope_snellius.sh and performed the test run using 1,2, and 4 cases. Sometimes the CPU usage per core exceeds 100%, and two cores are activated for each single case. It's worth noting that I have set the argument "-R singleCompThread" when building the executable file see here. I provided detailed information for each of the test runs:
  1. Test run with 1 case: this shell script.
    When I used htop -u <username> to check the CPU performance, two cores are activated (with one running and one sleeping, see the value of column "S")
    Screenshot from 2023-07-10 15-22-28

  2. Test run with 2 cases: this shell script, but it throw an error:
    image
    I added sleep 90 to solve this problem and ran it again see here. When it running, 4 cores were activated with 2 running and 2 sleeping
    Screenshot from 2023-07-10 16-12-57

  3. Test run with 4 cases:
    This test run involving for cases submitted via the script. There were 8 cores activated.
    Screenshot from 2023-07-10 16-47-36

I would like to inquire whether this occasional CPU usage exceeding 100% and two cores activated for one case are common situations on a supercomputer?

  • In addition, I found the values of cores per node in the slurm_log file were changed for the same task even though I didn't change any setting in run_stemmus_scope_snellius.sh. Additionally, the terminal displayed a message stating "You will be charged for 0.25 core". However, upon checking the slurm_{job_id}.out file, I found that the cores per node values varied. For example, I submitted the same job twice, but the slurm_{job_id}.out file show different cores per node, Job Wall-clock time, CPU utilization, CPU efficiency, Memory utilization, and Memory efficiency between the job executions.
    Screenshot from 2023-07-11 03-18-52

I'm seeking your advice on any additional steps or considerations that should be taken before executing the 380 cases.Thanks again for your help and time.

@Crystal-szj
Copy link
Contributor Author

If you are just running stemmus_scope, it should use only one core except that your stemmus_scope is very different than the one in the main branch. If this is not the case, please check the code to build exe file and make sure that the argument -R singleCompThread is set.

The STEMMUS_SCOPE version I used is based on version 1.1.9. And I added the plant hydraulics part as a separate function. I'd like to clarify that I have not utilized any parallel computer packages such as parfor within my function. The execution is currently running in a sequential manner.

If you have any further questions or require more details, please let me know. Thanks for your support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants