Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Template generation pipeline crashing on the Compute Canada cluster #86

Open
rohanbanerjee opened this issue Apr 4, 2024 · 6 comments

Comments

@rohanbanerjee
Copy link
Contributor

rohanbanerjee commented Apr 4, 2024

Moving this - spinalcordtoolbox/template-dog#18 issue to this repository since it is more relevant here.

The generate_template script is dependent on, as described here:

  1. minc-toolkit-v2
  2. minc2-simple
  3. nist_mni_pipelines

We have been using the SHA cadc7219e79d6edb90742e1e340f8eee76332006 version of the nist_mni_piplelines which used the scoop package for parallelizing. The newer versions (I'm using the commit 608acff75601bf80f79334abc0434bbc0734af0d)of the nist_mni_pipelines uses the ray package. Now when I try to use install ray by pip install ray, the jobs crash and run into the following error:

error stack
[2024-04-04 07:13:48,381] launcher  INFO    SCOOP 0.7 2.0 on linux using Python 3.8.10 (default, Jun 16 2021, 14:19:02) [GCC 9.3.0], API: 1013
[2024-04-04 07:13:48,382] launcher  INFO    Detected SLURM environment.
[2024-04-04 07:13:48,382] launcher  INFO    Deploying 1 worker(s) over 1 host(s).
[2024-04-04 07:13:48,382] launcher  DEBUG   Using hostname/ip: "bc11259" as external broker reference.
[2024-04-04 07:13:48,382] launcher  DEBUG   The python executable to execute the program with is: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.8.10/bin/python.
[2024-04-04 07:13:48,382] launcher  INFO    Worker distribution: 
[2024-04-04 07:13:48,382] launcher  INFO       bc11259:	0 + origin
[2024-04-04 07:13:48,816] brokerLaunch (127.0.0.1:36071) DEBUG   Local broker launched on ports 36071, 33491.
[2024-04-04 07:13:48,816] launcher  (127.0.0.1:36071) DEBUG   Initialising local origin worker 1 [bc11259].
[2024-04-04 07:13:48,816] launcher  (127.0.0.1:36071) DEBUG   bc11259: Launching 'env PYTHONPATH=/cvmfs/soft.computecanada.ca/easybuild/python/site-packages:/cvmfs/soft.computecanada.ca/custom/python/site-packages:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/cvmfs/soft.computecanada.ca/easybuild/python/site-packages:/cvmfs/soft.computecanada.ca/custom/python/site-packages:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.8.10/bin/python -m scoop.launch.__main__ 1 3 --size 1 --workingDirectory /lustre04/scratch/rohanb1/dog_template/template --brokerHostname 127.0.0.1 --externalBrokerHostname bc11259 --taskPort 36071 --metaPort 33491 --origin --backend=ZMQ -vvv generate_template_pediatric.py'
Launching 1 worker(s) using /bin/bash.
Executing '['/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.8.10/bin/python', '-m', 'scoop.bootstrap.__main__', '--size', '1', '--workingDirectory', '/lustre04/scratch/rohanb1/dog_template/template', '--brokerHostname', '127.0.0.1', '--externalBrokerHostname', 'bc11259', '--taskPort', '36071', '--metaPort', '33491', '--origin', '--backend=ZMQ', '-vvv', 'generate_template_pediatric.py']'...
2024-04-04 07:14:35,671	INFO worker.py:1553 -- Started a local Ray instance.
[2024-04-04 07:15:06,066 E 449836 449836] core_worker.cc:191: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
[2024-04-04 07:15:06,132] launcher  (127.0.0.1:36071) INFO    Root process is done.
[2024-04-04 07:15:06,132] workerLaunch (127.0.0.1:36071) DEBUG   Closing workers on bc11259 (1 workers).
[2024-04-04 07:15:06,132] brokerLaunch (127.0.0.1:36071) DEBUG   Closing local broker.
[2024-04-04 07:15:06,132] launcher  (127.0.0.1:36071) INFO    Finished cleaning spawned subprocesses.

I did some search and found a temporary fix to this issue here: https://stackoverflow.com/a/72492737 which did resolve the above error but the job still crash and following is the crash output (attached):
slurm-46365536.out.zip

Steps to reproduce this issue:

  1. Download the following data: https://drive.google.com/file/d/13yE3sS-GpawC-JcP-uDCJ-FTT1Jzmca9/view?usp=sharing
  2. For step 2a here, Drag and drop to the scratch folder on Compute Canada and unzip the file
  3. Open bids_data_final/derivatives/template/subjects.csv and update the paths
  4. Follow the rest of the steps mentioned here https://github.com/neuropoly/template?tab=readme-ov-file#step-2-template-creation

I'm trying to solve this issue on my side but if anyone has any insights, pls share!
(tagging @namgo if you have any information on this)

@vfonov
Copy link

vfonov commented Apr 4, 2024

looks like you are using a new verion of my scripts that don't actually need SCOOP anymore, but rather use ray ( ray.io ) to paralellize execution.

@rohanbanerjee
Copy link
Contributor Author

rohanbanerjee commented Apr 4, 2024

Thank you for your quick response @vfonov . I am indeed using the version which uses ray. What I am afraid of is if it is compatible with the clusters in Alliance Canada. I did install ray on the cluster as mentioned in the above issue. But I also do suspect that this error is caused not due to ray but due to minc-toolkit-v2. Quick question: have you come across this issue

Message: b'/tmp/ebuser/avx512/MINCToolkit/1.9.18.1/GCC-9.3.0/minc-toolkit-v2/libminc/volume_io/Prog_utils/print.c:226 (from mivarput1): volume_io error: copy_volume():  copying cached volumes not implemented.\n\n'

I'm asking this because this is a common error line irrespective of whether I use the version where scoop or ray is used and I think this might be the root of the issue I am facing.

@vfonov
Copy link

vfonov commented Apr 4, 2024

I haven't seen this error message appearing before. When does this happen?

@rohanbanerjee
Copy link
Contributor Author

This happens when I use the below csv file (which contains paths to normalized straightened .mnc files and template mask), for example, a line from my csv file looks like below:

/home/rohanb1/scratch/dog_template/bids_data_final/derivatives/template/sub-HarshmanDobby_T2_straight_norm.mnc,/home/rohanb1/scratch/dog_template/bids_data_final/derivatives/template/template_mask.mnc

Then I use this subjects.csv and pass it in this script generate_template.py and launch it on the cluster to run. If I use the scoop version the process crashes with the following error message:

error stack
[2024-04-04 05:17:19,225] launcher  INFO    SCOOP 0.7 2.0 on linux using Python 3.9.6 (default, Jul 12 2021, 18:24:27) [GCC 9.3.0], API: 1013
[2024-04-04 05:17:19,225] launcher  INFO    Detected SLURM environment.
[2024-04-04 05:17:19,225] launcher  INFO    Deploying 1 worker(s) over 1 host(s).
[2024-04-04 05:17:19,225] launcher  DEBUG   Using hostname/ip: "bc11203" as external broker reference.
[2024-04-04 05:17:19,225] launcher  DEBUG   The python executable to execute the program with is: /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.9.6/bin/python.
[2024-04-04 05:17:19,226] launcher  INFO    Worker distribution: 
[2024-04-04 05:17:19,226] launcher  INFO       bc11203:	0 + origin
[2024-04-04 05:17:20,331] brokerLaunch (127.0.0.1:43067) DEBUG   Local broker launched on ports 43067, 36013.
[2024-04-04 05:17:20,331] launcher  (127.0.0.1:43067) DEBUG   Initialising local origin worker 1 [bc11203].
[2024-04-04 05:17:20,331] launcher  (127.0.0.1:43067) DEBUG   bc11203: Launching 'env PYTHONPATH=/cvmfs/soft.computecanada.ca/easybuild/python/site-packages:/cvmfs/soft.computecanada.ca/custom/python/site-packages:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/cvmfs/soft.computecanada.ca/easybuild/python/site-packages:/cvmfs/soft.computecanada.ca/custom/python/site-packages:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl:/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.9.6/bin/python -m scoop.launch.__main__ 1 3 --size 1 --workingDirectory /lustre04/scratch/rohanb1/dog_template/template --brokerHostname 127.0.0.1 --externalBrokerHostname bc11203 --taskPort 43067 --metaPort 36013 --origin --backend=ZMQ -vvv generate_template_pediatric.py'
Launching 1 worker(s) using /bin/bash.
Executing '['/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx512/Core/python/3.9.6/bin/python', '-m', 'scoop.bootstrap.__main__', '--size', '1', '--workingDirectory', '/lustre04/scratch/rohanb1/dog_template/template', '--brokerHostname', '127.0.0.1', '--externalBrokerHostname', 'bc11203', '--taskPort', '43067', '--metaPort', '36013', '--origin', '--backend=ZMQ', '-vvv', 'generate_template_pediatric.py']'...
 -- Skipping: Output Exists:[['model_nl_all_t2_trial/flip/mask_sub-HarshmanDobby_T2_straight_norm.mnc']]
 -- Skipping: Output Exists:[['model_nl_all_t2_trial/flip/mask_sub-JacobsenPetri_T2_straight_norm.mnc']]
 -- Skipping: Output Exists:[['model_nl_all_t2_trial/flip/mask_sub-JamesMonk_T2_straight_norm.mnc']]
 -- Skipping: Output Exists:[['model_nl_all_t2_trial/flip/mask_sub-TrainqueNewt_T2_straight_norm.mnc']]
 -- Skipping: Output Exists:[['model_nl_all_t2_trial/flip/mask_sub-WalkerBo_T2_straight_norm.mnc']]
command ['minctracc', '/tmp/iplMincToolsm4uznzj7/p8sa6gg_sub-HarshmanDobby_T2_straight_norm_blur_8.0_0.mnc', '/tmp/iplMincToolsm4uznzj7/p8sa6gg_sub-HarshmanDobby_T2_straight_norm_blur_8.0_0.mnc', '-clobber', '-nonlinear', 'corrcoeff', '-weight', '1', '-stiffness', '1', '-similarity', '0.3', '-sub_lattice', '6', '-iterations', '20', '-lattice_diam', '48.0', '48.0', '48.0', '-step', '16.0', '16.0', '16.0', '-identity', '-source_mask', '/home/rohanb1/scratch/dog_template/bids_data_final/derivatives/template/template_mask.mnc', '-model_mask', '/home/rohanb1/scratch/dog_template/bids_data_final/derivatives/template/template_mask.mnc', '/tmp/iplMincToolsm4uznzj7/j8c90ngtsub-HarshmanDobby_T2_straight_norm_sub-HarshmanDobby_T2_straight_norm_1.xfm'] failed -11!
Message: b'/tmp/ebuser/avx512/MINCToolkit/1.9.18.1/GCC-9.3.0/minc-toolkit-v2/libminc/volume_io/Prog_utils/print.c:226 (from mivarput1): volume_io error: copy_volume():  copying cached volumes not implemented.\n\n'
NoneType: None

[2024-04-04 05:17:46,626] _control  (b'10.70.12.3:49263') DEBUG   The following error occured on a worker:
mincError:ERROR: command ['minctracc', '/tmp/iplMincToolsm4uznzj7/p8sa6gg_sub-HarshmanDobby_T2_straight_norm_blur_8.0_0.mnc', '/tmp/iplMincToolsm4uznzj7/p8sa6gg_sub-HarshmanDobby_T2_straight_norm_blur_8.0_0.mnc', '-clobber', '-nonlinear', 'corrcoeff', '-weight', '1', '-stiffness', '1', '-similarity', '0.3', '-sub_lattice', '6', '-iterations', '20', '-lattice_diam', '48.0', '48.0', '48.0', '-step', '16.0', '16.0', '16.0', '-identity', '-source_mask', '/home/rohanb1/scratch/dog_template/bids_data_final/derivatives/template/template_mask.mnc', '-model_mask', '/home/rohanb1/scratch/dog_template/bids_data_final/derivatives/template/template_mask.mnc', '/tmp/iplMincToolsm4uznzj7/j8c90ngtsub-HarshmanDobby_T2_straight_norm_sub-HarshmanDobby_T2_straight_norm_1.xfm'] failed -11!
Message: b'/tmp/ebuser/avx512/MINCToolkit/1.9.18.1/GCC-9.3.0/minc-toolkit-v2/libminc/volume_io/Prog_utils/print.c:226 (from mivarput1): volume_io error: copy_volume():  copying cached volumes not implemented.\n\n'
NoneType: None

AT:[<FrameSummary file /home/rohanb1/.local/lib/python3.9/site-packages/scoop/_control.py, line 150 in runFuture>, <FrameSummary file /home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl/model/registration.py, line 232 in non_linear_register_step>, <FrameSummary file /home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl/registration.py, line 759 in non_linear_register_full>, <FrameSummary file /home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl/minc_tools.py, line 418 in command>, <FrameSummary file /home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl/minc_tools.py, line 68 in __init__>]
Traceback (most recent call last):
  File "/home/rohanb1/.local/lib/python3.9/site-packages/scoop/_control.py", line 150, in runFuture
    future.resultValue = future.callable(*future.args, **future.kargs)
  File "/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl/model/registration.py", line 232, in non_linear_register_step
    ipl.registration.non_linear_register_full(
  File "/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl/registration.py", line 759, in non_linear_register_full
    minc.command([str(ii) for ii in args],
  File "/home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl/minc_tools.py", line 418, in command
    raise mincError("ERROR: command {} failed {}!\nMessage: {}\n{}".format(str(cmds),str(outvalue),output_stderr,traceback.format_exc()))
ipl.minc_tools.mincError: mincError:ERROR: command ['minctracc', '/tmp/iplMincToolsm4uznzj7/p8sa6gg_sub-HarshmanDobby_T2_straight_norm_blur_8.0_0.mnc', '/tmp/iplMincToolsm4uznzj7/p8sa6gg_sub-HarshmanDobby_T2_straight_norm_blur_8.0_0.mnc', '-clobber', '-nonlinear', 'corrcoeff', '-weight', '1', '-stiffness', '1', '-similarity', '0.3', '-sub_lattice', '6', '-iterations', '20', '-lattice_diam', '48.0', '48.0', '48.0', '-step', '16.0', '16.0', '16.0', '-identity', '-source_mask', '/home/rohanb1/scratch/dog_template/bids_data_final/derivatives/template/template_mask.mnc', '-model_mask', '/home/rohanb1/scratch/dog_template/bids_data_final/derivatives/template/template_mask.mnc', '/tmp/iplMincToolsm4uznzj7/j8c90ngtsub-HarshmanDobby_T2_straight_norm_sub-HarshmanDobby_T2_straight_norm_1.xfm'] failed -11!
Message: b'/tmp/ebuser/avx512/MINCToolkit/1.9.18.1/GCC-9.3.0/minc-toolkit-v2/libminc/volume_io/Prog_utils/print.c:226 (from mivarput1): volume_io error: copy_volume():  copying cached volumes not implemented.\n\n'
NoneType: None

AT:[<FrameSummary file /home/rohanb1/.local/lib/python3.9/site-packages/scoop/_control.py, line 150 in runFuture>, <FrameSummary file /home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl/model/registration.py, line 232 in non_linear_register_step>, <FrameSummary file /home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl/registration.py, line 759 in non_linear_register_full>, <FrameSummary file /home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl/minc_tools.py, line 418 in command>, <FrameSummary file /home/rohanb1/scratch/dog_template/template/nist_mni_pipelines/ipl/minc_tools.py, line 68 in __init__>]

[2024-04-04 05:17:46,706] launcher  (127.0.0.1:43067) INFO    Root process is done.
[2024-04-04 05:17:46,706] workerLaunch (127.0.0.1:43067) DEBUG   Closing workers on bc11203 (1 workers).
[2024-04-04 05:17:46,706] brokerLaunch (127.0.0.1:43067) DEBUG   Closing local broker.
[2024-04-04 05:17:46,707] launcher  (127.0.0.1:43067) INFO    Finished cleaning spawned subprocesses.

@vfonov
Copy link

vfonov commented Apr 5, 2024

ok, it looks like environment variable VOLUME_CACHE_THRESHOLD is set to value that's smaller then the volume size that you are using in template building.
Can you set it to -1, to completely disable ?
I,e export VOLUME_CACHE_THRESHOLD=-1 when you setup your environment.

@rohanbanerjee
Copy link
Contributor Author

This works perfectly with the scoop version, thank you! I am now testing it with the latest version (which uses Ray) and updating if it works fine or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants