You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My setup is having 8 Titan X GPUs, when i tried to set --ref 32 it gives this error,
/var/spool/slurm/slurmd/job86812/slurm_script: line 50: $benchmarch_logs: ambiguous redirect
Traceback (most recent call last):
File "/home/mu480317/ODISE/./tools/train_net.py", line 392, in
launch(
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/detectron2/engine/launch.py", line 67, in launch
mp.spawn(
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 5 terminated with the following error:
Traceback (most recent call last):
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/detectron2/engine/launch.py", line 126, in _distributed_worker
main_func(*args)
File "/home/mu480317/ODISE/tools/train_net.py", line 319, in main
cfg = auto_scale_workers(cfg, comm.get_world_size())
File "/home/mu480317/ODISE/odise/config/utils.py", line 65, in auto_scale_workers
assert cfg.dataloader.train.total_batch_size % old_world_size == 0, (
AssertionError: Invalid reference_world_size in config! 8 % 32 != 0
When --ref 8 , then the GPU memory is overflowing.
Please help me solve this. Thank you
The text was updated successfully, but these errors were encountered:
My setup is having 8 Titan X GPUs, when i tried to set --ref 32 it gives this error,
/var/spool/slurm/slurmd/job86812/slurm_script: line 50: $benchmarch_logs: ambiguous redirect
Traceback (most recent call last):
File "/home/mu480317/ODISE/./tools/train_net.py", line 392, in
launch(
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/detectron2/engine/launch.py", line 67, in launch
mp.spawn(
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 5 terminated with the following error:
Traceback (most recent call last):
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/mu480317/.conda/envs/ODISE/lib/python3.9/site-packages/detectron2/engine/launch.py", line 126, in _distributed_worker
main_func(*args)
File "/home/mu480317/ODISE/tools/train_net.py", line 319, in main
cfg = auto_scale_workers(cfg, comm.get_world_size())
File "/home/mu480317/ODISE/odise/config/utils.py", line 65, in auto_scale_workers
assert cfg.dataloader.train.total_batch_size % old_world_size == 0, (
AssertionError: Invalid reference_world_size in config! 8 % 32 != 0
When --ref 8 , then the GPU memory is overflowing.
Please help me solve this. Thank you
The text was updated successfully, but these errors were encountered: