You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It loads okay and begins performing inference.
Can see all 8 GPUs at 90+% utilization using nvidia-smi for a while.
Then eventually one GPU drops to 0%, the others jump to 100%.
Terminal shows:
Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/queues.py", line 239, in _feed obj = _ForkingPickler.dumps(obj) File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps cls(buf, protocol).dump(obj) File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 367, in reduce_storage df = multiprocessing.reduction.DupFd(fd) File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/reduction.py", line 198, in DupFd return resource_sharer.DupFd(fd) File "/home/ubuntu/miniconda3/envs/deepspeed/lib/python3.8/multiprocessing/resource_sharer.py", line 48, in __init__ new_fd = os.dup(fd) OSError: [Errno 9] Bad file descriptor
It then seems to hang forever from there.
I do realize this stacktrace doesn't give enough enough to get back to parallelformers, which is frustrating. Maybe it's actually a bug in PyTorch or Multiprocessing?
Environment
OS : Ubuntu 20.04.4 LTS
Python version : 3.8.13
Transformers version : 4.24.0
Whether to use Docker : No
Misc. : N/A
The text was updated successfully, but these errors were encountered:
How to reproduce
Using a p4d.24xlarge:
It loads okay and begins performing inference.
Can see all 8 GPUs at 90+% utilization using
nvidia-smi
for a while.Then eventually one GPU drops to 0%, the others jump to 100%.
Terminal shows:
It then seems to hang forever from there.
I do realize this stacktrace doesn't give enough enough to get back to parallelformers, which is frustrating. Maybe it's actually a bug in PyTorch or Multiprocessing?
Environment
The text was updated successfully, but these errors were encountered: