Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: DataLoader worker (pid xxx) exited unexpectedly with exit code 1. #70

Open
minygd opened this issue Dec 8, 2019 · 1 comment

Comments

@minygd
Copy link

minygd commented Dec 8, 2019

Hello, sorry to bother you. I have came across the same problem while running the train.py script, it occurs at Epoch[1](1438/4431) when I use --thread=8 and Epoch[1](1426/4431) when '--thread=1'.
Follow pytorch-issue, I couldn't fix it yet.

===> Epoch[1](1438/4431): Loss: 2.4208, Error: (2.7224 1.7324 1.4768)
===> Epoch[1](1439/4431): Loss: 7.0227, Error: (7.0972 4.2546 3.7592)
===> Epoch[1](1440/4431): Loss: 6.4023, Error: (6.2473 4.1015 3.3724)
Traceback (most recent call last):
 File "train.py", line 189, in <module>
   train(epoch)
 File "train.py", line 115, in train
   disp0, disp1, disp2 = model(input1, input2)
 File "/home/~/anaconda3/envs/former/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
   result = self.forward(*input, **kwargs)
 File "/home/~/anaconda3/envs/former/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
Traceback (most recent call last):
   outputs = self.parallel_apply(replicas, inputs, kwargs)
 File "/home/~/anaconda3/envs/former/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
   send_bytes(obj)
 File "/home/~/anaconda3/envs/former/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
 File "/home/~/anaconda3/envs/former/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
   self._send_bytes(m[offset:offset + size])
   return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
 File "/home/~/anaconda3/envs/former/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
   self._send(header + buf)
 File "/home/~/anaconda3/envs/former/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 75, in parallel_apply
 File "/home/~/anaconda3/envs/former/lib/python3.7/multiprocessing/connection.py", line 368, in _send
   n = write(self._handle, buf)
   thread.join()
BrokenPipeError: [Errno 32] Broken pipe
 File "/home/~/anaconda3/envs/former/lib/python3.7/threading.py", line 1044, in join
   self._wait_for_tstate_lock()
 File "/home/~/anaconda3/envs/former/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock
   elif lock.acquire(block, timeout):
 File "/home/~/anaconda3/envs/former/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
   _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 7094) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

The code is running under pytorch 1.1.0 cudatoolkit 10.1 and torchvision 0.3.0 in four Titan X GPU Ubuntu 18.04 machine.

Thx!

@feihuzhang
Copy link
Owner

I never met this problem. Usually, the broken image files would cause similar (but not the same) issue. You can try to print out the file name of the image to check whether it stop at the same image file. Or try to train with another dataset to see whether it happened again.
It's seems to be a pytorch, cuda or hardware problem.
Did you try to install pytorch from source?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants