RuntimeError: DataLoader worker (pid xxx) exited unexpectedly with exit code 1. #70

minygd · 2019-12-08T12:19:51Z

Hello, sorry to bother you. I have came across the same problem while running the train.py script, it occurs at Epoch[1](1438/4431) when I use --thread=8 and Epoch[1](1426/4431) when '--thread=1'.
Follow pytorch-issue, I couldn't fix it yet.

===> Epoch[1](1438/4431): Loss: 2.4208, Error: (2.7224 1.7324 1.4768)
===> Epoch[1](1439/4431): Loss: 7.0227, Error: (7.0972 4.2546 3.7592)
===> Epoch[1](1440/4431): Loss: 6.4023, Error: (6.2473 4.1015 3.3724)
Traceback (most recent call last):
 File "train.py", line 189, in <module>
   train(epoch)
 File "train.py", line 115, in train
   disp0, disp1, disp2 = model(input1, input2)
 File "/home/~/anaconda3/envs/former/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
   result = self.forward(*input, **kwargs)
 File "/home/~/anaconda3/envs/former/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
Traceback (most recent call last):
   outputs = self.parallel_apply(replicas, inputs, kwargs)
 File "/home/~/anaconda3/envs/former/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
   send_bytes(obj)
 File "/home/~/anaconda3/envs/former/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
 File "/home/~/anaconda3/envs/former/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
   self._send_bytes(m[offset:offset + size])
   return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
 File "/home/~/anaconda3/envs/former/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
   self._send(header + buf)
 File "/home/~/anaconda3/envs/former/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 75, in parallel_apply
 File "/home/~/anaconda3/envs/former/lib/python3.7/multiprocessing/connection.py", line 368, in _send
   n = write(self._handle, buf)
   thread.join()
BrokenPipeError: [Errno 32] Broken pipe
 File "/home/~/anaconda3/envs/former/lib/python3.7/threading.py", line 1044, in join
   self._wait_for_tstate_lock()
 File "/home/~/anaconda3/envs/former/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock
   elif lock.acquire(block, timeout):
 File "/home/~/anaconda3/envs/former/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler
   _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 7094) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with num_workers=0 may give better error trace.

The code is running under pytorch 1.1.0 cudatoolkit 10.1 and torchvision 0.3.0 in four Titan X GPU Ubuntu 18.04 machine.

Thx!

The text was updated successfully, but these errors were encountered:

feihuzhang · 2019-12-09T15:06:07Z

I never met this problem. Usually, the broken image files would cause similar (but not the same) issue. You can try to print out the file name of the image to check whether it stop at the same image file. Or try to train with another dataset to see whether it happened again.
It's seems to be a pytorch, cuda or hardware problem.
Did you try to install pytorch from source?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: DataLoader worker (pid xxx) exited unexpectedly with exit code 1. #70

RuntimeError: DataLoader worker (pid xxx) exited unexpectedly with exit code 1. #70

minygd commented Dec 8, 2019 •

edited

Loading

feihuzhang commented Dec 9, 2019

RuntimeError: DataLoader worker (pid xxx) exited unexpectedly with exit code 1. #70

RuntimeError: DataLoader worker (pid xxx) exited unexpectedly with exit code 1. #70

Comments

minygd commented Dec 8, 2019 • edited Loading

feihuzhang commented Dec 9, 2019

minygd commented Dec 8, 2019 •

edited

Loading