Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dap hang when debug torch dataloader #438

Open
wztdream opened this issue Feb 18, 2021 · 4 comments
Open

dap hang when debug torch dataloader #438

wztdream opened this issue Feb 18, 2021 · 4 comments

Comments

@wztdream
Copy link
Contributor

Hi,
I use dap to debug python code, all works fine, but when I debug dataloader in pytorch code, it will hang.

from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor

train_dataset = datasets.CIFAR10("/home/usr/research/code/data",
                                 train=True,
                                 download=True,
                                 transform=ToTensor())

train_loader = DataLoader(train_dataset, 128, shuffle=True, num_workers=3)
for sample in train_loader: # set break point here, then dap-next, then nothing happens...
    print(type(sample))

Here is the io log after dap-next

Sending: 
{
  "command": "next",
  "arguments": {
    "threadId": 1
  },
  "type": "request",
  "seq": 11
}
Received:
{
  "command": "next",
  "success": true,
  "request_seq": 11,
  "type": "response",
  "seq": 20
}
Received:
{
  "body": {
    "allThreadsContinued": true,
    "threadId": 1
  },
  "event": "continued",
  "type": "event",
  "seq": 21
}
Received:
{
  "body": {
    "threadId": 12,
    "reason": "started"
  },
  "event": "thread",
  "type": "event",
  "seq": 22
}
Sending: 
{
  "command": "threads",
  "type": "request",
  "seq": 12
}
Received:
{
  "body": {
    "threadId": 13,
    "reason": "started"
  },
  "event": "thread",
  "type": "event",
  "seq": 23
}
Sending: 
{
  "command": "threads",
  "type": "request",
  "seq": 13
}
Received:
{
  "body": {
    "threadId": 14,
    "reason": "started"
  },
  "event": "thread",
  "type": "event",
  "seq": 24
}
Sending: 
{
  "command": "threads",
  "type": "request",
  "seq": 14
}
Received:
{
  "body": {
    "threads": [
      {
        "name": "MainThread",
        "id": 1
      },
      {
        "name": "QueueFeederThread",
        "id": 12
      },
      {
        "name": "QueueFeederThread",
        "id": 13
      },
      {
        "name": "QueueFeederThread",
        "id": 14
      }
    ]
  },
  "command": "threads",
  "success": true,
  "request_seq": 12,
  "type": "response",
  "seq": 25
}
Received:
{
  "body": {
    "threads": [
      {
        "name": "MainThread",
        "id": 1
      },
      {
        "name": "QueueFeederThread",
        "id": 12
      },
      {
        "name": "QueueFeederThread",
        "id": 13
      },
      {
        "name": "QueueFeederThread",
        "id": 14
      }
    ]
  },
  "command": "threads",
  "success": true,
  "request_seq": 13,
  "type": "response",
  "seq": 26
}
Received:
{
  "body": {
    "connect": {
      "port": 43119,
      "host": "127.0.0.1"
    },
    "subProcessId": 15309,
    "python": [
      "/home/wangzongtao/anaconda3/envs/torch/bin/python"
    ],
    "program": "/home/wangzongtao/test/test.py",
    "name": "Subprocess 15309",
    "request": "attach",
    "type": "python"
  },
  "event": "debugpyAttach",
  "type": "event",
  "seq": 27
}
Received:
{
  "body": {
    "connect": {
      "port": 43119,
      "host": "127.0.0.1"
    },
    "subProcessId": 15310,
    "python": [
      "/home/wangzongtao/anaconda3/envs/torch/bin/python"
    ],
    "program": "/home/wangzongtao/test/test.py",
    "name": "Subprocess 15310",
    "request": "attach",
    "type": "python"
  },
  "event": "debugpyAttach",
  "type": "event",
  "seq": 28
}
Received:
{
  "body": {
    "threads": [
      {
        "name": "MainThread",
        "id": 1
      },
      {
        "name": "QueueFeederThread",
        "id": 12
      },
      {
        "name": "QueueFeederThread",
        "id": 13
      },
      {
        "name": "QueueFeederThread",
        "id": 14
      }
    ]
  },
  "command": "threads",
  "success": true,
  "request_seq": 14,
  "type": "response",
  "seq": 29
}
Received:
{
  "body": {
    "connect": {
      "port": 43119,
      "host": "127.0.0.1"
    },
    "subProcessId": 15315,
    "python": [
      "/home/wangzongtao/anaconda3/envs/torch/bin/python"
    ],
    "program": "/home/wangzongtao/test/test.py",
    "name": "Subprocess 15315",
    "request": "attach",
    "type": "python"
  },
  "event": "debugpyAttach",
  "type": "event",
  "seq": 30
}
  1. What is special with dataloader?
    the dataloader of torchvision will try to download dataset from internet, if it found the dataset already downloaded it will use the local dataset. In above test I already download the dataset.
  2. what is special with my internet?
    I use proxy to asscess internet, it works all fine.
  3. does vscode work correctly?
    yes, vscode can debug corretly.

It seems this issue is related with internet, but I have no idea where is wrong?

@wztdream
Copy link
Contributor Author

Maybe related with this attach mode #384 ?
After some try I noticed this issue is related with multi-thread, and also related with python debug backend.

  1. multi-thread
    in below code, num_workers=3 will start 3 threads to load the data, if I remove this parameter, which means only use the main thread, all works fine for both debugpy and ptvsd backend.
    train_loader = DataLoader(train_dataset, 128, shuffle=True, num_workers=3)

  2. debug backend
    use num_workers=3 only change backend.
    If I use debugpy as backend, dap only hang there(the io log is for debugpy backend), but if I use ptvsd as backend there will be error:

/home/wangzongtao/anaconda3/envs/torch/bin/python -m ptvsd --wait --host localhost --port 43209 /home/wangzongtao/test/test.py
Files already downloaded and verified
Traceback (most recent call last):
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/__main__.py", line 446, in <module>
    main(sys.argv)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/__main__.py", line 432, in main
    run()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/__main__.py", line 316, in run_file
    runpy.run_path(target, run_name='__main__')
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/wangzongtao/test/test.py", line 11, in <module>
    for sample in train_loader:
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 352, in __iter__
    return self._get_iterator()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 294, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 801, in __init__
    w.start()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
    self.pid = os.fork()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/_vendored/pydevd/_pydev_bundle/pydev_monkey.py", line 528, in new_fork
    _on_forked_process()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/_vendored/pydevd/_pydev_bundle/pydev_monkey.py", line 50, in _on_forked_process
    pydevd.settrace_forked()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/_vendored/pydevd/pydevd.py", line 2427, in settrace_forked
    patch_multiprocessing=True,
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/_vendored/pydevd/pydevd.py", line 2179, in settrace
    wait_for_ready_to_run,
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/_vendored/pydevd/pydevd.py", line 2230, in _locked_settrace
    debugger.connect(host, port)  # Note: connect can raise error.
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/_vendored/pydevd/pydevd.py", line 1060, in connect
    s = start_client(host, port)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/pydevd_hooks.py", line 136, in _start_client
    return start_client(daemon, h, p)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/_remote.py", line 62, in <lambda>
    start_client=(lambda daemon, h, port: start_daemon()),
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/_remote.py", line 50, in start_daemon
    _, next_session = daemon.start_server(addr=(host, port))
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/daemon.py", line 158, in start_server
    with self.started():
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/daemon.py", line 110, in started
    self.start()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/daemon.py", line 145, in start
    raise RuntimeError('already started')
RuntimeError: already started
Traceback (most recent call last):
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/__main__.py", line 446, in <module>
    main(sys.argv)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/__main__.py", line 432, in main
    run()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/__main__.py", line 316, in run_file
    runpy.run_path(target, run_name='__main__')
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/wangzongtao/test/test.py", line 11, in <module>
    for sample in train_loader:
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 352, in __iter__
    return self._get_iterator()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 294, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 801, in __init__
    w.start()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/multiprocessing/process.py", line 105, in start
    self._popen = self._Popen(self)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
    self.pid = os.fork()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/_vendored/pydevd/_pydev_bundle/pydev_monkey.py", line 528, in new_fork
    _on_forked_process()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/_vendored/pydevd/_pydev_bundle/pydev_monkey.py", line 50, in _on_forked_process
    pydevd.settrace_forked()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/_vendored/pydevd/pydevd.py", line 2427, in settrace_forked
    patch_multiprocessing=True,
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/_vendored/pydevd/pydevd.py", line 2179, in settrace
    wait_for_ready_to_run,
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/_vendored/pydevd/pydevd.py", line 2230, in _locked_settrace
    debugger.connect(host, port)  # Note: connect can raise error.
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/_vendored/pydevd/pydevd.py", line 1060, in connect
    s = start_client(host, port)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/pydevd_hooks.py", line 136, in _start_client
    return start_client(daemon, h, p)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/_remote.py", line 62, in <lambda>
    start_client=(lambda daemon, h, port: start_daemon()),
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/_remote.py", line 50, in start_daemon
    _, next_session = daemon.start_server(addr=(host, port))
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/daemon.py", line 158, in start_server
    with self.started():
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/daemon.py", line 110, in started
    self.start()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/daemon.py", line 145, in start
    raise RuntimeError('already started')
RuntimeError: already started
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/multiprocessing/util.py", line 319, in _exit_function
    p.join()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/multiprocessing/process.py", line 122, in join
    assert self._parent_pid == os.getpid(), 'can only join a child process'
AssertionError: can only join a child process
Traceback (most recent call last):
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/__main__.py", line 446, in <module>
    main(sys.argv)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/__main__.py", line 432, in main
    run()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/ptvsd/__main__.py", line 316, in run_file
    runpy.run_path(target, run_name='__main__')
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/wangzongtao/test/test.py", line 11, in <module>
    for sample in train_loader:
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 352, in __iter__
    return self._get_iterator()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 294, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 827, in __init__
    self._reset(loader, first_iter=True)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 857, in _reset
    self._try_put_index()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1102, in _try_put_index
    self._index_queues[worker_queue_idx].put((self._send_idx, index))
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/multiprocessing/queues.py", line 87, in put
    self._start_thread()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/multiprocessing/queues.py", line 169, in _start_thread
    self._thread.start()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/threading.py", line 851, in start
    self._started.wait()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/threading.py", line 551, in wait
    signaled = self._cond.wait(timeout)
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/threading.py", line 295, in wait
    waiter.acquire()
  File "/home/wangzongtao/anaconda3/envs/torch/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 7037) is killed by signal: Terminated. 

Debug Adapter exited abnormally with code 1 at Sat Feb 20 10:17:19

I checked vscode use debugpy as backend, and also works fine with num_workers=3 setting, so it seems this related with dap-mode. I am not sure if #384 can solve this issue or not?

@nbfalcon
Copy link
Member

The PR you mentioned is for ptsvd, not for debugpy. Currently, debugpy doesn't work in attach mode, see #406.

@wztdream
Copy link
Contributor Author

wztdream commented Feb 20, 2021

@nbfalcon thank you for looking at this issue. But it seems my issue is related with multi-thread, maybe not related with attach mode(which seems to me only related with remote debug, that is not my case).

So any idea where is wrong? is it due to dap-mode currently not support multi-thread debug? or is debugpy issue?

@dsyzling
Copy link
Member

I think this might be related to multi-process debugging which is triggered by the debugpyAttach message and more recently the standardised startDebugging request which is sent from the debug adapter to the debug client (dap-mode). I've noticed with recent changes to pytables loading hdf5 files via pandas causes a separate subprocess to start. This causes dap-mode to hang under Emacs because the debugpy adapter is expecting a response from client to handle the subprocess starting. It actually expects the client to start a new debug session, run through initialisation and then attach to the debugger given the connection details in the debugpyAttach/startDebugging message.

I've also verified this with a simple script that spawns a separate sub process which hangs. There are issues raised against libraries such as FastAPI which would spawn processes - which refer to hanging in the debugger.

Incidentally when I debug the hdf5 file open within vscode I can see the separate subprocess start and stop as I step over it. I've also tracked the message sequences that are sent and have dap-mode code that can work around the issue but I fear that some form of hierarchical debug session support will be needed within dap-mode to handle switching between debug sessions as breakpoints are triggered and sessions closed.

I'm going to raise a new issue to discuss this because the implications are wider than dap-python and involve dap-mode and possibly dap-ui changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants