Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate ghost processes in tproxy and JDC #1266

Open
jbesraa opened this issue Nov 28, 2024 · 3 comments
Open

Investigate ghost processes in tproxy and JDC #1266

jbesraa opened this issue Nov 28, 2024 · 3 comments

Comments

@jbesraa
Copy link
Contributor

jbesraa commented Nov 28, 2024

While working on #1207 it was noticed that some of the spawned tasks are not exiting upon dropping the main task. This results in hanging test as the future of those spawned tasks is never resolved and the test never ends.

@Shourya742 @plebhash tagging you as you have done some work here already

edit by @plebhash: it's important to get to the bottom of this, because it seems it's become a blocker to Integration Test issues such as #1207

@plebhash
Copy link
Collaborator

plebhash commented Nov 28, 2024

one of the main findings around this is the fact that this usually happens with JDC and tProxy, and both are using a task_collector pattern in very similar ways:

  • tProxy: task_collector: Arc<Mutex<Vec<(AbortHandle, String)>>>
  • JDC: task_collector: Arc<Mutex<Vec<AbortHandle>>>

we don't observe this behavior in any other role, and none of them have the task_collector pattern

@plebhash
Copy link
Collaborator

plebhash commented Nov 28, 2024

here's some findings from manual tests

they were observed without any Integration Tests or tokio-console

the text below might seem very dense, but we can actually infer some patterns from these observations


The first setup had the following:

  • TP on testnet4
  • Pool
  • JDS
  • JDC
  • tProxy (connected to JDC)

I tried killing JDC and observing tProxy in two different ways:

  • wait for tProxy to establish the connection with JDC before killing JDC: tProxy becomes unresponsive to ctrl+C and needs to be killed manually
  • don't wait for tProxy to open an extended channel with JDC before killing JDC: tProxy is still responsive to ctrl+C

then I tried killing tProxy and observing JDC, and the same patterns was observed:

  • wait for tProxy to establish the connection with JDC before killing tProxy: JDC becomes unresponsive to ctrl+C and needs to be killed manually
  • don't wait for tProxy to establish the connection with JDC before killing tProxy: tProxy is still responsive to ctrl+C

then I tried killing Pool and observing JDC, but the same pattern was not observed:

  • wait for JDC to establish connection with Pool before killing Pool: JDC is still responsive to ctrl+C
  • don't wait for JDC to establish connection with Pool before killing Pool: JDC is still responsive to ctrl+C

then I tried killing JDS and observing JDC, but the same pattern was not observed:

  • wait for JDC to establish connection with JDS before killing JDS: JDC is still responsive to ctrl+C
  • don't wait for JDC to establish connection with JDS before killing JDS: JDC is still responsive to ctrl+C

The second setup was the following:

  • TP on testnet4
  • Pool
  • tProxy (connected to Pool)

I tried killing Pool and observing tProxy in two different ways:

  • wait for tProxy to establish the connection with Pool before killing Pool: tProxy becomes unresponsive to ctrl+C and needs to be killed manually
  • don't wait for tProxy to establish the connection with Pool before killing Pool: tProxy is still responsive to ctrl+C

then I tried killing tProxy and observing Pool, but the same pattern was not observed:

  • wait for tProxy to establish connection with Pool before killing tProxy: Pool is still responsive to ctrl+C
  • don't wait for tProxy to establish connection with Pool before killing tProxy: Pool is still responsive to ctrl+C

@plebhash
Copy link
Collaborator

plebhash commented Nov 30, 2024

another relevant observation on JD-less setup:

I tried killing Pool and observing tProxy in two different ways:

  • wait for tProxy to establish the connection with Pool before killing Pool: tProxy becomes unresponsive to ctrl+C > and needs to be killed manually
  • don't wait for tProxy to establish the connection with Pool before killing Pool: tProxy is still responsive to ctrl+C

after killing Pool and making tProxy unresponsive to Ctrl+C, I avoided trying to force-kill tProxy (via pkill or kill -9), and simply re-launched the Pool

then tProxy re-connected to Pool, and went back to responding to Ctrl+C normally again


this should provide meaningful insight wrt task_collector

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants