Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Disconnect not always detected #307

Open
allada opened this issue Oct 28, 2024 · 4 comments
Open

[Bug] Disconnect not always detected #307

allada opened this issue Oct 28, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@allada
Copy link

allada commented Oct 28, 2024

We recently got all the configs going to handle reconnect in fred, but it appears that there is some state where disconnects are not detected by fred and end up hanging the connection forever.

I cannot get it to consistently reproduce, but I can usually get the reproduction case to happen after a few mins.

From what I can tell read timeouts do trigger everything properly, however writes and reconnects seemed to be more obscured and were where I started focusing my thoughts.

Before digging down into this, I figured I'd post this to see if there are any known issues around this and if there were any hints?

@allada allada added the bug Something isn't working label Oct 28, 2024
@aembke
Copy link
Owner

aembke commented Oct 28, 2024

Not that I'm aware of, no. Can you post the configuration details you're using and describe the server/cluster deployment model?

@allada
Copy link
Author

allada commented Oct 28, 2024

I can see this in end-to-end setups, so it is hard to pin point relevant parts/configs because there are so many moving parts, but here's the high level...

Here is the code that is relevant (can reproduce with default config to that store): https://github.com/TraceMachina/nativelink/blob/b2386fdd16ccc4d3330fcf91f593c7e9262a6197/nativelink-store/src/redis_store.rs#L212

Reproduction case:

  1. Spin up redis sentinel cluster & related nativelink config (easier to do over call than over github issue).
  2. Compile nativelink with remote execution pointing to deployed cluster.
  3. While compiling, kill the sentinel redis node.

Nativelink should take up to 10-20 seconds to realize something is wrong and reschedule the downloads, uploads and jobs... Sometimes it never recovers.

If possible it might be easier to show this in action over a call [email in my github user profile]. I started writing my own RedisPool to work around this issue to try and move fast, but if we can get some support here we can probably help everyone.

@aembke
Copy link
Owner

aembke commented Oct 29, 2024

Yeah no problem, happy to get on a call. I'll send you an email tomorrow and we can set something up.

@aembke
Copy link
Owner

aembke commented Nov 30, 2024

Hey @allada, did you have any luck repro'ing this on your side? I added sentinel support to the failover testing tool I usually use for this kind of thing (https://github.com/aembke/fred.rs/tree/main/bin/inf_loop) and tested sentinel failover for a few hours but wasn't able to repro the issue.

Also for what it's worth I just released version 10.0.0, which contains a ton of routing changes, so you might want to try against that version too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants