-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Disconnect not always detected #307
Comments
Not that I'm aware of, no. Can you post the configuration details you're using and describe the server/cluster deployment model? |
I can see this in end-to-end setups, so it is hard to pin point relevant parts/configs because there are so many moving parts, but here's the high level... Here is the code that is relevant (can reproduce with default config to that store): https://github.com/TraceMachina/nativelink/blob/b2386fdd16ccc4d3330fcf91f593c7e9262a6197/nativelink-store/src/redis_store.rs#L212 Reproduction case:
Nativelink should take up to 10-20 seconds to realize something is wrong and reschedule the downloads, uploads and jobs... Sometimes it never recovers. If possible it might be easier to show this in action over a call [email in my github user profile]. I started writing my own |
Yeah no problem, happy to get on a call. I'll send you an email tomorrow and we can set something up. |
Hey @allada, did you have any luck repro'ing this on your side? I added sentinel support to the failover testing tool I usually use for this kind of thing (https://github.com/aembke/fred.rs/tree/main/bin/inf_loop) and tested sentinel failover for a few hours but wasn't able to repro the issue. Also for what it's worth I just released version 10.0.0, which contains a ton of routing changes, so you might want to try against that version too. |
We recently got all the configs going to handle reconnect in fred, but it appears that there is some state where disconnects are not detected by fred and end up hanging the connection forever.
I cannot get it to consistently reproduce, but I can usually get the reproduction case to happen after a few mins.
From what I can tell read timeouts do trigger everything properly, however writes and reconnects seemed to be more obscured and were where I started focusing my thoughts.
Before digging down into this, I figured I'd post this to see if there are any known issues around this and if there were any hints?
The text was updated successfully, but these errors were encountered: