[Bug] Disconnect not always detected #307

allada · 2024-10-28T15:50:51Z

We recently got all the configs going to handle reconnect in fred, but it appears that there is some state where disconnects are not detected by fred and end up hanging the connection forever.

I cannot get it to consistently reproduce, but I can usually get the reproduction case to happen after a few mins.

From what I can tell read timeouts do trigger everything properly, however writes and reconnects seemed to be more obscured and were where I started focusing my thoughts.

Before digging down into this, I figured I'd post this to see if there are any known issues around this and if there were any hints?

aembke · 2024-10-28T16:58:46Z

Not that I'm aware of, no. Can you post the configuration details you're using and describe the server/cluster deployment model?

allada · 2024-10-28T18:18:33Z

I can see this in end-to-end setups, so it is hard to pin point relevant parts/configs because there are so many moving parts, but here's the high level...

Here is the code that is relevant (can reproduce with default config to that store): https://github.com/TraceMachina/nativelink/blob/b2386fdd16ccc4d3330fcf91f593c7e9262a6197/nativelink-store/src/redis_store.rs#L212

Reproduction case:

Spin up redis sentinel cluster & related nativelink config (easier to do over call than over github issue).
Compile nativelink with remote execution pointing to deployed cluster.
While compiling, kill the sentinel redis node.

Nativelink should take up to 10-20 seconds to realize something is wrong and reschedule the downloads, uploads and jobs... Sometimes it never recovers.

If possible it might be easier to show this in action over a call [email in my github user profile]. I started writing my own RedisPool to work around this issue to try and move fast, but if we can get some support here we can probably help everyone.

aembke · 2024-10-29T01:40:30Z

Yeah no problem, happy to get on a call. I'll send you an email tomorrow and we can set something up.

aembke · 2024-11-30T23:26:46Z

Hey @allada, did you have any luck repro'ing this on your side? I added sentinel support to the failover testing tool I usually use for this kind of thing (https://github.com/aembke/fred.rs/tree/main/bin/inf_loop) and tested sentinel failover for a few hours but wasn't able to repro the issue.

Also for what it's worth I just released version 10.0.0, which contains a ton of routing changes, so you might want to try against that version too.

allada added the bug Something isn't working label Oct 28, 2024

allada assigned aembke Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Disconnect not always detected #307

[Bug] Disconnect not always detected #307

allada commented Oct 28, 2024

aembke commented Oct 28, 2024 •

edited

Loading

allada commented Oct 28, 2024

aembke commented Oct 29, 2024

aembke commented Nov 30, 2024 •

edited

Loading

[Bug] Disconnect not always detected #307

[Bug] Disconnect not always detected #307

Comments

allada commented Oct 28, 2024

aembke commented Oct 28, 2024 • edited Loading

allada commented Oct 28, 2024

aembke commented Oct 29, 2024

aembke commented Nov 30, 2024 • edited Loading

aembke commented Oct 28, 2024 •

edited

Loading

aembke commented Nov 30, 2024 •

edited

Loading