Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite WEBSOCKET: Error reading websocket scenarios #83

Closed
jessegrosjean opened this issue May 10, 2024 · 14 comments · Fixed by #86
Closed

Infinite WEBSOCKET: Error reading websocket scenarios #83

jessegrosjean opened this issue May 10, 2024 · 14 comments · Fixed by #86
Assignees
Labels
bug Something isn't working

Comments

@jessegrosjean
Copy link
Contributor

I'm periodically seeing tons of WEBSOCKET: Error reading websocket errors that grind my app to a halt. I've found one way to recreate, which is to connect the web socket provider, and then turn off my wifi-connection, but I also think the error loop can happen in other less reproducible cases.

I don't know the correct fix, but problem is happening in while true loop in ongoingReceiveWebSocketMessages. Maybe the reconnectAttempts counter needs to factor in with that error too?

@heckj heckj added the bug Something isn't working label May 10, 2024
@heckj
Copy link
Collaborator

heckj commented May 10, 2024

When it's in a reconnect loop, it will try indefinitely until it's told to stop with .disconnect() - there's nothing (currently) feeding information about network paths and availability of connectivity into the heuristics. It should be delaying (slowly attempts to retry) as it goes, but there's no "stop after N attempts" concept.

When you say "grinding app to a halt" - do you mean this is impacting the rest of your app's UI or that it's not updating the relevant Document that you expect, and that makes everything appear to freeze? (want to understand if there are unexpected side effects here that I haven't accounted for)

What would be your desired behavior when it can't connect, but you've told it to - the "connect and then disable WIFI" scenario?

@jessegrosjean
Copy link
Contributor Author

I'm working on a simple project to reproducible the error. Will share that soon with more tracing and info.

@heckj
Copy link
Collaborator

heckj commented May 10, 2024

That's wonderful Jesse, thank you!

@jessegrosjean
Copy link
Contributor Author

Here's what I'm seeing:

https://github.com/jessegrosjean/AutomergeWebsocketError

  1. Start this app with wifi on
  2. Once app has started then turn wifi off

Once you have done this many WEBSOCKET: Errors start happening. CPU Usage reported by Xcode goes from 0 to 60%. App does stay responsive (menus and such). Upon quitting app Xcode then beachballs for 5-10 seconds... as I guess mabye its cleaning out logging? Not sure.

I did a trace to see what's taking up the CPU and it seems like it's almost all logging related CPU. I'm not sure how to interrupt that. Still seems to me like the retry loop should slow down in this case and then error out eventually. This is from someone who's done very little network code, so I really don't know what proper behavior should be.

Also even when re-enabling Wifi CPU remains high, pattern changes a little, but I think problem persists.

@heckj
Copy link
Collaborator

heckj commented May 10, 2024

First, we can definitely add a "stop after X/N" - where that's either time or attempts. that begs the question of how does your app know what's happening - that it's terminated after being unable to conenct. My first reaction is to look at enabling a publisher that provides state for the WebSocket - loosely one of [connected|ready|disconnected|waiting_to_retry].

@heckj heckj self-assigned this May 10, 2024
@heckj
Copy link
Collaborator

heckj commented May 10, 2024

Wow - I see what you mean, there's something VERY awry with the backoff loop - it's just NOT there and tight spinning on attempting to reconnect.

@jessegrosjean
Copy link
Contributor Author

jessegrosjean commented May 10, 2024

First, we can definitely add a "stop after X/N" - where that's either time or attempts. that begs the question of how does your app know what's happening - that it's terminated after being unable to conenct. My first reaction is to look at enabling a publisher that provides state for the WebSocket - loosely one of [connected|ready|disconnected|waiting_to_retry].

I don't really know how these things normally work at API level. From users perspective it would be nice if app can detect when network is available again, and automatically reconnect while also displaying some sort of offline/online status. I'm not sure if all that logic should be in app code, or in automerge-repo code.

@heckj
Copy link
Collaborator

heckj commented May 10, 2024

The PR I made to help issue-83 is still leaving something in a tight loop, so it's not working quite as I expected. I need to run for a bit, but I'll resume debugging on this a bit later with your example app. I really wish there was an easy way to control network connectivity in an integration test... Ah well, this is getting us there.

@heckj
Copy link
Collaborator

heckj commented May 10, 2024

Found at least two logic issues:
1 - when a read error occured, I wasn't setting Peered=false to trigger the reconnect, which left that Task spinning in the background with no delay, consuming CPU like a fiend
2 - I also wasn't acknowledging config.attemptToReconnect, so when it's set to false, it should NEVER get into a loop like this.

Both of those are now set, and the background loop terminates nicely when reconnectOnError=false. However, when it's true and I toggle wifi on and off again, its apepars to be fully terminating that background loop (although I'm missing the specific exception path of that process)

So this PR as it stands is a notable improvement, but insufficient as far as I can see. This begs the question for state information about what's happening with the provider: is it connected, is it not, and how would an app using it know? So next steps in my mind is adding a publisher to provide state, as well as tracking down what the exception flow is that appears to be terminating that background loop on an error (toggling wifi) such that it can't attempt to recover. I'm just not seeing the spot now, but maybe after a few hours away it'll be more clear.

@jessegrosjean
Copy link
Contributor Author

That all sounds good to me. Thanks!

@heckj
Copy link
Collaborator

heckj commented May 10, 2024

I've verified that something is triggering an exception that isn't being caught, the flow of which terminates the recurring loop.

WEBSOCKET: Activating websocket to wss://sync.automerge.org/
WEBSOCKET: RECV: PEER[version: 1, sender: storage-server-sync-automerge-org, target: F7357FD8-C9F7-4925-BE0F-399D29E5E0FD, metadata: [storageId: 3760df37-a4c6-4f66-9ecd-732039a9385d, ephemeral: false]]
WEBSOCKET: Peered to targetId: storage-server-sync-automerge-org PEER[version: 1, sender: storage-server-sync-automerge-org, target: F7357FD8-C9F7-4925-BE0F-399D29E5E0FD, metadata: [storageId: 3760df37-a4c6-4f66-9ecd-732039a9385d, ephemeral: false]]
WEBSOCKET: connected to wss://sync.automerge.org/
WEBSOCKET: await receive()
nw_proto_tcp_route_init [C1.1.1:3] no mtu received
Task <0498939E-9515-4D72-90FA-3996BF837D6A>.<1> finished with error [-1005] Error Domain=NSURLErrorDomain Code=-1005 "The network connection was lost." UserInfo={NSErrorFailingURLStringKey=wss://sync.automerge.org/, NSErrorFailingURLKey=wss://sync.automerge.org/, _NSURLErrorRelatedURLSessionTaskErrorKey=(
    "LocalWebSocketTask <0498939E-9515-4D72-90FA-3996BF837D6A>.<1>"
), _NSURLErrorFailingURLSessionTaskErrorKey=LocalWebSocketTask <0498939E-9515-4D72-90FA-3996BF837D6A>.<1>, NSLocalizedDescription=The network connection was lost.}
WEBSOCKET: Error reading websocket: The operation couldn’t be completed. Socket is not connected
WEBSOCKET: await receive()
WEBSOCKET: Reconnect attempt #0, waiting for 0 seconds.
WEBSOCKET: Activating websocket to wss://sync.automerge.org/
Connection 3: received failure notification
Connection 3: failed to connect 1:50, reason -1
Connection 3: encountered error(1:50)
Task <CBE8C785-034A-4578-A1F5-9685269FC453>.<2> HTTP load failed, 0/0 bytes (error code: -1009 [1:50])
EXCEPTION from background receive loop: The Internet connection appears to be offline.
Task <CBE8C785-034A-4578-A1F5-9685269FC453>.<2> finished with error [-1009] Error Domain=NSURLErrorDomain Code=-1009 "The Internet connection appears to be offline." UserInfo={NSErrorFailingURLStringKey=wss://sync.automerge.org/, NSErrorFailingURLKey=wss://sync.automerge.org/, _NSURLErrorRelatedURLSessionTaskErrorKey=(
    "LocalWebSocketTask <CBE8C785-034A-4578-A1F5-9685269FC453>.<2>"
), _NSURLErrorFailingURLSessionTaskErrorKey=LocalWebSocketTask <CBE8C785-034A-4578-A1F5-9685269FC453>.<2>, NSLocalizedDescription=The Internet connection appears to be offline.}
Terminated background read loop - socket expected to be disconnected
BOOL _NSPersistentUIDeleteItemAtFileURL(NSURL *const __strong) Failed to stat item: file:///Users/heckj/Library/Containers/hogbaysoftware.AutomergeWebsocketError/Data/Library/Saved%20Application%20State/hogbaysoftware.AutomergeWebsocketError.savedState/restorecount.plist

Okay, now I'm really, really going to step away from this and go for some exercise and try to clear my head.

@heckj
Copy link
Collaborator

heckj commented May 10, 2024

Found it - if the network is unavailable when the reconnect attempt happens, it'll throw an error that terminates out of the retry loop. I'm going to remove throws from that method overall to force (my) awareness as well as add handling of the exception as a failure that continues to (hopefully) back off and retry.

@heckj
Copy link
Collaborator

heckj commented May 10, 2024

And that's got it - I had a few more places where I missed things, but now it'll slowly back-off on reconnect attempts - BUT CONTINUE TO TRY - until wifi is restored, after which it'll reconnect and you're back in action. The code is doing redundant changes of state to make sure it knows "NO - YOU'RE A DEAD CONNECTION", but I'm not as fussed with the extra work in there, even if it's a clear state of my confusion and back-and-forth while developing this. ;-)

On the plus side, the manual testing that I did while working this out with your quick&dirty sample project answers the question to #14. I'm leaving that issue open because I still think it would be a good test setup, but this has sorted a lot of it. There was also a question in my mind of "doing I need to manually send pings every few seconds to make sure the connection is retained, but apparently that's something that URLSession.WebSocketDataTask does on our behalf, as the remote side never closed the connection, even while waiting ~15 seconds without activity.

@heckj heckj closed this as completed in #86 May 10, 2024
@heckj
Copy link
Collaborator

heckj commented May 10, 2024

(@jessegrosjean - reopen this, or kick up a new issue, if something still seems awry on this front)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants