Crash on com.apollographql.websocket on _inputStreamCallbackFunc #3390

aaronbarsky · 2024-06-10T16:55:46Z

Summary

Crashlytics is reporting a very rare but consistent crash on the com.apollograph.websocket queue.

Crashed: com.apollographql.websocket
EXC_BAD_ACCESS KERN_INVALID_ADDRESS 0x0000000f9d3993a0
0 libobjc.A.dylib objc_retain + 16
1 libobjc.A.dylib objc_retain_x0 + 16
2 CoreFoundation _inputStreamCallbackFunc + 60
3 CoreFoundation _signalEventSync + 216
4 CoreFoundation ___signalEventQueue_block_invoke + 28
5 libdispatch.dylib _dispatch_call_block_and_release + 32

The app is a mix of 60% foreground and 40% background.

Version

1.9.2

Steps to reproduce the behavior

Unfortunately I only have crash logs

Logs

No response

Anything else?

My hunch is that it's the FoundationStream teardown sequence:

 func cleanup() {
    if let stream = inputStream {
      stream.delegate = nil
      CFReadStreamSetDispatchQueue(stream, nil)
      stream.close()
    }

stream.delegate is an unowned(unsafe) var.

There are several codepaths where cleanup is not called on com.apollographql.websocket. When this happens, delegate can be set to nil as the CFReadStream is attempting to invoke a callback on the deallocated delegate.

The text was updated successfully, but these errors were encountered:

calvincestari · 2024-06-10T20:34:53Z

Hi @aaronbarsky - you mentioned this being consistent; do you know the conditions or sequence of events that leads to the crash. Being able to have a reproduction case would be immensely helpful to resolving this.

aaronbarsky · 2024-06-10T21:12:38Z

By consistent, I mean that the call stack is exactly the same in each case. It happens about 1/500 sessions.

The challenge is that the start of the crash is

CoreFoundation _inputStreamCallbackFunc

which makes it quite tough to write a unit test or minimally reproducible example.

AnthonyMDev · 2024-08-09T17:57:24Z

We're going to really struggle to create or verify a fix for this since, as you said, its difficult to reproduce it.

There are several codepaths where cleanup is not called on com.apollographql.websocket. When this happens, delegate can be set to nil as the CFReadStream is attempting to invoke a callback on the deallocated delegate.

What codepaths are you referring to here? If there are some obvious areas where we are creating unsafe memory access, I'd suggest trying to fork the library, fix the cleanup() methods calls and ship a new version. If that prevents the crashes, we'd be happy to take a PR with the fixes you've made?

aaronbarsky · 2024-09-03T17:00:10Z

I worked around the problem with an explicit call to closeConnection.

Previously I had

func makeWebSocketClient(connectionHeaders:JSONEncodableDictionary) -> ApolloClientProtocol {
   let configuration = WebSocketTransport.Configuration(
            reconnect: false,
            connectingPayload: connectionHeaders)

    let webSocketClient = WebSocket(
            url: ...,
            protocol: .graphql_transport_ws)

    let wsTransport = WebSocketTransport(
            websocket: webSocketClient,
            config: configuration)

    return ApolloClient(networkTransport: wsTransport, store: ApolloStore(cache: InMemoryNormalizedCache()))
}

    // use client
    apolloClient = nil

Now I keep an explicit reference to the networkTransport so I can close it before deallocating the client

   let configuration = ...
   let webSocketClient = ...
   self.wsTransport = ...
   return  apolloClient = ...

   // use client
   self.wsTransport.closeConnection()
   self.wsTransport = nil
   self.apolloClient = nil

(Note that networkTransport is internal to ApolloClient so I need to keep the extra reference for access to the closeConnection function)

Fixing it in the library would require a lot of changes, that could be complex.

The problem is that FoundationStream marks the delgate as @unowned(unsafe) instead of @weak. This means that when the delegate is deallocated, rather than skipping the callback, it crashes. To avoid this, we need to make sure that the delegate is only released after close has been called on the callback queue.

Current situation

   FoundationStreamQueue         DelegateCallbackQueue          ClientQueue
                                                                apolloClient = nil
    receiveData
    scheduleCallbackOnDelegate                                  deinit
                                                                cleanupStream
                                                                cleanup
                                                                delegate deallocated
                                 delegate.stream(handle:)
                                 Crash()

My theoretical fix would look something like

   FoundationStreamQueue         DelegateCallbackQueue          ClientQueue
                                                                apolloClient = nil
    receiveData
    scheduleCallbackOnDelegate                                  deinit
                                                                cleanupStream
                                                                cleanup
                                                                capture strong reference to delegate in work block
                                                                schedule block on StreamCallbackQueue
                                 delegate.stream(handle:)
                                 complete work block releasing last reference
                                 delegate deallocated

The tricky bit is that you can't make strong references to self in the deinit. So rather than the delegate being an instance of Apollo.FoundationStream, it would have to be a separate object with a weak reference back to FoundationStream.

I'm too busy to make an attempt on this now. Perhaps if this code doesn't change much in Apollo 2.0, I can give it another look in the future.

calvincestari · 2024-09-11T19:09:32Z

Thanks for the great debugging @aaronbarsky.

I'm too busy to make an attempt on this now. Perhaps if this code doesn't change much in Apollo 2.0, I can give it another look in the future.

The current StarScream-based websocket implementation will be removed from Apollo iOS; it's old, unsupported and hindering progress. Ideally we build a default websocket implementation using iOS native websocket APIs but allow users to swap in whatever websocket library they choose to use. When that happens is still undecided though. We're trying to keep the scope of 2.0 breaking changes as small as possible and it really depends how much work is needed here to achieve that.

pixelmatrix · 2024-09-12T18:15:17Z

I'm seeing a handful of crashes like this as well in my app

pixelmatrix · 2024-09-12T19:15:02Z

@aaronbarsky Isn't setting the delegate to nil supposed to set the delegate back to the stream instance itself? I'm not seeing this happening in the debugger, but that's what the documentation says:

By a default, a stream object must be its own delegate; so a delegate message with an argument of nil should restore this delegate. Don’t retain the delegate to prevent retain cycles.

If that's the case, could the crash be that inputStream or outputStream are getting set to nil, thus the delegate is nil when the system attempts to call it?

On my end, the crashes seem to be happening because I call pauseWebSocketConnection when the app goes into the background, which eventually triggers cleanup, but leaves the Apollo.FoundationStream stream instance in place. The subscriptions can be pretty active, so I'm thinking the crash is happening when a pending message is queued up while we pause the connection.

arnauddorgans · 2024-10-06T23:37:28Z

Any update here? I see a lot of crashes with 100% iOS 18 on com.apollographql.websocket

calvincestari · 2024-10-07T14:06:07Z

@arnauddorgans - if you're able to provide a reproduction sample project or more debugging on your end may help in figuring out the root cause. I think this issue is becoming a catch-all for anything websocket related and they may not all be the same thing.

arnauddorgans · 2024-10-08T09:04:27Z

I am not able to reproduce it, but it is our top crash

          Crashed: com.apollographql.websocket
0  CoreFoundation                 0xfe54 <redacted> + 76
1  CoreFoundation                 0xcd3c CFRelease + 60
2  CoreFoundation                 0xc05e0 <redacted> + 252
3  CoreFoundation                 0x112050 <redacted> + 28
4  libdispatch.dylib              0x2370 _dispatch_call_block_and_release + 32
5  libdispatch.dylib              0x40d0 _dispatch_client_callout + 20
6  libdispatch.dylib              0xb6d8 _dispatch_lane_serial_drain + 744
7  libdispatch.dylib              0xc1e0 _dispatch_lane_invoke + 380
8  libdispatch.dylib              0x17258 _dispatch_root_queue_drain_deferred_wlh + 288
9  libdispatch.dylib              0x16aa4 _dispatch_workloop_worker_thread + 540
10 libsystem_pthread.dylib        0x4c7c _pthread_wqthread + 288
11 libsystem_pthread.dylib        0x1488 start_wqthread + 8

aaronbarsky · 2024-10-30T21:11:28Z

I finally caught this crash in action. Not sure if this is helpful or not:

arnauddorgans · 2024-11-07T22:53:27Z

Same here

vladdorfman · 2024-11-12T10:03:20Z

Is it possible that next minor version will contains the fix for this? It seems reproducible.

calvincestari · 2024-11-12T16:15:47Z

Is it possible that next minor version will contains the fix for this? It seems reproducible.

@vladdorfman, all the recent instances of crash captures seem to be adhoc. I don't think it's reliably reproducible yet but if you're able to write a test that can consistently crash please do.

AnthonyMDev · 2024-11-12T16:20:27Z

Yeah, we will need a reproduction case we can use to debug and verify a fix. Even if it only crashes one out of ten times. But a screenshot of the stack trace in your existing project is not enough for us to track this down.

If you are able to create a unit test or provide an example project with the crash occurring, we will prioritize a fix.

calvincestari · 2024-11-14T00:45:20Z

I spent some time on this issue yesterday and again today trying to craft a test that fails. I haven't managed to replicate a single failure in 10000 repetitions of various tests; closing the websocket, leaving it dangling, setting the transport to nil, cancelling the subscription, etc.

The stack trace screenshots unfortunately don't tell us much other than it's happening in cleanup, and I don't think we can confidently make any changes to the cleanup code without knowing exactly how things are failing.

pixelmatrix · 2024-12-02T21:00:47Z

This is still the number one crash for our app by a large margin. I can't reproduce it in the debugger, but I have experienced it occasionally on my device in the wild.

This morning I found a potentially related discussion on the Apple Developer forums. The stack traces are very similar to what I'm seeing in Sentry & Xcode Organizer. https://forums.developer.apple.com/forums/thread/769191

calvincestari · 2024-12-02T21:18:44Z

Thanks for linking to that discussion @pixelmatrix, it's an interesting read. I agree it sounds very similar to what you and others are experiencing in this issue.

While there are probably improvements we can make to the cleanup sequence this is looking more like a iOS 18 + CFStream issue in Foundation. I still don't know what changes we could make now to resolve this but I'll keep watching that discussion in case a solution is posted.

pixelmatrix · 2024-12-02T21:49:10Z

That was my thought as well. Given that it may not be actionable on our side, any recommendation for how we can get this in front of someone at Apple?

calvincestari · 2024-12-02T23:43:56Z

Maybe a +1 on that Apple issue with your own traces if available? I don't have any contacts that could ensure prompt action on this unfortunately.

aaronbarsky added bug Generally incorrect behavior needs investigation labels Jun 10, 2024

AnthonyMDev added the apollo-websockets label Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash on com.apollographql.websocket on _inputStreamCallbackFunc #3390

Crash on com.apollographql.websocket on _inputStreamCallbackFunc #3390

aaronbarsky commented Jun 10, 2024

calvincestari commented Jun 10, 2024

aaronbarsky commented Jun 10, 2024

AnthonyMDev commented Aug 9, 2024

aaronbarsky commented Sep 3, 2024

calvincestari commented Sep 11, 2024 •

edited

Loading

pixelmatrix commented Sep 12, 2024

pixelmatrix commented Sep 12, 2024 •

edited

Loading

arnauddorgans commented Oct 6, 2024

calvincestari commented Oct 7, 2024

arnauddorgans commented Oct 8, 2024

aaronbarsky commented Oct 30, 2024

arnauddorgans commented Nov 7, 2024

vladdorfman commented Nov 12, 2024

calvincestari commented Nov 12, 2024

AnthonyMDev commented Nov 12, 2024

calvincestari commented Nov 14, 2024

pixelmatrix commented Dec 2, 2024

calvincestari commented Dec 2, 2024

pixelmatrix commented Dec 2, 2024

calvincestari commented Dec 2, 2024

Crash on com.apollographql.websocket on _inputStreamCallbackFunc #3390

Crash on com.apollographql.websocket on _inputStreamCallbackFunc #3390

Comments

aaronbarsky commented Jun 10, 2024

Summary

Version

Steps to reproduce the behavior

Logs

Anything else?

calvincestari commented Jun 10, 2024

aaronbarsky commented Jun 10, 2024

AnthonyMDev commented Aug 9, 2024

aaronbarsky commented Sep 3, 2024

calvincestari commented Sep 11, 2024 • edited Loading

pixelmatrix commented Sep 12, 2024

pixelmatrix commented Sep 12, 2024 • edited Loading

arnauddorgans commented Oct 6, 2024

calvincestari commented Oct 7, 2024

arnauddorgans commented Oct 8, 2024

aaronbarsky commented Oct 30, 2024

arnauddorgans commented Nov 7, 2024

vladdorfman commented Nov 12, 2024

calvincestari commented Nov 12, 2024

AnthonyMDev commented Nov 12, 2024

calvincestari commented Nov 14, 2024

pixelmatrix commented Dec 2, 2024

calvincestari commented Dec 2, 2024

pixelmatrix commented Dec 2, 2024

calvincestari commented Dec 2, 2024

calvincestari commented Sep 11, 2024 •

edited

Loading

pixelmatrix commented Sep 12, 2024 •

edited

Loading