Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling timeouts/disconnects, dropping some pushes? #47

Open
finkel opened this issue May 15, 2015 · 14 comments
Open

Handling timeouts/disconnects, dropping some pushes? #47

finkel opened this issue May 15, 2015 · 14 comments

Comments

@finkel
Copy link

finkel commented May 15, 2015

Since we switched to using this library in production, we've had many reports of dropped push notifications. I think it's because of the persistent connection, which may at times error and then disconnect. Sounds like they're trying to handle this in https://github.com/pranavraja/apns, though that looks like less mature.

Any thoughts on how to detect/avoid this, or any recommendations? Appreciate the library and the help.

Thanks,
Ben

@freeeve
Copy link

freeeve commented May 15, 2015

I also recently switched, and I've noticed this as well. (at least, increased dropped notifications) I wasn't sure where to assign blame yet, as Apple doesn't guarantee delivery either.

@nathany
Copy link
Contributor

nathany commented May 15, 2015

Hi Ben,

The intent is when an error comes back from Apple, any notifications from that point on are resent. See:
https://developer.apple.com/library/ios/technotes/tn2265/_index.html#//apple_ref/doc/uid/DTS40010376-CH1-TNTAG44

However, it only currently buffers 50 notifications, which may not be enough. @bdotdub is the original maintainer of the project, and I imagine he hasn't ran into this issue in production. I'm still trying to get my app ready for production, so you're further along than me. :-)

@finkel
Copy link
Author

finkel commented May 15, 2015

Hi Nathan,

Thanks for the response. That's true, but I'm pretty sure that there's also a scenario when Apple gets a certain kind of error that causes no response, just a silent disconnect (great API, I know), and the persistent connection used in this library doesn't handle that.

@nathany
Copy link
Contributor

nathany commented May 15, 2015

It does have a retry loop (in RunLoop). But right now it's not exposing the reason why there was an error and is retrying.

@finkel
Copy link
Author

finkel commented May 15, 2015

Hmm, I see that. I'm not seeing "EOF trying to write notification", or "err writing to apns" in my logs, so I wonder if there's a kind of error that doesn't break it out of the run loop, thus it gets hung in a state that drops pushes until restart.

@nathany
Copy link
Contributor

nathany commented May 15, 2015

Unfortunately all I can suggest atm is to add more logging in a fork of timehop/apns. 😦

If you do end up trying @pranavraja's library, it would be great to hear how it goes.

@taylortrimble
Copy link
Contributor

I'll help look into this.

On Fri, May 15, 2015 at 12:48 PM, Nathan Youngman
[email protected] wrote:

Unfortunately all I can suggest atm is to add more logging in a fork of timehop/apns. 😦

If you do end up trying @pranavraja's library, it would be great to hear how it goes.

Reply to this email directly or view it on GitHub:
#47 (comment)

@bdotdub
Copy link
Contributor

bdotdub commented May 15, 2015

We've actually only just ran into this issue last week.

@finkel you're right re: error in the run loop. Doing some logging and debugging on our end, we've found that it gets stuck in some failure condition in these lines and can continually get an error, ie. never successfully reconnects: https://github.com/timehop/apns/blob/master/client.go#L132-L136

I can't tell if Apple is giving us a slap on the wrist for trying to connect to quickly.

We added logging in that if block in our local version and have seen a lot of EOF errors:

2015/05/04 21:37:22 apns: client connect error: EOF
2015/05/04 21:37:24 apns: client connect error: EOF
2015/05/04 21:37:26 apns: client connect error: EOF
2015/05/04 21:37:27 apns: client connect error: read tcp xx.xxxx.xxxx:2195: connection reset by peer

I'm currently at a bit of a loss as to why it's happening. In the short term, we could put a log statement in there or pass the connection error back on the channel. In the longer term, we've been talking about doing a refactor to remove the runLoop approach for exactly this scenario, where we sweep some class of errors under the rug.

@finkel @wfreeman how are you guys detecting dropped pushes?

@nathany
Copy link
Contributor

nathany commented May 15, 2015

Thanks Benny. I think it makes sense to add this logging into master in the interim.

Passing connection errors over FailedNotifs might be a bit odd right now though, since it's currently a chan of NotificationResults.

@finkel
Copy link
Author

finkel commented May 15, 2015

@bdotdub we've only been detecting them anecdotally (reports from users) which definitely started when we switched to this library, and of course they never repro on staging. They're very hard to log on the client. But recently I saw some very specific times when I could confirm they were being dropped, so now I'm investigated...

@freeeve
Copy link

freeeve commented May 15, 2015

For me it's just from user reports. I'm not seeing them in the FailedNotifs unless I'm doing something wrong (I increment an error count when I find those, and log the error count).

@taylortrimble
Copy link
Contributor

I think it makes sense to add this logging into master in the interim.

I'd prefer a callback, so I could do the logging (and other custom app error recording) myself!

@nathany
Copy link
Contributor

nathany commented May 22, 2015

Reporting connection errors and read errors over a channel (FailedNotifs atm) seems like a reasonable solution as well.

I'm curious how many messages are lost?

The resend logic starts after a failed notification. What happens if the notification failed for reasons other than it was invalid? What happens if the error from Apple is Shutdown or (maybe) Processing error, or if the connection is just dropped? Does it (should it) resend the notification that "caused" the failure?

@nathany
Copy link
Contributor

nathany commented Dec 9, 2015

One cause of EOF looping I've seen is expired certificates.

The crypto/x509 package provides a Verify() method for certificates. If it returns an error of type x509.CertificateInvalidError, the Reason field will be set to x509.Expired if the certificate is expired. I'm not sure if it's a good idea to verify the certificates every time before connecting or just when first loaded.

Verify() will also return errors of type x509.UnknownAuthorityError though, as the Apple certificate isn't in the system roots (and I haven't figured out what needs to be added to VerifyOptions.Roots). For now I just ignore this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants