-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:2488) #175
Comments
So after a lot of staring at the backtrace and the code, I believe what's happening is The publish API is raising the proper "ConnectionException" in response to "pika.exceptions.StreamLostError" so this isn't exactly a bug in fedora-messaing, but what it's not doing is automatically retrying before giving up because it's entering the generic AMQPError handler. We should fix this, especially because the blocking connection times out regularly as it cannot heartbeat. This will cause users to hit it less frequently, but they still need to handle exceptions where the automatic retry doesn't work. |
Previously publishing was only retried if the connection was closed cleanly. However, it might close for other reasons (someone trips over a cord, a firewall kills a connection, etc). We should go through the retry process for any connection exception. Fixes fedora-infra#175 Signed-off-by: Jeremy Cline <[email protected]>
Previously publishing was only retried if the connection was closed cleanly. However, it might close for other reasons (someone trips over a cord, a firewall kills a connection, etc). We should go through the retry process for any connection exception. Fixes #175 Signed-off-by: Jeremy Cline <[email protected]>
Hmm, I was able to reproduce this again with fedora-messaging-1.7.0:
|
It's not immediately obvious to me by looking at the tracebacks whether this is due to my publishing code, or whether this is due to my consuming code. Since this is the composer, we are both publishing and consuming in the same process. And it is likely that I attempted to publish a message about the failure due to being out of disk space, so perhaps that is what triggered it. However, I do see calls to a |
I just did a test with Bodhi:
Thus, I am satisfied that this error is actually solved. I think it's just getting logged by pika which is why it shows up in my logs, but fedora-messaging does seem to respond to it favorably and makes sure the message is still sent. On my end, I will probably just stop logging pika so I don't see the tracebacks and call it a day. We can close this if you want, unless there's something you'd like to use it to track. |
I'd like to keep this open to track not leaving stale connections open in the publish code. The reason I originally kept the connection around is making a new TLS connection is expensive and if you publish rapidly you can re-use the connection. The problem is most things don't publish rapidly and this leaves open connections that the broker eventually has to clean up along with nasty tracebacks. We either need to find a way to heartbeat with the synchronous API (e.g. run a thread somewhere) or open and close a connection per publish. |
In Fedora's staging environment, I have observed this problem in both OpenShift and bodhi-backend01.stg today:
In the case of Bodhi's OpenShift consumer, it ends up crashing the process, systemd restarts it, the message is tried again, and it was successful.
However, in Bodhi's composer, the composer itself caught the exception (it has a catch all so we can log and mark the compose as failed), which means that the message was ultimately ACK'd and the process did not halt and the compose was left in the failed state.
Should it be the responsibility of the application (i.e., Bodhi) to catch this kind of error, or should fedora-messaging catch and handle it for me? I'm not sure fedora-messaging could catch it in the case of Bodhi's composer…
The text was updated successfully, but these errors were encountered: