-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bubble channel failures up to the connection #16
Conversation
I'm not very familiar with this code base, so I'd appreciate any advice you could give me toward resolving the build failures. Or if I'm way off base and this is better suited as an issue I'm happy to just open an issue instead. |
@DavidWittman are you able to provide steps to consistently reproduce? Are you actually loosing messages? Based on the description you provided, it sounds like the scenario in question is resulting in an error and ultimately a As for tests, this proposal is causing some other tests to fail as you are seeing in the details link -> https://travis-ci.org/github/Foo-Foo-MQ/foo-foo-mq/builds/710152398?utm_source=github_status&utm_medium=notification and more specifically https://travis-ci.org/github/Foo-Foo-MQ/foo-foo-mq/jobs/710152399 I would expect that PR's that are opened to have corresponding tests and existing tests should not be failing and/or updated accordingly to match the updated behavior. As for if this should be an issue or a pull request is up to whether you want to help fix it. A PR is the fastest way to get any fixes in and an open issue without steps to reproduce will likely not get much attention. We have some guidelines to contribute please read through https://github.com/Foo-Foo-MQ/foo-foo-mq/blob/master/HOW_TO_CONTRIBUTE.md We have these guidelines for submitting PR's around things such as format of the commit message to help with auto generation of the CHANGELOG.md file and that tests are required and to be passing. In the current state given it doesn't meet some of these guidelines we won't consider merging. |
Yes, sorry. I should have explained that better up front. Here's an overview of what happens: RabbitMQ: Cluster of 3 RabbitMQ servers behind a Load Balancer (AWS ELB in this case)
RabbitMQ logsNode A
Node B
As for the "consistently reproducing" part, that's a little more difficult because it's a somewhat-complex distributed system and we're trying to mimic a network failure. However, I was able to trigger the same behavior (4 and 5 above) just by modifying foo-foo-mq to send a bad delivery tag down a channel to a single RabbitMQ instance.
Losing messages? No. But the consumers effectively seize up and stop consuming from their queues when this happens.
This assumption is incorrect. My consumer has already acked the message, and it never receives an error. |
Thanks for the reply, logs, and description. I will look a little more at this as time allows and reach back out if I have further questions. Feel free to add additional context as you see fit that may be useful |
@DavidWittman As you can see time has not been on our side. Wondering if this is still an issue and/or if you are still using this library? I would be interested in any information you have regarding this. If you feel this is not an issue still or found a solution I would like to know in order to address this issue accordingly or close it. |
Still an issue, and we're still using this library. We have seen the issue in production on three distinct RabbitMQ HA architectures (including Amazon MQ). We're currently working around the bug with a watchdog thread which restarts the services when they can no longer read from RabbitMQ. Honestly I'd like to spend more time resolving this particular issue so we can get rid of that, but it's a pain in the ass to reproduce and I've done a poor job of explaining the issue up to this point. |
This is still an issue (I've been tracking this PR for a while) and seems to happen more frequently under high load. It's a big problem for us that in this case, foo-foo-mq just stops consuming from a queue silently (when the I tested the simple code change you have in this PR @DavidWittman, and was able to confirm that it fixes the issue. It forces foo-foo-mq to reconnect to RabbitMQ and start consuming from the queue again. Unfortunately the error is hard to reproduce as you pointed out, and I didn't bother trying to reproduce in a dev environment. Instead, I just deployed the change to our live environment and waited for the error to happen again (it happens every couple days in one of our services that constantly receives 3k messages per second or more). @zlintz Any chance this PR could be merged so that we don't have to use our own fork of foo-foo-mq? Do you have any concerns with this change? |
I don't think I have concerns based on the change. If you/others are confident in the change as well I don't see this as an issue and I will get it added. It does need to follow the conventional commits though https://github.com/Foo-Foo-MQ/foo-foo-mq/blob/master/HOW_TO_CONTRIBUTE.md Maybe just rebase your branch and use |
Send channel failures up to the connection so that they can be handled appropriately. Previously, errors on channels caused silent failures which could result in queues being unable to receive messages without any indication of failure, such as described in arobson#155. This approach is a bit naive in that it causes the whole connection to fail when an a single channel fails, but I don't think the abstractions in this library allow for the user to respond to channel-level events. In my case, this type of failure occurred any time my consumers lost their connection to the RabbitMQ cluster and tried to send an ack for an old delivery tag to a different RMQ node in the cluster (i.e. queue is re-established on new node and batch acks send after reconnecting). This would result in the following error in amqplib: Error: Channel closed by server: 406 (PRECONDITION-FAILED) with message "PRECONDITION_FAILED - unknown delivery tag 42" But from the perspective of foo-foo-mq/rabbot, everything was fine. With these changes in place, I can at least listen for a `failed` event on the broker and respond appropriately.
ac94edc
to
01418d1
Compare
I rebased and switched to conventional commit logs, but |
@zlintz nudge :) |
@DavidWittman seems like this is breaking tests. Unfortunately I am going to have revert this for the moment until it can pass the tests. Please see https://github.com/Foo-Foo-MQ/foo-foo-mq/actions/runs/7660491629/job/20877897843 I may try to dig into why as well. |
@DavidWittman @doubliez I think I am approaching the answer. It looks like the For more context if you look here https://github.com/Foo-Foo-MQ/foo-foo-mq/blob/main/src/connectionFsm.js#L192 during the state transition So what appears to be the problem relative to the tests was multiple For anyone curious to try this out put a What I could use confirmation from you all is any additional context of why the |
Send channel failures up to the connection so that they can be handled
appropriately. Previously, errors on channels caused silent failures
which could result in queues being unable to receive messages without
any indication of failure, such as described in
arobson#155.
This approach is a bit naive in that it causes the whole connection to
fail when an a single channel fails, but I don't think the abstractions
in this library allow for the user to respond to channel-level events.
In my case, this type of failure occurred any time my consumers lost
their connection to the RabbitMQ cluster and tried to send an ack for an
old delivery tag to a different RMQ node in the cluster (i.e. queue is
re-established on new node and batch acks send after reconnecting). This
would result in the following error in amqplib:
But from the perspective of foo-foo-mq/rabbot, everything was fine. With
these changes in place, I can at least listen for a
failed
event onthe broker and respond appropriately.