Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #254 Channel.Publish blocks indefinitely #508

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sljeff
Copy link

@sljeff sljeff commented Jun 16, 2021

I encountered a bug similar to this comment: #254 (comment)

However, in a weak network environment, I get a deadlock after N+1 messages have been sent where N is the size of the buffered channel.


How to Reproduce:

Then it will deadlock.


The sequence of events is:

  1. Receive an old message's ack -> One() -> resequence() -> confirm() send ack to channel
  2. Receive another old message's ack -> One() get lock -> resequence() -> confirm() send ack to channel -> block (channel buffer == 1) (and lock confirms.m is not released)
  3. I am sending a message using Publish -> Failed to get lock

This time it will keep blocking; simply because the buffer is 1 but two acks are received at the same time (although I sent messages serially, we can indeed reproduce this extreme case).


I don't think Publish needs to use the same lock as the other methods, it's just a self-incrementing. Even if the program is in resequence(), there may not be a problem with Publish() running at this point.

@nikitasadok
Copy link

nikitasadok commented Jun 19, 2021

I have had a very similar issue lately. There isn't anything special about the reproducing, there would just be a random deadlock which would prevent all the other Publish to this channel. It would lock up for at least an hour (didn't check longer time periods, but I am pretty sure that the deadlock would still be there).
However, there was still one weird problem. At about the time when the deadlock appeared, I could see the following log in RabbitMQ Container: [error] <...> closing AMQP connection <...>, so I am assuming that some connection problems could have caused this mixup in acks and lead to deadlock.
I completely agree with this pull request as I think that semantically Publish has to have another lock since it is just an increment which can be done more efficiently as well as more clearly with atomic operation.

@sljeff
Copy link
Author

sljeff commented Jul 7, 2021

@michaelklishin @streadway

@hj24
Copy link

hj24 commented Jul 12, 2021

same question here. @michaelklishin @streadway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants