-
Notifications
You must be signed in to change notification settings - Fork 339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix] retry producer creation upon error after succssful topic lookup #1139
base: master
Are you sure you want to change the base?
Conversation
Great work @zzzming! I'll review again after you reply to the question. |
pulsar/producer_partition.go
Outdated
} | ||
p.log.WithError(err).Error("Failed to create producer at newPartitionProducer") | ||
errMsg := err.Error() | ||
if strings.Contains(errMsg, errTopicNotFount) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if strings.Contains(errMsg, errTopicNotFount) { | |
if errors.Is(err, ErrTopicNotfound) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebase with the latest and fixed the error evaluation per your review comment
pulsar/producer_partition.go
Outdated
break | ||
} | ||
|
||
if strings.Contains(errMsg, "TopicTerminatedError") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if strings.Contains(errMsg, "TopicTerminatedError") { | |
if errors.Is(err, ErrTopicTerminated) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, could you rebase this PR? #1143 exports some error var, so you need to update your PR.
c676c7b
to
a84c97d
Compare
@nodece I fixed based on your review comments. CI does not seem to run. Does it require any approval to run CI? |
Ci triggered |
Ping @zzzming |
Fixes #1138
Motivation
In the newPartitionProducer() function, there should be a retry of grabCnx(). It will be similar to the reconnectToBroker's grabCnx() retry logic.
Java producer has this retry logic.
At the producer creation call, after a successful topic lookup at grabCnx() in producer_partition.go, if there is a network issue before the COMMAND to create producer sent, the grabCnx() will exit without retry.
The same connectoToBroker retry logic is observed in this implementation.
We had frequent failures upon the initial producer creation under unstable network conditions .
It's tricky to reproduce. But we observe the problem more frequently on Azure pod's initialization stage. After implementing the grabCnx() retry in the newPartitionProducer(), the problem has gone away. The error often shows a connection closed (EOF) by the other side. But it's not by the broker (or Pulsar) based on the logs on the Pulsar side. It can be network issues in between the producer pod and the Pulsar cluster. That's why a grabCnx() retry is much needed.
System configuration
Pulsar version: 2.10
Modifications
In the newPartitionProducer() function, adding a retry of grabCnx() with the same retry logic specified in reconnectToBroker's grabCnx() retry logic.
Verifying this change
This change is already covered by existing tests, such as (please describe tests).
Does this pull request potentially affect one of the following parts:
If
yes
was chosen, please highlight the changesDocumentation