-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
don't block TGB reconciliation loop on failed SG ingress reconciliation #3296
don't block TGB reconciliation loop on failed SG ingress reconciliation #3296
Conversation
the controller performs an SG reconciliation step for all (cluster-wide) SGs gathered from all TGBs during the TGB reconciliation loop, before TG endpoints are reconciled: https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/d177c898ddd86071eecc2fd918d72ebfb0af7892/pkg/targetgroupbinding/resource_manager.go#L139-L141 the way the code is currently written, this means that any failure during SG reconciliation blocks reconciliation of all targets across the whole cluster. such a failure can be caused by something as innocuous as a SG being deleted before the associated TGB is deleted, or a SG being entered on a TGB erroneously. this can easily lead to severe outages if not remediated quickly. this commit changes the method `reconcileWithIngressPermissionsPerSG` to not exit on a single failed SG ingress reconciliation - instead it will log the offending error and continue through the loop.
Welcome @michaelsaah! |
Hi @michaelsaah. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cannot support what-me-worry ignoring of errors. At the very least, the error needs to be surfaced as a Kubernetes event and the handler needs to mark the work item for retry.
thanks @johngmyers. surfacing the error via an Event sounds fine to me. I'm not sure what you mean by "mark the work item for retry." can you point me towards any examples in the code where this is performed? |
@johngmyers @M00nF1sh just bumping this, it's an ongoing operational issue for us (Twilio) on EKS. any help appreciated. |
just another bump, any eyes appreciated |
Thanks for this contribution. I agree with @johngmyers that the error cannot be ignored without retry. Since such SG modification errors can be transient errors such as EC2 api throttle and shall be retried to avoid issues in security group settings. How about below change:
|
b817502
to
d6e8667
Compare
thanks for the review @M00nF1sh, addressed your feedback. have a good weekend! |
just bumping this for a re-review, thanks |
morning, just a monday review bump, thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor style nit
/lgtm
/ok-to-test |
Co-authored-by: John Gardiner Myers <[email protected]>
thanks for the re-review @johngmyers, committed your style suggestion |
/lgtm |
looks like that test might be flaky: #3289 (comment) |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late review. This change looks good to me now, thanks a lot for your contribution :D
/lgtm
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johngmyers, M00nF1sh, michaelsaah The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## main #3296 +/- ##
==========================================
+ Coverage 54.70% 55.56% +0.85%
==========================================
Files 148 149 +1
Lines 8590 8828 +238
==========================================
+ Hits 4699 4905 +206
- Misses 3559 3586 +27
- Partials 332 337 +5
☔ View full report in Codecov by Sentry. |
/retest |
1 similar comment
/retest |
Issue
#3285
Description
the controller performs an SG reconciliation step for all (cluster-wide) SGs gathered from all TGBs during the TGB reconciliation loop, before TG endpoints are reconciled:
aws-load-balancer-controller/pkg/targetgroupbinding/resource_manager.go
Lines 139 to 141 in d177c89
the way the code is currently written, this means that any failure during SG reconciliation blocks reconciliation of all targets across the whole cluster. such a failure can be caused by something as innocuous as a SG being deleted before the associated TGB is deleted, or a SG being entered on a TGB erroneously. this can easily lead to severe outages if not remediated quickly.
this commit changes the method
reconcileWithIngressPermissionsPerSG
to not exit on a single failed SG ingress reconciliation - instead it will log the offending error and continue through the loop.I tested this by setting up two TGBs, modifying one to have a non-existent ingress SG, deleting a target pod, then ensuring that the new target is reconciled. the new error message was also logged:
Checklist
README.md
, or thedocs
directory)BONUS POINTS checklist: complete for good vibes and maybe prizes?! 🤯