Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configuration of the MutatingWebhook failure policy #2711

Closed
sidewinder12s opened this issue Jul 1, 2022 · 13 comments · May be fixed by #3261
Closed

Allow configuration of the MutatingWebhook failure policy #2711

sidewinder12s opened this issue Jul 1, 2022 · 13 comments · May be fixed by #3261
Assignees
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@sidewinder12s
Copy link

Describe the bug
I ran into issues with TLS certs being regenerated due to these bugs:

#2312
#2264

Once the TLS certs changed, the MutatingWebhook for PodReadinessGate started failing and blocking the rollout of pods on services using this feature.

This was the error:

Error creating: Internal error occurred: failed calling webhook "mpod.elbv2.k8s.aws": Post "https://aws-lb-controller-webhook-service.kube-system.svc:443/mutate-v1-pod?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")

I think this exposes an availability concern because if all the pods backing a service get rescheduled while the mutatingwebhook is broken, the service will go down. My understanding is the PodReadinessGate is a bonus feature to make rollouts more smooth in Kubernetes and I think it'd be preferable for the feature to just not work rather than block rollouts all together.

Steps to reproduce

Break TLS certs on the LB controller while using PodReadinessGates, then reschedule pods backing an LB in that namespace.

Expected outcome
I'd like to either be able to configure the webhooks failure policy or set it to fail open.

Environment

  • AWS Load Balancer controller version: 2.4.1
  • Kubernetes version: 1.21
  • Using EKS (yes/no), if so version? Yes, platform version 7

Additional Context:

@M00nF1sh M00nF1sh added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 14, 2022
@M00nF1sh
Copy link
Collaborator

Thanks for requesting this feature.
We can add an option to specify it.

/kind good-first-issue

@k8s-ci-robot
Copy link
Contributor

@M00nF1sh: The label(s) kind/good-first-issue cannot be applied, because the repository doesn't have them.

In response to this:

Thanks for requesting this feature.
We can add an option to specify it.

/kind good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@M00nF1sh M00nF1sh added the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Jul 14, 2022
@fabianberisha
Copy link

/assign

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023
@reixd
Copy link

reixd commented Nov 14, 2023

This will be also useful when the TargetGroupBinding admission webhook fails.
For example with this error message during helm upgrade:

Error: UPGRADE FAILED: failed to create resource: admission webhook "vtargetgroupbinding.elbv2.k8s.aws" denied the request: TargetGroup arn:aws:elasticloadbalancing:xxxxx:123456789:targetgroup/my-custom-name-of-target-group/8u738u4iojd23 is already bound to TargetGroupBinding prod/php-9fba15d9

@juozasget
Copy link

It looks like this configuration option would be needed in the event of an availability zone failure when running a multi-az EKS cluster.

We ran into a similar issue when simulating a network AZ outage in our environment. We were surprised to see that even when all the nodes were failed over to healthy environments no new pods could start. Investigating further we saw errors about loadbalancer controller related mutating webhook. For some reason it stops working during AZ failure.

replicaset-controller Error creating: Internal error occurred: failed calling webhook "mpod.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-pod?timeout=10s": context deadline exceeded

All the pods in namespaces with the PodReadinessGates enabled were stuck and replica set controller was not able to create new pods. To work around it - we now need a human intervention and a procedure in place, where we disable the PodReadinessGates in the event of AZ failure to recover the cluster.

@M00nF1sh Maybe you could confirm if this feature will help in our scenario or we should open a new issue for this?

@josh-ferrell
Copy link
Contributor

/assign

@josh-ferrell
Copy link
Contributor

Closing as it appears this was addressed in #3653

@josh-ferrell
Copy link
Contributor

/close

@k8s-ci-robot
Copy link
Contributor

@josh-ferrell: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mikutas
Copy link
Contributor

mikutas commented Oct 27, 2024

/reopen

@k8s-ci-robot
Copy link
Contributor

@mikutas: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mikutas
Copy link
Contributor

mikutas commented Oct 27, 2024

This issue is about pod mutating webhook
#3653 is about service mutating webhook

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants