Allow configuration of the MutatingWebhook failure policy #2711

sidewinder12s · 2022-07-01T17:52:35Z

Describe the bug
I ran into issues with TLS certs being regenerated due to these bugs:

Once the TLS certs changed, the MutatingWebhook for PodReadinessGate started failing and blocking the rollout of pods on services using this feature.

This was the error:

Error creating: Internal error occurred: failed calling webhook "mpod.elbv2.k8s.aws": Post "https://aws-lb-controller-webhook-service.kube-system.svc:443/mutate-v1-pod?timeout=10s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "aws-load-balancer-controller-ca")

I think this exposes an availability concern because if all the pods backing a service get rescheduled while the mutatingwebhook is broken, the service will go down. My understanding is the PodReadinessGate is a bonus feature to make rollouts more smooth in Kubernetes and I think it'd be preferable for the feature to just not work rather than block rollouts all together.

Steps to reproduce

Break TLS certs on the LB controller while using PodReadinessGates, then reschedule pods backing an LB in that namespace.

Expected outcome
I'd like to either be able to configure the webhooks failure policy or set it to fail open.

Environment

AWS Load Balancer controller version: 2.4.1
Kubernetes version: 1.21
Using EKS (yes/no), if so version? Yes, platform version 7

Additional Context:

The text was updated successfully, but these errors were encountered:

M00nF1sh · 2022-07-14T17:47:54Z

Thanks for requesting this feature.
We can add an option to specify it.

/kind good-first-issue

k8s-ci-robot · 2022-07-14T17:47:55Z

@M00nF1sh: The label(s) kind/good-first-issue cannot be applied, because the repository doesn't have them.

In response to this:

Thanks for requesting this feature.
We can add an option to specify it.

/kind good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fabianberisha · 2022-09-05T14:07:48Z

/assign

k8s-triage-robot · 2023-02-08T02:21:54Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

reixd · 2023-11-14T10:56:43Z

This will be also useful when the TargetGroupBinding admission webhook fails.
For example with this error message during helm upgrade:

Error: UPGRADE FAILED: failed to create resource: admission webhook "vtargetgroupbinding.elbv2.k8s.aws" denied the request: TargetGroup arn:aws:elasticloadbalancing:xxxxx:123456789:targetgroup/my-custom-name-of-target-group/8u738u4iojd23 is already bound to TargetGroupBinding prod/php-9fba15d9

juozasget · 2023-11-21T12:49:45Z

It looks like this configuration option would be needed in the event of an availability zone failure when running a multi-az EKS cluster.

We ran into a similar issue when simulating a network AZ outage in our environment. We were surprised to see that even when all the nodes were failed over to healthy environments no new pods could start. Investigating further we saw errors about loadbalancer controller related mutating webhook. For some reason it stops working during AZ failure.

replicaset-controller Error creating: Internal error occurred: failed calling webhook "mpod.elbv2.k8s.aws": failed to call webhook: Post "https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-v1-pod?timeout=10s": context deadline exceeded

All the pods in namespaces with the PodReadinessGates enabled were stuck and replica set controller was not able to create new pods. To work around it - we now need a human intervention and a procedure in place, where we disable the PodReadinessGates in the event of AZ failure to recover the cluster.

@M00nF1sh Maybe you could confirm if this feature will help in our scenario or we should open a new issue for this?

josh-ferrell · 2024-02-07T13:57:31Z

/assign

josh-ferrell · 2024-04-30T12:06:43Z

Closing as it appears this was addressed in #3653

josh-ferrell · 2024-04-30T12:07:27Z

/close

k8s-ci-robot · 2024-04-30T12:07:31Z

@josh-ferrell: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mikutas · 2024-10-27T00:42:51Z

/reopen

k8s-ci-robot · 2024-10-27T00:42:55Z

@mikutas: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

mikutas · 2024-10-27T00:44:18Z

This issue is about pod mutating webhook
#3653 is about service mutating webhook

M00nF1sh added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 14, 2022

M00nF1sh added the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Jul 14, 2022

k8s-ci-robot assigned fabianberisha Sep 5, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023

mikutas mentioned this issue Jun 28, 2023

feat: change pod mutating webhook to fail open #3261

Open

6 tasks

josh-ferrell mentioned this issue Feb 7, 2024

Allow webhook failure policies to be configured #3561

Closed

6 tasks

k8s-ci-robot assigned josh-ferrell Feb 7, 2024

raykrueger mentioned this issue Mar 18, 2024

failed calling webhook "mservice.elbv2.k8s.aws" awslabs/data-on-eks#458

Open

k8s-ci-robot closed this as completed Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow configuration of the MutatingWebhook failure policy #2711

Allow configuration of the MutatingWebhook failure policy #2711

sidewinder12s commented Jul 1, 2022

M00nF1sh commented Jul 14, 2022

k8s-ci-robot commented Jul 14, 2022

fabianberisha commented Sep 5, 2022

k8s-triage-robot commented Feb 8, 2023

reixd commented Nov 14, 2023

juozasget commented Nov 21, 2023

josh-ferrell commented Feb 7, 2024

josh-ferrell commented Apr 30, 2024

josh-ferrell commented Apr 30, 2024

k8s-ci-robot commented Apr 30, 2024

mikutas commented Oct 27, 2024

k8s-ci-robot commented Oct 27, 2024

mikutas commented Oct 27, 2024

Allow configuration of the MutatingWebhook failure policy #2711

Allow configuration of the MutatingWebhook failure policy #2711

Comments

sidewinder12s commented Jul 1, 2022

M00nF1sh commented Jul 14, 2022

k8s-ci-robot commented Jul 14, 2022

fabianberisha commented Sep 5, 2022

k8s-triage-robot commented Feb 8, 2023

reixd commented Nov 14, 2023

juozasget commented Nov 21, 2023

josh-ferrell commented Feb 7, 2024

josh-ferrell commented Apr 30, 2024

josh-ferrell commented Apr 30, 2024

k8s-ci-robot commented Apr 30, 2024

mikutas commented Oct 27, 2024

k8s-ci-robot commented Oct 27, 2024

mikutas commented Oct 27, 2024