Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle "release not found" errors thrown on uninstall / equivalent to helm uninstall --ignore-not-found #1488

Closed
sbocinec opened this issue Sep 17, 2024 · 4 comments · Fixed by #1487
Assignees

Comments

@sbocinec
Copy link
Contributor

sbocinec commented Sep 17, 2024

Description

We are frequently hitting an issue with helm TF provider failing to destroy a helm_release resource when a K8s cluster (specifically GKE 1.28) throws an Internal Server Error and terraform destroy than fails with:

Error: uninstallation completed with 1 error(s): an error on the server (\\\"Internal Server Error: \\\\\\\"/api/v1/namespaces/ingress-nginx/services/ingress-nginx-controller-admission\\\\\\\": the server is currently unable to handle the request\\\") has prevented the request from succeeding (get services ingress-nginx-controller-admission)\\n\\n\"

Retrying the destroy operation then fails with release not found uninstallation error, e.g.:

Error: uninstall: Release not loaded: ingress-nginx: release: not found"

as the helm_release resource remains in the TF state, but its resources are already deleted from the K8s cluster.

In our case, we are destroying whole cluster so we want to ignore such errors. Helm CLI already provides a solution for this specific situation: helm uninstall --ignore-not-found:

$ helm uninstall --help |grep ignore-not-found
      --ignore-not-found     Treat "release not found" as a successful uninstall

Potential Terraform Configuration

#1487

Add ignore_not_found boolean attribute (defaulting to false) to the helm_release resource, enabling to ignore the error on uninstall.

resource "helm_release" "ingress_nginx" {
  name              = local.ingress_nginx
  chart             = local.ingress_nginx
  repository        = "https://kubernetes.github.io/ingress-nginx"
  version           = var.ingress_nginx_chart_version
  namespace         = local.ingress_nginx_namespace
  atomic            = true
  wait              = true
  wait_for_jobs     = true
  skip_crds         = true
  lint              = true
  cleanup_on_fail   = true
  dependency_update = false
  create_namespace  = false
  ignore_not_found  = true
}

References

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@sbocinec sbocinec changed the title Handle "release not found" errors thrown on uninstall / equivalent to helm uninistall --ignore-not-found Handle "release not found" errors thrown on uninstall / equivalent to helm uninstall --ignore-not-found Sep 17, 2024
@appilon
Copy link
Contributor

appilon commented Sep 18, 2024

Thank you @sbocinec for the issue and PR. Normally we treat 404 errors in the DELETE methods as a "success" so Terraform removes the resource from state. So no need to use the --ignore-not-found flag we should have been doing this by default. Prefer the PR to address this explicitly check 404/not found and swallow the error (meaning success). It would look something like this

(see updated code)

If you'd like to update #1487 to do that instead? Otherwise I can open a PR to fix it.

UPDATE:
Looking at resourceReleaseDelete right now I think you can just add the string Contains logic to the error check, such that if its NOT a "not found" error, then error out, otherwise fallthru to the happy path

if err != nil && !strings.Contains(err.Error(), "not found") {
  return diag.FromErr(err)
}

Bonus points if you use !errors.Is(err, driver.ErrReleaseNotFound)

@sbocinec
Copy link
Contributor Author

sbocinec commented Sep 19, 2024

Thank you @appilon for your reply.

I'm not sure about your proposal. It makes the code dependent on the exact error message content.

Checking the upstream code how the ignoreNotFound behaves I see, it actually handles any error returned by the uninstall action:

	if err != nil {
		if u.IgnoreNotFound {
			return nil, nil
		}
		return nil, errors.Wrapf(err, "uninstall: Release not loaded: %s", name)
	}

and returns Release not loaded actually, not a *not found* error.

I think it's much less error prone to go use the upstream provided option than to assume any error content.

Ofc, I can modify the PR if you have a strong opinion about.

EDIT:
To add more context, we are destroying K8s workloads in an automated way on scale and our primary motivation is to have an option to configure helm provider to ignore any uninstallation error on a TF destroy.

@appilon
Copy link
Contributor

appilon commented Sep 19, 2024

I'm finding the upstream code is a bit of a misnomer

if err != nil {
		if u.IgnoreNotFound {

The error isn't checked I'm assuming it could be any number of errors, IMO this behavior should be described as --ignore-all-errors not just IgnoreNotFound 🤷‍♂️

Although it might feel like a pain, I think its better if the provider explicitly checks the errors and makes the decision on if it thinks "the release no longer exists". Terraform ultimately wants you to be confident there are no dangling resources. Provider code is often quite messy as a result. Sadly this often means check for a few known substrings that semantically mean the same thing, and probably down the road some third or fourth error that semantically means the same thing will have to be added. Provider codebases are often just a bunch of "heuristics" stapled together.

@sbocinec
Copy link
Contributor Author

sbocinec commented Sep 20, 2024

I'm finding the upstream code is a bit of a misnomer
The error isn't checked I'm assuming it could be any number of errors, IMO this behavior should be described as --ignore-all-errors not just IgnoreNotFound 🤷‍♂️

Completely agree, checking the code yesterday I thought the same.

Anyway, I have updated the PR following your suggestion. I have also added an acceptance test, but I feel it's false positive, as playing with the terraform-plugin-testing framework I was never able to make the test to fail on a destroy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants