Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing test(s): TestAccKMSAutokeyConfig_kmsAutokeyConfigAllExample #18935

Open
SarahFrench opened this issue Jul 31, 2024 · 11 comments
Open

Failing test(s): TestAccKMSAutokeyConfig_kmsAutokeyConfigAllExample #18935

SarahFrench opened this issue Jul 31, 2024 · 11 comments

Comments

@SarahFrench
Copy link
Member

SarahFrench commented Jul 31, 2024

Impacted tests

  • TestAccKMSAutokeyConfig_kmsAutokeyConfigAllExample

Affected Resource(s)

  • google_kms_autokey_config

Failure rates

  • 60% since 2024-05-18

Message(s)

------- Stdout: -------
=== RUN   TestAccKMSAutokeyConfig_kmsAutokeyConfigAllExample
=== PAUSE TestAccKMSAutokeyConfig_kmsAutokeyConfigAllExample
=== CONT  TestAccKMSAutokeyConfig_kmsAutokeyConfigAllExample
    vcr_utils.go:152: Step 1/2 error: After applying this test step, the plan was not empty.
        stdout:
        Terraform used the selected providers to generate the following execution
        plan. Resource actions are indicated with the following symbols:
          ~ update in-place
        Terraform will perform the following actions:
          # google_kms_autokey_config.example-autokeyconfig will be updated in-place
          ~ resource "google_kms_autokey_config" "example-autokeyconfig" {
                id          = "folders/81326716737/autokeyConfig"
              + key_project = "projects/tf-test-key-projwccymsfw2d"
                # (1 unchanged attribute hidden)
            }
        Plan: 0 to add, 1 to change, 0 to destroy.
--- FAIL: TestAccKMSAutokeyConfig_kmsAutokeyConfigAllExample (140.99s)
FAIL

Nightly build test history

b/357622504

@github-actions github-actions bot added forward/review In review; remove label to forward service/cloudkms labels Jul 31, 2024
@melinath
Copy link
Collaborator

melinath commented Aug 5, 2024

Note from triage: This looks like a permadiff: https://googlecloudplatform.github.io/magic-modules/develop/permadiff/

Weird that it only happens sometimes though.

@melinath melinath added size/xs and removed forward/review In review; remove label to forward labels Aug 5, 2024
@melinath melinath added this to the Goals milestone Aug 5, 2024
@tdbhacks
Copy link

tdbhacks commented Aug 8, 2024

The fact that this isn't failing all the time is weird, I wonder if it's just because I've seen the AutokeyConfig take a while to be updated. There is a short, dependency-caused propagation delay for AutokeyConfigs, and I've personally seen failures in the past where setting the AutokeyConfig did not complete fast enough for the KeyHandle creation request to succeed right after. Re-applying worked fine because at that point the config had been set.

IIUC, this test is terraform apply-ing, and then trying to re-apply right away. I wonder if that re-apply happens quickly enough to run into this propagation delay issue. Not sure if the server-side inconsistency would manifest as a diff though, but this should be easy enough to test by adding a short wait after the AutokeyConfig block?

@melinath
Copy link
Collaborator

melinath commented Aug 9, 2024

that seems plausible! a time_sleep resource would let one insert a wait

@tdbhacks
Copy link

tdbhacks commented Aug 9, 2024

that seems plausible! a time_sleep resource would let one insert a wait

Indeed, I've added it to the relevant test as part of GoogleCloudPlatform/magic-modules#11413. Fingers crossed!

@tdbhacks
Copy link

@melinath @SarahFrench is the test reliably passing now that the change has been merged? (also curious if I have a way to check myself?)

@SarahFrench
Copy link
Member Author

image

The test is still failing for the same reason I'm afraid. This screenshot shows a chart where each bar is a day the test ran, red means test failure, and the height is the duration of the test. The last 6 bars show the new sleep's effect on test duration.

I'm not sure on how Googlers access nightly test data, so I defer to @melinath

@melinath
Copy link
Collaborator

I'll reach out internally.

@tdbhacks
Copy link

Hmm, weird, I'm pretty confused then.. the google_kms_key_handle basic example test hasn't been failing in a similar way, right? We have a similar wait in there after the AutokeyConfig resource (with an even shorter time of 10s): https://github.com/GoogleCloudPlatform/magic-modules/blob/82cf1c53c32b0a5c7b5d29d4742a8b7bbb471eb0/mmv1/templates/terraform/examples/kms_key_handle_basic.tf.erb#L83

Are there any differences caused by the fact that the AutokeyConfig resource is the last one being created in this test, and that the time_sleep is at the very end of the test? It's definitely not being ignored, because of the test duration increase highlighted above. But if it was a true permadiff, it would show up every time, no?

@melinath
Copy link
Collaborator

It is currently showing up as a permadiff every time in nightly tests since 8/4/2024. Previously it was failing much less frequently. This kind of pattern often indicates a feature that is rolling out to an increasing percentage of servers over time before finally going to 100%. Maybe something went live around 8/4/24 that could be related?

The basic example is also failing at 100% (though with a different error message) and was previously flakey with this same permadiff. That failure is tracked in #18934.

@tdbhacks
Copy link

tdbhacks commented Sep 12, 2024

@melinath thanks for letting me know about that issue, I believe it got misrouted internally so I wasn't aware of it! I've opened a draft PR to start addressing the folder name / making resources sweepable: GoogleCloudPlatform/magic-modules#11693

I'd like to use that PR to also fix the permadiff about AutokeyConfig, but I haven't been able to figure out a root cause for it yet :( I'm not aware of anything rolling out and going live around that time, but I've just asked the team.

@romanini-ciandt
Copy link

I couldn't figure it out the root cause of it, but I made some tests and I strongly believe that this is being caused by an infrastructure issue. Feels like the CI is hitting a different version of the KMS API (maybe an alpha non released version) instead of v1... I'm not sure, it's just a guess and I can't confirm by myself.

Here are the tests I made to help on this investigation:

  1. First, I tried to run a dozen times TestAccKMSAutokeyConfig_kmsAutokeyConfigAllExample in my local, and it passed every time;
  2. Then, in order to try to replicate the Autokey config resource permadiff error locally, I:
    1. Applied manually the terraform code of the test, just removing the time_sleep resource (to increase the change of any eventual propagation issue);
    2. Made a looping shell script to destroy the autokey config resource (using terraform destroy + a curl request hitting the updateAutokeyConfig endpoint), auto apply and auto re-apply the autokey config resource, storing logs for all re-apply iterations;
    3. More than 1.1K re-applies were run and not a single diff was logged;

Some other interesting points:

With TG_LOG=DEBUG env var set in my local, I could notice that:
When we apply an autokey config tf resource with no key_project set, this is the behavior:
image

  • An empty dict {} on request body, and the response for both PATCH (autokey config doesn't have a POST) and GET with state "UNINITIALIZED"

When we have the key_project attribute on autokey config tf resource, this is the behavior:
image

So with these APIs requests and responses, with all the fields being returned, I don't see how TF could be reporting a permadiff (and I could evidence that with the 1.1K requests test - check re-apply logs here. That's why I suspect that the API that CI is hitting might be having a different behavior.

Maybe a good approach to continue this investigation is: someone with proper access into CI infrastructure set TG_LOG env var there and check which is the behavior of the KMS API for autokey config requests. Probably will be different from the one posted in this comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants