Support automatic apply on merge queue #227

fpacifici · 2024-06-24T23:00:33Z

Today TACOs require changes to be applied manually.
This was a deliberate choice as silently apply potentially large changes upon merge was considered too scary (@bukzor / @ellisonmarks , this is my understanding of the rationale, please correct me if I am wrong).

While this approach may be desirable for some large changes, it is counterproductive in several scenario like the Kafka Control Plane where the terraform changes are automatically generated and forgetting to apply has worse consequences.
Examples:

the system will be used from people across different teams who are used to changes being applied after merging the PR and are not used to deal with the drift problem.

We should enable merge queues and support automatic apply of TACOs changes while allowing for manually apply of critical changes.

Ideas on how to select slices or changes that would be applied automatically:

the apply label should still work to enable the PR based workflow as per TACOs requirements. In this case the operation on the merge queue will be a noop.
Default auto-apply. By default a tacos plan is applied on the merge queue.
Label to flag a PR for manual apply only. In this scenario we expect people to apply manually the PR will not be applied on the merge queue, for the PR to merge the change must have been applied manually. If not the queue action fails. (I am not even sure we would want this feature).

mwarkentin · 2024-06-27T01:39:47Z

Am -1 to the idea of requiring a PR label to opt out of auto apply from master, too easy to forget to apply, or there is a case where something changed after the plan was posted on the PR but before merge.

Terraform will happily destroy things that could have severe consequences for our production infrastructure.

Eg. Someone sees a disk space ticket and scales up the snuba errors cluster. I merge my PR adding a COGS label to the errors cluster (should be safe to auto apply right)? Terraform deletes all 30 data disks in about 15s because the only way to shrink a GCP disk is to destroy it and then recreate it.

Similar story for adding a label to a boot disk, the GCP provider has currently decided that requires recreating the disk to update.

If possible I would recommend that TACOS should support different flows depending on the slice being applied, then Kafka control plane, Datadog monitors, ops workflows, root account workflows, etc can all be handled in the way that fits them best.

Or switch to spacelift which supports this kind of configuration out of the box.

keeakita self-assigned this Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support automatic apply on merge queue #227

Support automatic apply on merge queue #227

fpacifici commented Jun 24, 2024

mwarkentin commented Jun 27, 2024

Support automatic apply on merge queue #227

Support automatic apply on merge queue #227

Comments

fpacifici commented Jun 24, 2024

mwarkentin commented Jun 27, 2024