Upgrade/Rollout System of Clusters #204

DinaBelova · 2024-08-18T23:20:39Z

Goals

Cluster management and operations is not only creation and deletion of clusters but also the upgrade of the clusters.
Upgrades can be a major stress factor for any platform engineering team and therefore we should try to make them as easy as possible and automated but with as much insights for the team that does the upgrades.
While fully automated upgrades are on the lowest level of interaction and seem to be the easiest, they do not fit into the operational procedures of enterprise customers which want to trigger upgrades of production clusters in a controlled way

Major deliverables

Who it benefits

Customer Business: Plane-ability and controlled cluster upgrades that fit the need of enterprise k8s cluster management
Platform: Stress free upgrades without a massive amount of work to upgrade
Mirantis: Great customer experience and happy customers

Acceptance criteria

Upgrading of a cluster involves 3 steps:
- Upgrade the Helm Chart with the changes and push the changes into an OCI registry with a new version of the Helm Chart
- Create a new Template Object with a new name that references the pushed vesrsion of the helm chart
- Upgrade/migrate the Deployment object to point to the new Template Name which then actually triggers the upgrade of the cluster
The Deployment Object shows similar status information as CAPI itself provides
- Expectation is to have three statuses: Upgrade in Progress, Upgrade successful, Upgrade failed
Failed Upgrades are clearly marked in the Deployment Object
Changes of template variables and template name of the Deployment object are treated the same way, as they could trigger any cluster changes (like a change of the instance type in AWS needs to replace all k8s nodes, the same as a template name upgrade which upgrades the k0s version)

Assumptions

Telemetry & Success Criteria

Each Upgrade triggers a Telemetry Event with the following Infos after the upgrade is completed:
- cluster_id
- target_infrastructure
- New template name

Out of scope

The actual upgrade of the cluster is handled by CAPI and we should not write any code in HMC repo which upgrades the clusters. HMC code should only be in an observabillity mode of the actual upgrade and provide as much information as needed into the Deployment Object from CAPI. If there are any bugs we find that prevent upgrades they should be fixed in CAPI or the affected CAPI providers.
CAPI is sometimes a bit finicky on which objects can be upgraded in place and which of them need to be rolling changed (new ones added and then old one removed). In this epic we don't want to worry about this yet and assume that the templates itself don't modify inplace parts of CAPI objects which actually can't be modified inplace.
Multi Cluster Upgrades will be implemented later
Auto Cluster Upgrade will be implemented later
Upgrading of Mirantis templates and mgmt control plane itself is not part of this epic

related issues:

DinaBelova added the epic Large body of work, can be broken down into individual issues label Aug 18, 2024

DinaBelova added this to Project 2A Aug 18, 2024

DinaBelova moved this to Todo in Project 2A Aug 19, 2024

DinaBelova moved this from Todo to In Progress in Project 2A Sep 4, 2024

Provide feedback