Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPEDGE-1191: feat: initial arbiter cluster enhancement #1674

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

eggfoobar
Copy link
Contributor

No description provided.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 4, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 4, 2024

@eggfoobar: This pull request references OCPEDGE-1191 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.18.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 4, 2024
Copy link
Contributor

openshift-ci bot commented Sep 4, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
Copy link
Contributor

@brandisher brandisher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions from my side:

  1. What is the concrete list of control plane components that will run on the arbiter node?
  2. What non-control plane or control-plane supporting components need to exist on the arbiter node? (e.g. cluster observability)
  3. How do we let cluster admins (or other RH operators) opt-in to adding components to the arbiter? I'm thinking in cases where a customer has a monitoring workload that they need to run on the control plane, or we have operators with agents that must run on the control plane like ACM or GitOps.
  4. Are there any considerations needed for components that utilize the infrastructureTopology field to determine their replicas? (I believe this applies to observability components like prometheus et al)

enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
enhancements/arbiter-clusters.md Show resolved Hide resolved
enhancements/arbiter-clusters.md Outdated Show resolved Hide resolved
updated CEO changes to include idea for wrapping informer and lister to minimize CEO code changes
updated MCO changes to include bootstrap files for arbiter node
added installer changes needed to include new machine types and install config changes

Signed-off-by: ehila <[email protected]>
@eggfoobar
Copy link
Contributor Author

This is in good enough shape to move out of draft, I think we captured most of the concerns here.

@eggfoobar eggfoobar marked this pull request as ready for review October 16, 2024 06:20
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 16, 2024
added appropriate reviewers for the component changes
added more detail around the specific component changes

Signed-off-by: ehila <[email protected]>
@rphillips
Copy link
Contributor

rphillips commented Oct 18, 2024

Was Chad's comment addressed? What is running on the arbiter node?

Something has to be compromised on the arbiter - cpu, disk, memory? What is compromised on the arbiter? Nodes have classically had performance and runtime issues with slow disks or not enough cpu or memory. Is there a definition of what specs the arbiter has?

F1 F2 A1 (full-node-1 full-node-2 arbiter-node-1) ...

Additionally, based on the Arbiter's specs, if F1 F2 and A1 are nodes, and F2 goes down for a reboot, then F1 will likely be taking the brunt of the API+DB load of the cluster. Good loadbalancers route traffic to fast responding API backends. Usually when a highly available node goes down, there is enough capacity of the other nodes to take that load. If one node is under specced (A1) and F1 has to take the load, then it stands to reason that the specs for F1 need to be higher.

I'm a bit confused on how we would create a supported cluster in this situation. etcd needs fast disks, so we can't compromise on disk speed on A1. If load shifts from F2 to F1, then F1 should have the capacity of F1 + F2.

Optimizing and trimming down OpenShift components might be doable. Instead of creating an arbiter, we still require 3 nodes but run less things?

@DanielFroehlich
Copy link

Right, the arbiter is like a 3 node, but only etcd components and a bit of other absolutely essential stuff (e.g. networking components) run on the arbiter. No components like the API server, Console, Monitoring etc.. Exact details need to be defined tho.

F1 and F2 currently - per OCP docs, have a min sys req of 4C/16G. Thats just for the control plane. The requirements of the workload need to be added and are not considered here.
For A1, we currently discuss 2C/8G as min sys req, but that is to be decided and tested yet.

Etcd and control-plane can work with effective end-to-end latency (which includes disk) of 500ms using the slow profile. So no concerns here - see [here (https://docs.google.com/presentation/d/1eMz3ON5y1Rk2QI2zH-QUGh5oKoMyOHZrCSSZ8aHQQPQ/edit#slide=id.g26e42bb9faa_0_251). Yes, a slow arbiter node will slow down overall API performance.

During an outage of F2, not all load of F2 shifts to F1, actually next to none. Take the console as an example. It is replica 2 with a required podAntiAffinity on the hostname. Meaning: the pod running on F2 will not start on F1.
Scaled control plane components are scaled mostly for availability reasons, not scalability.

@eggfoobar and @jerpeter1 keep me honest here please.

@rphillips
Copy link
Contributor

rphillips commented Oct 18, 2024

@DanielFroehlich All Kube API load from F2 will transition to F1, since F1 will be the only machine with a kube-apiserver running.

How many workers will we support? Kubelet's and their API calls+watches are usually load balanced between 2-3 masters. This prevents thundering herd issues on the API servers. With only 1 API server, potentially, that will limit the number of workers a customer can add to the cluster.

| [Installer](#installer-changes) | Update install config API to Support arbiter machine types |
| [MCO](#mco-changes) | Update validation hook to support arbiter role and add bootstrapping configurations needed for arbiter machines |
| [Kubernetes](#kubernetes-change) | Update allowed `well_known_openshift_labels.go` to include new node role as an upstream carry |
| [ETCD Operator](#etcd-operator-change) | Update operator to deploy operands on both `master` and `arbiter` node roles |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the arbiter must not be the leader, then that's an additional feature that needs to be listed and designed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some discussion on what we want to happen here. We're okay with the Arbiter node etcd member becoming leader. We're under the impression that etcd will be as fast as it's slowest member, then this might be fine for now. I'll document this as an open question, and plan to revisit this as it's own potential EP, wdyt?

@tjungblu Any concerns with this assumption on my part?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no functionality that can prevent it becoming a leader. I'm mostly concerned about the performance impact reported here: etcd-io/etcd#14501

There are some upstream k8s PRs in the works that enable grpc "circuit breakers", so this will get resolved at some point in the future

compact cluster, do we want to support changing ControlPlaneTopology field
after the fact?

## Test Plan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OLM+whoever to provide a way to make the arbiter configuration easy to test.

cc @joelanford

allow all taints can still run on the node, thus making it easier for smaller
arbiter nodes to be deployed with out higher memory and cpu requirements.

Components that we are proposing to change:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OLM to do some work fanning out for an infrastructure annotation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your thinking here to add support to OLM to filter components based off of topology type?


Customers at the edge are requiring a more economical solution for HA
deployments at the edge. They can support running 2 node clusters for redundancy
but would like the option to deploy a lower cost node as an arbiter to supply
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be worth considering having an arm64 node for the arbiter to lower costs and deploy the cluster with the multi-arch payload.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is a great idea, but unfortunately currently not supported by multi-arch team. its either all 3 nodes x86 or arm64, no mixture allowed. We could look into supporting this when there is customer demand for it. So I'll keep the idea in the backlog! But for now, it is out of scope.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the arbiter node considered strictly control plane?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI we know the specific hardware we are targeting here (it has the low-powered node built-in).

enhancements/arbiter-clusters.md Show resolved Hide resolved

We need an authoritative flag to make it easy for components and tools to
identify if they are running against an `arbiter` backed HA cluster. We are
proposing adding a new `ControlPlaneTopology` and `InfrastructureTopology` key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably there is no new InfrastructureTopology, because the arbiter node does not act as a worker.

Copy link
Contributor Author

@eggfoobar eggfoobar Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, I wasn't sure if there would be a future need for distinction if users were to add workers to these clusters. Not something that's supported here, given the desire is low cost deployment. However, would someone adding workers change your opinion on needing to supply a this flag? I don't know the full breath of Infra services where they would find this authoritative flag useful.

4. The user then enters the machine information for
`installConfig.controlPlane.platform.baremetal` and
`installConfig.arbiterNode.platform.baremetal`
5. The installer creates a new `arbiter` MachineSet with a replica of `1` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the arbiter node will be created like a worker, i.e. provisioned by a MachineSet in the cluster after the control plane comes up? (As opposed to like the control plane nodes, which are created directly by the bootstrap and do not have a MachineSet.)
We may have more work to do around labelling BaremetalHosts and filtering by label in the MachineSet so that the correct BaremetalHost is selected as the arbiter node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nope, this is a mistake in my original POC wording, thanks for pointing that out. The arbiter node follows the same install path as a master, so we create 2 master machines and 1 arbiter machine. They come up and work normally, except for one of the masters being only an arbiter.

Regarding BaremetalHost, I have not POCed that flow just yet, but I was also thinking labeling a host explicitly would be the path there. I'll flush out that flow a bit clearer here.

enhancements/arbiter-clusters.md Show resolved Hide resolved
name: worker
platform: {}
replicas: 0
arbiterNode:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simpler alternative - that would support baremetal clusters only - is to assume that the arbiter is part of the control plane, and identify its existence only by applying the arbiter role to one of the hosts in platform.baremetal.hosts.
This would result in the installer (rather than a MachineSet) creating the arbiter Node with a taint and a separate pointer ignition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I was leaning that way, but we also did have some testing and cloud deployment desires, that I switched the flow to follow the typical install flow for master, so now it's a full control plane machine at install. I'll update that step by step to correctly reflect that, my mistake on the MachineSet wording.


Before going GA we need to have a proper idea of the role that the
[control-plane machine-set
operator](https://github.com/openshift/cluster-control-plane-machine-set-operator/tree/main/docs/user#overview)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use ControlPlaneMachineSetOperator on platform:baremetal, so you only have to deal with this for clouds.

installs in the installer. However, the cloud install is primarily for testing,
this might mean that we simplify the installer changes if we are the only ones
using cloud installs, since we can simply alter the manifests in the pipeline
with out needing to change the installer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would seriously question whether the work saved in manually setting up tests for etcd justifies the extra work in supporting this on cloud platforms.
Not only are cloud platforms an extra thing to support, but installing the arbiter node via a MachineSet in a platform-independent way complicates the baremetal implementation, and the Assisted/ABI implementations as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the concern, I'll try to capture this better in the documentation, but there is value for cloud deployment, especially as a TechPreview iteration of this feature. Currently there are some unknowns around where users are with the hardware they will be using, so having a cloud environment would be easier for them to deploy and spin the tires on it as well as make testing and validation a bit easier for us.

I went ahead and updated the doc to reflect the fact that the arbiter node is being installed in the same way as the master nodes, would that help smooth out your concerns?


Customers at the edge are requiring a more economical solution for HA
deployments at the edge. They can support running 2 node clusters for redundancy
but would like the option to deploy a lower cost node as an arbiter to supply
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI we know the specific hardware we are targeting here (it has the low-powered node built-in).

added more components that need to be changed
updated wording to correctly reflect whats being changed during installer file generations
added clearity around the use and purpose of the currently identified components
added more test cases that should be covered

Signed-off-by: ehila <[email protected]>
@jerpeter1 jerpeter1 self-assigned this Oct 29, 2024
Copy link
Contributor

openshift-ci bot commented Nov 1, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jerpeter1. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Nov 1, 2024

@eggfoobar: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@patrickdillon patrickdillon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main questions from an installer perspective:

  • Are we supporting installer-provisioned infrastructure, as opposed to an assisted-installer upi-style provisioning
  • Are we supporting cloud platforms--or baremetal only?

The enhancement answers both of these questions "yes," but in the recent meeting it seems like they may still be open questions. I have left comments to expand on the Installer scope if we are supporting IPI & cloud platforms.

It seems like the assisted installer is a primary use case for this feature, but the workflow description describes an IPI-based workflow using create cluster. If assisted installer is a primary motivation it would be good to describe that use case.

Finally it would be good to loop in @JoelSpeed for input on the control plane machineset operator.

| Component | Change |
| ------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- |
| [Infrastructure API](#infrastructure-api) | Add `HighlyAvailableArbiter` as a new value for `ControlPlaneTopology` and `InfrastructureTopology` |
| [Installer](#installer-changes) | Update install config API to Support arbiter machine types |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The installer will need additional changes to:

  • provision different types of control plane machines (if installer-provisioned infrastructure is supported)
  • render machine & control plane machineset operator manifests differently (if cpmso is supported)
  • validation (some of these validations are already captured in the enhancement)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still in the process of deploying a baremetal cluster with these changes, but ideally since this is going techpreview, would love an easy way for customers to try it out, so I'll hope to have an IPI flow as well for ease of deployment.

After some conversations around what CPMSO does, since we are focusing on baremetal for the time being, we can punt that question for any future conversations of cloud deployments.

I do have a WIP PR here openshift/installer#9159 but that will probably change with the focus on baremetal flow.


Components that we are proposing to change:

| Component | Change |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two components are not listed here, which could be affected:

  • control plane machine set operator - cc @JoelSpeed. Perhaps the only change needed is for the installer to correctly render the manifests to have a different machine type for arbiter nodes, but it would be good to check whether there are assumptions of a homogeneous control plane built in (this assumes cpmso is supported when using arbiter nodes)

  • baremetal operator - @zaneb has already taken a look, but perhaps you could tag in more reviewers if needed for any changes to the baremetal operator. IF we are supporting installer-provisioned infrastructure, the baremetal operator (or whichever operator is used, I'm not familiar with the provisioning flow) would need to be updated to provision the arbiter node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Patrick, the control plane machine set operator should not be in play for this feature at the time being, since our focus is bare metal for support, and if we do end up doing cloud installs for CI validation of core components, we will be making that operator Inactive but will need to revisit it for any cloud support.

name: worker
platform: {}
replicas: 0
arbiterNode:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At risk of bikeshedding, I would suggest arbiter rather than arbiterNode. arbiterNode is reminiscent of Hungarian typing, and doesn't parallel the existing controlPlane or worker fields. Also arbiterNode implies replicas should only == 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that suggestion, I'll update it to use that wording.

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eggfoobar helped bring up a cluster and I poked around a bit on the MCO side. Functionally speaking it seems operational, and I think the general path seems logical. Just a few things to note (and some comments inline)

  1. we should check whether containerruntime/kubelet/node/image/IDMS etc. work properly for this pool (I feel like they mostly should?)
  2. this would be the first time we introduce a default pool other than master/worker into the MCO. We should be a bit careful on the scope of impact
  3. the master pool specifically in the MCO has some special casing around e.g. status reporting. (for example, it's a required pool for updates, whereas worker isn't). We should check the boundaries and error handling for this new pool

[template/<master/worker>/00-<master/worker>](https://github.com/openshift/machine-config-operator/tree/master/templates)
that are rendered in the [template
controller](https://github.com/openshift/machine-config-operator/blob/7c5ae75515fe373e3eecee711c18b07d01b5e759/pkg/controller/template/render.go#L102).
We will need to add the appropriate folder with the files relevant to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work but does require quite a bit of duplication (we'd have to copy over all master-* templates verbatim with only the kubelet template being slightly different)

Fine for now for TP but maybe worth revisiting before GA to see if we can make that better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed, I'll reevaluate this change and make some notes for MCO to help noodle some ideas on de-duplication of files for GA

```

We then need to add new rendered initial files for the arbiter role so
bootstrapping occurs correctly. When MCO stands up the Machine Config Server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a note this also creates a set of arbiter-user-data and arbiter-user-data-managed secrets in the mapi namespace. I don't suppose we'd want to scale additional arbiter nodes post-install though, so those would just be unused (which is probably fine?) but the user would theoretically be able to leverage them to scale additional arbiter nodes.

Also to check, do we foresee anyone adding arbiter nodes post-install? Maybe we can just have the bootstrap MCS server arbiter and block the in-cluster MCS from doing so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time being there is no need to scale the arbiter node, but there is no reason why we can't do that in the future for small HA deployments. In the installer changes I left the replica field to allow us to revisit this in the future, we can put in a validation check for now to enforce 1 arbiter. However, I can imagine a future scenario where we configure a 2 Arbiter 1 Master.

We then need to add new rendered initial files for the arbiter role so
bootstrapping occurs correctly. When MCO stands up the Machine Config Server
today, we supply initial configs from the
[template/<master/worker>/00-<master/worker>](https://github.com/openshift/machine-config-operator/tree/master/templates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing here is I'd like to make it such that any cluster that isn't using arbiter nodes to not have an arbiter pool. As it stands I think the proposal will make it such that we always create an arbiter pool, just empty and with no nodes. Since it almost looks like a custom pool, I'd prefer we don't confuse users into thinking we generated a new pool for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great call out, I'll add that wording to make sure we only create this pool when the arbiter is specified

@patrickdillon
Copy link
Contributor

Hive impact/support should also be considered cc @2uasimojo

arbiter would be a new MachinePool role

@eggfoobar
Copy link
Contributor Author

The main questions from an installer perspective:

  • Are we supporting installer-provisioned infrastructure, as opposed to an assisted-installer upi-style provisioning
  • Are we supporting cloud platforms--or baremetal only?

The enhancement answers both of these questions "yes," but in the recent meeting it seems like they may still be open questions. I have left comments to expand on the Installer scope if we are supporting IPI & cloud platforms.

It seems like the assisted installer is a primary use case for this feature, but the workflow description describes an IPI-based workflow using create cluster. If assisted installer is a primary motivation it would be good to describe that use case.

Finally it would be good to loop in @JoelSpeed for input on the control plane machineset operator.

I'm still in the process of doing some investigation on baremetal deployments. However, you are correct the main focus will be baremetal. Regarding the assisted installer flow, that was brought up as another path that we should cover, so you're right that, that flow is missing, I'll capture that as well in the EP

@eggfoobar
Copy link
Contributor Author

@eggfoobar helped bring up a cluster and I poked around a bit on the MCO side. Functionally speaking it seems operational, and I think the general path seems logical. Just a few things to note (and some comments inline)

  1. we should check whether containerruntime/kubelet/node/image/IDMS etc. work properly for this pool (I feel like they mostly should?)
  2. this would be the first time we introduce a default pool other than master/worker into the MCO. We should be a bit careful on the scope of impact
  3. the master pool specifically in the MCO has some special casing around e.g. status reporting. (for example, it's a required pool for updates, whereas worker isn't). We should check the boundaries and error handling for this new pool
  1. Fully agree, that will be part of our testing in TechPreview to make sure we identify any behavior changes. At a glance they seem to work fine, and running a simple e2e upgrade didn't show any errors there. We'll know when we have these running in a pipeline.
  2. Agreed, as you mentioned Jerry, we'll add the MCP creation as part of the featuregate and authoritative flag, so users are not surprised with new MCPs
  3. Doing a simple upgrade did work out pretty well for this configuration but you're right, we do have intention to pay extra attention to upgrades for this configuration. I'll capture this in the EP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.