-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPEDGE-1191: feat: initial arbiter cluster enhancement #1674
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: ehila <[email protected]>
@eggfoobar: This pull request references OCPEDGE-1191 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.18.0" version, but no target version was set. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Questions from my side:
- What is the concrete list of control plane components that will run on the arbiter node?
- What non-control plane or control-plane supporting components need to exist on the arbiter node? (e.g. cluster observability)
- How do we let cluster admins (or other RH operators) opt-in to adding components to the arbiter? I'm thinking in cases where a customer has a monitoring workload that they need to run on the control plane, or we have operators with agents that must run on the control plane like ACM or GitOps.
- Are there any considerations needed for components that utilize the
infrastructureTopology
field to determine their replicas? (I believe this applies to observability components like prometheus et al)
Signed-off-by: ehila <[email protected]>
bc94bef
to
368a00d
Compare
Signed-off-by: ehila <[email protected]>
updated CEO changes to include idea for wrapping informer and lister to minimize CEO code changes updated MCO changes to include bootstrap files for arbiter node added installer changes needed to include new machine types and install config changes Signed-off-by: ehila <[email protected]>
This is in good enough shape to move out of draft, I think we captured most of the concerns here. |
added appropriate reviewers for the component changes added more detail around the specific component changes Signed-off-by: ehila <[email protected]>
3e2bf61
to
643fe55
Compare
Signed-off-by: ehila <[email protected]>
Was Chad's comment addressed? What is running on the arbiter node? Something has to be compromised on the arbiter - cpu, disk, memory? What is compromised on the arbiter? Nodes have classically had performance and runtime issues with slow disks or not enough cpu or memory. Is there a definition of what specs the arbiter has? F1 F2 A1 (full-node-1 full-node-2 arbiter-node-1) ... Additionally, based on the Arbiter's specs, if F1 F2 and A1 are nodes, and F2 goes down for a reboot, then F1 will likely be taking the brunt of the API+DB load of the cluster. Good loadbalancers route traffic to fast responding API backends. Usually when a highly available node goes down, there is enough capacity of the other nodes to take that load. If one node is under specced (A1) and F1 has to take the load, then it stands to reason that the specs for F1 need to be higher. I'm a bit confused on how we would create a supported cluster in this situation. etcd needs fast disks, so we can't compromise on disk speed on A1. If load shifts from F2 to F1, then F1 should have the capacity of F1 + F2. Optimizing and trimming down OpenShift components might be doable. Instead of creating an arbiter, we still require 3 nodes but run less things? |
Right, the arbiter is like a 3 node, but only etcd components and a bit of other absolutely essential stuff (e.g. networking components) run on the arbiter. No components like the API server, Console, Monitoring etc.. Exact details need to be defined tho. F1 and F2 currently - per OCP docs, have a min sys req of 4C/16G. Thats just for the control plane. The requirements of the workload need to be added and are not considered here. Etcd and control-plane can work with effective end-to-end latency (which includes disk) of 500ms using the slow profile. So no concerns here - see [here (https://docs.google.com/presentation/d/1eMz3ON5y1Rk2QI2zH-QUGh5oKoMyOHZrCSSZ8aHQQPQ/edit#slide=id.g26e42bb9faa_0_251). Yes, a slow arbiter node will slow down overall API performance. During an outage of F2, not all load of F2 shifts to F1, actually next to none. Take the console as an example. It is replica 2 with a required podAntiAffinity on the hostname. Meaning: the pod running on F2 will not start on F1. @eggfoobar and @jerpeter1 keep me honest here please. |
@DanielFroehlich All Kube API load from F2 will transition to F1, since F1 will be the only machine with a kube-apiserver running. How many workers will we support? Kubelet's and their API calls+watches are usually load balanced between 2-3 masters. This prevents thundering herd issues on the API servers. With only 1 API server, potentially, that will limit the number of workers a customer can add to the cluster. |
enhancements/arbiter-clusters.md
Outdated
| [Installer](#installer-changes) | Update install config API to Support arbiter machine types | | ||
| [MCO](#mco-changes) | Update validation hook to support arbiter role and add bootstrapping configurations needed for arbiter machines | | ||
| [Kubernetes](#kubernetes-change) | Update allowed `well_known_openshift_labels.go` to include new node role as an upstream carry | | ||
| [ETCD Operator](#etcd-operator-change) | Update operator to deploy operands on both `master` and `arbiter` node roles | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the arbiter must not be the leader, then that's an additional feature that needs to be listed and designed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some discussion on what we want to happen here. We're okay with the Arbiter node etcd member becoming leader. We're under the impression that etcd will be as fast as it's slowest member, then this might be fine for now. I'll document this as an open question, and plan to revisit this as it's own potential EP, wdyt?
@tjungblu Any concerns with this assumption on my part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no functionality that can prevent it becoming a leader. I'm mostly concerned about the performance impact reported here: etcd-io/etcd#14501
There are some upstream k8s PRs in the works that enable grpc "circuit breakers", so this will get resolved at some point in the future
compact cluster, do we want to support changing ControlPlaneTopology field | ||
after the fact? | ||
|
||
## Test Plan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OLM+whoever to provide a way to make the arbiter configuration easy to test.
cc @joelanford
allow all taints can still run on the node, thus making it easier for smaller | ||
arbiter nodes to be deployed with out higher memory and cpu requirements. | ||
|
||
Components that we are proposing to change: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OLM to do some work fanning out for an infrastructure annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your thinking here to add support to OLM to filter components based off of topology type?
|
||
Customers at the edge are requiring a more economical solution for HA | ||
deployments at the edge. They can support running 2 node clusters for redundancy | ||
but would like the option to deploy a lower cost node as an arbiter to supply |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be worth considering having an arm64 node for the arbiter to lower costs and deploy the cluster with the multi-arch payload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is a great idea, but unfortunately currently not supported by multi-arch team. its either all 3 nodes x86 or arm64, no mixture allowed. We could look into supporting this when there is customer demand for it. So I'll keep the idea in the backlog! But for now, it is out of scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the arbiter node considered strictly control plane?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AIUI we know the specific hardware we are targeting here (it has the low-powered node built-in).
|
||
We need an authoritative flag to make it easy for components and tools to | ||
identify if they are running against an `arbiter` backed HA cluster. We are | ||
proposing adding a new `ControlPlaneTopology` and `InfrastructureTopology` key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arguably there is no new InfrastructureTopology, because the arbiter node does not act as a worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, I wasn't sure if there would be a future need for distinction if users were to add workers to these clusters. Not something that's supported here, given the desire is low cost deployment. However, would someone adding workers change your opinion on needing to supply a this flag? I don't know the full breath of Infra services where they would find this authoritative flag useful.
enhancements/arbiter-clusters.md
Outdated
4. The user then enters the machine information for | ||
`installConfig.controlPlane.platform.baremetal` and | ||
`installConfig.arbiterNode.platform.baremetal` | ||
5. The installer creates a new `arbiter` MachineSet with a replica of `1` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the arbiter node will be created like a worker, i.e. provisioned by a MachineSet in the cluster after the control plane comes up? (As opposed to like the control plane nodes, which are created directly by the bootstrap and do not have a MachineSet.)
We may have more work to do around labelling BaremetalHosts and filtering by label in the MachineSet so that the correct BaremetalHost is selected as the arbiter node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nope, this is a mistake in my original POC wording, thanks for pointing that out. The arbiter node follows the same install path as a master, so we create 2 master machines and 1 arbiter machine. They come up and work normally, except for one of the masters being only an arbiter.
Regarding BaremetalHost, I have not POCed that flow just yet, but I was also thinking labeling a host explicitly would be the path there. I'll flush out that flow a bit clearer here.
name: worker | ||
platform: {} | ||
replicas: 0 | ||
arbiterNode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simpler alternative - that would support baremetal clusters only - is to assume that the arbiter is part of the control plane, and identify its existence only by applying the arbiter
role to one of the hosts in platform.baremetal.hosts
.
This would result in the installer (rather than a MachineSet) creating the arbiter Node with a taint and a separate pointer ignition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I was leaning that way, but we also did have some testing and cloud deployment desires, that I switched the flow to follow the typical install flow for master, so now it's a full control plane machine at install. I'll update that step by step to correctly reflect that, my mistake on the MachineSet
wording.
|
||
Before going GA we need to have a proper idea of the role that the | ||
[control-plane machine-set | ||
operator](https://github.com/openshift/cluster-control-plane-machine-set-operator/tree/main/docs/user#overview) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use ControlPlaneMachineSetOperator on platform:baremetal
, so you only have to deal with this for clouds.
installs in the installer. However, the cloud install is primarily for testing, | ||
this might mean that we simplify the installer changes if we are the only ones | ||
using cloud installs, since we can simply alter the manifests in the pipeline | ||
with out needing to change the installer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would seriously question whether the work saved in manually setting up tests for etcd justifies the extra work in supporting this on cloud platforms.
Not only are cloud platforms an extra thing to support, but installing the arbiter node via a MachineSet in a platform-independent way complicates the baremetal implementation, and the Assisted/ABI implementations as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the concern, I'll try to capture this better in the documentation, but there is value for cloud deployment, especially as a TechPreview iteration of this feature. Currently there are some unknowns around where users are with the hardware they will be using, so having a cloud environment would be easier for them to deploy and spin the tires on it as well as make testing and validation a bit easier for us.
I went ahead and updated the doc to reflect the fact that the arbiter node is being installed in the same way as the master nodes, would that help smooth out your concerns?
|
||
Customers at the edge are requiring a more economical solution for HA | ||
deployments at the edge. They can support running 2 node clusters for redundancy | ||
but would like the option to deploy a lower cost node as an arbiter to supply |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AIUI we know the specific hardware we are targeting here (it has the low-powered node built-in).
added more components that need to be changed updated wording to correctly reflect whats being changed during installer file generations added clearity around the use and purpose of the currently identified components added more test cases that should be covered Signed-off-by: ehila <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: ehila <[email protected]>
ce07648
to
891e81d
Compare
@eggfoobar: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main questions from an installer perspective:
- Are we supporting installer-provisioned infrastructure, as opposed to an assisted-installer upi-style provisioning
- Are we supporting cloud platforms--or baremetal only?
The enhancement answers both of these questions "yes," but in the recent meeting it seems like they may still be open questions. I have left comments to expand on the Installer scope if we are supporting IPI & cloud platforms.
It seems like the assisted installer is a primary use case for this feature, but the workflow description describes an IPI-based workflow using create cluster
. If assisted installer is a primary motivation it would be good to describe that use case.
Finally it would be good to loop in @JoelSpeed for input on the control plane machineset operator.
| Component | Change | | ||
| ------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------- | | ||
| [Infrastructure API](#infrastructure-api) | Add `HighlyAvailableArbiter` as a new value for `ControlPlaneTopology` and `InfrastructureTopology` | | ||
| [Installer](#installer-changes) | Update install config API to Support arbiter machine types | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The installer will need additional changes to:
- provision different types of control plane machines (if installer-provisioned infrastructure is supported)
- render machine & control plane machineset operator manifests differently (if cpmso is supported)
- validation (some of these validations are already captured in the enhancement)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still in the process of deploying a baremetal cluster with these changes, but ideally since this is going techpreview, would love an easy way for customers to try it out, so I'll hope to have an IPI flow as well for ease of deployment.
After some conversations around what CPMSO does, since we are focusing on baremetal for the time being, we can punt that question for any future conversations of cloud deployments.
I do have a WIP PR here openshift/installer#9159 but that will probably change with the focus on baremetal flow.
|
||
Components that we are proposing to change: | ||
|
||
| Component | Change | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two components are not listed here, which could be affected:
-
control plane machine set operator - cc @JoelSpeed. Perhaps the only change needed is for the installer to correctly render the manifests to have a different machine type for arbiter nodes, but it would be good to check whether there are assumptions of a homogeneous control plane built in (this assumes cpmso is supported when using arbiter nodes)
-
baremetal operator - @zaneb has already taken a look, but perhaps you could tag in more reviewers if needed for any changes to the baremetal operator. IF we are supporting installer-provisioned infrastructure, the baremetal operator (or whichever operator is used, I'm not familiar with the provisioning flow) would need to be updated to provision the arbiter node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Patrick, the control plane machine set operator should not be in play for this feature at the time being, since our focus is bare metal for support, and if we do end up doing cloud installs for CI validation of core components, we will be making that operator Inactive
but will need to revisit it for any cloud support.
name: worker | ||
platform: {} | ||
replicas: 0 | ||
arbiterNode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At risk of bikeshedding, I would suggest arbiter
rather than arbiterNode
. arbiterNode
is reminiscent of Hungarian typing, and doesn't parallel the existing controlPlane
or worker
fields. Also arbiterNode
implies replicas
should only == 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that suggestion, I'll update it to use that wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eggfoobar helped bring up a cluster and I poked around a bit on the MCO side. Functionally speaking it seems operational, and I think the general path seems logical. Just a few things to note (and some comments inline)
- we should check whether containerruntime/kubelet/node/image/IDMS etc. work properly for this pool (I feel like they mostly should?)
- this would be the first time we introduce a default pool other than master/worker into the MCO. We should be a bit careful on the scope of impact
- the master pool specifically in the MCO has some special casing around e.g. status reporting. (for example, it's a required pool for updates, whereas worker isn't). We should check the boundaries and error handling for this new pool
[template/<master/worker>/00-<master/worker>](https://github.com/openshift/machine-config-operator/tree/master/templates) | ||
that are rendered in the [template | ||
controller](https://github.com/openshift/machine-config-operator/blob/7c5ae75515fe373e3eecee711c18b07d01b5e759/pkg/controller/template/render.go#L102). | ||
We will need to add the appropriate folder with the files relevant to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work but does require quite a bit of duplication (we'd have to copy over all master-* templates verbatim with only the kubelet template being slightly different)
Fine for now for TP but maybe worth revisiting before GA to see if we can make that better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed, I'll reevaluate this change and make some notes for MCO to help noodle some ideas on de-duplication of files for GA
``` | ||
|
||
We then need to add new rendered initial files for the arbiter role so | ||
bootstrapping occurs correctly. When MCO stands up the Machine Config Server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a note this also creates a set of arbiter-user-data and arbiter-user-data-managed secrets in the mapi namespace. I don't suppose we'd want to scale additional arbiter nodes post-install though, so those would just be unused (which is probably fine?) but the user would theoretically be able to leverage them to scale additional arbiter nodes.
Also to check, do we foresee anyone adding arbiter nodes post-install? Maybe we can just have the bootstrap MCS server arbiter and block the in-cluster MCS from doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the time being there is no need to scale the arbiter node, but there is no reason why we can't do that in the future for small HA deployments. In the installer changes I left the replica field to allow us to revisit this in the future, we can put in a validation check for now to enforce 1 arbiter. However, I can imagine a future scenario where we configure a 2 Arbiter 1 Master.
We then need to add new rendered initial files for the arbiter role so | ||
bootstrapping occurs correctly. When MCO stands up the Machine Config Server | ||
today, we supply initial configs from the | ||
[template/<master/worker>/00-<master/worker>](https://github.com/openshift/machine-config-operator/tree/master/templates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing here is I'd like to make it such that any cluster that isn't using arbiter nodes to not have an arbiter pool. As it stands I think the proposal will make it such that we always create an arbiter pool, just empty and with no nodes. Since it almost looks like a custom pool, I'd prefer we don't confuse users into thinking we generated a new pool for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great call out, I'll add that wording to make sure we only create this pool when the arbiter is specified
Hive impact/support should also be considered cc @2uasimojo
|
I'm still in the process of doing some investigation on baremetal deployments. However, you are correct the main focus will be baremetal. Regarding the assisted installer flow, that was brought up as another path that we should cover, so you're right that, that flow is missing, I'll capture that as well in the EP |
|
No description provided.