Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AzureMachinePool indefinitely scaling up and down the pool size #5240

Open
MadJlzz opened this issue Nov 5, 2024 · 8 comments
Open

AzureMachinePool indefinitely scaling up and down the pool size #5240

MadJlzz opened this issue Nov 5, 2024 · 8 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Milestone

Comments

@MadJlzz
Copy link
Contributor

MadJlzz commented Nov 5, 2024

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

We're having a small problem with capz v1.17.1 and capi v1.8.4 using AzureMachinePool

Even though we've set the number of replicas for the pool to be 1, capz is infinitely updating the underlying vmss changing the size to 2 and reverting back to 1 a bit afterwards.

What did you expect to happen:

To keep the size of the vmss to 1, as stated in the spec.

Anything else you would like to add:

In production, we're using those version and didn't get this particular error:

capi-kubeadm-bootstrap-system       bootstrap-kubeadm       194d   BootstrapProvider        kubeadm       v1.7.2
capi-kubeadm-control-plane-system   control-plane-kubeadm   194d   ControlPlaneProvider     kubeadm       v1.7.2
capi-system                         cluster-api             194d   CoreProvider             cluster-api   v1.7.2
capz-system                         infrastructure-azure    194d   InfrastructureProvider   azure         v1.15.1

Here's an image showcasing the issue:

image

Environment:

  • cluster-api-provider-azure version: 1.17.1
  • Kubernetes version: (use kubectl version): 1.30.x
  • OS (e.g. from /etc/os-release): Ubuntu 22.04
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 5, 2024
@nawazkh nawazkh self-assigned this Nov 14, 2024
@dtzar dtzar added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 14, 2024
@willie-yao
Copy link
Contributor

/unassign @nawazkh
/assign

@k8s-ci-robot k8s-ci-robot assigned willie-yao and unassigned nawazkh Nov 22, 2024
@willie-yao
Copy link
Contributor

@MadJlzz I'm trying to reproduce this error now. Did you initially set the number of replicas to 2 or 1? Can you also post your cluster spec?

@willie-yao willie-yao moved this from Todo to In Progress in CAPZ Planning Nov 22, 2024
@willie-yao willie-yao added this to the next milestone Nov 22, 2024
@willie-yao
Copy link
Contributor

I'm seeing a different error when creating an mp with 2 replicas, scaling down to 1, then scaling back up to 2:

VM has reported a failure when processing extension 'CAPZ.Linux.Bootstrapping' (publisher 'Microsoft.Azure.ContainerUpstream' and type 'CAPZ.Linux.Bootstrapping'). Error message: 'Enable failed: failed to execute command: command terminated with exit status=1 [stdout] [stderr] '. More information on troubleshooting is available at https://aka.ms/vmextensionlinuxtroubleshoot.

Is this similar to what you're seeing? I can't seem to reproduce the exact bug you're having.

@jackfrancis
Copy link
Contributor

@willie-yao /var/log/cloud-init-output.log on the new node that failed to come up successfully will tell you more information

@willie-yao
Copy link
Contributor

I see the same error in serial console and cloud-init-output:

[2024-11-25 19:32:19] error execution phase preflight: couldn't validate the identity of the API Server: failed to request the cluster-info ConfigMap: Get "https://willie-machinepool-5848-1
6fdbd27.northeurope.cloudapp.azure.com:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": context deadline exceeded
[2024-11-25 19:32:19] To see the stack trace of this error execute with --v=5 or higher
[2024-11-25 19:32:19] 2024-11-25 19:32:19,380 - cc_scripts_user.py[WARNING]: Failed to run module scripts_user (scripts in /var/lib/cloud/instance/scripts)
[2024-11-25 19:32:19] 2024-11-25 19:32:19,380 - util.py[WARNING]: Running module scripts_user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/conf
ig/cc_scripts_user.py'>) failed
[2024-11-25 19:32:20] Cloud-init v. 24.1.3-0ubuntu1~22.04.4 finished at Mon, 25 Nov 2024 19:32:20 +0000. Datasource DataSourceAzure [seed=/dev/sr0].  Up 343.46 seconds

@MadJlzz
Copy link
Contributor Author

MadJlzz commented Nov 26, 2024

We're trying to reproduce it as well on our side - so far we could not reproduce it unfortunately.

If I remember correctly we started with 2, scaled down to 1 and back up to 2 like you did.

@willie-yao
Copy link
Contributor

Thanks for the update. @MadJlzz are you also seeing the issue with CAPZ bootstrapping extension as shown above? Let me know if you guys run into the original bug as well.

@MadJlzz
Copy link
Contributor Author

MadJlzz commented Nov 26, 2024

Regarding the CAPZ bootstrapping extension above, logs are looking good - no errors to mention. The tests we performed were ran against both production version I mentionned above and the latest capi/capz version (1.8.5 | 1.17.2)

We even tested with the versions mentionned initially in that issue 1.17.1 capz and 1.8.4 capi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Status: In Progress
Development

No branches or pull requests

6 participants