AzureMachinePool indefinitely scaling up and down the pool size #5240

MadJlzz · 2024-11-05T08:49:33Z

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

We're having a small problem with capz v1.17.1 and capi v1.8.4 using AzureMachinePool

Even though we've set the number of replicas for the pool to be 1, capz is infinitely updating the underlying vmss changing the size to 2 and reverting back to 1 a bit afterwards.

What did you expect to happen:

To keep the size of the vmss to 1, as stated in the spec.

Anything else you would like to add:

In production, we're using those version and didn't get this particular error:

capi-kubeadm-bootstrap-system       bootstrap-kubeadm       194d   BootstrapProvider        kubeadm       v1.7.2
capi-kubeadm-control-plane-system   control-plane-kubeadm   194d   ControlPlaneProvider     kubeadm       v1.7.2
capi-system                         cluster-api             194d   CoreProvider             cluster-api   v1.7.2
capz-system                         infrastructure-azure    194d   InfrastructureProvider   azure         v1.15.1

Here's an image showcasing the issue:

Environment:

cluster-api-provider-azure version: 1.17.1
Kubernetes version: (use kubectl version): 1.30.x
OS (e.g. from /etc/os-release): Ubuntu 22.04

The text was updated successfully, but these errors were encountered:

willie-yao · 2024-11-22T21:46:45Z

/unassign @nawazkh
/assign

willie-yao · 2024-11-22T21:52:54Z

@MadJlzz I'm trying to reproduce this error now. Did you initially set the number of replicas to 2 or 1? Can you also post your cluster spec?

willie-yao · 2024-11-23T00:17:00Z

I'm seeing a different error when creating an mp with 2 replicas, scaling down to 1, then scaling back up to 2:

VM has reported a failure when processing extension 'CAPZ.Linux.Bootstrapping' (publisher 'Microsoft.Azure.ContainerUpstream' and type 'CAPZ.Linux.Bootstrapping'). Error message: 'Enable failed: failed to execute command: command terminated with exit status=1 [stdout] [stderr] '. More information on troubleshooting is available at https://aka.ms/vmextensionlinuxtroubleshoot.

Is this similar to what you're seeing? I can't seem to reproduce the exact bug you're having.

jackfrancis · 2024-11-23T00:22:06Z

@willie-yao /var/log/cloud-init-output.log on the new node that failed to come up successfully will tell you more information

willie-yao · 2024-11-25T19:35:29Z

I see the same error in serial console and cloud-init-output:

[2024-11-25 19:32:19] error execution phase preflight: couldn't validate the identity of the API Server: failed to request the cluster-info ConfigMap: Get "https://willie-machinepool-5848-1
6fdbd27.northeurope.cloudapp.azure.com:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s": context deadline exceeded
[2024-11-25 19:32:19] To see the stack trace of this error execute with --v=5 or higher
[2024-11-25 19:32:19] 2024-11-25 19:32:19,380 - cc_scripts_user.py[WARNING]: Failed to run module scripts_user (scripts in /var/lib/cloud/instance/scripts)
[2024-11-25 19:32:19] 2024-11-25 19:32:19,380 - util.py[WARNING]: Running module scripts_user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/conf
ig/cc_scripts_user.py'>) failed
[2024-11-25 19:32:20] Cloud-init v. 24.1.3-0ubuntu1~22.04.4 finished at Mon, 25 Nov 2024 19:32:20 +0000. Datasource DataSourceAzure [seed=/dev/sr0].  Up 343.46 seconds

MadJlzz · 2024-11-26T10:53:09Z

We're trying to reproduce it as well on our side - so far we could not reproduce it unfortunately.

If I remember correctly we started with 2, scaled down to 1 and back up to 2 like you did.

willie-yao · 2024-11-26T16:44:54Z

Thanks for the update. @MadJlzz are you also seeing the issue with CAPZ bootstrapping extension as shown above? Let me know if you guys run into the original bug as well.

MadJlzz · 2024-11-26T17:13:28Z

Regarding the CAPZ bootstrapping extension above, logs are looking good - no errors to mention. The tests we performed were ran against both production version I mentionned above and the latest capi/capz version (1.8.5 | 1.17.2)

We even tested with the versions mentionned initially in that issue 1.17.1 capz and 1.8.4 capi

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 5, 2024

github-project-automation bot added this to CAPZ Planning Nov 5, 2024

github-project-automation bot moved this to Todo in CAPZ Planning Nov 5, 2024

nawazkh self-assigned this Nov 14, 2024

dtzar added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 14, 2024

k8s-ci-robot assigned willie-yao and unassigned nawazkh Nov 22, 2024

willie-yao moved this from Todo to In Progress in CAPZ Planning Nov 22, 2024

willie-yao added this to the next milestone Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AzureMachinePool indefinitely scaling up and down the pool size #5240

AzureMachinePool indefinitely scaling up and down the pool size #5240

MadJlzz commented Nov 5, 2024 •

edited

Loading

willie-yao commented Nov 22, 2024

willie-yao commented Nov 22, 2024

willie-yao commented Nov 23, 2024

jackfrancis commented Nov 23, 2024

willie-yao commented Nov 25, 2024

MadJlzz commented Nov 26, 2024

willie-yao commented Nov 26, 2024

MadJlzz commented Nov 26, 2024 •

edited

Loading

AzureMachinePool indefinitely scaling up and down the pool size #5240

AzureMachinePool indefinitely scaling up and down the pool size #5240

Comments

MadJlzz commented Nov 5, 2024 • edited Loading

willie-yao commented Nov 22, 2024

willie-yao commented Nov 22, 2024

willie-yao commented Nov 23, 2024

jackfrancis commented Nov 23, 2024

willie-yao commented Nov 25, 2024

MadJlzz commented Nov 26, 2024

willie-yao commented Nov 26, 2024

MadJlzz commented Nov 26, 2024 • edited Loading

MadJlzz commented Nov 5, 2024 •

edited

Loading

MadJlzz commented Nov 26, 2024 •

edited

Loading