Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AKS Node not rebooted with lock held for not existing node #847

Open
andres32168 opened this issue Nov 6, 2023 · 16 comments
Open

AKS Node not rebooted with lock held for not existing node #847

andres32168 opened this issue Nov 6, 2023 · 16 comments

Comments

@andres32168
Copy link

Hi,

we're facing an issue with the newest version of Kured 1.14.0.

Nodes are not rebooted

Prometheus Metrics say that a reboot is required but on the Node Host there is no file /var/run/reboot-required present.

# HELP kured_reboot_required OS requires reboot due to software updates.
# TYPE kured_reboot_required gauge
kured_reboot_required{node="aks-XXX-XXXXXXXXXX-vmss000000"} 1

Adding the file manually to the node results in a message "warning msg="Lock already held:" for a not longer existing node.

time="2023-11-06T06:37:59Z" level=info msg="Binding node-id command flag to environment variable: KURED_NODE_ID"
time="2023-11-06T06:37:59Z" level=info msg="Kubernetes Reboot Daemon: 1.14.0"
time="2023-11-06T06:37:59Z" level=info msg="Node ID: aks-XXX-XXXXXXXXXX-vmss000000"
time="2023-11-06T06:37:59Z" level=info msg="Lock Annotation: base-mon/kured:weave.works/kured-node-lock"
time="2023-11-06T06:37:59Z" level=info msg="Lock TTL set, lock will expire after: 30m0s"
time="2023-11-06T06:37:59Z" level=info msg="Lock release delay not set, lock will be released immediately after rebooting"
time="2023-11-06T06:37:59Z" level=info msg="PreferNoSchedule taint: "
time="2023-11-06T06:37:59Z" level=info msg="Blocking Pod Selectors: []"
time="2023-11-06T06:37:59Z" level=info msg="Reboot schedule: ---MonTueWedThu------ between 02:00 and 08:00 UTC"
time="2023-11-06T06:37:59Z" level=info msg="Reboot check command: [test -f /var/run/reboot-required] every 1h0m0s"
time="2023-11-06T06:37:59Z" level=info msg="Concurrency: 1"
time="2023-11-06T06:37:59Z" level=info msg="Reboot command: [/bin/systemctl reboot]"
time="2023-11-06T06:37:59Z" level=info msg="Will annotate nodes during kured reboot operations"
time="2023-11-06T07:09:55Z" level=info msg="Reboot required"
time="2023-11-06T07:09:55Z" level=warning msg="Lock already held: aks-XXX-XXXXXXXXXX-vmss000024"

We added the lockTtl flag from the Helm-Chart

configuration:
  endTime: "08:00"                           # only reboot before this time of day (default "23:59") time is UTC
  rebootDays: ["mo", "tu", "we", "th"]       # only reboot on these days (default [su,mo,tu,we,th,fr,sa])
  startTime: "02:00"                         # only reboot after this time of day (default "0:00") time is UTC
  concurrency: 1
  lockTtl: "30m"
  annotateNodes: true

The nodes will got the annotation

weave.works/kured-most-recent-reboot-needed: 2023-11-06T02:21:56Z                                                                                                                          weave.works/kured-reboot-in-progress: 2023-11-06T02:21:56Z

Do you have any idea why this happens?

I know, that messages are from today and not much time between config and possible reboots but the it was also hole last week, this are only the newest logs after redeployment with increased endTime

Thank you in advance
André

@ckotzbauer
Copy link
Member

This seems to be a related to #822. The problem might appear, when a lock is held from a node which is removed from the cluster. However, the thing that the metric highlights a node which does not need a reboot is new, but may be a result of the wrong lock behaviour.
We'll have a look in the next few days.

Copy link

github-actions bot commented Jan 6, 2024

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

@andres32168
Copy link
Author

not stale

@gyoza
Copy link

gyoza commented Feb 14, 2024

This seems to happen when karpenter nodes get rebooted and karpenter nukes them before they come back. This is happening to me on EKS as well with lockttl set to 30m.

@jackfrancis
Copy link
Collaborator

@gyoza that sounds like a scenario where this could happen.

My main question for folks that are experiencing this is: is the TTL configuration not working? At present, kured makes no guarantees that a node will continue to exist after it successfully acquires the lock (annotation on the kured daemonset); but it does guarantee that if you configure a lock TTL that the lock will be released after the TTL expires (whether or not the node that acquired the lock still exists at that time).

Are we seeing different behavior than described above?

@gyoza
Copy link

gyoza commented Feb 14, 2024

@jackfrancis Exactly! I figured the lock would expire if the node was around or not but that does not seem to be the case.

Daemonset:

Image:      ghcr.io/kubereboot/kured:1.15.0
Port:       8080/TCP
Host Port:  0/TCP
Command:
  /usr/bin/kured
    Args:
      --ds-name=kured
      --ds-namespace=core
      --metrics-port=8080
      --lock-ttl=30m

Logs:

kured-gzmfr kured time="2024-02-14T07:36:15Z" level=warning msg="Lock already held: replaced-name.deadnode.compute.internal"
kured-8kn2v kured time="2024-02-14T17:14:49Z" level=warning msg="Lock already held: replaced-name.deadnode.compute.internal"

The only way i can get things to get back to work momentarily is to rollout restart the daemonset on each context.

@gyoza
Copy link

gyoza commented Feb 14, 2024

It appears that even on a daemonset rollout restart that specific node lock seems to show up again.

@gyoza
Copy link

gyoza commented Feb 15, 2024

Is there a way to force to clear the lock manually?

Copy link

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

@andres32168
Copy link
Author

not stale

Copy link

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

@MCBBosch
Copy link

Is there a way to force to clear the lock manually?

this could help https://kured.dev/docs/operation/#manual-unlock at least it works for us.

Nevertheless it is really annoying that the lock-ttl setting doesn't solve the problem

Copy link

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

@MCBBosch
Copy link

not stale

Copy link

This issue was automatically considered stale due to lack of activity. Please update it and/or join our slack channels to promote it, before it automatically closes (in 7 days).

@andres32168
Copy link
Author

not stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants