-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNI: Docker container lose network configuration after host reboot #24292
Comments
Hi @mr-karan, that's pretty weird! When the new tasks get started, we should be running the CNI plugins again and we arguably we must be otherwise we wouldn't have the rest of the networking setup. Something that stands out to me here is the |
Sharing the client config for
It's a Hetzner VM and I am running Tailscale on it. Tailscale would set up it's own separate network interface, but I believe
[Unit]
Description=Nomad
Documentation=https://nomadproject.io/docs/
Wants=network-online.target
After=network-online.target
[Service]
# Nomad clients need to be run as "root" whereas Nomad servers should be run as
# the "nomad" user. Please change this if needed.
User=root
Group=root
Type=notify
EnvironmentFile=-/etc/nomad.d/nomad.env
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/bin/nomad agent -config /etc/nomad.d
KillMode=process
KillSignal=SIGINT
LimitNOFILE=65536
LimitNPROC=infinity
Restart=on-failure
RestartSec=2
TasksMax=infinity
OOMScoreAdjust=-1000
[Install]
WantedBy=multi-user.target Since the dependency states |
Ok thanks @mr-karan, I'll see if I can reproduce and report back. |
Hi everyone, It seems we’ve encountered the same issue, though in a slightly different context. After restarting the Nomad client, any allocation that restarts (e.g., due to a template render) loses its DNS settings. Setup group "some_group" {
network {
mode = "cni/custom_cni"
} The CNI plugin provides IP and DNS settings for jobs. Steps to Reproduce
Possible Cause of the Problem When a container stops, Nomad calls the CNI plugin with the DEL command as described in the CNI specification. During this process, it provides all necessary settings via prevResult. If this network state data is stored only in the Nomad client’s runtime memory, it might be lost after a client restart. This could explain why DNS settings or routes cannot be restored properly following the restart. |
I wanted to give a little bit of an update on this, which that I've reproduced the behavior and with a bit of printf-debugging it seems to be originating from here in the network manager: What's puzzling is that in a reboot scenario I would definitely not expect the netns file to exist! The But it occurred to me that Docker wants to manage its own networks. So maybe this is a Docker issue specifically? I tried again with the
That alloc ID is the allocation we're trying to restore, so it's conflicting with itself. 🤦 Nomad looks for the netns file somewhere like That problem appears to be that the ipam plugin is writing its configuration to a persistent directory:
After a quick patch to write that to a tmpfs as recommended here, the Unfortunately that doesn't fix the problem for Docker. I'm still investigating here, but it's a bit slow-going until I can reproduce the behavior without actually rebooting the VM. But that narrows it down to Docker. I'll have a patch up for the non-Docker case shortly. |
When a Nomad host reboots, the network namespace files in the tmpfs in `/var/run` are wiped out. So when we restore allocations after a host reboot, we need to be able to restore both the network namespace and the network configuration. But because the netns is newly created and we need to run the CNI plugins again, this create potential conflicts with the IPAM plugin which has written state to persistent disk at `/var/lib/cni`. These IPs aren't the ones advertised to Consul, so there's no particular reason to keep them around after a host reboot because all virtual interfaces need to be recreated too. Reconfigure the CNI bridge configuration to use `/var/run/cni` as its state directory. We already expect this location to be created by CNI because the netns files are hard-coded to be created there too in `libcni`. Note this does not fix the problem described for Docker in #24292 because that appears to be related to the netns itself being restored unexpectedly from Docker's state. Ref: #24292 (comment) Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files
When a Nomad host reboots, the network namespace files in the tmpfs in `/var/run` are wiped out. So when we restore allocations after a host reboot, we need to be able to restore both the network namespace and the network configuration. But because the netns is newly created and we need to run the CNI plugins again, this create potential conflicts with the IPAM plugin which has written state to persistent disk at `/var/lib/cni`. These IPs aren't the ones advertised to Consul, so there's no particular reason to keep them around after a host reboot because all virtual interfaces need to be recreated too. Reconfigure the CNI bridge configuration to use `/var/run/cni` as its state directory. We already expect this location to be created by CNI because the netns files are hard-coded to be created there too in `libcni`. Note this does not fix the problem described for Docker in #24292 because that appears to be related to the netns itself being restored unexpectedly from Docker's state. Ref: #24292 (comment) Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files
Also cross-linking #19962 as potentially related. |
What happens during host shutdown from nomad's PoV? I am asking because the pause container has a restart policy of |
If my assumption is correct the docker issue should go away by reverting cd48910 (at least to confirm…) |
Good catch, @apollo13. I'll take a look at that tomorrow. |
I've confirmed that with the patch in #24650 and reverting cd48910 that the issue goes away for Docker:
The fix in cd48910 really did improve the situation around client restarts outside of host reboots though, so lemme circle-up with the team and figure out a next step. |
When a Nomad host reboots, the network namespace files in the tmpfs in `/var/run` are wiped out. So when we restore allocations after a host reboot, we need to be able to restore both the network namespace and the network configuration. But because the netns is newly created and we need to run the CNI plugins again, this create potential conflicts with the IPAM plugin which has written state to persistent disk at `/var/lib/cni`. These IPs aren't the ones advertised to Consul, so there's no particular reason to keep them around after a host reboot because all virtual interfaces need to be recreated too. Reconfigure the CNI bridge configuration to use `/var/run/cni` as its state directory. We already expect this location to be created by CNI because the netns files are hard-coded to be created there too in `libcni`. Note this does not fix the problem described for Docker in #24292 because that appears to be related to the netns itself being restored unexpectedly from Docker's state. Ref: #24292 (comment) Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files
@tgross I heard there is a PR somewhere that gets rid of the docker pause container which should solve that nicely ;) |
The challenge here as I see it is that the CNI spec doesn't allow for the runtime to re-run the ADD command after the container has been created. So we can't simply re-run CNI setup blindly. We also can't destroy the netns if it already exists because that breaks client restarts without a host reboot. Fortunately there's a CHECK command in CNI that, given a network namespace, verifies that the networking is set up correctly. With some quick-and-dirty hacking to add support for that if the netns already exists, I get an error like the following:
But because this fails the allocation it reschedules the alloc rather than retrying. So I'd need to add extra logic to teardown the old netns and then recreate it. Edit: so far my approach isn't working because it also fails (with a different error) when the client agent restarts. So it's the CHECK parameters themselves that aren't quite right too. |
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
I've got #24658 up with a fix. It has a caveat though, which is that it requires CNI plugins >=1.2.0 (released Jan 2023) and will throw an error if using an older version because of a bug in the older bridge network plugin. Going to chat with the team to see what they think about having a toggle to turn the behavior off for folks who don't want it. I'd rather not put a recent version constraint on the CNI plugins if we can get away with it, as some distros ship CNI plugins that are quite old. 😿 |
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
That looks interesting. Out of curiosity (and since you are already knee-deep in this) -- what does |
If the CNI data is in a tmpfs, then the check will fail because it won't have anything to compare against. That's what we want here, because we want to force Nomad to fail the allocation instead of saying "ok there's a netns, everything must be ok". But see my comment here: #24658 (comment). Because it might be nice if we could make a best-effort attempt to start over to account for that. |
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650 retry and tests [squashme]
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
When a Nomad host reboots, the network namespace files in the tmpfs in `/var/run` are wiped out. So when we restore allocations after a host reboot, we need to be able to restore both the network namespace and the network configuration. But because the netns is newly created and we need to run the CNI plugins again, this create potential conflicts with the IPAM plugin which has written state to persistent disk at `/var/lib/cni`. These IPs aren't the ones advertised to Consul, so there's no particular reason to keep them around after a host reboot because all virtual interfaces need to be recreated too. Reconfigure the CNI bridge configuration to use `/var/run/cni` as its state directory. We already expect this location to be created by CNI because the netns files are hard-coded to be created there too in `libcni`. Note this does not fix the problem described for Docker in #24292 because that appears to be related to the netns itself being restored unexpectedly from Docker's state. Ref: #24292 (comment) Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files
When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650
Nomad version
Operating system and Environment details
Ubuntu 24.04, Hetzner VM
Issue
After a node reboot following system upgrades, containers started by Nomad lose network connectivity due to missing network routes, despite having proper IP allocation. The issue resolves after a job restart. The issue only manifesting after reboot indicates CNI's network state recovery isn't handling route restoration correctly, despite having all other network components in place.
Key Findings
Partial Network Setup: The critical observation is that CNI completes IP allocation (container gets proper IP) and iptables rules are set up correctly, but routes are missing. This suggests CNI's routing setup phase is failing silently after reboot.
Networking State Inconsistency:
This specific combination indicates CNI's network setup is partially complete but fails during route installation.
Reproduction steps
ip route
shows empty routing table in container namespaceTechnical Details
Non-working state (after reboot):
Working state (after job restart):
Job file (if appropriate)
Impact
The text was updated successfully, but these errors were encountered: