CNI: Docker container lose network configuration after host reboot #24292

mr-karan · 2024-10-24T09:34:16Z

Nomad version

-> % nomad version
Nomad v1.9.1
BuildDate 2024-10-21T09:00:50Z
Revision d9ec23f0c1035401e9df6c64d6ffb8bffc555a5e

Operating system and Environment details

Ubuntu 24.04, Hetzner VM

Issue

After a node reboot following system upgrades, containers started by Nomad lose network connectivity due to missing network routes, despite having proper IP allocation. The issue resolves after a job restart. The issue only manifesting after reboot indicates CNI's network state recovery isn't handling route restoration correctly, despite having all other network components in place.

Key Findings

Partial Network Setup: The critical observation is that CNI completes IP allocation (container gets proper IP) and iptables rules are set up correctly, but routes are missing. This suggests CNI's routing setup phase is failing silently after reboot.
Networking State Inconsistency:
- IP is allocated from correct IPAM range (172.26.64.0/20)
- IPTables rules are properly created
- eth0 interface exists
- BUT routing table is empty
This specific combination indicates CNI's network setup is partially complete but fails during route installation.

Reproduction steps

Have Nomad jobs running with bridge networking
Perform system upgrade requiring reboot
After node is back, containers start but have no network connectivity
ip route shows empty routing table in container namespace
Job restart fixes the issue

Technical Details

Non-working state (after reboot):

# Container network namespace has only loopback
1: lo: <LOOPBACK,UP,LOWER_UP>
    inet 127.0.0.1/8

# Empty routing table
/app # ip route
/app #

# Primary container shows no network
"NetworkSettings": {
    "Bridge": "",
    "Networks": {
        "none": {
            "IPAMConfig": null,

Working state (after job restart):

# Container has proper interface and IP
1: lo: <LOOPBACK,UP,LOWER_UP>
2: eth0@if19: <BROADCAST,MULTICAST,UP,LOWER_UP>
    inet 172.26.64.31/20 brd 172.26.79.255 scope global eth0

# IPTables rules exist for both states
-A NOMAD-ADMIN -d 172.26.64.0/20 -o nomad -j ACCEPT
-A POSTROUTING -s 172.26.64.31/32 -j CNI-c1ff8471ac64d933382ca8e1

Job file (if appropriate)

job "doggo" {
  datacenters = ["dc1"]
  type        = "service"

  group "doggo" {
    count = 1

    network {
      mode = "bridge"

      port "web" {
        to           = 8080
        host_network = "private"
      }
    }

    service {
      name     = "doggo-web"
      port     = "web"
      provider = "nomad"
    }

    task "doggo" {
      driver = "docker"

      config {
        image = "ghcr.io/mr-karan/doggo-api:latest"
        ports = ["web"]
      }

      resources {
        cpu    = 100 # MHz - Adjust based on the needs of doggo
        memory = 50  # MB - Adjust based on the needs of doggo
      }
    }
  }
}

Impact

After node reboots, containers start without network connectivity
Applications cannot make outbound connections
Requires manual intervention (job restart) to restore networking

The text was updated successfully, but these errors were encountered:

tgross · 2024-11-08T17:12:14Z

Hi @mr-karan, that's pretty weird!

When the new tasks get started, we should be running the CNI plugins again and we arguably we must be otherwise we wouldn't have the rest of the networking setup. Something that stands out to me here is the network.port.host_network configuration. Is there any chance this alternate network isn't up yet when the allocations are being restored? I would expect that to return an error of course, but it would contribute to an explanation as to what's wrong.

mr-karan · 2024-11-10T05:16:59Z

Sharing the client config for host_network stanza:

client {
  enabled = true
  servers = ["127.0.0.1"]

  host_network "tailscale" {
    cidr = "100.93.59.52/32"
    reserved_ports = "22,4646"
  }

  host_network "private" {
    cidr = "10.0.0.3/32"
    reserved_ports = "22,4646"
  }

  host_network "public" {
    cidr = "37.yy.xxx.2/32"
    reserved_ports = "22,4646"
  }
}

private network is configured as:

3: enp7s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
    link/ether 86:00:00:f5:eb:99 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.3/32 brd 10.0.0.3 scope global dynamic enp7s0
       valid_lft 66380sec preferred_lft 55580sec

It's a Hetzner VM and I am running Tailscale on it. Tailscale would set up it's own separate network interface, but I believe enp7s0 should already have been set up. I run Nomad with a systemd service, sharing it's contents:

sudo cat /usr/lib/systemd/system/nomad.service

[Unit]
Description=Nomad
Documentation=https://nomadproject.io/docs/
Wants=network-online.target
After=network-online.target

[Service]

# Nomad clients need to be run as "root" whereas Nomad servers should be run as
# the "nomad" user. Please change this if needed.
User=root
Group=root

Type=notify
EnvironmentFile=-/etc/nomad.d/nomad.env
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/bin/nomad agent -config /etc/nomad.d
KillMode=process
KillSignal=SIGINT
LimitNOFILE=65536
LimitNPROC=infinity
Restart=on-failure
RestartSec=2

TasksMax=infinity

OOMScoreAdjust=-1000

[Install]
WantedBy=multi-user.target

Since the dependency states After=network-online.target, I believe enp7s0 which is the default private network interface in hetzner should be up.

tgross · 2024-11-11T14:34:36Z

Ok thanks @mr-karan, I'll see if I can reproduce and report back.

qk4l · 2024-12-11T13:09:20Z

Hi everyone,

It seems we’ve encountered the same issue, though in a slightly different context. After restarting the Nomad client, any allocation that restarts (e.g., due to a template render) loses its DNS settings.

Setup
Nomad version: 1.8.3
Driver: Docker version 24.0.5, build 24.0.5-0ubuntu1~22.04.1
CNI plugin: Custom plugin

    group "some_group" {
       network {
            mode = "cni/custom_cni"
        }

The CNI plugin provides IP and DNS settings for jobs.

Steps to Reproduce

Deploy a job with a CNI plugin that provides custom DNS settings.
Restart the Nomad client.
Restart the allocation.
Observe that during the allocation restart process, the CNI plugin does not run.
The container loses its custom DNS settings, and Docker applies the default nameservers.

Possible Cause of the Problem
Where is the network state (CNI response) saved?

When a container stops, Nomad calls the CNI plugin with the DEL command as described in the CNI specification. During this process, it provides all necessary settings via prevResult.

If this network state data is stored only in the Nomad client’s runtime memory, it might be lost after a client restart. This could explain why DNS settings or routes cannot be restored properly following the restart.

tgross · 2024-12-11T20:33:33Z

I wanted to give a little bit of an update on this, which that I've reproduced the behavior and with a bit of printf-debugging it seems to be originating from here in the network manager: network_manager_linux.go#L139. We check to see if the network namespace exists by checking for the existence of the netns file. That file is written somewhere like /var/run/docker/netns/$container_id. If that file exists, we assume the network namespace has been previously created and therefore there's nothing to do for CNI (see network_hook.go#L145).

What's puzzling is that in a reboot scenario I would definitely not expect the netns file to exist! The /var/run directory is a symlink to /run, which is a tmpfs on every distro I can think of.

But it occurred to me that Docker wants to manage its own networks. So maybe this is a Docker issue specifically? I tried again with the exec2 driver and this time when we reboot we get different results. I end up with an error like the following fairly reliably:

Recent Events:
Time                       Type           Description
2024-12-11T14:53:17-05:00  Setup Failure  failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.64.6 has been allocated to dbe23c1a-32bd-5f72-2fb9-17fba563c4b5, duplicate allocation is not allowed

That alloc ID is the allocation we're trying to restore, so it's conflicting with itself. 🤦 Nomad looks for the netns file somewhere like /var/run/netns/$alloc_id, doesn't find it and so creates it, and we see that created flag is set to true, so that part is working here. But then downstream CNI configuration for the bridge isn't working.

That problem appears to be that the ipam plugin is writing its configuration to a persistent directory:

$ ls /var/lib/cni/networks/nomad/
172.26.64.10  172.26.64.5  172.26.64.6  172.26.64.7  last_reserved_ip.0  lock

After a quick patch to write that to a tmpfs as recommended here, the exec2 driver works fine! (Well except for hashicorp/nomad-driver-exec2#63 which is unrelated.)

Unfortunately that doesn't fix the problem for Docker. I'm still investigating here, but it's a bit slow-going until I can reproduce the behavior without actually rebooting the VM. But that narrows it down to Docker.

I'll have a patch up for the non-Docker case shortly.

When a Nomad host reboots, the network namespace files in the tmpfs in `/var/run` are wiped out. So when we restore allocations after a host reboot, we need to be able to restore both the network namespace and the network configuration. But because the netns is newly created and we need to run the CNI plugins again, this create potential conflicts with the IPAM plugin which has written state to persistent disk at `/var/lib/cni`. These IPs aren't the ones advertised to Consul, so there's no particular reason to keep them around after a host reboot because all virtual interfaces need to be recreated too. Reconfigure the CNI bridge configuration to use `/var/run/cni` as its state directory. We already expect this location to be created by CNI because the netns files are hard-coded to be created there too in `libcni`. Note this does not fix the problem described for Docker in #24292 because that appears to be related to the netns itself being restored unexpectedly from Docker's state. Ref: #24292 (comment) Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files

tgross · 2024-12-11T20:58:45Z

Also cross-linking #19962 as potentially related.

apollo13 · 2024-12-11T21:20:55Z

Unfortunately that doesn't fix the problem for Docker. I'm still investigating here, but it's a bit slow-going until I can reproduce the behavior without actually rebooting the VM. But that narrows it down to Docker.

What happens during host shutdown from nomad's PoV? I am asking because the pause container has a restart policy of unless-stopped which would cause it to be restored by docker itself upon reboot if I am not mistaken. If docker brings up that container before nomad checks for the netns then nomad would see /var/run/docker/netns/$container_id

apollo13 · 2024-12-11T21:24:32Z

If my assumption is correct the docker issue should go away by reverting cd48910 (at least to confirm…)

tgross · 2024-12-11T21:54:05Z

Good catch, @apollo13. I'll take a look at that tomorrow.

tgross · 2024-12-12T14:41:10Z

I've confirmed that with the patch in #24650 and reverting cd48910 that the issue goes away for Docker:

Recent Events:
Time                       Type        Description
2024-12-12T09:39:11-05:00  Started     Task started by client
2024-12-12T09:38:56-05:00  Restarting  Task restarting in 15.437798526s
2024-12-12T09:38:56-05:00  Terminated  Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2024-12-12T09:36:46-05:00  Started     Task started by client
2024-12-12T09:36:46-05:00  Task Setup  Building Task Directory
2024-12-12T09:36:45-05:00  Received    Task received by client

The fix in cd48910 really did improve the situation around client restarts outside of host reboots though, so lemme circle-up with the team and figure out a next step.

When a Nomad host reboots, the network namespace files in the tmpfs in `/var/run` are wiped out. So when we restore allocations after a host reboot, we need to be able to restore both the network namespace and the network configuration. But because the netns is newly created and we need to run the CNI plugins again, this create potential conflicts with the IPAM plugin which has written state to persistent disk at `/var/lib/cni`. These IPs aren't the ones advertised to Consul, so there's no particular reason to keep them around after a host reboot because all virtual interfaces need to be recreated too. Reconfigure the CNI bridge configuration to use `/var/run/cni` as its state directory. We already expect this location to be created by CNI because the netns files are hard-coded to be created there too in `libcni`. Note this does not fix the problem described for Docker in #24292 because that appears to be related to the netns itself being restored unexpectedly from Docker's state. Ref: #24292 (comment) Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files

apollo13 · 2024-12-12T14:47:55Z

@tgross I heard there is a PR somewhere that gets rid of the docker pause container which should solve that nicely ;)

tgross · 2024-12-12T16:15:32Z

The challenge here as I see it is that the CNI spec doesn't allow for the runtime to re-run the ADD command after the container has been created. So we can't simply re-run CNI setup blindly. We also can't destroy the netns if it already exists because that breaks client restarts without a host reboot. Fortunately there's a CHECK command in CNI that, given a network namespace, verifies that the networking is set up correctly. With some quick-and-dirty hacking to add support for that if the netns already exists, I get an error like the following:

Recent Events:
Time                       Type           Description
2024-12-12T11:10:48-05:00  Terminated     Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2024-12-12T11:10:48-05:00  Setup Failure  failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: host-local: Failed to find address added by container 7fc5d0ba-5646-51da-1e7b-a459b79a66ee
2024-12-12T11:09:47-05:00  Started        Task started by client
2024-12-12T11:09:47-05:00  Task Setup     Building Task Directory
2024-12-12T11:09:47-05:00  Received       Task received by client

But because this fails the allocation it reschedules the alloc rather than retrying. So I'd need to add extra logic to teardown the old netns and then recreate it.

Edit: so far my approach isn't working because it also fails (with a different error) when the client agent restarts. So it's the CHECK parameters themselves that aren't quite right too.

When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650

tgross · 2024-12-12T21:00:57Z

I've got #24658 up with a fix. It has a caveat though, which is that it requires CNI plugins >=1.2.0 (released Jan 2023) and will throw an error if using an older version because of a bug in the older bridge network plugin. Going to chat with the team to see what they think about having a toggle to turn the behavior off for folks who don't want it. I'd rather not put a recent version constraint on the CNI plugins if we can get away with it, as some distros ship CNI plugins that are quite old. 😿

When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650

apollo13 · 2024-12-12T21:20:11Z

That looks interesting. Out of curiosity (and since you are already knee-deep in this) -- what does CHECK actually do? Does it also check if the IP addr is the same as before the reboot; if yes how does that work if we move the CNI data to a tmpfs?

tgross · 2024-12-12T21:32:51Z

Out of curiosity (and since you are already knee-deep in this) -- what does CHECK actually do? Does it also check if the IP addr is the same as before the reboot; if yes how does that work if we move the CNI data to a tmpfs?

If the CNI data is in a tmpfs, then the check will fail because it won't have anything to compare against. That's what we want here, because we want to force Nomad to fail the allocation instead of saying "ok there's a netns, everything must be ok". But see my comment here: #24658 (comment). Because it might be nice if we could make a best-effort attempt to start over to account for that.

When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650 retry and tests [squashme]

When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650

When a Nomad host reboots, the network namespace files in the tmpfs in `/var/run` are wiped out. So when we restore allocations after a host reboot, we need to be able to restore both the network namespace and the network configuration. But because the netns is newly created and we need to run the CNI plugins again, this create potential conflicts with the IPAM plugin which has written state to persistent disk at `/var/lib/cni`. These IPs aren't the ones advertised to Consul, so there's no particular reason to keep them around after a host reboot because all virtual interfaces need to be recreated too. Reconfigure the CNI bridge configuration to use `/var/run/cni` as its state directory. We already expect this location to be created by CNI because the netns files are hard-coded to be created there too in `libcni`. Note this does not fix the problem described for Docker in #24292 because that appears to be related to the netns itself being restored unexpectedly from Docker's state. Ref: #24292 (comment) Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files

When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650

mr-karan added the type/bug label Oct 24, 2024

tgross added this to Nomad - Community Issues Triage Oct 24, 2024

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Oct 24, 2024

Juanadelacuesta added the stage/needs-investigation label Oct 25, 2024

Juanadelacuesta assigned Juanadelacuesta and unassigned Juanadelacuesta Oct 25, 2024

tgross removed the stage/needs-investigation label Nov 8, 2024

tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Nov 8, 2024

tgross added theme/networking theme/cni stage/waiting-reply labels Nov 8, 2024

tgross self-assigned this Nov 8, 2024

tgross removed the stage/waiting-reply label Nov 11, 2024

tgross changed the title ~~CNI: Container loses network routes after node reboot, requires job restart~~ CNI: Docker container lose network configuration after host reboot Dec 11, 2024

tgross added the theme/driver/docker label Dec 11, 2024

tgross mentioned this issue Dec 11, 2024

CNI: use tmpfs location for ipam plugin #24650

Merged

6 tasks

tgross mentioned this issue Dec 11, 2024

Docker containers managed by Nomad in bridge network mode are brought back up with broken networks. #19962

Open

tgross linked a pull request Dec 12, 2024 that will close this issue

CNI: use check command when restoring from restart #24658

Open

6 tasks

tgross added this to the 1.9.x milestone Dec 13, 2024

tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Dec 13, 2024

tgross added the hcc/jira label Dec 13, 2024

hc-github-team-nomad-core mentioned this issue Dec 16, 2024

Backport of CNI: use tmpfs location for ipam plugin into release/1.9.x #24681

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNI: Docker container lose network configuration after host reboot #24292

CNI: Docker container lose network configuration after host reboot #24292

mr-karan commented Oct 24, 2024

tgross commented Nov 8, 2024

mr-karan commented Nov 10, 2024

tgross commented Nov 11, 2024

qk4l commented Dec 11, 2024

tgross commented Dec 11, 2024 •

edited

Loading

tgross commented Dec 11, 2024

apollo13 commented Dec 11, 2024

apollo13 commented Dec 11, 2024

tgross commented Dec 11, 2024

tgross commented Dec 12, 2024

apollo13 commented Dec 12, 2024

tgross commented Dec 12, 2024 •

edited

Loading

tgross commented Dec 12, 2024

apollo13 commented Dec 12, 2024

tgross commented Dec 12, 2024

CNI: Docker container lose network configuration after host reboot #24292

CNI: Docker container lose network configuration after host reboot #24292

Comments

mr-karan commented Oct 24, 2024

Nomad version

Operating system and Environment details

Issue

Key Findings

Reproduction steps

Technical Details

Job file (if appropriate)

Impact

tgross commented Nov 8, 2024

mr-karan commented Nov 10, 2024

tgross commented Nov 11, 2024

qk4l commented Dec 11, 2024

tgross commented Dec 11, 2024 • edited Loading

tgross commented Dec 11, 2024

apollo13 commented Dec 11, 2024

apollo13 commented Dec 11, 2024

tgross commented Dec 11, 2024

tgross commented Dec 12, 2024

apollo13 commented Dec 12, 2024

tgross commented Dec 12, 2024 • edited Loading

tgross commented Dec 12, 2024

apollo13 commented Dec 12, 2024

tgross commented Dec 12, 2024

tgross commented Dec 11, 2024 •

edited

Loading

tgross commented Dec 12, 2024 •

edited

Loading