Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNI: Docker container lose network configuration after host reboot #24292

Open
mr-karan opened this issue Oct 24, 2024 · 15 comments · May be fixed by #24658
Open

CNI: Docker container lose network configuration after host reboot #24292

mr-karan opened this issue Oct 24, 2024 · 15 comments · May be fixed by #24658

Comments

@mr-karan
Copy link
Contributor

Nomad version

-> % nomad version
Nomad v1.9.1
BuildDate 2024-10-21T09:00:50Z
Revision d9ec23f0c1035401e9df6c64d6ffb8bffc555a5e

Operating system and Environment details

Ubuntu 24.04, Hetzner VM

Issue

After a node reboot following system upgrades, containers started by Nomad lose network connectivity due to missing network routes, despite having proper IP allocation. The issue resolves after a job restart. The issue only manifesting after reboot indicates CNI's network state recovery isn't handling route restoration correctly, despite having all other network components in place.

Key Findings

  1. Partial Network Setup: The critical observation is that CNI completes IP allocation (container gets proper IP) and iptables rules are set up correctly, but routes are missing. This suggests CNI's routing setup phase is failing silently after reboot.

  2. Networking State Inconsistency:

    • IP is allocated from correct IPAM range (172.26.64.0/20)
    • IPTables rules are properly created
    • eth0 interface exists
    • BUT routing table is empty

    This specific combination indicates CNI's network setup is partially complete but fails during route installation.

Reproduction steps

  1. Have Nomad jobs running with bridge networking
  2. Perform system upgrade requiring reboot
  3. After node is back, containers start but have no network connectivity
  4. ip route shows empty routing table in container namespace
  5. Job restart fixes the issue

Technical Details

Non-working state (after reboot):

# Container network namespace has only loopback
1: lo: <LOOPBACK,UP,LOWER_UP>
    inet 127.0.0.1/8

# Empty routing table
/app # ip route
/app #

# Primary container shows no network
"NetworkSettings": {
    "Bridge": "",
    "Networks": {
        "none": {
            "IPAMConfig": null,

Working state (after job restart):

# Container has proper interface and IP
1: lo: <LOOPBACK,UP,LOWER_UP>
2: eth0@if19: <BROADCAST,MULTICAST,UP,LOWER_UP>
    inet 172.26.64.31/20 brd 172.26.79.255 scope global eth0

# IPTables rules exist for both states
-A NOMAD-ADMIN -d 172.26.64.0/20 -o nomad -j ACCEPT
-A POSTROUTING -s 172.26.64.31/32 -j CNI-c1ff8471ac64d933382ca8e1

Job file (if appropriate)

job "doggo" {
  datacenters = ["dc1"]
  type        = "service"

  group "doggo" {
    count = 1

    network {
      mode = "bridge"

      port "web" {
        to           = 8080
        host_network = "private"
      }
    }

    service {
      name     = "doggo-web"
      port     = "web"
      provider = "nomad"
    }

    task "doggo" {
      driver = "docker"

      config {
        image = "ghcr.io/mr-karan/doggo-api:latest"
        ports = ["web"]
      }

      resources {
        cpu    = 100 # MHz - Adjust based on the needs of doggo
        memory = 50  # MB - Adjust based on the needs of doggo
      }
    }
  }
}

Impact

  • After node reboots, containers start without network connectivity
  • Applications cannot make outbound connections
  • Requires manual intervention (job restart) to restore networking
@tgross
Copy link
Member

tgross commented Nov 8, 2024

Hi @mr-karan, that's pretty weird!

When the new tasks get started, we should be running the CNI plugins again and we arguably we must be otherwise we wouldn't have the rest of the networking setup. Something that stands out to me here is the network.port.host_network configuration. Is there any chance this alternate network isn't up yet when the allocations are being restored? I would expect that to return an error of course, but it would contribute to an explanation as to what's wrong.

@mr-karan
Copy link
Contributor Author

Sharing the client config for host_network stanza:

client {
  enabled = true
  servers = ["127.0.0.1"]

  host_network "tailscale" {
    cidr = "100.93.59.52/32"
    reserved_ports = "22,4646"
  }

  host_network "private" {
    cidr = "10.0.0.3/32"
    reserved_ports = "22,4646"
  }

  host_network "public" {
    cidr = "37.yy.xxx.2/32"
    reserved_ports = "22,4646"
  }
}

private network is configured as:

3: enp7s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc fq_codel state UP group default qlen 1000
    link/ether 86:00:00:f5:eb:99 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.3/32 brd 10.0.0.3 scope global dynamic enp7s0
       valid_lft 66380sec preferred_lft 55580sec

It's a Hetzner VM and I am running Tailscale on it. Tailscale would set up it's own separate network interface, but I believe enp7s0 should already have been set up. I run Nomad with a systemd service, sharing it's contents:

sudo cat /usr/lib/systemd/system/nomad.service
[Unit]
Description=Nomad
Documentation=https://nomadproject.io/docs/
Wants=network-online.target
After=network-online.target

[Service]

# Nomad clients need to be run as "root" whereas Nomad servers should be run as
# the "nomad" user. Please change this if needed.
User=root
Group=root

Type=notify
EnvironmentFile=-/etc/nomad.d/nomad.env
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/bin/nomad agent -config /etc/nomad.d
KillMode=process
KillSignal=SIGINT
LimitNOFILE=65536
LimitNPROC=infinity
Restart=on-failure
RestartSec=2

TasksMax=infinity

OOMScoreAdjust=-1000

[Install]
WantedBy=multi-user.target

Since the dependency states After=network-online.target, I believe enp7s0 which is the default private network interface in hetzner should be up.

@tgross
Copy link
Member

tgross commented Nov 11, 2024

Ok thanks @mr-karan, I'll see if I can reproduce and report back.

@qk4l
Copy link

qk4l commented Dec 11, 2024

Hi everyone,

It seems we’ve encountered the same issue, though in a slightly different context. After restarting the Nomad client, any allocation that restarts (e.g., due to a template render) loses its DNS settings.

Setup
Nomad version: 1.8.3
Driver: Docker version 24.0.5, build 24.0.5-0ubuntu1~22.04.1
CNI plugin: Custom plugin

    group "some_group" {
       network {
            mode = "cni/custom_cni"
        }

The CNI plugin provides IP and DNS settings for jobs.

Steps to Reproduce

  1. Deploy a job with a CNI plugin that provides custom DNS settings.
  2. Restart the Nomad client.
  3. Restart the allocation.
  4. Observe that during the allocation restart process, the CNI plugin does not run.
  5. The container loses its custom DNS settings, and Docker applies the default nameservers.

Possible Cause of the Problem
Where is the network state (CNI response) saved?

When a container stops, Nomad calls the CNI plugin with the DEL command as described in the CNI specification. During this process, it provides all necessary settings via prevResult.

If this network state data is stored only in the Nomad client’s runtime memory, it might be lost after a client restart. This could explain why DNS settings or routes cannot be restored properly following the restart.

@tgross
Copy link
Member

tgross commented Dec 11, 2024

I wanted to give a little bit of an update on this, which that I've reproduced the behavior and with a bit of printf-debugging it seems to be originating from here in the network manager: network_manager_linux.go#L139. We check to see if the network namespace exists by checking for the existence of the netns file. That file is written somewhere like /var/run/docker/netns/$container_id. If that file exists, we assume the network namespace has been previously created and therefore there's nothing to do for CNI (see network_hook.go#L145).

What's puzzling is that in a reboot scenario I would definitely not expect the netns file to exist! The /var/run directory is a symlink to /run, which is a tmpfs on every distro I can think of.

But it occurred to me that Docker wants to manage its own networks. So maybe this is a Docker issue specifically? I tried again with the exec2 driver and this time when we reboot we get different results. I end up with an error like the following fairly reliably:

Recent Events:
Time                       Type           Description
2024-12-11T14:53:17-05:00  Setup Failure  failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: failed to configure network: plugin type="bridge" failed (add): failed to allocate for range 0: 172.26.64.6 has been allocated to dbe23c1a-32bd-5f72-2fb9-17fba563c4b5, duplicate allocation is not allowed

That alloc ID is the allocation we're trying to restore, so it's conflicting with itself. 🤦 Nomad looks for the netns file somewhere like /var/run/netns/$alloc_id, doesn't find it and so creates it, and we see that created flag is set to true, so that part is working here. But then downstream CNI configuration for the bridge isn't working.

That problem appears to be that the ipam plugin is writing its configuration to a persistent directory:

$ ls /var/lib/cni/networks/nomad/
172.26.64.10  172.26.64.5  172.26.64.6  172.26.64.7  last_reserved_ip.0  lock

After a quick patch to write that to a tmpfs as recommended here, the exec2 driver works fine! (Well except for hashicorp/nomad-driver-exec2#63 which is unrelated.)

Unfortunately that doesn't fix the problem for Docker. I'm still investigating here, but it's a bit slow-going until I can reproduce the behavior without actually rebooting the VM. But that narrows it down to Docker.

I'll have a patch up for the non-Docker case shortly.

@tgross tgross changed the title CNI: Container loses network routes after node reboot, requires job restart CNI: Docker container lose network configuration after host reboot Dec 11, 2024
tgross added a commit that referenced this issue Dec 11, 2024
When a Nomad host reboots, the network namespace files in the tmpfs in
`/var/run` are wiped out. So when we restore allocations after a host reboot, we
need to be able to restore both the network namespace and the network
configuration. But because the netns is newly created and we need to run the CNI
plugins again, this create potential conflicts with the IPAM plugin which has
written state to persistent disk at `/var/lib/cni`. These IPs aren't the ones
advertised to Consul, so there's no particular reason to keep them around after
a host reboot because all virtual interfaces need to be recreated too.

Reconfigure the CNI bridge configuration to use `/var/run/cni` as its state
directory. We already expect this location to be created by CNI because the
netns files are hard-coded to be created there too in `libcni`.

Note this does not fix the problem described for Docker in #24292 because that
appears to be related to the netns itself being restored unexpectedly from
Docker's state.

Ref: #24292 (comment)
Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files
tgross added a commit that referenced this issue Dec 11, 2024
When a Nomad host reboots, the network namespace files in the tmpfs in
`/var/run` are wiped out. So when we restore allocations after a host reboot, we
need to be able to restore both the network namespace and the network
configuration. But because the netns is newly created and we need to run the CNI
plugins again, this create potential conflicts with the IPAM plugin which has
written state to persistent disk at `/var/lib/cni`. These IPs aren't the ones
advertised to Consul, so there's no particular reason to keep them around after
a host reboot because all virtual interfaces need to be recreated too.

Reconfigure the CNI bridge configuration to use `/var/run/cni` as its state
directory. We already expect this location to be created by CNI because the
netns files are hard-coded to be created there too in `libcni`.

Note this does not fix the problem described for Docker in #24292 because that
appears to be related to the netns itself being restored unexpectedly from
Docker's state.

Ref: #24292 (comment)
Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files
@tgross
Copy link
Member

tgross commented Dec 11, 2024

Also cross-linking #19962 as potentially related.

@apollo13
Copy link
Contributor

Unfortunately that doesn't fix the problem for Docker. I'm still investigating here, but it's a bit slow-going until I can reproduce the behavior without actually rebooting the VM. But that narrows it down to Docker.

What happens during host shutdown from nomad's PoV? I am asking because the pause container has a restart policy of unless-stopped which would cause it to be restored by docker itself upon reboot if I am not mistaken. If docker brings up that container before nomad checks for the netns then nomad would see /var/run/docker/netns/$container_id

@apollo13
Copy link
Contributor

If my assumption is correct the docker issue should go away by reverting cd48910 (at least to confirm…)

@tgross
Copy link
Member

tgross commented Dec 11, 2024

Good catch, @apollo13. I'll take a look at that tomorrow.

@tgross
Copy link
Member

tgross commented Dec 12, 2024

I've confirmed that with the patch in #24650 and reverting cd48910 that the issue goes away for Docker:

Recent Events:
Time                       Type        Description
2024-12-12T09:39:11-05:00  Started     Task started by client
2024-12-12T09:38:56-05:00  Restarting  Task restarting in 15.437798526s
2024-12-12T09:38:56-05:00  Terminated  Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2024-12-12T09:36:46-05:00  Started     Task started by client
2024-12-12T09:36:46-05:00  Task Setup  Building Task Directory
2024-12-12T09:36:45-05:00  Received    Task received by client

The fix in cd48910 really did improve the situation around client restarts outside of host reboots though, so lemme circle-up with the team and figure out a next step.

tgross added a commit that referenced this issue Dec 12, 2024
When a Nomad host reboots, the network namespace files in the tmpfs in
`/var/run` are wiped out. So when we restore allocations after a host reboot, we
need to be able to restore both the network namespace and the network
configuration. But because the netns is newly created and we need to run the CNI
plugins again, this create potential conflicts with the IPAM plugin which has
written state to persistent disk at `/var/lib/cni`. These IPs aren't the ones
advertised to Consul, so there's no particular reason to keep them around after
a host reboot because all virtual interfaces need to be recreated too.

Reconfigure the CNI bridge configuration to use `/var/run/cni` as its state
directory. We already expect this location to be created by CNI because the
netns files are hard-coded to be created there too in `libcni`.

Note this does not fix the problem described for Docker in #24292 because that
appears to be related to the netns itself being restored unexpectedly from
Docker's state.

Ref: #24292 (comment)
Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files
@apollo13
Copy link
Contributor

@tgross I heard there is a PR somewhere that gets rid of the docker pause container which should solve that nicely ;)

@tgross
Copy link
Member

tgross commented Dec 12, 2024

The challenge here as I see it is that the CNI spec doesn't allow for the runtime to re-run the ADD command after the container has been created. So we can't simply re-run CNI setup blindly. We also can't destroy the netns if it already exists because that breaks client restarts without a host reboot. Fortunately there's a CHECK command in CNI that, given a network namespace, verifies that the networking is set up correctly. With some quick-and-dirty hacking to add support for that if the netns already exists, I get an error like the following:

Recent Events:
Time                       Type           Description
2024-12-12T11:10:48-05:00  Terminated     Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
2024-12-12T11:10:48-05:00  Setup Failure  failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: host-local: Failed to find address added by container 7fc5d0ba-5646-51da-1e7b-a459b79a66ee
2024-12-12T11:09:47-05:00  Started        Task started by client
2024-12-12T11:09:47-05:00  Task Setup     Building Task Directory
2024-12-12T11:09:47-05:00  Received       Task received by client

But because this fails the allocation it reschedules the alloc rather than retrying. So I'd need to add extra logic to teardown the old netns and then recreate it.

Edit: so far my approach isn't working because it also fails (with a different error) when the client agent restarts. So it's the CHECK parameters themselves that aren't quite right too.

tgross added a commit that referenced this issue Dec 12, 2024
When the Nomad client restarts and restores allocations, the network namespace
for an allocation may exist but no longer be correctly configured. For example,
if the host is rebooted and the task was a Docker task using a pause container,
the network namespace may be recreated by the docker daemon.

When we restore an allocation, use the CNI "check" command to verify that any
existing network namespace matches the expected configuration. This requires CNI
plugins of at least version 1.2.0 to avoid a bug in older plugin versions that
would cause the check to fail. If the check fails, fail the restore so that the
allocation can be recreated (rather than silently having networking fail).

This should fix the gap left #24650 for Docker task drivers and any other
drivers with the `MustInitiateNetwork` capability.

Fixes: #24292
Ref: #24650
@tgross tgross linked a pull request Dec 12, 2024 that will close this issue
6 tasks
tgross added a commit that referenced this issue Dec 12, 2024
When the Nomad client restarts and restores allocations, the network namespace
for an allocation may exist but no longer be correctly configured. For example,
if the host is rebooted and the task was a Docker task using a pause container,
the network namespace may be recreated by the docker daemon.

When we restore an allocation, use the CNI "check" command to verify that any
existing network namespace matches the expected configuration. This requires CNI
plugins of at least version 1.2.0 to avoid a bug in older plugin versions that
would cause the check to fail. If the check fails, fail the restore so that the
allocation can be recreated (rather than silently having networking fail).

This should fix the gap left #24650 for Docker task drivers and any other
drivers with the `MustInitiateNetwork` capability.

Fixes: #24292
Ref: #24650
@tgross
Copy link
Member

tgross commented Dec 12, 2024

I've got #24658 up with a fix. It has a caveat though, which is that it requires CNI plugins >=1.2.0 (released Jan 2023) and will throw an error if using an older version because of a bug in the older bridge network plugin. Going to chat with the team to see what they think about having a toggle to turn the behavior off for folks who don't want it. I'd rather not put a recent version constraint on the CNI plugins if we can get away with it, as some distros ship CNI plugins that are quite old. 😿

tgross added a commit that referenced this issue Dec 12, 2024
When the Nomad client restarts and restores allocations, the network namespace
for an allocation may exist but no longer be correctly configured. For example,
if the host is rebooted and the task was a Docker task using a pause container,
the network namespace may be recreated by the docker daemon.

When we restore an allocation, use the CNI "check" command to verify that any
existing network namespace matches the expected configuration. This requires CNI
plugins of at least version 1.2.0 to avoid a bug in older plugin versions that
would cause the check to fail. If the check fails, fail the restore so that the
allocation can be recreated (rather than silently having networking fail).

This should fix the gap left #24650 for Docker task drivers and any other
drivers with the `MustInitiateNetwork` capability.

Fixes: #24292
Ref: #24650
@apollo13
Copy link
Contributor

That looks interesting. Out of curiosity (and since you are already knee-deep in this) -- what does CHECK actually do? Does it also check if the IP addr is the same as before the reboot; if yes how does that work if we move the CNI data to a tmpfs?

@tgross
Copy link
Member

tgross commented Dec 12, 2024

Out of curiosity (and since you are already knee-deep in this) -- what does CHECK actually do? Does it also check if the IP addr is the same as before the reboot; if yes how does that work if we move the CNI data to a tmpfs?

If the CNI data is in a tmpfs, then the check will fail because it won't have anything to compare against. That's what we want here, because we want to force Nomad to fail the allocation instead of saying "ok there's a netns, everything must be ok". But see my comment here: #24658 (comment). Because it might be nice if we could make a best-effort attempt to start over to account for that.

tgross added a commit that referenced this issue Dec 13, 2024
When the Nomad client restarts and restores allocations, the network namespace
for an allocation may exist but no longer be correctly configured. For example,
if the host is rebooted and the task was a Docker task using a pause container,
the network namespace may be recreated by the docker daemon.

When we restore an allocation, use the CNI "check" command to verify that any
existing network namespace matches the expected configuration. This requires CNI
plugins of at least version 1.2.0 to avoid a bug in older plugin versions that
would cause the check to fail.

If the check fails, destroy the network namespace and try to recreate it from
scratch once. If that fails in the second pass, fail the restore so that the
allocation can be recreated (rather than silently having networking fail).

This should fix the gap left #24650 for Docker task drivers and any other
drivers with the `MustInitiateNetwork` capability.

Fixes: #24292
Ref: #24650

retry and tests

[squashme]
tgross added a commit that referenced this issue Dec 13, 2024
When the Nomad client restarts and restores allocations, the network namespace
for an allocation may exist but no longer be correctly configured. For example,
if the host is rebooted and the task was a Docker task using a pause container,
the network namespace may be recreated by the docker daemon.

When we restore an allocation, use the CNI "check" command to verify that any
existing network namespace matches the expected configuration. This requires CNI
plugins of at least version 1.2.0 to avoid a bug in older plugin versions that
would cause the check to fail.

If the check fails, destroy the network namespace and try to recreate it from
scratch once. If that fails in the second pass, fail the restore so that the
allocation can be recreated (rather than silently having networking fail).

This should fix the gap left #24650 for Docker task drivers and any other
drivers with the `MustInitiateNetwork` capability.

Fixes: #24292
Ref: #24650
tgross added a commit that referenced this issue Dec 13, 2024
When the Nomad client restarts and restores allocations, the network namespace
for an allocation may exist but no longer be correctly configured. For example,
if the host is rebooted and the task was a Docker task using a pause container,
the network namespace may be recreated by the docker daemon.

When we restore an allocation, use the CNI "check" command to verify that any
existing network namespace matches the expected configuration. This requires CNI
plugins of at least version 1.2.0 to avoid a bug in older plugin versions that
would cause the check to fail.

If the check fails, destroy the network namespace and try to recreate it from
scratch once. If that fails in the second pass, fail the restore so that the
allocation can be recreated (rather than silently having networking fail).

This should fix the gap left #24650 for Docker task drivers and any other
drivers with the `MustInitiateNetwork` capability.

Fixes: #24292
Ref: #24650
@tgross tgross added this to the 1.9.x milestone Dec 13, 2024
@tgross tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Dec 13, 2024
tgross added a commit that referenced this issue Dec 16, 2024
When a Nomad host reboots, the network namespace files in the tmpfs in
`/var/run` are wiped out. So when we restore allocations after a host reboot, we
need to be able to restore both the network namespace and the network
configuration. But because the netns is newly created and we need to run the CNI
plugins again, this create potential conflicts with the IPAM plugin which has
written state to persistent disk at `/var/lib/cni`. These IPs aren't the ones
advertised to Consul, so there's no particular reason to keep them around after
a host reboot because all virtual interfaces need to be recreated too.

Reconfigure the CNI bridge configuration to use `/var/run/cni` as its state
directory. We already expect this location to be created by CNI because the
netns files are hard-coded to be created there too in `libcni`.

Note this does not fix the problem described for Docker in #24292 because that
appears to be related to the netns itself being restored unexpectedly from
Docker's state.

Ref: #24292 (comment)
Ref: https://www.cni.dev/plugins/current/ipam/host-local/#files
tgross added a commit that referenced this issue Dec 18, 2024
When the Nomad client restarts and restores allocations, the network namespace
for an allocation may exist but no longer be correctly configured. For example,
if the host is rebooted and the task was a Docker task using a pause container,
the network namespace may be recreated by the docker daemon.

When we restore an allocation, use the CNI "check" command to verify that any
existing network namespace matches the expected configuration. This requires CNI
plugins of at least version 1.2.0 to avoid a bug in older plugin versions that
would cause the check to fail.

If the check fails, destroy the network namespace and try to recreate it from
scratch once. If that fails in the second pass, fail the restore so that the
allocation can be recreated (rather than silently having networking fail).

This should fix the gap left #24650 for Docker task drivers and any other
drivers with the `MustInitiateNetwork` capability.

Fixes: #24292
Ref: #24650
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment