Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SecondaryNetwork breaks network connection of compute nodes #6448

Closed
meibensteiner opened this issue Jun 15, 2024 · 10 comments · Fixed by #6504
Closed

SecondaryNetwork breaks network connection of compute nodes #6448

meibensteiner opened this issue Jun 15, 2024 · 10 comments · Fixed by #6504
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. reported-by/end-user Issues reported by end users.

Comments

@meibensteiner
Copy link

Describe the bug

When using the SecondaryNetwork via OVS feature the control-plane nodes start perfectly fine. On compute nodes the antrea-agent startup fails, leaving the nodes unaccessible.

To Reproduce
Ubuntu 24.04 LTS
RKE2 1.29
Helm chart values:

  featureGates:
    Multicast: true
    ServiceExternalIP: true
    SecondaryNetwork: true
    AntreaIPAM: true
  multicast:
    enable: true
  tunnelType: "vxlan"
  logVerbosity: 2
  secondaryNetwork:
    ovsBridges: [{bridgeName: "br-ext", physicalInterfaces: ["eth0"]}]
  agentImage:
    repository: "antrea/antrea-agent-ubuntu"
    pullPolicy: "Always"
    tag: "latest"

Expected
antrea-agent is supposed to attach eth0 to br-ext and the node continues to have network connection.

Actual behavior

antrea-agent on the agent node:

antrea-agent I0615 08:59:26.438809       1 log_file.go:93] Set log file max size to 104857600
antrea-agent I0615 08:59:26.439750       1 feature_gate.go:249] feature gates: &{map[AntreaIPAM:true Multicast:true SecondaryNetwork:true ServiceExternalIP:true]}
antrea-agent I0615 08:59:26.439778       1 agent.go:107] "Starting Antrea agent" version="v2.1.0-dev-6e00ebf"
antrea-agent I0615 08:59:26.439794       1 client.go:89] No kubeconfig file was specified. Falling back to in-cluster config
antrea-agent I0615 08:59:26.440235       1 client.go:86] "No Antrea kubeconfig file was specified. Falling back to in-cluster config"
antrea-agent I0615 08:59:26.440289       1 prometheus.go:189] Initializing prometheus metrics
antrea-agent I0615 08:59:26.440392       1 ovs_client.go:70] Connecting to OVSDB at address /var/run/openvswitch/db.sock
antrea-agent I0615 08:59:27.441113       1 ovs_client.go:89] Not connected yet, will try again in 2s
antrea-agent I0615 08:59:29.441592       1 agent.go:383] Setting up node network
antrea-agent I0615 08:59:29.441803       1 discoverer.go:82] Starting ServiceCIDRDiscoverer
antrea-agent F0615 08:59:59.442410       1 main.go:53] Error running agent: error initializing agent: failed to get Node with name server4 from K8s: Get "https://10.43.0.1:443/api/v1/nodes/server4": dial tcp 10.43.0.1:443: i/o timeout
antrea-agent goroutine 1 [running]:
antrea-agent k8s.io/klog/v2/internal/dbg.Stacks(0x0)
antrea-agent     k8s.io/klog/[email protected]/internal/dbg/dbg.go:35 +0x85
antrea-agent k8s.io/klog/v2.(*loggingT).output(0x49d60c0, 0x3, 0x0, 0xc000292000, 0x1, {0x3a1b7a4?, 0x1?}, 0x10?, 0x0)
antrea-agent     k8s.io/klog/[email protected]/klog.go:957 +0x6fa
antrea-agent k8s.io/klog/v2.(*loggingT).printfDepth(0x0?, 0x0?, 0x0, {0x0, 0x0}, 0x0?, {0x2cd7cd7, 0x17}, {0xc000358410, 0x1, ...})
antrea-agent     k8s.io/klog/[email protected]/klog.go:750 +0x1dd
antrea-agent k8s.io/klog/v2.(*loggingT).printf(...)
antrea-agent     k8s.io/klog/[email protected]/klog.go:727
antrea-agent k8s.io/klog/v2.Fatalf(...)
antrea-agent     k8s.io/klog/[email protected]/klog.go:1661
antrea-agent main.newAgentCommand.func1(0xc00050c500?, {0xc00035e5b0, 0x0, 0x7})
antrea-agent     antrea.io/antrea/cmd/antrea-agent/main.go:53 +0x2d1
antrea-agent github.com/spf13/cobra.(*Command).execute(0xc000004900, {0xc000050110, 0x7, 0x7})
antrea-agent     github.com/spf13/[email protected]/command.go:987 +0xaa3
antrea-agent github.com/spf13/cobra.(*Command).ExecuteC(0xc000004900)
antrea-agent     github.com/spf13/[email protected]/command.go:1115 +0x3ff
antrea-agent github.com/spf13/cobra.(*Command).Execute(...)
antrea-agent     github.com/spf13/[email protected]/command.go:1039
antrea-agent main.main()
antrea-agent     antrea.io/antrea/cmd/antrea-agent/main.go:32 +0x18
Stream closed EOF for networking/antrea-agent-cqx88 (antrea-agent)

Versions:

  • Helm chart version 2.0.0 with agent set to latest
  • Server Version: v1.29.0+rke2r1
  • containerd://1.7.11-k3s2
  • Kernel 6.8.0-35-generic

Additional context

@meibensteiner meibensteiner added the kind/bug Categorizes issue or PR as related to a bug. label Jun 15, 2024
@antoninbas
Copy link
Contributor

This is very early during initialization (based on the logs), and at that point the uplink should not have been moved to the bridge yet, so there should still be connectivity and kube-proxy should be able to let you access the K8s API.

Is it possible that there was a previous run of the Agent, with a different failure? I suppose there could have been an Agent crash after connecting the uplink to the bridge, in which case you could end up in this situation.

@jianjuns
Copy link
Contributor

It would be good to have outputs of the follow command:

  1. ovs-vsctl show
  2. ip addr
  3. ip route

And check the previous antrea-agent logs in the host /var/log/antrea/ directory to see any failure after moving host interface to OVS bridge, or any chance antrea-agent stopped and had failure when restoring host interface from OVS.

@meibensteiner
Copy link
Author

I am on it. But since every install breaks the network connection I want to script the creation of the debug setup.

@meibensteiner
Copy link
Author

ip route

root@node2:~# ip route
default via 192.168.64.1 dev enp0s1 proto dhcp src 192.168.64.28 metric 100 
10.42.1.0/24 dev antrea-gw0 scope link src 10.42.1.1 
192.168.64.0/24 dev enp0s1 proto kernel scope link src 192.168.64.28 
192.168.64.0/24 dev enp0s1 proto kernel scope link src 192.168.64.28 metric 100 
192.168.64.1 dev enp0s1 proto dhcp scope link src 192.168.64.28 metric 100

ip addr

root@node2:~# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group defaul
t qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute 
       valid_lft forever preferred_lft forever
2: enp0s1~: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master o
vs-system state UP group default qlen 1000
    link/ether 06:7b:b6:19:fb:49 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::47b:b6ff:fe19:fb49/64 scope link 
       valid_lft forever preferred_lft forever
3: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group defaul
t qlen 1000
    link/ether 7e:32:46:6b:c7:aa brd ff:ff:ff:ff:ff:ff
4: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65000 qdisc noqueue mas
ter ovs-system state UNKNOWN group default qlen 1000
    link/ether 4a:d7:2d:44:39:82 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::3022:71ff:fe57:7d62/64 scope link 
       valid_lft forever preferred_lft forever
5: antrea-gw0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UN
KNOWN group default qlen 1000
    link/ether 6e:0d:cb:b5:3d:52 brd ff:ff:ff:ff:ff:ff
    inet 10.42.1.1/24 brd 10.42.1.255 scope global antrea-gw0
       valid_lft forever preferred_lft forever
    inet6 fe80::6c0d:cbff:feb5:3d52/64 scope link 
       valid_lft forever preferred_lft forever
6: antrea-egress0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group defaul
t 
    link/ether 26:04:ec:a2:4b:d8 brd ff:ff:ff:ff:ff:ff
7: enp0s1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOW
N group default qlen 1000
    link/ether 06:7b:b6:19:fb:49 brd ff:ff:ff:ff:ff:ff
    inet 192.168.64.28/24 brd 192.168.64.255 scope global enp0s1
       valid_lft forever preferred_lft forever
    inet6 fd51:b881:783e:89ae:47b:b6ff:fe19:fb49/64 scope global 
       valid_lft forever preferred_lft forever
    inet6 fe80::47b:b6ff:fe19:fb49/64 scope link 
       valid_lft forever preferred_lft forever
8: rke2-cor-0961dc@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue
 master ovs-system state UP group default 
    link/ether aa:ce:dc:32:29:ac brd ff:ff:ff:ff:ff:ff link-netns cni-7ca6c697-9
7db-1418-1c2f-08cb2df8050f
    inet6 fe80::c81e:58ff:fe72:c25d/64 scope link 
       valid_lft forever preferred_lft forever

ovs-vsctl show

root@node2:/# ovs-vsctl show
e4a567f7-ef96-471e-82d0-6aa279e22bc9
    Bridge br-int
        datapath_type: system
        Port antrea-tun0
            Interface antrea-tun0
                type: vxlan
                options: {key=flow, remote_ip=flow}
        Port rke2-cor-0961dc
            Interface rke2-cor-0961dc
        Port antrea-gw0
            Interface antrea-gw0
                type: internal
    Bridge br-ext
        datapath_type: system
        Port "enp0s1~"
            Interface "enp0s1~"
        Port enp0s1
            Interface enp0s1
                type: internal
    ovs_version: "2.17.7"

@meibensteiner
Copy link
Author

root@node2:/var/log/antrea# cat antrea-agent.ERROR

Log file created at: 2024/06/28 18:57:21
Running on machine: node2
Binary: Built with gc go1.21.11 for linux/arm64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0628 18:57:21.360180       1 arp_responder.go:101] "Failed to handle ARP request" err="read packet 06:7b:b6:19:fb:49: recvfrom: network is down" deviceName="enp0s1"
E0628 18:57:52.105875       1 shared_informer.go:314] unable to sync caches for endpoint slice config
E0628 18:57:52.105886       1 shared_informer.go:314] unable to sync caches for service config
E0628 18:57:52.105899       1 shared_informer.go:314] unable to sync caches for node config
E0628 18:57:52.105902       1 shared_informer.go:314] unable to sync caches for AntreaAgentTraceflowController
E0628 18:57:52.105904       1 shared_informer.go:314] unable to sync caches for ServiceExternalIPController
E0628 18:57:52.105908       1 shared_informer.go:314] unable to sync caches for MemberListCluster
E0628 18:57:52.105914       1 shared_informer.go:314] unable to sync caches for ExternalIPPoolController
E0628 18:57:52.105919       1 shared_informer.go:314] unable to sync caches for SecondaryNetworkController
E0628 18:57:52.105922       1 shared_informer.go:314] unable to sync caches for AntreaAgentEgressController
E0628 18:57:52.105925       1 shared_informer.go:314] unable to sync caches for AntreaAgentNodeRouteController
E0628 18:57:52.105966       1 mcast_socket_linux.go:93] "Failed to clear multicast routing table entries" err="invalid argument"
E0628 18:58:07.426844       1 reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://10.43.0.1:443/api/v1/namespaces/networking/config
maps?fieldSelector=metadata.name%3Dantrea-ca&resourceVersion=1139": dial tcp 10.43.0.1:443: connect: no route to host
E0628 18:58:10.499593       1 reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get "https://10.43.0.1:443/api/v1/namespaces/networking/endpoi
nts?fieldSelector=metadata.name%3Dantrea&resourceVersion=1139": dial tcp 10.43.0.1:443: connect: no route to host
E0628 18:58:10.499625       1 reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.43.0.1:443/api/v1/namespaces/networking/services?f
ieldSelector=metadata.name%3Dantrea&resourceVersion=1139": dial tcp 10.43.0.1:443: connect: no route to host
E0628 18:58:10.499715       1 reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://10.43.0.1:443/api/v1/namespaces/networking/config
maps?fieldSelector=metadata.name%3Dantrea-ca&resourceVersion=1139": dial tcp 10.43.0.1:443: connect: no route to host
E0628 18:58:13.569049       1 reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.43.0.1:443/api/v1/namespaces/networking/services?f
ieldSelector=metadata.name%3Dantrea&resourceVersion=1139": dial tcp 10.43.0.1:443: connect: no route to host
E0628 18:58:13.569085       1 reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get "https://10.43.0.1:443/api/v1/namespaces/networking/endpoi
nts?fieldSelector=metadata.name%3Dantrea&resourceVersion=1139": dial tcp 10.43.0.1:443: connect: no route to host
E0628 18:58:16.641232       1 reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: Get "https://10.43.0.1:443/api/v1/namespaces/networking/config
maps?fieldSelector=metadata.name%3Dantrea-ca&resourceVersion=1139": dial tcp 10.43.0.1:443: connect: no route to host
E0628 18:58:19.715535       1 reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://10.43.0.1:443/api/v1/namespaces/networking/services?f
ieldSelector=metadata.name%3Dantrea&resourceVersion=1139": dial tcp 10.43.0.1:443: connect: no route to host
E0628 18:58:19.715603       1 reflector.go:147] k8s.io/[email protected]/tools/cache/reflector.go:229: Failed to watch *v1.Endpoints: failed to list *v1.Endpoints: Get "https://10.43.0.1:443/api/v1/namespaces/networking/endpoi
nts?fieldSelector=metadata.name%3Dantrea&resourceVersion=1139": dial tcp 10.43.0.1:443: connect: no route to host

root@node2:/var/log/antrea# cat antrea-agent.FATAL

Log file created at: 2024/06/28 18:57:17
Running on machine: node2
Binary: Built with gc go1.21.11 for linux/arm64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0628 18:57:17.883444       1 main.go:53] Error running agent: error initializing agent: Spec.PodCIDR is empty for Node node2
goroutine 1 [running]:
k8s.io/klog/v2/internal/dbg.Stacks(0x1)
        k8s.io/klog/[email protected]/internal/dbg/dbg.go:35 +0x8c
k8s.io/klog/v2.(*loggingT).output(0x43cb280, 0x3, 0x0, 0x400023fb90, 0x1, {0x34550ed?, 0x1?}, 0x4000072400?, 0x0)
        k8s.io/klog/[email protected]/klog.go:961 +0x640
k8s.io/klog/v2.(*loggingT).printfDepth(0x40004f8b40?, 0x1d6e00?, 0x0, {0x0, 0x0}, 0x0?, {0x270e078, 0x17}, {0x40006a8130, 0x1, ...})
        k8s.io/klog/[email protected]/klog.go:750 +0x1ac
k8s.io/klog/v2.(*loggingT).printf(...)
        k8s.io/klog/[email protected]/klog.go:727
k8s.io/klog/v2.Fatalf(...)
        k8s.io/klog/[email protected]/klog.go:1661
main.newAgentCommand.func1(0x4000636900?, {0x40001d6e00, 0x0, 0x7})
        antrea.io/antrea/cmd/antrea-agent/main.go:53 +0x258
github.com/spf13/cobra.(*Command).execute(0x40005e1800, {0x400004c090, 0x7, 0x7})
        github.com/spf13/[email protected]/command.go:989 +0x828
github.com/spf13/cobra.(*Command).ExecuteC(0x40005e1800)
        github.com/spf13/[email protected]/command.go:1117 +0x344
github.com/spf13/cobra.(*Command).Execute(...)
        github.com/spf13/[email protected]/command.go:1041
main.main()
        antrea.io/antrea/cmd/antrea-agent/main.go:32 +0x20

goroutine 116 [chan receive]:
k8s.io/client-go/util/workqueue.(*Type).updateUnfinishedWorkLoop(0x4000576cc0)
        k8s.io/[email protected]/util/workqueue/queue.go:281 +0x94
created by k8s.io/client-go/util/workqueue.newQueue in goroutine 1
        k8s.io/[email protected]/util/workqueue/queue.go:106 +0x1c4

goroutine 115 [select]:
k8s.io/klog/v2.(*flushDaemon).run.func1()
        k8s.io/klog/[email protected]/klog.go:1157 +0xec
created by k8s.io/klog/v2.(*flushDaemon).run in goroutine 1
        k8s.io/klog/[email protected]/klog.go:1153 +0x19c

goroutine 117 [select]:
k8s.io/client-go/util/workqueue.(*delayingType).waitingLoop(0x4000576fc0)
        k8s.io/[email protected]/util/workqueue/delaying_queue.go:276 +0x254
created by k8s.io/client-go/util/workqueue.newDelayingQueue in goroutine 1
        k8s.io/[email protected]/util/workqueue/delaying_queue.go:113 +0x1f8

goroutine 118 [chan receive]:
k8s.io/client-go/util/workqueue.(*Type).updateUnfinishedWorkLoop(0x4000577200)
        k8s.io/[email protected]/util/workqueue/queue.go:281 +0x94
created by k8s.io/client-go/util/workqueue.newQueue in goroutine 1
        k8s.io/[email protected]/util/workqueue/queue.go:106 +0x1c4

goroutine 119 [select]:
k8s.io/client-go/util/workqueue.(*delayingType).waitingLoop(0x4000577560)
        k8s.io/[email protected]/util/workqueue/delaying_queue.go:276 +0x254
created by k8s.io/client-go/util/workqueue.newDelayingQueue in goroutine 1
        k8s.io/[email protected]/util/workqueue/delaying_queue.go:113 +0x1f8

goroutine 110 [syscall]:
os/signal.signal_recv()
        runtime/sigqueue.go:152 +0x30
os/signal.loop()
        os/signal/signal_unix.go:23 +0x1c
created by os/signal.Notify.func1.1 in goroutine 1
        os/signal/signal.go:151 +0x28

goroutine 136 [sync.Cond.Wait]:
sync.runtime_notifyListWait(0x400024da50, 0x1)
        runtime/sema.go:527 +0x154
sync.(*Cond).Wait(0x400024da40)
        sync/cond.go:70 +0xcc
github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.(*Synchronize).WaitError(0x40007ade00?)
        github.com/TomCodeLV/[email protected]/pkg/ovsdb/client.go:119 +0xa0
github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.Dial.func1()
        github.com/TomCodeLV/[email protected]/pkg/ovsdb/client.go:245 +0x964
created by github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.Dial in goroutine 1
        github.com/TomCodeLV/[email protected]/pkg/ovsdb/client.go:167 +0x440

goroutine 146 [IO wait]:
internal/poll.runtime_pollWait(0xfea9a802aeb0, 0x72)
        runtime/netpoll.go:343 +0xa0
internal/poll.(*pollDesc).wait(0x40007b3100?, 0x40007fc200?, 0x0)
        internal/poll/fd_poll_runtime.go:84 +0x28
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0x40007b3100, {0x40007fc200, 0x200, 0x200})
        internal/poll/fd_unix.go:164 +0x200
net.(*netFD).Read(0x40007b3100, {0x40007fc200?, 0xfea9612939c0?, 0xfea9a81a6108?})
        net/fd_posix.go:55 +0x28
net.(*conn).Read(0x40004ea1c8, {0x40007fc200?, 0x400006be01?, 0x7e12c?})
        net/net.go:179 +0x34
encoding/json.(*Decoder).refill(0x4000678b40)
        encoding/json/stream.go:165 +0x164
encoding/json.(*Decoder).readValue(0x4000678b40)
        encoding/json/stream.go:140 +0x88
encoding/json.(*Decoder).Decode(0x4000678b40, {0x2054800, 0x4000571810})
        encoding/json/stream.go:63 +0x5c
github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.(*OVSDB).decodeWrapper(0x40006bae60, 0x0?)
        github.com/TomCodeLV/[email protected]/pkg/ovsdb/client.go:318 +0x9c
github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.(*OVSDB).loop(0x40006bae60)
        github.com/TomCodeLV/[email protected]/pkg/ovsdb/client.go:335 +0x3c
created by github.com/TomCodeLV/OVSDB-golang-lib/pkg/ovsdb.Dial.func1 in goroutine 136
        github.com/TomCodeLV/[email protected]/pkg/ovsdb/client.go:228 +0x8ec

goroutine 139 [chan receive]:
antrea.io/antrea/pkg/signals.RegisterSignalHandlers.func1()
        antrea.io/antrea/pkg/signals/signals.go:38 +0x30
created by antrea.io/antrea/pkg/signals.RegisterSignalHandlers in goroutine 1
        antrea.io/antrea/pkg/signals/signals.go:37 +0x6c

goroutine 141 [select]:
k8s.io/apimachinery/pkg/util/wait.waitForWithContext({0x2c0bda0, 0x40004b7800}, 0x4000376e30, 0x4000000000?)
        k8s.io/[email protected]/pkg/util/wait/wait.go:205 +0xb0
k8s.io/apimachinery/pkg/util/wait.poll({0x2c0bda0, 0x40004b7800}, 0x0?, 0x52?, 0x2?)
        k8s.io/[email protected]/pkg/util/wait/poll.go:260 +0x90
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext({0x2c0bda0?, 0x40004b7800?}, 0x344d828?, 0xd?)
        k8s.io/[email protected]/pkg/util/wait/poll.go:200 +0x4c
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil(0x0?, 0x0?, 0x40006a80e0?)
        k8s.io/[email protected]/pkg/util/wait/poll.go:187 +0x44
k8s.io/client-go/tools/cache.WaitForCacheSync(0x0?, {0x4000376f48?, 0x0?, 0x0?})
        k8s.io/[email protected]/tools/cache/shared_informer.go:326 +0x50
antrea.io/antrea/pkg/agent/servicecidr.(*Discoverer).Run(0x40007b2e80, 0x0?)
        antrea.io/antrea/pkg/agent/servicecidr/discoverer.go:84 +0x14c
created by main.run in goroutine 1
        antrea.io/antrea/cmd/antrea-agent/agent.go:273 +0x1824

goroutine 111 [select]:
k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext.poller.func1.1()
        k8s.io/[email protected]/pkg/util/wait/poll.go:297 +0x15c
created by k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext.poller.func1 in goroutine 141
        k8s.io/[email protected]/pkg/util/wait/poll.go:280 +0xc4

@antoninbas
Copy link
Contributor

@meibensteiner there are logs missing here, maybe you should provide a tarball of the entire directory contents.
For example logs in antrea-agent.ERROR and antrea-agent.FATAL don't seem to correspond to the same run of the Agent based on timestamps, perhaps because of log rotation.

In your original post, the transport interface was named "eth0", but now it is showing as "enp0s1", so I assume this is a different testbed?

Finally, the network configuration looks correct to me. Do you not have connectivity to your network gateway (192.168.64.1)?

IIRC, the log Error running agent: error initializing agent: Spec.PodCIDR is empty for Node node2 indicates that we timed out waiting for the PodCIDR field to be populated, while Error running agent: error initializing agent: failed to get Node with name server4 from K8s: Get "https://10.43.0.1:443/api/v1/nodes/server4": dial tcp 10.43.0.1:443: i/o timeout (fatal log from your original post) indicates that the K8s apiserver could not be reached. These are 2 pretty different error scenarios. Could you also confirm that your Nodes do indeed have a PodCIDR allocated to them?

@meibensteiner
Copy link
Author

meibensteiner commented Jun 30, 2024

I reran the test and got that tarball. In order to regain access to the node via ssh I needed to uninstall the rke2-agent entirely though. Just FYI. var-log-antrea.tar.gz

This is a different testbed. Had to get dev environment working again.

My two nodes seem to both get the PodCIDR populated:

~/Projects/ main > kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}'                   
10.42.0.0/24 10.42.1.0/24

network gateway (192.168.64.1) is also not reachable when pinging.

@jianjuns
Copy link
Contributor

jianjuns commented Jul 1, 2024

@hongliangl will be looking at this issue.

@wenyingd
Copy link
Contributor

wenyingd commented Jul 4, 2024

@meibensteiner Thanks for providing the logs. After looking into the logs, I got these information,

  1. The NIC configured in SecondaryNetwork is the same as the one used as Node NIC (configured with Node IP used in the cluster )
  2. Connections between the current Node and the api-server / antrea-controller are broken after the physical NIC is configured to the second OVS bridge (br-ext)
  3. OVS local port and uplink configurations are as expected on the secondary OVS bridge, also including the host IP/Route migrations from the uplink to OVS local port
  4. No additional OpenFlow entries are installed on the secondary OVS bridge. In theory, the “normal” flow is supposed to be installed by default, need to double confirm the existence.

More time is needed for further investigation.

@wenyingd
Copy link
Contributor

wenyingd commented Jul 4, 2024

The root cause is antrea-agent doesn’t remove flow-restore-wait="true" when attaching the uplink and to the secondary OVS interface, so the “normal” flow on the secondary OVS bridge can’t forward packets between OVS internal port and uplink as expected. In the meanwhile, the Node IP NIC is the same as the physical NIC used in the secondary network, the disconnections from kube-apiserver and antrea-controller makes antrea-agent to stop itself because of the network errors. During this time, no chance to remove flow-restore-wait="true" from Open_vSwitch configurations caused the openflows never works as expected.

A patch is created to resolve the issue. #6504

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. reported-by/end-user Issues reported by end users.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants