Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: http(s) connection time out from VMs in VLAN netwok #98

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mingshuoqiu
Copy link
Contributor

Problem:
The VM attached to the VLAN network fails to http(s) with management URL. However, there's no problem with SSH/ping which connect to ports other then http(s) since http(s) needs extra route to the CNI interface.

Solution:
The http(s) egress from the uplink bridge interface should be untagged to be correctly routed. Since the routing is determined is based on L3, but the VLAN packet is L2.

Related Issue:
harvester/harvester#4359

Copy link
Member

@w13915984028 w13915984028 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

btw, please unify the wrapped new error

@@ -108,6 +108,16 @@ func (v *Vlan) AddLocalArea(la *LocalArea) error {
if err := v.uplink.AddBridgeVlan(la.Vid); err != nil {
return fmt.Errorf("add bridge vlanconfig %d failed, error: %w", la.Vid, err)
}
br, err := netlink.LinkByName(v.bridge.Name)
if err != nil {
return fmt.Errorf("bridge %s not found error", v.bridge.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrap original err to new Errorf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt.Errorf("bridge %s not found error: %w", v.bridge.Name, err), you mean like this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

brlnk := iface.NewLink(br)
if brlnk.IsBridgeVlanPVID(la.Vid) {
if err := netlink.BridgeVlanAdd(brlnk, la.Vid, false, true, true, false); err != nil {
return fmt.Errorf("add iface untagged vlan failed, error: %v, link: %s, vid: %d", err, br.Attrs().Name, la.Vid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the error fomat is suggested to %w

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VM attached to the VLAN network fails to http(s) with management
URL. However, there's no problem with SSH/ping which connect to ports
other then http(s) since http(s) needs extra route to the CNI interface.

The http(s) egress from the uplink bridge interface should be untagged
to be correctly routed.

Link: harvester/harvester#4359
Signed-off-by: Chris Chiu <[email protected]>
Copy link
Member

@starbops starbops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I suggest we completely disable the /net/bridge/bridge-nf-call-iptables kernel tunable during the network controller initialization because

  • Calico/Canal CNI's working model does not involve Linux bridges (also confirmed by the contributor of the project on the Slack channel that setting this tunable to 1 is unnecessary ref)
  • We already have the code that disables the tunable whenever users create new ClusterNetwork/Vlanconfig objects ref

Making packets leave the mgmt-br bridge interface untagged could solve the issue, but it still leaves the mistake uncorrected. Packets flow to and from the Pods with hostPort configured will be asymmetric. They should be forwarded outside of the Harvester cluster through the uplink and routed by an external gateway. This issue arises when we mix the usage of Canal and Bridge CNI with the Multus meta plugin. It's just that everything seems fine when no VLANs are involved.

Thank you!

@mingshuoqiu
Copy link
Contributor Author

As discussed, I suggest we completely disable the /net/bridge/bridge-nf-call-iptables kernel tunable during the network controller initialization because

I still have concern about simply disable bridge-nf-call-iptables solution. I hit another problem today when creating a worker node on different vlan w/ the management node. It fails update the secret because of timeout on http(s) the management url. However, ping/ssh from the worker node to the management node's IP is OK.

Then I create a worker node in the same VLAN w/ management, there's no such problem. Since all traffic from this node to management url will be SNAT/DNAT from 172.24.1.56 to 172.24.1.52 to 10.2.4.0 to 10.2.0.45. The http(s) will take flannel as the next hop instead of default gateway.

Maybe we need to list more use cases to find out the real solution.

@starbops
Copy link
Member

starbops commented Mar 7, 2024

I feel it is a different issue since no bridge CNI is involved; it's just pure canal stuff. But I strongly agree with you that we need to test thoroughly if we want to turn off the kernel tunable. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants