Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DSCP value to outgoing HTTP requests #464

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

garyclee
Copy link

Add ability to specify a DSCP value in IPv4/IPv6 header of outgoing HTTP requests

Add ability to specify a DSCP value in IPv4/IPv6 header of
outgoing HTTP requests

Signed-off-by: Gary Lee <[email protected]>
@garyclee
Copy link
Author

garyclee commented Mar 27, 2023

Hi @roidelapluie

We would like to add DSCP support to remote_write in Prometheus, specifically when it's running under Kubernetes. This PR is the first step.

The change in prometheus will be small, something like:

diff --git a/storage/remote/client.go b/storage/remote/client.go
index 92666cd1d..e141b132d 100644
--- a/storage/remote/client.go
+++ b/storage/remote/client.go
@@ -133,7 +133,9 @@ func NewReadClient(name string, conf *ClientConfig) (ReadClient, error) {
 
 // NewWriteClient creates a new client for remote write.
 func NewWriteClient(name string, conf *ClientConfig) (WriteClient, error) {
-       httpClient, err := config_util.NewClientFromConfig(conf.HTTPClientConfig, "remote_storage_write_client")
+       dialer := &config_util.DSCPDialer{DSCP: conf.HTTPClientConfig.DSCP}
+       httpClientOption := config_util.WithDialContextFunc(dialer.DialContext)
+       httpClient, err := config_util.NewClientFromConfig(conf.HTTPClientConfig, "remote_storage_write_client", httpClientOption)
        if err != nil {
                return nil, err
        }

Thanks
Gary

@roidelapluie
Copy link
Member

Hi @garyclee,

Thank you for the pull request. To be honest, I didn't know about this feature before. It seems very niche. However, after reviewing the code, I don't think this is a good idea to use syscall to configure the DSCP field in the IP header of an HTTP request. It might cause issues on other platforms and also seems to be something I am not certain I want to maintain in the long run.

A better way to achieve this might be to use a proxy server that can be configured to set the DSCP value in the IP header.

What do you think about this approach?

@roidelapluie
Copy link
Member

As a side note, you could probably use

func (*Conn) SetTOS
func (c *Conn) SetTOS(tos int) error
SetTOS sets the type-of-service field value for future outgoing packets.

@garyclee
Copy link
Author

Hi @roidelapluie

Thanks a lot for the tip about SetTOS() and SetTrafficClass(). I've started changing it and it looks promising. Would you be more inclined to accept the PR if I removed the the syscall? On platforms that don't support it, it will just ignore the DSCP option, and continue with DSCP = 0.

Thanks
Gary

Replace syscall with golang.org/x/net calls

Signed-off-by: Gary Lee <[email protected]>
Remove whitespace

Signed-off-by: Gary Lee <[email protected]>
@garyclee
Copy link
Author

garyclee commented Apr 3, 2023

Hi @roidelapluie

The syscalls have been removed. I hope it's a little more acceptable.

Using a proxy would be an option, but we would like to avoid it if possible, given it's relatively easy to do directly from Prometheus.

Thanks a lot
Gary

@dswarbrick
Copy link
Contributor

dswarbrick commented Apr 24, 2023

OpenShift supports setting specific DSCP for a pod's egress traffic: https://cloud.redhat.com/blog/using-qos-dscp-in-openshift-container-platform

Perhaps standard k8s does also?

Requiring the application to set TOS / DSCP may not work out that well, since setting some of the higher priorities will require CAP_NET_ADMIN (or equivalent on non-Linux platforms), which is why it usually handled at the kernel level via iptable / nftables packet mangling or tc.

@hobbytp
Copy link

hobbytp commented Apr 25, 2023

@dswarbrick I don't think setting TOS/DSCP require CAP_NET_ADMIN in k8s, I always configure DSCP (e.g. set to 63) and TCP KA (also one socket option) in istio ingress gateway, which default will drop all linux capabilities in securityContext. I also verified it in my docker image and k8s deployment with drop ALL. I captured the pkt to verify the DSCP value is correct.
Or I missed anything in your points?

@dswarbrick
Copy link
Contributor

dswarbrick commented Apr 25, 2023

@hobbytp https://man7.org/linux/man-pages/man7/socket.7.html

       SO_PRIORITY
              Set the protocol-defined priority for all packets to be
              sent on this socket.  Linux uses this value to order the
              networking queues: packets with a higher priority may be
              processed first depending on the selected device queueing
              discipline.  Setting a priority outside the range 0 to 6
              requires the CAP_NET_ADMIN capability.

These restrictions exist for a reason. On a multi-user system, if all applications marked all their traffic as highest priority, it would defeat the entire purpose of QoS.

@hobbytp
Copy link

hobbytp commented Apr 25, 2023

@dswarbrick

@hobbytp https://man7.org/linux/man-pages/man7/socket.7.html

       SO_PRIORITY
              Set the protocol-defined priority for all packets to be
              sent on this socket.  Linux uses this value to order the
              networking queues: packets with a higher priority may be
              processed first depending on the selected device queueing
              discipline.  Setting a priority outside the range 0 to 6
              requires the CAP_NET_ADMIN capability.

These restrictions exist for a reason. On a multi-user system, if all applications marked all their traffic as highest priority, it would defeat the entire purpose of QoS.

But I set DSCP by using following option level and name, instead of SO_PRIORITY.
IPv4: setsockopt(sock, IPPROTO_IP, IP_TOS, &opt, sizeof(opt))
IPv6: setsockopt(sock, IPPROTO_IPv6, IPV6_TCLASS, &opt, sizeof(opt))

@dswarbrick
Copy link
Contributor

@hobbytp In that case... https://man7.org/linux/man-pages/man7/ip.7.html

       IP_TOS (since Linux 1.0)
              Set or receive the Type-Of-Service (TOS) field that is
              sent with every IP packet originating from this socket.
              It is used to prioritize packets on the network.  TOS is a
              byte.  There are some standard TOS flags defined:
              IPTOS_LOWDELAY to minimize delays for interactive traffic,
              IPTOS_THROUGHPUT to optimize throughput, IPTOS_RELIABILITY
              to optimize for reliability, IPTOS_MINCOST should be used
              for "filler data" where slow transmission doesn't matter.
              At most one of these TOS values can be specified.  Other
              bits are invalid and shall be cleared.  Linux sends
              IPTOS_LOWDELAY datagrams first by default, but the exact
              behavior depends on the configured queueing discipline.
              Some high-priority levels may require superuser privileges
              (the CAP_NET_ADMIN capability).

@dswarbrick
Copy link
Contributor

dswarbrick commented Apr 25, 2023

I did some testing with SetTOS, and it looks like it successfully sets any desired TOS value, including high values like DSCP EF. Captured traffic confirms that the TOS value is being set as expected.

The __ip_sock_set_tos function in the Linux kernel maps TOS values to skb priorities:

void __ip_sock_set_tos(struct sock *sk, int val)
{
	if (sk->sk_type == SOCK_STREAM) {
		val &= ~INET_ECN_MASK;
		val |= inet_sk(sk)->tos & INET_ECN_MASK;
	}
	if (inet_sk(sk)->tos != val) {
		inet_sk(sk)->tos = val;
		sk->sk_priority = rt_tos2priority(val);
		sk_dst_reset(sk);
	}
}

You would need to poke around in the mapping that rt_tos2priority uses to calculate what skb priority will be assigned for a given TOS.

Going back to the SO_PRIORITY topic - that definitely has guards against using high priority values in sk_setsockopt without CAP_NET_ADMIN:

	case SO_PRIORITY:
		if ((val >= 0 && val <= 6) ||
		    sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_RAW) ||
		    sockopt_ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
			sk->sk_priority = val;
		else
			ret = -EPERM;
		break;

In summary, if you only care about the TOS bits being set and (presumably) honoured by downstream routers, then you can simply use SetTOS as @roidelapluie suggested. However, if you want to also ensure that the packets (or skbs, technically) are handled with high priority in the kernel, you may have to dig a little deeper.

@hobbytp
Copy link

hobbytp commented Apr 26, 2023

Thank dswarbrick for the testing and detail explanation from linux code level . It is quite clear and useful.
I dig the codes you shown, and found that any TOS value can NOT make the sk_priority >6. I collect all related code together as below.
I think that also approves your test result. I double checked the code in linux kernel v5.15 (the link you provided is for v6.3), the logic is the same. So seems it is safe to SetTos currently.

#define TC_PRIO_BESTEFFORT		0
#define TC_PRIO_FILLER			1
#define TC_PRIO_BULK			2
#define TC_PRIO_INTERACTIVE_BULK	4
#define TC_PRIO_INTERACTIVE		6
#define TC_PRIO_CONTROL			7                 //==>Only this one meanS SO_PRIORITY>6?

#define ECN_OR_COST(class)	TC_PRIO_##class

const __u8 ip_tos2prio[16] = {                                   //==>Key points: Seems all items <=6 ?
	TC_PRIO_BESTEFFORT,
	ECN_OR_COST(BESTEFFORT),
	TC_PRIO_BESTEFFORT,
	ECN_OR_COST(BESTEFFORT),
	TC_PRIO_BULK,
	ECN_OR_COST(BULK),
	TC_PRIO_BULK,
	ECN_OR_COST(BULK),
	TC_PRIO_INTERACTIVE,
	ECN_OR_COST(INTERACTIVE),
	TC_PRIO_INTERACTIVE,
	ECN_OR_COST(INTERACTIVE),
	TC_PRIO_INTERACTIVE_BULK,
	ECN_OR_COST(INTERACTIVE_BULK),
	TC_PRIO_INTERACTIVE_BULK,
	ECN_OR_COST(INTERACTIVE_BULK)
};

void ip_sock_set_tos(struct sock *sk, int val)
{
	lock_sock(sk);
	__ip_sock_set_tos(sk, val);
	release_sock(sk);
}
void __ip_sock_set_tos(struct sock *sk, int val)
{
	if (sk->sk_type == SOCK_STREAM) {
		val &= ~INET_ECN_MASK;
		val |= inet_sk(sk)->tos & INET_ECN_MASK;
	}
	if (inet_sk(sk)->tos != val) {
		inet_sk(sk)->tos = val;
		sk->sk_priority = rt_tos2priority(val);             //==> so sk_priority always <=6?
		sk_dst_reset(sk);
	}
}

#define IPTOS_TOS_MASK		0x1E
#define IPTOS_TOS(tos)		((tos)&IPTOS_TOS_MASK)

static inline char rt_tos2priority(u8 tos)
{
	return ip_tos2prio[IPTOS_TOS(tos)>>1];         //==>Seems only can return value <=6, e.g. DSCP EP(e.g. 46) will return 4
}

@garyclee
Copy link
Author

garyclee commented Apr 27, 2023

Hi @dswarbrick

Thanks a lot for the info regarding OpenShift.

My colleage Hobby checked a number of CNIs, including Calico, Flannel, Weave Net, Cilium CNI and CNI-Genie. The only one that seems to support DSCP marking is ovn-kubernetes, as you pointed out, that is the default in OpenShift. So the only portable way to achieve this seems to be doing in Prometheus directly.

Thanks
Gary

@roidelapluie
Copy link
Member

roidelapluie commented Apr 27, 2023

To me it still sound like you are bringing more proof that this feature is too niche to be integrated upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants