Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster domain "cluster.local" is hardcoded #15311

Open
pisto opened this issue Dec 8, 2024 · 5 comments
Open

Cluster domain "cluster.local" is hardcoded #15311

pisto opened this issue Dec 8, 2024 · 5 comments
Labels
sig/operator type/bug Somehing is not working as expected

Comments

@pisto
Copy link

pisto commented Dec 8, 2024

After jumping through a number hoops, I have hit a blocker to the deployment of the operator. I believe the issue is common to all cluster, including OpenShift (I am testing on minikube).

It appears that the cluster domain is hardcoded here. It appears also that there is no way to override loki config.yaml values, which would solve this issue, and most likely many other potential issues due to the rigidity of the configuration generated by the operator.

A solution would be to reference all services without the fqdn, but just by the short .svc name. I cannot foresee any issue with that, as I don't believe that the operator components access services external to the namespace where the stack is installed.

Additional places where cluster.local is hardcoded: handling of the certificates for the webhooks here and here. The webhooks are reached in my minikube installation with a custom cluster domain, which means that the apiserver is using and validating the short name, not the fqdn one, so it appears that the fqdn names can just be removed.

@JStickler JStickler added sig/operator type/feature Something new we should do labels Dec 9, 2024
@JoaoBraveCoding
Copy link
Collaborator

Hello @pisto 👋 First of all thank you for trying the operator and opening this issue!
I understand the "cluster.local" is hardcoded but I do not understand how is this breaking your deployment. I'm asking this because we can see in the k8s documentation that DNS look ups with .cluster.local are completely fine and it's actually part of the default k8s DNS name resolution. So could you provide more detail on what is actually the issue you are facing?

@JoaoBraveCoding JoaoBraveCoding added type/bug Somehing is not working as expected and removed type/feature Something new we should do labels Dec 13, 2024
@pisto
Copy link
Author

pisto commented Dec 13, 2024

It is breaking the deployment because in the generated loki-config.yaml mounted in the containers the address of the services is a full fqdn with the cluster domain included. If the cluster comes with a non-default cluster domain, those hostnames will simply not resolve.

The cluster domain is generally configured directly in the dns resolution stack of k8s (coredns, cloud platform, ...), and the root issue here is that it is in general not discoverable. In OpenShift you have the DNS Operator and you can (kubectl get dns.operator/default), but that's just one platform. This means that as an operator I have to configure the cluster domain essentially in all deployments and configurations, be it helm chart values, obscure operator-managed CR fields or others, and I've lost count of the instances where I had to rediscover and fix this issue.

As I mentioned in my first message, the easiest solution is to avoid generating fqdn hostnames at all, and just truncate them up to the .svc part. As far as I know, there is no real requirement or advantage in using the fqdn, except for a marginally faster resolution (if you don't use an fqdn you rely on search domains, which means multiple dns queries are sent out until a result is found).

Fortunately in my testing I managed to catch this blocker early in development, thanks to the fact that in our minikube development setup we use a custom domain (minikube start ... --dns-domain=some-dev-cluster.local ...). I suggest doing the same in your test pipelines.

@xperimental
Copy link
Collaborator

Hey @pisto and thanks for the question.

May I ask what the use-case for changing the cluster domain is? There have already been quite a few deployments of Loki managed with the Operator and we have so far not yet encountered this issue. Granted, most of them were based on OpenShift, customization might be more prevalent in the other flavors of Kubernetes.

From my point of view the cluster domain is only for "cluster internal" usage and would not need to be customized for interaction with "outside DNS", so I'm interested in why your use-case requires customization of that domain.

@pisto
Copy link
Author

pisto commented Dec 13, 2024

Non default cluster domain are used in multi-cluster deployments with inter-cluster connectivity, or in systems where the k8s DNS resolution system is integrated with a broader DNS zone.

My specific use case is about GKE (GCP managed kubernetes): we have multiple cluster installed in the same, cross-project VPC. In this mode, each cluster must (hence the blocker) have a different cluster domain. This configuration allows pods in one cluster to use vpc-native connectivity to talk to other pods and services, and k8s DNS names for any cluster can be resolved seamlessly from any other cluster (or any other VPC workload actually). For more information, https://cloud.google.com/kubernetes-engine/docs/how-to/cloud-dns#vpc_scope_dns .

Summing it up: I understand that most commonly, and apparently in all OpenShift environments the cluster domain is always cluster.local. However, it is perfectly legal per the kubernetes specs to have any cluster domain, and cluster.local is just an example value that happens to be very common. In my opinion the inability of setting the domain is a bug, and the resolution is a low hanging one.

@xperimental
Copy link
Collaborator

Sorry for being a bit slow to respond, needed to focus on a different topic for a few days.

Thanks for providing that context, I think I now get how it is used and that it's an actual use-case. I have to admit that it has not occurred to me and I still think I would prefer to use the cluster-internal DNS only for cluster-internal communication, but I can also see the advantages of your approach. I had used GKE before, but that was before that feature existed 🙂

I think we'll have a look at it and if it's as easy as your suggestion makes it seam without breaking the existing deployments, then this looks like a probable change. 👍

And to give a bit of context myself: Most of us will have a short or longer break over the next two weeks, so I wouldn't expect many updates over that time. 🌴

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/operator type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests

4 participants