Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent kubelet connection failures on GKE Autopilot after disabling insecure port 10255 #1635

Closed
tkoft opened this issue Dec 13, 2024 · 2 comments

Comments

@tkoft
Copy link

tkoft commented Dec 13, 2024

The kubelet https readonly port is set to 0 for GKE autopilot. This was done to fix this issue years ago, but not sure why it was necessary--it seems like it was to force the agent to fall back to the unsecured http port 10255? Google is deprecating the insecure readonly port now, and is emailing users to migrate to https on 10250.

I disabled the insecure port on our autopilot cluster, and it broke the agent's connection to kubelet, since the HTTPS port 0 obviously doesn't work and the fallback 10255 is gone too.

I then went to manually set

env:
- name: DD_KUBERNETES_HTTPS_KUBELET_PORT
value: 10250

in my datadog-values.yaml.

Nit: annoyingly the top-level datadog.env value doesn't propagate down--it's overridden by each container agent. I had to add it to agents.containers.agent.env, agents.containers.traceAgent.env, agents.containers.processAgent.env, etc.

Anyway, after doing this, I'm still getting failures:

2024-12-13 00:04:39 UTC | CORE | WARN | (comp/core/workloadmeta/impl/store.go:599 in func1) | error pulling from collector "kube_metadata": couldn't fetch "podlist": unexpected status code 403 on https://10.10.15.215:10250/pods: Forbidden (user=system:serviceaccount:ddagent:datadog-agent, verb=get, resource=nodes, subresource=proxy)

I thought this was an RBAC permissions issue with the agent's service account that's surfaced by enabling the https port instead of the open read-only http port:

> kubectl -n ddagent get clusterrole datadog-agent -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
annotations:
meta.helm.sh/release-name: datadog-agent
meta.helm.sh/release-namespace: ddagent
creationTimestamp: "2024-03-07T01:39:08Z"
labels:
app.kubernetes.io/instance: datadog-agent
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: datadog-agent
app.kubernetes.io/version: "7"
helm.sh/chart: datadog-3.83.0
name: datadog-agent
resourceVersion: "751533027"
uid: 899d6260-a88b-4027-b3da-69c83a3415f9
rules:
- nonResourceURLs:
- /metrics
- /metrics/slis
verbs:
- get
- apiGroups:
- ""
resources:
- nodes/metrics
- nodes/spec
- nodes/proxy
- nodes/stats
verbs:
- get
- apiGroups:
- ""
resources:
- endpoints
verbs:
- get
- apiGroups:
- security.openshift.io
resourceNames:
- datadog-agent
- hostaccess
- privileged
resources:
- securitycontextconstraints
verbs:
- use
- apiGroups:
- coordination.k8s.io
resources:
- leases
verbs:
- get

Although nodes/proxy is supposed to be the correct resource name to grant access to the /pods endpoint, granting this permissions is disabled in GKE autopilot:

If your workload uses the /pods endpoint on the insecure kubelet read-only port, you need to grant the nodes/proxy RBAC permission to access the endpoint on the secure kubelet port. nodes/proxy is a powerful permission that you can't grant in GKE Autopilot clusters and that you shouldn't grant in GKE Standard clusters. Use the Kubernetes API with a fieldSelector for the node name instead.

Ah, so that's why the HTTPS port was bypassed for GKE autopilot in the first place.

The core issue here may be on the agent, but think it's worth raising here too since the workaround (setting DD_KUBERNETES_HTTPS_KUBELET_PORT=0) will soon no longer be supported by GKE. And if the agent does get updated to use the Kubernetes API instead of kubelet, the RBAC roles will still have to be updated here.

@tbavelier
Copy link
Member

Hello @tkoft , thank you for raising this issue. We are aware of this incoming depreciation and are working with Google and other Datadog engineering teams towards a solution as indeed, nodes/proxy cannot be used preventing our usage of the HTTPS kubelet port. Until such work is completed, to keep full Agent functionality, the insecure port should remain enabled for GKE Autopilot clusters.

@tbavelier
Copy link
Member

Will be closing this in favour of DataDog/datadog-agent#32120

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants