Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubernetes 1.30.5 support #23230

Open
karatkep opened this issue Nov 4, 2024 · 11 comments
Open

kubernetes 1.30.5 support #23230

karatkep opened this issue Nov 4, 2024 · 11 comments
Labels
area/che-server area/dashboard area/install Issues related to installation, including offline/air gap and initial setup severity/P1 Has a major impact to usage or development of the system. status/analyzing An issue has been proposed and it is currently being analyzed for effort and implementation approach

Comments

@karatkep
Copy link

karatkep commented Nov 4, 2024

Summary

Dear Community,

Could you please help me verify if Eclipse Che 7.93.0 supports Kubernetes 1.30.5? The che-dashboard and che pods stopped working when our Kubernetes cluster was updated to version 1.30.5.

Here is a sample of the error in the che-dashboard:

ERROR[12:03:22 UTC]: [HTTP request failed[
    err: {
      "type": "le",
      "message": "HTTP request failed",
      "stack":
          HttpError: HTTP request failed
              at q._callback (/backend/server/backend.js:8:898957)
              at t._callback.t.callback.t.callback (/backend/server/backend.js:14:1087840)
              at q.emit (node:events:517:28)
              at q.<anonymous> (/backend/server/backend.js:14:1100418)
              at q.emit (node:events:517:28)
              at IncomingMessage.<anonymous> (/backend/server/backend.js:14:1099250)
              at Object.onceWrapper (node:events:631:28)
              at IncomingMessage.emit (node:events:529:35)
              at endReadableNT (node:internal/streams/readable:1400:12)
              at process.processTicksAndRejections (node:internal/process/task_queues:82:21)
      "response": {
        "statusCode": 401,
        "body": {
          "kind": "Status",
          "apiVersion": "v1",
          "metadata": {},
          "status": "Failure",
          "message": "Unauthorized",
          "reason": "Unauthorized",
          "code": 401
        },
        "headers": {
          "audit-id": "6b14e1b5-8a08-41a8-a093-5e00693737a6",
          "cache-control": "no-cache, private",
          "content-type": "application/json",
          "date": "Mon, 04 Nov 2024 12:03:21 GMT",
          "content-length": "129",
          "connection": "close"
        },
        "request": {
          "uri": {
            "protocol": "https:",
            "slashes": true,
            "auth": null,
            "host": "10.1.0.1:443",
            "port": "443",
            "hostname": "10.1.0.1",
            "hash": null,
            "search": null,
            "query": null,
            "pathname": "/apis/org.eclipse.che/v2/checlusters",
            "path": "/apis/org.eclipse.che/v2/checlusters",
            "href": "https://10.1.0.1:443/apis/org.eclipse.che/v2/checlusters"
          },
          "method": "GET",
          "headers": {
            "Accept": "application/json",
            "Authorization": "Bearer MASKED"
          }
        }
      },
      "body": {
        "type": "Object",
        "message": "Unauthorized",
        "stack":
            
        "kind": "Status",
        "apiVersion": "v1",
        "metadata": {},
        "status": "Failure",
        "reason": "Unauthorized",
        "code": 401
      },
      "statusCode": 401,
      "name": "HttpError"
    }

The same issue affects the che pod. It appears that both lost access to the Kubernetes API after the upgrade to version 1.30.5.

ServiceAccounts, Cluster Roles and Bindings are in place for both che-dashboard and che pods

Relevant information

No response

@karatkep karatkep added the kind/question Questions that haven't been identified as being feature requests or bugs. label Nov 4, 2024
@che-bot che-bot added the status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. label Nov 4, 2024
@tolusha
Copy link
Contributor

tolusha commented Nov 4, 2024

@karatkep
Could you show che pod logs?

I've tried to reproduce on Minikube with Kubernetes 1.31.0, but no luck

@ibuziuk ibuziuk added area/install Issues related to installation, including offline/air gap and initial setup severity/P1 Has a major impact to usage or development of the system. status/analyzing An issue has been proposed and it is currently being analyzed for effort and implementation approach area/dashboard area/che-server and removed kind/question Questions that haven't been identified as being feature requests or bugs. status/need-triage An issue that needs to be prioritized by the curator responsible for the triage. See https://github. labels Nov 5, 2024
@karatkep
Copy link
Author

karatkep commented Nov 6, 2024

@tolusha
According to the che logs, the che pod starts receiving 401 errors from the kube-api exactly one hour after the pod starts working/launches:

06-Nov-2024 08:26:02.136 INFO [main] org.apache.catalina.startup.HostConfig.deployWAR Deployment of web application archive [/home/user/eclipse-che/tomcat/webapps/ROOT.war] has finished in [2,488] ms
06-Nov-2024 08:26:02.138 INFO [main] org.apache.coyote.AbstractProtocol.start Starting ProtocolHandler ["http-nio-8080"]
06-Nov-2024 08:26:02.144 INFO [main] org.apache.catalina.startup.Catalina.start Server startup in [40907] milliseconds
2024-11-06 09:26:32,950[c4d-k5x9l-37628]  [WARN ] [o.j.p.kubernetes.KUBE_PING 115]      - failed getting JSON response from Kubernetes Client[masterUrl=https://10.1.0.1:443/api/v1, headers={Authorization=#MASKED:1868#}, connectTimeout=5000, readTimeout=30000, operationAttempts=3, operationSleep=1000, streamProvider=org.jgroups.protocols.kubernetes.stream.TokenStreamProvider@6c199c1d] for cluster [RemoteSubscriptionChannel], namespace [eclipse-che], labels [app.kubernetes.io/component=che,app.kubernetes.io/instance=che,app.kubernetes.io/managed-by=che-operator,app.kubernetes.io/name=che,app.kubernetes.io/part-of=che.eclipse.org]; encountered [java.lang.Exception: 3 attempt(s) with a 1000ms sleep to execute [OpenStream] failed. Last failure was [java.io.IOException: Server returned HTTP response code: 401 for URL: https://10.1.0.1:443/api/v1/namespaces/eclipse-che/pods?labelSelector=app.kubernetes.io%2Fcomponent%3Dche%2Capp.kubernetes.io%2Finstance%3Dche%2Capp.kubernetes.io%2Fmanaged-by%3Dche-operator%2Capp.kubernetes.io%2Fname%3Dche%2Capp.kubernetes.io%2Fpart-of%3Dche.eclipse.org]]
2024-11-06 09:26:42,473[4c4d-k5x9l-3460]  [WARN ] [o.j.p.kubernetes.KUBE_PING 115]      - failed getting JSON response from Kubernetes Client[masterUrl=https://10.1.0.1:443/api/v1, headers={Authorization=#MASKED:1868#}, connectTimeout=5000, readTimeout=30000, operationAttempts=3, operationSleep=1000, streamProvider=org.jgroups.protocols.kubernetes.stream.TokenStreamProvider@f31944b] for cluster [WorkspaceStateCache], namespace [eclipse-che], labels [app.kubernetes.io/component=che,app.kubernetes.io/instance=che,app.kubernetes.io/managed-by=che-operator,app.kubernetes.io/name=che,app.kubernetes.io/part-of=che.eclipse.org]; encountered [java.lang.Exception: 3 attempt(s) with a 1000ms sleep to execute [OpenStream] failed. Last failure was [java.io.IOException: Server returned HTTP response code: 401 for URL: https://10.1.0.1:443/api/v1/namespaces/eclipse-che/pods?labelSelector=app.kubernetes.io%2Fcomponent%3Dche%2Capp.kubernetes.io%2Finstance%3Dche%2Capp.kubernetes.io%2Fmanaged-by%3Dche-operator%2Capp.kubernetes.io%2Fname%3Dche%2Capp.kubernetes.io%2Fpart-of%3Dche.eclipse.org]]
2024-11-06 09:26:47,468[c4d-k5x9l-46003]  [WARN ] [o.j.p.kubernetes.KUBE_PING 115]      - failed getting JSON response from Kubernetes Client[masterUrl=https://10.1.0.1:443/api/v1, headers={Authorization=#MASKED:1868#}, connectTimeout=5000, readTimeout=30000, operationAttempts=3, operationSleep=1000, streamProvider=org.jgroups.protocols.kubernetes.stream.TokenStreamProvider@5ed91d32] for cluster [WorkspaceLocks], namespace [eclipse-che], labels [app.kubernetes.io/component=che,app.kubernetes.io/instance=che,app.kubernetes.io/managed-by=che-operator,app.kubernetes.io/name=che,app.kubernetes.io/part-of=che.eclipse.org]; encountered [java.lang.Exception: 3 attempt(s) with a 1000ms sleep to execute [OpenStream] failed. Last failure was [java.io.IOException: Server returned HTTP response code: 401 for URL: https://10.1.0.1:443/api/v1/namespaces/eclipse-che/pods?labelSelector=app.kubernetes.io%2Fcomponent%3Dche%2Capp.kubernetes.io%2Finstance%3Dche%2Capp.kubernetes.io%2Fmanaged-by%3Dche-operator%2Capp.kubernetes.io%2Fname%3Dche%2Capp.kubernetes.io%2Fpart-of%3Dche.eclipse.org]]

@karatkep
Copy link
Author

@tolusha, as I can see, the issue is that the token is not being refreshed. It is generated for 1 hour, and after that time, the che-dashboard continues to use it despite its expiration. Is there any way to prompt the che-dashboard to refresh it before using it for kube-api calls?

@tolusha
Copy link
Contributor

tolusha commented Nov 12, 2024

@karatkep
Could you share CheCluster CR?
What OIDC provider do you use?

@karatkep
Copy link
Author

@tolusha,
Yes, of course, I will provide the CheCluster CR. However, I don't think that the issue lies with the CheCluster CR or OIDC. The same version of Eclipse Che 7.93.0 was deployed in two identical AKS clusters (Kubernetes version 1.27.9), and everything was fine until one of the clusters was upgraded to 1.30.5. Immediately after this update, problems with the kube-api started. Reviewing the token used, for example, by the che-dashboard, I see that the expiration field "exp" is always the same and is in the past. From this, I conclude that for Kubernetes version 1.30.5, the token is not being updated.

@karatkep
Copy link
Author

karatkep commented Nov 12, 2024

@tolusha , @ibuziuk , We found the root cause of the issue. In Kubernetes 1.27.9, the token (located at the path /var/run/secrets/kubernetes.io/serviceaccount/token) is issued for one year, although it is refreshed every hour (or more precisely every 50 minutes). At the same time, in Kubernetes 1.30.5, the token is issued for one hour and is also refreshed every 50 minutes. However, Che (che-dashboard, che, and most likely che-gateway) caches this token at startup and uses it. Consequently, in Kubernetes 1.27.9 there is no problem since the token is issued for one year, but in Kubernetes 1.30.5, the problem begins after the first hour from startup because the cached token is used.

@tolusha
Copy link
Contributor

tolusha commented Nov 13, 2024

@karatkep
So, if you restart all pods, Che will continue working, right?

@karatkep
Copy link
Author

@tolusha
Correct, we need to restart the Che pods every hour to ensure they remain operational.

@karatkep
Copy link
Author

@tolusha, @ibuziuk,
Could you please share information and plans regarding this issue? Is everything clear and understandable? Were you able to reproduce it? Are you currently working on a resolution, or do you have plans to start working on it soon?

Just to be on the same page - there is absolutely no pressure from my side. I just want to understand the current status and plans regarding this issue. On my part, I have already used one of the possible workarounds and written a CronJob that restarts the necessary Che pods. If other Eclipse Che users are facing or will face the same issue, I am more than willing to share this workaround.

@ibuziuk ibuziuk moved this to 📅 Planned in Eclipse Che Team A Backlog Nov 15, 2024
@ibuziuk
Copy link
Member

ibuziuk commented Nov 15, 2024

@karatkep Thank you for the follow-up and investigation details - #23230 (comment)

I'm still wondering if the token lifetime is configurable on the k8s end in general?
Do you happen to have the link to the Release Notes, docs, or commit where this change with the lifetime was introduced? Could it be some AKS config?

The issue has been planned for the next sprint (Nov 20 - Dec 10), however, so far @tolusha was not able to reproduce it on vanilla minikube.

@karatkep also contributions from the Community are most welcome if you would like to change or update the caching mechanism in the project ;-)

@karatkep
Copy link
Author

@ibuziuk,
When I was researching this issue, I came across the documentation at https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#tokenrequest-api which contains detailed information about configuring token lifetime. Moreover, I conducted an experiment where I disabled the che-operator (so it wouldn’t interfere with making changes) and used the expirationSeconds to modify the lifetime of the token. I tried setting it to one day or 86400 seconds for the che-dashboard in the deployment. After restarting the che-dashboard pod, I confirmed that the lifetime of the token (located in /var/run/secrets/kubernetes.io/serviceaccount/token) had indeed changed.

P.S. But frankly speaking, I do not like the option of using a long-lived token - it contradicts security best practices. It seems to me that whoever made this change (token lifetime: 1y -> 1h), it is a step in the right direction to use short-lived tokens. And in my opinion, a well-written application should not cache the token indefinitely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/che-server area/dashboard area/install Issues related to installation, including offline/air gap and initial setup severity/P1 Has a major impact to usage or development of the system. status/analyzing An issue has been proposed and it is currently being analyzed for effort and implementation approach
Projects
Status: 📅 Planned
Development

No branches or pull requests

4 participants