-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Konnectivity server leaks memory and free sockets #255
Comments
One issue I'm following up on right now has to do with losing connections during a dial timeout. The most obvious symptom would be lines in the server log containing the string connectionID=0. Do you see this? (I have a fix for GRPC checked in but I think a bit extra needs to be done for http-connect which I believe is what you are using) |
It would also be interesting to see what pprof says as your ramping up on memory and sockets. |
@cheftako yes we are using http-connect. I cannot see this |
#253. I think the handling of client.PacketType_DIAL_CLS should also close the http connection but I wanted to get more testing before adding that change. However if your not seeing connectionID=0 then I think you seeing a different issue. |
Hi! It is possible that there were some networking issue between the konnectivity server and agent, but the traffic was high. It is possible we have found this issue. @Tamas-Biro1 I think based on this, we could try to simulate this somehow. What do you think? Thank you! |
Okay, after some investigaton, I see the following:
Configuration:
How do I make the agent hang? I use my How do I make request to the tunnel? I'm executing In the konnectivity server logs, the following is repeating:
Leak is visible in prometheus: Memory is increasing similarly. What do you think? @cheftako @jkh52 @caesarxuchao Thank you! |
In the test I'm using a container built from the main branch. Thanks. |
In the agent, the log is still rolling:
Once the
This query is for the metrics server inside the cluster. Seem Thanks! |
So to summerize: seems agent could open connection/can proxy to cluster service, but not for kubelet? Thank you! |
Note this is still occuring on the latest master (although not as frequently): Here's an example of one pod that has leaked to 12 Gigs server side with logs
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
@relyt0925 which commit is your image based off of, it's hard to tell from the manifest you shared because it's just the image SHA |
Reposting from #276 (comment): An interesting observation I noticed while trying to reproduce this issue is that restarting Konnectivity Server will not completely clear the open files and goroutines. So if I restart just proxy server and immediately check the goroutine and open files count, it's a bit lower but still very high. But if I restart kube-apiserver, I see the goroutines and open files reset across both components. So at this point I'm thinking this is mainly a leak in konnectivity-client. Going to run some validation against #325 today which has some potential fixes in konnectivity-client but would appreciate it if anyone else that is able to reproduce the issue can also test it. Note that the fix does require a rebuild of kube-apiserver so it might be difficult to test it |
@Tamas-Biro1 in your cluster do you also see kube-apiserver memory spike in a similar pattern or is it only ANP server/agent? |
@andrewsykim We are using Konnectivity with http-connect configuration, and NOT GRPC mode. I think the konnectivity-client is only for GRPC, is it right? Of course, we could check, but then please share the commands you are interested and make it comparable. Thanks, |
@mihivagyok you're right, I got this mixed up wtih #276 which was using gRPC + UDS. They seem like separte issues with similar symptoms |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@Tamas-Biro1: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What does your With 0.1.2, using grpc mode, we do not see sawtooth like leaks like earlier versions. So there could be remaining leaks in http-connect specific codepaths. @ipochi have you seen this as well, with recent binaries?
The logs above (from 2021) must be from an older version, do you have a logs sample from 0.1.2? Also I would check the metrics |
Recently I came across a scenario in which I noticed a similar spike in http-connect mode and proxy-server v0.0.33 getting OOMKilled Number of pending dials were high (500+, don't have an exact number). Whats peculiar/interesting to me is that after the first OOMKill of the proxy-server, the new container was already occupying 75% of memory at the time of OOMKill |
Hello,
We are using ANP version 0.0.21 with a patch #179 (I know it is already in 0.0.22, but currently it is not an option to upgrade)
We have a cluster with 1 ANP server and 4 ANP agents. Sometimes we see that the server starts consuming lot of memory and once it reaches the limit then restarts, and the whole thing starts from the beginning. This is true also for the number of free sockets. Here is the prometheus graph for memory and socket numbers:
From the server logs I only see the "usual" things could be seen in previous versions:
and
Did you face this problem? Or any idea what should be changed / investigated? Thanks!
The text was updated successfully, but these errors were encountered: