Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization is not working - Azure AKS - v1.25.6 #32

Open
zohebk8s opened this issue May 30, 2023 · 19 comments
Open

Optimization is not working - Azure AKS - v1.25.6 #32

zohebk8s opened this issue May 30, 2023 · 19 comments

Comments

@zohebk8s
Copy link

Hi Team,

First of all, it looks like a new tool and it can play an important role as well.

I just quickly tested it in Azure AKS v1.25.6. Below are my findings/comments:

  1. First, a small correction in the helm install command - We should use the name as well while installing.

helm install kube-reqsizer/kube-reqsizer --> helm install kube-reqsizer kube-reqsizer/kube-reqsizer

  1. I've deployed a basic application in the default namespace with high CPU/memory requests to test, whether kube-reqsizer will optimize or not. Waited for 22 mins, but still, it was the same.

  2. Logs FYR

I0530 15:58:39.252063 1 request.go:601] Waited for 1.996392782s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/argocd
I0530 15:58:49.252749 1 request.go:601] Waited for 1.995931495s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/argocd
I0530 15:58:59.450551 1 request.go:601] Waited for 1.994652278s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/argocd
I0530 15:59:09.450621 1 request.go:601] Waited for 1.994074539s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/kube-system
I0530 15:59:19.450824 1 request.go:601] Waited for 1.99598317s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/kubescape
I0530 15:59:29.650328 1 request.go:601] Waited for 1.993913908s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/tigera-operator
I0530 15:59:39.650831 1 request.go:601] Waited for 1.996110718s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/kubescape
I0530 15:59:49.850897 1 request.go:601] Waited for 1.995571438s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/kube-system
I0530 16:00:00.049996 1 request.go:601] Waited for 1.994819712s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/calico-system
I0530 16:00:10.050864 1 request.go:601] Waited for 1.991681441s due to client-side throttling, not priority and fairness, request: GET:https://10.0.0.1:443/api/v1/namespaces/default
image

  1. How much time it will take to optimize? Will it restart the pod automatically?

  2. I haven't customized any values, just used the below commands to install.

helm repo add kube-reqsizer https://jatalocks.github.io/kube-reqsizer/
helm repo update
helm install kube-reqsizer kube-reqsizer/kube-reqsizer

@ElementTech
Copy link
Owner

ElementTech commented May 30, 2023

Hey @zohebk8s , thanks for trying out the tool.

I've seen this occur to different people, and it seems like the kubeapi is too slow for the default configuration of the chart. For this, you need to set concurrentWorkers to 1.

This issue had the same problem as yours. Please see the correspondence here:

#30

Thanks! And tell me how it went

@ElementTech
Copy link
Owner

#30 (comment)

@zohebk8s
Copy link
Author

@jatalocks Thanks for your response.

I've updated the concurrentWorkers to "1" and the value of min-seconds in kube-reqsizer is also "1" as shown below: But still it's not updating the values. I am I missing something here?

image image

I've added below annotations for that deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
name: app-deployment
annotations:
reqsizer.jatalocks.github.io/optimize: "true" # Ignore Pod/Namespace when optimizing entire cluster
reqsizer.jatalocks.github.io/mode: "average" # Default Mode. Optimizes based on average. If ommited, mode is average
reqsizer.jatalocks.github.io/mode: "max" # Sets the request to the MAXIMUM of all sample points
reqsizer.jatalocks.github.io/mode: "min" # Sets the request to the MINIMUM of all sample points

@ElementTech
Copy link
Owner

Hey @zohebk8s , can you send a screenshot of the logs now? (A few minutes after the controller has started working). It might take for it some minutes to resize

@ElementTech
Copy link
Owner

Also, try adding the "optimize" annotation to the namespace this deployment is in

@zohebk8s
Copy link
Author

I've added annotation to the default namespace, where this deployment is running. But still, the values are the same
kube-reqsizer-controller-manager-795bbd7677-dl4xx-logs.txt
and they didn't change.

The utilization of the pods is very normal and I was expecting a change/optimization from kube-reqsizer. In the request, I've specified below values:
cpu: "100m"
memory: 400Mi

I've attached the full log file FYR. Please refer attached txt file

image image

@ElementTech
Copy link
Owner

@zohebk8s it appears it's working. If you gave it time through the night, did it eventually work? It might take some time on concurrentWorkers=1 but eventually it has enough data in cache to make the decision.

@zohebk8s
Copy link
Author

From logs, it looks like work. But it's not resizing/optimizing the workload. Still, I see no changes in CPU/memory requests for that deployment. Usually, it should not take this much time to take action.

image

@zohebk8s
Copy link
Author

@jatalocks Even if you see the cache sample is 278. Do you think this data is not enough for decision-making? Is there any specific amount of samples it will collect and take decision?

image

@ElementTech
Copy link
Owner

That's odd, it should have worked immediately. I think something prevents it from allowing it to resize. What's your values/configuration? You should make sure minSeconds=1 and sampleSize=1 as well.

@ElementTech
Copy link
Owner

The configuration should match what's on the top of the Readme (except concurrentWorkers=1)

@zohebk8s
Copy link
Author

Already it's "1" for concurrent-workers, minSeconds & sampleSize.

It's Azure AKS - v1.25.6 and the default namespace is istio injected. I hope it's not something specific to Istio.

configuration:

spec:
containers:
- args:
- --health-probe-bind-address=:8081
- --metrics-bind-address=:8080
- --leader-elect
- --annotation-filter=true
- --sample-size=1
- --min-seconds=1
- --zap-log-level=info
- --enable-increase=true
- --enable-reduce=true
- --max-cpu=0
- --max-memory=0
- --min-cpu=0
- --min-memory=0
- --min-cpu-increase-percentage=0
- --min-memory-increase-percentage=0
- --min-cpu-decrease-percentage=0
- --min-memory-decrease-percentage=0
- --cpu-factor=1
- --memory-factor=1
- --concurrent-workers=1
- --enable-persistence=true
- --redis-host=kube-reqsizer-redis-master

@ElementTech
Copy link
Owner

What are resource requirements for the deployments in default namespace? The only thing I could think of is that it doesn't have anything to resize so it just continues sampling the pods. Also if there are no requests/limits to begin with there's nothing to resize from. I'd check that the pods are configured with resources

@zohebk8s
Copy link
Author

I've defined requests/limits for this deployment and the utilization is very less, that's the reason I thought of raising this question/issue.

image image

If it doesn't have requests/limits, then as you said it won't work. But in case, I've defined requests/limits and the CPU/memory utilization is very less as well.

@ElementTech
Copy link
Owner

I see that reqsizer is alive for 11 minutes. I'd give it some more time for now and I'll check if there's a specific problem with AKS

@zohebk8s
Copy link
Author

@jatalocks Thank you for your patience and response, as I feel like this product can make a difference if it works properly. As it's more related to resource optimation which is directly proportional to cost optimization.

@zohebk8s
Copy link
Author

zohebk8s commented Jun 2, 2023

@jatalocks Is it a bug? Or kind of enhancement required at product level?

I hope the information which I’ve shared is of help.

@ElementTech
Copy link
Owner

@zohebk8s I think that by now if the controller has been continuously running the app should have already been resized

@darkxeno
Copy link

darkxeno commented Feb 9, 2024

@ElementTech i see that @zohebk8s seem to be using argo-cd in this cluster, can be that argo-cd is directly undoing all the changes done on the resources of the Deployment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants