Controller CPU Utilization #3752

johnmwood · 2024-07-26T22:03:01Z

johnmwood
Jul 26, 2024

I'm evaluating using Argo Rollouts for my organization. If we go through the full migration, it would likely equal around 2,000 rollout objects in a single cluster. In running some scaling tests, I found some concerns with CPU throttling on the argo rollouts controller.

When running kubectl argo rollouts set image on only 100-200 rollouts simultaneously, the controller gets CPU throttled (limit 8 CPUs) and slowly goes down as rollouts promote and eventually complete. We have the infra to scale the CPU cores on the controller but I would love to understand in more depth how the CPU is being utilized on the controller during a rollout.

How is the argo rollouts controller managing prometheus queries? I'm assuming that the AnalysisRun is querying prometheus, waiting on those results, and that prometheus should be taking on the compute burden. In my testing, our prometheus instance held up fine.
In a longer AnalysisRun of say 20+ minutes with larger pause times between steps, is the controller continually running metrics against prometheus?
The FAQ docs state: The recommended way to use Argo Rollouts is for brief deployments that take 15-20 minutes or maximum 1-2 hours. Are long-running rollouts straining the controller in some form? Any details on why this is the recommendation would be helpful.

I appreciate any context here. Thank you!

Answered by kostis-codefresh

Jul 29, 2024

Hello

The first 2 questions can only be answered by looking at the source code.

For the third one I added some clarifications here #3529
I am the author of that recommendation and it has nothing to do with resource constraints.

View full answer

johnmwood · 2024-07-26T22:42:46Z

johnmwood
Jul 26, 2024
Author

I will add that the rollouts I'm running are running extremely simple prometheus queries that check to see if the memory is maxed out or not. I also checked on the controller logs and I see no reason why the rollouts are failing due to CPU throttling as the rollouts seem to be moving forward through completion.

Here is an example of the spikes I'm seeing when running only 100 simultaneous rollouts.

0 replies

kostis-codefresh · 2024-07-29T18:20:55Z

kostis-codefresh
Jul 29, 2024
Collaborator

Hello

The first 2 questions can only be answered by looking at the source code.

For the third one I added some clarifications here #3529
I am the author of that recommendation and it has nothing to do with resource constraints.

0 replies

kostis-codefresh · 2024-07-29T18:50:13Z

kostis-codefresh
Jul 29, 2024
Collaborator

I just explained here that the recommendation for a short release duration wasn't about resources #3753

0 replies

kostis-codefresh · 2024-07-30T15:50:57Z

kostis-codefresh
Jul 30, 2024
Collaborator

The prometheus code is here https://github.com/argoproj/argo-rollouts/blob/master/metricproviders/prometheus/prometheus.go

But frankly I think running Argo Rollouts with a profiler might be a better idea, as the bottleneck might be somewhere else and not in metrics.

Out of curiosity, do you really need 2000 Rollout objects in a single cluster? Are these 2000 unique applications that need progressive delivery and your developers create new versions all the time? Are they in the same namespace or different namespaces? How many rollouts are actually under deployment at any given time?

0 replies

johnmwood · 2024-07-30T16:25:17Z

johnmwood
Jul 30, 2024
Author

Thank you for all the clarifications on question 3.

Do you really need 2000 Rollout objects in a single cluster?

Great question. Our two largest clusters currently run about 1,200 deployments at any given time with more teams expected to migrate their workloads. We are working through other scaling issues at the time that will involve using multiple smaller clusters, however in the short term we need more teams using our current clusters.

Are these 2000 unique applications that need progressive delivery and your developers create new versions all the time? Are they in the same namespace or different namespaces? How many rollouts are actually under deployment at any given time?

We run large multi-tenant clusters with unique applications per namespace. All those applications need to migrate to use some form of progressive delivery. Application teams for the most part aren't going to be constantly running new versions. We don't have hard numbers, however, I would expect around 50 rollouts could be occurring at any given time.

My goal with running the larger mass rollouts was to load test if the controller and our prometheus instances can withstand mass rollouts in any form and what impact that has on application teams.

0 replies

kostis-codefresh · 2024-07-30T16:58:53Z

kostis-codefresh
Jul 30, 2024
Collaborator

I added a future enhancement here. No solid ETA at the moment though #3757

0 replies

zachaller · 2024-07-30T17:21:27Z

zachaller
Jul 30, 2024
Maintainer

@johnmwood also feel free to reach out to me in CNCF slack if you are interested in adding pprof support or anything like that, my username: @zachaller

5 replies

johnmwood Jul 30, 2024
Author

Thank you. I appreciate that. In the meantime, I built a local controller image with a patch for the pprof server. I've managed to generate some traces but can't find any obvious issues. Ironically, when I want to CPU throttle our controller it kills the connection to the debug endpoint and I lose the file.

zachaller Jul 31, 2024
Maintainer

can you remove the limit and just get a pprof with load and no limit (not sure what your nodes sizes are). The other option is to try and just not generate so much load. Having the pprof would be super telling.

Also as a side note there is also an instance-id cli flag and label you can use to manually "shard" your rollouts if you don't like scaling vertically so much.

zachaller Jul 31, 2024
Maintainer

Also if you can wrap the pprof endpoints enablement behind a cli flag --enable-pprof would love a PR!

johnmwood Aug 5, 2024
Author

@zachaller I have a branch up for the profiling and putting up the PR shortly. I'm very open to feedback and adjustments. Just submitted the USERS.md PR to add my org.

johnmwood Aug 5, 2024
Author

Here's the profiling PR. I added some comments with questions. Thank you! 👍🏻

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controller CPU Utilization #3752

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Controller CPU Utilization #3752

johnmwood Jul 26, 2024

Replies: 7 comments · 5 replies

johnmwood Jul 26, 2024 Author

kostis-codefresh Jul 29, 2024 Collaborator

kostis-codefresh Jul 29, 2024 Collaborator

kostis-codefresh Jul 30, 2024 Collaborator

johnmwood Jul 30, 2024 Author

kostis-codefresh Jul 30, 2024 Collaborator

zachaller Jul 30, 2024 Maintainer

johnmwood Jul 30, 2024 Author

zachaller Jul 31, 2024 Maintainer

zachaller Jul 31, 2024 Maintainer

johnmwood Aug 5, 2024 Author

johnmwood Aug 5, 2024 Author

johnmwood
Jul 26, 2024

Replies: 7 comments 5 replies

johnmwood
Jul 26, 2024
Author

kostis-codefresh
Jul 29, 2024
Collaborator

kostis-codefresh
Jul 29, 2024
Collaborator

kostis-codefresh
Jul 30, 2024
Collaborator

johnmwood
Jul 30, 2024
Author

kostis-codefresh
Jul 30, 2024
Collaborator

zachaller
Jul 30, 2024
Maintainer

johnmwood Jul 30, 2024
Author

zachaller Jul 31, 2024
Maintainer

zachaller Jul 31, 2024
Maintainer

johnmwood Aug 5, 2024
Author

johnmwood Aug 5, 2024
Author