Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller stops accepting jobs from the cluster queue #302

Open
aressem opened this issue Apr 8, 2024 · 5 comments
Open

Controller stops accepting jobs from the cluster queue #302

aressem opened this issue Apr 8, 2024 · 5 comments

Comments

@aressem
Copy link

aressem commented Apr 8, 2024

We have the agent-stack-k8s up and running and works fine for a while. However, it suddenly stops accepting new jobs and the last thing it outputs is (we turned on debug):

2024-04-08T11:38:23.100Z	DEBUG	limiter	scheduler/limiter.go:77	max-in-flight reached	{"in-flight": 25}

We currently only have a single pipeline, single cluster and single queue. When this happens there are no jobs or pods named buildkite-${UUID} in the k8s cluster. Executing kubectl -n buildkite rollout restart deployment agent-stack-k8s makes the controller happy again and it starts jobs from the queue.

I suspect that there is something that should decrement the in-flight number, but fails to do so. We are now running a test where this number is set to 0 to see if that works around the problem.

@DrJosh9000
Copy link
Contributor

Hi @aressem, did you discover anything with your tests where the number is set to 0?

@aressem
Copy link
Author

aressem commented Apr 23, 2024

@DrJosh9000 , the pipeline works as expected with in-flight set to 0. I don't know what that number might be now, but I suspect it is steadily increasing :)

@artem-zinnatullin
Copy link
Contributor

Same issue when testing with max-in-flight: 1 on v0.11.0, at some point controller stops taking new jobs even though there are no jobs/pods running in the namespace besides the controller iteself.

2024-05-21T21:31:57.923Z	DEBUG	limiter	scheduler/limiter.go:79	max-in-flight reached	{"in-flight": 1}

@calvinbui
Copy link

i saw the same issue, num-in-flight does not decrease so the available-tokens eventually reaches 0 and no new jobs are run.

@DrJosh9000
Copy link
Contributor

num-in-flight and available-tokens are now somewhat decoupled, so it would be useful to compare available-tokens against the number of job pods actually pending or running in the k8s cluster.

🤔 Maybe the controller should periodically survey the cluster, and adjust tokens accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants