Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs that ran into OOM issues appear as still running in buildkite UI #182

Open
wallyqs opened this issue Jul 14, 2023 · 2 comments
Open

Comments

@wallyqs
Copy link

wallyqs commented Jul 14, 2023

For example, a container-0 job ran into this OOM so it has already exited:

 - containerID: containerd://868661c9da807af9428729518d1c95a52c1bb5efac68df8799cd6b24b475125c
    image: docker.io/library/golang:1.20-alpine
    imageID: docker.io/library/golang@sha256:59fc0dc542a38bb5b94cd1529e5f4663b4e7cc2f4a6c352b826dafe00d820031
    lastState: {}
    name: container-0
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: containerd://868661c9da807af9428729518d1c95a52c1bb5efac68df8799cd6b24b475125c
        exitCode: 137
        finishedAt: "2023-07-14T10:40:18Z"
        reason: OOMKilled
        startedAt: "2023-07-14T10:31:49Z"

But in the buildkite UI it still appears running:

image

Maybe need a way for the controller to detect OOM events in the jobs to clean them up?

@triarius
Copy link
Contributor

Thanks for raising this @wallyqs. We have a plan for how to proceed. It involves detecting OOM killed containers from the controller and cancelling them on Buildkite. We'll let you know when this is implemented. Let us know if there are more things to clean up for OOM killed jobs that we should catch as well.

@calvinbui
Copy link

Thanks for raising this @wallyqs. We have a plan for how to proceed. It involves detecting OOM killed containers from the controller and cancelling them on Buildkite. We'll let you know when this is implemented. Let us know if there are more things to clean up for OOM killed jobs that we should catch as well.

probably clean up the pod and job as well in the cluster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants