Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker.py: Avoid letting a job get stuck if exception occurs #2

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

JonathonHall-Purism
Copy link

Some instances of stuck jobs were observed recently for PureOS. From the logs, I think a Python exception may have occurred after the build completed but before the artifacts were uploaded. I can't tell what might have caused that exception, if it did occur.

This change would ensure that a 'rejected' status is sent if this occurs, rather than leaving the job stuck with no result.

I applied this change to fennel.pureos.net (the new worker) to try to identify what was causing the stuck jobs, but this never happened again. No jobs got stuck after that and this exception code was never hit, so I can't identify the root cause.

@JonathonHall-Purism
Copy link
Author

I have not properly tested this change - I don't have a test environment set up, and this never happened again after I applied the change to fennel.pureos.net to see what was causing these jobs to get stuck this way.

The jobs were very old (there had been no amd64 worker for months), so it is possible that this would not happen if the job queue is being processed regularly.

So with that in mind, although it seems like a good robustness improvement to avoid leaving a job stuck if this happens, I understand if you'd rather reject this and see if it ever happens again or can be reproduced.

Some instances of stuck jobs were observed recently for PureOS.  From
the logs, I think a Python exception may have occurred after the build
completed but before the artifacts were uploaded.  I can't tell what
might have caused that exception, if it did occur.

This change would ensure that a 'rejected' status is sent if this
occurs, rather than leaving the job stuck with no result.

I applied this change to fennel.pureos.net (the new worker) to try to
identify what was causing the stuck jobs, but this never happened
again.  No jobs got stuck after that and this exception code was never
hit, so I can't identify the root cause.

Signed-off-by: Jonathon Hall <[email protected]>
spark/worker.py Outdated Show resolved Hide resolved
jlog was accidentally changed to log in _run_job(), revert it.

There's no job log in _request_job(), log traceback to the regular log.

Fix a typo in a comment.

Signed-off-by: Jonathon Hall <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants