-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seeming race condition with job status #610
Comments
Yes. I have seen this issue. The race condition is with the following code in job_event.erl:
If the job has finished successfully, we still respond that the job is dead. |
We see this error quite often as well. I was wondering if there is any update on when this will be fixed? I would like to look at it myself but I'm not good with Erlang and not yet very familiar with the Erlang part of the disco codebase. Am I correct to say that the job coordinator sends a |
Locally, to avoid most of the issues with this (at least to make the client job code not think the job died), I changed disco/core.py to replace the check_results function with this:
I realize this is really suboptimal and hackey, but I was running into issues daily on our busy cluster and now very rarely see the issue, so its good enough for me until the actual backend gets fixed. |
We've scaled up our disco usage a fair amount and I've been seeing a few strange errors when the cluster is under heavy load. It looks like jobs semi randomly have a JobError exception thrown - the more load on the cluster the more it seems to happen.
It appears that this is happening inside of job.wait() - When tracking down what was happening, I added some extra code to disco/core.py / check_results and found that if the poll off the job status happens right around the instant the job itself finishes all its tasks, it's sometimes receiving a status = 'dead'. To remedy this I added some code to retry one more time and check the status after 1 second before actually throwing the exception, and upon retry, the status is consistently coming back as 'ready'. So basically, a completing job is sometimes being labelled as 'dead' before the 'ready' state. Doing this retry allowed us to hack around the issue.
Additionally, (not sure if this is related), when the cluster is under very heavy load, I can see the same thing happen with the status indicator for jobs - as they complete they first turn red (for failure), then green (for completion/ready).
One note - we are still on 0.5.3, so please let me know if this is known and has been fixed in a later version
The text was updated successfully, but these errors were encountered: