Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeming race condition with job status #610

Open
aegray opened this issue Feb 3, 2015 · 3 comments
Open

Seeming race condition with job status #610

aegray opened this issue Feb 3, 2015 · 3 comments
Labels

Comments

@aegray
Copy link

aegray commented Feb 3, 2015

We've scaled up our disco usage a fair amount and I've been seeing a few strange errors when the cluster is under heavy load. It looks like jobs semi randomly have a JobError exception thrown - the more load on the cluster the more it seems to happen.

It appears that this is happening inside of job.wait() - When tracking down what was happening, I added some extra code to disco/core.py / check_results and found that if the poll off the job status happens right around the instant the job itself finishes all its tasks, it's sometimes receiving a status = 'dead'. To remedy this I added some code to retry one more time and check the status after 1 second before actually throwing the exception, and upon retry, the status is consistently coming back as 'ready'. So basically, a completing job is sometimes being labelled as 'dead' before the 'ready' state. Doing this retry allowed us to hack around the issue.

Additionally, (not sure if this is related), when the cluster is under very heavy load, I can see the same thing happen with the status indicator for jobs - as they complete they first turn red (for failure), then green (for completion/ready).

One note - we are still on 0.5.3, so please let me know if this is known and has been fixed in a later version

@pooya
Copy link
Member

pooya commented Feb 8, 2015

Yes. I have seen this issue. The race condition is with the following code in job_event.erl:

    case is_process_alive(Pid) of                                                                        
        true  -> active;
        false -> dead
    end;

If the job has finished successfully, we still respond that the job is dead.

@erikdubbelboer
Copy link
Contributor

We see this error quite often as well. I was wondering if there is any update on when this will be fixed?

I would like to look at it myself but I'm not good with Erlang and not yet very familiar with the Erlang part of the disco codebase.

Am I correct to say that the job coordinator sends a ready message to do_update and then exists. Then before do_update is actually called to process the message do_get_status gets called by python which notifies the job coordinator is dead?
If this is correct I guess the way to fix it is either checking again in python with a delay (easy, I can fix that), or for the job coordinator to wait for a message from do_update to notify that it received it before exiting (somewhere in do_job_done_event maybe?).
Or am I completely wrong here?

@aegray
Copy link
Author

aegray commented Jun 12, 2015

Locally, to avoid most of the issues with this (at least to make the client job code not think the job died), I changed disco/core.py to replace the check_results function with this:

def check_results(self, jobname, start_time, timeout, poll_interval, recurse=False):
    try:
        status, results = self.results(jobname, timeout=poll_interval)
    except CommError as e:
        status = 'active'

    if status == 'ready':
        return results
    if status != 'active':
        if not recurse:
            time.sleep(5)
            return self.check_results(jobname, start_time, timeout, poll_interval, recurse=True)
        else:
            raise JobError(Job(name=jobname, master=self), "Status {0}".format(status))
    if timeout and time.time() - start_time > timeout:
        raise JobError(Job(name=jobname, master=self), "Timeout")
    raise Continue()

I realize this is really suboptimal and hackey, but I was running into issues daily on our busy cluster and now very rarely see the issue, so its good enough for me until the actual backend gets fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants