Seeming race condition with job status #610

aegray · 2015-02-03T17:22:07Z

We've scaled up our disco usage a fair amount and I've been seeing a few strange errors when the cluster is under heavy load. It looks like jobs semi randomly have a JobError exception thrown - the more load on the cluster the more it seems to happen.

It appears that this is happening inside of job.wait() - When tracking down what was happening, I added some extra code to disco/core.py / check_results and found that if the poll off the job status happens right around the instant the job itself finishes all its tasks, it's sometimes receiving a status = 'dead'. To remedy this I added some code to retry one more time and check the status after 1 second before actually throwing the exception, and upon retry, the status is consistently coming back as 'ready'. So basically, a completing job is sometimes being labelled as 'dead' before the 'ready' state. Doing this retry allowed us to hack around the issue.

Additionally, (not sure if this is related), when the cluster is under very heavy load, I can see the same thing happen with the status indicator for jobs - as they complete they first turn red (for failure), then green (for completion/ready).

One note - we are still on 0.5.3, so please let me know if this is known and has been fixed in a later version

pooya · 2015-02-08T21:55:20Z

Yes. I have seen this issue. The race condition is with the following code in job_event.erl:

    case is_process_alive(Pid) of                                                                        
        true  -> active;
        false -> dead
    end;

If the job has finished successfully, we still respond that the job is dead.

erikdubbelboer · 2015-06-12T09:18:10Z

We see this error quite often as well. I was wondering if there is any update on when this will be fixed?

I would like to look at it myself but I'm not good with Erlang and not yet very familiar with the Erlang part of the disco codebase.

Am I correct to say that the job coordinator sends a ready message to do_update and then exists. Then before do_update is actually called to process the message do_get_status gets called by python which notifies the job coordinator is dead?
If this is correct I guess the way to fix it is either checking again in python with a delay (easy, I can fix that), or for the job coordinator to wait for a message from do_update to notify that it received it before exiting (somewhere in do_job_done_event maybe?).
Or am I completely wrong here?

aegray · 2015-06-12T13:12:40Z

Locally, to avoid most of the issues with this (at least to make the client job code not think the job died), I changed disco/core.py to replace the check_results function with this:

def check_results(self, jobname, start_time, timeout, poll_interval, recurse=False):
    try:
        status, results = self.results(jobname, timeout=poll_interval)
    except CommError as e:
        status = 'active'

    if status == 'ready':
        return results
    if status != 'active':
        if not recurse:
            time.sleep(5)
            return self.check_results(jobname, start_time, timeout, poll_interval, recurse=True)
        else:
            raise JobError(Job(name=jobname, master=self), "Status {0}".format(status))
    if timeout and time.time() - start_time > timeout:
        raise JobError(Job(name=jobname, master=self), "Timeout")
    raise Continue()

I realize this is really suboptimal and hackey, but I was running into issues daily on our busy cluster and now very rarely see the issue, so its good enough for me until the actual backend gets fixed.

pooya added bug ux labels Feb 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seeming race condition with job status #610

Seeming race condition with job status #610

aegray commented Feb 3, 2015

pooya commented Feb 8, 2015

erikdubbelboer commented Jun 12, 2015

aegray commented Jun 12, 2015

Seeming race condition with job status #610

Seeming race condition with job status #610

Comments

aegray commented Feb 3, 2015

pooya commented Feb 8, 2015

erikdubbelboer commented Jun 12, 2015

aegray commented Jun 12, 2015