You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Scatter outputs should be collected only once for each scatter job.
Actual Behavior
The patch provided by @GlassOfWhiskey in PR #2051 has done well to ensure the expected behavior, but as they note, it doesn't address the root cause. Under certain conditions, ReceiveScatterOutput.receive_scatter_output() may still be called twice within a narrow window of time for the same scatter job output.
The root cause lies within WorkflowJob and the conditions it uses to determine if .do_output_callback() should be called. This method is intended to be called from eitherWorkflowJob.receive_output()ORWorkflowJob.job(), but during the race condition this method is called from both.
WorkflowJob.job() runs in one thread while WorkflowJob.receive_output() is the callback bound to a work unit being executed by one of the TaskQueue workers, i.e. it executes in a separate scatter job thread. The race condition is that .receive_output() thread calls do_output_callback() (which later sets self.did_callback = True) and while in the body of that method, the .job() thread queries the value of did_callback which is still False, so it also calls do_output_callback().
The relevant shared state is:
WorkflowJob.did_callback
WorkflowJob.steps where completed==True
I wanted to be sure that both methods were executing the same callback, which can be a little tricky with all of the nested functools partials that obscure the object to which each level is bound. If you unwind the callback chain you'll see they are the same in both branches:
Continuation of #2003
Expected Behavior
Scatter outputs should be collected only once for each scatter job.
Actual Behavior
The patch provided by @GlassOfWhiskey in PR #2051 has done well to ensure the expected behavior, but as they note, it doesn't address the root cause. Under certain conditions,
ReceiveScatterOutput.receive_scatter_output()
may still be called twice within a narrow window of time for the same scatter job output.The root cause lies within
WorkflowJob
and the conditions it uses to determine if.do_output_callback()
should be called. This method is intended to be called from eitherWorkflowJob.receive_output()
ORWorkflowJob.job()
, but during the race condition this method is called from both.WorkflowJob.job()
runs in one thread whileWorkflowJob.receive_output()
is the callback bound to a work unit being executed by one of theTaskQueue
workers, i.e. it executes in a separate scatter job thread. The race condition is that.receive_output()
thread callsdo_output_callback()
(which later setsself.did_callback = True
) and while in the body of that method, the.job()
thread queries the value ofdid_callback
which is still False, so it also callsdo_output_callback()
.The relevant shared state is:
WorkflowJob.did_callback
WorkflowJob.steps
where completed==TrueI wanted to be sure that both methods were executing the same callback, which can be a little tricky with all of the nested functools partials that obscure the object to which each level is bound. If you unwind the callback chain you'll see they are the same in both branches:
Workflow Code
See #2003
The text was updated successfully, but these errors were encountered: