-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pull timeout #2418
base: main
Are you sure you want to change the base?
Pull timeout #2418
Conversation
dbg("eish, i don't love this. what if there was some other reason for them getting lost?") | ||
dbg("like, what if they did start, and did the work, and all that, but we never heard back from them because of network issues?") | ||
dbg("i wouldn't want to re-do the work.") | ||
dbg("i wish there was some way to only mark them as claimed once we know that the worker actually got the run.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taylordowns2000 If I may chip in (uninvited):
As you mentioned yesterday the janitor approach is a band-aid until a better fix is available. If I was applying the bandaid, I would not tie it to the timeout value - I would make it a large multiple of the timeout. E.g. if your timeout is 20 sec, maybe wait 10 min (after all you wait 2 South Effrican hours to determine if something is lost, right?). And, notify Sentry whenever this happens, so that we have an opportunity to take a look at what is going on and tune things. Wrt the concern of potentially redoing work, you could make this something that is user-controllable (i.e. if a Run is 'LostAfterClaimed' would you like us to automagically resubmit it - or would you rather do that manually?).
A longer-term fix (which I can only assume will mean a fair amount of work) would be to allow a run to be claimed by multiple worker processes but only started by one (whoever gets there first). This obviously creates inefficiency elsewhere. Personally, I would do the bandaid if it helps us satisfy the immediate requirements and then identify the issue - e.g. maybe it is a simple as scaling up web pods alongside worker pods?
@rorymckinley , @stuartc , do you know if this code is still needed? would love to shut down this PR if it's not helpful (or has been implemented elsewhere in response to #2465 or #2416 ) |
@taylordowns2000 I have not implemented anything that addresses this - did the changing of the timeout values make the original problem go away? |
this is a concept. @stuartc , we're trying to:
lost
(@josephjclark on it.)