-
Notifications
You must be signed in to change notification settings - Fork 587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rush] unassigned operations can ignore weighting constraints #4821
base: main
Are you sure you want to change the base?
[rush] unassigned operations can ignore weighting constraints #4821
Conversation
If I understand it correctly, the current challenge is to consistently reproduce the same schedule across the different cobuild processes. In the current lightweight implementation, there is no centralized scheduler to control which runner will be assigned to run a task. To tackle with this, an initiative thought comes into my mind are:
The workflow for 2 could be: b: Reproduce the schedule log by specifying a CLI parameter or ENV. Back to this PR,
|
@chengcyber I posted about this on Zulip, I think my path forward is to get rid of the unassigned status and move the sleep to I don't think we need a central scheduler at this point, we're just experiencing some growing pains with cobuilds and trying to lessen the pain. I'm working on a plugin at the moment to detect cache entry drift which is the other issue we're seeing. |
e764ceb
to
de396f2
Compare
de396f2
to
5c22575
Compare
Summary
We're dealing with some build cache inconsistencies across cobuild agents. We're seeing agents using the same lock, but unable to restore completed state as they don't have the same build cache ID. This bug is allowing 2 expensive operations to run side-by-side causing memory issues and timeouts related to memory pressure.
This PR uses the remote executing operation when reasonable and moves the unassigned status to just denote a sleep. It also improves the remote executing evaluation, since we no longer greedily get an event during execution time, we need to move that complexity into the scheduler. I added
checkAfter
andlastCheckedAt
with the idea that someone could write a plugin to better set those values based on previous execution times, since waiting the default 5 seconds is wasteful.Details
From what I can tell, the unassigned operation needed to have a weight matching the possible operation it will pick up as
rushstack/libraries/rush-lib/src/logic/operations/OperationExecutionManager.ts
Line 265 in c1effc3
rushstack/libraries/rush-lib/src/logic/operations/CacheableOperationPlugin.ts
Lines 388 to 398 in c1effc3
Start: Machine 1 picks up Operation A, Machine 2 picks up Operation B
Step 1: Machine 1 finishes but fails to mark complete Operation A, Machine 2 finishes Operation B
Step 2: Machine 2 picks up Operation A and Operation C
The finishes but fails to mark complete is possible with a build cache ID inconsistency (what we're seeing) or if a machine gets lost during execution and doesn't report its success state, so another machine picks up the operation.
Removing the ability to assign a remote executing operation to a unassigned operation allows the weight to be correctly determined based on the specific operation that is being picked up.
How it was tested
I tested this with the sharded repo and made sure that both cobuilds aren't deadlocking and that sharding is still working and working effectively.
\nteresting sidenote, this change may open up the ability to do more dynamic wait times, which could help improve overall run times for agents that spend a lot of time in the sleep loop. Adjusting the sleep from 5s -> 1s dropped times from 40s -> 28s.
Impacted documentation
None.