Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rush] unassigned operations can ignore weighting constraints #4821

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

aramissennyeydd
Copy link
Contributor

@aramissennyeydd aramissennyeydd commented Jun 28, 2024

Summary

We're dealing with some build cache inconsistencies across cobuild agents. We're seeing agents using the same lock, but unable to restore completed state as they don't have the same build cache ID. This bug is allowing 2 expensive operations to run side-by-side causing memory issues and timeouts related to memory pressure.

This PR uses the remote executing operation when reasonable and moves the unassigned status to just denote a sleep. It also improves the remote executing evaluation, since we no longer greedily get an event during execution time, we need to move that complexity into the scheduler. I added checkAfter and lastCheckedAt with the idea that someone could write a plugin to better set those values based on previous execution times, since waiting the default 5 seconds is wasteful.

Details

From what I can tell, the unassigned operation needed to have a weight matching the possible operation it will pick up as

record = this._executionQueue.tryGetRemoteExecutingOperation();
can start an operation that it reasons was not completed on another agent that locked it initially in the CacheableOperationPlugin,
const acquireSuccess: boolean = await cobuildLock.tryAcquireLockAsync();
if (acquireSuccess) {
const { periodicCallback } = buildCacheContext;
periodicCallback.addCallback(async () => {
await cobuildLock?.renewLockAsync();
});
periodicCallback.start();
} else {
// failed to acquire the lock, mark current operation to remote executing
return OperationStatus.RemoteExecuting;
}
. If the weight is not reflected onto the unassigned operation, multiple operations can be picked up and will skip the weight check. For example, say you have 3 operations with parallelism set to 8 and 2 machines:

  1. Operation A with weight 8
  2. Operation B with weight 8
  3. Operation C with weight 4

Start: Machine 1 picks up Operation A, Machine 2 picks up Operation B
Step 1: Machine 1 finishes but fails to mark complete Operation A, Machine 2 finishes Operation B
Step 2: Machine 2 picks up Operation A and Operation C

The finishes but fails to mark complete is possible with a build cache ID inconsistency (what we're seeing) or if a machine gets lost during execution and doesn't report its success state, so another machine picks up the operation.

Removing the ability to assign a remote executing operation to a unassigned operation allows the weight to be correctly determined based on the specific operation that is being picked up.

How it was tested

I tested this with the sharded repo and made sure that both cobuilds aren't deadlocking and that sharding is still working and working effectively.

\nteresting sidenote, this change may open up the ability to do more dynamic wait times, which could help improve overall run times for agents that spend a lot of time in the sleep loop. Adjusting the sleep from 5s -> 1s dropped times from 40s -> 28s.

Impacted documentation

None.

@chengcyber
Copy link
Contributor

chengcyber commented Jul 1, 2024

If I understand it correctly, the current challenge is to consistently reproduce the same schedule across the different cobuild processes. In the current lightweight implementation, there is no centralized scheduler to control which runner will be assigned to run a task.

To tackle with this, an initiative thought comes into my mind are:

  1. Introduce a centralized scheduler
    The centralized scheduler introduces a much more complicated and powerful architecture to distributed execution. The drawback is that it goes a little bit against the designated lightweight approach for cobuild.

  2. Each runner can record how the tasks get scheduled, and cobuilds can reproduce the scheduler based on a log-based approach.

The workflow for 2 could be:
a: Turn on generating schedule logs for every runner by specifying a CLI parameter or ENV. (can be a default behaviour)
Running RUSH_COBUILD_SCHEDULE_LOG=1 rush cobuild --to <package_name> writes a log file somewhere, let's say common/temp/cobuild.schedule.log

b: Reproduce the schedule log by specifying a CLI parameter or ENV.
Running RUSH_COBUILD_REPRODUCE_BY_SCHEDULE_LOG=1 rush cobuild -to <package_name>, Rush instead of scheduling by AsyncOperationQueue, it invokes a similar OperationQueue instance (maybe ReproducibleAsyncOperationQueue) to accurate assign the task in a predefined order based on the common/temp/cobuild.schedule.log says.


Back to this PR, weight property is actually in the same direction for Option 2. But it has some drawbacks I can tell:

  1. Manually managing on weight setup is sort of loose and error-prune. A better approach is generated the "weight"(the generic idea I mean) from the actual run, such as schedule.log I proposed here.
  2. It's not easy to use. For a cobuild runs a lot of tasks, weight setup comes trivial to use.

@aramissennyeydd
Copy link
Contributor Author

@chengcyber I posted about this on Zulip, I think my path forward is to get rid of the unassigned status and move the sleep toCacheableOperationPlugin, that way we no longer need to interact with the scheduling loop at all or assign weights.

I don't think we need a central scheduler at this point, we're just experiencing some growing pains with cobuilds and trying to lessen the pain. I'm working on a plugin at the moment to detect cache entry drift which is the other issue we're seeing.

@aramissennyeydd aramissennyeydd force-pushed the sennyeya/unassigned-operation-weight branch 2 times, most recently from e764ceb to de396f2 Compare July 2, 2024 19:56
@aramissennyeydd aramissennyeydd force-pushed the sennyeya/unassigned-operation-weight branch from de396f2 to 5c22575 Compare July 3, 2024 16:30
@aramissennyeydd aramissennyeydd marked this pull request as ready for review July 8, 2024 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Needs triage
Development

Successfully merging this pull request may close these issues.

None yet

2 participants