[rush] unassigned operations can ignore weighting constraints #4821

aramissennyeydd · 2024-06-28T16:20:40Z

Summary

We're dealing with some build cache inconsistencies across cobuild agents. We're seeing agents using the same lock, but unable to restore completed state as they don't have the same build cache ID. This bug is allowing 2 expensive operations to run side-by-side causing memory issues and timeouts related to memory pressure.

This PR uses the remote executing operation when reasonable and moves the unassigned status to just denote a sleep. It also improves the remote executing evaluation, since we no longer greedily get an event during execution time, we need to move that complexity into the scheduler. I added checkAfter and lastCheckedAt with the idea that someone could write a plugin to better set those values based on previous execution times, since waiting the default 5 seconds is wasteful.

Details

From what I can tell, the unassigned operation needed to have a weight matching the possible operation it will pick up as

rushstack/libraries/rush-lib/src/logic/operations/OperationExecutionManager.ts

Line 265 in c1effc3

record = this._executionQueue.tryGetRemoteExecutingOperation();

can start an operation that it reasons was not completed on another agent that locked it initially in the CacheableOperationPlugin,

rushstack/libraries/rush-lib/src/logic/operations/CacheableOperationPlugin.ts

Lines 388 to 398 in c1effc3

    
           const acquireSuccess: boolean = await cobuildLock.tryAcquireLockAsync(); 
        
           if (acquireSuccess) { 
        
             const { periodicCallback } = buildCacheContext; 
        
             periodicCallback.addCallback(async () => { 
        
               await cobuildLock?.renewLockAsync(); 
        
             }); 
        
             periodicCallback.start(); 
        
           } else { 
        
             // failed to acquire the lock, mark current operation to remote executing 
        
             return OperationStatus.RemoteExecuting; 
        
           }

. If the weight is not reflected onto the unassigned operation, multiple operations can be picked up and will skip the weight check. For example, say you have 3 operations with parallelism set to 8 and 2 machines:

Operation A with weight 8
Operation B with weight 8
Operation C with weight 4

Start: Machine 1 picks up Operation A, Machine 2 picks up Operation B
Step 1: Machine 1 finishes but fails to mark complete Operation A, Machine 2 finishes Operation B
Step 2: Machine 2 picks up Operation A and Operation C

The finishes but fails to mark complete is possible with a build cache ID inconsistency (what we're seeing) or if a machine gets lost during execution and doesn't report its success state, so another machine picks up the operation.

Removing the ability to assign a remote executing operation to a unassigned operation allows the weight to be correctly determined based on the specific operation that is being picked up.

How it was tested

I tested this with the sharded repo and made sure that both cobuilds aren't deadlocking and that sharding is still working and working effectively.

\nteresting sidenote, this change may open up the ability to do more dynamic wait times, which could help improve overall run times for agents that spend a lot of time in the sleep loop. Adjusting the sleep from 5s -> 1s dropped times from 40s -> 28s.

Impacted documentation

None.

chengcyber · 2024-07-01T20:41:58Z

If I understand it correctly, the current challenge is to consistently reproduce the same schedule across the different cobuild processes. In the current lightweight implementation, there is no centralized scheduler to control which runner will be assigned to run a task.

To tackle with this, an initiative thought comes into my mind are:

Introduce a centralized scheduler
The centralized scheduler introduces a much more complicated and powerful architecture to distributed execution. The drawback is that it goes a little bit against the designated lightweight approach for cobuild.
Each runner can record how the tasks get scheduled, and cobuilds can reproduce the scheduler based on a log-based approach.

The workflow for 2 could be:
a: Turn on generating schedule logs for every runner by specifying a CLI parameter or ENV. (can be a default behaviour)
Running RUSH_COBUILD_SCHEDULE_LOG=1 rush cobuild --to <package_name> writes a log file somewhere, let's say common/temp/cobuild.schedule.log

b: Reproduce the schedule log by specifying a CLI parameter or ENV.
Running RUSH_COBUILD_REPRODUCE_BY_SCHEDULE_LOG=1 rush cobuild -to <package_name>, Rush instead of scheduling by AsyncOperationQueue, it invokes a similar OperationQueue instance (maybe ReproducibleAsyncOperationQueue) to accurate assign the task in a predefined order based on the common/temp/cobuild.schedule.log says.

Back to this PR, weight property is actually in the same direction for Option 2. But it has some drawbacks I can tell:

Manually managing on weight setup is sort of loose and error-prune. A better approach is generated the "weight"（the generic idea I mean） from the actual run, such as schedule.log I proposed here.
It's not easy to use. For a cobuild runs a lot of tasks, weight setup comes trivial to use.

aramissennyeydd · 2024-07-01T23:25:19Z

@chengcyber I posted about this on Zulip, I think my path forward is to get rid of the unassigned status and move the sleep toCacheableOperationPlugin, that way we no longer need to interact with the scheduling loop at all or assign weights.

I don't think we need a central scheduler at this point, we're just experiencing some growing pains with cobuilds and trying to lessen the pain. I'm working on a plugin at the moment to detect cache entry drift which is the other issue we're seeing.

aramissennyeydd force-pushed the sennyeya/unassigned-operation-weight branch 2 times, most recently from e764ceb to de396f2 Compare July 2, 2024 19:56

feat(rush-lib): adjust scheduling for remote executing operations.

5c22575

aramissennyeydd force-pushed the sennyeya/unassigned-operation-weight branch from de396f2 to 5c22575 Compare July 3, 2024 16:30

aramissennyeydd added 3 commits July 3, 2024 12:46

allow customization of polling interval

a9fe560

adjusting logic to sit at the operationexecutionrecord

50dd0c2

fix test cases

d0fb2e6

aramissennyeydd marked this pull request as ready for review July 8, 2024 15:45

aramissennyeydd requested review from iclanton, octogonz, apostolisms, D4N14L and dmichon-msft as code owners July 8, 2024 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rush] unassigned operations can ignore weighting constraints #4821

[rush] unassigned operations can ignore weighting constraints #4821

aramissennyeydd commented Jun 28, 2024 •

edited

Loading

chengcyber commented Jul 1, 2024 •

edited

Loading

aramissennyeydd commented Jul 1, 2024

	const acquireSuccess: boolean = await cobuildLock.tryAcquireLockAsync();
	if (acquireSuccess) {
	const { periodicCallback } = buildCacheContext;
	periodicCallback.addCallback(async () => {
	await cobuildLock?.renewLockAsync();
	});
	periodicCallback.start();
	} else {
	// failed to acquire the lock, mark current operation to remote executing
	return OperationStatus.RemoteExecuting;
	}

[rush] unassigned operations can ignore weighting constraints #4821

Are you sure you want to change the base?

[rush] unassigned operations can ignore weighting constraints #4821

Conversation

aramissennyeydd commented Jun 28, 2024 • edited Loading

Summary

Details

How it was tested

Impacted documentation

chengcyber commented Jul 1, 2024 • edited Loading

aramissennyeydd commented Jul 1, 2024

aramissennyeydd commented Jun 28, 2024 •

edited

Loading

chengcyber commented Jul 1, 2024 •

edited

Loading