-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Refactor] Move worker notification in SimpleScheduler under Workers #1069
base: main
Are you sure you want to change the base?
[Refactor] Move worker notification in SimpleScheduler under Workers #1069
Conversation
In prep to support a distributed/redis scheduler, prepare the state interface to no longer take mutable references. This is a partial PR and should be landed immediately with followup PRs that will remove many of the locking in the SimpleScheduler. towards: TraceMachina#359
Worker logic should not be visible to StateManager just yet. In the future this will likely change, but for this phase of the refactor SimpleScheduler should own all information about workers. towards: TraceMachina#359
Moves the logic on when the matching enginge trigger gets run to under the workers struct where easy. This splits the logic of when a task is changed and matching engine needs to run and when a task gets run and the matching engine needs to be run. towards: TraceMachina#359
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-init, Publish nativelink-worker-lre-cc, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable (waiting on @adam-singer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 1 LGTMs obtained, and 2 discussions need to be resolved (waiting on @adam-singer)
nativelink-scheduler/src/simple_scheduler.rs
line 389 at r3 (raw file):
let operation_id = state.id.clone(); let ret = <StateManager as MatchingEngineStateManager>::update_operation(
If maybe_worker_id
is None
we shouldn't update the action at all. We should only mark it as Executing
when we know for certain that there is a worker to run it on.
If the ID is none we failed to find a worker that meets the action criteria, which is a totally valid case. just create a INFO
event and continue. That should clean up all the logic below substantially.
This is a bit simpler and should be functionally equivalent:
let maybe_worker_id = self
.workers
.find_worker_for_action(&action_info.platform_properties);
let Some(worker_id) = maybe_worker_id else {
event!(
Level::INFO,
"Failed to find worker for action: {}",
action_info.unique_qualifier.action_name()
);
continue;
};
// At this point we know WorkerId is Some and can mark it as Executing.
let operation_id = state.id.clone();
let ret = <StateManager as MatchingEngineStateManager>::update_operation(
&self.state_manager,
operation_id.clone(),
maybe_worker_id,
Ok(ActionStage::Executing),
)
.await;
if let Err(e) = ret {
let max_job_retries = self.max_job_retries;
let metrics = self.metrics.clone();
// TODO(allada) This is to get around rust borrow checker with double mut borrows
// of a mutex lock. Once the scheduler is fully moved to state manager this can be
// removed.
let state_manager = self.state_manager.clone();
let mut inner_state = state_manager.inner.lock().await;
SimpleSchedulerImpl::immediate_evict_worker(
&mut inner_state,
&mut self.workers,
max_job_retries,
&metrics,
&worker_id,
e.clone(),
);
event!(
Level::ERROR,
?e,
"update operation failed for {}",
operation_id
);
continue;
}
// Once we get here we know that the operation update was successful so we notify the worker.
let run_action_result = self
.worker_notify_run_action(worker_id, action_info.clone())
.await;
if let Err(err) = run_action_result {
event!(
Level::ERROR,
?err,
?worker_id,
?action_info,
"failed to run worker_notify_run_action in SimpleSchedulerImpl::do_try_match"
);
}
nativelink-scheduler/src/simple_scheduler.rs
line 427 at r3 (raw file):
.await { event!(
Is it really ok to just emit an event here? Wouldn't the operation be stuck if we update the scheduler-side state to say it is executing on a worker before we actually succeed in assigning it to the worker?
We should do a second update here to fix the mismatch
let revert_res = MatchingEngineStateManager>::update_operation(
&self.state_manager,
operation_id.clone(),
None,
Ok(ActionStage::Queued),
)
.await;
It's pretty unlikely revert_res will fail here, but if it does that's a pretty bad error I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 13 files at r1.
Reviewable status: 1 of 1 LGTMs obtained, and 2 discussions need to be resolved
Moves the logic on when the matching enginge trigger gets run to
under the workers struct where easy. This splits the logic of
when a task is changed and matching engine needs to run and when
a task gets run and the matching engine needs to be run.
towards: #359
This change is