[Refactor] Move worker notification in SimpleScheduler under Workers #1069

allada · 2024-06-29T00:58:57Z

Moves the logic on when the matching enginge trigger gets run to
under the workers struct where easy. This splits the logic of
when a task is changed and matching engine needs to run and when
a task gets run and the matching engine needs to be run.

towards: #359

This change is

In prep to support a distributed/redis scheduler, prepare the state interface to no longer take mutable references. This is a partial PR and should be landed immediately with followup PRs that will remove many of the locking in the SimpleScheduler. towards: TraceMachina#359

Worker logic should not be visible to StateManager just yet. In the future this will likely change, but for this phase of the refactor SimpleScheduler should own all information about workers. towards: TraceMachina#359

Moves the logic on when the matching enginge trigger gets run to under the workers struct where easy. This splits the logic of when a task is changed and matching engine needs to run and when a task gets run and the matching engine needs to be run. towards: TraceMachina#359

allada

+@adam-singer

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-init, Publish nativelink-worker-lre-cc, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable (waiting on @adam-singer)

zbirenbaum

Reviewable status: 0 of 1 LGTMs obtained, and 2 discussions need to be resolved (waiting on @adam-singer)

nativelink-scheduler/src/simple_scheduler.rs line 389 at r3 (raw file):

                    let operation_id = state.id.clone();
                    let ret = <StateManager as MatchingEngineStateManager>::update_operation(

If maybe_worker_id is None we shouldn't update the action at all. We should only mark it as Executing when we know for certain that there is a worker to run it on.

If the ID is none we failed to find a worker that meets the action criteria, which is a totally valid case. just create a INFO event and continue. That should clean up all the logic below substantially.

This is a bit simpler and should be functionally equivalent:

let maybe_worker_id = self
    .workers
    .find_worker_for_action(&action_info.platform_properties);

let Some(worker_id) = maybe_worker_id else {
    event!(
        Level::INFO,
        "Failed to find worker for action: {}",
        action_info.unique_qualifier.action_name()
    );
    continue;
};

// At this point we know WorkerId is Some and can mark it as Executing.
let operation_id = state.id.clone();
let ret = <StateManager as MatchingEngineStateManager>::update_operation(
    &self.state_manager,
    operation_id.clone(),
    maybe_worker_id,
    Ok(ActionStage::Executing),
)
.await;

if let Err(e) = ret {
    let max_job_retries = self.max_job_retries;
    let metrics = self.metrics.clone();
    // TODO(allada) This is to get around rust borrow checker with double mut borrows
    // of a mutex lock. Once the scheduler is fully moved to state manager this can be
    // removed.
    let state_manager = self.state_manager.clone();
    let mut inner_state = state_manager.inner.lock().await;
    SimpleSchedulerImpl::immediate_evict_worker(
        &mut inner_state,
        &mut self.workers,
        max_job_retries,
        &metrics,
        &worker_id,
        e.clone(),
    );
    event!(
        Level::ERROR,
        ?e,
        "update operation failed for {}",
        operation_id
    );
    continue;
}

// Once we get here we know that the operation update was successful so we notify the worker.
let run_action_result = self
    .worker_notify_run_action(worker_id, action_info.clone())
    .await;

if let Err(err) = run_action_result {
    event!(
        Level::ERROR,
        ?err,
        ?worker_id,
        ?action_info,
        "failed to run worker_notify_run_action in SimpleSchedulerImpl::do_try_match"
    );
}

nativelink-scheduler/src/simple_scheduler.rs line 427 at r3 (raw file):

                            .await
                        {
                            event!(

Is it really ok to just emit an event here? Wouldn't the operation be stuck if we update the scheduler-side state to say it is executing on a worker before we actually succeed in assigning it to the worker?

We should do a second update here to fix the mismatch

let revert_res = MatchingEngineStateManager>::update_operation(
    &self.state_manager,
    operation_id.clone(),
    None,
    Ok(ActionStage::Queued),
)
.await;

It's pretty unlikely revert_res will fail here, but if it does that's a pretty bad error I think.

adam-singer

Reviewed 13 of 13 files at r1.
Reviewable status: 1 of 1 LGTMs obtained, and 2 discussions need to be resolved

allada added 3 commits June 28, 2024 19:35

[Refactor] Moves worker logic back to SimpleScheduler

b9d9702

Worker logic should not be visible to StateManager just yet. In the future this will likely change, but for this phase of the refactor SimpleScheduler should own all information about workers. towards: TraceMachina#359

allada assigned adam-singer Jun 29, 2024

allada commented Jun 29, 2024

View reviewed changes

zbirenbaum requested changes Jun 29, 2024

View reviewed changes

adam-singer approved these changes Jun 29, 2024

View reviewed changes

zbirenbaum mentioned this pull request Jul 1, 2024

operation state helpers #1071

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Move worker notification in SimpleScheduler under Workers #1069

[Refactor] Move worker notification in SimpleScheduler under Workers #1069

allada commented Jun 29, 2024 •

edited

Loading

allada left a comment

zbirenbaum left a comment

adam-singer left a comment

[Refactor] Move worker notification in SimpleScheduler under Workers #1069

Are you sure you want to change the base?

[Refactor] Move worker notification in SimpleScheduler under Workers #1069

Conversation

allada commented Jun 29, 2024 • edited Loading

allada left a comment

Choose a reason for hiding this comment

zbirenbaum left a comment

Choose a reason for hiding this comment

adam-singer left a comment

Choose a reason for hiding this comment

allada commented Jun 29, 2024 •

edited

Loading