Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Refactor] Move worker notification in SimpleScheduler under Workers #1069

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

allada
Copy link
Member

@allada allada commented Jun 29, 2024

Moves the logic on when the matching enginge trigger gets run to
under the workers struct where easy. This splits the logic of
when a task is changed and matching engine needs to run and when
a task gets run and the matching engine needs to be run.

towards: #359


This change is Reviewable

In prep to support a distributed/redis scheduler, prepare the state
interface to no longer take mutable references.

This is a partial PR and should be landed immediately with followup PRs
that will remove many of the locking in the SimpleScheduler.

towards: TraceMachina#359
Worker logic should not be visible to StateManager just yet. In the
future this will likely change, but for this phase of the refactor
SimpleScheduler should own all information about workers.

towards: TraceMachina#359
Moves the logic on when the matching enginge trigger gets run to
under the workers struct where easy. This splits the logic of
when a task is changed and matching engine needs to run and when
a task gets run and the matching engine needs to be run.

towards: TraceMachina#359
Copy link
Member Author

@allada allada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+@adam-singer

Reviewable status: 0 of 1 LGTMs obtained, and pending CI: Analyze (python), Bazel Dev / ubuntu-22.04, Cargo Dev / macos-13, Cargo Dev / ubuntu-22.04, Installation / macos-13, Installation / macos-14, Installation / ubuntu-22.04, Local / ubuntu-22.04, Publish image, Publish nativelink-worker-init, Publish nativelink-worker-lre-cc, Remote / large-ubuntu-22.04, asan / ubuntu-22.04, docker-compose-compiles-nativelink (20.04), docker-compose-compiles-nativelink (22.04), integration-tests (20.04), integration-tests (22.04), macos-13, pre-commit-checks, ubuntu-20.04 / stable, ubuntu-22.04, ubuntu-22.04 / stable, windows-2022 / stable (waiting on @adam-singer)

Copy link
Member

@zbirenbaum zbirenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 LGTMs obtained, and 2 discussions need to be resolved (waiting on @adam-singer)


nativelink-scheduler/src/simple_scheduler.rs line 389 at r3 (raw file):

                    let operation_id = state.id.clone();
                    let ret = <StateManager as MatchingEngineStateManager>::update_operation(

If maybe_worker_id is None we shouldn't update the action at all. We should only mark it as Executing when we know for certain that there is a worker to run it on.

If the ID is none we failed to find a worker that meets the action criteria, which is a totally valid case. just create a INFO event and continue. That should clean up all the logic below substantially.

This is a bit simpler and should be functionally equivalent:

let maybe_worker_id = self
    .workers
    .find_worker_for_action(&action_info.platform_properties);

let Some(worker_id) = maybe_worker_id else {
    event!(
        Level::INFO,
        "Failed to find worker for action: {}",
        action_info.unique_qualifier.action_name()
    );
    continue;
};

// At this point we know WorkerId is Some and can mark it as Executing.
let operation_id = state.id.clone();
let ret = <StateManager as MatchingEngineStateManager>::update_operation(
    &self.state_manager,
    operation_id.clone(),
    maybe_worker_id,
    Ok(ActionStage::Executing),
)
.await;

if let Err(e) = ret {
    let max_job_retries = self.max_job_retries;
    let metrics = self.metrics.clone();
    // TODO(allada) This is to get around rust borrow checker with double mut borrows
    // of a mutex lock. Once the scheduler is fully moved to state manager this can be
    // removed.
    let state_manager = self.state_manager.clone();
    let mut inner_state = state_manager.inner.lock().await;
    SimpleSchedulerImpl::immediate_evict_worker(
        &mut inner_state,
        &mut self.workers,
        max_job_retries,
        &metrics,
        &worker_id,
        e.clone(),
    );
    event!(
        Level::ERROR,
        ?e,
        "update operation failed for {}",
        operation_id
    );
    continue;
}

// Once we get here we know that the operation update was successful so we notify the worker.
let run_action_result = self
    .worker_notify_run_action(worker_id, action_info.clone())
    .await;

if let Err(err) = run_action_result {
    event!(
        Level::ERROR,
        ?err,
        ?worker_id,
        ?action_info,
        "failed to run worker_notify_run_action in SimpleSchedulerImpl::do_try_match"
    );
}


nativelink-scheduler/src/simple_scheduler.rs line 427 at r3 (raw file):

                            .await
                        {
                            event!(

Is it really ok to just emit an event here? Wouldn't the operation be stuck if we update the scheduler-side state to say it is executing on a worker before we actually succeed in assigning it to the worker?

We should do a second update here to fix the mismatch

let revert_res = MatchingEngineStateManager>::update_operation(
    &self.state_manager,
    operation_id.clone(),
    None,
    Ok(ActionStage::Queued),
)
.await;

It's pretty unlikely revert_res will fail here, but if it does that's a pretty bad error I think.

Copy link
Member

@adam-singer adam-singer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 13 of 13 files at r1.
Reviewable status: 1 of 1 LGTMs obtained, and 2 discussions need to be resolved

@zbirenbaum zbirenbaum mentioned this pull request Jul 1, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants