Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't spin on the main mutex while waiting for new work #8433

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Commits on Oct 8, 2024

  1. Don't spin on the main mutex while waiting for new work

    This is one solution to an issue identified by Marcos, opened for
    discussion. Here's the full description of the issue:
    
    Once they run out of work to do, Halide worker threads spin for a bit
    checking if new work has been enqueued before calling cond_wait, which
    puts them to sleep until signaled. Job owners also spin waiting for
    their job to complete before going to sleep on a different condition
    variable. I hate this, but all previous attempts I have made at removing
    or reducing the spinning have made things slower.
    
    One problem with this approach is that spinning is done by releasing the
    work queue lock, yielding, reacquiring the work queue lock, and doing
    the somewhat involved check to see if there's something useful for this
    thread to do, either because new work was enqueued, the last item on a
    job completed, or a semaphore was released. This hammering of the lock
    by idle worker threads can starve the thread that actually completed the
    last task, delaying its ability to tell the job owner the job is done,
    and can also starve the job owner, causing it to take extra time to
    realize the job is all done and return back into Halide code. So this
    adds some wasted time at the end of every parallel for loop.
    
    This PR gets these idle threads to spin off the main mutex. I did this
    by adding a counter to each condition variable. Any time they are
    signaled, the counter is atomically incremented. Before they first
    release the lock, the idlers atomically capture the value of this
    counter. Then in cond_wait they spin for a bit doing atomic loads of the
    counter in between yields until it changes, in which case they grab the
    lock and return, or until they reach the spin count limit, in which case
    they go to sleep. This improved performance quite a bit over main for
    the blur app, which is a fast pipeline (~100us) with fine-grained
    parallelism. The speed-up was 1.2x! Not much effect on the more complex
    apps.
    
    I'm not entirely sure it's correct, because I think the counter has to
    be incremented with the lock held, so that it serializes correctly with
    the idlers capturing the value of the counter before releasing the lock,
    and you can call cond_signal/broadcast without holding the mutex (though
    we don't do that currently). It also has the unfortunate effect of
    waking up all spinning threads when you signal, instead of just one of
    them. However we never actually call signal, just broadcast. It also
    increases the size of a cond var, which might be considered a breaking
    change in the Halide runtime API.
    abadams committed Oct 8, 2024
    Configuration menu
    Copy the full SHA
    79197e3 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    44a1cd7 View commit details
    Browse the repository at this point in the history

Commits on Oct 9, 2024

  1. Add final check

    abadams committed Oct 9, 2024
    Configuration menu
    Copy the full SHA
    c7ebdf9 View commit details
    Browse the repository at this point in the history
  2. Remove debugging assert

    abadams committed Oct 9, 2024
    Configuration menu
    Copy the full SHA
    6d46367 View commit details
    Browse the repository at this point in the history

Commits on Oct 10, 2024

  1. Configuration menu
    Copy the full SHA
    945e889 View commit details
    Browse the repository at this point in the history

Commits on Oct 28, 2024

  1. Configuration menu
    Copy the full SHA
    edeae97 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    4b25a9e View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    3c15e7f View commit details
    Browse the repository at this point in the history
  4. Fix atomics test

    abadams committed Oct 28, 2024
    Configuration menu
    Copy the full SHA
    299293f View commit details
    Browse the repository at this point in the history
  5. Fix comment

    abadams committed Oct 28, 2024
    Configuration menu
    Copy the full SHA
    34a7cef View commit details
    Browse the repository at this point in the history

Commits on Oct 29, 2024

  1. Configuration menu
    Copy the full SHA
    c5a8bdc View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    45cfbe1 View commit details
    Browse the repository at this point in the history

Commits on Oct 30, 2024

  1. Configuration menu
    Copy the full SHA
    f0df1c7 View commit details
    Browse the repository at this point in the history
  2. Fix decile reporting

    abadams committed Oct 30, 2024
    Configuration menu
    Copy the full SHA
    3b9b6a7 View commit details
    Browse the repository at this point in the history
  3. clang-format

    abadams committed Oct 30, 2024
    Configuration menu
    Copy the full SHA
    6eb29b9 View commit details
    Browse the repository at this point in the history
  4. Merge remote-tracking branch 'origin/abadams/less_noisy_performance_s…

    …cenarios' into abadams/cond_spin_wait
    abadams committed Oct 30, 2024
    Configuration menu
    Copy the full SHA
    6a93dd6 View commit details
    Browse the repository at this point in the history