-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't spin on the main mutex while waiting for new work #8433
base: main
Are you sure you want to change the base?
Conversation
This is one solution to an issue identified by Marcos, opened for discussion. Here's the full description of the issue: Once they run out of work to do, Halide worker threads spin for a bit checking if new work has been enqueued before calling cond_wait, which puts them to sleep until signaled. Job owners also spin waiting for their job to complete before going to sleep on a different condition variable. I hate this, but all previous attempts I have made at removing or reducing the spinning have made things slower. One problem with this approach is that spinning is done by releasing the work queue lock, yielding, reacquiring the work queue lock, and doing the somewhat involved check to see if there's something useful for this thread to do, either because new work was enqueued, the last item on a job completed, or a semaphore was released. This hammering of the lock by idle worker threads can starve the thread that actually completed the last task, delaying its ability to tell the job owner the job is done, and can also starve the job owner, causing it to take extra time to realize the job is all done and return back into Halide code. So this adds some wasted time at the end of every parallel for loop. This PR gets these idle threads to spin off the main mutex. I did this by adding a counter to each condition variable. Any time they are signaled, the counter is atomically incremented. Before they first release the lock, the idlers atomically capture the value of this counter. Then in cond_wait they spin for a bit doing atomic loads of the counter in between yields until it changes, in which case they grab the lock and return, or until they reach the spin count limit, in which case they go to sleep. This improved performance quite a bit over main for the blur app, which is a fast pipeline (~100us) with fine-grained parallelism. The speed-up was 1.2x! Not much effect on the more complex apps. I'm not entirely sure it's correct, because I think the counter has to be incremented with the lock held, so that it serializes correctly with the idlers capturing the value of the counter before releasing the lock, and you can call cond_signal/broadcast without holding the mutex (though we don't do that currently). It also has the unfortunate effect of waking up all spinning threads when you signal, instead of just one of them. However we never actually call signal, just broadcast. It also increases the size of a cond var, which might be considered a breaking change in the Halide runtime API.
Oh, another problem with the current behavior is that workers spin 40 times waiting for new work, and each spin grabs the mutex, which may spin 40 times to acquire it, so the upper limit on yield calls before sleeping is 40 x 40 = 1600 (!). This way the upper limit is 80 yields before going to sleep - 40 spins on the counter, and then 40 spins to acquire the mutex again to do the rest of cond_wait. |
uh oh, looks like a deadlock on the bots |
To me it sounds like you could get away with checking the bit without locking the mutex. Only when you want to actually synchronize, you lock the mutex, but as long as no work is available (and thus the bit reflecting that), reading the bit without locking sounds fine to me. Once the idle worker finds that there is an indication for more work, you actually take the more expensive code path with locking. Then idle workers would no longer compete for a lock as long as they are idle, giving the chance to the worker that's actually doing something to signal that it finished. Perhaps I misunderstood. |
Unfortunately I can't make the meeting today. I have not convinced myself this provides a correct implementation of a condition variable, specifically with multiple threads calling wait and signal. Though I think it strictly just increases the number of false returns from wait, which are allowed arbitrarily so it likely does not break the contract. That however is a pretty weak design to be standing on. Such returns are is allowed but a good implementation is supposed to minimize them. One thought is that if this is really improving things, a conditional critical section, which moves the actual test of whether there is more work to do or not inside the synchronization primitive, might help. But mostly my feeling is we need to get rid of the spinning entirely, at least on Linux. Other issue here is degree of testing required, both for correctness, but more importantly for performance. It is very easy to improve things in one place and worsen them elsewhere. Toward that end, it would be good to collect the performance data behind the change at least minimally, including the platform info it is from, and maybe some baselines for other work to make sure there are no regressions. |
It definitely wakes too many threads on signal - all the spinners plus one sleeping thread. We don't currently use signal. Viewed as a whole (i.e. including the spinning in thread_pool_common.h that this PR removes), if a spurious wake-up is a lock grab for no good reason, this design is a huge improvement over main (max of 1 vs 40). There are a few alternative designs that amount to the same thing for current usage, but perhaps won't sabotage any future uses of the existing cond var. One is just moving the raw counter into the thread pool and spinning on it before calling cond_wait. The places where we signal would have to both signal and increment the counter. Another is making a "halide_spin_cond_var", where there's no signal method, and wait returns after 40 spins max even if not signaled. It would return a bool saying whether or not it was signaled or timed-out. This halide_spin_cond_var would be waited on before waiting on the real cond var. One could also imagine a design where the spin_cond_var tracks the number of spinners. New waiters spin until the counter is the initial value plus the number of spinners (including itself). Broadcast increments the counter by the number of spinners, and signal increments it by one. The number of spinners is guarded by the mutex. I think this is basically a spin semaphore? (EDIT: Actually no I think this still has spurious wake-ups if there are new waiters and new wake-ups while a spinning thread is yielding) We're trying to get some data on our production workloads now. If that's positive, I'll also instrument the bots to record something useful on the open source apps on various platforms. |
ok, bots have been instrumented, so we should have results for a wide variety of scenarios and machines once all bots go green. |
Here are the geomean speedups. This is (geometrically) averaged over different numbers of tasks and different sizes of tasks. It's also geometrically average over different deciles in the distribution, whether or not there are contending tasks in the system also trying to do work, and whether or not it's memory bound. Being memory bound didn't make a difference. Contention just added noise and made the distributions a lot wider, but didn't change the story. I'll also break out the median speed-up in an easy case (256 tasks, lots of work per task, no contention), and the median speed-up in a hard case (16 tasks, little work per task, contention)
So we have solid wins on average on those four platforms, but something wonky is going on in the contended 16 tasks case on mac. If I look at the same number but in the memory-bound regime, I see a speed-up of 4.9! So I think it's just incredibly noisy, and we should focus on the geomean. I'm rerunning to be sure. In the easy cases (e.g. most of our app schedules) this is about the same as main. |
The rerun on the mac showed incredibly noisy results again. I suspect it's something about whether or not the jobs are scheduled on the performance or efficiency cores. I'm going to do a little more investigation. |
I wonder if it would help to issue a bunch of Neon instructions during the spin loops so the threads don't get rescheduled to the efficiency cores? |
…cenarios' into abadams/cond_spin_wait
Add the zero-vector to the zero-vector. Transistors only produce heat when they flip state. I believe no vector circuitry should cause flipping transistors on their computation lines when adding zero to zero. 😝 I wonder if anybody tried to rigorously test the impact on heat production when doing that, compared to adding random numbers. On a more serious note: perhaps there is a way to pin the thread on an efficiency core? (I understand that your suggestion is a funny quick way to test this hypothesis tho). |
I don't think I want the spinning threads on the performance cores, I'd rather that final thread doing work that they're all waiting for goes there. |
Updated less noisy results. I have also included the hard case, uncontended, but with lots of memory traffic to a large input image, because it's interestingly different on x86 (up to 3x faster!). A cause might be that the cache line containing the counter is now in a many-readers-and-occasionally-one-writer state, as opposed to main, where all threads were competing to write to it. Maybe this reduces inter-core traffic in a meaningful way.
The geomean on mac arm is lower than the cases I looked at, because it seems to do poorly in general when there are 8 small tasks to perform. The thread pool has 10 threads. The CPU has 8 performance cores and 2 efficiency cores. Maybe this makes it more likely for the threads on the efficiency cores to get the work somehow. If I restrict it to 8 threads it gets slower. If I use 16 threads it's better, but not as good as main. I think I need to do some instrumentation to see what the threads are doing in that case. One general cause of potential slowdowns that this branch spins a lot less than main (40 + 40 vs 40 * 40), which reduces the amount of time that threads spin waiting for new work, and for some workloads this causes them to give up just before they would have received new work. I think this is an acceptable trade-off. We should be spending less time spinning if it doesn't hurt performance on average. |
Actually on my M1 Max laptop I can't reproduce the slowness on the build bot. It's 2x faster in that case. The laptop has 8 performance cores and 2 efficiency cores. The buildbot actually has 4 of each. |
Is there consensus on this PR or are we still working on it? |
There are some concrete action items left to do:
|
Here's what I believe is happening in the one case that got slower after looking at some traces: On an M1, efficiency cores are about 4x slower than performance cores at this inner loop, so when you have 8 tasks to perform, total runtime is 2x lower if the four performance cores take two tasks each vs all 8 cores taking 1 task each. Running the test at HL_NUM_THREADS=4 only uses performance cores, and confirms this. I believe that due to increased contention in main and the fact that efficiency cores go away for longer when they yield compared to performance cores, those threads were getting starved out of the lock and were not able to claim tasks before the performance cores finished the whole job. So paradoxically, increased contention and the resulting starvation actually helped performance. Using efficiency cores does help in cases where there are more tasks to perform, so it's not like we should just never use efficiency cores. I confirmed this on the apps, and it also shows up in the parallel scenarios test if you have more tasks. If a task takes time t on a performance cores and 4t on an efficiency core, and you have 8n of them, using 4 performance cores only, the runtime is going to be 2nt because each core will perform 2n tasks, but using efficiency cores too, assuming at least one of them claims a task, the minimum runtime is going to be 4t. The break-even point is at n = 2, which is 16 tasks. Above 16 tasks you can save time by letting the efficiency cores help, but below that it's a bad idea to get them involved. We could change the thread pool to be big+little-aware and only use some fraction of the thread pool if there are too few tasks for it to be a good idea to wake all the threads. Pinning the threads is probably not necessary - if we're only running four of them to OS will put them on the performance cores. Instead of doing this automatically, we could also add a max-threads-to-use Expr that's distinct from the number-of-parallel-tasks Expr (the extent divided by the split factor), as Zalman has previously suggested. Either of those is a much more complicated project. I'm not going to consider this behavior a blocker for this PR. The fact that increased contention actually helps in a corner case by starving out slow threads seems like spacebar heating. Here's my current data: https://docs.google.com/spreadsheets/d/1Qdc0UHbF1hctxuYESJNHAqJegGDKlafuSqYc3t07u7M/edit?usp=sharing Next task is to move the atomic counters out of the cond vars |
I'm ok to land this after moving the logic out of the conditional variable code. The justification for moving the logic out is as follows:
|
Per different types of cores, the other issue here is total power use for a given job. Generally one can assume getting done faster is less power, but for some algorithms efficiency cores likely do reduce total power. There are also cases where using the efficiency cores for an independent part of a pipeline in parallel with the performance cores is likely the way to go. (Idea is some algorithms are better suited for one or the other and doing distinct loops removes a large difference in time per index from the scheduling decision.) Bottom line is the benchmark used likely doesn't model significant pipelines completely. (Which isn't to say it's not useful, just that I doubt the larger speedups are representative of situations with lots of work on the queue.) Adding a scheduling directive to control the number of threads in each processing class (performance, efficiency, etc.), possibly as a function of hardware core count (e.g. percentage of), would be a possible approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving pending moving logic out of condvar. I'm guessing having Google, etc. weigh in on their own performance tests is a good idea if that hasn't already been done.
This is one solution to an issue identified by Marcos, opened for discussion. Here's the full description of the issue:
Once they run out of work to do, Halide worker threads spin for a bit checking if new work has been enqueued before calling cond_wait, which puts them to sleep until signaled. Job owners also spin waiting for their job to complete before going to sleep on a different condition variable. I hate this, but all previous attempts I have made at removing or reducing the spinning have made things slower.
One problem with this approach is that spinning is done by releasing the work queue lock, yielding, reacquiring the work queue lock, and doing the somewhat involved check to see if there's something useful for this thread to do, either because new work was enqueued, the last item on a job completed, or a semaphore was released. This hammering of the lock by idle worker threads can starve the thread that actually completed the last task, delaying its ability to tell the job owner the job is done, and can also starve the job owner, causing it to take extra time to realize the job is all done and return back into Halide code. So this adds some wasted time at the end of every parallel for loop.
This PR gets these idle threads to spin off the main mutex. I did this by adding a counter to each condition variable. Any time they are signaled, the counter is atomically incremented. Before they first release the lock, the idlers atomically capture the value of this counter. Then in cond_wait they spin for a bit doing atomic loads of the counter in between yields until it changes, in which case they grab the lock and return, or until they reach the spin count limit, in which case they go to sleep. This improved performance quite a bit over main for the blur app, which is a fast pipeline (~100us) with fine-grained parallelism. The speed-up was 1.2x! Not much effect on the more complex apps.
I'm not entirely sure it's correct, because I think the counter has to be incremented with the lock held, so that it serializes correctly with the idlers capturing the value of the counter before releasing the lock, and you can call cond_signal/broadcast without holding the mutex (though we don't do that currently). It also has the unfortunate effect of waking up all spinning threads when you signal, instead of just one of them. However we never actually call signal, just broadcast. It also increases the size of a cond var, which might be considered a breaking change in the Halide runtime API.
Alternatives: