Scheduling is a simple, minimalistic and fast library allowing you to run async tasks and task graphs.
Scheduling is developed with simplicity and performance in mind.
Below are some GitHub projects related to thread pools and task graphs in which you may be interested:
- Taskflow
- CGraph
- https://github.com/bshoshany/thread-pool
- https://github.com/DeveloperPaul123/thread-pool
We start with examples demonstrating how to run async tasks and task graphs.
To add Scheduling to your project, you can use CMake. For example:
target_link_libraries(${PROJECT_NAME} PRIVATE scheduling)
To run async tasks, create a ThreadPool
instance. For example:
#include "scheduling/scheduling.hpp"
...
scheduling::ThreadPool thread_pool;
In the constructor, the ThreadPool
class creates several worker threads
that will be running in the background until the instance is
destroyed. As an optional argument, the constructor of the ThreadPool
class
accepts the number of threads to create. By default, the number of threads is
equal to std::thread::hardware_concurrency()
.
When the ThreadPool
instance is created, submit a task. For example:
thread_pool.Submit([] {
std::this_thread::sleep_for(std::chrono::seconds(1));
std::cout << "Completed.\n";
});
A task is a function that does not accept arguments and returns void
. Use
lambda captures to pass input and output arguments to the task. Eventually, the
task will be executed on one of the worker threads owned by the ThreadPool
instance.
If needed, call Wait()
to block the current thread until all submitted tasks
are completed. For example:
thread_pool.Wait();
In the destructor, the ThreadPool
class blocks the current thread until all
submitted tasks are completed.
A task graph is a collection of tasks and dependencies between them. Dependencies between tasks define the order in which the tasks should be executed.
To define a task graph, create an iterable collection of Task
instances. For
example:
#include "scheduling/scheduling.hpp"
...
std::vector<scheduling::Task> tasks;
The Task
instances should not be destroyed or reallocated until their
execution is completed by ThreadPool
.
Add elements to tasks
. For example, add tasks to calculate the value of
(a + b) * (c + d)
asynchronously. First, add tasks to get the values of a
,
b
, c
and d
:
int a, b, c, d;
auto& get_a = tasks.emplace_back([&] {
std::this_thread::sleep_for(std::chrono::seconds(1));
a = 1;
});
auto& get_b = tasks.emplace_back([&] {
std::this_thread::sleep_for(std::chrono::seconds(1));
b = 2;
});
auto& get_c = tasks.emplace_back([&] {
std::this_thread::sleep_for(std::chrono::seconds(1));
c = 3;
});
auto& get_d = tasks.emplace_back([&] {
std::this_thread::sleep_for(std::chrono::seconds(1));
d = 4;
});
Next, add tasks to calculate a + b
and c + d
:
int sum_ab, sum_cd;
auto& get_sum_ab = tasks.emplace_back([&] {
std::this_thread::sleep_for(std::chrono::seconds(1));
sum_ab = a + b;
});
auto& get_sum_cd = tasks.emplace_back([&] {
std::this_thread::sleep_for(std::chrono::seconds(1));
sum_cd = c + d;
});
Finally, add the task to calculate the product (a + b) * (c + d)
:
int product;
auto& get_product = tasks.emplace_back([&] {
std::this_thread::sleep_for(std::chrono::seconds(1));
product = sum_ab * sum_cd;
});
When all tasks are added, define task dependencies. The task get_sum_ab
should
be executed after get_a
and get_b
:
get_sum_ab.Succeed(&get_a, &get_b);
Similarly, the task get_sum_cd
should be executed after get_c
and get_d
:
get_sum_cd.Succeed(&get_c, &get_d);
Finally, the task get_product
should be executed after get_sum_ab
and
get_sum_cd
:
get_product.Succeed(&get_sum_ab, &get_sum_cd);
When dependencies between tasks are defined, create a ThreadPool
instance and
submit the task graph for execution:
scheduling::ThreadPool thread_pool;
thread_pool.Submit(tasks);
If needed, call Wait()
to block the current thread until all submitted tasks
are completed:
thread_pool.Wait();
If needed, use synchronization primitives to wait until a specific task is completed. For example:
scheduling::ThreadPool thread_pool;
std::atomic_flag is_completed;
thread_pool.Submit([&] {
is_completed.test_and_set();
is_completed.notify_one();
});
is_completed.wait(false);
Under certain scenarios, you might need to block the current thread and execute
submitted tasks in a loop until a specified condition is met. For example, this
may happen when you submit tasks recursively and a task cannot complete until
some other tasks are completed. In this case, all worker threads may get blocked
by tasks that are waiting for other tasks to complete. To keep executing tasks
while waiting until a specified condition is met, use the Wait
overload
accepting a predicate.
For example, below is a recursive function to calculate Fibonacci numbers without memoization used in the benchmarks:
int Fibonacci(ThreadPool& thread_pool, const int n) {
if (n < 2) {
return 1;
}
int a, b;
std::atomic counter{0};
thread_pool.Submit([&, n] {
a = Fibonacci(thread_pool, n - 1);
counter.fetch_add(1);
});
thread_pool.Submit([&, n] {
b = Fibonacci(thread_pool, n - 2);
counter.fetch_add(1);
});
thread_pool.Wait([&] { return counter.load() == 2; });
return a + b;
}
To define dependencies between task groups, use empty tasks.
For example, assume that the tasks a
and b
should be executed before the
tasks c
and d
. To ensure the order, create an empty task d
which will
follow a
and b
and precede c
and d
:
scheduling::Task d;
d.Succeed(&a, &b);
d.Precede(&c, &d);
To cancel a task, call Task::Cancel()
.
Cancelling a task never fails. If Task::Cancel()
returns false
, this means
that the task has been invoked earlier or will be invoked at least once after
the cancellation. When a task is cancelled and will not be invoked anymore, its
successors also will not be invoked.
For example:
scheduling::Task task;
thread_pool.Submit(&task);
if (task.Cancel()) {
std::cout << "The task will not be invoked.\n";
}
To undo cancellation, call Task::Reset()
:
task.Reset();
You can submit a task graph multiple times, but you need to make sure that the graph is completed before resubmitting it. You cannot submit the same graph simultaneously from different threads.
We compare Scheduling with Taskflow, which is a highly optimized library for parallel and heterogeneous programming [1], [2]. We use Google Benchmark for benchmarking. Comparison of Taskflow with other popular libraries for async programming can be found in Taskflow documentation.
We measure the total time needed to create and execute tasks. The benchmarks were run on 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz, 2083 Mhz, 4 Cores, 8 Logical Processors running Windows 10.0.19045:
Run on (8 X 2803 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 1280 KiB (x4)
L3 Unified 12288 KiB (x1)
To benchmark running async tasks using the ThreadPool
class, we use the
Taskflow example
to calculate Fibonacci numbers recursively without memoization.
The charts below demonstrate comparison between the two libraries:
Wall time | CPU time |
---|---|
![]() |
![]() |
Below is a sample output of Google Benchmark:
------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------
scheduling/fibonacci/25 28.0 ms 13.4 ms 50
scheduling/fibonacci/26 31.6 ms 11.8 ms 45
scheduling/fibonacci/27 64.0 ms 43.9 ms 16
scheduling/fibonacci/28 97.4 ms 78.1 ms 11
scheduling/fibonacci/29 134 ms 99.0 ms 6
scheduling/fibonacci/30 228 ms 184 ms 4
scheduling/fibonacci/31 363 ms 349 ms 3
scheduling/fibonacci/32 547 ms 539 ms 2
scheduling/fibonacci/33 854 ms 656 ms 1
scheduling/fibonacci/34 1349 ms 1359 ms 1
scheduling/fibonacci/35 2203 ms 2188 ms 1
taskflow/fibonacci/25 49.0 ms 31.2 ms 22
taskflow/fibonacci/26 80.0 ms 49.7 ms 11
taskflow/fibonacci/27 129 ms 89.3 ms 7
taskflow/fibonacci/28 212 ms 144 ms 5
taskflow/fibonacci/29 343 ms 234 ms 3
taskflow/fibonacci/30 536 ms 383 ms 2
taskflow/fibonacci/31 982 ms 625 ms 1
taskflow/fibonacci/32 1506 ms 1016 ms 1
taskflow/fibonacci/33 2540 ms 1734 ms 1
To benchmark the simplest example of a task graph, we use the Taskflow linear chain benchmark. In this benchmark, we create a task graph where all tasks are executed sequentially, each task incrementing an integer.
The charts below demonstrate comparison between the two libraries:
Wall time | CPU time |
---|---|
![]() |
![]() |
Below is a sample output of Google Benchmark:
---------------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------------
scheduling/linear_chain/1048576 156 ms 116 ms 5
scheduling/linear_chain/2097152 297 ms 229 ms 3
scheduling/linear_chain/4194304 519 ms 391 ms 2
scheduling/linear_chain/8388608 1064 ms 828 ms 1
scheduling/linear_chain/16777216 2173 ms 1688 ms 1
scheduling/linear_chain/33554432 4431 ms 3453 ms 1
taskflow/linear_chain/1048576 158 ms 125 ms 6
taskflow/linear_chain/2097152 314 ms 258 ms 2
taskflow/linear_chain/4194304 875 ms 750 ms 1
taskflow/linear_chain/8388608 1748 ms 1531 ms 1
taskflow/linear_chain/16777216 3714 ms 3219 ms 1
taskflow/linear_chain/33554432 8137 ms 7219 ms 1
To benchmark a more involved example of a task graph, we use the Taskflow matrix multiplication benchmark.
The charts below demonstrate comparison between the two libraries:
Wall time | CPU time |
---|---|
![]() |
![]() |
Below is a sample output of Google Benchmark:
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------
scheduling/matrix_multiplication/128 2.22 ms 0.719 ms 1000
scheduling/matrix_multiplication/256 11.1 ms 0.969 ms 1000
scheduling/matrix_multiplication/512 102 ms 1.41 ms 100
scheduling/matrix_multiplication/1024 961 ms 1.56 ms 10
scheduling/matrix_multiplication/2048 10461 ms 15.6 ms 1
taskflow/matrix_multiplication/128 1.88 ms 0.393 ms 1948
taskflow/matrix_multiplication/256 11.6 ms 0.732 ms 640
taskflow/matrix_multiplication/512 100 ms 0.938 ms 100
taskflow/matrix_multiplication/1024 960 ms 9.38 ms 10
taskflow/matrix_multiplication/2048 10443 ms 31.2 ms 1
In this section, we briefly describe some implementation details of the Scheduling library.
The idea of work-stealing queues is to provide each worker thread with its own task queue to reduce thread contention [3]. When a task is submitted, it is pushed to one of the queues. The thread owning the queue can eventually pick up the task and execute it. If there are no tasks in the queue owned by a worker thread, the thread attempts to steal a task from another queue.
Work-stealing queues are typically implemented as lock-free deques. The owning thread pops elements at one end of the deque, while other threads steal elements at the other end.
Implementing a work-stealing deque is not an easy task. The Chase-Lev deque [4], [5] is
one of the most commonly used implementations of such a deque. The original C11 implementation of
the Chase-Lev deque [5] uses atomic thread fences without associated atomic operations. When
compiling with a thread sanitizer, GCC 13 issues a warning saying that atomic_thread_fence
is
not supported with -fsanitize=thread
. Thread sanitizers may produce false positives when atomic
thread fences are used. For example, when using the Taskflow implementation of the work-stealing
deque, the thread sanitizer detects data races in the solution suggested in this paper. The Taskflow
implementation of the deque contains the following lines of code:
std::atomic_thread_fence(std::memory_order_release);
_bottom[p].data.store(b + 1, std::memory_order_relaxed);
If memory_order_relaxed
is replaced here by memory_order_release
, the sanitizer stops detecting
the data races. This might indicate a false positive related to the usage of
std::atomic_thread_fence
. It is worth noting that Taskflow unit tests and examples pass with the
thread sanitizer even though std::atomic_thread_fence
is used.
An example of a work-stealing deque implementation that does not use std::atomic_thread_fence
can
be found in Google Filament. When using the implementation
from Google Filament, the thread sanitizer does not detect data races in the suggested solution.
Concurrent push and pop operations are not allowed in most implementations of the work-stealing deque. To ensure that there are no concurrent push and pop operations, mappings from thread ID to queue indices are typically used. When a task is pushed to or popped from a queue, the correct queue is usually found using the current thread ID. Unlike this typical approach, the solution suggested in this paper uses a thread-local variable to find the correct task queue. It makes the solution not header-only, but this will not matter once modules from the C++20 standard are used. Unfortunately, at the time of writing this paper, there seems to be not enough compiler support to present the suggested solution in the form of a cross-platform module.
To run task graphs, simple wrappers over an std::function<void()>
are used. Each wrapper stores
references to successor tasks and the number of uncompleted predecessor tasks. When the thread pool
executes a task, it first executes the wrapped function. Then, for each successor task, it
decrements the number of uncompleted predecessor tasks. One of the successor tasks, for which the
number of uncompleted predecessor tasks becomes equal to zero, is then executed on the same worker
thread. Other successor tasks, for which the number of uncompleted predecessor tasks becomes equal
to zero, are submitted to the same thread pool instance for execution.
This project is licensed under the MIT License.
This project includes code from Google Filament, which is licensed under the Apache License 2.0. You can find the text of the Apache License 2.0 in the LICENSE-APACHE file.
[1] Chun-Xun Lin, Tsung-Wei Huang and Martin D. F. Wong, An Efficient Work-Stealing Scheduler for Task Dependency Graph
[2] Tsung-Wei Huang, Dian-Lun Lin, Chun-Xun Lin, and Yibo Lin, Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing System
[3] Anthony Williams, C++ Concurrency in Action, 2nd Edition, Manning, 592 pages, 2019.
[4] David Chase and Yossi Lev. Dynamic circular work-stealing deque. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA '05, pages 21-28, New York, NY, USA, 2005. ACM.
[5] N.M. Le, A. Pop, A. Cohen, and F. Zappa Nardelli. Correct and efficient work-stealing for weak memory models. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 69-80, New York, NY, USA, 2013. ACM.