Skip to content

A simple and fast library allowing to run async tasks and execute task graphs.

License

MIT, Apache-2.0 licenses found

Licenses found

MIT
LICENSE
Apache-2.0
LICENSE-APACHE
Notifications You must be signed in to change notification settings

dpuyda/scheduling

Repository files navigation

Scheduling

License Standard Build arXiv:2407.15805

Scheduling is a simple, minimalistic and fast library allowing you to run async tasks and task graphs.

Scheduling is developed with simplicity and performance in mind.

Below are some GitHub projects related to thread pools and task graphs in which you may be interested:

Examples

We start with examples demonstrating how to run async tasks and task graphs.

Add Scheduling to your project

To add Scheduling to your project, you can use CMake. For example:

target_link_libraries(${PROJECT_NAME} PRIVATE scheduling)

Run async tasks

To run async tasks, create a ThreadPool instance. For example:

#include "scheduling/scheduling.hpp"
...
scheduling::ThreadPool thread_pool;

In the constructor, the ThreadPool class creates several worker threads that will be running in the background until the instance is destroyed. As an optional argument, the constructor of the ThreadPool class accepts the number of threads to create. By default, the number of threads is equal to std::thread::hardware_concurrency().

When the ThreadPool instance is created, submit a task. For example:

thread_pool.Submit([] {
  std::this_thread::sleep_for(std::chrono::seconds(1));
  std::cout << "Completed.\n";
});

A task is a function that does not accept arguments and returns void. Use lambda captures to pass input and output arguments to the task. Eventually, the task will be executed on one of the worker threads owned by the ThreadPool instance.

If needed, call Wait() to block the current thread until all submitted tasks are completed. For example:

thread_pool.Wait();

In the destructor, the ThreadPool class blocks the current thread until all submitted tasks are completed.

Build task graphs

A task graph is a collection of tasks and dependencies between them. Dependencies between tasks define the order in which the tasks should be executed.

To define a task graph, create an iterable collection of Task instances. For example:

#include "scheduling/scheduling.hpp"
...
std::vector<scheduling::Task> tasks;

The Task instances should not be destroyed or reallocated until their execution is completed by ThreadPool.

Add elements to tasks. For example, add tasks to calculate the value of (a + b) * (c + d) asynchronously. First, add tasks to get the values of a, b, c and d:

int a, b, c, d;

auto& get_a = tasks.emplace_back([&] {
  std::this_thread::sleep_for(std::chrono::seconds(1));
  a = 1;
});

auto& get_b = tasks.emplace_back([&] {
  std::this_thread::sleep_for(std::chrono::seconds(1));
  b = 2;
});

auto& get_c = tasks.emplace_back([&] {
  std::this_thread::sleep_for(std::chrono::seconds(1));
  c = 3;
});

auto& get_d = tasks.emplace_back([&] {
  std::this_thread::sleep_for(std::chrono::seconds(1));
  d = 4;
});

Next, add tasks to calculate a + b and c + d:

int sum_ab, sum_cd;

auto& get_sum_ab = tasks.emplace_back([&] {
  std::this_thread::sleep_for(std::chrono::seconds(1));
  sum_ab = a + b;
});

auto& get_sum_cd = tasks.emplace_back([&] {
  std::this_thread::sleep_for(std::chrono::seconds(1));
  sum_cd = c + d;
});

Finally, add the task to calculate the product (a + b) * (c + d):

int product;

auto& get_product = tasks.emplace_back([&] {
  std::this_thread::sleep_for(std::chrono::seconds(1));
  product = sum_ab * sum_cd;
});

When all tasks are added, define task dependencies. The task get_sum_ab should be executed after get_a and get_b:

get_sum_ab.Succeed(&get_a, &get_b);

Similarly, the task get_sum_cd should be executed after get_c and get_d:

get_sum_cd.Succeed(&get_c, &get_d);

Finally, the task get_product should be executed after get_sum_ab and get_sum_cd:

get_product.Succeed(&get_sum_ab, &get_sum_cd);

When dependencies between tasks are defined, create a ThreadPool instance and submit the task graph for execution:

scheduling::ThreadPool thread_pool;
thread_pool.Submit(tasks);

If needed, call Wait() to block the current thread until all submitted tasks are completed:

thread_pool.Wait();

Wait until a specific task is completed

If needed, use synchronization primitives to wait until a specific task is completed. For example:

scheduling::ThreadPool thread_pool;
std::atomic_flag is_completed;
thread_pool.Submit([&] {
  is_completed.test_and_set();
  is_completed.notify_one();
});
is_completed.wait(false);

Wait until a predicate is satisfied

Under certain scenarios, you might need to block the current thread and execute submitted tasks in a loop until a specified condition is met. For example, this may happen when you submit tasks recursively and a task cannot complete until some other tasks are completed. In this case, all worker threads may get blocked by tasks that are waiting for other tasks to complete. To keep executing tasks while waiting until a specified condition is met, use the Wait overload accepting a predicate.

For example, below is a recursive function to calculate Fibonacci numbers without memoization used in the benchmarks:

int Fibonacci(ThreadPool& thread_pool, const int n) {
  if (n < 2) {
    return 1;
  }
  int a, b;
  std::atomic counter{0};
  thread_pool.Submit([&, n] {
    a = Fibonacci(thread_pool, n - 1);
    counter.fetch_add(1);
  });
  thread_pool.Submit([&, n] {
    b = Fibonacci(thread_pool, n - 2);
    counter.fetch_add(1);
  });
  thread_pool.Wait([&] { return counter.load() == 2; });
  return a + b;
}

Define dependencies between task groups

To define dependencies between task groups, use empty tasks.

For example, assume that the tasks a and b should be executed before the tasks c and d. To ensure the order, create an empty task d which will follow a and b and precede c and d:

scheduling::Task d;
d.Succeed(&a, &b);
d.Precede(&c, &d);

Cancel a task

To cancel a task, call Task::Cancel().

Cancelling a task never fails. If Task::Cancel() returns false, this means that the task has been invoked earlier or will be invoked at least once after the cancellation. When a task is cancelled and will not be invoked anymore, its successors also will not be invoked.

For example:

scheduling::Task task;
thread_pool.Submit(&task);
if (task.Cancel()) {
  std::cout << "The task will not be invoked.\n";
}

To undo cancellation, call Task::Reset():

task.Reset();

Resubmit a task graph

You can submit a task graph multiple times, but you need to make sure that the graph is completed before resubmitting it. You cannot submit the same graph simultaneously from different threads.

Benchmarks

We compare Scheduling with Taskflow, which is a highly optimized library for parallel and heterogeneous programming [1], [2]. We use Google Benchmark for benchmarking. Comparison of Taskflow with other popular libraries for async programming can be found in Taskflow documentation.

We measure the total time needed to create and execute tasks. The benchmarks were run on 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz, 2083 Mhz, 4 Cores, 8 Logical Processors running Windows 10.0.19045:

Run on (8 X 2803 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 1280 KiB (x4)
  L3 Unified 12288 KiB (x1)

Fibonacci numbers

To benchmark running async tasks using the ThreadPool class, we use the Taskflow example to calculate Fibonacci numbers recursively without memoization.

The charts below demonstrate comparison between the two libraries:

Wall time CPU time

Below is a sample output of Google Benchmark:

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
scheduling/fibonacci/25       28.0 ms         13.4 ms           50
scheduling/fibonacci/26       31.6 ms         11.8 ms           45
scheduling/fibonacci/27       64.0 ms         43.9 ms           16
scheduling/fibonacci/28       97.4 ms         78.1 ms           11
scheduling/fibonacci/29        134 ms         99.0 ms            6
scheduling/fibonacci/30        228 ms          184 ms            4
scheduling/fibonacci/31        363 ms          349 ms            3
scheduling/fibonacci/32        547 ms          539 ms            2
scheduling/fibonacci/33        854 ms          656 ms            1
scheduling/fibonacci/34       1349 ms         1359 ms            1
scheduling/fibonacci/35       2203 ms         2188 ms            1
taskflow/fibonacci/25         49.0 ms         31.2 ms           22
taskflow/fibonacci/26         80.0 ms         49.7 ms           11
taskflow/fibonacci/27          129 ms         89.3 ms            7
taskflow/fibonacci/28          212 ms          144 ms            5
taskflow/fibonacci/29          343 ms          234 ms            3
taskflow/fibonacci/30          536 ms          383 ms            2
taskflow/fibonacci/31          982 ms          625 ms            1
taskflow/fibonacci/32         1506 ms         1016 ms            1
taskflow/fibonacci/33         2540 ms         1734 ms            1

Linear chain

To benchmark the simplest example of a task graph, we use the Taskflow linear chain benchmark. In this benchmark, we create a task graph where all tasks are executed sequentially, each task incrementing an integer.

The charts below demonstrate comparison between the two libraries:

Wall time CPU time

Below is a sample output of Google Benchmark:

---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
scheduling/linear_chain/1048576         156 ms          116 ms            5
scheduling/linear_chain/2097152         297 ms          229 ms            3
scheduling/linear_chain/4194304         519 ms          391 ms            2
scheduling/linear_chain/8388608        1064 ms          828 ms            1
scheduling/linear_chain/16777216       2173 ms         1688 ms            1
scheduling/linear_chain/33554432       4431 ms         3453 ms            1
taskflow/linear_chain/1048576           158 ms          125 ms            6
taskflow/linear_chain/2097152           314 ms          258 ms            2
taskflow/linear_chain/4194304           875 ms          750 ms            1
taskflow/linear_chain/8388608          1748 ms         1531 ms            1
taskflow/linear_chain/16777216         3714 ms         3219 ms            1
taskflow/linear_chain/33554432         8137 ms         7219 ms            1

Matrix multiplication

To benchmark a more involved example of a task graph, we use the Taskflow matrix multiplication benchmark.

The charts below demonstrate comparison between the two libraries:

Wall time CPU time

Below is a sample output of Google Benchmark:

--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
scheduling/matrix_multiplication/128        2.22 ms        0.719 ms         1000
scheduling/matrix_multiplication/256        11.1 ms        0.969 ms         1000
scheduling/matrix_multiplication/512         102 ms         1.41 ms          100
scheduling/matrix_multiplication/1024        961 ms         1.56 ms           10
scheduling/matrix_multiplication/2048      10461 ms         15.6 ms            1
taskflow/matrix_multiplication/128          1.88 ms        0.393 ms         1948
taskflow/matrix_multiplication/256          11.6 ms        0.732 ms          640
taskflow/matrix_multiplication/512           100 ms        0.938 ms          100
taskflow/matrix_multiplication/1024          960 ms         9.38 ms           10
taskflow/matrix_multiplication/2048        10443 ms         31.2 ms            1

Implementation details

In this section, we briefly describe some implementation details of the Scheduling library.

Chase-Lev deque

The idea of work-stealing queues is to provide each worker thread with its own task queue to reduce thread contention [3]. When a task is submitted, it is pushed to one of the queues. The thread owning the queue can eventually pick up the task and execute it. If there are no tasks in the queue owned by a worker thread, the thread attempts to steal a task from another queue.

Work-stealing queues are typically implemented as lock-free deques. The owning thread pops elements at one end of the deque, while other threads steal elements at the other end.

Implementing a work-stealing deque is not an easy task. The Chase-Lev deque [4], [5] is one of the most commonly used implementations of such a deque. The original C11 implementation of the Chase-Lev deque [5] uses atomic thread fences without associated atomic operations. When compiling with a thread sanitizer, GCC 13 issues a warning saying that atomic_thread_fence is not supported with -fsanitize=thread. Thread sanitizers may produce false positives when atomic thread fences are used. For example, when using the Taskflow implementation of the work-stealing deque, the thread sanitizer detects data races in the solution suggested in this paper. The Taskflow implementation of the deque contains the following lines of code:

std::atomic_thread_fence(std::memory_order_release);
_bottom[p].data.store(b + 1, std::memory_order_relaxed);

If memory_order_relaxed is replaced here by memory_order_release, the sanitizer stops detecting the data races. This might indicate a false positive related to the usage of std::atomic_thread_fence. It is worth noting that Taskflow unit tests and examples pass with the thread sanitizer even though std::atomic_thread_fence is used.

An example of a work-stealing deque implementation that does not use std::atomic_thread_fence can be found in Google Filament. When using the implementation from Google Filament, the thread sanitizer does not detect data races in the suggested solution.

Concurrent push and pop operations are not allowed in most implementations of the work-stealing deque. To ensure that there are no concurrent push and pop operations, mappings from thread ID to queue indices are typically used. When a task is pushed to or popped from a queue, the correct queue is usually found using the current thread ID. Unlike this typical approach, the solution suggested in this paper uses a thread-local variable to find the correct task queue. It makes the solution not header-only, but this will not matter once modules from the C++20 standard are used. Unfortunately, at the time of writing this paper, there seems to be not enough compiler support to present the suggested solution in the form of a cross-platform module.

Task graphs

To run task graphs, simple wrappers over an std::function<void()> are used. Each wrapper stores references to successor tasks and the number of uncompleted predecessor tasks. When the thread pool executes a task, it first executes the wrapped function. Then, for each successor task, it decrements the number of uncompleted predecessor tasks. One of the successor tasks, for which the number of uncompleted predecessor tasks becomes equal to zero, is then executed on the same worker thread. Other successor tasks, for which the number of uncompleted predecessor tasks becomes equal to zero, are submitted to the same thread pool instance for execution.

License

This project is licensed under the MIT License.

Third-party licenses

This project includes code from Google Filament, which is licensed under the Apache License 2.0. You can find the text of the Apache License 2.0 in the LICENSE-APACHE file.

References

[1] Chun-Xun Lin, Tsung-Wei Huang and Martin D. F. Wong, An Efficient Work-Stealing Scheduler for Task Dependency Graph

[2] Tsung-Wei Huang, Dian-Lun Lin, Chun-Xun Lin, and Yibo Lin, Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing System

[3] Anthony Williams, C++ Concurrency in Action, 2nd Edition, Manning, 592 pages, 2019.

[4] David Chase and Yossi Lev. Dynamic circular work-stealing deque. In Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, SPAA '05, pages 21-28, New York, NY, USA, 2005. ACM.

[5] N.M. Le, A. Pop, A. Cohen, and F. Zappa Nardelli. Correct and efficient work-stealing for weak memory models. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 69-80, New York, NY, USA, 2013. ACM.

About

A simple and fast library allowing to run async tasks and execute task graphs.

Topics

Resources

License

MIT, Apache-2.0 licenses found

Licenses found

MIT
LICENSE
Apache-2.0
LICENSE-APACHE

Stars

Watchers

Forks

Releases

No releases published

Packages