diff --git a/docs/source/API/core/Graph.axpby.kokkos.graph.cpp b/docs/source/API/core/Graph.axpby.kokkos.graph.cpp new file mode 100644 index 000000000..24cc178ac --- /dev/null +++ b/docs/source/API/core/Graph.axpby.kokkos.graph.cpp @@ -0,0 +1,12 @@ +auto graph = Kokkos::Experimental::create_graph(exec_A, [&](auto root){ + auto node_xpy = root.then_parallel_for(N, MyAxpby{x, y, alpha, beta}); + auto node_zpy = root.then_parallel_for(N, MyAxpby{z, y, gamma, beta}); + + auto node_dotp = Kokkos::Experimental::when_all(node_xpy, node_zpy).then_parallel_reduce( + N, MyDotp{x, z}, dotp + ) +}); + +graph.submit(exec_A); + +exec_A.fence(); diff --git a/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp b/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp new file mode 100644 index 000000000..3d129d2a4 --- /dev/null +++ b/docs/source/API/core/Graph.axpby.kokkos.graph.p2300.cpp @@ -0,0 +1,15 @@ +auto graph = Kokkos::construct_graph(); + +auto node_xpy = Kokkos::then(graph, Kokkos::parallel_for(N, MyAxpby{x, y, alpha, beta})); +auto node_zpy = Kokkos::then(graph, Kokkos::parallel_for(N, MyAxpby{z, y, gamma, beta})); + +auto node_dotp = Kokkos::then( + Kokkos::when_all(node_xpy, node_zpy), + Kokkos::parallel_reduce(N, MyDotp{x, z}, dotp) +); + +graph.instantiate(); + +graph.submit(exec_A); + +exec_A.fence(); diff --git a/docs/source/API/core/Graph.axpby.kokkos.vanilla.cpp b/docs/source/API/core/Graph.axpby.kokkos.vanilla.cpp new file mode 100644 index 000000000..3789ba4d7 --- /dev/null +++ b/docs/source/API/core/Graph.axpby.kokkos.vanilla.cpp @@ -0,0 +1,8 @@ +Kokkos::parallel_for(policy_t(exec_A, 0, N), MyAxpby{x, y, alpha, beta}); +Kokkos::parallel_for(policy_t(exec_B, 0, N), MyAxpby{z, y, gamma, beta}); + +exec_B.fence(); + +Kokkos::parallel_reduce(policy_t(exec_A, 0, N), MyDotp{x, z}, dotp); + +exec_A.fence(); diff --git a/docs/source/API/core/Graph.rst b/docs/source/API/core/Graph.rst index 1a9f3df6a..42beb0860 100644 --- a/docs/source/API/core/Graph.rst +++ b/docs/source/API/core/Graph.rst @@ -4,10 +4,10 @@ Graph and related Usage ----- -:code:`Kokkos::Graph` is an abstraction that can be used to define a group of asynchronous workloads that are organised as a direct acyclic graph. -A :code:`Kokkos::Graph` is defined separatly from its execution, allowing it to be re-executed multiple times. +:cppkokkos:`Kokkos::Graph` is an abstraction that can be used to define a group of asynchronous workloads that are organised as a direct acyclic graph. +A :cppkokkos:`Kokkos::Graph` is defined separatly from its execution, allowing it to be re-executed multiple times. -:code:`Kokkos::Graph` is a powerful way of describing workload dependencies. It is also a good opportunity to present all workloads +:cppkokkos:`Kokkos::Graph` is a powerful way of describing workload dependencies. It is also a good opportunity to present all workloads at once to the driver, and allow some optimizations [ref]. .. note:: @@ -16,18 +16,18 @@ at once to the driver, and allow some optimizations [ref]. For small workloads that need to be sumitted several times, it might save you some overhead [reference to some presentation / paper]. -:code:`Kokkos::Graph` is specialized for some backends: +:cppkokkos:`Kokkos::Graph` is specialized for some backends: -* :code:`Cuda`: [ref to vendor doc] -* :code:`HIP`: [ref to vendor doc] -* :code:`SYCL`: [ref to vendor doc] -> https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc +* :cppkokkos:`Cuda`: [ref to vendor doc] +* :cppkokkos:`HIP`: [ref to vendor doc] +* :cppkokkos:`SYCL`: [ref to vendor doc] -> https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc For other backends, Kokkos provides a defaulted implementation [ref to file]. Philosophy ---------- -As mentioned earlier, the :code:`Kokkos::Graph` is first defined, and then executed. In fact, before the graph can be executed, +As mentioned earlier, the :cppkokkos:`Kokkos::Graph` is first defined, and then executed. In fact, before the graph can be executed, it needs to be *instantiated*. During the *instantiation* phase, the topology of the graph is **locked**, and an *executable graph* is created. @@ -40,53 +40,23 @@ In short, we have 3 phases: "Splitting command construction from execution is a proven solution." (https://www.iwocl.org/wp-content/uploads/iwocl-2023-Ewan-Crawford-4608.pdf) -Basic example -------------- - -This example showcases how three workloads can be organised as a :code:`Kokkos::Graph`. - -Workloads A and B are independent, but workload C needs the completion of A and B. - -.. code-block:: cpp - - int main() - { - auto graph = Kokkos::Experimental::create_graph([&](auto root) { - const auto node_A = root.then_parallel_for(...label..., ...policy..., ...body...); - const auto node_B = root.then_parallel_for(...label..., ...policy..., ...body...); - const auto ready = Kokkos::Experimental::when_all(node_A, node_B); - const auto node_C = ready.then_parallel_for(...label..., ...policy..., ...body...); - }); - - for(int irep = 0; irep < nrep; ++irep) - graph.submit(); - } - -Advanced example ----------------- - -To be done soon. - -References ----------- - -* https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf -* https://github.com/intel/llvm/blob/sycl/sycl/doc/syclgraph/SYCLGraphUsageGuide.md -* https://developer.nvidia.com/blog/a-guide-to-cuda-graphs-in-gromacs-2023/ - - Use cases --------- Diamond with closure, don't care about `exec` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Create a simple diamond-like graph within a closure, no caring about execution space instances. +Create a simple diamond-like graph within a closure, not caring too much about execution space instances. This use case demonstrates how a graph can be created from inside a closure, and how it could look like in the future. It is a very simple use case. -Note that I'm not sure why we should support the closure anyway. +.. note:: + + I'm not sure why we should support the closure anyway. I don't see the benefits of enforcing the + user to create the whole graph in there. + + See :ref:`no_root_node` for discussion. .. graphviz:: :caption: Diamond topology @@ -99,9 +69,9 @@ Note that I'm not sure why we should support the closure anyway. } .. code-block:: c++ - :caption: Current pseudo-code + :caption: Current `Kokkos` pseudo-code. - auto graph = Kokkos::create_graph([&](const auto& root){ + auto graph = Kokkos::create_graph([&](auto root){ auto node_A = root.then_parallel_...(...label..., ...policy..., ...functor...); auto node_B = node_A.then_parallel_...(...label..., ...policy..., ...functor...); @@ -113,9 +83,9 @@ Note that I'm not sure why we should support the closure anyway. graph.submit() .. code-block:: c++ - :caption: P2300 (but really I don't like that because `graph` itself is already a *sender*) + :caption: *à la* P2300 (but really I don't like that because `graph` itself is already a *sender*). - auto graph = Kokkos::create_graph([&](const auto& root){ + auto graph = Kokkos::create_graph([&](auto root){ auto node_A = then(root, parallel_...(...label..., ...policy..., ...functor...)); auto node_B = then(node_A, parallel_...(...label..., ...policy..., ...functor...)); @@ -129,7 +99,7 @@ Note that I'm not sure why we should support the closure anyway. Diamond, caring about `exec` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Create a simple diamond-like graph, caring about execution space instances. +Create a simple diamond-like graph, caring about execution space instances. No closure. This use case demonstrates how a graph can be created without a closure, and how it could look like in the future. It also focuses on where steps occur. @@ -147,9 +117,9 @@ Graph topology is known at compile, thus enabling a lot of optimizations (kernel } .. code-block:: c++ - :caption: Current pseudo-code + :caption: Current `Kokkos` pseudo-code. - auto graph = Kokkos::create_graph(exec_A, [&](const auto& root){}); + auto graph = Kokkos::create_graph(exec_A, [&](auto root){}); auto root = Kokkos::Impl::GraphAccess::create_root_node_ref(graph); auto node_A = root.then_parallel_...(...label..., ...policy..., ...functor...); @@ -161,19 +131,17 @@ Graph topology is known at compile, thus enabling a lot of optimizations (kernel graph.instantiate(); exec_A.fence("The graph might make some async to-device copies."); + graph.submit(exec_B); .. code-block:: c++ - :caption: P2300 + defer when Kokkos performs internal async to-device copies + :caption: *à la* P2300 and defer when `Kokkos` performs internal async to-device copies to the `instantiate` step. - // Step 1: define topology (no execution space instance required) + // Step 1: define graph topology (note that no execution space instance required). auto graph = Kokkos::create_graph(); auto node_A = then(graph, parallel_...(...label..., ...policy..., ...functor...)); - // what happens to an exec space instance passed to the policy ? is it used somehow or just ignored ? - // when dispatching the driver to global memory, what exec space instance is used for the async copies ? - auto node_B = then(node_A, parallel_...(...label..., ...policy..., ...functor...)); auto node_C = then(node_A, parallel_...(...label..., ...policy..., ...functor...)); @@ -186,15 +154,17 @@ Graph topology is known at compile, thus enabling a lot of optimizations (kernel // Step 3: execute graph.submit(exec_B) -No "root" node -~~~~~~~~~~~~~~ +.. _no_root_node: -Currently, the :code:`Kokkos::Graph` would expose to the user a "root node" concept that is not needed +To root or not to root ? +~~~~~~~~~~~~~~~~~~~~~~~~ + +Currently, the :cppkokkos:`Kokkos::Graph` API would expose to the user a "root node" concept that is not strictly needed by any backend (but might be needed by the default implementation that works with *sinks*). -The "root node" might be confusing. It sould not appear in the API for 2 reasons: +I think the "root node" might be confusing. IMO, it should not appear in the API for 2 reasons: -1. It can be misleading, as the user might think it's necessary though I think it's an artifact of how :code:`Kokkos::Graph` +1. It can be misleading, as the user might think it's necessary though I think it's an artifact of how :cppkokkos:`Kokkos::Graph` is currently implemented for graph construction, and because of the *sink*-based defaulted implementation. 2. With P2300, it's clear that *root* is an empty useless sender that can be thrown away at compile time. @@ -208,15 +178,15 @@ The "root node" might be confusing. It sould not appear in the API for 2 reasons } .. code-block:: c++ - :caption: P2300 + :caption: *à la* P2300. - auto graph = construct_graph(); + auto graph = Kokkos::construct_graph(); - auto A1 = then(graph, ...); - auto A2 = then(graph, ...); - auto A3 = then(graph, ...); + auto A1 = Kokkos::then(graph, Kokkos::parallel_...(...)); + auto A2 = Kokkos::then(graph, Kokkos::parallel_...(...)); + auto A3 = Kokkos::then(graph, Kokkos::parallel_...(...)); - auto B = then(when_all(A1, A2, A3), ...); + auto B = Kokkos::then(Kokkos::when_all(A1, A2, A3), Kokkos::parallel_...(...)); Complex DAG topology ~~~~~~~~~~~~~~~~~~~~ @@ -234,13 +204,13 @@ Any complex-but-valid DAG topology should work. A2 -> B1; A2 -> B3; A3 -> B4; - + B1 -> C1; B3 -> C1; - + B2 -> C2; B4 -> C2; - + // Enfore ordering of nodes with invisible edges. { rank = same; @@ -255,59 +225,58 @@ Changing scheduler This is the purpose of PR https://github.com/kokkos/kokkos/pull/7249, and should be further documented. -Towards https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2300r10.html#design-sender-adaptor-starts_on. +This is a step towards https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p2300r10.html#design-sender-adaptor-starts_on. .. code-block:: c++ + :caption: *à la* P2300. - auto graph = construct() - - auto node_1 = ... + // Step 1: construct. + auto graph = Kokkos::construct_graph(); + auto node_1 = Kokkos::then(graph, ...); ... + // Step 2: instantiate. graph.instantiate(); + // Step 3: execute, execute, and again. graph.submit(exec_A); - ... - graph.submit(exec_C); - ... - graph.submit(exec_D); Interoperability ~~~~~~~~~~~~~~~~ -Why interoperability matters (helps adoption of :code:`Kokkos::Graph`, extensibility, corner cases): +Why interoperability matters (helps adoption of :cppkokkos:`Kokkos::Graph`, extensibility, corner cases): -1. Attract users that already use some backend graph (*e.g.* `cudaGraph_t`) towards `Kokkos`. It helps them transition smoothly. -2. Help user integrate backend-specific graph capabilities that are not part of the :code:`Kokkos::Graph` API for whatever reason. +1. Attract users that already use some backend graph (*e.g.* :code:`cudaGraph_t`) towards `Kokkos`. It helps them transition smoothly. +2. Help user integrate backend-specific graph capabilities that are not part of the :cppkokkos:`Kokkos::Graph` API for whatever reason. Since `Kokkos` might run some stuff linked to its internals at *instantiation* stage, and since in PR https://github.com/kokkos/kokkos/pull/7240 we decided to ensure that before the submission, the graph needs to be instantiated in `Kokkos`, interoperability implies that the user -passes through `Kokkos` for both *instantiation* and *submission*. +relies on `Kokkos` for both *instantiation* and *submission*. .. graphviz:: - :caption: Dark nodes/edges are added through :code:`Kokkos::Graph`. + :caption: Dark nodes/edges are added through :cppkokkos:`Kokkos::Graph` API, the rest is pre-existing. digraph interoperability { A[color=darksalmon]; - + B1[color=darksalmon]; B2[color=darksalmon]; B3[color=darksalmon]; - + C3[color=darksalmon]; A -> B1[color=darksalmon]; A -> B2[color=darksalmon]; A -> B3[color=darksalmon]; - + B3 -> C3[color=darksalmon]; - + // Enfore ordering of nodes with invisible edges. { rank = same; @@ -315,50 +284,102 @@ passes through `Kokkos` for both *instantiation* and *submission*. B1 -> B2 -> B3 ; rankdir = LR; } - + B1 -> C1; B2 -> C1; - + C1 -> D1; C3 -> D1; - } + } .. code-block:: c++ - :caption: interoperability pseudo-code P2300 + :caption: Interoperability pseudo-code *à la* P2300. + // The user starts creating its graph with a backend API for some reason. cudaGraph_t graph; cudaGraphCreate(&graph, ...); cudaGraphNode_t A, B1, B2, B3, C3; ... create kernel nodes and add dependencies ... - auto kokkos_graph = construct(graph); + // But at some point wants interoperability with Kokkos. + auto kokkos_graph = Kokkos::construct_graph(graph); - auto C1 = then(when_all(B1, B2), ...); - auto D1 = then(when_all(C1, C3), ...); + auto C1 = Kokkos::then(Kokkos::when_all(B1, B2), ...); + auto D1 = Kokkos::then(Kokkos::when_all(C1, C3), ...); + // The user is now bound to Kokkos for instantiation and submission. kokkos_graph.instantiate(); kokkos_graph.submit(); Graph update ~~~~~~~~~~~~ -From reading `Cuda`, `HIP` and `SYCL` documentations, all have some *executable graph update* mechanisms. +From reading :cppkokkos:`Cuda`, :cppkokkos:`HIP` and :cppkokkos:`SYCL` documentations, all have some *executable graph update* mechanisms. -For instance, disabling a node from host (:code:`hipGraphNodeSetEnabled`, not in `HIP` yet) can support complex graphs that might slightly change from one submission to another. +For instance, disabling a node from host (:code:`hipGraphNodeSetEnabled`) can support complex graphs that might slightly change from one submission to another. Updates to a graph will be scheduled after any in-flight executions of the same graph and will not affect previous submissions of the same graph. The user is not required to wait on any previous submissions of a graph before updating it. -As the topology is fixed, we can only reasonably update kernel parameters. +As the topology is fixed, we can only reasonably update kernel parameters or skip a node. + +.. graphviz:: + :caption: Some iterative loop that needs to seed under some condition (to be enhanced). + + digraph graph_update { + + S[label="start", shape=diamond]; + + A[label="seed"]; + B[label="compute"]; + C[label="solve"]; + + S -> A[color=green]; + + A -> B[color=green]; + + B -> C; + + C -> S; + + S -> B[color="red"]; + + } + +Iterative processes +~~~~~~~~~~~~~~~~~~~ -Iterative process ------------------ +Plenty of opportunities for :cppkokkos:`Kokkos::Graph` to lean in: -- iterative solver (our assembly case) +- iterative solver - line search in optimization +- you name it + +Let's take the `AXPBY` micro-benchmark from https://hihat.opencommons.org/images/1/1a/Kokkos-graphs-presentation.pdf: + +.. graphviz:: + :caption: Two `AXPBY` followed by a dot product. + + digraph axpby { + A[label="axpby"]; + B[label="axpby"]; + C[label="dotp"]; + A->C; + B->C; + } + +.. literalinclude:: Graph.axpby.kokkos.vanilla.cpp + :language: c++ + :caption: Vanilla `Kokkos`. +.. literalinclude:: Graph.axpby.kokkos.graph.cpp + :language: c++ + :caption: Current :cppkokkos:`Kokkos::Graph`. +.. literalinclude:: Graph.axpby.kokkos.graph.p2300.cpp + :language: c++ + :caption: *à la* P2300. They also use graphs... ----------------------- @@ -366,11 +387,29 @@ They also use graphs... * `PyTorch` https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/ * `GROMACS` https://developer.nvidia.com/blog/a-guide-to-cuda-graphs-in-gromacs-2023/ +Design choices +-------------- + +Questions we need to answer before going further in the :cppkokkos:`Graph` refactor. + +Dispatching +~~~~~~~~~~~ -Homework +- Do we allow node policies to have a user-provided execution space instance ? +- When does `Kokkos` makes its to-device dispatching (*e.g.* to global memory) ? -- what does Kokkos during dispatching ? (HIP CUDA SYCL) Execution space instance from the policy, used or ignored ? -- for each example 3 columns how to write it in CUDA SYCL P2300 Kokkos -- développer l'update -- essayer de démontrer qu'on peut écrire un seul code, et dire si on veut que ce soit un graph ou pas - (why it matters: write single source code , kokkos premise 'single source code') \ No newline at end of file +Write a single source code, but allow skipping backend graph +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We should be able to write a single source code and decide if we want the graph to map to the backend graph or just +execute nodes. + +This would greatly benefit adoption, and respect `Kokkos` single source code promise. + +References +---------- + +* https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf +* https://github.com/intel/llvm/blob/sycl/sycl/doc/syclgraph/SYCLGraphUsageGuide.md +* https://developer.nvidia.com/blog/a-guide-to-cuda-graphs-in-gromacs-2023/ +* https://hihat.opencommons.org/images/1/1a/Kokkos-graphs-presentation.pdf