-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for host operations to executor #1232
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach of having the function as a kernel instead of in the core, but does there need to be a separate register operation ? Why not use exec->get_master()->run(...)
?
{ | ||
using matrix_type = matrix::Csr<ValueType, IndexType>; | ||
const auto exec = mtx->get_executor(); | ||
const auto host_exec = exec->get_master(); | ||
const auto forest = compute_elim_forest(mtx); | ||
std::unique_ptr<elimination_forest<IndexType>> forest; | ||
exec->run(make_compute_elim_forest(mtx, forest)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some drawback in doing exec->get_master()->run(...)
instead of having a separate call ? Then you dont need the GKO_REGISTER_HOST_OPERATION
at all, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not being logged with the executor the object lives on, which is also the reason that we need the workaround in the benchmarks. With the suggested approach, we'd need to pass an additional Executor parameter and provide overloads for both OmpExecutor and ReferenceExecutor and put them inside the kernels
namespace, which IMO isn't a good place to put core
functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, anything computationally expensive shouldn't be happening on the host. It should be on the kernel side. That also is in line with the Ginkgo philosophy of kernels doing the hard work and the host orchestrating the calling. It is perfectly fine if these kernels then are serial (with reference or with single threaded OpenMP), but I dont think we should be doing anything intensive on the host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operations we are looking at here (symbolic factorizations, elimination tree, later on MC64, AMD, METIS, ...) are either inherently sequential or very hard to parallelize (see e.g. RCM), let alone implement on GPUs, so either we duplicate the same implementation to OpenMP and Reference and call the kernels for host data only, or we have just a single place where they are implemented, in this case core
. I think the latter is the more sensible choice here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the operations are sequential and probably not suited for GPUs. My point is that they perform computations which are potentially expensive (with scaled up problem size) and hence they should not be on the core side. Regarding the duplication, we can just have a common kernel file (like we do for HIP and CUDA) and have the serial implementation on OpenMP (forcing it on one thread) and reference alike.
I think performing computation on the host side can have implications for future design. For example, if we want to add some stream/asynchronous tasking for these operations. Performing these operations on the kernel side would allow us to keep the asynchronous tasking/streaming kernel interface and keep the core side clean.
IMO avoiding this also keeps our logging and profiling interface clean, without fragmenting it into host operations and kernel operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to unpack the meaning of core and kernel a bit more here. core is everything in libginkgo.so, kernel is everything in libginkgo_<backend>.so. You are talking about a conceptional separation that IMO doesn't apply here - we already have some expensive operations in core, mainly factorization-related. The only thing this PR does is make them visible to profilers regardless of the executor the objects live on. The logging interface is not impacted by this change, since to the loggers, host and kernel operations look the same.
If we were to make them kernel operations, we would have to do a lot more work: Kernels can't create Ginkgo objects, so we need to operate on arrays and combine them into Ginkgo objects on the core side again. Also as a minor point, it also doubles compilation time and binary size of the resulting functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would advocate for making them kernel operations, so this might be something we discuss before merging this PR.
Allocations should not happen inside a operation, and I think that is all the more reason to have allocations on the host side and do the operations on the kernel side. We log and profile the allocations separately anyway, so having the allocations inside the host operation makes the profiling and logging muddled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We allocate a lot inside kernels, e.g. SpGEMM/SpGEAM, sparselib bindings, all kernels that use temporary memory, ..., removing these allocations/moving them into core would complicate the control flow significantly.
Codecov ReportBase: 91.55% // Head: 91.51% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## develop #1232 +/- ##
===========================================
- Coverage 91.55% 91.51% -0.04%
===========================================
Files 556 556
Lines 47418 47455 +37
===========================================
+ Hits 43413 43429 +16
- Misses 4005 4026 +21
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be a bit annoying that this approach doesn't allow for return values of the host kernels. Perhaps that could also be managed by adding a run_host
to Executor (or just an run
overload) that takes a HostOperation
instead. I'm thinking of something like: https://godbolt.org/z/TTd71zKnq. With the appropriate macros that should be as simple as what we have right now. In c++17 this could be simpler, see https://stackoverflow.com/a/53673648
Yes, I agree this is a bit unergonomic. Do you see a way how to make that work considering that template functions can't be virtual? |
Which virtual function do you refer to, |
Having an overloaded function that is partially virtual and partially templated seems a bit dangerous, so I think we should choose only one of the two. Note that there is no such thing as a HostOperation ATM, as any potential later addition to what constitutes an Operation would need to be mirrored with HostOperation (e.g. work estimates, relocation and asynchronous execution for runtime systems). If we added return types, I would ideally like to support them both for host and device operations, and have them look as similar as possible to each other. |
We could use |
* | ||
* @ingroup Executor | ||
*/ | ||
#define GKO_REGISTER_HOST_OPERATION(_name, _kernel) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with this, will you still need to add the logger to host_executor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but if the host operation calls kernels on the host side, they would not be visible in the benchmark runtime breakdown.
I will not hold up this PR anymore, because it is fairly small change and even though it adds the interface for host operations, I think removal of these requires update of code in develop. Additionally, as @upsj needs it for benchmarking purposes, from my side we can probably merge this once approved. I have created #1240 to document what I propose. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The out-parameters are annoying, but that is what we have to use for now. I think we could try to tackle kernels with return values later on. Although, now that I think of it, I don't think that returning non-trivial types from kernels is possible. In most cases, a kernel can't create a ginkgo object, because the constructors are implemented in core. But for host operations, that should still work.
@upsj Before I approve it, |
a9ff5ef
to
995f9e2
Compare
995f9e2
to
4852cef
Compare
@yhmtsai now I have, thanks for the reminder! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. only minor format change.
Co-authored-by: Yuhsiang M. Tsai <[email protected]>
Note: This PR changes the Ginkgo ABI:
For details check the full ABI diff under Artifacts here |
Kudos, SonarCloud Quality Gate passed! |
Release 1.6.0 of Ginkgo. The Ginkgo team is proud to announce the new Ginkgo minor release 1.6.0. This release brings new features such as: - Several building blocks for GPU-resident sparse direct solvers like symbolic and numerical LU and Cholesky factorization, ..., - A distributed Schwarz preconditioner, - New FGMRES and GCR solvers, - Distributed benchmarks for the SpMV operation, solvers, ... - Support for non-default streams in the CUDA and HIP backends, - Mixed precision support for the CSR SpMV, - A new profiling logger which integrates with NVTX, ROCTX, TAU and VTune to provide internal Ginkgo knowledge to most HPC profilers! and much more. If you face an issue, please first check our [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues) and the [open issues list](https://github.com/ginkgo-project/ginkgo/issues) and if you do not find a solution, feel free to [open a new issue](https://github.com/ginkgo-project/ginkgo/issues/new/choose) or ask a question using the [github discussions](https://github.com/ginkgo-project/ginkgo/discussions). Supported systems and requirements: + For all platforms, CMake 3.13+ + C++14 compliant compiler + Linux and macOS + GCC: 5.5+ + clang: 3.9+ + Intel compiler: 2018+ + Apple Clang: 14.0 is tested. Earlier versions might also work. + NVHPC: 22.7+ + Cray Compiler: 14.0.1+ + CUDA module: CUDA 9.2+ or NVHPC 22.7+ + HIP module: ROCm 4.5+ + DPC++ module: Intel OneAPI 2021.3+ with oneMKL and oneDPL. Set the CXX compiler to `dpcpp`. + Windows + MinGW: GCC 5.5+ + Microsoft Visual Studio: VS 2019+ + CUDA module: CUDA 9.2+, Microsoft Visual Studio + OpenMP module: MinGW. ### Version Support Changes + ROCm 4.0+ -> 4.5+ after [#1303](#1303) + Removed Cygwin pipeline and support [#1283](#1283) ### Interface Changes + Due to internal changes, `ConcreteExecutor::run` will now always throw if the corresponding module for the `ConcreteExecutor` is not build [#1234](#1234) + The constructor of `experimental::distributed::Vector` was changed to only accept local vectors as `std::unique_ptr` [#1284](#1284) + The default parameters for the `solver::MultiGrid` were improved. In particular, the smoother defaults to one iteration of `Ir` with `Jacobi` preconditioner, and the coarse grid solver uses the new direct solver with LU factorization. [#1291](#1291) [#1327](#1327) + The `iteration_complete` event gained a more expressive overload with additional parameters, the old overloads were deprecated. [#1288](#1288) [#1327](#1327) ### Deprecations + Deprecated less expressive `iteration_complete` event. Users are advised to now implement the function `void iteration_complete(const LinOp* solver, const LinOp* b, const LinOp* x, const size_type& it, const LinOp* r, const LinOp* tau, const LinOp* implicit_tau_sq, const array<stopping_status>* status, bool stopped)` [#1288](#1288) ### Added Features + A distributed Schwarz preconditioner. [#1248](#1248) + A GCR solver [#1239](#1239) + Flexible Gmres solver [#1244](#1244) + Enable Gmres solver for distributed matrices and vectors [#1201](#1201) + An example that uses Kokkos to assemble the system matrix [#1216](#1216) + A symbolic LU factorization allowing the `gko::experimental::factorization::Lu` and `gko::experimental::solver::Direct` classes to be used for matrices with non-symmetric sparsity pattern [#1210](#1210) + A numerical Cholesky factorization [#1215](#1215) + Symbolic factorizations in host-side operations are now wrapped in a host-side `Operation` to make their execution visible to loggers. This means that profiling loggers and benchmarks are no longer missing a separate entry for their runtime [#1232](#1232) + Symbolic factorization benchmark [#1302](#1302) + The `ProfilerHook` logger allows annotating the Ginkgo execution (apply, operations, ...) for profiling frameworks like NVTX, ROCTX and TAU. [#1055](#1055) + `ProfilerHook::created_(nested_)summary` allows the generation of a lightweight runtime profile over all Ginkgo functions written to a user-defined stream [#1270](#1270) for both host and device timing functionality [#1313](#1313) + It is now possible to enable host buffers for MPI communications at runtime even if the compile option `GINKGO_FORCE_GPU_AWARE_MPI` is set. [#1228](#1228) + A stencil matrices generator (5-pt, 7-pt, 9-pt, and 27-pt) for benchmarks [#1204](#1204) + Distributed benchmarks (multi-vector blas, SpMV, solver) [#1204](#1204) + Benchmarks for CSR sorting and lookup [#1219](#1219) + A timer for MPI benchmarks that reports the longest time [#1217](#1217) + A `timer_method=min|max|average|median` flag for benchmark timing summary [#1294](#1294) + Support for non-default streams in CUDA and HIP executors [#1236](#1236) + METIS integration for nested dissection reordering [#1296](#1296) + SuiteSparse AMD integration for fillin-reducing reordering [#1328](#1328) + Csr mixed-precision SpMV support [#1319](#1319) + A `with_loggers` function for all `Factory` parameters [#1337](#1337) ### Improvements + Improve naming of kernel operations for loggers [#1277](#1277) + Annotate solver iterations in `ProfilerHook` [#1290](#1290) + Allow using the profiler hooks and inline input strings in benchmarks [#1342](#1342) + Allow passing smart pointers in place of raw pointers to most matrix functions. This means that things like `vec->compute_norm2(x.get())` or `vec->compute_norm2(lend(x))` can be simplified to `vec->compute_norm2(x)` [#1279](#1279) [#1261](#1261) + Catch overflows in prefix sum operations, which makes Ginkgo's operations much less likely to crash. This also improves the performance of the prefix sum kernel [#1303](#1303) + Make the installed GinkgoConfig.cmake file relocatable and follow more best practices [#1325](#1325) ### Fixes + Fix OpenMPI version check [#1200](#1200) + Fix the mpi cxx type binding by c binding [#1306](#1306) + Fix runtime failures for one-sided MPI wrapper functions observed on some OpenMPI versions [#1249](#1249) + Disable thread pinning with GPU executors due to poor performance [#1230](#1230) + Fix hwloc version detection [#1266](#1266) + Fix PAPI detection in non-implicit include directories [#1268](#1268) + Fix PAPI support for newer PAPI versions: [#1321](#1321) + Fix pkg-config file generation for library paths outside prefix [#1271](#1271) + Fix various build failures with ROCm 5.4, CUDA 12, and OneAPI 6 [#1214](#1214), [#1235](#1235), [#1251](#1251) + Fix incorrect read for skew-symmetric MatrixMarket files with explicit diagonal entries [#1272](#1272) + Fix handling of missing diagonal entries in symbolic factorizations [#1263](#1263) + Fix segmentation fault in benchmark matrix construction [#1299](#1299) + Fix the stencil matrix creation for benchmarking [#1305](#1305) + Fix the additional residual check in IR [#1307](#1307) + Fix the cuSPARSE CSR SpMM issue on single strided vector when cuda >= 11.6 [#1322](#1322) [#1331](#1331) + Fix Isai generation for large sparsity powers [#1327](#1327) + Fix Ginkgo compilation and test with NVHPC >= 22.7 [#1331](#1331) + Fix Ginkgo compilation of 32 bit binaries with MSVC [#1349](#1349)
This adds the
GKO_REGISTER_HOST_OPERATION
macro which can be used to annotate a costlycore
-side computation for benchmarking and (later on, once #1036 is merged) profiling. Additionally, it adds theOperationLogger
to the host executor in GPU executions, which will make host-side operations visible in detailed benchmark results.TODO:
Closes #1211
Closes #1113