Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for host operations to executor #1232

Merged
merged 3 commits into from
Dec 18, 2022
Merged

Add support for host operations to executor #1232

merged 3 commits into from
Dec 18, 2022

Conversation

upsj
Copy link
Member

@upsj upsj commented Dec 8, 2022

This adds the GKO_REGISTER_HOST_OPERATION macro which can be used to annotate a costly core-side computation for benchmarking and (later on, once #1036 is merged) profiling. Additionally, it adds the OperationLogger to the host executor in GPU executions, which will make host-side operations visible in detailed benchmark results.

TODO:

  • Add tests

Closes #1211
Closes #1113

@upsj upsj added the 1:ST:ready-for-review This PR is ready for review label Dec 8, 2022
@upsj upsj requested a review from a team December 8, 2022 18:44
@upsj upsj self-assigned this Dec 8, 2022
@ginkgo-bot ginkgo-bot added mod:core This is related to the core module. mod:reference This is related to the reference module. reg:benchmarking This is related to benchmarking. reg:testing This is related to testing. type:factorization This is related to the Factorizations type:preconditioner This is related to the preconditioners type:solver This is related to the solvers labels Dec 8, 2022
Copy link
Member

@pratikvn pratikvn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach of having the function as a kernel instead of in the core, but does there need to be a separate register operation ? Why not use exec->get_master()->run(...) ?

{
using matrix_type = matrix::Csr<ValueType, IndexType>;
const auto exec = mtx->get_executor();
const auto host_exec = exec->get_master();
const auto forest = compute_elim_forest(mtx);
std::unique_ptr<elimination_forest<IndexType>> forest;
exec->run(make_compute_elim_forest(mtx, forest));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some drawback in doing exec->get_master()->run(...) instead of having a separate call ? Then you dont need the GKO_REGISTER_HOST_OPERATION at all, right ?

Copy link
Member Author

@upsj upsj Dec 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not being logged with the executor the object lives on, which is also the reason that we need the workaround in the benchmarks. With the suggested approach, we'd need to pass an additional Executor parameter and provide overloads for both OmpExecutor and ReferenceExecutor and put them inside the kernels namespace, which IMO isn't a good place to put core functionality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, anything computationally expensive shouldn't be happening on the host. It should be on the kernel side. That also is in line with the Ginkgo philosophy of kernels doing the hard work and the host orchestrating the calling. It is perfectly fine if these kernels then are serial (with reference or with single threaded OpenMP), but I dont think we should be doing anything intensive on the host.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operations we are looking at here (symbolic factorizations, elimination tree, later on MC64, AMD, METIS, ...) are either inherently sequential or very hard to parallelize (see e.g. RCM), let alone implement on GPUs, so either we duplicate the same implementation to OpenMP and Reference and call the kernels for host data only, or we have just a single place where they are implemented, in this case core. I think the latter is the more sensible choice here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the operations are sequential and probably not suited for GPUs. My point is that they perform computations which are potentially expensive (with scaled up problem size) and hence they should not be on the core side. Regarding the duplication, we can just have a common kernel file (like we do for HIP and CUDA) and have the serial implementation on OpenMP (forcing it on one thread) and reference alike.

I think performing computation on the host side can have implications for future design. For example, if we want to add some stream/asynchronous tasking for these operations. Performing these operations on the kernel side would allow us to keep the asynchronous tasking/streaming kernel interface and keep the core side clean.

IMO avoiding this also keeps our logging and profiling interface clean, without fragmenting it into host operations and kernel operations.

Copy link
Member Author

@upsj upsj Dec 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to unpack the meaning of core and kernel a bit more here. core is everything in libginkgo.so, kernel is everything in libginkgo_<backend>.so. You are talking about a conceptional separation that IMO doesn't apply here - we already have some expensive operations in core, mainly factorization-related. The only thing this PR does is make them visible to profilers regardless of the executor the objects live on. The logging interface is not impacted by this change, since to the loggers, host and kernel operations look the same.

If we were to make them kernel operations, we would have to do a lot more work: Kernels can't create Ginkgo objects, so we need to operate on arrays and combine them into Ginkgo objects on the core side again. Also as a minor point, it also doubles compilation time and binary size of the resulting functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would advocate for making them kernel operations, so this might be something we discuss before merging this PR.

Allocations should not happen inside a operation, and I think that is all the more reason to have allocations on the host side and do the operations on the kernel side. We log and profile the allocations separately anyway, so having the allocations inside the host operation makes the profiling and logging muddled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We allocate a lot inside kernels, e.g. SpGEMM/SpGEAM, sparselib bindings, all kernels that use temporary memory, ..., removing these allocations/moving them into core would complicate the control flow significantly.

@codecov
Copy link

codecov bot commented Dec 9, 2022

Codecov Report

Base: 91.55% // Head: 91.51% // Decreases project coverage by -0.03% ⚠️

Coverage data is based on head (9373771) compared to base (5b2f7a0).
Patch coverage: 77.89% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1232      +/-   ##
===========================================
- Coverage    91.55%   91.51%   -0.04%     
===========================================
  Files          556      556              
  Lines        47418    47455      +37     
===========================================
+ Hits         43413    43429      +16     
- Misses        4005     4026      +21     
Impacted Files Coverage Δ
include/ginkgo/core/base/executor.hpp 71.28% <ø> (ø)
core/test/factorization/elimination_forest.cpp 85.29% <50.00%> (-14.71%) ⬇️
reference/test/factorization/lu_kernels.cpp 97.70% <50.00%> (-2.30%) ⬇️
reference/test/factorization/cholesky_kernels.cpp 90.99% <68.75%> (-9.01%) ⬇️
test/factorization/cholesky_kernels.cpp 92.85% <71.42%> (-7.15%) ⬇️
core/factorization/elimination_forest.cpp 100.00% <100.00%> (ø)
core/factorization/lu.cpp 100.00% <100.00%> (ø)
core/factorization/symbolic.cpp 100.00% <100.00%> (ø)
test/base/executor.cpp 73.91% <100.00%> (+7.24%) ⬆️
reference/base/index_set_kernels.cpp 94.20% <0.00%> (+0.08%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@MarcelKoch MarcelKoch self-requested a review December 9, 2022 09:12
Copy link
Member

@MarcelKoch MarcelKoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be a bit annoying that this approach doesn't allow for return values of the host kernels. Perhaps that could also be managed by adding a run_host to Executor (or just an run overload) that takes a HostOperation instead. I'm thinking of something like: https://godbolt.org/z/TTd71zKnq. With the appropriate macros that should be as simple as what we have right now. In c++17 this could be simpler, see https://stackoverflow.com/a/53673648

@upsj
Copy link
Member Author

upsj commented Dec 9, 2022

Yes, I agree this is a bit unergonomic. Do you see a way how to make that work considering that template functions can't be virtual?

@MarcelKoch
Copy link
Member

Which virtual function do you refer to, Executor::run, HostOperation::run? I think both of them don't need to be virtual.

@upsj
Copy link
Member Author

upsj commented Dec 9, 2022

Having an overloaded function that is partially virtual and partially templated seems a bit dangerous, so I think we should choose only one of the two. Note that there is no such thing as a HostOperation ATM, as any potential later addition to what constitutes an Operation would need to be mirrored with HostOperation (e.g. work estimates, relocation and asynchronous execution for runtime systems). If we added return types, I would ideally like to support them both for host and device operations, and have them look as similar as possible to each other.

@thoasm thoasm self-requested a review December 9, 2022 11:38
@MarcelKoch
Copy link
Member

We could use Executor::run_host, which might also be helpful to add some verbosity. I think we could derive HostOperation from Operation, then changes that apply to Operations could be easily integrated.
But of course that would not help with return values for device operations. Perhaps with some CRTP magic it could be possible, but that is a bigger head scratcher.

*
* @ingroup Executor
*/
#define GKO_REGISTER_HOST_OPERATION(_name, _kernel) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this, will you still need to add the logger to host_executor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but if the host operation calls kernels on the host side, they would not be visible in the benchmark runtime breakdown.

@pratikvn
Copy link
Member

pratikvn commented Dec 13, 2022

I will not hold up this PR anymore, because it is fairly small change and even though it adds the interface for host operations, I think removal of these requires update of code in develop. Additionally, as @upsj needs it for benchmarking purposes, from my side we can probably merge this once approved.

I have created #1240 to document what I propose.

@MarcelKoch MarcelKoch self-requested a review December 16, 2022 13:52
Copy link
Member

@MarcelKoch MarcelKoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The out-parameters are annoying, but that is what we have to use for now. I think we could try to tackle kernels with return values later on. Although, now that I think of it, I don't think that returning non-trivial types from kernels is possible. In most cases, a kernel can't create a ginkgo object, because the constructors are implemented in core. But for host operations, that should still work.

core/factorization/symbolic.cpp Show resolved Hide resolved
@yhmtsai
Copy link
Member

yhmtsai commented Dec 16, 2022

@upsj Before I approve it, add test is still left in TODO. Did you already finish it?

@upsj
Copy link
Member Author

upsj commented Dec 16, 2022

@yhmtsai now I have, thanks for the reminder!

Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. only minor format change.

core/factorization/symbolic.hpp Outdated Show resolved Hide resolved
core/factorization/symbolic.hpp Outdated Show resolved Hide resolved
Co-authored-by: Yuhsiang M. Tsai <[email protected]>
@upsj upsj added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Dec 16, 2022
@ginkgo-bot
Copy link
Member

Note: This PR changes the Ginkgo ABI:

Functions changes summary: 26 Removed, 0 Changed, 108 Added functions
Variables changes summary: 0 Removed, 0 Changed, 0 Added variable

For details check the full ABI diff under Artifacts here

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 10 Code Smells

64.6% 64.6% Coverage
0.0% 0.0% Duplication

@upsj upsj merged commit 60b0514 into develop Dec 18, 2022
@upsj upsj deleted the host_operations branch December 18, 2022 03:39
tcojean added a commit that referenced this pull request Jun 16, 2023
Release 1.6.0 of Ginkgo.

The Ginkgo team is proud to announce the new Ginkgo minor release 1.6.0. This release brings new features such as:
- Several building blocks for GPU-resident sparse direct solvers like symbolic
  and numerical LU and Cholesky factorization, ...,
- A distributed Schwarz preconditioner,
- New FGMRES and GCR solvers,
- Distributed benchmarks for the SpMV operation, solvers, ...
- Support for non-default streams in the CUDA and HIP backends,
- Mixed precision support for the CSR SpMV,
- A new profiling logger which integrates with NVTX, ROCTX, TAU and VTune to
  provide internal Ginkgo knowledge to most HPC profilers!

and much more.

If you face an issue, please first check our [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues) and the [open issues list](https://github.com/ginkgo-project/ginkgo/issues) and if you do not find a solution, feel free to [open a new issue](https://github.com/ginkgo-project/ginkgo/issues/new/choose) or ask a question using the [github discussions](https://github.com/ginkgo-project/ginkgo/discussions).

Supported systems and requirements:
+ For all platforms, CMake 3.13+
+ C++14 compliant compiler
+ Linux and macOS
  + GCC: 5.5+
  + clang: 3.9+
  + Intel compiler: 2018+
  + Apple Clang: 14.0 is tested. Earlier versions might also work.
  + NVHPC: 22.7+
  + Cray Compiler: 14.0.1+
  + CUDA module: CUDA 9.2+ or NVHPC 22.7+
  + HIP module: ROCm 4.5+
  + DPC++ module: Intel OneAPI 2021.3+ with oneMKL and oneDPL. Set the CXX compiler to `dpcpp`.
+ Windows
  + MinGW: GCC 5.5+
  + Microsoft Visual Studio: VS 2019+
  + CUDA module: CUDA 9.2+, Microsoft Visual Studio
  + OpenMP module: MinGW.

### Version Support Changes
+ ROCm 4.0+ -> 4.5+ after [#1303](#1303)
+ Removed Cygwin pipeline and support [#1283](#1283)

### Interface Changes
+ Due to internal changes, `ConcreteExecutor::run` will now always throw if the corresponding module for the `ConcreteExecutor` is not build [#1234](#1234)
+ The constructor of `experimental::distributed::Vector` was changed to only accept local vectors as `std::unique_ptr` [#1284](#1284)
+ The default parameters for the `solver::MultiGrid` were improved. In particular, the smoother defaults to one iteration of `Ir` with `Jacobi` preconditioner, and the coarse grid solver uses the new direct solver with LU factorization. [#1291](#1291) [#1327](#1327)
+ The `iteration_complete` event gained a more expressive overload with additional parameters, the old overloads were deprecated. [#1288](#1288) [#1327](#1327)

### Deprecations
+ Deprecated less expressive `iteration_complete` event. Users are advised to now implement the function `void iteration_complete(const LinOp* solver, const LinOp* b, const LinOp* x, const size_type& it, const LinOp* r, const LinOp* tau, const LinOp* implicit_tau_sq, const array<stopping_status>* status, bool stopped)` [#1288](#1288)

### Added Features
+ A distributed Schwarz preconditioner. [#1248](#1248)
+ A GCR solver [#1239](#1239)
+ Flexible Gmres solver [#1244](#1244)
+ Enable Gmres solver for distributed matrices and vectors [#1201](#1201)
+ An example that uses Kokkos to assemble the system matrix [#1216](#1216)
+ A symbolic LU factorization allowing the `gko::experimental::factorization::Lu` and `gko::experimental::solver::Direct` classes to be used for matrices with non-symmetric sparsity pattern [#1210](#1210)
+ A numerical Cholesky factorization [#1215](#1215)
+ Symbolic factorizations in host-side operations are now wrapped in a host-side `Operation` to make their execution visible to loggers. This means that profiling loggers and benchmarks are no longer missing a separate entry for their runtime [#1232](#1232)
+ Symbolic factorization benchmark [#1302](#1302)
+ The `ProfilerHook` logger allows annotating the Ginkgo execution (apply, operations, ...) for profiling frameworks like NVTX, ROCTX and TAU. [#1055](#1055)
+ `ProfilerHook::created_(nested_)summary` allows the generation of a lightweight runtime profile over all Ginkgo functions written to a user-defined stream [#1270](#1270) for both host and device timing functionality [#1313](#1313)
+ It is now possible to enable host buffers for MPI communications at runtime even if the compile option `GINKGO_FORCE_GPU_AWARE_MPI` is set. [#1228](#1228)
+ A stencil matrices generator (5-pt, 7-pt, 9-pt, and 27-pt) for benchmarks [#1204](#1204)
+ Distributed benchmarks (multi-vector blas, SpMV, solver) [#1204](#1204)
+ Benchmarks for CSR sorting and lookup [#1219](#1219)
+ A timer for MPI benchmarks that reports the longest time [#1217](#1217)
+ A `timer_method=min|max|average|median` flag for benchmark timing summary [#1294](#1294)
+ Support for non-default streams in CUDA and HIP executors [#1236](#1236)
+ METIS integration for nested dissection reordering [#1296](#1296)
+ SuiteSparse AMD integration for fillin-reducing reordering [#1328](#1328)
+ Csr mixed-precision SpMV support [#1319](#1319)
+ A `with_loggers` function for all `Factory` parameters [#1337](#1337)

### Improvements
+ Improve naming of kernel operations for loggers [#1277](#1277)
+ Annotate solver iterations in `ProfilerHook` [#1290](#1290)
+ Allow using the profiler hooks and inline input strings in benchmarks [#1342](#1342)
+ Allow passing smart pointers in place of raw pointers to most matrix functions. This means that things like `vec->compute_norm2(x.get())` or `vec->compute_norm2(lend(x))` can be simplified to `vec->compute_norm2(x)` [#1279](#1279) [#1261](#1261)
+ Catch overflows in prefix sum operations, which makes Ginkgo's operations much less likely to crash. This also improves the performance of the prefix sum kernel [#1303](#1303)
+ Make the installed GinkgoConfig.cmake file relocatable and follow more best practices [#1325](#1325)

### Fixes
+ Fix OpenMPI version check [#1200](#1200)
+ Fix the mpi cxx type binding by c binding [#1306](#1306)
+ Fix runtime failures for one-sided MPI wrapper functions observed on some OpenMPI versions [#1249](#1249)
+ Disable thread pinning with GPU executors due to poor performance [#1230](#1230)
+ Fix hwloc version detection [#1266](#1266)
+ Fix PAPI detection in non-implicit include directories [#1268](#1268)
+ Fix PAPI support for newer PAPI versions: [#1321](#1321)
+ Fix pkg-config file generation for library paths outside prefix [#1271](#1271)
+ Fix various build failures with ROCm 5.4, CUDA 12, and OneAPI 6 [#1214](#1214), [#1235](#1235), [#1251](#1251)
+ Fix incorrect read for skew-symmetric MatrixMarket files with explicit diagonal entries [#1272](#1272)
+ Fix handling of missing diagonal entries in symbolic factorizations [#1263](#1263)
+ Fix segmentation fault in benchmark matrix construction [#1299](#1299)
+ Fix the stencil matrix creation for benchmarking [#1305](#1305)
+ Fix the additional residual check in IR [#1307](#1307)
+ Fix the cuSPARSE CSR SpMM issue on single strided vector when cuda >= 11.6 [#1322](#1322) [#1331](#1331)
+ Fix Isai generation for large sparsity powers [#1327](#1327)
+ Fix Ginkgo compilation and test with NVHPC >= 22.7 [#1331](#1331)
+ Fix Ginkgo compilation of 32 bit binaries with MSVC [#1349](#1349)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-to-merge This PR is ready to merge. mod:core This is related to the core module. mod:reference This is related to the reference module. reg:benchmarking This is related to benchmarking. reg:testing This is related to testing. type:factorization This is related to the Factorizations type:preconditioner This is related to the preconditioners type:solver This is related to the solvers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add HostOperation to enable profiling sequential core functionality Host operations get lost in benchmarks
5 participants