Add BatchMultiVector class, core, kernels and tests #1371

pratikvn · 2023-07-20T13:48:04Z

This PR adds the BatchMultiVector class, which implements the batched multi-vector object. Following our discussion, this is now separate from the LinOp/BatchLinOp hierarchy. In addition, the following things are also added in this PR:

batch_dim<> functionality which is a simple wrapper over dim<> for storing uniform batched objects.
batch_struct, to ease the access and implementation of the backend kernels.
batch_initialize functions used to initialize the all matrix formats in a fashion similar to Dense<>::initialize(...).
Backend kernels and tests for OpenMP, CUDA, HIP and DPCPP and their tests.

There are no apply kernels as this is only a Multi-vector object and hence does not lend itself to the apply functionality.

To be general enough, a MultiVector is supported even though the solvers in batch_develop currently only support single vector.

This PR is first in the series of the functionality that brings batched functionality to Ginkgo develop and part of the functionality has been in batch-develop.

pratikvn · 2023-07-20T13:52:36Z

format!

MarcelKoch

I think I will break up my review in a few pieces. This first one looks mostly at the interface and core stuff.

include/ginkgo/core/base/batch_multi_vector.hpp

common/cuda_hip/base/batch_multi_vector_kernels.hpp.inc

cuda/base/batch_multi_vector_kernels.cu

include/ginkgo/core/base/batch_multi_vector.hpp

core/base/batch_multi_vector.cpp

MarcelKoch · 2023-07-21T08:02:29Z

I fell that some of my comments are maybe too detailed and might not be critical to this PR. If you think that's the case, then those comments can be handled in another PR.

upsj

First coarse pass with a handful of specific suggestions:

we will be doing a lot of scheduling that is the same everywhere - looping over all batches in thread blocks, looping over all entries inside a batch. I would suggest to provide abstractions for those to a) give a single place where we can work on scheduling b) reduce the likelihood of indexing mistakes during development and c) give explicit names to the different levels of parallelism we are using here
we don't need raw reduction implementations, we have abstractions for those already.
with certain functions (read, write, unbatch), it seems like they are written mainly for testability purposes, I think we should find clear demarcations which functions are supposed to be used in applications, and which are mostly there for testing purposes (and decide whether we even want to expose them to users easily)

common/cuda_hip/base/batch_multi_vector_kernels.hpp.inc

core/base/batch_multi_vector.cpp

core/base/batch_struct.hpp

include/ginkgo/core/base/batch_multi_vector.hpp

omp/base/batch_multi_vector_kernels.cpp

pratikvn · 2023-07-21T08:51:02Z

format!

pratikvn · 2023-07-24T08:41:57Z

format!

include/ginkgo/core/base/batch_multi_vector.hpp

core/test/base/batch_multi_vector.cpp

dpcpp/base/batch_multi_vector_kernels.dp.cpp

cuda/base/batch_multi_vector_kernels.cu

dpcpp/base/batch_multi_vector_kernels.dp.cpp

core/base/batch_multi_vector.cpp

include/ginkgo/core/base/batch_multi_vector.hpp

core/base/batch_struct.hpp

include/ginkgo/core/base/batch_dim.hpp

include/ginkgo/ginkgo.hpp

reference/test/base/batch_multi_vector_kernels.cpp

test/base/batch_multi_vector_kernels.cpp

pratikvn · 2023-07-27T07:37:17Z

format!

yhmtsai

LGTM in general. There's some potential performance improvement we can try.
Should we also have the stride for the BatchMultiVector?

common/cuda_hip/base/batch_multi_vector_kernels.hpp.inc

core/base/batch_multi_vector.cpp

include/ginkgo/core/base/batch_multi_vector.hpp

dpcpp/base/batch_multi_vector_kernels.dp.cpp

hip/base/batch_struct.hip.hpp

include/ginkgo/core/base/batch_dim.hpp

cuda/base/batch_struct.hpp

core/base/batch_struct.hpp

Co-authored-by: Pratik Nayak <[email protected]>

Cannot use sycl::reduce_over_group for older DPCPP versions.

Co-authored-by: Yu-Hsiang Mike Tsai <[email protected]>

Co-authored-by: Yu-Hsiang Tsai<[email protected]>

Co-authored-by: Pratik Nayak <[email protected]>

Co-authored-by: Thomas Grützmacher <[email protected]> Co-authored-by: Yu-Hsiang Tsai <[email protected]> Co-authored-by: Marcel Koch <[email protected]>

Co-authored-by: Pratik Nayak <[email protected]>

sonarqubecloud · 2023-08-04T00:02:09Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
38 Code Smells

83.9% Coverage
3.2% Duplication

The version of Java (11.0.3) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

Release 1.7.0 to master The Ginkgo team is proud to announce the new Ginkgo minor release 1.7.0. This release brings new features such as: - Complete GPU-resident sparse direct solvers feature set and interfaces, - Improved Cholesky factorization performance, - A new MC64 reordering, - Batched iterative solver support with the BiCGSTAB solver with batched Dense and ELL matrix types, - MPI support for the SYCL backend, - Improved ParILU(T)/ParIC(T) preconditioner convergence, and more! If you face an issue, please first check our [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues) and the [open issues list](https://github.com/ginkgo-project/ginkgo/issues) and if you do not find a solution, feel free to [open a new issue](https://github.com/ginkgo-project/ginkgo/issues/new/choose) or ask a question using the [github discussions](https://github.com/ginkgo-project/ginkgo/discussions). Supported systems and requirements: + For all platforms, CMake 3.16+ + C++14 compliant compiler + Linux and macOS + GCC: 5.5+ + clang: 3.9+ + Intel compiler: 2019+ + Apple Clang: 14.0 is tested. Earlier versions might also work. + NVHPC: 22.7+ + Cray Compiler: 14.0.1+ + CUDA module: CMake 3.18+, and CUDA 10.1+ or NVHPC 22.7+ + HIP module: ROCm 4.5+ + DPC++ module: Intel oneAPI 2022.1+ with oneMKL and oneDPL. Set the CXX compiler to `dpcpp` or `icpx`. + MPI: standard version 3.1+, ideally GPU Aware, for best performance + Windows + MinGW: GCC 5.5+ + Microsoft Visual Studio: VS 2019+ + CUDA module: CUDA 10.1+, Microsoft Visual Studio + OpenMP module: MinGW. ### Version support changes + CUDA 9.2 is no longer supported and 10.0 is untested [#1382](#1382) + Ginkgo now requires CMake version 3.16 (and 3.18 for CUDA) [#1368](#1368) ### Interface changes + `const` Factory parameters can no longer be modified through `with_*` functions, as this breaks const-correctness [#1336](#1336) [#1439](#1439) ### New Deprecations + The `device_reset` parameter of CUDA and HIP executors no longer has an effect, and its `allocation_mode` parameters have been deprecated in favor of the `Allocator` interface. [#1315](#1315) + The CMake parameter `GINKGO_BUILD_DPCPP` has been deprecated in favor of `GINKGO_BUILD_SYCL`. [#1350](#1350) + The `gko::reorder::Rcm` interface has been deprecated in favor of `gko::experimental::reorder::Rcm` based on `Permutation`. [#1418](#1418) + The Permutation class' `permute_mask` functionality. [#1415](#1415) + Multiple functions with typos (`set_complex_subpsace()`, range functions such as `conj_operaton` etc). [#1348](#1348) ### Summary of previous deprecations + `gko::lend()` is not necessary anymore. + The classes `RelativeResidualNorm` and `AbsoluteResidualNorm` are deprecated in favor of `ResidualNorm`. + The class `AmgxPgm` is deprecated in favor of `Pgm`. + Default constructors for the CSR `load_balance` and `automatical` strategies + The PolymorphicObject's move-semantic `copy_from` variant + The templated `SolverBase` class. + The class `MachineTopology` is deprecated in favor of `machine_topology`. + Logger constructors and create functions with the `executor` parameter. + The virtual, protected, Dense functions `compute_norm1_impl`, `add_scaled_impl`, etc. + Logger events for solvers and criterion without the additional `implicit_tau_sq` parameter. + The global `gko::solver::default_krylov_dim`, use instead `gko::solver::gmres_default_krylov_dim`. ### Added features + Adds a batch::BatchLinOp class that forms a base class for batched linear operators such as batched matrix formats, solver and preconditioners [#1379](#1379) + Adds a batch::MultiVector class that enables operations such as dot, norm, scale on batched vectors [#1371](#1371) + Adds a batch::Dense matrix format that stores batched dense matrices and provides gemv operations for these dense matrices. [#1413](#1413) + Adds a batch::Ell matrix format that stores batched Ell matrices and provides spmv operations for these batched Ell matrices. [#1416](#1416) [#1437](#1437) + Add a batch::Bicgstab solver (class, core, and reference kernels) that enables iterative solution of batched linear systems [#1438](#1438). + Add device kernels (CUDA, HIP, and DPCPP) for batch::Bicgstab solver. [#1443](#1443). + New MC64 reordering algorithm which optimizes the diagonal product or sum of a matrix by permuting the rows, and computes additional scaling factors for equilibriation [#1120](#1120) + New interface for (non-symmetric) permutation and scaled permutation of Dense and Csr matrices [#1415](#1415) + LU and Cholesky Factorizations can now be separated into their factors [#1432](#1432) + New symbolic LU factorization algorithm that is optimized for matrices with an almost-symmetric sparsity pattern [#1445](#1445) + Sorting kernels for SparsityCsr on all backends [#1343](#1343) + Allow passing pre-generated local solver as factory parameter for the distributed Schwarz preconditioner [#1426](#1426) + Add DPCPP kernels for Partition [#1034](#1034), and CSR's `check_diagonal_entries` and `add_scaled_identity` functionality [#1436](#1436) + Adds a helper function to create a partition based on either local sizes, or local ranges [#1227](#1227) + Add function to compute arithmetic mean of dense and distributed vectors [#1275](#1275) + Adds `icpx` compiler supports [#1350](#1350) + All backends can be built simultaneously [#1333](#1333) + Emits a CMake warning in downstream projects that use different compilers than the installed Ginkgo [#1372](#1372) + Reordering algorithms in sparse_blas benchmark [#1354](#1354) + Benchmarks gained an `-allocator` parameter to specify device allocators [#1385](#1385) + Benchmarks gained an `-input_matrix` parameter that initializes the input JSON based on the filename [#1387](#1387) + Benchmark inputs can now be reordered as a preprocessing step [#1408](#1408) ### Improvements + Significantly improve Cholesky factorization performance [#1366](#1366) + Improve parallel build performance [#1378](#1378) + Allow constrained parallel test execution using CTest resources [#1373](#1373) + Use arithmetic type more inside mixed precision ELL [#1414](#1414) + Most factory parameters of factory type no longer need to be constructed explicitly via `.on(exec)` [#1336](#1336) [#1439](#1439) + Improve ParILU(T)/ParIC(T) convergence by using more appropriate atomic operations [#1434](#1434) ### Fixes + Fix an over-allocation for OpenMP reductions [#1369](#1369) + Fix DPCPP's common-kernel reduction for empty input sizes [#1362](#1362) + Fix several typos in the API and documentation [#1348](#1348) + Fix inconsistent `Threads` between generations [#1388](#1388) + Fix benchmark median condition [#1398](#1398) + Fix HIP 5.6.0 compilation [#1411](#1411) + Fix missing destruction of rand_generator from cuda/hip [#1417](#1417) + Fix PAPI logger destruction order [#1419](#1419) + Fix TAU logger compilation [#1422](#1422) + Fix relative criterion to not iterate if the residual is already zero [#1079](#1079) + Fix memory_order invocations with C++20 changes [#1402](#1402) + Fix `check_diagonal_entries_exist` report correctly when only missing diagonal value in the last rows. [#1440](#1440) + Fix checking OpenMPI version in cross-compilation settings [#1446](#1446) + Fix false-positive deprecation warnings in Ginkgo, especially for the old Rcm (it doesn't emit deprecation warnings anymore as a result but is still considered deprecated) [#1444](#1444) ### Related PR: #1451

Release 1.7.0 to develop The Ginkgo team is proud to announce the new Ginkgo minor release 1.7.0. This release brings new features such as: - Complete GPU-resident sparse direct solvers feature set and interfaces, - Improved Cholesky factorization performance, - A new MC64 reordering, - Batched iterative solver support with the BiCGSTAB solver with batched Dense and ELL matrix types, - MPI support for the SYCL backend, - Improved ParILU(T)/ParIC(T) preconditioner convergence, and more! If you face an issue, please first check our [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues) and the [open issues list](https://github.com/ginkgo-project/ginkgo/issues) and if you do not find a solution, feel free to [open a new issue](https://github.com/ginkgo-project/ginkgo/issues/new/choose) or ask a question using the [github discussions](https://github.com/ginkgo-project/ginkgo/discussions). Supported systems and requirements: + For all platforms, CMake 3.16+ + C++14 compliant compiler + Linux and macOS + GCC: 5.5+ + clang: 3.9+ + Intel compiler: 2019+ + Apple Clang: 14.0 is tested. Earlier versions might also work. + NVHPC: 22.7+ + Cray Compiler: 14.0.1+ + CUDA module: CMake 3.18+, and CUDA 10.1+ or NVHPC 22.7+ + HIP module: ROCm 4.5+ + DPC++ module: Intel oneAPI 2022.1+ with oneMKL and oneDPL. Set the CXX compiler to `dpcpp` or `icpx`. + MPI: standard version 3.1+, ideally GPU Aware, for best performance + Windows + MinGW: GCC 5.5+ + Microsoft Visual Studio: VS 2019+ + CUDA module: CUDA 10.1+, Microsoft Visual Studio + OpenMP module: MinGW. ### Version support changes + CUDA 9.2 is no longer supported and 10.0 is untested [#1382](#1382) + Ginkgo now requires CMake version 3.16 (and 3.18 for CUDA) [#1368](#1368) ### Interface changes + `const` Factory parameters can no longer be modified through `with_*` functions, as this breaks const-correctness [#1336](#1336) [#1439](#1439) ### New Deprecations + The `device_reset` parameter of CUDA and HIP executors no longer has an effect, and its `allocation_mode` parameters have been deprecated in favor of the `Allocator` interface. [#1315](#1315) + The CMake parameter `GINKGO_BUILD_DPCPP` has been deprecated in favor of `GINKGO_BUILD_SYCL`. [#1350](#1350) + The `gko::reorder::Rcm` interface has been deprecated in favor of `gko::experimental::reorder::Rcm` based on `Permutation`. [#1418](#1418) + The Permutation class' `permute_mask` functionality. [#1415](#1415) + Multiple functions with typos (`set_complex_subpsace()`, range functions such as `conj_operaton` etc). [#1348](#1348) ### Summary of previous deprecations + `gko::lend()` is not necessary anymore. + The classes `RelativeResidualNorm` and `AbsoluteResidualNorm` are deprecated in favor of `ResidualNorm`. + The class `AmgxPgm` is deprecated in favor of `Pgm`. + Default constructors for the CSR `load_balance` and `automatical` strategies + The PolymorphicObject's move-semantic `copy_from` variant + The templated `SolverBase` class. + The class `MachineTopology` is deprecated in favor of `machine_topology`. + Logger constructors and create functions with the `executor` parameter. + The virtual, protected, Dense functions `compute_norm1_impl`, `add_scaled_impl`, etc. + Logger events for solvers and criterion without the additional `implicit_tau_sq` parameter. + The global `gko::solver::default_krylov_dim`, use instead `gko::solver::gmres_default_krylov_dim`. ### Added features + Adds a batch::BatchLinOp class that forms a base class for batched linear operators such as batched matrix formats, solver and preconditioners [#1379](#1379) + Adds a batch::MultiVector class that enables operations such as dot, norm, scale on batched vectors [#1371](#1371) + Adds a batch::Dense matrix format that stores batched dense matrices and provides gemv operations for these dense matrices. [#1413](#1413) + Adds a batch::Ell matrix format that stores batched Ell matrices and provides spmv operations for these batched Ell matrices. [#1416](#1416) [#1437](#1437) + Add a batch::Bicgstab solver (class, core, and reference kernels) that enables iterative solution of batched linear systems [#1438](#1438). + Add device kernels (CUDA, HIP, and DPCPP) for batch::Bicgstab solver. [#1443](#1443). + New MC64 reordering algorithm which optimizes the diagonal product or sum of a matrix by permuting the rows, and computes additional scaling factors for equilibriation [#1120](#1120) + New interface for (non-symmetric) permutation and scaled permutation of Dense and Csr matrices [#1415](#1415) + LU and Cholesky Factorizations can now be separated into their factors [#1432](#1432) + New symbolic LU factorization algorithm that is optimized for matrices with an almost-symmetric sparsity pattern [#1445](#1445) + Sorting kernels for SparsityCsr on all backends [#1343](#1343) + Allow passing pre-generated local solver as factory parameter for the distributed Schwarz preconditioner [#1426](#1426) + Add DPCPP kernels for Partition [#1034](#1034), and CSR's `check_diagonal_entries` and `add_scaled_identity` functionality [#1436](#1436) + Adds a helper function to create a partition based on either local sizes, or local ranges [#1227](#1227) + Add function to compute arithmetic mean of dense and distributed vectors [#1275](#1275) + Adds `icpx` compiler supports [#1350](#1350) + All backends can be built simultaneously [#1333](#1333) + Emits a CMake warning in downstream projects that use different compilers than the installed Ginkgo [#1372](#1372) + Reordering algorithms in sparse_blas benchmark [#1354](#1354) + Benchmarks gained an `-allocator` parameter to specify device allocators [#1385](#1385) + Benchmarks gained an `-input_matrix` parameter that initializes the input JSON based on the filename [#1387](#1387) + Benchmark inputs can now be reordered as a preprocessing step [#1408](#1408) ### Improvements + Significantly improve Cholesky factorization performance [#1366](#1366) + Improve parallel build performance [#1378](#1378) + Allow constrained parallel test execution using CTest resources [#1373](#1373) + Use arithmetic type more inside mixed precision ELL [#1414](#1414) + Most factory parameters of factory type no longer need to be constructed explicitly via `.on(exec)` [#1336](#1336) [#1439](#1439) + Improve ParILU(T)/ParIC(T) convergence by using more appropriate atomic operations [#1434](#1434) ### Fixes + Fix an over-allocation for OpenMP reductions [#1369](#1369) + Fix DPCPP's common-kernel reduction for empty input sizes [#1362](#1362) + Fix several typos in the API and documentation [#1348](#1348) + Fix inconsistent `Threads` between generations [#1388](#1388) + Fix benchmark median condition [#1398](#1398) + Fix HIP 5.6.0 compilation [#1411](#1411) + Fix missing destruction of rand_generator from cuda/hip [#1417](#1417) + Fix PAPI logger destruction order [#1419](#1419) + Fix TAU logger compilation [#1422](#1422) + Fix relative criterion to not iterate if the residual is already zero [#1079](#1079) + Fix memory_order invocations with C++20 changes [#1402](#1402) + Fix `check_diagonal_entries_exist` report correctly when only missing diagonal value in the last rows. [#1440](#1440) + Fix checking OpenMPI version in cross-compilation settings [#1446](#1446) + Fix false-positive deprecation warnings in Ginkgo, especially for the old Rcm (it doesn't emit deprecation warnings anymore as a result but is still considered deprecated) [#1444](#1444) ### Related PR: #1454

pratikvn added this to the Release 1.7.0 milestone Jul 20, 2023

pratikvn requested review from a team July 20, 2023 13:48

pratikvn self-assigned this Jul 20, 2023

ginkgo-bot added reg:build This is related to the build system. reg:testing This is related to testing. labels Jul 20, 2023

MarcelKoch reviewed Jul 21, 2023

View reviewed changes

upsj reviewed Jul 21, 2023

View reviewed changes

pratikvn force-pushed the batch-vector branch from 6a7a788 to ed39a7e Compare July 21, 2023 08:45

pratikvn force-pushed the batch-vector branch 2 times, most recently from 4b98377 to a60ae1e Compare July 21, 2023 09:03

pratikvn force-pushed the batch-vector branch from bd5c11a to 19b40f2 Compare July 24, 2023 13:51

MarcelKoch reviewed Jul 24, 2023

View reviewed changes

yhmtsai reviewed Jul 26, 2023

View reviewed changes

pratikvn force-pushed the batch-vector branch 4 times, most recently from 3fe28eb to 62e84a2 Compare July 27, 2023 07:37

pratikvn force-pushed the batch-vector branch from a07acbc to a618e7f Compare July 27, 2023 09:55

yhmtsai approved these changes Jul 27, 2023

View reviewed changes

pratikvn and others added 23 commits August 3, 2023 13:31

Add compute_conj_dot and kernels

bc2de26

Generalize CUDA/HIP kernels and use reduce prim

f5bb2e3

Format files

1d1cf6b

Co-authored-by: Pratik Nayak <[email protected]>

Add a fill method and test

e953b84

Update dpcpp kernels and fix for 2022-1

fd3e5da

Cannot use sycl::reduce_over_group for older DPCPP versions.

Fix dpcpp CPU subgroup_size issue

7e82d5d

Co-authored-by: Yu-Hsiang Mike Tsai <[email protected]>

Move impls to source from header

41ad026

Update docs and zero-size issues

8bbac45

Review and doc updates

446ab5d

Review updates

e5f8387

Co-authored-by: Yu-Hsiang Tsai<[email protected]>

Format files

c7aa2ba

Co-authored-by: Pratik Nayak <[email protected]>

Update get_values and add test

4c6984e

Fix read bug and add test

5a48a0a

Review updates.

ebea8a1

Co-authored-by: Thomas Grützmacher <[email protected]> Co-authored-by: Yu-Hsiang Tsai <[email protected]> Co-authored-by: Marcel Koch <[email protected]>

Rename: batch_entry -> batch_item

59ff7e9

Use batch:: namespace,rename to batch::MultiVector

82c7d08

Rename to extract_batch_item

ef69dcd

Format files

2cf53a0

Co-authored-by: Pratik Nayak <[email protected]>

Add Dense matrix view creation

4f63e29

Move read/write/unbatch to Ginkgo internal

66d5d68

Remove warnings from CI builds

ce86e2b

Format files

8e11938

Co-authored-by: Pratik Nayak <[email protected]>

Fix warning in exception

86e9312

pratikvn force-pushed the batch-vector branch from 40ea6d9 to 86e9312 Compare August 3, 2023 11:32

pratikvn merged commit 1882753 into develop Aug 3, 2023

pratikvn deleted the batch-vector branch August 3, 2023 19:49

tcojean mentioned this pull request Nov 6, 2023

Release 1.7.0 to master #1451

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BatchMultiVector class, core, kernels and tests #1371

Add BatchMultiVector class, core, kernels and tests #1371

pratikvn commented Jul 20, 2023

pratikvn commented Jul 20, 2023

MarcelKoch left a comment

MarcelKoch commented Jul 21, 2023

upsj left a comment

pratikvn commented Jul 21, 2023

pratikvn commented Jul 24, 2023

pratikvn commented Jul 27, 2023

yhmtsai left a comment

sonarqubecloud bot commented Aug 4, 2023

Add BatchMultiVector class, core, kernels and tests #1371

Add BatchMultiVector class, core, kernels and tests #1371

Conversation

pratikvn commented Jul 20, 2023

pratikvn commented Jul 20, 2023

MarcelKoch left a comment

Choose a reason for hiding this comment

MarcelKoch commented Jul 21, 2023

upsj left a comment

Choose a reason for hiding this comment

pratikvn commented Jul 21, 2023

pratikvn commented Jul 24, 2023

pratikvn commented Jul 27, 2023

yhmtsai left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Aug 4, 2023