Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDA, HIP and DPCPP batch bicgstab kernels #1443

Merged
merged 28 commits into from
Nov 5, 2023
Merged

Conversation

pratikvn
Copy link
Member

@pratikvn pratikvn commented Oct 26, 2023

This PR adds the batch bicgstab solver kernels for CUDA, HIP and DPCPP backends. Some additional single rhs vector kernels are also added into the batch multivector kernels.

TODO

  • Add DPCPP kernels

@pratikvn pratikvn added 1:ST:WIP This PR is a work in progress. Not ready for review. type:batched-functionality This is related to the batched functionality in Ginkgo labels Oct 26, 2023
@pratikvn pratikvn added this to the Release 1.7.0 milestone Oct 26, 2023
@pratikvn pratikvn self-assigned this Oct 26, 2023
@ginkgo-bot ginkgo-bot added reg:build This is related to the build system. reg:testing This is related to testing. mod:core This is related to the core module. mod:cuda This is related to the CUDA module. type:solver This is related to the solvers mod:hip This is related to the HIP module. labels Oct 26, 2023
Copy link
Member

@MarcelKoch MarcelKoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the kernels look good so far. I have mostly comments outside of those.

Here are some things to be tackled later:

  • use dispatch instead of manual switch
  • make reductions work with more than 1 warp

cuda/solver/batch_bicgstab_kernels.cu Outdated Show resolved Hide resolved
// Compute norms of rhs
single_rhs_compute_norm2(subgroup, num_rows, b_global_entry, rhs_norm);
}
__syncthreads();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? The above code writes only to the norm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diverging paths between subwarps. To ensure consistency, I think it is good to synchronize them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, they diverge, but I don't see how that would affect the following code. But I'm no expert on this, so I won't push anything here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not requesting any changes, but I wanted to elaborate on this a bit. I agree here, I think we could take a page from CUB's book, where they ensure synchronization always happens inside functions that require it (i.e. SpMVs and reductions) and are entirely absent from the code otherwise.
To make this work, you need a "default" work assignment (like the default for (int iz = threadIdx.x; iz < num_rows; iz += blockDim.x) loop) and every time you read from values outside your own assigned set, you have a threadsync before, and if you write to values outside your set (also computing reductions), you have a threadsync after. This may even allow you to keep all values in registers most of the time, as long as you don't have huge blocks. But that is an optional detail.

Outside of this, there is also some potential for "kernel fusion" (i.e. removing the __syncthreads and computing directly on values in registers) by computing the dot product on the result of the SpMV, but I don't have a clear idea how large the runtime impact of that would be.

}
__syncthreads();

for (int iz = threadIdx.x; iz < num_rows; iz += blockDim.x) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in the other kernels you are using r as index variable.

core/solver/batch_bicgstab_kernels.hpp Show resolved Hide resolved
core/solver/batch_bicgstab_kernels.hpp Show resolved Hide resolved
dpcpp/solver/batch_bicgstab_kernels.dp.cpp Outdated Show resolved Hide resolved

// template
// launch_apply_kernel<StopType, SIMDLEN, n_shared_total, sg_kernel_all>
if (num_rows <= 32 && n_shared_total == 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda/hip uses 9 vectors in shmem. Why does this check for 10? Also the kernel only checks until n_shared_total == 9

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the strategy is slightly different. Here the count includes the prec_shared vector. The number of shared vectors is always 9, so you can only check until 9. If it is greater than 9, then you know that the prec is also in shared memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but isn't that what storage_config::prec_shared is there for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a bit easier with looking at n_shared as 10 vectors. Otherwise, prec_shared will need to be a template parameter as well. But I understand your point that it makes the cuda/dpcpp kernels more confusing to compare.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the additional template parameter then. But that might also be done later.

dpcpp/solver/batch_bicgstab_kernels.dp.cpp Outdated Show resolved Hide resolved
core/solver/batch_bicgstab_kernels.hpp Show resolved Hide resolved
dpcpp/solver/batch_bicgstab_kernels.hpp.inc Outdated Show resolved Hide resolved
@pratikvn
Copy link
Member Author

format!

common/cuda_hip/base/batch_multi_vector_kernels.hpp.inc Outdated Show resolved Hide resolved
core/test/utils/batch_helpers.hpp Outdated Show resolved Hide resolved
Comment on lines 53 to 57
if (sizeof(ValueType) == 4) {
cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeFourByte);
} else if (sizeof(ValueType) % 8 == 0) {
cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do they have TwoByte? Otherwise, it may introduce some troubles when adding half

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I dont think that is necessary. Only a value of 8 is recommended for double to avoid bank conflicts. You can just set it to 4 for half I think .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of problematic - it configures the entire device, but we only run on a single stream. At the very least, we need to revert it after the kernel finished, otherwise we interfere with other applications' performance

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a scope guard similar to the one for the device id could work here.

}
}
x.values[tidx * x.stride] = temp;
x[tidx] = temp;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete stride?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just use the plain pointers as arguments here. I guess technically we should have another stride parameter to the function, but I think that is unnecessary for now and we can add that when we support stride later.

dpcpp/preconditioner/batch_identity.hpp.inc Outdated Show resolved Hide resolved
dpcpp/solver/batch_bicgstab_kernels.hpp.inc Outdated Show resolved Hide resolved
dpcpp/solver/batch_bicgstab_kernels.hpp.inc Outdated Show resolved Hide resolved
dpcpp/solver/batch_bicgstab_kernels.hpp.inc Show resolved Hide resolved
Comment on lines 264 to 272
ValueType values[5];
real_type reals[2];
rho_old_sh = &values[0];
rho_new_sh = &values[1];
alpha_sh = &values[2];
omega_sh = &values[3];
temp_sh = &values[4];
norms_rhs_sh = &reals[0];
norms_res_sh = &reals[1];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

segfault.
values and reals will be destroies after else.

dpcpp/solver/batch_bicgstab_kernels.hpp.inc Outdated Show resolved Hide resolved
core/solver/batch_bicgstab_kernels.hpp Outdated Show resolved Hide resolved
{
using real_type = gko::remove_complex<value_type>;
const size_type num_batch_items = mat.num_batch_items;
constexpr int align_multiple = 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, that alignment is only relevant if the vectors are stored in global memory, right?

Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except for the shared_memory in dpcpp and storage computation (not reviewed yet), others LGTM

common/cuda_hip/solver/batch_bicgstab_kernels.hpp.inc Outdated Show resolved Hide resolved
dpcpp/base/batch_multi_vector_kernels.dp.cpp Outdated Show resolved Hide resolved
dpcpp/solver/batch_bicgstab_kernels.dp.cpp Outdated Show resolved Hide resolved
Comment on lines 35 to 44
__dpct_inline__ void initialize(
const int num_rows, const BatchMatrixType_entry& mat_global_entry,
const ValueType* const b_global_entry,
const ValueType* const x_global_entry, ValueType& rho_old, ValueType& omega,
ValueType& alpha, ValueType* const x_shared_entry,
ValueType* const r_shared_entry, ValueType* const r_hat_shared_entry,
ValueType* const p_shared_entry, ValueType* const v_shared_entry,
typename gko::remove_complex<ValueType>& rhs_norm,
typename gko::remove_complex<ValueType>& res_norm,
sycl::nd_item<3> item_ct1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think from CUDA, it will use __ldg() automatically if it is const __restrict__*. That's why we do not need to use __ldg

dpcpp/solver/batch_bicgstab_kernels.hpp.inc Outdated Show resolved Hide resolved
test/solver/batch_bicgstab_kernels.cpp Outdated Show resolved Hide resolved
core/solver/batch_bicgstab_kernels.hpp Outdated Show resolved Hide resolved
@pratikvn pratikvn force-pushed the batch-bicgstab-device branch from b8def5b to b653d3b Compare October 30, 2023 13:38
cuda/solver/batch_bicgstab_kernels.cu Show resolved Hide resolved
cuda/solver/batch_bicgstab_kernels.cu Outdated Show resolved Hide resolved
dpcpp/base/batch_multi_vector_kernels.hpp.inc Outdated Show resolved Hide resolved
dpcpp/solver/batch_bicgstab_kernels.dp.cpp Outdated Show resolved Hide resolved
hip/solver/batch_bicgstab_kernels.hip.cpp Outdated Show resolved Hide resolved
hip/solver/batch_bicgstab_kernels.hip.cpp Outdated Show resolved Hide resolved
hip/solver/batch_bicgstab_kernels.hip.cpp Outdated Show resolved Hide resolved
Comment on lines +95 to +96
inline batch::matrix::ell::uniform_batch<const hip_type<ValueType>,
const IndexType>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the to_const usually face this issue.
Could you check the other const version also correct?
If all related to this issue are not in public interface, it are not urgent before release

test/solver/batch_bicgstab_kernels.cpp Outdated Show resolved Hide resolved
@pratikvn pratikvn force-pushed the batch-bicgstab-device branch from b653d3b to fb50eaf Compare October 30, 2023 21:37
@pratikvn pratikvn added 1:ST:ready-for-review This PR is ready for review and removed 1:ST:WIP This PR is a work in progress. Not ready for review. labels Oct 31, 2023
@pratikvn pratikvn force-pushed the batch-bicgstab branch 3 times, most recently from 8982811 to 28560a5 Compare October 31, 2023 14:04
hip/base/exception.hip.hpp Outdated Show resolved Hide resolved
Base automatically changed from batch-bicgstab to develop November 1, 2023 09:06
common/cuda_hip/stop/batch_criteria.hpp.inc Outdated Show resolved Hide resolved
core/base/batch_utilities.hpp Outdated Show resolved Hide resolved
core/device_hooks/common_kernels.inc.cpp Outdated Show resolved Hide resolved
@pratikvn pratikvn force-pushed the batch-bicgstab-device branch from fb50eaf to d21d5fd Compare November 1, 2023 10:57
@pratikvn
Copy link
Member Author

pratikvn commented Nov 1, 2023

format!

@pratikvn pratikvn force-pushed the batch-bicgstab-device branch from f48179b to f600023 Compare November 5, 2023 16:11
Co-authored-by: Yu-Hsiang Tsai <[email protected]>
@pratikvn pratikvn force-pushed the batch-bicgstab-device branch from f600023 to 79e68b3 Compare November 5, 2023 16:17
@pratikvn
Copy link
Member Author

pratikvn commented Nov 5, 2023

format!

Co-authored-by: Pratik Nayak <[email protected]>
@pratikvn
Copy link
Member Author

pratikvn commented Nov 5, 2023

Turns out the no-circular-deps job is terribly slow. I verified (with the same config and flags as the job, inside the same image with a docker container) that it builds successfully with GINKGO_CHECK_CIRCULAR_DEPS=on, so I will go ahead and merge this.

@pratikvn pratikvn merged commit 47b3267 into develop Nov 5, 2023
12 of 13 checks passed
@pratikvn pratikvn deleted the batch-bicgstab-device branch November 5, 2023 23:44
Copy link

sonarqubecloud bot commented Nov 6, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 7 Code Smells

98.6% 98.6% Coverage
17.7% 17.7% Duplication

warning The version of Java (11.0.3) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

@tcojean tcojean mentioned this pull request Nov 6, 2023
tcojean added a commit that referenced this pull request Nov 10, 2023
Release 1.7.0 to master

The Ginkgo team is proud to announce the new Ginkgo minor release 1.7.0. This release brings new features such as:
- Complete GPU-resident sparse direct solvers feature set and interfaces,
- Improved Cholesky factorization performance,
- A new MC64 reordering,
- Batched iterative solver support with the BiCGSTAB solver with batched Dense and ELL matrix types,
- MPI support for the SYCL backend,
- Improved ParILU(T)/ParIC(T) preconditioner convergence,
and more!

If you face an issue, please first check our [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues) and the [open issues list](https://github.com/ginkgo-project/ginkgo/issues) and if you do not find a solution, feel free to [open a new issue](https://github.com/ginkgo-project/ginkgo/issues/new/choose) or ask a question using the [github discussions](https://github.com/ginkgo-project/ginkgo/discussions).

Supported systems and requirements:
+ For all platforms, CMake 3.16+
+ C++14 compliant compiler
+ Linux and macOS
  + GCC: 5.5+
  + clang: 3.9+
  + Intel compiler: 2019+
  + Apple Clang: 14.0 is tested. Earlier versions might also work.
  + NVHPC: 22.7+
  + Cray Compiler: 14.0.1+
  + CUDA module: CMake 3.18+, and CUDA 10.1+ or NVHPC 22.7+
  + HIP module: ROCm 4.5+
  + DPC++ module: Intel oneAPI 2022.1+ with oneMKL and oneDPL. Set the CXX compiler to `dpcpp` or `icpx`.
  + MPI: standard version 3.1+, ideally GPU Aware, for best performance
+ Windows
  + MinGW: GCC 5.5+
  + Microsoft Visual Studio: VS 2019+
  + CUDA module: CUDA 10.1+, Microsoft Visual Studio
  + OpenMP module: MinGW.

### Version support changes

+ CUDA 9.2 is no longer supported and 10.0 is untested [#1382](#1382)
+ Ginkgo now requires CMake version 3.16 (and 3.18 for CUDA) [#1368](#1368)

### Interface changes

+ `const` Factory parameters can no longer be modified through `with_*` functions, as this breaks const-correctness [#1336](#1336) [#1439](#1439)

### New Deprecations

+ The `device_reset` parameter of CUDA and HIP executors no longer has an effect, and its `allocation_mode` parameters have been deprecated in favor of the `Allocator` interface. [#1315](#1315)
+ The CMake parameter `GINKGO_BUILD_DPCPP` has been deprecated in favor of `GINKGO_BUILD_SYCL`. [#1350](#1350)
+ The `gko::reorder::Rcm` interface has been deprecated in favor of `gko::experimental::reorder::Rcm` based on `Permutation`. [#1418](#1418)
+ The Permutation class' `permute_mask` functionality. [#1415](#1415)
+ Multiple functions with typos (`set_complex_subpsace()`, range functions such as `conj_operaton` etc). [#1348](#1348)

### Summary of previous deprecations
+ `gko::lend()` is not necessary anymore.
+ The classes `RelativeResidualNorm` and `AbsoluteResidualNorm` are deprecated in favor of `ResidualNorm`.
+ The class `AmgxPgm` is deprecated in favor of `Pgm`.
+ Default constructors for the CSR `load_balance` and `automatical` strategies
+ The PolymorphicObject's move-semantic `copy_from` variant
+ The templated `SolverBase` class.
+ The class `MachineTopology` is deprecated in favor of `machine_topology`.
+ Logger constructors and create functions with the `executor` parameter.
+ The virtual, protected, Dense functions `compute_norm1_impl`, `add_scaled_impl`, etc.
+ Logger events for solvers and criterion without the additional `implicit_tau_sq` parameter.
+ The global `gko::solver::default_krylov_dim`, use instead `gko::solver::gmres_default_krylov_dim`.

### Added features

+ Adds a batch::BatchLinOp class that forms a base class for batched linear operators such as batched matrix formats, solver and preconditioners [#1379](#1379)
+ Adds a batch::MultiVector class that enables operations such as dot, norm, scale on batched vectors [#1371](#1371)
+ Adds a batch::Dense matrix format that stores batched dense matrices and provides gemv operations for these dense matrices. [#1413](#1413)
+ Adds a batch::Ell matrix format that stores batched Ell matrices and provides spmv operations for these batched Ell matrices. [#1416](#1416) [#1437](#1437)
+ Add a batch::Bicgstab solver (class, core, and reference kernels) that enables iterative solution of batched linear systems [#1438](#1438).
+ Add device kernels (CUDA, HIP, and DPCPP) for batch::Bicgstab solver. [#1443](#1443).
+ New MC64 reordering algorithm which optimizes the diagonal product or sum of a matrix by permuting the rows, and computes additional scaling factors for equilibriation [#1120](#1120)
+ New interface for (non-symmetric) permutation and scaled permutation of Dense and Csr matrices [#1415](#1415)
+ LU and Cholesky Factorizations can now be separated into their factors [#1432](#1432)
+ New symbolic LU factorization algorithm that is optimized for matrices with an almost-symmetric sparsity pattern [#1445](#1445)
+ Sorting kernels for SparsityCsr on all backends [#1343](#1343)
+ Allow passing pre-generated local solver as factory parameter for the distributed Schwarz preconditioner [#1426](#1426)
+ Add DPCPP kernels for Partition [#1034](#1034), and CSR's `check_diagonal_entries` and `add_scaled_identity` functionality [#1436](#1436)
+ Adds a helper function to create a partition based on either local sizes, or local ranges [#1227](#1227)
+ Add function to compute arithmetic mean of dense and distributed vectors [#1275](#1275)
+ Adds `icpx` compiler supports [#1350](#1350)
+ All backends can be built simultaneously [#1333](#1333)
+ Emits a CMake warning in downstream projects that use different compilers than the installed Ginkgo [#1372](#1372)
+ Reordering algorithms in sparse_blas benchmark [#1354](#1354)
+ Benchmarks gained an `-allocator` parameter to specify device allocators [#1385](#1385)
+ Benchmarks gained an `-input_matrix` parameter that initializes the input JSON based on the filename [#1387](#1387)
+ Benchmark inputs can now be reordered as a preprocessing step [#1408](#1408)


### Improvements

+ Significantly improve Cholesky factorization performance [#1366](#1366)
+ Improve parallel build performance [#1378](#1378)
+ Allow constrained parallel test execution using CTest resources [#1373](#1373)
+ Use arithmetic type more inside mixed precision ELL [#1414](#1414)
+ Most factory parameters of factory type no longer need to be constructed explicitly via `.on(exec)` [#1336](#1336) [#1439](#1439)
+ Improve ParILU(T)/ParIC(T) convergence by using more appropriate atomic operations [#1434](#1434)

### Fixes

+ Fix an over-allocation for OpenMP reductions [#1369](#1369)
+ Fix DPCPP's common-kernel reduction for empty input sizes [#1362](#1362)
+ Fix several typos in the API and documentation [#1348](#1348)
+ Fix inconsistent `Threads` between generations [#1388](#1388)
+ Fix benchmark median condition [#1398](#1398)
+ Fix HIP 5.6.0 compilation [#1411](#1411)
+ Fix missing destruction of rand_generator from cuda/hip [#1417](#1417)
+ Fix PAPI logger destruction order [#1419](#1419)
+ Fix TAU logger compilation [#1422](#1422)
+ Fix relative criterion to not iterate if the residual is already zero [#1079](#1079)
+ Fix memory_order invocations with C++20 changes [#1402](#1402)
+ Fix `check_diagonal_entries_exist` report correctly when only missing diagonal value in the last rows. [#1440](#1440)
+ Fix checking OpenMPI version in cross-compilation settings [#1446](#1446)
+ Fix false-positive deprecation warnings in Ginkgo, especially for the old Rcm (it doesn't emit deprecation warnings anymore as a result but is still considered deprecated) [#1444](#1444)


### Related PR: #1451
tcojean added a commit that referenced this pull request Nov 10, 2023
Release 1.7.0 to develop

The Ginkgo team is proud to announce the new Ginkgo minor release 1.7.0. This release brings new features such as:
- Complete GPU-resident sparse direct solvers feature set and interfaces,
- Improved Cholesky factorization performance,
- A new MC64 reordering,
- Batched iterative solver support with the BiCGSTAB solver with batched Dense and ELL matrix types,
- MPI support for the SYCL backend,
- Improved ParILU(T)/ParIC(T) preconditioner convergence,
and more!

If you face an issue, please first check our [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues) and the [open issues list](https://github.com/ginkgo-project/ginkgo/issues) and if you do not find a solution, feel free to [open a new issue](https://github.com/ginkgo-project/ginkgo/issues/new/choose) or ask a question using the [github discussions](https://github.com/ginkgo-project/ginkgo/discussions).

Supported systems and requirements:
+ For all platforms, CMake 3.16+
+ C++14 compliant compiler
+ Linux and macOS
  + GCC: 5.5+
  + clang: 3.9+
  + Intel compiler: 2019+
  + Apple Clang: 14.0 is tested. Earlier versions might also work.
  + NVHPC: 22.7+
  + Cray Compiler: 14.0.1+
  + CUDA module: CMake 3.18+, and CUDA 10.1+ or NVHPC 22.7+
  + HIP module: ROCm 4.5+
  + DPC++ module: Intel oneAPI 2022.1+ with oneMKL and oneDPL. Set the CXX compiler to `dpcpp` or `icpx`.
  + MPI: standard version 3.1+, ideally GPU Aware, for best performance
+ Windows
  + MinGW: GCC 5.5+
  + Microsoft Visual Studio: VS 2019+
  + CUDA module: CUDA 10.1+, Microsoft Visual Studio
  + OpenMP module: MinGW.

### Version support changes

+ CUDA 9.2 is no longer supported and 10.0 is untested [#1382](#1382)
+ Ginkgo now requires CMake version 3.16 (and 3.18 for CUDA) [#1368](#1368)

### Interface changes

+ `const` Factory parameters can no longer be modified through `with_*` functions, as this breaks const-correctness [#1336](#1336) [#1439](#1439)

### New Deprecations

+ The `device_reset` parameter of CUDA and HIP executors no longer has an effect, and its `allocation_mode` parameters have been deprecated in favor of the `Allocator` interface. [#1315](#1315)
+ The CMake parameter `GINKGO_BUILD_DPCPP` has been deprecated in favor of `GINKGO_BUILD_SYCL`. [#1350](#1350)
+ The `gko::reorder::Rcm` interface has been deprecated in favor of `gko::experimental::reorder::Rcm` based on `Permutation`. [#1418](#1418)
+ The Permutation class' `permute_mask` functionality. [#1415](#1415)
+ Multiple functions with typos (`set_complex_subpsace()`, range functions such as `conj_operaton` etc). [#1348](#1348)

### Summary of previous deprecations
+ `gko::lend()` is not necessary anymore.
+ The classes `RelativeResidualNorm` and `AbsoluteResidualNorm` are deprecated in favor of `ResidualNorm`.
+ The class `AmgxPgm` is deprecated in favor of `Pgm`.
+ Default constructors for the CSR `load_balance` and `automatical` strategies
+ The PolymorphicObject's move-semantic `copy_from` variant
+ The templated `SolverBase` class.
+ The class `MachineTopology` is deprecated in favor of `machine_topology`.
+ Logger constructors and create functions with the `executor` parameter.
+ The virtual, protected, Dense functions `compute_norm1_impl`, `add_scaled_impl`, etc.
+ Logger events for solvers and criterion without the additional `implicit_tau_sq` parameter.
+ The global `gko::solver::default_krylov_dim`, use instead `gko::solver::gmres_default_krylov_dim`.

### Added features

+ Adds a batch::BatchLinOp class that forms a base class for batched linear operators such as batched matrix formats, solver and preconditioners [#1379](#1379)
+ Adds a batch::MultiVector class that enables operations such as dot, norm, scale on batched vectors [#1371](#1371)
+ Adds a batch::Dense matrix format that stores batched dense matrices and provides gemv operations for these dense matrices. [#1413](#1413)
+ Adds a batch::Ell matrix format that stores batched Ell matrices and provides spmv operations for these batched Ell matrices. [#1416](#1416) [#1437](#1437)
+ Add a batch::Bicgstab solver (class, core, and reference kernels) that enables iterative solution of batched linear systems [#1438](#1438).
+ Add device kernels (CUDA, HIP, and DPCPP) for batch::Bicgstab solver. [#1443](#1443).
+ New MC64 reordering algorithm which optimizes the diagonal product or sum of a matrix by permuting the rows, and computes additional scaling factors for equilibriation [#1120](#1120)
+ New interface for (non-symmetric) permutation and scaled permutation of Dense and Csr matrices [#1415](#1415)
+ LU and Cholesky Factorizations can now be separated into their factors [#1432](#1432)
+ New symbolic LU factorization algorithm that is optimized for matrices with an almost-symmetric sparsity pattern [#1445](#1445)
+ Sorting kernels for SparsityCsr on all backends [#1343](#1343)
+ Allow passing pre-generated local solver as factory parameter for the distributed Schwarz preconditioner [#1426](#1426)
+ Add DPCPP kernels for Partition [#1034](#1034), and CSR's `check_diagonal_entries` and `add_scaled_identity` functionality [#1436](#1436)
+ Adds a helper function to create a partition based on either local sizes, or local ranges [#1227](#1227)
+ Add function to compute arithmetic mean of dense and distributed vectors [#1275](#1275)
+ Adds `icpx` compiler supports [#1350](#1350)
+ All backends can be built simultaneously [#1333](#1333)
+ Emits a CMake warning in downstream projects that use different compilers than the installed Ginkgo [#1372](#1372)
+ Reordering algorithms in sparse_blas benchmark [#1354](#1354)
+ Benchmarks gained an `-allocator` parameter to specify device allocators [#1385](#1385)
+ Benchmarks gained an `-input_matrix` parameter that initializes the input JSON based on the filename [#1387](#1387)
+ Benchmark inputs can now be reordered as a preprocessing step [#1408](#1408)


### Improvements

+ Significantly improve Cholesky factorization performance [#1366](#1366)
+ Improve parallel build performance [#1378](#1378)
+ Allow constrained parallel test execution using CTest resources [#1373](#1373)
+ Use arithmetic type more inside mixed precision ELL [#1414](#1414)
+ Most factory parameters of factory type no longer need to be constructed explicitly via `.on(exec)` [#1336](#1336) [#1439](#1439)
+ Improve ParILU(T)/ParIC(T) convergence by using more appropriate atomic operations [#1434](#1434)

### Fixes

+ Fix an over-allocation for OpenMP reductions [#1369](#1369)
+ Fix DPCPP's common-kernel reduction for empty input sizes [#1362](#1362)
+ Fix several typos in the API and documentation [#1348](#1348)
+ Fix inconsistent `Threads` between generations [#1388](#1388)
+ Fix benchmark median condition [#1398](#1398)
+ Fix HIP 5.6.0 compilation [#1411](#1411)
+ Fix missing destruction of rand_generator from cuda/hip [#1417](#1417)
+ Fix PAPI logger destruction order [#1419](#1419)
+ Fix TAU logger compilation [#1422](#1422)
+ Fix relative criterion to not iterate if the residual is already zero [#1079](#1079)
+ Fix memory_order invocations with C++20 changes [#1402](#1402)
+ Fix `check_diagonal_entries_exist` report correctly when only missing diagonal value in the last rows. [#1440](#1440)
+ Fix checking OpenMPI version in cross-compilation settings [#1446](#1446)
+ Fix false-positive deprecation warnings in Ginkgo, especially for the old Rcm (it doesn't emit deprecation warnings anymore as a result but is still considered deprecated) [#1444](#1444)

### Related PR: #1454
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-for-review This PR is ready for review 1:ST:run-full-test mod:core This is related to the core module. mod:cuda This is related to the CUDA module. mod:hip This is related to the HIP module. reg:build This is related to the build system. reg:testing This is related to testing. type:batched-functionality This is related to the batched functionality in Ginkgo type:solver This is related to the solvers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants