Add CUDA, HIP and DPCPP batch bicgstab kernels #1443

pratikvn · 2023-10-26T22:10:23Z

This PR adds the batch bicgstab solver kernels for CUDA, HIP and DPCPP backends. Some additional single rhs vector kernels are also added into the batch multivector kernels.

TODO

Add DPCPP kernels

MarcelKoch

I think the kernels look good so far. I have mostly comments outside of those.

Here are some things to be tackled later:

use dispatch instead of manual switch
make reductions work with more than 1 warp

cuda/solver/batch_bicgstab_kernels.cu

MarcelKoch · 2023-10-27T09:14:56Z

common/cuda_hip/solver/batch_bicgstab_kernels.hpp.inc

+        // Compute norms of rhs
+        single_rhs_compute_norm2(subgroup, num_rows, b_global_entry, rhs_norm);
+    }
+    __syncthreads();


Is this necessary? The above code writes only to the norm.

Diverging paths between subwarps. To ensure consistency, I think it is good to synchronize them.

Sure, they diverge, but I don't see how that would affect the following code. But I'm no expert on this, so I won't push anything here.

Not requesting any changes, but I wanted to elaborate on this a bit. I agree here, I think we could take a page from CUB's book, where they ensure synchronization always happens inside functions that require it (i.e. SpMVs and reductions) and are entirely absent from the code otherwise.
To make this work, you need a "default" work assignment (like the default for (int iz = threadIdx.x; iz < num_rows; iz += blockDim.x) loop) and every time you read from values outside your own assigned set, you have a threadsync before, and if you write to values outside your set (also computing reductions), you have a threadsync after. This may even allow you to keep all values in registers most of the time, as long as you don't have huge blocks. But that is an optional detail.

Outside of this, there is also some potential for "kernel fusion" (i.e. removing the __syncthreads and computing directly on values in registers) by computing the dot product on the result of the SpMV, but I don't have a clear idea how large the runtime impact of that would be.

MarcelKoch · 2023-10-27T09:15:54Z

common/cuda_hip/solver/batch_bicgstab_kernels.hpp.inc

+    }
+    __syncthreads();
+
+    for (int iz = threadIdx.x; iz < num_rows; iz += blockDim.x) {


nit: in the other kernels you are using r as index variable.

core/solver/batch_bicgstab_kernels.hpp

dpcpp/solver/batch_bicgstab_kernels.dp.cpp

MarcelKoch · 2023-10-27T14:09:14Z

dpcpp/solver/batch_bicgstab_kernels.dp.cpp

+
+        // template
+        // launch_apply_kernel<StopType, SIMDLEN, n_shared_total, sg_kernel_all>
+        if (num_rows <= 32 && n_shared_total == 10)


cuda/hip uses 9 vectors in shmem. Why does this check for 10? Also the kernel only checks until n_shared_total == 9

the strategy is slightly different. Here the count includes the prec_shared vector. The number of shared vectors is always 9, so you can only check until 9. If it is greater than 9, then you know that the prec is also in shared memory.

but isn't that what storage_config::prec_shared is there for?

I think it is a bit easier with looking at n_shared as 10 vectors. Otherwise, prec_shared will need to be a template parameter as well. But I understand your point that it makes the cuda/dpcpp kernels more confusing to compare.

I would prefer the additional template parameter then. But that might also be done later.

dpcpp/solver/batch_bicgstab_kernels.dp.cpp

core/solver/batch_bicgstab_kernels.hpp

dpcpp/solver/batch_bicgstab_kernels.hpp.inc

pratikvn · 2023-10-29T19:43:12Z

format!

common/cuda_hip/base/batch_multi_vector_kernels.hpp.inc

core/test/utils/batch_helpers.hpp

yhmtsai · 2023-10-29T20:31:43Z

cuda/base/kernel_config.cuh

+    if (sizeof(ValueType) == 4) {
+        cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeFourByte);
+    } else if (sizeof(ValueType) % 8 == 0) {
+        cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte);
+    }


do they have TwoByte? Otherwise, it may introduce some troubles when adding half

No, I dont think that is necessary. Only a value of 8 is recommended for double to avoid bank conflicts. You can just set it to 4 for half I think .

This is kind of problematic - it configures the entire device, but we only run on a single stream. At the very least, we need to revert it after the kernel finished, otherwise we interfere with other applications' performance

I guess a scope guard similar to the one for the device id could work here.

yhmtsai · 2023-10-29T20:35:03Z

dpcpp/matrix/batch_ell_kernels.hpp.inc

            }
        }
-        x.values[tidx * x.stride] = temp;
+        x[tidx] = temp;


delete stride?

I just use the plain pointers as arguments here. I guess technically we should have another stride parameter to the function, but I think that is unnecessary for now and we can add that when we support stride later.

dpcpp/preconditioner/batch_identity.hpp.inc

dpcpp/solver/batch_bicgstab_kernels.hpp.inc

yhmtsai · 2023-10-29T20:51:49Z

dpcpp/solver/batch_bicgstab_kernels.hpp.inc

+        ValueType values[5];
+        real_type reals[2];
+        rho_old_sh = &values[0];
+        rho_new_sh = &values[1];
+        alpha_sh = &values[2];
+        omega_sh = &values[3];
+        temp_sh = &values[4];
+        norms_rhs_sh = &reals[0];
+        norms_res_sh = &reals[1];


segfault.
values and reals will be destroies after else.

dpcpp/solver/batch_bicgstab_kernels.hpp.inc

core/solver/batch_bicgstab_kernels.hpp

MarcelKoch · 2023-10-30T08:15:18Z

cuda/solver/batch_bicgstab_kernels.cu

+    {
+        using real_type = gko::remove_complex<value_type>;
+        const size_type num_batch_items = mat.num_batch_items;
+        constexpr int align_multiple = 8;


So, that alignment is only relevant if the vectors are stored in global memory, right?

yhmtsai

except for the shared_memory in dpcpp and storage computation (not reviewed yet), others LGTM

common/cuda_hip/solver/batch_bicgstab_kernels.hpp.inc

dpcpp/base/batch_multi_vector_kernels.dp.cpp

dpcpp/solver/batch_bicgstab_kernels.dp.cpp

yhmtsai · 2023-10-30T10:30:07Z

dpcpp/solver/batch_bicgstab_kernels.hpp.inc

+__dpct_inline__ void initialize(
+    const int num_rows, const BatchMatrixType_entry& mat_global_entry,
+    const ValueType* const b_global_entry,
+    const ValueType* const x_global_entry, ValueType& rho_old, ValueType& omega,
+    ValueType& alpha, ValueType* const x_shared_entry,
+    ValueType* const r_shared_entry, ValueType* const r_hat_shared_entry,
+    ValueType* const p_shared_entry, ValueType* const v_shared_entry,
+    typename gko::remove_complex<ValueType>& rhs_norm,
+    typename gko::remove_complex<ValueType>& res_norm,
+    sycl::nd_item<3> item_ct1)


I think from CUDA, it will use __ldg() automatically if it is const __restrict__*. That's why we do not need to use __ldg

dpcpp/solver/batch_bicgstab_kernels.hpp.inc

test/solver/batch_bicgstab_kernels.cpp

core/solver/batch_bicgstab_kernels.hpp

common/cuda_hip/solver/batch_bicgstab_kernels.hpp.inc

cuda/solver/batch_bicgstab_kernels.cu

dpcpp/base/batch_multi_vector_kernels.hpp.inc

dpcpp/solver/batch_bicgstab_kernels.dp.cpp

hip/solver/batch_bicgstab_kernels.hip.cpp

yhmtsai · 2023-10-30T16:39:42Z

hip/matrix/batch_struct.hip.hpp

+inline batch::matrix::ell::uniform_batch<const hip_type<ValueType>,
+                                         const IndexType>


I think the to_const usually face this issue.
Could you check the other const version also correct?
If all related to this issue are not in public interface, it are not urgent before release

test/solver/batch_bicgstab_kernels.cpp

hip/base/exception.hip.hpp

common/cuda_hip/stop/batch_criteria.hpp.inc

core/base/batch_utilities.hpp

core/device_hooks/common_kernels.inc.cpp

pratikvn · 2023-11-01T10:58:59Z

format!

Co-authored-by: Pratik Nayak <[email protected]>

Co-authored-by: Phuong Nguyen <[email protected]>

Co-authored-by: Yu-Hsiang Tsai <[email protected]>

Co-authored-by: Pratik Nayak <[email protected]>

Co-authored-by: Yu-Hsiang Tsai <[email protected]>

pratikvn · 2023-11-05T16:20:26Z

format!

Co-authored-by: Pratik Nayak <[email protected]>

pratikvn · 2023-11-05T23:43:27Z

Turns out the no-circular-deps job is terribly slow. I verified (with the same config and flags as the job, inside the same image with a docker container) that it builds successfully with GINKGO_CHECK_CIRCULAR_DEPS=on, so I will go ahead and merge this.

sonarqubecloud · 2023-11-06T05:07:42Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
7 Code Smells

98.6% Coverage
17.7% Duplication

The version of Java (11.0.3) you have used to run this analysis is deprecated and we will stop accepting it soon. Please update to at least Java 17.
Read more here

Release 1.7.0 to master The Ginkgo team is proud to announce the new Ginkgo minor release 1.7.0. This release brings new features such as: - Complete GPU-resident sparse direct solvers feature set and interfaces, - Improved Cholesky factorization performance, - A new MC64 reordering, - Batched iterative solver support with the BiCGSTAB solver with batched Dense and ELL matrix types, - MPI support for the SYCL backend, - Improved ParILU(T)/ParIC(T) preconditioner convergence, and more! If you face an issue, please first check our [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues) and the [open issues list](https://github.com/ginkgo-project/ginkgo/issues) and if you do not find a solution, feel free to [open a new issue](https://github.com/ginkgo-project/ginkgo/issues/new/choose) or ask a question using the [github discussions](https://github.com/ginkgo-project/ginkgo/discussions). Supported systems and requirements: + For all platforms, CMake 3.16+ + C++14 compliant compiler + Linux and macOS + GCC: 5.5+ + clang: 3.9+ + Intel compiler: 2019+ + Apple Clang: 14.0 is tested. Earlier versions might also work. + NVHPC: 22.7+ + Cray Compiler: 14.0.1+ + CUDA module: CMake 3.18+, and CUDA 10.1+ or NVHPC 22.7+ + HIP module: ROCm 4.5+ + DPC++ module: Intel oneAPI 2022.1+ with oneMKL and oneDPL. Set the CXX compiler to `dpcpp` or `icpx`. + MPI: standard version 3.1+, ideally GPU Aware, for best performance + Windows + MinGW: GCC 5.5+ + Microsoft Visual Studio: VS 2019+ + CUDA module: CUDA 10.1+, Microsoft Visual Studio + OpenMP module: MinGW. ### Version support changes + CUDA 9.2 is no longer supported and 10.0 is untested [#1382](#1382) + Ginkgo now requires CMake version 3.16 (and 3.18 for CUDA) [#1368](#1368) ### Interface changes + `const` Factory parameters can no longer be modified through `with_*` functions, as this breaks const-correctness [#1336](#1336) [#1439](#1439) ### New Deprecations + The `device_reset` parameter of CUDA and HIP executors no longer has an effect, and its `allocation_mode` parameters have been deprecated in favor of the `Allocator` interface. [#1315](#1315) + The CMake parameter `GINKGO_BUILD_DPCPP` has been deprecated in favor of `GINKGO_BUILD_SYCL`. [#1350](#1350) + The `gko::reorder::Rcm` interface has been deprecated in favor of `gko::experimental::reorder::Rcm` based on `Permutation`. [#1418](#1418) + The Permutation class' `permute_mask` functionality. [#1415](#1415) + Multiple functions with typos (`set_complex_subpsace()`, range functions such as `conj_operaton` etc). [#1348](#1348) ### Summary of previous deprecations + `gko::lend()` is not necessary anymore. + The classes `RelativeResidualNorm` and `AbsoluteResidualNorm` are deprecated in favor of `ResidualNorm`. + The class `AmgxPgm` is deprecated in favor of `Pgm`. + Default constructors for the CSR `load_balance` and `automatical` strategies + The PolymorphicObject's move-semantic `copy_from` variant + The templated `SolverBase` class. + The class `MachineTopology` is deprecated in favor of `machine_topology`. + Logger constructors and create functions with the `executor` parameter. + The virtual, protected, Dense functions `compute_norm1_impl`, `add_scaled_impl`, etc. + Logger events for solvers and criterion without the additional `implicit_tau_sq` parameter. + The global `gko::solver::default_krylov_dim`, use instead `gko::solver::gmres_default_krylov_dim`. ### Added features + Adds a batch::BatchLinOp class that forms a base class for batched linear operators such as batched matrix formats, solver and preconditioners [#1379](#1379) + Adds a batch::MultiVector class that enables operations such as dot, norm, scale on batched vectors [#1371](#1371) + Adds a batch::Dense matrix format that stores batched dense matrices and provides gemv operations for these dense matrices. [#1413](#1413) + Adds a batch::Ell matrix format that stores batched Ell matrices and provides spmv operations for these batched Ell matrices. [#1416](#1416) [#1437](#1437) + Add a batch::Bicgstab solver (class, core, and reference kernels) that enables iterative solution of batched linear systems [#1438](#1438). + Add device kernels (CUDA, HIP, and DPCPP) for batch::Bicgstab solver. [#1443](#1443). + New MC64 reordering algorithm which optimizes the diagonal product or sum of a matrix by permuting the rows, and computes additional scaling factors for equilibriation [#1120](#1120) + New interface for (non-symmetric) permutation and scaled permutation of Dense and Csr matrices [#1415](#1415) + LU and Cholesky Factorizations can now be separated into their factors [#1432](#1432) + New symbolic LU factorization algorithm that is optimized for matrices with an almost-symmetric sparsity pattern [#1445](#1445) + Sorting kernels for SparsityCsr on all backends [#1343](#1343) + Allow passing pre-generated local solver as factory parameter for the distributed Schwarz preconditioner [#1426](#1426) + Add DPCPP kernels for Partition [#1034](#1034), and CSR's `check_diagonal_entries` and `add_scaled_identity` functionality [#1436](#1436) + Adds a helper function to create a partition based on either local sizes, or local ranges [#1227](#1227) + Add function to compute arithmetic mean of dense and distributed vectors [#1275](#1275) + Adds `icpx` compiler supports [#1350](#1350) + All backends can be built simultaneously [#1333](#1333) + Emits a CMake warning in downstream projects that use different compilers than the installed Ginkgo [#1372](#1372) + Reordering algorithms in sparse_blas benchmark [#1354](#1354) + Benchmarks gained an `-allocator` parameter to specify device allocators [#1385](#1385) + Benchmarks gained an `-input_matrix` parameter that initializes the input JSON based on the filename [#1387](#1387) + Benchmark inputs can now be reordered as a preprocessing step [#1408](#1408) ### Improvements + Significantly improve Cholesky factorization performance [#1366](#1366) + Improve parallel build performance [#1378](#1378) + Allow constrained parallel test execution using CTest resources [#1373](#1373) + Use arithmetic type more inside mixed precision ELL [#1414](#1414) + Most factory parameters of factory type no longer need to be constructed explicitly via `.on(exec)` [#1336](#1336) [#1439](#1439) + Improve ParILU(T)/ParIC(T) convergence by using more appropriate atomic operations [#1434](#1434) ### Fixes + Fix an over-allocation for OpenMP reductions [#1369](#1369) + Fix DPCPP's common-kernel reduction for empty input sizes [#1362](#1362) + Fix several typos in the API and documentation [#1348](#1348) + Fix inconsistent `Threads` between generations [#1388](#1388) + Fix benchmark median condition [#1398](#1398) + Fix HIP 5.6.0 compilation [#1411](#1411) + Fix missing destruction of rand_generator from cuda/hip [#1417](#1417) + Fix PAPI logger destruction order [#1419](#1419) + Fix TAU logger compilation [#1422](#1422) + Fix relative criterion to not iterate if the residual is already zero [#1079](#1079) + Fix memory_order invocations with C++20 changes [#1402](#1402) + Fix `check_diagonal_entries_exist` report correctly when only missing diagonal value in the last rows. [#1440](#1440) + Fix checking OpenMPI version in cross-compilation settings [#1446](#1446) + Fix false-positive deprecation warnings in Ginkgo, especially for the old Rcm (it doesn't emit deprecation warnings anymore as a result but is still considered deprecated) [#1444](#1444) ### Related PR: #1451

Release 1.7.0 to develop The Ginkgo team is proud to announce the new Ginkgo minor release 1.7.0. This release brings new features such as: - Complete GPU-resident sparse direct solvers feature set and interfaces, - Improved Cholesky factorization performance, - A new MC64 reordering, - Batched iterative solver support with the BiCGSTAB solver with batched Dense and ELL matrix types, - MPI support for the SYCL backend, - Improved ParILU(T)/ParIC(T) preconditioner convergence, and more! If you face an issue, please first check our [known issues page](https://github.com/ginkgo-project/ginkgo/wiki/Known-Issues) and the [open issues list](https://github.com/ginkgo-project/ginkgo/issues) and if you do not find a solution, feel free to [open a new issue](https://github.com/ginkgo-project/ginkgo/issues/new/choose) or ask a question using the [github discussions](https://github.com/ginkgo-project/ginkgo/discussions). Supported systems and requirements: + For all platforms, CMake 3.16+ + C++14 compliant compiler + Linux and macOS + GCC: 5.5+ + clang: 3.9+ + Intel compiler: 2019+ + Apple Clang: 14.0 is tested. Earlier versions might also work. + NVHPC: 22.7+ + Cray Compiler: 14.0.1+ + CUDA module: CMake 3.18+, and CUDA 10.1+ or NVHPC 22.7+ + HIP module: ROCm 4.5+ + DPC++ module: Intel oneAPI 2022.1+ with oneMKL and oneDPL. Set the CXX compiler to `dpcpp` or `icpx`. + MPI: standard version 3.1+, ideally GPU Aware, for best performance + Windows + MinGW: GCC 5.5+ + Microsoft Visual Studio: VS 2019+ + CUDA module: CUDA 10.1+, Microsoft Visual Studio + OpenMP module: MinGW. ### Version support changes + CUDA 9.2 is no longer supported and 10.0 is untested [#1382](#1382) + Ginkgo now requires CMake version 3.16 (and 3.18 for CUDA) [#1368](#1368) ### Interface changes + `const` Factory parameters can no longer be modified through `with_*` functions, as this breaks const-correctness [#1336](#1336) [#1439](#1439) ### New Deprecations + The `device_reset` parameter of CUDA and HIP executors no longer has an effect, and its `allocation_mode` parameters have been deprecated in favor of the `Allocator` interface. [#1315](#1315) + The CMake parameter `GINKGO_BUILD_DPCPP` has been deprecated in favor of `GINKGO_BUILD_SYCL`. [#1350](#1350) + The `gko::reorder::Rcm` interface has been deprecated in favor of `gko::experimental::reorder::Rcm` based on `Permutation`. [#1418](#1418) + The Permutation class' `permute_mask` functionality. [#1415](#1415) + Multiple functions with typos (`set_complex_subpsace()`, range functions such as `conj_operaton` etc). [#1348](#1348) ### Summary of previous deprecations + `gko::lend()` is not necessary anymore. + The classes `RelativeResidualNorm` and `AbsoluteResidualNorm` are deprecated in favor of `ResidualNorm`. + The class `AmgxPgm` is deprecated in favor of `Pgm`. + Default constructors for the CSR `load_balance` and `automatical` strategies + The PolymorphicObject's move-semantic `copy_from` variant + The templated `SolverBase` class. + The class `MachineTopology` is deprecated in favor of `machine_topology`. + Logger constructors and create functions with the `executor` parameter. + The virtual, protected, Dense functions `compute_norm1_impl`, `add_scaled_impl`, etc. + Logger events for solvers and criterion without the additional `implicit_tau_sq` parameter. + The global `gko::solver::default_krylov_dim`, use instead `gko::solver::gmres_default_krylov_dim`. ### Added features + Adds a batch::BatchLinOp class that forms a base class for batched linear operators such as batched matrix formats, solver and preconditioners [#1379](#1379) + Adds a batch::MultiVector class that enables operations such as dot, norm, scale on batched vectors [#1371](#1371) + Adds a batch::Dense matrix format that stores batched dense matrices and provides gemv operations for these dense matrices. [#1413](#1413) + Adds a batch::Ell matrix format that stores batched Ell matrices and provides spmv operations for these batched Ell matrices. [#1416](#1416) [#1437](#1437) + Add a batch::Bicgstab solver (class, core, and reference kernels) that enables iterative solution of batched linear systems [#1438](#1438). + Add device kernels (CUDA, HIP, and DPCPP) for batch::Bicgstab solver. [#1443](#1443). + New MC64 reordering algorithm which optimizes the diagonal product or sum of a matrix by permuting the rows, and computes additional scaling factors for equilibriation [#1120](#1120) + New interface for (non-symmetric) permutation and scaled permutation of Dense and Csr matrices [#1415](#1415) + LU and Cholesky Factorizations can now be separated into their factors [#1432](#1432) + New symbolic LU factorization algorithm that is optimized for matrices with an almost-symmetric sparsity pattern [#1445](#1445) + Sorting kernels for SparsityCsr on all backends [#1343](#1343) + Allow passing pre-generated local solver as factory parameter for the distributed Schwarz preconditioner [#1426](#1426) + Add DPCPP kernels for Partition [#1034](#1034), and CSR's `check_diagonal_entries` and `add_scaled_identity` functionality [#1436](#1436) + Adds a helper function to create a partition based on either local sizes, or local ranges [#1227](#1227) + Add function to compute arithmetic mean of dense and distributed vectors [#1275](#1275) + Adds `icpx` compiler supports [#1350](#1350) + All backends can be built simultaneously [#1333](#1333) + Emits a CMake warning in downstream projects that use different compilers than the installed Ginkgo [#1372](#1372) + Reordering algorithms in sparse_blas benchmark [#1354](#1354) + Benchmarks gained an `-allocator` parameter to specify device allocators [#1385](#1385) + Benchmarks gained an `-input_matrix` parameter that initializes the input JSON based on the filename [#1387](#1387) + Benchmark inputs can now be reordered as a preprocessing step [#1408](#1408) ### Improvements + Significantly improve Cholesky factorization performance [#1366](#1366) + Improve parallel build performance [#1378](#1378) + Allow constrained parallel test execution using CTest resources [#1373](#1373) + Use arithmetic type more inside mixed precision ELL [#1414](#1414) + Most factory parameters of factory type no longer need to be constructed explicitly via `.on(exec)` [#1336](#1336) [#1439](#1439) + Improve ParILU(T)/ParIC(T) convergence by using more appropriate atomic operations [#1434](#1434) ### Fixes + Fix an over-allocation for OpenMP reductions [#1369](#1369) + Fix DPCPP's common-kernel reduction for empty input sizes [#1362](#1362) + Fix several typos in the API and documentation [#1348](#1348) + Fix inconsistent `Threads` between generations [#1388](#1388) + Fix benchmark median condition [#1398](#1398) + Fix HIP 5.6.0 compilation [#1411](#1411) + Fix missing destruction of rand_generator from cuda/hip [#1417](#1417) + Fix PAPI logger destruction order [#1419](#1419) + Fix TAU logger compilation [#1422](#1422) + Fix relative criterion to not iterate if the residual is already zero [#1079](#1079) + Fix memory_order invocations with C++20 changes [#1402](#1402) + Fix `check_diagonal_entries_exist` report correctly when only missing diagonal value in the last rows. [#1440](#1440) + Fix checking OpenMPI version in cross-compilation settings [#1446](#1446) + Fix false-positive deprecation warnings in Ginkgo, especially for the old Rcm (it doesn't emit deprecation warnings anymore as a result but is still considered deprecated) [#1444](#1444) ### Related PR: #1454

pratikvn added 1:ST:WIP This PR is a work in progress. Not ready for review. type:batched-functionality This is related to the batched functionality in Ginkgo labels Oct 26, 2023

pratikvn added this to the Release 1.7.0 milestone Oct 26, 2023

pratikvn self-assigned this Oct 26, 2023

ginkgo-bot added reg:build This is related to the build system. reg:testing This is related to testing. mod:core This is related to the core module. mod:cuda This is related to the CUDA module. type:solver This is related to the solvers mod:hip This is related to the HIP module. labels Oct 26, 2023

MarcelKoch reviewed Oct 27, 2023

View reviewed changes

yhmtsai reviewed Oct 29, 2023

View reviewed changes

MarcelKoch reviewed Oct 30, 2023

View reviewed changes

yhmtsai mentioned this pull request Oct 30, 2023

Add a batch::Bicgstab solver class, core, ref and omp kernels #1438

Merged

yhmtsai reviewed Oct 30, 2023

View reviewed changes

pratikvn force-pushed the batch-bicgstab-device branch from b8def5b to b653d3b Compare October 30, 2023 13:38

yhmtsai reviewed Oct 30, 2023

View reviewed changes

pratikvn force-pushed the batch-bicgstab-device branch from b653d3b to fb50eaf Compare October 30, 2023 21:37

pratikvn added 1:ST:ready-for-review This PR is ready for review and removed 1:ST:WIP This PR is a work in progress. Not ready for review. labels Oct 31, 2023

pratikvn force-pushed the batch-bicgstab branch 3 times, most recently from 8982811 to 28560a5 Compare October 31, 2023 14:04

upsj reviewed Oct 31, 2023

View reviewed changes

hip/base/exception.hip.hpp Outdated Show resolved Hide resolved

pratikvn force-pushed the batch-bicgstab branch from e21b275 to 2260c8f Compare October 31, 2023 22:47

Base automatically changed from batch-bicgstab to develop November 1, 2023 09:06

yhmtsai reviewed Nov 1, 2023

View reviewed changes

common/cuda_hip/stop/batch_criteria.hpp.inc Outdated Show resolved Hide resolved

core/base/batch_utilities.hpp Outdated Show resolved Hide resolved

core/device_hooks/common_kernels.inc.cpp Outdated Show resolved Hide resolved

pratikvn force-pushed the batch-bicgstab-device branch from fb50eaf to d21d5fd Compare November 1, 2023 10:57

pratikvn and others added 16 commits November 5, 2023 17:10

Use synchronize for error handling

84be7dd

Format files

c6e9543

Co-authored-by: Pratik Nayak <[email protected]>

Add scoped cuda shmem config

1054b7b

move max_shmem query to internal

cc22557

Update size_type in tests

501c4e7

Update contributors.txt

7b0ebfd

Co-authored-by: Phuong Nguyen <[email protected]>

review updates

f1babfd

Co-authored-by: Yu-Hsiang Tsai <[email protected]>

Format files

221bba9

Co-authored-by: Pratik Nayak <[email protected]>

dpcpp group size and doc fixes

aa026c1

use global_and_local barrier

79e5cad

Fix Intel2020 apply call issue

693d308

Fix diag_dominance and tol issue

705339e

Fix some include issues

6729f68

Review updates

eebc06a

Co-authored-by: Yu-Hsiang Tsai <[email protected]>

use fence_space::global_and_local

498512c

Use updated deferred factory macros.

1bc6d83

pratikvn force-pushed the batch-bicgstab-device branch from f48179b to f600023 Compare November 5, 2023 16:11

Review updates

79e68b3

Co-authored-by: Yu-Hsiang Tsai <[email protected]>

pratikvn force-pushed the batch-bicgstab-device branch from f600023 to 79e68b3 Compare November 5, 2023 16:17

Format files

a1b84d4

Co-authored-by: Pratik Nayak <[email protected]>

yhmtsai approved these changes Nov 5, 2023

View reviewed changes

pratikvn merged commit 47b3267 into develop Nov 5, 2023
12 of 13 checks passed

pratikvn deleted the batch-bicgstab-device branch November 5, 2023 23:44

tcojean mentioned this pull request Nov 6, 2023

Release 1.7.0 to master #1451

Merged

pratikvn mentioned this pull request Dec 18, 2023

Update contributors.txt #1399

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDA, HIP and DPCPP batch bicgstab kernels #1443

Add CUDA, HIP and DPCPP batch bicgstab kernels #1443

pratikvn commented Oct 26, 2023 •

edited

Loading

MarcelKoch left a comment

MarcelKoch Oct 27, 2023

pratikvn Oct 29, 2023

MarcelKoch Oct 30, 2023

upsj Oct 31, 2023

MarcelKoch Oct 27, 2023

MarcelKoch Oct 27, 2023

pratikvn Oct 29, 2023

MarcelKoch Oct 29, 2023

pratikvn Oct 30, 2023

MarcelKoch Oct 30, 2023

pratikvn commented Oct 29, 2023

yhmtsai Oct 29, 2023

pratikvn Oct 30, 2023

upsj Oct 31, 2023

MarcelKoch Nov 1, 2023

yhmtsai Oct 29, 2023

pratikvn Oct 30, 2023

yhmtsai Oct 29, 2023

MarcelKoch Oct 30, 2023

yhmtsai left a comment

yhmtsai Oct 30, 2023

yhmtsai Oct 30, 2023

pratikvn commented Nov 1, 2023

pratikvn commented Nov 5, 2023

pratikvn commented Nov 5, 2023

sonarqubecloud bot commented Nov 6, 2023

		inline batch::matrix::ell::uniform_batch<const hip_type<ValueType>,
		const IndexType>

Add CUDA, HIP and DPCPP batch bicgstab kernels #1443

Add CUDA, HIP and DPCPP batch bicgstab kernels #1443

Conversation

pratikvn commented Oct 26, 2023 • edited Loading

TODO

MarcelKoch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pratikvn commented Oct 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhmtsai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pratikvn commented Nov 1, 2023

pratikvn commented Nov 5, 2023

pratikvn commented Nov 5, 2023

sonarqubecloud bot commented Nov 6, 2023

pratikvn commented Oct 26, 2023 •

edited

Loading