Add batch::cg solver device kernels #1609

pratikvn · 2024-05-10T11:26:29Z

This PR adds the CUDA/HIP/DPCPP device kernels for the batch CG solver.

A lot of similarities between existing bicgstab kernels and this one, which will be unified at a later stage.

MarcelKoch

Looks good. I've left mostly nits and some open-ended questions.

Maybe also list the unrelated changes. So far I gathered these:

snake_case for bicgstab kernel_caller
return bytes from scalar jacobi dynamic_work_size

MarcelKoch · 2024-05-10T14:48:59Z

common/cuda_hip/preconditioner/batch_scalar_jacobi.hpp.inc

@@ -17,7 +17,7 @@ public:
    __host__ __device__ static constexpr int dynamic_work_size(
        const int num_rows, int)
    {
-        return num_rows;
+        return num_rows * sizeof(value_type);


is that some rebase left over?

Yes, but I moved it to #1600 now. I think that will be merged first, so will rebase this on that afterwards

Then maybe change the base of the PR? Makes it easier to review.

cuda/solver/batch_cg_kernels.cu

MarcelKoch · 2024-05-10T14:56:28Z

cuda/solver/batch_cg_kernels.cu

+    template <typename StopType, const int n_shared,
+              const bool prec_shared_bool, typename PrecType, typename LogType,
+              typename BatchMatrixType>


nit: these parameters are ordered differently than for call_apply. Maybe order the shared parameters consistently.

cuda/solver/batch_cg_kernels.cu

common/cuda_hip/solver/batch_cg_kernels.hpp.inc

dpcpp/solver/batch_bicgstab_kernels.hpp.inc

dpcpp/solver/batch_cg_kernels.dp.cpp

dpcpp/solver/batch_cg_kernels.hpp.inc

core/solver/batch_bicgstab_kernels.hpp

stale

dpcpp/solver/batch_bicgstab_kernels.hpp.inc

yhmtsai

some sycl algorithm part (not kernel details) are different from cuda hip.

core/solver/batch_bicgstab_kernels.hpp

cuda/solver/batch_cg_kernels.cu

dpcpp/solver/batch_cg_kernels.dp.cpp

dpcpp/solver/batch_cg_kernels.hpp.inc

common/cuda_hip/solver/batch_cg_kernels.hpp.inc

hip/solver/batch_cg_kernels.hip.cpp

yhmtsai · 2024-05-13T08:28:55Z

test/solver/batch_cg_kernels.cpp

+    auto linear_system =
+        setup_linsys_and_solver(mat, num_rhs, tol / 100, max_iters);


Stopping by residual norm but checking the true error is still weird to me. the scale is 50000, which is a little high to me.
you also check the residual norm, so I do not hold this pr by this question now

Do you suggest I dont check with the true solution at all, because I am definitely having issues with DPCPP with the tolerance. I also agree that 500 is too high.

Yes, but the tol needs to be lower than current setup.
If it is the issue only in dpcpp, I think we need to be more careful on this.
For example, using the same n_shared settings, subgroup_size, group_size, and maybe using the same impl of reduction not from reduce_by_group on sycl and cuda side. If they still give quite different result, I think there are something wrong in the sync.

test/solver/batch_cg_kernels.cpp

sonarqubecloud · 2024-05-17T06:28:15Z

Quality Gate failed

Failed conditions
42.9% Duplication on New Code (required ≤ 20%)

See analysis details on SonarCloud

yhmtsai · 2024-05-17T07:22:17Z

dpcpp/solver/batch_cg_kernels.dp.cpp

+        // reserve 3 for intermediate rho,
+        // alpha, reduce_over_group, and two norms
+        // If the value available is negative, then set it to 0
+        const int static_var_mem =
+            (group_size + 3) * sizeof(ValueType) + 2 * sizeof(real_type);


still miss group_size?

Sorry, I dont understand what you mean.

the description only mention 3 for the result, right? but what' the group_size * sizeof(ValueType) here for

That was for local memory for reduce_over_group. But I think that was in a previous code. So, removed now.

does the cuda/hip part need to change? or they indeed use shared_memory?

dpcpp/solver/batch_cg_kernels.hpp.inc

yhmtsai · 2024-05-17T07:31:18Z

test/solver/batch_cg_kernels.cpp

@@ -190,16 +190,15 @@ TEST_F(BatchCg, CanSolveLargeBatchSizeHpdSystem)
                                                 &logger->get_num_iterations());
    auto res_norm = gko::make_temporary_clone(exec->get_master(),
                                              &logger->get_residual_norm());
-    GKO_ASSERT_BATCH_MTX_NEAR(res.x, linear_system.exact_sol, tol * 50);
    for (size_t i = 0; i < num_batch_items; i++) {
        auto comp_res_norm = res.host_res_norm->get_const_values()[i] /
                             linear_system.host_rhs_norm->get_const_values()[i];
        ASSERT_LE(iter_counts->get_const_data()[i], max_iters);
        EXPECT_LE(res_norm->get_const_data()[i] /


are the` host_res_norm and res_norm from logger different?

Yes, host_res_norm is the explicit residual norm: ||b-Ax||

yhmtsai · 2024-05-17T07:32:44Z

test/solver/batch_cg_kernels.cpp

    for (size_t i = 0; i < num_batch_items; i++) {
        auto comp_res_norm = res.host_res_norm->get_const_values()[i] /
                             linear_system.host_rhs_norm->get_const_values()[i];
        ASSERT_LE(iter_counts->get_const_data()[i], max_iters);
        EXPECT_LE(res_norm->get_const_data()[i] /
                      linear_system.host_rhs_norm->get_const_values()[i],
-                  tol * 20);
+                  tol * 100);


is the stopping criterion not based on this condition < tol?
It may contain the numerical rounding error from cg itself, but 100 times is 1e-3?

and later test does not need to change the tol.

yhmtsai · 2024-05-17T21:57:27Z

cuda/solver/batch_cg_kernels.cu

+        auto shem_guard =
+            gko::kernels::cuda::detail::shared_memory_config_guard<
+                value_type>();
+        const int shmem_per_blk =


~~here does not consider the 3 * ValueType and 2 * real_type.~~

same for hip

Okay, here is a bit different from SYCL. It only considers the DnamicSharedMemory and the getter does not contain static shared memory limitation information.

yhmtsai

LGTM. That will be great if you can confirm the CUDA/HIP only considers the DynamicSharedMemory Size

yhmtsai · 2024-05-22T09:10:50Z

cuda/solver/batch_cg_kernels.cu

+        auto shem_guard =
+            gko::kernels::cuda::detail::shared_memory_config_guard<
+                value_type>();
+        const int shmem_per_blk =


Okay, here is a bit different from SYCL. It only considers the DnamicSharedMemory and the getter does not contain static shared memory limitation information.

Co-authored-by: Isha Aggarwal <[email protected]> Co-authored-by: Aditya Kashi <[email protected]>

Co-authored-by: Phuong Nguyen <[email protected]>

Co-authored-by: Marcel Koch <[email protected]>

Co-authored-by: Yu-Hsiang Tsai <[email protected]>

- remove checks against true solution

pratikvn · 2024-05-22T16:22:35Z

@yhmtsai , yes. For CUDA/HIP we only consider dynamic shared memory and only that needs to be passed into the kernel. I dont think it is necessary to check for the static shared memory with CUDA/HIP

pratikvn added 1:ST:ready-for-review This PR is ready for review type:batched-functionality This is related to the batched functionality in Ginkgo labels May 10, 2024

pratikvn requested a review from a team May 10, 2024 11:26

pratikvn self-assigned this May 10, 2024

MarcelKoch added this to the Ginkgo 1.8.0 milestone May 10, 2024

MarcelKoch self-requested a review May 10, 2024 11:27

pratikvn added the 1:ST:no-changelog-entry Skip the wiki check for changelog update label May 10, 2024

pratikvn changed the title ~~Add batch::cg solver cuda/hip kernels~~ Add batch::cg solver device kernels May 10, 2024

pratikvn force-pushed the batch-cg-device branch 6 times, most recently from 03b7cac to a0b40d5 Compare May 10, 2024 13:15

MarcelKoch previously requested changes May 10, 2024

View reviewed changes

pratikvn requested a review from MarcelKoch May 11, 2024 10:23

MarcelKoch approved these changes May 12, 2024

View reviewed changes

dpcpp/solver/batch_bicgstab_kernels.hpp.inc Outdated Show resolved Hide resolved

dpcpp/solver/batch_bicgstab_kernels.hpp.inc Outdated Show resolved Hide resolved

yhmtsai reviewed May 12, 2024

View reviewed changes

yhmtsai reviewed May 13, 2024

View reviewed changes

pratikvn force-pushed the batch-cg-device branch 2 times, most recently from c7894eb to 8bc651d Compare May 13, 2024 11:12

pratikvn force-pushed the batch-cg-device branch 3 times, most recently from adf2563 to 541e29a Compare May 16, 2024 21:19

yhmtsai reviewed May 17, 2024

View reviewed changes

pratikvn force-pushed the batch-cg-device branch from 541e29a to e1c5d71 Compare May 17, 2024 07:48

yhmtsai reviewed May 17, 2024

View reviewed changes

yhmtsai approved these changes May 22, 2024

View reviewed changes

pratikvn and others added 11 commits May 22, 2024 09:08

Add cg solver cuda/hip kernels

3631744

Co-authored-by: Isha Aggarwal <[email protected]> Co-authored-by: Aditya Kashi <[email protected]>

add dpcpp kernels

cd4e614

Co-authored-by: Phuong Nguyen <[email protected]>

review updates

24e6920

Co-authored-by: Marcel Koch <[email protected]>

dpcpp kernel fix WIP

80e634f

review updates

9b1dc02

Co-authored-by: Yu-Hsiang Tsai <[email protected]>

update tolerances

22ca18b

- remove checks against true solution

hip fixes

c5d63e8

dpcpp fixes

8dc3aee

update tol

958e6d2

update static slm sizes

e739dcb

reduce num batch items

229db3f

pratikvn force-pushed the batch-cg-device branch from 0c43a72 to 229db3f Compare May 22, 2024 16:21

pratikvn merged commit 1782029 into develop May 23, 2024
12 of 15 checks passed

pratikvn deleted the batch-cg-device branch May 23, 2024 04:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batch::cg solver device kernels #1609

Add batch::cg solver device kernels #1609

pratikvn commented May 10, 2024 •

edited

Loading

MarcelKoch left a comment

MarcelKoch May 10, 2024

pratikvn May 11, 2024

MarcelKoch May 12, 2024

MarcelKoch May 10, 2024

yhmtsai left a comment

yhmtsai May 13, 2024

pratikvn May 13, 2024

yhmtsai May 13, 2024

sonarqubecloud bot commented May 17, 2024

yhmtsai May 17, 2024

pratikvn May 17, 2024

yhmtsai May 17, 2024

pratikvn May 17, 2024

yhmtsai May 17, 2024

yhmtsai May 17, 2024

pratikvn May 17, 2024

yhmtsai May 17, 2024

yhmtsai May 17, 2024

yhmtsai May 17, 2024 •

edited

Loading

yhmtsai May 17, 2024

yhmtsai May 22, 2024

yhmtsai left a comment

yhmtsai May 22, 2024

pratikvn commented May 22, 2024

		auto linear_system =
		setup_linsys_and_solver(mat, num_rhs, tol / 100, max_iters);

Add batch::cg solver device kernels #1609

Add batch::cg solver device kernels #1609

Conversation

pratikvn commented May 10, 2024 • edited Loading

MarcelKoch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhmtsai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarqubecloud bot commented May 17, 2024

Quality Gate failed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhmtsai May 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yhmtsai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pratikvn commented May 22, 2024

pratikvn commented May 10, 2024 •

edited

Loading

yhmtsai May 17, 2024 •

edited

Loading