Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add batch::cg solver device kernels #1609

Merged
merged 11 commits into from
May 23, 2024
Merged

Add batch::cg solver device kernels #1609

merged 11 commits into from
May 23, 2024

Conversation

pratikvn
Copy link
Member

@pratikvn pratikvn commented May 10, 2024

This PR adds the CUDA/HIP/DPCPP device kernels for the batch CG solver.

A lot of similarities between existing bicgstab kernels and this one, which will be unified at a later stage.

@pratikvn pratikvn added 1:ST:ready-for-review This PR is ready for review type:batched-functionality This is related to the batched functionality in Ginkgo labels May 10, 2024
@pratikvn pratikvn requested a review from a team May 10, 2024 11:26
@pratikvn pratikvn self-assigned this May 10, 2024
@MarcelKoch MarcelKoch added this to the Ginkgo 1.8.0 milestone May 10, 2024
@MarcelKoch MarcelKoch self-requested a review May 10, 2024 11:27
@ginkgo-bot ginkgo-bot added reg:build This is related to the build system. mod:core This is related to the core module. mod:cuda This is related to the CUDA module. mod:reference This is related to the reference module. type:solver This is related to the solvers type:preconditioner This is related to the preconditioners mod:hip This is related to the HIP module. mod:dpcpp This is related to the DPC++ module. labels May 10, 2024
@pratikvn pratikvn added the 1:ST:no-changelog-entry Skip the wiki check for changelog update label May 10, 2024
@pratikvn pratikvn changed the title Add batch::cg solver cuda/hip kernels Add batch::cg solver device kernels May 10, 2024
@pratikvn pratikvn force-pushed the batch-cg-device branch 6 times, most recently from 03b7cac to a0b40d5 Compare May 10, 2024 13:15
Copy link
Member

@MarcelKoch MarcelKoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I've left mostly nits and some open-ended questions.

Maybe also list the unrelated changes. So far I gathered these:

  • snake_case for bicgstab kernel_caller
  • return bytes from scalar jacobi dynamic_work_size

@@ -17,7 +17,7 @@ public:
__host__ __device__ static constexpr int dynamic_work_size(
const int num_rows, int)
{
return num_rows;
return num_rows * sizeof(value_type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that some rebase left over?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I moved it to #1600 now. I think that will be merged first, so will rebase this on that afterwards

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then maybe change the base of the PR? Makes it easier to review.

cuda/solver/batch_cg_kernels.cu Outdated Show resolved Hide resolved
Comment on lines +117 to +119
template <typename StopType, const int n_shared,
const bool prec_shared_bool, typename PrecType, typename LogType,
typename BatchMatrixType>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: these parameters are ordered differently than for call_apply. Maybe order the shared parameters consistently.

cuda/solver/batch_cg_kernels.cu Outdated Show resolved Hide resolved
dpcpp/solver/batch_bicgstab_kernels.hpp.inc Outdated Show resolved Hide resolved
dpcpp/solver/batch_cg_kernels.dp.cpp Outdated Show resolved Hide resolved
dpcpp/solver/batch_cg_kernels.hpp.inc Show resolved Hide resolved
dpcpp/solver/batch_cg_kernels.hpp.inc Outdated Show resolved Hide resolved
core/solver/batch_bicgstab_kernels.hpp Show resolved Hide resolved
dpcpp/solver/batch_bicgstab_kernels.hpp.inc Outdated Show resolved Hide resolved
dpcpp/solver/batch_bicgstab_kernels.hpp.inc Outdated Show resolved Hide resolved
Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some sycl algorithm part (not kernel details) are different from cuda hip.

core/solver/batch_bicgstab_kernels.hpp Show resolved Hide resolved
cuda/solver/batch_cg_kernels.cu Outdated Show resolved Hide resolved
dpcpp/solver/batch_cg_kernels.dp.cpp Outdated Show resolved Hide resolved
dpcpp/solver/batch_cg_kernels.dp.cpp Outdated Show resolved Hide resolved
dpcpp/solver/batch_cg_kernels.dp.cpp Outdated Show resolved Hide resolved
dpcpp/solver/batch_cg_kernels.hpp.inc Show resolved Hide resolved
hip/solver/batch_cg_kernels.hip.cpp Show resolved Hide resolved
hip/solver/batch_cg_kernels.hip.cpp Show resolved Hide resolved
Comment on lines 182 to 183
auto linear_system =
setup_linsys_and_solver(mat, num_rhs, tol / 100, max_iters);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stopping by residual norm but checking the true error is still weird to me. the scale is 50000, which is a little high to me.
you also check the residual norm, so I do not hold this pr by this question now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest I dont check with the true solution at all, because I am definitely having issues with DPCPP with the tolerance. I also agree that 500 is too high.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the tol needs to be lower than current setup.
If it is the issue only in dpcpp, I think we need to be more careful on this.
For example, using the same n_shared settings, subgroup_size, group_size, and maybe using the same impl of reduction not from reduce_by_group on sycl and cuda side. If they still give quite different result, I think there are something wrong in the sync.

test/solver/batch_cg_kernels.cpp Outdated Show resolved Hide resolved
@pratikvn pratikvn force-pushed the batch-cg-device branch 2 times, most recently from c7894eb to 8bc651d Compare May 13, 2024 11:12
@pratikvn pratikvn force-pushed the batch-cg-device branch 3 times, most recently from adf2563 to 541e29a Compare May 16, 2024 21:19
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
42.9% Duplication on New Code (required ≤ 20%)

See analysis details on SonarCloud

Comment on lines 137 to 141
// reserve 3 for intermediate rho,
// alpha, reduce_over_group, and two norms
// If the value available is negative, then set it to 0
const int static_var_mem =
(group_size + 3) * sizeof(ValueType) + 2 * sizeof(real_type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still miss group_size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I dont understand what you mean.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the description only mention 3 for the result, right? but what' the group_size * sizeof(ValueType) here for

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was for local memory for reduce_over_group. But I think that was in a previous code. So, removed now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the cuda/hip part need to change? or they indeed use shared_memory?

dpcpp/solver/batch_cg_kernels.hpp.inc Show resolved Hide resolved
@@ -190,16 +190,15 @@ TEST_F(BatchCg, CanSolveLargeBatchSizeHpdSystem)
&logger->get_num_iterations());
auto res_norm = gko::make_temporary_clone(exec->get_master(),
&logger->get_residual_norm());
GKO_ASSERT_BATCH_MTX_NEAR(res.x, linear_system.exact_sol, tol * 50);
for (size_t i = 0; i < num_batch_items; i++) {
auto comp_res_norm = res.host_res_norm->get_const_values()[i] /
linear_system.host_rhs_norm->get_const_values()[i];
ASSERT_LE(iter_counts->get_const_data()[i], max_iters);
EXPECT_LE(res_norm->get_const_data()[i] /
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the` host_res_norm and res_norm from logger different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, host_res_norm is the explicit residual norm: ||b-Ax||

for (size_t i = 0; i < num_batch_items; i++) {
auto comp_res_norm = res.host_res_norm->get_const_values()[i] /
linear_system.host_rhs_norm->get_const_values()[i];
ASSERT_LE(iter_counts->get_const_data()[i], max_iters);
EXPECT_LE(res_norm->get_const_data()[i] /
linear_system.host_rhs_norm->get_const_values()[i],
tol * 20);
tol * 100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the stopping criterion not based on this condition < tol?
It may contain the numerical rounding error from cg itself, but 100 times is 1e-3?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and later test does not need to change the tol.

auto shem_guard =
gko::kernels::cuda::detail::shared_memory_config_guard<
value_type>();
const int shmem_per_blk =
Copy link
Member

@yhmtsai yhmtsai May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here does not consider the 3 * ValueType and 2 * real_type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for hip

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, here is a bit different from SYCL. It only considers the DnamicSharedMemory and the getter does not contain static shared memory limitation information.

Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. That will be great if you can confirm the CUDA/HIP only considers the DynamicSharedMemory Size

auto shem_guard =
gko::kernels::cuda::detail::shared_memory_config_guard<
value_type>();
const int shmem_per_blk =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, here is a bit different from SYCL. It only considers the DnamicSharedMemory and the getter does not contain static shared memory limitation information.

@pratikvn
Copy link
Member Author

@yhmtsai , yes. For CUDA/HIP we only consider dynamic shared memory and only that needs to be passed into the kernel. I dont think it is necessary to check for the static shared memory with CUDA/HIP

@pratikvn pratikvn merged commit 1782029 into develop May 23, 2024
12 of 15 checks passed
@pratikvn pratikvn deleted the batch-cg-device branch May 23, 2024 04:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:no-changelog-entry Skip the wiki check for changelog update 1:ST:ready-for-review This PR is ready for review mod:core This is related to the core module. mod:cuda This is related to the CUDA module. mod:dpcpp This is related to the DPC++ module. mod:hip This is related to the HIP module. mod:reference This is related to the reference module. reg:build This is related to the build system. type:batched-functionality This is related to the batched functionality in Ginkgo type:preconditioner This is related to the preconditioners type:solver This is related to the solvers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants