-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add batch::cg solver device kernels #1609
Conversation
03b7cac
to
a0b40d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I've left mostly nits and some open-ended questions.
Maybe also list the unrelated changes. So far I gathered these:
- snake_case for bicgstab kernel_caller
- return bytes from scalar jacobi
dynamic_work_size
@@ -17,7 +17,7 @@ public: | |||
__host__ __device__ static constexpr int dynamic_work_size( | |||
const int num_rows, int) | |||
{ | |||
return num_rows; | |||
return num_rows * sizeof(value_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that some rebase left over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I moved it to #1600 now. I think that will be merged first, so will rebase this on that afterwards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then maybe change the base of the PR? Makes it easier to review.
template <typename StopType, const int n_shared, | ||
const bool prec_shared_bool, typename PrecType, typename LogType, | ||
typename BatchMatrixType> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: these parameters are ordered differently than for call_apply
. Maybe order the shared parameters consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some sycl algorithm part (not kernel details) are different from cuda hip.
test/solver/batch_cg_kernels.cpp
Outdated
auto linear_system = | ||
setup_linsys_and_solver(mat, num_rhs, tol / 100, max_iters); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stopping by residual norm but checking the true error is still weird to me. the scale is 50000, which is a little high to me.
you also check the residual norm, so I do not hold this pr by this question now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suggest I dont check with the true solution at all, because I am definitely having issues with DPCPP with the tolerance. I also agree that 500 is too high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the tol needs to be lower than current setup.
If it is the issue only in dpcpp, I think we need to be more careful on this.
For example, using the same n_shared settings, subgroup_size, group_size, and maybe using the same impl of reduction not from reduce_by_group on sycl and cuda side. If they still give quite different result, I think there are something wrong in the sync.
c7894eb
to
8bc651d
Compare
adf2563
to
541e29a
Compare
Quality Gate failedFailed conditions |
dpcpp/solver/batch_cg_kernels.dp.cpp
Outdated
// reserve 3 for intermediate rho, | ||
// alpha, reduce_over_group, and two norms | ||
// If the value available is negative, then set it to 0 | ||
const int static_var_mem = | ||
(group_size + 3) * sizeof(ValueType) + 2 * sizeof(real_type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still miss group_size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I dont understand what you mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the description only mention 3 for the result, right? but what' the group_size * sizeof(ValueType) here for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was for local memory for reduce_over_group. But I think that was in a previous code. So, removed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the cuda/hip part need to change? or they indeed use shared_memory?
@@ -190,16 +190,15 @@ TEST_F(BatchCg, CanSolveLargeBatchSizeHpdSystem) | |||
&logger->get_num_iterations()); | |||
auto res_norm = gko::make_temporary_clone(exec->get_master(), | |||
&logger->get_residual_norm()); | |||
GKO_ASSERT_BATCH_MTX_NEAR(res.x, linear_system.exact_sol, tol * 50); | |||
for (size_t i = 0; i < num_batch_items; i++) { | |||
auto comp_res_norm = res.host_res_norm->get_const_values()[i] / | |||
linear_system.host_rhs_norm->get_const_values()[i]; | |||
ASSERT_LE(iter_counts->get_const_data()[i], max_iters); | |||
EXPECT_LE(res_norm->get_const_data()[i] / |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the` host_res_norm and res_norm from logger different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, host_res_norm is the explicit residual norm: ||b-Ax||
test/solver/batch_cg_kernels.cpp
Outdated
for (size_t i = 0; i < num_batch_items; i++) { | ||
auto comp_res_norm = res.host_res_norm->get_const_values()[i] / | ||
linear_system.host_rhs_norm->get_const_values()[i]; | ||
ASSERT_LE(iter_counts->get_const_data()[i], max_iters); | ||
EXPECT_LE(res_norm->get_const_data()[i] / | ||
linear_system.host_rhs_norm->get_const_values()[i], | ||
tol * 20); | ||
tol * 100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the stopping criterion not based on this condition < tol?
It may contain the numerical rounding error from cg itself, but 100 times is 1e-3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and later test does not need to change the tol.
auto shem_guard = | ||
gko::kernels::cuda::detail::shared_memory_config_guard< | ||
value_type>(); | ||
const int shmem_per_blk = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here does not consider the 3 * ValueType and 2 * real_type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for hip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, here is a bit different from SYCL. It only considers the DnamicSharedMemory and the getter does not contain static shared memory limitation information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. That will be great if you can confirm the CUDA/HIP only considers the DynamicSharedMemory Size
auto shem_guard = | ||
gko::kernels::cuda::detail::shared_memory_config_guard< | ||
value_type>(); | ||
const int shmem_per_blk = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, here is a bit different from SYCL. It only considers the DnamicSharedMemory and the getter does not contain static shared memory limitation information.
Co-authored-by: Isha Aggarwal <[email protected]> Co-authored-by: Aditya Kashi <[email protected]>
Co-authored-by: Phuong Nguyen <[email protected]>
Co-authored-by: Marcel Koch <[email protected]>
Co-authored-by: Yu-Hsiang Tsai <[email protected]>
- remove checks against true solution
@yhmtsai , yes. For CUDA/HIP we only consider dynamic shared memory and only that needs to be passed into the kernel. I dont think it is necessary to check for the static shared memory with CUDA/HIP |
This PR adds the CUDA/HIP/DPCPP device kernels for the batch CG solver.
A lot of similarities between existing bicgstab kernels and this one, which will be unified at a later stage.