Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN Residuals with CUDA GMRES+ParILUT #1486

Open
iontcheva opened this issue Dec 5, 2023 · 11 comments
Open

NaN Residuals with CUDA GMRES+ParILUT #1486

iontcheva opened this issue Dec 5, 2023 · 11 comments
Labels
is:bug Something looks wrong. is:idea Just a thought - if it's good, it could evolve into a proposal.

Comments

@iontcheva
Copy link

Hello,

I have been testing ParILUT with GMRES with linear systems extracted from my application.

I am seeing NaN residuals with the CUDA exec on H100. The versions of the exact same code but with OMP and Reference exec converge in 14 GMRES iterations without issues.

The matrix of the linear system is in the 500K range and due to the limitations for the size of files that can be uploaded I have split it into 7 parts.
cis.mtx.gz.part-aa
cis.mtx.gz.part-ab
cis.mtx.gz.part-ac
cis.mtx.gz.part-ad
cis.mtx.gz.part-ae
cis.mtx.gz.part-af
cis.mtx.gz.part-ag

To merge these files into the actual matrix you can use :
cat cis.mtx.gz.part* > cis.mtx.gz

The rhs file is relatively small:
cis_rhs.mtx.gz

I will upload the files in a few submissions that follow.

See the snapshots showing the NaN with the CUDA exec and the runs with omp and reference exec.
CUDA_exec_1
CUDA_exec_2
OMP_exec
Reference_exec

@iontcheva
Copy link
Author

Part 1 of the matrix
cis.mtx.gz.part-aa.gz

@iontcheva
Copy link
Author

I have been debugging the issue and tracked it down to :
ginkgo/common/cuda_hip/factorization/par_ilut_spgeam_kernels.hpp.inc

line 131

lu_cur_val is NaN or INF

I am using a slightly modified version of ilu-preconditioned-solver-example.cpp
for my testing.

These are the parameters that I am using:
auto par_ilu_fact =
gko::factorization::ParIlut<ValueType, IndexType>::build()
.with_iterations(10u)
.with_fill_in_limit(2.0)
.on(exec);

const RealValueType reduction_factor{1e-7};
auto ilu_gmres_factory =
  gmres::build()
        .with_criteria(gko::stop::Iteration::build().with_max_iters(100u),
                       gko::stop::ResidualNorm<ValueType>::build()
                           .with_reduction_factor(reduction_factor))
        .with_generated_preconditioner(ilu_preconditioner)
        .on(exec);

Please let me know whether you can reproduce the issue.

@MarcelKoch
Copy link
Member

Hi @iontcheva, the ParILUT can be quite unstable, especially across different executors, since it only computes an approximation of the ILU factorization. How close the approximation is usually depends on the .with_iterations parameter. So, as a first step, maybe try to increase that parameter.

@upsj
Copy link
Member

upsj commented Dec 7, 2023

I think what's likely happening here is that while we guard against NaNs/Infs in our asynchronous sweep, they may still come up in the other operations, e.g. from an overflowing value in SpGEMM (which is where the lu_val comes from). Without looking at the specific problem, I'm not sure we can do much about this, the preconditioner may just not work on certain problems.

@iontcheva
Copy link
Author

Hi MarcelKoch, upsj,

I did try many options for .with_iterations - including the value 20 which is quite high but the issue is not resolved.

I think there is a bug in the implementation of the CUDA version.

The OMP and Reference versions work perfectly fine as I have shown above so I think the ParILUT algorithm is not the problem.

Regarding the comment from upsj - I have a check if (!is_finite(lu_val) || is_nan(lu_val)) in tri_spgeam_init
after line 118 in ginkgo/common/cuda_hip/factorization/par_ilut_spgeam_kernels.hpp.inc
auto lu_val =
checked_load(lu_vals, lu_begin + lane, lu_end, zero());

and it does not seem to get triggered which I think should mean that the values computed in the CuSparse SpGEMM should be fine.

What gets triggered is a similar check on the value lu_cur_val after line 131.

Did you manage to assemble the matrix that I sent and try the specific example? I do not think without looking at the specific example one can say much anyway.

If you can reproduce the issue though on your side and resolve it I think it would make Ginkgo CUDA much more useful for many applications.

Now one cannot solve anything harder (and all the cases in real applications are of that kind) with a simple ILU(0) type of preconditioner like ParILU one needs more advanced preconditioners like ParILUT.

I am not sure whether this is relevant but just an observation :

After you did the fix with the atomic load_relaxed and store_relaxed in the sweep a few weeks ago I was able to get come of my smaller examples work with Ginkgo CUDA GMRES+ParILUT which was not the case before - I was getting NaNs on all of my examples.

@uboats
Copy link

uboats commented Dec 14, 2023

If using ginkgo static lib, then it crashed at
auto par_ilu_fact = gko::factorization::ParIlut
here -> auto par_ilu = gko::share(par_ilu_fact->generate(A))

terminate called after throwing an instance of 'gko::CusparseError'
what(): /tools/ginkgo/ginkgo-git/cuda/base/cusparse_bindings.hpp:524: spgemm_work_estimation: Unknown error
Abort (core dumped)

@MarcelKoch
Copy link
Member

@uboats Are you also running this on a H100? I think our cuSPARSE exceptions might be a bit out-dated. In any case, I think this error might be due to insufficient GPU memory for the factorization. I had the same issue on my old personal GPU.

@uboats
Copy link

uboats commented Dec 14, 2023

@MarcelKoch yes, H100 (80GB)
for a smaller one (357k dim mat), error is Unrecoverable CUDA error on device 0 in deallocate:69: cudaErrorIllegalAddress: an illegal memory access was encountered

@uboats
Copy link

uboats commented Dec 14, 2023

I will try alg 3 for spgemm and see

@uboats
Copy link

uboats commented Dec 14, 2023

Tried alg2 and it works. so it's gpu memory issue that alg1 needs too much mem
Not sure whether it's cuda spgemm that cannot return correct error code or ginkgo not correctly translate it?
the error is unknown

@uboats
Copy link

uboats commented Dec 14, 2023

for parilut, can we have one more param to choose the alg?

@MarcelKoch MarcelKoch added is:bug Something looks wrong. is:idea Just a thought - if it's good, it could evolve into a proposal. labels Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is:bug Something looks wrong. is:idea Just a thought - if it's good, it could evolve into a proposal.
Projects
None yet
Development

No branches or pull requests

4 participants