-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intel MKL gives nan on cholmod_dl_demo for nd6k.mtx as the input. #554
Comments
The CUDA version is 12.0 and openmp is enabled. |
This can happen if the BLAS integer size is not detected properly. What is the SUITESPARSE_BLAS_INT, and does it match the MKL library you are using? |
On my machine, the SUITESPARSE_BLAS_INT is int32_t in cholmod_dl_demo on windows. |
I use cholmod_l_print_common to print information about my system
|
I think the blas integer type is right.
|
Ok. But mkl has both 32 and 64 bit versions. If you have 4 bytes but mkl has 8, then that would break. Open BLAS has 32 bits only |
Got it. Just saw that now. That all looks ok. I haven't tried the 2024 mkl BLAS yet on Linux. I will give it a try |
Oh, I see. One should always set up the environment variable with
Everything works well now. On windows, the internal environment of Visual Studio is some what conflict with MKL. When setup the environment using the cmd, the program can work well. However in Visual Studio, It reports an error. I will try to work out a way to setup the enviroment on Visual Studio IDE. |
For windows Visual Studio Users:
With the same cmd, compile your code
With the same cmd, open the project with
It seems in this way, the MSVC inherit the enviroment variables setup by Intel oneApi. |
Oops, it seems that Windows use GPU is the only one cannot pass the cholmod_dl_demo.exe. Environment
BLAS
|
I just opened #559 that adds a runner for that configuration to the CI. All ctests have passed on prior tests on my fork on GitHub. |
Thanks for the development of the library, the configuration of the project becomes super easy (with vcpkg). I have rebuilt the newest dev branch (0b41a3e) with Intel oneApi MKL 2024.0 + CUDA 12.0, and specify use GPU manually
The cholmod_sl_demo get the correct result
Howerver, the cholmod_dl_demo get the wrong result
I can obverse that the GPU is used with the Windows task manager. Besides, the cpu version of the both demo get the currect result. For example , the cholmod_dl_demo gives
|
The GPU is only used for the double case, so cholmod_sl_demo won't trigger it. If you uncomment this line: SuiteSparse/CHOLMOD/Include/cholmod_internal.h Lines 47 to 49 in 0b41a3e
and recompile, it will print the # of calls in the GPU. That will verify that the GPU is being used. If you enable the GPU but use another BLAS, does the error go away? I don't do a lot of testing on Windows (just the CI in github), so it's possible something is not working there. This cholmod_dl_demo works fine on my Linux desktop with its GPUs. But I'm using CUDA 11.7 not 12, and an older MKL (2022 if I recall). |
Thank you for your kind reply. I'll try to use another BLAS/Lapack on windows for testing. I'm not pretty sure that cholmod_sl_demo don't use GPU at all. Although the printed result gives no GPU calls. cm->useGPU=1
Howerver, with cm->useGPU = 0, the demo runs much much faster then cm->useGPU = 1. And in this case, the windows task manager shows no GPU ulility at all. cm->useGPU=0
I'm quite confused that useGPU=1 cause the demo runs much slower in the analyze step, how can it be ? |
I've searched desperately for the blas and lapack that can be used on Windows, but I failed to find one.Instead, I try to use the project suitesparse-metis-for-windows. First, I commented hunter download package for MKL in the top level CMakeLists.txt line 275 to ensure the library uses the system-wise MKL, i.e, MKL that is provided with inter OneApi 2024.0.
Then I built the project(suitesparse-metis-for-windows) with MKL and CUDA.
Finally I use the aforementioned compiled library to run cholmod_dl_demo(I changed very little code for API compatibility issue), with useGPU=1, the demos runs well.
I really wonder if the implementation is different in version 5.4.0 from that in version 7.4.0. |
I manully changed the cm->useGPU=1 in the cholmod_dl_demo. All tests passed on my Windows with CUDA 12.0 + MKL 2024.0. Howerver, when I want to compare the performance of the CPU/GPU version and use the recommended dataset with nd6k.mtx, the error happened. |
Specify cm->useGPU=1 in cholmod_dl_demo and recompile the code. After that run the target RUN_TESTS
|
The CMake tests for CHOLMOD are very short, and do not exercise the GPU. So the fact that they passed doesn't have any impact on how the GPU is working. |
The analyze phase is independent of the data type (single or double), so it can be used by a subsequent single or double factorization, regardless of the type of the input matrix. Only double is done on the GPU, but the analyze phase just looks at cm->useGPU, not the type. I can revise my rules so that the analysis for the GPU is only done if the input matrix has a dtype of CHOLMOD_DOUBLE. The analyze phase has more work to do for the GPU factorization, but I'm not sure why it's that slow. I will double check to see what's happening. |
Regarding the analyze time: I forgot but it's during the symbolic analysis time that I query the GPU to see how much memory it has. That requires a lot of time to "warm up" the GPU. It requires that the GPU be initialized, which takes a lot of time. If you did any subsequent calls to cholmod_analyze, you wouldn't see that run time. For the case when doing symbolic analysis with an single-precision input matrix: I can simply turn off the GPU analysis in that case. The remaining issue is just the nan, right? It's not clear to me which cases give you a nan, and which ones work. Can you clarify? Is it just the case when using MSVC with Intel oneApi MKL 2024.0 + CUDA 12.0, and cm->useGPU=1 (for cholmod_dl_demo)? And the same thing works fine with WSL, which is short for "Windows System for Linux"? |
Yes. I clone the latest dev branch to confirm the result. My computer is Windows 11 with CPU Intel 15-13490F and GPU Geforce RTX 2080Ti. And I installed WSL2 with Ubuntu 22.04 for linux environment simulation. The CUDA library on my system is slightly different from the CUDA library when the issue is opened several days ago. Besides, I uncomment the #define BLAS_TIMER to give more statistics. It seems that only the windows cm->useGPU = 1 gives nan. Windows With cm->useGPU = 1
With cm->useGPU = 0
WSL With cm->useGPU = 1
With cm->useGPU = 0
|
Thanks for the details. For the Windows case, do you use MSVC or MINGW? This will be difficult for me to replicate; I don't use Windows at all, myself, so the support I can provide for Windows is very limited. The CUDA version you have is very new (12.3). Perhaps an older version might be OK. I see you have 12.1 on Linux and 12.3 on Windows. |
OK, I’ll give CUDA 11.7 a try. If you want me to dump the intermediate result. I can send you an e-mail for the result.
If someone would like to reproduce the issue on Windows, I would like to list the following steps to setup the environment for the project. The following software is assumed to be installed previously. The test matrix data is extracted to
The following shell command is assumed to be executed in the same cmd rather than powershell. All the local path of the repository cloned is C:\Users\syby119\github Step 1: install gmp / mpfr with vcpkg
Step 2: Get SuiteSparse
Step 3: Activate oneApi environment
Step 4: Configure and build the project
Step 5: Copy the dll
**Run the demo **
|
I have tried CUDA 11.7, but the result is the same, i.e. cholmod_dl_demo.exe gives nan with GPU factorization. I'm pretty satisfying with the speed of the CPU version, so I'll use the CPU version for now. I hope some experienced Windows users can help debug the problem someday. |
OK. I'll go ahead and release SuiteSparse without fixing this, and perhaps add a note about using the GPU on Windows. |
Alternatively, I could just disable the GPU entirely when working on Windows with MSVC. |
I was struggling with the exact same issue between Intel MKL (2024.0) and CHOLMOD, however I'm using Visual Studio 2019 (16.11.32, Win10) and CMake 3.28 with MSVC directly without relying on vcpkg toolchain. Reading through this i included CUDA (12.1) in the mix and went deeper. I spent few days debugging and I think I might have a working hypotesis, but I need some more benchmarking to be certain for my case and may be @syby119 to test it also on his build environment/toochain. At the moment I'm on the path that the issue is caused by OpenMP and not CUDA/MKL directly. As soon as I disable OpenMP in the language extensions for C/C++ and override the 'NOT MSVC' clause from 68d1071 - the tests pass and the GPU is utilized. I've tried both MSVC and Intel C/C++ compilers and the behavior is the same as described by @syby119. Disabling OMP and with CUDA enabled - the tests pass (GPU calls verified by using #define BLAS_TIMER). What I'm investigating: P.S. What triggered me stumbling upon this thread initially - It all started from a strange issue I had when trying to link everything statically (including the BLAS) - needless to say mixing Intel's and Microsoft's OpenMP implementations is a recepie for disaster. As a workaround I excluded the MS's implementation of OpenMP (/NODEFAULTLIB:vcomp.lib) and this forced the application to link against symbols from libiomp5md.lib and load libiomp5md.dll only - factorizations run OK. Otherwise I got both vcomp140.dll and libiomp5md.dll loaded in the debugger and factorizations failed. The other workaround was to disable OpenMP in the compiler entirely and use TBB/OMP from the MKL oneAPI only for the internal BLAS parallelization, de facto leaving the CHOLMOD code serial from the MSVC side. |
Thanks for the update. It's been a difficult problem to tackle since I don't have access to a Windows machine with CUDA. I'll take a look at this, perhaps for a SuiteSparse 7.4.1 release (I'm about to release a stable SuiteSparse 7.4.0). I'll reopen this issue. |
I can confirm that declaring local function variables as private in t_cholmod_gpu.c, produces the correct behavior using MSVC toolchain. I used these override directives for each of the functions during testing (I hope the short names are indicative of the respective template function names, where they've been applied):
// tl;dr Regarding the observation of @syby119 about the slow Analyze phase, I've added some modifications to the BLAS_TIMER sections and timed the allocations during the initialization (which actually happens in the supersymbolic stage in matrix structure analyzis). Here are my results on [email protected]/32GB & RTX 3070Ti/8GB:
I can not be 100% certain if there is some other performance hit on my machine for using the GPU, but the CPU performs much better for nd6k, using pwtk produces even less gain, as the matrix contains smaller nodes and doesn't utilize the GPU at-all. And I don't have sufficient resources to run audikw_1... // tl;dr: P.S. As I'm the opposite case of @DrTimothyAldenDavis - I can not currently test this on a Linux environment (I have a VM CentOS environment, but it doesn't support GPU passthrough and CUDA). I'll see if I have the time to build a physical Linux environment and do more comparisons on the same machine and see if OS/MSVC have something to do with CUDA performance. |
Thanks for the update! I see where the problem is. Yes, those variables should be private. The openmp loops need to be rewritten so that they declare their variables internal to the loops themselves. Those loops were written a while back and they use an older style. It's more clear, I think, to rewrite them so that the private clause isn't needed at all. For example, rather than declaring iidx at the top of the code, it's more clear to do something like
instead of
I'll revise the code and re-enable the GPU and MSVC and post an update for you to try. Thanks again! |
I will also revise the demos so they factor the matrix twice, to get a different timing result. |
Thanks for your feedback. I think I've fixed this issue in my latest commit to dev2: f7a8349 I will post an update in SuiteSparse 7.4.0.beta9 shortly, or you can try the dev2 branch. |
Great, I have tried the cholmod_dl_demo on dev2 and it works!
Don't forget to allow CUDA support with MSVC in all SuiteSparsePolicy.cmake
instead of
|
I found that nsight system is really a great profiling tool for the program. From the timeline I can see that the CUDA Api call cudaMallocHost takes the most of the time. It seems that the demo need to allocated pinned memory for CPU to GPU data transfer once at the beginning of all calculation. Here're the percentage of the CUDA API call,the kernel execution time is not fully included and can be implied by cuda synchronize function calls.
|
I would be appericated if cholmod can support single precision calculation on GPU, since Nvidia Geforce series have very little FP64 core compared to FP32 core. Besides, the single precision data storage consumes less memory bandwidth and improves GPU caching performance in L1/L2 and shared memory. |
Great! For the single precision support on the GPU: yes, I agree that would be useful. I can't add it now though; it will have to wait for a future enhancement. (That would be best posted as a separate issue, since I'll close this soon, when 7.4.0 looks stable for this issue). The cudaMallocHost is a known thing. It has to do a large malloc to pin memory on the host, and that is costly. It only needs to be done once for the entire application. It's done on the first call to cholmod_l_analyze when using the GPU. |
Fixed in SuiteSparse 7.4.0 |
Using Intel MKL as the backend to provide blas and lapack, the cholmod_dl_demo gives warning when factorize the matrix.
And the result is wrong.
This phenomenon happens on both windows / WSL with Ubuntu 22.04, both with cpu/gpu.
Howerver, when I tried to use openblas and lapack as the backend, the result is correct, both with cpu/gpu.
I think maybe this is an external bug and we should not use MKL as an backend for now.
The text was updated successfully, but these errors were encountered: