-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kokkos kernels: broken unit test w/ cuda 12.4 on h100 gpus with UVM enabled #2316
Comments
Thanks for reporting this @vasylivy we will have a look! |
I tested on Blake with cuda/12.0+gcc/11.3.0 on H100 (cuda/12.4 is available there but the driver only supports up to cuda/12.2) The test passes in both cases where no TPLs are enabled and when CUSPARSE is enabled Here are some reference notes on attempts to reproduce (no TPLs enabled in post below) ssh blake
salloc -N 1 -p H100
module load cmake gcc/11.3.0 cuda/12.0.0
# kokkos configuration
cmake -DCMAKE_CXX_COMPILER=$KOKKOS_PATH/bin/nvcc_wrapper -DCMAKE_INSTALL_PREFIX=$KOKKOS_INSTALL -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_H100=ON -DKokkos_ENABLE_TESTS=OFF -DKokkos_ENABLE_EXAMPLES=OFF -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_CXX_EXTENSIONS=OFF -DBUILD_SHARED_LIBS=OFF -DKokkos_ENABLE_DEPRECATION_WARNINGS=OFF -DKokkos_ENABLE_DEPRECATED_CODE_4=OFF -DKokkos_ENABLE_DEPRECATION_WARNINGS=OFF $KOKKOS_PATH
# kokkos-kernels configuration
cmake -DCMAKE_CXX_COMPILER=$KOKKOS_PATH/bin/nvcc_wrapper -DKokkos_DIR=$KOKKOS_INSTALL/lib64/cmake/Kokkos -DKokkosKernels_ENABLE_TESTS_AND_PERFSUITE=OFF -DKokkosKernels_ENABLE_TESTS=ON -DKokkosKernels_ENABLE_PERFTESTS=ON -DKokkosKernels_ENABLE_EXAMPLES:BOOL=ON -DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=OFF -DKokkosKernels_ENABLE_TPL_ROCSPARSE=OFF -DKokkosKernels_ENABLE_TPL_ROCBLAS=OFF -DKokkosKernels_ENABLE_TPL_CUSOLVER=OFF -DKokkosKernels_ENABLE_TPL_CUSPARSE=OFF -DKokkosKernels_ENABLE_TPL_CUBLAS=OFF -DBUILD_SHARED_LIBS=OFF -DKokkosKernels_ENABLE_DOCS=OFF $KOKKOSKERNELS_PATH I'm not sure at the moment where to test on H100 with cuda/12.4 , will need to find machine |
The other configuration that was a slight tweak of config 1 in that issue did pass all tests. Machine is down at the moment so can't test things. Is UVM enabled by default with kokkos? Yaro |
@vasylivy ah, I didn't enable UVM in my testing I'll do that now and retest |
Yep, enabling UVM I see the same failure with 12.0 on H100 in the build with TPLs enabled:
|
Reproducer configuration notes for Blake:
|
These graph tests also failed in that build:
|
Looks like the issue exists with other cuda compilers on Hopper as well with UVM enabled cuda/11.8.0+gcc/11.3.0:
More details:
Similar with cuda/12.0, with or without TPLs |
Setting Edit: this refers to a cuda/12.0 build with UVM enabled on Hopper, no tpls |
An added data point, I tested another configuration combo with the cuda/12.0 H100 no-tpl build, with UVM disabled in Kokkos but still enabled in KokkosKernels, so these changes to the Kokkos config
while leaving In this case:
Edit: to clarify, the testing results here are consistent with and without deprecated code (the same in either case of |
Summarizing the multiple comments I added above: Testing on Blake H100 queue (Hopper GPUs) with cuda/12.0 and no tpls enabled This table summarizes the UVM combo triggering test failures:
|
Hi,
I've been testing trilinos and came across a broken kk unit tests on h100s w/ cuda 12.4. I have not tried to reproduce the broken test stand alone but figured I'd report it. See configuration 1 reported here trilinos/Trilinos#13397. Following test fails
Thanks,
Yaro
The text was updated successfully, but these errors were encountered: